pytorch

mirror of https://github.com/pytorch/pytorch.git synced 2025-10-22 06:11:27 +08:00

Author	SHA1	Message	Date
chentianyi16	83ad8e01b1	fix the problem that cpu_fallback for aten::triu_indices on custom device crashed (#121306 ) Fixes #121289 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121306 Approved by: https://github.com/ezyang	2024-03-26 01:29:45 +00:00
Kurt Mohler	5e66bf5f42	Avoid COW materialize in nn.functional forward ops (3) (#122443 ) Affected ops: * repeat * unfold * logsigmoid * pixel_shuffle/unshuffle * remaining norm ops Pull Request resolved: https://github.com/pytorch/pytorch/pull/122443 Approved by: https://github.com/ezyang	2024-03-26 00:56:57 +00:00
Peter Bell	b6982bf2b2	[dynamo] Forward OptimizedModule.__setattr__ to the wrapped module (#122098 ) Fixes #114844 In the linked issue we have ``` compiled_module = torch.compile(module) compiled_module.x = ... compiled_module(...) # Mutates self.x ``` Where since the module mutates `self.x` you would expect `compiled_module.x` to be updated but actually `compiled_module.x = ...` sets an attribute "x" on the `OptimizedModule` object while the forward method of the module mutates `module.x`. This gives the expected behavior by forwarding `compiled_module.__setattr__` down to `module.__setattr__`. There is already a corresponding `__getattr__` so now `compiled_module.x` becomes an alias for `module.x`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122098 Approved by: https://github.com/ezyang, https://github.com/lezcano	2024-03-26 00:52:12 +00:00
Nikita Shulga	eda279c997	[CpuInductor] Implement masked_load for integral types (#122608 ) Use `if constexpr` to separate float vs integral masked load for avx512 Discovered while looking at `test_comprehensive_fft_ihfft2_cpu_int64` on non-AVX512 capable CPUs where (5, 6, 7) shape were big enough to start a vectorized loop Added `test_pad_cast` regression test Fixes https://github.com/pytorch/pytorch/issues/122606 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122608 Approved by: https://github.com/jansel ghstack dependencies: #122607	2024-03-25 22:44:54 +00:00
Mu-Chu Lee	57a3d00b06	[AOTInductor] Add tensor_constantX to pass constant buffer update's check (#122562 ) Summary: During tracing, some constants (tensor_constant{idx}) are being generated internally. Those constants are neither parameters or buffers, and users have zero control on them. To accomodate this, we should allow users not passing in those constants generated internally but still be able the constants in the model. Test Plan: Included in commit. ``` build/bin/test_aot_inductor ``` Differential Revision: D55286634 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122562 Approved by: https://github.com/chenyang78, https://github.com/khabinov	2024-03-25 22:05:20 +00:00
eellison	ebde6c72cb	Precompile triton templates (#121998 ) Before this PR we were not precompiling triton templates in parallel. Compilation would occur during benchmarking. Triton benchmarking templates were emitted as : ``` @triton.jit def triton_mm(arg_A, arg_B, out_ptr0): ``` In order to precompile we need to give the full kernel specification, as we do when we emit the template in the final output code generation. ``` @triton_heuristics.template( num_stages=3, num_warps=8, triton_meta={'signature': {0: 'fp32', 1: 'fp32', 2: '*fp32'}, 'device': 0, 'device_type': 'cuda', 'constants': {}, 'configs': [AttrsDescriptor(divisible_by_16=(0, 1, 2), equal_to_1=(), ids_of_folded_args=(), divisible_by_8=())]}, inductor_meta={'kernel_name': 'Placeholder.DESCRIPTIVE_NAME', 'backend_hash': 'cdeecfeccd31ad7810f96b5752194b1c2406d0a81e39a6ca09c8ee150baae183'}, ) @triton.jit def triton_mm(arg_A, arg_B, out_ptr0): ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/121998 Approved by: https://github.com/jansel	2024-03-25 21:33:36 +00:00
IvanKobzarev	9b095c3fe6	[dynamo] Config to not emit runtime asserts (#122603 ) Repetition on squashed & merged by mistake https://github.com/pytorch/pytorch/pull/122406 Differential Revision: [D55312394](https://our.internmc.facebook.com/intern/diff/D55312394) Pull Request resolved: https://github.com/pytorch/pytorch/pull/122603 Approved by: https://github.com/ezyang	2024-03-25 21:17:44 +00:00
PyTorch UpdateBot	1f67da5105	[executorch hash update] update the pinned executorch hash (#122152 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned executorch hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122152 Approved by: https://github.com/pytorchbot	2024-03-25 20:56:34 +00:00
Peter Y Yeh	46a76cfef5	[ROCm] Fix test_trace_rule_update.py (#121524 ) -Add missing torch API to trace rules and ignore API with manual trace rule. The PR fix test/dynamo/test_trace_rule_update maybe related to https://github.com/pytorch/pytorch/pull/121142 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121524 Approved by: https://github.com/jansel, https://github.com/pruthvistony	2024-03-25 20:53:24 +00:00
Guilherme Leobas	bc7f3859b3	Update jvp to support symbolic execution. (#120338 ) Previously, all jvp tests under dynamo/test_dynamic_shapes would fail because symbolic execution wasn't supported in some autograd functions. List of changes: - Update`_has_same_storage_numel` to use `sym_nbytes` - Symintify `_efficientzerotensor_meta` - Introduce `empty_generic_symint` with the first argument `size` as symbolic integer - Update gen_variable_type.py script to call the symint version of zeros_fn function (zeros_symint / _efficientzerotensor_symint) - Update `has_same_meta` to call `sym_*` functions Pull Request resolved: https://github.com/pytorch/pytorch/pull/120338 Approved by: https://github.com/soulitzer	2024-03-25 20:50:12 +00:00
hongxyan	1c1268b6e9	seg-fault of "basic_string::_M_construct null not valid" fix for getNcclErrorDetailStr (#121905 ) When working on testing all-reduce with an alternative rccl replacement backend, my test script crashed. After debugging, I found that `ncclGetLastError(NULL)` return null, and then the code uses the return value to do std::string would seg-fault with an exception of `basic_string::_M_construct null not valid`. This pull request is to fix this edge condition so that it will exit the program gracefully with useful information. Test: Before the fix, my test script exits like below: ``` File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 2051, in all_reduce work = group.allreduce([tensor], opts) RuntimeError: basic_string::_M_construct null not valid ``` After this fix, my test script exited with useful message like, ``` [rank0]: File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 2219, in all_reduce [rank0]: work = group.allreduce([tensor], opts) [rank0]: torch.distributed.DistBackendError: NCCL error in: /pytorch/torch/csrc/distributed/c10d/NCCLUtils.hpp:272, internal error - please report this issue to the NCCL developers, NCCL version 0.4.2 [rank0]: ncclInternalError: Internal check failed. [rank0]: Last error: Unknown NCCL Error ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/121905 Approved by: https://github.com/wconstab	2024-03-25 20:49:34 +00:00
Edward Z. Yang	05bbcae5bb	Refactor functorch meta conversion (#122202 ) At a high level, the goal of this refactor was to make it so that `MetaConverter.__call__` has a straightforward code structure in three steps: (1) check if we support doing meta conversion, (2) describe the tensor into MetaTensorDesc, (3) call `meta_tensor` on MetaTensorDesc. However, this is not so easy to do, because there is a big pile of special cases for functional tensor inside `__call__`. The primarily complication is handling the ambient functionalization state: specifically, the functorch dynamic layer stack and the Python functionalization dispatch. The old code demands that meta tensor conversion happen with this state disabled. But I discovered that when I reconstruct functorch tensors it demands that the functorch layers be active; in fact a batch tensor will have a pointer to the internal functorch layer. I had some discussion with Richard Zou about what code structure here makes sense. In particular, one of the goals of the refactor here is that I can inflate MetaTensorDesc from an entirely different process, which may not have all of the functorch layers activated at the time we do reconstruction. So it seems to me that we should make it explicit in MetaTensorDesc that there was some functorch layer active at the time the functorch tensor was serialized, so that we could potentially know we need to reconstruct these layers on the other side. This is NOT implemented yet, but there's some notes about how potentially it could proceed. But the important thing here is we SHOULD disable everything when we run `meta_tensor`, and internally be responsible for restoring the stack. Actually, the necessary infra bits in functorch don't exist to do this, so I added some simple implementations in pyfunctorch.py. The rest is splitting up the manipulations on tensor (we do things like sync the real tensor before describing it; Describer is responsible for this now) and I also tried to simplify the not supported condition, based on my best understanding of what the old thicket of conditions was doing. You may notice that the internal meta_tensor handling of functional tensor is inconsistent with surrounding code: this is because I exactly replicated the old reconstruction behavior; a further refactor would be to rationalize this. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/122202 Approved by: https://github.com/zou3519	2024-03-25 20:47:21 +00:00
Adnan Akhundov	9223b2cb31	Pop codegened parent graph from wrapper in GraphLowering (#122469 ) Summary: Previously, we kept a reference to `V.graph` in the `codegened_graph_stack` of the wrapper. Memory regression analysis of https://github.com/pytorch/pytorch/issues/121887 shows that this has led to a slightly higher memory utilization during lowering of the `llama_v2_7b_16h` model. Here we refactor the code to pop the parent subgraph from the `codegened_graph_stack` when codegen-ing is done. Fixes https://github.com/pytorch/pytorch/issues/121887. Test Plan: CI, also see https://github.com/pytorch/pytorch/issues/121887#issuecomment-2014209104. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122469 Approved by: https://github.com/eellison	2024-03-25 20:27:59 +00:00
PyTorch MergeBot	b2c496ba24	Revert "[TorchGen] Add mutable parameter to valuetype_type function in api/cpp.py (#121415 )" This reverts commit c1fe09dc37358d8121f119d66e9e8c8d57035158. Reverted https://github.com/pytorch/pytorch/pull/121415 on behalf of https://github.com/ezyang due to I think this needs to be reverted to after https://github.com/pytorch/pytorch/pull/120076 revert ([comment](https://github.com/pytorch/pytorch/pull/121415#issuecomment-2018828813))	2024-03-25 20:14:40 +00:00
Catherine Lee	f84e3bf36d	[ez] Fix XLA auto hash updates (#122630 ) The xla pin is located in .github/ci_commit_pins not .ci/docker/ci_commit_pins Pull Request resolved: https://github.com/pytorch/pytorch/pull/122630 Approved by: https://github.com/huydhn	2024-03-25 19:45:56 +00:00
Nikita Shulga	9d1de31634	[BE][CPUInductor] Use C++17 helper templates (#122607 ) Such as `std::is_same_v` ,`std::is_integral_v` and C++14 one `std::enable_if_t` Pull Request resolved: https://github.com/pytorch/pytorch/pull/122607 Approved by: https://github.com/jansel, https://github.com/Skylion007	2024-03-25 19:01:44 +00:00
Ashwin Hari	2d4197c9b7	add case for creating storage on ort (#122446 ) Fixes #122445 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122446 Approved by: https://github.com/mikaylagawarecki	2024-03-25 18:59:20 +00:00
Jason Ansel	2db7d874a9	[inductor] Improve error message for shape errors in slice_scatter (#122543 ) Fixes #122291 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122543 Approved by: https://github.com/shunting314	2024-03-25 18:57:16 +00:00
PyTorch MergeBot	db506762d1	Revert "Change ATEN generator argument type to const std::optional<Generator>& (#120076 )" This reverts commit a52b4e22571507abc35c2d47de138497190d2e0a. Reverted https://github.com/pytorch/pytorch/pull/120076 on behalf of https://github.com/atalman due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/120076#issuecomment-2018680656))	2024-03-25 18:52:05 +00:00
zdevito	c7bf5871ce	CUDAEvent::elapsed_time could accidentally initialize a non-used GPU (#122538 ) This sets the device before call cudaEventElapsedTime to avoid the case where the "cudaGetCurrentDevice" device would be initialized even though neither event is on that device. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122538 Approved by: https://github.com/shuqiangzhang, https://github.com/wconstab	2024-03-25 17:49:50 +00:00
Kurt Mohler	198927170d	Avoid COW materialize in nn.functional forward ops (2) (#121992 ) Affected ops: * dropout * embedding * embedding_bag * mutli_head_attention_forward * grid_sample * ctc_loss * nll_loss * pdist Pull Request resolved: https://github.com/pytorch/pytorch/pull/121992 Approved by: https://github.com/ezyang ghstack dependencies: #122437, #121991	2024-03-25 17:31:19 +00:00
Kurt Mohler	55becf02bc	Avoid COW materialize in nn.functional forward ops (1) (#121991 ) Affected ops: * Remaining norm ops * pad * margin_loss ops * fractional_max_pool * linear * prelu * rrelu * scaled_dot_product_attention * logsigmoid * threshold * binary_cross_entropy * gelu Pull Request resolved: https://github.com/pytorch/pytorch/pull/121991 Approved by: https://github.com/ezyang ghstack dependencies: #122437	2024-03-25 17:31:19 +00:00
Nikita Shulga	4c70ab26ef	[MPS] Enable `index_select` for complex types (#122590 ) Surprisingly, as of MacOS-14.14 MPS `gatherWithUpdatesTensor:indicesTensor:axis:batchDimensions:name:` still does not support complex types, so emulate them by using `at::view_as_real` trick Fixes https://github.com/pytorch/pytorch/issues/122427 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122590 Approved by: https://github.com/Skylion007	2024-03-25 16:57:35 +00:00
yan-yhy	e6a37eeb06	run some cuda testcases on other devices if available. (#122182 ) If users want to run some cuda testcases on other devices throw setting an environment variable for testing the performance on custom devices, I think it can be used like this pr. Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/122182 Approved by: https://github.com/ezyang	2024-03-25 16:40:03 +00:00
Catherine Lee	70ac13b876	[ez][TD] Hide errors in llm retrieval job (#122615 ) The new ghstack does have a base on main anymore, so finding the base for ghstacked PRs is harder. Something similar to https://github.com/pytorch/pytorch/pull/122214 might be needed, but then I'm worried about tokens Either way, this is a quick workaround to hide these errors for ghstack users Pull Request resolved: https://github.com/pytorch/pytorch/pull/122615 Approved by: https://github.com/huydhn	2024-03-25 16:35:00 +00:00
Edward Z. Yang	47a9725de9	Implement prefer_deferred_runtime_asserts_over_guards (#122090 ) Fixes https://github.com/pytorch/pytorch/issues/121749 As promised, it is pretty easy. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/122090 Approved by: https://github.com/lezcano	2024-03-25 16:31:16 +00:00
Zola	e49a38973f	Update DimOrDims typing in torch.sparse (#122471 ) I noticed the typing of the `torch.sparse.sum`'s `dim` parameter wasn't allowing an int tuple as input and tracked the issue to this type. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122471 Approved by: https://github.com/soulitzer	2024-03-25 16:25:56 +00:00
Jason Ansel	06f22537ca	[dynamo] Suppress warning about torch.autograd.Function() (#122566 ) PR #120577 got reverted due to issues in fbcode. This hides warning that PR was trying to fix until we can debug the fbcode issue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122566 Approved by: https://github.com/yanboliang	2024-03-25 16:18:43 +00:00
Zhengxu Chen	0465a90b00	[export][reland] Fix unflattened submodule ordering. (#122341 ) (#122507 ) Summary: Make sure the order of submodules is the same as the original eager module. bypass-github-export-checks Test Plan: buck test mode/opt caffe2/test:test_export -- -r test_unflatten_submodule_ordering Differential Revision: D55251277 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122507 Approved by: https://github.com/tugsbayasgalan	2024-03-25 15:22:01 +00:00
Lucas Pasqualin	11dfa72153	[BE] Remove unnecessary state dict update. (#122528 ) From what I can see, following is a redundant/unnecessary setting of dict element. Differential Revision: [D55191396](https://our.internmc.facebook.com/intern/diff/D55191396/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/122528 Approved by: https://github.com/Skylion007	2024-03-25 15:21:44 +00:00
sanchitintel	5152945441	GPT2 SDPA inference pattern-matching for Inductor-CPU (#121866 ) ### Summary With this PR, SDPA pattern of GPT2 is being mapped to `torch.nn.functional.scaled_dot_product_attention`. While GPT2 supports both a causal mask & an attention mask, this PR considers the case of attention mask being absent. TorchBench inference workload for GPT2 also doesn't use an attention-mask. This pattern's replacement is being disabled for CUDA because [CUDA AOT Inductor](https://github.com/pytorch/pytorch/actions/runs/8319111885/job/22762567770) CI job's `GPT2ForSequenceClassification` accuracy test failed, although all other trunk CUDA Inductor CI checks had passed. Created #122429 to track that particular issue. ### CPU performance data with TorchBench \|MODEL \|BATCH SIZE \| DTYPE \| BEFORE: Speedup over eager-mode with the default Inductor implementation \| AFTER: Speedup over eager mode with SDPA op mapped\| Perf boost = (AFTER - BEFORE)/BEFORE * 100\| \|--------------------------\|-------------\|---------\|-----------------------------\|--------------------------\|------------\| \|hf_GPT2\| 1 \| FP32 \| 1.522x \| 1.791x\| 17.67%\| \|hf_GPT2\| 1 \| BF16 (AMP) \| 1.795x \| 2.387x\| 32.98%\| \|hf_GPT2\| 2 \| FP32 \| 1.313x \|1.629x \| 19.3%\| \|hf_GPT2\|2\| BF16 (AMP) \| 1.556x \| 1.924x \| 23.65%\| \|hf_GPT2_large\| 1 \| FP32 \| 1.380x \|1.585x \| 12.93%\| \|hf_GPT2_large\| 1 \| BF16 (AMP) \| 1.208x \| 1.567x \| 22.91%\| \|hf_GPT2_large\| 2 \| FP32 \| 1.188x \| 1.490x \| 25.42%\| \|hf_GPT2_large\|2\| BF16 (AMP) \| 0.991x \| 1.575x \| 58.93%\| Machine - Intel(R) Xeon(R) Platinum 8468H (Xeon 4th gen Sapphire Rapids) 48 physical cores were used. Intel OpenMP & libtcmalloc were preloaded. Example command - ``` OMP_NUM_THREADS=48 MKL_NUM_THREADS=48 numactl --membind=0 --cpunodebind=0 -C 0-47 python benchmarks/dynamo/torchbench.py --performance --inference --inductor --float32 -dcpu --only hf_GPT2_large --freezing --batch-size 1 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/121866 Approved by: https://github.com/Valentine233, https://github.com/jgong5, https://github.com/desertfire	2024-03-25 15:04:03 +00:00
PyTorch MergeBot	4dc09d6aa4	Revert "Graph-Safe RNG State Exchange for Tensor Parallelism (#114068 )" This reverts commit e9dcda5cba92884be6432cf65a777b8ed708e3d6. Reverted https://github.com/pytorch/pytorch/pull/114068 on behalf of https://github.com/ezyang due to memory leak in another ci ([comment](https://github.com/pytorch/pytorch/pull/114068#issuecomment-2018044527))	2024-03-25 13:49:04 +00:00
cyy	b9d6f8cc18	Fix clang-tidy warnings in aten/src/ATen/core/.cpp (#122572 ) This PR fixes clang-tidy warnings in aten/src/ATen/core/.cpp. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122572 Approved by: https://github.com/ezyang	2024-03-25 13:46:24 +00:00
Edward Z. Yang	1e404c9b12	Remove redundant query to tensor_to_context (#122278 ) from_real_tensor will query it again, so this query is strictly dominated. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/122278 Approved by: https://github.com/eellison ghstack dependencies: #122044, #122270, #122271	2024-03-25 13:16:21 +00:00
Edward Z. Yang	49b81af45f	Delete dead memoized_only kwarg in FakeTensor (#122271 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/122271 Approved by: https://github.com/eellison ghstack dependencies: #122044, #122270	2024-03-25 13:16:21 +00:00
Edward Z. Yang	f32ce4e28e	Delete FakeTensorConverter.__call__ in favor of from_real_tensor (#122270 ) It's annoying grepping for `__call__` call-sites so they're now all explicit now. I'd do this to MetaConverter too but that one is way more public, a lot more sites. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/122270 Approved by: https://github.com/eellison ghstack dependencies: #122044	2024-03-25 13:16:13 +00:00
Jason Ansel	069270db60	[dynamo] Fix list comparison ops (#122559 ) Fixes #122376 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122559 Approved by: https://github.com/Skylion007	2024-03-25 07:03:23 +00:00
Edward Z. Yang	5891c5b3a6	Factor meta conversion through serializable MetaTensorDesc (#122044 ) Fixes https://github.com/pytorch/pytorch/issues/121085 This PR pretty involved so pay attention to this description. At a high level, the refactor is intended to be mechanical: anywhere in MetaConverter where previously we took a Tensor as argument, we now take a MetaTensorDesc, which contains all of the information that we would have queried off of the Tensor, but placed into a separate data structure which we can serialize or use to recreate a fake tensor in a separate fake tensor mode in exact fidelity to the original. However, this transformation is not always entirely mechanical. Here is what you need to pay attention to: - The memo table from real Tensor -> meta/fake Tensor is now broken into two memo tables: real Tensor -> stable int id -> meta/fake Tensor. The stable int id is needed so that when we do serialization, we know when tensors/storages alias each other and can ensure we preserve this aliasing upon deserialization. The way I have implemented changes the weak reference behavior. Previously, when either the real Tensor OR the meta/fake Tensor went dead, we would remove the entry from the memo table. Now, this only removes entries from one of the two memo tables. This semantically makes sense, because the user may have held on to the stable int id out of band, and may expect a real Tensor to continue to be numbered consistently / expect to be able to lookup a meta/fake tensor from this id. If this is unacceptable, it may be possible to rejigger the memo tables so that we have real Tensor -> stable int id and real Tensor -> meta/fake Tensor, but TBH I find the new implementation a lot simpler, and arranging the memo tables in this way means that I have to muck around with the real tensor to save to the memo table; in the current implementation, I never pass the Tensor to meta_tensor function AT ALL, which means it is impossible to accidentally depend on it. - When I fill in the fields of MetaTensorDesc in describe_tensor, I need to be careful not to poke fields when they are not valid. Previously, preconditions were implicitly checked via the conditional structure ("is this sparse? is this nested?") that is tested before we start reading attributes. This structure has to be replicated in describe_tensor, and I have almost assuredly gotten it wrong on my first try (I'll be grinding through it on CI; a careful audit will help too, by auditing that I've tested all the same conditionals that the original access was guarded by.) - I originally submitted https://github.com/pytorch/pytorch/pull/121821 for the symbolic shapes change, but it turned out the way I did it there didn't actually work so well for this PR. I ended up just inlining the symbolic shapes allocation logic into MetaConverter (look for calls to maybe_specialize_sym_int_with_hint), maybe there is a better way to structure it, but what I really want is to just read sizes/strides/offset directly off of MetaTensorDesc; I don't want another intermediate data structure. - Some fields aren't serializable. These are documented as "NOT serializable". ctx/type should morally be serializable and I just need to setup a contract with subclasses to let them be serialized. The fake_mode is used solely to test if we are refakefying with a pre-existing ShapeEnv and we want to reuse the SymInt directly--serializing this case is hopeless but I am kind of hoping after this refactor we do not need this at all. view_func is not serializable because it's a bound C implemented method. Joel has promised me that this is not too difficult to actually expose as a true data structure, but this is the edgiest of edge cases and there is no reason to deal with it right now. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/122044 Approved by: https://github.com/eellison	2024-03-25 06:21:17 +00:00
Nikita Shulga	cf06189a2d	[CPPInductor] Fix another out-of-bounds access (#122580 ) Not sure what was the idea behind `{self.tiling_factor}*sizeof(float)/sizeof({DTYPE_TO_CPP[dtype]})` size calculation (perhaps copy-n-paste error during the refactor made by https://github.com/pytorch/pytorch/pull/97626 ) , but `Vectorized::store(ptr, tiling_factor)` needs at least `tiling_factor` elements, not `tiling_factor/2` (which would be the case with the original calculation if data type is 64-bit value such as int64) Discovered while trying to enable arch64 vectorized inductor. Minimal reproducer (reproducible on ARMv8 or any x86_64 machine that does not support AVX512): ```python import torch def do_ds(x, y): return torch.diagonal_scatter(x, y) x=torch.ones(10, 10, dtype=torch.int64) y=torch.tensor([ 1, 2, -8, 8, 5, 5, -7, -8, 7, 0]) dsc = torch.compile(do_ds) assert torch.allclose(torch.diagonal_scatter(x, y), dsc(x, y)) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/122580 Approved by: https://github.com/Skylion007, https://github.com/jansel	2024-03-25 04:49:20 +00:00
Edward Z. Yang	deeeaded1f	Add metas for randint/rand factory functions out overload (#122375 ) Fixes https://github.com/pytorch/pytorch/issues/121897 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/122375 Approved by: https://github.com/lezcano	2024-03-25 04:01:38 +00:00
cyy	a01d35c7f6	[TorchGen] Remove unused variables (#122576 ) This PR removes some unused Python variables from TorchGen scripts. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122576 Approved by: https://github.com/Skylion007	2024-03-25 03:31:41 +00:00
Nikita Shulga	e75ecd5618	[BE][veclib] Use `is_same_v`/`enable_if_t` (#122533 ) `enable_if_t` helper is part of C++14 `is_same_v` helper is part of C++17 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122533 Approved by: https://github.com/Skylion007	2024-03-24 20:57:41 +00:00
Alexander Grund	14e348b7ad	Handle JIT test failure when the GPU is newer than the CUDA compiler or vice versa (#122400 ) The test may fail because it either uses target flags newer than the GPU resulting in failures loading the compiled binary or targetting a GPU for which CUDA has no support yet/anymore Pull Request resolved: https://github.com/pytorch/pytorch/pull/122400 Approved by: https://github.com/ezyang	2024-03-24 13:58:06 +00:00
Yifu Wang	36188360dd	[dynamo] support torch.distributed.{group.WORLD, GroupMember.WORLD, distributed_c10d._get_default_group} (#120560 ) Fixes https://github.com/pytorch/pytorch/issues/120431 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120560 Approved by: https://github.com/wconstab	2024-03-24 11:13:05 +00:00
Jason Ansel	3e4a4bea12	[dynamo] Graph break on SymNode control flow (#122546 ) Fixes #111918 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122546 Approved by: https://github.com/ezyang	2024-03-24 07:22:02 +00:00
Honglin Zhu	adeedc060f	[Inductor] Fix unbacked symbol in stride when using item() (#122298 ) Fixes #122296 Test: python test/inductor/test_torchinductor_dynamic_shapes.py -k test_item_unbacked_stride_nobreak_cuda Co-authored-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/122298 Approved by: https://github.com/ezyang	2024-03-24 06:27:15 +00:00
cyy	c1fe09dc37	[TorchGen] Add mutable parameter to valuetype_type function in api/cpp.py (#121415 ) This PR is a follow-up of #120076, it moves std::optional<Generator> detection logic into ```valuetype_type``` of api/cpp.py by adding the mutable parameter, which facilitates future value type changes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121415 Approved by: https://github.com/ezyang	2024-03-24 06:11:08 +00:00
Kurt Mohler	ca9606f809	Update COW OpInfo test to include kwargs and expected materialization (#122437 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/122437 Approved by: https://github.com/ezyang	2024-03-24 06:07:30 +00:00
Alexander Grund	9d4218c23e	Handle JIT test failure when the GPU is newer than the CUDA compiler (#122402 ) The test uses the CUDA compute capabilities of the current device to compile an extension. If nvcc is older than the device, it will fail with a message like "Unsupported gpu architecture 'compute_80'" resulting in a `RuntimeError: Error building extension 'cudaext_archflags'` ultimately failing the test. This checks for this case and allows execution to continue Fixes https://github.com/pytorch/pytorch/issues/51950 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122402 Approved by: https://github.com/ezyang	2024-03-24 05:36:24 +00:00
cyy	808a035658	[Dynamo][4/N] Enable clang-tidy coverage on torch/csrc/dynamo/* (#122534 ) This PR enables clang-tidy coverage on torch/csrc/dynamo/* and also contains other small improvements. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122534 Approved by: https://github.com/Skylion007	2024-03-24 05:26:32 +00:00
PyTorch UpdateBot	f0d461beac	[vision hash update] update the pinned vision hash (#122536 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned vision hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122536 Approved by: https://github.com/pytorchbot	2024-03-24 03:42:21 +00:00
Jason Ansel	5f7e71c411	[dynamo] Add HASATTR guard for UserDefinedObject attrs (#122555 ) Fixes #111522 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122555 Approved by: https://github.com/Skylion007	2024-03-24 03:41:58 +00:00
Jason Ansel	07d037674f	[inductor] Fix issue with randint + symbolic shapes (#122428 ) Fixes #122405 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122428 Approved by: https://github.com/ezyang	2024-03-24 03:41:13 +00:00
Edward Z. Yang	476585b190	Preserve unbacked SymInt on SymNode (#120816 ) Previously, when we applied a replacement, a SymInt that was previously an unbacked SymInt would then transmute into whatever we replaced it into (e.g., a constant). This has a major downside: we often look at SymInts associated with FX nodes (e.g., the meta of x.item() return) to find out where the unbacked SymInt was allocated. If we replace it, we no longer can find out where, e.g., u1 was allocated! But we need to know this so we can generate deferred runtime asserts like u1 == s0. To solve this problem, I have a special mode for replace, resolve_unbacked=False, which lets you disable substitutions on unbacked SymInts. When reporting node.expr, we preferentially avoid applying unbacked SymInt substitutions. To understand if we might accidentally reapply the substitution later, before we have reached the deferred runtime assert, we must study the calls to simplify() in ShapeEnv. My audit turns up these sites: * `produce_guards`: this is fine, deferred runtime asserts never show up here, we must NOT have unbacked SymInts show up here. Similarly `get_nontrivial_guards`. * `_maybe_evaluate_static`: this is fine, we are using this to determine if it is necessary to produce a guard/runtime assert. We don't want to reissue a runtime assert if we've already asserted on it, and replacements can help us understand if this has occurred. * `_simplify_floor_div`: this is a legitimate bug, it needs to be `resolve_unbacked=False` * `_refine_ranges`: this is fine, a refined range doesn't affect what runtime asserts we issue * `_update_divisible`: this updates the `self.divisible` set, which specifies when we can simplify away divisibility constraints. Since this affects replacements only, it won't cause us to oversimplify a user provided expression. There are some situations where we DO want to always apply the substitution, specifically when we have the duplicate symbol problem (we retrace an item call and get u0 and u1 which refer to the same thing.) I don't want two symbols in this case, so a special `rename_unbacked_to` is provided which sets up the unconditional renaming. Along the way, I make a refinement to `_update_var_to_range`: if you update a var range for a size-like unbacked SymInt, you are now no longer allowed to set its lower bound below 2. This is because if you could, then our size oblivious tests for it would be inconsistent. Actually, I think there is still some inconsistency, because if you assert `u0 == 0` we will still end up with this in deferred runtime asserts, and we will then use this to simplify these statements to be True everywhere else. Maybe we should forbid this kind of refinement; not done in this PR. Fixes https://github.com/pytorch/pytorch/issues/119689 Fixes https://github.com/pytorch/pytorch/issues/118385 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/120816 Approved by: https://github.com/lezcano	2024-03-24 02:56:16 +00:00
cyy	a52b4e2257	Change ATEN generator argument type to const std::optional<Generator>& (#120076 ) This PR proposes to use std::optional<Generator>& for underlying functions to avoid unnecessary copy and move operations. The torchgen code was changed to generate the new type. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120076 Approved by: https://github.com/malfet	2024-03-24 02:12:08 +00:00
Edward Z. Yang	788638fcdc	Suggest TORCHDYNAMO_EXTENDED_DEBUG_ envvars when appropriate (#122473 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/122473 Approved by: https://github.com/lezcano	2024-03-24 01:02:20 +00:00
vfdev-5	cdc7f0fd3b	Fixed failing pyhpc_equation_of_state due to cpp nodes fusion with compatible ranges (#122420 ) Fixes #122283 Description: PR https://github.com/pytorch/pytorch/pull/120077 introduced cpp nodes fusion with compatible ranges with an assumption that all scheduler nodes inside the fused nodes are the same, however, it appeared that snodes can have different indexing expressions. This PR fixes the incorrect assumption. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122420 Approved by: https://github.com/lezcano	2024-03-24 00:40:31 +00:00
Nikita Shulga	4758837930	[BE] Do not use `importlib.load_module` (#122542 ) To get rid of the annoying ``` <frozen importlib._bootstrap>:283: DeprecationWarning: the load_module() method is deprecated and slated for removal in Python 3.12; use exec_module() instead ``` using recipe from https://docs.python.org/3/library/importlib.html#importing-a-source-file-directly Pull Request resolved: https://github.com/pytorch/pytorch/pull/122542 Approved by: https://github.com/jansel, https://github.com/desertfire	2024-03-23 17:22:26 +00:00
Nikita Shulga	bf40e3f880	[EZ][BE] Add missing `acosh` op to vec256_float_neon.h (#122513 ) As base class has it `ed15370aab/aten/src/ATen/cpu/vec/vec_base.h (L367-L369)` Discovered while attempting to enabling Inductor vectorization on ARM platform Pull Request resolved: https://github.com/pytorch/pytorch/pull/122513 Approved by: https://github.com/Skylion007	2024-03-23 14:18:02 +00:00
Pearu Peterson	a39e638707	Update bsr_dense_addmm kernel parameters for sizes 3 x 2 ^ N (#122506 ) As in the title. The speed-ups for a particular set of input sizes range from about 7 to 85 % depending on the used BSR tensor block sizes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122506 Approved by: https://github.com/cpuhrsch	2024-03-23 11:54:33 +00:00
Alexander Grund	8a209344c9	Fix access to unitialized memory in VSX vector functions for quantized values (#122399 ) Similar to https://github.com/pytorch/pytorch/pull/89833 those function may access uninitialized memory leading to undefined behavior/results. Initialize with zeros as done before. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122399 Approved by: https://github.com/ezyang	2024-03-23 06:11:30 +00:00
Guang Yang	c677221798	remove torchao dependency (#122524 ) Test Plan: CI ``` buck2 run mode/dev-nosan mode/inplace executorch/examples/models/llama2:export_llama -- -c ~/llama/ultra_new_checkpoint.pt -p ~/llama/params.json -kv -E 8,8 -d fp32 --pt2e_quantize "xnnpack_dynamic" -2 ``` ``` buck run //executorch/backends/xnnpack/test:test_xnnpack_ops -- executorch.backends.xnnpack.test.ops.linear.TestLinear.test_qd8_fp32_per_token_weight_per_channel_group_int4 ``` Differential Revision: D55263008 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122524 Approved by: https://github.com/jerryzh168	2024-03-23 03:18:43 +00:00
Nikita Shulga	19d27a13ea	[CPUInductor] Fix out-of-bounds read/write in cvt_int64_to_[fp32\|int32] (#122511 ) Discovered while debugging regressions in enabling vectorization on ARM platform Without this change `test_div2_cpu` will fail with invalid values on non-x86 CPU Pull Request resolved: https://github.com/pytorch/pytorch/pull/122511 Approved by: https://github.com/peterbell10, https://github.com/jansel	2024-03-23 01:45:07 +00:00
chilli	4d8a3f8bb3	changed aliasing checks to properly recurse for computing last usage (#122444 ) Fixes https://github.com/pytorch/pytorch/issues/122457 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122444 Approved by: https://github.com/yifuwang, https://github.com/jansel ghstack dependencies: #121624, #122474	2024-03-23 01:43:21 +00:00
Matthew Haddock	50036ec781	[Inductor] Add a test for creating a cpu inductor-> triton backend (#122396 ) Summary: Currently there is a test for adding a backend in test/inductor/test_extension_backend.py for a cpp backend with a new device. However there is no such test for the Triton backend; it should be possible for a user to create and register your own ExtensionWrapperCodegen and ExtensionSchedulingfor another non-CUDA device and be able to generate Triton code. For simplicity I have chosen to use a CPU device, as I think it's plausible someone might want to create a CPU Triton backend. Unfortunately the generation and running of the code is quite tightly coupled so I've had to use a mocked function to extract the code before running. Suggestions are welcome for better ways to do this. This is a stepping off point for some additional PRs to make the Triton code path less CUDA specific, as currently there would be no way to test this avenue. Test plan: ``` frames [('total', 1), ('ok', 1)] stats [('calls_captured', 3), ('unique_graphs', 1)] inductor [('intermediate_hooks', 1)] aot_autograd [('total', 1), ('ok', 1)] . ---------------------------------------------------------------------- Ran 1 test in 0.394s ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/122396 Approved by: https://github.com/jansel	2024-03-23 01:14:57 +00:00
Xinfeng	41d69ff324	Add a shape inference tool (#120097 ) Summary: Add a shape inference tool that helps to infer each node shape of a given graph module. 1. Given a fx graph, and an example of an input(don't need to be an accurate input that can be forward, but should have valid dims and data structures), `infer shape` creates an input of symbolic shape 2. Shape prop this symbolic input can catch runtime or value exceptions. 3. These errors are constraints for symbol values, and the constraint solver `infer symbolic values` helps us figure out specific values for each symbol. 4. Finally, we run the shape propagation based on input tensor to get tensor shapes for all nodes in the FX traced module. Test Plan: ### 1. Test `infer symbol values` Command: ``` buck2 test mode/opt //caffe2/test:fx_experimental -- test_infer_symbol_values ``` ### 2. Test `infer shape` Command: ``` buck2 test mode/opt //caffe2/test:fx_experimental -- test_infer_symbol_values ``` Inferred shape result like: P897560514 Differential Revision: D53593702 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120097 Approved by: https://github.com/yf225	2024-03-23 00:23:24 +00:00
Alexander Grund	29bca8547b	Fix failing test_cpu_repro without vectorization support (#117262 ) At least the following tests fail when there is no supported vector ISA: test_lowp_fp_neg_abs test_non_contiguous_index_with_constant_stride test_scalar_mul_bfloat16 test_transpose_non_contiguous test_transpose_sum2d_cpu_only test_transpose_sum_outer test_transpose_vertical_sum_cpu_only test_vertical_sum_cpu_only Those tests assert `metrics.generated_cpp_vec_kernel_count` is nonzero which is never the case without a supported vector ISA, e.g. on PPC and maybe on AArch. Skip those tests with a new decorator and use the simpler one where an equivalent is already used Some usages of `metrics.generated_cpp_vec_kernel_count` where guarded by a check instead of skipping the test. I tried to apply that instead of a skip where the test looked similar enough to where that was previously done. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117262 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-03-23 00:03:55 +00:00
angelayi	a84f1d3def	[effects] Fix backwards handling (#122346 ) I didn't previously test the `.backwards()` call, and when testing on #122348 I realized we were missing some token handling in some places. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122346 Approved by: https://github.com/zou3519	2024-03-22 23:31:52 +00:00
Brian Hirsh	e7fa3f7812	AOTDispatch: allow subclasses to correct when we guess metadata of tangents incorrectly (#118670 ) This PR is enough to fix https://github.com/pytorch/pytorch/issues/118600. More description of the problem is in the issue, but the high-level problem is similar to the "tangents might be non-contiguous" problem that we handle today, via forcing all tangents to be contiguous. There, the problem was something like: "We guessed the tangent strides incorrectly, because strides on the runtime tangents were different from strides on the forward outputs, which we used to generate tangents" Here, the problem is similar: "We guessed the tangent tensor subclass's metadata incorrectly, because the runtime tangent was a subclass with different metadata than the forward output subclass". This happened in an internal DTensor issue, where the metadata in question was the `placements` (shard vs. replicate vs. Partial). One option is to solve this problem via backward guards. This is needed to unblock internal though, so I figured handling this similarly to how we handle non-contiguous tangents would be reasonable. I did this by: (1) Assert that the metadata on subclass tangents is the same as what we guessed, and if not raise a loud error (2) In the error message, provide the name of an optional method that the subclass must implement to handle this case: `def __force_same_metadata__(self, metadata_tensor):`: If the forward output had a `Replicate()` placement, but the runtime tangent had a `Shard(1)` placement, this method allows a subclass to take the tangent and "convert" it to one with a `Replicate()` placement. `__force_standard_metadata__(self)`: One issue is that there is another placement called `_Partial`, and its semantics are such that DTensor is unable to convert a DTensor with some placement type into another DTensor with a `_Partial` placement. `__force_standard_metadata__` is now called on all (fake) subclass forward outs at trace-time to generate tangents, and gives subclasses a chance to "fix" any outputs with metadata that they cannot convert to later. Morally, this is similar to the fact that we force a `contiguous()` call on all tangents at trace-time. I'm interested in thoughts/feedback! Two new dunder methods on traceable subclasses is definitely a contentious change. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118670 Approved by: https://github.com/ezyang	2024-03-22 23:16:08 +00:00
Aaron Orenstein	f7b8d8e249	Support for sapling scm (#122072 ) We can use Sapling (hg) with the pytorch repo but there are a couple minor issues to teach our scripting to be happier with having either a git or hg repo. This change fixes some issues in: - setup.py - lintrunner Pull Request resolved: https://github.com/pytorch/pytorch/pull/122072 Approved by: https://github.com/ezyang	2024-03-22 22:59:16 +00:00
cyy	482f6c4693	[Dynamo][3/N] Fix clang-tidy warnings in torch/csrc/dynamo/* (#122392 ) This PR continues to clean clang-tidy warnings in torch/csrc/dynamo/*, following #122362 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122392 Approved by: https://github.com/ezyang	2024-03-22 22:57:41 +00:00
Pian Pawakapan	3f99306452	[export] Remove from_export flag (#122500 ) Summary: The flag from_export was incorrectly included in a previous diff (https://www.internalfb.com/diff/D54314379) - it was intended for helping with ExportedProgram verification, but was no longer needed in the final implementation. Test Plan: Changes no functionality, test/export already covers everything Differential Revision: D55205857 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122500 Approved by: https://github.com/avikchaudhuri, https://github.com/zhxchen17	2024-03-22 22:55:14 +00:00
Catherine Lee	03184a82dd	[TD] TD on ASAN PR jobs (#122332 ) Low impact CPU jobs Pull Request resolved: https://github.com/pytorch/pytorch/pull/122332 Approved by: https://github.com/huydhn	2024-03-22 22:32:51 +00:00
Tugsbayasgalan (Tugsuu) Manlaibaatar	271cc687de	Audit retracibility errors and fix some ez ones (#122461 ) Summary: Title Test Plan: CI Differential Revision: D55227094 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122461 Approved by: https://github.com/zhxchen17	2024-03-22 21:31:51 +00:00
Thiago Crepaldi	29132c2e47	Prevent dup initializers when ONNXProgram.save is called many times (#122435 ) Fixes https://github.com/pytorch/pytorch/issues/122351 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122435 Approved by: https://github.com/titaiwangms ghstack dependencies: #122196, #122230	2024-03-22 21:03:15 +00:00
Guilherme Leobas	4eaa000acc	Teach dynamo about torch.func.jvp (#119926 ) List of changes: - Replace JVP_NESTING by torch._C._functorch.maybe_current_level() - Remove all increment nesting functions from wrap_fx_proxy_cls - fwAD.make_dual receives the dual_level as keyword argument - Add jvp_increment_nesting, set_fwd_grad_enabled and dual_level context managers to dynamo Pull Request resolved: https://github.com/pytorch/pytorch/pull/119926 Approved by: https://github.com/zou3519	2024-03-22 20:25:47 +00:00
PyTorch MergeBot	3795ebe925	Revert "[Inductor] Make codecache CUDA compilation more robust & flexible (#121490 )" This reverts commit 6bbd697306851b785b51b4d0545c1ef9365ddaa6. Reverted https://github.com/pytorch/pytorch/pull/121490 on behalf of https://github.com/huydhn due to Sorry for reverting you change but I think it is failing on ROCm, i.e. `700c92e1b9` ([comment](https://github.com/pytorch/pytorch/pull/121490#issuecomment-2015829464))	2024-03-22 20:11:47 +00:00
PyTorch MergeBot	97d3bf71b9	Revert "[Inductor Cutlass backend] GEMM size threshold for Cutlass backend usage (#121491 )" This reverts commit 700c92e1b9cb6fae2610d08e5a960273c4dd1697. Reverted https://github.com/pytorch/pytorch/pull/121491 on behalf of https://github.com/huydhn due to Sorry for reverting you change but I think it is failing on ROCm, i.e. `700c92e1b9` ([comment](https://github.com/pytorch/pytorch/pull/121490#issuecomment-2015829464))	2024-03-22 20:11:47 +00:00
David Berard	8013c4409f	[inductor] config to control whether we assume inputs are aligned (#122158 ) Motivation: https://github.com/pytorch/pytorch/issues/112771 Summary: Inductor generates triton that assumes that inputs are going to be 16-byte aligned. If the inputs aren't aligned, Inductor clones the inputs. This PR introduces a config option to not do this: when assume_aligned_inputs=False, Inductor will _not_ pass inputs as being divisible_by_16, and Inductor will not make clones. This an can generate code that might be a bit slower, but this tradeoff can be worth it in some scenarios where you might otherwise make a lot of clones. Ideally, we could do this on a per-tensor basis. But this would be a lot of work, and attempts to add guards on storage offsets to do this automatically have run into issues: recompilations and excessive time to generate/check guards. Tests https://github.com/pytorch/pytorch/pull/122159 flips this to False. It didn't run through all errors, but the ones we see are all expected failures: divisible_by_16 changes; triton kernel caching fails if we call the same triton kernel multiple times (this makes sense because the first call will have unaligned inputs, but subsequent calls have aligned inputs); and some xfailed tests start passing. Alternatives/RFC: * Is this the right thing to do with cudagraphs? * Elias and Jason mentioned that we probably still want to make clones if we're dealing with unaligned inputs to matmuls. Is this something we should add in this config option? (In the use case I'm targeting, it seems like we don't need this optimization right now) Differential Revision: [D55079094](https://our.internmc.facebook.com/intern/diff/D55079094) Pull Request resolved: https://github.com/pytorch/pytorch/pull/122158 Approved by: https://github.com/ezyang	2024-03-22 20:03:38 +00:00
Peter Bell	5790096059	[dynamo] Remove uses of `raise unimplemented` (#122136 ) `unimplemented` is a function that raises an error, so `raise unimplemented(...)` never reaches the `raise`. Another related issue is that `raise unimplemented(...) from e` doesn't attach the exception cause correctly. I fix this by adding a `from_exc` argument to `unimplemented`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122136 Approved by: https://github.com/lezcano	2024-03-22 19:29:58 +00:00
angelayi	ed15370aab	[aoti] Add handling of ir.Constants in promote_constants (#122419 ) This issue popped up when enabling predispatch IR on the benchmarks (https://github.com/pytorch/pytorch/pull/122225) On the following model: ``` class M(torch.nn.Module): def __init__(self, device): super().__init__() self.device = device def forward(self, x): t = torch.tensor(x.size(-1), device=self.device, dtype=torch.float) t = torch.sqrt(t * 3) return x * t ``` We get the following error: ``` ====================================================================== ERROR: test_constant_abi_compatible_cuda (__main__.AOTInductorTestABICompatibleCuda) ---------------------------------------------------------------------- Traceback (most recent call last): File "/data/users/angelayi/pytorch/torch/testing/_internal/common_utils.py", line 2741, in wrapper method(args, kwargs) File "/data/users/angelayi/pytorch/test/inductor/test_torchinductor.py", line 9232, in new_test return value(self) File "/home/angelayi/.conda/envs/pytorch10/lib/python3.10/contextlib.py", line 79, in inner return func(args, *kwds) File "/data/users/angelayi/pytorch/test/inductor/test_aot_inductor.py", line 922, in test_constant self.check_model(M(self.device), (torch.randn(5, 5, device=self.device),)) File "/data/users/angelayi/pytorch/test/inductor/test_aot_inductor.py", line 91, in check_model actual = AOTIRunnerUtil.run( File "/data/users/angelayi/pytorch/test/inductor/test_aot_inductor_utils.py", line 102, in run so_path = AOTIRunnerUtil.compile( File "/data/users/angelayi/pytorch/test/inductor/test_aot_inductor_utils.py", line 40, in compile so_path = torch._inductor.aot_compile_ep( File "/data/users/angelayi/pytorch/torch/_inductor/__init__.py", line 150, in aot_compile_ep return compile_fx_aot( File "/data/users/angelayi/pytorch/torch/_inductor/compile_fx.py", line 1005, in compile_fx_aot compiled_lib_path = compile_fx( File "/home/angelayi/.conda/envs/pytorch10/lib/python3.10/contextlib.py", line 79, in inner return func(args, *kwds) File "/data/users/angelayi/pytorch/torch/_inductor/compile_fx.py", line 1111, in compile_fx return compile_fx( File "/home/angelayi/.conda/envs/pytorch10/lib/python3.10/contextlib.py", line 79, in inner return func(args, *kwds) File "/data/users/angelayi/pytorch/torch/_inductor/compile_fx.py", line 1145, in compile_fx return compile_fx( File "/home/angelayi/.conda/envs/pytorch10/lib/python3.10/contextlib.py", line 79, in inner return func(args, *kwds) File "/data/users/angelayi/pytorch/torch/_inductor/compile_fx.py", line 1336, in compile_fx return inference_compiler(unlifted_gm, example_inputs_) File "/data/users/angelayi/pytorch/torch/_dynamo/utils.py", line 265, in time_wrapper r = func(args, *kwargs) File "/data/users/angelayi/pytorch/torch/_inductor/compile_fx.py", line 1266, in fw_compiler_base return inner_compile( File "/home/angelayi/.conda/envs/pytorch10/lib/python3.10/contextlib.py", line 79, in inner return func(args, *kwds) File "/data/users/angelayi/pytorch/torch/_dynamo/repro/after_aot.py", line 83, in debug_wrapper inner_compiled_fn = compiler_fn(gm, example_inputs) File "/data/users/angelayi/pytorch/torch/_inductor/debug.py", line 304, in inner return fn(args, *kwargs) File "/home/angelayi/.conda/envs/pytorch10/lib/python3.10/contextlib.py", line 79, in inner return func(args, *kwds) File "/home/angelayi/.conda/envs/pytorch10/lib/python3.10/contextlib.py", line 79, in inner return func(args, *kwds) File "/data/users/angelayi/pytorch/torch/_dynamo/utils.py", line 265, in time_wrapper r = func(args, *kwargs) File "/data/users/angelayi/pytorch/torch/_inductor/compile_fx.py", line 447, in compile_fx_inner compiled_graph = fx_codegen_and_compile( File "/data/users/angelayi/pytorch/torch/_inductor/compile_fx.py", line 707, in fx_codegen_and_compile graph.run(example_inputs) File "/data/users/angelayi/pytorch/torch/_dynamo/utils.py", line 265, in time_wrapper r = func(args, kwargs) File "/data/users/angelayi/pytorch/torch/_inductor/graph.py", line 612, in run return super().run(args) File "/data/users/angelayi/pytorch/torch/fx/interpreter.py", line 145, in run self.env[node] = self.run_node(node) File "/data/users/angelayi/pytorch/torch/_inductor/graph.py", line 957, in run_node result = super().run_node(n) File "/data/users/angelayi/pytorch/torch/fx/interpreter.py", line 202, in run_node return getattr(self, n.op)(n.target, args, kwargs) File "/data/users/angelayi/pytorch/torch/_inductor/graph.py", line 819, in call_function raise LoweringException(e, target, args, kwargs).with_traceback( File "/data/users/angelayi/pytorch/torch/_inductor/graph.py", line 816, in call_function out = lowerings[target](args, kwargs) File "/data/users/angelayi/pytorch/torch/_inductor/lowering.py", line 298, in wrapped out = decomp_fn(args, **kwargs) File "/data/users/angelayi/pytorch/torch/_inductor/lowering.py", line 5340, in mul return make_pointwise(fn)(a, b) File "/data/users/angelayi/pytorch/torch/_inductor/lowering.py", line 409, in inner inputs = promote_constants(inputs, override_return_dtype) File "/data/users/angelayi/pytorch/torch/_inductor/lowering.py", line 373, in promote_constants ex = next(x for x in inputs if isinstance(x, (TensorBox, ExpandView))) torch._inductor.exc.LoweringException: StopIteration: target: aten.mul.Tensor args[0]: Constant(value=5.0, dtype=torch.float32, device=device(type='cuda', index=0)) args[1]: 3 ``` So I added an additional casing in `promote_constants` to handle the ir.Constants and now it works! Although please let me know if this is the wrong approach. Here's a paste of the full run with the inductor logs: P1198927007 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122419 Approved by: https://github.com/eellison, https://github.com/desertfire, https://github.com/chenyang78	2024-03-22 18:39:36 +00:00
cyy	52e9049ffa	Remove unused variables (#122496 ) This PR removes several unused variables in the code base. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122496 Approved by: https://github.com/ezyang	2024-03-22 18:04:09 +00:00
liqunfu	bbe846f430	Add symbolic_opset19.py and symbolic_opset20.py to support opset 19/20, extend opset 18 support (#118828 ) Start to fix https://github.com/pytorch/pytorch/issues/114801 Co-authored-by: Thiago Crepaldi <thiagofc@microsoft.com> Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/118828 Approved by: https://github.com/thiagocrepaldi	2024-03-22 18:01:33 +00:00
Lucas Pasqualin	34d33df056	[DCP] Check if pg exists in async before checking for cpu PG (#122316 ) Check if pg exists in async before checking for cpu PG in async save path. This PR enables using async_save even if PG is not initialized. Differential Revision: [D54868689](https://our.internmc.facebook.com/intern/diff/D54868689/) NOTE FOR REVIEWERS: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D54868689/)! Pull Request resolved: https://github.com/pytorch/pytorch/pull/122316 Approved by: https://github.com/shuqiangzhang, https://github.com/XilunWu	2024-03-22 18:01:11 +00:00
Kefei Lu	400cc518fc	pt2 dper passes: run shape prop before each pass (#122451 ) Summary: Most passes relies on shape info. We need to run shape prop after each pass Reviewed By: frank-wei Differential Revision: D55221119 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122451 Approved by: https://github.com/frank-wei	2024-03-22 17:57:25 +00:00
Shunting Zhang	152fa9ecc2	skip moondream for training (#122483 ) The model shows as failed model on the dashboard for training. But the model is not implemented for training (at least for now): `2196021e9b/torchbenchmark/models/moondream/__init__.py (L6)` Skip it in dashboard. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122483 Approved by: https://github.com/eellison	2024-03-22 17:35:52 +00:00
Shunting Zhang	a3d4eaf253	[inductor] device guard for max autotune benchmark (#122479 ) Internal users reported that they get failure for max-autotune if tensors are not on device 0. It turns out that we may use tensors on device say 6 and run kernel on them at device 0. This PR enforces that we do benchmarking for max-autotune on the correct device. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122479 Approved by: https://github.com/xintwfb, https://github.com/Chillee	2024-03-22 17:27:53 +00:00
Yulun Wang	3db64c1955	[NCCL PG] Enable ncclCommDevIdxMap unconditionally (#122049 ) Differential Revision: D54993977 ### Summary The initial purpose of ncclCommDevIdxMap is to support NCCL zero copy algorithms. Therefore, it is only enabled (with its values filled) if useTensorRegisterAllocatorHook_ is set to true. However, now we rely on it to support dumping NCCL information in a single PG. So we need it to be always available, regardless of whether we enabled useTensorRegisterAllocatorHook_. Move the code of filling ncclCommDevIdxMap out of if (useTensorRegisterAllocatorHook_) statement. ### Test Plan See diff Pull Request resolved: https://github.com/pytorch/pytorch/pull/122049 Approved by: https://github.com/shuqiangzhang	2024-03-22 17:10:33 +00:00
wz337	f305c96cac	[DCP] Add bytesIO object to test_e2e_save_and_load (#122112 ) Added a TestTrainstate that includes BytesIO checkpoint. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122112 Approved by: https://github.com/LucasLLC	2024-03-22 16:57:13 +00:00
Yang Chen	86082f1fdc	[aot_inductor] added runtime checks for input/output tensors in debug compile mode (#122047 ) This PR added runtime checks to guard the dtypes and shapes of input/output tensors. Currently, we enable these only for debug compilation (i.e. aot_inductor.debug_compile is True) in abi_compatible mode. Differential Revision: [D54993148](https://our.internmc.facebook.com/intern/diff/D54993148) Pull Request resolved: https://github.com/pytorch/pytorch/pull/122047 Approved by: https://github.com/desertfire	2024-03-22 16:40:33 +00:00
vfdev-5	90a13c3c5b	Added a check in register_lowering to avoid decomposed ops (#117632 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/117632 Approved by: https://github.com/lezcano	2024-03-22 16:38:31 +00:00
Gagan Jain	9347a79f1c	[Watchdog Timer] Clear timer for already terminated process (#122324 ) Summary: handling cases where worker process is terminated w/o releasing the timer request, this scenario causes reaping of process at expiry. removing the non-existent process during clear timer. Test Plan: unit tests Differential Revision: D55099773 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122324 Approved by: https://github.com/d4l3k	2024-03-22 15:48:03 +00:00
kungyork	018f5e2c32	Fix unused variable warning in `int4mm.cu` (#122286 ) Fix the following warning while compilation: ``` /home/pytorch/aten/src/ATen/native/cuda/int4mm.cu: In function ‘at::Tensor at::native::_weight_int4pack_mm_cuda(const at::Tensor&, const at::Tensor&, int64_t, const at::Tensor&)’: /home/pytorch/aten/src/ATen/native/cuda/int4mm.cu:871:6: warning: variable ‘stream’ set but not used [-Wunused-but-set-variable] 871 \| auto stream = at::cuda::getCurrentCUDAStream(); \| ^~~~~~ /home/pytorch/aten/src/ATen/native/cuda/int4mm.cu: In function ‘at::Tensor at::native::_convert_weight_to_int4pack_cuda(const at::Tensor&, int64_t)’: /home/pytorch/aten/src/ATen/native/cuda/int4mm.cu:1044:6: warning: variable ‘stream’ set but not used [-Wunused-but-set-variable] 1044 \| auto stream = at::cuda::getCurrentCUDAStream(); \| ^~~~~~ ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/122286 Approved by: https://github.com/soulitzer	2024-03-22 15:46:18 +00:00
Zhengxu Chen	7fd14ebb52	[export] Use randomized inputs to examples. (#122424 ) Summary: as title. replacing all torch.ones to randn Test Plan: CI Reviewed By: tugsbayasgalan Differential Revision: D55206441 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122424 Approved by: https://github.com/tugsbayasgalan	2024-03-22 15:32:28 +00:00
PyTorch MergeBot	60bc29aa0b	Revert "[Quant] [PT2] Add SiLU into X86InductorQuantizer Conv2d Unary Annotation (#122267 )" This reverts commit 2c6eeb26d3f61fba352ad51fd8653120937a20f3. Reverted https://github.com/pytorch/pytorch/pull/122267 on behalf of https://github.com/jeanschmidt due to Not sure if this PR caused breakages in main rocm jobs, I'll remerge if reverting does not fix it ([comment](https://github.com/pytorch/pytorch/pull/122267#issuecomment-2015294491))	2024-03-22 15:04:30 +00:00
PyTorch MergeBot	b30b396d05	Revert "[Quant] [Inductor] Enable the Inductor Lowering of QConv2d post op SiLU (#122268 )" This reverts commit 99f0fec7d0873d627e8c7f2dec65818d725424b0. Reverted https://github.com/pytorch/pytorch/pull/122268 on behalf of https://github.com/jeanschmidt due to Not sure if this PR caused breakages in main rocm jobs, I'll remerge if reverting does not fix it ([comment](https://github.com/pytorch/pytorch/pull/122267#issuecomment-2015294491))	2024-03-22 15:04:30 +00:00
PyTorch MergeBot	777ac511cc	Revert "[Quant] [Inductor] Enable the Inductor Lowering of QConv2d post op HardSwish with int8-mix-bf16 (#122373 )" This reverts commit 783fd89ff1cf401e484c20d14b16823abf20d87d. Reverted https://github.com/pytorch/pytorch/pull/122373 on behalf of https://github.com/jeanschmidt due to Not sure if this PR caused breakages in main rocm jobs, I'll remerge if reverting does not fix it ([comment](https://github.com/pytorch/pytorch/pull/122267#issuecomment-2015294491))	2024-03-22 15:04:30 +00:00
PyTorch MergeBot	dbedc6bb7c	Revert "[Quant] [Inductor] Enable the Inductor Lowering of QConv2d post op HardTanh with int8-mix-bf16 (#122374 )" This reverts commit 23a6d74f9352e0afb37750fee300d077c4ba9393. Reverted https://github.com/pytorch/pytorch/pull/122374 on behalf of https://github.com/jeanschmidt due to Not sure if this PR caused breakages in main rocm jobs, I'll remerge if reverting does not fix it ([comment](https://github.com/pytorch/pytorch/pull/122267#issuecomment-2015294491))	2024-03-22 15:04:30 +00:00
PyTorch MergeBot	02fee6caec	Revert "Change ATEN generator argument type to const std::optional<Generator>& (#120076 )" This reverts commit ecbe82b9cec75324b7efb58e1d9cae6b35b71bdc. Reverted https://github.com/pytorch/pytorch/pull/120076 on behalf of https://github.com/jeanschmidt due to Reverting in order to check if this will fix XLA trunk jobs ([comment](https://github.com/pytorch/pytorch/pull/120076#issuecomment-2015272644))	2024-03-22 14:53:45 +00:00
Joel Schlosser	e6986e4317	Public API for NJT construction from jagged components (#121518 ) This PR introduces `torch.nested.nested_tensor_from_jagged(values, offsets=None, lengths=None, jagged_dim=1)` (bikeshedding welcome). This is intended to be the main entrypoint for getting an NJT from the `(values, offsets, lengths)` components. The returned NJT is a view of the `values` component. Note that `torch.nested.nested_tensor()` / `torch.nested.as_nested_tensor()` already exist for constructing an NJT from a list of tensors. TODO: * Some doc formatting; suggestions welcome there * Tests / examples using `jagged_dim != 1` Pull Request resolved: https://github.com/pytorch/pytorch/pull/121518 Approved by: https://github.com/cpuhrsch ghstack dependencies: #113279, #113280	2024-03-22 14:48:22 +00:00
Brian Hirsh	65c37fe05a	AOTAutograd: ensure traced tangent subclass metadata takes non-contiguous outputs into account (#118669 ) Fixes https://github.com/pytorch/pytorch/issues/118596. The issue was as follows: (1) Whenever AOTAutograd sees an output that is non-contiguous, that it needs a tangent for, it forces the tangent that it generates to be contiguous during tracing (2) However: if this tangent is a subclass, we need to generate code to flatten/unflatten the subclass at runtime. (3) To do so, we use the metadata stashed here: https://github.com/pytorch/pytorch/blob/main/torch/_functorch/_aot_autograd/schemas.py#L231 (4) However, this metadata was wrong - it was generated by inspecting the tangent, before we made the tangent contiguous. The fix in this PR basically moves the logic make `traced_tangents` contiguous earlier, at the time that we first generate `ViewAndMutationMetadata` Pull Request resolved: https://github.com/pytorch/pytorch/pull/118669 Approved by: https://github.com/zou3519 ghstack dependencies: #118803, #119947	2024-03-22 14:42:27 +00:00
Brian Hirsh	09be5800c8	dynamo: support placement kwargs for DTensor.to_local() (#119947 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119947 Approved by: https://github.com/wanchaol, https://github.com/yoyoyocmu ghstack dependencies: #118803	2024-03-22 14:42:27 +00:00
Brian Hirsh	2e44b12dd4	dynamo: handle DTensor.device_mesh.device_type (#118803 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118803 Approved by: https://github.com/wanchaol, https://github.com/yanboliang	2024-03-22 14:42:22 +00:00
andrewor14	ea8e0c75c7	[quant][pt2] Fix create FQ with FixedQParamsQSpec (#122104 ) Summary: Before we just returned a _PartialWrapper object when using FixedQParamsQuantizationSpec in QAT. This is wrong and we should return a FQ object instead. Differential Revision: [D55021106](https://our.internmc.facebook.com/intern/diff/D55021106) Pull Request resolved: https://github.com/pytorch/pytorch/pull/122104 Approved by: https://github.com/jerryzh168	2024-03-22 14:23:05 +00:00
andrewor14	6e6891e843	[jit] Fix _batch_norm_with_update shape function (#122430 ) Summary: We used `native_batch_norm`'s shape function before, but the schemas are actually different. We need to create new shape functions for `_batch_norm_with_update` specifically. Test Plan: buck2 test '@fbcode//mode/opt-tsan' fbcode//caffe2/test/cpp/jit:jit -- --exact 'caffe2/test/cpp/jit:jit - TestShapeGraphLinting.Basic' Reviewers: bdhirsh, davidberard98, eellison Differential Revision: [D55211182](https://our.internmc.facebook.com/intern/diff/D55211182) Pull Request resolved: https://github.com/pytorch/pytorch/pull/122430 Approved by: https://github.com/eellison, https://github.com/bdhirsh	2024-03-22 14:21:57 +00:00
haozhe.zhu	23a6d74f93	[Quant] [Inductor] Enable the Inductor Lowering of QConv2d post op HardTanh with int8-mix-bf16 (#122374 ) Summary Enable the fusion pattern of `QConv2d -> hardtanh` lowering for int8-mixed-bf16 case. Test Plan ``` python -m pytest test_mkldnn_pattern_matcher.py -k test_qconv2d_hardtanh_int8_mixed_bf16_cpu ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/122374 Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5 ghstack dependencies: #122266, #122267, #122268, #122373	2024-03-22 13:13:14 +00:00
PyTorch MergeBot	f65373e278	Revert "Factor meta conversion through serializable MetaTensorDesc (#122044 )" This reverts commit e2d89e970480d7e5b10a77928442d8caf94e0e84. Reverted https://github.com/pytorch/pytorch/pull/122044 on behalf of https://github.com/jeanschmidt due to Seems that some landrace caused this PR to break lint ([comment](https://github.com/pytorch/pytorch/pull/122044#issuecomment-2015025490))	2024-03-22 12:46:21 +00:00
Kai Londenberg	700c92e1b9	[Inductor Cutlass backend] GEMM size threshold for Cutlass backend usage (#121491 ) * Adds a configurable GEMM size threshold for the usage of Cutlass GEMM Kernels _inductor.config.cutlass_backend_min_gemm_size * During GEMM algorithm choice generation: if no viable choices can be generated using the configured backends, the ATen backend will be used as a fallback backend, even if it is not enabled in _inductor.config.max_autotune_gemm_backends Test plan: CI Additional unit test in test_cutlass_backend.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/121491 Approved by: https://github.com/jansel ghstack dependencies: #121490	2024-03-22 10:58:43 +00:00
chilli	d34514f8db	Renamed mutationlayout/aliasedlayout (#122474 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/122474 Approved by: https://github.com/jansel ghstack dependencies: #121624	2024-03-22 08:32:14 +00:00
chilli	eca30df846	Added load_args to repro (#121624 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121624 Approved by: https://github.com/ezyang	2024-03-22 08:32:14 +00:00
haozhe.zhu	783fd89ff1	[Quant] [Inductor] Enable the Inductor Lowering of QConv2d post op HardSwish with int8-mix-bf16 (#122373 ) Summary Enable the fusion pattern of `QConv2d -> hardswish` lowering for int8-mixed-bf16 case. Test Plan ``` python -m pytest test_mkldnn_pattern_matcher.py -k test_qconv2d_hardswish_int8_mixed_bf16_cpu ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/122373 Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5 ghstack dependencies: #122266, #122267, #122268	2024-03-22 08:17:57 +00:00
haozhe.zhu	99f0fec7d0	[Quant] [Inductor] Enable the Inductor Lowering of QConv2d post op SiLU (#122268 ) Summary Enable the fusion pattern of `QConv2d -> silu` lowering to `swish` as `QConv2d` post operator. Test Plan ``` python -m pytest test_mkldnn_pattern_matcher.py -k test_qconv2d_silu_cpu python -m pytest test_mkldnn_pattern_matcher.py -k test_qconv2d_silu_int8_mixed_bf16_cpu python -m pytest test_mkldnn_pattern_matcher.py -k test_qat_qconv2d_silu ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/122268 Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5 ghstack dependencies: #122266, #122267	2024-03-22 08:15:28 +00:00
Jason Ansel	bb75313f0a	[dynamo] Optimize handling of BINARY_OP (#122465 ) This saves ~0.1s on https://dev-discuss.pytorch.org/t/a-torchdynamo-trace-time-ablation-study/1961 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122465 Approved by: https://github.com/oulgen	2024-03-22 08:14:58 +00:00
haozhe.zhu	2c6eeb26d3	[Quant] [PT2] Add SiLU into X86InductorQuantizer Conv2d Unary Annotation (#122267 ) Summary Add `SiLU` into X86InductorQuantizer Conv2d Unary Annotation TestPlan ``` python -m pytest test_x86inductor_quantizer.py -k test_conv2d_unary python -m pytest test_x86inductor_quantizer.py -k test_qat_conv2d_unary ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/122267 Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5 ghstack dependencies: #122266	2024-03-22 08:12:23 +00:00
Kai Londenberg	6bbd697306	[Inductor] Make codecache CUDA compilation more robust & flexible (#121490 ) Minor changes which make the CUDA compilation within _inductor/codecache.py more robust and flexible. Test plan: CI Additional test in test_codecache.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/121490 Approved by: https://github.com/jansel	2024-03-22 08:12:11 +00:00
haozhe.zhu	a337ee0a3a	[Quant] Enable QConv2d with silu post op (#122266 ) Summary Enable QConv2d implementation with post op `silu` Test Plan ``` python -m pytest test_quantized_op.py -k test_qconv2d_silu_pt2e ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/122266 Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5	2024-03-22 07:58:45 +00:00
arunppsg	b78e8c0d37	remove duplicate method run_subtests (#122421 ) Fixes #121654 I have removed the duplicate test `run_subtests` from `common_dtensor.py` and `common_fsdp.py` and moved it to `common_distributed.py`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122421 Approved by: https://github.com/soulitzer	2024-03-22 07:00:49 +00:00
Tobias Ringwald	6ba85cfc2a	Fixed memory leak in Python dispatcher w.r.t. THPDevice. (#122439 ) Fixes the memory leak reported in #122417. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122439 Approved by: https://github.com/soulitzer	2024-03-22 06:44:12 +00:00
Oguz Ulgen	3600778ede	Do not create a new node if no normalization is needed (#122330 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/122330 Approved by: https://github.com/jansel	2024-03-22 05:51:28 +00:00
Edward Z. Yang	e2d89e9704	Factor meta conversion through serializable MetaTensorDesc (#122044 ) Fixes https://github.com/pytorch/pytorch/issues/121085 This PR pretty involved so pay attention to this description. At a high level, the refactor is intended to be mechanical: anywhere in MetaConverter where previously we took a Tensor as argument, we now take a MetaTensorDesc, which contains all of the information that we would have queried off of the Tensor, but placed into a separate data structure which we can serialize or use to recreate a fake tensor in a separate fake tensor mode in exact fidelity to the original. However, this transformation is not always entirely mechanical. Here is what you need to pay attention to: - The memo table from real Tensor -> meta/fake Tensor is now broken into two memo tables: real Tensor -> stable int id -> meta/fake Tensor. The stable int id is needed so that when we do serialization, we know when tensors/storages alias each other and can ensure we preserve this aliasing upon deserialization. The way I have implemented changes the weak reference behavior. Previously, when either the real Tensor OR the meta/fake Tensor went dead, we would remove the entry from the memo table. Now, this only removes entries from one of the two memo tables. This semantically makes sense, because the user may have held on to the stable int id out of band, and may expect a real Tensor to continue to be numbered consistently / expect to be able to lookup a meta/fake tensor from this id. If this is unacceptable, it may be possible to rejigger the memo tables so that we have real Tensor -> stable int id and real Tensor -> meta/fake Tensor, but TBH I find the new implementation a lot simpler, and arranging the memo tables in this way means that I have to muck around with the real tensor to save to the memo table; in the current implementation, I never pass the Tensor to meta_tensor function AT ALL, which means it is impossible to accidentally depend on it. - When I fill in the fields of MetaTensorDesc in describe_tensor, I need to be careful not to poke fields when they are not valid. Previously, preconditions were implicitly checked via the conditional structure ("is this sparse? is this nested?") that is tested before we start reading attributes. This structure has to be replicated in describe_tensor, and I have almost assuredly gotten it wrong on my first try (I'll be grinding through it on CI; a careful audit will help too, by auditing that I've tested all the same conditionals that the original access was guarded by.) - I originally submitted https://github.com/pytorch/pytorch/pull/121821 for the symbolic shapes change, but it turned out the way I did it there didn't actually work so well for this PR. I ended up just inlining the symbolic shapes allocation logic into MetaConverter (look for calls to maybe_specialize_sym_int_with_hint), maybe there is a better way to structure it, but what I really want is to just read sizes/strides/offset directly off of MetaTensorDesc; I don't want another intermediate data structure. - Some fields aren't serializable. These are documented as "NOT serializable". ctx/type should morally be serializable and I just need to setup a contract with subclasses to let them be serialized. The fake_mode is used solely to test if we are refakefying with a pre-existing ShapeEnv and we want to reuse the SymInt directly--serializing this case is hopeless but I am kind of hoping after this refactor we do not need this at all. view_func is not serializable because it's a bound C implemented method. Joel has promised me that this is not too difficult to actually expose as a true data structure, but this is the edgiest of edge cases and there is no reason to deal with it right now. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/122044 Approved by: https://github.com/eellison ghstack dependencies: #122018	2024-03-22 03:56:34 +00:00
cyy	ecbe82b9ce	Change ATEN generator argument type to const std::optional<Generator>& (#120076 ) This PR proposes to use std::optional<Generator>& for underlying functions to avoid unnecessary copy and move operations. The torchgen code was changed to generate the new type. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120076 Approved by: https://github.com/malfet	2024-03-22 03:49:31 +00:00
PyTorch UpdateBot	ef0d470eb3	[vision hash update] update the pinned vision hash (#122453 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned vision hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122453 Approved by: https://github.com/pytorchbot	2024-03-22 03:37:11 +00:00
angelayi	fb57d1699b	[export] Fix handling output in remove_effect_tokens_pass (#122357 ) Added handling for updating the output_spec in the graph signature if the the result of a with_effects call is an output. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122357 Approved by: https://github.com/zhxchen17	2024-03-22 03:35:59 +00:00
Feng Yuan	09eb07bee8	Introduce XPU implementation for PyTorch ATen operators (#120891 ) As a follow-up to #114835 and #119682, we add limited ATen operators implementation for XPU. With this PR, the blocking issue for oneDNN operations and Inductor XPU backend will be resolved as the two components depend on these operations to support its basic features, respectively. The added ATen operators include: - `copy_`, `_to_copy`, `_copy_from_and_resize`, , `clone` - `view`, `view_as_real`, `view_as_complex`, - `as_strided`, `_reshape_alias`, `resize_`, `resize_as_`, - `add`/`add_`, `sub`/`sub_`, `mul`/`mul_`, `div`/`div_`, `abs`, - `empty`, `empty_strided`, - `fill_`, `zeros_`. Co-authored-by: Wang, Eikan <eikan.wang@intel.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/120891 Approved by: https://github.com/EikanWang, https://github.com/jgong5, https://github.com/gujinghui, https://github.com/atalman	2024-03-22 03:31:04 +00:00
Adnan Akhundov	e419011471	[inductor] Add torch.while_loop support to JIT Inductor (#122069 ) Summary: `torch.while_loop` HOP support is added to JIT Inductor. The test coverage is limited due to the functionality constraints of the upstream `torch.while_loop` op in Dynamo / Export. When those are lifted, we'll add more tests (see TODO-s in the test file). AOT Inductor support will be added in a follow-up PR. Test Plan: ``` $ python test/inductor/test_control_flow.py ... ---------------------------------------------------------------------- Ran 38 tests in 159.387s OK ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/122069 Approved by: https://github.com/jansel, https://github.com/eellison	2024-03-22 02:45:27 +00:00
PyTorch MergeBot	5e0440edb4	Revert "Optimize multi_tensor_apply (take 2) (#119764 )" This reverts commit 0b68a28c87df2c6eb2cf530be4659b5a2f8a95b0. Reverted https://github.com/pytorch/pytorch/pull/119764 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing ROCm job in trunk `0b68a28c87`. Please help take a look and reland the change ([comment](https://github.com/pytorch/pytorch/pull/119764#issuecomment-2014190124))	2024-03-22 02:18:28 +00:00
Joel Schlosser	470b44c048	Support for torch.nested.as_nested_tensor(t) (#113280 ) This PR adds support for tensor inputs to `as_nested_tensor()`. The tensor is treated as a batch of consistently-sized constituents. It utilizes `_nested_view_from_values_offsets()` to return a real view that allows for propagating gradients into inputs. Co-authored-by: voznesenskym <voznesenskym@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/113280 Approved by: https://github.com/cpuhrsch, https://github.com/soulitzer ghstack dependencies: #113279	2024-03-22 02:12:37 +00:00
Joel Schlosser	cd6bfc7965	Proper view support for jagged layout NestedTensor (#113279 ) This PR: * Introduces an ATen op for creating true jagged views from a dense values buffer * `_nested_view_from_jagged(values, offsets, lengths, ragged_idx, dummy)` * This ops is implemented on the Python side using torch.library so we can return a subclass instance * `jagged_from_list()` now uses this instead of the old autograd.Function `NestedViewFromBuffer` * The latter op is used for non-contiguous JTs returned via `torch.nested.narrow()` * `dummy` is an awful hack to ensure that `NestedTensor.__torch_dispatch__()` is invoked for our view * Introduces an ATen op for accessing the `values` component of an NT via a view * `_nested_get_values(nt)` * Removes the autograd.Functions `ViewNestedFromBuffer` and `ViewBufferFromNested` in favor of `nested_from_values_offsets()` / `nested_from_values_offsets_lengths()` and `nt.values()`, respectively. * Changes test code to prefer `as_nested_tensor()` over `jagged_from_list()` directly * Similarly, avoid `buffer_from_jagged()`, preferring `values()` * Depends on general subclass view fake-ification on the PT2 side (handled solely in previous PRs in the stack) With these changes, the semantics of jagged layout NTs are such that they are considered a true view of the underlying `values` buffer. This means views of jagged NTs are views of the underlying buffer as well, simplifying some handling. Differential Revision: [D54269922](https://our.internmc.facebook.com/intern/diff/D54269922) Co-authored-by: voznesenskym <voznesenskym@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/113279 Approved by: https://github.com/ezyang	2024-03-22 02:12:36 +00:00
Flavio Sales Truzzi	bde22835c6	[PT2] - Guard oblivious on meta registrations (#122216 ) Summary: ``` [trainer0\|0]:Potential framework code culprit (scroll up for full backtrace): [trainer0\|0]: File "/mnt/xarfuse/uid-539346/56d4bb3d-seed-nspid4026531836_cgpid183208940-ns-4026531840/torch/_meta_registrations.py", line 5043, in scatter_gather_dtype_check [trainer0\|0]: if index.numel() != 0: ``` Test Plan: General CI. Reviewed By: ezyang Differential Revision: D54689183 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122216 Approved by: https://github.com/ezyang	2024-03-22 01:36:03 +00:00
BowenBao	4f93b3d958	[Dort] Reduce excessive warning to info (#122442 ) No need to warn when an op can be exported. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122442 Approved by: https://github.com/thiagocrepaldi	2024-03-22 01:09:33 +00:00
Flavio Sales Truzzi	a001b4b048	Inductor: Don't clamp views when the views come from split_with_sizes (#122149 ) Summary: Fixes #122126 `split_with_sizes` don't need clamping. Test Plan: Added test + CI Differential Revision: D55043320 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122149 Approved by: https://github.com/ezyang	2024-03-22 00:55:36 +00:00
Zhengxu Chen	b1fa0ce4aa	[export] build the infra to rollout predispatch export. (#122326 ) Test Plan: fbcode:caffe2/test/quantization:test_quantization fbcode:bolt/nn/executorch/backends/tests:qnn_test fbcode:on_device_ai/helios/compiler_tests/... fbcode:pyspeech/tests:pyspeech_utils_test_oss fbcode:caffe2/test:quantization_pt2e_qat fbcode:on_device_ai/Assistant/Jarvis/tests:test_custom_ops fbcode:modai/test:test_modai fbcode:executorch/exir/backend/test:test_partitioner Differential Revision: D55133846 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122326 Approved by: https://github.com/tugsbayasgalan	2024-03-22 00:55:10 +00:00
Huy Do	4b535906aa	Better handle test-config labels on PR (#122155 ) I have some minor fixes in the scripts to 1. Fix the bug where the empty test matrix was confusingly print as unstable https://github.com/pytorch/pytorch/pull/121381#issuecomment-2004558588 1. Replace `print` with `logging.info` 1. Remove the hardcoded `VALID_TEST_CONFIG_LABELS` list. It's out of date and not many people use this features besides `test-config/default`, so why bother. The behavior here is simpler now: 1. If the PR has some `test-config/*` labels, they will be applied 1. If the PR has none of them, all test configs are applied 1. Add log for the previous 2 cases to avoid confusion ### Testing ``` python filter_test_configs.py --workflow "Mac MPS" --job-name "macos-12-py3-arm64 / build" --event-name "push" --schedule "" --branch "" --tag "ciflow/mps/121381" \ --pr-number 121065 \ --test-matrix "{ include: [ { config: "mps", shard: 1, num_shards: 1, runner: "macos-m1-stable" }, { config: "mps", shard: 1, num_shards: 1, runner: "macos-m2-14" }, ]} ``` Also running on this PR Pull Request resolved: https://github.com/pytorch/pytorch/pull/122155 Approved by: https://github.com/clee2000	2024-03-21 23:20:52 +00:00
PyTorch MergeBot	bce640709c	Revert "Precompile triton templates (#121998 )" This reverts commit b8df2f0ca530ebe01fa079c891c170a1f4b22823. Reverted https://github.com/pytorch/pytorch/pull/121998 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is causing all ROCm trunk job to fail `b8df2f0ca5` ([comment](https://github.com/pytorch/pytorch/pull/121998#issuecomment-2014003037))	2024-03-21 23:05:59 +00:00
Thiago Crepaldi	c4486d3e88	Allow fake models to run with ONNXProgram.__call__ (#122230 ) In order to a fake model to run using ONNXProgram.__call__ interface, we need to save the model into disk along with external data before executing the model. This is what this PR implements An alternative is to ONNXProgram.__call__ to detect that the model was exported with fake mode and explicit raise an exception when ONNXProgram.__call__ is executed. The exception message would instruct the user to call ONNXProgram.save and manually execute the model using the ONNX runtime of choice. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122230 Approved by: https://github.com/BowenBao ghstack dependencies: #122196	2024-03-21 22:28:05 +00:00
drisspg	4ba51bb2c4	Add keys used for templated attention impls (#122423 ) # Summary Mypy will complain that these attributes dont exist for this PR: https://github.com/pytorch/pytorch/pull/121845/ Pull Request resolved: https://github.com/pytorch/pytorch/pull/122423 Approved by: https://github.com/bdhirsh	2024-03-21 22:16:53 +00:00
PyTorch MergeBot	224beecee6	Revert "Proper view support for jagged layout NestedTensor (#113279 )" This reverts commit 5855c490f09a028bfdfefea8b93c9833eb55dc5c. Reverted https://github.com/pytorch/pytorch/pull/113279 on behalf of https://github.com/jbschlosser due to Need to fix BC thing ([comment](https://github.com/pytorch/pytorch/pull/113279#issuecomment-2013899762))	2024-03-21 22:03:01 +00:00
PyTorch MergeBot	12e7602cf9	Revert "Support for torch.nested.as_nested_tensor(t) (#113280 )" This reverts commit 17c9c7026521be1c194cae278b76ac8e8f7d145b. Reverted https://github.com/pytorch/pytorch/pull/113280 on behalf of https://github.com/jbschlosser due to Need to fix BC thing ([comment](https://github.com/pytorch/pytorch/pull/113280#issuecomment-2013893099))	2024-03-21 22:00:44 +00:00
PyTorch MergeBot	816db3bd29	Revert "Public API for NJT construction from jagged components (#121518 )" This reverts commit d4dff9cf5e7b734a8621b571e8f5a761dc43e1e0. Reverted https://github.com/pytorch/pytorch/pull/121518 on behalf of https://github.com/jbschlosser due to Need to fix BC thing ([comment](https://github.com/pytorch/pytorch/pull/121518#issuecomment-2013879641))	2024-03-21 21:56:29 +00:00
Peter Bell	48afb5c325	[inductor] Use python constants in IndexPropagation (#122031 ) In the next PR I have the IR `ops.neg(ops.constant(0.0, torch.float32))` which should be folded to `ops.constant(-0.0, torch.float32)` but it seems that `sympy.Float(-0.0)` doesn't respect the sign of the zero and so we instead get a positive zero constant. Here, I work around this by doing the constant folding with python arithmetic which does respect signed zeros. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122031 Approved by: https://github.com/lezcano	2024-03-21 21:53:22 +00:00
angelayi	99055ae165	[aoti] Fix compilation bug for buffer mutations (#121688 ) I realized there's a bug when unlifting buffer mutations in AOTI. However there seems to be a bug during tracing where AOTI mutates the buffer. I didn't take the time to investigate, so I left is as TODO for now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121688 Approved by: https://github.com/chenyang78, https://github.com/bdhirsh	2024-03-21 21:51:32 +00:00
rzou	332456c44d	triton_kernel_wrap shouldn't access FakeTensor.data_ptr (#122418 ) The comment suggests that we need to replace all FakeTensors with real tensors. `torch.empty` doesn't actually return a real Tensor because FakeTensorMode is active! We disable torch dispatch so that torch.empty actually returns a real Tensor. The motivation for this PR is that we're trying to ban FakeTensor.data_ptr (or at least warn on it) in torch.compile. See the next PR up in the stack Test Plan: - Existing tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122418 Approved by: https://github.com/oulgen	2024-03-21 21:48:07 +00:00
rzou	621fdc9db8	infer_schema can add alias annotations when passed a list of mutated args (#122343 ) Test Plan: - new tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/122343 Approved by: https://github.com/ezyang ghstack dependencies: #122319, #122320	2024-03-21 21:39:07 +00:00
rzou	639d6201b4	Expand the types infer_schema can infer (#122320 ) This PR allows it to infer: - None return as () - List[Tensor] as Tensor[] Test Plan: - new tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/122320 Approved by: https://github.com/ezyang, https://github.com/soulitzer ghstack dependencies: #122319	2024-03-21 21:39:07 +00:00
rzou	0dd78f1828	Add standalone tests for infer_schema (#122319 ) We're gonna reuse this helper in the new python custom ops API. Given a function with type annotations, `infer_schema(fun)` returns an inferred schema. Test Plan: - new tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/122319 Approved by: https://github.com/ezyang, https://github.com/soulitzer	2024-03-21 21:39:04 +00:00
William Wen	23524710e6	[dynamo] use proxies to nn.Module in dynamo generated GraphModules (#120756 ) Fixes remaining refleaks found when debugging https://github.com/pytorch/pytorch/issues/119607, tests added in https://github.com/pytorch/pytorch/pull/120657. Also fixes some tests that xfail: https://github.com/pytorch/pytorch/issues/120631 (not entirely sure why), but introduced tests now fail. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120756 Approved by: https://github.com/jansel	2024-03-21 21:23:12 +00:00
Kai Londenberg	2cd0a5d516	[Inductor] Fix for WrapperCodeGen.statically_known_int_or_none (#121808 ) There's obviously a small typo in WrapperCodeGen.statically_known_int_or_none, where the return value of a call to V.graph._shape_env._maybe_evaluate_static is being discarded. This fix changes that to work how it was likely intended to. Test Plan: CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/121808 Approved by: https://github.com/lezcano, https://github.com/jansel, https://github.com/aakhundov	2024-03-21 21:15:32 +00:00
PyTorch MergeBot	968c4c4154	Revert "Refactor gpu trace to be device-agnostic (#121794 )" This reverts commit 74deacbf31d032a2659dc1633dc3e5248921d466. Reverted https://github.com/pytorch/pytorch/pull/121794 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it breaks ROCm jobs in trunk `74deacbf31`, please help take a look and reland the change ([comment](https://github.com/pytorch/pytorch/pull/121794#issuecomment-2013674083))	2024-03-21 20:33:17 +00:00
PyTorch MergeBot	13afbcfc85	Revert "Support gpu trace on XPU (#121795 )" This reverts commit 91ead3eae4cd6cbf50fe7a7b4a2f9f35302bc9b2. Reverted https://github.com/pytorch/pytorch/pull/121795 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it breaks ROCm jobs in trunk `74deacbf31`, please help take a look and reland the change ([comment](https://github.com/pytorch/pytorch/pull/121794#issuecomment-2013674083))	2024-03-21 20:33:16 +00:00
PyTorch MergeBot	182bb0f2ca	Revert "Introduce XPU implementation for PyTorch ATen operators (#120891 )" This reverts commit 148a8de6397be6e4b4ca1508b03b82d117bfb03c. Reverted https://github.com/pytorch/pytorch/pull/120891 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I need to revert it to resolve a conflict in trunk https://github.com/pytorch/pytorch/pull/121794#issuecomment-2013434523. Please help reland the change after ([comment](https://github.com/pytorch/pytorch/pull/120891#issuecomment-2013668563))	2024-03-21 20:30:20 +00:00
Bin Bao	628dcde136	[AOTI] Disable stack allocation when there is a fallback op (#122367 ) Summary: Stack allocation is disabled when there is an aten fallback op, see `c84f81b395/torch/_inductor/codegen/cpp_wrapper_cpu.py (L974)`. But we need to do the same where is a custom op fallback. Test Plan: CI Reviewed By: mikekgfb Differential Revision: D55149369 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122367 Approved by: https://github.com/mikekgfb	2024-03-21 20:02:33 +00:00
ydwu4	af9b71c82f	fix typo in while_loop_test (#122416 ) As titiled. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122416 Approved by: https://github.com/angelayi	2024-03-21 19:42:08 +00:00
Yifu Wang	d131cbc44f	Fuse the input -> p2p buffer copy into one-shot all-reduce kernel when the input is small (#121213 ) This improves the gpt-fast llama2 70B 8xH100 (non-standard) TP benchmark from 86 tok/s to 88 tok/s. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121213 Approved by: https://github.com/Chillee	2024-03-21 18:25:57 +00:00
Abhishek Jindal	765c3fc138	fix breaking changes for ONNX Runtime Training (#122000 ) Fixes breaking changes for ONNX Runtime Training. PR https://github.com/pytorch/pytorch/pull/121102 introduced incompatibility with ORT training because of change in parameter type. Creating a PR to add previous parameter types and verified that it works with ORT training. Error with current scenario: ``` site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/cpu/aten_op_executor/aten_op_executor.cc:60:40: error: invalid conversion from ‘const DLManagedTensor’ to ‘DLManagedTensor’ [-fpermissive] at::Tensor tensor = at::fromDLPack(dlpack); site-packages/torch/include/ATen/DLConvertor.h:15:46: note: initializing argument 1 of ‘at::Tensor at::fromDLPack(DLManagedTensor)’ TORCH_API Tensor fromDLPack(DLManagedTensor src); ``` Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/122000 Approved by: https://github.com/malfet	2024-03-21 18:10:22 +00:00
Edward Z. Yang	c2651a7f0e	Make check_is_size clamp to sys.maxsize - 1, so sys.maxsize comparison returns False (#122372 ) Partially fixes https://github.com/pytorch/pytorch/issues/113002 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/122372 Approved by: https://github.com/lezcano ghstack dependencies: #122370	2024-03-21 17:14:42 +00:00
Edward Z. Yang	780f70b728	Make expected stride test in torch._prims_common size oblivious (#122370 ) Partially addresses https://github.com/pytorch/pytorch/issues/113002 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/122370 Approved by: https://github.com/lezcano	2024-03-21 17:14:42 +00:00
PyTorch MergeBot	25bf5f7e61	Revert "Enable x86 CPU vectorization on windows [submodule sleef] (#118980 )" This reverts commit aa74a8b9e5b34eaa700a64064818adc7a12942ca. Reverted https://github.com/pytorch/pytorch/pull/118980 on behalf of https://github.com/huydhn due to Sorry for revert your change one more time but the hard part is that it breaks lot of internal builds ([comment](https://github.com/pytorch/pytorch/pull/118980#issuecomment-2013043364))	2024-03-21 17:07:17 +00:00
eellison	b8df2f0ca5	Precompile triton templates (#121998 ) Before this PR we were not precompiling triton templates in parallel. Compilation would occur during benchmarking. Triton benchmarking templates were emitted as : ``` @triton.jit def triton_mm(arg_A, arg_B, out_ptr0): ``` In order to precompile we need to give the full kernel specification, as we do when we emit the template in the final output code generation. ``` @triton_heuristics.template( num_stages=3, num_warps=8, triton_meta={'signature': {0: 'fp32', 1: 'fp32', 2: '*fp32'}, 'device': 0, 'device_type': 'cuda', 'constants': {}, 'configs': [AttrsDescriptor(divisible_by_16=(0, 1, 2), equal_to_1=(), ids_of_folded_args=(), divisible_by_8=())]}, inductor_meta={'kernel_name': 'Placeholder.DESCRIPTIVE_NAME', 'backend_hash': 'cdeecfeccd31ad7810f96b5752194b1c2406d0a81e39a6ca09c8ee150baae183'}, ) @triton.jit def triton_mm(arg_A, arg_B, out_ptr0): ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/121998 Approved by: https://github.com/jansel ghstack dependencies: #121996, #120275, #121997	2024-03-21 17:04:53 +00:00
Sahdev Zala	17175cdbc7	[Docs] Add extended debugging options for troubleshooting (#122028 ) Fixes #120889 Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/122028 Approved by: https://github.com/ezyang, https://github.com/malfet	2024-03-21 17:00:45 +00:00
Pian Pawakapan	c20bc18d59	[export] allow static constraints in dynamic_shapes (#121860 ) This PR allows users to specify int values for dimensions in dynamic_shapes as well as None, for example: ``` class Foo(torch.nn.Module): def forward(self, x, y, z): ... foo = Foo() inputs = (torch.randn(4, 6), torch.randn(5, 4), torch.randn(3, 3)) for dynamic_shapes in [ None ((4, 6), (5, 4), (3, 3)), ((None, 6), None, {0: 3, 1: 3}) ]: _ = export(foo, inputs, dynamic_shapes=dynamic_shapes) ``` All of the above should produce the same ExportedProgram. This is done by temporarily creating a static dim constraint during analysis, where vr.lower == vr.upper. These constraints are then deleted during _process_constraints(), and do not show up in the final ExportedProgram's range_constraints. Additionally, export() will also fail if the shapes are mis-specified, for example: ``` _ = export(foo, inputs, dynamic_shapes=((5, None), None, None)) ``` leads to `torch._dynamo.exc.UserError: Static shape constraint of 5 does not match input size of 4, for L['x'].size()[0]` Pull Request resolved: https://github.com/pytorch/pytorch/pull/121860 Approved by: https://github.com/avikchaudhuri	2024-03-21 16:59:59 +00:00
Christian Puhrsch	16935de961	Support alias for NestedTensorCPU/CUDA (#117711 ) Fixes #ISSUE_NUMBER Co-authored-by: Vincent Moens <vmoens@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/117711 Approved by: https://github.com/albanD	2024-03-21 16:05:52 +00:00
Feng Yuan	148a8de639	Introduce XPU implementation for PyTorch ATen operators (#120891 ) As a follow-up to #114835 and #119682, we add limited ATen operators implementation for XPU. With this PR, the blocking issue for oneDNN operations and Inductor XPU backend will be resolved as the two components depend on these operations to support its basic features, respectively. The added ATen operators include: - `copy_`, `_to_copy`, `_copy_from_and_resize`, , `clone` - `view`, `view_as_real`, `view_as_complex`, - `as_strided`, `_reshape_alias`, `resize_`, `resize_as_`, - `add`/`add_`, `sub`/`sub_`, `mul`/`mul_`, `div`/`div_`, `abs`, - `empty`, `empty_strided`, - `fill_`, `zeros_`. Co-authored-by: Wang, Eikan <eikan.wang@intel.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/120891 Approved by: https://github.com/EikanWang, https://github.com/jgong5, https://github.com/gujinghui, https://github.com/atalman	2024-03-21 15:42:20 +00:00
Thiago Crepaldi	204fd69ca6	Make ONNXProgram.model_proto and disk file the same (#122196 ) Currently, the in-memory onnx program model proto does not contain initializers saved into the disk version. This PR changes this behavior, so that both versions are identical. This is important for running models with fake tensor from OMMProgram.model_proto directly, without a file Pull Request resolved: https://github.com/pytorch/pytorch/pull/122196 Approved by: https://github.com/BowenBao	2024-03-21 15:29:31 +00:00
Nikita Shulga	f9996ed764	[BE] Enable torch inductor tests running on MacOS (#122360 ) Original idea was limit the testing to just x86 Macs, but right now it will be skipped on all Apple Silicon ones, as all of them have Metal capable GPU Pull Request resolved: https://github.com/pytorch/pytorch/pull/122360 Approved by: https://github.com/jansel	2024-03-21 14:47:05 +00:00
Adnan Akhundov	456b112dca	[inductor] Support non-Tensor predicate in torch.cond (#122378 ) Summary: Previously, we only supported torch.Tensor boolean scalar predicate in `torch.cond` in Inductor. This PR adds support for SymBool and Python bool predicate, to match the `torch.cond` [sematics](https://pytorch.org/docs/stable/generated/torch.cond.html) in Dynamo / Export. Test Plan: ``` $ python test/inductor/test_control_flow.py ... ---------------------------------------------------------------------- Ran 34 tests in 56.980s OK $ python test/inductor/test_aot_inductor.py -k test_cond ... ---------------------------------------------------------------------- Ran 54 tests in 460.093s OK (skipped=4) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/122378 Approved by: https://github.com/jansel, https://github.com/chenyang78	2024-03-21 14:35:01 +00:00
Yifu Wang	0b68a28c87	Optimize multi_tensor_apply (take 2) (#119764 ) ### Take 2 The first take (#119153) landed but was reverted because it broke cuda graph for `multi_tensor_apply`. This PR is a reland of #119153: - Incorporate #119652 so that the optimization can be applied (1) without increasing binary size (2) to all 3 MTA variants without much code duplication. - Ensure the optimization is compatible with cuda graph. ### Summary Due to the dynamic nature of the workload, the kernel arguments aren't guaranteed to fit in the static 4kb kernel argument memory. Previously with the apex implementation, we overcame this limitation by dividing a multi_tensor_apply workload into multiple kernel launches. However, this led to low sustained occupancy, affecting the performance of memory bound ops. Based on the observation that the kernel argument memory limitation doesn't correlate well with available SM resources, we adopt a different approach: - When the kernel arguments fit into the static kernel argument memory, we use this memory to transfer the arguments. - Conversely, when the kernel arguments don't fit into the static kernel argument memory, instead of sacrificing sustained occupancy, we use a page-locked cudaMemcpyAsync to transfer the arguments, then perform the entire workload in a single kernel. This PR only covers `multi_tensor_apply` for tensors. The change can be easily applied to `multi_tensor_apply` for tensors + scalars and `multi_tensor_apply_for_fused_optimizer`. ### Benchmark (WIP) The only benchmark I've conducted so far on `_foreach_copy_` on a set of sizes that resembles internal workload. I need to benchmarks on more problem sizes. The speedup should vary among problem sizes. However, I believe this PR should not be slower than the previous impl on any problem sizes. The benchmark can be reproduced with [this script](https://gist.github.com/yifuwang/178c1f4bf951c5794ea79c04d90e44fa). Baseline A single iteration in trace: <img width="831" alt="image" src="https://github.com/pytorch/pytorch/assets/4156752/5c8d72d0-0628-4989-88a8-c756f6bc1319"> ``` https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/perfetto_internal_traces/tree/shared_trace/yifu_5a59145f-567b-472f-8eef-c61c388d45b4.json device ms: 1.111, cpu ms: 7.151 memory bandwidth: 1169.825 GB/s ``` This PR A single iteration in trace: <img width="967" alt="image" src="https://github.com/pytorch/pytorch/assets/4156752/a023e183-8166-48f7-b7c0-c8ba32653d2b"> ``` https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/perfetto_internal_traces/tree/shared_trace/yifu_da060725-62a8-466e-b570-2ad67ff0e29d.json device ms: 0.892, cpu ms: 0.810 memory bandwidth: 1456.744 GB/s ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/119764 Approved by: https://github.com/eqy, https://github.com/eellison, https://github.com/crcrpar	2024-03-21 11:53:31 +00:00
PyTorch MergeBot	0d8e960f74	Revert "[Sparsity] add support for H100 compute capability 9.x (#121768 )" This reverts commit 91fdaa1b416ab8ac8be30f3c3428751e236657cd. Reverted https://github.com/pytorch/pytorch/pull/121768 on behalf of https://github.com/jeanschmidt due to Agreed on reverting and fixing rocm tests ([comment](https://github.com/pytorch/pytorch/pull/121768#issuecomment-2011893826))	2024-03-21 10:42:08 +00:00
cyy	7f8bb1de83	[Dynamo][2/N] Fix clang-tidy warnings in torch/csrc/dynamo/* (#122362 ) This PR continues to clean clang-tidy warnings in torch/csrc/dynamo/*, following #122259 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122362 Approved by: https://github.com/ezyang	2024-03-21 09:41:41 +00:00
Shuqiang Zhang	ea1cd31b50	[c10d] Log the target of FR dump (#122345 ) Summary: It would be useful to log the destination of the trace dump in either manifold or local file for the users to quickly locate the dump Test Plan: Modified unit tests Differential Revision: D54972069 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122345 Approved by: https://github.com/wconstab	2024-03-21 08:03:05 +00:00
Michael Lazos	365e89a591	Add tensor step to adadelta (#122252 ) Towards fixing https://github.com/pytorch/pytorch/issues/115679 Fixes Adadelta step update while compiling Pull Request resolved: https://github.com/pytorch/pytorch/pull/122252 Approved by: https://github.com/janeyx99	2024-03-21 07:28:47 +00:00
drisspg	7fa1be506b	Add an option to sdpa benchmark to specify backend (#122368 ) # Summary Adds the ability to specify sdpa backend Pull Request resolved: https://github.com/pytorch/pytorch/pull/122368 Approved by: https://github.com/cpuhrsch	2024-03-21 07:00:40 +00:00
eellison	18c164ef7c	[Inductor] Match insignficiant strides on outputs (#122239 ) Fix for https://github.com/pytorch/pytorch/issues/116433 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122239 Approved by: https://github.com/Chillee	2024-03-21 05:35:59 +00:00
Kurt Mohler	b915877deb	Support numpy array in `Tensor.__eq__` (#122249 ) When the `other` arg of `Tensor.__eq__` is a numpy array, it is converted to a PyTorch tensor view of the numpy array, which is then given as the `other` arg to a `Tensor.eq` call Fixes #119965 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122249 Approved by: https://github.com/ezyang	2024-03-21 04:55:01 +00:00
Shuqiang Zhang	bf18e967b4	[c10d] disable compute_duration by default (#122138 ) Summary: Compute duration would invoke additional cuda overhead and possibly GPU mem increase and possible hang, so we want to disable it by default and enable it only when needed, or at least when timing is enabled. Test Plan: Test with existing unit test Pull Request resolved: https://github.com/pytorch/pytorch/pull/122138 Approved by: https://github.com/wconstab	2024-03-21 04:45:37 +00:00
Bert Maher	ea6f67853e	[inductor fbcode] Add python include paths for Python.h (#122363 ) Summary: We're getting errors that Python.h is not found because we didn't have the proper include path set up for it. bypass-github-export-checks Test Plan: I can only get this to show up in Bento: N5106134 Reviewed By: hl475, chenyang78 Differential Revision: D55133110 Co-authored-by: Bert Maher <bertrand@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/122363 Approved by: https://github.com/bertmaher	2024-03-21 04:32:17 +00:00
Joel Schlosser	d4dff9cf5e	Public API for NJT construction from jagged components (#121518 ) This PR introduces `torch.nested.nested_tensor_from_jagged(values, offsets=None, lengths=None, jagged_dim=1)` (bikeshedding welcome). This is intended to be the main entrypoint for getting an NJT from the `(values, offsets, lengths)` components. The returned NJT is a view of the `values` component. Note that `torch.nested.nested_tensor()` / `torch.nested.as_nested_tensor()` already exist for constructing an NJT from a list of tensors. TODO: * Some doc formatting; suggestions welcome there * Tests / examples using `jagged_dim != 1` Pull Request resolved: https://github.com/pytorch/pytorch/pull/121518 Approved by: https://github.com/cpuhrsch ghstack dependencies: #113280	2024-03-21 04:14:17 +00:00
Joel Schlosser	17c9c70265	Support for torch.nested.as_nested_tensor(t) (#113280 ) This PR adds support for tensor inputs to `as_nested_tensor()`. The tensor is treated as a batch of consistently-sized constituents. It utilizes `_nested_view_from_values_offsets()` to return a real view that allows for propagating gradients into inputs. Co-authored-by: voznesenskym <voznesenskym@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/113280 Approved by: https://github.com/cpuhrsch, https://github.com/soulitzer	2024-03-21 04:13:55 +00:00
titaiwangms	77bed8f7f2	[ONNX] model_type flag is only supported under `SKIP_XFAIL_SUBTESTS` (#122336 ) Fixes #120918 To address the confusion that developers usually have on which list to put xfail and skip. This PR provides guidance that `model_type` and `matcher` specified xfail/skip should go to `SKIP_XFAIL_SUBTESTS`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122336 Approved by: https://github.com/BowenBao, https://github.com/thiagocrepaldi	2024-03-21 04:10:32 +00:00
PyTorch UpdateBot	cc0cadaf4c	[vision hash update] update the pinned vision hash (#122154 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned vision hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122154 Approved by: https://github.com/pytorchbot	2024-03-21 03:59:12 +00:00
PyTorch UpdateBot	61f69c7fc4	[audio hash update] update the pinned audio hash (#122153 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned audio hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122153 Approved by: https://github.com/pytorchbot	2024-03-21 03:53:24 +00:00
Adnan Akhundov	885fb9742d	Handle special kwargs in user-written Triton kernel calls (#122280 ) Summary: Special kwargs like `num_warps`, `num_stages`, and `num_ctas` can be passed to the Triton kernel call as kwargs. These kwargs are handled in a special way, not being passed to the underlying kernel function directly. In this PR, we move those special kwargs from `kwargs` of the `TritonKernelVariable` in dynamo to `Autotuner`'s `Config` instances (either already existing or newly created for this purpose). As a result, the special kwargs can be codegened correctly as a part of `Config`, not as direct arguments to the kernel `.run`. Test Plan: ``` python test/inductor/test_triton_kernels.py -k test_triton_kernel_special_kwargs ... ---------------------------------------------------------------------- Ran 6 tests in 6.783s OK ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/122280 Approved by: https://github.com/oulgen	2024-03-21 03:34:07 +00:00
titaiwang	3e6fdea390	[ONNX] Fix list dtype finding bug in dispatcher (#122327 ) Fixes #122166 Before this PR, dispatcher assumes the first input should provide the reasonable dtype to them, but `aten::index` reveals the cases with `None` in the front of inputs. The PR addresses it by selecting the first non None input to take dtype. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122327 Approved by: https://github.com/justinchuby, https://github.com/thiagocrepaldi	2024-03-21 02:54:58 +00:00
Sherlock Huang	ae913175c3	Fix GraphModuleDeserializer (#122342 ) Summary: self.constants is used in self.deserialize_signature() Test Plan: CI Differential Revision: D55152971 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122342 Approved by: https://github.com/zhxchen17	2024-03-21 02:27:39 +00:00
Frank Lin	e9dcda5cba	Graph-Safe RNG State Exchange for Tensor Parallelism (#114068 ) See #113541 The PR allows for registering and controlling multiple RNG states using indices, ensuring cudagraph-safe operations, and includes both C++ and Python API changes to support this functionality. cc @eellison @anijain2305 @jansel @ezyang @ptrblck @csarofeen @mcarilli Pull Request resolved: https://github.com/pytorch/pytorch/pull/114068 Approved by: https://github.com/ezyang	2024-03-21 01:57:08 +00:00
Yu, Guangye	91ead3eae4	Support gpu trace on XPU (#121795 ) # Motivation Support GPU trace on XPU backend. Add GPU trace to xpu runtime. It is beneficial to generalize the device caching allocator in the next step. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121795 Approved by: https://github.com/EikanWang, https://github.com/gujinghui, https://github.com/jgong5, https://github.com/albanD ghstack dependencies: #121794	2024-03-21 01:56:42 +00:00
Yu, Guangye	74deacbf31	Refactor gpu trace to be device-agnostic (#121794 ) # Motivation Refactor gpu trace to be device-agnostic. gpu trace is usually used in runtime components, including Device, Stream, Event, Guard, and Allocator. It should be device-agnostic and can be shared among each device backend. # Solution move `_cuda_trace.py` to `_gpu_trace.py`, which makes each device backend owns their callback, respectively. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121794 Approved by: https://github.com/jgong5, https://github.com/albanD, https://github.com/EikanWang, https://github.com/gujinghui	2024-03-21 01:52:58 +00:00
Jing Shan	57734202c6	[HSTU][TGIF] Provide a API to check whether running in torch_dispatch mode (#122339 ) Summary: We provide a `is_in_torch_dispatch_mode` API returning `bool` to determine whether the program is running in torch dispatch mode or not. Test Plan: - OSS CI - Tested with publish of hstu models with the this diff and following diffs D54964288, D54964702, D54969677, D55025489, runtime errors are not raised anymore in publish Differential Revision: D55091453 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122339 Approved by: https://github.com/jiayisuse	2024-03-21 01:37:23 +00:00
JackCaoG	e38d60bc07	Remove some stale xla dynamo backend (#122128 ) `torchxla_trace_once ` and `aot_torchxla_trivial ` should be removed. In our internal(hopefully dashboard can be open source soon) torchbench daily runs, `openxla` backend has much higher passing rate and similar perfomrance as the `openxla_eval`(non-aot-auto-grad backend). We still use `openxla_eval` in llama2 example but I think we should move user to `openxla` backend going forward. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122128 Approved by: https://github.com/alanwaketan, https://github.com/jansel	2024-03-21 01:13:50 +00:00
Michael Lazos	c20cf97366	Move some cudagraphs checks into C++ (#122251 ) Based off of https://github.com/pytorch/pytorch/pull/111094 This + cpp guards improves TIMM geomean optimizer performance by about 20% Pull Request resolved: https://github.com/pytorch/pytorch/pull/122251 Approved by: https://github.com/eellison	2024-03-21 01:02:23 +00:00
James Pang	be5863de39	Remove usage of deprecated volatile (#122231 ) Summary: When building our iOS app, we get a compile error about the deprecated `volatile` keyword. This diff attempts to fix it by replacing the usage of the deprecated `volatile` keyword with `atomic` as suggested by malfet Test Plan: Successfully built the iOS app that previously had a compile error Differential Revision: D55090518 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122231 Approved by: https://github.com/malfet	2024-03-21 00:55:16 +00:00
Animesh Jain	1686e2d1e4	[symbolic shapes][compile-time] Minor compile time optimization in has_free_symbols (#122144 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/122144 Approved by: https://github.com/lezcano ghstack dependencies: #120726	2024-03-21 00:48:57 +00:00
cyy	c2eedb7f8a	[Dynamo][1/N] Fix clang-tidy warnings in torch/csrc/dynamo/* (#122259 ) This PR begins a series of works to ensure dynamo C++ code is clang-tidy clean. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122259 Approved by: https://github.com/ezyang	2024-03-21 00:43:25 +00:00
PyTorch MergeBot	c80601f35a	Revert "Avoid COW materialize in conv, log sigmoid, repeat, group_norm, batch_norm (#121537 )" This reverts commit a2a88f39ee991f471f2a2c54571886d70f5cd2e6. Reverted https://github.com/pytorch/pytorch/pull/121537 on behalf of https://github.com/kurtamohler due to flaky CI failures ([comment](https://github.com/pytorch/pytorch/pull/121537#issuecomment-2010937226))	2024-03-21 00:03:30 +00:00
eqy	d5b5012dc4	[CUDA] Raise `softmax_forward_64bit_indexing` GPU memory requirement (#116075 ) printing `torch.cuda.memory_summary()` shows ~41GiB reserved at the end of this test, not sure how it was passing previously on CUDA. CC @ptrblck @malfet Pull Request resolved: https://github.com/pytorch/pytorch/pull/116075 Approved by: https://github.com/ptrblck, https://github.com/malfet	2024-03-21 00:03:17 +00:00
Joel Schlosser	5855c490f0	Proper view support for jagged layout NestedTensor (#113279 ) This PR: * Introduces an ATen op for creating true jagged views from a dense values buffer * `_nested_view_from_jagged(values, offsets, lengths, ragged_idx, dummy)` * This ops is implemented on the Python side using torch.library so we can return a subclass instance * `jagged_from_list()` now uses this instead of the old autograd.Function `NestedViewFromBuffer` * The latter op is used for non-contiguous JTs returned via `torch.nested.narrow()` * `dummy` is an awful hack to ensure that `NestedTensor.__torch_dispatch__()` is invoked for our view * Introduces an ATen op for accessing the `values` component of an NT via a view * `_nested_get_values(nt)` * Removes the autograd.Functions `ViewNestedFromBuffer` and `ViewBufferFromNested` in favor of `nested_from_values_offsets()` / `nested_from_values_offsets_lengths()` and `nt.values()`, respectively. * Changes test code to prefer `as_nested_tensor()` over `jagged_from_list()` directly * Similarly, avoid `buffer_from_jagged()`, preferring `values()` * Depends on general subclass view fake-ification on the PT2 side (handled solely in previous PRs in the stack) With these changes, the semantics of jagged layout NTs are such that they are considered a true view of the underlying `values` buffer. This means views of jagged NTs are views of the underlying buffer as well, simplifying some handling. Differential Revision: [D54269922](https://our.internmc.facebook.com/intern/diff/D54269922) Co-authored-by: voznesenskym <voznesenskym@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/113279 Approved by: https://github.com/ezyang	2024-03-20 23:45:34 +00:00
min-jean-cho	057892f4be	[CPU] optimize Lp norm for 1-dimensional vector (#122143 ) Fixes https://github.com/pytorch/pytorch/issues/120229 - Optimize vector norm by simplifying vector norm formula for 1-dimensional vector. - Vector norm formula for 1-dimensional vector simplifies to `abs(x)`. See below for proof. - Next step, we can similarly optimize matrix norm (`torch.linalg.matrix_norm`) for 1 x 1 matrix. - Additionally, avoids overflow in power, `abs(x) p` for large `p` or `x`, for 1-dimensional vector. ### Performance Avg Latency (ms) of `torch.norm` and `torch.linalg.vector_norm` for `torch.norm(torch.randn(218, 1), ord, -1)` `torch.linalg.vector_norm(torch.randn(218, 1), ord, -1)` Tested on 28 physical cores/socket, 1 socket on Skylake. \| \| \| \| \| Avg Latency (ms) \| \| \| \|-------------------------- \|----------------- \|--------- \|--------- \|----------------------- \|----------------------- \|---------------------------------------- \| \| op \| input shape \| dim \| ord \| baseline (master) \| optimized (7102f1ef372b248414d36cbd0c51a546b6b6a41a) \| speedup ratio (baseline/optimized) \| \| torch.norm \| (218, 1) \| -1 \| fro \| 34.3755531 \| 0.0125408 \| 2741.094 \| \| \| \| \| inf \| 34.0952635 \| 0.0122237 \| 2789.271 \| \| \| \| \| -inf \| 34.3674493 \| 0.0120759 \| 2845.953 \| \| \| \| \| 0 \| 34.1004515 \| 0.0175261 \| 1945.69 \| \| \| \| \| 1 \| 34.1688442 \| 0.0121593 \| 2810.089 \| \| \| \| \| -1 \| 33.949492 \| 0.0120282 \| 2822.487 \| \| \| \| \| 2 \| 34.3669581 \| 0.0120401 \| 2854.366 \| \| \| \| \| -2 \| 33.9252067 \| 0.0121069 \| 2802.139 \| \| \| \| \| \| \| \| \| \| torch.linalg.vector_norm \| (2**18, 1) \| -1 \| inf \| 34.090879 \| 0.0095105 \| 3584.545 \| \| \| \| \| -inf \| 34.3708754 \| 0.0099111 \| 3467.931 \| \| \| \| \| 0 \| 34.0880775 \| 0.0141716 \| 2405.38 \| \| \| \| \| 1 \| 34.1392851 \| 0.0093174 \| 3664.036 \| \| \| \| \| -1 \| 33.925395 \| 0.0092483 \| 3668.302 \| \| \| \| \| 2 \| 34.3854165 \| 0.0092459 \| 3719.002 \| \| \| \| \| -2 \| 33.932972 \| 0.0093007 \| 3648.429 \| ### Proof <details> <summary>For those interested :)</summary> <img width="382" alt="1_dim_vector_norm_proof1" src="https://github.com/pytorch/pytorch/assets/93151422/59b1e00b-8fcd-47cb-877d-d31403b5195b"> <img width="432" alt="1_dim_vector_norm_proof2" src="https://github.com/pytorch/pytorch/assets/93151422/236bea15-2dd5-480b-9871-58b2e3b24322"> </details> Pull Request resolved: https://github.com/pytorch/pytorch/pull/122143 Approved by: https://github.com/lezcano	2024-03-20 23:20:25 +00:00
Xu Han	aa74a8b9e5	Enable x86 CPU vectorization on windows [submodule sleef] (#118980 ) Enable VEC on Windows OS. 1. Fix some type defination gap between Windows and Linux. 2. Fix some operator not support on Windows, such as [], /. 3. Enable static sleef library build on Windows. 4. Disable unsupported function overloading on MSVC. 5. Upgrade submodule sleef lib, which fixed build issue on Windows. 6. Fixed bazel build issues. 7. Fix test app not link to sleef on Windows. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118980 Approved by: https://github.com/jgong5, https://github.com/ezyang, https://github.com/malfet	2024-03-20 22:41:13 +00:00
Thiago Crepaldi	666d6291af	Cast checkpoint weights to match model parameter's dtype (#122100 ) Fixes #121986 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122100 Approved by: https://github.com/BowenBao	2024-03-20 22:01:40 +00:00
ydwu4	2289fa5f5a	[while_loop] fix mode not on stack error (#122323 ) Fixes https://github.com/pytorch/pytorch/issues/121453. This is caused by missing `with mode` in FakeTensor mode. Test Plan: add new tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122323 Approved by: https://github.com/tugsbayasgalan ghstack dependencies: #122244	2024-03-20 21:17:33 +00:00
Pritam Damania	512251c8f3	Use tree_map to get device ids and device types for activation checkpointing (#121462 ) `get_device_states` doesn't recursively look into nested lists/dicts to find tensors. As a result, activation checkpointing for such inputs results in silent incorrect results as `get_device_states` returns an empty result and no rng is saved as a result here: https://github.com/pytorch/pytorch/blob/main/torch/utils/checkpoint.py#L188 since `fwd_device_states` is empty. Fixed this by using `tree_map` for both `get_device_states` and `_infer_device_type`. Also added appropriate unit tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121462 Approved by: https://github.com/soulitzer	2024-03-20 21:09:21 +00:00
cyy	1dd1899fd6	Add missing throw of std::runtime_error in dynamo/guards.cpp (#122306 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/122306 Approved by: https://github.com/Skylion007, https://github.com/ezyang	2024-03-20 20:50:01 +00:00
Menglu Yu	d2a8d3864c	[PT2][Inductor] Change the log for the group batch fusion (#122245 ) Summary: Instead of using "batch_fusion" and "group_fusion" to log, we use the specific pass name to log, which could better summarize the hit of each pattern as well as debug Test Plan: ``` buck2 test mode/dev-nosan //caffe2/test/inductor:group_batch_fusion ``` Differential Revision: D55103303 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122245 Approved by: https://github.com/jackiexu1992	2024-03-20 20:45:37 +00:00
ydwu4	61ff41f0ca	[while_loop] disable closure capturing and manually set the inputs. (#122244 ) For while_loop operator, it's important to keep the output ordering consistent with input ordering. Previously, we're using set_graph_inputs="automatic", which doesn't respect such ordering. This PR changes it to "manual" and respects the original user inputs' ordering. We disable closures for body and cond fn as they require some additional designs. This PR is just to prevent the bleeding. Repro: ```python import torch from torch._higher_order_ops.while_loop import while_loop from torch._functorch.aot_autograd import aot_export_module class Nested(torch.nn.Module): def forward(self, ci, cj, a, b): def cond_fn(i1, j1, x1, y1): return i1 > 0 def body_fn(i1, j1, x1, y1): def cond_fn_nested(i2, j2, x2, y2): return j2 > 0 def body_fn_nested(i2, j2, x2, y2): return i2.clone(), j2 - 1, x2 + 3.14, y2 - 2.71 i1, j1, x1, y1 = while_loop( cond_fn_nested, body_fn_nested, [i1, j1, x1, y1] ) return i1 - 1, j1.clone(), x1 * 2, y1 / 2 return while_loop(cond_fn, body_fn, (ci, cj, a, b)) nested = Nested() torch.compile(nested, backend="eager", fullgraph=True)(torch.tensor(2), torch.tensor(2), torch.randn(2, 2), torch.randn(2, 2)) ``` Test plan: add new test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122244 Approved by: https://github.com/aakhundov	2024-03-20 20:14:35 +00:00
Boyuan Feng	2f6e8e84c5	Fix `_chunk_cat.out` issue (#122076 ) # PR Vectors allocated inside `get_chunk_cat_metadata()` are out of local scope when used in `_chunk_cat_out_cuda_contiguous()`. This PR fixes the issue by returning vectors from `get_chunk_cat_metadata`. This PR also added a few unit tests to cover more edge cases. # Tests This PR is tested with the following command and no error shows. So the flaky test error should be resolved. - `PYTORCH_NO_CUDA_MEMORY_CACHING=1 compute-sanitizer python test/test_ops.py -v -k test_out__chunk_cat_cuda_float32` - `PYTORCH_NO_CUDA_MEMORY_CACHING=1 python test/test_ops.py -v -k test_out__chunk_cat_cuda_float32 --repeat 1500` Fixes #122026 Fixes #121950 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122076 Approved by: https://github.com/yifuwang	2024-03-20 20:01:38 +00:00
Jacob Szwejbka	c84f81b395	[export] add pass to remove auto functionalized hop (#122246 ) Summary: Adds a pass that blindly removes the functionalize hop without consideration on if its safe. Useful for ExecuTorch today and other usecases that have additional logic that can reason about when this pass is safe to use Test Plan: added unit test Differential Revision: D55103867 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122246 Approved by: https://github.com/angelayi	2024-03-20 19:31:52 +00:00
Jing Shan	d813474363	[Pytorch] auto format _python_dispatch file (#122226 ) Summary: Auto format the _python_dispatch file, to make D55091453 easier to review Test Plan: `arc lint` Differential Revision: D55091454 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122226 Approved by: https://github.com/aakhundov	2024-03-20 19:28:39 +00:00
Jean Schmidt	821ad56ea6	[CI] Enables support for pytorch ci build in ARC + introduces _linux-build-rg.yml. (#121930 ) Introduce changes related to enable ARC to run on build for linux-jammy-py3.8-gcc11 Depends on: * https://github.com/pytorch/pytorch/pull/121908 * https://github.com/pytorch/pytorch/pull/121907 * Force docker to update credentials: https://github.com/pytorch/test-infra/pull/4991 * Add permissions to role to access ECR: `acc0154aa0` * Add permissions to the role to access relevant S3 bucket: `496b0422c3` ## Reasoning for introducing a new `_linux-build-rg.yml` Old style `runs-on` definition accept a string, new style `runs-on` requires a object in the format: ``` --- old ... runs-on: "linux.2xlarge" ... --- new ... runs-on: group: "running-group" ... ``` In other words, to specify a group the format of the yaml needs to be changed. Unfortunately, there is no way to accomplish this change using any trick in the book that I am aware of. This is due to the fact that GH actions yaml are not templatable and support minimal functions / replacements. A few examples that did not work: * [`e234f25` (#119544)](`e234f25ba1 (diff-b317d4da565a9e329ccf67e669c2ff1f4d4bc5fb0ffa4d74132545ad66f84339R76)`) * [`087de4a` (#119544)](`087de4ad8b (diff-b317d4da565a9e329ccf67e669c2ff1f4d4bc5fb0ffa4d74132545ad66f84339R76)`) * [`f03512e` (#119544)](`f03512e344 (diff-b317d4da565a9e329ccf67e669c2ff1f4d4bc5fb0ffa4d74132545ad66f84339R76)`) * [`67581fb` (#119544)](`67581fb737 (diff-b317d4da565a9e329ccf67e669c2ff1f4d4bc5fb0ffa4d74132545ad66f84339R76)`) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121930 Approved by: https://github.com/seemethere	2024-03-20 19:06:10 +00:00
Luoshang Pan	91fdaa1b41	[Sparsity] add support for H100 compute capability 9.x (#121768 ) Summary: as title Test Plan: buck test mode/opt //caffe2/test/... Differential Revision: D54792168 @diff-train-skip-merge Pull Request resolved: https://github.com/pytorch/pytorch/pull/121768 Approved by: https://github.com/SherlockNoMad	2024-03-20 19:00:54 +00:00
Zhengxu Chen	d1e8b97387	[export] Log module hierarchy. (#121970 ) Summary: We can also log the module hierarchy in the following format: ``` :ToplevelModule sparse:SparshArch dense:DenseArch ``` So that we can have more information recorded about model's identity. Test Plan: CI Differential Revision: D54921097 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121970 Approved by: https://github.com/angelayi	2024-03-20 18:59:42 +00:00
PyTorch MergeBot	0696db8202	Revert "Teach dynamo about torch.func.jvp (#119926 )" This reverts commit 17489784b635187316c6c856c5fe6b6a28d8a15a. Reverted https://github.com/pytorch/pytorch/pull/119926 on behalf of https://github.com/peterbell10 due to broken mac jobs on main ([comment](https://github.com/pytorch/pytorch/pull/119926#issuecomment-2010327997))	2024-03-20 18:34:43 +00:00
eellison	1d13c82559	Precompile in background (#121997 ) Precompile benchmarking choices in parallel, and then wait on those choices prior to benchmarking. In the case of deferred templates, we only only wait only those choices in the scheduler to allow multiple separate lowerings to compile in parallel. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121997 Approved by: https://github.com/jansel ghstack dependencies: #121996, #120275	2024-03-20 18:34:12 +00:00
PyTorch MergeBot	65eb22158e	Revert "Update jvp to support symbolic execution. (#120338 )" This reverts commit afc4c9382ff8b55da848ef40b4a17a92fb3d2ad6. Reverted https://github.com/pytorch/pytorch/pull/120338 on behalf of https://github.com/huydhn due to Broke dynamo tests on trunk ([comment](https://github.com/pytorch/pytorch/pull/120338#issuecomment-2010276712))	2024-03-20 18:04:53 +00:00
amoskvic	072935917b	Update cuda_to_hip_mappings.py (#122110 ) Added one datatype mapping (cuda_bf16.h), and a number of cub/hipcub mappings. Note: the missing mappings were discovered when hipifying the Mamba model's (https://github.com/state-spaces/mamba) forward kernel. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122110 Approved by: https://github.com/jithunnair-amd, https://github.com/Skylion007	2024-03-20 17:17:53 +00:00
Catherine Lee	334f7e43f9	[TD] Remove credentials requirement for retrieval (#122279 ) Made the bucket readable by public https://s3.console.aws.amazon.com/s3/buckets/target-determinator-assets?region=us-east-1&bucketType=general&tab=permissions The only jobs that matter here are the retrieval and td jobs, which were both successful Pull Request resolved: https://github.com/pytorch/pytorch/pull/122279 Approved by: https://github.com/huydhn	2024-03-20 15:55:46 +00:00
Adnan Akhundov	2e02e1efad	Skip nonzero unbacked SymInt memo in inference mode (#122147 ) Summary: In `torch.inference_mode()`, fake tensors don't have `_version`s. This breaks unbacked SymInt memoization in `torch.nonzero` tracing. Here we disable the latter in inference mode. Fixes https://github.com/pytorch/pytorch/issues/122127 Test Plan: ``` $ python test/inductor/test_unbacked_symints.py -k test_nonzero_in_inference_mode ... ---------------------------------------------------------------------- Ran 2 tests in 14.060s OK ``` Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/122147 Approved by: https://github.com/ezyang	2024-03-20 14:44:55 +00:00
PyTorch MergeBot	15a8185cd3	Revert "Enable x86 CPU vectorization on windows [submodule sleef] (#118980 )" This reverts commit 2b060983809e5fe8706acd085fff67b6a27bfb5f. Reverted https://github.com/pytorch/pytorch/pull/118980 on behalf of https://github.com/zou3519 due to This caused build failures for 2+ pytorch devs, so we're reverting it to be safe ([comment](https://github.com/pytorch/pytorch/pull/118980#issuecomment-2009661069))	2024-03-20 14:10:12 +00:00
PyTorch MergeBot	06db0a9f78	Revert "Upgrade submodule sleef to fix build warning (#122168 )" This reverts commit eec8b252b70b2489aee7281d336eb9c32dd85483. Reverted https://github.com/pytorch/pytorch/pull/122168 on behalf of https://github.com/zou3519 due to trying to revert another PR ([comment](https://github.com/pytorch/pytorch/pull/122168#issuecomment-2009653474))	2024-03-20 14:05:58 +00:00
IvanKobzarev	8a94005d46	[dynamo][runtime_asserts] Ignore failures on sorting sympy relations (#122205 ) Differential Revision: [D55075500](https://our.internmc.facebook.com/intern/diff/D55075500) Pull Request resolved: https://github.com/pytorch/pytorch/pull/122205 Approved by: https://github.com/ezyang	2024-03-20 13:25:37 +00:00
Guilherme Leobas	afc4c9382f	Update jvp to support symbolic execution. (#120338 ) Previously, all jvp tests under dynamo/test_dynamic_shapes would fail because symbolic execution wasn't supported in some autograd functions. List of changes: - Update`_has_same_storage_numel` to use `sym_nbytes` - Symintify `_efficientzerotensor_meta` - Introduce `empty_generic_symint` with the first argument `size` as symbolic integer - Update gen_variable_type.py script to call the symint version of zeros_fn function (zeros_symint / _efficientzerotensor_symint) - Update `has_same_meta` to call `sym_*` functions Pull Request resolved: https://github.com/pytorch/pytorch/pull/120338 Approved by: https://github.com/soulitzer ghstack dependencies: #119926	2024-03-20 13:09:19 +00:00
Guilherme Leobas	17489784b6	Teach dynamo about torch.func.jvp (#119926 ) List of changes: - Replace JVP_NESTING by torch._C._functorch.maybe_current_level() - Remove all increment nesting functions from wrap_fx_proxy_cls - fwAD.make_dual receives the dual_level as keyword argument - Add jvp_increment_nesting, set_fwd_grad_enabled and dual_level context managers to dynamo Pull Request resolved: https://github.com/pytorch/pytorch/pull/119926 Approved by: https://github.com/zou3519	2024-03-20 13:09:19 +00:00
Valentine233	eb1d6ed9f9	[Inductor] fix addmm fusion check (#121953 ) Fixes #121253. To avoid functional issue, disable pattern match for `addmm` when `beta!=1 or 0` or `alpha!=1`, as either `mkl_linear` or `mkldnn_linear` doesn't accept `beta` or `alpha` as parameters. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121953 Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jansel	2024-03-20 09:22:51 +00:00
Xilun Wu	ee6ce31b1d	[BE][fix] fix test_tp_random_state and add it to periodic test list (#122248 ) fix #122184 . Add the test to periodic test so that we can capture the error at CI in future. Test: `pytest test/distributed/tensor/parallel/test_tp_random_state.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/122248 Approved by: https://github.com/wanchaol	2024-03-20 08:24:14 +00:00
Huy Do	a1d02b423c	XFAIL detectron2_maskrcnn_r_101_c4 CPU inductor accuracy (#122263 ) This starts to fail in trunk after the stack https://github.com/pytorch/pytorch/pull/122066 lands Pull Request resolved: https://github.com/pytorch/pytorch/pull/122263 Approved by: https://github.com/jansel	2024-03-20 08:03:34 +00:00
Jason Ansel	477d154ffd	[dynamo] Add missing _nonvar_fields annotations (#122219 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/122219 Approved by: https://github.com/anijain2305 ghstack dependencies: #122218	2024-03-20 07:53:18 +00:00
Jason Ansel	46bf37b3f7	[dynamo] Replace VariableTracker.apply with visit/realize_all (#122218 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/122218 Approved by: https://github.com/anijain2305	2024-03-20 07:53:18 +00:00
Jason Ansel	a0db2e4237	[dynamo] Fixed handling of ImportError (#122222 ) Fixes #122088 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122222 Approved by: https://github.com/anijain2305	2024-03-20 07:52:01 +00:00
Pian Pawakapan	7832efb242	[export] skip nn_module_stack verifier for non-fx.GraphModule modules (#122210 ) Downstream users of torch.export may have different module classes (e.g. LoweredBackendModule), which cannot be checked for metadata in the same way. Add lines to skip this for non-fx.GraphModule modules. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122210 Approved by: https://github.com/angelayi, https://github.com/zhxchen17	2024-03-20 07:40:48 +00:00
Wei Lu	7d2b2dec4b	[Pytoch][Vulkan] Register `run_conv1d_context` (#122172 ) Summary: We have rewritten `conv1d` as `create_conv1d_context` and `run_conv1d_context` to enable prepack of `weight` and `bias`. We have registered `create_conv1d_context` but not `run_conv1d_context`. We add the registration in this diff. Test Plan: ``` [luwei@devbig439.ftw3 /data/users/luwei/fbsource (f89a7de33)]$ LD_LIBRARY_PATH=third-party/swiftshader/lib/linux-x64/ buck run fbcode/mode/dev-nosan //xplat/caffe2:pt_vulkan_api_test_bin -- --gtest_filter="conv1d" Using additional configuration options from /home/luwei/.buckconfig.d/experiments_from_buck_start Recommended: For faster builds try buck2: replace 'buck' with 'buck2' NOTE: buck-out/ has changed: look for files in fbsource/buck-out/v2/ 'buck2 build --show-output //xplat/caffe2:pt_vulkan_api_test_bin' will print the new output paths. If you are building in fbsource//xplat and have questions, post in 'Cross Platform Dev Discussions': https://fb.workplace.com/groups/xplat.qa Targets matching .buckconfig buck2.supported_projects: {'//xplat/caffe2:pt_vulkan_api_test_bin': '//xplat'} To suppress this warning: touch ~/.config/.dont_hint_buck2 Building: finished in 0.1 sec (100%) 394/394 jobs, 0/394 updated Total time: 0.2 sec BUILD SUCCEEDED Running main() from third-party/googletest/1.14.0/googletest/googletest/src/gtest_main.cc Note: Google Test filter = conv1d [==========] Running 2 tests from 1 test suite. [----------] Global test environment set-up. [----------] 2 tests from VulkanAPITest [ RUN ] VulkanAPITest.conv1d_simple [ OK ] VulkanAPITest.conv1d_simple (208 ms) [ RUN ] VulkanAPITest.conv1d [ OK ] VulkanAPITest.conv1d (81 ms) [----------] 2 tests from VulkanAPITest (289 ms total) [----------] Global test environment tear-down [==========] 2 tests from 1 test suite ran. (289 ms total) [ PASSED ] 2 tests. ``` full test result ``` ... [----------] 427 tests from VulkanAPITest (22583 ms total) [----------] Global test environment tear-down [==========] 427 tests from 1 test suite ran. (22583 ms total) [ PASSED ] 426 tests. [ SKIPPED ] 1 test, listed below: [ SKIPPED ] VulkanAPITest.querypool_flushed_shader_log YOU HAVE 11 DISABLED TESTS ``` Differential Revision: D55052816 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122172 Approved by: https://github.com/nathanaelsee	2024-03-20 07:36:23 +00:00
Yifu Wang	e7141d117f	[IntraNodeComm] refactor rendezvous into a separate method for better code organization and error handling (#120968 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120968 Approved by: https://github.com/wanchaol	2024-03-20 06:54:25 +00:00
cyy	9f572b99a6	[Clang-tidy header][29/N] Enable clang-tidy warnings in aten/src/ATen/core/.h (#122190 ) This PR enables clang-tidy in `aten/src/ATen/core/.h`, which ends the series of patches beginning from #122015. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122190 Approved by: https://github.com/Skylion007	2024-03-20 06:17:37 +00:00
Wanchao Liang	11e64b4ba8	[dtensor] aten.cat to use stack strategy approach (#122209 ) This PR switch aten.cat to use the strategy approach that is similar to aten.stack, as these two ops share similar semantics Pull Request resolved: https://github.com/pytorch/pytorch/pull/122209 Approved by: https://github.com/wz337	2024-03-20 04:19:25 +00:00
Tugsbayasgalan (Tugsuu) Manlaibaatar	5b7ceab650	Support auto_functionalize in pre-dispatch (#122177 ) Summary: Title Test Plan: CI Differential Revision: D55042061 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122177 Approved by: https://github.com/zou3519	2024-03-20 04:17:58 +00:00
Huy Do	dc89d8b74a	Fix broken lint after #116876 (#122253 ) Trivial fixes, so let's do this instead of reverting the change. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122253 Approved by: https://github.com/clee2000	2024-03-20 04:09:00 +00:00
Catherine Lee	de950039fc	Use .get in xml parsing (#122103 ) Check that the `classname` attribute actually exists. #122017 I expect this route to happen very rarely At a certain point, we should just remove this parsing altogether since everything uses pytest now... Pull Request resolved: https://github.com/pytorch/pytorch/pull/122103 Approved by: https://github.com/huydhn	2024-03-20 04:07:49 +00:00
Shan19900305	6662627c89	Add APIs for custom device using TensorIteratorBase. (#120792 ) 1) add operand and get_dim_names API; 2) set will_resize to true when output tensor is undefined; 3) add abs_stub for dummy device and calculate on cpu device; 4) support dummy device copy with stride; Pull Request resolved: https://github.com/pytorch/pytorch/pull/120792 Approved by: https://github.com/ezyang	2024-03-20 03:51:09 +00:00
Zhengxu Chen	f8565c4a28	[sigmoid] Clean up serialization API. (#122102 ) Summary: Entirely remove the old serializer code to avoid further confusion and code bloat. Test Plan: CI Reviewed By: SherlockNoMad Differential Revision: D54857118 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122102 Approved by: https://github.com/tugsbayasgalan	2024-03-20 03:45:36 +00:00
Valentine233	1f8177dedf	[Inductor][CPU] fix flash attention last_stride!=1 issue (#122083 ) Fixes #121174. Conv converts the input of sdpa to channel last, resulting in accuracy issue. Ensure the layout in lowering. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122083 Approved by: https://github.com/eellison, https://github.com/jgong5	2024-03-20 02:22:33 +00:00
cyy	55310e58a9	Use constexpr for index variables (#122178 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/122178 Approved by: https://github.com/Skylion007, https://github.com/ezyang	2024-03-20 02:20:17 +00:00
Xu Han	eec8b252b7	Upgrade submodule sleef to fix build warning (#122168 ) Subsequent PR to https://github.com/pytorch/pytorch/pull/118980, fix sleef build warning. submodule sleef, include this sleef PR: https://github.com/shibatch/sleef/pull/514 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122168 Approved by: https://github.com/jgong5, https://github.com/ezyang	2024-03-20 02:14:56 +00:00
eellison	cbbed46377	Defer selection of triton template (#120275 ) Our prior approach to epilogue fusion was to select from a choice from a set of triton templates and extern calls based on benchmarking inputs, then unconditionally fuse epilogues. This can be sub-optimal in following ways: - We select an extern kernel, however an epilogue like relu() exists such that choosing a triton template + relu would have been faster - We select a triton template, epilogue fuse, and register spilling occurs causing it to be slower than not epilogue fusing. In this PR we wait to select either the Triton Template or Extern Kernel based on benchmarking results from the kernel itself and its epilogue. As soon as a successful fusion occurs where a fused Triton Template + epilogue is faster than the unfused choice we finalize the MultiTemplateBuffer as a specific template. If no fusion occurs we'll finalize the MultiTemplateBuffer after fusion. Note: if there are multiple epilogue fusions (not super likely), even though we select a template after the first fusion, we will still benchmark to see if subsequent epilogue are worth fusing. We could potentially defer choosing template in this case in a follow up at expense of compile time. Gives 4% HF training win, 10% TIMM inference win. Increases compilation time which I will be trying to address more in follow up prs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120275 Approved by: https://github.com/jansel ghstack dependencies: #121996	2024-03-20 01:40:33 +00:00
PyTorch MergeBot	e5e0685f61	Revert "[dynamo] Forward OptimizedModule.__setattr__ to the wrapped module (#122098 )" This reverts commit 88ebdbc97c103271766203df6662240e95a09b42. Reverted https://github.com/pytorch/pytorch/pull/122098 on behalf of https://github.com/huydhn due to Sorry for reverting your change but the distributed failure looks legit as it is also failing in trunk `88ebdbc97c` ([comment](https://github.com/pytorch/pytorch/pull/122098#issuecomment-2008483316))	2024-03-20 01:12:24 +00:00
Valentine233	19d6004b97	add int8 woq mm pattern matcher (#120985 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120985 Approved by: https://github.com/mingfeima, https://github.com/jgong5, https://github.com/eellison	2024-03-20 00:23:41 +00:00
Huy Do	6fefc52a2b	Set py3.x build-environment name consistently (#122247 ) https://github.com/pytorch/pytorch/pull/122157 checks for the Python version using `"$BUILD_ENVIRONMENT" != py3.8`, but some build environment uses a different style with `py3_8` instead causing numpy 2.x to be installed there wrongly, i.e. `03b987fe3f` Pull Request resolved: https://github.com/pytorch/pytorch/pull/122247 Approved by: https://github.com/malfet	2024-03-19 23:56:38 +00:00
Richard Barnes	6c659bbc36	[codemod][lowrisk] Remove unused exception parameter from caffe2/c10/mobile/CPUCachingAllocator.cpp (#116875 ) Summary: `-Wunused-exception-parameter` has identified an unused exception parameter. This diff removes it. This: ``` try { ... } catch (exception& e) { // no use of e } ``` should instead be written as ``` } catch (exception&) { ``` If the code compiles, this is safe to land. Test Plan: Sandcastle Reviewed By: kimishpatel, palmje Pull Request resolved: https://github.com/pytorch/pytorch/pull/116875 Approved by: https://github.com/Skylion007	2024-03-19 23:52:09 +00:00
Richard Barnes	6b95dc8884	[codemod][lowrisk] Remove unused exception parameter from caffe2/torch/csrc/jit/frontend/lexer.cpp (#116876 ) Summary: `-Wunused-exception-parameter` has identified an unused exception parameter. This diff removes it. This: ``` try { ... } catch (exception& e) { // no use of e } ``` should instead be written as ``` } catch (exception&) { ``` If the code compiles, this is safe to land. Test Plan: Sandcastle Reviewed By: palmje Pull Request resolved: https://github.com/pytorch/pytorch/pull/116876 Approved by: https://github.com/Skylion007	2024-03-19 23:51:26 +00:00
feifan	d0153ca755	use make_storage_impl to create storages for COWStorage. (#121896 ) Thanks to https://github.com/pytorch/pytorch/pull/118459， `make_storage_impl` will use the func ,which register for other backends, to create StorageImpl. `make_storage_impl` completely overwrites the `make_intrusive<StorageImpl>`, so it makes sense to replace `make_intrusive<StorageImpl>` with `make_storage_impl` to create storage in cow. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121896 Approved by: https://github.com/ezyang	2024-03-19 23:40:15 +00:00
Shan19900305	4aaf25bc38	delete useless cast_outputs call in unary_op_impl_float_out (#120486 ) cast_outputs function is only used for CPU device, and this function already called in cpu_xxx_vec, like cpu_kernel_vec. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120486 Approved by: https://github.com/ezyang	2024-03-19 23:37:06 +00:00
Richard Barnes	2980779d0b	[codemod] Remove unused variables in caffe2/caffe2/experiments/operators/tt_pad_op.h (#120177 ) Summary: LLVM-15 has a warning `-Wunused-but-set-variable` which we treat as an error because it's so often diagnostic of a code issue. Unused variables can compromise readability or, worse, performance. This diff either (a) removes an unused variable and, possibly, it's associated code, or (b) qualifies the variable with `[[maybe_unused]]`, mostly in cases where the variable _is_ used, but, eg, in an `assert` statement that isn't present in production code. - If you approve of this diff, please use the "Accept & Ship" button :-) Test Plan: Sandcastle Reviewed By: palmje Pull Request resolved: https://github.com/pytorch/pytorch/pull/120177 Approved by: https://github.com/Skylion007	2024-03-19 23:36:52 +00:00
Edward Z. Yang	2239b55cd1	Add some more sanity asserts to checkPoolLiveAllocations (#122223 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/122223 Approved by: https://github.com/eellison	2024-03-19 23:26:19 +00:00
Gonçalo Rua	139647d317	Fix #83241 : torch.nn.TripletMarginLoss allowed margin less or equal to 0 (#121978 ) Documentation states that the parameter margin of torch.nn.TripletMarginLoss is greater than 0, however any value was being accepted. Also fixed torch.nn.TripletMarginWithDistanceLoss which had the same problem. Added error test input for the new ValueError. Fixes #83241 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121978 Approved by: https://github.com/mikaylagawarecki	2024-03-19 23:19:11 +00:00
Richard Barnes	a843bbdb21	[codemod] Remove unused variables in caffe2/caffe2/opt/nql/graphmatcher.cc (#118116 ) Summary: LLVM-15 has a warning `-Wunused-but-set-variable` which we treat as an error because it's so often diagnostic of a code issue. Unused variables can compromise readability or, worse, performance. This diff either (a) removes an unused variable and, possibly, it's associated code, or (b) qualifies the variable with `[[maybe_unused]]`, mostly in cases where the variable _is_ used, but, eg, in an `assert` statement that isn't present in production code. - If you approve of this diff, please use the "Accept & Ship" button :-) Test Plan: Sandcastle Reviewed By: malfet, dmm-fb Differential Revision: D52981072 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118116 Approved by: https://github.com/Skylion007	2024-03-19 22:45:43 +00:00
Richard Barnes	f05af9e377	[codemod] Remove unused variables in caffe2/caffe2/opt/nql/ast.h (#120176 ) Summary: LLVM-15 has a warning `-Wunused-but-set-variable` which we treat as an error because it's so often diagnostic of a code issue. Unused variables can compromise readability or, worse, performance. This diff either (a) removes an unused variable and, possibly, it's associated code, or (b) qualifies the variable with `[[maybe_unused]]`, mostly in cases where the variable _is_ used, but, eg, in an `assert` statement that isn't present in production code. - If you approve of this diff, please use the "Accept & Ship" button :-) Test Plan: Sandcastle Reviewed By: palmje Differential Revision: D53779579 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120176 Approved by: https://github.com/Skylion007	2024-03-19 22:44:51 +00:00
Nikita Shulga	03b987fe3f	[CI] Test that NumPy-2.X builds are backward compatible with 1.X (#122157 ) By compiling PyTorch against 2.x RC, but running all the tests with Numpy-1.X This has no affects on binary builds Pull Request resolved: https://github.com/pytorch/pytorch/pull/122157 Approved by: https://github.com/atalman	2024-03-19 22:40:26 +00:00
Richard Barnes	f8becb626f	[codemod] Remove unused variables in caffe2/caffe2/contrib/fakelowp/spatial_batch_norm_fp16_fake_op.h (#120178 ) Summary: LLVM-15 has a warning `-Wunused-but-set-variable` which we treat as an error because it's so often diagnostic of a code issue. Unused variables can compromise readability or, worse, performance. This diff either (a) removes an unused variable and, possibly, it's associated code, or (b) qualifies the variable with `[[maybe_unused]]`, mostly in cases where the variable _is_ used, but, eg, in an `assert` statement that isn't present in production code. - If you approve of this diff, please use the "Accept & Ship" button :-) Test Plan: Sandcastle Reviewed By: palmje Differential Revision: D53779549 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120178 Approved by: https://github.com/Skylion007	2024-03-19 22:36:38 +00:00
Richard Barnes	94eb940a02	[codemod] Remove unused variables in caffe2/caffe2/operators/softmax_op_cudnn.cc (#121995 ) Summary: LLVM-15 has a warning `-Wunused-but-set-variable` which we treat as an error because it's so often diagnostic of a code issue. Unused variables can compromise readability or, worse, performance. This diff either (a) removes an unused variable and, possibly, it's associated code, or (b) qualifies the variable with `[[maybe_unused]]`, mostly in cases where the variable _is_ used, but, eg, in an `assert` statement that isn't present in production code. - If you approve of this diff, please use the "Accept & Ship" button :-) Test Plan: Sandcastle Reviewed By: palmje Differential Revision: D54931224 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121995 Approved by: https://github.com/Skylion007	2024-03-19 22:35:58 +00:00
Richard Barnes	a6aa3afa77	[codemod] Remove unused variables in caffe2/caffe2/video/video_decoder.cc (#122151 ) Summary: LLVM-15 has a warning `-Wunused-but-set-variable` which we treat as an error because it's so often diagnostic of a code issue. Unused variables can compromise readability or, worse, performance. This diff either (a) removes an unused variable and, possibly, it's associated code, or (b) qualifies the variable with `[[maybe_unused]]`, mostly in cases where the variable _is_ used, but, eg, in an `assert` statement that isn't present in production code. - If you approve of this diff, please use the "Accept & Ship" button :-) Test Plan: Sandcastle Differential Revision: D54378401 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122151 Approved by: https://github.com/Skylion007	2024-03-19 22:34:17 +00:00
Richard Barnes	a80c60ad8f	[codemod] Remove unused variables in caffe2/caffe2/operators/conv_op_cudnn.cc (#122161 ) Summary: LLVM-15 has a warning `-Wunused-but-set-variable` which we treat as an error because it's so often diagnostic of a code issue. Unused variables can compromise readability or, worse, performance. This diff either (a) removes an unused variable and, possibly, it's associated code, or (b) qualifies the variable with `[[maybe_unused]]`, mostly in cases where the variable _is_ used, but, eg, in an `assert` statement that isn't present in production code. - If you approve of this diff, please use the "Accept & Ship" button :-) Test Plan: Sandcastle Pull Request resolved: https://github.com/pytorch/pytorch/pull/122161 Approved by: https://github.com/Skylion007	2024-03-19 22:33:19 +00:00
Richard Barnes	02f436da6d	[codemod][bugfix] Fix addressing bug in caffe2/caffe2/video/video_input_op.h (#121856 ) Summary: # Diff Specific The signature of `copyFrom` is ``` void Tensor::CopyFrom(const Tensor& src, bool async) { ``` so the `&context` always evaluated to true. I could dig around to see if anyone cares about what the flag should actually be, but this is old code in caffe2, so I've just used `true` and we'll keep using whatever behaviour we've been using since 2019 or so when this was written. # General A bug in this code was identified by `-Waddress`, which we are working to enable globally. This diff fixes the bug. There are a few types of fixes it might employ: The bug could be `const_char_array == "hello"` which compares two addresses and therefore is almost always false. This is fixed with `const_char_array == std::string_view("hello")` because `string_view` has an `==` operator that makes an appropriate comparison. The bug could be `if(name_of_func)` which always returns true because the function always has an address. Likely you meant to call the function here! - If you approve of this diff, please use the "Accept & Ship" button :-) Test Plan: Sandcastle Pull Request resolved: https://github.com/pytorch/pytorch/pull/121856 Approved by: https://github.com/Skylion007	2024-03-19 22:28:06 +00:00
Shunting Zhang	1c4887d52b	fix dlrm accuracy test in max-autotune (#122012 ) torchrec_dlrm training fail the accuracy check when max-autotune is enabled. I found there is no real issue in PT2. We fail to get fp64 reference results for the accuracy check. In max-autotune mode numerical may change a bit and cause the cosine similarity check fail. Using fp64 baseline is more reliable and make the test pass. The reason why we are not using a fp64 baseline earlier is because torchrec uses a dataclass [Batch](`99e6e669b5/torchrec/datasets/utils.py (L28)`) to represent the input. We use pytree to cast model and inputs to fp64. pytree can not look into a dataclass. My fix is to convert the dataclass to namedtuple to be more pytree friendly Pull Request resolved: https://github.com/pytorch/pytorch/pull/122012 Approved by: https://github.com/jansel, https://github.com/eellison	2024-03-19 22:23:42 +00:00
PyTorch MergeBot	c71554b944	Revert "[aot_inductor][easy] enable test_triton_kernel_multi_output_arg (#122052 )" This reverts commit 206da97b8b61f51041f67de68e68e9a1875589ab. Reverted https://github.com/pytorch/pytorch/pull/122052 on behalf of https://github.com/huydhn due to Although this look fixed on OSS, it is still failing internally. I have added the reproducible buck command in the diff D55046262 ([comment](https://github.com/pytorch/pytorch/pull/122052#issuecomment-2008253185))	2024-03-19 22:22:12 +00:00
Edward Z. Yang	7678be4667	Replace numel with sym_numel in is_int_or_symint (#122145 ) Fixes https://github.com/pytorch/pytorch/issues/122124 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/122145 Approved by: https://github.com/Skylion007	2024-03-19 21:58:43 +00:00
David Yan	6915a5be70	Increase numel limit to 2^63 for replicatepad1d (#122199 ) Summary: As title Test Plan: ``` CUDA_VISIBLE_DEVICES=5 buck2 test mode/opt //caffe2/test:test_nn_cuda -- test_replicatepad_64bit_indexing ``` Also benchmarked in N5106027 ``` device_ms, cpu_ms, gb/device_ms*1000 # before changes 11.058772478103638 18.912256770000006 735.4118906278957 # after changes 10.621162576675415 18.58972748 765.7121070725207 ``` Differential Revision: D55030372 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122199 Approved by: https://github.com/ezyang	2024-03-19 21:55:34 +00:00
Nikita Shulga	b12d297b44	[AARCH64] Hide FP16 scalar arithmetic behind proper feature flag (#122204 ) On Apple Silicon: ``` % sysctl machdep.cpu.brand_string; clang -dM -E - < /dev/null\|grep __ARM_FEATURE_FP16 machdep.cpu.brand_string: Apple M1 #define __ARM_FEATURE_FP16_FML 1 #define __ARM_FEATURE_FP16_SCALAR_ARITHMETIC 1 #define __ARM_FEATURE_FP16_VECTOR_ARITHMETIC 1 ``` On Graviton2 with respective `-march` flag: ``` # ./cpuinfo/build/cpu-info \|grep Microarch -A1; gcc -dM -E - -march=armv8.2-a+fp16 </dev/null \| grep __ARM_FEATURE_FP16 Microarchitectures: 8x Neoverse N1 #define __ARM_FEATURE_FP16_SCALAR_ARITHMETIC 1 #define __ARM_FEATURE_FP16_VECTOR_ARITHMETIC 1 ``` Test Plan: CI Reviewed By: dimitribouche Differential Revision: D55033347 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122204 Approved by: https://github.com/huydhn	2024-03-19 21:18:09 +00:00
Jerry Zhang	901ba2be86	[quant][pt2e] Add support for conv transpose + bn + {relu} weights fusion in PTQ (#122046 ) Summary: also added some utils in xnnpack_quantizer_utils.py * annotate_conv_tranpsose_bn_relu and annotate_conv_transpose_bn -> this is for QAT * annotate_conv_transpose_relu conv_transpose + bn weights fusion is performed automatically and can not be disabled currently we can add support to allow disable this fusion later if needed Test Plan: python test/test_quantization.py -k test_conv_transpose_bn_fusion Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/122046 Approved by: https://github.com/andrewor14	2024-03-19 21:00:57 +00:00
zdevito	bc1fef113d	Respect TORCH_DISABLE_ADDR2LINE in symbolizer (#121359 ) If TORCH_DISABLE_ADDR2LINE is set, the symbolizer will instead give the filename of the shared library as the filename, the offset in that library as the linenumber, and use dladdr to get the function name if possible. This is much faster than using addr2line, and the symbols can be later resolved offline using addr2line if desired. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121359 Approved by: https://github.com/aaronenyeshi	2024-03-19 20:50:26 +00:00
Hoa Dinh	7718a1cd4f	T159183991: Error: EXC_SOFTWARE / SIGABRT at IGPyTorchFramework:-[MPSImageWrapperTrampoline endSynchronization:] (MPSImageWrapper.mm<line_num>):cpp_exception_clas (#122132 ) Summary: Prevent crash by not throwing a C++ exception. Test Plan: spongebobsandcastle Reviewed By: SS-JIA Differential Revision: D55036050 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122132 Approved by: https://github.com/SS-JIA	2024-03-19 20:01:33 +00:00
Oguz Ulgen	c0b2e56c8f	Support triton.language.dtype with torch.compile -- Second Attempt (#122141 ) This PR is the second attempt at supporting `triton.language.dtype`, now instead of putting it on the graph, we put it on the side table since it is a constant. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122141 Approved by: https://github.com/jansel ghstack dependencies: #122140	2024-03-19 19:40:52 +00:00
Oguz Ulgen	58a805da71	[UserDefinedTriton] Move constant args out of the fx graph (#122140 ) @ezyang mentioned that we should not put constant args on the graph. Especially when there are args that would be trickier to put on the graph. E.g. next PR needs `triton.language.dtype` as an argument on the graph. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122140 Approved by: https://github.com/jansel	2024-03-19 19:40:52 +00:00
Pian Pawakapan	c5ffebebab	[export] allow Dim(1,2) for export dynamic shapes (v2 after revert) (#121910 ) Creating this after [PR](https://github.com/pytorch/pytorch/pull/121642) got reverted. Current dynamic shapes implementation fixes lower range of Dims to be 2 for analysis, but allows 0/1 shapes during runtime. This leads to failures when initializing Dim(1,2). This PR sets the lower bound to 0, and avoids erroring out when conflicting with the generated (2, maxsize) constraint during analysis. Also resolves a derived dim constraints issue with the following code: ``` class Bar(torch.nn.Module): def forward(self, x, y): return x + y[1:] dx = Dim("dx", min=1, max=3) ep = export( Bar(), (torch.randn(2, 2), torch.randn(3, 2)), dynamic_shapes=({0: dx, 1: None}, {0: dx+1, 1: None}) ) print(ep.range_constraints) ``` In main: ``` {s0: ValueRanges(lower=2, upper=3, is_bool=False), s0 + 1: ValueRanges(lower=3, upper=4, is_bool=False)} ``` This PR: ``` {s0: ValueRanges(lower=1, upper=3, is_bool=False), s0 + 1: ValueRanges(lower=2, upper=4, is_bool=False)} ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/121910 Approved by: https://github.com/avikchaudhuri, https://github.com/zhxchen17	2024-03-19 19:08:05 +00:00
PyTorch MergeBot	d56ab7b020	Revert "[torch export][serialize] create a more compact stacktrace format for serialization (#121675 )" This reverts commit eae89138d891d0310483c4d86dcb69b16de0a6b5. Reverted https://github.com/pytorch/pytorch/pull/121675 on behalf of https://github.com/jeanschmidt due to It seems that this PR broke lint jobs, I am reverting to confirm if this is the case ([comment](https://github.com/pytorch/pytorch/pull/121675#issuecomment-2007919486))	2024-03-19 19:02:09 +00:00
PyTorch MergeBot	36e5c1dcab	Revert "Teach dynamo about torch.func.jvp (#119926 )" This reverts commit edd04b7c16cc6715411119bb7db234a9df59065f. Reverted https://github.com/pytorch/pytorch/pull/119926 on behalf of https://github.com/jeanschmidt due to lots of breakages in pull jobs, checking if reverting this one will help ([comment](https://github.com/pytorch/pytorch/pull/119926#issuecomment-2007915919))	2024-03-19 18:59:46 +00:00
PyTorch MergeBot	88999674a0	Revert "Update jvp to support symbolic execution. (#120338 )" This reverts commit 39877abee2c3ad1956013d467b0f6e86cd20acfb. Reverted https://github.com/pytorch/pytorch/pull/120338 on behalf of https://github.com/jeanschmidt due to lots of breakages in pull jobs, checking if reverting this one will help ([comment](https://github.com/pytorch/pytorch/pull/120338#issuecomment-2007898831))	2024-03-19 18:50:12 +00:00
Richard Barnes	e0d57001ef	[codemod] Remove unused variables in caffe2/caffe2/experiments/operators/fully_connected_op_prune.h (#122165 ) Summary: LLVM-15 has a warning `-Wunused-but-set-variable` which we treat as an error because it's so often diagnostic of a code issue. Unused variables can compromise readability or, worse, performance. This diff either (a) removes an unused variable and, possibly, it's associated code, or (b) qualifies the variable with `[[maybe_unused]]`, mostly in cases where the variable _is_ used, but, eg, in an `assert` statement that isn't present in production code. - If you approve of this diff, please use the "Accept & Ship" button :-) Test Plan: Sandcastle Reviewed By: dmm-fb Differential Revision: D54380402 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122165 Approved by: https://github.com/Skylion007	2024-03-19 18:41:16 +00:00
Natalia Gimelshein	6bd2d12bc7	release gil in prepareProfiler (#121949 ) Initializing profiler while holding gil can lead to deadlocks, as it makes some presumably synchronizing cuda calls Pull Request resolved: https://github.com/pytorch/pytorch/pull/121949 Approved by: https://github.com/aaronenyeshi	2024-03-19 18:05:21 +00:00
Flavio Sales Truzzi	7fb2d69282	[PT2] - Fix cat backwards wrapping on symints (#121527 ) Summary: Wrapping was comparing Symint and ints forcing a guard. Rewrite it with TORCH_GUARD_SIZE_OBLIVIOUS ``` [trainer0\|0]: File "<invalid>", line 0, in THPEngine_run_backward(_object, _object, _object) [trainer0\|0]: File "<invalid>", line 0, in torch::autograd::python::PythonEngine::execute(std::vector<torch::autograd::Edge, std::allocator<torch::autograd::Edge>> const&, std::vector<at::Tensor, std::allocator<at::Tensor>> const&, bool, bool, bool, std::vector<torch::autograd::Edge, std::allocator<torch::autograd::Edge>> const&) [trainer0\|0]: File "<invalid>", line 0, in torch::autograd::Engine::execute(std::vector<torch::autograd::Edge, std::allocator<torch::autograd::Edge>> const&, std::vector<at::Tensor, std::allocator<at::Tensor>> const&, bool, bool, bool, std::vector<torch::autograd::Edge, std::allocator<torch::autograd::Edge>> const&) [trainer0\|0]: File "<invalid>", line 0, in torch::autograd::python::PythonEngine::execute_with_graph_task(std::shared_ptr<torch::autograd::GraphTask> const&, std::shared_ptr<torch::autograd::Node>, torch::autograd::InputBuffer&&) [trainer0\|0]: File "<invalid>", line 0, in torch::autograd::Engine::execute_with_graph_task(std::shared_ptr<torch::autograd::GraphTask> const&, std::shared_ptr<torch::autograd::Node>, torch::autograd::InputBuffer&&) [trainer0\|0]: File "<invalid>", line 0, in torch::autograd::Engine::thread_main(std::shared_ptr<torch::autograd::GraphTask> const&) [trainer0\|0]: File "<invalid>", line 0, in torch::autograd::Engine::evaluate_function(std::shared_ptr<torch::autograd::GraphTask>&, torch::autograd::Node, torch::autograd::InputBuffer&, std::shared_ptr<torch::autograd::ReadyQueue> const&) [trainer0\|0]: File "<invalid>", line 0, in torch::autograd::Node::operator()(std::vector<at::Tensor, std::allocator<at::Tensor>>&&) [trainer0\|0]: File "<invalid>", line 0, in torch::autograd::generated::CatBackward0::apply(std::vector<at::Tensor, std::allocator<at::Tensor>>&&) [trainer0\|0]: File "<invalid>", line 0, in torch::autograd::generated::details::cat_tensors_backward(at::Tensor const&, std::vector<std::vector<c10::SymInt, std::allocator<c10::SymInt>>, std::allocator<std::vector<c10::SymInt, std::allocator<c10::SymInt>>>> const&, std::vector<c10::ScalarType, std::allocator<c10::ScalarType>> const&, long) [trainer0\|0]: File "<invalid>", line 0, in c10::operator==(c10::SymInt const&, int) [trainer0\|0]: File "<invalid>", line 0, in c10::SymBool::guard_bool(char const, long) const [trainer0\|0]: File "<invalid>", line 0, in torch::impl::PythonSymNodeImpl::guard_bool(char const, long) ``` Test Plan: Regular CI Differential Revision: D54667300 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121527 Approved by: https://github.com/ezyang	2024-03-19 18:03:02 +00:00
Huy Do	8de4d86479	Back out "[fx] Preserve Fx graph node order in partitioner across runs (#115621 )" (#122113 ) Summary: Original commit changeset: 6578f47abfdb Original Phabricator Diff: D54913931 Differential Revision: D55027171 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122113 Approved by: https://github.com/osalpekar	2024-03-19 18:00:37 +00:00
Wenting Wang	eae89138d8	[torch export][serialize] create a more compact stacktrace format for serialization (#121675 ) Summary: - we want fx nodes' stack trace format to be backward compatible and same as before in the program we export - however in the serialized format, we would want to show a more compact stack_trace format, otherwise the nodes attributes are dominated by stack traces - the diff implements the minimal in serialization process to dedupe node stack traces by resorting to a fileinfo_list and a filename_to_abbrev map, so we can use index to represent filenames, use lineno to represent lines. Test Plan: # llm base on D54497918 ``` buck2 run @//mode/dev-nosan fbcode//executorch/examples/models/llama2:export_llama -- -c ~/stories110M.pt -p ~/params.json ``` set up breakpoint after serialization/deserialization - serialize ``` (Pdb) v_meta = [n.meta for n in exported_program.graph_module.graph.nodes] (Pdb) paste_client.create_phabricator_paste_object(paste_creation_client_id=1093956601162697, content=str(v_meta)).number 1193647450 (Pdb) json_program = json.dumps(_dataclass_to_dict(serialized_graph.co_fileinfo_ordered_list),cls=EnumEncoder) (Pdb) json_bytes = json_program.encode('utf-8') (Pdb) paste_client.create_phabricator_paste_object(paste_creation_client_id=1093956601162697, content=str(json_bytes)).number 1193604333 (Pdb) sys.getsizeof(json_bytes) 3846 (Pdb) compressed_bytes = zstd.ZstdCompressor().compress(json_bytes) (Pdb) sys.getsizeof(compressed_bytes) 1139 ``` in P1193647450 (before serialization), search for `stack_trace` in P1193604333 (after serialization), search for `stack_trace` and `co_fileinfo_ordered_list` [note: didn't do compression in this diff since the size is pretty small and it adds complexity if we do compression] - deserialize ``` (Pdb) v_meta = [n.meta for n in deserialized_exported_program.graph_module.graph.nodes] (Pdb) paste_client.create_phabricator_paste_object(paste_creation_client_id=1093956601162697, content=str(v_meta)).number 1193629435 ``` in P1193629435, search for `stack_trace` # ads Differential Revision: D54654443 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121675 Approved by: https://github.com/angelayi	2024-03-19 17:58:12 +00:00
eqy	271b12c790	[Functorch] Bump tolerances for `test_per_sample_grads_embeddingnet_mechanism_functional_call_cuda` (#122014 ) the `rtol` was indeed a problem on Grace Hopper Pull Request resolved: https://github.com/pytorch/pytorch/pull/122014 Approved by: https://github.com/zou3519	2024-03-19 17:52:39 +00:00
Yanan Cao (PyTorch)	ba9a1d96a4	Add scuba logging for TorchScript usage (#121936 ) Summary: Infra to log live usage of TorchScript internally Test Plan: manually tested Differential Revision: D54923510 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121936 Approved by: https://github.com/zhxchen17	2024-03-19 17:38:27 +00:00
Catherine Lee	4819da60ab	[TD] Add LLM retrieval + heuristic (#121836 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/121836 Approved by: https://github.com/osalpekar	2024-03-19 17:31:47 +00:00
Jackie (Jiaqi) Xu	cec0fd6f2f	[pt2] add symbolic shape support for decompose mm and expose max_block to user config (#121440 ) Summary: 1) As described in https://fb.workplace.com/groups/1075192433118967/permalink/1381918665779674/ As a follow up, we can increase max_block["y"] to sovle the issue 2) add symbolic shape support for decompose mm pass. I did not find a good way to compare symint with int. So when there is a symbolic shape, i would assume it is a "large" dim. Test Plan: Without change block: aps-pt2-7c23cea900 increase y_block: aps-pt2_dynamic_shape-25a027423c Differential Revision: D54525453 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121440 Approved by: https://github.com/mengluy0125, https://github.com/Yuzhen11	2024-03-19 17:31:16 +00:00
PyTorch MergeBot	764eae9c4e	Revert "Add Flash Attention support on ROCM (#121561 )" This reverts commit a37e22de7059d06b75e4602f0568c3154076718a. Reverted https://github.com/pytorch/pytorch/pull/121561 on behalf of https://github.com/huydhn due to Sorry for reverting your change but this needs more work to be able to land in fbcode because https://github.com/ROCm/aotriton is not available there atm. We are working to reland this change before 2.3 release ([comment](https://github.com/pytorch/pytorch/pull/121561#issuecomment-2007717091))	2024-03-19 17:14:28 +00:00
Peter Bell	88ebdbc97c	[dynamo] Forward OptimizedModule.__setattr__ to the wrapped module (#122098 ) Fixes #114844 In the linked issue we have ``` compiled_module = torch.compile(module) compiled_module.x = ... compiled_module(...) # Mutates self.x ``` Where since the module mutates `self.x` you would expect `compiled_module.x` to be updated but actually `compiled_module.x = ...` sets an attribute "x" on the `OptimizedModule` object while the forward method of the module mutates `module.x`. This gives the expected behavior by forwarding `compiled_module.__setattr__` down to `module.__setattr__`. There is already a corresponding `__getattr__` so now `compiled_module.x` becomes an alias for `module.x`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122098 Approved by: https://github.com/ezyang, https://github.com/lezcano	2024-03-19 16:51:43 +00:00
Michael Lazos	2164b7f746	Flatten/Unflatten micro optimization in proxy_tensor.py (#121993 ) Lowers compile time by 1s across all suites on average Pull Request resolved: https://github.com/pytorch/pytorch/pull/121993 Approved by: https://github.com/anijain2305, https://github.com/jansel, https://github.com/zou3519	2024-03-19 16:49:28 +00:00
drisspg	42624bceb6	Fixes nan with large bf16 values (#122135 ) Fixes #121558 Performance on main: ``` Markdown +------------+-----------+-----------+------------+-----------+-----------+----------------+--------------------+--------------------+ \| batch_size \| num_heads \| q_seq_len \| kv_seq_len \| embed_dim \| is_causal \| dtype \| forward_time \| backward_time \| +------------+-----------+-----------+------------+-----------+-----------+----------------+--------------------+--------------------+ \| 1 \| 16 \| 128 \| 128 \| 2048 \| True \| torch.bfloat16 \| 12.608132004970683 \| 65.90210803551601 \| \| 1 \| 16 \| 128 \| 128 \| 2048 \| False \| torch.bfloat16 \| 11.75877740024589 \| 64.83824399765581 \| \| 1 \| 16 \| 256 \| 256 \| 2048 \| True \| torch.bfloat16 \| 16.465420153690506 \| 67.6770955324173 \| \| 1 \| 16 \| 256 \| 256 \| 2048 \| False \| torch.bfloat16 \| 17.398148600477725 \| 68.19829455344006 \| \| 1 \| 16 \| 512 \| 512 \| 2048 \| True \| torch.bfloat16 \| 29.053532000398263 \| 99.58901099162175 \| \| 1 \| 16 \| 512 \| 512 \| 2048 \| False \| torch.bfloat16 \| 27.826815698063 \| 98.05690299253911 \| \| 1 \| 16 \| 1024 \| 1024 \| 2048 \| True \| torch.bfloat16 \| 49.89655229728669 \| 178.24282555375248 \| \| 1 \| 16 \| 1024 \| 1024 \| 2048 \| False \| torch.bfloat16 \| 48.840098950313404 \| 174.5950729819015 \| \| 1 \| 16 \| 4096 \| 2048 \| 2048 \| True \| torch.bfloat16 \| 505.66218036692584 \| 1865.9265094902366 \| \| 1 \| 16 \| 4096 \| 2048 \| 2048 \| False \| torch.bfloat16 \| 295.0534054543823 \| 967.3831606050952 \| \| 1 \| 32 \| 128 \| 128 \| 2048 \| True \| torch.bfloat16 \| 11.496030446141958 \| 55.11070846114308 \| \| 1 \| 32 \| 128 \| 128 \| 2048 \| False \| torch.bfloat16 \| 11.47399884648621 \| 55.452342028729625 \| \| 1 \| 32 \| 256 \| 256 \| 2048 \| True \| torch.bfloat16 \| 13.216444296995178 \| 55.14447903260589 \| \| 1 \| 32 \| 256 \| 256 \| 2048 \| False \| torch.bfloat16 \| 12.763233599252999 \| 55.142355500720434 \| \| 1 \| 32 \| 512 \| 512 \| 2048 \| True \| torch.bfloat16 \| 19.409965351223946 \| 74.9107634765096 \| \| 1 \| 32 \| 512 \| 512 \| 2048 \| False \| torch.bfloat16 \| 19.02470579952933 \| 74.84168506925926 \| \| 1 \| 32 \| 1024 \| 1024 \| 2048 \| True \| torch.bfloat16 \| 46.37695319834165 \| 172.19150450546294 \| \| 1 \| 32 \| 1024 \| 1024 \| 2048 \| False \| torch.bfloat16 \| 45.225963747361675 \| 185.19691249821335 \| \| 1 \| 32 \| 4096 \| 2048 \| 2048 \| True \| torch.bfloat16 \| 634.3090848531574 \| 2249.057865119539 \| \| 1 \| 32 \| 4096 \| 2048 \| 2048 \| False \| torch.bfloat16 \| 320.47313248040155 \| 1053.0515247955916 \| \| 4 \| 16 \| 128 \| 128 \| 2048 \| True \| torch.bfloat16 \| 13.448987301671878 \| 63.63581650657579 \| \| 4 \| 16 \| 128 \| 128 \| 2048 \| False \| torch.bfloat16 \| 12.509283400140703 \| 63.059300999157124 \| \| 4 \| 16 \| 256 \| 256 \| 2048 \| True \| torch.bfloat16 \| 19.71098779467866 \| 105.55780201684684 \| \| 4 \| 16 \| 256 \| 256 \| 2048 \| False \| torch.bfloat16 \| 18.264925852417946 \| 105.12311349157244 \| \| 4 \| 16 \| 512 \| 512 \| 2048 \| True \| torch.bfloat16 \| 45.218703348655254 \| 222.87272597895935 \| \| 4 \| 16 \| 512 \| 512 \| 2048 \| False \| torch.bfloat16 \| 43.55393464793451 \| 230.63290398567915 \| \| 4 \| 16 \| 1024 \| 1024 \| 2048 \| True \| torch.bfloat16 \| 134.02968645095825 \| 514.6893998607993 \| \| 4 \| 16 \| 1024 \| 1024 \| 2048 \| False \| torch.bfloat16 \| 157.13709802366793 \| 624.5892751030624 \| \| 4 \| 16 \| 4096 \| 2048 \| 2048 \| True \| torch.bfloat16 \| 1776.7079547047617 \| 6353.551096981391 \| \| 4 \| 16 \| 4096 \| 2048 \| 2048 \| False \| torch.bfloat16 \| 1143.6000745743513 \| 3811.8767354171723 \| \| 4 \| 32 \| 128 \| 128 \| 2048 \| True \| torch.bfloat16 \| 11.717129248427227 \| 55.35991647047922 \| \| 4 \| 32 \| 128 \| 128 \| 2048 \| False \| torch.bfloat16 \| 11.746983398916198 \| 55.76716404175386 \| \| 4 \| 32 \| 256 \| 256 \| 2048 \| True \| torch.bfloat16 \| 17.255573300644752 \| 106.47456656442955 \| \| 4 \| 32 \| 256 \| 256 \| 2048 \| False \| torch.bfloat16 \| 16.46409669774584 \| 108.07770595420152 \| \| 4 \| 32 \| 512 \| 512 \| 2048 \| True \| torch.bfloat16 \| 46.63354124641045 \| 213.74862996162847 \| \| 4 \| 32 \| 512 \| 512 \| 2048 \| False \| torch.bfloat16 \| 47.01801469782367 \| 240.78139301855117 \| \| 4 \| 32 \| 1024 \| 1024 \| 2048 \| True \| torch.bfloat16 \| 127.76448752265424 \| 508.08745552785695 \| \| 4 \| 32 \| 1024 \| 1024 \| 2048 \| False \| torch.bfloat16 \| 168.6308984644711 \| 667.2996102133766 \| \| 4 \| 32 \| 4096 \| 2048 \| 2048 \| True \| torch.bfloat16 \| 2268.1598202325404 \| 7727.2648515645415 \| \| 4 \| 32 \| 4096 \| 2048 \| 2048 \| False \| torch.bfloat16 \| 1242.8469699807465 \| 4161.965740495361 \| \| 8 \| 16 \| 128 \| 128 \| 2048 \| True \| torch.bfloat16 \| 14.340955897932872 \| 93.72280450770633 \| \| 8 \| 16 \| 128 \| 128 \| 2048 \| False \| torch.bfloat16 \| 13.25262250029482 \| 93.2030284893699 \| \| 8 \| 16 \| 256 \| 256 \| 2048 \| True \| torch.bfloat16 \| 27.598425600444898 \| 183.23776399483904 \| \| 8 \| 16 \| 256 \| 256 \| 2048 \| False \| torch.bfloat16 \| 26.362583553418514 \| 183.51862096460536 \| \| 8 \| 16 \| 512 \| 512 \| 2048 \| True \| torch.bfloat16 \| 84.52303148806094 \| 383.50319798337296 \| \| 8 \| 16 \| 512 \| 512 \| 2048 \| False \| torch.bfloat16 \| 89.41743348259479 \| 432.5502900755964 \| \| 8 \| 16 \| 1024 \| 1024 \| 2048 \| True \| torch.bfloat16 \| 217.76640450116247 \| 943.9354750793427 \| \| 8 \| 16 \| 1024 \| 1024 \| 2048 \| False \| torch.bfloat16 \| 303.0781910638325 \| 1225.4394043702632 \| \| 8 \| 16 \| 4096 \| 2048 \| 2048 \| True \| torch.bfloat16 \| 3470.8542854059488 \| 12194.579601055011 \| \| 8 \| 16 \| 4096 \| 2048 \| 2048 \| False \| torch.bfloat16 \| 2268.1174043100327 \| 7608.0941944383085 \| \| 8 \| 32 \| 128 \| 128 \| 2048 \| True \| torch.bfloat16 \| 12.289720651460811 \| 95.88620596332476 \| \| 8 \| 32 \| 128 \| 128 \| 2048 \| False \| torch.bfloat16 \| 11.618648946750909 \| 95.56685149436818 \| \| 8 \| 32 \| 256 \| 256 \| 2048 \| True \| torch.bfloat16 \| 31.567946751601994 \| 180.62468653079122 \| \| 8 \| 32 \| 256 \| 256 \| 2048 \| False \| torch.bfloat16 \| 28.611703700153157 \| 189.4215695792809 \| \| 8 \| 32 \| 512 \| 512 \| 2048 \| True \| torch.bfloat16 \| 84.11306998459621 \| 385.25596749968827 \| \| 8 \| 32 \| 512 \| 512 \| 2048 \| False \| torch.bfloat16 \| 93.82540901424363 \| 455.77428903197875 \| \| 8 \| 32 \| 1024 \| 1024 \| 2048 \| True \| torch.bfloat16 \| 226.80530551588163 \| 965.8026450779289 \| \| 8 \| 32 \| 1024 \| 1024 \| 2048 \| False \| torch.bfloat16 \| 327.4116570246406 \| 1312.5067745568228 \| \| 8 \| 32 \| 4096 \| 2048 \| 2048 \| True \| torch.bfloat16 \| 4445.5064804060385 \| 15020.768146496266 \| \| 8 \| 32 \| 4096 \| 2048 \| 2048 \| False \| torch.bfloat16 \| 2433.0302356975153 \| 8300.016750581563 \| +------------+-----------+-----------+------------+-----------+-----------+----------------+--------------------+--------------------+ ``` Performance on this branch: ```Markdown +------------+-----------+-----------+------------+-----------+-----------+----------------+--------------------+--------------------+ \| batch_size \| num_heads \| q_seq_len \| kv_seq_len \| embed_dim \| is_causal \| dtype \| forward_time \| backward_time \| +------------+-----------+-----------+------------+-----------+-----------+----------------+--------------------+--------------------+ \| 1 \| 16 \| 128 \| 128 \| 2048 \| True \| torch.bfloat16 \| 12.783618393586949 \| 65.59692794689909 \| \| 1 \| 16 \| 128 \| 128 \| 2048 \| False \| torch.bfloat16 \| 12.064015300711617 \| 56.99719698168337 \| \| 1 \| 16 \| 256 \| 256 \| 2048 \| True \| torch.bfloat16 \| 16.629025398287922 \| 68.65267595276237 \| \| 1 \| 16 \| 256 \| 256 \| 2048 \| False \| torch.bfloat16 \| 17.462356004398313 \| 68.35797848179936 \| \| 1 \| 16 \| 512 \| 512 \| 2048 \| True \| torch.bfloat16 \| 29.5476081490051 \| 101.22994752600789 \| \| 1 \| 16 \| 512 \| 512 \| 2048 \| False \| torch.bfloat16 \| 28.395320149138573 \| 98.62275794148445 \| \| 1 \| 16 \| 1024 \| 1024 \| 2048 \| True \| torch.bfloat16 \| 50.50016101449728 \| 181.4357690163888 \| \| 1 \| 16 \| 1024 \| 1024 \| 2048 \| False \| torch.bfloat16 \| 49.450615647947416 \| 175.86063902126625 \| \| 1 \| 16 \| 4096 \| 2048 \| 2048 \| True \| torch.bfloat16 \| 506.06461532879626 \| 1866.0613044630736 \| \| 1 \| 16 \| 4096 \| 2048 \| 2048 \| False \| torch.bfloat16 \| 299.9336270149797 \| 976.4662646921353 \| \| 1 \| 32 \| 128 \| 128 \| 2048 \| True \| torch.bfloat16 \| 11.45752210286446 \| 58.79682704107836 \| \| 1 \| 32 \| 128 \| 128 \| 2048 \| False \| torch.bfloat16 \| 11.407129396684468 \| 58.14061599085107 \| \| 1 \| 32 \| 256 \| 256 \| 2048 \| True \| torch.bfloat16 \| 13.822759891627355 \| 56.56979401828722 \| \| 1 \| 32 \| 256 \| 256 \| 2048 \| False \| torch.bfloat16 \| 13.39154909946956 \| 56.7130644340068 \| \| 1 \| 32 \| 512 \| 512 \| 2048 \| True \| torch.bfloat16 \| 20.282494352431968 \| 77.29688903782517 \| \| 1 \| 32 \| 512 \| 512 \| 2048 \| False \| torch.bfloat16 \| 19.899454596452415 \| 75.4446149803698 \| \| 1 \| 32 \| 1024 \| 1024 \| 2048 \| True \| torch.bfloat16 \| 48.494275606935844 \| 177.5322465109639 \| \| 1 \| 32 \| 1024 \| 1024 \| 2048 \| False \| torch.bfloat16 \| 46.84524350450374 \| 189.1778860008344 \| \| 1 \| 32 \| 4096 \| 2048 \| 2048 \| True \| torch.bfloat16 \| 635.1026654010639 \| 2248.0451600858937 \| \| 1 \| 32 \| 4096 \| 2048 \| 2048 \| False \| torch.bfloat16 \| 335.1591735263355 \| 1080.4320796160027 \| \| 4 \| 16 \| 128 \| 128 \| 2048 \| True \| torch.bfloat16 \| 13.63953539985232 \| 65.50709309522063 \| \| 4 \| 16 \| 128 \| 128 \| 2048 \| False \| torch.bfloat16 \| 12.858113402035087 \| 63.021871959790595 \| \| 4 \| 16 \| 256 \| 256 \| 2048 \| True \| torch.bfloat16 \| 19.98318645055406 \| 105.87883047992364 \| \| 4 \| 16 \| 256 \| 256 \| 2048 \| False \| torch.bfloat16 \| 18.619045056402683 \| 104.90188701078296 \| \| 4 \| 16 \| 512 \| 512 \| 2048 \| True \| torch.bfloat16 \| 45.91175540117546 \| 226.00732848513871 \| \| 4 \| 16 \| 512 \| 512 \| 2048 \| False \| torch.bfloat16 \| 44.39614630537107 \| 232.39317198749632 \| \| 4 \| 16 \| 1024 \| 1024 \| 2048 \| True \| torch.bfloat16 \| 135.5409600073472 \| 522.7949097752571 \| \| 4 \| 16 \| 1024 \| 1024 \| 2048 \| False \| torch.bfloat16 \| 158.79383607534692 \| 628.5856699105352 \| \| 4 \| 16 \| 4096 \| 2048 \| 2048 \| True \| torch.bfloat16 \| 1775.9978299727663 \| 6343.203847063706 \| \| 4 \| 16 \| 4096 \| 2048 \| 2048 \| False \| torch.bfloat16 \| 1160.680354805663 \| 3842.235009651631 \| \| 4 \| 32 \| 128 \| 128 \| 2048 \| True \| torch.bfloat16 \| 11.553713708417488 \| 65.50691701704638 \| \| 4 \| 32 \| 128 \| 128 \| 2048 \| False \| torch.bfloat16 \| 11.486379051348194 \| 56.9980075233616 \| \| 4 \| 32 \| 256 \| 256 \| 2048 \| True \| torch.bfloat16 \| 17.56585600087419 \| 107.89892700267956 \| \| 4 \| 32 \| 256 \| 256 \| 2048 \| False \| torch.bfloat16 \| 16.828144202008843 \| 109.05519902007653 \| \| 4 \| 32 \| 512 \| 512 \| 2048 \| True \| torch.bfloat16 \| 48.23235589428805 \| 217.8974545095116 \| \| 4 \| 32 \| 512 \| 512 \| 2048 \| False \| torch.bfloat16 \| 49.09284680034033 \| 244.73925953498107 \| \| 4 \| 32 \| 1024 \| 1024 \| 2048 \| True \| torch.bfloat16 \| 134.77827049791813 \| 522.7259948151186 \| \| 4 \| 32 \| 1024 \| 1024 \| 2048 \| False \| torch.bfloat16 \| 176.60772847011688 \| 681.5171707421541 \| \| 4 \| 32 \| 4096 \| 2048 \| 2048 \| True \| torch.bfloat16 \| 2267.821540008299 \| 7720.425300067291 \| \| 4 \| 32 \| 4096 \| 2048 \| 2048 \| False \| torch.bfloat16 \| 1295.3941145678982 \| 4272.425139788538 \| \| 8 \| 16 \| 128 \| 128 \| 2048 \| True \| torch.bfloat16 \| 14.514714101096615 \| 94.2192979855463 \| \| 8 \| 16 \| 128 \| 128 \| 2048 \| False \| torch.bfloat16 \| 13.553097198018804 \| 93.244242540095 \| \| 8 \| 16 \| 256 \| 256 \| 2048 \| True \| torch.bfloat16 \| 27.95821905019693 \| 185.0469880155288 \| \| 8 \| 16 \| 256 \| 256 \| 2048 \| False \| torch.bfloat16 \| 26.709681446664035 \| 184.22623950755226 \| \| 8 \| 16 \| 512 \| 512 \| 2048 \| True \| torch.bfloat16 \| 85.85420495364815 \| 388.3417735341937 \| \| 8 \| 16 \| 512 \| 512 \| 2048 \| False \| torch.bfloat16 \| 89.97473795898259 \| 434.4228169647977 \| \| 8 \| 16 \| 1024 \| 1024 \| 2048 \| True \| torch.bfloat16 \| 220.6919804448262 \| 958.9654899900779 \| \| 8 \| 16 \| 1024 \| 1024 \| 2048 \| False \| torch.bfloat16 \| 306.55586952343583 \| 1233.2170095760375 \| \| 8 \| 16 \| 4096 \| 2048 \| 2048 \| True \| torch.bfloat16 \| 3470.7326447824016 \| 12183.611298678443 \| \| 8 \| 16 \| 4096 \| 2048 \| 2048 \| False \| torch.bfloat16 \| 2299.064100370742 \| 7669.618452200666 \| \| 8 \| 32 \| 128 \| 128 \| 2048 \| True \| torch.bfloat16 \| 12.427107692928985 \| 96.96270158747211 \| \| 8 \| 32 \| 128 \| 128 \| 2048 \| False \| torch.bfloat16 \| 11.856995843118057 \| 96.38117247959599 \| \| 8 \| 32 \| 256 \| 256 \| 2048 \| True \| torch.bfloat16 \| 32.9956392000895 \| 182.52741603646427 \| \| 8 \| 32 \| 256 \| 256 \| 2048 \| False \| torch.bfloat16 \| 29.397601098753512 \| 191.0755339777097 \| \| 8 \| 32 \| 512 \| 512 \| 2048 \| True \| torch.bfloat16 \| 89.06024845782667 \| 392.2585004474967 \| \| 8 \| 32 \| 512 \| 512 \| 2048 \| False \| torch.bfloat16 \| 97.78487798757851 \| 462.07307645818213 \| \| 8 \| 32 \| 1024 \| 1024 \| 2048 \| True \| torch.bfloat16 \| 240.521906001959 \| 992.4693452194335 \| \| 8 \| 32 \| 1024 \| 1024 \| 2048 \| False \| torch.bfloat16 \| 341.98952303268015 \| 1339.2950996058062 \| \| 8 \| 32 \| 4096 \| 2048 \| 2048 \| True \| torch.bfloat16 \| 4445.311005110853 \| 15001.030603889374 \| \| 8 \| 32 \| 4096 \| 2048 \| 2048 \| False \| torch.bfloat16 \| 2535.9767401823774 \| 8528.990152990447 \| +------------+-----------+-----------+------------+-----------+-----------+----------------+--------------------+--------------------+ ``` ``` {'avg_forward_time_nan_fix': 399.7900972732653, 'avg_backward_time_nan_fix': 1409.652114014413, 'avg_forward_time_main_branch': 394.6807206988645, 'avg_backward_time_main_branch': 1399.4055472857629, 'geo_mean_nan_fix': 150.95049601244946, 'geo_mean_main_branch': 148.3381648508822} ``` The y axis is wrong and is micro seconds but the relative comparison still works <img width="790" alt="Screenshot 2024-03-18 at 3 34 15 PM" src="https://github.com/pytorch/pytorch/assets/32754868/ca278c15-b815-4535-bdcd-07e522055466"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/122135 Approved by: https://github.com/cpuhrsch	2024-03-19 16:32:00 +00:00
rzou	e26280ad8b	Fix typing for autograd.Function with ctx-less forward (#122167 ) Previously, typing an autograd.Function like the following would lead to a mypy error (which expects the first arg to forward to be named `ctx`). This PR fixes that by deleting the ctx arg. ```py class MySin(torch.autograd.Function): @staticmethod def forward(x: torch.Tensor) -> torch.Tensor: return x.sin() @staticmethod def setup_context(args, *kwargs): pass @staticmethod def backward(ctx, grad): if grad.stride(0) > 1: return grad.sin() return grad.cos() ``` Test Plan: - tested locally (I don't know how to put up a test in CI for this). Pull Request resolved: https://github.com/pytorch/pytorch/pull/122167 Approved by: https://github.com/soulitzer	2024-03-19 16:15:23 +00:00
PyTorch MergeBot	f9ed1c432d	Revert "Refactor gpu trace to be device-agnostic (#121794 )" This reverts commit 0ff1109e2688b8c841c9dd0eeecfba16f027b049. Reverted https://github.com/pytorch/pytorch/pull/121794 on behalf of https://github.com/jeanschmidt due to Reverting to see if rocm trunk errors are related ([comment](https://github.com/pytorch/pytorch/pull/121794#issuecomment-2007519408))	2024-03-19 15:40:26 +00:00
Jason Ansel	c05bf0037d	[dynamo] Remove copy_graphstate/restore_graphstate (#122067 ) Some dead code cleanup. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122067 Approved by: https://github.com/oulgen	2024-03-19 15:37:53 +00:00
PyTorch MergeBot	7673cb534a	Revert "Skip nonzero unbacked SymInt memo in inference mode (#122147 )" This reverts commit 5e2687391229cee6e4dc0214f9208b4ecbe058c1. Reverted https://github.com/pytorch/pytorch/pull/122147 on behalf of https://github.com/jeanschmidt due to Reverting to see if trunk error in inductor are related ([comment](https://github.com/pytorch/pytorch/pull/122147#issuecomment-2007513000))	2024-03-19 15:37:24 +00:00
cyy	6c01c25319	[Clang-tidy header][28/N] Fix clang-tidy warnings in aten/src/ATen/core/.h (#122175 ) This PR fixes various clang-tidy warnings on aten/src/ATen/core/.h, following https://github.com/pytorch/pytorch/pull/122023 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122175 Approved by: https://github.com/ezyang, https://github.com/malfet	2024-03-19 14:08:54 +00:00
Stephen Jia	6c50308801	[ATen-Vulkan][EZ] Small fixes: fix gpu size calculation and Half scalartype ctype mapping (#122096 ) Summary: ## Context Some small fixes to the ATen-Vulkan backend. The first is that GPU sizes for a 4 dimensional tensor with width packing had a small bug: ``` case 4: switch (memory_layout) { case api::GPUMemoryLayout::TENSOR_WIDTH_PACKED: gpu_sizes.at(0) = sizes.at(0); gpu_sizes.at(1) = sizes.at(1); // should be gpu_sizes.at(2) == sizes.at(2) gpu_sizes.at(2) = sizes.at(3); gpu_sizes.at(3) = api::utils::align_up(sizes.at(3), INT64_C(4)); break; ``` This was fixed by simplifying the logic of GPU size calculation for texture storage. The second was to modify the ctype mapping of the `api::kHalf` scalar type to be `float` instead of `unsigned short`. This is because GLSL does not natively support `float16`, so even with a FP16 texture type CPU/GPU transfer shaders will have to read from and write to `float` buffers. In the future, we will look into integrating [VK_KHR_shader_float16_int8](https://registry.khronos.org/vulkan/specs/1.3-extensions/man/html/VK_KHR_shader_float16_int8.html) into ATen-Vulkan to allow for 16 bit and 8 bit types to be referenced explicitly. Test Plan: CI Differential Revision: D55018171 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122096 Approved by: https://github.com/jorgep31415	2024-03-19 13:27:27 +00:00
Guilherme Leobas	39877abee2	Update jvp to support symbolic execution. (#120338 ) Previously, all jvp tests under dynamo/test_dynamic_shapes would fail because symbolic execution wasn't supported in some autograd functions. List of changes: - Update`_has_same_storage_numel` to use `sym_nbytes` - Symintify `_efficientzerotensor_meta` - Introduce `empty_generic_symint` with the first argument `size` as symbolic integer - Update gen_variable_type.py script to call the symint version of zeros_fn function (zeros_symint / _efficientzerotensor_symint) - Update `has_same_meta` to call `sym_*` functions Pull Request resolved: https://github.com/pytorch/pytorch/pull/120338 Approved by: https://github.com/soulitzer ghstack dependencies: #119926	2024-03-19 13:06:42 +00:00
Guilherme Leobas	edd04b7c16	Teach dynamo about torch.func.jvp (#119926 ) List of changes: - Replace JVP_NESTING by torch._C._functorch.maybe_current_level() - Remove all increment nesting functions from wrap_fx_proxy_cls - fwAD.make_dual receives the dual_level as keyword argument - Add jvp_increment_nesting, set_fwd_grad_enabled and dual_level context managers to dynamo Pull Request resolved: https://github.com/pytorch/pytorch/pull/119926 Approved by: https://github.com/zou3519	2024-03-19 13:06:42 +00:00
Xuehai Pan	6b5259e507	[lint] bump lint dependency PyYAML to 6.0.1 to support Python 3.12 (#122022 ) [PyYAML 6.0.0](https://pypi.org/project/PyYAML/6.0) was released 2.5 years ago and it is not installable with Python 3.12. This PR bumps the version of [PyYAML to 6.0.1](https://pypi.org/project/PyYAML/6.0.1) in `lintrunner` configuration. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122022 Approved by: https://github.com/Skylion007	2024-03-19 12:23:49 +00:00
Xia, Weiwen	8168338063	Add CPU implementation for `torch._int_mm` (s8s8->s32) (#121792 ) Fixes #121647 Description* Currently, the op `torch._int_mm` only supports CUDA device. This PR adds CPU implementation for it. Besides the request from the issue, this op may also be useful for planned CPU implementations of [LLM.int8()](https://arxiv.org/abs/2208.07339) in [Bitsandbytes](https://github.com/TimDettmers/bitsandbytes). The implementation prefers mkldnn (oneDNN) kernels. If mkldnn is not available, a reference implementation with nested for loops is used. Test plan `python test/test_linalg.py -k test__int_mm_cpu` Pull Request resolved: https://github.com/pytorch/pytorch/pull/121792 Approved by: https://github.com/jgong5, https://github.com/lezcano	2024-03-19 08:44:33 +00:00
Tugsbayasgalan (Tugsuu) Manlaibaatar	0d845f7b07	Fix auto_functionalize (#121990 ) Differential Revision: D54964130 When we re-export, auto_functionalize HOP will be in the graph. Therefore, we need to implement proper functionalization rule for it. Since the content inside auto_functionalize is guaranteed be functional, it is ok to just fall through it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121990 Approved by: https://github.com/ydwu4, https://github.com/zou3519	2024-03-19 07:11:11 +00:00
Kurt Mohler	a2a88f39ee	Avoid COW materialize in conv, log sigmoid, repeat, group_norm, batch_norm (#121537 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121537 Approved by: https://github.com/ezyang	2024-03-19 06:15:00 +00:00
Yu, Guangye	0ff1109e26	Refactor gpu trace to be device-agnostic (#121794 ) # Motivation Refactor gpu trace to be device-agnostic. gpu trace is usually used in runtime components, including Device, Stream, Event, Guard, and Allocator. It should be device-agnostic and can be shared among each device backend. # Solution move `_cuda_trace.py` to `_gpu_trace.py`, which makes each device backend owns their callback, respectively. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121794 Approved by: https://github.com/jgong5, https://github.com/albanD, https://github.com/EikanWang, https://github.com/gujinghui	2024-03-19 06:02:28 +00:00
Han, Xu	09ce76809c	Improve compiler detection on MacOS (#121406 ) By relying on `is_apple_clang` helper function rather than on compiler name (as `gcc` is clang on MacOS): ``` % which gcc; gcc -v /usr/bin/gcc Apple clang version 15.0.0 (clang-1500.3.9.4) Target: arm64-apple-darwin23.3.0 Thread model: posix InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin ``` But ``` % /opt/homebrew/bin/gcc-13 -v Using built-in specs. COLLECT_GCC=/opt/homebrew/bin/gcc-13 COLLECT_LTO_WRAPPER=/opt/homebrew/Cellar/gcc/13.2.0/bin/../libexec/gcc/aarch64-apple-darwin23/13/lto-wrapper Target: aarch64-apple-darwin23 Configured with: ../configure --prefix=/opt/homebrew/opt/gcc --libdir=/opt/homebrew/opt/gcc/lib/gcc/current --disable-nls --enable-checking=release --with-gcc-major-version-only --enable-languages=c,c++,objc,obj-c++,fortran --program-suffix=-13 --with-gmp=/opt/homebrew/opt/gmp --with-mpfr=/opt/homebrew/opt/mpfr --with-mpc=/opt/homebrew/opt/libmpc --with-isl=/opt/homebrew/opt/isl --with-zstd=/opt/homebrew/opt/zstd --with-pkgversion='Homebrew GCC 13.2.0' --with-bugurl=https://github.com/Homebrew/homebrew-core/issues --with-system-zlib --build=aarch64-apple-darwin23 --with-sysroot=/Library/Developer/CommandLineTools/SDKs/MacOSX14.sdk Thread model: posix Supported LTO compression algorithms: zlib zstd gcc version 13.2.0 (Homebrew GCC 13.2.0) ``` Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/121406 Approved by: https://github.com/malfet, https://github.com/jansel	2024-03-19 05:32:08 +00:00
FEI	8499767e96	add sdpa choice for DeviceType::PrivateUse1 (#121409 ) Fixes #116854 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121409 Approved by: https://github.com/drisspg	2024-03-19 05:08:46 +00:00
Jason Ansel	5bc7f7f977	[dynamo] Make tx.next_instruction lazy (#122066 ) Improves benchmarks/dynamo/microbenchmarks/dynamo_microbenchmarks.py from 2.5s to 2.4s. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122066 Approved by: https://github.com/oulgen, https://github.com/anijain2305 ghstack dependencies: #122039, #122043, #122055, #122058, #122060, #122063	2024-03-19 04:23:30 +00:00
Jason Ansel	153a01833b	[dynamo] Optimize SourcelessBuilder (#122063 ) Improves `benchmarks/dynamo/microbenchmarks/dynamo_microbenchmarks.py` from 2.7s to 2.5s. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122063 Approved by: https://github.com/anijain2305 ghstack dependencies: #122039, #122043, #122055, #122058, #122060	2024-03-19 04:23:30 +00:00
Jason Ansel	8082adcf65	[dynamo] Only rename a proxy once (#122060 ) Improves `benchmarks/dynamo/microbenchmarks/dynamo_microbenchmarks.py` from 3.9s to 2.7s. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122060 Approved by: https://github.com/oulgen ghstack dependencies: #122039, #122043, #122055, #122058	2024-03-19 04:23:27 +00:00
Jason Ansel	2bec55c5f9	[dynamo] Remove VariableTracker.parents_tracker (#122058 ) This is leftover from mutable variable tracker days and no longer needed. Improves benchmarks/dynamo/microbenchmarks/dynamo_microbenchmarks.py from 4.2s to 3.9s. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122058 Approved by: https://github.com/oulgen, https://github.com/anijain2305 ghstack dependencies: #122039, #122043, #122055	2024-03-19 04:23:24 +00:00
Jason Ansel	3c706bf483	[dynamo] Optimize BuiltinVariable (#122055 ) Improves `benchmarks/dynamo/microbenchmarks/dynamo_microbenchmarks.py` from 5.1s to 4.2s (compared to 2 PRs ago). This works by precomputing (and caching) the parts of `BuiltinVariable.call_function` that don't depend on the values of args/kwargs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122055 Approved by: https://github.com/oulgen, https://github.com/anijain2305 ghstack dependencies: #122039, #122043	2024-03-19 04:23:20 +00:00
Jason Ansel	07caea5c12	[dynamo] Refactor COMPARE_OP and comparison builtins (#122043 ) This removes the duplicate handling of comparison ops between symbolic_convert and bultin and refactors the handling to use the binop infrastructure. This change regresses overheads a bit, but this is fixed in the next PR. New test skips are variants of `type(e) is np.ndarray` previously falling back to eager. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122043 Approved by: https://github.com/anijain2305 ghstack dependencies: #122039	2024-03-19 04:23:17 +00:00
Jason Ansel	769ff86b91	[dynamo] Optimize COMPARE_OP (#122039 ) Improves `benchmarks/dynamo/microbenchmarks/dynamo_microbenchmarks.py` from 5.6 to 5.1s. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122039 Approved by: https://github.com/Skylion007, https://github.com/anijain2305	2024-03-19 04:23:14 +00:00
cyy	e1706bba3b	[Clang-tidy header][27/N] Fix clang-tidy warnings in aten/src/ATen/core/.h (#122023 ) This PR fixes various clang-tidy warnings on aten/src/ATen/core/.h, following #122015 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122023 Approved by: https://github.com/ezyang	2024-03-19 03:26:15 +00:00
Adnan Akhundov	5e26873912	Skip nonzero unbacked SymInt memo in inference mode (#122147 ) Summary: In `torch.inference_mode()`, fake tensors don't have `_version`s. This breaks unbacked SymInt memoization in `torch.nonzero` tracing. Here we disable the latter in inference mode. Test Plan: ``` $ python test/inductor/test_unbacked_symints.py -k test_nonzero_in_inference_mode ... ---------------------------------------------------------------------- Ran 2 tests in 14.060s OK ``` Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/122147 Approved by: https://github.com/ezyang	2024-03-19 03:20:33 +00:00
Animesh Jain	8860c625ea	[dynamo][guards-cpp-refactor] Integrate cpp guard manager with CheckFnManager (#120726 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120726 Approved by: https://github.com/jansel	2024-03-19 03:11:31 +00:00
Animesh Jain	f84d560236	[dynamo] Raise accumulated cache size limit (#122130 ) Fixes #114511 This was raised by IBM folks where the a LLM compile was failing because it had more than 64 layers. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122130 Approved by: https://github.com/Chillee, https://github.com/jansel ghstack dependencies: #121954, #122005	2024-03-19 02:35:48 +00:00
Animesh Jain	7084528eb9	[dynamo][model_output] Do not include none for CustomizedDictVariable (#122005 ) Fixes https://github.com/pytorch/pytorch/issues/120923 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122005 Approved by: https://github.com/weifengpy, https://github.com/jansel ghstack dependencies: #121954	2024-03-19 02:35:48 +00:00
Xu Han	2b06098380	Enable x86 CPU vectorization on windows [submodule sleef] (#118980 ) Enable VEC on Windows OS. 1. Fix some type defination gap between Windows and Linux. 2. Fix some operator not support on Windows, such as [], /. 3. Enable static sleef library build on Windows. 4. Disable unsupported function overloading on MSVC. 5. Upgrade submodule sleef lib, which fixed build issue on Windows. 6. Fixed bazel build issues. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118980 Approved by: https://github.com/jgong5, https://github.com/ezyang, https://github.com/malfet	2024-03-19 02:22:04 +00:00
Sam Larsen	6502c888cf	Enable fx graph cache in torch_test.py when using PYTORCH_TEST_WITH_INDUCTOR=1 (#122010 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/122010 Approved by: https://github.com/eellison	2024-03-19 02:17:10 +00:00
Jason Ansel	18d94d7165	Make FX nodes sortable (#122071 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/122071 Approved by: https://github.com/oulgen	2024-03-19 01:40:56 +00:00
Max Ren	1f4d4d3b78	[fx] preserver partiioner order fix (#122111 ) Summary: Previous implementation seems to introduce a key value of {"node": none}. This causes an error in logging later on because we extract the name from the "node" but it is a string instead of a torch.fx.node This seems to cause tests to pass. Test Plan: CI ExecuTorch CI: buck test mode/dev-nosan //executorch/backends/xnnpack/test:test_xnnpack_models Reviewed By: larryliu0820 Differential Revision: D55026133 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122111 Approved by: https://github.com/mikekgfb	2024-03-19 01:00:44 +00:00
Nikita Shulga	34f36a28df	[MPS] Fwd-fix for clamp regression (#122148 ) Forward fix for regressions introduced by https://github.com/pytorch/pytorch/pull/121381 as we failed to run MPS CI twice on it - Do not call `minimumWithNaNPropagationWithPrimaryTensor` for integral tensors as it will crash with ``` /AppleInternal/Library/BuildRoots/ce725a5f-c761-11ee-a4ec-b6ef2fd8d87b/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShaders/MPSCore/Utility/MPSKernelDAG.mm:805: failed assertion `Error getting visible function: (null) Function isNaN_i16_i8 was not found in the library' ``` - Change the order of max and min call as it's apparently important for consistency, as `min(max(a, b), c)` might not equal to `max(min(a, c), b)` if `c` is not always less or equal than `b` Pull Request resolved: https://github.com/pytorch/pytorch/pull/122148 Approved by: https://github.com/huydhn	2024-03-19 00:52:45 +00:00
Nathan	ae983d2d6e	Fix typo in sparse.rst (#121826 ) Change word "on" to "one" when talking in the third person. Fixes #121770 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121826 Approved by: https://github.com/janeyx99	2024-03-19 00:17:19 +00:00
rzou	e6cf3e90a5	[AOTAutograd / Functionalization] Fix incorrect expand_inverse (#122114 ) This is a rebase of https://github.com/pytorch/pytorch/pull/114538, originally submited by @jon-chuang. Fixes #114302 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122114 Approved by: https://github.com/bdhirsh	2024-03-18 22:52:57 +00:00
eellison	ba69dc6675	[Easy] add option to print compilation time (#121996 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121996 Approved by: https://github.com/davidberard98	2024-03-18 22:42:41 +00:00
Nikita Shulga	2ab8b34433	Error out in case of in-source builds (#122037 ) Such builds could not succeed, as arch-specific ATen dispatch mechanism will create temporary files that will be added to the build system with every rebuild, which will result in build failures Fixes https://github.com/pytorch/pytorch/issues/121507 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122037 Approved by: https://github.com/PaliC, https://github.com/kit1980	2024-03-18 21:48:18 +00:00
Guilherme Leobas	e6a461119a	[functorch] Add batch rule for linalg.lu_unpack (#121811 ) Fixes: https://github.com/pytorch/pytorch/issues/119998 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121811 Approved by: https://github.com/peterbell10, https://github.com/zou3519	2024-03-18 21:24:16 +00:00
andrewor14	773ae817f7	Batch Norm Consolidation (#116092 ) Summary: This commit simplifies the existing decomposition hierarchy of batch norm ops by adding a single, backend agnostic op: `batch_norm_with_update`. The existing hierarchy looks like: ``` aten.batch_norm -> aten._batch_norm_impl_index -> [ aten.native_batch_norm -> aten._native_batch_norm_legit (export only) -> _batch_norm_legit_cpu/cuda (kernels, export only) -> _batch_norm_cpu/cuda (kernels) ] OR [ aten.cudnn_batch_norm ] OR [ aten.miopen_batch_norm ] ``` Aside from complexity, an important problem with the above decomposition hierarchy is cuda numerics in export flows. We observed significantly worse convergence when training a mobilenetv2-like model when using the `_batch_norm_cuda` kernel instead of the `cudnn_batch_norm` kernel. This means users who export their models on CPU first then move the models to cuda later may silently see worse accuracies even when cudnn is installed, because they are using the worse kernel. This issue is summarized in https://github.com/pytorch/pytorch/issues/111384. Instead, the new hierarchy proposed by consolidating existing batch norm ops will look like: ``` aten.batch_norm -> aten.batch_norm_with_update -> [ _batch_norm_cpu (kernel) ] OR [ _batch_norm_cuda (kernel) ] OR [ cudnn_batch_norm (kernel) ] OR [ miopen_batch_norm (kernel) ] ``` The new op `batch_norm_with_update` hides backend implementation details and automatically picks the right kernel based on what is installed. This commit also adds the following variants to this op: ``` batch_norm_with_update_functional batch_norm_with_update.out batch_norm_no_update batch_norm_no_update.out batch_norm_backward ``` Note that this commit only adds this op and its variants, but does not actually change the decomps to produce these ops in the graph. This will be done after the 2 week FC window, and the ops used in the old stack is planned to be removed after the 6 month BC window. Test Plan: `OpInfo` tests for `batch_norm_with_update`. Reviewers: albanD, bdhirsh Subscribers: albanD, bdhirsh, supriyar Tasks: https://github.com/pytorch/pytorch/issues/111384 Differential Revision: [D54805279](https://our.internmc.facebook.com/intern/diff/D54805279) Co-authored-by: Tugsbayasgalan Manlaibaatar <tmanlaibaatar@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/116092 Approved by: https://github.com/bdhirsh, https://github.com/albanD	2024-03-18 21:01:30 +00:00
Sam Larsen	a17cd226d6	[inductor] Enable FX graph caching on another round of inductor tests (#121994 ) Summary: Enabling caching for these tests was blocked by https://github.com/pytorch/pytorch/pull/121686 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121994 Approved by: https://github.com/eellison	2024-03-18 20:55:18 +00:00
Oguz Ulgen	7c5e29ae71	Back out "Support `triton.language.dtype` with `torch.compile` (#121690 )" (#122108 ) Summary: Some hard to deal with package import/export related problems. Lets revert and start with clean slate. Test Plan: CI Differential Revision: D55024877 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122108 Approved by: https://github.com/ezyang	2024-03-18 20:50:28 +00:00
Simon Fan	685ace3834	[compiled autograd] add dynamo segfault test (#122004 ) To catch issues like https://github.com/pytorch/pytorch/issues/121862 in CI. This passes because we reverted the PRs, and https://github.com/pytorch/pytorch/pull/121870 confirms that this test can catch it Pull Request resolved: https://github.com/pytorch/pytorch/pull/122004 Approved by: https://github.com/eellison	2024-03-18 20:07:15 +00:00
Roger Lam	40acc84aaf	Fix torch.clamp in MPS to handle NaN correctly (#121381 ) Fixes #120899 So this is interesting. There are methods that specifically propagate NaN instead of clamping to real numbers. https://developer.apple.com/documentation/metalperformanceshadersgraph/mpsgraph/3857573-maximumwithnanpropagationwithpri Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/121381 Approved by: https://github.com/malfet	2024-03-18 19:38:15 +00:00
Dheeraj Peri	0a1b3be216	chore: add unit test to verify split_by_tags output_type (#121262 ) Add a test case as per https://github.com/pytorch/pytorch/pull/120361#issuecomment-1979163324 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121262 Approved by: https://github.com/atalman	2024-03-18 19:19:26 +00:00
PyTorch MergeBot	676a77177e	Revert "[BE] Migrate pull.yml to use S3 pytorch-ci-artifacts bucket for linux-jammy-py3_8-gcc11 and docs builds/tests (#121908 )" This reverts commit 4cbf963894e78d1cfedffe4f829740dc99163caa. Reverted https://github.com/pytorch/pytorch/pull/121908 on behalf of https://github.com/jeanschmidt due to this is due to OIDC can't work on forked PR due to token write permissions can't be shared ([comment](https://github.com/pytorch/pytorch/pull/121908#issuecomment-2004707582))	2024-03-18 19:03:11 +00:00
James Wu	df1cdaedeb	Log restart reasons and extra compile time in CompilationMetrics (#121827 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121827 Approved by: https://github.com/ezyang, https://github.com/yanboliang	2024-03-18 18:59:25 +00:00
Edward Z. Yang	74c09a757b	Simplify Storage meta conversion with PyObject preservation (#122018 ) Thanks to https://github.com/pytorch/pytorch/pull/109039 we can rely on finalizers on Storage PyObject to handle removal from dict. Irritatingly, we still have to attach finalizer, because we don't have a weak key AND value dict (only one or the other). Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/122018 Approved by: https://github.com/eellison, https://github.com/kurtamohler	2024-03-18 18:55:58 +00:00
Kunal Bhalla	32410f80ec	[Caffe2 CPU tests] Update CMakeLists.txt (#119643 ) I was trying to build PyTorch with USE_GLOG=ON (so we could get better timestamps around the nccl logging) and ran into this error ``` [1/7] Linking CXX executable bin/verify_api_visibility FAILED: bin/verify_api_visibility : && /opt/rh/gcc-toolset-11/root/usr/bin/c++ -D_GLIBCXX_USE_CXX11_ABI=1 -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=range-loop-construct -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wsuggest-override -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow -DHAVE_AVX512_CPU_DEFINITION -DHAVE_AVX2_CPU_DEFINITION -O2 -g -DNDEBUG -rdynamic -Wl,--no-as-needed caffe2/CMakeFiles/verify_api_visibility.dir/__/aten/src/ATen/test/verify_api_visibility.cpp.o -o bin/verify_api_visibility -L/lib/intel64 -L/lib/intel64_win -L/lib/win-x64 -Wl,-rpath,/lib/intel64:/lib/intel64_win:/lib/win-x64:/usr/local/cuda/lib64:/root/conda/lib:/mnt/code/pytorch/build/lib: lib/libgtest_main.a -Wl,--no-as-needed,"/mnt/code/pytorch/build/lib/libtorch.so" -Wl,--as-needed -Wl,--no-as-needed,"/mnt/code/pytorch/build/lib/libtorch_cpu.so" -Wl,--as-needed lib/libprotobuf.a /root/conda/lib/libmkl_intel_lp64.so /root/conda/lib/libmkl_gnu_thread.so /root/conda/lib/libmkl_core.so -fopenmp /usr/lib64/libpthread.so -lm /usr/lib64/libdl.so -Wl,--no-as-needed,"/mnt/code/pytorch/build/lib/libtorch_cuda.so" -Wl,--as-needed lib/libc10_cuda.so lib/libc10.so /root/conda/lib/libglog.so.0.4.0 /root/conda/lib/libgflags.so.2.2.2 -lpthread /usr/local/cuda/lib64/libcudart.so /usr/local/cuda/lib64/libnvToolsExt.so lib/libgtest.a -pthread && /root/conda/bin/cmake -E __run_co_compile --lwyu="ldd;-u;-r" --source=bin/verify_api_visibility && : /opt/rh/gcc-toolset-11/root/usr/bin/ld: /mnt/code/pytorch/build/lib/libtorch.so: undefined reference to symbol '_ZTVN10__cxxabiv117__class_type_infoE@@CXXABI_1.3' /opt/rh/gcc-toolset-11/root/usr/bin/ld: /usr/lib64/libstdc++.so.6: error adding symbols: DSO missing from command line collect2: error: ld returned 1 exit status ``` Adding stdc++ explicitly to the list of libraries to link seems to fix the build, and I was able to get a working build of PyTorch. Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/119643 Approved by: https://github.com/zdevito	2024-03-18 18:35:32 +00:00
Jason Ansel	5d52b163d1	[dynamo] Optimize load/store/const op handling (#122038 ) Improves `benchmarks/dynamo/microbenchmarks/dynamo_microbenchmarks.py` from 6.7s to 5.6. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122038 Approved by: https://github.com/Skylion007 ghstack dependencies: #122032, #122033, #122034, #122035	2024-03-18 18:08:06 +00:00
Jason Ansel	4034873a31	[dynamo] Optimize builtin handling (#122035 ) Improves `benchmarks/dynamo/microbenchmarks/dynamo_microbenchmarks.py` from 7.3s to 6.7s. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122035 Approved by: https://github.com/Skylion007 ghstack dependencies: #122032, #122033, #122034	2024-03-18 18:08:06 +00:00
Jason Ansel	6ca0323615	[dynamo] Optimize VariableTracker.__post_init__ (#122034 ) Improves `benchmarks/dynamo/microbenchmarks/dynamo_microbenchmarks.py` from 8.6s to 7.3s. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122034 Approved by: https://github.com/Skylion007 ghstack dependencies: #122032, #122033	2024-03-18 18:08:06 +00:00
Jason Ansel	115c9c6d6b	Remove __getattribute__ on autograd.Function (#122033 ) Improves `benchmarks/dynamo/microbenchmarks/overheads.py` from 38.7us to 34.3us. See #122029 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122033 Approved by: https://github.com/zou3519, https://github.com/soulitzer ghstack dependencies: #122032	2024-03-18 18:08:06 +00:00
Jason Ansel	5a10b56083	[dynamo] Small microbenchmark changes (#122032 ) Used to generate numbers in #122029 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122032 Approved by: https://github.com/yanboliang	2024-03-18 18:08:06 +00:00
Catherine Lee	1a58e9d357	[TD] LLM indexer to run daily (#121835 ) Run indexer daily Run indexer in docker container Pull Request resolved: https://github.com/pytorch/pytorch/pull/121835 Approved by: https://github.com/osalpekar, https://github.com/malfet	2024-03-18 16:34:01 +00:00
PyTorch MergeBot	ceb1910bad	Revert "[BE] Enables support for pytorch ci build in ARC + introduces _linux-build-rg.yml. (#121930 )" This reverts commit 11b36e163df66196d24fbded4b37ef8f8c032640. Reverted https://github.com/pytorch/pytorch/pull/121930 on behalf of https://github.com/huydhn due to New action is breaking current ci in not rebased PRs ([comment](https://github.com/pytorch/pytorch/pull/121930#issuecomment-2004393980))	2024-03-18 16:33:23 +00:00
Jean Schmidt	11b36e163d	[BE] Enables support for pytorch ci build in ARC + introduces _linux-build-rg.yml. (#121930 ) Introduce changes related to enable ARC to run on build for linux-jammy-py3.8-gcc11 Depends on: * https://github.com/pytorch/pytorch/pull/121908 * https://github.com/pytorch/pytorch/pull/121907 * Force docker to update credentials: https://github.com/pytorch/test-infra/pull/4991 * Add permissions to role to access ECR: `acc0154aa0` * Add permissions to the role to access relevant S3 bucket: `496b0422c3` ## Reasoning for introducing a new `_linux-build-rg.yml` Old style `runs-on` definition accept a string, new style `runs-on` requires a object in the format: ``` --- old ... runs-on: "linux.2xlarge" ... --- new ... runs-on: group: "running-group" ... ``` In other words, to specify a group the format of the yaml needs to be changed. Unfortunately, there is no way to accomplish this change using any trick in the book that I am aware of. This is due to the fact that GH actions yaml are not templatable and support minimal functions / replacements. A few examples that did not work: * [`e234f25` (#119544)](`e234f25ba1 (diff-b317d4da565a9e329ccf67e669c2ff1f4d4bc5fb0ffa4d74132545ad66f84339R76)`) * [`087de4a` (#119544)](`087de4ad8b (diff-b317d4da565a9e329ccf67e669c2ff1f4d4bc5fb0ffa4d74132545ad66f84339R76)`) * [`f03512e` (#119544)](`f03512e344 (diff-b317d4da565a9e329ccf67e669c2ff1f4d4bc5fb0ffa4d74132545ad66f84339R76)`) * [`67581fb` (#119544)](`67581fb737 (diff-b317d4da565a9e329ccf67e669c2ff1f4d4bc5fb0ffa4d74132545ad66f84339R76)`) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121930 Approved by: https://github.com/seemethere	2024-03-18 15:40:43 +00:00
Masaki Kozuki	c4d24b5b7f	special-case cuda array interface of zero size (#121458 ) Fixes #98133 retry of #98134 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121458 Approved by: https://github.com/bdice, https://github.com/ptrblck, https://github.com/mikaylagawarecki	2024-03-18 15:21:38 +00:00
chunyuan	f7908d9fa8	enable reshape+linear+reshape fusion for dynamic shapes (#121116 ) reshape+linear+reshape fusion for dynamic shapes has been disabled in https://github.com/pytorch/pytorch/pull/107123. Re-enable it by comparing the symbolic values in case of dynamic shapes. This will improve the performance for dynamic shape cases. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121116 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-03-18 14:46:27 +00:00
chunyuan	f2f8eeea94	Inductor: fix Conv output stride for dynamic shapes (#121400 ) Fixes https://github.com/pytorch/pytorch/issues/120873. Fixes the output stride of Conv in the case of dynamic shapes. The previous logic in inductor assumed that the output stride of Conv is always channels last while it is actually contiguous if `dynamic_shapes and is_contiguous_storage_and_layout(x)`. ### Static shape In static shape cases, since weight is prepacked (`weight_t.is_mkldnn()` will be `true`), we'll always force output to be channels last in the Conv kernel, thus it's fine to have the assumption in Inductor that the output stride of Conv is always channels last. `96ed37ac13/aten/src/ATen/native/mkldnn/Conv.cpp (L357-L358)` ### Dynamic shape In dynamic shape cases, we won't do weight prepack for Conv, in this case, the Conv kernel decides the output layout based on the input and weight layout. `96ed37ac13/torch/_inductor/fx_passes/mkldnn_fusion.py (L1024-L1025)` For input with `channels = 1`, like tensor of size `(s0, 1, 28, 28)` and stride `(784, 784, 28, 1)`, in Inductor, with `req_stride_order` in channels last order, the `require_stride_order` on `x` of such size and stride won't change the stride of the tensor since stride for dimensions of size 1 is ignored `96ed37ac13/torch/_inductor/ir.py (L5451)` While in Conv kernel, such tensor is consider it as contiguous tensor instead of channels last tensor thus the output of the Conv kernel will be in contiguous format. `96ed37ac13/aten/src/ATen/native/ConvUtils.h (L396-L404)` To align the behavior of the Conv kernel, we set the output_stride in such case to be contiguous instead of channels last. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121400 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-03-18 10:56:58 +00:00
Yang Chen	206da97b8b	[aot_inductor][easy] enable test_triton_kernel_multi_output_arg (#122052 ) looks like we already support aoti_torch_cuda_sort in C shim. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122052 Approved by: https://github.com/oulgen	2024-03-18 09:14:35 +00:00
Oguz Ulgen	65ccac6f17	Fix triton import time cycles (#122059 ) Summary: `has_triton` causes some import time cycles. Lets use `has_triton_package` which is enough. Test Plan: ``` buck2 test 'fbcode//mode/opt' fbcode//fblearner/flow/projects/model_processing/pytorch_model_export_utils/logical_transformations/tests:filter_inference_feature_metadata_test -- --exact 'fblearner/flow/projects/model_processing/pytorch_model_export_utils/logical_transformations/tests:filter_inference_feature_metadata_test - test_collect_features_from_graph_module_nodes (fblearner.flow.projects.model_processing.pytorch_model_export_utils.logical_transformations.tests.filter_inference_feature_metadata_test.FilterInferenceFromFeatureMetadataTest)' ``` now passes Differential Revision: D55001430 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122059 Approved by: https://github.com/aakhundov	2024-03-18 05:50:32 +00:00
PyTorch UpdateBot	bc9d054260	[executorch hash update] update the pinned executorch hash (#122061 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned executorch hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122061 Approved by: https://github.com/pytorchbot	2024-03-18 05:02:27 +00:00
PyTorch UpdateBot	7380585d97	[vision hash update] update the pinned vision hash (#122062 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned vision hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122062 Approved by: https://github.com/pytorchbot	2024-03-18 03:41:50 +00:00
Oguz Ulgen	e39aedfcc5	Fix fx graph triton import bug (#122041 ) Summary: Unless we register triton to be a special import, FX graph import mechanism imports it as `from fx-generated._0 import triton as triton` which is obviously broken. Test Plan: I could not figure out how to write a test for this but ``` buck2 run 'fbcode//mode/dev-nosan' fbcode//tgif/lib/tests/gpu_tests:lowering_pass_test -- -r test_default_ait_lowering_multi_hardwares ``` now passes Differential Revision: D54990782 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122041 Approved by: https://github.com/aakhundov	2024-03-17 22:48:51 +00:00
zhangsanfeng2022	5030913d6a	[test] Delete variables that have been declared but not referenced di… (#121964 ) Delete variables that have been declared but not referenced in aten/src/ATen/test/cuda_distributions_test.cu Pull Request resolved: https://github.com/pytorch/pytorch/pull/121964 Approved by: https://github.com/janeyx99	2024-03-17 09:45:05 +00:00
cyy	d9460758df	[Clang-tidy header][26/N] Fix clang-tidy warnings in aten/src/ATen/core/.h (#122015 ) This PR fixes various clang-tidy warnings on aten/src/ATen/core/.h Pull Request resolved: https://github.com/pytorch/pytorch/pull/122015 Approved by: https://github.com/ezyang	2024-03-17 07:56:45 +00:00
Animesh Jain	c568b84794	[dynamo][guards] Move backend match to eval_frame (#121954 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121954 Approved by: https://github.com/jansel	2024-03-17 06:52:10 +00:00
PyTorch UpdateBot	fc504d719f	[executorch hash update] update the pinned executorch hash (#122036 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned executorch hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122036 Approved by: https://github.com/pytorchbot	2024-03-17 04:56:37 +00:00
Edward Z. Yang	6f74b76072	Move get_unwrapped outside of disable_functorch (#121849 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/121849 Approved by: https://github.com/albanD	2024-03-16 22:25:07 +00:00
Pian Pawakapan	3bd38928ba	[export] Improve consistency for nn_module_stack metadata, add checks to _trace.py (#120661 ) We would like to improve consistency for nn_module_stack metadata in torch.export. This PR ensures that all tests in test/export/test_export.py has the following constraints: - Remove nn_module_stack for all placeholder & output nodes, for all modules and submodules - Ensure nn_module_stack is present for all other node types for the top-level module (there is still an issue with torch.cond submodules having empty fields) - Add these checks to _export() in _trace.py (we would add this in the Verifier, but downstream apps construct ExportedPrograms separate from _export(), and metadata may not be maintained there) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120661 Approved by: https://github.com/avikchaudhuri	2024-03-16 21:44:52 +00:00
blzheng	6d9588a12b	[inductor] disable linear weight prepacking pass on double (#121478 ) Fix #121175 Co-authored-by: Jiong Gong <jiong.gong@intel.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/121478 Approved by: https://github.com/jgong5, https://github.com/desertfire	2024-03-16 13:24:21 +00:00
dujinhang	9990d1bc22	Add 'profiler/python' to the package.' (#121892 ) Fixes #ISSUE_NUMBER expose the `py_symbolize` interface for use. thank you Pull Request resolved: https://github.com/pytorch/pytorch/pull/121892 Approved by: https://github.com/zdevito	2024-03-16 11:11:26 +00:00
Huy Do	5f601a41e0	Pin protobuf to 3.20.2 on macOS (#121918 ) The newer protobuf 5.26.0 releasing on March 13rd is causing failures with `test_hparams_*` from `test_tensorboard` in which the stringify metadata is wrong when escaping double quote. For example, `3bc2bb6781`. This looks like an upstream issue from Tensorboard where it doesn't work with this brand new protobuf version https://github.com/tensorflow/tensorboard/blob/master/tensorboard/pip_package/requirements.txt#L29 The package has been pinned on Docker https://github.com/pytorch/pytorch/blob/main/.ci/docker/requirements-ci.txt#L155, so it should be pinned on macOS too. We want to eventually just have one requirements.txt file. Fixes https://github.com/pytorch/pytorch/issues/122008 Fixes https://github.com/pytorch/pytorch/issues/121927 Fixes https://github.com/pytorch/pytorch/issues/121946 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121918 Approved by: https://github.com/kit1980	2024-03-16 09:48:05 +00:00
PyTorch UpdateBot	4d9d5fe540	[executorch hash update] update the pinned executorch hash (#122009 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned executorch hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122009 Approved by: https://github.com/pytorchbot	2024-03-16 04:46:45 +00:00
Jason Ansel	4d92928fe2	[dynamo] Add tests for fake FSDP (#121610 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121610 Approved by: https://github.com/yanboliang ghstack dependencies: #121735, #120965	2024-03-16 04:29:59 +00:00
Jason Ansel	0b7d9711d4	[dynamo] Add support for nn.Parameter constructor (part 2) (#120965 ) This handles the case where the tensor isn't an input. The changes to dynamo tests are cases where we would previously fall back to eager. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120965 Approved by: https://github.com/yanboliang ghstack dependencies: #121735	2024-03-16 04:29:58 +00:00
Jason Ansel	040b925753	[Compiled Autograd] Reorder accumulate grad nodes (#121735 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121735 Approved by: https://github.com/xmfan	2024-03-16 04:29:56 +00:00
PyTorch UpdateBot	f0b9a8344a	[vision hash update] update the pinned vision hash (#121177 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned vision hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121177 Approved by: https://github.com/pytorchbot	2024-03-16 03:25:08 +00:00
Andrew Gu	b94691700e	[FSDP] Avoided CPU sync in `clip_grad_norm_` (#122001 ) Copying a scalar 0 tensor on CPU to GPU or constructing a scalar 0 tensor on GPU requires a CPU sync with the GPU. Let us avoid doing ops that involve it. `FSDP.clip_grad_norm_` already first checks if all parameters are not sharded and calls into `nn.utils.clip_grad_norm_`, so at the point of the code changes, there is guaranteed to be some sharded parameters. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122001 Approved by: https://github.com/wanchaol	2024-03-16 03:01:49 +00:00
PaliC	7bc91d5dc2	[mergebot][BE] If we don't have any required checks, don't run required checks (#121921 ) This PR addresses the issue identified in #121920. The existing problem is that all tests are deemed mandatory if none are selected as required. This behavior is particularly noticeable during a force merge operation. In the context of a force merge, it may not be necessary to execute any tests which are not required (imo). However, this proposed change could be seen as controversial, hence it has been separated from the main update for further discussion and review. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121921 Approved by: https://github.com/huydhn ghstack dependencies: #121920	2024-03-16 01:35:21 +00:00
soulitzer	2b71b21a3f	Don't use Proxy torch function in the sym size calls (#121981 ) Fixes #ISSUE_NUMBER Changes from https://github.com/pytorch/pytorch/pull/121938 + adds test @bypass-github-pytorch-ci-checks Pull Request resolved: https://github.com/pytorch/pytorch/pull/121981 Approved by: https://github.com/davidberard98	2024-03-16 01:20:26 +00:00
Jane Xu	37e563276b	Document complex optimizer semantic behavior (#121667 ) <img width="817" alt="image" src="https://github.com/pytorch/pytorch/assets/31798555/565b389d-3e86-4767-9fcb-fe075b50aefe"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/121667 Approved by: https://github.com/albanD	2024-03-16 00:43:47 +00:00
Sam Larsen	12662900f9	[inductor] FX graph cache: Fix bug handling constants (#121925 ) Summary: During key calculation for FX graph caching: Rather than specialize on "small" vs. "large" tensor constants (i.e., inlined vs. not inlined), always hash on the tensor value. Doing so avoids the complication of trying to later attach the constant values as attributes to an already-compiled module. Instead, different constants will cause an FX graph cache miss and we'll just compile. Test Plan: New unit test Pull Request resolved: https://github.com/pytorch/pytorch/pull/121925 Approved by: https://github.com/eellison	2024-03-16 00:11:51 +00:00
cyy	6b0f61891f	[Clang-tidy header][25/N] Fix clang-tidy warnings and enable clang-tidy on c10/cuda/*.{cpp,h} (#121952 ) This PR enables clang-tidy to code in c10/cuda. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121952 Approved by: https://github.com/Skylion007	2024-03-16 00:09:54 +00:00
PyTorch MergeBot	0cc60a05da	Revert "Fix torch.clamp in MPS to handle NaN correctly (#121381 )" This reverts commit ca80d07ac71c1bfc9b13c3281a713fed89f15e0f. Reverted https://github.com/pytorch/pytorch/pull/121381 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I think its test is failing in trunk https://github.com/pytorch/pytorch/actions/runs/8302739752/job/22725865151#step:7:644, we should have ciflow/mps to run the test on PR. Please take a look a reland the change ([comment](https://github.com/pytorch/pytorch/pull/121381#issuecomment-2000685856))	2024-03-15 23:53:05 +00:00
PyTorch MergeBot	07ec3356b9	Revert "Force upsample to be float32 (#121324 )" This reverts commit 2770e3addd9f05101705f0fef85a163e0034b8a5. Reverted https://github.com/pytorch/pytorch/pull/121324 on behalf of https://github.com/huydhn due to I think it is better to revert and reland this next week `2770e3addd` ([comment](https://github.com/pytorch/pytorch/pull/121324#issuecomment-2000617536))	2024-03-15 23:20:01 +00:00
Andrew Gu	256c0ec1e5	[docs] Added comment on replicate -> partial for `_NormPartial` (#121976 ) Add a version of https://github.com/pytorch/pytorch/pull/121945#discussion_r1525697167 as a comment in the code Pull Request resolved: https://github.com/pytorch/pytorch/pull/121976 Approved by: https://github.com/wanchaol ghstack dependencies: #121747, #121869, #121945	2024-03-15 23:04:06 +00:00
PyTorch MergeBot	b717aa6f36	Revert "[BE] Enables support for pytorch ci build in ARC + introduces _linux-build-rg.yml. (#121930 )" This reverts commit 2c33e3a372c077badc561b4aad4997e52c03610a. Reverted https://github.com/pytorch/pytorch/pull/121930 on behalf of https://github.com/huydhn due to I am seeing lots of inductor jobs failing after this change `2c33e3a372`. They looks unrelated though but this change updates Docker image so may be something sneaks in. I will try to revert this to see if it helps and will reland the change after ([comment](https://github.com/pytorch/pytorch/pull/121930#issuecomment-2000547641))	2024-03-15 22:05:21 +00:00
Roger Lam	ca80d07ac7	Fix torch.clamp in MPS to handle NaN correctly (#121381 ) Fixes #120899 So this is interesting. There are methods that specifically propagate NaN instead of clamping to real numbers. https://developer.apple.com/documentation/metalperformanceshadersgraph/mpsgraph/3857573-maximumwithnanpropagationwithpri Pull Request resolved: https://github.com/pytorch/pytorch/pull/121381 Approved by: https://github.com/malfet	2024-03-15 21:54:50 +00:00
Shuqiang Zhang	26aaabb979	[c10d] initialize lastEnqueuedSeq_ and lastCompletedSeq_ (#121980 ) Summary: It is found that this 2 unitilized number was logged with some super large or negative numbers, which is confusing. So we need to initialize them. Now -1 indicate the number if invalid, or no work is completed or enqueued yet. 0 could be a legit seq id. Test Plan: Build Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/121980 Approved by: https://github.com/xw285cornell, https://github.com/wconstab, https://github.com/kwen2501, https://github.com/XilunWu	2024-03-15 21:45:15 +00:00
Wenting Wang	dfc5e9325d	format caffe2/torch/_export/serde/serialize.py (#121670 ) Summary: black caffe2/torch/_export/serde/serialize.py Test Plan: tests Differential Revision: D54654847 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121670 Approved by: https://github.com/angelayi	2024-03-15 21:30:16 +00:00
Tugsbayasgalan Manlaibaatar	53d2188df9	Update get_aten_graph_module (#121937 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121937 Approved by: https://github.com/andrewor14	2024-03-15 20:35:55 +00:00
Aidyn-A	af86d67d61	[Doc][NVTX] Add documentation for nvtx.range (#121699 ) The context manager `torch.cuda.nvtx.range` has been around for about 4 years (see #42925). Unfortunately, it was never documented and as a consequence users are just unaware of it (see #121663). Pull Request resolved: https://github.com/pytorch/pytorch/pull/121699 Approved by: https://github.com/janeyx99	2024-03-15 20:26:44 +00:00
wz337	b92daff6e9	[DTensor] Enable ASGD foreach optimizer and add the associated unit test (#121942 ) Enable ASGD foreach optimizer and add DTensor optimizer unit test for ASGD. Note that we need to investigate why when using ASGD we need higher atol and rtol when comparing model parameters. Listing it as a TODO now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121942 Approved by: https://github.com/wanchaol	2024-03-15 20:21:27 +00:00
Andrew Gu	f4dd2fda51	[DTensor] Supported 2D `clip_grad_norm_` (#121945 ) This PR adds support for 2D `clip_grad_norm_` (`foreach=True`). - This PR changes `OpSchema.args_spec` to use pytree if the runtime schema info specifies it. - This PR includes a unit test for 2D FSDP2 + SP with `clip_grad_norm_` enabled, which serves as a complete numerics test for 2D. Note: With this PR patched, 2-way SP + 4-way FSDP matches 8-way FSDP numerics on Llama-7B (doubling local batch size for the 2-way SP run). Pull Request resolved: https://github.com/pytorch/pytorch/pull/121945 Approved by: https://github.com/wanchaol ghstack dependencies: #121747, #121869	2024-03-15 20:11:24 +00:00
Jean Schmidt	2c33e3a372	[BE] Enables support for pytorch ci build in ARC + introduces _linux-build-rg.yml. (#121930 ) Introduce changes related to enable ARC to run on build for linux-jammy-py3.8-gcc11 Depends on: * https://github.com/pytorch/pytorch/pull/121908 * https://github.com/pytorch/pytorch/pull/121907 * Force docker to update credentials: https://github.com/pytorch/test-infra/pull/4991 * Add permissions to role to access ECR: `acc0154aa0` * Add permissions to the role to access relevant S3 bucket: `496b0422c3` ## Reasoning for introducing a new `_linux-build-rg.yml` Old style `runs-on` definition accept a string, new style `runs-on` requires a object in the format: ``` --- old ... runs-on: "linux.2xlarge" ... --- new ... runs-on: group: "running-group" ... ``` In other words, to specify a group the format of the yaml needs to be changed. Unfortunately, there is no way to accomplish this change using any trick in the book that I am aware of. This is due to the fact that GH actions yaml are not templatable and support minimal functions / replacements. A few examples that did not work: * [`e234f25` (#119544)](`e234f25ba1 (diff-b317d4da565a9e329ccf67e669c2ff1f4d4bc5fb0ffa4d74132545ad66f84339R76)`) * [`087de4a` (#119544)](`087de4ad8b (diff-b317d4da565a9e329ccf67e669c2ff1f4d4bc5fb0ffa4d74132545ad66f84339R76)`) * [`f03512e` (#119544)](`f03512e344 (diff-b317d4da565a9e329ccf67e669c2ff1f4d4bc5fb0ffa4d74132545ad66f84339R76)`) * [`67581fb` (#119544)](`67581fb737 (diff-b317d4da565a9e329ccf67e669c2ff1f4d4bc5fb0ffa4d74132545ad66f84339R76)`) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121930 Approved by: https://github.com/seemethere	2024-03-15 20:09:50 +00:00
Sam Larsen	6f4fa8e9a1	[inductor] FX graph cache: simplify "current callable" logic (#121903 ) Summary: The handling of the current_callable and compiled_artifact fields in the CompiledFxGraph object is unnecessarily complicated and confusing. We can simplify by storing only the callable. That field is not serializable, so the caching approach is to store a path to the generated artifact and reload from disk on a cache hit. We can just reload inline in the FX cache hit path. This change has the added benefit that it makes it easier to fallback to a "cache miss" if the path somehow doesn't exist. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121903 Approved by: https://github.com/eellison	2024-03-15 20:00:08 +00:00
lezcano	d0d09f5977	Fix torch.compile links (#121824 ) Fixes https://github.com/pytorch/pytorch.github.io/issues/1567 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121824 Approved by: https://github.com/svekars, https://github.com/peterbell10, https://github.com/malfet ghstack dependencies: #121823	2024-03-15 19:49:37 +00:00
lezcano	8a5a377190	Move doc links to point to main (#121823 ) The previous links were pointing to an outdated branch Command: `find . -type f -exec sed -i "s:docs/main:docs/master:g" {} + ` Pull Request resolved: https://github.com/pytorch/pytorch/pull/121823 Approved by: https://github.com/albanD, https://github.com/malfet	2024-03-15 19:49:37 +00:00
Sam Larsen	535bc71d03	Enable FX graph caching in another batch of inductor tests (#121697 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121697 Approved by: https://github.com/eellison	2024-03-15 19:38:51 +00:00
Todd Fiala	3ee319c49c	Fall back to eager mode when viewing with differing bitwidths (#120998 ) (#121786 ) The inductor lowering code for viewing a tensor as a type with a different bitwidth currently doesn't generate valid triton code. This change looks for a source and destination dtype and, if different sizes, falls back to the eager mode aten implementation. Prior to this change, this condition would throw an exception. Fixes #120998. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121786 Approved by: https://github.com/peterbell10, https://github.com/bertmaher	2024-03-15 19:33:30 +00:00
Isuru Fernando	409b1a6081	Add lowering for cummax, cummin (#120429 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120429 Approved by: https://github.com/peterbell10	2024-03-15 19:04:38 +00:00
Animesh Jain	d04faf4531	[dynamo][compile-time] Remove preserve rng state per op (#121923 ) We already have one globally - `02bb2180f4/torch/_dynamo/convert_frame.py (L477)` I don't think we need per op. Saves ~2 seconds on this benchmark ~~~ def fn(x): for _ in range(10000): x = torch.ops.aten.sin(x) return x ~~~ Pull Request resolved: https://github.com/pytorch/pytorch/pull/121923 Approved by: https://github.com/jansel	2024-03-15 18:24:46 +00:00
Isuru Fernando	67ec870234	Fix FakeTensorUpdater logic for updating fake tensors (#116168 ) Fixes https://github.com/pytorch/pytorch/issues/114464 Pull Request resolved: https://github.com/pytorch/pytorch/pull/116168 Approved by: https://github.com/peterbell10	2024-03-15 18:22:24 +00:00
drewd789	239d87af5e	combine loops so fn_name correct in error message (#121601 ) The error message shown when input aliasing is detected in `while_loop_func` may not have the correct `fn_name` as it set only in the previous for loop. This change merges the two loops so that `fn_name` has the correct value. No Issue Number for this minor change. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121601 Approved by: https://github.com/albanD	2024-03-15 17:14:56 +00:00
atalman	39fdde7f84	[release] Increase version 2.3.0->2.4.0 (#121974 ) Branch cut for 2.3.0 completed hence advance main version to 2.4.0 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121974 Approved by: https://github.com/jeanschmidt	2024-03-15 17:09:33 +00:00
Shengbao Zheng	565d1e28ab	update kineto submodule commit id (#121843 ) Summary: Update kineto submodule commit id so that pytorch profiler can pick up kineto changes from https://github.com/pytorch/kineto/pull/880 Test Plan: CI Differential Revision: D54828357 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121843 Approved by: https://github.com/aaronenyeshi	2024-03-15 16:55:25 +00:00
chilli	3c3d7455a3	Disable inductor (default) and inductor (dynamic) by default in the perf run launcher (#121914 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121914 Approved by: https://github.com/desertfire	2024-03-15 16:46:24 +00:00
angelayi	ef25d83a62	[export] Add serialization support for tokens (#121552 ) Differential Revision: [D54906766](https://our.internmc.facebook.com/intern/diff/D54906766) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121552 Approved by: https://github.com/zhxchen17	2024-03-15 16:15:11 +00:00
Wei (Will) Feng	014f91a9d9	[FSDP2] implement HSDP (#121569 ) support HSDP in per-parameter sharding FSDP: https://github.com/pytorch/pytorch/issues/121023 HSDP is a hybrid of FSDP and DDP: reduce-scatter grads intra-node (FSDP), and all-reduce grads inter-node (DDP) for unit test, we are testing 2 + 2 GPUs in single node: ``pytest test/distributed/_composable/fsdp/test_fully_shard_training.py -k test_train_parity_hsdp`` allreduce overlaps with next reduce-scatter in profiler traces <img width="886" alt="Screenshot 2024-03-14 at 3 02 52 PM" src="https://github.com/pytorch/pytorch/assets/134637289/98f1f2b5-c99d-4744-9938-10d0431487e5"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/121569 Approved by: https://github.com/awgu	2024-03-15 10:00:18 +00:00
Jean Schmidt	4cbf963894	[BE] Migrate pull.yml to use S3 pytorch-ci-artifacts bucket for linux-jammy-py3_8-gcc11 and docs builds/tests (#121908 ) Switch to use LF S3 bucket for pull on linux-jammy-py3_9-gcc and docs jobs. This is required to migrate to ARC and move to use LF resources. Depends on https://github.com/pytorch/pytorch/pull/121907 Follow up issue https://github.com/pytorch/pytorch/issues/121919 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121908 Approved by: https://github.com/malfet	2024-03-15 09:09:53 +00:00
bhack	2770e3addd	Force upsample to be float32 (#121324 ) Fixes #121072 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121324 Approved by: https://github.com/ezyang	2024-03-15 07:50:45 +00:00
Simon Fan	e25054b248	[compiled autograd] free stack objects before calling compiled graph (#121707 ) Moved compilation code into _compiled_autograd_impl, frees stack allocated objects e.g. AutogradCompilerCall Pull Request resolved: https://github.com/pytorch/pytorch/pull/121707 Approved by: https://github.com/jansel	2024-03-15 07:12:38 +00:00
Jason Ansel	5a2b4fc8f0	[dynamo] Convert invalid args into graph breaks (#121784 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121784 Approved by: https://github.com/yanboliang	2024-03-15 06:51:27 +00:00
Gao Tianlin	fc33bbf827	better support set_default_dtype(torch.float16), update doc (#121730 ) 1. Fixes #121300 2. Previously, calling `torch.tensor([2j])` after `torch.set_default_dtype(torch.float16)` will cause a runtime error. This PR also fixes it and enables test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121730 Approved by: https://github.com/peterbell10	2024-03-15 06:48:42 +00:00
PyTorch UpdateBot	8fdd8125b6	[executorch hash update] update the pinned executorch hash (#121871 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned executorch hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121871 Approved by: https://github.com/pytorchbot	2024-03-15 05:25:36 +00:00
cyy	fb10e13000	[Clang-tidy header][24/N] Fix clang-tidy warnings on c10/cuda/*.{cpp,h} (#120781 ) This PR begins to clean clang-tidy warnings of code in c10/cuda. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120781 Approved by: https://github.com/ezyang	2024-03-15 05:03:22 +00:00
Tristan Rice	e4fda049c2	DTensor: add comm tests to test_tp_examples (#121669 ) This adds some basic comm tests to test_tp_examples. This validates that the expected distributed calls are being made for `test_transformer_training`. Fixes #121649 Test plan: ``` pytest test/distributed/tensor/parallel/test_tp_examples.py -k test_transformer_training ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/121669 Approved by: https://github.com/wanchaol	2024-03-15 03:37:48 +00:00
wz337	02083f5452	[DCP][DSD] Add AdamW to distributed state dict unit tests (#121774 ) Thanks @fegin for removing the fsdp root module check in DCP to unblock test updates. https://github.com/pytorch/pytorch/pull/121544 This PR adds "optimzer_class" as a kwarg for the subtests of the following tests to add AdamW as an option. - test_fsdp - test_compiled_fsdp - test_fsdp2 - test_ddp - test_fsdp_ddp - test_cpu_offload_full_state_dict In addition, we temporarily remove the two _verify_osd_by_load in _test_save_load, as state dict loading seems affect parameters. Creating an issue https://github.com/pytorch/pytorch/issues/121186 to keep track. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121774 Approved by: https://github.com/Skylion007 ghstack dependencies: #121773	2024-03-15 03:33:33 +00:00
PaliC	efbeefbb84	[executorch] Make trymerge force merges actually work with executorch (#121920 ) This PR addresses an issue with the trymerge function for executorch, which currently uses Facebook CLA instead of Easy CLA. This bug has been patched in #121921. However, the patch is potentially controversial, and we still want to verify Facebook CLA if it exists. Therefore, this PR includes Facebook CLA in our set of mandatory checks. Additionally, this PR removes Facebook CLA from one of the mocks. This change is necessary because the specific PR used for testing fails due to the presence of Facebook CLA in the mock. ## Testing: We run `find_matching_merge_rule(pr = GitHubPR("pytorch", "executorch", 2326), skip_mandatory_checks=True, skip_internal_checks=True)` to check if things work https://pastebin.com/HHSFp2Gw Pull Request resolved: https://github.com/pytorch/pytorch/pull/121920 Approved by: https://github.com/huydhn	2024-03-15 03:21:44 +00:00
Animesh Jain	a623666066	[dynamo][compile-time] Make output_graph new_var linear (#121858 ) Fixes https://github.com/pytorch/pytorch/issues/121679 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121858 Approved by: https://github.com/jansel	2024-03-15 03:20:04 +00:00
haozhe.zhu	3bc2bb6781	use two pass reduction for deterministic reduction order (#115620 ) ## Motivation Address the [non-deterministic reduction order](https://github.com/pytorch/pytorch/issues/93542#issuecomment-1411294181) issue for `omp parallel reduction`. ## Latest update on 1.15: `55d81901bc`. Do not reduce to arr in loops. Instead, reduce to a local scaler and write it to arr after local reduction is done. This will allow the compiler to optimize the reduction variable in register instead read/write from memory. If the `working set` of `loop body` is quite large, `read/write from register/memory` will have a large gap. ``` vaddss (%xmm0, %xmm11, %xmm11) -> accumulate in register %xmm0 vaddssl ((%rdx, %rdi, 4), %xmm0, %xmm0) -> accumulate in memory address (%rdx, %rdi, 4) ``` Examples code: ``` tmp0_acc_arr[64]; #pragma omp parallel num_threads(64) { auto tid = omp_get_thread_num(); #pragma omp for for(...){ .... tmp0_acc_arr[tid] = tmp0_acc_arr[tid] + tmp_x; // access array will always from memory } } ``` will be changed to ``` tmp0_acc_arr[64]; #pragma omp parallel num_threads(64) { auto tid = omp_get_thread_num(); auto tmp0_acc_local = 0; #pragma omp for for(...){ .... tmp0_acc_local = tmp0_acc_local + tmp_x; } tmp0_acc_arr[tid] = tmp0_acc_local; } ``` ## Descriptions Following aten to use `two pass reduction` with `omp parallel` for deterministic reduction order. `9c3ae37fc4/aten/src/ATen/Parallel-inl.h (L39)` `9c3ae37fc4/aten/src/ATen/native/TensorIteratorReduce.cpp (L24)` ``` float tmp_acc0 = 0; at::vec::Vectorized<float> tmp_acc0_vec = at::vec::Vectorized<float>(0); // init reduction buffer per thread float tmp_acc0_arr[64]; at::vec::Vectorized<float> tmp_acc0_vec_arr[64]; for (int tid = 0; tid < 64; tid++) { tmp_acc0_arr[tid] = 0; tmp_acc0_vec_arr[tid] = at::vec::Vectorized<float>(0); } #pragma omp parallel num_threads(64) { int tid = omp_get_thread_num(); #pragma omp for for(long x0=static_cast<long>(0L); x0<static_cast<long>(3964928L); x0+=static_cast<long>(16L)) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(x0)); auto tmp1 = at::vec::Vectorized<float>::loadu(in_ptr1 + static_cast<long>(x0)); auto tmp2 = tmp0 - tmp1; auto tmp3 = tmp2 * tmp2; // reduce to per thread buffers tmp_acc0_vec_arr[tid] = tmp_acc0_vec_arr[tid] + tmp3; } } // second pass reduce for (int tid = 0; tid < 64; tid++) { tmp_acc0 = tmp_acc0 + tmp_acc0_arr[tid]; tmp_acc0_vec = tmp_acc0_vec + tmp_acc0_vec_arr[tid]; } tmp_acc0 = tmp_acc0 + at::vec::vec_reduce_all<float>([](at::vec::Vectorized<float>& x, at::vec::Vectorized<float>& y) { return x + y; }, tmp_acc0_vec); out_ptr0[static_cast<long>(0L)] = static_cast<float>(tmp_acc0); ``` ## Test results I test this PR with dynamo benchmark on 32-core ICX system, Result (avg speed up): \| \| before this PR \| after this PR \| \| ---- \| ---- \| ---- \| \| torchbench \| 1.303 \| 1.301 \| \| hugginface \| 1.346 \| 1.343 \| \| timms \| 1.971 \| 1.970 \| ``` export LD_PRELOAD=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}/lib/libiomp5.so:${CONDA_PREFIX:-"$(dirname $(which conda))/../"}/lib/libjemalloc.so export MALLOC_CONF="oversize_threshold:1,background_thread:true,metadata_thp:auto,dirty_decay_ms:-1,muzzy_decay_ms:-1" export KMP_AFFINITY=granularity=fine,compact,1,0 export KMP_BLOCKTIME=1 multi_threads_test() { CORES=$(lscpu \| grep Core \| awk '{print $4}') export OMP_NUM_THREADS=$CORES end_core=$(expr $CORES - 1) numactl -C 0-${end_core} --membind=0 python benchmarks/dynamo/${SUITE}.py --${SCENARIO} --${DT} -dcpu -n50 --no-skip --dashboard --only "${MODEL}" ${Channels_extra} ${BS_extra} ${Shape_extra} ${Mode_extra} ${Wrapper_extra} ${Flag_extra} --timeout 9000 --backend=inductor --output=${LOG_BASE}/${SUITE}.csv } SCENARIO=performance DT=float32 export TORCHINDUCTOR_FREEZING=1 Flag_extra="--freezing" Mode_extra="--inference" for suite in timm_models huggingface torchbench do export SUITE=$suite echo $SUITE export LOG_BASE=`date +%m%d%H%M%S` mkdir $LOG_BASE multi_threads_test done ``` System info ``` ubuntu@ip-172-31-18-205:~/hz/pytorch$ lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 46 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 64 On-line CPU(s) list: 0-63 Vendor ID: GenuineIntel Model name: Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz CPU family: 6 Model: 106 Thread(s) per core: 2 Core(s) per socket: 32 Socket(s): 1 Stepping: 6 BogoMIPS: 5800.00 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic mo vbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xs aveopt xsavec xgetbv1 xsaves wbnoinvd ida arat avx512vbmi pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg tme avx512_vpopcntdq rdpid md_clear flush_l1d arch_capabilities Virtualization features: Hypervisor vendor: KVM Virtualization type: full Caches (sum of all): L1d: 1.5 MiB (32 instances) L1i: 1 MiB (32 instances) L2: 40 MiB (32 instances) L3: 54 MiB (1 instance) NUMA: NUMA node(s): 1 NUMA node0 CPU(s): 0-63 Vulnerabilities: Gather data sampling: Unknown: Dependent on hypervisor status Itlb multihit: Not affected L1tf: Not affected Mds: Not affected Meltdown: Not affected Mmio stale data: Mitigation; Clear CPU buffers; SMT Host state unknown Retbleed: Not affected Spec rstack overflow: Not affected Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Spectre v2: Mitigation; Enhanced IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence Srbds: Not affected Tsx async abort: Not affected ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/115620 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-03-15 02:03:10 +00:00
PyTorch MergeBot	0cd094a4fd	Revert "[aoti] Fix compilation bug for buffer mutations (#121688 )" This reverts commit 9f314d4aa82169ee552ae2a8ad701bd0441a12b7. Reverted https://github.com/pytorch/pytorch/pull/121688 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/121688#issuecomment-1998740094))	2024-03-15 01:34:04 +00:00
Yifu Wang	01d7c948e2	Make torch/_inductor/comms.py recognize native funcol IRs as collective IRs (#118498 ) ### Summary As title. After this PR, Inductor should recognize native funcol IRs as collectives wherever the existing funcol IRs are recognized as collectives. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118498 Approved by: https://github.com/wanchaol	2024-03-15 01:24:36 +00:00
Jason Ansel	60ccf81490	[dynamo] Refactor update_block_stack into a seperate function (#121810 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121810 Approved by: https://github.com/williamwen42 ghstack dependencies: #121790	2024-03-15 01:01:05 +00:00
Jason Ansel	1e9a7df8fe	[dynamo] Compile time optimizations in tx.step() (#121790 ) `python benchmarks/dynamo/microbenchmarks/dynamo_microbenchmarks.py` - Before: `symbolic_convert_overhead_stress_test: 10.7s` - After: `symbolic_convert_overhead_stress_test: 8.6s` `tx.step()` is a small part of that benchmark, so likely the speedup in that isolated function is larger than the top line. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121790 Approved by: https://github.com/oulgen	2024-03-15 01:01:05 +00:00
João Gouveia	1afa8e0985	Fix #83153 : torch.nn.hardtahn allowed min_val to be greater than max_val (#121627 ) Fixes #83153 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121627 Approved by: https://github.com/albanD	2024-03-15 00:57:45 +00:00
Wanchao Liang	710446b1eb	[dtensor] refactor and generalize stack strategy (#121869 ) This PR rewrite the stack strategy to be more generalized, basically stack/cat like strategy follow pattern need to be smarter, i.e. it should be able to identify: 1. PR, PP, RP -> follow PP 2. RR, SR, RS -> follow SS So this PR refactors how the follow strategy should work, and make sure we start following the strategy that incurred lowest cost. i.e. for multiple PR, RP placements, we should be able to further delay the pending sum reductions Pull Request resolved: https://github.com/pytorch/pytorch/pull/121869 Approved by: https://github.com/awgu	2024-03-15 00:34:25 +00:00
Animesh Jain	92ed8553a6	Revert "Switch cudagraph backend to cudagraph trees (#121019 )" and "Add Cudagraphs disable checking (#121018 )" (#121864 ) This reverts commit 9373ad0bb87b364375a468c296d2daef0e8817d7. Revert "Add Cudagraphs disable checking (#121018)" This reverts commit 4af0e634bf02309583dfe3b5c3421442fda5ec7e. Causes compilation time increase. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121864 Approved by: https://github.com/eellison	2024-03-15 00:03:09 +00:00
Scott Wolchok	d604ab81a2	[PyTorch] Fix static runtime sigrid_hash precomputed multiplier pass (#120851 ) This pass was broken. Differential Revision: [D54336561](https://our.internmc.facebook.com/intern/diff/D54336561/) NOTE FOR REVIEWERS: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D54336561/)! Pull Request resolved: https://github.com/pytorch/pytorch/pull/120851 Approved by: https://github.com/houseroad	2024-03-15 00:02:38 +00:00
David Berard	cceabe873f	[jit] ClassType hashing: hash on compilation_unit as well (#121928 ) Following up on #121874 - it turns out that in our case, we're seeing repeated class names that are from different compilation units. Our previous hash function wasn't considering the compilation unit, leading to hash collisions (and then exponential memory usage in the number of copies of this class name) Differential Revision: [D54916455](https://our.internmc.facebook.com/intern/diff/D54916455) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121928 Approved by: https://github.com/eellison ghstack dependencies: #121874	2024-03-14 23:16:08 +00:00
David Berard	2d9cee20a2	[jit] AliasDB type hash - don't always return 0 (#121874 ) This hash was missing an assignment, so for almost all types it was returning "0". c10::flat_hash_map turns out to have really bad behavior with a terrible hash like this, nearly exponential in memory usage. Differential Revision: [D54916424](https://our.internmc.facebook.com/intern/diff/D54916424) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121874 Approved by: https://github.com/eellison	2024-03-14 23:16:08 +00:00
Tugsbayasgalan (Tugsuu) Manlaibaatar	57b20c51b9	Don't record autograd state ops while torch.compile in pre-dispatch export (#121736 ) Summary: Refer to OSS PR for details Test Plan: CI Differential Revision: D54812833 In pre-dispatch export, we have a special proxy torch mode where we intercept torch._C._set_grad_enabled op to correctly capture user's intention on train/eval. However, this is bit problematic when we are tracing torch.cond during export as it calls torch.compile internally. As a result, we end up capturing unwanted autograd context manager calls that are happening inside dynamo framework code because the top level tracer is still active. We fix it by turning off this proxy torch mode. We can still capture autograd ops inside cond branches because dynamo will translate them into HOP for us, so we don't have to intercept with special proxy mode. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121736 Approved by: https://github.com/anijain2305, https://github.com/ydwu4	2024-03-14 23:06:10 +00:00
Bin Bao	bd7beef529	[Inductor] Update the cpp_wrapper entry function signature (#121745 ) Summary: Update the entry function to use AtenTensorHandle instead of at::Tensor. This makes the compilation of the generated cpp wrapper code much faster: test_cpu_cpp_wrapper.py from 35 min to 21 min, and test_cuda_cpp_wrapper.py from 21 min to 14 min. Differential Revision: [D54818715](https://our.internmc.facebook.com/intern/diff/D54818715) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121745 Approved by: https://github.com/chenyang78, https://github.com/jansel ghstack dependencies: #121523, #121743, #121744	2024-03-14 22:23:00 +00:00
Bin Bao	8be80706b4	[AOTI] Add pybind for tensor_converter util functions (#121744 ) Differential Revision: [D54818716](https://our.internmc.facebook.com/intern/diff/D54818716) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121744 Approved by: https://github.com/chenyang78 ghstack dependencies: #121523, #121743	2024-03-14 22:20:51 +00:00
Bin Bao	46493ee9b5	[AOTI][refactor] Update tensor_converter util functions (#121743 ) Summary: Update the signature of unsafe_alloc_new_handles_from_tensors and alloc_tensors_by_stealing_from_handles. This is a preparation step towards adding pybind for these two functions, which will be used by cpp_wraper JIT Inductor. Differential Revision: [D54818717](https://our.internmc.facebook.com/intern/diff/D54818717) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121743 Approved by: https://github.com/chenyang78 ghstack dependencies: #121523	2024-03-14 22:17:54 +00:00
David Berard	3df1b3b0ad	[jit] support getattr/hasattr on NamedTuple (#121863 ) getattr is already supported on objects, and seems like for the most part for NamedTuples. The only remaining gap seems to be that hasattr only accepted objects, not NamedTuples. This PR adds support, and adds some basic tests. Differential Revision: [D54888612](https://our.internmc.facebook.com/intern/diff/D54888612) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121863 Approved by: https://github.com/eellison	2024-03-14 22:07:28 +00:00
Bin Bao	818b14025a	[AOTI][refactor] Remove is_legacy_abi_kernel and abi_compatible_kernel (#121523 ) Summary: is_legacy_abi_kernel was used for _scaled_dot_product_flash_attention fallback. It is only needed for C shim kernel name matching now, and the name matching is done with a direct string comparison. Also consolidate the fallback cpp kernel naming logic in CppWrapperCpu. Differential Revision: [D54727789](https://our.internmc.facebook.com/intern/diff/D54727789) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121523 Approved by: https://github.com/chenyang78	2024-03-14 22:05:38 +00:00
Yanbo Liang	43e243180b	Add gpt-fast as a static benchmark (#121886 ) Run: ``` python benchmarks/gpt_fast/benchmark.py ``` It generated a cvs file ```gpt_fast_benchmark.csv``` with the content like: ``` name,mode,target,actual,percentage Llama-2-7b-chat-hf,bfloat16,104,103.458618,99.48% Llama-2-7b-chat-hf,int8,155,158.964615,102.56% Mixtral-8x7B-v0.1,int8,97,99.760132,102.85% ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/121886 Approved by: https://github.com/Chillee	2024-03-14 21:46:59 +00:00
chentianyi16	0e68eb1505	Add privateuseone flags for c10::EventFlag (#121118 ) Fixes #117341 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121118 Approved by: https://github.com/albanD	2024-03-14 20:07:12 +00:00
angelayi	9f314d4aa8	[aoti] Fix compilation bug for buffer mutations (#121688 ) I realized there's a bug when unlifting buffer mutations in AOTI. However there seems to be a bug during tracing where AOTI mutates the buffer. I didn't take the time to investigate, so I left is as TODO for now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121688 Approved by: https://github.com/chenyang78	2024-03-14 19:35:26 +00:00
Sherlock Huang	0636c11811	[AOTInductor] Include build cmds at the end of wrapper file (#121872 ) Summary: For easier debugging, include build commands at the end of codegen wrapper. {F1468438991} Test Plan: CI Differential Revision: D54882164 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121872 Approved by: https://github.com/chenyang78, https://github.com/desertfire	2024-03-14 18:41:17 +00:00
Zhengxu Chen	c409292197	[sigmoid] Use deserializer from oss. (#121839 ) Summary: Old path: thrift -> thrift deserializer -> graph module. new path: thrift -> python dataclass -> oss deserializer -> graph_module Test Plan: CI buck2 test mode/dev-nosan caffe2/test/inductor/fb:test_aot_inductor_pt2_inference Reviewed By: SherlockNoMad Differential Revision: D54855251 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121839 Approved by: https://github.com/angelayi	2024-03-14 18:38:58 +00:00
Bin Bao	499136a4dd	[Inductor] Fix a dynamic shape problem when lowering diagonal (#121881 ) Summary: when computing the diagonal size, we need to use correct symbolic min/max function. Differential Revision: [D54884899](https://our.internmc.facebook.com/intern/diff/D54884899) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121881 Approved by: https://github.com/aakhundov	2024-03-14 18:36:37 +00:00
Angela Yi	5b1642516f	[with_effects] Skip over profiler.record_function_exit (#121829 ) Summary: tldr: User calls to `torch.autograd.profiler.record_function` fails when tracing with non-strict pre-dispatch export due to an effect token failure, so the solution is to skip over these operators 😅 Some user code contains calls to a `torch.autograd.profiler.record_function` context, like https://fburl.com/code/uesgknbq and https://fburl.com/code/iogbnsfw, which is used for adding user-defined events into the profiler. Currently these function calls will be skipped/removed in dynamo (https://fburl.com/code/fkf7qmai) but non-strict pre-dispatch export will hit these operators during tracing. However, it seems that although these operators get hit by the dispatcher, they don't actually show up in the final graph (maybe they get DCE-d). However, an issue comes up with a recent change with effect tokens (D54639390) which creates tokens if it sees a ScriptObject during tracing. The operator `torch.ops.profiler.record_function_exit` takes in a ScriptObject, so the effect tokens framework now tries to add an effect token to this operator, but results in the following error: (https://www.internalfb.com/intern/everpaste/?handle=GI-hvBknzj2ZxYkBABNzdztDxJVAbsIXAAAB, P1195258619) The reason is because this operator only gets hit during pre-dispatch, not post-dispatch tracing. During pre-dispatch tracing, we first trace using post-dispatch to collect metadata needed for functionalization, and then we do pre-dispatch tracing to construct the graph. The metadata collection phase is also when we determine what operators need effect tokens and create those tokens. However, since the operator only shows up in pre-dispatch tracing, we do create any tokens. During the actual pre-dispatch tracing to create the graph, we then run into this operator and try to get a token, but none exist, causing an error :( This PR just blocks the record_function operator from being looked at by the effect tokens framework. But a proper fix might be to have functionalization run on the pre-dispatch graph or have the operator also show up in the post-dispatch graph. But since in the PT2 stack dynamo just gets rid of this operator so that it won't show up anywhere downstream, I think we can also just ignore this operator. Test Plan: Fixed test for P1195258619 Differential Revision: D54857444 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121829 Approved by: https://github.com/BoyuanFeng, https://github.com/tugsbayasgalan	2024-03-14 18:09:43 +00:00
Sherlock Huang	f1f7c5c31e	[ez] Document for add_var_to_val (#121850 ) Summary: Add doc for ShapeEnv.add_var_to_val Test Plan: doc only change Reviewed By: izaitsevfb Differential Revision: D54872335 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121850 Approved by: https://github.com/izaitsevfb	2024-03-14 18:01:09 +00:00
Jean Schmidt	4c3a052acf	[BE] Add S3 bucket argument to number of workflows (#121907 ) Namely, it adds the `s3-bucket` argument to the following workflows, with default value set to `gha-artifacts`): - _docs - _linux-test workflows - download-build-artifacts - pytest-cache-download - upload-test-artifacts This is prerequisite part is required in order to start migrating to other s3 buckets for asset storage; This is one of the required steps in order to migrate to ARC and move our assets away from our S3 to Linux Foundation S3 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121907 Approved by: https://github.com/malfet	2024-03-14 17:57:05 +00:00
Andrew Gu	38d7d366b9	[FSDP2] Added 2D DCP save/load test (#121747 ) To prepare for FSDP2 + TP/SP in torchtrain, we should verify that we can resume training correctly with DCP save/load. For loading into a new model/optimizer instance, torchtrain uses lightweight `ModelWrapper` and `OptimizerWrapper`. In the added unit test, we use `get_optimizer_state_dict` directly to show the minimal requirement for correctness. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121747 Approved by: https://github.com/wz337	2024-03-14 17:24:17 +00:00
Shuqiang Zhang	443444dc7f	[c10d] Add generic scuba logging capability into c10d (#121859 ) Summary: This diff tries to periodically (e.g., every 30s) log critical collective progress status to scuba table, starting from a few metric such as last enequeued seq id. With the Scuba table, it is our hope that we can easily detect the straggler of a PG, E.g., the rank that has not progressed it seq_ for X seconds while other ranks in the same PG have a larger seq_ The implementation needs to make sure that Scuba will be used only for FB internal use cases. For OSS, we still provide a generic logger data struct and logger that can be easily extended. If users do not register the logger, nothing will be logged. Test Plan: Re-use the existing unit test for fb side of operations, such as test_register_and_dump in test_c10d_manifold and change the dump period to a very small number, e.g., 1ms, verified that the loggs are correctly shown in scuba table: https://fburl.com/scuba/c10d_work_update/9trhwnmy Reviewed By: wconstab Differential Revision: D54556219 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121859 Approved by: https://github.com/wconstab	2024-03-14 16:03:45 +00:00
Aleksandar Samardžić	83f8e51404	Add CUTLASS kernel as choice for (u)int8/(b)float16 mixed MM autotuning (#119986 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119986 Approved by: https://github.com/kadeng ghstack dependencies: #119685	2024-03-14 16:03:10 +00:00
Jeff Daily	be0bdf111c	relax tol for flaky nansum_out_dtype_cuda_float32 test (#121550 ) TestReductionsCUDA.test_nansum_out_dtype_cuda_float32 would fail or pass depending on the random inputs. Observed by ROCm internal QA testing. But same problematic random inputs breaks the test for CUDA, verified on V100. There is precedent in another test within the same file to relax tolerance. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121550 Approved by: https://github.com/albanD	2024-03-14 15:28:45 +00:00
Andrey Talman	7e13b5ba29	Checkout release branch rather then commit_hash when building triton release (#115379 ) (#121901 ) Cherry pick of https://github.com/pytorch/pytorch/pull/115379 from Release 2.2 that should be applied to main and Release 2.3 as well Pull Request resolved: https://github.com/pytorch/pytorch/pull/121901 Approved by: https://github.com/DanilBaibak, https://github.com/jeanschmidt	2024-03-14 14:42:29 +00:00
andoorve	956059fa2e	[Fix] Fixed behaviour for the conversion of complex tensors to bool (#121803 ) Fixes #120875 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121803 Approved by: https://github.com/lezcano	2024-03-14 13:35:15 +00:00
Aleksandar Samardžić	1251f0fa31	Add CUTLASS kernel as choice for _int_mm() Inductor autotuning (#119685 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119685 Approved by: https://github.com/cpuhrsch, https://github.com/kadeng	2024-03-14 13:25:23 +00:00
Nikita Shulga	38d9bb5abc	Make PyTorch compilable against upcoming Numpy-2.0 (#121880 ) Test plan: ``` % python -c "import torch;import numpy;print(numpy.__version__, torch.tensor(numpy.arange(3, 10)))" 2.1.0.dev0+git20240312.9de8a80 tensor([3, 4, 5, 6, 7, 8, 9]) % python -c "import torch;print(torch.rand(3, 3).numpy())" [[0.0931946 0.44874293 0.8480404 ] [0.93877375 0.10188377 0.67375803] [0.02520031 0.89019287 0.5691561 ]] ``` Fixes https://github.com/pytorch/pytorch/issues/121798 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121880 Approved by: https://github.com/albanD	2024-03-14 05:36:50 +00:00
Nikita Shulga	b4c53aa0ec	Do not compile FP16 arith internally (#121844 ) Also, decorate unused args with `C10_UNUSED` to fix linter warnings Test Plan: `buck2 build -c fbcode.arch=aarch64 //caffe2:ATen-cpu` Differential Revision: D54870507 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121844 Approved by: https://github.com/osalpekar	2024-03-14 05:19:02 +00:00
Adnan Akhundov	3eb322ff29	Handle transitive replacements in Triton kernel mutation analysis (#121867 ) Summary: Previously, we didn't handle transitive replacements in MLIR walk-based function info mining in the Triton kernel mutation analysis pass. As a result, for the TTIR below: ``` tt.func private @cumsum__fp32S1_16S__1cconstexpr_1__2cconstexpr_False_(%arg0: tensor<1x16xf32> loc("...":296:0)) -> tensor<1x16xf32> attributes {noinline = false} { %0 = "tt.scan"(%arg0) <{axis = 1 : i32, reverse = false}> ({ ^bb0(%arg1: f32 loc(unknown), %arg2: f32 loc(unknown)): %1 = tt.call @_sum_combine__fp32_fp32__(%arg1, %arg2) : (f32, f32) -> f32 loc(#loc16) tt.scan.return %1 : f32 loc(#loc16) }) : (tensor<1x16xf32>) -> tensor<1x16xf32> loc(#loc16) tt.return %0 : tensor<1x16xf32> loc(#loc18) } loc(#loc15) ``` the mined function dict looked like this: ``` {Intermediate(idx=25): [Op(name='tt.call', fn_call_name='_sum_combine__fp32_fp32__', args=[Intermediate(idx=26), Intermediate(idx=26)])], Intermediate(idx=27): [Op(name='tt.scan.return', fn_call_name=None, args=[Intermediate(idx=25)])], Intermediate(idx=-4): [Op(name='tt.return', fn_call_name=None, args=[Intermediate(idx=27)])]} ``` whereas it should look like this (not the `Param(idx=0)` arguments of the `tt.call`): ``` {Intermediate(idx=25): [Op(name='tt.call', fn_call_name='_sum_combine__fp32_fp32__', args=[Param(idx=0), Param(idx=0)])], Intermediate(idx=27): [Op(name='tt.scan.return', fn_call_name=None, args=[Intermediate(idx=25)])], Intermediate(idx=-4): [Op(name='tt.return', fn_call_name=None, args=[Intermediate(idx=27)])]} ``` This is fixed in the PR. Test Plan: ``` $ python test/inductor/test_triton_kernels.py -k test_cumsum . ---------------------------------------------------------------------- Ran 1 test in 1.771s OK ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/121867 Approved by: https://github.com/oulgen	2024-03-14 04:06:37 +00:00
Sam Larsen	4cd503c1f3	Enable FX graph cache for a batch of inductor tests (#121696 ) Summary: Get more FX graph cache coverage by enabling it for these unit tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/121696 Approved by: https://github.com/eellison	2024-03-14 03:39:59 +00:00
Michael Lazos	15abc56bd5	Graph break on step closure in optimizer (#121777 ) Fixes https://github.com/pytorch/pytorch/issues/116494 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121777 Approved by: https://github.com/yanboliang	2024-03-14 03:18:23 +00:00
Andre Eid	f85f58bf86	Fix quantized linear vulkan tests (#120960 ) Summary: Fixed quatized linear vulkan tests by using an old pack_biases function. Test Plan: Vulkan quantized api tests buck2 run --target-platforms ovr_config//platform/macos:arm64-fbsource //xplat/caffe2:pt_vulkan_quantized_api_test_binAppleMac\#macosx-arm64 -c pt.vulkan_full_precision=1 ... ... ... [ RUN ] VulkanAPITest.linear_2d_flat [ OK ] VulkanAPITest.linear_2d_flat (5 ms) [ RUN ] VulkanAPITest.linear_2d_small [ OK ] VulkanAPITest.linear_2d_small (0 ms) [ RUN ] VulkanAPITest.linear_2d_large [ OK ] VulkanAPITest.linear_2d_large (4 ms) [ RUN ] VulkanAPITest.linear_3d_flat [ OK ] VulkanAPITest.linear_3d_flat (2 ms) [ RUN ] VulkanAPITest.linear_3d_small [ OK ] VulkanAPITest.linear_3d_small (1 ms) [ RUN ] VulkanAPITest.linear_3d_large [ OK ] VulkanAPITest.linear_3d_large (1 ms) [ RUN ] VulkanAPITest.linear_4d_flat [ OK ] VulkanAPITest.linear_4d_flat (1 ms) [ RUN ] VulkanAPITest.linear_4d_small [ OK ] VulkanAPITest.linear_4d_small (1 ms) [ RUN ] VulkanAPITest.linear_4d_large [ OK ] VulkanAPITest.linear_4d_large (2 ms) ... ... [----------] 85 tests from VulkanAPITest (1704 ms total) [----------] Global test environment tear-down [==========] 85 tests from 1 test suite ran. (1704 ms total) [ PASSED ] 85 tests. YOU HAVE 8 DISABLED TESTS Vulkan api tests buck2 run --target-platforms ovr_config//platform/macos:arm64-fbsource //xplat/caffe2:pt_vulkan_api_test_binAppleMac\#macosx-arm64 -c pt.vulkan_full_precision=1 [----------] Global test environment tear-down [==========] 426 tests from 1 test suite ran. (4997 ms total) [ PASSED ] 423 tests. [ SKIPPED ] 1 test, listed below: [ SKIPPED ] VulkanAPITest.querypool_flushed_shader_log [ FAILED ] 2 tests, listed below: [ FAILED ] VulkanAPITest.log_softmax_underflow [ FAILED ] VulkanAPITest.log_softmax Differential Revision: D54396367 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120960 Approved by: https://github.com/yipjustin	2024-03-14 02:23:00 +00:00
Le-Zheng	a37caa6ed3	[Quant][Inductor] Enable quantization linear pattern fusion with int8_mixed_bf16 for gelu (#116004 ) Summary Enable QLinear Unary pattern for gelu with int8_mix_bf16 Test plan python test/inductor/test_mkldnn_pattern_matcher.py -k test_qlinear_gelu_int8_mixed_bf16 Co-authored-by: leslie-fang-intel <leslie.fang@intel.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/116004 Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel ghstack dependencies: #114853, #114854	2024-03-14 01:52:12 +00:00
Le-Zheng	43d68e9c8f	[Quant][Inductor] Enable quantization linear pattern fusion for gelu inside inductor (#114854 ) Summary Enable QLinear Unary pattern for gelu with int8 Test plan python test/inductor/test_mkldnn_pattern_matcher.py -k test_qlinear_gelu_cpu Co-authored-by: leslie-fang-intel <leslie.fang@intel.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/114854 Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5 ghstack dependencies: #114853	2024-03-14 01:49:14 +00:00
Le-Zheng	25e00545bb	[Quant][PT2E] Enable linear and linear-unary post-op gelu quant recipe for x86 inductor quantizer (#114853 ) Summary Add Gelu for linear-unary post-op quantization recipe to x86 inductor quantizer. Test plan python -m pytest test/quantization/pt2e/test_x86inductor_quantizer.py -k test_linear_unary_gelu python test/test_quantization.py -k test_linear_unary_with_quantizer_api Co-authored-by: leslie-fang-intel <leslie.fang@intel.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/114853 Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jerryzh168	2024-03-14 01:46:35 +00:00
Oguz Ulgen	a04e7fca8e	Use memcache versioning for autotune remote cache (#121748 ) Summary: Internal training platform doesn't get updated very frequently, so lets use versioning for memcache. Test Plan: existing tests Differential Revision: D54818197 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121748 Approved by: https://github.com/aakhundov, https://github.com/jansel	2024-03-14 00:36:10 +00:00
Will Constable	7e076c75bd	[C10D] Fix coalescedCollective op Flight Recording (#120430 ) Also noticed and filed https://github.com/pytorch/pytorch/issues/120516 during this work. May land this as is and then test/fix the other varieties of coalesced collectives later. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120430 Approved by: https://github.com/kwen2501	2024-03-13 23:55:00 +00:00
PyTorch MergeBot	bf7ac4ddf7	Revert "[export] allow Dim(1,2) for export dynamic shapes (#121642 )" This reverts commit a8dcbf2749f2081f939621db2d38fd15ab7e34a3. Reverted https://github.com/pytorch/pytorch/pull/121642 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/121642#issuecomment-1996121710))	2024-03-13 23:51:20 +00:00
drisspg	3e02a7efcd	Only FA2 doesn't support attn-mask (#121825 ) Fixes #121783 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121825 Approved by: https://github.com/cpuhrsch	2024-03-13 23:03:39 +00:00
Pian Pawakapan	a8dcbf2749	[export] allow Dim(1,2) for export dynamic shapes (#121642 ) Current dynamic shapes implementation fixes lower range of Dims to be 2 for analysis, but allows 0/1 shapes during runtime. This leads to failures when initializing Dim(1,2). This PR sets the lower bound to 0, and avoids erroring out when conflicting with the generated (2, maxsize) constraint during analysis. Also resolves a derived dim constraints issue with the following code: ``` class Bar(torch.nn.Module): def forward(self, x, y): return x + y[1:] dx = Dim("dx", min=1, max=3) ep = export( Bar(), (torch.randn(2, 2), torch.randn(3, 2)), dynamic_shapes=({0: dx, 1: None}, {0: dx+1, 1: None}) ) print(ep.range_constraints) ``` In main: ``` {s0: ValueRanges(lower=2, upper=3, is_bool=False), s0 + 1: ValueRanges(lower=3, upper=4, is_bool=False)} ``` This PR: ``` {s0: ValueRanges(lower=1, upper=3, is_bool=False), s0 + 1: ValueRanges(lower=2, upper=4, is_bool=False)} ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/121642 Approved by: https://github.com/avikchaudhuri	2024-03-13 22:59:07 +00:00
PyTorch MergeBot	70c6f542f2	Revert "[dynamo] Convert invalid args into graph breaks (#121784 )" This reverts commit 0df39480f6a74c9094555e8a61a8c8bb01716d4e. Reverted https://github.com/pytorch/pytorch/pull/121784 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I think it breaks ONNX test in trunk `0c1ac4484d` ([comment](https://github.com/pytorch/pytorch/pull/121784#issuecomment-1995979435))	2024-03-13 22:12:43 +00:00
Boyuan Feng	aaff8d274a	CUDA fast path for `_chunk_cat()` (#120678 ) This PR provides CUDA fast path implementation for ATen Op `_chunk_cat` (#121081). Performance on a production benchmark: - Float16 in, Float16 out: 249 -> 500 - BFloat16 in, BFloat16 out: 248 -> 500 - BFloat16 in, Float32 out: 126 -> 278 - Float32 in, Float32 out: 153 -> 260 - Float64 in, Float64 out: 79 -> 132 - int8 in, int8 out: 332 -> 908 - int16 in, int16 out: 250 -> 489 - int32 in, int32 out: 153 -> 260 - int64 in, int64 out: 79 -> 132 Unit: Billion elements per second. Hardware: H100. Baseline: [Existing FSDP implementation](`7b3febdca7/torch/distributed/_composable/fsdp/_fsdp_collectives.py (L176)`) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120678 Approved by: https://github.com/yifuwang	2024-03-13 22:02:06 +00:00
Manuel Candales	c53e3f57b5	allow fp16 in quant/dequant decompositions (#121738 ) Test Plan: ``` buck2 run mode/dev-nosan mode/inplace executorch/examples/models/llama2:export_llama -- -c ~/llama/ultra_new_checkpoint.pt -p ~/llama/params.json -kv -E 8,8 -d fp16 --pt2e_quantize "xnnpack_dynamic" -2 ``` Reviewed By: kirklandsign Differential Revision: D54785950 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121738 Approved by: https://github.com/jerryzh168	2024-03-13 21:45:08 +00:00
Chien-Chin Huang	c7193f4099	[DDP][PT2D][2D] Enable DDP + TP and add test for compiled DDP + TP (#120479 ) This PR enables DDP + TP using a TP internal API. This should not be the final implementation. A more sound implementation is to inline the TP internal API in DDP. In other words, DDP needs to be aware of DTensor so that we can support 2D state_dict. This PR adds a compiled DDP + TP test to ensure the new compiled DDP fusion doesn't break TP all_reduce. TODOs - [x] Implement DDP allreduce fusion algorithm for Inductor post_grad pass. - [x] Add unit tests to ensure the fusion doesn't DDP + TP. - [ ] Group different PG and data type of all_reduces. - [ ] Mixed precision supports and tests - [ ] Implement the fusions with Inductor IR. - [ ] Add auto bucketing based on Inductor profiling. Differential Revision: [D54105050](https://our.internmc.facebook.com/intern/diff/D54105050/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120479 Approved by: https://github.com/wz337 ghstack dependencies: #113209	2024-03-13 21:41:22 +00:00
Sherlock Huang	dd568f4207	[Export, AOTInductor] Populate ShapeEnv's var_to_val during deserialization (#121759 ) Summary: Deserialization didn't populate ShapeEnv's `var_to_val` field properly, and AOTInductor is relying on this field to compile dynamic shape properly. As a result, when AOTI failed at compiling a deserialized ExportedProgram. Test Plan: buck2 test mode/dev-nosan caffe2/test/inductor/fb:test_aot_inductor_pt2_inference Differential Revision: D54559494 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121759 Approved by: https://github.com/avikchaudhuri	2024-03-13 21:28:25 +00:00
PyTorch MergeBot	a2a4693c1b	Revert "Init CUDA instead of faking memory stats (#121698 )" This reverts commit 2460f0b1c7bb6e088aca1f6e9bb62c834053d71b. Reverted https://github.com/pytorch/pytorch/pull/121698 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I think it breaks inductor CPU tests `5b90074540` ([comment](https://github.com/pytorch/pytorch/pull/121698#issuecomment-1995868090))	2024-03-13 21:23:42 +00:00
PyTorch MergeBot	45a835cef2	Revert "[compiled autograd] free stack objects before calling compiled graph (#121707 )" This reverts commit 5b90074540577267c29f5f784be123ee54f6491d. Reverted https://github.com/pytorch/pytorch/pull/121707 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I think it breaks inductor CPU tests `5b90074540` ([comment](https://github.com/pytorch/pytorch/pull/121698#issuecomment-1995868090))	2024-03-13 21:23:42 +00:00
Simon Fan	8b1b61bc70	[compiled autograd] support custom ops backed by c++ autograd::Function (#120681 ) - Adds support for custom ops backed by c++ custom autograd functions, e.g. fbgemm - Include files more granularly to avoid namespace pollution and circular imports limitations: - requires user to audit their code and opt-in their custom autograd::Function via autograd::Function::is_traceable and maybe additional compiled_args + apply_with_saved implementation. this was the only way I can think of for soundness - will throw if we can't hash the saved_data i.e. for any non implemented type other than list and dict in at::IValue::hash `b0cfa96e82/aten/src/ATen/core/ivalue.cpp (L364)` - can technically silently fail if both the typeid hash and the typeid string name of the custom autograd::Function collide at the same time, and an identical autograd graph containing a different custom autograd::Function, yet that has an identical implementation, is called. this case seems extremely unlikely, and the only alternative to hash collision i can think of is compiling with reflection - tensors not saved via save_variables are not lifted, and are specialized on TensorImpl*'s hash (treated as a memory address). if needed, we can lift them. Differential Revision: [D54818488](https://our.internmc.facebook.com/intern/diff/D54818488) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120681 Approved by: https://github.com/jansel	2024-03-13 21:13:21 +00:00
Oguz Ulgen	58ff55aac5	Add support for tt.scan to triton kernel mutation analysis (#121828 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121828 Approved by: https://github.com/aakhundov, https://github.com/Skylion007	2024-03-13 20:37:56 +00:00
Chien-Chin Huang	8e6d572b4e	[DDP][PT2D] Allreduce fusion fx pass using concat and all_reduce_coalesced (#113209 ) Differential Revision: [D49858057](https://our.internmc.facebook.com/intern/diff/D49858057/) TL;DR This PR implements 2 different DDP all_reduce fusions in Inductor post_grad fx passes. The two fusions are 1) fusion with concat op and 2) fusion with all_reduce_coalesced. When DDP detects that Python reducer is being used, DDP will automatically turn on the fusion. This PR does not invent any algorithm and simply reflects the bucket size users set to DDP. Implementation Details Fusion with concat op The idea of this fusion is to use a concat op to concatenate all the gradients into one tensor and perform one `all_reduce`. After the `wait` op of the `all_reduce`, splitting and reshaping will also be perform to get the individual gradient. Because DDP needs to perform gradient scaling, the benefit of using this fusion is that we could perform the gradient scaling over the the concatenated buffer. Fusion with `all_reduce_coalesced` The idea of this fusion is to use `all_reduce_coalesced` op to directly perform the `all_reduce` over multiple buffers. This avoid the copy overhead but may not achieve the best NCCL performance. In addition, because there are multiple buffers, we could not do one simple gradient scaling but have to rely on `foreach_div` to help the gradient scaling. Limitations Current fusions do not distinguish `all_reduce` generated by different DDP modules. This is okay if all DDP instances use the same PG and data type. The support of multiple DDP instances with different PG and data type will come in the later PRs. TODOs - [x] Implement DDP allreduce fusion algorithm for Inductor post_grad pass. - [ ] Add unit tests to ensure the fusion doesn't DDP + TP. - [ ] Group different PG and data type of `all_reduce`s. - [ ] Mixed precision supports and tests - [ ] Implement the fusions with Inductor IR. - [ ] Add auto bucketing based on Inductor profiling. Pull Request resolved: https://github.com/pytorch/pytorch/pull/113209 Approved by: https://github.com/yf225	2024-03-13 20:37:09 +00:00
Boyuan Feng	0c1ac4484d	Support `call_method` in DDPOptimizer (#121771 ) This PR fixes Issue #111279. While #111279 reported the issue with `MultiheadAttention`, a minimal reproduction would be: ```python class ToyModel(nn.Module): def __init__(self,): super().__init__() self.linear = nn.Linear(128, 10) def forward(self, x: torch.Tensor) -> torch.Tensor: return self.linear.forward(x) # Error # return self.linear(x) # OK ``` Dynamo treats `self.linear(x)` as `call_module` while treating `self.linear.forward(x)` as a [`get_attr` and a `call_method`](https://github.com/pytorch/pytorch/blob/main/torch/_dynamo/variables/nn_module.py#L358-L378). However, existing DDPOptimizer assumes, for a `get_attr` node, `getattr(gm, node.target)` gives a tensor with the `requires_grad` attribute. Existing DDPOptimizer also does not support `call_method` nodes. This PR adds support for `call_method` and check on `get_attr`. It also checks if a module's parameters have been added to a bucket to support multiple method calls from the same module. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121771 Approved by: https://github.com/yf225	2024-03-13 20:03:15 +00:00
Jason Ansel	0df39480f6	[dynamo] Convert invalid args into graph breaks (#121784 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121784 Approved by: https://github.com/yanboliang ghstack dependencies: #121615, #121616	2024-03-13 20:02:33 +00:00
Simon Fan	5b90074540	[compiled autograd] free stack objects before calling compiled graph (#121707 ) Moved compilation code into _compiled_autograd_impl, frees stack allocated objects e.g. AutogradCompilerCall Pull Request resolved: https://github.com/pytorch/pytorch/pull/121707 Approved by: https://github.com/jansel ghstack dependencies: #121698	2024-03-13 19:31:44 +00:00
Simon Fan	2460f0b1c7	Init CUDA instead of faking memory stats (#121698 ) This is very confusing when checking for memory usage and allocations are only happening using C API. We should change it to a warning/error or just init cuda. Codepaths that run on non-CUDA environments shouldn't call into these functions in the first place Pull Request resolved: https://github.com/pytorch/pytorch/pull/121698 Approved by: https://github.com/jansel	2024-03-13 19:31:44 +00:00
Sam Larsen	cd949d133e	Support setUpClass & tearDownClass with instantiate_device_type_tests() (#121686 ) Summary: instantiate_device_type_tests() creates dynamic test case classes that derive from a "template class". By default, the test harness will call the setUpClass() and tearDownClass() methods defined by the template class (if the template class defines them). We can explicitly create these methods in the dynamic class and arrange to call those methods in both base classes. That allows us to support setUpClass & tearDownClass test classes used with instantiate_device_type_tests(). Pull Request resolved: https://github.com/pytorch/pytorch/pull/121686 Approved by: https://github.com/ezyang, https://github.com/eellison	2024-03-13 18:28:42 +00:00
Isuru Fernando	ffabb25c48	Count the number of entries directly in avg_pool2d lowering (#121429 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121429 Approved by: https://github.com/peterbell10 ghstack dependencies: #116085	2024-03-13 18:19:47 +00:00
Isuru Fernando	a19a05fd1d	Add lowering for avg_pool{1, 3}d (#116085 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116085 Approved by: https://github.com/peterbell10	2024-03-13 18:19:47 +00:00
Catherine Lee	79fac48bb3	Use pytorch bot's labeler (#121762 ) Change corresponds to https://github.com/pytorch/test-infra/pull/4995 Testing (very light) in https://github.com/malfet/deleteme/pull/81 Should help with https://github.com/pytorch/test-infra/issues/4950 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121762 Approved by: https://github.com/huydhn	2024-03-13 17:16:49 +00:00
Michael Lazos	05df03ec1b	Allow custom attributes for torch function subclasses (#121693 ) Added custom attribute access with test Pull Request resolved: https://github.com/pytorch/pytorch/pull/121693 Approved by: https://github.com/anijain2305	2024-03-13 17:01:57 +00:00
Edward Z. Yang	92a2b214f8	Make translation validation more user friendly (#120880 ) Two main changes: - Don't rethrow the exception when we fail in TV, just throw the entire thing and trust the user will inspect logs / backtrace to see we failed in TV - Don't add an event to the TV logs until we've confirmed that the event actually runs without erroring. This prevents us from putting events that e.g., fail because the guard on data dependent size, and the failing in TV. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/120880 Approved by: https://github.com/lezcano, https://github.com/ysiraichi	2024-03-13 15:21:59 +00:00
Edward Z. Yang	b1d5998956	Upgrade to tlparse 0.3.7 (#121772 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/121772 Approved by: https://github.com/Skylion007	2024-03-13 15:21:20 +00:00
Nikita Shulga	5498804ec2	[MPS] Fix naive matmul for BFloat16 (#121731 ) Will only work on MacOS14 or newer, so compile the shader with `MTLLanguageVersion_3_1` when appropriate Fixes https://github.com/pytorch/pytorch/issues/121583 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121731 Approved by: https://github.com/albanD	2024-03-13 14:34:03 +00:00
Jason Ansel	559ca13b3f	[dynamo] Refactor TorchInGraphFunctionVariable for compile time (#121616 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121616 Approved by: https://github.com/oulgen ghstack dependencies: #121615	2024-03-13 14:21:21 +00:00
PyTorch MergeBot	51cf57c6c6	Revert "Include torch warn in each error in cudnn/Conv_v8.cpp (#120719 )" This reverts commit 5fd7f5c4e336c2c3041e10529990c620cc8cf9a5. Reverted https://github.com/pytorch/pytorch/pull/120719 on behalf of https://github.com/janeyx99 due to sorry but am reverting as this prints unwanted warnings even when an exception is not thrown ([comment](https://github.com/pytorch/pytorch/pull/120719#issuecomment-1994491826))	2024-03-13 14:09:38 +00:00
IvanKobzarev	a157a0d00d	[constraints] Fix scalar type for constraint_range to Long (#121752 ) Differential Revision: [D54822125](https://our.internmc.facebook.com/intern/diff/D54822125) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121752 Approved by: https://github.com/ezyang	2024-03-13 11:11:09 +00:00
Avik Chaudhuri	7fe0cc53e9	make _process_dynamic_shapes an implementation detail (#121713 ) Summary: `_process_dynamic_shapes` converts new dynamic shapes to old constraints, but in the future may not need to do so. Preparing for that future. Test Plan: CI Differential Revision: D54780374 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121713 Approved by: https://github.com/tugsbayasgalan	2024-03-13 08:33:00 +00:00
Shubhraprakash Das	5088e4956e	Add quantized conv transpose2d op (#120151 ) Test Plan: Run vulkan api test: # buck2 build --target-platforms ovr_config//platform/macos:arm64-fbsource //xplat/caffe2:pt_vulkan_api_test_binAppleMac\#macosx-arm64 -c pt.vulkan_full_precision=1 --show-output" # buck-out//v2/gen/fbsource/xplat/caffe2/pt_vulkan_api_test_binAppleMac Running main() from third-party/googletest/1.14.0/googletest/googletest/src/gtest_main.cc [==========] Running 418 tests from 1 test suite. [----------] Global test environment set-up. [----------] 418 tests from VulkanAPITest .... [----------] Global test environment tear-down [==========] 418 tests from 1 test suite ran. (4510 ms total) [ PASSED ] 417 tests. [ SKIPPED ] 1 test, listed below: [ SKIPPED ] VulkanAPITest.querypool_flushed_shader_log YOU HAVE 9 DISABLED TESTS Run quantized vulkan api test: Note the linear quantized are failing but all the convolution tests still pass. Linear failures are being debugged. # buck2 build --target-platforms ovr_config//platform/macos:arm64-fbsource //xplat/caffe2:pt_vulkan_quantized_api_test_binAppleMac\#macosx-arm64 -c pt.vulkan_full_precision=1 --show-output" # buck-out//v2/gen/fbsource/xplat/caffe2/pt_vulkan_quantized_api_test_binAppleMac Running main() from third-party/googletest/1.14.0/googletest/googletest/src/gtest_main.cc [==========] Running 86 tests from 1 test suite. [----------] Global test environment set-up. [----------] 86 tests from VulkanAPITest ... [ PASSED ] 77 tests. [ FAILED ] 9 tests, listed below: [ FAILED ] VulkanAPITest.linear_2d_flat [ FAILED ] VulkanAPITest.linear_2d_small [ FAILED ] VulkanAPITest.linear_2d_large [ FAILED ] VulkanAPITest.linear_3d_flat [ FAILED ] VulkanAPITest.linear_3d_small [ FAILED ] VulkanAPITest.linear_3d_large [ FAILED ] VulkanAPITest.linear_4d_flat [ FAILED ] VulkanAPITest.linear_4d_small [ FAILED ] VulkanAPITest.linear_4d_large 9 FAILED TESTS YOU HAVE 8 DISABLED TESTS Differential Revision: D52344261 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120151 Approved by: https://github.com/yipjustin	2024-03-13 08:09:57 +00:00
Iris Zhang (PyTorch)	e99fa0042c	Back out "[DeviceMesh] Add support for nD slicing (#119752 )" (#121763 ) Summary: Original commit changeset: e52b8809c8d8 Original Phabricator Diff: D54778906 We have to backout this diff. D54778906 seems to be causing test failures for APF blocking trunk health and hence release. Just starting to look at the issue. T182209248 Test Plan: Sandcastle Reviewed By: satgera Differential Revision: D54825114 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121763 Approved by: https://github.com/osalpekar	2024-03-13 07:22:08 +00:00
Gao Tianlin	be33d31ae2	add std::ostream& operator<< for BFloat16 in BFloat16.h (#121302 ) This PR Move `operator<<` of `BFloat16` to `BFloat16.h`. Previously, this function is in `TensorDataContainer.h`. If need `std::cout` a `BFloat16` variable when debugging, `TensorDataContainer.h` have to be included. This is inconvient and counterintuitive. Other dtypes such as `Half`, define their `operator<<` in headers where they are defined such as `Half.h`. Therefore, I think it makes more sense to move `operator<<` of `BFloat16` to `BFloat16.h` Pull Request resolved: https://github.com/pytorch/pytorch/pull/121302 Approved by: https://github.com/ezyang	2024-03-13 06:47:34 +00:00
wz337	5986552ebe	[nit][DCP][DSD] Remove variables not being used in test_state_dict.py #121204 (#121773 ) Replacing https://github.com/pytorch/pytorch/pull/121204 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121773 Approved by: https://github.com/Skylion007	2024-03-13 06:35:04 +00:00
Masaki Kozuki	da2a9a0512	`_foreach_copy` with different src/dst dtypes (#121717 ) Fixes #115171 ``` torch.version.git_version = '6bff6372a922fe72be5335c6844c10e2687b967d', torch.cuda.get_device_name() = 'NVIDIA RTX 6000 Ada Generation' [------------------ foreach copy - self: torch.float32 - shape: (512, 512) -----------------] \| src: torch.float32 \| src: torch.float16 \| src: torch.bfloat16 1 threads: ---------------------------------------------------------------------------------- num_tensors: 32 \| 14.2 \| 12.6 \| 12.7 num_tensors: 256 \| 688.0 \| 510.3 \| 514.0 num_tensors: 1024 \| 2768.0 \| 2053.3 \| 2047.7 Times are in microseconds (us). [------------------ foreach copy - self: torch.float16 - shape: (512, 512) -----------------] \| src: torch.float32 \| src: torch.float16 \| src: torch.bfloat16 1 threads: ---------------------------------------------------------------------------------- num_tensors: 32 \| 10.0 \| 8.9 \| 8.8 num_tensors: 256 \| 497.6 \| 344.3 \| 348.3 num_tensors: 1024 \| 1991.9 \| 1392.0 \| 1389.0 Times are in microseconds (us). [----------------- foreach copy - self: torch.bfloat16 - shape: (512, 512) -----------------] \| src: torch.float32 \| src: torch.float16 \| src: torch.bfloat16 1 threads: ---------------------------------------------------------------------------------- num_tensors: 32 \| 10.0 \| 8.8 \| 8.8 num_tensors: 256 \| 497.5 \| 344.5 \| 348.0 num_tensors: 1024 \| 1993.2 \| 1390.4 \| 1387.5 Times are in microseconds (us). [------------------ foreach copy - self: torch.float32 - shape: (515, 515) -----------------] \| src: torch.float32 \| src: torch.float16 \| src: torch.bfloat16 1 threads: ---------------------------------------------------------------------------------- num_tensors: 32 \| 19.0 \| 17.9 \| 18.1 num_tensors: 256 \| 707.2 \| 540.2 \| 543.1 num_tensors: 1024 \| 2900.6 \| 2156.6 \| 2159.2 Times are in microseconds (us). [------------------ foreach copy - self: torch.float16 - shape: (515, 515) -----------------] \| src: torch.float32 \| src: torch.float16 \| src: torch.bfloat16 1 threads: ---------------------------------------------------------------------------------- num_tensors: 32 \| 13.8 \| 13.7 \| 13.1 num_tensors: 256 \| 513.2 \| 352.6 \| 350.4 num_tensors: 1024 \| 2047.6 \| 1404.4 \| 1400.4 Times are in microseconds (us). [----------------- foreach copy - self: torch.bfloat16 - shape: (515, 515) -----------------] \| src: torch.float32 \| src: torch.float16 \| src: torch.bfloat16 1 threads: ---------------------------------------------------------------------------------- num_tensors: 32 \| 13.6 \| 12.8 \| 14.2 num_tensors: 256 \| 511.9 \| 351.8 \| 350.6 num_tensors: 1024 \| 2045.4 \| 1402.2 \| 1401.4 Times are in microseconds (us). ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/121717 Approved by: https://github.com/janeyx99	2024-03-13 05:42:28 +00:00
Jason Ansel	a13dd92d88	[dynamo] Minor compile time optimizations in torch.py (#121615 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121615 Approved by: https://github.com/oulgen	2024-03-13 05:36:22 +00:00
PyTorch UpdateBot	d619be57c0	[executorch hash update] update the pinned executorch hash (#121056 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned executorch hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121056 Approved by: https://github.com/pytorchbot	2024-03-13 04:54:16 +00:00
Peter Bell	0c1d59b72f	CI: Fix flaky artifact upload step (#121733 ) This PR changes the upload artifact step of the wheels and conda build to write each matrix entry to a different file. This is because updating the same file from multiple jobs can be flaky as is warned in the docs for upload-artifact > Warning: Be careful when uploading to the same artifact via multiple jobs as artifacts may become corrupted. When uploading a file with an identical name and path in multiple jobs, uploads may fail with 503 errors due to conflicting uploads happening at the same time. Ensure uploads to identical locations to not interfere with each other. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121733 Approved by: https://github.com/huydhn ghstack dependencies: #121268	2024-03-13 04:42:52 +00:00
Peter Bell	52ed35bb64	[inductor] Update triton pin (#121268 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121268 Approved by: https://github.com/oulgen, https://github.com/malfet	2024-03-13 04:42:52 +00:00
Nikita Shulga	07330ff7b6	[MPS][BE] Define `_compute_tolerances` (#121754 ) Right now logic is mostly duplicated between `test_output_match` and `test_output_gradient_match` So move tolerance definition logic into a shared `_compute_tolerances` function and only keep differences (for example, grad checks are completely skipped for `torch.unique`) in the respective test functions. Also, increase tolerance for `pow` and `__rpow__` only on MacOS-13.3 or older and remove GRAD xfaillist for those Pull Request resolved: https://github.com/pytorch/pytorch/pull/121754 Approved by: https://github.com/albanD	2024-03-13 04:08:06 +00:00
karll	f83392b677	cublasLt workspace warning info is misleading, the unit of measuremen… (#121073 ) cublasLt workspace warning info is misleading, the unit of measurement should be KiB instead of bytes Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/121073 Approved by: https://github.com/eqy, https://github.com/ezyang	2024-03-13 03:37:40 +00:00
blorange-amd	e755dab0d1	[ROCm] Enable several test_unary_ufuncs UTs on ROCm (#121104 ) Enabled: test_unary_ufuncs::TestUnaryUfuncsCUDA::test_reference_numerics_small_atan_cuda_complex64 test_unary_ufuncs::TestUnaryUfuncsCUDA::test_reference_numerics_small_atan_cuda_complex128 test_unary_ufuncs::TestUnaryUfuncsCUDA::test_reference_numerics_extremal_atan_cuda_complex128 test_unary_ufuncs::TestUnaryUfuncsCUDA::test_reference_numerics_small__refs_atan_cuda_complex64 test_unary_ufuncs::TestUnaryUfuncsCUDA::test_reference_numerics_small__refs_atan_cuda_complex128 test_unary_ufuncs::TestUnaryUfuncsCUDA::test_reference_numerics_extremal__refs_atan_cuda_complex128 test_unary_ufuncs::TestUnaryUfuncsCUDA::test_reference_numerics_extremal_atanh_cuda_complex128 test_unary_ufuncs::TestUnaryUfuncsCUDA::test_reference_numerics_extremal__refs_atanh_cuda_complex128 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121104 Approved by: https://github.com/jeffdaily, https://github.com/ezyang	2024-03-13 03:34:22 +00:00
Mu-Chu Lee	f24ae66abf	[AOTInductor] Skip tests on RoCM for duplicate_constant_folding (#121750 ) Summary: Skip AMD tests for duplicated kernels in constant folding Test Plan: Diff is test Differential Revision: D54820804 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121750 Approved by: https://github.com/huydhn	2024-03-13 03:21:21 +00:00
Adnan Akhundov	9f235971f0	Gate tt.reduce Triton mutation tests on Triton version (#121753 ) Summary: The goal is to make the `test_argmax` and `test_reduce_sum` to work both before and after https://github.com/openai/triton/pull/3191 is included into the Triton pin. This is important to make those tests work during the Triton pin update process both in OSS and internally. Test Plan: ``` $ python test/inductor/test_triton_kernels.py -k test_reduce_sum -k test_argmax .. ---------------------------------------------------------------------- Ran 2 tests in 1.906s OK ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/121753 Approved by: https://github.com/Skylion007	2024-03-13 01:43:02 +00:00
Yanan Cao	7d05c4c093	Remove error anti-pattern when dealing with dynamic shape output (#121681 ) There are cases where capture_dynamic_output_shape_ops=True and we will still see DynamicOutputShapeException. For example, when an op doesn't have a meta kernel implemented to return the correct dynamic shape output. If we blindly give users instructions to set capture_dynamic_output_shape_ops to True, users would try it and see no change. As witnessed in this issue: https://github.com/pytorch/pytorch/issues/121036#issuecomment-1985221435 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121681 Approved by: https://github.com/tugsbayasgalan	2024-03-13 00:45:23 +00:00
PyTorch MergeBot	9df0dca7f6	Revert "[ Inductor ] Shape padding honors output stride preservation (#120797 )" This reverts commit 57fc35a3af09f7657b2be593a1046f0ac2dd50ab. Reverted https://github.com/pytorch/pytorch/pull/120797 on behalf of https://github.com/williamwen42 due to perf regression on dashboard ([comment](https://github.com/pytorch/pytorch/pull/120797#issuecomment-1992857428))	2024-03-13 00:43:34 +00:00
Wenting Wang	02bb2180f4	[torch export] replace traceback.extract_stack with CapturedTraceback.extract (#121449 ) Summary: with a simple bench in TestDeserializer.test_basic function: ``` time_start = time.time() for i in range(1000): self.check_graph(MyModule(), inputs) warnings.warn(f"time_taken: {time.time() - time_start}") ``` and forcing FakeTensorConfig.debug to True, record_stack_traces to True, logging level to debug, it shows that the the changed code is consistently ard 20 secs faster (~90s vs originally ~110s) Test Plan: test passed, see summary compared debug trace before and after: - exactly the same for fake tensor and proxy callsite https://www.internalfb.com/intern/diffing/?paste_number=1189883685 - slightly different for the user frame in proxy node https://www.internalfb.com/intern/diffing/?paste_number=1189884347 Differential Revision: D54237017 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121449 Approved by: https://github.com/angelayi	2024-03-13 00:19:05 +00:00
Peter Bell	7a53dedb07	CI: Specify libc and libstdcxx versions in conda environments (#121556 ) Without this we get mismatches between the GLIBC and GLIBCXX ABI used by conda packages vs pytorch. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121556 Approved by: https://github.com/isuruf, https://github.com/malfet	2024-03-13 00:12:54 +00:00
Oguz Ulgen	68be750e17	Cleanup some exception handling in triton mutation tracking (#121739 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121739 Approved by: https://github.com/Skylion007 ghstack dependencies: #121690	2024-03-13 00:02:36 +00:00
Matthias Reso	a9274c9a2c	Fix aoti doc to avoid cannot bind non-const lvalue reference error (#121672 ) This PR corrects the example in the AOTInductor example which currently fails with: ``` /home/ubuntu/test/inference.cpp:21:62: error: cannot bind non-const lvalue reference of type ‘std::vector<at::Tensor>&’ to an rvalue of type ‘std::vector<at::Tensor>’ 21 \| std::cout << runner.run({torch::randn({2, 10}, at::kCPU)})[0] << std::endl; \| ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/121672 Approved by: https://github.com/desertfire	2024-03-12 23:43:40 +00:00
Oguz Ulgen	79ee6bbde3	Support `triton.language.dtype` with `torch.compile` (#121690 ) Putting this PR as an RFC since I have resorted to some horrible hacks in order to make this work. ``` (Pdb) p triton.language.float32 triton.language.fp32 (Pdb) p str(triton.language.float32) 'fp32' (Pdb) p repr(triton.language.float32) 'triton.language.fp32' ``` This means that we need to "rewrite" them for fx graph and inductor execution. This PR allows Mamba2 to work with `torch.compile`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121690 Approved by: https://github.com/Skylion007	2024-03-12 23:21:46 +00:00
Animesh Jain	22bb24986d	[dynamo][guards] Use lazy variable tracker for func defaults (#121388 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121388 Approved by: https://github.com/jansel	2024-03-12 22:48:48 +00:00
kareem	519151a062	[fx] Preserve Fx graph node order in partitioner across runs (#115621 ) Fixes #ISSUE_NUMBER partitioner generates different graph in recompilation on each run Co-authored-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/115621 Approved by: https://github.com/izaitsevfb	2024-03-12 22:18:43 +00:00
atalman	a95ceb51a2	Release fix pinning slow-tests.json (#121746 ) Apply release changes script adds version to SLOW_TESTS_FILE which should not change Test: ``` SLOW_VER=test sed -i -e s#/slow-tests.json#"/slow-tests.json?versionId=${SLOW_VER}"# tools/stats/import_test_stats.py ``` Output: ``` SLOW_TESTS_FILE = ".pytorch-slow-tests.json" ... url = "https://ossci-metrics.s3.amazonaws.com/slow-tests.json?versionId=test" ``` related to: https://github.com/pytorch/pytorch/pull/121726 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121746 Approved by: https://github.com/huydhn	2024-03-12 22:04:55 +00:00
Kai Londenberg	a5ec45f2ec	[Inductor Cutlass backend] Move tests to separate file (#121489 ) Move Cutlass backend related tests to test/inductor/test_cutlass_backend.py - no changes to the tests themselves. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121489 Approved by: https://github.com/jansel	2024-03-12 21:59:48 +00:00
Bryant Biggs	844bfbbd2e	feat: Update Dockerfile default versions for Python, OS, and CUDA arch list (#121560 ) - Update Dockerfile default versions for Python, OS, and CUDA arch list - Python 3.8 is EOL later this year, the `docker.Makefile` has 3.10 as default - `docker.Makefile` is using 22.04 so this just aligns that - The GPU feature list is quite dated, most of those architectures are long past EOL and we aren't getting the newer cards (A100, H100) into that list until now https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html#gpu-feature-list Pull Request resolved: https://github.com/pytorch/pytorch/pull/121560 Approved by: https://github.com/seemethere, https://github.com/Neilblaze, https://github.com/atalman, https://github.com/malfet	2024-03-12 21:43:26 +00:00
Zihua Wu	d62bdb087d	[Profiler] add missing field device_resource_id (#121480 ) Fixes #121479 Co-authored-by: Aaron Shi <enye.shi@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/121480 Approved by: https://github.com/aaronenyeshi	2024-03-12 21:42:53 +00:00
Tugsbayasgalan Manlaibaatar	5478a4e348	Don't run non-strict for test case that doesn't need non-strict (#121710 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121710 Approved by: https://github.com/BoyuanFeng ghstack dependencies: #121652, #121678, #121687	2024-03-12 21:32:33 +00:00
PyTorch MergeBot	5b506c8bce	Revert "[dynamo][guards] Use lazy variable tracker for func defaults (#121388 )" This reverts commit 04a5d6e8d3f09ee6741484bcfea022228f747b09. Reverted https://github.com/pytorch/pytorch/pull/121388 on behalf of https://github.com/osalpekar due to causing executorch model-test failures internally. See [D54707529](https://www.internalfb.com/diff/D54707529) ([comment](https://github.com/pytorch/pytorch/pull/121388#issuecomment-1992619251))	2024-03-12 21:31:18 +00:00
Shunting Zhang	522d972924	[eazy] add more log when accuracy check fail (#121656 ) Add these log to debug the regress of accuracy test for dm_nfnet_f0 model for training. With these extra log when the accuracy check fail, we can verify if it's close to succeed or not. If yes that indicates there is no real issue but just flaky and we probably can tune the tolerance to fix. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121656 Approved by: https://github.com/jansel, https://github.com/Skylion007	2024-03-12 20:58:20 +00:00
Michael Ranieri	f50c652422	avoid aten dispatch shadowing type with variable (#121659 ) Summary: `DECLARE_DISPATCH` is shadowing the variable data with the data type: `extern TORCH_API struct name name` -> `extern TORCH_API struct gemm_stub gemm_stub` for instance. This is probably dangerous behavior to rely on, as the compiler needs to always resolve to type and/or data based on context. Previous macro fails with VS2022. Test Plan: `buck2 build arvr/mode/win/vs2022/cpp20/opt //xplat/caffe2:aten_pow_ovrsource` Differential Revision: D54699849 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121659 Approved by: https://github.com/albanD	2024-03-12 20:50:47 +00:00
Manuel Candales	6d8a7d6e58	[pytorch] optional zero points on dequantize per channel (#121724 ) Summary: X-link: https://github.com/pytorch/executorch/pull/2364 bypass-github-export-checks Test Plan: sandcastle Reviewed By: mikekgfb Differential Revision: D54709217 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121724 Approved by: https://github.com/mikekgfb	2024-03-12 19:54:11 +00:00
Colin Peppler	a6149eba12	[easy] Refactor MultiOutput. codegen_list_tuple_access to use subclass type checks (#121662 ) Summary: # Why? Right now I'm running into a case where `itype` is `torch.fx.immutable_collections.immutable_list` which is a subclass of `list`. However, currently we're checking the concrete types (i.e. `list`) and `immutable_list` isn't explictly supported here. Thus, we use a runtime check that looks at the subclass so we can support subclasses -- such as immutable_list -- as well. Test Plan: ci Differential Revision: D54764829 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121662 Approved by: https://github.com/aakhundov	2024-03-12 19:27:56 +00:00
Tugsbayasgalan Manlaibaatar	90e886aa6c	Sanity check for non-strict (#121687 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121687 Approved by: https://github.com/avikchaudhuri ghstack dependencies: #121652, #121678	2024-03-12 18:21:32 +00:00
Tugsbayasgalan (Tugsuu) Manlaibaatar	443e241cc5	Don't cache predispatch kernels (#121712 ) Summary: Title Test Plan: CI Differential Revision: D54791087 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121712 Approved by: https://github.com/ydwu4	2024-03-12 18:05:59 +00:00
Wanchao Liang	a26480a4d1	[dtensor] move early return check into redistribute autograd function (#121653 ) This PR fixed the bug of redistribute to move early return check into the redistribute autograd function, so that even though we redistribute the same placement, the grad_placements from the `to_local` call might be different, the redistribute backward still need to happen Pull Request resolved: https://github.com/pytorch/pytorch/pull/121653 Approved by: https://github.com/awgu	2024-03-12 17:37:30 +00:00
atalman	00a53b58dd	Refactor release only changes to two step execution (#121728 ) Refactor release only changes to two step execution. 1. Step ``tag-docker-images.sh`` . Tags latest docker images for current release. This step takes about 30min to complete. This step may fail due to space issues on the local host or http connection when pulling image. Hence should be rerun if failed. 2. Apply release only changes ``apply-release-changes.sh`` prepares a PR with release only changes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121728 Approved by: https://github.com/jeanschmidt	2024-03-12 17:22:22 +00:00
Animesh Jain	4e63d9065a	[dynamo] Delete record replay tests as they are not maintained (#121705 ) Fixes https://github.com/pytorch/pytorch/issues/115518 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121705 Approved by: https://github.com/mlazos	2024-03-12 17:16:34 +00:00
Animesh Jain	cd1751b14f	[dynamo] Measure Dynamo cache latency lookup (#121604 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121604 Approved by: https://github.com/jansel ghstack dependencies: #121614, #121622	2024-03-12 17:09:11 +00:00
Animesh Jain	22489bfe70	[dynamo][guards-cpp-refactor] Directly call root guard manager in eval_frame (#121622 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121622 Approved by: https://github.com/jansel ghstack dependencies: #121614	2024-03-12 17:09:11 +00:00
Animesh Jain	2348e8e4e7	[dynamo][guards-cpp-refactor] Simplify DYNAMIC_INDICES guard (#121614 ) Use NO_HASATTR guard for the common part. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121614 Approved by: https://github.com/jansel	2024-03-12 17:08:56 +00:00
PyTorch MergeBot	0398dc9e8e	Revert "[DCP] Makes fsspec public (#121508 )" This reverts commit d482614fec5fb9bccb49bf4ee4ab561e872c0f50. Reverted https://github.com/pytorch/pytorch/pull/121508 on behalf of https://github.com/osalpekar due to this causes torchrec tests to fail internally with this error: ModuleNotFoundError: No module named 'fsspec'. see [D54779117](https://www.internalfb.com/diff/D54779117) ([comment](https://github.com/pytorch/pytorch/pull/121508#issuecomment-1992137831))	2024-03-12 17:02:43 +00:00
Edward Z. Yang	b84f94f6a3	Restore timestamps on C++ logs without glog (#121384 ) It looks like it was commented out because the original implementation was not sufficiently portable. I had to do some rewrites to the innards to make it no portable. No Windows nanoseconds support because I'm lazy. I tested by running `build/bin/TCPStoreTest` and observing the log messages there. I am actually not sure how to look at the log messages from Python though. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/121384 Approved by: https://github.com/Skylion007, https://github.com/malfet	2024-03-12 17:01:32 +00:00
Igor Sugak	704e15307e	[caffe2] replace refernces to np.asscalar (#121332 ) (#121545 ) Summary: `np.asscalar` was deprecated and removed in a recent Numpy. It used to be implemented the following way, and the recommended alternative is to call `item()` directly: ```python def asscalar(a): return a.item() ``` This fixes all of the references. Test Plan: visual inspection and automated tests Differential Revision: D54697760 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121545 Approved by: https://github.com/malfet	2024-03-12 16:58:47 +00:00
angelayi	d1715c3adb	[export] Update error message for set_grad (#121666 ) Context: https://fb.workplace.com/groups/222849770514616/posts/381979051268353/?comment_id=383334957799429 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121666 Approved by: https://github.com/ydwu4	2024-03-12 16:41:45 +00:00
Jason Ansel	3c8c7e2a46	[dynamo] Tweak naming for module hook bw_state (#121609 ) Some minor changes not related to the other PRs in the stack Pull Request resolved: https://github.com/pytorch/pytorch/pull/121609 Approved by: https://github.com/yanboliang	2024-03-12 16:27:56 +00:00
Chien-Chin Huang	7a68e0a3e8	[DCP][state_dict] Remove the check of FSDP has root (#121544 ) Root may not exist due to FSDP lazy initialization. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121544 Approved by: https://github.com/Skylion007 ghstack dependencies: #121273, #121276, #121290	2024-03-12 15:43:19 +00:00
Andrew Gu	85dc254364	[DTensor] Moved `Transformer` sharding to staticmethod (#121660 ) To support FSDP + TP/SP unit tests, let us factor out the canonical TP/SP sharding of `Transformer` to a staticmethod that can be called by other unit tests. Test Plan: ``` pytest test/distributed/tensor/parallel/test_tp_examples.py -k test_transformer_training ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/121660 Approved by: https://github.com/wanchaol, https://github.com/yifuwang ghstack dependencies: #121360, #121357	2024-03-12 15:08:57 +00:00
Stephen Jia	cc51e100f5	[ET-VK] Enable Dynamic shape support via tensor virtual and physical resizing (#121598 ) Summary: ## Context This changeset lays the foundations for supporting dynamic shapes in the ExecuTorch Vulkan delegate via allowing Tensors to be resized in one of two ways: 1. Discarding underlying `vkImage` or `vkBuffer` and reallocating a new `vkImage` or `vkBuffer` with updated sizes. This method is intended to be used when the current `vkImage` or `vkBuffer` is not large enough to contain the new sizes. 2. Update the tensor's size metadata without reallocating any new resources. This allows shaders to interpret the underlying `vkImage` or `vkBuffer` as if it were smaller than it actually is, and allows command buffers to be preserved when sizes are changed. Test Plan: Check CI. Tests have also been added to `vulkan_compute_api_test` that test the two methods of tensor resizing. Differential Revision: D54728401 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121598 Approved by: https://github.com/jorgep31415	2024-03-12 14:32:00 +00:00
Howard Huang	2a99e6f299	Update error message (#121644 ) Summary: We don't want people to move to NCCL exp without explicit opt in. It seems that sparse allreduce was accidentally called and people were confused whether they should use NCCL exp instead. Update the error message to explicitly say that sparse_allreduce is not supported. Test Plan: sandcastle Differential Revision: D54759307 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121644 Approved by: https://github.com/awgu	2024-03-12 13:04:21 +00:00
kausik	edf22f3a48	Modify signature of dequantize ops for decomposed quantized Tensor (#119173 ) (#121450 ) Summary: X-link: https://github.com/pytorch/executorch/pull/2308 Note: The initial purpose of this PR is to draw suggestion and feedback regarding better alternative, if any. At present, dequantize op for decomposed quantized Tensor representation e.g. dequantize_per_tensor() assumes the output dtype as torch.float and hence, it does not have the output dtype in its operator argument list. However, this op signature becomes unusable when the assumption breaks. Because, in case the output dtype is different from torch.float, there is no way to specify the same during dequantization. This change is aimed at generalizing the signature of dequantize op like dequantize_per_tensor() for wider use-cases where the output dtype can be different from torch.float and needs to passed during dequantization. The proposal is to use an additional argument named 'output_dtype' to solve the problem. However, we would also like to have suggestion and feedback regarding any better alternative that can be used instead. cc jerryzh168 jianyuh raghuramank100 jamesr66a vkuzo jgong5 Xia-Weiwen leslie-fang-intel Reviewed By: digantdesai Differential Revision: D53590486 Pulled By: manuelcandales Co-authored-by: kausik <kmaiti@habana.ai> Pull Request resolved: https://github.com/pytorch/pytorch/pull/121450 Approved by: https://github.com/jerryzh168	2024-03-12 12:36:31 +00:00
Adnan Akhundov	06d2392003	Support tt.reduce in Triton kernel analysis pass (#121706 ) Summary: Previously, we bailed out of the Triton kernel analysis pass when seeing a `tt.reduce` op. In this PR, we support the op and don't bail out anymore. Test Plan: This is a bit tricky, as the extension is added to the MLIR walk-based analysis code path which is active only on when the MLIR bindings added in https://github.com/openai/triton/pull/3191 are available. So for now I've run the `test_argmax` and `test_reduce_sum` manually with a newer Triton version than the current pin. When pin updates, we'll make those tests official (left a TODO comment). Pull Request resolved: https://github.com/pytorch/pytorch/pull/121706 Approved by: https://github.com/jansel	2024-03-12 11:38:28 +00:00
Animesh Jain	78b4793c96	[dynamo][compile-time] Caching VTs to reduce compile-time (#121031 ) Reduces the `torch.compile(backend="eager")` for this code ~~~ def fn(x): for _ in range(10000): # x = torch.sin(x) x = torch.ops.aten.sin(x) # x = sin(x) return x ~~~ From 18 seconds to 12 seconds. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121031 Approved by: https://github.com/jansel	2024-03-12 09:19:50 +00:00
Tugsbayasgalan Manlaibaatar	52ad2b682c	Generate predispatch tests (#121678 ) In this PR, we create another dynamic test class for TestExport tests that basically serializes/deserializas pre-dispatch IR. I encountered 4 additional failures. But 3 of them are due to different operator showing up in the graph and only one legit failure which is tracked by another task internally. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121678 Approved by: https://github.com/angelayi ghstack dependencies: #121652	2024-03-12 08:34:50 +00:00
Dmitry Nikolaev	656134c38f	[ROCm] enable complex128 in test_addmm_sizes_all_sparse_csr for rocm for trivial (k,n,m) cases (#120504 ) This PR enables `test_addmm_sizes_all_sparse_csr_k__n__m_*_cuda_complex128` for ROCm for trivial cases (m or n or k = 0) CUSPARSE_SPMM_COMPLEX128_SUPPORTED also used for `test_addmm_all_sparse_csr` and ` test_sparse_matmul` and both of them are skipped for ROCm by `@skipIfRocm` or `@skipCUDAIf(not _check_cusparse_spgemm_available())` Pull Request resolved: https://github.com/pytorch/pytorch/pull/120504 Approved by: https://github.com/jithunnair-amd, https://github.com/ezyang	2024-03-12 07:29:57 +00:00
lezcano	86a2d67bb9	Simplify guards using info from previous guards (#121463 ) Let me see what CI thinks about this one. Will add tests tomorrow. Fixes https://github.com/pytorch/pytorch/issues/119917 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121463 Approved by: https://github.com/ezyang	2024-03-12 04:22:20 +00:00
Nikita Shulga	703e83e336	Fix AARCH64 builds (#121700 ) After https://github.com/pytorch/pytorch/pull/119992 was landed Pull Request resolved: https://github.com/pytorch/pytorch/pull/121700 Approved by: https://github.com/janeyx99, https://github.com/huydhn	2024-03-12 04:17:47 +00:00
Shen Xu	159f30331f	[quant][pt2e] Call sub-quantizers' transform_for_annotation in ComposableQuantizer (#121548 ) Test Plan: ``` buck run caffe2/test:quantization_pt2e ``` Differential Revision: D54454707 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121548 Approved by: https://github.com/jerryzh168	2024-03-12 02:59:12 +00:00
Tugsbayasgalan Manlaibaatar	7fc497711d	Also test predispatch serialization (#121652 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121652 Approved by: https://github.com/zhxchen17, https://github.com/angelayi	2024-03-12 02:37:59 +00:00
eellison	6ca9ae4f86	Express y grid > 2^16 in terms of z grid (#121554 ) CUDA has a max y_grid of 65535. If we're computing larger than that we can compose it in terms of z grid, which is currently unused in inductor codegen. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121554 Approved by: https://github.com/aakhundov	2024-03-12 02:36:19 +00:00
Jane Xu	fb1d7935bb	[optim][BE] move complex_2d (last of complex tests) to OptimInfo (#120618 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120618 Approved by: https://github.com/albanD	2024-03-12 02:33:21 +00:00
Xinya Zhang	a37e22de70	Add Flash Attention support on ROCM (#121561 ) This patch addresses the major limitations in our previous [PR #115981](https://github.com/pytorch/pytorch/pull/115981) through the new dedicated repository [AOTriton](https://github.com/ROCm/aotriton) - [x] Only supports MI200 series GPU (i.e., `gcnArchName == gfx90a:sramecc+:xnack-`). * MI300X is supported. More architectures will be added once Triton support them. - [x] Only supports power of two sequence lengths. * Now it support arbitrary sequence length - [ ] No support for varlen APIs. * varlen API will be supported in the next release of AOTriton - [x] Only support head dimension 16,32,64,128. * Now it support arbitrary head dimension <= 256 - [x] Performance is still being optimized. * Kernel is selected according to autotune information from Triton. Other improvements from AOTriton include * Allow more flexible Tensor storage layout * More flexible API This is a more extensive fix to #112997 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121561 Approved by: https://github.com/malfet, https://github.com/atalman	2024-03-12 01:16:53 +00:00
Kefei Lu	3a5f48d55f	Port remove_split_ops to PT2 pre-grad passes (#121674 ) Summary: For OEMAE, this contributes 14% of the total DPER pass perf gain. Test Plan: Run test cases Run oemae lower benchmark with and with this fix. FLOP/s 29 -> 34. Reviewed By: frank-wei Differential Revision: D54711064 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121674 Approved by: https://github.com/frank-wei	2024-03-12 01:15:19 +00:00
Elias Ellison	5b5d423c2e	Benchmark templates (#118880 ) Adding support for benchmarking templates in `benchmark_fusion` Pull Request resolved: https://github.com/pytorch/pytorch/pull/118880 Approved by: https://github.com/shunting314	2024-03-11 23:55:13 +00:00
Mu-Chu Lee	7676433012	[AOTInductor] Reuse generated kernels between constant graph and main graph (#121564 ) Summary: We copy the src_to_kernel from constant graph to main graph so that we could avoid generating duplicating kernels. And pass throught the name counter such that no duplicated names will be generated. Test Plan: Included in commit Differential Revision: D54706767 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121564 Approved by: https://github.com/desertfire, https://github.com/chenyang78	2024-03-11 22:44:38 +00:00
Andrew Gu	272cf29e4d	[FSDP2][BE] Refactored `check_1d_sharded_parity` to use mesh (#121357 ) Eventually, we should just have one unified way to check for parity between a `DTensor`-sharded model and a replicated model. This PR is a small refactor to work toward that. One current gap to use this `check_sharded_parity` function for 2D is that FSDP's `(Shard(0), Shard(0))` layout differs from that of the `DTensor` APIs since FSDP shards on dim-0 after TP shards on dim-0. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121357 Approved by: https://github.com/weifengpy ghstack dependencies: #121360	2024-03-11 22:34:42 +00:00
Sergii Dymchenko	cd1dc5e484	Delete requirements-flake8.txt (#121657 ) The file seems to be unused and also has different flake8 version compared to .lintrunner.toml, creating confusion. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121657 Approved by: https://github.com/huydhn, https://github.com/seemethere, https://github.com/malfet	2024-03-11 22:29:25 +00:00
PyTorch MergeBot	fd0dbcd891	Revert "Batch Norm Consolidation (#116092 )" This reverts commit 7b4f70eda519ccd7f28de17689edd43c52743bc9. Reverted https://github.com/pytorch/pytorch/pull/116092 on behalf of https://github.com/osalpekar due to Causes build failure in //caffe2:aten-hip (AMD build) target. See [D54707318](https://www.internalfb.com/diff/D54707318) for more details, may require internal build system changes to resolve. ([comment](https://github.com/pytorch/pytorch/pull/116092#issuecomment-1989542965))	2024-03-11 22:22:41 +00:00
Sergii Dymchenko	498a94a7f5	Don't install torchfix for python<3.9 (#121655 ) Fixes https://github.com/pytorch/pytorch/issues/121591 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121655 Approved by: https://github.com/huydhn, https://github.com/malfet	2024-03-11 22:18:42 +00:00
PyTorch MergeBot	b2f09c1859	Revert "[compiled autograd] support custom ops backed by c++ autograd::Function (#120681 )" This reverts commit d27509c384c9847cd2ac1f5d63ec143704b50591. Reverted https://github.com/pytorch/pytorch/pull/120681 on behalf of https://github.com/xmfan due to breaking internal builds, see D54707287 ([comment](https://github.com/pytorch/pytorch/pull/120681#issuecomment-1989542344))	2024-03-11 22:18:36 +00:00
Alexander Grund	d1f45a93af	Check for releasing GIL at compiletime (#116695 ) Introduce `conditional_gil_scoped_release` and use it in `wrap_pybind_function*` to avoid a runtime branch making the code cleaner and faster. @albanD This is the GIL change extracted from #112607 as discussed. Also fixes a potential use of a moved-from object introduced in #116560: - `f` is captured by value in a lambda that may be used called times - After `std::move(f)` the lambda is not safe to call anymore CC @cyyever for that change Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/116695 Approved by: https://github.com/albanD, https://github.com/Skylion007	2024-03-11 22:04:56 +00:00
Sam Larsen	fd13a56f61	Refactor some testing helpers for FX graph cache testing (#121520 ) Summary: I plan to enable the FX graph cache for more inductor unit tests. This PR does some refactoring to prepare by moving the `TestCase` base class to `torch._inductor.test_case` (which mirrors the existing `torch._dynamo.test_case`). In a subsequent diff, I'll modify tests importing `torch._dynamo.test_case.TestCase` to use `torch._inductor.test_case.TestCase` instead. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121520 Approved by: https://github.com/eellison	2024-03-11 21:46:27 +00:00
Andres Lugo-Reyes	e01b07e1e8	[ROCm] Autocast RNN Support (#121539 ) Fixes #116361 Implements Autocast wrapper for miopen rnn's Pull Request resolved: https://github.com/pytorch/pytorch/pull/121539 Approved by: https://github.com/albanD, https://github.com/jeffdaily	2024-03-11 21:14:43 +00:00
Kefei Lu	fc712311ce	port fuse_parallel_linear (without changing weights) to PT2 pre-grad (#121617 ) Summary: Does not change weights structure so compatible with const folding and realtime weights update Test Plan: run added test cases Reviewed By: frank-wei Differential Revision: D53843428 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121617 Approved by: https://github.com/frank-wei	2024-03-11 20:51:11 +00:00
Zhenghao Zhao	3461404869	[pt2 export]fix name collision on constant name (#121145 ) Summary: Taking the right most part of the fqn will cause name conflict when having multiple instances of the same class. Changed to replace "." in fqn by "_" to avoid invalid syntax in input args. Test Plan: added test case Differential Revision: D54435230 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121145 Approved by: https://github.com/zhxchen17	2024-03-11 20:40:59 +00:00
Huy Do	b091a32909	Add a section on release wiki about pytorchbot cherry-pick command (#121648 ) I add a section about the new `pytorchbot cherry-pick` command in the release wiki so that more people know about it Pull Request resolved: https://github.com/pytorch/pytorch/pull/121648 Approved by: https://github.com/atalman, https://github.com/seemethere	2024-03-11 20:09:58 +00:00
Jinzhe Zeng	dd2062c737	fix CMake FindCUDA module for cross-compiling (#121590 ) Fix two cross-compiling issues in `FindCUDA.cmake` (xref: https://github.com/conda-forge/pytorch-cpu-feedstock/pull/224). 1. `setup.py` reads the cached `CUDA_TOOLKIT_ROOT_DIR`, so it must be cached. `41286f1505/setup.py (L593)` I also submitted it to the upstream CMake: https://gitlab.kitware.com/cmake/cmake/-/merge_requests/9323. 2. [SBSA toolkit](https://developer.nvidia.com/cuda-downloads?target_os=Linux&target_arch=arm64-sbsa&Compilation=Cross&Distribution=Ubuntu&target_version=20.04&target_type=deb_network_cross) is in `sbsa-linux` directory. See also https://gitlab.kitware.com/cmake/cmake/-/issues/24192 I also submitted it to the upstream CMake: https://gitlab.kitware.com/cmake/cmake/-/merge_requests/9324 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121590 Approved by: https://github.com/malfet	2024-03-11 20:09:52 +00:00
lancerts	5fd7f5c4e3	Include torch warn in each error in cudnn/Conv_v8.cpp (#120719 ) Fixes #120702 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120719 Approved by: https://github.com/eqy, https://github.com/janeyx99	2024-03-11 20:05:42 +00:00
Jason Ansel	9aa3fedb75	Slightly faster FX graph iterator (#121611 ) Before: ``` iterating over 100000000 FX nodes took 5.9s (16830686 nodes/s) ``` After: ``` iterating over 100000000 FX nodes took 5.0s (19937698 nodes/s) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/121611 Approved by: https://github.com/oulgen	2024-03-11 20:00:19 +00:00
James Wu	ae22bdaefe	Update torchbench commit pin, add sam_fast benchmark (#121420 ) After this, the sam_fast benchmark can now be run in the pytorch repo: ``` SEGMENT_ANYTHING_FAST_USE_FLASH_4=0 benchmarks/dynamo/torchbench.py --inference --amp --performance --backend=inductor --explain --only sam_fast ``` sam_fast is designed for inference only, with cuda and amp on. The code adds these restrictions to the benchmark. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121420 Approved by: https://github.com/oulgen, https://github.com/msaroufim	2024-03-11 19:48:53 +00:00
Daniel Herrera	dccc1ca839	[torch] Use __prepare_scriptable__ for closures (#121553 ) Summary: This fixes a case left incomplete by https://github.com/pytorch/pytorch/pull/106229 The object is using __prepare_scriptable__ correctly inside of torch.jit.script() but the clousre that is obtained below is using the non-prepared version. This causes issues when the prepared and non-prepared versions are in different python modules. Test Plan: ``` buck2 run mode/opt caffe2/test:jit -- -r test_decorator ``` Differential Revision: D54308741 Re-exporting, as #120806 #121307 were not properly merged. Co-authored-by: Daniel Herrera <dherrera@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/121553 Approved by: https://github.com/huydhn, https://github.com/seemethere	2024-03-11 19:14:19 +00:00
Huy Do	b4160fd9c7	Clean up macOS x86 binaries build jobs (#116726 ) This will stop building binaries for MacOS x86 on PyTorch including nightly and all future releases. If we want this for 2.2, this can be cherry-picked there. * [x] https://github.com/pytorch/pytorch/pull/116725 * [ ] https://github.com/pytorch/pytorch/pull/116726 Fixes https://github.com/pytorch/pytorch/issues/114602 Pull Request resolved: https://github.com/pytorch/pytorch/pull/116726 Approved by: https://github.com/atalman	2024-03-11 19:09:39 +00:00
iefgnoix	8d03c59d59	Bring torch_xla pin to the latest torch_xla commit (03/08/2024). (#121529 ) Update the torch_xla pin to a more recent one (03/08/2024). We need to make sure the torch_xla pin stays up-to-date so that pytorch can test against a up-to-date version of torch_xla. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121529 Approved by: https://github.com/atalman	2024-03-11 18:25:42 +00:00
Aidyn-A	39ed038f41	[TEST] Prepare test_cumulative_trapezoid for SciPy 1.12 (#121541 ) Follow up on #119326 with addressed comment: https://github.com/pytorch/pytorch/pull/119326#issuecomment-1939428705: > I'd like to propose a slightly different approach. We could check if scipy is version `1.12.0`. If so, overload `scipy_cumulative_trapezoid` with a function that specifically checks `t.shape[axis] == 0`, and in that case return an array of the same shape as `t`, which is the expected behavior as far as I understand. That way, we're not just skipping the test cases I would like to add that the version check is not necessary as in any case the outcome is the same. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121541 Approved by: https://github.com/nWEIdia, https://github.com/albanD	2024-03-11 17:48:29 +00:00
Catherine Lee	6801595349	Fix round robin sharding (#121022 ) Fix round robin sharding when there are no test times and sort_by_time=False Adds more tests to test_test_selections for sort_by_time=False Adds more checks to test_split_shards_random for serial/parallel ordering + ordering of tests Refactoring of dup code Tested locally by running `python test/run_test.py --shard 3 5` with no test times downloaded and checked that it wasn't an empty list. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121022 Approved by: https://github.com/huydhn, https://github.com/osalpekar	2024-03-11 17:30:12 +00:00
Aaron Gokaslan	e2ac2dc13a	Update NCCL submodule to v2.20.5 (#121635 ) Updates NCCL submodule to 2.20.5 . Includes a lot of bugfixes for reductions and connections issues. Should also improve performance. We have been running 2.20.5 internally for a few weeks, the binary pip wheels have finally been published so we can update main. Release notes here: https://docs.nvidia.com/deeplearning/nccl/release-notes/rel_2-20-5.html#rel_2-20-5 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121635 Approved by: https://github.com/malfet	2024-03-11 17:23:59 +00:00
Natalia Gimelshein	89add71168	fix synchronization behavior for copies with type change (#121341 ) Fixes #121320 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121341 Approved by: https://github.com/albanD	2024-03-11 17:09:45 +00:00
CaoE	03717430cc	Fix lower precision check for MKLDNN on Windows (#121618 ) Fixes #120788 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121618 Approved by: https://github.com/xuhancn, https://github.com/jgong5, https://github.com/mingfeima, https://github.com/seemethere	2024-03-11 16:09:20 +00:00
Nikita Shulga	e29004615f	Add NEON accelerated torch.mv kernel (#119992 ) This reduces `torch.mv` time for 256x768 matrix by 256 element vector from 209 usec to 16 usec for nontransposed case and from 104 to 18 usec if transposed Also, add fp16-accumulation flavor to the same ops (controlled by private `torch._C._set_cpu_allow_fp16_reduced_precision_reduction` which yields a slightly better numbers), summarized in the following table \| op \| original \| F32+NEON \| F16+NEON\| \| ---\| -------- \| ---------- \| ----- \| \| torch.mv(m, v) \| 209.53 usec \| 16.25 usec \| 14.68 usec \| \| torch.mv(m.t(), v) \| 104.80 usec \| 28.68 usec \| 24.82 usec \| Test plan: CI on MacOS for both CPU and MPS test fp32<->fp16 matmul consistency ( For example "test_output_grad_match_nn_functional_linear_cpu_float16" passes if fp32-reductions are performed, but fails if fp16 accumulation is used) To investigate: - why replacing `sum0Vec = vaddq_f32(sum0Vec, vmulq_f32(a0Vec, xVec));` with `sum0Vec = vfmaq_f32(sum0Vec, a0Vec, xVec);` slows down gemv from 16.2 to 18.2 usec Pull Request resolved: https://github.com/pytorch/pytorch/pull/119992 Approved by: https://github.com/mikekgfb	2024-03-11 16:00:01 +00:00
Catherine Lee	fac06a12c8	CI sanity check test for env vars (#120519 ) Make a test that fails on purpose to trigger retries. Check the opposite of success (that env vars exist) It's bit hacky because I want it to fail on the normal flow in order to trigger reruns but I don't want to expose the failures to users since it's confusing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120519 Approved by: https://github.com/huydhn	2024-03-11 15:35:45 +00:00
Thiago Crepaldi	6c11d3ce0c	Add support to save safetensors checkpoint directly into onnx (#121001 ) Currently, when `torch.onnx.dynamo_export` is called within `torch.onnx.enable_fake_mode`, all the external pytorch checkpoint files used to initialize the model are automatically and used by `torch.onnx.ONNXProgram.save` to recreate the initializers for the newly exported ONNX model. This API extends the mechanism for HuggingFace models that use safetensors weights. This PR detects safetensors state files and converts them to PyTorch format using mmap on a temporary file, which is deleted after conversion is finished. Without this PR, the user would have to convert the safetensors files to pytorch format manually and feed it to `torch.onnx.ONNXProgram.save` manually. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121001 Approved by: https://github.com/BowenBao, https://github.com/malfet	2024-03-11 15:21:59 +00:00
FFFrog	485f8ebc07	add __repr__ function to FunctionSchema for Python (#121484 ) Fixes #118566 Unlike OpOverload or OpOverloadPacket, there is a lot of complex information in the schema, so for me keeping it as is is probably a good choice, but in theory the \_\_repr__ function should show the class name as well as some other key information. If you have any choices, please show me, thank you. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121484 Approved by: https://github.com/Skylion007	2024-03-11 15:16:50 +00:00
Xia Weiwen	d1510e01fa	Upgrade submodule onednn to v3.3.5 (#120767 ) This upgrade contains the fixes to the known issues brought by oneDNN v3.3.2, including issues https://github.com/pytorch/pytorch/issues/115346, https://github.com/pytorch/pytorch/issues/120211 and https://github.com/pytorch/pytorch/issues/120406 and those listed in PR #112700. Issue https://github.com/pytorch/pytorch/issues/115346 (perf regression) was fixed by oneDNN v3.3.4. No new regression was found with v3.3.5. The detailed results of v3.3.4 are given below and compared with v3.1.1 (the oneDNN version in PyTorch before it was updated to v3.3.2). 1. A performance regression with 5.8% perf drop from `pytorch_stargan-train` (see https://github.com/pytorch/benchmark/issues/2076#issuecomment-1847545843) Validation results with this patch: Latency increased by 0.60% ``` Tested on an Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz instance (IceLake) oneDNN v3.1.1 metrics-1484287.json { "name": "cpu", "environ": { "pytorch_git_version": "6c8c5ad5eaf47a62fafbb4a2747198cbffbf1ff0" }, "metrics": { "latency": 418.851717 } } oneDNN v3.3.4 { "name": "cpu", "environ": { "pytorch_git_version": "6c8c5ad5eaf47a62fafbb4a2747198cbffbf1ff0" }, "metrics": { "latency": 421.381313 } } ``` 2. Performance regression of FP32 rexnet_100 with Inductor, dynamic shape, multi-threads (see https://github.com/pytorch/pytorch/issues/115346#issue-2030859592) Validation results with this patch: Latency reduced by 3.23% ``` Tested on an Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz instance (IceLake) oneDNN v3.1.1 (inductor speedup over eager mode) 2.876x dev,name,batch_size,speedup,abs_latency,compilation_latency,compression_ratio,eager_peak_mem,dynamo_peak_mem,calls_captured,unique_graphs,graph_breaks,unique_graph_breaks cpu,rexnet_100,128,2.875904,113.314765,18.455283,0.990437,1302.636134,1315.212902,351,1,0,0 oneDNN v3.3.4 (inductor speedup over eager mode) 3.003x dev,name,batch_size,speedup,abs_latency,compilation_latency,compression_ratio,eager_peak_mem,dynamo_peak_mem,calls_captured,unique_graphs,graph_breaks,unique_graph_breaks cpu,rexnet_100,128,3.003012,109.653012,91.547260,0.990048,1302.532506,1315.625370,351,1,0,0 ``` 3. Performance regression of AMP hf_T5_generate and tinynet_a with Inductor, static shape, multi-threads (see https://github.com/pytorch/pytorch/issues/115346#issuecomment-1856029962) Validation results with this patch: Latency reduced by 0.85% ``` Tested on an AWS spr metal instance oneDNN v3.1.1 (inductor speedup over eager mode) 1.120x dev,name,batch_size,speedup,abs_latency,compilation_latency,compression_ratio,eager_peak_mem,dynamo_peak_mem,calls_captured,unique_graphs,graph_breaks,unique_graph_breaks cpu,hf_T5_generate,1,1.120018,1197.807729,205.905466,0.442803,125.179904,282.698957,10550,48,8,4 oneDNN v3.3.4 (inductor speedup over eager mode) 1.134x dev,name,batch_size,speedup,abs_latency,compilation_latency,compression_ratio,eager_peak_mem,dynamo_peak_mem,calls_captured,unique_graphs,graph_breaks,unique_graph_breaks cpu,hf_T5_generate,1,1.133594,1187.701514,205.855527,0.422012,128.405094,304.268493,10550,48,8,4 ``` The following issues about functionality are fixed by this upgrade. Test cases are also added for these issues. - https://github.com/pytorch/pytorch/issues/120211 - https://github.com/pytorch/pytorch/issues/120406 - https://github.com/pytorch/pytorch/issues/120547 ----- Below are detailed data of torchbench CPU userbenchmark test and Inductor FP32/AMP inference tests. No regression of perf or functionality was found. I. torchbench CPU userbenchmark test Suite \| Speedup -- \| -- eager_throughtput_bf16_infer \| 1.001848 eager_throughtput_fp32_infer \| 1.000257 eager_throughtput_fx_int8 \| 1.003069 jit_llga_throughtput_amp_bf16 \| 1.000682 jit_llga_throughtput_fp32 \| 1.000313 eager_throughtput_bf16_train \| 0.998222 eager_throughtput_fp32_train \| 1.003384 II. Inductor FP32/AMP inference tests i. FP32 static default suite \| name \| thread \| batch size \| Ratio Speedup(New/old) -- \| -- \| -- \| -- \| -- torchbench \| timm_efficientnet \| multiple \| 64 \| 1.09 timm_models \| tinynet_a \| multiple \| 128 \| 1.14 ii. FP32 dynamic default suite \| name \| thread \| batch size \| Ratio Speedup(New/old) -- \| -- \| -- \| -- \| -- torchbench \| alexnet \| multiple \| 128 \| 1.08 torchbench \| basic_gnn_edgecnn \| multiple \| 1 \| 0.98 torchbench \| timm_efficientnet \| multiple \| 64 \| 1.08 iii. AMP static default suite \| name \| thread \| batch size \| Ratio Speedup(New/old) -- \| -- \| -- \| -- \| -- torchbench \| hf_distil_whisper \| multiple \| 1 \| 1.18 torchbench \| timm_efficientnet \| multiple \| 64 \| 1.32 huggingface \| BartForConditionalGeneration \| multiple \| 2 \| 1.19 timm_models \| eca_halonext26ts \| multiple \| 128 \| 1.13 timm_models \| nfnet_l0 \| multiple \| 128 \| 1.13 timm_models \| rexnet_100 \| multiple \| 128 \| 1.45 timm_models \| spnasnet_100 \| multiple \| 128 \| 1.15 timm_models \| tf_efficientnet_b0 \| multiple \| 128 \| 1.22 timm_models \| tinynet_a \| multiple \| 128 \| 1.49 torchbench \| hf_Bert_large \| single \| 1 \| 1.16 huggingface \| XLNetLMHeadModel \| single \| 1 \| 1.07 iv. AMP dynamic default suite \| name \| thread \| batch size \| Ratio Speedup(New/old) -- \| -- \| -- \| -- \| -- torchbench \| timm_efficientnet \| multiple \| 64 \| 1.32 huggingface \| PLBartForConditionalGeneration \| multiple \| 4 \| 1.14 timm_models \| nfnet_l0 \| multiple \| 128 \| 1.15 timm_models \| rexnet_100 \| multiple \| 128 \| 1.45 timm_models \| tinynet_a \| multiple \| 128 \| 1.34 huggingface \| XLNetLMHeadModel \| single \| 1 \| 1.09 ----- Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/120767 Approved by: https://github.com/chuanqi129, https://github.com/jgong5, https://github.com/atalman	2024-03-11 12:56:59 +00:00
Xilun Wu	605c0a28aa	[dtensor][debug] force visualize_sharding not to print for empty tensors (#121217 ) Summary Current `visualize_sharding` code cannot print for empty DTensor objects which leads to an exception. This PR skips the print logic if the DTensor passed in has 0 element. <img width="2165" alt="Pasted Graphic" src="https://github.com/pytorch/pytorch/assets/12968408/fa40b5e7-dad7-4d3a-a485-6a18067320ff"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/121217 Approved by: https://github.com/wanchaol ghstack dependencies: #121385, #121382	2024-03-11 09:22:49 +00:00
Xilun Wu	3a5ab17bdc	[dtensor][debug] visualize_sharding skip if the current rank is not in mesh (#121382 ) Summary We should skip the `visualize_sharding()` function on those ranks that are not a part of the DTensor's mesh. If not, exception will be thrown in current visualize logic. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121382 Approved by: https://github.com/wanchaol ghstack dependencies: #121385	2024-03-11 09:22:49 +00:00
Xilun Wu	b383123e37	[dtensor][debug] visualize_sharding only compute offset on the first rank in mesh (#121385 ) Summary avoid computing on ranks where we do not plan to visualize the DTensor. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121385 Approved by: https://github.com/wanchaol	2024-03-11 09:22:31 +00:00
kungyork	9c50ecc84b	Fix `get_rank` under a non-default group. (#120481 ) Fixes #120213 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120481 Approved by: https://github.com/yifuwang	2024-03-11 05:40:54 +00:00
Jason Ansel	7cc476ea16	[dynamo] Fix support for nn.Parameter constructor (part 1) (#120163 ) This captures calls to `torch.nn.Parameter` by lifting them to graph inputs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120163 Approved by: https://github.com/albanD, https://github.com/yanboliang ghstack dependencies: #121086	2024-03-11 05:14:42 +00:00
Jason Ansel	32488b0664	[dynamo] Support _unsafe_set_version_counter (#121086 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121086 Approved by: https://github.com/yanboliang	2024-03-11 05:14:42 +00:00
Ze Sheng	7a4e451184	[Dynamo] Fix function overrides (#120885 ) To check existence of `__torch_function__`, the code intended to iterate each element but got `TupleVariable` when the ordinary `has_torch_function()` was being used. Needs further unpack in this case Fixes #120653 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120885 Approved by: https://github.com/yanboliang	2024-03-11 02:18:43 +00:00
Kefei Lu	f11f2b0d55	split predispatch pass into multiple passes (#121592 ) Summary: It's very difficult to debug the passes ineffectiveness, with them mingled in one single pass container. Here we extract them into seperate passes with diagnostics info. This is also required for a later change, where we must run shape prop on each of these passes, in order for the subsequent passes to have the correct shape information. Reviewed By: frank-wei Differential Revision: D53579545 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121592 Approved by: https://github.com/frank-wei	2024-03-11 00:30:55 +00:00
Avik Chaudhuri	13e8181b7b	relax assertion on fake shape (#121599 ) Summary: Seems like if you use `capture_pre_autograd_graph` fake tensor shapes can be ints instead of symints. Test Plan: fixes the AssertionError in N5057219 Differential Revision: D54729142 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121599 Approved by: https://github.com/angelayi, https://github.com/BoyuanFeng	2024-03-10 22:51:10 +00:00
Oguz Ulgen	660ec3d38d	[Export] Fix bug removing node from wrong graph (#121574 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121574 Approved by: https://github.com/ydwu4	2024-03-10 04:46:11 +00:00
Yifu Wang	41286f1505	[IntraNodeComm] fix a hybridCubeMeshAllReduceKernel breakage caused by a recent refactor (#121575 ) `hybridCubeMeshAllReduceKernel` uses the latter half of p2p buffers as relay buffers. The relay buffer address is calculated using a bf16 base pointer and the buffer size in byte. The breakage was caused by not taking element size into account. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121575 Approved by: https://github.com/Chillee	2024-03-10 00:55:25 +00:00
wz337	60cd2a43ca	[DeviceMesh] Add support for nD slicing (#119752 ) Fixes one of the issue mentioned in #118639 @mvpatel2000 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119752 Approved by: https://github.com/wanchaol	2024-03-10 00:16:37 +00:00
Bert Maher	e90cddb0d3	[inductor] Log triton kernel source and metadata on failure (#120494 ) If Triton compilation fails it's much easier to debug when given the kernel source directly, versus a PyTorch repro. This would have helped root cause https://github.com/pytorch/pytorch/issues/118589 almost immediately Differential Revision: [D54119568](https://our.internmc.facebook.com/intern/diff/D54119568/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120494 Approved by: https://github.com/peterbell10, https://github.com/eellison, https://github.com/jansel	2024-03-09 20:12:27 +00:00
Peter Bell	168a04e752	[inductor] Changes to support newer triton pin (#121267 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121267 Approved by: https://github.com/lezcano ghstack dependencies: #121438	2024-03-09 18:17:36 +00:00
Peter Bell	459c5bca58	[inductor] Refactor common triton imports into one function (#121438 ) This means when codegen depends on a particular import we only need to add it in one place and it's applied to all triton kernels. This also changes codegen slightly so instead of generating `@pointwise` we now generate `@triton_heuristics.pointwise` just so the imports are the same for all kernel types. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121438 Approved by: https://github.com/lezcano	2024-03-09 18:17:36 +00:00
BowenBao	8c96b4367a	Remove opmath cast for im2col decomp (#121363 ) It is unclear why opmath cast is needed for im2col decomp, given that the decomposition is mainly performing padding, slicing, indexing and shape manipulation. There is no need for performing these operations in a higher precision, and in doing so it requires more memory and yields less performance. Sample script to demonstrate inserted cast before this change ```python import torch from torch._decomp.decompositions import im2col def func(x): return torch.nn.functional.unfold( x, kernel_size=[3, 1], padding=[2, 0], dilation=1, stride=1 ) x = torch.rand(1, 1, 5, 5, dtype=torch.float16) eo = torch._dynamo.export( func, aten_graph=True, decomposition_table={torch.ops.aten.im2col.default: im2col} )(x) eo.graph_module.print_readable() ``` ``` class GraphModule(torch.nn.Module): def forward(self, x): arg0: "f16[1, 1, s0, s0]"; arg0, = fx_pytree.tree_flatten_spec(([x], {}), self._in_spec) arg0_1 = arg0 _to_copy: "f32[1, 1, s0, s0]" = torch.ops.aten._to_copy.default(arg0_1, dtype = torch.float32) ... constant_pad_nd: "f32[1, 1, s0 + 4, s0]" = torch.ops.aten.constant_pad_nd.default(_to_copy, [0, 0, 2, 2], 0.0); _to_copy = None ... slice_1: "f32[1, 1, s0 + 4, s0]" = torch.ops.aten.slice.Tensor(constant_pad_nd, 0, 0, 9223372036854775807); constant_pad_nd = None slice_2: "f32[1, 1, s0 + 4, s0]" = torch.ops.aten.slice.Tensor(slice_1, 1, 0, 9223372036854775807); slice_1 = None index: "f32[1, 1, 3, s0 + 2, 1, s0]" = torch.ops.aten.index.Tensor(slice_2, [None, None, unsqueeze_5, add_3]); slice_2 = unsqueeze_5 = add_3 = None permute: "f32[1, 1, 3, 1, s0 + 2, s0]" = torch.ops.aten.permute.default(index, [0, 1, 2, 4, 3, 5]); index = None ... view: "f32[1, 3, s0*2 + 2s0]" = torch.ops.aten.view.default(permute, [1, 3, mul]); permute = mul = None _to_copy_1: "f16[1, 3, s0*2 + 2s0]" = torch.ops.aten._to_copy.default(view, dtype = torch.float16); view = None return pytree.tree_unflatten([_to_copy_1], self._out_spec) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/121363 Approved by: https://github.com/lezcano	2024-03-09 15:37:27 +00:00
Yifu Wang	71d0202627	[dynamo] support rewriting dist.all_reduce with explicitly specified reduce op (#120181 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120181 Approved by: https://github.com/wconstab, https://github.com/awgu	2024-03-09 08:28:22 +00:00
PyTorch MergeBot	cf9742371c	Revert "Add CUTLASS kernel as choice for _int_mm() Inductor autotuning (#119685 )" This reverts commit 752d164b2f0d401042de4a75f36f7e84bae91daa. Reverted https://github.com/pytorch/pytorch/pull/119685 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it is crashing on ROCm `752d164b2f` ([comment](https://github.com/pytorch/pytorch/pull/119685#issuecomment-1986773384))	2024-03-09 07:20:53 +00:00
Florian	761783a4ff	[profiler] Fix recorded profiler step number (#121127 ) Fixes [121126](https://github.com/pytorch/pytorch/issues/121126) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121127 Approved by: https://github.com/briancoutinho	2024-03-09 06:54:51 +00:00
Wanchao Liang	242e03ba86	[dtensor] add async_op option to redistribute and some refactor (#121477 ) async output option was only available in `full_tensor()` call, but I think it's generally good to make this option available in the `redistribute` call directly so that user can control it This PR adds async_op option to redistribute call, to allow user control whether to perform tensor redistribution asynchronously or not. By default we set this to False, this is to follow the semantics of the c10d collectives. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121477 Approved by: https://github.com/wz337	2024-03-09 06:17:23 +00:00
Jerry Zhang	a6a67da333	[quant] Add error check for input_edge annotation (#121536 ) Summary: Raises error when an input edge contains non-Node elements like constant values etc in annotation. Test Plan: python test/test_quantization.py -k test_input_edge_sanity_check Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/121536 Approved by: https://github.com/andrewor14	2024-03-09 06:13:04 +00:00
angelayi	e8836759d0	[export] Add effect token to export (#121424 ) Following the creation of effect tokens (https://github.com/pytorch/pytorch/pull/120296), we want to now add support for these tokens in export because the calling/returning convention has changed. The inputs are now `(tokens, params, buffers, constants, user_inputs)` and the outputs are `(tokens, buffer_mutations, user_mutations, user_outputs)`. The graph looks something like: ``` graph(): %arg0_1 : [num_users=1] = placeholder[target=arg0_1] %attr : [num_users=2] = placeholder[target=attr] %arg1_1 : [num_users=2] = placeholder[target=arg1_1] %with_effects : [num_users=2] = call_function[target=torch._higher_order_ops.effects.with_effects](args = (%arg0_1, _TorchScriptTesting.takes_foo.default, %attr, %arg1_1), kwargs = {}) %getitem : [num_users=1] = call_function[target=operator.getitem](args = (%with_effects, 0), kwargs = {}) %getitem_1 : [num_users=1] = call_function[target=operator.getitem](args = (%with_effects, 1), kwargs = {}) %with_effects_1 : [num_users=2] = call_function[target=torch._higher_order_ops.effects.with_effects](args = (%getitem, _TorchScriptTesting.takes_foo.default, %attr, %getitem_1), kwargs = {}) %getitem_2 : [num_users=1] = call_function[target=operator.getitem](args = (%with_effects_1, 0), kwargs = {}) %getitem_3 : [num_users=1] = call_function[target=operator.getitem](args = (%with_effects_1, 1), kwargs = {}) %add : [num_users=1] = call_function[target=torch.ops.aten.add.Tensor](args = (%arg1_1, %getitem_3), kwargs = {}) return (getitem_2, add) ``` During unlifting, we will first remove the tokens and with_effect calls using the `remove_effect_tokens` pass. (cc @SherlockNoMad on the pass to remove tokens). This is so that this won't change the calling conventions when retracing. The graph after unlifting looks something like: ``` graph(): %attr_1 : [num_users=2] = get_attr[target=attr] %arg1_1 : [num_users=2] = placeholder[target=arg1_1] %takes_foo_default_1 : [num_users=1] = call_function[target=torch.ops._TorchScriptTesting.takes_foo.default](args = (%attr_1, %arg1_1), kwargs = {}) %takes_foo_default : [num_users=1] = call_function[target=torch.ops._TorchScriptTesting.takes_foo.default](args = (%attr_1, %takes_foo_default_1), kwargs = {}) %add : [num_users=1] = call_function[target=torch.ops.aten.add.Tensor](args = (%arg1_1, %takes_foo_default), kwargs = {}) return (add,) ``` Serialization support will be added in a followup. Note: tokens only affect custom ops that take in ScriptObjects, not ScriptObject methods yet. Differential Revision: [D54639390](https://our.internmc.facebook.com/intern/diff/D54639390) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121424 Approved by: https://github.com/tugsbayasgalan	2024-03-09 02:43:26 +00:00
Aidyn-A	eb3919944d	[C10d][NCCL] Refactor complex all_reduce and broadcast (#121045 ) The necessity of this PR lies in the fact that autograd engine + DDP calls `all_reduce` from C++, so the changes must be made in C++. ``` [rank0]: Traceback (most recent call last): [rank0]: File "~/complex_ddp.py", line 72, in <module> [rank0]: main() [rank0]: File "~/complex_ddp.py", line 64, in main [rank0]: loss.backward() [rank0]: File "/home/usr/pytorch/torch/_tensor.py", line 525, in backward [rank0]: torch.autograd.backward( [rank0]: File "/home/usr/pytorch/torch/autograd/__init__.py", line 267, in backward [rank0]: _engine_run_backward( [rank0]: File "/home/usr/pytorch/torch/autograd/graph.py", line 744, in _engine_run_backward [rank0]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [rank0]: TypeError: Input tensor data type is not supported for NCCL process group: ComplexFloat ``` I believe, for minimizing the Python overhead, the same could be done for the rest of the ops, what do you think @kwen2501? Pull Request resolved: https://github.com/pytorch/pytorch/pull/121045 Approved by: https://github.com/eqy, https://github.com/kwen2501	2024-03-09 02:00:54 +00:00
Aleksandar Samardžić	752d164b2f	Add CUTLASS kernel as choice for _int_mm() Inductor autotuning (#119685 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119685 Approved by: https://github.com/cpuhrsch	2024-03-09 02:00:50 +00:00
Colin Peppler	13a25c647f	[export] improve binary op fast path broadcast check (#121546 ) # Context I believe we have an incorrect guard being created during FakeTensor's binary op fast path. Consider this case ``` # op.shape: (10, 192); final_shape: (s0, 10, 192) # Guard Ne(s0, 10) is created when we create SymBool(10 == s0) if isinstance(op, torch.Tensor) and op.shape == final_shape: break ``` As of right now, `op.shape == final_shape` checks whether one of the binary op's operands is the same as the binay op's output shape. * If one of them is a dynamic shape, then we'll create a guard via`SymBool` creation (i.e. `s0 == 10`). * If the `SymBool` expr resolves to `false`, then we'll create the guard `Ne(s0, 10)`. This is a problem when the # of dimensions aren't the same between `op.shape` & `final_shape`. Take the case above for example, `op.shape: (10, 192); final_shape: (s0, 10, 192)`. Although, the shapes aren't the same, it doesn't necessarily mean that `s0 != 10`. Some thoughts (feel free to ignore). What if the # of dimensions are equal but one of the shapes has symbols. Here's three cases: 1. `op.shape: (9000, 10, 192); final_shape: (s0, 10, 192)` -- not broadcastable. 2. `op.shape: (1, 10, 192); final_shape: (s0, 10, 192)` -- 0/1 specialization wins? 3. `op.shape: (100, 10, 192); final_shape: (s0, 10, 192) where s0 = 100` -- Ask user to mark `s0` as a constant. # Test ``` $ TORCHDYNAMO_VERBOSE=1 PYTORCH_TEST_WITH_DYNAMO=1 pytest -s test/dynamo/test_dynamic_shapes.py -k test_export_fast_binary_broadcast_check_dynamic_shapes torch.fx.experimental.symbolic_shapes.ConstraintViolationError: Constraints violated (dim0)! For more information, run with TORCH_LOGS="+dynamic". - Not all values of dim0 = L['a'].size()[0] in the specified range 3 <= dim0 <= 1024 satisfy the generated guard Ne(L['a'].size()[0], 3). ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/121546 Approved by: https://github.com/aakhundov	2024-03-09 01:49:42 +00:00
Lucas Pasqualin	d482614fec	[DCP] Makes fsspec public (#121508 ) Fixes #118033 Also removes `_checkpointer.py` class original PR's: - https://github.com/pytorch/pytorch/pull/121330 - https://github.com/pytorch/pytorch/pull/121329 We're also disabling `test_fsdp` since it is failing on random PR's Pull Request resolved: https://github.com/pytorch/pytorch/pull/121508 Approved by: https://github.com/fegin	2024-03-09 01:14:18 +00:00
albanD	6791b0c09e	Change default torch_function behavior to be disabled when torch_dispatch is defined (take 2) (#120632 ) This does not introduce a new test but is tested by checking that all the classes we already have still behave as before now that they don't explicitly disable torch_function. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120632 Approved by: https://github.com/ezyang	2024-03-09 01:08:37 +00:00
Aidyn-A	ca9678405a	[CUDA graphs] Pool argument for make_graphed_callables (#121475 ) It is just a nice feature to have for the situations when users want multiple graphs captures and/or graphed callables to share the same memory pool. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121475 Approved by: https://github.com/eellison, https://github.com/eqy	2024-03-09 00:15:38 +00:00
Aidyn-A	b2f19dd284	[C10d][UCC] Retain CUDA context in progress_loop (#121446 ) UCC requires CUDA context be present, while `progress_loop` `f61192b014/torch/csrc/distributed/c10d/ProcessGroupUCC.cpp (L333)` runs on the side thread and it does not have context present (even though it sets the device). Pull Request resolved: https://github.com/pytorch/pytorch/pull/121446 Approved by: https://github.com/kwen2501	2024-03-09 00:09:47 +00:00
chilli	ed8eebd1c2	Changed cublas repdocubility URL (#121534 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121534 Approved by: https://github.com/Skylion007	2024-03-08 23:46:21 +00:00
Andrew Gu	b0a0850a5c	[DCP] Replaced `storage()` with `untyped_storage()` (#121538 ) Let us try to remove this warning 😄 : ``` [rank0]:/data/users/andgu/pytorch/torch/distributed/checkpoint/filesystem.py:150: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() [rank0]: if tensor.storage().size() != tensor.numel(): ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/121538 Approved by: https://github.com/wz337, https://github.com/fegin	2024-03-08 23:46:17 +00:00
Peter Bell	8887c95004	[inductor] Skip welford combine on first reduciton loop iteration (#121488 ) On the first iteration we short circuit `welford_reduce` since we know the accumulators are filled with the default values. This is split out from #120330 to hopefully avoid the meta-internal failure. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121488 Approved by: https://github.com/lezcano	2024-03-08 23:40:48 +00:00
Shengbao Zheng	fe78cf040b	[profiler] add a function to allow adding preset user-defined metadata to traces (#121487 ) Summary: `add_metadata_json` function in profiler can only work when being called during trace collection. However, sometimes we want to pass in some user-defined metadata and amend to the trace before trace collection starts, e.g. when the profiler is defined. This PR add a function `preset_metadata_json` for this purpose. The preset metadata will be stored and amended to the trace later. Differential Revision: D54678790 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121487 Approved by: https://github.com/aaronenyeshi	2024-03-08 23:18:48 +00:00
PyTorch MergeBot	9eb8fae02d	Revert "Fix round robin sharding (#121022 )" This reverts commit effdea5fc62c6bf13cb8035f7bfcc205f05a8b6a. Reverted https://github.com/pytorch/pytorch/pull/121022 on behalf of https://github.com/clee2000 due to made sharding really uneven ([comment](https://github.com/pytorch/pytorch/pull/121022#issuecomment-1986552662))	2024-03-08 23:16:24 +00:00
Wanchao Liang	bc02fca358	[dtensor] to_local backward grad placement passthrough (#121474 ) to_local accepts a `grad_placements` if user choose to pass, previously we enforce the grad_out to be the "same" placement as the current DTensor for safety. But I realized that we DO NOT need to enforce this constraint. Why? backward placement does not need to be the same as fwd tensor placement, this is already the case for param vs param.grad (i.e. param can be replicate and grad can be partial), so we should not restrict this to activation vs activation grad too Pull Request resolved: https://github.com/pytorch/pytorch/pull/121474 Approved by: https://github.com/awgu, https://github.com/yoyoyocmu, https://github.com/yifuwang	2024-03-08 23:11:49 +00:00
eellison	9373ad0bb8	Switch cudagraph backend to cudagraph trees (#121019 ) Switch torch.compile(..., backend="cudagraphs") to use cudagraph trees. Enabled a few test in cudagraph_trees and note that there is another test suite existing for cudagraphs backend: https://github.com/pytorch/pytorch/blob/main/test/dynamo/test_cudagraphs.py. This is basically the inductor cudagraphs without inductor. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121019 Approved by: https://github.com/ezyang, https://github.com/jansel ghstack dependencies: #121017, #121018	2024-03-08 22:56:26 +00:00
Mu-Chu Lee	7b3febdca7	Change assertion throw to error message for const_run_impl call. (#121396 ) Summary: Currently we do not have a easy mechanism to distinguish between models created with some specific config. We use a warning instead of failing directly. Test Plan: Messaging change only. Reviewed By: aakhundov Differential Revision: D54622522 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121396 Approved by: https://github.com/chenyang78	2024-03-08 22:48:43 +00:00
Ke Wen	038b2e8780	[c10d] Add complex support for P2P (#121240 ) Fixes the following error when `tensor` is a complex tensor: ``` [rank0]: return pg.send([tensor], dst, tag) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: RuntimeError: Unconvertible NCCL type ComplexFloat ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/121240 Approved by: https://github.com/shuqiangzhang	2024-03-08 22:47:49 +00:00
eellison	4af0e634bf	Add Cudagraphs disable checking (#121018 ) Adds the same cudagraphs disable checking from inductor - cudagraph trees to cudagraphs backend. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121018 Approved by: https://github.com/ezyang ghstack dependencies: #121017	2024-03-08 22:47:24 +00:00
Andrew Gu	7d0ad5c6f0	[FSDP2] Zeroed padded tensor in `_apply` (#121509 ) This PR replaces the `Tensor.resize_` with an explicit zero-ing of the padded tensor. Uninitialized padding is not good since it can give false positive NaNs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121509 Approved by: https://github.com/Skylion007, https://github.com/wanchaol	2024-03-08 22:31:19 +00:00
angelayi	f2d5e96db4	[export] Add docs for 2.3 release (#121466 ) - Added docs about non-strict export - Added example using derived dims - Added api docs for ep.run_decompositions() (https://github.com/pytorch/pytorch/issues/119480) - Tried to include/cover everything in https://docs.google.com/document/d/1kZ_BbB3JnoLbUZleDT6635dHs88ZVYId8jT-yTFgf3A/edit Pull Request resolved: https://github.com/pytorch/pytorch/pull/121466 Approved by: https://github.com/zhxchen17	2024-03-08 22:29:48 +00:00
PyTorch MergeBot	2c2d6ce515	Revert "CI sanity check test for env vars (#120519 )" This reverts commit f43b9c56c598b3a0f4d8e1d85f1e67b8f273d235. Reverted https://github.com/pytorch/pytorch/pull/120519 on behalf of https://github.com/clee2000 due to broken on slow `d27509c384` https://github.com/pytorch/pytorch/actions/runs/8208843198/job/22453617568 ([comment](https://github.com/pytorch/pytorch/pull/120519#issuecomment-1986480624))	2024-03-08 22:01:35 +00:00
Boyuan Feng	35d3adb4b0	Add ATen Op _chunk_cat and _chunk_cat.out (#121081 ) # Motivation In backward of per-parameter sharding FSDP, each rank performs reduce scatter to sync gradients across ranks. A rank chunks each gradient tensor into `world_size` slices along the 0-th dimension and concatenate all slices along the 1-th dimension. Gradient tensors will be padded before concatenation when tensor.size(0) % world_size != 0. ### Example 1 Consider `world_size=3` and tensors A (2x4), B (3x3), C (1x2): Input tensors: ``` AAAA BBB CC AAAA BBB BBB ``` Reduce-scatter-copy-in Output: ``` AAAABBBCC AAAABBB00 0000BBB00 ``` ### Example 2 Consider `world_size=2` and tensors A (2x4), B (3x3), C(1x2), D(4x2): Input tensors: ``` AAAA BBB CC DD AAAA BBB 00 DD BBB DD 000 DD ``` Reduce-scatter-copy-in first pad: ``` AAAA BBB CC DD AAAA BBB 00 DD BBB DD 000 DD ``` Then chunk and cat along dim as the output: ``` AAAABBBBBBCCDDDD AAAABBB00000DDDD ``` The performance of reduce-scatter-copy-in is critical to per-parameter sharding FSDP. However, reduce-scatter-copy-in via composing existing ATen ops involves `cat` and irregular `pad`, leading redundant data copies and unsatisfactory performance. # PR We provide aten native support for reduce-scatter-copy-in, namely `_chunk_cat()`: ``` _chunk_cat(Tensor[] tensors, int dim, int num_chunks) -> Tensor ``` This PR includes the registration of `_chunk_cat` and `_chunk_cat.out`, OpInfo tests, and basic implementation composing existing ATen ops. In the next PR, we will add the CUDA implementation. Comparing with baselines of composing existing ATen ops, `_chunk_cat()` CUDA implementation improves copy bandwidth from 498 GB/s to 966 GB/s on a production benchmark. ## Requirements on input 1. If input tensors have different ndims, dim should be non-negative and be less than the ndims of every input tensors. If all input tensors have the same ndims, we support both negative and non-negative dim. 2. For wrapped_dim, all tensors should have the same size for 0,...,wrapped_dim-1 dimensions. No requirements for (wrapped_dim, ...)-th dimension. 3. Expect positive num_chunks 4. Expect non-empty input tensor list and each input tensor should have at least 1 element Pull Request resolved: https://github.com/pytorch/pytorch/pull/121081 Approved by: https://github.com/albanD	2024-03-08 21:48:12 +00:00
Catherine Lee	a656e12bf5	Disable test_torch_name_rule_map_updated in code (#120627 ) I am getting tired of this test ;-; It gets disabled because it's broken, and then gets fixed, but something breaks it while it was disabled so its still broken and the infra is not handling it well. Disable issue is https://github.com/pytorch/pytorch/issues/114831 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120627 Approved by: https://github.com/yanboliang	2024-03-08 21:00:51 +00:00
Masaki Kozuki	82bb06334d	Update python binding for in-place foreach to return `List[Tensor]` (#121405 ) fixes #104817 taking over #118622 ```c++ // _foreach_atan_ static PyObject * THPVariable__foreach_atan_(PyObject* self_, PyObject* args, PyObject* kwargs) { HANDLE_TH_ERRORS static PythonArgParser parser({ "_foreach_atan_(TensorList self)", }, /traceable=/false); ParsedArgs<1> parsed_args; auto _r = parser.parse(nullptr, args, kwargs, parsed_args); if(_r.has_torch_function()) { return handle_torch_function(_r, nullptr, args, kwargs, THPVariableFunctionsModule, "torch"); } // aten::_foreach_atan_(Tensor(a!)[] self) -> () // auto dispatch__foreach_atan_ = [](at::TensorList self) -> at::TensorList { auto dispatch__foreach_atan_ = [](at::TensorList self) -> void { pybind11::gil_scoped_release no_gil; at::_foreach_atan_(self); }; dispatch__foreach_atan_(_r.tensorlist(0)); PyObject* self_tensorlist = _r.args[0]; Py_INCREF(self_tensorlist); return self_tensorlist; Py_RETURN_NONE; END_HANDLE_TH_ERRORS } ... // _foreach_div_ static PyObject * THPVariable__foreach_div_(PyObject* self_, PyObject* args, PyObject* kwargs) { HANDLE_TH_ERRORS static PythonArgParser parser({ "_foreach_div_(TensorList self, ScalarList scalars)", "_foreach_div_(TensorList self, Tensor other)", "_foreach_div_(TensorList self, TensorList other)", "_foreach_div_(TensorList self, Scalar scalar)", }, /traceable=/false); ParsedArgs<2> parsed_args; auto _r = parser.parse(nullptr, args, kwargs, parsed_args); if(_r.has_torch_function()) { return handle_torch_function(_r, nullptr, args, kwargs, THPVariableFunctionsModule, "torch"); } switch (_r.idx) { case 0: { // aten::_foreach_div_.ScalarList(Tensor(a!)[] self, Scalar[] scalars) -> () // auto dispatch__foreach_div_ = [](at::TensorList self, at::ArrayRef<at::Scalar> scalars) -> at::TensorList { auto dispatch__foreach_div_ = [](at::TensorList self, at::ArrayRef<at::Scalar> scalars) -> void { pybind11::gil_scoped_release no_gil; at::_foreach_div_(self, scalars); }; dispatch__foreach_div_(_r.tensorlist(0), _r.scalarlist(1)); PyObject* self_tensorlist = _r.args[0]; Py_INCREF(self_tensorlist); return self_tensorlist; } case 1: { // aten::_foreach_div_.Tensor(Tensor(a!)[] self, Tensor other) -> () // auto dispatch__foreach_div_ = [](at::TensorList self, const at::Tensor & other) -> at::TensorList { auto dispatch__foreach_div_ = [](at::TensorList self, const at::Tensor & other) -> void { pybind11::gil_scoped_release no_gil; at::_foreach_div_(self, other); }; dispatch__foreach_div_(_r.tensorlist(0), _r.tensor(1)); PyObject* self_tensorlist = _r.args[0]; Py_INCREF(self_tensorlist); return self_tensorlist; } case 2: { // aten::_foreach_div_.List(Tensor(a!)[] self, Tensor[] other) -> () // auto dispatch__foreach_div_ = [](at::TensorList self, at::TensorList other) -> at::TensorList { auto dispatch__foreach_div_ = [](at::TensorList self, at::TensorList other) -> void { pybind11::gil_scoped_release no_gil; at::_foreach_div_(self, other); }; dispatch__foreach_div_(_r.tensorlist(0), _r.tensorlist(1)); PyObject* self_tensorlist = _r.args[0]; Py_INCREF(self_tensorlist); return self_tensorlist; } case 3: { // aten::_foreach_div_.Scalar(Tensor(a!)[] self, Scalar scalar) -> () // auto dispatch__foreach_div_ = [](at::TensorList self, const at::Scalar & scalar) -> at::TensorList { auto dispatch__foreach_div_ = [](at::TensorList self, const at::Scalar & scalar) -> void { pybind11::gil_scoped_release no_gil; at::_foreach_div_(self, scalar); }; dispatch__foreach_div_(_r.tensorlist(0), _r.scalar(1)); PyObject* self_tensorlist = _r.args[0]; Py_INCREF(self_tensorlist); return self_tensorlist; } } Py_RETURN_NONE; END_HANDLE_TH_ERRORS } ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/121405 Approved by: https://github.com/soulitzer	2024-03-08 21:00:01 +00:00
Simon Fan	d27509c384	[compiled autograd] support custom ops backed by c++ autograd::Function (#120681 ) - Adds support for custom ops backed by c++ custom autograd functions, e.g. fbgemm - Include files more granularly to avoid namespace pollution and circular imports limitations: - requires user to audit their code and opt-in their custom autograd::Function via autograd::Function::is_traceable and maybe additional compiled_args + apply_with_saved implementation. this was the only way I can think of for soundness - will throw if we can't hash the saved_data i.e. for any non implemented type other than list and dict in at::IValue::hash `b0cfa96e82/aten/src/ATen/core/ivalue.cpp (L364)` - can technically silently fail if both the typeid hash and the typeid string name of the custom autograd::Function collide at the same time, and an identical autograd graph containing a different custom autograd::Function, yet that has an identical implementation, is called. this case seems extremely unlikely, and the only alternative to hash collision i can think of is compiling with reflection - tensors not saved via save_variables are not lifted, and are specialized on TensorImpl*'s hash (treated as a memory address). if needed, we can lift them. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120681 Approved by: https://github.com/jansel	2024-03-08 20:43:29 +00:00
Catherine Lee	f43b9c56c5	CI sanity check test for env vars (#120519 ) Make a test that fails on purpose to trigger retries. Check the opposite of success (that env vars exist) It's bit hacky because I want it to fail on the normal flow in order to trigger reruns but I don't want to expose the failures to users since it's confusing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120519 Approved by: https://github.com/huydhn	2024-03-08 20:28:50 +00:00
Adnan Akhundov	75bb049d38	Skip AOT Inductor test_cond_* tests on ROCm (#121522 ) Summary: The newly added tests in https://github.com/pytorch/pytorch/pull/121120 are failing in the `ciflow/periodic` jobs. Here we skip those on ROCm to avoid the need to disable those tests manually on ROCm. Test Plan: ``` $ python test/inductor/test_aot_inductor.py -k test_cond_nested ... ---------------------------------------------------------------------- Ran 6 tests in 72.122s OK ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/121522 Approved by: https://github.com/huydhn, https://github.com/malfet ghstack dependencies: #121120	2024-03-08 20:13:55 +00:00
albanD	53d5276d69	Improve Dynamo support for torch function and class methods in general (#121365 ) I was originally trying to solve https://github.com/pytorch/pytorch/issues/120799 but got sidetracked along the way. This PR contains a couple fixes. Let me know if you want me to split them up! - Properly handle invalid user code when "super()" is called from non-method/classmethod. It will now properly raise the same error as CPython - Fix base VariableTracker `__str__` method shadowing all `__repr__` methods defined in subclasses - Fix accessing a classmethod on a user object to bind "cls" and not "self" - Fix custom class handling of super() call to properly handle mixed regular/class/static methods Locally , test_repros.py -k test_batch_norm_act still fails where the generated graph module is: ``` Call using an FX-traced Module, line 8 of the traced Module's generated forward function: x = self.forward(l_x_); self = l_x_ = None x_1 = self.L__self___act(x); x = None ``` note that "self" is being unset on the first line even though it is used on the second one. For reference, this is the test `c268ce4a6d/test/dynamo/test_repros.py (L1368-L1369)` I cannot figure out where the generated forward function comes from though, any hint would be welcome! Pull Request resolved: https://github.com/pytorch/pytorch/pull/121365 Approved by: https://github.com/jansel	2024-03-08 20:03:49 +00:00
PyTorch MergeBot	c0996866f4	Revert "Change ATEN generator argument type to const std::optional<Generator>& (#120076 )" This reverts commit 4305c64fea154ee1ab566e19bd7568753fc30916. Reverted https://github.com/pytorch/pytorch/pull/120076 on behalf of https://github.com/izaitsevfb due to breaking internal builds(take 3) ([comment](https://github.com/pytorch/pytorch/pull/120076#issuecomment-1986338164))	2024-03-08 20:01:03 +00:00
Ke Wen	c78f72d7e7	[c10d] Deprecate torch.distributed.pipeline (#121464 ) In favor of PiPPy (Pipeline Parallelism for PyTorch) https://github.com/pytorch/PiPPy Pull Request resolved: https://github.com/pytorch/pytorch/pull/121464 Approved by: https://github.com/wz337, https://github.com/awgu	2024-03-08 19:55:02 +00:00
PyTorch MergeBot	27a0900946	Revert "[fx] Preserve Fx graph node order in partitioner across runs (#115621 )" This reverts commit 25c74a93cdf67545a4e3e1bedf2dbabbddfc5845. Reverted https://github.com/pytorch/pytorch/pull/115621 on behalf of https://github.com/izaitsevfb due to depends on #120076, which needs to be reverted ([comment](https://github.com/pytorch/pytorch/pull/115621#issuecomment-1986324796))	2024-03-08 19:50:57 +00:00
Elias Ellison	937e89f252	cudagraphs backend refactoring (#121017 ) This is just some refactoring.. no functional changes Pull Request resolved: https://github.com/pytorch/pytorch/pull/121017 Approved by: https://github.com/ezyang	2024-03-08 19:47:41 +00:00
PyTorch MergeBot	bc117898f1	Revert "Update XLA pin (#121501 )" This reverts commit 9d83f9dc0e4535f6535389201bc3c4a37f3305e3. Reverted https://github.com/pytorch/pytorch/pull/121501 on behalf of https://github.com/malfet due to We are trying to revert underlying change first ([comment](https://github.com/pytorch/pytorch/pull/121501#issuecomment-1986289409))	2024-03-08 19:34:44 +00:00
Yifu Wang	22cd2658b4	Disable GroupRegistry's thread isolation by default (#121457 ) Today `GroupRegistry` employs thread isolation by default, i.e. every thread sees its own process group registry. This is intended to work for one-device-per-process (for python use cases) and one-device-per-thread case (for custom native runtimes). However, there's a problem - there are python use cases that initializes/registers process groups in one thread, and runs collectives in another thread. This use case should be supported. However, since `GroupRegistry` employs thread isolation by default, collectives in different threads can't find the registered process groups. This PR fixes the issue by: - Make `GroupRegistry` work in non-thread isolation mode by default. This would match the behavior w/o the native process group registry. - Introduces `set_thread_isolation_mode` so one-device-per-thread runtimes can enable thread isolation mode explicitly. Differential Revision: [D54658515](https://our.internmc.facebook.com/intern/diff/D54658515) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121457 Approved by: https://github.com/wanchaol	2024-03-08 19:31:24 +00:00
Denis Yaroshevskiy	2c9c57c061	Only profiling when it's enabled. (#121404 ) Summary: The profiling, even when disabled, takes up about 1.5% cpu for a model I'm looking into. This patch just splits into with/without profile runs. The potential downside is that now the script can't enable profiling in itself. It doesn't seem to be used anywhere. If that's a crusial usecase, we can do something about it but ideally we wouldn't. Test Plan: Link with profiles: https://fburl.com/scuba/strobelight_services/ihxsl7pj ``` buck2 run fbcode//caffe2/test/cpp/jit:jit ``` Reviewed By: zhxchen17 Differential Revision: D54066589 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121404 Approved by: https://github.com/zhxchen17	2024-03-08 19:23:14 +00:00
lezcano	df06b94778	Add complex support to parametrizations.spectral_norm (#121452 ) Fixes https://github.com/pytorch/pytorch/issues/121091 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121452 Approved by: https://github.com/ezyang, https://github.com/peterbell10	2024-03-08 19:17:20 +00:00
PyTorch MergeBot	0f3f4f5534	Revert "[nit][DCP][DSD] Remove Unused Variables in test_state_dict.py (#121204 )" This reverts commit 4186c365313e909dfc8574c4469e5015439c2924. Reverted https://github.com/pytorch/pytorch/pull/121204 on behalf of https://github.com/huydhn due to Sorry for reverting your PR, but the failure looks legit ([comment](https://github.com/pytorch/pytorch/pull/121204#issuecomment-1986252526))	2024-03-08 19:08:50 +00:00
Aaron Gokaslan	d55d803812	Add operator length hint support (#121495 ) Seemed like an easy operator to squeeze into Python 2.3 . Added a simple test. Partially addresses #116396 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121495 Approved by: https://github.com/albanD	2024-03-08 19:08:33 +00:00
Nikita Shulga	9b03a06288	[BE] [MPS] Fix `out` resize logic in `torch.where` (#121476 ) By deleting `where_mps` and registering MPS dispatch for `where_kernel`. As result of this change resizing and type-checking logic is shared between MPS, CPU and CUDA backends. Add test_case to `TestMPS.test_where` (that should eventually be removed, when `out` OpInfo testing is enabled for MPS Pull Request resolved: https://github.com/pytorch/pytorch/pull/121476 Approved by: https://github.com/albanD, https://github.com/Skylion007 ghstack dependencies: #121473, #121494	2024-03-08 18:59:37 +00:00
Nikita Shulga	9cc89970a9	[BE] Cleanup where_self_out (#121494 ) - Avoid extra assignments by using ternary instead of if-else - Do not call type-cast unless it is needed (in most cases only one of two arguments will need to be custed) - Avoid extra assignment for condition_, by calling `cast` under `if` condition Pull Request resolved: https://github.com/pytorch/pytorch/pull/121494 Approved by: https://github.com/albanD, https://github.com/Skylion007 ghstack dependencies: #121473	2024-03-08 18:59:37 +00:00
Nikita Shulga	1866ee6735	Enable `out` OpInfo testing for `torch.where` (#121473 ) And fix behavior discrepancy between CPU and CUDA by raising an error when `out.dtype` is unexpected Fixes https://github.com/pytorch/pytorch/issues/121397 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121473 Approved by: https://github.com/Skylion007, https://github.com/albanD	2024-03-08 18:59:37 +00:00
Nitin Jain	0dd21c0c34	Update Quantizable LSTM to support QAT (#121448 ) Summary: Title. Test Plan: * CI * N3684627 Differential Revision: D54653542 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121448 Approved by: https://github.com/andrewor14	2024-03-08 18:55:50 +00:00
rzou	b52e0bf131	Deprecate torch.autograd.function.traceable, is_traceable (#121413 ) - There are no usages of this internally. - There are very few usages of this in OSS (most of these are forks of old repositories). - This flag doesn't do anything. We're deprecating it to prevent confusion. I will delete it immediately after the branch cut. Test Plan: - new tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/121413 Approved by: https://github.com/albanD, https://github.com/soulitzer	2024-03-08 18:41:07 +00:00
Wanchao Liang	08460f4bae	[tp] remove deprecated tp_mesh_dim arg (#121432 ) This PR removes the deprecated tp_mesh_dim arg to prepare for release. As we deprecated this arg for a while (by throwing deprecating messages), we should remove it before the release #suppress-api-compatibility-check Pull Request resolved: https://github.com/pytorch/pytorch/pull/121432 Approved by: https://github.com/wz337 ghstack dependencies: #121431	2024-03-08 17:46:44 +00:00
Wanchao Liang	30982ce072	[tp] doc fixes (#121431 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121431 Approved by: https://github.com/wz337	2024-03-08 17:46:44 +00:00
Catherine Lee	effdea5fc6	Fix round robin sharding (#121022 ) Fix round robin sharding when there are no test times and sort_by_time=False Adds more tests to test_test_selections for sort_by_time=False Adds more checks to test_split_shards_random for serial/parallel ordering + ordering of tests Refactoring of dup code Tested locally by running `python test/run_test.py --shard 3 5` with no test times downloaded and checked that it wasn't an empty list. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121022 Approved by: https://github.com/huydhn, https://github.com/osalpekar	2024-03-08 17:01:34 +00:00
Nikita Shulga	9d83f9dc0e	Update XLA pin (#121501 ) To `8078b8f38c` Fixes regression caused by https://github.com/pytorch/pytorch/pull/120076 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121501 Approved by: https://github.com/Skylion007, https://github.com/aakhundov, https://github.com/albanD	2024-03-08 16:53:10 +00:00
Peter Bell	a2a8c1fda0	[AOTDispatch] Return mutated inputs directly when keeping mutations (#120514 ) Fixes #120242 The example from the issue now results in the graph ```python def forward(self, arg0_1, arg1_1): sin = torch.ops.aten.sin.default(arg0_1); arg0_1 = None copy_ = torch.ops.aten.copy_.default(arg1_1, sin); arg1_1 = sin = None return (copy_,) ``` and the corresponding inductor kernel eliminates the intermediate buffer completely ```python def call(args): arg0_1, arg1_1 = args args.clear() assert_size_stride(arg0_1, (5, ), (1, )) assert_size_stride(arg1_1, (5, ), (1, )) with torch.cuda._DeviceGuard(0): torch.cuda.set_device(0) # Source Nodes: [sin], Original ATen: [aten.sin] stream0 = get_raw_stream(0) triton_poi_fused_sin_0.run(arg0_1, arg1_1, 5, grid=grid(5), stream=stream0) del arg0_1 return (arg1_1, ) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/120514 Approved by: https://github.com/ezyang, https://github.com/oulgen, https://github.com/lezcano	2024-03-08 16:33:26 +00:00
Yeounoh Chung	f7ec984b1b	[DTensor][XLA] support XLA backend in distirbute_module API (#121355 ) Addresses #92909 cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225 @chauhang Pull Request resolved: https://github.com/pytorch/pytorch/pull/121355 Approved by: https://github.com/wanchaol	2024-03-08 15:47:33 +00:00
andrewor14	7b4f70eda5	Batch Norm Consolidation (#116092 ) Summary: This commit simplifies the existing decomposition hierarchy of batch norm ops by adding a single, backend agnostic op: `batch_norm_with_update`. The existing hierarchy looks like: ``` aten.batch_norm -> aten._batch_norm_impl_index -> [ aten.native_batch_norm -> aten._native_batch_norm_legit (export only) -> _batch_norm_legit_cpu/cuda (kernels, export only) -> _batch_norm_cpu/cuda (kernels) ] OR [ aten.cudnn_batch_norm ] OR [ aten.miopen_batch_norm ] ``` Aside from complexity, an important problem with the above decomposition hierarchy is cuda numerics in export flows. We observed significantly worse convergence when training a mobilenetv2-like model when using the `_batch_norm_cuda` kernel instead of the `cudnn_batch_norm` kernel. This means users who export their models on CPU first then move the models to cuda later may silently see worse accuracies even when cudnn is installed, because they are using the worse kernel. This issue is summarized in https://github.com/pytorch/pytorch/issues/111384. Instead, the new hierarchy proposed by consolidating existing batch norm ops will look like: ``` aten.batch_norm -> aten.batch_norm_with_update -> [ _batch_norm_cpu (kernel) ] OR [ _batch_norm_cuda (kernel) ] OR [ cudnn_batch_norm (kernel) ] OR [ miopen_batch_norm (kernel) ] ``` The new op `batch_norm_with_update` hides backend implementation details and automatically picks the right kernel based on what is installed. This commit also adds the following variants to this op: ``` batch_norm_with_update_functional batch_norm_with_update.out batch_norm_no_update batch_norm_no_update.out batch_norm_backward ``` Note that this commit only adds this op and its variants, but does not actually change the decomps to produce these ops in the graph. This will be done after the 2 week FC window, and the ops used in the old stack is planned to be removed after the 6 month BC window. Test Plan: `OpInfo` tests for `batch_norm_with_update`. Reviewers: albanD, bdhirsh Subscribers: albanD, bdhirsh, supriyar Tasks: https://github.com/pytorch/pytorch/issues/111384 Co-authored-by: Tugsbayasgalan Manlaibaatar <tmanlaibaatar@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/116092 Approved by: https://github.com/bdhirsh, https://github.com/albanD	2024-03-08 15:07:15 +00:00
lezcano	c253d1c1db	Add links to _ex variants in all linalg functions that support them (#121451 ) Fixes https://github.com/pytorch/pytorch/issues/96632 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121451 Approved by: https://github.com/ezyang	2024-03-08 12:19:16 +00:00
leslie-fang-intel	975d428425	[Quant] Add the operator of decomposed fake quant per channel (#121297 ) Summary Add the operator of `quantized_decomposed.fake_quant_per_channel` and test the forward and backward of this op with comparing to ATen. Test Plan ``` python -u -m pytest -s -v test_cpu_repro.py -k test_decomposed_fake_quant_per_channel ``` Next Step Optimize the performance: from the generated code of forward and backward graph, the code didn't vectorize. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121297 Approved by: https://github.com/jerryzh168, https://github.com/jgong5	2024-03-08 10:51:37 +00:00
Karol Blaszczak	8ed0932172	Update link to OpenVINO backend in torch.compiler.rst (#121303 ) This is a permalink, so it will remain active regardless of documentation version changes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121303 Approved by: https://github.com/soulitzer	2024-03-08 08:17:13 +00:00
Avik Chaudhuri	b3f24b57fb	fix accidental specialization with faketensor input checks (#121460 ) Summary: When fake tensors are passed to a graph module and we do runtime assertions on them, we can accidentally trigger specialization guards. It's better to just relax the checking for these. Test Plan: confirmed that problem in T181400371 is now fixed Differential Revision: D54658960 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121460 Approved by: https://github.com/angelayi	2024-03-08 08:02:37 +00:00
Chien-Chin Huang	2e789ad522	[DCP][state_dict][doc] Update the distributed state_dict document (#121290 ) As title Pull Request resolved: https://github.com/pytorch/pytorch/pull/121290 Approved by: https://github.com/LucasLLC ghstack dependencies: #121273, #121276	2024-03-08 07:58:18 +00:00
Avik Chaudhuri	e628f2cc66	suggested fixes for congruences (#121418 ) Differential Revision: D54636152 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121418 Approved by: https://github.com/zhxchen17	2024-03-08 07:19:51 +00:00
Lucas Pasqualin	96ed37ac13	[DCP] Makes async_save public (#121325 ) Makes async_save public Differential Revision: [D54593610](https://our.internmc.facebook.com/intern/diff/D54593610/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121325 Approved by: https://github.com/wz337 ghstack dependencies: #121317	2024-03-08 05:13:13 +00:00
Chien-Chin Huang	13366a101a	[DCP][state_dict][doc] Fix the documents for distributed_state_dict (#121276 ) As title Pull Request resolved: https://github.com/pytorch/pytorch/pull/121276 Approved by: https://github.com/wz337, https://github.com/LucasLLC ghstack dependencies: #121273	2024-03-08 03:29:47 +00:00
Sam Larsen	72dd9b2430	[inductor] Make some improvements to FX graph caching (#117888 ) Summary: This is in preparation to enable FX graph caching by default. First fix some bugs uncovered by running all unit tests under `test/inductor/`. I'll enable in a separate diff in case we need to revert. Summary of changes: * Turn off caching for tests that require a compilation, e.g., when checking that a relevant counter was incremented * Bypass caching when we see mkldnn tensors as constants (they currently don't serialize, so we can't save to disk) * Include various global settings that could affect compilation in the cache key calculation. * Handle a few config settings that break key calculation. * Handle code paths where no ShapeEnv is available (the cache impl requires a shape env as part of handling guards) * Skip caching when freezing is enabled (Freezing can embed constants that wouldn't be static across runs). * Fix the clear() method to not throw when the cache /tmp dir doesn't exist Test Plan: Ran all tests under `test/inductor/` twice with TORCHINDUCTOR_FX_GRAPH_CACHE=1 to exercise any test that might be affected by caching. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117888 Approved by: https://github.com/eellison	2024-03-08 02:30:49 +00:00
Lucas Pasqualin	909d73d8cb	[DCP] Removes `no_dist` and `coordinator_rank` from public DCP API's (#121317 ) [DCP] Removes `no_dist` and `coordinator_rank` from public DCP API's Differential Revision: [D54591181](https://our.internmc.facebook.com/intern/diff/D54591181/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121317 Approved by: https://github.com/fegin	2024-03-08 02:14:12 +00:00
Aaron Orenstein	23ac0cd561	more passing dynamo tests (#121378 ) These are just tests that I noticed passed on current main Running: ``` PYTORCH_TEST_WITH_DYNAMO=1 pytest test/dynamo/test_dynamic_shapes.py test/dynamo/test_compile.py -k 'test_export_decomp_dynamic_shapes or test_export_dynamic_dim_cleanup_dynamic_shapes or test_export_multi_dynamic_dim_constraint_dynamic_shapes or test_export_multi_dynamic_dim_unsafe_relationship_dynamic_shapes or test_export_no_raise_dynamic_shapes or test_export_preserve_constraints_as_metadata_scalar_dynamic_shapes or test_export_raise_on_relationship_dynamic_shapes or test_exported_graph_serialization_dynamic_shapes or test_retracibility_dict_container_inp_out_dynamic_shapes or test_retracibility_nested_list_out_dynamic_shapes or test_exception_table_e2e_2_dynamic_shapes or test_exception_table_e2e_dynamic_shapes or test_exception_table_parsing_dynamic_shapes or test_inference_mode_dynamic_shapes or test_inplace_view_on_graph_input_dynamic_shapes or test_numpy_torch_operators_dynamic_shapes or test_py311_jump_offset_dynamic_shapes or test_lazy_module_no_cls_to_become_dynamic_shapes or test_batchnorm_e2e_dynamic_shapes or test_functools_wraps_dynamic_shapes or test_jit_trace_errors_dynamic_shapes or test_multi_import_dynamic_shapes or test_requires_grad_guards_with_grad_mode2_dynamic_shapese or test_dynamo_signatures' ``` BEFORE: `1 failed, 1 passed, 22 skipped, 1372 deselected` AFTER: `24 passed, 1372 deselected` Pull Request resolved: https://github.com/pytorch/pytorch/pull/121378 Approved by: https://github.com/oulgen	2024-03-08 01:59:01 +00:00
wz337	4186c36531	[nit][DCP][DSD] Remove Unused Variables in test_state_dict.py (#121204 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121204 Approved by: https://github.com/Skylion007	2024-03-08 01:54:25 +00:00
David Berard	0f8c9acc29	Revert "[fake_impls] Fix seed/offset device for attention kernels (#120839 )" (#121447 ) This reverts commit df3c8b8390bc601072b0ee9b2c39e07adf370fe2. It regressed cudagraphs+PT2 performance on SDPA. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121447 Approved by: https://github.com/Chillee	2024-03-08 01:48:23 +00:00
Tianyu Liu	dc514b967e	[dtensor][TP] check funcol calls and improve doc for loss parallel (#121366 ) Since CommDebugMode is fixed, we can check that loss parallel is working as expected. Under loss parallel, the forward computation should invoke 3 all-reduces, and the backward computation should invoke no functional collectives. Co-authored-by: Wanchao <wanchaol@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/121366 Approved by: https://github.com/wanchaol	2024-03-08 01:41:31 +00:00
kareem	25c74a93cd	[fx] Preserve Fx graph node order in partitioner across runs (#115621 ) Fixes #ISSUE_NUMBER partitioner generates different graph in recompilation on each run Co-authored-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/115621 Approved by: https://github.com/ezyang	2024-03-08 01:37:53 +00:00
Shunting Zhang	7dc1ab8989	make dyanmo work with _LazyGraphModule.lazy_forward (#121259 ) Fix https://github.com/pytorch/pytorch/issues/121198 . We previously already trigger the real recompilation for LazyGraphModule when it runs thru dynamo context. But people may pass in LazyGraphModule._lazy_forward rather than the LazyGraphModule instance itself. This PR handles that. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121259 Approved by: https://github.com/williamwen42, https://github.com/jansel	2024-03-08 01:37:39 +00:00
Cheng Ni	9bff1599b6	[Torch Elastic][Draft] Refactor SubprocessHandler to separate module for easier subclass (#120373 ) Summary: ## No Functional Change - Refactor Subprocess Handler into a separate folder for easier subclassing - SubprocessHandler - added `local_rank_id` in `SubprocessHandler` to make it available as a field in the class - pass in `local_rank_id` from subprocess start Test Plan: No functional changes. Differential Revision: D54038627 #suppress-api-compatibility-check Pull Request resolved: https://github.com/pytorch/pytorch/pull/120373 Approved by: https://github.com/kurman	2024-03-08 01:37:34 +00:00
Animesh Jain	c86a1ce125	[dynamo][guards-cpp-refactor] Func defaults and kwdefaults accessor (#121338 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121338 Approved by: https://github.com/jansel ghstack dependencies: #121327	2024-03-08 01:24:00 +00:00
Animesh Jain	79a04f2df9	[dynamo][guards-cpp-refactor] Permit dict version guard in DictGuardManager (#121327 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121327 Approved by: https://github.com/jansel	2024-03-08 01:24:00 +00:00
Gregory Comer	962c1b4c69	Update XNNPACK revision to fcbf55a (#120583 ) Update XNNPACK dependency to revision fcbf55a. This is part of a larger, synchronized update of the dependency version for PyTorch, ExecuTorch, and FB internal targets. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120583 Approved by: https://github.com/mcr229	2024-03-08 01:19:22 +00:00
Colin Peppler	090616d9a1	[Indutor] Support auto-tuned custom PT ops in abi compatible mode (#120877 ) Differential Revision: D54344556 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120877 Approved by: https://github.com/aakhundov	2024-03-08 01:16:57 +00:00
Animesh Jain	04a5d6e8d3	[dynamo][guards] Use lazy variable tracker for func defaults (#121388 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121388 Approved by: https://github.com/jansel	2024-03-08 01:10:46 +00:00
Ivan Zaitsev	5d8e4126b6	Fixup test_trace_rules (#121351 ) Summary: Fixes https://www.internalfb.com/intern/testinfra/diagnostics/7599824578133672.281475099376195.1709732674/ (for some reason this test didn't run in OSS)? Reached out to Yanbo Liang for additional context: {F1465435684} Test Plan: Local: https://www.internalfb.com/intern/testinfra/testconsole/testrun/16325548673376150/ Differential Revision: D54605075 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121351 Approved by: https://github.com/malfet, https://github.com/yanboliang	2024-03-08 00:50:45 +00:00
angelayi	af62a70fab	[export] Fix nn_module_stack in retracing (#121423 ) Fixes https://fb.workplace.com/groups/1075192433118967/permalink/1391916691446538/ Pull Request resolved: https://github.com/pytorch/pytorch/pull/121423 Approved by: https://github.com/zhxchen17	2024-03-08 00:34:11 +00:00
Tugsbayasgalan Manlaibaatar	4f120dc2a6	Clean up mode handling in python dispatcher (#121083 ) Things that were bad before this PR: 1. Temporarily unsetting functional tensor mode and proxy mode both had duplicate implementation 2. There are variants of mode handling private utils that has duplicate implementation. (different APIs calling repeated implementation, so i refactored) 3. _push_mode API used to take dispatch key argument which is not necessary. 4. There are unused APIs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121083 Approved by: https://github.com/zou3519	2024-03-08 00:30:34 +00:00
Chien-Chin Huang	0811f15270	[DCP][state_dict] Let _offload_state_dict_to_cpu to return the companion_obj if it exist. (#121273 ) As title Pull Request resolved: https://github.com/pytorch/pytorch/pull/121273 Approved by: https://github.com/wz337, https://github.com/LucasLLC	2024-03-08 00:24:29 +00:00
Jane Xu	f76e541ec7	[BE] NO MORE discrepancy between forloop foreach capturable YAY (#121269 ) and I will not let it happen again Pull Request resolved: https://github.com/pytorch/pytorch/pull/121269 Approved by: https://github.com/albanD ghstack dependencies: #121260, #121264	2024-03-08 00:00:30 +00:00
Jane Xu	9d6c5be781	Add ASGD capturable API for forloop (#121264 ) @tfsingh I got to it first--wanted to land this stack and close the gap ASAP. This PR also fixes a discrepancy between `_init_group` and `__set_state__` because we have the constants live on params' device always. There are some next steps though: - ASGD can be made faster by making etas, mus, steps be on CPU when NOT capturable. (I had mistakenly thought foreachifying was faster and so we landed https://github.com/pytorch/pytorch/pull/107857, but it is slower). No one has complained yet though. ¯\_(ツ)_/¯ Pull Request resolved: https://github.com/pytorch/pytorch/pull/121264 Approved by: https://github.com/albanD ghstack dependencies: #121260	2024-03-08 00:00:30 +00:00
Jane Xu	24821fec26	Add RAdam capturable API for forloop (#121260 ) Implementation thanks to @MarouaneMaatouk in https://github.com/pytorch/pytorch/pull/118697, though I've since cleaned it up a lot to save perf on the rect < 5 eager case. It also just looks better now :) Added tests and the cudagraph health check. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121260 Approved by: https://github.com/mlazos	2024-03-08 00:00:30 +00:00
Dheeraj Peri	b1657beac1	feat: Add min, max ranges to mark_dynamic API (#119737 ) Fixes https://github.com/pytorch/pytorch/issues/115137 This PR adds: - mark_dynamic API will accept `min`, `max` values to create a bounded constraint on the dim. - test case in test_misc.py which checks if `ConstraintViolationError` is triggered if `torch.compile` gets a input dimension out of bounds. Co-authored-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/119737 Approved by: https://github.com/ezyang, https://github.com/jansel	2024-03-07 23:26:03 +00:00
PyTorch MergeBot	e0c534fe02	Revert "[Inductor] Add support for NEON ISA in the Inductor C++ backend (#105590 )" This reverts commit 156954d6a2a05f3ce8288dd054691102e596e461. Reverted https://github.com/pytorch/pytorch/pull/105590 on behalf of https://github.com/ezyang due to https://github.com/pytorch/pytorch/issues/121288#issuecomment-1981980699 ([comment](https://github.com/pytorch/pytorch/pull/105590#issuecomment-1984745827))	2024-03-07 23:06:29 +00:00
Adnan Akhundov	3d089de851	Add torch.cond support to AOT Inductor (#121120 ) Summary: In this PR, `torch.cond` support and the necessary codegening infrastructure is added to C++ wrapper (AOTInductor and friends). Notable additions: - A new mechanism in the Python wrapper codegen to precompile and save the Triton kernels (generated and user-defined) which haven't been covered by the active path through the control flow given the sample inputs. As we can't do the runtime autotuning of the kernels outside the active path, we precompile and save them with the `launchers[0]` (corresponding to the first config). - Codegen infra for `torch.cond` in the C++ wrapper (ABI- and non-ABI-compatible). The `torch.cond` codegen has been slightly refactored to avoid duplication across the Python and C++ wrappers. - More extensions of the caching sites in the wrapper code to cache per codegened graph (e.g., `codegen_int_array_var`) + some infra for tracking the current codegened graph in the wrapper (both during codegen-ing in the `Scheduler.codegen` and in the `WrapperCodeGen.generate` functions). - New unit tests to cover the added AOT Inductor + `torch.cond` functionality. Codegen examples from the new unit tests: - [`test_cond_simple_abi_compatible_cpu`](https://gist.github.com/aakhundov/862d5de9aa460f5df399e1387f7b342e) - [`test_cond_simple_abi_compatible_cuda`](https://gist.github.com/aakhundov/d70b81f95fa8cc768cedef9acacb25bb) - [`test_cond_simple_non_abi_compatible_cpu`](https://gist.github.com/aakhundov/c0ae7a8cbb6fa311c838e1b580f9a3f6) - [`test_cond_simple_non_abi_compatible_cuda`](https://gist.github.com/aakhundov/08b945d4e8a32c97b7f9ff6272f4a223) - [`test_cond_nested_abi_compatible_cuda`](https://gist.github.com/aakhundov/ce664f433c53e010ce4c0d96a6c13711) - [`test_cond_with_parameters_abi_compatible_cuda`](https://gist.github.com/aakhundov/77afbeb8eaab5c5b930a3f922a7baf12) - [`test_cond_with_multiple_outputs_abi_compatible_cuda`](https://gist.github.com/aakhundov/8cc06105ec8a3fe88be09b3f6e32c690) Test Plan: ``` $ python test/inductor/test_aot_inductor.py -k test_cond ... ---------------------------------------------------------------------- Ran 42 tests in 170.619s OK ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/121120 Approved by: https://github.com/jansel, https://github.com/chenyang78	2024-03-07 22:39:57 +00:00
Tiago Quelhas	26740f853e	Remove unnecessary use of ctx.resolve_tools. (#120493 ) In this case, it's simpler to use ctx.actions.run(executable = ...), which already ensures that the runfiles associated with the executable are present. (It's also possible to use ctx.actions.run_shell(tools = ...) with a custom command line, but it's unclear to me that indirecting through the shell is needed here.) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120493 Approved by: https://github.com/ezyang	2024-03-07 22:33:17 +00:00
William Wen	d14d62b7aa	[dynamo] add more refleak tests (#120657 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120657 Approved by: https://github.com/jansel	2024-03-07 22:25:43 +00:00
Edward Z. Yang	6490441d8f	Remove dead get_shape_groups (#120813 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/120813 Approved by: https://github.com/albanD	2024-03-07 22:20:30 +00:00
Oguz Ulgen	18d574a07a	[Inductor] Use indices for constants in triton_meta (#121427 ) @bertmaher pointed out that constants are passed with their indices, not their names. Looking at triton source, this appears to be true `392370b303/python/triton/runtime/jit.py (L381-L385)` I'm guessing both indices and names work here but lets be consistent. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121427 Approved by: https://github.com/aakhundov	2024-03-07 21:59:43 +00:00
Antoni Viros	f61192b014	Fix for Wait kernel lowering in inductor not accepting MultiOutputs from non-collective calls (#121428 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121428 Approved by: https://github.com/yifuwang	2024-03-07 21:29:25 +00:00
Zhengxu Chen	76f1461892	[export] Serialize union fields with single entry dict. (#121263 ) (#121337 ) Summary: remove "$type" and "$value" fields, instead only serialize as {type: value} for union fields directly. bypass-github-export-checks Test Plan: CI Differential Revision: D54600943 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121337 Approved by: https://github.com/tugsbayasgalan	2024-03-07 21:24:28 +00:00
Scott Wolchok	4c58f2b675	[PyTorch] Use uint32_t for ProcessedNode::num_outputs (#121335 ) We already use uint32_t for indexing, and the notion of a single graph node with more than four billion outputs stretches credulity. Differential Revision: [D54598821](https://our.internmc.facebook.com/intern/diff/D54598821/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121335 Approved by: https://github.com/Skylion007	2024-03-07 21:15:05 +00:00
Joel Schlosser	ea8f6e2e54	Subclass view fake-ification via reified ViewFuncs (#118405 ) This PR: * Uses reified ViewFuncs to swap in fake tensors / symbolic SymInts for view replay during subclass view fake-ification * Enables automatic dynamic on view bases -> fakeifies according to the resultant symbolic context instead of the old "all-static" approach * Covers the following view types: * subclass -> dense * dense -> subclass * subclass -> subclass * Dense -> dense views are handled the old way via an `as_strided()` call, as it's likely there is no view func available Differential Revision: [D54269082](https://our.internmc.facebook.com/intern/diff/D54269082) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118405 Approved by: https://github.com/ezyang	2024-03-07 19:56:16 +00:00
Catherine Lee	63ec5cd158	TD Heuristic for tests mentioned in PR body, less verbose TD printing (#120621 ) Move tests that are mentioned in PR body or commit message to front. Also attempts to find any issues/PRs mentioned in the PR body and search for those too (ex if you link a disable issue and that issue contains the test file that it was failing on) looking for: dynamo/test_export_mutations Also removes some printed information in TD Pull Request resolved: https://github.com/pytorch/pytorch/pull/120621 Approved by: https://github.com/osalpekar	2024-03-07 19:36:11 +00:00
Nikita Shulga	c7a65f58b0	[CI] Script to fetch creds from current AWS session (#121426 ) Because some implementations, like OpenDAL does not work with AWS IMDSv2, but this script will bridge the gap and enables more recent `sccache` releases(that switched from simple-s3 to OpenDAL) to work in current CI system When launched it prints something like: ``` export AWS_ACCESS_KEY_ID=XXXXX export AWS_SECRET_ACCESS_KEY=YYYY export AWS_SESSION_TOKEN=ZZZZ ``` which can be `eval`ed and passed then sccache can use those failures. Validated in https://github.com/pytorch/pytorch/pull/121323 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121426 Approved by: https://github.com/Skylion007	2024-03-07 19:25:54 +00:00
PyTorch MergeBot	2b1661c7a0	Revert "[compiled autograd] support custom ops backed by c++ autograd::Function (#120681 )" This reverts commit 05c256849b464deee16ccd70152fd54071c6c79c. Reverted https://github.com/pytorch/pytorch/pull/120681 on behalf of https://github.com/izaitsevfb due to breaking internal builds, see D54617701 ([comment](https://github.com/pytorch/pytorch/pull/120681#issuecomment-1984214079))	2024-03-07 18:53:51 +00:00
Shengbao Zheng	60aaba4128	create function to get ProcessGroupNCCL uid (#121132 ) Summary: expose ProcessGroupNCCL uid Differential Revision: D54446056 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121132 Approved by: https://github.com/aaronenyeshi	2024-03-07 18:34:38 +00:00
Jane Xu	83d095c213	[BE] Remove unnecessary requires_cuda in common_optimizers.py (#121249 ) @mlazos had already added the needed decorator on the test itself. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121249 Approved by: https://github.com/Skylion007, https://github.com/mlazos, https://github.com/albanD ghstack dependencies: #121183	2024-03-07 17:57:02 +00:00
Jane Xu	53bdae736d	Add capturable single tensor Adamax (#121183 ) Finishes the work started in https://github.com/pytorch/pytorch/pull/118697. Thanks @MarouaneMaatouk for the attempt, but due to inactivity I have opened this PR for Adamax. Note that the new capturable implementation is much simpler and I've modified the foreach capturable impl--it now calls fewer kernels and is more easily comparable to forloop. Next steps: * This PR discovered two bugs: #121178 and #121238. * Move the now hefty graph optim tests in test_cuda to use OptimInfo. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121183 Approved by: https://github.com/albanD	2024-03-07 17:57:02 +00:00
Catherine Lee	af88425cdc	Forward fix lint after 121202 (#121425 ) Forward fix after #121202, where the lintrunner job failed due to being unable to checkout the pytorch repo Pull Request resolved: https://github.com/pytorch/pytorch/pull/121425 Approved by: https://github.com/ezyang, https://github.com/aakhundov, https://github.com/malfet	2024-03-07 17:20:13 +00:00
suo	c3c15eb9a6	[export] update docs to not export raw functions (#121272 ) as title Differential Revision: [D54555101](https://our.internmc.facebook.com/intern/diff/D54555101/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121272 Approved by: https://github.com/zhxchen17	2024-03-07 17:18:07 +00:00
PyTorch MergeBot	862b99b571	Revert "[ATen][CUDA][CUBLAS] cublasLtMatmul increase workspace_size (#120925 )" This reverts commit 3239f86a3df133b5977d988324639e0de7af8749. Reverted https://github.com/pytorch/pytorch/pull/120925 on behalf of https://github.com/malfet due to Breaks internal tests, likely due to the increased memory requirements ([comment](https://github.com/pytorch/pytorch/pull/120925#issuecomment-1983875400))	2024-03-07 16:16:07 +00:00
Shengbao Zheng	eea37c6db4	[profiler] record nccl version in distributed info (#121044 ) Summary: Add a field of NCCL version in distributed info if backend is NCCL Differential Revision: D54432888 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121044 Approved by: https://github.com/aaronenyeshi	2024-03-07 15:56:02 +00:00
cyy	3aa512cd72	[Clang-tidy header][23/N] Enable clang-tidy coverage on aten/src/ATen/*.{cpp,h} (#121380 ) This PR finishes the works beginning with #https://github.com/pytorch/pytorch/pull/120763 by enabling clang-tidy on aten/src/ATen. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121380 Approved by: https://github.com/Skylion007	2024-03-07 15:11:07 +00:00
IvanKobzarev	9a45001905	[dynamo] relax missing symbols runtime assert (#121339 ) Differential Revision: [D54603361](https://our.internmc.facebook.com/intern/diff/D54603361) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121339 Approved by: https://github.com/ezyang	2024-03-07 14:53:38 +00:00
Bin Bao	0339f1ca82	[Inductor] Allocate another shard for testing cpp-wrapper JIT (#121310 ) Summary: The ABI-compatible for cpp wrapper has not been turned on as default, so test them separately. Expect to add more tests for the shard. Differential Revision: [D54617287](https://our.internmc.facebook.com/intern/diff/D54617287) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121310 Approved by: https://github.com/chenyang78 ghstack dependencies: #121309	2024-03-07 14:24:21 +00:00
Bin Bao	7e598c0053	[Inductor] Enable ABI-compatible mode for cpp-wrapper JIT (#121309 ) Differential Revision: [D54617284](https://our.internmc.facebook.com/intern/diff/D54617284) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121309 Approved by: https://github.com/chenyang78	2024-03-07 14:22:06 +00:00
Kai Londenberg	57fc35a3af	[ Inductor ] Shape padding honors output stride preservation (#120797 ) This fix makes sure that shape padding honors inductors 'keep_output_strides' setting. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120797 Approved by: https://github.com/eellison	2024-03-07 13:52:29 +00:00
cyy	4305c64fea	Change ATEN generator argument type to const std::optional<Generator>& (#120076 ) This PR proposes to use std::optional<Generator>& for underlying functions to avoid unnecessary copy and move operations. The torchgen code was changed to generate the new type. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120076 Approved by: https://github.com/malfet	2024-03-07 09:52:21 +00:00
Shunting Zhang	1ce5049692	[inuctor] fix the layout problem for nll_loss2d_backward (#121173 ) Fixes https://github.com/pytorch/pytorch/issues/120759 . The CUDA implementation of nll_loss2d_backward.default requires that the 'self' tensor to be contiguous. These implicit assumption may be broken by layout optimizations. The fix here is to add the constraint when we explicitly defining the fallback for the op. Not sure if we can improve the cuda kernel to release the constraints though. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121173 Approved by: https://github.com/jansel, https://github.com/desertfire	2024-03-07 09:05:07 +00:00
mingfeima	b3065f6899	add int8 packed gemm support on CPU device (#118056 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118056 Approved by: https://github.com/mikekgfb	2024-03-07 08:41:43 +00:00
Andrew Gu	e8e3049f57	[FSDP2] Relaxed check for parent mesh (#121360 ) Mixing 1D and 2D `DTensor`s in the same sharded state dict should be okay, so we can remove the check that a parameter for FSDP to shard must be a `DTensor` if passing a child mesh to FSDP. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121360 Approved by: https://github.com/yifuwang, https://github.com/Skylion007 ghstack dependencies: #120351, #121328	2024-03-07 08:09:25 +00:00
Valentine233	db36d21f5c	Add SDPA pattern for HuggingFace models BF16 (#121202 ) ### Description - Add pattern for bf16 input type with fp32 attention mask. (Example model: ElectraForCausalLM) - Add pattern with batch_size=1 to avoid some clones in graph. (Example model: text-classification+prajjwal1-bert-tiny) ### Newly matched models Dtype: bf16, machine: SPR #### Dynamo HuggingFace models - ElectraForCausalLM (speedup=2.09x) - ElectraForQuestionAnswering (speedup=4.22x) - AlbertForQuestionAnswering (speedup=1.36x) - AlbertForMaskedLM (speedup=1.39x) #### OOB HuggingFace models - multiple-choice+google-electra-base-discriminator - text-classification+prajjwal1-bert-tiny - text-classification+prajjwal1-bert-mini - text-classification+google-electra-base-generator - text-classification+bert-large-cased - casual-language-modeling+xlm-roberta-base - text-classification+roberta-base - text-classification+xlm-roberta-base - text-classification+albert-base-v2 - token-classification+google-electra-base-generator - masked-language-modeling+bert-base-cased Pull Request resolved: https://github.com/pytorch/pytorch/pull/121202 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-03-07 07:40:00 +00:00
Oguz Ulgen	953c6c37cb	Wrap remote cache creation with a try-catch (#121340 ) Summary: In production I am seeing errors like "AttributeError: module 'triton.runtime' has no attribute 'fb_memcache'", likely due to some package skew. Until this is resolved, lets wrap this code with try-catch. Test Plan: CI Differential Revision: D54604339 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121340 Approved by: https://github.com/aakhundov	2024-03-07 07:05:49 +00:00
Chen_Liqing	291ce86a6c	Modify StorageImplCreateHelper (#118459 ) I want to use tensor.untyped_storage()[a:b] for ``PrivateUse1`` backend but fail. The code will go into ``THPStorage_get``: `bb6eba189f/torch/csrc/Storage.cpp (L525-L540)` Here ``torch`` will create a new ``c10::StorageImpl`` but not consider about ``PrivateUse1`` backend. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118459 Approved by: https://github.com/albanD	2024-03-07 06:26:55 +00:00
Xia, Weiwen	f848e9c646	[Quant][Inductor] Fix q/dq per channel lowering with 64-bit qparams (#120984 ) Fixes #120869 Fix lowering of `quantize_per_channel` and `dequantize_per_channel` with float64 scale and int64 zero point. Generated codes are incorrect without explicit type conversion. Add type conversion to the lowering pass, i.e., float64 (double) -> float32 and int64 -> int32. Test plan python test/inductor/test_cpu_repro.py -k test_per_channel_fake_quant_module_uint8 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120984 Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/jerryzh168	2024-03-07 06:23:52 +00:00
Yeounoh Chung	4f9d4e1ab0	[DTensor][XLA] refactor DTensor _xla API (#113214 ) In response to the change pytorch/xla#5776 and #92909 Pull Request resolved: https://github.com/pytorch/pytorch/pull/113214 Approved by: https://github.com/wanchaol	2024-03-07 06:18:05 +00:00
cyy	c723514ef4	[CUDACachingAllocator] Simplify update_stat and avoid casts (#120964 ) update_stat in CUDACachingAllocator.cpp was split into increase and decrease functions in this PR to simplify the implementation and avoid type casts throughout the code. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120964 Approved by: https://github.com/albanD	2024-03-07 05:55:38 +00:00
drisspg	55232c4e1c	Make CausalBias a torch.Tensor subclass again (#121358 ) # Summary This was removed in #116071 in order to enable compile support and re-adding this seems to still work with compile Pull Request resolved: https://github.com/pytorch/pytorch/pull/121358 Approved by: https://github.com/Skylion007, https://github.com/cpuhrsch	2024-03-07 05:20:47 +00:00
Xilun Wu	df2ad1fecc	[dtensor][debug] have visualize_sharding correctly print for sub-mesh DTensor (#121216 ) Summary In `visualize_sharding` we chose to only print on rank 0 (global rank) which means calling `visualize_sharind` will never print anything when the dtensor object's mesh doesn't include rank 0 (i.e. a sub-mesh). This PR has `visualize_sharding` always print on rank whose mesh coordinate is (0, 0, ..., 0) instead of whose global rank is 0. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121216 Approved by: https://github.com/wanchaol ghstack dependencies: #121179, #120260	2024-03-07 04:50:15 +00:00
Xilun Wu	77873f6fe5	[dtensor][1/N] add torchrec even row-wise sharding example (#120260 ) Summary our goal is to demonstrate that DTensor's capability to represent TorchRec's parameter sharding. Currently this is done with `ShardedTensor` and theoretically `DTensor` can replace it with minor change. This PR serves as a start of this effort by adding an example test that represents TorchRec's `ShardingType.ROW_WISE` using DTensor. Note that this PR only covers the even sharding case. Test Run `torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/torchrec_sharding_example.py -e row-wise` Pull Request resolved: https://github.com/pytorch/pytorch/pull/120260 Approved by: https://github.com/wanchaol ghstack dependencies: #121179	2024-03-07 04:50:15 +00:00
Xilun Wu	9cc0f23e5c	[dtensor][debug] allow visualize_sharding to print header (#121179 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121179 Approved by: https://github.com/wanchaol	2024-03-07 04:50:06 +00:00
jmarin	a2854ae904	Bugfix consume_prefix_in_state_dict_if_present function to keep the order of the state_dict (#117464 ) This PR proposes to keep the original order as the original state_dict, as the issue creator proposed. It also removes a bug concerning how ``_metadata`` is handled (see below), as well as other small changes to properly remove the prefix when is present. In the original code, ``_metadata`` was handled as a ``key``. ``` # also strip the prefix in metadata if any. if "_metadata" in state_dict: ``` This is not the case, ``_metadata`` is actually an ``attribute``. Hence, the previous condition is changed to: ``` # also strip the prefix in metadata if any. if hasattr(state_dict, "_metadata"): ``` This PR also includes the necessary test. Fixes #106942 Co-authored-by: mikaylagawarecki <mikaylagawarecki@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/117464 Approved by: https://github.com/mikaylagawarecki	2024-03-07 04:00:49 +00:00
Aaron Orenstein	edd80f87b8	Prevent infinite recursion within Tensor.__repr__ (#120206 ) `Tensor.__repr__` calls functions which can perform logging which ends up logging `self` (with `__repr__`) causing an infinite loop. Instead of logging all the args in FakeTensor.dispatch log the actual parameters (and use `id` to log the tensor itself). The change to torch/testing/_internal/common_utils.py came up during testing - in some ways of running the test parts was `('test', 'test_testing.py')` and so `i` was 0 and we were doing a join on `()` which was causing an error. Repro: ``` import torch from torch.testing import make_tensor from torch._subclasses.fake_tensor import FakeTensor, FakeTensorMode t = torch.sparse_coo_tensor(((0, 1), (1, 0)), (1, 2), size=(2, 2)) t2 = FakeTensor.from_tensor(t, FakeTensorMode()) print(repr(t2)) ``` and run with `TORCH_LOGS=+all` Pull Request resolved: https://github.com/pytorch/pytorch/pull/120206 Approved by: https://github.com/yanboliang, https://github.com/pearu	2024-03-07 02:24:45 +00:00
laith sakka	eb4d87f237	graph break on sparse tensors constructions (#120458 ) Fix some tests in https://github.com/pytorch/pytorch/issues/119780 sparse_bsc_tensor is not supported https://github.com/pytorch/pytorch/pull/117907 Also more about the issue here. https://docs.google.com/document/d/1EIb4qG88-SjVFn5TloLERliYdxIu2hwYoAA8skjOVfo/edit Pull Request resolved: https://github.com/pytorch/pytorch/pull/120458 Approved by: https://github.com/ezyang	2024-03-07 02:17:41 +00:00
Wanchao Liang	1a28ebffb3	[TP] Introduce Sequence Parallel Style for Laynorm/RMSNorm/Dropout (#121295 ) As titled, this PR introduces a dedicated `ParallelStyle` to shard the nn.LayerNorm/nn.Dropout/RMSNorm layers. We were mainly using a manual distribute_module calls before when sharding the RMSNorm layer, but I think we should have a dedicate TP API to easily shard those layers, instead of user manually using DTensors. I call this SequenceParallel, which might bring some confusion that we technically "deprecated" a SequenceParallel style months ago. But this time the SeuqenceParallel style is significantly different with the previous ones (which used to shard two consecutive Linear layers). I believe making it the right name is the first priority, instead of worrying about the issue of reusing the old name Pull Request resolved: https://github.com/pytorch/pytorch/pull/121295 Approved by: https://github.com/awgu, https://github.com/tianyu-l ghstack dependencies: #121294	2024-03-07 02:04:59 +00:00
Eddie Yan	967dd31621	[cuDNN] Cleanup cuDNN < 8.1 ifdefs (#120862 ) Follow-up of #95722 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120862 Approved by: https://github.com/Skylion007	2024-03-07 01:46:25 +00:00
briancoutinho	b9087f8571	[profiler] Add execution_trace_observer as an optional argument to profiler (#119912 ) # Update Profiler API to collect Execution Traces ## TLDR We would like to simplify collecting Execution Trace and Kineto together. Execution Trace and Kineto both provide meaningful information that can be combined to enable benchmarking, performance analysis and simulating new hardware. ``` import torch def main(): with torch.profiler.profile( activities=[ torch.profiler.ProfilerActivity.CPU, torch.profiler.ProfilerActivity.CUDA, ], … excution_trace_observer=ExecutionTraceObserver() # <<<<<<< NEW ) as prof: ... prof.step() ``` See test/profiler/test_profiler.py 'test_execution_trace_with_kineto' for an example of using this API. ## What are Execution Traces? [Chakra Execution Traces](https://github.com/mlcommons/chakra/wiki) offer a graph based representation of AI/ML workloads. It stands apart from conventional AI/ML frameworks by focusing on replay benchmarks, simulators, and emulators, prioritizing agile performance modeling and adaptable methodologies. - Chakra is part of ML Commons industry standard and is being adopted by other companies besides NVIDIA too. - At Meta we have instrumented PyPer framework to collect Execution Traces. More details on our [PyTorch implementation of Chakra can be found here](https://github.com/mlcommons/chakra/wiki) Chakra essentially enables benchmarking and co-design for ML Models without having to reproduce entier software stacks and helps companies collaborate together [[chakra paper](https://arxiv.org/pdf/2305.14516.pdf)] ## Why correlate Execution Trace with PyTorch/Kineto Trace Both Execution Traces and Kineto/ provide different types of information and combining. While PyTorch ETs focus on CPU operators with explicit dependencies between them, Kineto traces encode GPU operators with their start and end times. In addition, collecting them at different timestamps will be inaccurate as several operations (NCCL, Embedding lookup) are data dependent and may not match correctly. Thus, it makes sense to collect both ET and Kineto together. The problem is that there are two code paths. ## Proposal The proposal is to modify the PyTorch profiler (Kineto) API to enable execution trace to be collected simultaneously, see TLDR section # Testing Updated the unit test for collecting kineto and Execution Trace together. - Check the collected ET has right range of events. - Compare two sets of IDs - record func Ids in ET and external IDs in Kineto. We check if these have a constant difference. ``` pytest test/profiler/test_profiler.py -k test_execution_trace_with_kineto -rP Running 1 items in this shard test/profiler/test_profiler.py [W execution_trace_observer.cpp:682] Enabling Execution Trace Observer STAGE:2024-03-05 09:05:05 1119546:1119546 ActivityProfilerController.cpp:314] Completed Stage: Warm Up [W execution_trace_observer.cpp:694] Disabling Execution Trace Observer STAGE:2024-03-05 09:05:05 1119546:1119546 ActivityProfilerController.cpp:320] Completed Stage: Collection STAGE:2024-03-05 09:05:05 1119546:1119546 ActivityProfilerController.cpp:324] Completed Stage: Post Processing ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/119912 Approved by: https://github.com/sanrise, https://github.com/aaronenyeshi	2024-03-07 01:30:26 +00:00
Lucas Pasqualin	eb1145436a	[DCP] Adds main in format utils (#120128 ) Adds main in format utils. Usage: `python -m torch.distributed.checkpoint.format_utils dcp_to_torch dcp_dir torch_file.pt` or `python -m torch.distributed.checkpoint.format_utils torch_to_dcp torch_file.pt dcp_dir` Differential Revision: [D53791355](https://our.internmc.facebook.com/intern/diff/D53791355/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120128 Approved by: https://github.com/fegin, https://github.com/wz337	2024-03-07 01:18:17 +00:00
cyy	5cc511f72f	Use c10::irange and fix other index types in ForeachReduceOp.cu (#121123 ) This PR follows the suggestions in #121066 and changes most loops to c10::irange. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121123 Approved by: https://github.com/soulitzer	2024-03-07 00:11:27 +00:00
Xiaodong Wang	c268ce4a6d	Make ATen-cpu cuda/rocm agnostic (#121082 ) Summary: This specific rocm logic will make aten-cpu code diverge between rocm and cuda. This is not good because we won't be able to share aten-cpu.so between rocm and cuda. More specifically, this will prevent us build aten-hip by default, which requires us to set up rocm specific rules which is an extra burden for our build system. Test Plan: sandcastle + oss ci Differential Revision: D54453492 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121082 Approved by: https://github.com/jeffdaily, https://github.com/aaronenyeshi, https://github.com/albanD	2024-03-06 23:51:40 +00:00
Yichen Yan	e50ded03a6	Use type check for also `is_not` (#113859 ) Handle `is_not` for: `9647a251cb/torch/_dynamo/variables/builtin.py (L1314-L1317)` I noticed https://github.com/pytorch/pytorch/issues/111713 exists, I think it's no harm to land this first. Pull Request resolved: https://github.com/pytorch/pytorch/pull/113859 Approved by: https://github.com/Skylion007	2024-03-06 23:12:42 +00:00
Wanchao Liang	a88356f45c	[dtensor] make add_.Tensor/div_.Scalar to be linear pointwise instead (#121294 ) add_.Tensor and div_.Scalar should support linearity so that we delay the partial results. This fixes the additional collective in the layernorm layer that we seen Pull Request resolved: https://github.com/pytorch/pytorch/pull/121294 Approved by: https://github.com/tianyu-l	2024-03-06 22:52:18 +00:00
Edward Z. Yang	2f064d895c	Switch TORCH_TRACE to accept a directory by default (#121331 ) Directory is better because it works smoothly with distributed runs; otherwise you'd need to modify torchrun to setup distinct log names for each file. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Differential Revision: [D54597814](https://our.internmc.facebook.com/intern/diff/D54597814) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121331 Approved by: https://github.com/albanD	2024-03-06 22:46:18 +00:00
Andrew Gu	372f192050	[DTensor] Initialized RNG tracker if needed (#121328 ) Since we are already checking if the RNG tracker is initialized, there is no real performance difference between erroring vs. just initializing a default RNG tracker (which we choose to be the `OffsetBasedRNGTracker`). ``` pytest test/distributed/_composable/fsdp/test_fully_shard_init.py -k test_meta ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/121328 Approved by: https://github.com/wanchaol ghstack dependencies: #120351	2024-03-06 22:21:44 +00:00
Denis Yaroshevskiy	b0e2ed4d67	removing some macros (#120314 ) Summary: Will be making some changes in the surrounding code, they are going to be easier without macros Differential Revision: D54001770 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120314 Approved by: https://github.com/zhxchen17	2024-03-06 22:06:05 +00:00
Lourencom	69cedc16c5	Add padding dimension checks and tests (#121298 ) Fixes #121093 Previously, calling the following functions with invalid padding dimensions would cause a segmentation fault: ``` torch._C._nn.replication_pad1d, torch._C._nn.replication_pad3d, torch._C._nn.replication_pad3d ``` To fix, added condition checking to raise a runtime error with a debug message instead, specifying the correct dimensions necessary. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121298 Approved by: https://github.com/mikaylagawarecki	2024-03-06 21:55:34 +00:00
Yifu Wang	d7a5e59647	[dynamo] support group=None when rewriting collectives (#121043 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121043 Approved by: https://github.com/awgu	2024-03-06 21:37:19 +00:00
lezcano	3fee05f242	Triage the remaining fallbacks (#121312 ) Building off work from @amjames. There may be some missclassifications, feel free to flag them. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121312 Approved by: https://github.com/jansel	2024-03-06 21:23:47 +00:00
Andrew Gu	e865700f6a	[FSDP2] Added initial meta-device init support (#120351 ) This PR adds initial support for meta-device initialization for pre-training without loading from a state dict. The idea is to allow `fully_shard(module)` to return and still have sharded parameters on meta device. Then, the user is free to initialize them as they please, e.g. using `to_empty()`. We override `_apply` to achieve the following: - Reshard the parameters to ensure that sharded parameters are registered (for correctness) -- we will always need this - Pad new local tensors and use the padded local tensors (to handle uneven sharding) -- we will remove this once `DTensor` pads its local tensor We use the `swap_tensors` path in `_apply`. For now, this requires setting `torch.__future__.set_swap_module_params_on_conversion(True)`; however, in the future, this may be enabled by default for wrapper subclasses and will not need any explicit API call. If requiring this call is too intrusive in the short term, we can also call it in `_apply` or when importing `fully_shard`. ``` # Pre-training flow (no checkpoint) global_mesh = init_device_mesh(..., mesh_dim_names=("dp", "tp")) dp_mesh, tp_mesh = global_mesh["dp"], global_mesh["tp"] with torch.device("meta"): model = ... parallelize_module(model, tp_mesh, ...) fully_shard(model, mesh=dp_mesh, ...) for param in model.parameters(): assert param.device.type == "meta" model.to_empty(device="cuda") random.manual_seed(42, global_mesh) for module in model.modules(): if hasattr(module, "reset_parameters"): module.reset_parameters() ``` This PR includes some minor changes to allow the user to similarly cast the module to a different dtype after construction time but before forward. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120351 Approved by: https://github.com/wanchaol	2024-03-06 21:18:25 +00:00
Johannes Aalto	3cf02c5e06	[Dev Container] Fix container build by preventing conda prompt (#121128 ) Without this the build will freeze with prompt: Proceed ([y]/n)? I'm using rootless podman in vscode instead of docker but I think it should not affect this. ..or does conda somehow detect Docker but not Podman? Anyway, this should not break anything. Btw, I also had to uncomment the line: "remoteUser": "root" in devcontainer.json to finish the post installation properly but I guess there might be other workarounds - and perhaps you don't want to run as root if your container has root privileges. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121128 Approved by: https://github.com/drisspg	2024-03-06 20:50:40 +00:00
angelayi	58ac4a2007	Remove llava from ci_expected_accuracy as it's flaky (#121322 ) https://github.com/pytorch/pytorch/pull/121029 added it into the CI but the test is flaky on hud. It alternates between fail_accuracy and fail_to_run Pull Request resolved: https://github.com/pytorch/pytorch/pull/121322 Approved by: https://github.com/desertfire	2024-03-06 20:47:01 +00:00
PyTorch MergeBot	23fb37fa41	Revert "[export] Serialize union fields with single entry dict. (#121263 )" This reverts commit 7feabe9b73e6ba7724b62ea91df27049defdf378. Reverted https://github.com/pytorch/pytorch/pull/121263 on behalf of https://github.com/osalpekar due to A large number of inductor benchmarking jobs failing starting this PR. See for details: `7feabe9b73` ([comment](https://github.com/pytorch/pytorch/pull/121263#issuecomment-1981680049))	2024-03-06 19:58:55 +00:00
Tobias Ringwald	76f3663efe	Fixed a memory leak when calling from_numpy on a numpy array with an … (#121156 ) …unsupported dtype. Fixes #121138. The lambda function that DECREFs the object is not called when the dtype conversion functions throws. This PR moves the conversion before the INCREF, which prevents the memory leak. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121156 Approved by: https://github.com/soulitzer, https://github.com/albanD	2024-03-06 19:37:38 +00:00
Kurman Karabukaev	360761f7d0	[Torchelasic] Create root log directory by default (#121257 ) Summary: After refactoring in https://github.com/pytorch/pytorch/pull/120691, default behavior unintentionally was changes from creating tempdir for logging to not capturing any logs by torch Elastic Agent. Reverting the behavior to: - making tempdir when log dir is not specified - allowing non-empty root log dir - Note: in case attempt folder exists, it will be pruned here: https://github.com/pytorch/pytorch/blob/main/torch/distributed/elastic/multiprocessing/api.py#L294 Differential Revision: D54531851 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121257 Approved by: https://github.com/d4l3k	2024-03-06 18:50:38 +00:00
Thiago Crepaldi	418568d2e3	Add Float8 support to onnx exporter (#121281 ) Fixes #106877 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121281 Approved by: https://github.com/BowenBao, https://github.com/titaiwangms	2024-03-06 18:46:56 +00:00
cyy	5a2527db22	[Clang-tidy header][22/N] Fix clang-tidy warnings in aten/src/ATEN/.{cpp,h} (#121102 ) This PR continues to fix clang-tidy warnings in aten/src/ATEN/, following #120763. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121102 Approved by: https://github.com/Skylion007	2024-03-06 18:36:31 +00:00
Michael Lazos	c5ef4df274	guard on grads being `None` in compiled optimizers (#121291 ) Fixes #115607 We were missing guards when the grads were set to `None`. So if we compiled the optimizer with grads set to their proper value, and then with the grads set to `None` we'd continuously run the `None` version because all of the guards would pass and it would be ordered before the correct version in the cache. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121291 Approved by: https://github.com/Skylion007, https://github.com/anijain2305	2024-03-06 18:33:23 +00:00
Zhengxu Chen	7feabe9b73	[export] Serialize union fields with single entry dict. (#121263 ) Summary: remove "$type" and "$value" fields, instead only serialize as {type: value} for union fields directly. Test Plan: CI Differential Revision: D54553770 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121263 Approved by: https://github.com/tugsbayasgalan	2024-03-06 18:16:16 +00:00
PaulZhang12	c66d68ba51	[PT2] Add tolist() to FunctionalTensor for torch.export (#121242 ) Adding tolist() to FunctionalTensor for torch.exporting TorchRec data types Pull Request resolved: https://github.com/pytorch/pytorch/pull/121242 Approved by: https://github.com/ezyang	2024-03-06 18:10:44 +00:00
Simon Fan	05c256849b	[compiled autograd] support custom ops backed by c++ autograd::Function (#120681 ) - Adds support for custom ops backed by c++ custom autograd functions, e.g. fbgemm - Include files more granularly to avoid namespace pollution and circular imports limitations: - requires user to audit their code and opt-in their custom autograd::Function via autograd::Function::is_traceable and maybe additional compiled_args + apply_with_saved implementation. this was the only way I can think of for soundness - will throw if we can't hash the saved_data i.e. for any non implemented type other than list and dict in at::IValue::hash `b0cfa96e82/aten/src/ATen/core/ivalue.cpp (L364)` - can technically silently fail if both the typeid hash and the typeid string name of the custom autograd::Function collide at the same time, and an identical autograd graph containing a different custom autograd::Function, yet that has an identical implementation, is called. this case seems extremely unlikely, and the only alternative to hash collision i can think of is compiling with reflection - tensors not saved via save_variables are not lifted, and are specialized on TensorImpl*'s hash (treated as a memory address). if needed, we can lift them. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120681 Approved by: https://github.com/jansel	2024-03-06 18:01:56 +00:00
blorange-amd	b27d76949b	[ROCm] Enable several fake_crossref UTs on ROCm (#121112 ) Enabled unit tests: test_ops::TestFakeTensorCUDA::test_fake_crossref_backward_amp_linalg_norm_subgradients_at_zero_cuda_float32 test_ops::TestFakeTensorCUDA::test_fake_crossref_backward_no_amp_linalg_norm_subgradients_at_zero_cuda_float32 test_ops::TestFakeTensorCUDA::test_fake_crossref_backward_amp_norm_nuc_cuda_float32 test_ops::TestFakeTensorCUDA::test_fake_crossref_backward_no_amp_norm_nuc_cuda_float32 test_ops::TestFakeTensorCUDA::test_fake_crossref_backward_amp_svd_cuda_float32 test_ops::TestFakeTensorCUDA::test_fake_crossref_backward_no_amp_svd_cuda_float32 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121112 Approved by: https://github.com/ezyang	2024-03-06 17:36:47 +00:00
PyTorch MergeBot	b529c19bdf	Revert "Batch Norm Consolidation (#116092 )" This reverts commit 5680f565d5b7d4aa412a3988d3d91ca4c5679303. Reverted https://github.com/pytorch/pytorch/pull/116092 on behalf of https://github.com/jeffdaily due to broke ROCm, PR signal was clean but trunk was not, the merge should have been blocked but wasn't ([comment](https://github.com/pytorch/pytorch/pull/116092#issuecomment-1981373237))	2024-03-06 17:10:01 +00:00
jyomu	8dd4b6a78c	Fix venv compatibility issue by updating python_lib_path (#121103 ) Reference by sys.executable is the absolute path of the executable binary for the Python interpreter, which may not be appropriate. Instead, sys.base_exec_prefix is more suitable, and this change will correctly resolve the library when using venv. I have tested it with a venv created by rye. https://docs.python.org/3.6/library/sys.html#sys.executable > A string giving the absolute path of the executable binary for the Python interpreter, on systems where this makes sense. If Python is unable to retrieve the real path to its executable, [sys.executable](https://docs.python.org/3.6/library/sys.html#sys.executable) will be an empty string or None. https://docs.python.org/3.6/library/sys.html#sys.exec_prefix > A string giving the site-specific directory prefix where the platform-dependent Python files are installed; by default, this is also '/usr/local'. This can be set at build time with the --exec-prefix argument to the configure script. Specifically, all configuration files (e.g. the pyconfig.h header file) are installed in the directory exec_prefix/lib/pythonX.Y/config, and shared library modules are installed in exec_prefix/lib/pythonX.Y/lib-dynload, where X.Y is the version number of Python, for example 3.2. https://docs.python.org/3.6/library/sys.html#sys.base_exec_prefix > Set during Python startup, before site.py is run, to the same value as [exec_prefix](https://docs.python.org/3.6/library/sys.html#sys.exec_prefix). If not running in a [virtual environment](https://docs.python.org/3.6/library/venv.html#venv-def), the values will stay the same; if site.py finds that a virtual environment is in use, the values of [prefix](https://docs.python.org/3.6/library/sys.html#sys.prefix) and [exec_prefix](https://docs.python.org/3.6/library/sys.html#sys.exec_prefix) will be changed to point to the virtual environment, whereas [base_prefix](https://docs.python.org/3.6/library/sys.html#sys.base_prefix) and [base_exec_prefix](https://docs.python.org/3.6/library/sys.html#sys.base_exec_prefix) will remain pointing to the base Python installation (the one which the virtual environment was created from). Pull Request resolved: https://github.com/pytorch/pytorch/pull/121103 Approved by: https://github.com/ezyang	2024-03-06 17:00:46 +00:00
mingfeima	a427d90411	add int4 packed gemm support on CPU device (#117475 ) This patch adds int4 packed gemm support on CPU, both `avx512` and `avx2` are supported. It is used to speedup https://github.com/pytorch-labs/gpt-fast The default perf measured on Intel (R) Xeon (R) CPU Max 9480, single socket (56 cores) is `16.13 sec total, 12.40 tokens/sec` * WOQ int4 on avx512: `5.92 sec total, 33.79 tokens/sec` * WOQ int4 on avx2: `6.90 sec total, 29.00 tokens/sec` WOQ int4 is measured with method: https://github.com/pytorch-labs/gpt-fast?tab=readme-ov-file#int4-weight-only-quantization Pull Request resolved: https://github.com/pytorch/pytorch/pull/117475 Approved by: https://github.com/jgong5, https://github.com/malfet	2024-03-06 16:25:53 +00:00
Guilherme Leobas	54d92f2e37	Add jacrev support in torch.compile (#121146 ) Changes are simple. Moved a few entries on trace_rules.py and included tests to compare the graph generated by jacrev Pull Request resolved: https://github.com/pytorch/pytorch/pull/121146 Approved by: https://github.com/zou3519	2024-03-06 16:05:33 +00:00
vfdev-5	49d1fd31cf	Fuse nodes with sizes (s0s1...,) and (s0, s1, s2, ...) (#120077 ) Description: - PR tries to fuse nodes with compatible sizes, for example `node1: (s0, s1, s2)` and `node2: (s0 * s1 * s2)`. On `main` these two nodes can be fused due to different sizes. With this PR we can recompute node2 size, body etc using node1 indexing constraint and thus be able to fuse two nodes. - this should influence only cpu device Example: ```python from unittest.mock import patch import torch from torch._inductor.graph import GraphLowering from torch._inductor import config # Force multple scheduler nodes creation to fuse them config.realize_opcount_threshold = 1 @torch.compile(fullgraph=True, dynamic=True) def fn(x: torch.Tensor, w1: torch.Tensor, w2: torch.Tensor) -> torch.Tensor: o1 = x * w1.view(1, 1, 1, -1) o2 = x * w2.view(1, 1, 1, -1) output = o1 + o2 return output in_nodes = [] outputs = [] run_node = GraphLowering.run_node graph_lowering_obj = None def run_node_alt(self, n): global graph_lowering_obj graph_lowering_obj = self in_nodes.append(n) output = run_node(self, n) outputs.append(output) return output x = torch.rand(1, 3, 32, 32) w1 = torch.randn(32) w2 = torch.randn(32) with patch.object(GraphLowering, "run_node", run_node_alt): fn(x, w1, w2) print("graph_lowering_obj.buffers:", graph_lowering_obj.buffers) print("graph_lowering_obj.scheduler:", graph_lowering_obj.scheduler.nodes) ``` Output on `main`: ``` graph_lowering_obj.buffers: [ComputedBuffer(name='buf0', layout=FixedLayout('cpu', torch.float32, size=[1, s1, s0, s0], stride=[s0*2s1, s0*2, s0, 1]), data=Pointwise( 'cpu', torch.float32, def inner_fn(index): _, i1, i2, i3 = index tmp0 = ops.load(arg3_1, i3 + i1 s0*2 + i2 s0) tmp1 = ops.load(arg1_1, i3) tmp2 = tmp0 * tmp1 return tmp2 , ranges=[1, s1, s0, s0], origin_node=mul, origins={mul} )), ComputedBuffer(name='buf1', layout=FixedLayout('cpu', torch.float32, size=[1, s1, s0, s0], stride=[s0*2s1, s0*2, s0, 1]), data=Pointwise( 'cpu', torch.float32, def inner_fn(index): _, i1, i2, i3 = index tmp0 = ops.load(arg3_1, i3 + i1 s0*2 + i2 s0) tmp1 = ops.load(arg4_1, i3) tmp2 = tmp0 * tmp1 return tmp2 , ranges=[1, s1, s0, s0], origin_node=mul_1, origins={mul_1} )), ComputedBuffer(name='buf2', layout=FixedLayout('cpu', torch.float32, size=[1, s1, s0, s0], stride=[s0*2s1, s0*2, s0, 1]), data=Pointwise( 'cpu', torch.float32, def inner_fn(index): _, i1, i2, i3 = index tmp0 = ops.load(buf0, i3 + i1 s0*2 + i2 s0) tmp1 = ops.load(buf1, i3 + i1 * s0*2 + i2 s0) tmp2 = tmp0 + tmp1 return tmp2 , ranges=[1, s1, s0, s0], origin_node=add, origins={add} ))] graph_lowering_obj.scheduler: [FusedSchedulerNode(nodes=buf0_buf1), SchedulerNode(name='buf2')] ``` Output on this PR: ``` graph_lowering_obj.buffers: [ComputedBuffer(name='buf0', layout=FixedLayout('cpu', torch.float32, size=[1, s1, s0, s0], stride=[s0*2s1, s0*2, s0, 1]), data=Pointwise( 'cpu', torch.float32, def inner_fn(index): _, i1, i2, i3 = index tmp0 = ops.load(arg3_1, i3 + i1 s0*2 + i2 s0) tmp1 = ops.load(arg1_1, i3) tmp2 = tmp0 * tmp1 return tmp2 , ranges=[1, s1, s0, s0], origin_node=mul, origins={mul} )), ComputedBuffer(name='buf1', layout=FixedLayout('cpu', torch.float32, size=[1, s1, s0, s0], stride=[s0*2s1, s0*2, s0, 1]), data=Pointwise( 'cpu', torch.float32, def inner_fn(index): _, i1, i2, i3 = index tmp0 = ops.load(arg3_1, i3 + i1 s0*2 + i2 s0) tmp1 = ops.load(arg4_1, i3) tmp2 = tmp0 * tmp1 return tmp2 , ranges=[1, s1, s0, s0], origin_node=mul_1, origins={mul_1} )), ComputedBuffer(name='buf2', layout=FixedLayout('cpu', torch.float32, size=[1, s1, s0, s0], stride=[s0*2s1, s0*2, s0, 1]), data=Pointwise( 'cpu', torch.float32, def inner_fn(index): _, i1, i2, i3 = index tmp0 = ops.load(buf0, i3 + i1 s0*2 + i2 s0) tmp1 = ops.load(buf1, i3 + i1 * s0*2 + i2 s0) tmp2 = tmp0 + tmp1 return tmp2 , ranges=[1, s1, s0, s0], origin_node=add, origins={add} ))] graph_lowering_obj.scheduler: [FusedSchedulerNode(nodes=buf0_buf1_buf2)] ``` Context: While working on https://github.com/pytorch/pytorch/pull/120411, upsampling bicubic decomposition, I saw an extra for-loop in C++ generated code summing up two buffers. Exploring the cause, it happend due to buffer number of ops goes beyond `config.realize_opcount_threshold`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120077 Approved by: https://github.com/jgong5, https://github.com/lezcano, https://github.com/peterbell10	2024-03-06 12:19:45 +00:00
Yukio Siraichi	aa0b0944d5	[dynamo] Re-dispatch `torch.Tensor.new` into `torch.Tensor.new_empty` method. (#121075 ) Fix: https://github.com/pytorch/xla/issues/6009 This PR adds another case to `TensorVariable.method_new` special case, where it re-dispatches `new` into `new_empty`. Since we are using fake tensors, the `new` call doesn't actually gets to the corresponding backend (e.g. XLA). So, things like the following might happen: ```python @torch.compile(backend="openxla") def foo(x): new_x = x.new(x.size()) # new_x.device() == "xla" # x.device() == "xla:0" return new_x + x a = torch.arange(10) foo(a.to(xm.xla_device())) ``` Resulting in the following error: ```python Traceback (most recent call last): ... File "torch/_dynamo/utils.py", line 1654, in get_fake_value ret_val = wrap_fake_exception( File "torch/_dynamo/utils.py", line 1190, in wrap_fake_exception return fn() File "torch/_dynamo/utils.py", line 1655, in <lambda> lambda: run_node(tx.output, node, args, kwargs, nnmodule) File "torch/_dynamo/utils.py", line 1776, in run_node raise RuntimeError(make_error_message(e)).with_traceback( File "torch/_dynamo/utils.py", line 1758, in run_node return node.target(args, *kwargs) File "torch/utils/_stats.py", line 20, in wrapper return fn(args, *kwargs) File "torch/_subclasses/fake_tensor.py", line 885, in __torch_dispatch__ return self.dispatch(func, types, args, kwargs) File "torch/_subclasses/fake_tensor.py", line 1224, in dispatch return self._cached_dispatch_impl(func, types, args, kwargs) File "torch/_subclasses/fake_tensor.py", line 955, in _cached_dispatch_impl output = self._dispatch_impl(func, types, args, kwargs) File "torch/_subclasses/fake_tensor.py", line 1445, in _dispatch_impl return self.wrap_meta_outputs_with_default_device_logic( File "torch/_subclasses/fake_tensor.py", line 1575, in wrap_meta_outputs_with_default_device_logic return tree_map(wrap, r) File "torch/utils/_pytree.py", line 900, in tree_map return treespec.unflatten(map(func, flat_args)) File "torch/utils/_pytree.py", line 736, in unflatten leaves = list(leaves) File "torch/_subclasses/fake_tensor.py", line 1550, in wrap ) = FakeTensor._find_common_device(func, flat_args) File "torch/_subclasses/fake_tensor.py", line 625, in _find_common_device merge_devices(arg) File "torch/_subclasses/fake_tensor.py", line 620, in merge_devices raise RuntimeError( torch._dynamo.exc.TorchRuntimeError: Failed running call_function <built-in function add>((FakeTensor(..., device='xla', size=(10,), dtype=torch.int64), FakeTensor(..., device='xla:0', size=(10,), dtype=torch.int64)), *{}): Unhandled FakeTensor Device Propagation for aten.add.Tensor, found two different devices xla, xla:0 ``` Using `new_empty`, instead, fixes this error because it uses the device from the source tensor, instead of inferring from the current dispatch key set. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121075 Approved by: https://github.com/jansel	2024-03-06 11:49:27 +00:00
Animesh Jain	e3bd6efe72	[dynamo][guards-cpp-refactor] Prevent duplication of leaf guards (#121164 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121164 Approved by: https://github.com/jansel ghstack dependencies: #121121, #121147, #121154	2024-03-06 08:36:45 +00:00
Animesh Jain	b6b2d5b00a	[dynamo][guards-cpp-refactor] Pass source name for debug ease (#121154 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121154 Approved by: https://github.com/jansel ghstack dependencies: #121121, #121147	2024-03-06 08:36:45 +00:00
Animesh Jain	52d89d8491	[dynamo][guards-cpp-refactor] Simplify DictGuardManager by removing KeyValueDictGuardManager (#121147 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121147 Approved by: https://github.com/jansel ghstack dependencies: #121121	2024-03-06 08:36:45 +00:00
Animesh Jain	af7f55ffc8	[dynamo][guards-cpp-refactor] Add argnames in pybind'ings (#121121 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121121 Approved by: https://github.com/jansel	2024-03-06 08:36:45 +00:00
Avik Chaudhuri	0b9bfcf9bb	[non-strict export] support tensor attribute without other args (#121176 ) Summary: Without args we have a hard time detecting fake modes. This causes a fake mode mismatch error in non-strict (specifically, `aot_export_module`) when the module contains tensor attributes, because we create a fresh fake mode when we cannot detect one. The fix is to pass the same fake mode throughout. Test Plan: added test Differential Revision: D54516595 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121176 Approved by: https://github.com/angelayi, https://github.com/tugsbayasgalan	2024-03-06 08:10:00 +00:00
PyTorch MergeBot	8087912622	Revert "[XPU][Profiler] Add Logic To The Profiler For Processing XPU-backend Data (#120185 )" This reverts commit 0ab2ec37383e44fa00c520de6e2b40845fccc6f3. Reverted https://github.com/pytorch/pytorch/pull/120185 on behalf of https://github.com/briancoutinho due to This PR contains a list search in '_parse_kineto_events()' that can lead to very high cost of running this post trace, training jobs getting stuck for mins ([comment](https://github.com/pytorch/pytorch/pull/120185#issuecomment-1980180774))	2024-03-06 06:39:51 +00:00
lancerts	099ff51d45	torch check the division by zero in batch_norm_update_stats (#120882 ) Fixes #120803 Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/120882 Approved by: https://github.com/CaoE, https://github.com/malfet	2024-03-06 05:40:21 +00:00
Masaki Kozuki	2eec0e7c5f	[BE] Remove `__iniline__` from `__global__` (#121246 ) in layer_norm_kernel.cu since the qualifier seems to be ignored according to: ``` [18/263] Building CUDA object caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/layer_norm_kernel.cu.o /home/mkozuki/ghq/github.com/crcrpar/torch-3/aten/src/ATen/native/cuda/layer_norm_kernel.cu(300): warning #20050-D: inline qualifier ignored for "__global__" function Remark: The warnings can be suppressed with "-diag-suppress <warning-number>" /home/mkozuki/ghq/github.com/crcrpar/torch-3/aten/src/ATen/native/cuda/layer_norm_kernel.cu(300): warning #20050-D: inline qualifier ignored for "__global__" function Remark: The warnings can be suppressed with "-diag-suppress <warning-number>" ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/121246 Approved by: https://github.com/eqy, https://github.com/malfet	2024-03-06 05:16:52 +00:00
Sheng Fu	31bfa59970	Capture primitive data type arguments for profiling python_function (#120949 ) RECORD_FUNCTION in python_function only captures argument that is a Tensor. However, it is very common for user to use non tensor arguments in custom ops, for example, sequence length in GPT attention custom op. My previous PR tries to capture all non-tensor arguments, it turned out in some cases, it is very expensive. This PR is to support primitive (or its container) arguments in RECORD_FUNCTION. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120949 Approved by: https://github.com/soulitzer	2024-03-06 05:09:22 +00:00
Tugsbayasgalan Manlaibaatar	5680f565d5	Batch Norm Consolidation (#116092 ) Summary: This commit simplifies the existing decomposition hierarchy of batch norm ops by adding a single, backend agnostic op: `batch_norm_with_update`. The existing hierarchy looks like: ``` aten.batch_norm -> aten._batch_norm_impl_index -> [ aten.native_batch_norm -> aten._native_batch_norm_legit (export only) -> _batch_norm_legit_cpu/cuda (kernels, export only) -> _batch_norm_cpu/cuda (kernels) ] OR [ aten.cudnn_batch_norm ] OR [ aten.miopen_batch_norm ] ``` Aside from complexity, an important problem with the above decomposition hierarchy is cuda numerics in export flows. We observed significantly worse convergence when training a mobilenetv2-like model when using the `_batch_norm_cuda` kernel instead of the `cudnn_batch_norm` kernel. This means users who export their models on CPU first then move the models to cuda later may silently see worse accuracies even when cudnn is installed, because they are using the worse kernel. This issue is summarized in https://github.com/pytorch/pytorch/issues/111384. Instead, the new hierarchy proposed by consolidating existing batch norm ops will look like: ``` aten.batch_norm -> aten.batch_norm_with_update -> [ _batch_norm_cpu (kernel) ] OR [ _batch_norm_cuda (kernel) ] OR [ cudnn_batch_norm (kernel) ] OR [ miopen_batch_norm (kernel) ] ``` The new op `batch_norm_with_update` hides backend implementation details and automatically picks the right kernel based on what is installed. This commit also adds the following variants to this op: ``` batch_norm_with_update_functional batch_norm_with_update.out batch_norm_no_update batch_norm_no_update.out batch_norm_backward ``` Note that this commit only adds this op and its variants, but does not actually change the decomps to produce these ops in the graph. This will be done after the 2 week FC window, and the ops used in the old stack is planned to be removed after the 6 month BC window. Test Plan: `OpInfo` tests for `batch_norm_with_update`. Reviewers: albanD, bdhirsh Subscribers: albanD, bdhirsh, supriyar Tasks: https://github.com/pytorch/pytorch/issues/111384 Co-authored-by: Tugsbayasgalan Manlaibaatar <tmanlaibaatar@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/116092 Approved by: https://github.com/bdhirsh, https://github.com/albanD	2024-03-06 04:50:46 +00:00
Driss Guessous	f72eb5ae4c	__grid__constant is only suported on cuda version >= 11.8 (#121275 ) Summary: Update the macros to exclude using __grid__constant on compiling for devices > sm80 but cuda version < 11.8. Test Plan: buck2 build --keep-going --config buck2.log_configured_graph_size=true --flagfile fbcode//mode/dev fbcode//sigrid/predictor/client/python:ig_sigrid_client_pybinding Differential Revision: D54556796 Co-authored-by: Driss Guessous <drisspg@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/121275 Approved by: https://github.com/drisspg	2024-03-06 03:44:59 +00:00
Joel Schlosser	dad1b76584	Introduce EphemeralSource for symbols that should be simplified out (#120948 ) Context: view fake-ification should handle closed-over state in ViewFuncs for use in view replay by: * fake-ifying tensors * symbolicizing SymInts This avoids invalid specialization during view replay. However, the symbols / tensors created as intermediates in the view chain should not stick around or be guarded on. This PR introduces an `EphemeralSource` intended to be used as a source for this purpose. It has the following properties: * Considered first to be simplified out in symbol simplification logic * Errors if guarded on Differential Revision: [D54561597](https://our.internmc.facebook.com/intern/diff/D54561597) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120948 Approved by: https://github.com/ezyang	2024-03-06 02:30:52 +00:00
Wei (Will) Feng	d968fc442b	[FSDP] restore fully_shard after exit from mock.patch (#121058 ) manually restore fully_shard after \_\_exit\_\_ from mock.patch ctx. This will fix flaky CIs in trunk ``` pytest test/distributed/_composable/fsdp/test_fully_shard_training.py ``` this is a workaround to make mock.patch(fully_shard) work with multi-thread * thread 1 set func.\_\_module\_\_[fully_shard] = patched function * thread 2 read func.\_\_module\_\_[fully_shard], thought it is original and fail to restore fully_shard during \_\_exit\_\_ * this PR manually restore fully_shard after \_\_exit\_\_ Co-authored-by: Andrew Gu <31054793+awgu@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/121058 Approved by: https://github.com/awgu	2024-03-06 02:14:59 +00:00
eqy	8dafc81ba9	[cuBLAS][cuBLASLt] Fix expected failures for `int_mm` on `sm75` (turing) (#121277 ) CC @malfet @atalman @ptrblck @tinglvv Pull Request resolved: https://github.com/pytorch/pytorch/pull/121277 Approved by: https://github.com/malfet	2024-03-06 01:51:01 +00:00
Rodolfo Guerrero	ce6a7d56fc	Don't merge qnnpack (#120676 ) Summary: qnnack library merge fails on some application. This fix implements recommendation from Android build team to prevent merge for qnnpack. Test Plan: 1. Measure the binary size impact 1. Release build failed previously; now it should succeed Differential Revision: D54048156 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120676 Approved by: https://github.com/kimishpatel	2024-03-06 01:42:13 +00:00
Mikayla Gawarecki	4b3903379a	Add assign argument to torch.Tensor.module_load (#121158 ) Make `torch.__future__.get_swap_module_params_on_conversion() == True` account for `assign` argument to `nn.Module.load_state_dict` Similar to when `torch.__future__.set_swap_module_params_on_conversion()` is `False`, `assign=True` means that we do not incur a `self.copy_(other)` and the properties of `other` will be preserved Pull Request resolved: https://github.com/pytorch/pytorch/pull/121158 Approved by: https://github.com/albanD ghstack dependencies: #121157	2024-03-06 01:32:06 +00:00
Mikayla Gawarecki	27389e03f0	[easy] Fixed requires_grad preservation for nn.Module.load_state_dict(assign=True) (#121157 ) Always preserve requires_grad of param in module. Documentation fixed in PR stacked above. Also fix test case to test load a state_dict generated with `keep_vars=False` (the default) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121157 Approved by: https://github.com/albanD	2024-03-06 01:32:06 +00:00
Denis Yaroshevskiy	87a533ed1b	c10:intrusive_ptr, self assignment (#119275 ) Summary: In C++ books/sources, self assignment check is often considered a bad practice, since it is very very unlikely. See, for example libc++ doesn't have it: `cf94e0082e/libcxx/include/__memory/shared_ptr.h (L651)` How about we remove it? Test Plan: This check is like 1% of cycles assinged to intrusive_ptr::operator= https://fburl.com/scuba/strobelight_services/9qqnrkdn This is not a lot in purely cycles but since it's gpu machines, can be substantial Differential Revision: D53471639 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119275 Approved by: https://github.com/cyyever, https://github.com/ezyang	2024-03-06 01:11:56 +00:00
CaoE	412c687e2e	Fix permuted sum precision issue for lower precision on CPU (#108559 ) Fixes #83149 There is a limitation of `TensorIterator` reductions: The non-permuted input tensor will be coalesced down to a 2-d tensor by `TensorIterator` whereas the permuted case may become a >2d operation (for example, two reduced dimensions and non-reduced dim). Since the cpu reduction loop of `TensorIterator` only operates on two dimensions at a time, this means the intermediate sums will be truncated to lower precision. Pull Request resolved: https://github.com/pytorch/pytorch/pull/108559 Approved by: https://github.com/mingfeima, https://github.com/peterbell10	2024-03-06 01:01:35 +00:00
mingfeima	34e3f6f3c9	fix segfault in torch.native_channel_shuffle when input is empty (#121199 ) Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): fix https://github.com/pytorch/pytorch/issues/121092 `torch.channel_shuffle` could handle empty inputs correctly. `torch.native_channel_shuffle` bypassed the `numel == 0` check, this causes divided by zero in underlying kernel. * __->__ #121199 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121199 Approved by: https://github.com/malfet	2024-03-06 00:46:36 +00:00
Jinzhe Zeng	8473cd92e4	remove compute capability 3.5 for CUDA 12 (#114930 ) CUDA 12 has removed compute capability 3.5. NVCC throws the error: `nvcc fatal : Unsupported gpu architecture 'compute_35'` Pull Request resolved: https://github.com/pytorch/pytorch/pull/114930 Approved by: https://github.com/malfet	2024-03-06 00:40:57 +00:00
Sunita Nadampalli	d13ed8503c	CI: Add aarch64 docker build and ciflow tags (#120931 ) adding workflows for aarch64 linux docker build with ACL installed as system dependency Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/120931 Approved by: https://github.com/atalman, https://github.com/malfet	2024-03-06 00:31:22 +00:00
Scott Wolchok	cac36e232e	[PyTorch] Split StaticModule out of test_static_runtime (#121028 ) I want to use StaticModule in another (internal) test, so splitting it out. Differential Revision: [D54384817](https://our.internmc.facebook.com/intern/diff/D54384817/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121028 Approved by: https://github.com/suo	2024-03-05 23:14:07 +00:00
drisspg	f5391dad82	Update docs to point to new sdpa_kernel context manager (#121180 ) # Summary Updates the SDPA docs to fix some small inaccuracies and points to the new sdpa_kernel context manger. The Enum like type binded from cpp SDPBackend does not render its fields for some reason. Manually list them instead for now Pull Request resolved: https://github.com/pytorch/pytorch/pull/121180 Approved by: https://github.com/mikaylagawarecki	2024-03-05 22:19:48 +00:00
Valentin Andrei	8bb3e0b643	[pytorch] Name the main and autograd threads for better debugging (#121170 ) The main thread and the autograd one are latency critical threads. They launch CPU/GPU/Accelerator kernels and if for some reason they get preempted, the rank can become a straggler in a distributed training application. By naming these threads we can debug performance issues that impact the latency sensitive threads. I used Kineto traces to verify if the thread names were propagated: <img width="851" alt="Screenshot 2024-03-04 at 3 07 43 PM" src="https://github.com/pytorch/pytorch/assets/23515689/68b4a09c-b8e5-4f14-a5c0-6593f866c03f"> Also: ``` nvidia-smi +-----------------------------------------------------------------------------+ \| Processes: \| \| GPU GI CI PID Type Process name GPU Memory \| \| ID ID Usage \| \|=============================================================================\| \| 0 N/A N/A 3065920 C ...me#python#py_version_3_10 1968MiB \| \| 1 N/A N/A 3065926 C ...me#python#py_version_3_10 1978MiB \| \| 2 N/A N/A 3065930 C ...me#python#py_version_3_10 2084MiB \| \| 3 N/A N/A 3065936 C ...me#python#py_version_3_10 2016MiB \| \| 4 N/A N/A 3065939 C ...me#python#py_version_3_10 1998MiB \| \| 5 N/A N/A 3065943 C ...me#python#py_version_3_10 2070MiB \| \| 6 N/A N/A 3065948 C ...me#python#py_version_3_10 2026MiB \| \| 7 N/A N/A 3065952 C ...me#python#py_version_3_10 2070MiB \| +-----------------------------------------------------------------------------+ [me@myhost ~]$ ps -T -p 3065920 PID SPID TTY TIME CMD 3065920 3065920 pts/14 00:01:04 pt_main_thread ... 3065920 3092181 pts/14 00:00:40 pt_autograd_d0 3065920 3092182 pts/14 00:00:00 pt_autograd_d1 3065920 3092183 pts/14 00:00:00 pt_autograd_d2 3065920 3092184 pts/14 00:00:00 pt_autograd_d3 3065920 3092185 pts/14 00:00:00 pt_autograd_d4 3065920 3092186 pts/14 00:00:00 pt_autograd_d5 3065920 3092187 pts/14 00:00:00 pt_autograd_d6 3065920 3092188 pts/14 00:00:00 pt_autograd_d7 ... ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/121170 Approved by: https://github.com/albanD	2024-03-05 22:15:39 +00:00
Tongzhou Wang	24944f6717	[doc] Fix math display in ChannelShuffle doc (#121247 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121247 Approved by: https://github.com/mikaylagawarecki	2024-03-05 21:30:51 +00:00
Catherine Lee	b3a9d677a3	[ez] Add super() calls in test_custom_ops (#121239 ) Some disable issues are getting spammed Check that test_impl_invalid_devices gets skipped by the disable issue Pull Request resolved: https://github.com/pytorch/pytorch/pull/121239 Approved by: https://github.com/zou3519	2024-03-05 21:16:06 +00:00
Peter Bell	34a28f01dd	[Autograd] Improve error for leaf tensors as out argument to fallback (#121089 ) Closes #120988 Currently operators that hit the autograd fallback call `check_inplace` on all mutated inputs, including out arguments. This leads to a slightly confusing error message: ``` RuntimeError: a leaf Variable that requires grad is being used in an in-place operation. ``` Compared to functions that don't fallback, which raise ``` RuntimeError: add(): functions with out=... arguments don't support automatic differentiation, but one of the arguments requires grad. ``` This changes the error message to make clear the issue is with the out argument, but does not tighten the check to outright ban out arguments that require grad. Instead, I use the same checks from `check_inplace` which allows non-leaf tensors that require grad to pass without error. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121089 Approved by: https://github.com/lezcano, https://github.com/soulitzer ghstack dependencies: #121142	2024-03-05 21:13:27 +00:00
Peter Bell	eae9751e82	Fix linalg_eigvals invalid use of composite dispatch key (#121142 ) `linalg_eigvals_out` calls into a dispatch stub, so only supports CPU and CUDA strided tensors but incorrectly claimed to be a composite op. `linalg_eigvals` also shouldn't defer to the out variant inside a `CompositeImplicitAutograd` op as not all types support out variants. Instead, I add a new helper `_linalg_eigvals` which does the same thing in a non-composite operator. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121142 Approved by: https://github.com/lezcano	2024-03-05 21:13:27 +00:00
y-sq	393b4ab432	Fixes issue_119785 (#121048 ) Fixes #ISSUE_119785 - Removed all sentinel files of `test_causal_variants_.`. - The `test_causal_variants_causal_variant_` tests could pass after removing the dynamo_skips files. - The `test_causal_variants_compile_causal_variant` fails with `PYTORCH_TEST_WITH_DYNAMO=1`. These tests already call torch.compile, so added @skipIfTorchDynamo to skip them for `PYTORCH_TEST_WITH_DYNAMO`. Tests* ``` $ PYTORCH_TEST_WITH_DYNAMO=1 pytest test_transformers.py -v -k "test_causal_variants" ================================================================== test session starts ================================================================== platform linux -- Python 3.10.13, pytest-7.4.0, pluggy-1.0.0 -- /home/shuqiyang/.conda/envs/pytorch/bin/python cachedir: .pytest_cache rootdir: /data/users/shuqiyang/pytorch configfile: pytest.ini collected 77250 items / 77218 deselected / 32 selected Running 32 items in this shard test_transformers.py::TestAttnBiasCPU::test_causal_variants_causal_variant_CausalVariant_LOWER_RIGHT_shape0_cpu PASSED [0.7745s] [ 3%] test_transformers.py::TestAttnBiasCPU::test_causal_variants_causal_variant_CausalVariant_LOWER_RIGHT_shape1_cpu PASSED [0.8020s] [ 6%] test_transformers.py::TestAttnBiasCPU::test_causal_variants_causal_variant_CausalVariant_LOWER_RIGHT_shape2_cpu SKIPPED [0.0385s] (Lower righ...) [ 9%] test_transformers.py::TestAttnBiasCPU::test_causal_variants_causal_variant_CausalVariant_LOWER_RIGHT_shape3_cpu PASSED [0.5046s] [ 12%] test_transformers.py::TestAttnBiasCPU::test_causal_variants_causal_variant_CausalVariant_UPPER_LEFT_shape0_cpu PASSED [0.6483s] [ 15%] test_transformers.py::TestAttnBiasCPU::test_causal_variants_causal_variant_CausalVariant_UPPER_LEFT_shape1_cpu PASSED [0.8537s] [ 18%] test_transformers.py::TestAttnBiasCPU::test_causal_variants_causal_variant_CausalVariant_UPPER_LEFT_shape2_cpu PASSED [0.8388s] [ 21%] test_transformers.py::TestAttnBiasCPU::test_causal_variants_causal_variant_CausalVariant_UPPER_LEFT_shape3_cpu PASSED [0.4859s] [ 25%] test_transformers.py::TestAttnBiasCPU::test_causal_variants_compile_causal_variant_CausalVariant_LOWER_RIGHT_shape0_cpu SKIPPED [0.0084s] (Th...) [ 28%] test_transformers.py::TestAttnBiasCPU::test_causal_variants_compile_causal_variant_CausalVariant_LOWER_RIGHT_shape1_cpu SKIPPED [0.0086s] (Th...) [ 31%] test_transformers.py::TestAttnBiasCPU::test_causal_variants_compile_causal_variant_CausalVariant_LOWER_RIGHT_shape2_cpu SKIPPED [0.0081s] (Th...) [ 34%] test_transformers.py::TestAttnBiasCPU::test_causal_variants_compile_causal_variant_CausalVariant_LOWER_RIGHT_shape3_cpu SKIPPED [0.0085s] (Th...) [ 37%] test_transformers.py::TestAttnBiasCPU::test_causal_variants_compile_causal_variant_CausalVariant_UPPER_LEFT_shape0_cpu SKIPPED [0.0082s] (Thi...) [ 40%] test_transformers.py::TestAttnBiasCPU::test_causal_variants_compile_causal_variant_CausalVariant_UPPER_LEFT_shape1_cpu SKIPPED [0.0085s] (Thi...) [ 43%] test_transformers.py::TestAttnBiasCPU::test_causal_variants_compile_causal_variant_CausalVariant_UPPER_LEFT_shape2_cpu SKIPPED [0.0081s] (Thi...) [ 46%] test_transformers.py::TestAttnBiasCPU::test_causal_variants_compile_causal_variant_CausalVariant_UPPER_LEFT_shape3_cpu SKIPPED [0.0085s] (Thi...) [ 50%] test_transformers.py::TestAttnBiasCUDA::test_causal_variants_causal_variant_CausalVariant_LOWER_RIGHT_shape0_cuda PASSED [9.4185s] [ 53%] test_transformers.py::TestAttnBiasCUDA::test_causal_variants_causal_variant_CausalVariant_LOWER_RIGHT_shape1_cuda PASSED [0.4273s] [ 56%] test_transformers.py::TestAttnBiasCUDA::test_causal_variants_causal_variant_CausalVariant_LOWER_RIGHT_shape2_cuda SKIPPED [0.0280s] (Lower ri...) [ 59%] test_transformers.py::TestAttnBiasCUDA::test_causal_variants_causal_variant_CausalVariant_LOWER_RIGHT_shape3_cuda PASSED [8.0999s] [ 62%] test_transformers.py::TestAttnBiasCUDA::test_causal_variants_causal_variant_CausalVariant_UPPER_LEFT_shape0_cuda PASSED [0.3785s] [ 65%] test_transformers.py::TestAttnBiasCUDA::test_causal_variants_causal_variant_CausalVariant_UPPER_LEFT_shape1_cuda PASSED [0.3818s] [ 68%] test_transformers.py::TestAttnBiasCUDA::test_causal_variants_causal_variant_CausalVariant_UPPER_LEFT_shape2_cuda PASSED [0.3864s] [ 71%] test_transformers.py::TestAttnBiasCUDA::test_causal_variants_causal_variant_CausalVariant_UPPER_LEFT_shape3_cuda PASSED [0.7668s] [ 75%] test_transformers.py::TestAttnBiasCUDA::test_causal_variants_compile_causal_variant_CausalVariant_LOWER_RIGHT_shape0_cuda SKIPPED [0.0089s] (...) [ 78%] test_transformers.py::TestAttnBiasCUDA::test_causal_variants_compile_causal_variant_CausalVariant_LOWER_RIGHT_shape1_cuda SKIPPED [0.0087s] (...) [ 81%] test_transformers.py::TestAttnBiasCUDA::test_causal_variants_compile_causal_variant_CausalVariant_LOWER_RIGHT_shape2_cuda SKIPPED [0.0087s] (...) [ 84%] test_transformers.py::TestAttnBiasCUDA::test_causal_variants_compile_causal_variant_CausalVariant_LOWER_RIGHT_shape3_cuda SKIPPED [0.0084s] (...) [ 87%] test_transformers.py::TestAttnBiasCUDA::test_causal_variants_compile_causal_variant_CausalVariant_UPPER_LEFT_shape0_cuda SKIPPED [0.0087s] (T...) [ 90%] test_transformers.py::TestAttnBiasCUDA::test_causal_variants_compile_causal_variant_CausalVariant_UPPER_LEFT_shape1_cuda SKIPPED [0.0087s] (T...) [ 93%] test_transformers.py::TestAttnBiasCUDA::test_causal_variants_compile_causal_variant_CausalVariant_UPPER_LEFT_shape2_cuda SKIPPED [0.0084s] (T...) [ 96%] test_transformers.py::TestAttnBiasCUDA::test_causal_variants_compile_causal_variant_CausalVariant_UPPER_LEFT_shape3_cuda SKIPPED [0.0087s] (T...) [100%] =================================================== 14 passed, 18 skipped, 77218 deselected in 39.72s =================================================== ``` ``` $ pytest test_transformers.py -v -k "test_causal_variants" ================================================================== test session starts ================================================================== platform linux -- Python 3.10.13, pytest-7.4.0, pluggy-1.0.0 -- /home/shuqiyang/.conda/envs/pytorch/bin/python cachedir: .pytest_cache rootdir: /data/users/shuqiyang/pytorch configfile: pytest.ini collected 77250 items / 77218 deselected / 32 selected Running 32 items in this shard test_transformers.py::TestAttnBiasCPU::test_causal_variants_causal_variant_CausalVariant_LOWER_RIGHT_shape0_cpu PASSED [0.2410s] [ 3%] test_transformers.py::TestAttnBiasCPU::test_causal_variants_causal_variant_CausalVariant_LOWER_RIGHT_shape1_cpu PASSED [0.3984s] [ 6%] test_transformers.py::TestAttnBiasCPU::test_causal_variants_causal_variant_CausalVariant_LOWER_RIGHT_shape2_cpu SKIPPED [0.0011s] (Lower righ...) [ 9%] test_transformers.py::TestAttnBiasCPU::test_causal_variants_causal_variant_CausalVariant_LOWER_RIGHT_shape3_cpu PASSED [0.0095s] [ 12%] test_transformers.py::TestAttnBiasCPU::test_causal_variants_causal_variant_CausalVariant_UPPER_LEFT_shape0_cpu PASSED [0.1749s] [ 15%] test_transformers.py::TestAttnBiasCPU::test_causal_variants_causal_variant_CausalVariant_UPPER_LEFT_shape1_cpu PASSED [0.2138s] [ 18%] test_transformers.py::TestAttnBiasCPU::test_causal_variants_causal_variant_CausalVariant_UPPER_LEFT_shape2_cpu PASSED [0.2715s] [ 21%] test_transformers.py::TestAttnBiasCPU::test_causal_variants_causal_variant_CausalVariant_UPPER_LEFT_shape3_cpu PASSED [0.0108s] [ 25%] test_transformers.py::TestAttnBiasCPU::test_causal_variants_compile_causal_variant_CausalVariant_LOWER_RIGHT_shape0_cpu PASSED [0.4864s] [ 28%] test_transformers.py::TestAttnBiasCPU::test_causal_variants_compile_causal_variant_CausalVariant_LOWER_RIGHT_shape1_cpu PASSED [0.5346s] [ 31%] test_transformers.py::TestAttnBiasCPU::test_causal_variants_compile_causal_variant_CausalVariant_LOWER_RIGHT_shape2_cpu SKIPPED [0.0011s] (Lo...) [ 34%] test_transformers.py::TestAttnBiasCPU::test_causal_variants_compile_causal_variant_CausalVariant_LOWER_RIGHT_shape3_cpu PASSED [0.1722s] [ 37%] test_transformers.py::TestAttnBiasCPU::test_causal_variants_compile_causal_variant_CausalVariant_UPPER_LEFT_shape0_cpu PASSED [0.2341s] [ 40%] test_transformers.py::TestAttnBiasCPU::test_causal_variants_compile_causal_variant_CausalVariant_UPPER_LEFT_shape1_cpu PASSED [0.4786s] [ 43%] test_transformers.py::TestAttnBiasCPU::test_causal_variants_compile_causal_variant_CausalVariant_UPPER_LEFT_shape2_cpu PASSED [0.4635s] [ 46%] test_transformers.py::TestAttnBiasCPU::test_causal_variants_compile_causal_variant_CausalVariant_UPPER_LEFT_shape3_cpu PASSED [0.0861s] [ 50%] test_transformers.py::TestAttnBiasCUDA::test_causal_variants_causal_variant_CausalVariant_LOWER_RIGHT_shape0_cuda PASSED [9.7579s] [ 53%] test_transformers.py::TestAttnBiasCUDA::test_causal_variants_causal_variant_CausalVariant_LOWER_RIGHT_shape1_cuda PASSED [0.0044s] [ 56%] test_transformers.py::TestAttnBiasCUDA::test_causal_variants_causal_variant_CausalVariant_LOWER_RIGHT_shape2_cuda SKIPPED [0.0007s] (Lower ri...) [ 59%] test_transformers.py::TestAttnBiasCUDA::test_causal_variants_causal_variant_CausalVariant_LOWER_RIGHT_shape3_cuda PASSED [9.2065s] [ 62%] test_transformers.py::TestAttnBiasCUDA::test_causal_variants_causal_variant_CausalVariant_UPPER_LEFT_shape0_cuda PASSED [0.0081s] [ 65%] test_transformers.py::TestAttnBiasCUDA::test_causal_variants_causal_variant_CausalVariant_UPPER_LEFT_shape1_cuda PASSED [0.0063s] [ 68%] test_transformers.py::TestAttnBiasCUDA::test_causal_variants_causal_variant_CausalVariant_UPPER_LEFT_shape2_cuda PASSED [0.0059s] [ 71%] test_transformers.py::TestAttnBiasCUDA::test_causal_variants_causal_variant_CausalVariant_UPPER_LEFT_shape3_cuda PASSED [0.0055s] [ 75%] test_transformers.py::TestAttnBiasCUDA::test_causal_variants_compile_causal_variant_CausalVariant_LOWER_RIGHT_shape0_cuda PASSED [0.1200s] [ 78%] test_transformers.py::TestAttnBiasCUDA::test_causal_variants_compile_causal_variant_CausalVariant_LOWER_RIGHT_shape1_cuda PASSED [0.1032s] [ 81%] test_transformers.py::TestAttnBiasCUDA::test_causal_variants_compile_causal_variant_CausalVariant_LOWER_RIGHT_shape2_cuda SKIPPED [0.0010s] (...) [ 84%] test_transformers.py::TestAttnBiasCUDA::test_causal_variants_compile_causal_variant_CausalVariant_LOWER_RIGHT_shape3_cuda PASSED [0.1151s] [ 87%] test_transformers.py::TestAttnBiasCUDA::test_causal_variants_compile_causal_variant_CausalVariant_UPPER_LEFT_shape0_cuda PASSED [0.0705s] [ 90%] test_transformers.py::TestAttnBiasCUDA::test_causal_variants_compile_causal_variant_CausalVariant_UPPER_LEFT_shape1_cuda PASSED [0.0713s] [ 93%] test_transformers.py::TestAttnBiasCUDA::test_causal_variants_compile_causal_variant_CausalVariant_UPPER_LEFT_shape2_cuda PASSED [0.0696s] [ 96%] test_transformers.py::TestAttnBiasCUDA::test_causal_variants_compile_causal_variant_CausalVariant_UPPER_LEFT_shape3_cuda PASSED [0.1516s] [100%] =================================================== 28 passed, 4 skipped, 77218 deselected in 39.23s ==================================================== ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/121048 Approved by: https://github.com/zou3519	2024-03-05 20:19:02 +00:00
Kurt Mohler	8ccf8b2c47	Avoid COW input materialize in more forward ops (#121070 ) Affected operators are: addr, cdist, sparse.sampled_addm, sparse.mm, matrix_exp, softmax, cross_entropy Part of #97856 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121070 Approved by: https://github.com/ezyang	2024-03-05 19:47:24 +00:00
Sunita Nadampalli	81dbc487c7	ci: add "typing_extensions" package to ci requirements list (#121136 ) this is required for torchgen Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/121136 Approved by: https://github.com/malfet, https://github.com/atalman	2024-03-05 18:26:01 +00:00
Aidyn-A	3239f86a3d	[ATen][CUDA][CUBLAS] cublasLtMatmul increase workspace_size (#120925 ) According to the [cuBLAS API Reference](https://docs.nvidia.com/cuda/cublas/index.html#cublassetworkspace) the recommended workspace size for Hopper is 32 MiB and for the rest architectures 4 MiB. This PR increases the workspace size accordingly. I am not aware of the recommended workspace size for HIP, that is why I am keeping it unchanged. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120925 Approved by: https://github.com/eqy, https://github.com/malfet	2024-03-05 18:13:05 +00:00
Zhengxu Chen	8aeb247a3d	[export] Remove WrapperModule. (#121042 ) Summary: WrapperModule seems a good idea but may introduce some surprising behavior to users, for example, it never registers enclosed modules as submodules and therefore it's unclear that's the state dict for the exported program should look like, because some people may argue to include every state in state dict but others want to keep them as constants. Test Plan: CI Reviewed By: tugsbayasgalan Differential Revision: D54326331 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121042 Approved by: https://github.com/angelayi	2024-03-05 18:10:22 +00:00
Chengji Yao	0e604becc5	[NJT] support chunk on batch dim (#119713 ) - support chunk op on batch dim - support empty_like op - add tests for the like ops Pull Request resolved: https://github.com/pytorch/pytorch/pull/119713 Approved by: https://github.com/jbschlosser	2024-03-05 17:57:50 +00:00
angelayi	ae4c85960f	Add Deberta pass (#121206 ) Adding DebertaForQuestionAnswering to inductor benchmark pass, as it did not show up before Pull Request resolved: https://github.com/pytorch/pytorch/pull/121206 Approved by: https://github.com/desertfire	2024-03-05 17:56:25 +00:00
Chien-Chin Huang	5abf7972d1	[DCP][state_dict] Implement pin_memory and shared_memory copy for _offload_state_dict_to_cpu (#120378 ) Summary This PR extend `_offload_state_dict_to_cpu` to accept a `cpu_offload_state_dict` argument. If `cpu_offload_state_dict` is not None, `_offload_state_dict_to_cpu` will use `copy_` to copy the GPU data to the CPU tensors. This allows users to pass a pin_memory or share_memory version of `cpu_offload_state_dict`. This PR also adds `_create_cpu_state_dict` to allow users to easily create a pin_memory or share_memory cpu state_dict. Performance improvement ``` # The micro-benchmark has a source state_dict with 150 tensors, and each tensor is 50MB. # The micro-benchmark is run on a H100 machine with PCIe 5 cpu_state_dict_2 = _create_cpu_state_dict(state_dict, pin_memory=True) cpu_state_dict_3 = _create_cpu_state_dict(state_dict, share_memory=True) # GPU->CPU memory: 4.6556 seconds cpu_state_dict = _offload_state_dict_to_cpu(state_dict) # GPU->pin memory: 0.1566 seconds _offload_state_dict_to_cpu(state_dict, cpu_offload_state_dict=cpu_state_dict_2) # GPU->shared memory: 0.5509 seconds (variation is quite large) _offload_state_dict_to_cpu(state_dict, cpu_offload_state_dict=cpu_state_dict_3) # GPU->pin memory->shared memory: 0.2550 seconds _offload_state_dict_to_cpu(state_dict, cpu_offload_state_dict=cpu_state_dict_2) _offload_state_dict_to_cpu(cpu_state_dict_2, cpu_offload_state_dict=cpu_state_dict_3) ``` Differential Revision: [D54045845](https://our.internmc.facebook.com/intern/diff/D54045845/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120378 Approved by: https://github.com/LucasLLC	2024-03-05 17:48:15 +00:00
cyy	6ecd65886a	Remove unnecessary const_casts (#121225 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/121225 Approved by: https://github.com/soulitzer	2024-03-05 17:34:24 +00:00
Zhengxu Chen	85c807b3fd	[export] Ensure optional fields always have default value. (#121163 ) Summary: Add additional check to make sure we can always unset an optional field. Test Plan: CI Differential Revision: D54504243 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121163 Approved by: https://github.com/tugsbayasgalan	2024-03-05 17:16:49 +00:00
Jason Ansel	35004b8ab4	[dynamo] Fix handling of invalid args (#121110 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121110 Approved by: https://github.com/yanboliang ghstack dependencies: #121106	2024-03-05 17:16:04 +00:00
Jason Ansel	4f19b5f7ef	[dynamo] Remove extra guard for tensor constant attrs (#121106 ) Also deletes some unused code. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121106 Approved by: https://github.com/yanboliang, https://github.com/anijain2305	2024-03-05 17:16:04 +00:00
Oguz Ulgen	e4352182bd	Disable remote cache test on ROCM (#121210 ) Fixes #121194 Fixes #121166 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121210 Approved by: https://github.com/aakhundov	2024-03-05 16:35:40 +00:00
angelayi	f25a25fde5	Fix lintrunner-noclang (#121205 ) Fix lintrunnner-noclang Pull Request resolved: https://github.com/pytorch/pytorch/pull/121205 Approved by: https://github.com/Skylion007	2024-03-05 16:18:36 +00:00
bhack	fbf36d01a0	Update Triton (#119457 ) Fix pytorch nightly compilation for cuda linking Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/119457 Approved by: https://github.com/lezcano	2024-03-05 15:04:12 +00:00
redwrasse	59d9f1e227	Spectral norm value test (#121068 ) Spectral norm implementation has extensive tests, but there doesn't appear to be any checking that indeed the spectral norm (= top singular value) is correctly calculated. There should at least be one such testcase. This adds one such testcase for the parameterizations.py implementation of spectral norm. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121068 Approved by: https://github.com/soulitzer	2024-03-05 14:46:31 +00:00
Mikayla Gawarecki	d621e3e3b8	Add exhaustive module and optimizer tests for torch.load(state_dict, weights_only=True) (#121049 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121049 Approved by: https://github.com/janeyx99	2024-03-05 14:27:50 +00:00
Aidyn-A	42821d462a	[ATen][Native][CUDA] Decrease max_threads in ctc_loss (#120746 ) There will be some changes in CUDA 12.4 that would require smaller number of threads per block with double precision in `ctc_loss`. This PR addresses the change. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120746 Approved by: https://github.com/ptrblck, https://github.com/janeyx99	2024-03-05 14:14:41 +00:00
atalman	12191f4b3e	Fix make triton command on release branch (#121169 ) Fixes #120044 Should fix build from source instructions on release branch here: https://github.com/pytorch/pytorch#from-source Please note we are using /test/ channel for release here to make sure it works, before actual release is completed. Test main: ``` make triton pip3 uninstall -y triton WARNING: Skipping triton as it is not installed. Looking in indexes: https://download.pytorch.org/whl/nightly/ Collecting pytorch-triton==3.0.0+a9bc1a3647 Downloading https://download.pytorch.org/whl/nightly/pytorch_triton-3.0.0%2Ba9bc1a3647-cp310-cp310-linux_x86_64.whl (239.0 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 239.0/239.0 MB 8.7 MB/s eta 0:00:00 Requirement already satisfied: filelock in /home/atalman/miniconda3/envs/py310/lib/python3.10/site-packages (from pytorch-triton==3.0.0+a9bc1a3647) (3.13.1) Installing collected packages: pytorch-triton Attempting uninstall: pytorch-triton Found existing installation: pytorch-triton 2.2.0 Uninstalling pytorch-triton-2.2.0: Successfully uninstalled pytorch-triton-2.2.0 Successfully installed pytorch-triton-3.0.0+a9bc1a3647 ``` Test release/2.2: ``` make triton pip3 uninstall -y triton WARNING: Skipping triton as it is not installed. Looking in indexes: https://download.pytorch.org/whl/test/ Collecting pytorch-triton==2.2.0 Using cached https://download.pytorch.org/whl/test/pytorch_triton-2.2.0-cp310-cp310-linux_x86_64.whl (183.1 MB) Requirement already satisfied: filelock in /home/atalman/miniconda3/envs/py310/lib/python3.10/site-packages (from pytorch-triton==2.2.0) (3.13.1) Installing collected packages: pytorch-triton Attempting uninstall: pytorch-triton Found existing installation: pytorch-triton 3.0.0+a9bc1a3647 Uninstalling pytorch-triton-3.0.0+a9bc1a3647: Successfully uninstalled pytorch-triton-3.0.0+a9bc1a3647 Successfully installed pytorch-triton-2.2.0 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/121169 Approved by: https://github.com/seemethere	2024-03-05 13:53:53 +00:00
Sun, Jiayi	ee557d8f61	skip detectron2_fcos_r_50_fpn in dynamic shape test (#120697 ) As reported in https://github.com/pytorch/pytorch/issues/119434, `detectron2_fcos_r_50_fpn` failed with dynamic shape testing, we propose to skip the dynamic batch size testing of this model in this PR. * Error msg is ``` File "/home/jiayisun/pytorch/benchmarks/dynamo/common.py", line 3877, in run assert marked, f"nothing in example_inputs had a dim with {batch_size}" AssertionError: nothing in example_inputs had a dim with 4 ``` * Root Cause is Benchmark code will only annotate the inputs' dim as dynamic when its size equals to batch size `c617e7b407/benchmarks/dynamo/common.py (L3867-L3871)`. If it fails to find any dim equals to batch size, above error throws. However, the inputs of `detectron2_fcos_r_50_fpn` are as follows: ``` ([{'file_name': '/home/jiayisun/benchmark/torchbenchmark/data/.data/coco2017-minimal/coco/val2017/000000001268.jpg', 'height': 427, 'width': 640, 'image_id': 1268, 'image': tensor([[[147., 124., 82., ..., 3., 4., 5.], [125., 104., 65., ..., 3., 3., 4.], [ 87., 68., 34., ..., 2., 2., 2.], ..., [ 47., 45., 41., ..., 45., 45., 45.], [ 46., 44., 40., ..., 44., 45., 46.], [ 46., 44., 40., ..., 43., 45., 46.]], [[154., 129., 84., ..., 3., 4., 5.], [133., 110., 69., ..., 3., 3., 4.], [ 95., 76., 43., ..., 2., 2., 2.], ..., [ 44., 42., 38., ..., 34., 37., 39.], [ 43., 41., 37., ..., 35., 39., 41.], [ 43., 41., 37., ..., 35., 40., 43.]], [[171., 140., 85., ..., 3., 4., 5.], [147., 120., 71., ..., 3., 3., 4.], [103., 83., 47., ..., 2., 2., 2.], ..., [ 46., 44., 40., ..., 16., 20., 22.], [ 45., 43., 39., ..., 17., 22., 26.], [ 45., 43., 39., ..., 18., 24., 28.]]])}, ... ],) ``` None of the inputs' dim will equal to input batch size, so I think we may need to skip the dynamic batch size testing for this model. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120697 Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/desertfire	2024-03-05 12:12:18 +00:00
Tobias Ringwald	c4a1570864	Temporarily increased compile time limit of #GPUs to 120. (#121076 ) Fixes #115331. This is a temporary fix to increase the compile time number of GPUs to 120 until #119639 can be merged. Changing the parameter to 128 leads to annoying errors, as some checks would be tautological (`int8_t` is always < 128). Pull Request resolved: https://github.com/pytorch/pytorch/pull/121076 Approved by: https://github.com/albanD	2024-03-05 11:39:14 +00:00
wz337	de8af28083	[FSDP][StateDict] Allow FULL_STATE_DICT option for 2D (#120837 ) Fixes #120722 TL;DR for the issue: As users are expected to use get_model_state_dict to do state_dict retrieval, I think it's fine to remove the warning and RuntimeError. More context in #120722. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120837 Approved by: https://github.com/Skylion007	2024-03-05 10:03:44 +00:00
cyy	507611f9ae	[CUDACachingAllocator] Turn Allocator::allocate into non-const (#120969 ) Ideally, the method should be non-const since it changes the allocator state. Some const_casts are also removed in the way. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120969 Approved by: https://github.com/albanD	2024-03-05 09:53:05 +00:00
Yanbo Liang	46c9d646dd	[Dynamo] Fix inspect.getattr_static doesn't work well for torch.utils._cxx_pytree.PyTreeSpec (#120812 ) Fixes #118793 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120812 Approved by: https://github.com/zou3519	2024-03-05 09:05:26 +00:00
Lei Mao	311cc564f6	Fix README Typo (#120892 ) Fixes a README typo so that the prompt is consistent with VSCode 1.87.0. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120892 Approved by: https://github.com/albanD, https://github.com/drisspg	2024-03-05 09:05:21 +00:00
angelayi	a7e93c341f	[hoo] Add with_effects to handle side effectful ops (#120296 ) Proposal: https://docs.google.com/document/d/179QyhicGzTXJ5jvTAoAosP_Nzgf3PpgZwU_E3VV9PlM/edit#heading=h.bnm38nu3yfno Implementation discussion: https://docs.google.com/document/d/179QyhicGzTXJ5jvTAoAosP_Nzgf3PpgZwU_E3VV9PlM/edit#heading=h.bj61609o1buq Result with print: ``` graph(): %arg0_1 : [num_users=1] = placeholder[target=arg0_1] %arg1_1 : [num_users=1] = placeholder[target=arg1_1] %with_effects : [num_users=1] = call_function[target=torch._higher_order_ops.effects.with_effects](args = (%arg0_1, aten.print.default, moo), kwargs = {}) %getitem : [num_users=1] = call_function[target=operator.getitem](args = (%with_effects, 0), kwargs = {}) %add : [num_users=1] = call_function[target=torch.ops.aten.add.Tensor](args = (%arg1_1, %arg1_1), kwargs = {}) return [getitem, add] ``` Follow ups: * Add handling to auto_functionalize * Add support for tokens on the export side * Add support for tokens on the inductor side Pull Request resolved: https://github.com/pytorch/pytorch/pull/120296 Approved by: https://github.com/zou3519	2024-03-05 08:58:32 +00:00
Oguz Ulgen	29976519a1	Make configs hash part of remote cache key (#121152 ) Summary: While testing I noticed that if we generate different configs, we will fail to use the remote cache, so lets include configs in the cache key. Not sure how to write a deterministic test for this. Test Plan: existing tests Differential Revision: D54500957 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121152 Approved by: https://github.com/aakhundov	2024-03-05 08:01:24 +00:00
Oguz Ulgen	43416e3059	Correctly read the cache key for remote cache (#121151 ) Summary: While investigating why we were calling put each time, I noticed that memcache backend returns a list instead of direct result, which means that we were correctly fetching the cached result but not using it. Test Plan: The test should now work as expected Differential Revision: D54500851 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121151 Approved by: https://github.com/aakhundov	2024-03-05 07:33:20 +00:00
Oguz Ulgen	9e16622397	Move JK check to on-demand (#121182 ) Summary: Some tests are failing due to checking JK during forking. Lets move the JK check to on-demand. Differential Revision: D54518293 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121182 Approved by: https://github.com/aakhundov	2024-03-05 07:03:25 +00:00
Adnan Akhundov	9ccff0aff9	Remove ids_of_folded_args from test_triton_kernel_equal_to_1_arg (#121192 ) Summary: Due to the Triton pin update in https://github.com/pytorch/pytorch/pull/119457, `test_triton_kernel_equal_to_1_arg` started to break, as `ids_of_folded_args` has vanished from the upstream Triton codebase. Test Plan: ``` $ python test/inductor/test_triton_kernels.py -k test_triton_kernel_equal_to_1_arg ... ---------------------------------------------------------------------- Ran 6 tests in 6.790s OK ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/121192 Approved by: https://github.com/oulgen, https://github.com/bertmaher	2024-03-05 06:35:04 +00:00
Angela Yi	4b49bc19e8	[export][reland] Disable exported_program.__call__ (#120019 ) Summary: Reland of D53075378 / https://github.com/pytorch/pytorch/pull/119466 Test Plan: CI Differential Revision: D53827930 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120019 Approved by: https://github.com/ydwu4	2024-03-05 05:29:46 +00:00
Bin Bao	6ddf5cf85e	[AOTI] Update cpp wrapper codegen to use v2 C shim (#120714 ) Summary: To use the torchgen-ed v2 C shim interface, cpp wrapper codegen needs to update its rule for generating the right parameter and function call. Because changing the emitted code will cause a FC breakage, we add a flag to control the behavior. Differential Revision: [D54258086](https://our.internmc.facebook.com/intern/diff/D54258086) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120714 Approved by: https://github.com/chenyang78 ghstack dependencies: #120513	2024-03-05 04:32:32 +00:00
Bin Bao	bd19d6d822	[AOTI] Use torchgen to generate C shim functions (#120513 ) Summary: The current C shim layer manually implements a C interface for a handful of ops. Obviously that's not scalable if we want to extend it to cover all aten ops. This new torchgen script automatically generates C shim interfaces for CPU and CUDA backends. The interface follows the same parameter passing rules as the current C shim layer, such as * Use plain C data types to pass parameters * Use AtenTensorHandle to pass at::Tensor * Use pointer type to pass optional parameter * Use pointer+length to pass list * Use device_type+device_index to pass device * When a parameter is a pointer of pointer, e.g. AtenTensorHandle**, the script generates either a list of optional values or an optional list of values https://gist.github.com/desertfire/83701532b126c6d34dae6ba68a1b074a is an example of the generated torch/csrc/inductor/aoti_torch/generated/c_shim_cuda.cpp file. The current version doesn't generate C shim wrappers for all aten ops, and probably generates more wrappers than needed on the other hand, but it should serve as a good basis. This PR by itself won't change AOTI codegen and thus won't introduce any FC breakage. The actual wrapper codegen changes will come in another PR with some version control flag to avoid FC breakage. Differential Revision: [D54258087](https://our.internmc.facebook.com/intern/diff/D54258087) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120513 Approved by: https://github.com/jansel	2024-03-05 04:28:44 +00:00
Stephen Jia	ffe45a8188	[ATen-vulkan] Implement global shader registry (#121088 ) Differential Revision: D54447700 ## Context This changeset updates Vulkan SPIR-V codegen to introduce a global SPIR-V shader registry and register shaders dynamically at static initialization time. This change makes it possible to define and link custom shader libraries to the ATen-Vulkan runtime. Before: * `gen_vulkan_spv.py` generated two files, `spv.h` and `spv.cpp` which would contain the definition and initialization of Vulkan shader registry variables. After: * Introduce the `ShaderRegistry` class in `api/`, which encapsulates functionality of the `ShaderRegistry` class previously defined in the generated `spv.h` file * Introduce a global shader registry (defined as a static variable in the `api::shader_registry() function` * Define a `ShaderRegisterInit` class (taking inspiration from `TorchLibraryInit`) that allows for dynamic shader registration * `gen_vulkan_spv.py` now only generates `spv.cpp`, which defines a static `ShaderRegisterInit` instance that triggers registration of the compiled shaders to the global shader registry. Benefits: * Cleaner code base; we no longer have `ShaderRegistry` defined in a generated file, and don't need a separate implementation file (`impl/Registry.`) to handle shader lookup. All that logic now lives under `api/ShaderRegistry.` * Makes it possible to compile and link separate shader libraries, providing similar flexibility as defining and linking custom ATen operators Pull Request resolved: https://github.com/pytorch/pytorch/pull/121088 Approved by: https://github.com/manuelcandales, https://github.com/jorgep31415	2024-03-05 03:56:57 +00:00
angelayi	c3c618c750	Update torchbench pin (#121029 ) Fixes https://github.com/pytorch/pytorch/issues/117280 after bumping the HF version in https://github.com/pytorch/benchmark/pull/2179 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121029 Approved by: https://github.com/desertfire	2024-03-05 03:21:32 +00:00
Tugsbayasgalan (Tugsuu) Manlaibaatar	a15c02562a	Fix dynamo failure (#121167 ) Summary: Title Test Plan: CI Differential Revision: D54509198 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121167 Approved by: https://github.com/izaitsevfb	2024-03-05 03:19:59 +00:00
PyTorch MergeBot	3381f282c3	Revert "Update Triton (#119457 )" This reverts commit d49864f6a526d3def25f8da2fa9b8815b3347b9d. Reverted https://github.com/pytorch/pytorch/pull/119457 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing test_triton_kernels in trunk `d49864f6a5` ([comment](https://github.com/pytorch/pytorch/pull/119457#issuecomment-1977792634))	2024-03-05 01:46:44 +00:00
Aaron Gokaslan	9deaa2e812	[BE]: FURB187 Use inplace reverse on lists: faster, more readable. (#121140 ) Use `reverse()` method as it's faster and inplace. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121140 Approved by: https://github.com/albanD	2024-03-05 01:36:17 +00:00
Shunting Zhang	ec4146c535	[inductor] skip foreach kernel for benchmark fusion (#121168 ) benchmark fusion currently does not support foreach kernel. If we don't explicitly skip foreach kernels, we end up with exceptions in `codegen_node_schedule` because individual nodes in a foreach kernel may have incompatible shapes from pointwise/reduction perspective. cc Manman Ren ( @manman-ren ) who reported the issue when turning on benchmark fusion on BertForMaskedLM. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121168 Approved by: https://github.com/Chillee	2024-03-05 01:27:55 +00:00
Vikram Srivastava	bcf35c6ae6	[tensorboard] Handle bfloat16 type in add_histogram (#120087 ) Summary: add_histogram fails for this data type. Updating conversion code to handle it. Stack trace for the failure - ` [trainer0]Traceback (most recent call last): [trainer0] File "<torch_package_0>.tensorboard/logging/summary_v2.py", line 203, in unscriptable_record_summary [trainer0] unscriptable_histogram(name, t, step, ranks) [trainer0] File "<torch_package_0>.tensorboard/logging/fx_v1.py", line 146, in unscriptable_histogram [trainer0] Adhoc.writer().add_histogram(tag, x, step.int()) [trainer0] File "/tmp/aienv/images/aienv_image_09slg3j1/torch/utils/tensorboard/writer.py", line 40, in wrapper [trainer0] resp = super_method(args, *kwargs) [trainer0] File "/tmp/aienv/images/aienv_image_09slg3j1/torch/utils/tensorboard/writer_oss.py", line 526, in add_histogram [trainer0] histogram(tag, values, bins, max_bins=max_bins), global_step, walltime [trainer0] File "/tmp/aienv/images/aienv_image_09slg3j1/torch/utils/tensorboard/summary.py", line 482, in histogram [trainer0] values = make_np(values) [trainer0] File "/tmp/aienv/images/aienv_image_09slg3j1/torch/utils/tensorboard/_convert_np.py", line 23, in make_np [trainer0] return _prepare_pytorch(x) [trainer0] File "/tmp/aienv/images/aienv_image_09slg3j1/torch/utils/tensorboard/_convert_np.py", line 30, in _prepare_pytorch [trainer0] x = x.detach().cpu().numpy() [trainer0]TypeError: Got unsupported ScalarType BFloat16 ` Test Plan: Updated unit test that was failing before but passes after this change. Reviewed By: hamzajzmati, jcarreiro Differential Revision: D53841197 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120087 Approved by: https://github.com/jcarreiro, https://github.com/yanboliang	2024-03-05 00:27:21 +00:00
Wei-Sheng Chin	a3a8137484	[onnxrt, dynamo] Fix run with inputs on mix devices (#121159 ) `onnxrt` assumes all tensors are on the same device before, and this PR fixes that by setting individual device for each tensor. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121159 Approved by: https://github.com/thiagocrepaldi	2024-03-04 23:39:33 +00:00
chilli	83c312990f	Add missing newline to repro and some utility thing in repro (#121051 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121051 Approved by: https://github.com/ezyang, https://github.com/shunting314, https://github.com/eellison	2024-03-04 22:52:54 +00:00
Jorge Pineda	eba28a6f91	[VK-API][Op Redesign][3/n] Expose new Context and Resource APIs (#121060 ) Summary: For use in the next diff. Test Plan: sc Differential Revision: D54397862 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121060 Approved by: https://github.com/SS-JIA	2024-03-04 22:26:07 +00:00
PyTorch MergeBot	70c23a51ac	Revert "[ATen][CUDA][CUBLAS] cublasLtMatmul increase workspace_size (#120925 )" This reverts commit 0a38a6ac8046e4d3f9cfaba86b7ec6517038646f. Reverted https://github.com/pytorch/pytorch/pull/120925 on behalf of https://github.com/clee2000 due to broke inductor models and caused accuracy regression on nightly dashboard `0a38a6ac80` https://github.com/pytorch/pytorch/actions/runs/8118465367/job/22193590228 ([comment](https://github.com/pytorch/pytorch/pull/120925#issuecomment-1977556485))	2024-03-04 22:13:23 +00:00
David Berard	df3c8b8390	[fake_impls] Fix seed/offset device for attention kernels (#120839 ) 1) Fix fake_impls to return the correct device for these attention kernels. 2) Remove special-casing and test file xfails Pull Request resolved: https://github.com/pytorch/pytorch/pull/120839 Approved by: https://github.com/drisspg	2024-03-04 22:02:32 +00:00
Stephen Jia	6a5c7d5f95	[ATen-vulkan] Enable deferred descriptor pool initialization (#121134 ) Differential Revision: D54487619 ## Context Allow the descriptor pool of an `api::Context` object to be initialized in a deferred fashion, instead of forcing initialization upon construction. This mode of operation will be used in the ExecuTorch Vulkan delegate, where the exact number of descriptor sets can determined once the graph is built instead of needing to "guess" an adequate amount. ## Implementation Details * Check `config.descriptorPoolMaxSets > 0` to check if the descriptor pool should be initialized * Introduce `DescriptorPool::init()` function to trigger intialization * Introduce safeguards against using an uninitialized descriptor pool Pull Request resolved: https://github.com/pytorch/pytorch/pull/121134 Approved by: https://github.com/manuelcandales	2024-03-04 21:37:32 +00:00
PyTorch MergeBot	0c07c0c15f	Revert "add int4 packed gemm support on CPU device (#117475 )" This reverts commit 30befa592e0675cc694f87a4f6fb80894709e719. Reverted https://github.com/pytorch/pytorch/pull/117475 on behalf of https://github.com/izaitsevfb due to fails meta-internal tests ([comment](https://github.com/pytorch/pytorch/pull/117475#issuecomment-1977474686))	2024-03-04 21:20:57 +00:00
Wanchao Liang	74b19fa8b9	fix fsdp device mesh depenency issue (#121061 ) as reported in https://github.com/pytorch/torchtrain/pull/103 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121061 Approved by: https://github.com/awgu	2024-03-04 21:20:09 +00:00
lancerts	7a065e3b23	improve the constantLR doc (#120852 ) Fixes #120716 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120852 Approved by: https://github.com/janeyx99	2024-03-04 21:15:27 +00:00
atalman	cb812c9832	Add windows constraint to mkl package in wheel (#121014 ) Follow up on: https://github.com/pytorch/pytorch/pull/102604 Address this comment: https://github.com/pytorch/pytorch/pull/102604#discussion_r1419944305 Whl metadata for all wheels published to pypi must match, otherwise poetry install will fail see this comment: https://github.com/pytorch/pytorch/issues/88049#issuecomment-1302555269 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121014 Approved by: https://github.com/malfet	2024-03-04 20:54:26 +00:00
angelayi	4cdc2d7096	[dynamo] Remove expected dynamo test failures (#120836 ) Fixes some of the tests in #120643 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120836 Approved by: https://github.com/zou3519	2024-03-04 20:41:49 +00:00
PyTorch MergeBot	a98c17edc7	Revert "add int8 packed gemm support on CPU device (#118056 )" This reverts commit f84375ca5db623a6a53cbce2864d27dfad626228. Reverted https://github.com/pytorch/pytorch/pull/118056 on behalf of https://github.com/izaitsevfb due to breaks internal builds ([comment](https://github.com/pytorch/pytorch/pull/118056#issuecomment-1977368720))	2024-03-04 20:09:40 +00:00
PyTorch MergeBot	9ff65d56a5	Revert "delete useless cast_outputs call in unary_op_impl_float_out (#120486 )" This reverts commit d053dcfa69a52e6b9f9f2ba997b6bffbc9b29bb5. Reverted https://github.com/pytorch/pytorch/pull/120486 on behalf of https://github.com/izaitsevfb due to Fails meta internal tests ([comment](https://github.com/pytorch/pytorch/pull/120486#issuecomment-1977343125))	2024-03-04 19:52:23 +00:00
Francesco Fusco	26431db939	[ONNX] Perform implicit casting of constants for the onnx::where operator (#118733 ) (#120619 ) This PR fixes the problem of having the `Where` operator bound to different types in cases where the dtype is not explicitly set. The PR extends the implicit casting to the onnx::Where operator to fix the issue, and includes the corresponding unit test. Fixes #118733 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120619 Approved by: https://github.com/BowenBao, https://github.com/thiagocrepaldi	2024-03-04 19:27:30 +00:00
Tugsbayasgalan (Tugsuu) Manlaibaatar	58047205ed	Delete unnecessary code (#120365 ) Summary: Title Test Plan: CI Differential Revision: D53828357 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120365 Approved by: https://github.com/Skylion007	2024-03-04 18:02:58 +00:00
drisspg	2e6c08a14b	Update flash_attention kernel from 2.3.6 to 2.5.5 (#118935 ) # Summary Updates FlashAttention kernel code from tag [2.3.6](https://github.com/Dao-AILab/flash-attention/releases/tag/v2.3.6) to [2.5.3](https://github.com/Dao-AILab/flash-attention/releases/tag/v2.5.5). The usual changes were then re-rellod on top of the modified kernel, changing how dropout saved for backward, removing the head_dim_pad since this would make the kernel inplace mutate and that has a bad interaction with functionalization. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118935 Approved by: https://github.com/cpuhrsch	2024-03-04 17:36:22 +00:00
bhack	d49864f6a5	Update Triton (#119457 ) Fix pytorch nightly compilation for cuda linking Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/119457 Approved by: https://github.com/bertmaher	2024-03-04 17:04:59 +00:00
Oguz Ulgen	6566b3db67	Add an autotune cache for inductor generated kernels (#120963 ) Summary: Inductor currently has a best config cache for kernels that it generates. This is a local cache done via writing to the file system. This diff takes this local cache to remote by reusing the existing triton caching mechanism built via Memcache internally and Redis externally. Test Plan: tested locally using `TORCH_INDUCTOR_AUTOTUNE_REMOTE_CACHE =1` Look at scuba to verify the local testing: https://fburl.com/scuba/triton_remote_cache/z6pypznk The plan is to land this diff with this turned off and gradually introduce this. Differential Revision: D54398076 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120963 Approved by: https://github.com/jansel	2024-03-04 16:58:37 +00:00
rzou	3ef0befdc9	Better error messages for impl_abstract_pystub (#120959 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120959 Approved by: https://github.com/drisspg	2024-03-04 15:24:36 +00:00
Pearu Peterson	ce2903080c	Add sparse compressed fake tensor support (#120920 ) As in the title. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120920 Approved by: https://github.com/ezyang	2024-03-04 14:38:45 +00:00
Pearu Peterson	c06499981d	Add a decomposition for torch.put, 2. (#120179 ) As in the title. It is an updated copy of https://github.com/pytorch/pytorch/pull/115306 . Pull Request resolved: https://github.com/pytorch/pytorch/pull/120179 Approved by: https://github.com/lezcano, https://github.com/peterbell10, https://github.com/jgong5	2024-03-04 14:37:30 +00:00
pratiklp00	8ba49d0e53	Fix compilation error: load_fp32_from_fp16’ was not declared in this scope for ppc64le (#120307 ) This patch adds missing Implementation of load_fp32_from_fp16 for half. Fixes the error load_fp32_from_fp16’ was not declared in this scope . Pull Request resolved: https://github.com/pytorch/pytorch/pull/120307 Approved by: https://github.com/jgong5	2024-03-04 11:08:39 +00:00
Jianyu Huang	27ac73073b	Fix hipification issue (#121107 ) Differential Revision: D54470055 ``` buck-out/v2/gen/fbcode/713b128926d8b21f/caffe2/__ATen-hip__/buck-headers/ATen/native/hip/MemoryAccess.cuh:201:61: error: comparison of integers of different signs: 'R' (aka 'unsigned int') and 'int' [-Werror,-Wsign-compare] return ((threadIdx.x + thread_work_elemnum_threads()) < remaining); ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ^ ~~~~~~~~~ ``` ``` buck-out/v2/gen/fbcode/713b128926d8b21f/caffe2/__ATen-hip__/buck-headers/ATen/native/hip/MemoryAccess.cuh:223:15: error: unused variable 'to' [-Werror,-Wunused-variable] scalar_t to = reinterpret_cast<scalar_t >(data[0]) + block_work_size() idx; ^ ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/121107 Approved by: https://github.com/chenyang78	2024-03-04 09:41:21 +00:00
Wanchao Liang	2e50566722	[dtensor] change distribute_module input/output_fn to accept module (#120895 ) This is a BC breaking change to distribute_module. The underlying rationle for this change is that sometimes in the input_fn/output_fn, user would want to access to the current module for some attributes. This might not be common enough, but in some cases it's worth to access to the module. An outstanding use case we want to support is float8, if we want to make float8 works with the TP API, the input_fn/output_fn of TP parallel styles would need to get access to the module, where the module might encapsulates `dynamic_linear.emulate` attribute, that is useful for input/output casting Since this is needed for fp8 and DTensor still under prototype release, I feel it's worth the change and it's better we make the change as early. Right now making it a soft BC breaking, which means we maintain BC still but throw deprecation messages. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120895 Approved by: https://github.com/tianyu-l	2024-03-04 07:22:32 +00:00
CK Luk	3045b16488	Do not use warm_pool() if TorchTnT is used (#121047 ) Summary: This diff is needed to avoid QPS drop when parallel compilation is used with TorchTNT. Test Plan: On TNT * https://www.internalfb.com/mast/job/torchx-ldm_train-hxjhl0k1wjz93 On PyPer * f537224855 Differential Revision: D54430900 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121047 Approved by: https://github.com/yanboliang	2024-03-04 06:14:11 +00:00
cyy	4b494d0750	Fix comparison of integer expressions of different signedness (#121066 ) Fixes these warnings ``` src/aten/src/ATen/native/cuda/ForeachReduceOp.cu:190:19: warning: comparison of integer expressions of different signedness: ‘int’ and ‘const size_t’ {aka ‘const long unsigned int’} [-Wsign-compare] ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/121066 Approved by: https://github.com/tringwald, https://github.com/Skylion007	2024-03-04 02:14:10 +00:00
Menglu Yu	c83dfc8854	[PT2][Inductor] Fix missing "example_value" for nodes introduced by group batch fusion (#120974 ) Summary: Similar to D54140488, we fix more such bugs Test Plan: # unit test ``` buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:group_batch_fusion ``` Cache hits: 0%. Commands: 1 (cached: 0, remote: 0, local: 1) Tests finished: Pass 9. Fail 0. Fatal 0. Skip 0. Build failure 0 # local reproduce ``` buck2 run mode/opt //scripts/jackiexu0313/pt2:local_model_with_pt2 -- --test_mode split_batch ``` Differential Revision: D54399360 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120974 Approved by: https://github.com/jackiexu1992	2024-03-04 02:11:57 +00:00
David Berard	cead0363a8	[jit][nested strided tensor] support nested tensor in check_trace (#121039 ) Summary: torch.testing.assert_equal doesn't support nested strided tensors because sizes is not implemented. This adds special handling for nested tensors by checking for nested tensors unbinding if they are found. Test Plan: test_trace_with_nested_strided_tensor_output Differential Revision: D54430238 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121039 Approved by: https://github.com/YuqingJ	2024-03-04 01:15:45 +00:00
Edward Z. Yang	089f4c0bd9	If data dependent, check if guard_size_oblivious would fix problem and report if so (#121011 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/121011 Approved by: https://github.com/lezcano	2024-03-03 23:23:14 +00:00
cyy	13fadea888	[Clang-tidy header][21/N] Fix clang-tidy warnings in aten/src/ATEN/.{cpp,h} (#120763 ) This PR continues to fix clang-tidy warnings in aten/src/ATEN/, following #120574. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120763 Approved by: https://github.com/Skylion007	2024-03-03 23:18:43 +00:00
Jackie (Jiaqi) Xu	4f0481e1d5	[inductor] add decompostition for mm in backward (#120933 ) Summary: 1) As a follow up in D53602514. Found a new way to decompose mm in backward. Sum the permuted input and reduce along 0 dim. Some benchmark result P1190140001. 30x speedup Some explanations on why the original mm decomposition is slow. For mxkxn mm, when m is small and k is large, the stride for lhs is [m,1], hence it need to access memory k times to load all the data. As a result, decomposition will be effective with permute since the stride will be [k,1]. 2) add another pattern for large k. benchmark result P1190596489 28x speedup 3) fix the value not found error in ig ctr. f536115499 Test Plan: pt2 decompose: {F1462894821} decompose: f536159404 baseline: f536282578 705k vs 725k 4% for ig ctr Differential Revision: D54294491 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120933 Approved by: https://github.com/mengluy0125	2024-03-03 18:46:42 +00:00
Animesh Jain	b7f2522692	[dynamo][compile-time] Remove unnecessary tree_map_only (#121052 ) Reduces the torch.compile(backend="eager") for this code by 1-2 seconds. ~~~ def fn(x): for _ in range(10000): # x = torch.sin(x) x = torch.ops.aten.sin(x) # x = sin(x) return x ~~~ Pull Request resolved: https://github.com/pytorch/pytorch/pull/121052 Approved by: https://github.com/jansel ghstack dependencies: #121053	2024-03-03 06:59:43 +00:00
PyTorch MergeBot	368f242e37	Revert "[PT2D] Make the speedup benchmark works with DDP + CompiledAutograd (#120454 )" This reverts commit 8c2e569928a200893fe971e615b82a2f9ce32630. Reverted https://github.com/pytorch/pytorch/pull/120454 on behalf of https://github.com/desertfire due to breaks nightly dashboard cudagraphs run ([comment](https://github.com/pytorch/pytorch/pull/120454#issuecomment-1975001824))	2024-03-03 02:58:47 +00:00
Jason Ansel	0e0a621e0c	[dynamo] Minor refactors (#120966 ) These are changes I pulled out of the above PRs due to not being related. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120966 Approved by: https://github.com/yanboliang	2024-03-03 02:20:48 +00:00
Animesh Jain	8e4301077e	[dynamo][comp-time] BuiltinVariableTracker - inspect signature only on failure (#121053 ) Reduces the torch.compile(backend="eager") for this code by 1-2 seconds. ~~~ def fn(x): for _ in range(10000): # x = torch.sin(x) x = torch.ops.aten.sin(x) # x = sin(x) return x ~~~ Pull Request resolved: https://github.com/pytorch/pytorch/pull/121053 Approved by: https://github.com/jansel	2024-03-02 23:03:00 +00:00
Lucas Pasqualin	7aced61c46	[DCP] deletes legacy formatting test (#120127 ) Should no longer be necessary Differential Revision: [D53791345](https://our.internmc.facebook.com/intern/diff/D53791345/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120127 Approved by: https://github.com/fegin ghstack dependencies: #119816	2024-03-02 22:04:39 +00:00
Animesh Jain	7f81563e5e	[dynamo][guards-cpp-refactor] Skip type and length check guard for DictGuardManager (#120739 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120739 Approved by: https://github.com/jansel ghstack dependencies: #120673	2024-03-02 13:15:53 +00:00
Animesh Jain	82d1465d8d	[dynamo][guards-cpp-refactor] DICT_CONTAINS guard (#120673 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120673 Approved by: https://github.com/jansel	2024-03-02 13:15:53 +00:00
IvanKobzarev	bab4b5a341	[dist][sharded_tensor] Fix ChunkShardingSpec metadata offsets for empty shards (#121002 ) ChunkShardingSpec generated metadata where offsets exceed the tensor size. Example: Torchrec prepared ShardedTensorMetadata: ``` ShardedTensorMetadata(shards_metadata=[ ShardMetadata(shard_offsets=[0, 0], shard_sizes=[2, 512], placement=rank:0/cuda:0), ShardMetadata(shard_offsets=[2, 0], shard_sizes=[2, 512], placement=rank:1/cuda:1), ShardMetadata(shard_offsets=[4, 0], shard_sizes=[2, 512], placement=rank:2/cuda:2), ShardMetadata(shard_offsets=[6, 0], shard_sizes=[2, 512], placement=rank:3/cuda:3), ShardMetadata(shard_offsets=[8, 0], shard_sizes=[2, 512], placement=rank:4/cuda:4), ShardMetadata(shard_offsets=[10, 0], shard_sizes=[0, 512], placement=rank:5/cuda:5), ShardMetadata(shard_offsets=[10, 0], shard_sizes=[0, 512], placement=rank:6/cuda:6) ], size=torch.Size([10, 512] ), ``` Calling ShardedTensor._init_from_local_shards_and_global_metadata() ShardedTensor ShardingSpec builds metadata ``` ShardedTensorMetadata(shards_metadata=[ ShardMetadata(shard_offsets=[0, 0], shard_sizes=[2, 512], placement=rank:0/cuda:0), ShardMetadata(shard_offsets=[2, 0], shard_sizes=[2, 512], placement=rank:1/cuda:1), ShardMetadata(shard_offsets=[4, 0], shard_sizes=[2, 512], placement=rank:2/cuda:2), ShardMetadata(shard_offsets=[6, 0], shard_sizes=[2, 512], placement=rank:3/cuda:3), ShardMetadata(shard_offsets=[8, 0], shard_sizes=[2, 512], placement=rank:4/cuda:4), ShardMetadata(shard_offsets=[10, 0], shard_sizes=[0, 512], placement=rank:5/cuda:5), ShardMetadata(shard_offsets=[12, 0], shard_sizes=[0, 512], placement=rank:6/cuda:6) ], size=torch.Size([10, 512]), tensor_properties=TensorProperties(dtype=torch.float16, layout=torch.strided, requires_grad=False, memory_format=torch.contiguous_format, pin_memory=False)) ``` The deduced ChunkShardingSpec: ``` ChunkShardingSpec(dim=0, placements=[rank:0/cuda:0, rank:1/cuda:1, rank:2/cuda:2, rank:3/cuda:3, rank:4/cuda:4, rank:5/cuda:5, rank:6/cuda:6]) ``` The fix is to limit offsets by dim size. Differential Revision: [D54419513](https://our.internmc.facebook.com/intern/diff/D54419513) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121002 Approved by: https://github.com/wz337	2024-03-02 08:58:48 +00:00
suo	66b20b4297	[export][ez] minor variable rename (#121040 ) since `_export()` now takes an `nn.Module` only (which is asserted against at an upper layer), we should change this variable name from `f` to `mod` and remove some unnecessary isinstance checks Differential Revision: [D54430381](https://our.internmc.facebook.com/intern/diff/D54430381/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121040 Approved by: https://github.com/angelayi ghstack dependencies: #121037	2024-03-02 08:49:06 +00:00
suo	505637198a	[export] cleanup to rewrite steps (#121037 ) 1. Some underscores for consistency of private functions. 2. remove dead code in `_replace_param_buffer_names` Differential Revision: [D54429206](https://our.internmc.facebook.com/intern/diff/D54429206/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121037 Approved by: https://github.com/angelayi, https://github.com/zhxchen17	2024-03-02 08:45:50 +00:00
Kurman Karabukaev	b0cfa96e82	[Torchelastic][Logging] Pluggable logsspecs using python entrypoints and option to specify one by name. (#120942 ) Summary: Expose an option to users to specify name of the LogsSpec implementation to use. - Has to be defined in entrypoints under `torchrun.logs_specs` group. - Must implement LogsSpec defined in prior PR/diff. Test Plan: unit test+local tests Reviewed By: ezyang Differential Revision: D54180838 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120942 Approved by: https://github.com/ezyang	2024-03-02 08:07:52 +00:00
Avik Chaudhuri	f351a71dbb	remove constraints from capture_pre_autograd_graph (#120981 ) Differential Revision: D54407296 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120981 Approved by: https://github.com/zhxchen17	2024-03-02 07:00:51 +00:00
Xia, Weiwen	83d848e1c7	[Quant][Inductor] Enable lowering of dynamic qlinear for X86Inductor (#120605 ) description Enable lowering of dynamic qlinear for X86Inductor. The pattern is `choose_qparams -> getitem -> q -> dq -> linear`. We only fuse `dq -> linear` and get `choose_qparams -> getitem -> q -> onednn.qlinear_pointwise`. So, we treat it as dynamic quantization of activation + static quantized linear. The previous implementation of `onednn.qlinear_pointwise` is for the case where `x_scale` and `x_zp` are scalars. Since `choose_qparams` returns tensors, we added a variation `onednn.qlinear_pointwise.tensor` to support the case. This feature is targeting PyTorch 2.3 release. Test plan ``` python inductor/test_mkldnn_pattern_matcher.py -k test_dynamic_qlinear_cpu python inductor/test_mkldnn_pattern_matcher.py -k test_dynamic_qlinear_qat_cpu python inductor/test_cpu_cpp_wrapper.py -k test_dynamic_qlinear ``` Performance before and after lowering `choose_qparam` to Inductor Before - latency for shape (32, 32) = 0.151 ms latency for shape (128, 128) = 0.153 ms latency for shape (1024, 1024) = 0.247 ms After - latency for shape (32, 32) = 0.049 ms - latency for shape (128, 128) = 0.052 ms - latency for shape (1024, 1024) = 0.133 ms Test method: A module with a single Linear layer, dynamic-quantize, lower to X86Inductor Test env & config: Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz, single instance, single core, using Intel OpenMP and Tcmalloc Pull Request resolved: https://github.com/pytorch/pytorch/pull/120605 Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/jerryzh168	2024-03-02 05:11:17 +00:00
Tianyu Liu	af5376c444	[dtensor] add support for loss parallel (#119877 ) Loss parallel is the last piece of sequence parallelism to enable. It enables efficient distributed cross entropy computation when the input is sharded on the class dimension (in a classification problem with many classes). The implementation is via a context manager `loss_parallel`, after enabling which users can directly use `torch.nn.functional.cross_entropy` or `torch.nn.CrossEntropyLoss` without modifying other parts of their code. Here are the underlying rationales why we are going through these op replacements: 1. `nn.functional.cross_entropy` is the common method that OSS user is using for things like transformer training, to avoid changing user code, we want user to still use this function for loss calculation if they are already using it. 2. `nn.functional.cross_entropy` boils down into `aten.log_softmax` and `aten.nll_loss_foward/backward`, and DTensor now supports those ops already (#117723 #119255 #118917 #119256). They are doing computation with input replicated on the class dimension. 3. However when the input of this loss calculation is sharded on the class dimension, to run sharded computation efficiently, we need to run both `aten.log_softmax` and `aten.nll_loss_foward` with multiple all-reduce collectives in the middle of those aten ops. This is not possible if we are just overriding these two ops, so we need to have some way to decompose these two ops into smaller ops to have collectives run in the middle of these two ops. 4. We explored the existing decompositions (#118950). It seems working, except that `log_softmax_backward` and `nll_loss_backward` combined together in aten are implemented in a inefficient way, which would trigger an additional expensive collective. Recently some user also reported similar issues https://github.com/pytorch/pytorch/issues/119261. 5. Therefore, currently we are doing our own decomposition inside a context manager for sequence parallelism specifically. Once we have a better decomposition in core, we can possibly take that instead of reinventing the wheels here. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119877 Approved by: https://github.com/wanchaol	2024-03-02 05:06:26 +00:00
Shunting Zhang	c4ed456fc3	[inductor] fix accuracy failure for a few models under freezing (#121054 ) Fix https://github.com/pytorch/pytorch/issues/120545 . The reason why these models fail accuracy test with freezing is due to the conv-batchnorm fusion. Conv-batchnorm fusion causes relative big numerical churn. For the failed TIMM models, raising the tolerance to `8 * 1e-2` can make the test pass. For the failed TB models, the numerical difference is too large. Having a discussion with @eellison , we decided to skip them with freezing for now. One the other hand, we probably should dig more why the conv-bn fusion cause such large numerical difference. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121054 Approved by: https://github.com/eellison	2024-03-02 04:53:59 +00:00
mingfeima	f84375ca5d	add int8 packed gemm support on CPU device (#118056 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118056 Approved by: https://github.com/mikekgfb ghstack dependencies: #117475	2024-03-02 04:35:49 +00:00
Stephen Jia	5258c3645d	[ATen-vulkan][EZ] Bug fixes: only create the image view when memory has been bound, invalidate cmd on flush (#121027 ) Summary: ## Context Introduce some simple bug fixes to the Vulkan Compute API that were causing errors on Android. 1. When using deferred allocation for image textures, it is undefined behaviour to create a `vkImageView` for a `vkImage` that has not yet been bound to memory. Fix this by creating the image view only after the `vkImage` has been bound to memory. 2. When flushing the `api::Context`, the command pool is flushed but any current command buffers are not invalidated. This will cause a segmentation fault if the command buffer is not submitted prior to calling `flush()`, because subsequent calls to `submit_*_job()` will use the old command buffer which will have been freed when the command pool is flushed. To fix, invalidate any existing command buffers when calling `flush()`. Test Plan: Build the test binary for Android: ``` buck build --target-platforms=ovr_config//platform/android:arm64-fbsource -c ndk.custom_libcxx=false //xplat/caffe2:pt_vulkan_api_test_bin --show-output ``` Push and run the test binary on a local android phone. Differential Revision: D54425370 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121027 Approved by: https://github.com/mcr229, https://github.com/cbilgin	2024-03-02 04:35:46 +00:00
lancerts	2d9efad38f	Add the bound check for flatten with out_dim (#120894 ) Fixes #120762 The bound is not valid in the example but unchecked. ``` a = torch.tensor([1, 2, 3]) a.flatten(start_dim=0, end_dim=1, out_dim='a') ``` The same is checked for the case ``` a = torch.tensor([1, 2, 3]) a.flatten(start_dim=0, end_dim=1) ``` - Therefore, just apply the same check. @malfet @janeyx99 Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/120894 Approved by: https://github.com/malfet, https://github.com/spzala	2024-03-02 03:56:55 +00:00
Colin Peppler	06fe6ed82b	[dynamo bug burndown] update tensor creation to support sequences of tensors (#120872 ) Fixes https://github.com/pytorch/pytorch/issues/120645 `_internal_new_from_data` calls `_recursive_build`, but we run into an error such as the cases. ``` Failed running call_function <function tensor at 0xDEADBEEF>: scalar_tensor(): argument (position 1) must be Number, not FakeTensor # e.g. cases 1. [FakeTensor(..., size=(20, 1), dtype=torch.float64), ..., FakeTensor(..., size=(20, 1), dtype=torch.float64)] - Here, we call _recursive_build(sizes=[4] ...) which hits the base case `if dim == ndim:` in the 2nd level of recursion. - So, we try to return `scalar_tensor(FakeTensor)` 2. [[(FakeTensor(..., size=(1,), dtype=torch.int64), FakeTensor(..., size=(), dtype=torch.int64)]] # site note: when can size = ()? Probably from scalar_tensor. >>> torch.scalar_tensor(1).shape torch.Size([]) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/120872 Approved by: https://github.com/ezyang	2024-03-02 02:22:59 +00:00
Nikita Shulga	a3b81666b1	[Dynamo] Fix guards for code objects (#120909 ) By comparing them only by id, and raising an assert if someone calls into `EQUALS_MATCH` Which render following example compileable: ```python import torch @torch.compile() def foo(x, y): code = compile(y, "foo", "exec") exec(y) return x print(foo(torch.rand(3), "print('Hello World')")) ``` Fixes https://github.com/pytorch/pytorch/issues/120647 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120909 Approved by: https://github.com/jansel	2024-03-02 02:17:17 +00:00
Yifu Wang	f7a2bae0ac	Change TestOpWaitiness to use MultiProcessTestCase (#121046 ) The test has been failing sporadically rencetly in CI and the failures are not reproducible locally, likely due to some nasty race conditional related a combination of MultiThreadedTestCase, the use of global state and finalizers, and the recently introduced test decorator for native funcol migration. Switching to the test to use MultiProcessTestCase to provide better isolation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121046 Approved by: https://github.com/weifengpy	2024-03-02 01:12:14 +00:00
Andrew Gu	4cf6d1172b	[FSDP2] Used `ReduceOp.AVG` if fp32 reduce-scatter (#120919 ) This PR uses `ncclAvg` op (via `ReduceOp.AVG`) if doing fp32 reduce-scatter. This allows the division by world size to happen in the reduce-scatter kernel itself, which seems to save extra memory read/write for dividing. This yields ~1.5% speedup on the Llama-7B workload (and makes per-parameter FSDP faster than flat-parameter FSDP 😅 ). Pull Request resolved: https://github.com/pytorch/pytorch/pull/120919 Approved by: https://github.com/yifuwang, https://github.com/wanchaol ghstack dependencies: #120238, #120910	2024-03-02 00:39:16 +00:00
David Berard	85157af784	Fix more xfails for scaled_dot_product_attention (#121032 ) Followup to #120928. - should fix #120921 . I missed one test in #120928 - test_dispatch_symbolic_meta_outplace_all_strides. This wasn't caught because #120921 was open at the time, disabling the test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121032 Approved by: https://github.com/drisspg	2024-03-02 00:28:44 +00:00
Andrew Gu	7c71d7f32b	[DTensor] Supported `foreach=True` for `clip_grad_norm_` (#120910 ) This PR adds support for `clip_grad_norm_(foreach=True)` by implementing `aten._foreach_norm.Scalar` and `aten._foreach_mul_.Tensor`. `foreach=True` is required to get competitive performance with `DTensor`. `foreach=True` reduces CPU overhead for Llama-7B from 388 ms to 63 ms. Existing flat-parameter FSDP's `clip_grad_norm_` takes 3 ms on CPU 😢 . Pull Request resolved: https://github.com/pytorch/pytorch/pull/120910 Approved by: https://github.com/wanchaol, https://github.com/janeyx99 ghstack dependencies: #120238	2024-03-02 00:28:09 +00:00
Andrew Gu	f0e8e7cf43	[DTensor] Supported `foreach=False` for `clip_grad_norm_` (#120238 ) This PR adds `DTensor` support for `aten.linalg_vector_norm.default` and `aten.stack.default` so that we can run `clip_grad_norm_` (with `foreach=False`). To implement `linalg_vector_norm`, we introduce a `_NormPartial` placement since the reduction op for norm is the norm itself. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120238 Approved by: https://github.com/wanchaol	2024-03-02 00:25:16 +00:00
mingfeima	30befa592e	add int4 packed gemm support on CPU device (#117475 ) This patch adds int4 packed gemm support on CPU, both `avx512` and `avx2` are supported. It is used to speedup https://github.com/pytorch-labs/gpt-fast The default perf measured on Intel (R) Xeon (R) CPU Max 9480, single socket (56 cores) is `16.13 sec total, 12.40 tokens/sec` * WOQ int4 on avx512: `5.92 sec total, 33.79 tokens/sec` * WOQ int4 on avx2: `6.90 sec total, 29.00 tokens/sec` WOQ int4 is measured with method: https://github.com/pytorch-labs/gpt-fast?tab=readme-ov-file#int4-weight-only-quantization Pull Request resolved: https://github.com/pytorch/pytorch/pull/117475 Approved by: https://github.com/jgong5, https://github.com/malfet	2024-03-02 00:17:34 +00:00
Shuqiang Zhang	c8e56b4965	[c10d] dump from one and only one thread (PG0's monitor thread) (#120893 ) Summary: When there are multiple PGs in a process and a hardware failure happens, we found that multiple PGs/ threads in the same process are competing to dump the same records at the same time. The affects the reliability of dumps. In this PR, we will try to make the change such that only one thread/PG could dump: PG0's monitor thread. We use a static variable to indicate that something (e.g., collective timeout) has triggered the dump locally. monitor thread would dump debug info under any one of the 3 conditions: 1: this static variable is set to true by the watchdog thread when it detects a timeout or pipe dump signal 2: timeout signal is received from other ranks through tcpstore 3: no heartbeat of watchdog Test Plan: python test/distributed/test_c10d_nccl.py -k test_timeout_dumps_on_stuck_ranks Pull Request resolved: https://github.com/pytorch/pytorch/pull/120893 Approved by: https://github.com/wconstab	2024-03-02 00:13:13 +00:00
PyTorch MergeBot	3d7cf8f392	Revert "Limit loop unrolling (#120023 )" This reverts commit 6cc7f9a2e6bedff3109ea066278e9805713da4bb. Reverted https://github.com/pytorch/pytorch/pull/120023 on behalf of https://github.com/anijain2305 due to breaks llms export ([comment](https://github.com/pytorch/pytorch/pull/120023#issuecomment-1974104633))	2024-03-02 00:04:08 +00:00
BowenBao	d8395830ea	[ONNX][dynamo_export] Skip instance_norm decomp for export (#120866 ) Otherwise, instance_norm is decomposed into batch_norm with training set to True. Downstream exporter has no way to figure out that training is actually not needed. On the other hand, ONNX does have InstanceNormalization operator defined, however due to decomp, it unnecessarily exports as batch norm and glue code. Depends on https://github.com/microsoft/onnxscript/pull/1284 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120866 Approved by: https://github.com/thiagocrepaldi, https://github.com/titaiwangms	2024-03-01 23:51:16 +00:00
Will Constable	581fe26792	[C10D] Add ProcessGroup op_id to track ops inside coalescing region (#120745 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120745 Approved by: https://github.com/zdevito	2024-03-01 23:45:43 +00:00
Aidyn-A	0a38a6ac80	[ATen][CUDA][CUBLAS] cublasLtMatmul increase workspace_size (#120925 ) According to the [cuBLAS API Reference](https://docs.nvidia.com/cuda/cublas/index.html#cublassetworkspace) the recommended workspace size for Hopper is 32 MiB and for the rest architectures 4 MiB. This PR increases the workspace size accordingly. I am not aware of the recommended workspace size for HIP, that is why I am keeping it unchanged. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120925 Approved by: https://github.com/eqy, https://github.com/malfet	2024-03-01 23:32:59 +00:00
Catherine Lee	06b52dd103	TD outside of test job (#118250 ) Give TD it's own job so that each shard can get the results from this one job artifact and they will always be in sync with each other/no longer need to worry about consistently issues * Move test discovery to its own file that is not dependent on torch so it can be run without building torch * Cannot do cpp test discovery before building pytorch * Move TD calculation to own file that will create a json file with the final results * TD is now job/build env agnostic * TD will rank all tests, including those that test jobs may not want to run (ex it will rank distributed tests along with default tests, even though these tests are never run on the same machine together) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118250 Approved by: https://github.com/huydhn	2024-03-01 23:08:10 +00:00
Simon Fan	d08ce51881	[compiled autograd] refactor eager test loading and run custom ops tests (#120679 ) TestCustomOp's tests uses helper attributes and functions from a util parent class. To support arbitrary test classes, we need to refactor the current approach. Instead of allowlisting certain methods, we can instead copy the whole class and only overwrite the "test_.*" methods. Compiled autograd fails on ~10/90 of the newly added tests. test_autograd_function_backed_op is the example we discussed in PT-2D meeting about requiring c++ autograd::Function support. I'm addressing this in #120732 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120679 Approved by: https://github.com/jansel, https://github.com/zou3519	2024-03-01 22:48:17 +00:00
albanD	8cb4855d1e	Release the GIL in serialization when it is safe to do so (#120818 ) In particular this ensures we release the GIL when serializing: - PyBytes objects (this is how we get the pickle object) - Storage objects Other string-like objects keep the gil which is fine because we only use this for very small strings today (for endianess) and so releasing the GIL is not important there Co-authored-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/120818 Approved by: https://github.com/colesbury	2024-03-01 22:37:26 +00:00
Menglu Yu	fd2ab1f613	[PT2][Inductor] Change the split cat log to debug (#120823 ) Summary: Address the report in https://github.com/pytorch/pytorch/issues/120771. Test Plan: see signal Differential Revision: D54323475 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120823 Approved by: https://github.com/jackiexu1992	2024-03-01 22:34:23 +00:00
Zhengxu Chen	797d4fbdf4	[export] Log operator set. (#120951 ) Summary: as title. We want to count the number of total operator calls, and the distinct set of operators in the exported graph. Test Plan: CI Differential Revision: D54390298 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120951 Approved by: https://github.com/tugsbayasgalan	2024-03-01 20:58:31 +00:00
Xuehai Pan	d3876f73e7	Preserve metadata for `MutableMapping` and `MutableSequence` in `pin_memory` and `collate_fn` (#120553 ) For the user-defined `Mapping` type, it may contain some metadata (e.g., pytorch/tensordict#679, https://github.com/pytorch/pytorch/pull/120195#issue-2141716712). Simply use `type(mapping)({k: v for k, v in mapping.items()})` do not take this metadata into account. This PR uses `copy.copy(mapping)` to create a clone of the original collection and iteratively updates the elements in the cloned collection. This preserves the metadata in the original collection via `copy.copy(...)` rather than relying on the `__init__` method in the user-defined classes. Reference: - pytorch/tensordict#679 - #120195 Closes #120195 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120553 Approved by: https://github.com/vmoens	2024-03-01 20:43:42 +00:00
Jacob Szwejbka	a7c799fb85	[executorch] Add support for method variants in aten executorch code gen (#121016 ) Summary: Title. Test Plan: The added unittest Differential Revision: D54423028 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121016 Approved by: https://github.com/larryliu0820	2024-03-01 20:33:02 +00:00
Xu Zhao	7a64eb65e4	Fix Dynamo tests failing with "Failed running call_function <built-in function linalg_norm" (#120993 ) When iterating the ord value through an array, we are sharing the same torchdynamo context. This makes dynamo treat the `ord` variable as dynamic shape, causing problems. In the `vector_norm` decomposition, casting the int type ord to float will fix this problem. Fixes https://github.com/pytorch/pytorch/issues/119795 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120993 Approved by: https://github.com/lezcano	2024-03-01 20:27:45 +00:00
Tugsbayasgalan Manlaibaatar	39e4d1a535	Make TestEmbeddingNNDeviceTypeCPU::test_EmbeddingBag_per_sample_weights_and_no_offsets_cpu_int32_float32 compatible with TorchDynamo (#120831 ) Previously, the test case directly accesses the tensor data via tensor.data which is not supported on FakeTensor. So we manually copy the tensor as a workaround. Fixes: https://github.com/pytorch/pytorch/issues/119788 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120831 Approved by: https://github.com/janeyx99	2024-03-01 20:27:41 +00:00
Aaron Gokaslan	e02047add4	[BE][Ez]: Update ruff to 0.3.0 (#121003 ) Update ruff to 0.3.0 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121003 Approved by: https://github.com/malfet	2024-03-01 20:20:55 +00:00
Zhenghao Zhao	af93849a3a	[pt2 export] small fix on non_persistent buffer unlift (#120715 ) Summary: Change to get_buffer from the input plain_graph_module instead of the new stateful_gm when restoring non_persistent buffers, since the stateful_gm doesn't contain the buffer yet. Test Plan: Added test case. `buck test caffe2/test:test_export -- test_unlift_nonpersistent_buffer` Differential Revision: D54216772 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120715 Approved by: https://github.com/zhxchen17	2024-03-01 20:20:00 +00:00
Andrew M. James	19fcf6de1a	Add lowering for fraction_max_pool2d (#120460 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120460 Approved by: https://github.com/peterbell10, https://github.com/lezcano	2024-03-01 20:13:20 +00:00
Avik Chaudhuri	cdb50d0380	remove constraints from aot_compile (#120979 ) Differential Revision: D54405986 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120979 Approved by: https://github.com/zhxchen17	2024-03-01 20:06:21 +00:00
DanilBaibak	55ae8fb1f6	Switched m1 runners to the lable macos-m1-stable (#120997 ) Switched m1 runners to use `macos-m1-stable` label, which points to exactly the same M1 running MacOS-13.2 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120997 Approved by: https://github.com/malfet	2024-03-01 19:52:34 +00:00
Nikita Shulga	de3202abea	[EZ][BE] Remove Python-2 installation logic (#121015 ) Not sure why it's still there in 2024 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121015 Approved by: https://github.com/jeffdaily, https://github.com/atalman	2024-03-01 19:39:02 +00:00
Tugsbayasgalan (Tugsuu) Manlaibaatar	b474a523c6	Ban passing in free function into capture_pre_autograd_graph (#120817 ) Summary: Today we don't allow free functions to be tracing callable in torch.export. As a part of migrating capture_preautograd_graph usages to torch.export, we need to ban free functions to capture_preautograd_graph as well Test Plan: CI Differential Revision: D54319597 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120817 Approved by: https://github.com/zhxchen17, https://github.com/andrewor14	2024-03-01 19:38:58 +00:00
Edward Z. Yang	ce50db22c2	Handle transposition pattern seen in SDPA with unbacked SymInts (#121005 ) Fixes https://github.com/pytorch/pytorch/issues/121000 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/121005 Approved by: https://github.com/lezcano	2024-03-01 18:58:19 +00:00
Wei-Sheng Chin	11f2e8beac	[Dynamo, Compiled] Save some python overhead when calling compiled function with many tangents (#118730 ) When a dynamo backend captures the entire forward pass and the entire backward pass without graph break, there could be many (per my memory, hundreds or thousands for big model) `contiguous` calls. Here we can save those overhead by checking `is_contiguous` before `contigous` call. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118730 Approved by: https://github.com/thiagocrepaldi, https://github.com/ezyang	2024-03-01 18:57:18 +00:00
Andrew Gu	0b18ed1c47	[FSDP] Added warning about unsupported double backwards (#120926 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120926 Approved by: https://github.com/Skylion007	2024-03-01 18:40:30 +00:00
Tugsbayasgalan Manlaibaatar	f01a23d01b	Don't aggressively rewrite asserts for symbolic expressions (#120564 ) Fixes: https://github.com/pytorch/pytorch/issues/118417 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120564 Approved by: https://github.com/ezyang	2024-03-01 17:46:36 +00:00
angelayi	c844b377fa	[dynamo] Reorder logs (#116106 ) Currently when there is a print/warning in the graph, dynamo graph breaks causing export to fail. However export would like to just skip over these print/warning calls: https://github.com/pytorch/pytorch/issues/113792. Additionally there's a torch.compile feature request to "reorder prints" so that instead of graph breaking when hitting prints/logging, we can skip over these prints to create larger compiled graphs, and then print the results out after those compiled graphs: https://github.com/pytorch/pytorch/issues/93739. This PR also adds the `reorderable_logging_functions` config for users to register logging functions to be reordered (like `print` or a custom logging function). Printout of the bytecode after reordering the prints looks like the following: P914736600 There are some limitations to the printing right now: * You can only register logging functions, not methods * Inputs to the logging functions can only be tensors, constants, and format strings * Inputs to the logging functions which will later be mutated in-place will not be printed correctly TODO: Add the following tests * print function with argument of nested data structure; * print function with argument of nested data structure being updated inside of compile region (this would test if we handle side effect correctly); * custom defined logging functions with nn.Module or nn.Module attribute arguments; * custom defined logging functions with submodule input/output as arguments (we need to handle the mapping and fused-out value); * custom defined logging functions with tensor argument and mutation inside of the function (TBD: this may increase memory usage); Pull Request resolved: https://github.com/pytorch/pytorch/pull/116106 Approved by: https://github.com/yanboliang	2024-03-01 17:04:24 +00:00
Edward Z. Yang	9fc56f8209	Exclude operators that produce unbacked symbols (#120917 ) Unbacked symbols vary at runtime which means they are not CUDA graphable. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/120917 Approved by: https://github.com/eellison	2024-03-01 16:56:08 +00:00
Adnan Akhundov	ea7149aa22	Replace TTIR string parsing with structured MLIR walk in Triton kernel mutation analysis (#120476 ) Summary: Previously, we relied on the `lark`-based parsing of the string TTIR representation dumped by the Triton compiler. However, this has proven to be brittle in the face of changes both in the user-written Triton kernel code and in the Triton compiler code. In this PR, we add an alternative way of mining the function information from the TTIR based on walking the tree of structured MLIR entities. To this end, we rely on the MLIR bindings exposed by `libtriton` (related PR in Triton: https://github.com/openai/triton/pull/3191). For now, we introduce gating based on whether `ttir_module.hasattr("walk")`. This will allow switching to the newly introduced TTIR analysis approach only when the new MLIR bindings (including that of `ModuleOp::walk`) become available in the Triton pin. Before then, we'll keep using the old string TTIR parsing-based approach. Test Plan: The new functionality was tested locally with the latest Triton version compiled with the added new MLIR bindings: all Triton kernel mutation tests in `test_triton_kernels.py` are passing. Here we rely on the CI for regression testing, but it won't cover the new functionality due to gating. Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/120476 Approved by: https://github.com/oulgen	2024-03-01 16:20:24 +00:00
Aaron Orenstein	8861507ba3	Fix guard for SUPPORTED_NODES (#120798 ) The special-case code for handling SUPPORTED_NODES was producing a guard that looked like: ``` "G['torch'].utils._pytree.SUPPORTED_NODES[<class '__main__.CausalLMOutputWithPast'>].type" ``` resulting in a eval error trying to evaluate the guard. This change adds a new source type (`ClassSource`) which is given a class type (in this case `CausalLMOutputWithPast`) and attempts to fetch it from its defining module. It then uses that to build the `SUPPORTED_NODES` guards instead of referring to the type directly. Also added a unit test which fails before this change and passes after. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120798 Approved by: https://github.com/anijain2305	2024-03-01 16:03:21 +00:00
Pearu Peterson	b8e6ca6f76	Add sparse compressed meta tensor support (#120707 ) As in the title. Replaces https://github.com/pytorch/pytorch/pull/120498 and https://github.com/pytorch/pytorch/pull/120562 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120707 Approved by: https://github.com/ezyang ghstack dependencies: #120703	2024-03-01 13:28:47 +00:00
Pearu Peterson	70d4d109f2	Make SparseCsr a functionality dispatch key (#120703 ) As in the title. To enable meta and fake tensor support for sparse compressed tensors in compliance with the meta/fake tensor support for sparse COO tensor. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120703 Approved by: https://github.com/ezyang	2024-03-01 13:28:46 +00:00
Lei,zhenyuan	eee040c939	expose nested header to wheel (#120603 ) expose nested header to pytorch wheel, help with developers for reuse pytorch nested tensor related utils header inside wheel Pull Request resolved: https://github.com/pytorch/pytorch/pull/120603 Approved by: https://github.com/jbschlosser, https://github.com/gujinghui	2024-03-01 09:59:45 +00:00
Tugsbayasgalan Manlaibaatar	c646030cd2	Support higher order op functionalization in predispatch IR (#115314 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/115314 Approved by: https://github.com/bdhirsh	2024-03-01 09:13:47 +00:00
Simon Fan	82b356193d	Move VariableInfo into its own file to avoid circular dependency (#120732 ) VariableInfo is used by both `custom_function.h` (in a templated class) and `compiled_autograd.h` (in a class with some templated methods). Another way could have been to make a `compiled_autograd.cpp` and forward declare VariableInfo, but this VariableInfo was also being used in other nodes like PyNode so it felt cleaner to do it this way. Differential Revision: [D54287007](https://our.internmc.facebook.com/intern/diff/D54287007) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120732 Approved by: https://github.com/jansel	2024-03-01 08:48:13 +00:00
Chien-Chin Huang	8c2e569928	[PT2D] Make the speedup benchmark works with DDP + CompiledAutograd (#120454 ) With DDP + CompiledAutograd, we could not use the same parallelized model to do the test. This PR copies the model. Differential Revision: [D54094257](https://our.internmc.facebook.com/intern/diff/D54094257/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120454 Approved by: https://github.com/yf225, https://github.com/xmfan	2024-03-01 08:35:22 +00:00
cyy	77ef9d4022	Add verbose parameter to torch.hub.list (#120717 ) This PR adds ```verbose``` to ```torch.hub.list``` to let users being able to disable extraneous outputs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120717 Approved by: https://github.com/ezyang	2024-03-01 07:39:48 +00:00
PyTorch MergeBot	63b259492a	Revert "[dynamo] Reorder logs (#116106 )" This reverts commit c5472628ff9dedff57722941ac1b2a50af880197. Reverted https://github.com/pytorch/pytorch/pull/116106 on behalf of https://github.com/clee2000 due to landrace with 342e7929b804ec56121e82e92d6a199b549c38b1, which removed the import for warnings. Should be an easy fix after rebase `c5472628ff` ([comment](https://github.com/pytorch/pytorch/pull/116106#issuecomment-1972586180))	2024-03-01 06:25:46 +00:00
eqy	86e6497c6f	[Inductor][cuDNN] Disable tf32 in `test_mutate_view_for_conv_output` (#120953 ) Another disablement of TF32 to unblock #120642 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120953 Approved by: https://github.com/Skylion007	2024-03-01 05:51:29 +00:00
David Berard	6ed26392b3	Update xfails for scaled_dot_product_attention (#120928 ) Update xfails for test_dispatch_meta_outplace and test_dispatch_symbolic_meta_outplace. These tests are sometimes expected to fail, because we moved the registrations from meta_registrations.py to fake_impls.py. AFAIK, this is okay because fake tensors will still work because we have special handling in fake_impls.py. The purpose of this PR is to update the xfails so they are correctly xfailing the failing tests. Previously, I set these to xfail only for bfloat16, float16, and float32, but not float64; but this isn't really correct. Explanation below: Scaled dot product attention (SDPA) has multiple implementations, including efficient_attention, flash_attention, and unfused attention. flash_attention supports fp16, bf16. efficient_attention supports fp16, bf16, fp32. unfused attention supports all dtypes. efficient_attention and flash_attention implementations will fail the meta tests, but the unfused attention will not. Certain platforms may support none, both, or one of efficient_attention and flash_attention. Unfused attention will pass because it falls back to constituent ops which have registered meta kernels. So: on CUDA, we have all 3 available: in bf16, fp16, fp32, we'll select one of the fused implementations (where this test will fail). On ROCM, we don't have efficient_attention: so fp32 will use the unfused implementation, where the test will pass. Fix in this PR: * If any fused impl is available, then xfail float16 & bfloat16 * If efficient_attention is available, then also xfail float32 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120928 Approved by: https://github.com/drisspg	2024-03-01 05:16:11 +00:00
Edward Z. Yang	2a08a51738	Add _assert_scalar and teach Inductor to codegen it (#114148 ) Inductor codegen for `_assert_async` is currently disabled because we don't really understand how to codegen `scalar_to_tensor` on a Sympy expression. I initially tried to see if I could get this to work, but I got into some weird problem involving stride sorting, so I decided to fix it properly by not going through a tensor. So we introduce an `_assert_scalar` which takes a scalar as an argument, avoiding needing to turn a SymBool into a tensor before asserting on it. I also add `_functional_assert_scalar` for good luck, although this doesn't do anything right now because https://github.com/pytorch/pytorch/pull/104203 still hasn't been landed. I need to customize the codegen for this operator, so I decide to directly implement it in Inductor, rather than trying to treat it as a generic ExternKernel. This leads to the new AssertScalar IR node. This is written carefully so that it doesn't get DCE'd by Inductor. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/114148 Approved by: https://github.com/jansel ghstack dependencies: #120800	2024-03-01 05:06:36 +00:00
Kurt Mohler	77aea289ae	Add test to check that COW inputs are not materialized (#119507 ) Part of #97856 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119507 Approved by: https://github.com/ezyang ghstack dependencies: #120455	2024-03-01 05:05:28 +00:00
Kurt Mohler	13a54ce279	Avoid COW materialization in `at::parallel_for/parallel_reduce` (#120455 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120455 Approved by: https://github.com/albanD	2024-03-01 05:05:28 +00:00
Shan19900305	d053dcfa69	delete useless cast_outputs call in unary_op_impl_float_out (#120486 ) cast_outputs function is only used for CPU device, and this function already called in cpu_xxx_vec, like cpu_kernel_vec. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120486 Approved by: https://github.com/ezyang	2024-03-01 04:54:11 +00:00
Angela Yi	c5472628ff	[dynamo] Reorder logs (#116106 ) Currently when there is a print/warning in the graph, dynamo graph breaks causing export to fail. However export would like to just skip over these print/warning calls: https://github.com/pytorch/pytorch/issues/113792. Additionally there's a torch.compile feature request to "reorder prints" so that instead of graph breaking when hitting prints/logging, we can skip over these prints to create larger compiled graphs, and then print the results out after those compiled graphs: https://github.com/pytorch/pytorch/issues/93739. This PR also adds the `reorderable_logging_functions` config for users to register logging functions to be reordered (like `print` or a custom logging function). Printout of the bytecode after reordering the prints looks like the following: P914736600 There are some limitations to the printing right now: * You can only register logging functions, not methods * Inputs to the logging functions can only be tensors, constants, and format strings * Inputs to the logging functions which will later be mutated in-place will not be printed correctly TODO: Add the following tests * print function with argument of nested data structure; * print function with argument of nested data structure being updated inside of compile region (this would test if we handle side effect correctly); * custom defined logging functions with nn.Module or nn.Module attribute arguments; * custom defined logging functions with submodule input/output as arguments (we need to handle the mapping and fused-out value); * custom defined logging functions with tensor argument and mutation inside of the function (TBD: this may increase memory usage); Pull Request resolved: https://github.com/pytorch/pytorch/pull/116106 Approved by: https://github.com/yanboliang	2024-03-01 04:48:44 +00:00
Edward Yang	02a410ee12	Enable TORCH_TRACE by default in all Tupperware like environments (#120915 ) Summary: This is a reimplemented version of the FB specific code in https://www.internalfb.com/diff/D54230697 The new strategy is that we unconditionally install an FB handler to trace_log logger (and always set level to DEBUG). When the first log message is emitted, we check the JK/filesystem to see if we should actually do logging. If we decide we don't do logging, we remove the handler from trace_log and are done. build_only[github-export-checks,executorch,pytorch_benchmark,pytorch_quantization,pytorch_distributed,pytorch_distributed_gpu,pytorch_dynamo_inductor,pytorch_functorch,pytorch_fx2trt,pytorch_diff_train_tests_ads,glow_fb_pytorch_tests,training_platform,training_platform_compatibility,training_toolkit_applications,training_toolkit_examples,training_toolkit_model_optimization,dper3_pytorch,xplat_caffe2,pytorch_dev,android-pytorch-instrumentation-tests,smartpytorchgithub_first_try_merge,frl-target-determinator,f6-buck,training_platform_for_github,sigmoid_cpu,sigmoid_gpu,aiplatform_modelprocessing_for_github,accelerators_workloads_models_slimdsnn,ae_aotinductor_benchmark_test,aps_,aps_deterministic_ne_tests,dper_lib_silvertorch,torchrec,torchrec_fb,deeplearning_aot_inductor] Test Plan: sandcastle ``` buck2 test 'fbcode//mode/dev-nosan' fbcode//torchrec/inference/tests:test_single_gpu_executor -- --exact 'torchrec/inference/tests:test_single_gpu_executor - TorchDeployGPUTest.NestedModelSingleGPU' buck2 test 'fbcode//mode/dev-nosan' fbcode//dper_lib/silvertorch/modules/dynamic_stats/tests:accumulators_test -- --exact 'dper_lib/silvertorch/modules/dynamic_stats/tests:accumulators_test - test_global_fixed_interval_accumulator (dper_lib.silvertorch.modules.dynamic_stats.tests.accumulators_test.GlobalFixedIntervalUnivalentAcculumatorTest)' ``` Also running a test flow with/without JK enabled Differential Revision: D54275086 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120915 Approved by: https://github.com/yanboliang	2024-03-01 04:47:13 +00:00
Ma Jian	518a23bb03	support bool as Scalar Type in TorchScript (#113835 ) Fixes #112402 Fixes #75465 From the description in #75465 , the bool type should subtype from the int. and `register_prim_ops.cpp` already supports converting from bool to int or float. So this patch can fix bool as Scalar in TorchScirpt. Pull Request resolved: https://github.com/pytorch/pytorch/pull/113835 Approved by: https://github.com/davidberard98	2024-03-01 04:20:15 +00:00
PyTorch UpdateBot	2e84d01d05	[executorch hash update] update the pinned executorch hash (#120747 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned executorch hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120747 Approved by: https://github.com/pytorchbot	2024-03-01 04:02:09 +00:00
PyTorch MergeBot	65d568680c	Revert "[Dynamo] Fix inspect.getattr_static doesn't work well for torch.utils._cxx_pytree.PyTreeSpec (#120812 )" This reverts commit 1104e0798c8206e0226f2d68f6bb065645e6276f. Reverted https://github.com/pytorch/pytorch/pull/120812 on behalf of https://github.com/huydhn due to Sorry for reverting your change but the XLA failure test_simple_model look legit `1104e0798c` ([comment](https://github.com/pytorch/pytorch/pull/120812#issuecomment-1972460001))	2024-03-01 03:53:27 +00:00
Wei-Sheng Chin	e49f31ca02	[onnxrt, dynamo] Enable custom ONNX model transforms in `onnxrt` dynamo backend (#120854 ) A global transorm list is created. All backend instances call the transform functions in that list sequentially to modify the exported ONNX model before sending model to ORT session. For example, `record_onnx_model_transform` below is a no-op transform and only records the ONNX graphs sent to ONNXRuntime. ```python recorded_models = [] def record_onnx_model_transform(onnx_model): # Record the ONNX model seen by the transform. recorded_models.append(onnx_model) from torch.onnx import ( register_backend_graph_transform, unregister_backend_graph_transform, ) # Register so that `onnxrt` backend calls it to modify ONNX model. register_backend_graph_transform(record_onnx_model_transform) def example_model(x: torch.Tensor): y = torch.sigmoid(x) z = x + y return z # During the compilation, the exported ONNX model will be # modified by calling `record_onnx_model_transform` before # sending the model to `onnxruntime.InferenceSession`. compiled_model = torch.compile( example_model, backend="onnxrt", dynamic=True, ) # Now, `recorded_models` should contain one `onnx.ModelProto` representing # `example_model(x: torch.Tensor)`. # Remove the pass when not needed. If `record_onnx_model_transform` is not # removed, it will be applied to all models compiled by `backend="onnxrt"`. unregister_backend_graph_transform(record_onnx_model_transform) ``` In the future, we plan to use this mechanism to register all graph transforms such ash graph fusion and general ONNX optimization for `onnxrt`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120854 Approved by: https://github.com/BowenBao, https://github.com/thiagocrepaldi	2024-03-01 03:24:17 +00:00
lancerts	67c97a9aad	fix the scale dot attention doc (#120859 ) Fixes #120810 The code verifies the broadcast behavior (from the issue), ``` import torch B = 3 S = 5 L = 7 E = 16 EV = 32 additional_batches = [2, 4] query_shape = [B] + additional_batches + [L, E] key_shape = [B] + additional_batches + [S, E] value_shape = [B] + additional_batches + [S, EV] query = torch.rand(query_shape) key = torch.rand(key_shape) value = torch.rand(value_shape) mask = torch.zeros((1, 1, S), dtype=torch.bool) mask[:, :, S // 2 :] = True # query.to("cuda") # key.to("cuda") # value.to("cuda") # mask.to("cuda") attention = torch.nn.functional.scaled_dot_product_attention(query, key, value, mask) print(f"query shape = {query.shape}") print(f"key shape = {key.shape}") print(f"value shape = {value.shape}") print(f"mask shape = {mask.shape}") print(f"attention shape = {attention.shape}") #in both CPU and cuda, output shape is: # query shape = torch.Size([3, 2, 4, 7, 16]) # key shape = torch.Size([3, 2, 4, 5, 16]) # value shape = torch.Size([3, 2, 4, 5, 32]) # mask shape = torch.Size([1, 1, 5]) # attention shape = torch.Size([3, 2, 4, 7, 32]) ## test add is broadcasting mask to query@(key.mT) res = query@(key.mT) print(res.shape) res2 = torch.add(res, mask) print(res2.shape) ``` At code level, in the default backend, `ab38354887/aten/src/ATen/native/transformers/attention.cpp (L735)` the add operation is broadcasting the `attn_mask` to `auto attn = at::matmul(query, key.transpose(-2, -1) scaling_factor);` - Changed the doc in [torch/nn/functional.py](https://github.com/pytorch/pytorch/pull/120859/files#diff-c358c214f663ba0c8b9c6846fbe0042fa29494cf02fe4714a17dcd0d268b035b). - Also fixed a few inconsistencies in the cpp comments. @mikaylagawarecki Pull Request resolved: https://github.com/pytorch/pytorch/pull/120859 Approved by: https://github.com/drisspg	2024-03-01 02:54:08 +00:00
Oguz Ulgen	b35551f357	Ban reset_to_zero argument to triton.autotune in user defined kernels (#120938 ) Fixes #120802 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120938 Approved by: https://github.com/chenyang78, https://github.com/jansel	2024-03-01 02:37:24 +00:00
Sam Larsen	06f8af30fa	Change FakeTensor serialization to consider only an _active_ FakeTensor mode (#120848 ) Summary: https://github.com/pytorch/pytorch/pull/108186 make some changes related to FakeTensor serialization such that saving and loading a tensor will give us a meta tensor, even if FakeTensor mode is not enabled. This means we can't properly save and load Tensors as part of Fx graph caching. This PR changes the logic to check if there's an _active_ FakeTensor mode. Test Plan: * New unit tests * Validated unit tests introduced in https://github.com/pytorch/pytorch/pull/108186 still pass Pull Request resolved: https://github.com/pytorch/pytorch/pull/120848 Approved by: https://github.com/eellison, https://github.com/thiagocrepaldi	2024-03-01 02:37:21 +00:00
Jason Ansel	e3dbd194f4	[dynamo] Support module backwards hooks (#120685 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120685 Approved by: https://github.com/yanboliang, https://github.com/xmfan	2024-03-01 02:24:26 +00:00
Simon Fan	9b2c35b4fe	[dynamo] Fix convolution meta kernel when input channel is 0 (#120944 ) Addresses https://github.com/pytorch/pytorch/issues/118797 Adding in special channel handling logic from eager (set output channels to 0 when input channels are 0): `67d3e4f2a2/aten/src/ATen/native/Convolution.cpp (L1400-L1403)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/120944 Approved by: https://github.com/zou3519	2024-03-01 01:18:21 +00:00
rzou	d534a49767	Reinplace auto_functionalized (#120829 ) Fixes https://github.com/pytorch/pytorch/issues/120441 We follow how triton_kernel_wrapper_functional gets re-inplaced: - If we see auto_functionalized, then first we compute what inputs we actually need to clone ("tensors_to_clone") and fixup the graph. This happens in `reinplace_and_refine_tensors_to_clone`, which I have refactored out of the triton_kernel_wrapper_functional reinplacing code. - Later on, after the reinplacing pass, we have a decomposition pass for auto_functionalized. In that decomposition pass, we make use of the "tensor_to_clone" info and only clone those inputs in the decomposition. - We shepherd "tensor_to_clone" from the first step to the second step by setting the .meta field on the auto_functionalized node. Test Plan: - existing tests - tested locally by reading the output of TORCH_LOGS="post_grad_graphs" - added assertExpectedInline tests for the post_grad_graphs Pull Request resolved: https://github.com/pytorch/pytorch/pull/120829 Approved by: https://github.com/oulgen	2024-03-01 00:55:19 +00:00
wz337	791f8ef350	[Composable APIs] Add composable API `fully_shard` deprecation warning (#120929 ) `fully_shard`(https://github.com/pytorch/pytorch/blob/main/torch/distributed/_composable/fsdp/fully_shard.py) will be used by new FSDP2 and we want to add a deprecation warning to the existing composable API's `fully_shard`(https://github.com/pytorch/pytorch/blob/main/torch/distributed/_composable/fully_shard.py#L40). Planned release schedule is as follows https://dev-discuss.pytorch.org/t/release-cadence-for-year-2023-2024/1557: Minor Version \| Release branch cut \| Release date \| First patch release date \| Second patch release date -- \| -- \| -- \| -- \| -- 2.3 \| Mar 2024 \| Apr 2024 \| May 2024 \| Jun 2024 2.4 \| May 2024 \| Jul 2024 \| Aug 2024 \| Sep 2024 2.5 \| Aug 2024 \| Oct 2024 \| Nov 2024 \| Dec 2024 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120929 Approved by: https://github.com/awgu	2024-03-01 00:55:16 +00:00
Guilherme Leobas	fd35aafc26	Teach dynamo about vjp (#119405 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119405 Approved by: https://github.com/zou3519 ghstack dependencies: #118407	2024-03-01 00:21:10 +00:00
Lucas Pasqualin	9d5dea7812	[DCP] Adds storage reader and planner classes for online loading/sharding of models in torch.save format (#119816 ) as title Differential Revision: [D53718041](https://our.internmc.facebook.com/intern/diff/D53718041/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119816 Approved by: https://github.com/fegin	2024-03-01 00:21:05 +00:00
PyTorch MergeBot	33da8d5c12	Revert "Fix guard for SUPPORTED_NODES (#120798 )" This reverts commit 1b8bb027f676aa8c4260a3f6b9a5c98c37d25dc7. Reverted https://github.com/pytorch/pytorch/pull/120798 on behalf of https://github.com/kit1980 due to the new test fails internally, see D54343456 ([comment](https://github.com/pytorch/pytorch/pull/120798#issuecomment-1972134227))	2024-02-29 23:19:22 +00:00
Elias Ellison	7ebfe21724	Fix nll loss dynamo failure (#120805 ) Fix for https://github.com/pytorch/pytorch/issues/119791 Part of dynamo bug bash Pull Request resolved: https://github.com/pytorch/pytorch/pull/120805 Approved by: https://github.com/Skylion007, https://github.com/zou3519, https://github.com/malfet	2024-02-29 22:34:49 +00:00
Elias Ellison	d03b11ad5b	Pass inductor strides forward in ddp optimizer (#120523 ) # Note: Returning Fake Tensors on First AOT Autograd Call # # Inductor will optimize strides of outputs when it deems it profitable. # For instance, converting to channels last. When we split the graph here # into multiple inductor compilations, we need to make sure that the # output strides of one compilation is appropriately passed to the subsequent # compilations. However, the mapping from inductor output to dynamo output # is non-trivial due to aot_autograd's deduping, de-aliasing, mutation, re-writing, # subclass handling, etc. In order to replay all this logic we set a flag such that # the first invocation of inductor in aot_autograd will return Fake Tensors with # appropriate strides. Then, all of aot autograd's runtime logic is replayed. # This gives us the appropriately strided outputs here which will reflect runtime strides. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120523 Approved by: https://github.com/yf225, https://github.com/bdhirsh	2024-02-29 22:25:00 +00:00
Matthias Reso	772db2a3ae	Fix handling of torch.return_types in dynamo (#120826 ) Handle quasi-namedtuples as a special case in dynamo Fixes #120651 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120826 Approved by: https://github.com/anijain2305	2024-02-29 22:11:35 +00:00
Jane Xu	da559c98e3	Fix isin decomp and add python meta registration (#120821 ) Fixes #119792 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120821 Approved by: https://github.com/malfet, https://github.com/peterbell10	2024-02-29 22:08:50 +00:00
PyTorch MergeBot	76d3a6bb4a	Revert "[C10D] Add ProcessGroup op_id to track ops inside coalescing region (#120745 )" This reverts commit 381a7ad3f1cd38bf8e814ae9d275f101a2136139. Reverted https://github.com/pytorch/pytorch/pull/120745 on behalf of https://github.com/kit1980 due to The new test fails internally, see D54343421 ([comment](https://github.com/pytorch/pytorch/pull/120745#issuecomment-1972047106))	2024-02-29 22:06:13 +00:00
Animesh Jain	e7039e3a0b	[dynamo][easy] Dynamo test changes (#120927 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120927 Approved by: https://github.com/yanboliang ghstack dependencies: #120864, #120730	2024-02-29 22:05:41 +00:00
drisspg	39c092d242	Skip semi-structured-sparse on windows (#120807 ) # Sumary We can see that in this job on the other PR: https://github.com/pytorch/pytorch/actions/runs/8086597674/job/22096699337?pr=120641#step:11:11272 building the SemiStrucutredSparse kernel is erroring on windows machine so I think we she land this. ### Details Introduced in here: https://github.com/pytorch/pytorch/pull/120434 we don't compile for windows so we should have skipped this test. There is another PR: https://github.com/pytorch/pytorch/pull/120641 which removes this skip for windows, so if that is green we should do that otherwise skip windows tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/120807 Approved by: https://github.com/alexsamardzic, https://github.com/jcaip	2024-02-29 21:48:52 +00:00
Yang Chen	1a1f58ffbe	[rocm][cmake] retrieve rocm location from ROCM_SOURCE_DIR env if specified (#120898 ) This PR allows us to build PyTorch with a rocm that is not installed to the default location, i.e. /opt/rocm Pull Request resolved: https://github.com/pytorch/pytorch/pull/120898 Approved by: https://github.com/jianyuh	2024-02-29 21:32:45 +00:00
wz337	b2dddcfe27	[FSDP2][DCP][DSD] Add test to ensure FSDP2 model/optim state_dict work after a full training loop (#120871 ) This PR adds tests to test distributed state dict work properly for FSDP2's model and optimizer state_dict after a full training loop. We test the combination of these options on a evenly sharded model. ``` { "reshard_after_forward": [True, False], "optimizer_class": [torch.optim.Adam], "compile_model": [True, False], }, ``` Followup: 1. Add test for unevenly sharded model. 2. Add test to include `torch.optim.AdamW` (seems to have some gaps currently, still investigating) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120871 Approved by: https://github.com/fegin	2024-02-29 21:24:00 +00:00
Kurman Karabukaev	67d3e4f2a2	[TorchElastic] Refactoring to support non-default logging strategy (#120691 ) Summary: Pulling out logging parameters into a logging specs that can be overridden (follow-up changes on possible mechanism) Why? Right now the logging approach is quite rigid: - Requires for log directory to exist and not be empty - Will create tempdir otherwise, - Creates subdir for a run - creates subdir for each attempt - creates files named as stdout.log, stderr.log, error.json In some instances some of the users would like to customize the behavior including file names based on context. And we do have right now a mechanism to template multiplexed teed output prefix. With current changes, users can create custom log spec that can use env variables to change the behavior. Notes: Made `LaunchConf.logs_specs` as an optional field that will be bound to `DefaultLogsSpecs` instance. There are large number of clients (code) that use the API directly without using torchrun API. For those cases, we have to explicitly pass LogSpecs implementation if we would like to override the implementation. For the regular torchrun users, we can use pluggable approach proposed in the follow up change. Test Plan: CI + unit tests Differential Revision: D54176265 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120691 Approved by: https://github.com/ezyang	2024-02-29 20:59:17 +00:00
Andrew Gu	277bc97709	[FSDP2][ez] Combined communication test files (#120904 ) This just combines the unit tests for the collectives ops for copy-in/all-gather/copy-out and copy-in/reduce-scatter/view-out with the unit tests for communication schedule. I was mainly thinking to try to not have too many test files. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120904 Approved by: https://github.com/Skylion007, https://github.com/wanchaol ghstack dependencies: #120659	2024-02-29 20:36:04 +00:00
PyTorch MergeBot	0b924d7cde	Revert "[inductor] Optimize welford reduction (#120330 )" This reverts commit 7eb7ac815f0247a62b621897cea95ec4ca56d52e. Reverted https://github.com/pytorch/pytorch/pull/120330 on behalf of https://github.com/kit1980 due to Broke internal tests, see D54230858 ([comment](https://github.com/pytorch/pytorch/pull/120330#issuecomment-1971878323))	2024-02-29 20:12:50 +00:00
Edward Z. Yang	0a7666801d	SymIntify prod_backward (#120776 ) Fixes https://github.com/pytorch/pytorch/issues/120608 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/120776 Approved by: https://github.com/albanD	2024-02-29 20:05:22 +00:00
Shuqiang Zhang	313abcdba2	[c10d] fix the unwanted reason (#120863 ) Summary: Addressing #120849. Current c10d treat a reason as a failure, hence give some unwanted false postiive errors. This is a quick fix, but we need to revisit the error handling logic Pull Request resolved: https://github.com/pytorch/pytorch/pull/120863 Approved by: https://github.com/kwen2501	2024-02-29 19:58:11 +00:00
Edward Z. Yang	f94933ed42	Refine value ranges on inequalities (#120800 ) This is basically done the obvious way. For better or worse, I jammed this into what used to be `_maybe_guard_eq` but now is `_maybe_guard_rel`. I was careful to test all the off by one conditions, and each permutation. Let me know if you think I missed anything. Importantly, this now works for unbacked SymInts. While testing, I noticed we are silently duck sizing all symbolic variables in `test_dynamic_shapes.py`. This may or may not be covering up bugs. Along the way, I had to fix a bug in export constraints, where we weren't checking that the final var_to_range was consistent with what the user requested at top level. After I implemented all this, I realized that applying this to non-unbacked SymInts was duplicative with @ysiraichi's previous work on https://github.com/pytorch/pytorch/pull/97963 . The upside is I now understand what Yukio was trying to do in the original PR, and I think my new logic is simpler and less error prone. In Yukio's earlier diff, Yukio tried very hard to avoid changing what guards we actually issue (since this would cause tests to wobble). Thus, when he refined a range, he also saved the guard that actually caused the range to refine. In this PR, I don't bother saving these guards; instead I just tighten var_to_range directly and rely on generating guards on this to be correct. The key insight is that if I assert `x < y`, it's always safe to emit (potentially) more restrictive range guards, because this won't invalidate our guards, it will just make them a little too strong (but actually, I think we are precise along the way.) If these guards make it unnecessary to test `x < y`, because now the ranges for x and y are disjoint, this is fine, we've subsumed the x < y guard and can just not bother testing it. If I've gotten it right, TV will agree with me. In fact, I had a bug in this PR which TV didn't catch, which is that when we have a recorded var_to_guards for a symbol, we unconditionally never generate the range guard for it, even if the var_to_guards is potentially inconsistent with var_to_range (because var_to_range was updated separately). With var_to_guards removed, I don't have to worry abou this inconsistency. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/120800 Approved by: https://github.com/Skylion007, https://github.com/avikchaudhuri, https://github.com/ysiraichi	2024-02-29 19:41:51 +00:00
Yifu Wang	81c4c0dda2	[functional collecitve] don't import torchdynamo when running torchdeploy (#120900 ) Summary: Importing torchdynamo in `functional_collective_impl.py` seems to break loading of torchdeploy models. Test Plan: CI Differential Revision: D54355011 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120900 Approved by: https://github.com/fegin	2024-02-29 19:20:54 +00:00
Avik Chaudhuri	f7a809c96a	fix dupe deprecated warning in dynamo export (#120896 ) Summary: When we convert `dynamic_shapes` to `constraints` and pass them to `_dynamo.export`, we shouldn't give a deprecation warning. Such conversion happens when calling `torch.export.export`, e.g. But it can also happen when calling `capture_pre_autograd_graph` (which itself has this deprecation warning when `constraints` are passed directly as well). Since `_log_export_usage` is an indicator of a top-level call (it is `True` by default but set to `False`, or at least passed through, by callers), we can (ab)use it to indicate when to give this deprecation warning. Test Plan: none Differential Revision: D54350172 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120896 Approved by: https://github.com/BoyuanFeng, https://github.com/zhxchen17	2024-02-29 18:57:42 +00:00
Catherine Lee	0290fe65bd	Test TD (test removal) on crossref (#119426 ) Current threshold is to cut the bottom 75% of test files, which results in 13 min of tests getting cut. test_ops, functorch/test_ops, and test_decomp and other really long running test files are not getting cut and make the top 25% to take really long (still 90+ min) The original plan was to test on rocm but I'm worried about queuing given that cutting 75% of test files only cuts off 13 min, and crossref is rarely referenced by others and people keep talking about getting rid of it, so it's a good alternative Pull Request resolved: https://github.com/pytorch/pytorch/pull/119426 Approved by: https://github.com/huydhn	2024-02-29 18:53:43 +00:00
PyTorch MergeBot	1458f1de66	Revert "Update flash_attention kernel from 2.3.6 to 2.5.5 (#118935 )" This reverts commit 4b7a521856ca5fb0fc28edd18591f77fff5a6ba1. Reverted https://github.com/pytorch/pytorch/pull/118935 on behalf of https://github.com/atalman due to Significantly increases build time. Optimization is needed ([comment](https://github.com/pytorch/pytorch/pull/118935#issuecomment-1971723284))	2024-02-29 18:42:21 +00:00
Kai Londenberg	96eff4ef70	[inductor max autotune] Detailed autotuning result logs ( machine-readable ) (#119004 ) This diff introduces a new separate logging of autotuning results, with the intention of making the results analyzable, specifically those for the new experimental Cutlass backend. Results are logged as text files with one JSON document corresponding to a single benchmark result per line. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119004 Approved by: https://github.com/jansel ghstack dependencies: #120620	2024-02-29 18:24:13 +00:00
James Wu	a911eb74ae	[dynamo] Graph break when faking named tensors (#120779 ) Fixes #120644 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120779 Approved by: https://github.com/zou3519	2024-02-29 18:22:15 +00:00
Yanbo Liang	1104e0798c	[Dynamo] Fix inspect.getattr_static doesn't work well for torch.utils._cxx_pytree.PyTreeSpec (#120812 ) Fixes #118793 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120812 Approved by: https://github.com/zou3519	2024-02-29 18:19:14 +00:00
Yang Chen	ca679384c2	[rocm][cmake] correctly check the ROCM_SOURCE_DIR environment (#120858 ) The existing use of "if(NOT ENV{ROCM_SOURCE_DIR})" seems to be not working correctly, e.g. ``` $ cmake --version cmake version 3.26.4 $ cat CMakeList.txt cmake_minimum_required(VERSION 3.18 FATAL_ERROR) project(FOO) if(NOT ENV{ROCM_SOURCE_DIR}) message(INFO ": not defined 1") else() message(INFO ": defined 1: $ENV{ROCM_SOURCE_DIR}") endif() if("$ENV{ROCM_SOURCE_DIR}" STREQUAL "") message(INFO ": not defined 2") else() message(INFO ": defined 2: $ENV{ROCM_SOURCE_DIR}") endif() $ ROCM_SOURCE_DIR=/tmp cmake . INFO: not defined 1 INFO: defined 2: /tmp -- Configuring done (0.0s) -- Generating done (0.0s) -- Build files have been written to: /home/yangche/tmp/tmp ``` This PR replace it with a STREQUAL check. Note that the choice of STREQUAL is to avoid cases like: ``` $ ROCM_SOURCE_DIR= cmake . ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/120858 Approved by: https://github.com/jianyuh, https://github.com/jeffdaily	2024-02-29 17:49:00 +00:00
Catherine Lee	9e016debeb	[dynamo] Fix inference_mode context variable (#120830 ) <idk what im doing> Fixes #120646 The module for torch.inference_mode should be torch The input to `create` is a bool (mode?) and `_enter_inference_mode` expects a bool but [BlockStackEntry](`50073248ed/torch/_dynamo/symbolic_convert.py (L206)`) expects `target_values` to be a list? [inference_mode](`50073248ed/torch/autograd/grad_mode.py (L205)`) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120830 Approved by: https://github.com/zou3519, https://github.com/anijain2305, https://github.com/tugsbayasgalan	2024-02-29 17:10:06 +00:00
Nikita Shulga	98c4ba683e	[EZ][BE] Fix ResourceWarning (#120886 ) By closing the file handle Fixes ``` /Users/nshulga/git/pytorch/pytorch/test/quantization/core/test_docs.py:132: ResourceWarning: unclosed file <_io.TextIOWrapper name='/Users/nshulga/git/pytorch/pytorch/docs/source/quantization.rst' mode='r' encoding='UTF-8'> ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/120886 Approved by: https://github.com/seemethere, https://github.com/kit1980, https://github.com/Skylion007	2024-02-29 17:07:39 +00:00
Edward Z. Yang	664dd61b29	Add some more symbolic shapes related files to ciflow/inductor (#120887 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/120887 Approved by: https://github.com/janeyx99, https://github.com/malfet	2024-02-29 16:59:32 +00:00
Oguz Ulgen	558316b5f4	Emit grid wrapper inlined with the user defined triton kernel (#120824 ) Fixes #120801 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120824 Approved by: https://github.com/chenyang78, https://github.com/jansel ghstack dependencies: #120809	2024-02-29 16:17:45 +00:00
Oguz Ulgen	84e2accd6c	Make triton_meta be part of user defined triton kernel cache (#120809 ) Tensors with different shapes will generate different triton meta (divisibility rules), we need this to be part of the cache key. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120809 Approved by: https://github.com/chenyang78, https://github.com/jansel	2024-02-29 16:17:45 +00:00
Avik Chaudhuri	342e7929b8	[export] kill deprecated constraints API (#120860 ) Summary: Previously `export` would take `constraints` built with `dynamic_dim(...)`s. This has been deprecated for a while; one can now pass in a `dynamic_shapes` spec built with `Dim(...)`s. Here we kill this deprecated API. Eventually this will lead to simplification of the underlying implementation, since the new `Dim`-based specs can map 1-1 with symbolic shapes concepts without going through indirect machinery of `dynamic_dim`-based constraints. It is expected that internal APIs like `_dynamo.export` and `_trace._export_to_torch_ir` will change when that happens. Leaving `aot_compile` and `capture_pre_autograd_graph` entry points alone for now. This will eventually be updated anyway. Test Plan: updated tests Differential Revision: D54339703 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120860 Approved by: https://github.com/suo, https://github.com/tugsbayasgalan	2024-02-29 16:15:50 +00:00
Bin Bao	3cfed01228	[AOTI] Store OpOverload in ir.ExternKernel (#120629 ) Summary: Currently the logics for filling the default value for optional arguments are scattered in several places. By storing OpOverload in the base ExternKernel class, we can simplify codegen_kwargs, and this is a preparation step for enabling the torchgen-ed C shim. The default value filling logic for FallbackKernel can also be simplified, but that can come later. Differential Revision: [D54258089](https://our.internmc.facebook.com/intern/diff/D54258089) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120629 Approved by: https://github.com/chenyang78 ghstack dependencies: #119987, #120592	2024-02-29 15:51:33 +00:00
Bin Bao	fa7241ed79	[AOTI] Change the cpp wrapper codegen for sdpa (#120592 ) Summary: Switch codegen for sdpa to always point to v2 in the C shim. Since aoti_torch__scaled_dot_product_flash_attention_v2 has been introduced for a while, there shouldn't be any FC issue in production. Differential Revision: [D54258090](https://our.internmc.facebook.com/intern/diff/D54258090) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120592 Approved by: https://github.com/chenyang78 ghstack dependencies: #119987	2024-02-29 15:49:23 +00:00
Bin Bao	52e3c78a43	[AOTI][refactor] Move a few util functions in atoi_torch (#119987 ) Summary: Move these util functions from an anonymous namespace to a common header so that later torchgen-ed files can use them. Differential Revision: [D54258088](https://our.internmc.facebook.com/intern/diff/D54258088) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119987 Approved by: https://github.com/chenyang78	2024-02-29 15:46:47 +00:00
Shengbao Zheng	5b9e5f854b	[profiler] Log process group id instead of backend id (#120475 ) Summary: https://github.com/pytorch/pytorch/pull/104373 introduced backend_id > an unique ID for the actual backend object, this is also exposed in record_param_comms, so we can correlate these collectives with the right backend object. However, it is inconvenient to correlate collectives with backend id. Instead, using pg id(uid) to correlate directly is a better solution. This PR change the ID information exposted in record_param_comms from backend_id to pg_id. Differential Revision: D53558257 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120475 Approved by: https://github.com/aaronenyeshi	2024-02-29 15:04:33 +00:00
cpuhrsch	576c0482a5	Remove hard numpy dependency from guards.py (#119519 ) I'm not sure if this is the ideal behavior / best fix for this. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119519 Approved by: https://github.com/albanD	2024-02-29 14:37:33 +00:00
atalman	5db5049b34	Move TRITON_CONSTRAINT setting to common binary_populate_env.sh, BE - Cleanup unused build scripts (#120744 ) 1. This moves TRITON_CONSTRAINT to common binary_populate_env.sh so that this is set for all wheels. test in CI via ``ciflow/binaries`` label. Please note we only setting this constraint when PYTORCH_EXTRA_INSTALL_REQUIREMENTS is set. And this variable is set for all the wheels that gets uploaded to pypi. Hence triton wheels need to be set at the same place. This is done for regular wheels and rocm wheels separately, since rocm wheels using different triton package 3. Cleanup legacy unused code Test: `` git grep setup_linux_system_environment.sh `` Needs: https://github.com/pytorch/builder/pull/1712 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120744 Approved by: https://github.com/huydhn	2024-02-29 14:25:34 +00:00
Yifu Wang	f988f649be	[IntraNodeComm] accept P2P buffer size as constructor argument (#120856 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120856 Approved by: https://github.com/wanchaol ghstack dependencies: #120855	2024-02-29 11:43:52 +00:00
Yifu Wang	22b5548f5d	[IntraNodeComm] refactor all_reduce variants as private methods (#120855 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120855 Approved by: https://github.com/Chillee, https://github.com/wanchaol	2024-02-29 11:43:52 +00:00
Jeff Daily	96793e0f10	[ROCm] enable scaled_gemm (#117822 ) scaled_gemm for ROCm using hipblaslt. As of ROCm 6.0, HIPBLASLT_MATMUL_DESC_AMAX_D_POINTER is not supported. A work-around is provided, performing the absmax operation on the output buffer, but this results in some loss of accuracy for the absmax result. For this reason the feature should be considered beta/preview. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117822 Approved by: https://github.com/jianyuh, https://github.com/xw285cornell	2024-02-29 10:20:48 +00:00
Sergii Dymchenko	09aefe1502	Fix ouput typos (#120870 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120870 Approved by: https://github.com/clee2000	2024-02-29 08:29:14 +00:00
Nikita Shulga	14c5ebc8a1	[Dynamo] Do not attempt to make nditer spawned arrays writable (#120868 ) As they are not, converting `numpy.nditer` to writable is too expensive and tensor values are copied anyway Minimal reproducer: ```python import numpy as np import torch @torch.compile def f(x): return x + 1.0 for x in np.nditer(np.arange(3)): print(f(x)) ``` Fixes https://github.com/pytorch/pytorch/issues/119787 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120868 Approved by: https://github.com/jansel	2024-02-29 07:49:59 +00:00
Yanbo Liang	169c220bf8	[torch.compile] Provide capability to register callback on compile start/stop (#120764 ) This is a requirement from Meta internal cases, where ppl wants to register a callback function to detect if a job is stuck during compilation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120764 Approved by: https://github.com/jansel	2024-02-29 07:37:52 +00:00
Animesh Jain	82cbd9b131	[dynamo][guards-cpp-refactor] PythonLambdaGuardAccessor (#120730 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120730 Approved by: https://github.com/jansel ghstack dependencies: #120864	2024-02-29 07:25:13 +00:00
Animesh Jain	66d05a8900	[dynamo] Fix source for default dict default_factory (#120864 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120864 Approved by: https://github.com/yanboliang, https://github.com/Skylion007, https://github.com/jansel	2024-02-29 07:25:13 +00:00
David Berard	df1e855313	[fake_impls] fix max_seqlen return values in efficient_attention_forward (#120842 ) To match the actual implementation, we should return the max_seqlen_q/k, not M, N, when in the sparse case `7e185277cd/aten/src/ATen/native/transformers/cuda/attention.cu (L981-L996)` Note that although the .cu file sets max_seqlen_k = 0 in the sparse case, it actually returns max_seqlen_k or N: `7e185277cd/aten/src/ATen/native/transformers/cuda/attention.cu (L1224-L1231)` Tests - added in the next PR (#102839, which also fixes other parts of the test_fake tests so that we can un-xfail them and actually run the tests) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120842 Approved by: https://github.com/YuqingJ ghstack dependencies: #120682	2024-02-29 07:12:27 +00:00
eqy	d1d50d2e4c	[Inductor][cuDNN] Disable tf32 in `test_mutate_base_for_conv_output` (#120867 ) Looks like there is a sum? comparison where TF32 may not provide the necessary accuracy, leading to failures on sm86. CC @Skylion007 , hopefully this unblocks #120642 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120867 Approved by: https://github.com/Skylion007	2024-02-29 06:59:32 +00:00
cyy	8a42cff7b1	[DeviceIndex][7/N] Use DeviceIndex in XPU (#120576 ) Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/120576 Approved by: https://github.com/guangyey, https://github.com/Skylion007	2024-02-29 05:54:23 +00:00
Oleg Khabinov	4b18ab869f	[torch.export] Support is_compiling() flag for non-strict mode (#119602 ) Summary: In non-strict mode of torch.export() we didn't set those `is_compiling()` to `True` which is needed by some models. Test Plan: Unit tests and manual testing. Differential Revision: D53624452 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119602 Approved by: https://github.com/suo	2024-02-29 05:52:51 +00:00
Adnan Akhundov	0a46102b37	Add equal_to_1 to triton_meta for user-written Triton kernels (#120579 ) Summary: Previously, we omitted `equal_to_1` from the `triton_meta` part of the `@user_autotune` decorator. For user-written Triton kernels, this could lead to perf regressions, as the kernel in the Inductor codegen is compiled without `equal_to_1` specialization. Fixes #120478. The repro from the issue, on A100: Before this PR: ``` Triton matmul: 0.0167 seconds Triton matmul compiled: 0.0751 seconds ``` After this PR: ``` Triton matmul: 0.0168 seconds Triton matmul compiled: 0.0072 seconds ``` Test Plan: ``` $ python test/dynamo/test_triton_kernels.py -k test_triton_kernel_equal_to_1_arg ... ---------------------------------------------------------------------- Ran 3 tests in 3.545s OK ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/120579 Approved by: https://github.com/oulgen, https://github.com/jansel, https://github.com/chenyang78	2024-02-29 05:19:39 +00:00
Shunting Zhang	4407138bf6	[inductor][eazy] fix a typo in test (#120832 ) In theory we can test anything, but the test name mentioned attention so we should multiple the inv_scale rather than divide it. And I guess that the initial intension of the test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120832 Approved by: https://github.com/desertfire, https://github.com/jansel	2024-02-29 05:04:04 +00:00
Adnan Akhundov	2d17230212	[inductor] Do not reuse buffers across scopes in mem planning (#120777 ) Summary: Previously, in the `memory_plan_reuse` we assumed that the generated code is flat: in the sense of it can't have nested scopes. However, with nested control flow codegen-ing, this is no longer the case. This causes bugs in buffers being reused across the visibility boundaries in different nested scopes. In this PR, we add nested planning states in `memory_plan_reuse` on entering and exiting scope in the codegen. This restricts the buffer reusability only to the currently active (peak) scope / planning state. Test Plan: ``` python test/inductor/test_control_flow.py -k test_subgraphs_with_parameters ... ---------------------------------------------------------------------- Ran 27 tests in 149.413s OK ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/120777 Approved by: https://github.com/chenyang78, https://github.com/desertfire, https://github.com/jansel ghstack dependencies: #120665	2024-02-29 03:52:02 +00:00
Will Constable	f5b99976ad	[C10D] Make _set_pg_timeout work with DeviceMesh PG (#120850 ) Fixes #120847 Makes _set_pg_timeout work on nccl and/or gloo backends instead of working only on one backend (gloo) in cases that both backends exist for the group. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120850 Approved by: https://github.com/XilunWu, https://github.com/wanchaol	2024-02-29 03:41:15 +00:00
PaliC	26d6ddc232	[bug burndown]Fix #119784 (#120846 ) Addresses https://github.com/pytorch/pytorch/issues/119784. Interestingly, the test seem to just pass (yay!). Tested locally that the failing set of tests pass using `PYTORCH_TEST_WITH_DYNAMO=1 pytest functorch/test_vmap.py -v` Will wait for CI to pass first before bugging people for reviews. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120846 Approved by: https://github.com/Skylion007	2024-02-29 03:30:40 +00:00
Yifu Wang	fad228c7cc	Fix a potential race condition in the test decorators for enabling/disabling native funcol (#120833 ) Previous, we parametrize some tests to run with both native and py funcol by flipping a global variable. However, some of these tests are multi-threaded tests, and the parametrization mechanism could lead to race condition. This PR changes the mechansim to use `mock.patch` which is applied on a per-thread basis. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120833 Approved by: https://github.com/wconstab	2024-02-29 03:19:44 +00:00
youkaichao	2c0c70f763	[Dynamo] enumerate imported names for eval_frame.py (#120778 ) Fixes https://github.com/pytorch/pytorch/issues/120699 . Pull Request resolved: https://github.com/pytorch/pytorch/pull/120778 Approved by: https://github.com/Skylion007	2024-02-29 03:08:43 +00:00
Shruthi GN	ef9e89984c	[pytorch] Support output types that are non tensors (#120804 ) Summary: per title This is needed because some modules return None and non tensors as output Test Plan: sandcastle? Reviewed By: zhxchen17 Differential Revision: D54311609 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120804 Approved by: https://github.com/zhxchen17	2024-02-29 02:49:10 +00:00
Adnan Akhundov	0dbef1618f	[inductor] Apply fx passes recursively to nested subgraphs (#120665 ) Summary: The current machinery of Inductor's `compile_fx` assumes that the incoming fx graph is flat. As a result, everything before `graph.run` is applied to the outermost graph. This assumption was valid before #119759, but now there is control flow bringing (arbitrarily deeply) nested fx subgraphs to `compile_fx`. In this PR, we start extending the `compile_fx` machinery to deal with nested fx subgraphs. Namely, we recursively apply Inductor's `pre_grad`, `joint_graph`, and `post_grad` passes to the nested subgraphs in the incoming fx graph. For the recursive application of the `pre_grad` passes (which require example inputs per subgraph), we don't pass example inputs for the nested subgraphs. A few different attempts to infer the latter via fake tensor prop has led to different side effects in the model. Therefore, to the nested subgraphs, we only apply a subset of `pre_grad` passes that doesn't require example inputs. Test Plan: ``` $ python test/inductor/test_control_flow.py ... ---------------------------------------------------------------------- Ran 26 tests in 59.252s OK ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/120665 Approved by: https://github.com/eellison	2024-02-29 02:34:54 +00:00
PyTorch MergeBot	db1cc781db	Revert "[dynamo] Function => FunctionCtx for placeholder obj (#120577 )" This reverts commit ee01d0807b924874a329be78c6ee880f556645db. Reverted https://github.com/pytorch/pytorch/pull/120577 on behalf of https://github.com/jansel due to Causing breakages internally ([comment](https://github.com/pytorch/pytorch/pull/120577#issuecomment-1970254363))	2024-02-29 01:56:09 +00:00
Edward Z. Yang	b2e4b621cc	Reduce create_env log level to DEBUG (#120772 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/120772 Approved by: https://github.com/albanD	2024-02-29 01:33:16 +00:00
Brian Hirsh	9e0631cc8a	get CommsDebugMode to work with DTensor (#118769 ) Tested with Wanchao's repro: ``` from typing import Tuple, List, Dict, cast import torch import torch.nn as nn from torch.distributed.device_mesh import init_device_mesh from torch.distributed._tensor import distribute_tensor, DTensor, Shard, Placement, Replicate mesh = init_device_mesh(device_type="cuda", mesh_shape=(2,)) x = torch.randn(4, 8, requires_grad=True) y = torch.randn(4, 32, requires_grad=True) x_dtensor = DTensor.from_local(x, mesh, [Shard(0)], run_check=False) y_dtensor = DTensor.from_local(y, mesh, [Shard(0)], run_check=False) from torch.distributed._tensor.debug import CommDebugMode comm_mode = CommDebugMode() with comm_mode: z = torch.mm(x_dtensor, y_dtensor) print(comm_mode.get_comm_counts()) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/118769 Approved by: https://github.com/wanchaol	2024-02-29 01:11:05 +00:00
Will Constable	381a7ad3f1	[C10D] Add ProcessGroup op_id to track ops inside coalescing region (#120745 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120745 Approved by: https://github.com/zdevito ghstack dependencies: #120724, #120270	2024-02-29 01:03:31 +00:00
Will Constable	f85d3a022c	[C10D] Fix pointToPoint op Flight Recording (#120270 ) Fix and test issues with both coalesced and individual send/recv ops Considered an alternate approach and then ditched it - alternate approach: #119757 - reason ditched: prefer recording individual collective events inside coalescing region instead of just the event at the end of the region, which also would not have tensor sizes or opnames without additional state variables added Another approach also ditched - record events on workEnqueue instead of initWork - reason ditched: too messy to get input/output shapes tagged on recording when recording in workEnqueue. Adding the info onto the Work obj would be possible, but adds to overhead of copying Works which we do on every collective. We can get info off the input/output tensors directly in initWork, but we don't want to keep refs to those tensors alive while the work is Enqueued, so we'd have to specifically copy size lists or something. This PR instead avoids creating a work inside pointToPoint when coalescing is active. Instead, only at endCoalescing() is a work finally intialized and enqueued. But it adds a record() call inside pointToPoint() instead of creating a work, during coalescing. This record() call picks up tensor shapes and op names. It ALSO changes initWork to accept a 'record' argument. This defaults to false, and should only be set to true if the caller ensures the work will be enqueued by workEnqueue, ensuring its cuda events are live when used by flight recorder's update_state(). The testing uncovers some odd pre-existing behavior and leaves them alone for now. We could change some of these - seq starts off at 1, not 0 for first op (but this is inconistent) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120270 Approved by: https://github.com/shuqiangzhang ghstack dependencies: #120724	2024-02-29 01:03:31 +00:00
Will Constable	7f4d673885	[C10D] Add record_id to flight recorder (#120724 ) In cases where sequence number is shared between events (e.g. coalesced collectives) we want to ensure a unique (and ordered) ID per record. Note: the records are already in a list, so their ID could be implicitly observed. But (1) it's a ring buffer, so absolute ID is lost once the buffer rolls over once, (2) users may sort or process or filter their flight records, so having the ID be an explicit member of an entry is still useful Pull Request resolved: https://github.com/pytorch/pytorch/pull/120724 Approved by: https://github.com/zdevito	2024-02-29 01:03:31 +00:00
leslie-fang-intel	950b484356	skip three pyhpc models with dynamic shape test (#120599 ) As reported in https://github.com/pytorch/pytorch/issues/119434, `pyhpc_isoneutral_mixing`, `pyhpc_equation_of_state` and `pyhpc_turbulent_kinetic_energy` failed with dynamic shape testing, we propose to skip the dynamic batch size testing of these 3 models in this PR. * Error msg is ``` File "/localdisk/leslie/torch_inductor_community/pytorch/benchmarks/dynamo/common.py", line 3879, in run assert marked, f"nothing in example_inputs had a dim with {batch_size}" AssertionError: nothing in example_inputs had a dim with 1048576 ``` * Root Cause is * Benchmark code will only annotate the inputs' dim as dynamic when its size equals to batch size `c617e7b407/benchmarks/dynamo/common.py (L3867-L3871)`. If it fails to find any dim equals to batch size, above error throws. * However, for these 3 models, none of the inputs' dim will equal to input batch size since the [relationship of dim sizes](`26b85eadde/torchbenchmark/models/pyhpc_equation_of_state/__init__.py (L12-L16)`) ``` shape = ( math.ceil(2 * size ** (1/3)), math.ceil(2 * size ** (1/3)), math.ceil(0.25 * size ** (1/3)), ) ``` * Another thing is `pyhpc_isoneutral_mixing`, `pyhpc_equation_of_state` can pass the dynamic batch size accuracy testing, because the batch size has been set to 4 in accuracy testing (`c617e7b407/benchmarks/dynamo/common.py (L3456)`) and `math.ceil(2 * size ** (1/3))` happens equaling to 4. * Since the dim sizes of input has above relationship, running the these models in dynamic shape, we may need to annotate `dim[0](s0) = dim[2](s1) * 8`, per the discussion in https://github.com/pytorch/pytorch/issues/117477#issuecomment-1897108756 @avikchaudhuri, looks like we are not expressible for this case. So, I think we may need to skip the dynamic batch size testing for these 3 models. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120599 Approved by: https://github.com/jgong5, https://github.com/desertfire	2024-02-29 00:38:06 +00:00
Chien-Chin Huang	3179107629	[DDP][PT2D] Ignore gradient sync if the gradient is not defined (#120419 ) From the test, accum_grad_hook can still be fired even if the gradient is None. We need to ignore the gradient sync for this case. Differential Revision: [D54076485](https://our.internmc.facebook.com/intern/diff/D54076485/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120419 Approved by: https://github.com/yf225, https://github.com/XilunWu	2024-02-29 00:27:54 +00:00
Pian Pawakapan	ab38354887	Allow str inputs in non-strict tracing (#120536 ) Previously, torch.export in non-strict mode was failing on str inputs while creating fake inputs for tracing (fakify()), and using graph nodes to create constraints. This fixes those 2 stages to allow strs to pass. Failing test case: ``` class Foo(torch.nn.Module): def forward(self, a, b, mode): return torch.div(a, b, rounding_mode=mode) foo = Foo() inps = (torch.randn(4, 4), torch.randn(4), "trunc") exported = export(foo, inps) with self.assertRaisesRegex( RuntimeError, "to be equal to trunc, but got floor" ): _ = exported.module()(torch.randn(4, 4), torch.randn(4), "floor") self.assertTrue(torch.allclose(exported.module()(inps), foo(inps))) ``` Before: ``` (pytorch-local) pianpwk@pianpwk-mbp pytorch % python test/export/test_export_nonstrict.py -k test_runtime_assert_for_prm_str E ====================================================================== ERROR: test_runtime_assert_for_prm_str_non_strict (__main__.NonStrictExportTestExport.test_runtime_assert_for_prm_str_non_strict) ---------------------------------------------------------------------- Traceback (most recent call last): File "/Users/pianpwk/Documents/pytorch/torch/testing/_internal/common_utils.py", line 2744, in wrapper method(args, kwargs) File "/Users/pianpwk/Documents/pytorch/test/export/testing.py", line 40, in _fn return fn(args, *kwargs) ^^^^^^^^^^^^^^^^^^^ File "/Users/pianpwk/Documents/pytorch/test/export/test_export.py", line 1588, in test_runtime_assert_for_prm_str exported = export(foo, inps) ^^^^^^^^^^^^^^^^^ File "/Users/pianpwk/Documents/pytorch/test/export/test_export_nonstrict.py", line 16, in mocked_non_strict_export return export(args, *kwargs, strict=False) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/pianpwk/Documents/pytorch/torch/export/__init__.py", line 186, in export return _export( ^^^^^^^^ File "/Users/pianpwk/Documents/pytorch/torch/export/_trace.py", line 541, in wrapper raise e File "/Users/pianpwk/Documents/pytorch/torch/export/_trace.py", line 527, in wrapper ep = fn(args, *kwargs) ^^^^^^^^^^^^^^^^^^^ File "/Users/pianpwk/Documents/pytorch/torch/export/exported_program.py", line 83, in wrapper return fn(args, *kwargs) ^^^^^^^^^^^^^^^^^^^ File "/Users/pianpwk/Documents/pytorch/torch/export/_trace.py", line 707, in _export ) = make_fake_inputs(f, args, kwargs, constraints) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/pianpwk/Documents/pytorch/torch/_export/non_strict_utils.py", line 133, in make_fake_inputs fake_args, fake_kwargs = tree_map_with_path( ^^^^^^^^^^^^^^^^^^^ File "/Users/pianpwk/Documents/pytorch/torch/utils/_pytree.py", line 1519, in tree_map_with_path return treespec.unflatten(func(xs) for xs in zip(all_keypath_leaves)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/pianpwk/Documents/pytorch/torch/utils/_pytree.py", line 734, in unflatten leaves = list(leaves) ^^^^^^^^^^^^ File "/Users/pianpwk/Documents/pytorch/torch/utils/_pytree.py", line 1519, in <genexpr> return treespec.unflatten(func(xs) for xs in zip(*all_keypath_leaves)) ^^^^^^^^^ File "/Users/pianpwk/Documents/pytorch/torch/_export/non_strict_utils.py", line 134, in <lambda> lambda kp, val: fakify(fake_mode, kp, val, t_constraints, sources), ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/pianpwk/Documents/pytorch/torch/_export/non_strict_utils.py", line 68, in fakify raise ValueError("Only tensors allowed as input") ValueError: Only tensors allowed as input To execute this test, run the following from the base repo dir: python test/export/test_export_nonstrict.py -k test_runtime_assert_for_prm_str_non_strict This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 ---------------------------------------------------------------------- Ran 1 test in 0.008s FAILED (errors=1) ``` After: ``` (pytorch-local) pianpwk@pianpwk-mbp pytorch % python test/export/test_export_nonstrict.py -k test_runtime_assert_for_prm_str . ---------------------------------------------------------------------- Ran 1 test in 0.237s OK ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/120536 Approved by: https://github.com/tugsbayasgalan, https://github.com/zhxchen17, https://github.com/avikchaudhuri, https://github.com/gmagogsfm	2024-02-28 23:56:30 +00:00
Aaron Orenstein	1b8bb027f6	Fix guard for SUPPORTED_NODES (#120798 ) The special-case code for handling SUPPORTED_NODES was producing a guard that looked like: ``` "G['torch'].utils._pytree.SUPPORTED_NODES[<class '__main__.CausalLMOutputWithPast'>].type" ``` resulting in a eval error trying to evaluate the guard. This change adds a new source type (`ClassSource`) which is given a class type (in this case `CausalLMOutputWithPast`) and attempts to fetch it from its defining module. It then uses that to build the `SUPPORTED_NODES` guards instead of referring to the type directly. Also added a unit test which fails before this change and passes after. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120798 Approved by: https://github.com/anijain2305	2024-02-28 23:34:17 +00:00
Aaron Enye Shi	aa36821615	[Memory Snapshot] Stop clearing history when changing context (#120436 ) Summary: This change will avoid clearing the memory event history, when changing the context from `record_memory_history(context=None)` to `record_memory_history(context="python")`. Now it will continue recording memory events with changing context on the fly. Only `record_memory_history(enabled=None)` will clear the history. Test Plan: # Ran on the following local Resnet50 example: - At iteration=0, record_memory_history(context=None, stacks="python") - At iteration=3, record_memory_history(context="all", stacks="python") - After iteration=4, export_memory_snapshot() ## Before: - Only collects the last 2 iterations with python call stacks. ![image](https://github.com/pytorch/pytorch/assets/17602366/86154532-9f73-4d10-9194-19e8c96ee4f3) ## After: - Collects all 5 iterations, where first 3 iterations have no call stacks, and last 2 iterations have python call stacks. ![image](https://github.com/pytorch/pytorch/assets/17602366/c2c277d6-b400-4da2-85c8-a7f119d409f8) ![image](https://github.com/pytorch/pytorch/assets/17602366/dc9da2f8-41cc-44b0-9c32-ec3cbe79d2c4) Differential Revision: D54084017 Pulled By: aaronenyeshi Pull Request resolved: https://github.com/pytorch/pytorch/pull/120436 Approved by: https://github.com/zdevito, https://github.com/leitian	2024-02-28 22:46:26 +00:00
PyTorch MergeBot	86ff31c4a0	Revert "Avoid COW materialization in `at::parallel_for/parallel_reduce` (#120455 )" This reverts commit cabc09a5f259f1cc1e3bad1d80b5e5274838bced. Reverted https://github.com/pytorch/pytorch/pull/120455 on behalf of https://github.com/izaitsevfb due to breaks xla jobs ([comment](https://github.com/pytorch/pytorch/pull/120455#issuecomment-1970026100))	2024-02-28 22:30:18 +00:00
PyTorch MergeBot	dbe0967a0a	Revert "Add test to check that COW inputs are not materialized (#119507 )" This reverts commit 2ebf2c88baa4667d55eda92f4c8424db505af781. Reverted https://github.com/pytorch/pytorch/pull/119507 on behalf of https://github.com/izaitsevfb due to breaks xla jobs ([comment](https://github.com/pytorch/pytorch/pull/119507#issuecomment-1970022840))	2024-02-28 22:26:59 +00:00
Eddie Yan	7e185277cd	[cuDNN] bump cuDNN-frontend submodule to 1.1.2 (#120761 ) Hopefully resolves additional `CUDNN_STATUS_SUCCESS` failures that we have been seeing on H100 (though curiously not on upstream CI, perhaps due to the different hardware being tested) Need to confirm the fix on our end before merging CC @Skylion007 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120761 Approved by: https://github.com/Skylion007, https://github.com/nWEIdia	2024-02-28 22:15:43 +00:00
Elias Ellison	9c9bde515c	Factor out Submod compilers (#120527 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120527 Approved by: https://github.com/kadeng	2024-02-28 22:11:47 +00:00
Edward Z. Yang	5b5bcf0470	Test that tlparse understands the structured logs we output (#120658 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/120658 Approved by: https://github.com/Skylion007, https://github.com/malfet ghstack dependencies: #120712, #120289	2024-02-28 21:58:39 +00:00
David Berard	d6c202975c	Move attention kernels from meta_registrations to fake_impls (#120682 ) This PR is mostly just code movement to make the code review easier - AFAIK it should not change any functionality. The final goal is to remove the xfails for some of the test_fake opinfos for these ops. The opinfos are failing because the outputs can have mixed devices - we need to move them to fake_impls first before we can support mixed device returns. This PR: * Move the `_meta_registrations.py` implementations to `fake_impls.py` * Change the function signature from taking explicit named variables to taking `{args, kwargs}` and normalizing them * Wrap all the returned tensors in FakeTensors Tests: relying on opinfos. I also checked `test_fake_*` for these tests (by removing x-fails and patching things until they passed) to verify general correctness. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120682 Approved by: https://github.com/drisspg	2024-02-28 21:49:13 +00:00
lancerts	50073248ed	add a note wrt torch.nn.functional.scaled_dot_product_attention (#120668 ) followup change of https://github.com/pytorch/pytorch/pull/120565 - Added a note in the transformer class pointing out the mask definition is opposite to that of :attr:`attn_mask` in torch.nn.functional.scaled_dot_product_attention. @mikaylagawarecki Co-authored-by: mikaylagawarecki <mikaylagawarecki@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/120668 Approved by: https://github.com/mikaylagawarecki	2024-02-28 21:16:34 +00:00
Shubhraprakash Das	e2ee87d48b	Fix segfault on mac when running vulkan tests (#120337 ) Summary: Vulkan gtests were segfaulting on mac because the memory for barriers can get destroyed after the local function(CommandBuffer::insert_barrier) exits where it is created. Since we provide this barrier pointer to vulkan library it needs to be around even after the function exit, else we get crashes. Test Plan: See that there is no segfault on mac with fix and tests can run: Compile gtests: buck2 build --target-platforms ovr_config//platform/macos:arm64-fbsource //xplat/caffe2:pt_vulkan_quantized_api_test_binAppleMac\#macosx-arm64 -c pt.vulkan_full_precision=1 --show-output" Crash w/o diff bash-3.2$ buck-out//v2/gen/fbsource/xplat/caffe2/pt_vulkan_quantized_api_test_binAppleMac Running main() from third-party/googletest/1.14.0/googletest/googletest/src/gtest_main.cc [==========] Running 85 tests from 1 test suite. [----------] Global test environment set-up. [----------] 85 tests from VulkanAPITest [ RUN ] VulkanAPITest.uniform_buffer_copy [ OK ] VulkanAPITest.uniform_buffer_copy (88 ms) [ RUN ] VulkanAPITest.copy_to_buffer Segmentation fault: 11 With diff there is no crash: bash-3.2$ buck-out//v2/gen/fbsource/xplat/caffe2/pt_vulkan_quantized_api_test_binAppleMac Running main() from third-party/googletest/1.14.0/googletest/googletest/src/gtest_main.cc [==========] Running 85 tests from 1 test suite. [----------] Global test environment set-up. [----------] 85 tests from VulkanAPITest [ RUN ] VulkanAPITest.uniform_buffer_copy [ OK ] VulkanAPITest.uniform_buffer_copy (296 ms) ..... [ FAILED ] VulkanAPITest.gelu_quint8_self (23 ms) [----------] 85 tests from VulkanAPITest (1494 ms total) [----------] Global test environment tear-down [==========] 85 tests from 1 test suite ran. (1494 ms total) [ PASSED ] 72 tests. [ FAILED ] 13 tests, listed below: [ FAILED ] VulkanAPITest.linear_2d_flat [ FAILED ] VulkanAPITest.linear_2d_small [ FAILED ] VulkanAPITest.linear_2d_large [ FAILED ] VulkanAPITest.linear_3d_flat [ FAILED ] VulkanAPITest.linear_3d_small [ FAILED ] VulkanAPITest.linear_3d_large [ FAILED ] VulkanAPITest.linear_4d_flat [ FAILED ] VulkanAPITest.linear_4d_small [ FAILED ] VulkanAPITest.linear_4d_large [ FAILED ] VulkanAPITest.gelu_qint8 [ FAILED ] VulkanAPITest.gelu_qint8_self [ FAILED ] VulkanAPITest.gelu_quint8 [ FAILED ] VulkanAPITest.gelu_quint8_self The above failing tests were failing before as well and are being worked on. Differential Revision: D54023146 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120337 Approved by: https://github.com/SS-JIA	2024-02-28 20:55:47 +00:00
yuanx749	e317e39a02	Fix `nonlinearity` arg issue in RNN (#120234 ) Fixes #114617 This PR fix the the issue with `nonlinearity`, so that it can be passed as arg or kwarg. Alternatively, if making `nonlinearity` kwarg-only is preferred, I can revert to another commit. cc @mikaylagawarecki Pull Request resolved: https://github.com/pytorch/pytorch/pull/120234 Approved by: https://github.com/mikaylagawarecki	2024-02-28 20:53:18 +00:00
Yanbo Liang	8b22fe9594	[FX passes] Set group/batch fusion log to DEBUG level (#120780 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/120780 Approved by: https://github.com/jackiexu1992	2024-02-28 20:48:11 +00:00
PyTorch MergeBot	4903e33e19	Revert "Capture non tensor arguments in record_function (#120017 )" This reverts commit 5c5b71b6eebae76d744261715231093e62f0d090. Reverted https://github.com/pytorch/pytorch/pull/120017 on behalf of https://github.com/soulitzer due to regresses perf on autograd Function when using profiler ([comment](https://github.com/pytorch/pytorch/pull/120017#issuecomment-1969883792))	2024-02-28 20:43:33 +00:00
Jason Ansel	01ec8df6d8	[Compiled Autograd] Introduce BackwardState capture (#120382 ) This adds support for backwards hooks that are both: 1) Interior to the graph; and 2) Dynamically generated (e.g. lambdas) We do this by creating a BackwardState object that is used to register the hooks in the forward, then populated by dynamo after the forwards runs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120382 Approved by: https://github.com/xmfan	2024-02-28 20:36:47 +00:00
Will Constable	c016ffed5b	[C10D] Fix logic for default group=None in _set_pg_timeout (#120686 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120686 Approved by: https://github.com/yifuwang	2024-02-28 20:31:14 +00:00
Shengbao Zheng	11de40f82f	[flight recorder] record process group configuration (#120262 ) Summary: Record process group configuration (i.e. ranks involved in a process group) to facilitate NCCL related debugging. Differential Revision: D53792087 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120262 Approved by: https://github.com/shuqiangzhang	2024-02-28 20:31:08 +00:00
Hongtao Yu	5aa7f8646f	[inductor][Gemm] Autotune with matrix_instr_nonkdim for AMDGPU (#120742 ) Relanding https://github.com/pytorch/pytorch/pull/120639 + a fix to drop `matrix_instr_nonkdim` that does not align with `BLOCK_M` or `BLOCK_N` Matrix multiplication with Triton is usually in a tiled way. For a large tile size that single hardware instruction cannot handle, i.e, a 128*128 size matmul, it has to be broken down to a sequence of smaller hardware mma instructions. On AMDGPU, matrix_instr_nonkdim controls the shape of the mma instructions, of which the default value is 0 in Triton. This means by default Triton will decompose a large tiled matmul operation into a sequence of 32x32x8 mma instructions. There are other mma instructions available, such as 16x16x16 which requires matrix_instr_nonkdim=16. This change enables tuning the value for Gemm which seems to improve its performance by 20% - 2x. Before: ``` AUTOTUNE mm(1024x1024, 1024x1024) ExternKernelCaller(extern_kernels.mm) 0.0410 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 0.0487 ms 84.2% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 0.0544 ms 75.4% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 0.0633 ms 64.8% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 0.0687 ms 59.7% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 0.0716 ms 57.3% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 0.0748 ms 54.9% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 0.0788 ms 52.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=2 0.1014 ms 40.5% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 0.1069 ms 38.4% SingleProcess AUTOTUNE takes 8.1153 seconds ``` After: ``` AUTOTUNE mm(1024x1024, 1024x1024) ExternKernelCaller(extern_kernels.mm) 0.0417 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=8 0.0470 ms 88.7% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=4 0.0488 ms 85.4% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=32, num_stages=0, num_warps=4 0.0490 ms 85.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=4 0.0525 ms 79.5% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=32, num_stages=0, num_warps=4 0.0553 ms 75.4% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=8 0.0574 ms 72.7% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=32, num_stages=0, num_warps=8 0.0634 ms 65.8% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=2 0.0655 ms 63.7% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=8 0.0681 ms 61.2% SingleProcess AUTOTUNE takes 11.4076 seconds ``` Before: ``` AUTOTUNE mm(2048x2048, 2048x2048) ExternKernelCaller(extern_kernels.mm) 0.2094 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 0.2452 ms 85.4% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 0.2763 ms 75.8% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 0.2836 ms 73.8% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 0.2854 ms 73.4% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 0.2951 ms 71.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 0.2970 ms 70.5% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 0.4184 ms 50.1% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 0.5097 ms 41.1% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=2 0.5570 ms 37.6% SingleProcess AUTOTUNE takes 3.4052 seconds ``` After: ``` AUTOTUNE mm(2048x2048, 2048x2048) ExternKernelCaller(extern_kernels.mm) 0.2117 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=32, num_stages=0, num_warps=8 0.2429 ms 87.2% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=8 0.2485 ms 85.2% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=8 0.2526 ms 83.8% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=4 0.2537 ms 83.5% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=8 0.2554 ms 82.9% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=4 0.2623 ms 80.7% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=4 0.2695 ms 78.5% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=32, num_stages=0, num_warps=8 0.2758 ms 76.8% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=32, num_stages=0, num_warps=4 0.2792 ms 75.8% SingleProcess AUTOTUNE takes 11.3538 seconds ``` Before: ``` AUTOTUNE mm(4096x4096, 4096x4096) ExternKernelCaller(extern_kernels.mm) 1.5901 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 1.9380 ms 82.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 1.9943 ms 79.7% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 2.0640 ms 77.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 2.0941 ms 75.9% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 2.1272 ms 74.7% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 2.1554 ms 73.8% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 2.2931 ms 69.3% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 3.7016 ms 43.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=2 4.6021 ms 34.6% SingleProcess AUTOTUNE takes 9.0523 seconds ``` After: ``` AUTOTUNE mm(4096x4096, 4096x4096) ExternKernelCaller(extern_kernels.mm) 1.5862 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=8 1.6924 ms 93.7% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=8 1.7616 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=4 1.8159 ms 87.4% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=4 1.9340 ms 82.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=8 1.9352 ms 82.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=4 2.0378 ms 77.8% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=4 2.0983 ms 75.6% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=8 2.1138 ms 75.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=8 2.1657 ms 73.2% SingleProcess AUTOTUNE takes 8.2225 seconds ``` Before: ``` AUTOTUNE mm(8192x8192, 8192x8192) ExternKernelCaller(extern_kernels.mm) 12.0134 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 14.8082 ms 81.1% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 15.4242 ms 77.9% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 16.6869 ms 72.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 16.7751 ms 71.6% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 17.0145 ms 70.6% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 17.1363 ms 70.1% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 18.2159 ms 66.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 29.4726 ms 40.8% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=2 37.9039 ms 31.7% SingleProcess AUTOTUNE takes 11.0074 seconds ``` After: ``` AUTOTUNE mm(8192x8192, 8192x8192) ExternKernelCaller(extern_kernels.mm) 11.9554 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=8 12.9953 ms 92.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=8 13.7726 ms 86.8% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=4 13.9647 ms 85.6% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=32, num_stages=0, num_warps=8 14.9728 ms 79.8% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=8 15.3729 ms 77.8% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=4 15.3955 ms 77.7% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=32, num_stages=0, num_warps=4 15.5647 ms 76.8% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=4 16.0037 ms 74.7% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=4 16.7432 ms 71.4% SingleProcess AUTOTUNE takes 14.9839 seconds ``` Reviewed By: xw285cornell, nmacchioni Differential Revision: D54203170 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120742 Approved by: https://github.com/xw285cornell	2024-02-28 20:27:14 +00:00
Scott Wolchok	b020ee5b05	[PyTorch Use MaybeOwned when promoting indices/offsets in embedding_bag (#120755 ) We're currently doing two unnecessary reference count operations in the case where promotion doesn't need to happen. Differential Revision: [D54285999](https://our.internmc.facebook.com/intern/diff/D54285999/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120755 Approved by: https://github.com/cyyever, https://github.com/Skylion007 ghstack dependencies: #120752	2024-02-28 20:13:30 +00:00
Scott Wolchok	98d1529474	[PyTorch] fix mixed int32/int64 indices/offsets for embedding_bag_out (#120752 ) This was an oversight in D27482738 (#55189) -- it only patched the regular embedding_bag operator, but static runtime uses the out variant. Differential Revision: [D54285460](https://our.internmc.facebook.com/intern/diff/D54285460/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120752 Approved by: https://github.com/houseroad	2024-02-28 20:13:30 +00:00
Emmett Neyman	db92558229	[codemod][lowrisk] Fix deprecated use of 0/NULL (#120740 ) Summary: `nullptr` is typesafe. `0` and `NULL` are not. In the future, only `nullptr` will be allowed. This diff helps us embrace the future _now_ in service of enabling `-Wzero-as-null-pointer-constant`. Test Plan: Sandcastle Reviewed By: meyering Differential Revision: D54163060 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120740 Approved by: https://github.com/Skylion007	2024-02-28 20:13:13 +00:00
Guilherme Leobas	491c2b4665	Let torch dynamo inline torch.func.grad (#118407 ) When dynamo sees torch.func.grad, it tries to inline all frames related to. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118407 Approved by: https://github.com/zou3519	2024-02-28 20:05:00 +00:00
Avik Chaudhuri	5472923998	derived dim (#118729 ) With the current `Dim`-based dynamic shapes API for export, one can express that shapes of different input shapes must be equal by reusing the same `Dim`. However, non-trivial relationships between such input shapes cannot be expressed. Recently we are seeing more and more examples of code that require this additional expressibility, e.g., where a pair of shapes might differ by one, or a shape might be double another (or simply even). This PR introduces the concept of a "derived" `Dim`, i.e., a linear arithmetic expression over a `Dim`. By using a combination of `Dim`s and derived `Dim`s to specify input shapes, the desired relationships can be expressed naturally. E.g., a pair of shapes might be `dim` and `dim + 1`, or `dim` and `2dim`, or even `2dim` and `dim + 1`. We extend the current infrastructure that translates `Dim`s to deprecated `dynamic_dim`-based constraints to work with derived `Dim`s. As usual, we raise constraint violation errors when shape guards cannot be verified given a dynamic shapes spec; suggest fixes; and raise runtime errors when future inputs violate the spec. Importantly, some guards that used to cause forced specializations in the constraint solver because they were deemed "too complex" now do not do so, because they can now be specified as constraints. Since this was what motivated the introduction of a `disable_constraint_solver` flag to some internal APIs, we may not need that flag any more. Note that shapes of placeholders in exported programs can now contain symbolic expressions and not just symbols. Differential Revision: D53254587 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118729 Approved by: https://github.com/ezyang	2024-02-28 19:48:32 +00:00
Adam J. Stewart	9c55aa6ff6	TransformerEncoder/Decoder: add type hints (#120550 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120550 Approved by: https://github.com/mikaylagawarecki	2024-02-28 19:36:08 +00:00
drisspg	4b7a521856	Update flash_attention kernel from 2.3.6 to 2.5.5 (#118935 ) # Summary Updates FlashAttention kernel code from tag [2.3.6](https://github.com/Dao-AILab/flash-attention/releases/tag/v2.3.6) to [2.5.3](https://github.com/Dao-AILab/flash-attention/releases/tag/v2.5.5). The usual changes were then re-rellod on top of the modified kernel, changing how dropout saved for backward, removing the head_dim_pad since this would make the kernel inplace mutate and that has a bad interaction with functionalization. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118935 Approved by: https://github.com/cpuhrsch	2024-02-28 19:31:15 +00:00
PyTorch MergeBot	a9d9077f12	Revert "Increased compile time max GPUs to 512. Switched to int16_t DeviceIndex. (#119639 )" This reverts commit 7c556428c74a79c6d9c272826344a0828d3f66f5. Reverted https://github.com/pytorch/pytorch/pull/119639 on behalf of https://github.com/kit1980 due to breaking internal builds, see D54286923 ([comment](https://github.com/pytorch/pytorch/pull/119639#issuecomment-1969634480))	2024-02-28 18:57:09 +00:00
alanhe151220037	1c67f6cb26	fix decomposition of aten.diag_embed (#120549 ) Fixes #117019 Make the input that one dim negative and the other nonnegative be correctly solved in decomposition of `aten.diag_embed`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120549 Approved by: https://github.com/Dalian991, https://github.com/janeyx99	2024-02-28 18:48:01 +00:00
Chien-Chin Huang	f422467ccb	[BE]Delay the call to set_pytorch_distributed_envs_from_justknobs (#120625 ) When initializing the default process group, `init_process_group` will show the explicit message indicating the default process group is being initialized twice. However, with `set_pytorch_distributed_envs_from_justknobs` being the very first line in `init_process_group`, the error message becomes implicit and hard to understand the root cause when testing with the FB code base. Differential Revision: [D54206202](https://our.internmc.facebook.com/intern/diff/D54206202/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120625 Approved by: https://github.com/wconstab, https://github.com/yifuwang	2024-02-28 18:34:45 +00:00
andrewor14	91190d8087	[quant][pt2e] Relax `model_is_exported` input (#120720 ) Summary: This commit relaxes the `model_is_exported` API to additionally work for `torch.nn.Module`s in addition to just `torch.fx.GraphModule`s, simplifying downstream uses. Test Plan: python test/test_quantization.py TestQuantizePT2E.test_model_is_exported Differential Revision: [D54263935](https://our.internmc.facebook.com/intern/diff/D54263935) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120720 Approved by: https://github.com/tugsbayasgalan	2024-02-28 18:32:03 +00:00
Rohan Potdar	f67c77c497	Update engine.cpp (#120773 ) Minor comment fix; `backward` and `grad` are flipped here. See https://pytorch.org/docs/stable/_modules/torch/autograd.html#backward Pull Request resolved: https://github.com/pytorch/pytorch/pull/120773 Approved by: https://github.com/albanD, https://github.com/janeyx99, https://github.com/soulitzer	2024-02-28 18:23:35 +00:00
Xunsong, Huang	0ab2ec3738	[XPU][Profiler] Add Logic To The Profiler For Processing XPU-backend Data (#120185 ) This pull request is writing to provide an update on the recent advancements made in the PyTorch profiler with regards to XPU backend support. Following the successful merge of a previous pull request #94502 that established a pathway for the XPU backend within PyTorch, we have now taken steps to enhance the profiler's capabilities for handling and displaying profile data directly related to the XPU backend. # Motivation The current pull request builds upon this foundation by refining the profiler's data processing scripts, particularly `profiler_util.py`, to accommodate XPU backend-specific profile data. The aim is to align the handling and presentation of this data with that of the CUDA backend, offering users a consistent experience across different device profiles. This includes generating outputs such as JSON files compatible with Chrome trace tooling, among other formats. # Principles 1. Minimal Impact: The modifications introduced should support XPU backend data with minimal disruption to the existing profiling scripts. 2. Consistency: Changes should maintain stylistic and functional consistency with existing `CUDA` and `privateuse1` pathways, ensuring no adverse effects on other logic paths. 3. Exclusivity: Ensure that the new XPU pathway does not interfere with or impede other pathways. # Solutions ### a. Pathway Identification: Introduction of a `use_xpu` flag within `torch.autograd.profiler.profile` interfaces to distinguish XPU-specific profiling. ### b. `use_device` Logic Revision: With the introduction of the XPU pathway, `use_device` no longer implies a binary relationship with `use_cuda`. Consequently, we have revised related logic to remove implicit assertions and establish independent device distinction. ### c. Kernel List Segregation: To accommodate the non-binary nature of device pathways, we have enabled kernel lists to identify specific device affiliations through separate list objects. ### d. Formatted Output: To ensure output consistency, we have employed code duplication and keyword substitution techniques to facilitate the formatting of XPU-related profile data. # Additional Enhancements ### a. Enumerations in `.pyi` Files: Added recognition items for `DeviceType` and `ProfilerActivity` specific to XPU. ### b. Correct DeviceType Returns: Revised `deviceTypeFromActivity` logic to accurately differentiate between device backends, even when they share common flags such as `libkineto::ActivityType::GPU_MEMCPY`. ### c. Bug Fixes in `cuda_corr_map`: Addressed a corner case where erroneous parent-child event relationships were formed due to shared function event identifiers. The solution involves refining `cuda_corr_map` processing to prevent a function event from being misidentified as both the linker and linkee. # Further Abstraction Looking forward, we acknowledge the potential for further abstraction in the codebase. The current changes necessitated by XPU support have highlighted opportunities for reducing redundancy by consolidating naming conventions and utilizing a singular `device` naming system that relies on `DeviceType` attributes or string flags for differentiation. This would involve significant refactoring to replace device-specific flags and variables. This topic needs further discussions about whether we could and when we should deprecate all those flags and variables named with `cuda`. # Next Pull Request The next pull request will be contingent on Kineto's adoption of Intel's forthcoming PTI-sdk library, which will enable direct usage of XPU-related tracers. Subsequent modifications to `libkineto_init()` will aim to endow PyTorch running on XPU backends with comprehensive profiling capabilities on XPU devices. We appreciate your attention to these enhancements and welcome any feedback or questions you may have regarding these developments. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120185 Approved by: https://github.com/aaronenyeshi, https://github.com/gujinghui	2024-02-28 17:50:32 +00:00
Sherlock Huang	3e8b56d362	[Inductor] Track constant's original_fqn mapping (#120524 ) When compiling an deserialized ExportedProgram, constant’s original_fqn is not populated(). Highlighted line is missing, And a latter assertion is breaking due to original_fqn missing. ``` constants_info_[0].name = "L__self___w_pre"; constants_info_[0].dtype = static_cast<int32_t>(cached_torch_dtype_float32); constants_info_[0].offset = 0; constants_info_[0].data_size = 64; constants_info_[0].from_folded = false; constants_info_[0].shape = {4, 4}; constants_info_[0].stride = {4, 1}; // constants_info_[0].original_fqn = "w_pre"; // this line is missing ``` Inductor is relying `dynamo_flat_name_to_original_fqn` to populate the original_fqn field. This field originates from `graph_module.meta["dynamo_flat_name_to_original_fqn"]`, and is set during dynamo tracing. However, when compiling an deserialized ExportedProgram, we don't do dynamo tracing, thus this field is missing. As a fix, I maintain AOTI's own mapping for constant tensor's fqn. Differential Revision: D54097073 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120524 Approved by: https://github.com/chenyang78	2024-02-28 17:36:29 +00:00
Eddie Yan	702e82da28	[cuDNN][Flash Attention] Minor cleanup for cuDNN SDPA (#120750 ) Cleaning up before hopefully starting work on backward Pull Request resolved: https://github.com/pytorch/pytorch/pull/120750 Approved by: https://github.com/Skylion007, https://github.com/drisspg	2024-02-28 17:32:07 +00:00
Lucas Pasqualin	364faafe75	[DCP] Asserts CPU backend for async_save (#120241 ) If a CPU device is not present, collectives will hang in the threaded case due to: https://github.com/pytorch/pytorch/issues/115861 This PR asserts a CPU device is enabled in the pg group backend. Differential Revision: [D53952864](https://our.internmc.facebook.com/intern/diff/D53952864/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120241 Approved by: https://github.com/fegin	2024-02-28 17:21:30 +00:00
Catherine Lee	c8a34a4013	[ez] Smaller weight for some TD heuristics (#120736 ) Normalize to different number for the fuzzier heuristics Could this be done as a weighting elsewhere? Yes, but putting it here since I'm not sure which object would hold it best Pull Request resolved: https://github.com/pytorch/pytorch/pull/120736 Approved by: https://github.com/malfet, https://github.com/huydhn	2024-02-28 17:07:45 +00:00
Oguz Ulgen	dfe7b9d471	Move user defined triton tests to inductor test folder (#120738 ) Summary: FBCode CI does not compile torch with CUDA for tests in dynamo folder, instead of adding a special rule, lets move these tests to inductor folder. Test Plan: ``` buck run mode/opt //caffe2/test/inductor/:triton_kernels ``` now works instead of skipping tests Differential Revision: D54280629 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120738 Approved by: https://github.com/aakhundov	2024-02-28 17:03:41 +00:00
Yu, Guangye	df40847486	Add xpu header to include/ATen/xpu (#120786 ) # Motivation Add xpu header file to `include/ATen/xpu` to make them public. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120786 Approved by: https://github.com/gujinghui, https://github.com/EikanWang, https://github.com/jgong5, https://github.com/albanD	2024-02-28 16:22:14 +00:00
Edward Z. Yang	7881b95c73	Don't suppress error codes in lint job, properly activate conda (#120769 ) Before: ``` 2024-02-28T02:38:24.3757573Z + conda activate /opt/conda/envs/py_3.9 2024-02-28T02:38:24.3757872Z 2024-02-28T02:38:24.3758116Z CondaError: Run 'conda init' before 'conda activate' ``` Now, this would actually fail the job, and I also fix the bug. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/120769 Approved by: https://github.com/albanD, https://github.com/janeyx99, https://github.com/malfet	2024-02-28 15:17:31 +00:00
Edward Z. Yang	facfc0baaf	Update _constrain_as_size docs (#120728 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/120728 Approved by: https://github.com/Skylion007	2024-02-28 15:03:10 +00:00
James Wu	82099ab87b	[easy] Reword unexpected success error messages and generated github issues now that we have sentinel files (#120766 ) It's a bit annoying to have to read through the test name in verbose mode just to see what the test's sentinel file is actually called when encountering an unexpected success. Now that we have sentinel files, we can directly list the file path from root in the error message. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120766 Approved by: https://github.com/Skylion007	2024-02-28 11:15:29 +00:00
Yu, Guangye	46e3f670b4	refactor code to share across different devices (#120602 ) # Motivation Refactor utils code to make it possible to share across CUDA, XPU, and other backends. # Solution Move `_dummy_type` and `_LazySeedTracker` to torch._utils; # Additional Context When upstreaming, refactor these code changes by isolating them into in an additional PR to minimize their impact on the CUDA code. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120602 Approved by: https://github.com/albanD, https://github.com/jgong5, https://github.com/gujinghui, https://github.com/EikanWang	2024-02-28 09:42:58 +00:00
Chao Zhou	a11a49af58	Add NCCL work sequence number to work info (#120596 ) Summary: Expose sequence number to work info. The number can help applications identify a NCCL work more precisely. Test Plan: 1. pytest test/distributed/test_c10d_nccl.py::WorkHookTest::test_on_completion_hook_seq 2. pytest test/distributed/test_c10d_nccl.py::WorkHookTest Differential Revision: D54180050 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120596 Approved by: https://github.com/kwen2501	2024-02-28 07:54:37 +00:00
Menglu Yu	be31e522ce	[PT2][Inductor] Fix "example_value" absent for stack nodes (#120655 ) Summary: We observed that stack nodes have missing exampe_value in DPA+FIRST, causing issue to further do split cat. Full error log: P1187633689. pre grad graph: https://www.internalfb.com/intern/everpaste/?color=0&handle=GPUFOBWniTeB6s8DAN8z9sHTadpxbr0LAAAz We found that it was introduced by the new stack nodes in the group batch fusion, thus we fix the bug to enable further split cat optimization. Test Plan: ``` buck2 run mode/opt //scripts/jackiexu0313/pt2:local_model_with_pt2 -- --test_mode split_batch ``` before fix: P1187633689 ``` W0221 13:32:09.334000 139773455527936 torch/_inductor/fx_passes/split_cat.py:186] [0/0_1] example value absent for node: sigmoid_16 W0221 13:32:09.335000 139773455527936 torch/_inductor/fx_passes/split_cat.py:274] [0/0_1] example value absent for node: stack_19 W0221 13:32:09.335000 139773455527936 torch/_inductor/fx_passes/split_cat.py:186] [0/0_1] example value absent for node: add_16 W0221 13:32:09.335000 139773455527936 torch/_inductor/fx_passes/split_cat.py:274] [0/0_1] example value absent for node: stack_6 W0221 13:32:09.335000 139773455527936 torch/_inductor/fx_passes/split_cat.py:274] [0/0_1] example value absent for node: stack_5 W0221 13:32:09.336000 139773455527936 torch/_inductor/fx_passes/split_cat.py:274] [0/0_1] example value absent for node: stack_4 W0221 13:32:09.517000 139773455527936 torch/_inductor/fx_passes/split_cat.py:186] [0/0_1] example value absent for node: add_20 W0221 13:32:09.518000 139773455527936 torch/_inductor/fx_passes/split_cat.py:274] [0/0_1] example value absent for node: stack_18 W0221 13:32:09.518000 139773455527936 torch/_inductor/fx_passes/split_cat.py:274] [0/0_1] example value absent for node: stack_17 W0221 13:32:09.521000 139773455527936 torch/_inductor/fx_passes/split_cat.py:186] [0/0_1] example value absent for node: add_19 W0221 13:32:09.521000 139773455527936 torch/_inductor/fx_passes/split_cat.py:274] [0/0_1] example value absent for node: stack_15 W0221 13:32:09.521000 139773455527936 torch/_inductor/fx_passes/split_cat.py:274] [0/0_1] example value absent for node: stack_14 W0221 13:32:09.522000 139773455527936 torch/_inductor/fx_passes/split_cat.py:274] [0/0_1] example value absent for node: stack_16 W0221 13:32:09.524000 139773455527936 torch/_inductor/fx_passes/split_cat.py:186] [0/0_1] example value absent for node: add_18 W0221 13:32:09.525000 139773455527936 torch/_inductor/fx_passes/split_cat.py:274] [0/0_1] example value absent for node: stack_12 W0221 13:32:09.525000 139773455527936 torch/_inductor/fx_passes/split_cat.py:274] [0/0_1] example value absent for node: stack_11 W0221 13:32:09.525000 139773455527936 torch/_inductor/fx_passes/split_cat.py:274] [0/0_1] example value absent for node: stack_13 W0221 13:32:09.527000 139773455527936 torch/_inductor/fx_passes/split_cat.py:186] [0/0_1] example value absent for node: add_17 W0221 13:32:09.528000 139773455527936 torch/_inductor/fx_passes/split_cat.py:274] [0/0_1] example value absent for node: stack_9 W0221 13:32:09.528000 139773455527936 torch/_inductor/fx_passes/split_cat.py:274] [0/0_1] example value absent for node: stack_8 W0221 13:32:09.528000 139773455527936 torch/_inductor/fx_passes/split_cat.py:274] [0/0_1] example value absent for node: stack_10 W0221 13:32:09.528000 139773455527936 torch/_inductor/fx_passes/split_cat.py:274] [0/0_1] example value absent for node: stack_7 ``` after fix: P1189491364 ``` W0226 13:19:56.542000 139770599518208 torch/_inductor/fx_passes/split_cat.py:186] [0/0_1] example value absent for node: sigmoid_16 W0226 13:19:56.543000 139770599518208 torch/_inductor/fx_passes/split_cat.py:186] [0/0_1] example value absent for node: add_16 W0226 13:19:56.703000 139770599518208 torch/_inductor/fx_passes/split_cat.py:186] [0/0_1] example value absent for node: add_20 W0226 13:19:56.707000 139770599518208 torch/_inductor/fx_passes/split_cat.py:186] [0/0_1] example value absent for node: add_19 W0226 13:19:56.711000 139770599518208 torch/_inductor/fx_passes/split_cat.py:186] [0/0_1] example value absent for node: add_18 W0226 13:19:56.713000 139770599518208 torch/_inductor/fx_passes/split_cat.py:186] [0/0_1] example value absent for node: add_17 ``` Differential Revision: D54140488 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120655 Approved by: https://github.com/jackiexu1992	2024-02-28 05:35:36 +00:00
Yu, Guangye	12995a5d9d	[2/2] Intel GPU Runtime Upstreaming for Generator (#118613 ) # Motivation According to [[1/2] Intel GPU Runtime Upstreaming for Generator](https://github.com/pytorch/pytorch/pull/118528), as mentioned in [[RFC] Intel GPU Runtime Upstreaming](https://github.com/pytorch/pytorch/issues/114842), the second PR covers the changes under `python frontend`. # Design Currently, it primarily offers geneartor-related APIs, including - `torch.xpu.default_generators` - `torch.xpu.get_rng_state` - `torch.xpu.get_rng_state_all` - `torch.xpu.initial_seed` - `torch.xpu.manual_seed` - `torch.xpu.manual_seed_all` - `torch.xpu.seed` - `torch.xpu.seed_all` - `torch.xpu.set_rng_state` - `torch.xpu.set_rng_state_all` # Additional Context The differences with CUDA: The generator-related frontend python APIs are 1:1 mapping with CUDA. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118613 Approved by: https://github.com/gujinghui, https://github.com/EikanWang, https://github.com/jgong5, https://github.com/albanD	2024-02-28 05:28:11 +00:00
Xiaoya Xiang	8ba4cb451f	Fix an import loop (#119820 ) Summary: We ran into the following import loop when testing aps: ``` Traceback (most recent call last): File "/mnt/xarfuse/uid-26572/e04e8e0a-seed-nspid4026534049_cgpid5889271-ns-4026534028/runtime/lib/python3.10/multiprocessing/forkserver.py", line 274, in main code = _serve_one(child_r, fds, File "/mnt/xarfuse/uid-26572/e04e8e0a-seed-nspid4026534049_cgpid5889271-ns-4026534028/runtime/lib/python3.10/multiprocessing/forkserver.py", line 313, in _serve_one code = spawn._main(child_r, parent_sentinel) File "/mnt/xarfuse/uid-26572/e04e8e0a-seed-nspid4026534049_cgpid5889271-ns-4026534028/runtime/lib/python3.10/multiprocessing/spawn.py", line 125, in _main prepare(preparation_data) File "/mnt/xarfuse/uid-26572/e04e8e0a-seed-nspid4026534049_cgpid5889271-ns-4026534028/runtime/lib/python3.10/multiprocessing/spawn.py", line 234, in prepare _fixup_main_from_name(data['init_main_from_name']) File "/mnt/xarfuse/uid-26572/e04e8e0a-seed-nspid4026534049_cgpid5889271-ns-4026534028/runtime/lib/python3.10/multiprocessing/spawn.py", line 258, in _fixup_main_from_name main_content = runpy.run_module(mod_name, File "/mnt/xarfuse/uid-26572/e04e8e0a-seed-nspid4026534049_cgpid5889271-ns-4026534028/runtime/lib/python3.10/runpy.py", line 224, in run_module return _run_module_code(code, init_globals, run_name, mod_spec) File "/mnt/xarfuse/uid-26572/e04e8e0a-seed-nspid4026534049_cgpid5889271-ns-4026534028/runtime/lib/python3.10/runpy.py", line 96, in _run_module_code _run_code(code, mod_globals, init_globals, File "/mnt/xarfuse/uid-26572/e04e8e0a-seed-nspid4026534049_cgpid5889271-ns-4026534028/runtime/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/mnt/xarfuse/uid-26572/e04e8e0a-seed-nspid4026534049_cgpid5889271-ns-4026534028/aps_models/ads/icvr/icvr_launcher.py", line 29, in <module> class ICVRConfig(AdsComboLauncherConfig): File "/mnt/xarfuse/uid-26572/e04e8e0a-seed-nspid4026534049_cgpid5889271-ns-4026534028/aps_models/ads/common/ads_launcher.py", line 249, in <module> class AdsComboLauncherConfig(AdsConfig): File "/mnt/xarfuse/uid-26572/e04e8e0a-seed-nspid4026534049_cgpid5889271-ns-4026534028/aps_models/ads/common/app_config.py", line 16, in <module> class AdsConfig(RecTrainAppConfig): File "/mnt/xarfuse/uid-26572/e04e8e0a-seed-nspid4026534049_cgpid5889271-ns-4026534028/apf/rec/config_def.py", line 47, in <module> class EmbeddingKernelConfig: File "/mnt/xarfuse/uid-26572/e04e8e0a-seed-nspid4026534049_cgpid5889271-ns-4026534028/apf/rec/config_def.py", line 52, in EmbeddingKernelConfig cache_algorithm: CacheAlgorithm = CacheAlgorithm.LRU File "/mnt/xarfuse/uid-26572/e04e8e0a-seed-nspid4026534049_cgpid5889271-ns-4026534028/torchrec/distributed/types.py", line 501, in <module> class ParameterSharding: File "/mnt/xarfuse/uid-26572/e04e8e0a-seed-nspid4026534049_cgpid5889271-ns-4026534028/torchrec/distributed/types.py", line 527, in ParameterSharding sharding_spec: Optional[ShardingSpec] = None File "/mnt/xarfuse/uid-26572/e04e8e0a-seed-nspid4026534049_cgpid5889271-ns-4026534028/torch/distributed/_shard/sharding_spec/api.py", line 48, in <module> class ShardingSpec(ABC): File "/mnt/xarfuse/uid-26572/e04e8e0a-seed-nspid4026534049_cgpid5889271-ns-4026534028/torch/distributed/_shard/sharding_spec/api.py", line 55, in ShardingSpec tensor_properties: sharded_tensor_meta.TensorProperties, File "/mnt/xarfuse/uid-26572/e04e8e0a-seed-nspid4026534049_cgpid5889271-ns-4026534028/torch/distributed/_shard/sharded_tensor/__init__.py", line 21, in <module> def empty(sharding_spec: shard_spec.ShardingSpec, ImportError: cannot import name 'ShardingSpec' from partially initialized module 'torch.distributed._shard.sharding_spec.api' (most likely due to a circular import) (/mnt/xarfuse/uid-26572/e04e8e0a-seed-nspid4026534049_cgpid5889271-ns-4026534028/torch/distributed/_shard/sharding_spec/api.py) ``` Using future annotations to mitigate. Test Plan: ``` hg update 1b1b3154616b70fd3325c467db1f7e0f70182a74 CUDA_VISIBLE_DEVICES=1,2 buck2 run @//mode/opt //aps_models/ads/icvr:icvr_launcher -- mode=local_ctr_cvr_rep ``` Differential Revision: D53685582 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119820 Approved by: https://github.com/fegin	2024-02-28 05:09:16 +00:00
Animesh Jain	e9a961f66a	[dynamo][refactor] Use originating_source for HASATTR (#120723 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120723 Approved by: https://github.com/jansel ghstack dependencies: #120520, #120590, #120721	2024-02-28 05:00:59 +00:00
PyTorch UpdateBot	a774baa501	[audio hash update] update the pinned audio hash (#120748 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned audio hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120748 Approved by: https://github.com/pytorchbot	2024-02-28 04:47:38 +00:00
Jason Ansel	184e815c74	Add TORCH_LOGS_FORMAT=short alias (#120757 ) Shorthand for `"%(levelname)s:%(name)s:%(message)s"` which is hard to remember. I find the default formatter annoying since just the metadata fills up most of the width of my terminal. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120757 Approved by: https://github.com/ezyang	2024-02-28 04:40:48 +00:00
PyTorch UpdateBot	bd5f290505	[vision hash update] update the pinned vision hash (#120749 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned vision hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120749 Approved by: https://github.com/pytorchbot	2024-02-28 04:36:16 +00:00
feifan	bfa71b523d	add complex32 to v3_dtypes (#120388 ) Fixes [#120290](https://github.com/pytorch/pytorch/issues/120290) Fixes https://github.com/pytorch/pytorch/issues/73502 use `v3_dtypes` and `torch._utils._rebuild_tensor_v3` to handle torch.save(complex32) result: ![image](https://github.com/pytorch/pytorch/assets/37650440/18b6cbb3-fb3f-4855-9d48-374014647988) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120388 Approved by: https://github.com/albanD	2024-02-28 02:32:29 +00:00
Animesh Jain	5a53c0ff23	[dynamo][refactor] Rename LIST_LENGTH to SEQUENCE_LENGTH, separate DICT_LENGTH (#120721 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120721 Approved by: https://github.com/jansel ghstack dependencies: #120520, #120590	2024-02-28 02:19:10 +00:00
Yang Chen	1627d9e06d	[aot_inductor] added a utility function aoti_torch_print_tensor_handle (#120660 ) Added a function to print tenosr values for a tensor handle. It can be injected to the cpp wrapper code and help debug numerical issues. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120660 Approved by: https://github.com/desertfire	2024-02-28 02:08:34 +00:00
laith sakka	d21c6eb215	Do not wrap output with input device inside _to_copy (#119868 ) Fixing https://github.com/pytorch/pytorch/issues/118790 This diff revert a small part of the code that was introduced in https://github.com/pytorch/pytorch/pull/104689 The PR above added a comment that "In case of dtype promotion, fake tensor converted into tensor" but its not always the case that a conversion in dtype causes a fake tensor to be a tensor. When such conversion does not happen we get the following error ``` Creating a new Tensor subclass FakeTensor but the raw Tensor object is already associated to a python object of type FakeTensor ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/119868 Approved by: https://github.com/ezyang, https://github.com/thiagocrepaldi	2024-02-28 01:51:43 +00:00
wz337	33499ec41b	[FSDP2][DCP][DSD] Add FSDP2 model state dict unit test with distributed state dict (#120680 ) This adds some initial unit tests for FSDP2 model state dict only. This PR adds two tests: 1. Add a unit test for parity check for FSDP `model.state_dict()` with distributed_state_dict's `get_model_state_dict`. 2. Add a unit test to make sure`StateDictOptions(full_state_dict=True, cpu_offload=True)` in distributed_state_dict work for FSDP2 model state_dict. Optimizer state dict will be in follow up PRs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120680 Approved by: https://github.com/awgu	2024-02-28 01:40:04 +00:00
Yu, Guangye	1aa9099839	[CLANGTIDY] Enable clang-tidy in torch/csrc/xpu (#120616 ) # Motivation refer to [#118504](https://github.com/pytorch/pytorch/pull/118504), enabling clang-tidy in `torch/csrc/xpu`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120616 Approved by: https://github.com/albanD	2024-02-28 01:35:25 +00:00
Edward Z. Yang	1a1fc1047d	Add structured trace logs (#120289 ) Overall design: https://docs.google.com/document/d/1CX_hJ0PNy9f3R1y8TJrfkSeLkvGjjjLU84BSXgS2AZ8/edit How to read the diff: * Most files are me augmenting pre-existing logging with structured variants. For the most part it's simple (esp FX graphs, which have a canonical string representation); it gets more complicated when I decided to JSON-ify some data structure instead of keeping the ad hoc printing (notably, guards and dynamo output graph sizes) * torch/_functorch/_aot_autograd/collect_metadata_analysis.py is some unrelated fixes I noticed while auditing artifact logs * torch/_logging/_internal.py has the actual trace log implementation. The trace logger is implement as a logger named torch.__trace which is disconnected from the logging hierarchy. It gets its own handler and formatter (TorchLogsFormatter with _is_trace True). `trace_structured` is the main way to emit a trace log. Unusually, there's a separate "metadata" and "payload" field. The metadata field should not be too long (as it is serialized as a single line) and is always JSON (we put contextual things like compile id in it); the payload field can be long and is emitted after the metadata log line and can span multiple lines. * torch/_logging/structured.py contains some helpers for converting Python data structures into JSON form. Notably, we have a string interning implementation here, which helps reduce the cost of serializing filenames into the log. * test/dynamo/test_structured_trace.py the tests are cribbed from test_logging.py, but all rewritten to use expect tests on munged versions of what we'd actually output. Payloads are never tested, since they tend not be very stable. https://github.com/ezyang/tlparse is a POC Rust program that can interpret these logs. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/120289 Approved by: https://github.com/Skylion007 ghstack dependencies: #120712	2024-02-28 01:01:41 +00:00
Mikayla Gawarecki	677e67c399	Update nn.Module._apply to not gate on should_use_set_data when swap_tensors is set (#120659 ) This updates the nesting of if statements in `nn.Module._apply` such that if `torch.__future__.set_swap_module_params_on_conversion(True)`, we always try to swap regardless of whether - `torch._has_compatible_shallow_copy_type(param, fn(param)` - `torch.__future__.set_overwrite_module_params_on_conversion` is set This means that `meta_module.to_empty('device')` can now use the swap_tensors path cc @awgu Pull Request resolved: https://github.com/pytorch/pytorch/pull/120659 Approved by: https://github.com/albanD	2024-02-28 00:59:34 +00:00
Edward Z. Yang	213b3ac3f2	[BE] fail_* variables don't need to be shared across restarts, they're set only once (#120712 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/120712 Approved by: https://github.com/yanboliang	2024-02-28 00:48:11 +00:00
Kurt Mohler	2ebf2c88ba	Add test to check that COW inputs are not materialized (#119507 ) Part of #97856 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119507 Approved by: https://github.com/ezyang ghstack dependencies: #120455	2024-02-28 00:37:33 +00:00
Kurt Mohler	cabc09a5f2	Avoid COW materialization in `at::parallel_for/parallel_reduce` (#120455 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120455 Approved by: https://github.com/albanD	2024-02-28 00:37:33 +00:00
cyy	1e9fafc160	[Clang-tidy header][20/N] Fix clang-tidy warnings in aten/src/ATEN/.{cpp,h} (#120574 ) This PR fixes some clang-tidy warnings in aten/src/ATEN/. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120574 Approved by: https://github.com/Skylion007	2024-02-28 00:13:05 +00:00
Jeff Daily	9c597ff137	use condition_variable and wait_until in nccl dump on timeout (#120544 ) Fixes test_c10d_nccl.py -k test_timeout_dumps_timing_enabled_True. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120544 Approved by: https://github.com/atalman	2024-02-28 00:06:08 +00:00
Hankyeol Kyung	14b258b5bc	Fix broken link in README (#120698 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120698 Approved by: https://github.com/janeyx99	2024-02-27 23:55:06 +00:00
Eddie Yan	5929d4e830	[CUDA][cuBLAS] Check if a context is present when grabbing a cuBLAS handle (#120131 ) cuBLAS has indicated that certain kernels will transition to using the driver API over the CUDA runtime API, which we've observed to break existing tests (e.g., DataParallel) that use multithreading and may not eagerly grab a context via `cudaSetDevice`. CC @Aidyn-A @ptrblck Co-authored-by: Aidyn-A <31858918+Aidyn-A@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/120131 Approved by: https://github.com/atalman	2024-02-27 22:45:16 +00:00
PyTorch MergeBot	f36e00b8ce	Revert "[inductor][Gemm] Autotune with matrix_instr_nonkdim for AMDGPU (#120639 )" This reverts commit 78f53a3f731ee67dcffd308519ed48a745640dde. Reverted https://github.com/pytorch/pytorch/pull/120639 on behalf of https://github.com/izaitsevfb due to breaking ROCm ([comment](https://github.com/pytorch/pytorch/pull/120639#issuecomment-1967585568))	2024-02-27 21:05:57 +00:00
Aaron Orenstein	6cc7f9a2e6	Limit loop unrolling (#120023 ) Tacotron2 causes massive loop unrolling resulting in very large graphs (26k nodes) which was causing inductor (and tracing itself) to choke. The unrolling size is controlled by the environment variable TORCHDYNAMO_MAX_LOOP_UNROLL_NODES which defaults to the arbitrary value 5000. This updates the tacotron2 timings as follows: eager timing: 3m:23s -> 35s aot_eager timing: 4m:12s -> 39s inductor timing: 22m:24s ->1m For reference the big loop in tacotron2 was this one (model.py[405]): ``` while len(mel_outputs) < decoder_inputs.size(0) - 1: decoder_input = decoder_inputs[len(mel_outputs)] mel_output, gate_output, attention_weights = self.decode(decoder_input) mel_outputs += [mel_output.squeeze(1)] gate_outputs += [gate_output.squeeze(1)] alignments += [attention_weights] ``` which gets unrolled and inlined adding about 36 nodes to the graph per iteration. Fixes #98467 Relates to #102839 which hopefully will result in a better fix. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120023 Approved by: https://github.com/yanboliang	2024-02-27 20:44:21 +00:00
PyTorch MergeBot	f3dd2a544c	Revert "Add structured trace logs (#120289 )" This reverts commit 9dfaef962cda5f65eec53e5fd6f07b5226ea65cb. Reverted https://github.com/pytorch/pytorch/pull/120289 on behalf of https://github.com/kit1980 due to breaking internal builds, see D54230697 ([comment](https://github.com/pytorch/pytorch/pull/120289#issuecomment-1967477120))	2024-02-27 19:49:05 +00:00
eqy	65efece3a4	[CUDA][cuBLAS] Bump `test_cublas_baddbmm_large_input` tolerances (#117889 ) Unfortunate that the current `rtol=1e-5` hits a literal 1 / 1000000 mismatch (`rtol=1.04e-5`) on L40. CC @ptrblck Pull Request resolved: https://github.com/pytorch/pytorch/pull/117889 Approved by: https://github.com/atalman	2024-02-27 19:05:20 +00:00
Jason Ansel	5b5c167adc	[dynamo] Add some helpers to PyCodegen (#120684 ) This are used in later PRs in the stack Pull Request resolved: https://github.com/pytorch/pytorch/pull/120684 Approved by: https://github.com/yanboliang	2024-02-27 18:46:51 +00:00
Wanchao Liang	0c8bb6f70c	[dtensor] standardize tuple strategy handling for foreach ops (#120695 ) This PR refactors the tuple strategy handling logic, and allow TupleStrategy to have both input/output specs for each OpStrategy child, so that we could further enable operators like foreach norm Pull Request resolved: https://github.com/pytorch/pytorch/pull/120695 Approved by: https://github.com/awgu	2024-02-27 18:23:11 +00:00
Shengbao Zheng	440a9b212d	[profiler] log process group config information in distributedInfo field (#119443 ) Summary: Process Group config is essential to analyze collective pattern. We have added this in Execution Trace. Now expose this information in Kineto as well Differential Revision: D53557965 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119443 Approved by: https://github.com/kwen2501	2024-02-27 18:21:54 +00:00
Hongtao Yu	78f53a3f73	[inductor][Gemm] Autotune with matrix_instr_nonkdim for AMDGPU (#120639 ) Summary: Matrix multiplication with Triton is usually in a tiled way. For a large tile size that single hardware instruction cannot handle, i.e, a 128*128 size matmul, it has to be broken down to a sequence of smaller hardware mma instructions. On AMDGPU, matrix_instr_nonkdim controls the shape of the mma instructions, of which the default value is 32 in Triton. This means by default Triton will decompose a large tiled matmul operation into a sequence of 32x32x8 mma instructions. There are other mma instructions available, such as 16x16x16 which requires matrix_instr_nonkdim=16. This change enables tuning the value for Gemm which seems to improve its performance by 20% - 2x. Similar changes has been done to the HSTU ragged attention kernel D53386525. Test Plan: Before: ``` AUTOTUNE mm(1024x1024, 1024x1024) ExternKernelCaller(extern_kernels.mm) 0.0410 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 0.0487 ms 84.2% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 0.0544 ms 75.4% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 0.0633 ms 64.8% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 0.0687 ms 59.7% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 0.0716 ms 57.3% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 0.0748 ms 54.9% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 0.0788 ms 52.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=2 0.1014 ms 40.5% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 0.1069 ms 38.4% SingleProcess AUTOTUNE takes 8.1153 seconds ``` After: ``` AUTOTUNE mm(1024x1024, 1024x1024) ExternKernelCaller(extern_kernels.mm) 0.0417 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=8 0.0470 ms 88.7% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=4 0.0488 ms 85.4% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=32, num_stages=0, num_warps=4 0.0490 ms 85.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=4 0.0525 ms 79.5% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=32, num_stages=0, num_warps=4 0.0553 ms 75.4% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=8 0.0574 ms 72.7% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=32, num_stages=0, num_warps=8 0.0634 ms 65.8% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=2 0.0655 ms 63.7% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=8 0.0681 ms 61.2% SingleProcess AUTOTUNE takes 11.4076 seconds ``` Before: ``` AUTOTUNE mm(2048x2048, 2048x2048) ExternKernelCaller(extern_kernels.mm) 0.2094 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 0.2452 ms 85.4% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 0.2763 ms 75.8% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 0.2836 ms 73.8% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 0.2854 ms 73.4% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 0.2951 ms 71.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 0.2970 ms 70.5% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 0.4184 ms 50.1% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 0.5097 ms 41.1% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=2 0.5570 ms 37.6% SingleProcess AUTOTUNE takes 3.4052 seconds ``` After: ``` AUTOTUNE mm(2048x2048, 2048x2048) ExternKernelCaller(extern_kernels.mm) 0.2117 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=32, num_stages=0, num_warps=8 0.2429 ms 87.2% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=8 0.2485 ms 85.2% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=8 0.2526 ms 83.8% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=4 0.2537 ms 83.5% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=8 0.2554 ms 82.9% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=4 0.2623 ms 80.7% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=4 0.2695 ms 78.5% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=32, num_stages=0, num_warps=8 0.2758 ms 76.8% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=32, num_stages=0, num_warps=4 0.2792 ms 75.8% SingleProcess AUTOTUNE takes 11.3538 seconds ``` Before: ``` AUTOTUNE mm(4096x4096, 4096x4096) ExternKernelCaller(extern_kernels.mm) 1.5901 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 1.9380 ms 82.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 1.9943 ms 79.7% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 2.0640 ms 77.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 2.0941 ms 75.9% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 2.1272 ms 74.7% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 2.1554 ms 73.8% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 2.2931 ms 69.3% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 3.7016 ms 43.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=2 4.6021 ms 34.6% SingleProcess AUTOTUNE takes 9.0523 seconds ``` After: ``` AUTOTUNE mm(4096x4096, 4096x4096) ExternKernelCaller(extern_kernels.mm) 1.5862 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=8 1.6924 ms 93.7% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=8 1.7616 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=4 1.8159 ms 87.4% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=4 1.9340 ms 82.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=8 1.9352 ms 82.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=4 2.0378 ms 77.8% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=4 2.0983 ms 75.6% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=8 2.1138 ms 75.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=8 2.1657 ms 73.2% SingleProcess AUTOTUNE takes 8.2225 seconds ``` Before: ``` AUTOTUNE mm(8192x8192, 8192x8192) ExternKernelCaller(extern_kernels.mm) 12.0134 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 14.8082 ms 81.1% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 15.4242 ms 77.9% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 16.6869 ms 72.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 16.7751 ms 71.6% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 17.0145 ms 70.6% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 17.1363 ms 70.1% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 18.2159 ms 66.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 29.4726 ms 40.8% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=2 37.9039 ms 31.7% SingleProcess AUTOTUNE takes 11.0074 seconds ``` After: ``` AUTOTUNE mm(8192x8192, 8192x8192) ExternKernelCaller(extern_kernels.mm) 11.9554 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=8 12.9953 ms 92.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=8 13.7726 ms 86.8% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=4 13.9647 ms 85.6% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=32, num_stages=0, num_warps=8 14.9728 ms 79.8% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=8 15.3729 ms 77.8% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=4 15.3955 ms 77.7% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=32, num_stages=0, num_warps=4 15.5647 ms 76.8% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=4 16.0037 ms 74.7% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=4 16.7432 ms 71.4% SingleProcess AUTOTUNE takes 14.9839 seconds ``` Reviewed By: xw285cornell, nmacchioni Differential Revision: D54203170 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120639 Approved by: https://github.com/xw285cornell, https://github.com/jansel	2024-02-27 18:16:33 +00:00
Zhengxu Chen	3f62b05d31	[export] Use forward hooks to capture module signatures. (#120468 ) Summary: When we export in on strict mode and turn on preserve_module_call_signature, the following assertion error will occur today: ``` child_split[: len(parent_split)] == parent_split ``` This is due to the fact that we're monkey patching forward call directly, which kinda breaks the attribute propagation in the tracer. It's actually better to implement this by using forward hook because we don't have to alter the original module structure at all during export. Test Plan: CI Differential Revision: D54102714 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120468 Approved by: https://github.com/ydwu4	2024-02-27 17:44:06 +00:00
Andrew M. James	ed3c256b61	Add lowering for adaptive_max_pool2d (#120254 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120254 Approved by: https://github.com/lezcano	2024-02-27 16:32:18 +00:00
Bin Bao	27bb73fe46	[AOTI] Fix a strict-aliasing warning (#120628 ) Summary: This gets rid of an annoying compile time warning, "dereferencing type-punned pointer will break strict-aliasing rules" Differential Revision: [D54207229](https://our.internmc.facebook.com/intern/diff/D54207229) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120628 Approved by: https://github.com/Skylion007	2024-02-27 15:09:13 +00:00
Yang Chen	c29ac05ac0	[inductor] correctly retrieve the "shared" attribute from a Triton binary (#120666 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120666 Approved by: https://github.com/jansel	2024-02-27 13:10:09 +00:00
Isuru Fernando	435063aa89	Decomposition for upsample_linear{1d, 3d} (#114774 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/114774 Approved by: https://github.com/lezcano, https://github.com/vfdev-5, https://github.com/peterbell10	2024-02-27 11:57:45 +00:00
Kai Londenberg	2ad66e6bf0	Fix test failure: Add torch.cuda._get_device_properties to dynamo trace rules (#120620 ) In this PR stack, there were unrelated test failures within test_trace_rules.py - It turned out that torch.cuda._get_device_properties should be registered in _dynamoc/trace_rules.py. A test failed because it was not. This is a small fix which tries to get rid of the test failure by manually registering that function. Note: I am not sure whether this is the best way to fix this, as I am neither familiar with the trace rules nor with the introduction of torch.cuda._get_device_properties. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120620 Approved by: https://github.com/Skylion007	2024-02-27 10:46:01 +00:00
Animesh Jain	e3d64c4d5d	[dynamo] Desugar accumulate_grad, fix .grad handling (#120590 ) Fixes https://github.com/pytorch/pytorch/issues/118435 Fixes https://github.com/pytorch/pytorch/issues/119906 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120590 Approved by: https://github.com/ezyang, https://github.com/jansel ghstack dependencies: #120520	2024-02-27 10:12:26 +00:00
Rohan Varma	9db6a849ed	[FSDP] Clean missing and unexpected keys (#120600 ) Currently, when loading w/strict=False or w/strict=True and looking at error message, FQNs are garbled w/FSDP details such as "_fsdp_wrapped_module". This makes it tricky for upstream applications to validate the expected set of keys are missing / unexpected (for example with PEFT where state_dict is loaded non-strict), and makes error message more complicated w/FSDP details. This PR cleans those prefixes by using `clean_tensor_name` in FSDP's existing post load_state_dict hooks. Currently, only full_state_dict impl is tested, can test the rest of the impls as follow up work. Differential Revision: [D54182472](https://our.internmc.facebook.com/intern/diff/D54182472/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120600 Approved by: https://github.com/XilunWu, https://github.com/fegin	2024-02-27 07:43:45 +00:00
Max Ren	b2a318d856	[PyTorch][ExportedProgram] add 'is_lifted_tensor_constant' and 'get_lifted_tensor_constant' utils (#120546 ) as title Differential Revision: [D54149274](https://our.internmc.facebook.com/intern/diff/D54149274/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120546 Approved by: https://github.com/kirklandsign	2024-02-27 07:16:55 +00:00
Tobias Ringwald	7c556428c7	Increased compile time max GPUs to 512. Switched to int16_t DeviceIndex. (#119639 ) Fixes #115331. This PR increases the number of valid GPU devices to 512 (from 64) in order to future-proof PyTorch for providers that offer [single nodes with a large device count](https://www.tensorwave.com/). Until now, `DeviceIndex` was an `int8_t`, thus multiple changes were necessary: - `DeviceIndex` changed to `int16_t`. Updated consumers that assume it to be an `int8_t`. - Updated bounds checking for `torch.device()` in the Python frontend. Right now, we allow funny things like `torch.device('cpu', 200).index == -56`, which is undefined behavior. I inserted some checks to only allow values between 0 and `c10::Device::MAX_NUM_DEVICES - 1`. - Updated the `ArgumentInfo` struct as it hardcodes the device index as 8 bit field [^1]. Might be a breaking change, not sure if users rely on this. - Introduced `c10::Device::MAX_NUM_DEVICES` as a replacement for the old `C10_COMPILE_TIME_MAX_GPUS` [^1]: This field was unsigned, so I guess this has also been undef behavior the whole time? Our default device index is -1, so this always wrapped around to 255 when written to the `ArgumentInfo` struct. When I switched the `DeviceIndex` to `int16_t`, it actually stayed 255 after unpacking from `ArgumentInfo` again, as the `DeviceIndex` was now wide enough that it didn't wrap back to -1. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119639 Approved by: https://github.com/cyyever, https://github.com/albanD, https://github.com/huydhn	2024-02-27 07:05:48 +00:00
angelayi	cbbc309cae	[pytree][reland] Require pytree serialized_type_name (#120636 ) Relanding https://github.com/pytorch/pytorch/pull/119718 as the diff which prevents breakages of torchrec [D53857843](https://www.internalfb.com/diff/D53857843) has landed Pull Request resolved: https://github.com/pytorch/pytorch/pull/120636 Approved by: https://github.com/avikchaudhuri	2024-02-27 06:53:33 +00:00
Michael Suo	12f724c779	[export] preserve constant fqn (#120664 ) Summary: Previously we were renaming constants to `lifted_constant_tensor0` or equivalent. This PR changes things so that the constants retain the same FQN as in the original eager module. Actually, `symbolic_trace` already is supposed to do this, but the code path is not triggered when used from `make_fx`, since we don't pass an actual `nn.Module` instance to `trace()`, but rather a multiply-wrapped-functionalized-lambda-thing. So, I reproduced the essential logic outside of make_fx, at the export layer. Test Plan: added a unit test Differential Revision: D54221616 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120664 Approved by: https://github.com/SherlockNoMad	2024-02-27 06:35:51 +00:00
dilililiwhy	a358b23a6a	Keep test order due to rename_privateuse1_backend is disposable (#120464 ) With the change in https://github.com/pytorch/pytorch/pull/120399. As rename_privateuse1_backend is disposable, run test_external_module_register with an renamed backend may cause problem. Try to change the testcase name and keep the right order (ASCII). Pull Request resolved: https://github.com/pytorch/pytorch/pull/120464 Approved by: https://github.com/albanD	2024-02-27 05:38:43 +00:00
Aaron Gokaslan	5a5b654481	[BE]: Enable ruff LOG checks (#120674 ) Enable LOG error codes in ruff to find bad usages of the logger: https://docs.astral.sh/ruff/rules/#flake8-logging-log Pull Request resolved: https://github.com/pytorch/pytorch/pull/120674 Approved by: https://github.com/ezyang	2024-02-27 04:37:20 +00:00
Levy Zhao	b6139b1e57	[PyTorch][CUDA Caching Allocator] Export sync-stream-and-free-HBM counter in memory_stats for performance debugging (#120050 ) Differential Revision: D53734057 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120050 Approved by: https://github.com/xw285cornell	2024-02-27 04:34:53 +00:00
PyTorch UpdateBot	a1c641f118	[executorch hash update] update the pinned executorch hash (#120675 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned executorch hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120675 Approved by: https://github.com/pytorchbot	2024-02-27 03:59:16 +00:00
Edward Z. Yang	237773132d	Restore artifact name in log messages (#120671 ) Yuzhen Huang was complaining to me that searching for `__recompile` no longer works. This is because the glog format is filename, not logger name, so we lost the artifact name. Add it back. Looks like: ``` V0226 15:56:04.142000 139828992779264 torch/_dynamo/guards.py:1084] [0/2] __guards: ___check_type_id(L['inputs'], 7626144) ``` Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/120671 Approved by: https://github.com/Skylion007	2024-02-27 03:37:11 +00:00
PyTorch UpdateBot	ac28571742	[vision hash update] update the pinned vision hash (#119944 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned vision hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119944 Approved by: https://github.com/pytorchbot	2024-02-27 03:25:51 +00:00
PyTorch UpdateBot	9d423f0e91	[audio hash update] update the pinned audio hash (#120135 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned audio hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120135 Approved by: https://github.com/pytorchbot	2024-02-27 03:20:00 +00:00
Animesh Jain	63f874b476	[dynamo][guards-cpp-refactor] DictGetItemGuardAccessor for f_locals (#120593 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120593 Approved by: https://github.com/jansel	2024-02-27 03:13:55 +00:00
Eli Uriegas	27990045ff	docker: Only match tags that start with v* (#120670 ) To avoid issues where version could be confused with a ciflow tag. Example: ``` ❯ git describe --tags --always ciflow/periodic/c3496d50f0bb437c70f27085f71155209277bfd4-47-g4ca24959d1a ❯ git describe --tags --always --match "v[1-9]." v1.8.0-rc1-36500-g4ca24959d1a ``` Resolves https://github.com/pytorch/pytorch/issues/120392 Signed-off-by: Eli Uriegas <eliuriegas@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/120670 Approved by: https://github.com/kit1980, https://github.com/atalman	2024-02-27 02:55:33 +00:00
cpuhrsch	cf6df886a0	Remove hard numpy dependency from experimental_ops.py (#119520 ) Based on similar code in the codebase Pull Request resolved: https://github.com/pytorch/pytorch/pull/119520 Approved by: https://github.com/albanD	2024-02-27 02:46:13 +00:00
Yifu Wang	2de7468d2b	Switch to native functional collective by default (#120370 ) This enables native functional collectives by default. After this PR: - The Python APIs remain backward compatible. Users will receive a deprecation warning if they use `(rank, tags)` as process group identifier. - Collectives will be captured as `_c10d_functional` ops in post-grad fx graphs. The change will not affect end-users, but it will impact `torch-xla` which has implemented an all-reduce backend based on the existing `c10d_functional` IR. This excludes the migration for `torch-xla` use cases, which will be coordinated separately (see communications in #93173). - Collectives will be lowered to and codegen'd by new Inductor collective IRs (`ir._CollectiveKernel` and `ir._WaitKernel`). This change will not affect end-users. Testing performed: - We have been running a set of representative unit tests with both the new native funcol and the old py funcol in CI. These test will continue to run with the old py funcol after this PR, so they are covered until they are removed. - Manually verified with e2e llama model training with DTensor + functional collectives (https://github.com/fairinternal/xlformers/tree/pt2_llm/pt2d#create-your-local-development-env). Fallback mechansim: - Introduced a temporary environment variable `TORCH_DISABLE_NATIVE_FUNCOL` that allows users to fall back to the previous implementation. We don't expect the migration to break anything; the mechanism is a safety measure to reduce potential disruption in case the PR causes unforeseen breakages. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120370 Approved by: https://github.com/wconstab, https://github.com/yf225	2024-02-27 01:53:56 +00:00
Animesh Jain	8a59f49da2	[dynamo][compile-time] Collect guard debug stack info only with logs enabled (#120520 ) Reduces backend=eager compile time from 33 to 19 seconds for `MobileBertForQuestionAnswering`. This also helps an internal model where guards.add function is taking 124 seconds. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120520 Approved by: https://github.com/mlazos	2024-02-27 01:51:16 +00:00
Nikita Shulga	2e0e545759	[EZ][BE] Use nested namespace in functorch (#120663 ) I should really enable this clang-tidy check rather than doing it by hand Pull Request resolved: https://github.com/pytorch/pytorch/pull/120663 Approved by: https://github.com/zou3519, https://github.com/Skylion007	2024-02-27 01:45:32 +00:00
Yu, Guangye	b3fe53e1ad	[1/2] Intel GPU Runtime Upstreaming for Generator (#118528 ) # Motivation As mentioned in [[RFC] Intel GPU Runtime Upstreaming](https://github.com/pytorch/pytorch/issues/114842), the last runtime component we would like to upstream is `Generator` which is responsible for the pseudo-random number generation. To facilitate the code review, we split the code changes into 2 PRs. This is one of the 2 PRs and covers the changes under `aten`. # Design Following the previous design, `c10::GeneratorImpl` is the device-agnostic abstraction of a random number generator. So we will introduce an XPU generator `XPUGeneratorImpl`, inheriting from `c10::GeneratorImpl`, to manage random states on an Intel GPU device. Intel GPU runtime `Generator` adopts the same algorithm as CPU. The corresponding C++ file should be placed in aten/src/ATen/xpu/ folder and is built in `libtorch_xpu.so`. This PR provide the list of APIs: - `getDefaultXPUGenerator` - `createXPUGenerator` # Additional Context The 2nd PR will cover `python frontend`. The differences with CUDA: The generator-related ATen CPP APIs are 1:1 mapping with CUDA. The XPUGeneratorImpl's member functions have slight differences with CUDA. lack of CUDA-related counterpart APIs listed below: - capture_prologue - capture_epilogue - philox_cuda_state - reset_rnn_state Pull Request resolved: https://github.com/pytorch/pytorch/pull/118528 Approved by: https://github.com/EikanWang, https://github.com/gujinghui, https://github.com/albanD	2024-02-27 01:39:40 +00:00
angelayi	f064dec7e0	Add torch.ops.aten.print (#120295 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120295 Approved by: https://github.com/zou3519	2024-02-27 01:34:59 +00:00
Jane Xu	ef9b6d6816	Replace individual detaches with overall torch.no_grad decorator (#120638 ) Fixes https://github.com/pytorch/pytorch/issues/120611. At first, I thought there were too many detaches, but @awgu and I made the conclusion that both `clip_grad_norm_` and `clip_grad_value_` should be run under torch.no_grad similar to optimizer step. One option is to continue calling `detach`, but doing that on many tensors is slower than setting the context to be no_grad (I think?) and Andrew had noticed: "the 1st round of detaches takes 10 ms for FSDP2, whereas existing FSDP's clip_grad_norm_ only takes 3 ms total" since there are more tensors in FSDP2. This change also disables grad mode for the foreach path of `clip_grad_value_`, which the first attempt that didn't do this was an oversight. Not sure how to add a test case for this since grad mode will be turned back on after the call. New profile is not much different from the one in the bottom of this stack, but the number of detaches is 0 :D: ``` (pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (c71bcceb)]$ python playground2.py STAGE:2024-02-26 13:07:15 211224:211224 ActivityProfilerController.cpp:314] Completed Stage: Warm Up STAGE:2024-02-26 13:07:16 211224:211224 ActivityProfilerController.cpp:320] Completed Stage: Collection STAGE:2024-02-26 13:07:16 211224:211224 ActivityProfilerController.cpp:324] Completed Stage: Post Processing ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ cudaLaunchKernel 70.63% 110.415ms 70.63% 110.415ms 5.811ms 0.000us 0.00% 0.000us 0.000us 19 aten::linalg_vector_norm 0.18% 284.000us 26.00% 40.636ms 40.636ms 3.000us 0.99% 3.000us 3.000us 1 aten::clamp 0.09% 148.000us 14.88% 23.261ms 23.261ms 1.000us 0.33% 1.000us 1.000us 1 aten::to 0.75% 1.170ms 14.05% 21.970ms 84.826us 0.000us 0.00% 258.000us 0.996us 259 aten::_to_copy 2.28% 3.562ms 13.31% 20.800ms 161.240us 0.000us 0.00% 258.000us 2.000us 129 aten::_foreach_norm 4.44% 6.935ms 12.72% 19.878ms 9.939ms 19.000us 6.29% 21.000us 10.500us 2 aten::add 0.11% 173.000us 10.97% 17.153ms 17.153ms 1.000us 0.33% 1.000us 1.000us 1 aten::stack 2.99% 4.673ms 9.15% 14.300ms 14.300ms 0.000us 0.00% 6.000us 6.000us 1 aten::copy_ 5.49% 8.586ms 8.96% 14.001ms 108.535us 258.000us 85.43% 258.000us 2.000us 129 aten::reciprocal 0.11% 179.000us 8.35% 13.051ms 13.051ms 1.000us 0.33% 1.000us 1.000us 1 aten::cat 0.64% 993.000us 4.42% 6.902ms 6.902ms 6.000us 1.99% 6.000us 6.000us 1 aten::zeros 0.04% 69.000us 4.28% 6.698ms 3.349ms 0.000us 0.00% 2.000us 1.000us 2 aten::zero_ 0.04% 66.000us 4.13% 6.462ms 3.231ms 0.000us 0.00% 2.000us 1.000us 2 aten::fill_ 0.06% 98.000us 4.09% 6.396ms 3.198ms 2.000us 0.66% 2.000us 1.000us 2 aten::_foreach_mul_ 1.50% 2.342ms 3.79% 5.924ms 2.962ms 10.000us 3.31% 10.000us 5.000us 2 aten::empty 3.27% 5.115ms 3.27% 5.115ms 19.826us 0.000us 0.00% 0.000us 0.000us 258 aten::empty_strided 2.07% 3.237ms 2.07% 3.237ms 25.093us 0.000us 0.00% 0.000us 0.000us 129 cudaDeviceEnablePeerAccess 1.93% 3.023ms 1.93% 3.023ms 1.512ms 0.000us 0.00% 0.000us 0.000us 2 aten::unsqueeze 1.21% 1.896ms 1.74% 2.725ms 10.645us 0.000us 0.00% 0.000us 0.000us 256 cudaMemcpyAsync 1.01% 1.572ms 1.01% 1.572ms 12.186us 0.000us 0.00% 0.000us 0.000us 129 aten::as_strided 0.54% 839.000us 0.54% 839.000us 3.265us 0.000us 0.00% 0.000us 0.000us 257 cudaStreamWaitEvent 0.34% 539.000us 0.34% 539.000us 2.089us 0.000us 0.00% 0.000us 0.000us 258 cudaEventRecord 0.18% 274.000us 0.18% 274.000us 1.062us 0.000us 0.00% 0.000us 0.000us 258 aten::mul 0.07% 107.000us 0.08% 132.000us 132.000us 1.000us 0.33% 1.000us 1.000us 1 cudaDeviceSynchronize 0.01% 17.000us 0.01% 17.000us 8.500us 0.000us 0.00% 0.000us 0.000us 2 cudaDeviceCanAccessPeer 0.00% 7.000us 0.00% 7.000us 3.500us 0.000us 0.00% 0.000us 0.000us 2 void at::native::vectorized_elementwise_kernel<4, at... 0.00% 0.000us 0.00% 0.000us 0.000us 2.000us 0.66% 2.000us 1.000us 2 void at::native::(anonymous namespace)::multi_tensor... 0.00% 0.000us 0.00% 0.000us 0.000us 13.000us 4.30% 13.000us 3.250us 4 void at::native::lpnorm_cleanup<float, (at::native::... 0.00% 0.000us 0.00% 0.000us 0.000us 6.000us 1.99% 6.000us 3.000us 2 Memcpy PtoP (Device -> Device) 0.00% 0.000us 0.00% 0.000us 0.000us 258.000us 85.43% 258.000us 2.000us 129 void at::native::(anonymous namespace)::CatArrayBatc... 0.00% 0.000us 0.00% 0.000us 0.000us 6.000us 1.99% 6.000us 3.000us 2 void at::native::reduce_kernel<512, 1, at::native::R... 0.00% 0.000us 0.00% 0.000us 0.000us 3.000us 0.99% 3.000us 3.000us 1 void at::native::vectorized_elementwise_kernel<4, at... 0.00% 0.000us 0.00% 0.000us 0.000us 1.000us 0.33% 1.000us 1.000us 1 void at::native::vectorized_elementwise_kernel<4, at... 0.00% 0.000us 0.00% 0.000us 0.000us 1.000us 0.33% 1.000us 1.000us 1 void at::native::vectorized_elementwise_kernel<4, at... 0.00% 0.000us 0.00% 0.000us 0.000us 1.000us 0.33% 1.000us 1.000us 1 void at::native::vectorized_elementwise_kernel<4, at... 0.00% 0.000us 0.00% 0.000us 0.000us 1.000us 0.33% 1.000us 1.000us 1 void at::native::(anonymous namespace)::multi_tensor... 0.00% 0.000us 0.00% 0.000us 0.000us 10.000us 3.31% 10.000us 2.500us 4 ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Self CPU time total: 156.319ms Self CUDA time total: 302.000us ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/120638 Approved by: https://github.com/Skylion007, https://github.com/albanD ghstack dependencies: #120623	2024-02-27 01:27:05 +00:00
Jane Xu	df72819f91	clip_grad_norm can use fast foreach path for inf norm (#120623 ) Now that foreach_norm supports inf, we should not special case it. For a mere 256 parameters, we get a win of 30ms in CPU time and ~800us -> 300us decrease in CUDA time. This win is only bigger for more parameters. New profile: ``` (pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (bf1c0490\|REBASE-i\|detached HEAD)]$ python playground2.py STAGE:2024-02-26 13:14:10 395517:395517 ActivityProfilerController.cpp:314] Completed Stage: Warm Up STAGE:2024-02-26 13:14:11 395517:395517 ActivityProfilerController.cpp:320] Completed Stage: Collection STAGE:2024-02-26 13:14:11 395517:395517 ActivityProfilerController.cpp:324] Completed Stage: Post Processing ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ cudaLaunchKernel 67.01% 102.262ms 67.01% 102.262ms 5.382ms 2.000us 0.66% 2.000us 0.105us 19 aten::linalg_vector_norm 0.20% 311.000us 23.44% 35.776ms 35.776ms 3.000us 0.99% 3.000us 3.000us 1 aten::to 0.79% 1.208ms 14.62% 22.311ms 86.143us 0.000us 0.00% 263.000us 1.015us 259 aten::clamp 0.12% 182.000us 13.96% 21.303ms 21.303ms 1.000us 0.33% 1.000us 1.000us 1 aten::_to_copy 2.38% 3.628ms 13.83% 21.103ms 163.589us 0.000us 0.00% 263.000us 2.039us 129 aten::_foreach_norm 4.71% 7.185ms 13.54% 20.659ms 10.329ms 19.000us 6.29% 23.000us 11.500us 2 aten::add 0.14% 211.000us 10.86% 16.580ms 16.580ms 1.000us 0.33% 1.000us 1.000us 1 aten::stack 3.11% 4.744ms 9.59% 14.642ms 14.642ms 0.000us 0.00% 6.000us 6.000us 1 aten::copy_ 5.71% 8.721ms 9.27% 14.152ms 109.705us 258.000us 85.43% 263.000us 2.039us 129 aten::reciprocal 0.13% 193.000us 7.93% 12.100ms 12.100ms 1.000us 0.33% 1.000us 1.000us 1 aten::cat 0.67% 1.017ms 4.67% 7.129ms 7.129ms 6.000us 1.99% 6.000us 6.000us 1 aten::zeros 0.05% 79.000us 4.46% 6.800ms 3.400ms 0.000us 0.00% 2.000us 1.000us 2 aten::zero_ 0.05% 79.000us 4.28% 6.537ms 3.268ms 0.000us 0.00% 2.000us 1.000us 2 aten::fill_ 0.09% 131.000us 4.23% 6.458ms 3.229ms 2.000us 0.66% 2.000us 1.000us 2 aten::_foreach_mul_ 1.56% 2.377ms 3.86% 5.896ms 2.948ms 10.000us 3.31% 10.000us 5.000us 2 aten::empty 3.55% 5.414ms 3.55% 5.414ms 20.984us 0.000us 0.00% 0.000us 0.000us 258 aten::empty_strided 2.18% 3.323ms 2.18% 3.323ms 25.760us 0.000us 0.00% 0.000us 0.000us 129 aten::detach 0.85% 1.302ms 2.10% 3.199ms 12.496us 0.000us 0.00% 0.000us 0.000us 256 cudaDeviceEnablePeerAccess 2.01% 3.069ms 2.01% 3.069ms 1.534ms 0.000us 0.00% 0.000us 0.000us 2 aten::unsqueeze 1.24% 1.899ms 1.81% 2.769ms 10.816us 0.000us 0.00% 0.000us 0.000us 256 detach 1.24% 1.897ms 1.24% 1.897ms 7.410us 0.000us 0.00% 0.000us 0.000us 256 cudaMemcpyAsync 1.01% 1.539ms 1.01% 1.539ms 11.930us 0.000us 0.00% 0.000us 0.000us 129 aten::as_strided 0.58% 881.000us 0.58% 881.000us 3.428us 0.000us 0.00% 0.000us 0.000us 257 cudaStreamWaitEvent 0.35% 540.000us 0.35% 540.000us 2.093us 0.000us 0.00% 0.000us 0.000us 258 cudaEventRecord 0.18% 278.000us 0.18% 278.000us 1.078us 5.000us 1.66% 5.000us 0.019us 258 aten::mul 0.08% 125.000us 0.09% 138.000us 138.000us 1.000us 0.33% 1.000us 1.000us 1 cudaDeviceSynchronize 0.01% 13.000us 0.01% 13.000us 6.500us 0.000us 0.00% 0.000us 0.000us 2 cudaDeviceCanAccessPeer 0.00% 5.000us 0.00% 5.000us 2.500us 0.000us 0.00% 0.000us 0.000us 2 void at::native::vectorized_elementwise_kernel<4, at... 0.00% 0.000us 0.00% 0.000us 0.000us 2.000us 0.66% 2.000us 1.000us 2 void at::native::(anonymous namespace)::multi_tensor... 0.00% 0.000us 0.00% 0.000us 0.000us 13.000us 4.30% 13.000us 3.250us 4 void at::native::lpnorm_cleanup<float, (at::native::... 0.00% 0.000us 0.00% 0.000us 0.000us 6.000us 1.99% 6.000us 3.000us 2 Memcpy PtoP (Device -> Device) 0.00% 0.000us 0.00% 0.000us 0.000us 258.000us 85.43% 258.000us 2.000us 129 void at::native::(anonymous namespace)::CatArrayBatc... 0.00% 0.000us 0.00% 0.000us 0.000us 6.000us 1.99% 6.000us 3.000us 2 void at::native::reduce_kernel<512, 1, at::native::R... 0.00% 0.000us 0.00% 0.000us 0.000us 3.000us 0.99% 3.000us 3.000us 1 void at::native::vectorized_elementwise_kernel<4, at... 0.00% 0.000us 0.00% 0.000us 0.000us 1.000us 0.33% 1.000us 1.000us 1 void at::native::vectorized_elementwise_kernel<4, at... 0.00% 0.000us 0.00% 0.000us 0.000us 1.000us 0.33% 1.000us 1.000us 1 void at::native::vectorized_elementwise_kernel<4, at... 0.00% 0.000us 0.00% 0.000us 0.000us 1.000us 0.33% 1.000us 1.000us 1 void at::native::vectorized_elementwise_kernel<4, at... 0.00% 0.000us 0.00% 0.000us 0.000us 1.000us 0.33% 1.000us 1.000us 1 void at::native::(anonymous namespace)::multi_tensor... 0.00% 0.000us 0.00% 0.000us 0.000us 10.000us 3.31% 10.000us 2.500us 4 ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Self CPU time total: 152.613ms Self CUDA time total: 302.000us ``` Compared to on main: ``` (pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (5a0a9644)]$ python playground2.py STAGE:2024-02-26 13:09:56 285045:285045 ActivityProfilerController.cpp:314] Completed Stage: Warm Up STAGE:2024-02-26 13:09:57 285045:285045 ActivityProfilerController.cpp:320] Completed Stage: Collection STAGE:2024-02-26 13:09:57 285045:285045 ActivityProfilerController.cpp:324] Completed Stage: Post Processing ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ cudaLaunchKernel 61.42% 113.375ms 61.42% 113.375ms 424.625us 45.000us 5.66% 45.000us 0.169us 267 aten::linalg_vector_norm 14.04% 25.909ms 37.67% 69.534ms 271.617us 514.000us 64.65% 559.000us 2.184us 256 aten::to 0.78% 1.433ms 12.87% 23.751ms 91.703us 0.000us 0.00% 278.000us 1.073us 259 aten::_to_copy 2.02% 3.730ms 12.09% 22.318ms 173.008us 0.000us 0.00% 278.000us 2.155us 129 aten::clamp 0.09% 174.000us 11.43% 21.103ms 21.103ms 1.000us 0.13% 1.000us 1.000us 1 aten::add 0.11% 205.000us 9.08% 16.768ms 16.768ms 1.000us 0.13% 1.000us 1.000us 1 aten::copy_ 4.94% 9.112ms 8.15% 15.043ms 116.612us 258.000us 32.45% 278.000us 2.155us 129 aten::stack 2.76% 5.091ms 7.97% 14.719ms 14.719ms 0.000us 0.00% 6.000us 6.000us 1 aten::reciprocal 0.11% 194.000us 7.01% 12.933ms 12.933ms 1.000us 0.13% 1.000us 1.000us 1 aten::max 0.09% 165.000us 6.43% 11.868ms 11.868ms 3.000us 0.38% 3.000us 3.000us 1 aten::detach 1.58% 2.911ms 4.12% 7.596ms 14.836us 0.000us 0.00% 0.000us 0.000us 512 aten::cat 0.56% 1.042ms 3.73% 6.882ms 6.882ms 6.000us 0.75% 6.000us 6.000us 1 aten::_foreach_mul_ 1.36% 2.503ms 3.33% 6.145ms 3.072ms 10.000us 1.26% 10.000us 5.000us 2 detach 2.54% 4.685ms 2.54% 4.685ms 9.150us 0.000us 0.00% 0.000us 0.000us 512 aten::empty_strided 1.92% 3.545ms 1.92% 3.545ms 27.481us 0.000us 0.00% 0.000us 0.000us 129 cudaDeviceEnablePeerAccess 1.64% 3.022ms 1.64% 3.022ms 1.511ms 0.000us 0.00% 0.000us 0.000us 2 aten::unsqueeze 1.03% 1.892ms 1.49% 2.746ms 10.727us 0.000us 0.00% 0.000us 0.000us 256 aten::as_strided 1.35% 2.494ms 1.35% 2.494ms 4.862us 0.000us 0.00% 0.000us 0.000us 513 cudaMemcpyAsync 1.01% 1.868ms 1.01% 1.868ms 14.481us 4.000us 0.50% 4.000us 0.031us 129 cudaStreamWaitEvent 0.41% 760.000us 0.41% 760.000us 2.946us 8.000us 1.01% 8.000us 0.031us 258 cudaEventRecord 0.15% 276.000us 0.15% 276.000us 1.070us 8.000us 1.01% 8.000us 0.031us 258 aten::mul 0.08% 139.000us 0.08% 153.000us 153.000us 1.000us 0.13% 1.000us 1.000us 1 aten::empty 0.02% 35.000us 0.02% 35.000us 35.000us 0.000us 0.00% 0.000us 0.000us 1 cudaDeviceSynchronize 0.01% 14.000us 0.01% 14.000us 7.000us 0.000us 0.00% 0.000us 0.000us 2 cudaDeviceCanAccessPeer 0.00% 5.000us 0.00% 5.000us 2.500us 0.000us 0.00% 0.000us 0.000us 2 void at::native::reduce_kernel<512, 1, at::native::R... 0.00% 0.000us 0.00% 0.000us 0.000us 514.000us 64.65% 514.000us 2.008us 256 Memcpy PtoP (Device -> Device) 0.00% 0.000us 0.00% 0.000us 0.000us 258.000us 32.45% 258.000us 2.000us 129 void at::native::(anonymous namespace)::CatArrayBatc... 0.00% 0.000us 0.00% 0.000us 0.000us 6.000us 0.75% 6.000us 3.000us 2 void at::native::reduce_kernel<512, 1, at::native::R... 0.00% 0.000us 0.00% 0.000us 0.000us 3.000us 0.38% 3.000us 3.000us 1 void at::native::vectorized_elementwise_kernel<4, at... 0.00% 0.000us 0.00% 0.000us 0.000us 1.000us 0.13% 1.000us 1.000us 1 void at::native::vectorized_elementwise_kernel<4, at... 0.00% 0.000us 0.00% 0.000us 0.000us 1.000us 0.13% 1.000us 1.000us 1 void at::native::vectorized_elementwise_kernel<4, at... 0.00% 0.000us 0.00% 0.000us 0.000us 1.000us 0.13% 1.000us 1.000us 1 void at::native::vectorized_elementwise_kernel<4, at... 0.00% 0.000us 0.00% 0.000us 0.000us 1.000us 0.13% 1.000us 1.000us 1 void at::native::(anonymous namespace)::multi_tensor... 0.00% 0.000us 0.00% 0.000us 0.000us 10.000us 1.26% 10.000us 2.500us 4 ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Self CPU time total: 184.579ms Self CUDA time total: 795.000us ``` For script: ``` import torch from math import inf from torch.nn.utils import clip_grad_norm_ params = [torch.rand(32, 16, device="cuda:3")5 for _ in range(128)] + [torch.rand(32, 16, device="cuda:4")-7 for _ in range(128)] for p in params: p.grad = torch.rand_like(p) with torch.profiler.profile( activities=[ torch.profiler.ProfilerActivity.CPU, torch.profiler.ProfilerActivity.CUDA, ] ) as p: total_norm = clip_grad_norm_(params, 10.0, norm_type=inf) torch.cuda.synchronize() print(p.key_averages().table(sort_by="cpu_time_total")) print(total_norm) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/120623 Approved by: https://github.com/Skylion007, https://github.com/mikaylagawarecki	2024-02-27 01:27:05 +00:00
PyTorch MergeBot	b01bd1f7a1	Revert "Add torch.ops.aten.print (#120295 )" This reverts commit 3b944113c837e1111510487f4525aa07039462fe. Reverted https://github.com/pytorch/pytorch/pull/120295 on behalf of https://github.com/kit1980 due to breaking internal builds, see D54123688 ([comment](https://github.com/pytorch/pytorch/pull/120295#issuecomment-1965618191))	2024-02-27 01:18:48 +00:00
PyTorch MergeBot	17560eb472	Revert "[Dynamo] Remove deadcode: unwrapping script_if_tracing (#120444 )" This reverts commit 4d2073bc3faa7f2014c4fb2f568e68fe195b6f99. Reverted https://github.com/pytorch/pytorch/pull/120444 on behalf of https://github.com/kit1980 due to breaking internal builds, see D54192376 ([comment](https://github.com/pytorch/pytorch/pull/120444#issuecomment-1965600268))	2024-02-27 00:58:00 +00:00
Isuru Fernando	e874376f6a	Mark test_reference_numerics_extremal__refs_frexp_cuda as xfail on windows (#120640 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120640 Approved by: https://github.com/clee2000	2024-02-27 00:35:55 +00:00
Sergii Dymchenko	d341b66e96	Revert [dynamo] support group=None when rewriting collectives (#12018 ) (#120677 ) This reverts commit 298c686d3f7bc26399481b8830e71c4f02ce629c. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120677 Approved by: https://github.com/yifuwang, https://github.com/huydhn	2024-02-27 00:33:35 +00:00
David Berard	fdae9363b3	[meta registration] efficient_attention_forward fix for NT inputs (#120594 ) When cu_seqlens_q is provided, we should use the user-specified max_seqlen_q instead of inferring it as query.size(1): `1c7b0e7cd1/aten/src/ATen/native/transformers/cuda/attention.cu (L989)` This wasn't caught because the value is taken as ceil(max_seqlen / 32) * 32; in the opinfos, and the opinfo inputs were small enough that this value was 32 in either case. Differential Revision: [D54179733](https://our.internmc.facebook.com/intern/diff/D54179733) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120594 Approved by: https://github.com/drisspg	2024-02-27 00:10:37 +00:00
Edward Z. Yang	9dfaef962c	Add structured trace logs (#120289 ) Overall design: https://docs.google.com/document/d/1CX_hJ0PNy9f3R1y8TJrfkSeLkvGjjjLU84BSXgS2AZ8/edit How to read the diff: * Most files are me augmenting pre-existing logging with structured variants. For the most part it's simple (esp FX graphs, which have a canonical string representation); it gets more complicated when I decided to JSON-ify some data structure instead of keeping the ad hoc printing (notably, guards and dynamo output graph sizes) * torch/_functorch/_aot_autograd/collect_metadata_analysis.py is some unrelated fixes I noticed while auditing artifact logs * torch/_logging/_internal.py has the actual trace log implementation. The trace logger is implement as a logger named torch.__trace which is disconnected from the logging hierarchy. It gets its own handler and formatter (TorchLogsFormatter with _is_trace True). There's a teensy bit of FB specific code to automatically enable trace logging if a /logs directory exists. `trace_structured` is the main way to emit a trace log. Unusually, there's a separate "metadata" and "payload" field. The metadata field should not be too long (as it is serialized as a single line) and is always JSON (we put contextual things like compile id in it); the payload field can be long and is emitted after the metadata log line and can span multiple lines. * torch/_logging/structured.py contains some helpers for converting Python data structures into JSON form. Notably, we have a string interning implementation here, which helps reduce the cost of serializing filenames into the log. * test/dynamo/test_structured_trace.py the tests are cribbed from test_logging.py, but all rewritten to use expect tests on munged versions of what we'd actually output. Payloads are never tested, since they tend not be very stable. https://github.com/ezyang/tlparse is a POC Rust program that can interpret these logs. Testing that the fbcode detection works at https://www.internalfb.com/mlhub/pipelines/runs/fblearner/534553450 (Meta-only) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/120289 Approved by: https://github.com/Skylion007	2024-02-27 00:04:23 +00:00
William Wen	ecb3f33a1a	[dynamo] fix segfault in _debug_get_cache_entry_list (#120635 ) Fix https://github.com/pytorch/pytorch/issues/120607. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120635 Approved by: https://github.com/jansel	2024-02-26 23:31:09 +00:00
lancerts	64660b51f6	Add the hyperlink of the transfomer doc (#120565 ) Fixes #120488 - The shape for forward pass is clearly stated in the main [transformer class](https://pytorch.org/docs/stable/generated/torch.nn.Transformer.html) - Boolean mask for _key_padding_mask is also explained in the main transformer class. Therefore, add the hyperlink to the transformer class explicitly so the user can refer back to the main class. Also, correct several symbols in the transform doc from normal text style to math style. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120565 Approved by: https://github.com/mikaylagawarecki	2024-02-26 23:11:58 +00:00
Kai	c59b14163b	Implement aten::upsample_linear1d on mps (#115031 ) Related to #77764 Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/115031 Approved by: https://github.com/malfet	2024-02-26 23:04:52 +00:00
albanD	30625ae582	Add cpp stack traces to our own reruns (#119408 ) Note that I'm not sure why we both have pytest rerun the failing test twice via `81abc2b249/test/run_test.py (L966)` before our own logic retries it as well? The failing test is only here to make sure it works as expected in the CI env. Will remove before landing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119408 Approved by: https://github.com/huydhn	2024-02-26 22:21:14 +00:00
PyTorch MergeBot	41adec3c59	Revert "Switch to native functional collective by default (#120370 )" This reverts commit 1f1bc0e6acc3613339b1001a7c9fcd1dfe7b6580. Reverted https://github.com/pytorch/pytorch/pull/120370 on behalf of https://github.com/yifuwang due to broke CI ([comment](https://github.com/pytorch/pytorch/pull/120370#issuecomment-1965362938))	2024-02-26 21:55:13 +00:00
rzou	7b1cc140aa	Use lxml in scripts/compile_tests when it is available (#120633 ) It's around 30x (300s -> 10s) faster. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120633 Approved by: https://github.com/oulgen	2024-02-26 21:35:22 +00:00
Yanbo Liang	5a0a964444	[Dynamo] Fix guards for script_if_tracing or lru_cache fn with default args (#120390 ) Fixes #120387 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120390 Approved by: https://github.com/anijain2305	2024-02-26 19:40:14 +00:00
Menglu Yu	55b5908427	[PT2][Inductor]Add unbind node normalization (#120253 ) Summary: Normalize unbind nodes for the followup split_cat pattern detection and node removals Test Plan: ``` buck2 test //caffe2/test/inductor:split_cat_fx_passes ``` Buck UI: https://www.internalfb.com/buck2/f42297c2-2595-40a2-b270-5cec026f2fe4 Test UI: https://www.internalfb.com/intern/testinfra/testrun/17451448578242323 Network: Up: 132KiB Down: 88KiB (reSessionID-fc725143-317a-42a9-bc7e-0bbab6ef9e5c) Jobs completed: 27. Time elapsed: 3:09.2s. Cache hits: 0%. Commands: 1 (cached: 0, remote: 0, local: 1) Tests finished: Pass 11. Fail 0. Fatal 0. Skip 0. Build failure 0 ``` buck2 test mode/opt mode/inplace caffe2/test/inductor/fb:test_split_cat_fx_passes_aten_fb ``` Differential Revision: D53964593 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120253 Approved by: https://github.com/jackiexu1992	2024-02-26 19:13:26 +00:00
Andrew Gu	274b362442	[FSDP] Removed `.detach` in `clip_grad_norm_` (#120612 ) This seems unnecessary under `no_grad()` context. The unit tests still pass. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120612 Approved by: https://github.com/Skylion007 ghstack dependencies: #120231	2024-02-26 19:03:00 +00:00
Edward Z. Yang	fd3cf88f27	Rewrite docs about why we guard on dynamic dims (#120566 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/120566 Approved by: https://github.com/desertfire	2024-02-26 18:58:30 +00:00
angelayi	759204253f	[export] Change runtime asserts to using assert_scalar (#119608 ) By changing runtime symbolic asserts to using assert_scalar, the asserts can call into `expect_true` and modify the shape env so that we can run through the traced graph module with fake tensors. With assert_async, the asserts only get hit during runtime, but that means if we run the graph module with fake tensors, the asserts will not affect the shape env, so later data dependent calls to the fake tensors may result in GuardOnDataDependentSymNode errors. https://github.com/pytorch/pytorch/issues/119587 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119608 Approved by: https://github.com/ezyang	2024-02-26 17:56:12 +00:00
Sam Larsen	2fb32a5f07	Enable fake tensor caching in fbcode by default (#118555 ) Summary: Enabled by default in OSS; this switches the default to "on" in fbcode too. Test Plan: Ran torchbench benchmarks in fbcode Differential Revision: [D53771626](https://our.internmc.facebook.com/intern/diff/D53771626) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118555 Approved by: https://github.com/eellison	2024-02-26 17:35:23 +00:00
Jason Ansel	ee01d0807b	[dynamo] Function => FunctionCtx for placeholder obj (#120577 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120577 Approved by: https://github.com/yanboliang	2024-02-26 17:16:31 +00:00
Peter Bell	7eb7ac815f	[inductor] Optimize welford reduction (#120330 ) This does two things, 1) Short circuit `welford_reduce` on the first iteration to ignore the accumulator (big win for small `rnumel`) 2) Replace division with multiplication by reciprocal Currently this is not enough to match two pass reduction with bfloat16 but it is still a significant improvement. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120330 Approved by: https://github.com/lezcano	2024-02-26 17:01:47 +00:00
Catherine Lee	c39bbd6def	Numbers based TD (#119901 ) Convert from a list/bucket based TD system to just a numbers based TD system. Looks like a massive change but a decent amount of it is tests and removing code. Main file of interest is interface.py, which Github is collapsing by default due to size The test files pretty much got rewritten entirely since a lot of the old tests are no longer relevant. Other notable changes: * Use Frozenset to make TestRun hashable * Adds tools/test/heuristics/__init__.py to ensure that unittest can discover the tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/119901 Approved by: https://github.com/osalpekar, https://github.com/huydhn	2024-02-26 17:01:19 +00:00
angelayi	86063b4d03	Add torch._print to dynamo trace_rules (#120533 ) Fixes #114831 Before: ``` (pytorch10) angelayi@devgpu022 ~/local/pytorch [main] $ python test/dynamo/test_trace_rules.py -k test_torch_name_rule_map_updated F ====================================================================== FAIL: test_torch_name_rule_map_updated (__main__.TraceRuleTests) ---------------------------------------------------------------------- Traceback (most recent call last): File "/data/users/angelayi/pytorch/torch/testing/_internal/common_utils.py", line 2739, in wrapper method(args, *kwargs) File "/data/users/angelayi/pytorch/test/dynamo/test_trace_rules.py", line 328, in test_torch_name_rule_map_updated self._check_set_equality( File "/data/users/angelayi/pytorch/test/dynamo/test_trace_rules.py", line 302, in _check_set_equality self.assertTrue(len(x) == 0, msg1) AssertionError: False is not true : New torch objects: {<built-in method _print of type object at 0x7ff477e40ee0>} were not added to `trace_rules.torch_c_binding_in_graph_functions` or `test_trace_rules.ignored_c_binding_in_graph_function_names`. Refer the instruction in `torch/_dynamo/trace_rules.py` for more details. To execute this test, run the following from the base repo dir: python test/dynamo/test_trace_rules.py -k test_torch_name_rule_map_updated This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 ---------------------------------------------------------------------- Ran 1 test in 0.184s FAILED (failures=1) ``` After change: ``` (pytorch10) angelayi@devgpu022 ~/local/pytorch [main] $ python test/dynamo/test_trace_rules.py -k test_torch_name_rule_map_updated . ---------------------------------------------------------------------- Ran 1 test in 0.209s OK ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/120533 Approved by: https://github.com/clee2000, https://github.com/yanboliang, https://github.com/huydhn, https://github.com/Skylion007	2024-02-26 16:52:59 +00:00
PyTorch MergeBot	8a32a07856	Revert "Add meta device support to sparse compressed tensors (#120498 )" This reverts commit 5d71ba688563ef491bb28d47c493ec6fc7791da2. Reverted https://github.com/pytorch/pytorch/pull/120498 on behalf of https://github.com/zou3519 due to broke CI ([comment](https://github.com/pytorch/pytorch/pull/120498#issuecomment-1964491999))	2024-02-26 15:59:36 +00:00
Shunting Zhang	b381a4372b	make GPT2ForSequenceClassification pass inference accuracy check (#120537 ) We need a higher tolerance for GPT2ForSequenceClassification since if I change --bfloat16 in ``` time python benchmarks/dynamo/huggingface.py --accuracy --inference --bfloat16 --backend inductor --disable-cudagraphs --only GPT2ForSequenceClassification ``` to --float16 or --float32 it will pass the accuracy check. Adding --freezing can also make the test pass for this model. I think that's may be due to different fusion output being generated (depending on if constant propagation is happening controlled by freezing) and cause some small numerical difference. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120537 Approved by: https://github.com/jansel	2024-02-26 11:02:57 +00:00
Yifu Wang	f4cf25bb24	Fix a bug where nn.functional._AllGather.backward produces wrong gradients (#120582 ) Summary: Fixes #120386 `_AllGather.backward` assumes that `_ReduceScatter` would always in-place update the output buffer. However, when the output buffer is non-contiguous, `_ReduceScatter` would allocate and return a different buffer, causing the gradient to be thrown away. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120582 Approved by: https://github.com/XilunWu	2024-02-26 09:58:27 +00:00
leslie-fang-intel	c617e7b407	Add resnet50/mobilenet_v2_quantized_qat in into deterministic_algorithms exclusive list (#120384 ) After PR: https://github.com/pytorch/pytorch/pull/120026, 2 `Torchbench` testcases: `resnet50_quantized_qat` and `mobilenet_v2_quantized_qat` can pass the performance testing but failed with accuracy test. The failure msg is: `mobilenet_v2_quantized_qat, RuntimeError: quantized_resize_cpu_ does not have a deterministic implementation but you set 'torch.use_deterministic_algorithms(True)'. ` - `torch.use_deterministic_algorithms(True)` only setting for accuracy test. `fff9d98e58/benchmarks/dynamo/common.py (L3480)` - However, `quantized_resize_cpu_` only support `nondeterministic_algorithms` because the resized output memory may be uninitialized. `fff9d98e58/aten/src/ATen/native/quantized/cpu/TensorOperators.cpp (L85-L87)` Add these 2 models into the deterministic_algorithms exclusive model list in this PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120384 Approved by: https://github.com/desertfire, https://github.com/jgong5	2024-02-26 05:05:43 +00:00
Animesh Jain	a299db2983	[dynamo][guards-cpp-refactor] NO_HASATTR guard (#120469 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120469 Approved by: https://github.com/jansel	2024-02-26 04:37:40 +00:00
Jiong Gong	1c7b0e7cd1	[inductor][cpp] disable masked load for non-fp data types (#120558 ) Fix https://github.com/pytorch/pytorch/issues/120377. We disable the masked load for non-fp data types for now. The complete support of masks will be added in https://github.com/pytorch/pytorch/pull/119654. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120558 Approved by: https://github.com/lezcano, https://github.com/jansel	2024-02-26 04:12:22 +00:00
PyTorch UpdateBot	ea20885d95	[executorch hash update] update the pinned executorch hash (#120264 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned executorch hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120264 Approved by: https://github.com/pytorchbot	2024-02-26 03:55:32 +00:00
Animesh Jain	c18623b7ed	[dynamo] Reland 120147 - - Use EQUALS_MATCH guard for mod.training (#120578 ) To fix Memory leak discovered in https://github.com/pytorch/pytorch/issues/112090 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120578 Approved by: https://github.com/jansel	2024-02-26 03:49:47 +00:00
Shan19900305	685d862c45	Add SparsePrivateUse1 in backend_to_string, layout_from_backend and check_base_legacy_new. (#119263 ) 1) Using items stored in torch._tensor_classes to check item passed from python side; 2) Add SparsePrivateUse1 in backend_to_string, layout_from_backend and check_base_legacy_new; 3) Using more general API to get python module name in get_storage_obj and get_name functions. Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/119263 Approved by: https://github.com/ezyang	2024-02-26 01:54:30 +00:00
Animesh Jain	4328e772bf	[dynamo][guards-cpp-refactor] DICT_VERSION guard (#120416 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120416 Approved by: https://github.com/jansel ghstack dependencies: #119822, #119827, #119833, #120060, #120061, #120062, #120064, #120065, #120067, #120068, #120089, #120091, #120119, #120123, #120093, #120096, #120342, #120344, #120359	2024-02-25 23:24:24 +00:00
Animesh Jain	c269e48af0	[dynamo][guards-cpp-refactor] DictGuardManager (#120359 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120359 Approved by: https://github.com/jansel ghstack dependencies: #119822, #119827, #119833, #120060, #120061, #120062, #120064, #120065, #120067, #120068, #120089, #120091, #120119, #120123, #120093, #120096, #120342, #120344	2024-02-25 23:24:24 +00:00
Animesh Jain	775a4388d9	[dynamo][guards-cpp-refactor] WEAKREF_ALIVE guard (#120344 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120344 Approved by: https://github.com/jansel ghstack dependencies: #119822, #119827, #119833, #120060, #120061, #120062, #120064, #120065, #120067, #120068, #120089, #120091, #120119, #120123, #120093, #120096, #120342	2024-02-25 23:24:04 +00:00
Pearu Peterson	5d71ba6885	Add meta device support to sparse compressed tensors (#120498 ) As in the title. Unblocks https://github.com/pytorch/pytorch/pull/117907#discussion_r1499251745 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120498 Approved by: https://github.com/ezyang	2024-02-25 16:50:17 +00:00
Animesh Jain	834c7a1d3e	[dynamo][refactor] Move some helper functions to global scope (#120426 ) This is to prepare for guard C++ refactor work. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120426 Approved by: https://github.com/ezyang	2024-02-25 04:38:20 +00:00
Ting Lu	5c7b761f8e	Fix default world_size when running on 1 or 0 GPU (#119372 ) the mentioned distributed tests would fail if the number of GPUs available isn't sufficient. need to correct the default world size. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119372 Approved by: https://github.com/eqy, https://github.com/fegin	2024-02-25 04:14:34 +00:00
cyy	81f0b2c14e	[Clang-tidy header][19/N] Enable clang-tidy on torch/csrc/autograd/profiler_legacy.* (#120552 ) This PR enables clang-tidy on torch/csrc/autograd/profiler_legacy.* and cleans some path rules of clang-tidy. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120552 Approved by: https://github.com/Skylion007	2024-02-25 03:29:40 +00:00
Yifu Wang	298c686d3f	[dynamo] support group=None when rewriting collectives (#120118 ) Resolves case 2 in #120082. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120118 Approved by: https://github.com/wconstab ghstack dependencies: #120370	2024-02-25 03:12:10 +00:00
Han, Xu	3e382456c1	Fix compiler check (#120492 ) Fixes #119304 1. Add try catch to handle the compiler version check. 2. Retry to query compiler version info. 3. Return False if can't get compiler info twice. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120492 Approved by: https://github.com/ezyang	2024-02-25 02:41:20 +00:00
Edward Z. Yang	0f20cc1e0e	Don't use size on TensorVariable when doing out resize test (#120567 ) Fixes https://github.com/pytorch/pytorch/issues/120482 Fixes https://github.com/pytorch/pytorch/issues/120511 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/120567 Approved by: https://github.com/Skylion007	2024-02-25 02:21:34 +00:00
chuboning	54c1cf8d8a	add distributed checkpoint support for custom device (#120201 ) Fixes #120200 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120201 Approved by: https://github.com/fegin, https://github.com/wz337	2024-02-24 19:14:29 +00:00
Michael Lazos	56203fc407	Add profiling for backward (#120540 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/120540 Approved by: https://github.com/anijain2305	2024-02-24 16:53:28 +00:00
William Wen	a17979faa6	[dynamo] add stronger test for dynamo memory leaks (#120459 ) This issue was raised by a regression of https://github.com/pytorch/pytorch/issues/112090 caused by https://github.com/pytorch/pytorch/pull/120147. Make the memory leak test stronger by using weakref to check for model deletion instead of measuring CUDA memory allocation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120459 Approved by: https://github.com/jansel	2024-02-24 16:30:20 +00:00
Stephen Jia	a62d9184d5	[ET-VK] Move graph runtime from PT directory to ET directory (#120528 ) Summary: ## Context Move Vulkan graph runtime from PyTorch directory to ExecuTorch directory to improve development logistics: * ExecuTorch delegate changes will no longer require export to PyTorch directory * Makes it much easier to enable OSS build for Vulkan delegate Test Plan: ``` LD_LIBRARY_PATH=/home/ssjia/fbsource/third-party/swiftshader/lib/linux-x64/ buck run fbcode/mode/dev-nosan //xplat/executorch/backends/vulkan/test:vulkan_compute_api_test_bin buck2 run fbcode//executorch/backends/vulkan/test:test_vulkan_delegate ``` Differential Revision: D54133350 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120528 Approved by: https://github.com/manuelcandales	2024-02-24 15:00:21 +00:00
Yifu Wang	1f1bc0e6ac	Switch to native functional collective by default (#120370 ) This enables native functional collectives by default. After this PR: - The Python APIs remain backward compatible. Users will receive a deprecation warning if they use `(rank, tags)` as process group identifier. - Collectives will be captured as `_c10d_functional` ops in post-grad fx graphs. The change will not affect end-users, but it will impact `torch-xla` which has implemented an all-reduce backend based on the existing `c10d_functional` IR. This excludes the migration for `torch-xla` use cases, which will be coordinated separately (see communications in #93173). - Collectives will be lowered to and codegen'd by new Inductor collective IRs (`ir._CollectiveKernel` and `ir._WaitKernel`). This change will not affect end-users. Testing performed: - We have been running a set of representative unit tests with both the new native funcol and the old py funcol in CI. These test will continue to run with the old py funcol after this PR, so they are covered until they are removed. - Manually verified with e2e llama model training with DTensor + functional collectives (https://github.com/fairinternal/xlformers/tree/pt2_llm/pt2d#create-your-local-development-env). Fallback mechansim: - Introduced a temporary environment variable `TORCH_DISABLE_NATIVE_FUNCOL` that allows users to fall back to the previous implementation. We don't expect the migration to break anything; the mechanism is a safety measure to reduce potential disruption in case the PR causes unforeseen breakages. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120370 Approved by: https://github.com/wconstab, https://github.com/yf225	2024-02-24 09:38:26 +00:00
Aaron Gokaslan	33938cfddd	[BE][Ez] Update ruff to 0.2.2 (#120517 ) Updates ruff to 0.2.2. This updates the config and handles some of the new rules that have come out of preview. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120517 Approved by: https://github.com/albanD	2024-02-24 07:13:53 +00:00
Aaron Orenstein	79f059987e	Update find_test_dir() to check for skip files relative to the local path first. (#120521 ) The search code to find the dynamo skip files wasn't working properly when used with pytest and multiple files: ``` pytest a.py b.py ``` because pytest would point `__main__` at itself instead of the individual file. (This worked fine when only running a single file test) Change the scanning code to look for the skip directory relative to its own file first. While in there add/update some comments and log a warning when the directory wasn't found (instead of a hard crash). Pull Request resolved: https://github.com/pytorch/pytorch/pull/120521 Approved by: https://github.com/oulgen	2024-02-24 03:29:25 +00:00
Dheeraj Peri	15add24bf2	fix: set codegen in _SplitterBase partitioner (#120361 ) For graphs with single output, the expectation of torch.export / torch.compile graph_module output type is a single torch.tensor instead of a tuple. However, after using `_SplitterBase` partitioner on these graph_module (obtained from torch.export/torch.compile), the resultant graph module will return a tuple of tensors, in this case `(output,)`. This PR adds codegen to the graphs produced by `_SplitterBase` partitioner. Setting this will ensure pytree unflatten nodes will be added automatically to handle unflattening of the output to return single outputs directly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120361 Approved by: https://github.com/angelayi	2024-02-24 02:27:20 +00:00
Oguz Ulgen	3eefe96297	Update scripts/compile_tests/update_failures.py (#120529 ) In order to unbreak this script, I have only tested with ``` ./scripts/compile_tests/update_failures.py 97918e8c37e649dc8782bb1822ae954bca904d0f ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/120529 Approved by: https://github.com/zou3519	2024-02-23 22:15:44 +00:00
Isuru Fernando	b7df3bba62	add decomposition for frexp (#119217 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119217 Approved by: https://github.com/peterbell10 ghstack dependencies: #119284, #120027	2024-02-23 21:52:42 +00:00
Isuru Fernando	f7e79299c7	register torch.return_types in torch.fx._pytree (#120027 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120027 Approved by: https://github.com/lezcano, https://github.com/zou3519, https://github.com/XuehaiPan ghstack dependencies: #119284	2024-02-23 21:52:42 +00:00
Isuru Fernando	c3496d50f0	Fix torch.return_types init signature (#119284 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119284 Approved by: https://github.com/peterbell10, https://github.com/XuehaiPan	2024-02-23 21:52:34 +00:00
Edward Z. Yang	623632a401	More informative stacklevel for autograd function warning (#120512 ) Internal xref: https://fb.workplace.com/groups/1405155842844877/posts/8064897663537295 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/120512 Approved by: https://github.com/albanD	2024-02-23 21:48:55 +00:00
Yanbo Liang	4d2073bc3f	[Dynamo] Remove deadcode: unwrapping script_if_tracing (#120444 ) After the consolidated ```trace_rules.lookup```, we already unwrap at `2240018c03/torch/_dynamo/variables/builder.py (L712)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/120444 Approved by: https://github.com/anijain2305	2024-02-23 21:22:09 +00:00
Shuqiang Zhang	8e20385447	[c10d] fix the macro definition of NCCL_COMM_DUMP (#120502 ) Summary: Only if both macros are defined, should we dump the comm dump, otherwise, use the original definition. The previous implementation missed the function definition when IS_NCCL_EXP is defined but NCCL_COMM_DUMP is not defined Test Plan: Build and unit test Pull Request resolved: https://github.com/pytorch/pytorch/pull/120502 Approved by: https://github.com/dsjohns2, https://github.com/Skylion007	2024-02-23 20:59:39 +00:00
Thiago Crepaldi	7cd623aa89	Remove monkey-patch for torch.utils._rebuild_tensor (#120446 ) Not needed after #108186 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120446 Approved by: https://github.com/titaiwangms, https://github.com/BowenBao	2024-02-23 20:42:50 +00:00
Jinzhe Zeng	ed0ea2f30b	add `export` to `torch.jit.__all__` (#120432 ) I use pyright in the vscode. When I use `@torch.jit.export`, I always see an annoying error saying `export` is not exported. ![image](https://github.com/pytorch/pytorch/assets/9496702/f7b0e17f-6497-4f9a-87dd-55dc627156c3) Adding it to `__all__` should fix it. I have seen #92240 and #101678, and I am not sure why `export` is not there. cc @ringohoffman Pull Request resolved: https://github.com/pytorch/pytorch/pull/120432 Approved by: https://github.com/eellison	2024-02-23 20:37:09 +00:00
Nikita Shulga	e29eb39e04	[EZ] Fix typo in gcc version detection (#120489 ) It should be `FATAL_ERROR` rather than `FATAL` I wish cmakelint would have detected it Also, downgrade this check to 9.3, as all our binary builds are using 9.3 at the moment (will update in a followup PR) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120489 Approved by: https://github.com/DanilBaibak, https://github.com/Skylion007	2024-02-23 20:31:21 +00:00
Animesh Jain	007606e520	[dynamo][guards-cpp-refactor] TENSOR_MATCH guard (#120342 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120342 Approved by: https://github.com/jansel ghstack dependencies: #119822, #119827, #119833, #120060, #120061, #120062, #120064, #120065, #120067, #120068, #120089, #120091, #120119, #120123, #120093, #120096	2024-02-23 20:10:09 +00:00
Animesh Jain	4b65d192f0	[dynamo][guards-cpp-refactor] DYNAMIC_INDICES guard (#120096 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120096 Approved by: https://github.com/jansel ghstack dependencies: #119822, #119827, #119833, #120060, #120061, #120062, #120064, #120065, #120067, #120068, #120089, #120091, #120119, #120123, #120093	2024-02-23 20:10:09 +00:00
Animesh Jain	a92ce46dc3	[dynamo][guards-cpp-refactor] GlobalWeakRefGuardAccessor (#120093 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120093 Approved by: https://github.com/jansel ghstack dependencies: #119822, #119827, #119833, #120060, #120061, #120062, #120064, #120065, #120067, #120068, #120089, #120091, #120119, #120123	2024-02-23 20:10:01 +00:00
Animesh Jain	bb331b1eb5	[dynamo][guards-cpp-refactor] LENGTH_CHECK guard (#120123 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120123 Approved by: https://github.com/jansel ghstack dependencies: #119822, #119827, #119833, #120060, #120061, #120062, #120064, #120065, #120067, #120068, #120089, #120091, #120119	2024-02-23 20:09:52 +00:00
Animesh Jain	2eac593ffd	[dynamo][guards-cpp-refactor] TUPLE_ITERATOR_LEN guard (#120119 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120119 Approved by: https://github.com/jansel ghstack dependencies: #119822, #119827, #119833, #120060, #120061, #120062, #120064, #120065, #120067, #120068, #120089, #120091	2024-02-23 20:09:43 +00:00
Animesh Jain	da95421f05	[dynamo][guards-cpp-refactor] TupleIteratorGetItemAccessor (#120091 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120091 Approved by: https://github.com/jansel ghstack dependencies: #119822, #119827, #119833, #120060, #120061, #120062, #120064, #120065, #120067, #120068, #120089	2024-02-23 20:09:34 +00:00
Shuqiang Zhang	39f0a5ecc9	[c10d] simplify the dump timeout logic and unify the async call (#120331 ) Summary: The current dump timeout logic is a bit cumbersome as it needs 2 times: 1. timeout, 2. wake up time. And in theory the caller just needs to wait for a max of timeout value for the dump and declare the dump to be either successful or not. Also we unify the async call using std::async instead of a customized async lauch function for each operation. Test Plan: Unit tests Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/120331 Approved by: https://github.com/wconstab	2024-02-23 19:46:40 +00:00
Tugsbayasgalan (Tugsuu) Manlaibaatar	8646872ff7	Make balance_gradient preserved in export (#120332 ) Summary: We can only not-decompose CompositeImplicit functional custom ops. From the looks of the implementation, this op looks functional. So the fix is just fixing the schema. Test Plan: CI Differential Revision: D54019265 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120332 Approved by: https://github.com/zhxchen17	2024-02-23 19:14:08 +00:00
Isuru Fernando	182ed1e32c	Use a dtype property in torch inductor nodes (#119227 ) I usually forget to do `x.get_dtype()` and I type `x.dtype`. Similarly for `layout, device, sizes`. What do you think about making them properties? Pull Request resolved: https://github.com/pytorch/pytorch/pull/119227 Approved by: https://github.com/lezcano, https://github.com/jansel	2024-02-23 18:40:03 +00:00
Huy Do	d54121d13f	Increase bazel CUDA tests timeout to 480s (#120443 ) One of the bazel CUDA tests `//:modules_test` frequently timeout in trunk, so I try to increase the timeout value to 480s https://bazel.build/reference/test-encyclopedia to see if it helps fix the issue. Bazel CPU tests already use this value. Here is an example timeout https://github.com/pytorch/pytorch/actions/runs/8009308009/job/21877698886#step:13:3316 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120443 Approved by: https://github.com/clee2000	2024-02-23 18:32:35 +00:00
Oguz Ulgen	6b35415a54	Create a sentinel file for each dynamo test skips (Part 2) (#120501 ) [no ci] tested on https://github.com/pytorch/pytorch/pull/120451 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120501 Approved by: https://github.com/clee2000 ghstack dependencies: #120500	2024-02-23 18:25:30 +00:00
Oguz Ulgen	cffdd642a9	Create a sentinel file for each dynamo test skips (Part 1) (#120500 ) [no ci] tested on https://github.com/pytorch/pytorch/pull/120451 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120500 Approved by: https://github.com/clee2000	2024-02-23 18:25:30 +00:00
Jorge Pineda	2120f65174	[AT-VK][EZ] Move ops to dedicated folder (#120364 ) These ops are at the level of the OperatorRegistry from the previous change. All ExecuTorch ops will go here. ``` ATen/native/vulkan/graph/ops ``` They are not to be confused with the general ATen ops from `native_functions.yaml` that will continue to exist. All PyTorch ops are here. ``` ATen/native/vulkan/ops ``` To help think around this split, note that we can actually implement the latter ATen ops with the former OperatorRegistry ops, since it's currently a subset. Differential Revision: [D54030933](https://our.internmc.facebook.com/intern/diff/D54030933/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120364 Approved by: https://github.com/SS-JIA ghstack dependencies: #120362, #120363	2024-02-23 18:11:09 +00:00
Jorge Pineda	6d920dd3c6	[ET-VK][Op Redesign][2/n] Introduce OperatorRegistry (#120363 ) TSIA Differential Revision: [D53982439](https://our.internmc.facebook.com/intern/diff/D53982439/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120363 Approved by: https://github.com/SS-JIA ghstack dependencies: #120362	2024-02-23 18:07:59 +00:00
Jorge Pineda	3e2ac1f094	[AT-VK][EZ] Define OpNode constructor (#120362 ) Instead of using `emplace_back()`. This will be useful throughout the rest of the stack. Differential Revision: [D53982443](https://our.internmc.facebook.com/intern/diff/D53982443/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120362 Approved by: https://github.com/SS-JIA	2024-02-23 18:05:17 +00:00
Aleksei Nikiforov	232f09e0ea	Add copy of scripts for setting up s390x workers (#120417 ) This PR contains scripts used to produce self-hosted s390x worker. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120417 Approved by: https://github.com/malfet	2024-02-23 17:01:44 +00:00
angelayi	3b944113c8	Add torch.ops.aten.print (#120295 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120295 Approved by: https://github.com/zou3519	2024-02-23 17:01:22 +00:00
cyy	97918e8c37	[Clang-tidy header][18/N] Enable clang-tidy on headers in torch/csrc/cuda (#118504 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/118504 Approved by: https://github.com/albanD	2024-02-23 16:47:33 +00:00
PyTorch MergeBot	2892d2f31b	Revert "[inductor] Optimize welford reduction (#120330 )" This reverts commit 4c6ba16f825ca7b99133efca95da0b7112add66b. Reverted https://github.com/pytorch/pytorch/pull/120330 on behalf of https://github.com/jeffdaily due to broke ROCm CI while ROCm was in unstable status ([comment](https://github.com/pytorch/pytorch/pull/120330#issuecomment-1961623739))	2024-02-23 16:24:52 +00:00
Aaron Enye Shi	2c85c9e77e	[Memory Snapshot] Add Total memory used after allocation in Trace View (#120339 ) Summary: Being able to see max allocated helps improve user experience with memory snapshots. Test Plan: Before: ![image](https://github.com/pytorch/pytorch/assets/17602366/534001fa-2fbe-4fc5-bd48-cd82f3277941) After: ![image](https://github.com/pytorch/pytorch/assets/17602366/f8b9a7bc-3a34-4e38-82cb-f766e54b3fd2) Reviewed By: zdevito Differential Revision: D53953648 Pulled By: aaronenyeshi Pull Request resolved: https://github.com/pytorch/pytorch/pull/120339 Approved by: https://github.com/zdevito	2024-02-23 16:17:14 +00:00
jmarin	d9db9e62e3	Describe special case in avgpool (#120335 ) Fixes #116420 AvgPool1d, AvgPool2d and AvgPool3d include now in their descriptions the special case when `ceil_mode` is True and the last window starts outside the tensor. Co-authored-by: mikaylagawarecki <mikaylagawarecki@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/120335 Approved by: https://github.com/mikaylagawarecki	2024-02-23 15:29:54 +00:00
Yukio Siraichi	cef9f70f4b	Move torchbench model configuration into a YAML file. (#120299 ) This PR moves other aspects of torchbench's model configuration (e.g. batch size, tolerance requirements, etc.) into a new YAML file: `torchbench.yaml`. It also merges the recently added `torchbench_skip_models.yaml` file inside the `skip` key. This is an effort so that external consumers are able to easily replicate the performance results and coverage results from the PyTorch HUD. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120299 Approved by: https://github.com/jansel	2024-02-23 14:00:14 +00:00
yuanx749	54bac042e7	Fix error in examples of `torch.linalg.lu_factor` (#120484 ) Found an error in the doc of `torch.linalg.lu_factor` related to `torch.linalg.lu_solve`. Also fix a sphinx issue by the way. ```Python traceback TypeError: linalg_lu_solve(): argument 'LU' (position 1) must be Tensor, not torch.return_types.linalg_lu_factor ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/120484 Approved by: https://github.com/lezcano	2024-02-23 13:19:04 +00:00
Yang Chen	b96ea097ee	[aotinductor] rename CppWrapperCodeGen and CudaWrapperCodeGen (#120391 ) make WrapperCodeGen subclass names consistent with the file names: CppWrapperCodeGen -> CppWrapperCpu CudaWrapperCodeGen -> CppWrapperCuda Differential Revision: [D54074938](https://our.internmc.facebook.com/intern/diff/D54074938) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120391 Approved by: https://github.com/aakhundov	2024-02-23 10:41:50 +00:00
Yanli Zhao	72fec96e59	fix no shard state dict loading (#120367 ) Summary: fix no shard state dict loading Test Plan: CI tests Differential Revision: D51058607 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120367 Approved by: https://github.com/fegin	2024-02-23 07:25:43 +00:00
Eddie Yan	9e9eaf0032	[CUDA] Workaround register spilling issue in mem-efficient SDP kernels on `sm60` (#120445 ) We're seeing that a newer version of CUDA introduces register spilling behavior for a few kernels on Pascal---this PR works around them for this specific version. CC @ptrblck Pull Request resolved: https://github.com/pytorch/pytorch/pull/120445 Approved by: https://github.com/Skylion007, https://github.com/drisspg	2024-02-23 06:06:37 +00:00
Edward Z. Yang	edf1c4e552	[Dynamo] Handle guard_size_oblivious in user code (#120379 ) Fixes https://github.com/pytorch/pytorch/issues/120083 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/120379 Approved by: https://github.com/yanboliang	2024-02-23 05:38:57 +00:00
Oguz Ulgen	a5548c6886	Create a sentinel file for each dynamo test failure (#120355 ) Created via ``` import os current_dir = os.path.dirname(os.path.abspath(__file__)) directory = os.path.join(current_dir, 'dynamo_expected_failures') for name in dynamo_expected_failures: path = os.path.join(directory, name) with open(path, 'w') as fp: pass ``` Differential Revision: [D54036062](https://our.internmc.facebook.com/intern/diff/D54036062) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120355 Approved by: https://github.com/aorenste, https://github.com/yanboliang	2024-02-23 05:22:11 +00:00
Nikita Shulga	2240018c03	Construct `c10::Half` from `float16_t` on ARMv8 (#120425 ) By hiding float32 constructors and exposing float16 ones. This allows compiler do implicit conversions as needed, and in safe cases optimize out unneeded upcasts to fp32, see example [below](https://godbolt.org/z/5TKnY4cos) ```cpp #include <arm_neon.h> #ifndef __ARM_FEATURE_FP16_SCALAR_ARITHMETIC #error Ieeee #endif float16_t sum1(float16_t x, float16_t y) { return x + y; } float16_t sum2(float16_t x, float16_t y) { return static_cast<float>(x) + static_cast<float>(y); } ``` both sum variants are compiled to scalar fp16 add, if build for the platform that supports fp16 arithmetic ``` sum1(half, half): // @sum1(half, half) fadd h0, h0, h1 ret sum2(half, half): // @sum2(half, half) fadd h0, h0, h1 ret ``` Fixes build error in some aarch64 configurations after #119483 which are defined as supporting FP16 but don't define _Float16. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120425 Approved by: https://github.com/mikekgfb, https://github.com/atalman, https://github.com/snadampal	2024-02-23 04:22:45 +00:00
eqy	3f6be7696b	[cuDNN][cuDNN RNNv8 API] Fix math type behavior in cuDNN RNN (#120277 ) Adds back `CUDNN_TENSOR_OP_MATH` which was erroneously dropped by #115719 CC @malfet @ptrblck Pull Request resolved: https://github.com/pytorch/pytorch/pull/120277 Approved by: https://github.com/drisspg	2024-02-23 04:11:14 +00:00
Driss Guessous	36c1cc962a	Update cutlass from 3.3.0 to 3.4.1 (#120434 ) ### COPY OF https://github.com/pytorch/pytorch/pull/120010 ### Update I have rolled the two blocking changes into this PR, I also imported this to fbcode to verify that nothing is breaking: D53870253 This copy was generated by merging in all the internal only changes into one merged atomic commit and re-exporting to github ### Current Status - [PR](https://github.com/pytorch/pytorch/pull/118935) aims to update the flash attention kernels to a more recent version. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120434 Approved by: https://github.com/Skylion007, https://github.com/cpuhrsch	2024-02-23 03:57:26 +00:00
cyy	f609f2050f	[structural binding][6/N] Replace std::tie with structural binding (#120353 ) This PR follows https://github.com/pytorch/pytorch/pull/119774, it is a continued work to clean up std::tie. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120353 Approved by: https://github.com/albanD	2024-02-23 03:38:40 +00:00
lancerts	3426c6f559	update the tensor.scatter_ doc (#120169 ) Fixes #119543 - doc fixed with the `reduce` being a kwarg (see below for details) - doc added another interface `(int dim, Tensor index, Number value, , str reduce)` where the full signature in the pyi file after build is ``` def scatter_(self, dim: _int, index: Tensor, value: Union[Number, _complex], , reduce: str) -> Tensor: ``` . This can be further verified in `02fb043522/aten/src/ATen/native/native_functions.yaml (L8014)` Therefore, the value can be int, bool, float, or complex type. Besides the issue mentioned in 119543, the `reduce should be a kwarg` as shown below ``` * (int dim, Tensor index, Tensor src) * (int dim, Tensor index, Tensor src, , str reduce) (int dim, Tensor index, Number value) * (int dim, Tensor index, Number value, *, str reduce) ``` The test case for scala value is already implemented in `70bc3b3be4/test/test_scatter_gather_ops.py (L86)` so no additional test case required. @mikaylagawarecki @janeyx99 Co-authored-by: mikaylagawarecki <mikaylagawarecki@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/120169 Approved by: https://github.com/mikaylagawarecki	2024-02-23 02:51:55 +00:00
Sergii Dymchenko	bb6f50929b	Fix lint after https://github.com/pytorch/pytorch/pull/105590 (#120461 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120461 Approved by: https://github.com/Skylion007	2024-02-23 02:45:23 +00:00
Shuqiang Zhang	2b0168aeb0	[c10d] update the work progress of PG periodically (#120438 ) Summary: Previously, I added lastEnqueuedSeq_ and lastCompletedSeq_ to store the states of PG progress but log only when there is timeout detected. We found it is not enough since the 'straggler' itself might not detect the timeout and hence there is no log from the 'straggler'. In this PR, we can log these states periorically so that it would be much easier for us to identify the straggler by checking which rank has the smallest number of lastEnqueuedSeq_ Test Plan: Log adding, build success Pull Request resolved: https://github.com/pytorch/pytorch/pull/120438 Approved by: https://github.com/wconstab, https://github.com/XilunWu, https://github.com/kwen2501	2024-02-23 01:40:43 +00:00
ydwu4	8f4ffd3d8a	[HigherOrderOp] makes control flow operators respect global decomp table (#120412 ) A follow up of @zou3519 's comment on https://github.com/pytorch/pytorch/pull/120366. We create a helper method for this purpose. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120412 Approved by: https://github.com/zou3519	2024-02-23 00:10:20 +00:00
Rohan	156954d6a2	[Inductor] Add support for NEON ISA in the Inductor C++ backend (#105590 ) Fixes #104729 As suggested in the [blog](https://dev-discuss.pytorch.org/t/torchinductor-update-5-cpu-backend-backend-performance-update-and-deep-dive-on-key-optimizations/1117#:~:text=It%20can%20be,sub%2Dclasses.), I subclassed the `VecISA` class and implemented a NEON version of the `vec_reduce_all()` function, to go along with the existing AVX2 and AVX512 versions. Any operation that calls `vec_reduce_all()` will also take the NEON path and benefit from its vectorization. The `vec_reduce_all()` is invoked by Softmax and other operations like norms. Using the fast path results in 30% time savings for Softmax as compared to the previously taken slow path. \| Slow path \| Fast path (NEON intrinsics) -- \| -- \| -- Softmax (100 passes, 1024 dimension) \| 623.706ms \| 452.011ms Pull Request resolved: https://github.com/pytorch/pytorch/pull/105590 Approved by: https://github.com/jgong5, https://github.com/malfet	2024-02-22 23:55:35 +00:00
Peter Bell	4c6ba16f82	[inductor] Optimize welford reduction (#120330 ) This does two things, 1) Short circuit `welford_reduce` on the first iteration to ignore the accumulator (big win for small `rnumel`) 2) Replace division with multiplication by reciprocal Currently this is not enough to match two pass reduction with bfloat16 but it is still a significant improvement. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120330 Approved by: https://github.com/lezcano	2024-02-22 23:54:24 +00:00
PyTorch MergeBot	722afe6171	Revert "[dynamo] Use EQUALS_MATCH guard for mod.training (#120147 )" This reverts commit b642a18e8056287b0e5768f631dd03e0326a8b11. Reverted https://github.com/pytorch/pytorch/pull/120147 on behalf of https://github.com/williamwen42 due to memory leak, see https://github.com/pytorch/pytorch/issues/112090 ([comment](https://github.com/pytorch/pytorch/pull/120147#issuecomment-1960522018))	2024-02-22 23:46:55 +00:00
Thiago Crepaldi	3588e7f265	Ignore .numpy() under FakeTensorMode() (#120261 ) Fixes #120259 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120261 Approved by: https://github.com/jansel	2024-02-22 22:49:20 +00:00
Nikita Shulga	f9eb66e16d	[BE][EZ] Flatten preprocessor hierarchy (#120422 ) Instead of ```cpp #if defined(foo) #else #if defined(bar) #else #endif #endif ``` use ```cpp #if defined(foo) #elif defined(bar) #else #endif ``` Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/120422 Approved by: https://github.com/seemethere, https://github.com/kit1980, https://github.com/Skylion007	2024-02-22 22:38:08 +00:00
Taras Tsugrii	1c7ba330b2	[BE][optim] Simplify _init_group. (#120055 ) This version is more concise and avoids second lookup in case `momentum_buffer` is in the `state`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120055 Approved by: https://github.com/janeyx99	2024-02-22 22:15:01 +00:00
wz337	5603d95375	[DeviceMesh] Ensure mesh tensor is a cpu tensor (#120046 ) More discussion in the last comment in https://github.com/pytorch/pytorch/pull/118614 In general, users won't pass a cuda tensor to DeviceMesh, as the mesh tensor is just a way to construct a mesh that doesn't require cuda compute. Taking suggestion from @awgu to enforce the tensor to be cpu tensor if it is not already so that we can prevent a device sync. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120046 Approved by: https://github.com/wanchaol, https://github.com/wconstab	2024-02-22 22:03:13 +00:00
Jeff Daily	c11bd724fe	[ROCm] replace ROCmLoops.cuh with hipified CUDALoops.cuh (#120101 ) The intent of this change was to minimize code differences between CUDA and ROCm while maintaining or improving performance. Verified new performance using pytorch/benchmarks/operator_benchmark. ``` python -u -m pt.unary_test --tag-filter all --device cuda python -u -m pt.binary_test --tag-filter all --device cuda ``` On MI200 this improved performance on average 3%. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120101 Approved by: https://github.com/albanD	2024-02-22 21:57:36 +00:00
dilililiwhy	77692736d1	Use privateuseone during external module register test (#120399 ) Fixes #120397 Use privateuseone instead of xpu in test_external_module_register. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120399 Approved by: https://github.com/albanD, https://github.com/malfet	2024-02-22 21:32:59 +00:00
baocheny	edd03f975f	highlight readme code block (#120228 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120228 Approved by: https://github.com/mikaylagawarecki	2024-02-22 21:23:08 +00:00
Oguz Ulgen	1eae8950b9	[Dynamic] Fix dynamic shape size inspection bug (#120341 ) Fixes #120198 Differential Revision: [D54035984](https://our.internmc.facebook.com/intern/diff/D54035984) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120341 Approved by: https://github.com/ezyang	2024-02-22 21:08:28 +00:00
Yifu Wang	11e4a9266d	Temporarily support ranks + tag as pg identifier in native funcol (#120226 ) As communicated in https://github.com/pytorch/pytorch/issues/93173#issuecomment-1907095208, although we are dropping `(ranks, tag)` as group identifier in funcols, there will be a grace period for migration. This PR adds temporary `(ranks, tag)` support in native funcols. It also helps us decouple the py funcol -> native funcol transition from the API change. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120226 Approved by: https://github.com/wanchaol, https://github.com/wconstab ghstack dependencies: #120042, #120043, #120070	2024-02-22 20:24:16 +00:00
Yifu Wang	5a3e19578f	Make tests using CommDebugMode work for both legacy and native funcol (#120070 ) We have many tests that use CommDebugMode to verify the occurrence of collectives. These tests do so by querying comm_counts with legacy funcol ops as key. For the purpose of native funcol migration, we need these tests to work for both legacy and native funcol. To avoid the need to modify all tests to accommodate the two implementations, we make CommDebugMode translate native funcol ops into legacy funcol ops until the migration finishes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120070 Approved by: https://github.com/wconstab, https://github.com/wanchaol ghstack dependencies: #120042, #120043	2024-02-22 20:24:15 +00:00
Yifu Wang	a4c5f48b11	Prepare test_dtensor.py for native funcol migration (#120043 ) This file contains representative tests that we would like to run with both funcol impls during the migration period. Marking them as `@run_with_both_funcol_impls`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120043 Approved by: https://github.com/wanchaol ghstack dependencies: #120042	2024-02-22 20:24:15 +00:00
Yifu Wang	1c9fc720ae	Change the .clone() in native funcol's all_reduce to use at::MemoryFormat::Contiguous (#120042 ) Summary: While I think it probably makes more sense to only require `all_reduce` input to be non-overlapping and dense, today `ProcessGroupNCCL` requires it to be contiguous. This is also what the `all_reduce` in non-native funcol does. Also marking a test affected by this with `@run_with_both_funcol_impls`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120042 Approved by: https://github.com/wanchaol	2024-02-22 20:24:15 +00:00
ydwu4	7b8f6736d1	[cond] make sure subgraphs in cond are decomposed according to current decomp table (#120366 ) Fixes https://github.com/pytorch/pytorch/issues/120160. The issue is because previously cond doesn't pass in the global decomposition table in ProxyMode. This PR adds the current_decomposition_table to the recursive make_fx call. Test Plan: see added tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120366 Approved by: https://github.com/aakhundov, https://github.com/jansel	2024-02-22 20:06:46 +00:00
lancerts	680cfec295	Fix the default value of side in torch.searchsorted (#120066 ) Fixes #119999, currently the [doc](https://pytorch.org/docs/stable/generated/torch.searchsorted.html#torch.searchsorted) shows the default value of `side = "left"` <img width="600" alt="Screenshot 2024-02-16 at 10 36 08 AM" src="https://github.com/pytorch/pytorch/assets/7495155/e7d159aa-4985-4f50-9d81-6e71c3116c0d"> while the [implementation ](https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/native_functions.yaml#L11247) gives the default value of `side = c10::nullopt`. - fix the [torch doc](https://github.com/pytorch/pytorch/blob/main/torch/_torch_docs.py#L13782) such that the default value of side is None. - fix the [comment in cpp](`4dc75f9084/aten/src/ATen/native/Bucketization.cpp (L19)`) such that the default value of side is None. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120066 Approved by: https://github.com/malfet	2024-02-22 19:35:17 +00:00
Andrew Gu	c37d07a1bc	[FSDP2] Removed `super().__setattr__` call (#120340 ) `nn.Module.__setattr__` does not actually call `super().__setattr__()`. If we make this call in our fast path, then we will inadvertently set the parameter as an actual attribute on the module, not just as an entry in the `_parameters` dict. This can lead to a bug where after replacing the parameters on the module (e.g. via `to_empty()` from meta device), we now have both an actual attribute (old) and a new entry in `_parameters` (new). Trying to access the parameter would give the old one since Python only resolves `__getattr__` if normal attribute lookup fails. The bug was exercised in the following PR. I wanted to land this bug fix separately. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120340 Approved by: https://github.com/yifuwang ghstack dependencies: #120231	2024-02-22 19:33:57 +00:00
Jackie (Jiaqi) Xu	2ba798df60	[inductor] decompose memory bound mm (#120047 ) Summary: Decompose memory bound mm/bmm. Linear decomposition result: D53502768 BMM decomposition result: D53148650 We should only decompose when 1)bmm, b is large, m,n,k is relative small 2)mm/addmm. m is large, n and K is relative small. e.g. mm of input gradient in linear backward should not be decomposed since m is small and n is large. Need to conduct more experiments to see if we can find a better strategy for decomposition. I have tried to use a linear regression model (see the bento results) which does not fit well. For short term, we use heuristics to determine when to decompose. Test Plan: ``` buck2 test mode/dev-nosan //caffe2/test/inductor:decompose_mem_bound_mm ``` COFFEE APS mc0: baseline: aps-lsf-0124-bf16-267ccb7a0d decompose: aps-lsf-0124-bf16-4e3824db40 FIRST AFOC pyper mc1 Differential Revision: D53602514 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120047 Approved by: https://github.com/mengluy0125	2024-02-22 19:29:51 +00:00
feifan	ce807c17c0	modify comment of SparseTensor coalesce (#120221 ) Fixes #ISSUE_NUMBER Found the comment of coalesce is incorrect, modify it Pull Request resolved: https://github.com/pytorch/pytorch/pull/120221 Approved by: https://github.com/mikaylagawarecki	2024-02-22 19:24:53 +00:00
Andrei	bb72bfe2ac	Add code example for torch.stack() (#120304 ) Fixes #120303 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120304 Approved by: https://github.com/albanD	2024-02-22 18:30:30 +00:00
yuanx749	ca64f7cbb8	Fix rendering in the doc of `PackedSequence` (#120385 ) Correct a typo. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120385 Approved by: https://github.com/albanD	2024-02-22 18:29:12 +00:00
Shunting Zhang	a77226aa49	[inductor] improve kernel metadata logging (#120274 ) Log a few more fields - num_atomic_add: perf of kernels using atomic_add are usually data dependent. Our benchmarking code generate all indices to be 0 which will result in worse perf than reality. - kernel_args_num_gb: estimate the amount of read/writes for kernel args. In-place args will be double counted. If we have a good estimation, this should be the lower bound of memory access that the GPU performs. Sometimes GPU will do more memory access since a single buffer may be access multiple times (e.g. for softmax when input tensor is quite large. cache only help a bit here). With this logged, and if we augment the metadata with amount of memory the GPU actually accessed, then it would be nice to dig into kernels that GPU access more memory. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120274 Approved by: https://github.com/jansel ghstack dependencies: #120266	2024-02-22 18:28:05 +00:00
briancoutinho	b88621040a	[profiler] Add kineto init delay when used in daemon mode (#120276 ) Fixes #112389 ## About PyTorch (Kineto) profiler registers with the profiling daemon Dynolog to enable on-demand profiling. The user should only need to set the env variable `KINETO_USE_DAEMON`. To enable this we need to initialize kineto library early rather than lazily on a PyTorch profiler call. This initialization happens in a static initializer. - Kineto init function basically registers a callback using the CUDA CUPTI library https://github.com/pytorch/kineto/blob/main/libkineto/src/init.cpp#L130-L148 - However, the above needs the dynamic linking to libcupti.so to have taken place. - I understand now that static initializations of compilation units will be called before the dynamic linking leading to a segfault in #112389 ![image](https://github.com/pytorch/pytorch/assets/6922212/29c9e79b-8080-4198-aaae-8a5696dccaec) ## Workaround We add a delay in the initialization that can be configured using the env variable 'KINETO_DAEMON_INIT_DELAY_S'. May not be the best but it could help resolve the issue. ## Testing Tested this out with [linear_model_example.py](https://github.com/facebookincubator/dynolog/blob/main/scripts/pytorch/linear_model_example.py) First export the daemon env variable ### Without any delay ``` >$ python3 linear_model_example.py INFO:2024-02-21 19:34:50 2366287:2366287 init.cpp:131] Registering daemon config loader, cpuOnly = 1 INFO:2024-02-21 19:34:50 2366287:2366287 DaemonConfigLoader.cpp:63] Setting communication fabric enabled = 1 INFO:2024-02-21 19:34:50 2366287:2366287 IpcFabricConfigClient.cpp:93] Setting up IPC Fabric at endpoint: dynoconfigclientb8f91363-d8d6-47a7-9103-197661e28397 status = initialized INFO:2024-02-21 19:34:50 2366287:2366287 DaemonConfigLoader.cpp:63] Setting communication fabric enabled = 1 INFO:2024-02-21 19:34:50 2366287:2366287 DaemonConfigLoader.cpp:63] Setting communication fabric enabled = 1 cpu 99 1385.468505859375 ``` ### With 5 seconds delay ``` >$ KINETO_DAEMON_INIT_DELAY_S=5 python3 linear_model_example.py cpu 99 284.82305908203125 10099 8.817167282104492 INFO:2024-02-21 19:34:26 2359155:2359214 init.cpp:131] Registering daemon config loader, cpuOnly = 1 ERROR: External init callback must run in same thread as registerClient (1782580992 != -1922169024) INFO:2024-02-21 19:34:26 2359155:2359214 DaemonConfigLoader.cpp:63] Setting communication fabric enabled = 1 INFO:2024-02-21 19:34:26 2359155:2359214 IpcFabricConfigClient.cpp:93] Setting up IPC Fabric at endpoint: dynoconfigclient49270a3f-e913-4ea6-b9e0-cc90a853a869 status = initialized INFO:2024-02-21 19:34:26 2359155:2359214 DaemonConfigLoader.cpp:63] Setting communication fabric enabled = 1 INFO:2024-02-21 19:34:26 2359155:2359214 DaemonConfigLoader.cpp:63] Setting communication fabric enabled = 1 20099 8.817167282104492 ``` ### With an invalid delay ``` >$ KINETO_DAEMON_INIT_DELAY_S=abc python3 linear_model_example.py INFO:2024-02-21 19:35:02 2369647:2369647 init.cpp:131] Registering daemon config loader, cpuOnly = 1 INFO:2024-02-21 19:35:02 2369647:2369647 DaemonConfigLoader.cpp:63] Setting communication fabric enabled = 1 INFO:2024-02-21 19:35:02 2369647:2369647 IpcFabricConfigClient.cpp:93] Setting up IPC Fabric at endpoint: dynoconfigclient0e12a349-af7b-4322-901d-1ff22f91fd4c status = initialized INFO:2024-02-21 19:35:02 2369647:2369647 DaemonConfigLoader.cpp:63] Setting communication fabric enabled = 1 INFO:2024-02-21 19:35:02 2369647:2369647 DaemonConfigLoader.cpp:63] Setting communication fabric enabled = 1 cpu ``` ### Unit test updated as well. ## Impact This should not impact any general user. The initialization only occurs if `KINETO_USE_DAEMON` is set in the environment. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120276 Approved by: https://github.com/anupambhatnagar, https://github.com/aaronenyeshi	2024-02-22 18:17:33 +00:00
Xuehai Pan	be0ee93467	[pytree] support `X \| Y` union type in `tree_map_only` (#120389 ) Follow-up PR for #119974 with some small tweaks. 1. Support `X \| Y` union type for Python 3.10+ 2. Enable predicate function in `tree_map_only` in CXX pytree. 3. Remove unnecessary function definition. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120389 Approved by: https://github.com/zou3519	2024-02-22 18:17:13 +00:00
Wanchao Liang	65627cfd6a	[dtensor] implement scaled dot product attention (flash-attention) (#120298 ) as titled, this PR implements the sdpa flash attention op in DTensor Adding flash attention first but efficient attention and other attention ops should be similar fixes https://github.com/pytorch/pytorch/issues/120333 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120298 Approved by: https://github.com/XilunWu ghstack dependencies: #120297	2024-02-22 17:53:47 +00:00
PyTorch MergeBot	f2452e98a6	Revert "Native Half on ARM (#119483 )" This reverts commit 8f3fd79b23d483e846537b62f49111696d117870. Reverted https://github.com/pytorch/pytorch/pull/119483 on behalf of https://github.com/malfet due to Broke nightly arm builds (and will be breaking runtime), as F16 arithmetic is ARMv8.2 only, see https://github.com/pytorch/pytorch/actions/runs/8000968963/job/21851281141 ([comment](https://github.com/pytorch/pytorch/pull/119483#issuecomment-1959944948))	2024-02-22 17:41:55 +00:00
Dmitry Nikolaev	c7328602ed	[ROCm] enable tests test_sampled_addmm_autograd_cuda_*, test_sample… (#117501 ) These tests PASS on ROCM 5.6+ now: - test_sampled_addmm_autograd_cuda_complex128 - test_sampled_addmm_autograd_cuda_complex64 - test_sampled_addmm_autograd_cuda_float32 - test_sampled_addmm_autograd_cuda_float64 - test_sampled_addmm_cuda_complex128 - test_sampled_addmm_cuda_complex64 - test_sampled_addmm_cuda_float32 - test_sampled_addmm_cuda_float64 - test_autograd_dense_output_addmm_cuda_float64 - test_autograd_dense_output_addmv_cuda_float64 - test_autograd_dense_output_mv_cuda_float64 @pruthvistony @jithunnair-amd Pull Request resolved: https://github.com/pytorch/pytorch/pull/117501 Approved by: https://github.com/pruthvistony, https://github.com/jeffdaily, https://github.com/malfet	2024-02-22 17:24:25 +00:00
Lucas Pasqualin	1c1028ac49	[DCP] Adds utility for converting torch save to dcp (#119815 ) as title Differential Revision: [D53718040](https://our.internmc.facebook.com/intern/diff/D53718040/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119815 Approved by: https://github.com/fegin ghstack dependencies: #119813, #119814	2024-02-22 17:22:11 +00:00
willfengg	aae7ccd2d5	[FSDP2] disable compile in broken unit tests (#120358 ) following unit tests are broken in original commit, revert to keep trunk healthy. will add them back when figuring out the root cuase ``` python test/distributed/_composable/fsdp/test_fully_shard_training.py -k test_param_registration ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/120358 Approved by: https://github.com/awgu, https://github.com/Skylion007	2024-02-22 17:17:23 +00:00
Lucas Pasqualin	1ab441a7dd	[DCP] Adds utility for converting dcp to torch save format (#119814 ) as title Differential Revision: [D53718042](https://our.internmc.facebook.com/intern/diff/D53718042/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119814 Approved by: https://github.com/fegin ghstack dependencies: #119813	2024-02-22 16:55:58 +00:00
rraminen	e0a7b024b0	[ROCm] Skip test_parity* unit tests in test_foreach only if ROCm version < 6.0 (#117301 ) Skip test_parity* unit tests in test_foreach.py on ROCm only if ROCm version < 6.0 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117301 Approved by: https://github.com/jithunnair-amd, https://github.com/ezyang	2024-02-22 16:21:09 +00:00
Bert Maher	de60050801	[inductor] Colorization improvements for bandwidth profiler (#120343 ) A couple things: * Don't colorize output to the log file * Don't repeatedly warn if colorama isn't installed Differential Revision: [D54027075](https://our.internmc.facebook.com/intern/diff/D54027075/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120343 Approved by: https://github.com/Chillee, https://github.com/shunting314	2024-02-22 15:25:46 +00:00
Yanbo Liang	03f7235caa	[Dynamo] Fix dynamo trace rules (#120371 ) ```test_trace_rules.py``` is still failing due to this. Fixes https://github.com/pytorch/pytorch/issues/114831 (Having this here will run the disabled test on the PR) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120371 Approved by: https://github.com/drisspg, https://github.com/huydhn	2024-02-22 14:32:00 +00:00
Bert Maher	0e4bd25a33	[inductor] When generating debug logs don't fail if nvcc not found (#120346 ) If nvcc isn't found subprocess throws a CalledProcessError Differential Revision: [D54028438](https://our.internmc.facebook.com/intern/diff/D54028438/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120346 Approved by: https://github.com/Skylion007, https://github.com/shunting314	2024-02-22 14:25:34 +00:00
Yu, Guangye	c2b2e57032	Intel GPU Runtime Upstreaming for Guard (#118523 ) # Motivation According to [[RFC] Intel GPU Runtime Upstreaming](https://github.com/pytorch/pytorch/issues/114842), the 5th runtime component we would like to upstream is `Guard`. We will cover device guard and stream guard in this PR. # Design Device guard is used mainly for op dispatcher in PyTorch. Currently, PyTorch already has a device guard abstraction `c10::impl::DeviceGuardImplInterface`. In our design, we will introduce an `XPUGuardImpl` class inherits from `c10::impl::DeviceGuardImplInterface`. Register `XPUGuardImpl` to PyTorch after we implement the device switch management mechanism in `XPUGuardImpl`. Besides, we will introduce `XPUGuard`, `OptionalXPUGuard`, `XPUStreamGuard`, and `OptionalXPUStreamGuard`. They are all following the design of CUDA's counterpart. The corresponding C++ file should be placed in c10/xpu/ folder. # Additional Context It is unnecessary to add `Guard` code to PyTorch frontend. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118523 Approved by: https://github.com/EikanWang, https://github.com/gujinghui, https://github.com/jgong5, https://github.com/malfet ghstack dependencies: #120315	2024-02-22 14:07:21 +00:00
Yu, Guangye	dcfe463600	fix xpu build failure (#120315 ) # Motivation fix build failure introduced by [[DeviceIndex][6/N] Use DeviceIndex in more places](https://github.com/pytorch/pytorch/pull/120133), parameter `total` is undefined in line 100. see https://github.com/pytorch/pytorch/pull/120133/files#diff-00eb8a6f5dfbc341ee9ab9aff0e3dbece8ad73483d4f41a005b1f453cb78221cR91-L102 [PR120133](https://github.com/pytorch/pytorch/pull/120133) forgot to add the label `ciflow/xpu`, so the XPU CI flow was not triggered. # Solution refer to [Why is std::cout not printing the correct value for my int8_t number?](https://stackoverflow.com/questions/7587782) , static cast int8_t to int16_t and the condition `device >= 0 && device < total` is enough. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120315 Approved by: https://github.com/Skylion007, https://github.com/cyyever, https://github.com/malfet, https://github.com/EikanWang, https://github.com/gujinghui	2024-02-22 13:43:56 +00:00
lezcano	faad8ecb26	Use opmath for sinc on CPU (#120311 ) This aligns the implementation with CUDA and `torch.compile` Fixes https://github.com/pytorch/pytorch/issues/118176 https://github.com/pytorch/pytorch/issues/49133 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120311 Approved by: https://github.com/jgong5, https://github.com/Chillee	2024-02-22 12:37:50 +00:00
Sheng Fu	5c5b71b6ee	Capture non tensor arguments in record_function (#120017 ) Summary: RECORD_FUNCTION only capture the argument when it is a Tensor. However, it is very common for user to use the argument with primitive data type (int, float, index, bool). This DIFF is to support non tensor arguments in RECORD_FUNCTION. Test Plan: unit test buck test mode/dev-nosan caffe2/test:profiler -- test_execution_trace_with_pt2 test_execution_trace_alone test_execution_trace_with_kineto test_execution_trace_start_stop test_execution_trace_repeat_in_loop test_execution_trace_no_capture Differential Revision: D53674768 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120017 Approved by: https://github.com/soulitzer	2024-02-22 09:40:08 +00:00
Yu Guo	7e6bce9684	[amd] fix unused variable device_flags (#120369 ) Summary: get build error due to D53986297 (https://github.com/pytorch/pytorch/pull/119996) ``` caffe2/c10/cuda/__fb_c10_hipify_gen__/out/c10/hip/HIPStream.cpp:40:23: error: unused variable 'device_flags' [-Werror,-Wunused-variable] static c10::once_flag device_flags[C10_COMPILE_TIME_MAX_GPUS]; ``` Reviewed By: jianyuh, xw285cornell Differential Revision: D54027737 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120369 Approved by: https://github.com/xw285cornell, https://github.com/jianyuh	2024-02-22 09:36:59 +00:00
Michael Lazos	5210a22b39	Add basic shampoo test (#120293 ) Fixes [T175418669](https://www.internalfb.com/intern/tasks/?t=175418669) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120293 Approved by: https://github.com/bdhirsh	2024-02-22 08:39:55 +00:00
wangjiangben-hw	354a436d96	Remove device assert in Gradscaler (#119362 ) Fixes #119358 Co-authored-by: Edward Z. Yang <ezyang@mit.edu> Co-authored-by: ydwu4 <ydwu2014@gmail.com> Co-authored-by: PyTorch UpdateBot <pytorchupdatebot@users.noreply.github.com> Co-authored-by: Bin Bao <binbao@meta.com> Co-authored-by: Shuqiang Zhang <sqzhang@meta.com> Co-authored-by: Adnan Akhundov <aakhundov@meta.com> Co-authored-by: Ting Lu <tingl@nvidia.com> Co-authored-by: Yang Chen <yangche@fb.com> Co-authored-by: cyy <cyyever@outlook.com> Co-authored-by: Animesh Jain <anijain@umich.edu> Co-authored-by: Jason Ansel <jansel@meta.com> Co-authored-by: Eddie Yan <eddiey@nvidia.com> Co-authored-by: wz337 <wz337@cornell.edu> Co-authored-by: Xuehai Pan <XuehaiPan@pku.edu.cn> Co-authored-by: Anthony Alayo <anthony.alayo@applovin.com> Co-authored-by: leslie-fang-intel <leslie.fang@intel.com> Co-authored-by: Yifu Wang <yifu@fb.com> Co-authored-by: Yukio Siraichi <yukio.siraichi@gmail.com> Co-authored-by: atalman <atalman@fb.com> Co-authored-by: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com> Co-authored-by: Jeff Daily <jeff.daily@amd.com> Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com> Co-authored-by: haozhe.zhu <haozhe.zhu@intel.com> Co-authored-by: lezcano <lezcano-93@hotmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/119362 Approved by: https://github.com/ezyang	2024-02-22 08:02:18 +00:00
PyTorch MergeBot	fff9d98e58	Revert "Increased compile time max GPUs to 512. Switched to int16_t DeviceIndex. (#119639 )" This reverts commit e0268821dd2ea0e8a51b81c0ef3b18e77f68a33d. Reverted https://github.com/pytorch/pytorch/pull/119639 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I think the Window failures are legit as they are failing now in trunk, i.e. `450339ab2d` ([comment](https://github.com/pytorch/pytorch/pull/119639#issuecomment-1958428416))	2024-02-22 00:12:54 +00:00
PyTorch MergeBot	8fa6340701	Revert "Ignore .numpy() under FakeTensorMode() (#120261 )" This reverts commit 952b37145b7bb526ea5907ac574e324d274b02ee. Reverted https://github.com/pytorch/pytorch/pull/120261 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it seems breaking trunk on Python 3.12 `952b37145b` ([comment](https://github.com/pytorch/pytorch/pull/120261#issuecomment-1958267417))	2024-02-21 23:09:27 +00:00
cyy	1aad5c98b4	[structural binding][5/N] Replace std::tie with structural binding (#120142 ) This PR follows https://github.com/pytorch/pytorch/pull/119774, it is a continued work to clean up std::tie. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120142 Approved by: https://github.com/albanD	2024-02-21 22:32:55 +00:00
Oguz Ulgen	d514df63ea	Reenable triton tests and clean extra clones after the pin update (#120324 ) Test Plan: just tests Differential Revision: D54008642 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120324 Approved by: https://github.com/aakhundov, https://github.com/sijiac	2024-02-21 22:25:33 +00:00
Thiago Crepaldi	952b37145b	Ignore .numpy() under FakeTensorMode() (#120261 ) Fixes #120259 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120261 Approved by: https://github.com/jansel	2024-02-21 22:06:29 +00:00
soulitzer	450339ab2d	Test for fatal signal in test_pynode_destruction_deadlock (#120279 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120279 Approved by: https://github.com/albanD	2024-02-21 21:53:51 +00:00
ydwu4	306642b66d	[export] fix test_passes on ci (#120322 ) We put the test cases generation in unitest.setUp to avoid running export on machines that runs with Python 3.12, where dynamo is not supported. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120322 Approved by: https://github.com/angelayi, https://github.com/huydhn, https://github.com/malfet	2024-02-21 21:23:40 +00:00
Tobias Ringwald	e0268821dd	Increased compile time max GPUs to 512. Switched to int16_t DeviceIndex. (#119639 ) Fixes #115331. This PR increases the number of valid GPU devices to 512 (from 64) in order to future-proof PyTorch for providers that offer [single nodes with a large device count](https://www.tensorwave.com/). Until now, `DeviceIndex` was an `int8_t`, thus multiple changes were necessary: - `DeviceIndex` changed to `int16_t`. Updated consumers that assume it to be an `int8_t`. - Updated bounds checking for `torch.device()` in the Python frontend. Right now, we allow funny things like `torch.device('cpu', 200).index == -56`, which is undefined behavior. I inserted some checks to only allow values between 0 and `c10::Device::MAX_NUM_DEVICES - 1`. - Updated the `ArgumentInfo` struct as it hardcodes the device index as 8 bit field [^1]. Might be a breaking change, not sure if users rely on this. - Introduced `c10::Device::MAX_NUM_DEVICES` as a replacement for the old `C10_COMPILE_TIME_MAX_GPUS` [^1]: This field was unsigned, so I guess this has also been undef behavior the whole time? Our default device index is -1, so this always wrapped around to 255 when written to the `ArgumentInfo` struct. When I switched the `DeviceIndex` to `int16_t`, it actually stayed 255 after unpacking from `ArgumentInfo` again, as the `DeviceIndex` was now wide enough that it didn't wrap back to -1. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119639 Approved by: https://github.com/cyyever, https://github.com/albanD	2024-02-21 21:10:49 +00:00
soulitzer	27c5bbe5cb	Add is_nested_int() (#119975 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119975 Approved by: https://github.com/jbschlosser ghstack dependencies: #119661, #119974	2024-02-21 21:10:02 +00:00
soulitzer	2e77629b9f	[pytrees] Allow tree_map_only to support predicate function as filter (#119974 ) In many places in the code we use `tree_map_only((SymInt, SymBool, SymFloat), foo)` but with nested ints, it is possible to have SymInts that are non-symbolic, so we may want to do something like `tree_map_only(is_symbolic, foo)` instead. Alternative: wrap nested int SymNodes with something other than SymInt. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119974 Approved by: https://github.com/zou3519 ghstack dependencies: #119661	2024-02-21 21:10:02 +00:00
Aaron Enye Shi	722e87899a	[Memory Snapshot] Clean up elem text (#120245 ) Summary: These UI changes were added: - Prefix address with Addr: and size with Size: - Add comma between addr and size - Remove duplicate (${elem.size} bytes) print out Test Plan: Before: ![image](https://github.com/pytorch/pytorch/assets/17602366/2d9867d6-9cdb-405b-aa92-f0daf44f2ba7) After: ![image](https://github.com/pytorch/pytorch/assets/17602366/c7bd97d3-fdc6-4832-ae35-97a02ea73907) Differential Revision: D53953187 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120245 Approved by: https://github.com/zdevito	2024-02-21 20:59:04 +00:00
Wanchao Liang	a5893926f2	[dtensor] simplify outputs wrapping handling (#120297 ) This PR simplifies the outputs wrapping handling in op dispatch, to make it simpler and easier to understand. It also enables a new case, where if the output DTensorSpec for the res is None, and the res is a scalar tensor, we will just return the scalar tensor instead of wrapping it with a DTensor. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120297 Approved by: https://github.com/wz337	2024-02-21 20:28:20 +00:00
xiangdong	e06978be4b	[CI] Add initial inductor cpu smoketest for performance (#116456 ) Co-authored-by: chuanqiw <chuanqi.wang@intel.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/116456 Approved by: https://github.com/jgong5, https://github.com/atalman	2024-02-21 20:04:50 +00:00
Shengbao Zheng	9630bcbd49	[execution trace/chakra] remove backend_id from pg_info (#120038 ) Summary: PR 104373(https://github.com/pytorch/pytorch/pull/104373) log backend which has an unsafe dict loop up that might fail. We decide to deprecate backend_id and use pg id/name directly. Differential Revision: D53676181 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120038 Approved by: https://github.com/aaronenyeshi	2024-02-21 19:37:18 +00:00
Joel Schlosser	e7eab2f07e	Fix to keep stride in return_and_correct_aliasing() (#117860 ) Fixes #117794 Fix tripped the assert here: `86dedebeaf/torch/utils/_python_dispatch.py (L216)` From investigation: I found that functionalization of an in-place op (`mul_` in this test case) results in the strides of `TwoTensor`'s `a` / `b` components being mutated to be contiguous. This is not reflected in the outer tensor, causing the assert to be tripped. After discussion with Brian, I address this in this PR by disallowing input mutations on non-contiguous tensor subclass inputs for now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117860 Approved by: https://github.com/bdhirsh	2024-02-21 19:15:27 +00:00
Huy Do	fa77829126	Remove bc linter label triggers after test-infra #4956 (#120148 ) After https://github.com/pytorch/test-infra/pull/4956, mergebot will not block merge for a bc linter failure that has been suppressed. The failure will be ignored instead. This should help mitigate https://github.com/pytorch/test-infra/issues/4938 because the workflow will not be triggered multiple times when labels are attached. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120148 Approved by: https://github.com/clee2000	2024-02-21 18:36:38 +00:00
Mehant Kammakomati	e87deb8004	fix: conversion of max memory allocated and reserved to GB (#120172 ) Fixes #120171 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120172 Approved by: https://github.com/soulitzer, https://github.com/aaronenyeshi	2024-02-21 18:04:47 +00:00
rajibm	d336be2942	Update torch.mean() description about dtype restriction. (#120208 ) Fixes #120173 Co-authored-by: Jeffrey Wan <soulitzer@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/120208 Approved by: https://github.com/soulitzer	2024-02-21 18:04:11 +00:00
Animesh Jain	9c64068ef8	[dynamo][guards-cpp-refactor] TypeGuardAccessor (#120089 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120089 Approved by: https://github.com/jansel ghstack dependencies: #119822, #119827, #119833, #120060, #120061, #120062, #120064, #120065, #120067, #120068	2024-02-21 17:56:48 +00:00
Animesh Jain	ec6783990a	[dynamo][guards-cpp-refactor] GlobalsGuardAccessor (#120068 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120068 Approved by: https://github.com/jansel ghstack dependencies: #119822, #119827, #119833, #120060, #120061, #120062, #120064, #120065, #120067	2024-02-21 17:56:48 +00:00
Animesh Jain	66c52d678f	[dynamo][guards-cpp-refactor] GetItemGuardAccessor (#120067 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120067 Approved by: https://github.com/jansel ghstack dependencies: #119822, #119827, #119833, #120060, #120061, #120062, #120064, #120065	2024-02-21 17:56:36 +00:00
Animesh Jain	7a0c2a9d0a	[dynamo][guards-cpp-refactor] NO_TENSOR_ALIASING guard (#120065 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120065 Approved by: https://github.com/jansel ghstack dependencies: #119822, #119827, #119833, #120060, #120061, #120062, #120064	2024-02-21 17:56:18 +00:00
Animesh Jain	8d5ae8c0b3	[dynamo][guards-cpp-refactor] TENSOR_ALIASING guard (#120064 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120064 Approved by: https://github.com/jansel ghstack dependencies: #119822, #119827, #119833, #120060, #120061, #120062	2024-02-21 17:56:05 +00:00
Animesh Jain	034955b2fc	[dynamo][guards-cpp-refactor] DATA_PTR_MATCH guard (#120062 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120062 Approved by: https://github.com/jansel ghstack dependencies: #119822, #119827, #119833, #120060, #120061	2024-02-21 17:55:46 +00:00
Animesh Jain	cc6cf89c30	[dynamo][guards-cpp-refactor] GLOBAL_STATE guard (#120061 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120061 Approved by: https://github.com/jansel ghstack dependencies: #119822, #119827, #119833, #120060	2024-02-21 17:55:32 +00:00
Animesh Jain	5066bec743	[dynamo][guards-cpp-refactor] DEFAULT_DEVICE guard (#120060 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120060 Approved by: https://github.com/jansel ghstack dependencies: #119822, #119827, #119833	2024-02-21 17:55:17 +00:00
Michael Gschwind	8f3fd79b23	Native Half on ARM (#119483 ) Summary: Native Half on ARM Test Plan: sandcastle Differential Revision: D53585776 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119483 Approved by: https://github.com/ezyang, https://github.com/jgong5	2024-02-21 17:46:16 +00:00
Oguz Ulgen	29b2131c62	[Inductor] Fix bug around out of order constexprs in inductor (#120287 ) Inductor signature/config generation code assumes that all constexprs come as last arguments of the function. This is not always true for user defined kernels. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120287 Approved by: https://github.com/jansel	2024-02-21 17:39:41 +00:00
Catherine Lee	cfddfce0d3	Alternate sharding (#119078 ) Changes sharding to attempt to put all serial tests on as few shards as possible. Parallel tests are then distributed across all shards, with most of which likely ending up on the non serial shards Example: 8 minutes of serial tests, 20 minutes of parallel tests, 2 proc per machine, 6 machines -> 8 + 20/2 = 18 total minutes of tests -> 18 / 6 machines = 3 min per machine -> all serial tests should fit on 3 machines (3min, 3 min, 2min) -> majority of parallel tests should go on last 4 machines, one of which is shared with the serial tests Move serial tests to run first If I want to move to a purely numbers based sharding, this ensures that parallel tests are run with parallel tests as much as possible instead of interleaving serial + parallel tests, which decreases effectiveness of parallelization, while also ensuring that test reordering is still mostly effective. See `73e816ee80` for example logs Pull Request resolved: https://github.com/pytorch/pytorch/pull/119078 Approved by: https://github.com/huydhn	2024-02-21 16:40:27 +00:00
Shuqiang Zhang	a24cba35b0	[c10d][flight recorder] dump additinal NCCL debug info (#120063 ) Summary: This PR is mainly about flight recorder side of changes that takes a map of maps as input, and dump it as picklable. Also add functions that should be compiled only when NCCL_COMM_DUMP is defined Test Plan: Integration tests with NCCL would be done later, here we only do the c10d side of dump test, aka,NCCLTraceTest Testing the dump function is a bit tricky as we don't have existing C++ unit tests for them. So we still use the Python NCCLTraceTest with the python binding of _dump_nccl_trace(), we manually fed the dump_nccl_trace with a map of test info, and assert the pickle result and print the converted python dict: ``` (sqzhang_1) [sqzhang@devgpu009.cln1 ~/pytorch (main)]$ python test/distributed/test_c10d_nccl.py NCCLTraceTest NCCL version 2.19.3+cuda12.0 [rank0]:[E ProcessGroupNCCL.cpp:1200] [PG 0 Rank 0] ProcessGroupNCCL preparing to dump debug info. .NCCL version 2.19.3+cuda12.0 .NCCL version 2.19.3+cuda12.0 {'ncclID2': {'Key2': 'Value2', 'Key1': 'Value1'}, 'ncclID1': {'Key2': 'Value2', 'Key1': 'Value1'}} {'ncclID2': {'Key2': 'Value2', 'Key1': 'Value1'}, 'ncclID1': {'Key2': 'Value2', 'Key1': 'Value1'}} .NCCL version 2.19.3+cuda12.0 {'ncclID2': {'Key2': 'Value2', 'Key1': 'Value1'}, 'ncclID1': {'Key2': 'Value2', 'Key1': 'Value1'}} {'ncclID2': {'Key2': 'Value2', 'Key1': 'Value1'}, 'ncclID1': {'Key2': 'Value2', 'Key1': 'Value1'}} .NCCL version 2.19.3+cuda12.0 .NCCL version 2.19.3+cuda12.0 .NCCL version 2.19.3+cuda12.0 .NCCL version 2.19.3+cuda12.0 . ---------------------------------------------------------------------- Ran 8 tests in 95.761s OK ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/120063 Approved by: https://github.com/wconstab	2024-02-21 16:35:23 +00:00
Catherine Lee	06bc203c7b	Update dynamo_test_failures list (#120271 ) This PR removes and adds some failures and successes that were hidden in the past week (ish). https://github.com/pytorch/pytorch/pull/119408 (47182a8f4b5e36e280ca3595ba134f53499d2dc9) accidentally removed environment variables on rerun (see PR body of https://github.com/pytorch/pytorch/pull/120251 for slightly more details). Enabling testing with dynamo is set using an env var, so if a test failed with dynamo, it would rerun without the dynamo env var set, making it pass on retry. Normally, the flaky test bot would catch this and make an issue for the test, but the CI env var controls whether or not xml test reports get made, and that also got removed on rerun, so the xmls weren't made either. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120271 Approved by: https://github.com/DanilBaibak, https://github.com/zou3519	2024-02-21 16:34:34 +00:00
Edward Z. Yang	9199468401	Properly trace into mark_static (#120232 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/120232 Approved by: https://github.com/yanboliang	2024-02-21 13:51:31 +00:00
Shan19900305	d38a3627a5	Support privateUser1 key in RNN op. (#118182 ) (#118351 ) Support privateUser1 key in RNN op。 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118351 Approved by: https://github.com/bdhirsh	2024-02-21 13:51:27 +00:00
Oguz Ulgen	eae025b1d7	Fix bug with block pointer multi dim args (#120263 ) Summary: Now we can parse statements like ``` %22 = tt.make_tensor_ptr %20, [%21, %c128_i64], [%c2048_i64, %c1_i64], [%1, %c0_i32] ``` Test Plan: Added new test ``` buck2 test mode/opt //hammer/ops/tests/inductor:ragged_hstu_test ``` now passes again with optimizations Differential Revision: D53975130 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120263 Approved by: https://github.com/aakhundov, https://github.com/sijiac	2024-02-21 09:06:20 +00:00
cyy	3cd6a21e8f	[DeviceIndex][6/N] Use DeviceIndex in more places (#120133 ) This PR follows the series of patches beginning with #119142 and fixes various XPU and python related methods to use DeviceIndex. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120133 Approved by: https://github.com/Skylion007	2024-02-21 06:24:23 +00:00
cyy	d5d13ab15e	Remove C10_FALLTHROUGH (#120157 ) Since [[fallthrough]] is supported in our C++17 compilers and no other repo is using it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120157 Approved by: https://github.com/Skylion007	2024-02-21 06:18:58 +00:00
drisspg	d6801578c3	Update tracing rules for new cudnn functions (#120268 ) # Summary This updates the trace_rules with the new cudnn torch functions for sdpa To repro: `pytest test/dynamo/test_trace_rules.py -k test_torch_name_rule_map_updated` Pull Request resolved: https://github.com/pytorch/pytorch/pull/120268 Approved by: https://github.com/shuqiangzhang, https://github.com/huydhn, https://github.com/yanboliang	2024-02-21 05:22:44 +00:00
Michael Lazos	65519d183b	Remove old optimizer tests (#120257 ) Removes old tests now that all configs are covered in test_compiled_optimizers.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/120257 Approved by: https://github.com/eellison	2024-02-21 05:11:23 +00:00
wangjiangben-hw	b4cef25a1e	add register_device_op_overrides (#119268 ) Fixes #119267 Currently https://github.com/pytorch/pytorch/blob/main/torch/_inductor/codegen/common.py#L106 only supports built-in device function, I'm going to add a register function to get overrides class. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119268 Approved by: https://github.com/jansel	2024-02-21 04:53:07 +00:00
Quinn Zhu	3993771617	Expose recordSize in ChunkRecordIterator (#120239 ) Summary: Add a public method to read recordSize in ChunkRecordIterator Test Plan: ci Differential Revision: D53931944 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120239 Approved by: https://github.com/zoranzhao	2024-02-21 04:33:03 +00:00
wangjiangben-hw	26610175d2	pass device_str for async_compile.triton function (#120202 ) Fixes #120203 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120202 Approved by: https://github.com/jansel	2024-02-21 03:48:57 +00:00
Shunting Zhang	800e9acd43	[inductor] fix bandwidth extimation for StarDep (#120266 ) A lot of HF models fail when inductor_config.bechmark_kernel is enabled. The reason is the bandwidth estimation code assumes every dependencies has an index but StarDep does not. An exception is raised when StarDep.index is being accessed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120266 Approved by: https://github.com/eellison, https://github.com/jansel	2024-02-21 03:33:45 +00:00
wangjiangben-hw	20f7e5a719	Remove dependency of triton during inductor codegen (#120193 ) Fixes #120192 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120193 Approved by: https://github.com/jansel	2024-02-21 01:09:48 +00:00
Yifu Wang	dd6b5e236e	Prepare test_inductor_collectives.py for native funcol migration (#120025 ) There are some tests in this file that are impl specific, e.g. verifying generated code via `FileCheck`. These tests are covered for native funcol in test_c10d_functional_native.py, therefore marking them with `@run_with_legacy_funcol`. Other tests are marked with `@run_with_both_funcol_impls`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120025 Approved by: https://github.com/wanchaol ghstack dependencies: #119982	2024-02-21 00:46:25 +00:00
Catherine Lee	af765dbdfd	[ez] Explicit env for run_test (#120251 ) env=None (which is the default) inherits the env from the calling process. Explicitly set the env to the calling process env so that things can be added to it later Tested in: `e7b4d8ec88` Checked that test-reports (which depend on the CI env var) get made. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120251 Approved by: https://github.com/huydhn	2024-02-21 00:40:19 +00:00
PyTorch MergeBot	a1fc29cd78	Revert "[pytree] add function `tree_iter` (#120155 )" This reverts commit 372d078f361e726bb4ac0884ac334b04c58179ef. Reverted https://github.com/pytorch/pytorch/pull/120155 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/120155#issuecomment-1955479765))	2024-02-21 00:21:28 +00:00
lancerts	701f651f9c	Change the parameter type from int to float in torch.nn.Softplus (#120183 ) Fixes #120175 1 The c_api uses the double `f2cf0768d1/torch/csrc/api/include/torch/nn/options/activation.h (L501)`. 2 The type is also double in the test case `f2cf0768d1/test/cpp/api/functional.cpp (L1788)` 3 With float parameter in python works perfectly fine ``` m = nn.Softplus(beta=0.1,threshold=1.2) input = torch.randn(2) output = m(input) print(output) tensor([7.3749, 7.6852]) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/120183 Approved by: https://github.com/mikaylagawarecki	2024-02-21 00:14:38 +00:00
Matthew Hoffman	35891e5007	Explicitly set nn.Module.set_extra_state return type to None (#120161 ) Implicitly, the return type of `set_extra_state` is `NoReturn` since it always raises an error, and pyright will complain about mismatched return types if you override it with an implementation that doesn't also always raise an error. If we explicitly hint the return type as `None` (how we expect people to override it), we can avoid this error message. ``` Method "set_extra_state" overrides class "Module" in an incompatible manner Return type mismatch: base method returns type "NoReturn", override returns type "None" ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/120161 Approved by: https://github.com/mikaylagawarecki	2024-02-20 23:57:36 +00:00
David Berard	e54c4e8659	[aot_autograd] handle subclass input mutations correctly in collect_metadata_analysis.py (#120136 ) This PR fixes the issue in https://github.com/pytorch/pytorch/issues/120188. In collect_metadata_analysis.py, handling of input/output mutations was different from handling in other locations. In other locations, MUTATED_OUT_GRAPH was used to indicate that mutation would require returning an output; in collect_metadata_analysis.py, any type of mutation was being handled as if it would require returning an output. This PR changes collect_metadata_analysis to match other callsites and refactors computation of mutation types so that it is a property of the dataclass instead of something that needs to be computed manually when constructing an InputAliasInfo. Differential Revision: [D53950998](https://our.internmc.facebook.com/intern/diff/D53950998) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120136 Approved by: https://github.com/bdhirsh ghstack dependencies: #120141	2024-02-20 23:30:57 +00:00
David Berard	b36404159d	[aot_autograd] support inplace mutations for subclasses (#120141 ) This PR removes the conditional logic depending on requires_subclass_dispatch for mutation handling. Inputs are labeled with one of three labels: NOT_MUTATED, MUTATED_IN_GRAPH, or MUTATED_OUT_GRAPH. MUTATED_IN_GRAPH indicates mutation that is allowed in the aot autograd graph; MUTATED_OUT_GRAPH indicates mutation that is not allowed in the graph, so the result is computed, returned, and then assigned back to the input after the graph. Previously, there was logic to handle subclasses differently, so that MUTATED_IN_GRAPH + subclasses would behave like MUTATED_OUT_GRAPH. This PR simplifies aot_autograd's handling of mutations so that MUTATED_IN_GRAPH will always be handled in graph, even when subclasses are present. Note that there are still some cases where subclass support won't be handled correctly. Differential Revision: [D53950999](https://our.internmc.facebook.com/intern/diff/D53950999) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120141 Approved by: https://github.com/bdhirsh	2024-02-20 23:30:57 +00:00
Elias Ellison	96092e1f55	Extend aot_graph_input_parser to sym shapes (#120246 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120246 Approved by: https://github.com/shunting314	2024-02-20 23:24:45 +00:00
Andrew Gu	7acdd08fcc	[FSDP2] Used stream APIs for CUDA event handling (#120231 ) If we already have Python `Stream` objects, then calling `stream1.wait_stream(stream2)` is syntactic sugar for creating an `event: Event`, recording it in `stream2`, and calling `stream1.wait_event(event)`. ~~Getting a Python `Stream` object incurs some CPU overhead, so we prefer to not change other callsites where we do not already have the `Stream` objects.~~ Update: Calling `event.record()` with no stream specified calls `torch.cuda.current_stream()`, so the overhead should be identical. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120231 Approved by: https://github.com/yifuwang ghstack dependencies: #118298, #119985	2024-02-20 21:35:46 +00:00
PyTorch MergeBot	dfb83df889	Revert "Add cpp stack traces to our own reruns (#119408 )" This reverts commit 47182a8f4b5e36e280ca3595ba134f53499d2dc9. Reverted https://github.com/pytorch/pytorch/pull/119408 on behalf of https://github.com/clee2000 due to iirc the default setting of env to None causes it to inherit the env of the calling process, I'll make a PR that makes it so that the old env vars don't disappear, and then re merge this on top of it. Reverting this because I think some important env vars are disappearing (specifically CI) ([comment](https://github.com/pytorch/pytorch/pull/119408#issuecomment-1955128676))	2024-02-20 21:28:13 +00:00
Yifu Wang	2d6c0cc81b	Run test_functional_api.py with both legacy and native funcol impls (#119982 ) Additional changes: tests in test_functional_api.py uses multi-threaded pg which is implemented in Python. For the native ops to call into the Python pg implementation, glue code in PyProcessGroup is required for each collective. This PR also adds a few pieces of previously missing glue code, which are necessary for running test_functional_api.py with native funcol. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119982 Approved by: https://github.com/wanchaol	2024-02-20 21:15:37 +00:00
Yanbo Liang	d42ede8ae4	[torch.compile] Log compilation start time for timeline view (#120220 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/120220 Approved by: https://github.com/angelayi	2024-02-20 21:07:40 +00:00
atalman	be8ba5ef2d	Revert "use two pass reduction for deterministic reduction order (#11… (#120243 ) This reverts commit cc7ef43423fe36cf1778a9c9643454d62050a5b5. Manual revert because of the conflict in: test/inductor/test_cpu_repro.py , conflict with this PR: https://github.com/pytorch/pytorch/pull/118365 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120243 Approved by: https://github.com/malfet, https://github.com/huydhn	2024-02-20 20:50:29 +00:00
Menglu Yu	4f0f25b7ce	[Inductor][bugFix] fix a bug in merge_splits (#119956 ) Summary: RecGPT got a keyerror when running the split_cat, and it was caused by a corner case hit. Test Plan: P1184947021 Differential Revision: D53791839 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119956 Approved by: https://github.com/jackiexu1992	2024-02-20 20:38:34 +00:00
bhack	957f37686a	Refactor instance_descriptor for new triton version (#119636 ) Check https://github.com/pytorch/pytorch/pull/119457#issuecomment-1936764161 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119636 Approved by: https://github.com/shunting314	2024-02-20 20:26:35 +00:00
Hugues de Saxcé	8464654ae4	Add missing words to torch.utils.checkpoint doc (#120196 ) This PR adds a couple of missing words in the Checkpointing documentation, it doesn't have a specific issue number related to it. Changes are: - "backward." -> "backward propagation." - "to be advanced than" -> "to be more advanced than" Pull Request resolved: https://github.com/pytorch/pytorch/pull/120196 Approved by: https://github.com/soulitzer	2024-02-20 20:18:42 +00:00
Menglu Yu	b33e8d3f6b	[Inductor][fx pass] Add split cat pattern to remove cat nodes (#115004 ) Summary: Titled Test Plan: # unit test ``` buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:split_cat_fx_passes ``` Buck UI: https://www.internalfb.com/buck2/8e4179db-363a-41b5-8bd7-cc445a512f6f Test UI: https://www.internalfb.com/intern/testinfra/testrun/15762598708548039 Network: Up: 91KiB Down: 32KiB (reSessionID-b0985d82-1919-49c5-b307-ee0ab49b4738) Jobs completed: 28. Time elapsed: 1:27.1s. Cache hits: 0%. Commands: 2 (cached: 0, remote: 0, local: 2) Tests finished: Pass 11. Fail 0. Fatal 0. Skip 0. Build failure 0 # local reproduce (IG_CTR) ``` buck2 run mode/opt //scripts/jackiexu0313/pt2:local_model_with_pt2 -- --test_mode split_batch ``` P895047189 Differential Revision: D51777617 Pull Request resolved: https://github.com/pytorch/pytorch/pull/115004 Approved by: https://github.com/jackiexu1992	2024-02-20 19:35:20 +00:00
Brian Hirsh	cccacf6c8e	add a test that non_overlapping checks dont generate too many guards (#120106 ) Pre-emptive test in OSS to ensure that models relying on the "non-overlapping guards" checks do not suffer drastically w.r.t. guard slowness. Current plan is to follow up on this with a "real" fix, to generate a linear number of these guards. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120106 Approved by: https://github.com/mlazos	2024-02-20 18:38:59 +00:00
Angela Yi	6d82a7e9b0	Add pixel_shuffle to core aten decomps (#120092 ) Summary: https://github.com/pytorch/pytorch/pull/118239 added a decomposition for pixel_shuffle, so pixel_shuffle no longer needs to be a Core ATen Op. We have also fixed the internal use case so that it no longer special cases on pixel_shuffle, allowing us to revert the changes in https://github.com/pytorch/pytorch/pull/118921. Test Plan: CI Differential Revision: D53860966 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120092 Approved by: https://github.com/ydwu4	2024-02-20 18:37:32 +00:00
Nikita Shulga	53bfae2c06	[MPS] Add `torch.fft.` support (#119670 ) Increase tolerance for `ftt` ops, that warrants a further investigation as it grows larger with larger matrix dimensions (see https://github.com/pytorch/pytorch/issues/120237 ) When compiling on MacOS13, implement `+[FakeMPSGraphFFTDescriptor descriptor]` as a redispatch to a real thing. Fixes https://github.com/pytorch/pytorch/issues/78044 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119670 Approved by: https://github.com/kulinseth, https://github.com/albanD	2024-02-20 18:23:06 +00:00
Jokeren	5f3f8fd3c7	[Inductor] Setting kernel launch and exit callbacks for inductor generated triton kernels (#119450 ) `CompiledKernel.launch_enter_hook` and `CompiledKernel.launch_exit_hook` are hooks that allow external tools to monitor the execution of Triton kernels and read each kernel's metadata. Initially, these hooks have a value of `None`. Triton's kernel launcher passes hooks and kernel metadata by default, while Inductor's launcher doesn't. This PR could unify the parameters passed to both launchers so that tools can get information from both handwritten Triton kernels and Inductor-generated Triton kernels. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119450 Approved by: https://github.com/soulitzer	2024-02-20 16:58:20 +00:00
Jeff Daily	d3839b624b	[ROCm] HIP Lazy Streams (#119996 ) For ROCm/HIP, each stream is lazily initialized rather than creating all streams when the first stream is requested. HIP streams are not as lightweight as CUDA streams; the pooling strategy can affect performance. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119996 Approved by: https://github.com/ezyang	2024-02-20 16:24:04 +00:00
Brian Hirsh	26fbbc3e84	DTensor + dynamo: fix is_shard/replicate always inlining to False (#118668 ) Fixes an internal enablement bug. When dynamo traces `is_sharded`/`is_replicate`, it would unconditioanlly assume the result was False. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118668 Approved by: https://github.com/wconstab, https://github.com/wanchaol ghstack dependencies: #117667, #117666, #118209, #118191, #118667	2024-02-20 15:23:48 +00:00
Brian Hirsh	609cde94f9	DTensor: use memory_format in the hash for all aten ops that use that arg (e.g. aten.clone) (#118667 ) This fixes an internal DTensor enablement bug (I don't have an OSS issue for it) I finally root-caused this as follows: (1) we were fakefying a DTensor graph input, that was an autograd non-leaf (it had a grad_fn) (2) that caused it do go through this `clone()` call during fakeification: https://github.com/pytorch/pytorch/blob/main/torch/_subclasses/meta_utils.py#L549 (3) `clone(torch.preserve_format)` is supposed to return another DTensor with the same strides as the input, but I noticed we were returning a DTensor with contiguous strides incorrectly. (4) It turns out that DTensor was hashing on the sharding strategy for `aten.clone`, regardless of the `memory_format` kwarg that was passed in. I could have manually updated the `clone` sharding strategy registration to take `memory_format` into account. But instead, I figured that every aten op with a sharding strategy needs to handle the memory_format kwarg specially - so I tried to generically force DTensor to consider all ATen ops that take a `memory_format` kwarg during hashing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118667 Approved by: https://github.com/wanchaol ghstack dependencies: #117667, #117666, #118209, #118191	2024-02-20 15:23:48 +00:00
Brian Hirsh	6819452a08	fix multiple-fake-modes bug with compile + subclasses (#118191 ) This should fix the "multiple fake modes" errors we've been seeing with both float8 tensor and DTensor. Haven't added a test yet - will add one before landing. I also have a separate PR that would have made the error significantly nicer (the bad error resulted from us returning a FakeTensor at runtime): https://github.com/pytorch/pytorch/pull/118644 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118191 Approved by: https://github.com/drisspg ghstack dependencies: #117667, #117666, #118209	2024-02-20 15:23:41 +00:00
haozhe.zhu	b4b1480b06	remove redundant to_dtype in Fused Schedular Nodes (#118365 ) Fix https://github.com/pytorch/pytorch/issues/115260. This issue is triggered by `FusedSchedularNodes` cases. We always store `lowp buffer` to `store_cache` then load `lowp buffer` from `store_cache` and `convert it to float` before `compute ops`. Now we will generate a `{key: to(float32)_expr, value: the float32 cse var before to_dtype and store}` in `cse.cache`. Then the `to_dtype(float32)` after `load` will hit this cache and not generate a new var with cast codes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118365 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-02-20 13:35:03 +00:00
Kazuaki Ishizaki	c28a43988e	Fix typo under aten/src/ATen/native directory (#119686 ) This PR fixes typo in comments and msgs under `aten/src/ATen/native` directory Pull Request resolved: https://github.com/pytorch/pytorch/pull/119686 Approved by: https://github.com/lezcano, https://github.com/malfet	2024-02-20 06:31:10 +00:00
Animesh Jain	389b56b4c4	[dynamo][guards-cpp-refactor] GetAttrGuardAccessor (#119833 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119833 Approved by: https://github.com/jansel ghstack dependencies: #119822, #119827	2024-02-20 05:33:08 +00:00
Animesh Jain	96f45d15d8	[dynamo][guards-c++-refactor] EQUALS_MATCH guard (#119827 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119827 Approved by: https://github.com/jansel ghstack dependencies: #119822	2024-02-20 05:33:08 +00:00
Animesh Jain	0802951081	[dynamo][guards-c++-refactor] Introduce LeafGuard, GuardManager and GuardAccessor classes (#119822 ) The full blown implementation is in this stack - https://github.com/pytorch/pytorch/pull/110590 which is passing all the test cases on CI. That stack is hard to review. So, breaking apart. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119822 Approved by: https://github.com/jansel	2024-02-20 05:33:08 +00:00
PyTorch UpdateBot	0512ba43ab	[executorch hash update] update the pinned executorch hash (#120214 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned executorch hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120214 Approved by: https://github.com/pytorchbot	2024-02-20 04:13:02 +00:00
lezcano	a7e2b609d3	Skip less replacements (#119570 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119570 Approved by: https://github.com/ezyang	2024-02-20 04:10:33 +00:00
haozhe.zhu	cc7ef43423	use two pass reduction for deterministic reduction order (#115620 ) ## Motivation Address the [non-deterministic reduction order](https://github.com/pytorch/pytorch/issues/93542#issuecomment-1411294181) issue for `omp parallel reduction`. ## Latest update on 1.15: `55d81901bc`. Do not reduce to arr in loops. Instead, reduce to a local scaler and write it to arr after local reduction is done. This will allow the compiler to optimize the reduction variable in register instead read/write from memory. If the `working set` of `loop body` is quite large, `read/write from register/memory` will have a large gap. ``` vaddss (%xmm0, %xmm11, %xmm11) -> accumulate in register %xmm0 vaddssl ((%rdx, %rdi, 4), %xmm0, %xmm0) -> accumulate in memory address (%rdx, %rdi, 4) ``` Examples code: ``` tmp0_acc_arr[64]; #pragma omp parallel num_threads(64) { auto tid = omp_get_thread_num(); #pragma omp for for(...){ .... tmp0_acc_arr[tid] = tmp0_acc_arr[tid] + tmp_x; // access array will always from memory } } ``` will be changed to ``` tmp0_acc_arr[64]; #pragma omp parallel num_threads(64) { auto tid = omp_get_thread_num(); auto tmp0_acc_local = 0; #pragma omp for for(...){ .... tmp0_acc_local = tmp0_acc_local + tmp_x; } tmp0_acc_arr[tid] = tmp0_acc_local; } ``` ## Descriptions Following aten to use `two pass reduction` with `omp parallel` for deterministic reduction order. `9c3ae37fc4/aten/src/ATen/Parallel-inl.h (L39)` `9c3ae37fc4/aten/src/ATen/native/TensorIteratorReduce.cpp (L24)` ``` float tmp_acc0 = 0; at::vec::Vectorized<float> tmp_acc0_vec = at::vec::Vectorized<float>(0); // init reduction buffer per thread float tmp_acc0_arr[64]; at::vec::Vectorized<float> tmp_acc0_vec_arr[64]; for (int tid = 0; tid < 64; tid++) { tmp_acc0_arr[tid] = 0; tmp_acc0_vec_arr[tid] = at::vec::Vectorized<float>(0); } #pragma omp parallel num_threads(64) { int tid = omp_get_thread_num(); #pragma omp for for(long x0=static_cast<long>(0L); x0<static_cast<long>(3964928L); x0+=static_cast<long>(16L)) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(x0)); auto tmp1 = at::vec::Vectorized<float>::loadu(in_ptr1 + static_cast<long>(x0)); auto tmp2 = tmp0 - tmp1; auto tmp3 = tmp2 * tmp2; // reduce to per thread buffers tmp_acc0_vec_arr[tid] = tmp_acc0_vec_arr[tid] + tmp3; } } // second pass reduce for (int tid = 0; tid < 64; tid++) { tmp_acc0 = tmp_acc0 + tmp_acc0_arr[tid]; tmp_acc0_vec = tmp_acc0_vec + tmp_acc0_vec_arr[tid]; } tmp_acc0 = tmp_acc0 + at::vec::vec_reduce_all<float>([](at::vec::Vectorized<float>& x, at::vec::Vectorized<float>& y) { return x + y; }, tmp_acc0_vec); out_ptr0[static_cast<long>(0L)] = static_cast<float>(tmp_acc0); ``` ## Test results I test this PR with dynamo benchmark on 32-core ICX system, Result (avg speed up): \| \| before this PR \| after this PR \| \| ---- \| ---- \| ---- \| \| torchbench \| 1.303 \| 1.301 \| \| hugginface \| 1.346 \| 1.343 \| \| timms \| 1.971 \| 1.970 \| ``` export LD_PRELOAD=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}/lib/libiomp5.so:${CONDA_PREFIX:-"$(dirname $(which conda))/../"}/lib/libjemalloc.so export MALLOC_CONF="oversize_threshold:1,background_thread:true,metadata_thp:auto,dirty_decay_ms:-1,muzzy_decay_ms:-1" export KMP_AFFINITY=granularity=fine,compact,1,0 export KMP_BLOCKTIME=1 multi_threads_test() { CORES=$(lscpu \| grep Core \| awk '{print $4}') export OMP_NUM_THREADS=$CORES end_core=$(expr $CORES - 1) numactl -C 0-${end_core} --membind=0 python benchmarks/dynamo/${SUITE}.py --${SCENARIO} --${DT} -dcpu -n50 --no-skip --dashboard --only "${MODEL}" ${Channels_extra} ${BS_extra} ${Shape_extra} ${Mode_extra} ${Wrapper_extra} ${Flag_extra} --timeout 9000 --backend=inductor --output=${LOG_BASE}/${SUITE}.csv } SCENARIO=performance DT=float32 export TORCHINDUCTOR_FREEZING=1 Flag_extra="--freezing" Mode_extra="--inference" for suite in timm_models huggingface torchbench do export SUITE=$suite echo $SUITE export LOG_BASE=`date +%m%d%H%M%S` mkdir $LOG_BASE multi_threads_test done ``` System info ``` ubuntu@ip-172-31-18-205:~/hz/pytorch$ lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 46 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 64 On-line CPU(s) list: 0-63 Vendor ID: GenuineIntel Model name: Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz CPU family: 6 Model: 106 Thread(s) per core: 2 Core(s) per socket: 32 Socket(s): 1 Stepping: 6 BogoMIPS: 5800.00 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic mo vbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xs aveopt xsavec xgetbv1 xsaves wbnoinvd ida arat avx512vbmi pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg tme avx512_vpopcntdq rdpid md_clear flush_l1d arch_capabilities Virtualization features: Hypervisor vendor: KVM Virtualization type: full Caches (sum of all): L1d: 1.5 MiB (32 instances) L1i: 1 MiB (32 instances) L2: 40 MiB (32 instances) L3: 54 MiB (1 instance) NUMA: NUMA node(s): 1 NUMA node0 CPU(s): 0-63 Vulnerabilities: Gather data sampling: Unknown: Dependent on hypervisor status Itlb multihit: Not affected L1tf: Not affected Mds: Not affected Meltdown: Not affected Mmio stale data: Mitigation; Clear CPU buffers; SMT Host state unknown Retbleed: Not affected Spec rstack overflow: Not affected Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Spectre v2: Mitigation; Enhanced IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence Srbds: Not affected Tsx async abort: Not affected ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/115620 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-02-20 00:46:59 +00:00
Nikita Shulga	ae7830051d	[BE] Delete GCC-7 ICE workarounds (#120122 ) As one needs gcc-9 to compile PyTorch, so those workarounds are no longer relevant Pull Request resolved: https://github.com/pytorch/pytorch/pull/120122 Approved by: https://github.com/huydhn, https://github.com/kit1980, https://github.com/Skylion007	2024-02-20 00:31:20 +00:00
PyTorch MergeBot	0bdeaad936	Revert "add register_device_op_overrides (#119268 )" This reverts commit 2864a7e161cc107f7e4c00cccdf860a6089c73c3. Reverted https://github.com/pytorch/pytorch/pull/119268 on behalf of https://github.com/malfet due to Broke lint ([comment](https://github.com/pytorch/pytorch/pull/119268#issuecomment-1953231324))	2024-02-19 22:31:32 +00:00
Nikita Shulga	3ad067fe2b	[CPP] Update GCC minversion check to 9 or newer (#120126 ) It's already a requirement for building PyTorch, but should be a requirement for linking extensions with it, as that can lead to runtime crashes, as `std::optional` template layout is incompatible between gcc-9 and older compilers. Also, update minimum supported clang version to 9.x(used to build Android), as clang-5 is clearly not C++17 compliant. Fixes https://github.com/pytorch/pytorch/issues/120020 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120126 Approved by: https://github.com/Skylion007	2024-02-19 22:05:00 +00:00
Jeff Daily	48bdd0fb47	[ROCm] TunableOp bugfix filename handling (#120144 ) Fixes nightly wheel seg fault during pytorch shutdown. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120144 Approved by: https://github.com/xw285cornell	2024-02-19 21:31:29 +00:00
PyTorch MergeBot	f1fbba8f35	Revert "Fix lint after #119268 (#120207 )" This reverts commit d9d0f1dccc59ce6f0cb150ac236654c24a0d1118. Reverted https://github.com/pytorch/pytorch/pull/120207 on behalf of https://github.com/atalman due to Broke inductor tests ([comment](https://github.com/pytorch/pytorch/pull/120207#issuecomment-1953170249))	2024-02-19 21:21:12 +00:00
PyTorch MergeBot	a73a98c9ae	Revert "Updating sleef submodule to eb3d97785 to fix export errors (#119953 )" This reverts commit fa9cbdce993601276765ad7701871f7e04a400c6. Reverted https://github.com/pytorch/pytorch/pull/119953 on behalf of https://github.com/atalman due to Broke trunk linux-focal-cpu-py3.10-gcc9-bazel-test and linux-focal-cuda12.1-py3.10-gcc9-bazel-test. These are not flaky failures. ([comment](https://github.com/pytorch/pytorch/pull/119953#issuecomment-1953118780))	2024-02-19 20:26:33 +00:00
atalman	d9d0f1dccc	Fix lint after #119268 (#120207 ) Fixes lint after: https://github.com/pytorch/pytorch/issues/119268 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120207 Approved by: https://github.com/davidberard98	2024-02-19 20:01:45 +00:00
Yukio Siraichi	92bf2a4550	[torchbench] Update skipped models. (#120117 ) This PR updates the list of benchmarks that should (not) be skipped. Here's a summary of the changes: - `detectron2_maskrcnn`: #120115 - `fambench_xlmr`: moved to canary models - `hf_Bert` and `hf_Bert_large`: pass - `maml`: pass - `clip`: renamed to `hf_clip` - `gat`, `gcn`, and `sage`: moved to canary models Pull Request resolved: https://github.com/pytorch/pytorch/pull/120117 Approved by: https://github.com/ezyang, https://github.com/lezcano	2024-02-19 18:08:32 +00:00
Yifu Wang	637cf4a3f2	Test parametrization utils for native funcol migration (#119950 ) ``` Between the time we switch to the native funcol by default and the time when we are confident that we can remove the legacy implementation, we want to ensure that the legacy funcol remains covered by unit tests. This is to prepare for any potential (but unlikely) reverts. The following utilities help achieve this goal. run_with_{native,legacy}_funcol - mark a test to run with only {native,legacy} funcol. These decorators are for impl specific tests (e.g. verifying generated code with FileCheck). run_with_both_funcol_impls - parametrize a test to run with both legacy and native funcol. run_with_both_funcol_impls_with_arg - same as run_with_both_funcol_impls, but passes `enable_native_funcol` to the test so impl specific checks can be carried out. ``` This PR also marks some tests we want to cover in this fashion. More tests will be marked in subsequent PRs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119950 Approved by: https://github.com/wanchaol ghstack dependencies: #119881	2024-02-19 02:46:03 +00:00
Yifu Wang	40786ca509	Handle unwaited work objects on process termination (#119881 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119881 Approved by: https://github.com/wconstab	2024-02-19 02:46:02 +00:00
leslie-fang-intel	84de851539	[Inductor] Enable the decomposition of quant/dequant per channel (#119177 ) Summary Part 2 of fixing https://github.com/pytorch/pytorch/issues/119141 which needs vectorized code generation of per channel quant and int8 data type. Enable decomposition of quant/dequant per channel to make it vectorized code generation. TestPlan ``` python -u -m pytest -s -v test_cpu_repro.py -k test_per_channel_fake_quant_uint8 python -u -m pytest -s -v test_cpu_repro.py -k test_per_channel_fake_quant_int8 python -u -m pytest -s -v test_cpu_repro.py -k test_per_channel_fake_quant_uint8_bf16_input python -u -m pytest -s -v test_cpu_repro.py -k test_per_channel_fake_quant_int8_bf16_input ``` Co-authored-by: Jiong Gong <jiong.gong@intel.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/119177 Approved by: https://github.com/peterbell10, https://github.com/jansel	2024-02-19 01:30:44 +00:00
Anthony Alayo	fa9cbdce99	Updating sleef submodule to eb3d97785 to fix export errors (#119953 ) Fixes #119952 with submodule updates Pull Request resolved: https://github.com/pytorch/pytorch/pull/119953 Approved by: https://github.com/ezyang	2024-02-19 00:56:24 +00:00
Jason Ansel	f2cf0768d1	[dynamo][distributed] handle _rank_not_in_group, _get_or_create_default_group (#119628 ) Copy of #117692 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119628 Approved by: https://github.com/yanboliang	2024-02-18 22:34:35 +00:00
Xuehai Pan	372d078f36	[pytree] add function `tree_iter` (#120155 ) Fixes #119768 - #119768 This PR adds a new function `tree_iter` that lazily iterates over the tree leaves. It is different than the `tree_leaves` function while the latter traversal the whole tree first to build a list of leaves. ```python for leaf in tree_iter(tree): ... ``` is much more efficient than: ```python for leaf in tree_leaves(tree): ... ``` where `tree_leaves(tree)` is `list(tree_iter(tree))`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120155 Approved by: https://github.com/vmoens	2024-02-18 09:16:50 +00:00
wz337	61a3a7628c	[nit][DTensor][Test] Update test name to reflect the actual test (#118960 ) test_name: test_partial_mul_failure -> test_partial_mul Pull Request resolved: https://github.com/pytorch/pytorch/pull/118960 Approved by: https://github.com/XilunWu	2024-02-18 08:23:06 +00:00
wangjiangben-hw	2864a7e161	add register_device_op_overrides (#119268 ) Fixes #119267 Currently https://github.com/pytorch/pytorch/blob/main/torch/_inductor/codegen/common.py#L106 only supports built-in device function, I'm going to add a register function to get overrides class. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119268 Approved by: https://github.com/jansel	2024-02-18 06:11:54 +00:00
PyTorch UpdateBot	70bc3b3be4	[executorch hash update] update the pinned executorch hash (#120165 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned executorch hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120165 Approved by: https://github.com/pytorchbot	2024-02-18 03:44:50 +00:00
Jason Ansel	d74bdd5042	[inductor] Always allow 64 bit in next_power_of_2 (#120164 ) see #120153 #120152 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120164 Approved by: https://github.com/yanboliang	2024-02-18 03:22:46 +00:00
Eddie Yan	de15781af0	[cuDNN] Bump cuDNN frontend submodule to 1.1.1 (#120137 ) Hopefully addresses the failure seen when trying to bump to 1.1.0 (#119642) CC @Skylion007 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120137 Approved by: https://github.com/Skylion007	2024-02-18 02:57:02 +00:00
Animesh Jain	b642a18e80	[dynamo] Use EQUALS_MATCH guard for mod.training (#120147 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120147 Approved by: https://github.com/jansel ghstack dependencies: #120132, #120140, #120145	2024-02-18 00:31:36 +00:00
Animesh Jain	0b11b0edd6	[dynamo][refactor] Use existing helper functions for CLOSURE_MATCH (#120145 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120145 Approved by: https://github.com/jansel, https://github.com/Fidget-Spinner ghstack dependencies: #120132, #120140	2024-02-18 00:31:36 +00:00
wangjiangben-hw	0c972c7c4e	enhance next_power_of_2 function (#120153 ) Fixes #120152 cc @ezyang @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @aakhundov @ColinPeppler @amjames @jansel Pull Request resolved: https://github.com/pytorch/pytorch/pull/120153 Approved by: https://github.com/jansel	2024-02-17 20:18:46 +00:00
Jason Ansel	2fea475215	[dynamo] Refactor reconstruct() not to return anything (#120150 ) This simplifies things slightly and avoids some bugs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120150 Approved by: https://github.com/yanboliang	2024-02-17 17:13:41 +00:00
Animesh Jain	757fc663a8	[dynamo][refactor] Use TYPE_MATCH instead of manually constructing guard (#120140 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120140 Approved by: https://github.com/jansel, https://github.com/yanboliang ghstack dependencies: #120132	2024-02-17 16:03:36 +00:00
Animesh Jain	48d96c08f2	[dynamo][guards] Use EQUALS_MATCH for NAME_MATCH (#120132 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120132 Approved by: https://github.com/jansel, https://github.com/yanboliang	2024-02-17 16:03:36 +00:00
cyy	a9953a5ef3	Remove unused c10/util/C++17.h inclusion and outdated checks (#120149 ) This is a continued work to clean up pre-C++17 code. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120149 Approved by: https://github.com/ezyang	2024-02-17 14:28:17 +00:00
Yang Chen	fac598c4ae	[inductor] allow padding mm/bmm/addmm in the presence of dynamic dims (#120073 ) Previously, pad_mm skips cases where any input tensor has symbolic dimension or stride. This is too constraint in practise. This PR enables this pass to pad non-symbolic dimensions in the presence of dynamic dims. For example, with this PR, we could pad the K dimension (i.e. 1921) for torch.mm(A[s0, 1921], B[2048, 1921]). Pull Request resolved: https://github.com/pytorch/pytorch/pull/120073 Approved by: https://github.com/jansel	2024-02-17 12:22:20 +00:00
Ting Lu	2f8a80ecb2	Fix skip for test_set_nccl_pg_timeout (#120130 ) Test is failing on our internal CI with below error ```RuntimeError: CUDA error: invalid device ordinal CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions. ``` Purpose of this test is for nccl so it doesnt make sense to run in 1 GPU setting either. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120130 Approved by: https://github.com/wconstab, https://github.com/eqy	2024-02-17 07:36:14 +00:00
Adnan Akhundov	badf84bd6b	[inductor] Add torch.cond support to JIT Inductor (#119759 ) Summary: `torch.cond` is already supported in Dynamo and Export: the `true_fn` and `false_fn` subgraphs are traced as child fx graphs of the main graph and passed to the `torch.cond` higher-order operator in the fx graph. However, this breaks in Inductor, as the latter doesn't have the ways of dealing with child fx subgraphs and properly lowering and codegen-ing them. In this PR, we add `torch.cond` support in Inductor. This is achieved by adding subgraph lowering and codegen-ing infrastructure as well as new `Conditional` IR node type weaving the parent graph with the true and false child subgraphs. Here we only implement `torch.cond` support in JIT Inductor (Python wrapper codegen). The implementation in AOT Inductor (C++ wrapper codegen), including ABI-compatibility mode, will follow. Test Plan: ``` $ python test/inductor/test_control_flow.py ... ---------------------------------------------------------------------- Ran 24 tests in 86.790s ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/119759 Approved by: https://github.com/jansel, https://github.com/eellison	2024-02-17 07:25:27 +00:00
Shuqiang Zhang	30000aa3fd	[c10d] remove one line of verbose log (#120138 ) Summary: I don't find exiting DBG mode support in c10d. This is flooding the log, removing it to unblock user Test Plan: Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/120138 Approved by: https://github.com/wconstab	2024-02-17 06:39:57 +00:00
Bin Bao	fa0e39560c	[AOTI] Fix a typo (#120094 ) Differential Revision: [D53861810](https://our.internmc.facebook.com/intern/diff/D53861810) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120094 Approved by: https://github.com/khabinov, https://github.com/sijiac	2024-02-17 05:28:58 +00:00
PyTorch UpdateBot	0a7471e0df	[executorch hash update] update the pinned executorch hash (#120134 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned executorch hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120134 Approved by: https://github.com/pytorchbot	2024-02-17 05:00:35 +00:00
ydwu4	ac2ba7889d	[export] turn on replace_set_grad_with_hop_pass in pre_dispatch (#119915 ) This PR turns on replace_set_grad_with_hop_pass for pre_dispatch export. To do that, we need to propagate the meta-data from original submodule to the new higher order op and fix the names of nodes as is required by the _sig_to_specs pass. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119915 Approved by: https://github.com/tugsbayasgalan ghstack dependencies: #119732, #119736, #119810, #119913, #119914	2024-02-17 02:18:35 +00:00
ydwu4	737630268c	[export] manuually create test cases for split and inline (#119914 ) This PR makes the tests for inline and sequential_split stop relying on set_grad_enabled to be in the graph. Because they'll be gone if we turn on the replace_set_grad_with_hop_pass in the following diff. Instead, we'll manually insert them into the graph. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119914 Approved by: https://github.com/tugsbayasgalan ghstack dependencies: #119732, #119736, #119810, #119913	2024-02-17 02:18:35 +00:00
ydwu4	8d81e61fb6	[export] make node_inline_ also inline the get_item calls (#119913 ) As titled. Before the PR, after we split then inline_, there will be getitem calls in the graph while the original graph module doesn't have them. This PR removes the additional get_item calls by inlining. Test Plan: Added new test cases for graphs that return multiple outputs and takes multiple inputs Pull Request resolved: https://github.com/pytorch/pytorch/pull/119913 Approved by: https://github.com/tugsbayasgalan ghstack dependencies: #119732, #119736, #119810	2024-02-17 02:18:27 +00:00
ydwu4	812f05d731	[export] add replace_set_grad_with_hop_pass (#119810 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119810 Approved by: https://github.com/tugsbayasgalan ghstack dependencies: #119732, #119736	2024-02-17 02:18:19 +00:00
ydwu4	4769e6916a	[export] add node_inline_ to prepare replacing set_grad_enabled with hop (#119736 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119736 Approved by: https://github.com/tugsbayasgalan ghstack dependencies: #119732	2024-02-17 02:18:11 +00:00
ydwu4	068659ddc2	[export] add sequential_split to prepare replacing set_grad_enabled with hop (#119732 ) This pr is the 1/N pr of transforming the global state mutating ops such as torch._C.set_grad_enabled calls in pre-dispatch graph into a higher order op so that the graph becomes more functional. We make use of split_module to help us do the transformation. This pr preserves the node.name in original module by adding a new kwarg `keep_original_node_name` to split_module. For a graph looks like this: ```python def forward(self, arg_0): arg0_1, = fx_pytree.tree_flatten_spec(([arg_0], {}), self._in_spec) add = torch.ops.aten.add.Tensor(arg0_1, 1); arg0_1 = None sin = torch.ops.aten.sin.default(add); add = None sum_1 = torch.ops.aten.sum.default(sin); sin = None _set_grad_enabled = torch._C._set_grad_enabled(False) add_1 = torch.ops.aten.add.Tensor(sum_1, 1); sum_1 = None _set_grad_enabled_1 = torch._C._set_grad_enabled(True) sub = torch.ops.aten.sub.Tensor(add_1, 1) return pytree.tree_unflatten((add_1, sub), self._out_spec) ``` Before the change, split graph returns the following graphs and subgraphs (notice the change from `add` -> `add_tensor`, `sin` -> `sin_default`: ```python def forward(self, arg_0): arg0_1, = fx_pytree.tree_flatten_spec(([arg_0], {}), self._in_spec) submod_0 = self.submod_0(arg0_1); arg0_1 = None submod_1 = self.submod_1(submod_0); submod_0 = None submod_2 = self.submod_2(submod_1) return pytree.tree_unflatten((submod_1, submod_2), self._out_spec) # submod_0 def forward(self, arg0_1): add_tensor = torch.ops.aten.add.Tensor(arg0_1, 1); arg0_1 = None sin_default = torch.ops.aten.sin.default(add_tensor); add_tensor = None sum_default = torch.ops.aten.sum.default(sin_default); sin_default = None return sum_default # submod_1 def forward(self, sum_1): _set_grad_enabled = torch._C._set_grad_enabled(False) add_tensor = torch.ops.aten.add.Tensor(sum_1, 1); sum_1 = None return add_tensor # submod_2 def forward(self, add_1): _set_grad_enabled = torch._C._set_grad_enabled(True) sub_tensor = torch.ops.aten.sub.Tensor(add_1, 1); add_1 = None return sub_tensor """) ``` After the change, the test produce the following graph, all the node names in original graph module are preserved in sub_modules. ```python def forward(self, arg_0): sub, = fx_pytree.tree_flatten_spec(([arg_0], {}), self._in_spec) submod_0 = self.submod_0(sub); sub = None submod_1 = self.submod_1(submod_0); submod_0 = None submod_2 = self.submod_2(submod_1) return pytree.tree_unflatten((submod_1, submod_2), self._out_spec) # submod_0 def forward(self, arg0_1): add = torch.ops.aten.add.Tensor(arg0_1, 1); arg0_1 = None sin = torch.ops.aten.sin.default(add); add = None sum_1 = torch.ops.aten.sum.default(sin); sin = None return sum_1 # submod_1 def forward(self, sum_1): _set_grad_enabled = torch._C._set_grad_enabled(False) add_1 = torch.ops.aten.add.Tensor(sum_1, 1); sum_1 = None return add_1 # submod_2 def forward(self, add_1): _set_grad_enabled_1 = torch._C._set_grad_enabled(True) sub = torch.ops.aten.sub.Tensor(add_1, 1); add_1 = None return sub ``` Note that currently, we call split_module on the graph after pre-dispatch aot. The difference is even larger if we `split_module` the graph module produced by dynamo, where all the original variables names in user program are preserved after dynamo but lost after `split_module` without this change. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119732 Approved by: https://github.com/tugsbayasgalan	2024-02-17 02:18:04 +00:00
Shunting Zhang	becfda005e	tiny improvement to the cprofile wrapper (#120100 ) 1. right now we double increment the profile counter. The PR avoid that so we don't end up with profile_0, profile_2, profile_4 ... 2. log the latency to run the passed in function with profiling on so we can easily skip those _compile call which returns quickly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120100 Approved by: https://github.com/eellison	2024-02-17 02:10:25 +00:00
Shunting Zhang	36e118b810	[inductor] logging meta data for inductor generated triton kernel (#120048 ) I want to log metadata for inductor generated triton kernels for a couple of purposes 1. with these metadata, it should be convenient to find unaligned reduction kernels and try the idea here https://github.com/pytorch/pytorch/issues/119929 . I think it's nice to try on kernels that are used in real models 2. I'm thinking that based on the collected kernel metadata, I can build a simple offline tool by benchmarking each kernel with ncu and augment each kernel metadata with: latency, theoretical membw (estimated memory access / latency), and actually achieved membw. Hopefully this can point us to some good optimization opportunities. Command: ``` TORCHINDUCTOR_CACHE_DIR=`realpath ~/inductor-caches/kernel-metadata-log` TORCHINDUCTOR_ENABLED_METRIC_TABLES=kernel_metadata TORCHINDUCTOR_BENCHMARK_KERNEL=1 TORCHINDUCTOR_UNIQUE_KERNEL_NAMES=1 time python benchmarks/dynamo/huggingface.py --backend inductor --amp --performance --training ``` The best practice here is to point inductor cache to a folder outside of /tmp so that one can always run the kernel again based on the path stored in kernel metadata. (folders under /tmp may get removed by the system) Here is first 1000 rows of collected metadata for huggingface: https://gist.github.com/shunting314/cf4ebdaaaa7e852efcaa93524c868e5f And here is the total 10K kernels collected for huggingface. The gist can not be rendered as a csv since it's too large: https://gist.github.com/shunting314/7f841528e2debdc2ae05dece4ac591be . Pull Request resolved: https://github.com/pytorch/pytorch/pull/120048 Approved by: https://github.com/jansel	2024-02-17 02:09:27 +00:00
Andre Eid	24968ff042	Add quantized gelu (#119935 ) Summary: Added Quantized gelu for vulkan backend. Test Plan: Tested it on "On Demand RL FBSource" LD_LIBRARY_PATH=third-party/swiftshader/lib/linux-x64/ buck2 run fbcode/mode/dev-nosan //xplat/caffe2:pt_vulkan_quantized_api_test_bin -c pt.vulkan_full_precision=1 -- --gtest_filter="VulkanAPITest.gelu_q" ---------------------------------------------------------------------------------- Note: Google Test filter = VulkanAPITest.gelu_q [==========] Running 4 tests from 1 test suite. [----------] Global test environment set-up. [----------] 4 tests from VulkanAPITest [ RUN ] VulkanAPITest.gelu_qint8 [ OK ] VulkanAPITest.gelu_qint8 (318 ms) [ RUN ] VulkanAPITest.gelu_qint8_self [ OK ] VulkanAPITest.gelu_qint8_self (214 ms) [ RUN ] VulkanAPITest.gelu_quint8 [ OK ] VulkanAPITest.gelu_quint8 (152 ms) [ RUN ] VulkanAPITest.gelu_quint8_self [ OK ] VulkanAPITest.gelu_quint8_self (142 ms) [----------] 4 tests from VulkanAPITest (828 ms total) [----------] Global test environment tear-down [==========] 4 tests from 1 test suite ran. (828 ms total) [ PASSED ] 4 tests. Differential Revision: D52985437 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119935 Approved by: https://github.com/jorgep31415	2024-02-17 01:17:25 +00:00
Aaron Enye Shi	7973ac586d	[Memory Snapshot] Add CUDAAllocatorConfig details into snapshot metadata (#119404 ) Summary: Include the CUDAAllocatorConfig at the time of snapshot into the snapshot file. These include adding variables: ``` double garbage_collection_threshold; size_t max_split_size; size_t pinned_num_register_threads; bool expandable_segments; bool release_lock_on_cudamalloc; bool pinned_use_cuda_host_register; std::string last_allocator_settings; std::vector<size_t> roundup_power2_divisions; ``` Test Plan: `PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True ` produces ``` {'PYTORCH_CUDA_ALLOC_CONF': 'expandable_segments:True', 'max_split_size': -1, 'garbage_collection_threshold': 0.0, 'expandable_segments': True, 'pinned_num_register_threads': 1, 'release_lock_on_cudamalloc': False, 'pinned_use_cuda_host_register': False, 'roundup_power2_divisions': {'1': 0, '2': 0, '4': 0, '8': 0, '16': 0, '32': 0, '64': 0, '128': 0, '256': 0, '512': 0, '1024': 0, '2048': 0, '4096': 0, '8192': 0, '16384': 0, '32768': 0}} ``` `PYTORCH_CUDA_ALLOC_CONF="max_split_size_mb:2000,roundup_power2_divisions:[256:1,512:2,1024:4,>:8]"` produces ``` {'PYTORCH_CUDA_ALLOC_CONF': 'max_split_size_mb:2000,roundup_power2_divisions:[256:1,512:2,1024:4,>:8]', 'max_split_size': 2097152000, 'garbage_collection_threshold': 0.0, 'expandable_segments': False, 'pinned_num_register_threads': 1, 'release_lock_on_cudamalloc': False, 'pinned_use_cuda_host_register': False, 'roundup_power2_divisions': {'1': 1, '2': 1, '4': 1, '8': 1, '16': 1, '32': 1, '64': 1, '128': 1, '256': 1, '512': 2, '1024': 8, '2048': 8, '4096': 8, '8192': 8, '16384': 8, '32768': 8} } ``` Differential Revision: D53536199 Pulled By: aaronenyeshi Pull Request resolved: https://github.com/pytorch/pytorch/pull/119404 Approved by: https://github.com/zdevito	2024-02-17 01:16:37 +00:00
Nikita Shulga	9aa8bbf7f2	[BE] Delete `C10_IS_TRIVIALLY_COPYABLE` (#120120 ) It's not used anywhere in PyTorch after custom implementation of `c10::optional` is gone, and it's not used by the repo as well, see https://github.com/search?type=code&q=C10_IS_TRIVIALLY_COPYABLE+org%3Apytorch Pull Request resolved: https://github.com/pytorch/pytorch/pull/120120 Approved by: https://github.com/Skylion007, https://github.com/albanD, https://github.com/huydhn	2024-02-17 01:04:30 +00:00
Jeeja KP	79569d117d	Add hpu device support in storage/resize (#119761 ) Add hpu device to - In storage method resize_ - is_supported_device for fsdp - for storage add hpu device support Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/119761 Approved by: https://github.com/mikaylagawarecki	2024-02-17 01:04:27 +00:00
BowenBao	6b63d3bac9	[ONNX][dynamo_export] Adjust to new symbolic shape name format in value_info (#119855 ) Bump onnxscript in CI and adjust the test case expectation of the experimental exported shape naming format. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119855 Approved by: https://github.com/thiagocrepaldi	2024-02-17 00:51:19 +00:00
cyy	e61c8ef3aa	Simplify c10::is_pod implementation and remove unneeded inclusion of C++17.h (#118212 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118212 Approved by: https://github.com/albanD	2024-02-17 00:14:09 +00:00
cyy	6952d6ddad	[structural binding][4/N] Replace std::tie with structural binding (#120039 ) This PR follows https://github.com/pytorch/pytorch/pull/119774, it is a continued work to clean up std::tie Pull Request resolved: https://github.com/pytorch/pytorch/pull/120039 Approved by: https://github.com/albanD, https://github.com/Skylion007	2024-02-17 00:05:58 +00:00
Thiago Crepaldi	761fa5d6ec	Add FakeTensor support to torch._utils._rebuild_tensor (#108186 ) There are two scenarios: * Scenario 1: The checkpoint was saved with pytorch < 1.6 * Scenario 2: The checkpoint was saved with pytorch >= 1.6 Repro Scenario 1: ```python from torch._subclasses import fake_tensor import transformers fake_mode = fake_tensor.FakeTensorMode() with fake_mode: fake_model = transformers.AutoModel.from_pretrained("sshleifer/tiny-gpt2") ``` Error: ```bash Some weights of the model checkpoint at sshleifer/tiny-gpt2 were not used when initializing GPT2Model: ['lm_head.weight'] - This IS expected if you are initializing GPT2Model from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model). - This IS NOT expected if you are initializing GPT2Model from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model). ╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮ │ /opt/conda/envs/ptca/lib/python3.8/site-packages/transformers/modeling_utils.py:463 in │ │ load_state_dict │ │ │ │ 460 │ │ │ ) │ │ 461 │ │ return safe_load_file(checkpoint_file) │ │ 462 │ try: │ │ ❱ 463 │ │ return torch.load(checkpoint_file, map_location="cpu") │ │ 464 │ except Exception as e: │ │ 465 │ │ try: │ │ 466 │ │ │ with open(checkpoint_file) as f: │ │ │ │ /opt/pytorch/torch/serialization.py:1030 in load │ │ │ │ 1027 │ │ │ │ return _legacy_load(opened_file, map_location, _weights_only_unpickler, │ │ 1028 │ │ │ except RuntimeError as e: │ │ 1029 │ │ │ │ raise pickle.UnpicklingError(UNSAFE_MESSAGE + str(e)) from None │ │ ❱ 1030 │ │ return _legacy_load(opened_file, map_location, pickle_module, pickle_load_args │ │ 1031 │ │ 1032 │ │ 1033 # Register pickling support for layout instances such as │ │ │ │ /opt/pytorch/torch/serialization.py:1258 in _legacy_load │ │ │ │ 1255 │ _sys_info = pickle_module.load(f, pickle_load_args) │ │ 1256 │ unpickler = UnpicklerWrapper(f, pickle_load_args) │ │ 1257 │ unpickler.persistent_load = persistent_load │ │ ❱ 1258 │ result = unpickler.load() │ │ 1259 │ │ │ 1260 │ deserialized_storage_keys = pickle_module.load(f, pickle_load_args) │ │ 1261 │ │ │ │ /opt/pytorch/torch/_utils.py:201 in _rebuild_tensor_v2 │ │ │ │ 198 def _rebuild_tensor_v2( │ │ 199 │ storage, storage_offset, size, stride, requires_grad, backward_hooks, metadata=None │ │ 200 ): │ │ ❱ 201 │ tensor = _rebuild_tensor(storage, storage_offset, size, stride) │ │ 202 │ tensor.requires_grad = requires_grad │ │ 203 │ if metadata: │ │ 204 │ │ set_tensor_metadata(tensor, metadata) │ │ │ │ /opt/pytorch/torch/_utils.py:180 in _rebuild_tensor │ │ │ │ 177 def _rebuild_tensor(storage, storage_offset, size, stride): │ │ 178 │ # first construct a tensor with the correct dtype/device │ │ 179 │ t = torch.tensor([], dtype=storage.dtype, device=storage._untyped_storage.device) │ │ ❱ 180 │ return t.set_(storage._untyped_storage, storage_offset, size, stride) │ │ 181 │ │ 182 │ │ 183 def get_tensor_metadata(tensor): │ │ │ │ /opt/pytorch/torch/utils/_stats.py:20 in wrapper │ │ │ │ 17 │ │ if fn.__qualname__ not in simple_call_counter: │ │ 18 │ │ │ simple_call_counter[fn.__qualname__] = 0 │ │ 19 │ │ simple_call_counter[fn.__qualname__] = simple_call_counter[fn.__qualname__] + 1 │ │ ❱ 20 │ │ return fn(args, kwargs) │ │ 21 │ return wrapper │ │ 22 │ │ │ │ /opt/pytorch/torch/_subclasses/fake_tensor.py:1160 in __torch_dispatch__ │ │ │ │ 1157 │ def __torch_dispatch__(self, func, types, args=(), kwargs=None): │ │ 1158 │ │ assert self not in _get_current_dispatch_mode_stack(), func │ │ 1159 │ │ try: │ │ ❱ 1160 │ │ │ return self.dispatch(func, types, args, kwargs) │ │ 1161 │ │ except TypeError: │ │ 1162 │ │ │ log.exception("fake tensor raised TypeError") │ │ 1163 │ │ │ raise │ │ │ │ /opt/pytorch/torch/_subclasses/fake_tensor.py:1318 in dispatch │ │ │ │ 1315 │ │ │ │ 1316 │ │ # we are falling through to running non constant tensors, any input constant tha │ │ 1317 │ │ # is written to must be invalidated │ │ ❱ 1318 │ │ self.invalidate_written_to_constants(func, flat_arg_fake_tensors, args, kwargs) │ │ 1319 │ │ │ │ 1320 │ │ # Try for fastpath │ │ 1321 │ │ if has_symbolic_sizes: │ │ │ │ /opt/pytorch/torch/_subclasses/fake_tensor.py:1557 in invalidate_written_to_constants │ │ │ │ 1554 │ │ any_constant = any(e.constant is not None for e in flat_arg_fake_tensors) │ │ 1555 │ │ if any_constant and get_schema_info(func).is_mutable(): │ │ 1556 │ │ │ schema_info = get_schema_info(func) │ │ ❱ 1557 │ │ │ _, new_kwargs = normalize_function( │ │ 1558 │ │ │ │ func, args=args, kwargs=kwargs, normalize_to_only_use_kwargs=True │ │ 1559 │ │ │ ) │ │ 1560 │ │ │ for k, v in new_kwargs.items(): │ │ │ │ /opt/pytorch/torch/fx/operator_schemas.py:297 in normalize_function │ │ │ │ 294 │ │ new_args_and_kwargs = _args_kwargs_to_normalized_args_kwargs(sig, args, kwargs, │ │ 295 │ else: │ │ 296 │ │ assert callable(target) │ │ ❱ 297 │ │ torch_op_schemas = get_signature_for_torch_op(target) │ │ 298 │ │ matched_schemas = [] │ │ 299 │ │ if torch_op_schemas: │ │ 300 │ │ │ # Iterate through all of the schema until we find one that matches │ │ │ │ /opt/pytorch/torch/fx/operator_schemas.py:167 in get_signature_for_torch_op │ │ │ │ 164 │ │ │ return (None, None) if return_schemas else None │ │ 165 │ │ schemas = torch._C._jit_get_schemas_for_operator(aten_fn) │ │ 166 │ │ │ ❱ 167 │ signatures = [_torchscript_schema_to_signature(schema) for schema in schemas] │ │ 168 │ return (signatures, schemas) if return_schemas else signatures │ │ 169 │ │ 170 @compatibility(is_backward_compatible=False) │ │ │ │ /opt/pytorch/torch/fx/operator_schemas.py:167 in <listcomp> │ │ │ │ 164 │ │ │ return (None, None) if return_schemas else None │ │ 165 │ │ schemas = torch._C._jit_get_schemas_for_operator(aten_fn) │ │ 166 │ │ │ ❱ 167 │ signatures = [_torchscript_schema_to_signature(schema) for schema in schemas] │ │ 168 │ return (signatures, schemas) if return_schemas else signatures │ │ 169 │ │ 170 @compatibility(is_backward_compatible=False) │ │ │ │ /opt/pytorch/torch/fx/operator_schemas.py:70 in _torchscript_schema_to_signature │ │ │ │ 67 │ from inspect import Parameter │ │ 68 │ parameters : List[Parameter] = [] │ │ 69 │ for arg in ts_schema.arguments: │ │ ❱ 70 │ │ arg_type = _torchscript_type_to_python_type(arg.type) │ │ 71 │ │ default = arg.default_value if arg.has_default_value() else Parameter.empty │ │ 72 │ │ # TODO: Figure out if this is safe. It seems like when generating the type signa │ │ 73 │ │ # PythonArgParser, we emit signatures with `input` instead of `self` as the firs │ │ │ │ /opt/pytorch/torch/fx/operator_schemas.py:64 in _torchscript_type_to_python_type │ │ │ │ 61 │ eval'ing the annotation_str. _type_eval_globals sets up expressions │ │ 62 │ like "List" and "Future" to map to actual types (typing.List and jit.Future) │ │ 63 │ """ │ │ ❱ 64 │ return eval(ts_type.annotation_str, _type_eval_globals) │ │ 65 │ │ 66 def _torchscript_schema_to_signature(ts_schema : torch._C.FunctionSchema) -> inspect.Sig │ │ 67 │ from inspect import Parameter │ │ <string>:1 in <module> │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ NameError: name 'Storage' is not defined During handling of the above exception, another exception occurred: ╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮ │ /opt/conda/envs/ptca/lib/python3.8/site-packages/transformers/modeling_utils.py:467 in │ │ load_state_dict │ │ │ │ 464 │ except Exception as e: │ │ 465 │ │ try: │ │ 466 │ │ │ with open(checkpoint_file) as f: │ │ ❱ 467 │ │ │ │ if f.read(7) == "version": │ │ 468 │ │ │ │ │ raise OSError( │ │ 469 │ │ │ │ │ │ "You seem to have cloned a repository without having git-lfs ins │ │ 470 │ │ │ │ │ │ "git-lfs and run `git lfs install` followed by `git lfs pull` in │ │ │ │ /opt/conda/envs/ptca/lib/python3.8/codecs.py:322 in decode │ │ │ │ 319 │ def decode(self, input, final=False): │ │ 320 │ │ # decode input (taking the buffer into account) │ │ 321 │ │ data = self.buffer + input │ │ ❱ 322 │ │ (result, consumed) = self._buffer_decode(data, self.errors, final) │ │ 323 │ │ # keep undecoded input until the next call │ │ 324 │ │ self.buffer = data[consumed:] │ │ 325 │ │ return result │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte During handling of the above exception, another exception occurred: ╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮ │ /opt/pytorch/bug_repro.py:16 in <module> │ │ │ │ 13 fake_model = transformers.AutoModel.from_pretrained("sshleifer/tiny-gpt2") │ │ 14 assert fake_model is not None │ │ 15 with fake_mode: │ │ ❱ 16 │ fake_model = transformers.AutoModel.from_pretrained("sshleifer/tiny-gpt2") # raises │ │ │ │ /opt/conda/envs/ptca/lib/python3.8/site-packages/transformers/models/auto/auto_factory.py:484 in │ │ from_pretrained │ │ │ │ 481 │ │ │ ) │ │ 482 │ │ elif type(config) in cls._model_mapping.keys(): │ │ 483 │ │ │ model_class = _get_model_class(config, cls._model_mapping) │ │ ❱ 484 │ │ │ return model_class.from_pretrained( │ │ 485 │ │ │ │ pretrained_model_name_or_path, model_args, config=config, *hub_kwargs, │ │ 486 │ │ │ ) │ │ 487 │ │ raise ValueError( │ │ │ │ /opt/conda/envs/ptca/lib/python3.8/site-packages/transformers/modeling_utils.py:2604 in │ │ from_pretrained │ │ │ │ 2601 │ │ if from_pt: │ │ 2602 │ │ │ if not is_sharded and state_dict is None: │ │ 2603 │ │ │ │ # Time to load the checkpoint │ │ ❱ 2604 │ │ │ │ state_dict = load_state_dict(resolved_archive_file) │ │ 2605 │ │ │ │ │ 2606 │ │ │ # set dtype to instantiate the model under: │ │ 2607 │ │ │ # 1. If torch_dtype is not None, we use that dtype │ │ │ │ /opt/conda/envs/ptca/lib/python3.8/site-packages/transformers/modeling_utils.py:479 in │ │ load_state_dict │ │ │ │ 476 │ │ │ │ │ │ "model. Make sure you have saved the model properly." │ │ 477 │ │ │ │ │ ) from e │ │ 478 │ │ except (UnicodeDecodeError, ValueError): │ │ ❱ 479 │ │ │ raise OSError( │ │ 480 │ │ │ │ f"Unable to load weights from pytorch checkpoint file for '{checkpoint_f │ │ 481 │ │ │ │ f"at '{checkpoint_file}'. " │ │ 482 │ │ │ │ "If you tried to load a PyTorch model from a TF 2.0 checkpoint, please s │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ OSError: Unable to load weights from pytorch checkpoint file for '/root/.cache/huggingface/hub/models--sshleifer--tiny-gpt2/snapshots/5f91d94bd9cd7190a9f3216ff93cd1dd95f2c7be/pytorch_model.bin' at '/root/.cache/huggingface/hub/models--sshleifer--tiny-gpt2/snapshots/5f91d94bd9cd7190a9f3216ff93cd1dd95f2c7be/pytorch_model.bin'. If you tried to load a PyTorch model from a TF 2.0 checkpoint, please set from_tf=True. ``` Repro scenario 2: ```python import tempfile import torch from torch._subclasses import fake_tensor class TheModelClass(torch.nn.Module): def __init__(self): super(TheModelClass, self).__init__() self.fc1 = torch.nn.Linear(5, 10) def forward(self, x): return self.fc1(x) with tempfile.NamedTemporaryFile() as state_dict_file: # Create state_dict to be loaded later model = TheModelClass() torch.save(model.state_dict(), state_dict_file.name) fake_mode = fake_tensor.FakeTensorMode() with fake_mode: # This is where the bug is triggered state_dict = torch.load(state_dict_file.name) ``` Error: ```bash Traceback (most recent call last): File "issue_gh_torch_105077.py", line 22, in <module> state_dict = torch.load(state_dict_file.name) File "/opt/pytorch/torch/serialization.py", line 1014, in load return _load(opened_zipfile, File "/opt/pytorch/torch/serialization.py", line 1422, in _load result = unpickler.load() File "/opt/pytorch/torch/_utils.py", line 205, in _rebuild_tensor_v2 tensor = _rebuild_tensor(storage, storage_offset, size, stride) File "/opt/pytorch/torch/_utils.py", line 184, in _rebuild_tensor return t.set_(storage._untyped_storage, storage_offset, size, stride) File "/opt/pytorch/torch/utils/_stats.py", line 20, in wrapper return fn(args, **kwargs) File "/opt/pytorch/torch/_subclasses/fake_tensor.py", line 1288, in __torch_dispatch__ return self.dispatch(func, types, args, kwargs) File "/opt/pytorch/torch/_subclasses/fake_tensor.py", line 1468, in dispatch self.invalidate_written_to_constants(func, flat_arg_fake_tensors, args, kwargs) File "/opt/pytorch/torch/_subclasses/fake_tensor.py", line 1733, in invalidate_written_to_constants _, new_kwargs = normalize_function( File "/opt/pytorch/torch/fx/operator_schemas.py", line 297, in normalize_function torch_op_schemas = get_signature_for_torch_op(target) File "/opt/pytorch/torch/fx/operator_schemas.py", line 167, in get_signature_for_torch_op signatures = [_torchscript_schema_to_signature(schema) for schema in schemas] File "/opt/pytorch/torch/fx/operator_schemas.py", line 167, in <listcomp> signatures = [_torchscript_schema_to_signature(schema) for schema in schemas] File "/opt/pytorch/torch/fx/operator_schemas.py", line 70, in _torchscript_schema_to_signature arg_type = _torchscript_type_to_python_type(arg.type) File "/opt/pytorch/torch/fx/operator_schemas.py", line 64, in _torchscript_type_to_python_type return eval(ts_type.annotation_str, _type_eval_globals) File "<string>", line 1, in <module> NameError: name 'Storage' is not defined ``` This PR adds the ability to create fake tensors during torch.load (when fake mode is active) by changing the storage's device to 'meta'. Pull Request resolved: https://github.com/pytorch/pytorch/pull/108186 Approved by: https://github.com/ezyang, https://github.com/atalman	2024-02-16 23:42:50 +00:00
Tugsbayasgalan (Tugsuu) Manlaibaatar	7ad4ab4765	Remove unused import (#120004 ) Summary: Title Test Plan: CI Differential Revision: D53820298 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120004 Approved by: https://github.com/zhxchen17, https://github.com/Skylion007	2024-02-16 22:00:44 +00:00
Menglu Yu	7b1f5c874f	[PT2][Optimus][Observability] Log the optimus graph transformation to the scuba (#119745 ) Summary: Current everstore upload logging may cuase excessive compilation time when the model has lots of graph breaks (post: https://fb.workplace.com/groups/257735836456307/permalink/633533465543207/), we here log the transformation only when the graph changed Test Plan: timeout flows: f528209775 f530084719 Differential Revision: D53692344 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119745 Approved by: https://github.com/jackiexu1992	2024-02-16 21:32:04 +00:00
IvanKobzarev	006eead7d2	[dynamo][functional_collectives] Add all_to_all_single, all_gather_list, reduce_scatter_list to dynamo remapping (#119683 ) Differential Revision: [D53758434](https://our.internmc.facebook.com/intern/diff/D53758434) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119683 Approved by: https://github.com/ezyang	2024-02-16 21:28:39 +00:00
Yanbo Liang	4f4629d522	[Dynamo] Fix ListIteratorVariable repr to avoid log flooding (#120053 ) This issue was found from Meta internal use case. Before: ``` V0215 18:33:41.761000 140489262883968 torch/_dynamo/symbolic_convert.py:682 [0/0] TRACE starts_line /data/users/ybliang/debug/debug4.py:11 in <listcomp> (f) (inline depth: 1) V0215 18:33:41.761000 140489262883968 torch/_dynamo/symbolic_convert.py:682 [0/0] a = [sum(x) for x in result] V0215 18:33:41.761000 140489262883968 torch/_dynamo/symbolic_convert.py:708 [0/0] TRACE BUILD_LIST 0 [] V0215 18:33:41.761000 140489262883968 torch/_dynamo/symbolic_convert.py:708 [0/0] TRACE LOAD_FAST .0 [ListVariable()] V0215 18:33:41.762000 140489262883968 torch/_dynamo/symbolic_convert.py:708 [0/0] TRACE FOR_ITER 18 [ListVariable(), ListIteratorVariable([LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker()], index=0)] V0215 18:33:41.762000 140489262883968 torch/_dynamo/symbolic_convert.py:708 [0/0] TRACE STORE_FAST x [ListVariable(), ListIteratorVariable([LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker()], index=1), LazyVariableTracker()] V0215 18:33:41.762000 140489262883968 torch/_dynamo/symbolic_convert.py:708 [0/0] TRACE LOAD_GLOBAL sum [ListVariable(), ListIteratorVariable([ListVariable(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker()], index=1)] V0215 18:33:41.763000 140489262883968 torch/_dynamo/symbolic_convert.py:708 [0/0] TRACE LOAD_FAST x [ListVariable(), ListIteratorVariable([ListVariable(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker()], index=1), BuiltinVariable(sum)] V0215 18:33:41.763000 140489262883968 torch/_dynamo/symbolic_convert.py:708 [0/0] TRACE CALL_FUNCTION 1 [ListVariable(), ListIteratorVariable([ListVariable(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker()], index=1), BuiltinVariable(sum), ListVariable()] V0215 18:33:41.764000 140489262883968 torch/_dynamo/symbolic_convert.py:708 [0/0] TRACE LIST_APPEND 2 [ListVariable(), ListIteratorVariable([ListVariable(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker()], index=1), ConstantVariable(int: 50)] V0215 18:33:41.765000 140489262883968 torch/_dynamo/symbolic_convert.py:708 [0/0] TRACE JUMP_ABSOLUTE 4 [ListVariable(), ListIteratorVariable([ListVariable(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker()], index=1)] V0215 18:33:41.765000 140489262883968 torch/_dynamo/symbolic_convert.py:708 [0/0] TRACE FOR_ITER 18 [ListVariable(), ListIteratorVariable([ListVariable(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker()], index=1)] V0215 18:33:41.765000 140489262883968 torch/_dynamo/symbolic_convert.py:708 [0/0] TRACE STORE_FAST x [ListVariable(), ListIteratorVariable([ListVariable(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker()], index=2), LazyVariableTracker()] V0215 18:33:41.765000 140489262883968 torch/_dynamo/symbolic_convert.py:708 [0/0] TRACE LOAD_GLOBAL sum [ListVariable(), ListIteratorVariable([ListVariable(), ListVariable(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker()], index=2)] V0215 18:33:41.765000 140489262883968 torch/_dynamo/symbolic_convert.py:708 [0/0] TRACE LOAD_FAST x [ListVariable(), ListIteratorVariable([ListVariable(), ListVariable(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker()], index=2), BuiltinVariable(sum)] V0215 18:33:41.766000 140489262883968 torch/_dynamo/symbolic_convert.py:708 [0/0] TRACE CALL_FUNCTION 1 [ListVariable(), ListIteratorVariable([ListVariable(), ListVariable(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker()], index=2), BuiltinVariable(sum), ListVariable()] V0215 18:33:41.766000 140489262883968 torch/_dynamo/symbolic_convert.py:708 [0/0] TRACE LIST_APPEND 2 [ListVariable(), ListIteratorVariable([ListVariable(), ListVariable(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker()], index=2), ConstantVariable(int: 68)] V0215 18:33:41.767000 140489262883968 torch/_dynamo/symbolic_convert.py:708 [0/0] TRACE JUMP_ABSOLUTE 4 [ListVariable(), ListIteratorVariable([ListVariable(), ListVariable(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker()], index=2)] ``` After: ``` V0215 18:27:57.901000 140556649206912 torch/_dynamo/symbolic_convert.py:682 [0/0] TRACE starts_line /data/users/ybliang/debug/debug4.py:11 in <listcomp> (f) (inline depth: 1) V0215 18:27:57.901000 140556649206912 torch/_dynamo/symbolic_convert.py:682 [0/0] a = [sum(x) for x in result] V0215 18:27:57.901000 140556649206912 torch/_dynamo/symbolic_convert.py:708 [0/0] TRACE BUILD_LIST 0 [] V0215 18:27:57.901000 140556649206912 torch/_dynamo/symbolic_convert.py:708 [0/0] TRACE LOAD_FAST .0 [ListVariable()] V0215 18:27:57.901000 140556649206912 torch/_dynamo/symbolic_convert.py:708 [0/0] TRACE FOR_ITER 18 [ListVariable(), ListIteratorVariable(length=10, index=0)] V0215 18:27:57.901000 140556649206912 torch/_dynamo/symbolic_convert.py:708 [0/0] TRACE STORE_FAST x [ListVariable(), ListIteratorVariable(length=10, index=1), LazyVariableTracker()] V0215 18:27:57.902000 140556649206912 torch/_dynamo/symbolic_convert.py:708 [0/0] TRACE LOAD_GLOBAL sum [ListVariable(), ListIteratorVariable(length=10, index=1)] V0215 18:27:57.902000 140556649206912 torch/_dynamo/symbolic_convert.py:708 [0/0] TRACE LOAD_FAST x [ListVariable(), ListIteratorVariable(length=10, index=1), BuiltinVariable(sum)] V0215 18:27:57.903000 140556649206912 torch/_dynamo/symbolic_convert.py:708 [0/0] TRACE CALL_FUNCTION 1 [ListVariable(), ListIteratorVariable(length=10, index=1), BuiltinVariable(sum), ListVariable()] V0215 18:27:57.903000 140556649206912 torch/_dynamo/symbolic_convert.py:708 [0/0] TRACE LIST_APPEND 2 [ListVariable(), ListIteratorVariable(length=10, index=1), ConstantVariable(int: 55)] V0215 18:27:57.904000 140556649206912 torch/_dynamo/symbolic_convert.py:708 [0/0] TRACE JUMP_ABSOLUTE 4 [ListVariable(), ListIteratorVariable(length=10, index=1)] V0215 18:27:57.904000 140556649206912 torch/_dynamo/symbolic_convert.py:708 [0/0] TRACE FOR_ITER 18 [ListVariable(), ListIteratorVariable(length=10, index=1)] V0215 18:27:57.904000 140556649206912 torch/_dynamo/symbolic_convert.py:708 [0/0] TRACE STORE_FAST x [ListVariable(), ListIteratorVariable(length=10, index=2), LazyVariableTracker()] V0215 18:27:57.904000 140556649206912 torch/_dynamo/symbolic_convert.py:708 [0/0] TRACE LOAD_GLOBAL sum [ListVariable(), ListIteratorVariable(length=10, index=2)] V0215 18:27:57.904000 140556649206912 torch/_dynamo/symbolic_convert.py:708 [0/0] TRACE LOAD_FAST x [ListVariable(), ListIteratorVariable(length=10, index=2), BuiltinVariable(sum)] V0215 18:27:57.904000 140556649206912 torch/_dynamo/symbolic_convert.py:708 [0/0] TRACE CALL_FUNCTION 1 [ListVariable(), ListIteratorVariable(length=10, index=2), BuiltinVariable(sum), ListVariable()] V0215 18:27:57.905000 140556649206912 torch/_dynamo/symbolic_convert.py:708 [0/0] TRACE LIST_APPEND 2 [ListVariable(), ListIteratorVariable(length=10, index=2), ConstantVariable(int: 64)] V0215 18:27:57.905000 140556649206912 torch/_dynamo/symbolic_convert.py:708 [0/0] TRACE JUMP_ABSOLUTE 4 [ListVariable(), ListIteratorVariable(length=10, index=2)] V0215 18:27:57.905000 140556649206912 torch/_dynamo/symbolic_convert.py:708 [0/0] TRACE FOR_ITER 18 [ListVariable(), ListIteratorVariable(length=10, index=2)] V0215 18:27:57.905000 140556649206912 torch/_dynamo/symbolic_convert.py:708 [0/0] TRACE STORE_FAST x [ListVariable(), ListIteratorVariable(length=10, index=3), LazyVariableTracker()] V0215 18:27:57.906000 140556649206912 torch/_dynamo/symbolic_convert.py:708 [0/0] TRACE LOAD_GLOBAL sum [ListVariable(), ListIteratorVariable(length=10, index=3)] V0215 18:27:57.906000 140556649206912 torch/_dynamo/symbolic_convert.py:708 [0/0] TRACE LOAD_FAST x [ListVariable(), ListIteratorVariable(length=10, index=3), BuiltinVariable(sum)] V0215 18:27:57.906000 140556649206912 torch/_dynamo/symbolic_convert.py:708 [0/0] TRACE CALL_FUNCTION 1 [ListVariable(), ListIteratorVariable(length=10, index=3), BuiltinVariable(sum), ListVariable()] V0215 18:27:57.907000 140556649206912 torch/_dynamo/symbolic_convert.py:708 [0/0] TRACE LIST_APPEND 2 [ListVariable(), ListIteratorVariable(length=10, index=3), ConstantVariable(int: 56)] ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/120053 Approved by: https://github.com/williamwen42	2024-02-16 21:19:37 +00:00
Brian Hirsh	26343451be	DTensor: make tensor_flatten more compatible for dynamo getattr (#118209 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118209 Approved by: https://github.com/ezyang, https://github.com/wanchaol ghstack dependencies: #117667, #117666	2024-02-16 21:16:07 +00:00
Brian Hirsh	ee7bcf23db	dynamo: support attribute access on tensor subclasses without sources (#117666 ) Fixes https://github.com/pytorch/pytorch/issues/117596 This was needed for Float8Tensor. Before this PR, dynamo would sometimes handle attribute access on tensor subclasses correctly, but it would choke on tensor subclasses with no source (it would fall back to using a `GetAttrVariable` to represent the attribute access, which is a problem if the attribute is a tensor that we later want to call tensor methods on). I supported two cases: (1) the attribute is a tensor, which is part of the `attrs` returned by the subclass's `__tensor_flatten__`. This creates a `TensorVariable` (2) the attribute is a constant, which is part of the constant metadata returned by `__tensor_flatten__`. As per the contract of tensor_flatten, this should be a `ConstantVariable`. It could be possible that we allow non-constant metadata in the future, but we don't support that today. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117666 Approved by: https://github.com/zou3519 ghstack dependencies: #117667	2024-02-16 21:16:07 +00:00
Brian Hirsh	67f6aca0d0	dynamo: respect autograd.Function + multiple save_for_backward calls (#117667 ) Fixes https://github.com/pytorch/pytorch/issues/117652. Corner case that I hit debugging some Float8 issues. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117667 Approved by: https://github.com/ezyang, https://github.com/zou3519	2024-02-16 21:16:07 +00:00
Yifu Wang	4ac857f94e	Support broadcast in native funcol (#119229 ) ### Summary @LucasLLC recently implemented `broadcast` in funcol. This is not yet available in the native funcol ops. This PR adds support for broadcast for native funcol. - Added `_c10d_functional::broadcast` and `_c10d_functional::broadcast_` - Integrated with python functol broadcast and `AsyncCollectiveTensor` - Implemented Inductor lowering. Verified correctness and buffer reuse behavior - Validated dynamo traceability - Validated AOTInductor compile-ability Pull Request resolved: https://github.com/pytorch/pytorch/pull/119229 Approved by: https://github.com/wanchaol ghstack dependencies: #119104	2024-02-16 21:01:34 +00:00
Nikita Shulga	24d5caba6e	[EZ] Fix argument parsing in build_with_debinfo (#120088 ) `nargs="?"` accept 0 or 1 argument, but `nargs="*"` accepts 0 or any number of arguments, which is the intended behavior of the tool Test plan: Run `python tools/build_with_debinfo.py aten/src/ATen/native/cpu/BlasKernel.cpp aten/src/ATen/native/BlasKernel.cpp` and observe that it generates torch_cpu with those two files containing debug information Pull Request resolved: https://github.com/pytorch/pytorch/pull/120088 Approved by: https://github.com/Skylion007	2024-02-16 20:06:52 +00:00
Nikita Shulga	2d4aa91a10	Fix searchsorted function signature in docs (#120086 ) Side should be optional string, to match definition in native_functions: `fbe8e0f92d/aten/src/ATen/native/native_functions.yaml (L11246)` Fixes https://github.com/pytorch/pytorch/issues/119999 Test plan: https://docs-preview.pytorch.org/pytorch/pytorch/120086/generated/torch.searchsorted.html#torch-searchsorted Pull Request resolved: https://github.com/pytorch/pytorch/pull/120086 Approved by: https://github.com/lezcano	2024-02-16 20:00:04 +00:00
wz337	288d1f3698	[Optim][Rprop] Replace new().resize_as_() by torch.full_like() (#119978 ) As titled. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119978 Approved by: https://github.com/janeyx99	2024-02-16 19:54:04 +00:00
andrewor14	6ea4480818	[quant][pt2e] Add `model_is_exported` util function (#119726 ) Summary: This commit adds the `model_is_exported` util function for users to be able to easily tell what APIs to call to move their models between train and eval modes. This has the additional advantage of hiding the implementation of how we detect a model is exported, in case the metadata format changes in the future. Test Plan: python test/test_quantization.py TestQuantizePT2E.test_model_is_exported Differential Revision: [D53812972](https://our.internmc.facebook.com/intern/diff/D53812972) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119726 Approved by: https://github.com/tugsbayasgalan, https://github.com/albanD	2024-02-16 19:29:36 +00:00
soulitzer	312ce35c1f	Rename singleton int to nested int (#119661 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119661 Approved by: https://github.com/ezyang	2024-02-16 19:21:17 +00:00
lezcano	b97fa6ac30	Make roll a decomposition and remove its lowering (#119857 ) We use the fact that we now propagate indexing properly to avoid having to maintain two different implementations of the op. Doing this we also remove a spurious guard on this op. We move the ref into a decomp as we now use advanced indexing. The only difference we did in the implementation is that we now use advanced indexing rather than `torch.cat`. We also remove it from core. Let's see how this goes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119857 Approved by: https://github.com/peterbell10, https://github.com/larryliu0820 ghstack dependencies: #119863, #119864	2024-02-16 19:14:39 +00:00
lezcano	8b02d64197	Correct index propagation for % (#119864 ) The current index propagation transformed % into `fmod`. This was incorrect. We perform the index propagation in the most common case, when we it is correct to do it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119864 Approved by: https://github.com/peterbell10 ghstack dependencies: #119863	2024-02-16 19:14:39 +00:00
lezcano	00524970e8	Simplify indexing when doing ModularIndexing + index propagation. (#119863 ) We now avoid creating an unnecessary ternary operator in some reasonably common case. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119863 Approved by: https://github.com/peterbell10	2024-02-16 19:14:39 +00:00
PyTorch MergeBot	86dedebeaf	Revert "Add pixel_shuffle to core aten decomps (#119899 )" This reverts commit 9201d7335a25d9a91e10c1914c399419af0bd7c3. Reverted https://github.com/pytorch/pytorch/pull/119899 on behalf of https://github.com/huydhn due to Sorry for reverting your change but keep the diff D53766709 around while investigating the failed tests is not a good practice and could lead to out of sync issue, so it is better to revert and reland this ([comment](https://github.com/pytorch/pytorch/pull/119899#issuecomment-1948970686))	2024-02-16 17:44:59 +00:00
Oleg Khabinov	b10ae9e54c	[pytree] Properly register immutable collections (#120036 ) Summary: Getting error like: ``` No registered serialization name for <class 'torch.fx.immutable_collections.immutable_dict'> found. Please update your _register_pytree_node call with a `serialized_type_name` kwarg. ``` Reviewed By: suo Differential Revision: D53833323 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120036 Approved by: https://github.com/SherlockNoMad	2024-02-16 17:39:12 +00:00
Dan Johnson	124c251510	Guarantee init cuda before attaching hooks (#120052 ) Summary: If cuda is not initialized before calling attachAllocatorTraceTracker, then the CudaCachingAllocator device_allocator is empty which means that the registration hooks are not setup. This means that a new segment_alloc will not be registered causing an expensive dynamic registration each time the segment is used. The fix is to guarantee that cuda is initialized before attaching the hooks. If cuda is already initialized, then this lazyInitCUDA is a no-op. Test Plan: Testing this on fsdp+tp example model where cuda is not initialized before init_process_group. Job without the fix keeps dynamically registering: https://www.internalfb.com/mlhub/pipelines/runs/mast/torchx-fsdp_2d_main-j544j0vn7zqh4c?job_attempt=0&version=0&env=PRODUCTION The following keeps looping: [0]:2024-02-14T10:48:18.873079 twshared0039:4836:6232 [0] NCCL INFO CTRAN-MAPPER: registered buffer 0x7f6ebe000000 len 608124000, state 1 [0]:2024-02-14T10:48:18.873087 twshared0039:4836:6232 [0] NCCL INFO *dynamicRegist = true [0]:2024-02-14T10:48:18.903234 twshared0039:4836:6232 [0] NCCL INFO CTRAN-MAPPER: deregister buffer 0x7f6ebe000000 len 608124000, state 1 [0]:2024-02-14T10:48:18.903240 twshared0039:4836:6232 [0] NCCL INFO CTRAN-MAPPER: deregiter buffer 0x7f6ebe000000 len 608124000 Job with the fix does not have this issue: https://www.internalfb.com/mlhub/pipelines/runs/mast/torchx-fsdp_2d_main-hzm5dwqncr7l7?version=0&env=PRODUCTION Reviewed By: minsii, kwen2501, xw285cornell Differential Revision: D53770989 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120052 Approved by: https://github.com/kwen2501	2024-02-16 17:36:53 +00:00
Edward Z. Yang	fbe8e0f92d	Fix missing right square bracket to match glog format (#119966 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/119966 Approved by: https://github.com/oulgen ghstack dependencies: #119869	2024-02-16 15:14:00 +00:00
Andrew M. James	9726d7ca8e	Add lowering for logcumsumexp (#118753 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118753 Approved by: https://github.com/peterbell10 ghstack dependencies: #119809	2024-02-16 14:04:38 +00:00
Wilson Hong	3f4dd9bfa4	Back out "[pytree] Require serialized_type_name" (#120041 ) Summary: D53785493 breaks apf.rec.ir.tests.ir_export_deserialize_test.IRExportDeserializeTest: test_export_deserialize_ebc failed: https://www.internalfb.com/sandcastle/workflow/3436246515685789584 Test Plan: buck2 test mode/opt apf/rec/ir/tests:ir_export_deserialize_test Differential Revision: D53834881 Co-authored-by: Wilson Hong <wilsonhong@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/120041 Approved by: https://github.com/ydwu4	2024-02-16 10:02:25 +00:00
Andrew M. James	4625ecb858	Add decomp for linalg.cross (#119809 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119809 Approved by: https://github.com/lezcano, https://github.com/peterbell10	2024-02-16 09:58:38 +00:00
laith sakka	3693d8f467	Do to convert UnsupportedFakeTensorException into RuntimeError in runNode for proper graph breaking. (#120026 ) Fix: https://github.com/pytorch/pytorch/issues/119779 by properly graph breaking a proper fix is to handle quantized tensors for full complete solution. if when generating a fake tensor, UnsupportedFakeTensorException is thrown, then its handled and converted into a Unimplemented in inside wrap_fake_exception which is then translated to a graph break. However run_node used to convert UnsupportedFakeTensorException into a runtime error, creating runtime errors instead of graph breaks whenever generating a fake tensor for a quantized tensor fails. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120026 Approved by: https://github.com/jansel	2024-02-16 09:21:58 +00:00
Chien-Chin Huang	54025c01a7	[DCP][state_dict] Let distributed_state_dict filter out the compiler prefix (#119830 ) Let distributed_state_dict filter out the compiler prefix Differential Revision: [D53681864](https://our.internmc.facebook.com/intern/diff/D53681864/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119830 Approved by: https://github.com/wz337	2024-02-16 08:59:58 +00:00
Yang Chen	bc7f3efb09	[aot_inductor] move CppWrapperCodeGen into a separate file (#119871 ) This reverts commit d8e319a961bb872027f0abdc413d6beb7502ac9b. Differential Revision: [D53817853](https://our.internmc.facebook.com/intern/diff/D53817853) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119871 Approved by: https://github.com/albanD, https://github.com/khabinov ghstack dependencies: #119870	2024-02-16 08:14:20 +00:00
Yang Chen	78c9b2948a	[aot_inductor] move CudaWrapperCodeGen into a separate file (#119870 ) This reverts commit 3ab08946d5052eaeda11d683d6a58e801a032755. Differential Revision: [D53817852](https://our.internmc.facebook.com/intern/diff/D53817852) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119870 Approved by: https://github.com/khabinov	2024-02-16 08:10:51 +00:00
Yu, Guangye	8f9f12c068	Intel GPU Runtime Upstreaming for Device Allocator (#118091 ) # Motivation According to [[RFC] Intel GPU Runtime Upstreaming](https://github.com/pytorch/pytorch/issues/114842) and [[RFC] Intel GPU Runtime Upstreaming for Allocator](https://github.com/pytorch/pytorch/issues/116322), we will upstream the key functionality of device `Allocator` dedicated for XPU to PyTorch. And following our design prepare to generalize `Allocator` in parallel. # Design In the current design, XPU uses an `XPUAllocator` class, inherited from `c10::Allocator`. `XPUAllocator` is a manager to handle `DeviceCachingAllocator`, which is a per-device implementation of the caching mechanism to manage the already cached or newly allocated memory. The caching mechanism is similar to other backends, like CUDA. We can visualize the design as below. <p align="center"> <img width="162" alt="image" src="https://github.com/pytorch/pytorch/assets/106960996/6b17b8cf-e7d1-48b4-b684-f830c409d218"> </p> # Additional Context We're going to implement our design gradually. This PR covers the device `Allocator` dedicated to XPU. The second PR covers the host `Allocator`. Besides these PRs, we plan to generalize the device `Allocator` device-agnostic through another PR. In this PR, our device `Allocator` has the same memory management mechanism as CUDA, but lacks features such as expendable segments and statistics. We will add these features back in the subsequent PR which intend to generalize `Allocator`. The differences with CUDA: only key functionality, and lack of AsyncAllocator, gpu_trace, history_record, graph functionality, memory snapshot, memory statistics, expandable segment... Pull Request resolved: https://github.com/pytorch/pytorch/pull/118091 Approved by: https://github.com/EikanWang, https://github.com/gujinghui, https://github.com/jgong5, https://github.com/albanD ghstack dependencies: #117611, #117619, #117734	2024-02-16 06:46:00 +00:00
Mu-Chu Lee	b8be8b639f	Add Runtime Constant-Folding function of AOTInductor for AOTInductorModels used internally. (#119823 ) Summary: 1. Make sure folded constants generated internally doesn't get exposed. 2. Add runConstantFolding and related API calls Test Plan: ```buck2 run mode/opt-split-dwarf -c fbcode.nvcc_arch=v100,a100 caffe2/caffe2/fb/predictor/tests_gpu:pytorch_predictor_container_gpu_test -- --gtest_filter=PyTorchPredictorContainerTest.LoadAOTInductorModel ``` The test triggers the added predictor tests `test_aot_inductor_merge_net_file_*.predictor_20240206`, which would trigger runConstantFolding from predictor's module loading. Reviewed By: SherlockNoMad Differential Revision: D53718139 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119823 Approved by: https://github.com/chenyang78	2024-02-16 06:45:48 +00:00
Yu, Guangye	4dc75f9084	Intel GPU Runtime Upstreaming for Event (#117734 ) # Motivation As mentioned in [[RFC] Intel GPU Runtime Upstreaming](https://github.com/pytorch/pytorch/issues/114842), the next runtime component we would like to upstream is `Event` which handles the status of an operation that is being executed. Typically, in some circumstances, we can fine-grain control of the operation execution via `Event`. # Design `XPUEvent` is a movable but not a copyable wrapper around sycl event. It should be created lazily on an XPU device when recording an `XPUStream`. Meanwhile, `XPUEvent` can wait for another `XPUEvent` or all the submitted kernels on an `XPUStream` to complete. Align to the other backend, the C++ files related to `Event` will be placed in `aten/src/ATen/xpu` folder. For frontend code, `XPUEvent` runtime API will be bound to Python `torch.xpu.Event`. The corresponding C++ code will be placed in `torch/csrc/xpu/Event.cpp` and Python code will be placed in `torch/xpu/streams.py` respectively. # Additional Context It is worth mentioning that the `elapsed_time` method is temporarily not supported by `XPUEvent`. We will be adding support for it soon. Meanwhile `XPUEvent` doesn't support IPC from different processes. For the other parts, we have almost a 1:1 mapping with CUDA. lack of the below APIs: - `torch.cuda.Event.ipc_handle` - `CUDAEvent`'s constructor with `IpcEventHandle` Pull Request resolved: https://github.com/pytorch/pytorch/pull/117734 Approved by: https://github.com/EikanWang, https://github.com/gujinghui, https://github.com/jgong5, https://github.com/malfet ghstack dependencies: #117611, #117619	2024-02-16 06:28:26 +00:00
Yifu Wang	02fb043522	Change native funcol inductor tests to use fake pg (#119104 ) Summary: Previously these tests require more than 2 GPUs to run. Changing them to use fake pg so they can run more often. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119104 Approved by: https://github.com/wconstab ghstack dependencies: #119103	2024-02-16 05:18:45 +00:00
Oguz Ulgen	62e5840b36	[Dynamo] Do not create TorchInGraphFunctionVariable for tags (#120005 ) Fixes https://github.com/pytorch/pytorch/issues/119793 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120005 Approved by: https://github.com/yanboliang	2024-02-16 03:37:32 +00:00
PyTorch UpdateBot	ddde1e4dee	[executorch hash update] update the pinned executorch hash (#119943 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned executorch hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119943 Approved by: https://github.com/pytorchbot	2024-02-16 03:36:56 +00:00
Nikita Shulga	4eefe7285a	Use ARMV8 fconv insns to seepd up scalar fp16<->fp32 (#120012 ) Thanks to discussion with @mikekgfb I've realized that FP16_ARITH is the feature available by default on Apple Silicon, so let use it to speed up portable but slow bit mashing algorithm implemented as `c10::detail::fp16_ieee_from_fp32_value` by using the following implicit conversion routine: ```cpp float sve_fp16_to_fp32_value(uint16_t h) { union { uint16_t h; float16_t f16; } x = {h}; return x.f16; } ``` that according to the https://godbolt.org/z/8s14GvEjo is turned into [`fcvt s0,h0`](https://developer.arm.com/documentation/ddi0602/2023-12/SIMD-FP-Instructions/FCVT--Floating-point-Convert-precision--scalar--?lang=en) As results, very slow and naive [`torch.mm`](`edd9ddf73f/aten/src/ATen/native/cpu/BlasKernel.cpp (L108)`) runs 3x faster: 85 msec before to 27 msec (measured by running `e41341df2d/benchmarks/benchmark_torch_mm.py` ) This is a reland of https://github.com/pytorch/pytorch/pull/119895 that got reverted because it was not buildable using Jetson toolkit "Fixed" the problem by guarding the fast conversions with `!defined(__CUDACC__)` (for internal folks, tested it by running `buck build @arvr/mode/embedded/jetson/linux/opt-stripped //xplat/caffe2:caffe2_ops_cuda_ovrsource` ) But also, extended the conversion to all AARHC64 platforms, not just the ones that support FP16 arithmetic extensions (i.e. ARMv8.2) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120012 Approved by: https://github.com/huydhn	2024-02-16 03:04:06 +00:00
Sam Larsen	3e5e8590f4	Account for inference mode in FakeTensor cache (#119963 ) Summary: an fbcode test exposed a shortcoming where we serve a FakeTensor from the cache with the wrong inference_mode. Take the current mode into account in the cache key so we only serve entries from the same mode we're in currently Test Plan: New unit test Pull Request resolved: https://github.com/pytorch/pytorch/pull/119963 Approved by: https://github.com/eellison	2024-02-16 02:53:33 +00:00
chilli	8bfc87ce74	fixed flop counter formula for conv transposed backwards pass (#119874 ) Fixes #119806 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119874 Approved by: https://github.com/zou3519 ghstack dependencies: #119521	2024-02-16 02:43:49 +00:00
Wei (Will) Feng	17c345ebd9	[FSDP] compile compute and CI with @test_compiled_fsdp (#119933 ) goal: all unit tests for eager. we want to test torch.compile by default this PR adds ``@test_compiled_fsdp(compile_compute_on_module=None/TransformerBlock)`` to unit tests. now it's compiling compute-only as follows. ``` module.compile() # include user registered hooks if any fully_shard(module) ``` torch.compile does not work following component yet * compiling AC * compiling reshard_after_forward=2 * delayed_all_gather, delayed_reduce_scatter Pull Request resolved: https://github.com/pytorch/pytorch/pull/119933 Approved by: https://github.com/awgu, https://github.com/jansel	2024-02-16 01:48:51 +00:00
Omkar Salpekar	c802c50196	Setup Nvidia Runtime before Indexer (#119923 ) Sets up Nvidia Runtime and runs indexer inside a docker container. Verified this works by running the indexer jobs (all the setup is correct, it OOMs for an unrelated reason, for which a fix is on the way). Pull Request resolved: https://github.com/pytorch/pytorch/pull/119923 Approved by: https://github.com/huydhn	2024-02-16 00:33:18 +00:00
Jane Xu	4319735ace	Add meta registration for _foreach_norm (2nd try) (#119927 ) The first try reused TensorListMetadata, which caused illegal memory access issues when there were too many tensors in the list. We just launch multiple kernels with a simpler version of the struct (to minimize kernels launched). Pull Request resolved: https://github.com/pytorch/pytorch/pull/119927 Approved by: https://github.com/albanD	2024-02-16 00:23:23 +00:00
wz337	707cde9b31	[DTensor][Test] Improve math_ops test (#118956 ) The DTensor fully_shard_tensor was created but not used in shard_math_ops test previously. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118956 Approved by: https://github.com/wanchaol	2024-02-15 23:59:25 +00:00
cyy	94f19fe545	[3/N] Replace std::tie with structural binding (#119962 ) This PR follows https://github.com/pytorch/pytorch/pull/119774, it is a continued work to clean up std::tie Pull Request resolved: https://github.com/pytorch/pytorch/pull/119962 Approved by: https://github.com/albanD	2024-02-15 23:48:28 +00:00
Yanbo Liang	2a63dd8889	[Dynamo] Support lazy module with namedtuple/dict input (#119972 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/119972 Approved by: https://github.com/jansel	2024-02-15 23:18:18 +00:00
Michael Lazos	f9f602fcb8	Clean up decorators (#119925 ) as title Pull Request resolved: https://github.com/pytorch/pytorch/pull/119925 Approved by: https://github.com/eellison	2024-02-15 22:51:53 +00:00
lancerts	444c628e06	Include the scalar tensor auto-transfer in the doc (#119967 ) Fixes #119609 @albanD Co-authored-by: albanD <desmaison.alban@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/119967 Approved by: https://github.com/albanD	2024-02-15 22:37:39 +00:00
PyTorch MergeBot	47300221c2	Revert "[export] Change runtime asserts to using assert_scalar (#119608 )" This reverts commit f4d641ba2fb11fca2ba47f0c425d8a4a1adbffb6. Reverted https://github.com/pytorch/pytorch/pull/119608 on behalf of https://github.com/huydhn due to This break ONNX trunk job `65fd8b6730` ([comment](https://github.com/pytorch/pytorch/pull/119608#issuecomment-1947436402))	2024-02-15 22:25:24 +00:00
Jack Taylor	da1df5d7b8	[ROCm] Update triton wheels to ROCm 6.0 (#119765 ) Upgrades nightly triton issues to ROCM 6.0 and adds bitcodes for gfx941 and gfx942. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119765 Approved by: https://github.com/jeffdaily, https://github.com/huydhn	2024-02-15 21:57:51 +00:00
Shunting Zhang	3f4f91f2eb	[inductor][eazy] fix profiler (#119959 ) print_performance previously returns the execution time for `times` runs in total but now it returns the average execution time of a single run. Change the profiler to be consistent with that. Not sure if there is a good way to add test though. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119959 Approved by: https://github.com/eellison	2024-02-15 21:47:09 +00:00
PyTorch MergeBot	65fd8b6730	Revert "[export] Disable exported_program.__call__ (#119466 )" This reverts commit c26884f06345bf61e0843d13db84e76236ff6142. Reverted https://github.com/pytorch/pytorch/pull/119466 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/119466#issuecomment-1947384298))	2024-02-15 21:42:32 +00:00
drisspg	744898b311	Add doc page for environment variables that effect PyTorch Runtime (#119087 ) # Summary The goal of this PR is to add a doc page to list a number of environment that effect the PyTorch runtime. It will likely not be exhaustive but hopefully will be added and updated to stay relevant. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119087 Approved by: https://github.com/janeyx99, https://github.com/eqy	2024-02-15 21:41:38 +00:00
laith sakka	d707e3c9c6	Fix handling none source in build_torch_function_fn (#119724 ) Fix https://github.com/pytorch/pytorch/issues/119580 When a UserDefinedObjectVariable is created it does not always have a source, i.e: when its an intermediate This diff fix two handling of none source in two locations during an inlining of a user torch function. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119724 Approved by: https://github.com/jansel, https://github.com/mlazos, https://github.com/anijain2305	2024-02-15 21:21:47 +00:00
eliasstenhede	9548860b37	Fix typo in istft docstring (#119776 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119776 Approved by: https://github.com/colesbury	2024-02-15 21:20:00 +00:00
Kazuaki Ishizaki	a2f07bb317	Fix typo under docs directory (#119657 ) This PR fixes typo under `docs` directory. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119657 Approved by: https://github.com/colesbury	2024-02-15 21:14:34 +00:00
Valter Schütz	2d7a395c0f	Fix typo in functional.py (#119775 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119775 Approved by: https://github.com/colesbury	2024-02-15 21:14:29 +00:00
Yanbo Liang	c3b4d78e17	[Dynamo][Easy] Fix a small bug in test_trace_rules.py (#119973 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/119973 Approved by: https://github.com/zou3519	2024-02-15 20:44:32 +00:00
angelayi	b4c7afe101	[pytree] Require serialized_type_name (#119718 ) Differential Revision: [D53785493](https://our.internmc.facebook.com/intern/diff/D53785493) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119718 Approved by: https://github.com/suo	2024-02-15 20:32:44 +00:00
otdossett	f32560c939	Remove Redundant Bullet Point (#120007 ) Fast path explanation for scaled_dot_product_attention in nn.MultiHeadAttention mentioned inputs being batched with batch_first = True twice. Removed the second mention of this requirement. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120007 Approved by: https://github.com/mikaylagawarecki	2024-02-15 19:47:35 +00:00
lancerts	605de946cf	Clarify the patience in ReduceLROnPlateau (#119872 ) Fixes #119763 @janeyx99 Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/119872 Approved by: https://github.com/janeyx99	2024-02-15 19:43:06 +00:00
Huy Do	26b6de43e5	Revert "Use ARMV8.2 scalar fp16<->fp32 conversion (#119895 )" (#120001 ) This reverts commit d833e2f2364a01c6fdab689a8bb5bbf55a5b60f7. This is failing some RL builds internally using clang 13 D53791577 https://github.com/pytorch/pytorch/pull/119895#issuecomment-1946859332. The bot doesn't like a commit being merged into the stack base and fails to revert the PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120001 Approved by: https://github.com/malfet	2024-02-15 19:41:51 +00:00
Aaron Orenstein	9b6fae2d79	Tweak to pr#119719 - eager & fullgraph (#119921 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119921 Approved by: https://github.com/oulgen	2024-02-15 19:31:56 +00:00
Wei Lu	01ee85c8ab	[PyTorch][Vulkan]remove redundant test of log_softmax (#119964 ) Summary: `vulkan_api_test.cpp` already has [a test for `log_softmax`](https://www.internalfb.com/code/fbsource/[c79b73bd7d5f661c81ff3cf999cfa1af664f0c48]/xplat/caffe2/aten/src/ATen/test/vulkan_api_test.cpp?lines=4521), so we remove the redundant `DISABLED_log_softmax`. According to the comment the test was disabled because "the op is not working correctly. Add it back when it is fixed." Actually it's a simple typo mistake: the [CPU output should use `at::log_softmax` instead of `at::softmax`](https://www.internalfb.com/code/fbsource/[c79b73bd7d5f661c81ff3cf999cfa1af664f0c48]/xplat/caffe2/aten/src/ATen/test/vulkan_api_test.cpp?lines=4548). Since we already have a test for `log_softmax`, the fix isn't necessary and we remove this disabled test. Test Plan: Full vulkan_api_test P1184744699: ``` LD_LIBRARY_PATH=third-party/swiftshader/lib/linux-x64/ buck2 run fbcode/mode/dev-nosan //xplat/caffe2:pt_vulkan_api_test_bin ... [ SKIPPED ] VulkanAPITest.querypool_flushed_shader_log (0 ms) [----------] 427 tests from VulkanAPITest (23633 ms total) [----------] Global test environment tear-down [==========] 427 tests from 1 test suite ran. (23634 ms total) [ PASSED ] 426 tests. [ SKIPPED ] 1 test, listed below: [ SKIPPED ] VulkanAPITest.querypool_flushed_shader_log ``` Reviewed By: jorgep31415 Differential Revision: D53766200 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119964 Approved by: https://github.com/jorgep31415	2024-02-15 19:16:56 +00:00
Xiaodong Wang	8835ff1b09	[AMD] Update hipify code to oss (#119958 ) Summary: Syncing the hipify code to third party. Trunk was broken by multiple diffs D53716382 D53744795 Test Plan: sandcastle Differential Revision: D53790854 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119958 Approved by: https://github.com/jianyuh, https://github.com/drisspg	2024-02-15 19:14:34 +00:00
lancerts	143b5f2745	Fix the missing device in _memory_profiler (#119751 ) Fixes #119722, 1, added the missing device in ``` max_memory_allocated = torch.cuda.max_memory_allocated() max_memory_reserved = torch.cuda.max_memory_reserved() ``` 2, fix the device parameter to device_str. Based on [lines](`2bda6b4cb8/torch/profiler/profiler.py (L291)`), the input device are a string (device_str) for ``` self.mem_tl.export_memory_timeline_html self.mem_tl.export_memory_timeline_raw self.mem_tl.export_memory_timeline ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/119751 Approved by: https://github.com/aaronenyeshi	2024-02-15 19:11:15 +00:00
Peter Bell	98fd23cccc	[EASY] Move OpsHandler and MockHandler to their own file (#119851 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119851 Approved by: https://github.com/lezcano ghstack dependencies: #119728	2024-02-15 18:54:41 +00:00
Peter Bell	6f324e8776	[ATen] Tag isinf as a pointwise op (#119728 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119728 Approved by: https://github.com/lezcano	2024-02-15 18:54:41 +00:00
eqy	e386bfa688	[CUDA][cuSPARSE] Work around IMA in cuSPARSE ALG1 on SM 8.9 devices (#119610 ) Originally surfaced from the discuss forum: https://discuss.pytorch.org/t/issue-with-torch-sparse-mm-while-running-on-gpu/188669 This has been forwarded to cuSPARSE but we have not yet received a commitment on their end to fix this issue directly. CC @ptrblck Pull Request resolved: https://github.com/pytorch/pytorch/pull/119610 Approved by: https://github.com/jeffdaily, https://github.com/jcaip	2024-02-15 18:28:45 +00:00
Andrew Gu	2429495820	[FSDP2][ez] Made typing more strict to avoid `cast` (#119985 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119985 Approved by: https://github.com/Skylion007, https://github.com/fegin ghstack dependencies: #118298	2024-02-15 18:20:35 +00:00
Zhengxu Chen	840426e793	[export] Log export time. (#119960 ) Summary: as title. we are logging the time to complete one export session. Test Plan: CI Differential Revision: D53737766 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119960 Approved by: https://github.com/angelayi	2024-02-15 18:04:15 +00:00
PyTorch MergeBot	9b38ee2343	Revert "Alternate sharding (#119078 )" This reverts commit 861acda20577739d52dd0bcf09e162192f25020f. Reverted https://github.com/pytorch/pytorch/pull/119078 on behalf of https://github.com/clee2000 due to failing `861acda205` ([comment](https://github.com/pytorch/pytorch/pull/119078#issuecomment-1946583857))	2024-02-15 16:59:50 +00:00
Anthony Alayo	a83a1bc43b	Adding c10 device type to newly added DeviceAccelerator (#119961 ) Follow up to https://github.com/pytorch/pytorch/pull/104364, A new file got submitted yesterday that is using DeviceType without the c10 namespace. This fixes that. I haven't yet figured out a way to setup a test for this, but I will submit a follow up PR once I figure that out. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119961 Approved by: https://github.com/ezyang	2024-02-15 14:56:05 +00:00
Ting Lu	e5bfdde7ba	Fix the skip condition for test_c10d tests (#119938 ) Seeing the error for c10d tests when running on 1GPU. Adding the skip when there is insufficient GPU. ``` RuntimeError: CUDA error: invalid device ordinal CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions. ``` referring to https://github.com/pytorch/pytorch/pull/84980 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119938 Approved by: https://github.com/eqy, https://github.com/fegin	2024-02-15 11:03:39 +00:00
Angela Yi	c26884f063	[export] Disable exported_program.__call__ (#119466 ) Summary: `ExportedProgram` is an artifact produced by torch.export, containing the graph that is exported, along with other attributes about the original program such as the graph signature, state dict, and constants. One slightly confusing thing that users run into is that they treat the `ExportedProgram` as a `torch.nn.Module`, since the object is callable. However, as we do not plan to support all features that `torch.nn.Module`s have, like hooks, we want to create a distinction between it and the `ExportedProgram` by removing the `__call__` method. Instead users can create a proper `torch.nn.Module` through `exported_program.module()` and use that as a callable. Test Plan: CI Differential Revision: D53075378 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119466 Approved by: https://github.com/zhxchen17, https://github.com/thiagocrepaldi	2024-02-15 08:49:34 +00:00
angelayi	f4d641ba2f	[export] Change runtime asserts to using assert_scalar (#119608 ) By changing runtime symbolic asserts to using assert_scalar, the asserts can call into `expect_true` and modify the shape env so that we can run through the traced graph module with fake tensors. With assert_async, the asserts only get hit during runtime, but that means if we run the graph module with fake tensors, the asserts will not affect the shape env, so later data dependent calls to the fake tensors may result in GuardOnDataDependentSymNode errors. https://github.com/pytorch/pytorch/issues/119587 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119608 Approved by: https://github.com/ezyang	2024-02-15 07:13:42 +00:00
Wang, Xiao	c83af673bc	Allow CUDA extension builds to skip generating cuda dependencies during compile time (#119936 ) nvcc flag `--generate-dependencies-with-compile` doesn't seem to be supported by `sccache` for now. Builds with this flag enabled will not benefit from sccache. This PR adds an environment variable that allows users to set this flag and skip those nvcc dependencies to speed up their build with compiler caches. If everything is "fresh build" in CI, we don't care if there are unnecessary recompile during incremental builds. related: https://github.com/pytorch/pytorch/pull/49344 - [ ] todo: raise an issue to sccache Pull Request resolved: https://github.com/pytorch/pytorch/pull/119936 Approved by: https://github.com/ezyang	2024-02-15 07:03:59 +00:00
cyy	d4882e438a	[DeviceIndex][5/N] Use DeviceIndex in more places (#119866 ) This PR follows the series of patches beginning with #119142 and fixes various CUDA related methods to use DeviceIndex. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119866 Approved by: https://github.com/Skylion007	2024-02-15 07:01:43 +00:00
cyy	68328ad394	Check existence of caffe2::mkl target (#119945 ) Fixes #118862 If libtorch is included multiply times in different sub-folders, linking caffe2::mkl may incur errors like ``` Cannot specify link libraries for target "caffe2::mkl" which is not built by this project. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/119945 Approved by: https://github.com/ezyang	2024-02-15 06:28:17 +00:00
Omkar Salpekar	0898ead2d5	Timestamp Embedding Indices Generated for TD (#119955 ) Timestamps the generated embedding indices. Moves the old indices to an `archived/` folder and then uploads the index to a `latest/` folder. There will be a short period in between these operations where there is no index in `latest/`. To handle this case, any workflow fetching the index (such as the retriever) should use a retry with backoff when copying from S3. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119955 Approved by: https://github.com/huydhn	2024-02-15 04:48:40 +00:00
Wei Lu	af346df6a0	[PyTorch][Vulkan]fix the issue of log 0 after softmax (#119898 ) Summary: In some cases the output of `softmax` are so small that they are below the float16 precision. These values are represented as 0 in float16 and result in `-inf` when log is applied. According to [Wikipedia](https://en.wikipedia.org/wiki/Half-precision_floating-point_format#Exponent_encoding), the minimum strictly positive (subnormal) value is 2^−24 ≈ 5.9605 × 10^−8. Therefore, we add 6 x 10^-8 to the output of softmax to avoid the numerical issue. Test Plan: We add two tests: - `log_softmax_underflow_exception` tests the log_softmax without adding epsilon to the output of softmax, so we expect to get nan or -inf. (NOTE: this test has passed on both devserver and on Android device, but failed on the ` fbsource//xplat/caffe2:vulkan_ops_testAndroid` test on CI. In this test, `log` of small numbers [even `log 0` shows output -88 instead of `-inf`](https://interncache-cco.fbcdn.net/v/t49.3276-7/379414752_342395058779076_6447867753374424757_n.txt?ccb=1-7&_nc_sid=ce8ad4&efg=eyJ1cmxnZW4iOiJwaHBfdXJsZ2VuX2NsaWVudC9pbnRlcm4vc2l0ZS94L3Rlc3RpbmZyYSJ9&_nc_ht=interncache-cco&oh=00_AfApTdId1WOHUqdoSTc66s6adnrQt1YS0NDT-LDppIvX0g&oe=65D0CC99). We cannot reproduce this error on device now, so we DISABLE this test for now to integrate into CI.) - `log_softmax_underflow` tests the updated implementation of log_softmax, nan and -inf have been removed ## test on devserver ``` luwei@devbig984.prn1 /data/users/luwei/fbsource (9f6b78894)]$ LD_LIBRARY_PATH=third-party/swiftshader/lib/linux-x64/ buck2 run fbcode/mode/dev-nosan //xplat/caffe2:pt_vulkan_api_test_bin -- --gtest_filter="log_softmax_underflow" File changed: fbcode//caffe2/aten/src/ATen/test/vulkan_api_test.cpp File changed: fbsource//xplat/caffe2/aten/src/ATen/test/vulkan_api_test.cpp Buck UI: https://www.internalfb.com/buck2/baaaa683-60da-4dd8-95b9-6848fe1d7d74 Network: Up: 53KiB Down: 1.4MiB (reSessionID-9580ce4f-7e1e-4c65-8497-52443329b796) Jobs completed: 6. Time elapsed: 24.2s. Cache hits: 0%. Commands: 2 (cached: 0, remote: 1, local: 1) BUILD SUCCEEDED Running main() from third-party/googletest/1.14.0/googletest/googletest/src/gtest_main.cc Note: Google Test filter = log_softmax_underflow [==========] Running 1 test from 1 test suite. [----------] Global test environment set-up. [----------] 1 test from VulkanAPITest [ DISABLED ] VulkanAPITest.DISABLED_log_softmax_underflow_exception [ RUN ] VulkanAPITest.log_softmax_underflow [ OK ] VulkanAPITest.log_softmax_underflow (169 ms) [----------] 1 test from VulkanAPITest (169 ms total) [----------] Global test environment tear-down [==========] 1 test from 1 test suite ran. (169 ms total) [ PASSED ] 1 test. YOU HAVE 1 DISABLED TEST ``` full test results: P1184164670 ``` [----------] 428 tests from VulkanAPITest (21974 ms total) [----------] Global test environment tear-down [==========] 428 tests from 1 test suite ran. (21974 ms total) [ PASSED ] 427 tests. [ SKIPPED ] 1 test, listed below: [ SKIPPED ] VulkanAPITest.querypool_flushed_shader_log YOU HAVE 11 DISABLED TESTS ``` ## test on device: - build ``` [luwei@devbig984.prn1 /data/users/luwei/fbsource (82c91e8da)]$ buck2 build -c ndk.static_linking=true -c pt.enable_qpl=0 --target-platforms=ovr_config//platform/android:arm32-fbsource //xplat/caffe2:pt_vulkan_api_test_binAndroid --show-output ``` - push to device and run ``` [luwei@devbig984.prn1 /data/users/luwei/fbsource (82c91e8da)]$ adb shell /data/local/tmp/pt_vulkan_api_test_binAndroid --gtest_filter="log_softmax_underflow" Running main() from third-party/googletest/1.14.0/googletest/googletest/src/gtest_main.cc Note: Google Test filter = log_softmax_underflow [==========] Running 1 test from 1 test suite. [----------] Global test environment set-up. [----------] 1 test from VulkanAPITest [ DISABLED ] VulkanAPITest.DISABLED_log_softmax_underflow_exception [ RUN ] VulkanAPITest.log_softmax_underflow [ OK ] VulkanAPITest.log_softmax_underflow (292 ms) [----------] 1 test from VulkanAPITest (293 ms total) [----------] Global test environment tear-down [==========] 1 test from 1 test suite ran. (294 ms total) [ PASSED ] 1 test. YOU HAVE 1 DISABLED TEST ``` Reviewed By: yipjustin Differential Revision: D53694989 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119898 Approved by: https://github.com/jorgep31415	2024-02-15 03:59:44 +00:00
Yifu Wang	cd08dc37f8	Support tracing native functional collective via python APIs (#119103 ) Summary: - Inlined `torch.distributed.distributed_c10d._get_group_size_by_name` - Updated all torch.compile tests in test_c10d_functional_native.py to use funcol python APIs (as opposed to the dispatcher ops) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119103 Approved by: https://github.com/wconstab, https://github.com/fegin, https://github.com/wanchaol	2024-02-15 03:33:49 +00:00
cyy	5f9b432494	[2/N] Replace std::tie with structural binding (#119879 ) This PR follows #119774, Python generated code was changed to use structural binding. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119879 Approved by: https://github.com/albanD	2024-02-15 02:56:34 +00:00
Oguz Ulgen	9ff9798716	Fix a bug in kernel analysis with ttir defined args (#119934 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119934 Approved by: https://github.com/aakhundov	2024-02-15 02:49:11 +00:00
Yanbo Liang	7f5b87c953	[torch.compile] Log more compilation time breakdown (#119865 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/119865 Approved by: https://github.com/ezyang	2024-02-15 02:20:07 +00:00
Nikita Shulga	516f38a144	[RelEng] Define `BUILD_BUNDLE_PTXAS` (#119750 ) That would bundle PTXAS into a `bin` folder When compiling for Triton, define `TRITION_PTXAS_PATH` if `ptxas` is bundled with PyTorch Needed to make PyTorch compiled against CUDA-11.8 usable with 11.8 driver, as Triton is bundled with latest (CUDA-12.3 at time of PyTorch-2.2 release) ptxas Needs `5c814e2527` to produce valid binary builds Test plan: - Create dummy ptxas in `torch/bin` folder and observe `torch.compile` fail with backtrace in Triton module. - Run following script (to be added to binary tests ) against CUDA-11.8 wheel: ```python import torch import triton @torch.compile def foo(x: torch.Tensor) -> torch.Tensor: return torch.sin(x) + torch.cos(x) x=torch.rand(3, 3, device="cuda") print(foo(x)) # And check that CUDA versions match cuda_version = torch.version.cuda ptxas_version = triton.backends.nvidia.compiler.get_ptxas_version().decode("ascii") assert cuda_version in ptxas_version, f"CUDA version mismatch: torch build with {cuda_version}, but Triton uses ptxs {ptxas_version}" ``` Fixes https://github.com/pytorch/pytorch/issues/119054 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119750 Approved by: https://github.com/jansel, https://github.com/atalman	2024-02-15 02:08:57 +00:00
Orvid King	a07fd51b6b	[caffe2] Add an avx512 implementation of adagrad_update (#113289 ) Summary: As per title Test Plan: contbuilds Differential Revision: D50947444 Pull Request resolved: https://github.com/pytorch/pytorch/pull/113289 Approved by: https://github.com/ezyang	2024-02-15 01:45:30 +00:00
Catherine Lee	861acda205	Alternate sharding (#119078 ) Changes sharding to attempt to put all serial tests on as few shards as possible. Parallel tests are then distributed across all shards, with most of which likely ending up on the non serial shards Example: 8 minutes of serial tests, 20 minutes of parallel tests, 2 proc per machine, 6 machines -> 8 + 20/2 = 18 total minutes of tests -> 18 / 6 machines = 3 min per machine -> all serial tests should fit on 3 machines (3min, 3 min, 2min) -> majority of parallel tests should go on last 4 machines, one of which is shared with the serial tests Move serial tests to run first If I want to move to a purely numbers based sharding, this ensures that parallel tests are run with parallel tests as much as possible instead of interleaving serial + parallel tests, which decreases effectiveness of parallelization, while also ensuring that test reordering is still mostly effective. See `73e816ee80` for example logs Pull Request resolved: https://github.com/pytorch/pytorch/pull/119078 Approved by: https://github.com/huydhn	2024-02-15 01:32:44 +00:00
Tugsbayasgalan Manlaibaatar	b4252d73b1	Make pattern matcher more robust (#119876 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119876 Approved by: https://github.com/cccclai	2024-02-15 00:48:16 +00:00
Wanchao Liang	daf1050ae5	[dtensor] refactor sharding cost model to count for latency (#119897 ) This PR refactors the shardeing cost model, to do a more accurate estimation of redistribute cost, including both collective latency and communciation time. The previous cost model does not recale the latency and communciation time, therefore the latency factor is too small to be counted, and in the case of small tensors, multiple collectives is preferred than a single collective, which is wrong. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119897 Approved by: https://github.com/tianyu-l	2024-02-15 00:35:56 +00:00
Alexander Grund	99cb807e25	Skip test_wrap_bad if run under pytest (#115070 ) Pytest replaces sys.stdout/stderr by `TextIOWrapper` instances which do not support `fileno()` Hence skip that test in this case Fixes #115069 Pull Request resolved: https://github.com/pytorch/pytorch/pull/115070 Approved by: https://github.com/clee2000	2024-02-15 00:10:05 +00:00
Nikita Shulga	d833e2f236	Use ARMV8.2 scalar fp16<->fp32 conversion (#119895 ) Thanks to discussion with @mikekgfb I've realized that SVE is the feature availble by default on Apple Silicon, so let use it to speed up portable but slow bit mashing algorithm implemented as `c10::detail::fp16_ieee_from_fp32_value` by using the following implicit conversion routine: ```cpp float sve_fp16_to_fp32_value(uint16_t h) { union { uint16_t h; float16_t f16; } x = {h}; return x.f16; } ``` that according to the https://godbolt.org/z/8s14GvEjo is turned into [`fcvt s0,h0`](https://developer.arm.com/documentation/ddi0596/2021-12/SVE-Instructions/FCVT--Floating-point-convert-precision--predicated--) As results, very slow and naive [`torch.mm`](`edd9ddf73f/aten/src/ATen/native/cpu/BlasKernel.cpp (L108)`) runs 3x faster: 85 msec before to 27 msec (measured by running `e41341df2d/benchmarks/benchmark_torch_mm.py` ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119895 Approved by: https://github.com/mikekgfb ghstack dependencies: #119892	2024-02-14 23:42:53 +00:00
Andrew Gu	096ebcca73	[FSDP2] Added gradient accumulation w/o reduction (#118298 ) This PR adds a way to do gradient accumulation without collectives (i.e. reduce-scatter for FSDP and reduce-scatter/all-reduce for HSDP, though HSDP is not yet implemented). Since the `no_sync()` context manager has received some feedback, we simply define a method on the module to set whether the module requires gradient synchronization or not, where this method can recurse or not. ``` # Before with `no_sync()`: with fsdp_model.no_sync() if not is_last_microbatch else contextlib.nullcontext(): # Forward/backward # After with a setter: fsdp_model.set_requires_gradient_sync(not is_last_microbatch) # Forward/backward ``` Having the method be able to recurse or not also gives some flexibility. For example, some large modules can still reduce-scatter, while some smaller modules can avoid it to save communication bandwidth: ``` fsdp_modules_to_reduce_scatter: Set[nn.Module] = ... for module in fsdp_model.modules(): if isinstance(module, FSDP) and module not in fsdp_modules_to_reduce_scatter: module.set_requires_gradient_sync(not is_last_microbatch) # Forward/backward ``` (Separately, we may expose a helper for `return [module for model.modules() if isinstance(module, FSDP)]`.) --- To show the spirit of this API choice, I also included `set_requires_all_reduce` that would give us the ability to only reduce-scatter but not all-reduce for HSDP (originally from the MiCS paper). If we want to flexibly support heterogeneous sharding where FSDP is applied to some modules and HSDP to others in the same model, then having a module-level method that has the option to not recurse makes sense to me. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118298 Approved by: https://github.com/wconstab, https://github.com/wanchaol ghstack dependencies: #119550, #118136, #118223, #118755, #119825	2024-02-14 23:09:59 +00:00
Zhengxu Chen	8f27fde2f5	[export] Log private api uses. (#119848 ) Summary: as title. The following APIs are logged: - capture_preautograd_graph - torch._export.aot_compile - external usage of _export_to_torch_ir (AOTInductor, Pippy) - constraints API - public use of torch._dynamo.export Test Plan: CI Differential Revision: D53735599 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119848 Approved by: https://github.com/suo	2024-02-14 22:58:23 +00:00
soulitzer	340b6fa972	Deduplicate docs between global and non-global full backward hooks (#119708 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119708 Approved by: https://github.com/albanD ghstack dependencies: #114970	2024-02-14 22:53:44 +00:00
PyTorch MergeBot	3713103db4	Revert "[Inductor] Setting kernel launch and exit callbacks for inductor generated triton kernels (#119450 )" This reverts commit 4e93b00b692118b8531f3807ec95eb4c538ea419. Reverted https://github.com/pytorch/pytorch/pull/119450 on behalf of https://github.com/soulitzer due to Regressed perf on the dashboard ([comment](https://github.com/pytorch/pytorch/pull/119450#issuecomment-1944876761))	2024-02-14 22:44:21 +00:00
Joel Schlosser	756cf2913d	Fix NJT stride access in SDPA dispatcher logic (#119846 ) `._stride` -> `._strides` Adds test to cover this case. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119846 Approved by: https://github.com/drisspg, https://github.com/ani300, https://github.com/soulitzer ghstack dependencies: #119910	2024-02-14 22:37:52 +00:00
Joel Schlosser	0560c193a6	Fix meta registration for _flash_attention_forward() [ROCm forward fix] (#119910 ) Addresses ROCm failures from #119812 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119910 Approved by: https://github.com/drisspg	2024-02-14 22:37:52 +00:00
Nikita Shulga	734ae20f2e	[C10] Expand half unittest (#119892 ) So far it's been only testing legacy conversion, rather than the one actually used when `at::Half` is constructed Test `fp16` to `fp32` for the whole range of its 65536 values, though skip NaN comparisons, as different algorithms are not guaranteed to yield identical NaN representations and they are different anyway. Do a small code cleanup, remove extraneous semicolons as well as named namespace inside unnamed one Pull Request resolved: https://github.com/pytorch/pytorch/pull/119892 Approved by: https://github.com/kit1980	2024-02-14 22:32:43 +00:00
Lucas Pasqualin	3470ab42bb	[DCP] Automatically set `no_dist` if distributed is unavailable (#119813 ) [DCP] Automatically set `no_dist` if distributed is unavailable Differential Revision: [D53718043](https://our.internmc.facebook.com/intern/diff/D53718043/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119813 Approved by: https://github.com/fegin, https://github.com/wz337	2024-02-14 22:25:07 +00:00
Eddie Yan	cd380c794f	[CUDNN][SDPA] Experimental cuDNN Flash Attention v2 Inference (#115663 ) #113713 Going to clean up some of the checks and will remove draft status after. Can be tested on SM80+ with `TORCH_CUDNN_MHA_ENABLED=1`. CC @drisspg @ptrblck Pull Request resolved: https://github.com/pytorch/pytorch/pull/115663 Approved by: https://github.com/drisspg	2024-02-14 22:02:06 +00:00
Joel Schlosser	9ec8dd2467	Reify view_func() closures as ViewFuncs (#118404 ) Replaces `view_func()` closures with a reified `ViewFunc` data structure. Codegen generates a `ViewFunc` subclass for each view op (e.g. `NarrowViewFunc`) containing state needed to reconstruct the view. The `ViewFunc` API allows for querying and hot-swapping any `SymInt`s or `Tensors` in the state through `get_symints()` / `get_tensors()` / `clone_and_set()`, which will be essential for fake-ification later on. ```cpp /// Base class for view functions, providing reapplication of a view on a new base. /// Each view op should get a codegenerated subclass of this class containing /// any state needed to reconstruct the view. The class also provides convenience /// accessors for saved SymInts / tensor state. This is useful for e.g. fake-ification, /// where we want to use symbolic values or fake tensors instead. struct TORCH_API ViewFunc { virtual ~ViewFunc() {} /// Returns any SymInts in the saved state. virtual std::vector<c10::SymInt> get_symints() const { return {}; } /// Returns the number of SymInts in the saved state. virtual size_t num_symints() const { return 0; } /// Returns any tensors in the saved state. virtual std::vector<at::Tensor> get_tensors() const { return {}; } /// Returns the number of tensors in the saved state. virtual size_t num_tensors() const { return 0; } /// Reapplies the view on the given base using the saved state. virtual at::Tensor operator()(const at::Tensor&) const = 0; /// Returns a clone of this ViewFunc, optionally with the specified saved state. virtual std::unique_ptr<ViewFunc> clone_and_set( std::optional<std::vector<c10::SymInt>> = c10::nullopt, std::optional<std::vector<at::Tensor>> = c10::nullopt) const = 0; protected: /// Sets the values of any SymInts in the saved state. The input vector size must /// match the number of SymInts in the saved state (i.e. the size of the list /// returned by get_symints()). virtual void set_symints(std::vector<c10::SymInt>) {} /// Sets the values of any Tensors in the saved state. The input vector size must /// match the number of Tensors in the saved state (i.e. the size of the list /// returned by get_tensors()). virtual void set_tensors(std::vector<at::Tensor>) {} }; ``` New codegen files: * `torch/csrc/autograd/generated/ViewFunc.h` * `torch/csrc/autograd/generated/ViewFuncs.cpp` The templates for these also contains impls for `ChainedViewFunc` and `ErroringViewFunc` which are used in a few places within autograd. Example codegen for `slice.Tensor`: ```cpp // torch/csrc/autograd/generated/ViewFuncs.h #define SLICE_TENSOR_VIEW_FUNC_AVAILABLE struct SliceTensorViewFunc : public torch::autograd::ViewFunc { SliceTensorViewFunc(int64_t dim, c10::optional<c10::SymInt> start, c10::optional<c10::SymInt> end, c10::SymInt step) : dim(dim), start(start), end(end), step(step) {}; virtual ~SliceTensorViewFunc() override {}; virtual std::vector<c10::SymInt> get_symints() const override; virtual size_t num_symints() const override; virtual std::vector<at::Tensor> get_tensors() const override; virtual size_t num_tensors() const override; virtual at::Tensor operator()(const at::Tensor&) const override; virtual std::unique_ptr<ViewFunc> clone_and_set( std::optional<std::vector<c10::SymInt>> = c10::nullopt, std::optional<std::vector<at::Tensor>> = c10::nullopt) const override; protected: virtual void set_symints(std::vector<c10::SymInt>) override; virtual void set_tensors(std::vector<at::Tensor>) override; private: int64_t dim; c10::optional<c10::SymInt> start; c10::optional<c10::SymInt> end; c10::SymInt step; }; ... // torch/csrc/autograd/generated/ViewFuncs.cpp std::vector<c10::SymInt> SliceTensorViewFunc::get_symints() const { ::std::vector<c10::SymInt> symints; symints.reserve((start.has_value() ? 1 : 0) + (end.has_value() ? 1 : 0) + 1); if(start.has_value()) symints.insert(symints.end(), (start)); if(end.has_value()) symints.insert(symints.end(), (end)); symints.push_back(step); return symints; } size_t SliceTensorViewFunc::num_symints() const { return static_cast<size_t>((start.has_value() ? 1 : 0) + (end.has_value() ? 1 : 0) + 1); } void SliceTensorViewFunc::set_symints(std::vector<c10::SymInt> symints) { TORCH_INTERNAL_ASSERT(symints.size() == num_symints()); auto i = 0; if(start.has_value()) start = symints[i]; i += (start.has_value() ? 1 : 0); if(end.has_value()) end = symints[i]; i += (end.has_value() ? 1 : 0); step = symints[i]; } std::vector<at::Tensor> SliceTensorViewFunc::get_tensors() const { ::std::vector<at::Tensor> tensors; return tensors; } size_t SliceTensorViewFunc::num_tensors() const { return static_cast<size_t>(0); } void SliceTensorViewFunc::set_tensors(std::vector<at::Tensor> tensors) { TORCH_INTERNAL_ASSERT(tensors.size() == num_tensors()); } at::Tensor SliceTensorViewFunc::operator()(const at::Tensor& input_base) const { return at::_ops::slice_Tensor::call(input_base, dim, start, end, step); } std::unique_ptr<ViewFunc> SliceTensorViewFunc::clone_and_set( std::optional<std::vector<c10::SymInt>> symints, std::optional<std::vector<at::Tensor>> tensors) const { auto output = std::make_unique<SliceTensorViewFunc>(dim, start, end, step); if (symints.has_value()) { output->set_symints(std::move((symints))); } if (tensors.has_value()) { output->set_tensors(std::move((tensors))); } return output; } ``` The `_view_func()` / `_view_func_unsafe()` methods now accept two additional (optional) args for `symint_visitor_fn` / `tensor_visitor_fn`. If these are defined, they are expected to be python callables that operate on a single SymInt / tensor and return a new one. This allows for the hot-swapping needed during fake-ification. For testing, there are extensive pre-existing tests, and I added a test to ensure that hot-swapping functions correctly. ```sh python test/test_autograd.py -k test_view_func_replay python test/test_ops.py -k test_view_replay ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/118404 Approved by: https://github.com/ezyang	2024-02-14 22:00:43 +00:00
Animesh Jain	6b04251b87	[inductor][scheduler] Use set for origin (#119861 ) xref - https://github.com/pytorch/pytorch/issues/119440 This avoids node > node comparison if the origin order is same in the origins tuple. However, I am unable to come up with a test case where this could happen. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119861 Approved by: https://github.com/Skylion007, https://github.com/eellison	2024-02-14 22:00:38 +00:00
Michael Lazos	29235c7063	Handle aliases correctly in foreach (#119508 ) Fixes https://github.com/pytorch/pytorch/issues/119436 <s>In essence we need to ensure aliases are run in separate foreach kernels so that they are ordered correctly. Previously, aliases could end up in the same kernel which creates weird scheduling dependencies.</s> There was a bug in cycle detection/can_fuse which was creating cycles when more than two aliases were used in foreach nodes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119508 Approved by: https://github.com/jansel	2024-02-14 21:21:28 +00:00
gs-olive	e0f6fa6a7c	Windows Dynamo Error Removal CI Check (#115969 ) Rebase of #111313 onto `main`, for CI validation Co-authored-by: Stella Laurenzo <stellaraccident@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/115969 Approved by: https://github.com/PaliC, https://github.com/thiagocrepaldi	2024-02-14 21:14:36 +00:00
Angela Yi	9201d7335a	Add pixel_shuffle to core aten decomps (#119899 ) Summary: https://github.com/pytorch/pytorch/pull/118239 added a decomposition for pixel_shuffle, so pixel_shuffle no longer needs to be a Core ATen Op. We have also fixed the internal use case so that it no longer special cases on pixel_shuffle, allowing us to revert the changes in https://github.com/pytorch/pytorch/pull/118921. Test Plan: CI Differential Revision: D53766709 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119899 Approved by: https://github.com/peterbell10, https://github.com/lezcano	2024-02-14 21:01:11 +00:00
atalman	244b124bb8	Add linux cpu test for 3.12 (#117853 ) This is continuation of work: https://github.com/pytorch/pytorch/pull/113987 Co-authored-by: albanD <desmaison.alban@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/117853 Approved by: https://github.com/albanD	2024-02-14 20:52:23 +00:00
wz337	bb67a28738	[DTensor] Enable Adamax foreach optimizer (#119850 ) Enable Adamax foreach optimizer and add DTensor unit test for Adamax. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119850 Approved by: https://github.com/wanchaol	2024-02-14 20:43:00 +00:00
Aaron Orenstein	2aad3f93f8	Fix guards for field access through properties (#119719 ) When building guards which went through a property we were analyzing the property using getattr_static but the guard wasn't built using getattr_static so if the property was "unusual" it generated misbehaved code which referenced a non-existent `__closure__` field. Fixes #118786 Note that after this change some of the referenced tests are still failing with a different error - but getting further. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119719 Approved by: https://github.com/oulgen	2024-02-14 20:42:55 +00:00
Andrew M. James	7797a8c2cb	[testing][inductor] Allow grad tolerance override (#119844 ) Introduce `grad_atol` and `grad_rtol` kwargs, default behavior is preserved by using `atol` and `rtol` values. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119844 Approved by: https://github.com/peterbell10	2024-02-14 20:18:48 +00:00
Edward Z. Yang	15f1b9f1c4	Improve TORCHDYNAMO_EXTENDED_DEBUG for GuardOnDataDependentSymNode (#119412 ) This PR substantially improves the error reporting for GuardOnDataDependentSymNode in the following ways: * The GuardOnDataDependentSymNode error message is rewritten for clarity, and contains a link to a new doc on how to resolve these issues https://docs.google.com/document/d/1HSuTTVvYH1pTew89Rtpeu84Ht3nQEFTYhAX3Ypa_xJs/edit#heading=h.44gwi83jepaj * We support `TORCHDYNAMO_EXTENDED_DEBUG_CREATE_SYMBOL`, which lets you specify a symbol name to get detailed debug information when it is logged (e.g., the full backtrace and user backtrace of the symbol creation). The exact symbols that you may be interested in our now explicitly spelled out in the error message. * We support `TORCHDYNAMO_EXTENDED_DEBUG_CPP` which enables reporting C++ backtraces whenever we would report a backtrace. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/119412 Approved by: https://github.com/avikchaudhuri ghstack dependencies: #117356	2024-02-14 20:01:07 +00:00
Jeff Daily	0e6eee3c89	[ROCm] TunableOp (#114894 ) Some operations, such as GEMMs, could be implemented using more than one library or more than one technique. For example, a GEMM could be implemented for CUDA or ROCm using either the blas or blasLt libraries. Further, ROCm's rocblas and hipblaslt libraries allow the user to query for all possible algorithms and then choose one. How does one know which implementation is the fastest and should be chosen? That's what TunableOp provides. See the README.md for additional details. TunableOp was ported from onnxruntime starting from commit `08dce54266`. The content was significantly modified and reorganized for use within PyTorch. The files copied and their approximate new names or source content location within aten/src/ATen/cuda/tunable include the following: - onnxruntime/core/framework/tunable.h -> Tunable.h - onnxruntime/core/framework/tuning_context.h -> Tunable.h - onnxruntime/core/framework/tuning_context_impl.h -> Tunable.cpp - onnxruntime/core/providers/rocm/tunable/gemm_common.h -> GemmCommon.h - onnxruntime/core/providers/rocm/tunable/gemm_hipblaslt.h -> GemmHipblaslt.h - onnxruntime/core/providers/rocm/tunable/gemm_rocblas.h -> GemmRocblas.h - onnxruntime/core/providers/rocm/tunable/gemm_tunable.cuh -> TunableGemm.h - onnxruntime/core/providers/rocm/tunable/rocm_tuning_context.cc -> Tunable.cpp - onnxruntime/core/providers/rocm/tunable/util.h -> StreamTimer.h - onnxruntime/core/providers/rocm/tunable/util.cc -> StreamTimer.cpp Pull Request resolved: https://github.com/pytorch/pytorch/pull/114894 Approved by: https://github.com/xw285cornell, https://github.com/jianyuh	2024-02-14 19:03:49 +00:00
Edward Z. Yang	90f785dc34	Change default TORCH_LOGS format to match Meta/glog standard (#119869 ) Before: ``` [2024-02-13 19:34:50,591] [0/0] torch._dynamo.guards.__guards: [DEBUG] GUARDS: [2024-02-13 19:34:50,591] [0/0] torch._dynamo.guards.__guards: [DEBUG] ___check_type_id(L['x'], 70049616) # assert x.shape[0] > 2 # b.py:5 in f [2024-02-13 19:34:50,592] [0/0] torch._dynamo.guards.__guards: [DEBUG] hasattr(L['x'], '_dynamo_dynamic_indices') == False # assert x.shape[0] > 2 # b.py:5 in f ``` After this change, the logs look like this: ``` V0214 07:00:49.354000 139646045393920 torch/_dynamo/guards.py:1023 [0/0] GUARDS: V0214 07:00:49.354000 139646045393920 torch/_dynamo/guards.py:1039 [0/0] ___check_type_id(L['x'], 70050096) # assert x.shape[0] > 2 # b.py:5 in f V0214 07:00:49.355000 139646045393920 torch/_dynamo/guards.py:1039 [0/0] hasattr(L['x'], '_dynamo_dynamic_indices') == False # assert x.shape[0] > 2 # b.py:5 in f ``` The main differences from what we had before: * We don't print DEBUG/INFO/WARNING, instead, we only print a single character. DEBUG, somewhat oddly, maps to V, because it corresponds to glog VERBOSE * The year is omitted, and a more compact representation for date/month is adopted. Somewhat perplexingly, six digits are allocated for the nanoseconds, even though Python typically doesn't have that level of resolution * The thread ID is included (in a containerized environment, this thread id will be typically much lower) * Instead of using the module name, we give a filepath, as well as the line the log message was emitted from. I think the line number is a nice touch and improvement over our old logs, but one downside is we do lose the artifact name in the log message, in case anyone was grepping for that. * I chose to move the compile id prefix to the very end so as to keep a uniform layout before it, but I do think there are benefits to having it before the filename Meta only: This format was reverse engineered off of `6b8bbe3b53/supervisor/logging.py` and https://www.internalfb.com/code/fbsource/[e6728305a48540110f2bdba198aa74eee47290f9]/fbcode/tupperware/front_end/log_reader/filter/StreamingLogLineFilter.cpp?lines=105-114 Now, I think this may be slightly controversial, but I have chosen to apply this format by default in OSS. My reasoning is that many PT2 developers work with the logs in OSS, and keeping the format identical to what we run in prod will make it easier for these skills to transfer. The non-negotiable portion of the new format is "V0213 19:28:32"; the date string is expected to be in exactly this form or Tupperware will fail to parse it as a date. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/119869 Approved by: https://github.com/oulgen, https://github.com/mlazos, https://github.com/Skylion007	2024-02-14 18:56:35 +00:00
Tianyu Liu	d999222fba	[dtensor] add op support for nll_loss_backward (#119256 ) As titled. This is a followup to PR #118917 on nll_loss_forward. It also fixes an issue in it: the forward function produces two return values, the loss `result` and the `total_weight`. The previous PR didn't explicitly deal with the `total_weight` part. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119256 Approved by: https://github.com/wanchaol	2024-02-14 18:50:42 +00:00
albanD	47182a8f4b	Add cpp stack traces to our own reruns (#119408 ) Note that I'm not sure why we both have pytest rerun the failing test twice via `81abc2b249/test/run_test.py (L966)` before our own logic retries it as well? The failing test is only here to make sure it works as expected in the CI env. Will remove before landing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119408 Approved by: https://github.com/huydhn	2024-02-14 18:40:23 +00:00
Tugsbayasgalan (Tugsuu) Manlaibaatar	6cf48187c5	[export] Remove references to capture_pre_autograd_graph inside test_export (#119875 ) Summary: Title Test Plan: CI Reviewed By: zhxchen17 Differential Revision: D53728889 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119875 Approved by: https://github.com/angelayi	2024-02-14 17:59:10 +00:00
Angela Yi	ee3a7bdc2d	[export] Don't error if nn_module_stack doesn't contain a class (#119753 ) Summary: When we deserialize nn_module_stack, sometimes the module no longer exists in the python environment so we cannot deserialize it back into the python type and instead it's kept as a string. This causes downstream failures when retracing due to one of our checks in export. This diff just bypasses the check. Test Plan: CI Reviewed By: chakriu Differential Revision: D53527706 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119753 Approved by: https://github.com/zhxchen17	2024-02-14 16:56:11 +00:00
Jack Taylor	3e21c785a4	[ROCm] Initial ir.Scan/aten.cumsum lowering support on ROCm (#119369 ) It was noted in https://github.com/pytorch/pytorch/pull/117992 that ROCm is still falling back to eager with scan's with inductor. Initially as part of https://github.com/pytorch/pytorch/pull/106581 ROCm was disabled on this feature due to lack of triton support. This PR will enable support for lowering scan operations on ROCm. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119369 Approved by: https://github.com/peterbell10	2024-02-14 16:13:46 +00:00
Taras Tsugrii	fb492f7ca1	[inductor] Reorder if check to avoid more expensive check. (#119817 ) If `mkldnn` is not enabled or not available there is no point in performing a relatively expensive `all` check. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119817 Approved by: https://github.com/Skylion007	2024-02-14 16:04:31 +00:00
Taras Tsugrii	184605ae7d	[inductor] Replace generators with map. (#119818 ) It's more concise and efficient. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119818 Approved by: https://github.com/Skylion007, https://github.com/Neilblaze	2024-02-14 16:02:52 +00:00
laith sakka	edd9ddf73f	Propagate allow_non_graph_fake between get_fake_values_from_nodes and get_fake_values (#119731 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119731 Approved by: https://github.com/jansel, https://github.com/anijain2305 ghstack dependencies: #119314, #119435	2024-02-14 15:26:17 +00:00
cyy	87c6cd2f00	[1/N] Replace std::tie with structural binding (#119774 ) This PR replaces some std::tie calls with structural binding from C++17. This not only makes the code more compact, but also has some performance gain. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119774 Approved by: https://github.com/albanD, https://github.com/malfet	2024-02-14 09:25:04 +00:00
Shuqiang Zhang	a45c627f27	[c10d][flight recorder] store a copy of string in entry (#119837 ) Summary: Previously, we just store the char pointer in entry, the string is a temp object and will be destructed when we want to dump/access it. A quick fix is to store a copy of the string, but without changing the upstream char*. An alternative is to change every profilingTitle into std:string, this however would needs comprehensive overhall of the code up to the c10d::work layer above workNCCL and RecordFunction etc. We chose the first option for this change Resolve #119808 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119837 Approved by: https://github.com/zdevito, https://github.com/wconstab	2024-02-14 09:13:56 +00:00
Adnan Akhundov	4a50572c92	[inductor] Recursivly unwrap_storage_for_input when convert_to_reinterpret_view fails (#119867 ) Summary: When, during `ExternKernel.realize_input` call, underlying `ExternKernel.convert_to_reinterpret_view` fails, we currently fall back to `cls.copy_input` here: `31e59766e7/torch/_inductor/ir.py (L3805-L3816)` This creates a `TensorBox(StorageBox(...))` wrapped output, which causes a problem for this assertion: `31e59766e7/torch/_inductor/ir.py (L3479)` Here we add a special case handling for this to unwrap `x` recursively. Test Plan: This local repro: ``` torch.compile() def f(a, b, mat1, mat2): bias = torch.bmm(a + 3.14, b).permute(0, 2, 1).reshape(3992, -1) return torch.addmm(bias, mat1, mat2) f( torch.randn(3992, 20, 40).cuda(), torch.randn(3992, 40, 192).cuda(), torch.empty(3992, 1024).cuda(), torch.empty(1024, 3840).cuda(), ) ``` with this line: `690f54b0f5/torch/_inductor/fx_passes/post_grad.py (L650)` changed to `if cond(args, *kwargs):` fails before and succeeds after this PR. Differential Revision: D53743146 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119867 Approved by: https://github.com/xw285cornell	2024-02-14 07:50:34 +00:00
Michael Lazos	9f44274373	Add tests to verify disabled optimizers (#118919 ) As title Pull Request resolved: https://github.com/pytorch/pytorch/pull/118919 Approved by: https://github.com/janeyx99	2024-02-14 07:45:16 +00:00
Omkar Salpekar	ca55468416	Target Determinator Indexer Workflow (#118824 ) As described in [this talk](https://www.youtube.com/watch?v=I95KmF6KSIA) and [this repo](https://github.com/osalpekar/llm-target-determinator), we are experimenting with using CodeLlama-powered information retrieval for target determination. The idea is that we create embeddings for PyTorch test functions, and store this index in S3. Then when a new PR comes in, we create embedding(s) for that PR, compare them to the index of test embeddings, and run only the most relevant tests. This PR creates a workflow that does the indexing part (creating embeddings for functions and store in S3). All the logic for running the indexer is in [osalpekar/llm-target-determinator](https://github.com/osalpekar/llm-target-determinator). This workflow just checks out the relevant repos, installs the dependencies, runs the torchrun command to trigger indexing, and uploads the artifacts to S3. Co-authored-by: Catherine Lee <csl@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/118824 Approved by: https://github.com/izaitsevfb, https://github.com/huydhn	2024-02-14 06:21:18 +00:00
PyTorch UpdateBot	caf9d9d7c1	[executorch hash update] update the pinned executorch hash (#119733 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned executorch hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119733 Approved by: https://github.com/pytorchbot	2024-02-14 06:15:25 +00:00
Yanbo Liang	54a30f6d4e	[Dynamo] Update trace_rules.py and re-enable skipped tests (#119860 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/119860 Approved by: https://github.com/angelayi	2024-02-14 05:22:55 +00:00
Oguz Ulgen	8ba2675488	Fix for-loop divisibility parsing (#119859 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119859 Approved by: https://github.com/aakhundov ghstack dependencies: #119834, #119835, #119836, #119838	2024-02-14 05:09:59 +00:00
Oguz Ulgen	1f0e4ac146	Add support for while-loops in ttir analysis (#119838 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119838 Approved by: https://github.com/aakhundov ghstack dependencies: #119834, #119835, #119836	2024-02-14 05:09:59 +00:00
Oguz Ulgen	5ffac768f6	Add support for labels to ttir analysis (#119836 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119836 Approved by: https://github.com/aakhundov ghstack dependencies: #119834, #119835	2024-02-14 05:09:59 +00:00
Oguz Ulgen	3f09c5ee66	Add TTIR verification (#119835 ) Make sure the TTIR generated is valid before attempting to analyze. Incorrectly written triton code would produce broken TTIR. Minor discussion on https://github.com/openai/triton/issues/3120 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119835 Approved by: https://github.com/aakhundov ghstack dependencies: #119834	2024-02-14 05:09:59 +00:00
Oguz Ulgen	b257ff80da	Add test scf.for with multi return (#119834 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119834 Approved by: https://github.com/aakhundov	2024-02-14 05:09:59 +00:00
Huy Do	72bbbab70a	Add the missing test_dynamo_list_index from #119151 (D53392287) (#119854 ) D53392287 botched the export somehow and the exported PR https://github.com/pytorch/pytorch/pull/119151 didn't contain the added test. The discrepancy is showing up on diff train patch up diff D53694548 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119854 Approved by: https://github.com/kit1980, https://github.com/malfet	2024-02-14 04:10:02 +00:00
Bert Maher	563f1b9fef	[inductor] Use torch.cuda.clock_rate instead of triton.testing.nvsmi (#118662 ) `triton.testing.nvsmi` invokes `nvidia-smi` as a subprocess, and Meta prod usually doesn't make nvidia-smi available. Might as well just use something that's native to torch. Differential Revision: [D53235814](https://our.internmc.facebook.com/intern/diff/D53235814/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118662 Approved by: https://github.com/jansel	2024-02-14 03:23:49 +00:00
Animesh Jain	80379ef0aa	[dynamo-must-fix] Use ID_MATCH for UserDefinedClass (#119853 ) Fixes https://github.com/pytorch/pytorch/issues/119715 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119853 Approved by: https://github.com/jansel	2024-02-14 03:14:42 +00:00
Kurman Karabukaev	4240304da4	[TorchElastic] Handle SystemExit with code == 0 (#119697 ) Summary: Fix for a case where --run-path option fails to exit if the script exits with non-error status code. When there is an error exit code, run-path correctly detects an error and fails when calling spawn.join(). However for-non error case, current behavior is to check the return value of the operation and the fix is to return None so that our MP code detects an exit. Test Plan: cat /tmp/script.py ~~~ import sys def main(): exit_code = 1 if len(sys.argv) > 1: exit_code = int(sys.argv[1]) sys.exit(exit_code) if __name__=="__main__": main() ~~~ Case of exit code with 0 (prior behavior - never exits): torchrun --run-path /tmp/script.py 0 ~~~ [2024-02-12 09:20:57,523] torch.distributed.elastic.multiprocessing.redirects: [WARNING] NOTE: Redirects are currently not supported in Windows or MacOs. [2024-02-12 09:20:58,980] torch.distributed.elastic.multiprocessing.redirects: [WARNING] NOTE: Redirects are currently not supported in Windows or MacOs. (conda:pytorch) ➜ workspace echo $? 0 ~~~ Existing behavior for non-zero exit code still works: torchrun --run-path /tmp/script.py ~~~ (conda:pytorch) ➜ workspace torchrun --run-path /tmp/script.py [2024-02-12 09:16:20,667] torch.distributed.elastic.multiprocessing.redirects: [WARNING] NOTE: Redirects are currently not supported in Windows or MacOs. [2024-02-12 09:16:22,197] torch.distributed.elastic.multiprocessing.redirects: [WARNING] NOTE: Redirects are currently not supported in Windows or MacOs. [2024-02-12 09:16:25,795] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 64668) of fn: run_script_path (start_method: spawn) [2024-02-12 09:16:25,795] torch.distributed.elastic.multiprocessing.api: [ERROR] Traceback (most recent call last): [2024-02-12 09:16:25,795] torch.distributed.elastic.multiprocessing.api: [ERROR] File "/Users/kurman/workspace/pytorch/torch/distributed/elastic/multiprocessing/api.py", line 441, in _poll [2024-02-12 09:16:25,795] torch.distributed.elastic.multiprocessing.api: [ERROR] self._pc.join(-1) [2024-02-12 09:16:25,795] torch.distributed.elastic.multiprocessing.api: [ERROR] File "/Users/kurman/workspace/pytorch/torch/multiprocessing/spawn.py", line 177, in join [2024-02-12 09:16:25,795] torch.distributed.elastic.multiprocessing.api: [ERROR] raise ProcessExitedException( [2024-02-12 09:16:25,795] torch.distributed.elastic.multiprocessing.api: [ERROR] torch.multiprocessing.spawn.ProcessExitedException: process 0 terminated with exit code 1 Traceback (most recent call last): File "/Users/kurman/miniconda3/envs/pytorch/bin/torchrun", line 33, in <module> sys.exit(load_entry_point('torch', 'console_scripts', 'torchrun')()) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/kurman/workspace/pytorch/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper return f(args, *kwargs) ^^^^^^^^^^^^^^^^^^ File "/Users/kurman/workspace/pytorch/torch/distributed/run.py", line 812, in main run(args) File "/Users/kurman/workspace/pytorch/torch/distributed/run.py", line 803, in run elastic_launch( File "/Users/kurman/workspace/pytorch/torch/distributed/launcher/api.py", line 135, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/kurman/workspace/pytorch/torch/distributed/launcher/api.py", line 268, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ run_script_path FAILED ------------------------------------------------------------ Failures: <NO_OTHER_FAILURES> ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-02-12_09:16:25 host : kurman-mbp.dhcp.thefacebook.com rank : 0 (local_rank: 0) exitcode : 1 (pid: 64668) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ~~~ Differential Revision: D53653874 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119697 Approved by: https://github.com/wconstab	2024-02-14 03:09:09 +00:00
Aaron Meurer	5ce305270b	Add a decomposition for isin() (#115390 ) Co-authored-by: Peter Bell <peterbell10@live.co.uk> Co-authored-by: Mario Lezcano Casado <3291265+lezcano@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/115390 Approved by: https://github.com/peterbell10	2024-02-14 03:03:42 +00:00
Jason Ansel	75a6d6aef7	[inductor] Support storage resizing (#119749 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119749 Approved by: https://github.com/yf225 ghstack dependencies: #119647, #119671	2024-02-14 03:03:38 +00:00
Joel Schlosser	31e59766e7	Fix meta registration for _flash_attention_forward() (#119812 ) Meta registration wrongly assumes 4D inputs, while the underlying op allows 3D inputs for the `mha_varlen_fwd()` case. Testing: I added `detach()`es so the NJT test `test_sdpa_compile()` won't fail for a view-related reason. It should pass now with this fix. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119812 Approved by: https://github.com/drisspg	2024-02-14 02:38:53 +00:00
Huy Do	179ecab7e7	Do full checkout in lint workflow to rebuild new Docker images (#119858 ) From https://github.com/pytorch/pytorch/pull/119575, using `fetch-depth: 1` didn't work for `calculate-docker-image` when rebuilding a new one. Specifically, doing a full checkout is needed for `git rev-parse HEAD~:.ci/docker` to get the Docker tag. This shows up as a trunk failure after the recent Docker image update `507db17675` Pull Request resolved: https://github.com/pytorch/pytorch/pull/119858 Approved by: https://github.com/PaliC, https://github.com/clee2000, https://github.com/malfet	2024-02-14 02:37:54 +00:00
Taras Tsugrii	690f54b0f5	[dynamo][nit] Cleanup analyze_kernel_mutations nits. (#119703 ) Using `extend` is more efficient and other changes are stylistic. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119703 Approved by: https://github.com/Skylion007	2024-02-14 02:04:13 +00:00
Brian Hirsh	f9f0c67445	beef up non-overlapping checks for detecting false aliasing of graph inputs (#119826 ) This extra check is needed for some more complicated parameter sizes/strides for an internal model Pull Request resolved: https://github.com/pytorch/pytorch/pull/119826 Approved by: https://github.com/albanD	2024-02-14 01:46:30 +00:00
drisspg	c9459e7f55	Update atomicMaxFloat (#119577 ) # Summary Initially reported in https://github.com/pytorch/pytorch/issues/119320 I found that the by updating this function the nan values went away. I then created a godbolt to try and highlight the difference between the two versions: https://godbolt.org/z/3sKqEqn4M However they appear to always produce the same value, as the nvcc version is varied, except that the for some versions -inf is chosen and for others the correct subnormal is chosen... I am having a hard time finding an isolated test case for this but will keep working ### Update: I added printf_statements to the the version and indeed some values/*addr contain -0.0f. Hence the reason why this update fixes the reported issue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119577 Approved by: https://github.com/yifuwang	2024-02-14 01:17:16 +00:00
suo	8e029dc616	[export] fix tuple return with symints (#119829 ) as title. Differential Revision: [D53726648](https://our.internmc.facebook.com/intern/diff/D53726648/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119829 Approved by: https://github.com/zhxchen17, https://github.com/khabinov	2024-02-14 01:16:38 +00:00
PyTorch MergeBot	4a5b2cd6cb	Revert "Windows Dynamo Error Removal CI Check (#115969 )" This reverts commit 45e7af5818f1d4ab1cf568390b3721b9be4251a9. Reverted https://github.com/pytorch/pytorch/pull/115969 on behalf of https://github.com/PaliC due to this pr ended up breaking some of our periodic tests ([comment](https://github.com/pytorch/pytorch/pull/115969#issuecomment-1942934386))	2024-02-14 01:11:46 +00:00
Jesse Cai	16369816a2	[sparse] semi-structured sparse refactor (#117302 ) Summary: This PR is a refactor of semi-structured sparsity support. deprecation: Before `torch.sparse.to_sparse_semi_structured` had a kwarg param `transposed=False`, which has been removed. This kwarg was unused and now thros a deprecation warning. Namely, I've taken the subclassing implementation that xFormers has created and brought it over to PyTorch, as part of our plan to upstream runtime 2:4 sparsity. I've also copied over all the op support that Daniel implemenented that did not depend on the fast sparsification routines, into `_sparse_semi_structured_ops.py` With this subclass, all of our internal tests pass, as well as those in xFormers. The main change is that we now define a base subclass, `SparseSemiStructuredTensor` that is inherited from for each of the specific backends. We also now can arbitrarily override the sparse dispatch table with `_load_dispatch_table()`, idea being this is still general enough where users don't need to modify pytorch source code to get their model working. This also adds in padding support and stores alg_id and fuse_transpose as flags on the tensor, instead of hardcoding them. There still remains two components in xFormers that will need to be ported over eventually: - the autograd functions (`Sparsify24`, `Sparsify24_like`) - fast sparsification routines that they rely on Test Plan: Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/117302 Approved by: https://github.com/alexsamardzic, https://github.com/HDCharles	2024-02-14 01:10:40 +00:00
Nikita Shulga	2536c5186e	[BE] Properly mark destructor overrides (Take 2) (#119656 ) Otherwise, at least on MacOS builds are littered with: ``` In file included from /Users/malfet/git/pytorch/pytorch/aten/src/ATen/DeviceAccelerator.h:6: /Users/malfet/git/pytorch/pytorch/aten/src/ATen/detail/MTIAHooksInterface.h:23:11: warning: '~MTIAHooksInterface' overrides a destructor but is not marked 'override' [-Winconsistent-missing-destructor-override] virtual ~MTIAHooksInterface() = default; ^ /Users/malfet/git/pytorch/pytorch/aten/src/ATen/detail/CUDAHooksInterface.h:65:11: warning: '~CUDAHooksInterface' overrides a destructor but is not marked 'override' [-Winconsistent-missing-destructor-override] virtual ~CUDAHooksInterface() = default; ^ /Users/malfet/git/pytorch/pytorch/aten/src/ATen/detail/AcceleratorHooksInterface.h:15:11: note: overridden virtual function is here virtual ~AcceleratorHooksInterface() = default; ^ /Users/malfet/git/pytorch/pytorch/aten/src/ATen/detail/MPSHooksInterface.h:21:11: warning: '~MPSHooksInterface' overrides a destructor but is not marked 'override' [-Winconsistent-missing-destructor-override] virtual ~MPSHooksInterface() = default; ^ /Users/malfet/git/pytorch/pytorch/aten/src/ATen/detail/AcceleratorHooksInterface.h:15:11: note: overridden virtual function is here virtual ~AcceleratorHooksInterface() = default; ^ ``` Likely introduced by https://github.com/pytorch/pytorch/pull/119329 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119656 Approved by: https://github.com/Skylion007	2024-02-14 01:05:58 +00:00
cyy	cb0886ecf2	[DeviceIndex][4/N] Use DeviceIndex in more places (#119741 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/119741 Approved by: https://github.com/aaronenyeshi, https://github.com/ezyang	2024-02-14 00:29:10 +00:00
suo	b2e779868f	make internal lintrunner mypy clean (#119840 ) as title Differential Revision: [D53732505](https://our.internmc.facebook.com/intern/diff/D53732505/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119840 Approved by: https://github.com/ezyang	2024-02-14 00:25:42 +00:00
angelayi	507db17675	Update HF pin (#119717 ) Sometime between now and the previous pin update, HF introduced a ModelOutputs type, which was not pytree serializable, causing aot_compile to fail on new HF models (https://fb.workplace.com/groups/1075192433118967/permalink/1377977852840422/). With https://github.com/huggingface/transformers/pull/27871, we can now pytree serialize HF ModelOutputs types. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119717 Approved by: https://github.com/desertfire	2024-02-14 00:17:16 +00:00
Ozan Aydin	b51e0246b7	sccache version update (#119554 ) Fixes #37928 `sccache` is updated to the newer version (`v0.7.4`) to fix non-cacheable calls `multiple input files` for `CUDA` builds. This should make `Cache hits (CUDA)` work as expected and improve the speed dramatically. --- Additional information: - Modified `install_sccache.bat` check structure due to GitHub Action error `Process completed with exit code 255.` - Error is occurring when freshly downloaded `sccache` is being called with `--show-stats` or `--start-server` arguments within the script - Now, it is checking file's existence and killing/deleting executable before the download - Removed `sccache-cl` since it is no longer needed with newer versions of `sccache` --- `win-vs2019-cpu-py3 / build` - `16m 27s` ![image](https://github.com/pytorch/pytorch/assets/148207261/b5628e6c-64bb-4293-9d07-480f56df44f1) `win-vs2019-cuda11.8-py3 / build` - `17m 4s` (previously ~45 mins - 1h30mins) ![image](https://github.com/pytorch/pytorch/assets/148207261/e4ab01cb-0f56-41e8-984f-110e643b9c09) Now `Cache Hits (CUDA)` hits all `304` object and the error `Non-cacheable reasons` is fixed. ![image](https://github.com/pytorch/pytorch/assets/148207261/c8c25d2e-3fc1-4edb-8982-99c1f490cb54) --- Pull Request resolved: https://github.com/pytorch/pytorch/pull/119554 Approved by: https://github.com/malfet	2024-02-13 23:50:40 +00:00
Edward Z. Yang	be35fc9ea7	Size oblivious test for slice optimization (#119625 ) Fixes https://github.com/pytorch/pytorch/issues/119623 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/119625 Approved by: https://github.com/albanD	2024-02-13 23:47:52 +00:00
Andrew Gu	d81d5f52d5	[FSDP2][ez] Replaced `groupby` with `all` for same-dtype check (#119825 ) The `groupby` logic to check if all all-gather inputs have the same dtype is not so readable. Let us use `all` instead. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119825 Approved by: https://github.com/Skylion007 ghstack dependencies: #119550, #118136, #118223, #118755	2024-02-13 23:28:53 +00:00
Jason Ansel	cf117e37d5	Refactor THPStorage_resize_ (#119671 ) Moving code around to allow it to be reused in the next PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119671 Approved by: https://github.com/yf225 ghstack dependencies: #119647	2024-02-13 23:28:47 +00:00
albanD	ca777fbbb7	Add Accelerator device and shell hooks (#119329 ) This adds a concept of Accelerator that points to one of our devices. See DeviceAccelerator.h in this PR for details https://github.com/pytorch/pytorch/pull/119329/files#diff-83cc748bed5df1a453c272cc5ecc7e572d4eb694c5125384d8fbd17a0b5f50c8 It also adds scaffolding for shared C++ API to allow generic feature implementation. This PR in particular updates the autograd engine to use this generic API. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119329 Approved by: https://github.com/ezyang, https://github.com/huydhn	2024-02-13 23:15:24 +00:00
Aaron Orenstein	e9b78f2db0	Rewrite group_batch_fusion.find_independent_subset_greedy() to be iterative. (#118324 ) Improve performance of inductor searching large graphs for potential fusions. Also adds some direct unit tests of find_independent_subset_greedy() to ensure that the rewrite didn't break behavior. Fixes #98467 Previously find_independent_subset_greedy() was recursive and the example from the issue would cause it to blow out the stack. This changes it to be iterative and also caches some of the computed dependencies (it can't cache all of them because the caller is allowed to change the graph during the iteration). Fusion is still slow - but at least finishes. After this change the example given in #98467 has the following backend timings (on one particular CPU): eager timing: 3m:23s aot_eager timing: 4m:12s inductor timing: 22m:24s Possible future work to improve this further: 1. In dynamo limit the amount of inlining allowed before falling back to a graph break. This test ends up tracing through 483k bytecodes generating the graph. 2. In inductor have a limit so we don't exhaustively search the graph for fusion possibilities. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118324 Approved by: https://github.com/oulgen	2024-02-13 22:54:53 +00:00
Jeff Daily	ba1eb0e27f	[ROCm] upgrade CI to 6.0 (#119495 ) Co-authored-by: Jithun Nair <jithun.nair@amd.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/119495 Approved by: https://github.com/huydhn	2024-02-13 22:39:03 +00:00
blorange-amd	df9b44436a	[ROCm] Enable float16/complex32 fft tests on ROCm (#117296 ) This PR is to enable float16/complex32 fft tests on ROCm. Sample results are attached here: [test_spectral_ops_results.log](https://github.com/pytorch/pytorch/files/13908533/test_spectral_ops_results.log) test_decomp::TestDecompCUDA::test_comprehensive_fft* test_decomp::TestDecompCUDA::test_quick_fft* test_jit_fuser_te::TestNNCOpInfoCUDA::test_nnc_correctness_fft* test_meta::TestMetaCUDA::test_dispatch_meta_inplace_fft* test_meta::TestMetaCUDA::test_dispatch_meta_outplace_fft* test_meta::TestMetaCUDA::test_dispatch_symbolic_meta_inplace_fft* test_meta::TestMetaCUDA::test_dispatch_symbolic_meta_outplace_fft* test_meta::TestMetaCUDA::test_meta_inplace_fft* test_meta::TestMetaCUDA::test_meta_outplace_fft* test_ops::TestCommonCUDA::test_complex_half_reference_testing_fft* test_ops::TestCommonCUDA::test_python_ref__refs_fft* test_ops::TestCommonCUDA::test_python_ref_executor__refs_fft* test_ops::TestCommonCUDA::test_python_ref_meta__refs* test_ops::TestCommonCUDA::test_python_ref_torch_fallback__refs_fft* test_schema_check::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_fft* test_spectral_ops::TestFFTCUDA::test_empty_fft__refs_fft* test_spectral_ops::TestFFTCUDA::test_empty_fft_fft* test_spectral_ops::TestFFTCUDA::test_fft_half_and_chalf_not_power_of_two_error__refs_fft* test_spectral_ops::TestFFTCUDA::test_fft_half_and_chalf_not_power_of_two_error_fft* test_spectral_ops::TestFFTCUDA::test_fft_round_trip_cuda* test_spectral_ops::TestFFTCUDA::test_fft_type_promotion_cuda* test_spectral_ops::TestFFTCUDA::test_fftn_round_trip_cuda* test_spectral_ops::TestFFTCUDA::test_hfftn_cuda_float16 test_spectral_ops::TestFFTCUDA::test_ihfftn_cuda_float16 test_utils::TestDeviceUtilsCUDA::test_device_mode_ops_fft Pull Request resolved: https://github.com/pytorch/pytorch/pull/117296 Approved by: https://github.com/pruthvistony, https://github.com/malfet	2024-02-13 22:35:32 +00:00
Nikita Shulga	63d64c8995	[MPS] Enable more bfloat16 ops (#119738 ) Introduce conveninence inlinable `mps::supportedFloatingType` function that returns true if type is Float, Half or BFloat16 Test by running LLM inference using bfloat16 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119738 Approved by: https://github.com/Skylion007	2024-02-13 22:11:00 +00:00
Nikita Shulga	eb9a3383c2	[MPS] Add naive std_mean implementation (#119777 ) By just calling `std_mps` and `mean` in sequence Move `var_mean` decomp to `ReduceOps.mm`, as it should be faster to skip dispatching to a Python, which one can validate by running the following script: ```python from timeit import default_timer import torch from torch.utils.benchmark import Measurement, Timer def bench_var_mean( m, n, k, dtype = torch.float32, device:str = "cpu", ) -> Measurement: setup = f""" x = torch.rand({m}, {n}, {k}, dtype={dtype}, device="{device}") """ t = Timer( stmt="torch.var_mean(x, dim=1)", setup=setup, language="python", timer=default_timer ) return t.blocked_autorange() for x in [100, 1000]: rc = bench_var_mean(1000, x, 100, device="mps") print(f"{x:5} : {rc.mean*1e6:.2f} usec") ``` which before the change reports 681 and 1268 usec and after 668 and 684 (which probably means that GPU is not saturated, but overhead from switching between native and interpretable runtimes are shorter. Fixes https://github.com/pytorch/pytorch/issues/119663 TODOs: - Refactor the codebase and implement proper composite function (that must be faster) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119777 Approved by: https://github.com/albanD	2024-02-13 21:51:29 +00:00
Jeff Daily	ee5b59dd4b	[ROCm] CatArrayBatchedCopy performance improvement (#118685 ) Tune the grid and block sizes for ROCm. Add a contig kernel separate from aligned+contig. Verified new performance using pytorch/benchmarks/operator_benchmark. `python -m pt.cat_test --device=cuda --tag-filter all` On MI200 this improved performance on average 4%, and on MI300 14%. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118685 Approved by: https://github.com/malfet	2024-02-13 21:51:20 +00:00
Edward Z. Yang	6665b96ebb	Rewrite maybe_reduce more carefully for unbacked SymInt (#119562 ) Fixes https://github.com/pytorch/pytorch/issues/119476 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/119562 Approved by: https://github.com/albanD ghstack dependencies: #119559	2024-02-13 21:40:06 +00:00
Ke Wen	28f299a870	[c10d] Fix compilation of NCCL_EXP path (#119805 ) Fixes issue pointed out in https://github.com/pytorch/pytorch/pull/119421#issuecomment-1941694621 When refactoring ProcessGroupNCCL, some code in the NCCL_EXP path wasn't done cleanly. Cc: @kunalb @H-Huang Pull Request resolved: https://github.com/pytorch/pytorch/pull/119805 Approved by: https://github.com/H-Huang	2024-02-13 21:26:59 +00:00
Aaron Gokaslan	f9200c8608	[BE][Ez]: FURB129: remove unneeded readlines() (#119796 ) Applies a refurb rule to remove any readlines() in a for loop iteration as it just creates a temporary list in memory. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119796 Approved by: https://github.com/ezyang	2024-02-13 21:21:22 +00:00
Guilherme Leobas	3319dbcd23	Update vmap guard to avoid recompilations (#119061 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119061 Approved by: https://github.com/zou3519	2024-02-13 20:50:23 +00:00
Shuqiang Zhang	abadbbc4b0	[c10d][flight recorder] remove unintended assignment of entry (#119748 ) Summary: auto& entry = entries_.at(id % max_entries_); entry = entries_.at(id % max_entries_); The above line of code has unintended consequence of invoking copy/assignment of entry objects as ref itself cannot be re-assigned. Also what could cause the crash is that the entry ref could become invalid if entries_ are resized by other threads. and this could result in 'copy to a garbage location'. The fix is to use a pointer which can be re-assigned after re-acquiring the lock Tests: python test/distributed/test_c10d_nccl.py NCCLTraceTest Pull Request resolved: https://github.com/pytorch/pytorch/pull/119748 Approved by: https://github.com/wconstab, https://github.com/fegin	2024-02-13 20:18:58 +00:00
Catherine Lee	34638c82a6	[mergebot] No unique behavior for facebook bot re pending jobs (#119735 ) if fb bot says merge without -f, do normal behavior and wait for pending checks Pull Request resolved: https://github.com/pytorch/pytorch/pull/119735 Approved by: https://github.com/izaitsevfb, https://github.com/huydhn	2024-02-13 20:07:24 +00:00
vfdev	8ec3d8e35f	Fixed FxGraphDrawer compat constructor (#119767 ) Match FxGraphDrawer compat constructor signature to avoid the following failure when `pydot` is not installed: ``` File "/pytorch/torch/_functorch/partitioners.py", line 933, in draw_graph g = graph_drawer.FxGraphDrawer( torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised: TypeError: __init__() got an unexpected keyword argument 'dot_graph_shape' ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/119767 Approved by: https://github.com/eellison	2024-02-13 19:36:01 +00:00
andrewor14	8ec8d78ef2	[quant][pt2e][be] Rename eval_utils -> export_utils (#119725 ) It's not really eval_utils anymore, since we added some training related utils. Instead it should be util functions that are related to general export use cases. Differential Revision: [D53711494](https://our.internmc.facebook.com/intern/diff/D53711494) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119725 Approved by: https://github.com/tugsbayasgalan	2024-02-13 19:10:06 +00:00
Andrew Gu	0a2e000edf	[BE] Enabled mypy in `common_fsdp.py` (#118755 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118755 Approved by: https://github.com/Skylion007, https://github.com/weifengpy, https://github.com/wanchaol ghstack dependencies: #119550, #118136, #118223	2024-02-13 19:05:30 +00:00
Andrew Gu	8c1480f568	[FSDP2] Added mixed precision (#118223 ) This PR adds mixed precision configured via `MixedPrecisionPolicy`. - By default (`cast_forward_inputs=True`), each FSDP module will cast forward floating-point input tensors to `param_dtype` if specified. If the user wants to own the cast, then the user can disable it by passing `False`. - Symmetrically, by default (`output_dtype=None`) each FSDP module will not cast the forward output. If the user wants to customize the output dtype, then the user can pass a `torch.dtype`. - `param_dtype` configures the unsharded parameters' dtype for forward/backward computation and hence the all-gather dtype. - `reduce_dtype` configures the gradient reduction dtype. If `reduce_dtype=None` and `param_dtype is not None`, then `reduce_dtype` inherits from `param_dtype` for simplicity. We test against a manually implemented reference implementation instead of comparing against existing FSDP since the comparison is more direct to what we want to test. --- Overhead benchmarks to inform design The dilemma is as follows: - The common path for FSDP is bf16 parameter mixed precision, where we cast sharded parameters from fp32 to bf16 before all-gathering them. - The baseline implementation is to `torch._foreach_copy_` the sharded parameters to the flat `all_gather_input`, which gets passed to `dist.all_gather_into_tensor`. - The baseline incurs 1 extra fp32 read and 1 extra bf16 write per parameter because `_foreach_copy` takes the slow path, calling `copy_` in a loop, and `copy_` calls `dst.copy_(src.to(bf16))` where `dst` is bf16 and `src` is fp32. - These `copy_` calls stay in C++ and do not require calling `at::as_strided`. - The issue with this baseline implementation is that it requires knowing that all parameters in the group will be cast from fp32 to bf16 to do this `_foreach_copy_` from fp32 sources to a bf16 destination. - We want per-parameter FSDP to support mixed dtype all-gathers, which would involve different parameters providing different dtype all-gather inputs and viewing them as uint8 for a combined flat all-gather input, where this viewing-as-uint8 step is only needed in the mixed dtype case. - However, this incurs more CPU overhead, so we want to investigate this in more detail. We consider 150 `nn.Parameter`s with shapes taken from an internal model (where the shapes only affect the copy bandwidth, not the CPU overhead). We focus on world size 128 first. We consider two experiments: (1) run the copy-in with no head start, allowing CPU boundedness affect GPU time, and (2) run the copy-in with a CPU head start, removing CPU overhead from affecting GPU time. No head start: - Baseline `torch._foreach_copy_`: 0.525 ms CPU; 0.528 ms GPU - `.to(bf16)` before `torch._foreach_copy_`: 0.828 ms CPU; 0.836 ms GPU - `.to(bf16).view(uint8)` before `torch._foreach_copy_`: 0.933 ms CPU; 0.937 ms GPU Head start (removing CPU boundedness from GPU times): - Baseline `torch._foreach_copy_`: 0.393 ms GPU - `.to(bf16)` before `torch._foreach_copy_`: 0.403 ms GPU - `.to(bf16).view(uint8)` before `torch._foreach_copy_`: 0.403 ms GPU Some other interesting notes: - Constructing a set of all all-gather input dtypes: ~0.015 ms -- this would be the overhead cost of checking whether we need to view as uint8 (i.e. whether we have mixed dtype); alternatively, we could always view as uint8 (but that loses the mixed precision policy info from the profiler trace) - Changing from `[t.to(bf16).view(uint8) for t in ts]` to two list comprehensions like `[t.to(bf16) for t in ts]; [t.view(uint8) for t in ts]` actually reduces CPU overhead 🤔 (by ~0.04 ms) We see that the main difference is just CPU overhead. The GPU times are almost the same. (Actually, sweeping over 8, 16, 32, 64 world size, we do see difference in GPU time inversely proportional to world size, as expected since smaller world sizes copy more data. However, even at world size 8, the difference is only 0.407 ms vs. 0.445 ms GPU time.) Note though that the CPU overhead differences are exacerbated when the PyTorch profiler is turned on, and how much so seems to depend on the CPU capability. Seeing these numbers, I am inclined to prefer to just incur the CPU overhead, especially given that if we want to support the mixed dtype case for fp8 all-gather, we will need to incur this anyway. If the CPU overhead becomes a problem on a real workload, then we will need to figure out options then, one being using `torch.compile` possibly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118223 Approved by: https://github.com/weifengpy, https://github.com/wanchaol ghstack dependencies: #119550, #118136	2024-02-13 19:05:30 +00:00
Andrew Gu	3956ce01e0	[FSDP2] Added autograd/memory/overlap/frozen/2D/AC tests (#118136 ) This PR adds tests for autograd (mainly backward hooks), memory, overlap, and frozen parameters. - Autograd: unused forward output, unused forward module, non-tensor activations (common in internal models) - Memory: expected GPU memory usage after init, forward, backward, and optimizer step - Overlap: communication/computation overlap in forward and backward - Frozen: expected reduce-scatter size, training parity This PR adds some initial 2D (FSDP + TP) training and model state dict tests. The only change required for model sharded state dict is to make sure parameters are sharded before save and load. This PR adds tests that `fully_shard` can use `torch.utils.checkpoint`, `_composable.checkpoint`, and `CheckpointWrapper` on a transformer. (I squashed all of these into one PR now to save CI cost.) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118136 Approved by: https://github.com/weifengpy, https://github.com/wanchaol ghstack dependencies: #119550	2024-02-13 19:05:30 +00:00
Jason Ansel	39c68efd85	[dynamo] Capture untyped_storage().resize_() (#119647 ) This makes storage resizing work with `backend=eager`, the next two PRs make it work for inductor. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119647 Approved by: https://github.com/yf225	2024-02-13 19:03:28 +00:00
Chien-Chin Huang	c0e5cca4f8	[DDP] Change the --no-optimize-ddp flag to reflect the latest usage (#119437 ) Compiled DDP now has 4 different optimization modes. This PR changes the Dynamo benchmark flag to reflect that change. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119437 Approved by: https://github.com/wconstab, https://github.com/xmfan	2024-02-13 16:53:56 +00:00
Edward Z. Yang	c2522554dd	Prevent DCE'ing unbacked SymInt for view outputs (#119552 ) Fixes https://github.com/pytorch/pytorch/issues/119414 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/119552 Approved by: https://github.com/Skylion007, https://github.com/eellison	2024-02-13 16:32:21 +00:00
Edward Z. Yang	52de407b6c	Avoid performing replacements when it would unrefine ranges (#117356 ) Fixes https://github.com/pytorch/pytorch/issues/117268; check this issue for background. This PR does the following: * Do not perform a replacement if the expression we're replacing the symbol with has a less refined value range than the original. There's a little bit of trickiness around the handling for values close to INT64_MAX; when checking if a range refines another, I only consider the range representable in 64-bit integers. This is enough to prevent us from doing a substitution like `i0 = 10 - i1`, but it appears to still let us do the other substitutions we like, such as `i0 = i1` or `i0 = 12 * i1` * The test above is order dependent: if we assert an equality BEFORE we have refined a range, we might be willing to do the replacement because there isn't a meaningful range. This means that it's important to mark things as sizes, before you start doing other error checking. `split_with_sizes` is adjusted accordingly. It would be good to raise an error if you get the ordering wrong, but I leave this to future work. * It turns out this is not enough to fix AOTAutograd, because we lose the size-ness of unbacked SymInts when AOTAutograd retraces the Dynamo graph. So update deferred runtime assert insertion to also insert size-ness and value ranges annotations. Note that, in principle, it shouldn't be necessary to explicitly do the latter; these should just show up as deferred runtime asserts. That's some extra refactoring for a later day. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/117356 Approved by: https://github.com/lezcano	2024-02-13 15:56:59 +00:00
jeejeeli	0fd371c868	fix torch.cumsum docs (#117944 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/117944 Approved by: https://github.com/zou3519	2024-02-13 15:29:06 +00:00
Adnan Akhundov	c2a835d710	[inductor] Refactor device guard Python codegen to allow nested indentation (#119673 ) Summary: The codegen of `with torch.cuda._DeviceGuard` context manager in the Python wrapper code is implemented via `device_cm_stack: contextlib.ExitStack()`. As the context managers in the stack are `code.indent()`, this means that the whole stack is unindented at once on `device_cm_stack.close()`. This becomes problematic when attempting to codegen indented code (e.g., for control flow in Python and / or nested subgraph codegen-ing). In this PR, we refactor the device guard codegen-ing in Python by replacing the `device_cm_stack` by explicit indent and unindent calls for entering and exiting the `with torch.cuda._DeviceGuard` context manager. This allows for nested device guard context managers and better aligns with other indented codegen-ing intertwined with it (e.g., for nested subgraph codegen-ing). This is necessary for the upcoming support for `torch.cond` (and other control flow operators) in Inductor. Before that, the only change in the Python wrapper codegen is that the `return outputs` is now happening outside the `with torch.cuda._DeviceGuard` context manager. Test Plan: CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/119673 Approved by: https://github.com/peterbell10	2024-02-13 15:05:30 +00:00
Toshiki Kataoka	f4b5f710e8	Fix typo in private attr of `inference_mode` (#119167 ) This PR amends #102642. `torch.inference_mode`'s attribute to store the actual context is inconsistent between `__init__` and `__enter__`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119167 Approved by: https://github.com/albanD	2024-02-13 14:59:59 +00:00
Oguz Ulgen	3629287151	Implement analysis for for-loops (#119730 ) This PR adds support for for-loop parsing and analysis. While doing so, I ran into some constant value and function name problems so I fixed them as well. Technically, it should be possible to break this into multiple PRs but since these are small, I'm bundling them together. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119730 Approved by: https://github.com/aakhundov	2024-02-13 09:02:53 +00:00
Alexander Mols	2ae655b4f1	caffe2: remove support for specifically running "flaky tests" (#112007 ) Summary: In March 2019 D14468816 introduced some infra to mark tests as flaky while still running them. In July 2019 D15797371 removed the last use of this feature. Remove the related code as well. Test Plan: ci Reviewed By: mlogachev Differential Revision: D50601204 Pull Request resolved: https://github.com/pytorch/pytorch/pull/112007 Approved by: https://github.com/malfet	2024-02-13 07:56:37 +00:00
Nikita Shulga	60148f1761	[EZ] Set maximum supported version of Python as 3.12 (#119743 ) Doesn't really affect anything other than metadata on PyPI website Otherwise programming languages tab on https://pypi.org/project/torch/2.2.0/ shows supported version 3.8 to 3.10: <img width="239" alt="image" src="https://github.com/pytorch/pytorch/assets/2453524/e17f9982-8833-4cd8-b8d8-b2f1cb538548"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/119743 Approved by: https://github.com/kit1980, https://github.com/Skylion007	2024-02-13 06:56:32 +00:00
Elias Ellison	beb0041384	improve cuda graph symint logging msg (#119739 ) Users were confused by `recording cudagraph tree for None` ` Pull Request resolved: https://github.com/pytorch/pytorch/pull/119739 Approved by: https://github.com/mlazos	2024-02-13 06:26:36 +00:00
Wanchao Liang	bfb9ea1a43	fix compile DTensor.from_local in trace_rule_look up (#119659 ) There's a bug when converting from TorchVariable to trace rule look ups, in some corner cases the DTensor.from_local calls not matching the trace name rule look up, resulting in a None look up, and falling back to the UserFunctionVariable, which makes the tracing silent wrong by tracing into the DTensor.from_local function. Not exactly sure yet why the look up failed This PR fixes the DTensor.from_local tracing to make sure in everycase we should hit the InGraphFunctionVariable Pull Request resolved: https://github.com/pytorch/pytorch/pull/119659 Approved by: https://github.com/yifuwang	2024-02-13 05:21:19 +00:00
Mihir Patel	379183a0dd	Skip log line if no tensors were dedupped (#119742 ) Skips log line if nothing was dedupped. Avoids unhelpful logs like: ``` 2024-02-13 01:31:52,113 _dedup_tensors.py:46 INFO p:MainProcess t:MainThread: Duplicate keys to remove: {} ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/119742 Approved by: https://github.com/Skylion007	2024-02-13 05:18:16 +00:00
Yu, Guangye	a4c476a081	[BE] Use more GTest primitives in XPU unit tests (#119527 ) # Motivation Use `EXPECT_EQ` to refine XPU's UT when relying on gtest. # Solution use `EXPECT_EQ` directly instead of `ASSERT_EQ_XPU` Pull Request resolved: https://github.com/pytorch/pytorch/pull/119527 Approved by: https://github.com/malfet	2024-02-13 05:18:03 +00:00
cyy	47a2e6b6b8	Fix C++20 build (#112333 ) Currently C++20 fails because of incorrect template initialization order. This PR adjusted the order of theses classes and a constructor to address the issue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/112333 Approved by: https://github.com/albanD	2024-02-13 05:10:19 +00:00
Yue Dong	2bda6b4cb8	[DTensor] Only wait on AsyncCollectiveTensor after DTensor-based state dict loading (#119716 ) Summary: This PR serves as a follow-up fix to address numerical correctness concerns identified in PR #118197, and we should only wait on `AsyncCollectiveTensor`. Without the change, we occasionally ran into exception: `AttributeError("'Tensor' object has no attribute 'wait'")` Test Plan: CI: Wait for the CI test Test with prod model: - Tested with models and no-longer ran into the exception after checkpoint loading. Differential Revision: D53680406 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119716 Approved by: https://github.com/fegin, https://github.com/Skylion007, https://github.com/wz337	2024-02-13 04:30:45 +00:00
min-jean-cho	2502a01110	Linear-BN Fusion: add precondition check (#119264 ) Fixes #118990 The root cause is due to `out_features` of Linear not matching `num_features` of BatchNorm, resulting in shape mismatch while computing `fused_w`, and `fused_b`. This can happen for linear-bn folding because linear layer operates over the last dim, `(*, H_in)`, while bn layer operates over the channel dim, `(N, C_in, H, W)`. To preserve the shapes of the original linear weight and bias in linear-bn folding, check linear `out_features` match bn `num_features`. If they don't match, bn `num_features` need to be 1 to broadcast. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119264 Approved by: https://github.com/eellison	2024-02-13 04:16:34 +00:00
Nikita Shulga	15ef52a015	[MPS] Enable `conj` and `conj_physical` (#119669 ) Former is only on MacOS 14+, but at least on older MacOSes it would raise an exception rather than returning non-conjugated tensor Preliminary step for enabling FFT ops (without it `ifft` would never work) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119669 Approved by: https://github.com/albanD ghstack dependencies: #119681	2024-02-13 02:27:51 +00:00
PyTorch MergeBot	214f06ae3a	Revert "Add Accelerator device and shell hooks (#119329 )" This reverts commit 4b9568a360c4a90220e78e43435be8c56bc33fb2. Reverted https://github.com/pytorch/pytorch/pull/119329 on behalf of https://github.com/huydhn due to Breaks internal build and requires OSS file update to fix it ([comment](https://github.com/pytorch/pytorch/pull/119329#issuecomment-1940278598))	2024-02-13 02:23:45 +00:00
PyTorch MergeBot	7d4b666870	Revert "[BE] Properly mark destructor overrides (#119656 )" This reverts commit 069581b3ca354c3b34079d23bc237442d6f28cc3. Reverted https://github.com/pytorch/pytorch/pull/119656 on behalf of https://github.com/huydhn due to I need to revert this to unblock the revert of https://github.com/pytorch/pytorch/pull/119329#issuecomment-1939637967 and will reland this after resolving the conflicts ([comment](https://github.com/pytorch/pytorch/pull/119656#issuecomment-1940270997))	2024-02-13 02:20:45 +00:00
Colin Peppler	2921c2b3d9	[mypy] refactor mkldnn_fusion._is_valid_binary to avoid [union-attr] has no attribute (#119085 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119085 Approved by: https://github.com/Skylion007	2024-02-13 02:13:46 +00:00
SandishKumarHN	db228f1efd	[Lint] replace [assigment] with [method-assign] for methods (#119706 ) started with TODO fix from here https://github.com/pytorch/pytorch/blob/main/torch/testing/_internal/common_utils.py#L746 using ignore[method-assign] instead of ignore[assigment] Pull Request resolved: https://github.com/pytorch/pytorch/pull/119706 Approved by: https://github.com/Skylion007, https://github.com/malfet, https://github.com/kit1980	2024-02-13 02:06:04 +00:00
PyTorch MergeBot	9f8c84a399	Revert "Add missing include for internal build (#119721 )" This reverts commit e0cabebad94f1cf35742f8ec14f9938be3a195ab. Reverted https://github.com/pytorch/pytorch/pull/119721 on behalf of https://github.com/huydhn due to This fixes the build failures but there is still an issue with the missing libcaffe2_torch_fb_sparsenn_sparsenn_operators_gpu.so on D53686094 ([comment](https://github.com/pytorch/pytorch/pull/119721#issuecomment-1940191340))	2024-02-13 01:56:12 +00:00
laith sakka	ea8e4fd5ac	Support FunctoolsPartialVariable::get_function, fix NamedTupleVariable::as_proxy and handle call_function in get_fake_values_from_nodes (#119435 ) partially address https://github.com/pytorch/pytorch/issues/118785 This diff fixes three things: 1. add get_function to FunctoolsPartialVariable note that it will be available only if all args constant otherwise, it would throw unimplemented in the call to asPythonConstant. 2. NamedTupleVariable takes args dispatched not as list ex: NamedTuple(a, b, c) vs NamedTuple([a, b, c]), hence fix that by specializing asProxy. 3. A call to create_arg from within create_proxy, changes a python NamedTuple to a function call node without associating an example value! Updated get_fake_values_from_nodes to handle such case. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119435 Approved by: https://github.com/jansel, https://github.com/anijain2305 ghstack dependencies: #119314	2024-02-13 01:44:08 +00:00
Jason Ansel	74d55b0e63	[dynamo] Support torch.distributed.fsdp._flat_param._same_storage_size (#119627 ) Replaces #117690 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119627 Approved by: https://github.com/Skylion007	2024-02-13 01:27:37 +00:00
PyTorch MergeBot	472500e32a	Revert "Avoid performing replacements when it would unrefine ranges (#117356 )" This reverts commit 0e6b314fc2e7c965717e939a4e457a9b9d7e133e. Reverted https://github.com/pytorch/pytorch/pull/117356 on behalf of https://github.com/huydhn due to Sorry for reverting the change but it looks like the forward fix still needs more work https://github.com/pytorch/pytorch/pull/119712, so it would be cleaner to reland them ([comment](https://github.com/pytorch/pytorch/pull/117356#issuecomment-1940032407))	2024-02-13 01:16:58 +00:00
PyTorch MergeBot	2492f8748e	Revert "Improve TORCHDYNAMO_EXTENDED_DEBUG for GuardOnDataDependentSymNode (#119412 )" This reverts commit f208795182a22ebaef84a284750669fa372157cb. Reverted https://github.com/pytorch/pytorch/pull/119412 on behalf of https://github.com/huydhn due to Sorry for reverting the change but it looks like the forward fix still needs more work https://github.com/pytorch/pytorch/pull/119712, so it would be cleaner to reland them ([comment](https://github.com/pytorch/pytorch/pull/119412#issuecomment-1939937937))	2024-02-13 00:52:19 +00:00
andrewor14	830ed6d9b2	[quant][pt2] Fix _disallow_eval_train error message (#119694 ) Fix the message to use the right function name. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119694 Approved by: https://github.com/tugsbayasgalan	2024-02-13 00:17:53 +00:00
soulitzer	55483fc2c9	Min-cut partitioner always saves tensors that are returned as-is in backward (#114970 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/114970 Approved by: https://github.com/Chillee	2024-02-13 00:04:41 +00:00
Sergii Dymchenko	bd9db6a9c7	Update to TorchFix 0.4.0 (#119424 ) `torch.library.Library` updated to `torch.library._scoped_library` in files with many tests where it seems obvious to do, otherwise `noqa: TOR901` added - see https://github.com/pytorch/pytorch/pull/118318 for more context. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119424 Approved by: https://github.com/zou3519	2024-02-12 23:30:12 +00:00
Huy Do	5acd1f0f7d	Add cherry-pick workflow (#119352 ) After https://github.com/pytorch/test-infra/pull/4758, we can create a new workflow on PyTorch to receive `try-cherry-pick` dispatch event from the bot, and create the cherry pick PR. * [x] Cherry pick a PR after it has been landed and create a cherry pick PR to the target release branch. * [ ] The second part after this is to update the release tracker with the info. This will be done in a subsequent PR. * [ ] ghstack is not yet supported * [ ] Cherry pick a reverted commit is not yet supported (from @kit1980 comment) ### Testing The script can be used locally: ``` python cherry_pick.py --onto release/2.2 --classification release --github-actor huydhn 118907 The cherry pick PR is at https://github.com/pytorch/pytorch/pull/119351 ``` The test cherry pick PR is created at https://github.com/pytorch/pytorch/pull/119351 Unit testing this on CI is tricky, so I test this out on canary instead. * https://github.com/pytorch/pytorch-canary/pull/193#issuecomment-1933162707 creates the PR at https://github.com/pytorch/pytorch-canary/pull/201 * One more test on canary with the new token https://github.com/pytorch/pytorch-canary/pull/193#issuecomment-1933229483. The minimum required permission from what I see is `workflow` * Cherry picking conflicts could happen and needs to be handled manually https://github.com/pytorch/pytorch-canary/pull/194#issuecomment-1933142975 * ~Require a linked issue when cherry picking regressions, critical fixes, or fixing new features https://github.com/pytorch/pytorch-canary/pull/193#issuecomment-1933174520~ Relax this requirement to a suggestion Pull Request resolved: https://github.com/pytorch/pytorch/pull/119352 Approved by: https://github.com/atalman	2024-02-12 23:12:10 +00:00
suo	f15b517055	[export] suppress type error (#119720 ) Differential Revision: [D53681243](https://our.internmc.facebook.com/intern/diff/D53681243/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119720 Approved by: https://github.com/kit1980, https://github.com/huydhn	2024-02-12 22:54:36 +00:00
rzou	b3df3e4e94	Restore OpInfo/ModuleInfo tests in Inductor-wrapped tests (#119693 ) I accidentally disabled this without realizing it. It turns out that PYTORCH_TEST_WITH_INDUCTOR=1 implies PYTORCH_TEST_WITH_DYNAMO=1, which activates skipIfTorchDynamo decorators. Test Plan: - wait for CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/119693 Approved by: https://github.com/bdhirsh	2024-02-12 22:44:45 +00:00
albanD	e0cabebad9	Add missing include for internal build (#119721 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119721 Approved by: https://github.com/huydhn	2024-02-12 22:36:16 +00:00
Bin Bao	70c93c6097	[inductor] Update JIT Inductor cpp wrapper entry function signature (#119280 ) Summary: Change JIT Inductor cpp wrapper entry function to use similar signature as AOTInductor, i.e. using an array of AtenTensorHandle instead of a vector of at::Tensor as the inputs and return output through a pointer. This makes it easier to consolidate the ABI compatible and non-compatible modes. Differential Revision: [D53478825](https://our.internmc.facebook.com/intern/diff/D53478825) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119280 Approved by: https://github.com/chenyang78	2024-02-12 22:24:35 +00:00
Brian Hirsh	02b60e76c9	make flash_attn_bw impl correct w.r.t. meta when k and v have different strides (#119500 ) `dv = at::empty_like(k)` and `dv = at::empty_like(v)` can be materially different, because `empty_like` tries to preserve the strides of the input when possible. So if `k` is contiguous, but `v`, is transposed, then before this PR, `dv` would be computed to be contiguous. Alternatively, we could change the meta implementation of `aten._scaled_dot_product_flash_attention` to this: ``` grad_q = torch.empty_like(query.transpose(1, 2)).transpose(1, 2) grad_k = torch.empty_like(key.transpose(1, 2)).transpose(1, 2) grad_v = torch.empty_like(key.transpose(1, 2)).transpose(1, 2) return grad_q, grad_k, grad_v ``` But (I think?) the logic in the sdpa backward impl was a typo. I noticed this because changing the meta formula as above was enough to fix the issue with the `aot_eager` backend in this [link](https://github.com/pytorch/pytorch/issues/116935#issuecomment-1914310523). A minimal repro that I made looks like this: ``` import torch # in this repro, "grad_out" and "value" are transposed tensors, # but "key" and "value" are contiguous a = torch.randn(2, 513, 16, 64, dtype=torch.float16, device='cuda').transpose(1, 2) b = torch.randn(2, 16, 513, 64, dtype=torch.float16, device='cuda') c = torch.randn(2, 16, 513, 64, dtype=torch.float16, device='cuda') d = torch.randn(2, 513, 16, 64, dtype=torch.float16, device='cuda').transpose(1, 2) e = torch.randn(2, 16, 513, 64, dtype=torch.float16, device='cuda') f = torch.randn(2, 16, 513, device='cuda') g = None h = None i = 513 j = 513 k = 0.0 l = False m = torch.tensor(1, dtype=torch.int64) n = torch.tensor(1, dtype=torch.int64) out1_ref, out2_ref, out3_ref = torch.ops.aten._scaled_dot_product_flash_attention_backward(a, b, c, d, e, f, g, h, i, j, k, l, m, n, scale=0.125) from torch._meta_registrations import meta__scaled_dot_product_flash_backward out1_test, out2_test, out3_test = meta__scaled_dot_product_flash_backward(a, b, c, d, e, f, g, h, i, j, k, l, m, n, scale=0.125) # prints True True print(out1_ref.is_contiguous()) print(out1_test.is_contiguous()) # prints True True print(out2_ref.is_contiguous()) print(out2_test.is_contiguous()) # prints True False print(out3_ref.is_contiguous()) print(out3_test.is_contiguous()) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/119500 Approved by: https://github.com/drisspg, https://github.com/ezyang, https://github.com/Skylion007	2024-02-12 22:12:29 +00:00
cyy	10789ccd83	Remove redundant CMake NUMA code (#119650 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/119650 Approved by: https://github.com/ezyang	2024-02-12 21:53:44 +00:00
PyTorch MergeBot	34a61c527b	Revert "Enable x86 CPU vectorization on windows (#118980 )" This reverts commit 5f69d95b2b303382fe4cf301e73e36414c879c5c. Reverted https://github.com/pytorch/pytorch/pull/118980 on behalf of https://github.com/huydhn due to This is breaking Window binary build https://github.com/pytorch/pytorch/actions/runs/7874475000/job/21484997298 where it failed to build sleef ([comment](https://github.com/pytorch/pytorch/pull/118980#issuecomment-1939619212))	2024-02-12 21:33:14 +00:00
cyy	10f3abc6b8	[DeviceIndex][3/N] Use DeviceIndex in more places (#119635 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/119635 Approved by: https://github.com/ezyang	2024-02-12 21:31:27 +00:00
jmarin	064b61009b	Correctly formatting the example in get_state_dict (#119532 ) This PR corrects the example formatting provided in https://pytorch.org/docs/stable/distributed.checkpoint.html. In this issue, @wz337 is also commenting that the return type was not showing up correctly. I didn't see any formatting issue, but I could be wrong. Fixes #118837 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119532 Approved by: https://github.com/fegin	2024-02-12 21:28:22 +00:00
Catherine Lee	ad217d4266	[ez] Add try catch for deleting old branches (#119696 ) I think some chars in branch names affect the api calls, so just assume they're protected Pull Request resolved: https://github.com/pytorch/pytorch/pull/119696 Approved by: https://github.com/huydhn	2024-02-12 21:08:59 +00:00
rzou	7eecbf8a30	Remove unnecessary skipIfTorchDynamo from test_jit_fuser_te (#118728 ) And add some expected failures. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118728 Approved by: https://github.com/bdhirsh	2024-02-12 20:55:29 +00:00
maajidkhann	28c30f29be	Update documentation for set_flush_denormal support on ARM (#119354 ) Documentation update for set_flush_denormal(): -> set_flush_denormal() is now supported on ARM CPU's. -> PR: https://github.com/pytorch/pytorch/pull/115184 (Already merged) Reference page: https://pytorch.org/docs/stable/generated/torch.set_flush_denormal.html Pull Request resolved: https://github.com/pytorch/pytorch/pull/119354 Approved by: https://github.com/drisspg	2024-02-12 20:53:22 +00:00
PyTorch MergeBot	7d780ff86f	Revert "Enable fake tensor caching in fbcode by default (#118555 )" This reverts commit 0f2fbbff109cbc184a6a88247813dbcddaea2e5f. Reverted https://github.com/pytorch/pytorch/pull/118555 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing one model test internally. Please take a look at the diff for more info D53189048 ([comment](https://github.com/pytorch/pytorch/pull/118555#issuecomment-1939550273))	2024-02-12 20:51:23 +00:00
Sahdev Zala	110919c984	Check QNNPACK support for the platform before running test (#119139 ) Do not run test ConstantPropagation.CustomClassesCanBePropagated on a platform where QNNPACK is not supported. For example, this test fails on M1 Mac because QNNPACK is not supported on M1 Mac: [----------] 1 test from ConstantPropagation [ RUN ] ConstantPropagation.CustomClassesCanBePropagated unknown file: Failure as described in more details in the issue #88613. After the PR, test passes successfully as below: [----------] 1 test from ConstantPropagation [ RUN ] ConstantPropagation.CustomClassesCanBePropagated [ OK ] ConstantPropagation.CustomClassesCanBePropagated (0 ms) [----------] 1 test from ConstantPropagation (0 ms total) Fixes #88613 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119139 Approved by: https://github.com/jcaip	2024-02-12 20:21:07 +00:00
Andrey Talman	7adfeba47a	Add Python 3.12 as experimental to release 2.2 (#119705 ) Add 3.12 as experimental version to Release 2.2 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119705 Approved by: https://github.com/Skylion007, https://github.com/huydhn	2024-02-12 20:13:54 +00:00
suo	82248f0b1c	[export] improve FakeTensor serialization (#119531 ) Recently we made it possible to serialize ExportedPrograms with fake parameters/buffers/etc. The serialization regime was kind of whacky; basically we serialized a stub and reassembled the FakeTensor using metadata that we had stashed elsewhere in the Graph state. This was bad for a few reasons: - Storing the metadata separately from the actual serialized object caused situations where you could have one but not the other. An example case is if you had a FakeTensor contained inside a TorchBind object—there was no obviously place to store the metadata for this. This actually happens—TensorQueue in fbgemm does this. - It created an annoying cycle: we had to deserialize the Graph's tensor metadata in order to deserialize (potentially faked) constants, but we need constants in order to deserialize the Graph. This fixes all that. The basic idea is to patch the reducer function for FakeTensor at serialization time, and serialize a copy of the FakeTensor metadata. We already are policing BC for the TensorMeta schema struct so it's not a net increase in the BC surface. As a bonus, I fixed a weird bug with torchbind tracing where we were accidentally reinterpreting a torch.ScriptObject as a torch.ScriptModule (which was the root cause of some weird behavior @bahuang was seeing last week). Differential Revision: [D53601251](https://our.internmc.facebook.com/intern/diff/D53601251/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119531 Approved by: https://github.com/zhxchen17	2024-02-12 19:28:08 +00:00
Edward Z. Yang	482345d747	Refactor out shape test into InputMetadata::maybe_reduce (#119559 ) I'm going to gut this function shortly, and having it all on InputMetadata is convenient for this purpose. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/119559 Approved by: https://github.com/soulitzer	2024-02-12 19:27:48 +00:00
PyTorch MergeBot	c24b74efc7	Revert "Optimize multi_tensor_apply (#119153 )" This reverts commit 24be7daf799ed94e1964e2ce440ccaad15962719. Reverted https://github.com/pytorch/pytorch/pull/119153 on behalf of https://github.com/yifuwang due to This PR is breaking cuda graph for multi_tensor_apply ([comment](https://github.com/pytorch/pytorch/pull/119153#issuecomment-1939365823))	2024-02-12 19:11:29 +00:00
Nikita Shulga	8d8fb9783c	[MPS][EZ] Fix cfloat->chalf conversion on MacOS13 (#119681 ) By using `view_as_real` when type casting between two complex types Pull Request resolved: https://github.com/pytorch/pytorch/pull/119681 Approved by: https://github.com/Skylion007, https://github.com/albanD	2024-02-12 19:09:10 +00:00
laith sakka	eb0f9efd31	fix is_ and is_not (#118978 ) Fix issue https://github.com/pytorch/pytorch/issues/118805 Note: this was a refresh PR of https://github.com/pytorch/pytorch/pull/118806 discussion there is relevant Pull Request resolved: https://github.com/pytorch/pytorch/pull/118978 Approved by: https://github.com/lezcano	2024-02-12 19:04:40 +00:00
Yanbo Liang	0e5b6594b7	[Dynamo] Minor cleanup of redundant function lookup logics (#119666 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/119666 Approved by: https://github.com/angelayi	2024-02-12 19:00:39 +00:00
vfdev-5	ed20e9118b	Fixed hash issue in `fx_graph_cse` (#119567 ) Description: - Fixed issue with hash collision for `hash((primals_2, 1.0)) == hash((primals_2, 1))` Repro code: ```python import torch from torch._functorch.compile_utils import fx_graph_cse def func(inpt, osize): size = inpt.shape[-1] s1 = size - 1 s2 = size - 1.0 scale = s2 / (osize - 1.0) inpt = torch.clamp(inpt, 0, s1) return scale * inpt gms = [] def toy_backend(gm, _): gms.append(gm) return gm.forward torch._dynamo.reset() fn = torch.compile(backend=toy_backend, dynamic=True)(func) t = torch.rand(3, 100) out = fn(t, 50) gm = gms[0] print(gm.graph) new_fx_g = fx_graph_cse(gm.graph) print(str(new_fx_g)) ``` Original graph ``` graph(): %s0 : torch.SymInt [num_users=0] = placeholder[target=s0] %s1 : torch.SymInt [num_users=0] = placeholder[target=s1] %l_inpt_ : torch.Tensor [num_users=2] = placeholder[target=L_inpt_] %l_osize_ : torch.SymInt [num_users=1] = placeholder[target=L_osize_] %size : [num_users=1] = call_method[target=size](args = (%l_inpt_,), kwargs = {}) %getitem_1 : [num_users=2] = call_function[target=operator.getitem](args = (%size, 1), kwargs = {}) %sub : [num_users=1] = call_function[target=operator.sub](args = (%getitem_1, 1), kwargs = {}) %sub_1 : [num_users=1] = call_function[target=operator.sub](args = (%getitem_1, 1.0), kwargs = {}) %sub_2 : [num_users=1] = call_function[target=operator.sub](args = (%l_osize_, 1.0), kwargs = {}) %truediv : [num_users=1] = call_function[target=operator.truediv](args = (%sub_1, %sub_2), kwargs = {}) %inpt : [num_users=1] = call_function[target=torch.clamp](args = (%l_inpt_, 0, %sub), kwargs = {}) %mul : [num_users=1] = call_function[target=operator.mul](args = (%truediv, %inpt), kwargs = {}) return (mul,) ``` New wrong graph where `sub_2` is replaced incorrectly with `sub`: ``` graph(): %s0 : torch.SymInt [num_users=0] = placeholder[target=s0] %s1 : torch.SymInt [num_users=0] = placeholder[target=s1] %l_inpt_ : torch.Tensor [num_users=2] = placeholder[target=L_inpt_] %l_osize_ : torch.SymInt [num_users=1] = placeholder[target=L_osize_] %size : [num_users=1] = call_method[target=size](args = (%l_inpt_,), kwargs = {}) %getitem_1 : [num_users=1] = call_function[target=operator.getitem](args = (%size, 1), kwargs = {}) %sub : [num_users=2] = call_function[target=operator.sub](args = (%getitem_1, 1), kwargs = {}) %sub_2 : [num_users=1] = call_function[target=operator.sub](args = (%l_osize_, 1.0), kwargs = {}) %truediv : [num_users=1] = call_function[target=operator.truediv](args = (%sub, %sub_2), kwargs = {}) %inpt : [num_users=1] = call_function[target=torch.clamp](args = (%l_inpt_, 0, %sub), kwargs = {}) %mul : [num_users=1] = call_function[target=operator.mul](args = (%truediv, %inpt), kwargs = {}) return (mul,) ``` With this PR the new graph is the following: ``` graph(): %s0 : torch.SymInt [num_users=0] = placeholder[target=s0] %s1 : torch.SymInt [num_users=0] = placeholder[target=s1] %l_inpt_ : torch.Tensor [num_users=2] = placeholder[target=L_inpt_] %l_osize_ : torch.SymInt [num_users=1] = placeholder[target=L_osize_] %size : [num_users=1] = call_method[target=size](args = (%l_inpt_,), kwargs = {}) %getitem_1 : [num_users=2] = call_function[target=operator.getitem](args = (%size, 1), kwargs = {}) %sub : [num_users=1] = call_function[target=operator.sub](args = (%getitem_1, 1), kwargs = {}) %sub_1 : [num_users=1] = call_function[target=operator.sub](args = (%getitem_1, 1.0), kwargs = {}) %sub_2 : [num_users=1] = call_function[target=operator.sub](args = (%l_osize_, 1.0), kwargs = {}) %truediv : [num_users=1] = call_function[target=operator.truediv](args = (%sub_1, %sub_2), kwargs = {}) %inpt : [num_users=1] = call_function[target=torch.clamp](args = (%l_inpt_, 0, %sub), kwargs = {}) %mul : [num_users=1] = call_function[target=operator.mul](args = (%truediv, %inpt), kwargs = {}) return (mul,) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/119567 Approved by: https://github.com/eellison	2024-02-12 18:52:11 +00:00
Yifu Wang	27ffede878	[reland] Fix estimate_nccl_collective_runtime (#118986 ) `estimate_nccl_collective_runtime` has been broken and the errors have been silently swallowed by inductor. This PR: - Fixes the issues described in https://github.com/pytorch/pytorch/issues/118497. - Adds white-box testing so future issues can be surfaced in tests. - Add support for native funcol IRs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118986 Approved by: https://github.com/yf225 ghstack dependencies: #119102	2024-02-12 18:48:06 +00:00
Ke Wen	b2043c0543	[c10d] PGNCCL refactor part 2: Simplify ProcessGroupNCCL into single-device style (#119421 ) Part 2 and last part of #118674: Introduce actual "single-device" code change to ProcessGroupNCCL. assert size == 1 and test refactor have been done in #119099. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119421 Approved by: https://github.com/shuqiangzhang	2024-02-12 18:45:49 +00:00
Shuqiang Zhang	893dcac068	[c10d] explicitly abort communicators in destroy_process_group call (#119250 ) Summary: This PR tries to resolve issue #119215. Basically, processgroup shutdown (and hence ncclCommAbort) is called in destroy_process_group APIs for the corresponding PGs. and in the destructor of ProcessGroup, we avoid calling abort/ncclCommAbort. Instead, it just checks if the users have explicitly already called destroy_process_group. If not, Destructor will log a warning and encourage/expect users to do so for cleanup of resources of PGs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119250 Approved by: https://github.com/minsii, https://github.com/kwen2501	2024-02-12 18:40:28 +00:00
Edward Z. Yang	31f00b0160	Clarify that legacy cat behavior only applies for 1-D tensor (#119684 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/119684 Approved by: https://github.com/albanD	2024-02-12 18:13:04 +00:00
Catherine Lee	059bf1baa4	Separate clang lint? (#119575 ) 25 min -> 17 + 13 min, which is still not as fast as I want it to be but I'll take it Lintrunner provides some parallelism by default, but it's not perfect Reducing fetch-depth from all to 1 further reduces time by ~2-3 minutes From non clang's logs: ``` 2024-02-09T22:05:39.5297616Z Requirement already satisfied: PyYAML==6.0 in /opt/conda/lib/python3.11/site-packages (6.0) 2024-02-09T22:12:23.6164708Z Collecting black==23.12.1 ``` I don't know why this part takes so long, maybe it's just buffering? Clang version doesn't show this issue See `5a750c8035` Pull Request resolved: https://github.com/pytorch/pytorch/pull/119575 Approved by: https://github.com/huydhn, https://github.com/malfet	2024-02-12 17:46:31 +00:00
Tarun Karuturi	bc521f2ce3	In dynamo tracing for index() use None as the default indicator for end and not -1 (#119151 ) Summary: In dynamo tracing, `index()`'s implementation currently has the default begin index as `0` and the default end index as`-1` which means that by default we're dropping the last element. Rather we should be doing `None` which will ensure that the last element is also checked. Test Plan: CI Differential Revision: D53392287 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119151 Approved by: https://github.com/yanboliang	2024-02-12 17:45:05 +00:00
rzou	cf474a09f5	Decompose torch.ops.higher_order.auto_functionalized in Inductor (#118673 ) We'd like to get auto_functionalized to work with AOTInductor. To get there, we decompose `output = auto_functionalized(inplace_op, ...)` into its corresponding aten ops (clones + inplace_op) before the Inductor lowering phase. This decomposition must happen at the end of the Inductor FX passes because it introduces in-place operations. The pattern matcher's "replace this single node with multiple nodes" API isn't robust enough here. The problem is that `auto_functionalized` returns a single output (this output is a List), but the decomposition ends up returning the unpacked List (e.g. it may return two tensors). Previously, there was an assertion that this was not the case; I fixed up `replace_with_graph` to handle this. Future: Not all of the clones are necessary (e.g. if the input's last usage is this operator, then we don't need to clone it). We can add this logic later. Test Plan: - existing tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/118673 Approved by: https://github.com/oulgen	2024-02-12 17:30:01 +00:00
Zhengxu Chen	8069b29603	[export] Implement logging for scuba. (#119585 ) Summary: As we're growing the user surface of torch.export, we'd like to understand better how people are using our APIs. It's also possible to analyze the usages based on static analysis, but due to the fact that there could be many creative ways to call things in Python, I think just building some logging infra will benefit us in the short term and gain us some insights. Test Plan: buck test caffe2/test:test_export {F1454519846} Reviewed By: tugsbayasgalan Differential Revision: D53618220 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119585 Approved by: https://github.com/avikchaudhuri	2024-02-12 17:28:14 +00:00
Han Qi	757201c213	Refactor ExportedProgram to expose the functions for pre and postprocessing (#119513 ) Reason: Consumers of ExportProgram might choose to further lower exported_program.graph_module to something else. Then, it will need to setup the calling convention to call it. This refactor concentrates these calling convention to one place and can be reused. Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/119513 Approved by: https://github.com/zhxchen17	2024-02-12 17:22:27 +00:00
laith sakka	72d9a38118	add get_function to TorchInGraphFunctionVariable (#119314 ) partially address https://github.com/pytorch/pytorch/issues/118785 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119314 Approved by: https://github.com/yanboliang, https://github.com/anijain2305	2024-02-12 16:35:34 +00:00
Jesse Cai	1c1dc0e4e0	[sparse] Add in out_dtype support (i8i8->bf16, i32) for cusparselt (#119296 ) Summary: Adds in out_dtype support for (i8i8->bf16) and (i8i8->i32) matmul with cuSPARSELt. Test Plan: ``` python test/test_sparse_semi_structured.py -k mixed ``` Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/119296 Approved by: https://github.com/cpuhrsch, https://github.com/alexsamardzic	2024-02-12 16:02:36 +00:00
Xu Han	5f69d95b2b	Enable x86 CPU vectorization on windows (#118980 ) Enable VEC on Windows OS. 1. Fix some type defination gap between Windows and Linux. 2. Fix some operator not support on Windows, such as [], /. 3. Enable static sleef library build on Windows. 4. Disable unsupported function overloading on MSVC. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118980 Approved by: https://github.com/jgong5, https://github.com/ezyang, https://github.com/malfet	2024-02-12 16:01:30 +00:00
Bin Bao	52a3de6cbf	[AOTI][refactor] Move ThreadLocalCachedOutputTensor into a separate header (#119392 ) Summary: Move common functionality into a separate header so that later JIT and AOT Inductor can share it. Test Plan: CI Differential Revision: D53523452 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119392 Approved by: https://github.com/khabinov	2024-02-12 15:56:16 +00:00
PyTorch MergeBot	24bdd03d23	Revert "Reify view_func() closures as ViewFuncs (#118404 )" This reverts commit d5a6762263a98e5153bc057c8ba4f377542c7e55. Reverted https://github.com/pytorch/pytorch/pull/118404 on behalf of https://github.com/DanilBaibak due to Broken trunk ([comment](https://github.com/pytorch/pytorch/pull/118404#issuecomment-1938600260))	2024-02-12 12:38:51 +00:00
Yifu Wang	79df897608	Fix some tests in test_c10d_functional_native.py (#119102 ) Summary: This PR fixes a few tests that were broken because `empty` became `empty_strided_cuda` in the generate code. Also changed some _c10d_functional calls to funcol calls so add coverage to tracing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119102 Approved by: https://github.com/wconstab, https://github.com/wanchaol	2024-02-12 09:28:18 +00:00
PyTorch MergeBot	0342b227e5	Revert "[c10d] PGNCCL refactor part 2: Simplify ProcessGroupNCCL into single-device style (#119421 )" This reverts commit f3e7d809936d9f1bf63102e8afe241e13ed8766a. Reverted https://github.com/pytorch/pytorch/pull/119421 on behalf of https://github.com/DanilBaibak due to Broken trunk ([comment](https://github.com/pytorch/pytorch/pull/119421#issuecomment-1938169747))	2024-02-12 07:34:20 +00:00
cyy	8a3c241094	Remove unused header inclusion (#119667 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/119667 Approved by: https://github.com/Skylion007	2024-02-12 05:36:25 +00:00
Mu-Chu Lee	dcb08a7044	Add CUDAEvent recording for constant folding to show up. (#119216 ) Summary: Add a layer of call to let CUDAEvent show up for constant folding. Test Plan: Existing tests Differential Revision: D53437934 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119216 Approved by: https://github.com/khabinov	2024-02-12 03:46:36 +00:00
PyTorch UpdateBot	bc4d0277cd	[executorch hash update] update the pinned executorch hash (#119648 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned executorch hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119648 Approved by: https://github.com/pytorchbot	2024-02-12 03:42:07 +00:00
chilli	76fac69577	add a couple more cases to pointwise_cat perf tests (#119521 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119521 Approved by: https://github.com/ezyang, https://github.com/eellison	2024-02-12 03:41:08 +00:00
Oguz Ulgen	647564dbaa	Implement conditional statements in kernel analysis (#119664 ) This PR makes it so that ops is no longer a dict of RET => OP but rather it is now RET => List[OP] since now multiple OPs can return the same RET. In real execution, only one of these OPs will be executed, so no need to worry about renaming. For analysis, we pessimistically assume any one of them could be executed (which is safest for analysis purposes) Example TTIRs that can now be handled: ``` scf.if %13 { %14 = tt.get_program_id y : i32 loc(#loc13) %c0_i32_1 = arith.constant 0 : i32 loc(#loc14) %15 = arith.cmpi eq, %14, %c0_i32_1 : i32 loc(#loc14) scf.if %15 { %16 = arith.addf %8, %11 : tensor<4xf32> loc(#loc16) %17 = tt.splat %arg2 : (!tt.ptr<f32, 1>) -> tensor<4x!tt.ptr<f32, 1>> loc(#loc17) %18 = tt.addptr %17, %4 : tensor<4x!tt.ptr<f32, 1>>, tensor<4xi32> loc(#loc17) tt.store %18, %16, %5 {cache = 1 : i32, evict = 1 : i32} : tensor<4xf32> loc(#loc18) } else { } loc(#loc15) } else { } loc(#loc12) ``` and ``` %14 = scf.if %13 -> (tensor<4xf32>) { %17 = arith.addf %8, %11 : tensor<4xf32> loc(#loc13) scf.yield %17 : tensor<4xf32> loc(#loc13) } else { %17 = arith.mulf %8, %11 : tensor<4xf32> loc(#loc14) scf.yield %17 : tensor<4xf32> loc(#loc14) } loc(#loc12) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/119664 Approved by: https://github.com/aakhundov	2024-02-12 01:54:26 +00:00
Bin Bao	663dd5d006	[inductor] Update the compile options for CppPythonBindingsCodeCache (#119415 ) Differential Revision: [D53554681](https://our.internmc.facebook.com/intern/diff/D53554681) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119415 Approved by: https://github.com/jansel, https://github.com/khabinov	2024-02-11 21:25:34 +00:00
Nikita Shulga	069581b3ca	[BE] Properly mark destructor overrides (#119656 ) Otherwise, at least on MacOS builds are littered with: ``` In file included from /Users/malfet/git/pytorch/pytorch/aten/src/ATen/DeviceAccelerator.h:6: /Users/malfet/git/pytorch/pytorch/aten/src/ATen/detail/MTIAHooksInterface.h:23:11: warning: '~MTIAHooksInterface' overrides a destructor but is not marked 'override' [-Winconsistent-missing-destructor-override] virtual ~MTIAHooksInterface() = default; ^ /Users/malfet/git/pytorch/pytorch/aten/src/ATen/detail/CUDAHooksInterface.h:65:11: warning: '~CUDAHooksInterface' overrides a destructor but is not marked 'override' [-Winconsistent-missing-destructor-override] virtual ~CUDAHooksInterface() = default; ^ /Users/malfet/git/pytorch/pytorch/aten/src/ATen/detail/AcceleratorHooksInterface.h:15:11: note: overridden virtual function is here virtual ~AcceleratorHooksInterface() = default; ^ /Users/malfet/git/pytorch/pytorch/aten/src/ATen/detail/MPSHooksInterface.h:21:11: warning: '~MPSHooksInterface' overrides a destructor but is not marked 'override' [-Winconsistent-missing-destructor-override] virtual ~MPSHooksInterface() = default; ^ /Users/malfet/git/pytorch/pytorch/aten/src/ATen/detail/AcceleratorHooksInterface.h:15:11: note: overridden virtual function is here virtual ~AcceleratorHooksInterface() = default; ^ ``` Likely introduced by https://github.com/pytorch/pytorch/pull/119329 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119656 Approved by: https://github.com/Skylion007	2024-02-11 21:07:16 +00:00
Taras Tsugrii	a4cc6b85dc	[dynamo][eval][perf] Remove unnecessary dict copies. (#119305 ) Both of these variables are already created using `dict(...)` so making yet another `dict` copy is pure overhead and boilerplate. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119305 Approved by: https://github.com/Skylion007	2024-02-11 20:29:26 +00:00
Adnan Akhundov	e5f46a1d35	Check alignment of ReinterpretView args of custom Triton kernels (#119649 ) Summary: Currently, when a custom (user-written) Triton kernel has a ReinterpretView argument in IR, we're always skipping the alignment checking for this argument when preparing the `signature_of` for the AOT compilation of the Triton kernel (via setting `TensorArg.check_alignment` to `False`). This is problematic for user-written kernels where, albeit reinterpreted, the argument of the Triton kernel (the data pointer) can still be aligned to 16. When we skip alignment checking, the performance of the AOT-compiled internal Triton kernels can degrade 2x--3x. In this PR, we replace `TensorArg.check_alignment` by `TensorArg.offset`, in which we specify the offset of the `ReinterpretView.layout` relative to the underlying `ir.Buffer` (corresponding to the data pointer before reinterpretation). As the size and stride of the layout don't change the alignment properties, those can be skipped. Importantly, for `ReinterpretView` arguments of custom Triton kernels, we use `arg.data.get_name()` as the buffer name. That, together with the offset, is used to check the alignment. Bonus: the namedtuples in `codegen/common.py` are refactored as `dataclass`es, with nicer type hints and default values (for the newly added `TensorArg.offset`). Test Plan: ``` $ python test/inductor/test_aot_inductor.py -k test_triton_kernel_reinterpret_view ... ---------------------------------------------------------------------- Ran 6 tests in 27.952s OK (skipped=4) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/119649 Approved by: https://github.com/oulgen	2024-02-11 20:21:17 +00:00
Taras Tsugrii	b8e4423278	[torch][cuda][perf] Avoid unnecessary dicts. (#118011 ) It's unnecessary and inefficient to create a `dict` from list indices to list values just to check if particular `idx` exists there. This way leads to `O(N)` time and space complexity whereas using `list` directly is `O(1)` time and space complexity. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118011 Approved by: https://github.com/Skylion007	2024-02-11 19:29:24 +00:00
Taras Tsugrii	95a8d5b1bc	[random] Replace for loop with list comprehension. (#119143 ) It's more idiomatic and efficient. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119143 Approved by: https://github.com/Skylion007	2024-02-11 19:29:19 +00:00
Taras Tsugrii	4394e0dc2c	[inductor] Use list comprehension to initialize unused_views. (#119618 ) It's more idiomatic and efficient. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119618 Approved by: https://github.com/Skylion007	2024-02-11 18:57:18 +00:00
Yifu Wang	24be7daf79	Optimize multi_tensor_apply (#119153 ) ### Summary Due to the dynamic nature of the workload, the kernel arguments aren't guaranteed to fit in the static 4kb kernel argument memory. Previously with the apex implementation, we overcame this limitation by dividing a multi_tensor_apply workload into multiple kernel launches. However, this led to low sustained occupancy, affecting the performance of memory bound ops. Based on the observation that the kernel argument memory limitation doesn't correlate well with available SM resources, we adopt a different approach: - When the kernel arguments fit into the static kernel argument memory, we use this memory to transfer the arguments. - Conversely, when the kernel arguments don't fit into the static kernel argument memory, instead of sacrificing sustained occupancy, we use a page-locked cudaMemcpyAsync to transfer the arguments, then perform the entire workload in a single kernel. This PR only covers `multi_tensor_apply` for tensors. The change can be easily applied to `multi_tensor_apply` for tensors + scalars and `multi_tensor_apply_for_fused_optimizer`. ### Benchmark (WIP) The only benchmark I've conducted so far on `_foreach_copy_` on a set of sizes that resembles internal workload. I need to benchmarks on more problem sizes. The speedup should vary among problem sizes. However, I believe this PR should not be slower than the previous impl on any problem sizes. The benchmark can be reproduced with [this script](https://gist.github.com/yifuwang/178c1f4bf951c5794ea79c04d90e44fa). Baseline A single iteration in trace: <img width="831" alt="image" src="https://github.com/pytorch/pytorch/assets/4156752/5c8d72d0-0628-4989-88a8-c756f6bc1319"> ``` https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/perfetto_internal_traces/tree/shared_trace/yifu_5a59145f-567b-472f-8eef-c61c388d45b4.json device ms: 1.111, cpu ms: 7.151 memory bandwidth: 1169.825 GB/s ``` This PR A single iteration in trace: <img width="967" alt="image" src="https://github.com/pytorch/pytorch/assets/4156752/a023e183-8166-48f7-b7c0-c8ba32653d2b"> ``` https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/perfetto_internal_traces/tree/shared_trace/yifu_da060725-62a8-466e-b570-2ad67ff0e29d.json device ms: 0.892, cpu ms: 0.810 memory bandwidth: 1456.744 GB/s ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/119153 Approved by: https://github.com/janeyx99	2024-02-11 18:12:22 +00:00
Pearu Peterson	2c91e13afc	Add lowerings to special functions (#119187 ) As in the title. In addition, the PR introduces infrastructure for lowerings of pointwise functions that have both cpp and triton implementations available. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119187 Approved by: https://github.com/peterbell10	2024-02-11 16:35:40 +00:00
Nikita Shulga	4ee8aac432	[MPS] Enable `bfloat16` support on MacOS 14 (#119641 ) Per [MPSDataType](https://developer.apple.com/documentation/metalperformanceshaders/mpsdatatype/mpsdatatypebfloat16?changes=_11&language=objc) documentation bfloat16 are supported in MacOS Sonoma or later Added missing `MPSDataTypeBFloat16` and `MTLLanguageVersion3_1` enums to `MPSGraphSonomaOps.h` TODO: Enable more testing Pull Request resolved: https://github.com/pytorch/pytorch/pull/119641 Approved by: https://github.com/Skylion007	2024-02-11 16:25:29 +00:00
Nikita Shulga	68e009dd8f	[BE][EZ] Use `dyspatch_sync_with_rethrow` in searchsorted (#119646 ) For the proper exception handling, otherwise raising C++ exception inside dispatch block will crash the app (discovered while enabling more BFloat16 ops) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119646 Approved by: https://github.com/Skylion007	2024-02-11 07:19:00 +00:00
lancerts	6cd82253ae	fix torch.set_float32_matmul_precision doc (#119620 ) Fixes #119606, clearify the explictly stored number of bits in doc Pull Request resolved: https://github.com/pytorch/pytorch/pull/119620 Approved by: https://github.com/eqy, https://github.com/malfet	2024-02-11 06:41:37 +00:00
cyy	88183923d2	Remove unneeded linking of torch_shm_manager in CMake (#119540 ) This PR aims to clean up torch_shm_manager dependency in CMake. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119540 Approved by: https://github.com/ezyang	2024-02-11 06:33:35 +00:00
Adnan Akhundov	0bed0501fa	Don't skip register-spilling configs in custom Triton kernel auto-tuning (#119634 ) Summary: There has been some empirical evidence that, for (non-trivial) custom (user-written) Triton kernels, a register-spilling config yields the best result in auto-tuning. For this reason, we don't skip register-spilling config from auto-tuning of the custom Triton kernels. <details> <summary>An example of auto-tuning result with the register-spilling config outperforming others</summary> ``` BLOCK_M: 16, BLOCK_N: 16, num_warps: 2, num_ctas: 1, num_stages: 1, enable_warp_specialization: False, enable_persistent: False: 0.748896, nreg 255, nspill 0, #shared-mem 8704 BLOCK_M: 16, BLOCK_N: 16, num_warps: 4, num_ctas: 1, num_stages: 1, enable_warp_specialization: False, enable_persistent: False: 1.723424, nreg 249, nspill 0, #shared-mem 8704 BLOCK_M: 16, BLOCK_N: 16, num_warps: 8, num_ctas: 1, num_stages: 1, enable_warp_specialization: False, enable_persistent: False: 2.202656, nreg 190, nspill 0, #shared-mem 8704 BLOCK_M: 16, BLOCK_N: 16, num_warps: 2, num_ctas: 1, num_stages: 2, enable_warp_specialization: False, enable_persistent: False: 0.748256, nreg 255, nspill 0, #shared-mem 8704 BLOCK_M: 16, BLOCK_N: 16, num_warps: 4, num_ctas: 1, num_stages: 2, enable_warp_specialization: False, enable_persistent: False: 1.724896, nreg 249, nspill 0, #shared-mem 8704 BLOCK_M: 16, BLOCK_N: 16, num_warps: 8, num_ctas: 1, num_stages: 2, enable_warp_specialization: False, enable_persistent: False: 2.201632, nreg 190, nspill 0, #shared-mem 8704 BLOCK_M: 16, BLOCK_N: 32, num_warps: 2, num_ctas: 1, num_stages: 1, enable_warp_specialization: False, enable_persistent: False: 0.651664, nreg 255, nspill 56, #shared-mem 13312 BLOCK_M: 16, BLOCK_N: 32, num_warps: 4, num_ctas: 1, num_stages: 1, enable_warp_specialization: False, enable_persistent: False: 0.846368, nreg 255, nspill 14, #shared-mem 13312 BLOCK_M: 16, BLOCK_N: 32, num_warps: 8, num_ctas: 1, num_stages: 1, enable_warp_specialization: False, enable_persistent: False: 1.841792, nreg 243, nspill 0, #shared-mem 13312 BLOCK_M: 16, BLOCK_N: 32, num_warps: 2, num_ctas: 1, num_stages: 2, enable_warp_specialization: False, enable_persistent: False: 0.651584, nreg 255, nspill 56, #shared-mem 13312 BLOCK_M: 16, BLOCK_N: 32, num_warps: 4, num_ctas: 1, num_stages: 2, enable_warp_specialization: False, enable_persistent: False: 0.846432, nreg 255, nspill 14, #shared-mem 13312 BLOCK_M: 16, BLOCK_N: 32, num_warps: 8, num_ctas: 1, num_stages: 2, enable_warp_specialization: False, enable_persistent: False: 1.841904, nreg 243, nspill 0, #shared-mem 13312 BLOCK_M: 16, BLOCK_N: 64, num_warps: 2, num_ctas: 1, num_stages: 1, enable_warp_specialization: False, enable_persistent: False: 1.236448, nreg 255, nspill 254, #shared-mem 22528 BLOCK_M: 16, BLOCK_N: 64, num_warps: 4, num_ctas: 1, num_stages: 1, enable_warp_specialization: False, enable_persistent: False: 1.484384, nreg 255, nspill 174, #shared-mem 22528 BLOCK_M: 16, BLOCK_N: 64, num_warps: 8, num_ctas: 1, num_stages: 1, enable_warp_specialization: False, enable_persistent: False: 1.131168, nreg 255, nspill 6, #shared-mem 22528 BLOCK_M: 16, BLOCK_N: 64, num_warps: 2, num_ctas: 1, num_stages: 2, enable_warp_specialization: False, enable_persistent: False: 1.236544, nreg 255, nspill 254, #shared-mem 22528 BLOCK_M: 16, BLOCK_N: 64, num_warps: 4, num_ctas: 1, num_stages: 2, enable_warp_specialization: False, enable_persistent: False: 1.483648, nreg 255, nspill 174, #shared-mem 22528 BLOCK_M: 16, BLOCK_N: 64, num_warps: 8, num_ctas: 1, num_stages: 2, enable_warp_specialization: False, enable_persistent: False: 1.131408, nreg 255, nspill 6, #shared-mem 22528 BLOCK_M: 32, BLOCK_N: 16, num_warps: 2, num_ctas: 1, num_stages: 1, enable_warp_specialization: False, enable_persistent: False: 0.516112, nreg 255, nspill 28, #shared-mem 13312 BLOCK_M: 32, BLOCK_N: 16, num_warps: 4, num_ctas: 1, num_stages: 1, enable_warp_specialization: False, enable_persistent: False: 0.737792, nreg 255, nspill 0, #shared-mem 13312 BLOCK_M: 32, BLOCK_N: 16, num_warps: 8, num_ctas: 1, num_stages: 1, enable_warp_specialization: False, enable_persistent: False: 1.411632, nreg 193, nspill 0, #shared-mem 13312 BLOCK_M: 32, BLOCK_N: 16, num_warps: 2, num_ctas: 1, num_stages: 2, enable_warp_specialization: False, enable_persistent: False: 0.515904, nreg 255, nspill 28, #shared-mem 13312 BLOCK_M: 32, BLOCK_N: 16, num_warps: 4, num_ctas: 1, num_stages: 2, enable_warp_specialization: False, enable_persistent: False: 0.736608, nreg 255, nspill 0, #shared-mem 13312 BLOCK_M: 32, BLOCK_N: 16, num_warps: 8, num_ctas: 1, num_stages: 2, enable_warp_specialization: False, enable_persistent: False: 1.409808, nreg 193, nspill 0, #shared-mem 13312 BLOCK_M: 32, BLOCK_N: 32, num_warps: 2, num_ctas: 1, num_stages: 1, enable_warp_specialization: False, enable_persistent: False: 0.553536, nreg 255, nspill 130, #shared-mem 18432 BLOCK_M: 32, BLOCK_N: 32, num_warps: 4, num_ctas: 1, num_stages: 1, enable_warp_specialization: False, enable_persistent: False: 0.569792, nreg 255, nspill 56, #shared-mem 18432 BLOCK_M: 32, BLOCK_N: 32, num_warps: 8, num_ctas: 1, num_stages: 1, enable_warp_specialization: False, enable_persistent: False: 0.892448, nreg 255, nspill 4, #shared-mem 18432 BLOCK_M: 32, BLOCK_N: 32, num_warps: 2, num_ctas: 1, num_stages: 2, enable_warp_specialization: False, enable_persistent: False: 0.553584, nreg 255, nspill 130, #shared-mem 18432 BLOCK_M: 32, BLOCK_N: 32, num_warps: 4, num_ctas: 1, num_stages: 2, enable_warp_specialization: False, enable_persistent: False: 0.569568, nreg 255, nspill 56, #shared-mem 18432 BLOCK_M: 32, BLOCK_N: 32, num_warps: 8, num_ctas: 1, num_stages: 2, enable_warp_specialization: False, enable_persistent: False: 0.892240, nreg 255, nspill 4, #shared-mem 18432 BLOCK_M: 32, BLOCK_N: 64, num_warps: 2, num_ctas: 1, num_stages: 1, enable_warp_specialization: False, enable_persistent: False: 1.332928, nreg 255, nspill 366, #shared-mem 28672 BLOCK_M: 32, BLOCK_N: 64, num_warps: 4, num_ctas: 1, num_stages: 1, enable_warp_specialization: False, enable_persistent: False: 0.922256, nreg 255, nspill 228, #shared-mem 28672 BLOCK_M: 32, BLOCK_N: 64, num_warps: 8, num_ctas: 1, num_stages: 1, enable_warp_specialization: False, enable_persistent: False: 0.758400, nreg 255, nspill 26, #shared-mem 28672 BLOCK_M: 32, BLOCK_N: 64, num_warps: 2, num_ctas: 1, num_stages: 2, enable_warp_specialization: False, enable_persistent: False: 1.333440, nreg 255, nspill 366, #shared-mem 28672 BLOCK_M: 32, BLOCK_N: 64, num_warps: 4, num_ctas: 1, num_stages: 2, enable_warp_specialization: False, enable_persistent: False: 0.922336, nreg 255, nspill 228, #shared-mem 28672 BLOCK_M: 32, BLOCK_N: 64, num_warps: 8, num_ctas: 1, num_stages: 2, enable_warp_specialization: False, enable_persistent: False: 0.758496, nreg 255, nspill 26, #shared-mem 28672 BLOCK_M: 64, BLOCK_N: 16, num_warps: 2, num_ctas: 1, num_stages: 1, enable_warp_specialization: False, enable_persistent: False: 1.231648, nreg 255, nspill 292, #shared-mem 22528 BLOCK_M: 64, BLOCK_N: 16, num_warps: 4, num_ctas: 1, num_stages: 1, enable_warp_specialization: False, enable_persistent: False: 0.639424, nreg 255, nspill 90, #shared-mem 22528 BLOCK_M: 64, BLOCK_N: 16, num_warps: 8, num_ctas: 1, num_stages: 1, enable_warp_specialization: False, enable_persistent: False: 0.917952, nreg 240, nspill 0, #shared-mem 22528 BLOCK_M: 64, BLOCK_N: 16, num_warps: 2, num_ctas: 1, num_stages: 2, enable_warp_specialization: False, enable_persistent: False: 1.230624, nreg 255, nspill 292, #shared-mem 22528 BLOCK_M: 64, BLOCK_N: 16, num_warps: 4, num_ctas: 1, num_stages: 2, enable_warp_specialization: False, enable_persistent: False: 0.639168, nreg 255, nspill 90, #shared-mem 22528 BLOCK_M: 64, BLOCK_N: 16, num_warps: 8, num_ctas: 1, num_stages: 2, enable_warp_specialization: False, enable_persistent: False: 0.917440, nreg 240, nspill 0, #shared-mem 22528 BLOCK_M: 64, BLOCK_N: 32, num_warps: 2, num_ctas: 1, num_stages: 1, enable_warp_specialization: False, enable_persistent: False: 0.838080, nreg 255, nspill 354, #shared-mem 28672 BLOCK_M: 64, BLOCK_N: 32, num_warps: 4, num_ctas: 1, num_stages: 1, enable_warp_specialization: False, enable_persistent: False: 0.569184, nreg 255, nspill 178, #shared-mem 28672 BLOCK_M: 64, BLOCK_N: 32, num_warps: 8, num_ctas: 1, num_stages: 1, enable_warp_specialization: False, enable_persistent: False: 0.614720, nreg 255, nspill 28, #shared-mem 28672 BLOCK_M: 64, BLOCK_N: 32, num_warps: 2, num_ctas: 1, num_stages: 2, enable_warp_specialization: False, enable_persistent: False: 0.838048, nreg 255, nspill 354, #shared-mem 28672 BLOCK_M: 64, BLOCK_N: 32, num_warps: 4, num_ctas: 1, num_stages: 2, enable_warp_specialization: False, enable_persistent: False: 0.569472, nreg 255, nspill 178, #shared-mem 28672 BLOCK_M: 64, BLOCK_N: 32, num_warps: 8, num_ctas: 1, num_stages: 2, enable_warp_specialization: False, enable_persistent: False: 0.615104, nreg 255, nspill 28, #shared-mem 28672 BLOCK_M: 64, BLOCK_N: 64, num_warps: 2, num_ctas: 1, num_stages: 1, enable_warp_specialization: False, enable_persistent: False: 1.012128, nreg 255, nspill 522, #shared-mem 40960 BLOCK_M: 64, BLOCK_N: 64, num_warps: 4, num_ctas: 1, num_stages: 1, enable_warp_specialization: False, enable_persistent: False: 0.861536, nreg 255, nspill 378, #shared-mem 40960 BLOCK_M: 64, BLOCK_N: 64, num_warps: 8, num_ctas: 1, num_stages: 1, enable_warp_specialization: False, enable_persistent: False: 0.771584, nreg 255, nspill 134, #shared-mem 40960 BLOCK_M: 64, BLOCK_N: 64, num_warps: 2, num_ctas: 1, num_stages: 2, enable_warp_specialization: False, enable_persistent: False: 1.012512, nreg 255, nspill 522, #shared-mem 40960 BLOCK_M: 64, BLOCK_N: 64, num_warps: 4, num_ctas: 1, num_stages: 2, enable_warp_specialization: False, enable_persistent: False: 0.861024, nreg 255, nspill 378, #shared-mem 40960 BLOCK_M: 64, BLOCK_N: 64, num_warps: 8, num_ctas: 1, num_stages: 2, enable_warp_specialization: False, enable_persistent: False: 0.771712, nreg 255, nspill 134, #shared-mem 40960 ``` </details> In the above, the winning config is `BLOCK_M: 32, BLOCK_N: 16, num_warps: 2, num_ctas: 1, num_stages: 2`, although it has non-zero `nspill 28`. This is an example where we need to consider all configs, including the register-spilling ones, to obtain the best result from auto-tuning. In the worst case, this will just make auto-tuning longer, but can't regress the results. And, as the number of custom Triton kernels in the model is normally much smaller than the number of Inductor-generated ones, this should be acceptable. Test Plan: CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/119634 Approved by: https://github.com/oulgen	2024-02-11 02:13:25 +00:00
PyTorch MergeBot	3ab08946d5	Revert "[aot_inductor] move CudaWrapperCodeGen into a separate file (#119448 )" This reverts commit 0597dab523c0a341e136452a8f723f12700164c0. Reverted https://github.com/pytorch/pytorch/pull/119448 on behalf of https://github.com/DanilBaibak due to Broken trunk ([comment](https://github.com/pytorch/pytorch/pull/119448#issuecomment-1937345167))	2024-02-10 23:04:36 +00:00
PyTorch MergeBot	d8e319a961	Revert "[aot_inductor] move CppWrapperCodeGen into a separate file (#119491 )" This reverts commit 760056bbdc552314e7e81adc45e11766ac0f333c. Reverted https://github.com/pytorch/pytorch/pull/119491 on behalf of https://github.com/DanilBaibak due to Reverted as a dependency for #119448 ([comment](https://github.com/pytorch/pytorch/pull/119491#issuecomment-1937344548))	2024-02-10 23:02:05 +00:00
Taras Tsugrii	6db6a1b526	[aten] Use emplace instead of insert. (#119614 ) this avoids pair construction in case inserted key is already present in dict Pull Request resolved: https://github.com/pytorch/pytorch/pull/119614 Approved by: https://github.com/Skylion007	2024-02-10 22:35:00 +00:00
Taras Tsugrii	2c8722182e	[dynamo][guards] Avoid unnecessary stack copies. (#119115 ) There is no need to make a `frame_summary_stack` copy in case it's not modified. Proposed change uses copy-on-write functional approach that is easy to understand and is more efficient in case `self.loc_in_frame` is `None` Pull Request resolved: https://github.com/pytorch/pytorch/pull/119115 Approved by: https://github.com/Skylion007	2024-02-10 21:56:00 +00:00
cyy	568740f080	[DeviceIndex][2/N] Use DeviceIndex instead of int in allocators (#119545 ) Follows #119142 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119545 Approved by: https://github.com/ezyang	2024-02-10 20:27:59 +00:00
Yanbo Liang	57d8f67619	[Dynamo][17/N] Rename SkipFilesVariable to SkipFunctionVariable and move to functions.py (#119619 ) This is follow-up-3 from https://github.com/pytorch/pytorch/pull/118971#issue-2114082018 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119619 Approved by: https://github.com/jansel	2024-02-10 19:33:37 +00:00
Taras Tsugrii	dcce5327bb	[core][perf] Use set comprehensions in _RecreateLookupTables. (#119617 ) It's more idiomatic and much more efficient. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119617 Approved by: https://github.com/Skylion007	2024-02-10 18:53:25 +00:00
Alexander Kurakin	c5116d9e44	Fix optim.lr_scheduler examples in doc to use optimizer vs self.opt (#119563 ) Fixes #119561 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119563 Approved by: https://github.com/janeyx99	2024-02-10 15:10:43 +00:00
PyTorch MergeBot	34db6f1b13	Revert "make flash_attn_bw impl correct w.r.t. meta when k and v have different strides (#119500 )" This reverts commit 095f4713077639f0e48fa33d051c0de2eb1f8525. Reverted https://github.com/pytorch/pytorch/pull/119500 on behalf of https://github.com/DanilBaibak due to Broken trunk ([comment](https://github.com/pytorch/pytorch/pull/119500#issuecomment-1937003082))	2024-02-10 13:06:30 +00:00
Peter Bell	c0f1183eb4	[inductor] Fix compile error on scan with no mask (#119555 ) Fixes #119591 Currently this results in invalid syntax: ```python tmp4 = tl.where(, tmp1, tmp2) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/119555 Approved by: https://github.com/lezcano	2024-02-10 12:38:40 +00:00
Mu-Chu Lee	e71c202520	Use CUDA if cuda's macro is set for AOTI runner's pybind (#119616 ) Summary: Use CUDA if cuda's macro is set for AOTI runner's pybind This is a duplicate of #119438 for landing issues Test Plan: Existing tests (D52303882) Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/119616 Approved by: https://github.com/khabinov	2024-02-10 11:00:47 +00:00
Oguz Ulgen	3581428ea0	Do not mark tt.load's arguments as mutated (#119631 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119631 Approved by: https://github.com/aakhundov ghstack dependencies: #119581, #119615	2024-02-10 08:46:50 +00:00
Oguz Ulgen	6c5bf5a5ce	Implement kernel analysis for functions with multiple return values (#119615 ) This diff adds few improvements: * Parsing for multiple return value: `tt.return %1, %arg0` * Parsing for assignment for multiple values: `%1:2` means %1 has two values * Parsing for usage of a value with multiple values: `%1#0` means 0th index of %1 * Fixes a bug in memo-cycle detection when multiple tests are executed back to back Pull Request resolved: https://github.com/pytorch/pytorch/pull/119615 Approved by: https://github.com/aakhundov ghstack dependencies: #119581	2024-02-10 08:46:50 +00:00
Oguz Ulgen	e693089c7a	[Dynamo] Refactor tensor methods handling (#119581 ) Fixes part of #119128 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119581 Approved by: https://github.com/jansel, https://github.com/anijain2305	2024-02-10 08:46:50 +00:00
Chien-Chin Huang	699ae72f51	[DCP][state_dict] Fix the issue that get_state_dict/set_state_dict ignore the buffer (#119573 ) get_state_dict and set_state_dict currently do not appropriately handle the buffers. This PR fixes thie issue. Fixes, https://github.com/pytorch/pytorch/issues/119535. Differential Revision: [D53616762](https://our.internmc.facebook.com/intern/diff/D53616762/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119573 Approved by: https://github.com/wz337, https://github.com/LucasLLC	2024-02-10 06:36:58 +00:00
PyTorch UpdateBot	a82c50793e	[executorch hash update] update the pinned executorch hash (#119510 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned executorch hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119510 Approved by: https://github.com/pytorchbot	2024-02-10 03:40:34 +00:00
Yu, Guangye	8fd11cb307	[2/2] Intel GPU Runtime Upstreaming for Stream (#117619 ) # Motivation According to [[1/2] Intel GPU Runtime Upstreaming for Stream](https://github.com/pytorch/pytorch/pull/117611), as mentioned in [[RFC] Intel GPU Runtime Upstreaming](https://github.com/pytorch/pytorch/issues/114842), the second PR covers the changes under `python frontend`. # Design Currently, it primarily offers stream-related APIs, including - `torch.xpu.StreamContext` - `torch.xpu.current_stream` - `torch.xpu.set_stream` - `torch.xpu.synchronize` - `torch._C._xpu_getCurrentRawStream` # Additional Context We will implement functions like `torch.xpu.Stream.wait_event`, `torch.xpu.Stream.wait_stream`, and `torch.xpu.Stream.record_event` in the next PR related with `Event`. The differences with CUDA: no default and external stream in XPU and lack of below APIs: - `torch.cuda.ExternalStream` - `torch.cuda.default_stream` - `toch.cuda.is_current_stream_capturing` Pull Request resolved: https://github.com/pytorch/pytorch/pull/117619 Approved by: https://github.com/EikanWang, https://github.com/jgong5, https://github.com/gujinghui, https://github.com/albanD ghstack dependencies: #117611	2024-02-10 03:39:42 +00:00
PyTorch UpdateBot	f2778e3874	[vision hash update] update the pinned vision hash (#119511 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned vision hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119511 Approved by: https://github.com/pytorchbot	2024-02-10 03:22:13 +00:00
PyTorch UpdateBot	42ca82dfb1	[audio hash update] update the pinned audio hash (#119612 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned audio hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119612 Approved by: https://github.com/pytorchbot	2024-02-10 03:22:06 +00:00
Elias Ellison	3278b4c557	be more consrevative until regression is debugged (#119583 ) See, internal regression: https://www.internalfb.com/diff/D53375778?transaction_fbid=953511712782168 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119583 Approved by: https://github.com/Chillee	2024-02-10 03:06:58 +00:00
Avik Chaudhuri	70a364d402	non-strict improvements: constant args and kwargs (#119529 ) This PR makes a couple of improvements to non-strict to bring it closer to strict. (This lets us remove some expected failures from test_export.) 1. Support constant arguments (easy). 2. Support keyword arguments. This forces us to add kwargs to `aot_export_module`. Indeed there is no way to make this work otherwise, because some arguments in a function signature can be keyword-only and thus cannot be simulated by positional arguments alone. Adding kwargs to `aot_export_module` turns out to be fairly routine, but there is a bit of a unsatisfactory fork between how it is called by strict and non-strict: because strict calls it on a graph module, kwargs must be converted to positional arguments. So kwargs in `aot_export_module` really only comes into play in non-strict. Differential Revision: D53600977 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119529 Approved by: https://github.com/zhxchen17, https://github.com/tugsbayasgalan	2024-02-10 02:55:40 +00:00
Yang Chen	760056bbdc	[aot_inductor] move CppWrapperCodeGen into a separate file (#119491 ) This PR moved CppWrapperCodeGen class into a seperate file, cpp_wrapper.py, to simplify wrapper.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/119491 Approved by: https://github.com/desertfire, https://github.com/albanD	2024-02-10 02:15:56 +00:00
Brian Hirsh	095f471307	make flash_attn_bw impl correct w.r.t. meta when k and v have different strides (#119500 ) `dv = at::empty_like(k)` and `dv = at::empty_like(v)` can be materially different, because `empty_like` tries to preserve the strides of the input when possible. So if `k` is contiguous, but `v`, is transposed, then before this PR, `dv` would be computed to be contiguous. Alternatively, we could change the meta implementation of `aten._scaled_dot_product_flash_attention` to this: ``` grad_q = torch.empty_like(query.transpose(1, 2)).transpose(1, 2) grad_k = torch.empty_like(key.transpose(1, 2)).transpose(1, 2) grad_v = torch.empty_like(key.transpose(1, 2)).transpose(1, 2) return grad_q, grad_k, grad_v ``` But (I think?) the logic in the sdpa backward impl was a typo. I noticed this because changing the meta formula as above was enough to fix the issue with the `aot_eager` backend in this [link](https://github.com/pytorch/pytorch/issues/116935#issuecomment-1914310523). A minimal repro that I made looks like this: ``` import torch # in this repro, "grad_out" and "value" are transposed tensors, # but "key" and "value" are contiguous a = torch.randn(2, 513, 16, 64, dtype=torch.float16, device='cuda').transpose(1, 2) b = torch.randn(2, 16, 513, 64, dtype=torch.float16, device='cuda') c = torch.randn(2, 16, 513, 64, dtype=torch.float16, device='cuda') d = torch.randn(2, 513, 16, 64, dtype=torch.float16, device='cuda').transpose(1, 2) e = torch.randn(2, 16, 513, 64, dtype=torch.float16, device='cuda') f = torch.randn(2, 16, 513, device='cuda') g = None h = None i = 513 j = 513 k = 0.0 l = False m = torch.tensor(1, dtype=torch.int64) n = torch.tensor(1, dtype=torch.int64) out1_ref, out2_ref, out3_ref = torch.ops.aten._scaled_dot_product_flash_attention_backward(a, b, c, d, e, f, g, h, i, j, k, l, m, n, scale=0.125) from torch._meta_registrations import meta__scaled_dot_product_flash_backward out1_test, out2_test, out3_test = meta__scaled_dot_product_flash_backward(a, b, c, d, e, f, g, h, i, j, k, l, m, n, scale=0.125) # prints True True print(out1_ref.is_contiguous()) print(out1_test.is_contiguous()) # prints True True print(out2_ref.is_contiguous()) print(out2_test.is_contiguous()) # prints True False print(out3_ref.is_contiguous()) print(out3_test.is_contiguous()) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/119500 Approved by: https://github.com/drisspg, https://github.com/ezyang, https://github.com/Skylion007	2024-02-10 02:04:56 +00:00
Jason Ansel	e1c1b8c2b2	[dynamo] Improve support for backwards hooks (#119525 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119525 Approved by: https://github.com/yanboliang, https://github.com/anijain2305	2024-02-10 01:14:03 +00:00
cyy	05602915f5	Link torch_cpu to cudart only if CUPTI is enabled (#118232 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118232 Approved by: https://github.com/ezyang	2024-02-10 00:53:51 +00:00
Riley Dulin	44796682d0	[torch][ao] Fix module name filter for pytorch2 quantization for underscores (#119344 ) Summary: There was a bug in the module name filter for modules that had an underscore already in them, as it was replaced with a "dot" notation. This is because it was thought that underscores always meant a module separator, but this isn't the case for modules whose name contains an underscore. Test Plan: Added a unit test. Before this change, that test failed (due to applying the wrong qscheme). Now it passes. Differential Revision: D53502771 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119344 Approved by: https://github.com/jerryzh168	2024-02-10 00:29:08 +00:00
titaiwangms	34f7dc9eba	[ONNX] Support op consistency error reproduction (#119512 ) Fixes #119472 Introduce the debugging tool in onnxscript: https://github.com/microsoft/onnxscript/blob/main/onnxscript/tests/function_libs/torch_lib/error_reproduction.py This tool can help us quickly find the inputs leading to mismatched errors. NOTE: this produces `error_reports` folder where there are different markdown reports for each mismatched test cases. For example - CREATE_REPRODUCTION_REPORT=1 python -m pytest onnxscript/tests/function_libs/torch_lib/ops_test.py -k test_output_match_fft_fft_cpu_bool ### Summary The output of ONNX Runtime does not match that of PyTorch when executing test `test_fx_op_consistency.TestOnnxModelOutputConsistency_opset_version_18_model_type_TorchModelType.TORCH_NN_MODULECPU.test_output_match_fft_fft_cpu_bool`, `sample 3` in ONNX Script `TorchLib`. To recreate this report, use ```bash CREATE_REPRODUCTION_REPORT=1 python -m pytest onnxscript/tests/function_libs/torch_lib/ops_test.py -k test_output_match_fft_fft_cpu_bool ``` ### ONNX Model ``` < ir_version: 8, opset_import: ["pkg.onnxscript.torch_lib" : 1, "" : 18, "pkg.onnxscript.torch_lib.common" : 1], producer_name: "pytorch", producer_version: "2.2.0" > main_graph (bool[31] l_args_0_) => (float[31,2] _fft_r2c) <bool[31] l_args_0_, float[31] _to_copy, float[31,2] _fft_r2c> { _to_copy = Cast <to: int = 1> (l_args_0_) _val_2 = Constant <value: tensor = int64[1] {-1}> () _val_3 = Unsqueeze (_to_copy, _val_2) _val_4 = Constant <value: tensor = int64[1] {0}> () _val_5 = Unsqueeze (_val_3, _val_4) _val_6 = DFT <axis: int = 1, inverse: int = 0, onesided: int = 0> (_val_5) _val_7 = Constant <value: tensor = int64[1] {0}> () _val_8 = Squeeze (_val_6, _val_7) _fft_r2c = pkg.onnxscript.torch_lib._fftn_onnx_normalization <dims: ints = [0], forward: int = 1, normalization: int = 0> (_val_3, _val_8) } < domain: "pkg.onnxscript.torch_lib", opset_import: ["" : 18] > _fftn_onnx_normalization <normalization,forward,dims>(self, transformed) => (result_15) { self_shape = Shape (self) dims = Constant <value_ints: ints = @dims> () self_shape_subscripted = Gather <axis: int = 0> (self_shape, dims) total_sample_count = ReduceProd <keepdims: int = 0> (self_shape_subscripted) total_sample_count_0 = CastLike (total_sample_count, transformed) normalization = Constant <value_int: int = @normalization> () int64_1 = Constant <value: tensor = int64 int64_1 {1}> () cond = Equal (normalization, int64_1) result_15 = If (cond) <then_branch: graph = thenGraph_21 () => ( result_3) { forward = Constant <value_int: int = @forward> () forward_as_bool = Cast <to: int = 9> (forward) result_3 = If (forward_as_bool) <then_branch: graph = thenGraph_23 () => ( result) { tmp = Sqrt (total_sample_count_0) result = Div (transformed, tmp) }, else_branch: graph = elseGraph_23 () => ( result_2) { tmp_1 = Sqrt (total_sample_count_0) result_2 = Mul (transformed, tmp_1) }> }, else_branch: graph = elseGraph_21 () => ( result_14) { normalization_4 = Constant <value_int: int = @normalization> () int64_2 = Constant <value: tensor = int64 int64_2 {2}> () cond_5 = Equal (normalization_4, int64_2) result_14 = If (cond_5) <then_branch: graph = thenGraph_27 () => ( result_9) { forward_6 = Constant <value_int: int = @forward> () forward_6_as_bool = Cast <to: int = 9> (forward_6) result_9 = If (forward_6_as_bool) <then_branch: graph = thenGraph_29 () => ( result_7) { result_7 = Div (transformed, total_sample_count_0) }, else_branch: graph = elseGraph_29 () => ( result_8) { result_8 = Identity (transformed) }> }, else_branch: graph = elseGraph_27 () => ( result_13) { forward_10 = Constant <value_int: int = @forward> () forward_10_as_bool = Cast <to: int = 9> (forward_10) result_13 = If (forward_10_as_bool) <then_branch: graph = thenGraph_35 () => ( result_11) { result_11 = Identity (transformed) }, else_branch: graph = elseGraph_35 () => ( result_12) { result_12 = Mul (transformed, total_sample_count_0) }> }> }> } < domain: "pkg.onnxscript.torch_lib.common", opset_import: ["" : 18] > Rank (input) => (return_val) { tmp = Shape (input) return_val = Size (tmp) } < domain: "pkg.onnxscript.torch_lib.common", opset_import: ["" : 18] > IsScalar (input) => (return_val) { tmp = Shape (input) tmp_0 = Size (tmp) tmp_1 = Constant <value_int: int = 0> () return_val = Equal (tmp_0, tmp_1) } ``` ### Inputs Shapes: `['Tensor<torch.Size([31]), dtype=torch.bool>']` <details><summary>Details</summary> <p> ```python kwargs = {} inputs = (tensor([False, False, True, True, False, True, False, True, False, False, True, False, False, False, False, False, True, True, True, True, True, True, True, True, False, False, False, False, True, True, True]),) ``` </p> </details> ### Expected output Shape: `torch.Size([31, 2])` <details><summary>Details</summary> <p> ```python expected = tensor([[16.0000, 0.0000], [-0.2369, 2.6590], [ 0.7336, -4.9670], [ 2.2093, 2.9865], [-0.7166, 1.0928], [-3.0614, 3.0015], [-1.8945, -0.9677], [-2.1538, 0.2513], [-2.2432, 1.3978], [-0.3429, 1.9494], [-0.6495, -1.5423], [-0.6005, 2.2398], [ 2.2639, 2.6430], [ 1.7609, 0.2033], [-1.3829, -2.3365], [-1.6854, -0.0311], [-1.6854, 0.0311], [-1.3829, 2.3365], [ 1.7609, -0.2033], [ 2.2639, -2.6430], [-0.6005, -2.2398], [-0.6495, 1.5423], [-0.3429, -1.9494], [-2.2432, -1.3978], [-2.1538, -0.2513], [-1.8945, 0.9677], [-3.0614, -3.0015], [-0.7166, -1.0928], [ 2.2093, -2.9865], [ 0.7336, 4.9670], [-0.2369, -2.6590]]) ``` </p> </details> ### Actual output Shape: `torch.Size([31, 2])` <details><summary>Details</summary> <p> ```python actual = tensor([[ 1.6000e+01, -9.1791e-06], [-2.3695e-01, 2.6590e+00], [ 7.3355e-01, -4.9670e+00], [ 2.2093e+00, 2.9865e+00], [-7.1663e-01, 1.0928e+00], [-3.0614e+00, 3.0015e+00], [-1.8946e+00, -9.6773e-01], [-2.1538e+00, 2.5126e-01], [-2.2432e+00, 1.3978e+00], [-3.4294e-01, 1.9494e+00], [-6.4946e-01, -1.5423e+00], [-6.0044e-01, 2.2398e+00], [ 2.2639e+00, 2.6430e+00], [ 1.7609e+00, 2.0326e-01], [-1.3829e+00, -2.3365e+00], [-1.6854e+00, -3.1130e-02], [-1.6854e+00, 3.1161e-02], [-1.3829e+00, 2.3365e+00], [ 1.7609e+00, -2.0327e-01], [ 2.2639e+00, -2.6430e+00], [-6.0047e-01, -2.2398e+00], [-6.4945e-01, 1.5423e+00], [-3.4294e-01, -1.9494e+00], [-2.2432e+00, -1.3978e+00], [-2.1538e+00, -2.5129e-01], [-1.8945e+00, 9.6773e-01], [-3.0615e+00, -3.0015e+00], [-7.1663e-01, -1.0928e+00], [ 2.2093e+00, -2.9865e+00], [ 7.3354e-01, 4.9670e+00], [-2.3695e-01, -2.6589e+00]]) ``` </p> </details> ### Difference <details><summary>Details</summary> <p> ```diff --- actual +++ expected @@ -1,31 +1,31 @@ -tensor([[ 1.6000e+01, -9.1791e-06], - [-2.3695e-01, 2.6590e+00], - [ 7.3355e-01, -4.9670e+00], - [ 2.2093e+00, 2.9865e+00], - [-7.1663e-01, 1.0928e+00], - [-3.0614e+00, 3.0015e+00], - [-1.8946e+00, -9.6773e-01], - [-2.1538e+00, 2.5126e-01], - [-2.2432e+00, 1.3978e+00], - [-3.4294e-01, 1.9494e+00], - [-6.4946e-01, -1.5423e+00], - [-6.0044e-01, 2.2398e+00], - [ 2.2639e+00, 2.6430e+00], - [ 1.7609e+00, 2.0326e-01], - [-1.3829e+00, -2.3365e+00], - [-1.6854e+00, -3.1130e-02], - [-1.6854e+00, 3.1161e-02], - [-1.3829e+00, 2.3365e+00], - [ 1.7609e+00, -2.0327e-01], - [ 2.2639e+00, -2.6430e+00], - [-6.0047e-01, -2.2398e+00], - [-6.4945e-01, 1.5423e+00], - [-3.4294e-01, -1.9494e+00], - [-2.2432e+00, -1.3978e+00], - [-2.1538e+00, -2.5129e-01], - [-1.8945e+00, 9.6773e-01], - [-3.0615e+00, -3.0015e+00], - [-7.1663e-01, -1.0928e+00], - [ 2.2093e+00, -2.9865e+00], - [ 7.3354e-01, 4.9670e+00], - [-2.3695e-01, -2.6589e+00]]) +tensor([[16.0000, 0.0000], + [-0.2369, 2.6590], + [ 0.7336, -4.9670], + [ 2.2093, 2.9865], + [-0.7166, 1.0928], + [-3.0614, 3.0015], + [-1.8945, -0.9677], + [-2.1538, 0.2513], + [-2.2432, 1.3978], + [-0.3429, 1.9494], + [-0.6495, -1.5423], + [-0.6005, 2.2398], + [ 2.2639, 2.6430], + [ 1.7609, 0.2033], + [-1.3829, -2.3365], + [-1.6854, -0.0311], + [-1.6854, 0.0311], + [-1.3829, 2.3365], + [ 1.7609, -0.2033], + [ 2.2639, -2.6430], + [-0.6005, -2.2398], + [-0.6495, 1.5423], + [-0.3429, -1.9494], + [-2.2432, -1.3978], + [-2.1538, -0.2513], + [-1.8945, 0.9677], + [-3.0614, -3.0015], + [-0.7166, -1.0928], + [ 2.2093, -2.9865], + [ 0.7336, 4.9670], + [-0.2369, -2.6590]]) ``` </p> </details> ### Full error stack ``` Tensor-likes are not close! Mismatched elements: 21 / 62 (33.9%) Greatest absolute difference: 3.719329833984375e-05 at index (26, 1) (up to 1e-05 allowed) Greatest relative difference: 0.0005033136694692075 at index (15, 1) (up to 1.3e-06 allowed) File "/home/titaiwang/pytorch/test/onnx/test_fx_op_consistency.py", line 1763, in _compare_onnx_and_torch_exported_program torch.testing.assert_close( File "/home/titaiwang/pytorch/torch/testing/_comparison.py", line 1523, in assert_close raise error_metas[0].to_error(msg) ``` ### Environment ``` OS: Linux-5.15.135.1-2.cm2-x86_64-with-glibc2.35 Python version: 3.11.5 (main, Sep 11 2023, 13:54:46) [GCC 11.2.0] onnx==1.15.0 onnxruntime==1.17.0 onnxscript==0.1.0.dev20240207 numpy==1.26.0 torch==2.2.0a0+git684ce1b ``` Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/119512 Approved by: https://github.com/justinchuby, https://github.com/thiagocrepaldi	2024-02-09 23:24:01 +00:00
titaiwangms	bb287d73ec	[ONNX] Apply modularizarion to exported program exporting (#119498 ) Apply modularization pass to exported program exporting. The only two things that needs to be taken care of are (1) the extra call stack generated by `torch.export.export` and (2) lifted placeholder has call stack (different from original placeholder). Pull Request resolved: https://github.com/pytorch/pytorch/pull/119498 Approved by: https://github.com/thiagocrepaldi	2024-02-09 22:57:42 +00:00
Mikayla Gawarecki	3372aa51b4	Integrate swap_tensors into nn.Module.load_state_dict (#117913 ) Added a `torch.Tensor` method that defines how to transform `other`, a value in the state dictionary, to be loaded into `self`, a param/buffer in an `nn.Module` before swapping via `torch.utils.swap_tensors` * `param.module_load(sd[key])` This method can be overridden using `__torch_function__`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117913 Approved by: https://github.com/albanD	2024-02-09 22:32:29 +00:00
Xilun Wu	a7f82b7d62	[fix] tmp fix for import issue in dtensor (#119582 ) a temporary fix for S394053 which is likely caused by backward incompatible `import` introduced in D53437243. It's yet to be understood why this may cause an issue but let's forward "fix" it first then draft a follow up diff for a right fix. Differential Revision: [D53621345](https://our.internmc.facebook.com/intern/diff/D53621345/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119582 Approved by: https://github.com/tianyu-l	2024-02-09 20:50:27 +00:00
Andrew Gu	bf8db86a19	[FSDP] Added deprecation msg for `NO_SHARD` (#119553 ) This only includes the warning for world size >1 since we clamp to `NO_SHARD` for world size 1. We mainly do not want `NO_SHARD` to proliferate anymore. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119553 Approved by: https://github.com/Skylion007	2024-02-09 20:32:03 +00:00
Ke Wen	f3e7d80993	[c10d] PGNCCL refactor part 2: Simplify ProcessGroupNCCL into single-device style (#119421 ) Part 2 and last part of #118674: Introduce actual "single-device" code change to ProcessGroupNCCL. assert size == 1 and test refactor have been done in #119099. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119421 Approved by: https://github.com/shuqiangzhang	2024-02-09 20:23:20 +00:00
Yang Chen	0597dab523	[aot_inductor] move CudaWrapperCodeGen into a separate file (#119448 ) wrapper.py is getting more complex. Let's first split it into smaller pieces. Will have another PR to move CppWrapperCodeGen. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119448 Approved by: https://github.com/desertfire	2024-02-09 20:18:04 +00:00
Alexander Kurakin	9a1df7cfd7	ReduceLROnPlateau init _last_lr (#119366 ) (#119556 ) Fixes #119366 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119556 Approved by: https://github.com/janeyx99	2024-02-09 19:35:02 +00:00
Elias Ellison	bf8a5a11be	Fix Inductor CSE Across Separate Reductions (#119410 ) We were CSE'ing a load across two separate reduction loop bodies. This is because we were examining an indirect indexing that did not have an explicit rindex in its load. I've commented with more details and other potentials on the fix. Tried using minifier unsuccessfully and hand minified some but could do more.. Fix for https://github.com/pytorch/pytorch/issues/119327 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119410 Approved by: https://github.com/shunting314, https://github.com/jansel	2024-02-09 19:34:57 +00:00
Edward Z. Yang	f208795182	Improve TORCHDYNAMO_EXTENDED_DEBUG for GuardOnDataDependentSymNode (#119412 ) This PR substantially improves the error reporting for GuardOnDataDependentSymNode in the following ways: * The GuardOnDataDependentSymNode error message is rewritten for clarity, and contains a link to a new doc on how to resolve these issues https://docs.google.com/document/d/1HSuTTVvYH1pTew89Rtpeu84Ht3nQEFTYhAX3Ypa_xJs/edit#heading=h.44gwi83jepaj * We support `TORCHDYNAMO_EXTENDED_DEBUG_CREATE_SYMBOL`, which lets you specify a symbol name to get detailed debug information when it is logged (e.g., the full backtrace and user backtrace of the symbol creation). The exact symbols that you may be interested in our now explicitly spelled out in the error message. * We support `TORCHDYNAMO_EXTENDED_DEBUG_CPP` which enables reporting C++ backtraces whenever we would report a backtrace. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/119412 Approved by: https://github.com/avikchaudhuri ghstack dependencies: #117356	2024-02-09 19:15:28 +00:00
rzou	01e248d6f1	Fix FallbackKernel behavior on mutable ops (#118649 ) FallbackKernel wasn't handing mutable ops correctly: it would not report them in get_mutation_names or get_alias_names. This would lead to silent incorrectness -- Inductor would incorrectly reorder the mutable op with other mutable ops. This PR fixes that: - we only support mutable operations that are "auto_functionalizable". That is, they mutate inputs and do not return aliases of any inputs. - Following the Triton kernel work, any mutated inputs must be specified in get_alias_names and processed via mark_node_as_mutating - We also do some minor cleanup by killing dead code (FallbackKernel no longer processes OpOverloadPacket) and adding some handling around HOPs. Test Plan: - new tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/118649 Approved by: https://github.com/eellison, https://github.com/oulgen	2024-02-09 19:01:54 +00:00
PyTorch MergeBot	25a0fa6d13	Revert "[dynamo] Improve support for backwards hooks (#119525 )" This reverts commit b1f4b2a63c038f0090886d7d213825f39c283ea5. Reverted https://github.com/pytorch/pytorch/pull/119525 on behalf of https://github.com/clee2000 due to broke test_autograd.py::TestAutograd::test_post_accumulate_grad_hook_gets_cleaned_up on dynamo https://github.com/pytorch/pytorch/actions/runs/7847212828/job/21416215820 `b1f4b2a63c`. The failure exists on the PR as well, but got masked by the other test. Putting this as no signal? ([comment](https://github.com/pytorch/pytorch/pull/119525#issuecomment-1936447169))	2024-02-09 18:58:55 +00:00
albanD	4b9568a360	Add Accelerator device and shell hooks (#119329 ) This adds a concept of Accelerator that points to one of our devices. See DeviceAccelerator.h in this PR for details https://github.com/pytorch/pytorch/pull/119329/files#diff-83cc748bed5df1a453c272cc5ecc7e572d4eb694c5125384d8fbd17a0b5f50c8 It also adds scaffolding for shared C++ API to allow generic feature implementation. This PR in particular updates the autograd engine to use this generic API. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119329 Approved by: https://github.com/ezyang	2024-02-09 18:54:28 +00:00
Joel Schlosser	d5a6762263	Reify view_func() closures as ViewFuncs (#118404 ) Replaces `view_func()` closures with a reified `ViewFunc` data structure. Codegen generates a `ViewFunc` subclass for each view op (e.g. `NarrowViewFunc`) containing state needed to reconstruct the view. The `ViewFunc` API allows for querying and hot-swapping any `SymInt`s or `Tensors` in the state through `get_symints()` / `get_tensors()` / `clone_and_set()`, which will be essential for fake-ification later on. ```cpp /// Base class for view functions, providing reapplication of a view on a new base. /// Each view op should get a codegenerated subclass of this class containing /// any state needed to reconstruct the view. The class also provides convenience /// accessors for saved SymInts / tensor state. This is useful for e.g. fake-ification, /// where we want to use symbolic values or fake tensors instead. struct TORCH_API ViewFunc { virtual ~ViewFunc() {} /// Returns any SymInts in the saved state. virtual std::vector<c10::SymInt> get_symints() const { return {}; } /// Returns the number of SymInts in the saved state. virtual size_t num_symints() const { return 0; } /// Returns any tensors in the saved state. virtual std::vector<at::Tensor> get_tensors() const { return {}; } /// Returns the number of tensors in the saved state. virtual size_t num_tensors() const { return 0; } /// Reapplies the view on the given base using the saved state. virtual at::Tensor operator()(const at::Tensor&) const = 0; /// Returns a clone of this ViewFunc, optionally with the specified saved state. virtual std::unique_ptr<ViewFunc> clone_and_set( std::optional<std::vector<c10::SymInt>> = c10::nullopt, std::optional<std::vector<at::Tensor>> = c10::nullopt) const = 0; protected: /// Sets the values of any SymInts in the saved state. The input vector size must /// match the number of SymInts in the saved state (i.e. the size of the list /// returned by get_symints()). virtual void set_symints(std::vector<c10::SymInt>) {} /// Sets the values of any Tensors in the saved state. The input vector size must /// match the number of Tensors in the saved state (i.e. the size of the list /// returned by get_tensors()). virtual void set_tensors(std::vector<at::Tensor>) {} }; ``` New codegen files: * `torch/csrc/autograd/generated/ViewFunc.h` * `torch/csrc/autograd/generated/ViewFuncs.cpp` The templates for these also contains impls for `ChainedViewFunc` and `ErroringViewFunc` which are used in a few places within autograd. Example codegen for `slice.Tensor`: ```cpp // torch/csrc/autograd/generated/ViewFuncs.h #define SLICE_TENSOR_VIEW_FUNC_AVAILABLE struct SliceTensorViewFunc : public torch::autograd::ViewFunc { SliceTensorViewFunc(int64_t dim, c10::optional<c10::SymInt> start, c10::optional<c10::SymInt> end, c10::SymInt step) : dim(dim), start(start), end(end), step(step) {}; virtual ~SliceTensorViewFunc() override {}; virtual std::vector<c10::SymInt> get_symints() const override; virtual size_t num_symints() const override; virtual std::vector<at::Tensor> get_tensors() const override; virtual size_t num_tensors() const override; virtual at::Tensor operator()(const at::Tensor&) const override; virtual std::unique_ptr<ViewFunc> clone_and_set( std::optional<std::vector<c10::SymInt>> = c10::nullopt, std::optional<std::vector<at::Tensor>> = c10::nullopt) const override; protected: virtual void set_symints(std::vector<c10::SymInt>) override; virtual void set_tensors(std::vector<at::Tensor>) override; private: int64_t dim; c10::optional<c10::SymInt> start; c10::optional<c10::SymInt> end; c10::SymInt step; }; ... // torch/csrc/autograd/generated/ViewFuncs.cpp std::vector<c10::SymInt> SliceTensorViewFunc::get_symints() const { ::std::vector<c10::SymInt> symints; symints.reserve((start.has_value() ? 1 : 0) + (end.has_value() ? 1 : 0) + 1); if(start.has_value()) symints.insert(symints.end(), (start)); if(end.has_value()) symints.insert(symints.end(), (end)); symints.push_back(step); return symints; } size_t SliceTensorViewFunc::num_symints() const { return static_cast<size_t>((start.has_value() ? 1 : 0) + (end.has_value() ? 1 : 0) + 1); } void SliceTensorViewFunc::set_symints(std::vector<c10::SymInt> symints) { TORCH_INTERNAL_ASSERT(symints.size() == num_symints()); auto i = 0; if(start.has_value()) start = symints[i]; i += (start.has_value() ? 1 : 0); if(end.has_value()) end = symints[i]; i += (end.has_value() ? 1 : 0); step = symints[i]; } std::vector<at::Tensor> SliceTensorViewFunc::get_tensors() const { ::std::vector<at::Tensor> tensors; return tensors; } size_t SliceTensorViewFunc::num_tensors() const { return static_cast<size_t>(0); } void SliceTensorViewFunc::set_tensors(std::vector<at::Tensor> tensors) { TORCH_INTERNAL_ASSERT(tensors.size() == num_tensors()); } at::Tensor SliceTensorViewFunc::operator()(const at::Tensor& input_base) const { return at::_ops::slice_Tensor::call(input_base, dim, start, end, step); } std::unique_ptr<ViewFunc> SliceTensorViewFunc::clone_and_set( std::optional<std::vector<c10::SymInt>> symints, std::optional<std::vector<at::Tensor>> tensors) const { auto output = std::make_unique<SliceTensorViewFunc>(dim, start, end, step); if (symints.has_value()) { output->set_symints(std::move((symints))); } if (tensors.has_value()) { output->set_tensors(std::move((tensors))); } return output; } ``` The `_view_func()` / `_view_func_unsafe()` methods now accept two additional (optional) args for `symint_visitor_fn` / `tensor_visitor_fn`. If these are defined, they are expected to be python callables that operate on a single SymInt / tensor and return a new one. This allows for the hot-swapping needed during fake-ification. For testing, there are extensive pre-existing tests, and I added a test to ensure that hot-swapping functions correctly. ```sh python test/test_autograd.py -k test_view_func_replay python test/test_ops.py -k test_view_replay ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/118404 Approved by: https://github.com/ezyang	2024-02-09 18:51:36 +00:00
Kefei Lu	261f0138a2	[easy] Fix pass_manager type annotation (#119499 ) Summary: passes are str not callable here. Test Plan: lint Reviewed By: frank-wei Differential Revision: D53592166 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119499 Approved by: https://github.com/22quinn, https://github.com/Skylion007	2024-02-09 18:39:43 +00:00
suo	5747ec24b4	[export] fix canonicalization for input mutations (#119533 ) The comparison was off: user_input_mutation and buffer_mutation had the same numeric value, which led the comparison to move to the next element of the tuple and try to compare `None` to `spec.buffer_mutation.buffer_name`, which doesn't work. So make them different numbers. Differential Revision: [D53601300](https://our.internmc.facebook.com/intern/diff/D53601300/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119533 Approved by: https://github.com/zhxchen17	2024-02-09 18:30:39 +00:00
Andrew Gu	cf42dd09ca	[FSDP2] Replaced version-ctx with `no_grad`; removed `no_grad` (#119550 ) This PR replaces the `_unsafe_preserve_version_counters` context with a simple `torch.no_grad()` context instead. This decreases CPU overhead from (1 context enter/exit + `N` loop over tensors) with just (1 context enter/exit). This PR also removes a `torch.no_grad()` from `init_unsharded_param` as it helps compiling but does not affect eager. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119550 Approved by: https://github.com/Skylion007	2024-02-09 18:24:19 +00:00
Yanbo Liang	f3a2094065	[Dynamo][Export] Mitigate legacy issue that aten op as export entrance function (#119528 ) This is going to fix a legacy issue like: ``` torch._dynamo.export(torch.ops.aten.scaled_dot_product_attention, ...)(*inputs,) ``` This is not supported any more, now the top level ```torch.export``` only support ```nn.Module```, but there are still some tests using the internal APIs and caused the ```trace_rules.check``` assertion error. This PR is going to mitigate such cases. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119528 Approved by: https://github.com/ydwu4	2024-02-09 18:24:09 +00:00
Yanbo Liang	5356b5d1f0	[Dynamo][16/N] Move skipfiles to trace_rules.py (#119432 ) This is follow-up-1 for https://github.com/pytorch/pytorch/pull/118971#issue-2114082018. Only code motion and doc update in this PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119432 Approved by: https://github.com/jansel	2024-02-09 18:18:23 +00:00
Jerry Zhang	7082e24ce8	[quant][pt2e][bc-breaking] Set `fold_quantize` to True in `convert_pt2e` (#119425 ) Summary: This is a follow up to https://github.com/pytorch/pytorch/pull/118605 to set `fold_quantize` flag to True in `convert_pt2e` Test Plan: CI Differential Revision: D53550237 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119425 Approved by: https://github.com/andrewor14	2024-02-09 18:13:43 +00:00
Catherine Lee	3f82e435eb	Fix delete branches (#119399 ) Due to PR_WINDOW, if the magic string exists in the body but the pr was not updated recently, the query wouldn't find it and would delete the branch. Instead, query separately for branches with the no-delete-branch label, which I created recently. Might as well query for branches with open PRs while we're at it so PRs with the stale label won't get their branches deleted either Pull Request resolved: https://github.com/pytorch/pytorch/pull/119399 Approved by: https://github.com/huydhn	2024-02-09 17:28:00 +00:00
PyTorch MergeBot	c6f39740c7	Revert "Fix delete branches (#119399 )" This reverts commit e1fc7e1ebcf4b87d5c34bf276806212c38ca00f0. Reverted https://github.com/pytorch/pytorch/pull/119399 on behalf of https://github.com/clee2000 due to has a bug ([comment](https://github.com/pytorch/pytorch/pull/119399#issuecomment-1936291560))	2024-02-09 17:14:23 +00:00
Nikita Shulga	53a6ab3fda	[BE] Update Pillow to 10.2.0 (#119517 ) As older versions have arbitrary code execution vulnerabilities Reported by Dependabot, documented in https://nvd.nist.gov/vuln/detail/CVE-2023-50447 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119517 Approved by: https://github.com/kit1980, https://github.com/seemethere	2024-02-09 17:05:28 +00:00
Jason Ansel	b1f4b2a63c	[dynamo] Improve support for backwards hooks (#119525 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119525 Approved by: https://github.com/yanboliang	2024-02-09 17:02:40 +00:00
Catherine Lee	5d6e323549	No TD (test removal) option in CI (#118808 ) It currently doesn't do anything, but I will want these env vars later. Maybe I should start using ghstack Intention: --enable-td actually gets rid of tests I am open to better names Pull Request resolved: https://github.com/pytorch/pytorch/pull/118808 Approved by: https://github.com/huydhn, https://github.com/osalpekar	2024-02-09 16:42:27 +00:00
Catherine Lee	e1fc7e1ebc	Fix delete branches (#119399 ) Due to PR_WINDOW, if the magic string exists in the body but the pr was not updated recently, the query wouldn't find it and would delete the branch. Instead, query separately for branches with the no-delete-branch label, which I created recently. Might as well query for branches with open PRs while we're at it so PRs with the stale label won't get their branches deleted either Pull Request resolved: https://github.com/pytorch/pytorch/pull/119399 Approved by: https://github.com/huydhn	2024-02-09 16:40:32 +00:00
Kai Londenberg	5d81ade484	[Inductor max autotune] Multithreaded Precompilation (#119386 ) When using the Cutlass backend, the compilation of CUDA source files can totally dominate the runtime required for the benchmarking done as part of Autotuning. This change adds a multithreaded precompilation phase, which serves to pre-populate the compilation cache ( both in-memory, and a possible on-disk sccache ). Also it ensures that no unneccessary compilation and benchmarking steps are performed, which was peviously the case. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119386 Approved by: https://github.com/aakhundov	2024-02-09 16:11:30 +00:00
Nikita Shulga	173256424a	Update setuptools to 68.2.2 (#119456 ) Followup after itself: Anaconda does not have setuptools v65, but does v68 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119456 Approved by: https://github.com/Skylion007	2024-02-09 15:38:25 +00:00
PyTorch MergeBot	eff93fbd86	Revert "[Dynamo][16/N] Move skipfiles to trace_rules.py (#119432 )" This reverts commit 56364124af8fe148ba8b0c935571ebae6500f33b. Reverted https://github.com/pytorch/pytorch/pull/119432 on behalf of https://github.com/atalman due to Breaks internal tests ([comment](https://github.com/pytorch/pytorch/pull/119432#issuecomment-1936122795))	2024-02-09 15:25:25 +00:00
Kurt Mohler	90dabff260	Avoid COW materialize in various operations (#119506 ) Operations affected include dot, cross, scatter/gather, shape, sort, triangular, unary, scalar, pad, complex, to_list, fft Pull Request resolved: https://github.com/pytorch/pytorch/pull/119506 Approved by: https://github.com/ezyang ghstack dependencies: #119501, #119502, #119503, #119504	2024-02-09 14:47:19 +00:00
Kurt Mohler	8a09f1320c	Avoid COW materialize in index, reduce, compare, unique, and copy ops (#119504 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119504 Approved by: https://github.com/ezyang ghstack dependencies: #119501, #119502, #119503	2024-02-09 14:47:19 +00:00
Edward Z. Yang	0e6b314fc2	Avoid performing replacements when it would unrefine ranges (#117356 ) Fixes https://github.com/pytorch/pytorch/issues/117268; check this issue for background. This PR does the following: * Do not perform a replacement if the expression we're replacing the symbol with has a less refined value range than the original. There's a little bit of trickiness around the handling for values close to INT64_MAX; when checking if a range refines another, I only consider the range representable in 64-bit integers. This is enough to prevent us from doing a substitution like `i0 = 10 - i1`, but it appears to still let us do the other substitutions we like, such as `i0 = i1` or `i0 = 12 * i1` * The test above is order dependent: if we assert an equality BEFORE we have refined a range, we might be willing to do the replacement because there isn't a meaningful range. This means that it's important to mark things as sizes, before you start doing other error checking. `split_with_sizes` is adjusted accordingly. It would be good to raise an error if you get the ordering wrong, but I leave this to future work. * It turns out this is not enough to fix AOTAutograd, because we lose the size-ness of unbacked SymInts when AOTAutograd retraces the Dynamo graph. So update deferred runtime assert insertion to also insert size-ness and value ranges annotations. Note that, in principle, it shouldn't be necessary to explicitly do the latter; these should just show up as deferred runtime asserts. That's some extra refactoring for a later day. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/117356 Approved by: https://github.com/lezcano	2024-02-09 14:43:58 +00:00
Edward Z. Yang	064610d8ac	Don't guard if there are unbacked SymInts (#119312 ) Fixes https://github.com/pytorch/pytorch/issues/119309 Not sure how to write the test. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/119312 Approved by: https://github.com/lezcano	2024-02-09 11:02:47 +00:00
Edward Z. Yang	a13bb9f6a8	Add symbol_guard_limit_before_specialize (#119347 ) Add a flag setting that controls a threshold of guards involving a symbol, after which we force a symbol to be specialized. The roll out plan is to enable this on OSS but not fbcode, and then roll out to fbcode after we get some telemetry from the previous PR. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/119347 Approved by: https://github.com/lezcano	2024-02-09 08:44:37 +00:00
Jiong Gong	a050d146b7	[Inductor] Add Int8 data type into Inductor CPP backend vectorized code generation (#119179 ) Summary Part 1 of fixing https://github.com/pytorch/pytorch/issues/119141 which needs vectorized code generation of per channel quant and int8 data type. In the current implementation for quantization, the vectorized code generation only supports the `uint8` data type. In this PR, we introduce support for the `int8` data type within the vectorized code generation. TestPlan ``` python -u -m pytest -s -v test_cpu_repro.py -k test_decomposed_dequant_relu_quant_int8 python -u -m pytest -s -v test_cpu_repro.py -k test_dequant_quant_lowering_int8 python -u -m pytest -s -v test_cpu_repro.py -k test_dequant_maxpool2d_lowering_int8 python -u -m pytest -s -v test_cpu_repro.py -k test_tile2d_load_decomposed_dequant_add_relu_quant_int8 python -u -m pytest -s -v test_cpu_repro.py -k test_per_tensor_fake_quant_int8 python -u -m pytest -s -v test_cpu_repro.py -k test_non_contiguous_load_buf_quant_int8 python -u -m pytest -s -v test_cpu_repro.py -k test_tile2d_store_channel_shuffle_cl_quant_output_int8 python -u -m pytest -s -v test_cpu_repro.py -k test_dequant_relu_quant_dequant_relu_quant_lowering_int8 ``` Co-authored-by: Jiong Gong <jiong.gong@intel.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/119179 Approved by: https://github.com/peterbell10, https://github.com/jgong5, https://github.com/jansel	2024-02-09 07:33:12 +00:00
Kurt Mohler	5918622d72	Avoid COW materialize in pooling, batch linalg, upsample, softmax ops (#119503 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119503 Approved by: https://github.com/ezyang ghstack dependencies: #119501, #119502	2024-02-09 06:52:16 +00:00
Kurt Mohler	53deddd66d	Avoid COW materialization for TensorInfo with const type (#119502 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119502 Approved by: https://github.com/ezyang ghstack dependencies: #119501	2024-02-09 06:51:43 +00:00
Kurt Mohler	fba5b7f7c8	Avoid COW materialization for TensorAccessors with const type (#119501 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119501 Approved by: https://github.com/ezyang	2024-02-09 06:46:00 +00:00
jmarin	fa071a2e1b	Clarifying windows cosine behaviour in the documentation (#119444 ) After following the discussion, I've created a PR to update the documentation clarifying the function's behaviour (@tqbl solution 1). Fixes #110541 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119444 Approved by: https://github.com/malfet	2024-02-09 05:57:44 +00:00
Sam Larsen	0f2fbbff10	Enable fake tensor caching in fbcode by default (#118555 ) Summary: Enabled by default in OSS; this switches the default to "on" in fbcode too. Test Plan: Ran torchbench benchmarks in fbcode Differential Revision: [D53189048](https://our.internmc.facebook.com/intern/diff/D53189048) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118555 Approved by: https://github.com/eellison	2024-02-09 05:42:16 +00:00
Nikita Shulga	2cdf9b7674	[BE] Update requests to 2.31.0 (#119516 ) Fixes potential memory leak detected by DepandaBot and reported in https://nvd.nist.gov/vuln/detail/CVE-2023-32681 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119516 Approved by: https://github.com/kit1980, https://github.com/seemethere	2024-02-09 05:10:16 +00:00
PyTorch MergeBot	458e83b5b3	Revert "Add FakeTensor support to torch._utils._rebuild_tensor (#108186 )" This reverts commit 113506d2d4a0120e912c8f36e70a621f55378f81. Reverted https://github.com/pytorch/pytorch/pull/108186 on behalf of https://github.com/atalman due to Reverted Internally ([comment](https://github.com/pytorch/pytorch/pull/108186#issuecomment-1935310344))	2024-02-09 04:19:20 +00:00
Elias Ellison	930b60f5aa	Add Debug Utility To Generate Inputs for AOT Graphs (#119409 ) ``` Takes in a function which has been printed with print_readable() and constructs kwargs to run it. Currently only handles Tensor inputs and a graph module which might have tensor constants. Example: Consider a function `forward` defined as follows: >>> def forward(self, primals_1: "f32[1001, 6]"): ... _tensor_constant0: "i64[4190]" = self._tensor_constant0 ... # Further implementation >>> kwargs = aot_graph_input_parser(forward) >>> forward(**kwargs) """ ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/119409 Approved by: https://github.com/shunting314	2024-02-09 03:55:19 +00:00
lezcano	2d474e17cb	Don't log canonicalized expressions (#119471 ) Fixes https://github.com/pytorch/pytorch/issues/119467 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119471 Approved by: https://github.com/ezyang	2024-02-09 02:46:11 +00:00
PyTorch MergeBot	8994f2367d	Revert "Fix jagged NT softmax semantics (#119459 )" This reverts commit 6adadbaf7943f760ea2375619b1783020b69d4e6. Reverted https://github.com/pytorch/pytorch/pull/119459 on behalf of https://github.com/malfet due to broke dynamo, see https://github.com/pytorch/pytorch/actions/runs/7835402753/job/21386634602 ([comment](https://github.com/pytorch/pytorch/pull/119459#issuecomment-1935246413))	2024-02-09 02:31:49 +00:00
Peter Bell	88429a8084	[inductor] Add split scan kernel (#117992 ) This PR adds a new type of triton kernel in which data is persistent but the reduction dimension is split over multiple blocks (up to the entire kernel). though this is called a reduction dimension, in actuality we only support scans. because of this limitation, i have to be able to block fusions of split scan operations with reductions so chose to add a new `ir.SplitScan` node which is identical but allows for differentiation in the scheduler. The split scan kernel is also the first to require an additional workspace buffer which is used to communicate between cuda blocks. this is slightly tricky as we the exact scratch space requirement isn't known until the grid size is calculated. here i workaround the issue by setting a minimum rblock size and always allocating to the maximum possible grid size for a given input tensor. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117992 Approved by: https://github.com/jansel ghstack dependencies: #117991	2024-02-09 01:56:00 +00:00
Peter Bell	01edb8a559	[inductor] Refactor triton range_tree handling (#117991 ) Currently the dimension handling in triton kernels has various special cases e.g. - handling "r" for non-reduction vs persistent reduction vs non-persistent reduction. - handling "x" when `no_x_dim` is set This adds three new properties to the range tree objects which capture the same information in a more generic way: - `is_loop`: true for the "r" dimension of a non-persistent reduction - `tensor_dim`: Optional index of the triton tensor dimension - `grid_dim`: Optional index of the triton grid dimension The motivation here is I want to add a new split scan kernel type which is: - not a persistent reduction, yet has `is_loop=False` for the "r" dimension - Has a `grid_dim` for the "r" dimension These flags now only need to be set once in `initialize_range_trees`, instead of having to infer them throughout the code based on the tree prefix and various other kernel flags. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117991 Approved by: https://github.com/lezcano	2024-02-09 01:56:00 +00:00
Mihir Patel	6efda849b5	Update chunk_dtensor to support HYBRID_SHARD (#119481 ) Fixes https://github.com/pytorch/pytorch/issues/118639. Adds support to replicate across HSDP dimensions instead of sharding for shard placement Pull Request resolved: https://github.com/pytorch/pytorch/pull/119481 Approved by: https://github.com/Skylion007, https://github.com/wz337	2024-02-09 01:30:53 +00:00
Ting Lu	454abb6b99	Disable tests that use bfloat 16 for SM < 80 (#118449 ) ``` `torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised: RuntimeError: Internal Triton PTX codegen error: ptxas /tmp/compile-ptx-src-83b319, line 51; error : Feature '.bf16' requires .target sm_80 or higher ptxas /tmp/compile-ptx-src-83b319, line 51; error : Feature 'cvt with .f32.bf16' requires .target sm_80 or higher ptxas /tmp/compile-ptx-src-83b319, line 59; error : Feature '.bf16' requires .target sm_80 or higher ptxas /tmp/compile-ptx-src-83b319, line 59; error : Feature 'cvt with .f32.bf16' requires .target sm_80 or higher ptxas /tmp/compile-ptx-src-83b319, line 65; error : Feature '.bf16' requires .target sm_80 or higher ptxas /tmp/compile-ptx-src-83b319, line 65; error : Feature 'cvt.bf16.f32' requires .target sm_80 or higher ptxas fatal : Ptx assembly aborted due to errors Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information You can suppress this exception and fall back to eager by setting: import torch._dynamo torch._dynamo.config.suppress_errors = True To execute this test, run the following from the base repo dir: python test/inductor/test_torchinductor.py -k test_bfloat16_to_int16_cuda` ``` Fixed test failure that uses bfloat 16 on pre SM80 (V100 is where the test failure is seen for this test) See also #113384 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118449 Approved by: https://github.com/eqy, https://github.com/peterbell10	2024-02-09 01:27:22 +00:00
Yue Dong	915f9db03c	[Dynamo] Support kwargs for lazy module (#119445 ) Summary: Seems like `kwargs` is already support in `_infer_argument`, so we don't need the extra assertion `len(kwargs) == 0`. This optimization ensures compatibility with torch.compile() for LazyModules with kwargs inputs, preventing graph breaks. Test Plan: Unit tetst and CI Differential Revision: D53558778 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119445 Approved by: https://github.com/yanboliang	2024-02-09 00:46:41 +00:00
Nikita Shulga	45c4a0ce9d	Update setup tools to 65.5.1 (#119456 ) Should some dependabot alerts by: - Updating setupttols to 65.5.1 - Updating jinja2 to 3.3.1 TODO: - Update jinja2 and sphinx for the docs builds Pull Request resolved: https://github.com/pytorch/pytorch/pull/119456 Approved by: https://github.com/Skylion007	2024-02-08 23:34:41 +00:00
PyTorch MergeBot	a8d1645f15	Revert "Add lowering for logcumsumexp (#118753 )" This reverts commit 5a77ee65879b58e99911fd53d92ddb55a1c234eb. Reverted https://github.com/pytorch/pytorch/pull/118753 on behalf of https://github.com/jeffdaily due to broke ROCm CI, but not seen until trunk job ([comment](https://github.com/pytorch/pytorch/pull/118753#issuecomment-1935074235))	2024-02-08 23:10:33 +00:00
cyy	560c92c324	[DeviceIndex] Use DeviceIndex instead of int in CUDA wrappers (#119142 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119142 Approved by: https://github.com/ezyang	2024-02-08 23:00:56 +00:00
Jeff Daily	e98dbae0a0	[ROCm] enable hipsolver backend for linalg.eigh (#115177 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/115177 Approved by: https://github.com/lezcano	2024-02-08 22:03:27 +00:00
suo	0f12c0af44	[export] allow user input mutation in aot_export (#119356 ) This PR enables input mutation in aot_export by removing the guard and ensuring that the GraphSignature is properly wired up. This allows to undo the gross hack in torch.export where we lift user inputs to buffers in order to get around aot_export upstream support. It also makes input mutation work properly for non-strict mode. Mutations on inputs that require_grad are still banned (I added a test for a non-parameter input as well, just to make sure). Differential Revision: [D53507440](https://our.internmc.facebook.com/intern/diff/D53507440/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119356 Approved by: https://github.com/bdhirsh, https://github.com/zhxchen17, https://github.com/titaiwangms	2024-02-08 22:02:24 +00:00
Yang Chen	9f8ade04cc	[aot_inductor] replace TORCH_CHECK with AOTI_CHECK in the generate cpp code (#119220 ) In some cases where we have TORCH_CHECK in loops, it may cause the host compiler to spend hours optimizing the run_impl function. This PR mitigated the issue by replacing TORCH_CHECK with a custom AOTI_CHECK, where we force the underneath assert function to be noinline. If forcing noinline caused any serious perf regression, we could either add an option to turn on/off enable noinline. Or, we could another an option to just turn AOTI_CHECK into a no-op, similar to the ```assert``` macro from cassert. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119220 Approved by: https://github.com/hl475, https://github.com/desertfire	2024-02-08 21:57:27 +00:00
Qianli Scott Zhu	71e772f827	Update logging.cpp for explicit chrono import (#119469 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119469 Approved by: https://github.com/davidberard98	2024-02-08 21:57:23 +00:00
gs-olive	45e7af5818	Windows Dynamo Error Removal CI Check (#115969 ) Rebase of #111313 onto `main`, for CI validation Co-authored-by: Stella Laurenzo <stellaraccident@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/115969 Approved by: https://github.com/ezyang	2024-02-08 21:23:45 +00:00
Angela Yi	0827510fd3	[export] Remove torch._export.export (#119095 ) XLA changes: https://github.com/pytorch/xla/pull/6486 Test Plan: CI Differential Revision: D53316196 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119095 Approved by: https://github.com/ydwu4, https://github.com/zhxchen17, https://github.com/tugsbayasgalan, https://github.com/avikchaudhuri, https://github.com/jerryzh168	2024-02-08 21:22:04 +00:00
Tianyu Liu	a7754b2b60	[dtensor] switch softmax backward ops to OpStrategy (#119255 ) As titled. This is a followup to PR #117723 on softmax forward ops. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119255 Approved by: https://github.com/XilunWu, https://github.com/wanchaol	2024-02-08 21:18:39 +00:00
Tobias Ringwald	d9a1b25807	Fixed an issue where nn.Linear would cause an internal int underflow … (#119221 ) …when trying to reshape a scalar input. Fixes #119161 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119221 Approved by: https://github.com/albanD	2024-02-08 21:06:34 +00:00
Mark Saroufim	7fd6b1c558	s/print/warn in arch choice in cpp extension (#119463 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119463 Approved by: https://github.com/malfet	2024-02-08 20:38:51 +00:00
Mikayla Gawarecki	db1a4dcb5a	[BE] Add dtypesIfMPS to ModuleInfo enabling float16 tests for MPS and remove all skipIfMPS for float64 (#119039 ) Right now, `ModuleInfo.dtypes` defaults to `torch.testing._internal.common_dtype.floating_types()`, almost no ModuleInfos override this (so only `float32` and `float64` are tested). This is the first step to clean up/improve dtype testing for `ModuleInfos` and fix #116626. Follow up PRs will updates `dtypes=` (and perhaps `dtypesIf{Device}` (if it makes sense)) for each `ModuleInfo` Pull Request resolved: https://github.com/pytorch/pytorch/pull/119039 Approved by: https://github.com/janeyx99	2024-02-08 20:35:32 +00:00
Jokeren	4e93b00b69	[Inductor] Setting kernel launch and exit callbacks for inductor generated triton kernels (#119450 ) `CompiledKernel.launch_enter_hook` and `CompiledKernel.launch_exit_hook` are hooks that allow external tools to monitor the execution of Triton kernels and read each kernel's metadata. Initially, these hooks have a value of `None`. Triton's kernel launcher passes hooks and kernel metadata by default, while Inductor's launcher doesn't. This PR could unify the parameters passed to both launchers so that tools can get information from both handwritten Triton kernels and Inductor-generated Triton kernels. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119450 Approved by: https://github.com/jansel	2024-02-08 20:19:18 +00:00
Joel Schlosser	6adadbaf79	Fix jagged NT softmax semantics (#119459 ) Before: `softmax` definition uses `jagged_unary_pointwise()` (wrong) After: `softmax` impl adjusts the `dim` arg to account for the difference in dimensionality between the outer NT and the NT's `_values` Pull Request resolved: https://github.com/pytorch/pytorch/pull/119459 Approved by: https://github.com/soulitzer	2024-02-08 20:13:12 +00:00
David Berard	278a0e1600	[NestedTensor] Support binary pointwise ops with >2 inputs (if inputs are non-tensors) (#119419 ) It should usually be safe to run pointwise binary ops with >2 inputs. e.g. threshold_backward(tensor, tensor, scalar): we just operate on the values of the nested tensors, and pass in the other args as-is. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119419 Approved by: https://github.com/soulitzer	2024-02-08 20:06:40 +00:00
liqunfu	cd9a1934fb	[ONNX] Bump to onnx1.15.0 and ort1.17.0 in CI (#119106 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119106 Approved by: https://github.com/thiagocrepaldi, https://github.com/titaiwangms	2024-02-08 19:26:13 +00:00
Andrew Gu	91f038161a	[FSDP2] Used `split_with_sizes_copy` for all-gather copy-out (#119451 ) This switches to using @yifuwang's `split_with_sizes_copy.out` fast path! Pull Request resolved: https://github.com/pytorch/pytorch/pull/119451 Approved by: https://github.com/yifuwang ghstack dependencies: #118017, #118118	2024-02-08 19:04:30 +00:00
suo	def572929b	[export/nonstrict] always create FakeTensorMode (#119446 ) Previously in non-strict mode we would source a FakeTensorMode from existing tensors if available. It turns out this is problematic, as it means we can't directly control the behavior of this FakeTensorMode. For example, if the user-provided FakeTensorMode does not set `allow_non_fake_inputs=True`, then we get into trouble with constant tensors, etc. At the moment, we still have to epxlicitly re-fakifky the module state. @ezyang has recommended against this, but it's necessary because `create_aot_dispatcher_function` calls `detect_fake_mode` on all the inputs, which will error if not all the FakeTensors are on the same mode. We should straighten this out, but leaving for the future. Differential Revision: [D53559043](https://our.internmc.facebook.com/intern/diff/D53559043/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119446 Approved by: https://github.com/ezyang, https://github.com/zhxchen17	2024-02-08 18:54:18 +00:00
Pearu Peterson	7ec6ac89e8	Add lowering to special.modified_bessel_i0 (#118993 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118993 Approved by: https://github.com/peterbell10	2024-02-08 18:42:40 +00:00
Jorge Pineda	9242523ad5	[ET-Vulkan] aten.pow.Tensor_Tensor (#119423 ) Summary: This wires the eager-mode operation to the Vulkan shader. We only cover the case where both inputs are Tensor type, which is on par with the existing operators: add, sub, mul, div, floor_div. It doesn't seem like we can cover [any other of the 8 cases](https://www.internalfb.com/code/fbsource/[e45c04564445b5e67ebb61e6ba53995729686526]/xplat/caffe2/torch/distributed/_tensor/ops/pointwise_ops.py?lines=310-317), right now. We categorize them and explain that what's missing for each. ## Category 1 The other 2/3 "standard" cases requires one of the values to be a scalar, ``` z = torch.pow(x, y) ``` ``` aten.pow.Scalar, aten.pow.Tensor_Scalar, aten.pow.Tensor_Tensor, ``` which is not currently supported. ``` F 00:00:01.746228 executorch:aten_bridge.cpp:21] In function check_tensor_meta(), assert failed (b.sizes().data() != nullptr): ETensor must have valid sizes array ``` ## Category 2 IIUC, these operators require an out argument in the declaration. However, when they are traced they collapsed into Category 1, e.g., we obtain `aten.pow.Tensor_Tensor` not `aten.pow.Tensor_Tensor_out`. This appears in line with current PT-Vulkan, which only [implements the other two categories](https://www.internalfb.com/code/fbsource/[f148c22604b8e409696fd64f814cda89d091fe7a]/xplat/caffe2/aten/src/ATen/native/vulkan/ops/BinaryOp.cpp?lines=533-558). ``` torch.pow(x, y, out=z) ``` ``` aten.pow.Scalar_out, aten.pow.Tensor_Scalar_out, aten.pow.Tensor_Tensor_out, ``` ## Category 3 IIUC, in-place operators are written like this: ``` x.pow_(y) ``` ``` aten.pow_.Scalar, aten.pow_.Tensor, ``` They are not currently supported. ``` File "/data/users/jorgep31415/fbsource/buck-out/v2/gen/fbcode/b007eb344207ad7d/executorch/backends/vulkan/test/__test_vulkan_delegate__/test_vulkan_delegate#link-tree/torch/_export/verifier.py", line 188, in _check_valid_op raise SpecViolationError( torch._export.verifier.SpecViolationError: operator 'aten.copy_.default' is not functional ``` Test Plan: ``` [jorgep31415@devvm15882.vll0 /data/users/jorgep31415/fbsource (fd1ed5f81)]$ buck2 test fbcode//executorch/backends/vulkan/test:test_vulkan_delegate -- test_vulkan_backend_pow File changed: fbcode//executorch/backends/vulkan/vulkan_preprocess.py Buck UI: https://www.internalfb.com/buck2/7f9ec9e5-cbac-4618-b8ad-d94d10bb50ff Test UI: https://www.internalfb.com/intern/testinfra/testrun/562950306906309 Network: Up: 3.2KiB Down: 0B (reSessionID-ea5af789-c131-4170-ba20-5c5c9718276b) Jobs completed: 7. Time elapsed: 48.5s. Cache hits: 0%. Commands: 1 (cached: 0, remote: 0, local: 1) Tests finished: Pass 1. Fail 0. Fatal 0. Skip 0. Build failure 0 ``` Differential Revision: D53547865 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119423 Approved by: https://github.com/SS-JIA, https://github.com/malfet	2024-02-08 18:31:33 +00:00
lancerts	b51b27922b	Add to_empty() suggestion in the error message (#119353 ) Fixes #119293, the comprehensive documentation is [here](`0f478d9d61/docs/source/meta.rst (id11)`). Just added the suggestion into the error message so it is more informative to user. @albanD Co-authored-by: mikaylagawarecki <mikaylagawarecki@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/119353 Approved by: https://github.com/mikaylagawarecki	2024-02-08 18:30:02 +00:00
Andrew M. James	5a77ee6587	Add lowering for logcumsumexp (#118753 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118753 Approved by: https://github.com/peterbell10	2024-02-08 18:29:34 +00:00
PyTorch MergeBot	7315ec7505	Revert "Fix estimate_nccl_collective_runtime (#118986 )" This reverts commit 0dab6fb35284ed47d1c6339e9d71e4ca3b50dc51. Reverted https://github.com/pytorch/pytorch/pull/118986 on behalf of https://github.com/atalman due to Breaks internal tests ([comment](https://github.com/pytorch/pytorch/pull/118986#issuecomment-1934680463))	2024-02-08 18:11:53 +00:00
Nikita Shulga	1d61011c11	[MPS] Add support for complex scalars (#119318 ) - Switch to native complex support if running on MacOS Monterey or newer for binary ops. - Python complex scalars are always represented in PyTorch as ComplexDouble, but MPS yet to support double precision types, so downcast them to floats - Also add `cf`(for complex float) and `ch`(for complex half) to MPSScalar value union - Fix complex scalars to view promotion, by introducing `legacy_complex_as_view` helper function, that non-float types to complex and promotes CPU complex scalars to MPS before turning them into a view. - Add `test_tensor_scalar_binops` Fixes https://github.com/pytorch/pytorch/issues/119088 Test plan: CI (have quite a lot of tests, see new unexpected successes) + `python -c "import torch;x,y=torch.rand(2, 2, dtype=torch.cfloat, device='mps'),torch.tensor(2+3j,dtype=torch.chalf);print(y+x)"` Pull Request resolved: https://github.com/pytorch/pytorch/pull/119318 Approved by: https://github.com/albanD	2024-02-08 18:10:59 +00:00
Sheng Fu	2b9cba86cf	Fix deadlock in ExecutionTraceObserver (#119242 ) (#119398 ) Summary: With the compiled PyTorch module, in execution_trace_observer.cpp, function convertIValue calls TensorImpl->storage_offset(). That function call will trigger a recursive call into recordOperatorStart. It will cause a deadlock on ob.g_mutex. This DIFF is to fix this deadlock by replacing std::mutex with std::recursive_mutex. Since PyTorch only has one thread for FWD, and one thread for BWD. The contention is very low, the performance should NOT be a concern. Test Plan: Unit Test buck test mode/dev-nosan caffe2/test:profiler -- test_execution_trace_with_pt2 Differential Revision: D53533253 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119398 Approved by: https://github.com/aaronenyeshi	2024-02-08 18:00:51 +00:00
Jiong Gong	896cf9d1ce	[inductor][cpp] vectorization support for int32/int64 (#119001 ) This pull request aims to complete most of the support for vectorizing int32 and int64 data types except for indirect indexing and masks. The basic data type support for uint32 and uint64 is also added but without vectorization. More vectorized conversion functions are added between integer and float. In order to support int64 vectors, a new VectorizedN class to handle vectors of arbitrary length. Below are the details: 1. Complete most of the int32 and int64 vectorization support including load, store, reduction, constant and conversion. The indirect indexing and masks will be addressed in follow-up PRs, after which, the legality checking logic in `CppVecKernelChecker` can be further simplified. 2. Util functions for conversion between integer and float vectors (in cpp_prefix.h and ATen vec). Ideally, we'd better move them from cpp_prefix.h to ATen vec to simplify cpp_prefix.h, will be addressed in follow-up PRs. 3. Introduced a new template class VectorizedN, designed to handle vectors of arbitrary length by encapsulating multiple Vectorized<T> instances. This class supports most of the operations of `Vectorized<T>`. It makes the support of int64 vectorization simpler. I will also apply it to bf16/fp16/int8 in the follow-up PRs for better efficiency. For example, bf16 currently only uses half of the vector lanes. With `VectorizedN`, we can use full of the lanes and map bf16 vector to `VectorizedN<float,2>` on conversion. 4. Basic data type support is added for uint32 and uint64 (in graph.py). Vectorization support will be added later but not of high priority due to fewer usages. Next steps: - [ ] Refactor the vector mask handling to support data types other than float. Currently vector masks are implemented with float vectors. - [ ] Fully utilize vector lanes for bfloat16/float16/int8. - [ ] Support indirect indexing with vectorized index via scalarization. - [ ] Clean up `CppVecKernelChecker`. - [ ] Simplify `cpp_prefix.h` including refactoring vector conversion logic. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119001 Approved by: https://github.com/peterbell10, https://github.com/jansel	2024-02-08 17:38:49 +00:00
PyTorch MergeBot	8182fce769	Revert "Add cpp stack traces to our own reruns (#119408 )" This reverts commit fbe6f6236e25e27e5968715f824dc8bfb0e37213. Reverted https://github.com/pytorch/pytorch/pull/119408 on behalf of https://github.com/malfet due to Looks like it introduced intermittent crashes see https://github.com/pytorch/pytorch/actions/runs/7823402867/job/21344456540 for example, testing the theory ([comment](https://github.com/pytorch/pytorch/pull/119408#issuecomment-1934589057))	2024-02-08 17:20:39 +00:00
Angela Yi	8da2f81527	[export] Convert internal tests to using .module() (#119105 ) Test Plan: CI Differential Revision: D53091904 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119105 Approved by: https://github.com/ydwu4	2024-02-08 17:19:07 +00:00
Angela Yi	c3e0836084	[export] Remove CallSpec (#117671 ) Summary: This is not really being used anywhere Test Plan: CI Differential Revision: D52842563 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117671 Approved by: https://github.com/avikchaudhuri, https://github.com/zhxchen17	2024-02-08 17:19:03 +00:00
Yukio Siraichi	9436710afd	Implement shallow copy functions for `FunctionalTensorWrapper`. (#118783 ) Fix: #115792 This PR implements 2 virtual functions of `TensorImpl` that are called when setting the `tensor.data`: - `shallow_copy_from`: which calls `copy_tensor_metadata`; and - `copy_tensor_metadata`: which copies all `FunctionalTensorWrapper` metadata and ~calls `dest->value_.set_data(src->value_)`~ assigns `dest->value_ = src->value_`, so as to copy also the inner tensor using the same method Before this PR, the inner tensor of a `FunctionalTensorWrapper` was being ignored. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118783 Approved by: https://github.com/bdhirsh	2024-02-08 17:15:46 +00:00
Chien-Chin Huang	6d8f192fd0	[DCP] Call os.sync if os.fsync does not work for fsspec (#119287 ) Some fsspec storage may not support fileno(). In such a case, we fall back to os.sync() If may not be necessary to call `os.sync()` as in such a case, the storage may be a remote storage that requires a special sync API call. Differential Revision: [D53433425](https://our.internmc.facebook.com/intern/diff/D53433425/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119287 Approved by: https://github.com/wz337, https://github.com/LucasLLC ghstack dependencies: #118888	2024-02-08 17:10:38 +00:00
ydwu4	b251bca205	[dynamo] inlining into __iter__ of user defined object (#119243 ) Fixes #119198. This PR make dynamo inline `__iter__` of a user defined object instead of creating a graph break. Also added a new test, which shows: 1. the loop is unrolled 2. the length of the loop is guarded when inlining `__iter__` ```python class Mod: def __init__(self): self.a = [torch.randn(2, 2), torch.randn(2, 2)] def __iter__(self): return iter(self.a) def f(mod): ret = [] for x in mod: ret.append(x + 1) return ret ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/119243 Approved by: https://github.com/jansel	2024-02-08 17:07:30 +00:00
angelayi	b181e52a8f	[export] Support non-tensor tuple hoo outputs (#119402 ) There's an internal custom op which has a None output, so when it becomes auto_functionalized, the HOO's output is (None, Tensor, Tensor, ...). This PR adds support for the None output, and any int/bool outputs from HOOs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119402 Approved by: https://github.com/suo, https://github.com/avikchaudhuri	2024-02-08 16:54:40 +00:00
zdevito	7f05c72864	[nccl flight recorder] record time we discover start and complete (#119249 ) Some APIs like ncclCommAbort can cause nccl kernels to finish even if they were previously stuck. Because we can gather the trace buffer after those calls, we can end up seeing some collectives marked completed eventhough that complete happened several minutes after they started and clearly after the timeout. This changes how we record state so that we keep track of the time we discover a state change, so even if eventually the collective gets marked complete, we can observe it happened minutes after it was schedule. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119249 Approved by: https://github.com/wconstab	2024-02-08 16:48:33 +00:00
Peter Bell	3a8bf25fdd	[SparseCsr] Remove triton sdpa skip after triton pin update (#109601 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/109601 Approved by: https://github.com/desertfire, https://github.com/amjames	2024-02-08 16:40:25 +00:00
Chien-Chin Huang	d947534782	[DCP] Enable filesystem/fsspec auto detection (#118888 ) This API enables the ability to automatically detect whether to use filesystem or fsspec based on the checkpoint_id. Differential Revision: [D53318043](https://our.internmc.facebook.com/intern/diff/D53318043/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118888 Approved by: https://github.com/wz337, https://github.com/LucasLLC	2024-02-08 16:38:04 +00:00
lezcano	4f2bf7fa87	Print the value of constants in __str__ (#119276 ) Not sure why we haven't been doing this really... Pull Request resolved: https://github.com/pytorch/pytorch/pull/119276 Approved by: https://github.com/jansel	2024-02-08 16:23:36 +00:00
Banit Agrawal	579999a731	[PyTorch] Back scalar value to pinned memory for .item() (#119202 ) Summary: This diff optimizes the .item() call by backing the scalar value storage with pinned memory, so we dont create an implicit synchronization with libcuda library. Test Plan: # Prod VDD model on H100 Vanguard runs 9.8k qps -> 10.1k qps (~3% improvement) # .item() Benchmark 1 thread 50k iterations consistent ~2-3% improvements With pinned memory item() took 1.627608060836792 seconds item() took 1.635591983795166 seconds item() took 1.6398141384124756 seconds item() took 1.6378591060638428 seconds item() took 1.618534803390503 seconds item() took 1.6467158794403076 seconds item() took 1.6278800964355469 seconds item() took 1.6205573081970215 seconds item() took 1.64951753616333 seconds item() took 1.6286702156066895 seconds w/o pinned memory item() took 1.6783554553985596 seconds item() took 1.6670520305633545 seconds item() took 1.6748230457305908 seconds item() took 1.6708712577819824 seconds item() took 1.6836023330688477 seconds item() took 1.6518056392669678 seconds item() took 1.6769678592681885 seconds item() took 1.661888837814331 seconds item() took 1.6627326011657715 seconds item() took 1.6908581256866455 seconds Differential Revision: D53431148 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119202 Approved by: https://github.com/xw285cornell	2024-02-08 16:23:15 +00:00
Peter Bell	08657b82f5	Reduce scope of dispatching in logcumsumexp_backward (#119397 ) Everything inside the `AT_DISPATCH` block is being compiled 5 times, so it makes sense to limit it to the only line that uses `scalar_t` which is the `numeric_limits` query. Also a small optimization, instead of computing `grad.log()` and `(-grad).log()` we can compute `grad.abs().log()` which is 2 pointwise ops instead of 3. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119397 Approved by: https://github.com/lezcano, https://github.com/albanD	2024-02-08 15:09:22 +00:00
Yanbo Liang	56364124af	[Dynamo][16/N] Move skipfiles to trace_rules.py (#119432 ) This is follow-up-1 for https://github.com/pytorch/pytorch/pull/118971#issue-2114082018. Only code motion and doc update in this PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119432 Approved by: https://github.com/jansel	2024-02-08 09:41:52 +00:00
Yu, Guangye	0a41ac3cf3	[1/2] Intel GPU Runtime Upstreaming for Stream (#117611 ) # Motivation As mentioned in [[RFC] Intel GPU Runtime Upstreaming](https://github.com/pytorch/pytorch/issues/114842), the second runtime component we would like to upstream is `Stream` which contains the device management functions of Intel GPU's runtime. To facilitate the code review, we split the code changes into 2 PRs. This is one of the 2 PRs and covers the changes under `c10`. # Design Intel GPU stream is a wrapper of sycl queue which schedules kernels on a sycl device. In our design, we will maintain a sycl queue pool containing 32 queues per priority per device. And when a queue is requested one of these queues is returned round-robin. The corresponding C++ files related to `Device` will be placed in `c10/xpu` folder. We provide the `c10::xpu::XPUStream` APIs, like - `XPUStream getStreamFromPool` - `XPUStream getCurrentXPUStream` - `void setCurrentXPUStream` - `void device_synchronize` # Additional Context In our plan, 2 PRs should be submitted to PyTorch for `Stream`: 1. for c10 2. for python frontend. The differences with CUDA: no default and external stream in XPU and lack of the below API: - `getDefaultCUDAStream` - `getStreamFromExternal` for cuda, `cuda::device_synchronize` can sync all streams on the device, but for xpu, `xpu::sync_streams_on_device` only sync all reserved streams on the device. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117611 Approved by: https://github.com/EikanWang, https://github.com/jgong5, https://github.com/gujinghui, https://github.com/malfet	2024-02-08 09:07:23 +00:00
cyy	7d516bbd5f	Update MacOS deployment target to OS version 11.1 (#119373 ) To avoid the following error: ``` 2024-02-07T12:49:51.8306390Z ld: warning: dylib (/Users/runner/work/_temp/anaconda/envs/wheel_py38/lib/libomp.dylib) was built for newer macOS version (11.1) than being linked (11.0) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/119373 Approved by: https://github.com/huydhn	2024-02-08 08:19:42 +00:00
PyTorch UpdateBot	5f6b35915a	[executorch hash update] update the pinned executorch hash (#119336 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned executorch hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119336 Approved by: https://github.com/pytorchbot	2024-02-08 03:38:53 +00:00
Pritam Damania	f579c65ef6	Release GIL for torch::autograd::clear_autocast_cache (#119416 ) Fixes #119262 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119416 Approved by: https://github.com/albanD	2024-02-08 03:22:48 +00:00
Andrew Gu	9d6bf20022	[FSDP2] Added backward prefetching (#118118 ) This PR adds explicit backward prefetching to overlap communication and computation in backward (namely, needed for `reshard_after_forward=True` or `reshard_after_forward: int`). We do this by recording the post-forward order and using its reverse to approximate the backward order. This works for the typical 1 forward / 1 backward training. However, for more complex schedules, this can run into some gaps: - We need to know the _true end of backward_. - At the true of end of backward, we can clear our recorded post-forward order and pre-backward hook state, and we should wait on gradient reductions. - There is no easy way to know whether the current backward marks the true end of backward. Therefore, we introduce an API for the user to set this: `fsdp_module.set_is_last_backward(bool)`. For example, for pipeline parallelism's DFS cooldown backward, we can call `fsdp_module.set_is_last_backward(is_last_microbatch)`. - When the user runs backward through only part of the model, our reverse-post-forward-order heuristic risks _mistargeted prefetches_ for unused modules, which would mean the module's parameters are all-gathered and not freed until the end of backward. - To error on the side of less memory usage (but no overlap), this PR introduces logic to check whether a module will need its unshard in the current backward (by recording the module's `forward` outputs' `grad_fn`s and querying the autograd engine). - Note that there may be _no_ overlap in backward for some parts due to no prefetching. - Note further that when running multiple backwards, if the user does not use `set_is_last_backward`, we may not be able to provide a meaningful error message, as the pre-backward hook could be erroneously cleared on the 1st backward. - In the future, we may expose more APIs from the autograd engine (similar to `_current_graph_task_execution_order`) to make the prefetching exact. (Currently, `_current_graph_task_execution_order` requires the `with torch.autograd.set_multithreading_enabled(False)`, which is too hard of a constraint as we cannot easily modify users' training loops. We can replace the multi-threading check with a device check. Moreover, in the partial backward case in this PR's unit test, I still hit an [internal assertion](`b816760a2f/torch/csrc/autograd/engine.cpp (L476)`), so some follow-up is required.) <details> <summary> Old Discussion </summary> For discussion: - The PR includes a counter `expected_backward_unshard_count` to mitigate mistargeted prefetches in backward. However, it can be seen as a necessary but not sufficient solution. - If a module's outputs do not require gradient, then we certainly do not need to unshard the module in backward. - However, if a module's outputs do require gradient, then we still may not need to unshard the module for _this_ backward (e.g. if the module did not contribute to `loss` for the current `loss.backward()`). - This counter will only address the first case but not the second. If we want to address the second, then we may need more info from the autograd engine. - For now, I did not include any unit test to cover these behaviors, as I do not have a good example yet. </details> Pull Request resolved: https://github.com/pytorch/pytorch/pull/118118 Approved by: https://github.com/weifengpy, https://github.com/wanchaol ghstack dependencies: #118017	2024-02-08 03:17:45 +00:00
Chien-Chin Huang	1d2382f141	[DDP] Use compiled_autograd to trace DDP backward allreduce (#110662 ) Summary The reducer of `DistributedDataParallel` is implemented with C++ and it is not easy to trace the allreduce launched in the reducer. This PR modifies `DistributedDataParallel` to launch one allreduce per gradient when `compiled_autograd` is enabled. The changes allow us to use `compiled_autograd` to trace the allreduce and later be optimized (fused) in the Inductor. Key Logic 1. If `ddp_python_hook` is True, we assume `compiled_autograd` is used. `DistributedDataParallel` registers `compiled_accum_grad_hook` for all parameters. 2. In the first forward() call, if `DistributedDataParallel` is not compiled, all `compiled_accum_grad_hook` are deregistered. If `DistributedDataParallel` is compiled, all `compiled_accum_grad_hook` will be compiled by `compiled_autograd`. 3. `compiled_accum_grad_hook` launches an allreduce to reduce the gradient of the parameter. Bucketing The compiled backward is slow because there is no bucketing for the allreduces. We rely on Inductor to bucket the allreduces. The bucketing is done in a separate PR. Differential Revision: [D49428482](https://our.internmc.facebook.com/intern/diff/D49428482/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/110662 Approved by: https://github.com/wconstab	2024-02-08 03:03:15 +00:00
Thiago Crepaldi	113506d2d4	Add FakeTensor support to torch._utils._rebuild_tensor (#108186 ) Partially fixes https://github.com/pytorch/pytorch/issues/105077 Repro: ```python import tempfile import torch from torch._subclasses import fake_tensor class TheModelClass(torch.nn.Module): def __init__(self): super(TheModelClass, self).__init__() self.fc1 = torch.nn.Linear(5, 10) def forward(self, x): return self.fc1(x) with tempfile.NamedTemporaryFile() as state_dict_file: # Create state_dict to be loaded later model = TheModelClass() torch.save(model.state_dict(), state_dict_file.name) fake_mode = fake_tensor.FakeTensorMode() with fake_mode: # This is where the bug is triggered state_dict = torch.load(state_dict_file.name) ``` Error: ```bash Traceback (most recent call last): File "issue_gh_torch_105077.py", line 22, in <module> state_dict = torch.load(state_dict_file.name) File "/opt/pytorch/torch/serialization.py", line 1014, in load return _load(opened_zipfile, File "/opt/pytorch/torch/serialization.py", line 1422, in _load result = unpickler.load() File "/opt/pytorch/torch/_utils.py", line 205, in _rebuild_tensor_v2 tensor = _rebuild_tensor(storage, storage_offset, size, stride) File "/opt/pytorch/torch/_utils.py", line 184, in _rebuild_tensor return t.set_(storage._untyped_storage, storage_offset, size, stride) File "/opt/pytorch/torch/utils/_stats.py", line 20, in wrapper return fn(args, *kwargs) File "/opt/pytorch/torch/_subclasses/fake_tensor.py", line 1288, in __torch_dispatch__ return self.dispatch(func, types, args, kwargs) File "/opt/pytorch/torch/_subclasses/fake_tensor.py", line 1468, in dispatch self.invalidate_written_to_constants(func, flat_arg_fake_tensors, args, kwargs) File "/opt/pytorch/torch/_subclasses/fake_tensor.py", line 1733, in invalidate_written_to_constants _, new_kwargs = normalize_function( File "/opt/pytorch/torch/fx/operator_schemas.py", line 297, in normalize_function torch_op_schemas = get_signature_for_torch_op(target) File "/opt/pytorch/torch/fx/operator_schemas.py", line 167, in get_signature_for_torch_op signatures = [_torchscript_schema_to_signature(schema) for schema in schemas] File "/opt/pytorch/torch/fx/operator_schemas.py", line 167, in <listcomp> signatures = [_torchscript_schema_to_signature(schema) for schema in schemas] File "/opt/pytorch/torch/fx/operator_schemas.py", line 70, in _torchscript_schema_to_signature arg_type = _torchscript_type_to_python_type(arg.type) File "/opt/pytorch/torch/fx/operator_schemas.py", line 64, in _torchscript_type_to_python_type return eval(ts_type.annotation_str, _type_eval_globals) File "<string>", line 1, in <module> NameError: name 'Storage' is not defined ``` This PR adds the ability to create fake tensors during `torch.load` by wrapping the `torch.tensor.set_` call around a `torch.utils._mode_utils.no_dispatch()` to skip fake mode dispatcher for it and thus create a real tensor. It later calls `fake_mode.from_tensor(t)` to finally create the fake tensor. Co-authored-by: Edward Z. Yang <ezyang@mit.edu> Pull Request resolved: https://github.com/pytorch/pytorch/pull/108186 Approved by: https://github.com/ezyang	2024-02-08 03:01:34 +00:00
Yu, Guangye	9a992b0918	[4/4] Intel GPU Runtime Upstreaming for Device (#116869 ) # Motivation According to [[1/4] Intel GPU Runtime Upstreaming for Device](https://github.com/pytorch/pytorch/pull/116019), as mentioned in [[RFC] Intel GPU Runtime Upstreaming](https://github.com/pytorch/pytorch/issues/114842), this last PR covers the changes under lazy initialization. # Design This PR primarily offers the support of multi-processing via lazy initialization. We lazily initialize our runtime avoiding initializing XPU until the first time it is accessed. In our design, we extend `cuda_lazy_init` to `device_lazy_init` which is a device-agnostic API that can support any backend. And change `maybe_initialize_cuda` to `maybe_initialize_device` to support lazy initialization for both CUDA and XPU while maintaining scalability. # Additional Context We adopt a similar design to CUDA. So we share some code with CUDA. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116869 Approved by: https://github.com/EikanWang, https://github.com/jgong5, https://github.com/gujinghui, https://github.com/malfet ghstack dependencies: #119248	2024-02-08 03:01:21 +00:00
Jorge Pineda	3cb7ec312c	[PT-Vulkan] aten::conv1d - opt: width-pack weight tensor (>2x speedup) (#118835 ) ## This diff This optimization reduces calls to `texelFetch(uKernel, ...)` by 4. We borrow MatMul's work to do the re-packing: https://www.internalfb.com/code/fbsource/[7e8ef1b8adeda224a736f8cc4bf870e0a659df95]/xplat/caffe2/aten/src/ATen/native/vulkan/ops/Mm.cpp?lines=20%2C50 ## Future optimziations We are already batching reads from input/weight tensors, and writes to output tensor. Here are other ideas, which I won't pursue for now. (2) is the most doable. 1. Batch reads/writes along the dimension that is most commonly > 1. For weights, the length dimension is definitely correct here, but input/outputs could potentially leverage the length dimensions too. However, `stride != 1` would complicate this optimization. 2. Batch an optimal number of reads/writes. Instead of default-ing to 4 elements (since that corresponds to 1 texel), consider more elements such as MatMul's 4x4 texel tile. 3. Obscure shader compiler optimizations. Since MatMul seemed to benefit from several seemingly equivalent ways to write code. Differential Revision: [D53204674](https://our.internmc.facebook.com/intern/diff/D53204674/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118835 Approved by: https://github.com/SS-JIA, https://github.com/liuk22	2024-02-08 02:23:51 +00:00
Edward Z. Yang	2349e473f1	Forward fix for same_shape oblivious guard (#119383 ) Fixes internal test ``` buck2 test '@fbcode//mode/opt' fbcode//accelerators/workloads/models/slimdsnn:slimdsnn_test -- --exact 'accelerators/workloads/models/slimdsnn:slimdsnn_test - test_generate (accelerators.workloads.models.slimdsnn.test_slimdsnn.SlimDSNN)' ``` And I added an OSS test that approximates the internal situation. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Differential Revision: [D53544208](https://our.internmc.facebook.com/intern/diff/D53544208) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119383 Approved by: https://github.com/atalman, https://github.com/albanD	2024-02-08 02:11:46 +00:00
Mateus Devino	64aaa8f508	Fix typo on Contribution Guide (#119428 ) Fixes #119427 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119428 Approved by: https://github.com/awgu, https://github.com/kit1980	2024-02-08 01:07:27 +00:00
albanD	fbe6f6236e	Add cpp stack traces to our own reruns (#119408 ) Note that I'm not sure why we both have pytest rerun the failing test twice via `81abc2b249/test/run_test.py (L966)` before our own logic retries it as well? The failing test is only here to make sure it works as expected in the CI env. Will remove before landing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119408 Approved by: https://github.com/huydhn	2024-02-08 00:54:16 +00:00
Mihir Patel	33761969a4	Remove parent device mesh check (#118620 ) Removes raising error if a device_mesh has a parent. The comment says that HSDP + TP is not supported, but I'm able to do 2D parallelism + HSDP fine. The only issues are: - this check - https://github.com/pytorch/pytorch/pull/118618 - a series of PRs related to checkpointing with 3D meshes that I will open We currently monkeypatch for the above which I am slowly upstreaming. I imagine torch will have a better, native integration eventually, but this check seems too aggressive in the meantime given DTensor now lets users do some things themselves (which is amazing 🎉)! Pull Request resolved: https://github.com/pytorch/pytorch/pull/118620 Approved by: https://github.com/Skylion007	2024-02-08 00:49:28 +00:00
Ke Wen	029a16c41f	[c10d] PGNCCL refactor part 1: adds assert size==1 (#119099 ) Breaking #118674 into multiple smaller PRs. This is the first one. It adds `assert size==1` to PGNCCL, and refactors some old tests written in multi-device style (which would otherwise fail at the assert). Pull Request resolved: https://github.com/pytorch/pytorch/pull/119099 Approved by: https://github.com/wconstab, https://github.com/XilunWu	2024-02-07 22:29:29 +00:00
Natalia Gimelshein	6fe5a3adaf	release GIL for cudaEventDestroy (#119393 ) cudaEventDestroy can become blocking under some circumstances, and then holding GIL will lead to deadlocks. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119393 Approved by: https://github.com/Skylion007	2024-02-07 22:16:18 +00:00
Colin Peppler	ad75d9e2ca	[easy] Fix test_triton_kernel_reinterpret_view_mem_leak by cloning fwd input (#119219 ) ``` $ python test/inductor/test_aot_inductor.py -k test_triton_kernel_reinterpret_view_mem_leak # Before RuntimeError: Found following user inputs located at [0] are mutated. This is currently banned in the aot_export workflow. If you need this functionality, please file a github issue. fw_metadata=ViewAndMutationMeta(input_info=[InputAliasInfo(is_leaf=True, mutates_data=True, mutates_metadata=False, mutations_hidden_from_autograd=True, mutations_under_no_grad_or_inference_mode=False, mutates_storage_metadata=False, requires_grad=False, mutation_type=<MutationType.MUTATED_OUT_GRAPH: 3>),...) # Now Ran 6 tests in 13.851s OK (skipped=4) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/119219 Approved by: https://github.com/oulgen	2024-02-07 21:30:16 +00:00
PyTorch MergeBot	81abc2b249	Revert "[quant][pt2e][bc-breaking] Remove fold_quantize flag (#118701 )" This reverts commit 482d952e880cf78c103a06f2d483556ab0a89138. Reverted https://github.com/pytorch/pytorch/pull/118701 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/118701#issuecomment-1932866964))	2024-02-07 20:56:16 +00:00
albanD	a6e16fe202	Fix global in header warning (#119380 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119380 Approved by: https://github.com/janeyx99	2024-02-07 20:35:21 +00:00
Kaiming Ouyang	35aa353c48	Change watchdog log from "NCCL" to "Process group" (#118121 ) This PR changes the watchdog log. In order to avoid confusion that NCCL creates a watchdog thread and reports the error log, it is better to change "NCCL" to "Process group" to better indicate the source of the log. @wconstab Pull Request resolved: https://github.com/pytorch/pytorch/pull/118121 Approved by: https://github.com/kwen2501, https://github.com/wconstab	2024-02-07 20:14:49 +00:00
Aaron Gokaslan	892a7bf674	[BE]: Add filelock typing to mypy stubs (#119390 ) Realized we used filelock in some places, but didn't have a mypy type stub for it. Noticed it in this PR: https://github.com/pytorch/pytorch/pull/119386 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119390 Approved by: https://github.com/albanD, https://github.com/malfet	2024-02-07 20:14:28 +00:00
Nikita Shulga	d0db80126e	[EZ][CI] Fetch full history for MPS jobs (#119401 ) Otherwise emitting TD stats will fail with following warning: ``` Emiting td_test_failure_stats /Users/ec2-user/runner/_work/pytorch/pytorch/tools/testing/target_determination/heuristics/edited_by_pr.py:37: UserWarning: Can't query changed test files due to Command '['git', 'merge-base', 'origin/main', 'HEAD']' returned non-zero exit status 1. warn(f"Can't query changed test files due to {e}") ``` Test plan: Observe that MPS jobs finishes without those warnings Pull Request resolved: https://github.com/pytorch/pytorch/pull/119401 Approved by: https://github.com/atalman, https://github.com/huydhn	2024-02-07 19:29:30 +00:00
Jack Zhang	51fb99250b	Fix missing MAST log when there is Unicode non-decodable text in logs (#119298 ) Summary: ## Issue When there is Unicode non-decodable text in logs, `tail_logger` will stop working afterwards, i.e. f527390102 In the example, the process stopped producing Python logs after 17:20:21 untill the job finished ``` [0]:I0201 17:20:21.338000 3429 gen_ai/genie_projects/llm/metaformers/reward_model_score.py:335] Progress: 118 batches out of 512 total batches. 23.05 % \| (gpu mem: 25.8GB, free CPU mem: 1387.8GB) I0201 17:39:14 Stopping twtask-main.service with Service Result: [success] Exit Code: [exited] Exit Status: [0] ``` At the end, `UnicodeDecodeError` was thrown at the end with no call stack. ## Fix Use `errors="replace"` to avoid throwing exception when `UnicodeDecodeError` happens. Test Plan: f528854819 Differential Revision: D53483644 Co-authored-by: Jack Zhang <jackzh@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/119298 Approved by: https://github.com/XilunWu	2024-02-07 19:25:43 +00:00
Hirochika Matsumoto	02c24b0b5e	Add Python binding `resizable` to class `{Untyped,Typed}Storage` (#119286 ) This PR exposes `resizable` method of `StorageImpl` to Python frontend to make it accessible for users. Fixes #119233 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119286 Approved by: https://github.com/ezyang, https://github.com/mikaylagawarecki	2024-02-07 19:15:55 +00:00
Andrew Gu	d054cd3e44	[FSDP2] Added `reshard_after_forward` (#118017 ) This PR adds the `reshard_after_forward: Union[bool, int]` arg and a `reshard()` method. The `reshard_after_forward` argument trades off communication and memory. - `reshard_after_forward=True`: reshard parameters after forward; unshard (all-gather) in backward - `reshard_after_forward=False`: no reshard of parameters after forward; no unshard (all-gather) in backward - `reshard_after_forward: int`: reshard parameters to a smaller world size; unshard (all-gather) over small world size in backward In comparison with DeepSpeed and existing FSDP: - `reshard_after_forward=True` == `FULL_SHARD` == ZeRO-3 - `reshard_after_forward=False` == `SHARD_GRAD_OP` == ZeRO-2 - `reshard_after_forward=8` == ZeRO++ ZeRO-1 is `reshard_after_after_forward=False` without gradient reduction (implemented in a later PR). If we need gradient reduction on an iteration, then ZeRO-2 supersedes ZeRO-1. We prefer a simple state transition between `SHARDED` / `SHARDED_POST_FORWARD` and `UNSHARDED`, where the state directly defines what tensors are registered to the module. In particular, we _do not_ have a state where the sharded parameters are registered but the unsharded parameters are still in GPU memory. This greatly simplifies our state transitions, but it means that parameters may be non-intuitively registered to the module (e.g. if only the root does not reshard after forward, then the root will be the only without sharded parameters registered). To address this, we introduce a simple `reshard()` method that can force-reshard the parameters. This makes sense to me because the typical case does not care about the registered parameters after forward (in fact, for existing FSDP with `use_orig_params=False`, the unsharded parameters are still registered and are dangling tensors without storage.) I plan to expose a complementary `unshard(async_op: bool = True)` method in the future. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118017 Approved by: https://github.com/weifengpy, https://github.com/wanchaol	2024-02-07 19:14:20 +00:00
Jerry Zhang	482d952e88	[quant][pt2e][bc-breaking] Remove fold_quantize flag (#118701 ) Summary: This is a follow up to https://github.com/pytorch/pytorch/pull/118605 to remove `fold_quantize` flag from `convert_pt2e` Test Plan: CI Differential Revision: D53247301 BC Breaking Note: flag `fold_quantize` set to True `convert_pt2e` and now we'll fold the quantize op in the weight by default, so users will see model size reduction by default after pt2e quantization. 2.2 ``` folded_model = convert_pt2e(model, fold_quantize=True) non_folded_model = convert_pt2e(model) ``` 2.3 ``` folded_model = convert_pt2e(model) non_folded_model = convert_pt2e(model, fold_quantize=False) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/118701 Approved by: https://github.com/andrewor14, https://github.com/leslie-fang-intel	2024-02-07 19:10:51 +00:00
Michael Suo	0e2330d84c	fix lint (#119395 ) Summary: as title Test Plan: lint Differential Revision: D53532399 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119395 Approved by: https://github.com/tugsbayasgalan, https://github.com/malfet	2024-02-07 19:06:41 +00:00
Mikayla Gawarecki	23b030a79c	[easy] Add testing utilties for torch.nn.utils.set_swap_module_params_on_conversion (#118023 ) For above PR to parametrize existing `load_state_dict` tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/118023 Approved by: https://github.com/albanD ghstack dependencies: #118028, #117167	2024-02-07 18:55:44 +00:00
Mikayla Gawarecki	d5a718d27b	Add swap_tensors path to nn.Module._apply (#117167 ) Added `torch.__future__.{get/set}_swap_module_params_on_conversion` that defaults to `False` for now, but we probably want to modify to override this and default to `True` in `nn.Module._apply` if input is a tensor subclass. From offline discussion, for now we are not allowing `swap_tensor` after the first module forward has been run* if the autograd graph is still alive. The reason being that `torch.utils.swap_tensors(t1, t2)` requires the `use_count` of both `TensorImpl`s associated with `t1` and `t2` to be 1. The first forward pass will install `AccumulateGrad` nodes on each param, which [bump the refcount of the associated TensorImpl](`6cf1fc66e3/torch/csrc/autograd/variable.cpp (L307)`). Future work might be to swap the refs that the `AccumulateGrad` nodes hold if it is necessary. From this, it might seem like we don't need to handle gradients. However, I still handle the grads for the edge case that the grads are set via `p.grad = grad` OR the autograd graph is no longer alive because the output has been garbage collected. If any `swap_tensors` fails on any of the parameters in the `nn.Module` we raise an error. `RNNBase` overrides `nn.Module._apply()` and installs weakrefs on some parameters. As a result, all modules that inherit from `RNNBase` (`RNN`, `GRU` and `LSTM`) cannot use the`swap_tensors` path as of now* Pull Request resolved: https://github.com/pytorch/pytorch/pull/117167 Approved by: https://github.com/albanD ghstack dependencies: #118028	2024-02-07 18:55:44 +00:00
Lei Mao	91d1d2c421	Make MHA Query Scaling Behaviors Consistent (#119323 ) The multi-head attention (MHA) query scaling behaviors are not consistent when [`need_weights`](`8ac9b20d4b/torch/nn/modules/activation.py (L1073)`) values are different. On the current main, when `need_weights = True`, the query scaling was performed using a [division](`8ac9b20d4b/torch/nn/functional.py (L5434)`) and it will be exported as a `Div` operator in ONNX. When `need_weights = False`, the query scaling was performed using a [multiplication](`422b4271ae/aten/src/ATen/native/transformers/attention.cpp (L711)`) and it will be exported as a `Mul` operator in ONNX defined in the [PyTorch ONNX Symbolics](`422b4271ae/torch/onnx/symbolic_opset14.py (L177)`). We should make the query scaling behaviors consistent. On most of the platforms, multiplication performs no worse than division. Therefore, we should use multiplication consistently for both `need_weights = True` and `need_weights = False`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119323 Approved by: https://github.com/mikaylagawarecki, https://github.com/albanD	2024-02-07 18:42:57 +00:00
William Wen	5eda355e54	[inductor, test] remove cast for test_pow2_cpu (#114912 ) Verifies https://github.com/pytorch/pytorch/issues/94010 Pull Request resolved: https://github.com/pytorch/pytorch/pull/114912 Approved by: https://github.com/angelayi	2024-02-07 18:32:30 +00:00
Yifu Wang	0dab6fb352	Fix estimate_nccl_collective_runtime (#118986 ) `estimate_nccl_collective_runtime` has been broken and the errors have been silently swallowed by inductor. This PR: - Fixes the issues described in https://github.com/pytorch/pytorch/issues/118497. - Adds white-box testing so future issues can be surfaced in tests. - Add support for native funcol IRs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118986 Approved by: https://github.com/yf225 ghstack dependencies: #118910, #118911, #118437	2024-02-07 18:02:51 +00:00
PyTorch MergeBot	088d538a8d	Revert "[Inductor] GEMM shape padding improvements (#118522 )" This reverts commit cc46829f96dba05b9b46bae31a1e6d2a053f667e. Reverted https://github.com/pytorch/pytorch/pull/118522 on behalf of https://github.com/eellison due to regresses HF ~4/5% ([comment](https://github.com/pytorch/pytorch/pull/118522#issuecomment-1932557670))	2024-02-07 17:42:14 +00:00
Edward Z. Yang	f6bf7d26e1	Print full exception info in Graph break log (#119292 ) So, this is a little awkward, so I don't mind more thoughts on how best to do this. Let's suppose that you have a graph break inside of an inlined function call. We are not actually going to print this graph break yet; instead, we are going to restart analysis so that we can run up until the inlined function call. When this happens, the only log message we ever get is the log to `graph_break` (seen here) reporting that a graph break has occurred. In the current code, we don't print the fully formatted exception if you are only using `graph_breaks` logging. So the exception that induced the graph break has its traceback lost forever. For some classes of errors, esp., guard on data-dependent SymInt, this is quite bad. With this change, we do print the traceback. On this sample program: ``` import torch import torch._dynamo.config torch._dynamo.config.capture_scalar_outputs = True def g(x, y): y = x.item() if y < 3: return x + 2 else: return x + 3 @torch.compile() def f(x, y): y = y * y return g(x, y) f(torch.tensor(4), torch.randn(4)) ``` It looks like this: ``` [2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] Graph break: Traceback (most recent call last): [2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] File "/data/users/ezyang/b/pytorch/torch/_dynamo/variables/tensor.py", line 878, in evaluate_expr [2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] return guard_scalar(self.sym_num) [2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] File "/data/users/ezyang/b/pytorch/torch/fx/experimental/symbolic_shapes.py", line 414, in guard_scalar [2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] return guard_bool(a) [2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] File "/data/users/ezyang/b/pytorch/torch/fx/experimental/symbolic_shapes.py", line 663, in guard_bool [2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] return a.node.guard_bool("", 0) # NB: uses Python backtrace [2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] File "/data/users/ezyang/b/pytorch/torch/fx/experimental/sym_node.py", line 366, in guard_bool [2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] r = self.shape_env.evaluate_expr(self.expr, self.hint, fx_node=self.fx_node) [2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] File "/data/users/ezyang/b/pytorch/torch/fx/experimental/recording.py", line 227, in wrapper [2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] return fn(args, kwargs) [2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] File "/data/users/ezyang/b/pytorch/torch/fx/experimental/symbolic_shapes.py", line 3670, in evaluate_expr [2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] concrete_val = self.size_hint(orig_expr) [2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] File "/data/users/ezyang/b/pytorch/torch/fx/experimental/symbolic_shapes.py", line 3403, in size_hint [2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] raise self._make_data_dependent_error(result_expr, expr) [2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] torch.fx.experimental.symbolic_shapes.GuardOnDataDependentSymNode: It appears that you're trying to get a value out of symbolic int/float whose value is data-dependent (and thus we do not know the true value.) The expression we were trying to evaluate is u0 < 3 (unhinted: u0 < 3). For more information, run with TORCH_LOGS="+dynamic". [2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] [2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] [2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] During handling of the above exception, another exception occurred: [2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] [2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] Traceback (most recent call last): [2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] File "/data/users/ezyang/b/pytorch/torch/_dynamo/symbolic_convert.py", line 469, in wrapper [2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] return inner_fn(self, inst) [2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] File "/data/users/ezyang/b/pytorch/torch/_dynamo/symbolic_convert.py", line 1196, in CALL_FUNCTION [2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] self.call_function(fn, args, {}) [2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] File "/data/users/ezyang/b/pytorch/torch/_dynamo/symbolic_convert.py", line 651, in call_function [2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] self.push(fn.call_function(self, args, kwargs)) [2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] File "/data/users/ezyang/b/pytorch/torch/_dynamo/variables/functions.py", line 279, in call_function [2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] return super().call_function(tx, args, kwargs) [2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] File "/data/users/ezyang/b/pytorch/torch/_dynamo/variables/functions.py", line 87, in call_function [2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] return tx.inline_user_function_return( [2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] File "/data/users/ezyang/b/pytorch/torch/_dynamo/symbolic_convert.py", line 657, in inline_user_function_return [2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] return InliningInstructionTranslator.inline_call(self, fn, args, kwargs) [2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] File "/data/users/ezyang/b/pytorch/torch/_dynamo/symbolic_convert.py", line 2262, in inline_call [2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] return cls.inline_call_(parent, func, args, kwargs) [2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] File "/data/users/ezyang/b/pytorch/torch/_dynamo/symbolic_convert.py", line 2372, in inline_call_ [2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] tracer.run() [2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] File "/data/users/ezyang/b/pytorch/torch/_dynamo/symbolic_convert.py", line 787, in run [2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] and self.step() [2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] File "/data/users/ezyang/b/pytorch/torch/_dynamo/symbolic_convert.py", line 750, in step [2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] getattr(self, inst.opname)(inst) [2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] File "/data/users/ezyang/b/pytorch/torch/_dynamo/symbolic_convert.py", line 431, in inner [2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] eval_result = value.evaluate_expr(self.output) [2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] File "/data/users/ezyang/b/pytorch/torch/_dynamo/variables/tensor.py", line 880, in evaluate_expr [2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] raise UserError( # noqa: TRY200 [2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] torch._dynamo.exc.UserError: Consider annotating your code using torch._constrain_as_(). It appears that you're trying to get a value out of symbolic int/float whose value is data-dependent (and thus we do not know the true value.) The expression we were trying to evaluate is u0 < 3 (unhinted: u0 < 3). For more information, run with TORCH_LOGS="+dynamic". [2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] [2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] For more information about this error, see: https://pytorch.org/docs/main/generated/exportdb/index.html#constrain-as-size-example [2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] [2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] From user code at: [2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] File "/data/users/ezyang/b/pytorch/b.py", line 16, in f [2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] return g(x, y) [2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] File "/data/users/ezyang/b/pytorch/b.py", line 8, in g [2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] if y < 3: [2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] ``` The end of the log at restarted computation maybe can be improved too. Right now it looks like this: ``` [2024-02-06 10:32:24,338] [0/0_1] torch._dynamo.symbolic_convert: [DEBUG] TRACE CALL_FUNCTION 2 [UserFunctionVariable(), LazyVariableTracker(), TensorVariable()] [2024-02-06 10:32:24,338] [0/0_1] torch._dynamo.output_graph: [DEBUG] COMPILING GRAPH due to GraphCompileReason(reason='Consider annotating your code using torch._constrain_as_*(). It appears that you\'re trying to get a value out of symbolic int/float whose value is data-dependent (and thus we do not know the true value.) The expression we were trying to evaluate is u0 < 3 (unhinted: u0 < 3). For more information, run with TORCH_LOGS="+dynamic".\n\nFor more information about this error, see: https://pytorch.org/docs/main/generated/exportdb/index.html#constrain-as-size-example', user_stack=[<FrameSummary file /data/users/ezyang/b/pytorch/b.py, line 16 in f>, <FrameSummary file /data/users/ezyang/b/pytorch/b.py, line 8 in g>], graph_break=True) ``` An alternative to doing it this way, is I can make symbolic shapes print a warning log when guard on unbacked SymInt itself, so we don't have to worry about Dynamo generating the backtrace well. If, for the most part, the backtrace for other graph breaks is irrelevant, then this would seem to be a more expedient solution. PTAL and submit your opinions. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/119292 Approved by: https://github.com/yanboliang	2024-02-07 17:20:31 +00:00
Michael Suo	f79ae7599a	[export] fakify module state in nonstrict (#119297 ) Summary: Previously, we were not fakifying module state explicitly in the nonstrict path. This led to errors when modules were constructed under a fake mode, since the user-provided fake mode was clashing with the one that we had constructed internally to fakify the inputs. This fixes things to use a single fake mode for everything. As a side effect, this raised the question of how we ought to serialize state_dicts/constants that might be fake tensors. Naively calling torch.save understandably explodes—so this diff piggybacks on our infra for doing this on meta["val"]. Open to revising this, I'm low confidence that it's the best way to do it. Test Plan: unit tests Differential Revision: D53484942 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119297 Approved by: https://github.com/tugsbayasgalan	2024-02-07 17:12:22 +00:00
Bin Bao	40ec155e58	[AOTI][refactor] Split common aoti_runtime utils into a separate header (#119066 ) Summary: Split common utils from aoti_runtime/model.h into a separate header file, because when turning on ABI-compatible mode for JIT Inductor we won't need AOTInductorModel, but we do need some common utils, e.g. RAIIAtenTensorHandle. Differential Revision: [D53478809](https://our.internmc.facebook.com/intern/diff/D53478809) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119066 Approved by: https://github.com/khabinov	2024-02-07 16:54:00 +00:00
Jane Xu	059994d2b7	Migrate load_state_dict hook tests to OptimizerInfo (#119310 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119310 Approved by: https://github.com/albanD ghstack dependencies: #119283, #119288, #119299, #119308	2024-02-07 16:00:01 +00:00
Jane Xu	0320e62255	Migrate test_state_dict hooks to OptimizerInfo (#119308 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119308 Approved by: https://github.com/mikaylagawarecki ghstack dependencies: #119283, #119288, #119299	2024-02-07 16:00:01 +00:00
Yu, Guangye	5c46600f84	[RELAND] refactor lazy init to device-agnostic (#119248 ) # Motivation This PR intends to extend `cuda_lazy_init` to `device_lazy_init` which is a device-agnostic API that can support any backend. And change `maybe_initialize_cuda` to `maybe_initialize_device` to support lazy initialization for CUDA while maintaining scalability. # Design We maintain a flag for each backend to manage the lazy initialization state separately. # Additional Context No need more UTs. This is a reland PR, the original PR is [refactor lazy init to device-agnostic](https://github.com/pytorch/pytorch/pull/118846). This is a common PR, and does not trigger xpu ciflow. Differential Revision: [D53478332](https://our.internmc.facebook.com/intern/diff/D53478332) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119248 Approved by: https://github.com/EikanWang, https://github.com/gujinghui, https://github.com/jgong5, https://github.com/atalman	2024-02-07 15:58:51 +00:00
Jane Xu	3625ccfbea	Move step global hooks test to OptimizerInfo (#119299 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119299 Approved by: https://github.com/mikaylagawarecki ghstack dependencies: #119283, #119288	2024-02-07 15:50:31 +00:00
Jane Xu	7b3762e6bc	Move step pre/post hook tests to OptimizerInfo (#119288 ) Note that this increases coverage from 1 config (vanilla SGD) to all the configs (13 optimizers at around 6-7 each). The test time seems fine though! With the torch cuda synchronization: ``` (pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (b6093c03)]$ python test/test_optim.py -k test_step_pre_hook -k test_step_post_hook /home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/transformers/utils/generic.py:441: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead. _torch_pytree._register_pytree_node( /home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0 warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}" .................................................... ---------------------------------------------------------------------- Ran 52 tests in 13.680s OK ``` Excluding the torch cuda synchronization: ``` (pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (916f6fe3)]$ python test/test_optim.py -k test_step_pre_hook -k test_step_post_hook /home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/transformers/utils/generic.py:441: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead. _torch_pytree._register_pytree_node( /home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0 warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}" .................................................... ---------------------------------------------------------------------- Ran 52 tests in 1.038s OK ``` The old tests: ``` (pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (916f6fe3)]$ python test/test_optim.py -k test_pre_hook -k test_post_hook /home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/transformers/utils/generic.py:441: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead. _torch_pytree._register_pytree_node( /home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0 warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}" .. ---------------------------------------------------------------------- Ran 2 tests in 0.518s OK ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/119288 Approved by: https://github.com/mikaylagawarecki ghstack dependencies: #119283	2024-02-07 15:50:31 +00:00
Edward Z. Yang	99ddfaf572	Add symbol guard counts instrumentation (#119290 ) This helps us understand if there are symbols which are extremely hot (i.e., have a lot of guards mentioning them). Extremely hot symbols are candidates for being turned static. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/119290 Approved by: https://github.com/bdhirsh	2024-02-07 14:35:14 +00:00
Peter Bell	7c95cc5e03	Add basic reference documentation for symbolic_shapes.py (#118997 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118997 Approved by: https://github.com/albanD	2024-02-07 14:33:42 +00:00
Simon Fan	1435cfecfa	Increase accumulate_grad_ gradient's expected refcount to account for pybind (#119068 ) Account for pybind of the op holding 1 ref when torch.ops.inductor.accumulate_grad_.default is called during run time Pull Request resolved: https://github.com/pytorch/pytorch/pull/119068 Approved by: https://github.com/jansel ghstack dependencies: #118817, #119334	2024-02-07 10:25:43 +00:00
Simon Fan	326dcf9dc8	Never reuse accumulated gradients' buffers (#119334 ) Since accumulate grad may steal the gradient's `c10::Storage`, we can't reuse the op otherwise the gradient will get overwritten. From benchmarks, using the inductor's codegen'd _empty_strided_cpu/cuda and assigning to it has lower overhead than deep copying the gradient and reusing its buffer. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119334 Approved by: https://github.com/jansel ghstack dependencies: #118817	2024-02-07 10:25:42 +00:00
Simon Fan	8e14e1d514	Fix gradient refcounts in pybind and compiled autograd (#118817 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118817 Approved by: https://github.com/jansel	2024-02-07 10:25:42 +00:00
PyTorch MergeBot	d85631b721	Revert "Fix deadlock in ExecutionTraceObserver (#119242 )" This reverts commit 6fc775ae13b675f8d02f7f85bc4348bba3ae3dd3. Reverted https://github.com/pytorch/pytorch/pull/119242 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/119242#issuecomment-1931445631))	2024-02-07 07:37:22 +00:00
CaoE	dfdbd73360	add Half support for flash attention (#119247 ) Re-open for https://github.com/pytorch/pytorch/pull/118368. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119247 Approved by: https://github.com/drisspg, https://github.com/malfet	2024-02-07 05:57:41 +00:00
Yanbo Liang	0f478d9d61	[Dynamo][15/N] Merge allow_in_graph/inline/skip trace rules check into trace_rule.lookup (#118971 ) Finally we have this PR to merge allow_in_graph/inline/skip trace rules into ```trace_rules.lookup_inner```, where we can define and lookup trace rules at both function level and file level. Going forward, this is the central place that we define and consulte Dynamo trace rule for any function. * ```trace_rules.looup``` is the API can return allow_in_graph, inline or skip. * ```skipfiles.check``` is the API can return inline or skip, since we have multiple places that only do inline/skip check. * I'll move ```skipfiles.check``` to ```trace_rules.check``` as one of the follow-ups. * Both functions consulte ```trace_rules.lookup_inner``` to get the tracing rule. To avoid a single big PR, I left a few items as the follow-ups: * Remove ```skipfiles.py``` and merge the code into ```trace_rules.py```. * We do double check in ```symbolic_convert.check_inlineable```, will refactor and simplify it. We should only do inline/skip check before generating ```SkipFilesVariable``` and ```UserFunctionVariable```. * Rename ```SkipFilesVariable``` as ```SkipFunctionVariable```, since we only handle functions. * The inline/skip reasons are not logged for some cases, since the new lookup framework doesn't always return inline/skip reasons. I'll refactor loggings to record the inline/skip reason in next step. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118971 Approved by: https://github.com/jansel	2024-02-07 05:15:39 +00:00
Simon Fan	284b0b5f44	Add --local-ranks-filter to torchrun: allow logs filtering by rank (#118562 ) Addresses issue https://github.com/pytorch/pytorch/issues/117383 The implementation exposes `--local-ranks-filter` which filters by rank which files we pass to `TailLog` (used in torchrun to determine which logs to output to stdout/stderr) ## Behavior ### with --tee Currently --tee is implemented as --redirect to file, and streams file to console using `tail`. When --tee is specified, file logs will be unaffected and we will only filter the output to console. ### with --redirect When --redirect is specified without --tee, nothing is logged to console, so we no-op. ### with neither When neither --tee or --redirect are specified, torchrun uses empty string "" to indicate logging to console. We intercept this empty string, and redirect it to "/dev/null" to not print to console. The api also allows a per-rank configuration for --tee and --redirect, and is also supported by this filter implementation. ## Usage ### without --tee ``` > TORCH_LOGS_FORMAT="%(levelname)s: %(message)s" TORCH_LOGS="graph" torchrun --standalone --nproc_per_node=2 --role rank --local_rank_filter=0 t.py hello from rank 0 python DEBUG: TRACED GRAPH __compiled_fn_0 <eval_with_key>.0 opcode name target args kwargs ------------- ------ ----------------------- --------- -------- placeholder l_x_ L_x_ () {} call_function mul <built-in function mul> (l_x_, 5) {} output output output ((mul,),) {} ... ``` ### with --tee ``` > TORCH_LOGS_FORMAT="%(levelname)s: %(message)s" TORCH_LOGS="graph" torchrun --standalone --nproc_per_node=2 --role rank --tee 3 --local_rank_filter=0 t.py [rank0]:hello from rank 0 python [rank0]:DEBUG: TRACED GRAPH [rank0]: __compiled_fn_0 <eval_with_key>.0 opcode name target args kwargs [rank0]:------------- ------ ----------------------- --------- -------- [rank0]:placeholder l_x_ L_x_ () {} [rank0]:call_function mul <built-in function mul> (l_x_, 5) {} [rank0]:output output output ((mul,),) {} ... ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/118562 Approved by: https://github.com/wconstab, https://github.com/wanchaol	2024-02-07 04:29:54 +00:00
Shan19900305	6c3600d008	Enable optional tensorList fallback to cpu. (#119273 ) add optional tensorList fallback to cpu. Add testcases and old pr is: https://github.com/pytorch/pytorch/pull/106449 @bdhirsh Pull Request resolved: https://github.com/pytorch/pytorch/pull/119273 Approved by: https://github.com/bdhirsh	2024-02-07 03:54:13 +00:00
PyTorch UpdateBot	53ee47ca32	[vision hash update] update the pinned vision hash (#119337 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned vision hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119337 Approved by: https://github.com/pytorchbot	2024-02-07 03:43:26 +00:00
William Wen	ee1c2449f7	[dynamo] delete dynamo cache entry when guard function is invalidated [attempt 2] (#119107 ) Attempt #2 for https://github.com/pytorch/pytorch/pull/117875 to fix https://github.com/pytorch/pytorch/issues/112090. Summary of changes: - ~Changed CacheEntry linked list into a doubly-linked list structure to support deletion.~ (done by C++ refactor) - Added CacheEntry and ExtraState borrowed references to GuardFn so that GuardFn can tell ExtraState to delete CacheEntry when the GuardFn is invalidated. - ~Added ExtraState raw reference to CacheEntry so that we can get ExtraState to correctly point to the first CacheEntry if it gets deleted.~ (done by C++ refactor) - CacheEntry destructor needs to reset GuardFn refs to ExtraState/CacheEntry in order to prevent use-after-free. - code_context values that are nn.GraphModules need to be weakrefs in order to prevent circular references. - Added tests that check for memory leaks and cache deletion operations. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119107 Approved by: https://github.com/jansel	2024-02-07 03:32:42 +00:00
BowenBao	fcc36de9d6	[ONNX][dynamo_export] Turn off opmath type promotion for div (#119112 ) Skip opmath promotion for `_prims_common.ELEMENTWISE_TYPE_PROMOTION_KIND.INT_TO_FLOAT` as well. Fixes https://github.com/pytorch/pytorch/issues/118941 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119112 Approved by: https://github.com/thiagocrepaldi	2024-02-07 03:27:00 +00:00
Tamir Cohen	45a79323fe	Add torch.dtype instances to the public API (#119307 ) Fixes #91908 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119307 Approved by: https://github.com/albanD	2024-02-07 02:57:49 +00:00
Nikita Shulga	8c2fde1fcf	[EZ][BE] [CMake] Remove checks for GCC-7 (#119306 ) As PyTorch now uses C++17 and needs gcc-9.4+ to compile Pull Request resolved: https://github.com/pytorch/pytorch/pull/119306 Approved by: https://github.com/Skylion007	2024-02-07 01:24:01 +00:00
Scott Wolchok	e9907a3446	[PyTorch] Free up 8 bytes per intrusive_ptr_target (#117986 ) We don't need 64-bit reference and weak counts. (We also probably don't need a full 32 bits, but we'll deal with that later.) Differential Revision: [D52851891](https://our.internmc.facebook.com/intern/diff/D52851891/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/117986 Approved by: https://github.com/ezyang	2024-02-07 00:48:00 +00:00
Mateus Devino	5f2ad407a9	Fix typo on torch.frombuffer() documentation (#119214 ) Fixes #114345 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119214 Approved by: https://github.com/albanD	2024-02-07 00:41:51 +00:00
Svetlana Karslioglu	5ae6f6cffe	Test seo torch cuda (#119324 ) Testing if this will help improve SEO of this page. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119324 Approved by: https://github.com/albanD	2024-02-07 00:39:51 +00:00
Shunting Zhang	728228a7c7	LazyGraphModule: improve the fix for the FakeTensorMode mismatch issue (#119311 ) The previous fix https://github.com/pytorch/pytorch/pull/118981 misses some corner cases. It works when both LazyGraphModule and compiled-autograd are enabled. But it fail with FakeTensorMode mismatch error again if LazyGraphModule+CompiledAutograd+DynamicShape are all enabled. Note that disabling any of the three does not trigger the issue. The reason why enabling DynamicShape cause the previous fix not working is, we will call the bw_compiler here before running the backward pass if there are symints saved for backward: `73f0fdea5b/torch/_functorch/_aot_autograd/jit_compile_runtime_wrappers.py (L382)` The bw_compiler may cause extra GraphModule recompilation on the bw_module which cause it's forward method become the lazy one again. The fix is just to delay applying the previous fix after the potential extra call of the bw_compiler. Repro on hf_Whisper: ``` CUDA_VISIBLE_DEVICES=1 time benchmarks/dynamo/torchbench.py -dcuda --training --backend=inductor --disable-cudagraphs --accuracy --only hf_Whisper --repeat 1 --compiled-autograd --dynamic-batch-only ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/119311 Approved by: https://github.com/xmfan, https://github.com/jansel	2024-02-07 00:35:39 +00:00
Bin Bao	e868a7fedd	[AOTI] Rename config.aot_inductor.abi_compatible (#119065 ) Summary: Rename config.aot_inductor.abi_compatible to config.abi_compatible, since the cpp_wrapper mode in JIT Inductor will share the same flag. Differential Revision: [D53478752](https://our.internmc.facebook.com/intern/diff/D53478752) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119065 Approved by: https://github.com/khabinov	2024-02-07 00:14:33 +00:00
laith sakka	c814d8e5c2	Fix handling random() calls encountered inside inlined code. (#119218 ) Fix https://github.com/pytorch/pytorch/issues/118787 In the compiled function, calls to random() are replaced with a single function call to a function that generates all the random variables . The random calls encountered during compilation used to be tracked inside a variable stored inside the instruction translator. And when there are nested translators, the tracked calls used to get lost when the inner instructions translator popped out. This diff fixes that by moving the tracked calla to the output graph which is shared across translators that are generating the same function. More details about the issue and why this solution is picked are in the github issue above. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119218 Approved by: https://github.com/jansel, https://github.com/anijain2305	2024-02-06 23:48:21 +00:00
Jason Ansel	5e78c4b0f4	[dynamo] Functools partial reconstruct (#118583 ) Replaces #117721 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118583 Approved by: https://github.com/yanboliang ghstack dependencies: #118901, #118616	2024-02-06 23:42:43 +00:00
Jason Ansel	62cc1053d8	[dynamo] Fix missing guards in FunctoolsPartialVariable (#118616 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118616 Approved by: https://github.com/yanboliang ghstack dependencies: #118901	2024-02-06 23:42:43 +00:00
Sheng Fu	6fc775ae13	Fix deadlock in ExecutionTraceObserver (#119242 ) Summary: With the compiled PyTorch module, in execution_trace_observer.cpp, function convertIValue calls TensorImpl->storage_offset(). That function call will trigger a recursive call into recordOperatorStart. It will cause a deadlock on ob.g_mutex. This DIFF is to fix this deadlock by replacing std::mutex with std::recursive_mutex. Since PyTorch only has one thread for FWD, and one thread for BWD. The contention is very low, the performance should NOT be a concern. Test Plan: Unit Test buck test mode/dev-nosan caffe2/test:profiler -- test_execution_trace_with_pt2 Differential Revision: D53299183 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119242 Approved by: https://github.com/aaronenyeshi	2024-02-06 23:36:22 +00:00
Elias Ellison	d0ca849fdf	Refactor Symint Deduping to separate pass (#118938 ) Previously Symint Deduping was done during proxy tracing which made it more difficult to reason about. This refactors the deduping to a separate pass. We only dedupe symints which are resolvable from input symint nodes so as to avoid inducing a dependency on the backward in the forward. potential fix for : https://github.com/pytorch/pytorch/issues/118224 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118938 Approved by: https://github.com/ezyang	2024-02-06 23:07:31 +00:00
PyTorch MergeBot	dea15c9fdc	Revert "Add meta registration for _foreach_norm (#118604 )" This reverts commit b8bb12cd454b716da6a98db826fcc45fd7c0db05. Reverted https://github.com/pytorch/pytorch/pull/118604 on behalf of https://github.com/atalman due to Breaks internal tests ([comment](https://github.com/pytorch/pytorch/pull/118604#issuecomment-1930849491))	2024-02-06 22:20:44 +00:00
andrewor14	6c1cca153e	[quant][pt2e] Allow users to override train/eval behavior (#119091 ) Summary: This commit adds a util for PT2E quantization users to call `model.train()` and `model.eval()` without error. Instead, these will automatically call the equivalent `move_exported_model_to_train/eval` for the user, which only switch behavior for special ops like dropout and batchnorm. This enables users to onboard to the PT2E flow more easily. Test Plan: python test/test_quantization.py TestQuantizePT2E.test_allow_exported_model_train_eval Reviewers: jerryzh168, tugsbayasgalan, zhxchen17 Subscribers: jerryzh168, tugsbayasgalan, zhxchen17, supriyar Differential Revision: [D53426636](https://our.internmc.facebook.com/intern/diff/D53426636) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119091 Approved by: https://github.com/jerryzh168, https://github.com/tugsbayasgalan, https://github.com/zhxchen17	2024-02-06 22:19:58 +00:00
PyTorch MergeBot	9d46fe603d	Revert "[c10d] PGNCCL refactor part 1: adds assert size==1 (#119099 )" This reverts commit 4ab852b6c558a0b8e9fea0c863c782fe65f00be0. Reverted https://github.com/pytorch/pytorch/pull/119099 on behalf of https://github.com/atalman due to Breaks internal tests ([comment](https://github.com/pytorch/pytorch/pull/119099#issuecomment-1930839754))	2024-02-06 22:14:36 +00:00
Jason Ansel	0f68bcaa5c	Make filename optional in update_failures.py (#119289 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119289 Approved by: https://github.com/zou3519	2024-02-06 21:56:09 +00:00
Chen_Liqing	422b4271ae	Change PrivateUse1's resize_bytes to PrivateUse1HooksInterface (#117839 ) Reopen from https://github.com/pytorch/pytorch/pull/117211 Modify the logic for entering the registration branch so that existing uts are not affected. Co-authored-by: albanD <desmaison.alban@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/117839 Approved by: https://github.com/albanD	2024-02-06 20:51:56 +00:00
William Wen	ae4e866bba	[dynamo] refactor CacheEntry and ExtraState to eval_frame.c to C++ (#118438 ) Part of implementing CacheEntry invalidation to fix https://github.com/pytorch/pytorch/issues/112090. Changes: - Move CacheEntry and ExtraState to C++ - Use pybind to control reference counting - Use std::list instead of manually implementing a linked list Pull Request resolved: https://github.com/pytorch/pytorch/pull/118438 Approved by: https://github.com/jansel	2024-02-06 20:48:11 +00:00
Vladimir Malinovskii	73f0fdea5b	[fix] accounting for dilation in pool padding assertion (#118897 ) Fixes https://github.com/pytorch/pytorch/issues/7541 It is a copy of https://github.com/pytorch/pytorch/pull/111427, I have failed to fix all its issues in time, and it got closed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118897 Approved by: https://github.com/mikaylagawarecki	2024-02-06 20:32:58 +00:00
Jason Ansel	ec31d11580	[dynamo] Skip dynamo when inside a functorch context (#118901 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118901 Approved by: https://github.com/zou3519	2024-02-06 20:22:24 +00:00
angelayi	f3645fc38b	Update auto_functionalize docs (#119228 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119228 Approved by: https://github.com/zou3519	2024-02-06 19:50:54 +00:00
Jane Xu	f85b0ea8bb	Migrate last lbfgs test over to OptimizerInfo (#119283 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119283 Approved by: https://github.com/Skylion007, https://github.com/mikaylagawarecki	2024-02-06 19:49:05 +00:00
Edward Z. Yang	3f0fd36835	Introduce size oblivious guards (#118579 ) Fixes https://github.com/pytorch/pytorch/issues/117361 The implementation here slightly diverges from what was proposed in the issue, so I will recap what this PR is doing here. Today, when doing computations involving size-like unbacked SymInts, we assume for all operations that the compile time range of the integer is `[2, inf]`, even though at runtime we also accept zero and one. This PR removes the carte blanche assumption, and instead does the analysis in a much more limited and controlled fashion: only for guards which we have designated as "size oblivious" are we willing to do the analysis under the assumption that the range of all size-like unbacked SymInts is `[2, inf]`; otherwise, we will faithfully only do analysis with `[0, inf]` (or whatever the user provided) bounds. The infra pieces of this PR are: * Remove runtime_var_to_range from torch/fx/experimental/symbolic_shapes.py; modify `_constrain_range_for_size` to refine the range without clamping min to 2, and instead add the symbol to a `size_like` set in the ShapeEnv * When evaluating an expression, if the expression is requested to be evaluated in a `size_oblivious` way, we attempt to statically compute the value of the expression with the assumption that all symbols in `size_like` are updated to assume that they are `>= 2`. * Add Python and C++ APIs for guarding on a SymBool in a size-oblivious way. In C++, I also need to add some helpers for performing symbolic comparisons, since the stock comparisons immediately specialize in the "normal" way. The rest of the changes of the PR are marking various spots in PyTorch framework code as size oblivious, based on what our current test suite exercises. As you review the places where we have marked things as size oblivious, it may become clear why I ended up not opting for the "designate a branch as the default branch when it's not statically obvious which way to go": for some of the conditions, this answer is rather non-obvious. I think potentially there is another refinement on top of this PR, which is something like "I don't care if you can't figure it out with ValueRange analysis, go down this path anyway if there are unbacked sizes involved." But even if we add this API, I think we are obligated to attempt the ValueRange analysis first, since it can lead to better outcomes sometimes (e.g., we are able to figure out that something is contiguous no matter what the unbacked size is.) When is it permissible to mark something as size oblivious? Heuristically, it is OK anywhere in framework code if it gets you past a guard on unbacked SymInt problem. It is somewhat difficult to provide a true semantic answer, however. In particular, these annotations don't have any observational equivalence guarantee; for example, if I have `torch.empty(u0, 1).squeeze()`, we will always produce a `[u0]` size tensor, even though if `u0 == 1` PyTorch will actually produce a `[]` size tensor. The argument that I gave to Lezcano is that we are in fact defining an alternate semantics for a "special" size = 0, 1, for which we have these alternate eager mode semantics. In particular, suppose that we have a constant `special1` which semantically denotes 1, but triggers alternate handling rules. We would define `torch.empty(special1, 1).squeeze()` to always produce a `[special1]` size tensor, making its semantics coincide with unbacked SymInt semantics. In this model, the decision to designate guards as size oblivious is simply a user API question: you put them where ever you need some handling for special1! As we conservatively error out whenever it is not obvious what `special1` semantics should be, it is always valid to expand these semantics to cover more cases (although you can always choose the wrong semantics!) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/118579 Approved by: https://github.com/eellison, https://github.com/lezcano	2024-02-06 19:45:32 +00:00
ydwu4	5410385c42	[dynamo] support comparing stream with constant (#119199 ) Before the pr, we have a graph break for: ```python def f(): if torch.cuda.current_stream() is not None: return torch.randn(2, 2) torch.compile(f, backend="eager", fullgraph=True)() ``` This pr supports comparson ops of StreamVariable and ConstantVariable by returning a constant. It's safe to return a constant in this case becuase the StreamVariable is guarded by ID_MATCH when created. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119199 Approved by: https://github.com/yifuwang, https://github.com/anijain2305, https://github.com/jansel	2024-02-06 19:26:03 +00:00
Colin Peppler	fa157af69c	[mypy] declare type for DynamoTestCase._exit_stack (#119084 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119084 Approved by: https://github.com/Skylion007	2024-02-06 18:26:07 +00:00
lancerts	238d87f74d	Add a short code snippet in the RNN doc (#119150 ) Fixes #109443, also remove a duplicated comment line `# Efficient implementation equivalent to the following:` in scaled_dot_product_attention doc. @mikaylagawarecki Pull Request resolved: https://github.com/pytorch/pytorch/pull/119150 Approved by: https://github.com/malfet	2024-02-06 17:41:51 +00:00
Edward Z. Yang	169c070076	Move catch_errors_wrapper to convert_frame (#119253 ) With this change, we now have the invariant that eval_frame only contains "hot" functions that are called at runtime, as opposed to cold functions which are only called at compile time. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/119253 Approved by: https://github.com/yanboliang ghstack dependencies: #119251	2024-02-06 17:40:07 +00:00
Edward Z. Yang	790858afa9	Make start compiling stack trace omit framework frames (#119251 ) Fixes https://github.com/pytorch/pytorch/issues/119238 Here's what it looks like now: ``` $ TORCH_LOGS=+torch._dynamo.convert_frame python a.py [2024-02-05 18:52:07,248] [0/0] torch._dynamo.convert_frame: [DEBUG] torchdynamo start compiling f /data/users/ezyang/b/pytorch/a.py:3, stack (elided 5 frames): [2024-02-05 18:52:07,248] [0/0] torch._dynamo.convert_frame: [DEBUG] File "/data/users/ezyang/b/pytorch/a.py", line 7, in <module> [2024-02-05 18:52:07,248] [0/0] torch._dynamo.convert_frame: [DEBUG] f(torch.randn(2)) [2024-02-05 18:52:07,248] [0/0] torch._dynamo.convert_frame: [DEBUG] File "/data/users/ezyang/b/pytorch/torch/_dynamo/eval_frame.py", line 453, in _fn [2024-02-05 18:52:07,248] [0/0] torch._dynamo.convert_frame: [DEBUG] return fn(args, kwargs) [2024-02-05 18:52:07,248] [0/0] torch._dynamo.convert_frame: [DEBUG] $ cat a.py import torch @torch.compile def f(x): return x 2 f(torch.randn(2)) ``` The eval_frame frame is intentionally present, since what happens is you run the torch.compile wrapper, and then you actually hit the user frame to be compiled. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/119251 Approved by: https://github.com/yanboliang, https://github.com/mlazos	2024-02-06 17:40:07 +00:00
Iosif Spulber	22669843c2	Reserve sizes in c10::VaryingShape::concrete_sizes(), c10::TensorType::computeStrideProps() (#119189 ) Summary: Costly reallocs. Test Plan: CI Reviewed By: efiks Differential Revision: D53264908 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119189 Approved by: https://github.com/Skylion007	2024-02-06 17:13:37 +00:00
Yanbo Liang	8ee9f26ce8	[Dynamo] Remove build_checkpoint_variable from call_getattr (#119236 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/119236 Approved by: https://github.com/jansel	2024-02-06 16:59:40 +00:00
CK Luk	2ad3599a71	Add torch.backends.mha.get_fastpath_enabled to FUNC_INLINELIST (#118979 ) Summary: Add torch.backends.mha.get_fastpath_enabled to FUNC_INLINELIST Test Plan: See the one in D53154041 Reviewed By: yjhao, yanboliang, Yuzhen11 Differential Revision: D53154041 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118979 Approved by: https://github.com/yanboliang	2024-02-06 16:25:33 +00:00
CJMenart	a77be631e0	Bugfix to MixtureSameFamily's _pad_mixture_dimension (#118947 ) Fixes Issue #73792 This is a duplicate of pull request. #73864. It's a small bugfix that should have happened a long time ago, but it didn't because I didn't actually follow up with the pull request after originally submitting. That's my bad. Trying to remedy the error. This contains a fix to _pad_mixture_dimension, which intends to count the number of dimensions in its referent tensors, but accidentally counts the number of elements (and can thus end up creating tensors with potentially thousands of dimensions by mistake). Also contains a single test for the fixed behavior. Co-authored-by: Jeffrey Wan <soulitzer@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/118947 Approved by: https://github.com/soulitzer	2024-02-06 16:24:22 +00:00
PyTorch MergeBot	499040ac32	Revert "Add FakeTensor support to torch._utils._rebuild_tensor (#108186 )" This reverts commit 426339e4de2efc0cbd501e2bff947ba890ec9817. Reverted https://github.com/pytorch/pytorch/pull/108186 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/108186#issuecomment-1929978008))	2024-02-06 15:04:48 +00:00
Mengwei Liu	1e4b408b02	[decomp] Add tests for different dtypes to SDPA decomposition (#119239 ) Summary: As titled. Skipping torch.bfloat16 because for some reason the difference is 0.01. Test Plan: Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/119239 Approved by: https://github.com/drisspg	2024-02-06 11:17:07 +00:00
leslie-fang-intel	85033759d6	Update scatter_reduce_ test with parallel backend check (#118708 ) Summary Follow up of https://github.com/pytorch/pytorch/pull/118278, in which new added UT `test_scatter_using_atomic_add` failed with `native parallel backend` as reported in https://github.com/pytorch/pytorch/issues/118518. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118708 Approved by: https://github.com/jgong5, https://github.com/jansel, https://github.com/lezcano	2024-02-06 09:43:40 +00:00
Colin Peppler	7d7a3f0b37	[inductor] Support sympy.expr in user-defined Triton kernel grid fn (#119165 ) ## Problem A user-defined Triton kernel grid may use a sympy magic method like `Max`. This comes in the form of a form of a `sympy.Expr`, namely `sympy.core.function.FunctionClass`. Handling this is not trivial since `user_defined_kernel_grid_fn_code` is used in Eager & Inductor. Eager usage below. ## Approach Pass in wrapper when Inductor codegens grid with ints/sympy.Expr, so we can utilize wrapper functions, such as `codegen_shape_tuple()`. Differential Revision: D53367012 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119165 Approved by: https://github.com/aakhundov	2024-02-06 08:39:55 +00:00
Aiden Brent	8a8e70477e	Fix type hints on nn.attention.sdpa_kernel (#119140 ) Fixes #119133 Altered type hint and assert to include SDPBackend; disallowed None in assert. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119140 Approved by: https://github.com/mikaylagawarecki, https://github.com/cpuhrsch, https://github.com/drisspg	2024-02-06 07:33:22 +00:00
Valentine233	720f781160	[CPU] Optimize softmax as flash attention v2 (#118957 ) ### Descriptions According to flash attention v2, optimize softmax by dividing sum out of the KV inner loop. ### Performance Stable Diffusion V2.1 on GNR \| Version \| Kernel time (s) \| Speedup \| \|---------\|----------------\|----------------\| \| BF16 Before \| 28.67 \| \| BF16 After \| 23.55 \| 17.86% \| \| FP32 Before \| 54.20 \| \| FP32 After \| 49.47 \| 8.73% \| Pull Request resolved: https://github.com/pytorch/pytorch/pull/118957 Approved by: https://github.com/jgong5, https://github.com/drisspg	2024-02-06 07:06:36 +00:00
Ke Wen	4ab852b6c5	[c10d] PGNCCL refactor part 1: adds assert size==1 (#119099 ) Breaking #118674 into multiple smaller PRs. This is the first one. It adds `assert size==1` to PGNCCL, and refactors some old tests written in multi-device style (which would otherwise fail at the assert). Pull Request resolved: https://github.com/pytorch/pytorch/pull/119099 Approved by: https://github.com/wconstab	2024-02-06 06:59:47 +00:00
Andrew M. James	884b6d2a67	[inductor] Implementing missing magic methods on IR values. (#118933 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118933 Approved by: https://github.com/peterbell10	2024-02-06 05:50:26 +00:00
PyTorch MergeBot	e47f571da7	Revert "Update scatter_reduce_ test with parallel backend check (#118708 )" This reverts commit d670dfb7ae0a88cf010455301eb1d0ef91950f1a. Reverted https://github.com/pytorch/pytorch/pull/118708 on behalf of https://github.com/leslie-fang-intel due to Test Case still fail ([comment](https://github.com/pytorch/pytorch/pull/118708#issuecomment-1928767568))	2024-02-06 04:37:08 +00:00
PyTorch UpdateBot	12ac3ba383	[executorch hash update] update the pinned executorch hash (#118936 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned executorch hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118936 Approved by: https://github.com/pytorchbot	2024-02-06 03:41:33 +00:00
angelayi	3497388b9f	[export] Fix serialization for auto_functionalization (#118810 ) - Added support for serializig the auto_functionalization op, which required adding the functions `serialize_arbitrary_inputs` and `serialize_arbitrary_outputs` which will serialize the inputs/outputs without needing a schema, since HOOs do not have a schema. - Added support for serializing user input mutations - Added support for serializing operator inputs. They just get turned into strings. Differential Revision: [D53331039](https://our.internmc.facebook.com/intern/diff/D53331039) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118810 Approved by: https://github.com/suo	2024-02-06 03:41:05 +00:00
Yanbo Liang	03db96c248	[Dynamo] Enhance autograd.Function strict mode test (#119237 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/119237 Approved by: https://github.com/zou3519	2024-02-06 02:54:19 +00:00
chuanqiw	074f2bb5ce	Fix dynamo benchmark runner for torchbench skip sets (#118615 ) Fix dynamo benchmark runner for torchbench skip sets, which introduced by PR #118032 This runner.py script is still used in the [Inductor CPU Performance Dashboard](https://github.com/pytorch/pytorch/issues/93531) regular test Pull Request resolved: https://github.com/pytorch/pytorch/pull/118615 Approved by: https://github.com/jgong5, https://github.com/ysiraichi, https://github.com/ezyang	2024-02-06 02:06:54 +00:00
Catherine Lee	9250965f8b	[ez] Lower windows timeout limit for trunk, set test step timeout (#119234 ) Lower windows timeout to be the same as linux Step timeout thing for win (linux version + details for why at https://github.com/pytorch/pytorch/pull/93084) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119234 Approved by: https://github.com/huydhn	2024-02-06 01:54:31 +00:00
ydwu4	86d5d1650b	[dynamo] support dict.clear() (#119197 ) For code like following: ```python import torch def f(): a = {"a": torch.randn(2, 2)} a.clear() return a torch.compile(f, backend="eager", fullgraph=True)() ``` We have a graph break before the pr: ``` torch._dynamo.exc.Unsupported: call_method ConstDictVariable() clear [] {} ``` Test Plan: Added new tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/119197 Approved by: https://github.com/jansel, https://github.com/anijain2305	2024-02-06 01:17:55 +00:00
PyTorch MergeBot	c0164f2393	Revert "[BE] Add dtypesIfMPS to ModuleInfo enabling float16 tests for MPS and remove all skipIfMPS for float64 (#119039 )" This reverts commit 04d52d5399ad4abb8af9e8405be79e2a7f8b4c7a. Reverted https://github.com/pytorch/pytorch/pull/119039 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing MPS test in trunk `04d52d5399`, may be a landrace ([comment](https://github.com/pytorch/pytorch/pull/119039#issuecomment-1928595240))	2024-02-06 01:13:28 +00:00
Colin Peppler	3829b55416	[inductor] Support ProxyExecutor argument codegen for sympy.Expr (#119166 ) Differential Revision: D53398312 ## Problem Currently, if a sympy expression that uses a magic method like `Max` is passed as an argument to ProxyExecutor, then C++ compilation will fail. We need to use std::max method instead. ``` # What we see aoti_torch_proxy_executor_call_function(..., std::vector<int64_t>{Max(1025, u1)}.data(), ...); # What we want aoti_torch_proxy_executor_call_function(..., std::vector<int64_t>{std::max(1025L, u1)}.data(), ...) ``` ## Approach Use C++ wrapper's expression printer to handle this conversion Pull Request resolved: https://github.com/pytorch/pytorch/pull/119166 Approved by: https://github.com/aakhundov	2024-02-06 00:33:25 +00:00
Jane Xu	781f7c9080	[BE] Use OptimizerInfo step_requires_closure, only_supports_sparse_grads (#119230 ) So I had planned ahead of time to use these but forgot to actually use them when migrating tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119230 Approved by: https://github.com/albanD	2024-02-06 00:13:43 +00:00
Alexander Grund	69344fe987	c10d: Don't add NCCL backend by default without CUDA (#119149 ) The NCCL backend requires CUDA (including devices) to be available. So don't use that backend by default if that isn't the case to avoid the following error when creating a CPU-only device mesh: > RuntimeError: ProcessGroupNCCL is only supported with GPUs, no GPUs found! Fixes #117746 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119149 Approved by: https://github.com/kwen2501	2024-02-05 23:55:07 +00:00
Shunting Zhang	fd0bf96c2b	[inductor] make multi-kernel work with cpp-wrapper (#117813 ) Make multi-kernel work with cpp-wrapper. multi-kernel generates two equivalent variants for a reduction. At runtime the faster one is picked. But cpp-wrapper need save cubin file during codegen. They don't work with each other at the beginning. Thanks Jason for suggesting a neat way to integrate these two. cpp-wrapper does 2 passes codegen right now. For the first pass, we still generate multi-kernel code and run it; for the second pass, we load the cubin file for the faster kernel directly. And multi-kernel python code is not generated for the second pass since they should not be needed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117813 Approved by: https://github.com/jansel	2024-02-05 23:35:41 +00:00
Mikayla Gawarecki	04d52d5399	[BE] Add dtypesIfMPS to ModuleInfo enabling float16 tests for MPS and remove all skipIfMPS for float64 (#119039 ) Right now, `ModuleInfo.dtypes` defaults to `torch.testing._internal.common_dtype.floating_types()`, almost no ModuleInfos override this (so only `float32` and `float64` are tested). This is the first step to clean up/improve dtype testing for `ModuleInfos` and fix #116626. Follow up PRs will updates `dtypes=` (and perhaps `dtypesIf{Device}` (if it makes sense)) for each `ModuleInfo` Pull Request resolved: https://github.com/pytorch/pytorch/pull/119039 Approved by: https://github.com/janeyx99	2024-02-05 23:19:01 +00:00
Mihir Patel	d9d8c2b79f	Remove HSDP validation check (#112435 ) Currently, HSDP validates that all intra/inter node PGs are the same. This makes sense if you are only using HSDP with no other forms of parallelism and is a nice but not necessary sanity check. However, if you want to mix HSDP with other forms, say tensor parallelism on the FFN of a transformer block, the intra/inter node PGs will be different for that layer. This check raises errors in this scenario, so we need to remove this assumption. Pull Request resolved: https://github.com/pytorch/pytorch/pull/112435 Approved by: https://github.com/wz337, https://github.com/Skylion007	2024-02-05 22:27:53 +00:00
PyTorch MergeBot	966db82c9d	Revert "Remove extra graph breaks (#118987 )" This reverts commit 9a8e3b07d75e3e9bb902f81b4b6e1042bbe06b58. Reverted https://github.com/pytorch/pytorch/pull/118987 on behalf of https://github.com/eellison due to reverting because it causes regression ([comment](https://github.com/pytorch/pytorch/pull/118987#issuecomment-1928224447))	2024-02-05 22:19:37 +00:00
Jane Xu	b8bb12cd45	Add meta registration for _foreach_norm (#118604 ) This PR also fixes the discrepancy between _foreach_norm fast path and slow path, where storage_offsets will be different between the lists of tensors. Here are some profile results showing that we aren't significantly slower. Do note that we're replacing N `as_strided`/`select` calls to N `empty` calls. For script: ``` import torch ts = [torch.rand(32, 16, device="cuda") for _ in range(128)] with torch.profiler.profile( activities=[ torch.profiler.ProfilerActivity.CPU, torch.profiler.ProfilerActivity.CUDA, ] ) as p: res = torch._foreach_norm(ts) print(p.key_averages().table(sort_by="cpu_time_total")) ``` OG baseline: ``` (pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (7cf98987)]$ python playground2.py STAGE:2024-01-30 13:16:48 2740431:2740431 ActivityProfilerController.cpp:314] Completed Stage: Warm Up STAGE:2024-01-30 13:16:48 2740431:2740431 ActivityProfilerController.cpp:320] Completed Stage: Collection STAGE:2024-01-30 13:16:48 2740431:2740431 ActivityProfilerController.cpp:324] Completed Stage: Post Processing ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ aten::_foreach_norm 25.36% 4.209ms 99.94% 16.586ms 16.586ms 8.000us 88.89% 9.000us 9.000us 1 cudaLaunchKernel 61.21% 10.159ms 61.21% 10.159ms 2.540ms 0.000us 0.00% 0.000us 0.000us 4 aten::zeros 0.43% 71.000us 58.35% 9.683ms 9.683ms 0.000us 0.00% 1.000us 1.000us 1 aten::zero_ 0.33% 55.000us 57.35% 9.517ms 9.517ms 0.000us 0.00% 1.000us 1.000us 1 aten::fill_ 0.42% 69.000us 57.01% 9.462ms 9.462ms 1.000us 11.11% 1.000us 1.000us 1 aten::select 8.04% 1.335ms 11.29% 1.873ms 14.633us 0.000us 0.00% 0.000us 0.000us 128 aten::as_strided 3.24% 538.000us 3.24% 538.000us 4.203us 0.000us 0.00% 0.000us 0.000us 128 aten::empty 0.90% 150.000us 0.90% 150.000us 75.000us 0.000us 0.00% 0.000us 0.000us 2 cudaDeviceSynchronize 0.06% 10.000us 0.06% 10.000us 10.000us 0.000us 0.00% 0.000us 0.000us 1 void at::native::vectorized_elementwise_kernel<4, at... 0.00% 0.000us 0.00% 0.000us 0.000us 1.000us 11.11% 1.000us 1.000us 1 void at::native::(anonymous namespace)::multi_tensor... 0.00% 0.000us 0.00% 0.000us 0.000us 6.000us 66.67% 6.000us 3.000us 2 void at::native::lpnorm_cleanup<float, (at::native::... 0.00% 0.000us 0.00% 0.000us 0.000us 2.000us 22.22% 2.000us 2.000us 1 ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Self CPU time total: 16.596ms Self CUDA time total: 9.000us ``` And here's after this PR: ``` STAGE:2024-02-05 08:27:02 1127843:1127843 ActivityProfilerController.cpp:314] Completed Stage: Warm Up STAGE:2024-02-05 08:27:02 1127843:1127843 ActivityProfilerController.cpp:320] Completed Stage: Collection STAGE:2024-02-05 08:27:02 1127843:1127843 ActivityProfilerController.cpp:324] Completed Stage: Post Processing ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ aten::_foreach_norm 30.95% 4.653ms 99.95% 15.026ms 15.026ms 9.000us 90.00% 10.000us 10.000us 1 cudaLaunchKernel 52.41% 7.879ms 52.41% 7.879ms 1.970ms 0.000us 0.00% 0.000us 0.000us 4 aten::zeros 0.39% 58.000us 48.29% 7.260ms 7.260ms 0.000us 0.00% 1.000us 1.000us 1 aten::zero_ 0.35% 53.000us 47.25% 7.103ms 7.103ms 0.000us 0.00% 1.000us 1.000us 1 aten::fill_ 0.43% 65.000us 46.90% 7.050ms 7.050ms 1.000us 10.00% 1.000us 1.000us 1 aten::empty 15.42% 2.318ms 15.42% 2.318ms 17.969us 0.000us 0.00% 0.000us 0.000us 129 cudaDeviceSynchronize 0.05% 7.000us 0.05% 7.000us 7.000us 0.000us 0.00% 0.000us 0.000us 1 void at::native::vectorized_elementwise_kernel<4, at... 0.00% 0.000us 0.00% 0.000us 0.000us 1.000us 10.00% 1.000us 1.000us 1 void at::native::(anonymous namespace)::multi_tensor... 0.00% 0.000us 0.00% 0.000us 0.000us 6.000us 60.00% 6.000us 3.000us 2 void at::native::lpnorm_cleanup<float, (at::native::... 0.00% 0.000us 0.00% 0.000us 0.000us 3.000us 30.00% 3.000us 3.000us 1 ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Self CPU time total: 15.033ms Self CUDA time total: 10.000us ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/118604 Approved by: https://github.com/albanD	2024-02-05 22:01:01 +00:00
Edward Z. Yang	51e096114b	Increase recommended logging in DEFAULT_LOGGING (#119207 ) For long running batch jobs, it is best to opt for logs that are too spammy rather than not spammy enough. This lines up DEFAULT_LOGGING with our current internal guidance at Meta. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/119207 Approved by: https://github.com/bdhirsh	2024-02-05 21:59:10 +00:00
Yifu Wang	5086e1cf3f	Remove distributed/c10d/Functional.hpp (#119138 ) This file is useless and was accidentally checked in. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119138 Approved by: https://github.com/Skylion007	2024-02-05 21:58:08 +00:00
Catherine Lee	200108c6e6	Delete old branches (#117079 ) Example https://github.com/pytorch/pytorch/actions/runs/7562281351/job/20592425611?pr=117079 (The code to delete branches isn't being run, it's just listing the branches it wants to delete) Internal code: https://fburl.com/code/hdvvbfkj Threshold for branch with PR is 30 days regardless of whether or not the PR is merged or not (compared to 3 days if merged and 30 days if closed). Threshold for branch without PR is 1.5 years (same internally). Threshold of ~400 queries to github so it doesn't hit token usage limits. Currently this leads to about 350 branches deleted per run. Only query for the last 90 days of updated PRs to reduce token usage, so if a branch has a PR but it was updated 90+ days ago, it will think it doesn't have a PR and will wait for the 1.5 years branch update check instead, regardless of whether the PR is open or closed. I tested that it could delete my own branch and it worked. labeled with test-config/crossref because I just want the smallest test config possible to reduce CI usage Pull Request resolved: https://github.com/pytorch/pytorch/pull/117079 Approved by: https://github.com/malfet	2024-02-05 20:50:05 +00:00
Edward Z. Yang	b816760a2f	More progress on type checking ValueRanges (#118870 ) Type checking Python is a pain. Here are my learnings: * The types for heavily polymorphic code is going to be verbose, no way around it. I originally was hoping I could lean on polymorphism with a bounded TypeVar to compactly write signatures for many of the ValueRanges methods, but I ran into some unworkaroundable mypy bugs. Writing out all the types explicitly and using `@overload` liberally works pretty well, so I think I recommend people do that instead of trying to do fancy things. * Sympy is missing annotations for assumptions, because they are all metaprogrammed. I don't really relish maintaining a typeshed for sympy, so I wrote a small mypy plugin to add them in. * GADT style refinement is... just not a good idea in practice. Mypy easily gets confused whether or not a return value from a refined section is allowed for the outer return type. So many of these have been replaced with less informative implementation types and more informative external types via overloads. Hopefully this is good for use sites. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/118870 Approved by: https://github.com/Skylion007, https://github.com/albanD	2024-02-05 20:29:25 +00:00
Mikayla Gawarecki	b92819a039	Move nn.Module.load_state_dict tests from test_nn.py to separate file (#118028 ) Move these tests out so in https://github.com/pytorch/pytorch/pull/117913 where we can to run these tests with both `torch.nn.utils.set_swap_module_params_on_conversion({True/False})` Pull Request resolved: https://github.com/pytorch/pytorch/pull/118028 Approved by: https://github.com/albanD	2024-02-05 20:17:28 +00:00
Huy Do	71655bccbe	Fix wrong mobile build Docker image (#119213 ) It turns out that the Docker image name hasn't been updated yet referring to a non-existing name, may be we could update `calculate-docker-image` to fail in this case if there is a way to separate a non-existing name failure v.s. missing tag failure. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119213 Approved by: https://github.com/clee2000, https://github.com/kit1980, https://github.com/malfet	2024-02-05 19:48:10 +00:00
Taras Tsugrii	962fca6839	[storage][perf] Reduce _get_device_from_module overhead. (#119144 ) Using `rsplit` with maxsplit=1 is more efficient since it 1) stops traversal as soon as the first `.` from the right side is encountered 2) creates no more than 2-element list This change also reuses `last_part` to avoid unnecessary repetition of a split. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119144 Approved by: https://github.com/Skylion007, https://github.com/mikaylagawarecki	2024-02-05 19:33:18 +00:00
PyTorch MergeBot	b964a1222c	Revert "[inductor] make multi-kernel work with cpp-wrapper (#117813 )" This reverts commit c24ffc3f66b2270dfc65a404687b91b55ed580e9. Reverted https://github.com/pytorch/pytorch/pull/117813 on behalf of https://github.com/atalman due to Failing internal tests ([comment](https://github.com/pytorch/pytorch/pull/117813#issuecomment-1927877102))	2024-02-05 19:25:39 +00:00
Yang Chen	b2e0f8d82d	[mypy] added type annotations to codegen_nodes methods (#119080 ) added correct type annotations to scheduler and backends' codegen_nodes methods Pull Request resolved: https://github.com/pytorch/pytorch/pull/119080 Approved by: https://github.com/eellison	2024-02-05 18:33:52 +00:00
Mihir Patel	88e346680b	Patch all_gather to support HSDP + TP (#118638 ) Update all_gather to support HSDP + TP. Currently, the `_all_gather_dtensor` function for dtensors only replaces the first dimension with replicate (the FSDP dimension) and does not touch the second dimension (which is assumed to be the TP dimension). With HSDP, we have two dimensions ahead of the TP dimension as opposed to 1. This PR updates to replace all other dimensions with replicate to run the all-gather. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118638 Approved by: https://github.com/fegin, https://github.com/awgu, https://github.com/wz337	2024-02-05 18:29:23 +00:00
Catherine Lee	f481835115	Revert "add Half support for flash attention on CPU (#118368 )" (#119204 ) This reverts commit a5a63db3bf937a6eff993d1222fab18cc63f9cb2. Fixes #ISSUE_NUMBER Reverts #118368 Got reverted internally but branch got deleted to automation didn't work Mildly edited stack trace ``` ... return torch._dynamo.disable(fn, recursive)(args, kwargs) File "torch/_dynamo/eval_frame.py", line 453, in _fn return fn(args, *kwargs) File "torch/_dynamo/external_utils.py", line 25, in inner return fn(args, *kwargs) File "torch/fx/experimental/proxy_tensor.py", line 635, in dispatch_trace graph = tracer.trace(root, concrete_args) File "torch/fx/experimental/proxy_tensor.py", line 995, in trace res = super().trace(root, concrete_args) File "torch/_dynamo/eval_frame.py", line 453, in _fn return fn(args, *kwargs) File "torch/_dynamo/external_utils.py", line 25, in inner return fn(args, *kwargs) File "torch/fx/_symbolic_trace.py", line 793, in trace (self.create_arg(fn(args)),), File "torch/fx/experimental/proxy_tensor.py", line 665, in wrapped out = f(tensors) File "<string>", line 1, in <lambda> File "torch/_functorch/_aot_autograd/traced_function_transforms.py", line 357, in _functionalized_f_helper f_outs = fn(f_args) File "torch/_functorch/_aot_autograd/traced_function_transforms.py", line 68, in inner_fn outs = fn(args) File "torch/_functorch/_aot_autograd/utils.py", line 161, in flat_fn tree_out = fn(args, *kwargs) File "torch/_functorch/_aot_autograd/traced_function_transforms.py", line 618, in functional_call out = PropagateUnbackedSymInts(mod).run( File "torch/fx/interpreter.py", line 145, in run self.env[node] = self.run_node(node) File "torch/_functorch/_aot_autograd/traced_function_transforms.py", line 593, in run_node result = super().run_node(n) File "torch/fx/interpreter.py", line 202, in run_node return getattr(self, n.op)(n.target, args, kwargs) File "torch/fx/interpreter.py", line 274, in call_function return target(args, *kwargs) File "torch/_ops.py", line 571, in __call__ return self_._op(args, *kwargs) File "torch/_subclasses/functional_tensor.py", line 380, in __torch_dispatch__ outs_unwrapped = func._op_dk( File "torch/utils/_stats.py", line 20, in wrapper return fn(args, *kwargs) File "torch/fx/experimental/proxy_tensor.py", line 744, in __torch_dispatch__ return self.inner_torch_dispatch(func, types, args, kwargs) File "torch/fx/experimental/proxy_tensor.py", line 779, in inner_torch_dispatch return proxy_call(self, func, self.pre_dispatch, args, kwargs) File "torch/fx/experimental/proxy_tensor.py", line 423, in proxy_call r = maybe_handle_decomp(proxy_mode, func, args, kwargs) File "torch/fx/experimental/proxy_tensor.py", line 1225, in maybe_handle_decomp return CURRENT_DECOMPOSITION_TABLE[op](args, **kwargs) File "torch/_decomp/decompositions.py", line 4322, in scaled_dot_product_flash_attention_for_cpu torch._check( File "torch/__init__.py", line 1133, in _check _check_with(RuntimeError, cond, message) File "torch/__init__.py", line 1116, in _check_with raise error_type(message_evaluated) RuntimeError: query must be FP32, FP64, BF16 but got torch.float16 While executing %_scaled_dot_product_flash_attention_for_cpu : [num_users=1] = call_function[target=torch.ops.aten._scaled_dot_product_flash_attention_for_cpu.default](args = (%l_q_, %l_k_, %l_v_), kwargs = {attn_mask: %l_attn_mask_}) Original traceback: File "executorch/backends/xnnpack/partition/graphs/sdpa.py", line 34, in forward return torch.nn.functional.scaled_dot_product_attention( ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/119204 Approved by: https://github.com/kit1980	2024-02-05 18:24:53 +00:00
PyTorch MergeBot	ab613a4019	Revert "refactor lazy init to device-agnostic (#118846 )" This reverts commit 520771d7b35034c96c5b4604ecf8960e6aab856f. Reverted https://github.com/pytorch/pytorch/pull/118846 on behalf of https://github.com/atalman due to Failing, tests https://github.com/pytorch/torchdistx/blob/main/src/python/torchdistx/_C/fake.cc#L11 ([comment](https://github.com/pytorch/pytorch/pull/118846#issuecomment-1927651305))	2024-02-05 18:06:30 +00:00
Taras Tsugrii	124a54ef16	[jit][perf] Reduce lookupInModule overhead. (#119145 ) It's inefficient to split remaining parts of the module name by '.' just to join it back again. Instead it's more idiomatic and efficient to use `maxsplit=1` to ensure that all remaining parts remain intact. This improves best case time and space complexity since scan can terminate on first encountered `.` and only 2 parts are returned in a list. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119145 Approved by: https://github.com/Skylion007	2024-02-05 18:01:00 +00:00
Wei Wei	fa8d97776c	[aotinductor] Migrate fuse_split_linear_add from dper_pass to AOTI based on predispatch IR (#118983 ) Summary: As titled. Added support of fuse_split_linear_add in pregrad passes based on predispatch IR Test Plan: TORCH_LOGS=inductor,aot buck2 run mode/opt mode/inplace caffe2/test/inductor/fb:test_split_cat_fx_passes_aten_fb Differential Revision: D53302168 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118983 Approved by: https://github.com/kflu, https://github.com/chenyang78	2024-02-05 17:58:42 +00:00
wz337	5f9f771711	[DeviceMesh][Test] Remove test_raises_mesh_dim_less_than_2 (#119172 ) The test is no longer applicable after we allow 1D slice from 1D mesh. https://github.com/pytorch/pytorch/pull/118895 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119172 Approved by: https://github.com/awgu, https://github.com/atalman	2024-02-05 17:34:51 +00:00
watarungurunnn	d444a3b443	[MPS] fix float32 error on mps, in linalg.matrix_rank and linalg.pinv (#114771 ) Fixes #114285 (However, still have NotImplementedError ```NotImplementedError: The operator 'aten::_linalg_svd.U' is not currently implemented for the MPS device. If you want this op to be added in priority during the prototype phase of this feature, please comment on https://github.com/pytorch/pytorch/issues/77764. As a temporary fix, you can set the environment variable `PYTORCH_ENABLE_MPS_FALLBACK=1` to use the CPU as a fallback for this op. WARNING: this will be slower than running natively on MPS.```) Pull Request resolved: https://github.com/pytorch/pytorch/pull/114771 Approved by: https://github.com/lezcano	2024-02-05 15:36:55 +00:00
Shunting Zhang	a72190fd51	make nanogpt work with both compiled autograd and _LazyGraphModule (#118981 ) @xmfan and @fegin reported that _LazyGraphModule ( https://github.com/pytorch/pytorch/pull/117911 ) makes nanogpt training fail with compiled autograd. We have a repro: ``` python benchmarks/dynamo/torchbench.py --training --backend=inductor --disable-cudagraphs --accuracy --only nanogpt --repeat 1 --compiled-autograd ``` but it's still mysterious how to trigger the issue with a toy model. The error message for the failure is https://gist.github.com/shunting314/6402a6388b3539956090b6bc098952fb . In compile_fx we will call `detect_fake_mode`. This function will look for an active FakeTensorMode from both TracingContext and example inputs. The error is triggered because we find different FakeTensorMode from these 2 sources. Although I don't know what really causes the discrepancy of FakeTensorMode above, the fix here is to force _LazyGraphModule recompilation if we have compiled autograd enabled. This does not hurt compilation time most of the time because we anyway will call the graph module here in the backward pass when compiled autograd is enabled: `855d5f144e/torch/_functorch/_aot_autograd/jit_compile_runtime_wrappers.py (L705)` Let me know if we can have a better fix. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118981 Approved by: https://github.com/jansel	2024-02-05 10:40:06 +00:00
leslie-fang-intel	d670dfb7ae	Update scatter_reduce_ test with parallel backend check (#118708 ) Summary Follow up of https://github.com/pytorch/pytorch/pull/118278, in which new added UT `test_scatter_using_atomic_add` failed with `native parallel backend` as reported in https://github.com/pytorch/pytorch/issues/118518. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118708 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-02-05 08:48:45 +00:00
PrincipalsOffice	0348975a87	Set up new logging artifact for SymNode (#119158 ) Fixes #113876 Hi, I updated various logging configs and the SymNode module to use the new dedicated logging artifact. This is my first pytorch PR, mirrored my changes off of https://github.com/pytorch/pytorch/pull/111808. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119158 Approved by: https://github.com/ezyang	2024-02-05 07:34:54 +00:00
Iris Zhang (PyTorch)	0245000be8	[DeviceMesh] Temporarily disable re-use subgroup (#118940 ) Summary: The reuse subgroup logic is causing GLOO to timeout on two internal modelstore tests (relevant tests in test plan). We temporarily disabling re-use subgroup during root-causing to allow the internal tests to be able to run again, as they are now omitted shown in T176426987. Test Plan: CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/118940 Approved by: https://github.com/wanchaol	2024-02-05 06:30:00 +00:00
Animesh Jain	0c3a1c893e	[dynamo] Setup the globals for guard_fn without a reference to f_locals (#118447 ) UPDATE - I changed the PR because from discussion with @jansel it was clear that someone else was holding on to a reference to f_locals. This PR now solves that problem first. I removed the eval_frame.c part because it was failing tests that use `exec` or `eval` with weird error like `no no locals found when storing 'math'`. I would debug that in a separate PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118447 Approved by: https://github.com/Skylion007, https://github.com/jansel ghstack dependencies: #118975, #118420	2024-02-05 05:39:39 +00:00
Kurman Karabukaev	b8307513e5	[torchelastic][rendezvous] Add option to enable libuv for TCPStore based rendezvous backend (#118944 ) Summary: Expose an option to enable libuv in TCPStore based rendezvous backend that will allow better scaling. Libuv support has been added recently and allows scaling for more than 2K nodes. Test Plan: Unit tests Differential Revision: D53335860 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118944 Approved by: https://github.com/wconstab	2024-02-04 23:11:32 +00:00
Wilson Hong	5ebed6f1c3	[torch] fix comment typo (#118656 ) Summary: as title Differential Revision: D49841787 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118656 Approved by: https://github.com/Skylion007, https://github.com/zhxchen17	2024-02-04 22:20:09 +00:00
Michael Suo	0d5f53a2f9	fix forward test_memory_planning.py (#119109 ) Summary: fixes a broken test, also makes it run in fbcode correctly Test Plan: test Reviewed By: angelayi Differential Revision: D53373709 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119109 Approved by: https://github.com/angelayi	2024-02-04 21:45:07 +00:00
Zheng Yan	052e824467	improve CUDACachingAllocator lock contention (#118550 ) Summary: NativeCachingAllocator has a global lock which shows lock contention with one process using multiple GPUs. The lock is required to lookup Block from pointer. We can make the lock more fine grain to reduce the lock contention. Test Plan: existing unittests, verified on prod models using eight GPUs showing double digits improvements Differential Revision: D52493091 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118550 Approved by: https://github.com/albanD	2024-02-04 16:45:25 +00:00
Bin Bao	b41f3e8df1	[AOTI] Make abi_compatible as default for OSS CI (#119126 ) Summary: Introduce an environment varible AOT_INDUCTOR_ABI_COMPATIBLE to control the ABI-compatible mode, and turn it on for OSS CI. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119126 Approved by: https://github.com/chenyang78 ghstack dependencies: #119125	2024-02-04 15:48:58 +00:00
Bin Bao	79b20aec76	[AOTI] Support copy_, _fft_c2c and view_as_real in C shim (#119125 ) Summary: These ops exist in GoogleFnet. Also add a Complex fallback for convert_element_type. After this PR, we can enable ABI-compatible for AOTInductor OSS CI. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119125 Approved by: https://github.com/chenyang78	2024-02-04 15:48:58 +00:00
Yanbo Liang	cee16353db	[Dynamo][autograd.Function] Should graph break on stride accesses in backward (#119137 ) Fixes #118399 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119137 Approved by: https://github.com/oulgen	2024-02-04 09:08:45 +00:00
Yifu Wang	8f82a44a5b	Run device mesh tests with native funcol enabled (#118437 ) ### Summary Run the relevant tests in `test/distributed/_tensor/test_dtensor_compile.py` and `test/distributed/test_device_mesh.py` with native funcol enabled, in addition to with them being disabled. All tests excepts `test_tp_compile_comm_reordering` pass. This is expected because the native funcols have slightly different IRs, so the reordering pass needs to be adjusted. This test is disabled for now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118437 Approved by: https://github.com/LucasLLC ghstack dependencies: #118910, #118911	2024-02-04 04:11:11 +00:00
cyy	e3371ff739	Use correct type of indices in ForeachUtils.h (#119116 ) Fix a type mismatch detected by MSVC: ``` C:\Program Files\Microsoft Visual Studio\2022\Preview\VC\Tools\MSVC\14.39.33519\include\xutility(255): warning C4267: “初始化”: 从“size_t”转换到“_Ty”，可能丢失数据 with [ _Ty=int ] C:\Program Files\Microsoft Visual Studio\2022\Preview\VC\Tools\MSVC\14.39.33519\include\xutility(255): note: 模板实例化上下文(最早的实例化上下文)为 pytorch/aten/src\ATen/native/ForeachUtils.h(363): note: 查看对正在编译的函数模板实例化“_Ty &std::vector<_Ty,std::allocator<_Ty>>::emplace_back<const I&>(const I &)”的引用 with [ _Ty=int, I=size_t ] ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/119116 Approved by: https://github.com/Skylion007	2024-02-04 04:03:54 +00:00
Edward Z. Yang	6620176da7	Add documentation for meta device (#119119 ) Fixes https://github.com/pytorch/pytorch/issues/119098 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/119119 Approved by: https://github.com/bdhirsh	2024-02-04 01:05:22 +00:00
Edward Z. Yang	dab16b6b8e	s/supress/suppress/ (#119132 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/119132 Approved by: https://github.com/kit1980, https://github.com/malfet	2024-02-04 00:54:14 +00:00
Edward Z. Yang	abc09b27b9	Some minor type stub improvements (#118529 ) I was just playing around with improving the typing of symbolic_shapes. The PR is not "complete" but I in particular wanted to get feedback on whether or not people liked making ValueRanges Generic; it seems that distinguishing if you have an Expr ValueRange or a SympyBoolean ValueRange is a lot of trouble for downstream. Using TypeGuard, we can perform refinements on the generic parameter inside methods, although we still have to cast back to ValueRange[T] due to https://github.com/python/mypy/issues/14425#issuecomment-1914852707 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/118529 Approved by: https://github.com/Skylion007	2024-02-04 00:19:00 +00:00
Huy Do	3ed9df36a9	Clean up some obsolete TODOs in run_test and several test files (#119113 ) * The TODOs in `test/test_nestedtensor.py` has been mitigated, I keep the issue for reference. * ~~The TODOs in `test/test_ops_fwd_gradients.py` doesn't apply anymore~~ * The TODOs in `run_test.py` to support disabling C++ tests is probably not going to happen. I have never seen a flaky C++ test that needs to be disabled before. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119113 Approved by: https://github.com/kit1980	2024-02-03 23:54:30 +00:00
lancerts	26a2743162	Fix placeholder tensor is empty for relu in mps (#118965 ) Fixes #118845 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118965 Approved by: https://github.com/malfet	2024-02-03 23:50:35 +00:00
lancerts	0ddcb5c3ca	Include the documentation on scale arg being a keyword only arg (#119129 ) Fixes #117240 @drisspg Pull Request resolved: https://github.com/pytorch/pytorch/pull/119129 Approved by: https://github.com/drisspg	2024-02-03 23:41:06 +00:00
Nikita Shulga	ffae20e594	[BE][MPS] Add `dictionaryFromPlaceholders` (#119077 ) Which are a convenience methods that create a dictionary from placeholder, making code a more compact. Also added `runMPSGraph` overloaded function with Placeholder instead of an output dictionary, as majority of the operators have just one output. Typical change looks as follows ```patch - NSDictionary<MPSGraphTensor, MPSGraphTensorData>* feeds = @{ - selfPlaceholder.getMPSGraphTensor() : selfPlaceholder.getMPSGraphTensorData(), - }; - NSDictionary<MPSGraphTensor, MPSGraphTensorData>* results = - @{outputPlaceholder.getMPSGraphTensor() : outputPlaceholder.getMPSGraphTensorData()}; - runMPSGraph(stream, cachedGraph->graph(), feeds, results); + auto feeds = dictionaryFromPlaceholders(selfPlaceholder); + runMPSGraph(stream, cachedGraph->graph(), feeds, outputPlaceholder); } ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/119077 Approved by: https://github.com/kit1980, https://github.com/albanD	2024-02-03 22:07:02 +00:00
Tianyu Liu	2d64fddd48	[dtensor] add op support for nll_loss_forward (#118917 ) This is part of the work to support cross entropy in dtensor. This PR doesn't support nll_loss computation with input sharded on the channel dimension yet. In that case, redistribution to Replicate is needed in sharding propagation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118917 Approved by: https://github.com/wanchaol	2024-02-03 20:08:10 +00:00
Yanbo Liang	4c397e6ec6	[Dynamo] Add correct guards for tracable tensor subclasses (#119110 ) Fixes #118896 ``` (pt) [ybliang@devgpu002.ash8 ~/local/pytorch (subclass)]$ TORCH_LOGS="+guards" python test/dynamo/test_subclasses.py -k test_torch_dispatch_subclass_guard_recompile /home/ybliang/local/miniconda3/envs/pt/lib/python3.10/site-packages/transformers/utils/generic.py:441: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead. _torch_pytree._register_pytree_node( [2024-02-02 16:43:02,186] [0/0] torch._dynamo.guards.__guards: [DEBUG] GUARDS: [2024-02-02 16:43:02,186] [0/0] torch._dynamo.guards.__guards: [DEBUG] ___check_type_id(L['w'], 110557008) # return torch.add(w, 1.0) # ata/users/ybliang/pytorch/test/dynamo/test_subclasses.py:923 in fn [2024-02-02 16:43:02,187] [0/0] torch._dynamo.guards.__guards: [DEBUG] hasattr(L['w'].a, '_dynamo_dynamic_indices') == False # return torch.add(w, 1.0) # ata/users/ybliang/pytorch/test/dynamo/test_subclasses.py:923 in fn [2024-02-02 16:43:02,187] [0/0] torch._dynamo.guards.__guards: [DEBUG] hasattr(L['w'].b, '_dynamo_dynamic_indices') == False # return torch.add(w, 1.0) # ata/users/ybliang/pytorch/test/dynamo/test_subclasses.py:923 in fn [2024-02-02 16:43:02,187] [0/0] torch._dynamo.guards.__guards: [DEBUG] utils_device.CURRENT_DEVICE == None # _dynamo/output_graph.py:388 in init_ambient_guards [2024-02-02 16:43:02,187] [0/0] torch._dynamo.guards.__guards: [DEBUG] ___check_current_backend(139704947520224) # _dynamo/output_graph.py:394 in init_ambient_guards [2024-02-02 16:43:02,187] [0/0] torch._dynamo.guards.__guards: [DEBUG] check_tensor(L['w'].a, Tensor, DispatchKeySet(CPU, BackendSelect, ADInplaceOrView, AutogradCPU), torch.float32, device=None, requires_grad=False, size=[2, 2], stride=[2, 1]) # return torch.add(w, 1.0) # ata/users/ybliang/pytorch/test/dynamo/test_subclasses.py:923 in fn [2024-02-02 16:43:02,187] [0/0] torch._dynamo.guards.__guards: [DEBUG] check_tensor(L['w'].b, Tensor, DispatchKeySet(CPU, BackendSelect, ADInplaceOrView, AutogradCPU), torch.float32, device=None, requires_grad=False, size=[2, 2], stride=[2, 1]) # return torch.add(w, 1.0) # ata/users/ybliang/pytorch/test/dynamo/test_subclasses.py:923 in fn [2024-02-02 16:43:02,206] [0/1] torch._dynamo.guards.__guards: [DEBUG] GUARDS: [2024-02-02 16:43:02,207] [0/1] torch._dynamo.guards.__guards: [DEBUG] hasattr(L['w'], '_dynamo_dynamic_indices') == False # return torch.add(w, 1.0) # ata/users/ybliang/pytorch/test/dynamo/test_subclasses.py:923 in fn [2024-02-02 16:43:02,207] [0/1] torch._dynamo.guards.__guards: [DEBUG] utils_device.CURRENT_DEVICE == None # _dynamo/output_graph.py:388 in init_ambient_guards [2024-02-02 16:43:02,207] [0/1] torch._dynamo.guards.__guards: [DEBUG] ___check_current_backend(139704947520224) # _dynamo/output_graph.py:394 in init_ambient_guards [2024-02-02 16:43:02,207] [0/1] torch._dynamo.guards.__guards: [DEBUG] check_tensor(L['w'], Tensor, DispatchKeySet(CPU, BackendSelect, ADInplaceOrView, AutogradCPU), torch.float32, device=None, requires_grad=False, size=[2, 2], stride=[2, 1]) # return torch.add(w, 1.0) # ata/users/ybliang/pytorch/test/dynamo/test_subclasses.py:923 in fn ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/119110 Approved by: https://github.com/anijain2305, https://github.com/bdhirsh, https://github.com/yoyoyocmu	2024-02-03 18:12:51 +00:00
Jason Ansel	7a52455102	[dynamo] Refactor TensorVariable method handling (#119111 ) This should slightly improve compile times and be easier to maintain. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119111 Approved by: https://github.com/yanboliang, https://github.com/anijain2305	2024-02-03 17:18:19 +00:00
laith sakka	fcf22a853d	Enable test_ellipsis_index_2 with Torch dynamo (#118773 ) Fix issue #118819 test_ellipsis_index_2 is specifically testing properties of torch._numpy.array() and that a field tensor is being added hence overriding the imports. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118773 Approved by: https://github.com/anijain2305, https://github.com/lezcano	2024-02-03 10:33:48 +00:00
angelayi	1adedc3c86	[decomp] Remove pixel_shuffle from core aten decomps (#118921 ) pixel_shuffle is a core aten op (https://pytorch.org/docs/main/torch.compiler_ir.html#core-aten-ir) so we should not decompose it. https://github.com/pytorch/pytorch/pull/118239 added a decomp for it which is causing an internal test failure (https://www.internalfb.com/intern/test/281475090561210/) which cases on the pixel_shuffle operator. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118921 Approved by: https://github.com/SherlockNoMad, https://github.com/lezcano	2024-02-03 08:21:32 +00:00
Aaron Orenstein	4dc53f777b	Fix dynamo failure w/ astype (#117952 ) The torch "fake" ndarray had some mismatches vs numpy.ndarray which caused test_sparse_to_sparse_compressed to fail under dynamo. This also fixes (because the test now hits it) a problem where unpacking a sequence with the incorrect number of args would assert in dynamo instead of graph breaking (because it would throw an exception). Added a unit test for this condition. Fixed: - torch._numpy._ndarray.astype() (actually used by the test) - torch._numpy._ndarray.put() (drive-by discovery) - torch._numpy._ndarray.view() (drive-by discovery) (burndown item 7) Pull Request resolved: https://github.com/pytorch/pytorch/pull/117952 Approved by: https://github.com/yanboliang ghstack dependencies: #117951	2024-02-03 08:10:15 +00:00
Aaron Orenstein	c6c851102f	Fix test_compressed_layout_conversions_coverage to check BSC format (#117951 ) test_compressed_layout_conversions_coverage verifies torch's conversions between different memory layouts using numpy as a reference. Since numpy doesn't support BSC format it just skipped that. Instead fake it by using a transposed BSR format. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117951 Approved by: https://github.com/zou3519	2024-02-03 08:10:15 +00:00
Lucy Qiu	6c8faf4680	[executorch] Run llama in xplat (#118831 ) Summary: Error running llama in xplat, where half type isnt part of c10_mobile targets. See: D53158320 This diff: - Creates a `torch_mobile_all_ops_et` target, which is the same as `torch_mobile_all_ops`, except with a preprocessor flag (C10_MOBILE_HALF) to support Half type - Check C10_MOBILE_HALF in LinearAlgebra.cpp and include it - Use `torch_mobile_all_ops_et` for executorch, instead of `torch_mobile_all_ops`. Considerations: - Using `torch_mobile_all_ops_et` across executorch means that our runtime binary size for xplat aten increases (see test plan for increase amount, thanks tarun292 for the pointer). This may be okay, as aten mode isn't used in production. Test Plan: Run language llama in xplat: ``` buck2 run xplat/executorch/examples/models/llama2:main_aten -- --model_path llama-models/very_new_checkpoint_h.pte --tokenizer_path llama-models/flores200sacrebleuspm.bin --prompt 'fr Hello' --eos ``` And in fbcode: ``` buck2 run fbcode//executorch/examples/models/llama2:main_aten -- --model_path llama-models/very_new_checkpoint_h.pte --tokenizer_path llama-models/flores200sacrebleuspm.bin --prompt 'fr Hello' --eos ``` Test executor_runner size increase with: ``` buck2 build fbcode//executorch/sdk/fb/runners:executor_runner_aten ``` \|\|original\|this diff (+half dtype)\|diff\| \|unstripped\|214975784\|214976472\|+688\| \|stripped\|71373488\|71373808\|+320\| Differential Revision: D53292674 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118831 Approved by: https://github.com/larryliu0820	2024-02-03 08:07:19 +00:00
Michael Lazos	a64b03a58e	Move lr tensor to cuda if needed (#119073 ) Fixes https://github.com/pytorch/pytorch/issues/119026 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119073 Approved by: https://github.com/eellison	2024-02-03 07:34:33 +00:00
Taras Tsugrii	41b63b26c2	[dynamo] Fix incorrect docstring placements in _guards.py. (#119114 ) This makes them unavailable when using help and other tools accessing them. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119114 Approved by: https://github.com/kit1980	2024-02-03 06:25:54 +00:00
Michael Lazos	9a8e3b07d7	Remove extra graph breaks (#118987 ) Fixes https://github.com/pytorch/pytorch/issues/104053 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118987 Approved by: https://github.com/janeyx99	2024-02-03 05:55:09 +00:00
Andrew Gu	ce40ee8ecd	[FSDP] Fixed `device_mesh` and auto wrap (#119064 ) If the user passes `device_mesh`, then we should not forward the process groups to the children during auto wrap and instead just rely on the `device_mesh` argument. This should fix https://github.com/pytorch/pytorch/issues/118906. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119064 Approved by: https://github.com/wz337	2024-02-03 03:57:29 +00:00
Nikita Shulga	18fc1ca7d9	[MPS][BE] Add native lerp support (#119036 ) By implementing `out = self + weight * (end-self)` as MPS graph LERP is tested by `test_output_match_lerp_cpu_float[32\|16]` based on OpInfo and 10+ tests from `test_optim.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/119036 Approved by: https://github.com/albanD	2024-02-03 02:58:50 +00:00
James Wu	30d3ff1659	Inline gradcheck functions since they don't have C bindings (#119047 ) Gradcheck functions are in python, so they shouldn't be in `torch_c_binding_in_graph_functions` fixes #118792 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119047 Approved by: https://github.com/yanboliang, https://github.com/zou3519	2024-02-03 02:46:39 +00:00
Yifu Wang	372e9550bd	ProcessGroupGloo::reduce_scatter_tensor_coalesced (#118911 ) ### Motivation Despite our plan to reduce gloo usage, it is still being widely used as testing tool (in both the PyTorch CI and user tests) for code that only uses nccl in real world scenario. There's some coverage issues around all-gather and reduce-scatter variants, which are currently worked around in ugly ways (e.g. [this](`b9e86bc93d/torch/distributed/_functional_collectives_impl.py (L216-L219)`) and [this](`b9e86bc93d/torch/distributed/_functional_collectives_impl.py (L262-L272)`)). For native funcol I ran into the same issues but I'd rather just fix the coverage. ### This PR We already have a fallback impl for `_reduce_scatter_base`, which is composed from all-reduce + scatter. The scatter was not necessary. It introduces extra communication, sync point, and forced the impl to fail on `asyncOp=True`. This PR does the following: - Simulate reduce-scatter with `allreduce(inp).chunk(world_size)[rank]`. This is still 2x communication than a real reduce-scatter (since all-reduce = reduce-scatter + all-gather), but it's strictly better than what we have now. - By doing the above, the comm becomes async and we don't have to fail on `asyncOp=True`. - The general logic is implemented in `reduce_scatter_tensor_coalesced`. `_reduce_scatter_base` just calls it with single input/output. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118911 Approved by: https://github.com/shuqiangzhang ghstack dependencies: #118910	2024-02-03 02:42:47 +00:00
Shuqiang Zhang	65314a6129	[c10d] add an unit test for unordered destruction of PGs (#119045 ) Summary: We were suspecting ncclCommsAbort was hung due to NCCL 2.17's 'bug' triggered by different ranks calls desctructors of different PGs in different order. This can be reproed in a NCCL level test for 2.17 We need a test case in c10d to constantly check if PGs can be destructed in different order Test Plan: Run the test and print out the distruction orders are expected ``` [$ python test/distributed/test_c10d_nccl.py ProcessGroupNCCLTest.test_close_multi_pg_unordered NCCL version 2.19.3+cuda12.0 [rank0]:[W ProcessGroupNCCL.cpp:1128] [PG 2 Rank 0] ProcessGroupNCCL destructor entered. [rank0]:[W ProcessGroupNCCL.cpp:1147] [PG 2 Rank 0] ProcessGroupNCCL aborting communicators, check for 'abort finished' logs or look for abort hang [rank1]:[W ProcessGroupNCCL.cpp:1128] [PG 1 Rank 1] ProcessGroupNCCL destructor entered. [rank1]:[W ProcessGroupNCCL.cpp:1147] [PG 1 Rank 1] ProcessGroupNCCL aborting communicators, check for 'abort finished' logs or look for abort hang [rank0]:[W ProcessGroupNCCL.cpp:1151] [PG 2 Rank 0] ProcessGroupNCCL abort finished. [rank0]:[W ProcessGroupNCCL.cpp:1128] [PG 1 Rank 0] ProcessGroupNCCL destructor entered. [rank0]:[W ProcessGroupNCCL.cpp:1147] [PG 1 Rank 0] ProcessGroupNCCL aborting communicators, check for 'abort finished' logs or look for abort hang [rank1]:[W ProcessGroupNCCL.cpp:1151] [PG 1 Rank 1] ProcessGroupNCCL abort finished. [rank1]:[W ProcessGroupNCCL.cpp:1128] [PG 2 Rank 1] ProcessGroupNCCL destructor entered. [rank1]:[W ProcessGroupNCCL.cpp:1147] [PG 2 Rank 1] ProcessGroupNCCL aborting communicators, check for 'abort finished' logs or look for abort hang [rank0]:[W ProcessGroupNCCL.cpp:1151] [PG 1 Rank 0] ProcessGroupNCCL abort finished. [rank1]:[W ProcessGroupNCCL.cpp:1151] [PG 2 Rank 1] ProcessGroupNCCL abort finished. . ---------------------------------------------------------------------- Ran 1 test in 18.969s OK](url) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/119045 Approved by: https://github.com/yifuwang	2024-02-03 02:37:12 +00:00
lancerts	857508fa36	Change the internal assert to torch_check in torch::nn::functional::InterpolateFuncOptions (#117831 ) Fixes #117333 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117831 Approved by: https://github.com/malfet	2024-02-03 02:15:11 +00:00
Mikayla Gawarecki	9ffed22391	Document file format returned by torch.save (#118719 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118719 Approved by: https://github.com/albanD	2024-02-03 02:11:44 +00:00
ydwu4	2eba82d122	[dynamo] decrease logging level for graph break in higher order op. (#119079 ) Fixes https://github.com/pytorch/pytorch/issues/119059. This hides both logs behind TORCH_LOGS=dynamo. Just logging the exception seems not very informative. So I just put both under log.info(). For the example in the issue the log now looks like: ``` (pytorch-3.10) ~/local/pytorch$ python test.py (pytorch-3.10) ~/local/pytorch$ ``` ``` (pytorch-3.10) ~/local/pytorch$ python test.py (pytorch-3.10) ~/local/pytorch$ TORCH_LOGS=dynamo python test.py [2024-02-02 14:08:19,000] [0/0] torch._dynamo.symbolic_convert: [INFO] Step 1: torchdynamo start tracing linear /home/yidi/local/pytorch/test.py:267 [2024-02-02 14:08:19,000] [0/0] torch._dynamo.symbolic_convert: [INFO] Stack (most recent call last): [2024-02-02 14:08:19,000] [0/0] torch._dynamo.symbolic_convert: [INFO] File "/home/yidi/local/pytorch/test.py", line 272, in <module> [2024-02-02 14:08:19,000] [0/0] torch._dynamo.symbolic_convert: [INFO] y = linear(x) [2024-02-02 14:08:19,000] [0/0] torch._dynamo.symbolic_convert: [INFO] File "/home/yidi/local/pytorch/torch/_dynamo/eval_frame.py", line 453, in _fn [2024-02-02 14:08:19,000] [0/0] torch._dynamo.symbolic_convert: [INFO] return fn(args, kwargs) [2024-02-02 14:08:19,000] [0/0] torch._dynamo.symbolic_convert: [INFO] File "/home/yidi/local/pytorch/torch/_dynamo/eval_frame.py", line 615, in catch_errors [2024-02-02 14:08:19,000] [0/0] torch._dynamo.symbolic_convert: [INFO] return callback(frame, cache_entry, hooks, frame_state) [2024-02-02 14:08:19,000] [0/0] torch._dynamo.symbolic_convert: [INFO] File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 748, in _convert_frame [2024-02-02 14:08:19,000] [0/0] torch._dynamo.symbolic_convert: [INFO] result = inner_convert(frame, cache_entry, hooks, frame_state) [2024-02-02 14:08:19,000] [0/0] torch._dynamo.symbolic_convert: [INFO] File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 390, in _convert_frame_assert [2024-02-02 14:08:19,000] [0/0] torch._dynamo.symbolic_convert: [INFO] return _compile( [2024-02-02 14:08:19,000] [0/0] torch._dynamo.symbolic_convert: [INFO] File "/home/yidi/local/miniconda3/envs/pytorch-3.10/lib/python3.10/contextlib.py", line 79, in inner [2024-02-02 14:08:19,000] [0/0] torch._dynamo.symbolic_convert: [INFO] return func(args, *kwds) [2024-02-02 14:08:19,000] [0/0] torch._dynamo.symbolic_convert: [INFO] File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 650, in _compile [2024-02-02 14:08:19,000] [0/0] torch._dynamo.symbolic_convert: [INFO] guarded_code = compile_inner(code, one_graph, hooks, transform) [2024-02-02 14:08:19,000] [0/0] torch._dynamo.symbolic_convert: [INFO] File "/home/yidi/local/pytorch/torch/_dynamo/utils.py", line 248, in time_wrapper [2024-02-02 14:08:19,000] [0/0] torch._dynamo.symbolic_convert: [INFO] r = func(args, *kwargs) [2024-02-02 14:08:19,000] [0/0] torch._dynamo.symbolic_convert: [INFO] File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 531, in compile_inner [2024-02-02 14:08:19,000] [0/0] torch._dynamo.symbolic_convert: [INFO] out_code = transform_code_object(code, transform) [2024-02-02 14:08:19,000] [0/0] torch._dynamo.symbolic_convert: [INFO] File "/home/yidi/local/pytorch/torch/_dynamo/bytecode_transformation.py", line 1033, in transform_code_object [2024-02-02 14:08:19,000] [0/0] torch._dynamo.symbolic_convert: [INFO] transformations(instructions, code_options) [2024-02-02 14:08:19,000] [0/0] torch._dynamo.symbolic_convert: [INFO] File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 155, in _fn [2024-02-02 14:08:19,000] [0/0] torch._dynamo.symbolic_convert: [INFO] return fn(args, kwargs) [2024-02-02 14:08:19,000] [0/0] torch._dynamo.symbolic_convert: [INFO] File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 478, in transform [2024-02-02 14:08:19,000] [0/0] torch._dynamo.symbolic_convert: [INFO] tracer = InstructionTranslator( [2024-02-02 14:08:19,000] [0/0] torch._dynamo.symbolic_convert: [INFO] File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 2032, in __init__ [2024-02-02 14:08:19,000] [0/0] torch._dynamo.symbolic_convert: [INFO] _step_logger()( [2024-02-02 14:08:19,000] [0/0] torch._dynamo.symbolic_convert: [INFO] File "/home/yidi/local/pytorch/torch/_dynamo/logging.py", line 55, in log [2024-02-02 14:08:19,000] [0/0] torch._dynamo.symbolic_convert: [INFO] logger.log(level, "Step %s: %s", step, msg, kwargs) [2024-02-02 14:08:19,001] [0/0] torch.fx.experimental.symbolic_shapes: [INFO] create_env [2024-02-02 14:08:19,016] [0/0] torch._dynamo.variables.higher_order_ops: [INFO] speculate_subgraph: while introspecting autograd.Function, we were unable to trace function `backward` into a single graph. This means that Dynamo was unable to prove safety for this API and will fall back to eager-mode PyTorch, which could lead to a slowdown. [2024-02-02 14:08:19,016] [0/0] torch._dynamo.variables.higher_order_ops: [INFO] call_method GetAttrVariable(AutogradFunctionContextVariable(Function), needs_input_grad) __getitem__ (ConstantVariable(int),) {} [2024-02-02 14:08:19,017] [0/0] torch._dynamo.convert_frame: [INFO] Restarting analysis due to _dynamo/symbolic_convert.py:141 in fail_and_restart_analysis [2024-02-02 14:08:19,017] [0/0_1] torch._dynamo.symbolic_convert: [INFO] Step 1: torchdynamo start tracing linear /home/yidi/local/pytorch/test.py:267 [2024-02-02 14:08:19,017] [0/0_1] torch._dynamo.symbolic_convert: [INFO] Stack (most recent call last): [2024-02-02 14:08:19,017] [0/0_1] torch._dynamo.symbolic_convert: [INFO] File "/home/yidi/local/pytorch/test.py", line 272, in <module> [2024-02-02 14:08:19,017] [0/0_1] torch._dynamo.symbolic_convert: [INFO] y = linear(x) [2024-02-02 14:08:19,017] [0/0_1] torch._dynamo.symbolic_convert: [INFO] File "/home/yidi/local/pytorch/torch/_dynamo/eval_frame.py", line 453, in _fn [2024-02-02 14:08:19,017] [0/0_1] torch._dynamo.symbolic_convert: [INFO] return fn(args, kwargs) [2024-02-02 14:08:19,017] [0/0_1] torch._dynamo.symbolic_convert: [INFO] File "/home/yidi/local/pytorch/torch/_dynamo/eval_frame.py", line 615, in catch_errors [2024-02-02 14:08:19,017] [0/0_1] torch._dynamo.symbolic_convert: [INFO] return callback(frame, cache_entry, hooks, frame_state) [2024-02-02 14:08:19,017] [0/0_1] torch._dynamo.symbolic_convert: [INFO] File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 748, in _convert_frame [2024-02-02 14:08:19,017] [0/0_1] torch._dynamo.symbolic_convert: [INFO] result = inner_convert(frame, cache_entry, hooks, frame_state) [2024-02-02 14:08:19,017] [0/0_1] torch._dynamo.symbolic_convert: [INFO] File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 390, in _convert_frame_assert [2024-02-02 14:08:19,017] [0/0_1] torch._dynamo.symbolic_convert: [INFO] return _compile( [2024-02-02 14:08:19,017] [0/0_1] torch._dynamo.symbolic_convert: [INFO] File "/home/yidi/local/miniconda3/envs/pytorch-3.10/lib/python3.10/contextlib.py", line 79, in inner [2024-02-02 14:08:19,017] [0/0_1] torch._dynamo.symbolic_convert: [INFO] return func(args, *kwds) [2024-02-02 14:08:19,017] [0/0_1] torch._dynamo.symbolic_convert: [INFO] File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 650, in _compile [2024-02-02 14:08:19,017] [0/0_1] torch._dynamo.symbolic_convert: [INFO] guarded_code = compile_inner(code, one_graph, hooks, transform) [2024-02-02 14:08:19,017] [0/0_1] torch._dynamo.symbolic_convert: [INFO] File "/home/yidi/local/pytorch/torch/_dynamo/utils.py", line 248, in time_wrapper [2024-02-02 14:08:19,017] [0/0_1] torch._dynamo.symbolic_convert: [INFO] r = func(args, *kwargs) [2024-02-02 14:08:19,017] [0/0_1] torch._dynamo.symbolic_convert: [INFO] File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 531, in compile_inner [2024-02-02 14:08:19,017] [0/0_1] torch._dynamo.symbolic_convert: [INFO] out_code = transform_code_object(code, transform) [2024-02-02 14:08:19,017] [0/0_1] torch._dynamo.symbolic_convert: [INFO] File "/home/yidi/local/pytorch/torch/_dynamo/bytecode_transformation.py", line 1033, in transform_code_object [2024-02-02 14:08:19,017] [0/0_1] torch._dynamo.symbolic_convert: [INFO] transformations(instructions, code_options) [2024-02-02 14:08:19,017] [0/0_1] torch._dynamo.symbolic_convert: [INFO] File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 155, in _fn [2024-02-02 14:08:19,017] [0/0_1] torch._dynamo.symbolic_convert: [INFO] return fn(args, kwargs) [2024-02-02 14:08:19,017] [0/0_1] torch._dynamo.symbolic_convert: [INFO] File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 478, in transform [2024-02-02 14:08:19,017] [0/0_1] torch._dynamo.symbolic_convert: [INFO] tracer = InstructionTranslator( [2024-02-02 14:08:19,017] [0/0_1] torch._dynamo.symbolic_convert: [INFO] File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 2032, in __init__ [2024-02-02 14:08:19,017] [0/0_1] torch._dynamo.symbolic_convert: [INFO] _step_logger()( [2024-02-02 14:08:19,017] [0/0_1] torch._dynamo.symbolic_convert: [INFO] File "/home/yidi/local/pytorch/torch/_dynamo/logging.py", line 55, in log [2024-02-02 14:08:19,017] [0/0_1] torch._dynamo.symbolic_convert: [INFO] logger.log(level, "Step %s: %s", step, msg, kwargs) [2024-02-02 14:08:19,017] [0/0_1] torch.fx.experimental.symbolic_shapes: [INFO] create_env [2024-02-02 14:08:19,021] [0/0_1] torch.fx.experimental.symbolic_shapes: [INFO] produce_guards [2024-02-02 14:08:19,025] [1/0] torch._dynamo.symbolic_convert: [INFO] Step 1: torchdynamo start tracing forward /home/yidi/local/pytorch/test.py:257 [2024-02-02 14:08:19,025] [1/0] torch._dynamo.symbolic_convert: [INFO] Stack (most recent call last): [2024-02-02 14:08:19,025] [1/0] torch._dynamo.symbolic_convert: [INFO] File "/home/yidi/local/pytorch/test.py", line 272, in <module> [2024-02-02 14:08:19,025] [1/0] torch._dynamo.symbolic_convert: [INFO] y = linear(x) [2024-02-02 14:08:19,025] [1/0] torch._dynamo.symbolic_convert: [INFO] File "/home/yidi/local/pytorch/torch/_dynamo/eval_frame.py", line 453, in _fn [2024-02-02 14:08:19,025] [1/0] torch._dynamo.symbolic_convert: [INFO] return fn(args, kwargs) [2024-02-02 14:08:19,025] [1/0] torch._dynamo.symbolic_convert: [INFO] File "/home/yidi/local/pytorch/test.py", line 268, in linear [2024-02-02 14:08:19,025] [1/0] torch._dynamo.symbolic_convert: [INFO] return UseNeedsInputGradFunction.apply(x) [2024-02-02 14:08:19,025] [1/0] torch._dynamo.symbolic_convert: [INFO] File "/home/yidi/local/pytorch/torch/autograd/function.py", line 572, in apply [2024-02-02 14:08:19,025] [1/0] torch._dynamo.symbolic_convert: [INFO] return super().apply(args, *kwargs) # type: ignore[misc] [2024-02-02 14:08:19,025] [1/0] torch._dynamo.symbolic_convert: [INFO] File "/home/yidi/local/pytorch/torch/_dynamo/eval_frame.py", line 615, in catch_errors [2024-02-02 14:08:19,025] [1/0] torch._dynamo.symbolic_convert: [INFO] return callback(frame, cache_entry, hooks, frame_state) [2024-02-02 14:08:19,025] [1/0] torch._dynamo.symbolic_convert: [INFO] File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 748, in _convert_frame [2024-02-02 14:08:19,025] [1/0] torch._dynamo.symbolic_convert: [INFO] result = inner_convert(frame, cache_entry, hooks, frame_state) [2024-02-02 14:08:19,025] [1/0] torch._dynamo.symbolic_convert: [INFO] File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 390, in _convert_frame_assert [2024-02-02 14:08:19,025] [1/0] torch._dynamo.symbolic_convert: [INFO] return _compile( [2024-02-02 14:08:19,025] [1/0] torch._dynamo.symbolic_convert: [INFO] File "/home/yidi/local/miniconda3/envs/pytorch-3.10/lib/python3.10/contextlib.py", line 79, in inner [2024-02-02 14:08:19,025] [1/0] torch._dynamo.symbolic_convert: [INFO] return func(args, *kwds) [2024-02-02 14:08:19,025] [1/0] torch._dynamo.symbolic_convert: [INFO] File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 650, in _compile [2024-02-02 14:08:19,025] [1/0] torch._dynamo.symbolic_convert: [INFO] guarded_code = compile_inner(code, one_graph, hooks, transform) [2024-02-02 14:08:19,025] [1/0] torch._dynamo.symbolic_convert: [INFO] File "/home/yidi/local/pytorch/torch/_dynamo/utils.py", line 248, in time_wrapper [2024-02-02 14:08:19,025] [1/0] torch._dynamo.symbolic_convert: [INFO] r = func(args, *kwargs) [2024-02-02 14:08:19,025] [1/0] torch._dynamo.symbolic_convert: [INFO] File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 531, in compile_inner [2024-02-02 14:08:19,025] [1/0] torch._dynamo.symbolic_convert: [INFO] out_code = transform_code_object(code, transform) [2024-02-02 14:08:19,025] [1/0] torch._dynamo.symbolic_convert: [INFO] File "/home/yidi/local/pytorch/torch/_dynamo/bytecode_transformation.py", line 1033, in transform_code_object [2024-02-02 14:08:19,025] [1/0] torch._dynamo.symbolic_convert: [INFO] transformations(instructions, code_options) [2024-02-02 14:08:19,025] [1/0] torch._dynamo.symbolic_convert: [INFO] File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 155, in _fn [2024-02-02 14:08:19,025] [1/0] torch._dynamo.symbolic_convert: [INFO] return fn(args, kwargs) [2024-02-02 14:08:19,025] [1/0] torch._dynamo.symbolic_convert: [INFO] File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 478, in transform [2024-02-02 14:08:19,025] [1/0] torch._dynamo.symbolic_convert: [INFO] tracer = InstructionTranslator( [2024-02-02 14:08:19,025] [1/0] torch._dynamo.symbolic_convert: [INFO] File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 2032, in __init__ [2024-02-02 14:08:19,025] [1/0] torch._dynamo.symbolic_convert: [INFO] _step_logger()( [2024-02-02 14:08:19,025] [1/0] torch._dynamo.symbolic_convert: [INFO] File "/home/yidi/local/pytorch/torch/_dynamo/logging.py", line 55, in log [2024-02-02 14:08:19,025] [1/0] torch._dynamo.symbolic_convert: [INFO] logger.log(level, "Step %s: %s", step, msg, kwargs) [2024-02-02 14:08:19,025] [1/0] torch.fx.experimental.symbolic_shapes: [INFO] create_env [2024-02-02 14:08:19,097] torch._dynamo.utils: [INFO] TorchDynamo compilation metrics: [2024-02-02 14:08:19,097] torch._dynamo.utils: [INFO] Function, Runtimes (s) [2024-02-02 14:08:19,097] torch._dynamo.utils: [INFO] _compile.<locals>.compile_inner, 0.0283 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/119079 Approved by: https://github.com/zou3519	2024-02-03 02:10:13 +00:00
briancoutinho	d91d21fd6f	[submodule kineto] Enable profiler connection to daemon during init for cpu only jobs (#118320 ) Fixes #112389 and https://github.com/facebookincubator/dynolog/issues/208 This PR enables profiler initialization for CPU only use cases. The main goal is to enable on-demand profiling with a daemon when using CPU only mode of PyTorch. * When CUDA is available the profiler is initialized on first CUDA stream creation (or lazily when profiler is run). * Since the CUDA stream creation callback does not exist on CPU only PyTorch the profiler is never initied on its own. * Thus the job does not register with Dynolog when we set "KINETO_USE_DAEMON" env variable to set. Part of the fix is in Kineto https://github.com/pytorch/kineto/pull/861, we point to it in PyTorch. The change in PyTorch is to correctly set the `cpuOnly` argument. ## TestPlan: Build PyTorch from source with USE_CUDA=0 so we have CPU only based build. Git hash = `a40951defd87b9a5e582cf9112bf7a8bd0930c79` (See instructions in PyTorch repo) For the setup we run dynolog daemon in another terminal ``` buck2 run dynolog/src:dynolog -- --enable_ipc_monitor & ``` Now run an example model in PyTorch - see [linear_model.py](https://github.com/facebookincubator/dynolog/blob/main/scripts/pytorch/linear_model_example.py) , and set the device to 'cpu' inside the code instead of 'cuda'. ``` export KINETO_USE_DAEMON=1 python linear_model_example.py ``` Output shows the profiler registration with dynolog ``` (pytorch) [bcoutinho@devgpu038.ftw6 ~/local/pytorch (main)]$ python linear_model_example.py INFO:2024-01-25 11:08:53 1807792:1807792 init.cpp:122] Registering daemon config loader, cpuOnly = 1 INFO:2024-01-25 11:08:53 1807792:1807792 DaemonConfigLoader.cpp:63] Setting communication fabric enabled = 1 INFO:2024-01-25 11:08:53 1807792:1807792 IpcFabricConfigClient.cpp:93] Setting up IPC Fabric at endpoint: dynoconfigclient0dc36b8a-e14c-4260-958b-4b2e7d15e986 status = initialized INFO:2024-01-25 11:08:53 1807792:1807792 DaemonConfigLoader.cpp:63] Setting communication fabric enabled = 1 INFO:2024-01-25 11:08:53 1807792:1807792 DaemonConfigLoader.cpp:63] Setting communication fabric enabled = 1 ``` We can also collect a trace using ``` [bcoutinho@devgpu038.ftw6 ~/fbsource/fbcode (3bc85f968)]$ buck2 run dynolog/cli:dyno -- gputrace --log-file /tmp/test.json Kineto config = ACTIVITIES_LOG_FILE=/tmp/test.json PROFILE_START_TIME=0 ACTIVITIES_DURATION_MSECS=500 PROFILE_REPORT_INPUT_SHAPES=false PROFILE_PROFILE_MEMORY=false PROFILE_WITH_STACK=false PROFILE_WITH_FLOPS=false PROFILE_WITH_MODULES=false response length = 147 response = {"activityProfilersBusy":0,"activityProfilersTriggered":[1807792],"eventProfilersBusy":0,"eventProfilersTriggered":[],"processesMatched":[1807792]} Matched 1 processes Trace output files will be written to: /tmp/test_1807792.json ``` And trace file contains the trace correctly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118320 Approved by: https://github.com/aaronenyeshi	2024-02-03 01:40:56 +00:00
Chien-Chin Huang	494c2ec054	[DCP][BE] Let FsspecWriter and FsspecReader inherit from FileSystemWriter and FileSystemReader (#118887 ) There is no logic changed. However this PR dramatially reduces the effort to maintain filesystem-like storage backend. As we are going to enable fsspec, this is a must BE iteam. Differential Revision: [D53318044](https://our.internmc.facebook.com/intern/diff/D53318044/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118887 Approved by: https://github.com/wz337	2024-02-03 01:14:13 +00:00
drisspg	6b009aceea	Enable scaled_mm on sm89 devices (#118881 ) Fixes #118703 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118881 Approved by: https://github.com/malfet	2024-02-03 00:44:03 +00:00
angelayi	440b7d5279	[auto_functionalize] Remove mutated_args_name from args (#119050 ) `auto_functionalize` currently takes a custom op, a list of mutated argument names, and inputs to the custom op as kwargs. The list of mutated argument names is computed from the schema, and gets created when we're tracing. However, it seems that having the list of mutated argument names is a little unnecessary since we can always recompute it from the schema during runtime. This also prevents the case where users might incorrectly modify the inputs to this operator, as we will now just recompute it during the runtime. This probably won't affect things too much because inductor will decompose auto_functionalize. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119050 Approved by: https://github.com/zou3519	2024-02-03 00:27:14 +00:00
PyTorch MergeBot	3aeaa21eb0	Revert "Remove parent device mesh check (#118620 )" This reverts commit 3f1f057adfcd4cef67fff9605a894cb075c02881. Reverted https://github.com/pytorch/pytorch/pull/118620 on behalf of https://github.com/atalman due to broke periodic linux-focal-cuda11.8-py3.9-gcc9 ([comment](https://github.com/pytorch/pytorch/pull/118620#issuecomment-1924933878))	2024-02-03 00:22:56 +00:00
Yuzhen Huang	de6a906093	Expose aggressive_recomputation as an inductor config (#118943 ) Summary: As title. We found aggressive_recomputation shows memory savings (7% on APS COFFEE model) with 2% QPS loss. It also gives very promising signal on our auto ac experiments: https://docs.google.com/document/d/1S2qgMg1CwAQ4U1Ffuk2epbEOx06ogZhioX2jKCwL7ZQ/edit {F1426175073} Test Plan: APS COFFEE from silverlakeli - Zoom of baseline job: https://www.internalfb.com/intern/zoomer/?profiling_run_fbid=927380488801910&tab=overview - Zoom of job with aggressive_recomputation: https://www.internalfb.com/intern/zoomer/?profiling_run_fbid=1126815608217470&tab=overview APS 1100x shrunk version: - baseline: https://www.internalfb.com/mast/job/aps-yuzhenhuang-afe049505a - test: https://www.internalfb.com/mast/job/aps-yuzhenhuang-709e41bf0d Memory from 42.98% -> 41.04%. Reviewed By: yf225, yuxihu, silverlakeli, richqyz Differential Revision: D53248057 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118943 Approved by: https://github.com/anijain2305, https://github.com/yanboliang	2024-02-03 00:17:03 +00:00
Linus	7bbd9befed	Improve example for ``torch.mode()`` (#115308 ) Fixes #89820 and improves the documentation. Co-authored-by: Sam Gross <colesbury@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/115308 Approved by: https://github.com/colesbury	2024-02-03 00:13:26 +00:00
Shunting Zhang	c24ffc3f66	[inductor] make multi-kernel work with cpp-wrapper (#117813 ) Make multi-kernel work with cpp-wrapper. multi-kernel generates two equivalent variants for a reduction. At runtime the faster one is picked. But cpp-wrapper need save cubin file during codegen. They don't work with each other at the beginning. Thanks Jason for suggesting a neat way to integrate these two. cpp-wrapper does 2 passes codegen right now. For the first pass, we still generate multi-kernel code and run it; for the second pass, we load the cubin file for the faster kernel directly. And multi-kernel python code is not generated for the second pass since they should not be needed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117813 Approved by: https://github.com/jansel	2024-02-03 00:06:21 +00:00
lancerts	576383c2eb	Add torch check for dtype within bilinear (#118900 ) Fixes https://github.com/pytorch/pytorch/issues/117237 Short-term fix, when dtype does not match, it will be reflected in the torch check. @ezyang a cpp test case is added Pull Request resolved: https://github.com/pytorch/pytorch/pull/118900 Approved by: https://github.com/ezyang, https://github.com/malfet	2024-02-03 00:02:00 +00:00
PyTorch MergeBot	a4355d6b9a	Revert "Add --filter-rank to torchrun: allow logs filtering by rank (#118562 )" This reverts commit 73229b4f931f8cd1799b0905d61e3d8e85157bcd. Reverted https://github.com/pytorch/pytorch/pull/118562 on behalf of https://github.com/xmfan due to breaks MAST precheck, flag naming conflict ([comment](https://github.com/pytorch/pytorch/pull/118562#issuecomment-1924916601))	2024-02-02 23:56:21 +00:00
willfengg	63fd6883fd	[c10d] logging utility for cpp-python stacktrace (#118924 ) user may not know which line of code called collectives in a big code base. When debugging, we can print python-cpp stacktrace in case user call ``ProcessGroup.reduce`` instead of ``torch.distributed.reduce`` ``` LOG(INFO) << "ProcessGroupNCCL::_allgather_base stacktrace: " << get_python_cpp_trace(); ``` output (using _allgather_base as an example): one example python-part trace is ``all_gather_into_tensor from /data/users/weif/pytorch/torch/distributed/distributed_c10d.py:2838`` ``` ProcessGroupNCCL::_allgather_base stacktrace: #0 torch::unwind::unwind() from ??:0 #1 torch::CapturedTraceback::gather(bool, bool, bool) from ??:0 #2 c10d::get_python_cpp_trace[abi:cxx11]() from :0 #3 c10d::ProcessGroupNCCL::_allgather_base(at::Tensor&, at::Tensor&, c10d::AllgatherOptions const&) from ??:0 #4 c10d::ops::(anonymous namespace)::_allgather_base_CUDA(at::Tensor&, at::Tensor&, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, bool, long) from Ops.cpp:0 #5 c10::impl::make_boxed_from_unboxed_functor<c10::impl::detail::WrapFunctionIntoRuntimeFunctor_<std::tuple<at::Tensor, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > > ()(at::Tensor&, at::Tensor&, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, bool, long), std::tuple<at::Tensor, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > >, c10::guts::typelist::typelist<at::Tensor&, at::Tensor&, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, bool, long> >, false>::call(c10::OperatorKernel, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >) from :0 #6 torch::autograd::basicAutogradNotImplementedFallbackImpl(c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >) from autograd_not_implemented_fallback.cpp:0 #7 c10d::ProcessGroup::_allgather_base(at::Tensor&, at::Tensor&, c10d::AllgatherOptions const&) from :0 #8 pybind11::cpp_function::initialize<pybind11::cpp_function::initialize<c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> >, c10d::ProcessGroup, at::Tensor&, at::Tensor&, c10d::AllgatherOptions const&, pybind11::name, pybind11::is_method, pybind11::sibling, pybind11::arg, pybind11::arg, pybind11::arg_v, pybind11::call_guard<pybind11::gil_scoped_release> >(c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > (c10d::ProcessGroup::)(at::Tensor&, at::Tensor&, c10d::AllgatherOptions const&), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, pybind11::arg const&, pybind11::arg const&, pybind11::arg_v const&, pybind11::call_guard<pybind11::gil_scoped_release> const&)::{lambda(c10d::ProcessGroup, at::Tensor&, at::Tensor&, c10d::AllgatherOptions const&)#1}, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> >, c10d::ProcessGroup, at::Tensor&, at::Tensor&, c10d::AllgatherOptions const&, pybind11::name, pybind11::is_method, pybind11::sibling, pybind11::arg, pybind11::arg, pybind11::arg_v, pybind11::call_guard<pybind11::gil_scoped_release> >(pybind11::cpp_function::initialize<c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> >, c10d::ProcessGroup, at::Tensor&, at::Tensor&, c10d::AllgatherOptions const&, pybind11::name, pybind11::is_method, pybind11::sibling, pybind11::arg, pybind11::arg, pybind11::arg_v, pybind11::call_guard<pybind11::gil_scoped_release> >(c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > (c10d::ProcessGroup::)(at::Tensor&, at::Tensor&, c10d::AllgatherOptions const&), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, pybind11::arg const&, pybind11::arg const&, pybind11::arg_v const&, pybind11::call_guard<pybind11::gil_scoped_release> const&)::{lambda(c10d::ProcessGroup, at::Tensor&, at::Tensor&, c10d::AllgatherOptions const&)#1}&&, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > ()(c10d::ProcessGroup, at::Tensor&, at::Tensor&, c10d::AllgatherOptions const&), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, pybind11::arg const&, pybind11::arg const&, pybind11::arg_v const&, pybind11::call_guard<pybind11::gil_scoped_release> const&)::{lambda(pybind11::detail::function_call&)#3}::_FUN(pybind11::detail::function_call&) from :0 #9 pybind11::cpp_function::dispatcher(_object, _object, _object) from :0 #10 cfunction_call from /usr/local/src/conda/python-3.10.12/Objects/methodobject.c:543 #11 _PyObject_MakeTpCall from /usr/local/src/conda/python-3.10.12/Objects/call.c:215 #12 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.12/Include/cpython/abstract.h:112 #13 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.12/Include/cpython/abstract.h:114 #14 all_gather_into_tensor from /data/users/weif/pytorch/torch/distributed/distributed_c10d.py:2838 #15 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.12/Include/internal/pycore_ceval.h:46 #16 do_call_core from /usr/local/src/conda/python-3.10.12/Python/ceval.c:5945 #17 wrapper from /data/users/weif/pytorch/torch/distributed/c10d_logger.py:75 #18 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.12/Include/internal/pycore_ceval.h:46 #19 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.12/Include/cpython/abstract.h:114 #20 _all_gather_flat_param from /data/users/weif/pytorch/torch/distributed/fsdp/_flat_param.py:1399 #21 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.12/Include/internal/pycore_ceval.h:46 #22 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.12/Include/cpython/abstract.h:114 #23 unshard from /data/users/weif/pytorch/torch/distributed/fsdp/_flat_param.py:1308 #24 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.12/Include/internal/pycore_ceval.h:46 #25 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.12/Include/cpython/abstract.h:114 #26 _unshard from /data/users/weif/pytorch/torch/distributed/fsdp/_runtime_utils.py:332 #27 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.12/Include/internal/pycore_ceval.h:46 #28 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.12/Include/cpython/abstract.h:114 #29 _pre_forward_unshard from /data/users/weif/pytorch/torch/distributed/fsdp/_runtime_utils.py:448 #30 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.12/Include/internal/pycore_ceval.h:46 #31 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.12/Include/cpython/abstract.h:114 #32 _pre_forward from /data/users/weif/pytorch/torch/distributed/fsdp/_runtime_utils.py:413 #33 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.12/Include/internal/pycore_ceval.h:46 #34 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.12/Include/cpython/abstract.h:114 #35 forward from /data/users/weif/pytorch/torch/distributed/fsdp/fully_sharded_data_parallel.py:839 #36 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.12/Include/internal/pycore_ceval.h:46 #37 do_call_core from /usr/local/src/conda/python-3.10.12/Python/ceval.c:5945 #38 _call_impl from /data/users/weif/pytorch/torch/nn/modules/module.py:1520 #39 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.12/Include/internal/pycore_ceval.h:46 #40 do_call_core from /usr/local/src/conda/python-3.10.12/Python/ceval.c:5945 #41 _wrapped_call_impl from /data/users/weif/pytorch/torch/nn/modules/module.py:1511 #42 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.12/Include/internal/pycore_ceval.h:46 #43 _PyObject_Call_Prepend from /usr/local/src/conda/python-3.10.12/Objects/call.c:431 #44 slot_tp_call from /usr/local/src/conda/python-3.10.12/Objects/typeobject.c:7494 #45 _PyObject_MakeTpCall from /usr/local/src/conda/python-3.10.12/Objects/call.c:215 #46 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.12/Include/cpython/abstract.h:112 #47 inner from /data/users/weif/pytorch/run_fsdp.py:72 #48 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.12/Include/internal/pycore_ceval.h:46 #49 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.12/Include/cpython/abstract.h:114 #50 run from /data/users/weif/pytorch/run_fsdp.py:76 #51 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.12/Include/internal/pycore_ceval.h:46 #52 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.12/Include/cpython/abstract.h:114 #53 main from /data/users/weif/pytorch/run_fsdp.py:133 #54 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.12/Include/internal/pycore_ceval.h:46 #55 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.12/Include/cpython/abstract.h:114 #56 <module> from /data/users/weif/pytorch/run_fsdp.py:137 #57 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.12/Include/internal/pycore_ceval.h:46 #58 PyEval_EvalCode from /usr/local/src/conda/python-3.10.12/Python/ceval.c:1134 #59 run_eval_code_obj from /usr/local/src/conda/python-3.10.12/Python/pythonrun.c:1291 #60 run_mod from /usr/local/src/conda/python-3.10.12/Python/pythonrun.c:1312 #61 pyrun_file from /usr/local/src/conda/python-3.10.12/Python/pythonrun.c:1208 #62 _PyRun_SimpleFileObject from /usr/local/src/conda/python-3.10.12/Python/pythonrun.c:456 #63 _PyRun_AnyFileObject from /usr/local/src/conda/python-3.10.12/Python/pythonrun.c:90 #64 pymain_run_file_obj from /usr/local/src/conda/python-3.10.12/Modules/main.c:357 #65 Py_BytesMain from /usr/local/src/conda/python-3.10.12/Modules/main.c:1090 #66 __libc_start_call_main from ??:0 #67 <unwind unsupported> from ??:0 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/118924 Approved by: https://github.com/kwen2501	2024-02-02 23:49:18 +00:00
titaiwangms	a3cec6a7fa	[ONNX] Eliminate redundant TODOs (#119060 ) Remove titaiwangms/AllenTiTaiWang/titaiwang created TODOs: 1. Resolved TODOs 2. Turned TODOs to NOTEs if they are not actionable 3. Merge duplicated TODOs Pull Request resolved: https://github.com/pytorch/pytorch/pull/119060 Approved by: https://github.com/kit1980, https://github.com/thiagocrepaldi	2024-02-02 23:37:52 +00:00
Angela Yi	454e6b380c	[export] Prevent specialization on backends (#118683 ) Summary: https://github.com/pytorch/pytorch/issues/118289 shows that sometimes we will decompose into backend-specific operators, causing some specializations. We should probably avoid this by disabling these by default? Test Plan: CI Differential Revision: D53241300 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118683 Approved by: https://github.com/zhxchen17	2024-02-02 23:33:59 +00:00
Michael Suo	db2225da37	[export] fix forward test_lift_unlift (#119090 ) Test Plan: fixes test Reviewed By: zhxchen17 Differential Revision: D53367522 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119090 Approved by: https://github.com/kit1980	2024-02-02 23:07:36 +00:00
ydwu4	9fe3693bbb	[dynamo] bypass graph break due to masking if inference mode (#119056 ) Relax the constraints in https://github.com/pytorch/pytorch/issues/114123 when we're in inference mode. Test Plan: See added tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119056 Approved by: https://github.com/ezyang, https://github.com/zou3519	2024-02-02 22:53:23 +00:00
suo	4d45c68ca6	[fx] fix for subgraph rewriter (#119052 ) the semantics of `try_get_attr` are to default to None if the attribute doesn't exist; but we were throwing an exception in `get_submodule`. Catch that exception and return None. Differential Revision: [D53358747](https://our.internmc.facebook.com/intern/diff/D53358747/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119052 Approved by: https://github.com/angelayi	2024-02-02 22:47:53 +00:00
wz337	c908caf92b	[DeviceMesh] Alllow 1d slice from 1d mesh (#118895 ) Fixes [ISSUE_NUMBER](https://github.com/pytorch/pytorch/issues/118851) i.e. mesh = init_device_mesh("cuda", (8,), mesh_dim_names=("dp")) then we do dp_mesh = mesh["dp"] should still work, just dummy return without recording parent mesh Pull Request resolved: https://github.com/pytorch/pytorch/pull/118895 Approved by: https://github.com/wanchaol	2024-02-02 22:00:24 +00:00
Animesh Jain	6379010ebd	[dynamo][higher order ops] Remove restore side effects logic (#118420 ) The problem was exposed in https://github.com/pytorch/pytorch/pull/118071 where the control flow tests were always recompiling. The issue turned out that the same nonlocal variable used in `true_fn` and `false_fn` was getting lifted twice and thus creating two inputs in the main Fx graph. Dynamo Tensor guards does not like it because it wants all input tensors to be non-aliased. We already have logic to check if two different sources (closure of true_fn and closure of false_fn) point to the same tensor using side effects infra. But we were restoring side_effects after subtracing the true and false branches. This is not needed anymore. side_effects trace both read-only as well as actual writes to the variables. For higher order ops, any mutation which is not read-only leads to a graph break and safely exits the tracing. For read-only side effects, its doesn't matter. This PR removes the restoring of side_effects, which turns on the logic for checking if two different sources point to the same tensor, and thus lifts the common non local tensor to just once in the main graph. Related discussion at https://github.com/pytorch/pytorch/issues/113235 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118420 Approved by: https://github.com/ydwu4, https://github.com/mlazos, https://github.com/zou3519 ghstack dependencies: #118975	2024-02-02 21:54:22 +00:00
CaoE	113138aa55	add test cases for GradScaler on CPU (#109994 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/109994 Approved by: https://github.com/jgong5, https://github.com/ezyang	2024-02-02 21:49:07 +00:00
Thiago Crepaldi	426339e4de	Add FakeTensor support to torch._utils._rebuild_tensor (#108186 ) Partially fixes https://github.com/pytorch/pytorch/issues/105077 Repro: ```python import tempfile import torch from torch._subclasses import fake_tensor class TheModelClass(torch.nn.Module): def __init__(self): super(TheModelClass, self).__init__() self.fc1 = torch.nn.Linear(5, 10) def forward(self, x): return self.fc1(x) with tempfile.NamedTemporaryFile() as state_dict_file: # Create state_dict to be loaded later model = TheModelClass() torch.save(model.state_dict(), state_dict_file.name) fake_mode = fake_tensor.FakeTensorMode() with fake_mode: # This is where the bug is triggered state_dict = torch.load(state_dict_file.name) ``` Error: ```bash Traceback (most recent call last): File "issue_gh_torch_105077.py", line 22, in <module> state_dict = torch.load(state_dict_file.name) File "/opt/pytorch/torch/serialization.py", line 1014, in load return _load(opened_zipfile, File "/opt/pytorch/torch/serialization.py", line 1422, in _load result = unpickler.load() File "/opt/pytorch/torch/_utils.py", line 205, in _rebuild_tensor_v2 tensor = _rebuild_tensor(storage, storage_offset, size, stride) File "/opt/pytorch/torch/_utils.py", line 184, in _rebuild_tensor return t.set_(storage._untyped_storage, storage_offset, size, stride) File "/opt/pytorch/torch/utils/_stats.py", line 20, in wrapper return fn(args, *kwargs) File "/opt/pytorch/torch/_subclasses/fake_tensor.py", line 1288, in __torch_dispatch__ return self.dispatch(func, types, args, kwargs) File "/opt/pytorch/torch/_subclasses/fake_tensor.py", line 1468, in dispatch self.invalidate_written_to_constants(func, flat_arg_fake_tensors, args, kwargs) File "/opt/pytorch/torch/_subclasses/fake_tensor.py", line 1733, in invalidate_written_to_constants _, new_kwargs = normalize_function( File "/opt/pytorch/torch/fx/operator_schemas.py", line 297, in normalize_function torch_op_schemas = get_signature_for_torch_op(target) File "/opt/pytorch/torch/fx/operator_schemas.py", line 167, in get_signature_for_torch_op signatures = [_torchscript_schema_to_signature(schema) for schema in schemas] File "/opt/pytorch/torch/fx/operator_schemas.py", line 167, in <listcomp> signatures = [_torchscript_schema_to_signature(schema) for schema in schemas] File "/opt/pytorch/torch/fx/operator_schemas.py", line 70, in _torchscript_schema_to_signature arg_type = _torchscript_type_to_python_type(arg.type) File "/opt/pytorch/torch/fx/operator_schemas.py", line 64, in _torchscript_type_to_python_type return eval(ts_type.annotation_str, _type_eval_globals) File "<string>", line 1, in <module> NameError: name 'Storage' is not defined ``` This PR adds the ability to create fake tensors during `torch.load` by wrapping the `torch.tensor.set_` call around a `torch.utils._mode_utils.no_dispatch()` to skip fake mode dispatcher for it and thus create a real tensor. It later calls `fake_mode.from_tensor(t)` to finally create the fake tensor. Co-authored-by: Edward Z. Yang <ezyang@mit.edu> Pull Request resolved: https://github.com/pytorch/pytorch/pull/108186 Approved by: https://github.com/ezyang	2024-02-02 20:35:38 +00:00
Joel Schlosser	3b41793412	Purge redundant module init tests (#119028 ) Fixes #118784 This test file is old and redundant; coverage is maintained in `test_modules.py` via the `test_factory_kwargs` set of tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119028 Approved by: https://github.com/zou3519	2024-02-02 20:17:00 +00:00
Pearu Peterson	a69016a741	Add lowering to special.bessel_j1 (#118992 ) As in the title. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118992 Approved by: https://github.com/peterbell10	2024-02-02 20:16:08 +00:00
Bin Bao	c7ba5f6c6f	[AOTI] Fix a cpp kernel missing arg type issue (#119021 ) Summary: The current way of fetching the kernel arg types only works for tensors, not symbols. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119021 Approved by: https://github.com/aakhundov, https://github.com/hl475, https://github.com/khabinov	2024-02-02 20:11:58 +00:00
rzou	debc3b3254	Download reports only if they're necessary (#119027 ) Previously we were downloading all of (eager311, dynamo38, dynamo311). Now we just download what's necessary. This is useful for update_failures.py because the dynamo tests finish much faster than the eager tests and it only needs the result from the dynamo tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119027 Approved by: https://github.com/jamesjwu ghstack dependencies: #118874, #118882, #118931	2024-02-02 20:11:01 +00:00
rzou	a68cf3ef7d	update_failures.py: add option to also remove "skipped" tests (#118931 ) Previously, you could run update_failures.py (with a commit hash) and it would add new expected failures and skips for newly failing tests and remove expected failures for newly passing tests. This PR teaches update_failures.py to also remove skips for tests that are now passing without them. The way we do this is: - dynamo_test_failures.py doesn't actually skip tests -- it runs the test and then suppresses the signal. - if the test actually passed, then the test gets skipped with a special skip message - we teach update_failures.py to look for the presence of that skip message. Test Plan: - Used this to generate https://github.com/pytorch/pytorch/pull/118928 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118931 Approved by: https://github.com/yanboliang ghstack dependencies: #118874, #118882	2024-02-02 20:11:01 +00:00
ydwu4	1de50f8654	[HigherOrderOp] fix stack trace to report user stack (#118826 ) Fixes https://github.com/pytorch/pytorch/issues/111020 For the following code: ```python import torch import torch._higher_order_ops.wrap glob = [] def f(x): glob.append(x) return x.clone() @torch.compile(backend='eager', fullgraph=True) def g(x): return torch.ops.higher_order.wrap(f, x) x = torch.randn(3) g(x) ``` The stacktrace now becomes: ``` [2024-02-01 15:23:34,691] [0/0] torch._dynamo.variables.higher_order_ops: [WARNING] speculate_subgraph: while introspecting wrap, we were unable to trace function `f` into a single graph. This means that Dynamo was unable to prove safety for this API and will fall back to eager-mode PyTorch, which could lead to a slowdown. [2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR] HigherOrderOperator: Mutating a variable not in the current scope (SideEffects) [2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR] Traceback (most recent call last): [2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR] File "/home/yidi/local/pytorch/torch/_dynamo/variables/higher_order_ops.py", line 381, in speculate_subgraph [2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR] output = f.call_function(tx, args, sub_kwargs) [2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR] File "/home/yidi/local/pytorch/torch/_dynamo/variables/functions.py", line 278, in call_function [2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR] return super().call_function(tx, args, kwargs) [2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR] File "/home/yidi/local/pytorch/torch/_dynamo/variables/functions.py", line 86, in call_function [2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR] return tx.inline_user_function_return( [2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR] File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 657, in inline_user_function_return [2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR] return InliningInstructionTranslator.inline_call(self, fn, args, kwargs) [2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR] File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 2261, in inline_call [2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR] return cls.inline_call_(parent, func, args, kwargs) [2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR] File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 2370, in inline_call_ [2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR] tracer.run() [2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR] File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 787, in run [2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR] and self.step() [2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR] File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 750, in step [2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR] getattr(self, inst.opname)(inst) [2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR] File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 469, in wrapper [2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR] return inner_fn(self, inst) [2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR] File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 1196, in CALL_FUNCTION [2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR] self.call_function(fn, args, {}) [2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR] File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 651, in call_function [2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR] self.push(fn.call_function(self, args, kwargs)) [2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR] File "/home/yidi/local/pytorch/torch/_dynamo/variables/misc.py", line 583, in call_function [2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR] return self.obj.call_method(tx, self.name, args, kwargs) [2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR] File "/home/yidi/local/pytorch/torch/_dynamo/variables/lists.py", line 330, in call_method [2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR] return super().call_method(tx, name, args, kwargs) [2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR] File "/home/yidi/local/pytorch/torch/_dynamo/variables/lists.py", line 241, in call_method [2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR] tx.output.side_effects.mutation(self) [2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR] File "/home/yidi/local/pytorch/torch/_dynamo/side_effects.py", line 325, in mutation [2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR] self.check_allowed_side_effect(var) [2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR] File "/home/yidi/local/pytorch/torch/_dynamo/side_effects.py", line 157, in check_allowed_side_effect [2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR] unimplemented( [2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR] File "/home/yidi/local/pytorch/torch/_dynamo/exc.py", line 190, in unimplemented [2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR] raise Unsupported(msg) [2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR] torch._dynamo.exc.Unsupported: HigherOrderOperator: Mutating a variable not in the current scope (SideEffects) Traceback (most recent call last): File "/home/yidi/local/pytorch/test.py", line 219, in <module> g(x) File "/home/yidi/local/pytorch/torch/_dynamo/eval_frame.py", line 453, in _fn return fn(args, kwargs) File "/home/yidi/local/pytorch/torch/_dynamo/eval_frame.py", line 615, in catch_errors return callback(frame, cache_entry, hooks, frame_state) File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 390, in _convert_frame_assert return _compile( File "/home/yidi/local/miniconda3/envs/pytorch-3.10/lib/python3.10/contextlib.py", line 79, in inner return func(args, *kwds) File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 650, in _compile guarded_code = compile_inner(code, one_graph, hooks, transform) File "/home/yidi/local/pytorch/torch/_dynamo/utils.py", line 248, in time_wrapper r = func(args, *kwargs) File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 531, in compile_inner out_code = transform_code_object(code, transform) File "/home/yidi/local/pytorch/torch/_dynamo/bytecode_transformation.py", line 1033, in transform_code_object transformations(instructions, code_options) File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 155, in _fn return fn(args, **kwargs) File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 496, in transform tracer.run() File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 2125, in run super().run() File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 787, in run and self.step() File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 750, in step getattr(self, inst.opname)(inst) File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 469, in wrapper return inner_fn(self, inst) File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 1196, in CALL_FUNCTION self.call_function(fn, args, {}) File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 651, in call_function self.push(fn.call_function(self, args, kwargs)) File "/home/yidi/local/pytorch/torch/_dynamo/variables/higher_order_ops.py", line 1227, in call_function p_args, p_kwargs, example_value, body_r, treespec, _ = self.create_wrapped_node( File "/home/yidi/local/pytorch/torch/_dynamo/variables/higher_order_ops.py", line 1190, in create_wrapped_node ) = speculate_subgraph( File "/home/yidi/local/pytorch/torch/_dynamo/variables/higher_order_ops.py", line 453, in speculate_subgraph raise ex File "/home/yidi/local/pytorch/torch/_dynamo/variables/higher_order_ops.py", line 381, in speculate_subgraph output = f.call_function(tx, args, sub_kwargs) File "/home/yidi/local/pytorch/torch/_dynamo/variables/functions.py", line 278, in call_function return super().call_function(tx, args, kwargs) File "/home/yidi/local/pytorch/torch/_dynamo/variables/functions.py", line 86, in call_function return tx.inline_user_function_return( File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 657, in inline_user_function_return return InliningInstructionTranslator.inline_call(self, fn, args, kwargs) File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 2261, in inline_call return cls.inline_call_(parent, func, args, kwargs) File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 2370, in inline_call_ tracer.run() File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 787, in run and self.step() File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 750, in step getattr(self, inst.opname)(inst) File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 469, in wrapper return inner_fn(self, inst) File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 1196, in CALL_FUNCTION self.call_function(fn, args, {}) File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 651, in call_function self.push(fn.call_function(self, args, kwargs)) File "/home/yidi/local/pytorch/torch/_dynamo/variables/misc.py", line 583, in call_function return self.obj.call_method(tx, self.name, args, kwargs) File "/home/yidi/local/pytorch/torch/_dynamo/variables/lists.py", line 330, in call_method return super().call_method(tx, name, args, kwargs) File "/home/yidi/local/pytorch/torch/_dynamo/variables/lists.py", line 241, in call_method tx.output.side_effects.mutation(self) File "/home/yidi/local/pytorch/torch/_dynamo/side_effects.py", line 325, in mutation self.check_allowed_side_effect(var) File "/home/yidi/local/pytorch/torch/_dynamo/side_effects.py", line 157, in check_allowed_side_effect unimplemented( File "/home/yidi/local/pytorch/torch/_dynamo/exc.py", line 190, in unimplemented raise Unsupported(msg) torch._dynamo.exc.Unsupported: HigherOrderOperator: Mutating a variable not in the current scope (SideEffects) from user code: File "/home/yidi/local/pytorch/test.py", line 216, in g return torch.ops.higher_order.wrap(f, x) File "/home/yidi/local/pytorch/test.py", line 211, in f glob.append(x) Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information You can suppress this exception and fall back to eager by setting: import torch._dynamo ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/118826 Approved by: https://github.com/yanboliang, https://github.com/zou3519	2024-02-02 20:08:01 +00:00
Edward Z. Yang	3c0c387429	Support symbolic min/max on unbacked SymInt (#118953 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/118953 Approved by: https://github.com/ColinPeppler, https://github.com/aakhundov	2024-02-02 20:01:46 +00:00
Edward Z. Yang	f641c55c9b	Make torch._dynamo.mark_static work inside graph (#118962 ) I livecoded the entire PR authoring process, you can watch it at https://youtu.be/06HuwNR9-uI Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/118962 Approved by: https://github.com/yanboliang	2024-02-02 20:01:27 +00:00
Edward Z. Yang	29f99a3365	Update XLA commit pin (#118871 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/118871 Approved by: https://github.com/albanD	2024-02-02 19:55:04 +00:00
rzou	bd8c91efc0	Remove some now-succeeding tests from dynamo_test_failures.py (#118928 ) Test Plan: - wait for CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/118928 Approved by: https://github.com/aorenste, https://github.com/anijain2305, https://github.com/yanboliang	2024-02-02 19:49:26 +00:00
Michael Suo	bf4e171539	[export] support non-persistent buffers (#118969 ) Summary: X-link: https://github.com/pytorch/executorch/pull/1817 Basic support for non-persistent buffers, which are buffers that do not show up in the state dict. One weird twist is that most of our other systems (FX, aot_export, dynamo) have completely buggy handling of non-persistent buffers. I tried to go on a wild goose chase to fix them all, but it got to be too much. So I introduced some sad rewrite passes in `_export` make the final state dict correctly align with the original module's state dict. This exposed some bugs/ambiguous handling of parameters/buffers in existing test code. For example, `TestSaveLoad.test_save_buffer` traced over a module that was not in the root module hierarchy and caused some weird behavior. I think we should error explicitly on use cases like this: https://github.com/pytorch/pytorch/issues/118410. For now I just rewrote the tests or skipped them. As a side effect, this diff tightened up quite a few sloppy behaviors around state dict handling: - Tensor attributes were getting promoted to be buffers—bad! - Tracing through a module not in the children of the root module would add its parameters/buffers to the state dict—bad! This behavior is unlikely to show up in user code since the model would be totally broken, but did show up in a bunch of tests. #buildmore Test Plan: unit tests sandcastle Differential Revision: D53340041 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118969 Approved by: https://github.com/guangy10, https://github.com/huydhn, https://github.com/titaiwangms	2024-02-02 19:16:08 +00:00
Jane Xu	b5ba80828f	[optim] Rectify capturable testing and fix bugs! (#118326 ) This PR fixes several bugs, listed in priority: 1. `load_state_dict` with a nontensor step was incorrect for capturable and fused implementations since we don't create the tensors on the right device in `__setstate__`. This has been fixed. 2. The most recently added capturable implementations forgot the check that all tensors should be on CUDA for eager. We've now added those checks 3. The most recent change in Adamax only adds capturable for foreach but will silently be incorrect for forloop/single-tensor. I've added erroring and modified testing with many many many skips for that. Honestly my preference after this PR has only been further cemented that we should just do the single tensor and multi tensor capturable implementations together in the future. @mlazos 4. The conditional for adding cuda-supported configs for the optimizer infos was incorrect! So we hadn't been testing capturable! This also stands rectified and was the trigger for this PR in the first place. 5. In a similar way, the conditional for `_get_optim_inputs_including_global_cliquey_kwargs` was incorrect sometimes as well. This has also been corrected. The following is not a bug, but is just something to make life simpler by not needing to handle Nones: `optim_input_funcs` must now mandatorily take in a `device`, which could be a string or a torch.device. Details for posterity: 4. Running the test_foreach_matches_forloop test and printing the configs that get printed yields capturable getting included, which is correct. ``` (pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (5d50138f)]$ python test/test_optim.py -k test_foreach_matches_forloop_AdamW_cuda /home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/transformers/utils/generic.py:441: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead. _torch_pytree._register_pytree_node( /home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0 warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}" params=None, kwargs={}, desc=default params=None, kwargs={'lr': 0.01}, desc=non-default lr params=None, kwargs={'weight_decay': 0.1}, desc=nonzero weight_decay params=None, kwargs={'weight_decay': 0.1, 'maximize': True}, desc=maximize params=None, kwargs={'weight_decay': 0.1, 'amsgrad': True}, desc=amsgrad params=None, kwargs={'capturable': True}, desc=capturable params=None, kwargs={'weight_decay': 0.1, 'amsgrad': True, 'capturable': True}, desc=capturable, amsgrad params=None, kwargs={'lr': tensor(0.0010), 'amsgrad': True, 'capturable': True}, desc=Tensor lr with capturable and amsgrad . ---------------------------------------------------------------------- Ran 1 test in 19.229s OK ``` 5. Running the test_optimizer_can_be_printed test (which calls `_get_optim_inputs_including_global_cliquey_kwargs`) and printing what gets run is also now correct. ``` /home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0 warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}" params=None, kwargs={'differentiable': False}, desc=default params=None, kwargs={'differentiable': True}, desc=default & differentiable params=None, kwargs={'lr': 0.01, 'differentiable': False}, desc=non-default lr params=None, kwargs={'lr': 0.01, 'differentiable': True}, desc=non-default lr & differentiable params=None, kwargs={'weight_decay': 0.1, 'differentiable': False}, desc=nonzero weight_decay params=None, kwargs={'weight_decay': 0.1, 'differentiable': True}, desc=nonzero weight_decay & differentiable params=None, kwargs={'weight_decay': 0.1, 'maximize': True, 'differentiable': False}, desc=maximize params=None, kwargs={'weight_decay': 0.1, 'maximize': True, 'differentiable': True}, desc=maximize & differentiable params=None, kwargs={'weight_decay': 0.1, 'amsgrad': True, 'differentiable': False}, desc=amsgrad params=None, kwargs={'weight_decay': 0.1, 'amsgrad': True, 'differentiable': True}, desc=amsgrad & differentiable .params=None, kwargs={'foreach': False, 'differentiable': False, 'fused': False}, desc=default params=None, kwargs={'foreach': True, 'differentiable': False, 'fused': False}, desc=default & foreach params=None, kwargs={'foreach': False, 'differentiable': True, 'fused': False}, desc=default & differentiable params=None, kwargs={'foreach': False, 'differentiable': False, 'fused': True}, desc=default & fused params=None, kwargs={'lr': 0.01, 'foreach': False, 'differentiable': False, 'fused': False}, desc=non-default lr params=None, kwargs={'lr': 0.01, 'foreach': True, 'differentiable': False, 'fused': False}, desc=non-default lr & foreach params=None, kwargs={'lr': 0.01, 'foreach': False, 'differentiable': True, 'fused': False}, desc=non-default lr & differentiable params=None, kwargs={'lr': 0.01, 'foreach': False, 'differentiable': False, 'fused': True}, desc=non-default lr & fused params=None, kwargs={'weight_decay': 0.1, 'foreach': False, 'differentiable': False, 'fused': False}, desc=nonzero weight_decay params=None, kwargs={'weight_decay': 0.1, 'foreach': True, 'differentiable': False, 'fused': False}, desc=nonzero weight_decay & foreach params=None, kwargs={'weight_decay': 0.1, 'foreach': False, 'differentiable': True, 'fused': False}, desc=nonzero weight_decay & differentiable params=None, kwargs={'weight_decay': 0.1, 'foreach': False, 'differentiable': False, 'fused': True}, desc=nonzero weight_decay & fused params=None, kwargs={'weight_decay': 0.1, 'maximize': True, 'foreach': False, 'differentiable': False, 'fused': False}, desc=maximize params=None, kwargs={'weight_decay': 0.1, 'maximize': True, 'foreach': True, 'differentiable': False, 'fused': False}, desc=maximize & foreach params=None, kwargs={'weight_decay': 0.1, 'maximize': True, 'foreach': False, 'differentiable': True, 'fused': False}, desc=maximize & differentiable params=None, kwargs={'weight_decay': 0.1, 'maximize': True, 'foreach': False, 'differentiable': False, 'fused': True}, desc=maximize & fused params=None, kwargs={'weight_decay': 0.1, 'amsgrad': True, 'foreach': False, 'differentiable': False, 'fused': False}, desc=amsgrad params=None, kwargs={'weight_decay': 0.1, 'amsgrad': True, 'foreach': True, 'differentiable': False, 'fused': False}, desc=amsgrad & foreach params=None, kwargs={'weight_decay': 0.1, 'amsgrad': True, 'foreach': False, 'differentiable': True, 'fused': False}, desc=amsgrad & differentiable params=None, kwargs={'weight_decay': 0.1, 'amsgrad': True, 'foreach': False, 'differentiable': False, 'fused': True}, desc=amsgrad & fused params=None, kwargs={'capturable': True, 'foreach': False, 'differentiable': False, 'fused': False}, desc=capturable params=None, kwargs={'capturable': True, 'foreach': True, 'differentiable': False, 'fused': False}, desc=capturable & foreach params=None, kwargs={'capturable': True, 'foreach': False, 'differentiable': True, 'fused': False}, desc=capturable & differentiable params=None, kwargs={'capturable': True, 'foreach': False, 'differentiable': False, 'fused': True}, desc=capturable & fused params=None, kwargs={'weight_decay': 0.1, 'amsgrad': True, 'capturable': True, 'foreach': False, 'differentiable': False, 'fused': False}, desc=capturable, amsgrad params=None, kwargs={'weight_decay': 0.1, 'amsgrad': True, 'capturable': True, 'foreach': True, 'differentiable': False, 'fused': False}, desc=capturable, amsgrad & foreach params=None, kwargs={'weight_decay': 0.1, 'amsgrad': True, 'capturable': True, 'foreach': False, 'differentiable': True, 'fused': False}, desc=capturable, amsgrad & differentiable params=None, kwargs={'weight_decay': 0.1, 'amsgrad': True, 'capturable': True, 'foreach': False, 'differentiable': False, 'fused': True}, desc=capturable, amsgrad & fused params=None, kwargs={'lr': tensor(0.0010), 'amsgrad': True, 'capturable': True, 'foreach': False, 'differentiable': False, 'fused': False}, desc=Tensor lr with capturable and amsgrad params=None, kwargs={'lr': tensor(0.0010), 'amsgrad': True, 'capturable': True, 'foreach': True, 'differentiable': False, 'fused': False}, desc=Tensor lr with capturable and amsgrad & foreach params=None, kwargs={'lr': tensor(0.0010), 'amsgrad': True, 'capturable': True, 'foreach': False, 'differentiable': True, 'fused': False}, desc=Tensor lr with capturable and amsgrad & differentiable params=None, kwargs={'lr': tensor(0.0010), 'amsgrad': True, 'capturable': True, 'foreach': False, 'differentiable': False, 'fused': True}, desc=Tensor lr with capturable and amsgrad & fused . ---------------------------------------------------------------------- Ran 2 tests in 11.112s OK ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/118326 Approved by: https://github.com/mlazos	2024-02-02 19:13:00 +00:00
Andrew Gu	8b00e5aa12	[FSDP2] Added pre/post-backward (#118004 ) This PR adds the pre- and post-backward logic: - Pre-backward hook: `FSDPState` and `FSDPParamGroup` define this, and `FSDPState` is responsible for registering since its pre-backward should run even if the `FSDPState` does not manage any parameters (in case it is the root). - Post-backward hook: Only `FSDParamGroup` defines this since the post-backward hook reshards parameters and reduce-scatters gradients (functionality only needed with managed parameters). The `FSDPParamGroup` is responsible for registering this. - Post-backward final callback: `FSDPState` defines this, and each `FSDPParamGroup` defines a `finalize_backward()` to call in the final callback. ### Pre-Backward The pre-backward hook is registered on the module outputs (that require gradient), and it should run when the first such output has its gradient computed. The hook may run multiple times per backward, once per module forward. Specifically, there will be one `(pre-backward, post-backward)` interval for each of the module's `forward()` calls. This is contrast with the existing FSDP semantics, which only defines a single `(pre-backward, post-backward)` interval that is equivalent to the union of this FSDP's `(pre-backward, post-backward)` intervals. This avoids spiking memory from having multiple modules not resharding and avoids some autograd edge cases. We implement the pre-backward hook by having a flag that is set upon the 1st calls to disable subsequent calls. This flag could be maintained by FSDP, but for a cleaner design, we augment `register_multi_grad_hook` with a `mode="any"` option and use that instead. ### Post-Backward The post-backward hook is equivalent to a module full backward hook (`nn.Module.register_full_backward_hook`) except it adds pytree logic to work with data structures other than just flat `Tensor` args passed to `nn.Module.forward`. If we were to use `register_full_backward_hook`, then the hook could fire early (before all gradients for the module have been computed). Most internal models use custom data structures as `forward` inputs, and they find that unifying under pytree is an acceptable solution. Unlike existing FSDP, we are able to reshard the parameters in the post-backward hook _before_ 'concatenating' the autograd-computed gradients, achieving a lower peak memory usage. (Existing FSDP has `SplitWithSizesBackward` that calls a `CatArrayBatched`, and here we have the reduce-scatter copy-in.) ### Final Callback The final callback runs as a queued callback to the autograd engine, meaning that it runs at the end of backward. In the future, if we do not want to wait for the reduce-scatter (or similar for CPU offloading), we can augment the final callback. The code is written such that each reduce-scatter can be waited on separately (via CUDA event). Pull Request resolved: https://github.com/pytorch/pytorch/pull/118004 Approved by: https://github.com/weifengpy, https://github.com/wanchaol ghstack dependencies: #117950, #117955, #117973, #117975	2024-02-02 19:10:11 +00:00
Elias Ellison	a688b4b397	Update pointwise concat heuristics (#118453 ) This PR updates the heuristics for lowering to pointwise cat to trigger when we have either a small number of arbitrary pointwise inputs (8) or up to 128 pointwise inputs when they correspond to simple pointwise kernels or data movement. This originally came from an internal use case which noticed poor codegen: https://fb.workplace.com/groups/1075192433118967/posts/1365770660727808. In our initial heuristics for lowering to a masked loads pointwise concat kernel we were conservative with the number of inputs we would allow by setting a maximum of 4. However, I've noticed that we can much more aggressively fuse to pointwise_concat codegen performantly. In the following benchmark I compare foreach and pointwise_cat codegen : https://gist.github.com/eellison/2bf83231f2940d9b9b33eb4721d35e15. Here is the [csv output](https://gist.github.com/eellison/529da68b326e1d832c26c1dcdb42c313). When there is neither `gelu` applied on prologue or epilogue pointwise concat is faster (this is just the data movement case). Applying gelu on the epilogue does not affect this result. When you apply gelu on the prologue, then as the # of inputs starts to increase you end up getting register spills with pointwise concat and it gets slower. ![image](https://github.com/pytorch/pytorch/assets/11477974/0d6612b8-d60f-4984-99eb-9b518cd4af74) ![image](https://github.com/pytorch/pytorch/assets/11477974/4dda3341-68f9-4d1d-8334-67d7196371fb) When I benchmarked with relu instead of gelu, only as inputs got up to 256 did the pointwise and foreach even out. ![image](https://github.com/pytorch/pytorch/assets/11477974/985418f8-ddb8-47c1-baea-ccd9de72cd7f) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118453 Approved by: https://github.com/jansel, https://github.com/shunting314, https://github.com/mlazos ghstack dependencies: #118452	2024-02-02 18:31:37 +00:00
Catherine Lee	3a1ae86a93	Fix internal failure D53291154 (#118907 ) Fix internal failure D53291154 from alban: the change is breaking because the alpha argument is now kwarg only (via the * marker) while it was ok for it to be positional before for the rsub.Scalar overload ``` _wrapped_call_impl return self._call_impl(args, kwargs) File "torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(args, *kwargs) File "torch/_dynamo/eval_frame.py", line 453, in _fn return fn(args, *kwargs) File "torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(args, *kwargs) File "torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(args, *kwargs) File "torch/_dynamo/eval_frame.py", line 615, in catch_errors return callback(frame, cache_entry, hooks, frame_state) File "torch/_dynamo/convert_frame.py", line 390, in _convert_frame_assert return _compile( File "python3.10/contextlib.py", line 79, in inner return func(args, *kwds) File "torch/_dynamo/convert_frame.py", line 650, in _compile guarded_code = compile_inner(code, one_graph, hooks, transform) File "torch/_dynamo/utils.py", line 248, in time_wrapper r = func(args, *kwargs) File "torch/_dynamo/convert_frame.py", line 531, in compile_inner out_code = transform_code_object(code, transform) File "torch/_dynamo/bytecode_transformation.py", line 1033, in transform_code_object transformations(instructions, code_options) File "torch/_dynamo/convert_frame.py", line 155, in _fn return fn(args, kwargs) File "torch/_dynamo/convert_frame.py", line 496, in transform tracer.run() File "torch/_dynamo/symbolic_convert.py", line 2125, in run super().run() File "torch/_dynamo/symbolic_convert.py", line 787, in run and self.step() File "torch/_dynamo/symbolic_convert.py", line 750, in step getattr(self, inst.opname)(inst) File "torch/_dynamo/symbolic_convert.py", line 469, in wrapper return inner_fn(self, inst) File "torch/_dynamo/symbolic_convert.py", line 1249, in CALL_FUNCTION_KW self.call_function(fn, args, kwargs) File "torch/_dynamo/symbolic_convert.py", line 651, in call_function self.push(fn.call_function(self, args, kwargs)) File "torch/_dynamo/variables/torch.py", line 614, in call_function tensor_variable = wrap_fx_proxy( File "torch/_dynamo/variables/builder.py", line 1285, in wrap_fx_proxy return wrap_fx_proxy_cls(target_cls=TensorVariable, kwargs) File "torch/_dynamo/variables/builder.py", line 1370, in wrap_fx_proxy_cls example_value = get_fake_value(proxy.node, tx, allow_non_graph_fake=True) File "torch/_dynamo/utils.py", line 1653, in get_fake_value raise TorchRuntimeError(str(e)).with_traceback(e.__traceback__) from None File "torch/_dynamo/utils.py", line 1599, in get_fake_value ret_val = wrap_fake_exception( File "torch/_dynamo/utils.py", line 1140, in wrap_fake_exception return fn() File "torch/_dynamo/utils.py", line 1600, in <lambda> lambda: run_node(tx.output, node, args, kwargs, nnmodule) File "torch/_dynamo/utils.py", line 1720, in run_node raise RuntimeError(fn_str + str(e)).with_traceback(e.__traceback__) from e File "torch/_dynamo/utils.py", line 1699, in run_node return node.target(args, kwargs) File "torch/utils/_stats.py", line 20, in wrapper return fn(args, *kwargs) File "torch/_subclasses/fake_tensor.py", line 1637, in __torch_dispatch__ return self.dispatch(func, types, args, kwargs) File "torch/_subclasses/fake_tensor.py", line 1975, in dispatch return self._dispatch_impl(func, types, args, kwargs) File "torch/_subclasses/fake_tensor.py", line 2190, in _dispatch_impl r = func(args, *kwargs) File "torch/_ops.py", line 571, in __call__ return self_._op(args, *kwargs) File "torch/_prims_common/wrappers.py", line 252, in _fn result = fn(args, **kwargs) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/118907 Approved by: https://github.com/lezcano	2024-02-02 18:17:34 +00:00
Yifu Wang	fd000340fd	ProcessGroupGloo::allgather_into_tensor_coalesced (#118910 ) ### Motivation Despite our plan to reduce gloo usage, it is still being widely used as testing tool (in both the PyTorch CI and user tests) for code that only uses nccl in real world scenario. There's some coverage issues around all-gather and reduce-scatter variants, which are currently worked around in ugly ways (e.g. [this](`b9e86bc93d/torch/distributed/_functional_collectives_impl.py (L216-L219)`) and [this](`b9e86bc93d/torch/distributed/_functional_collectives_impl.py (L262-L272)`)). For native funcol I ran into the same issues but I'd rather just fix the coverage. I think it's reasonable to think of this as a fix rather than adding new features. This is orthogonal to the potential reduction of gloo usage. ### This PR This PR adds `ProcessGroupGloo::allgather_into_tensor_coalesced`. This is very straightforward - `ProcessGroupGloo` already supports `allgather_coalesced`, to which we can funnel `allgather_into_tensor_coalesced`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118910 Approved by: https://github.com/shuqiangzhang	2024-02-02 17:53:28 +00:00
andrewor14	70605d150b	[quant][pt2] Add `move_exported_model_to_train` (#113492 ) Summary: This is the equivalent API to `model.train()` for exported models, analogous to `move_exported_model_to_eval`. Test Plan: python test/test_quantization.py TestQuantizePT2E.test_move_exported_model_dropout python test/test_quantization.py TestQuantizePT2E.test_move_exported_model_dropout_inplace python test/test_quantization.py TestQuantizePT2E.test_move_exported_model_dropout_bn Reviewers: jerryzh168, kimishpatel Subscribers: jerryzh168, kimishpatel, supriyar Pull Request resolved: https://github.com/pytorch/pytorch/pull/113492 Approved by: https://github.com/jerryzh168, https://github.com/tugsbayasgalan	2024-02-02 17:39:47 +00:00
Nikita Shulga	52b679d415	[BE] Cleanup CircleCI README (#118927 ) All of the information there is out-of-date as CI/CD has long migrated to the GitHub Actions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118927 Approved by: https://github.com/kit1980	2024-02-02 17:08:20 +00:00
Bin Bao	0e5fe4b3ae	[AOTI] Fix a RAIIAtenTensorHandle premature deallocation bug (#118963 ) Summary: generate_index_put_fallback currently generates something like the following, ``` AtenTensorHandle tensor_handle_array_1[] = {nullptr, nullptr, arg1_1, wrap_with_raii_handle_if_needed(tmp_tensor_handle_0)}; ``` The problem is wrap_with_raii_handle_if_needed creates a RAIIAtenTensorHandle which only lives during this tmp array initialization. After the initialization is done, RAIIAtenTensorHandle dies and releases the underlying Tensor, and when later tensor_handle_array_1 is passed to aoti_torch_index_put_out, some of its element AtenTensorHandle becomes invalid, cauing segfault. Differential Revision: [D53339348](https://our.internmc.facebook.com/intern/diff/D53339348) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118963 Approved by: https://github.com/aakhundov	2024-02-02 16:49:45 +00:00
Angela Yi	53da422582	[export] Move _create_graph_module_for_export to torch/export (#118893 ) Summary: I have to keep the torch/_export one to not break executorch... Test Plan: CI Reviewed By: avikchaudhuri Differential Revision: D52842750 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118893 Approved by: https://github.com/zhxchen17	2024-02-02 16:40:01 +00:00
hongxyan	b374f8987d	[ROCm] Hipify trie re-engineering and adding unit tests (#118433 ) Fixes #[117504](https://github.com/pytorch/pytorch/issues/117504) Re-engineering Hipify Trie: (1) Re-engineering Trie. (2) More documentation or comments for easier understanding (3) Created a set of unit test (class `TestHipifyTrie`) to test the Trie data structure and APIs. Test: ``` root@xxx:/development/pytorch# pytest test/test_utils.py -k TestHipifyTrie ==================================================================================================== test session starts ==================================================================================================== platform linux -- Python 3.9.18, pytest-7.3.2, pluggy-1.3.0 rootdir: /dockerx/development/pytorch configfile: pytest.ini plugins: flakefinder-1.1.0, rerunfailures-13.0, xdist-3.3.1, xdoctest-1.1.0, cpp-2.3.0, shard-0.1.2, hypothesis-5.35.1 collected 11453 items / 11445 deselected / 8 selected Running 8 items in this shard test/test_utils.py ........ [100%] ============================================================================================ 8 passed, 11445 deselected in 3.84s ============================================================================================ root@xxx:/development/pytorch# ``` Also performed diff on modified and generated contents by this tool with the original code and the new code of the hipify_python.py script. Verified no difference. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118433 Approved by: https://github.com/malfet, https://github.com/jeffdaily	2024-02-02 16:04:59 +00:00
lezcano	65efbf078c	Optimize dict keys guard when all the keys are constant (#118855 ) We also rename ODICT_KEYS and make it use a list rather than a string. Split from https://github.com/pytorch/pytorch/pull/118630. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118855 Approved by: https://github.com/peterbell10 ghstack dependencies: #117982, #118098, #117983, #117625, #118194, #118003, #118208, #118199, #118535	2024-02-02 14:42:56 +00:00
lezcano	cdbc29e91a	[dynamo,optim] Use the actual sources from the parameters when tracing "params" in an optimizer (#118535 ) Fixes the unnecessary guards described at https://github.com/pytorch/pytorch/pull/117983#discussion_r1467622149 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118535 Approved by: https://github.com/mlazos ghstack dependencies: #117982, #118098, #117983, #117625, #118194, #118003, #118208, #118199	2024-02-02 14:42:56 +00:00
lezcano	a3770bcf10	Add functools.partial and UserDefinedFunction to dict keys (#118199 ) This is tested by `fullgraph=True` in the `test_getattr_dict` test. I can write a one-off test for both if that's needed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118199 Approved by: https://github.com/peterbell10, https://github.com/jansel, https://github.com/anijain2305 ghstack dependencies: #117982, #118098, #117983, #117625, #118194, #118003, #118208	2024-02-02 14:42:35 +00:00
lezcano	9d592c14eb	Don't assume all subclasses of BaseUserFunctionVariable have a fn attribute (#118208 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118208 Approved by: https://github.com/anijain2305 ghstack dependencies: #117982, #118098, #117983, #117625, #118194, #118003	2024-02-02 14:42:06 +00:00
lezcano	188628d99e	[dynamo,easy] Add Typing variable to possible dict keys (#118003 ) With this one, the only keys we are not tracing properly in the (non-skipped) test suite are `OutDtypeHigherOrderVariable()`, and a couple `UserDefinedObjectVariables` Pull Request resolved: https://github.com/pytorch/pytorch/pull/118003 Approved by: https://github.com/anijain2305, https://github.com/Skylion007, https://github.com/jansel ghstack dependencies: #117982, #118098, #117983, #117625, #118194	2024-02-02 14:40:21 +00:00
lezcano	ecf7d0e8ac	Make dict guards amenable to the CSE pass (#118194 ) Supersedes https://github.com/pytorch/pytorch/pull/118096 as a much cleaner and simpler solution. It is difficult to write a test for this one without exposing too much of the internals. You can see empirically that it works by running ``` TORCHDYNAMO_PRINT_GUARDS=1 TORCH_LOGS=+guards python test/test_optim.py -k test_can_load_older_state_dict_ASGD_cpu_float32 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/118194 Approved by: https://github.com/jansel, https://github.com/peterbell10 ghstack dependencies: #117982, #118098, #117983, #117625	2024-02-02 14:38:48 +00:00
lezcano	eb2bdfae88	Make variables in dict LazyTrackers (not lazily guarded yet) and avoid using DICT_KEYS guard (#117625 ) Make variables in dict lazy and remove DICT_KEYS guard. We build the keys of a dict depth-first and we rely on the guards of each element in the dict to create the correct guards. This allows us to remove the rather buggy DICT_KEYS guard and make the guard lazy. The guards are not completely lazy yet, as we instantiate them in `_HashableTracker._eq_impl` but it should be possible to make them truly lazy. Also, adding new types to the supported types within keys should be less error prone. This is marginally less efficient when we graph break, but in turn we should graph break much less. It also makes the dicts code easier to maintain (removes `is_hashable_python_var`). Pull Request resolved: https://github.com/pytorch/pytorch/pull/117625 Approved by: https://github.com/jansel, https://github.com/peterbell10, https://github.com/anijain2305 ghstack dependencies: #117982, #118098, #117983	2024-02-02 14:38:08 +00:00
lezcano	75a5c41921	[dynamo,optim] Place guards on the args before assuming they exist (#117983 ) This enables the new way of writing guards for dicts. Before we were doing things like ``` L['self'].param_groups[0][___dict_keys_getitem(L['self'].param_groups[0], 0)][3] is L['self'].param_groups[0]['params'][3] ``` without knowing whether `L['self'].param_groups[0][___dict_keys_getitem(L['self'].param_groups[0], 0)]` was a list. On a different note, I'll probably write a pass to recover the previous way to place guards on dicts via something like `DICT_KEYS` as an optimisation, as it seems relevant for optimisers. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117983 Approved by: https://github.com/mlazos ghstack dependencies: #117982, #118098	2024-02-02 14:37:46 +00:00
lezcano	b1da929df9	Use SourcelesBuilder in BuiltinVariable (#118098 ) This was failing when fetching a dictionary from a module Pull Request resolved: https://github.com/pytorch/pytorch/pull/118098 Approved by: https://github.com/peterbell10, https://github.com/anijain2305 ghstack dependencies: #117982	2024-02-02 14:37:23 +00:00
lezcano	0f3e20a1b6	Print the malformed guard when there's a guard error. (#117982 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/117982 Approved by: https://github.com/jansel, https://github.com/anijain2305	2024-02-02 14:37:05 +00:00
rzou	292243d1aa	Automatically pull test reports from CI (#118882 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118882 Approved by: https://github.com/jamesjwu, https://github.com/yanboliang ghstack dependencies: #118874	2024-02-02 14:18:56 +00:00
rzou	0f7954107a	Add ability to print histogram as a github issue (#118874 ) Adds the ability to print the failures histogram into lines that can be copy-pasted into a github issue. I used this to generate https://github.com/orgs/pytorch/projects/43 Test Plan: - tested locally Pull Request resolved: https://github.com/pytorch/pytorch/pull/118874 Approved by: https://github.com/jamesjwu	2024-02-02 14:18:56 +00:00
Yu, Guangye	520771d7b3	refactor lazy init to device-agnostic (#118846 ) # Motivation This PR intends to extend `cuda_lazy_init` to `device_lazy_init` which is a device-agnostic API that can support any backend. And change `maybe_initialize_cuda` to `maybe_initialize_device` to support lazy initialization for CUDA while maintaining scalability. # Design We maintain a flag for each backend to manage the lazy initialization state separately. # Additional Context No need more UTs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118846 Approved by: https://github.com/malfet	2024-02-02 12:10:39 +00:00
Tobias Ringwald	2de327cedc	Fixed an illegal memory access in cross entropy loss when using an index that is not a valid class (#117561 ) …dex that is not a valid class. Fixes #117532. Co-authored-by: mikaylagawarecki <mikaylagawarecki@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/117561 Approved by: https://github.com/mikaylagawarecki	2024-02-02 11:03:16 +00:00
angelayi	05ac295177	[export] Fix bug with user input mutations (#118942 ) We hit an edge case where the graph exporting contains placeholder nodes whose names conflict with names from aot_export, we don't update the user_inputs_to_mutate in the graph signature correctly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118942 Approved by: https://github.com/tugsbayasgalan, https://github.com/zhxchen17	2024-02-02 09:02:04 +00:00
Kai Londenberg	cc46829f96	[Inductor] GEMM shape padding improvements (#118522 ) Improvements to shape padding logic in torch/_inductor/pad_mm.py These changes could lead up to 14% perf improvement for certain Meta internal models in experiments. Most notably: * 1.) Use aten.const_pad_nd operation to pad Tensors in a single op instead of using multiple steps involving intermediate buffers. This appears to be more performant than the previous logic, confirmed by Profiling & Benchmarking results ( Meta internal ) * 2.) Make many paddings unneccessary using explicitly transposed GEMM when either M or N dimension is properly aligned but the other is not, configurable via config.shape_pad_use_transpose (default: True). * 3.) Enable shape padding for the Inductor CUDA / Cutlass backend for all GEMM ops where Cutlass would be enabled, without benchmarking in that case. * Add config flag to always pad shapes (without benchmarking first), configurable via config.force_shape_pad (default: False ) * Added several new unit tests to ensure tensors are padded such that they meet all alignment requirements after padding. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118522 Approved by: https://github.com/jansel, https://github.com/eellison	2024-02-02 08:50:06 +00:00
cyy	855d5f144e	Relax MKL_INT assumption to int64_t (#118946 ) When I built Pytorch on Windows with lastest MKL, it reported: ``` sources\pytorch\aten\src\ATen/cpu/vml.h(106): error C2338: static_assert failed: 'MKL_INT is assumed to be int32_t' ``` It should be safe to relax the restriction to int64_t. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118946 Approved by: https://github.com/ezyang	2024-02-02 07:11:47 +00:00
PyTorch MergeBot	2964170f3a	Revert "[optim] Rectify capturable testing and fix bugs! (#118326 )" This reverts commit d947b9d50011ebd75db2e90d86644a19c4fe6234. Reverted https://github.com/pytorch/pytorch/pull/118326 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it looks like there are some relevant failures in trunk `d947b9d500`, may be a land race ([comment](https://github.com/pytorch/pytorch/pull/118326#issuecomment-1923125676))	2024-02-02 07:08:14 +00:00
angelayi	4a5a2c6571	Update auto_functionalize schema (#118809 ) - Moved the dictionary arguments to the node's kwargs as dicts are not valid inputs. - Inlined the mutated arguments to the output. Originally, the output of auto_functionalize was the operator output and a list of mutated arguments (ex. [op_out1, op_out2, [mutated_arg1, mutated_arg2]]. However this is not easily exportable. Now, it will just be [op_out1, op_out2, mutated_arg1, mutated_arg2]. Differential Revision: [D53331040](https://our.internmc.facebook.com/intern/diff/D53331040) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118809 Approved by: https://github.com/zou3519	2024-02-02 06:21:43 +00:00
Aaron Orenstein	89b7ab671e	Protect against modules without __file__ (#117445 ) The __file__ special variable is optional so should be treated as such. Fixes #117109 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117445 Approved by: https://github.com/oulgen, https://github.com/yanboliang	2024-02-02 06:06:50 +00:00
Xu Song	3d8c36786b	Add device for distributed examples (#118867 ) ## 🐛 Describe the bug The following example (`all_reduce`) missed `device` allocation `a205e7bf56/torch/distributed/distributed_c10d.py (L2080-L2087)` ## Solution A better example should be like this `a205e7bf56/torch/distributed/distributed_c10d.py (L3212-L3222)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/118867 Approved by: https://github.com/soulitzer	2024-02-02 05:51:59 +00:00
Michael Suo	da5cbb1269	[export] fix for duplicate constant lifting (#118776 ) Summary: Whenever we access a constant, we emit a `get_attr` node for it. The `lift_constants_pass` was lifting every `get_attr` node unconditionally, even if the same target was already lifted. This diff fixes that. I also took the liberty of adding some infra to make it easier to unit test passes. GraphBuilder lets you declaratively construct graphs with the right metadata, it's pretty useful for directly inducing the pattern you want to test against. Test Plan: added unit test Differential Revision: D53278161 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118776 Approved by: https://github.com/angelayi, https://github.com/titaiwangms	2024-02-02 05:51:31 +00:00
Zejun Huang	32f48e917d	[minimizer] Defined traverse (#118889 ) Summary: Add defined traverse mode for minimizer it take user input start_idx and end_idx, form a subgraph, compare result from acclerators vs cpu Differential Revision: D53318292 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118889 Approved by: https://github.com/jfix71	2024-02-02 05:50:17 +00:00
Mihir Patel	3f1f057adf	Remove parent device mesh check (#118620 ) Removes raising error if a device_mesh has a parent. The comment says that HSDP + TP is not supported, but I'm able to do 2D parallelism + HSDP fine. The only issues are: - this check - https://github.com/pytorch/pytorch/pull/118618 - a series of PRs related to checkpointing with 3D meshes that I will open We currently monkeypatch for the above which I am slowly upstreaming. I imagine torch will have a better, native integration eventually, but this check seems too aggressive in the meantime given DTensor now lets users do some things themselves (which is amazing 🎉)! Pull Request resolved: https://github.com/pytorch/pytorch/pull/118620 Approved by: https://github.com/wz337, https://github.com/wanchaol	2024-02-02 05:29:49 +00:00
PyTorch MergeBot	9cc6422ab6	Revert "[executorch hash update] update the pinned executorch hash (#118936 )" This reverts commit 8cc8cf75f31f7e430ab2918db4a2fb9c7b951024. Reverted https://github.com/pytorch/pytorch/pull/118936 on behalf of https://github.com/suo due to conflicts with human change ([comment](https://github.com/pytorch/pytorch/pull/118936#issuecomment-1922824471))	2024-02-02 05:05:44 +00:00
PyTorch UpdateBot	8cc8cf75f3	[executorch hash update] update the pinned executorch hash (#118936 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned executorch hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118936 Approved by: https://github.com/pytorchbot	2024-02-02 04:10:53 +00:00
Elias Ellison	497ea17684	Limit reductions into pointwise cat fusion (#118452 ) @Chillee observed a regression when fusing the following: ``` def f(a, b): return torch.cat([torch.softmax(a, dim=-1), torch.softmax(b, dim=-1)]) ``` This PR limits pointwise concat/masked fusion in this case. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118452 Approved by: https://github.com/jansel	2024-02-02 03:34:50 +00:00
Colin Peppler	babd6c776d	[inductor] skip launching kernels with zero grid in AOTInductor when using backed symints (#118654 ) Like #110312 but we also run this check when backed symints are in the grid (e.g. s1 / 512) ### Why? Let's say we lower a model and generate GPU kernel grid with symbolic shapes, for e.g. `s1 / 512`. If at some point later, we ran the lowered model with inputs s.t. `s1 = 0`, then we'll launch the kernel with a `0` sized grid. This surfaces as `CUDA driver error: invalid argument`. To avoid this, we check for a `0` sized grid whenever there's symbolic shapes which includes backed and unbacked symints. This adds non-zero overhead to the CPU. However, in return, we get better reliability when encountering this scenario. This scenario happened when serving an internal model. ### Test ``` $ python test/inductor/test_aot_inductor.py -k test_zero_grid_with_unbacked_symbols OK (skipped=3) $ python test/inductor/test_aot_inductor.py -k test_zero_grid_with_backed_symbols # Before Error: CUDA driver error: invalid argument FAILED (errors=2, skipped=3) # Now OK (skipped=3) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/118654 Approved by: https://github.com/chenyang78, https://github.com/desertfire	2024-02-02 03:19:52 +00:00
Bin Bao	946ea47a4f	[inductor] Fix an internal test issue (#118903 ) Summary: test_add_complex4 that introduced in https://github.com/pytorch/pytorch/pull/117929 fails internally, because of a cpp compilation issue for cpu. Specify the right device in the test instead. Differential Revision: [D53333919](https://our.internmc.facebook.com/intern/diff/D53333919) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118903 Approved by: https://github.com/clee2000	2024-02-02 03:18:12 +00:00
Catherine Lee	8b729fb826	[ez] Fix CI log file piping error (#118807 ) Fixes https://github.com/pytorch/pytorch/issues/118764 Example log https://github.com/pytorch/pytorch/actions/runs/7737363970/job/21097159160 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118807 Approved by: https://github.com/huydhn, https://github.com/malfet, https://github.com/seemethere	2024-02-02 03:07:56 +00:00
Jane Xu	d947b9d500	[optim] Rectify capturable testing and fix bugs! (#118326 ) This PR fixes several bugs, listed in priority: 1. `load_state_dict` with a nontensor step was incorrect for capturable and fused implementations since we don't create the tensors on the right device in `__setstate__`. This has been fixed. 2. The most recently added capturable implementations forgot the check that all tensors should be on CUDA for eager. We've now added those checks 3. The most recent change in Adamax only adds capturable for foreach but will silently be incorrect for forloop/single-tensor. I've added erroring and modified testing with many many many skips for that. Honestly my preference after this PR has only been further cemented that we should just do the single tensor and multi tensor capturable implementations together in the future. @mlazos 4. The conditional for adding cuda-supported configs for the optimizer infos was incorrect! So we hadn't been testing capturable! This also stands rectified and was the trigger for this PR in the first place. 5. In a similar way, the conditional for `_get_optim_inputs_including_global_cliquey_kwargs` was incorrect sometimes as well. This has also been corrected. The following is not a bug, but is just something to make life simpler by not needing to handle Nones: `optim_input_funcs` must now mandatorily take in a `device`, which could be a string or a torch.device. Details for posterity: 4. Running the test_foreach_matches_forloop test and printing the configs that get printed yields capturable getting included, which is correct. ``` (pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (5d50138f)]$ python test/test_optim.py -k test_foreach_matches_forloop_AdamW_cuda /home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/transformers/utils/generic.py:441: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead. _torch_pytree._register_pytree_node( /home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0 warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}" params=None, kwargs={}, desc=default params=None, kwargs={'lr': 0.01}, desc=non-default lr params=None, kwargs={'weight_decay': 0.1}, desc=nonzero weight_decay params=None, kwargs={'weight_decay': 0.1, 'maximize': True}, desc=maximize params=None, kwargs={'weight_decay': 0.1, 'amsgrad': True}, desc=amsgrad params=None, kwargs={'capturable': True}, desc=capturable params=None, kwargs={'weight_decay': 0.1, 'amsgrad': True, 'capturable': True}, desc=capturable, amsgrad params=None, kwargs={'lr': tensor(0.0010), 'amsgrad': True, 'capturable': True}, desc=Tensor lr with capturable and amsgrad . ---------------------------------------------------------------------- Ran 1 test in 19.229s OK ``` 5. Running the test_optimizer_can_be_printed test (which calls `_get_optim_inputs_including_global_cliquey_kwargs`) and printing what gets run is also now correct. ``` /home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0 warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}" params=None, kwargs={'differentiable': False}, desc=default params=None, kwargs={'differentiable': True}, desc=default & differentiable params=None, kwargs={'lr': 0.01, 'differentiable': False}, desc=non-default lr params=None, kwargs={'lr': 0.01, 'differentiable': True}, desc=non-default lr & differentiable params=None, kwargs={'weight_decay': 0.1, 'differentiable': False}, desc=nonzero weight_decay params=None, kwargs={'weight_decay': 0.1, 'differentiable': True}, desc=nonzero weight_decay & differentiable params=None, kwargs={'weight_decay': 0.1, 'maximize': True, 'differentiable': False}, desc=maximize params=None, kwargs={'weight_decay': 0.1, 'maximize': True, 'differentiable': True}, desc=maximize & differentiable params=None, kwargs={'weight_decay': 0.1, 'amsgrad': True, 'differentiable': False}, desc=amsgrad params=None, kwargs={'weight_decay': 0.1, 'amsgrad': True, 'differentiable': True}, desc=amsgrad & differentiable .params=None, kwargs={'foreach': False, 'differentiable': False, 'fused': False}, desc=default params=None, kwargs={'foreach': True, 'differentiable': False, 'fused': False}, desc=default & foreach params=None, kwargs={'foreach': False, 'differentiable': True, 'fused': False}, desc=default & differentiable params=None, kwargs={'foreach': False, 'differentiable': False, 'fused': True}, desc=default & fused params=None, kwargs={'lr': 0.01, 'foreach': False, 'differentiable': False, 'fused': False}, desc=non-default lr params=None, kwargs={'lr': 0.01, 'foreach': True, 'differentiable': False, 'fused': False}, desc=non-default lr & foreach params=None, kwargs={'lr': 0.01, 'foreach': False, 'differentiable': True, 'fused': False}, desc=non-default lr & differentiable params=None, kwargs={'lr': 0.01, 'foreach': False, 'differentiable': False, 'fused': True}, desc=non-default lr & fused params=None, kwargs={'weight_decay': 0.1, 'foreach': False, 'differentiable': False, 'fused': False}, desc=nonzero weight_decay params=None, kwargs={'weight_decay': 0.1, 'foreach': True, 'differentiable': False, 'fused': False}, desc=nonzero weight_decay & foreach params=None, kwargs={'weight_decay': 0.1, 'foreach': False, 'differentiable': True, 'fused': False}, desc=nonzero weight_decay & differentiable params=None, kwargs={'weight_decay': 0.1, 'foreach': False, 'differentiable': False, 'fused': True}, desc=nonzero weight_decay & fused params=None, kwargs={'weight_decay': 0.1, 'maximize': True, 'foreach': False, 'differentiable': False, 'fused': False}, desc=maximize params=None, kwargs={'weight_decay': 0.1, 'maximize': True, 'foreach': True, 'differentiable': False, 'fused': False}, desc=maximize & foreach params=None, kwargs={'weight_decay': 0.1, 'maximize': True, 'foreach': False, 'differentiable': True, 'fused': False}, desc=maximize & differentiable params=None, kwargs={'weight_decay': 0.1, 'maximize': True, 'foreach': False, 'differentiable': False, 'fused': True}, desc=maximize & fused params=None, kwargs={'weight_decay': 0.1, 'amsgrad': True, 'foreach': False, 'differentiable': False, 'fused': False}, desc=amsgrad params=None, kwargs={'weight_decay': 0.1, 'amsgrad': True, 'foreach': True, 'differentiable': False, 'fused': False}, desc=amsgrad & foreach params=None, kwargs={'weight_decay': 0.1, 'amsgrad': True, 'foreach': False, 'differentiable': True, 'fused': False}, desc=amsgrad & differentiable params=None, kwargs={'weight_decay': 0.1, 'amsgrad': True, 'foreach': False, 'differentiable': False, 'fused': True}, desc=amsgrad & fused params=None, kwargs={'capturable': True, 'foreach': False, 'differentiable': False, 'fused': False}, desc=capturable params=None, kwargs={'capturable': True, 'foreach': True, 'differentiable': False, 'fused': False}, desc=capturable & foreach params=None, kwargs={'capturable': True, 'foreach': False, 'differentiable': True, 'fused': False}, desc=capturable & differentiable params=None, kwargs={'capturable': True, 'foreach': False, 'differentiable': False, 'fused': True}, desc=capturable & fused params=None, kwargs={'weight_decay': 0.1, 'amsgrad': True, 'capturable': True, 'foreach': False, 'differentiable': False, 'fused': False}, desc=capturable, amsgrad params=None, kwargs={'weight_decay': 0.1, 'amsgrad': True, 'capturable': True, 'foreach': True, 'differentiable': False, 'fused': False}, desc=capturable, amsgrad & foreach params=None, kwargs={'weight_decay': 0.1, 'amsgrad': True, 'capturable': True, 'foreach': False, 'differentiable': True, 'fused': False}, desc=capturable, amsgrad & differentiable params=None, kwargs={'weight_decay': 0.1, 'amsgrad': True, 'capturable': True, 'foreach': False, 'differentiable': False, 'fused': True}, desc=capturable, amsgrad & fused params=None, kwargs={'lr': tensor(0.0010), 'amsgrad': True, 'capturable': True, 'foreach': False, 'differentiable': False, 'fused': False}, desc=Tensor lr with capturable and amsgrad params=None, kwargs={'lr': tensor(0.0010), 'amsgrad': True, 'capturable': True, 'foreach': True, 'differentiable': False, 'fused': False}, desc=Tensor lr with capturable and amsgrad & foreach params=None, kwargs={'lr': tensor(0.0010), 'amsgrad': True, 'capturable': True, 'foreach': False, 'differentiable': True, 'fused': False}, desc=Tensor lr with capturable and amsgrad & differentiable params=None, kwargs={'lr': tensor(0.0010), 'amsgrad': True, 'capturable': True, 'foreach': False, 'differentiable': False, 'fused': True}, desc=Tensor lr with capturable and amsgrad & fused . ---------------------------------------------------------------------- Ran 2 tests in 11.112s OK ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/118326 Approved by: https://github.com/mlazos	2024-02-02 02:02:58 +00:00
Tianyu Liu	08472a4fd5	[dtensor] add op support for aten.gather.default (#118513 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118513 Approved by: https://github.com/wanchaol, https://github.com/XilunWu	2024-02-02 01:48:21 +00:00
Jorge Pineda	8ca8729321	[PT-Vulkan][EZ] Adjust string-report width (#118914 ) ## Before: P1148506541 Some of the shader names are now too long. ``` Kernel Name Workgroup Size Duration (ns) =========== ============== =========== vulkan.nchw_to_image {500, 500, 1} 4322188 vulkan.nchw_to_image {500, 500, 1} 4322240 vulkan.convert_channels_to_height_packed{500, 125, 1} 1189240 vulkan.zero {1, 1, 1} 3744 vulkan.convert_channels_to_width_packed{125, 500, 1} 1265680 ``` ## After: P1148506671 Now it's just right; `convert_channels_to_height_packed` is the longest shader name in the codebase. ``` Kernel Name Workgroup Size Duration (ns) =========== ============== =========== vulkan.nchw_to_image {500, 500, 1} 4327232 vulkan.nchw_to_image {500, 500, 1} 4327960 vulkan.convert_channels_to_height_packed{500, 125, 1} 1190540 vulkan.zero {1, 1, 1} 3744 vulkan.convert_channels_to_width_packed {125, 500, 1} 1287468 ``` Differential Revision: [D53293924](https://our.internmc.facebook.com/intern/diff/D53293924/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118914 Approved by: https://github.com/liuk22	2024-02-02 01:43:48 +00:00
Nathanael See	7e1ac59016	[pytorch][vulkan] add 1d tensor support for linear (#118690 ) Summary: Vulkan Linear op doesn't support 1d tensors. We can unsqueeze 1d tensors to 2d to unblock the functionality. Test Plan: `LD_LIBRARY_PATH=third-party/swiftshader/lib/linux-x64/ buck2 run fbcode/mode/dev-nosan //xplat/caffe2:pt_vulkan_api_test_bin -- --gtest_filter="linear_"` ``` Running main() from third-party/googletest/1.14.0/googletest/googletest/src/gtest_main.cc Note: Google Test filter = linear_ [==========] Running 11 tests from 1 test suite. [----------] Global test environment set-up. [----------] 11 tests from VulkanAPITest [ RUN ] VulkanAPITest.linear_1d_small [ OK ] VulkanAPITest.linear_1d_small (319 ms) [ RUN ] VulkanAPITest.linear_1d_large [ OK ] VulkanAPITest.linear_1d_large (64 ms) [ RUN ] VulkanAPITest.linear_2d_flat [ OK ] VulkanAPITest.linear_2d_flat (0 ms) [ RUN ] VulkanAPITest.linear_2d_small [ OK ] VulkanAPITest.linear_2d_small (0 ms) [ RUN ] VulkanAPITest.linear_2d_large [ OK ] VulkanAPITest.linear_2d_large (129 ms) [ RUN ] VulkanAPITest.linear_3d_flat [ OK ] VulkanAPITest.linear_3d_flat (0 ms) [ RUN ] VulkanAPITest.linear_3d_small [ OK ] VulkanAPITest.linear_3d_small (1 ms) [ RUN ] VulkanAPITest.linear_3d_large [ OK ] VulkanAPITest.linear_3d_large (51 ms) [ RUN ] VulkanAPITest.linear_4d_flat [ OK ] VulkanAPITest.linear_4d_flat (0 ms) [ RUN ] VulkanAPITest.linear_4d_small [ OK ] VulkanAPITest.linear_4d_small (1 ms) [ RUN ] VulkanAPITest.linear_4d_large [ OK ] VulkanAPITest.linear_4d_large (6 ms) [----------] 11 tests from VulkanAPITest (578 ms total) [----------] Global test environment tear-down [==========] 11 tests from 1 test suite ran. (578 ms total) [ PASSED ] 11 tests. ``` Differential Revision: D53243201 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118690 Approved by: https://github.com/jorgep31415, https://github.com/liuk22	2024-02-02 01:35:45 +00:00
PyTorch MergeBot	796278b57e	Revert "[inductor] make multi-kernel work with cpp-wrapper (#117813 )" This reverts commit 20484a193626ef72e0b3f35914f17deb2a89b8fc. Reverted https://github.com/pytorch/pytorch/pull/117813 on behalf of https://github.com/atalman due to broke linux-focal-rocm5.7-py3.8 tests ([comment](https://github.com/pytorch/pytorch/pull/117813#issuecomment-1922613135))	2024-02-02 01:19:19 +00:00
Stephen Jia	9153174cd1	[pt-vulkan] Introduce `SharedObject` class to `ComputeGraph` (#118756 ) ## Context This changeset is part of a stack that enables memory planning (i.e. sharing memory between intermediate tensors) in the PyTorch Vulkan Compute API. Note that Memory Planning can only be used via the ExecuTorch delegate (currently a WIP) and not Lite Interpreter (which does not collect metadata regarding tensor lifetimes). This changeset builds upon the [previous PR enabling resource aliasing](https://github.com/pytorch/pytorch/pull/118436) and introduces the `SharedObject` class to `ComputeGraph`, which manages resource aliasing in graph execution mode. `SharedObject` tracks which `vTensor` values in a `ComputeGraph` share the same backing memory, and provides functionality to aggregate memory requirements and bind users to same memory allocation. ## Notes for Reviewers The `SharedObject` class is introduced in `Graph.h`. It's fairly simple and provides three functions: * `add_user()` which adds a `ValueRef` to the list of users of the `SharedObject`, and updates the aggregate memory requirements with the memory requirements of the new user * `allocate_memory()` creates a `VmaAllocation` with the aggregated memory requirements * `bind_users()` iterates over the `users` of the `SharedObject` and binds each `vTensor`'s underlying resource to the memory associated with the `SharedObject`. As for how `SharedObject` is used in `ComputeGraph`: * `add_tensor()` now has an additional argument `shared_object_idx` which, if `>0`, will construct a `vTensor` without any backing memory and add the new `vTensor` to the `SharedObject` at `shared_object_idx` * `encode_execute()` will first iterate through the `SharedObject`s of the graph and allocate + bind users before recording the command buffer. Differential Revision: [D53271486](https://our.internmc.facebook.com/intern/diff/D53271486/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118756 Approved by: https://github.com/jorgep31415, https://github.com/yipjustin	2024-02-02 01:19:00 +00:00
CaoE	a5a63db3bf	add Half support for flash attention on CPU (#118368 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118368 Approved by: https://github.com/jgong5, https://github.com/Valentine233, https://github.com/drisspg ghstack dependencies: #118367	2024-02-02 01:08:39 +00:00
Michael Lazos	838c1c553e	Add back recompile test (#118905 ) Adds back a test that was skipped Pull Request resolved: https://github.com/pytorch/pytorch/pull/118905 Approved by: https://github.com/janeyx99	2024-02-02 00:51:01 +00:00
Nikita Shulga	4b59bfe8e5	[CI] Filter should not fail if pr_body is empty (#118934 ) Otherwise it will fail with `TypeError: argument of type 'NoneType' is not iterable` (see https://github.com/pytorch/pytorch/actions/runs/7748725174/job/21131915226 for example) ``` % gh api /repos/pytorch/pytorch/issues/118927\| { "url": "https://api.github.com/repos/pytorch/pytorch/issues/118927", ... "body": null, ... "state_reason": null } ``` TODO: Can we add a test for it? Pull Request resolved: https://github.com/pytorch/pytorch/pull/118934 Approved by: https://github.com/clee2000, https://github.com/seemethere, https://github.com/huydhn	2024-02-02 00:49:20 +00:00
Catherine Lee	08d90a1ea9	Workaround for super() calls in test_torchinductor_dynamic_shapes (#118586 ) Info about super in dynamic classes: https://stackoverflow.com/questions/71879642/how-to-pass-function-with-super-when-creating-class-dynamically https://stackoverflow.com/questions/43782944/super-does-not-work-together-with-type-supertype-obj-obj-must-be-an-i Calling super(TestCase) actually calls TestCase's parent's functions, bypassing TestCase itself's functions Mainly doing this because it's making disable bot spam Test: checked locally and check that https://github.com/pytorch/pytorch/issues/117954 actually got skipped Logs for `inductor/test_torchinductor_dynamic_shapes.py::TestInductorDynamicCUDA::test_unbacked_index_select_cuda` https://ossci-raw-job-status.s3.amazonaws.com/log/21083466405 Afaik this PR doesn't actually cause the test to fail, it just surfaces the error since the mem leak check wasn't running previously Pull Request resolved: https://github.com/pytorch/pytorch/pull/118586 Approved by: https://github.com/huydhn	2024-02-02 00:40:37 +00:00
Jorge Pineda	7c609f01ff	[PT-Vulkan] aten::conv1d - support any batch size (#118834 ) Completes `aten::conv1d` implementation. See D53204673 for full context. Differential Revision: [D53253625](https://our.internmc.facebook.com/intern/diff/D53253625/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118834 Approved by: https://github.com/yipjustin ghstack dependencies: #118833	2024-02-01 23:53:00 +00:00
Edward Z. Yang	dc4779b010	Split out fake_impls from fake_tensor (#118878 ) The motivation is fake_tensor is marked as an uninteresting file for the purposes of backtraces, but operator implementations in fake tensor are interesting and I do want them reported. How did I decide whether or not to move helper functions or not? It was kind of random, but if they weren't used in fake tensor generally I moved them over. There are no functional code changes, so you only need to review the import changes. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/118878 Approved by: https://github.com/eellison	2024-02-01 23:50:56 +00:00
Nikita Shulga	844a76ebe8	[MPS][BE] Remove stale TODO (#118902 ) And use convenient methods TODO was added by an accidental copy-n-paste of code from https://github.com/pytorch/pytorch/pull/82315 into https://github.com/pytorch/pytorch/pull/88532 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118902 Approved by: https://github.com/kit1980	2024-02-01 23:43:23 +00:00
rzou	a16df1d85f	[Dynamo] graph break on isinstance calls if we don't know the type (#118778 ) If we can't figure out the python type of a VariableTracker, then the isinstance call should graph break (instead of raising an error). Pull Request resolved: https://github.com/pytorch/pytorch/pull/118778 Approved by: https://github.com/ydwu4 ghstack dependencies: #118768	2024-02-01 23:18:10 +00:00
Mikayla Gawarecki	39aab55c1c	Add myself to CODEOWNERS for serialization-related files (#118892 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118892 Approved by: https://github.com/albanD	2024-02-01 23:14:04 +00:00
Nikolay Bogoychev	46ef73505d	Clarify how to get extra link flags when building CUDA/C++ extension (#118743 ) Make it a bit more explicit how one parse linker arguments to the build and point to the superclass documentation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118743 Approved by: https://github.com/ezyang	2024-02-01 22:35:25 +00:00
PyTorch MergeBot	dbba1d4bf5	Revert "Some minor type stub improvements (#118529 )" This reverts commit c978f38bd4aedeff4ee9ae693349217daea01412. Reverted https://github.com/pytorch/pytorch/pull/118529 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/118529#issuecomment-1922362331))	2024-02-01 22:18:36 +00:00
BowenBao	d4a94ad041	[ONNX] Fix upsample_bilinear2d decomp skip with output shape (#118823 ) The previous output size missed the first two dimensions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118823 Approved by: https://github.com/titaiwangms	2024-02-01 22:04:35 +00:00
Nikita Shulga	6692f2c91e	[no ci] Add myself to MPS codeowners (#118904 ) I got pinged on every other PR anyway, so just a means to automate the process Pull Request resolved: https://github.com/pytorch/pytorch/pull/118904 Approved by: https://github.com/albanD	2024-02-01 21:52:15 +00:00
Jorge Pineda	6929322a28	[PT-Vulkan] aten::conv1d - support any channel-group combo (#118833 ) ## Main Part of completing `aten::conv1d`'s implementation. See D53204673 for full context. This diff relaxes the constraint ``` c_in = c_out = groups ``` to support any legal combination of c_in, c_out, groups. From the [PyTorch docs](https://pytorch.org/docs/stable/generated/torch.nn.Conv1d.html), both c_in and c_out must be divisible by groups. Apart from that, any combo is now fair game. ## Additional Improved GLSL comments and variable names, since more indices yield more headaches. Differential Revision: [D53248767](https://our.internmc.facebook.com/intern/diff/D53248767/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118833 Approved by: https://github.com/yipjustin	2024-02-01 21:46:01 +00:00
Yang Chen	61b572ed56	[inductor] more accurate throughput calculations for kernel benchmarks (#118858 ) Our current throughput calculations for kernel benchmarks have some issues, particularly when we slice inputs in the kernel. In such cases, we count the original inputs as part of the memory traffic passed across the kernel. This is incorrect because it may result in a much larger throughput calculation, which can even exceed the theoretical bandwidth. Instead, we should only count the size of the "slices" that contribute to the actual memory traffic. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118858 Approved by: https://github.com/jansel	2024-02-01 21:42:14 +00:00
Shunting Zhang	20484a1936	[inductor] make multi-kernel work with cpp-wrapper (#117813 ) Make multi-kernel work with cpp-wrapper. multi-kernel generates two equivalent variants for a reduction. At runtime the faster one is picked. But cpp-wrapper need save cubin file during codegen. They don't work with each other at the beginning. Thanks Jason for suggesting a neat way to integrate these two. cpp-wrapper does 2 passes codegen right now. For the first pass, we still generate multi-kernel code and run it; for the second pass, we load the cubin file for the faster kernel directly. And multi-kernel python code is not generated for the second pass since they should not be needed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117813 Approved by: https://github.com/jansel	2024-02-01 21:29:02 +00:00
albanD	54668ad6dc	Cleanup max cuda device (#118779 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118779 Approved by: https://github.com/ezyang	2024-02-01 21:11:28 +00:00
Edward Z. Yang	f63dc9a21d	s/DIRECLTY/DIRECTLY/ (#118877 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/118877 Approved by: https://github.com/albanD	2024-02-01 20:25:58 +00:00
laith sakka	923a7c7572	add test elipsis to dynamo test functions (#118754 ) add tests to ensure the reported bug in #117563 is not failing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118754 Approved by: https://github.com/anijain2305	2024-02-01 19:05:01 +00:00
rzou	318e6ff40e	Fix `__name__` on a reconstructed NestedUserFunctionVariable (#118768 ) ``` def f(): def g(): return () print(g.__name__) f() ``` The following script should print `g` (with or without torch.compile), but prints `f.<locals>.g` with torch.compile. The problem looks like we use the co_qualname when reconstructing the NestedUserFunctionVariable. I switched this over to use the co_name. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118768 Approved by: https://github.com/yanboliang, https://github.com/jansel	2024-02-01 18:59:01 +00:00
mantaionut	b0e65dd1b4	Fix TCP Store Windows (#118860 ) In https://github.com/pytorch/pytorch/pull/107607 there was added a new Validate flow, however on Windows it was not calling addMiscellaneousSocket. Added missing call to addMiscellaneousSocket on Windows. Fixes #118737 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118860 Approved by: https://github.com/awgu, https://github.com/malfet	2024-02-01 18:46:18 +00:00
PyTorch MergeBot	df048f4da4	Revert "[RELAND] Remove deprecated fbgemm operators (#112153 )" This reverts commit 19e8ba95e535cd73d3eb37849f383ca8bab58603. Reverted https://github.com/pytorch/pytorch/pull/112153 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/112153#issuecomment-1921965780))	2024-02-01 18:35:19 +00:00
Yifu Wang	0f7e63620f	CUDA fast path for split_with_sizes_copy.out (#117203 ) ### Motivation In per-parameter sharding FSDP, each rank holds one shard of every parameter. Before a bucket of parameters is used, FSDP performs all-gather to reconstruct the full parameters. The following example demonstrates the process for `world_size=2`, `num_params=3` (`A`, `B`, `C` standands for values in param `A`, `B`, `C`): All-gather output: ``` AAAABBBCCAAAABBBCC ``` After all-gather-copy-out: ``` AAAAAAAA BBBBBB CCCC ``` The performance of all-gather-copy-out is crucial for the viability of per-parameter sharding FSDP. After thorough experiments, we believe that acceptable performance for this op is not achievable via composing existing ATen ops today. We have proven that ideal performance is achievable with a [custom kernel](https://github.com/pytorch/pytorch/pull/115515). This PR aims to incorporate the optimizations to appropriate ATen ops (as suggested by @albanD). ### all-gather-copy-out via Composing ATen Ops Carrying out the op out via composing ATen ops involves a combination of view ops and copy ops. After thorough experiments, we found that the most natural/performant way to express the op is via `split_with_sizes` + `_foreach_copy_`, which works as follows: Reshape all-gather output as (world_size, -1): ``` AAAABBBCC AAAABBBCC ``` `split_with_sizes` + `_foreach_copy_`: ``` AAAA BBB CC AAAA BBB CC ``` However, the performance of this approach is still far below that of the custom kernel. We've identified the following reasons: - The approach requires materializing `O(num_params)` intermediate views, which induces large amount of CPU overhead when `num_params` is high. - `_foreach_copy_` uses the same block size all tensors, leading to waste for small tensors and insufficient thread count for large tensors. This means low effective occupancy. - `_foreach_copy_` dispatches multiple kernels for typical problem sizes for all-gather-copy-out. This further lowers the effective occupancy. - Due to the nature of the workload, the underlying copies are unaligned. `_foreach_copy_` isn't aggressive enough in exploiting vectorization oppurtunities in such workloads. ### PR Introduces a CUDA backend for `split_with_sizes_copy.out` that addresses the above inefficiencies. See code for details. ### Benchmarks The benchmarks are conducted on a set of representative problems sizes on an A100. CPU overhead and GPU execution time is measured separately, as reasonable CPU overhead doesn't directly affect e2e throughput. The reported copy bandwidth is calculated with GPU execution time. Compared to the baseline, we observe 3x-10x higher throughput compared to the baseline depending on the problem size. We also observe lower CPU overhead across the board compared to the baseline. Baseline: ``` num_params=150 world_size=8 mixed=True Param size: 0.059 GB Copy bandwidth: 67.564 GB/s (gpu ms/iter: 0.869, cpu ms/iter 10.460) num_params=54 world_size=8 mixed=True Param size: 1.453 GB Copy bandwidth: 260.373 GB/s (gpu ms/iter: 5.582, cpu ms/iter 0.572) num_params=54 world_size=8 mixed=True Param size: 0.512 GB Copy bandwidth: 239.585 GB/s (gpu ms/iter: 2.135, cpu ms/iter 0.587) num_params=50 world_size=8 mixed=True Param size: 0.200 GB Copy bandwidth: 205.361 GB/s (gpu ms/iter: 0.976, cpu ms/iter 0.534) num_params=3 world_size=8 mixed=True Param size: 0.983 GB Copy bandwidth: 268.397 GB/s (gpu ms/iter: 3.663, cpu ms/iter 0.084) num_params=9 world_size=8 mixed=True Param size: 0.802 GB Copy bandwidth: 265.240 GB/s (gpu ms/iter: 3.024, cpu ms/iter 0.154) num_params=3 world_size=8 mixed=True Param size: 1.573 GB Copy bandwidth: 268.918 GB/s (gpu ms/iter: 5.849, cpu ms/iter 0.087) num_params=9 world_size=8 mixed=True Param size: 2.248 GB Copy bandwidth: 268.141 GB/s (gpu ms/iter: 8.384, cpu ms/iter 0.151) num_params=150 world_size=128 mixed=True Param size: 0.064 GB Copy bandwidth: 73.237 GB/s (gpu ms/iter: 0.874, cpu ms/iter 10.664) num_params=54 world_size=128 mixed=True Param size: 1.458 GB Copy bandwidth: 259.902 GB/s (gpu ms/iter: 5.609, cpu ms/iter 0.584) num_params=54 world_size=128 mixed=True Param size: 0.515 GB Copy bandwidth: 238.703 GB/s (gpu ms/iter: 2.158, cpu ms/iter 0.612) num_params=50 world_size=128 mixed=True Param size: 0.203 GB Copy bandwidth: 205.144 GB/s (gpu ms/iter: 0.987, cpu ms/iter 0.559) num_params=3 world_size=128 mixed=True Param size: 0.983 GB Copy bandwidth: 270.467 GB/s (gpu ms/iter: 3.635, cpu ms/iter 0.073) num_params=9 world_size=128 mixed=True Param size: 0.802 GB Copy bandwidth: 267.700 GB/s (gpu ms/iter: 2.997, cpu ms/iter 0.133) num_params=3 world_size=128 mixed=True Param size: 1.573 GB Copy bandwidth: 268.913 GB/s (gpu ms/iter: 5.849, cpu ms/iter 0.093) num_params=9 world_size=128 mixed=True Param size: 2.248 GB Copy bandwidth: 266.589 GB/s (gpu ms/iter: 8.433, cpu ms/iter 0.207) num_params=150 world_size=1024 mixed=True Param size: 0.202 GB Copy bandwidth: 135.107 GB/s (gpu ms/iter: 1.495, cpu ms/iter 10.904) num_params=54 world_size=1024 mixed=True Param size: 1.524 GB Copy bandwidth: 258.675 GB/s (gpu ms/iter: 5.890, cpu ms/iter 0.996) num_params=54 world_size=1024 mixed=True Param size: 0.575 GB Copy bandwidth: 238.919 GB/s (gpu ms/iter: 2.408, cpu ms/iter 0.765) num_params=50 world_size=1024 mixed=True Param size: 0.246 GB Copy bandwidth: 209.836 GB/s (gpu ms/iter: 1.172, cpu ms/iter 0.611) num_params=3 world_size=1024 mixed=True Param size: 1.007 GB Copy bandwidth: 270.607 GB/s (gpu ms/iter: 3.720, cpu ms/iter 0.100) num_params=9 world_size=1024 mixed=True Param size: 0.818 GB Copy bandwidth: 266.375 GB/s (gpu ms/iter: 3.071, cpu ms/iter 0.176) num_params=3 world_size=1024 mixed=True Param size: 1.611 GB Copy bandwidth: 270.601 GB/s (gpu ms/iter: 5.952, cpu ms/iter 0.099) num_params=9 world_size=1024 mixed=True Param size: 2.248 GB Copy bandwidth: 268.558 GB/s (gpu ms/iter: 8.371, cpu ms/iter 0.207) num_params=150 world_size=8 mixed=False Param size: 0.035 GB Copy bandwidth: 43.749 GB/s (gpu ms/iter: 0.797, cpu ms/iter 10.531) num_params=54 world_size=8 mixed=False Param size: 0.961 GB Copy bandwidth: 254.084 GB/s (gpu ms/iter: 3.781, cpu ms/iter 0.752) num_params=54 world_size=8 mixed=False Param size: 0.282 GB Copy bandwidth: 216.792 GB/s (gpu ms/iter: 1.299, cpu ms/iter 0.717) num_params=50 world_size=8 mixed=False Param size: 0.149 GB Copy bandwidth: 188.025 GB/s (gpu ms/iter: 0.793, cpu ms/iter 0.633) num_params=3 world_size=8 mixed=False Param size: 0.655 GB Copy bandwidth: 267.793 GB/s (gpu ms/iter: 2.447, cpu ms/iter 0.107) num_params=9 world_size=8 mixed=False Param size: 0.634 GB Copy bandwidth: 264.232 GB/s (gpu ms/iter: 2.401, cpu ms/iter 0.182) num_params=3 world_size=8 mixed=False Param size: 1.049 GB Copy bandwidth: 268.455 GB/s (gpu ms/iter: 3.906, cpu ms/iter 0.089) num_params=9 world_size=8 mixed=False Param size: 1.711 GB Copy bandwidth: 267.633 GB/s (gpu ms/iter: 6.394, cpu ms/iter 0.177) num_params=150 world_size=128 mixed=False Param size: 0.038 GB Copy bandwidth: 46.698 GB/s (gpu ms/iter: 0.807, cpu ms/iter 10.488) num_params=54 world_size=128 mixed=False Param size: 0.963 GB Copy bandwidth: 253.450 GB/s (gpu ms/iter: 3.799, cpu ms/iter 0.655) num_params=54 world_size=128 mixed=False Param size: 0.283 GB Copy bandwidth: 216.857 GB/s (gpu ms/iter: 1.307, cpu ms/iter 0.671) num_params=50 world_size=128 mixed=False Param size: 0.151 GB Copy bandwidth: 189.059 GB/s (gpu ms/iter: 0.799, cpu ms/iter 0.572) num_params=3 world_size=128 mixed=False Param size: 0.655 GB Copy bandwidth: 269.849 GB/s (gpu ms/iter: 2.429, cpu ms/iter 0.078) num_params=9 world_size=128 mixed=False Param size: 0.634 GB Copy bandwidth: 264.501 GB/s (gpu ms/iter: 2.399, cpu ms/iter 0.149) num_params=3 world_size=128 mixed=False Param size: 1.049 GB Copy bandwidth: 268.426 GB/s (gpu ms/iter: 3.906, cpu ms/iter 0.086) num_params=9 world_size=128 mixed=False Param size: 1.711 GB Copy bandwidth: 267.495 GB/s (gpu ms/iter: 6.398, cpu ms/iter 0.170) num_params=150 world_size=1024 mixed=False Param size: 0.122 GB Copy bandwidth: 101.151 GB/s (gpu ms/iter: 1.211, cpu ms/iter 10.476) num_params=54 world_size=1024 mixed=False Param size: 1.000 GB Copy bandwidth: 252.323 GB/s (gpu ms/iter: 3.963, cpu ms/iter 0.633) num_params=54 world_size=1024 mixed=False Param size: 0.318 GB Copy bandwidth: 218.322 GB/s (gpu ms/iter: 1.455, cpu ms/iter 0.622) num_params=50 world_size=1024 mixed=False Param size: 0.185 GB Copy bandwidth: 196.369 GB/s (gpu ms/iter: 0.944, cpu ms/iter 0.576) num_params=3 world_size=1024 mixed=False Param size: 0.671 GB Copy bandwidth: 269.369 GB/s (gpu ms/iter: 2.491, cpu ms/iter 0.076) num_params=9 world_size=1024 mixed=False Param size: 0.645 GB Copy bandwidth: 264.441 GB/s (gpu ms/iter: 2.439, cpu ms/iter 0.140) num_params=3 world_size=1024 mixed=False Param size: 1.074 GB Copy bandwidth: 269.955 GB/s (gpu ms/iter: 3.978, cpu ms/iter 0.073) num_params=9 world_size=1024 mixed=False Param size: 1.711 GB Copy bandwidth: 267.168 GB/s (gpu ms/iter: 6.405, cpu ms/iter 0.147) ``` New kernel: ``` num_params=150 world_size=8 mixed=True Param size: 0.059 GB Copy bandwidth: 560.946 GB/s (gpu ms/iter: 0.105, cpu ms/iter 1.066) num_params=54 world_size=8 mixed=True Param size: 1.453 GB Copy bandwidth: 732.657 GB/s (gpu ms/iter: 1.984, cpu ms/iter 0.417) num_params=54 world_size=8 mixed=True Param size: 0.512 GB Copy bandwidth: 753.514 GB/s (gpu ms/iter: 0.679, cpu ms/iter 0.419) num_params=50 world_size=8 mixed=True Param size: 0.200 GB Copy bandwidth: 719.400 GB/s (gpu ms/iter: 0.279, cpu ms/iter 0.410) num_params=3 world_size=8 mixed=True Param size: 0.983 GB Copy bandwidth: 782.121 GB/s (gpu ms/iter: 1.257, cpu ms/iter 0.098) num_params=9 world_size=8 mixed=True Param size: 0.802 GB Copy bandwidth: 766.458 GB/s (gpu ms/iter: 1.047, cpu ms/iter 0.134) num_params=3 world_size=8 mixed=True Param size: 1.573 GB Copy bandwidth: 790.611 GB/s (gpu ms/iter: 1.989, cpu ms/iter 0.099) num_params=9 world_size=8 mixed=True Param size: 2.248 GB Copy bandwidth: 789.754 GB/s (gpu ms/iter: 2.847, cpu ms/iter 0.138) num_params=150 world_size=128 mixed=True Param size: 0.064 GB Copy bandwidth: 565.667 GB/s (gpu ms/iter: 0.113, cpu ms/iter 0.996) num_params=54 world_size=128 mixed=True Param size: 1.458 GB Copy bandwidth: 670.681 GB/s (gpu ms/iter: 2.174, cpu ms/iter 0.289) num_params=54 world_size=128 mixed=True Param size: 0.515 GB Copy bandwidth: 676.135 GB/s (gpu ms/iter: 0.762, cpu ms/iter 0.264) num_params=50 world_size=128 mixed=True Param size: 0.203 GB Copy bandwidth: 662.603 GB/s (gpu ms/iter: 0.306, cpu ms/iter 0.249) num_params=3 world_size=128 mixed=True Param size: 0.983 GB Copy bandwidth: 769.283 GB/s (gpu ms/iter: 1.278, cpu ms/iter 0.078) num_params=9 world_size=128 mixed=True Param size: 0.802 GB Copy bandwidth: 761.057 GB/s (gpu ms/iter: 1.054, cpu ms/iter 0.104) num_params=3 world_size=128 mixed=True Param size: 1.573 GB Copy bandwidth: 774.325 GB/s (gpu ms/iter: 2.031, cpu ms/iter 0.075) num_params=9 world_size=128 mixed=True Param size: 2.248 GB Copy bandwidth: 773.048 GB/s (gpu ms/iter: 2.908, cpu ms/iter 0.099) num_params=150 world_size=1024 mixed=True Param size: 0.202 GB Copy bandwidth: 641.405 GB/s (gpu ms/iter: 0.315, cpu ms/iter 0.616) num_params=54 world_size=1024 mixed=True Param size: 1.524 GB Copy bandwidth: 646.772 GB/s (gpu ms/iter: 2.356, cpu ms/iter 0.276) num_params=54 world_size=1024 mixed=True Param size: 0.575 GB Copy bandwidth: 658.157 GB/s (gpu ms/iter: 0.874, cpu ms/iter 0.278) num_params=50 world_size=1024 mixed=True Param size: 0.246 GB Copy bandwidth: 642.032 GB/s (gpu ms/iter: 0.383, cpu ms/iter 0.245) num_params=3 world_size=1024 mixed=True Param size: 1.007 GB Copy bandwidth: 728.990 GB/s (gpu ms/iter: 1.381, cpu ms/iter 0.080) num_params=9 world_size=1024 mixed=True Param size: 0.818 GB Copy bandwidth: 689.763 GB/s (gpu ms/iter: 1.186, cpu ms/iter 0.102) num_params=3 world_size=1024 mixed=True Param size: 1.611 GB Copy bandwidth: 765.507 GB/s (gpu ms/iter: 2.104, cpu ms/iter 0.078) num_params=9 world_size=1024 mixed=True Param size: 2.248 GB Copy bandwidth: 757.626 GB/s (gpu ms/iter: 2.967, cpu ms/iter 0.106) num_params=150 world_size=8 mixed=False Param size: 0.035 GB Copy bandwidth: 584.272 GB/s (gpu ms/iter: 0.060, cpu ms/iter 0.656) num_params=54 world_size=8 mixed=False Param size: 0.961 GB Copy bandwidth: 728.234 GB/s (gpu ms/iter: 1.319, cpu ms/iter 0.264) num_params=54 world_size=8 mixed=False Param size: 0.282 GB Copy bandwidth: 730.059 GB/s (gpu ms/iter: 0.386, cpu ms/iter 0.279) num_params=50 world_size=8 mixed=False Param size: 0.149 GB Copy bandwidth: 670.899 GB/s (gpu ms/iter: 0.222, cpu ms/iter 0.274) num_params=3 world_size=8 mixed=False Param size: 0.655 GB Copy bandwidth: 775.699 GB/s (gpu ms/iter: 0.845, cpu ms/iter 0.077) num_params=9 world_size=8 mixed=False Param size: 0.634 GB Copy bandwidth: 773.612 GB/s (gpu ms/iter: 0.820, cpu ms/iter 0.112) num_params=3 world_size=8 mixed=False Param size: 1.049 GB Copy bandwidth: 781.395 GB/s (gpu ms/iter: 1.342, cpu ms/iter 0.081) num_params=9 world_size=8 mixed=False Param size: 1.711 GB Copy bandwidth: 789.156 GB/s (gpu ms/iter: 2.169, cpu ms/iter 0.116) num_params=150 world_size=128 mixed=False Param size: 0.038 GB Copy bandwidth: 517.056 GB/s (gpu ms/iter: 0.073, cpu ms/iter 0.632) num_params=54 world_size=128 mixed=False Param size: 0.963 GB Copy bandwidth: 684.246 GB/s (gpu ms/iter: 1.407, cpu ms/iter 0.294) num_params=54 world_size=128 mixed=False Param size: 0.283 GB Copy bandwidth: 680.593 GB/s (gpu ms/iter: 0.416, cpu ms/iter 0.286) num_params=50 world_size=128 mixed=False Param size: 0.151 GB Copy bandwidth: 682.197 GB/s (gpu ms/iter: 0.221, cpu ms/iter 0.255) num_params=3 world_size=128 mixed=False Param size: 0.655 GB Copy bandwidth: 759.470 GB/s (gpu ms/iter: 0.863, cpu ms/iter 0.074) num_params=9 world_size=128 mixed=False Param size: 0.634 GB Copy bandwidth: 765.694 GB/s (gpu ms/iter: 0.829, cpu ms/iter 0.094) num_params=3 world_size=128 mixed=False Param size: 1.049 GB Copy bandwidth: 766.535 GB/s (gpu ms/iter: 1.368, cpu ms/iter 0.075) num_params=9 world_size=128 mixed=False Param size: 1.711 GB Copy bandwidth: 787.608 GB/s (gpu ms/iter: 2.173, cpu ms/iter 0.105) num_params=150 world_size=1024 mixed=False Param size: 0.122 GB Copy bandwidth: 640.203 GB/s (gpu ms/iter: 0.191, cpu ms/iter 0.668) num_params=54 world_size=1024 mixed=False Param size: 1.000 GB Copy bandwidth: 713.947 GB/s (gpu ms/iter: 1.401, cpu ms/iter 0.274) num_params=54 world_size=1024 mixed=False Param size: 0.318 GB Copy bandwidth: 642.855 GB/s (gpu ms/iter: 0.494, cpu ms/iter 0.276) num_params=50 world_size=1024 mixed=False Param size: 0.185 GB Copy bandwidth: 643.297 GB/s (gpu ms/iter: 0.288, cpu ms/iter 0.262) num_params=3 world_size=1024 mixed=False Param size: 0.671 GB Copy bandwidth: 690.626 GB/s (gpu ms/iter: 0.972, cpu ms/iter 0.078) num_params=9 world_size=1024 mixed=False Param size: 0.645 GB Copy bandwidth: 754.431 GB/s (gpu ms/iter: 0.855, cpu ms/iter 0.109) num_params=3 world_size=1024 mixed=False Param size: 1.074 GB Copy bandwidth: 769.985 GB/s (gpu ms/iter: 1.395, cpu ms/iter 0.080) num_params=9 world_size=1024 mixed=False Param size: 1.711 GB Copy bandwidth: 766.337 GB/s (gpu ms/iter: 2.233, cpu ms/iter 0.103) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/117203 Approved by: https://github.com/albanD, https://github.com/awgu ghstack dependencies: #118512	2024-02-01 18:23:01 +00:00
Edward Z. Yang	68f9c28e00	Don't make default arguments dynamic (#118772 ) Noticed this while working on https://github.com/pytorch/pytorch/issues/114590 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/118772 Approved by: https://github.com/anijain2305	2024-02-01 18:11:57 +00:00
Nikita Shulga	24dd9f42ce	[MPS] Fix `use_metal_mm` condition (#118830 ) One should not only look at stride size, but on dimensions as well, as strides of `torch.rand(65536, 1)` are `(1, 1)` Extend test to account for this situation Pull Request resolved: https://github.com/pytorch/pytorch/pull/118830 Approved by: https://github.com/huydhn	2024-02-01 17:53:42 +00:00
Isuru Fernando	3e79ef6db8	Complete decomposition for aten.round (#118635 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118635 Approved by: https://github.com/peterbell10	2024-02-01 17:14:44 +00:00
Masaki Kozuki	0010b6145e	Reduce register usage of fused adam(w) (#118361 ) Part of #117872 \| branch \| cpu time avg (ms) \| cuda time avg (ms) \| \|--------\|--------------\|---------------\| \| [main](eebe7e1d37f1baa995c694d540cc2fc98884fa18) \| 13.430 \| 144.117 \| \| pr \| 13.371 \| 49.655 \| Used torch profiler to measure the avg perf or 20 iterations. Model is openlm-research/open_llama_7b_v2 (script is [here](https://gist.github.com/crcrpar/ca951d4e7f3e1c771d502135b798f0d1)). --- PR ``` ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg CPU Mem Self CPU Mem CUDA Mem Self CUDA Mem # of Calls ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ProfilerStep* 0.00% 0.000us 0.00% 0.000us 0.000us 5.789s 46.42% 5.789s 289.456ms 0 b 0 b 0 b 0 b 20 ProfilerStep* 36.02% 3.119s 67.19% 5.819s 290.958ms 0.000us 0.00% 2.586s 129.276ms 48.00 Kb -1.47 Mb 0 b -504.23 Gb 20 aten::mm 2.57% 222.681ms 8.80% 762.415ms 56.475us 2.501s 20.05% 2.501s 185.255us 0 b 0 b 441.39 Gb 441.39 Gb 13500 autograd::engine::evaluate_function: MmBackward0 0.10% 8.600ms 8.17% 707.935ms 157.319us 0.000us 0.00% 1.625s 361.098us 0 b 0 b 198.65 Gb -135.03 Gb 4500 MmBackward0 0.39% 33.896ms 7.99% 692.035ms 153.786us 0.000us 0.00% 1.601s 355.710us 0 b 0 b 330.84 Gb -248.00 Mb 4500 Optimizer.step#AdamW.step 0.00% 0.000us 0.00% 0.000us 0.000us 1.007s 8.07% 1.007s 50.329ms 0 b 0 b 0 b 0 b 20 AdamW.step 0.01% 837.000us 3.36% 290.610ms 14.530ms 0.000us 0.00% 993.235ms 49.662ms 0 b 0 b 0 b 0 b 20 Optimizer.step#AdamW.step 0.22% 18.825ms 3.35% 289.773ms 14.489ms 0.000us 0.00% 993.235ms 49.662ms 0 b 0 b 0 b 0 b 20 aten::_fused_adamw_ 0.12% 10.823ms 3.09% 267.428ms 13.371ms 993.095ms 7.96% 993.095ms 49.655ms 0 b 0 b 0 b 0 b 20 void at::native::(anonymous namespace)::multi_tensor... 0.00% 0.000us 0.00% 0.000us 0.000us 993.095ms 7.96% 993.095ms 154.207us 0 b 0 b 0 b 0 b 6440 aten::matmul 0.19% 16.140ms 1.73% 149.869ms 33.304us 0.000us 0.00% 876.000ms 194.667us 0 b 0 b 107.46 Gb 0 b 4500 sm90_xmma_gemm_bf16bf16_bf16f32_f32_nt_n_tilesize128... 0.00% 0.000us 0.00% 0.000us 0.000us 835.374ms 6.70% 835.374ms 185.639us 0 b 0 b 0 b 0 b 4500 aten::linear 0.27% 23.268ms 1.97% 170.227ms 37.828us 0.000us 0.00% 776.278ms 172.506us 0 b 0 b 107.46 Gb 12.17 Gb 4500 sm90_xmma_gemm_bf16bf16_bf16f32_f32_tn_n_tilesize128... 0.00% 0.000us 0.00% 0.000us 0.000us 707.074ms 5.67% 707.074ms 183.180us 0 b 0 b 0 b 0 b 3860 aten::mul 1.31% 113.614ms 5.14% 445.405ms 22.125us 552.421ms 4.43% 552.780ms 27.459us 256.32 Kb 256.21 Kb 420.38 Gb 419.88 Gb 20131 sm90_xmma_gemm_bf16bf16_bf16f32_f32_nn_n_tilesize128... 0.00% 0.000us 0.00% 0.000us 0.000us 442.209ms 3.55% 442.209ms 138.190us 0 b 0 b 0 b 0 b 3200 autograd::engine::evaluate_function: MulBackward0 0.25% 21.336ms 5.00% 432.976ms 74.651us 0.000us 0.00% 398.627ms 68.729us 0 b 0 b -45.71 Gb -252.76 Gb 5800 aten::add_ 0.37% 31.975ms 7.19% 622.433ms 53.658us 391.957ms 3.14% 391.957ms 33.789us 0 b 0 b -4.35 Gb -4.35 Gb 11600 sm90_xmma_gemm_bf16bf16_bf16f32_f32_nn_n_tilesize256... 0.00% 0.000us 0.00% 0.000us 0.000us 345.037ms 2.77% 345.037ms 265.413us 0 b 0 b 0 b 0 b 1300 aten::copy_ 0.41% 35.727ms 20.62% 1.786s 146.503us 342.386ms 2.75% 342.386ms 28.092us 48.00 Kb 48.00 Kb -56.00 Mb -56.00 Mb 12188 ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Self CPU time total: 8.661s Self CUDA time total: 12.472s ``` main ``` ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg CPU Mem Self CPU Mem CUDA Mem Self CUDA Mem # of Calls ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ProfilerStep* 0.00% 0.000us 0.00% 0.000us 0.000us 7.671s 42.31% 7.671s 383.529ms 0 b 0 b 0 b 0 b 20 ProfilerStep* 28.85% 3.050s 72.83% 7.700s 385.009ms 0.000us 0.00% 4.474s 223.678ms 48.00 Kb -1.48 Mb 0 b -504.45 Gb 20 Optimizer.step#AdamW.step 0.00% 0.000us 0.00% 0.000us 0.000us 2.896s 15.97% 2.896s 144.787ms 0 b 0 b 0 b 0 b 20 AdamW.step 0.01% 819.000us 2.75% 291.024ms 14.551ms 0.000us 0.00% 2.882s 144.125ms 0 b 0 b 0 b 0 b 20 Optimizer.step#AdamW.step 0.17% 18.291ms 2.74% 290.205ms 14.510ms 0.000us 0.00% 2.882s 144.125ms 0 b 0 b 0 b 0 b 20 aten::_fused_adamw_ 0.10% 10.893ms 2.54% 268.602ms 13.430ms 2.882s 15.90% 2.882s 144.117ms 0 b 0 b 0 b 0 b 20 void at::native::(anonymous namespace)::multi_tensor... 0.00% 0.000us 0.00% 0.000us 0.000us 2.882s 15.90% 2.882s 447.570us 0 b 0 b 0 b 0 b 6440 aten::mm 2.05% 217.136ms 7.21% 762.211ms 56.460us 2.499s 13.78% 2.499s 185.075us 0 b 0 b 441.37 Gb 441.37 Gb 13500 autograd::engine::evaluate_function: MmBackward0 0.07% 7.179ms 6.77% 715.673ms 159.038us 0.000us 0.00% 1.624s 360.812us 0 b 0 b 198.65 Gb -134.64 Gb 4500 MmBackward0 0.32% 34.257ms 6.62% 700.088ms 155.575us 0.000us 0.00% 1.600s 355.460us 0 b 0 b 330.59 Gb -628.00 Mb 4500 aten::matmul 0.15% 15.892ms 1.32% 139.597ms 31.022us 0.000us 0.00% 874.861ms 194.414us 0 b 0 b 107.46 Gb 0 b 4500 sm90_xmma_gemm_bf16bf16_bf16f32_f32_nt_n_tilesize128... 0.00% 0.000us 0.00% 0.000us 0.000us 834.631ms 4.60% 834.631ms 185.474us 0 b 0 b 0 b 0 b 4500 aten::linear 0.21% 22.460ms 1.51% 159.620ms 35.471us 0.000us 0.00% 774.772ms 172.172us 0 b 0 b 107.46 Gb 11.88 Gb 4500 sm90_xmma_gemm_bf16bf16_bf16f32_f32_tn_n_tilesize128... 0.00% 0.000us 0.00% 0.000us 0.000us 705.996ms 3.89% 705.996ms 182.901us 0 b 0 b 0 b 0 b 3860 aten::mul 1.06% 112.529ms 4.28% 452.473ms 22.488us 552.242ms 3.05% 552.266ms 27.447us 255.90 Kb 255.88 Kb 413.93 Gb 413.90 Gb 20121 sm90_xmma_gemm_bf16bf16_bf16f32_f32_nn_n_tilesize128... 0.00% 0.000us 0.00% 0.000us 0.000us 441.514ms 2.44% 441.514ms 137.973us 0 b 0 b 0 b 0 b 3200 autograd::engine::evaluate_function: MulBackward0 0.19% 20.517ms 4.18% 442.189ms 76.239us 0.000us 0.00% 398.552ms 68.716us 0 b 0 b -45.57 Gb -251.17 Gb 5800 aten::add_ 0.30% 31.703ms 6.01% 635.030ms 54.744us 391.897ms 2.16% 391.897ms 33.784us 0 b 0 b -5.71 Gb -5.71 Gb 11600 sm90_xmma_gemm_bf16bf16_bf16f32_f32_nn_n_tilesize256... 0.00% 0.000us 0.00% 0.000us 0.000us 344.972ms 1.90% 344.972ms 265.363us 0 b 0 b 0 b 0 b 1300 aten::copy_ 0.33% 34.415ms 34.75% 3.674s 301.437us 342.661ms 1.89% 342.661ms 28.115us 80.00 Kb 80.00 Kb -240.00 Mb -240.00 Mb 12188 ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Self CPU time total: 10.574s Self CUDA time total: 18.129s ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/118361 Approved by: https://github.com/janeyx99	2024-02-01 17:04:10 +00:00
Bradley Davis	b73a2b7795	[ait] inspect get_attr nodes for _decline_if_input_dtype (#118760 ) Summary: previously get_attr nodes were skipped, but for example: %mul_240 : [num_users=1] = call_function[target=torch_tensorrt.fx.tracer.acc_tracer.acc_ops.mul](args = (), kwargs = {input: %_fx_const_folded_attrs_13, other: %add_143}) where %_fx_const_folded_attrs_13 is int64, but add_143 is float causes issues if skipped, e.g. "unsupported dtype='int64' for alignments" Differential Revision: D53273467 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118760 Approved by: https://github.com/khabinov	2024-02-01 15:56:15 +00:00
garfield1997	ff9ce94489	Create empty host tensor for privateuseone (#118854 ) For the H2D copy of local_used_map_ on the privateuseone device, reuse the CUDA logic. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118854 Approved by: https://github.com/ezyang	2024-02-01 15:32:55 +00:00
Eddie Yan	d790c1dca6	[CUDA][cuDNN][TF32] Misc TF32 updates (#118781 ) Twiddle some thresholds that don't seem to play nice with sm90. CC @tinglvv @nWEIdia @ptrblck Pull Request resolved: https://github.com/pytorch/pytorch/pull/118781 Approved by: https://github.com/ezyang	2024-02-01 15:32:50 +00:00
Andrew Gu	687946eea1	[FSDP2] Added reduce-scatter (#117975 ) This PR adds the FSDP reduce-scatter (the copy-in/reduce-scatter collective/view-out). - We use gradient pre- and post-divide factors like existing FSDP (mainly for fp16 reduction). - We use a separate CUDA stream for the reduce-scatter to conveniently handle additional kernels surrounding the collective as a separate 'thread of execution' (e.g. pre/post-divide and later the D2H gradient offload). - ~~The implementation in this PR is more complicated to _try_ to reduce CPU overhead by using `torch.split` instead of a Python for-loop. The challenge comes from the fact that the autograd-computed unsharded gradients do not have padding. We prefer to not do an intermediate padding step and instead directly copying to the big reduce-scatter input.~~ For simplicity, I changed the implementation to include intermediate padding steps, as it can still achieve ~250 GB/s, and it avoids any `O(NP)` tensor materialization for world size `N` and `P` `nn.Parameter`s. <details> <summary> Recall: Copy-in/All-Gather/Copy-Out Example </summary> Suppose we have 2 parameters with shapes `(3, 3)` (denoted with `A`s) and `(2, 2)` (denoted with `B`s) and 2 ranks, where `P` represents padding and `E` represents empty: ``` Given: (3, 3): AAAAAAAAA (2, 2): BBBB Sharded parameters/all-gather inputs: Rank 0: AAAAAA, BB Rank 1: AAAPPP, BB Each rank allocate group's all-gather output: EEEEEEEEEEEEEEEE Each rank copy-in: Rank 0: AAAAAABBEEEEEEEE Rank 1: EEEEEEEEAAAPPPBB Each rank all-gather: Rank 0: AAAAAABBAAAPPPBB Rank 1: AAAAAABBAAAPPPBB Each rank copy-out: Rank 0: AAAAAAAAAPPP, BBBB Rank 1: AAAAAAAAAPPP, BBBB ``` </details> <details> <summary> Copy-in/Reduce-Scatter/View-Out Example </summary> Suppose we have 2 gradients with shapes `(3, 3)` (denoted with `a`s when not-yet-reduced and `A`s after reduced) and `(2, 2)` (denoted with `b`s and `B`s similarly) and 2 ranks, where `E` represents empty: ``` Given from autograd: (3, 3): aaaaaaaaa (2, 2): bbbb Unsharded gradients/reduce-scatter inputs (no padding!): Rank 0: aaaaaaaaa, bbbb Rank 1: aaaaaaaaa, bbbb Each rank allocate group's reduce-scatter input: EEEEEEEEEEEEEEEE Each rank copy-in: Rank 0: aaaaaabbaaaEEEbb Rank 1: aaaaaabbaaaEEEbb Each rank : Rank 0: AAAAAABBAAAEEEBB Rank 1: AAAAAABBAAAEEEBB Each rank view-out: Rank 0: AAAAAA BB Rank 1: AAA, BB ``` </details> Pull Request resolved: https://github.com/pytorch/pytorch/pull/117975 Approved by: https://github.com/weifengpy, https://github.com/yifuwang ghstack dependencies: #117950, #117955, #117973	2024-02-01 15:21:37 +00:00
Andrew M. James	9c2b43cc50	[inductor] Handle special values correctly in ir.Scan codegen (#118788 ) Special values (`NaN`/`+/-Inf`) are not correctly during codegen for `ir.Scan` nodes. This is a fairly minor bugfix that has not come up since the only two scan ops with lowerings use "normal" values. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118788 Approved by: https://github.com/peterbell10	2024-02-01 14:54:20 +00:00
PyTorch MergeBot	221747507d	Revert "[export] support non-persistent buffers (#118612 ) (#118722 )" This reverts commit a43c28368c184ba1bf964f4fb99bec300917e2f4. Reverted https://github.com/pytorch/pytorch/pull/118722 on behalf of https://github.com/atalman due to broke linux-jammy-py3-clang12-executorch ([comment](https://github.com/pytorch/pytorch/pull/118722#issuecomment-1921484565))	2024-02-01 14:39:29 +00:00
PyTorch MergeBot	4a5a3bcc89	Revert "fused adam(w): Reduce register usage (#117872 )" This reverts commit b8e71cf3022e701604ea1f0c381c0b9ccf8743be. Reverted https://github.com/pytorch/pytorch/pull/117872 on behalf of https://github.com/janeyx99 due to This was not intended to be merged ([comment](https://github.com/pytorch/pytorch/pull/117872#issuecomment-1921425677))	2024-02-01 14:15:00 +00:00
vfdev-5	a1dd367716	Fixed error in bicubic upsampling aa=false for uint8 input (#118389 ) Description: - Fixed error in bicubic upsampling aa=false for uint8 input. This is seen in the test suite: ```diff - self.assertLess(diff.max(), 15) + self.assertLess(diff.max(), 5) ``` While reducing the input range we do not fully remove the clipping effect that's why the threshold is 5 and not around 1. - Renamed methods - The error is mostly visible for upsampling (smaller -> larger) mode on the boundary values More details on the bug: For uint8 input and antialising=False we are using separable algorithm (using temp buffers and interpolating dimensions one by one) where interpolation weights and input indices are computed and stored using index ranges: `index_min` and `index_size`; weights outside of the `index_size` are zeros. For example, for an output point we can have index_min=10 and index_size=4 and 4 non-zero weights: so the output value is computed as ``` out_value = sum([src[i + index_min] * w for i, w in zip(range(4), weights) ]) ``` When computing index ranges and weights for output points near the boundaries we should clamp `index_min` between 0 and input_size and `index_size` becomes smaller than 4. This approach is OK for antialiasing=True but is not correct for antialiasing=False where weights are computed incorrectly: ``` -- output index i= 0 regular float32 approach: source indices: [-2, -1, 0, 1] -> outbounded values are clamped to boundaries -> [0, 0, 0, 1] interp weights: [-0.07200000000000006, 0.4600000000000001, 0.72, -0.1080000000000001] separable uint8 approach: source indices coming from index ranges (min, size): [0, 1] incorrect interp weights computed with current implementation : [1.1764705882352944, -0.17647058823529432, 0.0, 0.0] fixed interp weights in the PR: [1.108, -0.1080000000000001, 0.0, 0.0] Note: weight value corresponding to source index 0 is 1.108 = -0.07200000000000006 + 0.4600000000000001 + 0.72 and weight value corresponding to source index 1 is -0.1080000000000001 is the same as in f32 approach. ``` Quick benchmark to ensure perfs no regression: ``` [------------------------------------------------------------------------------------ Resize ------------------------------------------------------------------------------------] \| torch (2.3.0a0+gitfda85a6) PR \| torch (2.3.0a0+git0d1e705) Nightly \| Speed-up: PR vs Nightly 1 threads: ----------------------------------------------------------------------------------------------------------------------------------------------------------------------- 3 torch.uint8 channels_first bilinear (400, 400) -> (224, 224) aa=False \| 440.996 (+-2.044) \| 470.824 (+-5.927) \| 1.068 (+-0.000) 3 torch.uint8 channels_first bicubic (400, 400) -> (224, 224) aa=False \| 463.565 (+-1.519) \| 497.231 (+-10.825) \| 1.073 (+-0.000) 3 torch.uint8 channels_first bilinear (400, 400) -> (700, 700) aa=False \| 1717.000 (+-28.589) \| 1915.570 (+-43.397) \| 1.116 (+-0.000) 3 torch.uint8 channels_first bicubic (400, 400) -> (700, 700) aa=False \| 1801.954 (+-22.391) \| 1981.501 (+-37.034) \| 1.100 (+-0.000) 3 torch.uint8 channels_last bilinear (400, 400) -> (224, 224) aa=False \| 199.599 (+-0.851) \| 196.535 (+-3.788) \| 0.985 (+-0.000) 3 torch.uint8 channels_last bicubic (400, 400) -> (224, 224) aa=False \| 243.126 (+-0.681) \| 240.695 (+-2.306) \| 0.990 (+-0.000) 3 torch.uint8 channels_last bilinear (400, 400) -> (700, 700) aa=False \| 686.270 (+-2.870) \| 687.769 (+-17.863) \| 1.002 (+-0.000) 3 torch.uint8 channels_last bicubic (400, 400) -> (700, 700) aa=False \| 899.509 (+-5.377) \| 899.063 (+-9.001) \| 1.000 (+-0.000) Times are in microseconds (us). ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/118389 Approved by: https://github.com/NicolasHug ghstack dependencies: #118388	2024-02-01 14:14:32 +00:00
cyy	8b140da804	Use MKL_INT in MKL wrapper interfaces (#118734 ) I encountered the error when built PyTorch on Windows MKL: ``` pytorch\aten\src\ATen\native\mkl\LinearAlgebra.cpp(74): error C2664: “void cblas_sgemm_batch(const CBLAS_LAYOUT,const CBLAS_TRANSPOSE ,const CBLAS_TRANSPOSE ,const __int64 ,const __int64 ,const __int64 ,const float ,const float *,const __int64 ,const float *,const __int64 ,const float ,float ,const __int64 ,const __int64,const __int64 ) noexcept”: 无法将参数 4 从“const int ”转换为“const __int64 *” pytorch\aten\src\ATen\native\mkl\LinearAlgebra.cpp(74): note: 指向的类型不相关; 转换需要 reinterpret_cast、C 样式强制转换或带圆括号的函数样式强制转换 C:\Program Files (x86)\Intel\oneAPI\2024.0\include\mkl_cblas.h(550): note: 参见“cblas_sgemm_batch”的声明 ``` This was because MKL_INT was defined as int64_t. This PR tries to use MKL_INT. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118734 Approved by: https://github.com/ezyang	2024-02-01 13:32:28 +00:00
Yu, Guangye	a205e7bf56	[3/4] Intel GPU Runtime Upstreaming for Device (#116850 ) # Motivation According to [[1/4] Intel GPU Runtime Upstreaming for Device](https://github.com/pytorch/pytorch/pull/116019), As mentioned in [[RFC] Intel GPU Runtime Upstreaming](https://github.com/pytorch/pytorch/issues/114842), this third PR covers the changes under `libtorch_python`. # Design This PR primarily offers device-related APIs in python frontend, including - `torch.xpu.is_available` - `torch.xpu.device_count` - `torch.xpu.current_device` - `torch.xpu.set_device` - `torch.xpu.device` - `torch.xpu.device_of` - `torch.xpu.get_device_name` - `torch.xpu.get_device_capability` - `torch.xpu.get_device_properties` - ==================== - `torch.xpu._DeviceGuard` - `torch.xpu._is_compiled` - `torch.xpu._get_device` # Additional Context We will implement the support of lazy initialization in the next PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116850 Approved by: https://github.com/EikanWang, https://github.com/jgong5, https://github.com/gujinghui, https://github.com/malfet	2024-02-01 12:31:26 +00:00
Michael Suo	eaa45f47f8	[sigmoid] fix for torchbind serialization (#118791 ) Summary: There is an annoying inconsistency in how we pickle custom objs. `torch.save` will invoke regular pickle, for which we have bound `__setstate__`/`__getstate__` methods on `torch.ScriptObject`: https://fburl.com/code/4howyl4u. This serializes in a different format than TorchScript does, which uses the TS C++ pickler. The issue we were facing was using the Python pickler to save, and the C++ pickler to load. If we use the C++ pickler to both save and load (plus some plumbing to get type/object resolution to work correctly), then things should work. Test Plan: ran SherlockNoMad's repro ``` buck2 run 'fbcode//mode/dev-nosan' scripts/bahuang:export_torchbind -- --logging DBG ``` Got to a new error, which has to do with how we're initializing the graph, but will leave that for future diffs. Reviewed By: SherlockNoMad Differential Revision: D53248454 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118791 Approved by: https://github.com/qxy11, https://github.com/SherlockNoMad, https://github.com/khabinov	2024-02-01 10:09:07 +00:00
Angela Yi	0dc15ff674	[reland][export] Fix graph signature for primitive outputs (#118818 ) Summary: Reland of D53233649/https://github.com/pytorch/pytorch/pull/118655. Previously I didn't realize there was a use-case of a torchbind object as an input to the graph, so I didn't mark `CustomObjArgument` as a valid input, which broke [this test](`a43c28368c/test/export/test_torchbind.py (L81)`). Somehow the initial CI did not catch it, but hud was sad so that PR was reverted. So now I added `CustomObjArgument` as valid input [here](https://github.com/pytorch/pytorch/pull/118818/files#diff-92420f977c3a02b2deadf6752ce4a9ee601c20612a1a13cc365252eb09410edbR298). Test Plan: CI Reviewed By: tarun292 Differential Revision: D53288445 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118818 Approved by: https://github.com/ydwu4	2024-02-01 09:59:05 +00:00
Masaki Kozuki	b8e71cf302	fused adam(w): Reduce register usage (#117872 ) As per title, reducing register usage for better occupancy. Changes are: - use 32bit indexing if possible - convert some arguments of fused adam(w) functor to its template parameters - give `const` to some arguments Tables below are before/after of adamw for sm90 with / without amsgrad enabled. ### without amsgrad \| dtype \| main \| this PR \| \|-------\|------\|---------\| \| bf16 \| 79 \| 64 \| \| fp16 \| 82 \| 64 \| \| fp32 \| 126 \| 64 \| \| fp64 \| 128 \| 109 \| ### with amsgrad \| dtype \| main \| this PR \| \|-------\|------\|---------\| \| bf16 \| 124 \| 74 \| \| fp16 \| 124 \| 74 \| \| fp32 \| 123 \| 76 \| \| fp64 \| 128 \| 121 \| --- `AdamW(..., fused=True)` with llama-2 bf16 on H100 improved to 49.935ms of cuda avg time from 126.648ms according to torch profiler. This PR: ``` ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg CPU Mem Self CPU Mem CUDA Mem Self CUDA Mem # of Calls ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ProfilerStep* 0.00% 0.000us 0.00% 0.000us 0.000us 5.878s 46.47% 5.878s 293.918ms 0 b 0 b 0 b 0 b 20 aten::mm 2.57% 224.777ms 8.50% 741.993ms 54.962us 2.591s 20.48% 2.591s 191.910us 0 b 0 b 441.39 Gb 441.39 Gb 13500 ProfilerStep* 31.64% 2.763s 67.67% 5.910s 295.485ms 0.000us 0.00% 2.551s 127.547ms 48.00 Kb -1.44 Mb 0 b -506.38 Gb 20 autograd::engine::evaluate_function: MmBackward0 0.13% 11.349ms 7.90% 690.160ms 153.369us 0.000us 0.00% 1.726s 383.544us 0 b 0 b 198.65 Gb -137.53 Gb 4500 MmBackward0 0.45% 38.959ms 7.68% 670.399ms 148.978us 0.000us 0.00% 1.693s 376.326us 0 b 0 b 332.81 Gb 2.26 Gb 4500 Optimizer.step#AdamW.step 0.00% 0.000us 0.00% 0.000us 0.000us 1.012s 8.00% 1.012s 50.617ms 0 b 0 b 0 b 0 b 20 AdamW.step 0.01% 846.000us 3.39% 296.240ms 14.812ms 0.000us 0.00% 998.876ms 49.944ms 0 b 0 b 0 b 0 b 20 Optimizer.step#AdamW.step 0.26% 23.113ms 3.38% 295.394ms 14.770ms 0.000us 0.00% 998.876ms 49.944ms 0 b 0 b 0 b 0 b 20 aten::_fused_adamw_ 0.13% 11.000ms 3.08% 268.545ms 13.427ms 998.705ms 7.89% 998.705ms 49.935ms 0 b 0 b 0 b 0 b 20 void at::native::(anonymous namespace)::multi_tensor... 0.00% 0.000us 0.00% 0.000us 0.000us 998.705ms 7.89% 998.705ms 155.078us 0 b 0 b 0 b 0 b 6440 sm90_xmma_gemm_bf16bf16_bf16f32_f32_nt_n_tilesize128... 0.00% 0.000us 0.00% 0.000us 0.000us 872.287ms 6.90% 872.287ms 193.842us 0 b 0 b 0 b 0 b 4500 aten::matmul 0.19% 16.721ms 1.82% 159.130ms 35.362us 0.000us 0.00% 864.840ms 192.187us 0 b 0 b 107.46 Gb 0 b 4500 aten::linear 0.28% 24.641ms 2.09% 182.129ms 40.473us 0.000us 0.00% 765.554ms 170.123us 0 b 0 b 107.46 Gb 12.46 Gb 4500 sm90_xmma_gemm_bf16bf16_bf16f32_f32_tn_n_tilesize128... 0.00% 0.000us 0.00% 0.000us 0.000us 690.729ms 5.46% 690.729ms 178.945us 0 b 0 b 0 b 0 b 3860 aten::mul 1.36% 118.465ms 4.89% 427.071ms 21.225us 549.580ms 4.34% 549.697ms 27.320us 224.03 Kb 223.96 Kb 413.51 Gb 413.36 Gb 20121 sm90_xmma_gemm_bf16bf16_bf16f32_f32_nn_n_tilesize128... 0.00% 0.000us 0.00% 0.000us 0.000us 484.455ms 3.83% 484.455ms 151.392us 0 b 0 b 0 b 0 b 3200 autograd::engine::evaluate_function: MulBackward0 0.27% 23.176ms 4.63% 404.534ms 69.747us 0.000us 0.00% 406.155ms 70.027us 0 b 0 b -46.01 Gb -257.12 Gb 5800 aten::add_ 0.39% 34.186ms 7.22% 630.849ms 54.384us 394.402ms 3.12% 394.402ms 34.000us 0 b 0 b -6.68 Gb -6.68 Gb 11600 sm90_xmma_gemm_bf16bf16_bf16f32_f32_nn_n_tilesize256... 0.00% 0.000us 0.00% 0.000us 0.000us 366.653ms 2.90% 366.653ms 282.041us 0 b 0 b 0 b 0 b 1300 aten::copy_ 0.41% 35.934ms 20.61% 1.800s 147.691us 341.572ms 2.70% 341.572ms 28.025us 48.00 Kb 48.00 Kb -40.00 Mb -40.00 Mb 12188 ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Self CPU time total: 8.733s Self CUDA time total: 12.651s AdamW.step <FunctionEventAvg key=AdamW.step self_cpu_time=846.000us cpu_time=14.812ms self_cuda_time=0.000us cuda_time=49.944ms input_shapes= cpu_memory_usage=0 cuda_memory_usage=0> Optimizer.step#AdamW.step <FunctionEventAvg key=Optimizer.step#AdamW.step self_cpu_time=23.113ms cpu_time=14.770ms self_cuda_time=0.000us cuda_time=49.944ms input_shapes= cpu_memory_usage=0 cuda_memory_usage=0> Optimizer.step#AdamW.step <FunctionEventAvg key=Optimizer.step#AdamW.step self_cpu_time=0.000us cpu_time=0.000us self_cuda_time=1.012s cuda_time=50.617ms input_shapes= cpu_memory_usage=0 cuda_memory_usage=0> ``` Main ``` ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg CPU Mem Self CPU Mem CUDA Mem Self CUDA Mem # of Calls ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ProfilerStep* 0.00% 0.000us 0.00% 0.000us 0.000us 7.354s 42.89% 7.354s 367.698ms 0 b 0 b 0 b 0 b 20 ProfilerStep* 28.22% 2.875s 72.48% 7.384s 369.184ms 0.000us 0.00% 4.067s 203.325ms 48.00 Kb -1.48 Mb 0 b -508.04 Gb 20 aten::mm 2.24% 228.499ms 7.13% 726.223ms 53.794us 2.563s 14.95% 2.563s 189.873us 0 b 0 b 441.39 Gb 441.39 Gb 13500 Optimizer.step#AdamW.step 0.00% 0.000us 0.00% 0.000us 0.000us 2.546s 14.85% 2.546s 127.304ms 0 b 0 b 0 b 0 b 20 AdamW.step 0.01% 821.000us 2.87% 292.871ms 14.644ms 0.000us 0.00% 2.533s 126.654ms 0 b 0 b 0 b 0 b 20 Optimizer.step#AdamW.step 0.22% 22.801ms 2.87% 292.050ms 14.602ms 0.000us 0.00% 2.533s 126.654ms 0 b 0 b 0 b 0 b 20 aten::_fused_adamw_ 0.11% 11.332ms 2.61% 265.853ms 13.293ms 2.533s 14.77% 2.533s 126.648ms 0 b 0 b 0 b 0 b 20 void at::native::(anonymous namespace)::multi_tensor... 0.00% 0.000us 0.00% 0.000us 0.000us 2.533s 14.77% 2.533s 393.315us 0 b 0 b 0 b 0 b 6440 autograd::engine::evaluate_function: MmBackward0 0.13% 13.342ms 6.73% 685.250ms 152.278us 0.000us 0.00% 1.706s 379.209us 0 b 0 b 198.65 Gb -138.02 Gb 4500 MmBackward0 0.38% 38.974ms 6.52% 664.652ms 147.700us 0.000us 0.00% 1.675s 372.113us 0 b 0 b 333.59 Gb 2.75 Gb 4500 sm90_xmma_gemm_bf16bf16_bf16f32_f32_nt_n_tilesize128... 0.00% 0.000us 0.00% 0.000us 0.000us 859.515ms 5.01% 859.515ms 191.003us 0 b 0 b 0 b 0 b 4500 aten::matmul 0.16% 16.431ms 1.49% 152.052ms 33.789us 0.000us 0.00% 856.839ms 190.409us 0 b 0 b 107.46 Gb 0 b 4500 aten::linear 0.23% 23.703ms 1.72% 174.862ms 38.858us 0.000us 0.00% 758.995ms 168.666us 0 b 0 b 107.46 Gb 12.21 Gb 4500 sm90_xmma_gemm_bf16bf16_bf16f32_f32_tn_n_tilesize128... 0.00% 0.000us 0.00% 0.000us 0.000us 682.302ms 3.98% 682.302ms 176.762us 0 b 0 b 0 b 0 b 3860 aten::mul 1.16% 117.854ms 4.12% 420.100ms 20.892us 544.045ms 3.17% 544.157ms 27.062us 240.38 Kb 240.34 Kb 419.45 Gb 419.29 Gb 20108 sm90_xmma_gemm_bf16bf16_bf16f32_f32_nn_n_tilesize128... 0.00% 0.000us 0.00% 0.000us 0.000us 479.767ms 2.80% 479.767ms 149.927us 0 b 0 b 0 b 0 b 3200 autograd::engine::evaluate_function: MulBackward0 0.27% 27.303ms 3.95% 402.627ms 69.418us 0.000us 0.00% 403.020ms 69.486us 0 b 0 b -45.56 Gb -257.26 Gb 5800 aten::add_ 0.32% 32.543ms 6.08% 619.248ms 53.383us 393.242ms 2.29% 393.242ms 33.900us 0 b 0 b -6.21 Gb -6.21 Gb 11600 sm90_xmma_gemm_bf16bf16_bf16f32_f32_nn_n_tilesize256... 0.00% 0.000us 0.00% 0.000us 0.000us 363.245ms 2.12% 363.245ms 279.419us 0 b 0 b 0 b 0 b 1300 void at::native::vectorized_elementwise_kernel<4, at... 0.00% 0.000us 0.00% 0.000us 0.000us 338.460ms 1.97% 338.460ms 29.228us 0 b 0 b 0 b 0 b 11580 ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Self CPU time total: 10.187s Self CUDA time total: 17.145s AdamW.step <FunctionEventAvg key=AdamW.step self_cpu_time=821.000us cpu_time=14.644ms self_cuda_time=0.000us cuda_time=126.654ms input_shapes= cpu_memory_usage=0 cuda_memory_usage=0> Optimizer.step#AdamW.step <FunctionEventAvg key=Optimizer.step#AdamW.step self_cpu_time=22.801ms cpu_time=14.602ms self_cuda_time=0.000us cuda_time=126.654ms input_shapes= cpu_memory_usage=0 cuda_memory_usage=0> Optimizer.step#AdamW.step <FunctionEventAvg key=Optimizer.step#AdamW.step self_cpu_time=0.000us cpu_time=0.000us self_cuda_time=2.546s cuda_time=127.304ms input_shapes= cpu_memory_usage=0 cuda_memory_usage=0> ``` Script I used: https://gist.github.com/crcrpar/ca951d4e7f3e1c771d502135b798f0d1 <!-- ## adamw ### This PR ```console $ cuobjdump ./build/caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/fused_adamw_impl.cu.o -xelf all Extracting ELF file 1: fused_adamw_impl.sm_70.cubin Extracting ELF file 2: fused_adamw_impl.sm_80.cubin Extracting ELF file 3: fused_adamw_impl.sm_90.cubin $ cuobjdump -res-usage fused_adamw_impl.sm_90.cubin \| cu++filt Resource usage: Common: GLOBAL:3 Function void at::native::<unnamed>::multi_tensor_apply_kernel<long, at::native::<unnamed>::FusedOptimizerTensorListMetadata<(int)4>, at::native::<unnamed>::FusedAdamMathFunctor<c10::BFloat16, (int)4, (at::native::ADAM_MODE)1, (bool)0>, const float , double, double, double, double, double, bool, const float , const float >(T1, T2, T3, T4...): REG:64 STACK:8 SHARED:0 LOCAL:0 CONSTANT[0]:3952 TEXTURE:0 SURFACE:0 SAMPLER:0 Function void at::native::<unnamed>::multi_tensor_apply_kernel<long, at::native::<unnamed>::FusedOptimizerTensorListMetadata<(int)4>, at::native::<unnamed>::FusedAdamMathFunctor<c10::Half, (int)4, (at::native::ADAM_MODE)1, (bool)0>, const float , double, double, double, double, double, bool, const float , const float >(T1, T2, T3, T4...): REG:64 STACK:8 SHARED:0 LOCAL:0 CONSTANT[0]:3952 TEXTURE:0 SURFACE:0 SAMPLER:0 Function void at::native::<unnamed>::multi_tensor_apply_kernel<long, at::native::<unnamed>::FusedOptimizerTensorListMetadata<(int)4>, at::native::<unnamed>::FusedAdamMathFunctor<float, (int)4, (at::native::ADAM_MODE)1, (bool)0>, const float , double, double, double, double, double, bool, const float , const float >(T1, T2, T3, T4...): REG:64 STACK:0 SHARED:0 LOCAL:0 CONSTANT[0]:3952 TEXTURE:0 SURFACE:0 SAMPLER:0 Function void at::native::<unnamed>::multi_tensor_apply_kernel<long, at::native::<unnamed>::FusedOptimizerTensorListMetadata<(int)4>, at::native::<unnamed>::FusedAdamMathFunctor<double, (int)4, (at::native::ADAM_MODE)1, (bool)0>, const float , double, double, double, double, double, bool, const float , const float >(T1, T2, T3, T4...): REG:109 STACK:0 SHARED:0 LOCAL:0 CONSTANT[0]:3952 TEXTURE:0 SURFACE:0 SAMPLER:0 ``` ### Main ```console $ cuobjdump ./build/caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/fused_adamw_impl.cu.o -xelf all Extracting ELF file 1: fused_adamw_impl.cu.1.sm_70.cubin Extracting ELF file 2: fused_adamw_impl.cu.2.sm_80.cubin Extracting ELF file 3: fused_adamw_impl.cu.3.sm_90.cubin $ cuobjdump -res-usage fused_adamw_impl.cu.3.sm_90.cubin \| cu++filt Resource usage: Common: GLOBAL:3 Function void at::native::<unnamed>::multi_tensor_apply_kernel<at::native::<unnamed>::FusedOptimizerTensorListMetadata<(int)4>, at::native::<unnamed>::FusedAdamMathFunctor<c10::BFloat16, (int)4>, float , double, double, double, double, double, bool, bool, float , float , at::native::ADAM_MODE>(T1, T2, T3...): REG:79 STACK:0 SHARED:0 LOCAL:0 CONSTANT[0]:3945 TEXTURE:0 SURFACE:0 SAMPLER:0 Function void at::native::<unnamed>::multi_tensor_apply_kernel<at::native::<unnamed>::FusedOptimizerTensorListMetadata<(int)4>, at::native::<unnamed>::FusedAdamMathFunctor<c10::Half, (int)4>, float , double, double, double, double, double, bool, bool, float , float , at::native::ADAM_MODE>(T1, T2, T3...): REG:82 STACK:0 SHARED:0 LOCAL:0 CONSTANT[0]:3945 TEXTURE:0 SURFACE:0 SAMPLER:0 Function void at::native::<unnamed>::multi_tensor_apply_kernel<at::native::<unnamed>::FusedOptimizerTensorListMetadata<(int)4>, at::native::<unnamed>::FusedAdamMathFunctor<float, (int)4>, float , double, double, double, double, double, bool, bool, float , float , at::native::ADAM_MODE>(T1, T2, T3...): REG:126 STACK:0 SHARED:0 LOCAL:0 CONSTANT[0]:3945 TEXTURE:0 SURFACE:0 SAMPLER:0 Function void at::native::<unnamed>::multi_tensor_apply_kernel<at::native::<unnamed>::FusedOptimizerTensorListMetadata<(int)4>, at::native::<unnamed>::FusedAdamMathFunctor<double, (int)4>, float , double, double, double, double, double, bool, bool, float , float , at::native::ADAM_MODE>(T1, T2, T3...): REG:128 STACK:40 SHARED:0 LOCAL:0 CONSTANT[0]:3945 TEXTURE:0 SURFACE:0 SAMPLER:0 ``` ## adamw & amsgrad ### This PR ```console root@1a5180b041f7:/opt/pytorch/pytorch# cuobjdump ./build/caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/fused_adamw_amsgrad_impl.cu.o -xelf all Extracting ELF file 1: fused_adamw_amsgrad_impl.sm_70.cubin Extracting ELF file 2: fused_adamw_amsgrad_impl.sm_80.cubin Extracting ELF file 3: fused_adamw_amsgrad_impl.sm_90.cubin root@1a5180b041f7:/opt/pytorch/pytorch# cuobjdump -res-usage fused_adamw_amsgrad_impl.sm_90.cubin Resource usage: Common: GLOBAL:3 Function void at::native::<unnamed>::multi_tensor_apply_kernel<long, at::native::<unnamed>::FusedOptimizerTensorListMetadata<(int)5>, at::native::<unnamed>::FusedAdamMathFunctor<c10::BFloat16, (int)5, (at::native::ADAM_MODE)1, (bool)1>, const float , double, double, double, double, double, bool, const float , const float >(T1, T2, T3, T4...): REG:74 STACK:0 SHARED:0 LOCAL:0 CONSTANT[0]:3904 TEXTURE:0 SURFACE:0 SAMPLER:0 Function void at::native::<unnamed>::multi_tensor_apply_kernel<long, at::native::<unnamed>::FusedOptimizerTensorListMetadata<(int)5>, at::native::<unnamed>::FusedAdamMathFunctor<c10::Half, (int)5, (at::native::ADAM_MODE)1, (bool)1>, const float , double, double, double, double, double, bool, const float , const float >(T1, T2, T3, T4...): REG:74 STACK:0 SHARED:0 LOCAL:0 CONSTANT[0]:3904 TEXTURE:0 SURFACE:0 SAMPLER:0 Function void at::native::<unnamed>::multi_tensor_apply_kernel<long, at::native::<unnamed>::FusedOptimizerTensorListMetadata<(int)5>, at::native::<unnamed>::FusedAdamMathFunctor<float, (int)5, (at::native::ADAM_MODE)1, (bool)1>, const float , double, double, double, double, double, bool, const float , const float >(T1, T2, T3, T4...): REG:76 STACK:0 SHARED:0 LOCAL:0 CONSTANT[0]:3904 TEXTURE:0 SURFACE:0 SAMPLER:0 Function void at::native::<unnamed>::multi_tensor_apply_kernel<long, at::native::<unnamed>::FusedOptimizerTensorListMetadata<(int)5>, at::native::<unnamed>::FusedAdamMathFunctor<double, (int)5, (at::native::ADAM_MODE)1, (bool)1>, const float , double, double, double, double, double, bool, const float , const float >(T1, T2, T3, T4...): REG:121 STACK:0 SHARED:0 LOCAL:0 CONSTANT[0]:3904 TEXTURE:0 SURFACE:0 SAMPLER:0 ``` ### Main ```console root@7c40321796bc:/opt/pytorch/pytorch# cuobjdump ./build/caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/fused_adamw_amsgrad_impl.cu.o -xelf all Extracting ELF file 1: fused_adamw_amsgrad_impl.cu.1.sm_70.cubin Extracting ELF file 2: fused_adamw_amsgrad_impl.cu.2.sm_80.cubin Extracting ELF file 3: fused_adamw_amsgrad_impl.cu.3.sm_90.cubin root@7c40321796bc:/opt/pytorch/pytorch# cuobjdump -res-usave fused_adamw_amsgrad_impl.cu.3.sm_90.cubin cuobjdump fatal : Unknown option 'res-usave' root@7c40321796bc:/opt/pytorch/pytorch# cuobjdump -res-usage fused_adamw_amsgrad_impl.cu.3.sm_90.cubin Resource usage: Common: GLOBAL:3 Function void at::native::<unnamed>::multi_tensor_apply_kernel<at::native::<unnamed>::FusedOptimizerTensorListMetadata<(int)5>, at::native::<unnamed>::FusedAdamMathFunctor<c10::BFloat16, (int)5>, float , double, double, double, double, double, bool, bool, float , float , at::native::ADAM_MODE>(T1, T2, T3...): REG:124 STACK:0 SHARED:0 LOCAL:0 CONSTANT[0]:3897 TEXTURE:0 SURFACE:0 SAMPLER:0 Function void at::native::<unnamed>::multi_tensor_apply_kernel<at::native::<unnamed>::FusedOptimizerTensorListMetadata<(int)5>, at::native::<unnamed>::FusedAdamMathFunctor<c10::Half, (int)5>, float , double, double, double, double, double, bool, bool, float , float , at::native::ADAM_MODE>(T1, T2, T3...): REG:124 STACK:0 SHARED:0 LOCAL:0 CONSTANT[0]:3897 TEXTURE:0 SURFACE:0 SAMPLER:0 Function void at::native::<unnamed>::multi_tensor_apply_kernel<at::native::<unnamed>::FusedOptimizerTensorListMetadata<(int)5>, at::native::<unnamed>::FusedAdamMathFunctor<float, (int)5>, float , double, double, double, double, double, bool, bool, float , float , at::native::ADAM_MODE>(T1, T2, T3...): REG:123 STACK:0 SHARED:0 LOCAL:0 CONSTANT[0]:3897 TEXTURE:0 SURFACE:0 SAMPLER:0 Function void at::native::<unnamed>::multi_tensor_apply_kernel<at::native::<unnamed>::FusedOptimizerTensorListMetadata<(int)5>, at::native::<unnamed>::FusedAdamMathFunctor<double, (int)5>, float , double, double, double, double, double, bool, bool, float , float , at::native::ADAM_MODE>(T1, T2, T3...): REG:128 STACK:40 SHARED:0 LOCAL:0 CONSTANT[0]:3897 TEXTURE:0 SURFACE:0 SAMPLER:0 ``` --> Pull Request resolved: https://github.com/pytorch/pytorch/pull/117872 Approved by: https://github.com/janeyx99	2024-02-01 09:34:50 +00:00
vfdev-5	eba4bd6b86	Updated test_upsamplingBiMode2d_consistency (#118388 ) Description: - Lowered error thresholds and added input range for bicubic to make visible the inconsistency error in the implementation for upsampling (smaller -> larger) bicubic aa=false mode for uint8 input dtype - Updated out-dated comments Pull Request resolved: https://github.com/pytorch/pytorch/pull/118388 Approved by: https://github.com/NicolasHug	2024-02-01 09:22:23 +00:00
Angela Yi	7e0ea0d5df	[export] Only deepcopy graph in unlift (#118821 ) Summary: We only need to deepcopy the graph because we're modifying the graph by unlifting its parameter/buffer inputs. We don't need to deepcopy the graph module state/contents. This causes an error when the graph module contains an ExecuTorch LoweredModule which stores tensors. Test Plan: Fixes the following diff Differential Revision: D53290077 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118821 Approved by: https://github.com/tugsbayasgalan	2024-02-01 09:00:22 +00:00
Yanbo Liang	4fc4f5eb06	[Dynamo] Support tensor is not tensor (#118840 ) Fixes Meta internal use case. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118840 Approved by: https://github.com/yf225	2024-02-01 07:32:43 +00:00
Yifu Wang	a1280f0cc6	Add an OpInfo test for split_with_sizes_copy (#118512 ) Adding an `OpInfo` test for `split_with_sizes_copy` so we can use it to test [CUDA fast path for split_with_sizes_copy.out](https://github.com/pytorch/pytorch/pull/117203). Since the `OpInfo` test doesn't exist yet and introducing it requires modifications to the `CompositeExplicitAutograd` impl, adding the `OpInfo` test in a separate PR to establish a healthy baseline. Changes made: - Registered a batching rule for `split_with_sizes_copy`. - Registered a decomposition for `split_with_sizes_copy`. - Registered a DTensor prop rule for `split_with_sizes_copy`. - Added required dtype and device checks to the composite impl. - Added output resize to the composite impl. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118512 Approved by: https://github.com/albanD	2024-02-01 07:09:27 +00:00
Mu-Chu Lee	2b48891e62	[AOTInductor] Add Runtime Constant-folding for AOTInductor (#118765 ) Summary: Add Runtime Constant-folding for AOTInductor. This also include the invocation of constant folding at load time. The constant folding lowering is a 2-step process. First, we split the graph into 2 modules, one of it is the constant module, which doesn't depend on any input and the whole module could be inferred (constant-folded) one-time and be reused. The constant module, is lowered, and being codegen-ed as usual and cached (let's call this constant code). The constant code reuses the whole lowering/profiling/etc. process, only difference is that we do not generate any headers or initialization for the constant code. Second, after handling the constant module, we take care of the main module (which is the part that would depend on the user input.) For the main module, we take in one additional component, the constant code, compare with a normal lowering. Addition step we do here is that, we inject the constant code into the codegen-ed main module, and create the caller for the main module to consume the result of the constant module. Test Plan: Unit tests included in commit. Differential Revision: D53274382 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118765 Approved by: https://github.com/chenyang78	2024-02-01 04:54:25 +00:00
Jiaxu Zhu	b97ab47619	[pytorch][ao] Update `PerChannelMinMaxObserver` default `_load_from_state_dict` (#118659 ) Summary: When `version` is missing in the metadata, use `min_val/max_val` as keys instead of `max_vals/min_vals` ## Reasons 1. It's been almost 2 years since this change D30003700, which means now most checkpoints are using the `max_val/min_val` keys 2. most checkpoints dumps using `model.state_dict()` don't have version info, which will lead a fake `missing keys` error when loading state_dict Test Plan: CI Differential Revision: D53233012 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118659 Approved by: https://github.com/jerryzh168	2024-02-01 04:39:31 +00:00
PyTorch UpdateBot	526701cfb7	[executorch hash update] update the pinned executorch hash (#118698 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned executorch hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118698 Approved by: https://github.com/pytorchbot	2024-02-01 03:39:50 +00:00
Mikayla Gawarecki	45d2dff844	[easy] Enable test_neg_view for 5D SampleInput for torch.nn.functional.linear (#118815 ) Fixes #117854 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118815 Approved by: https://github.com/malfet	2024-02-01 03:26:45 +00:00
PyTorch UpdateBot	adff335095	[vision hash update] update the pinned vision hash (#118825 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned vision hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118825 Approved by: https://github.com/pytorchbot	2024-02-01 03:14:16 +00:00
Andrew Gu	9b28621369	[FSDP2] Added forward unshard/wait for unshard/reshard (#117973 ) This PR adds the all-gather and free logic required for forward. - We define the logical all-gather as two ops: (1) unshard and (2) wait for unshard. This abstraction allows capturing both implicit forward prefetching (using multiple streams and `async_op=False`) and explicit forward prefetching (using `async_op=True`). - Symmetrically, we define the reshard op to free the unsharded parameters. Some other notes: - The `FSDPParamGroup` and its `FSDPParam`s transition their sharded states together. This invariant allows us to reason about the parameters by group rather than individually with respect to whether they are sharded or unsharded. --- ### How Does the Overlap Work for All-Gather? For context, the all-gather consists of three steps: (1) copy-in, (2) all-gather collective, and (3) copy-out. <details> <summary> Example </summary> Suppose we have 2 parameters with shapes `(3, 3)` (denoted with `A`s) and `(2, 2)` (denoted with `B`s) and 2 ranks, where `P` represents padding and `E` represents empty: ``` Given: (3, 3): AAAAAAAAA (2, 2): BBBB Sharded parameters/all-gather inputs: Rank 0: AAAAAA, BB Rank 1: AAAPPP, BB Each rank allocate group's all-gather output: EEEEEEEEEEEEEEEE Each rank copy-in: Rank 0: AAAAAABBEEEEEEEE Rank 1: EEEEEEEEAAAPPPBB Each rank all-gather: Rank 0: AAAAAABBAAAPPPBB Rank 1: AAAAAABBAAAPPPBB Each rank copy-out: Rank 0: AAAAAAAAAPPP, BBBB Rank 1: AAAAAAAAAPPP, BBBB ``` </details> `dist.all_gather_into_tensor()` always has the PG's NCCL stream wait for the current stream before running the collective. `async_op=True` means that the function waits on the work, having the current stream wait for the NCCL stream before returning. `async_op=False` means it returns the `Work` object, which the user can wait on later. #### Implicit Prefetching Implicit prefetching achieves communication/computation overlap without changing the CPU issue order: - We use separate streams for copy-in and for issuing the `dist.all_gather_into_tensor()`. The copy-in stream allows us to overlap the copy-in with all-gather/reduce-scatter in backward, and the all-gather stream allows us to overlap the all-gather with forward compute (issued before it). - Because `dist.all_gather_into_tensor()` always has the PG's NCCL stream wait for the current stream, we need this "dummy" all-gather stream to prevent the all-gather from waiting on the forward compute with which it should overlap. - Without the separate copy-in stream, we cannot overlap all-gather copy-in with all-gather in forward. - We copy-out in the default stream after having the default stream wait for the all-gather. This means that the autograd leaves are allocated in the default stream and autograd will not call `recordStream`. Implicit prefetching does not require knowing the execution order ahead of time. However, when overlapping the next all-gather with the current compute, there may be a gap from the CPU thread issuing the current compute. If the CPU thread can run ahead, then this is not an issue. #### Explicit Prefetching Explicit prefetching achieves communication/computation by changing the CPU issue order, namely by reordering the all-gather to be before the compute with which it should overlap. - Because we reorder, we do not need any separate streams, and we can use `async_op=False` for overlap. - We can expose this explicit prefetching as a module-level `unshard()` op (e.g. `module.unshard(async_op: bool)`, and we can use it as a primitive for implementing the explicit forward prefetching in existing FSDP. Explicit prefetching requires knowing the execution order. --- Disclaimer: The testing is relatively lighter in this PR. I did not want to spend too much time writing new forward-only tests. The stream usage will be exercised thoroughly once we have backward too. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117973 Approved by: https://github.com/weifengpy, https://github.com/yifuwang ghstack dependencies: #117950, #117955	2024-02-01 03:08:13 +00:00
James Wu	8d6e34b21b	Add verbose option to failures histogram (#118757 ) Sample output: https://gist.github.com/jamesjwu/cc80d7da305add0a69c5e39aae09a077 Using directories from https://hud.pytorch.org/pr/118597: eager_tests: [linux-focal-py3.11-clang10 / test (default, 1, 3, linux.2xlarge)](https://github.com/pytorch/pytorch/actions/runs/7716582714/job/21034340833) dynamo_tests: [linux-focal-py3.11-clang10 / test (dynamo, 1, 3, linux.2xlarge)](https://github.com/pytorch/pytorch/actions/runs/7716582714/job/21034342747) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118757 Approved by: https://github.com/zou3519	2024-02-01 02:46:36 +00:00
David Berard	499f31d40b	[dynamo] use par_style = "xar" in minifier targets file (#118603 ) For internal usage, par_style="xar" is needed in order for certain build modes to work with triton. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118603 Approved by: https://github.com/williamwen42	2024-02-01 02:42:26 +00:00
Michael Suo	a43c28368c	[export] support non-persistent buffers (#118612 ) (#118722 ) Summary: X-link: https://github.com/pytorch/executorch/pull/1769 Basic support for non-persistent buffers, which are buffers that do not show up in the state dict. One weird twist is that most of our other systems (FX, aot_export, dynamo) have completely buggy handling of non-persistent buffers. I tried to go on a wild goose chase to fix them all, but it got to be too much. So I introduced some sad rewrite passes in `_export` make the final state dict correctly align with the original module's state dict. This exposed some bugs/ambiguous handling of parameters/buffers in existing test code. For example, `TestSaveLoad.test_save_buffer` traced over a module that was not in the root module hierarchy and caused some weird behavior. I think we should error explicitly on use cases like this: https://github.com/pytorch/pytorch/issues/118410. For now I just rewrote the tests or skipped them. Test Plan: added a unit test Differential Revision: D53253905 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118722 Approved by: https://github.com/SherlockNoMad, https://github.com/angelayi	2024-02-01 00:36:09 +00:00
drisspg	4cba1dd0c3	[submodule] Update cudnn_frontend to v1.0.3 (#118782 ) # Summary Updates cudnn frontend to tagged 1.0.3 tagged version submodule Pull Request resolved: https://github.com/pytorch/pytorch/pull/118782 Approved by: https://github.com/malfet	2024-02-01 00:35:03 +00:00
suo	2f79a7bf9e	[export] make spec comparison indifferent to fx collections (#118718 ) Treat immutable_dict as dict and immutale_list as list. This behavior was tripped up by some executorch tests Differential Revision: [D53252679](https://our.internmc.facebook.com/intern/diff/D53252679/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118718 Approved by: https://github.com/zhxchen17	2024-02-01 00:10:49 +00:00
Nikita Shulga	6c67f3333a	[Inductor] Skip triton templates for mixedmm on SM70- (#118591 ) As it results in numerical errors, see https://github.com/pytorch/pytorch/issues/117144 Fixes https://github.com/pytorch/pytorch/issues/117144 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118591 Approved by: https://github.com/jansel	2024-01-31 23:30:45 +00:00
Edward Z. Yang	da4b4d961e	Support printing storage while FakeTensorMode is enabled (#118780 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/118780 Approved by: https://github.com/thiagocrepaldi, https://github.com/eellison	2024-01-31 23:10:47 +00:00
BowenBao	30f43e3d89	[ONNX][bench] Deepcopy model to another device before export to avoid OOM (#118710 ) Prior to onnx export, the model is deepcopied to avoid modifications that may affect later performance profiling. However this increases the memory requirement on the device. This PR modifies the script to deepcopy and export the model on another device when possible. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118710 Approved by: https://github.com/thiagocrepaldi	2024-01-31 23:03:39 +00:00
Jane Xu	21ce53b9c5	Add inf norm support for _foreach_norm (#118441 ) Fixes #117803 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118441 Approved by: https://github.com/mlazos	2024-01-31 22:58:51 +00:00
Elias Ellison	e87ac82c98	Fix missing default dim param in weight norm interface decomp (#118762 ) Fix for https://github.com/pytorch/pytorch/issues/118742 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118762 Approved by: https://github.com/ezyang, https://github.com/shunting314	2024-01-31 22:10:10 +00:00
Michael Lazos	e426924c19	Change classification to beta for TORCH_LOGS (#118682 ) Changes classification of TORCH_LOGS to beta Pull Request resolved: https://github.com/pytorch/pytorch/pull/118682 Approved by: https://github.com/svekars	2024-01-31 21:50:55 +00:00
Michael Lazos	fb391a016d	Test that optimizers are running cudagraphs (#118716 ) Updates compiled optimizer tests to ensure that cudagraphs is running when on cuda. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118716 Approved by: https://github.com/eellison	2024-01-31 21:34:23 +00:00
Edward Z. Yang	8dee7b7a16	Add TORCHDYNAMO_EXTENDED_DEBUG_GUARD_ADDED (#118750 ) This allows us to request extended (including C++ backtrace) information whenever a specific guard occurs. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/118750 Approved by: https://github.com/aakhundov	2024-01-31 21:16:27 +00:00
Edward Z. Yang	c978f38bd4	Some minor type stub improvements (#118529 ) I was just playing around with improving the typing of symbolic_shapes. The PR is not "complete" but I in particular wanted to get feedback on whether or not people liked making ValueRanges Generic; it seems that distinguishing if you have an Expr ValueRange or a SympyBoolean ValueRange is a lot of trouble for downstream. Using TypeGuard, we can perform refinements on the generic parameter inside methods, although we still have to cast back to ValueRange[T] due to https://github.com/python/mypy/issues/14425#issuecomment-1914852707 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/118529 Approved by: https://github.com/Skylion007	2024-01-31 20:56:56 +00:00
PyTorch MergeBot	5ced432a0d	Revert "[export] Fix graph signature for primitive outputs (#118655 )" This reverts commit 680cc6b17ab3f318c0da6177646afe6700152327. Reverted https://github.com/pytorch/pytorch/pull/118655 on behalf of https://github.com/atalman due to broke TestExportTorchbind.test_input test ([comment](https://github.com/pytorch/pytorch/pull/118655#issuecomment-1919940598))	2024-01-31 20:55:46 +00:00
Sergii Dymchenko	a768a50a55	Re-enable test_nan_to_num (#118711 ) Resolve TODO and re-enable as https://github.com/pytorch/pytorch/issues/82763 is resolved. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118711 Approved by: https://github.com/peterbell10	2024-01-31 20:01:10 +00:00
Catherine Lee	9391af9796	Merging heuristics (#118029 ) Everyday I move closer and closer to just using numbers * number of heuristics that marked it as high, probable, low, none etc * order of heuristics in the `__init__` file as well as how the heuristic ordered the tests * put heuristics historical edited files and profiling as not trial mode * briefly sanity checked that all shards of the larger test files (ex test_ops) exist and there are no dups Pull Request resolved: https://github.com/pytorch/pytorch/pull/118029 Approved by: https://github.com/huydhn	2024-01-31 20:00:10 +00:00
Andrew Gu	3280fdb883	[FSDP2] Added `_to_kwargs` root forward input cast (#117955 ) This PR adds a `_to_kwargs()` call on the FSDP root module's forward inputs to move them to `device` similar to DDP. `39df084001/torch/nn/parallel/distributed.py (L1426-L1427)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/117955 Approved by: https://github.com/weifengpy ghstack dependencies: #117950	2024-01-31 19:51:32 +00:00
Andrew Gu	d33f9dcefe	[FSDP2] Added all-gather and unsharded parameter (#117950 ) This PR adds the FSDP all-gather (the copy-in/all-gather collective and the copy-out) and the unsharded parameter concept to `FSDPParam`. This is to prepare for being able to run the forward pass. - We implement all-gather as two functions: `foreach_all_gather` (copy-in/all-gather collective) and `foreach_all_gather_copy_out` (copy-out). - In the future, there will be two paths: `async_op=True` in the default stream for explicit prefetching and `async_op=False` in separate streams for implicit prefetching. - In the future, we will use `torch.split_with_sizes_copy` in the copy-out when it has the CUDA fast path. - We have the functions operate on `List[FSDPParam]` instead of passing the `torch.Tensor` and metadata mainly so that the `all_gather_input` can be computed under the `all_gather_copy_in_stream`. Since the two functions are specific to FSDP, I did not see motivation for avoiding this at the cost of entering/exiting the `all_gather_copy_in_stream` context twice (which incurs some CPU overhead). - The `init_all_gather_output()` and `init_unsharded_parameter()` functions may seem unintuitive. The reason we initialize them once and write to them in-place thereafter is for autograd. See the note `[Note: FSDP and autograd]` in the code. - We expand our 'FSDP tensors' definition to include the all-gather input and all-gather output in addition to the sharded and unsharded parameters. This distinction might seem unnecessary or pedantic, but it enables a language for describing pre- and post-all-gather transformations. - We use the `_unsafe_preserve_version_counters` context when copying out because otherwise autograd will complain of a version mismatch in backward due to writing to the leaf tensors. (An alternative would be to use `.data`, but we are avoiding that 😄 .) --- <details> <summary> Copy-in/All-Gather/Copy-Out Example </summary> Suppose we have 2 parameters with shapes `(3, 3)` (denoted with `A`s) and `(2, 2)` (denoted with `B`s) and 2 ranks, where `P` represents padding and `E` represents empty: ``` Given: (3, 3): AAAAAAAAA (2, 2): BBBB Sharded parameters/all-gather inputs: Rank 0: AAAAAA, BB Rank 1: AAAPPP, BB Each rank allocate group's all-gather output: EEEEEEEEEEEEEEEE Each rank copy-in: Rank 0: AAAAAABBEEEEEEEE Rank 1: EEEEEEEEAAAPPPBB Each rank all-gather: Rank 0: AAAAAABBAAAPPPBB Rank 1: AAAAAABBAAAPPPBB Each rank copy-out: Rank 0: AAAAAAAAAPPP, BBBB Rank 1: AAAAAAAAAPPP, BBBB ``` </details> --- For context, we use the copy-in/all-gather/copy-out strategy instead of NCCL group coalescing for two reasons: 1. One large NCCL all-gather is still noticeably faster than several NCCL all-gathers using group coalescing of the same total bytes (even after NCCL 2.18.3). We prefer to tradeoff extra device-to-device copies (using GPU high-bandwidth memory) to save communication time, which does not improve as fast from hardware generation to generation. 2. Copying out of the all-gather buffer tensor simplifies multi-stream memory handling because there is a constant number of such all-gather tensors alive at once. (The copy-out is done in the default/compute stream.) If we directly used the all-gather tensor memory for computation, then the number of such alive tensors is linear in the module depth and hence dependent on the particular model. --- Disclaimer: This PR has some extraneous code, but I did not want to simplify too much since that code will be added back soon anyway (e.g. for overlapping, mixed precision, and ZeRO++). Hopefully it does not hinder code review too much. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117950 Approved by: https://github.com/weifengpy, https://github.com/wanchaol	2024-01-31 19:51:32 +00:00
PyTorch MergeBot	483001e846	Revert "Workaround for super() calls in test_torchinductor_dynamic_shapes (#118586 )" This reverts commit f2682e75e6fd735c4a84afe59eafd541f7643f4a. Reverted https://github.com/pytorch/pytorch/pull/118586 on behalf of https://github.com/atalman due to Broke slow tests ([comment](https://github.com/pytorch/pytorch/pull/118586#issuecomment-1919810802))	2024-01-31 19:44:29 +00:00
Andrew Calvano	649f2e3000	Fix for out of bounds registers_ access in mobile TorchScript interpreter (#110300 ) Summary: The TorchScript interpreter had multiple opcodes whose logic had the potential to access the registers_ array out of bounds. This change ensures that all registers_ accesses are in bounds or an exception will be thrown. Test Plan: contbuild + OSS signals Differential Revision: D49748737 Pull Request resolved: https://github.com/pytorch/pytorch/pull/110300 Approved by: https://github.com/malfet, https://github.com/kimishpatel	2024-01-31 19:40:02 +00:00
hodavand	8026534a2f	Add torch.complex128 and torch.complex32 to DTYPE_TO_ATEN dictionary. (#117929 ) Fixes #117370 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117929 Approved by: https://github.com/Skylion007, https://github.com/desertfire	2024-01-31 19:34:58 +00:00
vinithakv	82b6ee5a2a	Fix build error in ppc64le (#118516 ) ... from /home/vinithav/pytorch-build/forks/myforks/jan23/pytorch/aten/src/ATen/test/vec_test_all_types.cpp:1: /home/vinithav/pytorch-build/forks/myforks/jan23/pytorch/aten/src/ATen/cpu/vec/vec256/vsx/vec256_float_vsx.h: In member function 'bool at::vec::DEFAULT::Vectorized::has_inf_nan() const': /home/vinithav/pytorch-build/forks/myforks/jan23/pytorch/aten/src/ATen/cpu/vec/vec256/vsx/vec256_float_vsx.h:244:36: error: no matching function for call to 'at::vec::DEFAULT::Vectorized::_isinf(float&) const' 244 \| if(_isnan(_vec0[i]) \|\| _isinf(_vec0[i])) { \| ~~~~~~^~~~~~~~~~ /home/vinithav/pytorch-build/forks/myforks/jan23/pytorch/aten/src/ATen/cpu/vec/vec256/vsx/vec256_float_vsx.h:237:21: note: candidate: 'at::vec::DEFAULT::Vectorized at::vec::DEFAULT::Vectorized::_isinf() const'~ ... Started breaking from `29516bd2a0`. Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/118516 Approved by: https://github.com/ezyang	2024-01-31 19:33:57 +00:00
Felix Zimmermann	aca41a3a74	[optim] lbfgs: handle complex params as independent real params (#118184 ) Ref: #86340 Fixes #118148 This fixes LBFGS for complex parameters. Complex parameters are handled as R^2. I also added a test, unfortunately, due to the closure required, I could not use the existing `_test_complex_optimizer` used for all other optimizers. Lbfgs is special, as it will call the objective function multiple times internally. So I felt making a one-off test for lbfgs might be justifiable. We will test if each step taken internally by the optimizer is the same for R^2 and complex parameters. Let me know if the approach is ok, thanks Pull Request resolved: https://github.com/pytorch/pytorch/pull/118184 Approved by: https://github.com/janeyx99	2024-01-31 19:24:16 +00:00
Edward Z. Yang	82b0341af3	s/verison/version/ (#118749 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/118749 Approved by: https://github.com/malfet, https://github.com/albanD	2024-01-31 19:23:55 +00:00
rzou	41dfd0e063	Update Dynamo passrate/histogram scripts (#118752 ) Changelog: - Don't count running PYTORCH_TEST_WITH_DYNAMO=1 on dynamo/ tests in the pass rate. This was a bug (we were counting all of these as failing, but in reality, most of these pass). The net effect is that the passrate is (artifically) 6% higher. - Have the histogram script filter out skips based on the passrate metric. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118752 Approved by: https://github.com/jamesjwu	2024-01-31 19:15:17 +00:00
Shan19900305	99b69e1ffb	add PrivateUse1 device support in function options_from_string. (#118627 ) add PrivateUse1 device support in function options_from_string. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118627 Approved by: https://github.com/soulitzer	2024-01-31 18:52:58 +00:00
Boyuan Feng	7aff92c838	[torch] Expose dynamic_shapes api at multiple levels (#118695 ) Summary: Exposes `dynamic_shapes` api at multiple levels so it's easier to replace the old API `dynamic_dim()` with the new API `Dim()`. Test Plan: CI Differential Revision: D53246409 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118695 Approved by: https://github.com/ydwu4	2024-01-31 18:50:01 +00:00
CaoE	6bd1807ae9	enable mkl_gemm_f16f16f32 in cpublas::gemm (#118367 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118367 Approved by: https://github.com/jgong5, https://github.com/cpuhrsch	2024-01-31 18:37:42 +00:00
Isuru Fernando	81d12846dc	Add decomp for pixel_shuffle/unshuffle (#118239 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118239 Approved by: https://github.com/peterbell10	2024-01-31 18:34:21 +00:00
soulitzer	81b55f58ce	Matmul decide should_fold using has_out instead of grad_mode (#118617 ) Fixes https://github.com/pytorch/pytorch/issues/118548 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118617 Approved by: https://github.com/lezcano	2024-01-31 18:34:16 +00:00
rzou	a5a0fdcae9	Remove some unnecessary skipIfTorchDynamo (#118725 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118725 Approved by: https://github.com/bdhirsh	2024-01-31 18:18:17 +00:00
Angela Yi	680cc6b17a	[export] Fix graph signature for primitive outputs (#118655 ) Summary: Now that we allow primitive outputs, we need to fix how the graph signature outputs user_outputs Test Plan: CI Differential Revision: D53233649 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118655 Approved by: https://github.com/tarun292	2024-01-31 18:00:02 +00:00
laith sakka	8455447972	Support builtin callable with object arguments in dynamo (#118678 ) Fix issue #117556 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118678 Approved by: https://github.com/anijain2305	2024-01-31 17:54:08 +00:00
Edward Z. Yang	68c3cb7594	s/fialure/failure/ (#118744 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/118744 Approved by: https://github.com/peterbell10	2024-01-31 17:42:44 +00:00
suo	5586d7797e	fix up batchnorm folding in pt2 quant (#118720 ) Changes to how attributes are structured messed this pass up, fix it Differential Revision: [D53253601](https://our.internmc.facebook.com/intern/diff/D53253601/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118720 Approved by: https://github.com/SherlockNoMad	2024-01-31 17:40:47 +00:00
Oguz Ulgen	4a677da36b	Add more triton kernel mutation tracking tests (#118691 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118691 Approved by: https://github.com/aakhundov ghstack dependencies: #118676, #118595	2024-01-31 17:38:17 +00:00
Oguz Ulgen	b4f4fd0c28	Parse and handle functions in TTIR (#118595 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118595 Approved by: https://github.com/aakhundov ghstack dependencies: #118676	2024-01-31 17:38:17 +00:00
laith sakka	1bf9ddf130	add test_truth (#118597 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118597 Approved by: https://github.com/anijain2305	2024-01-31 15:10:58 +00:00
Bin Bao	1128cf96f0	[AOTI] Support _embedding_bag in C shim (#118706 ) Summary: At some point I will stop manually adding ops to C shim, but use torchgen to generate those code. For the near term, I need to add a few more in order to switch the AOTInductor dashboard run. Differential Revision: [D53249074](https://our.internmc.facebook.com/intern/diff/D53249074) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118706 Approved by: https://github.com/frank-wei, https://github.com/aakhundov ghstack dependencies: #118704, #118705	2024-01-31 15:02:40 +00:00
Bin Bao	8db8ff652c	[AOTI] Add aoti_torch_view_dtype in C shim (#118705 ) Summary: Support ir.ComplexView in the ABI-compatible codegen Differential Revision: [D53249039](https://our.internmc.facebook.com/intern/diff/D53249039) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118705 Approved by: https://github.com/frank-wei ghstack dependencies: #118704	2024-01-31 14:42:29 +00:00
Bin Bao	dd52939438	[inductor] Refactor ir.ComplexView (#118704 ) Summary: Make ir.ComplexView a subclass of ir.FallbackKernel, to unify the codegen logic Differential Revision: [D53248972](https://our.internmc.facebook.com/intern/diff/D53248972) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118704 Approved by: https://github.com/frank-wei	2024-01-31 14:42:29 +00:00
Kai Londenberg	35f3ccffd4	[Cutlass 3.3.0 submodule upgrade] (#118629 ) Cutlass 3.3 offers the following improvements: Adds support for mixed precision GEMMs On Hopper and Ampere Adds support for < 16B aligned GEMMs on Hopper Enhancements to EVT Enhancements to Python interface Enhancements to Sub-byte type handling in CuTe Several other bug-fixes and performance improvements. minor doc update Test Plan: CI ( ciflow/trunk, ciflow/inductor ) pytest test/inductor/test_max_autotune.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/118629 Approved by: https://github.com/drisspg, https://github.com/Skylion007, https://github.com/khabinov	2024-01-31 13:53:58 +00:00
Sergii Dymchenko	c3a3e61bcb	Resolve TODO in test_slice_mutation2 (#118712 ) As https://github.com/pytorch/pytorch/issues/94693 has been resolved. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118712 Approved by: https://github.com/peterbell10	2024-01-31 08:26:22 +00:00
Sherlock Huang	9afd539075	[sigmoid] update serialization to include custom objs (#118684 ) Summary: Update the serialization code to handle custom objs. Test Plan: buck2 run 'fbcode//mode/dev-nosan' fbcode//sigmoid/frontend/test_gpu:serializer_test Differential Revision: D53139356 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118684 Approved by: https://github.com/angelayi, https://github.com/suo	2024-01-31 08:23:34 +00:00
Sergii Dymchenko	56718cab8d	Unskip test_complex_type_conversions (#118694 ) Resolve TODO and unskip test_complex_type_conversions as real and imag have been implemented for complex. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118694 Approved by: https://github.com/huydhn	2024-01-31 08:04:15 +00:00
Simon Fan	73229b4f93	Add --filter-rank to torchrun: allow logs filtering by rank (#118562 ) Addresses issue https://github.com/pytorch/pytorch/issues/117383 The implementation exposes `--filter-ranks` which filters by rank which files we pass to `TailLog` (used in torchrun to determine which logs to output to stdout/stderr) ## Behavior ### with --tee Currently --tee is implemented as --redirect to file, and streams file to console using `tail`. When --tee is specified, file logs will be unaffected and we will only filter the output to console. ### with --redirect When --redirect is specified without --tee, nothing is logged to console, so we no-op. ### with neither When neither --tee or --redirect are specified, torchrun uses empty string "" to indicate logging to console. We intercept this empty string, and redirect it to "/dev/null" to not print to console. The api also allows a per-rank configuration for --tee and --redirect, and is also supported by this filter implementation. ## Usage ### without --tee ``` > TORCH_LOGS_FORMAT="%(levelname)s: %(message)s" TORCH_LOGS="graph" torchrun --standalone --nproc_per_node=2 --role rank --filter_ranks=0 t.py hello from rank 0 python DEBUG: TRACED GRAPH __compiled_fn_0 <eval_with_key>.0 opcode name target args kwargs ------------- ------ ----------------------- --------- -------- placeholder l_x_ L_x_ () {} call_function mul <built-in function mul> (l_x_, 5) {} output output output ((mul,),) {} ... ``` ### with --tee ``` > TORCH_LOGS_FORMAT="%(levelname)s: %(message)s" TORCH_LOGS="graph" torchrun --standalone --nproc_per_node=2 --role rank --tee 3 --filter_ranks=0 t.py [rank0]:hello from rank 0 python [rank0]:DEBUG: TRACED GRAPH [rank0]: __compiled_fn_0 <eval_with_key>.0 opcode name target args kwargs [rank0]:------------- ------ ----------------------- --------- -------- [rank0]:placeholder l_x_ L_x_ () {} [rank0]:call_function mul <built-in function mul> (l_x_, 5) {} [rank0]:output output output ((mul,),) {} ... ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/118562 Approved by: https://github.com/wconstab, https://github.com/wanchaol	2024-01-31 07:40:01 +00:00
drisspg	995f69623d	Add Silu to Dtensor Pointwise ops (#118702 ) # Summary Adds silu to the supported list, needed for llama2 mlp support Pull Request resolved: https://github.com/pytorch/pytorch/pull/118702 Approved by: https://github.com/Skylion007, https://github.com/wanchaol	2024-01-31 06:17:36 +00:00
Nikita Shulga	74f4947caf	Fix admm over empty tensors and broadcastable input (#118619 ) Fixes https://github.com/pytorch/pytorch/issues/118131 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118619 Approved by: https://github.com/albanD	2024-01-31 05:40:25 +00:00
Zhengxu Chen	2d37a046e7	[export] Enforce serialization BC/FC with updater script. (#118424 ) Summary: This diff implements a mechanism for safely update torch.export serialization schema, aka schema.py, which is the API surface having the strongest compatibility guarantee. The diff is consist of 3 changes: - Added a script to "build" or "materialize" schema.py into a platform neutral format (yaml), which serves as the committed form of the seialization schema. - Added unittest to compare against schema.py and schema.yaml, so that it forces developers to execute the updater script when there is mismatch between two files. - Added a checker inside the updater script, so that all the compatible change will result in a minor version bump, and all the incompatible changes will result in a major version bump. torch.export's serialization BC/FC policy is (tentatively) documented here: https://docs.google.com/document/d/1EN7JrHbOPDhbpLDtiYG4_BPUs7PttpXlbZ27FuwKhxg/edit#heading=h.pup7ir8rqjhx , we will update the As noted in the code doc, people should be able to run the following command to update schema properly from now on: ``` python scripts/export/update_schema.py --prefix <path_to_torch_development_diretory> or buck run caffe2:export_update_schema -- --prefix /data/users/$USER/fbsource/fbcode/caffe2/ ``` Test Plan: buck test mode/opt caffe2/test:test_export -- -r test_schema buck run caffe2:update_export_schema -- --prefix /data/users/$USER/fbsource/fbcode/caffe2/ Differential Revision: D52971020 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118424 Approved by: https://github.com/angelayi	2024-01-31 05:37:58 +00:00
Yifu Wang	697ca4f292	Preliminary DeviceMesh + native c10d functional integration (#118423 ) ### Summary - Added `group_name` as the third field in `dim_group_infos`. - `DeviceMeshTest` now runs both w/ and w/0 `_USE_NATIVE_C10D_FUNCTIONAL=1` in CI. ### Other fixes - Convert `reduceOp` to lower case before passing it into c10d_functional ops. - Added a finalizer to handle unwaited collectives (this mirrors the treatment for Python functional collective ops). Pull Request resolved: https://github.com/pytorch/pytorch/pull/118423 Approved by: https://github.com/wanchaol, https://github.com/LucasLLC, https://github.com/wconstab	2024-01-31 04:36:12 +00:00
Andrew Gu	e3cde68534	[FSDP2] Added initial `_lazy_init` and FQNs for debugging (#117881 ) This PR adds the initial `_lazy_init`. Lazy initialization marks the point when the FSDP structure is finalized and is typically the beginning of the first forward. This would be after any meta-device initialization. - Lazy initialization is distinct from construction time because when processing `fully_shard(module)`, there is no way to know whether a parent of `module` will have `fully_shard` applied as well. This is a consequence of `fully_shard` having to be applied bottom-up. - At lazy initialization, we now have the concept of a _root_. The root FSDP module is the one whose `forward` runs first and ends last (and hence similarly for its backward). Having a single root simplifies handling logic that should only run "once per forward/backward/iteration". We may consider relaxing this in the future, but it will add more complexity to the design. - Once we have a root, we can define _fully-qualified names_ (FQNs) for both parameters and modules. To aid debugging, we store `_param_fqn` and `_module_fqn` on `FSDPParam` and `FSDPParamGroup`, respectively. Note that we can have a unique `_module_fqn` for `FSDPParamGroup` since we currently assume a 1:1 relationship. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117881 Approved by: https://github.com/weifengpy, https://github.com/wanchaol ghstack dependencies: #118525, #117814, #117867, #117877	2024-01-31 03:38:53 +00:00
PyTorch UpdateBot	f7ae454003	[vision hash update] update the pinned vision hash (#118700 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned vision hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118700 Approved by: https://github.com/pytorchbot	2024-01-31 03:10:52 +00:00
PyTorch UpdateBot	6d7cfb5c3f	[audio hash update] update the pinned audio hash (#118699 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned audio hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118699 Approved by: https://github.com/pytorchbot	2024-01-31 03:10:48 +00:00
Jorge Pineda	0a7e2ce0e1	[PT-Vulkan] aten::conv1d - support any stride, padding, dilation (#118660 ) Summary: This diff stack builds on yipjustin's initial special-case implementation: D50914117. That special-case only covers ``` strides = 1 padding = 0 dilation = 1 in_channels = out_channels = groups n = 1 ``` Test Plan: ``` [jorgep31415@161342.od /data/sandcastle/boxes/fbsource (a0b8b9b7f)]$ LD_LIBRARY_PATH=third-party/swiftshader/lib/linux-x64/ buck2 run fbcode/mode/dev-nosan //xplat/caffe2:pt_vulkan_api_test_bin -- --gtest_filter="conv1d" File changed: fbsource//xplat/caffe2/aten/src/ATen/native/vulkan/glsl/conv1d.glsl File changed: fbcode//caffe2/aten/src/ATen/test/vulkan_api_test.cpp File changed: fbcode//caffe2/aten/src/ATen/native/vulkan/glsl/conv1d.glsl 3 additional file change events Buck UI: https://www.internalfb.com/buck2/ebb61796-c71d-4e0c-8148-de1eb67b5d4c Network: Up: 10KiB Down: 53MiB (reSessionID-5f852cf6-9bf1-4c73-a471-4c121b53ed62) Jobs completed: 16. Time elapsed: 21.6s. Cache hits: 43%. Commands: 7 (cached: 3, remote: 0, local: 4) BUILD SUCCEEDED Running main() from third-party/googletest/1.14.0/googletest/googletest/src/gtest_main.cc Note: Google Test filter = conv1d [==========] Running 2 tests from 1 test suite. [----------] Global test environment set-up. [----------] 2 tests from VulkanAPITest [ RUN ] VulkanAPITest.conv1d_simple [ OK ] VulkanAPITest.conv1d_simple (136 ms) [ RUN ] VulkanAPITest.conv1d [ OK ] VulkanAPITest.conv1d (35 ms) [----------] 2 tests from VulkanAPITest (172 ms total) [----------] Global test environment tear-down [==========] 2 tests from 1 test suite ran. (172 ms total) [ PASSED ] 2 tests. ``` Reviewed By: yipjustin Differential Revision: D53204673 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118660 Approved by: https://github.com/yipjustin	2024-01-31 01:49:09 +00:00
suo	68a75d4539	[lint] remove merge_base_with from .lintrunner.toml (#118677 ) This setting is problematic in fbcode, where the expected behavior is to match `arc lint`, which has a behavior much like running `lintrunner` without a `--merge-base-with` argument. Let's try removing this. I also updated the CI message to encourage people to run with `-m origin/main`, which should hopefully cut down on confusion in the absence of defaulting to that behavior. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118677 Approved by: https://github.com/PaliC	2024-01-31 00:53:58 +00:00
Andrew Gu	07a7feca74	[FSDP2] Sharded parameter in `FSDPParam` (#117877 ) This PR adds logic to shard the managed parameters on dim-0. This is like `distribute_tensor()` with two differences: 1. `distribute_tensor()` today cannot accept a `DTensor` and reshard it to the parent mesh (https://github.com/pytorch/pytorch/issues/116101). 2. `DTensor` does not pad its local shard on any `Shard` dimensions (https://github.com/pytorch/pytorch/issues/113045). As such, the `FSDPParam._init_sharded_param()` derives the global `DTensor` metadata itself and pads the local tensor on dim-0. The padding helps make the all-gather copy-in more efficient since the all-gather buffer will require padding. --- Some details: - We free the original parameter manually after constructing the sharded parameter. This lowers the peak memory during construction time slightly (since not _all_ parameters in the group must be sharded before the original parameters are freed) and is not strictly necessary. - We bypass `nn.Module.__setattr__` because the checks are slow and unnecessary. The drawback is that we would ignore a user-defined override of `__setattr__`; however, since we have never encountered this in practice, I am okay with this. Notably, user calls to `setattr` would still use the override; FSDP only uses `setattr` as a mechanism for switching between sharded and unsharded parameters. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117877 Approved by: https://github.com/wanchaol ghstack dependencies: #118525, #117814, #117867	2024-01-31 00:44:19 +00:00
cyy	4a019047ad	Enable nested namespace check in clang-tidy (#118506 ) It is time to enable nested namespaces in the code. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118506 Approved by: https://github.com/albanD	2024-01-31 00:32:35 +00:00
David Berard	1b03423526	[meta registration] fix _efficient_attention_forward for jagged inputs (#118657 ) Fixes the meta registration for the logsumexp output, whose shape should be defined by the size of the offsets tensor when it exists. `644f64f2d1/aten/src/ATen/native/transformers/cuda/attention.cu (L1045)` Differential Revision: [D53234217](https://our.internmc.facebook.com/intern/diff/D53234217) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118657 Approved by: https://github.com/YuqingJ	2024-01-31 00:11:39 +00:00
Wei Wei	6fa162e681	Reland: [aotinductor] Replicate split_cat from torch IR to predispatch IR" (#118590 ) Summary: This is part the pass migration efforts. The final target is removing the acc tracer in AOTI. In this diff, I did a few things: 1. copy and modify the `fx_passes/split_cat.py` passes based on predispatch IR. 2. verify the correctness by copying the `test_split_cat_fx_passes.py` and create a new file `test_split_cat_fx_passes_aten_fb.py` which is executed in AOTI and checked the counters 3. create a util function to execute the pass and compare the before/after graph to give user more information like pass effect and time spent. It will create logs like ``` [2024-01-25 20:26:48,997] torch._inductor.utils: [INFO] [Pre grad(predispatch IR)]Apply split_cat, index: 0, save before/after graph to /tmp/tmpvlpwrklp, graph before/after are the same = False, time elapsed = 0:00:00.001585 [2024-01-25 20:26:49,000] torch._inductor.utils: [INFO] [Pre grad(predispatch IR)]Apply split_cat, index: 1, save before/after graph to /tmp/tmpz_onjfeu, graph before/after are the same = False, time elapsed = 0:00:00.001873 [2024-01-25 20:26:49,002] torch._inductor.utils: [INFO] [Pre grad(predispatch IR)]Apply split_cat, index: 2, save before/after graph to /tmp/tmpgkck8yko, graph before/after are the same = True, time elapsed = 0:00:00.000269 [2024-01-25 20:26:49,007] torch._inductor.utils: [INFO] [Pre grad(predispatch IR)]Apply split_cat, index: 3, save before/after graph to /tmp/tmpquenq06y, graph before/after are the same = False, time elapsed = 0:00:00.003621 [2024-01-25 20:26:49,009] torch._inductor.utils: [INFO] [Pre grad(predispatch IR)]Apply split_cat, index: 4, save before/after graph to /tmp/tmpi8fia0dv, graph before/after are the same = True, time elapsed = 0:00:00.000190 ``` Differential Revision: D53171027 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118590 Approved by: https://github.com/kflu, https://github.com/khabinov, https://github.com/chenyang78	2024-01-31 00:09:46 +00:00
Oguz Ulgen	7761ceb6b3	Fix a bug with python lambda capture (#118676 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118676 Approved by: https://github.com/jamesjwu, https://github.com/aakhundov	2024-01-30 23:59:07 +00:00
Tianyu Liu	616e9dbed8	add torch.float64 precision support to the transformer test suite in TP/SP (#116436 ) This PR (as a followup to #115530) resolves previous issues of not passing `assertEqual()` tests (with small error) when comparing outputs from the single-gpu model and the distributed model, under certain input/model sizes or when certain operations (e.g. weight-tying) are enabled. This is done by simply enabling higher precision computation using `dtype=torch.float64`. What is not tested: whether or not distributed model training convergence rate is affected using just `torch.float32` precision. Test plan: TP: `python test/distributed/tensor/parallel/test_tp_examples.py -k test_transformer_training_is_seq_parallel_False` TP+SP: `python test/distributed/tensor/parallel/test_tp_examples.py -k test_transformer_training_is_seq_parallel_True` Pull Request resolved: https://github.com/pytorch/pytorch/pull/116436 Approved by: https://github.com/wanchaol	2024-01-30 23:50:29 +00:00
atalman	1f376b3b24	Flix lint after #117814 (#118689 ) Forward fix after PR: #117814 . make lint green again Pull Request resolved: https://github.com/pytorch/pytorch/pull/118689 Approved by: https://github.com/awgu, https://github.com/huydhn	2024-01-30 23:46:27 +00:00
Oguz Ulgen	1e78dc95a4	Fix/Temporarily disable tests broken due to triton version mismatch (#118661 ) Summary: These test were broken because internal triton is 2.2 whereas external is 3.0. Will update after internal version catches up. Test Plan: CI Differential Revision: D53231204 Co-authored-by: Oguz Ulgen <oulgen@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/118661 Approved by: https://github.com/oulgen	2024-01-30 23:06:35 +00:00
Isuru Fernando	2f7839e6db	register decomposition for rsub in torch._refs (#118288 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118288 Approved by: https://github.com/lezcano ghstack dependencies: #118398	2024-01-30 22:18:15 +00:00
Isuru Fernando	04ded1399d	Fix signatures of torch.{add, sub, mul} (#118398 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118398 Approved by: https://github.com/lezcano	2024-01-30 22:18:15 +00:00
Andrew Gu	6ea233a14c	[FSDP2] Added initial `FSDPParamGroup`, `FSDPParam`, `ParamModuleInfo` (#117867 ) This PR adds the initial `FSDPParamGroup` and `FSDPParam` classes, and it focuses on the `ParamModuleInfo` data structure. - `ParamModuleInfo` has the info needed to `setattr` a managed parameter, where it must account for shared parameters and shared modules. ``` # Shared parameter lin1.weight = lin2.weight # Shared module mlp.lin1 = mlp.lin2 ``` - In order for FSDP to find shared modules' parameters, we must use `remove_duplicate=False`. See https://github.com/pytorch/pytorch/pull/99448/ for the original context. Finding shared modules' parameters is not necessary for the `setattr` logic, but in case we need it in the future (like for existing FSDP's state dict), we include that info for now. With this PR, we see the general system architecture: - 1 `module` : 1 `fully_shard` - 1 `fully_shard` : 1 `FSDPParamGroup` - 1 `FSDPParamGroup` : k `FSDPParam` Pull Request resolved: https://github.com/pytorch/pytorch/pull/117867 Approved by: https://github.com/weifengpy, https://github.com/wanchaol ghstack dependencies: #118525, #117814	2024-01-30 22:07:59 +00:00
Andrew Gu	ae6233ec47	[FSDP2] Added `mesh` arg, `FSDPState`, move to device (#117814 ) Squashed to include https://github.com/pytorch/pytorch/pull/117861, https://github.com/pytorch/pytorch/pull/117852 --- This PR adds `_get_managed_modules()` to determine which modules a `fully_shard(module)` call manages. The rule is defined as: > `fully_shard(module)` manages all modules in `module.modules()` except those already managed by a nested `fully_shard()` or a nested non-composable API (e.g. `replicate()` or TorchRec). Practically, this can be implemented as a graph search from `module` that does not proceed into any module with `fully_shard` or a non-composable API applied. Because the non-composable APIs follow the same rule, this rule is correct inductively. --- This PR adds `_get_managed_states(managed_modules)` to return the managed parameters and buffers given the managed modules. - Without an extra mechanism to ignore specific parameters or buffers, the rule currently is simply to get the directly managed state (i.e. parameters/buffers) from each managed module while de-duplicating shared ones. - However, we prefer this translation from managed modules to managed states to accommodate ignoring specific states in the future (which has appeared in various open-source use cases). --- This PR adds the `mesh` argument to `fully_shard` and some helper data structures specific to FSDP/HSDP that pre-compute useful info like rank/world size for each mesh dim. - The `mesh` defines the FSDP/HSDP algorithm. 1D mesh means FSDP, and 2D mesh means HSDP, where we assume sharding on the last dimension. - We can revisit the HSDP sharding-dim assumption if needed in the future. - The default (if `mesh is None`) is that `fully_shard` calls `init_device_mesh` following the global process group. - The helper data structures are the various `*MeshInfo`s. I included up to the `HSDPMeshInfo` even though it will not be immediately used to show the spirit of it. We want to tag both the shard and replicate dims. - The `mesh_info` variable in `fully_shard` is not used for now. It will be passed downstream in the future. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117814 Approved by: https://github.com/wanchaol, https://github.com/wconstab ghstack dependencies: #118525	2024-01-30 22:05:16 +00:00
Andrew Gu	7aa4b35b75	[FSDP2][Reland] Introduced initial `fully_shard` frontend (#118525 ) This PR introduces the initial `fully_shard` frontend without any distributed logic that will be built into per-parameter-sharding FSDP. - We design `fully_shard` to be a _module-level_ API (taking in an `nn.Module`), e.g. as opposed to a tensor-level one. - We define a `FSDP` class and use a dynamic class swap, setting `module.__class__` to a newly created class that subclasses `FSDP` and `type(module)`, to allow FSDP to override and add methods on the module. - We name this class as `FSDP<type(module)>`, e.g. `FSDPLinear` for `Linear`. - We disable the `deepcopy` because the state object inserted on the module will not be trivially `deepcopy`-able. - Calling `fully_shard(module)` inserts a state object on `module` but not any of its children. This state object will be used for any FSDP-specific state. - We raise an error on `ModuleList` or `ModuleDict` since they do not implement `forward()`, and FSDP will rely on `forward()` to insert logic (https://github.com/pytorch/pytorch/issues/113794). - In the future, we will deprecate the existing `fully_shard` that calls into the same backend logic as `FullyShardedDataParallel` as there is no adoption for that and we prefer to reuse that name. Reland details: I removed `test/distributed/_composable/fsdp/_test_fully_shard_common.py` and moved its contents to the existing `torch/testing/_internal/common_fsdp.py`, which is already a target for internal tests. Differential Revision: [D53187509](https://our.internmc.facebook.com/intern/diff/D53187509) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118525 Approved by: https://github.com/wanchaol	2024-01-30 22:05:16 +00:00
Huy Do	48f876143a	Fix missing permission in create release workflow (#118681 ) Fixes https://github.com/pytorch/pytorch/actions/runs/7715417683/job/21029944543 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118681 Approved by: https://github.com/clee2000, https://github.com/seemethere, https://github.com/atalman, https://github.com/malfet	2024-01-30 22:02:30 +00:00
Elias Ellison	1aa836f502	Dont fuse write into read if indexing differs (#118210 ) Fix for https://github.com/pytorch/pytorch/issues/101950, https://github.com/pytorch/pytorch/issues/94693 Similar to inplacing a kernel only fuse a write after a read of the same tensor if the write and read have same indexing formula. I did a perf test and it was neutral. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118210 Approved by: https://github.com/jansel	2024-01-30 21:55:27 +00:00
Jerry Zhang	82a7460b67	[quant][bc-breaking] Turn on fold_quantize by default (#118605 ) Summary: Previously by default we don't generate quantized weight, that is, we'll have fp32 weight, and `fp32 weight -> q -> dq -> linear -> ...` in the quantized model After this PR, we'll produce a graph with int8 weight by default after convert_pt2e: `int8 weight -> dq -> linear -> ...` We'll remove the fold_quantize flag in the next PR Test Plan: CI Differential Revision: D51730862 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118605 Approved by: https://github.com/andrewor14	2024-01-30 21:42:29 +00:00
Ivan Zaitsev	ba1be17733	Remove `voznesenskym` from the list of autoreviewers (#118680 ) Mitigates the failures of "Auto Request Review" workflow: ``` Requesting review to ezyang, albanD, miladm, voznesenskym, antoniojkim, SherlockNoMad Error: HttpError: Reviews may only be requested from collaborators. One or more of the users or teams you specified is not a collaborator of the pytorch/pytorch repository. ``` https://github.com/pytorch/pytorch/actions/runs/7716852492/job/21034629665?pr=118669 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118680 Approved by: https://github.com/clee2000	2024-01-30 21:35:38 +00:00
Catherine Lee	f2682e75e6	Workaround for super() calls in test_torchinductor_dynamic_shapes (#118586 ) Info about super in dynamic classes: https://stackoverflow.com/questions/71879642/how-to-pass-function-with-super-when-creating-class-dynamically https://stackoverflow.com/questions/43782944/super-does-not-work-together-with-type-supertype-obj-obj-must-be-an-i Calling super(TestCase) actually calls TestCase's parent's functions, bypassing TestCase itself's functions Mainly doing this because it's making disable bot spam Test: checked locally and check that https://github.com/pytorch/pytorch/issues/117954 actually got skipped Pull Request resolved: https://github.com/pytorch/pytorch/pull/118586 Approved by: https://github.com/huydhn	2024-01-30 21:34:05 +00:00
Catherine Lee	4f5785b6b3	Enable possibly-undefined error code (#118533 ) Fixes https://github.com/pytorch/pytorch/issues/118129 Suppressions automatically added with ``` import re with open("error_file.txt", "r") as f: errors = f.readlines() error_lines = {} for error in errors: match = re.match(r"(.):(\d+):\d+: error:.\[(.*)\]", error) if match: file_path, line_number, error_type = match.groups() if file_path not in error_lines: error_lines[file_path] = {} error_lines[file_path][int(line_number)] = error_type for file_path, lines in error_lines.items(): with open(file_path, "r") as f: code = f.readlines() for line_number, error_type in sorted(lines.items(), key=lambda x: x[0], reverse=True): code[line_number - 1] = code[line_number - 1].rstrip() + f" # type: ignore[{error_type}]\n" with open(file_path, "w") as f: f.writelines(code) ``` Signed-off-by: Edward Z. Yang <ezyang@meta.com> Co-authored-by: Catherine Lee <csl@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/118533 Approved by: https://github.com/Skylion007, https://github.com/zou3519	2024-01-30 21:07:01 +00:00
Jason Ansel	e332653eb3	[inductor] Use at::detail::empty_strided_* in cpp_wraper mode (#118490 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118490 Approved by: https://github.com/desertfire	2024-01-30 21:03:19 +00:00
Aaron Gokaslan	1562dae62c	[BE]: Apply RUF025 dict.fromkeys preview rule (#118637 ) Simplifies and optimizes dict construction using the `fromkeys` classmethod ctor. This also makes it really obvious when all the keys will have the same static value, which could be a bug if unintentional. It is also significantly faster than using a dict comprehension. The rule is in preview, but I am adding a forward fix for when it becomes stable. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118637 Approved by: https://github.com/albanD	2024-01-30 20:46:54 +00:00
Elias Ellison	e33e88e5bc	Add separate logging target for cudagraphs (#118329 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118329 Approved by: https://github.com/mlazos	2024-01-30 20:16:51 +00:00
Shuqiang Zhang	e180218949	[c10d] Log the last enqueued and completed collective (#118582 ) Summary: During debugging of some timeouted jobs, I found it difficult to identify which rank is at fault eventhough we have logs of many ranks reporting timeout on a specific collective seq. If we can also report lastEqueuedSeq and lastCompletedSeq, it would be much easier to identify, 1. whether a rank has not even join a collective call (not enqueued) 2. Or it has joined the collective call, but not completed. For the 1st case, it is mostly likely users code problem for the 2ed case, it could be lower-layer issues Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/118582 Approved by: https://github.com/wconstab	2024-01-30 20:13:55 +00:00
Jorge Pineda	9247641f34	[PT-Vulkan] aten::unsqueeze - nit optimization (#118575 ) Summary: Learning Vulkan shaders and realized one of the branches can be easily optimized. The relevant branch is only taken when we unsqueeze along `dim == 1` for 3D tensors. 1. There's an unnecessary for-loop. 2. There's an unnecessary dependency on the output tensor's number of channels. ## CPU Tensor ``` 3D->4D: (c, h, w) -> (c, 0, h, w) ``` ## GPU Texture ``` 3D->4D: (w, h, c/4)[c%4] -> (w, h, c)[0] ``` Note the GPU Texture's output is always at `[0]` and the output tensor's number of channels is always 1. We are currently writing the same value `v[p]` to all elements of the texel `out_texel`, but we need only write it to `out_texel[0]`: Test Plan: ``` [jorgep31415@161342.od /data/sandcastle/boxes/fbsource (ca3b566bc)]$ LD_LIBRARY_PATH=third-party/swiftshader/lib/linux-x64/ buck2 run fbcode/mode/dev-nosan //xplat/caffe2:pt_vulkan_api_test_bin -- --gtest_filter="unsqueeze" File changed: fbcode//caffe2/aten/src/ATen/native/vulkan/glsl/unsqueeze.glsl File changed: fbsource//xplat/caffe2/aten/src/ATen/native/vulkan/glsl/unsqueeze.glsl Buck UI: https://www.internalfb.com/buck2/2c7f1365-e004-41a0-9201-473929a2738a Network: Up: 174B Down: 0B (reSessionID-c54d25da-f44b-49f7-8bfd-1db4eee50f6d) Jobs completed: 6. Time elapsed: 14.4s. Cache hits: 0%. Commands: 1 (cached: 0, remote: 0, local: 1) BUILD SUCCEEDED Running main() from third-party/googletest/1.14.0/googletest/googletest/src/gtest_main.cc Note: Google Test filter = unsqueeze [==========] Running 10 tests from 1 test suite. [----------] Global test environment set-up. [----------] 10 tests from VulkanAPITest [ RUN ] VulkanAPITest.unsqueeze_0dto1d_dim0 [ OK ] VulkanAPITest.unsqueeze_0dto1d_dim0 (60 ms) [ RUN ] VulkanAPITest.unsqueeze_1dto2d_dim0 [ OK ] VulkanAPITest.unsqueeze_1dto2d_dim0 (0 ms) [ RUN ] VulkanAPITest.unsqueeze_1dto2d_dim1 [ OK ] VulkanAPITest.unsqueeze_1dto2d_dim1 (132 ms) [ RUN ] VulkanAPITest.unsqueeze_2dto3d_dim0 [ OK ] VulkanAPITest.unsqueeze_2dto3d_dim0 (20 ms) [ RUN ] VulkanAPITest.unsqueeze_2dto3d_dim1 [ OK ] VulkanAPITest.unsqueeze_2dto3d_dim1 (66 ms) [ RUN ] VulkanAPITest.unsqueeze_2dto3d_dim2 [ OK ] VulkanAPITest.unsqueeze_2dto3d_dim2 (3 ms) [ RUN ] VulkanAPITest.unsqueeze_3dto4d_dim0 [ OK ] VulkanAPITest.unsqueeze_3dto4d_dim0 (19 ms) [ RUN ] VulkanAPITest.unsqueeze_3dto4d_dim1 [ OK ] VulkanAPITest.unsqueeze_3dto4d_dim1 (1 ms) [ RUN ] VulkanAPITest.unsqueeze_3dto4d_dim2 [ OK ] VulkanAPITest.unsqueeze_3dto4d_dim2 (1 ms) [ RUN ] VulkanAPITest.unsqueeze_3dto4d_dim3 [ OK ] VulkanAPITest.unsqueeze_3dto4d_dim3 (1 ms) [----------] 10 tests from VulkanAPITest (307 ms total) [----------] Global test environment tear-down [==========] 10 tests from 1 test suite ran. (307 ms total) [ PASSED ] 10 tests. [ ``` Differential Revision: D53189637 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118575 Approved by: https://github.com/yipjustin	2024-01-30 20:01:18 +00:00
suo	d0627cc2af	[export] do not rewrite state dict when unlifting (#118611 ) This is Very Bad; changing state dict keys violates one of the key contracts we have, which is "do not mess with the state dict". Change unlift to use a similar `_assign_attr` approach that fx.GraphModule and unflatten do. Also took the opportunity to improve the interface of `_assign_attr` to be more general. Differential Revision: [D53139277](https://our.internmc.facebook.com/intern/diff/D53139277/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118611 Approved by: https://github.com/zhxchen17 ghstack dependencies: #118607, #118608, #118609, #118610	2024-01-30 19:14:19 +00:00
suo	be90ab7efd	[export] do not unlift cond/map submodules (#118610 ) I don't think we should be unlifting HOO submodules. What is the constract of unlifting? It is: restore the original calling convention of the module, undoing the transformation in which we lift parameters, buffers, and constants to inputs in the graph. Unlifting does not make any guarantees about what's going on inside the module. It's still a flat module. So why should we lift the cond/map submodules? It doesn't have anything to do with the contract stated above; it's some internal stuff that doesn't affect how the module will be called. Further, this code as written modifies the state dict; adding a new buffer that is actually duplicate of a previous buffer. Modifying the state dict from the original eager module is never correct. Differential Revision: [D53160713](https://our.internmc.facebook.com/intern/diff/D53160713/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118610 Approved by: https://github.com/zhxchen17 ghstack dependencies: #118607, #118608, #118609	2024-01-30 19:14:18 +00:00
suo	4ee8aa6028	[export] adopt KeyPath API in nonstrict mode (#118609 ) This PR rewrites two paths to use the newly-added keypaths API in pytree: First: we were hand-rolling a tree_map during fakification because we wanted to track sources. This PR uses keypaths instead, which can do the same thing without needing custom code. Second: our constraint error formatting was referencing placeholder names in error messages. These placeholder names are not otherwise user-visible, so they are super confusing to users (e.g. "which input does arg1_3 correspond to?"). This diff uses the `keystr` API to format the error message. This necessitated some small refactors—generating the keystr is expensive so doing it in an f-string was very bad. It can also be further improved—we can inspect the signature so that instead of `*args[0]` we can give people the actual argument name, which would be the ideal UX. But leaving that for later. Differential Revision: [D53139358](https://our.internmc.facebook.com/intern/diff/D53139358/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118609 Approved by: https://github.com/zhxchen17 ghstack dependencies: #118607, #118608	2024-01-30 19:14:11 +00:00
suo	ca090b2c77	[export] do not use tree_flatten_spec (#118608 ) tree_flatten_spec is bad; it isn't synced up with `register_pytree_node` so it will not handle arbitrary custom pytrees. It's also not really maintained. We only use it for two purposes: - To retain kwarg ordering stability, so that if the user passes in kwargs in a different order things will still work. - To do "structural" checks that ignore types. In both cases, tree_flatten_spec is probably not the ideal way to implement the desired behavior. ## kwargs ordering - tree_flatten_spec overwrites the behavior of ALL dictionaries, not just kwargs. This is not correct, dictionary ordering is meaningful in Python, and it's pretty trivial to write a program that relies on dict ordering. - For kwargs, we do sort of expect that the order in which arguments are passed shouldn't matter. BUT there is one exception: `kwargs`. In fact, [PEP 468](https://peps.python.org/pep-0468/) was introduced specifically to clarify that ordering does matter when the function being called uses `kwargs`. In this diff I introduce a utility function that only reorders kwargs. This gets us most of the way to correct—dicts are no longer reordered, but kwargs can be passed in any order. A "fully correct" solution would need fix the corner case from PEP468. We could detect whether the top-level fn being traced uses `kwargs` (via `inspect`), then serialize a flag for it. In ExportedProgram, we would check that flag and only re-order if `kwargs` was unused; otherwise error if the key order doesn't match. This is a super corner case though, so I'll file it as a followup task. ## structural equivalence checking This is another use case, where again `tree_flatten_spec` is too broad. Generally we want to treat a precise two types as the same, not override the behavior of comparison generally. So I introduce an `is_equivalent` util for this purpose. Differential Revision: [D53168420](https://our.internmc.facebook.com/intern/diff/D53168420/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118608 Approved by: https://github.com/zhxchen17 ghstack dependencies: #118607	2024-01-30 19:14:04 +00:00
Oguz Ulgen	bc9642f578	Skip more tests under rocm (#118624 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118624 Approved by: https://github.com/aakhundov	2024-01-30 19:06:06 +00:00
Stephen Jia	e6e7d7f26b	[pt-vulkan] Introduce MemoryAllocation class and enable deferred allocation and resource aliasing (#118436 ) ## Context This changeset is part of a stack that enables memory planning (i.e. sharing memory between intermediate tensors) in the PyTorch Vulkan Compute API. Note that Memory Planning can only be used via the ExecuTorch delegate (currently a WIP) and not Lite Interpreter (which does not collect metadata regarding tensor lifetimes). This changeset enables [resource aliasing](https://gpuopen-librariesandsdks.github.io/VulkanMemoryAllocator/html/resource_aliasing.html), a technique that allows two resources (i.e. `VkImage`s or `VkBuffer`s to bind to the same memory allocation. This is the core feature that allows memory planning to be implemented in PyTorch Vulkan. ## Notes for Reviewers At a high level, this changeset introduces the `MemoryAllocation` struct which represents a raw `VmaAllocation`. `VulkanImage` and `VulkanBuffer` have been updated to store a `MemoryAllocation` member instead of the raw handle of a `VmaAllocation`. `vTensor`, `VulkanImage`, and `VulkanBuffer` constructors now have a `allocate_memory` argument which controls if memory should be allocated on construction. If `false`, then memory must be allocated separately and bound later using `bind_allocation()` before the resource can be used. Internal: ## Notes for Internal Reviewers Please refer to [this design doc](https://docs.google.com/document/d/1EspYYdkmzOrfd76mPH2_2BgTDt-sOeFnwTkV3ZsFZr0/edit?usp=sharing) to understand how memory planning will work end-to-end in the Vulkan Delegate. Differential Revision: [D53136249](https://our.internmc.facebook.com/intern/diff/D53136249/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118436 Approved by: https://github.com/yipjustin	2024-01-30 19:03:55 +00:00
PyTorch MergeBot	40ece2e579	Revert "Enable possibly-undefined error code (#118533 )" This reverts commit 4f13f69a45ef53747e2eefffd65d91ce840b431b. Reverted https://github.com/pytorch/pytorch/pull/118533 on behalf of https://github.com/clee2000 due to sorry i'm trying to figure out a codev merge conflict, if this works i'll be back to rebase and merge ([comment](https://github.com/pytorch/pytorch/pull/118533#issuecomment-1917695185))	2024-01-30 19:00:34 +00:00
suo	6511811ebb	[export] preserve metadata during nonstrict tracing (#118607 ) Previously, nonstrict tracing would wipe metadata of graphmodules, because the wrapper class we're using was not detected as a graphmodule and thus meta preservation was not turned on Differential Revision: [D53139354](https://our.internmc.facebook.com/intern/diff/D53139354/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118607 Approved by: https://github.com/zhxchen17	2024-01-30 18:27:52 +00:00
Wei (Will) Feng	644f64f2d1	[c10d] added docstrings and tests for src / dst (#118593 ) Follow up https://github.com/pytorch/pytorch/pull/118359: whether``src`` and ``dst`` are base on global pg or sub pg * update c10d docstring: ``src`` / ``dst`` are base on global pg regardless of ``group`` arguments * communication ops with ``dst`` argument: ``reduce``, ``gather_object``, ``gather``, ``send``, ``isend`` * communication ops with ``src`` argument: ``irecv``, ``recv``, ``broadcast``, ``broadcast_object_list``, ``scatter``, ``scatter_object_list`` * ``pytest test/distributed/test_c10d_nccl.py -k subgroup`` 3 collectives are for pickable objects (``gather_object``, ``broadcast_object_list``, ``scatter_object_list``). There are 2 ways to set device * use device argument: it's implemented in ``broadcast_object_list``. maybe worth implementing in the other 2 * ``torch.cuda.set_device(global_rank)`` Pull Request resolved: https://github.com/pytorch/pytorch/pull/118593 Approved by: https://github.com/wconstab	2024-01-30 17:47:58 +00:00
Peter Bell	19e8ba95e5	[RELAND] Remove deprecated fbgemm operators (#112153 ) These operators are not used and have been deprecated since #72690 (Feb 2022). BC-breaking message: `TorchScript` models that were exported with the deprecated `torch.jit.quantized` API will no longer be loadable, as the required internal operators have been removed. Please re-export your models using the newer `torch.ao.quantization` API instead. Pull Request resolved: https://github.com/pytorch/pytorch/pull/112153 Approved by: https://github.com/jerryzh168	2024-01-30 16:32:37 +00:00
Pearu Peterson	2327879fb6	Add lowering to special.bessel_j0 (2nd try) (#118565 ) This PR is a copy of https://github.com/pytorch/pytorch/pull/118464 that was merged without using pytorchbot. Sorry for the noise! Pull Request resolved: https://github.com/pytorch/pytorch/pull/118565 Approved by: https://github.com/peterbell10	2024-01-30 15:26:59 +00:00
garfield1997	fbf92500fb	enable privateuseone to perform streaming backward (#117111 ) Fixes #116957 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117111 Approved by: https://github.com/soulitzer	2024-01-30 15:13:31 +00:00
atalman	15702a8027	Fix lnit after #118533 (#118633 ) Fixes lint after https://github.com/pytorch/pytorch/pull/118533 Adds ignore ``possibly-undefined`` to more places Pull Request resolved: https://github.com/pytorch/pytorch/pull/118633 Approved by: https://github.com/DanilBaibak	2024-01-30 14:07:16 +00:00
Qingpeng Li	827949cef2	accelerate `binary_cross_entropy_with_logits` by using `log_sigmoid` operator (#115539 ) When I was reimplementing BCEwithLogits, I found that `log_sigmoid` operator could accelerate the function. Simple benchmark on AMD 3600 CPU Ubuntu 22.04: \|avg time (ms)\|with `pos_weight`\|no `pos_weight`\| \|-\|-\|-\| \|original\|1986\|1658\| \|this PR\|1295\|995\| faster 35-40%. This is probably benefited by the `log_sigmoid` vectorization code. CUDA benchmark was not obtained, but I believe CUDA can be also benefited by reduecing kernel launches as https://github.com/pytorch/pytorch/pull/11054#issuecomment-442233714 and https://github.com/pytorch/pytorch/pull/78267#issue-1248398454 mentioned. The simple benchmark cpp file: [demo.txt](https://github.com/pytorch/pytorch/files/13635355/demo.txt) Pull Request resolved: https://github.com/pytorch/pytorch/pull/115539 Approved by: https://github.com/malfet	2024-01-30 13:24:13 +00:00
Jiong Gong	e5bb527d3e	[inductor][cpp] support scalar value in vec reduction (#118511 ) Fix https://github.com/pytorch/pytorch/issues/118379 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118511 Approved by: https://github.com/leslie-fang-intel, https://github.com/lezcano, https://github.com/jansel	2024-01-30 13:07:43 +00:00
lezcano	91690983ff	[easy] Faster empty LIST_LENGTH guard (#118542 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118542 Approved by: https://github.com/peterbell10, https://github.com/jansel	2024-01-30 13:02:18 +00:00
Yifu Wang	64efec9953	Port FakeProcessGroup to cpp (#118426 ) ### Summary Native functional collective ops requires the backend to be implemented in C++. Porting `FakeProcessGroup` to cpp so that it can also work for native functional collective ops. The existing tests involving `FakeProcessGroup` all pass. In addition, `DeviceMeshTest::test_fake_pg_device_mesh` now pass with `_USE_NATIVE_C10D_FUNCTIONAL=1`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118426 Approved by: https://github.com/wanchaol ghstack dependencies: #113057	2024-01-30 11:40:13 +00:00
Will Constable	da0635d17c	Add pytorch-distributed justknobs helper (#118568 ) Summary: Sets up a helper that checks any JKs relevent to pytorch distributed, and propagates their values to ENV. Test Plan: Added unit test Differential Revision: D53192406 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118568 Approved by: https://github.com/zdevito	2024-01-30 08:13:52 +00:00
Menglu Yu	3ecc2f3a0d	[PT2][Runtime Numeric Check] Fix compatibility issue (#118578 ) Summary: Titled Test Plan: WIP Differential Revision: D53196722 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118578 Approved by: https://github.com/jackiexu1992	2024-01-30 08:04:27 +00:00
Guoliang He	b7c8485704	refactor mm_plus_mm check to pattern match (#118456 ) Fixes #103101 replace #103253 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118456 Approved by: https://github.com/jansel	2024-01-30 07:48:06 +00:00
Shuqiang Zhang	c7af626a26	[c10d] allow nonblocking wrap of ncclCommInitRankConfig (#118256 ) resolve #117749 Summary: Updated the PR with the following intentions: 1. identify eagerMode init (as opposed to lazy init), in which case we will create NCCL comms without guarantees that they are fully initialized if NONBLOCKING mode is also enabled. 2. Python users can do their other works (e.g., model init) between invoking init_process_group and their first collective call. 3. c10D would guarantee/wait for communicators to be initialized before issuing the first collective call. 4. For NCCL collective calls, the contract between python users and c10d is not changed much from blocking calls (C10d would wait the NCCL call to be ncclSuccess, or timeout, whichever happens first). Pull Request resolved: https://github.com/pytorch/pytorch/pull/118256 Approved by: https://github.com/kwen2501	2024-01-30 06:23:20 +00:00
Oguz Ulgen	e632d0c0dc	Break Triton MutationTests to one kernel per test (#118553 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118553 Approved by: https://github.com/aakhundov ghstack dependencies: #118588	2024-01-30 06:17:55 +00:00
eqy	4a48899b6e	[CUDA][complex] Define `LIBCUDACXX_ENABLE_SIMPLIFIED_COMPLEX_OPERATIONS` in CUDA build (#117061 ) An upcoming CUDA release will migrate to CCCL, and we need this define to preserve current complex behavior: https://nvidia.github.io/libcudacxx/standard_api/numerics_library/complex.html CC @miscco @ptrblck Pull Request resolved: https://github.com/pytorch/pytorch/pull/117061 Approved by: https://github.com/ezyang, https://github.com/malfet	2024-01-30 06:11:31 +00:00
Oguz Ulgen	c203d88795	Skip mutation tests on rocm (#118588 ) Fixes #118585 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118588 Approved by: https://github.com/aakhundov, https://github.com/jansel	2024-01-30 05:46:54 +00:00
Eddie Yan	fe07851173	[CUDA][TF32][functorch] Also disable TF32 for vjp and jvp tests (#118592 ) CC @zou3519 Appears to be the same issue as https://github.com/pytorch/pytorch/issues/86798 Seen surfacing on >= sm80 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118592 Approved by: https://github.com/zou3519	2024-01-30 05:34:20 +00:00
Colin Peppler	8be6dee14b	[inductor] Fix codegen bug with Native Triton kernels with ReinterpretView args (#118569 ) Summary: ### Context It's possible for the args of a user-defined Triton Kernel to be codegen-ed twiced. But this only happens if the arg is a `ReinterpretView`. * First via `arg.codegen_reference()` in `define_user_defined_triton_kernel()` * Second in `self.codegen_kwargs()`. When using `abi_compatible=True`, the duplicate codegen will look like the code below. The issue in the code is that one of the Tensors, internal to the graph, isn't properly freed. This scenario was eventually exposed as a memory leak when we re-ran an AOTInductor model many times and observed `memory.used` increase after each iteration. ``` auto tmp_tensor_handle_0 = reinterpret_tensor_wrapper(buf1, 2, int_array_0, int_array_1, 0L); auto tmp_tensor_handle_1 = reinterpret_tensor_wrapper(buf1, 2, int_array_0, int_array_1, 0L); ... // There's no wrap_with_raii_handle_if_needed() for tmp_tensor_handle_0. // And there's no reference to tmp_tensor_handle_0. // Thus, tmp_tensor_handle_0 is left as an AtenTensorHandle which isn't // automatically cleaned-up like RAIIAtenTensorHandle CUdeviceptr var_6; aoti_torch_get_data_ptr(wrap_with_raii_handle_if_needed(tmp_tensor_handle_1), reinterpret_cast<void*>(&var_6)); void kernel_args_var_2[] = {..., &var_6, ...}; launchKernel(kernels.add_kernel_0, ..., kernel_args_var_2); ``` ### Solution We just need the arg's buffer name when creating the `TensorArg` in `define_user_defined_triton_kernel()`. Thus, just return the buffer's name and avoid any potential side-effects with `arg.codegen_reference()`. Test Plan: ### Inspect device memory allocated ``` # Before diff 0 device memory 2048 1 device memory 2560 2 device memory 3072 3 device memory 3584 4 device memory 4096 5 device memory 4608 # With diff (memory usage doesn't grow) 0 device memory 1536 1 device memory 1536 2 device memory 1536 3 device memory 1536 4 device memory 1536 5 device memory 1536 ``` Reviewed By: jingsh, tissue3 Differential Revision: D53190934 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118569 Approved by: https://github.com/oulgen	2024-01-30 05:19:32 +00:00
Edward Z. Yang	4f13f69a45	Enable possibly-undefined error code (#118533 ) Fixes https://github.com/pytorch/pytorch/issues/118129 Suppressions automatically added with ``` import re with open("error_file.txt", "r") as f: errors = f.readlines() error_lines = {} for error in errors: match = re.match(r"(.):(\d+):\d+: error:.\[(.*)\]", error) if match: file_path, line_number, error_type = match.groups() if file_path not in error_lines: error_lines[file_path] = {} error_lines[file_path][int(line_number)] = error_type for file_path, lines in error_lines.items(): with open(file_path, "r") as f: code = f.readlines() for line_number, error_type in sorted(lines.items(), key=lambda x: x[0], reverse=True): code[line_number - 1] = code[line_number - 1].rstrip() + f" # type: ignore[{error_type}]\n" with open(file_path, "w") as f: f.writelines(code) ``` Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/118533 Approved by: https://github.com/Skylion007, https://github.com/zou3519	2024-01-30 05:08:10 +00:00
Daohang Shi	5dfcf07449	Reland PR117393 [inductor/fb] log config dict when compilation finishes (#118552 ) Summary: Reverted due to merge conflict Differential Revision: D53188124 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118552 Approved by: https://github.com/mengluy0125	2024-01-30 04:34:22 +00:00
PyTorch UpdateBot	dcc077eea2	[executorch hash update] update the pinned executorch hash (#118594 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned executorch hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118594 Approved by: https://github.com/pytorchbot	2024-01-30 03:49:49 +00:00
Shunting Zhang	0d47f6a44f	[ez][inductor] fix a typo in should_pad_bench (#118598 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118598 Approved by: https://github.com/eellison	2024-01-30 03:49:44 +00:00
PyTorch UpdateBot	135f785d77	[audio hash update] update the pinned audio hash (#118338 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned audio hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118338 Approved by: https://github.com/pytorchbot	2024-01-30 03:44:00 +00:00
PyTorch UpdateBot	ff0cb38693	[vision hash update] update the pinned vision hash (#118340 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned vision hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118340 Approved by: https://github.com/pytorchbot	2024-01-30 03:15:16 +00:00
Catherine Lee	2eefbc02a0	[ez] Discover tests without importing torch (#118574 ) Moves test discovery into a file that doesn't have import torch so test listing can be done without having torch installed. Helpful when you don't have torch installed (aka me when I'm feeling lazy) I want to move TD into it's own job that doesn't need to wait for build to finish, so this is part of that. The first commit is a nothing more than a copy paste of the selected functions/vars into a new file, the second commit has various changes that should be checked. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118574 Approved by: https://github.com/huydhn	2024-01-30 03:02:29 +00:00
Zhengxu Chen	eb9905be5d	[export] Remove the branch for skipping verifier. (#118139 ) Summary: We used to skip verifier when the signature object is not the "correct" one (usually from some deprecated frontend). This was very useful when we wanted to pay a small cost to enable verifier path to be called everywhere for torch export. Now I believe no tests are relying on this behavior so we should remove this weird branch. Test Plan: CI Differential Revision: D53024506 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118139 Approved by: https://github.com/suo	2024-01-30 02:58:03 +00:00
Yifu Wang	b778f44e97	Allow using native c10d_functional via _functional_collectives (#113057 ) This diff introduces an env var `_USE_NATIVE_C10D_FUNCTIONAL` that tells `_functional_collective` to use native `c10d_functional` ops. The Python version and the native version will co-exist until we completely switch to the native version after more testing and verification. NOTE: `DeviceMesh` support for native `c10d_functional` will be added in a subsequent PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/113057 Approved by: https://github.com/LucasLLC, https://github.com/wconstab, https://github.com/wanchaol	2024-01-30 02:34:25 +00:00
drisspg	126c1621ce	Add Support for CausalBias to torch compile (#116071 ) Fixes #115363 Pull Request resolved: https://github.com/pytorch/pytorch/pull/116071 Approved by: https://github.com/mlazos	2024-01-30 02:22:48 +00:00
Masaki Kozuki	67d8db9252	Remove semicolon after `return_from_mutable_noop_redispatch` (#118538 ) [`return_from_mutable_noop_redispatch`](`65f8276bc6/torchgen/gen_functionalization_type.py (L477)`) calls [`return_str`](`65f8276bc6/torchgen/gen_functionalization_type.py (L159-L166)`). `return_str`'s output includes `;` so I think the semicolon after the callsite of `return_from_mutable_noop_redispatch` is not needed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118538 Approved by: https://github.com/colesbury	2024-01-30 02:22:42 +00:00
Zhengxu Chen	0ed24cb1af	[export] comments about runtime_var_to_range. (#118539 ) Summary: Add some comments in case we forgot what runtime_var_to_range means Test Plan: eyes Differential Revision: D53186114 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118539 Approved by: https://github.com/suo	2024-01-30 02:07:34 +00:00
Driss Guessous	b1f8b6b8fc	Forward Fix accidental removal of import (#118572 ) Summary: This Diff is a forward fix for this PR: https://github.com/pytorch/pytorch/pull/114689 Where I accidentally removed the old import from backends/cuda. Test Plan: Verrified on failing revert diff and it did indeed fix the issue Reviewed By: DanilBaibak Differential Revision: D53193454 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118572 Approved by: https://github.com/DanilBaibak	2024-01-30 02:07:19 +00:00
David Berard	460950d3aa	[Nested Tensor] Support ragged_idx != 1 on aten::is_same_size, aten::_to_copy (#118442 ) is_same_size is needed internally; `_to_copy` should be easy because it doesn't support new layouts. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118442 Approved by: https://github.com/cpuhrsch	2024-01-30 01:32:51 +00:00
Elias Ellison	6c9f72156e	Fix constant folding bug with sym size tensor (#118411 ) When there was a constant folded SymInt which was used to construct a then constant folding tensor, we had previously used tried to use the sympy symbol which would error (should take in SymInt not symbol). Fix by recording the observed size during constant folding. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118411 Approved by: https://github.com/ezyang	2024-01-30 01:26:51 +00:00
hyperfraise	aef820926c	Add some tests for 3d channels last (#118283 ) Part of a multi-PR work to fix #59168. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118283 Approved by: https://github.com/albanD	2024-01-30 01:26:47 +00:00
CaoE	bacbad5bc9	add GradScaler on CPU (#109993 ) Step 2 of https://github.com/pytorch/pytorch/issues/111559. Pull Request resolved: https://github.com/pytorch/pytorch/pull/109993 Approved by: https://github.com/jgong5, https://github.com/ezyang	2024-01-29 23:42:35 +00:00
Mikayla Gawarecki	796d270392	[easy] Fix small typo in register_state_dict_pre_hook doc (#118571 ) Fixed https://github.com/pytorch/pytorch/pull/112674#issuecomment-1912849827 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118571 Approved by: https://github.com/janeyx99, https://github.com/albanD	2024-01-29 23:18:12 +00:00
Angela Yi	413a434846	[export] Convert all export tests to .module() (#118425 ) Test Plan: CI Differential Revision: D53075379 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118425 Approved by: https://github.com/suo	2024-01-29 23:06:54 +00:00
Felix Zimmermann	ca7cbf1226	Add memory_format to typehints of Tensor.cpu and Tensor.cuda (#118392 ) Fixes #118501 which makes mypy complain if users use memory_format in torch.cpu/torch.cuda in their code. this adds the missing memory_format to the typehints of both functions. I believe there is no test infrastructure for type hints.... Co-authored-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/118392 Approved by: https://github.com/ezyang, https://github.com/malfet	2024-01-29 22:56:34 +00:00
Arseny Kapoulkine	e1cbf6dff5	Use SEQUENTIAL posix_fadvise on mmapped files (#117805 ) In theory this tells the system that we will access the file sequentially which allows prefetching future blocks. In practice it doubles the read-ahead size on Linux (which effectively doubles the read sizes). Without this, CUDA uploads of files that aren't already in FS cache, using mmapped files (safetensors) as source, run at ~1 GB/s (from an SSD that has ~7 GB/s read speed...). With this, they run at ~1.5 GB/s which is still bad but better than before! It is possible to increase the read performance further by touching the pages from multiple threads; in fact, when the tensors loaded from the file are used by the CPU, we get fairly good load performance (~5 GB/s), which appears to be because multiple threads page fault and trigger more concurrent reads which improves SSD read throughput... however, this is not the case for CUDA uploads, and it is difficult to make that change in a generic way because it's unclear what the usage pattern of the input file is going to be. All of the numbers above are taken on Samsung 990 Pro SSD, on Linux kernel 6.5 with FS cache cleared between every attempt to load a file. The file is loaded via `safetensors.safe_open` which uses UntypedTensor.from_file to load the file into memory, which in turn uses MapAllocator.cpp. I felt safe doing this change unconditionally but please let me know if you'd like to see a separate allocator flag for this, propagated through to UntypedTensor. Note that the fadvise API is not available on macOS. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117805 Approved by: https://github.com/mikaylagawarecki	2024-01-29 22:38:00 +00:00
ydwu4	67c6152f4e	[HigherOrderOp] support while_loop in dynamo (#116913 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116913 Approved by: https://github.com/zou3519	2024-01-29 22:32:32 +00:00
Ivan Zaitsev	e3d7a19f73	[CI] add wait for /orig branch in mergeability check (#118576 ) --- Test runs: * [happy path](https://github.com/pytorch/pytorch/actions/runs/7702614677/job/20991275431?pr=118576) (this PR) * [waiting for the hardcoded branch name](https://github.com/izaitsevfb/pr-head-test/actions/runs/7702386966/job/20990584514#step:3:33) in a separate repo (step succeeded after the branch was manually pushed) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118576 Approved by: https://github.com/malfet, https://github.com/huydhn	2024-01-29 22:10:50 +00:00
albanD	a40be5f4dc	Autograd doc cleanup (#118500 ) I don't think we'll realistically go though deprecation for these now since there are a couple use of each online. So document appropriately. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118500 Approved by: https://github.com/soulitzer	2024-01-29 21:51:33 +00:00
ydwu4	fc5cde7579	[dynamo] constant fold torch.cuda.get_device_properties to avoid graph break (#118422 ) Before the PR, we have a graph break for code like this, ```python def test_get_device_properties_tensor_device(a): x = a.to("cuda") prop = torch.cuda.get_device_properties(x.device) if prop.major == 8: return x + prop.multi_processor_count return x + prop.max_threads_per_multi_processor ``` This PR constant folds the torch.cuda.get_device_properties and we'll get a following dynamo graph: ```python [2024-01-26 13:28:13,253] [0/0] torch._dynamo.output_graph.__graph: [DEBUG] <eval_with_key>.0 class GraphModule(torch.nn.Module): [2024-01-26 13:28:13,253] [0/0] torch._dynamo.output_graph.__graph: [DEBUG] def forward(self, L_a_ : torch.Tensor): [2024-01-26 13:28:13,253] [0/0] torch._dynamo.output_graph.__graph: [DEBUG] l_a_ = L_a_ [2024-01-26 13:28:13,253] [0/0] torch._dynamo.output_graph.__graph: [DEBUG] [2024-01-26 13:28:13,253] [0/0] torch._dynamo.output_graph.__graph: [DEBUG] # File: /home/yidi/local/pytorch/test/dynamo/test_functions.py:544 in test_get_device_properties_tensor_device, code: x = a.to("cuda") [2024-01-26 13:28:13,253] [0/0] torch._dynamo.output_graph.__graph: [DEBUG] x = l_a_.to('cuda'); l_a_ = None [2024-01-26 13:28:13,253] [0/0] torch._dynamo.output_graph.__graph: [DEBUG] [2024-01-26 13:28:13,253] [0/0] torch._dynamo.output_graph.__graph: [DEBUG] # File: /home/yidi/local/pytorch/test/dynamo/test_functions.py:547 in test_get_device_properties_tensor_device, code: return x + prop.multi_processor_count [2024-01-26 13:28:13,253] [0/0] torch._dynamo.output_graph.__graph: [DEBUG] add = x + 108; x = None [2024-01-26 13:28:13,253] [0/0] torch._dynamo.output_graph.__graph: [DEBUG] return (add,) [2024-01-26 13:28:13,253] [0/0] torch._dynamo.output_graph.__graph: [DEBUG] ``` The signature of get_device_properties is: ```python def get_device_properties(device: _device_t) -> _CudaDeviceProperties: ``` I think it's safe to constant fold get_device_properties(): 1. torch.cuda.get_device_properties(tensor.device). In this case, tensor.device.index is guarded in _check_tensor 2. torch.cuda.get_device_properties(device_int_id). We don't expect the GPU properties for a particular index changes during a torch.compile run and it make sense to specialize the properties for a concrete device_int_id. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118422 Approved by: https://github.com/yanboliang, https://github.com/jansel	2024-01-29 20:26:40 +00:00
Peter Bell	f99adbb4ec	[inductor] Remove ROCm xfail on test_cum{sum,prod}_zero_dim (#118558 ) Fixes #118540, fixes #118541 Since the zero-dim case reduces to a pointwise operation, we don't fallback on ROCm. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118558 Approved by: https://github.com/malfet	2024-01-29 20:23:40 +00:00
ydwu4	6591741183	[dynamo] support inference_mode with no arguments (#118427 ) Before the pr, we have an error for the following code ```python def k(x): with torch.inference_mode(): x = x + 1 return x torch.compile(k, backend="eager", fullgraph=True)(x) ``` error message: ``` Traceback (most recent call last): .... return InferenceModeVariable.create(tx, args[0].as_python_constant()) torch._dynamo.exc.InternalTorchDynamoError: list index out of range ``` This pr supports the case when torch.inference_mode is not provided any argument (i.e. default to True). Pull Request resolved: https://github.com/pytorch/pytorch/pull/118427 Approved by: https://github.com/yanboliang, https://github.com/jansel	2024-01-29 20:20:26 +00:00
Joost Houben	e0d04b7119	[Caffe2] Fix bug in `str` on wide types (#117531 ) Summary: The current implementation of `str` passes wide types (`wchar_t`, `wchar_t`, `std::wstring`) directly to `std::ostringstream`. This has the following behavior: - C++17, `wchar_t` & `wchar_t `: print the integer representation of the character or the pointer. This is unexpected and almost certainly a (runtime) bug. - C++17, `std::wstring`: compile-time error. - C++20, all of the above: compile-time error. To fix the bug and to enable C++20 migration, this diff performs narrowing on these wide types (assuming UTF-16 encoding) before passing them to `std::ostringstream`. This fixes both the C++20 compile time errors and the C++17 runtime bugs. This bug surfaced in enabling C++20 windows builds, because windows specific caffe2 code uses `TORCH_CHECK` with wide strings, which references `str` for generating error messages. Test Plan: CI & https://godbolt.org/z/ecTGd8Ma9 Differential Revision: D52792393 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117531 Approved by: https://github.com/malfet	2024-01-29 20:11:37 +00:00
Andrew Gu	68b18dc2a2	[DeviceMesh] Removed print of `self._dim_group_infos` (#118527 ) This print seems to have accidentally been merged in. It is a bit verbose during unit tests, so this PR removes it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118527 Approved by: https://github.com/wz337	2024-01-29 19:14:25 +00:00
PyTorch MergeBot	bb55970e5b	Revert "Add justknobs env helper for pytorch distributed (#118451 )" This reverts commit 4d1bb2175a49e9b4440085a3dc2e2b211e5cf99e. Reverted https://github.com/pytorch/pytorch/pull/118451 on behalf of https://github.com/wconstab due to Broke internal tests ([comment](https://github.com/pytorch/pytorch/pull/118451#issuecomment-1915369013))	2024-01-29 19:01:05 +00:00
Lucas Pasqualin	0288db3120	[DCP] Removes Checkpoint Wrapped Prefix from state dict fqns (#118119 ) Fixes #117399 ~~Soliciting some early feedback here.~~ ~~Do we happen to know if there already some tests that cover this case or would it make sense to add? @fegin , @wz337~~ Edit: Added tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/118119 Approved by: https://github.com/fegin	2024-01-29 18:18:52 +00:00
PyTorch MergeBot	fb11354594	Revert "[c10d] relax the nccl error check for nonblocking mode (#118254 )" This reverts commit 993e4f3911856be3a93746f6ed6a13f25de6ff65. Reverted https://github.com/pytorch/pytorch/pull/118254 on behalf of https://github.com/clee2000 due to has internal failures D53170606 ([comment](https://github.com/pytorch/pytorch/pull/118254#issuecomment-1915267786))	2024-01-29 17:56:40 +00:00
Nikita Shulga	3011a4406f	[BE][GHF] Do not hardcode default branch name (#118530 ) Instead rely on `GitHubPR.default_branch()` which is the name of the repo's default branch. Do not pass branch name `merge_changes` is called, as it is set to default branch inside the function Pull Request resolved: https://github.com/pytorch/pytorch/pull/118530 Approved by: https://github.com/clee2000	2024-01-29 17:18:23 +00:00
Wenyin Fu	65f8276bc6	add an option to specify custom addr2line binary (#118328 ) There is a need for users to pick their own addr2line binary in their deployment due to reasons like default addr2line being too slow etc... This option would allow user quickly experiment other alternatives. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118328 Approved by: https://github.com/zdevito, https://github.com/aaronenyeshi	2024-01-29 16:36:38 +00:00
Will Constable	abe3c55a6a	Update DDP dynamo debug docs (#118295 ) Refreshes https://github.com/pytorch/pytorch/pull/114201 and updates it to include other log names that also include ddp_optimizer. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118295 Approved by: https://github.com/LucasLLC, https://github.com/wanchaol	2024-01-29 14:58:26 +00:00
Catherine Lee	f9971daaee	Fix divergence between internal + external (#118509 ) D53049807 and https://github.com/pytorch/pytorch/pull/118197 got out of sync somehow Fixing externally since I'm pretty sure the internal version is correct Pull Request resolved: https://github.com/pytorch/pytorch/pull/118509 Approved by: https://github.com/malfet	2024-01-29 14:53:50 +00:00
Jiong Gong	04c1df651a	[inductor][cpp] enable vectorization with constant bool (#118380 ) Related model DebertaForQuestionAnswering etc. For DebertaForQuestionAnswering, single thread, measured on ICX: Before: 0.990x, After: 1.043x Pull Request resolved: https://github.com/pytorch/pytorch/pull/118380 Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel	2024-01-29 13:31:22 +00:00
leslie-fang-intel	ee3dfbbe47	[Inductor] Fix Argmax codegen with Nan input (#118358 ) Summary Fix issue https://github.com/pytorch/pytorch/issues/118266, current `torch.argmax` and `torch.argmin` has different return values with eager and Inductor cpp backend when inputs has `Nan` value. Align cpp backend results to eager by reusing the compare function. Test Plan ``` python -u -m pytest -s -v test_cpu_repro.py -k test_argmin_cpu_only python -u -m pytest -s -v test_cpu_repro.py -k test_argmax_argmin_with_nan_value ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/118358 Approved by: https://github.com/lezcano, https://github.com/jgong5, https://github.com/jansel	2024-01-29 09:09:46 +00:00
James Wu	41dfdde9f5	Handle some numpy functions with out arguments correctly in dynamo (#118248 ) Dynamo creates Tensors when tracing through numpy ufuncs like np.sin, np.minimum etc. When running, np functions generally return Tensors when run with `torch.compile`. However, we currently require when normalizing `out` arguments that the input is an ndarray. This creates assertion errors when running torch.compile on any numpy function with an out argument: ``` def test_numpy_ufunc_out(self): @torch.compile(backend="eager") def foo(): x = np.arange(5) out = np.empty((x.shape[0], x.shape[0])) res_out = np.sin(x, out=out) assert res_out is out foo() ``` Failure with stack trace: https://gist.github.com/jamesjwu/68e217638d735678b3de968584dba23f Instead, we can wrap tensors in an ndarray in normalize_outarray to handle the case correctly. Fixing this resolves ~220 tests under dynamo_test_failures, but also exposes a followup bug. In the presence of a graph break, ndarrays don't preserve their id, which can affect assertions and `is` checks between numpy arrays: ``` def test_x_and_out_broadcast(self, ufunc): x = self.get_x(ufunc) out = np.empty((x.shape[0], x.shape[0])) x_b = np.broadcast_to(x, out.shape) # ufunc is just np.sin here res_out = ufunc(x, out=out) res_bcast = ufunc(x_b) # passes assert res_out is out graph_break() # fails assert res_out is out ``` Regular tensors preserve their id because Dynamo caches their example tensor values across a graph break. However, with ndarrays, we only store their converted tensor values, and construct new ndarrays around those values: `eebe7e1d37/torch/_dynamo/variables/builder.py (L1083)` Added a test with expected failure to showcase this — we can then fix that issue separately. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118248 Approved by: https://github.com/lezcano	2024-01-29 09:09:21 +00:00
Will Constable	4d1bb2175a	Add justknobs env helper for pytorch distributed (#118451 ) Summary: Adds a JK killswitch check and configures the env for enabling pytorch nccl flight recorder. Note- this only enables recording events in memory, not dumping them. Test Plan: CI test Reviewed By: zdevito Differential Revision: D52920092 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118451 Approved by: https://github.com/malfet	2024-01-29 08:57:16 +00:00
Jason Ansel	41902a6ebc	[dynamo] Optimize is_tracing checks (#118474 ) benchmarks/dynamo/microbenchmarks/overheads.py - before: 10.4us - after: 9.9us Pull Request resolved: https://github.com/pytorch/pytorch/pull/118474 Approved by: https://github.com/yanboliang	2024-01-29 08:31:26 +00:00
PyTorch MergeBot	eba240afcb	Revert "[FSDP2] Introduced initial `fully_shard` frontend (#117776 )" This reverts commit 316579e30ce820cb5f431e6bb816a882db918b38. Reverted https://github.com/pytorch/pytorch/pull/117776 on behalf of https://github.com/DanilBaibak due to Break internal build ([comment](https://github.com/pytorch/pytorch/pull/117776#issuecomment-1914121167))	2024-01-29 07:38:41 +00:00
lancerts	e6f3a4746c	include a print for _get_cuda_arch_flags (#118503 ) Related to #118494, it is not clear to users that the default behavior is to include all feasible archs (if the 'TORCH_CUDA_ARCH_LIST' is not set). In these scenarios, a user may experience a long build time. Adding a print statement to reflect this behavior. [`verbose` arg is not available and not feeling necessary to add `verbose` arg to this function and all its parent functions...] Co-authored-by: Edward Z. Yang <ezyang@mit.edu> Pull Request resolved: https://github.com/pytorch/pytorch/pull/118503 Approved by: https://github.com/ezyang	2024-01-29 07:03:56 +00:00
Oguz Ulgen	47b5a6b05d	[Dynamo] Analyze triton kernels via tracing to determine mutations (#117300 ) This PR adds TTIR lexing and parsing in order to analyze which of the user defined triton kernel inputs are mutated. Differential Revision: [D53165999](https://our.internmc.facebook.com/intern/diff/D53165999) Pull Request resolved: https://github.com/pytorch/pytorch/pull/117300 Approved by: https://github.com/jansel	2024-01-29 06:37:08 +00:00
Edward Z. Yang	2951bbf0f7	Add some type annotations to torch._inductor.codegen.wrapper (#118491 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/118491 Approved by: https://github.com/Skylion007	2024-01-29 06:17:27 +00:00
Will Constable	5f59d0c748	[C10D] Disarm PGNCCL Heartbeat Monitor to gather data (#118344 ) Summary: Leave monitoring thread 'running' in log-only mode. Use the kill logs to correlate with actual job outcomes (e.g. does stuck job detector agree?) Later, re-enable (using a justknobs knob this time) Test Plan: CI Differential Revision: D53108142 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118344 Approved by: https://github.com/shuqiangzhang, https://github.com/yifuwang, https://github.com/malfet, https://github.com/kwen2501	2024-01-29 06:09:36 +00:00
PyTorch UpdateBot	890d8e6692	[executorch hash update] update the pinned executorch hash (#118502 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned executorch hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118502 Approved by: https://github.com/pytorchbot	2024-01-29 03:45:45 +00:00
evelynmitchell	0d9aff2523	Removed unused “device” argument in torch.frombuffer() #118273 (#118439 ) Fixes #118273 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118439 Approved by: https://github.com/albanD	2024-01-28 22:01:49 +00:00
Edward Z. Yang	acc700739e	Upgrade mypy version to 1.8.0 (#118481 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/118481 Approved by: https://github.com/Skylion007, https://github.com/albanD ghstack dependencies: #118414, #118418, #118432, #118467, #118468, #118469, #118475, #118479, #118480	2024-01-28 19:22:37 +00:00
Edward Z. Yang	338596dfbc	Forbid follow_imports = skip from mypy.ini (#118480 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/118480 Approved by: https://github.com/Skylion007, https://github.com/albanD ghstack dependencies: #118414, #118418, #118432, #118467, #118468, #118469, #118475, #118479	2024-01-28 19:22:37 +00:00
Edward Z. Yang	119b66ba16	Use strict to toggle strict options in MYPYSTRICT (#118479 ) As we force a specific version of mypy, it's OK to use the agglomerated flag. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/118479 Approved by: https://github.com/Skylion007, https://github.com/albanD ghstack dependencies: #118414, #118418, #118432, #118467, #118468, #118469, #118475	2024-01-28 19:22:22 +00:00
Edward Z. Yang	ecca533872	Use dmypy instead of mypy in lintrunner (#118475 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/118475 Approved by: https://github.com/suo ghstack dependencies: #118414, #118418, #118432, #118467, #118468, #118469	2024-01-28 13:42:06 +00:00
Edward Z. Yang	cad79bd0bb	Remove follow_imports = skip from sympy (#118469 ) dmypy silently ignores follow_imports = skip, so to get parity between dmypy and mypy we have to suck it up and type: ignore all of the sympy typing problems. The suppressions were added automatically with the following script generated by GPT-4: ``` import re # Read the error file with open("error_file.txt", "r") as f: errors = f.readlines() # Parse the lines with errors and error types error_lines = {} for error in errors: match = re.match(r"(.):(\d+):\d+: error:.\[(.*)\]", error) if match: file_path, line_number, error_type = match.groups() if file_path not in error_lines: error_lines[file_path] = {} error_lines[file_path][int(line_number)] = error_type # Insert ignore comments in the source files for file_path, lines in error_lines.items(): with open(file_path, "r") as f: code = f.readlines() for line_number, error_type in sorted(lines.items(), key=lambda x: x[0], reverse=True): code[line_number - 1] = code[line_number - 1].rstrip() + f" # type: ignore[{error_type}]\n" with open(file_path, "w") as f: f.writelines(code) ``` Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/118469 Approved by: https://github.com/Skylion007 ghstack dependencies: #118414, #118418, #118432, #118467, #118468	2024-01-28 13:38:38 +00:00
Edward Z. Yang	59b4d2cd40	[mypy] Remove colorama ignore_missing_imports (#118468 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/118468 Approved by: https://github.com/Skylion007 ghstack dependencies: #118414, #118418, #118432, #118467	2024-01-28 13:38:38 +00:00
Edward Z. Yang	46712b019d	Enable local_partial_types (#118467 ) When using dmypy, this setting is enabled and cannot be turned off. Force it for regular mypy too. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/118467 Approved by: https://github.com/Skylion007 ghstack dependencies: #118414, #118418, #118432	2024-01-28 13:38:22 +00:00
PyTorch UpdateBot	2ed0af2bde	[executorch hash update] update the pinned executorch hash (#118477 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned executorch hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118477 Approved by: https://github.com/pytorchbot	2024-01-28 03:56:11 +00:00
Aaron Gokaslan	9d5b950bdd	[BE][Easy]: Update ruff to 0.1.14 (#118466 ) Updates ruff to 0.1.14 which has some more autofixes, bugfixes, and fixes some false positives. Full changelog found here: https://github.com/astral-sh/ruff/releases/tag/v0.1.14 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118466 Approved by: https://github.com/ezyang	2024-01-27 23:44:25 +00:00
Yanbo Liang	ca1d70632d	[14/N][Dynamo] Make trace_rules.lookup only handle function + callable type (#118366 ) Step by step changes to unblock #118264 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118366 Approved by: https://github.com/angelayi	2024-01-27 23:02:44 +00:00
Tobias Ringwald	62c1e4a578	Added missing CircularPad*d references so the docs are actually built. (#118465 ) Fixes #118429 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118465 Approved by: https://github.com/Skylion007	2024-01-27 22:39:01 +00:00
Xuehai Pan	2728c9137d	[easy][AOT] Fix shortcut path for simple tuple/list spec (#118460 ) `type(self.spec)` is always `TreeSpec` and the condition is always `False`. This PR changes it to `self.spec.type`, which is the type of tree that the spec represents. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118460 Approved by: https://github.com/Skylion007	2024-01-27 19:04:12 +00:00
Peter Bell	1460334436	[quant] Remove deprecated torch.jit.quantized APIs (#118406 ) The `torch.jit.quantized` interface has been deprecated since #40102 (June 2020). BC-breaking message: All functions and classes under `torch.jit.quantized` will now raise an error if called/instantiated. This API has long been deprecated in favor of `torch.ao.nn.quantized`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118406 Approved by: https://github.com/jerryzh168	2024-01-27 18:32:45 +00:00
Edward Z. Yang	d03173e88c	Unify MYPYINDUCTOR and MYPY (#118432 ) The original motivation for MYPYINDUCTOR was a faster type checking configuration that only checked a subset of files. With the removal of `follow_imports = ignore`, we are now able to use dmypy to do fast incremental typechecking, eliminating the need for this. Perhaps erroneously, when I tee'ed up this PR I elected to delete the `follow_imports = skip` designations in the mypy-inductor.ini. This lead to a number of extra type error suppressions that I manually edited. You will need to review. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/118432 Approved by: https://github.com/Skylion007 ghstack dependencies: #118414, #118418	2024-01-27 17:23:20 +00:00
Xuehai Pan	42062e2622	[pytree][BE] update treespec `is_leaf()` access (#116371 ) Change `isinstance(treespec, LeafSpec) -> treespec.is_leaf()`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116371 Approved by: https://github.com/zou3519	2024-01-27 11:44:57 +00:00
Justin Yip	26473460a4	[ET-Vulkan] ExecuTorch Vulkan floor_div (#118428 ) Summary: Add a new operator "floor_div" to ET-Vulkan. Test Plan: ``` [yipjustin@7777.od ~/fbcode (b32108c6c)]$ buck2 test fbcode//executorch/backends/vulkan/test:test_vulkan_delegate -- File changed: fbcode//executorch/backends/vulkan/test/test_vulkan_delegate.py Buck UI: https://www.internalfb.com/buck2/90290e5b-d47e-4cac-bc63-9939cc210d1f Test UI: https://www.internalfb.com/intern/testinfra/testrun/3940649890839142 Network: Up: 2.8KiB Down: 0B (reSessionID-e7425cc1-0987-46d8-a7bf-418a660bee5b) Jobs completed: 19. Time elapsed: 42.6s. Cache hits: 0%. Commands: 1 (cached: 0, remote: 0, local: 1) Tests finished: Pass 7. Fail 0. Fatal 0. Skip 0. Build failure 0 ``` Reviewed By: SS-JIA Differential Revision: D53072722 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118428 Approved by: https://github.com/SS-JIA	2024-01-27 11:20:52 +00:00
eqy	8d790abab9	[NCCL][c10d] Log failing pointer if deregistration fails (#118455 ) For debugging convenience CC @minsii @Aidyn-A @syed-ahmed @ptrblck Pull Request resolved: https://github.com/pytorch/pytorch/pull/118455 Approved by: https://github.com/wconstab	2024-01-27 11:03:02 +00:00
PyTorch MergeBot	dabb90f2a4	Revert "[Exception] [6/N] Remove use of torch::TypeError (#117964 )" This reverts commit 87335fabaeca41f9721ba5d5eb7eafcf70b7afad. Reverted https://github.com/pytorch/pytorch/pull/117964 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/117964#issuecomment-1913079096))	2024-01-27 08:44:34 +00:00
suo	bb6eba189f	[export][ez] remove unused argument from InterpreterModule (#118364 ) small thing I noticed Differential Revision: [D53113926](https://our.internmc.facebook.com/intern/diff/D53113926/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118364 Approved by: https://github.com/angelayi	2024-01-27 06:46:01 +00:00
Edward Z. Yang	89a1175e0e	Upgrade mypy python_version to 3.11 (#118418 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/118418 Approved by: https://github.com/albanD ghstack dependencies: #118414	2024-01-27 06:10:46 +00:00
Isuru Fernando	978faf1fa2	Use an op counter to decide when to realize a kernel (#117030 ) Instead of checking the number of bytes in the string representation of the kernel Pull Request resolved: https://github.com/pytorch/pytorch/pull/117030 Approved by: https://github.com/lezcano, https://github.com/peterbell10	2024-01-27 05:28:46 +00:00
Michael Lazos	800e2e823f	Add compilable foreach RAdam support (#117912 ) Fixes https://github.com/pytorch/pytorch/issues/117807 This brings the number of supported optimizers with `torch.compile` to 11/13 (!) Pull Request resolved: https://github.com/pytorch/pytorch/pull/117912 Approved by: https://github.com/janeyx99	2024-01-27 04:32:27 +00:00
Shunting Zhang	fe10b1800f	LazyGraphModule (#117911 ) I feel it's easier to open a new PR rather than iterating on the previous PR (https://github.com/pytorch/pytorch/pull/105257 ) since this is more like a rewrite. In this PR, instead of changing GraphModule directly which can easily causes BC issue, I create a LazyGraphModule class as Zachary & Jason suggested in comments from the previous PR. The difference between LazyGraphModule and GraphModule is mainly about how re-compile for the graph module happens. In GraphModule the recompilation happens 'eagerly': constructing a GraphModule will cause the recompilation. While in LazyGraphModule, we just mark the module as needing recompilation. The real recompilation only happens when absolutely required (e.g. call forward method, access the code property etc.). In a lot of cases in torch.compile, the real recompilation eventually is not triggered at all. This can save a few seconds of compilation time. By default, GraphModule rather than LazyGraphModule is used. `use_lazy_graph_module(True)` context manager can be used to pick LazyGraphModule instead. This has been applied to the torch.compile stack. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117911 Approved by: https://github.com/jansel	2024-01-27 04:10:18 +00:00
Will Constable	70699a6357	[C10D] Add tests for gather and gather_object with subgroup (#118359 ) Addresses #118337 somewhat- we probably need to update docs. Let's first confirm what behavior we want. Identifies a couple of confusing things 1) 'dst' arg for many collectives is always in 'global' rank regardless of whether a subgroup is passed in. This needs a doc update 2) gather_object has a strong dependency on setting the cuda device; could we make that smoother? 3) gather_object also should be happy with an empty list on the dst side, imo Pull Request resolved: https://github.com/pytorch/pytorch/pull/118359 Approved by: https://github.com/weifengpy	2024-01-27 04:08:56 +00:00
PyTorch UpdateBot	28625d746f	[executorch hash update] update the pinned executorch hash (#118443 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned executorch hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118443 Approved by: https://github.com/pytorchbot	2024-01-27 04:08:49 +00:00
Shuqiang Zhang	993e4f3911	[c10d] relax the nccl error check for nonblocking mode (#118254 ) resolve https://github.com/pytorch/pytorch/issues/117749 Summary: This is the first step to enable NCCL nonblocking mode. In NCCL nonblocking mode, ncclInProgress is an expected return value when checking communicators. Without this relaxation, watchdog thread would throw NCCL errors during work checking while it is expected. Test Plan: Set nonblocking mode in unit tests, and make sure all existing NCCL tests pass Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/118254 Approved by: https://github.com/kwen2501	2024-01-27 03:49:00 +00:00
David Berard	40c08795b0	[JIT] python IR bindings: consolidate tests, add short docs in OVERVIEW.md (#118319 ) Document the existence of python IR bindings; quick comments about it; and consolidate tests in one file to serve as examples to users. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118319 Approved by: https://github.com/eellison	2024-01-27 03:11:51 +00:00
Edward Z. Yang	9bce208dfb	Replace follow_imports = silent with normal (#118414 ) This is a lot of files changed! Don't panic! Here's how it works: * Previously, we set `follow_imports = silent` for our mypy.ini configuration. Per https://mypy.readthedocs.io/en/stable/running_mypy.html#follow-imports, what this does is whenever we have an import to a module which is not listed as a file to be typechecked in mypy, we typecheck it as normal but suppress all errors that occurred in that file. * When mypy is run inside lintrunner, the list of files is precisely the files covered by the glob in lintrunner.toml, but with files in excludes excluded. * The top-level directive `# mypy: ignore-errors` instructs mypy to typecheck the file as normal, but ignore all errors. * Therefore, it should be equivalent to set `follow_imports = normal`, if we put `# mypy: ignore-errors` on all files that were previously excluded from the file list. * Having done this, we can remove the exclude list from .lintrunner.toml, since excluding a file from typechecking is baked into the files themselves. * torch/_dynamo and torch/_inductor were previously in the exclude list, because they were covered by MYPYINDUCTOR. It is not OK to mark these as `# mypy: ignore-errors` as this will impede typechecking on the alternate configuration. So they are temporarily being checked twice, but I am suppressing the errors in these files as the configurations are not quite the same. I plan to unify the configurations so this is only a temporary state. * There were some straggler type errors after these changes somehow, so I fixed them as needed. There weren't that many. In the future, to start type checking a file, just remove the ignore-errors directive from the top of the file. The codemod was done with this script authored by GPT-4: ``` import glob exclude_patterns = [ ... ] for pattern in exclude_patterns: for filepath in glob.glob(pattern, recursive=True): if filepath.endswith('.py'): with open(filepath, 'r+') as f: content = f.read() f.seek(0, 0) f.write('# mypy: ignore-errors\n\n' + content) ``` Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/118414 Approved by: https://github.com/thiagocrepaldi, https://github.com/albanD	2024-01-27 02:44:11 +00:00
lancerts	af1338bfbf	fix escape nested comments in C++ (#117882 ) Fixes #115243, as it is tricky to deal with the nested comment in doxygen + sphinx. Change 6 below is adopted as the fix. All other changes do not work. After adopting change 6, realize the original `torch::optim::SGD sgd(0.9);` is not the correct call to the sgd constructor, modified to the correct one `torch::optim::SGD sgd(model->parameters(), 0.9);` - Original in [link](https://pytorch.org/cppdocs/api/function_namespacetorch_1ad98de93d4a74dd9a91161f64758f1a76.html#exhale-function-namespacetorch-1ad98de93d4a74dd9a91161f64758f1a76): `/// torch::optim::SGD sgd(/lr=/0.9);` ![image](https://github.com/pytorch/pytorch/assets/7495155/0054b355-4925-4112-93b4-9385fdc34bb9) - Change 1, this solution is referenced from [here](https://stackoverflow.com/questions/24978463/doxygen-escape-nested-comments-in-c): `/// torch::optim::SGD sgd(/&zwj;* lr= &zwj;/0.9);` ![image](https://github.com/pytorch/pytorch/assets/7495155/77ff2d18-3097-4265-8dcd-31d78acb9c6e) - Change 2: `/// torch::optim::SGD sgd(/ lr= // 0.9);` ![image](https://github.com/pytorch/pytorch/assets/7495155/b520f8de-ead7-4009-b0fb-f4517daba077) - Change 3: `/// torch::optim::SGD sgd(/\lr=\/0.9);` ![image](https://github.com/pytorch/pytorch/assets/7495155/07e9e608-4640-43c0-994a-37983b803003) - Change 4: `/// torch::optim::SGD sgd(/&lowast; lr= &lowast;/0.9);` ![image](https://github.com/pytorch/pytorch/assets/7495155/121e55c5-0802-4ff3-bbd7-3521e1299d94) - Change 5: ``` /// \rst /// .. code-block:: cpp /// /// torch::nn::Linear model(3, 4); /// torch::load(model, "model.pt"); /// \verbatim /// torch::optim::SGD sgd(/lr=/0.9); /// \endverbatim /// std::istringstream stream("..."); /// torch::load(sgd, stream); /// /// auto tensor = torch::ones({3, 4}); /// torch::load(tensor, "my_tensor.pt"); /// \endrst ``` ![image](https://github.com/pytorch/pytorch/assets/7495155/e675f551-e939-4be8-b24a-e2e53377dd08) - Change 6: `/// torch::optim::SGD sgd(0.9); // 0.9 is the learning rate` ![image](https://github.com/pytorch/pytorch/assets/7495155/ecf0adc4-9b0b-4aef-b0bc-72d4b17c45fa) ![image](https://github.com/pytorch/pytorch/assets/7495155/01bf5d5b-8450-4599-8c9a-00204ab56119) Pull Request resolved: https://github.com/pytorch/pytorch/pull/117882 Approved by: https://github.com/cpuhrsch, https://github.com/malfet	2024-01-27 02:37:23 +00:00
ydwu4	5b31516008	[dynamo] inline torch.jit._unwrap_optional (#118434 ) Before this pr, torch.jit._unwrap_optional is in the skipfile list thus causing a graph break. Check its implementation it's just a normal python function [here](`ff8e33556e/torch/jit/_script.py (L1681-L1683)`): ```python def _unwrap_optional(x): assert x is not None, "Unwrapping null optional" return x ``` We could safely inline it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118434 Approved by: https://github.com/yanboliang	2024-01-27 02:22:14 +00:00
Animesh Jain	4aa1f994be	[dynamo][assume_constant_result] Dont put symbolic guards for assume_constant_result (#118430 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118430 Approved by: https://github.com/ydwu4	2024-01-27 01:56:14 +00:00
Min Si	838d3620cd	[NCCL PG] log NCCL comm at creation and abort (#118335 ) Summary: It helps correlate NCCL PG with corresponding NCCL comm in separate logs. Differential Revision: D53107647 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118335 Approved by: https://github.com/wconstab	2024-01-27 01:43:53 +00:00
Wei Wang	80cb6db90d	[CUDA] [CI] Disable flash attention for sm87 architecture when the head dim > 192 (#117678 ) Head dim > 192 requires A100/H100 (sm80 or sm90) per TORCH_CHECK [here](`0c26565d5d/aten/src/ATen/native/transformers/cuda/flash_attn/flash_api.cpp (L760)`). Pull Request resolved: https://github.com/pytorch/pytorch/pull/117678 Approved by: https://github.com/eqy, https://github.com/malfet	2024-01-27 01:22:47 +00:00
Nikita Shulga	7cc7bf9dda	[GHF] Add co-authors to PR (#118347 ) Mention co-authors in PR body Modify `CommitAuthors` to include query first two commit `authors`, which makes sure that authors from suggested commits are recognized. Test plan: CI + check `get_authors()` on a few PRs Pull Request resolved: https://github.com/pytorch/pytorch/pull/118347 Approved by: https://github.com/kit1980	2024-01-27 01:02:49 +00:00
Vincent Lee	4d771c56de	[xnnpack] Move x86 flags to platform_compiler_flags (#117923 ) Summary: AVX extension flags are x86 specific, and clang-18 has started to error on it when building targets that's not x86. I couldn't find the resulting upstream change that made these flags an error, but it's fairly trivial that these flags do not apply to all architectures. For most of the flags, they are already defined in `platform_compiler_flags`. The changes done * Gate the flags under `compiler_flags` with `selects` * If flags weren't defined in `platform_compiler_flags`, define them there as well * Remove the `^` and `$` in the platform regex. Not all flavors start with the platform (e.g. `android-x86_64`. * Some minor formatting changes were also included here. Test Plan: Atop D52741786, ``` buck2 build --flagfile 'arvr/mode/android/apk/linux/opt' '//arvr/projects/mixedreality/android/ocean_passthrough_service:ocean_passthrough_mrservice_dev' ``` Differential Revision: D52856224 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117923 Approved by: https://github.com/mcr229	2024-01-26 23:41:06 +00:00
Lucas Pasqualin	ff8e33556e	Enables load balancing duplicates in DCP (#116469 ) Enables the deduplication of saved entries by load balancing duplicates across ranks. Tested with existing and modified tests. Additionally tested with the following code snippet, which saves a 20GB DDP model in ~3 seconds on 8 ranks. Before this PR, the same operation has been measured at ~19 seconds. ``` def run(local_rank, world_size, param_size, num_params, work_dir): os.environ["RANK"] = str(local_rank) os.environ["MASTER_ADDR"] = "localhost" os.environ["MASTER_PORT"] = "12355" device = torch.device(f"cuda:{local_rank}") torch.cuda.set_device(device) dist.init_process_group(backend="nccl", rank=local_rank, world_size=world_size) model = Model(param_size=param_size, num_params=num_params) model = DistributedDataParallel(model, gradient_as_bucket_view=True) _patch_model_state_dict(model) sz = sum(t.nelement() * t.element_size() for t in model.parameters()) rank_0_print(f"Model size: {sz / 1_000_000_000.0} GB") rank_0_print("Saving the model with DCP...") checkpointer = _FileSystemCheckpointer( f"{args.work_dir}/dcp", sync_files=False, single_file_per_rank=False, thread_count=1 ) begin_ts = time.monotonic() checkpointer.save(state_dict={"model": model}) end_ts = time.monotonic() rank_0_print(f"Took {end_ts - begin_ts} seconds with DCP") ``` Differential Revision: [D52435926](https://our.internmc.facebook.com/intern/diff/D52435926/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116469 Approved by: https://github.com/fegin, https://github.com/wz337	2024-01-26 22:34:14 +00:00
eellison	b95c45fbf7	add stack trace to device skip (#118112 ) Log stack trace of offending cpu use if it causes a disabling of cudagraphs. Also refactoring disable_cudagraphs: bool, and disable_cudagraphs_reason: str -> Optional[str]. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118112 Approved by: https://github.com/bdhirsh	2024-01-26 22:33:48 +00:00
rzou	b256b7b348	Add way to actually delete a torch.library.Library object (#118318 ) Relying on object lifetimes in Python is a bad idea due to reference cycles. Previously, when a torch.library.Library object gets destroyed, it clears all the registrations associated with it, but it's unclear when it actually gets destroyed due to the existence of refcycles. This PR: - adds torch::Library::clear(), which deterministically releases all of the RAII registration handles of the torch::Library object - adds a new `torch.library._scoped_library` context manager, which creates a library and cleans it up at the end of the scope using the previous item. All tests (unless they already handle library lifetimes) should use this new API - Rewrites some flaky tests to use `_scoped_library`. In the future we'll probably migrate all of our torch.library tests to use `_scoped_library`, but that's kind of annoying because we have multiple thousands of LOC I'm hoping this will deflake those tests; we'll see. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118318 Approved by: https://github.com/albanD	2024-01-26 22:30:51 +00:00
Peter Bell	f129e3fe03	[inductor] Handle cum{sum,prod} on zero-dim tensors (#117990 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/117990 Approved by: https://github.com/lezcano	2024-01-26 22:21:42 +00:00
titaiwangms	074ac822d5	[ONNX] Skip empty input test case in aten_mm (#118413 ) Fixes #117718 Fixes #117725 It's actually a known issue in https://github.com/microsoft/onnxscript/pull/586, and we do exclude the empty input test cases in aten_matmul. This PR follows the skip, and add aten_mm as well. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118413 Approved by: https://github.com/thiagocrepaldi	2024-01-26 22:06:57 +00:00
ydwu4	eee63ac845	[dynamo] move torch._C._get_cublas_allow_tf32 to constant_fold_functions (#118342 ) Previously, I create a value match for torch._C._get_cublas_allow_tf32, it should just be in constant_fold_functions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118342 Approved by: https://github.com/yanboliang, https://github.com/jansel ghstack dependencies: #118236	2024-01-26 22:00:00 +00:00
Ivan Zaitsev	d41cfc92e6	[CI] simplify mergeability check workflow (#118415 ) Test run: https://github.com/pytorch/pytorch/actions/runs/7673050632/job/20914851421?pr=118415 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118415 Approved by: https://github.com/PaliC, https://github.com/huydhn	2024-01-26 21:45:24 +00:00
Catherine Lee	84251d1d71	[ez] Windows log printing + save successful test logs (#118124 ) when doing print(f.read().decode etc etc) it prints an extra new line, so manually splitlines and strip to see if that helps My guess is windows line ending differences Also always save log file regardless of success or failure See `476b81a9bf` for what it looks like now Swapped to opening in text mode instead of binary, seems to be ok now. 42483193bf024983060a234dc0262f4840aef4b8 for example Pull Request resolved: https://github.com/pytorch/pytorch/pull/118124 Approved by: https://github.com/huydhn	2024-01-26 21:14:25 +00:00
Angela Yi	5c56822be2	[export] Various fixes to .module() (#118272 ) Summary: While turning on .module() for all the export tests, I uncovered some bugs with .module() and while fixing them I ended up rewriting some of the code... Some of the bugs were: * bad kwargs support on the unlifted module * no support for user input mutations * (at the commit hash i was working off of) no support for custom objects * there were no tests on unlifting weights from cond/map submodules Test Plan: CI Differential Revision: D53075380 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118272 Approved by: https://github.com/suo	2024-01-26 21:05:07 +00:00
Flavio Sales Truzzi	2ed1b1747a	Fix Auto Functionalize to handle specified default values (#118331 ) Summary: When there were optionals with specified default values the code was improperly handling the number of parameters causing IndexError: tuple index out of range. Test Plan: New tests. Reviewed By: zou3519 Differential Revision: D53095812 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118331 Approved by: https://github.com/zou3519	2024-01-26 20:31:38 +00:00
Jean Schmidt	07499074bb	Increasing session duration for AWS credentials for _rocm-test.yml (#118412 ) The workflow _rocm-test.yml needs longer session duration for AWS role keys Pull Request resolved: https://github.com/pytorch/pytorch/pull/118412 Approved by: https://github.com/jeffdaily, https://github.com/huydhn	2024-01-26 19:32:24 +00:00
Thiago Crepaldi	939008a268	Fix RuntimeError: NYI: Named tensors are not supported with the tracer (#118393 ) This PR relands #108238 that was closed as stale due to CLA issues and also because the CI check has marked the PR as not mergeable. Repro 1: ```python import torch def f(x): return x[x > 0] jf = torch.jit.trace(f, torch.tensor(2., device="cuda")) ``` Error: ```bash Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/opt/pytorch/torch/jit/_trace.py", line 874, in trace traced = torch._C._create_function_from_trace( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "<stdin>", line 2, in f RuntimeError: NYI: Named tensors are not supported with the tracer ``` Repro2: ```python import torch import torch.nn.functional as F from torch import nn import copy class Net(nn.Module): def __init__(self): super().__init__() def forward(self, inputs): x = copy.deepcopy(inputs) # RuntimeError: NYI: Named tensors are not supported with the tracer x = F.relu(x) return x model = Net() images = torch.randn(8, 28, 28) torch.jit.trace(model, images) ``` Error 2: ```bash Traceback (most recent call last): File "/opt/pytorch/test_deepcopy.py", line 18, in <module> File "/opt/pytorch/torch/jit/_trace.py", line 806, in trace return trace_module( ^^^^^^^^^^^^^ File "/opt/pytorch/torch/jit/_trace.py", line 1074, in trace_module module._c._create_method_from_trace( File "/opt/pytorch/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/pytorch/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/pytorch/torch/nn/modules/module.py", line 1501, in _slow_forward result = self.forward(input, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/pytorch/test_deepcopy.py", line 12, in forward x = F.relu(x) ^^^^^^^^^^ File "/opt/conda/envs/ptca/lib/python3.11/copy.py", line 153, in deepcopy y = copier(memo) ^^^^^^^^^^^^ File "/opt/pytorch/torch/_tensor.py", line 122, in __deepcopy__ new_storage = self._typed_storage()._deepcopy(memo) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/pytorch/torch/storage.py", line 847, in _deepcopy return self._new_wrapped_storage(copy.deepcopy(self._untyped_storage, memo)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/conda/envs/ptca/lib/python3.11/copy.py", line 153, in deepcopy y = copier(memo) ^^^^^^^^^^^^ File "/opt/pytorch/torch/storage.py", line 112, in __deepcopy__ new_storage = self.clone() ^^^^^^^^^^^^ File "/opt/pytorch/torch/storage.py", line 126, in clone return type(self)(self.nbytes(), device=self.device).copy_(self) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RuntimeError: NYI: Named tensors are not supported with the tracer ``` ---- #48054 RuntimeError: NYI: Named tensors are not supported with the tracer #49538 jit tracer doesn't work with unflatten layer #31591 when i try to export a pytorch model to ONNX, got RuntimeError: output of traced region did not have observable data dependence with trace inputs; this probably indicates your program cannot be understood by the tracer. - This bug was closed but exists. Multiple comments on it still showing error. This is addressed Likely fixes the following issues (but untested) #63297 Named tensor in tracer #2323 [Bug] torch.onnx.errors.UnsupportedOperatorError when convert mask2former to onnx Fix zero dimensioned tensors when used with jit.trace They are currently assigned an empty set for names {} this is not the same as "no name" so jit.trace bails with "NYI: Named tensors are not supported with the tracer" This happens when I am trying to save a non-trivial model as onnx but the simplest repro I have seen is 48054 above which has been added as test/jit/test_zero_dim_tensor_trace.py Test plan: New unit test added Broken scenarios tested locally CI Fixes #48054 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118393 Approved by: https://github.com/zou3519	2024-01-26 19:31:23 +00:00
rzou	bfbb8d8220	Don't manually invoke atexit exit handlers in tests (#118409 ) Fixes https://github.com/pytorch/pytorch/issues/104098 This is a bad idea because it runs all the exit handlers and messes with global state that is necessary for other tests to run. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118409 Approved by: https://github.com/ydwu4, https://github.com/yanboliang ghstack dependencies: #118152, #118309	2024-01-26 19:11:19 +00:00
rzou	728789d850	Deflake stream tests, part 2 (#118391 ) I missed these the first time around, some more streams need to be synchronized. Fixes https://github.com/pytorch/pytorch/issues/112694 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118391 Approved by: https://github.com/ydwu4, https://github.com/yanboliang	2024-01-26 19:10:53 +00:00
Wanchao Liang	e696fa1ee7	[tp] enable rowwise embedding sharding in RowwiseParallel (#118242 ) As titled, this PR enables the rowwise embedding sharding in the RowwiseParallel style, and add tests to ensure it's working as expected Pull Request resolved: https://github.com/pytorch/pytorch/pull/118242 Approved by: https://github.com/tianyu-l ghstack dependencies: #118079, #118080	2024-01-26 19:01:24 +00:00
Wanchao Liang	dc8357b397	[dtensor] implement dim-0 (row) embedding sharding with MaskPartial (#118080 ) This PR add support for rowwise sharded embedding by adding a MaskPartial placement that inherits from the default partial placement, and override the Partial constracts to construct the mask and release the mask after the reduction The MaskPartial placement have the potential to support other ops sharding computation that requires a mask for semantic correctness. currently make it live in the embedding ops but we can move it to a common place if needed Pull Request resolved: https://github.com/pytorch/pytorch/pull/118080 Approved by: https://github.com/tianyu-l ghstack dependencies: #118079	2024-01-26 19:01:24 +00:00
Wanchao Liang	910b49c48b	[dtensor] rewrite embedding ops using op strategy (#118079 ) This PR rewrites sharded embedding rule to use OpStrategy instead of the rule, one step further to get rid of rules and consolidate the embedding operator implementation, to prepare for rowwise embedding implementation, which will come in next PR Pull Request resolved: https://github.com/pytorch/pytorch/pull/118079 Approved by: https://github.com/tianyu-l	2024-01-26 19:01:15 +00:00
Edward Z. Yang	25f72194e8	Realize inputs to DynamicScalar before unwrapping storage (#118125 ) Fixes https://github.com/pytorch/pytorch/issues/118102 Unfortunately, the test still fails due to an unrelated problem https://github.com/pytorch/pytorch/issues/117665 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/118125 Approved by: https://github.com/eellison, https://github.com/jansel ghstack dependencies: #117862	2024-01-26 18:08:03 +00:00
Edward Z. Yang	96d94f574e	Fix several bugs related to unbacked SymInt codegen in inductor (#117862 ) Let me tell you, this was a journey. * When we repropagate through FX interpreter in AOTAutograd, this will reallocate unbacked SymInts. We can eliminate all of these fresh allocations by appropriately asserting equalities on them setting up replacements. See also https://github.com/pytorch/pytorch/issues/111950 * The `inner_fn` of Loops can contain references to unbacked SymInts. We must collect them to prevent DCE. * Export naughtily accessed `_expr` when it should have accessed `expr` on SymNode. Fixed two sites of this. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/117862 Approved by: https://github.com/bdhirsh	2024-01-26 18:08:03 +00:00
yewentao	89a0b1df51	fix lint for cudnn codes (#117091 ) Fixes the lint issue described in https://github.com/pytorch/pytorch/pull/116759 @albanD Please have a look Pull Request resolved: https://github.com/pytorch/pytorch/pull/117091 Approved by: https://github.com/albanD	2024-01-26 17:53:22 +00:00
David Berard	2842d3c9d3	[Nested Tensor] view: basic support for ragged_idx != 1 and _unsafe_view (#118317 ) Uses case: `_unsafe_view` is used in aot_autograd to create a view that doesn't register as a view: `eebe7e1d37/torch/_functorch/_aot_autograd/jit_compile_runtime_wrappers.py (L470-L476)` If a transposed nested tensor (i.e. NT with ragged_idx != 1) encounters this code path, it previously would fail for two reasons: 1) because `_unsafe_view` isn't registered, and 2) because ragged_idx != 1 is not supported. This PR adds support for `_unsafe_view` (completely reusing the implementation of `view`; this just registers `_unsafe_view` as another op using the same implementation). It also adds support for ragged_idx != 1, but only for trivial cases where inp._size == size (the use case used by aot_autograd). Tests: verify that the result of `_unsafe_view` doesn't have a `_base`, and that simple views on transposed NTs work. Differential Revision: [D53096814](https://our.internmc.facebook.com/intern/diff/D53096814) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118317 Approved by: https://github.com/soulitzer	2024-01-26 17:29:37 +00:00
PyTorch MergeBot	533637d9a3	Revert "Check if enable inside run call (#118101 )" This reverts commit 2abb812a78c0d3976e6eb10114716bcb163480ca. Reverted https://github.com/pytorch/pytorch/pull/118101 on behalf of https://github.com/clee2000 due to broke periodic multigpu test some how `6fc015fedc` ([comment](https://github.com/pytorch/pytorch/pull/118101#issuecomment-1912357321))	2024-01-26 16:41:56 +00:00
Alexander Grund	f1aef2c094	Don't check is_conj for `_refs.linalg.svd` (#117972 ) The flag is not correctly set when PyTorch is compiled with GPU support resulting in failures in `test_ops.py::test_python_ref_meta__refs_linalg_svd_cpu_complex`. Use a similar approach to test_meta and skip the check for this function. Workaround for #105068 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117972 Approved by: https://github.com/lezcano	2024-01-26 15:24:29 +00:00
PyTorch MergeBot	af8f37c2b6	Revert "Use SEQUENTIAL posix_fadvise on mmapped files (#117805 )" This reverts commit 401aa1a1deaee19909c957d7d56d91341018b4dc. Reverted https://github.com/pytorch/pytorch/pull/117805 on behalf of https://github.com/DanilBaibak due to Break internal build ([comment](https://github.com/pytorch/pytorch/pull/117805#issuecomment-1912204403))	2024-01-26 14:59:58 +00:00
cyy	6da0e7f84b	[Clang-tidy header][17/N] Apply clang-tidy on headers in torch/csrc/cuda (#117829 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/117829 Approved by: https://github.com/albanD	2024-01-26 13:33:24 +00:00
Tobias Ringwald	8ff55c7e68	Clarified sampling process of torch.randn for complex dtypes. (#118315 ) Fixes #118269. Clarified the docs of `torch.randn` and `torch.randn_like`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118315 Approved by: https://github.com/lezcano	2024-01-26 13:05:19 +00:00
leslie-fang-intel	b66c4eda61	[Inductor] Add Thread Number Checker in scatter_reduce_ fallback for CPP backend (#118278 ) Summary Follow up of https://github.com/pytorch/pytorch/pull/108220 which improves performance of `basic_gnn_gin`, `basic_gnn_sage` and `basic_gnn_gcn` in multi thread test cases. However, it causes performance regression of these 3 models in single thread test case as reported in https://github.com/pytorch/pytorch/issues/117740. Fix the single thread issues in this PR by adding the thread number check to decide whether fallback `scatter_reduce_` or not. Test Plan ``` python -u -m pytest -s -v test_cpu_repro.py -k test_scatter_using_atomic_add ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/118278 Approved by: https://github.com/jansel, https://github.com/jgong5	2024-01-26 12:43:25 +00:00
Yifu Wang	0857a3a753	[c10d_functional] fix an issue where mutation on views fails in inductor (#118333 ) `_CollectiveKernel.create_inplace` expresses mutation with the newly introduced `MutationOutput` which requires the `layout` of the input. Currently, there's a bug where if the input is a view, `inp.layout` fails. This PR fixes the issue by unwrapping the input if it's a view. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118333 Approved by: https://github.com/wanchaol	2024-01-26 11:13:30 +00:00
Daohang Shi	4d0b471389	fix key error in pre_grad fx_passes_numeric_check (#118325 ) Summary: ``` I0125 121749.865 pyper_config_utils.py:8225] torchdynamo pyper config = TorchDynamoConfig(backend='inductor', optimize_ddp=False, log_compile_graph=False, inductor_config=TorchInductorConfig(enable_cudagraph=False, max_autotune=False, max_autotune_pointwise=True, max_autotune_gemm=False, search_autotune_cache=False, autotune_in_subproc=False, aggressive_fusion=False, shape_padding=True, permute_fusion=False, epilogue_fusion_first=False, debug=True, triton=None, trace_enabled=False, log_kernel_source=False, split_cat_fx_passes=False, group_fusion=False, batch_fusion=False, coordinate_descent_tuning=False, coordinate_descent_check_all_directions=False, coordinate_descent_search_radius=1, layout_optimization=True, pre_grad_fusion_options={}, post_grad_fusion_options={}, max_pointwise_cat_inputs=4, fx_passes_numeric_check={}), automatic_dynamic_shapes=True) ``` In trainer ``` I0125 12:58:51.832000 4011.139732263132160 torchdynamo_wrapper.py:291 trainer:0:1 ] [pt2] creating torchdynamo backend wrapper with settings TorchDynamoConfig(backend='inductor', optimize_ddp=False, log_compile_graph=False, inductor_config=TorchInductorConfig(enable_cudagraph=False, max_autotune=False, max_autotune_pointwise=True, max_autotune_gemm=False, search_autotune_cache=False, autotune_in_subproc=False, aggressive_fusion=False, shape_padding=True, permute_fusion=False, epilogue_fusion_first=False, debug=True, triton=None, trace_enabled=False, log_kernel_source=False, split_cat_fx_passes=False, group_fusion=False, batch_fusion=False, coordinate_descent_tuning=False, coordinate_descent_check_all_directions=False, coordinate_descent_search_radius=1, layout_optimization=True, pre_grad_fusion_options={}, post_grad_fusion_options={}, max_pointwise_cat_inputs=4, fx_passes_numeric_check={}), automatic_dynamic_shapes=True) #ai_training_job_id="febe34d9-b2fb-493e-a5cc-6a0b1dc85ad4" #ai_training_local_rank="1" #ai_training_role_rank="1" #mast_job_attempt="2" #mast_job_name="f525072920-TrainingApplication" ... if config.fx_passes_numeric_check["pre_grad"]: ``` https://www.internalfb.com/diff/D52826442?dst_version_fbid=1115735309429172&transaction_fbid=682438900759710 https://www.internalfb.com/diff/D51838043?dst_version_fbid=336373395892373&transaction_fbid=349901787874069 This diff first fixes the key error to restore broken tests. Its pyper changes can be addressed later. https://www.internalfb.com/code/fbsource/[72c19313ed73]/fbcode/caffe2/torch/_inductor/config.py?lines=142-147 Test Plan: buck2 run //caffe2/torch/fb/training_toolkit/integration_tests/training_lifecycle/cogwheel_tests/pyper_release_v2:cogwheel_smallworld_mimo_cmf_deterministic_ne_pt2_training_platform__canary_offline_training-launcher -- --build-fbpkg --run-disabled --tests test Reviewed By: yusuo Differential Revision: D53102344 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118325 Approved by: https://github.com/mengluy0125	2024-01-26 11:02:12 +00:00
leslie-fang-intel	8dd1be49b7	[Inductor] Use sleef implementation for CPP backend acosh codegen (#118350 ) Summary Fix https://github.com/pytorch/pytorch/issues/118267. Current cpp backend using `f"({x} + ({x}{x} - {vec_one}).sqrt()).log()"` to calculate `acosh`, the issue happens when input is a large negative value like `-910685.8125`. In this case, `(xx - 1).sqrt() + x` equals to 0, and `0.log()` returns `-inf`. However, based on the document: https://pytorch.org/docs/stable/generated/torch.acosh.html, negative inputs should returns `Nan`. Using acosh sleef implementation to fix this issue. Test Plan ``` python -u -m pytest -s -v test_cpu_repro.py -k test_acosh_with_negative_large_input ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/118350 Approved by: https://github.com/jgong5, https://github.com/lezcano	2024-01-26 10:19:40 +00:00
Chien-Chin Huang	2ea38498b0	[FSDP][BE] Only show state_dict log when the debug level is detail (#118196 ) As title Differential Revision: [D53038704](https://our.internmc.facebook.com/intern/diff/D53038704/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118196 Approved by: https://github.com/rohan-varma, https://github.com/wz337 ghstack dependencies: #118197, #118195	2024-01-26 09:52:36 +00:00
Chien-Chin Huang	4f4e61bb75	[DCP] Add tests to demonstrate DCP checkpoint conversion (#117773 ) As title Differential Revision: [D52854759](https://our.internmc.facebook.com/intern/diff/D52854759/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/117773 Approved by: https://github.com/LucasLLC, https://github.com/wz337 ghstack dependencies: #116248, #117772	2024-01-26 09:39:10 +00:00
Chien-Chin Huang	644bc69530	[DCP] Allow users to save and load without creating storage reader and writer (#117772 ) Right now DCP API requires users to create StorageWriter and StorageReader for every API call. This PR allows users to only pass the checkpointer_id (a path) and use it to read/write a checkpoint without creating a StorageReader and Writer. Differential Revision: [D52740556](https://our.internmc.facebook.com/intern/diff/D52740556/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/117772 Approved by: https://github.com/wz337 ghstack dependencies: #116248	2024-01-26 09:08:35 +00:00
PyTorch MergeBot	fc30bd3b7b	Revert "[dtensor] rewrite embedding ops using op strategy (#118079 )" This reverts commit e599a0879684abedec2a28b08b822fd4a4219105. Reverted https://github.com/pytorch/pytorch/pull/118079 on behalf of https://github.com/DanilBaibak due to Break internal build ([comment](https://github.com/pytorch/pytorch/pull/118079#issuecomment-1911681293))	2024-01-26 08:47:14 +00:00
PyTorch MergeBot	bfb5e7642e	Revert "[dtensor] implement dim-0 (row) embedding sharding with MaskPartial (#118080 )" This reverts commit 8cc02b46c33b5192289e4cf64fa55d685127bfb8. Reverted https://github.com/pytorch/pytorch/pull/118080 on behalf of https://github.com/DanilBaibak due to Break internal build ([comment](https://github.com/pytorch/pytorch/pull/118079#issuecomment-1911681293))	2024-01-26 08:47:14 +00:00
PyTorch MergeBot	bc67f87559	Revert "[tp] enable rowwise embedding sharding in RowwiseParallel (#118242 )" This reverts commit 7a9012d7e847a6265e70873e9baab70838edd601. Reverted https://github.com/pytorch/pytorch/pull/118242 on behalf of https://github.com/DanilBaibak due to Break internal build ([comment](https://github.com/pytorch/pytorch/pull/118079#issuecomment-1911681293))	2024-01-26 08:47:14 +00:00
Jeff Daily	2c9a90cde6	[ROCm] backward compatible type enums (#118137 ) Fixes builds of pytorch using unreleased ROCm packages that are missing type enums introduced in ROCm 6.0 release. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118137 Approved by: https://github.com/xw285cornell, https://github.com/anupambhatnagar	2024-01-26 08:40:13 +00:00
Jorge Pineda	f8e14f3b46	[PyTorch][Vulkan] Clean up aten::stack (#118314 ) Summary: After D50347338, we already support zero-dim tensor input, which was my original task. As a result, this diff doesn't add or change functionality; it just cleans up the following: 1. Fix TORCH_CHECK to only allow `tensor.dim() <= 3`. Previously, it was a no-op since it didn't use `&&`. 2. Add `tensor.dim() == 0` tests. 3. Address `readability-container-size-empty` and `performance-unnecessary-copy-initialization` linter errors. Test Plan: Tested on OD. ``` [jorgep31415@29786.od /data/sandcastle/boxes/fbsource (1d0b920e0)]$ LD_LIBRARY_PATH=third-party/swiftshader/lib/linux-x64/ buck2 run fbcode/mode/dev-nosan //xplat/caffe2:pt_vulkan_api_test_bin -c pt.vulkan_full_precision=1 -- --gtest_filter="stack" File changed: fbsource//xplat/caffe2/aten/src/ATen/native/vulkan/ops/Unsqueeze.cpp File changed: fbsource//xplat/caffe2/aten/src/ATen/native/vulkan/glsl/unsqueeze.glsl File changed: fbsource//xplat/caffe2/aten/src/ATen/test/vulkan_api_test.cpp 3 additional file change events Buck UI: https://www.internalfb.com/buck2/98bb3bfa-a1d1-440e-8724-b4990c9cc7ca Network: Up: 1.4MiB Down: 377KiB (reSessionID-6eccf420-3951-4942-9350-998803589b8d) Jobs completed: 17. Time elapsed: 42.6s. Cache hits: 38%. Commands: 8 (cached: 3, remote: 0, local: 5) BUILD SUCCEEDED Running main() from third-party/googletest/1.14.0/googletest/googletest/src/gtest_main.cc Note: Google Test filter = stack [==========] Running 5 tests from 1 test suite. [----------] Global test environment set-up. [----------] 5 tests from VulkanAPITest [ RUN ] VulkanAPITest.stack_invalid_inputs [ OK ] VulkanAPITest.stack_invalid_inputs (27 ms) [ RUN ] VulkanAPITest.stack_0d [ OK ] VulkanAPITest.stack_0d (28 ms) [ RUN ] VulkanAPITest.stack_1d [ OK ] VulkanAPITest.stack_1d (1 ms) [ RUN ] VulkanAPITest.stack_2d [ OK ] VulkanAPITest.stack_2d (148 ms) [ RUN ] VulkanAPITest.stack_3d [ OK ] VulkanAPITest.stack_3d (354 ms) [----------] 5 tests from VulkanAPITest (561 ms total) [----------] Global test environment tear-down [==========] 5 tests from 1 test suite ran. (561 ms total) [ PASSED ] 5 tests. ``` Reviewed By: copyrightly, liuk22 Differential Revision: D53071188 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118314 Approved by: https://github.com/liuk22	2024-01-26 04:28:06 +00:00
PyTorch UpdateBot	2b1ee9be7a	[executorch hash update] update the pinned executorch hash (#118339 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned executorch hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118339 Approved by: https://github.com/pytorchbot	2024-01-26 04:26:38 +00:00
Jorge Pineda	0c5da6100f	[PyTorch][Vulkan] Clean up aten::unsqueeze (#118311 ) Summary: After D50347338, we already support zero-dim tensor input, which was my original task. As a result, this diff doesn't add or change functionality; it just cleans up the following: 1. Fix TORCH_CHECK to only allow `tensor.dim() <= 3`. Previously, it was a no-op since it didn't use `&&`. 2. Add 0->1 `tensor.dim()` tests. 3. Remove `dim == 0` case from shader since that path is never executed. The `cpp` code sends the input to `submit_copy` instead. Test Plan: Tested on OD. ``` [jorgep31415@29786.od /data/sandcastle/boxes/fbsource (c66693c95)]$ LD_LIBRARY_PATH=third-party/swiftshader/lib/linux-x64/ buck2 run fbcode/mode/dev-nosan //xplat/caffe2:pt_vulkan_api_test_bin -c pt.vulkan_full_precision=1 -- --gtest_filter="unsqueeze" File changed: fbcode//caffe2/aten/src/ATen/native/vulkan/glsl/unsqueeze.glsl File changed: fbsource//xplat/caffe2/aten/src/ATen/test/vulkan_api_test.cpp File changed: fbsource//xplat/caffe2/aten/src/ATen/native/vulkan/glsl/unsqueeze.glsl Buck UI: https://www.internalfb.com/buck2/16cf8f59-e535-493b-b123-5952ef8f1453 Network: Up: 21KiB Down: 1.4MiB (reSessionID-1219eefd-e78b-4bfd-aef8-8e4b38da82f8) Jobs completed: 8. Time elapsed: 37.8s. Cache hits: 0%. Commands: 3 (cached: 0, remote: 1, local: 2) BUILD SUCCEEDED Running main() from third-party/googletest/1.14.0/googletest/googletest/src/gtest_main.cc Note: Google Test filter = unsqueeze [==========] Running 10 tests from 1 test suite. [----------] Global test environment set-up. [----------] 10 tests from VulkanAPITest [ RUN ] VulkanAPITest.unsqueeze_0dto1d_dim0 [ OK ] VulkanAPITest.unsqueeze_0dto1d_dim0 (61 ms) [ RUN ] VulkanAPITest.unsqueeze_1dto2d_dim0 [ OK ] VulkanAPITest.unsqueeze_1dto2d_dim0 (0 ms) [ RUN ] VulkanAPITest.unsqueeze_1dto2d_dim1 [ OK ] VulkanAPITest.unsqueeze_1dto2d_dim1 (110 ms) [ RUN ] VulkanAPITest.unsqueeze_2dto3d_dim0 [ OK ] VulkanAPITest.unsqueeze_2dto3d_dim0 (16 ms) [ RUN ] VulkanAPITest.unsqueeze_2dto3d_dim1 [ OK ] VulkanAPITest.unsqueeze_2dto3d_dim1 (58 ms) [ RUN ] VulkanAPITest.unsqueeze_2dto3d_dim2 [ OK ] VulkanAPITest.unsqueeze_2dto3d_dim2 (2 ms) [ RUN ] VulkanAPITest.unsqueeze_3dto4d_dim0 [ OK ] VulkanAPITest.unsqueeze_3dto4d_dim0 (16 ms) [ RUN ] VulkanAPITest.unsqueeze_3dto4d_dim1 [ OK ] VulkanAPITest.unsqueeze_3dto4d_dim1 (1 ms) [ RUN ] VulkanAPITest.unsqueeze_3dto4d_dim2 [ OK ] VulkanAPITest.unsqueeze_3dto4d_dim2 (1 ms) [ RUN ] VulkanAPITest.unsqueeze_3dto4d_dim3 [ OK ] VulkanAPITest.unsqueeze_3dto4d_dim3 (1 ms) [----------] 10 tests from VulkanAPITest (270 ms total) [----------] Global test environment tear-down [==========] 10 tests from 1 test suite ran. (270 ms total) [ PASSED ] 10 tests. ``` Also, to improve my confidence in unit tests, I modified [force_flush.py](https://www.internalfb.com/code/fbsource/[6e606c6f62dafd2121e78ffe14ae12f1b6d8d405]/fbcode/wearables/camera/ml/pytorch_vulkan_native/demo/force_flush.py) to run several combinations of `aten::unsqueeze` on OD. Verified these work as expected. ``` torch.zeros([]) torch.randn([]) torch.rand([]) torch.ones([]) torch.tensor(0, dtype=torch.float) ``` Found that Vulkan in general does not support the following. That's ok though since it's technically a 1d tensor which is not part of my task. ``` torch.tensor([]) ``` Differential Revision: D53071189 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118311 Approved by: https://github.com/liuk22	2024-01-26 04:22:54 +00:00
CaoE	8467de4e97	Fix kaiser_window for lower precision data types on CPU (#117345 ) Fixes #117230. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117345 Approved by: https://github.com/jgong5, https://github.com/soumith	2024-01-26 03:26:12 +00:00
eqy	ef29fe745f	[CUDA] Add missing TF32 annotation to `test_uint4x2_mixed_mm` (#118143 ) Addresses numerical mismatches seen on architectures with TF32. CC @nWEIdia Pull Request resolved: https://github.com/pytorch/pytorch/pull/118143 Approved by: https://github.com/nWEIdia, https://github.com/jansel	2024-01-26 03:23:22 +00:00
Ivan Zaitsev	b599f5608c	Fix mergeability check for ghstack PRs (#118258 ) # Changes * introduce `--check-mergeability` trymerge flag that attempts to merge PR locally, using the same merge logic as the mergebot, but requires just a read-only `GITHUB_TOKEN` and git repo. * change mergeability workflow to utilize the new --check-mergeability logic # Alternatives considered 1. > Rewrite `https://github.com/pytorch/test-infra/actions/workflows/pr-dependencies-check.yml` to correctly support partially merged ghstacks. That would be a slightly better approach, but ROI is lower, as it requires reimplementing trymerge logic and additional effort to consolidate the codebase (trymerge lives in pytorch repo). `pr-dependencies-check.yml` still produces human-readable results for partially merged ghstack prs (even if it falsely reports them as non-mergeable). 2. > Instead of introducing new trymerge flag, use existing flags, including `--dry-run`. That didn't work, as no combination of existing flags skips the rule checks and ROCKSET lookups. # Testing 1. Manual testing `trymerge.py --check-mergeability` on the regular and ghstack PRs: ``` export GITHUB_TOKEN= export GIT_REPO_DIR=`pwd` export GITHUB_REPOSITORY=pytorch/pytorch export GIT_REMOTE_URL=https://github.com/pytorch/pytorch # Test 1 (2 prs, 1 is closed) python3 ../pytorch/.github/scripts/trymerge.py --check-mergeability 117862 Skipping 1 of 2 PR (#117859) as its already been merged echo $? 0 # Test 2 (3 prs, 1 is closed) python3 ../pytorch/.github/scripts/trymerge.py --check-mergeability 118125 Skipping 1 of 3 PR (#117859) as its already been merged echo $? 0 # Test 3 (3 prs, intentional conflicts introduced into `main`): python3 ../pytorch/.github/scripts/trymerge.py --check-mergeability 118125 Skipping 1 of 3 PR (#117859) as its already been merged stdout: Auto-merging torch/_inductor/ir.py Auto-merging torch/_inductor/lowering.py CONFLICT (content): Merge conflict in torch/_inductor/lowering.py error: could not apply 66ba5b8792f... Realize inputs to DynamicScalar before unwrapping ... RuntimeError: Command `git -C /Users/ivanzaitsev/pytorch2 cherry-pick -x 66ba5b8792fa076c4e512d920651e5b6b7e466f4` returned non-zero exit code 1 ``` 2. Workflow run: https://github.com/pytorch/pytorch/actions/runs/7660736172/job/20878651852?pr=118258 <img width="516" alt="image" src="https://github.com/pytorch/pytorch/assets/108101595/28fbf0d2-ac2a-4518-b41d-b32b41373747"> <img width="621" alt="image" src="https://github.com/pytorch/pytorch/assets/108101595/ddbf8566-a417-43ec-9d0e-f623f4a71313"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/118258 Approved by: https://github.com/PaliC, https://github.com/huydhn	2024-01-26 03:15:56 +00:00
Bin Bao	4e456fd95b	[AOTI] Support scalar to tensor in the ABI-compatible mode (#118024 ) Differential Revision: [D53019485](https://our.internmc.facebook.com/intern/diff/D53019485) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118024 Approved by: https://github.com/ezyang	2024-01-26 03:15:05 +00:00
Nikita Shulga	66c3152e36	[CI] Build docker on larger runners (#118167 ) Otherwise it takes 1+h to build CUDA12.1 docker - Limit UCC builds to just sm_52(M60) and sm_86(A10G), which I think has the biggest impact - Replace hardcoded `-j6` build parallelism with more dynamic `-j$[$(nproc) - 2]` - Remove redundant check about Ubuntu-14.04 - Added `DOCKER_BUILDKIT` to parallelize the builds As result, docker build time drops from 1+h to 35 min Pull Request resolved: https://github.com/pytorch/pytorch/pull/118167 Approved by: https://github.com/huydhn	2024-01-26 02:28:25 +00:00
Nikita Shulga	385d8b32fc	Update PocketFFT submodule (#118348 ) Accidentally downgraded by force merge of https://github.com/pytorch/pytorch/pull/117804 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118348 Approved by: https://github.com/kit1980	2024-01-26 02:01:06 +00:00
Yang Chen	3cdd4e236e	[inductor][easy] dump triton kernel names in the log (#118313 ) This may help debugging. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118313 Approved by: https://github.com/desertfire	2024-01-26 02:00:04 +00:00
Wanchao Liang	7a9012d7e8	[tp] enable rowwise embedding sharding in RowwiseParallel (#118242 ) As titled, this PR enables the rowwise embedding sharding in the RowwiseParallel style, and add tests to ensure it's working as expected Pull Request resolved: https://github.com/pytorch/pytorch/pull/118242 Approved by: https://github.com/tianyu-l ghstack dependencies: #118079, #118080	2024-01-26 01:36:24 +00:00
Wanchao Liang	8cc02b46c3	[dtensor] implement dim-0 (row) embedding sharding with MaskPartial (#118080 ) This PR add support for rowwise sharded embedding by adding a MaskPartial placement that inherits from the default partial placement, and override the Partial constracts to construct the mask and release the mask after the reduction The MaskPartial placement have the potential to support other ops sharding computation that requires a mask for semantic correctness. currently make it live in the embedding ops but we can move it to a common place if needed Pull Request resolved: https://github.com/pytorch/pytorch/pull/118080 Approved by: https://github.com/tianyu-l ghstack dependencies: #118079	2024-01-26 01:36:24 +00:00
PyTorch MergeBot	3d062f9abe	Revert "[pytorch][kineto] log process group config in distributed info (#117774 )" This reverts commit 9c1348feb3de872f7cabd807abbc228e7192cd46. Reverted https://github.com/pytorch/pytorch/pull/117774 on behalf of https://github.com/aaronenyeshi due to This diff is breaking internal jobs, but has been internally reverted ([comment](https://github.com/pytorch/pytorch/pull/117774#issuecomment-1911251092))	2024-01-26 01:10:31 +00:00
Sherlock Huang	6596a3f23d	[Export] Remove ScriptObjectMeta (#118241 ) Summary: As title. Use CustomObjArgument as ScriptObjectMeta Test Plan: CIs Reviewed By: zhxchen17 Differential Revision: D53062230 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118241 Approved by: https://github.com/zhxchen17	2024-01-26 00:37:19 +00:00
Arseny Kapoulkine	401aa1a1de	Use SEQUENTIAL posix_fadvise on mmapped files (#117805 ) In theory this tells the system that we will access the file sequentially which allows prefetching future blocks. In practice it doubles the read-ahead size on Linux (which effectively doubles the read sizes). Without this, CUDA uploads of files that aren't already in FS cache, using mmapped files (safetensors) as source, run at ~1 GB/s (from an SSD that has ~7 GB/s read speed...). With this, they run at ~1.5 GB/s which is still bad but better than before! It is possible to increase the read performance further by touching the pages from multiple threads; in fact, when the tensors loaded from the file are used by the CPU, we get fairly good load performance (~5 GB/s), which appears to be because multiple threads page fault and trigger more concurrent reads which improves SSD read throughput... however, this is not the case for CUDA uploads, and it is difficult to make that change in a generic way because it's unclear what the usage pattern of the input file is going to be. All of the numbers above are taken on Samsung 990 Pro SSD, on Linux kernel 6.5 with FS cache cleared between every attempt to load a file. The file is loaded via `safetensors.safe_open` which uses UntypedTensor.from_file to load the file into memory, which in turn uses MapAllocator.cpp. I felt safe doing this change unconditionally but please let me know if you'd like to see a separate allocator flag for this, propagated through to UntypedTensor. Note that the fadvise API is not available on macOS. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117805 Approved by: https://github.com/mikaylagawarecki	2024-01-26 00:26:57 +00:00
Catherine Lee	de9ddd19a5	Various CI settings (#117668 ) Test [ci-verbose-test-logs] (this worked, the test logs printing while running and interleaved and are really long) Settings for no timeout (step timeout still applies, only gets rid of ~30 min timeout for shard of test file) and no piping logs/extra verbose test logs (good for debugging deadlocks but results in very long and possibly interleaved logs). Also allows these to be set via pr body if the label name is in brackets ex [label name] or the test above. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117668 Approved by: https://github.com/huydhn	2024-01-26 00:17:29 +00:00
Nikita Shulga	8c167f9fc3	[CMake] Explicitly error out if CuDNN older than 8.5 (#118235 ) Also update README.md Fixes https://github.com/pytorch/pytorch/issues/118193 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118235 Approved by: https://github.com/zou3519	2024-01-25 23:41:04 +00:00
ydwu4	71757093c5	[dynamo] avoid graph break on torch.backends.cuda.matmul.allow_tf32 (#118236 ) Before the PR, we have a graph break for the following test: ```python def test_cublas_allow_tf32(x): if torch.backends.cuda.matmul.allow_tf32: return x.sin() + 1 return x.cos() - 1 ``` In this PR, we first add "torch.backends.cuda" to MOD_INLINELIST to trace through the python binding and get the actual call torch._C._get_cublas_allow_tf32, where it's already a TorchInGraphVariable. Because _get_cublas_allow_tf32 is accessing the same variable as at::globalContext().allowTF32CuBLAS(), which is guarded by dynamo as a global state [here](https://github.com/pytorch/pytorch/blob/main/torch/csrc/dynamo/guards.cpp#L443), we could safely assume it returns a ConstantVariable during tracing. After this pr, we get the following graph: ```python [2024-01-24 15:31:01,501] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] <eval_with_key>.0 class GraphModule(torch.nn.Module): [2024-01-24 15:31:01,501] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] def forward(self, L_x_ : torch.Tensor): [2024-01-24 15:31:01,501] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] l_x_ = L_x_ [2024-01-24 15:31:01,501] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] [2024-01-24 15:31:01,501] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] # File: /home/yidi/local/pytorch/test/dynamo/test_functions.py:515 in test_cublas_allow_tf32, code: return x.cos() - 1 [2024-01-24 15:31:01,501] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] cos = l_x_.cos(); l_x_ = None [2024-01-24 15:31:01,501] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] sub = cos - 1; cos = None [2024-01-24 15:31:01,501] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] return (sub,) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/118236 Approved by: https://github.com/yanboliang, https://github.com/anijain2305	2024-01-25 23:40:23 +00:00
Angela Yi	b5c9623835	[export] Add node meta into UnflattenedModule (#118138 ) Summary: Reland of #117686 Test Plan: CI Differential Revision: D53012028 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118138 Approved by: https://github.com/zhxchen17	2024-01-25 23:37:41 +00:00
Angela Yi	a93940b5db	[export] Allow constant outputs + None input/outputs (#117894 ) Added support for constant outputs. We will just embed the constant directly into the output, like `return (x, 1)`. Also adds support for None input/outputs. For None inputs we address it the same way we do to constants, which is that a placeholder with no users will be inserted into the graph, and the None will be embedded into whatever operator is using the None. For None outputs, we will also address the same way we do constants, which is that we embed it into the output, like `return (x, None)`. Differential Revision: D52881070 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117894 Approved by: https://github.com/zhxchen17	2024-01-25 23:37:34 +00:00
albanD	24133e44b1	Fix return type hint for list types (#118238 ) All single element list types are `Tensor[]` so they will always be Tuple. I don't know of any way to easily access the pyi type and compare that to a real run so no testing here :( Pull Request resolved: https://github.com/pytorch/pytorch/pull/118238 Approved by: https://github.com/ezyang	2024-01-25 23:35:20 +00:00
David Berard	52c5803088	[NestedTensor] Support ragged_idx != 1 in pointwise ops (#118157 ) This PR allows pointwise ops to operate on tensors with ragged_idx != 1. It does this by passing the ragged_idx metadata into the construction of the returned NestedTensor when computing pointwise ops. The assumption is that: pointwise ops can operate directly on the values tensors, and the resulting tensor should have all the same metadata properties as the input tensors. For binary ops, a test is added to verify that adding two tensors with different ragged_idx cannot be added. Previously: * unary pointwise ops would error out when performed on nested tensors with ragged_idx != 1 * binary pointwise ops would produce tensors with nonsense shapes Differential Revision: [D53032641](https://our.internmc.facebook.com/intern/diff/D53032641) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118157 Approved by: https://github.com/jbschlosser	2024-01-25 23:34:15 +00:00
Wei (Will) Feng	91d5f94f85	[FSDP] Idempotent reshard (#117997 ) address assertion error "Expects storage to be allocated" by making reshard idempotent https://github.com/pytorch/pytorch/issues/117510 ```pytest test/distributed/fsdp/test_fsdp_fine_tune.py -k test_parity_with_non_frozen_fsdp``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/117997 Approved by: https://github.com/awgu	2024-01-25 23:29:23 +00:00
Lucas Pasqualin	b10b08227a	Passes process group to `_all_gather_keys` in `dcp.load` (#118301 ) As title Fixes #118277 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118301 Approved by: https://github.com/Skylion007, https://github.com/fegin	2024-01-25 23:07:57 +00:00
Catherine Lee	02a411d4a6	[mergebot] Dry run for labels + easier to read Dr CI result (#118240 ) Dry run open for labels so we can run trymerge locally with dryrun without actually affected the PR Make Dr.CI results easier to read (previously a massive json dump, now just the job names + ids, in a nicer format) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118240 Approved by: https://github.com/huydhn	2024-01-25 23:06:43 +00:00
soulitzer	26f1da0b1b	Fix node traversal when setting up stacktrace preservation hooks (#118252 ) We only want to traverse over each node in the graph exactly once, and we do that by inserting nodes into the "seen" set. The issue is that we forget to check the "seen" set when inserting the root nodes. Typically that is not a problem, because the root nodes are from the different outputs and thus usually correspond to different nodes. With split_with_sizes, though all of the outputs correspond to the same node, ands this leads to the node being iterated over 3 times, and 3 sets of hooks being attached to the same node. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118252 Approved by: https://github.com/zou3519 ghstack dependencies: #117552, #118234, #118249	2024-01-25 22:56:20 +00:00
soulitzer	b8bd3bb30a	Fix aot_autograd seq_nr logic (#118249 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118249 Approved by: https://github.com/zou3519 ghstack dependencies: #117552, #118234	2024-01-25 22:56:20 +00:00
feifan	3c77a3ed03	export ATen/native/sparse/*.h (#118274 ) Fixes #ISSUE_NUMBER We are trying to adapt `SparsePrivateUse1` in our code. However, I found that `sparse_stup` has not been exposed yet, which makes it impossible for me to implement stup and register. I hope that the header files in this directory can be exposed. @albanD Pull Request resolved: https://github.com/pytorch/pytorch/pull/118274 Approved by: https://github.com/ezyang	2024-01-25 22:47:39 +00:00
ydwu4	fae569b4f2	[dynamo] avoid graph break on tensor.element_size() (#118229 ) Before this PR, for the following code, we have a graph break `torch._dynamo.exc.Unsupported: torch.* op returned non-Tensor int call_method element_size` ```python import torch def f(x): return x.sin().element_size() + x.sin() x = torch.randn(2, 2) torch.compile(f, backend="eager", fullgraph=True)(x) ``` After this PR, we got the following graph, where element_size() is baked in as a constant. ```python [2024-01-24 13:49:02,814] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] <eval_with_key>.0 class GraphModule(torch.nn.Module): [2024-01-24 13:49:02,814] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] def forward(self, L_x_ : torch.Tensor): [2024-01-24 13:49:02,814] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] l_x_ = L_x_ [2024-01-24 13:49:02,814] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] [2024-01-24 13:49:02,814] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] # File: /home/yidi/local/pytorch/test.py:4 in f, code: return x.sin().element_size() + x.sin() [2024-01-24 13:49:02,814] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] sin = l_x_.sin() [2024-01-24 13:49:02,814] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] sin_1 = l_x_.sin(); l_x_ = None [2024-01-24 13:49:02,814] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] add = 4 + sin_1; sin_1 = None [2024-01-24 13:49:02,814] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] return (add,) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/118229 Approved by: https://github.com/yanboliang, https://github.com/jansel, https://github.com/anijain2305	2024-01-25 22:28:37 +00:00
rzou	bd6bf97ea5	stop using torch.Tensor in dynamo/test_export_mutations.py (#118287 ) This causes test flakiness, because torch.Tensor allocates a Tensor with uninitialized memory. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118287 Approved by: https://github.com/ydwu4	2024-01-25 22:21:41 +00:00
rzou	f7f7283ec7	Skip test_none_names_refcount under Dynamo-wrapped CI (#118309 ) Fixes https://github.com/pytorch/pytorch/issues/117716 Dynamo does some things that modifies the refcount. Skipping this test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118309 Approved by: https://github.com/ydwu4, https://github.com/yanboliang, https://github.com/albanD ghstack dependencies: #118152	2024-01-25 22:21:22 +00:00
Sam Larsen	4e45d791e7	Remove set_ exclusion in FakeTensor dispatch cache (#118154 ) Summary: Now that set_ is marked as a view op, this special case is no longer necessary Test Plan: CI exposed the need for this special case in the first place, so I think we can just rely on the existing tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/118154 Approved by: https://github.com/bdhirsh	2024-01-25 21:54:36 +00:00
Nikita Shulga	13bdd6c4e2	Revert "[Dynamo, ONNX] use environment variable ONNXRT_DUMP_PATH to dump onnx models created by onnxrt backend (#117551 )" This reverts commit 3221585af0f78cee20f1fb739e140ab59a517ee1 as this commit was already landed as 83581f91ca9c3b78b0f8dc3a0a2c1cb229d20e99.	2024-01-25 13:41:39 -08:00
Lucas Pasqualin	ea851eb027	Uses Serial Loader for DCP.save when more then one thread is used. (#118114 ) The OverlappingCPU Loader is causing a major drop in performance when used with multiple threads. This PR is a temporary fix while we investigate why this is the case. Benchmarks for save, using a 7.25GB FSDP model, as per the TSS benchmark. Both benchmarks run on 8 ranks. Before this PR 9.475 s 8 threads After this PR 1.632 s 8 threads Pull Request resolved: https://github.com/pytorch/pytorch/pull/118114 Approved by: https://github.com/wz337, https://github.com/fegin	2024-01-25 21:11:16 +00:00
laith sakka	708e6241ed	Fix sympy_subs to preserve integer and non-negative properties. (#118150 ) This diff introduce the following changes: 1. Fix sympy_subs to preserve integer and non-negative properties of replaced symbol when replacement is string why is this needed? I was compiling an expression: xabs(y) where y =-2 what happens is that this expression is passed as ``s1abs(s0)`` then s0 is replaced to ks0 with a call to sympy_subs. but sympy_subs used to replace s0 (integer=false, nonegative=false) with ks0(inetegr=true, nonegative = true) resulting in ``xabs(ks0) = xks0`` which is wrong 2. rename sympy_symbol to sympy_index_symbol to make it explicit. 3. add assertion that replaced expression is not passed as string but always a sympy expression. Fixes https://github.com/pytorch/pytorch/issues/117757 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118150 Approved by: https://github.com/ezyang	2024-01-25 20:54:55 +00:00
Jason Ansel	2de24c11f6	[inductor] Slightly faster memory allocation on CUDA (#118255 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118255 Approved by: https://github.com/peterbell10 ghstack dependencies: #118065, #118070, #118171	2024-01-25 20:49:14 +00:00
Edward Z. Yang	3e76a0e9c2	Install an excepthook which annotates exceptions with rank information when distributed is initialized (#118190 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/118190 Approved by: https://github.com/Skylion007, https://github.com/kwen2501	2024-01-25 20:43:18 +00:00
Yang Chen	1565d58ad9	[inductor] correctly generate grid info for benchmark_kernel (#118202 ) Previously, we generated the grid argument with tree.numel for a benchmark TritonKernel. This was not correct, because it didn't match the launch config used for profiling and running. This PR fixed the issue by emitting the grid value computed by the kernel's grid_fn, which is used by the profiler and the kernel's runner. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118202 Approved by: https://github.com/shunting314, https://github.com/jansel	2024-01-25 20:37:44 +00:00
laith sakka	b47cf4182e	Fix support non tensor inputs to operator.pos function (#118251 ) Fixes #118231 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118251 Approved by: https://github.com/Skylion007, https://github.com/anijain2305	2024-01-25 20:37:40 +00:00
Bin Bao	476b744e23	[AOTI] Forward fix https://github.com/pytorch/pytorch/pull/117989 (#118291 ) Summary: https://github.com/pytorch/pytorch/pull/117989 disabled use_thread_local_cached_output_tensor for cuda, but it is not necessarily true, because we can still have cpu tensors when running cuda models. Differential Revision: D53089956 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118291 Approved by: https://github.com/Skylion007, https://github.com/frank-wei, https://github.com/chenyang78, https://github.com/khabinov	2024-01-25 20:30:17 +00:00
Colin Peppler	1f6aa4b336	[mypy] Enable follow_imports = normal for mypy-torch.backends.* (#116311 ) Summary: Test Plan: ``` lintrunner --take MYPYINDUCTOR --all-files ok No lint issues. lintrunner -a ok No lint issues. Successfully applied all patches. ``` Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/116311 Approved by: https://github.com/int3	2024-01-25 20:17:22 +00:00
xadupre	3221585af0	[Dynamo, ONNX] use environment variable ONNXRT_DUMP_PATH to dump onnx models created by onnxrt backend (#117551 ) With this PR, if environment variable `ONNXRT_DUMP_PATH` is set, the backend onnxrt dumps every onnx it creates as well as the graph_module stored as a text file. This allows users to see what onnx file is generated when this backend is used. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117551 Approved by: https://github.com/thiagocrepaldi, https://github.com/wschin	2024-01-25 20:00:14 +00:00
Bin Bao	9768f73cb2	[AOTI] Skip test_index_put_with_none_index on rocm (#118290 ) Summary: The test was added in https://github.com/pytorch/pytorch/pull/118187 and is failing on rocm. Differential Revision: [D53089729](https://our.internmc.facebook.com/intern/diff/D53089729) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118290 Approved by: https://github.com/DanilBaibak	2024-01-25 19:36:00 +00:00
xadupre	83581f91ca	[Dynamo, ONNX] use environment variable ONNXRT_DUMP_PATH to dump onnx models created by onnxrt backend (#117551 ) With this PR, if environment variable `ONNXRT_DUMP_PATH` is set, the backend onnxrt dumps every onnx it creates as well as the graph_module stored as a text file. This allows users to see what onnx file is generated when this backend is used. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117551 Approved by: https://github.com/thiagocrepaldi, https://github.com/wschin	2024-01-25 18:53:41 +00:00
Sherlock Huang	bb3db079b1	[Export] Introduce class_fqn into CustomObjArgument (#118158 ) Summary: Class FQN is needed when unpacking CustomObj instance. For all other Arguments, e.g. Tensor, TensorList, SymInt, we always know their exact type. However, CustomObjArgument had an opaque type. Adding this field also helps unveiling the type of this opaque object. Test Plan: CI Differential Revision: D53029847 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118158 Approved by: https://github.com/zhxchen17	2024-01-25 18:44:25 +00:00
Chien-Chin Huang	fed0f2946f	[FSDP][BE] Fix optim_state_dict_to_load doc errors (#118195 ) As title Differential Revision: [D53038703](https://our.internmc.facebook.com/intern/diff/D53038703/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118195 Approved by: https://github.com/rohan-varma, https://github.com/wz337 ghstack dependencies: #118197	2024-01-25 18:29:04 +00:00
angelayi	01388d0790	[dynamo] Slightly better error message if key not in dict (#117902 ) Was debugging an export issue, and currently when `key` does not exist in `self.items`, the error message is ``` File "/opt/pytorch/torch/_dynamo/variables/dicts.py", line 208, in getitem_const return self.items[key] ~~~~~~~~~~^^^^^ torch._dynamo.exc.InternalTorchDynamoError: <torch._dynamo.variables.dicts.ConstDictVariable._HashableTracker object at 0x7fd7697cbf90> ``` This PR changes it to be the following. ``` File "/data/users/angelayi/pytorch/torch/_dynamo/variables/dicts.py", line 199, in getitem_const raise KeyError(arg.value) torch._dynamo.exc.InternalTorchDynamoError: shape ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/117902 Approved by: https://github.com/williamwen42	2024-01-25 18:13:40 +00:00
wz337	e1f9eca113	[DeviceMesh] Reuse sub_group pg if exists (#115716 ) Currently, we create new_group for sub_group pg during mesh initialization. The PR changes this so we will: 1) re-use sub_group pg if it exsits, 2) create new sub_group pg if it does not exist. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115716 Approved by: https://github.com/wanchaol	2024-01-25 18:07:16 +00:00
nidefawl	a289dba7b1	Add missing cuda libraries for context_gpu_test (#117493 ) This adds some missing cuda (curand and cublas) libraries that are required for the context_gpu_test to link. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117493 Approved by: https://github.com/ezyang	2024-01-25 18:04:23 +00:00
PyTorch MergeBot	eb054cc012	Revert "Fix Auto Functionalize to handle specified default values (#118035 )" This reverts commit 2d7a360911fb7b27be82c51ca86b4b34b6f1b087. Reverted https://github.com/pytorch/pytorch/pull/118035 on behalf of https://github.com/zou3519 due to needs internal changes, reverting so we can land via co-dev ([comment](https://github.com/pytorch/pytorch/pull/118035#issuecomment-1910706841))	2024-01-25 17:53:15 +00:00
BJ Hargrave	8810fdd21e	fsdp: Unit test for ModuleWrapPolicy as a Callable (#117395 ) We use `_or_policy` as a `Callable` to wrap a `ModuleWrapPolicy` instance as a `Callable`. Fixes https://github.com/pytorch/pytorch/issues/109266 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117395 Approved by: https://github.com/wconstab	2024-01-25 17:40:06 +00:00
Chien-Chin Huang	c1e0674485	[DCP][BC] Remove the dependency on _shard.TensorProperties (#116248 ) ShardedTensor is in the maintence mode and is going to be deprecated. DCP's metadata should not rely on any definitions in ShardedTensor. This PR creates a replica of TensorProperties in DCP and removes the dependency on _shard.TensorProperties Differential Revision: [D52357732](https://our.internmc.facebook.com/intern/diff/D52357732/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116248 Approved by: https://github.com/wconstab, https://github.com/LucasLLC, https://github.com/wz337	2024-01-25 17:24:16 +00:00
Andrew Gu	316579e30c	[FSDP2] Introduced initial `fully_shard` frontend (#117776 ) This PR introduces the initial `fully_shard` frontend without any distributed logic that will be built into per-parameter-sharding FSDP. - We design `fully_shard` to be a _module-level_ API (taking in an `nn.Module`), e.g. as opposed to a tensor-level one. - We define a `FSDP` class and use a dynamic class swap, setting `module.__class__` to a newly created class that subclasses `FSDP` and `type(module)`, to allow FSDP to override and add methods on the module. - We name this class as `FSDP<type(module)>`, e.g. `FSDPLinear` for `Linear`. - We disable the `deepcopy` because the state object inserted on the module will not be trivially `deepcopy`-able. - Calling `fully_shard(module)` inserts a state object on `module` but not any of its children. This state object will be used for any FSDP-specific state. - We raise an error on `ModuleList` or `ModuleDict` since they do not implement `forward()`, and FSDP will rely on `forward()` to insert logic (https://github.com/pytorch/pytorch/issues/113794). - In the future, we will deprecate the existing `fully_shard` that calls into the same backend logic as `FullyShardedDataParallel` as there is no adoption for that and we prefer to reuse that name. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117776 Approved by: https://github.com/wconstab, https://github.com/weifengpy, https://github.com/wanchaol ghstack dependencies: #117994, #118186, #117984	2024-01-25 17:22:07 +00:00
Chien-Chin Huang	4f78869c18	[state_dict] Calls wait() for the DTensor to_local() result (#118197 ) See the discussion in https://github.com/pytorch/pytorch/pull/117799. There are some issues when returning a AsyncCollectiveTensor (haven't found the root causes), including OOM and unexpected values. This PR forces `_gather_state_dict()` to be synchronous with respect to the mian stream. Differential Revision: [D53049807](https://our.internmc.facebook.com/intern/diff/D53049807/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118197 Approved by: https://github.com/wz337, https://github.com/LucasLLC	2024-01-25 17:14:08 +00:00
Jason Ansel	817debeb89	[inductor] Slightly faster memory allocation on CPU (#118171 ) Based on `python benchmarks/dynamo/microbenchmarks/overheads.py`: - Before `12.2us` - After `10.5us` This is inspired by `a2c17a2b00` -- but in Python rather than C++ Pull Request resolved: https://github.com/pytorch/pytorch/pull/118171 Approved by: https://github.com/jgong5, https://github.com/peterbell10 ghstack dependencies: #118065, #118070	2024-01-25 16:54:57 +00:00
Andrew Gu	d6b556bd98	Added `"any"` mode to `register_multi_grad_hook` (#117984 ) This is a re-open of https://github.com/pytorch/pytorch/pull/115628/. This PR adds an `"any"` option to `register_multi_grad_hook` that runs the hook when the gradient of _any_ of the input tensors is computed. The existing functionality is folded under the default `"all"` mode. The multi-threaded test case is based on the existing one for `register_multi_grad_hook`. I would appreciate a closer look on that. ~~I am not sure about the hook signature (i.e. why we see two gradients in the hook that runs instead of just one, as [`register_hook`](https://pytorch.org/docs/stable/generated/torch.Tensor.register_hook.html) docs suggest).~~ It was because I was iterating over the 2 elements in the single tensor 😢 . I did not update the `notes/autograd.rst`, which currently has a [blurb](https://pytorch.org/docs/stable/notes/autograd.html#special-hooks) on `register_multi_grad_hook`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117984 Approved by: https://github.com/soulitzer ghstack dependencies: #117994, #118186	2024-01-25 16:25:52 +00:00
Lei,zhenyuan	173777461c	expose nested tensor header file (#117956 ) This pr is for expose nested tensor related header files, it will makes other people easier when developing nested tensor related kernel for extension module. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117956 Approved by: https://github.com/ezyang	2024-01-25 15:53:10 +00:00
Alexander Grund	865945cc1f	Convert `requires_cuda` to full decorator (#118281 ) Don't require using it as `@requires_cuda()` -> `@requires_cuda` instead No need for the partial function invoked many times Split out this change from the initial large refactoring in #117741 to hopefully get merged before conflicts arise Pull Request resolved: https://github.com/pytorch/pytorch/pull/118281 Approved by: https://github.com/ezyang	2024-01-25 15:50:21 +00:00
Andrew Gu	87fb8b6218	[DTensor] Relaxed `to_local` `requires_grad` warning (#118186 ) The existing warning in `DTensor.__new__()` checks `if requires_grad != local_tensor.requires_grad:` and warns with: > To construct DTensor from `torch.Tensor`, it's recommended to use `local_tensor.detach()` and make `requires_grad` consistent. Calling `local_tensor.detach()` will have the returned `Tensor` have `requires_grad=False`, so the error message refers to the case where `local_tensor.requires_grad is True` but the user passed `requires_grad=False` to `to_local()`. However, there is the converse case, where `local_tensor.requires_grad is False` but the user passed `requires_grad=True`. In this case, the original `if requires_grad != local_tensor.requires_grad:` check succeeds, and the warning is emitted. However, the warning message does not apply in that case. This can happen via `_prepare_output_fn` -> `redistribute` -> `Redistribute.forward()`, where `output.requires_grad is False` but it passes `requires_grad=input.requires_grad` which can be `True`. We should not warn in this case since `Redistribute.forward()` is our own framework code, so I was proposing to relax the warning. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118186 Approved by: https://github.com/XilunWu, https://github.com/wanchaol ghstack dependencies: #117994	2024-01-25 15:49:32 +00:00
Andrew Gu	a5230e6019	[ez][docs] Fixed render of `tensors` in `backward` (#117994 ) Before: <img width="851" alt="Screenshot 2024-01-22 at 2 03 49 PM" src="https://github.com/pytorch/pytorch/assets/31054793/a71111ab-c7c4-4af5-a996-cbd42bcc8326"> After: ![Screenshot 2024-01-23 at 7 13 40 PM](https://github.com/pytorch/pytorch/assets/31054793/36db28a0-a96f-434c-a93f-fe78aff1e035) Pull Request resolved: https://github.com/pytorch/pytorch/pull/117994 Approved by: https://github.com/soulitzer, https://github.com/weifengpy	2024-01-25 15:49:32 +00:00
rzou	8f973038d5	Update update_failures.py given feedback (#118237 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118237 Approved by: https://github.com/drisspg	2024-01-25 15:42:01 +00:00
Alexander Grund	b5b36cf0c4	Fix failure of test_dynamo_distributed & test_inductor_collectives (#117741 ) When CUDA is not available `c10d.init_process_group("nccl"...)` will fail with > RuntimeError: ProcessGroupNCCL is only supported with GPUs, no GPUs found! Hence add a corresponding skip marker to the classes deriving from DynamoDistributedSingleProcTestCase next to the `requires_nccl` marker. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117741 Approved by: https://github.com/ezyang, https://github.com/malfet	2024-01-25 13:25:36 +00:00
Bin Bao	ee1dbb2acf	[AOTI] Fix a None as index codegen issue (#118187 ) Summary: Fix a ABI-compatible codegen issue when index_put has None in its indices. Differential Revision: [D53047489](https://our.internmc.facebook.com/intern/diff/D53047489) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118187 Approved by: https://github.com/chenyang78 ghstack dependencies: #118168, #118169	2024-01-25 11:53:44 +00:00
Bin Bao	d1e661a1ce	[AOTI] Add _scaled_dot_product_efficient_attention to C shim (#118169 ) Summary: _scaled_dot_product_efficient_attention is used in some TIMM models Differential Revision: [D53032358](https://our.internmc.facebook.com/intern/diff/D53032358) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118169 Approved by: https://github.com/chenyang78 ghstack dependencies: #118168	2024-01-25 11:53:44 +00:00
Bin Bao	5c7a18c5cb	[AOTI] Refactor shim_common.cpp (#118168 ) Summary: Use new_tensor_handle to reduce code repetition Differential Revision: [D53032353](https://our.internmc.facebook.com/intern/diff/D53032353) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118168 Approved by: https://github.com/chenyang78	2024-01-25 11:53:29 +00:00
yanbing-j	4b4e6550f2	Update oneDNN build option for older systems (#118057 ) Fixes [#116623](https://github.com/pytorch/pytorch/issues/116623). As we discussed in https://github.com/pytorch/pytorch/issues/116623#issuecomment-1900406773 and https://github.com/pytorch/pytorch/issues/116623#issuecomment-1900825829, we update oneDNN build option to support older systems and document we only support CPUs with SSE4.1+. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118057 Approved by: https://github.com/malfet	2024-01-25 11:34:51 +00:00
Huy Do	eebe7e1d37	Migrate update-viablestrict to test-infra (#118163 ) In https://github.com/pytorch/test-infra/pull/4905, so that ExecuTorch can use the same GHA on their CI. ### Testing https://github.com/pytorch/pytorch/actions/runs/7634906738/job/20799502532#step:2:15480 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118163 Approved by: https://github.com/clee2000	2024-01-25 07:07:34 +00:00
Wei-Sheng Chin	357a06f7c9	[ONNX] Fix type promotion pass (#118246 ) Currently, when `node.meta["val"]` is `torch.Sym`, its `hint` [is extracted](`61865205b6/torch/onnx/_internal/fx/passes/type_promotion.py (L86)`) and used in type promotion. However, it will [override](`61865205b6/torch/onnx/_internal/fx/passes/type_promotion.py (L1409)`) dynamic shape information carried in `node.meta["val"]` during [type propagation](`61865205b6/torch/onnx/_internal/fx/passes/type_promotion.py (L1401)`) and the FX graph seen in `onnxrt` always carries static shapes. Let's use `torch.Sym` directly so that the type promotion propagates and stores dynamic shapes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118246 Approved by: https://github.com/titaiwangms	2024-01-25 07:04:18 +00:00
Edward Z. Yang	2c6a233c45	Report the type of a tensor in wrap_to_fake (#118220 ) This could help diagnose why a tensor wasn't considered static. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/118220 Approved by: https://github.com/albanD, https://github.com/bdhirsh ghstack dependencies: #118215, #118217	2024-01-25 06:53:12 +00:00
Edward Z. Yang	8b95fb4eb8	Add stack trace to "start tracing" log (#118217 ) When debugging problems on unfamiliar model code, I often want to know "how did I end up in this compiled region." Printing the stack trace at tracing start lets me find out this information. Looks like this: ``` [2024-01-24 12:07:11,819] [0/1] torch._dynamo.symbolic_convert: [INFO] Step 1: torchdynamo start tracing f /data/users/ezyang/c/pytorch/b.py:3 [2024-01-24 12:07:11,819] [0/1] torch._dynamo.symbolic_convert: [INFO] Stack (most recent call last): [2024-01-24 12:07:11,819] [0/1] torch._dynamo.symbolic_convert: [INFO] File "/data/users/ezyang/c/pytorch/b.py", line 9, in <module> [2024-01-24 12:07:11,819] [0/1] torch._dynamo.symbolic_convert: [INFO] f(torch.randn(5)) [2024-01-24 12:07:11,819] [0/1] torch._dynamo.symbolic_convert: [INFO] File "/data/users/ezyang/c/pytorch/torch/_dynamo/eval_frame.py", line 437, in _fn [2024-01-24 12:07:11,819] [0/1] torch._dynamo.symbolic_convert: [INFO] return fn(args, kwargs) [2024-01-24 12:07:11,819] [0/1] torch._dynamo.symbolic_convert: [INFO] File "/data/users/ezyang/c/pytorch/torch/_dynamo/eval_frame.py", line 601, in catch_errors [2024-01-24 12:07:11,819] [0/1] torch._dynamo.symbolic_convert: [INFO] return callback(frame, cache_entry, hooks, frame_state) [2024-01-24 12:07:11,819] [0/1] torch._dynamo.symbolic_convert: [INFO] File "/data/users/ezyang/c/pytorch/torch/_dynamo/convert_frame.py", line 743, in _convert_frame [2024-01-24 12:07:11,819] [0/1] torch._dynamo.symbolic_convert: [INFO] result = inner_convert(frame, cache_entry, hooks, frame_state) [2024-01-24 12:07:11,819] [0/1] torch._dynamo.symbolic_convert: [INFO] File "/data/users/ezyang/c/pytorch/torch/_dynamo/convert_frame.py", line 386, in _convert_frame_assert [2024-01-24 12:07:11,819] [0/1] torch._dynamo.symbolic_convert: [INFO] return _compile( [2024-01-24 12:07:11,819] [0/1] torch._dynamo.symbolic_convert: [INFO] File "/data/users/ezyang/c/pytorch/torch/_dynamo/convert_frame.py", line 645, in _compile [2024-01-24 12:07:11,819] [0/1] torch._dynamo.symbolic_convert: [INFO] guarded_code = compile_inner(code, one_graph, hooks, transform) [2024-01-24 12:07:11,819] [0/1] torch._dynamo.symbolic_convert: [INFO] File "/data/users/ezyang/c/pytorch/torch/_dynamo/utils.py", line 248, in time_wrapper [2024-01-24 12:07:11,819] [0/1] torch._dynamo.symbolic_convert: [INFO] r = func(args, *kwargs) [2024-01-24 12:07:11,819] [0/1] torch._dynamo.symbolic_convert: [INFO] File "/data/users/ezyang/c/pytorch/torch/_dynamo/convert_frame.py", line 526, in compile_inner [2024-01-24 12:07:11,819] [0/1] torch._dynamo.symbolic_convert: [INFO] out_code = transform_code_object(code, transform) [2024-01-24 12:07:11,819] [0/1] torch._dynamo.symbolic_convert: [INFO] File "/data/users/ezyang/c/pytorch/torch/_dynamo/bytecode_transformation.py", line 1033, in transform_code_object [2024-01-24 12:07:11,819] [0/1] torch._dynamo.symbolic_convert: [INFO] transformations(instructions, code_options) [2024-01-24 12:07:11,819] [0/1] torch._dynamo.symbolic_convert: [INFO] File "/data/users/ezyang/c/pytorch/torch/_dynamo/convert_frame.py", line 151, in _fn [2024-01-24 12:07:11,819] [0/1] torch._dynamo.symbolic_convert: [INFO] return fn(args, kwargs) [2024-01-24 12:07:11,819] [0/1] torch._dynamo.symbolic_convert: [INFO] File "/data/users/ezyang/c/pytorch/torch/_dynamo/convert_frame.py", line 473, in transform [2024-01-24 12:07:11,819] [0/1] torch._dynamo.symbolic_convert: [INFO] tracer = InstructionTranslator( [2024-01-24 12:07:11,819] [0/1] torch._dynamo.symbolic_convert: [INFO] File "/data/users/ezyang/c/pytorch/torch/_dynamo/symbolic_convert.py", line 2030, in __init__ [2024-01-24 12:07:11,819] [0/1] torch._dynamo.symbolic_convert: [INFO] _step_logger()( [2024-01-24 12:07:11,819] [0/1] torch._dynamo.symbolic_convert: [INFO] File "/data/users/ezyang/c/pytorch/torch/_dynamo/logging.py", line 55, in log [2024-01-24 12:07:11,819] [0/1] torch._dynamo.symbolic_convert: [INFO] logger.log(level, "Step %s: %s", step, msg, kwargs) ``` Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/118217 Approved by: https://github.com/albanD ghstack dependencies: #118215	2024-01-25 06:53:12 +00:00
Edward Z. Yang	2a178dade8	Augment create_symbol with user/infra backtrace fragment (#118215 ) Looks like this: ``` [2024-01-24 11:59:41,656] [0/1] torch.fx.experimental.symbolic_shapes: [INFO] create_symbol s0 = 5 for L['x'].size()[0] [2, 9223372036854775806] at b.py:5 in f (_dynamo/variables/builder.py:1788 in <lambda>) ``` Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/118215 Approved by: https://github.com/albanD, https://github.com/bdhirsh	2024-01-25 06:53:12 +00:00
Edward Z. Yang	514159ddcb	Add torch_dynamo to resume_in for ease of debugging (#118201 ) resume_in_* code objects show up in user backtraces when failures occur in code that has been Dynamo processed. It is obvious to me, a PT2 developer, that these are generated by PT2, but it is NOT obvious to a non-core dev that this is happened. Add an extra torch_dynamo breadcrumb to help get people to the right place. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/118201 Approved by: https://github.com/albanD	2024-01-25 06:52:17 +00:00
PyTorch UpdateBot	5a83c47d98	[vision hash update] update the pinned vision hash (#117594 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned vision hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117594 Approved by: https://github.com/pytorchbot	2024-01-25 05:33:01 +00:00
PyTorch UpdateBot	e0903b0720	[executorch hash update] update the pinned executorch hash (#118040 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned executorch hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118040 Approved by: https://github.com/pytorchbot	2024-01-25 05:27:53 +00:00
Jason Ansel	e5e9f390be	[dynamo] Optimize overheads from _TorchDynamoContext (#118070 ) Based on `python benchmarks/dynamo/microbenchmarks/overheads.py`: - Before `18.1us` - After `12.2us` Pull Request resolved: https://github.com/pytorch/pytorch/pull/118070 Approved by: https://github.com/yanboliang, https://github.com/anijain2305 ghstack dependencies: #118065	2024-01-25 05:04:56 +00:00
Will Constable	a40951defd	[C10D] Fix nccl flightrecorder ignored dump timeout (#118142 ) Don't call future.get() unless it's ready, because it waits. Also, refactor the code a bit for simplicity. We should do a follow-on PR to clean up the timeouts further, but this should fix the glaring timeout bug. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118142 Approved by: https://github.com/shuqiangzhang ghstack dependencies: #118044, #118046, #118047	2024-01-25 04:25:36 +00:00
cyy	87335fabae	[Exception] [6/N] Remove use of torch::TypeError (#117964 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/117964 Approved by: https://github.com/albanD	2024-01-25 03:35:58 +00:00
soulitzer	67300a11cb	Support custom autograd Function forward AD return non-Tensor in forward (#118234 ) Fixes https://github.com/pytorch/pytorch/issues/117491 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118234 Approved by: https://github.com/albanD ghstack dependencies: #117552	2024-01-25 03:24:29 +00:00
Flavio Sales Truzzi	2d7a360911	Fix Auto Functionalize to handle specified default values (#118035 ) Summary: When there were optionals with specified default values the code was improperly handling the number of parameters causing `IndexError: tuple index out of range` Test Plan: new tests Differential Revision: D52977644 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118035 Approved by: https://github.com/williamwen42	2024-01-25 01:22:12 +00:00
Elias Ellison	4a49e2b52d	refactoring (#118111 ) No real changes, just moving mutation checking skip to a helper file. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118111 Approved by: https://github.com/bdhirsh ghstack dependencies: #118110	2024-01-25 00:36:46 +00:00
Elias Ellison	4448f2a49d	Log stack trace of mutated idx reland (#118110 ) Relanding of https://github.com/pytorch/pytorch/pull/117720 with a fixed `next(iter(dict.values()))` instead of `next(dict.values())` and a corresponding test that would have caught the problem (as well as a type annotation that also would have). Pull Request resolved: https://github.com/pytorch/pytorch/pull/118110 Approved by: https://github.com/bdhirsh	2024-01-25 00:30:03 +00:00
soulitzer	5b819d9ef0	Properly move retains_grad hook on in-place over view for base (#117552 ) Fixes https://github.com/pytorch/pytorch/issues/117366 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117552 Approved by: https://github.com/albanD	2024-01-25 00:27:13 +00:00
Shengbao Zheng	9c1348feb3	[pytorch][kineto] log process group config in distributed info (#117774 ) Summary: Process Group config is essential to analyze collective pattern. We have added this in Execution Trace. Now expose this information in Kineto as well Test Plan: Tested in HPC Differential Revision: D52882292 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117774 Approved by: https://github.com/wconstab, https://github.com/aaronenyeshi	2024-01-25 00:08:10 +00:00
Peter Bell	89530c8590	[dynamo] Test for using torch.nn when replay_records are enabled (#116215 ) This adds a reproducer for a failure that has since been fixed in main. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116215 Approved by: https://github.com/jansel ghstack dependencies: #116230, #116214	2024-01-24 23:42:35 +00:00
Peter Bell	7c33ce7702	[CI] Install dill in ci (#116214 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116214 Approved by: https://github.com/malfet ghstack dependencies: #116230	2024-01-24 23:42:35 +00:00
Peter Bell	b53cc6cf8d	[dynamo] Fix test_replay_record.py (#116230 ) This test isn't run in CI because the CI runners don't have dill installed. This fixes the tests so they run for me locally, and in the next PR I add dill to the CI so we can test it properly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116230 Approved by: https://github.com/jansel	2024-01-24 23:42:35 +00:00
rzou	61865205b6	Deflake Dynamo stream tests (#118205 ) streams need to be synchronized, otherwise, there is undefined behavior. This PR adds the necessary synchronization. This exposed some bugs (https://github.com/pytorch/pytorch/issues/118204), so I just marked the tests as expectedFailure. Test Plan: - tested locally Pull Request resolved: https://github.com/pytorch/pytorch/pull/118205 Approved by: https://github.com/yanboliang	2024-01-24 23:31:47 +00:00
rzou	5e0ef84b01	[dynamo] Refactor install_global_once, remove usages of install_global_unsafe (#118100 ) We split install_global_once into two APIs: - `install_global_by_id(prefix, value) -> name`: installs a global if it hasn't been installed yet - `install_global(prefix, value) -> name`: always installs the global (and generates a unique name for it) Then, we refactor most callsites of `install_global_unsafe` to one of the previous. Some callsites cannot be refactored because we create the global name first, do a lot of stuff with it, and then install it. This fixes more test flakiness. Test Plan: - Existing tests; I can't reliably repro the flakiness Pull Request resolved: https://github.com/pytorch/pytorch/pull/118100 Approved by: https://github.com/ezyang, https://github.com/mlazos	2024-01-24 23:25:44 +00:00
Catherine Lee	2abb812a78	Check if enable inside run call (#118101 ) In theory this way we never have to worry about subclasses calling super().setUp() ever again Also, dynamically creating classes (ex via type in instantiate_device_type_tests) makes super() calls a bit odd https://stackoverflow.com/questions/71879642/how-to-pass-function-with-super-when-creating-class-dynamically https://stackoverflow.com/questions/43782944/super-does-not-work-together-with-type-supertype-obj-obj-must-be-an-i Pull Request resolved: https://github.com/pytorch/pytorch/pull/118101 Approved by: https://github.com/huydhn	2024-01-24 22:38:41 +00:00
Yanbo Liang	dba160e676	[13/N][Dynamo] Refactor torch ctx manager classes check out of trace_rules.lookup (#118130 ) I'm going to merge inline/skip/allow_in_graph check into ```trace_rules.lookup```, so it's better to make it only handle function types. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118130 Approved by: https://github.com/williamwen42	2024-01-24 22:33:41 +00:00
drisspg	4e29f01bf2	Remove sdp_kernel and replace with sdpa_kernel in attention namespace (#114689 ) # Summary Simplification of Backend Selection This PR deprecates the `torch.backends/cuda/sdp_kernel` context manager and replaces it with a new context manager `torch.nn.attention.sdpa_kernel`. This context manager also changes the api for this context manager. For `sdp_kernel` one would specify the backend choice by taking the negation of what kernel they would like to run. The purpose of this backend manager was to only to be a debugging tool, "turn off the math backend" and see if you can run one of the fused implementations. Problems: - This pattern makes sense if majority of users don't care to know anything about the backends that can be run. However, if users are seeking to use this context manager then they are explicitly trying to run a specific backend. - This is not scalable. We are working on adding the cudnn backend and this API makes it so so that more implementations will need to be turned off if user wants to explicitly run a given backend. - Discoverability of the current context manager. It is somewhat un-intutive that this backend manager is in backends/cuda/init when this now also controls the CPU fused kernel behavior. I think centralizing to attention namespace will be helpful. Other concerns: - Typically backends (kernels) for operators are entirely hidden from users and implementation details of the framework. We have exposed this to users already, albeit not by default and with beta warnings. Does making backends choices even more explicit lead to problems when we potentially want to remove existing backends, (perhaps inputs shapes will get covered by newer backends). A nice side effect is now that we aren't using the `BACKEND_MAP` in test_transformers many, many dynamo failures are passing for CPU tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/114689 Approved by: https://github.com/cpuhrsch	2024-01-24 22:28:04 +00:00
Xilun Wu	77186af028	[DTensor][BE] re-enable test_dtensor_ops in CPU CI (#118134 ) Test `pytest test/distributed/_tensor/test_dtensor_ops.py` This only runs CPU test and completes in 1 minute on local. <img width="3002" alt="image" src="https://github.com/pytorch/pytorch/assets/12968408/bfbcaff0-2581-41a7-817d-f68e4041b8b1"> CI Run: https://hud.pytorch.org/pr/pytorch/pytorch/118134 Search for "distributed" test and click any of them. Then search for "test_dtensor_ops". Saw successful run of `test_dtensor_ops`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118134 Approved by: https://github.com/malfet, https://github.com/huydhn, https://github.com/wanchaol ghstack dependencies: #117726, #118132	2024-01-24 22:11:51 +00:00
Jack Taylor	e6288820e3	Revert "Update triton ROCm version to 6.0" (#118179 ) Reverting [this commit](https://github.com/pytorch/pytorch/pull/117433) due to failures observed in wheel environment e.g: ``` ImportError: /tmp/torchinductor_root/triton/0/ebfa57c0b7b95873c96cad6f9bca148d/hip_utils.so: undefined symbol: hipGetDevicePropertiesR0600` ``` Will revert for now and investigate and aim to re-land this as part of https://github.com/pytorch/pytorch/pull/116270 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118179 Approved by: https://github.com/jeffdaily, https://github.com/malfet	2024-01-24 22:01:27 +00:00
PyTorch MergeBot	af9b6fa04e	Revert "Check if enable inside run call (#118101 )" This reverts commit 6fc015fedc96e532da756e9408fcedb9c81a423f. Reverted https://github.com/pytorch/pytorch/pull/118101 on behalf of https://github.com/clee2000 due to possibly causing failures on b025e5984ce30eed10df0cc89111e88983d823d3 ([comment](https://github.com/pytorch/pytorch/pull/118101#issuecomment-1908940940))	2024-01-24 21:26:35 +00:00
Jane Xu	15608d8cb4	Add guardrails preventing complex params in LBFGS & SparseAdam (#118161 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118161 Approved by: https://github.com/mikaylagawarecki ghstack dependencies: #118160	2024-01-24 21:22:47 +00:00
Jane Xu	17ecd1e9cd	Migrate test_complex_optimizer to OptimizerInfo (#118160 ) This PR does what it says and more. 1. We increase coverage by a LOT! Previously, complex was not tested for many many configs, including foreach + maximize at the same time. Or the fused impls. Or just random configs people forgot about. 2. I rearranged the maximize conditional and the _view_as_real to preserve list-ness. This is needed for _view_as_real to function properly, I did add a comment in the Files Changed. This new order also just...makes more aesthetic sense. 3. Note that LBFGS and SparseAdam are skipped--they don't support complex and now we know. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118160 Approved by: https://github.com/mikaylagawarecki	2024-01-24 21:22:47 +00:00
zjgarvey	6978c3ddf3	Removes an Incorrect Type Specification from AdaptiveMaxPool1d (#118162 ) The return type for the forward pass of nn.AdaptiveMaxPool1d is specified to be Tensor, but if self.return_indices, then the result type should be tuple[Tensor,Tensor]. For users trying to trace/script this function with indices, the incorrect typing is problematic. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118162 Approved by: https://github.com/albanD	2024-01-24 20:31:02 +00:00
Bin Bao	821b2c543c	[AOTI] Support .item() in the ABI-compatible mode (#117989 ) Summary: Differential Revision: [D52965076](https://our.internmc.facebook.com/intern/diff/D52965076) Pull Request resolved: https://github.com/pytorch/pytorch/pull/117989 Approved by: https://github.com/ezyang, https://github.com/chenyang78	2024-01-24 20:17:59 +00:00
Yukio Siraichi	2f6fc33c20	Move skip sets into a new file. (#118032 ) This PR moves the skip sets that lived in benchmarks/dynamo/torchbench.py into a more readable YAML file, so that it is consumable from other projects (e.g. XLA). Pull Request resolved: https://github.com/pytorch/pytorch/pull/118032 Approved by: https://github.com/lezcano, https://github.com/ezyang	2024-01-24 19:22:01 +00:00
Wanchao Liang	e599a08796	[dtensor] rewrite embedding ops using op strategy (#118079 ) This PR rewrites sharded embedding rule to use OpStrategy instead of the rule, one step further to get rid of rules and consolidate the embedding operator implementation, to prepare for rowwise embedding implementation, which will come in next PR Pull Request resolved: https://github.com/pytorch/pytorch/pull/118079 Approved by: https://github.com/tianyu-l	2024-01-24 19:12:12 +00:00
dilililiwhy	b025e5984c	Get Device instance with correct type when privateuse1 backend is registered (#117966 ) Fixes #ISSUE_NUMBER If privateuse1 backend is registered. Let torch.device return corresponding instance of Device when only index is given. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117966 Approved by: https://github.com/albanD, https://github.com/malfet	2024-01-24 19:03:18 +00:00
Catherine Lee	6fc015fedc	Check if enable inside run call (#118101 ) In theory this way we never have to worry about subclasses calling super().setUp() ever again Also, dynamically creating classes (ex via type in instantiate_device_type_tests) makes super() calls a bit odd https://stackoverflow.com/questions/71879642/how-to-pass-function-with-super-when-creating-class-dynamically https://stackoverflow.com/questions/43782944/super-does-not-work-together-with-type-supertype-obj-obj-must-be-an-i Pull Request resolved: https://github.com/pytorch/pytorch/pull/118101 Approved by: https://github.com/huydhn	2024-01-24 18:51:05 +00:00
Menglu Yu	fc135454ca	[PT2][Optimus][Reliability]Fix a bug in gradients computation for runtime numeric check (#118105 ) Summary: We observed the following error when launch e2e AFOC model test ``` RuntimeError: Trying to backward through the graph a second time (or directly access saved tensors after they have already been freed). Saved intermediate values of the graph are freed when you call .backward() or autograd.grad(). Specify retain_graph=True if you need to backward through the graph a second time or if you need to access saved tensors after calling backward. ``` f524190245 Differential Revision: D53011463 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118105 Approved by: https://github.com/jackiexu1992	2024-01-24 18:45:10 +00:00
Ke Wen	1e185c7803	[c10d] Barrier uses stream sync instead of device sync (#117804 ) Resubmitting #96785 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117804 Approved by: https://github.com/wconstab	2024-01-24 18:42:14 +00:00
RazaProdigy	6e78592cbb	Added type checking for ExportedProgram (#117231 ) Fixes #116952 Added type checking for ExportedProgram in save function. Please review. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117231 Approved by: https://github.com/avikchaudhuri	2024-01-24 18:24:44 +00:00
Jerry Zhang	af1ebc45d3	[quant][pt2e] Add fold_quantize=True for all convert_pt2e calls (#117797 ) Summary: In preparation for enabling fold_quantize=True by default Test Plan: CI Differential Revision: D52879612 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117797 Approved by: https://github.com/andrewor14	2024-01-24 17:54:13 +00:00
Nikita Shulga	90b3cf33ac	[C10] Make Scalar constructable from longs (#118149 ) On Linux and Mac `int64_t` is an alias to either `long` (Linux) or `long long` (Mac) Because of that, attempt to construct `c10::Scalar` from the other type will fail with `conversion from ‘long long int’ to ‘c10::Scalar’ is ambiguous`. I.e. attempt to compile: ```cpp int main() { c10::Scalar s = 1L; } ``` on MacOS failed with: ``` foo.cpp:3:15: error: conversion from 'long' to 'c10::Scalar' is ambiguous c10::Scalar s = 1L; ^ ~~ /Users/nshulga/git/pytorch/pytorch/torch/include/c10/core/Scalar.h:59:7: note: candidate constructor DEFINE_IMPLICIT_CTOR) ^ /Users/nshulga/git/pytorch/pytorch/torch/include/c10/core/Scalar.h:59:7: note: candidate constructor /Users/nshulga/git/pytorch/pytorch/torch/include/c10/core/Scalar.h:59:7: note: candidate constructor /Users/nshulga/git/pytorch/pytorch/torch/include/c10/core/Scalar.h:59:7: note: candidate constructor /Users/nshulga/git/pytorch/pytorch/torch/include/c10/core/Scalar.h:59:7: note: candidate constructor /Users/nshulga/git/pytorch/pytorch/torch/include/c10/core/Scalar.h:59:7: note: candidate constructor /Users/nshulga/git/pytorch/pytorch/torch/include/c10/core/Scalar.h:59:7: note: candidate constructor /Users/nshulga/git/pytorch/pytorch/torch/include/c10/core/Scalar.h:62:3: note: candidate constructor Scalar(uint16_t vv) : Scalar(vv, true) {} ^ /Users/nshulga/git/pytorch/pytorch/torch/include/c10/core/Scalar.h:63:3: note: candidate constructor Scalar(uint32_t vv) : Scalar(vv, true) {} ^ /Users/nshulga/git/pytorch/pytorch/torch/include/c10/core/Scalar.h:64:3: note: candidate constructor Scalar(uint64_t vv) { ^ ``` Prevent this by providing missing constructors when needed. Alas one can not use SFINAE, as template constructors on Scalar mess up a lot of implicit conversions, so I use `static_asserts` to detect early on if premise for constructing this class holds. Add ScalarTest::LongsAndLongLongs that is essentially a compile time test Discovered while trying to enable AOTI on MacOS Pull Request resolved: https://github.com/pytorch/pytorch/pull/118149 Approved by: https://github.com/ezyang, https://github.com/albanD ghstack dependencies: #118077, #118076	2024-01-24 17:32:29 +00:00
rzou	880f9bb57e	Remove xfails for consistently succeeding tests (#118152 ) Fixes https://github.com/pytorch/pytorch/issues/117786, https://github.com/pytorch/pytorch/issues/117785 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118152 Approved by: https://github.com/yanboliang	2024-01-24 15:47:55 +00:00
Nikita Shulga	bd99115276	[AOTI] Enable for MacOS (#118076 ) - Add `darwin` to the list of supported platform - Add `#include <sstream>` to `aoti_runtime/model.h` - Refactor Linux specific constant compilation logic to `_compile_consts_linux` - Add `_compile_consts_darwin` that converts consts to .S file that is linked into a shared library - Patch file using magic to avoid converting bytes to large hexadecimal string - Generate integer constants with `LL` suffix on MacOS (corresponds to int64_t definition) - Enable test_aot_inductor.py tests on MacOS Pull Request resolved: https://github.com/pytorch/pytorch/pull/118076 Approved by: https://github.com/desertfire ghstack dependencies: #118077	2024-01-24 14:24:05 +00:00
DanilBaibak	a545ebc870	Switched macOS runners type to macos-m1-stable (#117651 ) Switched macOS runners type to `macos-m1-stable`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117651 Approved by: https://github.com/huydhn	2024-01-24 11:55:13 +00:00
Quinn Zhu	12662f4d95	[dynamo] add username in debug path (#117820 ) Summary: No user name may cause conflict and permission error when people share a dev server bypass-github-pytorch-ci-checks Test Plan: ci Differential Revision: D52895486 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117820 Approved by: https://github.com/kflu, https://github.com/DanilBaibak	2024-01-24 10:14:20 +00:00
Nikita Shulga	7d396918c6	[Inductor] Fix `argument unused during compilation` warning (#118077 ) By not passing linker flag if `compile_only` is set to `True` Before that change every invocation of AOTI compiler resulted in emitting at least 4 warnings: ``` clang: warning: -lomp: 'linker' input unused [-Wunused-command-line-argument] clang: warning: argument unused during compilation: '-shared' [-Wunused-command-line-argument] clang: warning: argument unused during compilation: '-undefined dynamic_lookup' [-Wunused-command-line-argument] clang: warning: argument unused during compilation: '-L/Users/nshulga/miniforge3/lib' [-Wunused-command-line-argument] ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/118077 Approved by: https://github.com/desertfire	2024-01-24 09:52:16 +00:00
Shiyan Deng	50ead5d8ae	[fx] add an option to not retrace when doing op fusion (#118120 ) Summary: If the given model is already a graph module, we would want to skip retrace in some cases. Test Plan: CI Differential Revision: D53018283 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118120 Approved by: https://github.com/zyan0	2024-01-24 09:41:26 +00:00
Jason Ansel	c5702a0891	[dynamo] Optimize BACKEND_MATCH guard (#118065 ) As measured by `benchmarks/dynamo/microbenchmarks/overheads.py`: - Before `22.5us` - After `18.1us` Pull Request resolved: https://github.com/pytorch/pytorch/pull/118065 Approved by: https://github.com/ydwu4	2024-01-24 07:47:52 +00:00
Simon Fan	ed0ec2e0be	Remove dynamo runner's dependency on distributed build (#117903 ) So that we can bisect faster without needing to rebuild distributed module. We remove the annotation to avoid flake8 undefined name lint Pull Request resolved: https://github.com/pytorch/pytorch/pull/117903 Approved by: https://github.com/xuzhao9	2024-01-24 06:51:14 +00:00
Tarun Karuturi	725f4b58ac	Cache dfs path in propose_partitions and re-use that later when trying to find cycles in the graph (#115943 ) Summary: This diff introduces a caching mechanism to improve the performance of the partitioner in PyTorch. The changes involve adding a cache to store the DFS path of each node in the graph, which can be reused later when trying to find cycles in the graph. This shows significant improvements for the edge use cases where the ASR model (which is around 6000+ nodes) used to take 26 minutes, but after this it takes around 8 minutes. Test Plan: Relying on the existing ExecuTorch CI tests that heavily use this partitioning mechanism and also tested out locally via Bento notebooks. Differential Revision: D51289200 Pull Request resolved: https://github.com/pytorch/pytorch/pull/115943 Approved by: https://github.com/SherlockNoMad	2024-01-24 05:30:11 +00:00
Wanchao Liang	d59c2d6e05	[dtensor] refactor partial redistribution logic (#113334 ) This PR: * Make the remaining placement transform to move from redistribute.py to placement_types, specifically partial related logic * redefine partial interface to make things more consistent, and add docs about the transformation relationships Pull Request resolved: https://github.com/pytorch/pytorch/pull/113334 Approved by: https://github.com/tianyu-l, https://github.com/XilunWu ghstack dependencies: #118078	2024-01-24 04:56:16 +00:00
Wanchao Liang	03205ff3ba	[dtensor] make local_shard_size_on_dim be staticmethod (#118078 ) As titled, this is so that we can use it for the case when we don't need to construct a Shard placement Pull Request resolved: https://github.com/pytorch/pytorch/pull/118078 Approved by: https://github.com/XilunWu	2024-01-24 04:56:16 +00:00
Eddie Yan	8d49737f2b	[CUDA][Complex] Bump thresholds for conv3d (#118151 ) Seeing a 1/1000 numerical mismatch CC @coyotelll Pull Request resolved: https://github.com/pytorch/pytorch/pull/118151 Approved by: https://github.com/ezyang	2024-01-24 04:18:31 +00:00
Xilun Wu	46c228f0e2	[DTensor][BE] rename PlacementStrategy.output_spec to output_specs since now we support a tuple of DTensorSpec as output (#116437 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116437 Approved by: https://github.com/wanchaol	2024-01-24 03:33:58 +00:00
Xilun Wu	26968cefb0	[DTensor][fix] re-enable [add]mm tensor test (#118132 ) Summary Re-enable tests that were disabled in #118045 as #117726 fixed the empty tensor case for DTensor [add]mm. Test Plan `pytest test/distributed/_tensor/test_dtensor_ops.py -s -k mm` Pull Request resolved: https://github.com/pytorch/pytorch/pull/118132 Approved by: https://github.com/malfet ghstack dependencies: #117726	2024-01-24 03:17:18 +00:00
Xilun Wu	155f27a97b	[DTensor][fix] fix is_tensor_shardable to correctly handle Replicate placement (#117726 ) Summary Previously DTensor sharding plans filter (i.e. `is_tensor_shardable()`) cannot correctly handle the case where the input `DTensor` has 0 dimension. This filter should return `True` if the sharding placement on 0 dimension is `Replicate` even if `tensor dim < num of shards` on that dimension in which case `tensor dim == 0` and `num of shards == 1`. In this PR we also noticed a behavior discrepancy of `torch.addmm`. See #118131 Test Plan ``` pytest test/distributed/_tensor/test_dtensor_ops.py -s -k addmm pytest test/distributed/_tensor/test_dtensor_ops.py -s -k mm_cpu_float32 CUDA_VISIBLE_DEVICES="" pytest test/distributed/_tensor/test_matrix_ops.py -s -k empty_operand pytest test/distributed/_tensor/test_matrix_ops.py -s -k empty_operand ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/117726 Approved by: https://github.com/wanchaol	2024-01-24 03:17:18 +00:00
Zhengxu Chen	e9c240670f	[sigmoid] Add canonicalized IR as an option. (#116758 ) Summary: as title, the "canonical" flag is added to sigmoid serializer, so that we can optionally "normalize" the IR to give stable names and orders to IR nodes, which could help with the cases to compare IR definitions. Test Plan: buck run @//mode/opt //aps_models/ads/config_model_authoring/stability:cli export-generated-module-state-command Differential Revision: D52431965 Pull Request resolved: https://github.com/pytorch/pytorch/pull/116758 Approved by: https://github.com/avikchaudhuri	2024-01-24 03:11:25 +00:00
Colin Peppler	21e8546b11	[inductor][fx] Fix broadcast_tensors with unbacked symints when translation validation is off (#118066 ) ## Context This is an example that runs into an AssertionError while lowering in Inductor. ``` # While lowering, b will be expanded because b.size(1) == 1. a = torch.zeros([u0, 512]) b = torch.ones([u0, 1]) return a * b ``` Below's the tail-end of the stack trace. Here's the important bits: 1. In _inductor/sizevars.py, we'll call `self.shape_env.defer_runtime_assert(expr, msg, fx_node=V.graph.current_node)`. 2. This leads to the creation of a `ShapeEnvEvent` with an FX node via `kwargs={"fx_node": V.graph.current_node}` ([see](`0c9b513470/torch/fx/experimental/recording.py (L245-L247)`)). 3. Eventually, we try to call `maybe_convert_node()` but it expects translation validation to be on ([see](`0c9b513470/torch/fx/experimental/recording.py (L118-L121)`)). ``` File "pytorch/torch/_inductor/lowering.py", line 221, in transform_args for i, x in zip(indices, broadcast_tensors([args[i] for i in indices])): File "pytorch/torch/_inductor/lowering.py", line 294, in wrapped out = decomp_fn(args, *kwargs) File "pytorch/torch/_inductor/lowering.py", line 676, in broadcast_tensors x = expand(x, target) File "pytorch/torch/_inductor/lowering.py", line 294, in wrapped out = decomp_fn(args, **kwargs) File "pytorch/torch/_inductor/lowering.py", line 793, in expand return TensorBox(ExpandView.create(x.data, tuple(sizes))) File "pytorch/torch/_inductor/ir.py", line 1871, in create new_size = cls._normalize_size(x, new_size) File "pytorch/torch/_inductor/ir.py", line 1862, in _normalize_size new_size[i] = V.graph.sizevars.expect_equals( File "pytorch/torch/_inductor/sizevars.py", line 338, in expect_equals self.expect_true(sympy.Eq(left, right), msg=msg) File "pytorch/torch/_inductor/sizevars.py", line 333, in expect_true self.shape_env.defer_runtime_assert(expr, msg, fx_node=V.graph.current_node) # (1) is here File "pytorch/torch/fx/experimental/recording.py", line 257, in wrapper return event.run(self) # (2) happens right before this File "pytorch/torch/fx/experimental/recording.py", line 155, in run replacearg(index=3, key="fx_node", fn=maybe_convert_node) File "pytorch/torch/fx/experimental/recording.py", line 138, in replacearg kwargs[key] = fn(kwargs[key]) File "pytorch/torch/fx/experimental/recording.py", line 128, in maybe_convert_node assert hasattr(shape_env, "name_to_node") # (3) is here ``` ## Approach Since [translation validation](`c6be5d55a5/torch/fx/experimental/validator.py (L574)`) may not be on during Inductor lowering, we can check if that's True and return the FX node's name in this case. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118066 Approved by: https://github.com/ezyang, https://github.com/peterbell10	2024-01-24 03:07:30 +00:00
Mikayla Gawarecki	41a56f7828	Fix swap_tensors to swap PyObjects associated with TensorImpl (#116955 ) This PR intends to fix the following issue when swapping two tensors ```python >>> import torch >>> torch.manual_seed(5) >>> t1 = torch.randn(2) >>> t2 = torch.randn(3) >>> t1 tensor([-0.4868, -0.6038]) >>> t2 tensor([-0.5581, 0.6675, -0.1974]) >>> torch.utils.swap_tensors(t1, t2) >>> t1 tensor([-0.5581, 0.6675, -0.1974]) >>> t2 tensor([-0.4868, -0.6038]) >>> t1.fill_(0.5) # t1 back to its unswapped state :o tensor([-0.4868, -0.6038]) ``` What happens here is that in `THPVariable_Wrap` (which is used when going back from C++ --> Python), we check if the TensorImpl of the tensor to be returned already has a pointer to a PyObject in its PyObject slot. If this is the case then this object is returned. `57491d2046/torch/csrc/autograd/python_variable.cpp (L271-L292)` When we run any operation that returns the same TensorImpl (e.g. inplace op, `t.to(dtype=t.dtype)`, etc.), although `t1` now has `t2`'s TensorImpl, `t2`'s TensorImpl still has a reference to `t2`, so when we do the op on `t1` and `THPVariable_Wrap` attempts to return the pointer to the TensorImpl's PyObject, we return a pointer to `t2` instead. The TensorImpl should have the PyObjects in their PyObjectSlots swapped as well in `swap_tensors` Pull Request resolved: https://github.com/pytorch/pytorch/pull/116955 Approved by: https://github.com/albanD	2024-01-24 01:40:18 +00:00
Jane Xu	fc30c4d769	Migrate forloop directional tests to OptimizerInfo (#117410 ) This PR is another step towards modernizing our optimizer tests by tackling the simplest foreach tests. The replaced tests are now removed in `test/optim/test_optim.py`. Changes in coverage? Yes! - This PR _decreases_ coverage (!!!!) by only checking the direction on the forloop implementations vs both the forloop and foreach. Why? I believe it should be sufficient to check the forloop only, as the foreach parity is already checked in the `foreach_matches_forloop` test. - This PR also _increases_ coverage for SparseAdam with contiguous params on CUDA, which was previously forbidden due to an old old bug that has since been fixed. What will it take to fully remove `test_basic_cases`? - We need to flavor the tests with LRSchedulers - Testing for param groups --> which all just distinguish between lrs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117410 Approved by: https://github.com/albanD	2024-01-24 01:28:40 +00:00
William Wen	5b671ce486	[dynamo] fix typo in 3.11 resume_execution.py (#118108 ) whoopsie Pull Request resolved: https://github.com/pytorch/pytorch/pull/118108 Approved by: https://github.com/angelayi, https://github.com/zou3519	2024-01-24 00:59:04 +00:00
CaoE	b7b1affe97	Add half specializations for load of sum (#106454 ) Add half specializations for load of sum Pull Request resolved: https://github.com/pytorch/pytorch/pull/106454 Approved by: https://github.com/jgong5, https://github.com/cpuhrsch	2024-01-24 00:35:20 +00:00
Yanbo Liang	c0732c8d5e	[Dynamo] Add complex to literal constant (#117819 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/117819 Approved by: https://github.com/zou3519	2024-01-23 23:46:46 +00:00
Kurt Mohler	cd084c4909	Add `TensorIteratorConfig::add_const_input` to avoid COW materialize (#118053 ) Part of #97856 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118053 Approved by: https://github.com/ezyang	2024-01-23 22:32:39 +00:00
Zhengxu Chen	abd759d50d	[fx] Add hooks to intercept node replacements. (#117825 ) Summary: Adding an experimental API to FX graph module to place "hooks" every time when we are changing or replacing nodes in a graph, so that we can properly update the new name in graph signature and potentially other places. Test Plan: buck test mode/opt -c fbcode.enable_gpu_sections=true caffe2/test/distributed/_tensor/experimental:tp_transform buck test mode/opt caffe2/test:test_export -- -r test_replace_hook Differential Revision: D52896531 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117825 Approved by: https://github.com/avikchaudhuri	2024-01-23 22:28:40 +00:00
Boyuan Feng	b369888bec	Replace `constraints` with `dynamic_shapes` in caffe2/test/cpp & torchrec/distributed/tests/test_pt2 (#118026 ) Summary: `constraints` argument for `torch.export` has been deprecated in favor of the `dynamic_shapes` argument. This PR updates the use of the deprecated API in `caffe2/test/cpp` and `torchrec/distributed/test/test_pt2`. Test Plan: CI Differential Revision: D52977354 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118026 Approved by: https://github.com/chenyang78	2024-01-23 22:15:15 +00:00
Aaron Shi	6ac284122b	[Memory Snapshot] Track context for SEGMENT_FREE and SEGMENT_UNMAP (#118055 ) Summary: Show the stack when SEGMENT_FREE and SEGMENT_UNMAP occurs. This may be useful for debugging such as when empty_cache() may cause a segment to be freed. If the free context is unavailable, resort to the segment allocation stack. Test Plan: CI Differential Revision: D52984953 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118055 Approved by: https://github.com/zdevito	2024-01-23 21:48:57 +00:00
Bin Bao	c6930aad46	Update Triton pin (#117873 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/117873 Approved by: https://github.com/shunting314, https://github.com/malfet	2024-01-23 21:05:30 +00:00
Jane Xu	13d2cdffa2	Remove optimizer.step patching for profiler hook (#115772 ) 1. I'd like to remove the patching that avoids the profiler hook, but it adds an additional graph break due to nested wrappers. #117767 if interested, see (internal only) paste for [before](P996529232) and [after](P997507449) this PR. ``` I've locally run perf benchmarks for yolov3: Before the speedup is 4.183x, and after it is 4.208x. I've also run it for resnet50: before, speedup is 3.706x and now it is 3.924x. ``` 2. @mlazos I now unwrap twice in the dynamo and inductor tests. This feels like we're testing deficiently--should we add tests to test that tracing through the profiler hook and the use_grad hook are functioning according to expectations (I know there's at least one graph break in one). 3. There's a strange memory thing going on...what is happening? This has been resolved with @voznesenskym's [change](https://github.com/pytorch/pytorch/pull/116169). (for details see below) <details> This PR will fail the test_static_address_finalizer test due to a mysterious thing that is happening (idk what, but maybe the dynamo cache or a frame _expecting_ the patching to have been done). There is no Python refcycle, as the backrefs for `p_ref()` look like: ![image](https://github.com/pytorch/pytorch/assets/31798555/4d6cbf50-3924-4efe-b578-d93389eebec8) (so 5 backrefs but none of them python) And the refs: ![image](https://github.com/pytorch/pytorch/assets/31798555/25e01105-bcb9-44ca-997a-2cf1670a6d42) </details> Pull Request resolved: https://github.com/pytorch/pytorch/pull/115772 Approved by: https://github.com/jansel, https://github.com/mlazos	2024-01-23 20:15:41 +00:00
Tianyu Liu	77705e7486	[dtensor] fix unnecessary redistribute in new_factory_strategy (#118037 ) Summary Previously, assuming `x` is a DTensor with non-replicate placement, calling `x.new_full` would create a replicated (but unused) copy of `x`, incurring unnecessary communications. This PR fixes the issue. Test `python test/distributed/_tensor/test_tensor_ops.py -k test_new_full` Pull Request resolved: https://github.com/pytorch/pytorch/pull/118037 Approved by: https://github.com/wanchaol, https://github.com/XilunWu	2024-01-23 19:35:43 +00:00
PyTorch MergeBot	58e7ec5843	Revert "Log stack trace of mutated idx (#117720 )" This reverts commit 365c7a292fedbf776014b878849ebd3dcb7463f0. Reverted https://github.com/pytorch/pytorch/pull/117720 on behalf of https://github.com/eellison due to cause of https://github.com/pytorch/pytorch/issues/118104 ([comment](https://github.com/pytorch/pytorch/pull/117720#issuecomment-1906693119))	2024-01-23 18:40:20 +00:00
Catherine Lee	364728b27b	Reduce pytest prints (#117069 ) * custom pytest-shard so I can control the verbosity (also index by 1 since it's confusing) * normal runs (not keep-going) always rerun each failed test 9 times (3 per process, 3 processes). Previously it would only run the entire test file 3 times, so if a test before you segfaulted, you only got 2 tries Example of quieter log https://github.com/pytorch/pytorch/actions/runs/7481334046/job/20363147497 "items in shard" only gets printed once at the beginning, and the reruns just say how many got skipped. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117069 Approved by: https://github.com/huydhn	2024-01-23 18:39:30 +00:00
PyTorch MergeBot	5ec2d7959d	Revert "[ez] Provide a slightly better error message if process times out (#117865 )" This reverts commit 5538b37a065e5a68c3fb9d1f8eaa3e4fd12fd0b8. Reverted https://github.com/pytorch/pytorch/pull/117865 on behalf of https://github.com/clee2000 due to Does not play nice with retry_shell, which expects timeoutexpired, but i cant control the error message of that ([comment](https://github.com/pytorch/pytorch/pull/117865#issuecomment-1906640922))	2024-01-23 18:13:41 +00:00
mantaionut	6784594532	Fix sparse windows on CPU with MKL (#102604 ) Fix https://github.com/pytorch/pytorch/issues/97352. This PR changes the way the linking to intel MKL is done and updating MKL on Windows to mkl-2021.4.0 . There are for both conda and pip packages MKL version with which you can link dynamically. mkl-devel contains the static versions of the dlls and MKL contains the needed dlls for the runtime. MKL dlls and static libs starting with 2021.4.0 have the version in their names( for MKL 2023 we have mkl_core.2.dll and for 2021.4.0 we have mkl_core.1.dll) so its possible to have multiple versions installed and it will work properly. For the wheel build, I added dependency for whell MKL and on conda a dependecy for the conda MKL and on libtorch I copied the MKL binaries in libtorch. In order to test this PR I have to use custom builder https://github.com/pytorch/builder/pull/1467 Pull Request resolved: https://github.com/pytorch/pytorch/pull/102604 Approved by: https://github.com/IvanYashchuk, https://github.com/malfet	2024-01-23 17:41:18 +00:00
Dmitry Nikolaev	7598a4efdc	[ROCm] Disable MIOpen for empty tensors for RNN (#117672 ) Some MIOpen RNN functions (lstm, rnn, gru) can't work with empty tensors and return error "MIOpen Error: Lengths must be > 0" This PR disables MIOpen tor empty tensors and force to use native methods The solution is based on condition of using CUDNN `3a52147cc5/aten/src/ATen/native/TensorProperties.cpp (L91)` It also fix [test_nn.py::TestNN::test_RNN_input_size_zero](`29fa6fbc4e/test/test_nn.py (L4592)`) on ROCM Pull Request resolved: https://github.com/pytorch/pytorch/pull/117672 Approved by: https://github.com/cpuhrsch	2024-01-23 17:30:18 +00:00
Sherlock Huang	0c9b513470	[Export] Fix serialize_metadata (#118031 ) Summary: As title. Test Plan: CI Differential Revision: D52979069 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118031 Approved by: https://github.com/zhxchen17	2024-01-23 17:03:04 +00:00
Oguz Ulgen	9ebaa27922	Fix types.MethodDescriptorType related bug in dynamo (#118041 ) Methods that were `types.MethodDescriptorType` were failing because the `tensor.method()` to `method(tensor)` conversion was dropping the tensor and just calling `method`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118041 Approved by: https://github.com/yanboliang ghstack dependencies: #118000	2024-01-23 16:11:38 +00:00
Oguz Ulgen	3b38f7b266	Remove skips for passing tests (#118000 ) These tests were already passing Pull Request resolved: https://github.com/pytorch/pytorch/pull/118000 Approved by: https://github.com/yanboliang	2024-01-23 16:11:38 +00:00
Peter Bell	3ec4f00316	[inductor] Allow reinplacing functionalized scatter ops (#116899 ) This expands the reinplacing pass to allow reinplacing view-scatter operations. e.g. if our python code is: ``` a = view1(inp) b = view2(a) b.copy_(src) ``` this generates a functionalized graph like: ```python a = view1(inp) a_updated = view2_scatter(a, src) inp_updated = view1_scatter(inp, a_updated) ``` First, the `canonicalize_view_scatter_ops` step rewrites the functionalized graph in the form: ```python inp_updated = _generalized_scatter(inp, src, [view1, view2]) a_updated = view1(inp_updated) ``` I then register `_generalized_scatter` as a normal inplacable op which can be handled by the pre-existing mechanism. Since we've fused the two scatter ops into one, the reinplacing pass sees only one user of `inp` which allows the entire operation to be reinplaced if desired (and I add heuristics that sometimes choose not to reinplace). Finally, there is a decomposition step which decomposes out-of-place or in-place `_generalized_scatter` operations either back into view_scatter operations, or into the version with mutations. When introducing mutations, the reinplaced version is equivalent to the original mutation: ``` a = view1(inp) b = view2(a) b.copy_(src) ``` Or when out-of-place we end up with a minor restructuring of the graph: ``` a = view1(inp) tmp = view2_scatter(a, src) inp_updated = view1_scatter(inp, tmp) a_updated = view1(inp_updated) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/116899 Approved by: https://github.com/lezcano ghstack dependencies: #116898, #117121	2024-01-23 15:31:28 +00:00
Peter Bell	5502a63b22	[inductor] Allow reinplacing before meta-only users (#117121 ) Currently if you have the code: ```python idx = torch.arange(10, device=x.device) src = torch.ones(10, dtype=x.dtype, device=x.device) x.index_put_((idx,), src) expand = x.expand((2, x.shape[0])) ``` The `index_put_` cannot be reinplaced under dynamic shapes due to the user `aten.sym_size(x, 0)` however since this function only looks at the tensor metadata, it is actually fine to reinplace. Here I ignore these operators in the analysis of the reinplacing pass, so reinplacing can happen under dynamic shapes as well. I also handle cases where views are created just to be fed to `sym_size`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117121 Approved by: https://github.com/lezcano ghstack dependencies: #116898	2024-01-23 15:31:28 +00:00
Peter Bell	eb0fcab421	[inductor] Move reinplace pass to its own file (#116898 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116898 Approved by: https://github.com/lezcano	2024-01-23 15:31:28 +00:00
rzou	e309d6fa1c	Better unsupported op error message (#117770 ) Previously, if someone wrote a python abstract impl but didn't import the module it is in, then we would raise an error message suggesting that the user needs to add an abstract impl for the operator. In addition to this, we suggest that the user try importing the module associated with the operator in the pystub (it's not guaranteed that an abstract impl does exist) to avoid confusion. Test Plan: - new test Pull Request resolved: https://github.com/pytorch/pytorch/pull/117770 Approved by: https://github.com/ydwu4, https://github.com/williamwen42	2024-01-23 15:05:16 +00:00
Bin Bao	4d625c1c92	[AOTI] Fix a bug in the torch._export.aot_load API (#118039 ) Summary: tree_flatten_spec should use args instead of *args clone of https://github.com/pytorch/pytorch/pull/117948 but with some fbcode specific changes Test Plan: CI Differential Revision: D52982401 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118039 Approved by: https://github.com/angelayi	2024-01-23 14:54:02 +00:00
Nikita Shulga	bff348b28f	[AOTI] Add missing include to `model.h` (#118075 ) At lest if one tries to compile the AOTI code on Darwin, compilation fails with implicit instantiation of undefined template error: ``` In file included from /Users/nshulga/git/pytorch/pytorch/torch/include/torch/csrc/inductor/aoti_runtime/arrayref_tensor.h:3: /Users/nshulga/git/pytorch/pytorch/torch/include/torch/csrc/inductor/aoti_runtime/model.h:69:21: error: implicit instantiation of undefined template 'std::basic_stringstream<char>' std::stringstream ss; ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/118075 Approved by: https://github.com/desertfire ghstack dependencies: #118074	2024-01-23 14:34:00 +00:00
Nikita Shulga	2963e85a3f	[EZ][AOTI] Fix typos (#118074 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118074 Approved by: https://github.com/desertfire	2024-01-23 14:34:00 +00:00
Edward Z. Yang	ae459c5809	Don't use private accessor on SymNode to get _expr (#118007 ) This materially impacts https://github.com/pytorch/pytorch/pull/117862 split this out for testing Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/118007 Approved by: https://github.com/tugsbayasgalan	2024-01-23 14:29:19 +00:00
Edward Z. Yang	73c9be1395	Don't use private accessor on SymNode to get _expr (round 2) (#118013 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/118013 Approved by: https://github.com/tugsbayasgalan	2024-01-23 14:29:12 +00:00
Jeff Daily	905a7cc340	[ROCm] skip test_eager_transforms.py test_compile_vmap_hessian_cuda (#118009 ) Memory leak detected on ROCm. Skip until it can be addressed. PYTORCH_TEST_WITH_ROCM=1 PYTORCH_TEST_CUDA_MEM_LEAK_CHECK=1 python test_eager_transforms.py -k test_compile_vmap_hessian_cuda See #117642 for moving rocm CI to unstable due to this test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118009 Approved by: https://github.com/jeanschmidt	2024-01-23 09:57:18 +00:00
leslie-fang-intel	4cfd16cb6d	[Inductor] optimize transpose_mxn with bf16 data type (#117958 ) Summary Add the vectorization implementation of `transpose_mxn` with BFloat16 data type when matrix size is 16X16 or 32X32 which observed in Stable Diffusion BF16. TestPlan ``` python -u -m pytest -s -v test_cpu_repro.py -k test_transpose_mxn_16_16_bf16_fp16 python -u -m pytest -s -v test_cpu_repro.py -k test_transpose_mxn_32_32_bf16_fp16 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/117958 Approved by: https://github.com/jgong5, https://github.com/lezcano	2024-01-23 09:43:35 +00:00
chuanqiw	40890ba8e7	[CI] Add python test skip logic for XPU (#117621 ) Add python test skip logic for XPU For test purpose, cherry-pick #116833 & #116850 firstly, and the xpu test passed https://github.com/pytorch/pytorch/actions/runs/7566746218/job/20604997985?pr=117621. Revert them now. Works for #114850 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117621 Approved by: https://github.com/huydhn	2024-01-23 08:20:42 +00:00
Will Constable	455bba38f4	[C10D] Make Flight Recorder report time_created in ns (#118047 ) Addresses (6) from #117883 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118047 Approved by: https://github.com/zdevito ghstack dependencies: #118044, #118046	2024-01-23 08:18:08 +00:00
Will Constable	5df92a9244	[C10D] Add version tag to NCCL Flight Recorder Dump (#118046 ) Addresses (3) from #117883 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118046 Approved by: https://github.com/zdevito ghstack dependencies: #118044	2024-01-23 08:18:08 +00:00
Will Constable	dace1fda2e	[C10D] Make NCCL Flight Recorder dump produce a dict (#118044 ) Putting the list of entries into a particular key of a top-level dict paves the way for adding other metadata as other top level keys. Addresses 1 and 2 from #117883 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118044 Approved by: https://github.com/zdevito	2024-01-23 08:18:08 +00:00
haozhe.zhu	28c8a07b4d	add mask_convert_to_lp to support bool->fp16/bf16 convert (#117830 ) Fix https://github.com/pytorch/pytorch/issues/117624 https://github.com/pytorch/pytorch/issues/117627 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117830 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-01-23 07:52:43 +00:00
Will Constable	6049998971	[C10D] Finer-grain nccl heartbeat, avoid false positive hangs (#118016 ) Summary: Previously, heatbeat was incremented once per finishing a for loop over a list of in-progress work items, under the assumption that either the processing would be predictably quick, or it would hang completely. In fact, there can be cuda API contention that causes the processing of works to slow down arbitrarily but not truly deadlock. To guard against this, we bump the heartbeat at the smallest unit of progress, one work item being successfully processed. Test Plan: CI Differential Revision: D52973948 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118016 Approved by: https://github.com/shuqiangzhang, https://github.com/kwen2501	2024-01-23 07:25:18 +00:00
Yue Dong	a8978d3676	[dynamo] Add size(), get_coordinate() support for DeviceMesh in dynamo (#117710 ) Summary: This fix is part of: https://github.com/pytorch/pytorch/issues/117670 Test Plan: Unit tetst and CI Differential Revision: D52857348 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117710 Approved by: https://github.com/wconstab, https://github.com/yanboliang, https://github.com/wanchaol, https://github.com/anijain2305	2024-01-23 07:10:52 +00:00
PyTorch MergeBot	bb28965924	Revert "Remove skips for passing tests (#118000 )" This reverts commit 3c339b5b21fdbd530f82765f84bcabde8266d3e0. Reverted https://github.com/pytorch/pytorch/pull/118000 on behalf of https://github.com/oulgen due to test passing on diff but failing on hud... ([comment](https://github.com/pytorch/pytorch/pull/118000#issuecomment-1905351752))	2024-01-23 06:10:25 +00:00
suo	d84173c025	[export] fix unlifting of custom class constants (#117979 ) we didn't have a test covering this case, add one. Aside: we should invest in actually unit testing the lifting/unlifting passes, both separately and also against each other. I have a diff cooking for that. Differential Revision: [D52962180](https://our.internmc.facebook.com/intern/diff/D52962180/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/117979 Approved by: https://github.com/avikchaudhuri ghstack dependencies: #115222, #117978	2024-01-23 05:51:00 +00:00
suo	7b0979ef8e	[export] fixes to unflatten + custom obj composition (#117978 ) The test I added for this didn't actually enable torchbind tracing, oops. Fix that and fix the issues that cropped up. Differential Revision: [D52962205](https://our.internmc.facebook.com/intern/diff/D52962205/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/117978 Approved by: https://github.com/avikchaudhuri ghstack dependencies: #115222	2024-01-23 05:50:41 +00:00
Animesh Jain	e056cf5507	[ac][pattern matcher] Do not percolate tags beyond the inputs of matched portion (#118034 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118034 Approved by: https://github.com/yf225	2024-01-23 05:02:32 +00:00
Nikita Shulga	3708f2608e	[DTensor] Skip `[add]mm` empty tensor test (#118045 ) As DTensor does not support multiplication of [4,0] and [0,4] matrices Pull Request resolved: https://github.com/pytorch/pytorch/pull/118045 Approved by: https://github.com/yf225, https://github.com/wanchaol	2024-01-23 04:08:11 +00:00
Menglu Yu	0036385b55	[Inductor][Reliability] Add runtime numeric check for pt2 Optimus in the pre grad pass (#115142 ) Summary: Titled Test Plan: # local reproduce Patch ``icfg.fx_passes_numeric_check["pre_fx_passes"] = True" ``` buck2 run @mode/opt //scripts/jackiexu0313/pt2:local_model_with_pt2 -- --test_mode split_batch ``` P965217137 # MC candidates ### FIRST + CMF f520754604 P1056796962 ### ICVR f520816217 P1056839342 ### IG_CTR f520819178 P1056903302 ### MAI f520823559 P1057712009 ### AFOC f520822438 P1057760058 ### DPA f520826815 P1057808574 ### How the runtime numeric check to catch [SEVs](https://docs.google.com/document/d/1WOtlbgCBbmU1klK1LiGSO0lYf_7mtSP4nAdvhQHM0JE/edit#heading=h.k61fy2rhaijp) bug fix diff: D51378532 ### CMF+(FIRST) f509587388 P1058305139 by running the numeric check, we can catch the forward loss differences (e.g., diffing(https://www.internalfb.com/intern/diffing/?paste_number=1058293804)) https://pxl.cl/4bQDG f501760099 P1058400691 by running the numeric check, we can catch the forward loss differences (e.g., diffing(https://www.internalfb.com/intern/diffing/?paste_number=1058412054)) https://pxl.cl/4bQMw Pull Request resolved: https://github.com/pytorch/pytorch/pull/115142 Approved by: https://github.com/jackiexu1992, https://github.com/yanboliang	2024-01-23 03:56:50 +00:00
Oguz Ulgen	3c339b5b21	Remove skips for passing tests (#118000 ) These tests were already passing Pull Request resolved: https://github.com/pytorch/pytorch/pull/118000 Approved by: https://github.com/yanboliang	2024-01-23 03:41:23 +00:00
JackCaoG	4646d0e1b2	Update xla.txt (#117999 ) XLA CI is currently broken in PyTorch, I think there are 2 reasons causing that 1. There is an offending Pytorch PR `c393b2f1ee`. Han is working on a fix in https://github.com/pytorch/xla/pull/6345 2. Commit that pytorch pin to 2990cb38c17e06d0dbe25437674ca40130d76a8f was not a valid commit. I think this is because we tried to help them to land a breaking pr in https://github.com/pytorch/xla/pull/6307. However I think we did a rebase which vanish that commit. now the CI failed ``` fatal: reference is not a tree: 2990cb38c17e06d0dbe25437674ca40130d76a8f 585 ``` Let me first update the pin to the master so it at least run some test, this way we can discover if there is any additional issue. I will rebase after @qihqi 's fix passed all CI Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/117999 Approved by: https://github.com/clee2000	2024-01-23 03:36:32 +00:00
voznesenskym	fed45aee54	Replace invoking self.value if there is a user defined init, avoiding arbitrary code execution (#117818 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/117818 Approved by: https://github.com/ezyang	2024-01-23 03:14:58 +00:00
rzou	dc1b9d758e	Update passrate calculation script to skip inductor and export (#118030 ) We don't want to count running test/inductor/ and test/export/ with PYTORCH_TEST_WITH_DYNAMO=1 as a part of the pass rate. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118030 Approved by: https://github.com/ydwu4 ghstack dependencies: #117998	2024-01-23 02:33:57 +00:00
rzou	162f643090	Script to generate failures histogram (#118008 ) Generates something that looks like https://gist.github.com/zou3519/43aa8ef28a327bd68cfbac83d84c0999 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118008 Approved by: https://github.com/yanboliang, https://github.com/oulgen	2024-01-23 02:28:55 +00:00
rzou	af7cd5c32a	[Dynamo] Install module globals per output_graph (#117998 ) Fixes https://github.com/pytorch/pytorch/issues/117851 In tests, we ran into an issue where: - In frame A, Dynamo would install a global - We call reset() - reset() did not delete the installed global due to a refcycle - In frame B, Dynamo would re-use the same global - Python gc ran, deleting the installed global, leading to the compiled version of frame B raising NameNotFound This PR changes the following: - module globals are now installed at a per-frame basis. - renames install_global to install_global_unsafe: if the names are not unique and end up being re-used across frames, then we've got trouble. Test Plan: - I tested that this got rid of the test flakiness locally. I'm not sure how to easily write a test for this, because I don't actually know what the refcycle in the above is. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117998 Approved by: https://github.com/ezyang, https://github.com/anijain2305	2024-01-23 02:28:02 +00:00
Thiago Crepaldi	a85fd20d45	[ONNX] Improve support to mmap for ONNXProgram.save (#117863 ) Currently, when the user passes a model state_dict which is not a file, ONNXProgram.save calls torch.save along with io.BytesIO, which does not support memory-map. That makes the file stream to be fully allocated in memory. This PR removes the torch.save call and passes the dict directly to the serializer. this is beneficial for the scenario when model_state_dict is generated by torch.load(..., mmap=True) as the state dict will be mappped in memory instead of fully loaded in memory. This PR leverages https://github.com/pytorch/pytorch/pull/102549 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117863 Approved by: https://github.com/wschin	2024-01-23 02:00:00 +00:00
Boyuan Feng	052860294f	Replace `constraints` with `dynamic_shapes` in export-to-executorch tutorial (#117916 ) Summary: `constraints` argument for `torch.export` has been deprecated in favor of the `dynamic_shapes` argument. This PR updates the use of the deprecated API in export-to-executorch tutorial. Test Plan: CI Differential Revision: D52932772 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117916 Approved by: https://github.com/angelayi, https://github.com/avikchaudhuri	2024-01-23 01:17:19 +00:00
rockerBOO	d810b10232	Add beta1 support to CyclicLR momentum (#113548 ) Fixes #73910 Pull Request resolved: https://github.com/pytorch/pytorch/pull/113548 Approved by: https://github.com/janeyx99	2024-01-23 01:16:58 +00:00
haozhe.zhu	d01ba4e94e	enable fp8 cast for inductor CPU (#117737 ) Enable FP8 cast for this issue https://github.com/pytorch/pytorch/issues/117119. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117737 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-01-23 01:16:15 +00:00
YuqingJ	d8420c0b0c	[Nested Tensor]Add helper functions to set max_seqlen/min_seqlen directly (#117815 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/117815 Approved by: https://github.com/soulitzer	2024-01-23 01:00:45 +00:00
Jeff Daily	a27a6e8cf1	[ROCm] skip test_sparse_csr test_triton_bsr_softmax_cuda (#118006 ) The tests were taking too long and leading to CI timeouts. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118006 Approved by: https://github.com/huydhn	2024-01-23 00:09:42 +00:00
Jane Xu	c6be5d55a5	Migrate param_group testing to OptimizerInfo (#117675 ) Today, our param_group testing does the equivalent of pitting weight and bias with different optimizer hyperparams and then check that the overall result is going the right direction based on maximize. This PR introduces two tests to encompass coverage: 1. For every optimizer input (no differentiable), always force bias to have 0 weight_decay, and then check that the direction is expected. This is basically a replica to today's tests, but is more methodical as the test is a real use case. 2. To ensure that the different groups have distinct behavior, I added another test where lr is basically 0 in default group, and ensure that the param in the default group doesn't move while loss does. Together, these tests do a better job of testing param groups than today's tests, though we do lose some flavors. For example, RMSProp also pits centered=True vs False across the param_groups, Adadelta has a variation on rho, and ASGD has a variation for t0. I don't think this is really a loss, as the previous test was just testing for direction and our new tests test stronger guarantees. The leftover param group configs are used in conjunction with LRSchedulers. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117675 Approved by: https://github.com/albanD	2024-01-22 23:48:46 +00:00
Aaron Orenstein	d280b6ae58	Ensure that deleter is called even for a no-data tensor. (#117418 ) Summary: When using a custom deleter InefficientStdFunctionContext was using a std::unique_ptr<> to store the pointer and call the deleter - but this failed to call the deleter if the pointer was null. Since we have a separate holder class anyway take out the std::unique_ptr<> and call the deleter directly. Fixes #117273 Test Plan: Pull Request resolved: https://github.com/pytorch/pytorch/pull/117418 Approved by: https://github.com/wjakob, https://github.com/yanboliang	2024-01-22 23:27:27 +00:00
Catherine Lee	cef5b93f28	[ez] Serial when NUM_PROCS is 1 (#117977 ) Makes it easier to understand whats going on Pull Request resolved: https://github.com/pytorch/pytorch/pull/117977 Approved by: https://github.com/huydhn	2024-01-22 23:11:41 +00:00
Richard Barnes	f9fca33baf	[codemod][highrisk] Fix shadowed variable in caffe2/caffe2/onnx/onnx_exporter.cc (#117996 ) Summary: Our upcoming compiler upgrade will require us not to have shadowed variables. Such variables have a _high_ bug rate and reduce readability, so we would like to avoid them even if the compiler was not forcing us to do so. This codemod attempts to fix an instance of a shadowed variable. Please review with care: if it's failed the result will be a silent bug. What's a shadowed variable? Shadowed variables are variables in an inner scope with the same name as another variable in an outer scope. Having the same name for both variables might be semantically correct, but it can make the code confusing to read! It can also hide subtle bugs. This diff fixes such an issue by renaming the variable. - If you approve of this diff, please use the "Accept & Ship" button :-) Test Plan: Sandcastle Reviewed By: igorsugak Differential Revision: D52582853 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117996 Approved by: https://github.com/PaliC, https://github.com/kit1980, https://github.com/malfet	2024-01-22 22:57:06 +00:00
Colin Peppler	b901999350	[inductor] For View.create(x, sizes) call realize_input() instead of realize() when handling unbacked symints (#117013 ) # Context Let's say we do `View.create(x, sizes)` where `x` is a `SliceView` and `sizes` contains unbacked symints e.g. `sizes = [i14, 256]`. Then, this we'll run ([this code](`7e37f63e5e/torch/_inductor/ir.py (L2058-L2071)`)) where we. 1. Call `x.realize()` -- SliceView(Pointwise) -> SliceView(ComputedBuffer). 2. Retrieve storage & layout via `as_storage_and_layout(x)` 3. Calculate `new_layout` based off layout & `new_sizes` 3. `return ReinterpretView(storage, new_layout)` However, (2) will raise `NotImplementedError` ([see](`7e37f63e5e/torch/_inductor/ir.py (L1704-L1731)`)) since `x` is a `SliceView` and that isn't supported. Thus, I tried adding support for `SliceView` in `as_storage_and_layout`. This worked for my case, but if instead `sizes` had backed symints e.g. `sizes=[s0, 256]` then some benchmarked models lost accuracy. ``` if isinstance(x, SliceView): return as_storage_and_layout( x.data, freeze=freeze, want_contiguous=want_contiguous, stride_order=stride_order, ) ``` So instead of the above, I tried unwrapping the `SliceView` via `x = x.unwrap_view()`. This works for my usecase and passes CI but I'm not entirely sure why. If we unwrap our `SliceView` and create a `ReinterpretView`, I'd assume we'd lose the reindexer from `SliceView`. ~~But maybe we can re-create the same indexing from the `ReinterpretView`'s strides?~~ edit: we do lose vital information (like offset) when you release your `SliceView` and create a `ReinterpretView` so that's a no-go. Moving onto the final version of this PR. We call `ExternKernel.realize_input()` (feels a bit weird to use `ExternKernel` but it's exactly what I need). It will go ahead and handle our `SliceView` case ([see](`a468b9fbdf/torch/_inductor/ir.py (L3733-L3739)`)) by converting it to a `ReinterpretView` with the correct offset. # Test ``` $ python test/inductor/test_unbacked_symints.py .. ---------------------------------------------------------------------- Ran 10 tests in 20.813s OK ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/117013 Approved by: https://github.com/jansel, https://github.com/ezyang	2024-01-22 22:34:10 +00:00
ydwu4	f96b7d06d7	[export] skip export tests when test with dynamo in ci (#117988 ) Fixes https://github.com/pytorch/pytorch/issues/117947. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117988 Approved by: https://github.com/suo, https://github.com/zou3519	2024-01-22 22:14:36 +00:00
Richard Barnes	c14751b6cf	Remove extraneous [[fallthrough]] in ivalue.cpp (#117985 ) Test Plan: Sandcastle Differential Revision: D52963965 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117985 Approved by: https://github.com/Skylion007, https://github.com/albanD	2024-01-22 21:54:39 +00:00
PyTorch MergeBot	b5799d9977	Revert "[c10d] Barrier uses stream sync instead of device sync (#117804 )" This reverts commit 0f6bbb1c070c3a9713893659377e20e147c125f6. Reverted https://github.com/pytorch/pytorch/pull/117804 on behalf of https://github.com/clee2000 due to sorry the docs test failure is real, I think it wants the lines after the .. note to be indented https://github.com/pytorch/pytorch/actions/runs/7616827874/job/20745016788. Marking as nosignal due to bad Dr. CI categorization ([comment](https://github.com/pytorch/pytorch/pull/117804#issuecomment-1904882487))	2024-01-22 21:54:03 +00:00
Boyuan Feng	792dfa7e16	Allow dynamic shapes of `tuple` type for inputs of `dataclass` type (#117917 ) Summary: In `torch.export.export(f, args, kwargs, ..., dynamic_shpapes=None, ...)`, `dataclass` is an acceptable type of inputs (for args and kwargs). The `dynamic_shapes` of the `dataclass` inputs needs to be the same `dataclass` type which replaces each tensor attributes with `dynamic_shapes` of the corresponding tensors. (https://github.com/pytorch/pytorch/blob/main/torch/export/dynamic_shapes.py#L375) However, some `dataclass` may have limitations on the types of attributes (e.g., having to be tensors) such that the same `dataclass` cannot be constructed for dynamic shapes. For an input of `dataclass` type, this task enables a `dynamic_shapes` of a tuple type that specifies dynamic shape specifications for each tensor of the input in the same order as the input dataclass type's flatten_fn (https://github.com/pytorch/pytorch/blob/main/torch/utils/_pytree.py#L103) Test Plan: buck test //caffe2/test:test_export Differential Revision: D52932856 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117917 Approved by: https://github.com/avikchaudhuri	2024-01-22 21:50:28 +00:00
Adnan Akhundov	4df65bf51b	Optimize recursive_add_node in fx splitter (#117969 ) Summary: The `FxNetAccFusionsFinder.recursive_add_node` function can run into an exponential complexity when applied to an fx graph with multiple densely connected layers of nodes. Here we add a `visited` set which reduces the worst case complexity to linear. In the internal MRS models with the densely connected layer structure, this fix reduces the fx split time from forever to < 100ms, hence unblocking the internal enablement. P.S. As much as I want to add a unit test, I can't find any existing tests for the `_SplitterBase` infra. Happy to add one if pointed to where. Thanks! Test Plan: CI Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D52951321](https://our.internmc.facebook.com/intern/diff/D52951321) Pull Request resolved: https://github.com/pytorch/pytorch/pull/117969 Approved by: https://github.com/oulgen, https://github.com/khabinov	2024-01-22 21:49:36 +00:00
Tianyu Liu	86e8551446	[dtensor] switch softmax forward ops to OpStrategy (#117723 ) Summary This PR switches the softmax and log_softmax ops to use OpStrategy instead of rules. This PR also adds support when the softmax dimension is sharded -- a replication is performed before computation. Test `python test/distributed/_tensor/test_math_ops.py -k test_softmax_fwd` `python test/distributed/_tensor/test_math_ops.py -k test_softmax_with_bwd` Pull Request resolved: https://github.com/pytorch/pytorch/pull/117723 Approved by: https://github.com/XilunWu	2024-01-22 21:26:48 +00:00
Matteo Migliarini	fdac55c35d	Added example regarding weight_decay distinction with per-parameter API (#117436 ) Added new example and description regarding per-parameter `weight_decay` distinction for bias and non-bias terms. Fixes #115935 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117436 Approved by: https://github.com/janeyx99	2024-01-22 21:26:02 +00:00
Boyuan Feng	b14d57ceda	Replace `constraints` with `dynamic_shapes` in scripts/sijiac/prototypes and test/inductor (#117915 ) Summary: `constraints` argument for `torch.export` has been deprecated in favor of the `dynamic_shapes` argument. This PR updates the use of the deprecated API in `scripts/sijiac/prototypes` and `test/inductor`. Test Plan: buck test mode/dev-nosan fbcode//caffe2/test/inductor:test_aot_inductor Differential Revision: D52931743 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117915 Approved by: https://github.com/angelayi	2024-01-22 21:24:03 +00:00
Jane Xu	95a6866220	Migrate fused optim load_state_dict to OptimizerInfo (#117890 ) The new tests look like: ``` (pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (29f899ef)]$ python test/test_optim.py -v -k test_cpu_load_state_dict /home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/transformers/utils/generic.py:441: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead. _torch_pytree._register_pytree_node( test_cpu_load_state_dict_impl_capturable_AdamW_cpu_float32 (__main__.TestOptimRenewedCPU) ... skipped 'Only runs on cuda' test_cpu_load_state_dict_impl_capturable_Adam_cpu_float32 (__main__.TestOptimRenewedCPU) ... skipped 'Only runs on cuda' test_cpu_load_state_dict_impl_capturable_SGD_cpu_float32 (__main__.TestOptimRenewedCPU) ... skipped 'Only runs on cuda' test_cpu_load_state_dict_impl_fused_AdamW_cpu_float32 (__main__.TestOptimRenewedCPU) ... skipped 'Only runs on cuda' test_cpu_load_state_dict_impl_fused_Adam_cpu_float32 (__main__.TestOptimRenewedCPU) ... skipped 'Only runs on cuda' test_cpu_load_state_dict_impl_fused_SGD_cpu_float32 (__main__.TestOptimRenewedCPU) ... skipped 'Only runs on cuda' test_cpu_load_state_dict_impl_capturable_AdamW_cuda_float32 (__main__.TestOptimRenewedCUDA) ... ok test_cpu_load_state_dict_impl_capturable_Adam_cuda_float32 (__main__.TestOptimRenewedCUDA) ... ok test_cpu_load_state_dict_impl_capturable_SGD_cuda_float32 (__main__.TestOptimRenewedCUDA) ... skipped 'SGD does not currently support capturable' test_cpu_load_state_dict_impl_fused_AdamW_cuda_float32 (__main__.TestOptimRenewedCUDA) ... ok test_cpu_load_state_dict_impl_fused_Adam_cuda_float32 (__main__.TestOptimRenewedCUDA) ... ok test_cpu_load_state_dict_impl_fused_SGD_cuda_float32 (__main__.TestOptimRenewedCUDA) ... ok ---------------------------------------------------------------------- Ran 12 tests in 12.865s OK (skipped=6) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/117890 Approved by: https://github.com/albanD	2024-01-22 21:14:38 +00:00
Catherine Lee	9a2c8f644b	Mark DynamicShapesExportTests::test_retracibility_dynamic_shapes as slow (#117896 ) Mark `dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_retracibility_dynamic_shapes` explicitly as slow I cannot figure out what the correct way to do this is Tested locally Pull Request resolved: https://github.com/pytorch/pytorch/pull/117896 Approved by: https://github.com/huydhn	2024-01-22 21:12:03 +00:00
Edward Z. Yang	903e1913ff	Rename unbacked SymInt prefix to u (#117859 ) Currently, it conflicts with Inductor's naming convention for index variables Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/117859 Approved by: https://github.com/lezcano, https://github.com/jansel, https://github.com/avikchaudhuri	2024-01-22 20:53:47 +00:00
Ke Wen	0f6bbb1c07	[c10d] Barrier uses stream sync instead of device sync (#117804 ) Resubmitting #96785 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117804 Approved by: https://github.com/wconstab	2024-01-22 20:14:51 +00:00
Wanchao Liang	c170fbd309	[dtensor] refactor redistribute and fix uneven sharding redistribution (#115525 ) This PR: - refactors the redistribute implementation logic to make it more sound, by figuring out the transform informations first and then apply transformation step by step, we also cache the decisions so that it could be reuse again - for uneven sharding, refactor uneven sharding logic, and use a logical shape concept for each transform information to fix the uneven sharding multi-mesh redistribute bug fixes https://github.com/pytorch/pytorch/issues/115310 Pull Request resolved: https://github.com/pytorch/pytorch/pull/115525 Approved by: https://github.com/XilunWu	2024-01-22 18:57:44 +00:00
Wanchao Liang	2bb2cc0b71	[tp] add clarification to doc and improve TP examples (#117618 ) This PR adds a clarification about evenly sharded assumption in the main tp doc and improved the tp examples by adding device mesh constructions fixes https://github.com/pytorch/pytorch/issues/100044 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117618 Approved by: https://github.com/wconstab, https://github.com/awgu	2024-01-22 18:56:50 +00:00
Jeff Daily	01abb5af21	additional support for float8_e4m3fnuz and _e5m2fnuz (#115214 ) Follow up to #107586. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115214 Approved by: https://github.com/peterbell10, https://github.com/malfet	2024-01-22 18:33:41 +00:00
Yue Dong	56ef5afdee	[dynamo] Add more dynamo call_methods and getattr support or Placement (#117733 ) Summary: Explained by title. This fix is part of: https://github.com/pytorch/pytorch/issues/117670 Test Plan: Unit tetst and CI - Unit test: `buck2 test mode/dev-nosan //caffe2/test/distributed/_tensor:dtensor_compile -- test_placement_compile` Differential Revision: D52863073 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117733 Approved by: https://github.com/yanboliang	2024-01-22 18:22:54 +00:00
suo	f612e96180	[export] set proper fqn in lift constant tensor pass (#115222 ) See comments: previously we were populating the lifted constant in the buffer list without an FQN, which messed up unflattening. Differential Revision: [D50568062](https://our.internmc.facebook.com/intern/diff/D50568062/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/115222 Approved by: https://github.com/tugsbayasgalan	2024-01-22 18:13:49 +00:00
Guilherme Leobas	80cf0ce153	Enhance torch.vmap support from inside torch.compile (#116050 ) This work rewrites vmap support in torch.compile by inlining most of the frames into the existing FX graph. It also unlocks to PyTorch to support features that were previously missing, such as keyword args. Fixes: https://github.com/pytorch/pytorch/issues/114306 Pull Request resolved: https://github.com/pytorch/pytorch/pull/116050 Approved by: https://github.com/zou3519	2024-01-22 17:53:45 +00:00
Angela Yi	b2a3d6ba0d	[exportdb] Remove torch/fb/exportdb (#117866 ) Summary: This has already been moved to torch/_export/db Test Plan: no tests? I think? Reviewed By: avikchaudhuri Differential Revision: D52875607 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117866 Approved by: https://github.com/ydwu4	2024-01-22 17:41:33 +00:00
Edward Z. Yang	a359afbc3f	Make and/or on uint8 tensors properly return 0x00 or 0x01 (#117827 ) Fixes https://github.com/pytorch/pytorch/issues/117215 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/117827 Approved by: https://github.com/albanD	2024-01-22 17:30:22 +00:00
Michael Schmidt	c6c54df81b	Fix incorrect type hints of `Module.to` (#117937 ) Fixes #117936 While #113647 fixed the issue that `device` did not accept strings, it did not get the type hints fully correct. This PR removes the `str` variants from the type hints for the `dtype` parameter(s) in all overloads. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117937 Approved by: https://github.com/albanD	2024-01-22 16:47:30 +00:00
Dinahao Zhou	60519fa3b7	change master to main in datapipes readme (#117919 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/117919 Approved by: https://github.com/albanD	2024-01-22 16:29:41 +00:00
Stas Bekman	86b4b27e26	[docs] start a new FSDP notes doc (#117323 ) As discussed on [slack](https://pytorch.slack.com/archives/C3PDTEV8E/p1703699711772289) adding Andrew Gu's advanced FSDP design notes with a few additions from myself based on our discussion. I hope I did the RST right, I haven't done RST in a while. - The first section is Andrew's words verbatim + formatting - The second section is Andrew's words verbatim + formatting + a few of my additions that were confirmed by Andrew, and which hopefully should help understand the process better. tagging @albanD as requested. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117323 Approved by: https://github.com/awgu	2024-01-22 15:46:35 +00:00
PyTorch MergeBot	8dc421a6b4	Revert "accelerate `binary_cross_entropy_with_logits` by using `log_sigmoid` operator (#115539 )" This reverts commit 03b12e56c758431df6f95075ce3a0113ccaeb3f9. Reverted https://github.com/pytorch/pytorch/pull/115539 on behalf of https://github.com/DanilBaibak due to Break internal build ([comment](https://github.com/pytorch/pytorch/pull/115539#issuecomment-1904157729))	2024-01-22 14:48:35 +00:00
cyy	c3780010a5	Remove calls of c10::guts::void_t (#117942 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/117942 Approved by: https://github.com/Skylion007	2024-01-22 06:12:37 +00:00
PyTorch UpdateBot	3580e5d407	[executorch hash update] update the pinned executorch hash (#117953 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned executorch hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117953 Approved by: https://github.com/pytorchbot	2024-01-22 04:34:44 +00:00
cyy	39df084001	[Clang-tidy header][16/N] Enable clang-tidy on headers in torch/csrc/autograd (#117821 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/117821 Approved by: https://github.com/Skylion007	2024-01-22 00:52:56 +00:00
cyy	3baade4425	Remove calls of c10::guts::conjunction,c10::guts::disjunction,c10::guts::negation (#117926 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/117926 Approved by: https://github.com/Skylion007	2024-01-22 00:35:42 +00:00
PyTorch MergeBot	02209b5880	Revert "[docs] start a new FSDP notes doc (#117323 )" This reverts commit 7f474da6bcc735cde5ef1417dc28472769307f5d. Reverted https://github.com/pytorch/pytorch/pull/117323 on behalf of https://github.com/awgu due to broke docs ([comment](https://github.com/pytorch/pytorch/pull/117323#issuecomment-1902740900))	2024-01-21 19:47:27 +00:00
suo	c393b2f1ee	[export] require Module to be passed to export (#117528 ) This PR changes torch.export to require an nn.Module as input, rather than taking an arbitrary callable. The rationale for this is that we have several invariants the ExportedProgram that are ambiguous if the top-level object being traced is a function: 1. We "guarantee" that every call_function node has an `nn_module_stack` populated. 2. We offer ways to access the state_dict/parameters/buffers of the exported program. We'd like torch.export to offer strong invariants—the value proposition of export is that you can trade flexibility for stronger guarantees about your model. An alternative design would be to implicitly convert the top-level function into a module, rather than require that the user provide a module. I think that's reasonable (it's what we did in TorchScript), but in the spirit of being explicit (another design tenet of export) I avoid that here. Differential Revision: [D52789321](https://our.internmc.facebook.com/intern/diff/D52789321/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/117528 Approved by: https://github.com/thiagocrepaldi, https://github.com/zhxchen17, https://github.com/avikchaudhuri, https://github.com/tugsbayasgalan	2024-01-21 19:36:13 +00:00
Alexander Grund	3ee092f75b	VSX: Fix overflow in complex division (#116972 ) For large complex values the division produces inf or NaN values which leads other functions to produce such too, e.g. `torch._refs.sgn` used in a test. Example: ``` $ python -c 'import torch; print(torch._refs.sgn(torch.complex(torch.tensor([-501]16, dtype=torch.float32), torch.tensor([-1e20]16, dtype=torch.float32))))' tensor([-0.+nanj, -0.+nanj, -0.+nanj, -0.+nanj, -0.+nanj, -0.+nanj, -0.+nanj, -0.+nanj, -0.+nanj, -0.+nanj, -0.+nanj, -0.+nanj, -0.+nanj, -0.+nanj, -0.+nanj, -0.+nanj]) $ python -c 'import torch; t = torch.complex(torch.tensor([-501]16, dtype=torch.float32), torch.tensor([-1e20]16, dtype=torch.float32)); print(t / t.abs())' tensor([-0.+nanj, -0.+nanj, -0.+nanj, -0.+nanj, -0.+nanj, -0.+nanj, -0.+nanj, -0.+nanj, -0.+nanj, -0.+nanj, -0.+nanj, -0.+nanj, -0.+nanj, -0.+nanj, -0.+nanj, -0.+nanj]) ``` Implement the same algorithm as used in numpy and x86 (#93277) Reason here is that for a tensor with a component of `1e20` the abs-squared value used in the division contains a term `1e20 * 1e20` which overflows the dynamic range of float32 (3e38) and yields an "inf", so the division yields "nan" Output after change: ``` $ python -c 'import torch; t = torch.complex(torch.tensor([-501]16, dtype=torch.float32), torch.tensor([-1e20]16, dtype=torch.float32)); print(torch._refs.sgn(t), t.sgn(), t / t.abs())' tensor([-5.0100e-18-1.j, -5.0100e-18-1.j, -5.0100e-18-1.j, -5.0100e-18-1.j, -5.0100e-18-1.j, -5.0100e-18-1.j, -5.0100e-18-1.j, -5.0100e-18-1.j, -5.0100e-18-1.j, -5.0100e-18-1.j, -5.0100e-18-1.j, -5.0100e-18-1.j, -5.0100e-18-1.j, -5.0100e-18-1.j, -5.0100e-18-1.j, -5.0100e-18-1.j]) tensor([-5.0100e-18-1.j, -5.0100e-18-1.j, -5.0100e-18-1.j, -5.0100e-18-1.j, -5.0100e-18-1.j, -5.0100e-18-1.j, -5.0100e-18-1.j, -5.0100e-18-1.j, -5.0100e-18-1.j, -5.0100e-18-1.j, -5.0100e-18-1.j, -5.0100e-18-1.j, -5.0100e-18-1.j, -5.0100e-18-1.j, -5.0100e-18-1.j, -5.0100e-18-1.j]) tensor([-5.0100e-18-1.j, -5.0100e-18-1.j, -5.0100e-18-1.j, -5.0100e-18-1.j, -5.0100e-18-1.j, -5.0100e-18-1.j, -5.0100e-18-1.j, -5.0100e-18-1.j, -5.0100e-18-1.j, -5.0100e-18-1.j, -5.0100e-18-1.j, -5.0100e-18-1.j, -5.0100e-18-1.j, -5.0100e-18-1.j, -5.0100e-18-1.j, -5.0100e-18-1.j]) ``` CC @quickwritereader who wrote the initial code and @VitalyFedyunin who was involved in the initial review and @lezcano who reviewed #93277 Pull Request resolved: https://github.com/pytorch/pytorch/pull/116972 Approved by: https://github.com/lezcano	2024-01-21 19:21:13 +00:00
James Wu	afabed6ae6	[inductor][custom ops] Add tag to custom ops to preserve stride orders in inductor (#117298 ) fixes #116715 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117298 Approved by: https://github.com/eellison	2024-01-21 18:47:01 +00:00
Bin Bao	41556324a9	[cpp_wrapper] Change CppWrapperCodeCache to use faster python binding (#117693 ) Summary: Using faster binding following https://github.com/pytorch/pytorch/pull/117500. torch.utils.cpp_extension.load_inline builds a lot of things and is very slow. With this change, later we can further reduce the included header files using the ABI-compatible mode and thus further speed up the compilation. Result: ``` python test/inductor/test_cuda_cpp_wrapper.py -k test_relu_cuda_cuda_wrapper Before: Ran 1 test in 32.843s After: Ran 1 test in 26.229s ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/117693 Approved by: https://github.com/jansel	2024-01-21 16:07:52 +00:00
Stas Bekman	7f474da6bc	[docs] start a new FSDP notes doc (#117323 ) As discussed on [slack](https://pytorch.slack.com/archives/C3PDTEV8E/p1703699711772289) adding Andrew Gu's advanced FSDP design notes with a few additions from myself based on our discussion. I hope I did the RST right, I haven't done RST in a while. - The first section is Andrew's words verbatim + formatting - The second section is Andrew's words verbatim + formatting + a few of my additions that were confirmed by Andrew, and which hopefully should help understand the process better. tagging @albanD as requested. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117323 Approved by: https://github.com/albanD, https://github.com/awgu	2024-01-21 15:11:24 +00:00
Aaron Gokaslan	b50ccad86e	[BE]: Add type alias typing annotation to prims_common (#117928 ) Explicitly mark unions assignments as type aliases to make it easier for static type checkers. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117928 Approved by: https://github.com/ezyang	2024-01-21 14:26:59 +00:00
Edward Z. Yang	df4e3d9d08	Document OpsHandler protocol (#117790 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/117790 Approved by: https://github.com/jansel	2024-01-21 07:20:53 +00:00
eqy	8f7caaee67	[cuDNN] Fix cuDNN version parsing against future versions of cuDNN (#117908 ) Remove the unnecesssary dependence on assuming a fixed number of digits per field CC @ptrblck Pull Request resolved: https://github.com/pytorch/pytorch/pull/117908 Approved by: https://github.com/cpuhrsch	2024-01-21 05:00:01 +00:00
Adnan Akhundov	fbd1d567ed	[inductor] Fix CPP wrapper codegen for ExternKernel args (#117931 ) Summary: We see IR nodes `repr`-ed directly in the CPP wrapper codegen. Recently, this issue has been fixed for the Python wrapper codegen in D52899373 (https://github.com/pytorch/pytorch/pull/117838). Here we extend the fix to CPP wrapper codegen / AOTInductor. Test Plan: New unit tests. In OSS: ``` python test/inductor/test_aot_inductor.py -k test_triton_kernel_multi_output_arg ``` ``` python test/inductor/test_aot_inductor.py -k test_triton_kernel_extern_kernel_arg ``` Differential Revision: D52936248 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117931 Approved by: https://github.com/oulgen, https://github.com/chenyang78, https://github.com/desertfire	2024-01-21 04:58:56 +00:00
Tugsbayasgalan Manlaibaatar	fa1e89b337	Ban mutation on dropout outputs in export (#117879 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/117879 Approved by: https://github.com/ezyang ghstack dependencies: #117811	2024-01-21 04:53:40 +00:00
PyTorch UpdateBot	949a76a7f0	[executorch hash update] update the pinned executorch hash (#117899 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned executorch hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117899 Approved by: https://github.com/pytorchbot	2024-01-21 04:19:27 +00:00
suo	2ae66ddba0	[export] fix test ownership (#117886 ) as title Differential Revision: [D52924188](https://our.internmc.facebook.com/intern/diff/D52924188/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/117886 Approved by: https://github.com/ydwu4	2024-01-21 01:18:16 +00:00
le-zheng	bad5e1e0bb	[Quant] [Inductor] Enable the Inductor Lowering of QConv2d post op hardswish (#117489 ) Summary Enable the fusion pattern of `QConv2d -> hardswish` lowering to `hardswish` as `QConv2d` post operator. Test Plan ``` python -m pytest test_mkldnn_pattern_matcher.py -k test_qconv2d_hardswish_cpu python -m pytest test_mkldnn_pattern_matcher.py -k test_qat_qconv2d_hardswish ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/117489 Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5 ghstack dependencies: #117487, #117488	2024-01-21 00:01:32 +00:00
fduwjj	05ef2030ea	[c10d] Add logs for NCCL Comm Abort call (#117868 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/117868 Approved by: https://github.com/kwen2501	2024-01-20 21:34:13 +00:00
Taras Tsugrii	2de3474711	Simplify kwargs propagation in __call__. (#117880 ) In case no keyword arguments are passed, `*kwargs` would expand just fine without the need for extra overhead of `or {}`. In addition to reducing boilerplate, this also comes with a small perf improvement: ``` In [1]: def null(args, *kwargs): ...: pass ...: In [2]: def call1(args, *kwargs): ...: return null(args, *(kwargs or {})) ...: In [3]: def call2(args, *kwargs): ...: return null(args, **kwargs) ...: In [4]: %timeit call1() 145 ns ± 2.07 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each) In [5]: %timeit call2() 118 ns ± 2.14 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each) In [6]: %timeit call1() 147 ns ± 6.19 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each) In [7]: %timeit call2() 117 ns ± 0.846 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/117880 Approved by: https://github.com/Skylion007	2024-01-20 19:29:35 +00:00
Edward Z. Yang	50633620b2	sympy.Symbol is a subclass of sympy.Expr (#117857 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/117857 Approved by: https://github.com/peterbell10	2024-01-20 18:09:44 +00:00
leslie-fang-intel	af831415a8	fix cpp backend relu codegen with inf input (#117622 ) Summary Fix issue: https://github.com/pytorch/pytorch/issues/117544. For CPP backend, current `ReLU` will code gen to `f"{x} * ({x}>0)"` in `CppOverrides`. The result mismatches with eager when input has `inf`, since `inf * 0` will result to `nan` based on [IEEE_754](https://en.wikipedia.org/wiki/IEEE_754). Change the code gen to `f"std::max({x}, decltype({x})(0))"` to align with eager implementation as in `1deb75b584/aten/src/ATen/native/cpu/TensorCompareKernel.cpp (L392)` TestPlan ``` python -u -m pytest test_cpu_repro.py -k test_relu_with_inf_value ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/117622 Approved by: https://github.com/jgong5, https://github.com/lezcano	2024-01-20 13:28:03 +00:00
Honglin Zhu	4bf481fb1b	Fix inductor pattern match error for qlinear with bmm (#117633 ) Summary: PR https://github.com/pytorch/pytorch/pull/116599 convert `bmm` when input dim exceeds 2 and not contiguous to `qlinear`. However, there is an error when check weight size because of not considering the permute op. Test Plan: python test_mkldnn_pattern_matcher.py -k test_qlinear_input_dim_exceeds_2_and_not_contiguous Fixes: - Pull Request resolved: https://github.com/pytorch/pytorch/pull/117633 Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel	2024-01-20 12:26:26 +00:00
haozhe.zhu@intel.com	0ae952db76	enable mkldnn bf32 matmul (#116015 ) ### Testing FP32 matmul vs. mkldnn BF32 matmul on SPR single core: Input \| BF32 / ms \| FP32 / ms \| Speed up -- \| -- \| -- \| -- M: 128, N: 128, K: 128, trans_a: False, trans_b: False \| 32.842 \| 38.279 \| 1.165 M: 128, N: 256, K: 128, trans_a: False, trans_b: False \| 38.590 \| 73.967 \| 1.917 M: 8192, N: 768, K: 768, trans_a: False, trans_b: False \| 18456.267 \| 74588.002 \| 4.041 56 cores: Input \| BF32 / ms \| FP32 / ms \| Speed up -- \| -- \| -- \| -- M: 8192, N: 768, K: 768, trans_a: False, trans_b: False \| 1199.400 \| 1715.548 \| 1.430 M: 8192, N: 768, K: 768, trans_a: False, trans_b: True \|1129.204 \| 1708.912 \| 1.513 M: 8192, N: 768, K: 3072, trans_a: False, trans_b: False \| 3655.915 \| 7992.877 \| 2.186 M: 8192, N: 768, K: 3072, trans_a: False, trans_b: True \| 3707.993 \| 8026.191 \| 2.165 Batch: 768, M: 128, N: 64, K: 128 \| 1296.419 \| 1308.411 \| 1.009 Pull Request resolved: https://github.com/pytorch/pytorch/pull/116015 Approved by: https://github.com/jgong5, https://github.com/ezyang	2024-01-20 09:30:23 +00:00
Michael Lazos	aaae2d8bb6	Add compilable and capturable foreach adamax with tests (#117835 ) Based off of https://github.com/pytorch/pytorch/pull/110345 Fixes https://github.com/pytorch/pytorch/issues/117812 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117835 Approved by: https://github.com/janeyx99	2024-01-20 05:29:05 +00:00
suo	e732adf0a7	[pytree] add access api (#117771 ) This PR introduces an API to use KeyPaths to actually access values on pytrees. Differential Revision: [D52881260](https://our.internmc.facebook.com/intern/diff/D52881260/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/117771 Approved by: https://github.com/zou3519, https://github.com/XuehaiPan	2024-01-20 04:03:26 +00:00
Wei Lu	a1b3b5748f	[Pytoch][Vulkan] Create context for conv1d (#117780 ) Summary: `conv1d` has two arguments `weight` and `bias` which are stored as constant tensors on the CPU and they are transferred to GPU at every inference call. We create a context for this operator to avoid the repeated passing. Specifically, we - created `Conv1dPackedContext`,`create_conv1d_context` and `run_layernorm_context` in `Convolution.h` and `Convolution.cpp` - registered them in `Register.cpp` - rewrote the graph representation of the op in `vulkan_rewrite.cpp` Test Plan: ## Numerical test ``` [luwei@82308.od /data/sandcastle/boxes/fbsource (8a8d911dc)]$ LD_LIBRARY_PATH=third-party/swiftshader/lib/linux-x64/ buck run fbcode/mode/dev-nosan //xplat/caffe2:pt_vulkan_api_test_bin -- --gtest_filter="conv1d" Buck UI: https://www.internalfb.com/buck2/7760800b-fd75-479a-9368-be5fcd5a7fef Network: Up: 0B Down: 0B Jobs completed: 4. Time elapsed: 0.6s. BUILD SUCCEEDED Running main() from third-party/googletest/1.14.0/googletest/googletest/src/gtest_main.cc Note: Google Test filter = conv1d [==========] Running 2 tests from 1 test suite. [----------] Global test environment set-up. [----------] 2 tests from VulkanAPITest [ RUN ] VulkanAPITest.conv1d_simple [ OK ] VulkanAPITest.conv1d_simple (159 ms) [ RUN ] VulkanAPITest.conv1d [ OK ] VulkanAPITest.conv1d (57 ms) [----------] 2 tests from VulkanAPITest (217 ms total) [----------] Global test environment tear-down [==========] 2 tests from 1 test suite ran. (217 ms total) [ PASSED ] 2 tests. ``` Full test result in P1053644934, summary as below ``` [----------] 419 tests from VulkanAPITest (28080 ms total) [----------] Global test environment tear-down [==========] 419 tests from 1 test suite ran. (28080 ms total) [ PASSED ] 418 tests. [ SKIPPED ] 1 test, listed below: [ SKIPPED ] VulkanAPITest.querypool_flushed_shader_log ``` ## Graph representation comparison We created a model using `conv1d` and traced it as below ``` # Define a simple model that uses conv1d class MyModel(torch.nn.Module): def __init__(self): super(MyModel, self).__init__() self.conv1d = nn.Conv1d(16, 33, 3) def forward(self, x): return self.conv1d(x) # Create an instance of the model model = MyModel() # Create a dummy input tensor for tracing input_tensor = torch.randn(20, 16, 50) # Use torch.jit.trace to trace the model and generate a graph traced_model = torch.jit.trace(model, input_tensor) ``` Then we converted the traced model to Vulkan backend using `optimize_for_mobile` ``` from torch.utils import mobile_optimizer vulkan_model = mobile_optimizer.optimize_for_mobile( traced_model, backend="vulkan", preserved_methods=to_preserve ) ``` Next we can print the graph of the `vulkan_model` as `print(vk_model.graph)` - before this diff: `conv1d` was used ``` graph(%self.1 : __torch__.___torch_mangle_16.MyModel, %x : Tensor): %60 : Device = prim::Constant[value="cpu"]() %self.conv1d.bias : Float(33, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=<Tensor>]() %37 : bool = prim::Constant[value=0]() %36 : NoneType = prim::Constant() %59 : Device = prim::Constant[value="vulkan"]() %self.conv1d.weight : Float(33, 16, 3, strides=[48, 3, 1], requires_grad=0, device=cpu) = prim::Constant[value=<Tensor>]() %7 : int = prim::Constant[value=1](), scope: __module.conv1d # /mnt/xarfuse/uid-23453/243f3953-seed-nspid4026532834_cgpid7972545-ns-4026532831/torch/nn/modules/conv.py:306:0 %18 : int[] = prim::Constant[value=[1]]() %19 : int[] = prim::Constant[value=[0]]() %39 : Tensor = aten::to(%x, %59, %36, %37, %37) %20 : Tensor = aten::conv1d(%39, %self.conv1d.weight, %self.conv1d.bias, %18, %19, %18, %7) %58 : Tensor = aten::to(%20, %60, %36, %37, %37) return (%58) ``` - after this diff: `conv1d` was replaced with `run_conv1d_context` ``` graph(%self.1 : __torch__.___torch_mangle_6.MyModel, %x : Tensor): %85 : Device = prim::Constant[value="cpu"]() %51 : bool = prim::Constant[value=0]() %50 : NoneType = prim::Constant() %84 : Device = prim::Constant[value="vulkan"]() %53 : Tensor = aten::to(%x, %84, %50, %51, %51) %prepack_folding_forward._jit_pass_packed_weight_0 : __torch__.torch.classes.vulkan.Conv1dPackedContext = prim::GetAttr[name="prepack_folding_forward._jit_pass_packed_weight_0"](%self.1) %22 : Tensor = vulkan_prepack::run_conv1d_context(%53, %prepack_folding_forward._jit_pass_packed_weight_0) %83 : Tensor = aten::to(%22, %85, %50, %51, %51) return (%83) ``` Differential Revision: D52865379 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117780 Approved by: https://github.com/yipjustin	2024-01-20 02:35:32 +00:00
PyTorch MergeBot	10923f8720	Revert "[inductor][custom ops] Add tag to custom ops to preserve stride orders in inductor (#117298 )" This reverts commit 1967394690f144a7ba1717eccec977286cafe2da. Reverted https://github.com/pytorch/pytorch/pull/117298 on behalf of https://github.com/huydhn due to Sorry for reverting you change but it is failing in MacOS `1967394690`, may be due to a landrace ([comment](https://github.com/pytorch/pytorch/pull/117298#issuecomment-1901594120))	2024-01-20 02:14:58 +00:00
le-zheng	94f0472579	[Quant] [PT2] Add Hardswish into X86InductorQuantizer Conv2d Unary Annotation (#117488 ) Summary Add `hardswish` into X86InductorQuantizer Conv2d Unary Annotation TestPlan ``` python -m pytest test_x86inductor_quantizer.py -k test_conv2d_unary python -m pytest test_x86inductor_quantizer.py -k test_qat_conv2d_unary ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/117488 Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5 ghstack dependencies: #117487	2024-01-20 01:37:33 +00:00
James Wu	1967394690	[inductor][custom ops] Add tag to custom ops to preserve stride orders in inductor (#117298 ) fixes #116715 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117298 Approved by: https://github.com/eellison	2024-01-20 01:37:28 +00:00
Nikita Shulga	181e6dafd0	[MPS] Fix lintear for 5D tensors (#117837 ) torch.nn.Linear crashes with internal assert if invoked with 5D tensors, due to the bug in MPS framework, i.e. invoking ```swift import MetalPerformanceShadersGraph let graph = MPSGraph() let x = graph.constant(1, shape: [2, 1, 2, 1, 2], dataType: .float32) let y = graph.constant(1, shape: [2, 3], dataType: .float32) let z = graph.matrixMultiplication(primary: x, secondary: y, name: nil) let device = MTLCreateSystemDefaultDevice()! let buf = device.makeBuffer(length: 48)! let td = MPSGraphTensorData(buf, shape: [2, 1, 2, 1, 3], dataType: .int32) let cmdBuf = MPSCommandBuffer(from: device.makeCommandQueue()!) graph.encode(to: cmdBuf, feeds: [:], targetOperations: nil, resultsDictionary: [z:td], executionDescriptor: nil) cmdBuf.commit() ``` crashes with ``` AppleInternal/Library/BuildRoots/0032d1ee-80fd-11ee-8227-6aecfccc70fe/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShaders/MPSNDArray/Kernels/MPSNDArrayIdentity.mm:813: failed assertion `New volume: 4 should match old volume: 8 [reshapeWithCommandBuffer] MPSNDArrayIdentity.' zsh: abort ./build/matmul ``` Workaround the issue by flattening the forward and backward tensors if number of dimentions is greater than 4 Add regression tests to Linear opinfo samples Fixes https://github.com/pytorch/pytorch/issues/114942 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117837 Approved by: https://github.com/janeyx99	2024-01-20 01:19:19 +00:00
Valentine233	d4cc1c5bff	Add new pattern matchers for SDPA (#113004 ) Add two new pattern matchers to enable SDPA in more models. - Pattern 14: `BertLarge` - Pattern 15: `DistilBert` Perf on SPR: <img width="1007" alt="image" src="https://github.com/pytorch/pytorch/assets/23010269/f0813343-c9e8-4fd4-9fa0-d0e67e1d57af"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/113004 Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/eellison	2024-01-20 00:46:46 +00:00
Catherine Lee	8f91a53e9a	Add environment for close-nonexistent-disable-issues (#117885 ) Made a new environment called rockset-read-only that has a read only api key for rockset Pull Request resolved: https://github.com/pytorch/pytorch/pull/117885 Approved by: https://github.com/kit1980, https://github.com/malfet	2024-01-19 23:45:46 +00:00
Thiago Crepaldi	3c1498d117	[ONNX] Add bfloat16 support for scaled_dot_product_attention (#117878 ) Using ONNX opset 14, the aten scaled_dot_product_attention oeprator can be implemented with bfloat16 support because Add-14 does support bfloat16 This PR simply add bfloat16 to the list of supported types Pull Request resolved: https://github.com/pytorch/pytorch/pull/117878 Approved by: https://github.com/BowenBao	2024-01-19 23:24:44 +00:00
PyTorch MergeBot	f684e44fd6	Revert "Reduce pytest prints (#117069 )" This reverts commit 40dbd567e04483c671f9c897171bf9d1e7162b68. Reverted https://github.com/pytorch/pytorch/pull/117069 on behalf of https://github.com/clee2000 due to need to handle timeout expired better ([comment](https://github.com/pytorch/pytorch/pull/117069#issuecomment-1901270953))	2024-01-19 23:07:51 +00:00
Catherine Lee	5538b37a06	[ez] Provide a slightly better error message if process times out (#117865 ) Just a slightly clearer error message Pull Request resolved: https://github.com/pytorch/pytorch/pull/117865 Approved by: https://github.com/malfet, https://github.com/huydhn	2024-01-19 22:58:00 +00:00
Nathanael See	29f899ef87	[pytorch][vulkan] cumsum dim <= 1 (#117580 ) Summary: Following the implementation of Softmax, striding over the texture differently based on the desired dimension. Softmax performs a similar operation as cumsum (generally called "scan") iterating over all items in a dimension, but cumsum only needs to iterate once to collate the sum, compared to softmax which needs to iterate multiple times to collect the max and denominator for the final calculation. Similar to the softmax implmentation there's likely opportunities to optimize, but this gets all dims < 4 functional first. Test Plan: `LD_LIBRARY_PATH=third-party/swiftshader/lib/linux-x64/ buck2 run fbcode/mode/dev-nosan //xplat/caffe2:pt_vulkan_api_test_bin -- --gtest_filter="cumsum"`: ``` Running main() from third-party/googletest/1.14.0/googletest/googletest/src/gtest_main.cc Note: Google Test filter = cumsum [==========] Running 4 tests from 1 test suite. [----------] Global test environment set-up. [----------] 4 tests from VulkanAPITest [ RUN ] VulkanAPITest.cumsum_1d [ OK ] VulkanAPITest.cumsum_1d (93 ms) [ RUN ] VulkanAPITest.cumsum_2d [ OK ] VulkanAPITest.cumsum_2d (74 ms) [ RUN ] VulkanAPITest.cumsum_3d [ OK ] VulkanAPITest.cumsum_3d (105 ms) [ RUN ] VulkanAPITest.cumsum_4d [ OK ] VulkanAPITest.cumsum_4d (73 ms) [----------] 4 tests from VulkanAPITest (346 ms total) [----------] Global test environment tear-down [==========] 4 tests from 1 test suite ran. (346 ms total) [ PASSED ] 4 tests. ``` Differential Revision: D52814000 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117580 Approved by: https://github.com/yipjustin	2024-01-19 21:52:48 +00:00
rzou	dd6c0f6844	Trim Dynamo shards 7->3 (#117869 ) We added all of the tests we wanted for now. These fit comfortably in 3 shards (the total test time previously was 0.5 hours on each shard). Going to decrease the number of shards to 3 so that it's less unwieldy to work with. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117869 Approved by: https://github.com/huydhn, https://github.com/malfet	2024-01-19 21:48:35 +00:00
Elias Ellison	365c7a292f	Log stack trace of mutated idx (#117720 ) Log stack trace of mutated tensor that prevents cudagraphs. Will do some subsequent refactors when all of the checks are moved to this fashion. Differential Revision: [D52896588](https://our.internmc.facebook.com/intern/diff/D52896588) Pull Request resolved: https://github.com/pytorch/pytorch/pull/117720 Approved by: https://github.com/bdhirsh ghstack dependencies: #117823	2024-01-19 21:38:44 +00:00
Elias Ellison	6c99bf0766	move disable_cudagraph_reason disabling after codecache is accessed (#117823 ) Disabling cudagraphs has to happen after a codecache loading or it wont properly be disabled on a cache hit. Differential Revision: [D52896590](https://our.internmc.facebook.com/intern/diff/D52896590) Pull Request resolved: https://github.com/pytorch/pytorch/pull/117823 Approved by: https://github.com/bdhirsh, https://github.com/masnesral	2024-01-19 21:33:25 +00:00
Nikita Shulga	c4eab49ded	[MacOS] Embed libomp.dylib/omp.h into MacOS wheel (#114816 ) To keep them on par with what we do on x86 And `omp.h` as it is needed for `torch.compile` on CPU Pull Request resolved: https://github.com/pytorch/pytorch/pull/114816 Approved by: https://github.com/atalman	2024-01-19 21:21:33 +00:00
Scott Wolchok	414a1fd29f	[PyTorch] Add IValue::IValue(std::vector<T>&&) ctors (#117769 ) There are two IValue constructors that take `const std::vector<T>&`. Add moving variants to allow callers to save on reference counting. Differential Revision: [D52879065](https://our.internmc.facebook.com/intern/diff/D52879065/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/117769 Approved by: https://github.com/suo, https://github.com/Skylion007	2024-01-19 21:11:11 +00:00
Catherine Lee	d45fd68012	OIDC for update_pytorch_labels (#117876 ) Companion: https://github.com/pytorch-labs/pytorch-gha-infra/pull/339 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117876 Approved by: https://github.com/huydhn, https://github.com/malfet	2024-01-19 21:08:28 +00:00
Scott Wolchok	ad3d41692e	[PyTorch] return `decltype(auto)` from getItem (#117569 ) This allows getItem to take advantage of the nicer (sometimes-const-reference) return type from `List::get() const` added in the previous diff. Differential Revision: [D52809097](https://our.internmc.facebook.com/intern/diff/D52809097/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/117569 Approved by: https://github.com/iseeyuan, https://github.com/malfet ghstack dependencies: #117568	2024-01-19 21:04:53 +00:00
Scott Wolchok	632fcc4831	[PyTorch] Make `List::get() const` match `List::operator[]() const` (#117568 ) As far as I can tell, `get()` is supposed (and documented) to be the same as a const `operator[]`. We have an efficient implementation for `operator[]`. Let's use it for `get()`. Differential Revision: [D52809098](https://our.internmc.facebook.com/intern/diff/D52809098/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/117568 Approved by: https://github.com/suo, https://github.com/malfet	2024-01-19 21:04:53 +00:00
Oguz Ulgen	15d568d621	[Inductor] Use codegen reference for buffer to string (#117838 ) Summary: The added test case ends up emitting an inductor IR as the buffer string, lets properly emit the buffer name instead. Test Plan: added new test Differential Revision: D52899373 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117838 Approved by: https://github.com/aakhundov	2024-01-19 20:18:53 +00:00
redwrasse	1f5c27eb18	cleanup code comments _compute_numerical_gradient (#117484 ) cleanup code comments for ` _compute_numerical_gradient`: - reference parameters passed - indicate that central difference approximation is used Pull Request resolved: https://github.com/pytorch/pytorch/pull/117484 Approved by: https://github.com/soulitzer	2024-01-19 18:51:52 +00:00
redwrasse	ab216bbaeb	cleanup code comments analytical Jacobian as vjp projection (#117483 ) Cleanup code comments for `_compute_analytical_jacobian_rows` to make clear Jacobian is computed by standard basis vector projections using the vector-Jacobian-product operation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117483 Approved by: https://github.com/soulitzer	2024-01-19 18:50:26 +00:00
Catherine Lee	40dbd567e0	Reduce pytest prints (#117069 ) * custom pytest-shard so I can control the verbosity (also index by 1 since it's confusing) * normal runs (not keep-going) always rerun each failed test 9 times (3 per process, 3 processes). Previously it would only run the entire test file 3 times, so if a test before you segfaulted, you only got 2 tries Example of quieter log https://github.com/pytorch/pytorch/actions/runs/7481334046/job/20363147497 "items in shard" only gets printed once at the beginning, and the reruns just say how many got skipped. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117069 Approved by: https://github.com/huydhn	2024-01-19 18:42:12 +00:00
rzou	2f4456a73e	Remove xfail on test_make_weak_keyed_dict_from_weak_keyed_dict (#117848 ) Based on the logs, this test has been consistently passing, so we remove the xfail. Fixes https://github.com/pytorch/pytorch/issues/116765 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117848 Approved by: https://github.com/Skylion007 ghstack dependencies: #117765	2024-01-19 18:05:30 +00:00
PyTorch MergeBot	b637fdc8b3	Revert "additional support for float8_e4m3fnuz and _e5m2fnuz (#115214 )" This reverts commit 74e13624998f2a4de29bce73a949d7f0339ec04e. Reverted https://github.com/pytorch/pytorch/pull/115214 on behalf of https://github.com/PaliC due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/115214#issuecomment-1900815152))	2024-01-19 17:35:04 +00:00
Tugsbayasgalan (Tugsuu) Manlaibaatar	f316c35a34	[export] Support preserving submodule callling convention in non-strict export (#117796 ) Summary: Title Test Plan: CI Reviewed By: zhxchen17 Differential Revision: D52889236 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117796 Approved by: https://github.com/angelayi	2024-01-19 17:16:45 +00:00
angelayi	249a226113	[export] Error on not pytree-flattened nodes (#117598 ) Attempts to make the input/output mismatch error better by first checking if the inputs/outputs are able to be pytree flattened into supporting types (tensors, symints, ...). So if user passes in some datastructure which does not have a pytree flatten registration, this will error with the message "It looks like one of the inputs is with type CustomType is not supported or pytree flatten-able.... please register a pytree flatten/unflatten function using the pytree.register_pytree_node API". The check inside of produce_matching should now only error if something unexpected happens (dynamo accidentally adds an input or removes an output), and should be considered an internal error. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117598 Approved by: https://github.com/avikchaudhuri, https://github.com/BowenBao	2024-01-19 17:13:39 +00:00
Catherine Lee	6c5c2121b1	Run some OOMing tests serially (#117759 ) They were disabled due to being flaky due to OOMs but got renamed. Seeing if running serially helps I kind of want to keep this test disabled since the rest of the file is probably fine... Issues in question: #113132 #113136 #113140 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117759 Approved by: https://github.com/malfet, https://github.com/huydhn	2024-01-19 16:45:35 +00:00
atalman	de25718300	[release] Docker Release build trigger on rc for testing (#117849 ) Enable triggering the Docker Release builds on RC. Use test channel in this case. Hence following logic is applied: 1. On RC trigger use test channel and upload to pytorch-test : https://github.com/orgs/pytorch/packages/container/package/pytorch-test 2. On Final RC use prod channel and upload to pytorch : https://github.com/orgs/pytorch/packages/container/package/pytorch 3. Nightly: https://github.com/orgs/pytorch/packages/container/package/pytorch-nightly Pull Request resolved: https://github.com/pytorch/pytorch/pull/117849 Approved by: https://github.com/malfet	2024-01-19 15:01:46 +00:00
Qingpeng Li	03b12e56c7	accelerate `binary_cross_entropy_with_logits` by using `log_sigmoid` operator (#115539 ) When I was reimplementing BCEwithLogits, I found that `log_sigmoid` operator could accelerate the function. Simple benchmark on AMD 3600 CPU Ubuntu 22.04: \|avg time (ms)\|with `pos_weight`\|no `pos_weight`\| \|-\|-\|-\| \|original\|1986\|1658\| \|this PR\|1295\|995\| faster 35-40%. This is probably benefited by the `log_sigmoid` vectorization code. CUDA benchmark was not obtained, but I believe CUDA can be also benefited by reduecing kernel launches as https://github.com/pytorch/pytorch/pull/11054#issuecomment-442233714 and https://github.com/pytorch/pytorch/pull/78267#issue-1248398454 mentioned. The simple benchmark cpp file: [demo.txt](https://github.com/pytorch/pytorch/files/13635355/demo.txt) Pull Request resolved: https://github.com/pytorch/pytorch/pull/115539 Approved by: https://github.com/lezcano	2024-01-19 14:56:43 +00:00
Nikita Shulga	98a044d33e	[CI] Build M1 conda binaries on M1 runners (#117801 ) As usual, almost no work on PyTorch side, all changes are on the builder end, namely: - `8b67d32929` - depend on `blas * mkl` only on x86 machines - `eb78393f1e` - install arm64 conda when running on Apple Silicon - `0d3aea4ee0` - constrain llvmdev-9 to x86 machines only - `6c6a33b271` - set correct DEVELOPER_DIR path TODO: - We should auto-detect this `DEVELOPER_DIR` via `xcode-select` Pull Request resolved: https://github.com/pytorch/pytorch/pull/117801 Approved by: https://github.com/atalman	2024-01-19 14:31:12 +00:00
rzou	17c5f69852	Run test_jit with PYTORCH_TEST_WITH_DYNAMO=1 in CI (#117765 ) Gets rid of all the single test excludes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117765 Approved by: https://github.com/voznesenskym	2024-01-19 13:42:41 +00:00
le-zheng	f115f1cde1	[Quant] Enable QConv2d with hardswish post op (#117487 ) Summary Enable QConv2d implementation with post op `hardswish` Test Plan ``` python -m pytest test_quantized_op.py -k test_qconv2d_hardswish_pt2e ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/117487 Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5	2024-01-19 13:24:06 +00:00
cyy	5756b7a08e	Remove math_compat.h (#117828 ) Follows #116167 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117828 Approved by: https://github.com/malfet	2024-01-19 12:56:17 +00:00
lezcano	f2d6e99f8d	Workaround a cusolver bug on CUDA < 12.1 in triangular_solve (#117636 ) Fix https://github.com/pytorch/pytorch/issues/79191 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117636 Approved by: https://github.com/malfet	2024-01-19 12:42:37 +00:00
suo	4057d005ff	Initial torchbind support in PT2 (#117697 ) This PR adds the bare minimum functionality to get torchbind working in an e2e testable way on PT2. It implements: * ProxyTensor support * Simple torch.export support (proxytensor-only path, e.g. non-strict). * add some tests exercising the path. Because all this is not fully baked, I hide the functionality behind a feature flag (`enable_torchbind_tracing()`) so it does not affect regular users for now. Still on the agenda: * Dynamo support * Actual FakeMode support * Mutability support Hoping to get this first bit in as a standalone, as it will unblock some more extensive experimentation/testing going on internally. Differential Revision: [D51825372](https://our.internmc.facebook.com/intern/diff/D51825372/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/117697 Approved by: https://github.com/SherlockNoMad	2024-01-19 06:28:20 +00:00
Michael Lazos	c51a4e64c0	Add support for compiling SDPAParams (#117207 ) Allows us to `allow_in_graph` this `torch._C` struct for supporting scaled dot product attention. helps unblock https://github.com/pytorch/pytorch/pull/116071 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117207 Approved by: https://github.com/voznesenskym	2024-01-19 05:51:15 +00:00
PyTorch UpdateBot	8524fa566c	[executorch hash update] update the pinned executorch hash (#117593 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned executorch hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117593 Approved by: https://github.com/pytorchbot	2024-01-19 04:34:12 +00:00
Michael Lazos	f302a0d380	Re-enable SGD (#117434 ) Re-enables the SGD optimizer now that compile times are more reasonable. [Benchmark run](https://github.com/pytorch/pytorch/actions/runs/7511073761) Pull Request resolved: https://github.com/pytorch/pytorch/pull/117434 Approved by: https://github.com/anijain2305, https://github.com/janeyx99	2024-01-19 04:28:50 +00:00
dilililiwhy	924ed91612	Move getDurationFromFirstEvent to USE_C10D_NCCL ifdef (#117738 ) Fixes #117517 Try to move nccl related function getDurationFromFirstEvent to USE_C10D_NCCL ifdef (Related to https://github.com/pytorch/pytorch/issues/114575) Pull Request resolved: https://github.com/pytorch/pytorch/pull/117738 Approved by: https://github.com/wconstab, https://github.com/XilunWu	2024-01-19 04:28:47 +00:00
cyy	38d9b3d937	Remove use of math_compat.h (#116167 ) Because ANDROID>=21 is assumed in CI tests, it is time to remove old workarounds. math_compat.h contains solely wrapper math functions for ANDROID, so we can remove its usage. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116167 Approved by: https://github.com/ezyang	2024-01-19 03:37:55 +00:00
cyy	5c17f66a3d	[Exception] [5/N] Remove torch::IndexError (#117713 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/117713 Approved by: https://github.com/ezyang	2024-01-19 03:36:15 +00:00
Tobias Ringwald	3131e0460e	Changed return type of randint64_cpu to int64_t to prevent codegen is… (#117443 ) …sues. Fixes #117435. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117443 Approved by: https://github.com/ezyang	2024-01-19 03:23:20 +00:00
Tugsbayasgalan Manlaibaatar	1adf77ce5e	Don't use functional tensor inside _unstack_pytree (#117811 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/117811 Approved by: https://github.com/ydwu4	2024-01-19 03:15:06 +00:00
Ke Wen	c16e6e4cf7	[ProcessGroup] Make watchdog check work queue more frequently (#117297 ) Today watchdog's sleep interval is 1s. That's a bit long compared to modern GPU link's (or network link's) speed. Take DDP and Ampere for example: DDP's bucket size = 25 MB Ampere's NVLink speed = 250 GB/s 25 MB / 250 GB/s = 100 ms. So we are updating the interval to 100 ms. Update: 25 MB / 250 GB/s = 0.1 ms But let's see how it goes so far between making the checking more aggressive. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117297 Approved by: https://github.com/fduwjj	2024-01-19 02:33:31 +00:00
Nikita Shulga	aadbaf8e2d	[EZ][BE] Move `build_android_gradle.sh` (#117795 ) From `.circleci/scripts` to `scripts`, next to another `build_android.sh` Pull Request resolved: https://github.com/pytorch/pytorch/pull/117795 Approved by: https://github.com/huydhn	2024-01-19 02:14:28 +00:00
titaiwangms	d618e86328	[ONNX] Bump transformers in CI test (#117703 ) Fixes #117660 (1) skip dynamic tests for exported program in `test_fx_to_onnx_onnxruntime.py`, as they are not expected to pass anyway. (2) Move dolly model to runtime, since it's working in exporting, but it is blocked by non-persistent buffers as well. (3) openai whisper has changed/regression due to modeling modifications. (4) Replace OpenLlama with Llama, because OpenLlama is deprecated. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117703 Approved by: https://github.com/thiagocrepaldi	2024-01-19 02:10:10 +00:00
Jeff Daily	74e1362499	additional support for float8_e4m3fnuz and _e5m2fnuz (#115214 ) Follow up to #107586. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115214 Approved by: https://github.com/peterbell10	2024-01-19 00:50:18 +00:00
ydwu4	c317bf2c2b	[HigherOrderOp][BE] factor out merge_graph_inputs (#116912 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116912 Approved by: https://github.com/zou3519 ghstack dependencies: #116721, #116823	2024-01-19 00:35:26 +00:00
ydwu4	c6028f8f73	[HigherOrderOp] Add while_loop support (#116823 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116823 Approved by: https://github.com/zou3519 ghstack dependencies: #116721	2024-01-19 00:35:26 +00:00
ydwu4	113f0749f5	[HigherOrderOp] move some common utils in cond to utils.py (#116721 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116721 Approved by: https://github.com/zou3519	2024-01-19 00:35:26 +00:00
PyTorch MergeBot	77cfacab55	Revert "Reduce pytest prints (#117069 )" This reverts commit 2f89ef23007626aca1a577a4a388e315253c834f. Reverted https://github.com/pytorch/pytorch/pull/117069 on behalf of https://github.com/clee2000 due to distributed tests are not printing items ([comment](https://github.com/pytorch/pytorch/pull/117069#issuecomment-1899433816))	2024-01-19 00:27:03 +00:00
JackCaoG	a468b9fbdf	Update xla.txt to fix missing commit (#117708 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/117708 Approved by: https://github.com/masnesral, https://github.com/huydhn	2024-01-18 23:51:51 +00:00
PyTorch MergeBot	2f84a9d37c	Revert "[CUDNN][SDPA] Experimental cuDNN Flash Attention v2 Inference (#115663 )" This reverts commit 5aa92b5090e3db4a053548a3f360dd06c16df2f7. Reverted https://github.com/pytorch/pytorch/pull/115663 on behalf of https://github.com/PaliC due to Unfortunately, this pr breaks cuda builds internally ([comment](https://github.com/pytorch/pytorch/pull/115663#issuecomment-1899388813))	2024-01-18 23:40:30 +00:00
Catherine Lee	2f89ef2300	Reduce pytest prints (#117069 ) * custom pytest-shard so I can control the verbosity (also index by 1 since it's confusing) * normal runs (not keep-going) always rerun each failed test 9 times (3 per process, 3 processes). Previously it would only run the entire test file 3 times, so if a test before you segfaulted, you only got 2 tries Example of quieter log https://github.com/pytorch/pytorch/actions/runs/7481334046/job/20363147497 "items in shard" only gets printed once at the beginning, and the reruns just say how many got skipped. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117069 Approved by: https://github.com/huydhn	2024-01-18 23:30:59 +00:00
Shunting Zhang	e432b2e607	[inductor] multi-kernel support (#103469 ) For a persistent reduction, we generate 2 flavor of 'equivalant' kernels at the same time - persistent reduction - regular reduction A MultiKernel wraps these 2 kernels and pick the one with better performance at runtime. Here I talk more about implementation details: - Inductor maintains states for generating kernels. E.g. the wrapper code. After we generate code for one kernel, we need restore the inductor state before we can generate the counterpart. *There is one thing I need some comments from others*: There is one tricky thing about kernel arguments. In general, inductor removes a buffer from the argument list if it's only used inside the kernel. But somehow a buffer removed by persistent reduction kernel may still be kept by the regular (non-persistent) reduction kernel because of some CSE invalidation rule. My current implementation avoid removing buffers if multi_kernel is enabled. This makes sure both flavors of reduction has consistent argument list. Another idea I have is, we generate the multi-kernel definition with the union of arguments from both sub-kernels. Let each sub-kernel pick the subset of arguments it wants. But this will make the code-gen or multi-kernel much complex. I'm not sure if there is some easy and clean way to resolve this. Testing command: ``` TORCHINDUCTOR_MULTI_KERNEL=1 TORCH_LOGS=+torch._inductor.graph TORCHINDUCTOR_UNIQUE_KERNEL_NAMES=1 python benchmarks/dynamo/huggingface.py --backend inductor --amp --performance --only BertForMaskedLM --training ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/103469 Approved by: https://github.com/jansel	2024-01-18 23:16:31 +00:00
Nikita Shulga	fee96adde7	[EZ] Update `weekly.yml` to use actions from test-infra (#117775 ) It was deleted from `pytorch/pytorch` by https://github.com/pytorch/pytorch/pull/117506 Thanks [BowenBao](https://github.com/BowenBao) for alerting Pull Request resolved: https://github.com/pytorch/pytorch/pull/117775 Approved by: https://github.com/huydhn	2024-01-18 22:58:32 +00:00
BowenBao	6d9432c44c	[ONNX][dynamo_export] Decomposition skips using custom operator (#117314 ) A context manager that disables the decomposition of certain ops during dynamo tracing. The approach is to temporarily hijack the operator callable with PT2 custom operator. The custom operator will not be decomposed and will show up as a single node to be exported to ONNX. For the time being the decomposition of these ops is otherwise unavoidable. https://github.com/pytorch/pytorch/issues/116684 https://github.com/pytorch/pytorch/issues/115883 This solution will no longer be required once the issue is resolved. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117314 Approved by: https://github.com/justinchuby, https://github.com/malfet	2024-01-18 22:19:28 +00:00
Angela Yi	92d718aed1	[export] Add lifted constant obj to input (#116985 ) Test Plan: wip Differential Revision: D52556070 Pull Request resolved: https://github.com/pytorch/pytorch/pull/116985 Approved by: https://github.com/suo	2024-01-18 22:10:53 +00:00
ydwu4	eba5d5485d	[dynamo] make ConstantSource propagate through built-in ops for TensorVariable (#117704 ) Fixes #117685. This PR only makes ConstantSource perserved for built-in ops when we find all the inputs are either constant tensors or python constants. It doesn't fundamentally solve the problem of preserving ConstantSource information through all operators that's potentially can be constant folded. For the following code in the issue: ``` class Bob(torch.nn.Module): def __init__(self, p, val) -> None: super().__init__() self.p = p self.y = torch.nn.Parameter(torch.tensor(val)) def forward(self, x: torch.Tensor) -> torch.Tensor: # This only looks dynamic but it's actually a constant value if get_y(self.y) < self.p: return torch.cat([x,x]) else: return x ``` The graph exported looks like following: ```python class GraphModule(torch.nn.Module): def forward(self, x): arg0: "f32[s0, s1]"; arg0, = fx_pytree.tree_flatten_spec(([x], {}), self._in_spec) l_x_ = arg0 # File: /home/yidi/local/pytorch/test/dynamo/test_export.py:1498 in forward, code: return torch.cat([x, x]) cat = torch.cat([l_x_, l_x_]); l_x_ = None return pytree.tree_unflatten([cat], self._out_spec) ``` Test Plan: Added a new test for the given repro. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117704 Approved by: https://github.com/jansel, https://github.com/anijain2305	2024-01-18 20:18:34 +00:00
Tailing Yuan	1462d72904	Speed up triu_tril_kernel (#115013 ) 1. Batch Processing: Enhance kernel efficiency by having each thread handle multiple elements, reducing the frequency of offset calculations. 2. Inplace Operation Optimization: For inplace functions, eliminate unnecessary copying to enhance performance. Up to 5x speed up compared to torch 2.1.1 # Benchmark Test on NVIDIA RTX 3080, WSL, CUDA 12.1. Peak performance is recorded. \| function \| dtype \| shape \| k \| torch 2.1.1 \| this PR \| speed up -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- various dtype \| \| \| \| \| \| \| triu_ \| int8 \| [1, 3072, 3072] \| 0 \| 0.107 \| 0.028 \| 3.76x \| triu_ \| float16 \| [1, 3072, 3072] \| 0 \| 0.108 \| 0.029 \| 3.79x \| triu_ \| float32 \| [1, 3072, 3072] \| 0 \| 0.114 \| 0.045 \| 2.52x \| triu_ \| float64 \| [1, 3072, 3072] \| 0 \| 0.172 \| 0.082 \| 2.11x \| triu \| int8 \| [1, 3072, 3072] \| 0 \| 0.111 \| 0.056 \| 2.00x \| triu \| float16 \| [1, 3072, 3072] \| 0 \| 0.108 \| 0.049 \| 2.22x \| triu \| float32 \| [1, 3072, 3072] \| 0 \| 0.116 \| 0.091 \| 1.27x \| triu \| float64 \| [1, 3072, 3072] \| 0 \| 0.175 \| 0.176 \| 1.00x various shape \| \| \| \| \| \| \| triu_ \| float32 \| [1, 8192, 8192] \| 0 \| 0.798 \| 0.311 \| 2.56x \| triu_ \| float32 \| [4, 1024, 1024] \| 0 \| 0.054 \| 0.023 \| 2.37x \| triu_ \| float32 \| [4, 1021, 1021] \| 0 \| 0.054 \| 0.023 \| 2.33x \| triu_ \| float32 \| [256, 128, 256] \| 0 \| 0.111 \| 0.038 \| 2.92x \| triu_ \| float32 \| [128, 257, 125] \| 0 \| 0.051 \| 0.029 \| 1.77x \| triu_ \| float32 \| [20480, 16, 16] \| 0 \| 0.072 \| 0.036 \| 1.97x \| triu \| float32 \| [1, 8192, 8192] \| 0 \| 0.797 \| 0.611 \| 1.31x \| triu \| float32 \| [4, 1024, 1024] \| 0 \| 0.056 \| 0.042 \| 1.32x \| triu \| float32 \| [4, 1021, 1021] \| 0 \| 0.058 \| 0.044 \| 1.32x \| triu \| float32 \| [256, 128, 256] \| 0 \| 0.114 \| 0.093 \| 1.22x \| triu \| float32 \| [128, 257, 125] \| 0 \| 0.051 \| 0.036 \| 1.43x \| triu \| float32 \| [20480, 16, 16] \| 0 \| 0.075 \| 0.061 \| 1.23x various dim \| \| \| \| \| \| \| triu_ \| float32 \| [3072, 3072] \| 0 \| 0.093 \| 0.037 \| 2.49x \| triu_ \| float32 \| [1, 3072, 3072] \| 0 \| 0.114 \| 0.045 \| 2.52x \| triu_ \| float32 \| [1, 1, 3072, 3072] \| 0 \| 0.138 \| 0.053 \| 2.60x \| triu \| float32 \| [3072, 3072] \| 0 \| 0.097 \| 0.091 \| 1.07x \| triu \| float32 \| [1, 3072, 3072] \| 0 \| 0.116 \| 0.091 \| 1.27x \| triu \| float32 \| [1, 1, 3072, 3072] \| 0 \| 0.140 \| 0.090 \| 1.55x various k \| \| \| \| \| \| \| \| triu_ \| float16 \| [1, 3072, 3072] \| 0 \| 0.108 \| 0.029 \| 3.79x \| triu_ \| float16 \| [1, 3072, 3072] \| 1536 \| 0.103 \| 0.042 \| 2.44x \| triu_ \| float16 \| [1, 3072, 3072] \| -1536 \| 0.114 \| 0.020 \| 5.68x \| triu \| float16 \| [1, 3072, 3072] \| 0 \| 0.108 \| 0.049 \| 2.22x \| triu \| float16 \| [1, 3072, 3072] \| 1536 \| 0.104 \| 0.039 \| 2.65x \| triu \| float16 \| [1, 3072, 3072] \| -1536 \| 0.115 \| 0.058 \| 2.00x # Benchmark Code ```python3 import time import torch torch.manual_seed(42) def timeit(f, run_times=1000): torch.cuda.synchronize() t1 = time.time() for _ in range(run_times): f() torch.cuda.synchronize() t2 = time.time() return (t2 - t1) / run_times for dtype in [torch.int8, torch.float16, torch.float32, torch.float64]: for shape in [ [1, 8192, 8192], [3072, 3072], [1, 3072, 3072], [1, 1, 3072, 3072], [4, 1024, 1024], [4, 1021, 1021], [256, 128, 256], [128, 257, 125], [20480, 16, 16], ]: for k in [0, shape[-1] // 2, -shape[-1] // 2]: a = torch.empty(shape, dtype=dtype, device="cuda") for _ in range(4): t_triu = timeit(lambda: a.triu(k)) t_triu_ = timeit(lambda: a.triu_(k)) t_clone = timeit(lambda: a.clone()) print(dtype, shape, f"{k=}", f"triu_ {t_triu_ * 1000:.6f} ({t_triu_ / t_clone:.2f}xMemcpy)", f"triu {t_triu * 1000:.6f} ({t_triu / t_clone:.2f}xMemcpy)") a = torch.rand(shape, device="cuda") a = (a * 10).to(dtype) assert (a.triu(k) == a.cpu().triu(k).cuda()).all() assert (a.tril(k) == a.cpu().tril(k).cuda()).all() assert (a.clone().triu_(k) == a.triu(k)).all() assert (a.clone().tril_(k) == a.tril(k)).all() ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/115013 Approved by: https://github.com/eqy, https://github.com/janeyx99	2024-01-18 19:58:00 +00:00
rzou	16ebfbbf07	All tests run with markDynamoStrictTest now (#117763 ) Last test to remove from the denylist was dynamo/test_logging. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117763 Approved by: https://github.com/voznesenskym ghstack dependencies: #117729, #117747, #117754, #117761	2024-01-18 19:42:41 +00:00
rzou	5278200507	Add some better docs for dynamo_test_failures.py (#117761 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/117761 Approved by: https://github.com/voznesenskym ghstack dependencies: #117729, #117747, #117754	2024-01-18 19:42:41 +00:00
rzou	07216721cf	[codemod] markDynamoStrictTest batch 23 (#117754 ) [codemod] markDynamoStrictTest test_custom_ops [codemod] markDynamoStrictTest test_python_dispatch Pull Request resolved: https://github.com/pytorch/pytorch/pull/117754 Approved by: https://github.com/voznesenskym ghstack dependencies: #117729, #117747	2024-01-18 19:37:04 +00:00
PyTorch MergeBot	def4959662	Revert "[inductor] allow mm template to accumulate with float16 dtype (#117479 )" This reverts commit a7fbbc2a4a05fa4863f9d0e2adabcdc5e276c675. Reverted https://github.com/pytorch/pytorch/pull/117479 on behalf of https://github.com/PaliC due to breaking tests internally ([comment](https://github.com/pytorch/pytorch/pull/117479#issuecomment-1899032973))	2024-01-18 18:53:37 +00:00
suo	23d53a4360	add test_public_bindings to internal CI (#117712 ) enable this test in meta-internal CI, since it's mildly infuriating to not be able to locally test this when working inside meta One change: This test uses `pkgutil.walk_packages`, which ignores namespace packages. A quirk in Meta's internal python packaging system is that it adds `__init__.py` to each source directory. So this test picks up more files to check internally than in the GitHub CI. So I changed this test from using raw `pkgutil` to a version that also looks into namespace packages, so we're checking the same thing across both CIs. Differential Revision: [D52857631](https://our.internmc.facebook.com/intern/diff/D52857631/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/117712 Approved by: https://github.com/ezyang	2024-01-18 18:20:43 +00:00
Jez Ng	1b773df3c6	Place .lrodata later in the binary (#117575 ) Summary: By default, in LLD 16, .lrodata is placed immediately after .rodata. However, .lrodata can be very large in our compiled models, which leads to relocation out-of-range errors for relative relocations. So we place it after other the sections that are referenced from .text using relative relocations. This is the default behavior in GNU ld. Reviewed By: muchulee8, desertfire, khabinov, chenyang78 Differential Revision: D52557846 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117575 Approved by: https://github.com/chenyang78, https://github.com/khabinov	2024-01-18 17:58:18 +00:00
PyTorch MergeBot	7451dd0585	Revert "Add node meta value into UnflattenedModule (#117686 )" This reverts commit cbf24ba962f72175ec1c71a25f3379f7d9149ec1. Reverted https://github.com/pytorch/pytorch/pull/117686 on behalf of https://github.com/PaliC due to breaks internal modeling tests ([comment](https://github.com/pytorch/pytorch/pull/117686#issuecomment-1898939899))	2024-01-18 17:46:38 +00:00
rzou	5aa895e53e	Don't run inductor tests in Dynamo shard (#117747 ) In theory we could, but these get really slow once we turn on strict mode, so we're not going to for now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117747 Approved by: https://github.com/bdhirsh ghstack dependencies: #117729	2024-01-18 17:43:30 +00:00
PyTorch MergeBot	646229218f	Revert "[export] Error on not pytree-flattened nodes (#117598 )" This reverts commit 560213de2d8f734987e25680e72d565501ab8318. Reverted https://github.com/pytorch/pytorch/pull/117598 on behalf of https://github.com/PaliC due to breaking executorch tests internally ([comment](https://github.com/pytorch/pytorch/pull/117598#issuecomment-1898926720))	2024-01-18 17:37:59 +00:00
Wanchao Liang	4720109d7f	[dynamo] add common methods to DistributedVariable (#117590 ) This PR refactors the distributed related variables to use DistributedVariable for common methods, so that things like `python_type` works for all distributed variables. Maybe we can add `as_python_constant` to the DistributedVariable too? I didn't add in this PR but if that make sense I can update. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117590 Approved by: https://github.com/voznesenskym	2024-01-18 17:32:31 +00:00
Nikita Shulga	044b9012d5	Update PocketFFT (#117595 ) This updates PocketFFT submodule to `9d3ab05a7f` Probably fixes https://github.com/pytorch/pytorch/issues/117589 (as it includes https://github.com/mreineck/pocketfft/issues/5 that should fix PocketFFT compilation on Windows) Also adjust `#if __cplusplus >= 201703` replace path in Android scripts (need to submit the fix back to PocketFFT) Pull Request resolved: https://github.com/pytorch/pytorch/pull/117595 Approved by: https://github.com/huydhn	2024-01-18 17:08:44 +00:00
rzou	db1a6eda9e	[codemod] markDynamoStrictTest batch 22 (#117729 ) [codemod] markDynamoStrictTest test_autograd [codemod] markDynamoStrictTest test_ao_sparsity [codemod] markDynamoStrictTest test_jit [codemod] markDynamoStrictTest test_quantization Pull Request resolved: https://github.com/pytorch/pytorch/pull/117729 Approved by: https://github.com/bdhirsh	2024-01-18 16:59:26 +00:00
Ozan Aydin	fa86fa7a61	Fix MSVC 14.38 - VS 2022 Build (#117497 ) Fixes #115922 This PR is prepared to separate existing https://github.com/pytorch/pytorch/pull/116926 and to apply suggestions in the review. `scalar_t` which is defined as `c10::impl::ScalarTypeToCPPType<ScalarType::Half>::t` appears to be causing the issue with `Visual Studio 2022 17.8.4` (coming with `MSVC 14.38.33130`) Error message: ``` aten\src\ATen/cpu/vec/vec_base.h(150): fatal error C1001: Internal compiler error. (compiler file 'D:\a_work\1\s\src\vctools\Compiler\CxxFE\sl\p1\c\toinil.c', line 910) ``` --- Related line was added for a similar issue before as a workaround (`scalar_t` definition) [Fix compile error for vs2022](https://github.com/pytorch/pytorch/pull/85958) Pull Request resolved: https://github.com/pytorch/pytorch/pull/117497 Approved by: https://github.com/ezyang, https://github.com/malfet	2024-01-18 16:53:27 +00:00
Jason Ansel	a669319450	[inductor] Faster C++ kernel python bindings (#117500 ) Calling C++ from Python via ctypes is notoriously slow. This switches to generating our own C++ bindings directly, which is a >5x speedup on this kernel-launch-bound microbenchmark: ```python from ctypes import c_void_p import torch from torch import empty from torch._inductor.codecache import AsyncCompile from torch._dynamo.testing import rand_strided from torch._inductor.utils import print_performance from torch._inductor.wrapper_benchmark import compiled_module_main async_compile = AsyncCompile() src = ''' #include "/tmp/torchinductor_jansel/gb/cgbau5vlj6cetmcjbjbtw6x4rrivaln6f45s5d72gy2bfx5foz3k.h" extern "C" void kernel(const float* in_ptr0, float* out_ptr0) { { auto tmp0 = in_ptr0[static_cast<long>(0L)]; auto tmp1 = static_cast<float>(1.0); auto tmp2 = decltype(tmp0)(tmp0 + tmp1); out_ptr0[static_cast<long>(0L)] = tmp2; } } ''' cpp_fused_add_ctypes = async_compile.cpp(src) cpp_fused_add_cpython = async_compile.cpp_pybinding(["const float", "float"], src) async_compile.wait(globals()) del async_compile def call(arg0_1): buf0 = empty((1,), device='cpu', dtype=torch.float32) if use_ctypes: for _ in range(100): cpp_fused_add_ctypes(c_void_p(arg0_1.data_ptr()), c_void_p(buf0.data_ptr())) else: for _ in range(100): cpp_fused_add_cpython(arg0_1, buf0) del arg0_1 return (buf0,) def benchmark_compiled_module(times=1000, repeat=100): arg0_1 = rand_strided((1,), (1,), device='cpu', dtype=torch.float32) return print_performance(lambda: call(arg0_1), times=times, repeat=repeat) print("old ctypes bindings: ", end='') use_ctypes = True compiled_module_main('None', benchmark_compiled_module) print("new bindings: ", end='') use_ctypes = False compiled_module_main('None', benchmark_compiled_module) ``` Output: ``` old ctypes bindings: 0.000073 new bindings: 0.000013 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/117500 Approved by: https://github.com/desertfire	2024-01-18 16:20:12 +00:00
Animesh Jain	6e4e81a9ef	[dynamo] Extend LazyVariableTracker to tuples (#117426 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/117426 Approved by: https://github.com/lezcano, https://github.com/jansel	2024-01-18 15:51:28 +00:00
Bin Bao	26956980c6	[AOTI] Add torch._export.aot_load (#117610 ) Summary: Add a torch._export.aot_load API that can load an AOTInductor-compiled model.so into a python executable. Test Plan: CI Differential Revision: D52825456 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117610 Approved by: https://github.com/angelayi, https://github.com/khabinov, https://github.com/chenyang78	2024-01-18 15:02:16 +00:00
Edward Z. Yang	2fb9d8811f	Don't try to directly compare symbols, it won't work (#117674 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/117674 Approved by: https://github.com/lezcano	2024-01-18 12:18:45 +00:00
Francisco Massa	8bf788c390	[SAC][Dynamo] Add support for functools.partial in CheckpointHigherOrderVariable (#117657 ) # Context In some cases, we might want to build the `context_fn` with runtime-defined policies. One way of implementing this is to make `context_fn` be a partial, which holds the information that we want to pass. One concrete example is the [automatic policy selection from `xformers`](`ad986981b1/xformers/checkpoint.py (L185)`). # The problem The previous implementation wouldn't work with partials because `FunctoolsPartialVariable` doesn't have a `fn` attribute. This PR addresses this case, but ideally we could get this solved in a more general fashion, as callable classes and `NestedUserFunctionVariable` are not supported by this PR. # Tests I've added a basic test that mimics the tests around it. The tests could probably be simplified, but I've decided to keep changes to a minimum. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117657 Approved by: https://github.com/yf225	2024-01-18 11:59:23 +00:00
PyTorch MergeBot	b0084be114	Revert "Re-enable SGD (#117434 )" This reverts commit e7fac72be75a9fa7a31c6fc8062364fdfc4aaa3a. Reverted https://github.com/pytorch/pytorch/pull/117434 on behalf of https://github.com/lezcano due to breaks test_profiler.py when run with dynamo ([comment](https://github.com/pytorch/pytorch/pull/117434#issuecomment-1898311961))	2024-01-18 11:37:36 +00:00
lezcano	0d1e7053ac	[easy] Log guard failure (#117639 ) Facilitates greatly debugging guard creation Pull Request resolved: https://github.com/pytorch/pytorch/pull/117639 Approved by: https://github.com/Skylion007, https://github.com/jansel ghstack dependencies: #112252, #117630, #110524, #108420	2024-01-18 09:37:33 +00:00
lezcano	4ba5318d3f	[dynamo] Add DictView variable tracker (#108420 ) This also starts a comparison pattern where we don't ask variables what's their type, but what are their capabilities. Pull Request resolved: https://github.com/pytorch/pytorch/pull/108420 Approved by: https://github.com/jansel ghstack dependencies: #112252, #117630, #110524	2024-01-18 09:37:33 +00:00
lezcano	f4df0f061c	Implement set in terms of dict (#110524 ) This allows to heavily simplify the implementation of set, which was "quite unique". Now we represent a set a as a dict where all its values are None. Pull Request resolved: https://github.com/pytorch/pytorch/pull/110524 Approved by: https://github.com/jansel ghstack dependencies: #112252, #117630	2024-01-18 09:36:41 +00:00
lezcano	bc85eb948f	Break on unsupported keys for dicts / elements for sets (#117630 ) As per title Pull Request resolved: https://github.com/pytorch/pytorch/pull/117630 Approved by: https://github.com/jansel ghstack dependencies: #112252	2024-01-18 09:35:46 +00:00
lezcano	4512a95371	[easy]Remove specialized value (#112252 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/112252 Approved by: https://github.com/jansel	2024-01-18 09:34:50 +00:00
Sun, Jiayi	2dd4a254a0	add Half support for interpolate operators on CPU (#105648 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105648 Approved by: https://github.com/jgong5, https://github.com/cpuhrsch	2024-01-18 09:07:16 +00:00
CaoE	c9528a11dd	Add Half support for masked_softmax on CPU (#117028 ) Add Half support for `masked_softmax` on CPU. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117028 Approved by: https://github.com/jgong5, https://github.com/cpuhrsch	2024-01-18 08:59:20 +00:00
xinan.lin	e60bc502b4	[Inductor Intel GPU backend Upstream] Generalize part of Inductor test case (#117513 ) Following the RFC https://github.com/pytorch/pytorch/issues/114856, before upstream Intel XPU Inductor Backend, we need to preapre corresponding Inductor test cases. This PR aims to generalize part of Inductor test case so that a new GPU backend can reuse the existing test case with minimal code change. This Pull Request preferentially generalizes the test cases that cover Inductor's base functionality as follow: - test/inductor/test_codecache.py - test/inductor/test_codegen_triton.py - test/inductor/test_kernel_benchmark.py - test/inductor/test_torchinductor.py - test/inductor/test_torchinductor_codegen_dynamic_shapes.py - test/inductor/test_torchinductor_dynamic_shapes.py - test/inductor/test_torchinductor_opinfo.py - test/inductor/test_triton_heuristics.py - test/inductor/test_triton_wrapper.py Feature request: https://github.com/pytorch/pytorch/issues/114856 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117513 Approved by: https://github.com/EikanWang, https://github.com/jansel	2024-01-18 08:26:21 +00:00
cyy	b72ddbab60	[Clang-tidy header][15/N] Enable clang-tidy on headers in c10/cuda and c10/mobile (#116602 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116602 Approved by: https://github.com/ezyang	2024-01-18 08:15:50 +00:00
Yue Dong	57ca455471	[dynamo] Add hasattr support for TupleVariable (#117694 ) Summary: This change adds support hasattr support for TupleVariable in dynamo. This fix is part of: https://github.com/pytorch/pytorch/issues/117670 Test Plan: Unit test and CI Differential Revision: D52850665 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117694 Approved by: https://github.com/yanboliang	2024-01-18 07:47:43 +00:00
Tobias Ringwald	bc9cb04822	Replaced CHECK with TORCH_CHECK in order to not abort, but throw a Ru… (#117653 ) …ntimeError instead. Fixes #117499. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117653 Approved by: https://github.com/antoniojkim, https://github.com/JackCaoG, https://github.com/alanwaketan	2024-01-18 07:47:22 +00:00
Michael Lazos	e7fac72be7	Re-enable SGD (#117434 ) Re-enables the SGD optimizer now that compile times are more reasonable. [Benchmark run](https://github.com/pytorch/pytorch/actions/runs/7511073761) Pull Request resolved: https://github.com/pytorch/pytorch/pull/117434 Approved by: https://github.com/anijain2305, https://github.com/janeyx99	2024-01-18 06:47:15 +00:00
Yu, Guangye	79811e765c	[2/4] Intel GPU Runtime Upstreaming for Device (#116833 ) # Motivation According to [[1/4] Intel GPU Runtime Upstreaming for Device](https://github.com/pytorch/pytorch/pull/116019), as mentioned in [[RFC] Intel GPU Runtime Upstreaming](https://github.com/pytorch/pytorch/issues/114842), the second PR covers the changes under `aten`. # Design We will compile the code for XPU separately into a library named `libtorch_xpu.so`. Currently, it primarily offers device-related APIs, including - `getCurrentDeviceProperties` - `getDeviceProperties` - `getGlobalIdxFromDevice` - `getDeviceFromPtr` # Additional Context `XPUHooks` is an indispensable part of the runtime. We upstream `XPUHooks` in this PR since there is some code related to `Device` in it and we also refine some logic and code to avoid forward declaration in `DLPack`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116833 Approved by: https://github.com/EikanWang, https://github.com/jgong5, https://github.com/gujinghui, https://github.com/malfet	2024-01-18 05:02:42 +00:00
CK Luk	61ea3036bc	Allow explicit shutdown of the compile-worker pools (#117664 ) Summary: Allow the trainer to explicitly shutdown the compile-worker pools to save CPU resource, thereby avoiding QPS degradation. Test Plan: See the test plan in D52839313 Differential Revision: D52839313 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117664 Approved by: https://github.com/yanboliang	2024-01-18 04:56:11 +00:00
mingxzhao	1859895ffa	Docs: fix docstring errors in model_averaging (#117038 ) pydocstyle check averagers.py Pre /workspaces/pytorch/torch/distributed/algorithms/model_averaging/averagers.py:1 at module level: D100: Missing docstring in public module /workspaces/pytorch/torch/distributed/algorithms/model_averaging/averagers.py:20 in public method `__init__`: D107: Missing docstring in __init__ /workspaces/pytorch/torch/distributed/algorithms/model_averaging/averagers.py:27 in public method `average_parameters`: D102: Missing docstring in public method /workspaces/pytorch/torch/distributed/algorithms/model_averaging/averagers.py:84 in public method `__init__`: D107: Missing docstring in __init__ /workspaces/pytorch/torch/distributed/algorithms/model_averaging/averagers.py:106 in public method `average_parameters`: D205: 1 blank line required between summary line and description (found 0) /workspaces/pytorch/torch/distributed/algorithms/model_averaging/averagers.py:106 in public method `average_parameters`: D400: First line should end with a period (not '`') 6 Post /workspaces/pytorch/torch/distributed/algorithms/model_averaging/averagers.py:1 at module level: D100: Missing docstring in public module /workspaces/pytorch/torch/distributed/algorithms/model_averaging/averagers.py:20 in public method `__init__`: D107: Missing docstring in __init__ /workspaces/pytorch/torch/distributed/algorithms/model_averaging/averagers.py:27 in public method `average_parameters`: D102: Missing docstring in public method /workspaces/pytorch/torch/distributed/algorithms/model_averaging/averagers.py:84 in public method `__init__`: D107: Missing docstring in __init__ 4 utils.py Pre /workspaces/pytorch/torch/distributed/algorithms/model_averaging/utils.py:1 at module level: D100: Missing docstring in public module /workspaces/pytorch/torch/distributed/algorithms/model_averaging/utils.py:17 in public function `average_parameters`: D205: 1 blank line required between summary line and description (found 0) /workspaces/pytorch/torch/distributed/algorithms/model_averaging/utils.py:45 in public function `get_params_to_average`: D205: 1 blank line required between summary line and description (found 0) /workspaces/pytorch/torch/distributed/algorithms/model_averaging/utils.py:45 in public function `get_params_to_average`: D401: First line should be in imperative mood (perhaps 'Return', not 'Returns') /workspaces/pytorch/torch/distributed/algorithms/model_averaging/utils.py:68 in public function `average_parameters_or_parameter_groups`: D200: One-line docstring should fit on one line with quotes (found 3) 5 Post /workspaces/pytorch/torch/distributed/algorithms/model_averaging/utils.py:1 at module level: D100: Missing docstring in public module 1 hierarchical_model_averager.py Pre /workspaces/pytorch/torch/distributed/algorithms/model_averaging/hierarchical_model_averager.py:1 at module level: D100: Missing docstring in public module /workspaces/pytorch/torch/distributed/algorithms/model_averaging/hierarchical_model_averager.py:16 in public class `HierarchicalModelAverager`: D205: 1 blank line required between summary line and description (found 0) /workspaces/pytorch/torch/distributed/algorithms/model_averaging/hierarchical_model_averager.py:98 in public method `__init__`: D107: Missing docstring in __init__ /workspaces/pytorch/torch/distributed/algorithms/model_averaging/hierarchical_model_averager.py:137 in private method `_find_process_group`: D205: 1 blank line required between summary line and description (found 0) /workspaces/pytorch/torch/distributed/algorithms/model_averaging/hierarchical_model_averager.py:137 in private method `_find_process_group`: D400: First line should end with a period (not ',') /workspaces/pytorch/torch/distributed/algorithms/model_averaging/hierarchical_model_averager.py:137 in private method `_find_process_group`: D401: First line should be in imperative mood (perhaps 'Return', not 'Returns') /workspaces/pytorch/torch/distributed/algorithms/model_averaging/hierarchical_model_averager.py:151 in public method `average_parameters`: D205: 1 blank line required between summary line and description (found 0) /workspaces/pytorch/torch/distributed/algorithms/model_averaging/hierarchical_model_averager.py:151 in public method `average_parameters`: D400: First line should end with a period (not '`') 8 Post /workspaces/pytorch/torch/distributed/algorithms/model_averaging/hierarchical_model_averager.py:1 at module level: D100: Missing docstring in public module /workspaces/pytorch/torch/distributed/algorithms/model_averaging/hierarchical_model_averager.py:99 in public method `__init__`: D107: Missing docstring in __init__ 2 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117038 Approved by: https://github.com/H-Huang	2024-01-18 04:12:51 +00:00
Menglu Yu	4f2620ce56	[PT2][split_cat] fix a bug in merge_splits (#117707 ) Summary: Recently, we found merge splits (D45204109) is not working for AFOC model, thus patch a fix. Test Plan: The error log: P1046934021 # Flows used to local reproduce ### non-first: f522317780 after the fix: P1047603217 ### first: f522253163 after the fix: P1047764917 Differential Revision: D52856359 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117707 Approved by: https://github.com/jackiexu1992	2024-01-18 04:04:32 +00:00
suo	02c96f6949	[export] modify torch.export tests to pass a Module in (#117572 ) We have a lot of tests that pass a function to torch.export. We are planning to disallow this, so fix up the tests to pass a module in. Differential Revision: [D52791309](https://our.internmc.facebook.com/intern/diff/D52791309/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/117572 Approved by: https://github.com/tugsbayasgalan ghstack dependencies: #117570, #117571	2024-01-18 03:40:40 +00:00
suo	ccc8440609	[export] introduce WrapperModule (#117571 ) Simple module to wrap a callable. This is a useful utility for when we start requiring that torch.export take an nn.Module. Differential Revision: [D52791310](https://our.internmc.facebook.com/intern/diff/D52791310/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/117571 Approved by: https://github.com/tugsbayasgalan, https://github.com/avikchaudhuri ghstack dependencies: #117570	2024-01-18 03:40:34 +00:00
suo	5697986482	[export] change exportdb to require torch.nn.Module (#117570 ) Part of the effort to make torch.export require nn.Module. Differential Revision: [D52631366](https://our.internmc.facebook.com/intern/diff/D52631366/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/117570 Approved by: https://github.com/tugsbayasgalan	2024-01-18 03:40:10 +00:00
Elias Ellison	41153542ae	Use wait stream instead of synchronize() in cudagraph warmup (#117578 ) Fix for https://github.com/pytorch/pytorch/issues/113895 There are three phases to cudagraph trees. Warmup, recording, and execution. On recording and execution we are executing under the current_stream. In warmup we execute under a side stream that we also use for cudagraph recording so as to reuse memory. After we execute on the side stream we need to sync the current stream to the side stream. Previously there was a `torch.cuda.synchronize` but not a `torch.cuda.current_stream().wait_stream(stream)`. This PR removes the global sync and adds a wait_stream. I have confirmed that it fixes https://github.com/pytorch/pytorch/issues/113895. It's not entirely clear me why torch.cuda.synchronize would be insufficient - I would have thought the global sync would encompass the stream to stream sync. However, we do have a number of [instances](https://github.com/pytorch/pytorch/blob/main/torch/_inductor/compile_fx.py#L748-L749) throughout the code base where we do a stream->stream sync after the global sync so clearly I am missing something here. In any case the stream->stream sync is better perf than a global synchronize. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117578 Approved by: https://github.com/zdevito	2024-01-18 03:33:44 +00:00
angelayi	560213de2d	[export] Error on not pytree-flattened nodes (#117598 ) Attempts to make the input/output mismatch error better by first checking if the inputs/outputs are able to be pytree flattened into supporting types (tensors, symints, ...). So if user passes in some datastructure which does not have a pytree flatten registration, this will error with the message "It looks like one of the inputs is with type CustomType is not supported or pytree flatten-able.... please register a pytree flatten/unflatten function using the pytree.register_pytree_node API". The check inside of produce_matching should now only error if something unexpected happens (dynamo accidentally adds an input or removes an output), and should be considered an internal error. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117598 Approved by: https://github.com/avikchaudhuri, https://github.com/BowenBao	2024-01-18 03:06:42 +00:00
Edward Z. Yang	634ce3c913	Document and type torch._inductor.virtualized (#117658 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/117658 Approved by: https://github.com/eellison, https://github.com/peterbell10 ghstack dependencies: #117650	2024-01-18 03:03:20 +00:00
Edward Z. Yang	16ff6cd340	Catch some missing unbacked symbol dependencies (#117650 ) Whenever an IR node has reference to an unbacked SymInt, we must register it as a use of the unbacked SymInt. This fix isn't complete but the rest of the fix is fairly difficult, so putting this in to start. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/117650 Approved by: https://github.com/lezcano	2024-01-18 03:03:20 +00:00
rzou	cb2b98ad6b	[codemod] markDynamoStrictTest batch 21 (#117609 ) [codemod] markDynamoStrictTest test_torch [codemod] markDynamoStrictTest test_ops_gradients [codemod] markDynamoStrictTest test_ops [codemod] markDynamoStrictTest test_modules [codemod] markDynamoStrictTest test_ops_jit [codemod] markDynamoStrictTest test_ops_fwd_gradients Pull Request resolved: https://github.com/pytorch/pytorch/pull/117609 Approved by: https://github.com/bdhirsh ghstack dependencies: #117700, #117701, #117702	2024-01-18 02:49:26 +00:00
PyTorch MergeBot	bbf65bc451	Revert "[Dynamo] Remove the workaround since it has been fixed (#117615 )" This reverts commit b3e2571e83eff4a5ce45a7ad037c2fa2df87da9d. Reverted https://github.com/pytorch/pytorch/pull/117615 on behalf of https://github.com/huydhn due to Sorry for reverting you change but it seems to start failing some dynamo tests in trunk `b3e2571e83`. I try to disable the failed test but yet another one shows up ([comment](https://github.com/pytorch/pytorch/pull/117615#issuecomment-1897683076))	2024-01-18 02:48:34 +00:00
titaiwangms	cbf24ba962	Add node meta value into UnflattenedModule (#117686 ) Fixes #116670 Following the lead of #116720, added node.meta['val'] back to newly created subgraphs. node.meta['val'] is essential to ONNX in terms of the shape and type information. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117686 Approved by: https://github.com/angelayi	2024-01-18 02:37:15 +00:00
Ke Wen	6d96beb6be	[c10d] Remove health check (#117699 ) https://github.com/pytorch/pytorch/pull/114916 and https://github.com/pytorch/pytorch/pull/116222 added support for eager NCCL comm init (performed as soon as `init_process_group` is called). If any user cares about the time difference and want to see NCCL init errors early, they can use eager init now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117699 Approved by: https://github.com/wconstab	2024-01-18 02:14:49 +00:00
Lu Fang	21ddca4225	Enable HIP build for //sigrid/predictor:pytorch_disagg_gpu_task (#117616 ) Summary: Tweak some header include, as well as explicitly ignore hipEventDestroy return value. Test Plan: CI Reviewed By: jiaqizhai Differential Revision: D52722234 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117616 Approved by: https://github.com/xw285cornell	2024-01-18 01:37:50 +00:00
Sergii Dymchenko	3882714168	Fix check-labels.yml for ghstack PRs (#117680 ) Otherwise check-labels doesn't run on ghstack PRs, see https://github.com/pytorch/pytorch/pull/117609 for example: no Check Labels workflow run. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117680 Approved by: https://github.com/izaitsevfb	2024-01-18 01:33:55 +00:00
Sergii Dymchenko	f7143b79bd	Stricter pull_request_target in labeler.yml (#117677 ) Copied from https://github.com/pytorch/pytorch/blob/main/.github/workflows/check-labels.yml Pull Request resolved: https://github.com/pytorch/pytorch/pull/117677 Approved by: https://github.com/izaitsevfb, https://github.com/malfet	2024-01-18 01:33:49 +00:00
Ke Wen	58c4bc62bb	[c10d] Deprecate Work.result() (#117565 ) Work.result() returns a vector of tensors. This signature is problematic as some collectives may just return one tensor (e.g all-reduce), while some others may return multiple tensors (e.g. all-gather). It would be clearer/easier for users to directly access the result via the tensor/tensorlist passed to the collective APIs. Deprecating work.result() would also allow us to remove the `outputs_` field in the Work class, avoiding an "artificial" reference to the tensor, which could potentially hold up the tensor's memory. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117565 Approved by: https://github.com/wconstab	2024-01-18 01:22:37 +00:00
Eddie Yan	5aa92b5090	[CUDNN][SDPA] Experimental cuDNN Flash Attention v2 Inference (#115663 ) #113713 Going to clean up some of the checks and will remove draft status after. Can be tested on SM80+ with `TORCH_CUDNN_MHA_ENABLED=1`. CC @drisspg @ptrblck Pull Request resolved: https://github.com/pytorch/pytorch/pull/115663 Approved by: https://github.com/drisspg	2024-01-18 01:20:36 +00:00
Kurman Karabukaev	a60b566d37	[TorchElastic] Support for overprovisioning in C10 based rendezvous (#117066 ) Summary: Allow TorchElastic to manage more nodes than a maximum nnodes specifed in a job. It will be used as a spare capacity/warm nodes for schedulers that support elasticity. RFC: https://github.com/pytorch/pytorch/issues/114097 Test Plan: Integration tests Differential Revision: D52343874 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117066 Approved by: https://github.com/zdevito	2024-01-18 01:16:55 +00:00
Nikita Shulga	a1afd1b195	Revert "[inductor] Faster C++ kernel python bindings (#117500 )" It should have never been landed, but was landed again, thanks to ghstack grafting/ungrafting see discussion on https://github.com/pytorch/pytorch/pull/116910 This reverts commit e457b6fb18782425661e8a09d0222d0b29518ad1.	2024-01-17 17:06:32 -08:00
Ke Wen	410515241d	[c01d] Remove CoalescedWorkNCCL (#117696 ) `CoalescedWorkNCCL` is dead code now. Nowhere is it used. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117696 Approved by: https://github.com/wconstab	2024-01-18 01:00:43 +00:00
Ke Wen	387ea260af	[c10d] Enable watchdog for coalesced work (#117682 ) Fixes https://github.com/pytorch/pytorch/issues/114301 Previously, coalesced work (created by `end_coalescing`) is not watched by watchdog, which results in silent timeout. The culprit is that we reset `coalescing_state_` to 0 before checking it to see if we should enqueue a work. Example: ``` import torch import torch.distributed as dist from datetime import timedelta dist.init_process_group(backend="nccl", timeout=timedelta(seconds=10)) rank = dist.get_rank() world_size = dist.get_world_size() device = torch.device(f"cuda:{rank}") # Create tensors of different sizes to create hang s = 100 * 1024 * 1024 * (world_size - rank) with dist._coalescing_manager(device=device): dist.all_reduce(torch.ones(s, device=device)) dist.broadcast(torch.ones(s, device=device), src=0) torch.cuda.synchronize() print(f"{dist.get_rank()} done") ``` Watchdog fires: ``` $ torchrun --nproc-per-node 2 example.py ... [rank1]:[E ProcessGroupNCCL.cpp:545] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2, OpType=COALESCED, NumelIn=18446744073709551615, NumelOut=18446744073709551615, Timeout(ms)=10000) ran for 10000 milliseconds before timing out. [rank0]:[E ProcessGroupNCCL.cpp:545] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2, OpType=COALESCED, NumelIn=18446744073709551615, NumelOut=18446744073709551615, Timeout(ms)=10000) ran for 10567 milliseconds before timing out. ... ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/117682 Approved by: https://github.com/wconstab, https://github.com/fduwjj	2024-01-18 00:42:36 +00:00
cyy	396a5c3091	[Exception] [4/N] Replace torch::IndexError and torch::ValueError with C10 counterparts (#117317 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/117317 Approved by: https://github.com/ezyang	2024-01-18 00:35:29 +00:00
rzou	c64fd8b89c	[codemod] markDynamoStrictTest batch 20 (#117702 ) [codemod] markDynamoStrictTest test_tensorexpr_pybind [codemod] markDynamoStrictTest test_tensorexpr [codemod] markDynamoStrictTest test_jit_llga_fuser [codemod] markDynamoStrictTest test_jit_fuser_te Pull Request resolved: https://github.com/pytorch/pytorch/pull/117702 Approved by: https://github.com/bdhirsh ghstack dependencies: #117700, #117701	2024-01-18 00:30:22 +00:00
rzou	3770311093	[codemod] markDynamoStrictTest batch 19 (#117701 ) [codemod] markDynamoStrictTest export/test_verifier [codemod] markDynamoStrictTest export/test_upgrade [codemod] markDynamoStrictTest export/test_unflatten [codemod] markDynamoStrictTest export/test_serialize [codemod] markDynamoStrictTest export/test_serdes [codemod] markDynamoStrictTest export/test_retraceability [codemod] markDynamoStrictTest export/test_passes [codemod] markDynamoStrictTest export/test_pass_infra [codemod] markDynamoStrictTest export/test_functionalized_assertions [codemod] markDynamoStrictTest export/test_export_nonstrict [codemod] markDynamoStrictTest export/test_export [codemod] markDynamoStrictTest export/test_experimental [codemod] markDynamoStrictTest export/test_db Pull Request resolved: https://github.com/pytorch/pytorch/pull/117701 Approved by: https://github.com/bdhirsh, https://github.com/malfet ghstack dependencies: #117700	2024-01-18 00:30:22 +00:00
Nikita Shulga	82c0083819	Fix trition wheels build (take 2) (#117706 ) Sorry, I should have been more thorough in reviewing https://github.com/pytorch/pytorch/pull/117648 Triton wheels are built of `main` branch, rather than `nightly`, see `2db53a01e5/.github/workflows/build-triton-wheel.yml (L1-L6)` Test plan: merge and hope for the best :P Pull Request resolved: https://github.com/pytorch/pytorch/pull/117706 Approved by: https://github.com/huydhn, https://github.com/atalman	2024-01-18 00:26:36 +00:00
rzou	898f6a48a9	[codemod] markDynamoStrictTest batch 18 (#117700 ) [codemod] markDynamoStrictTest functorch/test_vmap [codemod] markDynamoStrictTest profiler/test_profiler_tree [codemod] markDynamoStrictTest profiler/test_profiler [codemod] markDynamoStrictTest profiler/test_memory_profiler [codemod] markDynamoStrictTest functorch/test_ops [codemod] markDynamoStrictTest functorch/test_aotdispatch Pull Request resolved: https://github.com/pytorch/pytorch/pull/117700 Approved by: https://github.com/bdhirsh	2024-01-18 00:25:38 +00:00
Yanbo Liang	b3e2571e83	[Dynamo] Remove the workaround since it has been fixed (#117615 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/117615 Approved by: https://github.com/angelayi	2024-01-18 00:21:22 +00:00
Boyuan Feng	3114813314	Replace `constraints` with `dynamic_shapes` in deeplearning/aot_inductor test (#117573 ) Summary: `constraints` argument for `torch.export` has been deprecated in favor of the `dynamic_shapes` argument. This PR updates the use of the deprecated API in `deeplearning/aot_inductor/test/test_custom_ops.py`. Test Plan: buck test mode/dev-nosan fbcode//deeplearning/aot_inductor/test:test_custom_ops -- test_export_extern_fallback_nodes_dynamic_shape Differential Revision: D52790332 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117573 Approved by: https://github.com/angelayi	2024-01-17 23:50:08 +00:00
Brian Hirsh	2db53a01e5	propagate torch stack trace metadata to copy_() nodes during input mutations (#117587 ) Tested by running the below script: ``` import torch @torch.compile(backend="aot_eager", fullgraph=True) def f(x): y = x.view(-1) y.mul_(2) return x = torch.ones(4) f(x) ``` Which gives me this ATen graph (notice that the copy_() node is bundled under the stacktrace for `mul_(2)`): ``` ===== Forward graph 0 ===== <eval_with_key>.2 from /data/users/hirsheybar/e/pytorch/torch/fx/experimental/proxy_tensor.py:521 in wrapped class <lambda>(torch.nn.Module): def forward(self, arg0_1: "f32[4]"): # File: /data/users/hirsheybar/e/pytorch/tmp5.py:8, code: y = x.view(-1) view: "f32[4]" = torch.ops.aten.view.default(arg0_1, [-1]) # File: /data/users/hirsheybar/e/pytorch/tmp5.py:9, code: y.mul_(2) mul: "f32[4]" = torch.ops.aten.mul.Tensor(view, 2); view = None view_1: "f32[4]" = torch.ops.aten.view.default(mul, [4]); mul = None copy_: "f32[4]" = torch.ops.aten.copy_.default(arg0_1, view_1); arg0_1 = view_1 = None return () ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/117587 Approved by: https://github.com/eellison	2024-01-17 23:07:45 +00:00
titaiwangms	26a63907ba	Ordering placeholder and get_attr nodes in unflattened module (#116910 ) Previous to this PR, the generated unflattened module could mix up the order of `placeholder` and newly created `get_attr`. As `placeholder` is the input of a function, it should be placed ahead of `get_attr` nodes. Before: ```bash test/export/test_unflatten.py::TestUnflatten::test_placeholder_and_get_attr_ordering_after_unflattened opcode name target args kwargs ------------- ----------- ------------------------ -------------------------------------------------------------- -------- get_attr bias bias () {} get_attr weight weight () {} placeholder l_x_ l_x_ () {} call_function convolution aten.convolution.default (l_x_, weight, bias, [2, 2], [0, 0], [1, 1], False, [0, 0], 1) {} output output output (convolution,) {} ``` After: ```bash test/export/test_unflatten.py::TestUnflatten::test_placeholder_and_get_attr_ordering_after_unflattened opcode name target args kwargs ------------- ----------- ------------------------ -------------------------------------------------------------- -------- placeholder l_x_ l_x_ () {} get_attr weight weight () {} get_attr bias bias () {} call_function convolution aten.convolution.default (l_x_, weight, bias, [2, 2], [0, 0], [1, 1], False, [0, 0], 1) {} output output output (convolution,) {} ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/116910 Approved by: https://github.com/tugsbayasgalan ghstack dependencies: #117409, #116667, #117591, #117500	2024-01-17 23:03:15 +00:00
titaiwangms	e457b6fb18	[inductor] Faster C++ kernel python bindings (#117500 ) Calling C++ from Python via ctypes is notoriously slow. This switches to generating our own C++ bindings directly, which is a >5x speedup on this kernel-launch-bound microbenchmark: ```python from ctypes import c_void_p import torch from torch import empty from torch._inductor.codecache import AsyncCompile from torch._dynamo.testing import rand_strided from torch._inductor.utils import print_performance from torch._inductor.wrapper_benchmark import compiled_module_main async_compile = AsyncCompile() src = ''' #include "/tmp/torchinductor_jansel/gb/cgbau5vlj6cetmcjbjbtw6x4rrivaln6f45s5d72gy2bfx5foz3k.h" extern "C" void kernel(const float* in_ptr0, float* out_ptr0) { { auto tmp0 = in_ptr0[static_cast<long>(0L)]; auto tmp1 = static_cast<float>(1.0); auto tmp2 = decltype(tmp0)(tmp0 + tmp1); out_ptr0[static_cast<long>(0L)] = tmp2; } } ''' cpp_fused_add_ctypes = async_compile.cpp(src) cpp_fused_add_cpython = async_compile.cpp_pybinding(["const float", "float"], src) async_compile.wait(globals()) del async_compile def call(arg0_1): buf0 = empty((1,), device='cpu', dtype=torch.float32) if use_ctypes: for _ in range(100): cpp_fused_add_ctypes(c_void_p(arg0_1.data_ptr()), c_void_p(buf0.data_ptr())) else: for _ in range(100): cpp_fused_add_cpython(arg0_1, buf0) del arg0_1 return (buf0,) def benchmark_compiled_module(times=1000, repeat=100): arg0_1 = rand_strided((1,), (1,), device='cpu', dtype=torch.float32) return print_performance(lambda: call(arg0_1), times=times, repeat=repeat) print("old ctypes bindings: ", end='') use_ctypes = True compiled_module_main('None', benchmark_compiled_module) print("new bindings: ", end='') use_ctypes = False compiled_module_main('None', benchmark_compiled_module) ``` Output: ``` old ctypes bindings: 0.000073 new bindings: 0.000013 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/117500 Approved by: https://github.com/desertfire ghstack dependencies: #117409, #116667, #117591	2024-01-17 23:03:15 +00:00
PyTorch MergeBot	763ddb396d	Revert "[codemod] markDynamoStrictTest batch 18 (#117604 )" This reverts commit 24f288114a696a27771c075b8e8df556c13eced6. Reverted https://github.com/pytorch/pytorch/pull/117604 on behalf of https://github.com/zou3519 due to probably a crossed merge? ([comment](https://github.com/pytorch/pytorch/pull/117604#issuecomment-1897082562))	2024-01-17 22:16:27 +00:00
PyTorch MergeBot	01c0c67937	Revert "[codemod] markDynamoStrictTest batch 19 (#117605 )" This reverts commit 0cda1e0b218895ce6121531991348b8bcbce9b94. Reverted https://github.com/pytorch/pytorch/pull/117605 on behalf of https://github.com/zou3519 due to probably a crossed merge? ([comment](https://github.com/pytorch/pytorch/pull/117605#issuecomment-1897065994))	2024-01-17 22:12:59 +00:00
PyTorch MergeBot	87c2427173	Revert "[codemod] markDynamoStrictTest batch 20 (#117606 )" This reverts commit 308e154af5fd6388f49eabe631e7b78ca3ac9c39. Reverted https://github.com/pytorch/pytorch/pull/117606 on behalf of https://github.com/zou3519 due to probably a crossed merge? ([comment](https://github.com/pytorch/pytorch/pull/117606#issuecomment-1897042843))	2024-01-17 22:08:20 +00:00
Mihir Patel	84cfe6d8b2	Drop all gather stats to debug not warning (#117669 ) Logger default level results in these all gather stats being spammed into every run which is very annoying Pull Request resolved: https://github.com/pytorch/pytorch/pull/117669 Approved by: https://github.com/Skylion007, https://github.com/awgu	2024-01-17 21:44:59 +00:00
Animesh Jain	8841d26046	[dynamo] LazyVariable - redirect __str__ to the realized variable __str__ (#117583 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/117583 Approved by: https://github.com/lezcano, https://github.com/jansel	2024-01-17 21:12:12 +00:00
Guoliang He	a7fbbc2a4a	[inductor] allow mm template to accumulate with float16 dtype (#117479 ) Fixes #108621 replace #108637 and #108982 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117479 Approved by: https://github.com/jansel	2024-01-17 21:01:14 +00:00
Sam Larsen	208e64a9ba	Initial implementation of FakeTensor caching (#113873 ) Summary: Cache the result of FakeTensor dispatch and skip re-evaluation on cache hits. Test Plan: New unit tests. Caching is enabled in this diff, so all existing tests exercise the cache as well. Differential Revision: [D52841637](https://our.internmc.facebook.com/intern/diff/D52841637) Pull Request resolved: https://github.com/pytorch/pytorch/pull/113873 Approved by: https://github.com/eellison	2024-01-17 20:38:54 +00:00
Xuehai Pan	c0940d2e93	[pytree] reuse `flatten_fn` in `flatten_with_keys_fn` to ensure consistency (#117656 ) Reuse `flatten_fn` in `flatten_with_keys_fn` to ensure `flatten_fn` and `flatten_with_keys_fn` get the same `leaves` and `context`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117656 Approved by: https://github.com/suo	2024-01-17 20:38:49 +00:00
Richard Barnes	bffc8ecfb0	[codemod] Fix shadows in PyTorch (#117562 ) Test Plan: Sandcastle Differential Revision: D52802592 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117562 Approved by: https://github.com/Skylion007, https://github.com/malfet	2024-01-17 20:33:50 +00:00
PyTorch MergeBot	da6abaeeac	Revert "[inductor] Faster C++ kernel python bindings (#117500 )" This reverts commit bb0fd1bd3ca145b77159427bc5bacf5f98ec3896. Reverted https://github.com/pytorch/pytorch/pull/117500 on behalf of https://github.com/PaliC due to breaking internal discussed with author offline ([comment](https://github.com/pytorch/pytorch/pull/117500#issuecomment-1896516512))	2024-01-17 19:34:26 +00:00
PyTorch MergeBot	cb0bfcf590	Revert "Ordering placeholder and get_attr nodes in unflattened module (#116910 )" This reverts commit 12561bb5fed08283baf7a31e6678341a04e83adb. Reverted https://github.com/pytorch/pytorch/pull/116910 on behalf of https://github.com/PaliC due to breaking internal discussed with author offline ([comment](https://github.com/pytorch/pytorch/pull/117500#issuecomment-1896516512))	2024-01-17 19:34:26 +00:00
Sherlock Huang	89cf1ddb5c	[AOTInductor] Allow user to explicitly specify Device to run on (#117413 ) Summary: AOTInductor currently infer cuda device index by `cudaGetDevice()`. This assumes outer runtime calls `cudaSetDevice()` somewhere, before invoking AOTInductor run. This diff adds an explicit argument for specifying target Device. e.g. compiled on "cuda:0", run on "cuda:1". todo: - Are the changes in interface.h BC breaking? as it changes the function signatures in .so file. Might just need introduce a new "Create" function. Test Plan: CI Differential Revision: D52747132 Privacy Context Container: 368960445142440 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117413 Approved by: https://github.com/chenyang78, https://github.com/desertfire, https://github.com/khabinov	2024-01-17 19:28:04 +00:00
rzou	308e154af5	[codemod] markDynamoStrictTest batch 20 (#117606 ) [codemod] markDynamoStrictTest test_tensorexpr_pybind [codemod] markDynamoStrictTest test_tensorexpr [codemod] markDynamoStrictTest test_jit_llga_fuser [codemod] markDynamoStrictTest test_jit_fuser_te Pull Request resolved: https://github.com/pytorch/pytorch/pull/117606 Approved by: https://github.com/bdhirsh ghstack dependencies: #117219, #117604, #117605	2024-01-17 19:20:11 +00:00
rzou	0cda1e0b21	[codemod] markDynamoStrictTest batch 19 (#117605 ) [codemod] markDynamoStrictTest export/test_verifier [codemod] markDynamoStrictTest export/test_upgrade [codemod] markDynamoStrictTest export/test_unflatten [codemod] markDynamoStrictTest export/test_serialize [codemod] markDynamoStrictTest export/test_serdes [codemod] markDynamoStrictTest export/test_retraceability [codemod] markDynamoStrictTest export/test_passes [codemod] markDynamoStrictTest export/test_pass_infra [codemod] markDynamoStrictTest export/test_functionalized_assertions [codemod] markDynamoStrictTest export/test_export_nonstrict [codemod] markDynamoStrictTest export/test_export [codemod] markDynamoStrictTest export/test_experimental [codemod] markDynamoStrictTest export/test_db Pull Request resolved: https://github.com/pytorch/pytorch/pull/117605 Approved by: https://github.com/bdhirsh ghstack dependencies: #117219, #117604	2024-01-17 19:20:11 +00:00
rzou	24f288114a	[codemod] markDynamoStrictTest batch 18 (#117604 ) [codemod] markDynamoStrictTest functorch/test_vmap [codemod] markDynamoStrictTest profiler/test_profiler_tree [codemod] markDynamoStrictTest profiler/test_profiler [codemod] markDynamoStrictTest profiler/test_memory_profiler [codemod] markDynamoStrictTest functorch/test_ops [codemod] markDynamoStrictTest functorch/test_aotdispatch Pull Request resolved: https://github.com/pytorch/pytorch/pull/117604 Approved by: https://github.com/bdhirsh ghstack dependencies: #117219	2024-01-17 19:20:01 +00:00
rzou	006d655956	[codemod] markDynamoStrictTest batch 17 (#117219 ) [codemod] markDynamoStrictTest test_xnnpack_integration [codemod] markDynamoStrictTest test_vulkan [codemod] markDynamoStrictTest test_package Pull Request resolved: https://github.com/pytorch/pytorch/pull/117219 Approved by: https://github.com/bdhirsh	2024-01-17 19:19:50 +00:00
titaiwangms	1967165d4d	[codemod] markDynamoStrictTest batch 16 (#117218 ) [codemod] markDynamoStrictTest test_public_bindings [codemod] markDynamoStrictTest test_package [codemod] markDynamoStrictTest test_legacy_vmap [codemod] markDynamoStrictTest test_namedtensor [codemod] markDynamoStrictTest test_fx [codemod] markDynamoStrictTest test_dataloader [codemod] markDynamoStrictTest test_content_store [codemod] markDynamoStrictTest test_schema_check [codemod] markDynamoStrictTest lazy/test_ts_opinfo [codemod] markDynamoStrictTest functorch/test_vmap_registrations Pull Request resolved: https://github.com/pytorch/pytorch/pull/117218 Approved by: https://github.com/bdhirsh, https://github.com/voznesenskym ghstack dependencies: #117409, #116667, #117591, #117500, #116910, #117553	2024-01-17 19:12:41 +00:00
titaiwangms	ca0abf8606	Add inductor-specific testing strict mode denylist (#117553 ) We have one for Dynamo that currently applies to all "compile" configurations (PYTORCH_TEST_WITH_DYNAMO, PYTORCH_TEST_WITH_INDUCTOR). I don't want to figure out the inductor situation right now, so we're going to add another denylist for inductor and work through it later. Test Plan: - existing tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/117553 Approved by: https://github.com/voznesenskym ghstack dependencies: #117409, #116667, #117591, #117500, #116910	2024-01-17 19:12:41 +00:00
titaiwangms	12561bb5fe	Ordering placeholder and get_attr nodes in unflattened module (#116910 ) Previous to this PR, the generated unflattened module could mix up the order of `placeholder` and newly created `get_attr`. As `placeholder` is the input of a function, it should be placed ahead of `get_attr` nodes. Before: ```bash test/export/test_unflatten.py::TestUnflatten::test_placeholder_and_get_attr_ordering_after_unflattened opcode name target args kwargs ------------- ----------- ------------------------ -------------------------------------------------------------- -------- get_attr bias bias () {} get_attr weight weight () {} placeholder l_x_ l_x_ () {} call_function convolution aten.convolution.default (l_x_, weight, bias, [2, 2], [0, 0], [1, 1], False, [0, 0], 1) {} output output output (convolution,) {} ``` After: ```bash test/export/test_unflatten.py::TestUnflatten::test_placeholder_and_get_attr_ordering_after_unflattened opcode name target args kwargs ------------- ----------- ------------------------ -------------------------------------------------------------- -------- placeholder l_x_ l_x_ () {} get_attr weight weight () {} get_attr bias bias () {} call_function convolution aten.convolution.default (l_x_, weight, bias, [2, 2], [0, 0], [1, 1], False, [0, 0], 1) {} output output output (convolution,) {} ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/116910 Approved by: https://github.com/tugsbayasgalan ghstack dependencies: #117409, #116667, #117591, #117500	2024-01-17 19:12:33 +00:00
titaiwangms	bb0fd1bd3c	[inductor] Faster C++ kernel python bindings (#117500 ) Calling C++ from Python via ctypes is notoriously slow. This switches to generating our own C++ bindings directly, which is a >5x speedup on this kernel-launch-bound microbenchmark: ```python from ctypes import c_void_p import torch from torch import empty from torch._inductor.codecache import AsyncCompile from torch._dynamo.testing import rand_strided from torch._inductor.utils import print_performance from torch._inductor.wrapper_benchmark import compiled_module_main async_compile = AsyncCompile() src = ''' #include "/tmp/torchinductor_jansel/gb/cgbau5vlj6cetmcjbjbtw6x4rrivaln6f45s5d72gy2bfx5foz3k.h" extern "C" void kernel(const float* in_ptr0, float* out_ptr0) { { auto tmp0 = in_ptr0[static_cast<long>(0L)]; auto tmp1 = static_cast<float>(1.0); auto tmp2 = decltype(tmp0)(tmp0 + tmp1); out_ptr0[static_cast<long>(0L)] = tmp2; } } ''' cpp_fused_add_ctypes = async_compile.cpp(src) cpp_fused_add_cpython = async_compile.cpp_pybinding(["const float", "float"], src) async_compile.wait(globals()) del async_compile def call(arg0_1): buf0 = empty((1,), device='cpu', dtype=torch.float32) if use_ctypes: for _ in range(100): cpp_fused_add_ctypes(c_void_p(arg0_1.data_ptr()), c_void_p(buf0.data_ptr())) else: for _ in range(100): cpp_fused_add_cpython(arg0_1, buf0) del arg0_1 return (buf0,) def benchmark_compiled_module(times=1000, repeat=100): arg0_1 = rand_strided((1,), (1,), device='cpu', dtype=torch.float32) return print_performance(lambda: call(arg0_1), times=times, repeat=repeat) print("old ctypes bindings: ", end='') use_ctypes = True compiled_module_main('None', benchmark_compiled_module) print("new bindings: ", end='') use_ctypes = False compiled_module_main('None', benchmark_compiled_module) ``` Output: ``` old ctypes bindings: 0.000073 new bindings: 0.000013 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/117500 Approved by: https://github.com/desertfire ghstack dependencies: #117409, #116667, #117591	2024-01-17 19:12:24 +00:00
PyTorch MergeBot	0c26565d5d	Revert "Add pull request target to bc lint (#106065 )" This reverts commit d4136c90882337a0891f5216292e9e3d55c13262. Reverted https://github.com/pytorch/pytorch/pull/106065 on behalf of https://github.com/izaitsevfb due to Tightening CI security ([comment](https://github.com/pytorch/pytorch/pull/106065#issuecomment-1896439167))	2024-01-17 18:51:46 +00:00
PyTorch MergeBot	9da01affd3	Revert "[inductor] Faster C++ kernel python bindings (#117500 )" This reverts commit 3a52147cc59b240737602d3d046080bbf6f567f1. Reverted https://github.com/pytorch/pytorch/pull/117500 on behalf of https://github.com/PaliC due to breaking internal discussed with author offline ([comment](https://github.com/pytorch/pytorch/pull/117500#issuecomment-1896426304))	2024-01-17 18:42:39 +00:00
PyTorch MergeBot	8c7e3a18ff	Revert "Ordering placeholder and get_attr nodes in unflattened module (#116910 )" This reverts commit 5e0e78585d9f662ecb957c327c8d3fa31bff4f9a. Reverted https://github.com/pytorch/pytorch/pull/116910 on behalf of https://github.com/PaliC due to breaking internal discussed with author offline ([comment](https://github.com/pytorch/pytorch/pull/117500#issuecomment-1896426304))	2024-01-17 18:42:39 +00:00
PyTorch MergeBot	e877c2e6ff	Revert "Add inductor-specific testing strict mode denylist (#117553 )" This reverts commit ab6207a34248fdf2d2766d0062f358b63380e151. Reverted https://github.com/pytorch/pytorch/pull/117553 on behalf of https://github.com/PaliC due to breaking internal discussed with author offline ([comment](https://github.com/pytorch/pytorch/pull/117500#issuecomment-1896426304))	2024-01-17 18:42:39 +00:00
PyTorch MergeBot	7f3cac06b9	Revert "[codemod] markDynamoStrictTest batch 16 (#117218 )" This reverts commit 46a8408fa123da571dc1c13dba9479ba6d540249. Reverted https://github.com/pytorch/pytorch/pull/117218 on behalf of https://github.com/PaliC due to breaking internal discussed with author offline ([comment](https://github.com/pytorch/pytorch/pull/117500#issuecomment-1896426304))	2024-01-17 18:42:39 +00:00
Sijia Chen	29fa6fbc4e	[Dynamo] Fix a corner case of reinplace_inplaceable_ops pass for triton kernels (#117612 ) Summary: We saw the following failure when compiling custom triton kernels: ``` RuntimeError: Argument 'getitem_22' of Node 'triton_kernel_wrapper_functional_proxy_3' was used before it has been defined! Please check that Nodes in the graph are topologically ordered ``` The root-cause is when doing the replacement, the replacement is replaced by another replacement. The fix will keep finding the replacement until it is not replaced Test Plan: Added a test case Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/117612 Approved by: https://github.com/aakhundov	2024-01-17 18:41:42 +00:00
PyTorch MergeBot	e94b79f627	Revert "[codemod] markDynamoStrictTest batch 17 (#117219 )" This reverts commit 5bb2298da769121421711504da47955d3129b54f. Reverted https://github.com/pytorch/pytorch/pull/117219 on behalf of https://github.com/PaliC due to sadly I need to revert these in order to revert https://github.com/pytorch/pytorch/pull/117500 ([comment](https://github.com/pytorch/pytorch/pull/117219#issuecomment-1896407436))	2024-01-17 18:35:56 +00:00
PyTorch MergeBot	8483f493af	Revert "[codemod] markDynamoStrictTest batch 18 (#117604 )" This reverts commit 70b22be32a2e6a1a51cb70a1418d73bfba533cc0. Reverted https://github.com/pytorch/pytorch/pull/117604 on behalf of https://github.com/PaliC due to sadly I need to revert these in order to revert https://github.com/pytorch/pytorch/pull/117500 ([comment](https://github.com/pytorch/pytorch/pull/117219#issuecomment-1896407436))	2024-01-17 18:35:56 +00:00
PyTorch MergeBot	0bfd9653ef	Revert "[codemod] markDynamoStrictTest batch 19 (#117605 )" This reverts commit 45d7859e751dff2096df8b346226b71cf6031424. Reverted https://github.com/pytorch/pytorch/pull/117605 on behalf of https://github.com/PaliC due to sadly I need to revert these in order to revert https://github.com/pytorch/pytorch/pull/117500 ([comment](https://github.com/pytorch/pytorch/pull/117219#issuecomment-1896407436))	2024-01-17 18:35:56 +00:00
PyTorch MergeBot	d51583b214	Revert "[codemod] markDynamoStrictTest batch 20 (#117606 )" This reverts commit ab847a2f5c903c629f4e2ab9bfea11f7edc1cf0e. Reverted https://github.com/pytorch/pytorch/pull/117606 on behalf of https://github.com/PaliC due to sadly I need to revert these in order to revert https://github.com/pytorch/pytorch/pull/117500 ([comment](https://github.com/pytorch/pytorch/pull/117219#issuecomment-1896407436))	2024-01-17 18:35:56 +00:00
PyTorch MergeBot	06dab05405	Revert "[export] Error on not pytree-flattened nodes (#117598 )" This reverts commit 35e847830511b2c700586d312177794be094d67e. Reverted https://github.com/pytorch/pytorch/pull/117598 on behalf of https://github.com/huydhn due to Sorry for reverting you change but it is failing ONNX test in trunk `35e8478305`, probably a landrace as the PR signal looks fine ([comment](https://github.com/pytorch/pytorch/pull/117598#issuecomment-1896389009))	2024-01-17 18:29:04 +00:00
vfdev-5	d0fc268918	Fixed issue in upsample_nearestnd lowering with scales (#117538 ) Fixed #116848 Related to the bug introduced in my previous PR here: https://github.com/pytorch/pytorch/pull/113749/files#diff-a1b077971cddfabfa0071c5162265066e867bc07721816d95b9cbe58431c38e3R3264 Originally, the code was ```python def upsample_nearestnd( x, output_size, scales_x: Tuple[Optional[float], ...], n: int = 2, exact: bool = False, ): # ... scales = [i / o for i, o in zip(i_sizes, o_sizes)] for i, scale in enumerate(scales): if scale: scales[i] = scale ``` which is wrong as `scales_x` is not used but can be provided by the user. The code was working for cases when user provided scale value can be recomputed using `input / output` sizes, e.g. scale=2.0. However, this would fail if input scale is a float value, e.g. 2.3, in this case recomputed scale is a bit different (e.g. 2.292682926829268, depending on input and output size) and can lead to an inconsistent output. This problem was "fixed" to the following in my previous PR: https://github.com/pytorch/pytorch/pull/113749 ```python def upsample_nearestnd( x, output_size, scales_x: Tuple[Optional[float], ...], n: int = 2, exact: bool = False, ): # ... scales = [i / o for i, o in zip(i_sizes, o_sizes)] for i, scale in enumerate(scales_x): if scale: scales[i] = scale ``` however, this leads to a wrong scale value as it should be inverted as (1 / scale). Pull Request resolved: https://github.com/pytorch/pytorch/pull/117538 Approved by: https://github.com/peterbell10	2024-01-17 18:14:35 +00:00
rzou	ab847a2f5c	[codemod] markDynamoStrictTest batch 20 (#117606 ) [codemod] markDynamoStrictTest test_tensorexpr_pybind [codemod] markDynamoStrictTest test_tensorexpr [codemod] markDynamoStrictTest test_jit_llga_fuser [codemod] markDynamoStrictTest test_jit_fuser_te Pull Request resolved: https://github.com/pytorch/pytorch/pull/117606 Approved by: https://github.com/bdhirsh ghstack dependencies: #117219, #117604, #117605	2024-01-17 17:43:27 +00:00
rzou	45d7859e75	[codemod] markDynamoStrictTest batch 19 (#117605 ) [codemod] markDynamoStrictTest export/test_verifier [codemod] markDynamoStrictTest export/test_upgrade [codemod] markDynamoStrictTest export/test_unflatten [codemod] markDynamoStrictTest export/test_serialize [codemod] markDynamoStrictTest export/test_serdes [codemod] markDynamoStrictTest export/test_retraceability [codemod] markDynamoStrictTest export/test_passes [codemod] markDynamoStrictTest export/test_pass_infra [codemod] markDynamoStrictTest export/test_functionalized_assertions [codemod] markDynamoStrictTest export/test_export_nonstrict [codemod] markDynamoStrictTest export/test_export [codemod] markDynamoStrictTest export/test_experimental [codemod] markDynamoStrictTest export/test_db Pull Request resolved: https://github.com/pytorch/pytorch/pull/117605 Approved by: https://github.com/bdhirsh ghstack dependencies: #117219, #117604	2024-01-17 17:43:27 +00:00
rzou	70b22be32a	[codemod] markDynamoStrictTest batch 18 (#117604 ) [codemod] markDynamoStrictTest functorch/test_vmap [codemod] markDynamoStrictTest profiler/test_profiler_tree [codemod] markDynamoStrictTest profiler/test_profiler [codemod] markDynamoStrictTest profiler/test_memory_profiler [codemod] markDynamoStrictTest functorch/test_ops [codemod] markDynamoStrictTest functorch/test_aotdispatch Pull Request resolved: https://github.com/pytorch/pytorch/pull/117604 Approved by: https://github.com/bdhirsh ghstack dependencies: #117219	2024-01-17 17:43:17 +00:00
atalman	6d1406d177	[oidc] Migrate Triton wheel upload to oidc (#117648 ) Fix for triton upload job that is currently failing: https://github.com/pytorch/pytorch/actions/runs/7555471235/job/20574022304 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117648 Approved by: https://github.com/DanilBaibak, https://github.com/jeanschmidt, https://github.com/malfet	2024-01-17 17:04:36 +00:00
angelayi	35e8478305	[export] Error on not pytree-flattened nodes (#117598 ) Attempts to make the input/output mismatch error better by first checking if the inputs/outputs are able to be pytree flattened into supporting types (tensors, symints, ...). So if user passes in some datastructure which does not have a pytree flatten registration, this will error with the message "It looks like one of the inputs is with type CustomType is not supported or pytree flatten-able.... please register a pytree flatten/unflatten function using the pytree.register_pytree_node API". The check inside of produce_matching should now only error if something unexpected happens (dynamo accidentally adds an input or removes an output), and should be considered an internal error. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117598 Approved by: https://github.com/avikchaudhuri	2024-01-17 16:33:57 +00:00
Sam Larsen	40a6710ad3	Mark set_ as an inplace view op (#115769 ) Summary: To be used in https://github.com/pytorch/pytorch/pull/113873. Since set_ is effectively an inplace view op, we'll need to skip caching them. Test Plan: Built pytorch; specifically this step: `/home/slarsen/local/miniconda3/envs/pytorch-3.10/bin/python -m torchgen.gen --source-path /home/slarsen/local/pytorch/cmake/../aten/src/ATen --install_dir /home/slarsen/local/pytorch/build/aten/src/ATen --per-operator-headers --generate sources --output-dependencies /home/slarsen/local/pytorch/build/aten/src/ATen/generated_sources.cmake` Differential Revision: [D52814561](https://our.internmc.facebook.com/intern/diff/D52814561) Pull Request resolved: https://github.com/pytorch/pytorch/pull/115769 Approved by: https://github.com/bdhirsh	2024-01-17 15:32:18 +00:00
rzou	5bb2298da7	[codemod] markDynamoStrictTest batch 17 (#117219 ) [codemod] markDynamoStrictTest test_xnnpack_integration [codemod] markDynamoStrictTest test_vulkan [codemod] markDynamoStrictTest test_package Pull Request resolved: https://github.com/pytorch/pytorch/pull/117219 Approved by: https://github.com/bdhirsh	2024-01-17 14:41:07 +00:00
Jithun Nair	3bb8d2b905	Update triton ROCm version to 6.0 (#117433 ) Related to PyTorch nightly wheels upgrade to ROCm6.0: https://github.com/pytorch/pytorch/pull/116983 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117433 Approved by: https://github.com/malfet, https://github.com/jeffdaily	2024-01-17 12:09:45 +00:00
Digant Desai	e2830e6328	[PyTorch] SDPA decomp: actually use attn_mask (#117579 ) Summary: Need to pass this along Test Plan: ``` cd ~/fbsource/fbcode/executorch/backends/xnnpack/test buck test fbcode//mode/dev-nosan :test_xnnpack_ops -- test_fp32_sdpa buck run fbcode//mode/dev-nosan :test_xnnpack_models -- executorch.backends.xnnpack.test.models.llama2_et_example.TestLlama2ETExample.test_fp32 ``` Reviewed By: larryliu0820 Differential Revision: D52812369 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117579 Approved by: https://github.com/larryliu0820	2024-01-17 10:26:43 +00:00
fduwjj	1deb75b584	[c10d] Move the timeout dump check from watchdog to monitoring thread (#117168 ) To avoid potential hang in watchdog thread which will prevent us from dumping timeout debugging info, we move the check of global collective timeout signals and dumping debugging info to monitoring thread. We also need to ensure that we don't wait very long to check out the timeout signal from store; otherwise, we will miss the signal and don't get debugging info dumped. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117168 Approved by: https://github.com/wconstab	2024-01-17 08:05:40 +00:00
titaiwangms	ed6006ee5d	[Reland][ONNX] Guard xfail tests with error messages (#117592 ) Reland #117425 Previous to this PR, xfail tests didn't provide (1) guarantee of error message/reason (could be outdated), and (2) execution of the test (xfail_if_model_type_is_not_exportedprogram). Therefore, the tests are less robust with xfail labeled, as we can't be sure if it's still failing with the same reason, and if it's even still failing. This PR fixes the issue with try/except with error message matching to consolidate the xfail truth and reason. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117592 Approved by: https://github.com/BowenBao	2024-01-17 08:05:35 +00:00
suo	9448065061	[pytree] add key path api (#116786 ) This PR introduces a key path API to pytrees, drawing direct inspiration from JAX's [key path API](https://jax.readthedocs.io/en/latest/jax-101/05.1-pytrees.html#key-paths). I added the 3 APIs described there, and a registry of `flatten_with_keys` fns for each node type, which is a version of `flatten` that also returns `KeyEntry`s describing how to access values from the original pytree. Current use cases for this API: - Folks would like to do argument traversal over input pytrees to do verification and compatibility enforcement. Keypaths are useful for this—https://fburl.com/code/06p7zrvr is a handrolled pass doing basically the same thing but probably more fragilely. - In export non-strict mode, we need to figure out a way to track sources for pytree inputs. In strict mode, dynamo handles this for us, but we'd like a decoupled component to handle this when we're not using dynamo. I'm sure there are places it would be useful. Some design notes: - I only implemented the API for the Python pytree impl. optree has some differences in how their keypath APIs are designed (see https://github.com/pytorch/pytorch/issues/113378 for discussion). I have some issues with the proposed typed_path solution in that discussion and prefer JAX's API, but we can hash that out separately. - The way folks register a `flatten_with_keys` fn is through a new kwarg to `register_pytree_node`. This follows how we do serialization fns, although the list of additional arguments is getting unwieldy. - My impl handles pytrees with an undefined `flatten_with_keys` fn is different from JAX. I will raise an error, JAX creates a fallback keyentry. Differential Revision: [D52547850](https://our.internmc.facebook.com/intern/diff/D52547850/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116786 Approved by: https://github.com/voznesenskym	2024-01-17 07:24:35 +00:00
Ana Basalo	5667a990fd	Chore: improve log message about cache size limit exceeded (#116557 ) Fixes #114527 Pull Request resolved: https://github.com/pytorch/pytorch/pull/116557 Approved by: https://github.com/ezyang	2024-01-17 06:07:18 +00:00
baseplate-admin	3cd2c68fbe	Fix syntax highlighting in android (#117439 ) Hi i have found code blocks are not highlighted properly. This PR aims to fix that Pull Request resolved: https://github.com/pytorch/pytorch/pull/117439 Approved by: https://github.com/ezyang	2024-01-17 05:17:13 +00:00
Yanbo Liang	735715e6d3	[Dynamo] Make profiler function will be ignored warn only once (#117585 ) Fix #111632 #111622 accidentally reverted #111921, we should bring it back. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117585 Approved by: https://github.com/williamwen42, https://github.com/mlazos, https://github.com/msaroufim	2024-01-17 04:05:45 +00:00
Roger Lam	2c5488d719	Match all_gather_into_tensor args names in remapping (#117224 ) Fixes #114179 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117224 Approved by: https://github.com/wanchaol, https://github.com/wconstab	2024-01-17 03:50:29 +00:00
Jerry Zhang	8f1bc876b2	[quant] Support custom qmin/qmax for activation and weight for xnnpack quantizer (#117305 ) Summary: att, this allows us to experiment with 4 bit quant in xnnpack Test Plan: python test/test_quantization.py -k test_dynamic_linear_int4_weight Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/117305 Approved by: https://github.com/digantdesai	2024-01-17 03:22:49 +00:00
Wei-Sheng Chin	e4c2dfb35b	[Dynamo, ONNX] Run llama attention with onnxrt and dynamic shapes (#117009 ) As title. This PR enables dynamic shapes for running llama with ORT. Both forward and backward are captured as a single graph with this PR. Summary of changes: - Test llama attention, llama decoder, llama model to ensure (1) no graph breaks (2) models exported with dynamic shapes with onnxrt dynamo backend - Reshape SymInt to tensor with shape (1,) to align with the cast done for int in fx_onnx_interpreter.py - Create an util function to map Python types (e.g., float) to ONNX tensor element type (e.g., onnx.TensorProto.FLOAT). - Return `hint` for torch.Sym* in type promotion pass. - Remove _replace_to_copy_with_to since exporter supports aten::_to_copy it now. - Modify _get_onnx_devices to return CPU device for torch.Sym*. - Introduce _adjust_scalar_from_fx_to_onnx (e.g., change 0 to tensor(0)) and _adjust_scalar_from_onnx_to_fx (e.g., change tensor(0) to 0) for adjusting scalars when passing values to and receive values from ORT. - Now, ValueInfoProto of graph inputs (i.e., input_value_infos) are stored and used as `ORT-expected type` when calling `_adjust_scalar_from_fx_to_onnx`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117009 Approved by: https://github.com/titaiwangms	2024-01-17 03:02:41 +00:00
rzou	fb06ed36d1	Change dynamo_test_failures.py to silently run skipped tests (#117401 ) - We silently run skipped tests and then raise a skip message with the error message (if any) - Instead of raising expectedFailure, we raise a skip message with the error message (if any) We log the skip messages in CI, so this will let us read the logs and do some basic triaging of the failure messages. Test Plan: - existing tests. I hope that there are no tests that cause each other to fail. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117401 Approved by: https://github.com/voznesenskym ghstack dependencies: #117391, #117400	2024-01-17 02:48:19 +00:00
garfield1997	9056c7d941	use getPinnedMemoryAllocator for privateuseone (#117530 ) Fixes #117482 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117530 Approved by: https://github.com/ezyang	2024-01-17 02:33:02 +00:00
sanchitintel	8852bb561c	More efficient multi-threading in Softmax & LogSoftmax CPU kernels (#116367 ) ### Summary In #85398, while fixing a bug (which was _not caused by, but was exposed by_ AVX512 implementation) in `_vec_logsoftmax_lastdim`, I had made some revisions to use more threads in some cases, but was asked to roll back [those changes](https://github.com/pytorch/pytorch/pull/85398#discussion_r1087680237) during the PR's review. At the time, landing that PR asap seemed essential, so I agreed to roll-back that change, In some cases, more threads can be used than are being used with the current approach. <strike>In this PR, I'm reintroducing those changes, which are geared towards more efficient multi-threading.</strike>. On second thought, even for other softmax kernels besides `_vec_log_softmax_lastdim` and `_vec_softmax_lastdim`, we could simply use `grain_size` of 0 or 1, instead of complicating code because `CHUNK_SIZE` for each thread is already being computed as per some heuristic, and if `grain_size` would be `0`, then work among the OpenMP threads (which, BTW, stay constant in number, unless explicitly changed, since we don't use the OpenMP `num_threads` clause in PyTorch) would be distributed equitably, thus yielding the similar speedup as the approach in the first commit of this PR. I've also added op-level benchmarks pertaining to example input shapes in this PR. ### Benchmarks Machine - Intel(R) Xeon(R) Platinum 8468H (Xeon 4th gen, formerly codenamed Sapphire Rapids) One socket of 48 physical cores was used, with & without HyperThreading. Intel OpenMP & tcmalloc were preloaded. Softmax benchmarks can be run with the following command, but the relevant benchmarks are the last dim ones - `KMP_AFFINITY=granularity=fine,compact,1,0 KMP_BLOCKTIME=1 KMP_SETTINGS=1 OMP_NUM_THREADS=48 MKL_NUM_THREADS=48 numactl --membind=0 --cpunodebind=0 python -m pt.softmax_test --tag-filter all` #### Already existing benchmarks \|Benchmark name (dim is 1, by default) \| Previous implementation's latency (in ms) \| This implementation's latency (in ms)\|Speedup Percentage = (old-new)100/old \| Speedup ratio (old/new)\| \|-------------\|--------\|-------\|----------------------------\|----------\| \|Softmax_N1_C3_H256_W256_cpu\|31.364\|11.594\|63.03% \|2.705\| \|Softmax_N4_C3_H256_W256_cpu\|34.475\|24.966\| 27.58%\|1.380\| \|Softmax_N8_C3_H512_W256_cpu\|94.044\|78.372\|16.66%\|1.199\| \|Softmax2d_N8_C3_H512_W256_cpu\|100.195\|79.529\|20.62%\|1.259\| #### Some of the following benchmarks are being added in this PR \|Benchmark name\| Previous implementation's latency (in ms) \| This implementation's latency (in ms)\|Speedup percentage = (old-new)100/old\| Speedup ratio (old/new) \| \|-------------\|--------\|-------\|----------------------------\|--------------------\| \|LogSoftmax_M128_N128_dim1_cpu\|7.629\|6.475\|15.12%\| 1.178\| \|LogSoftmax_M48_N128_dim1_cpu\|6.848\|5.969\|12.83%\| 1.147\| \|LogSoftmax_M16_N1024_dim1_cpu\|7.004\|6.322\|9.73%\| 1.107\| \|LogSoftmax_M32_N1024_dim1_cpu\|7.037\|6.558\|6.80%\| 1.073\| \|LogSoftmax_M48_N1024_dim1_cpu\|7.155\|6.773\|5.33%\|1.056\| \|LogSoftmax_M16_N512_dim1_cpu\|6.797\|5.862\|13.75%\|1.159\| \|LogSoftmax_M32_N512_dim1_cpu\|7.223\|6.202\|14.13%\|1.164\| \|LogSoftmax_M48_N512_dim1_cpu\|7.159\|6.301\|11.98%\|1.136\| \|LogSoftmax_M16_N256_dim1_cpu\|6.842\|5.682\|16.95%\|1.204\| \|LogSoftmax_M32_N256_dim1_cpu\|6.840\|6.086\|11.02%\|1.123\| \|LogSoftmax_M48_N256_dim1_cpu\|7.005\|6.031\|13.94%\|1.161\| Pull Request resolved: https://github.com/pytorch/pytorch/pull/116367 Approved by: https://github.com/jgong5, https://github.com/ezyang	2024-01-17 02:26:29 +00:00
Tobias Ringwald	4a54ab328c	Removed an internal assertion for the optional stable value and inste… (#117414 ) …ad defaulted to the standard (=false). Fixes #117255. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117414 Approved by: https://github.com/ezyang	2024-01-17 02:25:21 +00:00
Nikita Shulga	1872834247	[MPS] Fix `torch.mm` correctness for large matrices (#117549 ) Currently `matrixMultiplicationWithPrimaryTensor:secondaryTensor:` returns incorrect results if one of the matrix dimensions is greater than 32K Solve it by providing a very naive matrix multiplication metal shader and call it if stride size is greater than 32768 elements, as slicing inside the MPSGraph doesn't work either, since `-sliceTensor:starts:ends:strides:` somehow affects matmul as well, if tiling is done as follows: ```objc NSMutableArray<MPSGraphTensor> rows = [NSMutableArray new]; for (int64_t i = 0; i < M; i += tile_size) { const auto i_end = std::min(i + tile_size, M); NSMutableArray<MPSGraphTensor> row_chunks = [NSMutableArray new]; for (int64_t j = 0; j < K; j += tile_size) { const auto j_end = std::min(j + tile_size, K); MPSGraphTensor* tile = nil; for (int64_t k = 0; k < N; k += tile_size) { const auto k_end = std::min(k + tile_size, N); auto selfChunk = [graph sliceTensor:selfTensor starts:@[ @(i), @(k) ] ends:@[ @(i_end), @(k_end) ] strides:@[ @(1), @(1) ] name:nil]; auto otherChunk = [graph sliceTensor:otherTensor starts:@[ @(k), @(j) ] ends:@[ @(k_end), @(j_end) ] strides:@[ @(1), @(1) ] name:nil]; auto chunkMM = [graph matrixMultiplicationWithPrimaryTensor:selfChunk secondaryTensor:otherChunk name:nil]; tile = tile ? [graph additionWithPrimaryTensor:tile secondaryTensor:chunkMM name:nil] : chunkMM; } [row_chunks addObject:tile]; } auto row = row_chunks.count > 1 ? [graph concatTensors:row_chunks dimension:1 name:nil] : row_chunks.firstObject; [rows addObject:row]; } return rows.count > 1 ? [graph concatTensors:rows dimension:0 name:nil] : rows.firstObject; ``` One can always use metal MM by defining `PYTORCH_MPS_PREFER_METAL` environment variable Fixes https://github.com/pytorch/pytorch/issues/116769 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117549 Approved by: https://github.com/kulinseth	2024-01-17 01:33:08 +00:00
Iris Z	f518cf811d	[DCP] Adds support for meta tensor loading for DCP.load_state_dict() (#113319 ) Currently, DCP requires the `model.state_dict()` to be materialized before passing it to DCP to load, since DCP uses the pre-allocated storage from the initialized model state_dict. Therefore, even for fine-tuning and distributed inference, users would need to explicitly materialize the model on GPU before `DCP.load_state_dict()`. Today's flow: ``` with torch.device("meta"): model2 = parallelize_module( MLPModule("meta"), tp_mesh, parallelize_plan=parallelize_plan ) model.to_empty(device='cuda') state_dict_to_load = model2.state_dict() DCP.load_state_dict( state_dict=state_dict_to_load, storage_reader=DCP.FileSystemReader(CHECKPOINT_DIR), ) model2.load_state_dict(state_dict_to_load) ``` This PR adds support for meta tensor loading. In DCP's planner, when encountering tensors/DTensor on meta device, we initialize tensor/DTensor on the current device on the fly and replace the tensor/DTensor on meta device in the state_dict. After the change, users no longer needs to manually call `model.to_empty()` when loading existing checkpoints for fine-tuning and distributed inference. Updated user flow: ``` with torch.device("meta"): model2 = parallelize_module( MLPModule("meta"), tp_mesh, parallelize_plan=parallelize_plan ) # no longer need to call model.to_empty(device='cuda') state_dict_to_load = model2.state_dict() DCP.load_state_dict( state_dict=state_dict_to_load, storage_reader=DCP.FileSystemReader(CHECKPOINT_DIR), ) model2.load_state_dict(state_dict_to_load, assign=True) ``` Note that for distributed training, it's still the users' responsibility to reset the parameters (`model.reset_parameters()`) as checkpoint might not exist. Note that we need to loop thru the state_dict to replace meta tensor/DTensor instead of calling `model.to_empty()` since `DCP.load()` only takes in state_dict but not model. Pull Request resolved: https://github.com/pytorch/pytorch/pull/113319 Approved by: https://github.com/fegin, https://github.com/LucasLLC	2024-01-17 00:23:29 +00:00
Jeff Daily	4a44a3c76d	update kineto submodule (#114297 ) Rework roctracer shutdown flushing `9365c1aa09` This fixes flaky unit tests that use kineto to verify certain kernels have executed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/114297 Approved by: https://github.com/malfet, https://github.com/atalman	2024-01-17 00:17:03 +00:00
Huy Do	cf470e7b59	Migrate update-commit-hash to test-infra (#117506 ) After https://github.com/pytorch/test-infra/pull/4885, the GHA is now reusable on `test-infra`. This tests the change and we can also land it after https://github.com/pytorch/test-infra/pull/4885 lands. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117506 Approved by: https://github.com/malfet, https://github.com/atalman	2024-01-17 00:15:04 +00:00
Masaki Kozuki	1d14adfa66	[mta] Fused SGD (#116585 ) depends on #116583 rel: - #94791 Pull Request resolved: https://github.com/pytorch/pytorch/pull/116585 Approved by: https://github.com/janeyx99	2024-01-16 23:54:38 +00:00
Joel Schlosser	5aac95c713	Introduce slice_inverse() op (#117041 ) Introduces a new op `slice_inverse()`. This is used in the reverse view_func for slice and several other ops (e.g. `split_with_sizes`, `chunk`). It's implemented behind the scenes by a call to `as_strided()`, but it's easier for subclasses to implement the more limited `slice_inverse()` than the full `as_strided()`. This PR: * Introduces the op itself * Updates all relevant functional inverses to call `slice_inverse()` instead of `as_strided()` directly * Makes codegen changes to allow `slice_scatter()` to be the copy variant for `slice_inverse()` * Need to avoid view_copy codegen (assumes if view name ends in inverse, we don't need to gen one, which is possibly a bad assumption) @albanD / @soulitzer / @bdhirsh: I'm most interested in your thoughts on the codegen changes and whether this is the right way to go. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117041 Approved by: https://github.com/bdhirsh	2024-01-16 23:44:54 +00:00
vfdev-5	f6767244cf	Added meta function for _upsample_bicubic2d_aa (#117347 ) This should fix remaining errors with Resize op in torchvision: https://github.com/pytorch/vision/actions/runs/7298953575?pr=8127 ``` /opt/conda/envs/ci/lib/python3.8/site-packages/torch/nn/functional.py:4072: in interpolate return torch._C._nn._upsample_bicubic2d_aa(input, output_size, align_corners, scale_factors) E torch._dynamo.exc.TorchRuntimeError: Failed running call_function <function interpolate at 0x7f4443fe00d0>((FakeTensor(..., size=(1, s0, s1, s2)),), {'size': [s4, floor(s3s4/floor(s1*s3/s2))], 'mode': 'bicubic', 'align_corners': False, 'antialias': True}): E aten/src/ATen/RegisterCompositeImplicitAutograd.cpp:5567: SymIntArrayRef expected to contain only concrete integers E E from user code: E File "/pytorch/vision/torchvision/transforms/v2/functional/_geometry.py", line 260, in resize_image E image = interpolate( E E Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information E E E You can suppress this exception and fall back to eager by setting: E import torch._dynamo E torch._dynamo.config.suppress_errors = True ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/117347 Approved by: https://github.com/peterbell10	2024-01-16 23:33:55 +00:00
nidefawl	b1c3f9f1b9	Fix missing mkl-dnn include paths (#117492 ) Fixes #91968 and #100960 This commit fixes missing include paths by linking `caffe2_pybind11_state_gpu` against `caffe2::mkldnn` Pull Request resolved: https://github.com/pytorch/pytorch/pull/117492 Approved by: https://github.com/ezyang	2024-01-16 23:28:17 +00:00
rzou	46a8408fa1	[codemod] markDynamoStrictTest batch 16 (#117218 ) [codemod] markDynamoStrictTest test_public_bindings [codemod] markDynamoStrictTest test_package [codemod] markDynamoStrictTest test_legacy_vmap [codemod] markDynamoStrictTest test_namedtensor [codemod] markDynamoStrictTest test_fx [codemod] markDynamoStrictTest test_dataloader [codemod] markDynamoStrictTest test_content_store [codemod] markDynamoStrictTest test_schema_check [codemod] markDynamoStrictTest lazy/test_ts_opinfo [codemod] markDynamoStrictTest functorch/test_vmap_registrations Pull Request resolved: https://github.com/pytorch/pytorch/pull/117218 Approved by: https://github.com/bdhirsh, https://github.com/voznesenskym ghstack dependencies: #117553	2024-01-16 23:04:31 +00:00
rzou	ab6207a342	Add inductor-specific testing strict mode denylist (#117553 ) We have one for Dynamo that currently applies to all "compile" configurations (PYTORCH_TEST_WITH_DYNAMO, PYTORCH_TEST_WITH_INDUCTOR). I don't want to figure out the inductor situation right now, so we're going to add another denylist for inductor and work through it later. Test Plan: - existing tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/117553 Approved by: https://github.com/voznesenskym	2024-01-16 23:04:31 +00:00
titaiwangms	5e0e78585d	Ordering placeholder and get_attr nodes in unflattened module (#116910 ) Previous to this PR, the generated unflattened module could mix up the order of `placeholder` and newly created `get_attr`. As `placeholder` is the input of a function, it should be placed ahead of `get_attr` nodes. Before: ```bash test/export/test_unflatten.py::TestUnflatten::test_placeholder_and_get_attr_ordering_after_unflattened opcode name target args kwargs ------------- ----------- ------------------------ -------------------------------------------------------------- -------- get_attr bias bias () {} get_attr weight weight () {} placeholder l_x_ l_x_ () {} call_function convolution aten.convolution.default (l_x_, weight, bias, [2, 2], [0, 0], [1, 1], False, [0, 0], 1) {} output output output (convolution,) {} ``` After: ```bash test/export/test_unflatten.py::TestUnflatten::test_placeholder_and_get_attr_ordering_after_unflattened opcode name target args kwargs ------------- ----------- ------------------------ -------------------------------------------------------------- -------- placeholder l_x_ l_x_ () {} get_attr weight weight () {} get_attr bias bias () {} call_function convolution aten.convolution.default (l_x_, weight, bias, [2, 2], [0, 0], [1, 1], False, [0, 0], 1) {} output output output (convolution,) {} ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/116910 Approved by: https://github.com/tugsbayasgalan	2024-01-16 22:58:37 +00:00
PyTorch MergeBot	4ec667cc64	Revert "[ONNX] Guard xfail tests with error messages (#117425 )" This reverts commit 1993956da33376f34125306209930ed00c486abd. Reverted https://github.com/pytorch/pytorch/pull/117425 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing in trunk `1993956da3` ([comment](https://github.com/pytorch/pytorch/pull/117425#issuecomment-1894650769))	2024-01-16 22:56:35 +00:00
Jason Ansel	3a52147cc5	[inductor] Faster C++ kernel python bindings (#117500 ) Calling C++ from Python via ctypes is notoriously slow. This switches to generating our own C++ bindings directly, which is a >5x speedup on this kernel-launch-bound microbenchmark: ```python from ctypes import c_void_p import torch from torch import empty from torch._inductor.codecache import AsyncCompile from torch._dynamo.testing import rand_strided from torch._inductor.utils import print_performance from torch._inductor.wrapper_benchmark import compiled_module_main async_compile = AsyncCompile() src = ''' #include "/tmp/torchinductor_jansel/gb/cgbau5vlj6cetmcjbjbtw6x4rrivaln6f45s5d72gy2bfx5foz3k.h" extern "C" void kernel(const float* in_ptr0, float* out_ptr0) { { auto tmp0 = in_ptr0[static_cast<long>(0L)]; auto tmp1 = static_cast<float>(1.0); auto tmp2 = decltype(tmp0)(tmp0 + tmp1); out_ptr0[static_cast<long>(0L)] = tmp2; } } ''' cpp_fused_add_ctypes = async_compile.cpp(src) cpp_fused_add_cpython = async_compile.cpp_pybinding(["const float", "float"], src) async_compile.wait(globals()) del async_compile def call(arg0_1): buf0 = empty((1,), device='cpu', dtype=torch.float32) if use_ctypes: for _ in range(100): cpp_fused_add_ctypes(c_void_p(arg0_1.data_ptr()), c_void_p(buf0.data_ptr())) else: for _ in range(100): cpp_fused_add_cpython(arg0_1, buf0) del arg0_1 return (buf0,) def benchmark_compiled_module(times=1000, repeat=100): arg0_1 = rand_strided((1,), (1,), device='cpu', dtype=torch.float32) return print_performance(lambda: call(arg0_1), times=times, repeat=repeat) print("old ctypes bindings: ", end='') use_ctypes = True compiled_module_main('None', benchmark_compiled_module) print("new bindings: ", end='') use_ctypes = False compiled_module_main('None', benchmark_compiled_module) ``` Output: ``` old ctypes bindings: 0.000073 new bindings: 0.000013 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/117500 Approved by: https://github.com/desertfire	2024-01-16 22:30:04 +00:00
Prachi Gupta	2a3fb7dbb6	[ROCm] Fix NHWC related tests in test_inductor_freezing (#117158 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/117158 Approved by: https://github.com/eellison, https://github.com/pruthvistony	2024-01-16 20:48:49 +00:00
Colin Peppler	4712c7dac8	[inductor] add C-shim for index_put (#116667 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116667 Approved by: https://github.com/desertfire, https://github.com/chenyang78	2024-01-16 20:29:14 +00:00
Xilun Wu	3e8c8ce37b	Update Reviewers for PT-D team (#117409 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/117409 Approved by: https://github.com/fegin, https://github.com/awgu, https://github.com/fduwjj	2024-01-16 19:40:41 +00:00
titaiwangms	1993956da3	[ONNX] Guard xfail tests with error messages (#117425 ) Previous to this PR, xfail tests didn't provide (1) guarantee of error message/reason (could be outdated), and (2) execution of the test (`xfail_if_model_type_is_not_exportedprogram`). Therefore, the tests are less robust with xfail labeled, as we can't be sure if it's still failing with the same reason, and if it's even still failing. This PR fixes the issue with try/except with error message matching to consolidate the xfail truth and reason. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117425 Approved by: https://github.com/thiagocrepaldi	2024-01-16 19:33:51 +00:00
Tugsbayasgalan (Tugsuu) Manlaibaatar	28be47c267	[RELAND][export] Exempt autograd ops for predispatch export (#117448 ) Summary: Reland of https://github.com/pytorch/pytorch/pull/116527/files Test Plan: CI Differential Revision: D52675324 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117448 Approved by: https://github.com/ydwu4	2024-01-16 19:32:15 +00:00
Huy Do	99e54744f7	Fix ExecuTorch pinned commit update failure (#117518 ) https://github.com/pytorch/pytorch/pull/117003 shows in interesting failure in which building ExecuTorch runner fails because it needs the change from https://github.com/pytorch/pytorch/pull/117378. This reveals a chicken-and-egg bug in the job setup where building ExecuTorch runner depends on PyTorch and thus couldn't be part of the Docker image build where PyTorch is not yet available. The failure happens because an outdated version of PyTorch is there on the Docker image. So, like vision and audio, the step to build ExecuTorch runner needs to be done during test time. I also fix the installation of vision and audio in ET job because they are now installed using PyTorch pinned commits as usual after https://github.com/pytorch/executorch/pull/1247 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117518 Approved by: https://github.com/larryliu0820, https://github.com/malfet	2024-01-16 18:25:15 +00:00
rzou	c30346db0e	Check in some torch.compile helper scripts (#117400 ) - passrate.py: compute the pass rate - update_failures.py: update `dynamo_test_failures.py` Both of these scripts require you to download the test results from CI locally. Maybe we can automate this more in the future. Checking these in for now, with no tests :P. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117400 Approved by: https://github.com/voznesenskym ghstack dependencies: #117391	2024-01-16 17:14:43 +00:00
rzou	a7a2773567	Check invariants for dynamo_test_failures.py (#117391 ) Test that: - the xfail list and the skip list don't intersect - the test names look sane Pull Request resolved: https://github.com/pytorch/pytorch/pull/117391 Approved by: https://github.com/voznesenskym	2024-01-16 17:14:43 +00:00
CaoE	29516bd2a0	add _amp_foreach_non_finite_check_and_unscale_cpu_ and _amp_update_scale_cpu_ kernels on CPU (#109281 ) Step1 of https://github.com/pytorch/pytorch/issues/111559. Pull Request resolved: https://github.com/pytorch/pytorch/pull/109281 Approved by: https://github.com/jgong5, https://github.com/ezyang	2024-01-16 15:25:08 +00:00
Wang, Chuanqi	0fa6ee44d9	[CI] Skip lib for xpu binary unit test (#117514 ) Skip .so and .a libraries under build/bin/ for test_xpu_bin in CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/117514 Approved by: https://github.com/malfet	2024-01-16 12:07:15 +00:00
Nikita Shulga	13473df0d7	[MPS] Make addmm support empty matmul (#117223 ) Refactor common part between `mm_out_mps` and `addmm_out_mps` into `do_mm` static function. Change input placeholder initialization logic in a way that `addmm` can handle matrix multiplication with empty dimension. Add tests for `mm`+`addmm` with empty tensors to OpInfo but skip addmm with empty matrices from onnx tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/117223 Approved by: https://github.com/albanD	2024-01-16 06:46:20 +00:00
Oguz Ulgen	28bb31e4a5	[Dynamo] Trace autograd.function in dynamo when inputs require grad (#116358 ) (#116897 ) For training graphs (when inputs require grad), previously, we would speculate the forward and backward graph to determine if there are any graph breaks, side effect and etc but would not actually use these speculated graphs. We would just insert a call function node on the graph and later rely on autograd's tracing. This approach does not work for more generalized graphs like graphs that include user defined triton kernels because autograd is not able to do the higher order function conversation. This PR speculates the forward and backward functions and emits them in a HOF that later gets used via templating mechanism. While working on this PR, I have exposed some bugs in the current tracing due to trampoline functions losing the source information resulting in incorrect graphs being produced. I have fixed these source information bugs and killed the trampolines. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116897 Approved by: https://github.com/Skylion007, https://github.com/jansel, https://github.com/voznesenskym	2024-01-16 03:57:13 +00:00
PyTorch UpdateBot	f20eaadfef	[vision hash update] update the pinned vision hash (#117509 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml). Update the pinned vision hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117509 Approved by: https://github.com/pytorchbot	2024-01-16 03:17:24 +00:00
Nikita Shulga	ae3d7091cb	[BE] Replace deprecated `set_default_tensor_type` (#117505 ) Not sure what it was doing there, but replaced it with `set_default_dtype` Pull Request resolved: https://github.com/pytorch/pytorch/pull/117505 Approved by: https://github.com/Skylion007	2024-01-16 02:32:49 +00:00
Yanbo Liang	dd2cff1591	[Dynamo] Use isinstance rather than istype when check if python module type (#117022 ) This is to fix a issue from Meta internal use case, where third-party ```DictConfig``` has bug on [```__eq__```](`fd730509ef/omegaconf/dictconfig.py (L596)`) and it triggers Dynamo error because we are using ```obj in [x, y]``` check. Then I found we can use ```isinstance``` to cover all and removing these special cases. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117022 Approved by: https://github.com/ckluk2, https://github.com/jansel	2024-01-15 23:25:30 +00:00
Kurt Mohler	bac0878780	Error if compiled nondeterministic backward called in deterministic mode (#114780 ) Part of #113707 Pull Request resolved: https://github.com/pytorch/pytorch/pull/114780 Approved by: https://github.com/ezyang, https://github.com/albanD	2024-01-15 22:45:40 +00:00
Mihir Patel	c1ab2777c0	Update state_dict.py to propagate cpu offload (#117453 ) Update state_dict.py to propagate cpu offload. It looks like this flag is accidentally ignored? Pull Request resolved: https://github.com/pytorch/pytorch/pull/117453 Approved by: https://github.com/Skylion007	2024-01-15 22:13:37 +00:00
vfdev-5	1a57c18760	Fixed cuda grads for interpolate::trilinear on non-contig grad output (#117373 ) Fixes #113642 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117373 Approved by: https://github.com/lezcano	2024-01-15 18:05:47 +00:00
Peter Bell	001585f446	[fx][inductor] Add statically_known_true utility for SymBool (#117359 ) This adds a function `statically_known_true` for `SymBool` that works like inductor's `is_expr_static_and_true`. That is, it tries to simplify the expression to a constant or returns `False` if it cannot be simplified. This is useful in cases that can be optimized if the condition is met, otherwise it doesn't effect correctness so we can avoid adding guards. I also use this new function in inductor for `FakeTensorUpdater` and `remove_noop_pass` which both generated unexpected guards previously. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117359 Approved by: https://github.com/lezcano	2024-01-15 18:01:10 +00:00
atalman	661747c727	XPU, move oidc to top level workflow and use gha_workflow_s3_and_ecr_read_only policy (#117498 ) 1. oidc permissions need to be set on top level workflow 2. rename gha_workflow_s3_and_ecr_read_only to gha_workflow_s3_and_ecr_read_only policy which better reflects the policy usage Pull Request resolved: https://github.com/pytorch/pytorch/pull/117498 Approved by: https://github.com/chuanqi129, https://github.com/huydhn	2024-01-15 17:46:20 +00:00
Peter Bell	7a8013fbfa	[inductor] Handle more edge cases in slice and slice_scatter (#117377 ) Fixes #117110 When slicing we can end up with start and end which are out of bounds, which is handled in python slicing by clamping to the correct bounds. There is also the case where end < start which should result in an empty slice. In the isoneutral_mixing failure we have the second case, with `start=2, end=0` which in `slice_scatter` became `src_size[dim] = -2`. This PR improves slice's edge case handling and factors the start and end normalization code out so it can be shared with slice_scatter. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117377 Approved by: https://github.com/lezcano	2024-01-15 17:05:48 +00:00
Edward Z. Yang	5c700f60a5	Properly preserve SymInt input invariant when splitting graphs (#117406 ) Fixes https://github.com/pytorch/pytorch/issues/111636 Fixes https://github.com/pytorch/pytorch/issues/108877 Fixes https://github.com/pytorch/pytorch/issues/116956 Inductor has an invariant that every dynamic shape symbol s0, s1, etc. which is referenced by an input tensor must also be passed in explicitly as an argument. It has some capability of reverse engineering symbols if it's obvious how to get them (e.g., if you pass in `arg: f32[s0, 4]` it will know that it can retrieve `s0 = arg.size(0)`) but in full generality it is not always possible to derive this (e.g., if the only mention of s0 is in `arg2: f32[s0 + s1, 4]`). However, the graph splitter used by optimize_ddp did not respect this invariant. This PR makes it respect it. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/117406 Approved by: https://github.com/wconstab	2024-01-15 15:04:57 +00:00
albanD	75818adcf7	Pyi doc inclusion + fix (#117267 ) Reland of https://github.com/pytorch/pytorch/pull/114705 with extra fix to smoothly handle when the modules we're trying to load are not available (and thus the pyi won't contain the docs in this case). Tested locally that it works properly in fbcode. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117267 Approved by: https://github.com/ezyang	2024-01-15 13:06:53 +00:00
Sun, Jiayi	7a851fedc8	support torch.mm with conjugate transposed inputs (#117238 ) Fix https://github.com/pytorch/pytorch/issues/116855. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117238 Approved by: https://github.com/lezcano	2024-01-15 12:36:01 +00:00
Edward Z. Yang	41ffea2f99	Properly unwrap_storage tensors sent to DynamicScalar (#117444 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/117444 Approved by: https://github.com/Skylion007	2024-01-15 12:15:04 +00:00
Sun, Jiayi	d9b265adaf	modify the conditions as PythonModuleVariable (#116856 ) ## Motivation The current code of `value in [torch.backends.cudnn, torch.ops]` requires `value` to have the implementation of `__eq__`. If the value is a custom object and does not implement `__eq__`, dynamo will throw error. For example, ConvolutionOpContext, the custom 'torch._C.ScriptClass' object registered in IPEX, dynamo will throw the following error: torch._dynamo.exc.InternalTorchDynamoError: '__eq__' is not implemented for __torch__.torch.classes.ipex_prepack.ConvolutionOpContext I think this is a common issue, To avoid this issue, the PR replaces the current code `value in [torch.backends.cudnn, torch.ops]`with `isinstance(value, (torch.backends.cudnn.CudnnModule, torch._ops._Ops)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/116856 Approved by: https://github.com/jansel	2024-01-15 11:10:57 +00:00
PyTorch UpdateBot	d089bb1b72	[xla hash update] update the pinned xla hash (#117485 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml). Update the pinned xla hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117485 Approved by: https://github.com/pytorchbot	2024-01-15 10:33:18 +00:00
Jiong Gong	2b56d80460	[inductor][cpp] apply simplify_index_in_vec_range to vector store and vector transpose (#117263 ) As the title, this PR extends the `simplify_index_in_vec_range` to store and transpose. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117263 Approved by: https://github.com/jansel ghstack dependencies: #117221, #117260	2024-01-15 08:41:28 +00:00
Jiong Gong	3b00dd5843	[inductor][cpp] apply simplify_index_in_vec_range in select_tiling_indices to enable more contiguous vec load (#117260 ) For the one of the kernels in the UT `test_vec_contiguous_ModularIndexing`: Before: ```c++ for(long x0=static_cast<long>(0L); x0<static_cast<long>(28L); x0+=static_cast<long>(1L)) { for(long x1=static_cast<long>(0L); x1<static_cast<long>(16L); x1+=static_cast<long>(16L)) { { #pragma omp declare reduction(welford:Welford<float>:omp_out = welford_combine(omp_out, omp_in)) initializer(omp_priv={Welford<float>()}) #pragma omp declare reduction(welford:Welford<at::vec::Vectorized<float>>:omp_out = welford_combine(omp_out, omp_in)) initializer(omp_priv={Welford<at::vec::Vectorized<float>>()}) Welford<float> tmp_acc0 = Welford<float>(); Welford<at::vec::Vectorized<float>> tmp_acc0_vec = Welford<at::vec::Vectorized<float>>(); for(long x2=static_cast<long>(0L); x2<static_cast<long>(512L); x2+=static_cast<long>(1L)) { auto tmp0 = [&] { __at_align__ std::array<float, 16> tmpbuf; #pragma GCC unroll 16 for (long x1_inner = 0; x1_inner < 16; x1_inner++) { tmpbuf[x1_inner] = in_ptr0[static_cast<long>((128L(c10::div_floor_integer(x2, 256L))) + (256Lx1) + (256Lx1_inner) + (7168L(static_cast<long>(c10::div_floor_integer(x2, 128L)) % static_cast<long>(2L))) + (14336Lx0) + (static_cast<long>(x2) % static_cast<long>(128L)))]; } return at::vec::Vectorized<float>::loadu(tmpbuf.data()); } () ; tmp_acc0_vec = welford_combine(tmp_acc0_vec, tmp0); } tmp_acc0_vec.mean.store(out_ptr0 + static_cast<long>(x1 + (28Lx0))); tmp_acc0_vec.m2.store(out_ptr1 + static_cast<long>(x1 + (28Lx0))); } } #pragma omp simd simdlen(8) for(long x1=static_cast<long>(16L); x1<static_cast<long>(28L); x1+=static_cast<long>(1L)) { { #pragma omp declare reduction( welford:Welford<float>: omp_out = welford_combine(omp_out, omp_in)) initializer(omp_priv={Welford<float>()}) Welford<float> tmp_acc0 = Welford<float>(); for(long x2=static_cast<long>(0L); x2<static_cast<long>(512L); x2+=static_cast<long>(1L)) { auto tmp0 = in_ptr0[static_cast<long>((128L(c10::div_floor_integer(x2, 256L))) + (256Lx1) + (7168L(static_cast<long>(c10::div_floor_integer(x2, 128L)) % static_cast<long>(2L))) + (14336Lx0) + (static_cast<long>(x2) % static_cast<long>(128L)))]; tmp_acc0 = welford_combine(tmp_acc0, tmp0); } out_ptr0[static_cast<long>(x1 + (28Lx0))] = tmp_acc0.mean; out_ptr1[static_cast<long>(x1 + (28Lx0))] = tmp_acc0.m2; } } ``` After: ```c++ for(long x0=static_cast<long>(0L); x0<static_cast<long>(28L); x0+=static_cast<long>(1L)) { for(long x1=static_cast<long>(0L); x1<static_cast<long>(28L); x1+=static_cast<long>(1L)) { { #pragma omp declare reduction(welford:Welford<float>:omp_out = welford_combine(omp_out, omp_in)) initializer(omp_priv={Welford<float>()}) #pragma omp declare reduction(welford:Welford<at::vec::Vectorized<float>>:omp_out = welford_combine(omp_out, omp_in)) initializer(omp_priv={Welford<at::vec::Vectorized<float>>()}) Welford<float> tmp_acc0 = Welford<float>(); Welford<at::vec::Vectorized<float>> tmp_acc0_vec = Welford<at::vec::Vectorized<float>>(); for(long x2=static_cast<long>(0L); x2<static_cast<long>(512L); x2+=static_cast<long>(16L)) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>((128L(c10::div_floor_integer(x2, 256L))) + (256Lx1) + (7168L(static_cast<long>(c10::div_floor_integer(x2, 128L)) % static_cast<long>(2L))) + (14336Lx0) + (static_cast<long>(x2) % static_cast<long>(128L)))); tmp_acc0_vec = welford_combine(tmp_acc0_vec, tmp0); } tmp_acc0 = welford_combine(tmp_acc0, welford_vec_reduce_all(tmp_acc0_vec)); out_ptr0[static_cast<long>(x1 + (28Lx0))] = static_cast<float>(tmp_acc0.mean); out_ptr1[static_cast<long>(x1 + (28L*x0))] = static_cast<float>(tmp_acc0.m2); } } } ``` This PR also further speeds up the model `swin_base_patch4_window7_224` from 1.25x to 1.28x. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117260 Approved by: https://github.com/jansel ghstack dependencies: #117221	2024-01-15 06:57:25 +00:00
PyTorch UpdateBot	3a0bcd2c12	[audio hash update] update the pinned audio hash (#117423 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml). Update the pinned audio hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117423 Approved by: https://github.com/pytorchbot	2024-01-15 05:50:51 +00:00
Sai-Pra	19502ff6aa	Fixed typo in build_activation_images.py (#117458 ) In line 24 of build_activation_images.py, I changed "programmaticly" to "programmatically" to be dramatically correct. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117458 Approved by: https://github.com/malfet	2024-01-15 03:27:40 +00:00
PyTorch UpdateBot	03c6f79548	[vision hash update] update the pinned vision hash (#117311 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml). Update the pinned vision hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117311 Approved by: https://github.com/pytorchbot	2024-01-15 03:15:20 +00:00
Edward Z. Yang	2200118f59	Enable some uint{16,32,64} tests that are working (#116809 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/116809 Approved by: https://github.com/albanD	2024-01-15 02:25:21 +00:00
Nikita Shulga	a298fba146	[MPS] Increase metal language support to 2.3 (#117472 ) As Conda binaries are still built on MacOS 12, which renders MPS unusable after https://github.com/pytorch/pytorch/pull/116942 Test plan: ``` % xcrun -sdk macosx metal --std=macos-metal2.3 -Wall -o Index Index.metal % xcrun -sdk macosx metal --std=macos-metal2.2 -Wall -o Index Index.metal Index.metal:167:1: error: type 'const constant ulong3 *' is not valid for attribute 'buffer' REGISTER_INDEX_OP_ALL_DTYPES(select); ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Index.metal:159:5: note: expanded from macro 'REGISTER_INDEX_OP_ALL_DTYPES' REGISTER_INDEX_OP(8bit, idx64, char, INDEX_OP_TYPE, ulong3); \ ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ... ``` Fixes https://github.com/pytorch/pytorch/issues/117465 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117472 Approved by: https://github.com/xuzhao9	2024-01-15 01:16:52 +00:00
Edward Z. Yang	61a181e83c	Report function name in stack trace annotations (#117459 ) When working with internal flows, it can sometimes be ambiguous what version of the code they are working with. In this case, having the function name available in the stack trace can help identify what you are looking at. Example now looks like: ``` [DEBUG] # File: /data/users/ezyang/a/pytorch/a.py:5 in f, code: return x + x ``` Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/117459 Approved by: https://github.com/Skylion007	2024-01-15 00:29:13 +00:00
vasiliy	a6d33614d6	add float8 types to dtypes table (#117375 ) Summary: As titled Test Plan: CI Reviewers: Subscribers: Tasks: Tags: Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/117375 Approved by: https://github.com/ezyang	2024-01-15 00:23:07 +00:00
Adnan Akhundov	c3e2b94827	Realize non-ReinterpretView Views in custom Triton kernel args (#117468 ) Summary: If any of the `TensorBox` arguments of a custom (user-written) Triton kernel in the graph is wrapped into a `BaseView` subclass which is not `ReinterpretView`, this currently conflicts with the cloning (which preserves RVs) and downstream processing (which needs a layout to mark mutation) of the input. This PR adds conversion of the non-RV views to `ReinterpretView`s by realizing the corresponding inputs to the Triton kernel. As realization happens anyway before the Triton kernel call, this should not affect the perf. But it covers currently missed patterns in the internal models (see the unit test for a repro). Test Plan: ``` $ python test/dynamo/test_triton_kernels.py -k test_triton_kernel_slice_and_view_input ... ---------------------------------------------------------------------- Ran 1 test in 3.909s OK ``` Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/117468 Approved by: https://github.com/oulgen	2024-01-14 23:31:38 +00:00
Aaron Gokaslan	62496ffd0d	[dynamo][easy]: Add support for `operator.truth` (#117463 ) * This is an old builtin function equivalent to the bool constructor. it is easy enough to add support for. * I also realized the tests were in the wrong class (the one reserved for testing default args) so I moved them. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117463 Approved by: https://github.com/jansel	2024-01-14 19:08:31 +00:00
Edward Z. Yang	2748f05056	Add torch.fx.interpreter to uninteresting_files (#117460 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/117460 Approved by: https://github.com/Skylion007	2024-01-14 18:35:21 +00:00
Huy Do	a1155883d4	Clean up Docker config on ROCm runner (#117432 ) This fixes the issues on trunk when logging in to ECR on ROCm runner is failing. During my test, it's also ok to fail the login part with that `not implemented` error https://github.com/pytorch/pytorch/actions/runs/7516446579/job/20461801473, and pulling the image from ECR still works, so I set `continue-on-error: true` on the step. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117432 Approved by: https://github.com/malfet	2024-01-14 18:27:09 +00:00
Edward Z. Yang	a76610e6fb	[BE] Delete unused is_dynamo_compiling (#117455 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/117455 Approved by: https://github.com/Skylion007, https://github.com/yanboliang ghstack dependencies: #117451, #117452, #117454	2024-01-14 15:15:29 +00:00
Edward Z. Yang	347255809c	Make c10::SymInt typecaster support scalar-like fake tensor (#117454 ) We can use `__index__` to do this conversion because that will trigger a guard on data dependent SymInt if the tensor is a fake tensor, but if we fetch item directly and put it in the Scalar, we may still be able to make it work out. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/117454 Approved by: https://github.com/yanboliang ghstack dependencies: #117451, #117452	2024-01-14 15:15:29 +00:00
Edward Z. Yang	796fe40a96	[BE] Delete unnecessary variable fastpath (#117452 ) This fastpath is unnecessary because in the logic below we do the same thing: ``` auto& var = THPVariable_Unpack(obj); if (var.numel() != 1 \|\| !at::isIntegralType( var.dtype().toScalarType(), /include_bool/ true)) { throw_intlist_exception(this, i, obj, idx); } auto scalar = var.item(); TORCH_CHECK(scalar.isIntegral(/include bool/ false)); res.push_back(scalar.toSymInt()) ``` Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/117452 Approved by: https://github.com/yanboliang ghstack dependencies: #117451	2024-01-14 14:39:46 +00:00
Edward Z. Yang	220cf46c2a	Always accept 0-d scalar tensors as int, even if __index__ fails (#117451 ) Fixes https://github.com/pytorch/pytorch/issues/117288 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/117451 Approved by: https://github.com/yanboliang	2024-01-14 14:39:46 +00:00
fduwjj	38c18f3825	[c10d] Add a timeout check interval variable for timeout dump (#117093 ) The current timeout check frequency is relied on monitoring thread's timeout thread which can be too long (even if we set it to 2mins) so let's use a separate timeout variable which users can configure it. And we only only let default PG to check TCPStore so even more frequent check should be fine. (Our stress test is performed on every half second). Pull Request resolved: https://github.com/pytorch/pytorch/pull/117093 Approved by: https://github.com/wconstab, https://github.com/kwen2501	2024-01-14 02:33:17 +00:00
Edward Z. Yang	003c900d5e	Add _assert_scalar (#117378 ) Peeled off from https://github.com/pytorch/pytorch/pull/114148, because that PR is going to take a while to actually land. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/117378 Approved by: https://github.com/jansel	2024-01-14 00:50:36 +00:00
Mengwei Liu	1a8545164a	[export] Add unit test for SDPA export result (#117390 ) Summary: A follow up for #117097. In that PR I didn't add `_scaled_dot_product_attention_for_cpu` into the core_aten_decomposition table. This PR does that and also add a unit test. Test Plan: python test/export/test_export.py -k test_scaled_dot_product_attention Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/117390 Approved by: https://github.com/drisspg	2024-01-14 00:21:28 +00:00
Aaron Gokaslan	bf27dd6df9	Add dynamo support for operator.abs (#117442 ) A test case for operator.abs and allows for constant folding with it. Partially applies to #116396 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117442 Approved by: https://github.com/jansel, https://github.com/malfet	2024-01-13 21:38:55 +00:00
Tugsbayasgalan (Tugsuu) Manlaibaatar	1a790f5a61	[RELAND] Error grad mode op in export API (#117420 ) Summary: Title Test Plan: CI Differential Revision: D52706691 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117420 Approved by: https://github.com/angelayi	2024-01-13 21:36:29 +00:00
Nikita Shulga	d6847c5977	[CI] Set correct permissions for auto_request_review (#117408 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/117408 Approved by: https://github.com/izaitsevfb, https://github.com/atalman	2024-01-13 20:02:03 +00:00
Nikita Shulga	53f3361319	[BE] Use nested namespaces for sparse (#117415 ) C++17 is fu Pull Request resolved: https://github.com/pytorch/pytorch/pull/117415 Approved by: https://github.com/Skylion007	2024-01-13 19:51:28 +00:00
Wanchao Liang	d8bdb50379	[reland] pass shape/stride during tensor unflatten (#117340 ) Reland of https://github.com/pytorch/pytorch/pull/113547 as the previous PR reverted bc of torch.compile symbolic shape issue. Since we now disabled tensor unflatten with dynamo.disable, we should not hit this issue again Pull Request resolved: https://github.com/pytorch/pytorch/pull/117340 Approved by: https://github.com/Skylion007 ghstack dependencies: #117336	2024-01-13 19:33:47 +00:00
Wanchao Liang	eebf115686	[fsdp][2d] FSDP sync module states handle tensor subclass (#117336 ) This PR adds the ability to let FSDP sync module states kwarg to handle tensor subclass, because FSDP works on the "dp" mesh dimension, as long as FSDP works on a different device mesh dimension, we can safety let FSDP just broadcast the DTensor local shards. fixes https://github.com/pytorch/pytorch/issues/117126 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117336 Approved by: https://github.com/awgu	2024-01-13 19:33:47 +00:00
Stephen Jia	fc044b5cdb	[pt-vulkan] Add build time flag to control descriptor pool sizes (#117398 ) Summary: ## Context When running large models with a lot of operators, the default descriptor pool allocated by the Vulkan compute API may run out of descriptor sets. This changeset introduces the `VULKAN_DESCRIPTOR_POOL_SIZE` build variable (which will default to `1024u`) which can allow for a larger descriptor pool to be allocated if necessary. ## Notes for Reviewers This is a simple stopgap solution until we have bandwidth to implement the more general solution, which would be to modify the `DescriptorPool` class defined in `api/Descriptor.[h,cpp]` to automatically allocate a new descriptor pool when memory runs out. However, I would consider this change to be low priority since with a delegate/graph mode of execution, the descriptor pool can often be allocated to exactly fit a model's requirements. Test Plan: There should be no functional changes under default build settings. Run `vulkan_api_test` to make sure everything works as before; CI should test for that as well. ``` # On devserver LD_LIBRARY_PATH=/home/ssjia/Github/swiftshader_prebuilt/swiftshader/build/bin/ buck run fbcode/mode/dev-nosan //xplat/caffe2:pt_vulkan_api_test_bin -- --gtest_filter="*" ``` Reviewed By: yipjustin, jorgep31415 Differential Revision: D52742140 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117398 Approved by: https://github.com/yipjustin	2024-01-13 13:11:00 +00:00
Jackie (Jiaqi) Xu	2c8975387d	[Optimus] fix batch layernorm numerical issue (#117404 ) Summary: Fix the numerical issue with addcmul. Found that torch.addcmul will generate different value from torch.add+torch.mul with 32 bit check. Mini repro: N4823658 Change addcmul tp torch.add+torch.mm Test Plan: buck test before change ``` the diff index is: 0 the diff index is: 1 the diff index is: 6 ``` after change numeric on par Differential Revision: D52745671 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117404 Approved by: https://github.com/mengluy0125	2024-01-13 10:04:12 +00:00
voznesenskym	f008efa8e7	Reconstruct streams via global registration, temporary impl to unblock FSDP (#117386 ) This is a placeholder implementation for reconstructing streams via global storage to unblock FSDP, pending proper stream support design This PR does a few things: 1) fixes registration for devices with indices. We were only supporting "cuda", we now support "cuda:k" interfaces where k is # of gpu 2) Changes the stream objects in dynamo to take devices as device types, instead of strings, and updates the string based device APIs to gracefully take device types. 3) Introduces a reconstruct-by-global (using existing cleanup hook structures) to streams as a placeholder impl for now Pull Request resolved: https://github.com/pytorch/pytorch/pull/117386 Approved by: https://github.com/jansel	2024-01-13 07:03:33 +00:00
Banit Agrawal	ef3217d9f7	[PyTorch] Mark USDT probes as noinline to avoid duplications in ThinLTO mode (#117381 ) Differential Revision: D52710343 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117381 Approved by: https://github.com/chaekit	2024-01-13 06:18:01 +00:00
Kurman Karabukaev	302f931c25	Update Reviewers for PyTorch Distributed team (#116231 ) Update merge rule approver list under 'Distributed' section based on current PyTorch distributed team composition. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116231 Approved by: https://github.com/fduwjj, https://github.com/XilunWu	2024-01-13 05:07:13 +00:00
atalman	96163eb010	Switch nightly binaries to oidc. Remove aws keys (#117416 ) This should fix all wheel nightly upload failures: https://hud.pytorch.org/hud/pytorch/pytorch/nightly/1?per_page=50&name_filter=upload Pull Request resolved: https://github.com/pytorch/pytorch/pull/117416 Approved by: https://github.com/huydhn, https://github.com/malfet	2024-01-13 03:24:13 +00:00
Leon Gao	22ddf91dbb	[torch][fx] more strong typed codegen for partial specialized code on boolean (#117201 ) Summary: * in some fx partial specialized codegen via `concrete_args` on boolean arguments, we extend to further use the graphmodule on strong typed runtime like torchscript. * this diff fix the type annotation for boolean only and preserve argument mapping for leafing pytree nodes. Test Plan: buck2 test 'fbcode//mode/opt' fbcode//caffe2/test:fx -- --exact 'caffe2/test:fx - test_partial_trace (test_fx.TestFX)' Differential Revision: D52667883 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117201 Approved by: https://github.com/houseroad	2024-01-13 03:10:02 +00:00
Yidi Wu	2bc7da1ab7	[HigherOrderOp] change signature of map_impl (#117161 ) Summary: X-link: https://github.com/pytorch/executorch/pull/1580 This PR changes the schema of map_impl from map_impl(f, num_mapped, *operands) to map_impl(f, mapped_args: Tuple, moperands: Tuple). This is to prepare for turning on dynamo for eager mode map, where we want to get rid of the num_mapped scalar. Test Plan: Existing tests. Differential Revision: D52495413 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117161 Approved by: https://github.com/angelayi, https://github.com/tugsbayasgalan	2024-01-13 02:50:46 +00:00
William Wen	f2f47c6848	[dynamo] realize LazyVT's in DICT_MERGE (#117282 ) Fixes https://github.com/pytorch/pytorch/issues/115029. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117282 Approved by: https://github.com/jansel, https://github.com/mlazos	2024-01-13 01:50:39 +00:00
Jerry Zhang	3e397cefc5	Add uint1 to uint7 dtypes (#117208 ) Summary: These dtypes are added since we see more demand for these sub byte dtypes, especially with the popularity of LLMs (https://pytorch.org/blog/accelerating-generative-ai-2/#step-4-reducing-the-size-of-the-weights-even-more-with-int4-quantization-and-gptq-2021-toks) Note these are just placeholders, the operator support for these dtypes will be implemented with tensor subclass. e.g. torch.empty(..., dtype=torch.uint1) will return a tensor subclass of uint1, that supports different operations like bitwsise ops, add, mul etc. (will be added later) Also Note that these are not quantized data types, we'll implement quantization logic with tensor subclass backed up by these dtypes as well. e.g `Int4GroupedQuantization(torch.Tensor)` will be implemented with torch.uint4 Tensors (see https://github.com/pytorch-labs/ao/pull/13 as an example) Test Plan: CIs python test/test_quantization.py -k test_uint1_7_dtype Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/117208 Approved by: https://github.com/ezyang	2024-01-13 01:09:23 +00:00
Huy Do	52575eb1bb	The permission id-token write needs to be set on rocm-test callers (#117422 ) All these workflows lack the necessary permission to run `_rocm-test` job after https://github.com/pytorch/pytorch/pull/117160, for example https://github.com/pytorch/pytorch/actions/runs/7508520071 ### Testing Confirm that trunk is back https://github.com/pytorch/pytorch/actions/runs/7508830196. Other workflows would be the same, i.e. rocm https://github.com/pytorch/pytorch/actions/runs/7508830137/job/20444989127. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117422 Approved by: https://github.com/atalman	2024-01-13 00:27:46 +00:00
angelayi	9746f36e50	[export] Minor fixes to serialization (#117374 ) * Checks that the input to torch.export.save is an ExportedProgram (https://github.com/pytorch/pytorch/issues/116952) * Fixes naming for serialized state dict from `serialized_state_dict.json` to `serialized_state_dict.pt` (https://github.com/pytorch/pytorch/issues/116949) * Moves some tests to be expectFailure rather than blocklisted Pull Request resolved: https://github.com/pytorch/pytorch/pull/117374 Approved by: https://github.com/ydwu4	2024-01-13 00:23:06 +00:00
Will Constable	7f1f0b1135	[C10D] Add duration_ms to flight recorder (#114817 ) Measures the duration of a collective operation using nccl start/end events and includes this duration (in ms) in the flight recorder data. duration_ms will be an optional field, since it only works when timing is enabled. Currently timing is enabled when flight recorder is enabled, but this is not a strict requirement. Duration is also not available for collectives not in a completed state. Note: computing duration can lead to a hang due to calling cudaEventDuration when the cuda driver queue is full. We don't ever want dump() api to hang, since we might want dump to help debug a hang. Hence, we only query durations from the watchdog thread, and it's possible during dump() call, some of the most recent collectives durations won't have been computed yet at time of dump. We make this tradeoff to ensure that dump() itself will never hang. Pull Request resolved: https://github.com/pytorch/pytorch/pull/114817 Approved by: https://github.com/fduwjj, https://github.com/zdevito ghstack dependencies: #116905	2024-01-12 23:34:11 +00:00
Edward Z. Yang	7a7535283f	Some basic support for uint{16,32,64} codegen in CPU inductor (#116810 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/116810 Approved by: https://github.com/chenyang78, https://github.com/eellison, https://github.com/desertfire	2024-01-12 23:13:28 +00:00
Simon Fan	4b25948ee6	Torchbench Dynamo Runner: Enable DDP for perf test and traces (#113332 ) - Removes an outdated assert that prevents perf tests from running DDP, we now have single node --multiprocess and perf tests are already wrapping the model using `deepcopy_and_maybe_ddp` - Append rank name to traces to avoid all ranks trying to create the same file - Renames `deepcopy_and_maybe_ddp` to `deepcopy_and_maybe_parallelize` to include FSDP Pull Request resolved: https://github.com/pytorch/pytorch/pull/113332 Approved by: https://github.com/H-Huang, https://github.com/wconstab	2024-01-12 22:41:09 +00:00
Jane Xu	c329eddcb9	Migrate the rest of state_dict testing to OptimizerInfo (#117186 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/117186 Approved by: https://github.com/albanD ghstack dependencies: #116509	2024-01-12 22:32:37 +00:00
Jane Xu	bcf1f312a0	Migrate nontensor step and CUDA params state_dict tests to OptimizerInfo (#116509 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116509 Approved by: https://github.com/albanD	2024-01-12 22:32:37 +00:00
rzou	7b753cc7b8	Skip some slow tests (under Dynamo) (#117389 ) Otherwise these may cause timeouts. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117389 Approved by: https://github.com/jerryzh168, https://github.com/voznesenskym ghstack dependencies: #117318, #117320	2024-01-12 22:18:07 +00:00
rzou	d73846689d	Rename test_legacy_vmap.py TestCase names (#117320 ) The problem is that the dynamo_test_failures logic recognizes tests by their TestClass.test_name. Unfortunately we have duplicate TestClass.test_name in test_legacy_vmap and test_vmap. This PR unduplicates them. Something more robust would have been to include the test file name in the dynamo_test_failures logic, but... it's a bit too late for that. We can fix it if it becomes more of a problem in the future. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117320 Approved by: https://github.com/voznesenskym ghstack dependencies: #117318	2024-01-12 22:18:07 +00:00
rzou	06576d859d	Stop running ModuleInfo tests under Dynamo (#117318 ) This is a policy decision, similar to the OpInfo one. The problem is that they just take too long to run when we reset() before and after each. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117318 Approved by: https://github.com/voznesenskym	2024-01-12 22:17:59 +00:00
Will Constable	fbd9bccb75	[C10D](reland) Add GIL checker to NCCL watchdog monitor (#117312 ) Whenever the monitor thread kills the watchdog thread for being stuck, we do so to save cluster time and get a faster failure signal, but we want to know more about why it got stuck. One possible reason for watchdog stuckness is GIL contention, which could be ruled out or observed by making an attempt to acquire the GIL at exit time. If we cannot acquire the GIL within a short time window (1s) we abort the attempt and report GIL contention, otherwise we report that GIL was acquired successfully. Reland: uses a function pointer to avoid destructor ordering issues on dlclose. (Looks like the destructor for the std::function was being run later than the libtorchpython lib was unloaded, leading to a crash). Pull Request resolved: https://github.com/pytorch/pytorch/pull/117312 Approved by: https://github.com/zdevito	2024-01-12 21:48:45 +00:00
FFFrog	7b0926cc3e	Fix wrong class inheritance in pyi (#116404 ) As the title stated. `f6dfbffb3b/torch/csrc/distributed/c10d/ProcessGroupNCCL.hpp (L153)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/116404 Approved by: https://github.com/ezyang, https://github.com/wconstab	2024-01-12 21:25:29 +00:00
Ting Lu	c167c34396	Skip unsupported tests on arm (#117344 ) add skips to tests that involve record_context_cpp on ARM as it is only supported on linux x86_64 arch. Error is reported as below: ``` Traceback (most recent call last): File "/usr/lib/python3.10/unittest/case.py", line 59, in testPartExecutor yield File "/usr/lib/python3.10/unittest/case.py", line 591, in run self._callTestMethod(testMethod) File "/usr/lib/python3.10/unittest/case.py", line 549, in _callTestMethod method() File "/usr/local/lib/python3.10/dist-packages/torch/testing/_internal/common_utils.py", line 2674, in wrapper method(args, *kwargs) File "/opt/pytorch/pytorch/test/test_cuda.py", line 3481, in test_direct_traceback c = gather_traceback(True, True, True) RuntimeError: record_context_cpp is not support on non-linux non-x86_64 platforms ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/117344 Approved by: https://github.com/malfet, https://github.com/drisspg	2024-01-12 21:12:11 +00:00
Ke Wen	384c4885fa	[ProcessGroup] Do not print NCCL_DEBUG before NCCL init (#117328 ) In case /etc/nccl.conf is used, `NCCL_DEBUG` is not set to sys env until NCCL inits. The deleted print point is before NCCL inits, hence may be inaccurate. This PR removes it and relies on the other print point which is after NCCL comm creation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117328 Approved by: https://github.com/wconstab, https://github.com/fduwjj	2024-01-12 20:46:50 +00:00
Peter Bell	18bd5c05bc	FFT: Handle noop fftn calls gracefully (#117368 ) Fixes #117252 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117368 Approved by: https://github.com/malfet	2024-01-12 20:16:50 +00:00
Nikita Shulga	5cf481d1ac	[CI] Explicitly specify read-all permissions on the token (#117290 ) Would be nice to have it Pull Request resolved: https://github.com/pytorch/pytorch/pull/117290 Approved by: https://github.com/seemethere, https://github.com/osalpekar, https://github.com/huydhn, https://github.com/atalman	2024-01-12 19:15:54 +00:00
Tejaswini Thokachichu	013a59acbd	Update `BCEWithLogitsLoss` documentation regarding `pos_weight` (#117046 ) Added clarification for the example provided for the pos_weight parameter in the BCEWithLogitsLoss class, particularly in multi-label binary classification context. This enhancement addresses potential misunderstandings about the application of 'binary' classification, which typically implies two classes, to scenarios involving multiple classes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117046 Approved by: https://github.com/mikaylagawarecki	2024-01-12 18:26:25 +00:00
Animesh Jain	e54b40e5eb	[dynamo] GetItemSource - restrict the supported index Source to be GlobalWeakRefSource (#117138 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/117138 Approved by: https://github.com/jansel, https://github.com/mlazos	2024-01-12 18:21:14 +00:00
atalman	657545dbdd	Migrate rocm test to using oidc (#117160 ) Similar to Intel XPU, lets use OIDC for rocm runners. Refer to this PR: https://github.com/pytorch/pytorch/pull/116554 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117160 Approved by: https://github.com/huydhn, https://github.com/malfet	2024-01-12 17:57:26 +00:00
rzou	cb42bc705b	Make auto_functionalized HOP fallback in inductor (#117084 ) It looks like the inductor fallback previously worked with HOPs but no longer does, so I fixed that: - all HOPs are exposed under torch.ops.higher_order, so I changed how inductor looks them up - the inductor fallback assumed that an operator's signature was (args, *kwargs). This is true for all the OpOverloads but not HOPs. I rewrote the code to not rely on this. Test Plan: - existing tests - new test for auto_functionalized HOP. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117084 Approved by: https://github.com/williamwen42	2024-01-12 17:57:01 +00:00
YuqingJ	a97d00cca5	[Nested Tensor]Support SDPA math fallback for jagged layout nested tensor (#116445 ) Support this fallback by converting the jagged layout NT to strided layout NT, and the convert the result back to jagged layout NT. This fallback might not be efficient since it uses unbind, contiguous and split. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116445 Approved by: https://github.com/soulitzer	2024-01-12 17:30:40 +00:00
Nikita Shulga	21d370819b	[CI] Set permissions for stale workflow (#117371 ) Hopefully should fix failures one observes in HUD as default permissions for the repo were changed to read-only <img width="232" alt="image" src="https://github.com/pytorch/pytorch/assets/2453524/4047472c-ca3c-4288-add7-97f0ce43106a"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/117371 Approved by: https://github.com/clee2000	2024-01-12 16:44:15 +00:00
Jiong Gong	172dd13ecf	[inductor][cpp] improve vector contiguous checks for FloorDiv and ModularIndexing (#117221 ) Fix https://github.com/pytorch/pytorch/issues/114488 The PR tries to enable contiguous vector loads for cases where we can reduce `FloorDiv` and `ModularIndexing` in the vectorized loop. Take the index expression in test case `test_vec_contiguous_ModularIndexing` for example. `14336x0 + 256x1 + 128((x2//256)) + ModularIndexing(x2, 1, 128) + 7168ModularIndexing(x2, 128, 2)` can be reduced to `14336x0 + 256x1 + x2 + 128x2_div_c0 + 7168x2_mod_c0 + x2_mod_c1` where `x2` is a vectorized loop variable and the vector length is 16. This means we can do vectorized load for this index. Check the code comment for more details: https://github.com/pytorch/pytorch/pull/117221/files#diff-5ab7b0235e2076a5fc6629ba0b109208940f5b94f5c13babc3e0f87cf4fcec82R317-R329 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117221 Approved by: https://github.com/jansel	2024-01-12 15:20:36 +00:00
Valentine233	6c624aad37	[CPU] Disable floating-point contraction when compiling (#116318 ) Fixes #100775. For CPU inductor path, disable -ffp-contract, such as fma, from optimization flags to fix functional issues. ### Validation Validation on 3 benchmark suites. - [x] FP32: Negligible geomean change; No outlier models. <img width="582" alt="image" src="https://github.com/pytorch/pytorch/assets/23010269/7c14a8b8-eb6c-4794-bff9-2e1ae3a22781"> - [x] BF16: Negligible geomean change; No outlier models. <img width="589" alt="image" src="https://github.com/pytorch/pytorch/assets/23010269/cf558737-8cb2-411f-8761-27b9f8fc43af"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/116318 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-01-12 14:09:05 +00:00
leslie-fang-intel	6ebb26d572	Fail Conv Binary Inplace check when act and accum are same tensor (#117331 ) Summary When a tensor is used as the act of conv and extra input of the binary add node, we shouldn't do conv binary inplace fusion. ``` a / \ conv \ add ``` TestPlan ``` python -u -m pytest -s -v test_mkldnn_pattern_matcher.py -k test_conv2d_binary_inplace_fusion_failed_cpu ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/117331 Approved by: https://github.com/jgong5 ghstack dependencies: #117330	2024-01-12 10:34:11 +00:00
leslie-fang-intel	19a9fdbf3a	Add more alias and mutation check for other input of Conv Binary Inplace fusion (#117330 ) Summary Fix the issue: https://github.com/pytorch/pytorch/issues/117108. Use the outplace conv binary fusion when other input is with type `TensorBox(View(ReinterpretView()))` since other input is a view of some other tensor. Test Plan ``` python -u -m pytest -s -v test_mkldnn_pattern_matcher.py -k test_conv2d_binary_inplace_fusion_failed_cpu ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/117330 Approved by: https://github.com/jgong5	2024-01-12 10:29:33 +00:00
Animesh Jain	f7d9047864	[inductor] Iterative percolate tags (#117306 ) Fixes https://github.com/pytorch/pytorch/issues/116581 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117306 Approved by: https://github.com/aorenste, https://github.com/eellison	2024-01-12 07:52:32 +00:00
Catherine Lee	47c9d12ffd	Add super().setUp() to TestFFT1D (#117329 ) One day I'll move the check to be somewhere else so we don't need to worry about this anymore Pull Request resolved: https://github.com/pytorch/pytorch/pull/117329 Approved by: https://github.com/huydhn	2024-01-12 07:47:01 +00:00
Yu, Guangye	50049cfaa0	[1/4] Intel GPU Runtime Upstreaming for Device (#116019 ) # Motivation As mentioned in [[RFC] Intel GPU Runtime Upstreaming](https://github.com/pytorch/pytorch/issues/114842), The first runtime component we would like to upstream is `Device` which contains the device management functions of Intel GPU's runtime. To facilitate the code review, we split the code changes into 4 PRs. This is one of the 4 PRs and covers the changes under `c10`. # Design Intel GPU device is a wrapper of sycl device on which kernels can be executed. In our design, we will maintain a sycl device pool containing all the GPU devices of the current machine, and manage the status of the device pool by PyTorch. The thread local safe is considered in this design. The corresponding C++ files related to `Device` will be placed in c10/xpu folder. And we provide the c10 device runtime APIs, like - `c10::xpu::device_count` - `c10::xpu::set_device` - ... # Additional Context In our plan, 4 PRs should be submitted to PyTorch for `Device`: 1. for c10 2. for aten 3. for python frontend 4. for lazy initialization shared with CUDA Pull Request resolved: https://github.com/pytorch/pytorch/pull/116019 Approved by: https://github.com/gujinghui, https://github.com/jgong5, https://github.com/EikanWang, https://github.com/malfet	2024-01-12 07:36:25 +00:00
Angela Yi	7dac2f9f2d	[export][ez] Fix getting meta["val"] (#117313 ) Summary: For integer inputs, they do not have a meta["val"]. Test Plan: `buck run @//mode/dev-nosan //executorch/examples/portable/scripts:export -- -m emformer_predict` passes the export step Differential Revision: D52716419 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117313 Approved by: https://github.com/kirklandsign, https://github.com/tugsbayasgalan	2024-01-12 06:17:38 +00:00
Wei Wei	40f12cec93	Change predispatch tracing API (#117278 ) Summary: Change the API used in export for aotinductor Test Plan: buck2 run mode/opt mode/inplace caffe2/test/inductor/fb:test_group_batch_fusion_fb Differential Revision: D52678653 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117278 Approved by: https://github.com/angelayi, https://github.com/khabinov	2024-01-12 06:10:02 +00:00
haozhe.zhu	ec443089c7	enable fp16 mkldnn fusion/prepack in inductor (#117206 ) - Extend `linear/conv/rnn` packable with `float16`. - Extend `Unary fusion` to support `float16`. Test Case: Extend bfloat16 related test in `test_cpu_repro.py` and `test_mkldnn_pattern_matcher.py` to test both `fp16` and `bf16`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117206 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-01-12 06:08:42 +00:00
Avik Chaudhuri	9d5954e2a9	ignore ill-formed solution of reduce_inequalities (#117310 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/117033 Sometimes the solution returned by `sympy.solvers.inequalities.reduce_inequalities` can contain sub-expressions of the form `CRootOf(...)`, denoting the complex root of some equation in `x`, where `x` is an arbitrary symbol. We will now gracefully fail when this happens, like we already do when the solver itself fails. Test Plan: added a test Differential Revision: D52715578 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117310 Approved by: https://github.com/ezyang	2024-01-12 06:01:13 +00:00
Aaron Orenstein	638f85fd67	Add default parameters to rrelu_with_noise() (#117141 ) Summary: rrelu_with_noise() was listed as having default parameters in the schema but the actual code definition didn't have them. The failing example was calling rrelu() which DOES have default parameters and it passes those defaulted values to C++. Under the covers the C code was calling the python version of rrelu_with_noise(). Although the C++ code was passing all the values to the python version of rrelu_with_noise() the pytorch C++ -> Python dispatch code looks at the schema and strips any parameters which match the schema's listed defaults so if the schema shows defaults that aren't in the code it will be a problem. Test Plan: I added a unit test for this specific case. It would probably be better to write a more general one to validate all the ops against their schemas - but I haven't learned enough about the test harness to do that yet. Fixes #115811 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117141 Approved by: https://github.com/yanboliang, https://github.com/oulgen	2024-01-12 05:32:13 +00:00
Thiago Crepaldi	d29bf0a37e	Fix ONNXProgram.save to use torch.load(..., mmap=True) for large models (#117295 ) During ONNXProgram.save, the implicit/explicit state_dict passed in must be loaded in memory in order to read each initializer and create an external tensor proto with them This PR ensures torch.load uses memory-map to support large models that cannot fit in memory Pull Request resolved: https://github.com/pytorch/pytorch/pull/117295 Approved by: https://github.com/BowenBao ghstack dependencies: #117294	2024-01-12 04:38:27 +00:00
Thiago Crepaldi	b62ba82cdc	Update initializer path for ONNXProgram.save due to onnx.checker limitation (#117294 ) According to https://github.com/onnx/onnx/blob/main/docs/ExternalData.md#large-models-2gb when initializers are larger than 2GB, `onnx.checker` requires the model to be in the same directory as the initializer. Although not strictly necessary for the export and model save to succeed, it is desirable to have the `onnx.checker` to succeed when validation the resulting large model. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117294 Approved by: https://github.com/BowenBao	2024-01-12 04:22:12 +00:00
PyTorch MergeBot	b3b585af64	Revert "[codemod] markDynamoStrictTest batch 16 (#117218 )" This reverts commit 47119785acbfe20d9ef6cf5d90887a441402f5c7. Reverted https://github.com/pytorch/pytorch/pull/117218 on behalf of https://github.com/zou3519 due to just felt like reverting this ([comment](https://github.com/pytorch/pytorch/pull/117218#issuecomment-1888360366))	2024-01-12 03:06:20 +00:00
PyTorch MergeBot	ac0bed01df	Revert "[dynamo] GetItemSource - restrict the supported index Source to be GlobalWeakRefSource (#117138 )" This reverts commit c278a1b39c8ae33feaa4a87b35b721fff7fdf19a. Reverted https://github.com/pytorch/pytorch/pull/117138 on behalf of https://github.com/zou3519 due to Broke jobs on main, I'm not sure why ([comment](https://github.com/pytorch/pytorch/pull/117138#issuecomment-1888290068))	2024-01-12 01:55:49 +00:00
Nikita Shulga	3214ada631	[MPS][BE] Better format nested ternary (#117198 ) - Replace double ternary with if + ternary - Replace deprecated `AT_ASSERT` with `TORCH_INTERNAL_ASSERT` - Replace regular asserts with `TORCH_CHECK` or `TORCH_INTERNAL_ASSERT` depending on context Pull Request resolved: https://github.com/pytorch/pytorch/pull/117198 Approved by: https://github.com/Skylion007	2024-01-12 01:29:17 +00:00
Shunting Zhang	04604eea8a	[inductor] check nan/inf for graph inputs (#117189 ) This is split out from #103469 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117189 Approved by: https://github.com/jansel	2024-01-12 00:59:32 +00:00
rzou	47119785ac	[codemod] markDynamoStrictTest batch 16 (#117218 ) [codemod] markDynamoStrictTest test_dataloader [codemod] markDynamoStrictTest test_public_bindings [codemod] markDynamoStrictTest test_namedtensor [codemod] markDynamoStrictTest test_fx [codemod] markDynamoStrictTest test_content_store [codemod] markDynamoStrictTest test_schema_check [codemod] markDynamoStrictTest lazy/test_ts_opinfo [codemod] markDynamoStrictTest functorch/test_ops Pull Request resolved: https://github.com/pytorch/pytorch/pull/117218 Approved by: https://github.com/bdhirsh	2024-01-12 00:32:36 +00:00
Animesh Jain	c278a1b39c	[dynamo] GetItemSource - restrict the supported index Source to be GlobalWeakRefSource (#117138 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/117138 Approved by: https://github.com/jansel	2024-01-11 23:26:25 +00:00
Khushi Agrawal	5d2d21a7be	[bfloat16][easy] kthvalue, median (#117279 ) Fixes #109991 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117279 Approved by: https://github.com/Skylion007	2024-01-11 22:44:07 +00:00
fduwjj	5c6e7962f4	[c10d][EZ] Add more logs in the destructor of ProcessGroupNCCL for better root cause investigation (#117291 ) Add logs to the place where we inspect whether a hang happens. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117291 Approved by: https://github.com/XilunWu, https://github.com/shuqiangzhang	2024-01-11 22:33:30 +00:00
Omkar Salpekar	53cba40651	[Distributed] Fix tests when CUDA not available (#117163 ) NCCL tests failed after https://github.com/pytorch/pytorch/pull/116217 when PyTorch was not built with CUDA. This PR fixes the issue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117163 Approved by: https://github.com/malfet, https://github.com/wanchaol	2024-01-11 22:27:43 +00:00
PyTorch MergeBot	9f87760160	Revert "[Nested Tensor]Support SDPA math fallback for jagged layout nested tensor (#116445 )" This reverts commit e55a778cbb518e54c5afa5b8107b352746d7f41a. Reverted https://github.com/pytorch/pytorch/pull/116445 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but i see it fails ROCm test in trunk due to an unsupported use case `e55a778cbb` ([comment](https://github.com/pytorch/pytorch/pull/116445#issuecomment-1888060036))	2024-01-11 22:21:45 +00:00
SS-JIA	0a5aa5c2d1	[pt-vulkan][ez] Remove reference to c10::MemoryFormat from `api/` folder (#117183 ) ## Context This change is part of a set of changes that removes all references to the `c10` library in the `api/`, `graph/`, and `impl/` folders of the PyTorch Vulkan codebase. This is to ensure that these components can be built as a standalone library such that they can be used as the foundations of a Android GPU delegate for ExecuTorch. ## Notes for Reviewers This changeset removes references to `c10::MemoryFormat` in `api/Tensor.[h,cpp]`; when constructing a `vTensor`, the `api::StorageType` (i.e. whether the tensor will be backed by buffer or texture storage) and `api::GPUMemoryLayout` (i.e. which dimension will be the fastest moving dimension) must be specified directly. Differential Revision: [D52662234](https://our.internmc.facebook.com/intern/diff/D52662234/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/117183 Approved by: https://github.com/liuk22, https://github.com/yipjustin ghstack dependencies: #117176, #117177, #117178, #117179, #117180, #117181	2024-01-11 22:08:29 +00:00
Wei (Will) Feng	8b0bfb3aaa	[FSDP] remove unused flat_param_part_view (#117082 ) flat_param_part_view is unused in pytorch repo: https://fburl.com/ssaomd7x it became unused since refactoring in https://github.com/pytorch/pytorch/pull/115497 before that, the original code is below. Since flat_param is 1D, we do not need .view for reshaping ``` self.flat_param.data = padded_unsharded_flat_param[ : unsharded_size.numel() ].view( unsharded_size ) ``` unit test: pytest test/distributed/fsdp/test_fsdp_core.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/117082 Approved by: https://github.com/awgu, https://github.com/wconstab, https://github.com/Skylion007	2024-01-11 21:59:51 +00:00
SS-JIA	3c66c89057	[pt-vulkan] Replace `c10::ScalarType` with native equivalent (#117181 ) ## Context This change is part of a set of changes that removes all references to the `c10` library in the `api/`, `graph/`, and `impl/` folders of the PyTorch Vulkan codebase. This is to ensure that these components can be built as a standalone library such that they can be used as the foundations of a Android GPU delegate for ExecuTorch. ## Notes for Reviewers This changeset introduces `api::ScalarType` in `api/Types.h`, which is intended to function the same as `c10::ScalarType`; thus `api/Types.h` is the primary file of interest. The rest of the changes are straightforward replacements of `c10::ScalarType` with `api::ScalarType`. Differential Revision: [D52662237](https://our.internmc.facebook.com/intern/diff/D52662237/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/117181 Approved by: https://github.com/yipjustin ghstack dependencies: #117176, #117177, #117178, #117179, #117180	2024-01-11 21:43:33 +00:00
SS-JIA	331ae7f89f	[pt-vulkan][ez] Replace `c10::overflows` with native equivalent (#117180 ) ## Context This change is part of a set of changes that removes all references to the `c10` library in the `api/`, `graph/`, and `impl/` folders of the PyTorch Vulkan codebase. This is to ensure that these components can be built as a standalone library such that they can be used as the foundations of a Android GPU delegate for ExecuTorch. ## Notes for Reviewers This changeset is very straightforward, as it simply copies the required components of `c10::overflows` from [`c10/util/Half.h`](https://github.com/pytorch/pytorch/blob/main/c10/util/Half.h#L477) into `api/Utils.h`. Differential Revision: [D52662236](https://our.internmc.facebook.com/intern/diff/D52662236/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/117180 Approved by: https://github.com/yipjustin ghstack dependencies: #117176, #117177, #117178, #117179	2024-01-11 21:43:33 +00:00
SS-JIA	4205892be6	[pt-vulkan][ez] Replace `ArrayRef` with `std::vector<T>&` (#117179 ) ## Context This change is part of a set of changes that removes all references to the `c10` library in the `api/`, `graph/`, and `impl/` folders of the PyTorch Vulkan codebase. This is to ensure that these components can be built as a standalone library such that they can be used as the foundations of a Android GPU delegate for ExecuTorch. ## Notes for Reviewers This changeset replaces all instances of `c10::ArrayRef<T>` with `std::vector<T>&` and all instances of`c10::IntArrayRef` with `std::vector<int64_t>&`. There are a lot of changes in this changeset but that is simply due to the large number of callsites. All the changes are straightforward replacements. Differential Revision: [D52662235](https://our.internmc.facebook.com/intern/diff/D52662235/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/117179 Approved by: https://github.com/yipjustin ghstack dependencies: #117176, #117177, #117178	2024-01-11 21:43:15 +00:00
SS-JIA	b209de6699	[pt-vulkan] Replace `TORCH_CHECK` and similar macros with native equivalents (#117178 ) ## Context This change is part of a set of changes that removes all references to the `c10` library in the `api/`, `graph/`, and `impl/` folders of the PyTorch Vulkan codebase. This is to ensure that these components can be built as a standalone library such that they can be used as the foundations of a Android GPU delegate for ExecuTorch. ## Notes for Reviewers This changeset introduces `api::Error` class in `api/Exception.h`, which is a more barebones copy of the `c10::Error` class from [`c10/util/Exception.h`](https://github.com/pytorch/pytorch/blob/main/c10/util/Exception.h). The macros `VK_CHECK_COND` (equivalent to `TORCH_CHECK(cond, msg)`) and `VK_THROW` (equivalent to `TORCH_CHECK(false, msg)` are introduced as well to replace calls to `TORCH_CHECK()` and similar macros. Although this is a large diff, the most meaningful changes are in the added files `api/Exception.[h,cpp]` and `api/StringUtil.[h,cpp]` (which is mostly adapted from [`c10/util/StringUtil.h`](https://github.com/pytorch/pytorch/blob/main/c10/util/StringUtil.h)) which implements `api::Error` and the new macros. The rest of the diff is replacing calls to `TORCH_CHECK()` and similar macros with `VK_CHECK_COND()` and `VK_THROW()`. Differential Revision: [D52662233](https://our.internmc.facebook.com/intern/diff/D52662233/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/117178 Approved by: https://github.com/yipjustin ghstack dependencies: #117176, #117177	2024-01-11 21:43:15 +00:00
SS-JIA	fe298e901a	[pt-vulkan][ez] Replace `ska::flat_hash_map`, `c10::get_hash` with `std::unordered_map`, `std::hash` (#117177 ) ## Context This change is part of a set of changes that removes all references to the `c10` library in the `api/`, `graph/`, and `impl/` folders of the PyTorch Vulkan codebase. This is to ensure that these components can be built as a standalone library such that they can be used as the foundations of a Android GPU delegate for ExecuTorch. ## Notes for Reviewers The majority of the changes in this changeset are: * Replacing instances of `ska::flat_hash_map` with `std::unordered_map` * `ska::flat_hash_map` is an optimized hash map, but the optimizations shouldn't be too impactful so `std::unordered_map` should suffice. Performance regression testing will be done at the final change in this stack to verify this. * Replacing `c10::get_hash` with `std::hash` where only one variable is getting hashed or the `utils::hash_combine()` function added to `api/Utils.h` (which was copied from `c10/util/hash.h`) Differential Revision: [D52662231](https://our.internmc.facebook.com/intern/diff/D52662231/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/117177 Approved by: https://github.com/yipjustin ghstack dependencies: #117176	2024-01-11 21:43:15 +00:00
SS-JIA	57b76b970b	[pt-vulkan][ez] Miscellaneous small c10 deprecations (`c10::irange`, `C10_LIKELY`, `c10::SmallVector`, etc.) (#117176 ) ## Context This change is part of a set of changes that removes all references to the `c10` library in the `api/`, `graph/`, and `impl/` folders of the PyTorch Vulkan codebase. This is to ensure that these components can be built as a standalone library such that they can be used as the foundations of a Android GPU delegate for ExecuTorch. ## Notes for Reviewers This changeset deprecates various easy-to-replace symbols from the `c10` library with either C++ STL equivalents or by using copying those `c10` symbols as native equivalents. The symbols that were impacted are: * `c10::irange` * removed and replaced with standard for loops * `C10_LIKELY` and `C10_UNLIKELY` * These macros allow for some branch re-ordering compiler optimizations when building with GCC. They aren't strictly necessary and their impact is likely minimal so these have simply been removed. * `c10::SmallVector<T, N>` * My understanding is that `c10::SmallVector<T, N>` is essentially a wrapper around `std::vector<T>` that is optimized for array sizes up to `N`. I don't believe that this optimization is worth creating a native equivalent, so I replaced instances this symbol with replaced with `std::vector<T>` * `c10::multiply_integers` * This function is simply a convenient wrapper around `std::accumulate`, so I copied it as a native equivalent in `api/Utils.h` This changeset comprises entirely of the replacements described above. Differential Revision: [D52662232](https://our.internmc.facebook.com/intern/diff/D52662232/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/117176 Approved by: https://github.com/yipjustin	2024-01-11 21:42:24 +00:00
Jithun Nair	24c39bb5e5	Upgrade nightly wheels to rocm6.0 (#116983 ) Follow-up to https://github.com/pytorch/builder/pull/1647 Pull Request resolved: https://github.com/pytorch/pytorch/pull/116983 Approved by: https://github.com/jeffdaily, https://github.com/atalman	2024-01-11 20:36:00 +00:00
YuqingJ	e55a778cbb	[Nested Tensor]Support SDPA math fallback for jagged layout nested tensor (#116445 ) Support this fallback by converting the jagged layout NT to strided layout NT, and the convert the result back to jagged layout NT. This fallback might not be efficient since it uses unbind, contiguous and split. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116445 Approved by: https://github.com/soulitzer	2024-01-11 20:28:40 +00:00
Andrew Gu	92cc8ae172	[FSDP] Cloned unsharded tensor slice in optim state dict load (#117261 ) This takes the fix from https://github.com/pytorch/pytorch/issues/116553. Cloning the slice allows the base (much larger) tensor to be freed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117261 Approved by: https://github.com/wz337	2024-01-11 20:21:12 +00:00
Simon Fan	88bf84f106	[benchmark] add --compile-autograd to dynamo benchmarks (#117196 ) Adds `--compile-autograd` flag to benchmark suite to run accuracy and performance tests. Also adds autograd_captures and autograd_compiles to dynamo stats e.g. accuracy_inductor.csv ``` dev,name,batch_size,accuracy,calls_captured,unique_graphs,graph_breaks,unique_graph_breaks,autograd_captures,autograd_compiles cuda,BERT_pytorch,4,pass,2655,2,8,7,1,1 cuda,Background_Matting,4,pass_due_to_skip,0,0,0,0,0,0 cuda,DALLE2_pytorch,0,eager_fail_to_run,0,0,0,0,0,0 cuda,LearningToPaint,4,pass,639,2,8,7,1,1 ... ``` e.g. speedup_inductor.csv ``` dev,name,batch_size,speedup,abs_latency,compilation_latency,compression_ratio,eager_peak_mem,dynamo_peak_mem,calls_captured,unique_graphs,graph_breaks,unique_graph_breaks,autograd_captures,autograd_compiles cuda,hf_T5,8,1.214311,136.236793,88.350570,0.751322,18.754706,24.962275,3298,2,8,8,1,1 cuda,hf_T5,8,1.226645,135.431856,52.461461,1.040973,18.754706,18.016508,795,1,7,7,0,0 ... ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/117196 Approved by: https://github.com/jansel	2024-01-11 20:12:58 +00:00
Doe Hyun Yoon	83c45a9931	Faster gc_count update for CUDACachingAllocator (and avoid nullptr de… (#117064 ) …reference) (#109065) Summary: Modify the way we update gc_count in CUDACachingAlloctor to make it faster. Originally D48481557, but reverted due to nullptr dereference in some cases (D49003756). This diff changed to use correct constructor for search key (so avoid nullptr dereference). Also, added nullptr check (and returns 0 if it is) in gc_count functions. Differential Revision: D49068760 Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/117064 Approved by: https://github.com/zdevito	2024-01-11 19:47:05 +00:00
bhack	5bc896e5dc	Dockerfile; Add cuda bin to PATH (#117105 ) We need this to execute `nvidia-smi` in the officially released containers. We have already it in the Docker CI See `94db6578cc/.ci/docker/linter-cuda/Dockerfile (L35)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/117105 Approved by: https://github.com/atalman	2024-01-11 18:10:19 +00:00
wangkang1	9e3580f793	Fix #117011 : add the TORCH_CHECK(grad_output) of upsample_nearest::backward() (#117100 ) add the TORCH_CHECK(grad_output) of upsample_nearest::backward() Fixes #117011 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117100 Approved by: https://github.com/lezcano	2024-01-11 18:06:22 +00:00
Chien-Chin Huang	f89725fb41	[DCP][BC] Add the backward compatibility test (#116247 ) This PR adds a test to ensure all metadata is backward compatible with the older definination. Differential Revision: [D52357733](https://our.internmc.facebook.com/intern/diff/D52357733/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116247 Approved by: https://github.com/wz337 ghstack dependencies: #116245, #116246	2024-01-11 18:01:35 +00:00
Bin Bao	7e9cbc6834	[CI] Catch more exception types when running eager in PT2 tests (#117120 ) Summary: https://github.com/pytorch/pytorch/actions/runs/7467073391/job/20320251143#step:16:1332 shows a case where model loading fails with KeyError but the error is not logged in the report csv file, which can cause an eager model failure silently ignored in the PT2 integration test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117120 Approved by: https://github.com/huydhn	2024-01-11 17:46:11 +00:00
Edward Z. Yang	5b24877663	Improve uint{16,32,64} dlpack/numpy compatibility (#116808 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/116808 Approved by: https://github.com/malfet, https://github.com/albanD	2024-01-11 17:01:54 +00:00
fduwjj	623b7fedc4	[c10d] Add comments to the rest environment variable within NCCLPG (#117092 ) Not every environment within NCCLPG has comments, let's add comments to each of them. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117092 Approved by: https://github.com/kwen2501 ghstack dependencies: #116545	2024-01-11 16:47:25 +00:00
Chien-Chin Huang	3d1869d0ae	[DCP][BE] Improve the readability of filesystem and fsspec filesystem (#116246 ) 1. Better typing 2. Remove 1-liner function Differential Revision: [D52357731](https://our.internmc.facebook.com/intern/diff/D52357731/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116246 Approved by: https://github.com/wz337 ghstack dependencies: #116245	2024-01-11 16:27:21 +00:00
Elias Ellison	4c7b602645	Add Support For Symbolic Shapes in Register_replacement, SDPA Pattern Matching (#115441 ) Many of our pattern matching replacements are specified as a `search_fn` and a `replacment_fn`. The search_fn's are traced out once with static shapes, converted to a pattern, and then matched on every graph compiled with inductor. The static shape patterns would not match with graphs that are traced out with dynamic shapes because SymInts would be added to the graph as `sym_size` fx nodes which added additional uses and prevented matching. The previous PR partially addresses this by deduping SymInts that are resolvable to graph inputs, as is the calling convention in aot autograd. This PR adjusts our matching of the `search_fn` by adding SymInts to the arguments we trace out the search_fn with so that their symint accesses are deduped. Later, if we have a match, we will trace out the replacement graph with the correct Tensors and corresponding symbolic shapes that will get added to the graph. Note: the replacement patterns will insert sym_size uses which could potentially be removed, but I'll leave that for follow up. Fix for https://github.com/pytorch/pytorch/issues/111190. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115441 Approved by: https://github.com/jansel ghstack dependencies: #116158	2024-01-11 15:58:37 +00:00
PyTorch MergeBot	bfc336308a	Revert "Error grad mode op in export API (#117187 )" This reverts commit 89ef426ba0d87091303f6a3c21c38749f9af72a3. Reverted https://github.com/pytorch/pytorch/pull/117187 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/117187#issuecomment-1887363580))	2024-01-11 15:01:36 +00:00
PyTorch MergeBot	767e1b6349	Revert "Bring docstring to .pyi file (#114705 )" This reverts commit 0dd5deecedd136852c7ccc81630eaefbebe5be29. Reverted https://github.com/pytorch/pytorch/pull/114705 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/114705#issuecomment-1887165326))	2024-01-11 13:30:44 +00:00
vfdev-5	7005a4bcb6	[dynamo] Added dyn shapes support for math trigo ops: sin(h), cos(h), tan(h) ... (#114866 ) Description: - Added dynamic shapes support for math trigo ops: sin(h), cos(h), tan(h) ... ```python import math import torch def func(x, a, b): c = 0 c = c + math.sqrt(a) c = c + math.cos(a) c = c + math.cosh(a) c = c + math.sin(a) c = c + math.sinh(a) c = c + math.tan(a) c = c + math.tanh(a) c = c + math.asin(b) c = c + math.acos(b) c = c + math.atan(a) y = x + c return y cfunc = torch.compile(func, dynamic=True, fullgraph=True) device = "cpu" # or "cuda" x = torch.tensor([0, 1, 2, 3], dtype=torch.float32, device=device) a = 12 b = 1 out = cfunc(x, a, b) expected = func(x, a, b) torch.testing.assert_close(out, expected) ``` and the graph `TORCH_LOGS=+graph_code python check_math_ops.py`: <details> <summary> graph code </summary> ``` [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] TRACED GRAPH [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] ===== __compiled_fn_0 ===== [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] <eval_with_key>.0 class GraphModule(torch.nn.Module): [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] def forward(self, L_a_ : torch.SymInt, s1 : torch.SymInt, L_x_ : torch.Tensor): [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] l_a_ = L_a_ [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] l_x_ = L_x_ [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] # File: check_math_ops.py:57, code: c = c + math.sqrt(a) [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] sym_sqrt = torch.sym_sqrt(l_a_) [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] add = 0 + sym_sqrt; sym_sqrt = None [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] # File: check_math_ops.py:58, code: c = c + math.cos(a) [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] sym_cos = torch.sym_cos(l_a_) [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] add_1 = add + sym_cos; add = sym_cos = None [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] # File: check_math_ops.py:59, code: c = c + math.cosh(a) [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] sym_cosh = torch.sym_cosh(l_a_) [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] add_2 = add_1 + sym_cosh; add_1 = sym_cosh = None [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] # File: check_math_ops.py:60, code: c = c + math.sin(a) [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] sym_sin = torch.sym_sin(l_a_) [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] add_3 = add_2 + sym_sin; add_2 = sym_sin = None [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] # File: check_math_ops.py:61, code: c = c + math.sinh(a) [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] sym_sinh = torch.sym_sinh(l_a_) [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] add_4 = add_3 + sym_sinh; add_3 = sym_sinh = None [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] # File: check_math_ops.py:62, code: c = c + math.tan(a) [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] sym_tan = torch.sym_tan(l_a_) [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] add_5 = add_4 + sym_tan; add_4 = sym_tan = None [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] # File: check_math_ops.py:63, code: c = c + math.tanh(a) [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] sym_tanh = torch.sym_tanh(l_a_) [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] add_6 = add_5 + sym_tanh; add_5 = sym_tanh = None [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] # File: check_math_ops.py:64, code: c = c + math.asin(b) [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] add_7 = add_6 + 1.5707963267948966; add_6 = None [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] # File: check_math_ops.py:65, code: c = c + math.acos(b) [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] add_8 = add_7 + 0.0; add_7 = None [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] # File: check_math_ops.py:66, code: c = c + math.atan(a) [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] sym_atan = torch.sym_atan(l_a_); l_a_ = None [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] add_9 = add_8 + sym_atan; add_8 = sym_atan = None [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] # File: check_math_ops.py:67, code: y = x + c [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] y = l_x_ + add_9; l_x_ = add_9 = None [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] return (y,) [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] ``` </details> Generated code with `TORCH_LOGS=+output_code python check_math_ops.py`: <details> <summary> C++ code </summary> ``` [2023-11-30 22:19:09,709] [0/0] torch._inductor.graph.__output_code: [DEBUG] cpp_fused_add_0 = async_compile.cpp(''' [2023-11-30 22:19:09,709] [0/0] torch._inductor.graph.__output_code: [DEBUG] #include "/tmp/torchinductor_root/2l/c2ljzlm4sosod7u6lyrroqdba6hmfcyijrric6p4t3fhbcmw6osp.h" [2023-11-30 22:19:09,709] [0/0] torch._inductor.graph.__output_code: [DEBUG] extern "C" void kernel(const float* in_ptr0, [2023-11-30 22:19:09,709] [0/0] torch._inductor.graph.__output_code: [DEBUG] float* out_ptr0, [2023-11-30 22:19:09,709] [0/0] torch._inductor.graph.__output_code: [DEBUG] const long ks0, [2023-11-30 22:19:09,709] [0/0] torch._inductor.graph.__output_code: [DEBUG] const long ks1) [2023-11-30 22:19:09,709] [0/0] torch._inductor.graph.__output_code: [DEBUG] { [2023-11-30 22:19:09,709] [0/0] torch._inductor.graph.__output_code: [DEBUG] { [2023-11-30 22:19:09,709] [0/0] torch._inductor.graph.__output_code: [DEBUG] #pragma GCC ivdep [2023-11-30 22:19:09,709] [0/0] torch._inductor.graph.__output_code: [DEBUG] for(long x0=static_cast<long>(0L); x0<static_cast<long>(ks0); x0+=static_cast<long>(1L)) [2023-11-30 22:19:09,709] [0/0] torch._inductor.graph.__output_code: [DEBUG] { [2023-11-30 22:19:09,709] [0/0] torch._inductor.graph.__output_code: [DEBUG] auto tmp0 = in_ptr0[static_cast<long>(x0)]; [2023-11-30 22:19:09,709] [0/0] torch._inductor.graph.__output_code: [DEBUG] auto tmp1 = c10::convert<float>(1.57079632679490 + (std::sqrt(ks1)) + (std::atan(ks1)) + (std::cos(ks1)) + (std::cosh(ks1)) + (std::sin(ks1)) + (std::sinh(ks1)) + (std::tan(ks1)) + (std::tanh(ks1))); [2023-11-30 22:19:09,709] [0/0] torch._inductor.graph.__output_code: [DEBUG] auto tmp2 = decltype(tmp0)(tmp0 + tmp1); [2023-11-30 22:19:09,709] [0/0] torch._inductor.graph.__output_code: [DEBUG] out_ptr0[static_cast<long>(x0)] = tmp2; [2023-11-30 22:19:09,709] [0/0] torch._inductor.graph.__output_code: [DEBUG] } [2023-11-30 22:19:09,709] [0/0] torch._inductor.graph.__output_code: [DEBUG] } [2023-11-30 22:19:09,709] [0/0] torch._inductor.graph.__output_code: [DEBUG] } [2023-11-30 22:19:09,709] [0/0] torch._inductor.graph.__output_code: [DEBUG] ''') ``` </details> <details> <summary> Triton code </summary> ``` [2023-11-30 22:20:00,383] [0/0] torch._inductor.graph.__output_code: [DEBUG] @pointwise( [2023-11-30 22:20:00,383] [0/0] torch._inductor.graph.__output_code: [DEBUG] size_hints=[4], [2023-11-30 22:20:00,383] [0/0] torch._inductor.graph.__output_code: [DEBUG] filename=__file__, [2023-11-30 22:20:00,383] [0/0] torch._inductor.graph.__output_code: [DEBUG] triton_meta={'signature': {0: 'fp32', 1: 'fp32', 2: 'i32', 3: 'i32'}, 'device': 0, 'device_type': 'cuda', 'constants': {}, 'configs': [instance_descriptor(divisible_by_16=(0, 1), equal_to_1=(), i ds_of_folded_args=(), divisible_by_8=())]}, [2023-11-30 22:20:00,383] [0/0] torch._inductor.graph.__output_code: [DEBUG] inductor_meta={'autotune_hints': set(), 'kernel_name': 'triton_poi_fused_add_0', 'mutated_arg_names': []}, [2023-11-30 22:20:00,383] [0/0] torch._inductor.graph.__output_code: [DEBUG] min_elem_per_thread=0 [2023-11-30 22:20:00,383] [0/0] torch._inductor.graph.__output_code: [DEBUG] ) [2023-11-30 22:20:00,383] [0/0] torch._inductor.graph.__output_code: [DEBUG] @triton.jit [2023-11-30 22:20:00,383] [0/0] torch._inductor.graph.__output_code: [DEBUG] def triton_(in_ptr0, out_ptr0, ks0, xnumel, XBLOCK : tl.constexpr): [2023-11-30 22:20:00,383] [0/0] torch._inductor.graph.__output_code: [DEBUG] xoffset = tl.program_id(0) * XBLOCK [2023-11-30 22:20:00,383] [0/0] torch._inductor.graph.__output_code: [DEBUG] xindex = xoffset + tl.arange(0, XBLOCK)[:] [2023-11-30 22:20:00,383] [0/0] torch._inductor.graph.__output_code: [DEBUG] xmask = xindex < xnumel [2023-11-30 22:20:00,383] [0/0] torch._inductor.graph.__output_code: [DEBUG] x0 = xindex [2023-11-30 22:20:00,383] [0/0] torch._inductor.graph.__output_code: [DEBUG] tmp0 = tl.load(in_ptr0 + (x0), xmask) [2023-11-30 22:20:00,383] [0/0] torch._inductor.graph.__output_code: [DEBUG] tmp1 = 1.57079632679490 + (tl.math.sqrt(ks0.to(tl.float32))) + (tl.math.atan((ks0).to(tl.float32))) + (tl.math.cos((ks0).to(tl.float32))) + (tl.math.cosh((ks0).to(tl.float32))) + (tl.math.sin((ks0) .to(tl.float32))) + (tl.math.sinh((ks0).to(tl.float32))) + (tl.math.tan((ks0).to(tl.float32))) + (tl.math.tanh((ks0).to(tl.float32))) [2023-11-30 22:20:00,383] [0/0] torch._inductor.graph.__output_code: [DEBUG] tmp2 = tmp1.to(tl.float32) [2023-11-30 22:20:00,383] [0/0] torch._inductor.graph.__output_code: [DEBUG] tmp3 = tmp0 + tmp2 [2023-11-30 22:20:00,383] [0/0] torch._inductor.graph.__output_code: [DEBUG] tl.store(out_ptr0 + (x0), tmp3, xmask) [2023-11-30 22:20:00,383] [0/0] torch._inductor.graph.__output_code: [DEBUG] ''') ``` </details> Pull Request resolved: https://github.com/pytorch/pytorch/pull/114866 Approved by: https://github.com/peterbell10	2024-01-11 11:52:28 +00:00
cyy	2b5a201aa6	[Exception] [3/N] Replace torch::NotImplementedError and torch::LinAlgError with C10 counterparts. (#116824 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116824 Approved by: https://github.com/albanD	2024-01-11 11:27:04 +00:00
Tugsbayasgalan (Tugsuu) Manlaibaatar	89ef426ba0	Error grad mode op in export API (#117187 ) Summary: This is reland of https://github.com/pytorch/pytorch/pull/116339 Needed to some internal adjustments to make it work properly. Original credit goes to andrewlee302 Test Plan: CI Differential Revision: D52674706 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117187 Approved by: https://github.com/suo	2024-01-11 09:06:59 +00:00
Shunting Zhang	0e1f43c44d	[inductor] don't access cluster_dims for too old version of triton (#117192 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/117192 Approved by: https://github.com/masnesral	2024-01-11 08:37:30 +00:00
Huy Do	3b2ddb6f71	Update TorchBench pinned commit (#117073 ) ~~To match their recent v4.36.2 release https://github.com/huggingface/transformers/commits/v4.36.2. This is to fix the KeyError showing on release branch https://github.com/pytorch/pytorch/actions/runs/7451512288/job/20279117324#step:16:1336. I think this can be updated in main too because the current pinned commit is already 4-month old.~~ Check with @desertfire, trying to update TorchBench pinned commit instead. The test is also failing in main https://github.com/pytorch/pytorch/actions/runs/7467073391/job/20320251143#step:16:1120, but for some reason, it doesn't surface as a failure. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117073 Approved by: https://github.com/atalman, https://github.com/thiagocrepaldi, https://github.com/desertfire	2024-01-11 08:35:00 +00:00
haozhe.zhu	1cefc58905	init tls grad_mode/local_dispatch_key set while fork new thread in (#113246 ) TorchDynamo will guard grad_mode and the local dispatch key set. `3a429423fc/torch/csrc/dynamo/guards.cpp (L13-L16)` While using ThroughputBenchmark, those tls state will not be init as same as the main thread status. `3a429423fc/torch/csrc/utils/throughput_benchmark-inl.h (L64-L94)` Run following scripts ``` import torch linear = torch.nn.Linear(128, 128) compiled = torch.compile(linear) x = torch.rand(10, 128) with torch.no_grad(), torch.cpu.amp.autocast(enabled=True, dtype=torch.bfloat16): compiled(x) compiled(x) from torch._dynamo import config config.error_on_recompile = True from torch.utils import ThroughputBenchmark with torch.no_grad(), torch.cpu.amp.autocast(enabled=True, dtype=torch.bfloat16): bench = ThroughputBenchmark(compiled) bench.add_input(x) stats = bench.benchmark( num_calling_threads=10, num_warmup_iters=100, num_iters=100, ) print(stats) ``` will lead to 2 re-compile reasons: ``` triggered by the following guard failure(s): ___check_global_state() triggered by the following guard failure(s): tensor 'x' dispatch key set mismatch. ``` This will trigger a re-compile in torchdynamo. But since `ThroughputBenchmark` is used for sharing weight within threads, the model should not be changed anymore while running the benchmark. So this PR is to init the tls state as same as main thread. Then we can use ` ThroughputBenchmark` to run torchdynamo optimized models. throughputbenchmark Pull Request resolved: https://github.com/pytorch/pytorch/pull/113246 Approved by: https://github.com/jgong5, https://github.com/desertfire	2024-01-11 08:31:46 +00:00
Sun, Jiayi	9f57cf502f	[inductor][cpu]disable pointwise_cat on CPU (#116313 ) We observed negative performance impact of pointwise_cat optimization on CPU so disabled it. We will revisit this later after enabling vectorization on index_expr. This PR fix the following three regression issues: https://github.com/pytorch/pytorch/issues/115827 https://github.com/pytorch/pytorch/issues/112139 https://github.com/pytorch/pytorch/issues/114495 and cause performance regression of pytorch_unet again. Related issue: https://github.com/pytorch/pytorch/issues/115343 Pull Request resolved: https://github.com/pytorch/pytorch/pull/116313 Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/eellison	2024-01-11 08:00:00 +00:00
Elias Ellison	e3d4f4d14b	[ProxyTensor] dedupe symbolic shapes in tracing (#116158 ) Dedupes symbolic shapes in proxy tensor tracing. Reusing the existing sym shape avoids inserting spurious sym_size calls, which can interfere with pattern matching and graph passes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116158 Approved by: https://github.com/ezyang	2024-01-11 07:15:11 +00:00
Chien-Chin Huang	6f9fcc79c2	[DCP][BE] Remove unused fields (#116245 ) As title Differential Revision: [D52357730](https://our.internmc.facebook.com/intern/diff/D52357730/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116245 Approved by: https://github.com/wz337	2024-01-11 06:03:09 +00:00
leslie-fang-intel	263cc12fab	Add Dynamo Reset in PT2E Quantization testing (#117200 ) Summary Fix https://github.com/pytorch/pytorch/issues/117012 by adding `torch._dynamo.reset()` in `PT2EQuantizationTestCase._quantize`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117200 Approved by: https://github.com/jerryzh168	2024-01-11 05:53:55 +00:00
titaiwangms	5ae221a214	[ONNX] Refactor op consistency tests (#116319 ) Fixes #105338 This PR changes the ops consistency tests from manual adding ops into testing list to automated testing all ops in registry. It also spots more complex dtype bugs in the converter. Overall, this PR provides: (1) Whole test coverage on ONNX registry (2) More completed complex supports (3) Only test the same dtypes as torchlib (4) Auto xfail unsupported nodes Follow-up issue: https://github.com/pytorch/pytorch/issues/117118 Pull Request resolved: https://github.com/pytorch/pytorch/pull/116319 Approved by: https://github.com/justinchuby	2024-01-11 05:17:40 +00:00
fduwjj	9b1fac694e	[c10d] Add extra sleep in waitForDumpOrTimeout to ensure enough time for all ranks dump debug info (#116545 ) We added an extra sleep and make it configurable so that users can set an extra wait to ensure all ranks have dumped the debug info. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116545 Approved by: https://github.com/wconstab	2024-01-11 04:39:57 +00:00
rzou	ca23c56efc	[codemod] markDynamoStrictTest batch 15 (#117139 ) [codemod] markDynamoStrictTest test_spectral_ops [codemod] markDynamoStrictTest test_fx_experimental [codemod] markDynamoStrictTest test_foreach [codemod] markDynamoStrictTest test_decomp Pull Request resolved: https://github.com/pytorch/pytorch/pull/117139 Approved by: https://github.com/voznesenskym ghstack dependencies: #117114, #117127, #117128, #117129, #117133	2024-01-11 04:28:57 +00:00
rzou	9dbe4eae82	[codemod] markDynamoStrictTest batch 14 (#117133 ) [codemod] markDynamoStrictTest test_utils [codemod] markDynamoStrictTest test_unary_ufuncs [codemod] markDynamoStrictTest test_sparse_semi_structured [codemod] markDynamoStrictTest test_sparse_csr [codemod] markDynamoStrictTest test_sparse [codemod] markDynamoStrictTest test_reductions [codemod] markDynamoStrictTest test_proxy_tensor [codemod] markDynamoStrictTest test_prims [codemod] markDynamoStrictTest test_maskedtensor [codemod] markDynamoStrictTest test_masked [codemod] markDynamoStrictTest test_legacy_vmap [codemod] markDynamoStrictTest test_binary_ufuncs Pull Request resolved: https://github.com/pytorch/pytorch/pull/117133 Approved by: https://github.com/voznesenskym ghstack dependencies: #117114, #117127, #117128, #117129	2024-01-11 04:28:57 +00:00
rzou	a526d0a926	Skip all OpInfo-based test when running with PYTORCH_TEST_WITH_DYNAMO (#117129 ) This is a policy decision. These tests: - are flaky, and fixing the flakiness is unfeasible at the moment - are highly redundant Pull Request resolved: https://github.com/pytorch/pytorch/pull/117129 Approved by: https://github.com/voznesenskym ghstack dependencies: #117114, #117127, #117128	2024-01-11 04:28:42 +00:00
blzheng	dc43ad4286	add is_grad_enabled check in runtime_wrapper before running with torch.no_grad (#117089 ) We observed that `with torch.no_grad()` in runtime_wrapper introduced ~10% (0.06ms->0.066ms) inference performance regression on lennard_jones on cpu. For inference tasks in benchmark, grad has been disabled, but in the current runtime_wrapper, no_grad is set again and its time is counted into the running time. Therefore, we add `is_grad_enabled` check in runtime_wrapper before running with torch.no_grad. If grad has been disabled, there is no need to set no_grad. Before this pr: 1.043x dev,name,batch_size,speedup,abs_latency,compilation_latency,compression_ratio,eager_peak_mem,dynamo_peak_mem,calls_captured,unique_graphs,graph_breaks,unique_graph_breaks cpu,lennard_jones,1,1.043427,0.068366,4.756151,0.941846,45.056819,47.838822,9,1,0,0 After this pr: 1.146x dev,name,batch_size,speedup,abs_latency,compilation_latency,compression_ratio,eager_peak_mem,dynamo_peak_mem,calls_captured,unique_graphs,graph_breaks,unique_graph_breaks cpu,lennard_jones,1,1.146190,0.061844,4.468380,0.936456,44.427264,47.441920,9,1,0,0 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117089 Approved by: https://github.com/jgong5, https://github.com/bdhirsh	2024-01-11 03:37:45 +00:00
voznesenskym	203430a778	[dynamo] easy - better assert message for EQUALS_MATCH guard (#117006 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/117006 Approved by: https://github.com/lezcano ghstack dependencies: #116723	2024-01-11 03:14:43 +00:00
angelayi	79de14546d	[export] Add TORCH_LOGS=export (#116993 ) Adds TORCH_LOGS=export which currently includes dynamo/dynamic logs. In the future if we add any logs under the torch/export directory it will also show up in the TORCH_LOGS=export Pull Request resolved: https://github.com/pytorch/pytorch/pull/116993 Approved by: https://github.com/avikchaudhuri	2024-01-11 03:02:23 +00:00
vmoens	6f0f4f12ca	[BugFix] Prevent LSTM to run with wrong input shape (#115542 ) Fixes #114874 Pull Request resolved: https://github.com/pytorch/pytorch/pull/115542 Approved by: https://github.com/mikaylagawarecki	2024-01-11 02:57:09 +00:00
Will Constable	10509dac85	[C10D] Rename flightrecorder key vars to avoid confusion (#116905 ) Key vars are strings used as dict keys (e.g. duration_s was a string "duration_ms") _s confused me with time (seconds) since duration_s was a key string and duration_ms is another variable holding a time value. Now duration_key is "duration_ms". Pull Request resolved: https://github.com/pytorch/pytorch/pull/116905 Approved by: https://github.com/zdevito	2024-01-11 02:57:04 +00:00
PyTorch MergeBot	1174e82bde	Revert "Add _assert_scalar and teach Inductor to codegen it (#114148 )" This reverts commit b6028acfa46363c1d3262a1522741a06c307843f. Reverted https://github.com/pytorch/pytorch/pull/114148 on behalf of https://github.com/osalpekar due to Going to revert this given the broken torchrec PT2 tests internally: [D52648865](https://www.internalfb.com/diff/D52648865). Logs aren't too clear but @dstaay-fb can help debug as well ([comment](https://github.com/pytorch/pytorch/pull/114148#issuecomment-1886100368))	2024-01-11 02:30:22 +00:00
vasiliy	0f10a706f6	add a docblock for torch._scaled_mm (#117190 ) Summary: Describes the arguments in more detail. Not in user facing docs for now, but a step towards getting there eventually. Test Plan: CI Reviewers: Subscribers: Tasks: Tags: Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/117190 Approved by: https://github.com/drisspg	2024-01-11 02:22:44 +00:00
Edward Z. Yang	edec54b9de	Add `torch._lazy_clone` to create COW tensors (#113397 ) Part of #109833 Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * __->__ #113397 Pull Request resolved: https://github.com/pytorch/pytorch/pull/113397 Approved by: https://github.com/ezyang	2024-01-11 01:32:44 +00:00
Catherine Lee	71343507cd	Add super().setup in test_numeric (#117148 ) Call super().setUp() so that it will check the disabled test json (and also reset seeds etc) Test: Check that test_all_any is skipped in dynamo shard - success Pull Request resolved: https://github.com/pytorch/pytorch/pull/117148 Approved by: https://github.com/huydhn	2024-01-11 01:03:46 +00:00
cyy	2f17a21b2b	[Reland] [13/N] Enable clang-tidy on headers of torch/csrc (#117088 ) Reland of #116560 and fixes the issued reported by #116695 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117088 Approved by: https://github.com/albanD	2024-01-10 23:58:04 +00:00
Mengwei Liu	8783fe9cf3	[export] Modify SDPA decomposition to decompose _scaled_dot_product_flash_attention_for_cpu (#117097 ) Summary: As titled. #115913 added `_scaled_dot_product_flash_attention_for_cpu` and the export result of `scaled_dot_product_attention` includes this op. Adding this decomposition so that it's being decomposed the same way as `_scaled_dot_product_attention_math`. Test Plan: Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/117097 Approved by: https://github.com/lezcano	2024-01-10 23:46:14 +00:00
Joel Schlosser	f70aeb4ffd	Fix backward for reshape() on jagged layout NT (#117137 ) Provides symbolic C++-side `reshape_as()` / `reshape()` decomps for jagged layout NTs to make the backwards pass work. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117137 Approved by: https://github.com/soulitzer	2024-01-10 23:35:07 +00:00
soulitzer	e10cfdd895	Update matmul requires_grad checks (#117067 ) Fixes https://github.com/pytorch/pytorch/issues/116099 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117067 Approved by: https://github.com/lezcano, https://github.com/albanD ghstack dependencies: #116523, #116710	2024-01-10 23:16:42 +00:00
rzou	7e6a04e542	Allow unMarkDynamoStrictTest to work on tests (instead of just classes) (#117128 ) Tested locally. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117128 Approved by: https://github.com/voznesenskym ghstack dependencies: #117114, #117127	2024-01-10 22:25:40 +00:00
rzou	1b8ebb6c42	[codemod] markDynamoStrictTest batch 13 (#117127 ) [codemod] markDynamoStrictTest test_overrides [codemod] markDynamoStrictTest test_namedtuple_return_api [codemod] markDynamoStrictTest test_jiterator [codemod] markDynamoStrictTest test_jit_disabled [codemod] markDynamoStrictTest test_jit_autocast [codemod] markDynamoStrictTest test_fx_reinplace_pass [codemod] markDynamoStrictTest test_fx_passes Pull Request resolved: https://github.com/pytorch/pytorch/pull/117127 Approved by: https://github.com/voznesenskym ghstack dependencies: #117114	2024-01-10 22:25:40 +00:00
rzou	79e6d2ae9d	Remove incorrect usages of skipIfTorchDynamo (#117114 ) Using `@skipifTorchDynamo` is wrong, the correct usage is `@skipIfTorchDynamo()` or `@skipIfTorchDynamo("msg")`. This would cause tests to stop existing. Added an assertion for this and fixed the incorrect callsites. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117114 Approved by: https://github.com/voznesenskym	2024-01-10 22:25:31 +00:00
Elias Ellison	d6540038c0	Fix 0-dim Index in Index Copy decomp (#117065 ) Fix for https://github.com/pytorch/pytorch/issues/115931 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117065 Approved by: https://github.com/jansel, https://github.com/shunting314	2024-01-10 22:13:43 +00:00
lcskrishna	b9293e74a2	[ROCm] Fixes for hipblasLt for mm use case. (#116537 ) This PR fixes the accuracy issues for hipblasLT for mm case on ROCm. This PR is a follow up to the integration PR https://github.com/pytorch/pytorch/pull/114329 and https://github.com/pytorch/pytorch/pull/114890 The accuracy issue arises for mm usecase for ROCm where hipblasLT is enabled, and a bias has been passed which is not required. This PR addresses that issue. Added a unit-test case for this issue (bias=None) case. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116537 Approved by: https://github.com/jeffdaily, https://github.com/malfet	2024-01-10 22:13:18 +00:00
Aaron Shi	7e37f63e5e	[Reference Cycle Detector] Ignore FakeTensor in cycle leak detection (#117116 ) Summary: Skip FakeTensors since these tensors are not actually using GPU memory. Reference Cycle Detector does not need to generate plots for these tensors. Test Plan: CI and internal testing. Differential Revision: D52637209 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117116 Approved by: https://github.com/zdevito, https://github.com/tianfengfrank	2024-01-10 21:33:56 +00:00
atalman	3e9bb8d4de	Run docker release build on final tag (#117131 ) To be successful, the docker release workflow needs to run on final tag, after the Release to conda and pypi are complete. Please refer to: https://github.com/pytorch/pytorch/blob/main/Dockerfile#L76 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117131 Approved by: https://github.com/huydhn, https://github.com/seemethere, https://github.com/malfet	2024-01-10 21:00:45 +00:00
fduwjj	73990c37e6	[c10d] To make ProcessGroupNCCL to use globalStore for coordination (#117075 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/117075 Approved by: https://github.com/wconstab ghstack dependencies: #117074	2024-01-10 20:39:53 +00:00
fduwjj	180425df9b	[c10d] Add a recursive method to get the inner most store (#117074 ) In c10d PG initialization, we wrap TCPStore with multiple layers of PrefixStore which adds layers of prefix. One example is: "default_pg/0//cuda//timeout_dump" When initialized the default PG, because there is no store passed. We first add the prefix "default_pg" to the TCPStore returned from rendezvous: `bdeaaad70c/torch/distributed/distributed_c10d.py (L1240)` We then add pg_name (aka 0) `bdeaaad70c/torch/distributed/distributed_c10d.py (L1376)` and device (aka cuda) `bdeaaad70c/torch/distributed/distributed_c10d.py (L1387)` to the prefix. Then when we call store_->set("timeout_dump"). The actual key used for writing into TCPStore is "default_pg/0//cuda//timeout_dump". For sub-PG, things get even interesting, we put the store wrapped with default pg name to a cache: `bdeaaad70c/torch/distributed/distributed_c10d.py (L1517)` And when creating each subPG, it is append its PG name right after the cached store. The example keys are: 'default_pg/0//10//cuda//timeout_dump', 'default_pg/0//12//cuda//timeout_dump', 'default_pg/0//38//cuda//timeout_dump', 'default_pg/0//39//cuda//timeout_dump'. (10, 12, 38 and 39 are all PG names of each subPG created) The reason why the number in the name is bumped up so high is because for each subPG creation, all ranks have to call the API together and the global variable used for PG name will be bumped up monolithically: `bdeaaad70c/torch/distributed/distributed_c10d.py (L3666)` Similar things happen for using hashing for PG names. This has a potential issue, because each sub-PG has an instance of ProcessGroupNCCL, and if we want to set something global to notify all sub-PGs (and all ranks). This added prefix causes bugs. For example, if on sub-PG 1, we set a value to TCPStore with key ('default_pg/0//1//cuda//timeout_dump'), while we use the default PG instances to check the TCPStore, which are using the key ('default_pg/0//cuda//timeout_dump'), default PG instances will never get the notified signals. So in this PR, we added a new API in PrefixStore which we get the innermost non-PrefixStore for set and check. The next PR will make changes in NCCL watchdog. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117074 Approved by: https://github.com/wconstab, https://github.com/H-Huang	2024-01-10 20:22:55 +00:00
Jason Ansel	6f8fc42dba	[inductor] Add support for tl.make_block_ptr (#116079 ) On A100 this is a small regression: ![image](https://github.com/pytorch/pytorch/assets/533820/b30eee9d-c0fe-4123-99da-d554fc5d0171) So I will leave it disabled by default. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116079 Approved by: https://github.com/shunting314	2024-01-10 20:02:49 +00:00
Catherine Lee	9bf9586c6d	Pytest do not rewrite assertions by default (#117060 ) From https://pytest.org/en/7.4.x/how-to/assert.html#advanced-assertion-introspection pytest only rewrites test modules directly discovered by its test collection process, so asserts in supporting modules which are not themselves test modules will not be rewritten. In CI we usually call the test file (`python test_ops.py`), which then calls run_test which then calls pytest.main, so the test module is already imported as `__main__`, so pytest does not import the test module itself and relies on the already imported module. (#95844) However, calling `pytest test_ops.py` will rely on pytest to import the module, resulting in asserts being rewritten, so I add --assert=plain by default into the opts so we don't have to worry about this anymore. Another way to make pytest stop assertion rewriting in a file is to include `PYTEST_DONT_REWRITE` somewhere in the file. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117060 Approved by: https://github.com/zou3519	2024-01-10 20:02:45 +00:00
Bin Bao	fad7734fa7	[AOTI] Remove caching for compiled model.so (#117087 ) Summary: Oleg found the model.so caching does not compute hash key with model weights included, which can cause incorrect model.so reuse. Since caching is not really necessary in the AOT mode, let's just remove it. Test Plan: CI Differential Revision: D52647555 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117087 Approved by: https://github.com/frank-wei, https://github.com/chenyang78, https://github.com/khabinov	2024-01-10 19:53:27 +00:00
Wei (Will) Feng	e4e80dc9b3	[FSDP] sharded grad scaler: copy found_inf after waiting on async reduce_all (#115710 ) Expected behavior: when rank 0 have inf grad, rank 1...k should get `found_inf=1` after `dist.reduce_all` Bug addressed in this PR: for cpu offloaded param.grad, when rank 0 have inf, rank 1...k would not have found_inf=1. This is because `found_inf` was copied before `future.wait` on async `dist.reduce_all` repro the bug using the newly added unit test: `pytest test/distributed/fsdp/test_fsdp_sharded_grad_scaler.py -k test_sharded_grad_scaler_found_inf` ``` File "/data/users/weif/pytorch/test/distributed/fsdp/test_fsdp_sharded_grad_scaler.py", line 320, in _test_sharded_grad_scaler_found_inf self.assertEqual( File "/data/users/weif/pytorch/torch/testing/_internal/common_utils.py", line 3576, in assertEqual raise error_metas.pop()[0].to_error( AssertionError: Scalars are not close! Expected 1.0 but got 2.0. Absolute difference: 1.0 (up to 1e-05 allowed) Relative difference: 1.0 (up to 1.3e-06 allowed) rank: 0 iter: 0 expect origin scale 2.0 to be backed off by 0.5 but got 2.0 ``` verify the bug is fixed: `pytest test/distributed/fsdp/test_fsdp_sharded_grad_scaler.py -k test_sharded_grad_scaler_found_inf` ``` test/distributed/fsdp/test_fsdp_sharded_grad_scaler.py dist init r=1, world=8 dist init r=3, world=8 dist init r=7, world=8 dist init r=4, world=8 dist init r=6, world=8 dist init r=2, world=8 dist init r=0, world=8 dist init r=5, world=8 NCCL version 2.19.3+cuda12.0 . [100%] ====================================================================== 1 passed, 19 deselected in 27.43s ========================= ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/115710 Approved by: https://github.com/awgu	2024-01-10 19:17:25 +00:00
Simon Fan	9eb842cbd6	Compiled autograd: Lift autograd functions' backward and provide default key for custom autograd functions (#115573 ) This PR adds support for torch.autograd.Function subclasses in compiled autograd. We do this by: - Creating a uid for all torch.autograd.Function via its metaclass. This uid is used in the compiled autograd key, which is a subset of the cache key to the compiled graph - "Lifting" the backward/saved_tensors, having them as input arguments in the compiled graph - Creating proxies to track the backward's inputs and outputs. Since the backward's outputs (grads) have to match the forward's inputs, we pass the node's `input_info` (forward's input sizes) to build the proxies tracking the backward's outputs. - Use a `FakeContext` class as a replacement for the autograd node's context object (`BackwardCFunction`) during tracing, only support passing saved_tensors from the forward to the backward - Index each backward, to support multiple torch.autograd.Functions in the same graph - Special case for `CompiledFunctionBackward`, lifting CompiledFunction will fail 4 tests and requires some skipfiles changes that I'd rather do that in a separate PR Example graph: test_custom_fn_saved_multiple_tensors (eager fw + compiled autograd) ```python class MyFn(torch.autograd.Function): @staticmethod def forward(ctx, x, y): ctx.save_for_backward(x, y) return torch.sin(x), torch.sin(y) @staticmethod def backward(ctx, gO_x, gO_y): (x, y) = ctx.saved_tensors return gO_x * torch.cos(x), gO_y * torch.cos(y) ``` The backwards is lifted via `getitem_5` and `call_backward` ```python # Compiled autograd graph ===== Compiled autograd graph ===== <eval_with_key>.0 class CompiledAutograd(torch.nn.Module): def forward(self, inputs, sizes, hooks): # No stacktrace found for following nodes getitem: "f32[]" = inputs[0] getitem_1: "f32[10]" = inputs[1] getitem_2: "f32[10]" = inputs[2] getitem_3: "f32[10]" = inputs[3] getitem_4: "f32[10]" = inputs[4]; inputs = None expand: "f32[10]" = torch.ops.aten.expand.default(getitem, [10]); getitem = None mul: "f32[10]" = torch.ops.aten.mul.Tensor(expand, getitem_2); getitem_2 = None mul_1: "f32[10]" = torch.ops.aten.mul.Tensor(expand, getitem_1); expand = getitem_1 = None getitem_5 = hooks[0]; hooks = None call_backward = torch__dynamo_external_utils_call_backward(getitem_5, (getitem_3, getitem_4), mul_1, mul); getitem_5 = mul_1 = mul = None getitem_6: "f32[10]" = call_backward[0] getitem_7: "f32[10]" = call_backward[1]; call_backward = None accumulate_grad_ = torch.ops.inductor.accumulate_grad_.default(getitem_4, getitem_7); getitem_4 = getitem_7 = None accumulate_grad__1 = torch.ops.inductor.accumulate_grad_.default(getitem_3, getitem_6); getitem_3 = getitem_6 = None return [] ``` then is later inlined by dynamo ```python # Dynamo graph ===== __compiled_fn_0 ===== <eval_with_key>.1 class GraphModule(torch.nn.Module): def forward(self, L_inputs_0_ : torch.Tensor, L_inputs_1_ : torch.Tensor, L_inputs_2_ : torch.Tensor, L_inputs_3_ : torch.Tensor, L_inputs_4_ : torch.Tensor): getitem = L_inputs_0_ getitem_1 = L_inputs_1_ getitem_2 = L_inputs_2_ x = L_inputs_3_ y = L_inputs_4_ # File: <eval_with_key>.0:10, code: expand = torch.ops.aten.expand.default(getitem, [10]); getitem = None expand = torch.ops.aten.expand.default(getitem, [10]); getitem = None # File: <eval_with_key>.0:11, code: mul = torch.ops.aten.mul.Tensor(expand, getitem_2); getitem_2 = None mul = torch.ops.aten.mul.Tensor(expand, getitem_2); getitem_2 = None # File: <eval_with_key>.0:12, code: mul_1 = torch.ops.aten.mul.Tensor(expand, getitem_1); expand = getitem_1 = None mul_1 = torch.ops.aten.mul.Tensor(expand, getitem_1); expand = getitem_1 = None # File: /data/users/xmfan/core/pytorch/test/inductor/test_compiled_autograd.py:412, code: return gO_x * torch.cos(x), gO_y * torch.cos(y) cos = torch.cos(x) getitem_6 = mul_1 * cos; mul_1 = cos = None cos_1 = torch.cos(y) getitem_7 = mul * cos_1; mul = cos_1 = None # File: <eval_with_key>.0:17, code: accumulate_grad_ = torch.ops.inductor.accumulate_grad_.default(getitem_4, getitem_7); getitem_4 = getitem_7 = None accumulate_grad__default = torch.ops.inductor.accumulate_grad_.default(y, getitem_7); y = getitem_7 = None # File: <eval_with_key>.0:18, code: accumulate_grad__1 = torch.ops.inductor.accumulate_grad_.default(getitem_3, getitem_6); getitem_3 = getitem_6 = None accumulate_grad__default_1 = torch.ops.inductor.accumulate_grad_.default(x, getitem_6); x = getitem_6 = None return () ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/115573 Approved by: https://github.com/jansel	2024-01-10 18:01:28 +00:00
Edward Yang	b4a35632f9	Add function to materialize COW storages (#117053 ) Summary: From Kurt Mohler, see https://github.com/pytorch/pytorch/pull/113396 (manually imported due to ghimport problems) Test Plan: sandcastle, OSS CI Differential Revision: D52610522 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117053 Approved by: https://github.com/malfet, https://github.com/kurtamohler	2024-01-10 15:34:16 +00:00
min-jean-cho	ec98df70f3	[CPU] _vec_softmax_backward, _vec_log_softmax_backward, _vec_logsoftmax: fix CHUNK_SIZE to avoid unnecessarily large allocation (#117029 ) Similar to https://github.com/pytorch/pytorch/pull/116990, fixes `CHUNK_SIZE` in `_vec_softmax_backward`, `_vec_log_softmax_backward`, `_vec_logsoftmax`, where `CHUNK_SIZE` is set as ```cpp int64_t BLOCK_SIZE = 128 * 1024; int64_t CHUNK_SIZE = std::max<int64_t>(BLOCK_SIZE / dim_size / sizeof(scalar_t), Vec::size()); CHUNK_SIZE = CHUNK_SIZE / Vec::size() * Vec::size(); ``` where `BLOCK_SIZE / dim_size / sizeof(scalar_t)` computes the maximum number of inner dim that can fit into L2 cache, assuming L2 cache = 128KB, and `CHUNK_SIZE / Vec::size() * Vec::size()` is to make `CHUNK_SIZE` a multiple of `Vec::size()`. Fix `CHUNK_SIZE` as the minimum between `CHUNK_SIZE` and `inner_size` to avoid unnecessarily large `CHUNK_SIZE` and unnecessarily large allocation for `max` and `tmp_sum` buffer. ```cpp auto buffer = std::make_unique<scalar_t []>(CHUNK_SIZE * 2); scalar_t* input_max_data = buffer.get(); scalar_t* tmp_sum_data = buffer.get() + CHUNK_SIZE; ``` ### Performance Perf data of `_vec_logsoftmax` collected for `dim_size` in range [2^0, 2^9] and `outer_size` in range [2^0, 2^3]. To measure the benefit from avoiding unnecessarily large allocation, values of `outer_size` were chosen such that `outer_size` is less than `BLOCK_SIZE / dim_size / sizeof(scalar_t)` for all values of `outer_size`. Tested on 28 physical cores/socket, 1 socket on Skylake. \| dim_size \| BLOCK_SIZE / dim_size / sizeof(scalar_t) \| input shape: (dim_size, inner_size) \| Baseline (original implementation) \| Optimized \| Speedup Ratio (Baseline/Optimized) \| \|-------------- \|---------------------------------------------- \|----------------------------------------- \|---------------------------------------- \|--------------- \|---------------------------------------- \| \| 1 \| 32768 \| (1, 1) \| 0.012578964 \| 0.003523827 \| 3.569689 \| \| \| \| (1, 2) \| 0.012645721 \| 0.003550053 \| 3.562122 \| \| \| \| (1, 4) \| 0.01303196 \| 0.003521442 \| 3.700745 \| \| \| \| (1, 8) \| 0.01275301 \| 0.003552437 \| 3.589933 \| \| 2 \| 16384 \| (2, 1) \| 0.008230209 \| 0.003688335 \| 2.231416 \| \| \| \| (2, 2) \| 0.00821352 \| 0.003502369 \| 2.345133 \| \| \| \| (2, 4) \| 0.008280277 \| 0.003442764 \| 2.405125 \| \| \| \| (2, 8) \| 0.0086236 \| 0.003490448 \| 2.470628 \| \| 4 \| 8192 \| (4, 1) \| 0.005865097 \| 0.003454685 \| 1.697723 \| \| \| \| (4, 2) \| 0.005846024 \| 0.003490448 \| 1.674863 \| \| \| \| (4, 4) \| 0.006036758 \| 0.0035429 \| 1.703903 \| \| \| \| (4, 8) \| 0.005993843 \| 0.003669262 \| 1.633528 \| \| 8 \| 4096 \| (8, 1) \| 0.00469923 \| 0.003535748 \| 1.329063 \| \| \| \| (8, 2) \| 0.004696846 \| 0.003600121 \| 1.304636 \| \| \| \| (8, 4) \| 0.005483627 \| 0.003721714 \| 1.473414 \| \| \| \| (8, 8) \| 0.005180836 \| 0.00389576 \| 1.329865 \| \| 16 \| 2048 \| (16, 1) \| 0.00446558 \| 0.003738403 \| 1.194515 \| \| \| \| (16, 2) \| 0.004258156 \| 0.00382185 \| 1.114161 \| \| \| \| (16, 4) \| 0.004422665 \| 0.004007816 \| 1.10351 \| \| \| \| (16, 8) \| 0.004923344 \| 0.004308224 \| 1.142778 \| \| 32 \| 1024 \| (32 , 1) \| 0.004467964 \| 0.00402689 \| 1.109532 \| \| \| \| (32, 2) \| 0.004336834 \| 0.004196167 \| 1.033523 \| \| \| \| (32, 4) \| 0.004661083 \| 0.004513264 \| 1.032752 \| \| \| \| (32, 8) \| 0.005385876 \| 0.005121231 \| 1.051676 \| \| 64 \| 512 \| (64, 1) \| 0.004725456 \| 0.00462532 \| 1.021649 \| \| \| \| (64, 2) \| 0.005085468 \| 0.004930496 \| 1.031431 \| \| \| \| (64, 4) \| 0.005791187 \| 0.005600452 \| 1.034057 \| \| \| \| (64, 8) \| 0.007030964 \| 0.006783009 \| 1.036555 \| \| 128 \| 256 \| (128, 1) \| 0.005710125 \| 0.005786419 \| _0.986815_ \| \| \| \| (128, 2) \| 0.006377697 \| 0.006473064 \| _0.985267_ \| \| \| \| (128, 4) \| 0.00754118 \| 0.007488728 \| 1.007004 \| \| \| \| (128, 8) \| 0.009772778 \| 0.009725094 \| 1.004903 \| \| 256 \| 128 \| (256 , 1) \| 0.007708073 \| 0.007715225 \| _0.999073_ \| \| \| \| (256, 2) \| 0.008938313 \| 0.009071827 \| _0.985283_ \| \| \| \| (256, 4) \| 0.011227131 \| 0.011045933 \| 1.016404 \| \| \| \| (256, 8) \| 0.016131401 \| 0.016396046 \| _0.983859_ \| \| 512 \| 64 \| (512, 1) \| 0.011544228 \| 0.011487007 \| 1.004981 \| \| \| \| (512, 2) \| 0.014071465 \| 0.014281273 \| _0.985309_ \| \| \| \| (512, 4) \| 0.019016266 \| 0.018930435 \| 1.004534 \| \| \| \| (512, 8) \| 0.028913021 \| 0.028159618 \| 1.026755 \| Bolded speedup ratio indicates greater than 5% speedup, to identify as significant speedup. Especially for smaller `dim_size` (1, 2, 4, 8, 16, 32), we observe significant speedups (greater than 5% better, bolded) as smaller the `dim_size`, larger the `BLOCK_SIZE / dim_size / sizeof(scalar_t)`, hence larger the unnecessary allocation. For larger `dim_size` (64, 128, 256, 512), we also observe insignificantly better (less than 5% better, unbolded) performance. For some shapes such as {128, 1}, we also observe insignificantly worse (less than 5% worse, _italicized_) performance. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117029 Approved by: https://github.com/jgong5, https://github.com/ezyang	2024-01-10 15:04:34 +00:00
rzou	e0da05e1ba	[codemod] markDynamoStrictTest dynamo/* (#117077 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/117077 Approved by: https://github.com/bdhirsh ghstack dependencies: #117076	2024-01-10 14:37:52 +00:00
rzou	04f788f925	Unflake test_auto_functionalize (#117076 ) feat better cleanup of the custom op. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117076 Approved by: https://github.com/bdhirsh	2024-01-10 14:37:52 +00:00
Jack Taylor	5046b4981d	[ROCm] Add opt-in option for inductor's layout optimisation on ROCm (#116329 ) Disabling layout optimisation in inductor for ROCm (https://github.com/pytorch/pytorch/pull/111474) was a bit shortsighted. If there are workloads that heavily use NHWC we will see a perf drop from additional transpose ops. Instead of disabling this entirely on ROCm this is now an opt-in feature. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116329 Approved by: https://github.com/jansel, https://github.com/eellison	2024-01-10 13:56:27 +00:00
Xia, Weiwen	94db6578cc	[Quant] Add dynamic quantization config for x86 inductor backend (#115337 ) Description Add dynamic quantization config for x86 inductor backend. To support the QKV structure in self-attention, we removed an assertion in port-metadata-pass that requires single dequantize node after quantize node. Test plan ``` python test/test_quantization.py -k TestQuantizePT2EX86Inductor.test_dynamic_quant_linear python test/test_quantization.py -k TestQuantizePT2EX86Inductor.test_qat_dynamic_quant_linear ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/115337 Approved by: https://github.com/jgong5, https://github.com/jerryzh168	2024-01-10 11:33:37 +00:00
Michael Lazos	558cc69641	Fix torch function kwarg dispatch (#117083 ) Previously, kwargs were incorrectly dispatched by passing them as the true kwargs to the torch function call. To fix, the kwargs of the original torch op need to be stored in a dictionary and passed as an argument to the torch function implementation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117083 Approved by: https://github.com/drisspg	2024-01-10 10:55:10 +00:00
PyTorch MergeBot	e88d0648ed	Revert "[export] Error grad mode op in export API (#116339 )" This reverts commit 943179852102ac0be27aeae5a2c0272e25ccf90e. Reverted https://github.com/pytorch/pytorch/pull/116339 on behalf of https://github.com/tugsbayasgalan due to PR below this in the stack broke torchrec/sigmoid tests ([comment](https://github.com/pytorch/pytorch/pull/116339#issuecomment-1884599027))	2024-01-10 10:42:33 +00:00
PyTorch MergeBot	77ecb3d725	Revert "[export] Exempt autograd ops for predispatch export (#116527 )" This reverts commit af2ded23eb398e14cf380b39d46bfa786d26b3ee. Reverted https://github.com/pytorch/pytorch/pull/116527 on behalf of https://github.com/tugsbayasgalan due to Need to revert this to revert the bottom diff ([comment](https://github.com/pytorch/pytorch/pull/116527#issuecomment-1884592658))	2024-01-10 10:38:27 +00:00
Davide Italiano	20f394f10a	[LLVM/TensorExpr] Update for an API change in LLVM 18. (#117086 ) `registerPassBuilderCallbacks` takes now an extra bool argument to print extra information. Currently initialized to false to not change functional behaviour. Relevant LLVM commit: `ffb1f20e0d` Pull Request resolved: https://github.com/pytorch/pytorch/pull/117086 Approved by: https://github.com/bertmaher	2024-01-10 09:08:42 +00:00
cyy	20f769544c	[12/N] Apply clang-tidy and fix warnings in headers of torch/csrc (#116486 ) This PR follows #116751. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116486 Approved by: https://github.com/albanD	2024-01-10 08:48:14 +00:00
Jane Xu	90df7c008a	Migrate state_dict bc test to OptimizerInfo, increase coverage (#116500 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116500 Approved by: https://github.com/albanD	2024-01-10 08:19:27 +00:00
drisspg	19e93b85b9	Fixes last_dim stride check for singleton dimensions (#117001 ) Fixes #116333 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117001 Approved by: https://github.com/cpuhrsch	2024-01-10 04:46:49 +00:00
Edward Z. Yang	8bcdde5058	Support uint{16,32,64} deterministic empty fill and scalar Python binding handling (#116807 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/116807 Approved by: https://github.com/albanD ghstack dependencies: #116805, #116806	2024-01-10 02:17:23 +00:00
Edward Z. Yang	43a23a704a	Support uint{16,32,64} copy (#116806 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/116806 Approved by: https://github.com/albanD ghstack dependencies: #116805	2024-01-10 02:17:23 +00:00
Edward Z. Yang	2e983fcfd3	Support unsigned int for randint, item, equality, fill, iinfo, tensor (#116805 ) These are some basic utilities that are often used for testing. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/116805 Approved by: https://github.com/albanD	2024-01-10 02:17:23 +00:00
Xu Han	4a10e9eed4	update build guide to use mkl-static. (#116946 ) # Background: We found current build guide use mkl dynamic link. It may trigger a mkl link issue. Detailed: In build environment, libtorch_cpu.so will dynamic link to system mkl binaries by default. If users install another version mkl library, it may lead to mkl symbol conflict. I also checked released pytorch binary it use static mkl link. The build script shows it: https://github.com/pytorch/builder/blob/main/common/install_mkl.sh#L10 # Solution: Update build guide to use mkl static link. And it is aligned to build script. Conda install command docs: https://anaconda.org/intel/mkl-static https://anaconda.org/intel/mkl-include # Validation No mkl libraries dependencing, after use `conda install intel::mkl-static intel::mkl-include`. ## Windows ![image](https://github.com/pytorch/pytorch/assets/8433590/cc554ded-d827-4de5-81c6-cc3039155580) ## Linux <img width="959" alt="image" src="https://github.com/pytorch/pytorch/assets/8433590/79766ad8-4ba2-4ff1-adc9-63affd8d419a"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/116946 Approved by: https://github.com/jgong5, https://github.com/malfet	2024-01-10 01:35:02 +00:00
mingxzhao	b4f1ab4505	Docs: fix docstring errors in ddp_comm_hooks (#116866 ) Reopens #115272 Fixes ddp_comm_hooks errors in https://github.com/pytorch/pytorch/issues/112644 Pull Request resolved: https://github.com/pytorch/pytorch/pull/116866 Approved by: https://github.com/awgu	2024-01-10 01:24:06 +00:00
Joel Schlosser	16d69290c6	Use view name instead of view_copy name for functional inverses (#117056 ) Ex: `unsqueeze_copy_inverse()` -> `unsqueeze_inverse()` Pull Request resolved: https://github.com/pytorch/pytorch/pull/117056 Approved by: https://github.com/bdhirsh	2024-01-10 00:52:36 +00:00
Nikita Shulga	fdfdba7c13	[BE] Use `__builtin_overflow_sub` when available (#117015 ) Which is faster then ternary. Following script ```python import torch from timeit import default_timer global_setup = """ """ setup = """ c10::SymInt a = c10::SymInt(123); """ code = """ -a; """ from torch.utils.benchmark import Timer t = Timer(stmt=code, setup=setup, global_setup=global_setup, language="c++", timer=default_timer) print(t.blocked_autorange()) ``` reports 4.17 ns median type before and 3.61 ns after on x86_64 Linux and 2.02 ns before and 1.91 ns after on Apple M1 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117015 Approved by: https://github.com/albanD	2024-01-10 00:50:09 +00:00
Nikita Shulga	a6325ad86c	Fix cuInit test on Windows (#117055 ) By changing library name from `libcuda.so.1` to `nvcuda.dll` on Windows Pull Request resolved: https://github.com/pytorch/pytorch/pull/117055 Approved by: https://github.com/Skylion007, https://github.com/huydhn, https://github.com/atalman	2024-01-10 00:45:18 +00:00
Huy Do	907e80239d	Fix broken lint after #117052 (#117080 ) https://hud.pytorch.org/pr/pytorch/pytorch/117052#20318344490 breaks lint, forward fixing with `lintrunner -a` Pull Request resolved: https://github.com/pytorch/pytorch/pull/117080 Approved by: https://github.com/atalman, https://github.com/clee2000, https://github.com/Skylion007	2024-01-10 00:44:19 +00:00
Jiong Gong	d9fc438083	[cpu][vec512][double] unsigned left shift for mask (#117021 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/117021 Approved by: https://github.com/leslie-fang-intel	2024-01-10 00:32:15 +00:00
Pearu Peterson	0b72ce1bd1	Add at::sparse::full_coo_indices utility function. (#116352 ) As in the title. `full_coo_indices(shape)` should be used instead of `ones(shape).nonzero().T` as `full_coo_indices` is exponentially more efficient. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116352 Approved by: https://github.com/cpuhrsch ghstack dependencies: #116206	2024-01-10 00:07:09 +00:00
Nikita Shulga	152bde6e27	[MPS][BE] Move `kernel_index_offset` to HistogramKernel (#117037 ) As it have almost nothing in commmon with the rest of indexing primitives other than name Also, use `mtl_dispatch1DJob` to dispatch the work and check that tensor size is less than 4Gb, as this function would not work with larger tensors Pull Request resolved: https://github.com/pytorch/pytorch/pull/117037 Approved by: https://github.com/kulinseth ghstack dependencies: #116903, #116904, #116915, #116940, #116942	2024-01-10 00:02:14 +00:00
Oguz Ulgen	8918ce4087	Add TORCH_LOGS_OUT to direct TORCH_LOGS output (#117005 ) Twice now, while I was debugging accuracy bugs, I get dynamo logs that are 100k lines long and it is impossible to read them on the terminal. Lets add an option to write them to a file. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117005 Approved by: https://github.com/ezyang, https://github.com/zou3519 ghstack dependencies: #116894	2024-01-09 23:46:22 +00:00
Edward Z. Yang	b6028acfa4	Add _assert_scalar and teach Inductor to codegen it (#114148 ) Inductor codegen for `_assert_async` is currently disabled because we don't really understand how to codegen `scalar_to_tensor` on a Sympy expression. I initially tried to see if I could get this to work, but I got into some weird problem involving stride sorting, so I decided to fix it properly by not going through a tensor. So we introduce an `_assert_scalar` which takes a scalar as an argument, avoiding needing to turn a SymBool into a tensor before asserting on it. I also add `_functional_assert_scalar` for good luck, although this doesn't do anything right now because https://github.com/pytorch/pytorch/pull/104203 still hasn't been landed. I need to customize the codegen for this operator, so I decide to directly implement it in Inductor, rather than trying to treat it as a generic ExternKernel. This leads to the new AssertScalar IR node. This is written carefully so that it doesn't get DCE'd by Inductor. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/114148 Approved by: https://github.com/jansel	2024-01-09 23:21:26 +00:00
Max Ren	d2033a0639	[quant][pt2e][xnnpack_quantizer] add support for linear_relu (#117052 ) Add support for linear_relu annotation for XNNPACKQuantizer, this allows the input to linear and the output to relu to share the same quantization parameter.s Differential Revision: [D52574086](https://our.internmc.facebook.com/intern/diff/D52574086/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/117052 Approved by: https://github.com/jerryzh168, https://github.com/digantdesai	2024-01-09 23:19:52 +00:00
Guilherme Leobas	4f3d698cac	Impl. call_hasattr for BaseUserFunctionVariable (#116049 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116049 Approved by: https://github.com/zou3519	2024-01-09 22:58:58 +00:00
Kefei Lu	8a6c43fbe5	add predispatch_pass to hold pass functions to be run when config.is_predispatch is true (#116788 ) Summary: config.is_predispatch is a config to instruct inductor to enable predispatch tracing (high level pre-dispatch IR). Currently, there is no dedicated pass for this config. In this commit, for better pass function management, we created `predispatch_pass` to hold the pass functions to be run on the high level pre-dispatch IR-based graphs. Test Plan: CI Differential Revision: D52491332 Pull Request resolved: https://github.com/pytorch/pytorch/pull/116788 Approved by: https://github.com/frank-wei	2024-01-09 22:42:24 +00:00
PyTorch MergeBot	39ae4d8cd7	Revert "[inductor] Add support for tl.make_block_ptr (#116079 )" This reverts commit d527df707acce59bd432763c94399aa7b3fe38cf. Reverted https://github.com/pytorch/pytorch/pull/116079 on behalf of https://github.com/huydhn due to Sorry for reverting your change but the new test is failing on ROCm ([comment](https://github.com/pytorch/pytorch/pull/116079#issuecomment-1883890254))	2024-01-09 22:19:57 +00:00
Wanchao Liang	848cfe8d45	[reland] unflatten_tensor on compute stream for DTensorExtension (#117020 ) reland of https://github.com/pytorch/pytorch/pull/116559, which was reverted by internal. The underlying reason for the revert is that the torch.dynamo.disable can't be used by the pytorch codebase, as it's conflicting with some torch.deploy together, although the later one only run some inference, but it somehow take that weird dependency on fsdp.. We have seen this issue with our functional collectives that we can't use any dynamo components otherwise torch.deploy would complain.. verified internally that after removing torch.dynamo.disable the test passed again Pull Request resolved: https://github.com/pytorch/pytorch/pull/117020 Approved by: https://github.com/awgu	2024-01-09 21:25:15 +00:00
Aaron Gokaslan	1dd4813328	[BE][dynamo]: Add operator is and is not tests to dynamo tests (#116397 ) Adds an operator that was unit not tested in our test suite - improves coverage. Inspired by looking into https://github.com/pytorch/pytorch/pull/116397 after @XuehaiPan brought up some issues with builtins in #116389 Pull Request resolved: https://github.com/pytorch/pytorch/pull/116397 Approved by: https://github.com/albanD, https://github.com/jansel	2024-01-09 21:13:22 +00:00
soulitzer	5866284d4a	Make not passing use_reentrant back to warning instead of erroring and clarify docs (#116710 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116710 Approved by: https://github.com/albanD ghstack dependencies: #116523	2024-01-09 20:58:49 +00:00
soulitzer	4e666ba011	Update torch.autograd.graph logging to not print out grad_output (#116523 ) Instead of printing the tensor's data print the dtype and shape metadata of the tensor. ``` Executing: <VarMeanBackward0 object at 0x1352d0e20> with grad_outputs: [None,f32[]] ``` This is important in order to avoid doing a cuda sync and also useful to reduce verbosity. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116523 Approved by: https://github.com/albanD	2024-01-09 20:40:02 +00:00
feifan	29ae4f22bf	Enables private_use_one lazy_init by PrivateUse1HooksInterface (#115067 ) Fixes https://github.com/pytorch/pytorch/issues/112369 In my last pr:https://github.com/pytorch/pytorch/pull/113343, I want to implement lazy_init for other device through `REGISTER_LAZY_INIT `. But this might be too big of a change. Recently, my team found that `torch.load` without `lazy_init ` will also results in the same error. `bbd5b935e4/torch/csrc/Storage.cpp (L319-L321)` `bbd5b935e4/torch/csrc/Storage.cpp (L334-L335)` So, I want to use `PrivateUse1HooksInterface` to implement lazy_init for `PrivateUse1`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115067 Approved by: https://github.com/ezyang	2024-01-09 20:12:08 +00:00
Xuehai Pan	ab1ac43752	[pytree] extend pytree operations with `is_leaf` prediction function (#116419 ) Add an extra `is_leaf` prediction function to pytree operations. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116419 Approved by: https://github.com/zou3519	2024-01-09 19:50:08 +00:00
suo	902807a86d	enable pytree tests in fbcode (#116787 ) these were not runnable before Differential Revision: [D52547846](https://our.internmc.facebook.com/intern/diff/D52547846/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116787 Approved by: https://github.com/zou3519	2024-01-09 19:12:43 +00:00
PyTorch MergeBot	b4eb97a072	Revert "[C10D] Add GIL checker to NCCL watchdog monitor (#116798 )" This reverts commit 830ace33bcc0291e5c615ad1727799b1d04067cd. Reverted https://github.com/pytorch/pytorch/pull/116798 on behalf of https://github.com/osalpekar due to This seems to crash torchrec inference unittests: [D52583939](https://www.internalfb.com/diff/D52583939) ([comment](https://github.com/pytorch/pytorch/pull/116798#issuecomment-1883624022))	2024-01-09 19:09:02 +00:00
Bin Bao	b8374314cc	[AOTI] Update AOTI runner util (#116971 ) Summary: Update the runner used in integration tests after https://github.com/pytorch/torchrec/pull/1604 Pull Request resolved: https://github.com/pytorch/pytorch/pull/116971 Approved by: https://github.com/chenyang78	2024-01-09 19:07:54 +00:00
Jason Ansel	d527df707a	[inductor] Add support for tl.make_block_ptr (#116079 ) On A100 this is a small regression: ![image](https://github.com/pytorch/pytorch/assets/533820/b30eee9d-c0fe-4123-99da-d554fc5d0171) So I will leave it disabled by default. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116079 Approved by: https://github.com/shunting314 ghstack dependencies: #116078	2024-01-09 19:06:51 +00:00
Jason Ansel	94363cee41	[inductor] Indexing refactors (#116078 ) Perf differences seems to be noise: ![image](https://github.com/pytorch/pytorch/assets/533820/d7a36574-0388-46e4-bd4d-b274d37cab2b) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116078 Approved by: https://github.com/aakhundov	2024-01-09 19:06:51 +00:00
Prachi Gupta	84b04e42a1	[ROCm] Enable aot_inductor tests (#116713 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/116713 Approved by: https://github.com/jithunnair-amd, https://github.com/desertfire	2024-01-09 19:05:44 +00:00
Angela Yi	ad22bd2fa1	[export][refactor][6/n] Remove equality_constraints (#116979 ) Through the new dynamic_shapes API and using torch.export.Dim, dimensions that are equal will now be represented by the same symbol, so we no longer need to store `equality_constraints`. Differential Revision: D52351705 Pull Request resolved: https://github.com/pytorch/pytorch/pull/116979 Approved by: https://github.com/avikchaudhuri	2024-01-09 19:04:47 +00:00
min-jean-cho	bdeaaad70c	[CPU] _vec_log_softmax_lastdim: fix CHUNK_SIZE to avoid unnecessarily large allocation (#116990 ) Given input shape of `[outer_size, dim_size]`, `_vec_log_softmax_lastdim` sets `CHUNK_SIZE` as ```cpp int64_t CHUNK_SIZE = std::max<int64_t>( 1, at::internal::GRAIN_SIZE / (sizeof(scalar_t) * dim_size)); ``` where `at::internal::GRAIN_SIZE / (sizeof(scalar_t) * dim_size)` computes the maximum number of rows that can fit into L1d cache size `(GRAIN_SIZE)`. Fix `CHUNK_SIZE` as the minimum between `CHUNK_SIZE` and `outer_size` to avoid unnecessarily large `CHUNK_SIZE` and unnecessarily large allocation for `max` and `tmp_sum` buffer. ```cpp auto tmp_sum_scalar = std::make_unique<scalar_t[]>(CHUNK_SIZE); auto max_input_arr = std::make_unique<scalar_t[]>(CHUNK_SIZE); ``` ### Performance Perf data collected for `dim_size` in range [2^0, 2^9] and `outer_size` in range [2^0, 2^3]. To measure the benefit from avoiding unnecessarily large allocation, values of `outer_size` were chosen such that `outer_size` is less than `at::internal::GRAIN_SIZE / (sizeof(scalar_t) * dim_size)` for all values of `dim_size`. Tested on 28 physical cores/socket, 1 socket on Skylake. \| dim_size \| *at::internal::GRAIN_SIZE / (sizeof(scalar_t) dim_size) \| input shape: (outer_size, dim_size) \| Baseline (original implementation) \| Optimized \| Speedup Ratio (Baseline/Optimized) \| \|-------------- \|---------------------------------------------------------------- \|----------------------------------------- \|---------------------------------------- \|--------------- \|---------------------------------------- \| \| 1 \| 8192 \| (1, 1) \| 0.006070137 \| 0.003378391 \| 1.796754 \| \| \| \| (2, 1) \| 0.006327629 \| 0.00361681 \| 1.749506 \| \| \| \| (4, 1) \| 0.006246567 \| 0.00379324 \| 1.646763 \| \| \| \| (8, 1) \| 0.006320477 \| 0.003941059 \| 1.603751 \| \| 2 \| 4096 \| (1, 2) \| 0.004889965 \| 0.003342628 \| 1.46291 \| \| \| \| (2, 2) \| 0.005021095 \| 0.003380775 \| 1.48519 \| \| \| \| (4, 2) \| 0.004897118 \| 0.003535748 \| 1.38503 \| \| \| \| (8, 2) \| 0.005195141 \| 0.003790855 \| 1.37044 \| \| 4 \| 2048 \| (1, 4) \| 0.004477501 \| 0.003364086 \| 1.330971 \| \| \| \| (2, 4) \| 0.004198551 \| 0.003452301 \| 1.21616 \| \| \| \| (4, 4) \| 0.004312992 \| 0.003650188 \| 1.181581 \| \| \| \| (8, 4) \| 0.004432201 \| 0.00399828 \| 1.108527 \| \| 8 \| 1024 \| (1, 8) \| 0.004155636 \| 0.0035429 \| 1.172948 \| \| \| \| (2, 8) \| 0.003905296 \| 0.003569126 \| 1.094188 \| \| \| \| (4, 8) \| 0.004405975 \| 0.003864765 \| 1.140037 \| \| \| \| (8, 8) \| 0.004785061 \| 0.004456043 \| 1.073836 \| \| 16 \| 512 \| (1, 16) \| 0.003867149 \| 0.003504753 \| 1.103401 \| \| \| \| (2, 16) \| 0.003743172 \| 0.003340244 \| 1.120628 \| \| \| \| (4, 16) \| 0.003614426 \| 0.003519058 \| 1.0271 \| \| \| \| (8, 16) \| 0.00395298 \| 0.003488064 \| 1.133288 \| \| 32 \| 256 \| (1, 32) \| 0.003900528 \| 0.003421307 \| 1.14007 \| \| \| \| (2, 32) \| 0.003569126 \| 0.003511906 \| 1.016293 \| \| \| \| (4, 32) \| 0.003736019 \| 0.003590584 \| 1.040505 \| \| \| \| (8, 32) \| 0.003845692 \| 0.003662109 \| 1.05013 \| \| 64 \| 128 \| (1, 64) \| 0.003652573 \| 0.003437996 \| 1.062413 \| \| \| \| (2, 64) \| 0.003700256 \| 0.003516674 \| 1.052203 \| \| \| \| (4, 64) \| 0.003783703 \| 0.003638268 \| 1.039974 \| \| \| \| (8, 64) \| 0.003993511 \| 0.003809929 \| 1.048185 \| \| 128 \| 64 \| (1, 128) \| 0.003848076 \| 0.003600121 \| 1.068874 \| \| \| \| (2, 128) \| 0.003979206 \| 0.003826618 \| 1.039875 \| \| \| \| (4, 128) \| 0.004360676 \| 0.004224777 \| 1.032167 \| \| \| \| (8, 128) \| 0.005149841 \| 0.004999638 \| 1.030043 \| \| 256 \| 32 \| (1, 256) \| 0.003943443 \| 0.003738403 \| 1.054847** \| \| \| \| (2, 256) \| 0.00420332 \| 0.00408411 \| 1.029189 \| \| \| \| (4, 256) \| 0.004820824 \| 0.00474453 \| 1.01608 \| \| \| \| (8, 256) \| 0.006194115 \| 0.006067753 \| 1.020825 \| \| 512 \| 16 \| (1, 512) \| 0.004277229 \| 0.004253387 \| 1.005605 \| \| \| \| (2, 512) \| 0.004863739 \| 0.004782677 \| 1.016949 \| \| \| \| (4, 512) \| 0.006172657 \| 0.00607729 \| 1.015692 \| \| \| \| (8, 512) \| 0.011193752 \| 0.010819435 \| 1.034597 \| Bolded speedup ratio indicates greater than 5% speedup, to identify as significant speedup. Especially for smaller `dim_size` (1, 2, 4, 8, 16), we observe significant speedups as smaller the `dim_size`, larger the `at::internal::GRAIN_SIZE / (sizeof(scalar_t) * dim_size)`, hence larger the unnecessary allocation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116990 Approved by: https://github.com/jgong5, https://github.com/ezyang	2024-01-09 18:43:02 +00:00
FFFrog	75968e2f94	Optimize operator (#117017 ) As the title stated. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117017 Approved by: https://github.com/Skylion007	2024-01-09 18:37:22 +00:00
Shawn Zhong	0dd5deeced	Bring docstring to .pyi file (#114705 ) Fixes #37762 Since the original issue hasn't been making progress for more than 3 years, I am attempting to make this PR to at least make some progress forward. This PR attempts to add docstring to the `.pyi` files. The docstrings are read from [`_torch_docs`](https://github.com/pytorch/pytorch/blob/main/torch/_torch_docs.py) by mocking [`_add_docstr`](`9f073ae304/torch/csrc/Module.cpp (L329)`), which is the only function used to add docstring. Luckily, `_torch_docs` has no dependencies for other components of PyTorch, and can be imported without compiling `torch._C` with `_add_docstr` mocked. The generated `.pyi` file looks something like the following: [_VariableFunctions.pyi.txt](https://github.com/pytorch/pytorch/files/13494263/_VariableFunctions.pyi.txt) <img width="787" alt="image" src="https://github.com/pytorch/pytorch/assets/6421097/73c2e884-f06b-4529-8301-0ca0b9de173c"> And the docstring can be picked up by VSCode: <img width="839" alt="image" src="https://github.com/pytorch/pytorch/assets/6421097/1999dc89-a591-4c7a-80ac-aa3456672af4"> <img width="908" alt="image" src="https://github.com/pytorch/pytorch/assets/6421097/ecf3fa92-9822-4a3d-9263-d224d87ac288"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/114705 Approved by: https://github.com/albanD	2024-01-09 18:37:16 +00:00
yewentao	cfd0728b24	Feature: cudnn convolution out (#116759 ) Fixes #115611 Pull Request resolved: https://github.com/pytorch/pytorch/pull/116759 Approved by: https://github.com/albanD	2024-01-09 17:51:29 +00:00
Nikita Shulga	0ef1266bc6	[BE] Fix CUDA build warnings (#117023 ) After https://github.com/pytorch/pytorch/pull/116595/files compiling every .cu file results in ``` /home/nshulga/git/pytorch/pytorch/c10/util/Half.h(450): warning #173-D: floating-point value does not fit in required integral type -static_cast<uint64_t>(f) > static_cast<uint64_t>(limit::max())); ^ detected during: instantiation of "std::enable_if_t<<expression>, __nv_bool> c10::overflows<To,From>(From, __nv_bool) [with To=float, From=int64_t]" at line 159 of /home/nshulga/git/pytorch/pytorch/c10/util/TypeCast.h instantiation of "To c10::checked_convert<To,From>(From, const char ) [with To=float, From=int64_t]" at line 122 of /home/nshulga/git/pytorch/pytorch/c10/core/Scalar.h Remark: The warnings can be suppressed with "-diag-suppress <warning-number>" /home/nshulga/git/pytorch/pytorch/c10/util/Half.h(450): warning #173-D: floating-point value does not fit in required integral type -static_cast<uint64_t>(f) > static_cast<uint64_t>(limit::max())); ^ detected during: instantiation of "std::enable_if_t<<expression>, __nv_bool> c10::overflows<To,From>(From, __nv_bool) [with To=float, From=uint64_t]" at line 159 of /home/nshulga/git/pytorch/pytorch/c10/util/TypeCast.h instantiation of "To c10::checked_convert<To,From>(From, const char ) [with To=float, From=uint64_t]" at line 122 of /home/nshulga/git/pytorch/pytorch/c10/core/Scalar.h /home/nshulga/git/pytorch/pytorch/c10/util/Half.h(450): warning #173-D: floating-point value does not fit in required integral type -static_cast<uint64_t>(f) > static_cast<uint64_t>(limit::max())); ^ detected during: instantiation of "std::enable_if_t<<expression>, __nv_bool> c10::overflows<To,From>(From, __nv_bool) [with To=double, From=int64_t]" at line 159 of /home/nshulga/git/pytorch/pytorch/c10/util/TypeCast.h instantiation of "To c10::checked_convert<To,From>(From, const char ) [with To=double, From=int64_t]" at line 122 of /home/nshulga/git/pytorch/pytorch/c10/core/Scalar.h /home/nshulga/git/pytorch/pytorch/c10/util/Half.h(450): warning #173-D: floating-point value does not fit in required integral type -static_cast<uint64_t>(f) > static_cast<uint64_t>(limit::max())); ^ detected during: instantiation of "std::enable_if_t<<expression>, __nv_bool> c10::overflows<To,From>(From, __nv_bool) [with To=double, From=uint64_t]" at line 159 of /home/nshulga/git/pytorch/pytorch/c10/util/TypeCast.h instantiation of "To c10::checked_convert<To,From>(From, const char ) [with To=double, From=uint64_t]" at line 122 of /home/nshulga/git/pytorch/pytorch/c10/core/Scalar.h /home/nshulga/git/pytorch/pytorch/c10/util/Half.h(450): warning #173-D: floating-point value does not fit in required integral type -static_cast<uint64_t>(f) > static_cast<uint64_t>(limit::max())); ^ detected during: instantiation of "std::enable_if_t<<expression>, __nv_bool> c10::overflows<To,From>(From, __nv_bool) [with To=c10::complex<float>, From=int64_t]" at line 159 of /home/nshulga/git/pytorch/pytorch/c10/util/TypeCast.h instantiation of "To c10::checked_convert<To,From>(From, const char ) [with To=c10::complex<float>, From=int64_t]" at line 122 of /home/nshulga/git/pytorch/pytorch/c10/core/Scalar.h /home/nshulga/git/pytorch/pytorch/c10/util/Half.h(450): warning #173-D: floating-point value does not fit in required integral type -static_cast<uint64_t>(f) > static_cast<uint64_t>(limit::max())); ^ detected during: instantiation of "std::enable_if_t<<expression>, __nv_bool> c10::overflows<To,From>(From, __nv_bool) [with To=c10::complex<float>, From=uint64_t]" at line 159 of /home/nshulga/git/pytorch/pytorch/c10/util/TypeCast.h instantiation of "To c10::checked_convert<To,From>(From, const char ) [with To=c10::complex<float>, From=uint64_t]" at line 122 of /home/nshulga/git/pytorch/pytorch/c10/core/Scalar.h /home/nshulga/git/pytorch/pytorch/c10/util/Half.h(450): warning #173-D: floating-point value does not fit in required integral type -static_cast<uint64_t>(f) > static_cast<uint64_t>(limit::max())); ^ detected during: instantiation of "std::enable_if_t<<expression>, __nv_bool> c10::overflows<To,From>(From, __nv_bool) [with To=c10::complex<double>, From=int64_t]" at line 159 of /home/nshulga/git/pytorch/pytorch/c10/util/TypeCast.h instantiation of "To c10::checked_convert<To,From>(From, const char ) [with To=c10::complex<double>, From=int64_t]" at line 122 of /home/nshulga/git/pytorch/pytorch/c10/core/Scalar.h /home/nshulga/git/pytorch/pytorch/c10/util/Half.h(450): warning #173-D: floating-point value does not fit in required integral type -static_cast<uint64_t>(f) > static_cast<uint64_t>(limit::max())); ^ detected during: instantiation of "std::enable_if_t<<expression>, __nv_bool> c10::overflows<To,From>(From, __nv_bool) [with To=c10::complex<double>, From=uint64_t]" at line 159 of /home/nshulga/git/pytorch/pytorch/c10/util/TypeCast.h instantiation of "To c10::checked_convert<To,From>(From, const char ) [with To=c10::complex<double>, From=uint64_t]" at line 122 of /home/nshulga/git/pytorch/pytorch/c10/core/Scalar.h ``` Fix it by using using `if constexpr` to avoid calling `static_cast<uint64_t>` for any floating point type Pull Request resolved: https://github.com/pytorch/pytorch/pull/117023 Approved by: https://github.com/albanD	2024-01-09 17:40:10 +00:00
chuanqiw	b6962208b8	[CI] Add initial ci test workflow for XPU based on IDC runners (#116554 ) Add initial CI test for XPU based on IDC self-hosted runners with label `linux.idc.xpu`, which will be triggered by label `ciflow/xpu` for current stage. Works for RFC https://github.com/pytorch/pytorch/issues/114850 Pull Request resolved: https://github.com/pytorch/pytorch/pull/116554 Approved by: https://github.com/EikanWang, https://github.com/atalman	2024-01-09 17:00:35 +00:00
Nikita Shulga	6784030df4	[MPS] Add support for 64-bit index operations (#116942 ) But enable it only if `iter.can_use_32bit_indexing()` is False. add test for index_select, but enable it only on Sonoma, as all attempts to create 4Gb+ tensor on Ventura and older fail Pull Request resolved: https://github.com/pytorch/pytorch/pull/116942 Approved by: https://github.com/Skylion007, https://github.com/kulinseth ghstack dependencies: #116903, #116904, #116915, #116940	2024-01-09 16:56:49 +00:00
Nikita Shulga	81b7a09d27	[CI] Test that cuInit is not called during import (#117010 ) By making a driver API call in subprocess and expecting it to return `CUDA_ERROR_NOT_INITIALIZED` Test Plan: run it on nighties before https://github.com/pytorch/pytorch/pull/116201 got reverted and observe the failure This is very important for lots of distributed launchers Fixes https://github.com/pytorch/pytorch/issues/116276 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117010 Approved by: https://github.com/albanD	2024-01-09 14:44:22 +00:00
Jack Taylor	db79ceb110	[ROCm] Enabling additional UTs on ROCm (#115738 ) Unskips mostly for dynamo/inductor UT. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115738 Approved by: https://github.com/jithunnair-amd, https://github.com/malfet	2024-01-09 08:36:07 +00:00
Huamin Li	f0bbc2fcf5	[AOTInductor] Small refactor so both Meta internal and OSS can deal with misplaced args and kwargs for Extern Fallback kernels (#116779 ) Summary: In torch/_inductor/lowering.py (https://fburl.com/code/jd58vxpw), we are using ``` fallback_cumsum(x, dim=axis, dtype=dtype) ``` so this will treat `x` as args, `dim` and `dtype` as kwargs from https://fburl.com/code/cikchxp9 The issue has been fixed from D52530506 for OSS but not Meta internal. This diff address the Meta internal issue by some refactoring so both Meta internal and OSS can use the same helper function. The diff also added some debug log. Test Plan: before ``` aoti_torch_proxy_executor_call_function(proxy_executor, 2, 1, std::vector<int64_t>{torch.int64}.data(), 2, std::vector<AtenTensorHandle>{buf702, buf708}.data()); ``` after ``` aoti_torch_proxy_executor_call_function(proxy_executor, 2, 1, std::vector<int64_t>{0}.data(), 2, std::vector<AtenTensorHandle>{buf702, buf708}.data()); ``` so `torch.int64` changed to `0` Differential Revision: D52532031 Pull Request resolved: https://github.com/pytorch/pytorch/pull/116779 Approved by: https://github.com/desertfire, https://github.com/chenyang78	2024-01-09 07:57:46 +00:00
Jianyu Huang	6e2f879d7f	[ROCm] hipify mapping for cudaDevAttrMaxSharedMemoryPerBlockOptin (#116984 ) Summary: Map `cudaDevAttrMaxSharedMemoryPerBlockOptin` to `hipDeviceAttributeMaxSharedMemoryPerBlock` to make it work for AMD GPUs. Test Plan: CI Differential Revision: D52558076 Pull Request resolved: https://github.com/pytorch/pytorch/pull/116984 Approved by: https://github.com/jeffdaily	2024-01-09 07:38:20 +00:00
Edward Z. Yang	d78776e2e6	Stop unconditionally applying hermetic mode (#116996 ) When originally authored, it was not necessary to unconditionally apply hermetic mode, but I chose to apply it in eager mode to help catch bugs. Well, multipy is kind of dead, and hermetic mode is causing real implementation problems for people who want to do fancy Python stuff from the dispatcher. So let's yank this mode for now. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/116996 Approved by: https://github.com/jansel	2024-01-09 05:55:08 +00:00
Valentin Andrei	6cf1fc66e3	[cuda][easy] cosmetic and small syntax changes to layer_norm_kernel.cu (#116920 ) Used `auto` and `const` where needed; replaced a CUDA specific `__syncwarp` with device agnostic `WARP_SYNC`; added more comments. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116920 Approved by: https://github.com/malfet	2024-01-09 04:44:57 +00:00
Jiong Gong	104a23e4f5	[cpu][vec512] improve int load/store/with mask (#116964 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116964 Approved by: https://github.com/leslie-fang-intel ghstack dependencies: #116961, #116962, #116963	2024-01-09 04:37:44 +00:00
Jiong Gong	4e54a70451	[cpu][vec512] improve double load/store with mask (#116963 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116963 Approved by: https://github.com/leslie-fang-intel ghstack dependencies: #116961, #116962	2024-01-09 04:37:44 +00:00
Jiong Gong	428807f9bc	[cpu][vec512] improve fp32 load/store with mask (#116962 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116962 Approved by: https://github.com/leslie-fang-intel ghstack dependencies: #116961	2024-01-09 04:32:22 +00:00
Jiong Gong	a0bd7dfec1	[cpu][vec512] improve bf16/fp16 load/store with mask for inductor (#116961 ) Improve perf of vec512 bfloat16 (and also float16) load and store with partial vector lanes using masked load/store instead of via `memcpy` with aux buffer. In inductor CPU backend, we do load/store half (16) vector lanes for bfloat16 and float16. Using the following micro-benchmark script for `layernorm + add`: ```python import torch import torch.nn as nn from benchmark_helper import time_with_torch_timer class AddLayernorm(nn.Module): def __init__(self, hidden_size): super().__init__() self.ln = nn.LayerNorm(hidden_size) def forward(self, hidden_states): return hidden_states + self.ln(hidden_states) hidden_states = torch.randn(1, 512, 1024).to(torch.bfloat16) with torch.no_grad(): compiled_add_ln = torch.compile(add_ln) print(time_with_torch_timer(compiled_add_ln, hidden_states, iters=10000)) ``` Measured on single-core `Intel(R) Xeon(R) Platinum 8358 CPU`. Before: 1.39 ms After: 498.66 us Pull Request resolved: https://github.com/pytorch/pytorch/pull/116961 Approved by: https://github.com/sanchitintel, https://github.com/leslie-fang-intel	2024-01-09 04:18:33 +00:00
Jack Taylor	bac0de160c	[ROCm] Add minimal inductor test to rocm-test workflow (#115425 ) Adds the `inductor/test_torchinductor` to tests-to-include so we can have some PR-level test coverage for inductor tests on ROCm. This should help catch issues before merging (e.g. https://github.com/pytorch/pytorch/pull/114772) This unit test takes ~6minutes Pull Request resolved: https://github.com/pytorch/pytorch/pull/115425 Approved by: https://github.com/jithunnair-amd, https://github.com/huydhn, https://github.com/malfet	2024-01-09 03:54:25 +00:00
voznesenskym	4c0d63180a	Support NNModules as dict keys (#116723 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116723 Approved by: https://github.com/lezcano	2024-01-09 03:32:47 +00:00
PyTorch UpdateBot	92cf7ba36b	[vision hash update] update the pinned vision hash (#117002 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml). Update the pinned vision hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117002 Approved by: https://github.com/pytorchbot	2024-01-09 03:21:43 +00:00
PyTorch UpdateBot	ff0a3f35a4	[audio hash update] update the pinned audio hash (#116954 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml). Update the pinned audio hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116954 Approved by: https://github.com/pytorchbot	2024-01-09 03:16:00 +00:00
leslie-fang-intel	14be2ee271	Inductor qlinear int8_bf16 with bmm (#116604 ) Summary Fix issue: https://github.com/pytorch/pytorch/issues/116492, `linear` will be decomposed into `bmm` when input dim exceeds 2 and not contiguous. Fix this issue by convert the pattern back into `qlinear`. This PR focus on int8_bf16 case following of https://github.com/pytorch/pytorch/pull/116599. Test Plan ``` python -u -m pytest -s -v test_mkldnn_pattern_matcher.py -k test_qlinear_int8_mixed_bf16_input_dim_exceeds_2_and_not_contiguous ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/116604 Approved by: https://github.com/jgong5 ghstack dependencies: #116937, #116599	2024-01-09 01:36:27 +00:00
leslie-fang-intel	153b3a0996	Inductor qlinear int8_fp32 with bmm (#116599 ) Summary Fix issue: https://github.com/pytorch/pytorch/issues/116492, `linear` will be decomposed into `bmm` when input dim exceeds 2 and not contiguous. Fix this issue by convert the pattern back into `qlinear`. This PR focus on int8_fp32 case, will follow up int8_bf16 case in next PR. Test Plan ``` python -u -m pytest -s -v test_mkldnn_pattern_matcher.py -k test_qlinear_input_dim_exceeds_2_and_not_contiguous ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/116599 Approved by: https://github.com/jgong5 ghstack dependencies: #116937	2024-01-09 01:33:46 +00:00
Prachi Gupta	6ca31ae1d3	[CI] Add inductor workflow for rocm (#110544 ) This PR is to create a separate CI job for inductor UTs on ROCm. You will need to add `ciflow/inductor` tag on PRs to trigger this job. However, the job will run on its own on any commit merged in main. This job takes around 1.5 hours to run and it is run in parallel to other rocm jobs. It is run only on the MI210 CI runners to ensure maximum inductor functionality is tested. Pull Request resolved: https://github.com/pytorch/pytorch/pull/110544 Approved by: https://github.com/jithunnair-amd, https://github.com/jansel, https://github.com/huydhn	2024-01-09 01:32:15 +00:00
leslie-fang-intel	227579d6a0	[Inductor] [Quant] Add remaining user check for qconv binary fusion (#115809 ) Summary Similar as https://github.com/pytorch/pytorch/pull/115153, when we do the `qconv_binary` fusion with post op sum, we also need to ensure that: all users of the extra input in this pattern should be ancestor nodes of the compute node, except for the binary node connected to the compute node. Also change some variable names in this diff as: - Change name of `qconv2d_node_after_weight_prepack` to `compute_node` - Change name of `extra_input_node` to `extra_input_of_binary_node` Test Plan ``` python -u -m pytest -s -v test_mkldnn_pattern_matcher.py -k test_qconv2d_add_3 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/115809 Approved by: https://github.com/jgong5 ghstack dependencies: #115153	2024-01-09 01:26:50 +00:00
Edward Z. Yang	33d90cfd16	Allow for [-oo, oo] ranges for bools (#114362 ) This fixes a problem in Seamless M4T in fairseq2 repro instructions at https://docs.google.com/document/d/1PVy4KibfljirQDoijOwyHCV97B67r_iElWqFh7h1Acc/edit I tried extracting a minimal repro but I couldn't actually manage it! Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/114362 Approved by: https://github.com/Skylion007	2024-01-09 01:08:34 +00:00
voznesenskym	f26ed0a71d	[dynamo] Move graph breaks in for/while->skip after logging (#116981 ) We were losing critical graph break info if the graph break came from a for or while loop. Given: ``` def foo(x, y): z = x * y for i in range(10): z = z * y print(z) return z a = torch.randn([2, 2]) b = torch.randn([2, 2]) foo = torch._dynamo.optimize('eager')(foo) foo(a, b) ``` Before: ``` $ TORCH_LOGS=+graph_breaks python x.py tensor([[-0.1046, -0.1597], [-0.0006, -0.1327]]) tensor([[-4.2091e-02, 6.3045e-02], [-1.6759e-05, 4.0366e-02]]) tensor([[-1.6929e-02, -2.4892e-02], [-4.8690e-07, -1.2281e-02]]) tensor([[-6.8091e-03, 9.8278e-03], [-1.4146e-08, 3.7363e-03]]) tensor([[-2.7387e-03, -3.8803e-03], [-4.1097e-10, -1.1367e-03]]) tensor([[-1.1015e-03, 1.5320e-03], [-1.1940e-11, 3.4584e-04]]) tensor([[-4.4304e-04, -6.0488e-04], [-3.4688e-13, -1.0522e-04]]) tensor([[-1.7820e-04, 2.3882e-04], [-1.0078e-14, 3.2012e-05]]) tensor([[-7.1672e-05, -9.4293e-05], [-2.9279e-16, -9.7392e-06]]) tensor([[-2.8827e-05, 3.7229e-05], [-8.5063e-18, 2.9630e-06]]) ``` After: ``` $ TORCH_LOGS=+graph_breaks python x.py [2024-01-08 11:14:49,372] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] Graph break: call_function BuiltinVariable(print) [TensorVariable()] {} from user code at: [2024-01-08 11:14:49,372] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] File "/data/users/voz/pytorch/x.py", line 32, in foo [2024-01-08 11:14:49,372] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] print(z) [2024-01-08 11:14:49,372] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] tensor([[ 0.2065, 0.0766], [-2.0600, 1.8425]]) tensor([[-0.0617, -0.0698], [-3.5799, 2.2167]]) tensor([[ 0.0184, 0.0636], [-6.2212, 2.6669]]) tensor([[-5.5031e-03, -5.7971e-02], [-1.0811e+01, 3.2085e+00]]) tensor([[ 1.6437e-03, 5.2837e-02], [-1.8788e+01, 3.8601e+00]]) tensor([[-4.9093e-04, -4.8157e-02], [-3.2650e+01, 4.6441e+00]]) tensor([[ 1.4663e-04, 4.3891e-02], [-5.6741e+01, 5.5872e+00]]) tensor([[-4.3796e-05, -4.0004e-02], [-9.8605e+01, 6.7220e+00]]) tensor([[ 1.3081e-05, 3.6461e-02], [-1.7136e+02, 8.0871e+00]]) tensor([[-3.9070e-06, -3.3231e-02], [-2.9779e+02, 9.7296e+00]]) ```` Pull Request resolved: https://github.com/pytorch/pytorch/pull/116981 Approved by: https://github.com/ezyang	2024-01-09 00:39:03 +00:00
Ghassene Jebali	e728ebb66d	Small docstring fix (#116947 ) Fix a small typo in the docstring of checkpoint function. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116947 Approved by: https://github.com/Skylion007, https://github.com/kit1980	2024-01-08 23:51:59 +00:00
Jerry Zhang	28e2e12b2a	[quant][be] enable xnnpack_quantizer tests to run in internal CI (#116911 ) Summary: fixed an import problem for test_xnnpack_quantizer so that it can run in CI Test Plan: internal CI sanity check: buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/quantization:test_quantization -- --exact 'caffe2/test/quantization:test_quantization - test_conv2d (caffe2.test.quantization.pt2e.test_xnnpack_quantizer.TestXNNPACKQuantizer)' Differential Revision: D52576449 Pull Request resolved: https://github.com/pytorch/pytorch/pull/116911 Approved by: https://github.com/mcr229	2024-01-08 23:43:47 +00:00
Turab Iqbal	534c73d478	Fix NaN bug in torch.signal.windows.kaiser (#116470 ) Fixes #115595 As an aside, there are currently no tests checking the output of `torch.signal.windows.kaiser` against the output of scipy's implementation, which is what is done with `torch.kaiser_window`. The same goes for the other window functions in `torch.signal.windows`. I did some tests on my end, but I'm not sure what the best practice is, so I haven't included them for now. @gchanan @mruberry Pull Request resolved: https://github.com/pytorch/pytorch/pull/116470 Approved by: https://github.com/ezyang	2024-01-08 22:24:52 +00:00
Edward Z. Yang	d006cae2a8	Update documentation for unsigned int types (#116804 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/116804 Approved by: https://github.com/albanD ghstack dependencies: #116595, #116803	2024-01-08 22:02:10 +00:00
Edward Z. Yang	fd0c071969	Add tolist support for unsigned types (#116803 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/116803 Approved by: https://github.com/albanD ghstack dependencies: #116595	2024-01-08 22:02:10 +00:00
Edward Z. Yang	f4e35e2c3d	Proposed mechanism for handling uint64_t in Scalar (#116595 ) Here's the problem: if we support unsigned integer types, and in particular if we support uint64_t, we need a way to represent these integers in Scalar. However, Scalar currently stores all integral values inside int64_t, which is not wide enough to accommodate all possible uint64_t values. So we need to do something to Scalar to support it. The obvious thing to do is add a uint64_t field to the union, and used it some situations. But when should we use it? The proposal is that we only use it if and only if the integer in question is not representable in int64_t. The historical precedent for this is our handling for uint8_t. Because this type is representable inside int64_t, we have historically stored it inside Scalar as an int64_t. In general, the concept behind Scalar is that it doesn't know the signedness/unsignedness/bitwidth of its input; in particular, we typically construct Scalar from Python int, which doesn't have any concept of how wide the integer is! So it doesn't make any sense to allow for a small integer like 255 to be representable under both the HAS_i tag and the HAS_u tag. So we forbid the latter case. Although I have proposed this, the PR as currently written just chokes when you pass it a uint64_t that's too big. There's some more logic that would have to be written out for this. I'm putting this out to start to get some agreement that this is the way to do it. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/116595 Approved by: https://github.com/albanD	2024-01-08 22:02:03 +00:00
leslie-fang-intel	7073dc604e	Merge merging rules of CPU inductor and x86 CPU quantization (#116937 ) Summary Following the discussion at https://github.com/pytorch/pytorch/pull/116599#issuecomment-1878757581, due to the limitation of the current merging rules that prevent cross-checking all files among different merge groups, it is proposed to merge the groups `x86 CPU quantization` and `CPU inductor` since they are closely related. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116937 Approved by: https://github.com/jgong5, https://github.com/atalman	2024-01-08 15:32:03 +00:00
Jeff Daily	a2d73e21d1	follow up #115078 , broken distributed tests (#116217 ) ROCm distributed tests started failing after #115078. This skips the new tests if the number of GPUs available isn't sufficient. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116217 Approved by: https://github.com/malfet, https://github.com/huydhn	2024-01-08 15:26:54 +00:00
cyy	ad507789d1	[Reland] [11/N] Enable clang-tidy warnings on c10/util/*.h (#116751 ) Reland of #116353 with C++ diagnostic macros restored. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116751 Approved by: https://github.com/albanD	2024-01-08 11:07:58 +00:00
PyTorch UpdateBot	e780213340	[xla hash update] update the pinned xla hash (#116958 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml). Update the pinned xla hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116958 Approved by: https://github.com/pytorchbot	2024-01-08 11:00:59 +00:00
Nikita Shulga	6173386fc4	[MPS][BE] Remove unused nOffsets parameter (#116940 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116940 Approved by: https://github.com/Skylion007 ghstack dependencies: #116903, #116904, #116915	2024-01-08 04:55:35 +00:00
Nikita Shulga	f663935935	[MPS] Fix boundary checks in generateKernelOffsets (#116915 ) `TORCH_CHECK(i<UINT32_MAX)` is always false, it should be `TORCH_CHECK(iterShape[i] < UINT32_MAX)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/116915 Approved by: https://github.com/Skylion007, https://github.com/kulinseth ghstack dependencies: #116903, #116904	2024-01-08 04:55:35 +00:00
Nikita Shulga	aa718065b2	[MPS][BE] Refactor common code (#116904 ) Into `generateKernelDataOffsets` which was repeated character by character in BinaryKernel, CrossKernel and Indexing Pull Request resolved: https://github.com/pytorch/pytorch/pull/116904 Approved by: https://github.com/Skylion007 ghstack dependencies: #116903	2024-01-08 04:55:35 +00:00
Aaron Gokaslan	57491d2046	Add bfloat16 + fp16 support to fractional_max_pool for CUDA and CPU (#116950 ) Adds bfloat16 to fractional_max_pool. If op supports fp32 and fp16, it really should support bf16 for the most part. Most but not all ops satisfy this, so I am adding support for the few that do not. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116950 Approved by: https://github.com/lezcano	2024-01-08 03:54:29 +00:00
Aaron Gokaslan	7d61fa23df	Add float16 support to CUDA logaddexp2 (#116948 ) float16 is already supported on CPU for this op and on gpu for `logaddexp` so let's expand support to the function with the base2 variant as well. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116948 Approved by: https://github.com/lezcano	2024-01-08 03:37:07 +00:00
PyTorch UpdateBot	2fe90e4d47	[vision hash update] update the pinned vision hash (#116908 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml). Update the pinned vision hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116908 Approved by: https://github.com/pytorchbot	2024-01-08 03:24:41 +00:00
PyTorch UpdateBot	6c32cd05a3	[executorch hash update] update the pinned executorch hash (#116936 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml). Update the pinned executorch hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116936 Approved by: https://github.com/pytorchbot	2024-01-08 03:18:18 +00:00
Aaron Gokaslan	376f036570	Add bfloat16 CUDA support to multinomial (#116951 ) Add torch bfloat16 support to multinomial. Only a few methods in torch support fp32, fp16, but not bfloat16 so let's go and finish implementing them. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116951 Approved by: https://github.com/lezcano	2024-01-08 01:43:16 +00:00
Aaron Gokaslan	8257b867d8	Add bfloat16 CUDA support to binomial distribution (#116932 ) Now all distributions support bfloat16 as input. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116932 Approved by: https://github.com/malfet	2024-01-07 19:50:10 +00:00
Pearu Peterson	4a37f57c69	Add batched sparse CSR/CSC/BSR/BSC to sparse COO conversion support (#116206 ) As in the title. Fixes https://github.com/pytorch/pytorch/issues/104868 Pull Request resolved: https://github.com/pytorch/pytorch/pull/116206 Approved by: https://github.com/amjames, https://github.com/lezcano, https://github.com/cpuhrsch	2024-01-07 19:42:02 +00:00
cyy	4b74bb6c34	[Exception] [2/N] Remove THPUtils_assert (#116772 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116772 Approved by: https://github.com/albanD	2024-01-07 14:21:43 +00:00
Huy Do	3c7f358c91	Update the expected accuracy value for demucs (#116944 ) Update the expected value with `python benchmarks/dynamo/ci_expected_accuracy/update_expected.py b847290ddd9c6a5a598c70f8b660ee2b1e71dc95` as this is now failing in trunk after `95041829c8` Pull Request resolved: https://github.com/pytorch/pytorch/pull/116944 Approved by: https://github.com/voznesenskym	2024-01-07 13:34:51 +00:00
voznesenskym	de005b14ab	[dynamo] fix more broken dict tests (#116943 ) Forward fixing after #111196 Pull Request resolved: https://github.com/pytorch/pytorch/pull/116943 Approved by: https://github.com/huydhn	2024-01-07 08:00:16 +00:00
Edward Z. Yang	8ddac14a15	Add unsigned integer dtypes to PyTorch (#116594 ) The dtypes are very useless right now (not even fill works), but it makes torch.uint16, uint32 and uint64 available as a dtype. Towards https://github.com/pytorch/pytorch/issues/58734 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/116594 Approved by: https://github.com/albanD ghstack dependencies: #116698, #116693	2024-01-07 07:40:49 +00:00
Edward Z. Yang	8e273e23b5	Refactor promoteType to no longer use shifting strategy (#116693 ) Instead of manually fixing the indices (extremely error prone when new dtypes are added) we just setup a lookup table to map ScalarType to the offsets table. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/116693 Approved by: https://github.com/albanD ghstack dependencies: #116698	2024-01-07 07:40:49 +00:00
Edward Z. Yang	c5e6485d14	Add AT_DISPATCH_V2 (#116698 ) See top-level comment on Dispatch_v2.h for motivation. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/116698 Approved by: https://github.com/albanD, https://github.com/malfet	2024-01-07 07:40:49 +00:00
Nikita Shulga	9557b63c85	[MPS][BE] Do not crash if Metal function can not be found (#116938 ) As [`newFunctionWithName:`](https://developer.apple.com/documentation/metal/mtllibrary/1515524-newfunctionwithname) does not accept error argument, do not attempt to print it as it'll be guaranteed `nil` at that point, that results in a classic null pointer dereference, when `TORCH_CHECK` will attempt to construct `std::string` from it. See below backtrace for example: ``` thread #1, queue = 'metal gpu stream', stop reason = EXC_BAD_ACCESS (code=1, address=0x0) frame #0: 0x000000018a316dc4 libsystem_platform.dylib`_platform_strlen + 4 frame #1: 0x00000001471011bc libtorch_cpu.dylib`std::__1::__constexpr_strlen[abi:v160006](__str=0x0000000000000000) at cstring:114:10 frame #2: 0x0000000147100c24 libtorch_cpu.dylib`std::__1::char_traits<char>::length(__s=0x0000000000000000) at char_traits.h:220:12 * frame #3: 0x0000000147100bf0 libtorch_cpu.dylib`std::__1::basic_ostream<char, std::__1::char_traits<char>>& std::__1::operator<<[abi:v160006]<std::__1::char_traits<char>>(__os=0x000000016fdfb3a0, __str=0x0000000000000000) at ostream:901:57 frame #4: 0x0000000147100bb4 libtorch_cpu.dylib`std::__1::basic_ostream<char, std::__1::char_traits<char>>& c10::detail::_str<char const>(ss=0x000000016fdfb3a0, t=0x000000016fdfb5d0) at StringUtil.h:55:6 frame #5: 0x00000001471007ac libtorch_cpu.dylib`std::__1::basic_ostream<char, std::__1::char_traits<char>>& c10::detail::_str<char const, char const>(ss=0x000000016fdfb3a0, t=0x000000016fdfb4f8, args=0x000000016fdfb5d0) at StringUtil.h:68:10 frame #6: 0x0000000147101444 libtorch_cpu.dylib`std::__1::basic_ostream<char, std::__1::char_traits<char>>& c10::detail::_str<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, char const, char const>(ss=0x000000016fdfb3a0, t="index_select_32bit_idx32", args=0x000000016fdfb4f8, args=0x000000016fdfb5d0) at StringUtil.h:68:10 frame #7: 0x0000000147101404 libtorch_cpu.dylib`std::__1::basic_ostream<char, std::__1::char_traits<char>>& c10::detail::_str<char const, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, char const, char const>(ss=0x000000016fdfb3a0, t=0x000000016fdfb500, args="index_select_32bit_idx32", args=0x000000016fdfb4f8, args=0x000000016fdfb5d0) at StringUtil.h:68:10 frame #8: 0x000000014710137c libtorch_cpu.dylib`c10::detail::_str_wrapper<char const, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, char const, char const* const&>::call(args=0x000000016fdfb500, args="index_select_32bit_idx32", args=0x000000016fdfb4f8, args=0x000000016fdfb5d0) at StringUtil.h:75:5 frame #9: 0x0000000147101310 libtorch_cpu.dylib`decltype(auto) c10::str<char [53], std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, char [10], char const>(args={a\xcb\xa7H\x01\0\0\0}, args="index_select_32bit_idx32", args={\x96\xcb\xa7H\x01\0\0\0}, args=0x000000016fdfb5d0) at StringUtil.h:111:10 frame #10: 0x0000000147100210 libtorch_cpu.dylib`decltype(auto) c10::detail::torchCheckMsgImpl<char [53], std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, char [10], char const>((null)="Expected indexFunction to be true, but got false. (Could this error message be improved? If so, please report an enhancement request to PyTorch.)", args={a\xcb\xa7H\x01\0\0\0}, args="index_select_32bit_idx32", args={\x96\xcb\xa7H\x01\0\0\0}, args=0x000000016fdfb5d0) at Exception.h:453:10 frame #11: 0x00000001470fffe8 libtorch_cpu.dylib`at::mps::MPSDevice::metalIndexingPSO(this=0x0000600000381670, kernel="index_select_32bit_idx32") at MPSDevice.mm:62:3 ``` This was introduced by https://github.com/pytorch/pytorch/pull/99855 that replaced `newFunctionWithName:constantValues:error:` with `newFunctionWithName:` Pull Request resolved: https://github.com/pytorch/pytorch/pull/116938 Approved by: https://github.com/Skylion007	2024-01-07 07:08:54 +00:00
Valentine233	20c2ec9a15	[CPU] Add flash attention mask version (#115913 ) Add a masked-version flash attention for CPU. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115913 Approved by: https://github.com/jgong5, https://github.com/drisspg	2024-01-07 04:58:23 +00:00
Qinfan Wu	b847290ddd	Back out "[2d] unflatten_tensor on compute stream for DTensorExtension (#116559 )" (#116939 ) Summary: Original commit changeset: 65298112f3db Original Phabricator Diff: D52530451 Differential Revision: D52583345 Pull Request resolved: https://github.com/pytorch/pytorch/pull/116939 Approved by: https://github.com/842974287	2024-01-07 03:53:40 +00:00
Aaron Gokaslan	4b5b8f8a75	Add bfloat16 CUDA support to smoothl1loss (#116933 ) Gradually ensuring that all CUDA ops that support float16 also support bfloat16 if possible Pull Request resolved: https://github.com/pytorch/pytorch/pull/116933 Approved by: https://github.com/malfet	2024-01-07 02:42:49 +00:00
Aaron Gokaslan	a7902571be	Add bfloat16 CUDA support to gamma unary functions (#116929 ) Add bfloat16 support to unary gamma functions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116929 Approved by: https://github.com/malfet	2024-01-07 02:07:55 +00:00
Aaron Gokaslan	8e1119f7b2	Fix typo in CUDA Macro (#116930 ) Found while grepping for remaining _AND macros in CUDA subfolder Pull Request resolved: https://github.com/pytorch/pytorch/pull/116930 Approved by: https://github.com/malfet	2024-01-07 01:49:32 +00:00
voznesenskym	83e8a0721d	Reland #111196 (take 4) "Support tensors as Dict keys" (#116934 ) Fixes #ISSUE_NUMBER See that PR Pull Request resolved: https://github.com/pytorch/pytorch/pull/116934 Approved by: https://github.com/ezyang, https://github.com/huydhn	2024-01-07 01:37:26 +00:00
Aaron Gokaslan	95041829c8	Add bfloat16 CUDA support to RNN (#116927 ) Fixes #116925 Fixes #116763 Pull Request resolved: https://github.com/pytorch/pytorch/pull/116927 Approved by: https://github.com/malfet	2024-01-06 22:55:34 +00:00
Weiming Zhao	a5b86847ef	Fix compiler warnings in cuda code (#116921 ) Fixes compiler warnings about comparison between signed and unsigned data types Pull Request resolved: https://github.com/pytorch/pytorch/pull/116921 Approved by: https://github.com/Skylion007	2024-01-06 21:25:19 +00:00
Nikita Shulga	65da4e1ba2	[CI] Use jemalloc for CUDA builds (#116900 ) According to @ptrblck it'll likely mitigate non-deterministic NVCC bug See https://github.com/pytorch/pytorch/issues/116289 for more detail Test plan: ssh into one of the cuda builds and make sure that `LD_PRELOAD` is set for the top-level make command Pull Request resolved: https://github.com/pytorch/pytorch/pull/116900 Approved by: https://github.com/atalman	2024-01-06 21:03:02 +00:00
Nikita Shulga	c05dd2aaf0	[EZ][MPS] Use dispatch with rethrow for indexing (#116903 ) Otherwise any assert withing sync block will cause an unrecoverable abort rather than structured exception Pull Request resolved: https://github.com/pytorch/pytorch/pull/116903 Approved by: https://github.com/Skylion007	2024-01-06 20:36:47 +00:00
Zhengxu Chen	9519c8afd4	[export] Remove hacks for passing pinned version test. (#116871 ) Summary: nature will heal itself. Test Plan: CI Reviewed By: angelayi Differential Revision: D52566227 Pull Request resolved: https://github.com/pytorch/pytorch/pull/116871 Approved by: https://github.com/angelayi	2024-01-06 18:09:27 +00:00
PyTorch MergeBot	2dca3e99eb	Revert "Support tensors as Dict keys Re-PR of #111196 (#116785 )" This reverts commit 1badad9ce9694ef70f6a3dc01000f2cf310c4c11. Reverted https://github.com/pytorch/pytorch/pull/116785 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/116785#issuecomment-1879592261))	2024-01-06 08:22:33 +00:00
Tugsbayasgalan (Tugsuu) Manlaibaatar	88197f2202	Rename experimental API (#116895 ) Summary: Title Test Plan: CI Differential Revision: D52571286 Pull Request resolved: https://github.com/pytorch/pytorch/pull/116895 Approved by: https://github.com/zhxchen17	2024-01-06 08:01:09 +00:00
Will Constable	830ace33bc	[C10D] Add GIL checker to NCCL watchdog monitor (#116798 ) Whenever the monitor thread kills the watchdog thread for being stuck, we do so to save cluster time and get a faster failure signal, but we want to know more about why it got stuck. One possible reason for watchdog stuckness is GIL contention, which could be ruled out or observed by making an attempt to acquire the GIL at exit time. If we cannot acquire the GIL within a short time window (1s) we abort the attempt and report GIL contention, otherwise we report that GIL was acquired successfully. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116798 Approved by: https://github.com/zdevito	2024-01-06 05:13:43 +00:00
PyTorch UpdateBot	f24bba1624	[executorch hash update] update the pinned executorch hash (#116800 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml). Update the pinned executorch hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116800 Approved by: https://github.com/pytorchbot	2024-01-06 04:10:52 +00:00
Alexander Grund	78c3098470	cmake: Include `CheckCXXCompilerFlag` where it is used (#113028 ) Move the `include(CheckCXXCompilerFlag)` above the `append_cxx_flag_if_supported` function that uses it to avoid depending on the caller to have it already included. Pull Request resolved: https://github.com/pytorch/pytorch/pull/113028 Approved by: https://github.com/malfet	2024-01-06 04:05:45 +00:00
voznesenskym	1badad9ce9	Support tensors as Dict keys Re-PR of #111196 (#116785 ) This prepares the PR where we implement sets in terms of dicts. To do so, rather than storing internally a dictionary that maps literals to VariableTrackers, it stores (pretty much) a dictionary from VTs to VTs. To do so, keys are wrapped in an opaque internal class _Hashable. The Hashable class is opaque on purpose so that it fails hard if if it inadvertently leaks back into user code. We also found and fixed a number of latent bugs and inconsistencies in the way dynamo checked what can be a dict key. More generally, we make much clearer what are the things that need to be modified to add a new supported key type to Dicts. Fixes [#107595](https://www.internalfb.com/tasks?t=107595) Fixes [#111603](https://www.internalfb.com/tasks?t=111603) Re-PR of https://github.com/pytorch/pytorch/pull/111196 sadly due to reverts, we could not reuse @lezcano's original PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116785 Approved by: https://github.com/mlazos	2024-01-06 03:35:35 +00:00
Nikita Shulga	ff0f79d3c7	[MPS] Mark `torch.[all\|any]` as working with complex on MacOS14 (#116907 ) It was enabled by https://github.com/pytorch/pytorch/pulls/116457 but at the time PR was landed Sonoma testing was still not enabled Pull Request resolved: https://github.com/pytorch/pytorch/pull/116907 Approved by: https://github.com/osalpekar, https://github.com/kit1980	2024-01-06 01:10:11 +00:00
Joel Schlosser	0b0c76bace	Support squeeze.dim for jagged NT (#116891 ) As title. Needed for `rev_view_func()` of `unsqueeze()`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116891 Approved by: https://github.com/soulitzer ghstack dependencies: #115894, #116512	2024-01-06 01:00:53 +00:00
Oguz Ulgen	8894a97707	[Dynamo] Fix source for autograd.function default value (#116894 ) Before this PR, the source guard would emit ``` globals()['Gradient'].__class__.forward.__defaults__[0] ``` which is incorrect Pull Request resolved: https://github.com/pytorch/pytorch/pull/116894 Approved by: https://github.com/zou3519, https://github.com/yanboliang	2024-01-06 00:36:00 +00:00
Guo Yejun	5323b2daa5	[docs] add mode="reduce-overhead" into torch.compile to enable cuda g… (#116529 ) …raph Pull Request resolved: https://github.com/pytorch/pytorch/pull/116529 Approved by: https://github.com/eellison	2024-01-05 22:54:20 +00:00
rzou	2753960177	markDynamoStrictTest most of test/lazy/.* (#116893 ) [codemod] markDynamoStrictTest lazy/test_step_closures [codemod] markDynamoStrictTest lazy/test_reuse_ir [codemod] markDynamoStrictTest lazy/test_meta_kernel [codemod] markDynamoStrictTest lazy/test_generator [codemod] markDynamoStrictTest lazy/test_functionalization [codemod] markDynamoStrictTest lazy/test_extract_compiled_graph [codemod] markDynamoStrictTest lazy/test_debug_util [codemod] markDynamoStrictTest lazy/test_bindings Pull Request resolved: https://github.com/pytorch/pytorch/pull/116893 Approved by: https://github.com/Skylion007 ghstack dependencies: #116879, #116880, #116881, #116892	2024-01-05 22:29:35 +00:00
chundian	af2ded23eb	[export] Exempt autograd ops for predispatch export (#116527 ) Summary: We intend to preserve autograd ops for predispatch export. Therefore, we need to exempt the autograd ops in some places, e.g. verifier and proxy_tensor.py. Test Plan: python test/export/test_export.py -k test_predispatch_export_with_autograd_op Pull Request resolved: https://github.com/pytorch/pytorch/pull/116527 Approved by: https://github.com/tugsbayasgalan ghstack dependencies: #116339	2024-01-05 22:28:57 +00:00
chundian	9431798521	[export] Error grad mode op in export API (#116339 ) Summary: As current export doesn't support training, so grad mode ops doesn't make sense. To avoid the confusion, we choose to early error if there exist grad mode ops. Test Plan: python test/export/test_safeguard.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/116339 Approved by: https://github.com/tugsbayasgalan	2024-01-05 22:28:57 +00:00
rzou	8fd4efacb4	markDynamoStrictTest most test/functorch/* (#116892 ) [codemod] markDynamoStrictTest functorch/test_rearrange [codemod] markDynamoStrictTest functorch/test_parsing [codemod] markDynamoStrictTest functorch/test_minifier [codemod] markDynamoStrictTest functorch/test_memory_efficient_fusion [codemod] markDynamoStrictTest functorch/test_logging [codemod] markDynamoStrictTest functorch/test_eager_transforms [codemod] markDynamoStrictTest functorch/test_dims [codemod] markDynamoStrictTest functorch/test_control_flow Pull Request resolved: https://github.com/pytorch/pytorch/pull/116892 Approved by: https://github.com/Skylion007 ghstack dependencies: #116879, #116880, #116881	2024-01-05 22:26:20 +00:00
rzou	e5f2ac18da	[codemod] markDynamoStrictTest batch 12 (#116881 ) [codemod] markDynamoStrictTest distributions/test_distributions [codemod] markDynamoStrictTest distributions/test_constraints [codemod] markDynamoStrictTest benchmark_utils/test_benchmark_utils [codemod] markDynamoStrictTest backends/xeon/test_launch Pull Request resolved: https://github.com/pytorch/pytorch/pull/116881 Approved by: https://github.com/voznesenskym ghstack dependencies: #116879, #116880	2024-01-05 21:59:40 +00:00
Will Constable	7562a00946	Make TORCH_LOGS="dist_ddp" include DDPOptimizer logs (#116794 ) Note: ddp_graphs is still 'separate' from log components since it is an artifact. Not sure it's possible to enable it by default when dist_ddp is selected. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116794 Approved by: https://github.com/fduwjj	2024-01-05 21:31:42 +00:00
Oleg Khabinov	5377b994da	[aot_inductor] Retrieve original FQNs for weights (#116157 ) Differential Revision: D52303882 Pull Request resolved: https://github.com/pytorch/pytorch/pull/116157 Approved by: https://github.com/frank-wei	2024-01-05 21:30:36 +00:00
Bert Maher	521dbbfaff	Remove cpp/tensorexpr benchmarks (#116868 ) Summary: These refer to a deprecated backend of torchscript which is no longer built in releases, and require llvm to be built. Test Plan: ``` python setup.py develop ``` Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/116868 Approved by: https://github.com/hl475, https://github.com/chenyang78, https://github.com/eellison, https://github.com/mikekgfb	2024-01-05 21:23:30 +00:00
chunyuan	99ef47098d	Use smaller shapes in lstm test to fix the CI timeout (#116453 ) Fixes https://github.com/pytorch/pytorch/issues/108824 by using smaller shapes while keeping the same test scope Pull Request resolved: https://github.com/pytorch/pytorch/pull/116453 Approved by: https://github.com/huydhn, https://github.com/jgong5	2024-01-05 21:19:56 +00:00
rzou	499ca71e49	[codemod] markDynamoStrictTest batch 11 (#116880 ) [codemod] markDynamoStrictTest nn/test_pruning [codemod] markDynamoStrictTest nn/test_pooling [codemod] markDynamoStrictTest nn/test_parametrization [codemod] markDynamoStrictTest nn/test_packed_sequence [codemod] markDynamoStrictTest nn/test_multihead_attention [codemod] markDynamoStrictTest nn/test_module_hooks [codemod] markDynamoStrictTest nn/test_lazy_modules [codemod] markDynamoStrictTest nn/test_init [codemod] markDynamoStrictTest nn/test_embedding [codemod] markDynamoStrictTest nn/test_dropout [codemod] markDynamoStrictTest nn/test_convolution Pull Request resolved: https://github.com/pytorch/pytorch/pull/116880 Approved by: https://github.com/voznesenskym ghstack dependencies: #116879	2024-01-05 21:17:43 +00:00
Nikita Shulga	ef7abdbd1a	[C10] Mark Complex::imag as C10_HOST_DEVICE (#116877 ) It feels weird that `real` is marked as such, but `imag` is not Find while working on https://github.com/pytorch/pytorch/issues/116628 Pull Request resolved: https://github.com/pytorch/pytorch/pull/116877 Approved by: https://github.com/Skylion007	2024-01-05 21:17:05 +00:00
Nikita Shulga	c72d9f5de3	[no ci] Add pytorch-dev-infra as owners of .ci folder (#116901 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116901 Approved by: https://github.com/huydhn	2024-01-05 21:15:47 +00:00
Nikita Shulga	0f0020d76f	[GHF] Add support for new style stacks (#116873 ) Where base stack targets default branch, rather than base. But as default branch is likely to advance, since PR was made, search for mergebase before determining whether `base`..`head` are in sync with `orig` branch Also, rather than hardcode default branch name, fetch it from `GitHubPR.default_branch()` Test Plan: https://github.com/malfet/deleteme/pull/77 Pull Request resolved: https://github.com/pytorch/pytorch/pull/116873 Approved by: https://github.com/ezyang	2024-01-05 20:32:24 +00:00
Aaron Orenstein	71d8fe690f	Replace recursive stable_topological_sort() with iterative. (#116761 ) Summary: A graph with a deep set of nodes caused stable_topological_sort() to recurse and pop the stack. Rewrite it to be iterative and avoid recursion. Fixes #115506 Pull Request resolved: https://github.com/pytorch/pytorch/pull/116761 Approved by: https://github.com/jansel, https://github.com/oulgen, https://github.com/Skylion007	2024-01-05 20:13:49 +00:00
rzou	476e9d5f77	[codemod] markDynamoStrictTest batch 10 (#116879 ) [codemod] markDynamoStrictTest test_cpp_extensions_aot_no_ninja [codemod] markDynamoStrictTest test_cpp_extensions_aot_ninja [codemod] markDynamoStrictTest test_cpp_api_parity [codemod] markDynamoStrictTest test_complex [codemod] markDynamoStrictTest test_compile_benchmark_util [codemod] markDynamoStrictTest test_comparison_utils [codemod] markDynamoStrictTest test_bundled_inputs Pull Request resolved: https://github.com/pytorch/pytorch/pull/116879 Approved by: https://github.com/voznesenskym	2024-01-05 19:46:55 +00:00
Alexander Grund	764a18016d	VSX: Fix vectorized abs function for complex tensors (#116859 ) Use a similar approach with Sleef as in #99550 to improve the precision and extremal value handling of the `abs` function for complex tensors. This fixes - test_reference_numerics_extremal__refs_abs_cpu_float64 - test_reference_numerics_extremal__refs_abs_cpu_float128 which failed on PPC. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116859 Approved by: https://github.com/lezcano	2024-01-05 19:24:42 +00:00
Aaron Gokaslan	63ee35c4e0	BugFix: Fix F632 bug in dynamo (if statement is always false) (#116867 ) This was flagged by a preview ruff check as the if statement always evaluating false. Likely a typo between `is` and `in`. I also micro-optimized some list construction into tuple construction, which is semantically identical, but faster. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116867 Approved by: https://github.com/lezcano, https://github.com/albanD, https://github.com/yanboliang	2024-01-05 19:15:05 +00:00
Catherine Lee	d455c33cca	[ez][td] Pipe TD logs to log file (#116796 ) It is a bit annoying have them come up when searching through the logs. They're also surprisingly long Pull Request resolved: https://github.com/pytorch/pytorch/pull/116796 Approved by: https://github.com/huydhn	2024-01-05 19:05:12 +00:00
Wei (Will) Feng	ebedce24ab	[FSDP] enable autograd in forward prefetching (#116792 ) problem when prefetching for next forward, current forward may be annotated by `@torch.no_grad`. `param.grad_fn` keeps being None during prefetching. `_post_backward_hook` never gets triggered repro ```pytest test/distributed/fsdp/test_fsdp_freezing_weights.py``` solution this PR enabled autograd during prefetching (`_use_unsharded_views`), so `param.grad_fn` are properly assigned for next forward a longer-term fix would be moving `_use_unsharded_views` out of `_prefetch_handle` and put it in `_pre_forward_unshard` Pull Request resolved: https://github.com/pytorch/pytorch/pull/116792 Approved by: https://github.com/awgu	2024-01-05 18:44:27 +00:00
Aaron Gokaslan	7f124167b5	[BE][Easy]: Update libfmt submodule to 10.2.1 (#116864 ) Follow up to #116363. There was an update and 10.2.1 was released that fixes an accidental ABI change in 10.2 with libfmt on windows. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116864 Approved by: https://github.com/albanD	2024-01-05 18:32:23 +00:00
Nikita Shulga	4b6961a629	[no ci] Fix spelling (#116872 ) s/initization/initialization/ Pull Request resolved: https://github.com/pytorch/pytorch/pull/116872 Approved by: https://github.com/albanD, https://github.com/Skylion007, https://github.com/atalman	2024-01-05 18:04:36 +00:00
Jithun Nair	0a0209e8a1	[ROCm] Use MI210 CI runners for all trunk commits (#116797 ) As a follow-up to https://github.com/pytorch/pytorch/pull/115981 To make sure we catch any regressions/breakages related to flash attention/inductor/etc. functionality that is only enabled for MI210s, we would like to switch the trunk commit CI jobs to always run on MI210 runners. This should help us accurately identify the breaking commits for ROCm CI on the HUD. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116797 Approved by: https://github.com/jeffdaily, https://github.com/pruthvistony	2024-01-05 17:46:38 +00:00
PyTorch MergeBot	9ac0e6971a	Revert "[1/4] Intel GPU Runtime Upstreaming for Device (#116019 )" This reverts commit b4cebe2c34242ceee3a1bc285f426662942a29ac. Reverted https://github.com/pytorch/pytorch/pull/116019 on behalf of https://github.com/malfet due to Broke internal and periodic buck builds, see https://github.com/pytorch/pytorch/actions/runs/7414664129/job/20176215868 ([comment](https://github.com/pytorch/pytorch/pull/116019#issuecomment-1879030285))	2024-01-05 17:36:39 +00:00
Joel Schlosser	7956ca16e6	Enable reverse view_funcs by default for python subclasses (#116512 ) Part 3 of implementation for general [subclass view fake-ification](https://docs.google.com/document/d/1C5taWiplmX7nKiURXDOAZG2W5VNJ2iV0fQFq92H0Cxw). Changes codegen to generate `view_func()` / `rev_view_func()` by default for python subclasses. With `view_func()` existing more often now, the lazy view rebase logic [here](`f10c3f4184/torch/csrc/autograd/variable.cpp (L665-L695)`) causes some slight behavior changes for in-place ops on views: * Additional view nodes are inserted into output graphs, changing their string representation, although they are functionally the same. The extra nodes are removed in AOTAutograd's DCE pass. * When `t` is a `FunctionalTensor`, calling `t.grad_fn` will now invoke `view_func()`; we need to make sure we're operating in a `FunctionalTensorMode` so the view op calls succeed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116512 Approved by: https://github.com/bdhirsh, https://github.com/soulitzer ghstack dependencies: #115894	2024-01-05 16:48:12 +00:00
Joel Schlosser	3c21264c9b	Introduce reverse view_funcs (#115894 ) Part 2 of implementation for general [subclass view fake-ification](https://docs.google.com/document/d/1C5taWiplmX7nKiURXDOAZG2W5VNJ2iV0fQFq92H0Cxw). Details: * Codegen `rev_view_func()` alongside `view_func()` * Reverse view_func gives you a "base" from a "view": `rev_view_func(new_view) -> new_base` AKA it plays the original view backwards * Utilizes the functional inverses defined in `FunctionalInverses.cpp`, passing `InverseReturnMode::AlwaysView` * Manually implements functional inverses for `narrow()` and `chunk()` * NB: Multi-output views now set view_func() / rev_view_func() for each of the output views! * Due to this, the `as_view()` overload that operates on a list of views is scrapped in favor of iteration via codegen Example codegen in `ADInplaceOrViewTypeN.cpp`: ```cpp at::Tensor narrow(c10::DispatchKeySet ks, const at::Tensor & self, int64_t dim, c10::SymInt start, c10::SymInt length) { auto _tmp = ([&]() { at::AutoDispatchBelowADInplaceOrView guard; return at::_ops::narrow::redispatch(ks & c10::after_ADInplaceOrView_keyset, self, dim, start, length); })(); std::function<at::Tensor(const at::Tensor&)> func=nullptr; std::function<at::Tensor(const at::Tensor&)> rev_func=nullptr; if (false \|\| !self.unsafeGetTensorImpl()->support_as_strided() \|\| c10::AutogradState::get_tls_state().get_view_replay_enabled()) { func = [=](const at::Tensor& input_base) { return at::_ops::narrow::call(input_base, dim, start, length); }; rev_func = [=](const at::Tensor& input_view) { // NB: args from narrow() signature are passed along to the inverse return at::functionalization::FunctionalInverses::narrow_copy_inverse(self, input_view, at::functionalization::InverseReturnMode::AlwaysView, dim, start, length); }; } auto result = as_view(/* base / self, / output / _tmp, / is_bw_differentiable / true, / is_fw_differentiable / true, / view_func / func, / rev_view_func / rev_func, / creation_meta */ InferenceMode::is_enabled() ? CreationMeta::INFERENCE_MODE : (at::GradMode::is_enabled() ? CreationMeta::DEFAULT : CreationMeta::NO_GRAD_MODE)); return result; } ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/115894 Approved by: https://github.com/soulitzer	2024-01-05 16:48:12 +00:00
rzou	053b15c596	[codemod] markDynamoStrictTest batch 9 (#116836 ) [codemod] markDynamoStrictTest test_datapipe [codemod] markDynamoStrictTest test_cuda_trace [codemod] markDynamoStrictTest test_cuda_sanitizer [codemod] markDynamoStrictTest test_cuda_primary_ctx [codemod] markDynamoStrictTest test_cuda_nvml_based_avail [codemod] markDynamoStrictTest test_cuda_multigpu [codemod] markDynamoStrictTest test_cuda_expandable_segments [codemod] markDynamoStrictTest test_cuda [codemod] markDynamoStrictTest test_cpp_extensions_open_device_registration [codemod] markDynamoStrictTest test_cpp_extensions_jit Pull Request resolved: https://github.com/pytorch/pytorch/pull/116836 Approved by: https://github.com/bdhirsh ghstack dependencies: #116802, #116827, #116829, #116834	2024-01-05 16:40:40 +00:00
rzou	ee07260337	[codemod] markDynamoStrictTest batch 8 (#116834 ) [codemod] markDynamoStrictTest test_flop_counter [codemod] markDynamoStrictTest test_fake_tensor [codemod] markDynamoStrictTest test_expanded_weights [codemod] markDynamoStrictTest test_dynamic_shapes [codemod] markDynamoStrictTest test_dlpack [codemod] markDynamoStrictTest test_dispatch [codemod] markDynamoStrictTest test_deploy Pull Request resolved: https://github.com/pytorch/pytorch/pull/116834 Approved by: https://github.com/bdhirsh ghstack dependencies: #116802, #116827, #116829	2024-01-05 16:40:24 +00:00
rzou	c0da5a4c68	[codemod] markDynamoStrictTest batch 7 (#116829 ) [codemod] markDynamoStrictTest test_license [codemod] markDynamoStrictTest test_itt [codemod] markDynamoStrictTest test_import_stats [codemod] markDynamoStrictTest test_hub [codemod] markDynamoStrictTest test_futures [codemod] markDynamoStrictTest test_functionalization_of_rng_ops [codemod] markDynamoStrictTest test_functionalization [codemod] markDynamoStrictTest test_functional_autograd_benchmark [codemod] markDynamoStrictTest test_function_schema Pull Request resolved: https://github.com/pytorch/pytorch/pull/116829 Approved by: https://github.com/bdhirsh ghstack dependencies: #116802, #116827	2024-01-05 16:33:20 +00:00
rzou	6747d1383f	[codemod] markDynamoStrictTest batch 6 (#116827 ) [codemod] markDynamoStrictTest test_model_exports_to_core_aten [codemod] markDynamoStrictTest test_model_dump [codemod] markDynamoStrictTest test_mobile_optimizer [codemod] markDynamoStrictTest test_mkldnn_verbose [codemod] markDynamoStrictTest test_mkldnn_fusion [codemod] markDynamoStrictTest test_mkldnn [codemod] markDynamoStrictTest test_mkl_verbose [codemod] markDynamoStrictTest test_meta [codemod] markDynamoStrictTest test_matmul_cuda [codemod] markDynamoStrictTest test_logging Pull Request resolved: https://github.com/pytorch/pytorch/pull/116827 Approved by: https://github.com/bdhirsh ghstack dependencies: #116802	2024-01-05 16:33:20 +00:00
rzou	9543caadc8	[codemod] markDynamoStrictTest batch 5 (#116802 ) [codemod] markDynamoStrictTest test_openmp [codemod] markDynamoStrictTest test_numpy_interop [codemod] markDynamoStrictTest test_numba_integration [codemod] markDynamoStrictTest test_nn [codemod] markDynamoStrictTest test_nestedtensor [codemod] markDynamoStrictTest test_native_mha [codemod] markDynamoStrictTest test_native_functions [codemod] markDynamoStrictTest test_multiprocessing_spawn [codemod] markDynamoStrictTest test_multiprocessing [codemod] markDynamoStrictTest test_monitor Pull Request resolved: https://github.com/pytorch/pytorch/pull/116802 Approved by: https://github.com/bdhirsh	2024-01-05 16:33:13 +00:00
Guoliang He	0159e3abbd	[dynamo] add a handler for itertools_chain_from_iterable and test (#116849 ) 1. add a handler for itertools_chain_from_iterable 2. a test for itertools_chain_from_iterable Fixes #116463 Pull Request resolved: https://github.com/pytorch/pytorch/pull/116849 Approved by: https://github.com/ezyang	2024-01-05 15:14:18 +00:00
Edward Z. Yang	0249c4a785	Add config toggle suggestions for data-dependent/dynamic output shape (#114337 ) Fixes https://github.com/pytorch/pytorch/issues/114220 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/114337 Approved by: https://github.com/aakhundov	2024-01-05 14:01:01 +00:00
Edward Z. Yang	53f8d17d1e	Specialize SymNodeVariable when used as module index (#114377 ) Fixes #114171 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/114377 Approved by: https://github.com/Skylion007	2024-01-05 13:51:52 +00:00
Edward Z. Yang	0e8698c3b6	Prevent unbacked symbol reallocation by forcing unification for unbacked symbol def sites (#114368 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/114368 Approved by: https://github.com/aakhundov	2024-01-05 13:51:36 +00:00
sunxinlei	f692fc9e7f	fix typo (#116828 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116828 Approved by: https://github.com/Skylion007	2024-01-05 12:35:33 +00:00
drisspg	5f5405f809	I have seen this deprecation and I am curious if this is the fix (#116714 ) Lets see what CI/CD says Pull Request resolved: https://github.com/pytorch/pytorch/pull/116714 Approved by: https://github.com/awgu, https://github.com/wanchaol	2024-01-05 07:02:58 +00:00
Bin Bao	79ba39710e	[AOTI] Forward fix a Windows build failure (#116790 ) Summary: forward fix https://github.com/pytorch/pytorch/pull/116269 Pull Request resolved: https://github.com/pytorch/pytorch/pull/116790 Approved by: https://github.com/khabinov, https://github.com/huydhn	2024-01-05 06:00:58 +00:00
PyTorch MergeBot	2ccc7af028	Revert "[CPU] Add flash attention mask version (#115913 )" This reverts commit 76a3fbb7092d25638a046c1994030fc8108e5fbf. Reverted https://github.com/pytorch/pytorch/pull/115913 on behalf of https://github.com/zou3519 due to broke transformer test on dynamo shard ([comment](https://github.com/pytorch/pytorch/pull/115913#issuecomment-1878043389))	2024-01-05 02:39:12 +00:00
rzou	bbfd81f513	[codemod] markDynamoStrictTest batch (#116791 ) [codemod] markDynamoStrictTest test_sympy_utils [codemod] markDynamoStrictTest test_serialization [codemod] markDynamoStrictTest test_segment_reductions [codemod] markDynamoStrictTest test_schema_check [codemod] markDynamoStrictTest test_scatter_gather_ops [codemod] markDynamoStrictTest test_pytree [codemod] markDynamoStrictTest test_pruning_op [codemod] markDynamoStrictTest test_per_overload_api [codemod] markDynamoStrictTest test_out_dtype_op [codemod] markDynamoStrictTest test_optim Pull Request resolved: https://github.com/pytorch/pytorch/pull/116791 Approved by: https://github.com/voznesenskym ghstack dependencies: #116735, #116736, #116739, #116740, #116742, #116743, #116744, #116745	2024-01-05 02:22:53 +00:00
lezcano	6d9b837c27	Graphbreak when creating a map with unsupported keys (#116460 ) As per title. With this, https://github.com/pytorch/pytorch/issues/93697 does not choke, but spits out many of these: ``` [ERROR] Name: "L['self']" [ERROR] Source: local [ERROR] Create Function: NN_MODULE [ERROR] Guard Types: ['ID_MATCH'] [ERROR] Code List: ["___check_obj_id(L['self'], 139962171127504)"] [ERROR] Object Weakref: <weakref at 0x7f4b72f7c9a0; to 'ActorCriticPolicy' at 0x7f4b7b7df6d0> [ERROR] Guarded Class Weakref: <weakref at 0x7f4afbd08b30; to 'ABCMeta' at 0x56463a727840 (ActorCriticPolicy)> [ERROR] Created at: [ERROR] File "/home/lezcano/git/pytorch/pytorch/torch/_dynamo/variables/builder.py", line 248, in __call__ [ERROR] vt = self._wrap(value) [ERROR] File "/home/lezcano/git/pytorch/pytorch/torch/_dynamo/variables/builder.py", line 474, in _wrap [ERROR] return self.wrap_module(value) [ERROR] File "/home/lezcano/git/pytorch/pytorch/torch/_dynamo/variables/builder.py", line 941, in wrap_module [ERROR] return self.tx.output.register_attr_or_module( [ERROR] File "/home/lezcano/git/pytorch/pytorch/torch/_dynamo/output_graph.py", line 735, in register_attr_or_module [ERROR] install_guard(source.make_guard(GuardBuilder.NN_MODULE)) [ERROR] Error while creating guard: ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/116460 Approved by: https://github.com/jansel ghstack dependencies: #116459	2024-01-05 01:48:07 +00:00
lezcano	7c8f38700a	[dynamo] Fix np.issubdtype (#116459 ) Fixes the issue described at https://github.com/pytorch/pytorch/issues/93697#issuecomment-1828346590 This doesn't fix the full issue yet, now we hit ```python File "/home/lezcano/git/pytorch/pytorch/torch/_dynamo/symbolic_convert.py", line 744, in step getattr(self, inst.opname)(inst) File "/home/lezcano/git/pytorch/pytorch/torch/_dynamo/symbolic_convert.py", line 1366, in BUILD_MAP assert ( AssertionError ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/116459 Approved by: https://github.com/peterbell10	2024-01-05 01:48:07 +00:00
Valentine233	76a3fbb709	[CPU] Add flash attention mask version (#115913 ) Add a masked-version flash attention for CPU. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115913 Approved by: https://github.com/jgong5, https://github.com/drisspg	2024-01-05 01:27:36 +00:00
Angela Yi	6413511713	[export][refactor][4/n] Make equality_constraints optional (#116233 ) Summary: needed to remove equality_contraints eventually :P Test Plan: CI Differential Revision: D52351709 Pull Request resolved: https://github.com/pytorch/pytorch/pull/116233 Approved by: https://github.com/tugsbayasgalan	2024-01-05 00:50:52 +00:00
Yanbo Liang	db69956feb	[Dynamo] Catch ImportError when tracing_rules load objects (#116783 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/116783 Approved by: https://github.com/angelayi	2024-01-05 00:26:17 +00:00
Nikita Shulga	b0393ebe9b	[MPS] Make test_mps.py passable on Sonoma (#116764 ) - Enable Sonoma testing on M2 machines - Add 70+ ops to the list of supported ones on MacOS Sonoma - Enable nn.functional. - Add explicit `TORCH_CHECK` to mark scatter/gather, index_select and linalg ops as yet not supporting Complex, as attempt to call those will crash with various MPS asserts such as: ``` (mpsFileLoc): /AppleInternal/Library/BuildRoots/0032d1ee-80fd-11ee-8227-6aecfccc70fe/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm:96:0: error: 'mps.reduction_min' op operand #0 must be tensor of MPS type values or memref of MPS type values, but got 'tensor<5x5xcomplex<f32>>' (mpsFileLoc): /AppleInternal/Library/BuildRoots/0032d1ee-80fd-11ee-8227-6aecfccc70fe/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm:96:0: note: see current operation: %3 = "mps.reduction_min"(%1, %2) <{keep_dims}> : (tensor<5x5xcomplex<f32>>, tensor<2xsi32>) -> tensor<1x1xcomplex<f32>> ``` - Treat bools as int8 to fix regression re-surfaced in `index_fill` (used to be broken in Monterey, then fixed in Ventura and broken in Sonoma again) - `nn.functional.max_pool2d` results now match CPU output for uint8 dtype in Sonoma Pull Request resolved: https://github.com/pytorch/pytorch/pull/116764 Approved by: https://github.com/kulinseth, https://github.com/seemethere	2024-01-05 00:25:47 +00:00
Mikayla Gawarecki	d0cf2182ea	Fix TransformerEncoderLayer for bias=False (#116760 ) Fixes https://github.com/pytorch/pytorch/issues/116385 Don't call `torch._transformer_encoder_layer_fwd` when `bias=False` `bias=False` was not something that `torch._transformer_encoder_layer_fwd` was meant to work with, it was my bad that this wasn't tested as I approved https://github.com/pytorch/pytorch/pull/101687. `bias=False` was causing the `tensor_args` in [`TransformerEncoder`](`a17de2d645/torch/nn/modules/transformer.py (L663-L677)`) to contain `None`s and error on checks for the fastpath like `t.requires_grad for t in tensor_args`. Alternative fix would be to 1) Pass `torch.zeros_like({*}.weight)` to the kernel when `bias=False` and filter `tensor_args` as appropriate 2) Fix `torch._transformer_encoder_layer_fwd` to take `Optional<Tensor>` for biases and fix the kernels as appropriate Let me know if these approaches are preferable Pull Request resolved: https://github.com/pytorch/pytorch/pull/116760 Approved by: https://github.com/jbschlosser	2024-01-05 00:13:10 +00:00
Xinya Zhang	e3ca7346ce	Re-add initial Flash Attention support on ROCM (#115981 ) Note about the Updates: This PR: 1. skips more flash attention related UTs on MI200 2. Fix additional ATen compiling errors after hipification 3. Fix the author "root" of a specific commit 4. Includes the patch from Nikita in favor of block level static initialization. CAVEAT: This revised PR has a commit that modifies the CI to force its running on MI200 nodes. That specific commit must be reverted before merge. Original PR (https://github.com/pytorch/pytorch/pull/114309) Note: This pull requests add initial Flash Attention support for AMD/ROCM platform. It added a specialized Triton repository/branch as a compile-time dependency for Flash Attention math library on AMD/ROCM. This triton submodule is not used at runtime and will not be shipped to the final pytorch package. We have the plan to release this specialized Triton as a separate project. Know limitations: - Only supports MI200 series GPU (i.e., `gcnArchName == gfx90a:sramecc+:xnack-`. - Only supports power of two sequence lengths. - No support for varlen APIs. - Only support head dimension 16,32,64,128. - Performance is still being optimized. Fixes #112997 Pull Request resolved: https://github.com/pytorch/pytorch/pull/115981 Approved by: https://github.com/malfet	2024-01-04 22:21:31 +00:00
Edward Z. Yang	8195a0aaa7	Move array_of helper to c10/util (#116749 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/116749 Approved by: https://github.com/drisspg, https://github.com/Skylion007 ghstack dependencies: #116685	2024-01-04 21:58:32 +00:00
Zhengxu Chen	5ac57a06eb	[export] Refactor ExportPassBase. (#116778 ) Summary: X-link: https://github.com/pytorch/executorch/pull/1532 as title. This diff decouple the pass base library from torch export and exir, so that different layers can evolve in their own fashion, and we have more head room to divide and conquer in the future. Test Plan: CI Reviewed By: angelayi Differential Revision: D52514517 Pull Request resolved: https://github.com/pytorch/pytorch/pull/116778 Approved by: https://github.com/angelayi	2024-01-04 21:32:14 +00:00
Will Constable	e7d741b0fd	[C10D] Dump cpp stacktraces on heartbeat monitor timeout (#116717 ) Summary: If heartbeat monitor times out and kills the process, we want to know why. It's convenient to use an internal tool for this, but we plan to later integrate with torchelastic to call into pyspy or something else, which will be both better (including py stacks) and compatible with OSS. Test Plan: tested manually, observed c++ stacktraces were dumped Reviewed By: fduwjj Differential Revision: D52370243 Pull Request resolved: https://github.com/pytorch/pytorch/pull/116717 Approved by: https://github.com/zdevito	2024-01-04 21:11:47 +00:00
cyy	d23972df00	Update libfmt submodule to 10.2.0 (#116363 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116363 Approved by: https://github.com/ezyang	2024-01-04 19:25:40 +00:00
Bin Bao	70f3a530d7	[AOTI] Add pybind for AOTIModelContainerRunnerCpu and AOTIModelContainerRunnerCuda (#116269 ) Summary: Now we can allocate an AOTIModelContainerRunner object instead of relying on torch.utils.cpp_extension.load_inline. Also renamed AOTInductorModelRunner to AOTIRunnerUtil in this PR. Test Plan: CI Reviewed By: khabinov Differential Revision: D52339116 Pull Request resolved: https://github.com/pytorch/pytorch/pull/116269 Approved by: https://github.com/khabinov	2024-01-04 18:58:24 +00:00
Nikita Shulga	56d7a47806	[BE] Use precompiled headers to speedup clang-tidy (#116780 ) This brings the time down by 30% (from [30](https://github.com/pytorch/pytorch/actions/runs/7412899917/job/20170674075#step:11:64) min to [20](https://github.com/pytorch/pytorch/actions/runs/7413082213/job/20171286833?pr=116780#step:11:64) min) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116780 Approved by: https://github.com/kit1980, https://github.com/Skylion007	2024-01-04 18:37:44 +00:00
Peter Bell	39f8853313	[inductor] Use max sm clock when calculating device tflops (#116754 ) See openai/triton#2801 Current SM clocks may fluctuate at runtime and change the result of `get_device_tflops`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116754 Approved by: https://github.com/lezcano	2024-01-04 17:38:21 +00:00
Gao Tianlin	6793b99107	[BugFix] Fix SegFault when torch.all/any dispatched to mps or other backends (#116457 ) The old implementation will result in an infinite recursive loop, leading to a stack overflow and segfault. If TORCH_SHOW_DISPATCH_TRACE is on, with a debug version pytorch, we can see the following endless output in terminal: ``` [call] op=[aten::quantize_per_tensor], key=[AutogradCPU] [redispatch] op=[aten::quantize_per_tensor], key=[CPU] [call] op=[aten::any.dims], key=[AutogradCPU] [redispatch] op=[aten::any.dims], key=[QuantizedCPU] [call] op=[aten::empty.memory_format], key=[BackendSelect] [redispatch] op=[aten::empty.memory_format], key=[CPU] [call] op=[aten::any.dims_out], key=[QuantizedCPU] [call] op=[aten::any.dims], key=[QuantizedCPU] [call] op=[aten::empty.memory_format], key=[BackendSelect] [redispatch] op=[aten::empty.memory_format], key=[CPU] [call] op=[aten::any.dims_out], key=[QuantizedCPU] [call] op=[aten::any.dims], key=[QuantizedCPU] [call] op=[aten::empty.memory_format], key=[BackendSelect] [redispatch] op=[aten::empty.memory_format], key=[CPU] [call] op=[aten::any.dims_out], key=[QuantizedCPU] [call] op=[aten::any.dims], key=[QuantizedCPU] [call] op=[aten::empty.memory_format], key=[BackendSelect] [redispatch] op=[aten::empty.memory_format], key=[CPU] [call] op=[aten::any.dims_out], key=[QuantizedCPU] [call] op=[aten::any.dims], key=[QuantizedCPU] [call] op=[aten::empty.memory_format], key=[BackendSelect] [redispatch] op=[aten::empty.memory_format], key=[CPU] [call] op=[aten::any.dims_out], key=[QuantizedCPU] [call] op=[aten::any.dims], key=[QuantizedCPU] [call] op=[aten::empty.memory_format], key=[BackendSelect] [redispatch] op=[aten::empty.memory_format], key=[CPU] [call] op=[aten::any.dims_out], key=[QuantizedCPU] [call] op=[aten::any.dims], key=[QuantizedCPU] [call] op=[aten::empty.memory_format], key=[BackendSelect] [redispatch] op=[aten::empty.memory_format], key=[CPU] [call] op=[aten::any.dims_out], key=[QuantizedCPU] [call] op=[aten::any.dims], key=[QuantizedCPU] [call] op=[aten::empty.memory_format], key=[BackendSelect] [redispatch] op=[aten::empty.memory_format], key=[CPU] [call] op=[aten::any.dims_out], key=[QuantizedCPU] [call] op=[aten::any.dims], key=[QuantizedCPU] ..... ..... ..... ``` Fixes #116452 Fixes #116451 Pull Request resolved: https://github.com/pytorch/pytorch/pull/116457 Approved by: https://github.com/malfet	2024-01-04 17:37:17 +00:00
Yu, Guangye	b4cebe2c34	[1/4] Intel GPU Runtime Upstreaming for Device (#116019 ) # Motivation As mentioned in [[RFC] Intel GPU Runtime Upstreaming](https://github.com/pytorch/pytorch/issues/114842), The first runtime component we would like to upstream is `Device` which contains the device management functions of Intel GPU's runtime. To facilitate the code review, we split the code changes into 4 PRs. This is one of the 4 PRs and covers the changes under `c10`. # Design Intel GPU device is a wrapper of sycl device on which kernels can be executed. In our design, we will maintain a sycl device pool containing all the GPU devices of the current machine, and manage the status of the device pool by PyTorch. The thread local safe is considered in this design. The corresponding C++ files related to `Device` will be placed in c10/xpu folder. And we provide the c10 device runtime APIs, like - `c10::xpu::device_count` - `c10::xpu::set_device` - ... # Additional Context In our plan, 4 PRs should be submitted to PyTorch for `Device`: 1. for c10 2. for aten 3. for python frontend 4. for lazy initialization shared with CUDA Pull Request resolved: https://github.com/pytorch/pytorch/pull/116019 Approved by: https://github.com/gujinghui, https://github.com/jgong5, https://github.com/EikanWang, https://github.com/malfet	2024-01-04 17:35:04 +00:00
Zhengxu Chen	43fb1b671c	[export] Improve verifier to not specialize on dialect. (#116705 ) Summary: Currently we have a very ugly specialization on edge dialect in verifier like the following: ``` # TODO Remove this branch. if ep.dialect == "EDGE": # !!! Don't change this allowlist. !!! pass else: raise e ``` In this diff we do some additional work to make signature checking also work in exir. We decouple the transformation stack in torch export and exir so that different layers of the stack can evolve in their own fashion and the team can divide and conquer them seperately. Test Plan: CI Differential Revision: D52499225 Pull Request resolved: https://github.com/pytorch/pytorch/pull/116705 Approved by: https://github.com/tugsbayasgalan	2024-01-04 17:17:23 +00:00
rzou	f1a393c029	[codemod] markDynamoStrictTest batch (#116745 ) - test_show_pickle - test_show_pickle - test_set_default_mobile_cpu_allocator Pull Request resolved: https://github.com/pytorch/pytorch/pull/116745 Approved by: https://github.com/Skylion007, https://github.com/voznesenskym ghstack dependencies: #116728, #116729, #116730, #116731, #116732, #116733, #116734, #116735, #116736, #116739, #116740, #116742, #116743, #116744	2024-01-04 15:04:18 +00:00
rzou	311548b79c	[codemod] markDynamoStrictTest test_sort_and_select (#116744 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116744 Approved by: https://github.com/voznesenskym ghstack dependencies: #116728, #116729, #116730, #116731, #116732, #116733, #116734, #116735, #116736, #116739, #116740, #116742, #116743	2024-01-04 15:04:18 +00:00
rzou	30f0a05207	[codemod] markDynamoStrictTest test_stateless (#116743 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116743 Approved by: https://github.com/Skylion007, https://github.com/voznesenskym ghstack dependencies: #116728, #116729, #116730, #116731, #116732, #116733, #116734, #116735, #116736, #116739, #116740, #116742	2024-01-04 15:03:21 +00:00
rzou	46b44fb246	[codemod] markDynamoStrictTest test_subclass (#116742 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116742 Approved by: https://github.com/voznesenskym ghstack dependencies: #116728, #116729, #116730, #116731, #116732, #116733, #116734, #116735, #116736, #116739, #116740	2024-01-04 15:02:46 +00:00
rzou	c2174974ae	[codemod] markDynamoStrictTest test_tensor_creation_ops (#116740 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116740 Approved by: https://github.com/voznesenskym ghstack dependencies: #116728, #116729, #116730, #116731, #116732, #116733, #116734, #116735, #116736, #116739	2024-01-04 15:02:03 +00:00
rzou	7c5704fc00	[codemod] markDynamoStrictTest test_tensorboard (#116739 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116739 Approved by: https://github.com/voznesenskym ghstack dependencies: #116728, #116729, #116730, #116731, #116732, #116733, #116734, #116735, #116736	2024-01-04 15:01:25 +00:00
rzou	caa33e1eb1	[codemod] markDynamoStrictTest test_testing (#116736 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116736 Approved by: https://github.com/voznesenskym ghstack dependencies: #116728, #116729, #116730, #116731, #116732, #116733, #116734, #116735	2024-01-04 15:01:07 +00:00
rzou	882d1f4ea6	[codemod] markDynamoStrictTest test_transformers (#116735 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116735 Approved by: https://github.com/voznesenskym ghstack dependencies: #116728, #116729, #116730, #116731, #116732, #116733, #116734	2024-01-04 15:00:23 +00:00
Tugsbayasgalan (Tugsuu) Manlaibaatar	eb958d7552	Fix bug in unflatten pytree (#116750 ) Summary: Title Test Plan: CI Differential Revision: D52529088 Pull Request resolved: https://github.com/pytorch/pytorch/pull/116750 Approved by: https://github.com/zhxchen17	2024-01-04 14:23:40 +00:00
PyTorch MergeBot	75dae4f691	Revert "[dynamo] Fix np.issubdtype (#116459 )" This reverts commit b5c33ccdb3198a48a354e21a4fdace0ec6d04146. Reverted https://github.com/pytorch/pytorch/pull/116459 on behalf of https://github.com/zou3519 due to Broke CI, seems to be a landrace ([comment](https://github.com/pytorch/pytorch/pull/116459#issuecomment-1877135999))	2024-01-04 14:00:11 +00:00
PyTorch MergeBot	3a0f6897c5	Revert "Graphbreak when creating a map with unsupported keys (#116460 )" This reverts commit c2a020a2184982361a712bbb1e9766caba26dba6. Reverted https://github.com/pytorch/pytorch/pull/116460 on behalf of https://github.com/zou3519 due to I think the bottom PR broke CI ([comment](https://github.com/pytorch/pytorch/pull/116460#issuecomment-1877132374))	2024-01-04 13:56:57 +00:00
lezcano	c2a020a218	Graphbreak when creating a map with unsupported keys (#116460 ) As per title. With this, https://github.com/pytorch/pytorch/issues/93697 does not choke, but spits out many of these: ``` [ERROR] Name: "L['self']" [ERROR] Source: local [ERROR] Create Function: NN_MODULE [ERROR] Guard Types: ['ID_MATCH'] [ERROR] Code List: ["___check_obj_id(L['self'], 139962171127504)"] [ERROR] Object Weakref: <weakref at 0x7f4b72f7c9a0; to 'ActorCriticPolicy' at 0x7f4b7b7df6d0> [ERROR] Guarded Class Weakref: <weakref at 0x7f4afbd08b30; to 'ABCMeta' at 0x56463a727840 (ActorCriticPolicy)> [ERROR] Created at: [ERROR] File "/home/lezcano/git/pytorch/pytorch/torch/_dynamo/variables/builder.py", line 248, in __call__ [ERROR] vt = self._wrap(value) [ERROR] File "/home/lezcano/git/pytorch/pytorch/torch/_dynamo/variables/builder.py", line 474, in _wrap [ERROR] return self.wrap_module(value) [ERROR] File "/home/lezcano/git/pytorch/pytorch/torch/_dynamo/variables/builder.py", line 941, in wrap_module [ERROR] return self.tx.output.register_attr_or_module( [ERROR] File "/home/lezcano/git/pytorch/pytorch/torch/_dynamo/output_graph.py", line 735, in register_attr_or_module [ERROR] install_guard(source.make_guard(GuardBuilder.NN_MODULE)) [ERROR] Error while creating guard: ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/116460 Approved by: https://github.com/jansel ghstack dependencies: #116459	2024-01-04 12:36:31 +00:00
Tugsbayasgalan Manlaibaatar	81f98f1082	Experimental non-strict mode (#114658 ) This is proof-of-concept implementation of how people can use a marker `mark_strict` to enable torchdynamo while exporting under non-strict mode. The main idea is that `mark_strict` will turn into an HOO which then utilizes dynamo to do correctness analysis in the same way how torch.cond works today. There are some notable limitations: 1. This API is not meant for public use yet 2. Strict region can't work with arbitrary container inputs 3. We don't preserve `nn_module_stack` and other node metadata for the strict region. 4. strict_mode HOO will show up in the final graph. This is undesirable in the long term, but for short term experiments, it should be good enough. Will fix this in the follow up PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/114658 Approved by: https://github.com/ydwu4	2024-01-04 12:24:58 +00:00
cyy	91bbcf8c71	[1/N] replace THPUtils_assert with TORCH_CHECK (#116675 ) This PR replaces THPUtils_assert with TORCH_CHECK. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116675 Approved by: https://github.com/albanD	2024-01-04 11:15:33 +00:00
Will Constable	faea6f2c7a	[C10D] Make heartbeat_ atomic (#116702 ) Summary: Currently, the code is working. We know this becuase we observe heartbeat timeouts. However, there is a chance that if the code were refactored, the compiler could optimize away the load of heartbeat_ inside heartbeatMonitor, and we wouldn't know. Using atomic here is not really for thread synchronization, but more to ensure compiler optimizations (hoisting the read outside the loop) can never be allowed to happen. Again, we know this isn't currently happening bc if it were, it would not be an intermittent failure, it would be an always failure. (at least with a fixed compiler/platform). I previously avoided atomic bc we didn't want shared locks between heartbeat monitor and watchdog thread. Why? if watchdog held the lock and hung, monitor could also hang. However, this really can't happen (Afaik) when using an atomic. Test Plan: existing CI tests Differential Revision: D52378257 Pull Request resolved: https://github.com/pytorch/pytorch/pull/116702 Approved by: https://github.com/fduwjj, https://github.com/zdevito	2024-01-04 06:06:32 +00:00
Catherine Lee	2bdc2a68cb	[ez][td] Fix for emit metrics can't find JOB_NAME (#116748 ) After #113884 Pull Request resolved: https://github.com/pytorch/pytorch/pull/116748 Approved by: https://github.com/malfet, https://github.com/huydhn	2024-01-04 05:31:25 +00:00
Will Feng	670e7992fd	[Easy] Document AGGRESSIVE_RECOMPUTATION flag in min-cut partitioner (#114007 ) As titled. Pull Request resolved: https://github.com/pytorch/pytorch/pull/114007 Approved by: https://github.com/wanchaol	2024-01-04 05:05:08 +00:00
Edward Z. Yang	a8a9695047	Move promoteTypes to cpp file (#116685 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/116685 Approved by: https://github.com/albanD	2024-01-04 04:42:14 +00:00
Huy Do	f071687ef1	Clean up macOS x86 CI build and test jobs (#116725 ) We're ready to pull the plug on MacOX x86 build and test jobs on CI. * [ ] https://github.com/pytorch/pytorch/pull/116725 * [ ] https://github.com/pytorch/pytorch/pull/116726 More details is at https://github.com/pytorch/pytorch/issues/114602 Pull Request resolved: https://github.com/pytorch/pytorch/pull/116725 Approved by: https://github.com/malfet, https://github.com/seemethere	2024-01-04 04:26:32 +00:00
PyTorch UpdateBot	9b88354b80	[executorch hash update] update the pinned executorch hash (#116668 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml). Update the pinned executorch hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116668 Approved by: https://github.com/pytorchbot	2024-01-04 04:12:25 +00:00
lezcano	b5c33ccdb3	[dynamo] Fix np.issubdtype (#116459 ) Fixes the issue described at https://github.com/pytorch/pytorch/issues/93697#issuecomment-1828346590 This doesn't fix the full issue yet, now we hit ```python File "/home/lezcano/git/pytorch/pytorch/torch/_dynamo/symbolic_convert.py", line 744, in step getattr(self, inst.opname)(inst) File "/home/lezcano/git/pytorch/pytorch/torch/_dynamo/symbolic_convert.py", line 1366, in BUILD_MAP assert ( AssertionError ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/116459 Approved by: https://github.com/peterbell10	2024-01-04 03:55:50 +00:00
Aaron Gokaslan	e2359f72c8	[BE]: Update ruff to 0.1.11 (#116704 ) Updates ruff to 0.1.11 Pull Request resolved: https://github.com/pytorch/pytorch/pull/116704 Approved by: https://github.com/malfet	2024-01-04 03:35:45 +00:00
PyTorch UpdateBot	e70dfe07f6	[audio hash update] update the pinned audio hash (#116747 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml). Update the pinned audio hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116747 Approved by: https://github.com/pytorchbot	2024-01-04 03:27:48 +00:00
rzou	c14a0b6c84	[codemod] markDynamoStrictTest batch (#116734 ) - test_type_promotion - test_type_info - test_type_hints Pull Request resolved: https://github.com/pytorch/pytorch/pull/116734 Approved by: https://github.com/voznesenskym ghstack dependencies: #116728, #116729, #116730, #116731, #116732, #116733	2024-01-04 03:18:06 +00:00
rzou	bfb9df3684	[codemod] markDynamoStrictTest batch (#116733 ) - test_weak - test_view_ops - test_typing Pull Request resolved: https://github.com/pytorch/pytorch/pull/116733 Approved by: https://github.com/voznesenskym ghstack dependencies: #116728, #116729, #116730, #116731, #116732	2024-01-04 03:18:06 +00:00
rzou	a308a25fb7	[codemod] markDynamoStrictTest batch (#116732 ) - torch_np/numpy_tests/core/test_getlimits - torch_np/numpy_tests/core/test_einsum - torch_np/numpy_tests/core/test_dtype - torch_np/numpy_tests/core/test_dlpack Pull Request resolved: https://github.com/pytorch/pytorch/pull/116732 Approved by: https://github.com/voznesenskym ghstack dependencies: #116728, #116729, #116730, #116731	2024-01-04 03:17:57 +00:00
rzou	9255f55767	[codemod] markDynamoStrictTest batch (#116731 ) - torch_np/numpy_tests/core/test_numerictypes - torch_np/numpy_tests/core/test_numeric - torch_np/numpy_tests/core/test_indexing Pull Request resolved: https://github.com/pytorch/pytorch/pull/116731 Approved by: https://github.com/voznesenskym ghstack dependencies: #116728, #116729, #116730	2024-01-04 03:17:47 +00:00
rzou	1f7badd856	[codemod] markDynamoStrictTest batch (#116730 ) - torch_np/numpy_tests/core/test_scalarinherit - torch_np/numpy_tests/core/test_scalar_methods - torch_np/numpy_tests/core/test_scalar_ctors Pull Request resolved: https://github.com/pytorch/pytorch/pull/116730 Approved by: https://github.com/voznesenskym ghstack dependencies: #116728, #116729	2024-01-04 03:17:39 +00:00
rzou	d1d6b90a1b	[codemod] markDynamoStrictTest torch_np/numpy_tests/core/test_scalarmath (#116729 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116729 Approved by: https://github.com/voznesenskym ghstack dependencies: #116728	2024-01-04 03:17:29 +00:00
rzou	3ba35548c3	[codemod] markDynamoStrictTest torch_np/numpy_tests/core/test_shape_base (#116728 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116728 Approved by: https://github.com/voznesenskym	2024-01-04 03:17:22 +00:00
Nikita Shulga	3acb7972b0	[BE] Test CrossEntropyLoss for `torch.half` (#116681 ) To test it on MPS and CUDA devices Also, move some float64 skip-tests for MPS to xfail, same as CPU tests for torch.half Pull Request resolved: https://github.com/pytorch/pytorch/pull/116681 Approved by: https://github.com/xuzhao9, https://github.com/mikaylagawarecki	2024-01-04 02:16:09 +00:00
Richard Barnes	6fece41e9a	[codemod][lowrisk] Remove extra semi colon from caffe2/c10/util/Float8_e5m2.h (#115761 ) Summary: `-Wextra-semi` or `-Wextra-semi-stmt` If the code compiles, this is safe to land. Test Plan: Sandcastle Reviewed By: palmje Differential Revision: D51995078 Pull Request resolved: https://github.com/pytorch/pytorch/pull/115761 Approved by: https://github.com/Skylion007	2024-01-04 02:02:26 +00:00
zdevito	5395331644	Avoid GIL during exit (#116709 ) Stacks recorded when tensors are being freed during exit could try to acquire the GIL. Py_IsInitialized can be used to check if we are post Python exit and should not attempt to acquire the GIL. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116709 Approved by: https://github.com/aaronenyeshi	2024-01-04 01:56:44 +00:00
leslie-fang-intel	4926146537	[Inductor] Fix Conv Binary Inplace Fusion issue (#115153 ) Summary Take this Pattern as example ``` # ReLU # / \ # Conv1 # / \ # Conv2 # \ / # Add ``` The current `ConvBinaryInplace` check will fail to perform Inplace fusion (using outplace fusion instead) due to `ReLU` having 2 users. However, if all users of `ReLU` are ancestor nodes of `Conv2`, we should be able to proceed with the `ConvBinaryInplace` fusion. This diff relaxes the `ConvBinaryInplace` check accordingly. TestPlan ``` python -m pytest test_mkldnn_pattern_matcher.py -k test_conv2d_binary_inplace_fusion_pass_cpu python -m pytest test_mkldnn_pattern_matcher.py -k test_conv2d_binary_inplace_fusion_failed_cpu ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/115153 Approved by: https://github.com/CaoE, https://github.com/jgong5	2024-01-04 01:06:27 +00:00
ydwu4	ce2df3f690	[HigherOrderOp] set set_subgraph_inputs to flatten_manual for map (#115853 ) We change manually_set_subgraph_inputs to three modes: manual, automatic and flatten_manual. The flatten_manual wil first flatten the sub_args then recussively call set_subgrah_inputs = "manual". This allows us to control the order of the placeholder shown up in the graph, which is necessary for map, where we want to keep the mapped arguments before the rest positional arguments. Right now, map only takes a single tensor as mapped argument but it would become pretty easy to match the subgraph inputs to original proxy if we have a "flatten_manual" option. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115853 Approved by: https://github.com/zou3519	2024-01-04 00:27:07 +00:00
Nikita Shulga	a2f3770b24	[BE] Remove `arch -arch arm64` (#116724 ) It was needed back in a day when there were no arm64 runner daemon binaries, so the trick was needed to execute native arm64 tests when invoked from x86 runner daemon Followup after https://github.com/pytorch/pytorch/pull/116680 Pull Request resolved: https://github.com/pytorch/pytorch/pull/116724 Approved by: https://github.com/huydhn	2024-01-03 23:59:53 +00:00
Colin Peppler	4e330882da	[inductor] Add ABI shim function for torch.scatter_reduce (#116700 ) Ran into the following exception during C++ file compilation. ``` error: use of undeclared identifier 'aoti_torch_scatter_reduce_out' aoti_torch_scatter_reduce_out(buf12, buf12,0,buf13,buf14, "sum",1); ^ ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/116700 Approved by: https://github.com/aakhundov	2024-01-03 23:43:44 +00:00
rzou	a75b587803	[codemod] markDynamoStrictTest torch_np/numpy_tests/fft/test_helper (#116654 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116654 Approved by: https://github.com/bdhirsh ghstack dependencies: #116638, #116673, #116639, #116640, #116641, #116642, #116643, #116644, #116645, #116646, #116647, #116648, #116649, #116650, #116651, #116652, #116653	2024-01-03 23:03:06 +00:00
rzou	f3e2661555	[codemod] markDynamoStrictTest torch_np/numpy_tests/fft/test_pocketfft (#116653 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116653 Approved by: https://github.com/bdhirsh ghstack dependencies: #116638, #116673, #116639, #116640, #116641, #116642, #116643, #116644, #116645, #116646, #116647, #116648, #116649, #116650, #116651, #116652	2024-01-03 23:03:06 +00:00
rzou	bf4c1a3d66	[codemod] markDynamoStrictTest torch_np/numpy_tests/lib/test_arraypad (#116652 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116652 Approved by: https://github.com/bdhirsh ghstack dependencies: #116638, #116673, #116639, #116640, #116641, #116642, #116643, #116644, #116645, #116646, #116647, #116648, #116649, #116650, #116651	2024-01-03 23:03:06 +00:00
rzou	f4168c0e2e	[codemod] markDynamoStrictTest torch_np/numpy_tests/lib/test_arraysetops (#116651 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116651 Approved by: https://github.com/bdhirsh ghstack dependencies: #116638, #116673, #116639, #116640, #116641, #116642, #116643, #116644, #116645, #116646, #116647, #116648, #116649, #116650	2024-01-03 23:03:06 +00:00
rzou	dab1599d81	[codemod] markDynamoStrictTest torch_np/numpy_tests/lib/test_function_base (#116650 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116650 Approved by: https://github.com/bdhirsh ghstack dependencies: #116638, #116673, #116639, #116640, #116641, #116642, #116643, #116644, #116645, #116646, #116647, #116648, #116649	2024-01-03 23:03:06 +00:00
Wanchao Liang	8a76c07b98	[threaded pg] add devices to avoid seeing warnings (#116678 ) This PR adds devices to register_backend of multithraeded pg, to avoid seeing tons of warnings. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116678 Approved by: https://github.com/awgu, https://github.com/XilunWu ghstack dependencies: #116426, #116559, #116573	2024-01-03 23:01:19 +00:00
Wanchao Liang	b10cb168a7	[tp] disable some assertion temporarily for torch.compile (#116573 ) Disable some runtime assertion first as it does not work with torch.compile properly, I'll have a follow up fix in dynamo and reenable this check again Pull Request resolved: https://github.com/pytorch/pytorch/pull/116573 Approved by: https://github.com/awgu, https://github.com/XilunWu ghstack dependencies: #116426, #116559	2024-01-03 23:01:19 +00:00
Huy Do	7309f6fdf0	Remove hardcoding arch to arm64 (#116680 ) https://github.com/pytorch/pytorch/pull/116627 hardcodes arch to arm64 and it's failing on x86 GitHub runner (yup, they are still there on periodic, we haven't pulled the plug yet). https://github.com/pytorch/pytorch/actions/runs/7392059632/job/20112760709#step:2:12 is an example failure. There is no need to set the arch here because it has already been set earlier in the workflow https://github.com/pytorch/pytorch/blob/main/.github/workflows/_mac-test.yml#L47 Pull Request resolved: https://github.com/pytorch/pytorch/pull/116680 Approved by: https://github.com/seemethere	2024-01-03 22:42:14 +00:00
Peter Bell	f6be25bae6	[inductor] Add shape checks to ExpandView (#113839 ) Currently `ExpandView` doesn't check that the expanded shape is valid which may allow bugs to slip through which cause silent correctness issues. Pull Request resolved: https://github.com/pytorch/pytorch/pull/113839 Approved by: https://github.com/ezyang	2024-01-03 22:31:43 +00:00
PyTorch MergeBot	1c69d0bdb5	Revert "[11/N] Enable clang-tidy warnings on c10/util/*.h (#116353 )" This reverts commit 37aae5932c26c3729d68b6ebdf00e618fe229b1c. Reverted https://github.com/pytorch/pytorch/pull/116353 on behalf of https://github.com/izaitsevfb due to Reverting, breaks internal builds: error: implicit conversion from 'long long' to 'float' may lose precision [-Werror,-Wimplicit-int-float-conversion] ([comment](https://github.com/pytorch/pytorch/pull/116353#issuecomment-1876045800))	2024-01-03 22:22:11 +00:00
PyTorch MergeBot	0aa50909f3	Revert "[12/N] Apply clang-tidy and fix warnings in headers of torch/csrc (#116486 )" This reverts commit 5aa258eb09d5ecd62aea4d2bd02bbfa5eda0d554. Reverted https://github.com/pytorch/pytorch/pull/116486 on behalf of https://github.com/izaitsevfb due to Reverting, as it depends on https://github.com/pytorch/pytorch/pull/116353, which has to be reverted ([comment](https://github.com/pytorch/pytorch/pull/116486#issuecomment-1876042948))	2024-01-03 22:18:54 +00:00
PyTorch MergeBot	791db94c62	Revert "[13/N] Enable clang-tidy on headers of torch/csrc (#116560 )" This reverts commit b0629cdd67ea5dd264250262e0af75579ed26952. Reverted https://github.com/pytorch/pytorch/pull/116560 on behalf of https://github.com/izaitsevfb due to Reverting, as it depends on #116353, which has to be reverted ([comment](https://github.com/pytorch/pytorch/pull/116560#issuecomment-1876033363))	2024-01-03 22:08:40 +00:00
Masaki Kozuki	71523c2289	Add 116583 to `.git-blame-ignore-revs` (#116676 ) since #116583 is purely cosmetic. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116676 Approved by: https://github.com/janeyx99	2024-01-03 19:37:31 +00:00
Chip Turner	9693b3740b	[easy] [c10d] Add documentation for the `device_id` parameter for `init_process_group` (#116222 ) Follow-up to add missing docs for #114916 Pull Request resolved: https://github.com/pytorch/pytorch/pull/116222 Approved by: https://github.com/kwen2501, https://github.com/fduwjj	2024-01-03 19:32:18 +00:00
CYuxian	f543093e06	[ONNX] Fix output mismatch issue of repeat_interleave when dim is None (#116689 ) 'input' is introduced but it's mixed with 'self' in repeat_interleave, which causes the mismatch issue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116689 Approved by: https://github.com/thiagocrepaldi	2024-01-03 18:38:00 +00:00
PyTorch MergeBot	68105da229	Revert "[Dynamo] Trace autograd.function in dynamo when inputs require grad (#116358 )" This reverts commit 97891b184c12763f335fbe1ff63fab843edafab5. Reverted https://github.com/pytorch/pytorch/pull/116358 on behalf of https://github.com/izaitsevfb due to Breaks internal accuracy test, see D52491095, pytorch/benchmark/fb/test_gpu:run_test_gpu - test_train_ig_feed_over_inductor_accuracy ([comment](https://github.com/pytorch/pytorch/pull/116358#issuecomment-1875779697))	2024-01-03 18:20:51 +00:00
Tugsbayasgalan (Tugsuu) Manlaibaatar	68b77311ad	Fix bug in non-strict input processor (#116674 ) Summary: Title Test Plan: CI Reviewed By: zhxchen17 Differential Revision: D52499932 Pull Request resolved: https://github.com/pytorch/pytorch/pull/116674 Approved by: https://github.com/tugsbayasgalan	2024-01-03 18:13:25 +00:00
Aarni Koskela	1429c204f8	Increase hub download chunk size (#116536 ) This PR increases the read size for the `hub.download_url_to_file` function from 8,192 bytes to 131,072 bytes (128 * 1,024), as reading in larger chunks should be more efficient. The size could probably be larger still, at the expense of the progress bar not getting updated as often. It re-introduces use of the `READ_DATA_CHUNK` constant that was originally used for this purpose in `4a3baec961` and since forgotten. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116536 Approved by: https://github.com/NicolasHug	2024-01-03 17:38:45 +00:00
Zhengxu Chen	c919935cb7	[export] Update schema versioning format. (#116462 ) Summary: Update the old versioning scheme to a major and minor version. Test Plan: CI Differential Revision: D52431963 Pull Request resolved: https://github.com/pytorch/pytorch/pull/116462 Approved by: https://github.com/tugsbayasgalan	2024-01-03 17:34:58 +00:00
atalman	2ae55e99fe	[release] Add Launch Execution XFN meeting process to release runbook (#116701 ) Make sure we have this process documented in the runbook. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116701 Approved by: https://github.com/seemethere	2024-01-03 17:16:18 +00:00
rzou	d2fc00d2cc	[codemod] markDynamoStrictTest torch_np/numpy_tests/lib/test_histograms (#116649 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116649 Approved by: https://github.com/bdhirsh ghstack dependencies: #116638, #116673, #116639, #116640, #116641, #116642, #116643, #116644, #116645, #116646, #116647, #116648	2024-01-03 17:00:32 +00:00
rzou	2d1011d84f	[codemod] markDynamoStrictTest torch_np/numpy_tests/lib/test_index_tricks (#116648 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116648 Approved by: https://github.com/bdhirsh ghstack dependencies: #116638, #116673, #116639, #116640, #116641, #116642, #116643, #116644, #116645, #116646, #116647	2024-01-03 17:00:32 +00:00
rzou	c47ab693ff	[codemod] markDynamoStrictTest torch_np/numpy_tests/lib/test_shape_base_ (#116647 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116647 Approved by: https://github.com/bdhirsh ghstack dependencies: #116638, #116673, #116639, #116640, #116641, #116642, #116643, #116644, #116645, #116646	2024-01-03 17:00:23 +00:00
rzou	6a300bd1c6	[codemod] markDynamoStrictTest torch_np/numpy_tests/lib/test_twodim_base (#116646 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116646 Approved by: https://github.com/bdhirsh ghstack dependencies: #116638, #116673, #116639, #116640, #116641, #116642, #116643, #116644, #116645	2024-01-03 17:00:13 +00:00
rzou	34a8c64c92	[codemod] markDynamoStrictTest torch_np/numpy_tests/lib/test_type_check (#116645 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116645 Approved by: https://github.com/bdhirsh ghstack dependencies: #116638, #116673, #116639, #116640, #116641, #116642, #116643, #116644	2024-01-03 17:00:07 +00:00
rzou	fe287af812	[codemod] markDynamoStrictTest torch_np/numpy_tests/linalg/test_linalg (#116644 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116644 Approved by: https://github.com/bdhirsh ghstack dependencies: #116638, #116673, #116639, #116640, #116641, #116642, #116643	2024-01-03 16:59:59 +00:00
rzou	28a8e4bdb6	[codemod] markDynamoStrictTest torch_np/test_basic (#116643 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116643 Approved by: https://github.com/bdhirsh ghstack dependencies: #116638, #116673, #116639, #116640, #116641, #116642	2024-01-03 16:59:50 +00:00
rzou	146426a0df	[codemod] markDynamoStrictTest torch_np/test_binary_ufuncs (#116642 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116642 Approved by: https://github.com/bdhirsh ghstack dependencies: #116638, #116673, #116639, #116640, #116641	2024-01-03 16:59:41 +00:00
rzou	efe3b7f457	[codemod] markDynamoStrictTest torch_np/test_dtype (#116641 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116641 Approved by: https://github.com/bdhirsh ghstack dependencies: #116638, #116673, #116639, #116640	2024-01-03 16:59:32 +00:00
rzou	d760014b9f	[codemod] markDynamoStrictTest torch_np/test_function_base (#116640 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116640 Approved by: https://github.com/bdhirsh ghstack dependencies: #116638, #116673, #116639	2024-01-03 16:59:25 +00:00
rzou	efee9e689e	[codemod] markDynamoStrictTest torch_np/test_ndarray_methods (#116639 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116639 Approved by: https://github.com/bdhirsh ghstack dependencies: #116638, #116673	2024-01-03 16:59:19 +00:00
rzou	608091e4d1	[codemod] markDynamoStrictTest torch_np/numpy_tests/core/test_multiarray (#116673 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116673 Approved by: https://github.com/voznesenskym ghstack dependencies: #116638	2024-01-03 16:59:12 +00:00
angelayi	70eb53505b	[export] Update range constraints to runtime_var_to_range (#115427 ) Updated range_constraints to be the union of shape_env.var_to_range and shape_env.runtime_var_to_range, with shape_env.runtime_var_to_range taking priority. Due to 0/1 specialization, if we bound an unbacked symint to be less than 5, the range of possible values for this symint is actually recorded as [2, 5] in shape_env.var_to_range. To fix this so that users will be able to see a more understandable range of [0, 5], shape_env.runtime_var_to_range was created to store the range of [0, 5]. Since range_constraints is a user-facing attribute to query the ranges of certain symints, we want to use shape_env.runtime_var_to_range to get the unbacked symints ranges, rather than shape_env.var_to_range. Additionally, run_decompositions() has an issue where it will always add assertions to the graph, even if a previous run has already added the assertions. So, I added a part to the AddRuntimeAssertionsForInlineConstraints which will store which assertions have already been added. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115427 Approved by: https://github.com/zhxchen17	2024-01-03 16:55:04 +00:00
Aleksandar Samardžić	f081c45a34	Add out_dtype support for sparse semi-structured CUTLASS back-end (#116519 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116519 Approved by: https://github.com/cpuhrsch	2024-01-03 16:23:17 +00:00
Eddie Yan	ba06951c66	[BE] [cuDNN] Always build assuming cuDNN >= 8.1 (#95722 ) <!-- copilot:summary --> ### <samp>🤖 Generated by Copilot at 27084ed</samp> This pull request simplifies and cleans up the code that uses the cuDNN library for convolution, batch normalization, CTC loss, and quantized operations. It removes the unnecessary checks and conditions for older cuDNN versions and the experimental cuDNN v8 API, and ~~replaces them with the stable `cudnn_frontend` API that requires cuDNN v8 or higher. It also adds the dependency and configuration for the `cudnn_frontend` library in the cmake and bazel files.~~ Correction: The v7 API will still be available with this PR, and can still be used, without any changes to the defaults. This change simply always _builds_ the v8 API, and removes the case where _only_ the v7 API is built. This is a re-land of https://github.com/pytorch/pytorch/pull/91527 Pull Request resolved: https://github.com/pytorch/pytorch/pull/95722 Approved by: https://github.com/malfet, https://github.com/atalman	2024-01-03 15:41:28 +00:00
Jiong Gong	3407541b0c	add cpu inductor merge rule (#116679 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116679 Approved by: https://github.com/huydhn	2024-01-03 15:09:36 +00:00
rzou	b57d473091	[codemod] markDynamoStrictTest torch_np/test_nep50_examples (#116638 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116638 Approved by: https://github.com/bdhirsh	2024-01-03 14:45:43 +00:00
beiwei37d	49de03f0fd	adapt to other acceleration devices (#116682 ) Fixes #116504 When this API is invoked, a runtime error occurs. When the NPU acceleration device is used, the input tensor is not processed at a branch. As a result, some input tensors are on the CPU and some are on the NPU. As a result, an error is reported. Here, I adapt to other acceleration devices and move the tensor on the acceleration device to the CPU. It's tested and feasible. The details are in the issue：#116504 Pull Request resolved: https://github.com/pytorch/pytorch/pull/116682 Approved by: https://github.com/lezcano	2024-01-03 12:41:19 +00:00
Huy Do	c1b88723f8	Fix buck build after recent clang-tidy updates (#116669 ) Broken after either https://github.com/pytorch/pytorch/pull/116486 or https://github.com/pytorch/pytorch/pull/116353 I think. Here is an example build failure `0bc21c6a6b` Pull Request resolved: https://github.com/pytorch/pytorch/pull/116669 Approved by: https://github.com/Skylion007	2024-01-03 09:02:58 +00:00
FFFrog	2a87ab4508	Refactor some tests by using TEST_CUDA & TEST_MULTIGPU instead (#116083 ) as https://github.com/pytorch/pytorch/pull/116014#discussion_r1430510759 stated, refactor some tests related. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116083 Approved by: https://github.com/fduwjj	2024-01-03 08:53:59 +00:00
Wanchao Liang	d9c0e37bab	[2d] unflatten_tensor on compute stream for DTensorExtension (#116559 ) Context: Existing FSDPExtension have some bug in the case when the unflatten tensor involves some compute/communications in cuda stream, the current logic of FSDPExtension unflatten tensor happens in the unshard stream, which makes runtime lost sync with the compute stream, and if there're some dependencies between the compute stream and the unflatten tensor logic, currently it would lose sync point, which could possibly lead to NaN. This PR make the FSDPExtension to record the compute stream and let DTensorExtension to directly use the compute stream for unflatten_tensor. In long term we might want to directly make the FSDP runtime logic to only make the unshard happen in unshard stream, and use unshard views to happen in the compute stream. We currently fix this in the Extension directly as this is the simplest thing to do without affecting FSDP runtime logic Pull Request resolved: https://github.com/pytorch/pytorch/pull/116559 Approved by: https://github.com/awgu, https://github.com/fduwjj, https://github.com/yifuwang ghstack dependencies: #116426	2024-01-03 07:29:08 +00:00
Wanchao Liang	29674b8e1d	[dtensor] fix dtensor _to_copy op for mix precision (#116426 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116426 Approved by: https://github.com/fduwjj	2024-01-03 07:29:08 +00:00
Zhengxu Chen	b0749bce6c	[export] Allow None as the meta value for tensor output. (#116664 ) Summary: Sometimes we will get a None value from ops which returns Tensor type in the schema. Allow this case during serialization. Test Plan: test__scaled_dot_product_flash_attention Differential Revision: D52491668 Pull Request resolved: https://github.com/pytorch/pytorch/pull/116664 Approved by: https://github.com/SherlockNoMad	2024-01-03 07:07:39 +00:00
Aaron Gokaslan	3fe437b24b	[BE]: Update flake8 to v6.1.0 and fix lints (#116591 ) Updates flake8 to v6.1.0 and fixes a few lints using sed and some ruff tooling. - Replace `assert(0)` with `raise AssertionError()` - Remove extraneous parenthesis i.e. - `assert(a == b)` -> `assert a == b` - `if(x > y or y < z):`->`if x > y or y < z:` - And `return('...')` -> `return '...'` Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/116591 Approved by: https://github.com/albanD, https://github.com/malfet	2024-01-03 06:04:44 +00:00
Nikita Shulga	09ee96b69d	[MPS] Fix CrossEntropyLoss for float16 (#116597 ) Looks like neither [`divisionNoNaNWithPrimaryTensor:`](https://developer.apple.com/documentation/metalperformanceshadersgraph/mpsgraph/3675593-divisionnonanwithprimarytensor) nor `oneHotWithIndicesTensor:` works for `MPSDataTypeFloat16`, so provide an explicit cast for one-hot tensor and alternative implementation using the formula from the official doc, i.e. > `resultTensor = select(secondaryTensor, primaryTensor / secondaryTensor, 0)` Alas, at the moment it can not be tested via `test_modules.py` as it runs only `torch.float32` and `torch.float64` tests (and `torch.half` implementation is not available for CPU) Fixes https://github.com/pytorch/pytorch/issues/116095 TODO: Enable testing via TestModules, but will do in separate PR Pull Request resolved: https://github.com/pytorch/pytorch/pull/116597 Approved by: https://github.com/kulinseth	2024-01-03 05:58:26 +00:00
Will Constable	75359934bd	[C10D] Improve Heartbeat Monitor exit logs (#116268 ) (#116661 ) Summary: - add workMetaList_.size() so we know how many outstanding works there were when killing - Print our first log before debuginfo dump instead of after, since it is clearer when reading the logs that we time out and then dump - Organize the log strings- put them near where they are used cc mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu fegin XilunWu wanchaol fduwjj wz337 tianyu-l yf225 imported-using-ghimport Test Plan: Imported from OSS Reviewed By: fduwjj Differential Revision: D52369167 Pulled By: wconstab Pull Request resolved: https://github.com/pytorch/pytorch/pull/116661 Approved by: https://github.com/fduwjj	2024-01-03 05:35:06 +00:00
chunyuan	1ae39a372e	Inductor cpp wrapper: fix cumsum codegen (#116171 ) Fixes https://github.com/pytorch/pytorch/issues/115829 For `cumsum(Tensor self, int dim, *, ScalarType? dtype=None) -> Tensor`, `dim` is not a `kwarg_only` argument, but it could be provided as a kwarg when calling this OP. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116171 Approved by: https://github.com/jgong5, https://github.com/desertfire, https://github.com/jansel	2024-01-03 05:33:17 +00:00
Arun Ranganathan	ef98987017	Fix user input mutations for run_decompositions (#116382 ) Fixes #115106 Pull Request resolved: https://github.com/pytorch/pytorch/pull/116382 Approved by: https://github.com/angelayi	2024-01-03 05:04:22 +00:00
Zhengxu Chen	c5bd88b56a	[export] Improve serialization of union types. (#116511 ) Summary: Making union types harder to use wrong: 1. Initialize unset fields still with None, but we don't assert on the uniqueness of not None field, since it's possible to set a real field to None. 2. Raise error on unset fields in union, reducing the error surface and enforcing type safety. 3. Serialize union type with only tag and omit all the unset fields, this makes the serialized model more readable and debuggable. Test Plan: buck test mode/opt caffe2/test:test_export buck test mode/opt executorch/exir/... buck test mode/opt mode/inplace aps_models/ads/icvr/tests:export_test Differential Revision: D52446586 Pull Request resolved: https://github.com/pytorch/pytorch/pull/116511 Approved by: https://github.com/angelayi	2024-01-03 04:58:59 +00:00
fduwjj	ca4df16fdd	[c10d] Make DebugInfoWriter Singleton across all PG objects (#116489 ) Previously, we have the writer register to each NCCL PG(backend), so for every pg, we have a NCCL PG instance, so if we use some customized writer when multiple sub-PGs are used, we need to ensure user to register the writer for every backend which indicates a bad UX. Furthermore, the debug info is global, so it does not make sense to have the writer for each instance. We even have a static mutex in the `dumpDebuggingInfo` to ensure we serialize the write, that makes it more obvious that we can make the writer a singleton so that we only have one writer instance for all PG instances. Although the rationale is clear, the implementation may vary a lot. So this PR is RFC for now to see if this implementation makes sense or not. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116489 Approved by: https://github.com/kwen2501	2024-01-03 03:42:54 +00:00
Jerry Zhang	41f265b06a	[quant][pt2e] Preserve numeric_debug_handle in quantization flows (#116477 ) Summary: We introduced `node.meta["numeric_debug_handle"]` in https://github.com/pytorch/pytorch/pull/114315 to indicate the numeric debug handle for values in the graph, in this PR we supported preserving this field in prepare and convert so that we can use these for numerical debugging Next: we also want to preserve these in deepcopy of GraphModule as well Test Plan: python test/test_quantization.py -k test_quantize_pt2e_preserve_handle Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/116477 Approved by: https://github.com/tugsbayasgalan	2024-01-03 03:39:00 +00:00
Nikita Shulga	f73b1b9388	[EZ] Update lxml dependency to 5.0.0 (#116657 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116657 Approved by: https://github.com/atalman	2024-01-03 02:57:31 +00:00
Nikita Shulga	6e9ca2f220	Enable eye on CPU for bfloat16 dtype (#116616 ) Fixes https://github.com/pytorch/pytorch/issues/116609 Pull Request resolved: https://github.com/pytorch/pytorch/pull/116616 Approved by: https://github.com/Skylion007	2024-01-03 02:53:27 +00:00
Justin Yip	5005f36c12	Clean up files under fb/vulkan/... (#116665 ) Remove files accidentally imported in https://github.com/pytorch/pytorch/pull/114712 Pull Request resolved: https://github.com/pytorch/pytorch/pull/116665 Approved by: https://github.com/izaitsevfb, https://github.com/seemethere	2024-01-03 01:55:32 +00:00
rzou	3ac0aaf478	[codemod] markDynamoStrictTest torch_np/test_random (#116637 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116637 Approved by: https://github.com/bdhirsh ghstack dependencies: #116632, #116634, #116635, #116636	2024-01-03 00:51:36 +00:00
rzou	884e449753	[codemod] markDynamoStrictTest torch_np/test_reductions (#116636 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116636 Approved by: https://github.com/bdhirsh ghstack dependencies: #116632, #116634, #116635	2024-01-03 00:51:36 +00:00
rzou	8ec606d4c5	[codemod] markDynamoStrictTest torch_np/test_scalars_0D_arrays (#116635 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116635 Approved by: https://github.com/bdhirsh ghstack dependencies: #116632, #116634	2024-01-03 00:51:36 +00:00
rzou	9b27fcf65a	[codemod] markDynamoStrictTest torch_np/test_ufuncs_basic (#116634 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116634 Approved by: https://github.com/bdhirsh ghstack dependencies: #116632	2024-01-03 00:51:36 +00:00
rzou	0ce32ce409	[codemod] markDynamoStrictTest torch_np/test_unary_ufuncs (#116632 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116632 Approved by: https://github.com/bdhirsh	2024-01-03 00:51:36 +00:00
Michael Gschwind	a1191ce4bf	optimize (u)int8 vectorized operator* (#116235 ) Summary: optimize (u)int8 vectorized operator* Test Plan: sandcastle github Differential Revision: D52318192 Pull Request resolved: https://github.com/pytorch/pytorch/pull/116235 Approved by: https://github.com/hl475, https://github.com/malfet	2024-01-03 00:50:23 +00:00
Mikayla Gawarecki	0f6f582c0d	Add config to disable TransformerEncoder/MHA fastpath (#112212 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/112212 Approved by: https://github.com/jbschlosser	2024-01-02 23:59:30 +00:00
Masaki Kozuki	9dc68d1aa9	clangformat: fused adam (#116583 ) Apply clangformat to fused adam/adamw files. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116583 Approved by: https://github.com/janeyx99	2024-01-02 22:30:23 +00:00
Wanchao Liang	3ff4572fe7	delete sharded tensor from fsdp/tp tests (#116244 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116244 Approved by: https://github.com/awgu, https://github.com/wz337, https://github.com/fduwjj ghstack dependencies: #116122	2024-01-02 22:11:36 +00:00
Wanchao Liang	dfccaac31b	[2d] Ensure gradient clear out pending AsyncCollectiveTensor in FSDP Extension (#116122 ) As titled, this PR adds gradient hook to the FSDP DTensor extension, to check if there's gradients that are AsyncCollectiveTensors, if there're some, we eagerly wait there. This is needed because sometimes the parameter's gradient might still pending with AsyncCollectiveTensor, if we directly feed them to FSDP then FSDP would use the ACT's storage to do reduce_scatter, which is wrong. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116122 Approved by: https://github.com/awgu, https://github.com/fduwjj	2024-01-02 22:11:36 +00:00
Eli Uriegas	a2061ceefe	ci: Output runner OS / HW for macOS (#116627 ) It's difficult to debug these since there's no understanding of what the OS / HW that we're running CI on so output it so we can have a better understanding here. Signed-off-by: Eli Uriegas <eliuriegas@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/116627 Approved by: https://github.com/janeyx99	2024-01-02 22:05:53 +00:00
Bin Bao	640d46f823	[inductor] Control the cpp_wrapper mode with an env variable (#116615 ) Summary: also add one model test for the cpp_wrapper mode on CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/116615 Approved by: https://github.com/angelayi	2024-01-02 21:50:25 +00:00
rzou	295bdaafb7	[codemod] markDynamoStrictTest test_module_init (#116625 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116625 Approved by: https://github.com/bdhirsh ghstack dependencies: #116618, #116619, #116621, #116622, #116624	2024-01-02 20:55:48 +00:00
rzou	074dfc2648	[codemod] markDynamoStrictTest test_linalg (#116624 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116624 Approved by: https://github.com/bdhirsh ghstack dependencies: #116618, #116619, #116621, #116622	2024-01-02 20:55:48 +00:00
rzou	5d8e066f6b	[codemod] markDynamoStrictTest test_indexing (#116622 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116622 Approved by: https://github.com/bdhirsh ghstack dependencies: #116618, #116619, #116621	2024-01-02 20:55:39 +00:00
rzou	fc7546e9db	[codemod] markDynamoStrictTest test_functional_optim (#116621 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116621 Approved by: https://github.com/bdhirsh ghstack dependencies: #116618, #116619	2024-01-02 20:55:31 +00:00
rzou	88d1638139	[codemod] markDynamoStrictTest test_autograd_fallback (#116619 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116619 Approved by: https://github.com/bdhirsh ghstack dependencies: #116618	2024-01-02 20:55:21 +00:00
rzou	39339df8d7	[codemod] markDynamoStrictTest test_autocast (#116618 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116618 Approved by: https://github.com/bdhirsh	2024-01-02 20:54:24 +00:00
fduwjj	0bc21c6a6b	[C10d] Fix Log Prefix in NCCLPG so that each instance gets its own prefix (#116520 ) Somehow the logprefix only have ProcessGroup 0 rank [global rank]. This does not give the expected result as per the comment says "a prefix that is unique to this process group and rank". So this PR fix it and make it different for different subPGs. The reason is that we set the prefix static which is shared across all NCCLPG instances and whoever calls this function first will set `rank_` and `uid_` to the prefix. We always initialize PG 0 first that's why we always see PG[0] + global ranks for all subPGs. <img width="484" alt="image" src="https://github.com/pytorch/pytorch/assets/6937752/7fbb0226-7e25-4306-9cee-22e17b00bc8e"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/116520 Approved by: https://github.com/wconstab ghstack dependencies: #116218	2024-01-02 20:23:58 +00:00
Tianyu Liu	6d8d3c1334	add a DTensor test for weight tying (#116475 ) Weight tying is useful when we'd like to share weights (and their gradients) between two modules, e.g. the word/token embedding module and the output linear module in language models. This test demonstrates that with DTensor it can be achieved just as with normal tensor, e.g. using `model.fc.weight = model.embedding.weight`. To test: `python test/distributed/tensor/parallel/test_tp_examples.py -k test_weight_tying` Pull Request resolved: https://github.com/pytorch/pytorch/pull/116475 Approved by: https://github.com/wanchaol, https://github.com/fduwjj	2024-01-02 20:19:36 +00:00
Ivan Zaitsev	fb5a9f2f5c	Fix implicit conversion to double (#116614 ) Summary: Forward fix for https://github.com/pytorch/pytorch/pull/116185 / D52390113 Error: ``` xplat/caffe2/c10/util/order_preserving_flat_hash_map.h:602:23: error: implicit conversion from 'uint64_t' (aka 'unsigned long long') to 'double' may lose precision [-Werror,-Wimplicit-int-float-conversion] [CONTEXT] std::ceil(num_elements / static_cast<double>(_max_load_factor)))); [CONTEXT] ^~~~~~~~~~~~ ~ xplat/caffe2/c10/util/order_preserving_flat_hash_map.h:923:22: error: implicit conversion from 'uint64_t' (aka 'unsigned long long') to 'double' may lose precision [-Werror,-Wimplicit-int-float-conversion] [CONTEXT] num_elements + 1 > [CONTEXT] ~~~~~~~~~~~~~^~~ ~ xplat/caffe2/c10/util/order_preserving_flat_hash_map.h:924:34: error: implicit conversion from 'uint64_t' (aka 'unsigned long long') to 'double' may lose precision [-Werror,-Wimplicit-int-float-conversion] [CONTEXT] (num_slots_minus_one + 1) * static_cast<double>(_max_load_factor)) { [CONTEXT] ~~~~~~~~~~~~~~~~~~~~^~~ ~ xplat/caffe2/c10/util/order_preserving_flat_hash_map.h:923:22: error: implicit conversion from 'uint64_t' (aka 'unsigned long long') to 'double' may lose precision [-Werror,-Wimplicit-int-float-conversion] [CONTEXT] num_elements + 1 > [CONTEXT] ~~~~~~~~~~~~~^~~ ~ xplat/caffe2/c10/util/order_preserving_flat_hash_map.h:924:34: error: implicit conversion from 'uint64_t' (aka 'unsigned long long') to 'double' may lose precision [-Werror,-Wimplicit-int-float-conversion] [CONTEXT] (num_slots_minus_one + 1) * static_cast<double>(_max_load_factor)) { [CONTEXT] ~~~~~~~~~~~~~~~~~~~~^~~ ~ xplat/caffe2/c10/util/order_preserving_flat_hash_map.h:923:22: error: implicit conversion from 'uint64_t' (aka 'unsigned long long') to 'double' may lose precision [-Werror,-Wimplicit-int-float-conversion] [CONTEXT] num_elements + 1 > [CONTEXT] ~~~~~~~~~~~~~^~~ ~ xplat/caffe2/c10/util/order_preserving_flat_hash_map.h:924:34: error: implicit conversion from 'uint64_t' (aka 'unsigned long long') to 'double' may lose precision [-Werror,-Wimplicit-int-float-conversion] [CONTEXT] (num_slots_minus_one + 1) * static_cast<double>(_max_load_factor)) { ``` Fixed by casting int parts to double explicitly. Test Plan: SC Differential Revision: D52482968 Pull Request resolved: https://github.com/pytorch/pytorch/pull/116614 Approved by: https://github.com/jeanschmidt, https://github.com/seemethere	2024-01-02 20:08:51 +00:00
soulitzer	77d979f748	Autograd attaches logging hooks only in debug level (#116522 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116522 Approved by: https://github.com/albanD	2024-01-02 20:06:18 +00:00
lezcano	b18d8d4595	Add a wrapper to transform a NumPy function into a PyTorch function (#114610 ) A less general version of this wrapper was used in the keynote on `torch.compile(numpy)`. We expose a generic version of the wrapper that works seamlessly with `torch.compile`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/114610 Approved by: https://github.com/albanD	2024-01-02 18:35:29 +00:00
Ilya Kamen	be455921f5	Fix missing words in README.md (#116606 ) minor fix to wording Pull Request resolved: https://github.com/pytorch/pytorch/pull/116606 Approved by: https://github.com/Skylion007	2024-01-02 18:24:58 +00:00
Le-Zheng	95a86ed9ca	[Quant] Add int8 linear op gelu for quantization PT2E with Inductor. input is an int8 CPU tensor; weight is an int8 MdkldnnCPU tensor (#114852 ) Summary Enable Int8 Linear Gelu post operator fusions for Stock PyTorch Inductor. The input is an int8 CPU tensor and weight is an int8 MkldnnCPU tensor. Test plan python test/test_quantization.py -k test_qlinear_gelu_pt2e Pull Request resolved: https://github.com/pytorch/pytorch/pull/114852 Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5	2024-01-02 08:11:26 +00:00
Bin Bao	a81edf9f23	[inductor] Fix cpp_wrapper codegen for ir.ComplexView (#116481 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116481 Approved by: https://github.com/htyu	2024-01-02 05:38:58 +00:00
cyy	b0629cdd67	[13/N] Enable clang-tidy on headers of torch/csrc (#116560 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116560 Approved by: https://github.com/Skylion007, https://github.com/albanD	2024-01-02 05:33:04 +00:00
Nikita Shulga	1ed8efa9b3	[MPS] Speedup addmm (#116548 ) - Do not copy bias to output - Skip respective multiplication op if either alpha or beta are equal to 1.0 Pull Request resolved: https://github.com/pytorch/pytorch/pull/116548 Approved by: https://github.com/albanD ghstack dependencies: #116547	2024-01-02 00:43:37 +00:00
Yanbo Liang	abd80cbb15	[Inductor] Decompose bmm if batch2's last dim size is 1 and coordinate_descent_tuning is enabled (#116582 ) We found this perf optimization opportunity at https://github.com/pytorch-labs/gpt-fast/pull/71. This would bring 5%+ perf gain for Mixtral 8x7B on gpt-fast. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116582 Approved by: https://github.com/lezcano	2024-01-01 21:24:02 +00:00
Aaron Gokaslan	4ffe1fb7f4	[BE]: Improve typing to respect ruff PYI058 (#116588 ) Tried out rule PYI058 and it flagged one typing recommendation in our codebase that would be better to fix. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116588 Approved by: https://github.com/malfet, https://github.com/kit1980	2024-01-01 20:49:55 +00:00
Aaron Gokaslan	cf618452d3	[BE]: Fix F821 error in torch/fx/experimental (#116587 ) Fix F821 error in torch/fx/experimental. Fixes a bug I did not fix in #116579 Pull Request resolved: https://github.com/pytorch/pytorch/pull/116587 Approved by: https://github.com/kit1980	2024-01-01 19:45:49 +00:00
Dmitry Rogozhkin	035e55822a	vulkan: fix gcc build errors (#115976 ) Fixes #96617 There was already an attempt to fix this build issue - see #96618. One commit is reused from this attempt (@zboszor) with adjustments to commit message. Another one differs and takes into account provided review feedback (@ezyang). Pull Request resolved: https://github.com/pytorch/pytorch/pull/115976 Approved by: https://github.com/ezyang	2024-01-01 11:10:42 +00:00
PyTorch UpdateBot	4451ca068c	[xla hash update] update the pinned xla hash (#116388 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml). Update the pinned xla hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116388 Approved by: https://github.com/pytorchbot	2024-01-01 10:30:59 +00:00
Aaron Gokaslan	bd10fea79a	[BE]: Enable F821 and fix bugs (#116579 ) Fixes #112371 I tried to fix as many of the bugs as I could, a few I could not figure out what the proper fix for them was though and so I left them with noqas. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116579 Approved by: https://github.com/ezyang	2024-01-01 08:40:46 +00:00
Kunal Tyagi	6c02520466	Remove unneeded comment and link for `BuildExtension` (#115496 ) `BuildExtension` is no longer derived from object, but from `build_ext`. Py2 is also deprecated, so this comment wouldn't be required anyways Pull Request resolved: https://github.com/pytorch/pytorch/pull/115496 Approved by: https://github.com/Skylion007	2024-01-01 08:29:48 +00:00
FFFrog	db752f2f1a	Pin the version of expecttest to 0.1.6 in requirements.txt (#116238 ) The version 0.2.0 of expecttest have removed `ACCEPT` variable by this [PR](https://github.com/ezyang/expecttest/pull/11), so when someone install python dependences using `pip install -r PyTorch_Root/requirements.txt`, the latest version of expecttest will be installed which will cuase failure in some PyTorch Tests. So Pin the version of expecttest to 0.1.6 like [this](`db35ccf463/.ci/docker/requirements-ci.txt (L28)`) is needed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116238 Approved by: https://github.com/ezyang	2024-01-01 05:25:39 +00:00
Nikita Shulga	60844ccc4f	[MPS][BE] Refactor common code (#116566 ) Introduce `mtl_setBuffer` and `mps_dispatch1DJob` and use it to bind Tensor to metal kernel as well as disptatch Metal job This avoids potential typos/bugs when one tries to bind tensor to a Metal kernel but forgets about storage offset Pull Request resolved: https://github.com/pytorch/pytorch/pull/116566 Approved by: https://github.com/Skylion007	2024-01-01 04:58:18 +00:00
min-jean-cho	aec4377257	Optimize batch_norm_cpu_collect_stats_channels_last_impl when N <= num_threads (#113619 ) Currently `batch_norm_cpu_collect_stats_channels_last_impl` uses two-path reduction to vertical reduce from shape of `{NHW, C}` to `{C}`. First path is reduction from `{NHW, C}` to intermediate buffer `{num_threads, C}`. Second path is reduction from `{num_threads, C}` to `{C}`. Optimization is as follows: - Add if/else path. 1. if `NHW > num_threads`, do the two-path reduction. 2. else `NHW <= num_threads`, do single-path reduction -- `NHW` is small enough that there is no need to first reduce to intermediate buffer. - Moreover when `NHW <= num_threads`, use two methods, Method 1 and Method 2. [Method 1](https://github.com/pytorch/pytorch/pull/113619/files#diff-e39a21a7125ac201b766a585b57ebf8429a7ac28cd723b09930aceb198fd25b0R372-R397): parallel on C, vertical reduce `{NHW, C} => {C}` [Method 2](https://github.com/pytorch/pytorch/pull/113619/files#diff-e39a21a7125ac201b766a585b57ebf8429a7ac28cd723b09930aceb198fd25b0R325-R370): parallel on tiles of C, vectorized vertical reduce on each tile `{NHW, TILE_SIZE} => {TILE_SIZE}` 1. if `(num_threads == 1 \|\| (C <= TILE_SIZE \|\| C > THRESHOLD))`, use Method 2. 2. else, use Method 1. - When `num_threads == 1`, there is no thread synchronization overhead, so it is better to use Method 2 than Method 1. - When `C > THRESHOLD`, C is large enough that the benefit from tiling and vectorization outweigh the synchronization overhead. - When `C <= TILE_SIZE`, the problem size is small enough (`C <= TILE_SIZE && NHW <= num_threads`) that it's better to launch single thread with vectorization than C threads without vectorization. - `TILE_SIZE` is set to `16`. - `THRESHOLD` is set to `2048`, it is an empirically found threshold to tile on C or not. See comments for details. ### Performance Perf data collected for C in range [2^1, 2^20], and (N,H,W) = (1,2,14) for all values of C. Values of (N,H,W)=(1,2,14) were arbitrarily chosen that satisfies the condition NHW <= num_threads = 28. Tested on 28 physical cores/socket, 1 socket on Skylake. \| (N, H, W) = (1, 2, 14) \| \| \| \| \|---------------------------- \|------------------------------------------------------------ \|--------------- \|---------------------------------------- \| \| \| Avg Latency (ms) \| \| \| \| n_channel \| Baseline (original implementation): two-path reduction \| Optimized \| Speedup Ratio (Optimized/Baseline) \| \| 1048576 \| 13.67034435 \| 3.059654236 \| 4.467937649 \| \| 524288 \| 5.230793953 \| 0.840408802 \| 6.224106578 \| \| 262144 \| 2.131233215 \| 0.353398323 \| 6.030682876 \| \| 131072 \| 0.990390778 \| 0.213630199 \| 4.636005491 \| \| 65536 \| 0.422859192 \| 0.107388496 \| 3.937658186 \| \| 32768 \| 0.224406719 \| 0.075747967 \| 2.962544459 \| \| 16384 \| 0.143647194 \| 0.049884319 \| 2.879606175 \| \| 8192 \| 0.10917902 \| 0.031619072 \| 3.452948273 \| \| 4096 \| 0.08869648 \| 0.024063587 \| 3.685920935 \| \| 2048 \| 0.075721741 \| 0.022127628 \| 3.422045038 \| \| 1024 \| 0.06685257 \| 0.018239021 \| 3.665359477 \| \| 512 \| 0.051283836 \| 0.017580986 \| 2.917005696 \| \| 256 \| 0.043172836 \| 0.020868778 \| 2.06877642 \| \| 128 \| 0.042669773 \| 0.018148422 \| 2.351156069 \| \| 64 \| 0.038774014 \| 0.015704632 \| 2.468954 \| \| 32 \| 0.038630962 \| 0.013871193 \| 2.784977656 \| \| 16 \| 0.027766228 \| 0.008444786 \| 3.287972897 \| \| 8 \| 0.019891262 \| 0.007579327 \| 2.624410192 \| \| 4 \| 0.018217564 \| 0.008151531 \| 2.234863995 \| \| 2 \| 0.017716885 \| 0.008127689 \| 2.179818128 \| ### Single Thread Performance Perf data collected for C in range [2^1, 2^20], and (N,H,W) = (1,1,1) for all values of C. Values of (N,H,W)=(1,1,1) were chosen to satisfy the condition NHW <= num_threads = 1 for single thread performance. Tested on 1 physical core/socket, 1 socket on Skylake. \| (N, H, W) = (1, 1, 1) \| \| \| \| \|--------------------------- \|------------------------------------------------------------ \|--------------- \|---------------------------------------- \| \| \| Avg Latency (ms) \| \| \| \| n_channel \| Baseline (original implementation): two-path reduction \| Optimized \| Speedup Ratio (Optimized/Baseline) \| \| 1048576 \| 10.97419 \| 8.390961 \| 1.307859 \| \| 524288 \| 4.860618 \| 4.128075 \| 1.177454 \| \| 262144 \| 2.782302 \| 1.981447 \| 1.404177 \| \| 131072 \| 2.105565 \| 1.073592 \| 1.961234 \| \| 65536 \| 0.857651 \| 0.523462 \| 1.63842 \| \| 32768 \| 0.309389 \| 0.247979 \| 1.24764 \| \| 16384 \| 0.13869 \| 0.098376 \| 1.409796 \| \| 8192 \| 0.072258 \| 0.050876 \| 1.420263 \| \| 4096 \| 0.038414 \| 0.027308 \| 1.40667 \| \| 2048 \| 0.021684 \| 0.015688 \| 1.382219 \| \| 1024 \| 0.013294 \| 0.009842 \| 1.350775 \| \| 512 \| 0.008659 \| 0.006645 \| 1.303193 \| \| 256 \| 0.006964 \| 0.005393 \| 1.291335 \| \| 128 \| 0.005918 \| 0.00464 \| 1.275437 \| \| 64 \| 0.005324 \| 0.004292 \| 1.240556 \| \| 32 \| 0.004981 \| 0.004163 \| 1.196449 \| \| 16 \| 0.004833 \| 0.003943 \| 1.225514 \| \| 8 \| 0.004768 \| 0.003896 \| 1.22399 \| \| 4 \| 0.004828 \| 0.003955 \| 1.220615 \| \| 2 \| 0.004776 \| 0.003934 \| 1.213939 \| Pull Request resolved: https://github.com/pytorch/pytorch/pull/113619 Approved by: https://github.com/jgong5, https://github.com/ezyang	2024-01-01 04:43:42 +00:00
Yukio Siraichi	fc5fda14bc	Try creating a bf16 tensor as a last resort of `is_bf16_supported()`. (#115924 ) Fix: #115900 https://github.com/pytorch/xla/issues/6085 This PR adds a last resort for testing for BF16 support on CUDA. This is necessary on GPUs such as RTX 2060, where `torch.cuda.is_bf_supported()` returns False, but we can successfully create a BF16 tensor on CUDA. Before this PR: ```python >>> torch.cuda.is_bf_supported() False >>> torch.tensor([1.], dtype=torch.bfloat16, device="cuda") tensor([...], device='cuda:0', dtype=torch.bfloat16) ``` After this PR: ```python >>> torch.cuda.is_bf_supported() True >>> torch.tensor([1.], dtype=torch.bfloat16, device="cuda") tensor([...], device='cuda:0', dtype=torch.bfloat16) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/115924 Approved by: https://github.com/jansel	2024-01-01 01:15:30 +00:00
Aaron Gokaslan	127812efee	[BE]: Further improve pathlib checks in torch serialization (#116577 ) Follow up #116564. `os.path` functions can accept an os.PathLike object too. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116577 Approved by: https://github.com/malfet	2023-12-31 20:24:40 +00:00
Nikita Shulga	4bfaa6bc25	[MPS] Fix addmm (#116547 ) Remove weird logic for designating matrices as transposed if sizes match(which always true if square matrices are multiplied with each other), which resulted in `torch.addmm` returns transposed matrix compared to `torch.mm`, see below: ``` % python -c "import torch;torch.set_default_device('mps');a=torch.eye(2);b=torch.arange(4.0).reshape(2, 2);print(a@b);print(torch.addmm(torch.zeros(2, 2), a,b))" tensor([[0., 1.], [2., 3.]], device='mps:0') tensor([[0., 2.], [1., 3.]], device='mps:0') ``` Fixes introduced to `torch.mm` in https://github.com/pytorch/pytorch/pull/77462 suggests that this is not needed Modify `sample_inputs_addmm` to test `torch.addmm` with square matrices, but skip this config for `test_autograd_dense_output_addmm`, see https://github.com/pytorch/pytorch/issues/116565 TODO: probably tweak tolerances, as `test_output_match_addmm_cpu_float16` fails with 2x2 matrices, but passes using 3x3 ones with errors slightly exceeding the tolerance Fixes https://github.com/pytorch/pytorch/issues/116331 Pull Request resolved: https://github.com/pytorch/pytorch/pull/116547 Approved by: https://github.com/albanD, https://github.com/Skylion007	2023-12-31 02:28:59 +00:00
Aaron Gokaslan	aef06c316b	[BE]: Add better handling of pathlib.Path with os calls (#116564 ) Builds on #116562 to the rest of the instances of pathlib in the PyTorch. * Uses more generic `os.PathLike` and `os.fspath` calls where appropiate Pull Request resolved: https://github.com/pytorch/pytorch/pull/116564 Approved by: https://github.com/malfet	2023-12-31 01:46:03 +00:00
Aaron Gokaslan	86cd6655a1	[BE]: Use exist_ok arg for os.makedirs calls (#116561 ) Optimize os.makedirs calls to use exist_ok parameter when possible to avoid unnecessary checks. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116561 Approved by: https://github.com/malfet	2023-12-30 21:12:53 +00:00
Aaron Gokaslan	4f9858a902	[BE]: Use os.fspath and os.PathLike in torch serialization (#116562 ) Use proper `os.fspath` to better convert `os.PathLike` object to a path. Replace `pathlib.Path` with `os.PathLike` which is more generic and typing correct. `pathlib.Path` is an instance of `os.PathLike`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116562 Approved by: https://github.com/malfet	2023-12-30 20:53:10 +00:00
cyy	5aa258eb09	[12/N] Apply clang-tidy and fix warnings in headers of torch/csrc (#116486 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/116486 Approved by: https://github.com/albanD	2023-12-30 18:38:53 +00:00
cyy	37aae5932c	[11/N] Enable clang-tidy warnings on c10/util/.h (#116353 ) This PR enables clang-tidy coverage on c10/util/.h Pull Request resolved: https://github.com/pytorch/pytorch/pull/116353 Approved by: https://github.com/albanD	2023-12-30 14:38:39 +00:00
Oguz Ulgen	97891b184c	[Dynamo] Trace autograd.function in dynamo when inputs require grad (#116358 ) For training graphs (when inputs require grad), previously, we would speculate the forward and backward graph to determine if there are any graph breaks, side effect and etc but would not actually use these speculated graphs. We would just insert a call function node on the graph and later rely on autograd's tracing. This approach does not work for more generalized graphs like graphs that include user defined triton kernels because autograd is not able to do the higher order function conversation. This PR speculates the forward and backward functions and emits them in a HOF that later gets used via templating mechanism. While working on this PR, I have exposed some bugs in the current tracing due to trampoline functions losing the source information resulting in incorrect graphs being produced. I have fixed these source information bugs and killed the trampolines. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116358 Approved by: https://github.com/jansel	2023-12-30 01:51:30 +00:00
Aaron Gokaslan	c5d9173d04	[BE]: Enable readability-redundant-function-ptr-dereference check (#116538 ) Enable an additional clang-tidy check to remove redundant function ptr dereferences to help make the code more readable. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116538 Approved by: https://github.com/malfet	2023-12-30 01:15:35 +00:00
albanD	5e58be678c	Make collect env BC compatible (#116532 ) To avoid errors like the one in https://github.com/pytorch/pytorch/issues/116531 when the user tries to run collect_env Pull Request resolved: https://github.com/pytorch/pytorch/pull/116532 Approved by: https://github.com/malfet	2023-12-30 01:13:37 +00:00
Nikita Shulga	bd7d26bb96	[CI] Fix docker builds (#116549 ) By pinning lxml to 4.9.4 as 5.0.0 is missing Python-3.9 binaries, see https://pypi.org/project/lxml/5.0.0/#files <img width="568" alt="image" src="https://github.com/pytorch/pytorch/assets/2453524/fbd64512-b788-4bf6-9c1f-084dcedfd082"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/116549 Approved by: https://github.com/houseroad, https://github.com/aakhundov	2023-12-30 00:38:14 +00:00
chuanqiw	961fbbe967	[CI] Add initial ci build test for XPU (#116100 ) Add initial CI build test for XPU, which will be triggered by label `ciflow/xpu` for current stage. Works for RFC #114850 Pull Request resolved: https://github.com/pytorch/pytorch/pull/116100 Approved by: https://github.com/EikanWang, https://github.com/huydhn, https://github.com/atalman	2023-12-29 23:44:46 +00:00
fduwjj	de4d48df34	[c10d] Fix timeout dump path write path overlap when there are multiple PGs (#116218 ) Basically we observed that if there are multiple PGs and if the timeout happens on one of the subPG, we somehow use the local rank in the dump file. We realize that: 1. For setting the timeout signal in the store, any watchdog thread from any PG can do that. 2. For checking and dump, only the watchdog thread of default PG which we will always create and contain all ranks (no file name conflict) is needed here because the store signal and dump debug info are all global. 3. Since dump is global, we want to avoid the case when ranks from sub-PG pollute logs from global ranks (local rank 0 vs global rank 0). So that we use global ranks here to initialize debug info writer. (Down the road, we are thinking about making it a singleton so that user only register it once for multi-PG case.) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116218 Approved by: https://github.com/wconstab	2023-12-29 21:58:25 +00:00
Ómar Högni Guðmarsson	db2b4078b9	Add missing cstdint includes (#116458 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116458 Approved by: https://github.com/Skylion007	2023-12-29 18:30:26 +00:00
wgb	71ec3edbf7	Enhance Opinfo to support privateuse1 (#116417 ) Fix Opinfo does not support third-party devices when the current test framework instantiation method is privateuse1. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116417 Approved by: https://github.com/albanD	2023-12-29 13:43:29 +00:00
baocheny	e01e00fba8	fix code spell (#116530 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116530 Approved by: https://github.com/albanD	2023-12-29 12:58:38 +00:00
fduwjj	afadfa0175	[c10d] Add stream info during nccl comm abort call (#116076 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116076 Approved by: https://github.com/XilunWu	2023-12-29 06:58:26 +00:00
Nikita Shulga	e8a9d088c6	[DevX] Add tool and doc on partial debug builds (#116521 ) Turned command sequence mentioned in https://dev-discuss.pytorch.org/t/how-to-get-a-fast-debug-build/1597 and in various discussions into a tool that I use almost daily to debug crashes or correctness issues in the codebase Essentially it allows one to turn this: ``` Process 87729 stopped * thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 1.1 frame #0: 0x00000001023d55a8 libtorch_python.dylib`at::indexing::impl::applySelect(at::Tensor const&, long long, c10::SymInt, long long, c10::Device const&, std::__1::optional<c10::ArrayRef<c10::SymInt>> const&) libtorch_python.dylib`at::indexing::impl::applySelect: -> 0x1023d55a8 <+0>: sub sp, sp, #0xd0 0x1023d55ac <+4>: stp x24, x23, [sp, #0x90] 0x1023d55b0 <+8>: stp x22, x21, [sp, #0xa0] 0x1023d55b4 <+12>: stp x20, x19, [sp, #0xb0] ``` into this ``` Process 87741 stopped * thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 1.1 frame #0: 0x00000001024e2628 libtorch_python.dylib`at::indexing::impl::applySelect(self=0x00000001004ee8a8, dim=0, index=(data_ = 3), real_dim=0, (null)=0x000000016fdfe535, self_sizes= Has Value=true ) at TensorIndexing.h:239:7 236 const at::Device& /self_device/, 237 const c10::optional<SymIntArrayRef>& self_sizes) { 238 // See NOTE [nested tensor size for indexing] -> 239 if (self_sizes.has_value()) { 240 auto maybe_index = index.maybe_as_int(); 241 if (maybe_index.has_value()) { 242 TORCH_CHECK_INDEX( ``` while retaining good performance for the rest of the codebase Pull Request resolved: https://github.com/pytorch/pytorch/pull/116521 Approved by: https://github.com/atalman	2023-12-29 05:15:35 +00:00
Menglu Yu	df85a920cf	[Inductor][Observability] Add logging for split cat pass (#116442 ) Summary: Add logs for both in the pre and post grad passes Test Plan: ``` buck2 run mode/opt //scripts/jackiexu0313/pt2:local_model_with_pt2 -- --test_mode split_batch ``` [2023-12-26 17:14:24,203] [0/0] torch._inductor.fx_passes.post_grad: [INFO] counters of inductor dict after apply the split cat in the post grad pass: Counter({'pattern_matcher_nodes': 4076, 'pattern_matcher_count': 2917, 'remove_split_with_size_one': 1322, 'split_cat_norm': 461, 'consecutive_split_merged': 371, 'scmerge_cat_removed': 41, 'scmerge_cat_added': 32, 'scmerge_split_removed': 28, 'getitem_cat_merged': 11, 'batch_fusion': 7, 'scmerge_split_sections_removed': 3, 'scmerge_split_added': 2, 'split_squeeze_replaced': 2}) [2023-12-26 17:16:28,437] torch._inductor.fx_passes.post_grad: [INFO] counters of inductor dict after apply the split cat in the post grad pass: Counter({'pattern_matcher_nodes': 4122, 'pattern_matcher_count': 2935, 'remove_split_with_size_one': 1322, 'split_cat_norm': 461, 'consecutive_split_merged': 371, 'scmerge_cat_removed': 41, 'batch_fusion': 39, 'scmerge_cat_added': 32, 'scmerge_split_removed': 28, 'getitem_cat_merged': 11, 'scmerge_split_sections_removed': 3, 'scmerge_split_added': 2, 'split_squeeze_replaced': 2}) Differential Revision: D52425400 Pull Request resolved: https://github.com/pytorch/pytorch/pull/116442 Approved by: https://github.com/yanboliang	2023-12-29 05:10:45 +00:00
fduwjj	8deaa13417	[EZ][Distributed] Add 'c10d' to distributed TORCH_LOG comment (#116526 ) Address the comment in https://github.com/pytorch/pytorch/pull/116434, which I confused in the first beginning. Let's add c10d to the comment. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116526 Approved by: https://github.com/XilunWu	2023-12-29 04:40:37 +00:00
PyTorch UpdateBot	ef94499ad7	[executorch hash update] update the pinned executorch hash (#116474 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml). Update the pinned executorch hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116474 Approved by: https://github.com/pytorchbot	2023-12-29 03:13:51 +00:00
PyTorch UpdateBot	240121587a	[vision hash update] update the pinned vision hash (#116524 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml). Update the pinned vision hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116524 Approved by: https://github.com/pytorchbot	2023-12-29 03:08:39 +00:00
xinan.lin	cab79ceb51	[Inductor Intel GPU backend Upstream] Step 2: Register and add Intel GPU Inductor backend (#116330 ) Right after the first PR https://github.com/pytorch/pytorch/pull/116020, this PR forcus on generalizing device-bias runtime code that used in the basic workflow including triton kernel generation, codecache, autotuning. Feature request: #114856 Pull Request resolved: https://github.com/pytorch/pytorch/pull/116330 Approved by: https://github.com/EikanWang, https://github.com/jgong5, https://github.com/desertfire	2023-12-29 02:49:37 +00:00
Jerry Zhang	8173d98c57	[quant][be] Skip conv-bn folding when there are no batchnorm ops (#116440 ) Summary: `_fold_conv_bn_qat` is taking a long time currently, so skipping it when it's not necessary, we can have follow up fixes to actually reduce the patterns or cache the patterns if possible Test Plan: uncomment the print in `test_speed`, run python test/test_quantization.py -k test_speed and make sure the convert time is low, e.g. 0.1s instead of 8-9 seconds Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/116440 Approved by: https://github.com/andrewor14	2023-12-28 23:33:21 +00:00
voznesenskym	33917150d3	Cleanup scope ref properly (#116169 ) Fixes https://github.com/pytorch/pytorch/issues/116143 See test in PR for a case where this happens. Discovered while debugging optimizers. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116169 Approved by: https://github.com/janeyx99, https://github.com/williamwen42, https://github.com/jansel	2023-12-28 23:29:37 +00:00
Anupam Bhatnagar	4371939751	Removing HTA documentation (#116513 ) Removing HTA documentation Pull Request resolved: https://github.com/pytorch/pytorch/pull/116513 Approved by: https://github.com/aaronenyeshi, https://github.com/malfet, https://github.com/atalman	2023-12-28 23:04:23 +00:00
Adrian Wälchli	8220d5c66d	Support `pathlib.Path` as input to `torch.load` when `mmap=True` (#116104 ) Fixes #116103 This now works: ```py import torch from pathlib import Path file = Path("example.pt") torch.save(torch.rand(5, 3), file) torch.load(file, mmap=True) # works! ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/116104 Approved by: https://github.com/mikaylagawarecki	2023-12-28 22:54:11 +00:00
Andrew Calvano	02e2158e75	Fix for out of bounds read in mobile interpreter INTERFACE_CALL opcode handler (#110301 ) Summary: The INTERFACE_CALL opcode for the mobile TorchScript interpreter contained an out of bounds read issue leading to memory corruption. This change adds an explicit check that the number of inputs passed to the format method called when handling the INTERFACE_CALL opcode is a valid and within bounds of the stack. Test Plan: contbuild + OSS signals Differential Revision: D49739450 Pull Request resolved: https://github.com/pytorch/pytorch/pull/110301 Approved by: https://github.com/dbort	2023-12-28 22:09:03 +00:00
Yanbo Liang	7e12e722af	[Dynamo][12/N] Remove allowed_functions.py (#116401 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/116401 Approved by: https://github.com/angelayi	2023-12-28 21:26:06 +00:00
Nikita Shulga	439f2a6c1f	[RelEng] Missing signal for release branches (#116516 ) Run slow/periodic and inductor workflows on push to release branches Right now there are no signal from those jobs on release branches at all. This will run periodic jobs on every commit to release branch, which is fine, as they are short lived and have a much lower traffic that a regular jobs Pull Request resolved: https://github.com/pytorch/pytorch/pull/116516 Approved by: https://github.com/clee2000	2023-12-28 20:19:55 +00:00
Jane Xu	4af1c27fa8	Migrate repr, deterministic state_dict test to OptimizerInfo (#116496 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116496 Approved by: https://github.com/albanD ghstack dependencies: #116471	2023-12-28 19:49:04 +00:00
Jane Xu	f3c4395358	[BE] Add helper in common_optimizers to get all optim inputs (#116471 ) This will be a common utility in test_optim.py. Printing out the optimizer inputs when using this helper looks reasonable: For local test plan, click below. <details> ``` (pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (d186986c)]$ python test/test_optim.py -vv -k test_step_is_noop_when_params_have_no_grad test_step_is_noop_when_params_have_no_grad_ASGD_cpu_float32 (__main__.TestOptimRenewedCPU) ... params=None, kwargs={'foreach': False, 'differentiable': False}, desc=default params=None, kwargs={'foreach': True, 'differentiable': False}, desc=default & foreach params=None, kwargs={'foreach': False, 'differentiable': True}, desc=default & differentiable params=None, kwargs={'lr': 0.02, 'foreach': False, 'differentiable': False}, desc=non-default lr params=None, kwargs={'lr': 0.02, 'foreach': True, 'differentiable': False}, desc=non-default lr & foreach params=None, kwargs={'lr': 0.02, 'foreach': False, 'differentiable': True}, desc=non-default lr & differentiable params=None, kwargs={'t0': 100, 'foreach': False, 'differentiable': False}, desc=t0 params=None, kwargs={'t0': 100, 'foreach': True, 'differentiable': False}, desc=t0 & foreach params=None, kwargs={'t0': 100, 'foreach': False, 'differentiable': True}, desc=t0 & differentiable params=None, kwargs={'weight_decay': 0.9, 'foreach': False, 'differentiable': False}, desc=nonzero weight_decay params=None, kwargs={'weight_decay': 0.9, 'foreach': True, 'differentiable': False}, desc=nonzero weight_decay & foreach params=None, kwargs={'weight_decay': 0.9, 'foreach': False, 'differentiable': True}, desc=nonzero weight_decay & differentiable params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': False, 'differentiable': False}, desc=maximize params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': True, 'differentiable': False}, desc=maximize & foreach params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': False, 'differentiable': True}, desc=maximize & differentiable ok test_step_is_noop_when_params_have_no_grad_Adadelta_cpu_float32 (__main__.TestOptimRenewedCPU) ... params=None, kwargs={'foreach': False, 'differentiable': False}, desc=default params=None, kwargs={'foreach': True, 'differentiable': False}, desc=default & foreach params=None, kwargs={'foreach': False, 'differentiable': True}, desc=default & differentiable params=None, kwargs={'lr': 0.01, 'foreach': False, 'differentiable': False}, desc=non-default lr params=None, kwargs={'lr': 0.01, 'foreach': True, 'differentiable': False}, desc=non-default lr & foreach params=None, kwargs={'lr': 0.01, 'foreach': False, 'differentiable': True}, desc=non-default lr & differentiable params=None, kwargs={'weight_decay': 0.9, 'foreach': False, 'differentiable': False}, desc=nonzero weight_decay params=None, kwargs={'weight_decay': 0.9, 'foreach': True, 'differentiable': False}, desc=nonzero weight_decay & foreach params=None, kwargs={'weight_decay': 0.9, 'foreach': False, 'differentiable': True}, desc=nonzero weight_decay & differentiable params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': False, 'differentiable': False}, desc=maximize params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': True, 'differentiable': False}, desc=maximize & foreach params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': False, 'differentiable': True}, desc=maximize & differentiable params=None, kwargs={'rho': 0.95, 'weight_decay': 0.9, 'foreach': False, 'differentiable': False}, desc=rho params=None, kwargs={'rho': 0.95, 'weight_decay': 0.9, 'foreach': True, 'differentiable': False}, desc=rho & foreach params=None, kwargs={'rho': 0.95, 'weight_decay': 0.9, 'foreach': False, 'differentiable': True}, desc=rho & differentiable ok test_step_is_noop_when_params_have_no_grad_Adagrad_cpu_float32 (__main__.TestOptimRenewedCPU) ... params=None, kwargs={'foreach': False, 'differentiable': False}, desc=default params=None, kwargs={'foreach': True, 'differentiable': False}, desc=default & foreach params=None, kwargs={'foreach': False, 'differentiable': True}, desc=default & differentiable params=None, kwargs={'weight_decay': 0.9, 'foreach': False, 'differentiable': False}, desc=nonzero weight_decay params=None, kwargs={'weight_decay': 0.9, 'foreach': True, 'differentiable': False}, desc=nonzero weight_decay & foreach params=None, kwargs={'weight_decay': 0.9, 'foreach': False, 'differentiable': True}, desc=nonzero weight_decay & differentiable params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': False, 'differentiable': False}, desc=maximize params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': True, 'differentiable': False}, desc=maximize & foreach params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': False, 'differentiable': True}, desc=maximize & differentiable params=None, kwargs={'initial_accumulator_value': 0.1, 'weight_decay': 0.9, 'foreach': False, 'differentiable': False}, desc=initial_accumulator_value params=None, kwargs={'initial_accumulator_value': 0.1, 'weight_decay': 0.9, 'foreach': True, 'differentiable': False}, desc=initial_accumulator_value & foreach params=None, kwargs={'initial_accumulator_value': 0.1, 'weight_decay': 0.9, 'foreach': False, 'differentiable': True}, desc=initial_accumulator_value & differentiable params=None, kwargs={'lr': 0.1, 'lr_decay': 0.5, 'weight_decay': 0.9, 'foreach': False, 'differentiable': False}, desc=lr_decay params=None, kwargs={'lr': 0.1, 'lr_decay': 0.5, 'weight_decay': 0.9, 'foreach': True, 'differentiable': False}, desc=lr_decay & foreach params=None, kwargs={'lr': 0.1, 'lr_decay': 0.5, 'weight_decay': 0.9, 'foreach': False, 'differentiable': True}, desc=lr_decay & differentiable ok test_step_is_noop_when_params_have_no_grad_AdamW_cpu_float32 (__main__.TestOptimRenewedCPU) ... params=None, kwargs={'foreach': False, 'differentiable': False}, desc=default params=None, kwargs={'foreach': True, 'differentiable': False}, desc=default & foreach params=None, kwargs={'foreach': False, 'differentiable': True}, desc=default & differentiable params=None, kwargs={'lr': 0.01, 'foreach': False, 'differentiable': False}, desc=non-default lr params=None, kwargs={'lr': 0.01, 'foreach': True, 'differentiable': False}, desc=non-default lr & foreach params=None, kwargs={'lr': 0.01, 'foreach': False, 'differentiable': True}, desc=non-default lr & differentiable params=None, kwargs={'weight_decay': 0.9, 'foreach': False, 'differentiable': False}, desc=nonzero weight_decay params=None, kwargs={'weight_decay': 0.9, 'foreach': True, 'differentiable': False}, desc=nonzero weight_decay & foreach params=None, kwargs={'weight_decay': 0.9, 'foreach': False, 'differentiable': True}, desc=nonzero weight_decay & differentiable params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': False, 'differentiable': False}, desc=maximize params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': True, 'differentiable': False}, desc=maximize & foreach params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': False, 'differentiable': True}, desc=maximize & differentiable params=None, kwargs={'weight_decay': 0.9, 'amsgrad': True, 'foreach': False, 'differentiable': False}, desc=amsgrad params=None, kwargs={'weight_decay': 0.9, 'amsgrad': True, 'foreach': True, 'differentiable': False}, desc=amsgrad & foreach params=None, kwargs={'weight_decay': 0.9, 'amsgrad': True, 'foreach': False, 'differentiable': True}, desc=amsgrad & differentiable ok test_step_is_noop_when_params_have_no_grad_Adam_cpu_float32 (__main__.TestOptimRenewedCPU) ... params=None, kwargs={'foreach': False, 'differentiable': False}, desc=default params=None, kwargs={'foreach': True, 'differentiable': False}, desc=default & foreach params=None, kwargs={'foreach': False, 'differentiable': True}, desc=default & differentiable params=None, kwargs={'lr': 0.01, 'foreach': False, 'differentiable': False}, desc=non-default lr params=None, kwargs={'lr': 0.01, 'foreach': True, 'differentiable': False}, desc=non-default lr & foreach params=None, kwargs={'lr': 0.01, 'foreach': False, 'differentiable': True}, desc=non-default lr & differentiable params=None, kwargs={'weight_decay': 0.9, 'foreach': False, 'differentiable': False}, desc=nonzero weight_decay params=None, kwargs={'weight_decay': 0.9, 'foreach': True, 'differentiable': False}, desc=nonzero weight_decay & foreach params=None, kwargs={'weight_decay': 0.9, 'foreach': False, 'differentiable': True}, desc=nonzero weight_decay & differentiable params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': False, 'differentiable': False}, desc=maximize params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': True, 'differentiable': False}, desc=maximize & foreach params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': False, 'differentiable': True}, desc=maximize & differentiable params=None, kwargs={'weight_decay': 0.9, 'amsgrad': True, 'foreach': False, 'differentiable': False}, desc=amsgrad params=None, kwargs={'weight_decay': 0.9, 'amsgrad': True, 'foreach': True, 'differentiable': False}, desc=amsgrad & foreach params=None, kwargs={'weight_decay': 0.9, 'amsgrad': True, 'foreach': False, 'differentiable': True}, desc=amsgrad & differentiable ok test_step_is_noop_when_params_have_no_grad_Adamax_cpu_float32 (__main__.TestOptimRenewedCPU) ... params=None, kwargs={'foreach': False, 'differentiable': False}, desc=default params=None, kwargs={'foreach': True, 'differentiable': False}, desc=default & foreach params=None, kwargs={'foreach': False, 'differentiable': True}, desc=default & differentiable params=None, kwargs={'lr': 0.001, 'foreach': False, 'differentiable': False}, desc=non-default lr params=None, kwargs={'lr': 0.001, 'foreach': True, 'differentiable': False}, desc=non-default lr & foreach params=None, kwargs={'lr': 0.001, 'foreach': False, 'differentiable': True}, desc=non-default lr & differentiable params=None, kwargs={'weight_decay': 0.9, 'foreach': False, 'differentiable': False}, desc=nonzero weight_decay params=None, kwargs={'weight_decay': 0.9, 'foreach': True, 'differentiable': False}, desc=nonzero weight_decay & foreach params=None, kwargs={'weight_decay': 0.9, 'foreach': False, 'differentiable': True}, desc=nonzero weight_decay & differentiable params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': False, 'differentiable': False}, desc=maximize params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': True, 'differentiable': False}, desc=maximize & foreach params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': False, 'differentiable': True}, desc=maximize & differentiable ok test_step_is_noop_when_params_have_no_grad_LBFGS_cpu_float32 (__main__.TestOptimRenewedCPU) ... ok test_step_is_noop_when_params_have_no_grad_NAdam_cpu_float32 (__main__.TestOptimRenewedCPU) ... params=None, kwargs={'foreach': False, 'differentiable': False}, desc=default params=None, kwargs={'foreach': True, 'differentiable': False}, desc=default & foreach params=None, kwargs={'foreach': False, 'differentiable': True}, desc=default & differentiable params=None, kwargs={'lr': 0.001, 'foreach': False, 'differentiable': False}, desc=non-default lr params=None, kwargs={'lr': 0.001, 'foreach': True, 'differentiable': False}, desc=non-default lr & foreach params=None, kwargs={'lr': 0.001, 'foreach': False, 'differentiable': True}, desc=non-default lr & differentiable params=None, kwargs={'momentum_decay': 0.006, 'foreach': False, 'differentiable': False}, desc=non-zero momentum_decay params=None, kwargs={'momentum_decay': 0.006, 'foreach': True, 'differentiable': False}, desc=non-zero momentum_decay & foreach params=None, kwargs={'momentum_decay': 0.006, 'foreach': False, 'differentiable': True}, desc=non-zero momentum_decay & differentiable params=None, kwargs={'weight_decay': 0.9, 'momentum_decay': 0.006, 'foreach': False, 'differentiable': False}, desc=weight_decay params=None, kwargs={'weight_decay': 0.9, 'momentum_decay': 0.006, 'foreach': True, 'differentiable': False}, desc=weight_decay & foreach params=None, kwargs={'weight_decay': 0.9, 'momentum_decay': 0.006, 'foreach': False, 'differentiable': True}, desc=weight_decay & differentiable params=None, kwargs={'weight_decay': 0.9, 'momentum_decay': 0.006, 'decoupled_weight_decay': True, 'foreach': False, 'differentiable': False}, desc=decoupled_weight_decay params=None, kwargs={'weight_decay': 0.9, 'momentum_decay': 0.006, 'decoupled_weight_decay': True, 'foreach': True, 'differentiable': False}, desc=decoupled_weight_decay & foreach params=None, kwargs={'weight_decay': 0.9, 'momentum_decay': 0.006, 'decoupled_weight_decay': True, 'foreach': False, 'differentiable': True}, desc=decoupled_weight_decay & differentiable ok test_step_is_noop_when_params_have_no_grad_RAdam_cpu_float32 (__main__.TestOptimRenewedCPU) ... params=None, kwargs={'foreach': False, 'differentiable': False}, desc=default params=None, kwargs={'foreach': True, 'differentiable': False}, desc=default & foreach params=None, kwargs={'foreach': False, 'differentiable': True}, desc=default & differentiable params=None, kwargs={'lr': 0.002, 'foreach': False, 'differentiable': False}, desc=non-default lr params=None, kwargs={'lr': 0.002, 'foreach': True, 'differentiable': False}, desc=non-default lr & foreach params=None, kwargs={'lr': 0.002, 'foreach': False, 'differentiable': True}, desc=non-default lr & differentiable params=None, kwargs={'eps': 1e-06, 'foreach': False, 'differentiable': False}, desc=non-default eps params=None, kwargs={'eps': 1e-06, 'foreach': True, 'differentiable': False}, desc=non-default eps & foreach params=None, kwargs={'eps': 1e-06, 'foreach': False, 'differentiable': True}, desc=non-default eps & differentiable params=None, kwargs={'weight_decay': 0.9, 'foreach': False, 'differentiable': False}, desc=nonzero weight_decay params=None, kwargs={'weight_decay': 0.9, 'foreach': True, 'differentiable': False}, desc=nonzero weight_decay & foreach params=None, kwargs={'weight_decay': 0.9, 'foreach': False, 'differentiable': True}, desc=nonzero weight_decay & differentiable params=None, kwargs={'weight_decay': 0.9, 'decoupled_weight_decay': True, 'foreach': False, 'differentiable': False}, desc=decoupled_weight_decay params=None, kwargs={'weight_decay': 0.9, 'decoupled_weight_decay': True, 'foreach': True, 'differentiable': False}, desc=decoupled_weight_decay & foreach params=None, kwargs={'weight_decay': 0.9, 'decoupled_weight_decay': True, 'foreach': False, 'differentiable': True}, desc=decoupled_weight_decay & differentiable ok test_step_is_noop_when_params_have_no_grad_RMSprop_cpu_float32 (__main__.TestOptimRenewedCPU) ... params=None, kwargs={'foreach': False, 'differentiable': False}, desc=default params=None, kwargs={'foreach': True, 'differentiable': False}, desc=default & foreach params=None, kwargs={'foreach': False, 'differentiable': True}, desc=default & differentiable params=None, kwargs={'lr': 0.001, 'foreach': False, 'differentiable': False}, desc=non-default lr params=None, kwargs={'lr': 0.001, 'foreach': True, 'differentiable': False}, desc=non-default lr & foreach params=None, kwargs={'lr': 0.001, 'foreach': False, 'differentiable': True}, desc=non-default lr & differentiable params=None, kwargs={'weight_decay': 0.9, 'foreach': False, 'differentiable': False}, desc=nonzero weight_decay params=None, kwargs={'weight_decay': 0.9, 'foreach': True, 'differentiable': False}, desc=nonzero weight_decay & foreach params=None, kwargs={'weight_decay': 0.9, 'foreach': False, 'differentiable': True}, desc=nonzero weight_decay & differentiable params=None, kwargs={'weight_decay': 0.9, 'centered': True, 'foreach': False, 'differentiable': False}, desc=centered params=None, kwargs={'weight_decay': 0.9, 'centered': True, 'foreach': True, 'differentiable': False}, desc=centered & foreach params=None, kwargs={'weight_decay': 0.9, 'centered': True, 'foreach': False, 'differentiable': True}, desc=centered & differentiable params=None, kwargs={'weight_decay': 0.9, 'centered': True, 'momentum': 0.1, 'foreach': False, 'differentiable': False}, desc=momentum params=None, kwargs={'weight_decay': 0.9, 'centered': True, 'momentum': 0.1, 'foreach': True, 'differentiable': False}, desc=momentum & foreach params=None, kwargs={'weight_decay': 0.9, 'centered': True, 'momentum': 0.1, 'foreach': False, 'differentiable': True}, desc=momentum & differentiable params=None, kwargs={'weight_decay': 0.9, 'centered': True, 'momentum': 0.1, 'maximize': True, 'foreach': False, 'differentiable': False}, desc=maximize params=None, kwargs={'weight_decay': 0.9, 'centered': True, 'momentum': 0.1, 'maximize': True, 'foreach': True, 'differentiable': False}, desc=maximize & foreach params=None, kwargs={'weight_decay': 0.9, 'centered': True, 'momentum': 0.1, 'maximize': True, 'foreach': False, 'differentiable': True}, desc=maximize & differentiable ok test_step_is_noop_when_params_have_no_grad_Rprop_cpu_float32 (__main__.TestOptimRenewedCPU) ... params=None, kwargs={'foreach': False, 'differentiable': False}, desc=default params=None, kwargs={'foreach': True, 'differentiable': False}, desc=default & foreach params=None, kwargs={'foreach': False, 'differentiable': True}, desc=default & differentiable params=None, kwargs={'lr': 0.0002, 'foreach': False, 'differentiable': False}, desc=non-default lr params=None, kwargs={'lr': 0.0002, 'foreach': True, 'differentiable': False}, desc=non-default lr & foreach params=None, kwargs={'lr': 0.0002, 'foreach': False, 'differentiable': True}, desc=non-default lr & differentiable params=None, kwargs={'etas': (0.5, 1.5), 'foreach': False, 'differentiable': False}, desc=non-default etas params=None, kwargs={'etas': (0.5, 1.5), 'foreach': True, 'differentiable': False}, desc=non-default etas & foreach params=None, kwargs={'etas': (0.5, 1.5), 'foreach': False, 'differentiable': True}, desc=non-default etas & differentiable params=None, kwargs={'step_sizes': (2e-06, 100), 'foreach': False, 'differentiable': False}, desc=non-default step_sizes params=None, kwargs={'step_sizes': (2e-06, 100), 'foreach': True, 'differentiable': False}, desc=non-default step_sizes & foreach params=None, kwargs={'step_sizes': (2e-06, 100), 'foreach': False, 'differentiable': True}, desc=non-default step_sizes & differentiable params=None, kwargs={'maximize': True, 'foreach': False, 'differentiable': False}, desc=maximize params=None, kwargs={'maximize': True, 'foreach': True, 'differentiable': False}, desc=maximize & foreach params=None, kwargs={'maximize': True, 'foreach': False, 'differentiable': True}, desc=maximize & differentiable ok test_step_is_noop_when_params_have_no_grad_SGD_cpu_float32 (__main__.TestOptimRenewedCPU) ... params=None, kwargs={'lr': 0.01, 'foreach': False, 'differentiable': False}, desc=default params=None, kwargs={'lr': 0.01, 'foreach': True, 'differentiable': False}, desc=default & foreach params=None, kwargs={'lr': 0.01, 'foreach': False, 'differentiable': True}, desc=default & differentiable params=None, kwargs={'lr': 0.01, 'momentum': 0.9, 'foreach': False, 'differentiable': False}, desc=momentum params=None, kwargs={'lr': 0.01, 'momentum': 0.9, 'foreach': True, 'differentiable': False}, desc=momentum & foreach params=None, kwargs={'lr': 0.01, 'momentum': 0.9, 'foreach': False, 'differentiable': True}, desc=momentum & differentiable params=None, kwargs={'lr': 0.01, 'momentum': 0.9, 'dampening': 0.5, 'foreach': False, 'differentiable': False}, desc=dampening params=None, kwargs={'lr': 0.01, 'momentum': 0.9, 'dampening': 0.5, 'foreach': True, 'differentiable': False}, desc=dampening & foreach params=None, kwargs={'lr': 0.01, 'momentum': 0.9, 'dampening': 0.5, 'foreach': False, 'differentiable': True}, desc=dampening & differentiable params=None, kwargs={'lr': 0.01, 'momentum': 0.9, 'weight_decay': 0.9, 'foreach': False, 'differentiable': False}, desc=non-zero weight_decay params=None, kwargs={'lr': 0.01, 'momentum': 0.9, 'weight_decay': 0.9, 'foreach': True, 'differentiable': False}, desc=non-zero weight_decay & foreach params=None, kwargs={'lr': 0.01, 'momentum': 0.9, 'weight_decay': 0.9, 'foreach': False, 'differentiable': True}, desc=non-zero weight_decay & differentiable params=None, kwargs={'lr': 0.01, 'momentum': 0.9, 'nesterov': True, 'weight_decay': 0.9, 'foreach': False, 'differentiable': False}, desc=nesterov params=None, kwargs={'lr': 0.01, 'momentum': 0.9, 'nesterov': True, 'weight_decay': 0.9, 'foreach': True, 'differentiable': False}, desc=nesterov & foreach params=None, kwargs={'lr': 0.01, 'momentum': 0.9, 'nesterov': True, 'weight_decay': 0.9, 'foreach': False, 'differentiable': True}, desc=nesterov & differentiable params=None, kwargs={'lr': 0.01, 'weight_decay': 0.9, 'maximize': True, 'foreach': False, 'differentiable': False}, desc=maximize params=None, kwargs={'lr': 0.01, 'weight_decay': 0.9, 'maximize': True, 'foreach': True, 'differentiable': False}, desc=maximize & foreach params=None, kwargs={'lr': 0.01, 'weight_decay': 0.9, 'maximize': True, 'foreach': False, 'differentiable': True}, desc=maximize & differentiable ok test_step_is_noop_when_params_have_no_grad_SparseAdam_cpu_float32 (__main__.TestOptimRenewedCPU) ... ok test_step_is_noop_when_params_have_no_grad_ASGD_cuda_float32 (__main__.TestOptimRenewedCUDA) ... params=None, kwargs={'foreach': False, 'differentiable': False}, desc=default params=None, kwargs={'foreach': True, 'differentiable': False}, desc=default & foreach params=None, kwargs={'foreach': False, 'differentiable': True}, desc=default & differentiable params=None, kwargs={'lr': 0.02, 'foreach': False, 'differentiable': False}, desc=non-default lr params=None, kwargs={'lr': 0.02, 'foreach': True, 'differentiable': False}, desc=non-default lr & foreach params=None, kwargs={'lr': 0.02, 'foreach': False, 'differentiable': True}, desc=non-default lr & differentiable params=None, kwargs={'t0': 100, 'foreach': False, 'differentiable': False}, desc=t0 params=None, kwargs={'t0': 100, 'foreach': True, 'differentiable': False}, desc=t0 & foreach params=None, kwargs={'t0': 100, 'foreach': False, 'differentiable': True}, desc=t0 & differentiable params=None, kwargs={'weight_decay': 0.9, 'foreach': False, 'differentiable': False}, desc=nonzero weight_decay params=None, kwargs={'weight_decay': 0.9, 'foreach': True, 'differentiable': False}, desc=nonzero weight_decay & foreach params=None, kwargs={'weight_decay': 0.9, 'foreach': False, 'differentiable': True}, desc=nonzero weight_decay & differentiable params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': False, 'differentiable': False}, desc=maximize params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': True, 'differentiable': False}, desc=maximize & foreach params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': False, 'differentiable': True}, desc=maximize & differentiable ok test_step_is_noop_when_params_have_no_grad_Adadelta_cuda_float32 (__main__.TestOptimRenewedCUDA) ... params=None, kwargs={'foreach': False, 'differentiable': False}, desc=default params=None, kwargs={'foreach': True, 'differentiable': False}, desc=default & foreach params=None, kwargs={'foreach': False, 'differentiable': True}, desc=default & differentiable params=None, kwargs={'lr': 0.01, 'foreach': False, 'differentiable': False}, desc=non-default lr params=None, kwargs={'lr': 0.01, 'foreach': True, 'differentiable': False}, desc=non-default lr & foreach params=None, kwargs={'lr': 0.01, 'foreach': False, 'differentiable': True}, desc=non-default lr & differentiable params=None, kwargs={'weight_decay': 0.9, 'foreach': False, 'differentiable': False}, desc=nonzero weight_decay params=None, kwargs={'weight_decay': 0.9, 'foreach': True, 'differentiable': False}, desc=nonzero weight_decay & foreach params=None, kwargs={'weight_decay': 0.9, 'foreach': False, 'differentiable': True}, desc=nonzero weight_decay & differentiable params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': False, 'differentiable': False}, desc=maximize params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': True, 'differentiable': False}, desc=maximize & foreach params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': False, 'differentiable': True}, desc=maximize & differentiable params=None, kwargs={'rho': 0.95, 'weight_decay': 0.9, 'foreach': False, 'differentiable': False}, desc=rho params=None, kwargs={'rho': 0.95, 'weight_decay': 0.9, 'foreach': True, 'differentiable': False}, desc=rho & foreach params=None, kwargs={'rho': 0.95, 'weight_decay': 0.9, 'foreach': False, 'differentiable': True}, desc=rho & differentiable ok test_step_is_noop_when_params_have_no_grad_Adagrad_cuda_float32 (__main__.TestOptimRenewedCUDA) ... params=None, kwargs={'foreach': False, 'differentiable': False}, desc=default params=None, kwargs={'foreach': True, 'differentiable': False}, desc=default & foreach params=None, kwargs={'foreach': False, 'differentiable': True}, desc=default & differentiable params=None, kwargs={'weight_decay': 0.9, 'foreach': False, 'differentiable': False}, desc=nonzero weight_decay params=None, kwargs={'weight_decay': 0.9, 'foreach': True, 'differentiable': False}, desc=nonzero weight_decay & foreach params=None, kwargs={'weight_decay': 0.9, 'foreach': False, 'differentiable': True}, desc=nonzero weight_decay & differentiable params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': False, 'differentiable': False}, desc=maximize params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': True, 'differentiable': False}, desc=maximize & foreach params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': False, 'differentiable': True}, desc=maximize & differentiable params=None, kwargs={'initial_accumulator_value': 0.1, 'weight_decay': 0.9, 'foreach': False, 'differentiable': False}, desc=initial_accumulator_value params=None, kwargs={'initial_accumulator_value': 0.1, 'weight_decay': 0.9, 'foreach': True, 'differentiable': False}, desc=initial_accumulator_value & foreach params=None, kwargs={'initial_accumulator_value': 0.1, 'weight_decay': 0.9, 'foreach': False, 'differentiable': True}, desc=initial_accumulator_value & differentiable params=None, kwargs={'lr': 0.1, 'lr_decay': 0.5, 'weight_decay': 0.9, 'foreach': False, 'differentiable': False}, desc=lr_decay params=None, kwargs={'lr': 0.1, 'lr_decay': 0.5, 'weight_decay': 0.9, 'foreach': True, 'differentiable': False}, desc=lr_decay & foreach params=None, kwargs={'lr': 0.1, 'lr_decay': 0.5, 'weight_decay': 0.9, 'foreach': False, 'differentiable': True}, desc=lr_decay & differentiable ok test_step_is_noop_when_params_have_no_grad_AdamW_cuda_float32 (__main__.TestOptimRenewedCUDA) ... params=None, kwargs={'foreach': False, 'differentiable': False, 'fused': False}, desc=default params=None, kwargs={'foreach': True, 'differentiable': False, 'fused': False}, desc=default & foreach params=None, kwargs={'foreach': False, 'differentiable': True, 'fused': False}, desc=default & differentiable params=None, kwargs={'foreach': False, 'differentiable': False, 'fused': True}, desc=default & fused params=None, kwargs={'lr': 0.01, 'foreach': False, 'differentiable': False, 'fused': False}, desc=non-default lr params=None, kwargs={'lr': 0.01, 'foreach': True, 'differentiable': False, 'fused': False}, desc=non-default lr & foreach params=None, kwargs={'lr': 0.01, 'foreach': False, 'differentiable': True, 'fused': False}, desc=non-default lr & differentiable params=None, kwargs={'lr': 0.01, 'foreach': False, 'differentiable': False, 'fused': True}, desc=non-default lr & fused params=None, kwargs={'weight_decay': 0.9, 'foreach': False, 'differentiable': False, 'fused': False}, desc=nonzero weight_decay params=None, kwargs={'weight_decay': 0.9, 'foreach': True, 'differentiable': False, 'fused': False}, desc=nonzero weight_decay & foreach params=None, kwargs={'weight_decay': 0.9, 'foreach': False, 'differentiable': True, 'fused': False}, desc=nonzero weight_decay & differentiable params=None, kwargs={'weight_decay': 0.9, 'foreach': False, 'differentiable': False, 'fused': True}, desc=nonzero weight_decay & fused params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': False, 'differentiable': False, 'fused': False}, desc=maximize params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': True, 'differentiable': False, 'fused': False}, desc=maximize & foreach params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': False, 'differentiable': True, 'fused': False}, desc=maximize & differentiable params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': False, 'differentiable': False, 'fused': True}, desc=maximize & fused params=None, kwargs={'weight_decay': 0.9, 'amsgrad': True, 'foreach': False, 'differentiable': False, 'fused': False}, desc=amsgrad params=None, kwargs={'weight_decay': 0.9, 'amsgrad': True, 'foreach': True, 'differentiable': False, 'fused': False}, desc=amsgrad & foreach params=None, kwargs={'weight_decay': 0.9, 'amsgrad': True, 'foreach': False, 'differentiable': True, 'fused': False}, desc=amsgrad & differentiable params=None, kwargs={'weight_decay': 0.9, 'amsgrad': True, 'foreach': False, 'differentiable': False, 'fused': True}, desc=amsgrad & fused ok test_step_is_noop_when_params_have_no_grad_Adam_cuda_float32 (__main__.TestOptimRenewedCUDA) ... params=None, kwargs={'foreach': False, 'differentiable': False, 'fused': False}, desc=default params=None, kwargs={'foreach': True, 'differentiable': False, 'fused': False}, desc=default & foreach params=None, kwargs={'foreach': False, 'differentiable': True, 'fused': False}, desc=default & differentiable params=None, kwargs={'foreach': False, 'differentiable': False, 'fused': True}, desc=default & fused params=None, kwargs={'lr': 0.01, 'foreach': False, 'differentiable': False, 'fused': False}, desc=non-default lr params=None, kwargs={'lr': 0.01, 'foreach': True, 'differentiable': False, 'fused': False}, desc=non-default lr & foreach params=None, kwargs={'lr': 0.01, 'foreach': False, 'differentiable': True, 'fused': False}, desc=non-default lr & differentiable params=None, kwargs={'lr': 0.01, 'foreach': False, 'differentiable': False, 'fused': True}, desc=non-default lr & fused params=None, kwargs={'weight_decay': 0.9, 'foreach': False, 'differentiable': False, 'fused': False}, desc=nonzero weight_decay params=None, kwargs={'weight_decay': 0.9, 'foreach': True, 'differentiable': False, 'fused': False}, desc=nonzero weight_decay & foreach params=None, kwargs={'weight_decay': 0.9, 'foreach': False, 'differentiable': True, 'fused': False}, desc=nonzero weight_decay & differentiable params=None, kwargs={'weight_decay': 0.9, 'foreach': False, 'differentiable': False, 'fused': True}, desc=nonzero weight_decay & fused params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': False, 'differentiable': False, 'fused': False}, desc=maximize params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': True, 'differentiable': False, 'fused': False}, desc=maximize & foreach params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': False, 'differentiable': True, 'fused': False}, desc=maximize & differentiable params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': False, 'differentiable': False, 'fused': True}, desc=maximize & fused params=None, kwargs={'weight_decay': 0.9, 'amsgrad': True, 'foreach': False, 'differentiable': False, 'fused': False}, desc=amsgrad params=None, kwargs={'weight_decay': 0.9, 'amsgrad': True, 'foreach': True, 'differentiable': False, 'fused': False}, desc=amsgrad & foreach params=None, kwargs={'weight_decay': 0.9, 'amsgrad': True, 'foreach': False, 'differentiable': True, 'fused': False}, desc=amsgrad & differentiable params=None, kwargs={'weight_decay': 0.9, 'amsgrad': True, 'foreach': False, 'differentiable': False, 'fused': True}, desc=amsgrad & fused ok test_step_is_noop_when_params_have_no_grad_Adamax_cuda_float32 (__main__.TestOptimRenewedCUDA) ... params=None, kwargs={'foreach': False, 'differentiable': False}, desc=default params=None, kwargs={'foreach': True, 'differentiable': False}, desc=default & foreach params=None, kwargs={'foreach': False, 'differentiable': True}, desc=default & differentiable params=None, kwargs={'lr': 0.001, 'foreach': False, 'differentiable': False}, desc=non-default lr params=None, kwargs={'lr': 0.001, 'foreach': True, 'differentiable': False}, desc=non-default lr & foreach params=None, kwargs={'lr': 0.001, 'foreach': False, 'differentiable': True}, desc=non-default lr & differentiable params=None, kwargs={'weight_decay': 0.9, 'foreach': False, 'differentiable': False}, desc=nonzero weight_decay params=None, kwargs={'weight_decay': 0.9, 'foreach': True, 'differentiable': False}, desc=nonzero weight_decay & foreach params=None, kwargs={'weight_decay': 0.9, 'foreach': False, 'differentiable': True}, desc=nonzero weight_decay & differentiable params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': False, 'differentiable': False}, desc=maximize params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': True, 'differentiable': False}, desc=maximize & foreach params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': False, 'differentiable': True}, desc=maximize & differentiable ok test_step_is_noop_when_params_have_no_grad_LBFGS_cuda_float32 (__main__.TestOptimRenewedCUDA) ... ok test_step_is_noop_when_params_have_no_grad_NAdam_cuda_float32 (__main__.TestOptimRenewedCUDA) ... params=None, kwargs={'foreach': False, 'differentiable': False}, desc=default params=None, kwargs={'foreach': True, 'differentiable': False}, desc=default & foreach params=None, kwargs={'foreach': False, 'differentiable': True}, desc=default & differentiable params=None, kwargs={'lr': 0.001, 'foreach': False, 'differentiable': False}, desc=non-default lr params=None, kwargs={'lr': 0.001, 'foreach': True, 'differentiable': False}, desc=non-default lr & foreach params=None, kwargs={'lr': 0.001, 'foreach': False, 'differentiable': True}, desc=non-default lr & differentiable params=None, kwargs={'momentum_decay': 0.006, 'foreach': False, 'differentiable': False}, desc=non-zero momentum_decay params=None, kwargs={'momentum_decay': 0.006, 'foreach': True, 'differentiable': False}, desc=non-zero momentum_decay & foreach params=None, kwargs={'momentum_decay': 0.006, 'foreach': False, 'differentiable': True}, desc=non-zero momentum_decay & differentiable params=None, kwargs={'weight_decay': 0.9, 'momentum_decay': 0.006, 'foreach': False, 'differentiable': False}, desc=weight_decay params=None, kwargs={'weight_decay': 0.9, 'momentum_decay': 0.006, 'foreach': True, 'differentiable': False}, desc=weight_decay & foreach params=None, kwargs={'weight_decay': 0.9, 'momentum_decay': 0.006, 'foreach': False, 'differentiable': True}, desc=weight_decay & differentiable params=None, kwargs={'weight_decay': 0.9, 'momentum_decay': 0.006, 'decoupled_weight_decay': True, 'foreach': False, 'differentiable': False}, desc=decoupled_weight_decay params=None, kwargs={'weight_decay': 0.9, 'momentum_decay': 0.006, 'decoupled_weight_decay': True, 'foreach': True, 'differentiable': False}, desc=decoupled_weight_decay & foreach params=None, kwargs={'weight_decay': 0.9, 'momentum_decay': 0.006, 'decoupled_weight_decay': True, 'foreach': False, 'differentiable': True}, desc=decoupled_weight_decay & differentiable ok test_step_is_noop_when_params_have_no_grad_RAdam_cuda_float32 (__main__.TestOptimRenewedCUDA) ... params=None, kwargs={'foreach': False, 'differentiable': False}, desc=default params=None, kwargs={'foreach': True, 'differentiable': False}, desc=default & foreach params=None, kwargs={'foreach': False, 'differentiable': True}, desc=default & differentiable params=None, kwargs={'lr': 0.002, 'foreach': False, 'differentiable': False}, desc=non-default lr params=None, kwargs={'lr': 0.002, 'foreach': True, 'differentiable': False}, desc=non-default lr & foreach params=None, kwargs={'lr': 0.002, 'foreach': False, 'differentiable': True}, desc=non-default lr & differentiable params=None, kwargs={'eps': 1e-06, 'foreach': False, 'differentiable': False}, desc=non-default eps params=None, kwargs={'eps': 1e-06, 'foreach': True, 'differentiable': False}, desc=non-default eps & foreach params=None, kwargs={'eps': 1e-06, 'foreach': False, 'differentiable': True}, desc=non-default eps & differentiable params=None, kwargs={'weight_decay': 0.9, 'foreach': False, 'differentiable': False}, desc=nonzero weight_decay params=None, kwargs={'weight_decay': 0.9, 'foreach': True, 'differentiable': False}, desc=nonzero weight_decay & foreach params=None, kwargs={'weight_decay': 0.9, 'foreach': False, 'differentiable': True}, desc=nonzero weight_decay & differentiable params=None, kwargs={'weight_decay': 0.9, 'decoupled_weight_decay': True, 'foreach': False, 'differentiable': False}, desc=decoupled_weight_decay params=None, kwargs={'weight_decay': 0.9, 'decoupled_weight_decay': True, 'foreach': True, 'differentiable': False}, desc=decoupled_weight_decay & foreach params=None, kwargs={'weight_decay': 0.9, 'decoupled_weight_decay': True, 'foreach': False, 'differentiable': True}, desc=decoupled_weight_decay & differentiable ok test_step_is_noop_when_params_have_no_grad_RMSprop_cuda_float32 (__main__.TestOptimRenewedCUDA) ... params=None, kwargs={'foreach': False, 'differentiable': False}, desc=default params=None, kwargs={'foreach': True, 'differentiable': False}, desc=default & foreach params=None, kwargs={'foreach': False, 'differentiable': True}, desc=default & differentiable params=None, kwargs={'lr': 0.001, 'foreach': False, 'differentiable': False}, desc=non-default lr params=None, kwargs={'lr': 0.001, 'foreach': True, 'differentiable': False}, desc=non-default lr & foreach params=None, kwargs={'lr': 0.001, 'foreach': False, 'differentiable': True}, desc=non-default lr & differentiable params=None, kwargs={'weight_decay': 0.9, 'foreach': False, 'differentiable': False}, desc=nonzero weight_decay params=None, kwargs={'weight_decay': 0.9, 'foreach': True, 'differentiable': False}, desc=nonzero weight_decay & foreach params=None, kwargs={'weight_decay': 0.9, 'foreach': False, 'differentiable': True}, desc=nonzero weight_decay & differentiable params=None, kwargs={'weight_decay': 0.9, 'centered': True, 'foreach': False, 'differentiable': False}, desc=centered params=None, kwargs={'weight_decay': 0.9, 'centered': True, 'foreach': True, 'differentiable': False}, desc=centered & foreach params=None, kwargs={'weight_decay': 0.9, 'centered': True, 'foreach': False, 'differentiable': True}, desc=centered & differentiable params=None, kwargs={'weight_decay': 0.9, 'centered': True, 'momentum': 0.1, 'foreach': False, 'differentiable': False}, desc=momentum params=None, kwargs={'weight_decay': 0.9, 'centered': True, 'momentum': 0.1, 'foreach': True, 'differentiable': False}, desc=momentum & foreach params=None, kwargs={'weight_decay': 0.9, 'centered': True, 'momentum': 0.1, 'foreach': False, 'differentiable': True}, desc=momentum & differentiable params=None, kwargs={'weight_decay': 0.9, 'centered': True, 'momentum': 0.1, 'maximize': True, 'foreach': False, 'differentiable': False}, desc=maximize params=None, kwargs={'weight_decay': 0.9, 'centered': True, 'momentum': 0.1, 'maximize': True, 'foreach': True, 'differentiable': False}, desc=maximize & foreach params=None, kwargs={'weight_decay': 0.9, 'centered': True, 'momentum': 0.1, 'maximize': True, 'foreach': False, 'differentiable': True}, desc=maximize & differentiable ok test_step_is_noop_when_params_have_no_grad_Rprop_cuda_float32 (__main__.TestOptimRenewedCUDA) ... params=None, kwargs={'foreach': False, 'differentiable': False}, desc=default params=None, kwargs={'foreach': True, 'differentiable': False}, desc=default & foreach params=None, kwargs={'foreach': False, 'differentiable': True}, desc=default & differentiable params=None, kwargs={'lr': 0.0002, 'foreach': False, 'differentiable': False}, desc=non-default lr params=None, kwargs={'lr': 0.0002, 'foreach': True, 'differentiable': False}, desc=non-default lr & foreach params=None, kwargs={'lr': 0.0002, 'foreach': False, 'differentiable': True}, desc=non-default lr & differentiable params=None, kwargs={'etas': (0.5, 1.5), 'foreach': False, 'differentiable': False}, desc=non-default etas params=None, kwargs={'etas': (0.5, 1.5), 'foreach': True, 'differentiable': False}, desc=non-default etas & foreach params=None, kwargs={'etas': (0.5, 1.5), 'foreach': False, 'differentiable': True}, desc=non-default etas & differentiable params=None, kwargs={'step_sizes': (2e-06, 100), 'foreach': False, 'differentiable': False}, desc=non-default step_sizes params=None, kwargs={'step_sizes': (2e-06, 100), 'foreach': True, 'differentiable': False}, desc=non-default step_sizes & foreach params=None, kwargs={'step_sizes': (2e-06, 100), 'foreach': False, 'differentiable': True}, desc=non-default step_sizes & differentiable params=None, kwargs={'maximize': True, 'foreach': False, 'differentiable': False}, desc=maximize params=None, kwargs={'maximize': True, 'foreach': True, 'differentiable': False}, desc=maximize & foreach params=None, kwargs={'maximize': True, 'foreach': False, 'differentiable': True}, desc=maximize & differentiable ok test_step_is_noop_when_params_have_no_grad_SGD_cuda_float32 (__main__.TestOptimRenewedCUDA) ... params=None, kwargs={'lr': 0.01, 'foreach': False, 'differentiable': False}, desc=default params=None, kwargs={'lr': 0.01, 'foreach': True, 'differentiable': False}, desc=default & foreach params=None, kwargs={'lr': 0.01, 'foreach': False, 'differentiable': True}, desc=default & differentiable params=None, kwargs={'lr': 0.01, 'momentum': 0.9, 'foreach': False, 'differentiable': False}, desc=momentum params=None, kwargs={'lr': 0.01, 'momentum': 0.9, 'foreach': True, 'differentiable': False}, desc=momentum & foreach params=None, kwargs={'lr': 0.01, 'momentum': 0.9, 'foreach': False, 'differentiable': True}, desc=momentum & differentiable params=None, kwargs={'lr': 0.01, 'momentum': 0.9, 'dampening': 0.5, 'foreach': False, 'differentiable': False}, desc=dampening params=None, kwargs={'lr': 0.01, 'momentum': 0.9, 'dampening': 0.5, 'foreach': True, 'differentiable': False}, desc=dampening & foreach params=None, kwargs={'lr': 0.01, 'momentum': 0.9, 'dampening': 0.5, 'foreach': False, 'differentiable': True}, desc=dampening & differentiable params=None, kwargs={'lr': 0.01, 'momentum': 0.9, 'weight_decay': 0.9, 'foreach': False, 'differentiable': False}, desc=non-zero weight_decay params=None, kwargs={'lr': 0.01, 'momentum': 0.9, 'weight_decay': 0.9, 'foreach': True, 'differentiable': False}, desc=non-zero weight_decay & foreach params=None, kwargs={'lr': 0.01, 'momentum': 0.9, 'weight_decay': 0.9, 'foreach': False, 'differentiable': True}, desc=non-zero weight_decay & differentiable params=None, kwargs={'lr': 0.01, 'momentum': 0.9, 'nesterov': True, 'weight_decay': 0.9, 'foreach': False, 'differentiable': False}, desc=nesterov params=None, kwargs={'lr': 0.01, 'momentum': 0.9, 'nesterov': True, 'weight_decay': 0.9, 'foreach': True, 'differentiable': False}, desc=nesterov & foreach params=None, kwargs={'lr': 0.01, 'momentum': 0.9, 'nesterov': True, 'weight_decay': 0.9, 'foreach': False, 'differentiable': True}, desc=nesterov & differentiable params=None, kwargs={'lr': 0.01, 'weight_decay': 0.9, 'maximize': True, 'foreach': False, 'differentiable': False}, desc=maximize params=None, kwargs={'lr': 0.01, 'weight_decay': 0.9, 'maximize': True, 'foreach': True, 'differentiable': False}, desc=maximize & foreach params=None, kwargs={'lr': 0.01, 'weight_decay': 0.9, 'maximize': True, 'foreach': False, 'differentiable': True}, desc=maximize & differentiable ok test_step_is_noop_when_params_have_no_grad_SparseAdam_cuda_float32 (__main__.TestOptimRenewedCUDA) ... ok ---------------------------------------------------------------------- Ran 26 tests in 19.089s OK ``` </details> Pull Request resolved: https://github.com/pytorch/pytorch/pull/116471 Approved by: https://github.com/albanD	2023-12-28 19:49:04 +00:00
Oguz Ulgen	577529daec	[Dynamo] Implement a simple mutation tracker for user defined triton kernels (#116466 ) This PR adds a very simple mutation tracking mechanism to dynamo which can later be improved to be more thorough. Currently it allows tensors to be in tl.load but if it sees a tensor used anywhere else (including a tl.load), it bails out. One question about the method: is `ast.NodeVisitor` the best thing to use here? Having to detect mutations with this is kinda pretty since you need to keep setting state at each transition. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116466 Approved by: https://github.com/aakhundov	2023-12-28 18:59:44 +00:00
albanD	f10c3f4184	Fix module pre bw hooks when input doesn't req grad but gradients are changed by the user (#116454 ) As per title. FYI @vkuzo Pull Request resolved: https://github.com/pytorch/pytorch/pull/116454 Approved by: https://github.com/mikaylagawarecki	2023-12-28 18:32:50 +00:00
atalman	fb91acd33b	[release] Add specific section about building and testing final rc (#116476 ) Formalize process of building and testing final rc. To avoid having missing PRs in the release, similar to this: https://github.com/pytorch/pytorch/pull/114197 Pull Request resolved: https://github.com/pytorch/pytorch/pull/116476 Approved by: https://github.com/huydhn	2023-12-28 15:25:08 +00:00
Mikayla Gawarecki	b5e83b8c50	Fix edge case for size 1 channels dim in AdaptiveMaxPool (#116482 ) Fixes https://github.com/pytorch/pytorch/issues/107842 Unlike `AdaptiveAvgPool`, `AdaptiveMaxPool` does not have a CUDA kernel for ChannelsLast. We workaround this by calling `contiguous()` on the input. However, there is an edge case when the channels dimension has size 1. ```python >>> t = torch.randn(2, 1, 3, 3) >>> t.stride() (9, 9, 3, 1) >>> t_c = t.to(memory_format=torch.channels_last) >>> t_c.stride() (9, 1, 3, 1) # (CHW, 1, CW, C) >>> t_c.is_contiguous() True # contiguity check doesn't check strides for singleton dimensions ``` Since the CUDA kernel treats the batch,`B`, and channels,`C`, dimensions as implicitly flattened and increments the data pointer for `input` to the start of the next plane using `669b182d33/aten/src/ATen/native/cuda/AdaptiveMaxPooling2d.cu (L67)` If our input falls into the aforementioned edge case, the `data_ptr` will not be incremented correctly. The simple fix for this is to calculate the stride for the channels dimension using $\prod_{i > 1}size(i)$ Analogous fix for the 3D case. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116482 Approved by: https://github.com/albanD	2023-12-28 15:02:29 +00:00
Tugsbayasgalan Manlaibaatar	dfc898ede4	Don't decompose functional ops in predispatch functionalization (#116383 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116383 Approved by: https://github.com/bdhirsh ghstack dependencies: #115188, #115210	2023-12-28 11:54:04 +00:00
feifan	80c07df659	Update doc for the constraints of FractionalMaxPool2d (#116261 ) Fixes [#115531 ](https://github.com/pytorch/pytorch/issues/115531) Update doc for the constraints of FractionalMaxPool2d. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116261 Approved by: https://github.com/mikaylagawarecki	2023-12-28 06:55:36 +00:00
Lu Fang	d791074c81	Clean up PyTorch op BC check list (#116468 ) Summary: Remove the expired items. Test Plan: CI Differential Revision: D52435764 Pull Request resolved: https://github.com/pytorch/pytorch/pull/116468 Approved by: https://github.com/feikou	2023-12-28 06:05:59 +00:00
Xilun Wu	6243dbb5c0	[DTensor][BE] unify PlacementStrategy print function (#116428 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116428 Approved by: https://github.com/wanchaol ghstack dependencies: #115683, #115689	2023-12-28 01:10:20 +00:00
Xilun Wu	87fea086aa	[DTensor] remove experimental DTensor op backward layer norm (#115689 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/115689 Approved by: https://github.com/wanchaol, https://github.com/yoyoyocmu ghstack dependencies: #115683	2023-12-28 01:10:20 +00:00
Xilun Wu	575f17ebd4	[DTensor] add layer norm backward support (#115683 ) Summary This PR adds DTensor implementation for ATen op `native_layer_norm_backward`. Test Plan pytest test/distributed/_tensor/test_math_ops.py -s -k layer_norm pytest test/distributed/_tensor/test_dtensor_ops.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/115683 Approved by: https://github.com/wanchaol	2023-12-28 01:10:10 +00:00
Zhengxu Chen	b3f7fdbf0a	Add decomp for pad_sequence (#116285 ) Summary: currently pad_sequence caused symbolic shape specialization in export which is unintended. Adding a decomp seems to work to avoid the c++ kernel which caused the specialization. Test Plan: buck test mode/opt caffe2/test:test_export -- -r pad_sequence Reviewed By: SherlockNoMad Differential Revision: D52345667 Pull Request resolved: https://github.com/pytorch/pytorch/pull/116285 Approved by: https://github.com/peterbell10, https://github.com/lezcano	2023-12-27 23:56:51 +00:00
Yanbo Liang	d59350cc1c	[Dynamo] Consolidate common constant types (#116366 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/116366 Approved by: https://github.com/Skylion007	2023-12-27 23:54:35 +00:00
Yanbo Liang	6375eb15ef	[Dynamo][11/N] allow_in_graph/disallow_in_graph decorator refactor (#116365 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/116365 Approved by: https://github.com/jansel	2023-12-27 23:50:35 +00:00
Nikita Shulga	53e32d12c4	[c10] Use nested namespace in c10/cuda (#116464 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116464 Approved by: https://github.com/Skylion007	2023-12-27 23:14:00 +00:00
Nikita Shulga	93b86bf531	[GHF] Implement stacked revert (#116447 ) By adding `get_ghstack_dependent_prs` that using `git branch --contains` finds all PRs containing stacked branch, selecting longest one (in terms of distance between origin and default branch) and skipping all open PRs Please note, that reverts should be applied in a reversed order with the one how PRs were landed originally. Use a bit of a defensive programming, i.e. revert single PR if attempt to fetch dependencies fails for some reason. Test plan: - Lint - ``` >>> from trymerge import GitRepo, GitHubPR, get_ghstack_prs, get_ghstack_dependent_prs >>> pr=GitHubPR("pytorch", "pytorch", 115188) >>> pr1=GitHubPR("pytorch", "pytorch", 115210) >>> repo=GitRepo("/Users/nshulga/git/pytorch/pytorch") >>> get_ghstack_dependent_prs(repo, pr1) [('22742d93a5357c9b5b45a74f91a6dc5599c9c266', <trymerge.GitHubPR object at 0x100f32f40>)] >>> get_ghstack_dependent_prs(repo, pr) [('22742d93a5357c9b5b45a74f91a6dc5599c9c266', <trymerge.GitHubPR object at 0x10102eaf0>), ('76b1d44d576c20be79295810904c589241ca1bd2', <trymerge.GitHubPR object at 0x10102eb50>)] >>> rc=get_ghstack_dependent_prs(repo, pr) rc[0]>>> rc[0][1].pr_num 115210 >>> rc[1][1].pr_num 115188 ``` - see: https://github.com/malfet/deleteme/pull/59#issuecomment-1869904714 and https://github.com/malfet/deleteme/pull/74#issuecomment-1870542702 Fixes https://github.com/pytorch/test-infra/issues/4845 Pull Request resolved: https://github.com/pytorch/pytorch/pull/116447 Approved by: https://github.com/huydhn ghstack dependencies: #116446	2023-12-27 23:01:16 +00:00
Nikita Shulga	5fcc2519f5	[GHF] Refactors (#116446 ) Prep change for allowing stacked reverts This is a no-op that factors out some helper function that would be useful later: - `get_pr_commit_sha` finds a committed sha for a given PR - `_revlist_to_prs` converts a revlist to GitHubPRs conditionally filtering some out - `do_revert_prs` reverts multiple PRs in a batch, but so far is invoked with only one PR Pull Request resolved: https://github.com/pytorch/pytorch/pull/116446 Approved by: https://github.com/huydhn, https://github.com/seemethere	2023-12-27 23:01:16 +00:00
PyTorch MergeBot	85628c0e57	Revert "[export] Update range constraints to runtime_var_to_range (#115427 )" This reverts commit f8ad664cf267bcbdd8f8f85e27ad3a6e7d9fa86f. Reverted https://github.com/pytorch/pytorch/pull/115427 on behalf of https://github.com/angelayi due to failing internal tests ([comment](https://github.com/pytorch/pytorch/pull/115427#issuecomment-1870671728))	2023-12-27 22:44:45 +00:00
Jarlaze	a17069684c	Improve nn.modules.activation and batchnorm docs (#113531 ) Fixes #112602 For some reason, I could not get the same output when running pycodestyle command as indicated in the issue. I manually ran ruff checks fixing the following issues `D202`, `D204`, `D205`, `D207`, `D400` and `D401`. ### Requested output nn.modules.activation: before: 135 after: 79 nn.modules.batchnorm before: 21 after: 3 Pull Request resolved: https://github.com/pytorch/pytorch/pull/113531 Approved by: https://github.com/mikaylagawarecki	2023-12-27 21:06:47 +00:00
Xuehai Pan	3149e4a667	[dynamo] fix `sum()` function with `start` argument (#116389 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116389 Approved by: https://github.com/Skylion007, https://github.com/malfet	2023-12-27 20:42:27 +00:00
Aaron Gokaslan	83502feabe	[BE]: Enable readability-simplify-subscript-expr clang-tidy check (#116356 ) [BE]: enable clang-tidy check for readability-simplify-subscript-expr which looks for unnecessarily complex subscripting of the underlying data array of STL types. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116356 Approved by: https://github.com/lezcano	2023-12-27 20:22:20 +00:00
SS-JIA	8d84b5041c	[pt-vulkan] Address CLANGTIDY warnings in `api`, `graph`, and `impl` folders (#116431 ) ## Context *Currently, `.h` and `.cpp` produces many lint warnings/errors from `clang-tidy` in the Meta internal Phabricator mirror. These changes address all the lint warnings in the `api`, `graph`, and `impl` folders in preparation for upcoming planned work. ## Review Guide Most changes are the result of automatically applied patches from `clang-tidy` * However, some warnings had to be manually addressed * There should be no functional changes * Many of the `clang-tidy` warnings arose from the `facebook-hte-BadMemberName` rule which checks for compliance with variable naming rules from Meta's internal C++ style guide * However, the rest of the ATen codebase does not conform to this rule, and PyTorch Vulkan was written to be consisten with ATen's naming conventions; thus, to stay consistent with the rest of ATen, this rule is disabled wherever relevant using `// @lint-ignore-every CLANGTIDY facebook-hte-BadMemberName` * Lint was disabled entirely for`vulkan_api_test.cpp` since there are too many warnings to address at the moment. Addressing all of them will be a small project of its own; thus, in the interim lint will be disabled to reduce distracting signals for developers. Internal: ## Notes for Internal Reviewers This diff was largely created with ``` cd ~/fbsource/xplat/caffe2/aten/src/ATen/native/vulkan arc lint -e extra -a --take CLANGTIDY * 2>&1 \| tee ~/scratch/lint.txt ``` The above command automatically applied patches suggested by `clang-tidy`, and the rest of the warnings were addressed manually. To disable `facebook-hte-BadMemberName`, I found that disabling it via a `.clang-tidy` file didn't work with `arc lint`, and the only way that worked was through the adding a comment ``` // @lint-ignore-every CLANGTIDY facebook-hte-BadMemberName ``` Differential Revision: [D50336057](https://our.internmc.facebook.com/intern/diff/D50336057/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116431 Approved by: https://github.com/GregoryComer, https://github.com/kirklandsign	2023-12-27 19:29:18 +00:00
Aaron Gokaslan	bbe3261dd3	[BE]: Use `iterable.chain.from_iterable` where possible (#116376 ) This is more readable and more efficient when dealing with lots of sequences to chain together. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116376 Approved by: https://github.com/albanD	2023-12-27 19:20:07 +00:00
PyTorch MergeBot	e0e90bc0d4	Revert "[dynamo] fix `sum()` function with `start` argument (#116389 )" This reverts commit 3c9076f070fab5b27eae3b7846755c98b7c97a1a. Reverted https://github.com/pytorch/pytorch/pull/116389 on behalf of https://github.com/kit1980 due to Breaks Meta-internal tests, but the issue could have been caught on GitHub ([comment](https://github.com/pytorch/pytorch/pull/116389#issuecomment-1870556927))	2023-12-27 19:05:55 +00:00
Guilherme Leobas	5c9464fb51	add CALL_FINALLY opcode (#116159 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116159 Approved by: https://github.com/yanboliang	2023-12-27 19:01:08 +00:00
Yanbo Liang	f657b2b1f8	[Dynamo][10/N] Remove TorchVariable and is_allowed (#116312 ) After this refactor: * ```TorchVariable``` definition and all references are removed. * All ```is_allowed``` references except one are removed. - The only left one is in ```torch/_dynamo/decorators:_disallow_in_graph_helper```. It was called when users put ```disallow_in_graph``` decorator on a function. Since we use the lists in ```trace_rules``` to decide the function's trace rule, so the decorator would only be used as customer function rather than torch functions. I'll defer this to a separate decorator refactor PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116312 Approved by: https://github.com/jansel	2023-12-27 18:47:05 +00:00
Nikita Shulga	87da0e1d23	[GHF] Fix gh_get_labels for small repos (#116444 ) Not sure if this is recent API change or what but `gh_get_labels('malfet', 'deleteme')` used to raise an exception (see https://github.com/malfet/deleteme/actions/runs/7334535266/job/19971328673#step:6:37 ) ``` File "/home/runner/work/deleteme/deleteme/.github/scripts/label_utils.py", line 50, in get_last_page_num_from_header link_info[link_info.rindex(prefix) + len(prefix) : link_info.rindex(suffix)] AttributeError: 'NoneType' object has no attribute 'rindex' ``` And with this fix it returns the expected list Pull Request resolved: https://github.com/pytorch/pytorch/pull/116444 Approved by: https://github.com/huydhn	2023-12-27 15:50:42 +00:00
Eddie Yan	e14026bc2a	[CUDNN] RNNv6 API deprecation support (#115719 ) The cuDNN RNNv6 API has been deprecated and support will be dropped in a recent release; this PR migrates to the newer API to support newer cuDNN versions that would otherwise break the build. Note that it may not be tested yet in upstream CI if the upstream CI cuDNN version is less than 8.9.7. CC @ptrblck @malfet Pull Request resolved: https://github.com/pytorch/pytorch/pull/115719 Approved by: https://github.com/albanD, https://github.com/malfet	2023-12-27 09:31:08 +00:00
PyTorch UpdateBot	0aa5b751bb	[executorch hash update] update the pinned executorch hash (#116438 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml). Update the pinned executorch hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116438 Approved by: https://github.com/pytorchbot	2023-12-27 09:13:54 +00:00
Jane Xu	924f1b841a	[optim] Allow torch.float64 scalars for forloop + foreach implementations (#115841 ) Should allow for uses cases mentioned in #110940 This would allow scalars to also be float64s in the foreach implementation. The fused implementation would still create a float32 step on Adam and AdamW. This PR also does NOT worry about performance and is mainly for enablement. Next steps: - Relax the constraint on fused adam(w) and allow torch.float64 scalars there - Allow _performant_ mixed dtypes in foreach (a bigger project in itself). This PR will conflict with my other PRs, I will figure out a landing order Pull Request resolved: https://github.com/pytorch/pytorch/pull/115841 Approved by: https://github.com/albanD	2023-12-27 09:13:49 +00:00
Xilun Wu	1d13086492	[BE] force DTensorTestBase.build_device_mesh to use world_size rather than NUM_DEVICES constant (#116439 ) Test: `python test/distributed/fsdp/test_shard_utils.py -k test_create_chunk_dtensor` Pull Request resolved: https://github.com/pytorch/pytorch/pull/116439 Approved by: https://github.com/wanchaol	2023-12-27 07:37:07 +00:00
angelayi	6b91e6907e	Add setUserEnabledNNPACK config (#116152 ) When exporting a model with a convolution kernel on cpu, if mkldnn is disabled and nnpack is enabled, export will go down the nnpack optimized convolution kernel for certain shapes ((code pointer)[`cd449e260c/aten/src/ATen/native/Convolution.cpp (L542-L552)`]). This means that we will automatically create a guard on that certain shape. If users want to export without any restrictions, one option is to disable nnpack. However, no config function exists for this, so this PR is adding a config function, similar to the `set_mkldnn_enabled` function. Original context is in https://fb.workplace.com/groups/1075192433118967/posts/1349589822345892/?comment_id=1349597102345164&reply_comment_id=1349677642337110. To test the flag, the following script runs successfully: ``` import os import torch from torchvision.models import ResNet18_Weights, resnet18 torch.set_float32_matmul_precision("high") model = resnet18(weights=ResNet18_Weights.DEFAULT) model.eval() with torch.no_grad(): # device = "cuda" if torch.cuda.is_available() else "cpu" torch.backends.mkldnn.set_flags(False) torch.backends.nnpack.set_flags(False) # <--- Added config device = "cpu" model = model.to(device=device) example_inputs = (torch.randn(2, 3, 224, 224, device=device),) batch_dim = torch.export.Dim("batch", min=2, max=32) so_path = torch._export.aot_compile( model, example_inputs, # Specify the first dimension of the input x as dynamic dynamic_shapes={"x": {0: batch_dim}}, # Specify the generated shared library path options={ "aot_inductor.output_path": os.path.join(os.getcwd(), "resnet18_pt2.so"), "max_autotune": True, }, ) ``` I'm not sure who to add as reviewer, so please feel free to add whoever is relevant! Pull Request resolved: https://github.com/pytorch/pytorch/pull/116152 Approved by: https://github.com/malfet	2023-12-27 06:00:16 +00:00
fduwjj	9c3ae37fc4	[Distributed] Add finer granularity tag for distributed submodule (#116434 ) This PR is the start to enable the integrate pytorch distributed logs in Torch LOGs. We now already have one tag "distributed" for all distributed components but distributed is a very large component and we want to have some hierarchy and give users options to only turn on logs for certain submodules. So we also added tags starting with "dist_*" for each submodule. (This PR only adds some of them and we are going to add more down the road) Related discussions can be found here: https://github.com/pytorch/pytorch/issues/113544 Pull Request resolved: https://github.com/pytorch/pytorch/pull/116434 Approved by: https://github.com/awgu, https://github.com/wanchaol	2023-12-27 04:09:34 +00:00
James Wu	2c89e5a5e5	[inductor] Sort unbacked symbols before iterating on them (#116421 ) get_unbacked_symbol_defs and get_unbacked_symbol_uses inconsistently return dicts vs. sets. The majority of the use cases of these methods use them for set membership, which is deterministic, but set iteration is non deterministic. Therefore, in the one place where we iterate through unbacked symbols, we sort by the symbol name before iterating to preserve determinism. Another approach would be to have these functions consistently return dictionaries, where the key of the dictionary is the name of the symbol. I'm happy to do that approach if we think it's likely future code will forget to sort before iteration. Fixes #113130 Pull Request resolved: https://github.com/pytorch/pytorch/pull/116421 Approved by: https://github.com/oulgen, https://github.com/aakhundov	2023-12-27 03:35:58 +00:00
Tobias Ringwald	362bc6d7cb	Fixed a segfault issue when passing an empty kernel to quantized_max_… (#116342 ) …pool1d. Fixes #116323. Reused the same check as for `max_pool1d`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116342 Approved by: https://github.com/jerryzh168	2023-12-27 01:22:49 +00:00
Xilun Wu	d0395239c1	[DTensor] allow OpStrategy to represent ops whose return type is a tuple (#115682 ) Summary: Ops like `native_layer_norm_backward` return a tuple of optional torch.Tensor. This PR allows to use OpStrategy to represent `native_layer_norm_backward`'s return value sharding. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115682 Approved by: https://github.com/wanchaol	2023-12-27 00:44:11 +00:00
Jane Xu	44b98c09ca	[BE] migrate all assertRaises tests to OptimizerInfo test_errors (#116315 ) Removes a part of the sparse adam test and the following three tests: `test_fused_optimizer_raises`, `test_duplicate_params_across_param_groups`, `test_duplicate_params_in_one_param_group` ``` (pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (d2d129de)]$ python test/test_optim.py -k test_fused_optimizer_raises -k test_duplicate_params_across_param_groups -k test_duplicate_params_in_one_param_group /home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0 warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}" ... ---------------------------------------------------------------------- Ran 3 tests in 0.023s OK ``` Increases coverage by testing the duplicate param tests on ALL the optims instead of just one each. Also fixes SparseAdam bug which was accidentally calling torch.unbind through list instead of putting params in a list. This bug was caught by migrating the weird warning stuff to just one easy warning context manager, which checks that nothing else gets raised. The new test_errors does not run slower than before, overhead is still king: ``` (pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (d2d129de)]$ python test/test_optim.py -k test_errors /home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0 warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}" .......................... ---------------------------------------------------------------------- Ran 26 tests in 10.337s OK ``` Compared to test_errors BEFORE my commit :p ``` (pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (b47aa696)]$ python test/test_optim.py -k test_errors /home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0 warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}" .............sssssssssssss ---------------------------------------------------------------------- Ran 26 tests in 11.980s OK (skipped=13) (pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (b47aa696)]$ ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/116315 Approved by: https://github.com/mikaylagawarecki	2023-12-27 00:08:31 +00:00
Oguz Ulgen	8abeacda6f	Refactor user defined triton kernel tests (#116425 ) I will be adding more triton tests of different types, so I'm moving them to a brand new file. While doing this, I also cleaned up some flake linting opt outs Pull Request resolved: https://github.com/pytorch/pytorch/pull/116425 Approved by: https://github.com/aakhundov	2023-12-26 23:54:26 +00:00
PyTorch MergeBot	3b709d7c1e	Revert "[Dynamo][10/N] Remove TorchVariable and is_allowed (#116312 )" This reverts commit 015bd0e0a189f929e469c6bc75fe1541c18a014d. Reverted https://github.com/pytorch/pytorch/pull/116312 on behalf of https://github.com/kit1980 due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/116312#issuecomment-1869825506))	2023-12-26 23:47:15 +00:00
PyTorch MergeBot	13505898c9	Revert "[Dynamo][11/N] allow_in_graph/disallow_in_graph decorator refactor (#116365 )" This reverts commit 951da38800f66e2d2bb2bb8e87e12218d1e28b8c. Reverted https://github.com/pytorch/pytorch/pull/116365 on behalf of https://github.com/kit1980 due to Need to revert this because of https://github.com/pytorch/pytorch/pull/116312 ([comment](https://github.com/pytorch/pytorch/pull/116365#issuecomment-1869824468))	2023-12-26 23:43:45 +00:00
Nikita Shulga	0aa185f394	[BE] Make `torch.cuda.has_magma` a build time check (#116299 ) Perhaps originally one needed to query about GPU capability, but right now it's a simple check for a build time flag: `52f0457d7d/aten/src/ATen/cuda/detail/CUDAHooks.cpp (L165-L171)` Alternative, to avoid `at::hasMAGMA()` call one can implement it as follows: ```cpp const auto use_magma = caffe2::GetBuildOptions().at("USE_MAGMA"); return PyBool_FromLong(use_magma == "1"); ``` Make this check very similar to `_has_mkldnn` `0978482afa/torch/csrc/Module.cpp (L1793-L1794)` Test plan: Run `lldb -- python3 -c "import torch;print(torch.cuda.has_magma)"` and make sure it returns True and that `cuInit` is not called Pull Request resolved: https://github.com/pytorch/pytorch/pull/116299 Approved by: https://github.com/seemethere, https://github.com/albanD	2023-12-26 23:37:23 +00:00
PyTorch MergeBot	0edc348788	Revert "[Dynamo] Consolidate common constant types (#116366 )" This reverts commit 36dccc2aba61a2637aa5d42f38b6fd1fe10dcbdc. Reverted https://github.com/pytorch/pytorch/pull/116366 on behalf of https://github.com/kit1980 due to Need to revert this because of https://github.com/pytorch/pytorch/pull/116312 ([comment](https://github.com/pytorch/pytorch/pull/116366#issuecomment-1869821625))	2023-12-26 23:36:52 +00:00
Nikita Shulga	e86636266f	[Quantized] Fixed `equal_quantized_cpu` for QUInt4 (#116307 ) - Return false if scalar_type is different (because QInt8 and QUint8 has identical item_size but shouldn't be compared by comparing data) - Compute data_size correctly for QUInt4x2 and QUInt2x4 dtypes - Add regression test Fixes https://github.com/pytorch/pytorch/issues/116087 Pull Request resolved: https://github.com/pytorch/pytorch/pull/116307 Approved by: https://github.com/jerryzh168	2023-12-26 21:52:28 +00:00
Bin Bao	e5bcfe205e	[inductor] fix cpp_wrapper inputs mismatch (#116197 ) Summary: fixes https://github.com/pytorch/pytorch/issues/115035, where in the cpp_wrapper JIT inductor, the input args should contain the lifted parameters. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116197 Approved by: https://github.com/jansel	2023-12-26 21:41:47 +00:00
Jez Ng	7571511af9	[inductor] More tweaks to fusion logs (#115084 ) I think it's more useful to print out actual fusions rather than possible fusions. I also updated `speedup_by_fusion`'s logs to include the node names in the log output. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115084 Approved by: https://github.com/jansel, https://github.com/aakhundov	2023-12-26 20:25:57 +00:00
Michael Gschwind	6051f9f404	multiply int8/uint8 for AVX512 (#116346 ) Summary: multiply int8/uint8 for AVX512 Test Plan: sandcastle, github Differential Revision: D52393918 Pull Request resolved: https://github.com/pytorch/pytorch/pull/116346 Approved by: https://github.com/jgong5	2023-12-26 19:44:05 +00:00
Michael Gschwind	51eef859eb	min, max, clamp* for AVX2 hosts (#116236 ) Summary: min, max, clamp* for AVX2 hosts Test Plan: sandcastle, github Differential Revision: D52353148 Pull Request resolved: https://github.com/pytorch/pytorch/pull/116236 Approved by: https://github.com/alexsamardzic, https://github.com/malfet	2023-12-26 19:43:43 +00:00
Aaron Gokaslan	427ecc61c0	[Easy][BE]: Fix none type comparison (#116399 ) Simplifies type comparison, as it is unneeded since None is a singleton, and all objects are the same None object when they are set to None. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116399 Approved by: https://github.com/XuehaiPan, https://github.com/lezcano, https://github.com/malfet	2023-12-26 19:36:34 +00:00
PyTorch MergeBot	0978482afa	Revert "Implement aten::upsample_linear1d on mps (#115031 )" This reverts commit c6969cb8a93a7dfd3f1bf17716470174bb973076. Reverted https://github.com/pytorch/pytorch/pull/115031 on behalf of https://github.com/malfet due to Broke lint, will fwd fix and re-land ([comment](https://github.com/pytorch/pytorch/pull/115031#issuecomment-1869693081))	2023-12-26 18:01:49 +00:00
Bin Bao	f4230ec9fd	[inductor] Remove the float16 restriction for cpu cpp_wrapper (#116205 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116205 Approved by: https://github.com/jgong5, https://github.com/chunyuan-w, https://github.com/jansel	2023-12-26 16:01:20 +00:00
Kai	c6969cb8a9	Implement aten::upsample_linear1d on mps (#115031 ) Related to #77764 Pull Request resolved: https://github.com/pytorch/pytorch/pull/115031 Approved by: https://github.com/malfet	2023-12-26 15:44:21 +00:00
Jiong Gong	4c6e842496	[inductor][cpp] load as scalar for the index invariant in the vector range (#116387 ) For the test `test_expr_vec_non_contiguous`. The index_expr `31L + (63L(c10::div_floor_integer(x1, 32L))) + (c10::div_floor_integer(x2, 32L))` is invariant under the vector range of `x2`. Before change ```c++ #pragma omp for for(long x0=static_cast<long>(0L); x0<static_cast<long>(4L); x0+=static_cast<long>(1L)) { for(long x1=static_cast<long>(0L); x1<static_cast<long>(1024L); x1+=static_cast<long>(16L)) { { #pragma omp declare reduction(max:at::vec::Vectorized<float>:omp_out = at::vec::maximum(omp_out, omp_in)) initializer(omp_priv={at::vec::Vectorized<float>(-std::numeric_limits<float>::infinity())}) float tmp_acc0 = -std::numeric_limits<float>::infinity(); at::vec::Vectorized<float> tmp_acc0_vec = at::vec::Vectorized<float>(-std::numeric_limits<float>::infinity()); for(long x2=static_cast<long>(0L); x2<static_cast<long>(1024L); x2+=static_cast<long>(1L)) { auto tmp0 = [&] { __at_align__ std::array<int, 16> tmpbuf; #pragma GCC unroll 16 for (long x1_inner = 0; x1_inner < 16; x1_inner++) { tmpbuf[x1_inner] = static_cast<long>(31L + (63L(c10::div_floor_integer((x1 + x1_inner), 32L))) + (c10::div_floor_integer(x2, 32L))); } return at::vec::Vectorized<int>::loadu(tmpbuf.data()); } () ; auto tmp1 = static_cast<int>(2048); auto tmp2 = at::vec::Vectorized<int>(tmp1); auto tmp3 = to_float_mask(tmp0 < tmp2); auto tmp4 = [&] { auto tmp5 = [&] { __at_align__ std::array<float, 16> tmpbuf; #pragma GCC unroll 16 for (long x1_inner = 0; x1_inner < 16; x1_inner++) { if (vector_lane_mask_check(tmp3, x1_inner)) { tmpbuf[x1_inner] = in_ptr0[static_cast<long>(31L + (63L(c10::div_floor_integer((x1 + x1_inner), 32L))) + (2048L(static_cast<long>((x1 + x1_inner)) % static_cast<long>(32L))) + (65536Lx0) + (c10::div_floor_integer(x2, 32L)))]; } } return at::vec::Vectorized<float>::loadu(tmpbuf.data()); } () ; return tmp5; } ; auto tmp6 = [&] { if (all_zero(to_float_mask(tmp3))) { return at::vec::Vectorized<float>(static_cast<float>(0.0)); } else { return decltype(tmp4())::blendv(at::vec::Vectorized<float>(static_cast<float>(0.0)), tmp4(), to_float_mask(tmp3)); } } () ; tmp_acc0_vec = at::vec::maximum(tmp_acc0_vec, tmp6); } tmp_acc0_vec.store(out_ptr0 + static_cast<long>(x1 + (1024Lx0))); } } } } ``` After change ```c++ #pragma omp for for(long x0=static_cast<long>(0L); x0<static_cast<long>(4L); x0+=static_cast<long>(1L)) { for(long x1=static_cast<long>(0L); x1<static_cast<long>(1024L); x1+=static_cast<long>(16L)) { { #pragma omp declare reduction(max:at::vec::Vectorized<float>:omp_out = at::vec::maximum(omp_out, omp_in)) initializer(omp_priv={at::vec::Vectorized<float>(-std::numeric_limits<float>::infinity())}) float tmp_acc0 = -std::numeric_limits<float>::infinity(); at::vec::Vectorized<float> tmp_acc0_vec = at::vec::Vectorized<float>(-std::numeric_limits<float>::infinity()); for(long x2=static_cast<long>(0L); x2<static_cast<long>(1024L); x2+=static_cast<long>(1L)) { auto tmp0 = c10::convert<int>(31L + (63L(c10::div_floor_integer(x1, 32L))) + (c10::div_floor_integer(x2, 32L))); auto tmp1 = static_cast<int>(2048); auto tmp2 = tmp0 < tmp1; auto tmp3 = [&] { auto tmp4 = [&] { __at_align__ std::array<float, 16> tmpbuf; #pragma GCC unroll 16 for (long x1_inner = 0; x1_inner < 16; x1_inner++) { if (tmp2 != 0) { tmpbuf[x1_inner] = in_ptr0[static_cast<long>(31L + (63L(c10::div_floor_integer((x1 + x1_inner), 32L))) + (2048L(static_cast<long>((x1 + x1_inner)) % static_cast<long>(32L))) + (65536Lx0) + (c10::div_floor_integer(x2, 32L)))]; } } return at::vec::Vectorized<float>::loadu(tmpbuf.data()); } () ; return tmp4; } ; auto tmp5 = [&] { if (all_zero(to_float_mask(tmp2))) { return at::vec::Vectorized<float>(static_cast<float>(0.0)); } else { return decltype(tmp3())::blendv(at::vec::Vectorized<float>(static_cast<float>(0.0)), tmp3(), to_float_mask(tmp2)); } } () ; tmp_acc0_vec = at::vec::maximum(tmp_acc0_vec, tmp5); } tmp_acc0_vec.store(out_ptr0 + static_cast<long>(x1 + (1024L*x0))); } } } } ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/116387 Approved by: https://github.com/EikanWang, https://github.com/lezcano ghstack dependencies: #114545	2023-12-26 08:45:04 +00:00
Xuehai Pan	3c9076f070	[dynamo] fix `sum()` function with `start` argument (#116389 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116389 Approved by: https://github.com/Skylion007	2023-12-26 06:37:55 +00:00
cyy	bb2a1e9941	Enable readability-redundant-smartptr-get in clang-tidy (#116381 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116381 Approved by: https://github.com/Skylion007	2023-12-26 06:05:15 +00:00
Jiong Gong	ffe6f9ac91	[inductor cpp] support vectorization for index_expr that depends on tiling itervar or with indirect indexing (#114545 ) As the title, this PR enables vectorization for the situation when the the index_expr depends on vectorized itervar. There are two cases here: 1. The vectorized itervar has constant stride in the index_expr. We vectorize the index_expr with `Vectorized<int32>::arange` for this case. 2. Otherwise, we load the index_expr vector in a non-contiguous way with a loop. Below is the generated code for the first case from the test `test_concat_inner_vec`. Here `x1` is the index_expr and depends on the vectorized itervar `x1`. It has constant stride 1. We vectorized it with arange. We use `all_zero` to implement a short-cut for masks to avoid unnecessary execution of nested masked regions which are invalid. Before: ```c++ #pragma omp for collapse(2) for(long x0=static_cast<long>(0L); x0<static_cast<long>(32L); x0+=static_cast<long>(1L)) { for(long x1=static_cast<long>(0L); x1<static_cast<long>(155L); x1+=static_cast<long>(1L)) { auto tmp0 = c10::convert<long>(x1); auto tmp1 = static_cast<long>(0); auto tmp2 = tmp0 >= tmp1; auto tmp3 = static_cast<long>(35); auto tmp4 = tmp0 < tmp3; auto tmp5 = [&] { auto tmp6 = in_ptr0[static_cast<long>(x1 + (35Lx0))]; return tmp6; } ; auto tmp7 = tmp4 ? tmp5() : static_cast<decltype(tmp5())>(0.0); auto tmp8 = tmp0 >= tmp3; auto tmp9 = static_cast<long>(155); auto tmp10 = tmp0 < tmp9; auto tmp11 = [&] { auto tmp12 = in_ptr1[static_cast<long>((-35L) + x1 + (120Lx0))]; return tmp12; } ; ... ``` After: ```c++ #pragma omp for for(long x0=static_cast<long>(0L); x0<static_cast<long>(32L); x0+=static_cast<long>(1L)) { for(long x1=static_cast<long>(0L); x1<static_cast<long>(144L); x1+=static_cast<long>(16L)) { auto tmp0 = c10::convert<int>(x1); auto tmp1 = at::vec::Vectorized<int32_t>::arange(tmp0, 1); auto tmp2 = static_cast<int>(0); auto tmp3 = at::vec::Vectorized<int>(tmp2); auto tmp4 = to_float_mask(tmp1 >= tmp3); auto tmp5 = static_cast<int>(35); auto tmp6 = at::vec::Vectorized<int>(tmp5); auto tmp7 = to_float_mask(tmp1 < tmp6); auto tmp8 = [&] { auto tmp9 = masked_load(in_ptr0 + static_cast<long>(x1 + (35Lx0)), to_float_mask(tmp7)); return tmp9; } ; auto tmp10 = [&] { if (all_zero(to_float_mask(tmp7))) { return at::vec::Vectorized<float>(static_cast<float>(0.0)); } else { return decltype(tmp8())::blendv(at::vec::Vectorized<float>(static_cast<float>(0.0)), tmp8(), to_float_mask(tmp7)); } } () ; ... ``` Below is the generated code for the second case from the test case `test_expr_vec_non_contiguous`. Here, the index_expr is `31L + (63L(c10::div_floor_integer(x1, 32L))) + (c10::div_floor_integer(x2, 32L))` which depends on the vectorized itervar `x2` and doesn't have constant stride. So, we load the index_expr vector with a loop. (In fact, this can be further optimized since the index_expr is invariant with the data points in the range [x2, x2+16). So it can be regarded as a scalar. This will be optimized in the follow-up PR.) The code uses `vector_lane_mask_check` to implement the masked version of non-contiguous load. Before: ```c++ #pragma omp for collapse(2) for(long x0=static_cast<long>(0L); x0<static_cast<long>(4L); x0+=static_cast<long>(1L)) { for(long x1=static_cast<long>(0L); x1<static_cast<long>(1024L); x1+=static_cast<long>(1L)) { { float tmp_acc0 = -std::numeric_limits<float>::infinity(); for(long x2=static_cast<long>(0L); x2<static_cast<long>(1024L); x2+=static_cast<long>(1L)) { auto tmp0 = c10::convert<long>(31L + (63L(c10::div_floor_integer(x1, 32L))) + (c10::div_floor_integer(x2, 32L))); auto tmp1 = static_cast<long>(2048); auto tmp2 = tmp0 < tmp1; auto tmp3 = [&] { auto tmp4 = in_ptr0[static_cast<long>(31L + (63L(c10::div_floor_integer(x1, 32L))) + (2048L(static_cast<long>(x1) % static_cast<long>(32L))) + (65536Lx0) + (c10::div_floor_integer(x2, 32L)))]; return tmp4; } ; auto tmp5 = tmp2 ? tmp3() : static_cast<decltype(tmp3())>(0.0); tmp_acc0 = max_propagate_nan(tmp_acc0, tmp5); } out_ptr0[static_cast<long>(x1 + (1024Lx0))] = tmp_acc0; } } } ``` After: ```c++ #pragma omp for for(long x0=static_cast<long>(0L); x0<static_cast<long>(4L); x0+=static_cast<long>(1L)) { for(long x1=static_cast<long>(0L); x1<static_cast<long>(1024L); x1+=static_cast<long>(16L)) { { #pragma omp declare reduction(max:at::vec::Vectorized<float>:omp_out = at::vec::maximum(omp_out, omp_in)) initializer(omp_priv={at::vec::Vectorized<float>(-std::numeric_limits<float>::infinity())}) float tmp_acc0 = -std::numeric_limits<float>::infinity(); at::vec::Vectorized<float> tmp_acc0_vec = at::vec::Vectorized<float>(-std::numeric_limits<float>::infinity()); for(long x2=static_cast<long>(0L); x2<static_cast<long>(1024L); x2+=static_cast<long>(1L)) { auto tmp0 = [&] { __at_align__ std::array<int, 16> tmpbuf; #pragma GCC unroll 16 for (long x1_inner = 0; x1_inner < 16; x1_inner++) { tmpbuf[x1_inner] = static_cast<long>(31L + (63L(c10::div_floor_integer((x1 + x1_inner), 32L))) + (c10::div_floor_integer(x2, 32L))); } return at::vec::Vectorized<int>::loadu(tmpbuf.data()); } () ; auto tmp1 = static_cast<int>(2048); auto tmp2 = at::vec::Vectorized<int>(tmp1); auto tmp3 = to_float_mask(tmp0 < tmp2); auto tmp4 = [&] { auto tmp5 = [&] { __at_align__ std::array<float, 16> tmpbuf; #pragma GCC unroll 16 for (long x1_inner = 0; x1_inner < 16; x1_inner++) { if (vector_lane_mask_check(tmp3, x1_inner)) { tmpbuf[x1_inner] = in_ptr0[static_cast<long>(31L + (63L(c10::div_floor_integer((x1 + x1_inner), 32L))) + (2048L(static_cast<long>((x1 + x1_inner)) % static_cast<long>(32L))) + (65536Lx0) + (c10::div_floor_integer(x2, 32L)))]; } } return at::vec::Vectorized<float>::loadu(tmpbuf.data()); } () ; return tmp5; } ; auto tmp6 = [&] { if (all_zero(to_float_mask(tmp3))) { return at::vec::Vectorized<float>(static_cast<float>(0.0)); } else { return decltype(tmp4())::blendv(at::vec::Vectorized<float>(static_cast<float>(0.0)), tmp4(), to_float_mask(tmp3)); } } () ; tmp_acc0_vec = at::vec::maximum(tmp_acc0_vec, tmp6); } tmp_acc0_vec.store(out_ptr0 + static_cast<long>(x1 + (1024Lx0))); } } } } ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/114545 Approved by: https://github.com/lezcano	2023-12-26 05:36:39 +00:00
Isuru Fernando	a254fbfd61	Initialize variable for all codepaths in dynamo benchmarks (#116260 ) Sometimes, the first statement that sets this variable in the try block fails due to out of memory issues and the finally block tries to delete this variable, but it was not written to in the first place. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116260 Approved by: https://github.com/lezcano	2023-12-26 05:15:39 +00:00
fduwjj	f6dfbffb3b	[c10d] Add hashing as a debug feature for before and after NCCL collective call (#113238 ) For now, we use `TORCH_DISTRIBUTED_DEBUG = DETAIL` to turn a debug feature which calculate the hashing for input tensors and output results of c10d collective in NCCL. This is a debugging feature so that we can rule out the bug from c10d level. <img width="840" alt="image" src="https://github.com/pytorch/pytorch/assets/6937752/cdc70b0b-ae3c-4efd-86ff-adc5c5ba505f"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/113238 Approved by: https://github.com/wconstab, https://github.com/fegin	2023-12-25 22:25:38 +00:00
Xuehai Pan	039fbeb016	[dynamo] fix `functools.reduce()` function with `None` as `initial` (#116398 ) The `initial` argument in `functools.reduce` can be `None`. ```python initial_missing = object() def reduce(function, iterable, initial=initial_missing, /): it = iter(iterable) if initial is initial_missing: value = next(it) else: value = initial for element in it: value = function(value, element) return value ``` Reference: - python/cpython#102759 Pull Request resolved: https://github.com/pytorch/pytorch/pull/116398 Approved by: https://github.com/Skylion007	2023-12-25 21:23:28 +00:00
Yuxin Wu	c7e9c15102	Ignore SIGINT in codecache workers (#116380 ) Fixes #116379 Pull Request resolved: https://github.com/pytorch/pytorch/pull/116380 Approved by: https://github.com/Skylion007	2023-12-25 08:59:54 +00:00
Yanbo Liang	951da38800	[Dynamo][11/N] allow_in_graph/disallow_in_graph decorator refactor (#116365 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/116365 Approved by: https://github.com/jansel	2023-12-25 07:15:09 +00:00
Tugsbayasgalan Manlaibaatar	22742d93a5	Expose functional IR to capture_pre_autograd (#115210 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/115210 Approved by: https://github.com/zhxchen17 ghstack dependencies: #115188	2023-12-25 04:51:21 +00:00
Tugsbayasgalan Manlaibaatar	76b1d44d57	pre_dispatch aot_export (#115188 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/115188 Approved by: https://github.com/bdhirsh	2023-12-25 04:51:21 +00:00
Yanbo Liang	36dccc2aba	[Dynamo] Consolidate common constant types (#116366 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/116366 Approved by: https://github.com/Skylion007	2023-12-24 22:58:01 +00:00
Xuehai Pan	199e07f108	[pytree][BE] update treespec `num_children` access (#116370 ) Change `len(treespec.children_spes) -> treespec.num_children`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116370 Approved by: https://github.com/Skylion007	2023-12-24 20:54:32 +00:00
leslie-fang-intel	81cebca3d2	[Inductor] [Quant] Fix QConv Binary Inplace Layout Issue (#115613 ) This pull request primarily addresses two issues to resolve the `QConvPointWiseBinaryPT2E` layout problem: - As the changes made in `611a7457ca`, for `QConvPointWiseBinaryPT2E` with post-op `sum`, we should also utilize `NoneLayout` and return `accum` instead of `QConvPointWiseBinaryPT2E`. - Additionally, this pull request fixes an issue in the `_quantized_convolution_onednn` implementation. Given that we expect `accum` to be inplace changed, we should avoid copying `accum` by changing the memory format or data type inside the kernel implementation. Instead, we have moved the necessary changes of memory format or data type to the lowering of `QConvPointWiseBinaryPT2E`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115613 Approved by: https://github.com/jgong5, https://github.com/oulgen ghstack dependencies: #116172	2023-12-24 08:04:29 +00:00
leslie-fang-intel	dfb6815170	[Reland] [PT2] [Quant] Change the QConv2d Binary post op name from add to sum (#116172 ) Summary Re-land https://github.com/pytorch/pytorch/pull/115329. Open a new PR since the origin branch has been deleted. Change the QConv2d Binary fusion post op name from `add` to `sum`, since we are actually using OneDNN `post op sum` instead of `Binary_Add` for now. TestPlan ``` python -m pytest test_quantized_op.py -k test_qconv2d_sum_pt2e python -m pytest test_quantized_op.py -k test_qconv2d_sum_relu_pt2e python -m pytest test_quantized_op.py -k test_qconv2d_sum_relu_float_output_pt2e ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/116172 Approved by: https://github.com/kit1980	2023-12-24 08:00:21 +00:00
PyTorch UpdateBot	7cdbdc789d	[executorch hash update] update the pinned executorch hash (#116362 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml). Update the pinned executorch hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116362 Approved by: https://github.com/pytorchbot	2023-12-24 05:02:05 +00:00
Jason Ansel	f1cdb39da3	[dynamo] Fix handling of one_hot (#116338 ) Fixes #115817 Pull Request resolved: https://github.com/pytorch/pytorch/pull/116338 Approved by: https://github.com/yanboliang	2023-12-24 04:55:35 +00:00
Jason Ansel	dbbe8485b4	Fake Tensor refactors part 2 (#116345 ) This should help trace time a bit. This refactors `op_implementations` (which requires O(n) checks per op) to mostly use a dict with O(1) cost per op. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116345 Approved by: https://github.com/yanboliang	2023-12-24 04:54:50 +00:00
Tobias Ringwald	6c419a0efd	Fixed a segfault when calling topk on a quantized scalar tensor. (#116337 ) Fixes #116324. Added an extra check for empty sizes (=scalars) when running `topk` on quantized tensors. Added a test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116337 Approved by: https://github.com/Skylion007	2023-12-23 23:21:12 +00:00
Tobias Ringwald	3a4fe835cc	Fixed segfault when trying to permute empty tensor (#116335 ) Fixes #116325. Fixed unchecked access to first element of `dims` when permuting an empty tensor. Added test to prevent regressions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116335 Approved by: https://github.com/Skylion007	2023-12-23 23:14:28 +00:00
Yanbo Liang	015bd0e0a1	[Dynamo][10/N] Remove TorchVariable and is_allowed (#116312 ) After this refactor: * ```TorchVariable``` definition and all references are removed. * All ```is_allowed``` references except one are removed. - The only left one is in ```torch/_dynamo/decorators:_disallow_in_graph_helper```. It was called when users put ```disallow_in_graph``` decorator on a function. Since we use the lists in ```trace_rules``` to decide the function's trace rule, so the decorator would only be used as customer function rather than torch functions. I'll defer this to a separate decorator refactor PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116312 Approved by: https://github.com/jansel	2023-12-23 09:44:09 +00:00
Jason Ansel	4912922297	Fake Tensor refactors part 1 (#116344 ) These are mostly small performance optimizations to move constant list construction into global scope and replace O(n) `x in list` checks with O(1) `x in dict` checks. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116344 Approved by: https://github.com/yanboliang	2023-12-23 08:38:26 +00:00
Yanbo Liang	08b404e3a2	[Dynamo] Remove ExecutionRecorder.MOD_EXCLUDES during replay & record (#116347 ) Remove ```ExecutionRecorder.MOD_EXCLUDES``` since now torch python modules are wrapped as ```PythonModuleVariable``` after #115724. This is reported from Meta internal user cases, where it triggers failure when replay & record is enabled. But the enablement was triggered by ```TORCH_COMPILE_DEBUG=1``` rather than they really need this. Actually they are not using it according the conversation with the team members. I think we don't maintain replay & record well, so probably we can remove them from our codebase to avoid such issues in the future. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116347 Approved by: https://github.com/jansel	2023-12-23 08:13:14 +00:00
cyy	7663ffb673	[10/N] Fixes clang-tidy warnings in c10/util/.h (#116326 ) Still a continued work for clean up c10/util/.h Pull Request resolved: https://github.com/pytorch/pytorch/pull/116326 Approved by: https://github.com/Skylion007	2023-12-23 04:59:55 +00:00
PyTorch UpdateBot	84b2a32359	[executorch hash update] update the pinned executorch hash (#115599 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml). Update the pinned executorch hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115599 Approved by: https://github.com/huydhn	2023-12-23 04:07:23 +00:00
Tugsbayasgalan (Tugsuu) Manlaibaatar	60f4114769	Support nn_module_stack in non_strict mode (#116309 ) Summary: Title Test Plan: CI Differential Revision: D52382672 Pull Request resolved: https://github.com/pytorch/pytorch/pull/116309 Approved by: https://github.com/zhxchen17	2023-12-23 03:34:58 +00:00
PyTorch UpdateBot	0931170a13	[vision hash update] update the pinned vision hash (#116343 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml). Update the pinned vision hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116343 Approved by: https://github.com/pytorchbot	2023-12-23 03:16:06 +00:00
Peter Bell	4f4b931aba	[inductor] Do variance calculation in opmath type (#115181 ) Fixes #114903 Previously large split variance reductions stored the intermediates as float16 precision, which may lead to overflow as the intermediate result is unnormalized. In #114903 we see two different `num_split` decisions made based on the hardware capabilities, one of which has large enough intermediates to cause overflows. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115181 Approved by: https://github.com/shunting314	2023-12-23 01:06:43 +00:00
Zhengxu Chen	65c5eed01d	[sigmoid] Remove workaround for constant output. (#116288 ) Summary: no more workaround_export_bug_constant_buffer_output Test Plan: buck2 run mode/dev-nosan //scripts/ads_pt2_inference:pt2_cli -- --src_model manifold://ads_storage_fblearner/tree/user/facebook/fblearner/predictor/473164617/6/gpu_lowering/input.predictor.disagg.gpu.merge buck2 run mode/opt caffe2/torch/fb/model_transform/fx2trt/packaging:generate_merge_net_file -- --action=generate --lower_backend=aot_inductor_ep --input_file=/data/users/zhxchen17/fbsource/fbcode/input.predictor.disagg.gpu.merge --output_file=/tmp/409501788_66.predictor.disagg.gpu.merge buck2 run mode/opt -c fbcode.nvcc_arch=a100 caffe2/torch/fb/model_transform/fx2trt/packaging:load_merge_net_predictor -- --loadMode=Normal --inputMergeNetFile=/tmp/409501788_66.predictor.disagg.gpu.merge --pytorch_predictor_sigmoid_enabled=true Reviewed By: khabinov, SherlockNoMad Differential Revision: D52210429 Pull Request resolved: https://github.com/pytorch/pytorch/pull/116288 Approved by: https://github.com/tugsbayasgalan	2023-12-22 20:33:09 +00:00
Pearu Peterson	3f9e9ecfe4	Fix torch.detach doc-string (#115850 ) Fixes https://github.com/pytorch/pytorch/issues/98976 Pull Request resolved: https://github.com/pytorch/pytorch/pull/115850 Approved by: https://github.com/albanD	2023-12-22 20:04:33 +00:00
Tugsbayasgalan Manlaibaatar	b940fa2fce	Delete unused global variable (#116228 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116228 Approved by: https://github.com/angelayi ghstack dependencies: #116225, #116226	2023-12-22 19:07:59 +00:00
Aaron Meurer	f08c4da86d	Add a decomposition for take() (#114813 ) Presumably this can close https://github.com/pytorch/pytorch/pull/109784 Also related to https://github.com/pytorch/pytorch/issues/93757 (though `take` is not listed there). There's no bounds checking here (out of bounds indices cause a segfault or undefined behavior). Should that be added somehow? Pull Request resolved: https://github.com/pytorch/pytorch/pull/114813 Approved by: https://github.com/peterbell10, https://github.com/lezcano	2023-12-22 18:14:57 +00:00
Aleksandar Samardžić	341c4227a8	Update F32 sparse semi-structured support for CUTLASS back-end (#116017 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116017 Approved by: https://github.com/jcaip	2023-12-22 16:53:04 +00:00
Aaron Gokaslan	0b9146bf5d	[BE][Easy]: Update ruff to 0.1.9 (#116290 ) Updates the ruff linter with lots of bugfixes, speed improvements, and fix improvements. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116290 Approved by: https://github.com/janeyx99, https://github.com/malfet	2023-12-22 15:26:02 +00:00
maajidkhann	0e39f4db92	Disables denormal floating numbers on ARM CPU (#115184 ) Motivation: Denormal numbers are used to store extremely small numbers that are close to 0. Denormal numbers can incur extra computational cost. To solve the low performance issue caused by denormal numbers, Pytorch supports flushing denormal numbers and it successfully configures flush denormal mode Currently set_flush_denormal() is only supported on x86 architectures supporting SSE3 ([https://pytorch.org/docs/stable/generated/torch.set_flush_denormal.html (Opens in new window or tab)](https://pytorch.org/docs/stable/generated/torch.set_flush_denormal.html) and now we want to extend this functionality for ARM architecture. This PR: ->Supports set_flush_denormal() on ARM. ->Datatypes supported and tested: FP64, FP32, BFloat16 Pull Request resolved: https://github.com/pytorch/pytorch/pull/115184 Approved by: https://github.com/jgong5	2023-12-22 13:56:46 +00:00
cyy	9a0c217a0a	[9/N] Fixes clang-tidy warnings in c10/util/*.h (#116185 ) Continued work to clean headers in c10/util. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116185 Approved by: https://github.com/Skylion007	2023-12-22 09:35:44 +00:00
Tugsbayasgalan Manlaibaatar	c7514ccc8c	Delete unused API again (#116226 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116226 Approved by: https://github.com/angelayi ghstack dependencies: #116225	2023-12-22 09:30:00 +00:00
etaf	7a6cb9fdfb	[Inductor Intel GPU backend Upstream] Step 1/3: Generalize device-bias code in code generation. (#116020 ) As the [RFC](https://github.com/pytorch/pytorch/issues/114856) mentions, this is the step 1 to add Intel GPU backend as an alternative inductor backend. ### Design Typically, in order to integrate Intel GPU backend into Inductor, we need to inherit from `WrapperCodegen` and `TritonScheduling` and implement the corresponding subclasses respectively. However, since `WrapperCodegen` and `TritonScheduling` have some device-bias code generation scattered in their methods, overriding them in subclasses would introduce a lot of duplicated parent class code. For example: `2a44034895/torch/_inductor/codegen/wrapper.py (L487)` `2a44034895/torch/_inductor/codegen/triton.py (L1996)` So we abstract the device-bias code scattered in WrapperCodegen and TritonScheduling and provide a unified interface "DeviceOpOverrides". This way, when integrating a new backend, we can maximize the reuse of `WrapperCodegen` and `TritonScheduling` code by inherit and implement this interface for device flexibility. Currently the `DeviceOpOverrides` only cover Python wrapper code generation. We can futher extend it to cover Cpp wrapper code generation on demand. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116020 Approved by: https://github.com/jgong5, https://github.com/EikanWang, https://github.com/jansel	2023-12-22 08:42:51 +00:00
Yifu Wang	7d0ad6e870	Make native c10d_functional ops work with AOTInductor (#113735 ) Summary: - Revised `c10d_functional` ops to conform to https://github.com/pytorch/pytorch/tree/main/aten/src/ATen/native#func - Modifed `get_cpp_op_schema()` to handle mutable args and aliasing returns Pull Request resolved: https://github.com/pytorch/pytorch/pull/113735 Approved by: https://github.com/desertfire ghstack dependencies: #113438	2023-12-22 08:12:13 +00:00
Yifu Wang	718b576e2c	Port all_to_all_single to native c10d_functional (#113438 ) Summary: - Ported `all_to_all_single` to native c10d_functional - Added Inductor support for the native `all_to_all_single` via the new collective IR's `create_out_of_place()` - Since the new collective IR derives from `FallbackKernel` which implements a generic `free_unbacked_symbols`, no additional unbacked symbol handling for all_to_all_single is required Pull Request resolved: https://github.com/pytorch/pytorch/pull/113438 Approved by: https://github.com/yf225, https://github.com/ezyang	2023-12-22 08:12:13 +00:00
Tugsbayasgalan Manlaibaatar	cb489e769c	Delete unused API (#116225 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116225 Approved by: https://github.com/angelayi	2023-12-22 06:38:47 +00:00
Xiaodong Wang	b6473065c6	[AMD] Fix build for intra_node_comm (#116291 ) Summary: amd build is broken Test Plan: ``` buck-out/v2/gen/fbcode/75c2b50d9f8b18d8/caffe2/__fb_libtorch_hipify_gen_eqsb_torch/csrc/distributed/c10d/intra_node_comm.hip__/out/torch/csrc/distributed/c10d/intra_node_comm.hip:37:1: error: non-void function does not return a value [-Werror,-Wreturn-type] } ^ 1 error generated when compiling for gfx90a. ``` Now it's gone Differential Revision: D52373348 Pull Request resolved: https://github.com/pytorch/pytorch/pull/116291 Approved by: https://github.com/yifuwang	2023-12-22 05:51:50 +00:00
Lucas Pasqualin	b342286646	adds async save, makes checkpointer private (#116293 ) Adds Async Save and also makes `Checkpointer` classes private. The original PR was here: https://github.com/pytorch/pytorch/pull/115864 Pull Request resolved: https://github.com/pytorch/pytorch/pull/116293 Approved by: https://github.com/fegin	2023-12-22 05:22:39 +00:00
Bradley Davis	ad3c0b2c00	[torch.export] fixes for unlifting lifted tensor constants (#116266 ) Summary: lifted tensor constants were not being treated the same way as named buffers when unlifting, i.e. getting name correction to convert "." in FQNS to "_" for proper names. Additionally, future torchbind object support will allow objects to be registered, so only register_buffer for lifted constants if the value is a tensor. Differential Revision: D52367846 Pull Request resolved: https://github.com/pytorch/pytorch/pull/116266 Approved by: https://github.com/angelayi	2023-12-22 04:46:25 +00:00
cyy	764b4cd44e	Remove outdated string function wrapper for Android and Caffe2 (#116186 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116186 Approved by: https://github.com/janeyx99	2023-12-22 04:31:56 +00:00
Shuqiang Zhang	b47aa69685	[c10d] Fix the hang issue in store.check(TIMEOUT_DUMP) (#116297 ) Summary: We have found out the root cause of the hang is NOT due to destruction of stores. The hang in the check() only happens when the store is of type FileStore. The file held by each filestore was a temp file, which was created by Python Tempfile, it was deleted by default when the file was closed. Note that the file was opened and closed by every check() in the watchdog and in constructor of FileStore. The when check() tried to open the deleted file again, open() would fail after the timeout value (by default 5 mins), hence the hang happened. The fix is simple, just avoid the default deletion after the file is closed. Test Plan: 1. We first reproduce the hang in check() in the existing unit test: test_init_process_group_for_all_backends by enabling the DumpOnTimeOut and making the main thread sleep for 2s, to give enough time for tempfile to be deleted 2. Adding log to check ref count of fileStore and also the sequence of file opening and closing 3. With the repro, an exception will be thrown as "no such file or directory' and unit test would fail 4. Verify the tests now passes with the above knob change 5. add an unit test in test_c10d_nccl to cover the fileStore check() code path python test/distributed/test_c10d_common.py ProcessGroupWithDispatchedCollectivesTests python test/distributed/test_c10d_nccl.py ProcessGroupNCCLTest.test_file_store_check Reviewers: Subscribers: Tasks: T173200093 Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/116297 Approved by: https://github.com/fduwjj ghstack dependencies: #116296	2023-12-22 04:04:30 +00:00
Collin Kemper	94f3781145	Fixed bug with unpickling integers > 64-bits (#115264 ) Fixes #115234 Currently, the unpickling code does not support integers larger than 64 bits in size. However, this is a part of the Python unpickling code. See `pickle.py` in CPython: ``` def decode_long(data): r"""Decode a long from a two's complement little-endian binary string. >>> decode_long(b'') 0 >>> decode_long(b"\xff\x00") 255 >>> decode_long(b"\xff\x7f") 32767 >>> decode_long(b"\x00\xff") -256 >>> decode_long(b"\x00\x80") -32768 >>> decode_long(b"\x80") -128 >>> decode_long(b"\x7f") 127 """ return int.from_bytes(data, byteorder='little', signed=True) ``` E.g.: ``` >>> int.from_bytes(bytearray(b'\xff\xff\xff\xff\xff\xff\xff\xff\x00'), byteorder='little', signed=True) 18446744073709551615 ``` This PR makes it so that integers of arbitrary size are supported with JS BigNums. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115264 Approved by: https://github.com/zdevito	2023-12-22 03:17:34 +00:00
PyTorch UpdateBot	9736deae76	[vision hash update] update the pinned vision hash (#109957 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml). Update the pinned vision hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/109957 Approved by: https://github.com/pytorchbot	2023-12-22 03:12:23 +00:00
Jerry Zhang	db25462ffd	[quant][pt2e] Relax constraints on dtype and qscheme to allow for customizations (#116287 ) Summary: att Test Plan: CI Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/116287 Approved by: https://github.com/kimishpatel	2023-12-22 03:12:04 +00:00
Shuqiang Zhang	fdf8718225	Update reviewes for PyTorch Distributed (#116296 ) Summary: Add shuqiangzhang as a reviewer Test Plan: Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/116296 Approved by: https://github.com/fduwjj	2023-12-22 02:49:51 +00:00
Andrew M. James	4b97ed2ed8	[SparseCompressed] support csc layout for add sparse/dense. (#115433 ) `add` when passed one sparse and one dense argument will error if the sparse argument does not have csr layout. This PR modifies the underlying algorithm to be generic on the compressed dimension handling both csr and csc. The functions are renamed to use the `sparse_compressed` qualifier rather than `sparse_csr` Fixes: #114807 Pull Request resolved: https://github.com/pytorch/pytorch/pull/115433 Approved by: https://github.com/cpuhrsch, https://github.com/pearu ghstack dependencies: #115432	2023-12-22 01:47:55 +00:00
Andrew M. James	910baa3a03	[SparseCompressed] Support `add(sparse_compressed, dense)` (#115432 ) Addition involving sparse compressed and dense arguments is implemented requiring that the dense tensor be on the LHS. This change adds support for the other pattern `sparse + dense by permuting arguments. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115432 Approved by: https://github.com/cpuhrsch, https://github.com/pearu	2023-12-22 01:47:55 +00:00
suo	d2d129de65	[sigmoid] replace unflatten with upstream version (#115468 ) as title Differential Revision: [D52000213](https://our.internmc.facebook.com/intern/diff/D52000213/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/115468 Approved by: https://github.com/zhxchen17	2023-12-22 00:56:19 +00:00
Will Constable	127cae7ec8	[C10D] Increase TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC (#116267 ) Change default from 2 min to 10 min. Why? Many cases of heartbeat timeout were reported, but increasing timeout led to the same job hanging in a different place, suggesting heartbeat kill was working well and not a false positive. However, some others reported jobs running fine with increased timeouts. One such case was investigated below, and suggests that indeed a 2 min timeout is too aggressive. While we have not fully root caused the issue, it is better to avoid killing jobs that would otherwise complete. Current theory is that watchdog is not totally deadlocked, but is slowed down in its processing of work objs due to some intermittent resource contention. Hence, allowing more time is more of a workaround than a fix. Debug/Analysis: https://docs.google.com/document/d/1NMNWoTB86ZpP9bqYLZ_EVA9byOlEfxw0wynMVEMlXwM Differential Revision: [D52368791](https://our.internmc.facebook.com/intern/diff/D52368791) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116267 Approved by: https://github.com/fduwjj	2023-12-22 00:47:45 +00:00
Huy Do	d6de2df6b6	Improve the error message when a PR lacks the necessary approvals (#116161 ) The error message from https://github.com/pytorch/pytorch/pull/115329#issuecomment-1857135047 is pretty confusing because it lists some random `pytorch/metamates` folks from `superuser` merge rule. My attempt here is to make the error message clearer by pointing out: * All the matching merge rules and * Their list of approvers The message will now become: ``` Approvers from one of the follow rules are needed: - Core Reviewers (1, 2, 3, 4, 5, ...) - Core Maintainers (1, 2) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/116161 Approved by: https://github.com/malfet, https://github.com/PaliC, https://github.com/atalman, https://github.com/ZainRizvi	2023-12-22 00:22:43 +00:00
Shunting Zhang	99f7e721fe	[inductor] make inductor work with new triton compile interface (#115878 ) Recent 2 triton PRs (https://github.com/openai/triton/pull/2701, https://github.com/openai/triton/pull/2756) change the interface for triton.compile, this PR added the necessary change on inductor side to work with both old and new compile API. Also there is some simplification between compilation call in subprocess and the one in main process - previously we pass warm_cache_only=True if the compilation happens in subprocess. But triton never use that argument in the currently used pin. So I removed that - previously we only pass compute_capability if compilation happens in subprocess. The PR change that to always passing compute_capability to triton.compile no matter if the compilation happens in main or sub process. Updated: There are more interface change from triton side. E.g. - tl.math.{min, max} now requires a propagate_nan argument - JITFunction.run now requires a warmup argument. This affect the benchmarking phase of matmul max-autotune; on the other hand, JITFunction.run forbids stream argument now. Simply removing passing this in when benchmarking matmul triton kernel will work for both old and new version of triton. - triton Autotuner change attribute name from 'warmup' to 'num_warmup' and from 'rep' to 'num_rep'. This cause dynamo failed to handle triton Autotuner object since dynamo TritonKernelVariable makes assumption about attribute names. It's used in some test cases that a model call triton Autotuner directly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115878 Approved by: https://github.com/jansel	2023-12-22 00:09:29 +00:00
Adnan Akhundov	247f9c3de4	Preserve strides of custom Triton kernel args (#116219 ) Summary: Currently, we [`clone`](`19207b9183/torch/_inductor/lowering.py (L5273)`) every `TensorBox` argument of custom Triton kernels while lowering them to the Inductor IR, during which the stride information of the kernel inputs is lost. This is problematic in the common case when the strides of a `torch.Tensor` argument are passed as scalars to a custom Triton kernel alongside the tensor itself (due to the underlying Triton code interpreting the tensors as raw pointers, so the contained stride semantics of the `torch.Tensor` is lost). In this PR, we add an extended version of the existing [`clone` lowering](`19207b9183/torch/_inductor/lowering.py (L2289)`)---`clone_preserve_reinterpret_view`---which carries over the `ir.ReinterpretVew` layers (if any) from the source `TensorBox` to the cloned one. The rationale behind adding a new function (and switching to it in the `triton_kernel_wrap` only for now) as opposed to extending the existing `clone` is keeping the semantics of the latter untouched, as it is a lowering of `torch.clone` (albeit incomplete, as the `memory_format` is currently ignored). Changing the existing `clone` would change the semantics which is not necessarily desirable in general. Open to suggestions, though. Test Plan: ``` $ python test/dynamo/test_functions.py -k test_triton_kernel_strided_input ... ---------------------------------------------------------------------- Ran 1 test in 5.568s OK ``` Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/116219 Approved by: https://github.com/jansel	2023-12-21 22:46:32 +00:00
Will Feng	a27ed4d364	[dynamo / DDP] Add optimize_ddp_lazy_compile config to control lazy compile for DDPOptimizer (False by default) (#116292 ) We want to enable `optimize_ddp_lazy_compile` by default as soon as possible, becuase it will fix stride mismatch errors (see motivation: https://github.com/pytorch/pytorch/pull/114154). However, lazy compile currently causes shape mismatch in other cases (`test_graph_split_inductor_transpose`) and we need to fix them before we can enable it by default. Differential Revision: D52373445 Pull Request resolved: https://github.com/pytorch/pytorch/pull/116292 Approved by: https://github.com/williamwen42, https://github.com/wconstab	2023-12-21 22:34:24 +00:00
drisspg	1e834e0e50	Fix bug in mem_eff kernel with attention mask and MQA (#116234 ) # Summary Found using the repros mentioned in this issue: #112577 After many go rounds with compute-sanitizer and eventual printf debugging I feel pretty confident that this was the underlying issue Pull Request resolved: https://github.com/pytorch/pytorch/pull/116234 Approved by: https://github.com/malfet, https://github.com/danthe3rd, https://github.com/atalman	2023-12-21 21:52:21 +00:00
Joel Schlosser	52f0457d7d	Support view returns for functional inverses on narrowing views (#115893 ) Part 1 of implementation for general [subclass view fake-ification](https://docs.google.com/document/d/1C5taWiplmX7nKiURXDOAZG2W5VNJ2iV0fQFq92H0Cxw). The following functional inverses are currently implemented scatter-style and thus never return views: * `as_strided_copy_inverse()` * `diagonal_copy_inverse()` * `expand_copy_inverse()` * `select_copy_int_inverse()` * `slice_copy_Tensor_inverse()` * `split_copy_Tensor_inverse()` * `split_with_sizes_copy_inverse()` * `unbind_copy_int_inverse()` * `unfold_copy_inverse()` We need to get actual views for the introduction of reverse view funcs coming next. Details: * Use `as_strided()` to implement actual view inverses for the above * Assumes we're given a mutated_view that is actually part of a bigger storage; this isn't really the case for functionalization * Introduce `InverseReturnMode` enum for customization of functional inverses * `AlwaysView` - always return an actual view; needed for reverse view_funcs() * `NeverView` - always do a copy; useful for certain functionalization use cases (e.g. XLA, executorch) * `ViewOrScatterInverse` - return an actual view in most cases, but prefer scatter inverses when they exist. this avoids the need to implement `as_strided()` for subclasses, which can be difficult or impossible * Make sure functionalization works as before * Use `ViewOrScatterInverse` when reapply_views TLS is True or `NeverView` otherwise * Adds tests to ensure old behavior for above inverses in functionalization Pull Request resolved: https://github.com/pytorch/pytorch/pull/115893 Approved by: https://github.com/bdhirsh	2023-12-21 21:39:22 +00:00
suo	b5c866db13	[export] Add FlatArgsAdapter to unflatten (#115467 ) This is the final divergence between our internal/external unflatteners. Differential Revision: [D52001135](https://our.internmc.facebook.com/intern/diff/D52001135/) @diff-train-skip-merge Pull Request resolved: https://github.com/pytorch/pytorch/pull/115467 Approved by: https://github.com/zhxchen17 ghstack dependencies: #115466, #115795	2023-12-21 20:52:36 +00:00
suo	01ec3d1113	[export] upstream some final fixes to OSS unflatten (#115795 ) as title Differential Revision: [D52141387](https://our.internmc.facebook.com/intern/diff/D52141387/) @diff-train-skip-merge Pull Request resolved: https://github.com/pytorch/pytorch/pull/115795 Approved by: https://github.com/zhxchen17 ghstack dependencies: #115466	2023-12-21 20:52:36 +00:00
suo	bc3ef1684e	[export] refactor unflatten.py to be a top-level API (#115466 ) This is in preparation for the merging of the internal and external versions of the unflattener. Unflatten needs to be its own API because we are adding more options to it in forthcoming diffs. Differential Revision: [D52001133](https://our.internmc.facebook.com/intern/diff/D52001133/) @diff-train-skip-merge Pull Request resolved: https://github.com/pytorch/pytorch/pull/115466 Approved by: https://github.com/zhxchen17	2023-12-21 20:52:29 +00:00
PyTorch MergeBot	497777e302	Revert "Mark set_ as an inplace view op (#115769 )" This reverts commit cd449e260c830c9ce0f06ed4833b46aa638f1529. Reverted https://github.com/pytorch/pytorch/pull/115769 on behalf of https://github.com/jeanschmidt due to breaking landing signals internally, more details on the diff, author is tagged ([comment](https://github.com/pytorch/pytorch/pull/115769#issuecomment-1866846607))	2023-12-21 19:53:32 +00:00
Peter Bell	0e63837ec7	[dynamo] Skip some tests using scipy.kstest (#116263 ) These tests are failing in CI with this error ``` File "/opt/conda/envs/py_3.11/lib/python3.11/site-packages/torch/_dynamo/variables/builder.py", line 1126, in wrap_numpy_ndarray value.flags.writeable = True ^^^^^^^^^^^^^^^^^^^^^ torch._dynamo.exc.InternalTorchDynamoError: cannot set WRITEABLE flag to True of this array ``` And it may be related to a `SIGKILL` exception being raised shortly after the failure. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116263 Approved by: https://github.com/lezcano	2023-12-21 18:08:29 +00:00
Yingxin Kang	199b04fdbd	Back out "Implement pass-through `state_dict` and `load_state_dict` for dynamo OptimizedModule (#113423 )" (#116243 ) Summary: Original commit changeset: 2a9588cfd51b Original Phabricator Diff: D52062368 Test Plan: In investigating S386328 and S382826, we found checkpoint loading succeed after backout D52062368: S386328_backout_1220_193648 Differential Revision: D52356011 Pull Request resolved: https://github.com/pytorch/pytorch/pull/116243 Approved by: https://github.com/voznesenskym	2023-12-21 17:57:05 +00:00
PyTorch MergeBot	ed03834693	Revert "Expose functional IR to capture_pre_autograd (#115210 )" This reverts commit 4b59b4dffba633f638f3d7ccffff2abc2e53f25e. Reverted https://github.com/pytorch/pytorch/pull/115210 on behalf of https://github.com/malfet due to This should fix test_export_constraints_error_non_strict failures, see https://github.com/pytorch/pytorch/issues/116273 ([comment](https://github.com/pytorch/pytorch/pull/115210#issuecomment-1866706302))	2023-12-21 17:49:43 +00:00
Aaron Shi	a357a0f315	Back out "[Kineto] Initialize libkineto profilers during torch init process during pybind set-up (#112623 )" (#116201 ) Summary: This diff needs to be backed out because TorchBench llama_v2_7b_16h has a cublas init error. https://github.com/pytorch/benchmark/actions/runs/7266269668/job/19797677485?pr=2095 Test Plan: CI Differential Revision: D52339142 Pull Request resolved: https://github.com/pytorch/pytorch/pull/116201 Approved by: https://github.com/xuzhao9	2023-12-21 16:32:19 +00:00
Aaron Gokaslan	ff4aac109a	[BE][Easy]: Enable clang-tidy check readability-misplaced-array-index (#116210 ) Enable clang-tidy check readability which checks for a bizarre C++ construct that is usually indicative of an error: https://clang.llvm.org/extra/clang-tidy/checks/readability/misplaced-array-index.html (indexing a number by a pointer, which surprisingly inverts the operands). Pull Request resolved: https://github.com/pytorch/pytorch/pull/116210 Approved by: https://github.com/albanD, https://github.com/malfet	2023-12-21 15:09:10 +00:00
Aaron Gokaslan	cc2c2c6ca9	[Easy][BE]: Enable clang-tidy check for duplicate includes (#116193 ) Adds a clang-tidy check to flag duplicate include files Pull Request resolved: https://github.com/pytorch/pytorch/pull/116193 Approved by: https://github.com/albanD, https://github.com/malfet	2023-12-21 14:58:12 +00:00
Bin Bao	2dce364634	[AOTI][refactor] Remove model_container_runner_cuda.cpp (#116113 ) Differential Revision: [D52301272](https://our.internmc.facebook.com/intern/diff/D52301272) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116113 Approved by: https://github.com/khabinov ghstack dependencies: #116047	2023-12-21 14:56:25 +00:00
PyTorch MergeBot	f71d302c63	Revert "[Easy][BE]: Enable clang-tidy check for duplicate includes (#116193 )" This reverts commit 71cb13869b4eced76589f47e26bd64cdc2d54aa2. Reverted https://github.com/pytorch/pytorch/pull/116193 on behalf of https://github.com/jeanschmidt due to Breaking internal test (bolt_nn_espresso_operator_test_eureka-scheduler) and job (build-rdk-diff-windows-debug-cuda11) @malfet and @albanD, please help the author get this PR merged by providing more information ([comment](https://github.com/pytorch/pytorch/pull/116193#issuecomment-1866391726))	2023-12-21 14:43:07 +00:00
PyTorch MergeBot	348cb2f8f9	Revert "[BE][Easy]: Enable clang-tidy check readability-misplaced-array-index (#116210 )" This reverts commit 5d5ef016a622c8259b328e8b6f8fa7ffcf3c80dc. Reverted https://github.com/pytorch/pytorch/pull/116210 on behalf of https://github.com/jeanschmidt due to unfortunately, It is required to revert this PR in order to properly revert https://github.com/pytorch/pytorch/pull/116193 ([comment](https://github.com/pytorch/pytorch/pull/116210#issuecomment-1866380974))	2023-12-21 14:37:41 +00:00
PyTorch MergeBot	ec6c4fed3f	Revert "Support nn_module_stack in torch.export(strict=False) (#115454 )" This reverts commit 6730b5bcb41e0519572759d9ad9852a113d0a7e4. Reverted https://github.com/pytorch/pytorch/pull/115454 on behalf of https://github.com/jeanschmidt due to Breaking internal tests recycle_bin_citadel and executorch, check internal diff to see more details ([comment](https://github.com/pytorch/pytorch/pull/115454#issuecomment-1866315233))	2023-12-21 14:05:43 +00:00
PyTorch MergeBot	0567f71ac6	Revert " pre_dispatch aot_export (#115188 )" This reverts commit a267d6735051a4714fa2ac1c163315b650118744. Reverted https://github.com/pytorch/pytorch/pull/115188 on behalf of https://github.com/jeanschmidt due to sadly, it is required to revert this commit in order to revert https://github.com/pytorch/pytorch/pull/115454 ([comment](https://github.com/pytorch/pytorch/pull/115188#issuecomment-1866310014))	2023-12-21 14:03:18 +00:00
Chien-Chin Huang	f170d6665c	[DCP] Add a profiler function for benchmarking save and load (#116007 ) Many operations when calling DCP's save and load are executed on CPU. Thus we can easily profile these operations with cProfile. This PR adds the ability to profile the save() and load() One follow-up for this PR is to integrate the feature with the distributed logging flags. Differential Revision: [D52245434](https://our.internmc.facebook.com/intern/diff/D52245434/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116007 Approved by: https://github.com/LucasLLC, https://github.com/wz337 ghstack dependencies: #116006	2023-12-21 08:03:07 +00:00
Chien-Chin Huang	a548ff40de	[DCP][BE] Remove unused function (#116006 ) As title Differential Revision: [D52245433](https://our.internmc.facebook.com/intern/diff/D52245433/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116006 Approved by: https://github.com/wz337	2023-12-21 07:20:08 +00:00
Tugsbayasgalan Manlaibaatar	4b59b4dffb	Expose functional IR to capture_pre_autograd (#115210 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/115210 Approved by: https://github.com/zhxchen17 ghstack dependencies: #115188	2023-12-21 07:16:07 +00:00
Chen, Zejun	8fd1963ae2	[dynamo][collective_op] Use the value of the wrappered attribute async_op in dynamo when checking supported or not (#115921 ) I found whatever the attribute `async_op` in collective ops is `True` or `False` explicitly set by the users, it always leads to the graph break because the argument `async_op` is wrappered as `ConstantVariable(bool)` in dynamo. So here we need to use the `value` for the judgement. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115921 Approved by: https://github.com/jansel, https://github.com/wconstab	2023-12-21 03:27:57 +00:00
voznesenskym	74e8cfc9a0	Forward fix torch package bug - dont depend on dynam in fsdp directly (#116229 ) Differential Revision: [D52350752](https://our.internmc.facebook.com/intern/diff/D52350752) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116229 Approved by: https://github.com/janeyx99, https://github.com/zou3519	2023-12-21 03:10:22 +00:00
PyTorch MergeBot	db35ccf463	Revert "[innductor] make inductor work with new triton compile interface (#115878 )" This reverts commit bbded928b3556cf5678edf8fa41109d418312bcc. Reverted https://github.com/pytorch/pytorch/pull/115878 on behalf of https://github.com/kit1980 due to Broke ROCm https://github.com/pytorch/pytorch/actions/runs/7282149837/job/19844618618 ([comment](https://github.com/pytorch/pytorch/pull/115878#issuecomment-1865369349))	2023-12-21 02:00:17 +00:00
drisspg	65d3dde665	Fix allowed dtypes for mem_eff attention (#116026 ) # Summary Fix issue bug in detecting mem eff capability for cuda devices less than sm80: https://github.com/pytorch-labs/gpt-fast/issues/49 Pull Request resolved: https://github.com/pytorch/pytorch/pull/116026 Approved by: https://github.com/janeyx99	2023-12-21 01:56:38 +00:00
jiayisun	c1d960aadd	[Quant] [Inductor] add input shape check for quantized conv binary lowering (#115247 ) Add inputs shape check for quantized conv binary lowering, since qconv2d_pointwise.binary does not yet support the case of broadcasting shape inputs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115247 Approved by: https://github.com/leslie-fang-intel, https://github.com/eellison	2023-12-21 01:36:49 +00:00
Yanbo Liang	be9de33240	[Dynamo][9/N] Make SkipFilesVariable wrap functions only (#115963 ) Make ```SkipFilesVariable``` only handle function type, and route skipped classes to ```UserDefinedClassVariable```. The reasons behind this are: * We'd like to remove ```is_allowed```, so the allowed/disallowed torch classes should have a proper place to handle. We can put them in either ```SkipFilesVariable``` and ```UserDefinedClassVariable``` under the current architecture, but it's confusing to have two places do one thing. - Going forward, let's make ```SkipFilesVariable``` only handle functions, and probably I'll rename it to ```SkippedFunctionVariable``` in the following PRs. - Let's do dispatch by value's type, all torch classes stuff would go to ```UserDefinedClassVariable``` in the next PR. * We'd merge in_graph/skip/inline trace decision into the same API ```trace_rule.lookup```, so probably we have to limit the input to only function for better organizing ```VariableBuilder._wrap``` logics. - Next step, I'll merge ```skipfiles.check``` into ```trace_rules.lookup```, and do the skipfile check before wrapping them into correct variable tracker. - Though the ```TorchCtxManagerClassVariable``` is decided by ```trace_rules.lookup```, I'll refactor it out in the following PRs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115963 Approved by: https://github.com/jansel	2023-12-21 01:35:07 +00:00
BowenBao	a734085a63	[ONNX][Dort] Fix bug preventing running with OrtValueVector (#116124 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116124 Approved by: https://github.com/justinchuby, https://github.com/titaiwangms ghstack dependencies: #115945	2023-12-21 01:20:46 +00:00
BowenBao	259b0af367	[ONNX] Add copy before export for perf bench to avoid mutating base model (#115945 ) Otherwise base model might be mutated and affects the performance measured. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115945 Approved by: https://github.com/justinchuby, https://github.com/titaiwangms	2023-12-21 01:20:46 +00:00
Bin Bao	feafbcf437	[AOTI][refactor] Refactor model runner API (#116047 ) Summary: 1) make proxy executor as a private member; 2) use std::string instead of char* Differential Revision: [D52301106](https://our.internmc.facebook.com/intern/diff/D52301106) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116047 Approved by: https://github.com/khabinov	2023-12-21 01:05:37 +00:00
Tianyu Liu	9502fa8d84	add a transformer suite in TP/SP tests (#115530 ) This is to address issue #115309. Test plan `python test/distributed/tensor/parallel/test_tp_examples.py -k test_transformer_training_is_seq_parallel_False` `python test/distributed/tensor/parallel/test_tp_examples.py -k test_transformer_training_is_seq_parallel_True` Pull Request resolved: https://github.com/pytorch/pytorch/pull/115530 Approved by: https://github.com/wanchaol	2023-12-21 01:04:36 +00:00
Nikita Shulga	7ca6e0d38f	[EZ] Add `CUSPARSELT` to build variables (#116213 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116213 Approved by: https://github.com/Skylion007, https://github.com/kit1980, https://github.com/atalman ghstack dependencies: #116212	2023-12-21 01:02:11 +00:00
Nikita Shulga	74119a3482	[EZ] Fix typo in `USE_GLOO` var (#116212 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116212 Approved by: https://github.com/Skylion007, https://github.com/kit1980	2023-12-21 01:02:11 +00:00
Mikayla Gawarecki	f206e31e2f	Swap slots if slots match in swap_tensor (#116128 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116128 Approved by: https://github.com/albanD	2023-12-21 00:43:30 +00:00
Jeff Daily	8aae46f843	[ROCm] fix nightly 5.6 build (#116029 ) ROCm 5.6 nightly wheel build broken by #114329. This fixes it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116029 Approved by: https://github.com/huydhn, https://github.com/jithunnair-amd, https://github.com/atalman	2023-12-21 00:22:42 +00:00
Michael Lazos	be90b757d9	Enable compiled Adam in the benchmarks (#116093 ) Commit b697bcc583 of mlazos/compiled-adam2 at https://hud.pytorch.org/benchmark/compilers is an initial benchmark run Increases compile time by 20s for torchbench and HF, and 30s for TIMM I expect the compile time to come down significantly with fake tensor prop caching Pull Request resolved: https://github.com/pytorch/pytorch/pull/116093 Approved by: https://github.com/janeyx99	2023-12-21 00:17:36 +00:00
Shunting Zhang	bbded928b3	[innductor] make inductor work with new triton compile interface (#115878 ) Recent 2 triton PRs (https://github.com/openai/triton/pull/2701, https://github.com/openai/triton/pull/2756) change the interface for triton.compile, this PR added the necessary change on inductor side to work with both old and new compile API. Also there is some simplification between compilation call in subprocess and the one in main process - previously we pass warm_cache_only=True if the compilation happens in subprocess. But triton never use that argument in the currently used pin. So I removed that - previously we only pass compute_capability if compilation happens in subprocess. The PR change that to always passing compute_capability to triton.compile no matter if the compilation happens in main or sub process. Updated: There are more interface change from triton side. E.g. - tl.math.{min, max} now requires a propagate_nan argument - JITFunction.run now requires a warmup argument. This affect the benchmarking phase of matmul max-autotune; on the other hand, JITFunction.run forbids stream argument now. Simply removing passing this in when benchmarking matmul triton kernel will work for both old and new version of triton. - triton Autotuner change attribute name from 'warmup' to 'num_warmup' and from 'rep' to 'num_rep'. This cause dynamo failed to handle triton Autotuner object since dynamo TritonKernelVariable makes assumption about attribute names. It's used in some test cases that a model call triton Autotuner directly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115878 Approved by: https://github.com/jansel	2023-12-21 00:03:38 +00:00
Aaron Gokaslan	5d5ef016a6	[BE][Easy]: Enable clang-tidy check readability-misplaced-array-index (#116210 ) Enable clang-tidy check readability which checks for a bizarre C++ construct that is usually indicative of an error: https://clang.llvm.org/extra/clang-tidy/checks/readability/misplaced-array-index.html (indexing a number by a pointer, which surprisingly inverts the operands). Pull Request resolved: https://github.com/pytorch/pytorch/pull/116210 Approved by: https://github.com/albanD, https://github.com/malfet	2023-12-21 00:00:20 +00:00
Jez Ng	897600eb35	[inductor] Some tests have both CPU and CUDA variants running with CPU tensors (#116131 ) I don't think that's intended. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116131 Approved by: https://github.com/jansel	2023-12-21 00:00:15 +00:00
Joel Schlosser	7c7208a9e7	Forward fix to remove xfails for vmap NT tests in Dynamo (#116216 ) Resolves land race between #116111 and #114523. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116216 Approved by: https://github.com/kit1980	2023-12-20 22:55:08 +00:00
Jane Xu	edf1ea622d	Move step is noop tests (#115299 ) As stated. I do notice there is perhaps opportunity to abstract, but the tests as written are also super understandable and more abstraction might not be desirable. This PR _increases coverage_. The original tests each tested 12 default configs (left out Rprop). Now the tests test ~80 configs, and then foreach + fused on top of that! Test time, we basically increase over 10-fold, but this test is tiny so we are not worried: Old: ``` (pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (5ca9672c)]$ python test/test_optim.py -k test_step_is_noop_when_params_have_no_grad /home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0 warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}" . ---------------------------------------------------------------------- Ran 1 test in 0.028s OK ``` New (includes the old test): ``` (pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (5ca9672c)]$ python test/test_optim.py -k test_step_is_noop_when_params_have_no_grad /home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0 warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}" ........................... ---------------------------------------------------------------------- Ran 27 tests in 0.456s OK ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/115299 Approved by: https://github.com/albanD ghstack dependencies: #114802, #115023, #115025	2023-12-20 22:49:44 +00:00
Jane Xu	8f3a0594e9	Move tests depending on listed configs to OptimizerInfo (#115025 ) Removing 4 tests: ``` (pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (7539011b)]$ python test/test_optim.py -v -k test_fused_optimizers_with_large_tensors -k test_fused_optimizers_with_varying_tensors -k test_multi_tensor_optimizers_with_large_tensors -k test_multi_tensor_optimizers_with_varying_tensors /home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0 warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}" test_fused_optimizers_with_large_tensors (optim.test_optim.TestOptim) ... ok test_fused_optimizers_with_varying_tensors (optim.test_optim.TestOptim) ... ok test_multi_tensor_optimizers_with_large_tensors (optim.test_optim.TestOptim) ... ok test_multi_tensor_optimizers_with_varying_tensors (optim.test_optim.TestOptim) ... ok ---------------------------------------------------------------------- Ran 4 tests in 22.731s OK ``` For the same 4 but more granular: ``` (pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (7539011b)]$ python test/test_optim.py -v -k test_fused_large_tensor -k test_fused_mixed_device_dtype -k test_foreach_large_tensor -k test_foreach_mixed_device_dtype /home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0 warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}" test_foreach_large_tensor_ASGD_cpu_float16 (__main__.TestOptimRenewedCPU) ... skipped 'Only runs on cuda' .... test_fused_mixed_device_dtype_Adam_cpu_float32 (__main__.TestOptimRenewedCPU) ... skipped 'Only runs on cuda' test_foreach_large_tensor_ASGD_cuda_float16 (__main__.TestOptimRenewedCUDA) ... ok test_foreach_large_tensor_Adadelta_cuda_float16 (__main__.TestOptimRenewedCUDA) ... ok test_foreach_large_tensor_Adagrad_cuda_float16 (__main__.TestOptimRenewedCUDA) ... ok test_foreach_large_tensor_AdamW_cuda_float16 (__main__.TestOptimRenewedCUDA) ... ok test_foreach_large_tensor_Adam_cuda_float16 (__main__.TestOptimRenewedCUDA) ... ok test_foreach_large_tensor_NAdam_cuda_float16 (__main__.TestOptimRenewedCUDA) ... ok test_foreach_large_tensor_RAdam_cuda_float16 (__main__.TestOptimRenewedCUDA) ... ok test_foreach_large_tensor_RMSprop_cuda_float16 (__main__.TestOptimRenewedCUDA) ... ok test_foreach_large_tensor_Rprop_cuda_float16 (__main__.TestOptimRenewedCUDA) ... ok test_foreach_large_tensor_SGD_cuda_float16 (__main__.TestOptimRenewedCUDA) ... ok test_foreach_mixed_device_dtype_ASGD_cuda_float32 (__main__.TestOptimRenewedCUDA) ... ok test_foreach_mixed_device_dtype_Adadelta_cuda_float32 (__main__.TestOptimRenewedCUDA) ... ok test_foreach_mixed_device_dtype_Adagrad_cuda_float32 (__main__.TestOptimRenewedCUDA) ... ok test_foreach_mixed_device_dtype_AdamW_cuda_float32 (__main__.TestOptimRenewedCUDA) ... ok test_foreach_mixed_device_dtype_Adam_cuda_float32 (__main__.TestOptimRenewedCUDA) ... ok test_foreach_mixed_device_dtype_Adamax_cuda_float32 (__main__.TestOptimRenewedCUDA) ... ok test_foreach_mixed_device_dtype_NAdam_cuda_float32 (__main__.TestOptimRenewedCUDA) ... ok test_foreach_mixed_device_dtype_RAdam_cuda_float32 (__main__.TestOptimRenewedCUDA) ... ok test_foreach_mixed_device_dtype_RMSprop_cuda_float32 (__main__.TestOptimRenewedCUDA) ... ok test_foreach_mixed_device_dtype_Rprop_cuda_float32 (__main__.TestOptimRenewedCUDA) ... ok test_foreach_mixed_device_dtype_SGD_cuda_float32 (__main__.TestOptimRenewedCUDA) ... ok test_fused_large_tensor_AdamW_cuda_float16 (__main__.TestOptimRenewedCUDA) ... ok test_fused_large_tensor_Adam_cuda_float16 (__main__.TestOptimRenewedCUDA) ... ok test_fused_mixed_device_dtype_AdamW_cuda_float32 (__main__.TestOptimRenewedCUDA) ... ok test_fused_mixed_device_dtype_Adam_cuda_float32 (__main__.TestOptimRenewedCUDA) ... ok ---------------------------------------------------------------------- Ran 50 tests in 50.785s OK (skipped=25) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/115025 Approved by: https://github.com/albanD ghstack dependencies: #114802, #115023	2023-12-20 22:49:44 +00:00
Jane Xu	05d60931b3	Migrate test_peak_mem_multi_tensor_optimizers to OptimizerInfo (#115023 ) Replace the following: ``` (pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (1bbf1c6f)]$ python test/test_optim.py -k test_peak_mem_multi_tensor_optimizers /home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0 warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}" . ---------------------------------------------------------------------- Ran 1 test in 38.599s OK ``` with 11 tests (one for each foreach optim :)) ``` (pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (1bbf1c6f)]$ python test/test_optim.py -k TestOptimRenewedCUDA.test_foreach_memory /home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0 warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}" ........... ---------------------------------------------------------------------- Ran 11 tests in 39.293s OK ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/115023 Approved by: https://github.com/albanD ghstack dependencies: #114802	2023-12-20 22:49:44 +00:00
Jane Xu	4fb92b591d	[BE] remove redundant _test_derived_optimizers by migrating more to OptimizerInfo (#114802 ) New tests look like: ``` (pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (af8fca04)]$ python test/test_optim.py -v -k TestOptimRenewedCUDA.test_fused /home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0 warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}" test_fused_AdamW_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok test_fused_Adam_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok ---------------------------------------------------------------------- Ran 2 tests in 34.591s OK (pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (af8fca04)]$ python test/test_optim.py -v -k test_set_default_dtype_works_with_foreach /home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0 warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}" test_set_default_dtype_works_with_foreach_ASGD_cpu_float64 (__main__.TestOptimRenewedCPU) ... skipped 'Only runs on cuda' ... test_set_default_dtype_works_with_foreach_ASGD_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok test_set_default_dtype_works_with_foreach_Adadelta_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok test_set_default_dtype_works_with_foreach_Adagrad_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok test_set_default_dtype_works_with_foreach_AdamW_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok test_set_default_dtype_works_with_foreach_Adam_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok test_set_default_dtype_works_with_foreach_Adamax_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok test_set_default_dtype_works_with_foreach_NAdam_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok test_set_default_dtype_works_with_foreach_RAdam_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok test_set_default_dtype_works_with_foreach_RMSprop_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok test_set_default_dtype_works_with_foreach_Rprop_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok test_set_default_dtype_works_with_foreach_SGD_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok ---------------------------------------------------------------------- Ran 22 tests in 32.915s OK (skipped=11) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/114802 Approved by: https://github.com/albanD	2023-12-20 22:49:44 +00:00
rzou	0fae3dfef7	Add convenient things for Dynamo testing (#116173 ) - added a way to easily add a skip - added a way to easily turn markDynamoStrictTest on by default for a particular test file - added an envvar to turn markDynamoStrictTest on by default Pull Request resolved: https://github.com/pytorch/pytorch/pull/116173 Approved by: https://github.com/voznesenskym	2023-12-20 22:49:26 +00:00
Mikayla Gawarecki	19207b9183	Allow more backend worker threads with each using a separate cuda stream (#116190 ) Added a `--num_workers` option to `server.py` that allows more than 1 worker in the `ThreadPoolWorker` used for model predictions. Each worker uses its own `cuda.Stream()` that is created when the worker thread is initialized. Ran benchmark for 2-4 workers with `compile=False` (since compile is not thread-safe) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116190 Approved by: https://github.com/albanD ghstack dependencies: #115286, #116187, #116188, #116189	2023-12-20 22:08:29 +00:00
Mikayla Gawarecki	0dd64174bd	Do H2D/D2H of input/result on separate threads/cuda.Streams (#116189 ) Added two `ThreadPoolExecutor`s with 1 worker each for D2H and H2D copies. Each uses its own `cuda.Stream`. The purpose is to try to overlap D2H and H2D with compute and allow the worker handling prediction to launch compute kernels without being blocked by D2H/H2D. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116189 Approved by: https://github.com/albanD ghstack dependencies: #115286, #116187, #116188	2023-12-20 22:08:29 +00:00
Mikayla Gawarecki	3793ad6a7e	Fix bugs in metrics calculation in inference benchmark and rerun baseline (#116188 ) Before this PR, each `request_time` was separated by the time for a `torch.randn(...)` to create the fake `data` tensor on CPU. This meant that the gap between `request_times` scaled with the batch_size. So the latency comparisons across batch sizes were inaccurate. In this PR we generate all the fake data outside the loop to avoid this. Other bug fixes: - Only start polling GPU utilization after warmup event is complete - Correct calculation of throughput: previously `(num_batches * batch_size) / sum(response_times)`, should have been `(num_batches * batch_size) / (last_response_time - first_request_time)` - Make sure that response sent back to frontend is on CPU - Use a lock to ensure writing to `metrics_dict` in `metrics_thread` and `gpu_utilization_thread` in a thread-safe manner Pull Request resolved: https://github.com/pytorch/pytorch/pull/116188 Approved by: https://github.com/albanD ghstack dependencies: #115286, #116187	2023-12-20 22:08:22 +00:00
Mikayla Gawarecki	75a4b10d56	[easy] Add option for profiling backend in inference benchmark (#116187 ) Some misc fixes, also added option for experiment name to add to result table Pull Request resolved: https://github.com/pytorch/pytorch/pull/116187 Approved by: https://github.com/albanD ghstack dependencies: #115286	2023-12-20 22:08:11 +00:00
Mikayla Gawarecki	31f21e033e	Run inference in an Executor (#115286 ) Experiment: run model predictions in the backend in a ThreadPoolExecutor so that each model prediction does not block reading requests from the queue Baseline is reset in above PR that bugfixes a lot of the metrics calculations but I kept the metrics here anyway Pull Request resolved: https://github.com/pytorch/pytorch/pull/115286 Approved by: https://github.com/albanD	2023-12-20 22:08:02 +00:00
vfdev-5	b72127cd4b	[inductor] Support sym exprs in lowering constant promotion (#116196 ) Follow-up to https://github.com/pytorch/pytorch/pull/115920 This PR fixes the error with symbolic expression in aten.div: ```python import torch aten = torch.ops.aten def func(x, a): return aten.div(x * 0.5, a, rounding_mode=None) cfunc = torch.compile(func, dynamic=True, fullgraph=True) device = "cpu" x = 124 a = 33 out = cfunc(x, a) expected = func(x, a) torch.testing.assert_close(out, expected) ``` Error message: ``` File "/pytorch/torch/_inductor/graph.py", line 700, in call_function out = lowerings[target](args, kwargs) File "/pytorch/torch/_inductor/lowering.py", line 293, in wrapped out = decomp_fn(args, *kwargs) File "/pytorch/torch/_inductor/lowering.py", line 4823, in div_mode return div(a, b) File "/pytorch/torch/_inductor/lowering.py", line 293, in wrapped out = decomp_fn(args, *kwargs) File "/pytorch/torch/_inductor/lowering.py", line 4857, in div a, b = promote_constants( File "/pytorch/torch/_inductor/lowering.py", line 368, in promote_constants ex = next(x for x in inputs if isinstance(x, (TensorBox, ExpandView))) torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised: LoweringException: StopIteration: target: aten.div.Tensor_mode args[0]: 1.0s0 args[1]: s1 kwargs: {'rounding_mode': None} Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/116196 Approved by: https://github.com/peterbell10	2023-12-20 21:59:51 +00:00
Tugsbayasgalan Manlaibaatar	a267d67350	pre_dispatch aot_export (#115188 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/115188 Approved by: https://github.com/bdhirsh	2023-12-20 21:36:25 +00:00
zdevito	4afe2687d5	Reland "Serve multistream graph captures from correct pool (#114647 )" (#116199 ) Fixes a variable shadowing problem that broke internal builds. This reverts commit fe156456194ed64bdf8b086d469b3643515a2baf. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116199 Approved by: https://github.com/eellison	2023-12-20 21:22:34 +00:00
Yanbo Liang	199bacaf77	[Dynamo] Fix broken trunk and re-enable test_torch_name_rule_map_updated (#116146 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/116146 Approved by: https://github.com/williamwen42	2023-12-20 21:22:29 +00:00
Aaron Gokaslan	6e2c9be501	[Easy][BE]: Enable RUF008 and RUF016 checks (#116195 ) Enables a few more static linting checks for mutable defaults in dataclasses and for detecting a common type error. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116195 Approved by: https://github.com/malfet	2023-12-20 21:16:49 +00:00
Sacha	bc0d8649a4	Fix missing dependency in torch.utils.tensorboard (#115598 ) Fixes #114591 Version package was removed in this pull request: #114108 but is still used in `torch.utils.tensorboard` causing import errors. The fix removes the import and uses a simpler check. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115598 Approved by: https://github.com/malfet	2023-12-20 21:11:52 +00:00
Aaron Gokaslan	1d5a9a1c1a	[Easy][BE]: remove itertools.accumulate Python 2 shim and apply UFMT (#116192 ) Removes an unnecessary duplicated utility functions and just have it rely on itertools. Since the file is low traffic, I also added the modified files to UFMT'd files and formatted them. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116192 Approved by: https://github.com/malfet	2023-12-20 20:36:59 +00:00
Jeff Daily	602abf6b55	[ROCm] more 6.0 changes (#115946 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/115946 Approved by: https://github.com/pruthvistony, https://github.com/huydhn, https://github.com/malfet	2023-12-20 20:19:29 +00:00
Joel Schlosser	ea3a5f8ddc	Add chunk for jagged layout NT (#115842 ) Nice to have for the [SDPA tutorial](https://pytorch.org/tutorials/intermediate/scaled_dot_product_attention_tutorial.html) Pull Request resolved: https://github.com/pytorch/pytorch/pull/115842 Approved by: https://github.com/soulitzer ghstack dependencies: #115192, #116111	2023-12-20 20:13:20 +00:00
Joel Schlosser	29b198dcf8	Add markDynamoStrictTest to NT tests (#116111 ) Decorates all NT tests with `@markDynamoStrictTest` to ensure we get the correct signal. Adds xfails where needed to get things passing. Includes a fix in meta_utils.py for a bug that was breaking several python 3.11 tests. In particular, a dense tensor graph input that is a view of a strided NT would slip past Dynamo's check and break in meta-ification. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116111 Approved by: https://github.com/soulitzer, https://github.com/zou3519 ghstack dependencies: #115192	2023-12-20 20:13:20 +00:00
Nikita Shulga	f2c1fb3ee4	Fix crash in SymInt unary minus (#116160 ) Before this change `-SymInt(std::numeric_limits<int64_t>::min()) == 0` would reliably crash with null pointer dereference, as `data_` of the SymInt returned by `operator-` would be `0x8000000000000000`, because of the carry/overflow flags set by `negq`. Before the change x86_64 assembly generated for `4f02cc0670/c10/core/SymInt.cpp (L137)` looked as follows: ``` 0x7ffff7f2f490 <+115>: movq %rax, %rdx 0x7ffff7f2f493 <+118>: negq %rdx 0x7ffff7f2f496 <+121>: movq %rdx, (%rbp) 0x7ffff7f2f49a <+125>: movabsq $0x4000000000000000, %rdx ; imm = 0x4000000000000000 0x7ffff7f2f4a4 <+135>: cmpq %rdx, %rax 0x7ffff7f2f4a7 <+138>: jle 0x7ffff7f2f520 ; <+259> at SymInt.cpp:141:1 ``` `negq %rfx` correspond to unary minus and `cmpq %rdx, 0x4000000000000000` are inverted `check_range` `b6d0d0819a/c10/core/SymInt.h (L247-L249)` Flags raised by `negq` will affect the results of `cmpq`, and as result value would not be allocated on heap, but rather preserved as `nullptr`. Not sure if it's worth benchmarking, but perhaps using `__builtin_sub_overflow` would be faster as it does not require an extra comparison, just guarantees that overflow flags is cleared after the op. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116160 Approved by: https://github.com/Skylion007, https://github.com/colesbury	2023-12-20 20:12:57 +00:00
angelayi	f8ad664cf2	[export] Update range constraints to runtime_var_to_range (#115427 ) Updated range_constraints to be the union of shape_env.var_to_range and shape_env.runtime_var_to_range, with shape_env.runtime_var_to_range taking priority. Due to 0/1 specialization, if we bound an unbacked symint to be less than 5, the range of possible values for this symint is actually recorded as [2, 5] in shape_env.var_to_range. To fix this so that users will be able to see a more understandable range of [0, 5], shape_env.runtime_var_to_range was created to store the range of [0, 5]. Since range_constraints is a user-facing attribute to query the ranges of certain symints, we want to use shape_env.runtime_var_to_range to get the unbacked symints ranges, rather than shape_env.var_to_range. Additionally, run_decompositions() has an issue where it will always add assertions to the graph, even if a previous run has already added the assertions. So, I added a part to the AddRuntimeAssertionsForInlineConstraints which will store which assertions have already been added. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115427 Approved by: https://github.com/zhxchen17	2023-12-20 20:00:41 +00:00
Guilherme Leobas	1be6a070bc	Add support for torch.cond in vmap (#114523 ) Fixes: https://github.com/pytorch/pytorch/issues/114136 Patch enables conversion of a BatchedTensor into FakeTensor and write torch.cond vmap support using torch.where Pull Request resolved: https://github.com/pytorch/pytorch/pull/114523 Approved by: https://github.com/zou3519	2023-12-20 19:54:38 +00:00
Zejun Huang	06ae9b79ed	[mtia] add module exporter to net minimizer (#115687 ) Summary: add module exporter to net minimizer Reviewed By: amylittleyang Differential Revision: D52086699 Pull Request resolved: https://github.com/pytorch/pytorch/pull/115687 Approved by: https://github.com/jfix71	2023-12-20 19:36:23 +00:00
Aaron Gokaslan	6de28e92d2	[BE]: Apply FURB118 (prev): replaces unnecessary lambdas with operator. (#116027 ) This replaces a bunch of unnecessary lambdas with the operator package. This is semantically equivalent, but the operator package is faster, and arguably more readable. When the FURB rules are taken out of preview, I will enable it as a ruff check. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116027 Approved by: https://github.com/malfet	2023-12-20 19:35:08 +00:00
Damien	2d2016fdf8	WIP Add compatibility with channels_last_3d for conv3d (#114790 ) Part of a multi-PR work to fix #59168 Pull Request resolved: https://github.com/pytorch/pytorch/pull/114790 Approved by: https://github.com/albanD	2023-12-20 19:28:25 +00:00
Jeff Daily	8bff59e41d	[ROCm] add hipblaslt support (#114329 ) Disabled by default. Enable with env var DISABLE_ADDMM_HIP_LT=0. Tested on both ROCm 5.7 and 6.0. Pull Request resolved: https://github.com/pytorch/pytorch/pull/114329 Approved by: https://github.com/malfet	2023-12-20 19:09:25 +00:00
Xilun Wu	0b0b9b3275	[c10d][libuv] add partial read test for libuv backend and fix an error which only happens when partially reading a buffer (#116141 ) Test Plan 1. build pytorch 2. execute `TORCH_CPP_LOG_LEVEL=INFO build/bin/TCPStoreTest --gtest_filter=TCPStoreTest.testLibUVPartialRead` from the pytorch root directory. without the change: <img width="761" alt="image" src="https://github.com/pytorch/pytorch/assets/12968408/1942e3c2-a9c1-4fe4-87e8-7e21f4d8f9aa"> with the change: <img width="747" alt="image" src="https://github.com/pytorch/pytorch/assets/12968408/f3e96a5b-0ed1-49bd-9184-bb8a5ebebc33"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/116141 Approved by: https://github.com/wconstab	2023-12-20 18:37:55 +00:00
Aaron Gokaslan	ee5d981249	[BE]: Enable RUFF PERF402 and apply fixes (#115505 ) * Enable PERF402. Makes code more efficient and succinct by removing useless list copies that could be accomplished either via a list constructor or extend call. All test cases have noqa added since performance is not as sensitive in that folder. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115505 Approved by: https://github.com/malfet	2023-12-20 18:01:24 +00:00
fduwjj	8837df1d71	[c10d] Expose check method to Python for store via pybind (#116144 ) Differential Revision: [D52310987](https://our.internmc.facebook.com/intern/diff/D52310987) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116144 Approved by: https://github.com/wconstab	2023-12-20 17:57:13 +00:00
Aaron Gokaslan	71cb13869b	[Easy][BE]: Enable clang-tidy check for duplicate includes (#116193 ) Adds a clang-tidy check to flag duplicate include files Pull Request resolved: https://github.com/pytorch/pytorch/pull/116193 Approved by: https://github.com/albanD, https://github.com/malfet	2023-12-20 17:56:21 +00:00
PyTorch MergeBot	fe15645619	Revert "Serve multistream graph captures from correct pool (#114647 )" This reverts commit 8a445f7bd5bef43b30b61b20483d606c6e42e606. Reverted https://github.com/pytorch/pytorch/pull/114647 on behalf of https://github.com/jeanschmidt due to breaking multiple internal build jobs, please check internal diff in order to obtain more details ([comment](https://github.com/pytorch/pytorch/pull/114647#issuecomment-1864840724))	2023-12-20 17:11:42 +00:00
atalman	ea7f2de6f3	[docker] Fix typo in docker-release workflow (#116191 ) Fix copy-paste typo in docker-release workflow. After https://github.com/pytorch/pytorch/pull/116097 Pull Request resolved: https://github.com/pytorch/pytorch/pull/116191 Approved by: https://github.com/malfet	2023-12-20 16:44:36 +00:00
Nikita Shulga	16e539e0e6	Fix index range check (#116062 ) Fixes incorrect range check when index is `std::numeric_limits<int64_t>::min()`, as result of unary minus operations for such values is undefined, but in practice is equal to self, see https://godbolt.org/z/Wxhh44ocr Lower bound check was `size >= -index`, which was incorrect if `index` is `INT64_MIN`, with `-1 - index`, which for all int64_t values returns result that also fits into int64_t range. `- (index + 1)` is more readable and results in the identical optimized assembly, see https://godbolt.org/z/3vcnMYf9a , but its intermediate result for `INT64_MAX` is outside of `int64_t` range, which leads to a similar problems as with `int64_min` in original example. Added regression test. Fixes https://github.com/pytorch/pytorch/issues/115415 Pull Request resolved: https://github.com/pytorch/pytorch/pull/116062 Approved by: https://github.com/Skylion007, https://github.com/albanD	2023-12-20 15:40:57 +00:00
Bin Bao	fabf9433e7	[AOTI][refactor] Organize model runner files (#116022 ) Summary: Move runner util files into a subdirectory and put AOTIModelContainerRunnerCpu into a separate file Differential Revision: [D52300693](https://our.internmc.facebook.com/intern/diff/D52300693) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116022 Approved by: https://github.com/khabinov	2023-12-20 15:35:34 +00:00
soulitzer	4d6a1ad400	Activation checkpoint and checkpoint_sequential errors if use_reentrant not passed explicitly (#115868 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/115868 Approved by: https://github.com/albanD ghstack dependencies: #115438	2023-12-20 15:23:44 +00:00
soulitzer	cfb3cd11c1	Add basic autograd TORCH_LOGS support (#115438 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/115438 Approved by: https://github.com/albanD	2023-12-20 15:23:44 +00:00
atalman	cfbf647adb	Add aten/src/ATen/native/quantized/cpu/ path to CPU quantization merge rule (#116145 ) Observing following PR: https://github.com/pytorch/pytorch/pull/115329 Comment from author: https://github.com/pytorch/pytorch/pull/115329#issuecomment-1851339555 pytorchbot merge failed. Reason is this logic, we expect all files in PR to match one merge rule: `110339a310/.github/scripts/trymerge.py (L1310-L1324)` This should mitigate the issue, followup will post a PR to refactor this code to allow cross rule matching of approvers Pull Request resolved: https://github.com/pytorch/pytorch/pull/116145 Approved by: https://github.com/huydhn, https://github.com/kit1980, https://github.com/malfet	2023-12-20 14:43:15 +00:00
Michael Lazos	8eb7f6276b	Ensure wrapping subclasses with `as_subclass` is supported (#116091 ) As title Pull Request resolved: https://github.com/pytorch/pytorch/pull/116091 Approved by: https://github.com/pmeier, https://github.com/zou3519	2023-12-20 14:37:08 +00:00
PyTorch MergeBot	c215e59bf2	Revert "[inductor] Avoid bool being upcast to int (#109913 )" This reverts commit 92998693a9455af6259cae468265f01cfff8810e. Reverted https://github.com/pytorch/pytorch/pull/109913 on behalf of https://github.com/jeanschmidt due to causing performance regression in relevant metrics, @malfet I believe you are the correct person to help identify and fix the issues. More details check internal OPS count for ads metricsnin the internal related diff ([comment](https://github.com/pytorch/pytorch/pull/109913#issuecomment-1864397407))	2023-12-20 12:33:50 +00:00
cyy	968b94bef2	[8/N] Fixes clang-tidy warnings in c10/{core,util}/.h (#116082 ) This patch enables clang-tidy coverage on c10//.h and contains other fixes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116082 Approved by: https://github.com/Skylion007	2023-12-20 12:22:21 +00:00
Pearu Peterson	d72d99e591	Fix sparse compressed tensor invariants checks when nnz==0 (#115826 ) Fixes https://github.com/pytorch/pytorch/issues/115755 This PR is a step toward deprecating `torch.empty(..., layout=<sparse compressed tensor layout>)` that usage should be minimized as it will produce invalid tensors, see also https://github.com/pytorch/pytorch/issues/90695 . Pull Request resolved: https://github.com/pytorch/pytorch/pull/115826 Approved by: https://github.com/cpuhrsch, https://github.com/amjames	2023-12-20 12:16:07 +00:00
PyTorch MergeBot	bdfabe5e7d	Revert "[Dynamo][9/N] Make SkipFilesVariable wrap functions only (#115963 )" This reverts commit bb5a27052fa989f2365793c7ffe2d5a453aca31a. Reverted https://github.com/pytorch/pytorch/pull/115963 on behalf of https://github.com/jeanschmidt due to causing significant performance regression, identified by number of ops in ads, please check internal diff ([comment](https://github.com/pytorch/pytorch/pull/115963#issuecomment-1864361697))	2023-12-20 12:06:55 +00:00
PyTorch MergeBot	af8a50e656	Revert "Fix allowed dtypes for mem_eff attention (#116026 )" This reverts commit fc58909babcd07ea9652a1c1b3c2c7803f407a37. Reverted https://github.com/pytorch/pytorch/pull/116026 on behalf of https://github.com/jeanschmidt due to breaking internal windows buck builds, check internal diff for more details ([comment](https://github.com/pytorch/pytorch/pull/116026#issuecomment-1864354665))	2023-12-20 12:01:34 +00:00
Yifu Wang	6e1ba79b7f	[re-land] Introduce 3 low-latency, intra-node allreduce algorithms for small messages to PyTorch (#114001 ) (#116125 ) This is an attempt to re-land https://github.com/pytorch/pytorch/pull/114001. The previous attempt used `std::array` in cuda kernels which wasn't compatible with Meta's internal build. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116125 Approved by: https://github.com/yf225	2023-12-20 07:13:50 +00:00
Carlos Mocholí	9df4ee8d38	Fix ColwiseParallel typo (#116151 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116151 Approved by: https://github.com/wanchaol	2023-12-20 06:40:32 +00:00
Stephen Jia	545d2126f6	[pt-vulkan] Enable Python code blocks in shader templates and upgrade shader template generation (#115948 ) Summary: This change makes two major improvements to PyTorch Vulkan's shader authoring workflow. ## Review Guide There are a lot of changed files because every GLSL shader had to be touched. The majority of changes is changing ``` #define PRECISION $precision #define FORMAT $format ``` to ``` #define PRECISION ${PRECISION} #define FORMAT ${FORMAT} ``` due to changes in how shader templates are processed. For reviewers, the primary functional changes to review are: * `gen_vulkan_spv.py` * Majority of functional changes are in this file, which controls how shader templates are processed. * `shader_params.yaml` * controls how shader variants are generated ## Python Codeblocks in Shader Templates From now on, every compute shader (i.e. `.glsl`) is treated as a shader template. To this effect, the `templates/` folder has been removed and there is now a global `shader_params.yaml` file to describe the shader variants that should be generated for all shader templates. Taking inspiration from XNNPACK's [`xngen` tool](https://github.com/google/XNNPACK/blob/master/tools/xngen.py), shader templates can now use Python codeblocks. One example is: ``` $if not INPLACE: layout(set = 0, binding = 0, FORMAT) uniform PRECISION restrict writeonly image3D uOutput; layout(set = 0, binding = 1) uniform PRECISION sampler3D uInput; layout(set = 0, binding = 2) uniform PRECISION sampler3D uOther; layout(set = 0, binding = 3) uniform PRECISION restrict Block { ivec4 output_sizes; ivec4 input_sizes; ivec4 other_sizes; float alpha; } uArgs; $else: layout(set = 0, binding = 0, FORMAT) uniform PRECISION restrict image3D uOutput; layout(set = 0, binding = 1) uniform PRECISION sampler3D uOther; layout(set = 0, binding = 2) uniform PRECISION restrict Block { ivec4 output_sizes; ivec4 other_sizes; float alpha; } uArgs; ``` Another is: ``` // PYTHON CODEBLOCK $if not IS_DIV: const int c_index = (pos.z % ((uArgs.output_sizes.z + 3) / 4)) * 4; if (uArgs.other_sizes.z != 1 && c_index + 3 >= uArgs.output_sizes.z) { ivec4 c_ind = ivec4(c_index) + ivec4(0, 1, 2, 3); vec4 mask = vec4(lessThan(c_ind, ivec4(uArgs.output_sizes.z))); other_texel = other_texel * mask + vec4(1, 1, 1, 1) - mask; } // PYTHON CODEBLOCK $if not INPLACE: ivec3 input_pos = map_output_pos_to_input_pos(pos, uArgs.output_sizes, uArgs.input_sizes); const vec4 in_texel = load_texel(input_pos, uArgs.output_sizes, uArgs.input_sizes, uInput); imageStore(uOutput, pos, OP(in_texel, other_texel, uArgs.alpha)); $else: const vec4 in_texel = imageLoad(uOutput, pos); imageStore(uOutput, pos, OP(in_texel, other_texel, uArgs.alpha)); ``` In addition to making it easier and clearer to write shader templates, this enables shaders that were previously unable to be consolidated into a single template to now be represented using a single template, such as non inplace and inplace variants of the same shader. ## `generate_variant_forall` in shader variant YAML configuration YAML files that describe how shader variants should be generated can now use a `generate_variant_forall` field to iterate over various settings for a specific parameter for each variant defined. Example: ``` unary_op: parameter_names_with_default_values: OPERATOR: exp(X) INPLACE: 0 generate_variant_forall: INPLACE: - VALUE: 0 SUFFIX: "" - VALUE: 1 SUFFIX: "inplace" shader_variants: - NAME: exp OPERATOR: exp(X) - NAME: sqrt OPERATOR: sqrt(X) - NAME: log OPERATOR: log(X) ``` Previously, the `inplace` variants would need to have separate `shader_variants` entries. If there are multiple variables that need to be iterated across, then all possible combinations will be generated. Would be good to take a look to see how the new YAML configuration works. Test Plan: There is no functional change to this diff; we only need to make sure that the generated shaders are still correct. Therefore, we only need to run `vulkan_api_test`. ``` # On Mac Laptop buck run --target-platforms ovr_config//platform/macos:arm64-fbsource //xplat/caffe2:pt_vulkan_api_test_binAppleMac\#macosx-arm64 -c pt.vulkan_full_precision=1 -- --gtest_filter="*" ``` Reviewed By: digantdesai Differential Revision: D52087084 Pull Request resolved: https://github.com/pytorch/pytorch/pull/115948 Approved by: https://github.com/manuelcandales	2023-12-20 05:47:33 +00:00
rzou	9766781512	Skip some flaky Dynamo tests (#116165 ) The goal right now is to get the Dynamo CI back to green. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116165 Approved by: https://github.com/drisspg, https://github.com/aakhundov, https://github.com/huydhn, https://github.com/khabinov	2023-12-20 05:05:02 +00:00
Will Constable	3747aca49a	[C10D] Make all PGNCCL LOG usages use logPrefix() (#116060 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116060 Approved by: https://github.com/fduwjj ghstack dependencies: #116059	2023-12-20 04:19:45 +00:00
Elias Ellison	6ffe1da375	Add support for multi device foreach ops (#116064 ) Fix for https://github.com/pytorch/pytorch/issues/102023 Pull Request resolved: https://github.com/pytorch/pytorch/pull/116064 Approved by: https://github.com/mlazos	2023-12-20 04:19:40 +00:00
Xiaodong Wang	c72bc61bcd	[ROCm] Fix caffe2 build with hipblasv2 api (#116073 ) Summary: we need this change along with D52244365 to make caffe2 build happy Test Plan: OSS CI Differential Revision: D52275058 Pull Request resolved: https://github.com/pytorch/pytorch/pull/116073 Approved by: https://github.com/jeffdaily, https://github.com/malfet	2023-12-20 04:02:29 +00:00
Bin Bao	a597a00c87	[AOTI][refactor][3/n] Declare python_kernel_name and cpp_kernel_name in ExternKernel (#115972 ) Summary: Both ExternKernelAlloc and ExternKernelOut need the two fields, so declaring them in the base class. Also add cpp codegen for IndexPutFallback and InplaceBernoulliFallback in this PR. This is a reland of https://github.com/pytorch/pytorch/pull/115831 Differential Revision: [D52290900](https://our.internmc.facebook.com/intern/diff/D52290900) Pull Request resolved: https://github.com/pytorch/pytorch/pull/115972 Approved by: https://github.com/chenyang78	2023-12-20 03:22:03 +00:00
Will Constable	4f02cc0670	[C10D] Add logPrefix to abortCommsFromMap (#116059 ) Prints additional info such as PG ID/Rank. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116059 Approved by: https://github.com/fduwjj	2023-12-20 02:17:04 +00:00
Oleg Khabinov	c3bc65d9d8	[dynamo] Restore constant tensor original FQNs (#116086 ) Differential Revision: D52192693 Pull Request resolved: https://github.com/pytorch/pytorch/pull/116086 Approved by: https://github.com/angelayi, https://github.com/muchulee8	2023-12-20 02:10:02 +00:00
Tugsbayasgalan Manlaibaatar	6730b5bcb4	Support nn_module_stack in torch.export(strict=False) (#115454 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/115454 Approved by: https://github.com/suo, https://github.com/bdhirsh	2023-12-20 01:43:39 +00:00
Sun, Jiayi	c173a9d9b3	add Half support for layer_norm on CPU (#99590 ) ### Testing Single socket (icx, 32cores): \| shape \| fp32 forward (ms) \| fp16 forward (ms) \| mixed fp32 fp16 forward (ms) \| fp32 backward (ms) \| fp16 backward (ms) \| mixed fp32 fp16 backward (ms) \| \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| \| (1, 8, 16) \| 0.012 \| 0.011 \| 0.011 \| 0.051 \| 0.051 \| 0.050 \| \| (8 ,8, 16) \| 0.013 \| 0.013 \| 0.013 \| 0.054 \| 0.053 \| 0.051 \| \| (32, 8, 16) \| 0.015 \| 0.014 \| 0.014 \| 0.059 \| 0.054 \| 0.052 \| \| (64, 128, 56, 56) \| 1.875 \| 0.790 \| 1.016 \| 12.845 \| 7.151 \| 6.985 \| \| (64, 128, 256, 256) \| 50.226 \| 25.462 \| 35.736 \| 328.957 \| 179.615 \| 175.618 \| Single core (icx): \| shape \| fp32 forward (ms) \| fp16 forward (ms) \| mixed fp32 fp16 forward (ms) \| fp32 backward (ms) \| fp16 backward (ms) \| mixed fp32 fp16 backward (ms) \| \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| \| (1, 8, 16) \| 0.012 \| 0.011 \| 0.011 \| 0.040 \| 0.041 \| 0.041 \| \| (8 ,8, 16) \| 0.012 \| 0.012 \| 0.012 \| 0.042 \| 0.042 \| 0.042 \| \| (32, 8, 16) \| 0.027 \| 0.014 \| 0.014 \| 0.048 \| 0.048 \| 0.046 \| \| (64, 128, 56, 56) \| 58.054 \| 11.034 \| 17.928 \| 108.603 \| 48.816 \| 50.244 \| \| (64, 128, 256, 256) \| 1327.758 \| 352.394 \| 496.994 \| 2846.182 \| 1224.247 \| 1218.422 \| Pull Request resolved: https://github.com/pytorch/pytorch/pull/99590 Approved by: https://github.com/mingfeima, https://github.com/jgong5, https://github.com/cpuhrsch	2023-12-20 01:11:15 +00:00
Angela Yi	45cfe9cdf7	[export] Fix test to run internally (#116118 ) Test Plan: `buck2 run @//mode/dev-nosan //caffe2/test:test_export` Reviewed By: suo Differential Revision: D52297701 Pull Request resolved: https://github.com/pytorch/pytorch/pull/116118 Approved by: https://github.com/suo	2023-12-20 01:02:16 +00:00
Oguz Ulgen	c55210b4f0	[Inductor] Deduplicate grid wrapper statements for user defined triton kernels (#115849 ) Noticed that on many MRS kernels the grid wrapper for autotuning is huge with a bunch of duplicates due to num_warps and num_stages not being needed for grid calculation. Lets deduplicate these entries. Previously, we would see wrapper like ``` def grid_wrapper_for_add_kernel_2d_autotuned_0(meta): if meta['BLOCK_SIZE_X'] == 128 and meta['BLOCK_SIZE_Y'] == 128: return (4, 2, 1) if meta['BLOCK_SIZE_X'] == 128 and meta['BLOCK_SIZE_Y'] == 128: return (4, 2, 1) if meta['BLOCK_SIZE_X'] == 64 and meta['BLOCK_SIZE_Y'] == 64: return (8, 4, 1) if meta['BLOCK_SIZE_X'] == 64 and meta['BLOCK_SIZE_Y'] == 64: return (8, 4, 1) ``` now it looks like ``` def grid_wrapper_for_add_kernel_2d_autotuned_0(meta): if meta['BLOCK_SIZE_X'] == 128 and meta['BLOCK_SIZE_Y'] == 128: return (4, 2, 1) if meta['BLOCK_SIZE_X'] == 64 and meta['BLOCK_SIZE_Y'] == 64: return (8, 4, 1) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/115849 Approved by: https://github.com/jansel	2023-12-20 00:25:32 +00:00
Elias Ellison	9a2a44457a	SDPA extend backward realized tensor alignment checking to forward realized tensors (#116069 ) The logic to check alignment for realized tensors in the backward can be extended for realized tensors in the forward. This fixes an interaction with freezing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116069 Approved by: https://github.com/drisspg	2023-12-20 00:14:20 +00:00
Jez Ng	110339a310	Fix c10::div_floor_floating compile error (#115647 ) Introduced by #113276. I've added a test to catch future regressions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115647 Approved by: https://github.com/desertfire, https://github.com/vfdev-5	2023-12-20 00:09:01 +00:00
Avik Chaudhuri	68c7aac809	[export][reland] non-strict export with dynamic shapes (#116048 ) Reland of https://github.com/pytorch/pytorch/pull/115862 Pull Request resolved: https://github.com/pytorch/pytorch/pull/116048 Approved by: https://github.com/ydwu4	2023-12-19 23:57:22 +00:00
Sam Larsen	cd449e260c	Mark set_ as an inplace view op (#115769 ) Summary: To be used in https://github.com/pytorch/pytorch/pull/113873. Since set_ is effectively an inplace view op, we'll need to skip caching them. Test Plan: Built pytorch; specifically this step: `/home/slarsen/local/miniconda3/envs/pytorch-3.10/bin/python -m torchgen.gen --source-path /home/slarsen/local/pytorch/cmake/../aten/src/ATen --install_dir /home/slarsen/local/pytorch/build/aten/src/ATen --per-operator-headers --generate sources --output-dependencies /home/slarsen/local/pytorch/build/aten/src/ATen/generated_sources.cmake` Pull Request resolved: https://github.com/pytorch/pytorch/pull/115769 Approved by: https://github.com/bdhirsh	2023-12-19 23:08:05 +00:00
Jesse Cai	0759240001	[sparse] update cslt to 0.5.2.1 (#115988 ) Summary: - update install_cusparselt to download 0.5.2.1 for 12.1 - add ifdef for new compute_type changes Test Plan: Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/115988 Approved by: https://github.com/malfet ghstack dependencies: #115369	2023-12-19 23:02:54 +00:00
eqy	d55365dc05	[CUDA] Workaround shmem limit for certain input sizes in `AdaptiveAvgPool1D` (#115231 ) Reference issue #68248 CC @ptrblck @malfet @xwang233 Pull Request resolved: https://github.com/pytorch/pytorch/pull/115231 Approved by: https://github.com/mikaylagawarecki	2023-12-19 22:40:10 +00:00
Catherine Lee	7d92449171	Add call to run_tests for more tests? (#115781 ) To make sure they get run in CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/115781 Approved by: https://github.com/kshitij12345, https://github.com/mlazos, https://github.com/voznesenskym	2023-12-19 22:20:10 +00:00
Catherine Lee	7f7a7b0b48	Reset stepcurrent cache if file succeeds (#115775 ) Attempt to surface the segfault that happens on exit by resetting the "pytest last run" cache if pytest succeeds. CI does not rerun on success so we won't hit an infinite loop anywhere, and I don't expect people to rerun on success (unless they're looking for flakes? Either way I highly doubt any one is using the --sc/--scs flag locally). This ensures that if pytest succeeds but the process gets a non zero exit code, the rerun will start at beginning instead of skipping all the "succeeding" tests. This only applies if the --sc/--scs flags are used, custom to pytorch and probably not used anywhere other than CI, not to be confused with --stepwise, which pytest has by default Here's a list of segfaulting inductor/test_aot_inductor tests, which I added skips for: ``` inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocation::test_duplicated_params_abi_compatible_cpu_with_stack_allocation inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocation::test_fqn_abi_compatible_cpu_with_stack_allocation inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocation::test_no_args_abi_compatible_cpu_with_stack_allocation inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocation::test_output_misaligned_abi_compatible_cpu_with_stack_allocation inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocation::test_pytree_inputs_abi_compatible_cpu_with_stack_allocation inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocation::test_seq_abi_compatible_cpu_with_stack_allocation inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocation::test_simple_split_abi_compatible_cpu_with_stack_allocation inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocationAndMinimalArrayRefInterface::test_addmm_abi_compatible_cpu_with_stack_allocation_and_minimal_arrayref_interface inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocationAndMinimalArrayRefInterface::test_aliased_buffer_reuse_abi_compatible_cpu_with_stack_allocation_and_minimal_arrayref_interface inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocationAndMinimalArrayRefInterface::test_buffer_reuse_abi_compatible_cpu_with_stack_allocation_and_minimal_arrayref_interface inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocationAndMinimalArrayRefInterface::test_convolution_abi_compatible_cpu_with_stack_allocation_and_minimal_arrayref_interface inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocationAndMinimalArrayRefInterface::test_duplicated_params_abi_compatible_cpu_with_stack_allocation_and_minimal_arrayref_interface inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocationAndMinimalArrayRefInterface::test_empty_graph_abi_compatible_cpu_with_stack_allocation_and_minimal_arrayref_interface inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocationAndMinimalArrayRefInterface::test_fqn_abi_compatible_cpu_with_stack_allocation_and_minimal_arrayref_interface inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocationAndMinimalArrayRefInterface::test_large_abi_compatible_cpu_with_stack_allocation_and_minimal_arrayref_interface inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocationAndMinimalArrayRefInterface::test_missing_output_abi_compatible_cpu_with_stack_allocation_and_minimal_arrayref_interface inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocationAndMinimalArrayRefInterface::test_no_args_abi_compatible_cpu_with_stack_allocation_and_minimal_arrayref_interface inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocationAndMinimalArrayRefInterface::test_output_misaligned_abi_compatible_cpu_with_stack_allocation_and_minimal_arrayref_interface inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocationAndMinimalArrayRefInterface::test_output_path_1_abi_compatible_cpu_with_stack_allocation_and_minimal_arrayref_interface inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocationAndMinimalArrayRefInterface::test_pytree_inputs_abi_compatible_cpu_with_stack_allocation_and_minimal_arrayref_interface inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocationAndMinimalArrayRefInterface::test_repeat_interleave_abi_compatible_cpu_with_stack_allocation_and_minimal_arrayref_interface inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocationAndMinimalArrayRefInterface::test_return_constant_abi_compatible_cpu_with_stack_allocation_and_minimal_arrayref_interface inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocationAndMinimalArrayRefInterface::test_reuse_kernel_abi_compatible_cpu_with_stack_allocation_and_minimal_arrayref_interface inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocationAndMinimalArrayRefInterface::test_seq_abi_compatible_cpu_with_stack_allocation_and_minimal_arrayref_interface inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocationAndMinimalArrayRefInterface::test_simple_abi_compatible_cpu_with_stack_allocation_and_minimal_arrayref_interface inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocationAndMinimalArrayRefInterface::test_simple_split_abi_compatible_cpu_with_stack_allocation_and_minimal_arrayref_interface inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocationAndMinimalArrayRefInterface::test_small_constant_abi_compatible_cpu_with_stack_allocation_and_minimal_arrayref_interface inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocationAndMinimalArrayRefInterface::test_with_no_triton_profiler_abi_compatible_cpu_with_stack_allocation_and_minimal_arrayref_interface inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocationAndMinimalArrayRefInterface::test_with_offset_abi_compatible_cpu_with_stack_allocation_and_minimal_arrayref_interface inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocationAndMinimalArrayRefInterface::test_with_profiler_abi_compatible_cpu_with_stack_allocation_and_minimal_arrayref_interface inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocationAndMinimalArrayRefInterface::test_zero_size_weight_abi_compatible_cpu_with_stack_allocation_and_minimal_arrayref_interface ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/115775 Approved by: https://github.com/desertfire	2023-12-19 22:19:57 +00:00
aaitzhan	f88c9af98e	[TEST] Skip scaled_dot_product_attention test on sm < 80 (#115760 ) According to the [functionality](https://github.com/NVIDIA/cutlass/blob/main/media/docs/functionality.md) page, CUTLASS support `bfloat16` aka `bf16` only on compute capability 80+ devices. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115760 Approved by: https://github.com/drisspg	2023-12-19 22:00:33 +00:00
Aaron Gokaslan	ae6f1f4a47	[BE]: enable readability-delete-null-pointer clang-tidy check (#116107 ) * Enables an additional clang-tidy check that remove unnecessary nullptr checks around delete statements. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116107 Approved by: https://github.com/albanD, https://github.com/malfet	2023-12-19 21:08:37 +00:00
Tugsbayasgalan Manlaibaatar	d85314c95c	Support Predispatch functionalization (#113728 ) In this PR, we are implementing Functionalization on pre-dispatch graph. Today, every dispatch key except for Dispatchkey.Python has a dedicated mode stack in python. PreDispatch tracing relies on this behaviour by pushing ProxyTorchDispatchMode to Dispatchkey.PreDispatch mode stack and handle the dispatching logic in python. To make pre-dispatch functionalization work, we now need to push FunctionalTensorMode on DispatchKey.PreDispatch mode stack and make sure it runs before ProxyTorchDispatchMode. (this is very similar to how post-dispatch tracing work). Here are some design decisions we made for this flow to work: 1. FunctionalTensorMode internally calls C++ functionalize key. Since C++ functionalization goes after PreDispatch, if we are not careful, we will keep re-entering into PreDispatch key. We solve this by directly dispatching to C++ Functionalize key. 2. We delete mode_stack_per_key logic because the only realistic time it is exercised is for PreDispatch and it is in general not safe to have a plain list because FunctionalTensorMode and ProxyTorchDispatchMode ordering matter and it is hard to enforce it on plain list. Instead, now we have a private class that tracks PreDispatch mode stack. 3. We will still run CompositeImplicitAutograd decomps in this PR, and disable this logic later as a followup. Some missing bits after this PR: 1. Preserving autograd ops in a functional form. Right now they still show up in the graph but in a "non-functional" way. 2. Turn off CompositeImplicitAutograd decomps 3. Functionalizing HOO Pull Request resolved: https://github.com/pytorch/pytorch/pull/113728 Approved by: https://github.com/bdhirsh	2023-12-19 20:28:35 +00:00
Joel Schlosser	1474eb5f29	Fix jagged composite impl of flatten() (#115192 ) Need to handle this in `NestedTensor.__torch_function__()` since it's CompositeImplicit Pull Request resolved: https://github.com/pytorch/pytorch/pull/115192 Approved by: https://github.com/cpuhrsch, https://github.com/soulitzer	2023-12-19 19:15:21 +00:00
Tien-Che Tsai	cbc70e9b9c	[caffe2] Add option for build_cpukernel_avx2 (#116008 ) Summary: We would like to have a more flexible way to customize the build option with avx2 instruction to address other issues Test Plan: CI Differential Revision: D52247916 Pull Request resolved: https://github.com/pytorch/pytorch/pull/116008 Approved by: https://github.com/mattjgalloway	2023-12-19 18:49:52 +00:00
voznesenskym	77d5f60740	[fsdp][torch.compile] FSDP changes (#115497 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/115497 Approved by: https://github.com/albanD	2023-12-19 18:44:36 +00:00
Lan ZongWei	e52983939c	fix(conv_v8): optimize lru cache in conv v8 (#114110 ) Fixes #108474 the main issue is due to GCC's dual abi. https://gcc.gnu.org/onlinedocs/libstdc++/manual/using_dual_abi.html > requires lists to keep track of their size. seems like in GCC's old abi, std::list::size is linear other optimization is: * `splice` instead of erase then push, will save some memory and time. more perf benchmark is coming... Pull Request resolved: https://github.com/pytorch/pytorch/pull/114110 Approved by: https://github.com/eqy, https://github.com/atalman, https://github.com/malfet	2023-12-19 18:43:37 +00:00
Lucas Pasqualin	d749b4a152	Implements `permute_tensor` in functional collectives (#115078 ) Implementation of `permute_tensor` as per @yifuwang 's suggestion. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115078 Approved by: https://github.com/wanchaol, https://github.com/yifuwang	2023-12-19 18:33:28 +00:00
fwenguang	71bedc3a69	[Inductor UT] fix unreachable code (#116094 ) The testcase test_uint4x2_mixed_mm has indentation error. This pr make testcode reachable. test result: ``` pytest test_torchinductor.py -k test_uint4x2_mixed_mm -v =========================================================================================== test session starts =========================================================================================== platform linux -- Python 3.10.12, pytest-7.4.2, pluggy-1.3.0 -- /usr/bin/python cachedir: .pytest_cache hypothesis profile 'default' -> database=DirectoryBasedExampleDatabase('/workspace/pytorch/test/inductor/.hypothesis/examples') rootdir: /workspace/pytorch configfile: pytest.ini plugins: shard-0.1.2, xdoctest-1.0.2, flakefinder-1.1.0, xdist-3.3.1, rerunfailures-12.0, hypothesis-5.35.1 collected 964 items / 962 deselected / 2 selected Running 2 items in this shard: test/inductor/test_torchinductor.py::CpuTests::test_uint4x2_mixed_mm_cpu, test/inductor/test_torchinductor.py::CudaTests::test_uint4x2_mixed_mm_cuda test_torchinductor.py::CpuTests::test_uint4x2_mixed_mm_cpu PASSED [2.2136s] [ 50%] test_torchinductor.py::CudaTests::test_uint4x2_mixed_mm_cuda PASSED [1.9466s] [100%] =================================================================================== 2 passed, 962 deselected in 15.70s ==================================================================================== ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/116094 Approved by: https://github.com/peterbell10	2023-12-19 17:14:25 +00:00
rzou	5ba87a31bc	Unflake test_reference_numerics_large__refs_special_multigammaln_mvlgamma_p_1_cpu_bfloat16 (#116058 ) Run the test under markDynamoStrict mode and record an expected failure under the Dynamo CI shard. Test Plan: - wait for CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/116058 Approved by: https://github.com/atalman	2023-12-19 16:42:29 +00:00
David Berard	7b7f11f230	[dynamo] test number of guards when inputs are views (#115793 ) After # 113734 landed (adding dynamic storage offsets), we found that compilation times increased significantly. The reason: tensors_definitely_do_not_overlap was doing comparisons on storage offsets which were adding guards `626b7dc847/torch/_functorch/_aot_autograd/input_output_analysis.py (L268-L276)` This guard is added on all pairs of tensors which are views of the same source tensor - i.e. it the number of guards can be quadratic in the number of input tensors. This PR adds a test to prevent similar regressions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115793 Approved by: https://github.com/yanboliang	2023-12-19 16:09:29 +00:00
PyTorch MergeBot	91e184fd74	Revert "Introduce 3 low-latency, intra-node allreduce algorithms for small messages to PyTorch (#114001 )" This reverts commit 4edc921857f39ba9510b6ab1c454149cfb2de157. Reverted https://github.com/pytorch/pytorch/pull/114001 on behalf of https://github.com/jeanschmidt due to Breaking multiple internal tests, might be flakiness but multiple retries did not elicit an improvement, please check internal diff ([comment](https://github.com/pytorch/pytorch/pull/114001#issuecomment-1863036417))	2023-12-19 16:01:19 +00:00
PyTorch MergeBot	b6d0d0819a	Revert "[PT2] [Quant] Change the QConv2d Binary post op name from add to sum (#115329 )" This reverts commit 9ae0e6292944139ea598e7347c95ebd7df09e819. Reverted https://github.com/pytorch/pytorch/pull/115329 on behalf of https://github.com/jeanschmidt due to Breaking internal builds, please check internal diff to get the list and logs, @jerryzh168 please support the author in order to get these changes merged and landed ([comment](https://github.com/pytorch/pytorch/pull/115329#issuecomment-1863021726))	2023-12-19 15:52:57 +00:00
PyTorch MergeBot	c539f7df10	Revert "[Inductor] Deduplicate grid wrapper statements for user defined triton kernels (#115849 )" This reverts commit 21b8127f1c9f31c02145d906aae2db1ada703067. Reverted https://github.com/pytorch/pytorch/pull/115849 on behalf of https://github.com/jeanschmidt due to Breaking internal tests, please check internal diff for more details ([comment](https://github.com/pytorch/pytorch/pull/115849#issuecomment-1863012933))	2023-12-19 15:47:55 +00:00
Philip Meier	505a9e4854	add support for dynamic shapes in round (#115259 ) Fixes #114310 and supersedes #114748. There are two reasons why we have quite a few special cases for `round`: 1. `round` is actually two ops. With `ndigits=None` (default), `round` always returns an integer. When `ndigits` is an integer, the returned type is a float. 2. Although `round` takes two arguments, it is a unary function with a parameter rather than a binary one. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115259 Approved by: https://github.com/peterbell10, https://github.com/lezcano	2023-12-19 15:45:50 +00:00
PyTorch MergeBot	a7bfa04da6	Revert "More markDynamoStrictTest (#115870 )" This reverts commit 7f686c8fe127cc7db07134297fa09be20ab87918. Reverted https://github.com/pytorch/pytorch/pull/115870 on behalf of https://github.com/jeanschmidt due to Breaking internal tests and builds, please check diff ([comment](https://github.com/pytorch/pytorch/pull/115870#issuecomment-1862997125))	2023-12-19 15:40:57 +00:00
PyTorch MergeBot	24af118e55	Revert "markDynamoStrictTest more tests (#115871 )" This reverts commit 478f0e96dc2593db401903ac2ae053f8cd1e29ea. Reverted https://github.com/pytorch/pytorch/pull/115871 on behalf of https://github.com/jeanschmidt due to Breaking internal tests and builds, please check diff, this is required to revert #115870 ([comment](https://github.com/pytorch/pytorch/pull/115871#issuecomment-1862992931))	2023-12-19 15:36:27 +00:00
PyTorch MergeBot	5b6b680517	Revert "Adamw refactor (#115983 )" This reverts commit eafeba71c1ed35f8cf2d39016bf66c0b088e4a9f. Reverted https://github.com/pytorch/pytorch/pull/115983 on behalf of https://github.com/jeanschmidt due to Breaking internal tests, @janeyx99 please help @tfsingh to have this PR landed ([comment](https://github.com/pytorch/pytorch/pull/115983#issuecomment-1862976954))	2023-12-19 15:26:44 +00:00
Peter Bell	92998693a9	[inductor] Avoid bool being upcast to int (#109913 ) Currently the inductor code for `x.any(-1)` does a this strange dance: ```python tmp0 = tl.load(in_ptr0 + (r1 + (128x0)), rmask & xmask) tmp1 = tmp0.to(tl.int64) tmp2 = (tmp1 != 0) ``` This happens because `register_lowering` is doing type promotion with the dimension argument, and so promotes to `int64` which we then cast back to bool. A better fix would be to fix `register_lowering` but for now I just remove the unnecessary type promotion from `aten.any`. In the current code we also see: ```python tmp5 = tl.where(rmask & xmask, tmp3, 0) ``` which promotes the boolean value to int since `0` is an int32 in triton. This fixes it to generate a boolean constant instead. Finally there is also a triton bug where the `tl.load` itself upcasts to `tl.int8`. I fix this by adding an explicit cast to `tl.int1`. The final kernel code looks like: ```python tmp0 = tl.load(in_ptr0 + (r1 + (128x0)), rmask & xmask).to(tl.int1) tmp1 = tl.broadcast_to(tmp0, [XBLOCK, RBLOCK]) tmp3 = tl.full([1, 1], 0, tl.int1) tmp4 = tl.where(rmask & xmask, tmp1, tmp3) tmp5 = triton_helpers.any(tmp4, 1)[:, None] ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/109913 Approved by: https://github.com/lezcano	2023-12-19 14:16:10 +00:00
rzou	992c4e7b24	Actually run Dynamo tests in all Dynamo shards (#115962 ) We weren't doing this before. Also adds some more skips so that CI passes Pull Request resolved: https://github.com/pytorch/pytorch/pull/115962 Approved by: https://github.com/voznesenskym ghstack dependencies: #115925	2023-12-19 14:12:53 +00:00
atalman	0bd5a3fed7	[releng] Docker release Refactor Push nightly tags step. Move cuda and cudnn version to docker tag rather then name (#116097 ) Follow up after : https://github.com/pytorch/pytorch/pull/116070 This PR does 2 things. 1. Refactor Push nightly tags step, don't need to extract CUDA_VERSION anymore. New tag should be in this format: ``${PYTORCH_VERSION}-cuda$(CUDA_VERSION_SHORT)-cudnn$(CUDNN_VERSION)-runtime`` 2. Move cuda$(CUDA_VERSION_SHORT)-cudnn$(CUDNN_VERSION) from docker name to tag Pull Request resolved: https://github.com/pytorch/pytorch/pull/116097 Approved by: https://github.com/jeanschmidt	2023-12-19 13:53:08 +00:00
Carlos Mocholí	a31effa15f	Update device_mesh.py docs imports (#116074 ) These are not importable from `torch.distributed`, at least today. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116074 Approved by: https://github.com/wz337, https://github.com/fegin	2023-12-19 09:44:55 +00:00
eqy	2a44034895	[CUDA] Include `<thrust/swap.h>` in `LinearAlgebra.cu` (#116072 ) Fixes build against the latest `NVIDIA/cccl`. CC @malfet @xwang233 @ptrblck Pull Request resolved: https://github.com/pytorch/pytorch/pull/116072 Approved by: https://github.com/malfet, https://github.com/xwang233	2023-12-19 05:56:52 +00:00
FFFrog	327bdcdb14	Some tiny modification about torch.set/get_default_device (#116014 ) 1. fix bug of torch.set_default_device in multi-threading 2. add new interface named torch.get_default_device Fixes #115333 Fixes #115917 Pull Request resolved: https://github.com/pytorch/pytorch/pull/116014 Approved by: https://github.com/malfet, https://github.com/jansel	2023-12-19 05:08:06 +00:00
wz337	b48abbc020	[DeviceMesh] Fix DeviceMesh docstring (#116053 ) 1. remove outdated comments 2. fix examples in docstring Doc after fix: <img width="706" alt="image" src="https://github.com/pytorch/pytorch/assets/31293777/19f4f03c-0fd7-4e88-bca1-1a6ce693fbb7"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/116053 Approved by: https://github.com/wanchaol	2023-12-19 04:05:49 +00:00
Isuru Fernando	8b0122ad33	Add lowerings for reflection_pad{1, 3}d_backward (#115645 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/115645 Approved by: https://github.com/lezcano, https://github.com/peterbell10	2023-12-19 04:05:10 +00:00
Nikita Shulga	9dda4b20a0	[MPS] Enable select/[broad]cast ops for complex dtypes (#115727 ) By representing `torch.cfloat`/`torch.chalf` as `float2`/`half2` metal types and modifying `SCATTER_OPS_TEMPLATE`/`GATHER_OPS_TEMPLATE` to accept third argument which is fully specialized `cast` function, which is no-op for regular type, but special cased for float->complex and complex->float Pull Request resolved: https://github.com/pytorch/pytorch/pull/115727 Approved by: https://github.com/kulinseth	2023-12-19 02:25:28 +00:00
cyy	1544c37520	[7/N] Fixes clang-tidy warnings in c10/{core,util}/*.h (#115495 ) This PR continues to fix clang-tidy warnings for headers in c10/core and c10/util. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115495 Approved by: https://github.com/malfet	2023-12-19 02:14:30 +00:00
CaoE	9b8f934068	Remove memory_format check for native_group_norm_backward (#115721 ) To fix https://github.com/pytorch/pytorch/issues/115940. Remove memory_format check for native_group_norm_backward. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115721 Approved by: https://github.com/mikaylagawarecki	2023-12-19 02:12:26 +00:00
Oguz Ulgen	01b979fc9a	[Inductor] Fix constant folding and extern kernel mutation tracking bugs (#115908 ) This PR fixes two bugs 1) Constant folding a triton kernel results in the kernel's inputs to be returned back without any modification. Disable constant folding for triton kernels. Need more investigation 2) NoneLayout buffers should not be deleted as they do not exist Pull Request resolved: https://github.com/pytorch/pytorch/pull/115908 Approved by: https://github.com/aakhundov, https://github.com/jansel	2023-12-19 02:06:50 +00:00
Yanbo Liang	bb5a27052f	[Dynamo][9/N] Make SkipFilesVariable wrap functions only (#115963 ) Make ```SkipFilesVariable``` only handle function type, and route skipped classes to ```UserDefinedClassVariable```. The reasons behind this are: * We'd like to remove ```is_allowed```, so the allowed/disallowed torch classes should have a proper place to handle. We can put them in either ```SkipFilesVariable``` and ```UserDefinedClassVariable``` under the current architecture, but it's confusing to have two places do one thing. - Going forward, let's make ```SkipFilesVariable``` only handle functions, and probably I'll rename it to ```SkippedFunctionVariable``` in the following PRs. - Let's do dispatch by value's type, all torch classes stuff would go to ```UserDefinedClassVariable``` in the next PR. * We'd merge in_graph/skip/inline trace decision into the same API ```trace_rule.lookup```, so probably we have to limit the input to only function for better organizing ```VariableBuilder._wrap``` logics. - Next step, I'll merge ```skipfiles.check``` into ```trace_rules.lookup```, and do the skipfile check before wrapping them into correct variable tracker. - Though the ```TorchCtxManagerClassVariable``` is decided by ```trace_rules.lookup```, I'll refactor it out in the following PRs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115963 Approved by: https://github.com/jansel	2023-12-19 02:01:47 +00:00
PyTorch MergeBot	47908a608f	Revert "[ROCm] add hipblaslt support (#114329 )" This reverts commit b062ea38039234c80404a8f5f4d5a93c4cb9832d. Reverted https://github.com/pytorch/pytorch/pull/114329 on behalf of https://github.com/jeanschmidt due to Reverting due to inconsistencies on internal diff ([comment](https://github.com/pytorch/pytorch/pull/114329#issuecomment-1861933267))	2023-12-19 01:04:58 +00:00
PyTorch MergeBot	ed0c0c49ef	Revert "[ROCm] fix nightly 5.6 build (#116029 )" This reverts commit 63e242b1e41759f9b24a0fbb997f157a06a9dd13. Reverted https://github.com/pytorch/pytorch/pull/116029 on behalf of https://github.com/jeanschmidt due to Need to revert, in order to be able to revert #114329 ([comment](https://github.com/pytorch/pytorch/pull/116029#issuecomment-1861931736))	2023-12-19 01:01:42 +00:00
atalman	368a0c06d4	[releng] Docker Official release make sure cuda version is part of image name (#116070 ) Follow up on https://github.com/pytorch/pytorch/pull/115949 Change docker build image name: ``pytorch:2.1.2-devel``-> ``2.1.2-cuda12.1-cudnn8-devel and 2.1.2-cuda11.8-cudnn8-devel`` Ref: https://github.com/orgs/pytorch/packages/container/package/pytorch-nightly Naming will be same as in https://hub.docker.com/r/pytorch/pytorch/tags Pull Request resolved: https://github.com/pytorch/pytorch/pull/116070 Approved by: https://github.com/huydhn, https://github.com/seemethere	2023-12-19 00:58:15 +00:00
Shubhraprakash Das	5894af83be	Use dequantized weight and bias in conv2d quantized ops (#115615 ) Summary: Dequantize weight and bias for conv2d ops to improve performance. The weight and bias are usually small in size hence they do not increase memory footprint by a lot when dequantized. With optimization cunet-enc ops: vulkan.quantized_conv2d {96, 72, 2} 3753204 vulkan.quantized_conv2d {96, 72, 2} 6977048 vulkan.quantized_conv2d_dw{96, 72, 2} 2499640 vulkan.quantized_conv2d_pw_2x2{96, 72, 2} 842088 vulkan.quantized_conv2d {48, 36, 4} 2388152 vulkan.quantized_conv2d {48, 36, 4} 4775940 vulkan.quantized_conv2d_dw{48, 36, 4} 709800 vulkan.quantized_conv2d_pw_2x2{48, 36, 4} 483236 vulkan.quantized_conv2d {24, 18, 8} 2562144 vulkan.quantized_conv2d {24, 18, 8} 5447624 vulkan.quantized_conv2d_dw{24, 18, 8} 392756 vulkan.quantized_conv2d_pw_2x2{24, 18, 8} 509080 Without optimization: vulkan.quantized_conv2d {96, 72, 2} 4291768 vulkan.quantized_conv2d {96, 72, 2} 7871344 vulkan.quantized_conv2d_dw{96, 72, 2} 2658500 vulkan.quantized_conv2d_pw_2x2{96, 72, 2} 891020 vulkan.quantized_conv2d {48, 36, 4} 2966860 vulkan.quantized_conv2d {48, 36, 4} 5661812 vulkan.quantized_conv2d_dw{48, 36, 4} 816556 vulkan.quantized_conv2d_pw_2x2{48, 36, 4} 528632 vulkan.quantized_conv2d {24, 18, 8} 3139604 vulkan.quantized_conv2d {24, 18, 8} 6202820 vulkan.quantized_conv2d_dw{24, 18, 8} 452660 vulkan.quantized_conv2d_pw_2x2{24, 18, 8} 557388 Test Plan: Ensure all vulkan quantize tests pass: buck2 run --target-platforms ovr_configplatform/macos:arm64-fbsourcexplat/caffe2:pt_vulkan_quantized_api_test_binAppleMac\#macosx-arm64 -c pt.vulkan_full_precision=1 --show-output" Running main() from third-party/googletest/1.11.0/googletest/googletest/src/gtest_main.cc [==========] Running 78 tests from 1 test suite. [----------] Global test environment set-up. [----------] 78 tests from VulkanAPITest ... [==========] 78 tests from 1 test suite ran. (1519 ms total) [ PASSED ] 78 tests. buck2 run --target-platforms ovr_config//platform/macos:arm64-fbsource //xplat/caffe2:pt_vulkan_api_test_binAppleMac\#macosx-arm64 -c pt.vulkan_full_precision=1 --show-output" Running main() from third-party/googletest/1.11.0/googletest/googletest/src/gtest_main.cc [==========] Running 395 tests from 1 test suite. [----------] Global test environment set-up. [----------] 395 tests from VulkanAPITest ... [----------] 395 tests from VulkanAPITest (6515 ms total) [----------] Global test environment tear-down [==========] 395 tests from 1 test suite ran. (6515 ms total) [ PASSED ] 394 tests. [ SKIPPED ] 1 test, listed below: [ SKIPPED ] VulkanAPITest.querypool_flushed_shader_log YOU HAVE 5 DISABLED TESTS Reviewed By: yipjustin Differential Revision: D50997532 Pull Request resolved: https://github.com/pytorch/pytorch/pull/115615 Approved by: https://github.com/manuelcandales, https://github.com/yipjustin	2023-12-19 00:23:52 +00:00
Yue Dong	270ed13e87	[DTensor] Make DTensor `from_local` backward partial() to replicate() pass through (#115967 ) Summary: This change makes the `DTensor.from_local()` placements in backward pass from `Partial()` to `Replicate()` as pass through for following reasons: 1. When we run backward pass of DTensor.from_local, if the target placement is partial() (i.e. from user manual overwrite code instead of torch_dispatch) we keep the grad as replicate. This is because converting the gradients back to `Partial()` is meaningless. 2. The current div logic will lead to wrong numerical value in the above case. Test Plan: CI: CI Tests Unit test: `buck2 test mode/dev-nosan //caffe2/test/distributed/_tensor:redistribute` - Passed With model training: ``` # We tested the case where input tensor is manually overwrite as Partial() and # output tensor manually overwrite to Shard() then to local. # Before the change: numerical value not correct Forward pass: collective: ReduceScatter backward pass: collective: AllGather + div by process group size # After the change: div is removed as expected. Forward pass: collective: ReduceScatter Backward pas: collective: AllGather ``` Differential Revision: D52175709 Pull Request resolved: https://github.com/pytorch/pytorch/pull/115967 Approved by: https://github.com/wanchaol	2023-12-19 00:16:10 +00:00
Philip Meier	3472a9200d	expand subclass type tests in dynamo (#116024 ) Following up on my own comments in https://github.com/pytorch/pytorch/pull/115323#pullrequestreview-1769491483. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116024 Approved by: https://github.com/mlazos	2023-12-19 00:08:55 +00:00
David Berard	054f9548b4	[dynamo] Store CompilationEvents in a buffer in torch._dynamo.utils (#115788 ) Motivation: it would be nice to be able to test using the metrics in log_compilation_event; currently dumps logs (or logs to a database in fbcode) - these are hard to use in unit tests. This change: * always record the information in torch._dynamo.utils.record_compilation_metrics; here, log into a limited-size deque to prevent the list of metrics from getting too long * if config.log_compilation_metrics, then call back into the original log_compilation_event function Pull Request resolved: https://github.com/pytorch/pytorch/pull/115788 Approved by: https://github.com/yanboliang	2023-12-18 23:26:13 +00:00
drisspg	fc58909bab	Fix allowed dtypes for mem_eff attention (#116026 ) # Summary Fix issue bug in detecting mem eff capability for cuda devices less than sm80: https://github.com/pytorch-labs/gpt-fast/issues/49 Pull Request resolved: https://github.com/pytorch/pytorch/pull/116026 Approved by: https://github.com/janeyx99	2023-12-18 23:20:52 +00:00
drisspg	6b120c6cf9	Update the sdpa benchmark to measure forward backward time in isolation (#115986 ) # Summary The benchmarks were getting a little stale and I think it makes more sense to measure in isolation now rather than E2E in a mha component. This is a pre-req for getting the data for https://github.com/pytorch/pytorch/pull/115357 Output from run: ``` Shell +------------+-----------+-----------+------------+-----------+-----------+----------------+--------------------+--------------------+ \| batch_size \| num_heads \| q_seq_len \| kv_seq_len \| embed_dim \| is_causal \| dtype \| forward_time \| backward_time \| +------------+-----------+-----------+------------+-----------+-----------+----------------+--------------------+--------------------+ \| 1 \| 16 \| 128 \| 128 \| 2048 \| True \| torch.bfloat16 \| 23.86634959839284 \| 66.21150835417211 \| \| 1 \| 16 \| 128 \| 128 \| 2048 \| False \| torch.bfloat16 \| 23.452017060481012 \| 66.90612225793302 \| \| 1 \| 16 \| 256 \| 256 \| 2048 \| True \| torch.bfloat16 \| 24.478124547749758 \| 76.4232068322599 \| \| 1 \| 16 \| 256 \| 256 \| 2048 \| False \| torch.bfloat16 \| 24.6928428998217 \| 75.76151192188263 \| \| 1 \| 16 \| 512 \| 512 \| 2048 \| True \| torch.bfloat16 \| 28.69622849393636 \| 114.73898496478796 \| \| 1 \| 16 \| 512 \| 512 \| 2048 \| False \| torch.bfloat16 \| 34.399422979913645 \| 112.96746158041059 \| \| 1 \| 16 \| 1024 \| 1024 \| 2048 \| True \| torch.bfloat16 \| 65.4690912924707 \| 216.26344555988908 \| \| 1 \| 16 \| 1024 \| 1024 \| 2048 \| False \| torch.bfloat16 \| 88.57532404363155 \| 212.07790216431025 \| \| 8 \| 16 \| 128 \| 128 \| 2048 \| True \| torch.bfloat16 \| 11.582905380055308 \| 70.09557797573505 \| \| 8 \| 16 \| 128 \| 128 \| 2048 \| False \| torch.bfloat16 \| 12.068384909071026 \| 70.01491216942668 \| \| 8 \| 16 \| 256 \| 256 \| 2048 \| True \| torch.bfloat16 \| 31.671419646590945 \| 203.54910241439939 \| \| 8 \| 16 \| 256 \| 256 \| 2048 \| False \| torch.bfloat16 \| 33.0585768679157 \| 209.45609430782497 \| \| 8 \| 16 \| 512 \| 512 \| 2048 \| True \| torch.bfloat16 \| 87.43969700299202 \| 469.8729298543185 \| \| 8 \| 16 \| 512 \| 512 \| 2048 \| False \| torch.bfloat16 \| 123.9265550393611 \| 580.1084265112877 \| \| 8 \| 16 \| 1024 \| 1024 \| 2048 \| True \| torch.bfloat16 \| 561.1918237991632 \| 1181.655174586922 \| \| 8 \| 16 \| 1024 \| 1024 \| 2048 \| False \| torch.bfloat16 \| 884.2707145959139 \| 1662.4679416418073 \| +------------+-----------+-----------+------------+-----------+-----------+----------------+--------------------+--------------------+ ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/115986 Approved by: https://github.com/mikaylagawarecki	2023-12-18 22:40:47 +00:00
Joel Schlosser	bf62511e07	Reshape decomposition for jagged layout NT (#115191 ) No more segfault from using `reshape()` on jagged NT :) Pull Request resolved: https://github.com/pytorch/pytorch/pull/115191 Approved by: https://github.com/cpuhrsch, https://github.com/soulitzer	2023-12-18 22:34:41 +00:00
Jeff Daily	63e242b1e4	[ROCm] fix nightly 5.6 build (#116029 ) ROCm 5.6 nightly wheel build broken by #114329. This fixes it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116029 Approved by: https://github.com/huydhn, https://github.com/jithunnair-amd, https://github.com/atalman	2023-12-18 22:12:30 +00:00
Lucas Pasqualin	8452f41305	Adds allreduce to inductor remap (#115950 ) Fixes #115728 Implements a rewrite path for allreduce Pull Request resolved: https://github.com/pytorch/pytorch/pull/115950 Approved by: https://github.com/wconstab	2023-12-18 22:00:22 +00:00
Tianyu Liu	2a5659a797	add length assertion to PrepareModuleInput and PrepareModuleOutput (#115957 ) ## summary `zip(inputs, self.input_layouts, self.desired_input_layouts)` is used in `_prepare_input_fn`; similar for `_prepare_output_fn`. Without assertion, unmatched dimension in inputs/outputs will be lost, potentially causing unexpected behabiors. ## test plan `python test/distributed/tensor/parallel/test_tp_style.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/115957 Approved by: https://github.com/wanchaol	2023-12-18 21:50:18 +00:00
Jacob Rodal	a699b10339	[buck2][win] fix caffe2 protobuf_rule (#115954 ) Summary: c2_protobuf_rule ([here](https://fburl.com/code/iyiulpmv)) is broken on buck2, ultimately due to the following error: > .\./caffe2.proto: File does not reside within any path specified using --proto_path (or -I). You must specify a --proto_path which encompasses this file. Note that the proto_path must be an exact prefix of the .proto file names -- protoc is too dumb to figure out when two paths (e.g. absolute and relative) are equivalent (it's harder than you think). The root cause is differences in how buck1 and buck2 handle `%SRCDIR%` (absolute versus relative paths). This diff fixes the build. Test Plan: # Before ``` buck2 build arvr/mode/win/opt //xplat/caffe2:caffe2.pb.h ``` ``` More details at https://www.internalfb.com/intern/buck/build/c6550454-ae6d-479e-9d08-016e544ef050 BUILD SUCCEEDED ``` ``` Action failed: fbsource//xplat/caffe2:caffe2.pb.h (genrule) Remote command returned non-zero exit code <no exit code> Reproduce locally: frecli cas download-action 5df17cf64b7e2fc5ab090c91e1129f2f3cad36dc72c7c182ab052af23d3f32aa:145 stdout: stderr: OUTMISS: Missing outputs: buck-out/v2/gen/fbsource/dd87aacb8683145b/xplat/caffe2/caffe2.pb.h/out/caffe2.pb.h ``` # After Buck1 still works ``` buck1 build arvr/mode/win/opt //xplat/caffe2:caffe2.pb.h ``` Buck2 works ``` buck2 build arvr/mode/win/opt //xplat/caffe2:caffe2.pb.h ``` ``` Buck UI: https://www.internalfb.com/buck2/e5dae607-325a-4eab-b0c9-66fe4e9a6254 BUILD SUCCEEDED ``` Differential Revision: D52218365 Pull Request resolved: https://github.com/pytorch/pytorch/pull/115954 Approved by: https://github.com/mcr229	2023-12-18 21:41:10 +00:00
isdanni	2f7bb18def	[Doc] Add padding size constraint in nn.ReflectionPad2d (#115995 ) Fixes #115532 Pull Request resolved: https://github.com/pytorch/pytorch/pull/115995 Approved by: https://github.com/mikaylagawarecki	2023-12-18 21:29:14 +00:00
angelayi	1e272fb6d6	[export] Undo "module: export" labeling (#116042 ) Delete the auto-labeling of "module: export" as this is not really used, and we want to delete the "module: export" label. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116042 Approved by: https://github.com/clee2000	2023-12-18 21:23:17 +00:00
Catherine Lee	c4748b425e	Add main in dynamo/test_compile.py (#115941 ) Need to verify that it is dynamo's custom TestCase and run_tests instead of the general common_utils TestCase and run_tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/115941 Approved by: https://github.com/msaroufim	2023-12-18 20:53:28 +00:00
Wanchao Liang	a1a0b290d2	[tp] further fix the docs (#115974 ) some typo result in the note section not rendered properly, can't see this from the last PR directly as the last PR only show the first commit documentation :( Also make the parallelize_module doc example more concrete Pull Request resolved: https://github.com/pytorch/pytorch/pull/115974 Approved by: https://github.com/wz337	2023-12-18 20:41:53 +00:00
Jesse Cai	8868c1cfae	[sparse][ci] Add cuSPASRELt to CI (#115369 ) Summary: This PR adds in cuSPARSELt v0.4.07 into CI (12.1 and 11.8 CUDA) to run our cuSPARSELt specific tests. Test Plan: Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/115369 Approved by: https://github.com/malfet	2023-12-18 20:33:30 +00:00
PyTorch UpdateBot	2b2ed52799	[xla hash update] update the pinned xla hash (#116003 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml). Update the pinned xla hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116003 Approved by: https://github.com/clee2000	2023-12-18 20:31:49 +00:00
atalman	7b6210e8a4	Use matrix generate script for docker release workflows (#115949 ) Enable both supported CUDA version builds for docker release. Rather then building only 1 version. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115949 Approved by: https://github.com/huydhn	2023-12-18 20:20:59 +00:00
gs-olive	e30d436b01	[fx][split][testing] Add testing for #107981 (#108731 ) - Follow-up to #107981, adding testing for metadata copying in placeholder nodes within the `split_by_tags` utility - Validation included in the test from #107248, since both tests are relevant to the same aspect of the utility Pull Request resolved: https://github.com/pytorch/pytorch/pull/108731 Approved by: https://github.com/angelayi	2023-12-18 20:19:18 +00:00
vinithakv	bf20b56e9d	Fix PyTorch build error on ppc64le (#115729 ) The PyTorch build breaks when building from tip on ppc64le with following error pytorch/aten/src/ATen/native/quantized/cpu/kernels/QuantizedOpKernels.cpp:863:46: error: no matching function for call to 'at::vec::DEFAULT::Vectorizedc10::qint8::dequantize(at::vec::DEFAULT::Vectorized&, at::vec::DEFAULT::Vectorized&) Issue reported #115165 This patch fixes the build issue. Fixes #115165 Pull Request resolved: https://github.com/pytorch/pytorch/pull/115729 Approved by: https://github.com/albanD	2023-12-18 19:00:56 +00:00
Tobias Ringwald	77366ba637	Increased hardcoded limit for number of GPUs. (#115368 ) Fixes #115331. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115368 Approved by: https://github.com/albanD	2023-12-18 18:39:19 +00:00
Michael Lazos	80b1ecc308	Run eager adam optimizer in benchmarks where possible (#115445 ) Runs eager Adam (instead of SGD) on all models that don't fail accuracy. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115445 Approved by: https://github.com/desertfire	2023-12-18 18:28:23 +00:00
zdevito	8a445f7bd5	Serve multistream graph captures from correct pool (#114647 ) This fixes #114320 by placing the logic for determining whether to allocate to a pool inside a callback that is controlled by CUDAGraph.cpp or by the python bound api to allocate a stream directly to a pool. Pull Request resolved: https://github.com/pytorch/pytorch/pull/114647 Approved by: https://github.com/ngimel, https://github.com/eellison	2023-12-18 18:24:15 +00:00
CK Luk	3b70bd3970	Take 2 of "Add an option to log the source of the Triton kernels generated by torch._inductor (#115979 ) Summary: This is useful the comparing the Triton kernels generated by two different invocations of torch.compile on the same model (e.g., checking of serial compile and parallel compile generate identical Triton kernels). Test Plan: Unit test: buck2 test mode/opt //caffe2/torch/fb/module_factory/sync_sgd/tests:test_torchdynamo_wrapper -- --print-passing-details >& ~/tmp/log.test PyPer Mast job: https://www.internalfb.com/mast/job/sw-951074659-OfflineTraining_87587a4e See the *.py files generated in: pyper_traces/tree/torchinductor_traces/sw-951074659-OfflineTraining_87587a4e/4623 Differential Revision: D52221500 Pull Request resolved: https://github.com/pytorch/pytorch/pull/115979 Approved by: https://github.com/yanboliang	2023-12-18 18:16:44 +00:00
Behrang Javaherian	386776c49a	[torch] Reduce the memory usage by adding flags to clearing intermediate graphs used for optimization during the ineference. (#115657 ) Summary: During the inference time the intermediate graphs for optimization are not used so the Executor's graph is the only graph we need to keep around these two flags Test Plan: the FLAGS are all off by default baseline ``` buck run mode/opt-clang sigrid/predictor/client/localnet:run_model -- --model_id_to_load=951679039 --model_snapshot_to_load=244 --torch_jit_do_not_store_optimized_graph=true I1212 10:24:20.407408 401092 SigridPredictorLocalModelFactory.cpp:32] Memory usage for 951679039_244 is 182863 Kb ``` ``` buck run mode/opt-clang sigrid/predictor/client/localnet:run_model -- --model_id_to_load=951679039 --model_snapshot_to_load=244 --torch_jit_do_not_store_optimized_graph=true --torch_jit_release_profiling_graph_after_optimization=true I1212 10:31:37.663487 464000 SigridPredictorLocalModelFactory.cpp:32] Memory usage for 951679039_244 is 186127 Kb ``` ``` buck run mode/opt-clang sigrid/predictor/client/localnet:run_model -- --model_id_to_load=951679039 --model_snapshot_to_load=244 --torch_jit_do_not_store_optimized_graph=true --torch_jit_release_profiling_graph_after_optimization=true --torch_jit_execution_plan_avoid_extra_graph_copy=true I1212 10:29:42.848093 447218 SigridPredictorLocalModelFactory.cpp:32] Memory usage for 951679039_244 is 129451 Kb``` Differential Revision: D52081631 Pull Request resolved: https://github.com/pytorch/pytorch/pull/115657 Approved by: https://github.com/houseroad	2023-12-18 17:56:39 +00:00
Wanchao Liang	dd367b7c8f	check tensor subclass when using torch.compile + SAC (#115960 ) as titled, when using SAC + torch.compile, it currently only check for functional tensor, but not checking any tensor subclasses, therefore SAC under torch.compile would ignore the tensor types like tensor subclasses. Fixed in this PR Pull Request resolved: https://github.com/pytorch/pytorch/pull/115960 Approved by: https://github.com/bdhirsh	2023-12-18 17:49:06 +00:00
angelayi	e43d33f4f7	[export] Support torch.sym* ops (#115854 ) Fixes https://github.com/pytorch/pytorch/issues/108830 and https://github.com/pytorch/executorch/issues/1379#issuecomment-1853322866 Pull Request resolved: https://github.com/pytorch/pytorch/pull/115854 Approved by: https://github.com/zhxchen17	2023-12-18 17:48:47 +00:00
Aaron Gokaslan	647f14e70b	[BE]: Enable clang-tidy check for readability-string-compare (#115994 ) Adds a clang-tidy check to ensure string compare is not used unnecessarily in a way that is less efficient and less readable if an equality overload exists. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115994 Approved by: https://github.com/albanD	2023-12-18 16:13:00 +00:00
Nikita Shulga	d7caef7996	[CI] Update clang-format (#116002 ) To 17.0.6 build using https://github.com/pytorch/test-infra/blob/main/.github/workflows/clang-tidy-linux.yml Pull Request resolved: https://github.com/pytorch/pytorch/pull/116002 Approved by: https://github.com/suo	2023-12-18 14:58:46 +00:00
Mu-Chu Lee	c285ca7916	[AOTInductor] Add updaing constant buffer to active buffer. (#116001 ) Summary: Refactor update inactive constant buffer to allow updating with active buffer. Test Plan: Existing test to test inactive buffer updates. UpdateConstantsCuda in cpp test for active buffer updates. Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/116001 Approved by: https://github.com/chenyang78	2023-12-18 11:49:03 +00:00
Pearu Peterson	34fe850d00	SymInt'ify sparse_compressed_tensor (#107903 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/107903 Approved by: https://github.com/cpuhrsch ghstack dependencies: #115586	2023-12-17 17:36:20 +00:00
Pearu Peterson	419f2ca3e3	Fix a crash in sparse compressed tensor invariants check when nnz == 0 (#115825 ) Fixes python crash example from https://github.com/pytorch/pytorch/issues/115755 Pull Request resolved: https://github.com/pytorch/pytorch/pull/115825 Approved by: https://github.com/cpuhrsch	2023-12-17 17:36:15 +00:00
Tej Singh	eafeba71c1	Adamw refactor (#115983 ) Fixes #104899, refactors adamw by abstracting out common code in adam. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115983 Approved by: https://github.com/janeyx99	2023-12-17 06:58:39 +00:00
Yue Dong	87ea6fb844	Make input contiguous for DTensor reduce scatter to fix the incorrect numerical values (#115847 ) Summary: This change is to make the input tensor contiguous for DTensor reduce scatter in the case no padding is needed. There's no exception thrown during training, but we ran into numerical value correctness issue without the change. Test Plan: CI CI test WHEN model test: - Verified loss for each iteration within the expected range. - Verified NE on-par with this change with 4B training data. Differential Revision: D52170822 Pull Request resolved: https://github.com/pytorch/pytorch/pull/115847 Approved by: https://github.com/wanchaol	2023-12-17 01:35:09 +00:00
Menglu Yu	bc4115ffcf	[Inductor][Observability] Change to log.debug to avoid excessive long of logs (#115474 ) Summary: Titled Test Plan: CI Differential Revision: D52003825 Pull Request resolved: https://github.com/pytorch/pytorch/pull/115474 Approved by: https://github.com/jackiexu1992, https://github.com/yanboliang	2023-12-17 00:25:54 +00:00
Nikita Shulga	4123cca859	[AARCH64] Fall back to GEMM if mkldnn_matmul fails (#115936 ) - Add call to `at::globalContext().userEnabledMkldnn()` to `apply_mkldnn_matmul_heur` - Surround calls to `mkldnn_matmul` with `try {} catch {}` - Print warning and fall back to BLAS (by calling `at::globalContext().setUserEnabledMkldnn()`) if `mkldnn_matmul()` fails Test plan: On Linux arm run: ```shell $ sudo chmod 400 /sys; python -c "import torch;m=torch.nn.Linear(1, 32);print(torch.__version__);print(m(torch.rand(32, 1)))" Error in cpuinfo: failed to parse the list of possible processors in /sys/devices/system/cpu/possible Error in cpuinfo: failed to parse the list of present processors in /sys/devices/system/cpu/present Error in cpuinfo: failed to parse both lists of possible and present processors 2.3.0.dev20231215 bad err=11 in Xbyak::Error bad err=11 in Xbyak::Error /home/ubuntu/miniconda3/envs/py311/lib/python3.11/site-packages/torch/nn/modules/linear.py:116: UserWarning: mkldnn_matmul failed, switching to BLAS gemm:internal error (Triggered internally at /pytorch/aten/src/ATen/native/LinearAlgebra.cpp:1509.) return F.linear(input, self.weight, self.bias) tensor([[-0.5183, 0.2279, -0.4035, ..., -0.3446, 0.0938, -0.2113], [-0.5111, 0.2362, -0.3821, ..., -0.3536, 0.1011, -0.2159], [-0.6387, 0.0894, -0.7619, ..., -0.1939, -0.0282, -0.1344], ..., [-0.6352, 0.0934, -0.7516, ..., -0.1983, -0.0247, -0.1366], [-0.4790, 0.2733, -0.2862, ..., -0.3939, 0.1338, -0.2365], [-0.5702, 0.1682, -0.5580, ..., -0.2796, 0.0412, -0.1782]], grad_fn=<AddmmBackward0>) ``` Fixes https://github.com/pytorch/pytorch/issues/114750 Pull Request resolved: https://github.com/pytorch/pytorch/pull/115936 Approved by: https://github.com/lezcano	2023-12-16 21:37:56 +00:00
voznesenskym	b06b02559e	Support non grapharg and intermediary grad access (#115898 ) Support for something we need for both FSDP and optimizers. For sourced args that are not inputs (params, etc) - we use the dynamic_getattr flow on tensors. This soundly handles the storage and registration and guarding downstream of tensor_wrap for the grad values. For non sourced (true intermediates), we only support None (the idea being that if we have a true intermediate in the graph with grad, we are already doing something weird). Pull Request resolved: https://github.com/pytorch/pytorch/pull/115898 Approved by: https://github.com/bdhirsh ghstack dependencies: #115315, #112184	2023-12-16 18:43:37 +00:00
kflu	c5dcb50c00	[easy] aten ops: support passing all args as kwargs, including `self` (#114920 ) Summary: This is important for writing aten IR based graph transformation. ``` In [4]: [x.name for x in torch.ops.aten.reshape.default._schema.arguments] Out[4]: ['self', 'shape'] In [8]: torch.ops.aten.reshape.default(torch.rand(1,2), shape=[2]) Out[8]: tensor([0.7584, 0.4834]) # === CANNOT CALL `self` BY KWARGS === In [7]: torch.ops.aten.reshape.default(self=torch.rand(1,2), shape=[2]) --------------------------------------------------------------------------- TypeError Traceback (most recent call last) Cell In[7], line 1 ----> 1 torch.ops.aten.reshape.default(self=torch.rand(1,2), shape=[2]) TypeError: OpOverload.__call__() got multiple values for argument 'self' ``` # Where's the problem? 1. the aten ops first arg is usually named `self` (aten/src/ATen/native/native_functions.yaml) 2. Unfortunately, in `torch._ops.{OpOverload, OpOverloadPacket}.__call__()`, the first arg is (by python convention) named `self` too. So when call `self` by kwargs, `OpOverloadPacket.__call__` received: ``` OpOverloadPacket.__call__(self, {"self": ...}) ``` It is Python that does not allow some argument named "arg" to appear twice. and hence > TypeError: OpOverload.__call__() got multiple values for argument 'self' # How to fix? Note that, in above, `self` is an instance of `OpOverloadPacket`, and the "self" kwarg is the input tensor to the aten op. To fix, we only need to differentiate the two `self`s. In Python, first arg of a method does not need to be named `self`. So we change the `__call__` definition to: ``` def __call__(_self, ...): ``` Now the call becomes: ``` OpOverloadPacket.__call__(_self, {"self": ...}) ``` where: * `_self` is the instance to the `OpOverloadPacket` * `"self"` is the input tensor to the aten op. Test Plan: ``` In [4]: [x.name for x in torch.ops.aten.reshape.default._schema.arguments] Out[4]: ['self', 'shape'] In [3]: torch.ops.aten.reshape.default(self=torch.rand(1,2), shape=[2]) Out[3]: tensor([0.5127, 0.3051]) ``` Differential Revision: D51731996 Pull Request resolved: https://github.com/pytorch/pytorch/pull/114920 Approved by: https://github.com/houseroad	2023-12-16 18:32:58 +00:00
Sunita Nadampalli	88207b10ca	Enable thp(transparent huge pages) for buffer sizes >=2MB (#107697 ) The 2MB thp pages provide better allocation latencies compared to the standard 4KB pages. This change has shown substantial improvement for batch mode usecases where the tensor sizes are larger than 100MB. Only enabled if THP_MEM_ALLOC_ENABLE environment variable is set. Relanding https://github.com/pytorch/pytorch/pull/93888 with functionality disabled for Android Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/107697 Approved by: https://github.com/malfet	2023-12-16 18:16:19 +00:00
Nikita Shulga	622947afa8	[BE] Use nested namespace in ATen/native (#115938 ) It's a C++17 feature that usually makes code a bit more compact, and should have no side-effects otherwise. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115938 Approved by: https://github.com/Skylion007	2023-12-16 06:07:40 +00:00
Jeff Daily	e3aefe2970	Revert "Initial Flash Attention support on ROCM (#114309 )" (#115975 ) This reverts commit 5bddbed399a89bf2875a38bb84cb869f382f1809. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115975 Approved by: https://github.com/atalman, https://github.com/malfet	2023-12-16 03:40:14 +00:00
Aidyn-A	8283491eff	[TEST] Increase numerical tolerances in test_torchinductor_opinfo:test_comprehensive (#115768 ) There are numerical mismatches that causes some tests of `test_comprehensive` to fail. I propose to just increase tolerances a bit to make them pass. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115768 Approved by: https://github.com/jansel	2023-12-16 03:00:22 +00:00
rzou	49af19cd8e	Skip some flaky Dynamo tests in test_linalg.py (#115925 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/115925 Approved by: https://github.com/lezcano	2023-12-16 02:38:56 +00:00
vfdev-5	2a2f2e454a	[inductor] Fixed issue with true div on integer input with dyn shapes (#115920 ) Related to https://github.com/pytorch/pytorch/issues/115742, `Cpu/CudaTests.test_div8` Description: - Fixed issue with true div on integer input with dyn shapes Pull Request resolved: https://github.com/pytorch/pytorch/pull/115920 Approved by: https://github.com/peterbell10	2023-12-16 02:06:39 +00:00
PaliC	d08905db7e	Trigger a mergability check on ghstack prs (#115944 ) Works to solve https://github.com/pytorch/test-infra/issues/4816 In conjunction with https://github.com/pytorch/test-infra/pull/4823 this pr should make it such that all ghstack prs kick off a job which is a mergability check. Test plan, once https://github.com/pytorch/test-infra/pull/4823 is merged, I'll resubmit this diff to make sure the workflow job triggers. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115944 Approved by: https://github.com/izaitsevfb, https://github.com/huydhn	2023-12-16 01:53:10 +00:00
Yanbo Liang	14a6b24c8b	[Dynamo][8/N] Wrap itertools.* as ItertoolsVariable (#115802 ) This is part of a series changes before removing ```is_allowed```. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115802 Approved by: https://github.com/voznesenskym	2023-12-16 01:42:02 +00:00
Jane Xu	056a882cb9	add markDynamoStrictTest to TestOptimRenewed, removing flakiness (#115947 ) fixes #115406 fixes #115394 fixes #115393 fixes #115392 fixes #115391 Pull Request resolved: https://github.com/pytorch/pytorch/pull/115947 Approved by: https://github.com/albanD, https://github.com/zou3519	2023-12-16 01:33:32 +00:00
Michael Lazos	0597eb56c2	Generate exhaustive compiled optimizer tests (#115906 ) Generates tests for all permutations of arguments using the existing optimizer infos. Covers capturable, cpu/gpu, single/multitensor and optimizer specific constants like rho/etas, etc. [new test list](https://gist.github.com/mlazos/d3404383e7c3d490cbb51b7d6c750629) [old test list](https://gist.github.com/mlazos/e0043aee1b6a0962d2f3ac8193aa62f8) Pull Request resolved: https://github.com/pytorch/pytorch/pull/115906 Approved by: https://github.com/janeyx99	2023-12-16 00:42:43 +00:00
youkaichao	034e871710	[Dynamo] Look up variables from old frame, rather than copy variables to new frame; skip some copy to save time. (#115062 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/115062 Approved by: https://github.com/williamwen42	2023-12-16 00:02:59 +00:00
Huy Do	94d28161fa	Fix broken PyYAML 6.0 on MacOS x86 (#115956 ) May be we should just get rid of x86 jobs, but that's for another day. This one should fix the broken build in trunk, i.e. https://github.com/pytorch/pytorch/actions/runs/7227220153/job/19694420117. I guess that the failure looks flaky depending on the version of default python3 on the GitHub x86 runner. The issue from PyYAML https://github.com/yaml/pyyaml/issues/601 Pull Request resolved: https://github.com/pytorch/pytorch/pull/115956 Approved by: https://github.com/malfet	2023-12-15 23:17:05 +00:00
Peter Pham	74dfdc567b	[MPS] aten::erfinv bug fix: add storage offset buffers to handle slicing (#105801 ) A bug fix of a recently merged PR per comment: https://github.com/pytorch/pytorch/pull/101507#discussion_r1271393706 The follow test would fail without this bug fix: ``` import torch def test_erfinv(): for device in ['cpu', 'mps']: x = torch.tensor([0.1, 0.2, 0.3, 0.4, 0.5], device=device) y = x[2:].erfinv() x2 = torch.tensor([0.3, 0.4, 0.5], device=device) y2 = x2.erfinv() print(y) print(y2) torch.testing.assert_close(y, y2) print(f"{device} passes.") test_erfinv() ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/105801 Approved by: https://github.com/malfet	2023-12-15 23:14:03 +00:00
Max Ren	d92d4133e7	[8/n] Update XNNPACK Submodule Version Part 8 Everything Remaining to get it to work (#115714 ) > __Note:__ XNNPACK Upgrade is too large in the range of 40k files and 10m Lines of code, Thus we break the update of the library into multiple parts. All Parts [1 - n] Must be landed together for it to work. *This also means If there is a revert. Please revert the Entire Stack.* This change is everything remaining requiring XNNPACK version to work. @allow-large-files Differential Revision: [D52099769](https://our.internmc.facebook.com/intern/diff/D52099769/) --- submodule (unblock merge to make ShipIt happy) Pull Request resolved: https://github.com/pytorch/pytorch/pull/115714 Approved by: https://github.com/digantdesai	2023-12-15 23:08:08 +00:00
Lucas Steuernagel	2e517b20d9	[MPS] Add Conv3D support for MPS (#114183 ) Fixes #77818 I saw that PR #99246 was approved, but no one fixed the rebase conflicts, so I am bringing this up again to be merged. I am leveraging @mattiaspaul work. Quoting the description here: > * this pull request enables 3D convolutions (forward/backward) for MPS (Apple Silicon) within the same Convolution.mm file as conv2d. > * does not support channel_last (since pytorch doesn't implement channel_last for 3D tensors) > * does not support conv3d_transpose and treats depth-separable convolutions not as normal case (there are no MPS kernels available for either of those so far) > * requires MacOS >=13.2 (Ventura) Please, let me know if there are any other changes needed and I'll be happy to implement them. Pull Request resolved: https://github.com/pytorch/pytorch/pull/114183 Approved by: https://github.com/malfet	2023-12-15 23:05:01 +00:00
Will Constable	9fcf6fb6fe	[C10D] Add waitForDumpOrTimeout to log on dump abandonment (#115876 ) Helps call attention to any cases where the dump actually times out. The timeout is likely to hit if we run into slow stacktrace processing. Log any exceptions encountered in the background thread, but don't raise them- we're already willing to abandon the debug dump, and want to proceed with our normal execution (in the case of dumppipe) or shutdown process (when dumping happens on timeout and shutdown is already initiated). Pull Request resolved: https://github.com/pytorch/pytorch/pull/115876 Approved by: https://github.com/zdevito ghstack dependencies: #115807	2023-12-15 22:13:06 +00:00
Will Constable	82e0d00da9	[c10d] Polish NCCL PG monitor thread log message (#115888 ) We turned on monitor thread by default in https://github.com/pytorch/pytorch/pull/112518, and we want the error message that is displayed when the monitor kills the process to be more informative. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115888 Approved by: https://github.com/wconstab	2023-12-15 22:00:29 +00:00
angelayi	1f3bdf40ad	[export] Update schema version (#115712 ) Since pytorch 2.1 release we've made some BC breaking changes to the serialized schema. We should update it in time for the 2.2 release. Some of the changes include: * https://github.com/pytorch/pytorch/pull/114371 - custom class objects / pybinded objects are no longer saved directly to the `ExportedProgram` structure. Instead, the name is serialized inside of the program, and the actual bytes are stored. in a separate location from the exported program, allowing it to be saved to a different location. * https://github.com/pytorch/pytorch/pull/111204 - `GraphSignature` structure changed and `call_spec` is removed from the `GraphModule` schema * https://github.com/pytorch/pytorch/pull/111407 - `loss_outout` -> `loss_output` * https://github.com/pytorch/pytorch/pull/113075 - `example_inputs` removed from the `ExportedProgram` structure (this originally did not store anything), `dialect` added to the `ExportedProgram` structure. * https://github.com/pytorch/pytorch/pull/113689 - tensor constants are now lifted as inputs to the graph, and their locations are stored in the `GraphSignature` * https://github.com/pytorch/pytorch/pull/114172 - removed `equality_constraints` and added a `SymExprHint` for all symbolic expressions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115712 Approved by: https://github.com/gmagogsfm	2023-12-15 21:43:03 +00:00
Jiong Gong	715d663794	[inductor] split test_cpp_wrapper.py into cpu and cuda test files (#115479 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/115479 Approved by: https://github.com/atalman ghstack dependencies: #115167	2023-12-15 21:21:10 +00:00
PyTorch MergeBot	50c9665f92	Revert "[export] Support torch.sym* ops (#115854 )" This reverts commit 347cb91946318eaedc350c2c3cda659d1cbde931. Reverted https://github.com/pytorch/pytorch/pull/115854 on behalf of https://github.com/atalman due to OSSCI oncall, broke multple jobs ([comment](https://github.com/pytorch/pytorch/pull/115854#issuecomment-1858486796))	2023-12-15 21:07:52 +00:00
PyTorch MergeBot	80a9625d9f	Revert "non-strict export with dynamic shapes (#115862 )" This reverts commit 1bb0d0fc1f1da750206fad45f32e9564f0edd1f4. Reverted https://github.com/pytorch/pytorch/pull/115862 on behalf of https://github.com/atalman due to OSSCI oncall, failing trunk / macos-12-py3-arm64 / test ([comment](https://github.com/pytorch/pytorch/pull/115862#issuecomment-1858482486))	2023-12-15 21:04:12 +00:00
Avik Chaudhuri	1bb0d0fc1f	non-strict export with dynamic shapes (#115862 ) Differential Revision: [D52175048](https://our.internmc.facebook.com/intern/diff/D52175048/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/115862 Approved by: https://github.com/zhxchen17	2023-12-15 20:11:30 +00:00
angelayi	347cb91946	[export] Support torch.sym* ops (#115854 ) Fixes https://github.com/pytorch/pytorch/issues/108830 and https://github.com/pytorch/executorch/issues/1379#issuecomment-1853322866 Pull Request resolved: https://github.com/pytorch/pytorch/pull/115854 Approved by: https://github.com/zhxchen17	2023-12-15 20:08:04 +00:00
vfdev-5	6c2103bdf7	Fixed some failing inductor tests with exact_dtype=True (#115828 ) Addresses point 1 from #115742: fixing CPUReproTest.test_embedding_vec_bf16 Pull Request resolved: https://github.com/pytorch/pytorch/pull/115828 Approved by: https://github.com/peterbell10	2023-12-15 20:02:19 +00:00
PyTorch MergeBot	91b848bf81	Revert "markDynamoStrictTest on more tests (#115879 )" This reverts commit 8b650cdd3cdd1174b399f312ec2f7955551a2f5d. Reverted https://github.com/pytorch/pytorch/pull/115879 on behalf of https://github.com/atalman due to OSSCI oncall, broke inductor ([comment](https://github.com/pytorch/pytorch/pull/115879#issuecomment-1858418921))	2023-12-15 20:00:09 +00:00
PyTorch MergeBot	c006c8b50e	Revert "markDynamoStrictTest some more (#115885 )" This reverts commit 55ce4693ff2c0b6e50b8af323f36ecc7ff929638. Reverted https://github.com/pytorch/pytorch/pull/115885 on behalf of https://github.com/atalman due to OSSCI oncall, broke inductor ([comment](https://github.com/pytorch/pytorch/pull/115885#issuecomment-1858409669))	2023-12-15 19:51:24 +00:00
Wanchao Liang	61abacf829	[tp] improve documentation (#115880 ) Improve the TP documentation in terms of format and descriptions Pull Request resolved: https://github.com/pytorch/pytorch/pull/115880 Approved by: https://github.com/XilunWu	2023-12-15 18:44:22 +00:00
PyTorch MergeBot	d5115bfb06	Revert "[AOTI][refactor][3/n] Declare python_kernel_name and cpp_kernel_name in ExternKernel (#115831 )" This reverts commit 287a86567731ff4d87f71dcd285d0ab4253cfceb. Reverted https://github.com/pytorch/pytorch/pull/115831 on behalf of https://github.com/desertfire due to rocm CI failure ([comment](https://github.com/pytorch/pytorch/pull/115831#issuecomment-1858322270))	2023-12-15 18:34:55 +00:00
Lucas Pasqualin	72eab5aa43	Configures distributed_checkpoint label (#115833 ) Configures the existing `module: distributed_checkpoint` label Pull Request resolved: https://github.com/pytorch/pytorch/pull/115833 Approved by: https://github.com/wconstab, https://github.com/wz337	2023-12-15 18:17:25 +00:00
PyTorch MergeBot	1b506e7469	Revert "non-strict export with dynamic shapes (#115862 )" This reverts commit f54bb1ed566f27affff9fdbd5c1ceee854ef2de5. Reverted https://github.com/pytorch/pytorch/pull/115862 on behalf of https://github.com/atalman due to OSSCI oncall, failing trunk / macos-12-py3-arm64 / test ([comment](https://github.com/pytorch/pytorch/pull/115862#issuecomment-1858197497))	2023-12-15 17:03:42 +00:00
Nikita Shulga	7ed2bc7c67	[GHF] Do not block reverts with internal changes (#115903 ) As check is more often than not is unreliable, so better just post a warning and let the revert proceed. Fixes https://github.com/pytorch/test-infra/issues/4797 Pull Request resolved: https://github.com/pytorch/pytorch/pull/115903 Approved by: https://github.com/clee2000, https://github.com/atalman	2023-12-15 17:00:07 +00:00
Avik Chaudhuri	f54bb1ed56	non-strict export with dynamic shapes (#115862 ) Differential Revision: [D52175048](https://our.internmc.facebook.com/intern/diff/D52175048/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/115862 Approved by: https://github.com/zhxchen17	2023-12-15 16:38:45 +00:00
Jeff Daily	b062ea3803	[ROCm] add hipblaslt support (#114329 ) Disabled by default. Enable with env var DISABLE_ADDMM_HIP_LT=0. Tested on both ROCm 5.7 and 6.0. Pull Request resolved: https://github.com/pytorch/pytorch/pull/114329 Approved by: https://github.com/malfet	2023-12-15 15:36:46 +00:00
Bin Bao	287a865677	[AOTI][refactor][3/n] Declare python_kernel_name and cpp_kernel_name in ExternKernel (#115831 ) Summary: Both ExternKernelAlloc and ExternKernelOut need the two fields, so declaring them in the base class. Also add cpp codegen for IndexPutFallback and InplaceBernoulliFallback in this PR. Differential Revision: [D52189999](https://our.internmc.facebook.com/intern/diff/D52189999) Pull Request resolved: https://github.com/pytorch/pytorch/pull/115831 Approved by: https://github.com/chenyang78	2023-12-15 14:40:44 +00:00
PyTorch MergeBot	66994bca5f	Revert "[inductor] split test_cpp_wrapper.py into cpu and cuda test files (#115479 )" This reverts commit 653acd8fe1d0a7b4a084a47ee022f163015fee64. Reverted https://github.com/pytorch/pytorch/pull/115479 on behalf of https://github.com/desertfire due to will cause land race in fbcode because https://github.com/pytorch/pytorch/pull/115831 is already landed internally ([comment](https://github.com/pytorch/pytorch/pull/115479#issuecomment-1857979948))	2023-12-15 14:35:40 +00:00
rzou	55ce4693ff	markDynamoStrictTest some more (#115885 ) Featuring test_native_mha.py test_nn.py test_prims.py test_schema_check.py test_serialization.py test_show_pickle.py test_sort_and_select.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/115885 Approved by: https://github.com/voznesenskym ghstack dependencies: #115845, #115855, #115856, #115857, #115858, #115870, #115871, #115879	2023-12-15 13:19:52 +00:00
rzou	8b650cdd3c	markDynamoStrictTest on more tests (#115879 ) Featuring: test_mobile_optimizer.py test_module_init.py test_modules.py test_multiprocessing.py test_multiprocessing_spawn.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/115879 Approved by: https://github.com/voznesenskym ghstack dependencies: #115845, #115855, #115856, #115857, #115858, #115870, #115871	2023-12-15 13:19:52 +00:00
Jun Luo	2d43e31aa9	Fix wrong behavior of is_alias_of and c10d::reducer on MTIA (#115553 ) Reviewed By: kirteshpatil Differential Revision: D51860023 Pull Request resolved: https://github.com/pytorch/pytorch/pull/115553 Approved by: https://github.com/fduwjj	2023-12-15 11:14:41 +00:00
Nikita Shulga	4ea7430ffb	[BE] Don't copy CuDNN libs twice (#115872 ) - It was installed twice : once in `/usr/local/cuda/lib64` folder and 2nd time in `/usr/lib64` - And don't install CuDNN headers thrice, only in `/usr/local/cuda/includa` - Error on unknown CUDA version - Modify bazel builds to look for cudnn in `/usr/local/cuda` folder Pull Request resolved: https://github.com/pytorch/pytorch/pull/115872 Approved by: https://github.com/huydhn	2023-12-15 09:47:14 +00:00
Yanbo Liang	b4d6443bcf	[Dynamo] Log innermost user frame filename & lineno for better error aggregation (#115899 ) CompilationMetrics example: ``` frame_key='1', co_name='fn', co_filename='/data/users/ybliang/debug/debug1.py', co_firstlineno=58, cache_size=0, accumulated_cache_size=0, guard_count=None, graph_op_count=None, graph_node_count=None, graph_input_count=None, entire_frame_compile_time_s=None, backend_compile_time_s=None, fail_type="<class 'torch._dynamo.exc.Unsupported'>", fail_reason='custome dict init with args/kwargs unimplemented', fail_user_frame_filename='/data/users/ybliang/debug/debug1.py', fail_user_frame_lineno=61 ``` where: * ```fail_type``` and ```fail_reason``` are exceptions inside of Dynamo. * ```fail_user_frame_filename``` and ```fail_user_frame_lineno``` are where the original user code triggered the exception. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115899 Approved by: https://github.com/davidberard98, https://github.com/ydwu4	2023-12-15 08:24:55 +00:00
Yifu Wang	4edc921857	Introduce 3 low-latency, intra-node allreduce algorithms for small messages to PyTorch (#114001 ) ## Summary This PR added 3 intra-node GPU allreduce algorithms to PyTorch: - One-shot allreduce (inspired by FasterTransformer): all ranks simultaneously read and accumulate data from other ranks. - Two-shot allreduce (inspired by FasterTransformer): all ranks simultanesouly read and accumulate `1 / world_size` data from other ranks. Then all ranks read accumulated data from other ranks. (effectively one-shot reduce-scatter + one-shot all-gather). - Hybrid cube mesh allreduce (original): a one-shot allreduce variant that avoids transmission over PCIe on HCM topology. ## Micro Benchmarks ![image](https://github.com/pytorch/pytorch/assets/4156752/7bd25ffc-cd5b-4acb-bd65-b01bc136726e) ![image](https://github.com/pytorch/pytorch/assets/4156752/3ced31b4-6c31-4f34-a2d8-c072df29ae0e) ![image](https://github.com/pytorch/pytorch/assets/4156752/5b942c05-4fcc-4ec9-ae29-12c64080bb1c) ## Details The intra-node algos are organized behind `c10d::IntraNodeComm`, which is responsible for: - Managing handshaking and cuda IPC handle exchange among ranks. - Querying NVLink connection and detecting topology. - Performing algo selection based on available info. - Launching the selected allreduce kernel. `c10d::IntraNodeComm` is integrated into `c10d::ProcessGroupNCCL` as follows: - When the `ENABLE_INTRA_NODE_COMM` environment variable is set, `c10d::ProcessGroupNCCL` initialize a `c10d::IntraNodeComm` for its ranks. - If the setup is not suitable for intra-node comm (e.g. not all ranks are from the same node), the rendezvous logic guarantees all participants fall back consistently. - `c10d::ProcessGroupNCCL::allreduce` consults `c10d::IntraNodeComm` whether to use intra-node allreduce and carries out the communication accordingly. We currently detect two types of topoloies from the nNVLink connection mesh: - Fully connected: all GPU pairs has direct NVLink connection (e.g. NVSwitch or fully connected sub-set of hybrid cube mesh) - `msg <= 256KB`: one-shot allreduce. - `256KB < msg <= 10MB`: two-shot allreduce. - `msg > 10MB`: instructs the caller to fallback to NCCL. - Hybrid cube mesh - `msg <= 256KB`: one-shot allreduce. - `msg > 256KB`: instructs the caller to fallback to NCCL. ## Next Steps - Fine tune algo selection based on GPU model, topology, link speed. - Potentially optimize the two-shot allreduce impl. Accroding to FasterTransformer, two-shot allreduce is preferred until 50MB. There might be room for improvement, but PyTorch does impose more constraints: - FasterTransformer uses a single process to drive multiple devices. It can use `cudaDeviceEnablePeerAccess` enable device-level peer access. - PyTorch uses multiple process to drive multiple devices. With cuda IPC, a device can only share a specific region to other devices. This means extra copies may be unavoidable. Pull Request resolved: https://github.com/pytorch/pytorch/pull/114001 Approved by: https://github.com/yf225	2023-12-15 08:17:35 +00:00
Aidyn-A	cd47e335d1	[TEST] Skip test_schema_correctness for float8 dtype (#115757 ) According to the https://github.com/pytorch/pytorch/issues/107256#issuecomment-1705341870 the ops tested in `test_schema_correctness` are not supported with `torch.float8_e4m3fn` yet. Until they are not supported, it is best to skip the test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115757 Approved by: https://github.com/drisspg	2023-12-15 06:26:46 +00:00
Zhijing Li (Accelerator Enablement)	c1c9b739e2	Back out "[aotinductor] replace lld with the default ld linker (#115478 )" (#115875 ) Summary: Back out the diff Pull Request resolved: https://github.com/pytorch/pytorch/pull/115875 Approved by: https://github.com/chenyang78	2023-12-15 05:56:06 +00:00
rzou	478f0e96dc	markDynamoStrictTest more tests (#115871 ) For: test_dispatch.py test_fake_tensor.py test_indexing.py test_linalg.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/115871 Approved by: https://github.com/voznesenskym ghstack dependencies: #115845, #115855, #115856, #115857, #115858, #115870	2023-12-15 05:26:54 +00:00
rzou	7f686c8fe1	More markDynamoStrictTest (#115870 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/115870 Approved by: https://github.com/voznesenskym ghstack dependencies: #115845, #115855, #115856, #115857, #115858	2023-12-15 05:26:54 +00:00
leslie-fang-intel	9ae0e62929	[PT2] [Quant] Change the QConv2d Binary post op name from add to sum (#115329 ) Summary Change the QConv2d Binary fusion post op name from `add` to `sum`, since we are actually using OneDNN `post op sum` instead of `Binary_Add` for now. TestPlan ``` python -m pytest test_quantized_op.py -k test_qconv2d_sum_pt2e python -m pytest test_quantized_op.py -k test_qconv2d_sum_relu_pt2e python -m pytest test_quantized_op.py -k test_qconv2d_sum_relu_float_output_pt2e ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/115329 Approved by: https://github.com/jgong5, https://github.com/jerryzh168	2023-12-15 05:10:47 +00:00
Jiong Gong	653acd8fe1	[inductor] split test_cpp_wrapper.py into cpu and cuda test files (#115479 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/115479 Approved by: https://github.com/atalman ghstack dependencies: #115167	2023-12-15 04:04:08 +00:00
eqy	9056903b09	[CUDA] 64-bit indexing for avg_pool_backward (#114193 ) Fixes #113833 Pull Request resolved: https://github.com/pytorch/pytorch/pull/114193 Approved by: https://github.com/malfet	2023-12-15 03:58:46 +00:00
Angela Yi	8e2d63cbc3	[export][reland] Remove runtime assertion pass (#115597 ) Summary: Reland of https://github.com/pytorch/pytorch/pull/115196 D52054112 to fix internal failures. Test Plan: CI Differential Revision: D52054110 Pull Request resolved: https://github.com/pytorch/pytorch/pull/115597 Approved by: https://github.com/ydwu4, https://github.com/zhxchen17	2023-12-15 03:22:03 +00:00
Bin Bao	7d4ccd7b9e	[AOTI][refactor][2/n] Rename kernel to python_kernel_name (#115766 ) Differential Revision: [D52164940](https://our.internmc.facebook.com/intern/diff/D52164940) Pull Request resolved: https://github.com/pytorch/pytorch/pull/115766 Approved by: https://github.com/chenyang78 ghstack dependencies: #115783	2023-12-15 03:08:13 +00:00
Will Constable	8e1cff96e3	[C10D] Log PG size in init log (#115807 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/115807 Approved by: https://github.com/XilunWu	2023-12-15 02:38:54 +00:00
Nikita Shulga	5989e1222d	[BE] Set `torch.cuda.has_half` to True (#115884 ) This check was introduced by https://github.com/pytorch/pytorch/pull/5417 and then turned into a tautology by https://github.com/pytorch/pytorch/pull/10147 So I guess it's time to let go of all that dynamic initialization (and may be just delete it in 2.3?) Pull Request resolved: https://github.com/pytorch/pytorch/pull/115884 Approved by: https://github.com/kit1980	2023-12-15 02:30:55 +00:00
Jesse Cai	a8e354a9a0	[sparse][semi-structured] enable fp32 support, separate sparse and dense constraints (#115550 ) Summary: Both cuSPASRELt and CUTLASS support 1:2 semi-structured sparsity for fp32, which this PR enables.(thanks @alexsamardzic). Furthermore, this PR also updates the sparse_config to take into account the different shape constraints for sparse and dense matrices. Technically, cuSPARSELt supports smaller sparse matrix constraints as it seens to pad to the CUTLASS constraints under the hood. However, in practice small sparse matrices are not commonly used and we care more about the dense constraints for LLM inference. For now, we keep the CUTLASS constraints in place for both cuSPARSELt and CUTLASS tensors This PR also reconnects the _FUSE_TRANSPOSE flag for cuSPARSELt tensors. Test Plan: ``` python test/test_sparse_semi_structured.py ``` Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/115550 Approved by: https://github.com/cpuhrsch	2023-12-15 02:28:17 +00:00
Mikayla Gawarecki	6d5fe07659	Fix numpy warning when importing torch without numpy installed (#115867 ) Fixes #115638 I verified locally that with no numpy install the warning no longer occurs Pull Request resolved: https://github.com/pytorch/pytorch/pull/115867 Approved by: https://github.com/soulitzer	2023-12-15 02:22:12 +00:00
watarungurunnn	9e84d0fa60	[MPS] Fix opposite error message in empty_mps (#115746 ) Fixes #115625 Pull Request resolved: https://github.com/pytorch/pytorch/pull/115746 Approved by: https://github.com/mikaylagawarecki	2023-12-15 01:31:40 +00:00
rzou	85262b0a9e	markDynamoStrictTest some test_cpp_extensions.* (#115858 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/115858 Approved by: https://github.com/voznesenskym ghstack dependencies: #115845, #115855, #115856, #115857	2023-12-15 01:22:38 +00:00
rzou	8ddca5aeae	markDynamoStrictTest some more tests (#115857 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/115857 Approved by: https://github.com/voznesenskym ghstack dependencies: #115845, #115855, #115856	2023-12-15 01:22:38 +00:00
rzou	3477a2ee03	unMarkDynamoStrictTest on OpInfo-based tests (#115856 ) These take too long to run under strict mode. We'll worry about them later. Note that these decorators don't do anything yet (unless we flip the default from non-strict to strict). Pull Request resolved: https://github.com/pytorch/pytorch/pull/115856 Approved by: https://github.com/voznesenskym ghstack dependencies: #115845, #115855	2023-12-15 01:22:31 +00:00
rzou	0722ce35f5	Increase number of Dynamo shards from 2->7 (#115855 ) In preparation for ~3x increased test time coming in the upcoming PRs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115855 Approved by: https://github.com/voznesenskym ghstack dependencies: #115845	2023-12-15 01:22:24 +00:00
rzou	4ccd8eb613	Add Dynamo test expected failure mechanism (#115845 ) Tests that are added to a list in dynamo_test_failures.py will automatically be marked as expectedFailure when run with PYTORCH_TEST_WITH_DYNAMO=1. I'm splitting this PR off on its own so that I can test various things on top of it. Also added an unMarkDynamoStrictTest that is not useful until we turn on strict mode by default. Test Plan: - code reading Pull Request resolved: https://github.com/pytorch/pytorch/pull/115845 Approved by: https://github.com/voznesenskym	2023-12-15 01:22:17 +00:00
Anthony Shoumikhin	5477120ebf	[executorch] Update iOS toolchain with a modern cmake syntax. (#115799 ) Summary: Replace exec_program with execute_process Test Plan: CI Differential Revision: D52147108 Pull Request resolved: https://github.com/pytorch/pytorch/pull/115799 Approved by: https://github.com/huydhn	2023-12-15 00:51:30 +00:00
Bin Bao	f90a5f891b	[AOTI][refactor][1/n] Rename cpp_kernel to cpp_kernel_name (#115783 ) Differential Revision: [D52142184](https://our.internmc.facebook.com/intern/diff/D52142184) Pull Request resolved: https://github.com/pytorch/pytorch/pull/115783 Approved by: https://github.com/chenyang78, https://github.com/jansel	2023-12-15 00:50:17 +00:00
Shubhraprakash Das	1b8599283f	Optimize quantized max pool 2d (#115690 ) Summary: We do not need to dequantize and quantize again for this op. With this optimization cunet-enc ops: vulkan.quantized_max_pool2d_quint8{48, 36, 2} 207532 vulkan.quantized_max_pool2d_quint8{24, 18, 4} 78832 vulkan.quantized_max_pool2d_quint8{12, 9, 8} 49296 Without optimization: vulkan.quantized_max_pool2d_quint8{48, 36, 2} 234416 vulkan.quantized_max_pool2d_quint8{24, 18, 4} 94380 vulkan.quantized_max_pool2d_quint8{12, 9, 8} 58760 Test Plan: Ensure all vulkan quantize tests pass: buck2 run --target-platforms ovr_configplatform/macos:arm64-fbsourcexplat/caffe2:pt_vulkan_quantized_api_test_binAppleMac\#macosx-arm64 -c pt.vulkan_full_precision=1 --show-output" Running main() from third-party/googletest/1.11.0/googletest/googletest/src/gtest_main.cc [==========] Running 78 tests from 1 test suite. [----------] Global test environment set-up. [----------] 78 tests from VulkanAPITest ... [==========] 78 tests from 1 test suite ran. (1519 ms total) [ PASSED ] 78 tests. buck2 run --target-platforms ovr_config//platform/macos:arm64-fbsource //xplat/caffe2:pt_vulkan_api_test_binAppleMac\#macosx-arm64 -c pt.vulkan_full_precision=1 --show-output" Running main() from third-party/googletest/1.11.0/googletest/googletest/src/gtest_main.cc [==========] Running 395 tests from 1 test suite. [----------] Global test environment set-up. [----------] 395 tests from VulkanAPITest ... [ SKIPPED ] VulkanAPITest.querypool_flushed_shader_log (0 ms) [----------] 395 tests from VulkanAPITest (6515 ms total) [----------] Global test environment tear-down [==========] 395 tests from 1 test suite ran. (6515 ms total) [ PASSED ] 394 tests. [ SKIPPED ] 1 test, listed below: [ SKIPPED ] VulkanAPITest.querypool_flushed_shader_log YOU HAVE 5 DISABLED TESTS Reviewed By: yipjustin, copyrightly Differential Revision: D50998619 Pull Request resolved: https://github.com/pytorch/pytorch/pull/115690 Approved by: https://github.com/SS-JIA	2023-12-15 00:45:37 +00:00
Joel Schlosser	6fee208064	Handle -1 in jagged layout NT view ops (#115843 ) Allows for inheriting the ragged and batch dims via -1: ```python nt.view(-1, -1, D) nt.expand(B, -1, D) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/115843 Approved by: https://github.com/soulitzer ghstack dependencies: #115636	2023-12-15 00:42:47 +00:00
Nikita Shulga	c947ed1135	[BE][ROCm] Use modern C++ (#115844 ) This removes global (but ROCM_ONLY) `over_arch` and `gcn_arch_override_flag` variables in favor of block level static initialization introduced in C++11 To quote from [ISO/IEC 14882-2014](https://www.open-std.org/jtc1/sc22/wg21/docs/standards) >The zero-initialization (8.5) of all block-scope variables with static storage duration (3.7.1) or thread storage > duration (3.7.2) is performed before any other initialization takes place. Constant initialization (3.6.2) of a > block-scope entity with static storage duration, if applicable, is performed before its block is first entered. > An implementation is permitted to perform early initialization of other block-scope variables with static or > thread storage duration under the same conditions that an implementation is permitted to statically initialize > a variable with static or thread storage duration in namespace scope (3.6.2). Otherwise such a variable is > initialized the first time control passes through its declaration; such a variable is considered initialized upon > the completion of its initialization. If the initialization exits by throwing an exception, the initialization > is not complete, so it will be tried again the next time control enters the declaration. If control enters > the declaration concurrently while the variable is being initialized, the concurrent execution shall wait for > completion of the initialization. Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/115844 Approved by: https://github.com/huydhn	2023-12-15 00:38:43 +00:00
BowenBao	7e6ec8d3db	[ONNX] Add proper iobinding synchronize for ONNX cuda bench (#115773 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/115773 Approved by: https://github.com/thiagocrepaldi ghstack dependencies: #115670, #115673	2023-12-15 00:37:32 +00:00
BowenBao	823523acc0	[ONNX] Dump sarif diagnostics for failed onnx exports in benchmark (#115673 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/115673 Approved by: https://github.com/thiagocrepaldi ghstack dependencies: #115670	2023-12-15 00:37:32 +00:00
BowenBao	0959e67de3	[ONNX] Set correct cuda.current_device for multi-device onnx performance bench (#115670 ) Otherwise `torch.cuda.synchronize()` works on a different device from the one that runs PyTorch model, which lead to incorrect performance number. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115670 Approved by: https://github.com/thiagocrepaldi	2023-12-15 00:37:32 +00:00
PyTorch MergeBot	59f7355f86	Revert "[ROCm] add hipblaslt support (#114329 )" This reverts commit bb2bb8cca1c00e3f6e7025a62688d0cfcbfee144. Reverted https://github.com/pytorch/pytorch/pull/114329 on behalf of https://github.com/atalman due to OSSCI oncall, trunk tests are failing ([comment](https://github.com/pytorch/pytorch/pull/114329#issuecomment-1857003155))	2023-12-14 23:53:30 +00:00
zdevito	66b04e3cb7	[nccl flight recorder] nullptr profiling name (#115851 ) Sometimes profiling name can be a nullptr, which throws on conversion to std::string. This adds a check. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115851 Approved by: https://github.com/wconstab	2023-12-14 23:40:54 +00:00
Oguz Ulgen	21b8127f1c	[Inductor] Deduplicate grid wrapper statements for user defined triton kernels (#115849 ) Noticed that on many MRS kernels the grid wrapper for autotuning is huge with a bunch of duplicates due to num_warps and num_stages not being needed for grid calculation. Lets deduplicate these entries. Previously, we would see wrapper like ``` def grid_wrapper_for_add_kernel_2d_autotuned_0(meta): if meta['BLOCK_SIZE_X'] == 128 and meta['BLOCK_SIZE_Y'] == 128: return (4, 2, 1) if meta['BLOCK_SIZE_X'] == 128 and meta['BLOCK_SIZE_Y'] == 128: return (4, 2, 1) if meta['BLOCK_SIZE_X'] == 64 and meta['BLOCK_SIZE_Y'] == 64: return (8, 4, 1) if meta['BLOCK_SIZE_X'] == 64 and meta['BLOCK_SIZE_Y'] == 64: return (8, 4, 1) ``` now it looks like ``` def grid_wrapper_for_add_kernel_2d_autotuned_0(meta): if meta['BLOCK_SIZE_X'] == 128 and meta['BLOCK_SIZE_Y'] == 128: return (4, 2, 1) if meta['BLOCK_SIZE_X'] == 64 and meta['BLOCK_SIZE_Y'] == 64: return (8, 4, 1) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/115849 Approved by: https://github.com/jansel	2023-12-14 23:26:04 +00:00
Pearu Peterson	194d57dae7	Add values backward support for sparse CSR, CSC, BSR, and BSC tensors (#115586 ) Fixes https://github.com/pytorch/pytorch/issues/107286 Pull Request resolved: https://github.com/pytorch/pytorch/pull/115586 Approved by: https://github.com/cpuhrsch, https://github.com/albanD	2023-12-14 23:09:13 +00:00
Wanchao Liang	49d826bcd3	[dtensor] update op db tests (#115722 ) This PR updates the op db tests xfails, we should see whether we can enable this again in CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/115722 Approved by: https://github.com/XilunWu	2023-12-14 22:49:13 +00:00
Zhengxu Chen	ef6a0faf89	[export] Fix canonicalization. (#115830 ) Summary: Add the missed layout argument branch. Test Plan: buck2 test 'fbcode//mode/dev-nosan' fbcode//sigmoid/inference/test_gpu:export_package_sparse_toy_test Differential Revision: D52166501 Pull Request resolved: https://github.com/pytorch/pytorch/pull/115830 Approved by: https://github.com/angelayi	2023-12-14 22:48:26 +00:00
Jeff Daily	bb2bb8cca1	[ROCm] add hipblaslt support (#114329 ) Disabled by default. Enable with env var DISABLE_ADDMM_HIP_LT=0. Tested on both ROCm 5.7 and 6.0. Pull Request resolved: https://github.com/pytorch/pytorch/pull/114329 Approved by: https://github.com/malfet	2023-12-14 21:41:22 +00:00
Will Constable	04ef21f5dd	[C10D] Make dumpDebuggingInfo share a mutex across PGs (#115803 ) The mutex was originally added to avoid racing to dump debuginfo, where a race in this case would result in a corrupted dump file. The reason a mutex helps is that it forces all dump requests to be serialized, so that an observer would either see an in-progress file, a complete file, or no file. Without a mutex, a fourth state is possible (a file that has been written to by multiple threads and is invalid). Becuase the mutex was a ProcessGroupNCCL class member, and each PG instance has its own watchdog thread that can launch a dump, it was not doing its job. Making the mutex static shares it between instances of the class and ensures serialization of dumps triggered by any PG. (Note: dumps triggered by different PGs have the same, global contents anyway- there is only one global flight recorder, so it doesn't matter who triggers it.) Pull Request resolved: https://github.com/pytorch/pytorch/pull/115803 Approved by: https://github.com/kwen2501 ghstack dependencies: #115771, #115798, #115800, #115801	2023-12-14 21:17:44 +00:00
PyTorch MergeBot	7ecddaef23	Revert "Introduce 3 low-latency, intra-node allreduce algorithms for small messages to PyTorch (#114001 )" This reverts commit adfbd2b219f4995d3f13870927022b67550f8b0e. Reverted https://github.com/pytorch/pytorch/pull/114001 on behalf of https://github.com/atalman due to OSSCI oncall, breaks periodic jobs ([comment](https://github.com/pytorch/pytorch/pull/114001#issuecomment-1856539040))	2023-12-14 20:33:10 +00:00
David Berard	67232199b1	[dynamo] Log shape_env_guard_count separately from guard_count (#115776 ) guard_count counts all the shape_env guards as a single guard; log the shape_env_guard_count separately so those metrics can be used. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115776 Approved by: https://github.com/yanboliang	2023-12-14 20:12:49 +00:00
eqy	353f2dbd9c	[CUDA] Fix V100 expected failures in `test_mm_decomp` and `test_linalg` (#115666 ) BFloat16 isn't supported on sm70 and we get an unexpected cuBLAS success in 12.3+ Pull Request resolved: https://github.com/pytorch/pytorch/pull/115666 Approved by: https://github.com/malfet	2023-12-14 19:17:53 +00:00
Nikita Shulga	28e37d4f3b	Update Trition pin (#115743 ) To include a cherry-pick of https://github.com/openai/triton/pull/2771 that should fix cuda-11.8 runtime issues Also, tweak build wheel script to update both ROCm and vanilla Trition builds version to 2.2 (even though on trunk it should probably be 3.3 already) TODO: Remove `ROCM_TRITION_VERSION` once both trunk and ROCM version are in sync again Pull Request resolved: https://github.com/pytorch/pytorch/pull/115743 Approved by: https://github.com/davidberard98	2023-12-14 18:54:24 +00:00
Wei Wei	87547a26b8	[aotinductor] add no weight change version of fuse_parallel_linear (#115791 ) Summary: We need a new version of fuse_parallel_linear w/o creating new weights for real-time update. Reviewed By: khabinov Differential Revision: D52128296 Pull Request resolved: https://github.com/pytorch/pytorch/pull/115791 Approved by: https://github.com/khabinov	2023-12-14 18:36:17 +00:00
PyTorch MergeBot	ca4caf4eac	Revert "[inductor] Do variance calculation in opmath type (#115181 )" This reverts commit 42390a097b987cd3384511c3df3747699f2281f4. Reverted https://github.com/pytorch/pytorch/pull/115181 on behalf of https://github.com/atalman due to OSSCI oncall, broke periodic tests ([comment](https://github.com/pytorch/pytorch/pull/115181#issuecomment-1856360644))	2023-12-14 18:21:49 +00:00
Will Constable	0fe014bd8a	[C10D] Change PGNCCL logs to prefix [PG {} Rank {}] (#115801 ) Adds a PG {process group uid} prefix component to logs. This is helpful in situations where there are multiple processgroups, and rank information by itself is confusing. (For example rank0 on PG1 may correspond to rank3 on PG0. People may assume 'rank0' references the global (PG0) world, but it may reference a sub-pg. Prefacing the PG helps clarify this. Does NOT change logs from inside WorkNCCL functions, since WorkNCCL doens't know what PG ID it corresponds to. Will address these logs separately. Example: ``` [I ProcessGroupNCCL.cpp:787] [PG 0 Rank 0] ProcessGroupNCCL initialization ... ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/115801 Approved by: https://github.com/fduwjj ghstack dependencies: #115771, #115798, #115800	2023-12-14 18:17:16 +00:00
Will Constable	e94267587b	[C10D] Refactor NCCL logs to use common prefix helper (#115800 ) Put the repeated code that string formats [Rank {rank}] in one place. Sets up for the next PR that also adds more info to this prefix. (Does not change exception messages, which could be done as well. Exception messages are not formatted quite the same way. Tries instead to keep from changing log behavior (in this PR) and only refactor code. Did limited testing (some logs were observed OK). Pull Request resolved: https://github.com/pytorch/pytorch/pull/115800 Approved by: https://github.com/fduwjj ghstack dependencies: #115771, #115798	2023-12-14 18:13:24 +00:00
Will Constable	eb6e70cf66	[C10D] Only open NCCL dump pipe file once per process (#115798 ) The NCCL flight recorder is per-process (it is shared by all processgroups), but individual process groups used to construct their own pipe for being signaled to dump the flight recorder. This ensures that only one pipe per process is created, by only creating the pipe on the first ProcessGroup (uid_ == 0) which should be the world group. Filenames are still keyed off of rank, but this should now be global rank instead of sub-pg rank, making the filenames unique across the whole trainer process. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115798 Approved by: https://github.com/zdevito ghstack dependencies: #115771	2023-12-14 17:48:26 +00:00
Will Constable	74d2b9dd15	[C10D] Make DumpPipe disabled when FlightRecorder disabled (#115771 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/115771 Approved by: https://github.com/fduwjj	2023-12-14 17:42:46 +00:00
Jiong Gong	b618869208	[inductor] label cpp test files with oncall: cpu inductor (#115167 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/115167 Approved by: https://github.com/atalman	2023-12-14 17:39:27 +00:00
Ruichao Xiao	c80e2d5bb2	[fbcode] consolidate usage of fp8 linears for inference models (#115808 ) Summary: ATT, this will use implementation of D51812709 for fp8 linears. Meanwhile, it also adds use-case of delay quantization Test Plan: ``` CUDA_VISIBLE_DEVICES=7 buck run mode/opt -c fbcode.platform010_cuda_version=12 -c fbcode.nvcc_arch=h100 -c fbcode.use_link_groups=false caffe2/torch/fb/model_transform/experimental/benchmark:mts_gpu_benchmark -- --local-model /home/xiaoruichao/test_models/463113248.input.predictor.disagg.gpu.merge --lower-backend AOT_INDUCTOR --fp8-linear-quantization-type delay_quantization --disable-acc-tracer-aot-inductor ``` ``` CUDA_VISIBLE_DEVICES=7 buck run mode/opt -c fbcode.platform010_cuda_version=12 -c fbcode.nvcc_arch=h100 -c fbcode.use_link_groups=false caffe2/torch/fb/model_transform/experimental/benchmark:mts_gpu_benchmark -- --local-model /home/xiaoruichao/test_models/463113248.input.predictor.disagg.gpu.merge --lower-backend AOT_INDUCTOR --fp8-linear-quantization-type delay_quantization --disable-acc-tracer-aot-inductor ``` Reviewed By: tter1 Differential Revision: D51840344 Pull Request resolved: https://github.com/pytorch/pytorch/pull/115808 Approved by: https://github.com/ipiszy	2023-12-14 16:59:48 +00:00
Xinya Zhang	5bddbed399	Initial Flash Attention support on ROCM (#114309 ) This pull requests add initial Flash Attention support for AMD/ROCM platform. It added a specialized Triton repository/branch as a compile-time dependency for Flash Attention math library on AMD/ROCM. This triton submodule is not used at runtime and will not be shipped to the final pytorch package. We have the plan to release this specialized Triton as a separate project. Know limitations: - [ ] Only supports MI200 series GPU (i.e., `gcnArchName == gfx90a:sramecc+:xnack-`. - [ ] Only supports power of two sequence lengths. - [ ] No support for varlen APIs. - [ ] Only support head dimension 16,32,64,128. - [ ] Performance is still being optimized. Fixes https://github.com/pytorch/pytorch/issues/112997 Pull Request resolved: https://github.com/pytorch/pytorch/pull/114309 Approved by: https://github.com/jeffdaily, https://github.com/malfet --------- Co-authored-by: Joseph Groenenboom <joseph.groenenboom@amd.com>	2023-12-14 08:52:57 -08:00
Mikayla Gawarecki	ac60a70e06	Migrated loss functions to ModuleInfos (#115584 ) Migrates most tests in `common_nn.py:criterion_tests` to ModuleInfos. I can split this up if it is too large to review What this PR does not include: - [`no_batch_dim` tests](https://github.com/pytorch/pytorch/blob/main/torch/testing/_internal/common_nn.py#L3995-L4112) - [tests that use the functional variant of the loss function and `wrap_functional`](https://github.com/pytorch/pytorch/blob/main/torch/testing/_internal/common_nn.py#L1079-L1128) #### On test times This PR increases test time by ~58s locally Before this PR: ``` >>> python test/test_nn.py -k Loss Ran 1003 tests in 28.977s ``` After this PR ``` >>> python test/test_nn.py -k Loss Ran 368 tests in 23.073s ``` ``` >>> python test/test_modules.py -k Loss Ran 836 tests in 63.900s ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/115584 Approved by: https://github.com/janeyx99 ghstack dependencies: #115617	2023-12-14 16:21:05 +00:00
vfdev-5	f727bed2e6	[inductor] Updated upsample_bilinear2d decomposition (#104182 ) Description: - Updated upsample_bilinear2d decomposition - added support for uint8 dtype support - code improvements - Added uint8 dtype tests Perf considerations: - There is minor perf regression (speed-up ~0.7) on cases uint8, align_corners=True when output is smaller/equal (256, 256) - For cases, when output is larger (256, 256) and input dtype uint8, nightly output is wrong, so IMO large perf regression (speed-up around ~0.2) should not be taken into account. ## Perfs benchmarks ``` [--------------------------------------------------------------------------------------------------------------------------------------------------------- Interpolate, cpu --------------------------------------------------------------------------------------------------------------------------------------------------------] \| Eager (2.3.0a0+gitafcfdb1) PR \| Compiled (2.3.0a0+gitafcfdb1) PR \| Compiled (2.3.0a0+gitde89a53) Nightly \| speed-up PR vs Nightly \| Eager (2.3.0a0+gitde89a53) Nightly 1 threads: -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Input (1, 3, 500, 400), torch.uint8, torch.contiguous_format \| mode: bilinear, align_corners: True, antialias: False, osize: (256, 256) \| 565.212 (+-3.548) \| 1384.210 (+-10.798) \| 1230.996 (+-32.930) \| 0.889 (+-0.000) \| 566.253 (+-1.526) Input (1, 3, 500, 400), torch.uint8, torch.contiguous_format \| mode: bilinear, align_corners: False, antialias: False, osize: (256, 256) \| 565.404 (+-1.614) \| 1491.649 (+-7.763) \| 2974.959 (+-6.006) \| 1.994 (+-0.000) \| 566.476 (+-1.742) Input (1, 3, 500, 400), torch.uint8, torch.channels_last \| mode: bilinear, align_corners: True, antialias: False, osize: (256, 256) \| 270.761 (+-0.861) \| 1557.777 (+-4.699) \| 1080.919 (+-4.243) \| 0.694 (+-0.000) \| 269.829 (+-0.986) Input (1, 3, 500, 400), torch.uint8, torch.channels_last \| mode: bilinear, align_corners: False, antialias: False, osize: (256, 256) \| 270.960 (+-0.995) \| 1723.913 (+-12.433) \| 3191.938 (+-6.194) \| 1.852 (+-0.000) \| 269.962 (+-1.657) Input (1, 3, 500, 400), torch.float32, torch.contiguous_format \| mode: bilinear, align_corners: True, antialias: False, osize: (256, 256) \| 1555.884 (+-5.169) \| 1178.753 (+-4.957) \| 1910.445 (+-5.988) \| 1.621 (+-0.000) \| 1560.804 (+-6.793) Input (1, 3, 500, 400), torch.float32, torch.contiguous_format \| mode: bilinear, align_corners: False, antialias: False, osize: (256, 256) \| 1651.193 (+-6.952) \| 1323.466 (+-6.059) \| 3374.842 (+-8.168) \| 2.550 (+-0.000) \| 1653.497 (+-8.018) Input (1, 3, 500, 400), torch.float32, torch.channels_last \| mode: bilinear, align_corners: True, antialias: False, osize: (256, 256) \| 978.482 (+-10.183) \| 1383.768 (+-4.341) \| 2147.841 (+-6.581) \| 1.552 (+-0.000) \| 979.983 (+-1.499) Input (1, 3, 500, 400), torch.float32, torch.channels_last \| mode: bilinear, align_corners: False, antialias: False, osize: (256, 256) \| 1074.472 (+-5.031) \| 1414.912 (+-5.754) \| 3590.968 (+-10.042) \| 2.538 (+-0.000) \| 1074.589 (+-3.948) Input (4, 3, 500, 400), torch.uint8, torch.contiguous_format \| mode: bilinear, align_corners: True, antialias: False, osize: (256, 256) \| 2168.703 (+-8.964) \| 5400.528 (+-26.628) \| 4777.299 (+-11.891) \| 0.885 (+-0.000) \| 2168.133 (+-7.667) Input (4, 3, 500, 400), torch.uint8, torch.contiguous_format \| mode: bilinear, align_corners: False, antialias: False, osize: (256, 256) \| 2169.132 (+-12.618) \| 6583.866 (+-28.959) \| 11986.894 (+-45.838) \| 1.821 (+-0.000) \| 2174.488 (+-10.317) Input (4, 3, 500, 400), torch.uint8, torch.channels_last \| mode: bilinear, align_corners: True, antialias: False, osize: (256, 256) \| 992.808 (+-6.086) \| 5985.028 (+-9.532) \| 4334.158 (+-9.423) \| 0.724 (+-0.000) \| 989.604 (+-5.499) Input (4, 3, 500, 400), torch.uint8, torch.channels_last \| mode: bilinear, align_corners: False, antialias: False, osize: (256, 256) \| 987.618 (+-6.350) \| 6963.044 (+-28.885) \| 15441.096 (+-55.324) \| 2.218 (+-0.000) \| 985.573 (+-5.159) Input (4, 3, 500, 400), torch.float32, torch.contiguous_format \| mode: bilinear, align_corners: True, antialias: False, osize: (256, 256) \| 6695.557 (+-35.067) \| 4657.603 (+-14.220) \| 8058.708 (+-41.684) \| 1.730 (+-0.000) \| 6714.996 (+-38.626) Input (4, 3, 500, 400), torch.float32, torch.contiguous_format \| mode: bilinear, align_corners: False, antialias: False, osize: (256, 256) \| 7040.481 (+-39.486) \| 5445.704 (+-16.659) \| 13906.618 (+-53.298) \| 2.554 (+-0.000) \| 7034.453 (+-44.626) Input (4, 3, 500, 400), torch.float32, torch.channels_last \| mode: bilinear, align_corners: True, antialias: False, osize: (256, 256) \| 3926.186 (+-10.660) \| 5741.433 (+-12.748) \| 9356.036 (+-40.848) \| 1.630 (+-0.000) \| 3930.598 (+-17.086) Input (4, 3, 500, 400), torch.float32, torch.channels_last \| mode: bilinear, align_corners: False, antialias: False, osize: (256, 256) \| 4308.536 (+-9.607) \| 6122.755 (+-47.278) \| 15637.567 (+-54.392) \| 2.554 (+-0.000) \| 4307.463 (+-11.268) Input (1, 3, 1200, 1300), torch.uint8, torch.contiguous_format \| mode: bilinear, align_corners: True, antialias: False, osize: (200, 300) \| 2512.740 (+-10.860) \| 1573.590 (+-5.061) \| 451.355 (+-1.210) \| 0.287 (+-0.000) \| 2511.727 (+-10.930) Input (1, 3, 1200, 1300), torch.uint8, torch.contiguous_format \| mode: bilinear, align_corners: False, antialias: False, osize: (200, 300) \| 2489.926 (+-11.915) \| 1537.233 (+-4.212) \| 2501.470 (+-7.446) \| 1.627 (+-0.000) \| 2500.000 (+-12.155) Input (1, 3, 1200, 1300), torch.uint8, torch.channels_last \| mode: bilinear, align_corners: True, antialias: False, osize: (200, 300) \| 632.032 (+-2.108) \| 1496.994 (+-4.194) \| 404.759 (+-1.064) \| 0.270 (+-0.000) \| 630.122 (+-4.086) Input (1, 3, 1200, 1300), torch.uint8, torch.channels_last \| mode: bilinear, align_corners: False, antialias: False, osize: (200, 300) \| 629.174 (+-4.386) \| 1708.935 (+-8.817) \| 2643.296 (+-9.723) \| 1.547 (+-0.000) \| 628.388 (+-1.326) Input (1, 3, 1200, 1300), torch.float32, torch.contiguous_format \| mode: bilinear, align_corners: True, antialias: False, osize: (200, 300) \| 4409.941 (+-8.016) \| 1160.133 (+-4.698) \| 1897.089 (+-9.392) \| 1.635 (+-0.000) \| 4450.959 (+-10.438) Input (1, 3, 1200, 1300), torch.float32, torch.contiguous_format \| mode: bilinear, align_corners: False, antialias: False, osize: (200, 300) \| 4493.427 (+-11.703) \| 1329.226 (+-4.740) \| 2835.872 (+-12.241) \| 2.133 (+-0.000) \| 4506.973 (+-9.914) Input (1, 3, 1200, 1300), torch.float32, torch.channels_last \| mode: bilinear, align_corners: True, antialias: False, osize: (200, 300) \| 901.712 (+-4.071) \| 1320.739 (+-5.197) \| 2207.605 (+-8.219) \| 1.671 (+-0.000) \| 904.757 (+-4.558) Input (1, 3, 1200, 1300), torch.float32, torch.channels_last \| mode: bilinear, align_corners: False, antialias: False, osize: (200, 300) \| 990.080 (+-3.922) \| 1702.563 (+-7.909) \| 3074.196 (+-10.478) \| 1.806 (+-0.000) \| 990.482 (+-4.444) Input (4, 3, 1200, 1300), torch.uint8, torch.contiguous_format \| mode: bilinear, align_corners: True, antialias: False, osize: (200, 300) \| 9785.550 (+-58.445) \| 6135.680 (+-33.569) \| 1628.572 (+-19.770) \| 0.265 (+-0.000) \| 9893.606 (+-62.377) Input (4, 3, 1200, 1300), torch.uint8, torch.contiguous_format \| mode: bilinear, align_corners: False, antialias: False, osize: (200, 300) \| 9710.191 (+-57.597) \| 6066.824 (+-36.364) \| 10469.110 (+-42.775) \| 1.726 (+-0.000) \| 9919.022 (+-72.190) Input (4, 3, 1200, 1300), torch.uint8, torch.channels_last \| mode: bilinear, align_corners: True, antialias: False, osize: (200, 300) \| 2790.356 (+-12.188) \| 6134.101 (+-28.694) \| 1576.832 (+-6.030) \| 0.257 (+-0.000) \| 2761.122 (+-11.503) Input (4, 3, 1200, 1300), torch.uint8, torch.channels_last \| mode: bilinear, align_corners: False, antialias: False, osize: (200, 300) \| 2778.711 (+-13.603) \| 6608.528 (+-37.776) \| 10841.549 (+-49.429) \| 1.641 (+-0.000) \| 2753.037 (+-10.995) Input (4, 3, 1200, 1300), torch.float32, torch.contiguous_format \| mode: bilinear, align_corners: True, antialias: False, osize: (200, 300) \| 45533.868 (+-102.618) \| 4962.994 (+-8.215) \| 9003.968 (+-38.179) \| 1.814 (+-0.000) \| 43531.261 (+-102.951) Input (4, 3, 1200, 1300), torch.float32, torch.contiguous_format \| mode: bilinear, align_corners: False, antialias: False, osize: (200, 300) \| 45932.699 (+-81.207) \| 5595.682 (+-11.482) \| 12302.907 (+-50.254) \| 2.199 (+-0.000) \| 43916.455 (+-80.468) Input (4, 3, 1200, 1300), torch.float32, torch.channels_last \| mode: bilinear, align_corners: True, antialias: False, osize: (200, 300) \| 3827.804 (+-8.057) \| 6311.580 (+-25.021) \| 11760.614 (+-51.531) \| 1.863 (+-0.000) \| 3849.959 (+-10.848) Input (4, 3, 1200, 1300), torch.float32, torch.channels_last \| mode: bilinear, align_corners: False, antialias: False, osize: (200, 300) \| 4169.007 (+-8.452) \| 6820.716 (+-35.310) \| 15264.633 (+-49.982) \| 2.238 (+-0.000) \| 4183.875 (+-19.104) Input (1, 3, 300, 400), torch.uint8, torch.contiguous_format \| mode: bilinear, align_corners: True, antialias: False, osize: (600, 700) \| 1306.914 (+-7.470) \| 10598.101 (+-38.410) \| 2678.031 (+-11.051) \| 0.253 (+-0.000) \| 1307.470 (+-8.519) Input (1, 3, 300, 400), torch.uint8, torch.contiguous_format \| mode: bilinear, align_corners: False, antialias: False, osize: (600, 700) \| 1307.268 (+-8.197) \| 10161.123 (+-45.643) \| 17148.842 (+-55.402) \| 1.688 (+-0.000) \| 1308.077 (+-8.553) Input (1, 3, 300, 400), torch.uint8, torch.channels_last \| mode: bilinear, align_corners: True, antialias: False, osize: (600, 700) \| 548.574 (+-2.157) \| 10072.806 (+-41.368) \| 2408.971 (+-6.997) \| 0.239 (+-0.000) \| 547.726 (+-1.721) Input (1, 3, 300, 400), torch.uint8, torch.channels_last \| mode: bilinear, align_corners: False, antialias: False, osize: (600, 700) \| 546.664 (+-1.484) \| 11123.694 (+-43.636) \| 18058.070 (+-48.552) \| 1.623 (+-0.000) \| 547.151 (+-1.627) Input (1, 3, 300, 400), torch.float32, torch.contiguous_format \| mode: bilinear, align_corners: True, antialias: False, osize: (600, 700) \| 7935.051 (+-71.022) \| 7654.533 (+-29.512) \| 12414.194 (+-87.450) \| 1.622 (+-0.000) \| 7900.056 (+-53.997) Input (1, 3, 300, 400), torch.float32, torch.contiguous_format \| mode: bilinear, align_corners: False, antialias: False, osize: (600, 700) \| 8546.732 (+-53.118) \| 8583.572 (+-35.656) \| 19111.824 (+-166.978) \| 2.227 (+-0.000) \| 8515.433 (+-63.300) Input (1, 3, 300, 400), torch.float32, torch.channels_last \| mode: bilinear, align_corners: True, antialias: False, osize: (600, 700) \| 6202.642 (+-34.355) \| 8915.622 (+-62.293) \| 14327.295 (+-52.188) \| 1.607 (+-0.000) \| 6213.329 (+-39.740) Input (1, 3, 300, 400), torch.float32, torch.channels_last \| mode: bilinear, align_corners: False, antialias: False, osize: (600, 700) \| 6811.128 (+-33.747) \| 9647.316 (+-50.837) \| 20830.594 (+-62.979) \| 2.159 (+-0.000) \| 6822.512 (+-37.092) Input (4, 3, 300, 400), torch.uint8, torch.contiguous_format \| mode: bilinear, align_corners: True, antialias: False, osize: (600, 700) \| 5079.586 (+-19.067) \| 42238.442 (+-87.643) \| 11282.141 (+-42.477) \| 0.267 (+-0.000) \| 5104.234 (+-17.706) Input (4, 3, 300, 400), torch.uint8, torch.contiguous_format \| mode: bilinear, align_corners: False, antialias: False, osize: (600, 700) \| 5079.575 (+-16.306) \| 41512.995 (+-83.710) \| 68789.816 (+-440.001) \| 1.657 (+-0.000) \| 5097.446 (+-21.724) Input (4, 3, 300, 400), torch.uint8, torch.channels_last \| mode: bilinear, align_corners: True, antialias: False, osize: (600, 700) \| 2039.974 (+-8.614) \| 42322.773 (+-111.866) \| 10399.237 (+-43.140) \| 0.246 (+-0.000) \| 2043.808 (+-10.707) Input (4, 3, 300, 400), torch.uint8, torch.channels_last \| mode: bilinear, align_corners: False, antialias: False, osize: (600, 700) \| 2036.214 (+-10.083) \| 44353.281 (+-71.548) \| 73340.412 (+-324.780) \| 1.654 (+-0.000) \| 2039.000 (+-9.554) Input (4, 3, 300, 400), torch.float32, torch.contiguous_format \| mode: bilinear, align_corners: True, antialias: False, osize: (600, 700) \| 33821.523 (+-96.639) \| 30552.094 (+-65.023) \| 49494.486 (+-872.916) \| 1.620 (+-0.000) \| 33844.404 (+-92.466) Input (4, 3, 300, 400), torch.float32, torch.contiguous_format \| mode: bilinear, align_corners: False, antialias: False, osize: (600, 700) \| 36196.104 (+-128.169) \| 34038.432 (+-79.697) \| 75761.226 (+-905.194) \| 2.226 (+-0.000) \| 36260.473 (+-94.642) Input (4, 3, 300, 400), torch.float32, torch.channels_last \| mode: bilinear, align_corners: True, antialias: False, osize: (600, 700) \| 24827.821 (+-77.335) \| 37006.218 (+-86.318) \| 61297.625 (+-898.192) \| 1.656 (+-0.000) \| 24823.275 (+-80.945) Input (4, 3, 300, 400), torch.float32, torch.channels_last \| mode: bilinear, align_corners: False, antialias: False, osize: (600, 700) \| 27266.138 (+-70.262) \| 40109.475 (+-94.248) \| 92086.075 (+-404.922) \| 2.296 (+-0.000) \| 27287.992 (+-89.507) Times are in microseconds (us). [--------------------------------------------------------------------------------------------------------------------------------------------------------- Interpolate, cuda ---------------------------------------------------------------------------------------------------------------------------------------------------------] \| Eager (2.3.0a0+gitafcfdb1) PR \| Compiled (2.3.0a0+gitafcfdb1) PR \| Compiled (2.3.0a0+gitde89a53) Nightly \| speed-up PR vs Nightly \| Eager (2.3.0a0+gitde89a53) Nightly 1 threads: ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Input (1, 3, 2345, 2456), torch.float32, torch.contiguous_format \| mode: bilinear, align_corners: True, antialias: False, osize: (1234, 1345) \| 98.259 (+-0.014) \| 97.156 (+-0.008) \| 97.443 (+-0.031) \| 1.003 (+-0.000) \| 98.248 (+-0.021) Input (1, 3, 2345, 2456), torch.float32, torch.contiguous_format \| mode: bilinear, align_corners: False, antialias: False, osize: (1234, 1345) \| 97.048 (+-0.016) \| 97.480 (+-0.018) \| 96.819 (+-0.126) \| 0.993 (+-0.000) \| 97.045 (+-0.015) Input (1, 3, 2345, 2456), torch.float32, torch.channels_last \| mode: bilinear, align_corners: True, antialias: False, osize: (1234, 1345) \| 97.944 (+-0.028) \| 91.686 (+-0.411) \| 93.894 (+-1.011) \| 1.024 (+-0.000) \| 97.933 (+-0.008) Input (1, 3, 2345, 2456), torch.float32, torch.channels_last \| mode: bilinear, align_corners: False, antialias: False, osize: (1234, 1345) \| 98.008 (+-0.011) \| 91.205 (+-0.346) \| 96.854 (+-0.058) \| 1.062 (+-0.000) \| 97.203 (+-0.010) Input (4, 3, 2345, 2456), torch.float32, torch.contiguous_format \| mode: bilinear, align_corners: True, antialias: False, osize: (1234, 1345) \| 384.318 (+-0.011) \| 382.793 (+-0.007) \| 382.472 (+-0.011) \| 0.999 (+-0.000) \| 384.701 (+-0.012) Input (4, 3, 2345, 2456), torch.float32, torch.contiguous_format \| mode: bilinear, align_corners: False, antialias: False, osize: (1234, 1345) \| 384.266 (+-0.009) \| 385.333 (+-0.024) \| 382.554 (+-0.022) \| 0.993 (+-0.000) \| 384.386 (+-0.016) Input (4, 3, 2345, 2456), torch.float32, torch.channels_last \| mode: bilinear, align_corners: True, antialias: False, osize: (1234, 1345) \| 383.924 (+-0.011) \| 570.071 (+-0.030) \| 545.615 (+-0.051) \| 0.957 (+-0.000) \| 384.044 (+-0.012) Input (4, 3, 2345, 2456), torch.float32, torch.channels_last \| mode: bilinear, align_corners: False, antialias: False, osize: (1234, 1345) \| 384.184 (+-0.016) \| 560.857 (+-0.026) \| 552.447 (+-0.040) \| 0.985 (+-0.000) \| 384.063 (+-0.016) Input (1, 3, 1234, 1345), torch.float32, torch.contiguous_format \| mode: bilinear, align_corners: True, antialias: False, osize: (2345, 2456) \| 122.188 (+-0.053) \| 116.744 (+-1.006) \| 163.762 (+-0.015) \| 1.403 (+-0.000) \| 121.874 (+-0.015) Input (1, 3, 1234, 1345), torch.float32, torch.contiguous_format \| mode: bilinear, align_corners: False, antialias: False, osize: (2345, 2456) \| 122.156 (+-0.012) \| 182.692 (+-0.013) \| 161.653 (+-0.018) \| 0.885 (+-0.000) \| 121.926 (+-0.014) Input (1, 3, 1234, 1345), torch.float32, torch.channels_last \| mode: bilinear, align_corners: True, antialias: False, osize: (2345, 2456) \| 105.852 (+-0.324) \| 119.545 (+-0.294) \| 190.527 (+-0.023) \| 1.594 (+-0.000) \| 105.999 (+-0.446) Input (1, 3, 1234, 1345), torch.float32, torch.channels_last \| mode: bilinear, align_corners: False, antialias: False, osize: (2345, 2456) \| 106.507 (+-0.282) \| 120.060 (+-0.257) \| 162.330 (+-0.012) \| 1.352 (+-0.000) \| 106.567 (+-0.385) Input (4, 3, 1234, 1345), torch.float32, torch.contiguous_format \| mode: bilinear, align_corners: True, antialias: False, osize: (2345, 2456) \| 447.907 (+-0.015) \| 463.863 (+-1.779) \| 650.492 (+-0.331) \| 1.402 (+-0.000) \| 446.596 (+-0.017) Input (4, 3, 1234, 1345), torch.float32, torch.contiguous_format \| mode: bilinear, align_corners: False, antialias: False, osize: (2345, 2456) \| 447.750 (+-0.017) \| 723.832 (+-0.170) \| 641.539 (+-0.075) \| 0.886 (+-0.000) \| 446.467 (+-0.019) Input (4, 3, 1234, 1345), torch.float32, torch.channels_last \| mode: bilinear, align_corners: True, antialias: False, osize: (2345, 2456) \| 439.549 (+-0.031) \| 507.772 (+-2.879) \| 758.795 (+-0.482) \| 1.494 (+-0.000) \| 440.372 (+-0.025) Input (4, 3, 1234, 1345), torch.float32, torch.channels_last \| mode: bilinear, align_corners: False, antialias: False, osize: (2345, 2456) \| 439.538 (+-0.029) \| 509.260 (+-2.704) \| 654.195 (+-2.621) \| 1.285 (+-0.000) \| 440.362 (+-0.026) Times are in microseconds (us). ``` [Source](`f4751a3196/perf_interp_mode.py`), [Output](`899f34c024/output/20231213-214209-upsample-bilinear-pr_vs_nightly-speedup.md`) Pull Request resolved: https://github.com/pytorch/pytorch/pull/104182 Approved by: https://github.com/lezcano	2023-12-14 14:50:06 +00:00
Will Constable	28e4004286	Add doc for torch.distributed.breakpoint (#115656 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/115656 Approved by: https://github.com/wanchaol, https://github.com/fegin ghstack dependencies: #115705	2023-12-14 14:45:36 +00:00
cyy	fcb95bf31b	[2/N] Use std::in_place (#115480 ) Remove c10/util/in_place.h Pull Request resolved: https://github.com/pytorch/pytorch/pull/115480 Approved by: https://github.com/soulitzer	2023-12-14 12:54:22 +00:00
haozhe.zhu	6500ccebd7	enable fp16 autocast for dynamo benchmark (#114088 ) `--amp` to enable amp path for` CUDA` (default amp_dtype will be float16) and `CPU` (default amp_dtype will be bfloat16). If users set `--amp_dtype`, the amp_dtype from users will have the highest priority. Pull Request resolved: https://github.com/pytorch/pytorch/pull/114088 Approved by: https://github.com/jgong5, https://github.com/jansel	2023-12-14 12:38:44 +00:00
Huy Do	afe6d272c6	Fix buck OSS build after #115570 (#115804 ) From #115570, `supports_shlib_interfaces` is only available in https://buck2.build/docs/api/rules/ not buck https://buck.build/rule/cxx_library.html. The best way to fix this is probably to migrate OSS CI to buck2, so this is a temporary workaround because the fix from #115570 is only needed internally anyway Pull Request resolved: https://github.com/pytorch/pytorch/pull/115804 Approved by: https://github.com/kit1980, https://github.com/malfet	2023-12-14 08:33:07 +00:00
Yifu Wang	adfbd2b219	Introduce 3 low-latency, intra-node allreduce algorithms for small messages to PyTorch (#114001 ) ## Summary This PR added 3 intra-node GPU allreduce algorithms to PyTorch: - One-shot allreduce (inspired by FasterTransformer): all ranks simultaneously read and accumulate data from other ranks. - Two-shot allreduce (inspired by FasterTransformer): all ranks simultanesouly read and accumulate `1 / world_size` data from other ranks. Then all ranks read accumulated data from other ranks. (effectively one-shot reduce-scatter + one-shot all-gather). - Hybrid cube mesh allreduce (original): a one-shot allreduce variant that avoids transmission over PCIe on HCM topology. ## Micro Benchmarks ![image](https://github.com/pytorch/pytorch/assets/4156752/7bd25ffc-cd5b-4acb-bd65-b01bc136726e) ![image](https://github.com/pytorch/pytorch/assets/4156752/3ced31b4-6c31-4f34-a2d8-c072df29ae0e) ![image](https://github.com/pytorch/pytorch/assets/4156752/5b942c05-4fcc-4ec9-ae29-12c64080bb1c) ## Details The intra-node algos are organized behind `c10d::IntraNodeComm`, which is responsible for: - Managing handshaking and cuda IPC handle exchange among ranks. - Querying NVLink connection and detecting topology. - Performing algo selection based on available info. - Launching the selected allreduce kernel. `c10d::IntraNodeComm` is integrated into `c10d::ProcessGroupNCCL` as follows: - When the `ENABLE_INTRA_NODE_COMM` environment variable is set, `c10d::ProcessGroupNCCL` initialize a `c10d::IntraNodeComm` for its ranks. - If the setup is not suitable for intra-node comm (e.g. not all ranks are from the same node), the rendezvous logic guarantees all participants fall back consistently. - `c10d::ProcessGroupNCCL::allreduce` consults `c10d::IntraNodeComm` whether to use intra-node allreduce and carries out the communication accordingly. We currently detect two types of topoloies from the nNVLink connection mesh: - Fully connected: all GPU pairs has direct NVLink connection (e.g. NVSwitch or fully connected sub-set of hybrid cube mesh) - `msg <= 256KB`: one-shot allreduce. - `256KB < msg <= 10MB`: two-shot allreduce. - `msg > 10MB`: instructs the caller to fallback to NCCL. - Hybrid cube mesh - `msg <= 256KB`: one-shot allreduce. - `msg > 256KB`: instructs the caller to fallback to NCCL. ## Next Steps - Fine tune algo selection based on GPU model, topology, link speed. - Potentially optimize the two-shot allreduce impl. Accroding to FasterTransformer, two-shot allreduce is preferred until 50MB. There might be room for improvement, but PyTorch does impose more constraints: - FasterTransformer uses a single process to drive multiple devices. It can use `cudaDeviceEnablePeerAccess` enable device-level peer access. - PyTorch uses multiple process to drive multiple devices. With cuda IPC, a device can only share a specific region to other devices. This means extra copies may be unavoidable. Pull Request resolved: https://github.com/pytorch/pytorch/pull/114001 Approved by: https://github.com/yf225	2023-12-14 08:13:08 +00:00
Xuehai Pan	36c6c0c7dc	[pytree] expand `tree_map` to accept multi-inputs (#115642 ) Fixes #115419 Fixes #91323 Closes #115549 - #115419 - #91323 Pull Request resolved: https://github.com/pytorch/pytorch/pull/115642 Approved by: https://github.com/vmoens, https://github.com/zou3519	2023-12-14 06:16:42 +00:00
eqy	7e1542b938	[CUDA][FP8] Skip `test_dtypes` on FP8 `_scaled_mm` (#115661 ) This test isn't actually parametrized by `dtype` so it seems to surface bogus failures where "unsupported" types "work" but in reality fp8 is used every time. CC @drisspg I'm guessing this doesn't surface in upstream CI because there are no SM9.0 runners yet? Pull Request resolved: https://github.com/pytorch/pytorch/pull/115661 Approved by: https://github.com/drisspg	2023-12-14 05:12:33 +00:00
Will Constable	f5458f8f00	[C10D] Make DumpPipe pipe file configurable (#115770 ) Add TORCH_NCCL_DEBUG_INFO_PIPE_FILE env, allowing separate pipe file location from dump file location. Defaults PIPE_FILE to empty, meaning disabled. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115770 Approved by: https://github.com/zdevito	2023-12-14 03:54:43 +00:00
Fuzzkatt	ef01e78fd9	disable test_ddp_profiling_autograd_profiler in distributed_test.py (#115704 ) test was previously disabled in upstream: https://github.com/pytorch/pytorch/issues/77342, currently failing in NVIDIA internal CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/115704 Approved by: https://github.com/soulitzer	2023-12-14 01:41:37 +00:00
PyTorch MergeBot	722752fc28	Revert "Increased hardcoded limit for number of GPUs. (#115368 )" This reverts commit c039f01bd932d4f67d5b1d63ade8f1db11bfb72e. Reverted https://github.com/pytorch/pytorch/pull/115368 on behalf of https://github.com/osalpekar due to This was reverted internally due to a release breakage ([comment](https://github.com/pytorch/pytorch/pull/115368#issuecomment-1854956224))	2023-12-14 01:28:01 +00:00
Nikita Shulga	5e615f5f3a	[BE] Use `version.txt` to determine version of nightly builds (#115794 ) Fixes TODO from https://github.com/pytorch/pytorch/pull/33326 Test plan: check version generated by CI: - https://github.com/pytorch/pytorch/actions/runs/7202798334/job/19621620744?pr=115794#step:9:64 - https://github.com/pytorch/pytorch/actions/runs/7202798329/job/19621639791?pr=115794#step:11:104 Pull Request resolved: https://github.com/pytorch/pytorch/pull/115794 Approved by: https://github.com/atalman	2023-12-14 01:09:51 +00:00
Fuzzkatt	661c1cf2aa	numerical mismatch fix for test_mem_efficient_attention_attn_mask_vs_math_ref_grads in test_transformers.py (#115707 ) adjust dropout_fudge_factor since previous fudge factor was too small and led to numerical mismatch in NVIDIA internal CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/115707 Approved by: https://github.com/drisspg	2023-12-14 01:04:39 +00:00
Pavan Balaji	ffc826bf10	[nccl-pg] Store PG global rank information in tracing logs (#115730 ) Storing the list of global ranks associated with each PG allows us to correlate traces across different ranks. Test Plan: OSS CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/115730 Approved by: https://github.com/fduwjj	2023-12-14 00:59:17 +00:00
Yidi Wu	b38e14c12a	[Reland][HigherOrderOp] remove unused get_item in MapHigherOrder (#115758 ) Summary: This is a reland of https://github.com/pytorch/pytorch/pull/115207 Test Plan: Modified existing tests. Reviewed By: yanboliang Differential Revision: D52045157 Pull Request resolved: https://github.com/pytorch/pytorch/pull/115758 Approved by: https://github.com/angelayi	2023-12-14 00:41:46 +00:00
PyTorch MergeBot	626b7dc847	Revert "Migrated loss functions to ModuleInfos (#115584 )" This reverts commit f138b08d2e9c8d676f2a404e97d773f42132b0c7. Reverted https://github.com/pytorch/pytorch/pull/115584 on behalf of https://github.com/atalman due to OSS CI oncall, breaks slow test ([comment](https://github.com/pytorch/pytorch/pull/115584#issuecomment-1854855080))	2023-12-13 23:34:30 +00:00
soulitzer	3fa3ed4923	Workaround to avoid MSVC std ambiguous symbol error (#115748 ) Don't know what the correct fix is, but it appears that this is the known workaround https://github.com/pytorch/pytorch/issues/18607 Failing windows build: https://hud.pytorch.org/pytorch/pytorch/pull/114897?sha=574a6f7cfe979f1bac62c6b0b51380ff67a31a09 Pull Request resolved: https://github.com/pytorch/pytorch/pull/115748 Approved by: https://github.com/jbschlosser ghstack dependencies: #114895, #115739	2023-12-13 23:22:52 +00:00
soulitzer	67ce57ff66	Add pragma once to headers (#115739 ) This reverts commit 9b93c23b5e2d695c2fbd9c886cc0c8010edab717. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115739 Approved by: https://github.com/Skylion007, https://github.com/jbschlosser ghstack dependencies: #114895	2023-12-13 23:22:52 +00:00
vfdev-5	c7ae2c170f	[inductor] Added non-integer expr support for floordiv in triton codegen (#115751 ) Description: - Added non-integer expr support for floordiv in triton codegen - Added a test - cpp test is skipped as failing and https://github.com/pytorch/pytorch/pull/115647 may fix it This PR is fixing compilation error with the following code: ```python import torch def func(x, a): n = (a * 1.234) // 8.234 y = x + n return y cfunc = torch.compile(func, dynamic=True, fullgraph=True) device = "cuda" x = torch.tensor(0, dtype=torch.float32, device=device) a = 33 out = cfunc(x, a) expected = func(x, a) torch.testing.assert_close(out, expected) ``` Error message on Nightly: ``` File "/usr/lib/python3.8/concurrent/futures/_base.py", line 389, in __get_result raise self._exception torch._dynamo.exc.BackendCompilerFailed: backend='compile_fx_wrapper' raised: CompilationError: at 7:38:def triton_(in_ptr0, out_ptr0, ks0, xnumel, XBLOCK : tl.constexpr): xoffset = tl.program_id(0) * XBLOCK xindex = xoffset + tl.arange(0, XBLOCK)[:] xmask = xindex < xnumel x0 = xindex tmp0 = tl.load(in_ptr0 + (x0), xmask) tmp1 = ((1.23400000000000*ks0) // 8.23400000000000) ^ AssertionError() ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/115751 Approved by: https://github.com/peterbell10	2023-12-13 23:17:42 +00:00
chundian	3643548447	[Export] Support ser/des test on existing cases (#115413 ) Summary: Similar as #115399 Test Plan: ``` $ python test/export/test_serdes.py ... Ran 72 tests in 29.097s OK (expected failures=13) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/115413 Approved by: https://github.com/tugsbayasgalan ghstack dependencies: #115402	2023-12-13 23:17:12 +00:00
chundian	a34d56a64a	[Export] Support retraceability test on existing cases (#115402 ) Summary: Similar as #115399 Test Plan: python test/export/test_retraceability.py Ran 71 tests in 31.929s OK (expected failures=14) Pull Request resolved: https://github.com/pytorch/pytorch/pull/115402 Approved by: https://github.com/tugsbayasgalan	2023-12-13 23:17:12 +00:00
Richard Barnes	43efe39cb1	[codemod][lowrisk] Remove extra semi colon from caffe2/caffe2/opt/optimizer.cc (#115018 ) Summary: `-Wextra-semi` or `-Wextra-semi-stmt` If the code compiles, this is safe to land. Test Plan: Sandcastle Reviewed By: dmm-fb Differential Revision: D51777924 Pull Request resolved: https://github.com/pytorch/pytorch/pull/115018 Approved by: https://github.com/Skylion007	2023-12-13 23:11:33 +00:00
Peter Bell	ad76a4e1e7	[inductor] Allow sympy expressions to participate in type promotion (#115676 ) In the test example we have `add(i64[10], sympy.Expr)` where `sympy.Expr` is not considered a promoting arg so isn't factored into the type promotion. However, in eager it would promote to float32. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115676 Approved by: https://github.com/lezcano ghstack dependencies: #115677, #115699, #115700	2023-12-13 22:22:37 +00:00
Michael Lazos	869e52e3dd	Support torch function user objects (#111765 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/111765 Approved by: https://github.com/jansel	2023-12-13 22:11:52 +00:00
Scott Wolchok	81321baf5c	[PyTorch] Remove ArrayRefTensor::dtype (#113578 ) Knocks off a few nanoseconds from CPU inference due to not having to set this field; paths that would've needed it are expensive anyway. Differential Revision: [D51182794](https://our.internmc.facebook.com/intern/diff/D51182794/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/113578 Approved by: https://github.com/khabinov, https://github.com/Neilblaze ghstack dependencies: #112800, #113577	2023-12-13 21:32:14 +00:00
Chien-Chin Huang	8c57fde21f	Let all_reduce_coalesced accept one tensor as well (#115650 ) This diff introduces a change to the `all_reduce_coalesced` function in `distributed_c10d.py`. The function now accepts a single tensor as well as a list of tensors. This allows for more flexibility in the use of the function. This is just a syntax sugar for the compiler to use `all_reduce_coalesced` without worrying about converting the input to a list. Differential Revision: [D51433236](https://our.internmc.facebook.com/intern/diff/D51433236/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/115650 Approved by: https://github.com/wconstab ghstack dependencies: #115523, #115302, #115648, #115649	2023-12-13 21:32:01 +00:00
Scott Wolchok	b9af126908	[PyTorch] Add input numel assert for minimal arrayref interface (#113577 ) We currently have no shape checking on CPU IIUC. Now we at least do numel checking for the minimal arrayref interface. Differential Revision: [D51165703](https://our.internmc.facebook.com/intern/diff/D51165703/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/113577 Approved by: https://github.com/chenyang78, https://github.com/jansel ghstack dependencies: #112800	2023-12-13 21:31:55 +00:00
Yanbo Liang	db851b1bc9	[Dynamo][7/N] Wrap python modules under torch as regular PythonModuleVariable (#115724 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/115724 Approved by: https://github.com/jansel	2023-12-13 21:23:14 +00:00
Chien-Chin Huang	54d552e991	[funcol] Directly import DeviceMesh to avoid circular dependency (#115649 ) This diff aims to directly import DeviceMesh from torch.distributed.device_mesh instead of importing it from dist._tensor. This is done to avoid a circular dependency issue. The code changes in each file of the diff are as follows: - torch/distributed/_functional_collectives.py: import DeviceMesh from torch.distributed instead of dist._tensor. Overall, this diff aims to improve the code by avoiding circular dependencies and improving the import statements. == The above summary is generated by LLM with minor manual fixes. The following summary is by me. The original import causes some issues when compiling DDP with compiled_autograd. The root cause of compilation failure is not identified but it is good to fix the lazy initialization, which indirectly fixes the compilation issues for DDP. Differential Revision: [D51857246](https://our.internmc.facebook.com/intern/diff/D51857246/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/115649 Approved by: https://github.com/wconstab, https://github.com/wz337 ghstack dependencies: #115523, #115302, #115648	2023-12-13 20:44:58 +00:00
Shoaib Meenai	7388d40165	Make pytorch_qnnpack a shared library (#115570 ) Summary: This library contains global state, e.g. pytorch_qnnp_params. If we make it a static library, different shared libraries linking that static library can end up with their own copies of the global state, leading to bugs. Make it a shared library instead, to avoid this issue. Test Plan: buck2 test fbsource//fbandroid/javatests/com/facebook/playground/apps/fb4aplayground/scenarios/pytorchscenario:pytorchscenario -- --run-disabled --regex runBundledInputWithLocalAsset Differential Revision: D51926024 Pull Request resolved: https://github.com/pytorch/pytorch/pull/115570 Approved by: https://github.com/malfet	2023-12-13 20:44:37 +00:00
Will Constable	c90fdb9ac0	Fix torch.distributed.breakpoint (#115705 ) Switches from calling breakpoint() internally to using a subclass of Pdb. Fixes #115685 Pull Request resolved: https://github.com/pytorch/pytorch/pull/115705 Approved by: https://github.com/wanchaol, https://github.com/fegin	2023-12-13 20:33:56 +00:00
Kurt Mohler	8a8d0adc0b	Fix `troch.gradient` check for spacing arg list length (#115686 ) Fixes #114207 Pull Request resolved: https://github.com/pytorch/pytorch/pull/115686 Approved by: https://github.com/albanD	2023-12-13 20:17:20 +00:00
Alexander Yermolovich	23bff71de4	[llvm][oncall] Fix build for llvm-18+ (#115652 ) Summary: https://reviews.llvm.org/D137838 moved Host.h and some other files under TargetParser. https://github.com/llvm/llvm-project/pull/74261 Removed it from Support folder. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115652 Approved by: https://github.com/davidberard98	2023-12-13 20:11:31 +00:00
soulitzer	4d8ad4fb82	Move SingletonSymNodeImpl from c10 to aten (#114895 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/114895 Approved by: https://github.com/jbschlosser	2023-12-13 20:01:18 +00:00
Thiago Crepaldi	2a514f48d7	Add huggingface gpt2 fake tensor unit test for torch.onnx.dynamo_export (#115380 ) open llama, dolly v2 and falcon are still broken regardless of `ExportedProgram`, so they were not moved from `test_fx_to_onnx.py` to `fx_to_onnx_onnxruntime.py`. Dolly and falcon already have tracking issues, but a tracking issue was created for open llama: https://github.com/pytorch/pytorch/issues/115552 A tracking issue was created for `xfail_if_model_type_is_exportedprogram` and `xfail_if_model_type_is_not_exportedprogram` issues with unexpected success runs: https://github.com/pytorch/pytorch/issues/115747 Pull Request resolved: https://github.com/pytorch/pytorch/pull/115380 Approved by: https://github.com/titaiwangms	2023-12-13 19:49:06 +00:00
suo	926236305f	[sigmoid] fix for FX tracing unflattened modules (#115708 ) Differential Revision: [D52095387](https://our.internmc.facebook.com/intern/diff/D52095387/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/115708 Approved by: https://github.com/zhxchen17	2023-12-13 19:43:46 +00:00
Carlos Mocholí	75d3bbaaa2	Fix cudagraph check message (#115664 ) This error message is printed when CUDAGraph trees are used with multiple device indices. However, the message seems to say the opposite. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115664 Approved by: https://github.com/soulitzer	2023-12-13 18:44:43 +00:00
Peter Bell	42390a097b	[inductor] Do variance calculation in opmath type (#115181 ) Fixes #114903 Previously large split variance reductions stored the intermediates as float16 precision, which may lead to overflow as the intermediate result is unnormalized. In #114903 we see two different `num_split` decisions made based on the hardware capabilities, one of which has large enough intermediates to cause overflows. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115181 Approved by: https://github.com/shunting314	2023-12-13 18:40:44 +00:00
Fuzzkatt	95de4f5764	add sm80orlater check to test_sdpa (#115702 ) test_sdpa and test_sdpa2 in test_aot_inductor.py use bfloat16 which is not supported by sm < 80, so skip test if sm < 80 Pull Request resolved: https://github.com/pytorch/pytorch/pull/115702 Approved by: https://github.com/soulitzer	2023-12-13 18:21:32 +00:00
yewentao	caddcf9de5	Fix lint error in `aten/src/ATen/native/cuda/CUDALoops.cuh` (#115616 ) Fix lint error in `aten/src/ATen/native/cuda/CUDALoops.cuh` Pull Request resolved: https://github.com/pytorch/pytorch/pull/115616 Approved by: https://github.com/soulitzer	2023-12-13 18:13:00 +00:00
Pavan Balaji	afa62d6237	[nccl-pg] Pass group global rank information to NCCL PG (#114736 ) We were only passing a subset of the group creation information to the NCCL PG. We are specifically missing the information on which global ranks belong to a particular PG. This allows the NCCL PG to use this additional information for things like better trace logging. Test Plan: OSS CI Reviewers: Subscribers: Tasks: Tags: Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/114736 Approved by: https://github.com/kwen2501	2023-12-13 18:02:51 +00:00
Pearu Peterson	193f87857e	[BC breaking] Remove check_sparse_nnz argument of gradcheck (#115658 ) As in title per deprecation plan. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115658 Approved by: https://github.com/cpuhrsch, https://github.com/soulitzer	2023-12-13 17:34:30 +00:00
voznesenskym	310f6ab11a	[fsdp] Replace acc_grad hooking with register_post_accumulate_grad_hook on flat_param (#112184 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/112184 Approved by: https://github.com/albanD ghstack dependencies: #115315	2023-12-13 16:24:44 +00:00
chundian	97888725c5	[Export] Test non-strict mode on existing test cases (#115399 ) Summary: Dynamo test methodology provides a good example to patch various treaments on the same set of test cases. A pitfall is the global config that could be easily modified somewhere. Here we change the behavior of the export API thru hijacking it with self defined code. For supporting non-strict test suite, the `strict=False` is explicitly passed into the export API when it's called w/ or w/o strict arg. Test Plan: python test/export/test_export_nonstrict.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/115399 Approved by: https://github.com/zhxchen17, https://github.com/tugsbayasgalan	2023-12-13 16:01:06 +00:00
hongxyan	66a76516bf	[ROCm] Disabling Kernel Asserts for ROCm by default - fix and clean up and refactoring (#114660 ) Related to #103973 #110532 #108404 #94891 Context: As commented in `6ae0554d11/cmake/Dependencies.cmake (L1198)` Kernel asserts are enabled by default for CUDA and disabled for ROCm. However it is somewhat broken, and Kernel assert was still enabled for ROCm. Disabling kernel assert is also needed for users who do not have PCIe atomics support. These community users have verified that disabling the kernel assert in PyTorch/ROCm platform fixed their pytorch workflow, like torch.sum script, stable-diffusion. (see the related issues) Changes: This pull request serves the following purposes: * Refactor and clean up the logic, make it simpler for ROCm to enable and disable Kernel Asserts * Fix the bug that Kernel Asserts for ROCm was not disabled by default. Specifically, - Renamed `TORCH_DISABLE_GPU_ASSERTS` to `C10_USE_ROCM_KERNEL_ASSERT` for the following reasons: (1) This variable only applies to ROCm. (2) The new name is more align with #define CUDA_KERNEL_ASSERT function. (3) With USE_ in front of the name, we can easily control it with environment variable to turn on and off this feature during build (e.g. `USE_ROCM_KERNEL_ASSERT=1 python setup.py develop` will enable kernel assert for ROCm build). - Get rid of the `ROCM_FORCE_ENABLE_GPU_ASSERTS' to simplify the logic and make it easier to understand and maintain - Added `#cmakedefine` to carry over the CMake variable to C++ Tests: (1) build with default mode and verify that USE_ROCM_KERNEL_ASSERT is OFF(0), and kernel assert is disabled: ``` python setup.py develop ``` Verify CMakeCache.txt has correct value. ``` /xxxx/pytorch/build$ grep USE_ROCM_KERNEL_ASSERT CMakeCache.txt USE_ROCM_KERNEL_ASSERT:BOOL=0 ``` Tested the following code in ROCm build and CUDA build, and expected the return code differently. ``` subprocess.call([sys.executable, '-c', "import torch;torch._assert_async(torch.tensor(0,device='cuda'));torch.cuda.synchronize()"]) ``` This piece of code is adapted from below unit test to get around the limitation that this unit test now was skipped for ROCm. (We will check to enable this unit test in the future) ``` python test/test_cuda_expandable_segments.py -k test_fixed_cuda_assert_async ``` Ran the following script, expecting r ==0 since the CUDA_KERNEL_ASSERT is defined as nothing: ``` >> import sys >>> import subprocess >>> r=subprocess.call([sys.executable, '-c', "import torch;torch._assert_async(torch.tensor(0,device='cuda'));torch.cuda.synchronize()"]) >>> r 0 ``` (2) Enable the kernel assert by building with USE_ROCM_KERNEL_ASSERT=1, or USE_ROCM_KERNEL_ASSERT=ON ``` USE_ROCM_KERNEL_ASSERT=1 python setup.py develop ``` Verify `USE_ROCM_KERNEL_ASSERT` is `1` ``` /xxxx/pytorch/build$ grep USE_ROCM_KERNEL_ASSERT CMakeCache.txt USE_ROCM_KERNEL_ASSERT:BOOL=1 ``` Run the assert test, and expected return code not equal to 0. ``` >> import sys >>> import subprocess >>> r=subprocess.call([sys.executable, '-c', "import torch;torch._assert_async(torch.tensor(0,device='cuda'));torch.cuda.synchronize()"]) >>>/xxxx/pytorch/aten/src/ATen/native/hip/TensorCompare.hip:108: _assert_async_cuda_kernel: Device-side assertion `input[0] != 0' failed. :0:rocdevice.cpp :2690: 2435301199202 us: [pid:206019 tid:0x7f6cf0a77700] Callback: Queue 0x7f64e8400000 aborting with error : HSA_STATUS_ERROR_EXCEPTION: An HSAIL operation resulted in a hardware exception. code: 0x1016 >>> r -6 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/114660 Approved by: https://github.com/jeffdaily, https://github.com/malfet, https://github.com/jithunnair-amd	2023-12-13 15:44:53 +00:00
Peter Bell	fb80f05ee2	[inductor] Fix angle decomposition return type (#115700 ) The current decomposition always returns float32 when the input isn't complex. Instead, we should do proper type promotion. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115700 Approved by: https://github.com/lezcano ghstack dependencies: #115677, #115699	2023-12-13 14:16:31 +00:00
Peter Bell	9cdc80d581	[inductor] Fix torch.bernoulli decomposition return type (#115699 ) Strangely enough, `torch.bernoulli` doesn't return a boolean and instead it matches the output type of the inplace bernoulli. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115699 Approved by: https://github.com/lezcano ghstack dependencies: #115677	2023-12-13 14:16:31 +00:00
Peter Bell	0e0dd8f985	[dynamo][BE] Move torchvision import inside of test_multi_import (#115677 ) Currently this skip imports torchvision, so if your torchvision install is broken then the entire file fails at collection time. This instead means only the test itself will fail. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115677 Approved by: https://github.com/lezcano	2023-12-13 14:16:31 +00:00
atalman	3807fc690f	[OSSCI oncall] fix lint (#115737 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/115737 Approved by: https://github.com/DanilBaibak	2023-12-13 14:15:26 +00:00
PyTorch MergeBot	0870afb85c	Revert "[Export] Test non-strict mode on existing test cases (#115399 )" This reverts commit 2411a92e9d9f90e2db3cde9190e1301bd02cb221. Reverted https://github.com/pytorch/pytorch/pull/115399 on behalf of https://github.com/atalman due to OSSCI oncall, broke CI tests ([comment](https://github.com/pytorch/pytorch/pull/115399#issuecomment-1853869965))	2023-12-13 12:59:09 +00:00
PyTorch MergeBot	bda6f02343	Revert "[Export] Support retraceability test on existing cases (#115402 )" This reverts commit b0c7dd47cdb8d17bbfd0ab2963b1afb908dab716. Reverted https://github.com/pytorch/pytorch/pull/115402 on behalf of https://github.com/atalman due to OSSCI oncall, broke CI tests ([comment](https://github.com/pytorch/pytorch/pull/115402#issuecomment-1853864075))	2023-12-13 12:55:07 +00:00
PyTorch MergeBot	3b87681ddc	Revert "[Export] Support ser/des test on existing cases (#115413 )" This reverts commit 47443591631ebb80a84487bbdab3233e0077941d. Reverted https://github.com/pytorch/pytorch/pull/115413 on behalf of https://github.com/atalman due to OSSCI oncall, broke CI tests ([comment](https://github.com/pytorch/pytorch/pull/115413#issuecomment-1853859443))	2023-12-13 12:51:34 +00:00
Scott Wolchok	f9cf6ae889	[PyTorch] AOTI: add minimal arrayref interface (#112800 ) This implements an optional alternate interface to the AOTI generated DSO, intended to increase efficiency for models running on CPU and requiring minimal overhead. See comment in config.py for more explanation. This took a while to get right (e.g., I initially required 1-D MiniArrayRef<T> for the inputs, but found that multi-dimensional ArrayRefTensor<T> ended up simplifying the implementation and allowed test_aot_inductor.py to run) and is somewhat intricate, so I am anticipating that review will require some back-and-forth. Differential Revision: [D50699890](https://our.internmc.facebook.com/intern/diff/D50699890/) NOTE FOR REVIEWERS: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D50699890/)! Pull Request resolved: https://github.com/pytorch/pytorch/pull/112800 Approved by: https://github.com/chenyang78	2023-12-13 12:06:35 +00:00
Hannes Friederich	331128b444	[c10] signal_handler: atomically exchange the signal count to fix data race in ExecuteStepRecursive() (#115510 ) Summary: `CheckForSignals()` can be called from multiple threads concurrently, e.g. from within `ExecuteStepRecursive()`. This means that `my_sigint_count_` and `my_sighup_count_` can be written concurrently, causing data races. To fix, use atomic exchange which writes the new value and returns the old value in one atomic operation. Test Plan: Running TSAN tests that failed before and now pass Differential Revision: D52018963 Pull Request resolved: https://github.com/pytorch/pytorch/pull/115510 Approved by: https://github.com/malfet	2023-12-13 12:06:06 +00:00
Chien-Chin Huang	50db2aa70a	[funcol][BE] Apply ufmt to _functional_collectives.py and turn on lintrunner for functional_collective (#115648 ) No logic change, just formatting. Differential Revision: [D51857236](https://our.internmc.facebook.com/intern/diff/D51857236/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/115648 Approved by: https://github.com/wconstab, https://github.com/wz337 ghstack dependencies: #115523, #115302	2023-12-13 11:19:29 +00:00
Chien-Chin Huang	db8d409d08	[DCP][BE] Apply ufmt to DCP and turn on lintrunner for DCP (#115302 ) No logic change. Just typing and ufmt. Differential Revision: [D51914982](https://our.internmc.facebook.com/intern/diff/D51914982/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/115302 Approved by: https://github.com/XilunWu, https://github.com/wz337, https://github.com/LucasLLC ghstack dependencies: #115523	2023-12-13 10:32:36 +00:00
Chien-Chin Huang	cc28f61fa3	[DCP][BE] Move DCP._state_dict_utils out from DCP (#115523 ) DCP._state_dict_utils is also used by FSDP. This can cause circular import sometimes. Move it out from DCP to avoid circular import. Differential Revision: [D52022440](https://our.internmc.facebook.com/intern/diff/D52022440/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/115523 Approved by: https://github.com/wz337	2023-12-13 08:59:48 +00:00
Nikita Shulga	1500379b6d	[MPS] Enable `torch.rand[n]` for complex types (#115514 ) Test plan: ``` % python -c "import torch;print(torch.rand(3, 3, dtype=torch.chalf, device='mps'))" tensor([[0.4639+0.8350j, 0.0479+0.1650j, 0.2510+0.9551j], [0.4746+0.3984j, 0.1484+0.8242j, 0.0098+0.7129j], [0.7979+0.6162j, 0.7188+0.9580j, 0.5186+0.2559j]], device='mps:0', dtype=torch.complex32) % python3 -c "import torch; x=torch.randn(1000000, dtype=torch.cfloat, device='mps'); print((x-x.mean()).abs().pow(2).div(x.numel()-1).sum().sqrt())" ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/115514 Approved by: https://github.com/lezcano ghstack dependencies: #115512, #115513, #115554	2023-12-13 07:30:56 +00:00
chundian	4744359163	[Export] Support ser/des test on existing cases (#115413 ) Summary: Similar as #115399 Test Plan: ``` $ python test/export/test_serdes.py ... Ran 72 tests in 29.097s OK (expected failures=13) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/115413 Approved by: https://github.com/tugsbayasgalan ghstack dependencies: #115399, #115402	2023-12-13 06:01:17 +00:00
chundian	b0c7dd47cd	[Export] Support retraceability test on existing cases (#115402 ) Summary: Similar as #115399 Test Plan: python test/export/test_retraceability.py FAILED (failures=6, errors=8, expected failures=7) Pull Request resolved: https://github.com/pytorch/pytorch/pull/115402 Approved by: https://github.com/tugsbayasgalan ghstack dependencies: #115399	2023-12-13 06:01:17 +00:00
chundian	2411a92e9d	[Export] Test non-strict mode on existing test cases (#115399 ) Summary: Dynamo test methodology provides a good example to patch various treaments on the same set of test cases. A pitfall is the global config that could be easily modified somewhere. Here we change the behavior of the export API thru hijacking it with self defined code. For supporting non-strict test suite, the `strict=False` is explicitly passed into the export API when it's called w/ or w/o strict arg. Test Plan: python test/export/test_export_nonstrict.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/115399 Approved by: https://github.com/zhxchen17, https://github.com/tugsbayasgalan	2023-12-13 06:01:17 +00:00
angelayi	dd42201cb8	[export] Preserve FQN in export_to_torch_ir (#115462 ) AOTInductor currently relies of export_to_torch_ir to generate a graph, and passes it to inductor to generate the .so. They would like the FQN to be consistent so that they can easily find/update the weights in the .so. Note that since export flattens all modules in to a single computational graph, we will change the FQNs in the original module by replacing all periods with underscores. For example, `foo.child1param`, which points to a submodule named `foo`'s parameter named `child1param`, will be renamed to `foo_child1param` since we no longer have the submodule `foo`. This is done just by doing `name.replace(".", "_")`. Outputted AOTInductor c++ code: https://www.internalfb.com/phabricator/paste/view/P900120950?lines=377-355%2C354 Pull Request resolved: https://github.com/pytorch/pytorch/pull/115462 Approved by: https://github.com/tugsbayasgalan	2023-12-13 04:58:47 +00:00
Yanbo Liang	0dad85b402	[Dynamo] Fix torch.tensor call with tuple (#115713 ) Land #114383 on behalf of @ezyang since he is on recharge and this is an high priority issue. Fix #114231 Pull Request resolved: https://github.com/pytorch/pytorch/pull/115713 Approved by: https://github.com/angelayi, https://github.com/voznesenskym	2023-12-13 04:08:12 +00:00
Riham Selim	38101e349e	[usdt][torch] Sample dispatch operator integration (#115593 ) Summary: By default the instruction at the USDT is nop, when the tracepoint is attached (e.g. through bpftrace) the code inside the semaphore check is executed. Thus there should be no performance impact as long as the USDT is not attached from the tracepoint execution code itself, however the semaphore check itself `TORCH_SDT_IS_ENABLED` will incur the cost of a `read_global_volatile` operation. https://github.com/dtrace4linux/linux/blob/master/doc/usdt.html for more info Test Plan: ``` buck2 build mode/opt caffe2/torch/fb/observers:strobelight_observer_runner --show-full-output ``` ``` /data/users/rihams/fbsource/buck-out/v2/gen/fbcode/0bc8cf217a8cf352/caffe2/torch/fb/observers/__strobelight_observer_runner__/strobelight_observer_runner ``` ``` sudo bpftrace -e 'usdt:/data/users/rihams/fbsource/buck-out/v2/gen/fbcode/6081734815403318/caffe2/torch/fb/observers/__strobelight_observer_runner__/strobelight_observer_runner:pytorch:operator_* { printf("%s --> %s\n", probe, str(arg0)); }' -v usdt:<path>strobelight_observer_runner:pytorch:operator_start --> aten::empty_strided usdt:<path>strobelight_observer_runner:pytorch:operator_end --> aten::empty_strided usdt:<path>strobelight_observer_runner:pytorch:operator_end --> aten::empty_like usdt:<path>strobelight_observer_runner:pytorch:operator_start --> aten::fill_ usdt:<path>strobelight_observer_runner:pytorch:operator_end --> aten::fill_ usdt:<path>strobelight_observer_runner:pytorch:operator_end --> aten::ones_like usdt:<path>strobelight_observer_runner:pytorch:operator_start --> aten::mul usdt:<path>strobelight_observer_runner:pytorch:operator_end --> aten::mul usdt:<path>strobelight_observer_runner:pytorch:operator_start --> aten::add usdt:<path>strobelight_observer_runner:pytorch:operator_end --> aten::add usdt:<path>strobelight_observer_runner:pytorch:operator_start --> aten::detach usdt:<path>strobelight_observer_runner:pytorch:operator_end --> aten::detach usdt:<path>strobelight_observer_runner:pytorch:operator_start --> aten::randn usdt:<path>strobelight_observer_runner:pytorch:operator_start --> aten::empty usdt:<path>strobelight_observer_runner:pytorch:operator_end --> aten::empty usdt:<path>strobelight_observer_runner:pytorch:operator_start --> aten::normal_ usdt:<path>strobelight_observer_runner:pytorch:operator_end --> aten::normal_ usdt:<path>strobelight_observer_runner:pytorch:operator_end --> aten::randn usdt:<path>strobelight_observer_runner:pytorch:operator_start --> aten::to usdt:<path>strobelight_observer_runner:pytorch:operator_start --> aten::_to_copy usdt:<path>strobelight_observer_runner:pytorch:operator_start --> aten::empty_strided usdt:<path>strobelight_observer_runner:pytorch:operator_end --> aten::empty_strided ``` Differential Revision: D44636587 Pull Request resolved: https://github.com/pytorch/pytorch/pull/115593 Approved by: https://github.com/malfet	2023-12-13 02:41:48 +00:00
angelayi	17c104ac18	[export] Do not copy state_dict in run_decomp (#115269 ) Fixes https://github.com/pytorch/pytorch/issues/114628 Pull Request resolved: https://github.com/pytorch/pytorch/pull/115269 Approved by: https://github.com/thiagocrepaldi, https://github.com/ydwu4	2023-12-13 01:21:21 +00:00
cdzhan	99554112d3	[pytorch] add namespace for optTypeMetaToScalarType in codegen to avoid not declared when compile (#115623 ) Fixes compilation failure in some environment. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115623 Approved by: https://github.com/albanD	2023-12-13 00:59:01 +00:00
Yang Chen	1392843e7b	[inductor] make sure bitcast input and target type have the same bitwidth (#115619 ) This PR fixed #104791 bitcast requires the source and target have the bitwidth. Because the input tensor's dtype could be promoted, e.g. from float16 to float, we have to cast the tensor to its original source dtype before invoking bitcast in such cases. After that, we also need to convert the bit-casted tensor back to float to make sure we keep using higher precision values for the rest of the computation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115619 Approved by: https://github.com/jansel, https://github.com/eellison	2023-12-13 00:53:04 +00:00
Nikita Shulga	469d6d45fe	[BE] Bye bye, CircleCI (#115701 ) In PyTorch, a change we now see, CircleCI's gone, set it free. With commits and a push, No more waiting in hush, For a simpler CI spree! Pull Request resolved: https://github.com/pytorch/pytorch/pull/115701 Approved by: https://github.com/PaliC, https://github.com/suo, https://github.com/seemethere	2023-12-13 00:26:49 +00:00
voznesenskym	76ced0df03	Consider storage_changed for assigning alias_of_input in aot_autograd when computing differentiable outputs that alias each other (#115315 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/115315 Approved by: https://github.com/bdhirsh	2023-12-12 23:21:58 +00:00
chundian	946de1cf4c	[export][fix] Add back export strict argument (#115668 ) Summary: \#115556 omitted strict argument, which is necessary for non-strict mode dev. Test Plan: python test/export/test_export.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/115668 Approved by: https://github.com/tugsbayasgalan, https://github.com/angelayi	2023-12-12 22:59:10 +00:00
Chien-Chin Huang	48ed165380	[FSDP][state_dict] Create a FSDP/EP unittest (#115567 ) As title Differential Revision: [D52043394](https://our.internmc.facebook.com/intern/diff/D52043394/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/115567 Approved by: https://github.com/wz337, https://github.com/LucasLLC	2023-12-12 22:48:11 +00:00
angelayi	639060cb0b	Use get_mkldnn_enabled for decompositions (#115448 ) `torch._C.has_mkldnn` does not respect cases where users try to disable mkldnn using `torch._C._set_mkldnn_enabled()`. This is relevant to edge use cases, where they do not want decompositions to go to the ATen opset, and do not want the mkldnn operator to appear in the graph. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115448 Approved by: https://github.com/jgong5, https://github.com/ydwu4	2023-12-12 22:42:51 +00:00
zhxchen17	f78f23d753	[export] Turn off output value from sources for export. (#115442 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/115442 Approved by: https://github.com/tugsbayasgalan	2023-12-12 22:41:23 +00:00
Oguz Ulgen	af09fe256a	[Inductor] Implement a deduplist data structure for name to user tracking (#115609 ) Summary: An internal MRS model was taking over a day's worth of time to compile due to many duplicates in dependency tracking. This PR replaces the list with a custom dedup list. Normally one could use a set/dict for this purpose however the list in question gets elements appended as it is being iterated over which means that we need to keep the list semantics. Test Plan: ad hoc testing Differential Revision: D52060659 Pull Request resolved: https://github.com/pytorch/pytorch/pull/115609 Approved by: https://github.com/jansel	2023-12-12 22:28:30 +00:00
Lucas Pasqualin	ffb2a28a67	Fixes expected behavior when `no_dist=True` in `state_dict_loader.load` (#115660 ) Fixes expected behavior when `no_dist=True` in `state_dict_loader.load` Fixes #115591 Pull Request resolved: https://github.com/pytorch/pytorch/pull/115660 Approved by: https://github.com/wz337, https://github.com/fegin	2023-12-12 22:21:16 +00:00
Mikayla Gawarecki	f138b08d2e	Migrated loss functions to ModuleInfos (#115584 ) Migrates most tests in `common_nn.py:criterion_tests` to ModuleInfos. I can split this up if it is too large to review What this PR does not include: - [`no_batch_dim` tests](https://github.com/pytorch/pytorch/blob/main/torch/testing/_internal/common_nn.py#L3995-L4112) - [tests that use the functional variant of the loss function and `wrap_functional`](https://github.com/pytorch/pytorch/blob/main/torch/testing/_internal/common_nn.py#L1079-L1128) #### On test times This PR increases test time by ~58s locally Before this PR: ``` >>> python test/test_nn.py -k Loss Ran 1003 tests in 28.977s ``` After this PR ``` >>> python test/test_nn.py -k Loss Ran 368 tests in 23.073s ``` ``` >>> python test/test_modules.py -k Loss Ran 836 tests in 63.900s ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/115584 Approved by: https://github.com/janeyx99 ghstack dependencies: #115617	2023-12-12 22:20:20 +00:00
Mikayla Gawarecki	1becd2c314	Align checks in `_use_cudnn_ctc_loss` with those in `_cudnn_ctc_loss` (#115617 ) This PR is intended to fix the following problem: When using `CTCLoss`, there is a cudnn path gated by a call to [`_use_cudnn_ctc_loss`]( `e918461377/aten/src/ATen/native/cudnn/LossCTC.cpp (L73-L101)`) which checks some conditions `e918461377/aten/src/ATen/native/LossCTC.cpp (L486-L496)` However, there are more checks in `_cudnn_ctc_loss` `e918461377/aten/src/ATen/native/cudnn/LossCTC.cpp (L122-L130)` some of which are not present in `_use_cudnn_ctc_loss` (e.g. the check that `targets` is on CPU which will cause a RuntimeError after dispatching to `_cudnn_ctc_loss`). Instead, these checks should be in `_use_cudnn_ctc_loss` so that the normal `_ctc_loss` path will be used if the checks are not met) e.g. Before this PR ```python >>> import torch >>> ctcloss = torch.nn.CTCLoss() >>> log_probs = torch.randn((50, 3, 15), device='cuda').log_softmax(2) >>> target = torch.randint(1, 15, (30 + 25 + 20,), dtype = torch.int) >>> input_lengths = torch.tensor((50, 50, 50), device='cuda') >>> target_lengths = torch.tensor((30, 25, 20), device='cuda') >>> ctcloss(log_probs, target, input_lengths, target_lengths) tensor(4.1172, device='cuda:0') >>> target = target.to('cuda') >>> ctcloss(log_probs, target, input_lengths, target_lengths) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/data/users/mg1998/pytorch/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(args, kwargs) File "/data/users/mg1998/pytorch/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(args, **kwargs) File "/data/users/mg1998/pytorch/torch/nn/modules/loss.py", line 1779, in forward return F.ctc_loss(log_probs, targets, input_lengths, target_lengths, self.blank, self.reduction, File "/data/users/mg1998/pytorch/torch/nn/functional.py", line 2660, in ctc_loss return torch.ctc_loss( RuntimeError: Expected tensor to have CPU Backend, but got tensor with CUDA Backend (while checking arguments for cudnn_ctc_loss) ``` After this PR the above snippet runs without error. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115617 Approved by: https://github.com/janeyx99	2023-12-12 22:20:20 +00:00
PyTorch MergeBot	c3ed9f65a0	Revert "[8/n] Update XNNPACK Version Part 8 Everything Remaining to get it to work (#115587 )" This reverts commit a8dc9d8e353ddcf7db0247349a3acd0dd37fcc6f. Reverted https://github.com/pytorch/pytorch/pull/115587 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/115587#issuecomment-1852835898))	2023-12-12 21:28:09 +00:00
Yanbo Liang	ac4f6beb00	[Dynamo] Make resume function name more explicit by adding lineno (#115608 ) Adding lineno to resume function name for easy aggregation in Scuba table. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115608 Approved by: https://github.com/jansel, https://github.com/williamwen42	2023-12-12 21:08:41 +00:00
fduwjj	40ce9a4cfb	[c10d] Create a python c10d API _set_pg_timeout to set timeout (#115453 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/115453 Approved by: https://github.com/wconstab, https://github.com/H-Huang	2023-12-12 20:52:43 +00:00
ydwu4	8a58af2a9f	[Reland][HigherOrderOp] make MapHigherOrder create map_impl (#115561 ) This is a reland of #115205, which gets reverted due to internal test failure. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115561 Approved by: https://github.com/angelayi	2023-12-12 20:45:01 +00:00
Pearu Peterson	8739d1e3f9	Fix a fast mode gradcheck bug where specified eps argument is ignored when switching to slow mode (#115634 ) As in the title. The reproducer for the bug is as follows: ```python >>> import torch >>> dtype = torch.bfloat16 >>> D1 = torch.tensor([[1, 2], [3, 4]], dtype=dtype, requires_grad=True) >>> D2 = torch.tensor([[1, 2], [3, 4]], dtype=dtype, requires_grad=True) >>> torch.autograd.gradcheck(torch.mm, (D1, D2), fast_mode=True) ``` <details> ``` torch.autograd.gradcheck.GradcheckError: Jacobian mismatch for output 0 with respect to input 0, numerical:tensor(0., dtype=torch.bfloat16) analytical:tensor(4.9062, dtype=torch.bfloat16) The above quantities relating the numerical and analytical jacobians are computed in fast mode. See: https://github.com/pytorch/pytorch/issues/53876 for more background about fast mode. Below, we recompute numerical and analytical jacobians in slow mode: Numerical: tensor([[0., 0., 0., 0.], [0., 0., 0., 0.], [0., 0., 0., 0.], [0., 0., 0., 0.]], dtype=torch.bfloat16) Analytical: tensor([[1., 2., 0., 0.], [3., 4., 0., 0.], [0., 0., 1., 2.], [0., 0., 3., 4.]], dtype=torch.bfloat16) ``` </details> ```python The max per-element difference (slow mode) is: 4.0. >>> torch.autograd.gradcheck(torch.mm, (D1, D2), fast_mode=True, eps=1e-1) ``` <details> ``` <snip> torch.autograd.gradcheck.GradcheckError: Jacobian mismatch for output 0 with respect to input 0, numerical:tensor(5., dtype=torch.bfloat16) analytical:tensor(4.9062, dtype=torch.bfloat16) The above quantities relating the numerical and analytical jacobians are computed in fast mode. See: https://github.com/pytorch/pytorch/issues/53876 for more background about fast mode. Below, we recompute numerical and analytical jacobians in slow mode: Numerical: tensor([[0., 0., 0., 0.], [0., 0., 0., 0.], [0., 0., 0., 0.], [0., 0., 0., 0.]], dtype=torch.bfloat16) Analytical: tensor([[1., 2., 0., 0.], [3., 4., 0., 0.], [0., 0., 1., 2.], [0., 0., 3., 4.]], dtype=torch.bfloat16) ``` </details> ``` The max per-element difference (slow mode) is: 4.0. ``` Notice that changing `eps` value has no effect to max per-element difference. With this PR, increasing `eps` value will lead to sensible results in numerical jacobian: ```python >>> torch.autograd.gradcheck(torch.mm, (D1, D2), fast_mode=True, eps=1e-1) ``` <details> ``` <snip> torch.autograd.gradcheck.GradcheckError: Jacobian mismatch for output 0 with respect to input 0, numerical:tensor(5., dtype=torch.bfloat16) analytical:tensor(4.9062, dtype=torch.bfloat16) The above quantities relating the numerical and analytical jacobians are computed in fast mode. See: https://github.com/pytorch/pytorch/issues/53876 for more background about fast mode. Below, we recompute numerical and analytical jacobians in slow mode: Numerical: tensor([[0.9375, 1.8750, 0.0000, 0.0000], [2.9688, 3.7500, 0.0000, 0.0000], [0.0000, 0.0000, 1.2500, 2.5000], [0.0000, 0.0000, 2.5000, 3.7500]], dtype=torch.bfloat16) Analytical: tensor([[1., 2., 0., 0.], [3., 4., 0., 0.], [0., 0., 1., 2.], [0., 0., 3., 4.]], dtype=torch.bfloat16) ``` </details> ``` The max per-element difference (slow mode) is: 0.5. ``` Finally: ```python >>> torch.autograd.gradcheck(torch.mm, (D1, D2), fast_mode=True, eps=1e-1, atol=1) True ``` that would fail with the current main branch. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115634 Approved by: https://github.com/lezcano, https://github.com/soulitzer, https://github.com/albanD ghstack dependencies: #115536	2023-12-12 20:00:56 +00:00
Thiago Crepaldi	75ab294eb5	Enable builtin tests for ONNX Export with ExportedProgram models (#114762 ) Fixed by https://github.com/pytorch/pytorch/pull/113982 Pull Request resolved: https://github.com/pytorch/pytorch/pull/114762 Approved by: https://github.com/BowenBao	2023-12-12 19:50:06 +00:00
Chien-Chin Huang	d954ef208f	[DCP][state_dict] DCP state_dict cannot correctly find FQN when the leaf module is wrapped by FSDP (#115592 ) Summary: The original logic has an incorrect assumption that there is at one object name left when traversing the module tree. This is not correct when the leaf module is wrapped by FSDP. Test Plan: CI Differential Revision: D52049293 Pull Request resolved: https://github.com/pytorch/pytorch/pull/115592 Approved by: https://github.com/wz337	2023-12-12 19:22:23 +00:00

9252 changed files with 202122 additions and 89462 deletions

									
										1

.ci/docker/README.md
									
												View File
												
				@ -19,6 +19,7 @@ See `build.sh` for valid build environments (it's the giant switch).

				* `ubuntu` -- Dockerfile for Ubuntu image for CPU build and test jobs

				* `ubuntu-cuda` -- Dockerfile for Ubuntu image with CUDA support for nvidia-docker

				* `ubuntu-rocm` -- Dockerfile for Ubuntu image with ROCm support

				* `ubuntu-xpu` -- Dockerfile for Ubuntu image with XPU support

				## Usage

									
										30

.ci/docker/build.sh
									
												View File
												
				@ -71,6 +71,8 @@ if [[ "$image" == *cuda* && "$UBUNTU_VERSION" != "22.04" ]]; then

				  DOCKERFILE="${OS}-cuda/Dockerfile"

				elif [[ "$image" == *rocm* ]]; then

				  DOCKERFILE="${OS}-rocm/Dockerfile"

				elif [[ "$image" == *xpu* ]]; then

				  DOCKERFILE="${OS}-xpu/Dockerfile"

				elif [[ "$image" == *cuda*linter* ]]; then

				  # Use a separate Dockerfile for linter to keep a small image size

				  DOCKERFILE="linter-cuda/Dockerfile"

				@ -202,7 +204,7 @@ case "$image" in

				    PROTOBUF=yes

				    DB=yes

				    VISION=yes

				    ROCM_VERSION=5.6

				    ROCM_VERSION=5.7

				    NINJA_VERSION=1.9.0

				    CONDA_CMAKE=yes

				    TRITON=yes

				@ -213,11 +215,21 @@ case "$image" in

				    PROTOBUF=yes

				    DB=yes

				    VISION=yes

				    ROCM_VERSION=5.7

				    ROCM_VERSION=6.0

				    NINJA_VERSION=1.9.0

				    CONDA_CMAKE=yes

				    TRITON=yes

				    ;;

				  pytorch-linux-jammy-xpu-2024.0-py3)

				    ANACONDA_PYTHON_VERSION=3.8

				    GCC_VERSION=11

				    PROTOBUF=yes

				    DB=yes

				    VISION=yes

				    BASEKIT_VERSION=2024.0.0-49522

				    NINJA_VERSION=1.9.0

				    CONDA_CMAKE=yes

				    ;;

				    pytorch-linux-jammy-py3.8-gcc11-inductor-benchmarks)

				    ANACONDA_PYTHON_VERSION=3.8

				    GCC_VERSION=11

				@ -265,6 +277,7 @@ case "$image" in

				    CONDA_CMAKE=yes

				    TRITON=yes

				    DOCS=yes

				    UNINSTALL_DILL=yes

				    ;;

				  pytorch-linux-jammy-py3-clang12-executorch)

				    ANACONDA_PYTHON_VERSION=3.10

				@ -284,6 +297,15 @@ case "$image" in

				    CUDA_VERSION=11.8

				    CONDA_CMAKE=yes

				    ;;

				  pytorch-linux-jammy-aarch64-py3.10-gcc11)

				    ANACONDA_PYTHON_VERSION=3.10

				    GCC_VERSION=11

				    ACL=yes

				    PROTOBUF=yes

				    DB=yes

				    VISION=yes

				    CONDA_CMAKE=yes

				    ;;

				  *)

				    # Catch-all for builds that are not hardcoded.

				    PROTOBUF=yes

				@ -337,7 +359,7 @@ if [[ "$image" == *cuda*  && ${OS} == "ubuntu" ]]; then

				fi

				# Build image

				docker build \

				DOCKER_BUILDKIT=1 docker build \

				       --no-cache \

				       --progress=plain \

				       --build-arg "BUILD_ENVIRONMENT=${image}" \

				@ -374,6 +396,8 @@ docker build \

				       --build-arg "DOCS=${DOCS}" \

				       --build-arg "INDUCTOR_BENCHMARKS=${INDUCTOR_BENCHMARKS}" \

				       --build-arg "EXECUTORCH=${EXECUTORCH}" \

				       --build-arg "BASEKIT_VERSION=${BASEKIT_VERSION}" \

				       --build-arg "ACL=${ACL:-}" \

				       -f $(dirname ${DOCKERFILE})/Dockerfile \

				       -t "$tmp_tag" \

				       "$@" \

2

.ci/docker/ci_commit_pins/executorch.txt

View File

 @ -1 +1 @@
 ca6322dcfc51b209a06b76d160bd95d81d58f15c
 f96f5a852ba452670255d28d59f1e6398141fbb

2

.ci/docker/ci_commit_pins/huggingface.txt

View File

 @ -1 +1 @@
 c26faa159b79a42d7fa46cb66e2d21523351987
 e186efbf7fb93328dd6b34927a4e8c8f24395

2

.ci/docker/ci_commit_pins/triton-rocm.txt

View File

 @ -1 +1 @@
 dafe1459823b9549417ed95e9720f1b594fab329
 a22a91d04c2b4a029a69a198eac390089c3e891

2

.ci/docker/ci_commit_pins/triton.txt

View File

 @ -1 +1 @@
 bcad9dabe15021c53b6a88296e9d7a210044f108
 adb9a29496c22a36ef82ca69cad5dad536b9c

									
										16

.ci/docker/common/install_acl.sh
									
										Normal file
									
												View File
												
				@ -0,0 +1,16 @@

				set -euo pipefail

				readonly version=v23.08

				readonly src_host=https://review.mlplatform.org/ml

				readonly src_repo=ComputeLibrary

				# Clone ACL

				[[ ! -d ${src_repo} ]] && git clone ${src_host}/${src_repo}.git

				cd ${src_repo}

				git checkout $version

				# Build with scons

				scons -j8  Werror=0 debug=0 neon=1 opencl=0 embed_kernels=0 \

				  os=linux arch=armv8a build=native multi_isa=1 \

				  fixed_format_kernels=1 openmp=1 cppthreads=0

									
										4

.ci/docker/common/install_base.sh
									
												View File
												
				@ -61,6 +61,7 @@ install_ubuntu() {

				    ${maybe_libiomp_dev} \

				    libyaml-dev \

				    libz-dev \

				    libjemalloc2 \

				    libjpeg-dev \

				    libasound2-dev \

				    libsndfile-dev \

				@ -74,6 +75,7 @@ install_ubuntu() {

				    libtool \

				    vim \

				    unzip \

				    gpg-agent \

				    gdb

				  # Should resolve issues related to various apt package repository cert issues

				@ -151,7 +153,7 @@ wget https://ossci-linux.s3.amazonaws.com/valgrind-${VALGRIND_VERSION}.tar.bz2

				tar -xjf valgrind-${VALGRIND_VERSION}.tar.bz2

				cd valgrind-${VALGRIND_VERSION}

				./configure --prefix=/usr/local

				make -j6

				make -j$[$(nproc) - 2]

				sudo make install

				cd ../../

				rm -rf valgrind_build

									
										57

.ci/docker/common/install_conda.sh
									
												View File
												
				@ -9,10 +9,19 @@ if [ -n "$ANACONDA_PYTHON_VERSION" ]; then

				  MAJOR_PYTHON_VERSION=$(echo "$ANACONDA_PYTHON_VERSION" | cut -d . -f 1)

				  MINOR_PYTHON_VERSION=$(echo "$ANACONDA_PYTHON_VERSION" | cut -d . -f 2)

				if [[ $(uname -m) == "aarch64" ]]; then

				  BASE_URL="https://github.com/conda-forge/miniforge/releases/latest/download"

				  case "$MAJOR_PYTHON_VERSION" in

				    2)

				      CONDA_FILE="Miniconda2-latest-Linux-x86_64.sh"

				    3)

				      CONDA_FILE="Miniforge3-Linux-aarch64.sh"

				    ;;

				    *)

				      echo "Unsupported ANACONDA_PYTHON_VERSION: $ANACONDA_PYTHON_VERSION"

				      exit 1

				      ;;

				  esac

				else

				  case "$MAJOR_PYTHON_VERSION" in

				    3)

				      CONDA_FILE="Miniconda3-latest-Linux-x86_64.sh"

				    ;;

				@ -21,6 +30,7 @@ if [ -n "$ANACONDA_PYTHON_VERSION" ]; then

				      exit 1

				      ;;

				  esac

				fi

				  mkdir -p /opt/conda

				  chown jenkins:jenkins /opt/conda

				@ -47,15 +57,39 @@ if [ -n "$ANACONDA_PYTHON_VERSION" ]; then

				  # Uncomment the below when resolved to track the latest conda update

				  # as_jenkins conda update -y -n base conda

				  if [[ $(uname -m) == "aarch64" ]]; then

				    export SYSROOT_DEP="sysroot_linux-aarch64=2.17"

				  else

				    export SYSROOT_DEP="sysroot_linux-64=2.17"

				  fi

				  # Install correct Python version

				  as_jenkins conda create -n py_$ANACONDA_PYTHON_VERSION -y python="$ANACONDA_PYTHON_VERSION"

				  # Also ensure sysroot is using a modern GLIBC to match system compilers

				  as_jenkins conda create -n py_$ANACONDA_PYTHON_VERSION -y\

				             python="$ANACONDA_PYTHON_VERSION" \

				             ${SYSROOT_DEP}

				  # libstdcxx from conda default channels are too old, we need GLIBCXX_3.4.30

				  # which is provided in libstdcxx 12 and up.

				  conda_install libstdcxx-ng=12.3.0 -c conda-forge

				  # Install PyTorch conda deps, as per https://github.com/pytorch/pytorch README

				  CONDA_COMMON_DEPS="astunparse pyyaml mkl=2021.4.0 mkl-include=2021.4.0 setuptools"

				  if [ "$ANACONDA_PYTHON_VERSION" = "3.11" ]; then

				    conda_install numpy=1.23.5 ${CONDA_COMMON_DEPS}

				  if [[ $(uname -m) == "aarch64" ]]; then

				    CONDA_COMMON_DEPS="astunparse pyyaml setuptools openblas==0.3.25=*openmp* ninja==1.11.1 scons==4.5.2"

				    if [ "$ANACONDA_PYTHON_VERSION" = "3.8" ]; then

				      conda_install numpy=1.24.4 ${CONDA_COMMON_DEPS}

				    else

				      conda_install numpy=1.26.2 ${CONDA_COMMON_DEPS}

				    fi

				  else

				    conda_install numpy=1.21.2 ${CONDA_COMMON_DEPS}

				    CONDA_COMMON_DEPS="astunparse pyyaml mkl=2021.4.0 mkl-include=2021.4.0 setuptools"

				    if [ "$ANACONDA_PYTHON_VERSION" = "3.11" ] || [ "$ANACONDA_PYTHON_VERSION" = "3.12" ]; then

				      conda_install numpy=1.26.0 ${CONDA_COMMON_DEPS}

				    else

				      conda_install numpy=1.21.2 ${CONDA_COMMON_DEPS}

				    fi

				  fi

				  # Install llvm-8 as it is required to compile llvmlite-0.30.0 from source

				@ -89,14 +123,5 @@ if [ -n "$ANACONDA_PYTHON_VERSION" ]; then

				    pip_install -r /opt/conda/requirements-docs.txt

				  fi

				  # HACK HACK HACK

				  # gcc-9 for ubuntu-18.04 from http://ppa.launchpad.net/ubuntu-toolchain-r/test/ubuntu

				  # Pulls llibstdc++6 13.1.0-8ubuntu1~18.04 which is too new for conda

				  # So remove libstdc++6.so.3.29 installed by https://anaconda.org/anaconda/libstdcxx-ng/files?version=11.2.0

				  # Same is true for gcc-12 from Ubuntu-22.04

				  if grep -e [12][82].04.[623] /etc/issue >/dev/null; then

				    rm /opt/conda/envs/py_$ANACONDA_PYTHON_VERSION/lib/libstdc++.so.6

				  fi

				  popd

				fi

									
										13

.ci/docker/common/install_cudnn.sh
									
												View File
												
				@ -2,8 +2,8 @@

				if [[ ${CUDNN_VERSION} == 8 ]]; then

				    # cuDNN license: https://developer.nvidia.com/cudnn/license_agreement

				    mkdir tmp_cudnn && cd tmp_cudnn

				    CUDNN_NAME="cudnn-linux-x86_64-8.3.2.44_cuda11.5-archive"

				    mkdir tmp_cudnn

				    pushd tmp_cudnn

				    if [[ ${CUDA_VERSION:0:4} == "12.1" ]]; then

				        CUDNN_NAME="cudnn-linux-x86_64-8.9.2.26_cuda12-archive"

				        curl --retry 3 -OLs https://developer.download.nvidia.com/compute/cudnn/redist/cudnn/linux-x86_64/${CUDNN_NAME}.tar.xz

				@ -11,17 +11,14 @@ if [[ ${CUDNN_VERSION} == 8 ]]; then

				        CUDNN_NAME="cudnn-linux-x86_64-8.7.0.84_cuda11-archive"

				        curl --retry 3 -OLs https://developer.download.nvidia.com/compute/redist/cudnn/v8.7.0/local_installers/11.8/${CUDNN_NAME}.tar.xz

				    else

				        curl --retry 3 -OLs https://developer.download.nvidia.com/compute/redist/cudnn/v8.3.2/local_installers/11.5/${CUDNN_NAME}.tar.xz

				        print "Unsupported CUDA version ${CUDA_VERSION}"

				        exit 1

				    fi

				    tar xf ${CUDNN_NAME}.tar.xz

				    cp -a ${CUDNN_NAME}/include/* /usr/include/

				    cp -a ${CUDNN_NAME}/include/* /usr/local/cuda/include/

				    cp -a ${CUDNN_NAME}/include/* /usr/include/x86_64-linux-gnu/

				    cp -a ${CUDNN_NAME}/lib/* /usr/local/cuda/lib64/

				    cp -a ${CUDNN_NAME}/lib/* /usr/lib/x86_64-linux-gnu/

				    cd ..

				    popd

				    rm -rf tmp_cudnn

				    ldconfig

				fi

									
										21

.ci/docker/common/install_cusparselt.sh
									
										Normal file
									
												View File
												
				@ -0,0 +1,21 @@

				#!/bin/bash

				set -ex

				# cuSPARSELt license: https://docs.nvidia.com/cuda/cusparselt/license.html

				mkdir tmp_cusparselt && cd tmp_cusparselt

				if [[ ${CUDA_VERSION:0:4} == "12.1" ]]; then

				    CUSPARSELT_NAME="libcusparse_lt-linux-x86_64-0.5.2.1-archive"

				    curl --retry 3 -OLs https://developer.download.nvidia.com/compute/cusparselt/redist/libcusparse_lt/linux-x86_64/${CUSPARSELT_NAME}.tar.xz

				elif [[ ${CUDA_VERSION:0:4} == "11.8" ]]; then

				    CUSPARSELT_NAME="libcusparse_lt-linux-x86_64-0.4.0.7-archive"

				    curl --retry 3 -OLs https://developer.download.nvidia.com/compute/cusparselt/redist/libcusparse_lt/linux-x86_64/${CUSPARSELT_NAME}.tar.xz

				fi

				tar xf ${CUSPARSELT_NAME}.tar.xz

				cp -a ${CUSPARSELT_NAME}/include/* /usr/local/cuda/include/

				cp -a ${CUSPARSELT_NAME}/lib/* /usr/local/cuda/lib64/

				cd ..

				rm -rf tmp_cusparselt

				ldconfig

1

.ci/docker/common/install_executorch.sh

View File

 @ -48,7 +48,6 @@ setup_executorch() {
   install_flatc_from_source
   pip_install .
   build_executorch_runner "cmake"
   # Make sure that all the newly generate files are owned by Jenkins
   chown -R jenkins .

									
										11

.ci/docker/common/install_onnx.sh
									
												View File
												
				@ -26,18 +26,19 @@ pip_install \

				  pytest-cov==4.0.0 \

				  pytest-subtests==0.10.0 \

				  tabulate==0.9.0 \

				  transformers==4.32.1

				  transformers==4.36.2

				pip_install coloredlogs packaging

				retry pip_install -i https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/ORT-Nightly/pypi/simple/ --no-cache-dir --no-input ort-nightly==1.17.0.dev20231005006

				pip_install -i https://test.pypi.org/simple/ onnx==1.15.0rc2

				pip_install onnxscript==0.1.0.dev20231128 --no-deps

				pip_install onnxruntime==1.17.0

				pip_install onnx==1.15.0

				# pip_install "onnxscript@git+https://github.com/microsoft/onnxscript@3e869ef8ccf19b5ebd21c10d3e9c267c9a9fa729" --no-deps

				pip_install onnxscript==0.1.0.dev20240315 --no-deps

				# Cache the transformers model to be used later by ONNX tests. We need to run the transformers

				# package to download the model. By default, the model is cached at ~/.cache/huggingface/hub/

				IMPORT_SCRIPT_FILENAME="/tmp/onnx_import_script.py"

				as_jenkins echo 'import transformers; transformers.AutoModel.from_pretrained("sshleifer/tiny-gpt2"); transformers.AutoTokenizer.from_pretrained("sshleifer/tiny-gpt2");' > "${IMPORT_SCRIPT_FILENAME}"

				as_jenkins echo 'import transformers; transformers.AutoModel.from_pretrained("sshleifer/tiny-gpt2"); transformers.AutoTokenizer.from_pretrained("sshleifer/tiny-gpt2"); transformers.AutoModelForSpeechSeq2Seq.from_pretrained("openai/whisper-large-v3");' > "${IMPORT_SCRIPT_FILENAME}"

				# Need a PyTorch version for transformers to work

				pip_install --pre torch --index-url https://download.pytorch.org/whl/nightly/cpu

									
										3

.ci/docker/common/install_openssl.sh
									
												View File
												
				@ -9,7 +9,8 @@ tar xf "${OPENSSL}.tar.gz"

				cd "${OPENSSL}"

				./config --prefix=/opt/openssl -d '-Wl,--enable-new-dtags,-rpath,$(LIBRPATH)'

				# NOTE: openssl install errors out when built with the -j option

				make -j6; make install_sw

				NPROC=$[$(nproc) - 2]

				make -j${NPROC}; make install_sw

				# Link the ssl libraries to the /usr/lib folder.

				sudo ln -s /opt/openssl/lib/lib* /usr/lib

				cd ..

									
										61

.ci/docker/common/install_protobuf.sh
									
												View File
												
				@ -2,55 +2,18 @@

				set -ex

				# This function installs protobuf 3.17

				install_protobuf_317() {

				  pb_dir="/usr/temp_pb_install_dir"

				  mkdir -p $pb_dir

				pb_dir="/usr/temp_pb_install_dir"

				mkdir -p $pb_dir

				  # On the nvidia/cuda:9-cudnn7-devel-centos7 image we need this symlink or

				  # else it will fail with

				  #   g++: error: ./../lib64/crti.o: No such file or directory

				  ln -s /usr/lib64 "$pb_dir/lib64"

				# On the nvidia/cuda:9-cudnn7-devel-centos7 image we need this symlink or

				# else it will fail with

				#   g++: error: ./../lib64/crti.o: No such file or directory

				ln -s /usr/lib64 "$pb_dir/lib64"

				  curl -LO "https://github.com/protocolbuffers/protobuf/releases/download/v3.17.3/protobuf-all-3.17.3.tar.gz" --retry 3

				  tar -xvz -C "$pb_dir" --strip-components 1 -f protobuf-all-3.17.3.tar.gz

				  # -j6 to balance memory usage and speed.

				  # naked `-j` seems to use too much memory.

				  pushd "$pb_dir" && ./configure && make -j6 && make -j6 check && sudo make -j6 install && sudo ldconfig

				  popd

				  rm -rf $pb_dir

				}

				curl -LO "https://github.com/protocolbuffers/protobuf/releases/download/v3.17.3/protobuf-all-3.17.3.tar.gz" --retry 3

				install_ubuntu() {

				  # Ubuntu 14.04 has cmake 2.8.12 as the default option, so we will

				  # install cmake3 here and use cmake3.

				  apt-get update

				  if [[ "$UBUNTU_VERSION" == 14.04 ]]; then

				    apt-get install -y --no-install-recommends cmake3

				  fi

				  # Cleanup

				  apt-get autoclean && apt-get clean

				  rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/*

				  install_protobuf_317

				}

				install_centos() {

				  install_protobuf_317

				}

				# Install base packages depending on the base OS

				ID=$(grep -oP '(?<=^ID=).+' /etc/os-release | tr -d '"')

				case "$ID" in

				  ubuntu)

				    install_ubuntu

				    ;;

				  centos)

				    install_centos

				    ;;

				  *)

				    echo "Unable to determine OS..."

				    exit 1

				    ;;

				esac

				tar -xvz --no-same-owner -C "$pb_dir" --strip-components 1 -f protobuf-all-3.17.3.tar.gz

				NPROC=$[$(nproc) - 2]

				pushd "$pb_dir" && ./configure && make -j${NPROC} && make -j${NPROC} check && sudo make -j${NRPOC} install && sudo ldconfig

				popd

				rm -rf $pb_dir

									
										16

.ci/docker/common/install_rocm.sh
									
												View File
												
				@ -80,6 +80,14 @@ install_ubuntu() {

				        fi

				    fi

				    # ROCm 6.0 had a regression where journal_mode was enabled on the kdb files resulting in permission errors at runtime

				    if [[ $(ver $ROCM_VERSION) -ge $(ver 6.0) ]]; then

				        for kdb in /opt/rocm/share/miopen/db/*.kdb

				        do

				            sqlite3 $kdb "PRAGMA journal_mode=off; PRAGMA VACUUM;"

				        done

				    fi

				    # Cleanup

				    apt-get autoclean && apt-get clean

				    rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/*

				@ -151,6 +159,14 @@ install_centos() {

				      fi

				  fi

				  # ROCm 6.0 had a regression where journal_mode was enabled on the kdb files resulting in permission errors at runtime

				  if [[ $(ver $ROCM_VERSION) -ge $(ver 6.0) ]]; then

				      for kdb in /opt/rocm/share/miopen/db/*.kdb

				      do

				          sqlite3 $kdb "PRAGMA journal_mode=off; PRAGMA VACUUM;"

				      done

				  fi

				  # Cleanup

				  yum clean all

				  rm -rf /var/cache/yum

									
										2

.ci/docker/common/install_rocm_magma.sh
									
												View File
												
				@ -7,7 +7,7 @@ git clone https://bitbucket.org/icl/magma.git

				pushd magma

				# Version 2.7.2 + ROCm related updates

				git checkout 823531632140d0edcb7e77c3edc0e837421471c5

				git checkout a1625ff4d9bc362906bd01f805dbbe12612953f6

				cp make.inc-examples/make.inc.hip-gcc-mkl make.inc

				echo 'LIBDIR += -L$(MKLROOT)/lib' >> make.inc

									
										3

.ci/docker/common/install_triton.sh
									
												View File
												
				@ -64,5 +64,6 @@ if [ -n "${CONDA_CMAKE}" ]; then

				  # latest numpy version, which fails ASAN tests with the following import error: Numba

				  # needs NumPy 1.20 or less.

				  conda_reinstall cmake="${CMAKE_VERSION}"

				  conda_reinstall numpy="${NUMPY_VERSION}"

				  # Note that we install numpy with pip as conda might not have the version we want

				  pip_install --force-reinstall numpy=="${NUMPY_VERSION}"

				fi

									
										7

.ci/docker/common/install_ucc.sh
									
												View File
												
				@ -36,7 +36,12 @@ function install_ucc() {

				  git submodule update --init --recursive

				  ./autogen.sh

				  ./configure --prefix=$UCC_HOME --with-ucx=$UCX_HOME --with-cuda=$with_cuda

				  # We only run distributed tests on Tesla M60 and A10G

				  NVCC_GENCODE="-gencode=arch=compute_52,code=sm_52 -gencode=arch=compute_86,code=compute_86"

				  ./configure --prefix=$UCC_HOME          \

				    --with-ucx=$UCX_HOME                  \

				    --with-cuda=$with_cuda                \

				    --with-nvcc-gencode="${NVCC_GENCODE}"

				  time make -j

				  sudo make install

									
										115

.ci/docker/common/install_xpu.sh
									
										Normal file
									
												View File
												
				@ -0,0 +1,115 @@

				#!/bin/bash

				set -xe

				# Intel® software for general purpose GPU capabilities.

				# Refer to https://dgpu-docs.intel.com/releases/stable_647_21_20230714.html

				# Intel® oneAPI Base Toolkit (version 2024.0.0) has been updated to include functional and security updates.

				# Refer to https://www.intel.com/content/www/us/en/developer/tools/oneapi/base-toolkit-download.html

				# Users should update to the latest version as it becomes available

				function install_ubuntu() {

				    apt-get update -y

				    apt-get install -y gpg-agent wget

				    # Set up the repository. To do this, download the key to the system keyring

				    wget -qO - https://repositories.intel.com/gpu/intel-graphics.key \

				        | gpg --dearmor --output /usr/share/keyrings/intel-graphics.gpg

				    wget -O- https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB \

				        | gpg --dearmor | tee /usr/share/keyrings/oneapi-archive-keyring.gpg > /dev/null

				    # Add the signed entry to APT sources and configure the APT client to use the Intel repository

				    echo "deb [arch=amd64 signed-by=/usr/share/keyrings/intel-graphics.gpg] https://repositories.intel.com/gpu/ubuntu jammy/production/2328 unified" \

				        | tee /etc/apt/sources.list.d/intel-gpu-jammy.list

				    echo "deb [signed-by=/usr/share/keyrings/oneapi-archive-keyring.gpg] https://apt.repos.intel.com/oneapi all main" \

				        | tee /etc/apt/sources.list.d/oneAPI.list

				    # Update the packages list and repository index

				    apt-get update

				    # The xpu-smi packages

				    apt-get install -y flex bison xpu-smi

				    # Compute and Media Runtimes

				    apt-get install -y \

				        intel-opencl-icd intel-level-zero-gpu level-zero \

				        intel-media-va-driver-non-free libmfx1 libmfxgen1 libvpl2 \

				        libegl-mesa0 libegl1-mesa libegl1-mesa-dev libgbm1 libgl1-mesa-dev libgl1-mesa-dri \

				        libglapi-mesa libgles2-mesa-dev libglx-mesa0 libigdgmm12 libxatracker2 mesa-va-drivers \

				        mesa-vdpau-drivers mesa-vulkan-drivers va-driver-all vainfo hwinfo clinfo

				    # Development Packages

				    apt-get install -y libigc-dev intel-igc-cm libigdfcl-dev libigfxcmrt-dev level-zero-dev

				    # Install Intel® oneAPI Base Toolkit

				    if [ -n "$BASEKIT_VERSION" ]; then

				        apt-get install intel-basekit=$BASEKIT_VERSION -y

				    else

				        apt-get install intel-basekit -y

				    fi

				    # Cleanup

				    apt-get autoclean && apt-get clean

				    rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/*

				}

				function install_centos() {

				    dnf install -y 'dnf-command(config-manager)'

				    dnf config-manager --add-repo \

				        https://repositories.intel.com/gpu/rhel/8.6/production/2328/unified/intel-gpu-8.6.repo

				    # To add the EPEL repository needed for DKMS

				    dnf -y install https://dl.fedoraproject.org/pub/epel/epel-release-latest-8.noarch.rpm

				        # https://dl.fedoraproject.org/pub/epel/epel-release-latest-9.noarch.rpm

				    # Create the YUM repository file in the /temp directory as a normal user

				    tee > /tmp/oneAPI.repo << EOF

				[oneAPI]

				name=Intel® oneAPI repository

				baseurl=https://yum.repos.intel.com/oneapi

				enabled=1

				gpgcheck=1

				repo_gpgcheck=1

				gpgkey=https://yum.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB

				EOF

				    # Move the newly created oneAPI.repo file to the YUM configuration directory /etc/yum.repos.d

				    mv /tmp/oneAPI.repo /etc/yum.repos.d

				    # The xpu-smi packages

				    dnf install -y flex bison xpu-smi

				    # Compute and Media Runtimes

				    dnf install -y \

				        intel-opencl intel-media intel-mediasdk libmfxgen1 libvpl2\

				        level-zero intel-level-zero-gpu mesa-dri-drivers mesa-vulkan-drivers \

				        mesa-vdpau-drivers libdrm mesa-libEGL mesa-libgbm mesa-libGL \

				        mesa-libxatracker libvpl-tools intel-metrics-discovery \

				        intel-metrics-library intel-igc-core intel-igc-cm \

				        libva libva-utils intel-gmmlib libmetee intel-gsc intel-ocloc hwinfo clinfo

				    # Development packages

				    dnf install -y --refresh \

				        intel-igc-opencl-devel level-zero-devel intel-gsc-devel libmetee-devel \

				        level-zero-devel

				    # Install Intel® oneAPI Base Toolkit

				    dnf install intel-basekit -y

				    # Cleanup

				    dnf clean all

				    rm -rf /var/cache/yum

				    rm -rf /var/lib/yum/yumdb

				    rm -rf /var/lib/yum/history

				}

				# The installation depends on the base OS

				ID=$(grep -oP '(?<=^ID=).+' /etc/os-release | tr -d '"')

				case "$ID" in

				    ubuntu)

				        install_ubuntu

				    ;;

				    centos)

				        install_centos

				    ;;

				    *)

				        echo "Unable to determine OS..."

				        exit 1

				    ;;

				esac

49

.ci/docker/requirements-ci.txt

View File

 @ -15,7 +15,7 @@ click
 #Pinned versions:
 #test that import:
 coremltools==5.0b5
 coremltools==5.0b5 ; python_version < "3.12"
 #Description: Apple framework for ML integration
 #Pinned versions: 5.0b5
 #test that import:
 @ -25,6 +25,11 @@ coremltools==5.0b5
 #Pinned versions:
 #test that import:
 dill==0.3.7
 #Description: dill extends pickle with serializing and de-serializing for most built-ins
 #Pinned versions: 0.3.7
 #test that import: dynamo/test_replay_record.py test_dataloader.py test_datapipe.py test_serialization.py
 expecttest==0.1.6
 #Description: method for writing tests where test framework auto populates
 # the expected output based on previous runs
 @ -47,6 +52,11 @@ junitparser==2.1.1
 #Pinned versions: 2.1.1
 #test that import:
 lark==0.12.0
 #Description: parser
 #Pinned versions: 0.12.0
 #test that import:
 librosa>=0.6.2 ; python_version < "3.11"
 #Description: A python package for music and audio analysis
 #Pinned versions: >=0.6.2
 @ -66,7 +76,7 @@ librosa>=0.6.2 ; python_version < "3.11"
 #Description: A testing library that allows you to replace parts of your
 #system under test with mock objects
 #Pinned versions:
 #test that import: test_module_init.py, test_modules.py, test_nn.py,
 #test that import: test_modules.py, test_nn.py,
 #test_testing.py
 #MonkeyType # breaks pytorch-xla-linux-bionic-py3.7-clang8
 @ -75,10 +85,10 @@ librosa>=0.6.2 ; python_version < "3.11"
 #Pinned versions:
 #test that import:
 mypy==1.7.0
 mypy==1.8.0
 # Pin MyPy version because new errors are likely to appear with each release
 #Description: linter
 #Pinned versions: 1.7.0
 #Pinned versions: 1.8.0
 #test that import: test_typing.py, test_type_hints.py
 networkx==2.8.8
 @ -137,9 +147,9 @@ optree==0.9.1
 #test_pointwise_ops.py, test_dtensor_ops.py, test_torchinductor.py, test_fx.py,
 #test_fake_tensor.py, test_mps.py
 pillow==10.0.1
 pillow==10.2.0
 #Description:  Python Imaging Library fork
 #Pinned versions: 10.0.1
 #Pinned versions: 10.2.0
 #test that import:
 protobuf==3.20.2
 @ -162,11 +172,6 @@ pytest-xdist==3.3.1
 #Pinned versions:
 #test that import:
 pytest-shard==0.1.2
 #Description: plugin spliting up tests in pytest
 #Pinned versions:
 #test that import:
 pytest-flakefinder==1.1.0
 #Description: plugin for rerunning tests a fixed number of times in pytest
 #Pinned versions: 1.1.0
 @ -243,7 +248,8 @@ tb-nightly==2.13.0a20230426
 #Pinned versions:
 #test that import:
 #typing-extensions
 # needed by torchgen utils
 typing-extensions
 #Description: type hints for python
 #Pinned versions:
 #test that import:
 @ -258,7 +264,8 @@ unittest-xml-reporting<=3.2.0,>=2.0.0
 #Pinned versions:
 #test that import:
 lintrunner==0.10.7
 #wheel not found on aarch64, and source build requires rust
 lintrunner==0.10.7 ; platform_machine == "x86_64"
 #Description: all about linters!
 #Pinned versions: 0.10.7
 #test that import:
 @ -268,14 +275,14 @@ rockset==1.0.3
 #Pinned versions: 1.0.3
 #test that import:
 ghstack==0.7.1
 ghstack==0.8.0
 #Description: ghstack tool
 #Pinned versions: 0.7.1
 #Pinned versions: 0.8.0
 #test that import:
 jinja2==3.1.2
 jinja2==3.1.3
 #Description: jinja2 template engine
 #Pinned versions: 3.1.2
 #Pinned versions: 3.1.3
 #test that import:
 pytest-cpp==2.3.0
 @ -293,8 +300,14 @@ tensorboard==2.13.0
 #Pinned versions:
 #test that import: test_tensorboard
 pywavelets==1.4.1
 pywavelets==1.4.1 ; python_version < "3.12"
 pywavelets==1.5.0 ; python_version >= "3.12"
 #Description: This is a requirement of scikit-image, we need to pin
 # it here because 1.5.0 conflicts with numpy 1.21.2 used in CI
 #Pinned versions: 1.4.1
 #test that import:
 lxml==5.0.0.
 #Description: This is a requirement of unittest-xml-reporting
 # Python-3.9 binaries

2

.ci/docker/triton_version.txt

View File

 @ -1 +1 @@
 .1.0
 .0.0

									
										6

.ci/docker/ubuntu-cuda/Dockerfile
									
												View File
												
				@ -142,6 +142,12 @@ COPY ./common/install_cudnn.sh install_cudnn.sh

				RUN if [ "${CUDNN_VERSION}" -eq 8 ]; then bash install_cudnn.sh; fi

				RUN rm install_cudnn.sh

				# Install CUSPARSELT

				ARG CUDA_VERSION

				COPY ./common/install_cusparselt.sh install_cusparselt.sh

				RUN bash install_cusparselt.sh

				RUN rm install_cusparselt.sh

				# Delete /usr/local/cuda-11.X/cuda-11.X symlinks

				RUN if [ -h /usr/local/cuda-11.6/cuda-11.6 ]; then rm /usr/local/cuda-11.6/cuda-11.6; fi

				RUN if [ -h /usr/local/cuda-11.7/cuda-11.7 ]; then rm /usr/local/cuda-11.7/cuda-11.7; fi

									
										118

.ci/docker/ubuntu-xpu/Dockerfile
									
										Normal file
									
												View File
												
				@ -0,0 +1,118 @@

				ARG UBUNTU_VERSION

				FROM ubuntu:${UBUNTU_VERSION}

				ARG UBUNTU_VERSION

				ENV DEBIAN_FRONTEND noninteractive

				ARG CLANG_VERSION

				# Install common dependencies (so that this step can be cached separately)

				COPY ./common/install_base.sh install_base.sh

				RUN bash ./install_base.sh && rm install_base.sh

				# Install clang

				ARG LLVMDEV

				COPY ./common/install_clang.sh install_clang.sh

				RUN bash ./install_clang.sh && rm install_clang.sh

				# Install user

				COPY ./common/install_user.sh install_user.sh

				RUN bash ./install_user.sh && rm install_user.sh

				# Install katex

				ARG KATEX

				COPY ./common/install_docs_reqs.sh install_docs_reqs.sh

				RUN bash ./install_docs_reqs.sh && rm install_docs_reqs.sh

				# Install conda and other packages (e.g., numpy, pytest)

				ARG ANACONDA_PYTHON_VERSION

				ARG CONDA_CMAKE

				ARG DOCS

				ENV ANACONDA_PYTHON_VERSION=$ANACONDA_PYTHON_VERSION

				ENV PATH /opt/conda/envs/py_$ANACONDA_PYTHON_VERSION/bin:/opt/conda/bin:$PATH

				ENV DOCS=$DOCS

				COPY requirements-ci.txt requirements-docs.txt /opt/conda/

				COPY ./common/install_conda.sh install_conda.sh

				COPY ./common/common_utils.sh common_utils.sh

				RUN bash ./install_conda.sh && rm install_conda.sh common_utils.sh /opt/conda/requirements-ci.txt /opt/conda/requirements-docs.txt

				# Install gcc

				ARG GCC_VERSION

				COPY ./common/install_gcc.sh install_gcc.sh

				RUN bash ./install_gcc.sh && rm install_gcc.sh

				# Install lcov for C++ code coverage

				COPY ./common/install_lcov.sh install_lcov.sh

				RUN  bash ./install_lcov.sh && rm install_lcov.sh

				COPY ./common/install_openssl.sh install_openssl.sh

				RUN bash ./install_openssl.sh

				ENV OPENSSL_ROOT_DIR /opt/openssl

				ENV OPENSSL_DIR /opt/openssl

				RUN rm install_openssl.sh

				ARG INDUCTOR_BENCHMARKS

				COPY ./common/install_inductor_benchmark_deps.sh install_inductor_benchmark_deps.sh

				COPY ./common/common_utils.sh common_utils.sh

				COPY ci_commit_pins/huggingface.txt huggingface.txt

				COPY ci_commit_pins/timm.txt timm.txt

				RUN if [ -n "${INDUCTOR_BENCHMARKS}" ]; then bash ./install_inductor_benchmark_deps.sh; fi

				RUN rm install_inductor_benchmark_deps.sh common_utils.sh timm.txt huggingface.txt

				ARG TRITON

				# Install triton, this needs to be done before sccache because the latter will

				# try to reach out to S3, which docker build runners don't have access

				COPY ./common/install_triton.sh install_triton.sh

				COPY ./common/common_utils.sh common_utils.sh

				# TODO: will add triton xpu commit

				COPY ci_commit_pins/triton.txt triton.txt

				RUN if [ -n "${TRITON}" ]; then bash ./install_triton.sh; fi

				RUN rm install_triton.sh common_utils.sh triton.txt

				# (optional) Install database packages like LMDB and LevelDB

				ARG DB

				COPY ./common/install_db.sh install_db.sh

				RUN if [ -n "${DB}" ]; then bash ./install_db.sh; fi

				RUN rm install_db.sh

				ENV INSTALLED_DB ${DB}

				# (optional) Install vision packages like OpenCV and ffmpeg

				ARG VISION

				COPY ./common/install_vision.sh ./common/cache_vision_models.sh ./common/common_utils.sh ./

				RUN if [ -n "${VISION}" ]; then bash ./install_vision.sh; fi

				RUN rm install_vision.sh cache_vision_models.sh common_utils.sh

				ENV INSTALLED_VISION ${VISION}

				# Install XPU Dependencies

				ARG BASEKIT_VERSION

				COPY ./common/install_xpu.sh install_xpu.sh

				RUN bash ./install_xpu.sh && rm install_xpu.sh

				# (optional) Install non-default CMake version

				ARG CMAKE_VERSION

				COPY ./common/install_cmake.sh install_cmake.sh

				RUN if [ -n "${CMAKE_VERSION}" ]; then bash ./install_cmake.sh; fi

				RUN rm install_cmake.sh

				# (optional) Install non-default Ninja version

				ARG NINJA_VERSION

				COPY ./common/install_ninja.sh install_ninja.sh

				RUN if [ -n "${NINJA_VERSION}" ]; then bash ./install_ninja.sh; fi

				RUN rm install_ninja.sh

				# Install ccache/sccache (do this last, so we get priority in PATH)

				COPY ./common/install_cache.sh install_cache.sh

				ENV PATH /opt/cache/bin:$PATH

				RUN bash ./install_cache.sh && rm install_cache.sh

				# Include BUILD_ENVIRONMENT environment variable in image

				ARG BUILD_ENVIRONMENT

				ENV BUILD_ENVIRONMENT ${BUILD_ENVIRONMENT}

				# Install LLVM dev version (Defined in the pytorch/builder github repository)

				COPY --from=pytorch/llvm:9.0.1 /opt/llvm /opt/llvm

				USER jenkins

				CMD ["bash"]

									
										8

.ci/docker/ubuntu/Dockerfile
									
												View File
												
				@ -37,6 +37,7 @@ COPY requirements-ci.txt requirements-docs.txt /opt/conda/

				COPY ./common/install_conda.sh install_conda.sh

				COPY ./common/common_utils.sh common_utils.sh

				RUN bash ./install_conda.sh && rm install_conda.sh common_utils.sh /opt/conda/requirements-ci.txt /opt/conda/requirements-docs.txt

				RUN if [ -n "${UNINSTALL_DILL}" ]; then pip uninstall -y dill; fi

				# Install gcc

				ARG GCC_VERSION

				@ -160,6 +161,13 @@ COPY ./common/install_onnx.sh ./common/common_utils.sh ./

				RUN if [ -n "${ONNX}" ]; then bash ./install_onnx.sh; fi

				RUN rm install_onnx.sh common_utils.sh

				# (optional) Build ACL

				ARG ACL

				COPY ./common/install_acl.sh install_acl.sh

				RUN if [ -n "${ACL}" ]; then bash ./install_acl.sh; fi

				RUN rm install_acl.sh

				ENV INSTALLED_ACL ${ACL}

				# Install ccache/sccache (do this last, so we get priority in PATH)

				COPY ./common/install_cache.sh install_cache.sh

				ENV PATH /opt/cache/bin:$PATH

									
										33

.ci/pytorch/build.sh
									
												View File
												
				@ -28,6 +28,8 @@ echo "Environment variables:"

				env

				if [[ "$BUILD_ENVIRONMENT" == *cuda* ]]; then

				  # Use jemalloc during compilation to mitigate https://github.com/pytorch/pytorch/issues/116289

				  export LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libjemalloc.so.2

				  echo "NVCC version:"

				  nvcc --version

				fi

				@ -80,6 +82,19 @@ if ! which conda; then

				  fi

				else

				  export CMAKE_PREFIX_PATH=/opt/conda

				  # Workaround required for MKL library linkage

				  # https://github.com/pytorch/pytorch/issues/119557

				  if [ "$ANACONDA_PYTHON_VERSION" = "3.12" ]; then

				    export CMAKE_LIBRARY_PATH="/opt/conda/envs/py_$ANACONDA_PYTHON_VERSION/lib/"

				    export CMAKE_INCLUDE_PATH="/opt/conda/envs/py_$ANACONDA_PYTHON_VERSION/include/"

				  fi

				fi

				if [[ "$BUILD_ENVIRONMENT" == *aarch64* ]]; then

				  export USE_MKLDNN=1

				  export USE_MKLDNN_ACL=1

				  export ACL_ROOT_DIR=/ComputeLibrary

				fi

				if [[ "$BUILD_ENVIRONMENT" == *libtorch* ]]; then

				@ -151,6 +166,12 @@ if [[ "$BUILD_ENVIRONMENT" == *rocm* ]]; then

				  python tools/amd_build/build_amd.py

				fi

				if [[ "$BUILD_ENVIRONMENT" == *xpu* ]]; then

				  # shellcheck disable=SC1091

				  source /opt/intel/oneapi/compiler/latest/env/vars.sh

				  export USE_XPU=1

				fi

				# sccache will fail for CUDA builds if all cores are used for compiling

				# gcc 7 with sccache seems to have intermittent OOM issue if all cores are used

				if [ -z "$MAX_JOBS" ]; then

				@ -202,6 +223,10 @@ if [[ "${BUILD_ENVIRONMENT}" != *android* && "${BUILD_ENVIRONMENT}" != *cuda* ]]

				  export BUILD_STATIC_RUNTIME_BENCHMARK=ON

				fi

				WORKSPACE_ORIGINAL_OWNER_ID=$(stat -c '%u' "/var/lib/jenkins/workspace")

				sudo chown -R jenkins /var/lib/jenkins/workspace

				git config --global --add safe.directory /var/lib/jenkins/workspace

				if [[ "$BUILD_ENVIRONMENT" == *-bazel-* ]]; then

				  set -e

				@ -227,13 +252,17 @@ else

				  ( ! get_exit_code python setup.py clean bad_argument )

				  if [[ "$BUILD_ENVIRONMENT" != *libtorch* ]]; then

				    # rocm builds fail when WERROR=1

				    # XLA test build fails when WERROR=1

				    # set only when building other architectures

				    # or building non-XLA tests.

				    if [[ "$BUILD_ENVIRONMENT" != *rocm*  &&

				          "$BUILD_ENVIRONMENT" != *xla* ]]; then

				      if [[ "$BUILD_ENVIRONMENT" != *py3.8* ]]; then

				        # Install numpy-2.0 release candidate for builds

				        # Which should be backward compatible with Numpy-1.X

				        python -mpip install --pre numpy==2.0.0b1

				      fi

				      WERROR=1 python setup.py bdist_wheel

				    else

				      python setup.py bdist_wheel

				@ -334,3 +363,5 @@ if [[ "$BUILD_ENVIRONMENT" != *libtorch* && "$BUILD_ENVIRONMENT" != *bazel* ]];

				fi

				print_sccache_stats

				sudo chown -R "$WORKSPACE_ORIGINAL_OWNER_ID" /var/lib/jenkins/workspace

									
										5

.ci/pytorch/common_utils.sh
									
												View File
												
				@ -158,6 +158,11 @@ function install_torchvision() {

				  fi

				}

				function install_tlparse() {

				  pip_install --user "tlparse==0.3.7"

				  PATH="$(python -m site --user-base)/bin:$PATH"

				}

				function install_torchrec_and_fbgemm() {

				  local torchrec_commit

				  torchrec_commit=$(get_pinned_commit torchrec)

									
										2

.ci/pytorch/macos-common.sh
									
												View File
												
				@ -9,7 +9,7 @@ sysctl -a | grep machdep.cpu

				# These are required for both the build job and the test job.

				# In the latter to test cpp extensions.

				export MACOSX_DEPLOYMENT_TARGET=11.0

				export MACOSX_DEPLOYMENT_TARGET=11.1

				export CXX=clang++

				export CC=clang

									
										2

.ci/pytorch/macos-test.sh
									
												View File
												
				@ -149,6 +149,8 @@ test_jit_hooks() {

				  assert_git_not_dirty

				}

				install_tlparse

				if [[ $NUM_TEST_SHARDS -gt 1 ]]; then

				  test_python_shard "${SHARD_NUMBER}"

				  if [[ "${SHARD_NUMBER}" == 1 ]]; then

									
										5

.ci/pytorch/multigpu-test.sh
									
												View File
												
				@ -34,7 +34,6 @@ time python test/run_test.py --verbose -i distributed/_shard/sharded_tensor/test

				# functional collective tests

				time python test/run_test.py --verbose -i distributed/test_functional_api

				# DTensor tests

				time python test/run_test.py --verbose -i distributed/_tensor/test_random_ops

				time python test/run_test.py --verbose -i distributed/_tensor/test_dtensor_compile

				@ -46,9 +45,11 @@ time python test/run_test.py --verbose -i distributed/test_device_mesh

				time python test/run_test.py --verbose -i distributed/tensor/parallel/test_ddp_2d_parallel

				time python test/run_test.py --verbose -i distributed/tensor/parallel/test_fsdp_2d_parallel

				time python test/run_test.py --verbose -i distributed/tensor/parallel/test_tp_examples

				time python test/run_test.py --verbose -i distributed/tensor/parallel/test_tp_random_state.py

				# Other tests

				time python test/run_test.py --verbose -i test_cuda_primary_ctx

				time python test/run_test.py --verbose -i test_optim -- -k optimizers_with_varying_tensors

				time python test/run_test.py --verbose -i test_optim -- -k test_forloop_goes_right_direction_multigpu

				time python test/run_test.py --verbose -i test_optim -- -k test_mixed_device_dtype

				time python test/run_test.py --verbose -i test_foreach -- -k test_tensors_grouping

				assert_git_not_dirty

									
										193

.ci/pytorch/test.sh
									
												View File
												
				@ -18,6 +18,10 @@ BUILD_DIR="build"

				BUILD_RENAMED_DIR="build_renamed"

				BUILD_BIN_DIR="$BUILD_DIR"/bin

				#Set Default values for these variables in case they are not set

				SHARD_NUMBER="${SHARD_NUMBER:=1}"

				NUM_TEST_SHARDS="${NUM_TEST_SHARDS:=1}"

				export VALGRIND=ON

				# export TORCH_INDUCTOR_INSTALL_GXX=ON

				if [[ "$BUILD_ENVIRONMENT" == *clang9* ]]; then

				@ -124,6 +128,10 @@ if [[ "$BUILD_ENVIRONMENT" == *cuda* || "$BUILD_ENVIRONMENT" == *rocm* ]]; then

				  # mainly used so that we're not spending extra cycles testing cpu

				  # devices on expensive gpu machines

				  export PYTORCH_TESTING_DEVICE_ONLY_FOR="cuda"

				elif [[ "$BUILD_ENVIRONMENT" == *xpu* ]]; then

				  export PYTORCH_TESTING_DEVICE_ONLY_FOR="xpu"

				  # setting PYTHON_TEST_EXTRA_OPTION

				  export PYTHON_TEST_EXTRA_OPTION="--xpu"

				fi

				if [[ "$TEST_CONFIG" == *crossref* ]]; then

				@ -131,11 +139,22 @@ if [[ "$TEST_CONFIG" == *crossref* ]]; then

				fi

				if [[ "$BUILD_ENVIRONMENT" == *rocm* ]]; then

				  # regression in ROCm 6.0 on MI50 CI runners due to hipblaslt; remove in 6.1

				  export VALGRIND=OFF

				  # Print GPU info

				  rocminfo

				  rocminfo | grep -E 'Name:.*\sgfx|Marketing'

				fi

				if [[ "$BUILD_ENVIRONMENT" == *xpu* ]]; then

				  # Source Intel oneAPI envrioment script to enable xpu runtime related libraries

				  # refer to https://www.intel.com/content/www/us/en/docs/oneapi/programming-guide/2024-0/use-the-setvars-and-oneapi-vars-scripts-with-linux.html

				  # shellcheck disable=SC1091

				  source /opt/intel/oneapi/compiler/latest/env/vars.sh

				  # Check XPU status before testing

				  xpu-smi discovery

				fi

				if [[ "$BUILD_ENVIRONMENT" != *-bazel-* ]] ; then

				  # JIT C++ extensions require ninja.

				  pip_install --user "ninja==1.10.2"

				@ -144,6 +163,8 @@ if [[ "$BUILD_ENVIRONMENT" != *-bazel-* ]] ; then

				  export PATH="$HOME/.local/bin:$PATH"

				fi

				install_tlparse

				# DANGER WILL ROBINSON.  The LD_PRELOAD here could cause you problems

				# if you're not careful.  Check this if you made some changes and the

				# ASAN test is not working

				@ -235,14 +256,14 @@ test_python_shard() {

				  # Bare --include flag is not supported and quoting for lint ends up with flag not being interpreted correctly

				  # shellcheck disable=SC2086

				  time python test/run_test.py --exclude-jit-executor --exclude-distributed-tests $INCLUDE_CLAUSE --shard "$1" "$NUM_TEST_SHARDS" --verbose

				  time python test/run_test.py --exclude-jit-executor --exclude-distributed-tests $INCLUDE_CLAUSE --shard "$1" "$NUM_TEST_SHARDS" --verbose $PYTHON_TEST_EXTRA_OPTION

				  assert_git_not_dirty

				}

				test_python() {

				  # shellcheck disable=SC2086

				  time python test/run_test.py --exclude-jit-executor --exclude-distributed-tests $INCLUDE_CLAUSE --verbose

				  time python test/run_test.py --exclude-jit-executor --exclude-distributed-tests $INCLUDE_CLAUSE --verbose $PYTHON_TEST_EXTRA_OPTION

				  assert_git_not_dirty

				}

				@ -253,33 +274,13 @@ test_dynamo_shard() {

				    exit 1

				  fi

				  python tools/dynamo/verify_dynamo.py

				  # Temporarily disable test_fx for dynamo pending the investigation on TTS

				  # regression in https://github.com/pytorch/torchdynamo/issues/784

				  # PLEASE DO NOT ADD ADDITIONAL EXCLUDES HERE.

				  # Instead, use @skipIfTorchDynamo on your tests.

				  time python test/run_test.py --dynamo \

				    --exclude-inductor-tests \

				    --exclude-jit-executor \

				    --exclude-distributed-tests \

				    --exclude \

				      test_autograd \

				      test_jit \

				      test_proxy_tensor \

				      test_quantization \

				      test_public_bindings \

				      test_dataloader \

				      test_reductions \

				      test_namedtensor \

				      test_namedtuple_return_api \

				      profiler/test_profiler \

				      profiler/test_profiler_tree \

				      test_overrides \

				      test_python_dispatch \

				      test_fx \

				      test_package \

				      test_legacy_vmap \

				      test_custom_ops \

				      test_content_store \

				      export/test_db \

				      functorch/test_dims \

				      functorch/test_aotdispatch \

				    --exclude-torch-export-tests \

				    --shard "$1" "$NUM_TEST_SHARDS" \

				    --verbose

				  assert_git_not_dirty

				@ -291,8 +292,18 @@ test_inductor_distributed() {

				  pytest test/inductor/test_torchinductor.py -k test_multi_gpu

				  pytest test/inductor/test_aot_inductor.py -k test_non_default_cuda_device

				  pytest test/inductor/test_aot_inductor.py -k test_replicate_on_devices

				  pytest test/distributed/test_c10d_functional_native.py

				  pytest test/distributed/_tensor/test_dtensor_compile.py

				  pytest test/distributed/tensor/parallel/test_fsdp_2d_parallel.py

				  pytest test/distributed/_composable/fsdp/test_fully_shard_comm.py

				  pytest test/distributed/_composable/fsdp/test_fully_shard_training.py -k test_train_parity_multi_group

				  pytest test/distributed/_composable/fsdp/test_fully_shard_training.py -k test_train_parity_with_activation_checkpointing

				  pytest test/distributed/_composable/fsdp/test_fully_shard_training.py -k test_train_parity_2d_mlp

				  pytest test/distributed/_composable/fsdp/test_fully_shard_training.py -k test_train_parity_hsdp

				  pytest test/distributed/_composable/fsdp/test_fully_shard_training.py -k test_train_parity_2d_transformer_checkpoint_resume

				  pytest test/distributed/_composable/fsdp/test_fully_shard_frozen.py

				  pytest test/distributed/_composable/fsdp/test_fully_shard_mixed_precision.py -k test_compute_dtype

				  pytest test/distributed/_composable/fsdp/test_fully_shard_mixed_precision.py -k test_reduce_dtype

				  # this runs on both single-gpu and multi-gpu instance. It should be smart about skipping tests that aren't supported

				  # with if required # gpus aren't available

				@ -308,8 +319,18 @@ test_inductor() {

				  # docker build uses bdist_wheel which does not work with test_aot_inductor

				  # TODO: need a faster way to build

				  BUILD_AOT_INDUCTOR_TEST=1 python setup.py develop

				  CPP_TESTS_DIR="${BUILD_BIN_DIR}" LD_LIBRARY_PATH="${TORCH_LIB_DIR}" python test/run_test.py --cpp --verbose -i cpp/test_aot_inductor

				  if [[ "$BUILD_ENVIRONMENT" != *rocm* ]]; then

				      BUILD_AOT_INDUCTOR_TEST=1 python setup.py develop

				      CPP_TESTS_DIR="${BUILD_BIN_DIR}" LD_LIBRARY_PATH="${TORCH_LIB_DIR}" python test/run_test.py --cpp --verbose -i cpp/test_aot_inductor

				  fi

				}

				test_inductor_cpp_wrapper_abi_compatible() {

				  export TORCHINDUCTOR_ABI_COMPATIBLE=1

				  echo "Testing Inductor cpp wrapper mode with TORCHINDUCTOR_ABI_COMPATIBLE=1"

				  # cpu stack allocation causes segfault and needs more investigation

				  TORCHINDUCTOR_STACK_ALLOCATION=0 python test/run_test.py --include inductor/test_cpu_cpp_wrapper

				  python test/run_test.py --include inductor/test_cuda_cpp_wrapper

				}

				# "Global" flags for inductor benchmarking controlled by TEST_CONFIG

				@ -389,8 +410,8 @@ test_perf_for_dashboard() {

				            --output "$TEST_REPORTS_DIR/${backend}_dynamic_${suite}_${dtype}_${mode}_cuda_${target}.csv"

				      fi

				      if [[ "$DASHBOARD_TAG" == *cppwrapper-true* ]] && [[ "$mode" == "inference" ]]; then

				        python "benchmarks/dynamo/$suite.py" \

				            "${target_flag[@]}" --"$mode" --"$dtype" --backend "$backend" --disable-cudagraphs --cpp-wrapper "$@" \

				        TORCHINDUCTOR_CPP_WRAPPER=1 python "benchmarks/dynamo/$suite.py" \

				            "${target_flag[@]}" --"$mode" --"$dtype" --backend "$backend" --disable-cudagraphs "$@" \

				            --output "$TEST_REPORTS_DIR/${backend}_cpp_wrapper_${suite}_${dtype}_${mode}_cuda_${target}.csv"

				      fi

				      if [[ "$DASHBOARD_TAG" == *freezing_cudagraphs-true* ]] && [[ "$mode" == "inference" ]]; then

				@ -404,7 +425,7 @@ test_perf_for_dashboard() {

				            --output "$TEST_REPORTS_DIR/${backend}_with_cudagraphs_freezing_autotune_${suite}_${dtype}_${mode}_cuda_${target}.csv"

				      fi

				      if [[ "$DASHBOARD_TAG" == *aotinductor-true* ]] && [[ "$mode" == "inference" ]]; then

				        python "benchmarks/dynamo/$suite.py" \

				        TORCHINDUCTOR_ABI_COMPATIBLE=1 python "benchmarks/dynamo/$suite.py" \

				            "${target_flag[@]}" --"$mode" --"$dtype" --export-aot-inductor --disable-cudagraphs "$@" \

				            --output "$TEST_REPORTS_DIR/${backend}_aot_inductor_${suite}_${dtype}_${mode}_cuda_${target}.csv"

				      fi

				@ -448,6 +469,11 @@ test_single_dynamo_benchmark() {

				    test_perf_for_dashboard "$suite" \

				      "${DYNAMO_BENCHMARK_FLAGS[@]}" "$@" "${partition_flags[@]}"

				  else

				    if [[ "${TEST_CONFIG}" == *aot_inductor* ]]; then

				      # Test AOTInductor with the ABI-compatible mode on CI

				      # This can be removed once the ABI-compatible mode becomes default.

				      export TORCHINDUCTOR_ABI_COMPATIBLE=1

				    fi

				    python "benchmarks/dynamo/$suite.py" \

				      --ci --accuracy --timing --explain \

				      "${DYNAMO_BENCHMARK_FLAGS[@]}" \

				@ -491,13 +517,20 @@ test_inductor_torchbench_smoketest_perf() {

				  TEST_REPORTS_DIR=$(pwd)/test/test-reports

				  mkdir -p "$TEST_REPORTS_DIR"

				  # smoke test the cpp_wrapper mode

				  TORCHINDUCTOR_CPP_WRAPPER=1 python benchmarks/dynamo/torchbench.py --device cuda --accuracy --bfloat16 \

				    --inference --inductor --only hf_T5 --output "$TEST_REPORTS_DIR/inductor_cpp_wrapper_smoketest.csv"

				  python benchmarks/dynamo/check_accuracy.py \

				      --actual "$TEST_REPORTS_DIR/inductor_cpp_wrapper_smoketest.csv" \

				      --expected "benchmarks/dynamo/ci_expected_accuracy/inductor_torchbench_inference.csv"

				  python benchmarks/dynamo/torchbench.py --device cuda --performance --backend inductor --float16 --training \

				    --batch-size-file "$(realpath benchmarks/dynamo/torchbench_models_list.txt)" --only hf_Bert \

				    --output "$TEST_REPORTS_DIR/inductor_training_smoketest.csv"

				  # The threshold value needs to be actively maintained to make this check useful

				  python benchmarks/dynamo/check_perf_csv.py -f "$TEST_REPORTS_DIR/inductor_training_smoketest.csv" -t 1.4

				  python benchmarks/dynamo/torchbench.py --device cuda --performance --bfloat16 --inference \

				  TORCHINDUCTOR_ABI_COMPATIBLE=1 python benchmarks/dynamo/torchbench.py --device cuda --performance --bfloat16 --inference \

				    --export-aot-inductor --only nanogpt --output "$TEST_REPORTS_DIR/inductor_inference_smoketest.csv"

				  # The threshold value needs to be actively maintained to make this check useful

				  # The perf number of nanogpt seems not very stable, e.g.

				@ -518,6 +551,50 @@ test_inductor_torchbench_smoketest_perf() {

				  done

				}

				test_inductor_torchbench_cpu_smoketest_perf(){

				  TEST_REPORTS_DIR=$(pwd)/test/test-reports

				  mkdir -p "$TEST_REPORTS_DIR"

				  #set jemalloc

				  JEMALLOC_LIB="/usr/lib/x86_64-linux-gnu/libjemalloc.so.2"

				  IOMP_LIB="$(dirname "$(which python)")/../lib/libiomp5.so"

				  export LD_PRELOAD="$JEMALLOC_LIB":"$IOMP_LIB":"$LD_PRELOAD"

				  export MALLOC_CONF="oversize_threshold:1,background_thread:true,metadata_thp:auto,dirty_decay_ms:-1,muzzy_decay_ms:-1"

				  export KMP_AFFINITY=granularity=fine,compact,1,0

				  export KMP_BLOCKTIME=1

				  CORES=$(lscpu | grep Core | awk '{print $4}')

				  export OMP_NUM_THREADS=$CORES

				  end_core=$(( CORES-1 ))

				  MODELS_SPEEDUP_TARGET=benchmarks/dynamo/expected_ci_speedup_inductor_torchbench_cpu.csv

				  grep -v '^ *#' < "$MODELS_SPEEDUP_TARGET" | while IFS=',' read -r -a model_cfg

				  do

				    local model_name=${model_cfg[0]}

				    local data_type=${model_cfg[1]}

				    local speedup_target=${model_cfg[4]}

				    if [[ ${model_cfg[3]} == "cpp" ]]; then

				      export TORCHINDUCTOR_CPP_WRAPPER=1

				    else

				      unset TORCHINDUCTOR_CPP_WRAPPER

				    fi

				    local output_name="$TEST_REPORTS_DIR/inductor_inference_${model_cfg[0]}_${model_cfg[1]}_${model_cfg[2]}_${model_cfg[3]}_cpu_smoketest.csv"

				    if [[ ${model_cfg[2]} == "dynamic" ]]; then

				      taskset -c 0-"$end_core" python benchmarks/dynamo/torchbench.py \

				        --inference --performance --"$data_type" -dcpu -n50 --only "$model_name" --dynamic-shapes \

				        --dynamic-batch-only --freezing --timeout 9000 --backend=inductor --output "$output_name"

				    else

				      taskset -c 0-"$end_core" python benchmarks/dynamo/torchbench.py \

				        --inference --performance --"$data_type" -dcpu -n50 --only "$model_name" \

				        --freezing --timeout 9000 --backend=inductor --output "$output_name"

				    fi

				    cat "$output_name"

				    # The threshold value needs to be actively maintained to make this check useful.

				    python benchmarks/dynamo/check_perf_csv.py -f "$output_name" -t "$speedup_target"

				  done

				}

				test_python_gloo_with_tls() {

				  source "$(dirname "${BASH_SOURCE[0]}")/run_glootls_test.sh"

				  assert_git_not_dirty

				@ -664,6 +741,19 @@ test_libtorch_api() {

				  fi

				}

				test_xpu_bin(){

				  TEST_REPORTS_DIR=$(pwd)/test/test-reports

				  mkdir -p "$TEST_REPORTS_DIR"

				  for xpu_case in "${BUILD_BIN_DIR}"/*{xpu,sycl}*; do

				    if [[ "$xpu_case" != *"*"* && "$xpu_case" != *.so && "$xpu_case" != *.a ]]; then

				      case_name=$(basename "$xpu_case")

				      echo "Testing ${case_name} ..."

				      "$xpu_case" --gtest_output=xml:"$TEST_REPORTS_DIR"/"$case_name".xml

				    fi

				  done

				}

				test_aot_compilation() {

				  echo "Testing Ahead of Time compilation"

				  ln -sf "$TORCH_LIB_DIR"/libc10* "$TORCH_BIN_DIR"

				@ -904,7 +994,8 @@ test_bazel() {

				    tools/bazel test --config=cpu-only --test_timeout=480 --test_output=all --test_tag_filters=-gpu-required --test_filter=-*CUDA :all_tests

				  else

				    tools/bazel test --test_output=errors \

				    # Increase the test timeout to 480 like CPU tests because modules_test frequently timeout

				    tools/bazel test --test_timeout=480 --test_output=errors \

				      //:any_test \

				      //:autograd_test \

				      //:dataloader_test \

				@ -999,14 +1090,17 @@ test_docs_test() {

				}

				test_executorch() {

				  echo "Install torchvision and torchaudio"

				  install_torchvision

				  install_torchaudio

				  pushd /executorch

				  echo "Install torchvision and torchaudio"

				  # TODO(huydhn): Switch this to the pinned commits on ExecuTorch once they are

				  # there.  These libraries need to be built here, and not part of the Docker

				  # image because they require the target version of torch to be installed first

				  pip_install --no-use-pep517 --user "git+https://github.com/pytorch/audio.git"

				  pip_install --no-use-pep517 --user "git+https://github.com/pytorch/vision.git"

				  # NB: We need to build ExecuTorch runner here and not inside the Docker image

				  # because it depends on PyTorch

				  # shellcheck disable=SC1091

				  source .ci/scripts/utils.sh

				  build_executorch_runner "cmake"

				  echo "Run ExecuTorch regression tests for some models"

				  # NB: This is a sample model, more can be added here

				@ -1075,6 +1169,11 @@ elif [[ "${TEST_CONFIG}" == *torchbench* ]]; then

				  if [[ "${TEST_CONFIG}" == *inductor_torchbench_smoketest_perf* ]]; then

				    checkout_install_torchbench hf_Bert hf_Albert nanogpt timm_vision_transformer

				    PYTHONPATH=$(pwd)/torchbench test_inductor_torchbench_smoketest_perf

				  elif [[ "${TEST_CONFIG}" == *inductor_torchbench_cpu_smoketest_perf* ]]; then

				    checkout_install_torchbench timm_vision_transformer phlippe_densenet basic_gnn_gcn \

				      llama_v2_7b_16h resnet50 timm_efficientnet mobilenet_v3_large timm_resnest \

				      shufflenet_v2_x1_0 hf_GPT2

				    PYTHONPATH=$(pwd)/torchbench test_inductor_torchbench_cpu_smoketest_perf

				  else

				    checkout_install_torchbench

				    # Do this after checkout_install_torchbench to ensure we clobber any

				@ -1084,24 +1183,29 @@ elif [[ "${TEST_CONFIG}" == *torchbench* ]]; then

				    fi

				    PYTHONPATH=$(pwd)/torchbench test_dynamo_benchmark torchbench "$id"

				  fi

				elif [[ "${TEST_CONFIG}" == *inductor_cpp_wrapper_abi_compatible* ]]; then

				  install_torchvision

				  test_inductor_cpp_wrapper_abi_compatible

				elif [[ "${TEST_CONFIG}" == *inductor* && "${SHARD_NUMBER}" == 1 ]]; then

				  install_torchvision

				  test_inductor

				  test_inductor_distributed

				elif [[ "${TEST_CONFIG}" == *dynamo* && "${SHARD_NUMBER}" == 1 && $NUM_TEST_SHARDS -gt 1 ]]; then

				  test_without_numpy

				  install_torchvision

				  test_dynamo_shard 1

				  test_aten

				elif [[ "${TEST_CONFIG}" == *dynamo* && "${SHARD_NUMBER}" == 2 && $NUM_TEST_SHARDS -gt 1 ]]; then

				elif [[ "${TEST_CONFIG}" == *dynamo* && $SHARD_NUMBER -gt 1 && $NUM_TEST_SHARDS -gt 1 ]]; then

				  install_torchvision

				  test_dynamo_shard 2

				  test_dynamo_shard "${SHARD_NUMBER}"

				elif [[ "${SHARD_NUMBER}" == 1 && $NUM_TEST_SHARDS -gt 1 ]]; then

				  test_without_numpy

				  install_torchvision

				  test_python_shard 1

				  test_aten

				  test_libtorch 1

				  if [[ "${BUILD_ENVIRONMENT}" == *xpu* ]]; then

				    test_xpu_bin

				  fi

				elif [[ "${SHARD_NUMBER}" == 2 && $NUM_TEST_SHARDS -gt 1 ]]; then

				  install_torchvision

				  test_python_shard 2

				@ -1126,6 +1230,11 @@ elif [[ "${BUILD_ENVIRONMENT}" == *rocm* && -n "$TESTS_TO_INCLUDE" ]]; then

				  install_torchvision

				  test_python

				  test_aten

				elif [[ "${BUILD_ENVIRONMENT}" == *xpu* ]]; then

				  install_torchvision

				  test_python

				  test_aten

				  test_xpu_bin

				else

				  install_torchvision

				  install_monkeytype

									
										13

.ci/pytorch/win-test-helpers/build_pytorch.bat
									
												View File
												
				@ -16,11 +16,6 @@ set PATH=C:\Program Files\CMake\bin;C:\Program Files\7-Zip;C:\ProgramData\chocol

				set INSTALLER_DIR=%SCRIPT_HELPERS_DIR%\installation-helpers

				call %INSTALLER_DIR%\install_mkl.bat

				if errorlevel 1 exit /b

				if not errorlevel 0 exit /b

				call %INSTALLER_DIR%\install_magma.bat

				if errorlevel 1 exit /b

				if not errorlevel 0 exit /b

				@ -35,6 +30,10 @@ call %INSTALLER_DIR%\activate_miniconda3.bat

				if errorlevel 1 exit /b

				if not errorlevel 0 exit /b

				call pip install mkl-include==2021.4.0 mkl-devel==2021.4.0

				if errorlevel 1 exit /b

				if not errorlevel 0 exit /b

				:: Override VS env here

				pushd .

				if "%VC_VERSION%" == "" (

				@ -89,8 +88,8 @@ set SCCACHE_IGNORE_SERVER_IO_ERROR=1

				sccache --stop-server

				sccache --start-server

				sccache --zero-stats

				set CC=sccache-cl

				set CXX=sccache-cl

				set CMAKE_C_COMPILER_LAUNCHER=sccache

				set CMAKE_CXX_COMPILER_LAUNCHER=sccache

				set CMAKE_GENERATOR=Ninja

									
										14

.ci/pytorch/win-test-helpers/installation-helpers/install_mkl.bat
									
												View File
											
				@ -1,14 +0,0 @@

				if "%REBUILD%"=="" (

				  if "%BUILD_ENVIRONMENT%"=="" (

				    curl --retry 3 --retry-all-errors -k https://s3.amazonaws.com/ossci-windows/mkl_2020.2.254.7z --output %TMP_DIR_WIN%\mkl.7z

				  ) else (

				    aws s3 cp s3://ossci-windows/mkl_2020.2.254.7z %TMP_DIR_WIN%\mkl.7z --quiet

				  )

				  if errorlevel 1 exit /b

				  if not errorlevel 0 exit /b

				  7z x -aoa %TMP_DIR_WIN%\mkl.7z -o%TMP_DIR_WIN%\mkl

				  if errorlevel 1 exit /b

				  if not errorlevel 0 exit /b

				)

				set CMAKE_INCLUDE_PATH=%TMP_DIR_WIN%\mkl\include

				set LIB=%TMP_DIR_WIN%\mkl\lib;%LIB%

									
										19

.ci/pytorch/win-test-helpers/installation-helpers/install_sccache.bat
									
												View File
												
				@ -1,18 +1,13 @@

				mkdir %TMP_DIR_WIN%\bin

				if "%REBUILD%"=="" (

				  :check_sccache

				  %TMP_DIR_WIN%\bin\sccache.exe --show-stats || (

				  IF EXIST %TMP_DIR_WIN%\bin\sccache.exe (

				    taskkill /im sccache.exe /f /t || ver > nul

				    del %TMP_DIR_WIN%\bin\sccache.exe || ver > nul

				    del %TMP_DIR_WIN%\bin\sccache-cl.exe || ver > nul

				    if "%BUILD_ENVIRONMENT%"=="" (

				      curl --retry 3 --retry-all-errors -k https://s3.amazonaws.com/ossci-windows/sccache.exe --output %TMP_DIR_WIN%\bin\sccache.exe

				      curl --retry 3 --retry-all-errors -k https://s3.amazonaws.com/ossci-windows/sccache-cl.exe --output %TMP_DIR_WIN%\bin\sccache-cl.exe

				    ) else (

				      aws s3 cp s3://ossci-windows/sccache.exe %TMP_DIR_WIN%\bin\sccache.exe

				      aws s3 cp s3://ossci-windows/sccache-cl.exe %TMP_DIR_WIN%\bin\sccache-cl.exe

				    )

				    goto :check_sccache

				  )

				)

				  if "%BUILD_ENVIRONMENT%"=="" (

				    curl --retry 3 --retry-all-errors -k https://s3.amazonaws.com/ossci-windows/sccache-v0.7.4.exe --output %TMP_DIR_WIN%\bin\sccache.exe

				  ) else (

				    aws s3 cp s3://ossci-windows/sccache-v0.7.4.exe %TMP_DIR_WIN%\bin\sccache.exe

				  )

				)

									
										466

.circleci/README.md
									
												View File
												
				@ -1,468 +1,4 @@

				Warning

				=======

				Contents may be out of date. Our CircleCI workflows are gradually being migrated to Github actions.

				Structure of CI

				===============

				setup job:

				1. Does a git checkout

				2. Persists CircleCI scripts (everything in `.circleci`) into a workspace.  Why?

				   We don't always do a Git checkout on all subjobs, but we usually

				   still want to be able to call scripts one way or another in a subjob.

				   Persisting files this way lets us have access to them without doing a

				   checkout.  This workspace is conventionally mounted on `~/workspace`

				   (this is distinguished from `~/project`, which is the conventional

				   working directory that CircleCI will default to starting your jobs

				   in.)

				3. Write out the commit message to `.circleci/COMMIT_MSG`.  This is so

				   we can determine in subjobs if we should actually run the jobs or

				   not, even if there isn't a Git checkout.

				CircleCI configuration generator

				================================

				One may no longer make changes to the `.circleci/config.yml` file directly.

				Instead, one must edit these Python scripts or files in the `verbatim-sources/` directory.

				Usage

				----------

				1. Make changes to these scripts.

				2. Run the `regenerate.sh` script in this directory and commit the script changes and the resulting change to `config.yml`.

				You'll see a build failure on GitHub if the scripts don't agree with the checked-in version.

				Motivation

				----------

				These scripts establish a single, authoritative source of documentation for the CircleCI configuration matrix.

				The documentation, in the form of diagrams, is automatically generated and cannot drift out of sync with the YAML content.

				Furthermore, consistency is enforced within the YAML config itself, by using a single source of data to generate

				multiple parts of the file.

				* Facilitates one-off culling/enabling of CI configs for testing PRs on special targets

				Also see https://github.com/pytorch/pytorch/issues/17038

				Future direction

				----------------

				### Declaring sparse config subsets

				See comment [here](https://github.com/pytorch/pytorch/pull/17323#pullrequestreview-206945747):

				In contrast with a full recursive tree traversal of configuration dimensions,

				> in the future I think we actually want to decrease our matrix somewhat and have only a few mostly-orthogonal builds that taste as many different features as possible on PRs, plus a more complete suite on every PR and maybe an almost full suite nightly/weekly (we don't have this yet). Specifying PR jobs in the future might be easier to read with an explicit list when we come to this.

				----------------

				----------------

				# How do the binaries / nightlies / releases work?

				### What is a binary?

				A binary or package (used interchangeably) is a pre-built collection of c++ libraries, header files, python bits, and other files. We build these and distribute them so that users do not need to install from source.

				A **binary configuration** is a collection of

				* release or nightly

				    * releases are stable, nightlies are beta and built every night

				* python version

				    * linux: 3.7m (mu is wide unicode or something like that. It usually doesn't matter but you should know that it exists)

				    * macos: 3.7, 3.8

				    * windows: 3.7, 3.8

				* cpu version

				    * cpu, cuda 9.0, cuda 10.0

				    * The supported cuda versions occasionally change

				* operating system

				    * Linux - these are all built on CentOS. There haven't been any problems in the past building on CentOS and using on Ubuntu

				    * MacOS

				    * Windows - these are built on Azure pipelines

				* devtoolset version (gcc compiler version)

				    * This only matters on Linux cause only Linux uses gcc. tldr is gcc made a backwards incompatible change from gcc 4.8 to gcc 5, because it had to change how it implemented std::vector and std::string

				### Where are the binaries?

				The binaries are built in CircleCI. There are nightly binaries built every night at 9pm PST (midnight EST) and release binaries corresponding to Pytorch releases, usually every few months.

				We have 3 types of binary packages

				* pip packages - nightlies are stored on s3 (pip install -f \<a s3 url\>). releases are stored in a pip repo (pip install torch) (ask Soumith about this)

				* conda packages - nightlies and releases are both stored in a conda repo. Nighty packages have a '_nightly' suffix

				* libtorch packages - these are zips of all the c++ libraries, header files, and sometimes dependencies. These are c++ only

				    * shared with dependencies (the only supported option for Windows)

				    * static with dependencies

				    * shared without dependencies

				    * static without dependencies

				All binaries are built in CircleCI workflows except Windows. There are checked-in workflows (committed into the .circleci/config.yml) to build the nightlies every night. Releases are built by manually pushing a PR that builds the suite of release binaries (overwrite the config.yml to build the release)

				# CircleCI structure of the binaries

				Some quick vocab:

				* A \**workflow** is a CircleCI concept; it is a DAG of '**jobs**'. ctrl-f 'workflows' on https://github.com/pytorch/pytorch/blob/main/.circleci/config.yml to see the workflows.

				* **jobs** are a sequence of '**steps**'

				* **steps** are usually just a bash script or a builtin CircleCI command. *All steps run in new environments, environment variables declared in one script DO NOT persist to following steps*

				* CircleCI has a **workspace**, which is essentially a cache between steps of the *same job* in which you can store artifacts between steps.

				## How are the workflows structured?

				The nightly binaries have 3 workflows. We have one job (actually 3 jobs:  build, test, and upload) per binary configuration

				1. binary_builds

				    1. every day midnight EST

				    2. linux: https://github.com/pytorch/pytorch/blob/main/.circleci/verbatim-sources/linux-binary-build-defaults.yml

				    3. macos: https://github.com/pytorch/pytorch/blob/main/.circleci/verbatim-sources/macos-binary-build-defaults.yml

				    4. For each binary configuration, e.g. linux_conda_3.7_cpu there is a

				        1. binary_linux_conda_3.7_cpu_build

				            1. Builds the build. On linux jobs this uses the 'docker executor'.

				            2. Persists the package to the workspace

				        2. binary_linux_conda_3.7_cpu_test

				            1. Loads the package to the workspace

				            2. Spins up a docker image (on Linux), mapping the package and code repos into the docker

				            3. Runs some smoke tests in the docker

				            4. (Actually, for macos this is a step rather than a separate job)

				        3. binary_linux_conda_3.7_cpu_upload

				            1. Logs in to aws/conda

				            2. Uploads the package

				2. update_s3_htmls

				    1. every day 5am EST

				    2. https://github.com/pytorch/pytorch/blob/main/.circleci/verbatim-sources/binary_update_htmls.yml

				    3. See below for what these are for and why they're needed

				    4. Three jobs that each examine the current contents of aws and the conda repo and update some html files in s3

				3. binarysmoketests

				    1. every day

				    2. https://github.com/pytorch/pytorch/blob/main/.circleci/verbatim-sources/nightly-build-smoke-tests-defaults.yml

				    3. For each binary configuration, e.g. linux_conda_3.7_cpu there is a

				        1. smoke_linux_conda_3.7_cpu

				            1. Downloads the package from the cloud, e.g. using the official pip or conda instructions

				            2. Runs the smoke tests

				## How are the jobs structured?

				The jobs are in https://github.com/pytorch/pytorch/tree/main/.circleci/verbatim-sources. Jobs are made of multiple steps. There are some shared steps used by all the binaries/smokes. Steps of these jobs are all delegated to scripts in https://github.com/pytorch/pytorch/tree/main/.circleci/scripts .

				* Linux jobs: https://github.com/pytorch/pytorch/blob/main/.circleci/verbatim-sources/linux-binary-build-defaults.yml

				    * binary_linux_build.sh

				    * binary_linux_test.sh

				    * binary_linux_upload.sh

				* MacOS jobs: https://github.com/pytorch/pytorch/blob/main/.circleci/verbatim-sources/macos-binary-build-defaults.yml

				    * binary_macos_build.sh

				    * binary_macos_test.sh

				    * binary_macos_upload.sh

				* Update html jobs: https://github.com/pytorch/pytorch/blob/main/.circleci/verbatim-sources/binary_update_htmls.yml

				    * These delegate from the pytorch/builder repo

				    * https://github.com/pytorch/builder/blob/main/cron/update_s3_htmls.sh

				    * https://github.com/pytorch/builder/blob/main/cron/upload_binary_sizes.sh

				* Smoke jobs (both linux and macos): https://github.com/pytorch/pytorch/blob/main/.circleci/verbatim-sources/nightly-build-smoke-tests-defaults.yml

				    * These delegate from the pytorch/builder repo

				    * https://github.com/pytorch/builder/blob/main/run_tests.sh

				    * https://github.com/pytorch/builder/blob/main/smoke_test.sh

				    * https://github.com/pytorch/builder/blob/main/check_binary.sh

				* Common shared code (shared across linux and macos): https://github.com/pytorch/pytorch/blob/main/.circleci/verbatim-sources/nightly-binary-build-defaults.yml

				    * binary_checkout.sh - checks out pytorch/builder repo. Right now this also checks out pytorch/pytorch, but it shouldn't. pytorch/pytorch should just be shared through the workspace. This can handle being run before binary_populate_env.sh

				    * binary_populate_env.sh - parses BUILD_ENVIRONMENT into the separate env variables that make up a binary configuration. Also sets lots of default values, the date, the version strings, the location of folders in s3, all sorts of things. This generally has to be run before other steps.

				    * binary_install_miniconda.sh - Installs miniconda, cross platform. Also hacks this for the update_binary_sizes job that doesn't have the right env variables

				    * binary_run_in_docker.sh - Takes a bash script file (the actual test code) from a hardcoded location, spins up a docker image, and runs the script inside the docker image

				### **Why do the steps all refer to scripts?**

				CircleCI creates a  final yaml file by inlining every <<* segment, so if we were to keep all the code in the config.yml itself then the config size would go over 4 MB and cause infra problems.

				### **What is binary_run_in_docker for?**

				So, CircleCI has several executor types: macos, machine, and docker are the ones we use. The 'machine' executor gives you two cores on some linux vm. The 'docker' executor gives you considerably more cores (nproc was 32 instead of 2 back when I tried in February). Since the dockers are faster, we try to run everything that we can in dockers. Thus

				* linux build jobs use the docker executor. Running them on the docker executor was at least 2x faster than running them on the machine executor

				* linux test jobs use the machine executor in order for them to properly interface with GPUs since docker executors cannot execute with attached GPUs

				* linux upload jobs use the machine executor. The upload jobs are so short that it doesn't really matter what they use

				* linux smoke test jobs use the machine executor for the same reason as the linux test jobs

				binary_run_in_docker.sh is a way to share the docker start-up code between the binary test jobs and the binary smoke test jobs

				### **Why does binary_checkout also checkout pytorch? Why shouldn't it?**

				We want all the nightly binary jobs to run on the exact same git commit, so we wrote our own checkout logic to ensure that the same commit was always picked. Later circleci changed that to use a single pytorch checkout and persist it through the workspace (they did this because our config file was too big, so they wanted to take a lot of the setup code into scripts, but the scripts needed the code repo to exist to be called, so they added a prereq step called 'setup' to checkout the code and persist the needed scripts to the workspace). The changes to the binary jobs were not properly tested, so they all broke from missing pytorch code no longer existing. We hotfixed the problem by adding the pytorch checkout back to binary_checkout, so now there's two checkouts of pytorch on the binary jobs. This problem still needs to be fixed, but it takes careful tracing of which code is being called where.

				# Code structure of the binaries (circleci agnostic)

				## Overview

				The code that runs the binaries lives in two places, in the normal [github.com/pytorch/pytorch](http://github.com/pytorch/pytorch), but also in [github.com/pytorch/builder](http://github.com/pytorch/builder), which is a repo that defines how all the binaries are built. The relevant code is

				```

				# All code needed to set-up environments for build code to run in,

				# but only code that is specific to the current CI system

				pytorch/pytorch

				- .circleci/                # Folder that holds all circleci related stuff

				  - config.yml              # GENERATED file that actually controls all circleci behavior

				  - verbatim-sources        # Used to generate job/workflow sections in ^

				  - scripts/                # Code needed to prepare circleci environments for binary build scripts

				- setup.py                  # Builds pytorch. This is wrapped in pytorch/builder

				- cmake files               # used in normal building of pytorch

				# All code needed to prepare a binary build, given an environment

				# with all the right variables/packages/paths.

				pytorch/builder

				# Given an installed binary and a proper python env, runs some checks

				# to make sure the binary was built the proper way. Checks things like

				# the library dependencies, symbols present, etc.

				- check_binary.sh

				# Given an installed binary, runs python tests to make sure everything

				# is in order. These should be de-duped. Right now they both run smoke

				# tests, but are called from different places. Usually just call some

				# import statements, but also has overlap with check_binary.sh above

				- run_tests.sh

				- smoke_test.sh

				# Folders that govern how packages are built. See paragraphs below

				- conda/

				  - build_pytorch.sh          # Entrypoint. Delegates to proper conda build folder

				  - switch_cuda_version.sh    # Switches activate CUDA installation in Docker

				  - pytorch-nightly/          # Build-folder

				- manywheel/

				  - build_cpu.sh              # Entrypoint for cpu builds

				  - build.sh                  # Entrypoint for CUDA builds

				  - build_common.sh           # Actual build script that ^^ call into

				- wheel/

				  - build_wheel.sh            # Entrypoint for wheel builds

				- windows/

				  - build_pytorch.bat         # Entrypoint for wheel builds on Windows

				```

				Every type of package has an entrypoint build script that handles the all the important logic.

				## Conda

				Linux, MacOS and Windows use the same code flow for the conda builds.

				Conda packages are built with conda-build, see https://conda.io/projects/conda-build/en/latest/resources/commands/conda-build.html

				Basically, you pass `conda build` a build folder (pytorch-nightly/ above) that contains a build script and a meta.yaml. The meta.yaml specifies in what python environment to build the package in, and what dependencies the resulting package should have, and the build script gets called in the env to build the thing.

				tl;dr on conda-build is

				1. Creates a brand new conda environment, based off of deps in the meta.yaml

				    1. Note that environment variables do not get passed into this build env unless they are specified in the meta.yaml

				    2. If the build fails this environment will stick around. You can activate it for much easier debugging. The “General Python” section below explains what exactly a python “environment” is.

				2. Calls build.sh in the environment

				3. Copies the finished package to a new conda env, also specified by the meta.yaml

				4. Runs some simple import tests (if specified in the meta.yaml)

				5. Saves the finished package as a tarball

				The build.sh we use is essentially a wrapper around `python setup.py build`, but it also manually copies in some of our dependent libraries into the resulting tarball and messes with some rpaths.

				The entrypoint file `builder/conda/build_conda.sh` is complicated because

				* It works for Linux, MacOS and Windows

				    * The mac builds used to create their own environments, since they all used to be on the same machine. There’s now a lot of extra logic to handle conda envs. This extra machinery could be removed

				* It used to handle testing too, which adds more logic messing with python environments too. This extra machinery could be removed.

				## Manywheels (linux pip and libtorch packages)

				Manywheels are pip packages for linux distros. Note that these manywheels are not actually manylinux compliant.

				`builder/manywheel/build_cpu.sh` and `builder/manywheel/build.sh` (for CUDA builds) just set different env vars and then call into `builder/manywheel/build_common.sh`

				The entrypoint file `builder/manywheel/build_common.sh` is really really complicated because

				* This used to handle building for several different python versions at the same time. The loops have been removed, but there's still unnecessary folders and movements here and there.

				    * The script is never used this way anymore. This extra machinery could be removed.

				* This used to handle testing the pip packages too. This is why there’s testing code at the end that messes with python installations and stuff

				    * The script is never used this way anymore. This extra machinery could be removed.

				* This also builds libtorch packages

				    * This should really be separate. libtorch packages are c++ only and have no python. They should not share infra with all the python specific stuff in this file.

				* There is a lot of messing with rpaths. This is necessary, but could be made much much simpler if the above issues were fixed.

				## Wheels (MacOS pip and libtorch packages)

				The entrypoint file `builder/wheel/build_wheel.sh` is complicated because

				* The mac builds used to all run on one machine (we didn’t have autoscaling mac machines till circleci). So this script handled siloing itself by setting-up and tearing-down its build env and siloing itself into its own build directory.

				    * The script is never used this way anymore. This extra machinery could be removed.

				* This also builds libtorch packages

				    * Ditto the comment above. This should definitely be separated out.

				Note that the MacOS Python wheels are still built in conda environments. Some of the dependencies present during build also come from conda.

				## Windows Wheels (Windows pip and libtorch packages)

				The entrypoint file `builder/windows/build_pytorch.bat` is complicated because

				* This used to handle building for several different python versions at the same time. This is why there are loops everywhere

				    * The script is never used this way anymore. This extra machinery could be removed.

				* This used to handle testing the pip packages too. This is why there’s testing code at the end that messes with python installations and stuff

				    * The script is never used this way anymore. This extra machinery could be removed.

				* This also builds libtorch packages

				    * This should really be separate. libtorch packages are c++ only and have no python. They should not share infra with all the python specific stuff in this file.

				Note that the Windows Python wheels are still built in conda environments. Some of the dependencies present during build also come from conda.

				## General notes

				### Note on run_tests.sh, smoke_test.sh, and check_binary.sh

				* These should all be consolidated

				* These must run on all OS types: MacOS, Linux, and Windows

				* These all run smoke tests at the moment. They inspect the packages some, maybe run a few import statements. They DO NOT run the python tests nor the cpp tests. The idea is that python tests on main and PR merges will catch all breakages. All these tests have to do is make sure the special binary machinery didn’t mess anything up.

				* There are separate run_tests.sh and smoke_test.sh because one used to be called by the smoke jobs and one used to be called by the binary test jobs (see circleci structure section above). This is still true actually, but these could be united into a single script that runs these checks, given an installed pytorch package.

				### Note on libtorch

				Libtorch packages are built in the wheel build scripts: manywheel/build_*.sh for linux and build_wheel.sh for mac. There are several things wrong with this

				* It’s confusing. Most of those scripts deal with python specifics.

				* The extra conditionals everywhere severely complicate the wheel build scripts

				* The process for building libtorch is different from the official instructions (a plain call to cmake, or a call to a script)

				### Note on docker images / Dockerfiles

				All linux builds occur in docker images. The docker images are

				* pytorch/conda-cuda

				    * Has ALL CUDA versions installed. The script pytorch/builder/conda/switch_cuda_version.sh sets /usr/local/cuda to a symlink to e.g. /usr/local/cuda-10.0 to enable different CUDA builds

				    * Also used for cpu builds

				* pytorch/manylinux-cuda90

				* pytorch/manylinux-cuda100

				    * Also used for cpu builds

				The Dockerfiles are available in pytorch/builder, but there is no circleci job or script to build these docker images, and they cannot be run locally (unless you have the correct local packages/paths). Only Soumith can build them right now.

				### General Python

				* This is still a good explanation of python installations https://caffe2.ai/docs/faq.html#why-do-i-get-import-errors-in-python-when-i-try-to-use-caffe2

				# How to manually rebuild the binaries

				tl;dr make a PR that looks like https://github.com/pytorch/pytorch/pull/21159

				Sometimes we want to push a change to mainand then rebuild all of today's binaries after that change. As of May 30, 2019 there isn't a way to manually run a workflow in the UI. You can manually re-run a workflow, but it will use the exact same git commits as the first run and will not include any changes. So we have to make a PR and then force circleci to run the binary workflow instead of the normal tests. The above PR is an example of how to do this; essentially you copy-paste the binarybuilds workflow steps into the default workflow steps. If you need to point the builder repo to a different commit then you'd need to change https://github.com/pytorch/pytorch/blob/main/.circleci/scripts/binary_checkout.sh#L42-L45 to checkout what you want.

				## How to test changes to the binaries via .circleci

				Writing PRs that test the binaries is annoying, since the default circleci jobs that run on PRs are not the jobs that you want to run. Likely, changes to the binaries will touch something under .circleci/ and require that .circleci/config.yml be regenerated (.circleci/config.yml controls all .circleci behavior, and is generated using `.circleci/regenerate.sh` in python 3.7). But you also need to manually hardcode the binary jobs that you want to test into the .circleci/config.yml workflow, so you should actually make at least two commits, one for your changes and one to temporarily hardcode jobs. See https://github.com/pytorch/pytorch/pull/22928 as an example of how to do this.

				```sh

				# Make your changes

				touch .circleci/verbatim-sources/nightly-binary-build-defaults.yml

				# Regenerate the yaml, has to be in python 3.7

				.circleci/regenerate.sh

				# Make a commit

				git add .circleci *

				git commit -m "My real changes"

				git push origin my_branch

				# Now hardcode the jobs that you want in the .circleci/config.yml workflows section

				# Also eliminate ensure-consistency and should_run_job checks

				# e.g. https://github.com/pytorch/pytorch/commit/2b3344bfed8772fe86e5210cc4ee915dee42b32d

				# Make a commit you won't keep

				git add .circleci

				git commit -m "[DO NOT LAND] testing binaries for above changes"

				git push origin my_branch

				# Now you need to make some changes to the first commit.

				git rebase -i HEAD~2 # mark the first commit as 'edit'

				# Make the changes

				touch .circleci/verbatim-sources/nightly-binary-build-defaults.yml

				.circleci/regenerate.sh

				# Ammend the commit and recontinue

				git add .circleci

				git commit --amend

				git rebase --continue

				# Update the PR, need to force since the commits are different now

				git push origin my_branch --force

				```

				The advantage of this flow is that you can make new changes to the base commit and regenerate the .circleci without having to re-write which binary jobs you want to test on. The downside is that all updates will be force pushes.

				## How to build a binary locally

				### Linux

				You can build Linux binaries locally easily using docker.

				```sh

				# Run the docker

				# Use the correct docker image, pytorch/conda-cuda used here as an example

				#

				# -v path/to/foo:path/to/bar makes path/to/foo on your local machine (the

				#    machine that you're running the command on) accessible to the docker

				#    container at path/to/bar. So if you then run `touch path/to/bar/baz`

				#    in the docker container then you will see path/to/foo/baz on your local

				#    machine. You could also clone the pytorch and builder repos in the docker.

				#

				# If you know how, add ccache as a volume too and speed up everything

				docker run \

				    -v your/pytorch/repo:/pytorch \

				    -v your/builder/repo:/builder \

				    -v where/you/want/packages/to/appear:/final_pkgs \

				    -it pytorch/conda-cuda /bin/bash

				# Export whatever variables are important to you. All variables that you'd

				# possibly need are in .circleci/scripts/binary_populate_env.sh

				# You should probably always export at least these 3 variables

				export PACKAGE_TYPE=conda

				export DESIRED_PYTHON=3.7

				export DESIRED_CUDA=cpu

				# Call the entrypoint

				# `|& tee foo.log` just copies all stdout and stderr output to foo.log

				# The builds generate lots of output so you probably need this when

				# building locally.

				/builder/conda/build_pytorch.sh |& tee build_output.log

				```

				**Building CUDA binaries on docker**

				You can build CUDA binaries on CPU only machines, but you can only run CUDA binaries on CUDA machines. This means that you can build a CUDA binary on a docker on your laptop if you so choose (though it’s gonna take a long time).

				For Facebook employees, ask about beefy machines that have docker support and use those instead of your laptop; it will be 5x as fast.

				### MacOS

				There’s no easy way to generate reproducible hermetic MacOS environments. If you have a Mac laptop then you can try emulating the .circleci environments as much as possible, but you probably have packages in /usr/local/, possibly installed by brew, that will probably interfere with the build. If you’re trying to repro an error on a Mac build in .circleci and you can’t seem to repro locally, then my best advice is actually to iterate on .circleci    :/

				But if you want to try, then I’d recommend

				```sh

				# Create a new terminal

				# Clear your LD_LIBRARY_PATH and trim as much out of your PATH as you

				# know how to do

				# Install a new miniconda

				# First remove any other python or conda installation from your PATH

				# Always install miniconda 3, even if building for Python <3

				new_conda="~/my_new_conda"

				conda_sh="$new_conda/install_miniconda.sh"

				curl -o "$conda_sh" https://repo.anaconda.com/miniconda/Miniconda3-latest-MacOSX-x86_64.sh

				chmod +x "$conda_sh"

				"$conda_sh" -b -p "$MINICONDA_ROOT"

				rm -f "$conda_sh"

				export PATH="~/my_new_conda/bin:$PATH"

				# Create a clean python env

				# All MacOS builds use conda to manage the python env and dependencies

				# that are built with, even the pip packages

				conda create -yn binary python=2.7

				conda activate binary

				# Export whatever variables are important to you. All variables that you'd

				# possibly need are in .circleci/scripts/binary_populate_env.sh

				# You should probably always export at least these 3 variables

				export PACKAGE_TYPE=conda

				export DESIRED_PYTHON=3.7

				export DESIRED_CUDA=cpu

				# Call the entrypoint you want

				path/to/builder/wheel/build_wheel.sh

				```

				N.B. installing a brand new miniconda is important. This has to do with how conda installations work. See the “General Python” section above, but tldr; is that

				1. You make the ‘conda’ command accessible by prepending `path/to/conda_root/bin` to your PATH.

				2. You make a new env and activate it, which then also gets prepended to your PATH. Now you have `path/to/conda_root/envs/new_env/bin:path/to/conda_root/bin:$PATH`

				3. Now say you (or some code that you ran) call python executable `foo`

				    1. if you installed `foo` in `new_env`, then `path/to/conda_root/envs/new_env/bin/foo` will get called, as expected.

				    2. But if you forgot to installed `foo` in `new_env` but happened to previously install it in your root conda env (called ‘base’), then unix/linux will still find `path/to/conda_root/bin/foo` . This is dangerous, since `foo` can be a different version than you want; `foo` can even be for an incompatible python version!

				Newer conda versions and proper python hygiene can prevent this, but just install a new miniconda to be safe.

				### Windows

				TODO: fill in

				PyTorch migration from CircleCI to github actions has been completed. All continuous integration & deployment workflows are defined in  `.github/workflows` folder

									
										198

.circleci/cimodel/data/binary_build_data.py
									
												View File
											
				@ -1,198 +0,0 @@

				"""

				This module models the tree of configuration variants

				for "smoketest" builds.

				Each subclass of ConfigNode represents a layer of the configuration hierarchy.

				These tree nodes encapsulate the logic for whether a branch of the hierarchy

				should be "pruned".

				"""

				from collections import OrderedDict

				import cimodel.data.dimensions as dimensions

				from cimodel.lib.conf_tree import ConfigNode

				LINKING_DIMENSIONS = [

				    "shared",

				    "static",

				]

				DEPS_INCLUSION_DIMENSIONS = [

				    "with-deps",

				    "without-deps",

				]

				def get_processor_arch_name(gpu_version):

				    return (

				        "cpu"

				        if not gpu_version

				        else (

				            "cu" + gpu_version.strip("cuda")

				            if gpu_version.startswith("cuda")

				            else gpu_version

				        )

				    )

				CONFIG_TREE_DATA = OrderedDict()

				# GCC config variants:

				#

				# All the nightlies (except libtorch with new gcc ABI) are built with devtoolset7,

				# which can only build with old gcc ABI. It is better than devtoolset3

				# because it understands avx512, which is needed for good fbgemm performance.

				#

				# Libtorch with new gcc ABI is built with gcc 5.4 on Ubuntu 16.04.

				LINUX_GCC_CONFIG_VARIANTS = OrderedDict(

				    manywheel=["devtoolset7"],

				    conda=["devtoolset7"],

				    libtorch=[

				        "devtoolset7",

				        "gcc5.4_cxx11-abi",

				    ],

				)

				WINDOWS_LIBTORCH_CONFIG_VARIANTS = [

				    "debug",

				    "release",

				]

				class TopLevelNode(ConfigNode):

				    def __init__(self, node_name, config_tree_data, smoke):

				        super().__init__(None, node_name)

				        self.config_tree_data = config_tree_data

				        self.props["smoke"] = smoke

				    def get_children(self):

				        return [

				            OSConfigNode(self, x, c, p) for (x, (c, p)) in self.config_tree_data.items()

				        ]

				class OSConfigNode(ConfigNode):

				    def __init__(self, parent, os_name, gpu_versions, py_tree):

				        super().__init__(parent, os_name)

				        self.py_tree = py_tree

				        self.props["os_name"] = os_name

				        self.props["gpu_versions"] = gpu_versions

				    def get_children(self):

				        return [PackageFormatConfigNode(self, k, v) for k, v in self.py_tree.items()]

				class PackageFormatConfigNode(ConfigNode):

				    def __init__(self, parent, package_format, python_versions):

				        super().__init__(parent, package_format)

				        self.props["python_versions"] = python_versions

				        self.props["package_format"] = package_format

				    def get_children(self):

				        if self.find_prop("os_name") == "linux":

				            return [

				                LinuxGccConfigNode(self, v)

				                for v in LINUX_GCC_CONFIG_VARIANTS[self.find_prop("package_format")]

				            ]

				        elif (

				            self.find_prop("os_name") == "windows"

				            and self.find_prop("package_format") == "libtorch"

				        ):

				            return [

				                WindowsLibtorchConfigNode(self, v)

				                for v in WINDOWS_LIBTORCH_CONFIG_VARIANTS

				            ]

				        else:

				            return [ArchConfigNode(self, v) for v in self.find_prop("gpu_versions")]

				class LinuxGccConfigNode(ConfigNode):

				    def __init__(self, parent, gcc_config_variant):

				        super().__init__(parent, "GCC_CONFIG_VARIANT=" + str(gcc_config_variant))

				        self.props["gcc_config_variant"] = gcc_config_variant

				    def get_children(self):

				        gpu_versions = self.find_prop("gpu_versions")

				        # XXX devtoolset7 on CUDA 9.0 is temporarily disabled

				        # see https://github.com/pytorch/pytorch/issues/20066

				        if self.find_prop("gcc_config_variant") == "devtoolset7":

				            gpu_versions = filter(lambda x: x != "cuda_90", gpu_versions)

				        # XXX disabling conda rocm build since docker images are not there

				        if self.find_prop("package_format") == "conda":

				            gpu_versions = filter(

				                lambda x: x not in dimensions.ROCM_VERSION_LABELS, gpu_versions

				            )

				        # XXX libtorch rocm build  is temporarily disabled

				        if self.find_prop("package_format") == "libtorch":

				            gpu_versions = filter(

				                lambda x: x not in dimensions.ROCM_VERSION_LABELS, gpu_versions

				            )

				        return [ArchConfigNode(self, v) for v in gpu_versions]

				class WindowsLibtorchConfigNode(ConfigNode):

				    def __init__(self, parent, libtorch_config_variant):

				        super().__init__(

				            parent, "LIBTORCH_CONFIG_VARIANT=" + str(libtorch_config_variant)

				        )

				        self.props["libtorch_config_variant"] = libtorch_config_variant

				    def get_children(self):

				        return [ArchConfigNode(self, v) for v in self.find_prop("gpu_versions")]

				class ArchConfigNode(ConfigNode):

				    def __init__(self, parent, gpu):

				        super().__init__(parent, get_processor_arch_name(gpu))

				        self.props["gpu"] = gpu

				    def get_children(self):

				        return [PyVersionConfigNode(self, v) for v in self.find_prop("python_versions")]

				class PyVersionConfigNode(ConfigNode):

				    def __init__(self, parent, pyver):

				        super().__init__(parent, pyver)

				        self.props["pyver"] = pyver

				    def get_children(self):

				        package_format = self.find_prop("package_format")

				        os_name = self.find_prop("os_name")

				        has_libtorch_variants = package_format == "libtorch" and os_name == "linux"

				        linking_variants = LINKING_DIMENSIONS if has_libtorch_variants else []

				        return [LinkingVariantConfigNode(self, v) for v in linking_variants]

				class LinkingVariantConfigNode(ConfigNode):

				    def __init__(self, parent, linking_variant):

				        super().__init__(parent, linking_variant)

				    def get_children(self):

				        return [

				            DependencyInclusionConfigNode(self, v) for v in DEPS_INCLUSION_DIMENSIONS

				        ]

				class DependencyInclusionConfigNode(ConfigNode):

				    def __init__(self, parent, deps_variant):

				        super().__init__(parent, deps_variant)

				        self.props["libtorch_variant"] = "-".join(

				            [self.parent.get_label(), self.get_label()]

				        )

									
										275

.circleci/cimodel/data/binary_build_definitions.py
									
												View File
											
				@ -1,275 +0,0 @@

				from collections import OrderedDict

				import cimodel.data.binary_build_data as binary_build_data

				import cimodel.data.simple.util.branch_filters as branch_filters

				import cimodel.lib.conf_tree as conf_tree

				import cimodel.lib.miniutils as miniutils

				class Conf:

				    def __init__(

				        self,

				        os,

				        gpu_version,

				        pydistro,

				        parms,

				        smoke,

				        libtorch_variant,

				        gcc_config_variant,

				        libtorch_config_variant,

				    ):

				        self.os = os

				        self.gpu_version = gpu_version

				        self.pydistro = pydistro

				        self.parms = parms

				        self.smoke = smoke

				        self.libtorch_variant = libtorch_variant

				        self.gcc_config_variant = gcc_config_variant

				        self.libtorch_config_variant = libtorch_config_variant

				    def gen_build_env_parms(self):

				        elems = (

				            [self.pydistro]

				            + self.parms

				            + [binary_build_data.get_processor_arch_name(self.gpu_version)]

				        )

				        if self.gcc_config_variant is not None:

				            elems.append(str(self.gcc_config_variant))

				        if self.libtorch_config_variant is not None:

				            elems.append(str(self.libtorch_config_variant))

				        return elems

				    def gen_docker_image(self):

				        if self.gcc_config_variant == "gcc5.4_cxx11-abi":

				            if self.gpu_version is None:

				                return miniutils.quote("pytorch/libtorch-cxx11-builder:cpu")

				            else:

				                return miniutils.quote(

				                    f"pytorch/libtorch-cxx11-builder:{self.gpu_version}"

				                )

				        if self.pydistro == "conda":

				            if self.gpu_version is None:

				                return miniutils.quote("pytorch/conda-builder:cpu")

				            else:

				                return miniutils.quote(f"pytorch/conda-builder:{self.gpu_version}")

				        docker_word_substitution = {

				            "manywheel": "manylinux",

				            "libtorch": "manylinux",

				        }

				        docker_distro_prefix = miniutils.override(

				            self.pydistro, docker_word_substitution

				        )

				        # The cpu nightlies are built on the pytorch/manylinux-cuda102 docker image

				        # TODO cuda images should consolidate into tag-base images similar to rocm

				        alt_docker_suffix = (

				            "cuda102"

				            if not self.gpu_version

				            else (

				                "rocm:" + self.gpu_version.strip("rocm")

				                if self.gpu_version.startswith("rocm")

				                else self.gpu_version

				            )

				        )

				        docker_distro_suffix = (

				            alt_docker_suffix

				            if self.pydistro != "conda"

				            else ("cuda" if alt_docker_suffix.startswith("cuda") else "rocm")

				        )

				        return miniutils.quote(

				            "pytorch/" + docker_distro_prefix + "-" + docker_distro_suffix

				        )

				    def get_name_prefix(self):

				        return "smoke" if self.smoke else "binary"

				    def gen_build_name(self, build_or_test, nightly):

				        parts = [self.get_name_prefix(), self.os] + self.gen_build_env_parms()

				        if nightly:

				            parts.append("nightly")

				        if self.libtorch_variant:

				            parts.append(self.libtorch_variant)

				        if not self.smoke:

				            parts.append(build_or_test)

				        joined = "_".join(parts)

				        return joined.replace(".", "_")

				    def gen_workflow_job(self, phase, upload_phase_dependency=None, nightly=False):

				        job_def = OrderedDict()

				        job_def["name"] = self.gen_build_name(phase, nightly)

				        job_def["build_environment"] = miniutils.quote(

				            " ".join(self.gen_build_env_parms())

				        )

				        if self.smoke:

				            job_def["requires"] = [

				                "update_s3_htmls",

				            ]

				            job_def["filters"] = branch_filters.gen_filter_dict(

				                branches_list=["postnightly"],

				            )

				        else:

				            filter_branch = r"/.*/"

				            job_def["filters"] = branch_filters.gen_filter_dict(

				                branches_list=[filter_branch],

				                tags_list=[branch_filters.RC_PATTERN],

				            )

				        if self.libtorch_variant:

				            job_def["libtorch_variant"] = miniutils.quote(self.libtorch_variant)

				        if phase == "test":

				            if not self.smoke:

				                job_def["requires"] = [self.gen_build_name("build", nightly)]

				            if not (self.smoke and self.os == "macos") and self.os != "windows":

				                job_def["docker_image"] = self.gen_docker_image()

				            # fix this. only works on cuda not rocm

				            if self.os != "windows" and self.gpu_version:

				                job_def["use_cuda_docker_runtime"] = miniutils.quote("1")

				        else:

				            if self.os == "linux" and phase != "upload":

				                job_def["docker_image"] = self.gen_docker_image()

				        if phase == "test":

				            if self.gpu_version:

				                if self.os == "windows":

				                    job_def["executor"] = "windows-with-nvidia-gpu"

				                else:

				                    job_def["resource_class"] = "gpu.medium"

				        os_name = miniutils.override(self.os, {"macos": "mac"})

				        job_name = "_".join([self.get_name_prefix(), os_name, phase])

				        return {job_name: job_def}

				    def gen_upload_job(self, phase, requires_dependency):

				        """Generate binary_upload job for configuration

				          Output looks similar to:

				        - binary_upload:

				            name: binary_linux_manywheel_3_7m_cu113_devtoolset7_nightly_upload

				            context: org-member

				            requires: binary_linux_manywheel_3_7m_cu113_devtoolset7_nightly_test

				            filters:

				              branches:

				                only:

				                  - nightly

				              tags:

				                only: /v[0-9]+(\\.[0-9]+)*-rc[0-9]+/

				            package_type: manywheel

				            upload_subfolder: cu113

				        """

				        return {

				            "binary_upload": OrderedDict(

				                {

				                    "name": self.gen_build_name(phase, nightly=True),

				                    "context": "org-member",

				                    "requires": [

				                        self.gen_build_name(requires_dependency, nightly=True)

				                    ],

				                    "filters": branch_filters.gen_filter_dict(

				                        branches_list=["nightly"],

				                        tags_list=[branch_filters.RC_PATTERN],

				                    ),

				                    "package_type": self.pydistro,

				                    "upload_subfolder": binary_build_data.get_processor_arch_name(

				                        self.gpu_version,

				                    ),

				                }

				            )

				        }

				def get_root(smoke, name):

				    return binary_build_data.TopLevelNode(

				        name,

				        binary_build_data.CONFIG_TREE_DATA,

				        smoke,

				    )

				def gen_build_env_list(smoke):

				    root = get_root(smoke, "N/A")

				    config_list = conf_tree.dfs(root)

				    newlist = []

				    for c in config_list:

				        conf = Conf(

				            c.find_prop("os_name"),

				            c.find_prop("gpu"),

				            c.find_prop("package_format"),

				            [c.find_prop("pyver")],

				            c.find_prop("smoke")

				            and not (c.find_prop("os_name") == "macos_arm64"),  # don't test arm64

				            c.find_prop("libtorch_variant"),

				            c.find_prop("gcc_config_variant"),

				            c.find_prop("libtorch_config_variant"),

				        )

				        newlist.append(conf)

				    return newlist

				def predicate_exclude_macos(config):

				    return config.os == "linux" or config.os == "windows"

				def get_nightly_uploads():

				    configs = gen_build_env_list(False)

				    mylist = []

				    for conf in configs:

				        phase_dependency = "test" if predicate_exclude_macos(conf) else "build"

				        mylist.append(conf.gen_upload_job("upload", phase_dependency))

				    return mylist

				def get_post_upload_jobs():

				    return [

				        {

				            "update_s3_htmls": {

				                "name": "update_s3_htmls",

				                "context": "org-member",

				                "filters": branch_filters.gen_filter_dict(

				                    branches_list=["postnightly"],

				                ),

				            },

				        },

				    ]

				def get_nightly_tests():

				    configs = gen_build_env_list(False)

				    filtered_configs = filter(predicate_exclude_macos, configs)

				    tests = []

				    for conf_options in filtered_configs:

				        yaml_item = conf_options.gen_workflow_job("test", nightly=True)

				        tests.append(yaml_item)

				    return tests

				def get_jobs(toplevel_key, smoke):

				    jobs_list = []

				    configs = gen_build_env_list(smoke)

				    phase = "build" if toplevel_key == "binarybuilds" else "test"

				    for build_config in configs:

				        # don't test for macos_arm64 as it's cross compiled

				        if phase != "test" or build_config.os != "macos_arm64":

				            jobs_list.append(build_config.gen_workflow_job(phase, nightly=True))

				    return jobs_list

				def get_binary_build_jobs():

				    return get_jobs("binarybuilds", False)

				def get_binary_smoke_test_jobs():

				    return get_jobs("binarysmoketests", True)

									
										19

.circleci/cimodel/data/dimensions.py
									
												View File
											
				@ -1,19 +0,0 @@

				PHASES = ["build", "test"]

				CUDA_VERSIONS = [

				    "102",

				    "113",

				    "116",

				    "117",

				]

				ROCM_VERSIONS = [

				    "4.3.1",

				    "4.5.2",

				]

				ROCM_VERSION_LABELS = ["rocm" + v for v in ROCM_VERSIONS]

				GPU_VERSIONS = [None] + ["cuda" + v for v in CUDA_VERSIONS] + ROCM_VERSION_LABELS

				STANDARD_PYTHON_VERSIONS = ["3.7", "3.8", "3.9", "3.10"]

									
										296

.circleci/cimodel/data/pytorch_build_data.py
									
												View File
											
				@ -1,296 +0,0 @@

				from cimodel.lib.conf_tree import ConfigNode

				CONFIG_TREE_DATA = []

				def get_major_pyver(dotted_version):

				    parts = dotted_version.split(".")

				    return "py" + parts[0]

				class TreeConfigNode(ConfigNode):

				    def __init__(self, parent, node_name, subtree):

				        super().__init__(parent, self.modify_label(node_name))

				        self.subtree = subtree

				        self.init2(node_name)

				    def modify_label(self, label):

				        return label

				    def init2(self, node_name):

				        pass

				    def get_children(self):

				        return [self.child_constructor()(self, k, v) for (k, v) in self.subtree]

				class TopLevelNode(TreeConfigNode):

				    def __init__(self, node_name, subtree):

				        super().__init__(None, node_name, subtree)

				    # noinspection PyMethodMayBeStatic

				    def child_constructor(self):

				        return DistroConfigNode

				class DistroConfigNode(TreeConfigNode):

				    def init2(self, node_name):

				        self.props["distro_name"] = node_name

				    def child_constructor(self):

				        distro = self.find_prop("distro_name")

				        next_nodes = {

				            "xenial": XenialCompilerConfigNode,

				            "bionic": BionicCompilerConfigNode,

				        }

				        return next_nodes[distro]

				class PyVerConfigNode(TreeConfigNode):

				    def init2(self, node_name):

				        self.props["pyver"] = node_name

				        self.props["abbreviated_pyver"] = get_major_pyver(node_name)

				        if node_name == "3.9":

				            self.props["abbreviated_pyver"] = "py3.9"

				    # noinspection PyMethodMayBeStatic

				    def child_constructor(self):

				        return ExperimentalFeatureConfigNode

				class ExperimentalFeatureConfigNode(TreeConfigNode):

				    def init2(self, node_name):

				        self.props["experimental_feature"] = node_name

				    def child_constructor(self):

				        experimental_feature = self.find_prop("experimental_feature")

				        next_nodes = {

				            "asan": AsanConfigNode,

				            "xla": XlaConfigNode,

				            "mps": MPSConfigNode,

				            "vulkan": VulkanConfigNode,

				            "parallel_tbb": ParallelTBBConfigNode,

				            "crossref": CrossRefConfigNode,

				            "dynamo": DynamoConfigNode,

				            "parallel_native": ParallelNativeConfigNode,

				            "onnx": ONNXConfigNode,

				            "libtorch": LibTorchConfigNode,

				            "important": ImportantConfigNode,

				            "build_only": BuildOnlyConfigNode,

				            "shard_test": ShardTestConfigNode,

				            "cuda_gcc_override": CudaGccOverrideConfigNode,

				            "pure_torch": PureTorchConfigNode,

				            "slow_gradcheck": SlowGradcheckConfigNode,

				        }

				        return next_nodes[experimental_feature]

				class SlowGradcheckConfigNode(TreeConfigNode):

				    def init2(self, node_name):

				        self.props["is_slow_gradcheck"] = True

				    def child_constructor(self):

				        return ExperimentalFeatureConfigNode

				class PureTorchConfigNode(TreeConfigNode):

				    def modify_label(self, label):

				        return "PURE_TORCH=" + str(label)

				    def init2(self, node_name):

				        self.props["is_pure_torch"] = node_name

				    def child_constructor(self):

				        return ImportantConfigNode

				class XlaConfigNode(TreeConfigNode):

				    def modify_label(self, label):

				        return "XLA=" + str(label)

				    def init2(self, node_name):

				        self.props["is_xla"] = node_name

				    def child_constructor(self):

				        return ImportantConfigNode

				class MPSConfigNode(TreeConfigNode):

				    def modify_label(self, label):

				        return "MPS=" + str(label)

				    def init2(self, node_name):

				        self.props["is_mps"] = node_name

				    def child_constructor(self):

				        return ImportantConfigNode

				class AsanConfigNode(TreeConfigNode):

				    def modify_label(self, label):

				        return "Asan=" + str(label)

				    def init2(self, node_name):

				        self.props["is_asan"] = node_name

				    def child_constructor(self):

				        return ExperimentalFeatureConfigNode

				class ONNXConfigNode(TreeConfigNode):

				    def modify_label(self, label):

				        return "Onnx=" + str(label)

				    def init2(self, node_name):

				        self.props["is_onnx"] = node_name

				    def child_constructor(self):

				        return ImportantConfigNode

				class VulkanConfigNode(TreeConfigNode):

				    def modify_label(self, label):

				        return "Vulkan=" + str(label)

				    def init2(self, node_name):

				        self.props["is_vulkan"] = node_name

				    def child_constructor(self):

				        return ImportantConfigNode

				class ParallelTBBConfigNode(TreeConfigNode):

				    def modify_label(self, label):

				        return "PARALLELTBB=" + str(label)

				    def init2(self, node_name):

				        self.props["parallel_backend"] = "paralleltbb"

				    def child_constructor(self):

				        return ImportantConfigNode

				class CrossRefConfigNode(TreeConfigNode):

				    def init2(self, node_name):

				        self.props["is_crossref"] = node_name

				    def child_constructor(self):

				        return ImportantConfigNode

				class DynamoConfigNode(TreeConfigNode):

				    def init2(self, node_name):

				        self.props["is_dynamo"] = node_name

				    def child_constructor(self):

				        return ImportantConfigNode

				class ParallelNativeConfigNode(TreeConfigNode):

				    def modify_label(self, label):

				        return "PARALLELNATIVE=" + str(label)

				    def init2(self, node_name):

				        self.props["parallel_backend"] = "parallelnative"

				    def child_constructor(self):

				        return ImportantConfigNode

				class LibTorchConfigNode(TreeConfigNode):

				    def modify_label(self, label):

				        return "BUILD_TEST_LIBTORCH=" + str(label)

				    def init2(self, node_name):

				        self.props["is_libtorch"] = node_name

				    def child_constructor(self):

				        return ExperimentalFeatureConfigNode

				class CudaGccOverrideConfigNode(TreeConfigNode):

				    def init2(self, node_name):

				        self.props["cuda_gcc_override"] = node_name

				    def child_constructor(self):

				        return ExperimentalFeatureConfigNode

				class BuildOnlyConfigNode(TreeConfigNode):

				    def init2(self, node_name):

				        self.props["build_only"] = node_name

				    def child_constructor(self):

				        return ExperimentalFeatureConfigNode

				class ShardTestConfigNode(TreeConfigNode):

				    def init2(self, node_name):

				        self.props["shard_test"] = node_name

				    def child_constructor(self):

				        return ImportantConfigNode

				class ImportantConfigNode(TreeConfigNode):

				    def modify_label(self, label):

				        return "IMPORTANT=" + str(label)

				    def init2(self, node_name):

				        self.props["is_important"] = node_name

				    def get_children(self):

				        return []

				class XenialCompilerConfigNode(TreeConfigNode):

				    def modify_label(self, label):

				        return label or "<unspecified>"

				    def init2(self, node_name):

				        self.props["compiler_name"] = node_name

				    # noinspection PyMethodMayBeStatic

				    def child_constructor(self):

				        return (

				            XenialCompilerVersionConfigNode

				            if self.props["compiler_name"]

				            else PyVerConfigNode

				        )

				class BionicCompilerConfigNode(TreeConfigNode):

				    def modify_label(self, label):

				        return label or "<unspecified>"

				    def init2(self, node_name):

				        self.props["compiler_name"] = node_name

				    # noinspection PyMethodMayBeStatic

				    def child_constructor(self):

				        return (

				            BionicCompilerVersionConfigNode

				            if self.props["compiler_name"]

				            else PyVerConfigNode

				        )

				class XenialCompilerVersionConfigNode(TreeConfigNode):

				    def init2(self, node_name):

				        self.props["compiler_version"] = node_name

				    # noinspection PyMethodMayBeStatic

				    def child_constructor(self):

				        return PyVerConfigNode

				class BionicCompilerVersionConfigNode(TreeConfigNode):

				    def init2(self, node_name):

				        self.props["compiler_version"] = node_name

				    # noinspection PyMethodMayBeStatic

				    def child_constructor(self):

				        return PyVerConfigNode

									
										382

.circleci/cimodel/data/pytorch_build_definitions.py
									
												View File
											
				@ -1,382 +0,0 @@

				from collections import OrderedDict

				from dataclasses import dataclass, field

				from typing import List, Optional

				import cimodel.data.dimensions as dimensions

				import cimodel.lib.conf_tree as conf_tree

				import cimodel.lib.miniutils as miniutils

				from cimodel.data.pytorch_build_data import CONFIG_TREE_DATA, TopLevelNode

				from cimodel.data.simple.util.branch_filters import gen_filter_dict, RC_PATTERN

				from cimodel.data.simple.util.docker_constants import gen_docker_image

				@dataclass

				class Conf:

				    distro: str

				    parms: List[str]

				    parms_list_ignored_for_docker_image: Optional[List[str]] = None

				    pyver: Optional[str] = None

				    cuda_version: Optional[str] = None

				    rocm_version: Optional[str] = None

				    # TODO expand this to cover all the USE_* that we want to test for

				    #  tesnrorrt, leveldb, lmdb, redis, opencv, mkldnn, ideep, etc.

				    # (from https://github.com/pytorch/pytorch/pull/17323#discussion_r259453608)

				    is_xla: bool = False

				    is_vulkan: bool = False

				    is_pure_torch: bool = False

				    restrict_phases: Optional[List[str]] = None

				    gpu_resource: Optional[str] = None

				    dependent_tests: List = field(default_factory=list)

				    parent_build: Optional["Conf"] = None

				    is_libtorch: bool = False

				    is_important: bool = False

				    parallel_backend: Optional[str] = None

				    build_only: bool = False

				    @staticmethod

				    def is_test_phase(phase):

				        return "test" in phase

				    # TODO: Eliminate the special casing for docker paths

				    # In the short term, we *will* need to support special casing as docker images are merged for caffe2 and pytorch

				    def get_parms(self, for_docker):

				        leading = []

				        # We just don't run non-important jobs on pull requests;

				        # previously we also named them in a way to make it obvious

				        # if self.is_important and not for_docker:

				        #    leading.append("AAA")

				        leading.append("pytorch")

				        if self.is_xla and not for_docker:

				            leading.append("xla")

				        if self.is_vulkan and not for_docker:

				            leading.append("vulkan")

				        if self.is_libtorch and not for_docker:

				            leading.append("libtorch")

				        if self.is_pure_torch and not for_docker:

				            leading.append("pure_torch")

				        if self.parallel_backend is not None and not for_docker:

				            leading.append(self.parallel_backend)

				        cuda_parms = []

				        if self.cuda_version:

				            cudnn = "cudnn8" if self.cuda_version.startswith("11.") else "cudnn7"

				            cuda_parms.extend(["cuda" + self.cuda_version, cudnn])

				        if self.rocm_version:

				            cuda_parms.extend([f"rocm{self.rocm_version}"])

				        result = leading + ["linux", self.distro] + cuda_parms + self.parms

				        if not for_docker and self.parms_list_ignored_for_docker_image is not None:

				            result = result + self.parms_list_ignored_for_docker_image

				        return result

				    def gen_docker_image_path(self):

				        parms_source = self.parent_build or self

				        base_build_env_name = "-".join(parms_source.get_parms(True))

				        image_name, _ = gen_docker_image(base_build_env_name)

				        return miniutils.quote(image_name)

				    def gen_docker_image_requires(self):

				        parms_source = self.parent_build or self

				        base_build_env_name = "-".join(parms_source.get_parms(True))

				        _, requires = gen_docker_image(base_build_env_name)

				        return miniutils.quote(requires)

				    def get_build_job_name_pieces(self, build_or_test):

				        return self.get_parms(False) + [build_or_test]

				    def gen_build_name(self, build_or_test):

				        return (

				            ("_".join(map(str, self.get_build_job_name_pieces(build_or_test))))

				            .replace(".", "_")

				            .replace("-", "_")

				        )

				    def get_dependents(self):

				        return self.dependent_tests or []

				    def gen_workflow_params(self, phase):

				        parameters = OrderedDict()

				        build_job_name_pieces = self.get_build_job_name_pieces(phase)

				        build_env_name = "-".join(map(str, build_job_name_pieces))

				        parameters["build_environment"] = miniutils.quote(build_env_name)

				        parameters["docker_image"] = self.gen_docker_image_path()

				        if Conf.is_test_phase(phase) and self.gpu_resource:

				            parameters["use_cuda_docker_runtime"] = miniutils.quote("1")

				        if Conf.is_test_phase(phase):

				            resource_class = "large"

				            if self.gpu_resource:

				                resource_class = "gpu." + self.gpu_resource

				            if self.rocm_version is not None:

				                resource_class = "pytorch/amd-gpu"

				            parameters["resource_class"] = resource_class

				        if phase == "build" and self.rocm_version is not None:

				            parameters["resource_class"] = "xlarge"

				        if hasattr(self, "filters"):

				            parameters["filters"] = self.filters

				        if self.build_only:

				            parameters["build_only"] = miniutils.quote(str(int(True)))

				        return parameters

				    def gen_workflow_job(self, phase):

				        job_def = OrderedDict()

				        job_def["name"] = self.gen_build_name(phase)

				        if Conf.is_test_phase(phase):

				            # TODO When merging the caffe2 and pytorch jobs, it might be convenient for a while to make a

				            #  caffe2 test job dependent on a pytorch build job. This way we could quickly dedup the repeated

				            #  build of pytorch in the caffe2 build job, and just run the caffe2 tests off of a completed

				            #  pytorch build job (from https://github.com/pytorch/pytorch/pull/17323#discussion_r259452641)

				            dependency_build = self.parent_build or self

				            job_def["requires"] = [dependency_build.gen_build_name("build")]

				            job_name = "pytorch_linux_test"

				        else:

				            job_name = "pytorch_linux_build"

				            job_def["requires"] = [self.gen_docker_image_requires()]

				        if not self.is_important:

				            job_def["filters"] = gen_filter_dict()

				        job_def.update(self.gen_workflow_params(phase))

				        return {job_name: job_def}

				# TODO This is a hack to special case some configs just for the workflow list

				class HiddenConf:

				    def __init__(self, name, parent_build=None, filters=None):

				        self.name = name

				        self.parent_build = parent_build

				        self.filters = filters

				    def gen_workflow_job(self, phase):

				        return {

				            self.gen_build_name(phase): {

				                "requires": [self.parent_build.gen_build_name("build")],

				                "filters": self.filters,

				            }

				        }

				    def gen_build_name(self, _):

				        return self.name

				class DocPushConf:

				    def __init__(self, name, parent_build=None, branch="master"):

				        self.name = name

				        self.parent_build = parent_build

				        self.branch = branch

				    def gen_workflow_job(self, phase):

				        return {

				            "pytorch_doc_push": {

				                "name": self.name,

				                "branch": self.branch,

				                "requires": [self.parent_build],

				                "context": "org-member",

				                "filters": gen_filter_dict(

				                    branches_list=["nightly"], tags_list=RC_PATTERN

				                ),

				            }

				        }

				def gen_docs_configs(xenial_parent_config):

				    configs = []

				    configs.append(

				        HiddenConf(

				            "pytorch_python_doc_build",

				            parent_build=xenial_parent_config,

				            filters=gen_filter_dict(

				                branches_list=["master", "main", "nightly"], tags_list=RC_PATTERN

				            ),

				        )

				    )

				    configs.append(

				        DocPushConf(

				            "pytorch_python_doc_push",

				            parent_build="pytorch_python_doc_build",

				            branch="site",

				        )

				    )

				    configs.append(

				        HiddenConf(

				            "pytorch_cpp_doc_build",

				            parent_build=xenial_parent_config,

				            filters=gen_filter_dict(

				                branches_list=["master", "main", "nightly"], tags_list=RC_PATTERN

				            ),

				        )

				    )

				    configs.append(

				        DocPushConf(

				            "pytorch_cpp_doc_push",

				            parent_build="pytorch_cpp_doc_build",

				            branch="master",

				        )

				    )

				    return configs

				def get_root():

				    return TopLevelNode("PyTorch Builds", CONFIG_TREE_DATA)

				def gen_tree():

				    root = get_root()

				    configs_list = conf_tree.dfs(root)

				    return configs_list

				def instantiate_configs(only_slow_gradcheck):

				    config_list = []

				    root = get_root()

				    found_configs = conf_tree.dfs(root)

				    for fc in found_configs:

				        restrict_phases = None

				        distro_name = fc.find_prop("distro_name")

				        compiler_name = fc.find_prop("compiler_name")

				        compiler_version = fc.find_prop("compiler_version")

				        is_xla = fc.find_prop("is_xla") or False

				        is_asan = fc.find_prop("is_asan") or False

				        is_crossref = fc.find_prop("is_crossref") or False

				        is_dynamo = fc.find_prop("is_dynamo") or False

				        is_onnx = fc.find_prop("is_onnx") or False

				        is_pure_torch = fc.find_prop("is_pure_torch") or False

				        is_vulkan = fc.find_prop("is_vulkan") or False

				        is_slow_gradcheck = fc.find_prop("is_slow_gradcheck") or False

				        parms_list_ignored_for_docker_image = []

				        if only_slow_gradcheck ^ is_slow_gradcheck:

				            continue

				        python_version = None

				        if compiler_name == "cuda" or compiler_name == "android":

				            python_version = fc.find_prop("pyver")

				            parms_list = [fc.find_prop("abbreviated_pyver")]

				        else:

				            parms_list = ["py" + fc.find_prop("pyver")]

				        cuda_version = None

				        rocm_version = None

				        if compiler_name == "cuda":

				            cuda_version = fc.find_prop("compiler_version")

				        elif compiler_name == "rocm":

				            rocm_version = fc.find_prop("compiler_version")

				            restrict_phases = ["build", "test1", "test2", "caffe2_test"]

				        elif compiler_name == "android":

				            android_ndk_version = fc.find_prop("compiler_version")

				            # TODO: do we need clang to compile host binaries like protoc?

				            parms_list.append("clang5")

				            parms_list.append("android-ndk-" + android_ndk_version)

				            android_abi = fc.find_prop("android_abi")

				            parms_list_ignored_for_docker_image.append(android_abi)

				            restrict_phases = ["build"]

				        elif compiler_name:

				            gcc_version = compiler_name + (fc.find_prop("compiler_version") or "")

				            parms_list.append(gcc_version)

				        if is_asan:

				            parms_list.append("asan")

				            python_version = fc.find_prop("pyver")

				            parms_list[0] = fc.find_prop("abbreviated_pyver")

				        if is_crossref:

				            parms_list_ignored_for_docker_image.append("crossref")

				        if is_dynamo:

				            parms_list_ignored_for_docker_image.append("dynamo")

				        if is_onnx:

				            parms_list.append("onnx")

				            python_version = fc.find_prop("pyver")

				            parms_list[0] = fc.find_prop("abbreviated_pyver")

				            restrict_phases = ["build", "ort_test1", "ort_test2"]

				        if cuda_version:

				            cuda_gcc_version = fc.find_prop("cuda_gcc_override") or "gcc7"

				            parms_list.append(cuda_gcc_version)

				        is_libtorch = fc.find_prop("is_libtorch") or False

				        is_important = fc.find_prop("is_important") or False

				        parallel_backend = fc.find_prop("parallel_backend") or None

				        build_only = fc.find_prop("build_only") or False

				        shard_test = fc.find_prop("shard_test") or False

				        # TODO: fix pure_torch python test packaging issue.

				        if shard_test:

				            restrict_phases = ["build"] if restrict_phases is None else restrict_phases

				            restrict_phases.extend(["test1", "test2"])

				        if build_only or is_pure_torch:

				            restrict_phases = ["build"]

				        if is_slow_gradcheck:

				            parms_list_ignored_for_docker_image.append("old")

				            parms_list_ignored_for_docker_image.append("gradcheck")

				        gpu_resource = None

				        if cuda_version and cuda_version != "10":

				            gpu_resource = "medium"

				        c = Conf(

				            distro_name,

				            parms_list,

				            parms_list_ignored_for_docker_image,

				            python_version,

				            cuda_version,

				            rocm_version,

				            is_xla,

				            is_vulkan,

				            is_pure_torch,

				            restrict_phases,

				            gpu_resource,

				            is_libtorch=is_libtorch,

				            is_important=is_important,

				            parallel_backend=parallel_backend,

				            build_only=build_only,

				        )

				        # run docs builds on "pytorch-linux-xenial-py3.7-gcc5.4". Docs builds

				        # should run on a CPU-only build that runs on all PRs.

				        # XXX should this be updated to a more modern build?

				        if (

				            distro_name == "xenial"

				            and fc.find_prop("pyver") == "3.7"

				            and cuda_version is None

				            and parallel_backend is None

				            and not is_vulkan

				            and not is_pure_torch

				            and compiler_name == "gcc"

				            and fc.find_prop("compiler_version") == "5.4"

				        ):

				            c.filters = gen_filter_dict(branches_list=r"/.*/", tags_list=RC_PATTERN)

				            c.dependent_tests = gen_docs_configs(c)

				        config_list.append(c)

				    return config_list

				def get_workflow_jobs(only_slow_gradcheck=False):

				    config_list = instantiate_configs(only_slow_gradcheck)

				    x = []

				    for conf_options in config_list:

				        phases = conf_options.restrict_phases or dimensions.PHASES

				        for phase in phases:

				            # TODO why does this not have a test?

				            if Conf.is_test_phase(phase) and conf_options.cuda_version == "10":

				                continue

				            x.append(conf_options.gen_workflow_job(phase))

				        # TODO convert to recursion

				        for conf in conf_options.get_dependents():

				            x.append(conf.gen_workflow_job("test"))

				    return x

									
										39

.circleci/cimodel/data/simple/docker_definitions.py
									
												View File
											
				@ -1,39 +0,0 @@

				from collections import OrderedDict

				from cimodel.data.simple.util.branch_filters import gen_filter_dict, RC_PATTERN

				from cimodel.lib.miniutils import quote

				# NOTE: All hardcoded docker image builds have been migrated to GHA

				IMAGE_NAMES = []

				# This entry should be an element from the list above

				# This should contain the image matching the "slow_gradcheck" entry in

				# pytorch_build_data.py

				SLOW_GRADCHECK_IMAGE_NAME = "pytorch-linux-xenial-cuda10.2-cudnn7-py3-gcc7"

				def get_workflow_jobs(images=IMAGE_NAMES, only_slow_gradcheck=False):

				    """Generates a list of docker image build definitions"""

				    ret = []

				    for image_name in images:

				        if image_name.startswith("docker-"):

				            image_name = image_name.lstrip("docker-")

				        if only_slow_gradcheck and image_name is not SLOW_GRADCHECK_IMAGE_NAME:

				            continue

				        parameters = OrderedDict(

				            {

				                "name": quote(f"docker-{image_name}"),

				                "image_name": quote(image_name),

				            }

				        )

				        if image_name == "pytorch-linux-xenial-py3.7-gcc5.4":

				            # pushing documentation on tags requires CircleCI to also

				            # build all the dependencies on tags, including this docker image

				            parameters["filters"] = gen_filter_dict(

				                branches_list=r"/.*/", tags_list=RC_PATTERN

				            )

				        ret.append(OrderedDict({"docker_build_job": parameters}))

				    return ret

									
										100

.circleci/cimodel/data/simple/ios_definitions.py
									
												View File
											
				@ -1,100 +0,0 @@

				import cimodel.lib.miniutils as miniutils

				from cimodel.data.simple.util.branch_filters import gen_filter_dict_exclude

				from cimodel.data.simple.util.versions import MultiPartVersion

				XCODE_VERSION = MultiPartVersion([12, 5, 1])

				class ArchVariant:

				    def __init__(self, name, custom_build_name=""):

				        self.name = name

				        self.custom_build_name = custom_build_name

				    def render(self):

				        extra_parts = (

				            [self.custom_build_name] if len(self.custom_build_name) > 0 else []

				        )

				        return "-".join([self.name] + extra_parts).replace("_", "-")

				def get_platform(arch_variant_name):

				    return "SIMULATOR" if arch_variant_name == "x86_64" else "OS"

				class IOSJob:

				    def __init__(

				        self, xcode_version, arch_variant, is_org_member_context=True, extra_props=None

				    ):

				        self.xcode_version = xcode_version

				        self.arch_variant = arch_variant

				        self.is_org_member_context = is_org_member_context

				        self.extra_props = extra_props

				    def gen_name_parts(self):

				        version_parts = self.xcode_version.render_dots_or_parts("-")

				        build_variant_suffix = self.arch_variant.render()

				        return (

				            [

				                "ios",

				            ]

				            + version_parts

				            + [

				                build_variant_suffix,

				            ]

				        )

				    def gen_job_name(self):

				        return "-".join(self.gen_name_parts())

				    def gen_tree(self):

				        platform_name = get_platform(self.arch_variant.name)

				        props_dict = {

				            "name": self.gen_job_name(),

				            "build_environment": self.gen_job_name(),

				            "ios_arch": self.arch_variant.name,

				            "ios_platform": platform_name,

				        }

				        if self.is_org_member_context:

				            props_dict["context"] = "org-member"

				        if self.extra_props:

				            props_dict.update(self.extra_props)

				        props_dict["filters"] = gen_filter_dict_exclude()

				        return [{"pytorch_ios_build": props_dict}]

				WORKFLOW_DATA = [

				    IOSJob(

				        XCODE_VERSION,

				        ArchVariant("x86_64"),

				        is_org_member_context=False,

				        extra_props={"lite_interpreter": miniutils.quote(str(int(True)))},

				    ),

				    # IOSJob(XCODE_VERSION, ArchVariant("arm64"), extra_props={

				    #     "lite_interpreter": miniutils.quote(str(int(True)))}),

				    # IOSJob(XCODE_VERSION, ArchVariant("arm64", "metal"), extra_props={

				    #     "use_metal": miniutils.quote(str(int(True))),

				    #     "lite_interpreter": miniutils.quote(str(int(True)))}),

				    # IOSJob(XCODE_VERSION, ArchVariant("arm64", "custom-ops"), extra_props={

				    #     "op_list": "mobilenetv2.yaml",

				    #     "lite_interpreter": miniutils.quote(str(int(True)))}),

				    IOSJob(

				        XCODE_VERSION,

				        ArchVariant("x86_64", "coreml"),

				        is_org_member_context=False,

				        extra_props={

				            "use_coreml": miniutils.quote(str(int(True))),

				            "lite_interpreter": miniutils.quote(str(int(True))),

				        },

				    ),

				    # IOSJob(XCODE_VERSION, ArchVariant("arm64", "coreml"), extra_props={

				    #     "use_coreml": miniutils.quote(str(int(True))),

				    #     "lite_interpreter": miniutils.quote(str(int(True)))}),

				]

				def get_workflow_jobs():

				    return [item.gen_tree() for item in WORKFLOW_DATA]

									
										54

.circleci/cimodel/data/simple/macos_definitions.py
									
												View File
											
				@ -1,54 +0,0 @@

				class MacOsJob:

				    def __init__(self, os_version, is_build=False, is_test=False, extra_props=tuple()):

				        # extra_props is tuple type, because mutable data structures for argument defaults

				        # is not recommended.

				        self.os_version = os_version

				        self.is_build = is_build

				        self.is_test = is_test

				        self.extra_props = dict(extra_props)

				    def gen_tree(self):

				        non_phase_parts = ["pytorch", "macos", self.os_version, "py3"]

				        extra_name_list = [name for name, exist in self.extra_props.items() if exist]

				        full_job_name_list = (

				            non_phase_parts

				            + extra_name_list

				            + [

				                "build" if self.is_build else None,

				                "test" if self.is_test else None,

				            ]

				        )

				        full_job_name = "_".join(list(filter(None, full_job_name_list)))

				        test_build_dependency = "_".join(non_phase_parts + ["build"])

				        extra_dependencies = [test_build_dependency] if self.is_test else []

				        job_dependencies = extra_dependencies

				        # Yes we name the job after itself, it needs a non-empty value in here

				        # for the YAML output to work.

				        props_dict = {"requires": job_dependencies, "name": full_job_name}

				        return [{full_job_name: props_dict}]

				WORKFLOW_DATA = [

				    MacOsJob("10_15", is_build=True),

				    MacOsJob("10_13", is_build=True),

				    MacOsJob(

				        "10_13",

				        is_build=False,

				        is_test=True,

				    ),

				    MacOsJob(

				        "10_13",

				        is_build=True,

				        is_test=True,

				        extra_props=tuple({"lite_interpreter": True}.items()),

				    ),

				]

				def get_workflow_jobs():

				    return [item.gen_tree() for item in WORKFLOW_DATA]

									
										51

.circleci/cimodel/data/simple/mobile_definitions.py
									
												View File
											
				@ -1,51 +0,0 @@

				"""

				PyTorch Mobile PR builds (use linux host toolchain + mobile build options)

				"""

				import cimodel.data.simple.util.branch_filters

				import cimodel.lib.miniutils as miniutils

				class MobileJob:

				    def __init__(

				        self, docker_image, docker_requires, variant_parts, is_master_only=False

				    ):

				        self.docker_image = docker_image

				        self.docker_requires = docker_requires

				        self.variant_parts = variant_parts

				        self.is_master_only = is_master_only

				    def gen_tree(self):

				        non_phase_parts = [

				            "pytorch",

				            "linux",

				            "xenial",

				            "py3",

				            "clang5",

				            "mobile",

				        ] + self.variant_parts

				        full_job_name = "_".join(non_phase_parts)

				        build_env_name = "-".join(non_phase_parts)

				        props_dict = {

				            "build_environment": build_env_name,

				            "build_only": miniutils.quote(str(int(True))),

				            "docker_image": self.docker_image,

				            "requires": self.docker_requires,

				            "name": full_job_name,

				        }

				        if self.is_master_only:

				            props_dict[

				                "filters"

				            ] = cimodel.data.simple.util.branch_filters.gen_filter_dict()

				        return [{"pytorch_linux_build": props_dict}]

				WORKFLOW_DATA = []

				def get_workflow_jobs():

				    return [item.gen_tree() for item in WORKFLOW_DATA]

									
										96

.circleci/cimodel/data/simple/nightly_ios.py
									
												View File
											
				@ -1,96 +0,0 @@

				import cimodel.data.simple.ios_definitions as ios_definitions

				import cimodel.lib.miniutils as miniutils

				class IOSNightlyJob:

				    def __init__(self, variant, is_full_jit=False, is_upload=False):

				        self.variant = variant

				        self.is_full_jit = is_full_jit

				        self.is_upload = is_upload

				    def get_phase_name(self):

				        return "upload" if self.is_upload else "build"

				    def get_common_name_pieces(self, sep):

				        extra_name_suffix = [self.get_phase_name()] if self.is_upload else []

				        extra_name = ["full_jit"] if self.is_full_jit else []

				        common_name_pieces = (

				            [

				                "ios",

				            ]

				            + extra_name

				            + []

				            + ios_definitions.XCODE_VERSION.render_dots_or_parts(sep)

				            + [

				                "nightly",

				                self.variant,

				                "build",

				            ]

				            + extra_name_suffix

				        )

				        return common_name_pieces

				    def gen_job_name(self):

				        return "_".join(["pytorch"] + self.get_common_name_pieces(None))

				    def gen_tree(self):

				        build_configs = BUILD_CONFIGS_FULL_JIT if self.is_full_jit else BUILD_CONFIGS

				        extra_requires = (

				            [x.gen_job_name() for x in build_configs] if self.is_upload else []

				        )

				        props_dict = {

				            "build_environment": "-".join(

				                ["libtorch"] + self.get_common_name_pieces(".")

				            ),

				            "requires": extra_requires,

				            "context": "org-member",

				            "filters": {"branches": {"only": "nightly"}},

				        }

				        if not self.is_upload:

				            props_dict["ios_arch"] = self.variant

				            props_dict["ios_platform"] = ios_definitions.get_platform(self.variant)

				            props_dict["name"] = self.gen_job_name()

				            props_dict["use_metal"] = miniutils.quote(str(int(True)))

				            props_dict["use_coreml"] = miniutils.quote(str(int(True)))

				        if self.is_full_jit:

				            props_dict["lite_interpreter"] = miniutils.quote(str(int(False)))

				        template_name = "_".join(

				            [

				                "binary",

				                "ios",

				                self.get_phase_name(),

				            ]

				        )

				        return [{template_name: props_dict}]

				BUILD_CONFIGS = [

				    IOSNightlyJob("x86_64"),

				    IOSNightlyJob("arm64"),

				]

				BUILD_CONFIGS_FULL_JIT = [

				    IOSNightlyJob("x86_64", is_full_jit=True),

				    IOSNightlyJob("arm64", is_full_jit=True),

				]

				WORKFLOW_DATA = (

				    BUILD_CONFIGS

				    + BUILD_CONFIGS_FULL_JIT

				    + [

				        IOSNightlyJob("binary", is_full_jit=False, is_upload=True),

				        IOSNightlyJob("binary", is_full_jit=True, is_upload=True),

				    ]

				)

				def get_workflow_jobs():

				    return [item.gen_tree() for item in WORKFLOW_DATA]

									
										36

.circleci/cimodel/data/simple/util/branch_filters.py
									
												View File
											
				@ -1,36 +0,0 @@

				NON_PR_BRANCH_LIST = [

				    "main",

				    "master",

				    r"/ci-all\/.*/",

				    r"/release\/.*/",

				]

				PR_BRANCH_LIST = [

				    r"/gh\/.*\/head/",

				    r"/pull\/.*/",

				]

				RC_PATTERN = r"/v[0-9]+(\.[0-9]+)*-rc[0-9]+/"

				MAC_IOS_EXCLUSION_LIST = ["nightly", "postnightly"]

				def gen_filter_dict(branches_list=NON_PR_BRANCH_LIST, tags_list=None):

				    """Generates a filter dictionary for use with CircleCI's job filter"""

				    filter_dict = {

				        "branches": {

				            "only": branches_list,

				        },

				    }

				    if tags_list is not None:

				        filter_dict["tags"] = {"only": tags_list}

				    return filter_dict

				def gen_filter_dict_exclude(branches_list=MAC_IOS_EXCLUSION_LIST):

				    return {

				        "branches": {

				            "ignore": branches_list,

				        },

				    }

									
										35

.circleci/cimodel/data/simple/util/docker_constants.py
									
												View File
											
				@ -1,35 +0,0 @@

				AWS_DOCKER_HOST = "308535385114.dkr.ecr.us-east-1.amazonaws.com"

				def gen_docker_image(container_type):

				    return (

				        "/".join([AWS_DOCKER_HOST, "pytorch", container_type]),

				        f"docker-{container_type}",

				    )

				def gen_docker_image_requires(image_name):

				    return [f"docker-{image_name}"]

				DOCKER_IMAGE_BASIC, DOCKER_REQUIREMENT_BASE = gen_docker_image(

				    "pytorch-linux-xenial-py3.7-gcc5.4"

				)

				DOCKER_IMAGE_CUDA_10_2, DOCKER_REQUIREMENT_CUDA_10_2 = gen_docker_image(

				    "pytorch-linux-xenial-cuda10.2-cudnn7-py3-gcc7"

				)

				DOCKER_IMAGE_GCC7, DOCKER_REQUIREMENT_GCC7 = gen_docker_image(

				    "pytorch-linux-xenial-py3.7-gcc7"

				)

				def gen_mobile_docker(specifier):

				    container_type = "pytorch-linux-xenial-py3-clang5-" + specifier

				    return gen_docker_image(container_type)

				DOCKER_IMAGE_ASAN, DOCKER_REQUIREMENT_ASAN = gen_mobile_docker("asan")

				DOCKER_IMAGE_NDK, DOCKER_REQUIREMENT_NDK = gen_mobile_docker("android-ndk-r21e")

									
										36

.circleci/cimodel/data/simple/util/versions.py
									
												View File
											
				@ -1,36 +0,0 @@

				from typing import Optional

				class MultiPartVersion:

				    def __init__(self, parts, prefix=""):

				        self.parts = parts

				        self.prefix = prefix

				    def prefixed_parts(self):

				        """

				        Prepends the first element of the version list

				        with the prefix string.

				        """

				        if self.parts:

				            return [self.prefix + str(self.parts[0])] + [

				                str(part) for part in self.parts[1:]

				            ]

				        else:

				            return [self.prefix]

				    def render_dots_or_parts(self, sep: Optional[str] = None):

				        if sep is None:

				            return self.prefixed_parts()

				        else:

				            return [sep.join(self.prefixed_parts())]

				class CudaVersion(MultiPartVersion):

				    def __init__(self, major, minor):

				        self.major = major

				        self.minor = minor

				        super().__init__([self.major, self.minor], "cuda")

				    def __str__(self):

				        return f"{self.major}.{self.minor}"

									
										111

.circleci/cimodel/lib/conf_tree.py
									
												View File
											
				@ -1,111 +0,0 @@

				from dataclasses import dataclass, field

				from typing import Dict, Optional

				def X(val):

				    """

				    Compact way to write a leaf node

				    """

				    return val, []

				def XImportant(name):

				    """Compact way to write an important (run on PRs) leaf node"""

				    return (name, [("important", [X(True)])])

				@dataclass

				class Ver:

				    """

				    Represents a product with a version number

				    """

				    name: str

				    version: str = ""

				    def __str__(self):

				        return self.name + self.version

				@dataclass

				class ConfigNode:

				    parent: Optional["ConfigNode"]

				    node_name: str

				    props: Dict[str, str] = field(default_factory=dict)

				    def get_label(self):

				        return self.node_name

				    # noinspection PyMethodMayBeStatic

				    def get_children(self):

				        return []

				    def get_parents(self):

				        return (

				            (self.parent.get_parents() + [self.parent.get_label()])

				            if self.parent

				            else []

				        )

				    def get_depth(self):

				        return len(self.get_parents())

				    def get_node_key(self):

				        return "%".join(self.get_parents() + [self.get_label()])

				    def find_prop(self, propname, searched=None):

				        """

				        Checks if its own dictionary has

				        the property, otherwise asks parent node.

				        """

				        if searched is None:

				            searched = []

				        searched.append(self.node_name)

				        if propname in self.props:

				            return self.props[propname]

				        elif self.parent:

				            return self.parent.find_prop(propname, searched)

				        else:

				            # raise Exception('Property "%s" does not exist anywhere in the tree! Searched: %s' % (propname, searched))

				            return None

				def dfs_recurse(

				    node,

				    leaf_callback=lambda x: None,

				    discovery_callback=lambda x, y, z: None,

				    child_callback=lambda x, y: None,

				    sibling_index=0,

				    sibling_count=1,

				):

				    discovery_callback(node, sibling_index, sibling_count)

				    node_children = node.get_children()

				    if node_children:

				        for i, child in enumerate(node_children):

				            child_callback(node, child)

				            dfs_recurse(

				                child,

				                leaf_callback,

				                discovery_callback,

				                child_callback,

				                i,

				                len(node_children),

				            )

				    else:

				        leaf_callback(node)

				def dfs(toplevel_config_node):

				    config_list = []

				    def leaf_callback(node):

				        config_list.append(node)

				    dfs_recurse(toplevel_config_node, leaf_callback)

				    return config_list

									
										10

.circleci/cimodel/lib/miniutils.py
									
												View File
											
				@ -1,10 +0,0 @@

				def quote(s):

				    return sandwich('"', s)

				def sandwich(bread, jam):

				    return bread + jam + bread

				def override(word, substitutions):

				    return substitutions.get(word, word)

									
										51

.circleci/cimodel/lib/miniyaml.py
									
												View File
											
				@ -1,51 +0,0 @@

				from collections import OrderedDict

				import cimodel.lib.miniutils as miniutils

				LIST_MARKER = "- "

				INDENTATION_WIDTH = 2

				def is_dict(data):

				    return type(data) in [dict, OrderedDict]

				def is_collection(data):

				    return is_dict(data) or type(data) is list

				def render(fh, data, depth, is_list_member=False):

				    """

				    PyYaml does not allow precise control over the quoting

				    behavior, especially for merge references.

				    Therefore, we use this custom YAML renderer.

				    """

				    indentation = " " * INDENTATION_WIDTH * depth

				    if is_dict(data):

				        tuples = list(data.items())

				        if type(data) is not OrderedDict:

				            tuples.sort()

				        for i, (k, v) in enumerate(tuples):

				            if not v:

				                continue

				            # If this dict is itself a list member, the first key gets prefixed with a list marker

				            list_marker_prefix = LIST_MARKER if is_list_member and not i else ""

				            trailing_whitespace = "\n" if is_collection(v) else " "

				            fh.write(indentation + list_marker_prefix + k + ":" + trailing_whitespace)

				            render(fh, v, depth + 1 + int(is_list_member))

				    elif type(data) is list:

				        for v in data:

				            render(fh, v, depth, True)

				    else:

				        # use empty quotes to denote an empty string value instead of blank space

				        modified_data = miniutils.quote(data) if data == "" else data

				        list_member_prefix = indentation + LIST_MARKER if is_list_member else ""

				        fh.write(list_member_prefix + str(modified_data) + "\n")

1386

.circleci/config.yml generated

View File

File diff suppressed because it is too large Load Diff

									
										41

.circleci/ensure-consistency.py
									
												View File
											
				@ -1,41 +0,0 @@

				#!/usr/bin/env python3

				import os

				import subprocess

				import sys

				import tempfile

				import generate_config_yml

				CHECKED_IN_FILE = "config.yml"

				REGENERATION_SCRIPT = "regenerate.sh"

				PARENT_DIR = os.path.basename(os.path.dirname(os.path.abspath(__file__)))

				README_PATH = os.path.join(PARENT_DIR, "README.md")

				ERROR_MESSAGE_TEMPLATE = """

				The checked-in CircleCI "%s" file does not match what was generated by the scripts.

				Please re-run the "%s" script in the "%s" directory and commit the result. See "%s" for more information.

				"""

				def check_consistency():

				    _, temp_filename = tempfile.mkstemp("-generated-config.yml")

				    with open(temp_filename, "w") as fh:

				        generate_config_yml.stitch_sources(fh)

				    try:

				        subprocess.check_call(["cmp", temp_filename, CHECKED_IN_FILE])

				    except subprocess.CalledProcessError:

				        sys.exit(

				            ERROR_MESSAGE_TEMPLATE

				            % (CHECKED_IN_FILE, REGENERATION_SCRIPT, PARENT_DIR, README_PATH)

				        )

				    finally:

				        os.remove(temp_filename)

				if __name__ == "__main__":

				    check_consistency()

									
										196

.circleci/generate_config_yml.py
									
												View File
											
				@ -1,196 +0,0 @@

				#!/usr/bin/env python3

				"""

				This script is the source of truth for config.yml.

				Please see README.md in this directory for details.

				"""

				import os

				import shutil

				import sys

				from collections import namedtuple

				import cimodel.data.simple.docker_definitions

				import cimodel.data.simple.mobile_definitions

				import cimodel.data.simple.nightly_ios

				import cimodel.lib.miniutils as miniutils

				import cimodel.lib.miniyaml as miniyaml

				class File:

				    """

				    Verbatim copy the contents of a file into config.yml

				    """

				    def __init__(self, filename):

				        self.filename = filename

				    def write(self, output_filehandle):

				        with open(os.path.join("verbatim-sources", self.filename)) as fh:

				            shutil.copyfileobj(fh, output_filehandle)

				class FunctionGen(namedtuple("FunctionGen", "function depth")):

				    __slots__ = ()

				class Treegen(FunctionGen):

				    """

				    Insert the content of a YAML tree into config.yml

				    """

				    def write(self, output_filehandle):

				        miniyaml.render(output_filehandle, self.function(), self.depth)

				class Listgen(FunctionGen):

				    """

				    Insert the content of a YAML list into config.yml

				    """

				    def write(self, output_filehandle):

				        miniyaml.render(output_filehandle, self.function(), self.depth)

				def horizontal_rule():

				    return "".join("#" * 78)

				class Header:

				    def __init__(self, title, summary=None):

				        self.title = title

				        self.summary_lines = summary or []

				    def write(self, output_filehandle):

				        text_lines = [self.title] + self.summary_lines

				        comment_lines = ["# " + x for x in text_lines]

				        lines = miniutils.sandwich([horizontal_rule()], comment_lines)

				        for line in filter(None, lines):

				            output_filehandle.write(line + "\n")

				def _for_all_items(items, functor) -> None:

				    if isinstance(items, list):

				        for item in items:

				            _for_all_items(item, functor)

				    if isinstance(items, dict) and len(items) == 1:

				        item_type, item = next(iter(items.items()))

				        functor(item_type, item)

				def filter_master_only_jobs(items):

				    def _is_main_or_master_item(item):

				        filters = item.get("filters", None)

				        branches = filters.get("branches", None) if filters is not None else None

				        branches_only = branches.get("only", None) if branches is not None else None

				        return (

				            ("main" in branches_only or "master" in branches_only)

				            if branches_only is not None

				            else False

				        )

				    master_deps = set()

				    def _save_requires_if_master(item_type, item):

				        requires = item.get("requires", None)

				        item_name = item.get("name", None)

				        if not isinstance(requires, list):

				            return

				        if _is_main_or_master_item(item) or item_name in master_deps:

				            master_deps.update([n.strip('"') for n in requires])

				    def _do_filtering(items):

				        if isinstance(items, list):

				            rc = [_do_filtering(item) for item in items]

				            return [item for item in rc if len(item if item is not None else []) > 0]

				        assert isinstance(items, dict) and len(items) == 1

				        item_type, item = next(iter(items.items()))

				        item_name = item.get("name", None)

				        item_name = item_name.strip('"') if item_name is not None else None

				        if not _is_main_or_master_item(item) and item_name not in master_deps:

				            return None

				        if "filters" in item:

				            item = item.copy()

				            item.pop("filters")

				        return {item_type: item}

				    # Scan of dependencies twice to pick up nested required jobs

				    # I.e. jobs depending on jobs that main-only job depend on

				    _for_all_items(items, _save_requires_if_master)

				    _for_all_items(items, _save_requires_if_master)

				    return _do_filtering(items)

				def generate_required_docker_images(items):

				    required_docker_images = set()

				    def _requires_docker_image(item_type, item):

				        requires = item.get("requires", None)

				        if not isinstance(requires, list):

				            return

				        for requirement in requires:

				            requirement = requirement.replace('"', "")

				            if requirement.startswith("docker-"):

				                required_docker_images.add(requirement)

				    _for_all_items(items, _requires_docker_image)

				    return required_docker_images

				def gen_build_workflows_tree():

				    build_workflows_functions = [

				        cimodel.data.simple.mobile_definitions.get_workflow_jobs,

				        cimodel.data.simple.nightly_ios.get_workflow_jobs,

				    ]

				    build_jobs = [f() for f in build_workflows_functions]

				    build_jobs.extend(

				        cimodel.data.simple.docker_definitions.get_workflow_jobs(

				            # sort for consistency

				            sorted(generate_required_docker_images(build_jobs))

				        )

				    )

				    master_build_jobs = filter_master_only_jobs(build_jobs)

				    rc = {

				        "workflows": {

				            "build": {

				                "when": r"<< pipeline.parameters.run_build >>",

				                "jobs": build_jobs,

				            },

				        }

				    }

				    if len(master_build_jobs) > 0:

				        rc["workflows"]["master_build"] = {

				            "when": r"<< pipeline.parameters.run_master_build >>",

				            "jobs": master_build_jobs,

				        }

				    return rc

				# Order of this list matters to the generated config.yml.

				YAML_SOURCES = [

				    File("header-section.yml"),

				    File("commands.yml"),

				    File("nightly-binary-build-defaults.yml"),

				    Header("Build parameters"),

				    File("build-parameters/pytorch-build-params.yml"),

				    File("build-parameters/binary-build-params.yml"),

				    Header("Job specs"),

				    File("job-specs/binary-job-specs.yml"),

				    File("job-specs/job-specs-custom.yml"),

				    File("job-specs/binary_update_htmls.yml"),

				    File("job-specs/binary-build-tests.yml"),

				    File("job-specs/docker_jobs.yml"),

				    Header("Workflows"),

				    Treegen(gen_build_workflows_tree, 0),

				]

				def stitch_sources(output_filehandle):

				    for f in YAML_SOURCES:

				        f.write(output_filehandle)

				if __name__ == "__main__":

				    stitch_sources(sys.stdout)

									
										5

.circleci/regenerate.ps1
									
												View File
											
				@ -1,5 +0,0 @@

				cd $PSScriptRoot;

				$NewFile = New-TemporaryFile;

				python generate_config_yml.py > $NewFile.name

				(Get-Content $NewFile.name -Raw).TrimEnd().Replace("`r`n","`n") | Set-Content config.yml -Force

				Remove-Item $NewFile.name

									
										17

.circleci/regenerate.sh
									
												View File
											
				@ -1,17 +0,0 @@

				#!/bin/bash -e

				# Allows this script to be invoked from any directory:

				cd "$(dirname "$0")"

				UNCOMMIT_CHANGE=$(git status -s | grep " config.yml" | wc -l | xargs)

				if [[ $UNCOMMIT_CHANGE != 0 ]]; then

				    OLD_FILE=$(mktemp)

				    cp config.yml "$OLD_FILE"

				    echo "Uncommitted change detected in .circleci/config.yml"

				    echo "It has been backed up to $OLD_FILE"

				fi

				NEW_FILE=$(mktemp)

				./generate_config_yml.py > "$NEW_FILE"

				cp "$NEW_FILE" config.yml

				echo "New config generated in .circleci/config.yml"

									
										69

.circleci/scripts/binary_checkout.sh
									
												View File
											
				@ -1,69 +0,0 @@

				#!/bin/bash

				set -eux -o pipefail

				retry () {

				    $*  || (sleep 1 && $*) || (sleep 2 && $*) || (sleep 4 && $*) || (sleep 8 && $*)

				}

				# This step runs on multiple executors with different envfile locations

				if [[ "$(uname)" == Darwin ]]; then

				  # macos executor (builds and tests)

				  workdir="/Users/distiller/project"

				elif [[ "$OSTYPE" == "msys" ]]; then

				  # windows executor (builds and tests)

				  rm -rf /c/w

				  ln -s "/c/Users/circleci/project" /c/w

				  workdir="/c/w"

				elif [[ -d "/home/circleci/project" ]]; then

				  # machine executor (binary tests)

				  workdir="/home/circleci/project"

				else

				  # docker executor (binary builds)

				  workdir="/"

				fi

				# It is very important that this stays in sync with binary_populate_env.sh

				if [[ "$OSTYPE" == "msys" ]]; then

				  # We need to make the paths as short as possible on Windows

				  export PYTORCH_ROOT="$workdir/p"

				  export BUILDER_ROOT="$workdir/b"

				else

				  export PYTORCH_ROOT="$workdir/pytorch"

				  export BUILDER_ROOT="$workdir/builder"

				fi

				# Try to extract PR number from branch if not already set

				if [[ -z "${CIRCLE_PR_NUMBER:-}" ]]; then

				  CIRCLE_PR_NUMBER="$(echo ${CIRCLE_BRANCH} | sed -E -n 's/pull\/([0-9]*).*/\1/p')"

				fi

				# Clone the Pytorch branch

				retry git clone https://github.com/pytorch/pytorch.git "$PYTORCH_ROOT"

				pushd "$PYTORCH_ROOT"

				if [[ -n "${CIRCLE_PR_NUMBER:-}" ]]; then

				  # "smoke" binary build on PRs

				  git fetch --force origin "pull/${CIRCLE_PR_NUMBER}/head:remotes/origin/pull/${CIRCLE_PR_NUMBER}"

				  git reset --hard "$CIRCLE_SHA1"

				  git checkout -q -B "$CIRCLE_BRANCH"

				  git reset --hard "$CIRCLE_SHA1"

				elif [[ -n "${CIRCLE_SHA1:-}" ]]; then

				  # Scheduled workflows & "smoke" binary build on trunk on PR merges

				  DEFAULT_BRANCH="$(git remote show $CIRCLE_REPOSITORY_URL | awk '/HEAD branch/ {print $NF}')"

				  git reset --hard "$CIRCLE_SHA1"

				  git checkout -q -B $DEFAULT_BRANCH

				else

				  echo "Can't tell what to checkout"

				  exit 1

				fi

				retry git submodule update --init --recursive

				echo "Using Pytorch from "

				git --no-pager log --max-count 1

				popd

				# Clone the Builder main repo

				retry git clone -q https://github.com/pytorch/builder.git "$BUILDER_ROOT"

				pushd "$BUILDER_ROOT"

				echo "Using builder from "

				git --no-pager log --max-count 1

				popd

									
										44

.circleci/scripts/binary_install_miniconda.sh
									
												View File
											
				@ -1,44 +0,0 @@

				#!/bin/bash

				set -eux -o pipefail

				# This step runs on multiple executors with different envfile locations

				if [[ "$(uname)" == Darwin ]]; then

				  envfile="/Users/distiller/project/env"

				elif [[ -d "/home/circleci/project" ]]; then

				  # machine executor (binary tests)

				  envfile="/home/circleci/project/env"

				else

				  # docker executor (binary builds)

				  envfile="/env"

				fi

				# TODO this is super hacky and ugly. Basically, the binary_update_html job does

				# not have an env file, since it does not call binary_populate_env.sh, since it

				# does not have a BUILD_ENVIRONMENT. So for this one case, which we detect by a

				# lack of an env file, we manually export the environment variables that we

				# need to install miniconda

				if [[ ! -f "$envfile" ]]; then

				  MINICONDA_ROOT="/home/circleci/project/miniconda"

				  workdir="/home/circleci/project"

				  retry () {

				      $*  || (sleep 1 && $*) || (sleep 2 && $*) || (sleep 4 && $*) || (sleep 8 && $*)

				  }

				  export -f retry

				else

				  source "$envfile"

				fi

				conda_sh="$workdir/install_miniconda.sh"

				if [[ "$(uname)" == Darwin ]]; then

				  curl --retry 3 --retry-all-errors -o "$conda_sh" https://repo.anaconda.com/miniconda/Miniconda3-py39_4.12.0-MacOSX-x86_64.sh

				else

				  curl --retry 3 --retry-all-errors -o "$conda_sh" https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh

				fi

				chmod +x "$conda_sh"

				"$conda_sh" -b -p "$MINICONDA_ROOT"

				rm -f "$conda_sh"

				# We can't actually add miniconda to the PATH in the envfile, because that

				# breaks 'unbuffer' in Mac jobs. This is probably because conda comes with

				# a tclsh, which then gets inserted before the tclsh needed in /usr/bin

									
										4

.circleci/scripts/binary_macos_build.sh
									
												View File
												
				@ -4,10 +4,6 @@ set -eux -o pipefail

				source "${BINARY_ENV_FILE:-/Users/distiller/project/env}"

				mkdir -p "$PYTORCH_FINAL_PACKAGE_DIR"

				if [[ -z "${GITHUB_ACTIONS:-}" ]]; then

				  export PATH="${workdir:-${HOME}}/miniconda/bin:${PATH}"

				fi

				# Build

				export USE_PYTORCH_METAL_EXPORT=1

				export USE_COREML_DELEGATE=1

									
										70

.circleci/scripts/binary_populate_env.sh
									
												View File
												
				@ -3,17 +3,9 @@ set -eux -o pipefail

				export TZ=UTC

				tagged_version() {

				  # Grabs version from either the env variable CIRCLE_TAG

				  # or the pytorch git described version

				  if [[ "$OSTYPE" == "msys" &&  -z "${GITHUB_ACTIONS:-}" ]]; then

				    GIT_DIR="${workdir}/p/.git"

				  else

				    GIT_DIR="${workdir}/pytorch/.git"

				  fi

				  GIT_DIR="${workdir}/pytorch/.git"

				  GIT_DESCRIBE="git --git-dir ${GIT_DIR} describe --tags --match v[0-9]*.[0-9]*.[0-9]*"

				  if [[ -n "${CIRCLE_TAG:-}" ]]; then

				    echo "${CIRCLE_TAG}"

				  elif [[ ! -d "${GIT_DIR}" ]]; then

				  if [[ ! -d "${GIT_DIR}" ]]; then

				    echo "Abort, abort! Git dir ${GIT_DIR} does not exists!"

				    kill $$

				  elif ${GIT_DESCRIBE} --exact >/dev/null; then

				@ -58,8 +50,8 @@ fi

				PIP_UPLOAD_FOLDER='nightly/'

				# We put this here so that OVERRIDE_PACKAGE_VERSION below can read from it

				export DATE="$(date -u +%Y%m%d)"

				#TODO: We should be pulling semver version from the base version.txt

				BASE_BUILD_VERSION="2.2.0.dev$DATE"

				BASE_BUILD_VERSION="$(cat ${PYTORCH_ROOT}/version.txt|cut -da -f1).dev${DATE}"

				# Change BASE_BUILD_VERSION to git tag when on a git tag

				# Use 'git -C' to make doubly sure we're in the correct directory for checking

				# the git tag

				@ -79,6 +71,35 @@ fi

				export PYTORCH_BUILD_NUMBER=1

				# Set triton version as part of PYTORCH_EXTRA_INSTALL_REQUIREMENTS

				TRITON_VERSION=$(cat $PYTORCH_ROOT/.ci/docker/triton_version.txt)

				# Here PYTORCH_EXTRA_INSTALL_REQUIREMENTS is already set for the all the wheel builds hence append TRITON_CONSTRAINT

				if [[ "$PACKAGE_TYPE" =~ .*wheel.* &&  -n "${PYTORCH_EXTRA_INSTALL_REQUIREMENTS:-}" ]]; then

				  # Only linux Python < 3.12 are supported wheels for triton

				  TRITON_CONSTRAINT="platform_system == 'Linux' and platform_machine == 'x86_64' and python_version < '3.12'"

				  TRITON_REQUIREMENT="triton==${TRITON_VERSION}; ${TRITON_CONSTRAINT}"

				  if [[ -n "$PYTORCH_BUILD_VERSION" && "$PYTORCH_BUILD_VERSION" =~ .*dev.* ]]; then

				      TRITON_SHORTHASH=$(cut -c1-10 $PYTORCH_ROOT/.ci/docker/ci_commit_pins/triton.txt)

				      TRITON_REQUIREMENT="pytorch-triton==${TRITON_VERSION}+${TRITON_SHORTHASH}; ${TRITON_CONSTRAINT}"

				  fi

				  export PYTORCH_EXTRA_INSTALL_REQUIREMENTS="${PYTORCH_EXTRA_INSTALL_REQUIREMENTS} | ${TRITON_REQUIREMENT}"

				fi

				# Set triton via PYTORCH_EXTRA_INSTALL_REQUIREMENTS for triton rocm package

				if [[ "$PACKAGE_TYPE" =~ .*wheel.* && -n "$PYTORCH_BUILD_VERSION" && "$PYTORCH_BUILD_VERSION" =~ .*rocm.* && $(uname) == "Linux" && "$DESIRED_PYTHON" != "3.12" ]]; then

				    TRITON_REQUIREMENT="pytorch-triton-rocm==${TRITON_VERSION}"

				    if [[ -n "$PYTORCH_BUILD_VERSION" && "$PYTORCH_BUILD_VERSION" =~ .*dev.* ]]; then

				        TRITON_SHORTHASH=$(cut -c1-10 $PYTORCH_ROOT/.ci/docker/ci_commit_pins/triton-rocm.txt)

				        TRITON_REQUIREMENT="pytorch-triton-rocm==${TRITON_VERSION}+${TRITON_SHORTHASH}"

				    fi

				    if [[ -z "${PYTORCH_EXTRA_INSTALL_REQUIREMENTS:-}" ]]; then

				        export PYTORCH_EXTRA_INSTALL_REQUIREMENTS="${TRITON_REQUIREMENT}"

				    else

				        export PYTORCH_EXTRA_INSTALL_REQUIREMENTS="${PYTORCH_EXTRA_INSTALL_REQUIREMENTS} | ${TRITON_REQUIREMENT}"

				    fi

				fi

				JAVA_HOME=

				BUILD_JNI=OFF

				if [[ "$PACKAGE_TYPE" == libtorch ]]; then

				@ -124,12 +145,13 @@ if [[ "${OSTYPE}" == "msys" ]]; then

				else

				  export DESIRED_DEVTOOLSET="${DESIRED_DEVTOOLSET:-}"

				fi

				export PYTORCH_EXTRA_INSTALL_REQUIREMENTS="${PYTORCH_EXTRA_INSTALL_REQUIREMENTS:-}"

				export DATE="$DATE"

				export NIGHTLIES_DATE_PREAMBLE=1.14.0.dev

				export PYTORCH_BUILD_VERSION="$PYTORCH_BUILD_VERSION"

				export PYTORCH_BUILD_NUMBER="$PYTORCH_BUILD_NUMBER"

				export OVERRIDE_PACKAGE_VERSION="$PYTORCH_BUILD_VERSION"

				export PYTORCH_EXTRA_INSTALL_REQUIREMENTS="${PYTORCH_EXTRA_INSTALL_REQUIREMENTS:-}"

				# TODO: We don't need this anymore IIUC

				export TORCH_PACKAGE_NAME='torch'

				@ -162,28 +184,6 @@ if [[ "$(uname)" != Darwin ]]; then

				EOL

				fi

				if [[ -z "${GITHUB_ACTIONS:-}" ]]; then

				  cat >>"$envfile" <<EOL

				  export workdir="$workdir"

				  export MAC_PACKAGE_WORK_DIR="$workdir"

				  if [[ "$OSTYPE" == "msys" ]]; then

				    export PYTORCH_ROOT="$workdir/p"

				    export BUILDER_ROOT="$workdir/b"

				  else

				    export PYTORCH_ROOT="$workdir/pytorch"

				    export BUILDER_ROOT="$workdir/builder"

				  fi

				  export MINICONDA_ROOT="$workdir/miniconda"

				  export PYTORCH_FINAL_PACKAGE_DIR="$workdir/final_pkgs"

				  export CIRCLE_TAG="${CIRCLE_TAG:-}"

				  export CIRCLE_SHA1="$CIRCLE_SHA1"

				  export CIRCLE_PR_NUMBER="${CIRCLE_PR_NUMBER:-}"

				  export CIRCLE_BRANCH="$CIRCLE_BRANCH"

				  export CIRCLE_WORKFLOW_ID="$CIRCLE_WORKFLOW_ID"

				EOL

				fi

				echo 'retry () {' >> "$envfile"

				echo '    $*  || (sleep 1 && $*) || (sleep 2 && $*) || (sleep 4 && $*) || (sleep 8 && $*)' >> "$envfile"

				echo '}' >> "$envfile"

									
										29

.circleci/scripts/binary_run_in_docker.sh
									
												View File
											
				@ -1,29 +0,0 @@

				#!/bin/bash

				# This section is used in the binary_test and smoke_test jobs. It expects

				# 'binary_populate_env' to have populated /home/circleci/project/env and it

				# expects another section to populate /home/circleci/project/ci_test_script.sh

				# with the code to run in the docker

				# Expect all needed environment variables to be written to this file

				source /home/circleci/project/env

				echo "Running the following code in Docker"

				cat /home/circleci/project/ci_test_script.sh

				echo

				echo

				set -eux -o pipefail

				# Expect actual code to be written to this file

				chmod +x /home/circleci/project/ci_test_script.sh

				VOLUME_MOUNTS="-v /home/circleci/project/:/circleci_stuff -v /home/circleci/project/final_pkgs:/final_pkgs -v ${PYTORCH_ROOT}:/pytorch -v ${BUILDER_ROOT}:/builder"

				# Run the docker

				if [ -n "${USE_CUDA_DOCKER_RUNTIME:-}" ]; then

				  export id=$(docker run --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --gpus all ${VOLUME_MOUNTS} -t -d "${DOCKER_IMAGE}")

				else

				  export id=$(docker run --cap-add=SYS_PTRACE --security-opt seccomp=unconfined ${VOLUME_MOUNTS} -t -d "${DOCKER_IMAGE}")

				fi

				# Execute the test script that was populated by an earlier section

				export COMMAND='((echo "source /circleci_stuff/env && /circleci_stuff/ci_test_script.sh") | docker exec -i "$id" bash) 2>&1'

				echo ${COMMAND} > ./command.sh && unbuffer bash ./command.sh | ts

									
										111

.circleci/scripts/setup_ci_environment.sh
									
												View File
											
				@ -1,111 +0,0 @@

				#!/usr/bin/env bash

				set -ex -o pipefail

				# Remove unnecessary sources

				sudo rm -f /etc/apt/sources.list.d/google-chrome.list

				sudo rm -f /etc/apt/heroku.list

				sudo rm -f /etc/apt/openjdk-r-ubuntu-ppa-xenial.list

				sudo rm -f /etc/apt/partner.list

				# To increase the network reliability, let apt decide which mirror is best to use

				sudo sed -i -e 's/http:\/\/.*archive/mirror:\/\/mirrors/' -e 's/\/ubuntu\//\/mirrors.txt/' /etc/apt/sources.list

				retry () {

				  $*  || $* || $* || $* || $*

				}

				# Method adapted from here: https://askubuntu.com/questions/875213/apt-get-to-retry-downloading

				# (with use of tee to avoid permissions problems)

				# This is better than retrying the whole apt-get command

				echo "APT::Acquire::Retries \"3\";" | sudo tee /etc/apt/apt.conf.d/80-retries

				retry sudo apt-get update -qq

				retry sudo apt-get -y install \

				  moreutils \

				  expect-dev

				echo "== DOCKER VERSION =="

				docker version

				if ! command -v aws >/dev/null; then

				  retry sudo pip3 -q install awscli==1.19.64

				fi

				if [ -n "${USE_CUDA_DOCKER_RUNTIME:-}" ]; then

				  DRIVER_FN="NVIDIA-Linux-x86_64-515.76.run"

				  wget "https://s3.amazonaws.com/ossci-linux/nvidia_driver/$DRIVER_FN"

				  sudo /bin/bash "$DRIVER_FN" -s --no-drm || (sudo cat /var/log/nvidia-installer.log && false)

				  nvidia-smi

				  # Taken directly from https://github.com/NVIDIA/nvidia-docker

				  # Add the package repositories

				  distribution=$(. /etc/os-release;echo "$ID$VERSION_ID")

				  curl -s -L --retry 3 --retry-all-errors https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -

				  curl -s -L --retry 3 --retry-all-errors "https://nvidia.github.io/nvidia-docker/${distribution}/nvidia-docker.list" | sudo tee /etc/apt/sources.list.d/nvidia-docker.list

				  retry sudo apt-get update -qq

				  # Necessary to get the `--gpus` flag to function within docker

				  retry sudo apt-get install -y nvidia-container-toolkit

				  sudo systemctl restart docker

				else

				  # Explicitly remove nvidia docker apt repositories if not building for cuda

				  sudo rm -rf /etc/apt/sources.list.d/nvidia-docker.list

				fi

				add_to_env_file() {

				  local name=$1

				  local value=$2

				  case "$value" in

				    *\ *)

				      # BASH_ENV should be set by CircleCI

				      echo "${name}='${value}'" >> "${BASH_ENV:-/tmp/env}"

				      ;;

				    *)

				      echo "${name}=${value}" >> "${BASH_ENV:-/tmp/env}"

				      ;;

				  esac

				}

				add_to_env_file CI_MASTER "${CI_MASTER:-}"

				add_to_env_file COMMIT_SOURCE "${CIRCLE_BRANCH:-}"

				add_to_env_file BUILD_ENVIRONMENT "${BUILD_ENVIRONMENT}"

				add_to_env_file CIRCLE_PULL_REQUEST "${CIRCLE_PULL_REQUEST}"

				if [[ "${BUILD_ENVIRONMENT}" == *-build ]]; then

				  add_to_env_file SCCACHE_BUCKET ossci-compiler-cache-circleci-v2

				  SCCACHE_MAX_JOBS=$(( $(nproc) - 1 ))

				  MEMORY_LIMIT_MAX_JOBS=8  # the "large" resource class on CircleCI has 32 CPU cores, if we use all of them we'll OOM

				  MAX_JOBS=$(( ${SCCACHE_MAX_JOBS} > ${MEMORY_LIMIT_MAX_JOBS} ? ${MEMORY_LIMIT_MAX_JOBS} : ${SCCACHE_MAX_JOBS} ))

				  add_to_env_file MAX_JOBS "${MAX_JOBS}"

				  if [ -n "${USE_CUDA_DOCKER_RUNTIME:-}" ]; then

				    add_to_env_file TORCH_CUDA_ARCH_LIST 5.2

				  fi

				  if [[ "${BUILD_ENVIRONMENT}" == *xla* ]]; then

				    # This IAM user allows write access to S3 bucket for sccache & bazels3cache

				    set +x

				    add_to_env_file XLA_CLANG_CACHE_S3_BUCKET_NAME "${XLA_CLANG_CACHE_S3_BUCKET_NAME:-}"

				    add_to_env_file AWS_ACCESS_KEY_ID "${CIRCLECI_AWS_ACCESS_KEY_FOR_SCCACHE_AND_XLA_BAZEL_S3_BUCKET_V2:-}"

				    add_to_env_file AWS_SECRET_ACCESS_KEY "${CIRCLECI_AWS_SECRET_KEY_FOR_SCCACHE_AND_XLA_BAZEL_S3_BUCKET_V2:-}"

				    set -x

				  else

				    # This IAM user allows write access to S3 bucket for sccache

				    set +x

				    add_to_env_file XLA_CLANG_CACHE_S3_BUCKET_NAME "${XLA_CLANG_CACHE_S3_BUCKET_NAME:-}"

				    add_to_env_file AWS_ACCESS_KEY_ID "${CIRCLECI_AWS_ACCESS_KEY_FOR_SCCACHE_S3_BUCKET_V4:-}"

				    add_to_env_file AWS_SECRET_ACCESS_KEY "${CIRCLECI_AWS_SECRET_KEY_FOR_SCCACHE_S3_BUCKET_V4:-}"

				    set -x

				  fi

				fi

				# This IAM user only allows read-write access to ECR

				set +x

				export AWS_ACCESS_KEY_ID=${CIRCLECI_AWS_ACCESS_KEY_FOR_ECR_READ_WRITE_V4:-}

				export AWS_SECRET_ACCESS_KEY=${CIRCLECI_AWS_SECRET_KEY_FOR_ECR_READ_WRITE_V4:-}

				export AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")

				export AWS_REGION=us-east-1

				aws ecr get-login-password --region $AWS_REGION|docker login --username AWS --password-stdin $AWS_ACCOUNT_ID.dkr.ecr.$AWS_REGION.amazonaws.com

				set -x

									
										50

.circleci/scripts/setup_linux_system_environment.sh
									
												View File
											
				@ -1,50 +0,0 @@

				#!/usr/bin/env bash

				set -eux -o pipefail

				# Set up CircleCI GPG keys for apt, if needed

				curl --retry 3 --retry-all-errors -s -L https://packagecloud.io/circleci/trusty/gpgkey | sudo apt-key add -

				# Stop background apt updates.  Hypothetically, the kill should not

				# be necessary, because stop is supposed to send a kill signal to

				# the process, but we've added it for good luck.  Also

				# hypothetically, it's supposed to be unnecessary to wait for

				# the process to block.  We also have that line for good luck.

				# If you like, try deleting them and seeing if it works.

				sudo systemctl stop apt-daily.service || true

				sudo systemctl kill --kill-who=all apt-daily.service || true

				sudo systemctl stop unattended-upgrades.service || true

				sudo systemctl kill --kill-who=all unattended-upgrades.service || true

				# wait until `apt-get update` has been killed

				while systemctl is-active --quiet apt-daily.service

				do

				    sleep 1;

				done

				while systemctl is-active --quiet unattended-upgrades.service

				do

				    sleep 1;

				done

				# See if we actually were successful

				systemctl list-units --all | cat

				# For good luck, try even harder to kill apt-get

				sudo pkill apt-get || true

				# For even better luck, purge unattended-upgrades

				sudo apt-get purge -y unattended-upgrades || true

				cat /etc/apt/sources.list

				# For the bestest luck, kill again now

				sudo pkill apt || true

				sudo pkill dpkg || true

				# Try to detect if apt/dpkg is stuck

				if ps auxfww | grep '[a]pt'; then

				  echo "WARNING: There are leftover apt processes; subsequent apt update will likely fail"

				fi

				if ps auxfww | grep '[d]pkg'; then

				  echo "WARNING: There are leftover dpkg processes; subsequent apt update will likely fail"

				fi

									
										65

.circleci/verbatim-sources/build-parameters/binary-build-params.yml
									
												View File
											
				@ -1,65 +0,0 @@

				binary_linux_build_params: &binary_linux_build_params

				  parameters:

				    build_environment:

				      type: string

				      default: ""

				    docker_image:

				      type: string

				      default: ""

				    libtorch_variant:

				      type: string

				      default: ""

				    resource_class:

				      type: string

				      default: "2xlarge+"

				  environment:

				    BUILD_ENVIRONMENT: << parameters.build_environment >>

				    LIBTORCH_VARIANT: << parameters.libtorch_variant >>

				    ANACONDA_USER: pytorch

				  resource_class: << parameters.resource_class >>

				  docker:

				    - image: << parameters.docker_image >>

				binary_linux_test_upload_params: &binary_linux_test_upload_params

				  parameters:

				    build_environment:

				      type: string

				      default: ""

				    docker_image:

				      type: string

				      default: ""

				    libtorch_variant:

				      type: string

				      default: ""

				    resource_class:

				      type: string

				      default: "medium"

				    use_cuda_docker_runtime:

				      type: string

				      default: ""

				  environment:

				    BUILD_ENVIRONMENT: << parameters.build_environment >>

				    DOCKER_IMAGE: << parameters.docker_image >>

				    USE_CUDA_DOCKER_RUNTIME: << parameters.use_cuda_docker_runtime >>

				    LIBTORCH_VARIANT: << parameters.libtorch_variant >>

				  resource_class: << parameters.resource_class >>

				binary_mac_params: &binary_mac_params

				  parameters:

				    build_environment:

				      type: string

				      default: ""

				  environment:

				    BUILD_ENVIRONMENT: << parameters.build_environment >>

				binary_windows_params: &binary_windows_params

				  parameters:

				    build_environment:

				      type: string

				      default: ""

				    executor:

				      type: string

				      default: "windows-xlarge-cpu-with-nvidia-cuda"

				  environment:

				    BUILD_ENVIRONMENT: << parameters.build_environment >>

				    JOB_EXECUTOR: <<parameters.executor>>

									
										105

.circleci/verbatim-sources/build-parameters/pytorch-build-params.yml
									
												View File
											
				@ -1,105 +0,0 @@

				pytorch_params: &pytorch_params

				  parameters:

				    build_environment:

				      type: string

				      default: ""

				    docker_image:

				      type: string

				      default: ""

				    resource_class:

				      type: string

				      default: "large"

				    use_cuda_docker_runtime:

				      type: string

				      default: ""

				    build_only:

				      type: string

				      default: ""

				    ci_master:

				      type: string

				      default: ""

				  environment:

				    BUILD_ENVIRONMENT: << parameters.build_environment >>

				    DOCKER_IMAGE: << parameters.docker_image >>

				    USE_CUDA_DOCKER_RUNTIME: << parameters.use_cuda_docker_runtime >>

				    BUILD_ONLY: << parameters.build_only >>

				    CI_MASTER: << pipeline.parameters.run_master_build >>

				  resource_class: << parameters.resource_class >>

				pytorch_ios_params: &pytorch_ios_params

				  parameters:

				    build_environment:

				      type: string

				      default: ""

				    ios_arch:

				      type: string

				      default: ""

				    ios_platform:

				      type: string

				      default: ""

				    op_list:

				      type: string

				      default: ""

				    use_metal:

				      type: string

				      default: "0"

				    lite_interpreter:

				      type: string

				      default: "1"

				    use_coreml:

				      type: string

				      default: "0"

				  environment:

				    BUILD_ENVIRONMENT: << parameters.build_environment >>

				    IOS_ARCH: << parameters.ios_arch >>

				    IOS_PLATFORM: << parameters.ios_platform >>

				    SELECTED_OP_LIST: << parameters.op_list >>

				    USE_PYTORCH_METAL: << parameters.use_metal >>

				    BUILD_LITE_INTERPRETER: << parameters.lite_interpreter >>

				    USE_COREML_DELEGATE: << parameters.use_coreml >>

				pytorch_windows_params: &pytorch_windows_params

				  parameters:

				    executor:

				      type: string

				      default: "windows-xlarge-cpu-with-nvidia-cuda"

				    build_environment:

				      type: string

				      default: ""

				    test_name:

				      type: string

				      default: ""

				    cuda_version:

				      type: string

				      default: "10.1"

				    python_version:

				      type: string

				      default: "3.8"

				    vs_version:

				      type: string

				      default: "16.8.6"

				    vc_version:

				      type: string

				      default: "14.16"

				    vc_year:

				      type: string

				      default: "2019"

				    vc_product:

				      type: string

				      default: "BuildTools"

				    use_cuda:

				      type: string

				      default: ""

				  environment:

				    BUILD_ENVIRONMENT: <<parameters.build_environment>>

				    SCCACHE_BUCKET: "ossci-compiler-cache"

				    CUDA_VERSION: <<parameters.cuda_version>>

				    PYTHON_VERSION: <<parameters.python_version>>

				    VS_VERSION: <<parameters.vs_version>>

				    VC_VERSION: <<parameters.vc_version>>

				    VC_YEAR: <<parameters.vc_year>>

				    VC_PRODUCT: <<parameters.vc_product>>

				    USE_CUDA: <<parameters.use_cuda>>

				    TORCH_CUDA_ARCH_LIST: "5.2 7.5"

				    JOB_BASE_NAME: <<parameters.test_name>>

				    JOB_EXECUTOR: <<parameters.executor>>

									
										134

.circleci/verbatim-sources/commands.yml
									
												View File
											
				@ -1,134 +0,0 @@

				commands:

				  calculate_docker_image_tag:

				    description: "Calculates the docker image tag"

				    steps:

				      - run:

				          name: "Calculate docker image hash"

				          command: |

				            DOCKER_TAG=$(git rev-parse HEAD:.ci/docker)

				            echo "DOCKER_TAG=${DOCKER_TAG}" >> "${BASH_ENV}"

				  designate_upload_channel:

				    description: "inserts the correct upload channel into ${BASH_ENV}"

				    steps:

				      - run:

				          name: adding UPLOAD_CHANNEL to BASH_ENV

				          command: |

				            our_upload_channel=nightly

				            # On tags upload to test instead

				            if [[ -n "${CIRCLE_TAG}" ]]; then

				              our_upload_channel=test

				            fi

				            echo "export UPLOAD_CHANNEL=${our_upload_channel}" >> ${BASH_ENV}

				  # This system setup script is meant to run before the CI-related scripts, e.g.,

				  # installing Git client, checking out code, setting up CI env, and

				  # building/testing.

				  setup_linux_system_environment:

				    steps:

				      - run:

				          name: Set Up System Environment

				          no_output_timeout: "1h"

				          command: .circleci/scripts/setup_linux_system_environment.sh

				  setup_ci_environment:

				    steps:

				      - run:

				          name: Set Up CI Environment After attach_workspace

				          no_output_timeout: "1h"

				          command: .circleci/scripts/setup_ci_environment.sh

				  brew_update:

				    description: "Update Homebrew and install base formulae"

				    steps:

				      - run:

				          name: Update Homebrew

				          no_output_timeout: "10m"

				          command: |

				            set -ex

				            # Update repositories manually.

				            # Running `brew update` produces a comparison between the

				            # current checkout and the updated checkout, which takes a

				            # very long time because the existing checkout is 2y old.

				            for path in $(find /usr/local/Homebrew -type d -name .git)

				            do

				            cd $path/..

				            git fetch --depth=1 origin

				            git reset --hard origin/master

				            done

				            export HOMEBREW_NO_AUTO_UPDATE=1

				            # Install expect and moreutils so that we can call `unbuffer` and `ts`.

				            # moreutils installs a `parallel` executable by default, which conflicts

				            # with the executable from the GNU `parallel`, so we must unlink GNU

				            # `parallel` first, and relink it afterwards.

				            brew unlink parallel

				            brew install moreutils

				            brew link parallel --overwrite

				            brew install expect

				  brew_install:

				    description: "Install Homebrew formulae"

				    parameters:

				      formulae:

				        type: string

				        default: ""

				    steps:

				      - run:

				          name: Install << parameters.formulae >>

				          no_output_timeout: "10m"

				          command: |

				            set -ex

				            export HOMEBREW_NO_AUTO_UPDATE=1

				            brew install << parameters.formulae >>

				  run_brew_for_macos_build:

				    steps:

				      - brew_update

				      - brew_install:

				          formulae: libomp

				  run_brew_for_ios_build:

				    steps:

				      - brew_update

				      - brew_install:

				          formulae: libtool

				  optional_merge_target_branch:

				    steps:

				      - run:

				          name: (Optional) Merge target branch

				          no_output_timeout: "10m"

				          command: |

				            if [[ -n "$CIRCLE_PULL_REQUEST" && "$CIRCLE_BRANCH" != "nightly" ]]; then

				              PR_NUM=$(basename $CIRCLE_PULL_REQUEST)

				              CIRCLE_PR_BASE_BRANCH=$(curl -s https://api.github.com/repos/$CIRCLE_PROJECT_USERNAME/$CIRCLE_PROJECT_REPONAME/pulls/$PR_NUM | jq -r '.base.ref')

				              if [[ "${BUILD_ENVIRONMENT}" == *"xla"* || "${BUILD_ENVIRONMENT}" == *"gcc5"* ]] ; then

				                set -x

				                git config --global user.email "circleci.ossci@gmail.com"

				                git config --global user.name "CircleCI"

				                git config remote.origin.url https://github.com/pytorch/pytorch.git

				                git config --add remote.origin.fetch +refs/heads/master:refs/remotes/origin/master

				                git fetch --tags --progress https://github.com/pytorch/pytorch.git +refs/heads/master:refs/remotes/origin/master --depth=100 --quiet

				                # PRs generated from ghstack has format CIRCLE_PR_BASE_BRANCH=gh/xxx/1234/base

				                if [[ "${CIRCLE_PR_BASE_BRANCH}" == "gh/"* ]]; then

				                  CIRCLE_PR_BASE_BRANCH=master

				                fi

				                export GIT_MERGE_TARGET=`git log -n 1 --pretty=format:"%H" origin/$CIRCLE_PR_BASE_BRANCH`

				                echo "GIT_MERGE_TARGET: " ${GIT_MERGE_TARGET}

				                export GIT_COMMIT=${CIRCLE_SHA1}

				                echo "GIT_COMMIT: " ${GIT_COMMIT}

				                git checkout -f ${GIT_COMMIT}

				                git reset --hard ${GIT_COMMIT}

				                git merge --allow-unrelated-histories --no-edit --no-ff ${GIT_MERGE_TARGET}

				                echo "Merged $CIRCLE_PR_BASE_BRANCH branch before building in environment $BUILD_ENVIRONMENT"

				                set +x

				              else

				                echo "No need to merge with $CIRCLE_PR_BASE_BRANCH, skipping..."

				              fi

				            else

				              echo "This is not a pull request, skipping..."

				            fi

									
										41

.circleci/verbatim-sources/header-section.yml
									
												View File
											
				@ -1,41 +0,0 @@

				# WARNING: DO NOT EDIT THIS FILE DIRECTLY!!!

				# See the README.md in this directory.

				# IMPORTANT: To update Docker image version, please follow

				# the instructions at

				# https://github.com/pytorch/pytorch/wiki/Docker-image-build-on-CircleCI

				version: 2.1

				parameters:

				  run_binary_tests:

				    type: boolean

				    default: false

				  run_build:

				    type: boolean

				    default: true

				  run_master_build:

				    type: boolean

				    default: false

				  run_slow_gradcheck_build:

				    type: boolean

				    default: false

				executors:

				  windows-with-nvidia-gpu:

				    machine:

				      resource_class: windows.gpu.nvidia.medium

				      image: windows-server-2019-nvidia:previous

				      shell: bash.exe

				  windows-xlarge-cpu-with-nvidia-cuda:

				    machine:

				      resource_class: windows.xlarge

				      image: windows-server-2019-vs2019:stable

				      shell: bash.exe

				  windows-medium-cpu-with-nvidia-cuda:

				    machine:

				      resource_class: windows.medium

				      image: windows-server-2019-vs2019:stable

				      shell: bash.exe

									
										14

.circleci/verbatim-sources/job-specs/binary-build-tests.yml
									
												View File
											
				@ -1,14 +0,0 @@

				# There is currently no testing for libtorch TODO

				#  binary_linux_libtorch_3.6m_cpu_test:

				#    environment:

				#      BUILD_ENVIRONMENT: "libtorch 3.6m cpu"

				#    resource_class: gpu.nvidia.small

				#    <<: *binary_linux_test

				#

				#  binary_linux_libtorch_3.6m_cu90_test:

				#    environment:

				#      BUILD_ENVIRONMENT: "libtorch 3.6m cu90"

				#    resource_class: gpu.nvidia.small

				#    <<: *binary_linux_test

				#

									
										44

.circleci/verbatim-sources/job-specs/binary-job-specs.yml
									
												View File
											
				@ -1,44 +0,0 @@

				jobs:

				  binary_ios_build:

				    <<: *pytorch_ios_params

				    macos:

				      xcode: "12.5.1"

				    steps:

				    - attach_workspace:

				        at: ~/workspace

				    - checkout

				    - run_brew_for_ios_build

				    - run:

				        name: Build

				        no_output_timeout: "1h"

				        command: |

				          script="/Users/distiller/project/.circleci/scripts/binary_ios_build.sh"

				          cat "$script"

				          source "$script"

				    - run:

				        name: Test

				        no_output_timeout: "30m"

				        command: |

				          script="/Users/distiller/project/.circleci/scripts/binary_ios_test.sh"

				          cat "$script"

				          source "$script"

				    - persist_to_workspace:

				        root: /Users/distiller/workspace/

				        paths: ios

				  binary_ios_upload:

				    <<: *pytorch_ios_params

				    macos:

				      xcode: "12.5.1"

				    steps:

				    - attach_workspace:

				        at: ~/workspace

				    - checkout

				    - run_brew_for_ios_build

				    - run:

				        name: Upload

				        no_output_timeout: "1h"

				        command: |

				          script="/Users/distiller/project/.circleci/scripts/binary_ios_upload.sh"

				          cat "$script"

				          source "$script"

									
										53

.circleci/verbatim-sources/job-specs/binary_update_htmls.yml
									
												View File
											
				@ -1,53 +0,0 @@

				  # update_s3_htmls job

				  # These jobs create html files for every cpu/cu## folder in s3. The html

				  # files just store the names of all the files in that folder (which are

				  # binary files (.whl files)). This is to allow pip installs of the latest

				  # version in a folder without having to know the latest date. Pip has a flag

				  # -f that you can pass an html file listing a bunch of packages, and pip will

				  # then install the one with the most recent version.

				  update_s3_htmls: &update_s3_htmls

				    machine:

				      image: ubuntu-2004:202104-01

				    resource_class: medium

				    steps:

				    - checkout

				    - setup_linux_system_environment

				    - run:

				        <<: *binary_checkout

				    # N.B. we do not run binary_populate_env. The only variable we need is

				    # PIP_UPLOAD_FOLDER (which is 'nightly/' for the nightlies and '' for

				    # releases, and sometimes other things for special cases). Instead we

				    # expect PIP_UPLOAD_FOLDER to be passed directly in the env. This is

				    # because, unlike all the other binary jobs, these jobs only get run once,

				    # in a separate workflow. They are not a step in other binary jobs like

				    # build, test, upload.

				    #

				    # You could attach this to every job, or include it in the upload step if

				    # you wanted. You would need to add binary_populate_env in this case to

				    # make sure it has the same upload folder as the job it's attached to. This

				    # function is idempotent, so it won't hurt anything; it's just a little

				    # unnescessary"

				    - run:

				        name: define PIP_UPLOAD_FOLDER

				        command: |

				          our_upload_folder=nightly/

				          # On tags upload to test instead

				          if [[ -n "${CIRCLE_TAG}" ]]; then

				            our_upload_folder=test/

				          fi

				          echo "export PIP_UPLOAD_FOLDER=${our_upload_folder}" >> ${BASH_ENV}

				    - run:

				        name: Update s3 htmls

				        no_output_timeout: "1h"

				        command: |

				          set +x

				          echo "declare -x \"AWS_ACCESS_KEY_ID=${PYTORCH_BINARY_AWS_ACCESS_KEY_ID}\"" >> /home/circleci/project/env

				          echo "declare -x \"AWS_SECRET_ACCESS_KEY=${PYTORCH_BINARY_AWS_SECRET_ACCESS_KEY}\"" >> /home/circleci/project/env

				          source /home/circleci/project/env

				          set -eux -o pipefail

				          retry () {

				              $*  || (sleep 1 && $*) || (sleep 2 && $*) || (sleep 4 && $*) || (sleep 8 && $*)

				          }

				          retry pip install awscli==1.6

				          "/home/circleci/project/builder/cron/update_s3_htmls.sh"

									
										56

.circleci/verbatim-sources/job-specs/docker_jobs.yml
									
												View File
											
				@ -1,56 +0,0 @@

				  docker_build_job:

				      parameters:

				        image_name:

				          type: string

				          default: ""

				      machine:

				        image: ubuntu-2004:202104-01

				      resource_class: large

				      environment:

				        IMAGE_NAME: << parameters.image_name >>

				        # Enable 'docker manifest'

				        DOCKER_CLI_EXPERIMENTAL: "enabled"

				        DOCKER_BUILDKIT: 1

				      steps:

				        - checkout

				        - calculate_docker_image_tag

				        - run:

				            name: Check if image should be built

				            command: |

				              set +x

				              export AWS_ACCESS_KEY_ID=${CIRCLECI_AWS_ACCESS_KEY_FOR_DOCKER_BUILDER_V1}

				              export AWS_SECRET_ACCESS_KEY=${CIRCLECI_AWS_SECRET_KEY_FOR_DOCKER_BUILDER_V1}

				              export AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")

				              export AWS_REGION=us-east-1

				              aws ecr get-login-password --region $AWS_REGION|docker login --username AWS \

				                       --password-stdin $AWS_ACCOUNT_ID.dkr.ecr.$AWS_REGION.amazonaws.com

				              set -x

				              # Check if image already exists, if it does then skip building it

				              if docker manifest inspect "308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/${IMAGE_NAME}:${DOCKER_TAG}"; then

				                circleci-agent step halt

				                # circleci-agent step halt doesn't actually halt the step so we need to

				                # explicitly exit the step here ourselves before it causes too much trouble

				                exit 0

				              fi

				              # Covers the case where a previous tag doesn't exist for the tree

				              # this is only really applicable on trees that don't have `.ci/docker` at its merge base, i.e. nightly

				              if ! git rev-parse "$(git merge-base HEAD << pipeline.git.base_revision >>):.ci/docker"; then

				                echo "Directory '.ci/docker' not found in tree << pipeline.git.base_revision >>, you should probably rebase onto a more recent commit"

				                exit 1

				              fi

				              PREVIOUS_DOCKER_TAG=$(git rev-parse "$(git merge-base HEAD << pipeline.git.base_revision >>):ci/docker")

				              # If no image exists but the hash is the same as the previous hash then we should error out here

				              if [[ "${PREVIOUS_DOCKER_TAG}" = "${DOCKER_TAG}" ]]; then

				                echo "ERROR: Something has gone wrong and the previous image isn't available for the merge-base of your branch"

				                echo "       contact the PyTorch team to restore the original images"

				                exit 1

				              fi

				        - run:

				            name: build_docker_image_<< parameters.image_name >>

				            no_output_timeout: "1h"

				            command: |

				              set +x

				              export AWS_ACCESS_KEY_ID=${CIRCLECI_AWS_ACCESS_KEY_FOR_DOCKER_BUILDER_V1}

				              export AWS_SECRET_ACCESS_KEY=${CIRCLECI_AWS_SECRET_KEY_FOR_DOCKER_BUILDER_V1}

				              set -x

				              cd .ci/docker && ./build_docker.sh

									
										745

.circleci/verbatim-sources/job-specs/job-specs-custom.yml
									
												View File
											
				@ -1,745 +0,0 @@

				  pytorch_doc_push:

				    resource_class: medium

				    machine:

				      image: ubuntu-2004:202104-01

				    parameters:

				      branch:

				        type: string

				        default: "main"

				    steps:

				    - attach_workspace:

				        at: /tmp/workspace

				    - run:

				        name: Generate netrc

				        command: |

				          # set credentials for https pushing

				          cat > ~/.netrc \<<DONE

				            machine github.com

				            login pytorchbot

				            password ${GITHUB_PYTORCHBOT_TOKEN}

				          DONE

				    - run:

				        name: Docs push

				        command: |

				          pushd /tmp/workspace

				          git push -u origin "<< parameters.branch >>"

				  pytorch_macos_10_15_py3_build:

				    environment:

				      BUILD_ENVIRONMENT: pytorch-macos-10.15-py3-arm64-build

				    macos:

				      xcode: "12.3.0"

				    steps:

				      - checkout

				      - run_brew_for_macos_build

				      - run:

				          name: Build

				          no_output_timeout: "1h"

				          command: |

				            set -e

				            export CROSS_COMPILE_ARM64=1

				            export JOB_BASE_NAME=$CIRCLE_JOB

				            # Install sccache

				            sudo curl --retry 3 https://s3.amazonaws.com/ossci-macos/sccache_v2.15 --output /usr/local/bin/sccache

				            sudo chmod +x /usr/local/bin/sccache

				            export SCCACHE_BUCKET=ossci-compiler-cache-circleci-v2

				            # This IAM user allows write access to S3 bucket for sccache

				            set +x

				            export AWS_ACCESS_KEY_ID=${CIRCLECI_AWS_ACCESS_KEY_FOR_SCCACHE_S3_BUCKET_V4}

				            export AWS_SECRET_ACCESS_KEY=${CIRCLECI_AWS_SECRET_KEY_FOR_SCCACHE_S3_BUCKET_V4}

				            set -x

				            chmod a+x .ci/pytorch/macos-build.sh

				            unbuffer .ci/pytorch/macos-build.sh 2>&1 | ts

				      - persist_to_workspace:

				          root: /Users/distiller/workspace/

				          paths:

				            - miniconda3

				      - store_artifacts:

				          path: /Users/distiller/project/dist

				  pytorch_macos_10_13_py3_build:

				    environment:

				      BUILD_ENVIRONMENT: pytorch-macos-10.13-py3-build

				    macos:

				      xcode: "12.0"

				    steps:

				      - checkout

				      - run_brew_for_macos_build

				      - run:

				          name: Build

				          no_output_timeout: "1h"

				          command: |

				            set -e

				            export JOB_BASE_NAME=$CIRCLE_JOB

				            # Install sccache

				            sudo curl --retry 3 https://s3.amazonaws.com/ossci-macos/sccache_v2.15 --output /usr/local/bin/sccache

				            sudo chmod +x /usr/local/bin/sccache

				            export SCCACHE_BUCKET=ossci-compiler-cache-circleci-v2

				            # This IAM user allows write access to S3 bucket for sccache

				            set +x

				            export AWS_ACCESS_KEY_ID=${CIRCLECI_AWS_ACCESS_KEY_FOR_SCCACHE_S3_BUCKET_V4}

				            export AWS_SECRET_ACCESS_KEY=${CIRCLECI_AWS_SECRET_KEY_FOR_SCCACHE_S3_BUCKET_V4}

				            set -x

				            chmod a+x .ci/pytorch/macos-build.sh

				            unbuffer .ci/pytorch/macos-build.sh 2>&1 | ts

				      - persist_to_workspace:

				          root: /Users/distiller/workspace/

				          paths:

				            - miniconda3

				  mac_build:

				    parameters:

				      build-environment:

				        type: string

				        description: Top-level label for what's being built/tested.

				      xcode-version:

				        type: string

				        default: "13.3.1"

				        description: What xcode version to build with.

				      build-generates-artifacts:

				        type: boolean

				        default: true

				        description: if the build generates build artifacts

				      python-version:

				        type: string

				        default: "3.8"

				    macos:

				      xcode: << parameters.xcode-version >>

				    resource_class: medium

				    environment:

				      BUILD_ENVIRONMENT: << parameters.build-environment >>

				      AWS_REGION: us-east-1

				    steps:

				      - checkout

				      - run_brew_for_macos_build

				      - run:

				          name: Install sccache

				          command: |

				            sudo curl --retry 3 https://s3.amazonaws.com/ossci-macos/sccache_v2.15 --output /usr/local/bin/sccache

				            sudo chmod +x /usr/local/bin/sccache

				            echo "export SCCACHE_BUCKET=ossci-compiler-cache-circleci-v2" >> "${BASH_ENV}"

				            echo "export SCCACHE_S3_KEY_PREFIX=${GITHUB_WORKFLOW}" >> "${BASH_ENV}"

				            set +x

				            echo "export AWS_ACCESS_KEY_ID=${CIRCLECI_AWS_ACCESS_KEY_FOR_SCCACHE_S3_BUCKET_V4}" >> "${BASH_ENV}"

				            echo "export AWS_SECRET_ACCESS_KEY=${CIRCLECI_AWS_SECRET_KEY_FOR_SCCACHE_S3_BUCKET_V4}" >> "${BASH_ENV}"

				            set -x

				      - run:

				          name: Get workflow job id

				          command: |

				            echo "export OUR_GITHUB_JOB_ID=${CIRCLE_WORKFLOW_JOB_ID}" >> "${BASH_ENV}"

				      - run:

				          name: Build

				          command: |

				            set -x

				            git submodule sync

				            git submodule update --init --recursive --depth 1 --jobs 0

				            export PATH="/usr/local/bin:$PATH"

				            export WORKSPACE_DIR="${HOME}/workspace"

				            mkdir -p "${WORKSPACE_DIR}"

				            MINICONDA_URL="https://repo.anaconda.com/miniconda/Miniconda3-py38_4.12.0-MacOSX-x86_64.sh"

				            if [  << parameters.python-version >> == 3.9.12 ]; then

				              MINICONDA_URL="https://repo.anaconda.com/miniconda/Miniconda3-py39_4.12.0-MacOSX-x86_64.sh"

				            fi

				            # If a local installation of conda doesn't exist, we download and install conda

				            if [ ! -d "${WORKSPACE_DIR}/miniconda3" ]; then

				              mkdir -p "${WORKSPACE_DIR}"

				              curl --retry 3 ${MINICONDA_URL} -o "${WORKSPACE_DIR}"/miniconda3.sh

				              bash "${WORKSPACE_DIR}"/miniconda3.sh -b -p "${WORKSPACE_DIR}"/miniconda3

				            fi

				            export PATH="${WORKSPACE_DIR}/miniconda3/bin:$PATH"

				            # shellcheck disable=SC1091

				            source "${WORKSPACE_DIR}"/miniconda3/bin/activate

				            brew link --force libomp

				            echo "export CMAKE_PREFIX_PATH=${CONDA_PREFIX:-"$(dirname "$(which conda)")/../"}" >> "${BASH_ENV}"

				            .ci/pytorch/macos-build.sh

				      - when:

				          condition: << parameters.build-generates-artifacts >>

				          steps:

				            - run:

				                name: Archive artifacts into zip

				                command: |

				                  zip -1 -r artifacts.zip dist/ build/.ninja_log build/compile_commands.json .additional_ci_files

				                  cp artifacts.zip /Users/distiller/workspace

				      - persist_to_workspace:

				          root: /Users/distiller/workspace/

				          paths:

				            - miniconda3

				            - artifacts.zip

				      - store_artifacts:

				          path: /Users/distiller/project/artifacts.zip

				  mac_test:

				    parameters:

				      build-environment:

				        type: string

				      shard-number:

				        type: string

				      num-test-shards:

				        type: string

				      xcode-version:

				        type: string

				      test-config:

				        type: string

				        default: 'default'

				    macos:

				      xcode: << parameters.xcode-version >>

				    environment:

				      GIT_DEFAULT_BRANCH: 'master'

				      BUILD_ENVIRONMENT: << parameters.build-environment >>

				      TEST_CONFIG: << parameters.test-config >>

				      SHARD_NUMBER: << parameters.shard-number >>

				      NUM_TEST_SHARDS: << parameters.num-test-shards >>

				    steps:

				      - checkout

				      - attach_workspace:

				          at: ~/workspace

				      - run_brew_for_macos_build

				      - run:

				          name: Test

				          no_output_timeout: "2h"

				          command: |

				            set -x

				            git submodule sync --recursive

				            git submodule update --init --recursive

				            mv ~/workspace/artifacts.zip .

				            unzip artifacts.zip

				            export IN_CI=1

				            COMMIT_MESSAGES=$(git cherry -v "origin/${GIT_DEFAULT_BRANCH:-master}")

				            export PATH="/usr/local/bin:$PATH"

				            export WORKSPACE_DIR="${HOME}/workspace"

				            mkdir -p "${WORKSPACE_DIR}"

				            export PATH="${WORKSPACE_DIR}/miniconda3/bin:$PATH"

				            source "${WORKSPACE_DIR}"/miniconda3/bin/activate

				            # sanitize the input commit message and PR body here:

				            # trim all new lines from commit messages to avoid issues with batch environment

				            # variable copying. see https://github.com/pytorch/pytorch/pull/80043#issuecomment-1167796028

				            COMMIT_MESSAGES="${COMMIT_MESSAGES//[$'\n\r']}"

				            # then trim all special characters like single and double quotes to avoid unescaped inputs to

				            # wreak havoc internally

				            export COMMIT_MESSAGES="${COMMIT_MESSAGES//[\'\"]}"

				            python3 -mpip install dist/*.whl

				            .ci/pytorch/macos-test.sh

				      - run:

				          name: Copy files for uploading test stats

				          command: |

				            # copy into a parent folder test-reports because we can't use CIRCLEI_BUILD_NUM in path when persisting to workspace

				            mkdir -p test-reports/test-reports_${CIRCLE_BUILD_NUM}/test/test-reports

				            cp -r test/test-reports test-reports/test-reports_${CIRCLE_BUILD_NUM}/test/test-reports

				      - store_test_results:

				          path: test/test-reports

				      - persist_to_workspace:

				          root: /Users/distiller/project/

				          paths:

				            - test-reports

				  upload_test_stats:

				    machine: # executor type

				      image: ubuntu-2004:202010-01 # # recommended linux image - includes Ubuntu 20.04, docker 19.03.13, docker-compose 1.27.4

				    steps:

				      - checkout

				      - attach_workspace:

				          at: ~/workspace

				      - run:

				          name: upload

				          command: |

				            set -ex

				            if [ -z ${AWS_ACCESS_KEY_FOR_OSSCI_ARTIFACT_UPLOAD} ]; then

				              echo "No credentials found, cannot upload test stats (are you on a fork?)"

				              exit 0

				            fi

				            cp -r ~/workspace/test-reports/* ~/project

				            pip3 install requests==2.26 rockset==1.0.3 boto3==1.19.12

				            export AWS_ACCESS_KEY_ID=${AWS_ACCESS_KEY_FOR_OSSCI_ARTIFACT_UPLOAD}

				            export AWS_SECRET_ACCESS_KEY=${AWS_SECRET_KEY_FOR_OSSCI_ARTIFACT_UPLOAD}

				            # i dont know how to get the run attempt number for reruns so default to 1

				            python3 -m tools.stats.upload_test_stats --workflow-run-id "${CIRCLE_WORKFLOW_JOB_ID}" --workflow-run-attempt 1 --head-branch << pipeline.git.branch >> --circleci

				  pytorch_macos_10_13_py3_test:

				    environment:

				      BUILD_ENVIRONMENT: pytorch-macos-10.13-py3-test

				    macos:

				      xcode: "12.0"

				    steps:

				      - checkout

				      - attach_workspace:

				          at: ~/workspace

				      - run_brew_for_macos_build

				      - run:

				          name: Test

				          no_output_timeout: "1h"

				          command: |

				            set -e

				            export JOB_BASE_NAME=$CIRCLE_JOB

				            chmod a+x .ci/pytorch/macos-test.sh

				            unbuffer .ci/pytorch/macos-test.sh 2>&1 | ts

				      - store_test_results:

				          path: test/test-reports

				  pytorch_macos_10_13_py3_lite_interpreter_build_test:

				    environment:

				      BUILD_ENVIRONMENT: pytorch-macos-10.13-py3-test

				    macos:

				      xcode: "12.0"

				    steps:

				      - checkout

				      - attach_workspace:

				          at: ~/workspace

				      - run_brew_for_macos_build

				      - run:

				          name: Test

				          no_output_timeout: "1h"

				          command: |

				            set -e

				            export BUILD_LITE_INTERPRETER=1

				            export JOB_BASE_NAME=$CIRCLE_JOB

				            chmod a+x ${HOME}/project/.ci/pytorch/macos-lite-interpreter-build-test.sh

				            unbuffer ${HOME}/project/.ci/pytorch/macos-lite-interpreter-build-test.sh 2>&1 | ts

				      - store_test_results:

				          path: test/test-reports

				  pytorch_android_gradle_build:

				    environment:

				      BUILD_ENVIRONMENT: pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-build

				      DOCKER_IMAGE: "308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3-clang5-android-ndk-r19c"

				      PYTHON_VERSION: "3.7"

				    resource_class: large

				    machine:

				      image: ubuntu-2004:202104-01

				    steps:

				    - checkout

				    - calculate_docker_image_tag

				    - setup_linux_system_environment

				    - setup_ci_environment

				    - run:

				        name: pytorch android gradle build

				        no_output_timeout: "1h"

				        command: |

				          set -eux

				          docker_image_commit=${DOCKER_IMAGE}:build-${DOCKER_TAG}-${CIRCLE_SHA1}

				          docker_image_libtorch_android_x86_32=${docker_image_commit}-android-x86_32

				          docker_image_libtorch_android_x86_64=${docker_image_commit}-android-x86_64

				          docker_image_libtorch_android_arm_v7a=${docker_image_commit}-android-arm-v7a

				          docker_image_libtorch_android_arm_v8a=${docker_image_commit}-android-arm-v8a

				          echo "docker_image_commit: "${docker_image_commit}

				          echo "docker_image_libtorch_android_x86_32: "${docker_image_libtorch_android_x86_32}

				          echo "docker_image_libtorch_android_x86_64: "${docker_image_libtorch_android_x86_64}

				          echo "docker_image_libtorch_android_arm_v7a: "${docker_image_libtorch_android_arm_v7a}

				          echo "docker_image_libtorch_android_arm_v8a: "${docker_image_libtorch_android_arm_v8a}

				          # x86_32

				          time docker pull ${docker_image_libtorch_android_x86_32} >/dev/null

				          export id_x86_32=$(docker run --env-file "${BASH_ENV}" -e GRADLE_OFFLINE=1 --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -t -d -w /var/lib/jenkins ${docker_image_libtorch_android_x86_32})

				          export COMMAND='((echo "sudo chown -R jenkins workspace") | docker exec -u jenkins -i "$id_x86_32" bash) 2>&1'

				          echo ${COMMAND} > ./command.sh && unbuffer bash ./command.sh | ts

				          # arm-v7a

				          time docker pull ${docker_image_libtorch_android_arm_v7a} >/dev/null

				          export id_arm_v7a=$(docker run --env-file "${BASH_ENV}" --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -t -d -w /var/lib/jenkins ${docker_image_libtorch_android_arm_v7a})

				          export COMMAND='((echo "sudo chown -R jenkins workspace") | docker exec -u jenkins -i "$id_arm_v7a" bash) 2>&1'

				          echo ${COMMAND} > ./command.sh && unbuffer bash ./command.sh | ts

				          mkdir -p ~/workspace/build_android_install_arm_v7a

				          docker cp $id_arm_v7a:/var/lib/jenkins/workspace/build_android/install ~/workspace/build_android_install_arm_v7a

				          # x86_64

				          time docker pull ${docker_image_libtorch_android_x86_64} >/dev/null

				          export id_x86_64=$(docker run --env-file "${BASH_ENV}" --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -t -d -w /var/lib/jenkins ${docker_image_libtorch_android_x86_64})

				          export COMMAND='((echo "sudo chown -R jenkins workspace") | docker exec -u jenkins -i "$id_x86_64" bash) 2>&1'

				          echo ${COMMAND} > ./command.sh && unbuffer bash ./command.sh | ts

				          mkdir -p ~/workspace/build_android_install_x86_64

				          docker cp $id_x86_64:/var/lib/jenkins/workspace/build_android/install ~/workspace/build_android_install_x86_64

				          # arm-v8a

				          time docker pull ${docker_image_libtorch_android_arm_v8a} >/dev/null

				          export id_arm_v8a=$(docker run --env-file "${BASH_ENV}" --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -t -d -w /var/lib/jenkins ${docker_image_libtorch_android_arm_v8a})

				          export COMMAND='((echo "sudo chown -R jenkins workspace") | docker exec -u jenkins -i "$id_arm_v8a" bash) 2>&1'

				          echo ${COMMAND} > ./command.sh && unbuffer bash ./command.sh | ts

				          mkdir -p ~/workspace/build_android_install_arm_v8a

				          docker cp $id_arm_v8a:/var/lib/jenkins/workspace/build_android/install ~/workspace/build_android_install_arm_v8a

				          docker cp ~/workspace/build_android_install_arm_v7a $id_x86_32:/var/lib/jenkins/workspace/build_android_install_arm_v7a

				          docker cp ~/workspace/build_android_install_x86_64 $id_x86_32:/var/lib/jenkins/workspace/build_android_install_x86_64

				          docker cp ~/workspace/build_android_install_arm_v8a $id_x86_32:/var/lib/jenkins/workspace/build_android_install_arm_v8a

				          # run gradle buildRelease

				          export COMMAND='((echo "sudo chown -R jenkins workspace && cd workspace && ./.circleci/scripts/build_android_gradle.sh") | docker exec -u jenkins -i "$id_x86_32" bash) 2>&1'

				          echo ${COMMAND} > ./command.sh && unbuffer bash ./command.sh | ts

				          mkdir -p ~/workspace/build_android_artifacts

				          docker cp $id_x86_32:/var/lib/jenkins/workspace/android/artifacts.tgz ~/workspace/build_android_artifacts/

				          output_image=$docker_image_libtorch_android_x86_32-gradle

				          docker commit "$id_x86_32" ${output_image}

				          time docker push ${output_image}

				    - store_artifacts:

				        path: ~/workspace/build_android_artifacts/artifacts.tgz

				        destination: artifacts.tgz

				  pytorch_android_publish_snapshot:

				    environment:

				      BUILD_ENVIRONMENT: pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-publish-snapshot

				      DOCKER_IMAGE: "308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3-clang5-android-ndk-r19c"

				      PYTHON_VERSION: "3.7"

				    resource_class: large

				    machine:

				      image: ubuntu-2004:202104-01

				    steps:

				    - checkout

				    - calculate_docker_image_tag

				    - setup_linux_system_environment

				    - setup_ci_environment

				    - run:

				        name: pytorch android gradle build

				        no_output_timeout: "1h"

				        command: |

				          set -eux

				          docker_image_commit=${DOCKER_IMAGE}:build-${DOCKER_TAG}-${CIRCLE_SHA1}

				          docker_image_libtorch_android_x86_32_gradle=${docker_image_commit}-android-x86_32-gradle

				          echo "docker_image_commit: "${docker_image_commit}

				          echo "docker_image_libtorch_android_x86_32_gradle: "${docker_image_libtorch_android_x86_32_gradle}

				          # x86_32

				          time docker pull ${docker_image_libtorch_android_x86_32_gradle} >/dev/null

				          export id_x86_32=$(docker run --env-file "${BASH_ENV}" --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -t -d -w /var/lib/jenkins ${docker_image_libtorch_android_x86_32_gradle})

				          export COMMAND='((echo "sudo chown -R jenkins workspace" && echo "export BUILD_ENVIRONMENT=${BUILD_ENVIRONMENT}" && echo "export SONATYPE_NEXUS_USERNAME=${SONATYPE_NEXUS_USERNAME}" && echo "export SONATYPE_NEXUS_PASSWORD=${SONATYPE_NEXUS_PASSWORD}" && echo "export ANDROID_SIGN_KEY=${ANDROID_SIGN_KEY}" && echo "export ANDROID_SIGN_PASS=${ANDROID_SIGN_PASS}" && echo "sudo chown -R jenkins workspace && cd workspace && ./.circleci/scripts/publish_android_snapshot.sh") | docker exec -u jenkins -i "$id_x86_32" bash) 2>&1'

				          echo ${COMMAND} > ./command.sh && unbuffer bash ./command.sh | ts

				          output_image=${docker_image_libtorch_android_x86_32_gradle}-publish-snapshot

				          docker commit "$id_x86_32" ${output_image}

				          time docker push ${output_image}

				  pytorch_android_gradle_build-x86_32:

				    environment:

				      BUILD_ENVIRONMENT: pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-build-only-x86_32

				      DOCKER_IMAGE: "308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3-clang5-android-ndk-r19c"

				      PYTHON_VERSION: "3.7"

				    resource_class: large

				    machine:

				      image: ubuntu-2004:202104-01

				    steps:

				    - checkout

				    - calculate_docker_image_tag

				    - setup_linux_system_environment

				    - checkout

				    - setup_ci_environment

				    - run:

				        name: pytorch android gradle build only x86_32 (for PR)

				        no_output_timeout: "1h"

				        command: |

				          set -e

				          docker_image_libtorch_android_x86_32=${DOCKER_IMAGE}:build-${DOCKER_TAG}-${CIRCLE_SHA1}-android-x86_32

				          echo "docker_image_libtorch_android_x86_32: "${docker_image_libtorch_android_x86_32}

				          # x86

				          time docker pull ${docker_image_libtorch_android_x86_32} >/dev/null

				          export id=$(docker run --env-file "${BASH_ENV}" --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -t -d -w /var/lib/jenkins ${docker_image_libtorch_android_x86_32})

				          export COMMAND='((echo "export BUILD_ENVIRONMENT=${BUILD_ENVIRONMENT}" && echo "export GRADLE_OFFLINE=1" && echo "sudo chown -R jenkins workspace && cd workspace && ./.circleci/scripts/build_android_gradle.sh") | docker exec -u jenkins -i "$id" bash) 2>&1'

				          echo ${COMMAND} > ./command.sh && unbuffer bash ./command.sh | ts

				          mkdir -p ~/workspace/build_android_x86_32_artifacts

				          docker cp $id:/var/lib/jenkins/workspace/android/artifacts.tgz ~/workspace/build_android_x86_32_artifacts/

				          output_image=${docker_image_libtorch_android_x86_32}-gradle

				          docker commit "$id" ${output_image}

				          time docker push ${output_image}

				    - store_artifacts:

				        path: ~/workspace/build_android_x86_32_artifacts/artifacts.tgz

				        destination: artifacts.tgz

				  pytorch_ios_build:

				    <<: *pytorch_ios_params

				    macos:

				      xcode: "12.5.1"

				    steps:

				      - run:

				          name: checkout with retry

				          command: |

				            checkout() {

				              set -ex

				              # Workaround old docker images with incorrect $HOME

				              # check https://github.com/docker/docker/issues/2968 for details

				              if [ "${HOME}" = "/" ]

				                then

				                export HOME=$(getent passwd $(id -un) | cut -d: -f6)

				              fi

				              mkdir -p ~/.ssh

				              echo 'github.com ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAQEAq2A7hRGmdnm9tUDbO9IDSwBK6TbQa+PXYPCPy6rbTrTtw7PHkccKrpp0yVhp5HdEIcKr6pLlVDBfOLX9QUsyCOV0wzfjIJNlGEYsdlLJizHhbn2mUjvSAHQqZETYP81eFzLQNnPHt4EVVUh7VfDESU84KezmD5QlWpXLmvU31/yMf+Se8xhHTvKSCZIFImWwoG6mbUoWf9nzpIoaSjB+weqqUUmpaaasXVal72J+UX2B+2RPW3RcT0eOzQgqlJL3RKrTJvdsjE3JEAvGq3lGHSZXy28G3skua2SmVi/w4yCE6gbODqnTWlg7+wC604ydGXA8VJiS5ap43JXiUFFAaQ==

				              ' >> ~/.ssh/known_hosts

				              # use git+ssh instead of https

				              git config --global url."ssh://git@github.com".insteadOf "https://github.com" || true

				              git config --global gc.auto 0 || true

				              echo 'Cloning git repository'

				              mkdir -p '/Users/distiller/project'

				              cd '/Users/distiller/project'

				              git clone "$CIRCLE_REPOSITORY_URL" .

				              echo 'Checking out branch'

				              git checkout --force -B "$CIRCLE_BRANCH" "$CIRCLE_SHA1"

				              git --no-pager log --no-color -n 1 --format='HEAD is now at %h %s'

				            }

				            retry () {

				              $* || (sleep 1 && $*) || (sleep 2 && $*) || (sleep 4 && $*) || (sleep 8 && $*)

				            }

				            retry checkout

				      - run_brew_for_ios_build

				      - run:

				          name: Setup Fastlane

				          no_output_timeout: "1h"

				          command: |

				            set -e

				            PROJ_ROOT=/Users/distiller/project

				            cd ${PROJ_ROOT}/ios/TestApp

				            # install fastlane

				            sudo gem install bundler && bundle install

				      - run:

				          name: Build

				          no_output_timeout: "1h"

				          command: |

				            set -e

				            WORKSPACE=/Users/distiller/workspace

				            PROJ_ROOT=/Users/distiller/project

				            export TCLLIBPATH="/usr/local/lib"

				            # Install conda

				            curl --retry 3 -o ~/conda.sh https://repo.anaconda.com/miniconda/Miniconda3-py39_4.12.0-MacOSX-x86_64.sh

				            chmod +x ~/conda.sh

				            /bin/bash ~/conda.sh -b -p ~/anaconda

				            export PATH="~/anaconda/bin:${PATH}"

				            source ~/anaconda/bin/activate

				            # Install dependencies

				            retry () {

				                $*  || (sleep 1 && $*) || (sleep 2 && $*) || (sleep 4 && $*) || (sleep 8 && $*)

				            }

				            retry conda install numpy ninja pyyaml mkl mkl-include setuptools cmake requests typing-extensions --yes

				            # sync submodules

				            cd ${PROJ_ROOT}

				            git submodule sync

				            git submodule update --init --recursive --depth 1 --jobs 0

				            # export

				            export CMAKE_PREFIX_PATH=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}

				            # run build script

				            chmod a+x ${PROJ_ROOT}/scripts/build_ios.sh

				            echo "IOS_ARCH: ${IOS_ARCH}"

				            echo "IOS_PLATFORM: ${IOS_PLATFORM}"

				            echo "USE_PYTORCH_METAL": "${USE_METAL}"

				            echo "BUILD_LITE_INTERPRETER": "${BUILD_LITE_INTERPRETER}"

				            echo "USE_COREML_DELEGATE": "${USE_COREML_DELEGATE}"

				            #check the custom build flag

				            echo "SELECTED_OP_LIST: ${SELECTED_OP_LIST}"

				            if [ -n "${SELECTED_OP_LIST}" ]; then

				                export SELECTED_OP_LIST="${PROJ_ROOT}/ios/TestApp/custom_build/${SELECTED_OP_LIST}"

				            fi

				            export IOS_ARCH=${IOS_ARCH}

				            export IOS_PLATFORM=${IOS_PLATFORM}

				            export USE_COREML_DELEGATE=${USE_COREML_DELEGATE}

				            if [ ${IOS_PLATFORM} != "SIMULATOR" ]; then

				              export USE_PYTORCH_METAL=${USE_METAL}

				            fi

				            unbuffer ${PROJ_ROOT}/scripts/build_ios.sh 2>&1 | ts

				      - run:

				          name: Run Build Test

				          no_output_timeout: "30m"

				          command: |

				            set -e

				            PROJ_ROOT=/Users/distiller/project

				            # run the ruby build script

				            if ! [ -x "$(command -v xcodebuild)" ]; then

				              echo 'Error: xcodebuild is not installed.'

				              exit 1

				            fi

				            ruby ${PROJ_ROOT}/scripts/xcode_build.rb -i ${PROJ_ROOT}/build_ios/install -x ${PROJ_ROOT}/ios/TestApp/TestApp.xcodeproj -p ${IOS_PLATFORM}

				            if ! [ "$?" -eq "0" ]; then

				              echo 'xcodebuild failed!'

				              exit 1

				            fi

				      - run:

				          name: Run Simulator Tests

				          no_output_timeout: "2h"

				          command: |

				            set -e

				            if [ ${IOS_PLATFORM} != "SIMULATOR" ]; then

				              echo "not SIMULATOR build, skip it."

				              exit 0

				            fi

				            WORKSPACE=/Users/distiller/workspace

				            PROJ_ROOT=/Users/distiller/project

				            source ~/anaconda/bin/activate

				            # use the pytorch nightly build to generate models

				            pip3 install --pre torch torchvision torchaudio -f https://download.pytorch.org/whl/nightly/cpu/torch_nightly.html

				            # generate models for differnet backends

				            cd ${PROJ_ROOT}/ios/TestApp/benchmark

				            mkdir -p ../models

				            if [ ${USE_COREML_DELEGATE} == 1 ]; then

				              pip install coremltools==5.0b5 protobuf==3.20.1

				              python coreml_backend.py

				            else

				              cd "${PROJ_ROOT}"

				              python test/mobile/model_test/gen_test_model.py ios-test

				            fi

				            cd "${PROJ_ROOT}/ios/TestApp/benchmark"

				            if [ ${BUILD_LITE_INTERPRETER} == 1 ]; then

				              echo "Setting up the TestApp for LiteInterpreter"

				              ruby setup.rb --lite 1

				            else

				              echo "Setting up the TestApp for Full JIT"

				              ruby setup.rb

				            fi

				            cd "${PROJ_ROOT}/ios/TestApp"

				            # instruments -s -devices

				            if [ "${BUILD_LITE_INTERPRETER}" == 1 ]; then

				              if [ "${USE_COREML_DELEGATE}" == 1 ]; then

				                fastlane scan --only_testing TestAppTests/TestAppTests/testCoreML

				              else

				                fastlane scan --only_testing TestAppTests/TestAppTests/testLiteInterpreter

				              fi

				            else

				              fastlane scan --only_testing TestAppTests/TestAppTests/testFullJIT

				            fi

				  pytorch_linux_bazel_build:

				    <<: *pytorch_params

				    machine:

				      image: ubuntu-2004:202104-01

				    steps:

				    - checkout

				    - calculate_docker_image_tag

				    - setup_linux_system_environment

				    - setup_ci_environment

				    - run:

				        name: Bazel Build

				        no_output_timeout: "1h"

				        command: |

				          set -e

				          # Pull Docker image and run build

				          echo "DOCKER_IMAGE: "${DOCKER_IMAGE}:${DOCKER_TAG}

				          time docker pull ${DOCKER_IMAGE}:${DOCKER_TAG} >/dev/null

				          export id=$(docker run --env-file "${BASH_ENV}" --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -t -d -w /var/lib/jenkins ${DOCKER_IMAGE}:${DOCKER_TAG})

				          echo "Do NOT merge main branch into $CIRCLE_BRANCH in environment $BUILD_ENVIRONMENT"

				          git submodule sync && git submodule update -q --init --recursive --depth 1 --jobs 0

				          docker cp /home/circleci/project/. $id:/var/lib/jenkins/workspace

				          export COMMAND='((echo "sudo chown -R jenkins workspace && cd workspace && .ci/pytorch/build.sh") | docker exec -u jenkins -i "$id" bash) 2>&1'

				          echo ${COMMAND} > ./command.sh && unbuffer bash ./command.sh | ts

				          # Push intermediate Docker image for next phase to use

				          if [ -z "${BUILD_ONLY}" ]; then

				            # Augment our output image name with bazel to avoid collisions

				            output_image=${DOCKER_IMAGE}:build-${DOCKER_TAG}-bazel-${CIRCLE_SHA1}

				            export COMMIT_DOCKER_IMAGE=$output_image

				            docker commit "$id" ${COMMIT_DOCKER_IMAGE}

				            time docker push ${COMMIT_DOCKER_IMAGE}

				          fi

				  pytorch_linux_bazel_test:

				    <<: *pytorch_params

				    machine:

				      image: ubuntu-2004:202104-01

				    steps:

				    - checkout

				    - calculate_docker_image_tag

				    - setup_linux_system_environment

				    - setup_ci_environment

				    - run:

				        name: Test

				        no_output_timeout: "90m"

				        command: |

				          set -e

				          output_image=${DOCKER_IMAGE}:build-${DOCKER_TAG}-bazel-${CIRCLE_SHA1}

				          export COMMIT_DOCKER_IMAGE=$output_image

				          echo "DOCKER_IMAGE: "${COMMIT_DOCKER_IMAGE}

				          time docker pull ${COMMIT_DOCKER_IMAGE} >/dev/null

				          if [ -n "${USE_CUDA_DOCKER_RUNTIME}" ]; then

				            export id=$(docker run --env-file "${BASH_ENV}" --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --gpus all -t -d -w /var/lib/jenkins ${COMMIT_DOCKER_IMAGE})

				          else

				            export id=$(docker run --env-file "${BASH_ENV}" --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -t -d -w /var/lib/jenkins ${COMMIT_DOCKER_IMAGE})

				          fi

				          retrieve_test_reports() {

				            echo "retrieving test reports"

				            docker cp -L $id:/var/lib/jenkins/workspace/bazel-testlogs ./ || echo 'No test reports found!'

				          }

				          trap "retrieve_test_reports" ERR

				          if [[ ${BUILD_ENVIRONMENT} == *"multigpu"* ]]; then

				            export COMMAND='((echo "sudo chown -R jenkins workspace && cd workspace && .ci/pytorch/multigpu-test.sh") | docker exec -u jenkins -i "$id" bash) 2>&1'

				          else

				            export COMMAND='((echo "sudo chown -R jenkins workspace && cd workspace && .ci/pytorch/test.sh") | docker exec -u jenkins -i "$id" bash) 2>&1'

				          fi

				          echo ${COMMAND} > ./command.sh && unbuffer bash ./command.sh | ts

				          retrieve_test_reports

				          docker stats --all --no-stream

				    - store_test_results:

				        path: bazel-testlogs

				  pytorch_windows_test_multigpu:

				    machine:

				      image: ubuntu-2004:202104-01

				    steps:

				      - checkout

				      - run:

				          name: Test

				          no_output_timeout: "90m"

				          command: |

				            set -e

				            python3 -m pip install requests

				            python3 ./.circleci/scripts/trigger_azure_pipeline.py

									
										18

.circleci/verbatim-sources/job-specs/job-specs-promote.yml
									
												View File
											
				@ -1,18 +0,0 @@

				  promote_s3:

				    <<: *promote_common

				    steps:

				      - checkout

				      - run:

				          name: Running promote script

				          command: |

				            scripts/release/promote/wheel_to_s3.sh

				  promote_conda:

				    <<: *promote_common

				    steps:

				      - checkout

				      - run:

				          name: Running promote script

				          command: |

				            scripts/release/promote/conda_to_conda.sh

									
										29

.circleci/verbatim-sources/job-specs/job-specs-setup.yml
									
												View File
											
				@ -1,29 +0,0 @@

				  setup:

				    docker:

				      - image: circleci/python:3.7.3

				    steps:

				      - checkout

				      - run:

				          name: Save commit message

				          command: git log --format='%B' -n 1 HEAD > .circleci/scripts/COMMIT_MSG

				      # Note [Workspace for CircleCI scripts]

				      # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

				      # In the beginning, you wrote your CI scripts in a

				      # .circleci/config.yml file, and life was good.  Your CI

				      # configurations flourished and multiplied.

				      #

				      # Then one day, CircleCI cometh down high and say, "Your YAML file

				      # is too biggeth, it stresses our servers so."  And thus they

				      # asketh us to smite the scripts in the yml file.

				      #

				      # But you can't just put the scripts in the .circleci folder,

				      # because in some jobs, you don't ever actually checkout the

				      # source repository.  Where you gonna get the scripts from?

				      #

				      # Here's how you do it: you persist .circleci/scripts into a

				      # workspace, attach the workspace in your subjobs, and run all

				      # your scripts from there.

				      - persist_to_workspace:

				          root: .

				          paths: .circleci/scripts

									
										51

.circleci/verbatim-sources/nightly-binary-build-defaults.yml
									
												View File
											
				@ -1,51 +0,0 @@

				##############################################################################

				# Binary build (nightlies nightly build) defaults

				# The binary builds use the docker executor b/c at time of writing the machine

				# executor is limited to only two cores and is painfully slow (4.5+ hours per

				# GPU build). But the docker executor cannot be run with --runtime=nvidia, and

				# so the binary test/upload jobs must run on a machine executor. The package

				# built in the build job is persisted to the workspace, which the test jobs

				# expect. The test jobs just run a few quick smoke tests (very similar to the

				# second-round-user-facing smoke tests above) and then upload the binaries to

				# their final locations. The upload part requires credentials that should only

				# be available to org-members.

				#

				# binary_checkout MUST be run before other commands here. This is because the

				# other commands are written in .circleci/scripts/*.sh , so the pytorch source

				# code must be downloaded on the machine before they can be run. We cannot

				# inline all the code into this file, since that would cause the yaml size to

				# explode past 4 MB (all the code in the command section is just copy-pasted to

				# everywhere in the .circleci/config.yml file where it appears).

				##############################################################################

				# Checks out the Pytorch and Builder repos (always both of them), and places

				# them in the right place depending on what executor we're running on. We curl

				# our .sh file from the interweb to avoid yaml size bloat. Note that many jobs

				# do not need both the pytorch and builder repos, so this is a little wasteful

				# (smoke tests and upload jobs do not need the pytorch repo).

				binary_checkout: &binary_checkout

				  name: Checkout pytorch/builder repo

				  no_output_timeout: "30m"

				  command: .circleci/scripts/binary_checkout.sh

				# Parses circleci arguments in a consistent way, essentially routing to the

				# correct pythonXgccXcudaXos build we want

				binary_populate_env: &binary_populate_env

				  name: Set up binary env variables

				  command: .circleci/scripts/binary_populate_env.sh

				binary_install_miniconda: &binary_install_miniconda

				  name: Install miniconda

				  no_output_timeout: "1h"

				  command: .circleci/scripts/binary_install_miniconda.sh

				# This section is used in the binary_test and smoke_test jobs. It expects

				# 'binary_populate_env' to have populated /home/circleci/project/env and it

				# expects another section to populate /home/circleci/project/ci_test_script.sh

				# with the code to run in the docker

				binary_run_in_docker: &binary_run_in_docker

				  name: Run in docker

				  # This step only runs on circleci linux machine executors that themselves

				  # need to start docker images

				  command: .circleci/scripts/binary_run_in_docker.sh

									
										8

.circleci/verbatim-sources/workflows/workflows-nightly-uploads-header.yml
									
												View File
											
				@ -1,8 +0,0 @@

				      #- binary_linux_libtorch_3.6m_cpu_test:

				      #    requires:

				      #      - binary_linux_libtorch_3.6m_cpu_build

				      #- binary_linux_libtorch_3.6m_cu90_test:

				      #    requires:

				      #      - binary_linux_libtorch_3.6m_cu90_build

				      # Nightly uploads

8

.clang-tidy

View File

 @ -42,7 +42,6 @@ misc-*,
 -misc-non-private-member-variables-in-classes,
 -misc-confusable-identifiers,
 modernize-*,
 -modernize-concat-nested-namespaces,
 -modernize-macro-to-enum,
 -modernize-return-braced-init-list,
 -modernize-use-auto,
 @ -52,6 +51,13 @@ modernize-*,
 -modernize-use-nodiscard,
 performance-*,
 readability-container-size-empty,
 readability-delete-null-pointer,
 readability-duplicate-include
 readability-misplaced-array-index,
 readability-redundant-function-ptr-dereference,
 readability-redundant-smartptr-get,
 readability-simplify-subscript-expr,
 readability-string-compare,
 '
 HeaderFilterRegex: '^(aten/|c10/|torch/).*$'
 AnalyzeTemporaryDtors: false

									
										2

.devcontainer/Dockerfile
									
												View File
												
				@ -30,5 +30,5 @@ RUN if [ -n "$CLANG_VERSION" ]; then \

				# Install cuda if version is specified

				ARG CUDA_VERSION

				RUN if [ -n "$CUDA_VERSION" ]; then \

				       conda install cuda -c "nvidia/label/cuda-${CUDA_VERSION}"; \

				       conda install -y cuda -c "nvidia/label/cuda-${CUDA_VERSION}"; \

				    fi

									
										2

.devcontainer/README.md
									
												View File
												
				@ -46,7 +46,7 @@ If you are using [Visual Studio Code Remote - SSH](https://code.visualstudio.com

				## Step 6: Open in DevContainer

				1. In VSCode, use the Command Palette (`Ctrl+Shift+P` or `Cmd+Shift+P` on macOS) to run the "Remote-Containers: Open Folder in Container..." command.

				1. In VSCode, use the Command Palette (`Ctrl+Shift+P` or `Cmd+Shift+P` on macOS) to run the "Dev Containers: Open Folder in Container..." command.

				2. You will be prompted with two options: CPU dev container or CUDA dev container. Choose the one you want to run.

				## Step 7: Wait for Building the Environment

28

.flake8

View File

 @ -2,14 +2,12 @@
 # NOTE: **Mirror any changes** to this file the [tool.ruff] config in pyproject.toml
 # before we can fully move to use ruff
 enable-extensions = G
 select = B,C,E,F,G,P,SIM1,T4,W,B9,TOR0,TOR1,TOR2
 select = B,C,E,F,G,P,SIM1,T4,W,B9,TOR0,TOR1,TOR2,TOR9
 max-line-length = 120
 # C408 ignored because we like the dict keyword argument syntax
 # E501 is not flexible enough, we're using B950 instead
 ignore =
     E203,E305,E402,E501,E721,E741,F405,F821,F841,F999,W503,W504,C408,E302,W291,E303,
     # fix these lints in the future
     E275,
     E203,E305,E402,E501,E721,E741,F405,F841,F999,W503,W504,C408,E302,W291,E303,
     # shebang has extra meaning in fbcode lints, so I think it's not worth trying
     # to line this up with executable bit
     EXE001,
 @ -29,11 +27,33 @@ ignore =
     # TODO(kit1980): fix all TOR102 issues
     # `torch.load` without `weights_only` parameter is unsafe
     TOR102,
     # TODO(kit1980): resolve all TOR003 issues
     # pass `use_reentrant` explicitly to `checkpoint`.
     TOR003
 per-file-ignores =
     __init__.py: F401
     test/**: F821
     test/**/__init__.py: F401,F821
     torch/utils/cpp_extension.py: B950
     torchgen/api/types/__init__.py: F401,F403
     torchgen/executorch/api/types/__init__.py: F401,F403
     test/dynamo/test_higher_order_ops.py: B950
     torch/testing/_internal/dynamo_test_failures.py: B950
     # TOR901 is only for test, we want to ignore it for everything else.
     # It's not easy to configure this without affecting other per-file-ignores,
     # so we explicitly list every file where it's violated outside of test.
     torch/__init__.py: F401,TOR901
     torch/_custom_op/impl.py: TOR901
     torch/_export/serde/upgrade.py: TOR901
     torch/_functorch/vmap.py: TOR901
     torch/_inductor/test_operators.py: TOR901
     torch/_library/abstract_impl.py: TOR901
     torch/_meta_registrations.py: TOR901
     torch/_prims/__init__.py: F401,TOR901
     torch/_prims/rng_prims.py: TOR901
     torch/ao/quantization/fx/_decomposed.py: TOR901
     torch/distributed/_functional_collectives.py: TOR901
     torch/distributed/_spmd/data_parallel.py: TOR901
 optional-ascii-coding = True
 exclude =
     ./.git,

2

.git-blame-ignore-revs

View File

 @ -38,3 +38,5 @@ f70844bec783bfce43c950ccf180dc494e86f2bf
 e6ec0efaf87703c5f889cfc20b29be455885d58d
 # 2023-07-31 [optim][BE] split test file into logical parts: SWA, LR, optim
 a53cda1ddc15336dc1ff0ce1eff2a49cdc5f882e
 # 2024-01-02 clangformat: fused adam #116583
 dc68d1aa9e554d09344a10fff69f7b50b2d23a0

									
										4

.github/ISSUE_TEMPLATE/pt2-bug-report.yml
									
										vendored
									
												View File
												
				@ -8,7 +8,7 @@ body:

				      value: >

				        #### Before submitting a bug, please make sure the issue hasn't been already addressed by searching through [the

				        existing and past issues](https://github.com/pytorch/pytorch/issues)

				        It's likely that your bug will be resolved by checking our FAQ or troubleshooting guide [documentation](https://pytorch.org/docs/master/dynamo/index.html)

				        It's likely that your bug will be resolved by checking our FAQ or troubleshooting guide [documentation](https://pytorch.org/docs/main/dynamo/index.html)

				  - type: textarea

				    attributes:

				      label: 🐛 Describe the bug

				@ -33,7 +33,7 @@ body:

				      label: Minified repro

				      description: |

				        Please run the minifier on your example and paste the minified code below

				        Learn more here https://pytorch.org/docs/master/compile/troubleshooting.html

				        Learn more here https://pytorch.org/docs/main/torch.compiler_troubleshooting.html

				      placeholder: |

				        env TORCHDYNAMO_REPRO_AFTER="aot" python your_model.py

				        or

									
										2

.github/actionlint.yaml
									
										vendored
									
												View File
												
				@ -19,7 +19,7 @@ self-hosted-runner:

				    - windows.g5.4xlarge.nvidia.gpu

				    - bm-runner

				    - linux.rocm.gpu

				    - macos-m1-12

				    - macos-m1-stable

				    - macos-m1-13

				    - macos-12-xl

				    - macos-12

									
										9

.github/actions/download-build-artifacts/action.yml
									
										vendored
									
												View File
												
				@ -9,6 +9,10 @@ inputs:

				  use-gha:

				    description: If set to any value, use GHA to download the artifact. Otherwise use s3.

				    required: false

				  s3-bucket:

				    description: S3 bucket to download builds

				    required: false

				    default: "gha-artifacts"

				runs:

				  using: composite

				@ -18,6 +22,7 @@ runs:

				      uses: seemethere/download-artifact-s3@v4

				      with:

				        name: ${{ inputs.name }}

				        s3-bucket: ${{ inputs.s3-bucket }}

				    - name: Download PyTorch Build Artifacts from GHA

				      if: inputs.use-gha

				@ -29,6 +34,10 @@ runs:

				      shell: bash

				      run: unzip -o artifacts.zip

				    - name: Remove artifacts.zip

				      shell: bash

				      run: rm artifacts.zip

				    - name: Output disk space left

				      shell: bash

				      run: df -H

									
										29

.github/actions/download-td-artifacts/action.yml
									
										vendored
									
										Normal file
									
												View File
												
				@ -0,0 +1,29 @@

				name: Download TD Artifacts

				description: Download artifacts from target_determination.yml

				inputs:

				  use-gha:

				    description: If set to any value, use GHA to download the artifact. Otherwise use s3.

				    required: false

				runs:

				  using: composite

				  steps:

				    - name: Download TD Artifacts from S3

				      if: ${{ !inputs.use-gha }}

				      uses: seemethere/download-artifact-s3@v4

				      with:

				        name: td_results

				    - name: Download TD Artifacts from GHA

				      if: inputs.use-gha

				      uses: actions/download-artifact@v3

				      with:

				        name: td_results.json

				    - name: Move artifacts to .additional_ci_files folder

				      shell: bash

				      run: |

				        mkdir -p .additional_ci_files

				        mv td_results.json .additional_ci_files/td_results.json

									
										14

.github/actions/filter-test-configs/action.yml
									
										vendored
									
												View File
												
				@ -26,11 +26,20 @@ outputs:

				    description: True if the filtered test configs matrix is empty. False otherwise.

				    value: ${{ steps.filter.outputs.is-test-matrix-empty }}

				  keep-going:

				    description: True if keep-going label was on PR.

				    description: True if keep-going label was on PR or [keep-going] in PR body.

				    value: ${{ steps.filter.outputs.keep-going }}

				  reenabled-issues:

				    description: Comma separated list of issue numbers that should correspond to disable test issues that the PR fixes

				    value: ${{ steps.filter.outputs.reenabled-issues }}

				  ci-verbose-test-logs:

				    description: True if ci-verbose-test-logs label was on PR or [ci-verbose-test-logs] in PR body.

				    value: ${{ steps.filter.outputs.ci-verbose-test-logs }}

				  ci-no-test-timeout:

				    description: True if ci-no-test-timeout label was on PR or [ci-no-test-timeout] in PR body.

				    value: ${{ steps.filter.outputs.ci-no-test-timeout }}

				  ci-no-td:

				    description: True if ci-no-td label was on PR or [ci-no-td] in PR body.

				    value: ${{ steps.filter.outputs.ci-no-td }}

				runs:

				  using: composite

				@ -46,7 +55,8 @@ runs:

				        retry_wait_seconds: 30

				        command: |

				          set -eux

				          python3 -m pip install requests==2.26.0 pyyaml==6.0

				          # PyYAML 6.0 doesn't work with MacOS x86 anymore

				          python3 -m pip install requests==2.26.0 pyyaml==6.0.1

				    - name: Parse ref

				      id: parse-ref

									
										207

.github/actions/linux-build/action.yml
									
										vendored
									
										Normal file
									
												View File
												
				@ -0,0 +1,207 @@

				name: linux-build

				inputs:

				  build-environment:

				    required: true

				    description: Top-level label for what's being built/tested.

				  docker-image-name:

				    required: true

				    description: Name of the base docker image to build with.

				  build-generates-artifacts:

				    required: false

				    default: "true"

				    description: If set, upload generated build artifacts.

				  build-with-debug:

				    required: false

				    default: "false"

				    description: If set, build in debug mode.

				  sync-tag:

				    required: false

				    default: ""

				    description: |

				      If this is set, our linter will use this to make sure that every other

				      job with the same `sync-tag` is identical.

				  cuda-arch-list:

				    required: false

				    default: "5.2"

				    description: Runner label to select worker type

				  runner:

				    required: false

				    default: "linux.2xlarge"

				    description: |

				      List of CUDA architectures CI build should target.

				  test-matrix:

				    required: false

				    type: string

				    description: |

				      An option JSON description of what test configs to run later on. This

				      is moved here from the Linux test workflow so that we can apply filter

				      logic using test-config labels earlier and skip unnecessary builds

				  s3-bucket:

				    description: S3 bucket to download artifact

				    required: false

				    default: "gha-artifacts"

				  aws-role-to-assume:

				    description: role to assume for downloading artifacts

				    required: false

				    default: ""

				  GITHUB_TOKEN:

				    description: GitHub token

				    required: true

				  HUGGING_FACE_HUB_TOKEN:

				    description: Hugging Face Hub token

				    required: false

				    default: ""

				outputs:

				  docker-image:

				    value: ${{ steps.calculate-docker-image.outputs.docker-image }}

				    description: The docker image containing the built PyTorch.

				  test-matrix:

				    value: ${{ steps.filter.outputs.test-matrix }}

				    description: An optional JSON description of what test configs to run later on.

				runs:

				  using: composite

				  steps:

				    - name: Setup Linux

				      uses: ./.github/actions/setup-linux

				    - name: configure aws credentials

				      uses: aws-actions/configure-aws-credentials@v3

				      if: ${{ inputs.aws-role-to-assume != '' }}

				      with:

				        role-to-assume: ${{ inputs.aws-role-to-assume }}

				        role-session-name: gha-linux-build

				        role-duration-seconds: 10800

				        aws-region: us-east-1

				    - name: Calculate docker image

				      id: calculate-docker-image

				      uses: pytorch/test-infra/.github/actions/calculate-docker-image@main

				      with:

				        docker-image-name: ${{ inputs.docker-image-name }}

				    - name: Use following to pull public copy of the image

				      id: print-ghcr-mirror

				      env:

				        ECR_DOCKER_IMAGE: ${{ steps.calculate-docker-image.outputs.docker-image }}

				      shell: bash

				      run: |

				        tag=${ECR_DOCKER_IMAGE##*/}

				        echo "docker pull ghcr.io/pytorch/ci-image:${tag/:/-}"

				    - name: Pull docker image

				      uses: pytorch/test-infra/.github/actions/pull-docker-image@main

				      with:

				        docker-image: ${{ steps.calculate-docker-image.outputs.docker-image }}

				    - name: Parse ref

				      id: parse-ref

				      shell: bash

				      run: .github/scripts/parse_ref.py

				    - name: Get workflow job id

				      id: get-job-id

				      uses: ./.github/actions/get-workflow-job-id

				      if: always()

				      with:

				        github-token: ${{ inputs.GITHUB_TOKEN }}

				    # Apply the filter logic to the build step too if the test-config label is already there

				    - name: Select all requested test configurations (if the test matrix is available)

				      id: filter

				      uses: ./.github/actions/filter-test-configs

				      with:

				        github-token: ${{ inputs.GITHUB_TOKEN }}

				        test-matrix: ${{ inputs.test-matrix }}

				        job-name: ${{ steps.get-job-id.outputs.job-name }}

				    - name: Download pytest cache

				      uses: ./.github/actions/pytest-cache-download

				      continue-on-error: true

				      with:

				        cache_dir: .pytest_cache

				        job_identifier: ${{ github.workflow }}_${{ inputs.build-environment }}

				        s3_bucket: ${{ inputs.s3-bucket }}

				    - name: Build

				      if: steps.filter.outputs.is-test-matrix-empty == 'False' || inputs.test-matrix == ''

				      id: build

				      env:

				        BUILD_ENVIRONMENT: ${{ inputs.build-environment }}

				        BRANCH: ${{ steps.parse-ref.outputs.branch }}

				        # TODO duplicated

				        AWS_DEFAULT_REGION: us-east-1

				        PR_NUMBER: ${{ github.event.pull_request.number }}

				        SHA1: ${{ github.event.pull_request.head.sha || github.sha }}

				        SCCACHE_BUCKET: ossci-compiler-cache-circleci-v2

				        SCCACHE_S3_KEY_PREFIX: ${{ github.workflow }}

				        XLA_CLANG_CACHE_S3_BUCKET_NAME: ossci-compiler-clang-cache-circleci-xla

				        PR_LABELS: ${{ toJson(github.event.pull_request.labels.*.name) }}

				        TORCH_CUDA_ARCH_LIST: ${{ inputs.cuda-arch-list }}

				        DOCKER_IMAGE: ${{ steps.calculate-docker-image.outputs.docker-image }}

				        XLA_CUDA: ${{ contains(inputs.build-environment, 'xla') && '0' || '' }}

				        DEBUG: ${{ inputs.build-with-debug == 'true' && '1' || '0' }}

				        OUR_GITHUB_JOB_ID: ${{ steps.get-job-id.outputs.job-id }}

				        HUGGING_FACE_HUB_TOKEN: ${{ inputs.HUGGING_FACE_HUB_TOKEN }}

				      shell: bash

				      run: |

				        # detached container should get cleaned up by teardown_ec2_linux

				        container_name=$(docker run \

				          -e BUILD_ENVIRONMENT \

				          -e MAX_JOBS="$(nproc --ignore=2)" \

				          -e AWS_DEFAULT_REGION \

				          -e PR_NUMBER \

				          -e SHA1 \

				          -e BRANCH \

				          -e SCCACHE_BUCKET \

				          -e SCCACHE_S3_KEY_PREFIX \

				          -e XLA_CUDA \

				          -e XLA_CLANG_CACHE_S3_BUCKET_NAME \

				          -e SKIP_SCCACHE_INITIALIZATION=1 \

				          -e TORCH_CUDA_ARCH_LIST \

				          -e PR_LABELS \

				          -e OUR_GITHUB_JOB_ID \

				          -e HUGGING_FACE_HUB_TOKEN \

				          --env-file="/tmp/github_env_${GITHUB_RUN_ID}" \

				          --security-opt seccomp=unconfined \

				          --cap-add=SYS_PTRACE \

				          --tty \

				          --detach \

				          --user jenkins \

				          -v "${GITHUB_WORKSPACE}:/var/lib/jenkins/workspace" \

				          -w /var/lib/jenkins/workspace \

				          "${DOCKER_IMAGE}"

				        )

				        docker exec -t "${container_name}" sh -c '.ci/pytorch/build.sh'

				    - name: Archive artifacts into zip

				      if: inputs.build-generates-artifacts == 'true' && steps.build.outcome != 'skipped'

				      shell: bash

				      run: |

				        zip -1 -r artifacts.zip dist/ build/custom_test_artifacts build/lib build/bin .additional_ci_files

				    - name: Store PyTorch Build Artifacts on S3

				      uses: seemethere/upload-artifact-s3@v5

				      if: inputs.build-generates-artifacts == 'true' && steps.build.outcome != 'skipped'

				      with:

				        name: ${{ inputs.build-environment }}

				        retention-days: 14

				        if-no-files-found: error

				        path: artifacts.zip

				        s3-bucket: ${{ inputs.s3-bucket }}

				    - name: Upload sccache stats

				      if: steps.build.outcome != 'skipped'

				      uses: seemethere/upload-artifact-s3@v5

				      with:

				        s3-prefix: |

				          ${{ github.repository }}/${{ github.run_id }}/${{ github.run_attempt }}/artifact

				        retention-days: 365

				        if-no-files-found: warn

				        path: sccache-stats-*.json

				        s3-bucket: ${{ inputs.s3-bucket }}

				    - name: Teardown Linux

				      uses: pytorch/test-infra/.github/actions/teardown-linux@main

				      if: always()

									
										6

.github/actions/pytest-cache-download/action.yml
									
										vendored
									
												View File
												
				@ -9,6 +9,10 @@ inputs:

				  job_identifier:

				    description: Text that uniquely identifies a given job type within a workflow. All shards of a job should share the same job identifier.

				    required: true

				  s3_bucket:

				    description: S3 bucket to upload/download PyTest cache

				    required: false

				    default: ""

				runs:

				  using: composite

				@ -30,6 +34,7 @@ runs:

				        CACHE_DIR: ${{ inputs.cache_dir }}

				        JOB_IDENTIFIER: ${{ inputs.job_identifier }}

				        REPO: ${{ github.repository }}

				        BUCKET: ${{ inputs.s3_bucket }}

				      run: |

				        python3 .github/scripts/pytest_cache.py \

				          --download \

				@ -38,3 +43,4 @@ runs:

				          --job_identifier $JOB_IDENTIFIER \

				          --temp_dir $RUNNER_TEMP \

				          --repo $REPO \

				          --bucket $BUCKET \

									
										6

.github/actions/setup-linux/action.yml
									
										vendored
									
												View File
												
				@ -26,8 +26,14 @@ runs:

				        echo "instance-type: $(get_ec2_metadata instance-type)"

				        echo "system info $(uname -a)"

				    - name: Check if in a ARC runner

				      shell: bash

				      id: check_arc_runner

				      run: echo "IN_ARC_RUNNER=$([ -f /.inarc ] && echo true || echo false)"  >> $GITHUB_OUTPUT

				    - name: Start docker if docker deamon is not running

				      shell: bash

				      if: ${{ steps.check_arc_runner.outputs.IN_ARC_RUNNER == 'false' }}

				      run: |

				        if systemctl is-active --quiet docker; then

				            echo "Docker daemon is running...";

									
										10

.github/actions/setup-rocm/action.yml
									
										vendored
									
												View File
												
				@ -9,6 +9,16 @@ runs:

				      shell: bash

				      run: echo "DOCKER_HOST=unix:///run/user/$(id -u)/docker.sock" >> "${GITHUB_ENV}"

				    - name: Remove leftover Docker config file

				      shell: bash

				      continue-on-error: true

				      run: |

				        set -ex

				        cat ~/.docker/config.json || true

				        # https://stackoverflow.com/questions/64455468/error-when-logging-into-ecr-with-docker-login-error-saving-credentials-not

				        rm -f ~/.docker/config.json

				    - name: Stop all running docker containers

				      if: always()

				      shell: bash

									
										67

.github/actions/setup-xpu/action.yml
									
										vendored
									
										Normal file
									
												View File
												
				@ -0,0 +1,67 @@

				name: Setup XPU host

				description: Set up XPU host for CI

				runs:

				  using: composite

				  steps:

				    - name: Clean all stopped docker containers

				      if: always()

				      shell: bash

				      run: |

				        # Prune all stopped containers.

				        # If other runner is pruning on this node, will skip.

				        nprune=$(ps -ef | grep -c "docker container prune")

				        if [[ $nprune -eq 1 ]]; then

				          docker container prune -f

				        fi

				    - name: Runner health check system info

				      if: always()

				      shell: bash

				      run: |

				        cat /etc/os-release || true

				        cat /etc/apt/sources.list.d/oneAPI.list || true

				        cat /etc/apt/sources.list.d/intel-gpu-jammy.list || true

				        whoami

				    - name: Runner health check xpu-smi

				      if: always()

				      shell: bash

				      run: |

				        xpu-smi discovery

				    - name: Runner health check GPU count

				      if: always()

				      shell: bash

				      run: |

				        ngpu=$(xpu-smi discovery | grep -c -E 'Device Name')

				        msg="Please file an issue on pytorch/pytorch reporting the faulty runner. Include a link to the runner logs so the runner can be identified"

				        if [[ $ngpu -eq 0 ]]; then

				          echo "Error: Failed to detect any GPUs on the runner"

				          echo "$msg"

				          exit 1

				        fi

				    - name: Runner diskspace health check

				      uses: ./.github/actions/diskspace-cleanup

				      if: always()

				    - name: Runner health check disconnect on failure

				      if: ${{ failure() }}

				      shell: bash

				      run: |

				        killall runsvc.sh

				    - name: Preserve github env variables for use in docker

				      shell: bash

				      run: |

				        env | grep '^GITHUB' >> "/tmp/github_env_${GITHUB_RUN_ID}"

				        env | grep '^CI' >> "/tmp/github_env_${GITHUB_RUN_ID}"

				    - name: XPU set GPU_FLAG

				      shell: bash

				      run: |

				        # Add render group for container creation.

				        render_gid=`cat /etc/group | grep render | cut -d: -f3`

				        echo "GPU_FLAG=--device=/dev/mem --device=/dev/dri --group-add video --group-add $render_gid" >> "${GITHUB_ENV}"

									
										20

.github/actions/teardown-xpu/action.yml
									
										vendored
									
										Normal file
									
												View File
												
				@ -0,0 +1,20 @@

				name: Teardown XPU host

				description: Tear down XPU host for CI

				runs:

				  using: composite

				  steps:

				    - name: Teardown XPU

				      if: always()

				      shell: bash

				      run: |

				        # Prune all stopped containers.

				        # If other runner is pruning on this node, will skip.

				        nprune=$(ps -ef | grep -c "docker container prune")

				        if [[ $nprune -eq 1 ]]; then

				          docker container prune -f

				        fi

				    - name: Runner diskspace health check

				      uses: ./.github/actions/diskspace-cleanup

				      if: always()

									
										59

.github/actions/update-commit-hash/action.yml
									
										vendored
									
												View File
											
				@ -1,59 +0,0 @@

				name: Update commit hash

				inputs:

				  repo-owner:

				    required: false

				    type: string

				    description: Name of repository's owner.

				    default: pytorch

				  repo-name:

				    required: true

				    type: string

				    description: Name of the repository we're updating commit hash for.

				  branch:

				    required: true

				    type: string

				    description: Branch to fetch commit of

				  pin-folder:

				    type: string

				    description: Path to folder with commit pin

				    required: false

				    default: .github/ci_commit_pins

				  updatebot-token:

				    required: true

				    type: string

				    description: update bot token

				  pytorchbot-token:

				    required: true

				    type: string

				    description: update bot token

				description: update commit hash

				runs:

				  using: composite

				  steps:

				    - name: Checkout repo

				      uses: actions/checkout@v3

				      with:

				        fetch-depth: 1

				        submodules: false

				        token: ${{ inputs.updatebot-token }}

				    - name: Checkout

				      shell: bash

				      run: |

				        git clone https://github.com/${{ inputs.repo-owner }}/${{ inputs.repo-name }}.git --quiet

				    - name: Check if there already exists a PR

				      shell: bash

				      env:

				        REPO_NAME: ${{ inputs.repo-name }}

				        BRANCH: ${{ inputs.branch }}

				        PIN_FOLDER: ${{ inputs.pin-folder }}

				        UPDATEBOT_TOKEN: ${{ inputs.updatebot-token }}

				        PYTORCHBOT_TOKEN: ${{ inputs.pytorchbot-token }}

				        NEW_BRANCH_NAME: update-${{ inputs.repo-name }}-commit-hash/${{ github.run_id }}-${{ github.run_number }}-${{ github.run_attempt }}

				      run: |

				        # put this here instead of the script to prevent accidentally changing the config when running the script locally

				        git config --global user.name "PyTorch UpdateBot"

				        git config --global user.email "pytorchupdatebot@users.noreply.github.com"

				        python .github/scripts/update_commit_hashes.py --repo-name "${REPO_NAME}" --branch "${BRANCH}" --pin-folder "${PIN_FOLDER}"

									
										7

.github/actions/upload-test-artifacts/action.yml
									
										vendored
									
												View File
												
				@ -11,6 +11,10 @@ inputs:

				      Suffix to add to the filename of the artifacts. This should include the

				      workflow job id, see [Job id in artifacts].

				    required: true

				  s3-bucket:

				    description: S3 bucket to download builds

				    required: false

				    default: "gha-artifacts"

				runs:

				  using: composite

				@ -87,6 +91,7 @@ runs:

				      uses: seemethere/upload-artifact-s3@v5

				      if: ${{ !inputs.use-gha }}

				      with:

				        s3-bucket: ${{ inputs.s3-bucket }}

				        s3-prefix: |

				          ${{ github.repository }}/${{ github.run_id }}/${{ github.run_attempt }}/artifact

				        retention-days: 14

				@ -97,6 +102,7 @@ runs:

				      uses: seemethere/upload-artifact-s3@v5

				      if: ${{ !inputs.use-gha }}

				      with:

				        s3-bucket: ${{ inputs.s3-bucket }}

				        s3-prefix: |

				          ${{ github.repository }}/${{ github.run_id }}/${{ github.run_attempt }}/artifact

				        retention-days: 14

				@ -108,6 +114,7 @@ runs:

				      if: ${{ !inputs.use-gha }}

				      continue-on-error: true

				      with:

				        s3-bucket: ${{ inputs.s3-bucket }}

				        s3-prefix: |

				          ${{ github.repository }}/${{ github.run_id }}/${{ github.run_attempt }}/artifact

				        retention-days: 14

									
										1

.github/auto_request_review.yml
									
										vendored
									
												View File
												
				@ -6,7 +6,6 @@ reviewers:

				      - albanD

				      - miladm

				      - bdhirsh

				      - voznesenskym

				  per_author:

				    symbolic-shapes:

2

.github/ci_commit_pins/audio.txt vendored

View File

 @ -1 +1 @@
 fa9b2c74e84d7eb1fc6e3eb51e43213f0c05
 a70815259222570feb071034acd7bae2adc019

2

.github/ci_commit_pins/torchbench.txt vendored

View File

 @ -1 +1 @@
 a2fb8624947f9c0e2edc898ff42a16124da
 d6015d42d9a1834bc7595c4bd6852562fb80b30b

2

.github/ci_commit_pins/vision.txt vendored

View File

 @ -1 +1 @@
 e12d200c97d7aab668b976e92b46513c9ca7a0d8
 a0c79b399b75368208464b2c638708165cca7ef1

2

.github/ci_commit_pins/xla.txt vendored

View File

 @ -1 +1 @@
 a80c1e7f958e7d8e8f92319db70876940e67ad9b
 a632930bfde19ffb361cdf5c31a7682af4e67

									
										14

.github/labeler.yml
									
										vendored
									
												View File
												
				@ -8,10 +8,6 @@

				- torch/_inductor/**

				- test/inductor/**

				"module: export":

				- torch/_export/**

				- test/export/**

				"ciflow/inductor":

				- torch/_decomp/**

				- torch/_dynamo/**

				@ -30,6 +26,11 @@

				- .github/ci_commit_pins/**

				- c10/core/Sym*

				- torch/fx/experimental/symbolic_shapes.py

				- torch/fx/experimental/recording.py

				- torch/fx/experimental/sym_node.py

				- torch/fx/experimental/validator.py

				- torch/fx/experimental/_sym_dispatch_mode.py

				- torch/fx/experimental/proxy_tensor.py

				- test/distributed/_tensor/test_dtensor_compile.py

				- test/distributed/tensor/parallel/test_fsdp_2d_parallel.py

				- torch/distributed/_tensor/**

				@ -43,6 +44,7 @@

				- aten/src/ATen/native/mkldnn/**

				- torch/cpu/**

				- torch/utils/mkldnn.py

				- torch/utils/_sympy/**

				- test/test_mkldnn.py

				"module: mkldnn":

				@ -79,3 +81,7 @@

				- torch/nn/parallel/**

				- test/distributed/**

				- torch/testing/_internal/distributed/**

				"module: distributed_checkpoint":

				- torch/distributed/checkpoint/**

				- test/distributed/checkpoint/**

Compare commits

3649 Commits whc_flight ... forpull1

1 .ci/docker/README.md Unescape Escape View File

30 .ci/docker/build.sh Unescape Escape View File

2 .ci/docker/ci_commit_pins/executorch.txt Unescape Escape View File

2 .ci/docker/ci_commit_pins/huggingface.txt Unescape Escape View File

2 .ci/docker/ci_commit_pins/triton-rocm.txt Unescape Escape View File

2 .ci/docker/ci_commit_pins/triton.txt Unescape Escape View File

16 .ci/docker/common/install_acl.sh Normal file Unescape Escape View File

4 .ci/docker/common/install_base.sh Unescape Escape View File

57 .ci/docker/common/install_conda.sh Unescape Escape View File

13 .ci/docker/common/install_cudnn.sh Unescape Escape View File

21 .ci/docker/common/install_cusparselt.sh Normal file Unescape Escape View File

1 .ci/docker/common/install_executorch.sh Unescape Escape View File

11 .ci/docker/common/install_onnx.sh Unescape Escape View File

3 .ci/docker/common/install_openssl.sh Unescape Escape View File

61 .ci/docker/common/install_protobuf.sh Unescape Escape View File

16 .ci/docker/common/install_rocm.sh Unescape Escape View File

2 .ci/docker/common/install_rocm_magma.sh Unescape Escape View File

3 .ci/docker/common/install_triton.sh Unescape Escape View File

7 .ci/docker/common/install_ucc.sh Unescape Escape View File

115 .ci/docker/common/install_xpu.sh Normal file Unescape Escape View File

49 .ci/docker/requirements-ci.txt Unescape Escape View File

2 .ci/docker/triton_version.txt Unescape Escape View File

6 .ci/docker/ubuntu-cuda/Dockerfile Unescape Escape View File

118 .ci/docker/ubuntu-xpu/Dockerfile Normal file Unescape Escape View File

8 .ci/docker/ubuntu/Dockerfile Unescape Escape View File

33 .ci/pytorch/build.sh Unescape Escape View File

5 .ci/pytorch/common_utils.sh Unescape Escape View File

2 .ci/pytorch/macos-common.sh Unescape Escape View File

2 .ci/pytorch/macos-test.sh Unescape Escape View File

5 .ci/pytorch/multigpu-test.sh Unescape Escape View File

193 .ci/pytorch/test.sh Unescape Escape View File

13 .ci/pytorch/win-test-helpers/build_pytorch.bat Unescape Escape View File

14 .ci/pytorch/win-test-helpers/installation-helpers/install_mkl.bat Unescape Escape View File

19 .ci/pytorch/win-test-helpers/installation-helpers/install_sccache.bat Unescape Escape View File

466 .circleci/README.md Unescape Escape View File

198 .circleci/cimodel/data/binary_build_data.py Unescape Escape View File

275 .circleci/cimodel/data/binary_build_definitions.py Unescape Escape View File

19 .circleci/cimodel/data/dimensions.py Unescape Escape View File

296 .circleci/cimodel/data/pytorch_build_data.py Unescape Escape View File

382 .circleci/cimodel/data/pytorch_build_definitions.py Unescape Escape View File

39 .circleci/cimodel/data/simple/docker_definitions.py Unescape Escape View File

100 .circleci/cimodel/data/simple/ios_definitions.py Unescape Escape View File

54 .circleci/cimodel/data/simple/macos_definitions.py Unescape Escape View File

51 .circleci/cimodel/data/simple/mobile_definitions.py Unescape Escape View File

96 .circleci/cimodel/data/simple/nightly_ios.py Unescape Escape View File

36 .circleci/cimodel/data/simple/util/branch_filters.py Unescape Escape View File

35 .circleci/cimodel/data/simple/util/docker_constants.py Unescape Escape View File

36 .circleci/cimodel/data/simple/util/versions.py Unescape Escape View File

111 .circleci/cimodel/lib/conf_tree.py Unescape Escape View File

10 .circleci/cimodel/lib/miniutils.py Unescape Escape View File

51 .circleci/cimodel/lib/miniyaml.py Unescape Escape View File

1386 .circleci/config.yml generated View File

41 .circleci/ensure-consistency.py Unescape Escape View File

196 .circleci/generate_config_yml.py Unescape Escape View File

5 .circleci/regenerate.ps1 Unescape Escape View File

17 .circleci/regenerate.sh Unescape Escape View File

69 .circleci/scripts/binary_checkout.sh Unescape Escape View File

44 .circleci/scripts/binary_install_miniconda.sh Unescape Escape View File

4 .circleci/scripts/binary_macos_build.sh Unescape Escape View File

70 .circleci/scripts/binary_populate_env.sh Unescape Escape View File

29 .circleci/scripts/binary_run_in_docker.sh Unescape Escape View File

111 .circleci/scripts/setup_ci_environment.sh Unescape Escape View File

50 .circleci/scripts/setup_linux_system_environment.sh Unescape Escape View File

65 .circleci/verbatim-sources/build-parameters/binary-build-params.yml Unescape Escape View File

105 .circleci/verbatim-sources/build-parameters/pytorch-build-params.yml Unescape Escape View File

134 .circleci/verbatim-sources/commands.yml Unescape Escape View File

41 .circleci/verbatim-sources/header-section.yml Unescape Escape View File

14 .circleci/verbatim-sources/job-specs/binary-build-tests.yml Unescape Escape View File

44 .circleci/verbatim-sources/job-specs/binary-job-specs.yml Unescape Escape View File

53 .circleci/verbatim-sources/job-specs/binary_update_htmls.yml Unescape Escape View File

56 .circleci/verbatim-sources/job-specs/docker_jobs.yml Unescape Escape View File

745 .circleci/verbatim-sources/job-specs/job-specs-custom.yml Unescape Escape View File

18 .circleci/verbatim-sources/job-specs/job-specs-promote.yml Unescape Escape View File

29 .circleci/verbatim-sources/job-specs/job-specs-setup.yml Unescape Escape View File

51 .circleci/verbatim-sources/nightly-binary-build-defaults.yml Unescape Escape View File

8 .circleci/verbatim-sources/workflows/workflows-nightly-uploads-header.yml Unescape Escape View File

8 .clang-tidy Unescape Escape View File

2 .devcontainer/Dockerfile Unescape Escape View File

3649 Commits

whc_flight ... forpull1

1

.ci/docker/README.md

View File

30

.ci/docker/build.sh

View File

2

.ci/docker/ci_commit_pins/executorch.txt

View File

2

.ci/docker/ci_commit_pins/huggingface.txt

View File

2

.ci/docker/ci_commit_pins/triton-rocm.txt

View File

2

.ci/docker/ci_commit_pins/triton.txt

View File

16

.ci/docker/common/install_acl.sh Normal file

View File

4

.ci/docker/common/install_base.sh

View File

57

.ci/docker/common/install_conda.sh

View File

13

.ci/docker/common/install_cudnn.sh

View File

21

.ci/docker/common/install_cusparselt.sh Normal file

View File

1

.ci/docker/common/install_executorch.sh

View File

11

.ci/docker/common/install_onnx.sh

View File

3

.ci/docker/common/install_openssl.sh

View File

61

.ci/docker/common/install_protobuf.sh

View File

16

.ci/docker/common/install_rocm.sh

View File

2

.ci/docker/common/install_rocm_magma.sh

View File

3

.ci/docker/common/install_triton.sh

View File

7

.ci/docker/common/install_ucc.sh

View File

115

.ci/docker/common/install_xpu.sh Normal file

View File

49

.ci/docker/requirements-ci.txt

View File

2

.ci/docker/triton_version.txt

View File

6

.ci/docker/ubuntu-cuda/Dockerfile

View File

118

.ci/docker/ubuntu-xpu/Dockerfile Normal file

View File

8

.ci/docker/ubuntu/Dockerfile

View File

33

.ci/pytorch/build.sh

View File

5

.ci/pytorch/common_utils.sh

View File

2

.ci/pytorch/macos-common.sh

View File

2

.ci/pytorch/macos-test.sh

View File

5

.ci/pytorch/multigpu-test.sh

View File

193

.ci/pytorch/test.sh

View File

13

.ci/pytorch/win-test-helpers/build_pytorch.bat

View File

14

.ci/pytorch/win-test-helpers/installation-helpers/install_mkl.bat

View File

19

.ci/pytorch/win-test-helpers/installation-helpers/install_sccache.bat

View File

466

.circleci/README.md

View File

198

.circleci/cimodel/data/binary_build_data.py

View File

275

.circleci/cimodel/data/binary_build_definitions.py

View File

19

.circleci/cimodel/data/dimensions.py

View File

296

.circleci/cimodel/data/pytorch_build_data.py

View File

382

.circleci/cimodel/data/pytorch_build_definitions.py

View File

39

.circleci/cimodel/data/simple/docker_definitions.py

View File

100

.circleci/cimodel/data/simple/ios_definitions.py

View File

54

.circleci/cimodel/data/simple/macos_definitions.py

View File

51

.circleci/cimodel/data/simple/mobile_definitions.py

View File

96

.circleci/cimodel/data/simple/nightly_ios.py

View File

36

.circleci/cimodel/data/simple/util/branch_filters.py

View File

35

.circleci/cimodel/data/simple/util/docker_constants.py

View File

36

.circleci/cimodel/data/simple/util/versions.py

View File

111

.circleci/cimodel/lib/conf_tree.py

View File

10

.circleci/cimodel/lib/miniutils.py

View File

51

.circleci/cimodel/lib/miniyaml.py

View File

1386

.circleci/config.yml generated

View File

41

.circleci/ensure-consistency.py

View File

196

.circleci/generate_config_yml.py

View File

5

.circleci/regenerate.ps1

View File

17

.circleci/regenerate.sh

View File

69

.circleci/scripts/binary_checkout.sh

View File

44

.circleci/scripts/binary_install_miniconda.sh

View File

4

.circleci/scripts/binary_macos_build.sh

View File

70

.circleci/scripts/binary_populate_env.sh

View File

29

.circleci/scripts/binary_run_in_docker.sh

View File

111

.circleci/scripts/setup_ci_environment.sh

View File

50

.circleci/scripts/setup_linux_system_environment.sh

View File

65

.circleci/verbatim-sources/build-parameters/binary-build-params.yml

View File

105

.circleci/verbatim-sources/build-parameters/pytorch-build-params.yml

View File

134

.circleci/verbatim-sources/commands.yml

View File

41

.circleci/verbatim-sources/header-section.yml

View File

14

.circleci/verbatim-sources/job-specs/binary-build-tests.yml

View File

44

.circleci/verbatim-sources/job-specs/binary-job-specs.yml

View File

53

.circleci/verbatim-sources/job-specs/binary_update_htmls.yml

View File

56

.circleci/verbatim-sources/job-specs/docker_jobs.yml

View File

745

.circleci/verbatim-sources/job-specs/job-specs-custom.yml

View File

18

.circleci/verbatim-sources/job-specs/job-specs-promote.yml

View File

29

.circleci/verbatim-sources/job-specs/job-specs-setup.yml

View File

51

.circleci/verbatim-sources/nightly-binary-build-defaults.yml

View File

8

.circleci/verbatim-sources/workflows/workflows-nightly-uploads-header.yml

View File

8

.clang-tidy

View File

2

.devcontainer/Dockerfile

View File

2

.devcontainer/README.md

View File