pytorch

mirror of https://github.com/pytorch/pytorch.git synced 2025-10-26 00:24:53 +08:00

Author	SHA1	Message	Date
Joona Havukainen	7a30933f85	WIP. Finally working in the constrained case. Now cleanup and generalization.	2025-01-14 09:48:33 -08:00
Randolf Scholz	983bf604e5	ReshapeTransform: added missing argument in docstring (#144401 ) See https://github.com/pytorch/pytorch/pull/144197#discussion_r1907336339 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144401 Approved by: https://github.com/janeyx99, https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com> Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>	2025-01-13 17:59:59 +00:00
George Wigley	fe8c5c7a2d	Update the Triton DeviceInterface in test/inductor/extension_backends/triton/device_interface.py (#144399 ) Following the changes to how `DeviceInterface` is used in this [PR](https://github.com/pytorch/pytorch/pull/142033), the `DeviceInterface` in `extension_backend/triton/device_interface.py` should by updated to return the `DeviceProperties` instead of raising a NotImplementedError. This PR mirrors the [changes](https://github.com/pytorch/pytorch/pull/142033/files#diff-06553e25e48e1d60f3030458bc46d52067d3d0c3eef2d5fcea29f7e8126bd7c9L112-R114) made in Dynamo when the PR landed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144399 Approved by: https://github.com/jansel	2025-01-13 17:19:58 +00:00
Xuehai Pan	bee84e88f8	[BE][Easy] improve submodule discovery for `torch.ao` type annotations (#144680 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144680 Approved by: https://github.com/Skylion007	2025-01-13 17:16:19 +00:00
Nikita Shulga	c40d917182	[MPSInductor] Fix maximum/minimum for int types (#144665 ) `metal::isnan` is only defined for floats, so provide a generic wrapper that is false for integral types TODO: Figure out why type propagantion is not working (or should it?) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/144665 Approved by: https://github.com/dcci	2025-01-13 15:14:01 +00:00
Isuru Fernando	8633845090	Support nanj in inductor (#144064 ) Fixes https://github.com/pytorch/pytorch/issues/144029 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144064 Approved by: https://github.com/amjames, https://github.com/eellison	2025-01-13 14:29:38 +00:00
Davide Italiano	417354d953	[mps/inductor] Add support for truncdiv(). (#144666 ) Two other inductor tests pass after this change. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144666 Approved by: https://github.com/malfet	2025-01-13 13:39:38 +00:00
Nikita Shulga	7e2239f1f0	[MPSInductor] Better error when kernel fails to compile (#144649 ) Now error message looks as follows: ``` % python ../test/inductor/test_torchinductor.py -v -k test_cat_unbacked_2d_mps test_cat_unbacked_2d_mps (__main__.GPUTests) ... inline_call [] stats [('calls_captured', 6)] inductor [('extern_calls', 2), ('fxgraph_cache_miss', 1)] aot_autograd [('total', 1), ('autograd_cache_bypass', 1), ('not_ok', 1)] ERROR ====================================================================== ERROR: test_cat_unbacked_2d_mps (__main__.GPUTests) ---------------------------------------------------------------------- Traceback (most recent call last): File "/Users/malfet/git/pytorch/pytorch/torch/testing/_internal/common_utils.py", line 3126, in wrapper method(args, kwargs) File "/Users/malfet/git/pytorch/pytorch/build/../test/inductor/test_torchinductor.py", line 12254, in new_test return value(self) File "/Users/malfet/miniconda3/lib/python3.10/contextlib.py", line 79, in inner return func(args, *kwds) File "/Users/malfet/git/pytorch/pytorch/build/../test/inductor/test_torchinductor.py", line 5885, in test_cat_unbacked_2d self.common( File "/Users/malfet/miniconda3/lib/python3.10/contextlib.py", line 79, in inner return func(args, *kwds) File "/Users/malfet/git/pytorch/pytorch/build/../test/inductor/test_torchinductor.py", line 620, in check_model_gpu check_model( File "/Users/malfet/git/pytorch/pytorch/build/../test/inductor/test_torchinductor.py", line 461, in check_model actual = run(example_inputs, *kwargs) File "/Users/malfet/git/pytorch/pytorch/torch/_dynamo/eval_frame.py", line 580, in _fn raise e.remove_dynamo_frames() from None # see TORCHDYNAMO_VERBOSE=1 File "/Users/malfet/git/pytorch/pytorch/torch/_inductor/compile_fx.py", line 704, in _compile_fx_inner raise InductorError(e, currentframe()).with_traceback( File "/Users/malfet/git/pytorch/pytorch/torch/_inductor/compile_fx.py", line 689, in _compile_fx_inner mb_compiled_graph = fx_codegen_and_compile( File "/Users/malfet/git/pytorch/pytorch/torch/_inductor/compile_fx.py", line 1149, in fx_codegen_and_compile return scheme.codegen_and_compile(gm, example_inputs, inputs_to_check, graph_kwargs) File "/Users/malfet/git/pytorch/pytorch/torch/_inductor/compile_fx.py", line 1064, in codegen_and_compile compiled_fn = graph.compile_to_module().call File "/Users/malfet/git/pytorch/pytorch/torch/_inductor/graph.py", line 1977, in compile_to_module return self._compile_to_module() File "/Users/malfet/git/pytorch/pytorch/torch/_inductor/graph.py", line 2018, in _compile_to_module mod = PyCodeCache.load_by_key_path( File "/Users/malfet/git/pytorch/pytorch/torch/_inductor/codecache.py", line 2768, in load_by_key_path mod = _reload_python_module(key, path) File "/Users/malfet/git/pytorch/pytorch/torch/_inductor/runtime/compile_tasks.py", line 51, in _reload_python_module exec(code, mod.__dict__, mod.__dict__) File "/var/folders/sc/2thx6_x95h7_h9qs8s48yh140000gn/T/tmpmyfz2ju8/lt/cltm34ognlgcc6oxoe6bexvtbwcdtdfgnkjj5miz7vhkemitacp7.py", line 40, in <module> File "/var/folders/sc/2thx6_x95h7_h9qs8s48yh140000gn/T/tmpmyfz2ju8/lt/cltm34ognlgcc6oxoe6bexvtbwcdtdfgnkjj5miz7vhkemitacp7.py", line 32, in _compile_mps_shader torch._inductor.exc.InductorError: SyntaxError: failed to compile kernel void generated_kernel( device float out_ptr0, constant float* in_ptr0, uint xindex [[thread_position_in_grid]] ) { long x1 = (xindex) / (3); auto tmp0 = x1; auto tmp1 = static_cast<long>(tmp0); auto tmp2 = 0; auto tmp3 = tmp1 >= tmp2; auto tmp4 = 2; auto tmp5 = tmp1 < tmp4; long x0 = (xindex) % (3); auto tmp6 = in_ptr0[x0 + 3*(x1)]; auto tmp7 = tmp5 ? tmp6 : 0.0; auto tmp8 = tmp1 >= tmp4; auto tmp9 = 2 + ks0; auto tmp10 = static_cast<long>(tmp9); auto tmp11 = tmp1 < tmp10; auto tmp12 = 1.0; auto tmp13 = tmp8 ? tmp12 : 0.0; auto tmp14 = tmp5 ? tmp7 : tmp13; long x2 = xindex; out_ptr0[x2] = static_cast<float>(tmp14); } with program_source:18:25: error: use of undeclared identifier 'ks0' auto tmp9 = 2 + ks0; ^ Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information You can suppress this exception and fall back to eager by setting: import torch._dynamo torch._dynamo.config.suppress_errors = True To execute this test, run the following from the base repo dir: python test/inductor/test_torchinductor.py GPUTests.test_cat_unbacked_2d_mps This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 ---------------------------------------------------------------------- Ran 1 test in 0.472s FAILED (errors=1) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/144649 Approved by: https://github.com/Skylion007, https://github.com/jansel, https://github.com/dcci ghstack dependencies: #144647, #144648	2025-01-13 13:38:03 +00:00
PyTorch UpdateBot	a85d1ee106	Update slow tests (#144670 ) This PR is auto-generated weekly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/weekly.yml). Update the list of slow tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144670 Approved by: https://github.com/pytorchbot	2025-01-13 12:06:22 +00:00
James Wu	6e77d7cac5	Add AOTAutogradCache support for cache hot loading APIs (#144499 ) This diff adds AOTAutogradCache support to the mega cache. Differential Revision: [D67991059](https://our.internmc.facebook.com/intern/diff/D67991059/) NOTE FOR REVIEWERS: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D67991059/)! Pull Request resolved: https://github.com/pytorch/pytorch/pull/144499 Approved by: https://github.com/oulgen	2025-01-13 07:07:18 +00:00
Nikita Shulga	a08bd8154e	[MPSInductor] Add support for sizevars (#144662 ) Just pass them as kernel arguments After this change `pytest test/inductor/test_torchinduct.py -v -k _mps` reports 330 failed, 429 passed after and 335 failed, 424 passed before Pull Request resolved: https://github.com/pytorch/pytorch/pull/144662 Approved by: https://github.com/jansel	2025-01-13 06:22:38 +00:00
Yiming Zhou	87843ee9ab	[export] Unify single and multiple return for hops (#143227 ) Summary: Introduce `is_hop_single_tensor_return` field to the `Node` class in serialization so that during deserialization when there is a single return, we know whether it is a tuple of a single element or a single element. Test Plan: ``` buck2 run @mode/dev-nosan sigmoid/inference/test:e2e_test_cpu -- -r E2ETestCPUCond buck2 run @mode/dev-nosan sigmoid/inference/test:test_passes -- -r test_const_folding2 ``` Differential Revision: D66991624 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143227 Approved by: https://github.com/zhxchen17	2025-01-13 03:31:14 +00:00
PyTorch MergeBot	0aa34e9591	Revert "Collect packages with importlib in collect_env (#144616 )" This reverts commit 3541d2a2aaacc4f15ea865c815ce8882577a439c. Reverted https://github.com/pytorch/pytorch/pull/144616 on behalf of https://github.com/malfet due to Somehow this change causes test_bottleneck_cuda to fail ([comment](https://github.com/pytorch/pytorch/pull/144616#issuecomment-2586095595))	2025-01-13 03:11:04 +00:00
Nikita Shulga	46eeef9130	[MPS][BE] Surface syntax errors shader compilation (#144648 ) Before this change ```python >>> import torch >>> torch.mps._compile_shader('What') Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/Users/malfet/miniconda3/envs/py311/lib/python3.11/site-packages/torch/mps/__init__.py", line 157, in _compile_shader return torch._C._mps_compileShader(source) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RuntimeError: Failed to create metal library, error: Error Domain=MTLLibraryErrorDomain Code=3 "program_source:1:1: error: unknown type name 'What' What ^ program_source:1:5: error: expected unqualified-id What ^ " UserInfo={NSLocalizedDescription=program_source:1:1: error: unknown type name 'What' What ^ program_source:1:5: error: expected unqualified-id What ^ } ``` After this change ```python >>> import torch >>> torch.mps._compile_shader('What') Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/Users/malfet/git/pytorch/pytorch/torch/mps/__init__.py", line 157, in _compile_shader return torch._C._mps_compileShader(source) SyntaxError: program_source:1:1: error: unknown type name 'What' What ^ program_source:1:5: error: expected unqualified-id What ^ ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/144648 Approved by: https://github.com/Skylion007 ghstack dependencies: #144647	2025-01-13 02:03:19 +00:00
Nikita Shulga	9ae35b8bb1	[BE] Introduce `c10::SyntaxError` (#144647 ) Which will be translated into Python's SyntaxError Pull Request resolved: https://github.com/pytorch/pytorch/pull/144647 Approved by: https://github.com/Skylion007	2025-01-12 23:23:54 +00:00
Sv. Lockal	3541d2a2aa	Collect packages with importlib in collect_env (#144616 ) If pytorch is installed systemwide (via os package manager) or by alternative package manager like `uv`, pip is not available, causing error in `collect_env`. However it is still possible to collect exactly the same list using `importlib` API, which is always available. Fixes #144615 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144616 Approved by: https://github.com/malfet	2025-01-12 23:21:08 +00:00
Gabriel Ferns	1376116ab1	Config fuzzer (#139736 ) This tool makes it easy to search through config state-space with a minimal reproduction or test. It presents a similar interface to the config bisector by taking a test_function that should either raise on Exception or return False upon failure. It has two entry points: `fuzz_n_tuple`, which tries every combination of n configs, and `bisect`, which randomly flips configs and tries to find the minimal reproduction upon failure. `bisect` is a much more efficient way to search the space, but `fuzz_n_tuple` can give you peace of mind that a new config will compose with every other config. It's been used to find three bugs so far in the inductor config: https://github.com/pytorch/pytorch/issues/140220 https://github.com/pytorch/pytorch/issues/140219 https://github.com/pytorch/pytorch/issues/143524 This PR also adds a bunch of missing types to the inductor config to get them to play nice with the fuzzer, so it can be a good forcing function for adding types to config. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139736 Approved by: https://github.com/eellison	2025-01-12 22:59:02 +00:00
Wenqin Yang	334ee8ba40	Fix a bug for conj_physical (#144391 ) Fixes #141426 fix a bug in previous [PR](https://github.com/pytorch/pytorch/pull/141427), which shouldn't convert the data type for conj. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144391 Approved by: https://github.com/jansel	2025-01-12 21:18:17 +00:00
Aaron Gokaslan	cb66146f2b	[BE]: Update literal typing for torch/fx/graph nodelist (#144650 ) Mentioned in discussion for #144631 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144650 Approved by: https://github.com/jansel	2025-01-12 21:02:13 +00:00
Nikita Shulga	91a65cbd31	[MPSInductor] Implement `check_bounds` (#144635 ) Although at the moment it returns rather than rasises assert due to https://github.com/pytorch/pytorch/pull/144632 `pytest test/inductor/test_torchinductor.py -v -k _mps` score is `368 failed, 391 passed, 32 skipped` Pull Request resolved: https://github.com/pytorch/pytorch/pull/144635 Approved by: https://github.com/jansel	2025-01-12 21:01:20 +00:00
Jason Ansel	fd382f1269	Micro-optimization in Graph.nodes.__iter__ (#144631 ) This generates slightly better code (removing a generator frame) and drops a redundant assert. ```py >>> import timeit >>> def a(): ... yield from range(3) ... >>> def b(): ... return range(3) ... >>> timeit.timeit(lambda: [a()]) 0.2714634328149259 >>> timeit.timeit(lambda: [b()]) 0.12076826114207506 >>> ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/144631 Approved by: https://github.com/oulgen, https://github.com/Skylion007	2025-01-12 17:46:46 +00:00
Sam Larsen	de04acaca9	Disable scuba logging for autotuning (#144568 ) Summary: the compile IDs are currently null, which is confusing. Turn it off until we have a solution. Test Plan: https://fburl.com/scuba/dynamo_compile/sandbox/g2d2g5xs Pull Request resolved: https://github.com/pytorch/pytorch/pull/144568 Approved by: https://github.com/jamesjwu	2025-01-12 15:47:14 +00:00
Yanbo Liang	1664033e13	[Functorch] Refactor vmapify autograd function: remove cell mutation (#143811 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143811 Approved by: https://github.com/zou3519	2025-01-12 10:31:23 +00:00
Nikita Shulga	cec245806e	[MPSInductor] Implement bitcasts (#144638 ) That will be used to compile something like `torch.rand(32, device='mps').view(dtype=torch.int32)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/144638 Approved by: https://github.com/dcci	2025-01-12 06:11:28 +00:00
Nikita Shulga	32a91dedc5	[MPSInductor] Properly generate index expressions (#144632 ) Now test_slice_scatter4_mps passes Before this change test_torchinductor.py reported 422 failed and 337 passed, after this change 412 failed 347 passed. Fixes https://github.com/pytorch/pytorch/issues/144630 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144632 Approved by: https://github.com/dcci	2025-01-12 06:10:05 +00:00
Yanbo Liang	3355103233	[Dynamo] Supports autograd.Function forward returns constant (#144597 ) Fixes #144142 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144597 Approved by: https://github.com/jansel	2025-01-12 03:53:10 +00:00
Davide Italiano	e0f67405a1	[mps/inductor] Add support for exp(). (#144606 ) inductor/test_silu now passes after this change. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144606 Approved by: https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-01-12 00:38:11 +00:00
Nikita Shulga	10887fc139	[BE] Enable test_public_bindings on MacOS (#144591 ) I've tried it locally and it works.. (One more reason to xfail rather than skip) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144591 Approved by: https://github.com/Skylion007	2025-01-12 00:34:47 +00:00
Davide Italiano	5e858254d2	[mps/inductor] Add support for trunc(). (#144629 ) inductor/test_div1 passes after this change. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144629 Approved by: https://github.com/malfet, https://github.com/jansel	2025-01-12 00:11:03 +00:00
bobrenjc93	f6688ac81d	remove allow-untyped-defs from torch/distributed/_shard/sharded_tensor/shard.py (#144623 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144623 Approved by: https://github.com/Skylion007	2025-01-12 00:10:42 +00:00
bobrenjc93	b8aae2773f	remove allow-untyped-defs from torch/distributed/_checkpointable.py (#144627 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144627 Approved by: https://github.com/Skylion007	2025-01-12 00:07:26 +00:00
bobrenjc93	b5485c9f41	remove allow-untyped-defs from torch/_functorch/utils.py (#144626 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144626 Approved by: https://github.com/Skylion007	2025-01-12 00:07:16 +00:00
bobrenjc93	ad221269b0	remove allow-untyped-defs from torch/distributions/pareto.py (#144624 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144624 Approved by: https://github.com/Skylion007	2025-01-12 00:06:56 +00:00
bobrenjc93	80b756ed91	remove allow-untyped-defs from torch/jit/_pickle.py (#144625 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144625 Approved by: https://github.com/Skylion007	2025-01-12 00:06:25 +00:00
PyTorch MergeBot	4f406d22a2	Revert "[mps/inductor] Add support for exp(). (#144606 )" This reverts commit 2ccbacfa24cae724ec1ea3bc7de189e5bf948d46. Reverted https://github.com/pytorch/pytorch/pull/144606 on behalf of https://github.com/malfet due to It now passes MPS-not-supported test ([comment](https://github.com/pytorch/pytorch/pull/144606#issuecomment-2585482477))	2025-01-11 23:51:35 +00:00
PyTorch MergeBot	eaa24821f2	Revert "[ez] add lint commits to .git-blame-ignore-revs (#144576 )" This reverts commit 49c1f81be84466d015705b1882320919eecffa82. Reverted https://github.com/pytorch/pytorch/pull/144576 on behalf of https://github.com/janeyx99 due to need to redo with better testing ([comment](https://github.com/pytorch/pytorch/pull/144576#issuecomment-2585456893))	2025-01-11 21:53:00 +00:00
Davide Italiano	2ccbacfa24	[mps/inductor] Add support for exp(). (#144606 ) inductor/test_silu now passes after this change. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144606 Approved by: https://github.com/malfet	2025-01-11 18:09:33 +00:00
eqy	63569d9745	[CUDA][TF32] Add some missing TF32 decorators to `test_nn.py` (#144592 ) Original authored by @bilal2vec Pull Request resolved: https://github.com/pytorch/pytorch/pull/144592 Approved by: https://github.com/Skylion007	2025-01-11 16:20:59 +00:00
eqy	388b75edec	[CUDA][cuBLAS] Add fp16 accumulate option to cuBLAS/cuBLASLt (#144441 ) Test for `cublasGemmEx` added, still need to figure out the best way to exercise the other APIs... Pull Request resolved: https://github.com/pytorch/pytorch/pull/144441 Approved by: https://github.com/Chillee	2025-01-11 15:30:38 +00:00
Ding, Yi1	2e3b051154	[XPU] Fix TRITON_XPU_BUILD_FROM_SOURCE (#142850 ) Fixes #142849 The idea is to remove the redundant 'git' in TRITON_XPU_BUILD_FROM_SOURCE=1 case (L29) while keep it in pre-build whl installation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142850 Approved by: https://github.com/chuanqi129, https://github.com/benjaminglass1, https://github.com/EikanWang, https://github.com/atalman	2025-01-11 13:11:55 +00:00
Ting Lu	b7bef1ca84	[aarch64] fix TORCH_CUDA_ARCH_LIST for cuda arm build (#144436 ) Fixes #144037 Root cause is CUDA ARM build did not call `.ci/manywheel/build_cuda.sh`, but calls `.ci/aarch64_linux/aarch64_ci_build.sh `instead. Therefore, https://github.com/pytorch/pytorch/blob/main/.ci/manywheel/build_cuda.sh#L56 was not called for CUDA ARM build. Adding the equivalent of the code to `.ci/aarch64_linux/aarch64_ci_build.sh` as a WAR. In the future, we should target to integrate the files in .ci/aarch64_linux/aarch64_ci_build.sh back to .ci/manywheel/build_cuda.sh. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144436 Approved by: https://github.com/atalman	2025-01-11 09:00:46 +00:00
Blaine Burton Rister	e1d0a2ff30	[Inductor] Restrict ND tiling analysis to MemoryDeps (#144497 ) # Issue https://github.com/pytorch/pytorch/pull/137243 introduced a feature where the ND tiling algorithm analyzes memory dependencies. It iterates over all `Dep`'s of the kernel. However, the analysis is only applicable to `MemoryDep` instances, which are a subclass of `Dep`. In particular, it doesn't work for `StarDep`'s, for the reasons described here: https://github.com/pytorch/pytorch/blob/main/torch/_inductor/codegen/simd.py#L1653 # Fix This PR changes the algorithm to only iterate over `MemoryDep` instances. # Testing Parameterized an existing test for `torch.bucketize` to also run with ND tiling. This test emits a node with `StarDep`'s. Without this PR, the compiler would crash on this test case. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144497 Approved by: https://github.com/eellison	2025-01-11 05:16:47 +00:00
Huy Do	e4b2e90e54	Fix broken YAML template after #144574 (#144604 ) The YAML syntax is wrong and GitHub complains about it https://github.com/pytorch/pytorch/blob/main/.github/ISSUE_TEMPLATE/pt2-bug-report.yml Pull Request resolved: https://github.com/pytorch/pytorch/pull/144604 Approved by: https://github.com/wdvr	2025-01-11 05:09:06 +00:00
Will Constable	11082aead3	[Pipelining] Fix FSDP+PP stream sync bug (#144535 ) This bug could cause gradient corruption as a race condition exists between FSDP's reduce-scatter and any operations reading .grad on the main stream. The root cause is that pipelining stage .backward implementation got modified to support zero-bubble and in doing so, invoked .grad() instead of .backward(), and performed manual gradient accumulation and manually called into hooks for FSDP. But one key hook was missed for FSDP, the '_root_post_backward_final_callback' hook, which is responsible for syncing the grad reduction ops after the last layer's backward completes. Note: this fix applies to both zero-bubble and non-zero-bubble schedules. This caused some confusion initially, as non-zero-bubble schedules do use torch.autograd.backward() which would have called into fsdp's hooks and synced, unlike zero-bubble which uses .grad() which does not invoke hooks. However, this difference was already taken into consideration as FSDP's hooks are manually disabled before invoking either type of backward, and then the hooks are manually triggered. A better fix as a follow up PR would be to invoke .backward() for the weight grad, so that we never have to disable or manually invoke hooks. Modified test_pp_dp to intentionally race against FSDP's reduce by modifying the parameters inplace in a mathematically identical way, and confirmed it fails intermittently when the FSDP sync is not applied and passes with the FSDP sync added. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144535 Approved by: https://github.com/awgu ghstack dependencies: #144534	2025-01-11 03:42:15 +00:00
Will Constable	1d3cd7bd09	[Pipelining] Improve test_pp_dp (#144534 ) Some refactoring, but important changes include - initializing the weights properly so there are more nonzero gradients flowing, which helped catch the DDP+PP+ZB bug - make the DDP+ZB+PP bug skip for now and file an issue - tighten the tolerances to defaults - use separate targets instead of same inputs Pull Request resolved: https://github.com/pytorch/pytorch/pull/144534 Approved by: https://github.com/H-Huang	2025-01-11 03:27:16 +00:00
Simon Fan	8fa47c9455	[dynamo] log compiler collective duration to tlparse chromium trace (#144372 ) To show wall time in tlparse for the synchronous compiler collective. Can eliminate the leading hypothesis from https://fb.workplace.com/groups/1075192433118967/permalink/1578670289437843. <img width="1296" alt="image" src="https://github.com/user-attachments/assets/b17d4efb-8573-43e5-af58-c51af05acb54" /> sample: https://gist.github.com/xmfan/19eeaa80d55a4e7c168e150355ec7392 rank 0: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpr5WNMt/rank_0/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10 rank 1: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpr5WNMt/rank_1/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=100 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144372 Approved by: https://github.com/ezyang	2025-01-11 03:10:39 +00:00
Colin L. Rice	0cd9320c7f	easy: dynamo_config: sort keys and set values (#143317 ) This will create consistent ordering of keys when writing, as well as sorting sets before serializing Pull Request resolved: https://github.com/pytorch/pytorch/pull/143317 Approved by: https://github.com/masnesral ghstack dependencies: #143307	2025-01-11 03:08:04 +00:00
Sam Ginzburg	074aca3ed2	[user triton] add support for @triton.heuristics after @triton.autotune (#142208 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/142208 Approved by: https://github.com/zou3519	2025-01-11 02:18:26 +00:00
PyTorch MergeBot	3753d30273	Revert "Stop ignoring mypy errors in torch/testing/_internal/common_utils.py (#144483 )" This reverts commit 9f09b719d33c61224ebb85baa369a8364063aa6f. Reverted https://github.com/pytorch/pytorch/pull/144483 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it somehow breaks memory leak checks ([comment](https://github.com/pytorch/pytorch/pull/144483#issuecomment-2585004792))	2025-01-11 02:10:16 +00:00
Sahan Paliskara	49c1f81be8	[ez] add lint commits to .git-blame-ignore-revs (#144576 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144576 Approved by: https://github.com/janeyx99	2025-01-11 02:09:46 +00:00
Nikita Shulga	92ddb3d3d3	[MPS] Expose `MPSProfiler::start/stopCapture` to Python (#144561 ) I.e. when `MTL_CAPTURE_ENABLED` environment variable is set to 1, one should be able to invoke wrap the code with `torch.mps.profiler.capture_metal` to generate gputrace for shaders invoked inside the context manager. For example, code below: ```python import torch import os def foo(x): return x[:,::2].sin() + x[:, 1::2].cos() if __name__ == "__main__": os.environ["MTL_CAPTURE_ENABLED"] = "1" x = torch.rand(32, 1024, device="mps") with torch.mps.profiler.metal_capture("compiled_shader"): torch.compile(foo)(x) ``` should capture the execution of a `torch.compile` generated shader <img width="734" alt="image" src="https://github.com/user-attachments/assets/718ff64e-103b-4b11-b66c-c89cfc770b5d" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/144561 Approved by: https://github.com/manuelcandales ghstack dependencies: #144559, #144560	2025-01-11 02:05:36 +00:00
Yidi Wu	c7dbee5106	[reland][export] don't decompose custom triton op when exporting (#144284 ) Summary: A reland of https://github.com/pytorch/pytorch/pull/142426. Copying the description over here: For torch.export (strict and non-strict), we don't do functional decomposition. Instead, we preserve the custom triton ops as custom ops. This is because we want the exported program to be high-level and serializable. The alternative: If we decompose the custom op to a functional hop and make it a node in exported program, we need to figure out ways of serializing the hop and its arguments, which can be triton.jited python functions and triton dtypes. This is undesireble because: it can be tedious to maintain layer that serialize the jited function (e.g. with a string) and dtypes. changes to triton or the serialization logic for triton arguments can be BC breaking exported program will expose the implementation detail (i.e. triton source code) for a specific backend (GPU) to users, which mixes levels of abstraction. Future plans: After this PR, in the short term, we expect users to have a seperate aot_compile stage that compiles the exported program into a Cubin file on the same machine that users call export, which does autotuning and removes triton dependency and serve the model with Cubin. This guarantees that triton changes won't break BC. In the long term, we may export multiple cubins for the triton op directly. Test Plan: see new tests. Differential Revision: D67879685 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144284 Approved by: https://github.com/zou3519	2025-01-11 01:34:35 +00:00
Marc Horowitz	95d333f52e	[distributed] Fix _ReaderView.read() and readinto() to stop reading at the end of the slice (#143357 ) _ReaderView doesn't work correctly if the slice ends past the view. read(-1) would call read(-1) on the base_stream, which would consume the entire underlying stream, even if the view ended before that. read(n) would read n bytes, even if the view ended before that. The new implementation clamps the size read to the size of the view. readinto(b) would read len(b) bytes, even if the view ended before that. Since the interface depends on the size of b, we use a (potentially) shortened view into b to avoid a copy. If the view doesn't contain enough data to fill the view, then this will appear as end of stream to the caller, which is the desired behavior. This fix should not be user facing, since the bug is in an internal helper, and is only visible with new code down the stack. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143357 Approved by: https://github.com/saumishr	2025-01-11 00:22:10 +00:00
Xu Han	c9afa00a85	update sleef for disable libm on Windows [submodule Sleef] (#142245 ) This PR is implement of RFC: https://github.com/pytorch/pytorch/issues/141946 Changes: 1. Update `Sleef` to contains it's PRS: https://github.com/shibatch/sleef/pull/603 2. Set `SLEEF_BUILD_WITH_LIBM` to `OFF`, it is turn off CMake find_library(libm) of `Sleef`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142245 Approved by: https://github.com/EikanWang, https://github.com/atalman Co-authored-by: Eikan Wang <eikan.wang@intel.com>	2025-01-11 00:11:55 +00:00
cyy	6cfc081675	Increase C10_COMPILE_TIME_MAX_GPUS to 128 (#144138 ) To facilitate further possible changes of DeviceIndex to int16_t. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144138 Approved by: https://github.com/albanD	2025-01-10 23:53:19 +00:00
PyTorch MergeBot	b80ecc4457	Revert "Fix poision child process issue when call getAccelerator() (#144368 )" This reverts commit 2583d831d40d6fa64f0b637d5bc7598e484a3283. Reverted https://github.com/pytorch/pytorch/pull/144368 on behalf of https://github.com/clee2000 due to broke internal tests D68023262, probably the same problem as noted in the issue this PR is mentioned above ([comment](https://github.com/pytorch/pytorch/pull/144368#issuecomment-2584848568))	2025-01-10 23:36:43 +00:00
PyTorch MergeBot	db2a30932a	Revert "Generalize at::manual_seed for all accelerators (#144370 )" This reverts commit eeb57394f93d720bca498c3fa9d167fc7b9cca46. Reverted https://github.com/pytorch/pytorch/pull/144370 on behalf of https://github.com/clee2000 due to broke internal tests D68023262, probably the same problem as noted in the issue this PR is mentioned above ([comment](https://github.com/pytorch/pytorch/pull/144368#issuecomment-2584848568))	2025-01-10 23:36:43 +00:00
Sahan Paliskara	9ec8ecea71	Update documentation.yml	2025-01-10 15:27:28 -08:00
Sahan Paliskara	1ff8a1c4eb	Update documentation.yml to request english	2025-01-10 15:26:43 -08:00
Nikita Shulga	c7f12a4a7b	[MPSInductor] Speedup maximum/minumum ops (#144581 ) By relying on the fact that if either `a` or `b` is NaN (or both), than `a + b` would also be NaN. I.e. it replaces ```metal auto tmp2 = metal::any(metal::isnan(static_cast<decltype(tmp0+tmp1)>(tmp0))) \| metal::any(metal::isnan(static_cast<decltype(tmp0+tmp1)>(tmp1))) ? static_cast<decltype(tmp0+tmp1)>(NAN) : metal::max(static_cast<decltype(tmp0+tmp1)>(tmp0), static_cast<decltype(tmp0+tmp1)>(tmp1)); ``` with ```metal auto tmp2 = metal::isnan(tmp0 + tmp1) ? tmp0 + tmp1 : metal::max(static_cast<decltype(tmp0+tmp1)>(tmp0), static_cast<decltype(tmp0+tmp1)>(tmp1)); ``` which according to MetalProfiler takes fewer instructions: <img width="520" alt="image" src="https://github.com/user-attachments/assets/54659392-012b-453e-9c02-c3c5f332074a" /> vs <img width="1031" alt="image" src="https://github.com/user-attachments/assets/55fcfa78-1ea5-4b0a-8154-d79b3e3cc400" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/144581 Approved by: https://github.com/dcci, https://github.com/jhavukainen	2025-01-10 22:58:00 +00:00
Angela Yi	a94ec0a9a5	[aoti] Remove example inputs from aoti_compile_and_package (#144520 ) Summary: The args were removed in https://github.com/pytorch/pytorch/pull/140991 Test Plan: CI Differential Revision: D67998954 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144520 Approved by: https://github.com/yushangdi	2025-01-10 21:56:23 +00:00
Sahan Paliskara	6b902e6e1a	Update bug-report.yml to make it not look weird Seems like https://github.com/pytorch/pytorch/pull/144574 did not format as expected.	2025-01-10 13:53:27 -08:00
Sahan Paliskara	4daf007b64	Request English for Issues (#144574 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144574 Approved by: https://github.com/albanD	2025-01-10 21:51:15 +00:00
Alexander Kurakin	68dad26b95	torch/nn/modules/linear.py: docs: improvements (#138484 ) torch/nn/modules/linear.py: docs: improvements Pull Request resolved: https://github.com/pytorch/pytorch/pull/138484 Approved by: https://github.com/mikaylagawarecki	2025-01-10 20:03:43 +00:00
angelayi	7a81ba18b9	[export] Add support for serializing symint inputs (#142284 ) Fixes https://github.com/pytorch/pytorch/issues/142167 Pull Request resolved: https://github.com/pytorch/pytorch/pull/142284 Approved by: https://github.com/avikchaudhuri	2025-01-10 20:03:26 +00:00
Alexander Kurakin	18c1dcb8f3	docs: get rid of copyright year (#144562 ) Fixes https://github.com/pytorch/pytorch/pull/144153#pullrequestreview-2540418083 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144562 Approved by: https://github.com/albanD	2025-01-10 19:57:25 +00:00
Shangdi Yu	be5afe16a6	Fix deepcopy hooks (#144531 ) Summary: As title, fix bug when a GraphModule doesn't have _deepcopy_hooks attribute Test Plan: ``` buck2 test 'fbcode//mode/opt' fbcode//torchmultimodal/tests:tests -- --exact 'torchmultimodal/tests:tests - test_albef.py::test_dequeue_and_enqueue' ``` Differential Revision: D68002767 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144531 Approved by: https://github.com/BoyuanFeng	2025-01-10 19:55:22 +00:00
angelayi	10ff6b8894	[export] Add pickle protocol (#142253 ) Fixes https://github.com/pytorch/pytorch/issues/142004 Pull Request resolved: https://github.com/pytorch/pytorch/pull/142253 Approved by: https://github.com/avikchaudhuri	2025-01-10 19:49:07 +00:00
Huy Do	396630ed78	Update the accuracy results for moco and llama (#144523 ) This has been failing in trunk for sometimes, let's just update the accuracy results first. The command I run `python benchmarks/dynamo/ci_expected_accuracy/update_expected.py 127f836881e75e0c688619b54a35b018a69d7ee7`. I also fix the update script a bit to make it working after https://github.com/pytorch/pytorch/pull/139337 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144523 Approved by: https://github.com/kit1980, https://github.com/Skylion007	2025-01-10 19:40:49 +00:00
Max Podkorytov	99600789c3	[ROCm][Inductor][CK] hackfix for segfault in addmm op (#144519 ) This snippet used to cause segfault on GPU due to incorrect input order when invoking the kernel ``` import os import torch import torch.nn as nn from torch._inductor import config as inductor_config from torch._inductor.utils import fresh_inductor_cache M, N, K = 128, 128, 4096 dtype = torch.float16 X = torch.randn(M, N, dtype=dtype).cuda() A = torch.randn(M, K, dtype=dtype).cuda() B = torch.randn(K, N, dtype=dtype).cuda() class SimpleModel(nn.Module): def __init__(self): super().__init__() def forward(self, b, x, y): return torch.addmm(b, x, y) import ck4inductor ck_dir = os.path.dirname(ck4inductor.__file__) with fresh_inductor_cache(): with inductor_config.patch( { "max_autotune_gemm_backends": "CK", "autotune_fallback_to_aten": False, "compile_threads": 144, "rocm.ck_dir": ck_dir, } ): compiled_model = torch.compile(SimpleModel(), mode="max-autotune") res = compiled_model(X, A, B) res_eager = torch.addmm(X, A, B) torch.testing.assert_close(res, res_eager) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/144519 Approved by: https://github.com/chenyang78	2025-01-10 19:29:14 +00:00
Arash Pakbin	a37db5ae39	operator benchmark change parsing from regex based to manual (#144297 ) The regex-based parser would erroneously split on commas in nested brackets, for example, it would do the following parse which is wrong: 'M: [(32, 16), (64, 32)], ZPB: 2' -> ['M: [(32, 16)', ' (64, 32)]', 'ZPB: 2'] The new manual parser handles this situation the right way: 'M: [(32, 16), (64, 32)], ZPB: 2' -> ['M: [(32, 16), (64, 32)]', 'ZPB: 2'] Pull Request resolved: https://github.com/pytorch/pytorch/pull/144297 Approved by: https://github.com/XuehaiPan, https://github.com/jeffdaily	2025-01-10 19:15:36 +00:00
Hamza Butt	4f04078aec	[CI] Ensure ACL is obtained from GitHub (#141804 ) - The GitHub tagged releases is the preferred method to obtain ACL. Please merge this before https://github.com/pytorch/pytorch/pull/138889 so that PyTorch can take GitHub releases going forward instead of mlplatform. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141804 Approved by: https://github.com/snadampal, https://github.com/ng-05, https://github.com/digantdesai	2025-01-10 19:05:02 +00:00
cyy	4abf554882	Use structure binding (#144524 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/144524 Approved by: https://github.com/Skylion007	2025-01-10 18:47:35 +00:00
Feny Patel	1ce3524277	use collective_comm activity for hccl traces (#144490 ) Summary: Use existing collective_comm (currently used for nccl traces) for hccl traces as well. Only init the nccl profiler when KINETO_HAS_NCCL_PROFILER is defined so as to not init it when the build is for MTIA/HCCL Test Plan: CIs Differential Revision: D67285333 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144490 Approved by: https://github.com/sraikund16	2025-01-10 18:39:35 +00:00
Bin Bao	868984c3e3	[AOTI] Add a boxed_run API (#142213 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/141696. Add a new C++ runner API (boxed_run) following dynamo's boxed calling convention, which steals tensors' ownership from the input tensor list. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142213 Approved by: https://github.com/ezyang	2025-01-10 18:27:00 +00:00
Scott Wolchok	b46d00c1b7	Shard RegisterDispatchKey (#144364 ) Should fix https://github.com/pytorch/pytorch/issues/143952 . Testing: built PyTorch on Raspberry Pi 5; this seemed to alleviate high peak memory requirement. (I did increase shard counts for other generated files along the way, but I need to go back and figure out how much of that was strictly necessary vs. needing to use -j1 or -j2.) Differential Revision: [D67925496](https://our.internmc.facebook.com/intern/diff/D67925496/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144364 Approved by: https://github.com/Skylion007, https://github.com/bdhirsh ghstack dependencies: #144363	2025-01-10 18:21:19 +00:00
Aleksei Nikiforov	4143312e67	S390x ci periodic tests (#125401 ) Periodically run testsuite for s390x Dependencies update Package z3-solver is updated from version 4.12.2.0 to version 4.12.6.0. This is a minor version update, so no functional change is expected. The reason for update is build on s390x. pypi doesn't provide binary build for z3-solver for versions 4.12.2.0 or 4.12.6.0 for s390x. Unfortunately, version 4.12.2.0 fails to build with newer gcc used on s390x builders, but those errors are fixed in version 4.12.6.0. Due to this minor version bump fixes build on s390x. ``` # pip3 install z3-solver==4.12.2.0 ... In file included from /tmp/pip-install-756iytc6/z3-solver_ce6f750b780b4146a9a7c01e52672071/core/src/util/region.cpp:53: /tmp/pip-install-756iytc6/z3-solver_ce6f750b780b4146a9a7c01e52672071/core/src/util/region.cpp: In member function ‘void* region::allocate(size_t)’: /tmp/pip-install-756iytc6/z3-solver_ce6f750b780b4146a9a7c01e52672071/core/src/util/tptr.h:29:62: error: ‘uintptr_t’ does not name a type 29 \| #define ALIGN(T, PTR) reinterpret_cast<T>(((reinterpret_cast<uintptr_t>(PTR) >> PTR_ALIGNMENT) + \ \| ^~~~~~~~~ /tmp/pip-install-756iytc6/z3-solver_ce6f750b780b4146a9a7c01e52672071/core/src/util/region.cpp:82:22: note: in expansion of macro ‘ALIGN’ 82 \| m_curr_ptr = ALIGN(char , new_curr_ptr); \| ^~~~~ /tmp/pip-install-756iytc6/z3-solver_ce6f750b780b4146a9a7c01e52672071/core/src/util/region.cpp:57:1: note: ‘uintptr_t’ is defined in header ‘<cstdint>’; did you forget to ‘#include <cstdint>’? 56 \| #include "util/page.h" +++ \|+#include <cstdint> 57 \| ``` Python paths update* On AlmaLinux 8 s390x, old paths: ``` python -c 'from distutils.sysconfig import get_python_lib; print(get_python_lib())' /usr/lib/python3.12/site-packages ``` Total result is `/usr/lib/python3.12/site-packages/torch;/usr/lib/python3.12/site-packages` New paths: ``` python -c 'import site; print(";".join([x for x in site.getsitepackages()] + [x + "/torch" for x in site.getsitepackages()]))' /usr/local/lib64/python3.12/site-packages;/usr/local/lib/python3.12/site-packages;/usr/lib64/python3.12/site-packages;/usr/lib/python3.12/site-packages;/usr/local/lib64/python3.12/site-packages/torch;/usr/local/lib/python3.12/site-packages/torch;/usr/lib64/python3.12/site-packages/torch;/usr/lib/python3.12/site-packages/torch ``` ``` # python -c 'import torch ; print(torch)' <module 'torch' from '/usr/local/lib64/python3.12/site-packages/torch/__init__.py'> ``` `pip3 install dist/.whl` installs torch into `/usr/local/lib64/python3.12/site-packages`, and later it's not found by cmake with old paths: ``` CMake Error at CMakeLists.txt:9 (find_package): By not providing "FindTorch.cmake" in CMAKE_MODULE_PATH this project has asked CMake to find a package configuration file provided by "Torch", but CMake did not find one. ``` https://github.com/pytorch/pytorch/actions/runs/10994060107/job/30521868178?pr=125401 Builders availability* Build took 60 minutes Tests took: 150, 110, 65, 55, 115, 85, 50, 70, 105, 110 minutes (split into 10 shards) 60 + 150 + 110 + 65 + 55 + 115 + 85 + 50 + 70 + 105 + 110 = 975 minutes used. Let's double it. It would be 1950 minutes. We have 20 machines * 24 hours = 20 * 24 * 60 = 20 * 1440 = 28800 minutes We currently run 5 nightly binaries builds, each on average 90 minutes build, 15 minutes test, 5 minutes upload, 110 minutes total for each, 550 minutes total. Doubling would be 1100 minutes. That leaves 28800 - 1100 = 27700 minutes total. Periodic tests would use will leave 25750 minutes. Nightly binaries build + nightly tests = 3050 minutes. 25750 / 3050 = 8.44. So we could do both 8 more times for additional CI runs for any reason. And that is with pretty good safety margin. Skip test_tensorexpr On s390x, pytorch is built without llvm. Even if it would be built with llvm, llvm currently doesn't support used features on s390x and test fails with errors like: ``` JIT session error: Unsupported target machine architecture in ELF object pytorch-jitted-objectbuffer unknown file: Failure C++ exception with description "valOrErr INTERNAL ASSERT FAILED at "/var/lib/jenkins/workspace/torch/csrc/jit/tensorexpr/llvm_jit.h":34, please report a bug to PyTorch. Unexpected failure in LLVM JIT: Failed to materialize symbols: { (main, { func }) } ``` Disable cpp/static_runtime_test on s390x Quantization is not fully supported on s390x in pytorch yet. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125401 Approved by: https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-01-10 18:21:07 +00:00
Scott Wolchok	603e1c0b02	torchgen: move dispatch_helpers out of RegisterDispatchDefinitions.ini (#144363 ) The dispatch_helpers should be generated once, not once per kernel namespace. Differential Revision: [D67925497](https://our.internmc.facebook.com/intern/diff/D67925497/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144363 Approved by: https://github.com/bdhirsh	2025-01-10 18:13:06 +00:00
Masaki Kozuki	7a93a58b3c	fix typo: "assumbed" (#144543 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144543 Approved by: https://github.com/Skylion007	2025-01-10 17:16:01 +00:00
Alexander Grund	fdc4f9dde2	Avoid running helper functions as test (#144544 ) Pytest considers all symbols starting with `test_` as a test case/function and runs them. The `test_compiled_fsdp` is a decorator but due to the import discovered by pytest. Rename it to avoid. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144544 Approved by: https://github.com/Skylion007	2025-01-10 17:15:50 +00:00
Nikita Shulga	8dba1ce73b	[MPS] Make MPSProfiler usable from C++ (#144560 ) By moving `buildTensorString` implementation away from the header Pull Request resolved: https://github.com/pytorch/pytorch/pull/144560 Approved by: https://github.com/Skylion007 ghstack dependencies: #144559	2025-01-10 17:13:34 +00:00
Nikita Shulga	f604338e31	[MPS] Make sure that MPSStream is usable from C++ (#144559 ) It's intended to be, but this was never tested. This change introduces no new functionality, just properly isolates ObjC implementation details from the potential C++ caller Pull Request resolved: https://github.com/pytorch/pytorch/pull/144559 Approved by: https://github.com/Skylion007	2025-01-10 17:13:34 +00:00
PyTorch MergeBot	473b745cb9	Revert "[dynamo] Avoid graph break on updates to `obj.__dict__` (#144419 )" This reverts commit c8595ba7d02fea9a5642ebbb60a810d18dc60666. Reverted https://github.com/pytorch/pytorch/pull/144419 on behalf of https://github.com/clee2000 due to newly added test fails internally D68004708 ([comment](https://github.com/pytorch/pytorch/pull/144419#issuecomment-2583265412))	2025-01-10 16:59:38 +00:00
Nikita Shulga	e6b9e67465	[BE][Opinfo] Delete redundant `dtypesIfCUDA` (#144512 ) If they are the same as CPU, no need to have that extra line Discovered while reviewing https://github.com/pytorch/pytorch/pull/143833 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144512 Approved by: https://github.com/Skylion007	2025-01-10 15:15:38 +00:00
Avik Chaudhuri	a222029f4e	retracing in strict doesn't like dataclass registration (#144487 ) Retracing in strict doesn't seem to like dataclass registration. Just refactoring some tests to make this explicit (whereas other export testing variants work fine). Differential Revision: [D67985149](https://our.internmc.facebook.com/intern/diff/D67985149/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144487 Approved by: https://github.com/angelayi	2025-01-10 12:31:53 +00:00
fan.mo	b2fde28283	[Profiler] Fix device setting error of other backends in torch.profiler (#144237 ) In earlier implementation, if `self.use_device != "cuda"` and `device is None`, we would get a `device = "cpu"` from line401, which is not as expected. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144237 Approved by: https://github.com/sraikund16	2025-01-10 10:41:11 +00:00
Yu, Guangye	eeb57394f9	Generalize at::manual_seed for all accelerators (#144370 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144370 Approved by: https://github.com/albanD, https://github.com/EikanWang, https://github.com/gujinghui ghstack dependencies: #144368	2025-01-10 09:28:28 +00:00
Yu, Guangye	2583d831d4	Fix poision child process issue when call getAccelerator() (#144368 ) # Motivation fix https://github.com/pytorch/pytorch/issues/144152 # Solution - Align `at::globalContext()::hasXXX` to determine if accelerator XXX is built with PyTorch or an extension already registered to PyTorch. - Define `at::hasXXX` to determine if accelerator XXX is available at runtime. - Use `at::globalContext()::hasXXX` in `getAccelerator` rather than `at::hasXXX` to avoid initializing the XXX runtime (which can poison child processes) while detecting the current accelerator. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144368 Approved by: https://github.com/albanD, https://github.com/atalman, https://github.com/gujinghui	2025-01-10 09:28:27 +00:00
bobrenjc93	08be9ec312	Migrate from Tuple -> tuple in torch/distributed (#144258 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144258 Approved by: https://github.com/aorenste	2025-01-10 08:34:54 +00:00
zeshengzong	184549b2d7	Fix torch.normal ignores default_device (#144070 ) Fixes #122886 1. Enable `torch.normal` working with `DeviceContext` to get default device which set via `set_default_device`. 2. Add hint in `set_default_device` doc, suggest use `torch.Tensor.to` method move to desired device explicitly. Test Result 1. Doc Preview ![image](https://github.com/user-attachments/assets/eb69c334-be2b-4dc5-bdce-567da21e1635) 2. Local Test ```python >>> import torch >>> torch.normal(0.,1., (10,10)).device device(type='cpu') >>> torch.set_default_device('cuda') >>> torch.normal(0.,1., (10,10)).device device(type='cuda', index=0) ``` ```bash pytest test/test_tensor_creation_ops.py ``` ![image](https://github.com/user-attachments/assets/8b466b55-f162-4b83-8b20-71de2c1d0914) ```bash lintrunner ``` ![image](https://github.com/user-attachments/assets/5b269c50-da57-47ed-8500-4edf2c2295e4) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144070 Approved by: https://github.com/ezyang	2025-01-10 08:19:55 +00:00
bobrenjc93	1fe3af2c68	Migrate from Tuple -> tuple in torch/_dynamo (#144261 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144261 Approved by: https://github.com/aorenste, https://github.com/zou3519	2025-01-10 07:45:57 +00:00
Shivam Raikundalia	f295eff512	[Profiler] Hide Kineto Step Tracker Behind Env Var (#144494 ) Summary: To support iteration-based on-demand we have step tracker hooks for both the scheduler and for the optimizer to control Kineto's backend FSM. We already hide the optimizer step tracker behind and ENV_VAR to prevent any extra overhead from the frontend profiler down to the kineto backend, but we don't do any such thing for the profiler step tracker. It also seems to cause errors occasionally in the FSM having both auto-trace and on-demand occurring at the same time. To remedy this issue, lets put in a patch to guard the step incrementer for the frontend step function. This will bypass all of the on-demand logic which shouldn't occur in auto-trace Test Plan: Ran `buck run mode/dev-nosan kineto/libkineto/fb/integration_tests:pytorch_resnet_integration_test -- --enable_profiling --trace_handler=auto_trace --with_stack` and added prints in on-demand functions (performLoopStep and collectTrace) and saw that neither were called even though they were called on main. Also got following healthy traces: Auto-Trace (schedule-based): https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/dynocli/devvm2185.cco0.facebook.com/rank-0.Jan_09_12_43_37.1122140.pt.trace.json.gz&bucket=gpu_traces Timing Based On-demand: https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/dynocli/0/1736456722/localhost/libkineto_activities_1286261.json.gz&bucket=gpu_traces Iteration Based On-demand: https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/dynocli/0/1736456889/localhost/libkineto_activities_1304781.json.gz&bucket=gpu_traces Differential Revision: D67990080 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144494 Approved by: https://github.com/ngimel	2025-01-10 07:00:56 +00:00
xinan.lin	8cc8989b26	[Inductor UT] Generalize newly introduced device-bias hard code in (#144456 ) Re-land #143975. Fix "cuda" hard code in test_pattern_matcher.py introduced by #139321 Fix #143974 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144456 Approved by: https://github.com/EikanWang, https://github.com/malfet, https://github.com/jansel ghstack dependencies: #144457	2025-01-10 06:55:44 +00:00
xinan.lin	e5111d0430	[Inductor UT] Add expected failure for newly added case on XPU, align CUDA. (#144457 ) The newly added case `test_randint_distribution` from #143787 was set expected failure for CUDA but not for XPU. We add the expected failure here because if fails with the same reason as CUDA. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144457 Approved by: https://github.com/EikanWang, https://github.com/malfet, https://github.com/jansel, https://github.com/liangan1	2025-01-10 06:55:44 +00:00
Zhang, Jianyi	eddf83559e	[Intel GPU][Inductor] Convert Conv1D to 2D in inductor (#144140 ) Layout optimization in inductor does not apply to Conv1D. We convert Conv1D to channel last Conv2D for better performance on Intel GPU. For example, demucs fp16 inference in torchbench can improve from 149ms to 91ms on Max 1100. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144140 Approved by: https://github.com/EikanWang, https://github.com/jansel, https://github.com/desertfire	2025-01-10 06:50:46 +00:00
bobrenjc93	fbad833538	Migrate from Tuple -> tuple in test/distributed/_composable (#144254 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144254 Approved by: https://github.com/aorenste	2025-01-10 06:38:05 +00:00
bobrenjc93	3b6b306b71	Migrate from Tuple -> tuple in torch/testing (#144256 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144256 Approved by: https://github.com/aorenste	2025-01-10 06:37:55 +00:00
Yu, Guangye	493a52cb72	Refine torch.xpu.get_device_properties API error message (#144379 ) # Motivation Remove the redundant error message. Without this PR: ```python >>> import torch >>> torch.xpu.get_device_name(1) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/guangyey/repos/stock-pytorch/torch/xpu/__init__.py", line 215, in get_device_name return get_device_properties(device).name File "/home/guangyey/repos/stock-pytorch/torch/xpu/__init__.py", line 258, in get_device_properties raise AssertionError("Invalid device index") AssertionError: Invalid device index ``` With this PR: ```python >>> import torch >>> torch.xpu.get_device_name(1) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/guangyey/repos/stock-pytorch/torch/xpu/__init__.py", line 215, in get_device_name return get_device_properties(device).name File "/home/guangyey/repos/stock-pytorch/torch/xpu/__init__.py", line 257, in get_device_properties return _get_device_properties(device) # type: ignore[name-defined] # noqa: F821 RuntimeError: The device index is out of range. It must be in [0, 1), but got 1. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/144379 Approved by: https://github.com/EikanWang	2025-01-10 06:27:51 +00:00
Nicolas Macchioni	4375c2c534	Cleanup gpt_fast benchmark (#144517 ) This is an exact copy of https://github.com/pytorch/pytorch/pull/144484, I bricked the last PR running ghstack land :( Pull Request resolved: https://github.com/pytorch/pytorch/pull/144517 Approved by: https://github.com/davidberard98, https://github.com/huydhn	2025-01-10 05:22:13 +00:00
Ryan Guo	c8595ba7d0	[dynamo] Avoid graph break on updates to `obj.__dict__` (#144419 ) `obj.__dict__` is handled specially in Dynamo, and prior to this patch we only support read and membership check on that dictionary object. This patch adds support for writes and some documentation. Fixes #143756. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144419 Approved by: https://github.com/jansel, https://github.com/anijain2305	2025-01-10 05:22:04 +00:00
Valentine233	d100a92d33	[CPU][Brgemm] add support for int8 brgemm (#143384 ) For INT8 SDPA kernel usage, we add support for INT8 Brgemm. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143384 Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/ezyang	2025-01-10 04:20:26 +00:00
Scott Wolchok	0529908f13	Remove is_reduced_floating_point from namespace std (#144502 ) Partial fix for #144495. Avoiding BC-break using existing practice of removing only if FBCODE_CAFFE2 and C10_NODEPRECATED are not defined. Differential Revision: [D67992342](https://our.internmc.facebook.com/intern/diff/D67992342/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144502 Approved by: https://github.com/malfet	2025-01-10 03:24:10 +00:00
cyy	9a841f9321	Enable bugprone-unchecked-optional-access (#144226 ) We can actually enable bugprone-unchecked-optional-access without the risk of hang. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144226 Approved by: https://github.com/albanD	2025-01-10 03:16:56 +00:00
Aaron Orenstein	9f09b719d3	Stop ignoring mypy errors in torch/testing/_internal/common_utils.py (#144483 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144483 Approved by: https://github.com/Skylion007	2025-01-10 02:31:43 +00:00
Scott Wolchok	898fcb4590	Simplify vec128 bfloat16/half fmadds (#144486 ) I was being silly when I wrote these; it doesn't make sense to do four conversions and two FMAs when we could do a multiply and an add. Differential Revision: [D67985074](https://our.internmc.facebook.com/intern/diff/D67985074/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144486 Approved by: https://github.com/malfet	2025-01-10 02:25:57 +00:00
Yiming Zhou	d1b64ec326	[export] Fix sym_bool serialization (#144295 ) Summary: When there is a `torch._check()` that checks if a sym_int is equal to some constant, it will generate 3 nodes in the graph with target `operation.ge`, `operator.le` and `operator.eq`. These operators belong to `_SYM_BOOL_OPS` but the `meta_val` of these nodes are are `bool` instead of `torch.SymBool`. Similar things can happen to `torch.SymInt`, where a `node.target` belongs to `_SYM_INT_OPS` but `node.meta["val"]` is an `int` instead of `torch.SymInt`. Therefore, we need to check both `meta_val` type and `node.target` type during serialization. Test Plan: ``` buck2 run @mode/dev-nosan caffe2/test:test_export -- -r test_sym_bool_torch_check_equal buck2 run @mode/dev-nosan caffe2/test:test_export -- -r test_sym_int_torch_check_equal ``` Differential Revision: D67883754 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144295 Approved by: https://github.com/avikchaudhuri, https://github.com/angelayi	2025-01-10 02:07:54 +00:00
Yu, Guangye	6de110b862	Support with statement on torch.Stream (#140138 ) # Motivation We propose to support Python with statement on `torch.Stream`. This is a benefit for all accelerators when writing device-agnostic code. The device-specific stream will also be supported because they are generally derived from `torch.Stream`. With this PR, we can do like this ```python s1= torch.Stream() # Set s1 to the current stream torch.accelerator.set_stream(s1) with torch.Stream() as s2: # Inside with statement, we set s2 to the current stream assert torch.accelerator.current_stream() == s2 # Here the current stream should be s1 assert torch.accelerator.current_stream() == s1 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/140138 Approved by: https://github.com/albanD	2025-01-10 02:05:19 +00:00
drisspg	04cb19d225	Add instantiation level to CutlassArgs (#144506 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144506 Approved by: https://github.com/huydhn	2025-01-10 02:01:40 +00:00
PyTorch MergeBot	87c1f76e63	Revert "Migrate from Tuple -> tuple in torch/_decomp (#144260 )" This reverts commit 8db67e03193dd1dbf7ca80cf0eb2f904e18e25ec. Reverted https://github.com/pytorch/pytorch/pull/144260 on behalf of https://github.com/kit1980 due to Lots of inductor failures ([comment](https://github.com/pytorch/pytorch/pull/144260#issuecomment-2581572235))	2025-01-10 01:47:29 +00:00
Guilherme Leobas	bf6dd955cd	Fix max(map(...)) (#142443 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/142443 Approved by: https://github.com/zou3519	2025-01-10 01:44:37 +00:00
Nikita Shulga	1dd1d532ba	[BE] Fix extra-semi warnings in int4mm_kernel.cpp (#144510 ) Fixes ``` In file included from /Users/nshulga/git/pytorch/pytorch/build/aten/src/ATen/native/cpu/int4mm_kernel.cpp.DEFAULT.cpp:1: /Users/nshulga/git/pytorch/pytorch/aten/src/ATen/native/cpu/int4mm_kernel.cpp:998:2: warning: extra ';' outside of a function is incompatible with C++98 [-Wc++98-compat-extra-semi] }; ^ ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/144510 Approved by: https://github.com/kit1980	2025-01-10 01:17:31 +00:00
Xu Han	bd1f5d1c32	update xnnpack for disable libm on Windows [submodule XNNPACK] (#141943 ) This PR is implement of RFC: https://github.com/pytorch/pytorch/issues/141946 Changes: 1. Update `XNNPACK` to contains it's PRS: https://github.com/google/XNNPACK/pull/7456, https://github.com/google/XNNPACK/pull/7535 and other build fixing PRs. 2. Set `XNNPACK_BUILD_WITH_LIBM` to `OFF`, it is turn off CMake find_library(libm) of `XNNPACK`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141943 Approved by: https://github.com/atalman	2025-01-10 00:47:41 +00:00
bobrenjc93	8db67e0319	Migrate from Tuple -> tuple in torch/_decomp (#144260 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144260 Approved by: https://github.com/aorenste	2025-01-10 00:13:15 +00:00
bobrenjc93	3607ff2c1d	Migrate from Tuple -> tuple in benchmarks/instruction_counts/core (#144253 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144253 Approved by: https://github.com/aorenste	2025-01-10 00:12:23 +00:00
bobrenjc93	a55977f763	Migrate from Tuple -> tuple in torch/ao (#144265 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144265 Approved by: https://github.com/aorenste	2025-01-10 00:12:06 +00:00
Benjamin Glass	08eaaa61ea	Inductor dashboard benchmarks: swap unused freeze_autotune_cudagraphs workflow for cppwrapper workflow (#144427 ) GitHub limits us to 10 inputs per workflow_dispatch job, so this PR swaps out an input that is no longer used for the cppwrapper input. See [the HUD](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Thu%2C%2002%20Jan%202025%2016%3A30%3A07%20GMT&stopTime=Thu%2C%2009%20Jan%202025%2016%3A30%3A07%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cuda%20(a100)&lBranch=gh/benjaminglass1/53/orig&lCommit=4c3d3ad3c7886cbda9705b41c6db5fa7da0d6fe9&rBranch=main&rCommit=00df63f09f07546bacec734f37132edc58ccf574) for an example showing that it works and displays sane output. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144427 Approved by: https://github.com/desertfire, https://github.com/huydhn	2025-01-09 23:56:00 +00:00
Shangdi Yu	66ce13b497	Revert D67299312: Multisect successfully blamed "D67299312: [AoTI Minifier] UX Improvement" for one test failure (#144475 ) Summary: This diff partially reverts D67299312 D67299312: [AoTI Minifier] UX Improvement by yushangdi causes the following test failure: Differential Revision: D67963019 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144475 Approved by: https://github.com/zhxchen17, https://github.com/angelayi	2025-01-09 23:27:55 +00:00
Nikita Shulga	91cbeb7db9	[MPSInductor] Fix `masked`/`where` for inf values (#144500 ) Move constant to value logic to `value_to_metal` function (similar to `value_to_cpp`) Call it from `constant` as well as `where` ops (which is in turn being called from `masked` op Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/144500 Approved by: https://github.com/dcci	2025-01-09 23:11:06 +00:00
Wanchao Liang	b1c2c3967a	[dtensor] deprecate _shard_tensor to use src_data_rank=None (#144171 ) as titled, we can achieve no comm sharding for the inference case with src_data_rank=None, so deprecate the private APi Pull Request resolved: https://github.com/pytorch/pytorch/pull/144171 Approved by: https://github.com/awgu	2025-01-09 22:26:45 +00:00
Shangdi Yu	379b54603a	[Inductor] [bc-breaking] Node Level provenance tracking (#144277 ) Summary: - use GraphTransformObserver + replace_node hooks to track node sources when they are replaced - add pre_grad_graph tracking to tlparse - add the node provenance information to post_grad_graph tlparse. This is for the frontend to create a mapping between pre_grad and post_grad graph. See an example frontend (this is just a prototype) here: https://drive.google.com/file/d/1cMHH_0y4FJUSS9tATwGQvA72O0Lth8eh/view?usp=sharing - change "action" of NodeSource from a single action to a list of actions. - It's BC-Breaking because we removed `GraphTransformObserver`'s class methods `on_node_erase` and `on_node_erase` . https://docs.google.com/document/d/1dGh9myqNhywmbfP0Quzx_f04bghDFlj8cawj8MopiO8/edit?tab=t.0 The front-end code that takes in the tlparse result is in https://github.com/yushangdi/compiler_explorer. ghstack-source-id: 260390519 Test Plan: ``` buck2 run mode/dev-nosan fbcode//caffe2/test:fx -- -r test_graph_transform_observer buck run mode/dev-nosan fbcode//caffe2/test:fx -- -r node_source buck run mode/dev-nosan fbcode//caffe2/test:fx -- -r graph_provenance ``` Front-end example screenshots on a real model, 93% coverage rate between pre_grad_graph and post_grad_graph {F1973584210}{F1973584209} ``` buck2 build --show-output mode/opt -c=python.package_style=inplace -c fbcode.enable_gpu_sections=true -c fbcode.platform=platform010 -c fbcode.split-dwarf=true -c fbcode.nvcc_arch=a100,h100 caffe2/torch/fb/model_transform/experimental/benchmark:mts_gpu_benchmark MODEL_ENTITY_ID=644688112 SNAPSHOT_ID=32 MODULE=merge TORCH_COMPILE_DEBUG=1 CUDA_VISIBLE_DEVICES=7 TORCH_LOGS="+inductor,+schedule,output_code,graph_code" TORCHINDUCTOR_MAX_AUTOTUNE=1 TORCHINDUCTOR_UNIQUE_KERNEL_NAMES=1 ../buck-out/v2/gen/fbcode/ec86b05dd59e84db/caffe2/torch/fb/model_transform/experimental/benchmark/__mts_gpu_benchmark__/mts_gpu_benchmark.par --local-model /home/bahuang/models/${MODEL_ENTITY_ID}/${SNAPSHOT_ID}/gpu_lowering/input.predictor.disagg.gpu.merge --lower-backend AOT_INDUCTOR_EP --gpu-trace --aot-inductor-config="{'max_autotune': True}" buck2 run mode/dev-nosan fbcode//caffe2/test/inductor:auto_functionalize ``` Differential Revision: D65006709 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144277 Approved by: https://github.com/desertfire	2025-01-09 22:06:51 +00:00
Eddie Yan	28b1960d49	[CUDA] parse arch-conditional compute-capability when building extensions (#144446 ) don't choke on arch-conditional compute capabilities e.g., `sm_90a`: #144037 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144446 Approved by: https://github.com/Skylion007, https://github.com/ezyang	2025-01-09 22:05:18 +00:00
drisspg	206a932f23	[Submodule] Upgrade to Cutlass 3.6 (#144180 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144180 Approved by: https://github.com/eqy, https://github.com/Skylion007	2025-01-09 21:56:53 +00:00
Richard Barnes	3e7e435bb1	[codemod] Remove unused-variable in caffe2/aten/src/ATen/native/quantized/cpu/fbgemm_utils.cpp +2 (#144371 ) Summary: LLVM-15 has a warning `-Wunused-variable` which we treat as an error because it's so often diagnostic of a code issue. Unused variables can compromise readability or, worse, performance. This diff either (a) removes an unused variable and, possibly, it's associated code or (b) qualifies the variable with `[[maybe_unused]]`. - If you approve of this diff, please use the "Accept & Ship" button :-) Test Plan: Sandcastle Reviewed By: palmje Pull Request resolved: https://github.com/pytorch/pytorch/pull/144371 Approved by: https://github.com/Skylion007	2025-01-09 21:49:17 +00:00
PyTorch MergeBot	f71688f30d	Revert "[Submodule] Upgrade to Cutlass 3.6 (#144180 )" This reverts commit f2c103317814eecf2b622e322e4d0877c16af943. Reverted https://github.com/pytorch/pytorch/pull/144180 on behalf of https://github.com/huydhn due to Ops, this fails some slow tests. Please help fix and reland this ([comment](https://github.com/pytorch/pytorch/pull/144180#issuecomment-2581302233))	2025-01-09 21:45:39 +00:00
Aleksei Nikiforov	127f836881	S390x cancelled jobs cleanup (#144149 ) Sometimes job is cancelled during nested docker container creation. This leads to nested docker container not being stopped and worker hanging forever in the job. Improve nested docker containers cleanup for these cases. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144149 Approved by: https://github.com/seemethere	2025-01-09 20:45:19 +00:00
Leo Yang	40305dd37e	[onnx] Fix bug for exporting torch.cdist into onnx and support 'compute_mode' (#144213 ) ### Fix bug for exporting torch.cdist and support 'compute_mode' In [cdist,](https://github.com/pytorch/pytorch/blob/main/torch/onnx/symbolic_opset9.py#L6181) the 'compute_mode' was ignored, which leads to a big difference of the computation flow between original torch.cdist and the exported onnx file when computing Euclidean distance (p=2). For computing Euclidean distance, the running of exported onnx model will be 10x slower than running torch.cdist directly, and also very likely to cause CUDA OOM for larger matrixes unnecessarily. This code is going for exporting the same onnx computation flow with the forward of torch.cdist defined at [forward implementation](`9225f149eb/aten/src/ATen/native/Distance.cpp (L66-L149)`.) under every compute_mode. Fixes #144212 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144213 Approved by: https://github.com/justinchuby	2025-01-09 20:07:20 +00:00
atalman	2b241a8206	Amazon Linux 2023: Preload cusparseLt.so (#144477 ) Fixes https://github.com/pytorch/pytorch/issues/144433 Test with some debug statements added: ``` >>> import torch trying to load libcublas.so.[0-9] from ['/usr/local/lib/python3.9/site-packages/nvidia/cublas/lib/libcublas.so.12'] trying to load libcublas.so.[0-9] from /usr/local/lib/python3.9/site-packages/nvidia/cublas/lib/libcublas.so.12 trying to load libcudnn.so.[0-9] from ['/usr/local/lib/python3.9/site-packages/nvidia/cudnn/lib/libcudnn.so.9'] trying to load libcudnn.so.[0-9] from /usr/local/lib/python3.9/site-packages/nvidia/cudnn/lib/libcudnn.so.9 trying to load libnvrtc.so.[0-9] from ['/usr/local/lib/python3.9/site-packages/nvidia/cuda_nvrtc/lib/libnvrtc.so.12'] trying to load libnvrtc.so.[0-9] from /usr/local/lib/python3.9/site-packages/nvidia/cuda_nvrtc/lib/libnvrtc.so.12 trying to load libcudart.so.[0-9] from ['/usr/local/lib/python3.9/site-packages/nvidia/cuda_runtime/lib/libcudart.so.12'] trying to load libcudart.so.[0-9] from /usr/local/lib/python3.9/site-packages/nvidia/cuda_runtime/lib/libcudart.so.12 trying to load libcupti.so.[0-9] from ['/usr/local/lib/python3.9/site-packages/nvidia/cuda_cupti/lib/libcupti.so.12'] trying to load libcupti.so.[0-9] from /usr/local/lib/python3.9/site-packages/nvidia/cuda_cupti/lib/libcupti.so.12 trying to load libcufft.so.[0-9] from ['/usr/local/lib/python3.9/site-packages/nvidia/cufft/lib/libcufft.so.11'] trying to load libcufft.so.[0-9] from /usr/local/lib/python3.9/site-packages/nvidia/cufft/lib/libcufft.so.11 trying to load libcurand.so.[0-9] from ['/usr/local/lib/python3.9/site-packages/nvidia/curand/lib/libcurand.so.10'] trying to load libcurand.so.[0-9] from /usr/local/lib/python3.9/site-packages/nvidia/curand/lib/libcurand.so.10 trying to load libnvJitLink.so.[0-9] from ['/usr/local/lib/python3.9/site-packages/nvidia/nvjitlink/lib/libnvJitLink.so.12'] trying to load libnvJitLink.so.[0-9] from /usr/local/lib/python3.9/site-packages/nvidia/nvjitlink/lib/libnvJitLink.so.12 trying to load libcusparse.so.[0-9] from ['/usr/local/lib/python3.9/site-packages/nvidia/cusparse/lib/libcusparse.so.12'] trying to load libcusparse.so.[0-9] from /usr/local/lib/python3.9/site-packages/nvidia/cusparse/lib/libcusparse.so.12 trying to load libcusparseLt.so.[0-9] from [] trying to load libcusparseLt.so.[0-9] from /usr/local/lib/python3.9/site-packages/cusparselt/lib/libcusparseLt.so.0 trying to load libcusolver.so.[0-9] from ['/usr/local/lib/python3.9/site-packages/nvidia/cusolver/lib/libcusolver.so.11'] trying to load libcusolver.so.[0-9] from /usr/local/lib/python3.9/site-packages/nvidia/cusolver/lib/libcusolver.so.11 trying to load libnccl.so.[0-9] from ['/usr/local/lib/python3.9/site-packages/nvidia/nccl/lib/libnccl.so.2'] trying to load libnccl.so.[0-9] from /usr/local/lib/python3.9/site-packages/nvidia/nccl/lib/libnccl.so.2 trying to load libnvToolsExt.so.[0-9] from ['/usr/local/lib/python3.9/site-packages/nvidia/nvtx/lib/libnvToolsExt.so.1'] trying to load libnvToolsExt.so.[0-9] from /usr/local/lib/python3.9/site- packages/nvidia/nvtx/lib/libnvToolsExt.so.1 /usr/local/lib64/python3.9/site-packages/torch/_subclasses/functional_tensor.py:275: UserWarning: Failed to initialize NumPy: No module named 'numpy' (Triggered internally at /pytorch/torch/csrc/utils/tensor_numpy.cpp:81.) cpu = _conversion_method_template(device=torch.device("cpu")) >>> exit() ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/144477 Approved by: https://github.com/Skylion007, https://github.com/nWEIdia	2025-01-09 20:04:11 +00:00
Guilherme Leobas	6bc17b0725	Update #graph breaks for moco benchmark (#144266 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144266 Approved by: https://github.com/zou3519	2025-01-09 18:51:13 +00:00
Aaron Gokaslan	0e02e6f95f	[BE]: Remove redundant contiguous copy in torch/_decomp/decompositions (#144472 ) Removes a redundant extra copy by calling contiguous. Instead, just add a memory_format flag to the dtype cast. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144472 Approved by: https://github.com/awgu, https://github.com/cyyever, https://github.com/malfet	2025-01-09 18:50:00 +00:00
Aaron Gokaslan	307ca094c9	[BE]: Remove redundant contiguous copy in flex attention (#144467 ) Removes a redundant potential copy, instead use memory_format kwarg to fuse both operations into a single copy. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144467 Approved by: https://github.com/awgu	2025-01-09 18:30:09 +00:00
Aaron Gokaslan	bbec35f028	[BE]: Replace clone detach with detach clone to be more efficient (#144469 ) Follow up to #144270 and fix some vulkan code Pull Request resolved: https://github.com/pytorch/pytorch/pull/144469 Approved by: https://github.com/awgu	2025-01-09 18:28:39 +00:00
Colin L. Rice	73278e6a5d	easy: sort dictionary keys for inductor config when publishing (#143307 ) This means we should get consistent logging strings for the same config on different ranks Pull Request resolved: https://github.com/pytorch/pytorch/pull/143307 Approved by: https://github.com/xmfan	2025-01-09 18:01:20 +00:00
Colin L. Rice	84443bd61a	feature_use: Remove JK from naming for feature use. (#143529 ) See discussion in https://github.com/pytorch/pytorch/pull/142819 but TL;DR, since we're loging use but not direct JK reads, it's less confusing to use the logging Pull Request resolved: https://github.com/pytorch/pytorch/pull/143529 Approved by: https://github.com/ezyang	2025-01-09 17:58:22 +00:00
Mikayla Gawarecki	b8f383107e	Link to transformer tutorial in transformer docs (#144425 ) <img width="1045" alt="Screenshot 2025-01-08 at 4 50 20 PM" src="https://github.com/user-attachments/assets/05adfecb-8a23-4c48-9a2c-50c5b3f886b0" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/144425 Approved by: https://github.com/albanD	2025-01-09 17:42:09 +00:00
drisspg	f2c1033178	[Submodule] Upgrade to Cutlass 3.6 (#144180 ) Differential Revision: [D67866269](https://our.internmc.facebook.com/intern/diff/D67866269) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144180 Approved by: https://github.com/eqy, https://github.com/Skylion007	2025-01-09 17:29:58 +00:00
Jithun Nair	1365ae859c	[ROCm][CI] upgrade CI to ROCm 6.3 (#142152 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/142152 Approved by: https://github.com/jeffdaily, https://github.com/pruthvistony Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-01-09 17:14:16 +00:00
cyy	b0be30dd79	[19/N] Fix extra warnings brought by clang-tidy-17 (#144448 ) Apply more clang-tidy fixes. There was a bug introduced by #144014 due to incorrect namespace concatenation which is reverted here. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144448 Approved by: https://github.com/albanD	2025-01-09 15:58:05 +00:00
Davide Italiano	1353f3beb4	[mps/inductor] Add support for fmod(). (#144449 ) 397 -> 395 tests failing. `static_cast<>` is because there are several overloads of `fmod()` that's otherwise ambiguous. I wonder if we should take in account NaN propagation (maybe it's not tested). Pull Request resolved: https://github.com/pytorch/pytorch/pull/144449 Approved by: https://github.com/malfet	2025-01-09 15:47:41 +00:00
Howard Huang	9631d1a021	[pipelining] throw error with ZB and compile (#143599 ) Zero bubble wil SIGSEGV when operating on a `torch.compile`'d model so raising this error while I am still investigating the cause / design for a fix. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143599 Approved by: https://github.com/wconstab	2025-01-09 06:53:25 +00:00
PyTorch MergeBot	3797143e06	Revert "[Quant][Inductor][X86] Separate binary post op fusion and lowering for qlinear (#144224 )" This reverts commit fabf2ea12e18bad3297e2810b77417d71c2a360b. Reverted https://github.com/pytorch/pytorch/pull/144224 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it seems that some ARM tests are failing after this lands ([comment](https://github.com/pytorch/pytorch/pull/144224#issuecomment-2579260377))	2025-01-09 06:20:31 +00:00
Davide Italiano	6f28e466f3	[mps/inductor] Add support for tanh(). (#144443 ) Fixes test_tanh() in the inductor testsuite. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144443 Approved by: https://github.com/malfet	2025-01-09 06:14:03 +00:00
Simon Fan	7f1946aa9b	[aot] don't dce aten rng nodes (#144319 ) FIXES https://github.com/pytorch/pytorch/issues/143431 For aot_eager backend, we dce twice in aot. The first dce errs on the side of caution and provides a restrictive dce function: `2e1ea8598f/torch/fx/experimental/proxy_tensor.py (L1173)` The second one is more aggressive: `2e1ea8598f/torch/_functorch/_aot_autograd/dispatch_and_compile_graph.py (L185)` But this deviates from eager accuracy when rand ops are dce'd The repro doesn't work for inductor, but that's a separate issue Pull Request resolved: https://github.com/pytorch/pytorch/pull/144319 Approved by: https://github.com/jansel	2025-01-09 05:27:49 +00:00
Dmitry Nikolaev	d4871750d9	[ROCm] Enable post-merge trunk workflow on MI300 runners; skip and fix MI300 related failed tests (#143673 ) This PR * makes changes to the workflow files and scripts so we can run CI workflows on the MI300 runners * skips and fixes several tests, failed on MI300, observed in https://github.com/pytorch/pytorch/pull/140989 Skipped due to unsupported Float8_e4m3fn data type on MI300 (need to update test code to use datatypes supported by MI300): - distributed.tensor.parallel.test_micro_pipeline_tp.py::MicroPipelineTPTest::test_fuse_all_gather_scaled_matmul_A_dims_\_gather_dim_\ (24 tests across inductor/distributed configs) - distributed.tensor.parallel.test_micro_pipeline_tp.py::test_fuse_scaled_matmul_reduce_scatter_A_dims_\_scatter_dim_\ (12 tests across inductor/distributed configs)) - inductor.test_loop_ordering::LoopOrderingTest::test_fp8_cast_and_t - inductor.test_loop_ordering::LoopOrderingTest::test_fp8_pattern_2 Skipped due to AssertionError on MI300: - inductor.test_mkldnn_pattern_matcher.py::test_qconv2d_int8_mixed_bf16 - distributed._tools.test_sac_ilp::TestSACILP::test_sac_ilp_case1 Skipped: - test_cuda.py::TestCudaMallocAsync::test_clock_speed - test_cuda.py::TestCudaMallocAsync::test_power_draw - test_torch.py::TestTorchDeviceTypeCUDA::test_deterministic_cumsum_cuda Skipped flaky tests on MI300: - distributed.test_c10d_gloo.py::ProcessGroupGlooTest::test_gather_stress_cuda - inductor.test_cpu_repro::CPUReproTests::test_lstm_packed_unbatched_False* (256 tests) Fixed: - test_matmul_cuda.py::TestFP8MatmulCudaCUDA::test_float8_basics_cuda Features: - inductor/test_fp8.py - declare a new function to convert FP8 datatypes to ROCm supported FP8 datatypes. It keeps test names for CUDA and ROCm and allows to enable Inductor FP8 tests on CPU Pull Request resolved: https://github.com/pytorch/pytorch/pull/143673 Approved by: https://github.com/jeffdaily, https://github.com/malfet, https://github.com/pruthvistony Co-authored-by: saienduri <saimanas.enduri@amd.com> Co-authored-by: Jithun Nair <jithun.nair@amd.com> Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-01-09 05:18:57 +00:00
Ding, Yi1	0d08084f1a	[Inductor] Add convolution output size checking to the meta function (#144225 ) Fixes #144013 Adding a size check to the meta function, similar to which in the CUDA/CPU aten op. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144225 Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel	2025-01-09 04:20:06 +00:00
Xia, Weiwen	fabf2ea12e	[Quant][Inductor][X86] Separate binary post op fusion and lowering for qlinear (#144224 ) Summary The current implementation fuses quantized ops and their post ops and lowers the fused the op to cpp backend in the same pass. It is better to separate post op fusion and lowering because - it looks better in terms of design - we need the post op fusion pass for PT2E quantization eager mode This PR is one of a series of PRs which separate post op fusion and lowering for quantized linear and convolution. It moves binary post op fusion of qlinear out of the lowering pass. This PR moves the fusion pass from the lowering pass to after the weight-prepack pass. The workflow is 1. Weight prepack for qlinear so that `dq - linear` patterns are replaced by `onednn.qlinear_pointwise` 2. Fuse `onednn.qlinear_pointwise` and post ops 3. Lower to cpp backend This PR adds additional `PatternMatcherPass`'s to handle the post op fusion. Pattern matchers used for fusion are reused. Test plan It is covered by existing UTs in `test_mkldnn_pattern_matcher.py` for post op fusion. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144224 Approved by: https://github.com/leslie-fang-intel, https://github.com/jerryzh168 ghstack dependencies: #143903	2025-01-09 03:27:09 +00:00
Xinya Zhang	bc576355a2	Let aotriton.cmake detect the best binary package to use, and deprecate aotriton_version.txt (#137443 ) We do not need `install_aotriton.sh` and `aotriton_version.txt` any more since `aotriton.cmake` now installs the best binary release package as the default option when building pytorch. This should resolve the issue of needing a pre-installed aotriton package when building PyTorch for ROCm from source, which is not feasible if building PyTorch outside a CI docker image. With this change, a user can have a pre-installed AOTriton in their environment, if desired, and have the build pick it up by specifying the `AOTRITON_INSTALLED_PREFIX` env var, or have the build automatically detect and install the compatible version. As a third option, the user can also force AOTriton to build from source instead, using the `AOTRITON_INSTALL_FROM_SOURCE` env var. Also, with the changes in this PR, the cmake build process handles the tasks of copying aotriton .so and images directory from `torch/lib` to the installation path. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137443 Approved by: https://github.com/jithunnair-amd, https://github.com/jeffdaily Co-authored-by: Jithun Nair <jithun.nair@amd.com>	2025-01-09 00:00:02 +00:00
Andrew Gu	8ac005ddb8	[DTensor] Add `aten.view.dtype` op support (#144404 ) Fixes https://github.com/pytorch/pytorch/issues/144286 Viewing a tensor to a different dtype does not require any redistribution and can use the default strategy. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144404 Approved by: https://github.com/wanchaol	2025-01-08 23:11:22 +00:00
Xuehai Pan	dcc3cf7066	[BE] fix ruff rule E226: add missing whitespace around operator in f-strings (#144415 ) The fixes are generated by: ```bash ruff check --fix --preview --unsafe-fixes --select=E226 . lintrunner -a --take "RUFF,PYFMT" --all-files ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/144415 Approved by: https://github.com/huydhn, https://github.com/Skylion007	2025-01-08 21:55:00 +00:00
titaiwangms	a742859fc2	[ONNX] Update images and APIs to onnx_dynamo.rst (#144358 ) Update the result image of exporting, and delete the functions/class that belongs to `torch.onnx.dynamo_export` Pull Request resolved: https://github.com/pytorch/pytorch/pull/144358 Approved by: https://github.com/justinchuby, https://github.com/malfet	2025-01-08 21:44:43 +00:00
Brian Muse	a5164a2b18	[BE] Clean up ExecuTorch Export Docstring (#141490 ) Summary: I noticed when looking at the docs for [`torch.export.load`](https://pytorch.org/docs/stable/_modules/torch/export.html#load) that it looked like there was a copy and paste error from the save command docstring since ep is not an actual parameter for load and it says "The exported program to save." This diff removes it from the docstring. Test Plan: Automated Testing Pull Request resolved: https://github.com/pytorch/pytorch/pull/141490 Approved by: https://github.com/JacobSzwejbka	2025-01-08 21:28:58 +00:00
Will Constable	8c5d992772	[Pipelining] Refactor pp composability test to use faster MPCT (#144345 ) * Using MultiProcessContinuousTest base class is faster (60s vs 279s for the full run of `test_manual_with_data_parallel` and all its parametrizations * Have to move to a new file to use MPTC since it requires a different launcher style in `__main__` * Propose to reorganize the composability tests anyway, since `test/_composable/test_composability/test_pp_composability` is an annoyingly long path * rename `test_manual_with_data_parallel` to `test_pp_dp` for simplicity/consistency with newer test names. (manual refers to not using tracer frontend, but that's not so important to call out in the test name) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144345 Approved by: https://github.com/H-Huang, https://github.com/mori360	2025-01-08 20:50:12 +00:00
LlamaFarm	c194e5c986	Remove extra copy torch/_prims (#144407 ) updated _reshape_aten Pull Request resolved: https://github.com/pytorch/pytorch/pull/144407 Approved by: https://github.com/awgu	2025-01-08 20:14:48 +00:00
Randolf Scholz	628acc4ace	`Dirichlet.mode`: use `dim=` instead of `axis=` (#144402 ) `axis=` is undocumented and will raise typing errors when #144197 is merged. See: https://github.com/pytorch/pytorch/pull/144197#pullrequestreview-2537398866 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144402 Approved by: https://github.com/Skylion007	2025-01-08 20:14:01 +00:00
Natalia Gimelshein	ab1f627aa4	fix randint distribution for large max (#143787 ) Fixes #ISSUE_NUMBER Similar to #143682, for large maximum values we were sampling integers via % and it doesn't provide uniform distribution. Here we limit the max skew to approx 1% (random32 is used for max values `<= 2**32 / 128`) This comes with significant perf penalty, especially for cuda, but it's a pretty bad bug, so we'll have to figure out what can be done to improve it. `torch.compile` has always been producing correct results for this, and it's performance is also significantly better than current eager (eager is ~660 GB/s on H100, torch.compile 1200 GB/s), so we have to figure out why torch.compile is better. `__launch_bounds__` slightly regress perf, so perhaps we can figure out how to specify them better, but it's only 20-30 GB/s, so the big difference is still unexplained. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143787 Approved by: https://github.com/eqy	2025-01-08 18:51:48 +00:00
Shangdi Yu	0e1675a89b	Relax aten.to restriction (#142420 ) Summary: if we have a.to(b), and b has a different dtype with a, then it must be a copy. In this case, we do not need to freeze the tensor. Instead, we use torch.ops.aten._assert_tensor_metadata.default to ensure that a must not have the same dtype as b. Fixes https://github.com/pytorch/pytorch/issues/139718 Update executorch pin to include https://github.com/pytorch/executorch/pull/7277. Test Plan: ``` buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test:test_export -- -r test_float_conversion buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test:test_export -- -r test_device_to_mutation_float ``` Differential Revision: D66988295 Pull Request resolved: https://github.com/pytorch/pytorch/pull/142420 Approved by: https://github.com/bdhirsh	2025-01-08 18:11:31 +00:00
Randolf Scholz	768d73f692	use `torch.special.xlogy` to implement `x_log_x` (#144220 ) Fixes #144279 Using `x* x.log()` does not produce the correct value when `x=0`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144220 Approved by: https://github.com/Skylion007	2025-01-08 17:41:55 +00:00
cyy	d0070ca07e	[18/N] Fix extra warnings brought by clang-tidy-17 (#144014 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/144014 Approved by: https://github.com/Skylion007, https://github.com/albanD	2025-01-08 17:21:55 +00:00
Aaron Gokaslan	373541fbf4	[BE]: Remove unnecessary copy of gradients in util (#144329 ) No need to copy gradients to CPU too Pull Request resolved: https://github.com/pytorch/pytorch/pull/144329 Approved by: https://github.com/awgu, https://github.com/cyyever	2025-01-08 16:52:15 +00:00
atalman	e14c36d3f4	Set maximum supported version of Python as 3.13 (#144396 ) Same as https://github.com/pytorch/pytorch/pull/119743 Required for Release 2.6.0 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144396 Approved by: https://github.com/Skylion007, https://github.com/albanD, https://github.com/malfet	2025-01-08 16:16:10 +00:00
Xinya Zhang	3068ce0337	ROCm SDPA: Ensure attn_mask has the same dtype with q (#143242 ) This is required by current AOTriton's backend. Fixes NaN when calling SDPA ME backend with `q.dtype() != attn_mask.dtype()` when training llama2 using transformers+deepspeed+pytorch Corresponding CUDA check seems to be here: `708ce3c008/aten/src/ATen/native/transformers/cuda/attention.cu (L1331-L1336)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/143242 Approved by: https://github.com/jeffdaily	2025-01-08 15:20:26 +00:00
Nikita Shulga	708ce3c008	Add `is_dtype_supported` predicate to DeviceInterface (#144355 ) Which will return true, unless dtype is bf16 by default For MPS device it will return false if dtype is double Check that it works by refactoring `test_inf` that should expect TypeError raised if invoked with unsupported dtype Pull Request resolved: https://github.com/pytorch/pytorch/pull/144355 Approved by: https://github.com/jansel, https://github.com/dcci	2025-01-08 13:59:46 +00:00
Davide Italiano	8fc0ffe54b	[mps/inductor] Add support for rsqrt(). (#144374 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144374 Approved by: https://github.com/malfet	2025-01-08 13:58:05 +00:00
William Wen	f700035090	[3.13t] use sysconfig to check for Python nogil builds (#144361 ) `sys._is_gil_enabled()` wasn't working in certain cases, according to @atalman Pull Request resolved: https://github.com/pytorch/pytorch/pull/144361 Approved by: https://github.com/atalman	2025-01-08 13:00:32 +00:00
George Wigley	a5051a9521	Update torch.masked.mean to upcast dtype for bool tensors (#139999 ) When calling `torch.masked.mean(...)` with a boolean tensor, the dtype is inferred to be bool. When the mean is being computed, the sum operator is used. When the sum operator is used with dtype=torch.bool, the result is clamped to True (1) leading to an incorrect mean being calculated. The below example shows how the incorrect result occurs: ``` a = torch.tensor([True, True]) count = torch.sum(torch.ones(a.shape, dtype=torch.int64)) # 2 total = torch.sum(a, dtype=torch.bool) # True (1) mean = total / count # 0.5 ``` This PR upcasts the dtype used for the sumation to int32 in the case of bool tensors allowing for the correct result to be computed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139999 Approved by: https://github.com/cpuhrsch	2025-01-08 10:35:19 +00:00
Xiaodong Wang	60a505022f	[AMD] SDPA internal changes (#144320 ) Summary: All the internal changes needed to enable flash attention w/ SDPA in fbcode. Test Plan: ``` TORCH_ROCM_FA_PREFER_CK=1 buck run -m rocm621 mode/opt-amd-gpu scripts/xdwang/example:sdpa +--------------+-------------------+---------+------------+-------------------+------------------+----------------------+----------------+---------------+-------------------+------------------------+---------------------------+----------------------+-------------------+ \| Batch Size \| Sequence Length \| Heads \| Head Dim \| Flash Time (µs) \| Math Time (µs) \| xformers Time (µs) \| Flash TFlops \| Math TFlops \| xformers TFlops \| Speedup (Flash/Math) \| Speedup (xformers/Math) \| xformers trace_url \| Flash trace_url \| +==============+===================+=========+============+===================+==================+======================+================+===============+===================+========================+===========================+======================+===================+ \| 1 \| 4096 \| 32 \| 64 \| 455.552 \| 7748.76 \| 513.449 \| 301.698 \| 17.7369 \| 267.678 \| 17.0096 \| 15.0916 \| \| \| +--------------+-------------------+---------+------------+-------------------+------------------+----------------------+----------------+---------------+-------------------+------------------------+---------------------------+----------------------+-------------------+ \| 1 \| 4096 \| 16 \| 128 \| 329.971 \| 4741.11 \| 386.049 \| 416.519 \| 28.9888 \| 356.014 \| 14.3683 \| 12.2811 \| \| \| +--------------+-------------------+---------+------------+-------------------+------------------+----------------------+----------------+---------------+-------------------+------------------------+---------------------------+----------------------+-------------------+ \| 1 \| 8192 \| 32 \| 64 \| 1455.76 \| 31869.6 \| 1665.49 \| 377.642 \| 17.2501 \| 330.087 \| 21.8921 \| 19.1353 \| \| \| +--------------+-------------------+---------+------------+-------------------+------------------+----------------------+----------------+---------------+-------------------+------------------------+---------------------------+----------------------+-------------------+ \| 1 \| 8192 \| 16 \| 128 \| 1265.77 \| 18972.8 \| 1479.48 \| 434.325 \| 28.976 \| 371.588 \| 14.9891 \| 12.824 \| \| \| +--------------+-------------------+---------+------------+-------------------+------------------+----------------------+----------------+---------------+-------------------+------------------------+---------------------------+----------------------+-------------------+ \| 1 \| 16384 \| 32 \| 64 \| 5732.99 \| 121861 \| 6816.77 \| 383.573 \| 18.0453 \| 322.59 \| 21.2562 \| 17.8767 \| \| \| +--------------+-------------------+---------+------------+-------------------+------------------+----------------------+----------------+---------------+-------------------+------------------------+---------------------------+----------------------+-------------------+ \| 1 \| 16384 \| 16 \| 128 \| 4749.69 \| 73776.4 \| 5404.03 \| 462.982 \| 29.8066 \| 406.923 \| 15.5329 \| 13.6521 \| \| \| +--------------+-------------------+---------+------------+-------------------+------------------+----------------------+----------------+---------------+-------------------+------------------------+---------------------------+----------------------+-------------------+ +--------------+-------------------+---------+------------+-------------------+------------------+----------------------+----------------+---------------+-------------------+------------------------+---------------------------+----------------------+-------------------+ \| Batch Size \| Sequence Length \| Heads \| Head Dim \| Flash Time (µs) \| Math Time (µs) \| xformers Time (µs) \| Flash TFlops \| Math TFlops \| xformers TFlops \| Speedup (Flash/Math) \| Speedup (xformers/Math) \| xformers trace_url \| Flash trace_url \| +==============+===================+=========+============+===================+==================+======================+================+===============+===================+========================+===========================+======================+===================+ \| 1 \| 4096 \| 32 \| 64 \| 1615.41 \| 8342.67 \| 1822.72 \| 212.7 \| 41.1855 \| 188.508 \| 5.16443 \| 4.57705 \| \| \| +--------------+-------------------+---------+------------+-------------------+------------------+----------------------+----------------+---------------+-------------------+------------------------+---------------------------+----------------------+-------------------+ \| 1 \| 4096 \| 16 \| 128 \| 1357.97 \| 5943.53 \| 1432.34 \| 253.022 \| 57.8104 \| 239.886 \| 4.37676 \| 4.14953 \| \| \| +--------------+-------------------+---------+------------+-------------------+------------------+----------------------+----------------+---------------+-------------------+------------------------+---------------------------+----------------------+-------------------+ \| 1 \| 8192 \| 32 \| 64 \| 5556.5 \| 31726.7 \| 6502.17 \| 247.348 \| 43.3197 \| 211.374 \| 5.70984 \| 4.8794 \| \| \| +--------------+-------------------+---------+------------+-------------------+------------------+----------------------+----------------+---------------+-------------------+------------------------+---------------------------+----------------------+-------------------+ \| 1 \| 8192 \| 16 \| 128 \| 5186 \| 22529.4 \| 5590.36 \| 265.019 \| 61.0044 \| 245.85 \| 4.34427 \| 4.03004 \| \| \| +--------------+-------------------+---------+------------+-------------------+------------------+----------------------+----------------+---------------+-------------------+------------------------+---------------------------+----------------------+-------------------+ \| 1 \| 16384 \| 32 \| 64 \| 22527.7 \| 130413 \| 26527.6 \| 244.035 \| 42.155 \| 207.239 \| 5.789 \| 4.91613 \| \| \| +--------------+-------------------+---------+------------+-------------------+------------------+----------------------+----------------+---------------+-------------------+------------------------+---------------------------+----------------------+-------------------+ \| 1 \| 16384 \| 16 \| 128 \| 18347.9 \| 87553.2 \| 20358 \| 299.628 \| 62.791 \| 270.044 \| 4.77184 \| 4.30068 \| \| \| +--------------+-------------------+---------+------------+-------------------+------------------+----------------------+----------------+---------------+-------------------+------------------------+---------------------------+----------------------+-------------------+ ``` Reviewed By: leitian, feikou, yoyoyocmu, sijiac Differential Revision: D67262726 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144320 Approved by: https://github.com/jianyuh, https://github.com/eqy, https://github.com/leitian	2025-01-08 09:29:28 +00:00
PyTorch MergeBot	7d9f26de05	Revert "Unskipped multiple inductor tests for ROCm (#143581 )" This reverts commit e05d67790ee4a53c310322829631c000f0ac2985. Reverted https://github.com/pytorch/pytorch/pull/143581 on behalf of https://github.com/huydhn due to There is some tests failing on ROCm jobs in trunk ([comment](https://github.com/pytorch/pytorch/pull/143581#issuecomment-2577163274))	2025-01-08 09:15:14 +00:00
Davide Italiano	aaf56152ea	[cpu/sorting] Throw an error when trying to sort complex numbers. (#144113 ) It doesn't really make sense to sort complex numbers as they are not comparable. Fixes #129296 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144113 Approved by: https://github.com/malfet	2025-01-08 05:15:36 +00:00
titaiwangms	78eded8e00	[ONNX] Use torch.export.Dim.AUTO in dynamo_export (#144356 ) Align to the changes in https://github.com/pytorch/pytorch/pull/143158 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144356 Approved by: https://github.com/justinchuby	2025-01-08 05:00:16 +00:00
bobrenjc93	90e81a157a	Migrate from Tuple -> tuple in torch/utils/data (#144255 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144255 Approved by: https://github.com/andrewkho	2025-01-08 04:09:45 +00:00
Animesh Jain	8ccf3f6f3f	[dynamo][easy] Move dict tests to test_dicts.py (#144165 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144165 Approved by: https://github.com/jansel ghstack dependencies: #143997	2025-01-08 03:56:33 +00:00
Animesh Jain	2ac41404a8	[dynamo][dicts] Guarding lazily on dict keys (#143997 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143997 Approved by: https://github.com/jansel	2025-01-08 03:56:33 +00:00
iupaikov-amd	e05d67790e	Unskipped multiple inductor tests for ROCm (#143581 ) All of them should be fine to run now after the triton fix. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143581 Approved by: https://github.com/jataylo, https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-01-08 03:55:33 +00:00
CaoE	28b4992e7a	Set prop_kind to forward_inference when grad is not needed for mkldnn_convolution_pointwise (#142855 ) `prop_kind` of MKLDNN convolution is always `dnnl_forward`, i.e., `dnnl_forward_training` , regardless of whether grad is needed. Setting `prop_kind` to `dnnl_forward_inference` for mkldnn_convolution_pointwise could have better performance. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142855 Approved by: https://github.com/jgong5	2025-01-08 02:22:06 +00:00
Xia, Weiwen	f8fcb9e7d3	[Quant][Inductor][X86] Separate unary post op fusion and lowering for qlinear (#143903 ) Summary The current implementation fuses quantized ops and their post ops and lowers the fused the op to cpp backend in the same pass. It is better to separate post op fusion and lowering because - it looks better in terms of design - we need the post op fusion pass for PT2E quantization eager mode This PR is the first of a series of PRs which separate post op fusion and lowering for quantized linear and convolution. It moves unary post op fusion of qlinear out of the lowering pass. This PR moves the fusion pass from the lowering pass to after the weight-prepack pass. The workflow is 1. Weight prepack for qlinear so that `dq - linear` patterns are replaced by `onednn.qlinear_pointwise` 2. Fuse `onednn.qlinear_pointwise` and post ops 3. Lower to cpp backend This PR adds additional `PatternMatcherPass`'s to handle the post op fusion. Pattern matchers used for fusion are reused. Test plan It is covered by existing UTs in `test_mkldnn_pattern_matcher.py` for post op fusion. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143903 Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/jerryzh168	2025-01-08 01:55:53 +00:00
zeshengzong	094ca3154d	Fix torch._refs.tensor error with empty list (#143461 ) Fixes #143216 Test Result Before ```python >>> import torch >>> torch._refs.tensor([]) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/zong/code/pytorch/torch/_refs/__init__.py", line 6614, in tensor new_tensor = _internal_new_from_data( ^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/zong/code/pytorch/torch/_refs/__init__.py", line 6596, in _internal_new_from_data tensor = _recursive_build(inferred_scalar_type, data) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/zong/code/pytorch/torch/_refs/__init__.py", line 6545, in _recursive_build return torch.stack([_recursive_build(scalarType, item) for item in seq]) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RuntimeError: stack expects a non-empty TensorList ``` After ```python >>> torch._refs.tensor([]) tensor([]) >>> torch._refs.tensor([], device='cuda') tensor([], device='cuda:0') ``` ```bash $ pytest test/test_tensor_creation_ops.py -k test_refs_tensor ``` ![image](https://github.com/user-attachments/assets/5be4c17a-bea6-4b7b-bec1-b4fcb417a8cd) ```bash $ lintrunner ``` ![image](https://github.com/user-attachments/assets/e8f88f41-78ac-4337-b53f-2e524de2bec0) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143461 Approved by: https://github.com/ezyang, https://github.com/soulitzer	2025-01-08 01:29:00 +00:00
Eddie Yan	9e6a6389ce	[functorch] clean up asserts in `test_dims.py` (#144276 ) For better debuggability of issues encountered in e.g., #141730 when trying to migrate to python 3.12/3.13 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144276 Approved by: https://github.com/Skylion007	2025-01-08 01:21:40 +00:00
Lu Fang	013c796b1e	Eliminate c10::optional usage in PyTorch (#144346 ) Differential Revision: D67907427 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144346 Approved by: https://github.com/hl475	2025-01-08 01:14:04 +00:00
Randolf Scholz	f002825e1e	added `__add__` and `__mul__` hints to torch.Size (#144322 ) Fixes #144218 `Size` returns `Size`, whereas `tuple` returns `tuple`: `9f28171658/stdlib/builtins.pyi (L985-L988)` - Use `SupportIndex` instead of `int` in `__getitem__` (supported at runtime) - `Size.__add__` overrides `tuple.__add__`, the latter supports adding tuples on non-integral type. - Added typing unit tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144322 Approved by: https://github.com/Skylion007	2025-01-08 01:02:11 +00:00
xinan.lin	06ea81336f	[Inductor UT] Remove excepted failure for aoti test_fft_c2c (#144238 ) Since #143223 enabled runtime dispatch for fft_c2c in AOTI mod, for XPU, we can fallback fft_c2c which has no XPU implementation to CPU and pass the case now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144238 Approved by: https://github.com/jansel	2025-01-08 00:49:32 +00:00
Wanchao Liang	96f4abba17	[dtensor] move all tests to distribute/tensor folder (#144166 ) as titled, mainly moving files Pull Request resolved: https://github.com/pytorch/pytorch/pull/144166 Approved by: https://github.com/Skylion007	2025-01-08 00:32:33 +00:00
Justin Chu	7c9cf287c2	[ONNX] Handle list values as 0d inputs (#144343 ) Handle list values as 0d inputs instead of 1d, as the `SymInt`s are expected to be 0d tensors in ONNX. This PR reshapes int64 values into 1D tensors in a list, assuming they are 0D tensors initially. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144343 Approved by: https://github.com/gramalingam, https://github.com/titaiwangms	2025-01-08 00:15:50 +00:00
Oguz Ulgen	9ee242213b	[RFC] Introduce cache hot loading APIs (a.k.a. "Mega-cache") (#143341 ) This PR essentially introduces two new APIs * torch.compiler.save_cache_artifacts * torch.compiler.load_cache_artifacts which aim to create a mega cache experience where the user can start collecting cache artifacts, and later call the save API to fetch them. In the next attempt, the user can "hot load" the cache artifacts via the load function. This bundling approach reduces the need to rely on porting individual files one by one, or relying on many network requests. Note that these APIs CANNOT log to structured logging as these functions will be called before and after compilation, as opposed to during compilation. Due to this limitation, the API returns a struct that the user can log with. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143341 Approved by: https://github.com/jansel	2025-01-07 23:13:24 +00:00
Stacie-Herda	c2c50d5f00	Fixed doc where more than one device specified since only one device is used (#17553 ) (#144043 ) Fixes #17553 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144043 Approved by: https://github.com/soulitzer	2025-01-07 23:06:52 +00:00
Yanbo Liang	430d54ee20	[Dynamo] Add functorch C++ bindings as in graph functions (#144309 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144309 Approved by: https://github.com/williamwen42 ghstack dependencies: #144306, #144307, #144308	2025-01-07 22:25:01 +00:00
Yanbo Liang	d146763f6f	[Dynamo] Inline functions in torch._ops (#144308 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144308 Approved by: https://github.com/williamwen42 ghstack dependencies: #144306, #144307	2025-01-07 22:25:01 +00:00
Yanbo Liang	242a4a3f83	[Dynamo] Inline functions in torch._functorch.pyfunctorch (#144307 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144307 Approved by: https://github.com/williamwen42 ghstack dependencies: #144306	2025-01-07 22:24:53 +00:00
Yanbo Liang	4417be65e5	[Dynamo] Inline functions in torch._functorch.autograd_function (#144306 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144306 Approved by: https://github.com/williamwen42	2025-01-07 22:24:46 +00:00
Richard Barnes	3beb7006dd	c10::optional -> std::optional in a few places (#144340 ) Test Plan: Sandcastle Pull Request resolved: https://github.com/pytorch/pytorch/pull/144340 Approved by: https://github.com/malfet	2025-01-07 21:09:39 +00:00
Simon Fan	f4969c8235	fix torch.compile + ddp + non-reentrant AC pack hook firing count (#144271 ) FIXES https://github.com/pytorch/pytorch/issues/144035 In order to preserve hook firing semantics, we disabled pack/unpack hooks for torch.compile: https://github.com/pytorch/pytorch/pull/123196. In DDP under torch.compile, there's this other callsite that we need to disable hooks for Pull Request resolved: https://github.com/pytorch/pytorch/pull/144271 Approved by: https://github.com/bdhirsh, https://github.com/soulitzer	2025-01-07 21:08:52 +00:00
zeshengzong	861b65fe74	[Easy] Fix linalg.norm hint message typo (#144323 ) Fixes #136454 Test Result Before ```python >>> import torch >>> from torch import linalg >>> >>> my_tensor = torch.tensor([[[8., -3., 0., 1.]]]) >>> # ↓ ↓ ↓ ↓ ↓ >>> linalg.norm(input=my_tensor, ord='fro', dim=(0, 1, 2)) # Error Traceback (most recent call last): File "<stdin>", line 1, in <module> RuntimeError: linalg.norm: If dim is specified, it mut be of length 1 or 2. Got [0, 1, 2] >>> # ↓ ↓ ↓ ↓ ↓ >>> linalg.norm(input=my_tensor, ord='nuc', dim=(0, 1, 2)) # Error Traceback (most recent call last): File "<stdin>", line 1, in <module> RuntimeError: linalg.norm: If dim is specified, it mut be of length 1 or 2. Got [0, 1, 2] ``` After ```python >>> import torch >>> from torch import linalg >>> >>> my_tensor = torch.tensor([[[8., -3., 0., 1.]]]) >>> # ↓ ↓ ↓ ↓ ↓ >>> linalg.norm(input=my_tensor, ord='fro', dim=(0, 1, 2)) # Error Traceback (most recent call last): File "<stdin>", line 1, in <module> RuntimeError: linalg.norm: If dim is specified, it must be of length 1 or 2. Got [0, 1, 2] >>> # ↓ ↓ ↓ ↓ ↓ >>> linalg.norm(input=my_tensor, ord='nuc', dim=(0, 1, 2)) # Error Traceback (most recent call last): File "<stdin>", line 1, in <module> RuntimeError: linalg.norm: If dim is specified, it must be of length 1 or 2. Got [0, 1, 2] ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/144323 Approved by: https://github.com/Skylion007, https://github.com/soulitzer	2025-01-07 20:34:16 +00:00
Simon Fan	d38af6e8bc	[ca] dedup node names when AOT bwd graph is reused multiple times (#144202 ) This error started popping up in HUD CA benchmarks: ```python File "/data/users/xmfan/core/b/pytorch/torch/_dynamo/compiled_autograd.py", line 371, in dce self.fx_tracer.graph.eliminate_dead_code(is_impure) File "/data/users/xmfan/core/b/pytorch/torch/fx/graph.py", line 1862, in eliminate_dead_code self.lint() File "/data/users/xmfan/core/b/pytorch/torch/fx/graph.py", line 1753, in lint raise RuntimeError(f"Node redefined name {node.name}!") RuntimeError: Node redefined name aot0_expand! ``` We added CA initial capture's renaming (https://github.com/pytorch/pytorch/pull/133148) to help debug issues with AOT backward, but it errors out when we have multiple instances of the same AOT backward. This likely only showed up now because of increased hierarchical graph reuse. I fix it by adding a postfix counter to the node name Pull Request resolved: https://github.com/pytorch/pytorch/pull/144202 Approved by: https://github.com/bdhirsh, https://github.com/jansel	2025-01-07 20:23:09 +00:00
Shangdi Yu	72e8f34715	[AoTI Minifier] UX Improvement (#143330 ) Summary: - When a user specify `TORCHINDUCTOR_MAX_AUTOTUNE=1` env variable, we add `config.max_autotune=True` to the generated minifier_launcher - We should do this to other inductor configs as well in a followup Diff Currently in dynamo and aoti minifier, if a config is overwritten by an env variable, the config will not show up in the config list in the minifier_launcher.py file. As a result, when running the minifier_launcher, they need to re-apply the same env variable. This is: 1) not convenient for the users 2) if they copy-paste the minifier_launcher.py to us without including the env variable, we could be confused and not able to reproduce the error. Underlying implementation change: - Add `env_default` parameter to `codegen_config()`. If set, configs overriden by the env are not considered default. Test Plan: ``` buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test:utils -- -r test_codegen_config ``` Differential Revision: D67299312 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143330 Approved by: https://github.com/jansel, https://github.com/eellison	2025-01-07 20:04:19 +00:00
bobrenjc93	096cb874d3	remove allow-untyped-defs from torch/_prims/executor.py (#144233 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144233 Approved by: https://github.com/Skylion007	2025-01-07 19:40:40 +00:00
Sampsa	0aa74d0ab9	Skip L1 cache for single-use buffers (#143115 ) ### 1. Synopsis Adds `cache_modifier='.cg'` optional argument into `tl.load` instructions in the inductor-generated triton code for selected buffers. It makes the `tl.load` instruction to skip the L1 cache for short-lived / non-reused data. ### 2. Using the feature This feature is experimental and disabled by default. It can be enabled by setting the environmental variable `TORCHINDUCTOR_SKIP_L1` equal to `1`. ### 3. Results For a simple pointwise addition kernel: ```python @torch.compile def add_dummy(x: torch.Tensor, y: torch.Tensor): return x+y ``` we get (bandwith performance is in GB/s): (a) feature DISABLED: ![image](https://github.com/user-attachments/assets/6caaf775-f083-4943-a61f-8a1bcb154387) (b) feature ENABLED: ![image](https://github.com/user-attachments/assets/9286be7d-c6ff-4a33-a023-77cb5cc87ff6) ### 4. Caveats The feature boost is only available when using ```python torch._dynamo.config.cache_size_limit = 64 # or any other sufficiently big number.. torch._dynamo.config.automatic_dynamic_shapes = False # use static shapes ``` When using (the default) dynamic shapes, only 1-2 triton kernels are generated with non-optimal block-sizes for all the cases (vector sizes), hiding any perf benefit from skipping the L1 cache. In the static case, as an optimal block size is generated for each vector size, the perf benefit of skipping the L1 cache becomes visible. This block-size optimization issue is a larger problem in pytorch inductor and is outside the scope of this feature. ### 5. References - [tl.load](https://triton-lang.org/main/python-api/generated/triton.language.load.html#triton.language.load) - [cache operators](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#cache-operators) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143115 Approved by: https://github.com/jansel	2025-01-07 19:35:40 +00:00
Randolf Scholz	355b0bc7e3	[typing] Add type hints to `@property` and `@lazy_property` in `torch.distributions`. (#144110 ) Fixes #76772, #144196 Extends #144106 - added type annotations to `lazy_property`. - added type annotation to all `@property` and `@lazy_property` inside `torch.distributions` module. - added simply type-check unit test to ensure type inference is working. - replaced deprecated annotations like `typing.List` with the corresponding counterpart. - simplified `torch.Tensor` hints with plain `Tensor`, otherwise signatures can become very verbose. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144110 Approved by: https://github.com/Skylion007	2025-01-07 19:27:36 +00:00
hongxyan	aa69d73e6b	[ROCm] fix torch.layer_norm invalid configuration problem when input is large tensor (#144007 ) Fixes #136291 This PR is to fix the `invalid configuration argument` problem happened on ROCm when input is a large tensor when calling `torch.layer_norm`. ``` File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/functional.py", line 2573, in layer_norm return torch.layer_norm RuntimeError: HIP error: invalid configuration argument ``` After investigation, I found that the reason why this error happened is: The amd compute language runtime checks whether `gridDim.x * blockDim.x` is greater than `std::numeric_limits<uint32_t>::max()` or not. If yes, it will error out with the "invalid configuration argument" message. The fix is to split the whole task to several chunks so that each chunk will not trigger the failure condition. This will ensure the correctness and completeness given the current kernel implementation logic of `vectorized_layer_norm_kernel`. Also added a largeTensor layer_norm unit test `test_layer_norm_large_tensor` with the same shape `[16, 3000, 3000, 16]` as the one used by the pytorch issue #136291 so that the unit test can check the expected output value to ensure correctness. The future work may include performance optimization of layer_norm and CK layer_norm integration. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144007 Approved by: https://github.com/eqy	2025-01-07 19:17:02 +00:00
PyTorch MergeBot	6c54963f75	Revert "[dtensor] move all tests to distribute/tensor folder (#144166 )" This reverts commit 2e1ea8598f477322965c28fb52e6e5f53876d8dd. Reverted https://github.com/pytorch/pytorch/pull/144166 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but inductor/test_compiled_autograd needs to be updated ([comment](https://github.com/pytorch/pytorch/pull/144166#issuecomment-2575969871))	2025-01-07 18:31:36 +00:00
Aaron Gokaslan	e4a05dec0f	[BE][Ez]: Fix docs recommending inefficient tensor op order (#144270 ) `detach().clone()` is faster than `.clone().detatch()` since the gradients are not cloned. Let's update all the documentation and tests so that users do not use the inefficient op ordering. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144270 Approved by: https://github.com/awgu, https://github.com/XuehaiPan	2025-01-07 17:31:32 +00:00
atalman	8d35333498	[CD] Aarch64 builds should not override `OVERRIDE_PACKAGE_VERSION` envvar (#144285 ) Currently our nightly aarch64 binaries have correct suffixes +cpu or +cu126. But release binaries are missing these suffixes. Hence to correct this, make sure are nightly and release binaries are consistent, I propose this change. I see that override is already set correctly in release workflow: https://github.com/pytorch/pytorch/actions/runs/12383179841/job/34565381200 For CPU: ``` OVERRIDE_PACKAGE_VERSION="2.6.0+cpu" ``` For CUDA: ``` OVERRIDE_PACKAGE_VERSION="2.6.0+cu126" ``` The removed code will set : OVERRIDE_PACKAGE_VERSION="2.6.0" for both cuda and cpu builds for release binaries. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144285 Approved by: https://github.com/malfet, https://github.com/tinglvv	2025-01-07 12:50:54 +00:00
Avik Chaudhuri	12fdb93ebd	fix non-strict placeholder naming with kwargs (#144278 ) Fixes https://github.com/pytorch/pytorch/issues/143732 Differential Revision: [D67872055](https://our.internmc.facebook.com/intern/diff/D67872055/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144278 Approved by: https://github.com/yushangdi, https://github.com/pianpwk	2025-01-07 11:22:09 +00:00
Evgeny Fiksman	c3b28491c8	[caffe2] Add AVX512 support for box_cox operator (#143627 ) Summary: Reuse templetized implementation of box_cox caffe2 operator. * Duplicate .cc file of AVX2 * change intrinsics functions to use AVX512 instructions * override templates * extend the caller to use new methods * guard AVX512 with a gflag to allow smooth transition Differential Revision: D67433457 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143627 Approved by: https://github.com/hl475	2025-01-07 09:54:39 +00:00
RAHUL SINGH	bf7747e935	Tests Generelization for multiple accelerator devices (#139184 ) Motivation: Generalize unit tests so that can be executed for cuda and non cuda devices. Depedency : #133209 Merged now. There was a #135242 for these changes and closed due to in correct commits. I have incoroprated the changes as suggested in comments. @kwen2501 @zeshengzong Please review the changes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139184 Approved by: https://github.com/kwen2501 Co-authored-by: Yu, Guangye <guangye.yu@intel.com>	2025-01-07 09:04:38 +00:00
Wanchao Liang	2e1ea8598f	[dtensor] move all tests to distribute/tensor folder (#144166 ) as titled, mainly moving files Pull Request resolved: https://github.com/pytorch/pytorch/pull/144166 Approved by: https://github.com/Skylion007	2025-01-07 06:45:14 +00:00
Simon Fan	d0f5df83a5	[ca] add test_dtensor_compile.py to compiled autograd tests (#144107 ) more than half the tests use autograd, pass rate 19/26 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144107 Approved by: https://github.com/zou3519, https://github.com/bdhirsh, https://github.com/jansel	2025-01-07 05:16:14 +00:00
bobrenjc93	fcf9dc3b11	Migrate from Tuple -> tuple in benchmarks (#144259 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144259 Approved by: https://github.com/yanboliang	2025-01-07 04:09:52 +00:00
Natalia Gimelshein	2e42be0595	Use random64 in Fischer-Yates algorithm for large N (#143682 ) Fixes bug in randperm https://nbsanity.com/static/a4774194938414dedcec7d6e99727d31/Shuffling_20in_20torch_20vs_20numpy-public.html Pull Request resolved: https://github.com/pytorch/pytorch/pull/143682 Approved by: https://github.com/eqy, https://github.com/albanD, https://github.com/malfet	2025-01-07 03:48:56 +00:00
Davide Italiano	551f104153	[mps/inductor] Add support for sign(). (#144298 ) Drive-by fix of a test name while I was at it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144298 Approved by: https://github.com/malfet	2025-01-07 03:33:26 +00:00
bobrenjc93	a3ab27b8e0	Migrate from Tuple -> tuple in torch/_inductor (#144264 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144264 Approved by: https://github.com/eellison	2025-01-07 03:27:27 +00:00
PyTorch MergeBot	778d953951	Revert "[AsyncMM] re-enable and prepare for cutlass 3.5.1 update (#144011 )" This reverts commit 24ac87392bc4e0060a90483643f7df5611988ae5. Reverted https://github.com/pytorch/pytorch/pull/144011 on behalf of https://github.com/malfet due to Not sure what is going on, but lots of builds are failing ([comment](https://github.com/pytorch/pytorch/pull/144011#issuecomment-2574317669))	2025-01-07 03:24:01 +00:00
PyTorch MergeBot	f4e9aebbcc	Revert "Update torch.masked.mean to upcast dtype for bool tensors (#139999 )" This reverts commit 0742b2366e7ba65e0437a17b09a3bb0804ae51ea. Reverted https://github.com/pytorch/pytorch/pull/139999 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I think it has a landrace and fails a test in trunk ([comment](https://github.com/pytorch/pytorch/pull/139999#issuecomment-2574283986))	2025-01-07 02:42:55 +00:00
bobrenjc93	168c2cb3f3	remove allow-untyped-defs from torch/nn/utils/_deprecation_utils.py (#144231 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144231 Approved by: https://github.com/albanD	2025-01-07 02:22:22 +00:00
Yifu Wang	24ac87392b	[AsyncMM] re-enable and prepare for cutlass 3.5.1 update (#144011 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144011 Approved by: https://github.com/Skylion007, https://github.com/drisspg	2025-01-07 02:15:42 +00:00
leslie-fang-intel	73a6a40346	[Inductor][CPP] Fix outer loop fusion buffer removed (#144243 ) Summary Fix issue: https://github.com/pytorch/pytorch/issues/144186. For the test case reported in the issue, we have saw some nodes with `LoopNest` - `LoopNest(loops=[LoopLevel(var=x0, size=8, offset=0, tiled_size=0, steps=1, parallel=0, simd_omp=False, simd_vec=False, collapsed=False, is_reduction=False), LoopLevel(var=x1, size=8, offset=0, tiled_size=0, steps=1, parallel=0, simd_omp=False, simd_vec=False, collapsed=False, is_reduction=True)], kernel=<torch._inductor.codegen.cpp.CppKernelProxy object at 0x7fc724426680>)` - `LoopNest(loops=[LoopLevel(var=x0, size=8, offset=0, tiled_size=0, steps=16, parallel=0, simd_omp=False, simd_vec=True, collapsed=False, is_reduction=False), LoopLevel(var=x1, size=8, offset=0, tiled_size=0, steps=16, parallel=0, simd_omp=False, simd_vec=True, collapsed=False, is_reduction=True)], kernel=<torch._inductor.codegen.cpp.CppKernelProxy object at 0x7fc75c2cae60>)` Although, these 2 `LoopNest` have same `range` and `var`, but different `steps` 1 and 16. So, they will fail to be merged with outer loops. And since when we localize the buffer, we have removed the global buffers. We need to restore the status of `V.graph.removed_buffers` before fallback to codegen without outer loop fusion. Test Plan ``` python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_outer_loop_fusion_buffer_remove ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/144243 Approved by: https://github.com/jgong5	2025-01-07 01:17:46 +00:00
Jane Xu	2f6f13562f	[BE] Actually suppress vmap warning from gradcheck (#144287 ) This is the much safer change compared to https://github.com/pytorch/pytorch/pull/144283 Before: ``` PYTORCH_TEST_CUDA_MEM_LEAK_CHECK=1 PYTORCH_TEST_WITH_SLOW_GRADCHECK=1 python test/test_optim.py -k TestDifferentiableOptimizer.test_sgd /data/users/janeyx/pytorch/torch/autograd/gradcheck.py:1156: FutureWarning: Please use torch.vmap instead of torch._vmap_internals.vmap. result = vmap(vjp)(torch.stack(grad_outputs)) /data/users/janeyx/pytorch/torch/autograd/gradcheck.py:1156: FutureWarning: Please use torch.vmap instead of torch._vmap_internals.vmap. result = vmap(vjp)(torch.stack(grad_outputs)) . ---------------------------------------------------------------------- Ran 1 test in 0.028s ``` (the env vars aren't necessary) After: ``` PYTORCH_TEST_CUDA_MEM_LEAK_CHECK=1 PYTORCH_TEST_WITH_SLOW_GRADCHECK=1 python test/test_optim.py -k TestDifferentiableOptimizer.test_sgd . ---------------------------------------------------------------------- Ran 1 test in 0.028s ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/144287 Approved by: https://github.com/cyyever, https://github.com/soulitzer	2025-01-07 01:11:41 +00:00
Nikita Shulga	61c0a3d1cb	Fix lint in `test_provenance_tracing.py` (#144296 ) Regression introduced by https://github.com/pytorch/pytorch/pull/143684/ that somehow did not surface on PR CI IMO this also makes two branches of the test(compile vs aoti) more readable Pull Request resolved: https://github.com/pytorch/pytorch/pull/144296 Approved by: https://github.com/xw285cornell, https://github.com/huydhn	2025-01-07 01:11:38 +00:00
Xu Han	48153c72b2	[Intel XPU] enable kineto for XPU Windows. (#144034 ) This PR will turn on `kineto` on Windowx XPU wheel build. For `kineto` on Windows XPU, the build time dependencies list: 1. Intel PTI, it contained by oneAPI 2025+. 2. Level zero SDK: https://github.com/oneapi-src/level-zero/releases/download/v1.14.0/level-zero-sdk_1.14.0.zip Note: We need to manual setup level zero SDK on build time, so we will turn off kineto build on Windows XPU by default. It is in order to avoid developer occurred build issue. After add level zero SDK include path to `INCLUDE` env_var path. We can add an env_var `XPU_ENABLE_KINETO` to turn on it. For runtime dependency: 1. Intel-pti pipy package. @chuanqi129 will follow up on further PR. Local tested the nightly binary: <img width="1909" alt="image" src="https://github.com/user-attachments/assets/7dfaa7bc-e8ed-40b8-bc71-f91a3df3b95f" /> TODO: @chuanqi129 will submit a following PR to add `intel-pti` as dependency and turn on env_var `XPU_ENABLE_KINETO` for nightly build. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144034 Approved by: https://github.com/chuanqi129, https://github.com/zejun-chen, https://github.com/EikanWang, https://github.com/sraikund16	2025-01-07 01:11:25 +00:00
George Wigley	0742b2366e	Update torch.masked.mean to upcast dtype for bool tensors (#139999 ) When calling `torch.masked.mean(...)` with a boolean tensor, the dtype is inferred to be bool. When the mean is being computed, the sum operator is used. When the sum operator is used with dtype=torch.bool, the result is clamped to True (1) leading to an incorrect mean being calculated. The below example shows how the incorrect result occurs: ``` a = torch.tensor([True, True]) count = torch.sum(torch.ones(a.shape, dtype=torch.int64)) # 2 total = torch.sum(a, dtype=torch.bool) # True (1) mean = total / count # 0.5 ``` This PR upcasts the dtype used for the sumation to int32 in the case of bool tensors allowing for the correct result to be computed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139999 Approved by: https://github.com/cpuhrsch	2025-01-07 00:26:59 +00:00
Henry Hu	f013cfee38	[TreeSpec] Support enum in defaultdict (#144235 ) Summary: Followup from D66269157, add support for enum in defaultdict. Test Plan: Added unit test Differential Revision: D67832100 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144235 Approved by: https://github.com/henrylhtsang, https://github.com/houseroad	2025-01-07 00:10:46 +00:00
Tugsbayasgalan Manlaibaatar	c68c38c673	Support getattr for tensor subclasses in pre-dispatch export via patching tensor.getattr (#143946 ) Previous discussion: https://github.com/pytorch/pytorch/pull/143671#issuecomment-2560112499 and https://github.com/pytorch/pytorch/pull/143671 Differential Revision: [D67693609](https://our.internmc.facebook.com/intern/diff/D67693609) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143946 Approved by: https://github.com/bdhirsh	2025-01-06 23:55:50 +00:00
bobrenjc93	66059f80d2	Migrate from Tuple -> tuple in torch/profiler (#144257 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144257 Approved by: https://github.com/sraikund16	2025-01-06 23:34:14 +00:00
Laith Sakka	5ccbfffd11	update expected results (#144274 ) this PR `f6488d85a0` made it +1.3% < 1.5%. once we have the API from dev infra and change the test this wont be happening. <img width="364" alt="Screenshot 2025-01-06 at 11 01 15 AM" src="https://github.com/user-attachments/assets/401b2d11-e400-49d6-b6f9-8e10ca141cb0" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/144274 Approved by: https://github.com/oulgen, https://github.com/anijain2305	2025-01-06 23:18:21 +00:00
Rachel Guo	f879a6982d	Enhance provenance tracing unit test to cover `torch.compile()` (#143684 ) Summary: Follow up as title. Test Plan: ``` buck2 run -c fbcode.enable_gpu_sections=true -c fbcode.nvcc_arch=h100 @//mode/opt fbcode//caffe2/test/inductor:provenance_tracing -- -r test_triton_kernel_to_post_grad_tracing ``` Differential Revision: D67543556 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143684 Approved by: https://github.com/yushangdi	2025-01-06 22:58:04 +00:00
Isuru Fernando	301b9c8a90	Fix PythonMod printing (#144078 ) Fixes #144075 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144078 Approved by: https://github.com/anijain2305	2025-01-06 22:52:34 +00:00
bobrenjc93	edbda2fad8	remove allow-untyped-defs from torch/export/_remove_auto_functionalized_pass.py (#144230 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144230 Approved by: https://github.com/Skylion007	2025-01-06 22:23:19 +00:00
bobrenjc93	d75ffccd0a	Migrate from Tuple -> tuple in torch/_export (#144262 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144262 Approved by: https://github.com/avikchaudhuri	2025-01-06 22:20:26 +00:00
Andrew Gu	00c18c8882	Make all-reduce input contiguous in `distributed.nn.all_reduce` (#144267 ) Fixes https://github.com/pytorch/pytorch/issues/144060 I confirmed that the unit test fails without the `.contiguous()` fix. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144267 Approved by: https://github.com/wz337, https://github.com/Skylion007, https://github.com/fduwjj	2025-01-06 22:20:04 +00:00
Nikita Shulga	16c1b1048b	[MPSInductor] Add `nan` constant generation (#144281 ) If val is not equal to self, it's a nan (which is spelled as `NAN` in Metal) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144281 Approved by: https://github.com/atalman, https://github.com/dcci	2025-01-06 22:13:23 +00:00
Nikita Shulga	7d5249dbc2	[EZ][BE] Fix E226 flake8 violation (#144282 ) Not sure why CI did not complain about it, but it my local runs it clearly says ``` Advice (FLAKE8) E226 missing whitespace around arithmetic operator See https://www.flake8rules.com/rules/E226.html 268 \| with code.indent(): 269 \| if len(idx_var_names) > 1: 270 \| for idx, name in enumerate(idx_var_names): >>> 271 \| code.writeline(f"auto {name} = thread_pos.{chr(120+idx)};") ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/144282 Approved by: https://github.com/Skylion007	2025-01-06 22:12:21 +00:00
Ryan Guo	5d88002af6	[inductor] Avoid specializing over symbolic value during constant folding (#144176 ) Fixes #143667. See more context in the issue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144176 Approved by: https://github.com/jansel, https://github.com/eellison	2025-01-06 21:50:17 +00:00
Faran Ahmad	729b7c0a84	[TGIF][Easy] Slightly improve the logging for tgif split pass (#143771 ) Summary: 1. Added more details for some of the assert statements. 2. Moved assert statements to use tgif_assert Test Plan: all unit tests should pass Reviewed By: jingsh Differential Revision: D67608251 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143771 Approved by: https://github.com/jingsh	2025-01-06 21:00:15 +00:00
Aaron Gokaslan	b5cf8e2460	[BE]: Remove redundant copy in torch chunk shard (#144269 ) Fixes an issue noticed in recent all_gather PR. Some parts of the codebase have a double copy with `clone().contiguous()` which could be fused into a single copy op. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144269 Approved by: https://github.com/awgu	2025-01-06 20:52:49 +00:00
bobrenjc93	1b8a943011	remove allow-untyped-defs from ao/nn/sparse/quantized/utils.py (#144232 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144232 Approved by: https://github.com/Skylion007	2025-01-06 19:54:27 +00:00
Doru Bercea	6d445bef0c	[ROCm][NFC] Fix condition for small tensor tuning (#144087 ) Fix condition for small tensor tuning to not impact non-ROCm compilation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144087 Approved by: https://github.com/jeffdaily	2025-01-06 19:40:22 +00:00
Marc Horowitz	c62873a09a	Fix incorrect python expression (#143675 ) Summary: This expression would return True always, causing the input to be deleted on error, even for non-write modes: ``` >>> bool("w" or "+" or "a" in "rb") True ``` Test Plan: new test in test_fsspec.py Differential Revision: D67537234 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143675 Approved by: https://github.com/mayankgarg1990, https://github.com/huydhn	2025-01-06 19:04:26 +00:00
Shangdi Yu	e3aac7f8a0	detect fake mode in proxy_tensor creation in make_fx (#144168 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/143742 A FakeTensorMode may already exist when we are setting the "val" meta of a proxy tensor. We should detect existing FakeTensorMode before creating a new one. Otherwise, we could cause an error when using `detect_fake_mode` later, because there are now multiple FakeTensorModes existing. Test Plan: The error in https://github.com/pytorch/pytorch/issues/143742 Differential Revision: D67813111 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144168 Approved by: https://github.com/BoyuanFeng, https://github.com/tugsbayasgalan	2025-01-06 19:02:08 +00:00
Nikita Shulga	e56768f030	[MPS] Fix bitwise shifts for uint8 (#144251 ) Previosly all bitwise operations were aliased to the same type, but this is wrong for shift ops Rather than building an overly complex logic, let's just instantiate using shared `scalarToMetalTypeString` helper function Fixes https://github.com/pytorch/pytorch/issues/144190 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144251 Approved by: https://github.com/Skylion007 ghstack dependencies: #144249, #144250	2025-01-06 18:27:16 +00:00
PyTorch MergeBot	aa14fcd96c	Revert "export AOTI_TORCH_EXPORT on Windows. (#140030 )" This reverts commit e141cb9c34e5e96ca47ea69b565bc4fd9c8f34c1. Reverted https://github.com/pytorch/pytorch/pull/140030 on behalf of https://github.com/clee2000 due to still failing internally D67556174, see D67866123 for link to error ([comment](https://github.com/pytorch/pytorch/pull/140030#issuecomment-2573652459))	2025-01-06 18:15:52 +00:00
Nikita Shulga	ebeb433e73	[BE] Fix + parametrize `test_min_max_nan_propagation` (#144250 ) - `dtype` was not passed as argument to `torch.rand` before - Condition bfloat16 testing on MacOS14+ Pull Request resolved: https://github.com/pytorch/pytorch/pull/144250 Approved by: https://github.com/Skylion007 ghstack dependencies: #144249	2025-01-06 17:49:41 +00:00
Nikita Shulga	11a0663eeb	[BE] Parametrize `test_min_max` (#144249 ) It's better to have one unit test per dtype rather a combined one Pull Request resolved: https://github.com/pytorch/pytorch/pull/144249 Approved by: https://github.com/Skylion007	2025-01-06 17:49:41 +00:00
Tugsbayasgalan Manlaibaatar	d65a50ef34	Fix subclass unwrapping bug (#143945 ) I noticed a small bug in tensor subclass unwrapping logic. cc @IvanKobzarev It seems easier if we just implement it recursively so that it is easier to track the inner attrs to corresponding plain tensors and both aot_autograd and fake_tensor implement subclass unwrapping recursively. Differential Revision: [D67693610](https://our.internmc.facebook.com/intern/diff/D67693610) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143945 Approved by: https://github.com/IvanKobzarev	2025-01-06 17:38:19 +00:00
Aaron Gokaslan	5c783bf410	[BE][Ez]: Update CUDNN Frontend submodule to 1.9.0 (#144200 ) * Update CUDNN Frontend to 1.9.0, which include some API improvements, new features, and bugfixes. This is a header only lib fix so should be pretty straight forward. * Nicest feature is that it now logs / print warnings when the CUDNN compiled version does not match the dynamically loaded one * Fixes corrupted / truncated log lines from being printed by CUDNN Frontend Pull Request resolved: https://github.com/pytorch/pytorch/pull/144200 Approved by: https://github.com/cyyever, https://github.com/albanD	2025-01-06 17:33:38 +00:00
Jane Xu	c8713e659a	fix memleak, detach instead of clone to not drag around graph (#144154 ) Thanks @clee2000 for bringing the memleak to my attention: https://github.com/pytorch/pytorch/actions/runs/12549765082/job/34996244798. This memleak in the test was caused by the differentiable flavors. Because we had param.clone() and param persisted outside the for loop, the autograd graph would continue growing for each optimizer.step instead of being deleted after the optim input was used up. To clarify, I had still expected (and still do expect) the test to fully clean everything up once the test is over, but I didn't get the chance to look into why that's not the case. This change would preliminarily unblock this particular test from failing the memleak CI. Use detach instead of clone, which is...cheaper anyway :D since a detach I've learned from @soulitzer is a view with requires_grad=False Pull Request resolved: https://github.com/pytorch/pytorch/pull/144154 Approved by: https://github.com/clee2000, https://github.com/Skylion007, https://github.com/huydhn, https://github.com/albanD	2025-01-06 17:09:00 +00:00
Guilherme Leobas	e222dd5d25	Rewrite _reparametrize_module to use `contextmanager` (#138203 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138203 Approved by: https://github.com/zou3519 ghstack dependencies: #136033, #140604	2025-01-06 16:56:22 +00:00
Guilherme Leobas	4c8d661348	Set `enable_trace_contextlib_contextmanager` flag to True (#140604 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140604 Approved by: https://github.com/zou3519 ghstack dependencies: #136033	2025-01-06 16:56:22 +00:00
Luca Wehrstedt	defbf0d339	[DTensor] Add strategy for _scaled_mm (#143760 ) This is done by copying the one for a regular mm, and enforcing that the scales have the same sharding scheme as their respective operands. This works because scales are 2-d tensors that must "broadcast" to the operands. This broadcasting is trivial when scales have dimensions of 1 or N, which is the only options we currently support. Note, however, that after this PR scales will be allowed to have the mesh's world size as a dimension (in certain cases). This works because, when mapped to the local shard, it becomes a dimension of 1, which can be handled by the operator. Note that when using row-wise _scaled_mm for tensor (sequence) parallelism, this situation arises naturally! Because of these specificities, the test is rather complex, as it specifically tests all these behaviors. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143760 Approved by: https://github.com/tianyu-l	2025-01-06 16:35:47 +00:00
yijun-lee	d4609af1ca	Propagate callable parameter types using ParamSpec (#142306 ) (#144047 ) Fixes #142306 This PR includes typing improvements and refactoring for the following files: - __init__.py - decorators.py - _ops.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/144047 Approved by: https://github.com/XuehaiPan, https://github.com/Skylion007 Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com> Co-authored-by: Xuehai Pan <XuehaiPan@pku.edu.cn>	2025-01-06 16:16:18 +00:00
cyy	9225f149eb	Enable clang-analyzer checks of Clang-tidy (#144222 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/144222 Approved by: https://github.com/Skylion007	2025-01-06 15:44:45 +00:00
Pian Pawakapan	bba672e117	[docs/export] update dynamic_shapes docs (#142510 ) https://pytorch.org/docs/stable/export.html dynamic_shapes section formatting is messed up, fix & update documentation to be more user-friendly. Happy accepting nits :) Pull Request resolved: https://github.com/pytorch/pytorch/pull/142510 Approved by: https://github.com/yushangdi	2025-01-06 14:12:34 +00:00
PyTorch UpdateBot	d85ae4be73	Update slow tests (#144236 ) This PR is auto-generated weekly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/weekly.yml). Update the list of slow tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144236 Approved by: https://github.com/pytorchbot	2025-01-06 11:19:09 +00:00
Sun, Jiayi	a8e97d9d4d	fix torch.acos and torch.asin for torch.complex datatypes on CPU (#134838 ) Fix https://github.com/pytorch/pytorch/issues/134487, https://github.com/pytorch/pytorch/issues/138327. These two issues are caused by the lack of special handling of the case where the real number/imag number is 0/Inf/NaN in the vectorized implementation of `asin`. For correctness, I temporarily fallback the implementation of `asin `to scalar implementation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134838 Approved by: https://github.com/mingfeima, https://github.com/Skylion007	2025-01-06 06:17:39 +00:00
eellison	e1622dca7a	Fix duplicate pattern error (#139321 ) vllm had an error when we were incorrectly stating two patterns are duplicates. See, comment inline: For a particular generated pattern repr, store all the equivalent graphs that used to generate them. Because we ignore certain patterns in searching, but not in matching, use the graph to distinguish if two equivalent searches are actually different. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139321 Approved by: https://github.com/shunting314	2025-01-06 05:04:59 +00:00
PyTorch MergeBot	cb5fa17e44	Revert "[ca] add test_dtensor_compile.py to compiled autograd tests (#144107 )" This reverts commit 67f85ccdcf56894d653b4d37cd7651eefa0ddf8c. Reverted https://github.com/pytorch/pytorch/pull/144107 on behalf of https://github.com/huydhn due to Sorry for reverting your change but the failure looks legit ([comment](https://github.com/pytorch/pytorch/pull/144107#issuecomment-2572209717))	2025-01-06 03:30:22 +00:00
Davide Italiano	c9ef98478a	[mps/BE] Enable a test that now passes. (#144198 ) After the implementation of floordiv in `464b50dbd7` landed, this now passes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144198 Approved by: https://github.com/Skylion007, https://github.com/malfet	2025-01-06 03:14:21 +00:00
Davide Italiano	23e2953cd3	[mps/inductor] Add support for floor(). (#144195 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144195 Approved by: https://github.com/jansel	2025-01-06 02:07:17 +00:00
Ding, Yi1	d71f111109	[Inductor][CPP] Fix Inductor integer avg pool (#144059 ) Fixes #143738. Currently the scaler for averaging is rounded to 0 if dtype is an integer, resulting to all-zero output. This fix uses `truediv` instead for integer cases. ## Test ```bash pytest -vs ./test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCPU::test_comprehensive_nn_functional_avg_pool1d_cpu_int64 pytest -vs ./test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCPU::test_comprehensive_nn_functional_avg_pool2d_cpu_int64 pytest -vs ./test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCPU::test_comprehensive_nn_functional_avg_pool3d_cpu_int64 pytest -vs ./test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCPU::test_comprehensive_nn_functional_local_response_norm_cpu_int64 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/144059 Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel, https://github.com/jgong5	2025-01-06 01:26:53 +00:00
Xiaodong Wang	3d3a07963f	[reland][attempt2][AMD] Turn on TF32 for aten::mm (#144145 ) Summary: https://github.com/pytorch/pytorch/pull/143549 was reverted due to some internal/oss tooling issue. Relanding. hipblaslt supports TF32, so adding the support. Original PR https://github.com/pytorch/pytorch/pull/139869 Test Plan: CI Differential Revision: D67785496 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144145 Approved by: https://github.com/jianyuh	2025-01-06 00:37:01 +00:00
Jack Morris	9f94710e48	Update core.py to fix typo (#144201 ) dype -> dtype Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/144201 Approved by: https://github.com/Skylion007	2025-01-05 18:20:52 +00:00
Mitchell, Frost	51a37a42e0	[inductor][cpu] Fix bmm b_index for dynamic expressions in inductor autotuner (#143141 ) Fixes #143102 Addresses 2 problems relating to dynamic batch size in BMM autotuner: 1. With dynamic batch size, when the input is a sympy Mult expression, such as `s0*8` which occurs in many dynamo benchmark models. We address this by using `size_hints` to solve for any expressions. This is safe since this section of the code is only called to generate inputs for benchmarking. 2. Some epilogue nodes may use the dynamic batch size as part of the codegen, for example when an input to the epilogue node is transposed and has dynamic batch size in the stride of other dimensions. When these epilogue nodes exist, if the sizevar is not already present in the `kernel.args`, it will create a new sizevar with a name. It is possible that subsequent calls to `def_kernel` could overwrite this variable name, so to avoid this we pass all the sizevars as `extra_sizevars` to the calls to `def_kernel` for the GEMM functions, so no variable renaming happens later in the BMM definition. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143141 Approved by: https://github.com/jansel, https://github.com/leslie-fang-intel, https://github.com/jgong5	2025-01-05 18:02:37 +00:00
Animesh Jain	f6488d85a0	[dynamo][user-defined] Remove __getattribute__ checks and add getsetdescriptor (#144173 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144173 Approved by: https://github.com/jansel	2025-01-05 13:48:15 +00:00
PyTorch MergeBot	b01556bd8a	Revert "[dynamo][dicts] Guarding lazily on dict keys (#143997 )" This reverts commit f5df082fabfe81639e25b8e01dae107548389c5e. Reverted https://github.com/pytorch/pytorch/pull/143997 on behalf of https://github.com/jeanschmidt due to Seems to have introduced internal ci redness in some tests, D67828366 ([comment](https://github.com/pytorch/pytorch/pull/143997#issuecomment-2571587599))	2025-01-05 11:09:45 +00:00
Yutao Xu	1e881ceecf	Update torch-xpu-ops commit pin (#143984 ) Update the torch-xpu-ops commit to [28cfac20ec662abdb0ac98faf122450013e8f520](`28cfac20ec`), includes: - Disable batch_norm vectorization path to fix accuracy issues. - Fix the LSRM/RNN implementation error. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143984 Approved by: https://github.com/EikanWang, https://github.com/ruidazeng, https://github.com/desertfire, https://github.com/jansel	2025-01-05 09:01:36 +00:00
Jason Ansel	157c185afe	[inductor] Add types to compile_tasks.py and runtime_utils.py (#144004 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144004 Approved by: https://github.com/yanboliang	2025-01-05 08:47:49 +00:00
Simon Fan	67f85ccdcf	[ca] add test_dtensor_compile.py to compiled autograd tests (#144107 ) more than half the tests use autograd, pass rate 19/26 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144107 Approved by: https://github.com/zou3519, https://github.com/bdhirsh, https://github.com/jansel	2025-01-05 02:11:48 +00:00
James Wu	f2d6cfa677	Introduce CompileEventLogger, replace usages of metrics_context and chromium_event with it (#143420 ) Problem statement: I want to be able to centralize and simplify the process by which people add columns/data to existing spans. We have MetricsContext and ChromiumEventLogger, and there's various choices you can make to decide where and when to log different levels of observability for your events. To resolve this, I want a central API for "adding to events under dynamo_timed". CompileEventLogger is intended as a frontend for MetricsContext and ChromiumEventLogger so we can use the same class for handling everything. CompileEventLogger is intended be used within a `dynamo_timed()` context. Its purpose is to 1. log to existing events that are in progress (i.e. within dynamo_timed), and 2. log instant events to chromium that are independent of any specific span. CompileEventLogger has three log levels: - CHROMIUM: Log only to chromium events, visible via tlparse. - PT2_COMPILE: Log to chromium_events + pt2_compile_events - COMPILATION_METRIC: Log to compilation metrics in addition to the toplevel chromium and pt2_compile_event. In addition, we have a function CompileEventLogger.add() that automagically chooses the correct log level. For now, it is conservative, and will never automagically choose to log CompilationMetrics (though I could imagine it figuring out the metadata are all keys in CompilationMetric and therefore loggable there). The goal here is to make one single interface to log stuff for observability reasons, and make it as easy as possible. Not included in this diff: - V1 of this diff will not have implementations of `increment` and `add_to_set` which MetricsContext has, so those usages are not replaced yet. But I'll add those in a followup. - We don't handle `RuntimeMetricsContext`. It's unclear if I want that to be part of this, because under RuntimeMetricsContext there might not be a toplevel event to log to, so chromium events doesn't make sense in that context. So I might leave that separate for now. Differential Revision: [D67346203](https://our.internmc.facebook.com/intern/diff/D67346203/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143420 Approved by: https://github.com/aorenste	2025-01-04 22:40:34 +00:00
Jackson Tsang	68d30c6a25	Add check for unsupported sprase layout to resolve false INTERNAL ASSERT FAILED (#139198 ) Fixes #131319. Implemented the check on layout as described in the original issue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139198 Approved by: https://github.com/pearu, https://github.com/amjames, https://github.com/cpuhrsch Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com> Co-authored-by: Pearu Peterson <pearu.peterson@gmail.com>	2025-01-04 21:40:36 +00:00
Nikita Shulga	b1bc880f26	[EZ][BE] Cleanup `test_mps_basic` (#144194 ) - Sort imported tests alphabetically - Run `add` tests with `check_lowp=False` as it is tested explicitly by parametrization - Do not hardcode device, but rather use `self.device` property Pull Request resolved: https://github.com/pytorch/pytorch/pull/144194 Approved by: https://github.com/Skylion007, https://github.com/dcci	2025-01-04 21:36:40 +00:00
Davide Italiano	0dc1e6be19	[mps/BE] Fix linter warning/advice. (#144199 ) Two spaces before an inline comment according to E261. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144199 Approved by: https://github.com/Skylion007, https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-01-04 20:15:41 +00:00
Richard Barnes	e458b39fc4	c10::string_view -> std::string_view in Device.cpp (#144178 ) Test Plan: Sandcastle Differential Revision: D67817163 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144178 Approved by: https://github.com/Skylion007, https://github.com/malfet	2025-01-04 18:51:33 +00:00
Joona Havukainen	811c714911	Fix nan propagation for minimum() and maximum() in MPS (#144086 ) Fixes #143976 - Moves minimum and maximum operations to use the NaN propagating call into MPSGraph instead of the default one. - Adds test for the NaN propagating case to `test_mps.py`. - Adjusts the inductor metal backend implementation for minimum and maximum to also respect the nan propagation. Additions by @malfet: - Introduce MPSGraph+PyTorchFixups interface following [Customizing existing classes](https://developer.apple.com/library/archive/documentation/Cocoa/Conceptual/ProgrammingWithObjectiveC/CustomizingExistingClasses/CustomizingExistingClasses.html) tutorial and implement `minimumWithNaNPropagationAndIntFallbackWithPrimaryTensor:` as `minimumWithNaNPropagationWithPrimaryTensor:` segfaults when called for integral types Pull Request resolved: https://github.com/pytorch/pytorch/pull/144086 Approved by: https://github.com/malfet Co-authored-by: Nikita Shulga <nshulga@meta.com>	2025-01-04 18:48:24 +00:00
Andrey Talman	60de73c3c7	Update nightly PyTorch version to 2.7.0 Same as https://github.com/pytorch/pytorch/pull/135916	2025-01-04 13:24:48 -05:00
Animesh Jain	f5df082fab	[dynamo][dicts] Guarding lazily on dict keys (#143997 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143997 Approved by: https://github.com/jansel ghstack dependencies: #144129, #144130, #144141, #144158, #144163, #144160	2025-01-04 18:13:00 +00:00
drisspg	005a4b9537	[Submodule] Bump Cutlass to 3.5.1 OSS PR (#144000 ) ## Summary Follow up PR to https://github.com/pytorch/pytorch/pull/143515. That PR added a bunch of macro switches to ensure both 3.4 and 3.5.1 built succesfully. This PR actual bumps the cutlass pin to 3.5.1. I am going to do a stack on top to add an conditional gates for 3.6 hijacking the 3.4 switches. We will leap frog our way to the top :) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144000 Approved by: https://github.com/Skylion007, https://github.com/eqy, https://github.com/malfet	2025-01-04 18:04:03 +00:00
Michal Gallus	93633d0e80	[ROCm][Windows] Fix export macros (#144098 ) For correct import and export of functions when the dynamic linkage is used for HIP libraries on windows, the appropriate export/import macros need to be put in place. This Pull Request utilizes existing CUDA import/export macros by converting them to corresponding HIP macros during the hipification process. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144098 Approved by: https://github.com/jeffdaily	2025-01-04 17:12:46 +00:00
Aaron Orenstein	45ef3309e3	[BE] typing for decorators (#144161 ) Summary: Untyped decorators strip annotations from the decorated items. - _compile - _inductor/fx_passes/post_grad - _inductor/lowering - _library/custom_ops - _meta_registrations - _ops - _refs/nn/functional - ao/quantization/quantizer/xnnpack_quantizer_utils - distributed/_composable/contract - fx/experimental/graph_gradual_typechecker - fx/experimental/migrate_gradual_types/constraint_generator - optim/optimizer - signal/windows/windows - testing/_internal/common_device_type - torch/_inductor/decomposition - utils/flop_counter Test Plan: unit tests Differential Revision: D62302684 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144161 Approved by: https://github.com/Skylion007, https://github.com/albanD	2025-01-04 16:40:09 +00:00
Nichols A. Romero	79cbda3ab0	[ROCm] Get rid of extra rpath-link that was needed for libtinfo. (#143348 ) Fixes #137858 Due to the extra rpath-link line inserted by these CMake lines, it is possible to unintentionally pick up other libraries that are incompatible with the version of ROCm in ${ROCM_PATH}. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143348 Approved by: https://github.com/jithunnair-amd, https://github.com/jeffdaily, https://github.com/pruthvistony	2025-01-04 15:42:30 +00:00
Steven Zeltmann	6f2451c2e9	[MPS] Add `aten::angle` (#143449 ) This adds an MPS backend implementation for `aten::angle` and `aten::angle_out` (mentioned in issue #77764), following the example #78408. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143449 Approved by: https://github.com/malfet	2025-01-04 15:38:40 +00:00
Nikita Shulga	301c457032	[MPS] Fix `nllnd_loss_backward` crash with different dtypes (#144170 ) Otherwise, invoking with torch.half inputs, but float weights will result in ``` (mpsFileLoc): /AppleInternal/Library/BuildRoots/b11baf73-9ee0-11ef-b7b4-7aebe1f78c73/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm:233:0: error: 'mps.divide' op requires the same element type for all operands and results (mpsFileLoc): /AppleInternal/Library/BuildRoots/b11baf73-9ee0-11ef-b7b4-7aebe1f78c73/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm:233:0: note: see current operation: %16 = "mps.divide"(%15, %arg2) : (tensor<5x5xf16>, tensor<1xf32>) -> tensor<xf32> (mpsFileLoc): /AppleInternal/Library/BuildRoots/b11baf73-9ee0-11ef-b7b4-7aebe1f78c73/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm:233:0: error: 'mps.divide' op requires the same element type for all operands and results (mpsFileLoc): /AppleInternal/Library/BuildRoots/b11baf73-9ee0-11ef-b7b4-7aebe1f78c73/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm:233:0: note: see current operation: %16 = "mps.divide"(%15, %arg2) : (tensor<5x5xf16>, tensor<1xf32>) -> tensor<xf32> 2025-01-03 14:13:18.747151-0800 python[87772:4027380] /AppleInternal/Library/BuildRoots/b11baf73-9ee0-11ef-b7b4-7aebe1f78c73/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphExecutable.mm, line 975: error 'original module failed verification' /AppleInternal/Library/BuildRoots/b11baf73-9ee0-11ef-b7b4-7aebe1f78c73/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphExecutable.mm:975: failed assertion `original module failed verification' ``` Test plan: `python -mpytest test/inductor/test_torchinductor.py -k test_nll_loss_backward_mps` should not crash Pull Request resolved: https://github.com/pytorch/pytorch/pull/144170 Approved by: https://github.com/kit1980, https://github.com/Skylion007 ghstack dependencies: #144167, #144162, #144083, #144084	2025-01-04 15:24:55 +00:00
PyTorch MergeBot	99f2491af9	Revert "Use absolute path `path.resolve()` -> `path.absolute()` (#129409 )" This reverts commit 45411d1fc9a2b6d2f891b6ab0ae16409719e09fc. Reverted https://github.com/pytorch/pytorch/pull/129409 on behalf of https://github.com/jeanschmidt due to Breaking internal CI, @albanD please help get this PR merged ([comment](https://github.com/pytorch/pytorch/pull/129409#issuecomment-2571316444))	2025-01-04 14:17:20 +00:00
cyy	df458be4e5	[4/N] Apply py39 ruff and pyupgrade fixes (#143257 ) ```torch/fx/passes/annotate_getitem_nodes.py``` was changed to support the new type hinting annotations. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143257 Approved by: https://github.com/justinchuby, https://github.com/albanD	2025-01-04 10:47:51 +00:00
Dingming Wu	a881954b0c	[PTD] Dump rcclexp proxy trace in pytorch (#143678 ) Summary: Dump the active proxyOp status per rank and per communicator when WatchDog timeout or aborts. Added `#if defined(USE_ROCM) && defined(NCCL_COMM_DUMP)` guard in the print function, so only rcclexp users will see this dump in console. This is the changes of the PTD. Test Plan: Job with A2A hang due to receiver failing to post receive operations https://fburl.com/mlhub/95vg12r3 {F1971449692} Reviewed By: c-p-i-o Differential Revision: D67036093 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143678 Approved by: https://github.com/c-p-i-o	2025-01-04 10:20:47 +00:00
Huy Do	aa7d01ea22	Use sccache 0.9.0 on ROCm build job (#144125 ) TSIA, sccache 0.9.0 seems to work fine with ROCm build job Pull Request resolved: https://github.com/pytorch/pytorch/pull/144125 Approved by: https://github.com/jithunnair-amd, https://github.com/wdvr, https://github.com/jeffdaily	2025-01-04 08:56:48 +00:00
Valentine233	636a2c7e0f	[Inductor][lowering] support out_dtype for dequant lowering (#143845 ) In lowering, support the parameter `out_dtype` for `dequant_per_tensor` and `dequant_per_channel`. Fix the following runtime error issue found in https://github.com/pytorch/ao/pull/1372: ``` File "/home/liaoxuan/pytorch_ao/torch/_inductor/lowering.py", line 452, in wrapped out = decomp_fn(args, *kwargs) torch._dynamo.exc.BackendCompilerFailed: backend='compile_fx_wrapper' raised: LoweringException: TypeError: quantized_decomposed_dequantize_per_tensor_default() got an unexpected keyword argument 'out_dtype' target: quantized_decomposed.dequantize_per_tensor.default args[0]: TensorBox(StorageBox( InputBuffer(name='arg0_1', layout=FixedLayout('cpu', torch.uint8, size=[1, 7, 7, 9], stride=[441, 63, 9, 1])) )) args[1]: 0.01 args[2]: 100 args[3]: 0 args[4]: 255 args[5]: torch.uint8 kwargs: {'out_dtype': torch.bfloat16} ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/143845 Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jansel	2025-01-04 08:48:41 +00:00
Xinran / Allan Rui	417d9c3522	[Inductor/Triton] Upcast FP16/BF16 math reductions to FP32 (#141052 ) Summary: Triton compiler does not automatically promote fp16/bf16 reductions to fp32 accumulation. This will result in significant accuracy issue. This diff will upcast the input to FP32 for all math reductions `["welford_reduce", "welford_combine", "prod", "sum", "xor_sum"]` Test Plan: CI ``` python test/inductor/test_torchinductor.py TritonCodeGenTests.test_low_precision_reduction ``` Differential Revision: D65965032 Pull Request resolved: https://github.com/pytorch/pytorch/pull/141052 Approved by: https://github.com/blaine-rister	2025-01-04 07:57:10 +00:00
Animesh Jain	816328fa51	[dynamo][lazy] LazyVT utils to get original value/source and is_hashable (#144160 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144160 Approved by: https://github.com/williamwen42, https://github.com/jansel ghstack dependencies: #144129, #144130, #144141, #144158, #144163	2025-01-04 06:23:05 +00:00
Nikita Shulga	b5b1e9456a	[MPSInductor] Add `masked` implementation (#144084 ) More or less borrowed from `22580f160e/torch/_inductor/codegen/halide.py (L549-L563)` `pytest test/inductor/test_torchinductor.py -k _mps` score is 408 failed, 347 passed, 32 skipped Pull Request resolved: https://github.com/pytorch/pytorch/pull/144084 Approved by: https://github.com/Skylion007, https://github.com/jansel ghstack dependencies: #144167, #144162, #144083	2025-01-04 04:30:07 +00:00
Shangdi Yu	f15af077fb	Fix get_source_partitions when weights are tied (#142446 ) Summary: Fix https://github.com/pytorch/pytorch/issues/142035 and https://github.com/pytorch/pytorch/issues/143621 When Linear module params are tied to another parameter, like this: ``` class SimpleLinearModel(nn.Module): def __init__(self, input_size, output_size): super(SimpleLinearModel, self).__init__() # Define a linear layer self.linear = nn.Linear(input_size, output_size) self.tied_weight = self.linear.weight def forward(self, x): # Forward pass through the linear layer b = self.tied_weight + 1 return self.linear(x), b ``` We get a graph like below: ``` graph(): %p_tied_weight : [num_users=0] = placeholder[target=p_tied_weight] %p_linear_weight : [num_users=2] = placeholder[target=p_linear_weight] %p_linear_bias : [num_users=1] = placeholder[target=p_linear_bias] %x : [num_users=1] = placeholder[target=x] %add : [num_users=1] = call_function[target=torch.ops.aten.add.Tensor](args = (%p_linear_weight, 1), kwargs = {}) %linear : [num_users=1] = call_function[target=torch.ops.aten.linear.default](args = (%x, %p_linear_weight, %p_linear_bias), kwargs = {}) return (linear, add) ``` Notice that ` %p_linear_weight : [num_users=2]`. When we get source partitions, we should exclude attributes nodes like `p_linear_weight` from outputs. A real world example where people do something like this is in https://github.com/pytorch/pytorch/issues/142035. Test Plan: ``` buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test:fx -- -r test_module_partitioner_weight_tied ``` Differential Revision: D66998592 Pull Request resolved: https://github.com/pytorch/pytorch/pull/142446 Approved by: https://github.com/angelayi	2025-01-04 04:28:20 +00:00
cyy	f9bf9057ef	Fix ruff warnings in caffe2 and functorch (#144182 ) In preparation for upgrading ruff config to py3.9. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144182 Approved by: https://github.com/malfet	2025-01-04 04:15:01 +00:00
Sam Ginzburg	ec1f56fdcf	[user triton] add support for prune_configs_by in @triton.autotune (#142207 ) This PR adds support for prune_configs_by in the @triton.autotune decorator [docs](https://triton-lang.org/main/python-api/generated/triton.autotune.html#triton.autotune). Supporting this lets users reduce autotuning time by running user-supplied code (early_config_prune, perf_model) to prune the provided list of configs. We implement this by realizing args/kwargs in call_triton_kernel(...), and then calling kernel.prune_configs(...). Pull Request resolved: https://github.com/pytorch/pytorch/pull/142207 Approved by: https://github.com/zou3519, https://github.com/aakhundov	2025-01-04 03:50:28 +00:00
Davide Italiano	479d6f2199	[mps/inductor] Add support for log(). (#144169 ) Tested via: ``` % pytest test/inductor/test_mps_basic.py ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/144169 Approved by: https://github.com/jansel, https://github.com/malfet	2025-01-04 03:07:56 +00:00
Animesh Jain	087c625261	[dynamo] Trace torch.typename (#144163 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144163 Approved by: https://github.com/yanboliang, https://github.com/williamwen42, https://github.com/jansel ghstack dependencies: #144129, #144130, #144141, #144158	2025-01-04 02:52:58 +00:00
Animesh Jain	3292220c43	[dynamo][easy] Move symnode helpers to utils (#144158 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144158 Approved by: https://github.com/williamwen42, https://github.com/jansel ghstack dependencies: #144129, #144130, #144141	2025-01-04 02:52:58 +00:00
PHLens	98949df7a4	Fix torch.distributed._functional_collectives.AsyncCollectiveTensor for aten.to. (#134661 ) Fixes #133421 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134661 Approved by: https://github.com/bdhirsh	2025-01-04 02:33:38 +00:00
eqy	7e3cd0e488	[CUDA] Check `size` calculation in `ilpReduce` for `softmax` (#144009 ) For #143644 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144009 Approved by: https://github.com/Skylion007	2025-01-04 02:31:15 +00:00
eqy	dbdda654af	[64-bit][CUDA] Upsample2D 64-bit indexing fix attempt 2 (#141923 ) #141831 Block/thread math requires a cast... Pull Request resolved: https://github.com/pytorch/pytorch/pull/141923 Approved by: https://github.com/ngimel	2025-01-04 02:30:38 +00:00
xinan.lin	1d091e47d6	[Inductor UT] Generalize device-bias code in test_torchinductor.py introduced by #143884 . (#144057 ) Fix #144056 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144057 Approved by: https://github.com/EikanWang, https://github.com/jansel	2025-01-04 02:24:33 +00:00
isalia20	22580f160e	Multinomial sampling fix on mps for non contiguous tensors (#141515 ) Fixes #141457 As for the tests. I looked in `test/test_mps.py` but I saw that `test_multinomial` function is disabled. Glad to add test where needed if there is some place where multinomial function is tested on metal. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141515 Approved by: https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-01-04 01:21:37 +00:00
Nikita Shulga	464b50dbd7	[MPSInductor] Add `floor_div` and `index_expr` implementation (#144083 ) Simply copy-n-pasted from CPPInductor `pytest test/inductor/test_torchinductor.py -k _mps` score is 418 failed, 337 passed, 32 skipped Pull Request resolved: https://github.com/pytorch/pytorch/pull/144083 Approved by: https://github.com/jansel ghstack dependencies: #144167, #144162	2025-01-04 01:10:01 +00:00
Nikita Shulga	6d25938540	[MPSInductor] Add `remainder` op (#144162 ) For it to return correct result for half precision type it must be upcast to float Pull Request resolved: https://github.com/pytorch/pytorch/pull/144162 Approved by: https://github.com/jansel ghstack dependencies: #144167	2025-01-04 00:47:40 +00:00
Nikita Shulga	f8e1eacf2f	[MPSInductor] Extend `constant` to bool type (#144167 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144167 Approved by: https://github.com/jansel	2025-01-04 00:47:40 +00:00
Yuanhao Ji	d41134f7e5	[Inductor] Fix `torch.polygamma()` when n == 0 (#144058 ) Fixes #143648 aten: `dec1a6d0f0/aten/src/ATen/native/cpu/UnaryOpsKernel.cpp (L436-L447)` compiled kernel code: ``` cpp_fused_polygamma_0 = async_compile.cpp_pybinding(['const float', 'float'], ''' #include "/tmp/torchinductor_devuser/tmpi1d9ksww/db/cdb7hyptwxpzukwd42x4ajfjlgrpum4a4htdd6lhb65apclsmno4.h" extern "C" void kernel(const float* in_ptr0, float* out_ptr0) { { { { auto tmp0 = in_ptr0[static_cast<int64_t>(0L)]; auto tmp1 = static_cast<float>(0.0); auto tmp2 = tmp1 == 0 ? calc_digamma(tmp0) : calc_polygamma(tmp0, tmp1); out_ptr0[static_cast<int64_t>(0L)] = tmp2; } } } } ''') ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/144058 Approved by: https://github.com/jansel	2025-01-04 00:22:10 +00:00
bobrenjc93	52742b07c5	remove allow-untyped-defs from nn/utils/_deprecation_utils.py (#144136 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144136 Approved by: https://github.com/aorenste	2025-01-03 23:44:14 +00:00
Xiaodong Wang	0a94bb432e	[ROCm] CK Flash Attention Backend (#143695 ) Replace https://github.com/pytorch/pytorch/pull/138947 for re-import. Replaces https://github.com/ROCm/pytorch/pull/1592 This PR contains the initial implementation of SDPA with composable_kernel backend. The CK path can be forced by simply calling torch.backends.cuda.preferred_rocm_fa_library("ck"). Similarly, you can force the incumbent aotriton implementation by passing in "aotriton" or "default". As you'd expect, not setting this option will result in aotriton to be used as the backend. In the case of CK, if pytorch deems flash attention usable, then it will use the CK path in all the same places aotriton would have been used. This PR makes no changes to the heuristics which select which attention scheme to use (i.e. flash attention vs memory efficient attention vs math etc etc). It only gets called when flash attention is both enabled (via USE_FLASH_ATTENTION) and is selected at runtime by the existing heuristics. Files located in pytorch/aten/src/ATen/native/transformers/hip/flash_attn/ck/mha* have been pulled from https://github.com/Dao-AILab/flash-attention courtesy of @tridao's hard work who is the co-author NOTE: In order to use this backend, the user MUST set USE_CK_FLASH_ATTENTION=1 in their environment when they build PyTorch. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143695 Approved by: https://github.com/malfet Co-authored-by: Andy Lugo <Andy.LugoReyes@amd.com> Co-authored-by: Jithun Nair <jithun.nair@amd.com>	2025-01-03 22:01:36 +00:00
Huy Do	3251171ae8	Make whl metadata public readable (#144164 ) After https://github.com/pytorch/pytorch/pull/143677/files#r1902138480 lands, the new nightly wheel metadata is not readable publicly causing pip install to fail, for example https://github.com/pytorch/pytorch/actions/runs/12603415308/job/35128414909. FBGEMM folks are also noticed this failure on their end (cc @q10) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144164 Approved by: https://github.com/clee2000	2025-01-03 21:08:49 +00:00
drisspg	9bf2a9a616	[ScaledMM] Fix NaNs in test for garbage input data (#144042 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144042 Approved by: https://github.com/janeyx99	2025-01-03 21:02:25 +00:00
Jay Zhang	b75f32b848	Update TorchDynamo-based ONNX Exporter memory usage example code. (#144139 ) Address related comments earlier. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144139 Approved by: https://github.com/justinchuby Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com>	2025-01-03 20:41:36 +00:00
bobrenjc93	64bffb3124	remove allow-untyped-defs onnx/_internal/exporter/_fx_passes.py (#144134 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144134 Approved by: https://github.com/Skylion007	2025-01-03 20:18:40 +00:00
bobrenjc93	64b197b603	remove allow-untyped-defs from export/_remove_auto_functionalized_pass.py (#144135 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144135 Approved by: https://github.com/Skylion007	2025-01-03 20:08:11 +00:00
bobrenjc93	9b8a4e7141	remove allow-untyped-defs from torch/onnx/operators.py (#144133 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144133 Approved by: https://github.com/Skylion007	2025-01-03 20:07:56 +00:00
bobrenjc93	6e09d32c00	remove allow-untyped-defs from torch/jit/_passes/_property_propagation.py (#144132 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144132 Approved by: https://github.com/Skylion007	2025-01-03 20:07:37 +00:00
Wanchao Liang	eb7a303d21	[dtensor] expose the __create_chunk_list__ in the doc (#144100 ) as titled, this PR expose this dunder method as a public API in the doc, so that different checkpoint implementations can leverage this protocol, instead of exposing a separate API Pull Request resolved: https://github.com/pytorch/pytorch/pull/144100 Approved by: https://github.com/awgu ghstack dependencies: #144099	2025-01-03 20:06:23 +00:00
Xuehai Pan	45411d1fc9	Use absolute path `path.resolve()` -> `path.absolute()` (#129409 ) Changes: 1. Always explicit `.absolute()`: `Path(__file__)` -> `Path(__file__).absolute()` 2. Replace `path.resolve()` with `path.absolute()` if the code is resolving the PyTorch repo root directory. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129409 Approved by: https://github.com/albanD	2025-01-03 20:03:40 +00:00
bobrenjc93	e9e18a9617	remove allow-untyped-defs from _export/db/logging.py (#144093 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144093 Approved by: https://github.com/Skylion007	2025-01-03 19:36:14 +00:00
Nikita Shulga	ad09395674	[MPSInductor] Fix multi rangevar kernel invocation (#144050 ) By changing `thread_position_in_grid` type to uint{n} and passing dimentions during the kernel call `pytest test/inductor/test_torchinductor.py -k _mps` score is 445 failed, 309 passed, 32 skipped Pull Request resolved: https://github.com/pytorch/pytorch/pull/144050 Approved by: https://github.com/jansel ghstack dependencies: #144055, #144051, #144122, #144105, #144156	2025-01-03 19:32:43 +00:00
Nikita Shulga	52e107a7ca	[MPSInductor] Add `constant`, `isinf` and `isnan` ops (#144156 ) Per Table 6.5 of [Metal Language Specification](https://developer.apple.com/metal/Metal-Shading-Language-Specification.pdf) infinity is `HUGE_VALF` Pull Request resolved: https://github.com/pytorch/pytorch/pull/144156 Approved by: https://github.com/Skylion007, https://github.com/jansel ghstack dependencies: #144055, #144051, #144122, #144105	2025-01-03 19:32:43 +00:00
Catherine Lee	383ff4011c	[ez] Use strip for arg sanitization in upload_metadata_file to improve readability (#144155 ) Minor thing that improves readability. I didn't realize you could specify characters for strip when I wrote this Pull Request resolved: https://github.com/pytorch/pytorch/pull/144155 Approved by: https://github.com/huydhn, https://github.com/Skylion007	2025-01-03 19:25:30 +00:00
bobrenjc93	8b3479e361	remove allow-untyped-defs from torch/distributed/fsdp/_dynamo_utils.py (#144131 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144131 Approved by: https://github.com/Skylion007	2025-01-03 19:07:21 +00:00
Jane Xu	7b69f7b449	Clarify what we mean by decoupled weight decay in the *AdamWs (#144101 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144101 Approved by: https://github.com/albanD	2025-01-03 19:06:00 +00:00
Yidi Wu	c36f94b373	[while_loop][dynamo] auto-unspecialize int input and output to unbacked symints (#143106 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143106 Approved by: https://github.com/zou3519 ghstack dependencies: #143105, #143545	2025-01-03 19:01:07 +00:00
Yidi Wu	5660709856	[hop][BE] unify meta checking with check_meta_consistency (#143545 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143545 Approved by: https://github.com/zou3519 ghstack dependencies: #143105	2025-01-03 19:01:07 +00:00
Yidi Wu	6e8dca9ff3	[while_loop][aot] auto-unspecialize int input and output to unbacked symints (#143105 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143105 Approved by: https://github.com/zou3519	2025-01-03 19:01:07 +00:00
Davide Italiano	56f6289f6a	[mps/inductor] Add support for atanh(). (#144121 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144121 Approved by: https://github.com/jansel, https://github.com/malfet	2025-01-03 18:55:05 +00:00
Nikita Shulga	a7b61c5b49	[MPSInductor] Add signbit op support (#144105 ) By mapping it to `metal::signbit` Pull Request resolved: https://github.com/pytorch/pytorch/pull/144105 Approved by: https://github.com/jansel, https://github.com/Skylion007 ghstack dependencies: #144055, #144051, #144122	2025-01-03 18:34:46 +00:00
PyTorch MergeBot	8d63a4a409	Revert "Set `enable_trace_contextlib_contextmanager` flag to True (#140604 )" This reverts commit 1c817fe6714cec510ccc6022b2c3e66146c3ad59. Reverted https://github.com/pytorch/pytorch/pull/140604 on behalf of https://github.com/guilhermeleobas due to breaking one of the benchmarks (moco) ([comment](https://github.com/pytorch/pytorch/pull/140604#issuecomment-2569640837))	2025-01-03 18:23:53 +00:00
Animesh Jain	c5c897c3a1	[dynamo][easy] Miscellaneous fixes (#144141 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144141 Approved by: https://github.com/williamwen42 ghstack dependencies: #144129, #144130	2025-01-03 18:22:56 +00:00
Animesh Jain	732359c633	[dynamo][easy] Minor fixes in guards.cpp (#144130 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144130 Approved by: https://github.com/williamwen42 ghstack dependencies: #144129	2025-01-03 18:22:56 +00:00
Animesh Jain	a450e177fd	[dynamo] remove inline inbuilt tests as flag is enabled by default (#144129 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144129 Approved by: https://github.com/williamwen42	2025-01-03 18:22:56 +00:00
PyTorch MergeBot	2409b49a33	Revert "Rewrite _reparametrize_module to use `contextmanager` (#138203 )" This reverts commit 7bf3b7cdc5631f9991eebcdd8ec09095339a9973. Reverted https://github.com/pytorch/pytorch/pull/138203 on behalf of https://github.com/guilhermeleobas due to breaking one of the benchmarks (moco) ([comment](https://github.com/pytorch/pytorch/pull/138203#issuecomment-2569634001))	2025-01-03 18:17:32 +00:00
Blaine Burton Rister	60fe8a65af	[Inductor] Generalize tiling algorithm to handle fused reductions (#144041 ) # Issue This PR cleans up an edge case that wasn't handled by https://github.com/pytorch/pytorch/pull/137243. The existing tiling code assumes that `node.get_ranges()` is a reliable source of pointwise and reduction numels. This is true for pointwise kernels, but the situation is more complicated with reductions. Since reductions change the number of elements in a tensor, not all ops within a reduction kernel will have the same number of iterations. For example, `var_mean` fuses pointwise division with the output of reduction sum, and the division lacks the corresponding reduction ranges. # Fix Instead of getting numels from `node.get_ranges()`, explicitly pass the global pointwise and reduction numels to the relevant tiling functions. In `SIMDKernel.complete_partial_tiling`, we solve for the missing numel by diving the global numel by the partial tiling's numel. This ensures all tilings have the correct global numel. Also, in `SIMDKernel.is_compatible`, add the global reduction numel to node ranges that are missing it. For example, `{"x": 8, "r0_": 8}` is compatible with a node of ranges `([8], [])` when we have `reduction_numel=8`. Finally, this PR generalizes some of the existing codegen to handle multiple reduction dims. We already had code to ignore reduction splits for pointwise kernels, but it only worked for 1D reductions. Now it can handle ND. # Test plan This PR parametrizes the existing CI test for `var_mean` to also run with tiled reductions. It also adds a new test checking that `var_mean` generates 2D tilings (with tiled reduction enabled). These new tests would fail on the current main branch. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144041 Approved by: https://github.com/jansel	2025-01-03 18:16:27 +00:00
Colin Peppler	e93f625d00	[AOTI] don't codegen autotune_at_compile_time for non-Triton kernels (#143990 ) `autotune_at_compile_time` is a separate codegen file specifically for autotuning Triton kernels. We can skip it for non-Triton kernels (like CUTLASS). This test (test_aoti_workspace_ptr) checks that `workspace_0.data_ptr()` is codegen-ed correctly in AOTI. ``` // in AOTI codegen kernels.cuda_fused_0( (const half)arg0_1.data_ptr(), (const half)arg1_1.data_ptr(), (half)buf0.data_ptr(), (int)200, (int)5216, (int)10432, (int)10432, (int)5216, (int)0, (int)5216, (size_t)nullptr, (uint8_t*)workspace_0.data_ptr(), stream); ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/143990 Approved by: https://github.com/henrylhtsang, https://github.com/chenyang78, https://github.com/desertfire	2025-01-03 18:01:12 +00:00
Huy Do	f3968373c1	Migrate the rest of CUDA 12.1 jobs to 12.4 (#144118 ) CUDA 12.4 is the default now and we don't build nightly 12.1 anymore, so it's time to move the rest of CI jobs to 12.4. I also clean up some redundant CI jobs on periodic and inductor-periodic. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144118 Approved by: https://github.com/atalman	2025-01-03 17:45:41 +00:00
Huy Do	cbdc70ae07	Use the build environment as sccache prefix instead of workflow name (#144112 ) This is an attempt to improve cache usage for jobs in non-pull workflows like periodic, slow, or inductor as we are seeing build timeout there from time to time, for example https://github.com/pytorch/pytorch/actions/runs/12553928804. The build timeout never happens in pull or trunk AFAICT because they are more up to date with the cache content coming from the PR itself. Logically, the same build should use the same cache regardless of the workflows. We have many examples where the same build, for example [linux-focal-cuda12.4-py3.10-gcc9-sm86](https://github.com/search?q=repo%3Apytorch%2Fpytorch+linux-focal-cuda12.4-py3.10-gcc9-sm86&type=code), is split between different workflows and, thus, uses different caches. I could gather some sccache stats from CH in the meantime to try to prove the improvement before and after this lands. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144112 Approved by: https://github.com/malfet	2025-01-03 17:33:03 +00:00
Benjamin Glass	b9fbd65dfd	AOTI fallback ops: remove ops that were never codegen'ed (#143421 ) Removes 4 fallback ops that are currently not possible to codegen, which does not break ABI-compatibility. 1. `_cudnn_rnn_backward` and `_histogramdd_bin_edges` both return `Tensor[]`, which we cannot codegen with the current design. 2. `_sparse_coo_tensor_with_dims_and_tensors` only supplies a Sparse operator, which we don't support. 3. `zeros.names` requires a `Dimname` input, which we can't currently codegen. Removing these ops from the list will improve test performance, since the fallback op generation will use the Python proxy executor instead of calling non-existent C functions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143421 Approved by: https://github.com/desertfire ghstack dependencies: #141371, #143223	2025-01-03 16:05:38 +00:00
Benjamin Glass	b5b419d627	cpp_wrapper: Use runtime dispatched fallbacks for complex ops (#143223 ) When calling a fallback op in cpp_wrapper mode, where any of the inputs are complex numbers, utilize the runtime dispatched fallback mode. This properly handles the Conjugate and Negative dispatch keys, if present, in exchange for a performance pessimization in complex arithmetic. This PR additionally fixes some cascading failure modes exposed in our `aot_inductor` tests by this change. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143223 Approved by: https://github.com/desertfire ghstack dependencies: #141371	2025-01-03 16:05:38 +00:00
Benjamin Glass	e88d06f54e	ir.ExternKernel: correctly handle kwarg default arguments (#141371 ) Additionally, enable torchinductor opinfo tests exercising all previously fixed bugs in this stack. Note: I've manually sharded the cpp_wrapper CI checks into 2 shards. Once all OpInfo tests are enabled we should switch back to automatic sharding, but until then the pipeline doesn't have appropriate timing stats. More shards would be helpful given the compilation slowdown associated with cpp_wrapper, but 2 will do for now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141371 Approved by: https://github.com/desertfire	2025-01-03 16:05:31 +00:00
Nikita Shulga	f7644efa79	[MPSInductor][EZ] Fix logical_[or\|end] ops (#144122 ) For boolean operands it does not really matter whether `&` or `&&` is used, but if one ever to rely on operator precedence, then bitwise ops should have higher precendence than logical ones Pull Request resolved: https://github.com/pytorch/pytorch/pull/144122 Approved by: https://github.com/huydhn ghstack dependencies: #144055, #144051	2025-01-03 15:28:07 +00:00
Nikita Shulga	b336d72dae	[MPSInductor] Preserve dtype during load (#144051 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144051 Approved by: https://github.com/Skylion007 ghstack dependencies: #144055	2025-01-03 15:17:33 +00:00
Valentine233	a1ae8fadc7	[cpu][vec] support reduce ops for add and max (#144065 ) ### Description During the support of INT8 SDPA https://github.com/pytorch/ao/pull/1372, we find that `at::vec::vec_reduce_all<int32_t>` would go into slow scalar path when doing sum and max. So here, we support the two reduce-related ops `reduce_add` and `reduce_max` for `vec512` and `vec256`, using the Sequence instructions. ### Details - Support vectorized `reduce_add` and `reduce_max` for dtypes `int32` and `float32`, using the Sequence instructions; - Implement the scalar version for fallback path in vec base; - Add the operator `reduce` in vec base, in order to simplify the codes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144065 Approved by: https://github.com/mingfeima	2025-01-03 13:01:52 +00:00
Michael Diggin	55dc61dd52	Dataloader distribute tasks to workers when in_order is False (#142324 ) Fixes #105203 and is a follow up PR to #141833 When `in_order` is True (the default), tasks are given out to workers in a round robin fashion. When `in_order` is False this is no longer needed, as we give up guarantees of reproducibility, and instead tasks should be given to workers that are able to perform work. In this PR I've added tracking of the number of outstanding tasks for each worker (updated when tasks are added to their queue, and when data is returned to the main thread). When finding the next queue to add a task to, if `in_order` is False it will only add the task to the workers queue if it has fewer than `_prefetch_factor` tasks outstanding. The current default behaviour is left as is. Tests are also updated to assert on the worker IDs for each sample of data returned. I've run the following to confirm they aren't flaky ```bash for i in {1..20}; do python test/test_dataloader.py TestOutOfOrderDataLoader; done ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/142324 Approved by: https://github.com/andrewkho	2025-01-03 12:57:04 +00:00
blzheng	c09bf71bd6	[Inductor][CPU] Fix C++ compile error of torch.max on bool type (#143848 ) Fix https://github.com/pytorch/pytorch/issues/143568 Before: ![image](https://github.com/user-attachments/assets/3e1e869e-7ae7-45c0-a334-8a663028e003) After: ![image](https://github.com/user-attachments/assets/91f72920-64bd-449a-a6c6-6048409c1450) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143848 Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel	2025-01-03 09:00:43 +00:00
Xuehai Pan	d9507548d8	[dynamo][BE] move `zip_longest` polyfill to submodule `polyfills.itertools` (#144067 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144067 Approved by: https://github.com/yanboliang ghstack dependencies: #144066	2025-01-03 08:08:31 +00:00
Xuehai Pan	fb1beb31d2	[dynamo][BE] move `dropwhile` polyfill to submodule `polyfills.itertools` (#144066 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144066 Approved by: https://github.com/jansel	2025-01-03 08:08:31 +00:00
hongxyan	00df63f09f	[ROCm] Fix for ld failed to convert GOTPCREL relocation in PyTorch build (#143986 ) I experienced an error while doing a DEBUG build of pytorch on rocm: ``` additional relocation overflows omitted from the output /usr/bin/ld: failed to convert GOTPCREL relocation; relink with --no-relax ``` Based on discussions on similar issue #138427, I fixed it after adding the `--offload-compress` to the HIP_HIPCC_FLAGS to successfully build DEBUG mode on my node. Further updated the PR to enable the flag for non-DEBUG builds as well due to the size reduction. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143986 Approved by: https://github.com/jeffdaily	2025-01-03 06:53:08 +00:00
Xu Han	e141cb9c34	export AOTI_TORCH_EXPORT on Windows. (#140030 ) Fixes #139954 reproduce UT: ```cmd pytest test/inductor/test_torchinductor_codegen_dynamic_shapes.py -k test_device_assert_dynamic_shapes_cpu ``` Issue: <img width="856" alt="image" src="https://github.com/user-attachments/assets/5fc501a9-54e5-45ac-9fb3-509ec11a7abe"> After fixing: ![Image](https://github.com/user-attachments/assets/883846fb-8e92-4b9c-9400-daab32382a3a) Reland: 1. Declare export on Windows explicitly. 2. Support cpu, cuda and xpu devices. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140030 Approved by: https://github.com/jgong5, https://github.com/desertfire	2025-01-03 05:41:06 +00:00
Wanchao Liang	48a05ee773	[dtensor] improve doc of the DTensor class (#144099 ) as titled: explicitly list all public members to make sure the public API stays consistent, also use groupwise as the member order to make doc look better Pull Request resolved: https://github.com/pytorch/pytorch/pull/144099 Approved by: https://github.com/awgu	2025-01-03 05:35:44 +00:00
Davide Italiano	41b5c600df	[ReduceOps] Add dimension checking for cummin()/cummax(). (#143920 ) Summary: cum{min,max} didn't guard against 0-d vector and allowed an arbitrary dimension to be passed. Test Plan: torch_test.py Reviewers: Subscribers: Tasks: Tags: Fixes #71477 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143920 Approved by: https://github.com/malfet	2025-01-03 04:14:33 +00:00
Bin Bao	c5b75f8db1	[AOTI] Remove more AOTI_TORCH_EXPORT (#144081 ) Summary: Similar to https://github.com/pytorch/pytorch/pull/142500, remove redundant AOTI_TORCH_EXPORT from several cpp files, to solve a windows build issue. Differential Revision: D67762069 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144081 Approved by: https://github.com/yushangdi	2025-01-03 02:17:38 +00:00
Jithun Nair	c31912666e	[ROCm] Print amdgpu info on bare metal for CI runners (#144038 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144038 Approved by: https://github.com/jeffdaily	2025-01-03 02:00:40 +00:00
Michal Gallus	37e9da0687	[ROCm][Windows] Disable roctracer-related code (#143329 ) Currently, the roctracer for Windows is not available. This PR disables any mentions of its usage for Windows, and creates dummy functions for Windows to keep compatibility with existing code, but which warn the user about the lack of Windows' availability. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143329 Approved by: https://github.com/sraikund16	2025-01-03 01:51:01 +00:00
bobrenjc93	891a86d1ad	remove allow-untyped-defs from ao/quantization/experimental/fake_quantize.py (#144091 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144091 Approved by: https://github.com/aorenste	2025-01-03 01:26:36 +00:00
bobrenjc93	377e29745f	remove allow-untyped-defs from distributed/elastic/utils/data/cycling_iterator.py (#144090 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144090 Approved by: https://github.com/aorenste	2025-01-03 01:22:50 +00:00
bobrenjc93	0d6db839a7	remove allow-untyped-defs from utils/data/datapipes/iter/streamreader.py (#144088 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144088 Approved by: https://github.com/aorenste	2025-01-03 01:21:44 +00:00
bobrenjc93	bdfb40ed29	remove allow-untyped-defs from utils/_import_utils.py (#144089 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144089 Approved by: https://github.com/aorenste	2025-01-03 01:21:13 +00:00
bobrenjc93	28a74fe3aa	remove allow-untyped-defs from torch/mps/event.py (#144092 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144092 Approved by: https://github.com/aorenste	2025-01-03 01:20:17 +00:00
Catherine Lee	496fc90965	[CI] Multigpu 1 -> 2 shards (#143992 ) Fixes #ISSUE_NUMBER It's been timing out https://github.com/pytorch/pytorch/actions/runs/12544191739/job/34977636276 They're still somewhat uneven but they're both under the limit now. It would probably be better to use run_test.py's sharding to do this, maybe in another PR Pull Request resolved: https://github.com/pytorch/pytorch/pull/143992 Approved by: https://github.com/huydhn	2025-01-03 00:33:16 +00:00
Catherine Lee	3eb3f4ed55	Upload METADATA file with whl binaries (#143677 ) Upload the metadata file for wheels for pep658 https://peps.python.org/pep-0658/ Using a python script but using bash might be easier... -- Testing Example run https://github.com/pytorch/pytorch/actions/runs/12550595201/job/34994883276 without actual upload, just dry run Lightly tested the script to make sure it uploads to s3, but integration with the bash script + workflow is untested Pull Request resolved: https://github.com/pytorch/pytorch/pull/143677 Approved by: https://github.com/seemethere	2025-01-03 00:32:05 +00:00
Catherine Lee	bb5e439f2d	Add networkx as bazel dep to fix CI failure (#143995 ) Add networkx as a dependency for test_bazel Example failure: https://github.com/pytorch/pytorch/actions/runs/12551752021/job/34996706301 ``` INFO: From Testing //:test_bazel: ==================== Test output for //:test_bazel: Traceback (most recent call last): File "/var/lib/jenkins/.cache/bazel/_bazel_jenkins/fdf6d09bf4b4f04a71e2a7dfceb40620/sandbox/processwrapper-sandbox/6504/execroot/pytorch/bazel-out/k8-fastbuild/bin/test_bazel.runfiles/pytorch/test/_test_bazel.py", line 33, in <module> test_simple_compile_eager() File "/var/lib/jenkins/.cache/bazel/_bazel_jenkins/fdf6d09bf4b4f04a71e2a7dfceb40620/sandbox/processwrapper-sandbox/6504/execroot/pytorch/bazel-out/k8-fastbuild/bin/test_bazel.runfiles/pytorch/test/_test_bazel.py", line 27, in test_simple_compile_eager opt_foo1 = torch.compile(foo, backend="eager") File "/var/lib/jenkins/.cache/bazel/_bazel_jenkins/fdf6d09bf4b4f04a71e2a7dfceb40620/sandbox/processwrapper-sandbox/6504/execroot/pytorch/bazel-out/k8-fastbuild/bin/test_bazel.runfiles/pytorch/torch/__init__.py", line 2533, in compile backend = _TorchCompileWrapper(backend, mode, options, dynamic) File "/var/lib/jenkins/.cache/bazel/_bazel_jenkins/fdf6d09bf4b4f04a71e2a7dfceb40620/sandbox/processwrapper-sandbox/6504/execroot/pytorch/bazel-out/k8-fastbuild/bin/test_bazel.runfiles/pytorch/torch/__init__.py", line 2342, in __init__ self.compiler_fn = lookup_backend(backend) File "/var/lib/jenkins/.cache/bazel/_bazel_jenkins/fdf6d09bf4b4f04a71e2a7dfceb40620/sandbox/processwrapper-sandbox/6504/execroot/pytorch/bazel-out/k8-fastbuild/bin/test_bazel.runfiles/pytorch/torch/_dynamo/backends/registry.py", line 66, in lookup_backend _lazy_import() File "/var/lib/jenkins/.cache/bazel/_bazel_jenkins/fdf6d09bf4b4f04a71e2a7dfceb40620/sandbox/processwrapper-sandbox/6504/execroot/pytorch/bazel-out/k8-fastbuild/bin/test_bazel.runfiles/pytorch/torch/_dynamo/backends/registry.py", line 102, in _lazy_import import_submodule(backends) File "/var/lib/jenkins/.cache/bazel/_bazel_jenkins/fdf6d09bf4b4f04a71e2a7dfceb40620/sandbox/processwrapper-sandbox/6504/execroot/pytorch/bazel-out/k8-fastbuild/bin/test_bazel.runfiles/pytorch/torch/_dynamo/utils.py", line 2797, in import_submodule importlib.import_module(f"{mod.__name__}.{filename[:-3]}") File "/var/lib/jenkins/.cache/bazel/_bazel_jenkins/fdf6d09bf4b4f04a71e2a7dfceb40620/execroot/pytorch/external/python3_10_x86_64-unknown-linux-gnu/lib/python3.10/importlib/__init__.py", line 126, in import_module return _bootstrap._gcd_import(name[level:], package, level) File "<frozen importlib._bootstrap>", line 1050, in _gcd_import File "<frozen importlib._bootstrap>", line 1027, in _find_and_load File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked File "<frozen importlib._bootstrap>", line 688, in _load_unlocked File "<frozen importlib._bootstrap_external>", line 883, in exec_module File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed File "/var/lib/jenkins/.cache/bazel/_bazel_jenkins/fdf6d09bf4b4f04a71e2a7dfceb40620/sandbox/processwrapper-sandbox/6504/execroot/pytorch/bazel-out/k8-fastbuild/bin/test_bazel.runfiles/pytorch/torch/_dynamo/backends/common.py", line 12, in <module> from torch._functorch.aot_autograd import ( File "/var/lib/jenkins/.cache/bazel/_bazel_jenkins/fdf6d09bf4b4f04a71e2a7dfceb40620/sandbox/processwrapper-sandbox/6504/execroot/pytorch/bazel-out/k8-fastbuild/bin/test_bazel.runfiles/pytorch/torch/_functorch/aot_autograd.py", line 147, in <module> from .partitioners import default_partition File "/var/lib/jenkins/.cache/bazel/_bazel_jenkins/fdf6d09bf4b4f04a71e2a7dfceb40620/sandbox/processwrapper-sandbox/6504/execroot/pytorch/bazel-out/k8-fastbuild/bin/test_bazel.runfiles/pytorch/torch/_functorch/partitioners.py", line 31, in <module> from ._activation_checkpointing.graph_info_provider import GraphInfoProvider File "/var/lib/jenkins/.cache/bazel/_bazel_jenkins/fdf6d09bf4b4f04a71e2a7dfceb40620/sandbox/processwrapper-sandbox/6504/execroot/pytorch/bazel-out/k8-fastbuild/bin/test_bazel.runfiles/pytorch/torch/_functorch/_activation_checkpointing/graph_info_provider.py", line 3, in <module> import networkx as nx ModuleNotFoundError: No module named 'networkx' ``` No periodic runs on this PR or its main branch commit, but I'm pretty sure its started on https://togithub.com/pytorch/pytorch/pull/143539 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143995 Approved by: https://github.com/huydhn	2025-01-02 19:42:18 +00:00
Driss Guessous	a8c98ce175	[cutlass-3] Update third-party/cutlass-3 from 3.4 to 3.5.1 (#143515 ) # Summary: This also makes updates to different repositories throughout FB code to roll any updates needed for this new release. I was not able to get AsyncMM.cu to build (still trying) Yfiu suggested that I just skip it for now Test Plan: Have run various build commands to try and expose errors Pull Request resolved: https://github.com/pytorch/pytorch/pull/143515 Approved by: https://github.com/eqy, https://github.com/Skylion007	2025-01-02 18:45:11 +00:00
bobrenjc93	8506a2af9a	remove allow-untyped-defs from _export/pass_infra/proxy_value.py (#143944 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143944 Approved by: https://github.com/aorenste ghstack dependencies: #143943	2025-01-02 18:17:03 +00:00
Jagadish Krishnamoorthy	8f3eb84373	ROCm: Enable 4 gpu tests for distributed config (#140319 ) Change the label to make sure the jobs land on a node which has >= 4 GPUs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140319 Approved by: https://github.com/jeffdaily, https://github.com/malfet, https://github.com/kwen2501	2025-01-02 17:22:11 +00:00
Chris Sidebottom	916b510ff5	Enable mkldnn pattern matcher tests for BF16 on AArch64 (#144030 ) Fixes #143146 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144030 Approved by: https://github.com/malfet	2025-01-02 17:13:38 +00:00
Nikita Shulga	a93e75d1e2	[MPS] Handle implicit cpu-scalar-to-gpu transfer (#144055 ) Followup after https://github.com/pytorch/pytorch/pull/143934, this check is no longer necessary and fixes a subset of inductor tests Before `pytest test/inductor/test_torchinductor.py -k _mps` reports 463 failed, 291 passed, 32 skipped after 456 failed, 298 passed, 32 skipped Pull Request resolved: https://github.com/pytorch/pytorch/pull/144055 Approved by: https://github.com/Skylion007	2025-01-02 17:12:39 +00:00
Wanchao Liang	0431d47eaa	[tp] propagate src_data_rank kwarg in TP API (#144005 ) as titled, this PR propagates the src_data_rank in the TP API, so that module level APIs could leverage the flexibility to choose src_data_rank, and avoid the communication if it does not need to Pull Request resolved: https://github.com/pytorch/pytorch/pull/144005 Approved by: https://github.com/tianyu-l ghstack dependencies: #143883	2025-01-02 05:35:52 +00:00
Wanchao Liang	f242dbb76f	[dtensor] add src_data_rank to distribute_tensor API (#143883 ) As titled, this PR add a kwarg src_data_rank to the distribute_tensor API, to allow user specify a specific rank as the full tensor source data. Previously we by default specify group_rank=0 as the source of truth for single device semantic, this new option: * gives advanced user flexiblity to choose the source data rank * allow user to specify None explicity, which means we will skip the communications needed (scatter/broadcast) for the cases that does not care about single device semantic (i.e. loading from a checkpoint) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143883 Approved by: https://github.com/XilunWu, https://github.com/tianyu-l	2025-01-02 05:35:52 +00:00
Animesh Jain	dec1a6d0f0	[dynamo] Separate out GetItemSource and DictGetItemSource (#143926 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143926 Approved by: https://github.com/jansel	2025-01-01 02:39:41 +00:00
Wenqin Yang	8d9ff9c8a4	Fix a bug for wrong stride in fake tensor (#141427 ) Fixes #141426 Please see details in the issue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141427 Approved by: https://github.com/jansel	2024-12-31 23:45:32 +00:00
Jason Ansel	e7ed660233	[inductor] Add missing py312 xfail (#144006 ) See #144006 ```py __________________________________________ CudaReproTests.test_repeated_masked_load __________________________________________ RuntimeError: First class dim doesn't work with python 3.12 The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/home/jansel/conda/envs/pytorch/lib/python3.12/unittest/case.py", line 58, in testPartExecutor yield File "/home/jansel/conda/envs/pytorch/lib/python3.12/unittest/case.py", line 634, in run self._callTestMethod(testMethod) File "/home/jansel/conda/envs/pytorch/lib/python3.12/unittest/case.py", line 589, in _callTestMethod if method() is not None: ^^^^^^^^ File "/home/jansel/pytorch/torch/testing/_internal/common_utils.py", line 3108, in wrapper method(args, *kwargs) File "/home/jansel/pytorch/test/inductor/test_cuda_repro.py", line 1678, in test_repeated_masked_load from functorch.einops import rearrange File "/home/jansel/pytorch/functorch/einops/__init__.py", line 1, in <module> from .rearrange import rearrange File "/home/jansel/pytorch/functorch/einops/rearrange.py", line 7, in <module> from functorch._C import dim as _C ImportError: initialization failed ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/144006 Approved by: https://github.com/Skylion007	2024-12-31 23:37:05 +00:00
PyTorch MergeBot	a174ee2255	Revert "Fix duplicate pattern error (#139321 )" This reverts commit 9e8d84f8631317ce61de4f0f9731fc1b1ccc3d2b. Reverted https://github.com/pytorch/pytorch/pull/139321 on behalf of https://github.com/jeanschmidt due to breaking internal signals ([comment](https://github.com/pytorch/pytorch/pull/139321#issuecomment-2566620095))	2024-12-31 17:44:02 +00:00
PyTorch MergeBot	d8a2796fb6	Revert "[Inductor UT] Generalize newly introduced device-bias hard code in (#143975 )" This reverts commit 7c1c0730beed9bb05a16ba678a8f32b29fdd0a29. Reverted https://github.com/pytorch/pytorch/pull/143975 on behalf of https://github.com/jeanschmidt due to Need to revert in order to be able to revert https://github.com/pytorch/pytorch/pull/139321 feel free to merge it back once conflicts are cleared ([comment](https://github.com/pytorch/pytorch/pull/143975#issuecomment-2566619312))	2024-12-31 17:41:06 +00:00
PyTorch MergeBot	eec30916e7	Revert "Update low prec codegen for div/mod (#142350 )" This reverts commit 135a2d44830b2de1ed6714f52cc6a612406adb6d. Reverted https://github.com/pytorch/pytorch/pull/142350 on behalf of https://github.com/jeanschmidt due to breaking internal signals ([comment](https://github.com/pytorch/pytorch/pull/142350#issuecomment-2566615835))	2024-12-31 17:35:32 +00:00
Nikita Shulga	5ef0de7615	[MPSInductor] Fix multiple kernel generation (#143998 ) At the moment by generating multiple MetalLibraries `pytest test/inductor/test_torchinductor.py -k _mps` score is 434 failed, 317 passed, 32 skipped Pull Request resolved: https://github.com/pytorch/pytorch/pull/143998 Approved by: https://github.com/jansel, https://github.com/ruidazeng ghstack dependencies: #143948, #143949, #143973, #143977	2024-12-31 13:51:50 +00:00
Nikita Shulga	f0f09bb3c2	[MPSInductor] Implement minimum and maximum ops (#143977 ) By calling `metal::min` and `metal::max` respectively with argument typecast to a common type to avoid ambiguous calls errors TODO: Implement NaN propagation for both eager and compile, see https://github.com/pytorch/pytorch/issues/143976 `pytest test/inductor/test_torchinductor.py -k _mps` score is 460 failed, 291 passed, 32 skipped Pull Request resolved: https://github.com/pytorch/pytorch/pull/143977 Approved by: https://github.com/jansel ghstack dependencies: #143948, #143949, #143973	2024-12-31 13:51:50 +00:00
Yu, Guangye	09e47ab7ab	Refine CUDA Stream priority (#143849 ) # Motivation As mentioned in https://github.com/pytorch/pytorch/pull/141119#discussion_r1897480515, we properly handle the priority value if it is outside of the priority range. # Additional Context If the value falls outside of the allowed priority range, it will automatically be mapped to the nearest valid priority(either lowest or highest). Pull Request resolved: https://github.com/pytorch/pytorch/pull/143849 Approved by: https://github.com/albanD, https://github.com/EikanWang ghstack dependencies: #142347, #141119, #141123, #143799	2024-12-31 11:15:59 +00:00
Yu, Guangye	3848de55ed	Add get_stream_from_external API for CUDA backend (#143799 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143799 Approved by: https://github.com/albanD, https://github.com/EikanWang ghstack dependencies: #142347, #141119, #141123	2024-12-31 11:15:59 +00:00
Yu, Guangye	8f6c4d1732	Add get_stream_from_external API for XPU backend (#141123 ) # Motivation This PR aims to introduce `torch.xpu.ExternalStream` to be used to wrap SYCL queue created in other libraries to PyTorch. # Additional Context Pull Request resolved: https://github.com/pytorch/pytorch/pull/141123 Approved by: https://github.com/albanD, https://github.com/EikanWang ghstack dependencies: #142347, #141119	2024-12-31 11:15:52 +00:00
Yu, Guangye	a68c0ca497	Add low priority XPU Stream (#141119 ) # Motivation Due to the potential for the external SYCL queue to have a low priority, we need to support the low-priority SYCL queue for native XPU Streams to maintain consistency. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141119 Approved by: https://github.com/gujinghui, https://github.com/albanD ghstack dependencies: #142347	2024-12-31 11:15:45 +00:00
Yu, Guangye	39450ae655	Refine XPU external Stream (#142347 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/142347 Approved by: https://github.com/gujinghui, https://github.com/albanD	2024-12-31 11:15:38 +00:00
Vinayak Pandey	16a57e232c	removed dead code for dynamo flag dead_code_elimination (#140938 ) Fixes #136862 1. removed dead code from torch/_dynamo/convert_frame.py 2. ran `lintrunner -a` and all the tests passed. 3. ran the unit tests and everything seems to be in order. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140938 Approved by: https://github.com/zou3519	2024-12-31 09:27:43 +00:00
xinan.lin	01034e963c	[AOTI] Not use AOTI_TORCH_CHECK in non AOTI mode. (#143970 ) Fix #143967 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143970 Approved by: https://github.com/EikanWang, https://github.com/jansel	2024-12-31 06:28:32 +00:00
Blaine Burton Rister	a2753e376b	[Inductor] Support tiling reduction dimensions (#137243 ) Fixes #134277 and https://github.com/pytorch/pytorch/issues/142317. Sub-PRs containing refactors from this one: - https://github.com/pytorch/pytorch/pull/141733 - https://github.com/pytorch/pytorch/pull/141738 - https://github.com/pytorch/pytorch/pull/141751 (based off the former) - https://github.com/pytorch/pytorch/pull/142249 - https://github.com/pytorch/pytorch/pull/142020 - https://github.com/pytorch/pytorch/pull/143135 These refactor PRs should land before the main one. # Feature Note: to minimize risk, multi-dimensional reductions are gated by the flag `config.triton.tile_reductions`, which defaults to False. Instead of having a single reduction dimension called `"r"`, we can now support 2D reductions with `"r0_"` and `"r1_"` dimensions. 2D reductions generate two nested loops, with different block pointer advancements in each loop body. Most of the implementation is generic to ND reductions, but for now the tiling algorithm sets a hard limit at 2D. Here's an example of a 2D persistent reduction kernel: ``` @triton.jit def triton_per_fused_sum_0(in_ptr0, out_ptr0, xnumel, r0_numel, r1_numel, XBLOCK : tl.constexpr): xnumel = 1 r0_numel = 15 R0_BLOCK: tl.constexpr = 16 r1_numel = 15 R1_BLOCK: tl.constexpr = 16 xoffset = tl.program_id(0) * XBLOCK xindex = xoffset + tl.arange(0, XBLOCK)[:, None, None] xmask = tl.full([XBLOCK, R0_BLOCK, R1_BLOCK], True, tl.int1) r0_index = tl.arange(0, R0_BLOCK)[None, :, None] r0_offset = 0 r0_mask = r0_index < r0_numel r1_index = tl.arange(0, R1_BLOCK)[None, None, :] r1_offset = 0 r1_mask = r1_index < r1_numel rnumel = r0_numel * r1_numel RBLOCK: tl.constexpr = R0_BLOCKR1_BLOCK roffset = r1_offset + (r0_offsetr1_numel) rindex = r1_index + (r0_indexr1_numel) r0_0 = r0_index r1_1 = r1_index tmp0 = tl.load(tl.make_block_ptr(in_ptr0, shape=[15, 15], strides=[30, 1], block_shape=[R0_BLOCK, R1_BLOCK], order=[1, 0], offsets=[r0_offset, r1_offset]), boundary_check=[0, 1], padding_option='zero')[None, :, :] tmp1 = tl.broadcast_to(tmp0, [XBLOCK, R0_BLOCK, R1_BLOCK]) tmp3 = tl.where(r0_mask & r1_mask, tmp1, 0) tmp4 = tl.reshape(tmp3, [XBLOCK, RBLOCK]) tmp5 = tl.sum(tmp4, 1)[:, None, None] tl.store(out_ptr0 + (tl.full([XBLOCK, 1, 1], 0, tl.int32)), tmp5, None) ''', device_str='cuda') ``` There are a few main differences between this kernel and what Inductor would generate without this PR. - Instead of an `r`/`RBLOCK` dimension, we have two reduction dimensions: `r0_`/`R0_BLOCK` and `r1_`/`R1_BLOCK`. - There are special size and indexing variables for reductions, which don't directly correspond to any kernel dimension. (`rindex`, `rnumel`, `RBLOCK`, and `roffset`.) These collapse N-D reduction sizes and indices indices into 1D. This simplifies the codegen for reductions, which sometimes want to access linear indices instead of N-dimensional ones. Doing things this way allows us to generate N-D loads and stores, but access this data as if it were 1D, minimizing the blast radius of this PR. Although this makes the code more verbose, it shouldn't have a perf impact because the triton compiler eliminates dead code. - We generate the line `tmp4 = tl.reshape(tmp3, [XBLOCK, RBLOCK])` before performing the actual reduction. This reshapes N reduction dimensions into 1D. This allows us to reduce over all N dimensions at once, simplifying the codegen and allowing the Triton complier to decide the order of processing under the hood. Here's an example of a looped reduction: ``` @triton.jit def triton_red_fused_sum_0(in_ptr0, out_ptr0, xnumel, r0_numel, r1_numel, XBLOCK : tl.constexpr, R0_BLOCK : tl.constexpr, R1_BLOCK : tl.constexpr): xnumel = 3 r0_numel = 43 r1_numel = 129 xoffset = tl.program_id(0) XBLOCK xindex = xoffset + tl.arange(0, XBLOCK)[:, None, None] xmask = xindex < xnumel r0_base = tl.arange(0, R0_BLOCK)[None, :, None] r1_base = tl.arange(0, R1_BLOCK)[None, None, :] rnumel = r0_numel * r1_numel RBLOCK: tl.constexpr = R0_BLOCKR1_BLOCK rbase = r1_base + (r0_baser1_numel) x0 = xindex block_ptr0 = tl.make_block_ptr(in_ptr0, shape=[3, 43, 129], strides=[11094, 258, 1], block_shape=[XBLOCK, R0_BLOCK, R1_BLOCK], order=[2, 1, 0], offsets=[xoffset, 0, 0]) _tmp2 = tl.full([XBLOCK, R0_BLOCK, R1_BLOCK], 0, tl.float32) for r0_offset in range(0, r0_numel, R0_BLOCK): r0_index = r0_offset + r0_base r0_mask = r0_index < r0_numel for r1_offset in range(0, r1_numel, R1_BLOCK): r1_index = r1_offset + r1_base r1_mask = r1_index < r1_numel roffset = r1_offset + (r0_offsetr1_numel) rindex = r1_index + (r0_indexr1_numel) r0_1 = r0_index r1_2 = r1_index tmp0 = tl.load(block_ptr0, boundary_check=[0, 1, 2], padding_option='zero', eviction_policy='evict_first') tmp1 = tl.broadcast_to(tmp0, [XBLOCK, R0_BLOCK, R1_BLOCK]) tmp3 = _tmp2 + tmp1 _tmp2 = tl.where(r0_mask & r1_mask & xmask, tmp3, _tmp2) block_ptr0 = tl.advance(block_ptr0, [0, 0, R1_BLOCK]) block_ptr0 = tl.advance(block_ptr0, [0, R0_BLOCK, (-1)R1_BLOCK((128 + R1_BLOCK) // R1_BLOCK)]) tmp4 = tl.reshape(_tmp2, [XBLOCK, RBLOCK]) tmp2 = tl.sum(tmp4, 1)[:, None, None] tl.store(tl.make_block_ptr(out_ptr0, shape=[3], strides=[1], block_shape=[XBLOCK], order=[0], offsets=[xoffset]), tl.reshape(tmp2, [XBLOCK]).to(tl.float32), boundary_check=[0]) ''', device_str='cuda') ``` In addition to the aforementioned changes to the persistent reduction, multidimensional looped reductions have a few more lines of code: - They calculate indices inside the loop using `r0_base` and `r1_base`. For compatibility with existing codegen, these are collapsed to the 1D variant `rbase`. - Block pointer advancements are more nuanced for multidimensional loops. At the end of each loop body, we emit a `tl.advance` line which not only increments the pointer in its own dimension, but also undoes the cumulative increments of the previous loop level. This is equivalent to the usual practice in nested loops of starting with a fresh iteration variable at each level. Implementing this required refactoring the way we generate pointer advancements into a new `self.pointer_advancements` field of the kernel, which categorizes advancements by dimension. The biggest difficulty in implementing this feature was that we represented tiling with a tuple like `(5,2)`. In the existing codebase, the compiler can infer that the reduction dimension of `(5,2)` is `2`, since reductions are always the last dimension. This became cumbersome now that we have to support multiple reduction dimensions, so I refactored tiling into a dict like `{"x": 5, "r0_": 2, "r1_": 4}`. This required quite a few code changes, but I don't think it makes the underlying logic much more complex. This will also make it easier to eventually support simultaneous pointwise and reduction tiling, like `{"x": 5, "y": 5, "r0_": 2, "r1_": 4}`. (This is not supported today, but we might want to do it eventually.) The existing tiling algorithm generalized naturally to support reductions. For pointwise kernels, we tile the pointwise dimensions (`"x"`, `"y"`) as is. For reduction kernels, we never tile the `"x"` dimension, and only tile the reduction dimensions (`"r0_"`, `"r1_"`). Thus we only ever tile pointwise OR reduction dimensions, but not both. In principle it seems possible to support both, but it would likely require changes to the kernel fusion and autotuning logic. I thought it best to keep this PR as minimal as possible since it already touched a lot of different files. Unfortunately, these changes weren't enough to get block pointers in some seemingly simple test cases. In some tests for `argmax` and `var_mean`, we already collapse reduction dimensions into 1D and generate modular indexing expressions, prior to tiling. So it's not trivial to figure out how to expand the collapsed reduction dimension back to a shape that would simplify the indexing. To address these cases, this PR adds a new feature to the `config.prefer_nd_tiling` option, which analyzes reads and writes in the kernel, using the same mod-div pattern matching logic that generates block pointers later on. By matching this pattern, we can solve for the tiling splits which would simplify the indexing expression, and use then use that tiling to eliminate the modular indexing and emit a block pointer. This tiling mode is still off by default, but it's important for certain applications where we need to get as many block pointers as possible. # Test plan This touches pretty much anything that uses the Triton and Halide backends, so the existing CI provides good coverage. However, 2D reductions are gated behind a few feature flags like `config.prefer_nd_tiling` and `config.tile_reductions`, so this really only checks that the PR doesn't break 1D reductions. In addition to existing CI tests, this PR also adds some new tests that specifically stress 2D reductions: - `test_2d_reduction_odd_shapes`: test 2D reductions with a variety of ops and sizes. This covers the typical persistent and looped reductions. - `test_2d_reduce_no_x_dim`: test 2D reductions with no x dimension. - `test_2d_welford_reduction`: test 2D welford reductions with block pointers. - `test_welford_non_block_pointer`: test a 2D welford reduction when block pointer analysis fails. - `test_reduction_multiple_discontiguous_dims`: test reducing over more than one discontiguous dimension. We won't get a block pointer for this case, since that would require 3D tiling, but we're currently limited to 2D. - `test_2d_reduction_multi_kernel`: test multi kernel autotuning on a 2D softmax kernel. - `test_enable_tiled_reductions`: test that `config.triton.tile_reductions` enables/disables this feature. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137243 Approved by: https://github.com/jansel Co-authored-by: Yueming Hao <yhao@meta.com> Co-authored-by: Jason Ansel <jansel@meta.com>	2024-12-31 05:06:46 +00:00
Boyuan Feng	f3e5078c27	[Inductor] Relax size constraints for re-inplacing (#143884 ) Current reinplacing requires input buffer and output buffer has exactly the same storage size. However, matmul padding may increase the tensor size slightly for better performance, which prevents reinplacing. This PR changes the size constraints to be: - input and output buffer have exactly the same symbolic expression for storage size (i.e., sympy str). - it's statically known that 0.99 * input_size <= output_size <= input_size ### Apply on llm.c See the reuse of `buf1`. Before relaxing size requirements on re-inplacing: ([P1703512078](https://www.internalfb.com/phabricator/paste/view/P1703512078)) ![1](https://github.com/user-attachments/assets/1472f550-6eb8-4d5c-9965-49bbb20d81a9) After relaxing size requirements on re-inplacing: ([P1703513053](https://www.internalfb.com/phabricator/paste/view/P1703513053)) ![2](https://github.com/user-attachments/assets/416294dd-30eb-4e12-a36c-1aebf9af530b) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143884 Approved by: https://github.com/eellison	2024-12-31 03:52:47 +00:00
cyy	8df99b6a6e	Remove unneeded std::make_optional (#143575 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/143575 Approved by: https://github.com/Skylion007	2024-12-31 03:08:47 +00:00
Nikita Shulga	11bb94b7ea	[MPSInductor] Fix index generation for transpose (#143973 ) Alas, PythonPrinter would not work here, not would CppPrinter, so start building MetalPrinter. `pytest test/inductor/test_torchinductor.py -k _mps` score is 474 failed, 277 passed, 32 skipped Before this change: `pytest test/inductor/test_torchinductor.py -k _mps` reported 506 failed, 245 passed, 32 skipped Pull Request resolved: https://github.com/pytorch/pytorch/pull/143973 Approved by: https://github.com/jansel ghstack dependencies: #143948, #143949	2024-12-31 02:04:50 +00:00
Kai Londenberg	cb24013b5b	Fix assertion failure in pytorch profiler (#143940 ) Summary: Attempt to fix the following exception which occurred when profiling a Pytorch model ( Meta-internal LLM ) that also involved a ThreadPoolExecutor in the background: ``` Exception Found: !stack.empty() INTERNAL ASSERT FAILED at "fbcode/caffe2/torch/csrc/autograd/profiler_python.cpp":987, please report a bug to PyTorch. Python replay stack is empty. ``` The root cause of this issue seems to be that a thread call stack can be empty, which is asserted to not be empty. I fixed this with some minimal changes to profiler_python.cpp Approach: * Ensuring that the stack in question is not empty before trying to pop from it. Test Plan: * Tested manually on a reproducible scenario where the assertion failure was otherwise triggered ( repro too large to include here ). The assertion failure disappears. * CI Differential Revision: D67691558 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143940 Approved by: https://github.com/Skylion007, https://github.com/sraikund16	2024-12-31 01:43:04 +00:00
cyy	af629a8146	Enable readability-redundant-declaration (#143982 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/143982 Approved by: https://github.com/Skylion007	2024-12-31 00:20:10 +00:00
xinan.lin	934eaa503f	[Inductor XPU] Support max-autotune on XPU and reuse the corresponding Inductor UT. (#143266 ) This PR aims to add the functionality support of max-autotune for XPU. The current triton templates and configurations are not well optimized for XPU, so the performance is not ready yet. Also the `mm_plus_mm` template have accuracy issues in some cases. We will address these issues in the next PRs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143266 Approved by: https://github.com/EikanWang, https://github.com/jansel	2024-12-30 23:51:17 +00:00
Andrew Gu	d9a6ffb63c	[FSDP] Add workaround to fix `buffer_dtype` without root parameters (#143989 ) Fixes https://github.com/pytorch/pytorch/issues/143900 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143989 Approved by: https://github.com/H-Huang	2024-12-30 23:42:24 +00:00
Jason Ansel	2da7fb5320	[inductor] Make generated kernels deterministic (#143951 ) `"compile_id"` had slipped into our generated Triton code (in the metadata), which will defeat caching because the same kernels generated in a different order would not cache hit with eachother. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143951 Approved by: https://github.com/oulgen	2024-12-30 23:35:11 +00:00
Benjamin Glass	d88a8c41d5	Fix flaky "Upload test stats" job (#143991 ) Test stat uploading was intermittently failing due to certain XML strings being opportunistically converted to numbers, when string output was expected. This PR makes the conversion behavior optional, which should fix the stat uploads. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143991 Approved by: https://github.com/clee2000, https://github.com/huydhn	2024-12-30 21:40:01 +00:00
Benjamin Glass	d260bc4476	cpp_wrapper: minimize pybind11 dependency (#143772 ) Only include the parts of `pybind11` that handle GIL management within `cpp_wrapper`. This dramatically improves compilation times by reducing the number of headers we compile. Improvements on my local system are on the order of 2x. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143772 Approved by: https://github.com/Skylion007	2024-12-30 20:41:02 +00:00
Aaron Gokaslan	baee623691	[BE][Ez]: Update fmtlib submodule to 1.11.1 (#143937 ) * Exactly the same as previous fmtlib except it fixes an edgecase that could affect ABI compatibility between fmtlib versions. * Seems safe to update Pull Request resolved: https://github.com/pytorch/pytorch/pull/143937 Approved by: https://github.com/albanD	2024-12-30 19:46:27 +00:00
Wouter Devriendt	9d026000de	change import relative paths due to internal build failures (#143968 ) Internal builds failing due to #143355, changing imports to be relative, similar to other imports Pull Request resolved: https://github.com/pytorch/pytorch/pull/143968 Approved by: https://github.com/albanD	2024-12-30 17:19:49 +00:00
Nikita Shulga	c27c788e35	[MPS] Fix `torch.add(x,y, alpha=2)` crash (#143949 ) TODO: as followup PR replace this weird logic with shaders Fixes https://github.com/pytorch/pytorch/issues/143932 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143949 Approved by: https://github.com/Skylion007 ghstack dependencies: #143948	2024-12-30 17:16:29 +00:00
Nikita Shulga	beb6c2dea5	[MPS] Fix crash when mm is invoked with mixed dtypes (#143948 ) Simply by copy-n-pasting check from `a7915c56f6/aten/src/ATen/native/cuda/Blas.cpp (L254-L257)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/143948 Approved by: https://github.com/Skylion007	2024-12-30 17:13:34 +00:00
xinan.lin	7c1c0730be	[Inductor UT] Generalize newly introduced device-bias hard code in (#143975 ) test_pattern_matcher.py Fix #143974 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143975 Approved by: https://github.com/malfet	2024-12-30 16:47:19 +00:00
cyy	dca443835e	Enable more readability-redundant checks (#143963 ) They are helpful to simplifying code. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143963 Approved by: https://github.com/albanD	2024-12-30 14:49:33 +00:00
chuanqiw	438698b20b	[CD] Remove redundant triton dependency for xpu wheels (#143839 ) Due to XPU CD wheels enabled pypi dependencies by https://github.com/pytorch/pytorch/pull/141135, so the PYTORCH_EXTRA_INSTALL_REQUIREMENTS has value for XPU CD wheel build. Works for https://github.com/pytorch/pytorch/issues/139722 and https://github.com/pytorch/pytorch/issues/114850 Fixes #143838 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143839 Approved by: https://github.com/huydhn	2024-12-30 13:39:06 +00:00
PyTorch UpdateBot	2fa09853cb	Update slow tests (#143745 ) This PR is auto-generated weekly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/weekly.yml). Update the list of slow tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143745 Approved by: https://github.com/pytorchbot	2024-12-30 11:51:49 +00:00
Yutao Xu	2ed4d65af0	Update torch-xpu-ops commit pin (#143853 ) Update the torch-xpu-ops commit to [214f33](`214f33b9d9`), includes: - Fix building issue for transformer related operators - Improve XPU operator coverage Pull Request resolved: https://github.com/pytorch/pytorch/pull/143853 Approved by: https://github.com/EikanWang	2024-12-30 02:38:16 +00:00
PyTorch MergeBot	1b0d19a2cb	Revert "[inductor] Make generated kernels deterministic (#143951 )" This reverts commit 79b354ee37b7d8a06a48ca8cc4e19a3fd006b433. Reverted https://github.com/pytorch/pytorch/pull/143951 on behalf of https://github.com/wdvr due to failing tests on trunk ([comment](https://github.com/pytorch/pytorch/pull/143951#issuecomment-2564952267))	2024-12-30 02:06:38 +00:00
Henry Hu	cf89127137	[Torch.package] Add support for UntypedStorage tensors (#143930 ) Summary: fp8 uses untyped storage. Add support for torch.package by using the same logic as in serialization.py Differential Revision: D67684033 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143930 Approved by: https://github.com/henrylhtsang	2024-12-30 02:03:52 +00:00
emmettbicker	92d8965082	Adding support for differentiable lr, weight_decay, and betas in Adam/AdamW (#143726 ) Third PR in a series of PRs to broaden differentiable optimizer support w/ @janeyx99 (sorry for pinging over the holidays! I just wanted to put this one out but I am definitely not asking for review or anything like that rn) This is also going to probably be my last PR before the holidays! Note: This is a branch of #143710 -- I've never worked on a branch of a branch before so I wasn't sure about the protocol so I thought I'd just made the PR and wait until that one gets merged. This is adding support for differentiable lr, weight_decay, and betas to Adam and AdamW (but after refactoring AdamW into an Adam subclass, it's really just changing code in torch/optim/adam.py) I had one main thing I was wondering about, which is that adam already has a differentiable flag built in, so I have code like this ```py if differentiable and isinstance(beta2, Tensor): if beta2.requires_grad: exp_avg_sq.mul_(beta2).addcmul_(grad, grad.conj().mul(1 - beta2)) else: exp_avg_sq.mul_(beta2).addcmul_(grad, grad.conj(), value=1 - beta2) else: exp_avg_sq.mul_(beta2).addcmul_(grad, grad.conj(), value=1 - beta2) ``` That I could definitely simplify to just ```py if differentiable and isinstance(beta2, Tensor): exp_avg_sq.mul_(beta2).addcmul_(grad, grad.conj().mul(1 - beta2)) else: exp_avg_sq.mul_(beta2).addcmul_(grad, grad.conj(), value=1 - beta2) ``` It would definitely be a little slower in the case that it's differentiable but doesn't need a grad for beta2, but the code would also be a lot more clear and I'm debating speed vs future code usability. Also the line in the above example: ```py exp_avg_sq.mul_(beta2).addcmul_(grad, grad.conj().mul(1 - beta2)) ``` was concerning to me because it is considerably more expensive than `value=1 - beta2`, but I couldn't think of a better way to do it. Further work on #141832 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143726 Approved by: https://github.com/janeyx99	2024-12-30 01:11:57 +00:00
Kasperi Apell	a7915c56f6	Propagate callable parameter types using ParamSpec (#142306 ) (#143797 ) The codebase has a few locations where callable parameter type information is lost when the unpackings args and *kwargs are typed as Any. Refactor these instances to retain type information using typing_extensions.ParamSpec. Also, in these functions, enforce return type with TypeVar. Addresses #142306 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143797 Approved by: https://github.com/Skylion007 Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com> Co-authored-by: Xuehai Pan <XuehaiPan@outlook.com>	2024-12-29 23:03:14 +00:00
Jason Ansel	79b354ee37	[inductor] Make generated kernels deterministic (#143951 ) `"compile_id"` had slipped into our generated Triton code (in the metadata), which will defeat caching because the same kernels generated in a different order would not cache hit with eachother. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143951 Approved by: https://github.com/oulgen	2024-12-29 19:53:33 +00:00
Xuehai Pan	b6bdb67f82	[BE][Easy] use `pathlib.Path` instead of `dirname` / `".."` / `pardir` (#129374 ) Changes by apply order: 1. Replace all `".."` and `os.pardir` usage with `os.path.dirname(...)`. 2. Replace nested `os.path.dirname(os.path.dirname(...))` call with `str(Path(...).parent.parent)`. 3. Reorder `.absolute()` ~/ `.resolve()`~ and `.parent`: always resolve the path first. `.parent{...}.absolute()` -> `.absolute().parent{...}` 4. Replace chained `.parent x N` with `.parents[${N - 1}]`: the code is easier to read (see 5.) `.parent.parent.parent.parent` -> `.parents[3]` 5. ~Replace `.parents[${N - 1}]` with `.parents[${N} - 1]`: the code is easier to read and does not introduce any runtime overhead.~ ~`.parents[3]` -> `.parents[4 - 1]`~ 6. ~Replace `.parents[2 - 1]` with `.parent.parent`: because the code is shorter and easier to read.~ Pull Request resolved: https://github.com/pytorch/pytorch/pull/129374 Approved by: https://github.com/justinchuby, https://github.com/malfet	2024-12-29 17:23:13 +00:00
bobrenjc93	7101b8ca35	remove allow-untyped-defs from onnx/_internal/_lazy_import.py (#143943 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143943 Approved by: https://github.com/justinchuby	2024-12-29 10:29:43 +00:00
bobrenjc93	cf0b72c4ab	remove allow-untyped-defs from _inductor/compile_worker/watchdog.py (#143941 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143941 Approved by: https://github.com/Skylion007	2024-12-29 01:05:09 +00:00
bobrenjc93	3ba6fcd3ff	remove allow-untyped-defs from torch/_size_docs.py (#143942 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143942 Approved by: https://github.com/Skylion007	2024-12-29 01:00:46 +00:00
Yanan Cao (PyTorch)	85f348578b	[Codemod][AddExplicitStrictExportArg] caffe2/test/inductor (#143929 ) Differential Revision: D67682313 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143929 Approved by: https://github.com/hl475	2024-12-28 23:39:21 +00:00
bobrenjc93	e1abbe155e	remove allow-untyped-defs from ao/nn/qat/dynamic/modules/linear.py (#143919 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143919 Approved by: https://github.com/Skylion007, https://github.com/malfet	2024-12-28 20:50:48 +00:00
Nikita Shulga	3054aae493	[MPS] Fix fmin/fmax for scalar argument (#143934 ) CPU scalar promotion to GPU is allowed for CUDA and shoudl be allowed for MPS as well (at the very least it should not crash) Fixes https://github.com/pytorch/pytorch/issues/143933 https://github.com/pytorch/pytorch/issues/142203 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143934 Approved by: https://github.com/Skylion007	2024-12-28 17:07:19 +00:00
PyTorch MergeBot	45a709d9ec	Revert "Add torch._scaled_mm for CPU (#139975 )" This reverts commit cbc4cf3043a7316c1f6e86b1e22d96042a59489c. Reverted https://github.com/pytorch/pytorch/pull/139975 on behalf of https://github.com/malfet due to It broke the same test, but on ROCM this time, though it was classified as flaky for some reason, see `d8c3900d80/1` ([comment](https://github.com/pytorch/pytorch/pull/139975#issuecomment-2564378146))	2024-12-28 16:49:38 +00:00
PyTorch MergeBot	8cccc46e33	Revert "Add AOT inductor support for _scaled_mm for CPU (#141961 )" This reverts commit 3fabd10c40c632104e420ae8e3721f33176e8640. Reverted https://github.com/pytorch/pytorch/pull/141961 on behalf of https://github.com/malfet due to It broke the same test, but on ROCM this time, though it was classified as flaky for some reason, see `d8c3900d80/1` ([comment](https://github.com/pytorch/pytorch/pull/139975#issuecomment-2564378146))	2024-12-28 16:49:38 +00:00
Nikita Shulga	d8c3900d80	[Inductor] Implement primitive Metal compiler (#143893 ) Still work in progress, only works for element wise operations. Current implementation could be used to turn something like ```python def f(x): return x[:,::2].sin() + x[:, 1::2].cos() ``` into the following shader ```python # Topologically Sorted Source Nodes: [sin, cos, add], Original ATen: [aten.sin, aten.cos, aten.add] # Source node to ATen node mapping: # add => add # cos => cos # sin => sin # Graph fragment: # %sin : [num_users=1] = call_function[target=torch.ops.aten.sin.default](args = (%slice_2,), kwargs = {}) # %cos : [num_users=1] = call_function[target=torch.ops.aten.cos.default](args = (%slice_4,), kwargs = {}) # %add : [num_users=1] = call_function[target=torch.ops.aten.add.Tensor](args = (%sin, %cos), kwargs = {}) mps_lib = torch.mps._compile_shader(""" kernel void kernel_0( device float* out_ptr0, constant float* in_ptr0, uint xindex [[thread_position_in_grid]] ) { int x0 = xindex; auto tmp0 = in_ptr0[2x0]; auto tmp1 = metal::precise::sin(tmp0); auto tmp2 = in_ptr0[2x0 + 1]; auto tmp3 = metal::precise::cos(tmp2); auto tmp4 = tmp1 + tmp3; out_ptr0[x0] = static_cast<float>(tmp4); } """) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/143893 Approved by: https://github.com/jansel ghstack dependencies: #143891, #143892	2024-12-28 06:58:32 +00:00
leslie-fang-intel	74028cfd0c	[Inductor][CPP] Fix Data Type issue of frexp (#143746 ) Summary Fix issue: https://github.com/pytorch/pytorch/issues/143729. `frexp` has 1 input but 2 output tensor with different data type, current `deduce_dtype_for_cpp_cse_variable` can't deduce the data type for each output correctly due to missing of output index. In this PR, we set the data type of cse var in the codegen of `frexp` and avoid it being overridden in the following flow. Test Plan ``` python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_frexp ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/143746 Approved by: https://github.com/jgong5	2024-12-28 06:00:13 +00:00
Animesh Jain	01980cac38	[dynamo] Make ConstDictKeySource a subclass of ChainedSource (#143924 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143924 Approved by: https://github.com/jansel	2024-12-28 05:59:45 +00:00
Jiang, Yanbing	3fabd10c40	Add AOT inductor support for _scaled_mm for CPU (#141961 ) This PR is to add AOT inductor support for _scaled_mm for CPU. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141961 Approved by: https://github.com/malfet ghstack dependencies: #139975	2024-12-28 05:57:35 +00:00
Jiang, Yanbing	cbc4cf3043	Add torch._scaled_mm for CPU (#139975 ) This PR is to add `torch._scaled_mm` for CPU backend. `_scaled_mm_out_cpu` and `_scaled_mm_cpu` are new added and included in `torch._scaled_mm` CPU dispatch. We also add `_scaled_mm_out_cpu_emulated` as a fallback function if the current platform cannot run FP8 matmul using oneDNN. And this PR also updates the various UTs related to FP8 to support CPU tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139975 Approved by: https://github.com/mingfeima, https://github.com/jgong5, https://github.com/malfet	2024-12-28 05:49:06 +00:00
eellison	d3e9133ab2	Fix separate in process bisector cache, cleanup on exit (#143661 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143661 Approved by: https://github.com/ezyang ghstack dependencies: #143657	2024-12-28 03:20:37 +00:00
Eddie Yan	1e246ef05b	[CUDA][CUDA graphs][RNG] Skip replay prologue if `wholegraph_increment` is 0 (#143777 ) #143572 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143777 Approved by: https://github.com/ngimel, https://github.com/eellison	2024-12-28 02:31:26 +00:00
Nikita Shulga	4a7cf0dbff	[Inductor] Add MPS device op overrides (#143892 ) Mostly dummy interface as MPS backend currently limited to a single device Pull Request resolved: https://github.com/pytorch/pytorch/pull/143892 Approved by: https://github.com/jansel ghstack dependencies: #143891	2024-12-28 02:11:45 +00:00
Jerry Zhang	ad78edee8e	Add support for list, tuple and dict in numeric debugger (#143882 ) Summary: Previously numeric debugger only supports torch.Tensor, this PR adds support for list, tuple and dict as well Test Plan: python test/test_quantization.py -k test_extract_results_from_loggers_list_output Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D67660049](https://our.internmc.facebook.com/intern/diff/D67660049) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143882 Approved by: https://github.com/dulinriley	2024-12-28 02:10:31 +00:00
Animesh Jain	c3c27aef34	[dynamo] Remove HFPretrained config hack (#143698 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143698 Approved by: https://github.com/williamwen42, https://github.com/jansel ghstack dependencies: #143888	2024-12-28 02:03:13 +00:00
eellison	7c343a9d68	Fix emulate low precision bool inp (#143657 ) Fix for https://github.com/pytorch/pytorch/issues/143502 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143657 Approved by: https://github.com/BoyuanFeng	2024-12-28 01:51:28 +00:00
bobrenjc93	88ccf2fa5e	remove allow-untyped-defs from distributed/elastic/multiprocessing/subprocess_handler/handlers.py (#143917 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143917 Approved by: https://github.com/Skylion007	2024-12-28 00:13:05 +00:00
Colin Peppler	e3fefdfbf0	[CUTLASS] fix addmm (#143537 ) We would get a CUDA IMA before because we pass Bias in for X. So, we need to re-order the inputs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143537 Approved by: https://github.com/chenyang78 ghstack dependencies: #143528	2024-12-27 23:47:55 +00:00
Colin Peppler	b54620f40f	[CUTLASS] fix bugs: extra data_ptr() call, wrong size symbol name, bias symbol not added (#143528 ) A few small things in this PR: - fixed a bug where `workspace.data_ptr().data_ptr()` showed up - for SM80 CUTLASS kernels, the symbol size for W.size(1) was never created - for addmm kernels, the ldc bias symbol never showed up Pull Request resolved: https://github.com/pytorch/pytorch/pull/143528 Approved by: https://github.com/henrylhtsang	2024-12-27 23:38:18 +00:00
bobrenjc93	c17d767686	remove allow-untyped-defs from _inductor/codegen/rocm/rocm_template_buffer.py (#143870 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143870 Approved by: https://github.com/aorenste, https://github.com/Skylion007	2024-12-27 23:28:51 +00:00
bobrenjc93	63d6e1f743	remove allow-untyped-defs from _inductor/codegen/aoti_hipify_utils.py (#143916 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143916 Approved by: https://github.com/Skylion007	2024-12-27 23:25:37 +00:00
Dmitry Nikolaev	928e01545c	restore 'unused' variable to fix test_cuda_device_memory_allocated (#143885 ) This PR fix `test_cuda_multigpu.py::TestCudaMultiGPU::test_cuda_device_memory_allocated` by restoring a deleted 'unused' variable from commit `d8c8ba2440` Pull Request resolved: https://github.com/pytorch/pytorch/pull/143885 Approved by: https://github.com/Skylion007 Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>	2024-12-27 23:18:13 +00:00
Emmett Bicker	0de661dc27	Add support for differentiable weight decay (#143679 ) (Actual) second PR in a larger project to broaden support for differentiable optimizers with @janeyx99! In this PR, I did a lot of pattern matching from the previous PR to add support for differentiable weight_decay. And also added a single new line on line 359 (previously line 352) to make the code from the last PR a little easier to read Continuation of progress on #141832 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143679 Approved by: https://github.com/janeyx99 Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>	2024-12-27 23:14:43 +00:00
bobrenjc93	c0c7f881da	remove allow-untyped-defs from distributed/pipelining/_unflatten.py (#143915 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143915 Approved by: https://github.com/aorenste, https://github.com/Skylion007, https://github.com/malfet	2024-12-27 22:21:28 +00:00
bobrenjc93	af823bd526	remove allow-untyped-defs from utils/tensorboard/_convert_np.py (#143918 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143918 Approved by: https://github.com/Skylion007	2024-12-27 22:19:33 +00:00
Nikita Shulga	fe398de769	[EZ] Update sympy to 1.13.3 (#143908 ) And remove python>=3.9 check as it currently covers all supported python versions Fixes https://github.com/pytorch/pytorch/issues/143907 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143908 Approved by: https://github.com/Skylion007, https://github.com/huydhn	2024-12-27 21:32:55 +00:00
PyTorch MergeBot	b5042cfa58	Revert "remove allow-untyped-defs from torch/ao/__init__.py (#143604 )" This reverts commit 1598d458797e69376a9a148bd37fb6e8167d22e3. Reverted https://github.com/pytorch/pytorch/pull/143604 on behalf of https://github.com/wdvr due to failing typing checks in torchao ([comment](https://github.com/pytorch/pytorch/pull/143604#issuecomment-2564043233))	2024-12-27 21:30:02 +00:00
Nikita Shulga	7a13bfa1ad	[EZ] Update jinja2 to 3.1.5 (#143923 ) To make Dependabot happy about https://cwe.mitre.org/data/definitions/150.html Pull Request resolved: https://github.com/pytorch/pytorch/pull/143923 Approved by: https://github.com/Skylion007	2024-12-27 21:10:21 +00:00
Joel Schlosser	228b228449	Fix batch-specific attention mod for NJT + Flex (#143866 ) Fixes #143788 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143866 Approved by: https://github.com/Skylion007, https://github.com/cpuhrsch	2024-12-27 20:51:41 +00:00
Nikita Shulga	1e65dec2b9	[Dynamo] Add MPSDevice interface (#143891 ) That simply checks if device is available and whether or not it supports bf16 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143891 Approved by: https://github.com/jansel	2024-12-27 20:31:44 +00:00
Xuehai Pan	d2f769476f	[Easy] add quotes to shell activation commands (#143902 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143902 Approved by: https://github.com/Skylion007, https://github.com/malfet	2024-12-27 19:17:46 +00:00
Animesh Jain	a87cd5283b	[dynamo] Trace through overridden __getattribute__ method (#143888 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143888 Approved by: https://github.com/jansel	2024-12-27 18:10:00 +00:00
bobrenjc93	fda9048ca8	remove allow-untyped-defs from distributed/elastic/multiprocessing/errors/handlers.py (#143869 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143869 Approved by: https://github.com/Skylion007	2024-12-27 15:49:19 +00:00
YangQun1	a20765a9c1	subgraph rewriter supports matched pattern with no users (#143842 ) Fixes #143841 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143842 Approved by: https://github.com/yushangdi	2024-12-27 12:45:39 +00:00
eellison	9e8d84f863	Fix duplicate pattern error (#139321 ) vllm had an error when we were incorrectly stating two patterns are duplicates. See, comment inline: For a particular generated pattern repr, store all the equivalent graphs that used to generate them. Because we ignore certain patterns in searching, but not in matching, use the graph to distinguish if two equivalent searches are actually different. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139321 Approved by: https://github.com/shunting314	2024-12-27 11:10:46 +00:00
PyTorch MergeBot	3571476739	Revert "fix randint distribution for large max (#143787 )" This reverts commit 8059d56ec364feb554f3fb90012a0fc2d2104e7f. Reverted https://github.com/pytorch/pytorch/pull/143787 on behalf of https://github.com/wdvr due to failing internal tests, to be fixed first ([comment](https://github.com/pytorch/pytorch/pull/143787#issuecomment-2563493323))	2024-12-27 09:16:36 +00:00
PyTorch MergeBot	f6801ba4b3	Revert "Use random64 in Fischer-Yates algorithm for large N (#143682 )" This reverts commit 7013be0094e8d3ded2ba2f948082f98d63e622bb. Reverted https://github.com/pytorch/pytorch/pull/143682 on behalf of https://github.com/wdvr due to failing Meta internal tests that need to be updated ([comment](https://github.com/pytorch/pytorch/pull/143682#issuecomment-2563487675))	2024-12-27 09:09:33 +00:00
Yanan Cao (PyTorch)	ba5cacbc17	[Codemod][AddExplicitStrictExportArg] caffe2/test (#143688 ) Reviewed By: avikchaudhuri Differential Revision: D67530154 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143688 Approved by: https://github.com/tugsbayasgalan	2024-12-27 07:58:44 +00:00
Animesh Jain	969415885d	[inductor][invoke_subgraph] Support None/int as input/output of invoke_subgraph (#139373 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139373 Approved by: https://github.com/eellison	2024-12-27 06:46:09 +00:00
cyy	379bbef23c	Enable more C++ warnings (#143355 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/143355 Approved by: https://github.com/albanD	2024-12-27 05:46:57 +00:00
PyTorch MergeBot	fca457b5db	Revert "Add torch._scaled_mm for CPU (#139975 )" This reverts commit 3f80632c802f1d9fafd0c303d45ba2376b9c1e11. Reverted https://github.com/pytorch/pytorch/pull/139975 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing some tests in trunk ([comment](https://github.com/pytorch/pytorch/pull/139975#issuecomment-2563331259))	2024-12-27 05:25:17 +00:00
Animesh Jain	0f474a960b	[dynamo] Remove dead code after introducing UserDefinedDictVariable (#143699 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143699 Approved by: https://github.com/williamwen42, https://github.com/yanboliang, https://github.com/jansel ghstack dependencies: #143722	2024-12-27 04:51:35 +00:00
Animesh Jain	e296bab614	[dynamo] Remove DICT_SUBCLASS_GUARD_MANAGER and use dict.keys (#143722 ) In hinsight, we never needed a DICT_SUBCLASS_GUARD_MANAGER, because Dynamo would inline through the overridden keys method. In this PR, we ensure that while creating guards and constructing variable trackers, we get the `d.keys()` value by using `dict.keys(d)`. This ensures that we do not call overridden keys method. Therefore, the C++ guard can use `PyDict_Next` directly to check the guards. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143722 Approved by: https://github.com/jansel	2024-12-27 04:51:35 +00:00
bobrenjc93	d60282c177	remove allow-untyped-defs from _inductor/codegen/cpu_device_op_overrides.py (#143881 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143881 Approved by: https://github.com/aorenste	2024-12-27 04:10:47 +00:00
Huamin Li	43853691bc	[Quantization] add an option keep_original_weights in _lower_to_native_backend (#141049 ) Differential Revision: D66153809 This diff adds an option to keep_original_weights so we can track back the original weight and bias after performing prepare_fx and convert_fx Pull Request resolved: https://github.com/pytorch/pytorch/pull/141049 Approved by: https://github.com/jerryzh168	2024-12-27 04:02:07 +00:00
Chirag Pandya	809106a93f	[fr][c10d] fix flaky test (#143878 ) Summary: Test erroneously assumed that input/output sizes are same and that all states are matchable. Fixes issue #143798 Test Plan: Test passes Reviewers Test passes Pull Request resolved: https://github.com/pytorch/pytorch/pull/143878 Approved by: https://github.com/fduwjj ghstack dependencies: #143865	2024-12-27 03:13:15 +00:00
Chirag Pandya	1cd70e7e23	[fr][c10d] log trace capture enabled or not in flight recorder (#143865 ) Summary: Refactor logging for flight recorder so we can log if the capture was with or without stack trace capture enabled. We introduce a new column ('trace_enabled') in the logger. Test Plan: Tested on local job and noted that correct output was produced. Internal link: https://fburl.com/scuba/c10d_flight_recorder/ulhqnmhg Pull Request resolved: https://github.com/pytorch/pytorch/pull/143865 Approved by: https://github.com/fduwjj	2024-12-27 03:07:55 +00:00
Jason Ansel	6bdf2addc5	[inductor] Simplify get_launch_args_* handling (#143835 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143835 Approved by: https://github.com/eellison, https://github.com/shunting314 ghstack dependencies: #143813, #143814, #143815, #143817, #143818	2024-12-27 02:02:11 +00:00
Jason Ansel	138efb3002	[inductor] Move GPUTarget backwards compat to triton_compat.py (#143818 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143818 Approved by: https://github.com/eellison ghstack dependencies: #143813, #143814, #143815, #143817	2024-12-27 02:02:11 +00:00
Jason Ansel	be1936804b	[inductor] Drop support for pre-ASTSource Triton (#143817 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143817 Approved by: https://github.com/eellison ghstack dependencies: #143813, #143814, #143815	2024-12-27 02:02:11 +00:00
Jason Ansel	f3d0f67039	[inductor] Minor refactor of hip compile_meta (#143815 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143815 Approved by: https://github.com/eellison ghstack dependencies: #143813, #143814	2024-12-27 02:02:11 +00:00
bobrenjc93	29841b9414	remove allow-untyped-defs from torch/distributed/pipelining/_debug.py (#143871 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143871 Approved by: https://github.com/Skylion007	2024-12-27 01:20:26 +00:00
bobrenjc93	373dba35f9	remove allow-untyped-defs from fx/experimental/refinement_types.py (#143868 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143868 Approved by: https://github.com/Skylion007	2024-12-27 01:00:45 +00:00
Xuehai Pan	c4bff71854	[Easy] Add ROCm support to nightly pull tool (#141282 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141282 Approved by: https://github.com/malfet ghstack dependencies: #143263	2024-12-27 00:07:38 +00:00
Natalia Gimelshein	8059d56ec3	fix randint distribution for large max (#143787 ) Fixes #ISSUE_NUMBER Similar to #143682, for large maximum values we were sampling integers via % and it doesn't provide uniform distribution. Here we limit the max skew to approx 1% (random32 is used for max values `<= 2**32 / 128`) This comes with significant perf penalty, especially for cuda, but it's a pretty bad bug, so we'll have to figure out what can be done to improve it. `torch.compile` has always been producing correct results for this, and it's performance is also significantly better than current eager (eager is ~660 GB/s on H100, torch.compile 1200 GB/s), so we have to figure out why torch.compile is better. `__launch_bounds__` slightly regress perf, so perhaps we can figure out how to specify them better, but it's only 20-30 GB/s, so the big difference is still unexplained. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143787 Approved by: https://github.com/eqy	2024-12-26 23:54:03 +00:00
bobrenjc93	1598d45879	remove allow-untyped-defs from torch/ao/__init__.py (#143604 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143604 Approved by: https://github.com/aorenste	2024-12-26 23:27:16 +00:00
Jiang, Yanbing	3f80632c80	Add torch._scaled_mm for CPU (#139975 ) This PR is to add `torch._scaled_mm` for CPU backend. `_scaled_mm_out_cpu` and `_scaled_mm_cpu` are new added and included in `torch._scaled_mm` CPU dispatch. We also add `_scaled_mm_out_cpu_emulated` as a fallback function if the current platform cannot run FP8 matmul using oneDNN. And this PR also updates the various UTs related to FP8 to support CPU tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139975 Approved by: https://github.com/mingfeima, https://github.com/jgong5, https://github.com/malfet ghstack dependencies: #139974	2024-12-26 22:22:42 +00:00
PyTorch MergeBot	26364428f5	Revert "[dynamo] Remove DICT_SUBCLASS_GUARD_MANAGER and use dict.keys (#143722 )" This reverts commit fe95cbe018218d159ba0a0269045b31ff72f1a20. Reverted https://github.com/pytorch/pytorch/pull/143722 on behalf of https://github.com/wdvr due to failing internal tests ([comment](https://github.com/pytorch/pytorch/pull/143722#issuecomment-2563127017))	2024-12-26 22:04:36 +00:00
PyTorch MergeBot	ee25daef5a	Revert "[dynamo] Remove dead code after introducing UserDefinedDictVariable (#143699 )" This reverts commit 7d1c6661397f9bff93c1ea389506c8a163d7a2ab. Reverted https://github.com/pytorch/pytorch/pull/143699 on behalf of https://github.com/wdvr due to failing internal tests ([comment](https://github.com/pytorch/pytorch/pull/143722#issuecomment-2563127017))	2024-12-26 22:04:35 +00:00
Darshan Sanghani	2966fb3708	[pytorch/et] Allow ET to save additional resources for completing a trace like generated kernels and index tensor data (#143775 ) The resources directory lets ET observer dump any additional data like Triton kernels while capturing the ET. This allows us to use the ET trace to replay PT2 workloads and get visibility into data like generated kernels and their usage in a model, index tensor data etc. We also added a few ways to enable ET and ET Resources through the OS environment variables. Setting `ENABLE_PYTORCH_EXECUTION_TRACE` will enable default Execution Tracing in Pytorch. Additionally setting `ENABLE_PYTORCH_EXECUTION_TRACE_EXTRAS` will enable ET to collect extra resources from the ET run like Triton Kernels. Differential Revision: [D67610542](https://our.internmc.facebook.com/intern/diff/D67610542/) NOTE FOR REVIEWERS: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D67610542/)! Pull Request resolved: https://github.com/pytorch/pytorch/pull/143775 Approved by: https://github.com/shengfukevin, https://github.com/wdvr	2024-12-26 21:15:39 +00:00
chuanqiw	96e9a5aeec	[CI] Disable sccache for xpu test (#143851 ) WA for https://github.com/pytorch/pytorch/issues/143585 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143851 Approved by: https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2024-12-26 19:45:04 +00:00
Aaron Orenstein	3df12d38cf	dynamo tracing perf: cache cleaned_instructions: 33.7 -> 30.0 (#143070 ) See #143056 for overall docs. This PR: Cache the interesting/expensive bits of `cleaned_instructions()` Pull Request resolved: https://github.com/pytorch/pytorch/pull/143070 Approved by: https://github.com/jansel	2024-12-26 19:02:08 +00:00
Xuehai Pan	51a7ecde80	[Easy] Bump CUDA nightly version to 11.8 / 12.4 / 12.6 in nightly pull tool (#143263 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143263 Approved by: https://github.com/malfet	2024-12-26 19:01:38 +00:00
lzhang2	78502a58ba	Enable FSDP2 on XPU device (#143737 ) Motivation: Enabling FSDP2 on XPU device. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143737 Approved by: https://github.com/awgu	2024-12-26 18:34:11 +00:00
PyTorch MergeBot	475656fd9c	Revert "[BE][Easy] use `pathlib.Path` instead of `dirname` / `".."` / `pardir` (#129374 )" This reverts commit 2293fe1024812d6349f6e2b3b7de82c6b73f11e4. Reverted https://github.com/pytorch/pytorch/pull/129374 on behalf of https://github.com/malfet due to failing internal ROCM builds with error: ModuleNotFoundError: No module named hipify ([comment](https://github.com/pytorch/pytorch/pull/129374#issuecomment-2562973920))	2024-12-26 17:32:23 +00:00
PyTorch MergeBot	cc4e70b7c3	Revert "Use absolute path `path.resolve()` -> `path.absolute()` (#129409 )" This reverts commit 135c7db99d646b8bd9603bf969d47d3dec5987b1. Reverted https://github.com/pytorch/pytorch/pull/129409 on behalf of https://github.com/malfet due to need to revert to as dependency of https://github.com/pytorch/pytorch/pull/129374 ([comment](https://github.com/pytorch/pytorch/pull/129409#issuecomment-2562969825))	2024-12-26 17:26:06 +00:00
PyTorch MergeBot	9255ffc841	Revert "Enable more C++ warnings (#143355 )" This reverts commit daa3ffe0ebff58577b8db964447b6abc6de53a25. Reverted https://github.com/pytorch/pytorch/pull/143355 on behalf of https://github.com/malfet due to It fails internal build system as it kind of breaks separation between native and native/cpu ([comment](https://github.com/pytorch/pytorch/pull/143355#issuecomment-2562961546))	2024-12-26 17:13:10 +00:00
Jason Ansel	cf76c05b4d	[inductor] Refactor conditional triton imports into triton_compat.py (#143814 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143814 Approved by: https://github.com/Skylion007 ghstack dependencies: #143813	2024-12-26 09:14:06 +00:00
Jason Ansel	efac5ed81b	[inductor] Reorder imports in codecache.py (#143813 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143813 Approved by: https://github.com/Skylion007	2024-12-26 09:14:06 +00:00
dependabot[bot]	bf8da4c145	Bump jinja2 from 3.1.4 to 3.1.5 in /.ci/docker (#143844 ) Bumps [jinja2](https://github.com/pallets/jinja) from 3.1.4 to 3.1.5. <details> <summary>Release notes</summary> <p><em>Sourced from <a href="https://github.com/pallets/jinja/releases">jinja2's releases</a>.</em></p> <blockquote> <h2>3.1.5</h2> <p>This is the Jinja 3.1.5 security fix release, which fixes security issues and bugs but does not otherwise change behavior and should not result in breaking changes compared to the latest feature release.</p> <p>PyPI: <a href="https://pypi.org/project/Jinja2/3.1.5/">https://pypi.org/project/Jinja2/3.1.5/</a> Changes: <a href="https://jinja.palletsprojects.com/changes/#version-3-1-5">https://jinja.palletsprojects.com/changes/#version-3-1-5</a> Milestone: <a href="https://github.com/pallets/jinja/milestone/16?closed=1">https://github.com/pallets/jinja/milestone/16?closed=1</a></p> <ul> <li>The sandboxed environment handles indirect calls to <code>str.format</code>, such as by passing a stored reference to a filter that calls its argument. <a href="https://github.com/pallets/jinja/security/advisories/GHSA-q2x7-8rv6-6q7h">GHSA-q2x7-8rv6-6q7h</a></li> <li>Escape template name before formatting it into error messages, to avoid issues with names that contain f-string syntax. <a href="https://redirect.github.com/pallets/jinja/issues/1792">#1792</a>, <a href="https://github.com/pallets/jinja/security/advisories/GHSA-gmj6-6f8f-6699">GHSA-gmj6-6f8f-6699</a></li> <li>Sandbox does not allow <code>clear</code> and <code>pop</code> on known mutable sequence types. <a href="https://redirect.github.com/pallets/jinja/issues/2032">#2032</a></li> <li>Calling sync <code>render</code> for an async template uses <code>asyncio.run</code>. <a href="https://redirect.github.com/pallets/jinja/issues/1952">#1952</a></li> <li>Avoid unclosed <code>auto_aiter</code> warnings. <a href="https://redirect.github.com/pallets/jinja/issues/1960">#1960</a></li> <li>Return an <code>aclose</code>-able <code>AsyncGenerator</code> from <code>Template.generate_async</code>. <a href="https://redirect.github.com/pallets/jinja/issues/1960">#1960</a></li> <li>Avoid leaving <code>root_render_func()</code> unclosed in <code>Template.generate_async</code>. <a href="https://redirect.github.com/pallets/jinja/issues/1960">#1960</a></li> <li>Avoid leaving async generators unclosed in blocks, includes and extends. <a href="https://redirect.github.com/pallets/jinja/issues/1960">#1960</a></li> <li>The runtime uses the correct <code>concat</code> function for the current environment when calling block references. <a href="https://redirect.github.com/pallets/jinja/issues/1701">#1701</a></li> <li>Make <code>\|unique</code> async-aware, allowing it to be used after another async-aware filter. <a href="https://redirect.github.com/pallets/jinja/issues/1781">#1781</a></li> <li><code>\|int</code> filter handles <code>OverflowError</code> from scientific notation. <a href="https://redirect.github.com/pallets/jinja/issues/1921">#1921</a></li> <li>Make compiling deterministic for tuple unpacking in a <code>{% set ... %}</code> call. <a href="https://redirect.github.com/pallets/jinja/issues/2021">#2021</a></li> <li>Fix dunder protocol (<code>copy</code>/<code>pickle</code>/etc) interaction with <code>Undefined</code> objects. <a href="https://redirect.github.com/pallets/jinja/issues/2025">#2025</a></li> <li>Fix <code>copy</code>/<code>pickle</code> support for the internal <code>missing</code> object. <a href="https://redirect.github.com/pallets/jinja/issues/2027">#2027</a></li> <li><code>Environment.overlay(enable_async)</code> is applied correctly. <a href="https://redirect.github.com/pallets/jinja/issues/2061">#2061</a></li> <li>The error message from <code>FileSystemLoader</code> includes the paths that were searched. <a href="https://redirect.github.com/pallets/jinja/issues/1661">#1661</a></li> <li><code>PackageLoader</code> shows a clearer error message when the package does not contain the templates directory. <a href="https://redirect.github.com/pallets/jinja/issues/1705">#1705</a></li> <li>Improve annotations for methods returning copies. <a href="https://redirect.github.com/pallets/jinja/issues/1880">#1880</a></li> <li><code>urlize</code> does not add <code>mailto:</code> to values like <code>@a@b</code>. <a href="https://redirect.github.com/pallets/jinja/issues/1870">#1870</a></li> <li>Tests decorated with <code>@pass_context</code> can be used with the <code>\|select</code> filter. <a href="https://redirect.github.com/pallets/jinja/issues/1624">#1624</a></li> <li>Using <code>set</code> for multiple assignment (<code>a, b = 1, 2</code>) does not fail when the target is a namespace attribute. <a href="https://redirect.github.com/pallets/jinja/issues/1413">#1413</a></li> <li>Using <code>set</code> in all branches of <code>{% if %}{% elif %}{% else %}</code> blocks does not cause the variable to be considered initially undefined. <a href="https://redirect.github.com/pallets/jinja/issues/1253">#1253</a></li> </ul> </blockquote> </details> <details> <summary>Changelog</summary> <p><em>Sourced from <a href="https://github.com/pallets/jinja/blob/main/CHANGES.rst">jinja2's changelog</a>.</em></p> <blockquote> <h2>Version 3.1.5</h2> <p>Released 2024-12-21</p> <ul> <li>The sandboxed environment handles indirect calls to <code>str.format</code>, such as by passing a stored reference to a filter that calls its argument. :ghsa:<code>q2x7-8rv6-6q7h</code></li> <li>Escape template name before formatting it into error messages, to avoid issues with names that contain f-string syntax. :issue:<code>1792</code>, :ghsa:<code>gmj6-6f8f-6699</code></li> <li>Sandbox does not allow <code>clear</code> and <code>pop</code> on known mutable sequence types. :issue:<code>2032</code></li> <li>Calling sync <code>render</code> for an async template uses <code>asyncio.run</code>. :pr:<code>1952</code></li> <li>Avoid unclosed <code>auto_aiter</code> warnings. :pr:<code>1960</code></li> <li>Return an <code>aclose</code>-able <code>AsyncGenerator</code> from <code>Template.generate_async</code>. :pr:<code>1960</code></li> <li>Avoid leaving <code>root_render_func()</code> unclosed in <code>Template.generate_async</code>. :pr:<code>1960</code></li> <li>Avoid leaving async generators unclosed in blocks, includes and extends. :pr:<code>1960</code></li> <li>The runtime uses the correct <code>concat</code> function for the current environment when calling block references. :issue:<code>1701</code></li> <li>Make <code>\|unique</code> async-aware, allowing it to be used after another async-aware filter. :issue:<code>1781</code></li> <li><code>\|int</code> filter handles <code>OverflowError</code> from scientific notation. :issue:<code>1921</code></li> <li>Make compiling deterministic for tuple unpacking in a <code>{% set ... %}</code> call. :issue:<code>2021</code></li> <li>Fix dunder protocol (<code>copy</code>/<code>pickle</code>/etc) interaction with <code>Undefined</code> objects. :issue:<code>2025</code></li> <li>Fix <code>copy</code>/<code>pickle</code> support for the internal <code>missing</code> object. :issue:<code>2027</code></li> <li><code>Environment.overlay(enable_async)</code> is applied correctly. :pr:<code>2061</code></li> <li>The error message from <code>FileSystemLoader</code> includes the paths that were searched. :issue:<code>1661</code></li> <li><code>PackageLoader</code> shows a clearer error message when the package does not contain the templates directory. :issue:<code>1705</code></li> <li>Improve annotations for methods returning copies. :pr:<code>1880</code></li> <li><code>urlize</code> does not add <code>mailto:</code> to values like <code>@a@b</code>. :pr:<code>1870</code></li> <li>Tests decorated with <code>@pass_context`` can be used with the ``\|select`` filter. :issue:</code>1624`</li> <li>Using <code>set</code> for multiple assignment (<code>a, b = 1, 2</code>) does not fail when the target is a namespace attribute. :issue:<code>1413</code></li> <li>Using <code>set</code> in all branches of <code>{% if %}{% elif %}{% else %}</code> blocks does not cause the variable to be considered initially undefined. :issue:<code>1253</code></li> </ul> </blockquote> </details> <details> <summary>Commits</summary> <ul> <li><a href="`877f6e51be`"><code>877f6e5</code></a> release version 3.1.5</li> <li><a href="`8d58859265`"><code>8d58859</code></a> remove test pypi</li> <li><a href="`eda8fe86fd`"><code>eda8fe8</code></a> update dev dependencies</li> <li><a href="`c8fdce1e03`"><code>c8fdce1</code></a> Fix bug involving calling set on a template parameter within all branches of ...</li> <li><a href="`66587ce989`"><code>66587ce</code></a> Fix bug where set would sometimes fail within if</li> <li><a href="`fbc3a696c7`"><code>fbc3a69</code></a> Add support for namespaces in tuple parsing (<a href="https://redirect.github.com/pallets/jinja/issues/1664">#1664</a>)</li> <li><a href="`b8f4831d41`"><code>b8f4831</code></a> more comments about nsref assignment</li> <li><a href="`ee832194cd`"><code>ee83219</code></a> Add support for namespaces in tuple assignment</li> <li><a href="`1d55cddbb2`"><code>1d55cdd</code></a> Triple quotes in docs (<a href="https://redirect.github.com/pallets/jinja/issues/2064">#2064</a>)</li> <li><a href="`8a8eafc6b9`"><code>8a8eafc</code></a> edit block assignment section</li> <li>Additional commits viewable in <a href="https://github.com/pallets/jinja/compare/3.1.4...3.1.5">compare view</a></li> </ul> </details> <br /> [![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=jinja2&package-manager=pip&previous-version=3.1.4&new-version=3.1.5)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> <br /> You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot show <dependency name> ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/pytorch/pytorch/network/alerts). </details> Pull Request resolved: https://github.com/pytorch/pytorch/pull/143844 Approved by: https://github.com/Skylion007 Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2024-12-26 05:20:06 +00:00
cyy	e05bfb8ee3	[Submodule] Bump libfmt to 11.1.0 (#143843 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/143843 Approved by: https://github.com/Skylion007	2024-12-26 04:49:11 +00:00
Raymond Li	4bacfd6e11	Sort requirements.txt (#143778 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143778 Approved by: https://github.com/albanD	2024-12-26 00:51:52 +00:00
cyy	f42cff4e29	[17/N] Fix extra warnings brought by clang-tidy-17 (#143804 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/143804 Approved by: https://github.com/Skylion007	2024-12-25 19:54:42 +00:00
shaoyuyoung	a8ac3a6b20	[inductor] fix the `adaptive_avg_pool` on processing int64 (#143802 ) Fixes #143801 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143802 Approved by: https://github.com/jansel	2024-12-25 09:08:43 +00:00
Tal Ben-Nun	c0d710634f	Respect ROCR_VISIBLE_DEVICES on AMD GPU device discovery (#142292 ) Reland of #140320 after failing test on trunk. Fixes potential environment clobbering in test, makes ROCr+HIP devices (if specified together) more robust to index errors. Fixes #140318 Pull Request resolved: https://github.com/pytorch/pytorch/pull/142292 Approved by: https://github.com/jataylo, https://github.com/huydhn, https://github.com/jeffdaily Co-authored-by: Jack Taylor <108682042+jataylo@users.noreply.github.com> Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2024-12-25 02:37:11 +00:00
Natalia Gimelshein	7013be0094	Use random64 in Fischer-Yates algorithm for large N (#143682 ) Fixes bug in randperm https://nbsanity.com/static/a4774194938414dedcec7d6e99727d31/Shuffling_20in_20torch_20vs_20numpy-public.html Pull Request resolved: https://github.com/pytorch/pytorch/pull/143682 Approved by: https://github.com/eqy, https://github.com/albanD	2024-12-25 01:19:19 +00:00
Jack Taylor	27b0d41f0a	[ROCm] Add miopen_batch_norm to meta_registrations to fix AOTI issue (#143569 ) Currently the upstream example for AOTI usage breaks on ROCm (https://pytorch.org/tutorials/recipes/torch_export_aoti_python.html) ``` File "/root/upstream/torch/_dynamo/exc.py", line 317, in unimplemented raise Unsupported(msg, case_name=case_name) torch._dynamo.exc.Unsupported: unsupported operator: aten.miopen_batch_norm.default (see https://docs.google.com/document/d/1GgvOe7C8_NVOMLOCwDaYV1mXXyHMXY7ExoewHqooxrs/edit#heading=h.64r4npvq0w0 for how to fix) from user code: File "/root/vision/torchvision/models/resnet.py", line 285, in forward return self._forward_impl(x) File "/root/vision/torchvision/models/resnet.py", line 269, in _forward_impl x = self.bn1(x) ``` This PR adds a meta_registration for miopen_batch_norm to resolve this issue Pull Request resolved: https://github.com/pytorch/pytorch/pull/143569 Approved by: https://github.com/jeffdaily	2024-12-24 23:43:11 +00:00
Jason Ansel	9035fb5a7b	[dynamo] Add types to exc.py (#143626 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143626 Approved by: https://github.com/yanboliang ghstack dependencies: #143552, #143610	2024-12-24 21:48:32 +00:00
Jason Ansel	3e7f9e2cc4	[inductor] Shorten tracebacks for errors inside inductor (by skipping AOTAutograd frames) (#143610 ) Before #143552 ```py Traceback (most recent call last): File "/home/jansel/pytorch/repro.py", line 51, in <module> fp32_compiled = optimized_model(low_input) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl return self._call_impl(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/nn/modules/module.py", line 1750, in _call_impl return forward_call(args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_dynamo/eval_frame.py", line 576, in _fn return fn(args, *kwargs) ^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl return self._call_impl(args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/nn/modules/module.py", line 1750, in _call_impl return forward_call(args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_dynamo/convert_frame.py", line 1381, in __call__ return self._torchdynamo_orig_callable( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_dynamo/convert_frame.py", line 1165, in __call__ result = self._inner_convert( ^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_dynamo/convert_frame.py", line 547, in __call__ return _compile( ^^^^^^^^^ File "/home/jansel/pytorch/torch/_dynamo/convert_frame.py", line 987, in _compile guarded_code = compile_inner(code, one_graph, hooks, transform) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_dynamo/convert_frame.py", line 715, in compile_inner return _compile_inner(code, one_graph, hooks, transform) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_utils_internal.py", line 95, in wrapper_function return function(args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_dynamo/convert_frame.py", line 750, in _compile_inner out_code = transform_code_object(code, transform) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_dynamo/bytecode_transformation.py", line 1361, in transform_code_object transformations(instructions, code_options) File "/home/jansel/pytorch/torch/_dynamo/convert_frame.py", line 231, in _fn return fn(args, kwargs) ^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_dynamo/convert_frame.py", line 662, in transform tracer.run() File "/home/jansel/pytorch/torch/_dynamo/symbolic_convert.py", line 2870, in run super().run() File "/home/jansel/pytorch/torch/_dynamo/symbolic_convert.py", line 1053, in run while self.step(): ^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_dynamo/symbolic_convert.py", line 963, in step self.dispatch_table[inst.opcode](self, inst) File "/home/jansel/pytorch/torch/_dynamo/symbolic_convert.py", line 3050, in RETURN_VALUE self._return(inst) File "/home/jansel/pytorch/torch/_dynamo/symbolic_convert.py", line 3035, in _return self.output.compile_subgraph( File "/home/jansel/pytorch/torch/_dynamo/output_graph.py", line 1101, in compile_subgraph self.compile_and_call_fx_graph( File "/home/jansel/pytorch/torch/_dynamo/output_graph.py", line 1382, in compile_and_call_fx_graph compiled_fn = self.call_user_compiler(gm) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_dynamo/output_graph.py", line 1432, in call_user_compiler return self._call_user_compiler(gm) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_dynamo/output_graph.py", line 1483, in _call_user_compiler raise BackendCompilerFailed(self.compiler_fn, e).with_traceback( File "/home/jansel/pytorch/torch/_dynamo/output_graph.py", line 1462, in _call_user_compiler compiled_fn = compiler_fn(gm, self.example_inputs()) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_dynamo/repro/after_dynamo.py", line 130, in __call__ compiled_gm = compiler_fn(gm, example_inputs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/__init__.py", line 2314, in __call__ return compile_fx(model_, inputs_, config_patches=self.config) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_inductor/compile_fx.py", line 1880, in compile_fx return aot_autograd( ^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_dynamo/backends/common.py", line 83, in __call__ cg = aot_module_simplified(gm, example_inputs, self.kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_functorch/aot_autograd.py", line 1145, in aot_module_simplified compiled_fn = AOTAutogradCache.load( ^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_functorch/_aot_autograd/autograd_cache.py", line 754, in load compiled_fn = dispatch_and_compile() ^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_functorch/aot_autograd.py", line 1131, in dispatch_and_compile compiled_fn, _ = create_aot_dispatcher_function( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_functorch/aot_autograd.py", line 580, in create_aot_dispatcher_function return _create_aot_dispatcher_function( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_functorch/aot_autograd.py", line 830, in _create_aot_dispatcher_function compiled_fn, fw_metadata = compiler_fn( ^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_functorch/_aot_autograd/jit_compile_runtime_wrappers.py", line 676, in aot_dispatch_autograd compiled_fw_func = aot_config.fw_compiler(fw_module, adjusted_flat_args) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_functorch/aot_autograd.py", line 489, in __call__ return self.compiler_fn(gm, example_inputs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_inductor/compile_fx.py", line 1758, in fw_compiler_base return inner_compile( ^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_inductor/compile_fx.py", line 572, in compile_fx_inner return wrap_compiler_debug(_compile_fx_inner, compiler_name="inductor")( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_dynamo/repro/after_aot.py", line 102, in debug_wrapper inner_compiled_fn = compiler_fn(gm, example_inputs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_inductor/compile_fx.py", line 686, in _compile_fx_inner mb_compiled_graph = fx_codegen_and_compile( ^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_inductor/compile_fx.py", line 1129, in fx_codegen_and_compile return scheme.codegen_and_compile(gm, example_inputs, inputs_to_check, graph_kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_inductor/compile_fx.py", line 1044, in codegen_and_compile compiled_fn = graph.compile_to_module().call ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_inductor/graph.py", line 1975, in compile_to_module return self._compile_to_module() ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_inductor/graph.py", line 1981, in _compile_to_module self.codegen_with_cpp_wrapper() if self.cpp_wrapper else self.codegen() ^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_inductor/graph.py", line 1912, in codegen self.scheduler = Scheduler(self.operations) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_inductor/scheduler.py", line 1880, in __init__ self._init(nodes) File "/home/jansel/pytorch/torch/_inductor/scheduler.py", line 1955, in _init self.nodes = self.fuse_nodes(self.nodes) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_inductor/scheduler.py", line 2461, in fuse_nodes nodes = self.fuse_nodes_once(nodes) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_inductor/scheduler.py", line 2773, in fuse_nodes_once assert False, "a fake error during fusion" ^^^^^ torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised: AssertionError: a fake error during fusion Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information ``` Before this PR ```py Traceback (most recent call last): File "/home/jansel/pytorch/repro.py", line 51, in <module> fp32_compiled = optimized_model(low_input) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl return self._call_impl(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/nn/modules/module.py", line 1750, in _call_impl return forward_call(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_dynamo/eval_frame.py", line 580, in _fn raise e.remove_dynamo_frames() from None # see TORCHDYNAMO_VERBOSE=1 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_dynamo/output_graph.py", line 1484, in _call_user_compiler raise BackendCompilerFailed( File "/home/jansel/pytorch/torch/_dynamo/output_graph.py", line 1463, in _call_user_compiler compiled_fn = compiler_fn(gm, self.example_inputs()) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_dynamo/repro/after_dynamo.py", line 130, in __call__ compiled_gm = compiler_fn(gm, example_inputs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/__init__.py", line 2314, in __call__ return compile_fx(model_, inputs_, config_patches=self.config) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_inductor/compile_fx.py", line 1880, in compile_fx return aot_autograd( ^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_dynamo/backends/common.py", line 83, in __call__ cg = aot_module_simplified(gm, example_inputs, self.kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_functorch/aot_autograd.py", line 1145, in aot_module_simplified compiled_fn = AOTAutogradCache.load( ^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_functorch/_aot_autograd/autograd_cache.py", line 754, in load compiled_fn = dispatch_and_compile() ^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_functorch/aot_autograd.py", line 1131, in dispatch_and_compile compiled_fn, _ = create_aot_dispatcher_function( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_functorch/aot_autograd.py", line 580, in create_aot_dispatcher_function return _create_aot_dispatcher_function( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_functorch/aot_autograd.py", line 830, in _create_aot_dispatcher_function compiled_fn, fw_metadata = compiler_fn( ^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_functorch/_aot_autograd/jit_compile_runtime_wrappers.py", line 676, in aot_dispatch_autograd compiled_fw_func = aot_config.fw_compiler(fw_module, adjusted_flat_args) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_functorch/aot_autograd.py", line 489, in __call__ return self.compiler_fn(gm, example_inputs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_inductor/compile_fx.py", line 1758, in fw_compiler_base return inner_compile( ^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_inductor/compile_fx.py", line 572, in compile_fx_inner return wrap_compiler_debug(_compile_fx_inner, compiler_name="inductor")( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_dynamo/repro/after_aot.py", line 102, in debug_wrapper inner_compiled_fn = compiler_fn(gm, example_inputs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_inductor/compile_fx.py", line 686, in _compile_fx_inner mb_compiled_graph = fx_codegen_and_compile( ^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_inductor/compile_fx.py", line 1129, in fx_codegen_and_compile return scheme.codegen_and_compile(gm, example_inputs, inputs_to_check, graph_kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_inductor/compile_fx.py", line 1044, in codegen_and_compile compiled_fn = graph.compile_to_module().call ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_inductor/graph.py", line 1975, in compile_to_module return self._compile_to_module() ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_inductor/graph.py", line 1981, in _compile_to_module self.codegen_with_cpp_wrapper() if self.cpp_wrapper else self.codegen() ^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_inductor/graph.py", line 1912, in codegen self.scheduler = Scheduler(self.operations) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_inductor/scheduler.py", line 1880, in __init__ self._init(nodes) File "/home/jansel/pytorch/torch/_inductor/scheduler.py", line 1955, in _init self.nodes = self.fuse_nodes(self.nodes) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_inductor/scheduler.py", line 2461, in fuse_nodes nodes = self.fuse_nodes_once(nodes) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_inductor/scheduler.py", line 2773, in fuse_nodes_once assert False, "a fake error during fusion" ^^^^^ torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised: AssertionError: a fake error during fusion Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information ``` After this PR ```py Traceback (most recent call last): File "/home/jansel/pytorch/repro.py", line 51, in <module> fp32_compiled = optimized_model(low_input) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl return self._call_impl(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/nn/modules/module.py", line 1750, in _call_impl return forward_call(args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_dynamo/eval_frame.py", line 580, in _fn raise e.remove_dynamo_frames() from None # see TORCHDYNAMO_VERBOSE=1 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_inductor/compile_fx.py", line 704, in _compile_fx_inner raise InductorError(e, currentframe()).with_traceback( File "/home/jansel/pytorch/torch/_inductor/compile_fx.py", line 689, in _compile_fx_inner mb_compiled_graph = fx_codegen_and_compile( ^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_inductor/compile_fx.py", line 1138, in fx_codegen_and_compile return scheme.codegen_and_compile(gm, example_inputs, inputs_to_check, graph_kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_inductor/compile_fx.py", line 1053, in codegen_and_compile compiled_fn = graph.compile_to_module().call ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_inductor/graph.py", line 1975, in compile_to_module return self._compile_to_module() ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_inductor/graph.py", line 1981, in _compile_to_module self.codegen_with_cpp_wrapper() if self.cpp_wrapper else self.codegen() ^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_inductor/graph.py", line 1912, in codegen self.scheduler = Scheduler(self.operations) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_inductor/scheduler.py", line 1880, in __init__ self._init(nodes) File "/home/jansel/pytorch/torch/_inductor/scheduler.py", line 1955, in _init self.nodes = self.fuse_nodes(self.nodes) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_inductor/scheduler.py", line 2461, in fuse_nodes nodes = self.fuse_nodes_once(nodes) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_inductor/scheduler.py", line 2773, in fuse_nodes_once assert False, "a fake error during fusion" ^^^^^ torch._inductor.exc.InductorError: AssertionError: a fake error during fusion Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information ``` A large numer of frames are removed between: ```py File "/home/jansel/pytorch/torch/_dynamo/eval_frame.py", line 580, in _fn raise e.remove_dynamo_frames() from None # see TORCHDYNAMO_VERBOSE=1 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_inductor/compile_fx.py", line 704, in _compile_fx_inner raise InductorError(e, currentframe()).with_traceback( ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/143610 Approved by: https://github.com/eellison ghstack dependencies: #143552	2024-12-24 21:48:32 +00:00
Jason Ansel	9e5f3fdfc7	[dynamo] Shorten tracebacks for backend compiler errors (#143552 ) Fixes #143406 After this PR the error for missing Triton is: ```py Traceback (most recent call last): File "/home/jansel/pytorch/repro.py", line 51, in <module> fp32_compiled = optimized_model(low_input) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl return self._call_impl(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/nn/modules/module.py", line 1750, in _call_impl return forward_call(args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_dynamo/eval_frame.py", line 580, in _fn raise e.remove_dynamo_frames() from None # see TORCHDYNAMO_VERBOSE=1 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_inductor/scheduler.py", line 3624, in create_backend raise TritonMissing(inspect.currentframe()) torch._dynamo.exc.TritonMissing: Cannot find a working triton installation. Either the package is not installed or it is too old. More information on installing Triton can be found at: https://github.com/triton-lang/triton Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information You can suppress this exception and fall back to eager by setting: import torch._dynamo torch._dynamo.config.suppress_errors = True ``` Setting `TORCHDYNAMO_VERBOSE=1` yields something like the old error: ```py Traceback (most recent call last): File "/home/jansel/pytorch/repro.py", line 51, in <module> fp32_compiled = optimized_model(low_input) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl return self._call_impl(args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/nn/modules/module.py", line 1750, in _call_impl return forward_call(args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_dynamo/eval_frame.py", line 580, in _fn raise e.remove_dynamo_frames() from None # see TORCHDYNAMO_VERBOSE=1 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_dynamo/eval_frame.py", line 576, in _fn return fn(args, *kwargs) ^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl return self._call_impl(args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/nn/modules/module.py", line 1750, in _call_impl return forward_call(args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_dynamo/convert_frame.py", line 1383, in __call__ return self._torchdynamo_orig_callable( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_dynamo/convert_frame.py", line 1167, in __call__ result = self._inner_convert( ^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_dynamo/convert_frame.py", line 548, in __call__ return _compile( ^^^^^^^^^ File "/home/jansel/pytorch/torch/_dynamo/convert_frame.py", line 988, in _compile guarded_code = compile_inner(code, one_graph, hooks, transform) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_dynamo/convert_frame.py", line 716, in compile_inner return _compile_inner(code, one_graph, hooks, transform) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_utils_internal.py", line 95, in wrapper_function return function(args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_dynamo/convert_frame.py", line 751, in _compile_inner out_code = transform_code_object(code, transform) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_dynamo/bytecode_transformation.py", line 1361, in transform_code_object transformations(instructions, code_options) File "/home/jansel/pytorch/torch/_dynamo/convert_frame.py", line 232, in _fn return fn(args, kwargs) ^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_dynamo/convert_frame.py", line 663, in transform tracer.run() File "/home/jansel/pytorch/torch/_dynamo/symbolic_convert.py", line 2870, in run super().run() File "/home/jansel/pytorch/torch/_dynamo/symbolic_convert.py", line 1053, in run while self.step(): ^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_dynamo/symbolic_convert.py", line 963, in step self.dispatch_table[inst.opcode](self, inst) File "/home/jansel/pytorch/torch/_dynamo/symbolic_convert.py", line 3050, in RETURN_VALUE self._return(inst) File "/home/jansel/pytorch/torch/_dynamo/symbolic_convert.py", line 3035, in _return self.output.compile_subgraph( File "/home/jansel/pytorch/torch/_dynamo/output_graph.py", line 1102, in compile_subgraph self.compile_and_call_fx_graph( File "/home/jansel/pytorch/torch/_dynamo/output_graph.py", line 1383, in compile_and_call_fx_graph compiled_fn = self.call_user_compiler(gm) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_dynamo/output_graph.py", line 1433, in call_user_compiler return self._call_user_compiler(gm) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_dynamo/output_graph.py", line 1463, in _call_user_compiler compiled_fn = compiler_fn(gm, self.example_inputs()) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_dynamo/repro/after_dynamo.py", line 130, in __call__ compiled_gm = compiler_fn(gm, example_inputs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/__init__.py", line 2314, in __call__ return compile_fx(model_, inputs_, config_patches=self.config) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_inductor/compile_fx.py", line 1880, in compile_fx return aot_autograd( ^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_dynamo/backends/common.py", line 83, in __call__ cg = aot_module_simplified(gm, example_inputs, self.kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_functorch/aot_autograd.py", line 1145, in aot_module_simplified compiled_fn = AOTAutogradCache.load( ^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_functorch/_aot_autograd/autograd_cache.py", line 754, in load compiled_fn = dispatch_and_compile() ^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_functorch/aot_autograd.py", line 1131, in dispatch_and_compile compiled_fn, _ = create_aot_dispatcher_function( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_functorch/aot_autograd.py", line 580, in create_aot_dispatcher_function return _create_aot_dispatcher_function( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_functorch/aot_autograd.py", line 830, in _create_aot_dispatcher_function compiled_fn, fw_metadata = compiler_fn( ^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_functorch/_aot_autograd/jit_compile_runtime_wrappers.py", line 676, in aot_dispatch_autograd compiled_fw_func = aot_config.fw_compiler(fw_module, adjusted_flat_args) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_functorch/aot_autograd.py", line 489, in __call__ return self.compiler_fn(gm, example_inputs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_inductor/compile_fx.py", line 1758, in fw_compiler_base return inner_compile( ^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_inductor/compile_fx.py", line 572, in compile_fx_inner return wrap_compiler_debug(_compile_fx_inner, compiler_name="inductor")( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_dynamo/repro/after_aot.py", line 102, in debug_wrapper inner_compiled_fn = compiler_fn(gm, example_inputs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_inductor/compile_fx.py", line 686, in _compile_fx_inner mb_compiled_graph = fx_codegen_and_compile( ^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_inductor/compile_fx.py", line 1129, in fx_codegen_and_compile return scheme.codegen_and_compile(gm, example_inputs, inputs_to_check, graph_kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_inductor/compile_fx.py", line 1044, in codegen_and_compile compiled_fn = graph.compile_to_module().call ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_inductor/graph.py", line 1975, in compile_to_module return self._compile_to_module() ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_inductor/graph.py", line 1981, in _compile_to_module self.codegen_with_cpp_wrapper() if self.cpp_wrapper else self.codegen() ^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_inductor/graph.py", line 1916, in codegen self.scheduler.codegen() File "/home/jansel/pytorch/torch/_inductor/scheduler.py", line 3667, in codegen return self._codegen() ^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_inductor/scheduler.py", line 3761, in _codegen if device is not None and self.get_backend(device).ready_to_flush(): ^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_inductor/scheduler.py", line 3631, in get_backend self.backends[device] = self.create_backend(device) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_inductor/scheduler.py", line 3624, in create_backend raise TritonMissing(inspect.currentframe()) torch._dynamo.exc.TritonMissing: Cannot find a working triton installation. Either the package is not installed or it is too old. More information on installing Triton can be found at: https://github.com/triton-lang/triton You can suppress this exception and fall back to eager by setting: import torch._dynamo torch._dynamo.config.suppress_errors = True ``` This PR also strips dynamo stack frames from other types of backend compile errors. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143552 Approved by: https://github.com/yanboliang	2024-12-24 21:48:23 +00:00
PyTorch MergeBot	844e6108f6	Revert "[Inductor XPU] Support max-autotune on XPU and reuse the corresponding Inductor UT. (#143266 )" This reverts commit ad750ae32079020f51f9b7d01237f3ecfa83b6ff. Reverted https://github.com/pytorch/pytorch/pull/143266 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing some tests in trunk ([comment](https://github.com/pytorch/pytorch/pull/143266#issuecomment-2561303786))	2024-12-24 17:22:57 +00:00
atalman	6c32ef4c5b	Remove builder repo from workflows and scripts (#143776 ) Part of https://github.com/pytorch/builder/issues/2054 Builder is repo is no longer used. Hence remove any references to builder repo Pull Request resolved: https://github.com/pytorch/pytorch/pull/143776 Approved by: https://github.com/huydhn	2024-12-24 14:11:51 +00:00
Luca Wehrstedt	aec3b46274	[DTensor] Add aten.amin/amax to linear_reduction_strategy (#143747 ) In the same vein as https://github.com/pytorch/pytorch/pull/134206, these two ops still seemed missing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143747 Approved by: https://github.com/kwen2501	2024-12-24 13:36:40 +00:00
Xuehai Pan	b77406a9ec	[BE][CI] bump `ruff` to 0.8.4 (#143753 ) Changes: 1. Bump `ruff` from 0.7.4 to 0.8.4 2. Change `%`-formatted strings to f-string 3. Change arguments with the `__`-prefix to positional-only arguments with the `/` separator in function signature. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143753 Approved by: https://github.com/Skylion007	2024-12-24 12:24:10 +00:00
Iurii Paikov	dbbc81cb34	Enabled force_shape_pad for test_pad_mm and test_slice_mm_bandwidth_computation (#141768 ) Some tests fail for ROCm build on navi arch because of this check: `f83361b274/torch/_inductor/fx_passes/pad_mm.py (L211)` There is no need to determine if mm is compute bound for most of the padding tests since they don't specifically test compute bound behavior. We don't have enough empirical data to fine tune this check for AMD gpus yet. I propose to force the shape padding for the tests that we had trouble with to avoid this unnecessary logic path. Please correct me if I didn't add other tests that can potentially fail with this issue or if I added a test that is dependent on logic below the `force_shape_pad` check here: `f83361b274/torch/_inductor/fx_passes/pad_mm.py (L444)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/141768 Approved by: https://github.com/jeffdaily	2024-12-24 11:03:39 +00:00
Jiang, Yanbing	783065637e	Add FP8 support for eye (#139974 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139974 Approved by: https://github.com/jgong5, https://github.com/malfet	2024-12-24 10:00:23 +00:00
Jason Ansel	060ee14753	[inductor] Make adaptive_max_pool2d error on int64 (#143762 ) Fixes #143752 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143762 Approved by: https://github.com/yanboliang	2024-12-24 08:33:59 +00:00
Xuehai Pan	135c7db99d	Use absolute path `path.resolve()` -> `path.absolute()` (#129409 ) Changes: 1. Always explicit `.absolute()`: `Path(__file__)` -> `Path(__file__).absolute()` 2. Replace `path.resolve()` with `path.absolute()` if the code is resolving the PyTorch repo root directory. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129409 Approved by: https://github.com/albanD	2024-12-24 08:33:08 +00:00
Jithun Nair	362ecad9bb	[ROCm] Use `linux.rocm.gpu.2` for 2-GPU and `linux.rocm.gpu.4` for 4-GPU runners (#143769 ) * Will enable us to target `periodic`/distributed CI jobs to 4-GPU runners using a different label `linux.rocm.gpu.4` * Use 2-GPU runners for `trunk`, `pull` and `slow` (in addition to `inductor-rocm`) as well (although this currently will not change anything, since all our MI2xx runners have both `linux.rocm.gpu` and `linux.rocm.gpu.2` labels... but this will change in the future: see next point) * Continue to use `linux.rocm.gpu` label for any job that doesn't need more than 1-GPU eg. binary test jobs in `workflows/generated-linux-binary-manywheel-nightly.yml` Pull Request resolved: https://github.com/pytorch/pytorch/pull/143769 Approved by: https://github.com/jeffdaily	2024-12-24 08:04:00 +00:00
Yifu Wang	1963fc83a1	[micro_pipeline_tp] don't pass return_A to fused_all_gather_scaled_matmul (#143782 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143782 Approved by: https://github.com/tianyu-l	2024-12-24 07:25:38 +00:00
xinan.lin	ad750ae320	[Inductor XPU] Support max-autotune on XPU and reuse the corresponding Inductor UT. (#143266 ) This PR aims to add the functionality support of max-autotune for XPU. The current triton templates and configurations are not well optimized for XPU, so the performance is not ready yet. Also the `mm_plus_mm` template have accuracy issues in some cases. We will address these issues in the next PRs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143266 Approved by: https://github.com/EikanWang, https://github.com/jansel	2024-12-24 05:42:36 +00:00
Jason Ansel	b0c3f48a40	[inductor] Improve error message for assert_size_stride (#143765 ) ``` >>> torch._C._dynamo.guards.assert_size_stride(torch.randn(10), (10,), (2,)) Traceback (most recent call last): File "<stdin>", line 1, in <module> AssertionError: expected size 10==10, stride 1==2 at dim=0 This error most often comes from an incorrect meta function for a custom op. See https://pytorch.org/docs/stable/library.html#torch.library.opcheck >>> ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/143765 Approved by: https://github.com/zou3519	2024-12-24 05:26:05 +00:00
Jerry Zhang	ace645a017	Add support for prototype affine quantization in pt2e flow (#141421 ) Summary: duplicated affine quantization functionality including observer (https://github.com/pytorch/ao/blob/main/torchao/quantization/observer.py) and some quant_primitive ops (`7c3c51fd0d/torchao/quantization/quant_primitives.py (L26-L30)`) to allow for per group quantization min max observer in pt2e flow Next: We can follow up to add moving average min max observer Test Plan: python test/test_quantization.py -k test_channel_group_quantization Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/141421 Approved by: https://github.com/cccclai	2024-12-24 04:22:18 +00:00
Jason Ansel	60a0d53c13	[dynamo] Add test for #143697 (#143764 ) The issue from #143697 seems to already be fixed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143764 Approved by: https://github.com/Skylion007	2024-12-24 03:50:15 +00:00
zeshengzong	01d60bcf32	[Easy] Fix todo by enable tests for cuda (#143637 ) Fix TODO in `test_tensor_creation_ops.py` file: ```python # TODO: update to work on CUDA, too ``` Test Result ```bash $ pytest test/test_tensor_creation_ops.py ``` ![image](https://github.com/user-attachments/assets/ef829541-668e-446d-a9ab-b26b9d73085f) ```bash $ lintrunner ``` ![image](https://github.com/user-attachments/assets/d6a46eee-1f60-48e6-898a-a8d9620eb54a) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143637 Approved by: https://github.com/albanD	2024-12-24 03:47:43 +00:00
Eddie Yan	b90a3b7281	[cumsum][CUDA][64-bit indexing] Add 64-bit indexing path for `cumsum` (#143696 ) For #143486 Interestingly enough changing the indexing type seems to degrade performance when a larger width is not needed, even on small sizes, so making this a template param rather than forcing all cases to 64-bit Pull Request resolved: https://github.com/pytorch/pytorch/pull/143696 Approved by: https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2024-12-24 03:45:28 +00:00
Jason Ansel	dec4286b2d	[inductor] Fix for extract_target with dots (#143766 ) Fixes #143650 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143766 Approved by: https://github.com/yanboliang	2024-12-24 03:42:15 +00:00
cyy	1feae27ed6	[16/N] Fix extra warnings brought by clang-tidy-17 (#143714 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/143714 Approved by: https://github.com/Skylion007, https://github.com/albanD	2024-12-24 03:29:38 +00:00
PyTorch MergeBot	49fdc52fd2	Revert "Add a warning when a tensor with requires_grad=True is converted to a scalar (#143261 )" This reverts commit bc78b6ea4f88d673426d6de17671b82facf50beb. Reverted https://github.com/pytorch/pytorch/pull/143261 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing lint, plz help fix and reland this ([comment](https://github.com/pytorch/pytorch/pull/143261#issuecomment-2560583332))	2024-12-24 03:15:38 +00:00
cyy	d6a066ead6	Simplify host_softmax (#143251 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/143251 Approved by: https://github.com/albanD	2024-12-24 02:27:51 +00:00
Nikita Shulga	da21fabf34	[BE] Only print MKL version on x86 platforms (#143763 ) As it will obviously be missing on ARM/S390, etc Test plan: run `python3 -c "import torch;print(torch.__config__.parallel_info())"` on both x86 and non-x86 system Pull Request resolved: https://github.com/pytorch/pytorch/pull/143763 Approved by: https://github.com/Skylion007, https://github.com/albanD	2024-12-24 02:04:26 +00:00
Animesh Jain	7d1c666139	[dynamo] Remove dead code after introducing UserDefinedDictVariable (#143699 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143699 Approved by: https://github.com/williamwen42, https://github.com/yanboliang, https://github.com/jansel ghstack dependencies: #143722	2024-12-24 02:00:18 +00:00
Animesh Jain	fe95cbe018	[dynamo] Remove DICT_SUBCLASS_GUARD_MANAGER and use dict.keys (#143722 ) In hinsight, we never needed a DICT_SUBCLASS_GUARD_MANAGER, because Dynamo would inline through the overridden keys method. In this PR, we ensure that while creating guards and constructing variable trackers, we get the `d.keys()` value by using `dict.keys(d)`. This ensures that we do not call overridden keys method. Therefore, the C++ guard can use `PyDict_Next` directly to check the guards. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143722 Approved by: https://github.com/jansel	2024-12-24 02:00:18 +00:00
zeshengzong	67355a1289	[Easy] Add torch.range, torch.arange params optional description (#143731 ) Fixes #129333 Test Result Before ![image](https://github.com/user-attachments/assets/c5873690-7de7-4a14-9423-a150d17d137e) ![image](https://github.com/user-attachments/assets/ff4ee545-f27a-403b-bf92-51f9571022a3) After ![image](https://github.com/user-attachments/assets/34e2c41f-8b54-417d-bb10-7ca6f679206a) ![image](https://github.com/user-attachments/assets/b54bcebd-70e9-4a1a-8a22-1ab815e17827) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143731 Approved by: https://github.com/janeyx99	2024-12-24 01:29:24 +00:00
Jithun Nair	0ca6a47872	Update tag_regex in filter_test_configs.py for workflows such as `inductor-rocm` (#143768 ) This helps to make `continue-through-error`/`keep-going` work as expected on `inductor-rocm` workflow jobs. Without this, the code here doesn't enter the `if` condition: `6ccb8ed186/.github/scripts/filter_test_configs.py (L577)` Tested via [this PR](https://github.com/pytorch/pytorch/pull/140989): Without this change: https://hud.pytorch.org/pytorch/pytorch/pull/140989?sha=8232e18957f987d99c946efc0cf6da9be9b52067: https://github.com/pytorch/pytorch/actions/runs/12164558045/job/34192442187#step:13:144 With this change: https://hud.pytorch.org/pytorch/pytorch/pull/140989?sha=763179c5e421791ee05c8e2a600379b29a1c8c33: https://github.com/pytorch/pytorch/actions/runs/12261943684/job/34213300153#step:13:145 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143768 Approved by: https://github.com/huydhn	2024-12-24 00:50:14 +00:00
Joshua Hamilton	bc78b6ea4f	Add a warning when a tensor with requires_grad=True is converted to a scalar (#143261 ) Fixes #143071 Operations performed on tensors with `requires_grad=True` such as ```python import torch x = torch.tensor(2.0, requires_grad=True) y = x ** 3 ``` and ```python x = torch.tensor(2.0, requires_grad=True) y = torch.pow(x,3) ``` are valid operations. While an operation using `numpy` like ```python import numpy as np x = torch.tensor(2.0, requires_grad=True) y = np.pow(x,3) # > RuntimeError: Can't call numpy() on Tensor that requires grad. Use tensor.detach().numpy() instead. ``` leads to an error. However, an operation that uses `math` like ```python import math x = torch.tensor(2.0, requires_grad=True) y = math.pow(x,3) ``` does not cause an error, and `y` is no longer a tensor with a gradient! This represents a [footgun](https://en.wiktionary.org/wiki/footgun#Noun) for some users, like myself when training small, custom, non-neural network models. To prevent future undesired behavior, I added a warning when converting tensors with `requires_grad=True` to scalars. Now, when using `math.pow` on a `tensor`, we get a single warning with: ```python x = torch.tensor(2.0, requires_grad=True) y = math.pow(x,3) # > UserWarning: Converting a tensor with requires_grad=True to a scalar may lead to unexpected behavior. # Consider using tensor.detach() first. ``` Please let me know if you have any questions 👍 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143261 Approved by: https://github.com/albanD	2024-12-24 00:22:18 +00:00
emmettbicker	6ccb8ed186	Refactor AdamW into Adam (heavily inspired by tfsingh) (#143710 ) Fixes #104899 Refactors AdamW into Adam by making AdamW a subclass of Adam. Additionally adds a test to assert that the added parameter `decoupled_weight_decay` is True in AdamW and also updates test_defaults_changed_to_foreach to account for the differences in module location for AdamW. Heavily heavily inspired by #118857 by @tfsingh Pull Request resolved: https://github.com/pytorch/pytorch/pull/143710 Approved by: https://github.com/janeyx99	2024-12-23 23:27:28 +00:00
Sam Larsen	4271a95590	[logging] A few fixes/updates to record_compilation_metrics (#143332 ) Summary: Mostly cosmetic, but one bug fix: * Bug fix: Make sure compile_id is converted to a string in the compilation metrics so it's printed as, e.g., "0/1" instead of "[0, 1]" * Sort collections in `collection_to_str` * Print non-string elements as `"<unknown>"` instead of None (since we don't expect non-strings) * Move the population of the legacy metrics and any pre-processing to a new factory method in CompilationMetrics Test Plan: ``` python test/dynamo/test_structured_trace.py python test/dynamo/test_utils.py ``` Internal testing: https://fburl.com/scuba/dynamo_compile/sandbox/l0me8auf Pull Request resolved: https://github.com/pytorch/pytorch/pull/143332 Approved by: https://github.com/ppanchalia	2024-12-23 23:10:11 +00:00
Natalia Gimelshein	2ab698e708	allow profiling on all threads via experimentalConfig (#143659 ) In some situations we want to profile calls coming from all threads (similar to on-demand), not just the thread that started profiling and the spawned threads that would inherit KinetoThreadLocal state. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143659 Approved by: https://github.com/sraikund16	2024-12-23 20:41:27 +00:00
Aaron Gokaslan	00831f9b22	[BE]: Properly forward raise pickle exception with from (#143761 ) Properly raises the pickle exception with from. Provides a more informative stack trace and forwards information about the exception that led to the current exception. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143761 Approved by: https://github.com/XuehaiPan, https://github.com/albanD	2024-12-23 20:21:30 +00:00
Jithun Nair	75e1f8a227	[ROCm] upgrade nightly wheels to rocm6.3 - 2 of 2 (binaries) (#143613 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143613 Approved by: https://github.com/jeffdaily	2024-12-23 19:47:30 +00:00
PyTorch MergeBot	0ebc6388cf	Revert "Exclude py 31.3t triton package from PyTorch 3.13t wheel (#143218 )" This reverts commit 3bfdf6f0633e6feb067e032009256c740a2a2665. Reverted https://github.com/pytorch/pytorch/pull/143218 on behalf of https://github.com/atalman due to this constrain is ignored see https://github.com/pytorch/pytorch/issues/143654 ([comment](https://github.com/pytorch/pytorch/pull/143218#issuecomment-2560208992))	2024-12-23 19:37:35 +00:00
Sergii Dymchenko	727ee853b4	Apply TorchFix TOR203 fixes (#143691 ) Codemodded via `torchfix . --select=TOR203 --fix`. This is a step to unblock https://github.com/pytorch/pytorch/pull/141076 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143691 Approved by: https://github.com/malfet	2024-12-23 18:21:03 +00:00
Sergii Dymchenko	c042c8a475	Use default_collate from public API (#143616 ) Codemodded via `torchfix . --select=TOR104 --fix`. This is a step to unblock https://github.com/pytorch/pytorch/pull/141076 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143616 Approved by: https://github.com/malfet	2024-12-23 17:38:43 +00:00
zeshengzong	a70191da41	Add torch.topk indices vary description (#143736 ) Fixes #133542 Test Result Before ![image](https://github.com/user-attachments/assets/65227efb-02af-45e7-804c-35588dff360d) After ![image](https://github.com/user-attachments/assets/91f1f53f-008c-4784-82fe-013404e273cb) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143736 Approved by: https://github.com/zou3519	2024-12-23 17:16:31 +00:00
PyTorch MergeBot	1519a9e30b	Revert "Add FP8 support for eye (#139974 )" This reverts commit 01890526b9068ae20b38b2a33e8f11a6331d7d4b. Reverted https://github.com/pytorch/pytorch/pull/139974 on behalf of https://github.com/huydhn due to Sorry for reverting your change but this seems to fail some slow tests ([comment](https://github.com/pytorch/pytorch/pull/139974#issuecomment-2560046399))	2024-12-23 17:12:39 +00:00
Nikita Shulga	12662901aa	[BE] Move Mac BB test to its own step (#143513 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143513 Approved by: https://github.com/huydhn, https://github.com/atalman, https://github.com/kit1980, https://github.com/seemethere ghstack dependencies: #143395, #143511, #143512	2024-12-23 14:05:10 +00:00
Xuehai Pan	5c4545f857	[BE][Easy] enable PYFMT for `torch/[a-s]*/` (#138447 ) Reproduce command: ```bash ghstack checkout https://github.com/pytorch/pytorch/pull/138447 git checkout HEAD~1 torch/ lintrunner -a --take "PYFMT" --all-files ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/138447 Approved by: https://github.com/ezyang	2024-12-23 14:04:00 +00:00
Dmitry Rogozhkin	7314cf44ae	torch/accelerator: fix device type comparison (#143541 ) This was failing without the fix: ``` python -c 'import torch; d=torch.device("xpu:0"); torch.accelerator.current_stream(d)' ``` with: ``` ValueError: xpu doesn't match the current accelerator xpu. ``` CC: @guangyey, @EikanWang Pull Request resolved: https://github.com/pytorch/pytorch/pull/143541 Approved by: https://github.com/guangyey, https://github.com/albanD	2024-12-23 10:54:53 +00:00
Kai Londenberg	434e0c2104	Inductor Cutlass backend: Eliminate unused code. (#143723 ) Summary: Eliminates an unused file and some smaller unused code fragments from the inductor cutlass codebase. Test Plan: CI Differential Revision: D67579837 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143723 Approved by: https://github.com/ColinPeppler	2024-12-23 09:35:03 +00:00
Jiang, Yanbing	01890526b9	Add FP8 support for eye (#139974 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139974 Approved by: https://github.com/jgong5, https://github.com/malfet	2024-12-23 06:47:49 +00:00
PyTorch MergeBot	448c16ac87	Revert "[reland][AMD] Turn on TF32 for aten::mm (#143549 )" This reverts commit 41cdc7f73552cc8a0dbf2d3cb55440c0d6b548ea. Reverted https://github.com/pytorch/pytorch/pull/143549 on behalf of https://github.com/malfet due to It breaks ROCM testing, see `06b4b96b34/1` ([comment](https://github.com/pytorch/pytorch/pull/143549#issuecomment-2559016960))	2024-12-23 06:47:36 +00:00
Aaron Orenstein	06b4b96b34	dynamo tracing perf: no re in arg_ref: 33.9 -> 33.7 (#143069 ) See #143056 for overall docs. This PR: Avoid use of python re and move valid varname check in `GuardBuilder.arg_ref()` into C++ Pull Request resolved: https://github.com/pytorch/pytorch/pull/143069 Approved by: https://github.com/jansel	2024-12-23 05:32:09 +00:00
Yu, Guangye	07fa6e2c8b	Fix torch.accelerator api abort when passing invaild device (#143550 ) # Motivation Fix https://github.com/pytorch/pytorch/issues/143543 # Solution We should raise python exception instead of aborting... # Additional Context without this PR: ```python >>> import torch >>> torch.accelerator.current_stream(torch.accelerator.device_count()) terminate called after throwing an instance of 'c10::Error' what(): device is out of range, device is 2, total number of device is 2. Exception raised from check_device_index at /home/dvrogozh/git/pytorch/pytorch/c10/xpu/XPUFunctions.h:36 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xac (0x7f30707eb95c in /home/dvrogozh/git/pytorch/pytorch/torch/lib/libc10.so) frame #1: c10::detail::torchCheckFail(char const, char const, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xf3 (0x7f307078fc57 in /home/dvrogozh/git/pytorch/pytorch/torch/lib/libc10.so) frame #2: <unknown function> + 0x19a3e (0x7f3070c2ba3e in /home/dvrogozh/git/pytorch/pytorch/torch/lib/libc10_xpu.so) frame #3: c10::xpu::getCurrentXPUStream(signed char) + 0x2f (0x7f3070c2c83f in /home/dvrogozh/git/pytorch/pytorch/torch/lib/libc10_xpu.so) frame #4: <unknown function> + 0x1ca35 (0x7f3070c2ea35 in /home/dvrogozh/git/pytorch/pytorch/torch/lib/libc10_xpu.so) frame #5: <unknown function> + 0x653f15 (0x7f3083391f15 in /home/dvrogozh/git/pytorch/pytorch/torch/lib/libtorch_python.so) frame #6: <unknown function> + 0x39e5f2 (0x7f30830dc5f2 in /home/dvrogozh/git/pytorch/pytorch/torch/lib/libtorch_python.so) <omitting python frames> frame #20: <unknown function> + 0x29d90 (0x7f308b19bd90 in /lib/x86_64-linux-gnu/libc.so.6) frame #21: __libc_start_main + 0x80 (0x7f308b19be40 in /lib/x86_64-linux-gnu/libc.so.6) Aborted (core dumped) ``` with this PR: ```python >>> import torch >>> torch.accelerator.current_stream(torch.accelerator.device_count()) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/pt-gpu/4T-4652/guangyey/stock-pytorch/torch/accelerator/__init__.py", line 123, in current_stream return torch._C._accelerator_getStream(device_index) RuntimeError: The device index is out of range. It must be in [0, 2), but got 2. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/143550 Approved by: https://github.com/EikanWang, https://github.com/dvrogozh, https://github.com/albanD	2024-12-23 03:44:22 +00:00
Jason Ansel	eebc93d41e	Better fix for f-strings in set_linter for py3.12 (#143725 ) #143628 didn't handle a few cases right for example: ```py $ python3 tools/linter/adapters/set_linter.py torch/_inductor/scheduler.py torch/_inductor/scheduler.py:261:24: Builtin `set` is deprecated 259 \| multiline=False, 260 \| ) 261 \| return f"{self}{data_str}" ^ 262 \| 263 \| def log_details(self) -> None: torch/_inductor/scheduler.py:261:33: Builtin `set` is deprecated 259 \| multiline=False, 260 \| ) 261 \| return f"{self}{data_str}" ^ 262 \| 263 \| def log_details(self) -> None: ``` also multi-line fstrings Pull Request resolved: https://github.com/pytorch/pytorch/pull/143725 Approved by: https://github.com/yanboliang	2024-12-22 22:51:27 +00:00
Xiaodong Wang	41cdc7f735	[reland][AMD] Turn on TF32 for aten::mm (#143549 ) Summary: hipblaslt supports TF32, so adding the support. Original PR https://github.com/pytorch/pytorch/pull/139869 Test Plan: CI Differential Revision: D67431681 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143549 Approved by: https://github.com/eqy	2024-12-22 21:05:05 +00:00
Nikita Shulga	6425f0779d	[BE] Update triton repo link (#143429 ) It should be https://github.com/triton-lang/triton rather than https://github.com/openai/triton shouldn't it? Pull Request resolved: https://github.com/pytorch/pytorch/pull/143429 Approved by: https://github.com/jansel	2024-12-22 18:38:35 +00:00
Nikita Shulga	a316a4581d	Add mps to GPU_TYPES (#143634 ) Because it is a GPU, but don't require a triton, as it does not need one Pull Request resolved: https://github.com/pytorch/pytorch/pull/143634 Approved by: https://github.com/jansel	2024-12-22 18:37:35 +00:00
cyy	09c950cc87	Remove unused <ATen/core/Array.h> inclusion (#143701 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/143701 Approved by: https://github.com/albanD	2024-12-22 14:30:11 +00:00
Oguz Ulgen	dc55704b48	Rename cache limit to recompile limit in configs (#143709 ) This PR renames every cache_limit to recompile_limit via sed. Old config options are maintained via Config(alias='xyz') Pull Request resolved: https://github.com/pytorch/pytorch/pull/143709 Approved by: https://github.com/jansel	2024-12-22 10:03:57 +00:00
Aaron Orenstein	9bf4b1c2e9	dynamo tracing perf: c++ strip_function_call: 49.12 -> 47.77 (#143063 ) See #143056 for overall docs. This PR: Convert `strip_function_call()` into C++ Pull Request resolved: https://github.com/pytorch/pytorch/pull/143063 Approved by: https://github.com/jansel ghstack dependencies: #143057, #143062	2024-12-22 06:38:46 +00:00
Aaron Orenstein	3ec04d30d5	dynamo tracing perf: kill import: 50.36 -> 49.12 (#143062 ) See #143056 for overall docs. This PR: Stop importing in the body of `BuiltinVariable.call_getattr()` Pull Request resolved: https://github.com/pytorch/pytorch/pull/143062 Approved by: https://github.com/jansel ghstack dependencies: #143057	2024-12-22 06:38:46 +00:00
Aaron Orenstein	f2b744b9ca	dynamo tracing perf: import_module: 59.92 -> 52.9 (#143057 ) See #143056 for overall docs. This PR: Using `importlib.import_module()` within the hot path of symbolic_convert is slow. Memoize it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143057 Approved by: https://github.com/jansel	2024-12-22 06:38:38 +00:00
Tom Ritchford	f1cbf4b1b5	Enable ruff's unused variable checking everywhere in pytorch (#136965 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136965 Approved by: https://github.com/cyyever, https://github.com/albanD	2024-12-22 02:33:11 +00:00
Xuehai Pan	2293fe1024	[BE][Easy] use `pathlib.Path` instead of `dirname` / `".."` / `pardir` (#129374 ) Changes by apply order: 1. Replace all `".."` and `os.pardir` usage with `os.path.dirname(...)`. 2. Replace nested `os.path.dirname(os.path.dirname(...))` call with `str(Path(...).parent.parent)`. 3. Reorder `.absolute()` ~/ `.resolve()`~ and `.parent`: always resolve the path first. `.parent{...}.absolute()` -> `.absolute().parent{...}` 4. Replace chained `.parent x N` with `.parents[${N - 1}]`: the code is easier to read (see 5.) `.parent.parent.parent.parent` -> `.parents[3]` 5. ~Replace `.parents[${N - 1}]` with `.parents[${N} - 1]`: the code is easier to read and does not introduce any runtime overhead.~ ~`.parents[3]` -> `.parents[4 - 1]`~ 6. ~Replace `.parents[2 - 1]` with `.parent.parent`: because the code is shorter and easier to read.~ Pull Request resolved: https://github.com/pytorch/pytorch/pull/129374 Approved by: https://github.com/justinchuby, https://github.com/malfet	2024-12-21 22:08:01 +00:00
PyTorch MergeBot	197954e14b	Revert "Handle meta tensors in FX quantization (#142262 )" This reverts commit e97b97af56204230f1030bd297dda9bc6b053a4c. Reverted https://github.com/pytorch/pytorch/pull/142262 on behalf of https://github.com/janeyx99 due to this PR broke lint ([comment](https://github.com/pytorch/pytorch/pull/142262#issuecomment-2558233022))	2024-12-21 20:34:09 +00:00
Yanan Cao (PyTorch)	0666347fc4	[Codemod][AddExplicitStrictExportArg] caffe2/benchmarks/dynamo (#143686 ) Reviewed By: avikchaudhuri Pull Request resolved: https://github.com/pytorch/pytorch/pull/143686 Approved by: https://github.com/tugsbayasgalan	2024-12-21 19:56:56 +00:00
Kaustubh Vartak	e97b97af56	Handle meta tensors in FX quantization (#142262 ) Summary: If module being quantized contains a some meta tensors and some tensors with actual device, we should not fail quantization. Quantization should also not fail if new quantized module is created on a meta device. Differential Revision: D66895899 Pull Request resolved: https://github.com/pytorch/pytorch/pull/142262 Approved by: https://github.com/iamzainhuda	2024-12-21 13:19:30 +00:00
cyy	daa3ffe0eb	Enable more C++ warnings (#143355 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/143355 Approved by: https://github.com/albanD	2024-12-21 09:19:02 +00:00
PyTorch MergeBot	e15442a9b2	Revert "export AOTI_TORCH_EXPORT on Windows. (#140030 )" This reverts commit 6733045a4aaef7a8d9fb1f9f8b80f4f5f4ef1f4f. Reverted https://github.com/pytorch/pytorch/pull/140030 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but my first attempt to fix internal build does not fix all the cases, so let us try again ([comment](https://github.com/pytorch/pytorch/pull/140030#issuecomment-2558043056))	2024-12-21 08:06:19 +00:00
Avik Chaudhuri	51eacea8c4	graph module retracing without preserving MCS (#143676 ) Retracing while preserving module call signatures used to be a problem because graph modules don't have submodules at given paths. This led to a number of failing retracebility tests. By not trying to wrap modules with export tracepoints we can pass most of these tests; the only exception is where you do module swapping on retraced programs, which is still not possible. Differential Revision: [D67539304](https://our.internmc.facebook.com/intern/diff/D67539304/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143676 Approved by: https://github.com/zhxchen17, https://github.com/tugsbayasgalan ghstack dependencies: #143664	2024-12-21 07:57:43 +00:00
cyy	d7e59c2f85	Fix cppcoreguidelines-pro-type-member-init (#141787 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/141787 Approved by: https://github.com/albanD	2024-12-21 07:51:30 +00:00
Basil Wong	7b2af25f80	[1/n] Support Dynamic Memory Budget in Auto AC (#143539 ) # Summary: Full Context: https://docs.google.com/document/d/1-j5KSbfGFJQcH4sYh7BIeJXso3zYzl5G5yFQqXdKx_o/edit?usp=sharing tl;dr This change introduces classes which help determine a dynamic memory budget. This will mostly be helpful for models with many implicit graph breaks. --- New Classes: GraphInfoProvider * Takes the joint_graph as well as the input memories and runtimes and parses the graph + values into usable forms for the SolverEvaluator. KnapsackEvaluator * Provides a function: Given all of the four inputs (solver function as a callable, max_dynamic_memory_budget, min_dynamic_memory_budget, dynamic_memory_budget_pareto_granularity) it returns an approximation of the knee point of the pareto distribution. # Test Plan: ### LintRunner LintRunner Output: P1700445547 ### Unit Tests ``` $ buck test @mode/opt //caffe2/test/functorch:test_ac_knapsack `@mode/opt` was specified, but not found. Using file at `//mode/opt`. This behavior is being deprecated. Please use `"@//mode/opt"` instead File changed: fbcode//caffe2/.ruff_cache/0.7.4/.tmpB6PmDS File changed: fbsource//xplat/caffe2/test/functorch/test_ac_knapsack.py File changed: fbcode//caffe2/.ruff_cache/0.7.4/.tmpyjCiPn 20 additional file change events Buck UI: https://www.internalfb.com/buck2/414ead46-9ede-4192-8e1a-5d3c52bdb9cc Test UI: https://www.internalfb.com/intern/testinfra/testrun/6473924710342830 Network: Up: 0B Down: 0B (reSessionID-159794b9-9d61-477e-8e63-9bdeaa537dca) Analyzing targets. Remaining 0/214 Executing actions. Remaining 0/6933 0.1s exec time total Command: test. Finished 1 local Time elapsed: 18.5s Tests finished: Pass 15. Fail 0. Fatal 0. Skip 0. Build failure 0 ``` ### Test Run Updated the config: ``` activation_memory_budget_solver: DYNAMIC_MEMORY_BUDGET_DP ``` Confirming proper execution via: [aps-fb_fm_v4_768_01_dynamic-2a792ba8af](https://www.internalfb.com/mlhub/pipelines/runs/mast/aps-fb_fm_v4_768_01_dynamic-2a792ba8af?job_attempt=0&version=0&env=PRODUCTION) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143539 Approved by: https://github.com/jansel	2024-12-21 07:38:52 +00:00
PyTorch MergeBot	bee47b0663	Revert "[pytorch/et] Allow ET to save additional resources for completing a trace like generated kernels and index tensor data (#143430 )" This reverts commit 33dd4f187dd3b54d65182d56998feae235ee48c7. Reverted https://github.com/pytorch/pytorch/pull/143430 on behalf of https://github.com/huydhn due to The internal diff D58707846 has been backed out ([comment](https://github.com/pytorch/pytorch/pull/143430#issuecomment-2558033930))	2024-12-21 07:26:34 +00:00
PyTorch UpdateBot	47c4e01e71	[audio hash update] update the pinned audio hash (#143694 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned audio hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143694 Approved by: https://github.com/pytorchbot	2024-12-21 05:42:34 +00:00
Richard Barnes	9f3c291bc3	Fix issue with setAttribute and int8_t vs int32_t variables (#143693 ) Test Plan: Sandcastle Pull Request resolved: https://github.com/pytorch/pytorch/pull/143693 Approved by: https://github.com/huydhn	2024-12-21 05:31:56 +00:00
Richard Barnes	518b5050c0	Fix unused-variable issues in caffe2 (#143639 ) Summary: LLVM-15 has a warning `-Wunused-variable` which we treat as an error because it's so often diagnostic of a code issue. Unused variables can compromise readability or, worse, performance. This diff either (a) removes an unused variable and, possibly, it's associated code or (b) qualifies the variable with `[[maybe_unused]]`. - If you approve of this diff, please use the "Accept & Ship" button :-) Test Plan: Sandcastle Pull Request resolved: https://github.com/pytorch/pytorch/pull/143639 Approved by: https://github.com/kit1980, https://github.com/malfet, https://github.com/cyyever	2024-12-21 05:27:38 +00:00
eellison	f44310097c	Reuse partial reductions (#143600 ) Reuse partial reductions for complete reductions. We could expand this to more cover more types of reductions, although we'd have to be a bit more careful about keeping the intermediary, partial reduction in higher precision. Just doing the ops which do not depend on a higher compute_dtype_precision for now to cover the relevant use case initially. Fix for https://github.com/pytorch/pytorch/issues/136267. Longer term, we should make sure cooperative reductions fuse partial and complete reductions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143600 Approved by: https://github.com/vkuzo	2024-12-21 04:44:07 +00:00
PyTorch MergeBot	97990f476d	Revert "Fix unused-variable issues in caffe2 (#143639 )" This reverts commit 23ca7c2515dd1f601926c4fd0e65513308c135a9. Reverted https://github.com/pytorch/pytorch/pull/143639 on behalf of https://github.com/huydhn due to This is failing OSS tests ([comment](https://github.com/pytorch/pytorch/pull/143639#issuecomment-2557991297))	2024-12-21 04:30:48 +00:00
PyTorch MergeBot	b89bfe0bac	Revert "Fix issue with setAttribute and int8_t vs int32_t variables (#143693 )" This reverts commit ae3d385fcba0f91f35b2848b852d4c75f88cbd62. Reverted https://github.com/pytorch/pytorch/pull/143693 on behalf of https://github.com/huydhn due to Sorry for reverting this change but it has a conflict with https://github.com/pytorch/pytorch/pull/143639 that is breaking trunk ([comment](https://github.com/pytorch/pytorch/pull/143693#issuecomment-2557990508))	2024-12-21 04:27:18 +00:00
Simon Fan	a8953c36f5	[compiled autograd] log compilation time to perfetto (#140964 ) https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmprli4iy/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=100 ``` [ { "args": { "compile_id": "0/-/-", "graph_id": 0 }, "cat": "dynamo_timed", "name": "compiled_autograd", "ph": "B", "pid": 0, "tid": 0, "ts": 1733886868992655.8 }, { "args": { "compile_id": "0/-/-", "graph_id": 0 }, "cat": "dynamo_timed", "name": "compiled_autograd", "ph": "E", "pid": 0, "tid": 0, "ts": 1733886869130681.0 }, { "args": { "compile_id": "0/0/0" }, "cat": "dynamo_timed", "name": "dynamo", "ph": "B", "pid": 0, "tid": 0, "ts": 1733886869134350.5 }, { ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/140964 Approved by: https://github.com/masnesral ghstack dependencies: #141907, #143175	2024-12-21 04:23:25 +00:00
PyTorch MergeBot	c7d7eff798	Revert "[MTIA] (3/n) Implement PyTorch APIs to query/reset device peak memory usage (#143347 )" This reverts commit efe21ee59dfdd6642cc693e69e07aa9d8be13eb9. Reverted https://github.com/pytorch/pytorch/pull/143347 on behalf of https://github.com/huydhn due to D67118173 has been backed out internally ([comment](https://github.com/pytorch/pytorch/pull/143347#issuecomment-2557983266))	2024-12-21 04:04:16 +00:00
PyTorch MergeBot	dabc9566c4	Revert "(MTIA) Move "empty_cache" API (#143402 )" This reverts commit c7d9f298072a3f59b39517e367c7d3d2ea30e6d9. Reverted https://github.com/pytorch/pytorch/pull/143402 on behalf of https://github.com/huydhn due to The internal diff D67148738 has been reverted ([comment](https://github.com/pytorch/pytorch/pull/143402#issuecomment-2557982597))	2024-12-21 04:01:23 +00:00
Bin Bao	fecf03fa3f	[AOTI][reland] Emit a CMakeLists.txt when package_cpp_only (#143680 ) Summary: Emit a CMakeLists.txt with compile and link options when package_cpp_only is specified. After unzipping AOTI generated .pt2 package file, user can manually build the generated model code in their local environment. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143680 Approved by: https://github.com/huydhn	2024-12-21 03:48:40 +00:00
xinan.lin	b5e159270a	[AOTI XPU] Replace intel compiler with g++ to build inductor CPP wrapper in runtime. (#142322 ) This PR aims to removes the de pendency on Intel Compiler at Inductor runtime. Now we only need a SYCL_HOME in runtime to find the sycl headers and libs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142322 Approved by: https://github.com/EikanWang, https://github.com/desertfire, https://github.com/albanD ghstack dependencies: #143491	2024-12-21 02:27:04 +00:00
xinan.lin	af0e159740	[Inductor XPU] Add XPU check for `is_big_gpu()`. (#143491 ) Fix #143472 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143491 Approved by: https://github.com/desertfire, https://github.com/jansel, https://github.com/EikanWang	2024-12-21 02:27:04 +00:00
Animesh Jain	0da004f3dd	[dynamo] Remove transformers ModelOutput hack (#143567 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143567 Approved by: https://github.com/williamwen42, https://github.com/jansel ghstack dependencies: #143548	2024-12-21 01:46:14 +00:00
Animesh Jain	4627cfd1f9	[dynamo] Support user defined dicts (#143548 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143548 Approved by: https://github.com/yanboliang, https://github.com/jansel, https://github.com/williamwen42	2024-12-21 01:46:14 +00:00
James Wu	9cb743d1f9	[easy] Set feature use for aot autograd remote cache (#143674 ) Use set_feature_use for logging aot autograd cache so that dynamo_compile has this data as well as PT2 Compile Events. Differential Revision: [D67536293](https://our.internmc.facebook.com/intern/diff/D67536293/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143674 Approved by: https://github.com/bobrenjc93	2024-12-21 01:40:18 +00:00
Simon Fan	ffd1b53f26	[aot] refactor dynamo source and cudagraphs static idx logic (#141748 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141748 Approved by: https://github.com/ezyang	2024-12-21 01:20:53 +00:00
Richard Barnes	ae3d385fcb	Fix issue with setAttribute and int8_t vs int32_t variables (#143693 ) Test Plan: Sandcastle Differential Revision: D67549758 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143693 Approved by: https://github.com/huydhn	2024-12-21 01:19:29 +00:00
Avik Chaudhuri	bdeee82822	unflatten isinstance (#143664 ) When we unflatten, the submodules we generate (`InterpreterModule` or `InterpreterModuleDispatcher`) are not related by type to the original submodules `N`. This makes `isinstance(mod, N)` checks fail. Since we do not have the original types after export, the best we can do is expose a `type_name()` method that carries the original type name, which we do carry in `nn_module_stack` entries. Differential Revision: [D67526542](https://our.internmc.facebook.com/intern/diff/D67526542/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143664 Approved by: https://github.com/tugsbayasgalan	2024-12-21 01:07:10 +00:00
Simon Fan	d88ebbf822	cleanup chromium event log on dynamo exit rather than on entry (#143175 ) clearing at dynamo start is an issue because it throws away events from compiled autograd Pull Request resolved: https://github.com/pytorch/pytorch/pull/143175 Approved by: https://github.com/Skylion007, https://github.com/jamesjwu ghstack dependencies: #141907	2024-12-21 00:41:24 +00:00
Simon Fan	4ee166b82f	[ca] add compiled autograd to CompileId (#141907 ) tlparse PR: https://github.com/ezyang/tlparse/pull/83 Pull Request resolved: https://github.com/pytorch/pytorch/pull/141907 Approved by: https://github.com/ezyang	2024-12-21 00:41:24 +00:00
Tugsbayasgalan Manlaibaatar	0ce233b8ca	Support tensor subclass unwrapping (#141941 ) This PR adds support for export to unwrap/wrap subclasses AOT so that we can trace through subclass parameters. This will resolve the UX issue in torchao where users had to manually unwrap their subclasses before calling export. Differential Revision: [D67531057](https://our.internmc.facebook.com/intern/diff/D67531057) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141941 Approved by: https://github.com/bdhirsh	2024-12-21 00:29:31 +00:00
Nikita Shulga	553031fb9a	[BE] Remove gcc-5 workaround for unused args (#143685 ) ditto Pull Request resolved: https://github.com/pytorch/pytorch/pull/143685 Approved by: https://github.com/kit1980, https://github.com/seemethere, https://github.com/atalman	2024-12-21 00:18:15 +00:00
PyTorch MergeBot	ad7ab5ef84	Revert "[logging] A few fixes/updates to record_compilation_metrics (#143332 )" This reverts commit a9c753bbc88bfdc0e77f66956b3a11e405235d0f. Reverted https://github.com/pytorch/pytorch/pull/143332 on behalf of https://github.com/malfet due to Surprisingly failure is caused by this PR ([comment](https://github.com/pytorch/pytorch/pull/143332#issuecomment-2557899120))	2024-12-21 00:06:44 +00:00
Will Feng	bf7009d839	[rpc] Fix unit test after c10::nullopt removal (#143690 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143690 Approved by: https://github.com/yifuwang, https://github.com/c-p-i-o, https://github.com/XilunWu	2024-12-20 23:36:07 +00:00
eqy	912d6a2867	[CUDA] Bump tolerances in `test_svd_lowrank_cuda_float64` (#143049 ) pre-emptive bump for apparent noisy failure Pull Request resolved: https://github.com/pytorch/pytorch/pull/143049 Approved by: https://github.com/Skylion007, https://github.com/lezcano, https://github.com/nikitaved	2024-12-20 23:25:21 +00:00
Michael Lazos	8960cb5809	Add support for bfloat16 atomic adds in fbcode (#143629 ) Reland https://github.com/pytorch/pytorch/pull/141857 and fallback on A100 which doesn't have bfloat16 atomic add instrs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143629 Approved by: https://github.com/eellison	2024-12-20 23:05:13 +00:00
amdfaa	a3b04d473e	[ROCm] Update setup-rocm for almalinux-based images (#143590 ) Needed for https://github.com/pytorch/test-infra/pull/6003 and https://github.com/pytorch/ao/pull/999 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143590 Approved by: https://github.com/atalman Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>	2024-12-20 22:48:54 +00:00
Richard Barnes	23ca7c2515	Fix unused-variable issues in caffe2 (#143639 ) Summary: LLVM-15 has a warning `-Wunused-variable` which we treat as an error because it's so often diagnostic of a code issue. Unused variables can compromise readability or, worse, performance. This diff either (a) removes an unused variable and, possibly, it's associated code or (b) qualifies the variable with `[[maybe_unused]]`. - If you approve of this diff, please use the "Accept & Ship" button :-) Test Plan: Sandcastle Pull Request resolved: https://github.com/pytorch/pytorch/pull/143639 Approved by: https://github.com/kit1980, https://github.com/malfet	2024-12-20 22:30:58 +00:00
Tristan Rice	6e58c37542	c10d: no call_guard in init (#143598 ) `py::call_guard<py::gil_scoped_release>` is not safe when using multiple threads. This instead moves it into the init function which is safe. For more details see #143593 https://github.com/pybind/pybind11/issues/5473 Test plan: ``` python setup.py develop ``` CI ```py import time from concurrent.futures import ThreadPoolExecutor from torch import distributed as dist def run(): store = dist.TCPStore( host_name="localhost", port=0, is_master=True, wait_for_workers=False, ) # this sleep is required to trigger the crash time.sleep(0.1) del store futures = [] with ThreadPoolExecutor( max_workers=100, ) as executor: for i in range(100000): print(i) futures.append(executor.submit(run)) if len(futures) > 100: futures.pop(0).result() ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/143598 Approved by: https://github.com/c-p-i-o	2024-12-20 22:23:36 +00:00
Sam Larsen	a9c753bbc8	[logging] A few fixes/updates to record_compilation_metrics (#143332 ) Summary: Mostly cosmetic, but one bug fix: * Bug fix: Make sure compile_id is converted to a string in the compilation metrics so it's printed as, e.g., "0/1" instead of "[0, 1]" * Sort collections in `collection_to_str` * Print non-string elements as `"<unknown>"` instead of None (since we don't expect non-strings) * Move the population of the legacy metrics and any pre-processing to a new factory method in CompilationMetrics Test Plan: ``` python test/dynamo/test_structured_trace.py python test/dynamo/test_utils.py ``` Internal testing: https://fburl.com/scuba/dynamo_compile/sandbox/l0me8auf Pull Request resolved: https://github.com/pytorch/pytorch/pull/143332 Approved by: https://github.com/ppanchalia	2024-12-20 21:42:32 +00:00
Mikayla Gawarecki	372b023eb1	Fix test_serialization_zipfile_actually_jit when weights_only is not default (#143668 ) Fails in fbcode where weights_only isn't default Pull Request resolved: https://github.com/pytorch/pytorch/pull/143668 Approved by: https://github.com/awgu ghstack dependencies: #143326, #143403	2024-12-20 21:25:10 +00:00
Darshan Sanghani	33dd4f187d	[pytorch/et] Allow ET to save additional resources for completing a trace like generated kernels and index tensor data (#143430 ) The resources directory lets ET observer dump any additional data like Triton kernels while capturing the ET. This allows us to use the ET trace to replay PT2 workloads and get visibility into data like generated kernels and their usage in a model, index tensor data etc. We also added a few ways to enable ET and ET Resources through the OS environment variables. Setting `ENABLE_PYTORCH_EXECUTION_TRACE` will enable default Execution Tracing in Pytorch. Additionally setting `ENABLE_PYTORCH_EXECUTION_TRACE_EXTRAS` will enable ET to collect extra resources from the ET run like Triton Kernels. Differential Revision: [D58707846](https://our.internmc.facebook.com/intern/diff/D58707846/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143430 Approved by: https://github.com/shengfukevin, https://github.com/sraikund16	2024-12-20 21:20:32 +00:00
zeshengzong	cee06e74ee	Apply clang-format for ATen/core/dispatch headers (#143620 ) Code change via add path config in `.lintrunner.toml` file and running ```bash $ lintrunner -a --take CLANGFORMAT --all-files ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/143620 Approved by: https://github.com/malfet	2024-12-20 21:16:23 +00:00
Mikayla Gawarecki	8e483654cb	Add config.save.use_pinned_memory_for_d2h to serialization config (#143342 ) This was benchmarked with two separate scripts on my A100 (A) Save state_dict of llama3-style model on CUDA to disk with ``torch.save`` (B) Save `ModuleList` of 10 `nn.Linear(10,000, 10,000)` on CUDA to disk with `torch.save` Timings are an average of 5 runs and benchmark scripts + results are attached Under both scenarios, we see ~2x speedup in ``torch.save`` time with (``compute_crc32=False`` and ``use_pinned_memory_for_d2h=True``) compared to the baseline of the current defaults (``compute_crc32=True`` and ``use_pinned_memory_for_d2h=False`` (A) Save state_dict of llama3-style model on CUDA to disk with ``torch.save`` [[script](https://gist.github.com/mikaylagawarecki/d3a86ea1bb08045d1a839976808d7432)][[results](https://gist.github.com/mikaylagawarecki/f61a4714e5cff703146a1fcb7e0c755c)] \| \| use_pinned_memory_for_d2h=False (Default) \| use_pinned_memory_for_d2h=True \| \|-\|-\|-\| \| `compute_crc_32= True` (Default)\| 28.54s \| 20.76s \| \| `compute_crc_32 = False` \| 22.57s \| 14.51s \| (B) Save `ModuleList` of 10 `nn.Linear(10,000, 10,000)` on CUDA to disk with `torch.save` [[script](https://gist.github.com/mikaylagawarecki/ecbc505436bdd4b5190ef1b3430c12b6)][[results](https://gist.github.com/mikaylagawarecki/4e686bcf030b57de8c3ca74d8f5a88f7)] \| \| use_pinned_memory_for_d2h=False (Default) \| use_pinned_memory_for_d2h=True \| \|-\|-\|-\| \| `compute_crc_32= True` (Default)\| 8.38s \| 5.53s \| \| `compute_crc_32 = False` \| 6.94s \| 3.99s \| Trace of (A) with `use_pinned_memory_for_d2h=True`, `compute_crc32=False` <img width="1745" alt="Screenshot 2024-12-16 at 7 32 33 PM" src="https://github.com/user-attachments/assets/80b87a8c-5a70-4eb9-ad66-7abc4aa7cc25" /> Baseline trace of (A) with `use_pinned_memory_for_d2h=False`, `compute_crc32=True` <img width="1799" alt="Screenshot 2024-12-16 at 7 38 20 PM" src="https://github.com/user-attachments/assets/13fa12d1-8f5f-424c-adc4-275b67012927" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/143342 Approved by: https://github.com/albanD ghstack dependencies: #143324	2024-12-20 21:01:18 +00:00
Mikayla Gawarecki	3f63b742e6	Refactor serialization getter/setters into torch.utils.serialization.config (#143324 ) Consolidate - get/set_default_load_endianness - get/set_default_mmap_options - get/set_crc32_options into one global dynamo-style config + allow global setting of mmap. The existing APIs are not removed and will get/set from the config (as they can't be removed for BC) In #143459 I add the local (argument style) config Pull Request resolved: https://github.com/pytorch/pytorch/pull/143324 Approved by: https://github.com/albanD	2024-12-20 21:01:17 +00:00
Scott Wolchok	629de988df	Fix old-compiler-unfriendly zero init of bfloat16_t array (#143504 ) clang versions before 17 don't like to assign 0 to a bfloat16_t. gcc versions before 13 also won't assign 0.0 to a bfloat16_t. (Citation: https://godbolt.org/z/Gzs5ebdej) Differential Revision: [D67396740](https://our.internmc.facebook.com/intern/diff/D67396740/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143504 Approved by: https://github.com/malfet	2024-12-20 20:49:51 +00:00
Chirag Pandya	485497e727	[c10d][fr] flight recorder improvements (#143446 ) Summary: 1. Flight recorder dumps are now automatically dumped by default upon timeout or exception. Users don't need to opt-in. 2. Change default dump location to running user's home directory `.cache` folder. Test Plan: 1. Tested locally by running the crash program from flight recorder tutorial page. https://pytorch.org/tutorials/prototype/flight_recorder_tutorial.html#an-end-to-end-example 2. Noted that flight recorder files were correctly created. ❯ pwd /home/cpio/.cache/fr_trace ❯ ls nccl_trace_rank_0 nccl_trace_rank_1 Differential Revision: [D67363720](https://our.internmc.facebook.com/intern/diff/D67363720) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143446 Approved by: https://github.com/d4l3k	2024-12-20 20:41:30 +00:00
Colin L. Rice	a94f259a69	pgo: Log feature use (#142819 ) This will cause dynamo_compile to popualte the feature column if we have a hit for PGO. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142819 Approved by: https://github.com/ezyang	2024-12-20 20:22:20 +00:00
Aaron Orenstein	8ce0bc282a	dynamo tracing perf: bytecode_transform improvements: 34.86 -> 33.9 (#143068 ) See #143056 for overall docs. This PR: Use slots on InstructionExnTabEntry and Instruction. Stop doing python version checks in the middle of `convert_instruction()` and `inst_has_op_bits()`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143068 Approved by: https://github.com/jansel ghstack dependencies: #143065, #143067	2024-12-20 20:06:42 +00:00
Aaron Orenstein	5feb2d7b41	dynamo tracing perf: don't call expensive _set_guard_export_info if it's a duplicate guard: 37.66 -> 34.86 (#143067 ) See #143056 for overall docs. This PR: Move the call to `_set_guard_export_info()` after the duplicate guard check in `GuardBuilder.DUPLICATE_INPUT()` Pull Request resolved: https://github.com/pytorch/pytorch/pull/143067 Approved by: https://github.com/jansel ghstack dependencies: #143065	2024-12-20 20:06:42 +00:00
Aaron Orenstein	7d4e7fbfc1	dynamo tracing perf: no import on hot path: 47.62 -> 47.26 (#143065 ) See #143056 for overall docs. This PR: Removed another `import` in the body of the hot path. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143065 Approved by: https://github.com/jansel	2024-12-20 20:06:42 +00:00
Yanbo Liang	792e6184c5	[GPT-fast] Support run spcific model or micro-benchmark (#143607 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143607 Approved by: https://github.com/BoyuanFeng, https://github.com/jerryzh168, https://github.com/huydhn	2024-12-20 19:58:07 +00:00
Nikhil Gupta	94737e8a2a	[ARM][feat]: Add 4 bit dynamic quantization matmuls & KleidiAI Backend (#134124 ) Description: 1. Quantize Linear Layer Weights to 4-bits: Quantize the weights of the Linear layer to 4 bits, using symmetric quantization. Pack two 4-bit weights into one uint8 container. Choose a quantization scheme (channel-wise or group-wise), with the group size being a multiple of 32. 2. Prepare Quantized Weights, Scales, and Optional Bias: After quantizing, obtain the quantized_weights, scales, and groupsize. If the original Linear layer has a bias, prepare it as well. 3. Pack the Weights Efficiently: Use torch.ops.aten._dyn_quant_pack_4bit_weight to optimally pack the weights, scales, and optional bias. ```python packed_weights = torch.ops.aten._dyn_quant_pack_4bit_weight(weight, scales_and_zeros, bias, groupsize, in_features, out_features) ``` Input parameters should include: in_features and out_features (the same as the Linear layer’s corresponding parameters). 4. Perform Dynamic Quantized Matrix Multiplication: Use torch.ops.aten._dyn_quant_matmul_4bit to perform matrix multiplication with quantized weights. ```python output = torch.ops.aten._dyn_quant_matmul_4bit(input, packed_weights, groupsize, in_features, out_features) ``` Inputs required include: The input tensor, packed_weights , groupsize, and the in_features and out_features. API Usage: https://github.com/pytorch/pytorch/issues/143289 Model Perf : 7B Transformer model: Prefill : 340 t/s Decode : 40 t/s 2B Transformer model Prefill : 747 t/s Decode : 80 t/s Tests: python test/test_linalg.py -k test__dyn_quant_pack_4bit_weight Ran 1 test in 0.016s OK python test/test_linalg.py -k test__dyn_quant_matmul_4bit Ran 8 tests in 0.077s OK python test/test_linalg.py -k test_compile_dyn_quant_matmul_4bit Ran 8 tests in 11.454s Change-Id: Ia1672bad5e6ec94e64d8bb1971395d60f4b3a452 Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/134124 Approved by: https://github.com/digantdesai, https://github.com/malfet	2024-12-20 19:32:03 +00:00
Tom Ritchford	b5475d334e	[inductor] Fix an unused variable in cpu_vec_isa.py (#138473 ) ---- * Extracted from https://github.com/pytorch/pytorch/pull/133492 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138473 Approved by: https://github.com/EikanWang, https://github.com/albanD, https://github.com/xuhancn	2024-12-20 18:50:19 +00:00
Nikita Shulga	5a69c2a649	[BE][Sparse] Get rid of gcc-5 workaround (#143653 ) Discovered those comments while looking at https://github.com/pytorch/pytorch/pull/143620 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143653 Approved by: https://github.com/albanD	2024-12-20 18:40:45 +00:00
Joy Dong	a5ed499f6a	FlexAttention Benchmark (#139665 ) 1. Add alibi, sliding window, tahn softcap, prefixLM, and document_mask from attn_gym to benchmark. 2. Add comparison to different SDPA backends & FAv2, FAv3, FAKV. Dependent on https://github.com/pytorch/pytorch/pull/139639 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139665 Approved by: https://github.com/drisspg	2024-12-20 17:52:24 +00:00
Hyunho Yeo	c7d9f29807	(MTIA) Move "empty_cache" API (#143402 ) Summary: This diff moves one of memory-related APIs to the consolidated location, which is `mtia/memory.py`. Test Plan: ``` buck2 test //mtia/host_runtime/torch_mtia/tests:test_torch_mtia_api ``` https://www.internalfb.com/intern/testinfra/testrun/13510798943184259 Reviewed By: nautsimon Differential Revision: D67148738 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143402 Approved by: https://github.com/nautsimon	2024-12-20 17:39:06 +00:00
Colin L. Rice	d79fbf6b6d	test/dynamo/test_utils: logging - Stop testing for impossible things. (#143535 ) We don't support assigning to objects or numeric constants at the top level in config modules, no need to test for them. (This specifically breaks later sorting refactoring, since it requires < to be implemented). Pull Request resolved: https://github.com/pytorch/pytorch/pull/143535 Approved by: https://github.com/ppanchalia	2024-12-20 17:21:49 +00:00
Huamin Li	f5af87c23c	Make Inductor cpp backend enable_floating_point_contract_flag to take string (#143450 ) Differential Revision: D66269001 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143450 Approved by: https://github.com/desertfire	2024-12-20 16:28:54 +00:00
William Wen	7ab880bc5e	fix typo in autocast header (#143625 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143625 Approved by: https://github.com/mlazos ghstack dependencies: #143592	2024-12-20 16:17:15 +00:00
bobrenjc93	4f8b7c4272	Revert "refactor tensorify restart logic to use sources (#141517 )" (#143623 ) This reverts commit 30d8b30db7eaaa254d97077ac6515cdc4568fd6d. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143623 Approved by: https://github.com/mlazos	2024-12-20 15:38:34 +00:00
leslie-fang-intel	607884c9af	[Inductor][CPP] Fix bitwise shift with corner inputs (#143635 ) Summary Fix issue https://github.com/pytorch/pytorch/issues/143555 and https://github.com/pytorch/pytorch/issues/143566, we can align the implementation with Eager: `29b586bbad/aten/src/ATen/native/cpu/BinaryOpsKernel.cpp (L501)` at these corner inputs. Test Plan ``` python test/inductor/test_cpu_repro.py -k test_bitwise_shift_corner_inputs ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/143635 Approved by: https://github.com/jgong5	2024-12-20 13:47:40 +00:00
Guilherme Leobas	7bf3b7cdc5	Rewrite _reparametrize_module to use `contextmanager` (#138203 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138203 Approved by: https://github.com/zou3519 ghstack dependencies: #136033, #140604	2024-12-20 12:02:27 +00:00
Guilherme Leobas	1c817fe671	Set `enable_trace_contextlib_contextmanager` flag to True (#140604 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140604 Approved by: https://github.com/zou3519 ghstack dependencies: #136033	2024-12-20 12:02:27 +00:00
Guilherme Leobas	673cc88fd6	Add support for `contextmanager` in Dynamo (#136033 ) Fixes #130559 * Intro This PR adds support for `@contextmanager` in Dynamo. We chose to limit the scope of this work to only `@contextmanager` and plan to handle generators fully in #141055 (still in draft). * Motivation Dynamo lacks support for generator functions. When it encounters one, it traces it as if it were a regular function. This is problematic because it can lead to incorrect behavior. To illustrate, consider the test case below: ```python import torch import contextlib @contextlib.contextmanager def set_default_dtype(dtype): old_dtype = torch.get_default_dtype() try: torch.set_default_dtype(dtype) yield finally: torch.set_default_dtype(old_dtype) @torch.compile(backend="eager", fullgraph=True) def fn(): with set_default_dtype(torch.float64): x = torch.tensor([3.0, 3.0 + 5.0j]) return x ``` Before this work, Dynamo would not stop at the `yield`, and the graph produced would contain both calls to `set_default_dtype` executed one after the other. This is incorrect because the context manager should execute code before and after the `yield`. * List of changes `YIELD_VALUE` now raises an exception (`YieldValueOp`) to signal that control flow must be suspended and returned to the caller. Additionally, `RETURN_VALUE` behaves differently in a generator function. Unlike regular functions, where `RETURN_VALUE` indicates the final result, in generators it signifies that the generator is exhausted and implicitly raises `StopIteration`. A new `VariableTracker` named `FunctionDecoratedByContextlibContextManagerVariable` was introduced to handle `@contextmanager`. This variable tracker acts not just as a wrapper for the original function but also maintains an internal `tx` (InstructionTranslator) object to suspend and return control flow to the parent tracer when a `yield` is encountered. * Corner cases Returning a context manager from a compiled function is not supported. This would require PyTorch to synchronize the generator state between Dynamo and the interpreter. Any attempt to return it will result in an `IncorrectUsage` exception. Graph breaks require special handling as well. In the event of a graph break, the frame associated with the context manager is skipped, and the context manager runs in eager mode. * This PR is breaking my code There is a configuration flag (`enable_trace_contextlib`) that can be set to `False` to disable tracing context managers. If this still causes crashes, please revert this PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136033 Approved by: https://github.com/zou3519	2024-12-20 12:02:20 +00:00
Jason Ansel	04b26ee1e8	Fix false positive from f-strings in set_linter (#143628 ) This linter was going crazy in python 3.12, example: ```py $ python3 tools/linter/adapters/set_linter.py torch/_inductor/runtime/triton_heuristics.py torch/_inductor/runtime/triton_heuristics.py:192:25: Builtin `set` is deprecated 190 \| args_str += ", ".join(call_args) 191 \| for k, v in call_kwargs.items(): 192 \| args_str += f", {k}={v}" ^ 193 \| 194 \| abs_path = os.path.abspath(sys.argv[0]) torch/_inductor/runtime/triton_heuristics.py:192:27: Builtin `set` is deprecated 190 \| args_str += ", ".join(call_args) 191 \| for k, v in call_kwargs.items(): 192 \| args_str += f", {k}={v}" ^ 193 \| 194 \| abs_path = os.path.abspath(sys.argv[0]) torch/_inductor/runtime/triton_heuristics.py:192:29: Builtin `set` is deprecated 190 \| args_str += ", ".join(call_args) 191 \| for k, v in call_kwargs.items(): 192 \| args_str += f", {k}={v}" ^ 193 \| 194 \| abs_path = os.path.abspath(sys.argv[0]) torch/_inductor/runtime/triton_heuristics.py:192:31: Builtin `set` is deprecated 190 \| args_str += ", ".join(call_args) 191 \| for k, v in call_kwargs.items(): 192 \| args_str += f", {k}={v}" ^ 193 \| 194 \| abs_path = os.path.abspath(sys.argv[0]) torch/_inductor/runtime/triton_heuristics.py:195:17: Builtin `set` is deprecated 193 \| 194 \| abs_path = os.path.abspath(sys.argv[0]) 195 \| with open(f"{abs_path}.launch_params", "a") as f: ^ 196 \| f.write(f"{kernel_name} \| {args_str}\n") 197 \| torch/_inductor/runtime/triton_heuristics.py:195:26: Builtin `set` is deprecated 193 \| 194 \| abs_path = os.path.abspath(sys.argv[0]) 195 \| with open(f"{abs_path}.launch_params", "a") as f: ^ 196 \| f.write(f"{kernel_name} \| {args_str}\n") 197 \| torch/_inductor/runtime/triton_heuristics.py:196:19: Builtin `set` is deprecated 194 \| abs_path = os.path.abspath(sys.argv[0]) 195 \| with open(f"{abs_path}.launch_params", "a") as f: 196 \| f.write(f"{kernel_name} \| {args_str}\n") ^ 197 \| 198 \| torch/_inductor/runtime/triton_heuristics.py:196:31: Builtin `set` is deprecated 194 \| abs_path = os.path.abspath(sys.argv[0]) 195 \| with open(f"{abs_path}.launch_params", "a") as f: 196 \| f.write(f"{kernel_name} \| {args_str}\n") ^ 197 \| 198 \| torch/_inductor/runtime/triton_heuristics.py:196:35: Builtin `set` is deprecated 194 \| abs_path = os.path.abspath(sys.argv[0]) 195 \| with open(f"{abs_path}.launch_params", "a") as f: 196 \| f.write(f"{kernel_name} \| {args_str}\n") ^ 197 \| 198 \| torch/_inductor/runtime/triton_heuristics.py:196:44: Builtin `set` is deprecated 194 \| abs_path = os.path.abspath(sys.argv[0]) 195 \| with open(f"{abs_path}.launch_params", "a") as f: 196 \| f.write(f"{kernel_name} \| {args_str}\n") ^ 197 \| 198 \| torch/_inductor/runtime/triton_heuristics.py:729:26: Builtin `set` is deprecated 727 \| exec( 728 \| f""" 729 \| def launcher({', '.join(def_args)}, grid, stream): ^ 730 \| if callable(grid): 731 \| grid_0, grid_1, grid_2 = grid(grid_meta) torch/_inductor/runtime/triton_heuristics.py:729:46: Builtin `set` is deprecated 727 \| exec( 728 \| f""" 729 \| def launcher({', '.join(def_args)}, grid, stream): ^ 730 \| if callable(grid): 731 \| grid_0, grid_1, grid_2 = grid(grid_meta) torch/_inductor/runtime/triton_heuristics.py:735:24: Builtin `set` is deprecated 733 \| grid_0, grid_1, grid_2 = grid 734 \| 735 \| args = {', '.join(call_args)}, ^ 736 \| launch_args = get_launch_args( 737 \| grid, grid_0, grid_1, grid_2, stream, function, torch/_inductor/runtime/triton_heuristics.py:735:45: Builtin `set` is deprecated 733 \| grid_0, grid_1, grid_2 = grid 734 \| 735 \| args = {', '.join(call_args)}, ^ 736 \| launch_args = get_launch_args( 737 \| grid, grid_0, grid_1, grid_2, stream, function, torch/_inductor/runtime/triton_heuristics.py:1144:20: Builtin `set` is deprecated 1142 \| cur_file = inspect.stack()[1].filename 1143 \| summary_str = ( 1144 \| f"SUMMARY ({cur_file})\n" ^ 1145 \| f"{overall_time:.2f}ms \t {overall_gb:.2f} GB\t {overall_gb / (overall_time / 1e3):.2f}GB/s" 1146 \| ) torch/_inductor/runtime/triton_heuristics.py:1144:29: Builtin `set` is deprecated 1142 \| cur_file = inspect.stack()[1].filename 1143 \| summary_str = ( 1144 \| f"SUMMARY ({cur_file})\n" ^ 1145 \| f"{overall_time:.2f}ms \t {overall_gb:.2f} GB\t {overall_gb / (overall_time / 1e3):.2f}GB/s" 1146 \| ) torch/_inductor/runtime/triton_heuristics.py:1162:61: Builtin `set` is deprecated 1160 \| ) 1161 \| file.write("====================\n") 1162 \| file.write(f"TRITON KERNELS BANDWIDTH INFO ({cur_file})\n") ^ 1163 \| for ms, num_gb, gb_per_s, kernel_name in sorted_calls: 1164 \| # also display the runtime percentage for each kernel torch/_inductor/runtime/triton_heuristics.py:1162:70: Builtin `set` is deprecated 1160 \| ) 1161 \| file.write("====================\n") 1162 \| file.write(f"TRITON KERNELS BANDWIDTH INFO ({cur_file})\n") ^ 1163 \| for ms, num_gb, gb_per_s, kernel_name in sorted_calls: 1164 \| # also display the runtime percentage for each kernel torch/_inductor/runtime/triton_heuristics.py:1166:36: Builtin `set` is deprecated 1164 \| # also display the runtime percentage for each kernel 1165 \| percentage = f"{ms / overall_time * 100:.2f}%" 1166 \| suffix = f" \t {percentage} \t {kernel_name}" ^ 1167 \| bw_info_str = create_bandwidth_info_str( 1168 \| ms, torch/_inductor/runtime/triton_heuristics.py:1166:47: Builtin `set` is deprecated 1164 \| # also display the runtime percentage for each kernel 1165 \| percentage = f"{ms / overall_time * 100:.2f}%" 1166 \| suffix = f" \t {percentage} \t {kernel_name}" ^ 1167 \| bw_info_str = create_bandwidth_info_str( 1168 \| ms, torch/_inductor/runtime/triton_heuristics.py:1166:52: Builtin `set` is deprecated 1164 \| # also display the runtime percentage for each kernel 1165 \| percentage = f"{ms / overall_time * 100:.2f}%" 1166 \| suffix = f" \t {percentage} \t {kernel_name}" ^ 1167 \| bw_info_str = create_bandwidth_info_str( 1168 \| ms, torch/_inductor/runtime/triton_heuristics.py:1166:64: Builtin `set` is deprecated 1164 \| # also display the runtime percentage for each kernel 1165 \| percentage = f"{ms / overall_time * 100:.2f}%" 1166 \| suffix = f" \t {percentage} \t {kernel_name}" ^ 1167 \| bw_info_str = create_bandwidth_info_str( 1168 \| ms, torch/_inductor/runtime/triton_heuristics.py:1175:30: Builtin `set` is deprecated 1173 \| ) 1174 \| file.write(bw_info_str + "\n") 1175 \| file.write(f"{summary_str}\n\n") ^ 1176 \| except Exception as e: 1177 \| log.warning( torch/_inductor/runtime/triton_heuristics.py:1175:42: Builtin `set` is deprecated 1173 \| ) 1174 \| file.write(bw_info_str + "\n") 1175 \| file.write(f"{summary_str}\n\n") ^ 1176 \| except Exception as e: 1177 \| log.warning( torch/_inductor/runtime/triton_heuristics.py:1205:29: Builtin `set` is deprecated 1203 \| else: 1204 \| possible_names = _find_names(self) 1205 \| kernel_name = f"{max(possible_names, key=len)}" ^ 1206 \| if not re.match(self.regex_filter, kernel_name): 1207 \| return torch/_inductor/runtime/triton_heuristics.py:1205:58: Builtin `set` is deprecated 1203 \| else: 1204 \| possible_names = _find_names(self) 1205 \| kernel_name = f"{max(possible_names, key=len)}" ^ 1206 \| if not re.match(self.regex_filter, kernel_name): 1207 \| return torch/_inductor/runtime/triton_heuristics.py:1241:60: Builtin `set` is deprecated 1239 \| "%s", 1240 \| create_bandwidth_info_str( 1241 \| ms, num_gb, gb_per_s, suffix=f" \t {kernel_name}" ^ 1242 \| ), 1243 \| ) torch/_inductor/runtime/triton_heuristics.py:1241:72: Builtin `set` is deprecated 1239 \| "%s", 1240 \| create_bandwidth_info_str( 1241 \| ms, num_gb, gb_per_s, suffix=f" \t {kernel_name}" ^ 1242 \| ), 1243 \| ) torch/_inductor/runtime/triton_heuristics.py:1256:15: Builtin `set` is deprecated 1254 \| for cfg in configs: 1255 \| hasher.update( 1256 \| f"{sorted(cfg.kwargs.items())} {cfg.num_warps} {cfg.num_stages}\n".encode() ^ 1257 \| ) 1258 \| return hasher.hexdigest() torch/_inductor/runtime/triton_heuristics.py:1256:42: Builtin `set` is deprecated 1254 \| for cfg in configs: 1255 \| hasher.update( 1256 \| f"{sorted(cfg.kwargs.items())} {cfg.num_warps} {cfg.num_stages}\n".encode() ^ 1257 \| ) 1258 \| return hasher.hexdigest() torch/_inductor/runtime/triton_heuristics.py:1256:44: Builtin `set` is deprecated 1254 \| for cfg in configs: 1255 \| hasher.update( 1256 \| f"{sorted(cfg.kwargs.items())} {cfg.num_warps} {cfg.num_stages}\n".encode() ^ 1257 \| ) 1258 \| return hasher.hexdigest() torch/_inductor/runtime/triton_heuristics.py:1256:58: Builtin `set` is deprecated 1254 \| for cfg in configs: 1255 \| hasher.update( 1256 \| f"{sorted(cfg.kwargs.items())} {cfg.num_warps} {cfg.num_stages}\n".encode() ^ 1257 \| ) 1258 \| return hasher.hexdigest() torch/_inductor/runtime/triton_heuristics.py:1256:60: Builtin `set` is deprecated 1254 \| for cfg in configs: 1255 \| hasher.update( 1256 \| f"{sorted(cfg.kwargs.items())} {cfg.num_warps} {cfg.num_stages}\n".encode() ^ 1257 \| ) 1258 \| return hasher.hexdigest() torch/_inductor/runtime/triton_heuristics.py:1256:75: Builtin `set` is deprecated 1254 \| for cfg in configs: 1255 \| hasher.update( 1256 \| f"{sorted(cfg.kwargs.items())} {cfg.num_warps} {cfg.num_stages}\n".encode() ^ 1257 \| ) 1258 \| return hasher.hexdigest() torch/_inductor/runtime/triton_heuristics.py:1377:23: Builtin `set` is deprecated 1375 \| if numel is None: 1376 \| continue 1377 \| block = cfg[f"{label}BLOCK"] ^ 1378 \| if numel == 1: 1379 \| assert block == 1, ( torch/_inductor/runtime/triton_heuristics.py:1377:29: Builtin `set` is deprecated 1375 \| if numel is None: 1376 \| continue 1377 \| block = cfg[f"{label}BLOCK"] ^ 1378 \| if numel == 1: 1379 \| assert block == 1, ( torch/_inductor/runtime/triton_heuristics.py:1381:24: Builtin `set` is deprecated 1379 \| assert block == 1, ( 1380 \| f"TritonKernel.indexing assumes numel == 1 => BLOCK == 1" 1381 \| f" but {label.lower()}numel=={numel} and {label}BLOCK={block} (cfg={cfg})." ^ 1382 \| ) 1383 \| max_block = TRITON_MAX_BLOCK[label] torch/_inductor/runtime/triton_heuristics.py:1381:38: Builtin `set` is deprecated 1379 \| assert block == 1, ( 1380 \| f"TritonKernel.indexing assumes numel == 1 => BLOCK == 1" 1381 \| f" but {label.lower()}numel=={numel} and {label}BLOCK={block} (cfg={cfg})." ^ 1382 \| ) 1383 \| max_block = TRITON_MAX_BLOCK[label] torch/_inductor/runtime/triton_heuristics.py:1381:46: Builtin `set` is deprecated 1379 \| assert block == 1, ( 1380 \| f"TritonKernel.indexing assumes numel == 1 => BLOCK == 1" 1381 \| f" but {label.lower()}numel=={numel} and {label}BLOCK={block} (cfg={cfg})." ^ 1382 \| ) 1383 \| max_block = TRITON_MAX_BLOCK[label] torch/_inductor/runtime/triton_heuristics.py:1381:52: Builtin `set` is deprecated 1379 \| assert block == 1, ( 1380 \| f"TritonKernel.indexing assumes numel == 1 => BLOCK == 1" 1381 \| f" but {label.lower()}numel=={numel} and {label}BLOCK={block} (cfg={cfg})." ^ 1382 \| ) 1383 \| max_block = TRITON_MAX_BLOCK[label] torch/_inductor/runtime/triton_heuristics.py:1381:58: Builtin `set` is deprecated 1379 \| assert block == 1, ( 1380 \| f"TritonKernel.indexing assumes numel == 1 => BLOCK == 1" 1381 \| f" but {label.lower()}numel=={numel} and {label}BLOCK={block} (cfg={cfg})." ^ 1382 \| ) 1383 \| max_block = TRITON_MAX_BLOCK[label] torch/_inductor/runtime/triton_heuristics.py:1381:64: Builtin `set` is deprecated 1379 \| assert block == 1, ( 1380 \| f"TritonKernel.indexing assumes numel == 1 => BLOCK == 1" 1381 \| f" but {label.lower()}numel=={numel} and {label}BLOCK={block} (cfg={cfg})." ^ 1382 \| ) 1383 \| max_block = TRITON_MAX_BLOCK[label] torch/_inductor/runtime/triton_heuristics.py:1381:71: Builtin `set` is deprecated 1379 \| assert block == 1, ( 1380 \| f"TritonKernel.indexing assumes numel == 1 => BLOCK == 1" 1381 \| f" but {label.lower()}numel=={numel} and {label}BLOCK={block} (cfg={cfg})." ^ 1382 \| ) 1383 \| max_block = TRITON_MAX_BLOCK[label] torch/_inductor/runtime/triton_heuristics.py:1381:77: Builtin `set` is deprecated 1379 \| assert block == 1, ( 1380 \| f"TritonKernel.indexing assumes numel == 1 => BLOCK == 1" 1381 \| f" but {label.lower()}numel=={numel} and {label}BLOCK={block} (cfg={cfg})." ^ 1382 \| ) 1383 \| max_block = TRITON_MAX_BLOCK[label] torch/_inductor/runtime/triton_heuristics.py:1381:84: Builtin `set` is deprecated 1379 \| assert block == 1, ( 1380 \| f"TritonKernel.indexing assumes numel == 1 => BLOCK == 1" 1381 \| f" but {label.lower()}numel=={numel} and {label}BLOCK={block} (cfg={cfg})." ^ 1382 \| ) 1383 \| max_block = TRITON_MAX_BLOCK[label] torch/_inductor/runtime/triton_heuristics.py:1381:88: Builtin `set` is deprecated 1379 \| assert block == 1, ( 1380 \| f"TritonKernel.indexing assumes numel == 1 => BLOCK == 1" 1381 \| f" but {label.lower()}numel=={numel} and {label}BLOCK={block} (cfg={cfg})." ^ 1382 \| ) 1383 \| max_block = TRITON_MAX_BLOCK[label] torch/_inductor/runtime/triton_heuristics.py:1384:52: Builtin `set` is deprecated 1382 \| ) 1383 \| max_block = TRITON_MAX_BLOCK[label] 1384 \| max_block_str = f'config.triton.max_block["{label}"]' ^ 1385 \| assert max_block % block == 0, ( 1386 \| f"TritonKernel.indexing assumes {label}BLOCK divides {max_block_str}" torch/_inductor/runtime/triton_heuristics.py:1384:58: Builtin `set` is deprecated 1382 \| ) 1383 \| max_block = TRITON_MAX_BLOCK[label] 1384 \| max_block_str = f'config.triton.max_block["{label}"]' ^ 1385 \| assert max_block % block == 0, ( 1386 \| f"TritonKernel.indexing assumes {label}BLOCK divides {max_block_str}" torch/_inductor/runtime/triton_heuristics.py:1386:45: Builtin `set` is deprecated 1384 \| max_block_str = f'config.triton.max_block["{label}"]' 1385 \| assert max_block % block == 0, ( 1386 \| f"TritonKernel.indexing assumes {label}BLOCK divides {max_block_str}" ^ 1387 \| f" but {label}BLOCK={block} and {max_block_str}={max_block} (cfg={cfg})." 1388 \| ) torch/_inductor/runtime/triton_heuristics.py:1386:51: Builtin `set` is deprecated 1384 \| max_block_str = f'config.triton.max_block["{label}"]' 1385 \| assert max_block % block == 0, ( 1386 \| f"TritonKernel.indexing assumes {label}BLOCK divides {max_block_str}" ^ 1387 \| f" but {label}BLOCK={block} and {max_block_str}={max_block} (cfg={cfg})." 1388 \| ) torch/_inductor/runtime/triton_heuristics.py:1386:66: Builtin `set` is deprecated 1384 \| max_block_str = f'config.triton.max_block["{label}"]' 1385 \| assert max_block % block == 0, ( 1386 \| f"TritonKernel.indexing assumes {label}BLOCK divides {max_block_str}" ^ 1387 \| f" but {label}BLOCK={block} and {max_block_str}={max_block} (cfg={cfg})." 1388 \| ) torch/_inductor/runtime/triton_heuristics.py:1386:80: Builtin `set` is deprecated 1384 \| max_block_str = f'config.triton.max_block["{label}"]' 1385 \| assert max_block % block == 0, ( 1386 \| f"TritonKernel.indexing assumes {label}BLOCK divides {max_block_str}" ^ 1387 \| f" but {label}BLOCK={block} and {max_block_str}={max_block} (cfg={cfg})." 1388 \| ) torch/_inductor/runtime/triton_heuristics.py:1387:20: Builtin `set` is deprecated 1385 \| assert max_block % block == 0, ( 1386 \| f"TritonKernel.indexing assumes {label}BLOCK divides {max_block_str}" 1387 \| f" but {label}BLOCK={block} and {max_block_str}={max_block} (cfg={cfg})." ^ 1388 \| ) 1389 \| torch/_inductor/runtime/triton_heuristics.py:1387:26: Builtin `set` is deprecated 1385 \| assert max_block % block == 0, ( 1386 \| f"TritonKernel.indexing assumes {label}BLOCK divides {max_block_str}" 1387 \| f" but {label}BLOCK={block} and {max_block_str}={max_block} (cfg={cfg})." ^ 1388 \| ) 1389 \| torch/_inductor/runtime/triton_heuristics.py:1387:33: Builtin `set` is deprecated 1385 \| assert max_block % block == 0, ( 1386 \| f"TritonKernel.indexing assumes {label}BLOCK divides {max_block_str}" 1387 \| f" but {label}BLOCK={block} and {max_block_str}={max_block} (cfg={cfg})." ^ 1388 \| ) 1389 \| torch/_inductor/runtime/triton_heuristics.py:1387:39: Builtin `set` is deprecated 1385 \| assert max_block % block == 0, ( 1386 \| f"TritonKernel.indexing assumes {label}BLOCK divides {max_block_str}" 1387 \| f" but {label}BLOCK={block} and {max_block_str}={max_block} (cfg={cfg})." ^ 1388 \| ) 1389 \| torch/_inductor/runtime/triton_heuristics.py:1387:45: Builtin `set` is deprecated 1385 \| assert max_block % block == 0, ( 1386 \| f"TritonKernel.indexing assumes {label}BLOCK divides {max_block_str}" 1387 \| f" but {label}BLOCK={block} and {max_block_str}={max_block} (cfg={cfg})." ^ 1388 \| ) 1389 \| torch/_inductor/runtime/triton_heuristics.py:1387:59: Builtin `set` is deprecated 1385 \| assert max_block % block == 0, ( 1386 \| f"TritonKernel.indexing assumes {label}BLOCK divides {max_block_str}" 1387 \| f" but {label}BLOCK={block} and {max_block_str}={max_block} (cfg={cfg})." ^ 1388 \| ) 1389 \| torch/_inductor/runtime/triton_heuristics.py:1387:61: Builtin `set` is deprecated 1385 \| assert max_block % block == 0, ( 1386 \| f"TritonKernel.indexing assumes {label}BLOCK divides {max_block_str}" 1387 \| f" but {label}BLOCK={block} and {max_block_str}={max_block} (cfg={cfg})." ^ 1388 \| ) 1389 \| torch/_inductor/runtime/triton_heuristics.py:1387:71: Builtin `set` is deprecated 1385 \| assert max_block % block == 0, ( 1386 \| f"TritonKernel.indexing assumes {label}BLOCK divides {max_block_str}" 1387 \| f" but {label}BLOCK={block} and {max_block_str}={max_block} (cfg={cfg})." ^ 1388 \| ) 1389 \| torch/_inductor/runtime/triton_heuristics.py:1387:78: Builtin `set` is deprecated 1385 \| assert max_block % block == 0, ( 1386 \| f"TritonKernel.indexing assumes {label}BLOCK divides {max_block_str}" 1387 \| f" but {label}BLOCK={block} and {max_block_str}={max_block} (cfg={cfg})." ^ 1388 \| ) 1389 \| torch/_inductor/runtime/triton_heuristics.py:1387:82: Builtin `set` is deprecated 1385 \| assert max_block % block == 0, ( 1386 \| f"TritonKernel.indexing assumes {label}BLOCK divides {max_block_str}" 1387 \| f" but {label}BLOCK={block} and {max_block_str}={max_block} (cfg={cfg})." ^ 1388 \| ) 1389 \| torch/_inductor/runtime/triton_heuristics.py:1402:19: Builtin `set` is deprecated 1400 \| assert ( 1401 \| val <= max_block 1402 \| ), f"'{var}' too large. Maximum: {max_block}. Actual: {val}." ^ 1403 \| 1404 \| torch/_inductor/runtime/triton_heuristics.py:1402:23: Builtin `set` is deprecated 1400 \| assert ( 1401 \| val <= max_block 1402 \| ), f"'{var}' too large. Maximum: {max_block}. Actual: {val}." ^ 1403 \| 1404 \| torch/_inductor/runtime/triton_heuristics.py:1402:46: Builtin `set` is deprecated 1400 \| assert ( 1401 \| val <= max_block 1402 \| ), f"'{var}' too large. Maximum: {max_block}. Actual: {val}." ^ 1403 \| 1404 \| torch/_inductor/runtime/triton_heuristics.py:1402:56: Builtin `set` is deprecated 1400 \| assert ( 1401 \| val <= max_block 1402 \| ), f"'{var}' too large. Maximum: {max_block}. Actual: {val}." ^ 1403 \| 1404 \| torch/_inductor/runtime/triton_heuristics.py:1402:67: Builtin `set` is deprecated 1400 \| assert ( 1401 \| val <= max_block 1402 \| ), f"'{var}' too large. Maximum: {max_block}. Actual: {val}." ^ 1403 \| 1404 \| torch/_inductor/runtime/triton_heuristics.py:1402:71: Builtin `set` is deprecated 1400 \| assert ( 1401 \| val <= max_block 1402 \| ), f"'{var}' too large. Maximum: {max_block}. Actual: {val}." ^ 1403 \| 1404 \| torch/_inductor/runtime/triton_heuristics.py:1551:21: Builtin `set` is deprecated 1549 \| rnumels = {} 1550 \| for idx in range(num_reduction_dims - 1, -1, -1): 1551 \| prefix = f"r{idx}_" ^ 1552 \| max_size = min(size_hints[prefix], TRITON_MAX_BLOCK[prefix.upper()]) 1553 \| dim = min(max_size, remaining) torch/_inductor/runtime/triton_heuristics.py:1551:25: Builtin `set` is deprecated 1549 \| rnumels = {} 1550 \| for idx in range(num_reduction_dims - 1, -1, -1): 1551 \| prefix = f"r{idx}_" ^ 1552 \| max_size = min(size_hints[prefix], TRITON_MAX_BLOCK[prefix.upper()]) 1553 \| dim = min(max_size, remaining) torch/_inductor/runtime/triton_heuristics.py:1556:34: Builtin `set` is deprecated 1554 \| assert ( 1555 \| remaining % dim == 0 1556 \| ), f"Expected dimension '{dim}' to divide remaining size '{remaining}'" ^ 1557 \| rnumels[prefix] = dim 1558 \| remaining //= dim torch/_inductor/runtime/triton_heuristics.py:1556:38: Builtin `set` is deprecated 1554 \| assert ( 1555 \| remaining % dim == 0 1556 \| ), f"Expected dimension '{dim}' to divide remaining size '{remaining}'" ^ 1557 \| rnumels[prefix] = dim 1558 \| remaining //= dim torch/_inductor/runtime/triton_heuristics.py:1556:67: Builtin `set` is deprecated 1554 \| assert ( 1555 \| remaining % dim == 0 1556 \| ), f"Expected dimension '{dim}' to divide remaining size '{remaining}'" ^ 1557 \| rnumels[prefix] = dim 1558 \| remaining //= dim torch/_inductor/runtime/triton_heuristics.py:1556:77: Builtin `set` is deprecated 1554 \| assert ( 1555 \| remaining % dim == 0 1556 \| ), f"Expected dimension '{dim}' to divide remaining size '{remaining}'" ^ 1557 \| rnumels[prefix] = dim 1558 \| remaining //= dim torch/_inductor/runtime/triton_heuristics.py:1564:38: Builtin `set` is deprecated 1562 \| assert ( 1563 \| r == final_numel 1564 \| ), f"Expected ND reduction size ({rnumels}) to have {r} elements." ^ 1565 \| assert all( 1566 \| rnumels[prefix] <= size_hints[prefix] for prefix in rnumels torch/_inductor/runtime/triton_heuristics.py:1564:46: Builtin `set` is deprecated 1562 \| assert ( 1563 \| r == final_numel 1564 \| ), f"Expected ND reduction size ({rnumels}) to have {r} elements." ^ 1565 \| assert all( 1566 \| rnumels[prefix] <= size_hints[prefix] for prefix in rnumels torch/_inductor/runtime/triton_heuristics.py:1564:57: Builtin `set` is deprecated 1562 \| assert ( 1563 \| r == final_numel 1564 \| ), f"Expected ND reduction size ({rnumels}) to have {r} elements." ^ 1565 \| assert all( 1566 \| rnumels[prefix] <= size_hints[prefix] for prefix in rnumels torch/_inductor/runtime/triton_heuristics.py:1564:59: Builtin `set` is deprecated 1562 \| assert ( 1563 \| r == final_numel 1564 \| ), f"Expected ND reduction size ({rnumels}) to have {r} elements." ^ 1565 \| assert all( 1566 \| rnumels[prefix] <= size_hints[prefix] for prefix in rnumels torch/_inductor/runtime/triton_heuristics.py:1567:37: Builtin `set` is deprecated 1565 \| assert all( 1566 \| rnumels[prefix] <= size_hints[prefix] for prefix in rnumels 1567 \| ), f"rnumels exceed size_hints. {rnumels} > {size_hints}" ^ 1568 \| 1569 \| return rnumels torch/_inductor/runtime/triton_heuristics.py:1567:45: Builtin `set` is deprecated 1565 \| assert all( 1566 \| rnumels[prefix] <= size_hints[prefix] for prefix in rnumels 1567 \| ), f"rnumels exceed size_hints. {rnumels} > {size_hints}" ^ 1568 \| 1569 \| return rnumels torch/_inductor/runtime/triton_heuristics.py:1567:49: Builtin `set` is deprecated 1565 \| assert all( 1566 \| rnumels[prefix] <= size_hints[prefix] for prefix in rnumels 1567 \| ), f"rnumels exceed size_hints. {rnumels} > {size_hints}" ^ 1568 \| 1569 \| return rnumels torch/_inductor/runtime/triton_heuristics.py:1567:60: Builtin `set` is deprecated 1565 \| assert all( 1566 \| rnumels[prefix] <= size_hints[prefix] for prefix in rnumels 1567 \| ), f"rnumels exceed size_hints. {rnumels} > {size_hints}" ^ 1568 \| 1569 \| return rnumels torch/_inductor/runtime/triton_heuristics.py:1746:49: Builtin `set` is deprecated 1744 \| 1745 \| if not configs: 1746 \| raise NotImplementedError(f"size_hints: {size_hints}") ^ 1747 \| return cached_autotune( 1748 \| size_hints, torch/_inductor/runtime/triton_heuristics.py:1746:60: Builtin `set` is deprecated 1744 \| 1745 \| if not configs: 1746 \| raise NotImplementedError(f"size_hints: {size_hints}") ^ 1747 \| return cached_autotune( 1748 \| size_hints, torch/_inductor/runtime/triton_heuristics.py:1928:32: Builtin `set` is deprecated 1926 \| for prefix in size_hints: 1927 \| if prefix_is_reduction(prefix): 1928 \| c.kwargs.pop(f"{prefix.upper()}BLOCK") ^ 1929 \| 1930 \| if disable_pointwise_autotuning(inductor_meta): torch/_inductor/runtime/triton_heuristics.py:1928:47: Builtin `set` is deprecated 1926 \| for prefix in size_hints: 1927 \| if prefix_is_reduction(prefix): 1928 \| c.kwargs.pop(f"{prefix.upper()}BLOCK") ^ 1929 \| 1930 \| if disable_pointwise_autotuning(inductor_meta): torch/_inductor/runtime/triton_heuristics.py:1975:49: Builtin `set` is deprecated 1973 \| assert triton_meta is not None 1974 \| if len(size_hints) != 2: 1975 \| raise NotImplementedError(f"size_hints: {size_hints}") ^ 1976 \| 1977 \| configs = _reduction_configs(size_hints=size_hints, inductor_meta=inductor_meta) torch/_inductor/runtime/triton_heuristics.py:1975:60: Builtin `set` is deprecated 1973 \| assert triton_meta is not None 1974 \| if len(size_hints) != 2: 1975 \| raise NotImplementedError(f"size_hints: {size_hints}") ^ 1976 \| 1977 \| configs = _reduction_configs(size_hints=size_hints, inductor_meta=inductor_meta) torch/_inductor/runtime/triton_heuristics.py:2082:56: Builtin `set` is deprecated 2080 \| xnumel, ynumel, znumel = numels[2], numels[1], numels[0] 2081 \| else: 2082 \| raise AssertionError(f"invalid size for numels {len(numels)}") ^ 2083 \| 2084 \| def get_grid_dim(numel, block): torch/_inductor/runtime/triton_heuristics.py:2082:68: Builtin `set` is deprecated 2080 \| xnumel, ynumel, znumel = numels[2], numels[1], numels[0] 2081 \| else: 2082 \| raise AssertionError(f"invalid size for numels {len(numels)}") ^ 2083 \| 2084 \| def get_grid_dim(numel, block): torch/_inductor/runtime/triton_heuristics.py:2104:57: Builtin `set` is deprecated 2102 \| torch._check( 2103 \| y_grid <= max_y_grid, 2104 \| lambda: f"Generated y grid beyond 2^16 ({y_grid}) not supported with z dimension present. File issue", ^ 2105 \| ) 2106 \| torch/_inductor/runtime/triton_heuristics.py:2104:64: Builtin `set` is deprecated 2102 \| torch._check( 2103 \| y_grid <= max_y_grid, 2104 \| lambda: f"Generated y grid beyond 2^16 ({y_grid}) not supported with z dimension present. File issue", ^ 2105 \| ) 2106 \| torch/_inductor/runtime/triton_heuristics.py:2113:43: Builtin `set` is deprecated 2111 \| ) 2112 \| 2113 \| setattr(grid_fn, "grid_fn_str", f"grid{numels}") # noqa: B010 ^ 2114 \| 2115 \| return grid_fn torch/_inductor/runtime/triton_heuristics.py:2113:50: Builtin `set` is deprecated 2111 \| ) 2112 \| 2113 \| setattr(grid_fn, "grid_fn_str", f"grid{numels}") # noqa: B010 ^ 2114 \| 2115 \| return grid_fn torch/_inductor/runtime/triton_heuristics.py:2122:48: Builtin `set` is deprecated 2120 \| return (meta["RSPLIT"], ceildiv(xnumel, meta.get("XBLOCK", 1)), 1) 2121 \| 2122 \| grid_fn_str = f"cooperative_reduction_grid({xnumel})" ^ 2123 \| setattr(grid_fn, "grid_fn_str", grid_fn_str) # noqa: B010 2124 \| return grid_fn torch/_inductor/runtime/triton_heuristics.py:2122:55: Builtin `set` is deprecated 2120 \| return (meta["RSPLIT"], ceildiv(xnumel, meta.get("XBLOCK", 1)), 1) 2121 \| 2122 \| grid_fn_str = f"cooperative_reduction_grid({xnumel})" ^ 2123 \| setattr(grid_fn, "grid_fn_str", grid_fn_str) # noqa: B010 2124 \| return grid_fn torch/_inductor/runtime/triton_heuristics.py:2135:54: Builtin `set` is deprecated 2133 \| coop_grid = cooperative_reduction_grid(xnumel) 2134 \| normal_grid = grid(xnumel) 2135 \| grid_fn_str = f"maybe_cooperative_reduction_grid({xnumel})" ^ 2136 \| setattr(grid_fn, "grid_fn_str", grid_fn_str) # noqa: B010 2137 \| return grid_fn torch/_inductor/runtime/triton_heuristics.py:2135:61: Builtin `set` is deprecated 2133 \| coop_grid = cooperative_reduction_grid(xnumel) 2134 \| normal_grid = grid(xnumel) 2135 \| grid_fn_str = f"maybe_cooperative_reduction_grid({xnumel})" ^ 2136 \| setattr(grid_fn, "grid_fn_str", grid_fn_str) # noqa: B010 2137 \| return grid_fn torch/_inductor/runtime/triton_heuristics.py:2145:37: Builtin `set` is deprecated 2143 \| return (ceildiv(rnumel, meta.get("R0_BLOCK", 1)), xnumel, 1) 2144 \| 2145 \| grid_fn_str = f"split_scan_grid({xnumel}, {rnumel})" ^ 2146 \| setattr(grid_fn, "grid_fn_str", grid_fn_str) # noqa: B010 2147 \| torch/_inductor/runtime/triton_heuristics.py:2145:44: Builtin `set` is deprecated 2143 \| return (ceildiv(rnumel, meta.get("R0_BLOCK", 1)), xnumel, 1) 2144 \| 2145 \| grid_fn_str = f"split_scan_grid({xnumel}, {rnumel})" ^ 2146 \| setattr(grid_fn, "grid_fn_str", grid_fn_str) # noqa: B010 2147 \| torch/_inductor/runtime/triton_heuristics.py:2145:47: Builtin `set` is deprecated 2143 \| return (ceildiv(rnumel, meta.get("R0_BLOCK", 1)), xnumel, 1) 2144 \| 2145 \| grid_fn_str = f"split_scan_grid({xnumel}, {rnumel})" ^ 2146 \| setattr(grid_fn, "grid_fn_str", grid_fn_str) # noqa: B010 2147 \| torch/_inductor/runtime/triton_heuristics.py:2145:54: Builtin `set` is deprecated 2143 \| return (ceildiv(rnumel, meta.get("R0_BLOCK", 1)), xnumel, 1) 2144 \| 2145 \| grid_fn_str = f"split_scan_grid({xnumel}, {rnumel})" ^ 2146 \| setattr(grid_fn, "grid_fn_str", grid_fn_str) # noqa: B010 2147 \| torch/_inductor/runtime/triton_heuristics.py:2173:42: Builtin `set` is deprecated 2171 \| assert ( 2172 \| min_blocks_d is None or min_blocks == min_blocks_d 2173 \| ), f"inconsistent min_blocks {min_blocks} vs x grid {numels[-1]}" ^ 2174 \| else: 2175 \| # sequential dispatch torch/_inductor/runtime/triton_heuristics.py:2173:53: Builtin `set` is deprecated 2171 \| assert ( 2172 \| min_blocks_d is None or min_blocks == min_blocks_d 2173 \| ), f"inconsistent min_blocks {min_blocks} vs x grid {numels[-1]}" ^ 2174 \| else: 2175 \| # sequential dispatch torch/_inductor/runtime/triton_heuristics.py:2173:66: Builtin `set` is deprecated 2171 \| assert ( 2172 \| min_blocks_d is None or min_blocks == min_blocks_d 2173 \| ), f"inconsistent min_blocks {min_blocks} vs x grid {numels[-1]}" ^ 2174 \| else: 2175 \| # sequential dispatch torch/_inductor/runtime/triton_heuristics.py:2173:77: Builtin `set` is deprecated 2171 \| assert ( 2172 \| min_blocks_d is None or min_blocks == min_blocks_d 2173 \| ), f"inconsistent min_blocks {min_blocks} vs x grid {numels[-1]}" ^ 2174 \| else: 2175 \| # sequential dispatch ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/143628 Approved by: https://github.com/yanboliang, https://github.com/rec	2024-12-20 11:45:26 +00:00
Xu Han	6733045a4a	export AOTI_TORCH_EXPORT on Windows. (#140030 ) Fixes #139954 reproduce UT: ```cmd pytest test/inductor/test_torchinductor_codegen_dynamic_shapes.py -k test_device_assert_dynamic_shapes_cpu ``` Issue: <img width="856" alt="image" src="https://github.com/user-attachments/assets/5fc501a9-54e5-45ac-9fb3-509ec11a7abe"> After fixing: ![Image](https://github.com/user-attachments/assets/883846fb-8e92-4b9c-9400-daab32382a3a) Reland: 1. Declare export on Windows explicitly. 2. Support cpu, cuda and xpu devices. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140030 Approved by: https://github.com/jgong5, https://github.com/desertfire	2024-12-20 11:42:09 +00:00
Michael Lazos	b539c61631	[Hierarchical Compile] Update NoneAsConstantBuffer to support graph d… (#143531 ) Fixes issues I hit while running graph deduplication with torch tune. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143531 Approved by: https://github.com/eellison	2024-12-20 09:23:12 +00:00
Pian Pawakapan	f9f82ca48f	[ts converter] use Dim.AUTO for ts -> export converter (#138273 ) Switches TS converter to use `Dim.AUTO` by default, exporting models with max dynamism. Adds runtime input tests to `test_converter.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/138273 Approved by: https://github.com/avikchaudhuri	2024-12-20 07:48:24 +00:00
Michael Lazos	270ad513c8	[Dynamo] only import einops if version is lower than 0.7.0 (#142847 ) Fixes internal xref (https://fb.workplace.com/groups/257735836456307/posts/804793021750583/?comment_id=805229281706957&reply_comment_id=805232695039949) Pull Request resolved: https://github.com/pytorch/pytorch/pull/142847 Approved by: https://github.com/zou3519	2024-12-20 07:46:49 +00:00
Avik Chaudhuri	29b586bbad	fix formatting in programming model doc (#143587 ) Test Plan: Some of the formatting in https://docs-preview.pytorch.org/pytorch/pytorch/143546/export.programming_model.html is broken. Differential Revision: D67458972 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143587 Approved by: https://github.com/yushangdi	2024-12-20 07:09:19 +00:00
Huy Do	fe0f20615c	[DynamoBench] Handle accuracy results in benchmark records (#143611 ) I discovered this issue when trying to search for the accuracy results on the database and couldn't find any. It turns out that the results is there on the JSON file, for example `"metric": {"name": "accuracy", "benchmark_values": ["pass_due_to_skip"]}`, but inserting them into the database fails because benchmark values is a list of strings here while the expectation is that it's a list of numbers. ClickHouse doesn't support mix types atm. It has a Variant type https://clickhouse.com/docs/en/sql-reference/data-types/variant, but this isn't recommended by CH team themselves. So, the remaining option is to store this in the `extra_info` field. This field is a dictionary, so it can goes there. ### Testing https://github.com/pytorch/pytorch/actions/runs/12421747715 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143611 Approved by: https://github.com/kit1980	2024-12-20 06:43:38 +00:00
Sam Ginzburg	132fcf4e0d	[user triton] Raise an exception when encountering nested @triton.autotune decorators or @triton.heuristics (#143519 ) We support running a single Autotuner for each Triton kernel. Currently, if there are multiple autotuning decorators, the subsequent ones will be silently ignored. Instead, we should raise an error here to avoid silent incorrectness. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143519 Approved by: https://github.com/aakhundov	2024-12-20 06:38:45 +00:00
PyTorch MergeBot	71479a9b9c	Revert "[AOTI] Emit a CMakeLists.txt when package_cpp_only (#143352 )" This reverts commit 429f4cd1408b11a7b0dd10634b46b3265dc31af1. Reverted https://github.com/pytorch/pytorch/pull/143352 on behalf of https://github.com/huydhn due to Sorry for reverting your change but the new test is failing on ROCm ([comment](https://github.com/pytorch/pytorch/pull/143352#issuecomment-2556365140))	2024-12-20 06:21:31 +00:00
Jane Xu	4e29e4aa63	[BE] Add a test to ensure grads are never inplaced into accidentally (#143612 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143612 Approved by: https://github.com/soulitzer	2024-12-20 06:15:08 +00:00
Xu Han	2daa666591	update kineto to XPU Windows fixed PR. [submodule kineto] (#143445 ) Include XPU Windows Fixed PR: https://github.com/pytorch/kineto/pull/1012 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143445 Approved by: https://github.com/sraikund16	2024-12-20 05:57:30 +00:00
zeshengzong	217a4ddb04	Add range check embedding_bag on input index >= 0 of cuda device (#140791 ) Fixes #89362 Test Result Before ``` >>> import torch >>> input = torch.randint(-5, 1, [1, 2], dtype=torch.int64).cuda() >>> weight = torch.rand([2, 3], dtype=torch.float32).cuda() >>> print(torch.nn.functional.embedding_bag(input, weight)) tensor([[0., 0., 0.]], device='cuda:0') ``` After ```python >>> import torch >>> input = torch.randint(-5, 1, [1, 2], dtype=torch.int64).cuda() >>> weight = torch.rand([2, 3], dtype=torch.float32).cuda() >>> print(torch.nn.functional.embedding_bag(input, weight)) /home/zong/code/pytorch/aten/src/ATen/native/cuda/EmbeddingBag.cu:141: EmbeddingBag_updateOutputKernel_sum_mean: block: [0,0,0], thread: [0,0,0] Assertion `0 <= input_idx && input_idx < numRows` failed. /home/zong/code/pytorch/aten/src/ATen/native/cuda/EmbeddingBag.cu:141: EmbeddingBag_updateOutputKernel_sum_mean: block: [0,0,0], thread: [1,0,0] Assertion `0 <= input_idx && input_idx < numRows` failed. /home/zong/code/pytorch/aten/src/ATen/native/cuda/EmbeddingBag.cu:141: EmbeddingBag_updateOutputKernel_sum_mean: block: [0,0,0], thread: [2,0,0] Assertion `0 <= input_idx && input_idx < numRows` failed. Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/zong/code/pytorch/torch/_tensor.py", line 568, in __repr__ return torch._tensor_str._str(self, tensor_contents=tensor_contents) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/zong/code/pytorch/torch/_tensor_str.py", line 708, in _str return _str_intern(self, tensor_contents=tensor_contents) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/zong/code/pytorch/torch/_tensor_str.py", line 625, in _str_intern tensor_str = _tensor_str(self, indent) ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/zong/code/pytorch/torch/_tensor_str.py", line 357, in _tensor_str formatter = _Formatter(get_summarized_data(self) if summarize else self) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/zong/code/pytorch/torch/_tensor_str.py", line 146, in __init__ tensor_view, torch.isfinite(tensor_view) & tensor_view.ne(0) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ RuntimeError: CUDA error: device-side assert triggered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1 Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions. ``` ```bash $ pytest test/nn/test_embedding.py ``` ![image](https://github.com/user-attachments/assets/6a5ec759-a3dc-4d51-9e5e-ec79c0aac526) ```bash $ lintrunner ``` ![image](https://github.com/user-attachments/assets/2ce4ac24-74fb-4181-9510-18b96a2c2acb) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140791 Approved by: https://github.com/eqy	2024-12-20 05:47:26 +00:00
bobrenjc93	9713a6eeca	remove allow-untyped-defs from torch/fx/experimental/refinement_types.py (#143602 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143602 Approved by: https://github.com/aorenste	2024-12-20 05:40:52 +00:00
bobrenjc93	78d294379a	remove allow-untyped-defs from torch/_lazy/config.py (#143603 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143603 Approved by: https://github.com/aorenste	2024-12-20 05:34:19 +00:00
bobrenjc93	cb4e9888df	remove allow-untyped-defs from torch/ao/quantization/experimental/APoT_tensor.py (#143601 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143601 Approved by: https://github.com/aorenste	2024-12-20 05:26:09 +00:00
bobrenjc93	dd346dbeab	remove allow-untyped-defs from torch/distributed/elastic/multiprocessing/errors/handlers.py (#143605 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143605 Approved by: https://github.com/aorenste	2024-12-20 05:25:01 +00:00
Michael Lazos	fd23cf5848	[Dynamo] check node class first for graph dedup (#143609 ) as title Pull Request resolved: https://github.com/pytorch/pytorch/pull/143609 Approved by: https://github.com/williamwen42	2024-12-20 04:09:46 +00:00
William Wen	1c2593f035	[dynamo] guard global autocast state (#143592 ) Fixes https://github.com/pytorch/pytorch/issues/112260. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143592 Approved by: https://github.com/jansel	2024-12-20 03:30:54 +00:00
drisspg	d339f1506b	Add cutlass version guard in prep for upgrade (#143551 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143551 Approved by: https://github.com/eqy	2024-12-20 02:40:02 +00:00
Mayank Mishra	75661f2036	try root fix for FP8 tensor (#143248 ) Fixes #143194 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143248 Approved by: https://github.com/fegin	2024-12-20 01:57:17 +00:00
PyTorch MergeBot	4462cc6375	Revert "[Inductor] inplace padding (#140249 )" This reverts commit 297ce776363cc4802fa74d210fced2b4128960d5. Reverted https://github.com/pytorch/pytorch/pull/140249 on behalf of https://github.com/huydhn due to This break an internal test https://fburl.com/test/ppl2we5l ([comment](https://github.com/pytorch/pytorch/pull/140249#issuecomment-2556079406))	2024-12-20 01:30:27 +00:00
bobrenjc93	e1b4635504	remove allow-untyped-defs from torch/distributed/pipelining/_debug.py (#143606 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143606 Approved by: https://github.com/aorenste	2024-12-20 01:26:51 +00:00
Jane Xu	a0cff096bc	Improve cond error messaging (#143595 ) Discovered by @drisspg and I trying out a simple toy example and being way too confused :') Pull Request resolved: https://github.com/pytorch/pytorch/pull/143595 Approved by: https://github.com/zou3519, https://github.com/ydwu4	2024-12-20 01:19:20 +00:00
Yanan Cao (PyTorch)	d547fae5b0	[Codemod][AddExplicitStrictExportArg] caffe2/torch/onnx/_internal/exporter (#143542 ) Reviewed By: avikchaudhuri Differential Revision: D67381244 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143542 Approved by: https://github.com/ydwu4, https://github.com/titaiwangms	2024-12-20 00:54:52 +00:00
Sun, Jiayi	544de4008e	[Inductor] Constrain the shape of other tensor for Conv/Linear + broadcast add fusion. (#141759 ) Fix https://github.com/pytorch/pytorch/issues/141671. Summary: The performance regression of these two timm_models is caused by Conv/Linear + broadcast add fusion run into oneDNN ref path. This PR constrains the shape of other tensor for Conv/Linear + broadcast add fusion to fix this issue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141759 Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jansel	2024-12-20 00:35:58 +00:00
PyTorch MergeBot	8136daff5a	Revert "[ARM][feat]: Add 4 bit dynamic quantization matmuls & KleidiAI Backend (#134124 )" This reverts commit 4b82251011f85f9d1395b451d61e976af844d9b1. Reverted https://github.com/pytorch/pytorch/pull/134124 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it breaks lots of internal build ([comment](https://github.com/pytorch/pytorch/pull/134124#issuecomment-2555953189))	2024-12-19 23:33:17 +00:00
PyTorch MergeBot	145fd5bad0	Revert "[Dynamo] only import einops if version is lower than 0.7.0 (#142847 )" This reverts commit a96387a481633389a6b5a5ac7b8406e9216f320e. Reverted https://github.com/pytorch/pytorch/pull/142847 on behalf of https://github.com/huydhn due to This has been reverted internally D67436053 ([comment](https://github.com/pytorch/pytorch/pull/142847#issuecomment-2555942351))	2024-12-19 23:22:44 +00:00
Sun, Jiayi	d2b83aa122	add grad_output shape check for fractional_max_pool2d_backward (#141666 ) Fix https://github.com/pytorch/pytorch/issues/141102. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141666 Approved by: https://github.com/mingfeima, https://github.com/malfet	2024-12-19 22:47:02 +00:00
Evgeny Fiksman	2def1f6f74	[caffe2] Move vectorized templates into a separate file for box_cox operator (#143556 ) Summary: No functional changes in this diff, the code is moved into a separate file to be reused by avx512 version in the follow up diff. Test Plan: buck build //caffe2/caffe2/perfkernels:perfkernels Differential Revision: D67433115 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143556 Approved by: https://github.com/hl475	2024-12-19 22:02:23 +00:00
Bin Bao	429f4cd140	[AOTI] Emit a CMakeLists.txt when package_cpp_only (#143352 ) Summary: Emit a CMakeLists.txt with compile and link options when package_cpp_only is specified. After unzipping AOTI generated .pt2 package file, user can manually build the generated model code in their local environment. Differential Revision: [D67458526](https://our.internmc.facebook.com/intern/diff/D67458526) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143352 Approved by: https://github.com/malfet	2024-12-19 22:01:05 +00:00
PyTorch MergeBot	e9bd74d763	Revert "[export] don't decompose custom triton op when exporting (#142426 )" This reverts commit 10b9c5944e8d6ff0685e1ef25277a1d3c4c9c5aa. Reverted https://github.com/pytorch/pytorch/pull/142426 on behalf of https://github.com/huydhn due to This fails one internal MTIA test, checking with the author that we need to revert and reland this ([comment](https://github.com/pytorch/pytorch/pull/142426#issuecomment-2555793496))	2024-12-19 21:21:38 +00:00
Joel Schlosser	fc03c62c56	Unbacked SymInt fixes for subclasses + data-dependent slice() bounds (#142062 ) Related: #125914 (specifically see [comment](https://github.com/pytorch/pytorch/issues/125914#issuecomment-2513044125)) This PR addresses two broken things involving the usage of unbacked SymInts for calls to `slice()` with data-dependent bounds. These issues are encountered in practice for `narrow()` operating on the batch dim with an NJT input, but apply to other subclasses as well. The test in this PR uses a purpose-built subclass. There are two different issues here, depending on whether `torch.compile()` is called with `dynamic=True`. In practice, these only occur when the unbacked SymInts are created within the torch_dispatch implementation of a subclass, because the unbacked symbols are considered "freshly created" when the output subclass instance is handled in Dynamo. Error 1 (dynamic=False): ``` LoweringException: GuardOnDataDependentSymNode: Could not guard on data-dependent expression Eq(-Min(22, Max(0, u0)) + Min(22, Max(u0 + u1, Max(0, u0))), 0) (unhinted: Eq(-Min(s0, Max(0, u0)) + Min(s0, Max(u0 + u1, Max(0, u0))), 0)). (Size-like symbols: u1, u0) ``` The expression comes from the use of `clamp()` logic for `SliceView` in Inductor: `41e59754b4/torch/_inductor/ir.py (L3014)` If the (start, end) bounds for the `slice()` are statically known to be in range for the given dim (e.g. provided via `torch._check()` calls), we can avoid this `clamp()` logic and the error. This PR implements this fix. Error 2 (dynamic=True): ``` torch._dynamo.exc.InternalTorchDynamoError: PendingUnbackedSymbolNotFound: Pending unbacked symbols {u0} not in returned outputs NestedTensor(size=(2, s16, s1), offsets=FakeTensor(..., device='cuda:0', size=(3,), dtype=torch.int64), grad_fn=<NarrowBackwardAutogradNestedTensor0 object at 0x7f1f8603cfd0>, contiguous=True) ((s1s16, s1, 1), s1u0) ``` The storage offset of the values component of the returned NJT is `s1u0` where `s1` is known to be an integer. This PR expands the special logic handling the `constant u0` case to handle SymInts as well: `314e08eb52/torch/fx/experimental/symbolic_shapes.py (L1013-L1031)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/142062 Approved by: https://github.com/ezyang ghstack dependencies: #143526	2024-12-19 21:08:04 +00:00
emmettbicker	0b2c47962c	Add support for differentiable LR in SGD + test v2.0 (#143510 ) Second PR in a larger project to broader support for differentiable optimizers with @janeyx99 ! The first one had an issue near the end so this is the second PR on that subject. See #143122 for the development up until this point. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143510 Approved by: https://github.com/janeyx99	2024-12-19 21:04:44 +00:00
Ryan Guo	629de4da60	[dynamo] Add a lint rule to restrict what 3P library one can import (#143312 ) As title, this patch prevents developers from importing third party libraries to patch things in Dynamo, unless there's no other easy workaround (in which case one would add the library to the allowlist in `import_linter.py`, as instructed by the lint error). For instance, if we remove `einops` from the allowlist, we'd get this ```verbatim >>> Lint for torch/_dynamo/decorators.py: Error (IMPORT) Disallowed import importing from einops is not allowed, if you believe there's a valid reason, please add it to import_linter.py 608 \|# Note: this carefully avoids eagerly import einops. 609 \|# TODO: we should delete this whole _allow_in_graph_einops logic by approximately 2024 Q2 610 \|def _allow_in_graph_einops(): >>> 611 \| import einops 612 \| 613 \| try: 614 \| # requires einops > 0.6.1, torch >= 2.0 Error (IMPORT) Disallowed import importing from einops is not allowed, if you believe there's a valid reason, please add it to import_linter.py 612 \| 613 \| try: 614 \| # requires einops > 0.6.1, torch >= 2.0 >>> 615 \| from einops._torch_specific import ( # type: ignore[attr-defined] # noqa: F401 616 \| _ops_were_registered_in_torchdynamo, 617 \| ) 618 \| ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/143312 Approved by: https://github.com/zou3519	2024-12-19 20:59:16 +00:00
bobrenjc93	8e78345d69	remove allow-untyped-defs from distributed/tensor/experimental/__init__.py (#143583 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143583 Approved by: https://github.com/awgu	2024-12-19 20:25:28 +00:00
Thomas Bohnstingl	0a7dba4978	[cond] Change Autograd for cond (#142518 ) Instead of returning None for unused variables, a tensor with all-zeros is returned. Fixes [141301](https://github.com/pytorch/pytorch/issues/141301) Pull Request resolved: https://github.com/pytorch/pytorch/pull/142518 Approved by: https://github.com/ydwu4	2024-12-19 20:09:42 +00:00
bobrenjc93	8850a7b62c	add some logging for tensorify (#143391 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143391 Approved by: https://github.com/jamesjwu	2024-12-19 20:06:26 +00:00
bobrenjc93	25172dc075	remove allow-untyped-defs from torch/ao/quantization/experimental/fake_quantize_function.py (#143582 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143582 Approved by: https://github.com/XuehaiPan, https://github.com/laithsakka	2024-12-19 20:06:22 +00:00
Nichols A. Romero	2d150ad29f	[ROCm] Fix unit test: matmul_offline_mgpu_tunableop (#143507 ) Fixes #141652 This PR contains: - Fix for `matmul_offline_mgpu_tunableop` - Modifications to _checking_tuning_assertions to enable TunableOp if it is disabled. Also moved it into the concurrent futures initializer. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143507 Approved by: https://github.com/jeffdaily	2024-12-19 19:48:20 +00:00
Jack Taylor	66172578f9	[ROCm] Guard triton backend call around cuda.is_available (#143570 ) To resolve: https://github.com/pytorch/test-infra/issues/6082 Calling into Triton's get_backend_options will initialise CUDA and break CPU-only environments that may have hip installed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143570 Approved by: https://github.com/atalman, https://github.com/jeffdaily	2024-12-19 19:46:13 +00:00
Yanbo Liang	c46cfc245f	[Dynamo] Support dict_keys from nested dict object (#143557 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143557 Approved by: https://github.com/williamwen42 ghstack dependencies: #143374, #143547	2024-12-19 19:02:55 +00:00
Yanbo Liang	5fa287aa82	[Dynamo] Rename Dict{View/Keys/Values} to Dict{View/Keys/Values}Variable (#143547 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143547 Approved by: https://github.com/williamwen42 ghstack dependencies: #143374	2024-12-19 19:02:55 +00:00
Nikhil Gupta	4b82251011	[ARM][feat]: Add 4 bit dynamic quantization matmuls & KleidiAI Backend (#134124 ) Description: 1. Quantize Linear Layer Weights to 4-bits: Quantize the weights of the Linear layer to 4 bits, using symmetric quantization. Pack two 4-bit weights into one uint8 container. Choose a quantization scheme (channel-wise or group-wise), with the group size being a multiple of 32. 2. Prepare Quantized Weights, Scales, and Optional Bias: After quantizing, obtain the quantized_weights, scales, and groupsize. If the original Linear layer has a bias, prepare it as well. 3. Pack the Weights Efficiently: Use torch.ops.aten._dyn_quant_pack_4bit_weight to optimally pack the weights, scales, and optional bias. ```python packed_weights = torch.ops.aten._dyn_quant_pack_4bit_weight(weight, scales_and_zeros, bias, groupsize, in_features, out_features) ``` Input parameters should include: in_features and out_features (the same as the Linear layer’s corresponding parameters). 4. Perform Dynamic Quantized Matrix Multiplication: Use torch.ops.aten._dyn_quant_matmul_4bit to perform matrix multiplication with quantized weights. ```python output = torch.ops.aten._dyn_quant_matmul_4bit(input, packed_weights, groupsize, in_features, out_features) ``` Inputs required include: The input tensor, packed_weights , groupsize, and the in_features and out_features. API Usage: https://github.com/pytorch/pytorch/issues/143289 Model Perf : 7B Transformer model: Prefill : 340 t/s Decode : 40 t/s 2B Transformer model Prefill : 747 t/s Decode : 80 t/s Tests: python test/test_linalg.py -k test__dyn_quant_pack_4bit_weight Ran 1 test in 0.016s OK python test/test_linalg.py -k test__dyn_quant_matmul_4bit Ran 8 tests in 0.077s OK python test/test_linalg.py -k test_compile_dyn_quant_matmul_4bit Ran 8 tests in 11.454s Change-Id: Ia1672bad5e6ec94e64d8bb1971395d60f4b3a452 Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/134124 Approved by: https://github.com/digantdesai, https://github.com/malfet	2024-12-19 18:51:26 +00:00
Joel Schlosser	c5ddf5dd90	Unbacked SymInt fixes for subclasses + data-dependent slice() bounds (non-dynamic) (#143526 ) Lifted non-controversial (non-dynamic) fixes from #142062. See description there for context. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143526 Approved by: https://github.com/ezyang	2024-12-19 18:46:36 +00:00
Laith Sakka	2a11472f46	update expected results (#143586 ) update results based on small regression added by `17b71e5d6a` the max we was 1.25%. for sum_floor_div <img width="842" alt="Screenshot 2024-12-19 at 9 04 30 AM" src="https://github.com/user-attachments/assets/6ce913cd-110d-4837-af59-08fb6a0dd12d" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/143586 Approved by: https://github.com/bobrenjc93	2024-12-19 18:43:27 +00:00
William Wen	e1e83015d2	[dynamo, 3.13t] raise error if torch.compile is attempted in 3.13t (nogil) (#143404 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143404 Approved by: https://github.com/colesbury, https://github.com/atalman	2024-12-19 18:10:01 +00:00
Joona Havukainen	33c27be017	Workaround for gather_out in MPS backend (#135543 ) Avoids an underlying issue in reshape op in MPS that gets triggered when the input has multiple dimensions but the shape can be squeezed into 1D. The underlying issue is going to get fixed eventually. Fixes https://github.com/pytorch/pytorch/issues/135240 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135543 Approved by: https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2024-12-19 18:01:01 +00:00
Avik Chaudhuri	1433bad0e4	torch export programming model (#143546 ) Differential Revision: [D67429743](https://our.internmc.facebook.com/intern/diff/D67429743/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143546 Approved by: https://github.com/ydwu4	2024-12-19 16:56:13 +00:00
Tony-Y	61a835ec53	Corrected description of AMSGrad algorithm (#142351 ) Fixes #142323 Pull Request resolved: https://github.com/pytorch/pytorch/pull/142351 Approved by: https://github.com/janeyx99	2024-12-19 16:24:19 +00:00
bobrenjc93	171e6a934f	Don't 1 specialize if stride is contiguous (#143365 ) Fixes: https://github.com/pytorch/pytorch/issues/142024 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143365 Approved by: https://github.com/ezyang	2024-12-19 15:22:47 +00:00
Animesh Jain	465f282a24	[reland][dynamo][guards] Consider tensors as immutable for dict tag matches (#141085 ) Reland - https://github.com/pytorch/pytorch/pull/139560 As mentioned in https://github.com/pytorch/pytorch/pull/130341, using `static py::object` can lead to segfaults. I suspect this is the reason for the import system error seen internally (https://www.internalfb.com/sevmanager/view/469592). In this PR, I am removing the `static` part. This is fine and also the right thing to do because this will catch if user changes the flag in the same process for compiling two different functions. Unfortunately, there is no easy way to trigger this segfault, so I can't write a test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141085 Approved by: https://github.com/jansel Co-authored-by: William Wen <williamwen@meta.com>	2024-12-19 15:16:10 +00:00
blzheng	288aa87383	[Inductor][CPU] disable bernoulli_p decomposition (#143460 ) Fix https://github.com/pytorch/pytorch/issues/142853 `fallback_random=True` should cause RNG to match between compile/eager (by having compile fall back to eager for RNG ops), but the `bernoulli_p` decompose function is not fully consistent with the eager CPU implementation. We remove the decomp and keep the version for` fallback_random=False`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143460 Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/jansel	2024-12-19 11:21:35 +00:00
Edward Z. Yang	fd8b217fcd	Pass allow_rhs_unbacked to the stride test in metadata test too (#143040 ) Fixes https://github.com/pytorch/pytorch/issues/142410 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/143040 Approved by: https://github.com/bobrenjc93	2024-12-19 09:37:50 +00:00
Joe Wang	451c233936	leaking c++ singleton specifically (#143509 ) Summary: fix forward for S477887 leaking c++ singleton specifically when c++ shutdown, it tries to destruct the singleton and acquire GIL, at this moment python runtime exists already, causing undefined behavior. Leaking here specifically so that we won't try to destroy singleton at the shutdown phase Test Plan: n/a Differential Revision: D67400633 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143509 Approved by: https://github.com/c-p-i-o	2024-12-19 09:27:07 +00:00
Aaron Orenstein	da06d47bdb	dynamo tracing perf: slight improvement on __instancecheck__: 47.77 -> 47.62 (#143064 ) See #143056 for overall docs. This PR: Switch out an `isinstance()` for an `is` in the very hot `VariableTrackerMeta.__instancecheck__`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143064 Approved by: https://github.com/ezyang, https://github.com/jansel	2024-12-19 09:19:35 +00:00
Aditya Tewari	a97c6a78a8	Upgrade submodule ideep for bf16f32 matmul changes (#143508 ) This change will enable this PR #140159 to pick proper kernels in bf16 mode for SDPA layer. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143508 Approved by: https://github.com/yanbing-j, https://github.com/jgong5	2024-12-19 06:49:16 +00:00
Yanbo Liang	2ffdcab04c	[Dynamo] Add DictKeySetVariable to capture dict_keys passed outside of compiled region (#143374 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143374 Approved by: https://github.com/williamwen42, https://github.com/jansel	2024-12-19 06:39:27 +00:00
Sun, Jiayi	fa1a4a91e9	add batch_size check for max_pool2d_backward (#141657 ) Fix https://github.com/pytorch/pytorch/issues/140923. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141657 Approved by: https://github.com/mingfeima, https://github.com/malfet	2024-12-19 06:01:41 +00:00
mori360	a7ba562ec8	[state dict] Change _load_model_state_dict to enable cpu_offload, accept 2 device type and optimize memory (#142845 ) For destributed state dict api [migration](https://github.com/pytorch/torchtune/pull/2138), make the changes here: 1. `load_from_full_model_state_dict` at TorchTune calls `set_model_state_dict` with the options on whether to have cpu_offload. Add cpu_offload at _load_model_state_dict to process to cpu if config is True 2. Change the device check as lora_finetune might hace 2 device types, accept that to be valid. 3. Some changes to optimize the memory performance: 3.1 use `.detach().clone()` instead of view directly 3.2 if local_state is not meta, copy `full_tensor[slices]` to `ret.to_local()` 4. add relative unit tests Memory performance calling from TorchTune with llama2/7B_full: 1. cpu_offload = True <img width="555" alt="Screenshot 2024-12-18 at 1 36 47 PM" src="https://github.com/user-attachments/assets/429261f5-1107-4592-b295-de3944a2614b" /> 2. cpu_offload = False <img width="555" alt="Screenshot 2024-12-18 at 1 36 52 PM" src="https://github.com/user-attachments/assets/40bf281a-236a-4218-826b-b1192a10c806" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/142845 Approved by: https://github.com/fegin	2024-12-19 05:06:41 +00:00
Sean Xiao	e4301aeaa5	[ODML] Make the ML feature provider thread safe (#143418 ) Summary: This PR is generated from a meta internal Diff, aiming to resolve a crash from a race condition on the dictionary. Test Plan: Build and run Print out the count/name/value of the dictionary and see if the values are get/set/removed correctly. Observe the print statement on app start within IG @diff-train-skip-merge Pull Request resolved: https://github.com/pytorch/pytorch/pull/143418 Approved by: https://github.com/shoumikhin	2024-12-19 04:47:56 +00:00
Valentine233	bf44d5bfb5	[Inductor] move custom pre pass (#143458 ) Fixes #143363. Move `joint_custom_pre` pass after `remove_noop_ops`/`constant_folding`, in order to get the same behavior as `pattern_matcher`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143458 Approved by: https://github.com/jansel, https://github.com/jgong5	2024-12-19 04:41:20 +00:00
Michael Lazos	deb1da15cc	[foreach_map] Add foreach_map Adam impl to compiled optimizer tests (#143454 ) Adds a foreach_map backed Adam to compiled optimizer tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/143454 Approved by: https://github.com/Chillee, https://github.com/eellison	2024-12-19 03:16:47 +00:00
Sergii Dymchenko	19d8bbafb2	Update release matrix for 2.6 (#143538 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143538 Approved by: https://github.com/atalman Co-authored-by: Andrey Talman <atalman@fb.com>	2024-12-19 02:02:04 +00:00
PyTorch MergeBot	14fe1f7190	Revert "[ARM][feat]: Add 4 bit dynamic quantization matmuls & KleidiAI Backend (#134124 )" This reverts commit d3ff2d42c28a2c187cbedfd8f60b84a4dfa2d6bf. Reverted https://github.com/pytorch/pytorch/pull/134124 on behalf of https://github.com/malfet due to This broke S390 builds, includes cpuinfo unconditionally ([comment](https://github.com/pytorch/pytorch/pull/134124#issuecomment-2552560208))	2024-12-19 01:05:11 +00:00
Eddie Yan	2c48af568a	[CUDA][64-bit indexing] Fix some existing problematic `int64_t _ = blockIdx.* * blockDim.` code (#142010 ) `grep` didn't surface any `blockIdx.z blockDim.z` cases ``` git grep -l "int64_t.=.blockIdx.x \* blockDim.x." \| xargs sed -i 's/int64_t $.$ = blockIdx.x \* blockDim.x + threadIdx.x;./int64_t \1 = ((int64_t) blockIdx.x) blockDim.x + threadIdx.x;/g' git grep -l "int64_t.=.blockIdx.x \* blockDim.x." \| xargs sed -i 's/int64_t $.$ = threadIdx.x + blockIdx.x \* blockDim.x;./int64_t \1 = threadIdx.x + ((int64_t) blockIdx.x) blockDim.x;/g' git grep -l "int64_t.=.blockIdx.y \* blockDim.y." \| xargs sed -i 's/int64_t $.$ = blockIdx.y \* blockDim.y + threadIdx.y;./int64_t \1 = ((int64_t) blockIdx.y) blockDim.y + threadIdx.y;/g' git grep -l "int64_t.=.blockIdx.y \* blockDim.y." \| xargs sed -i 's/int64_t $.$ = threadIdx.y + blockIdx.y \* blockDim.y;./int64_t \1 = threadIdx.y + ((int64_t) blockIdx.y) blockDim.y;/g' git grep -l "int64_t.=.blockDim.x \* blockIdx.x." \| xargs sed -i 's/int64_t $.$ = blockDim.x \* blockIdx.x + threadIdx.x;./int64_t \1 = ((int64_t) blockIdx.x) blockDim.x + threadIdx.x;/g' ``` See also https://github.com/pytorch/pytorch/pull/141922/files#r1868262823 in #141999 141922 Pull Request resolved: https://github.com/pytorch/pytorch/pull/142010 Approved by: https://github.com/ngimel	2024-12-19 00:55:11 +00:00
Michael Lazos	b4e0e3bfa3	Backout D66648013 (#143433 ) Summary: backing out https://www.internalfb.com/diff/D66648013 (see comments there for justification) I will reland and disallow the bfloat16 atomics behavior on A100 because it causes a pretty significant performance regression. Test Plan: This is a revert Differential Revision: D67357485 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143433 Approved by: https://github.com/davidberard98	2024-12-19 00:53:49 +00:00
Michael Lazos	5c3996cab2	[Dynamo] topologically sort duplicated graph regions (#143523 ) Ensure regions are topologically sorted Pull Request resolved: https://github.com/pytorch/pytorch/pull/143523 Approved by: https://github.com/williamwen42	2024-12-19 00:43:48 +00:00
Nikita Shulga	55092e1ec5	[BE] Delete `install sccache` step from MacBB (#143512 ) To the best of my knowledge, this step never executed and there were no MacOS binary build running on trunk for a while Pull Request resolved: https://github.com/pytorch/pytorch/pull/143512 Approved by: https://github.com/kit1980, https://github.com/atalman, https://github.com/seemethere ghstack dependencies: #143395, #143511	2024-12-19 00:41:28 +00:00
Nikita Shulga	5e172ea004	[BE] Get rid of `malfet/checkout@silent-checkout` (#143516 ) Instead use `actions/checkout@v4` with `show-progress: false`. It's more verbose than the quiet option, but our logs are long anyway... Partially addresses https://github.com/pytorch/pytorch/issues/143079 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143516 Approved by: https://github.com/atalman, https://github.com/ZainRizvi, https://github.com/huydhn	2024-12-19 00:36:36 +00:00
Richard Barnes	f9da639950	[codemod] Fix a few unused-variable issues in pytorch (#143517 ) Summary: LLVM-15 has a warning `-Wunused-variable` which we treat as an error because it's so often diagnostic of a code issue. Unused variables can compromise readability or, worse, performance. This diff either (a) removes an unused variable and, possibly, it's associated code or (b) qualifies the variable with `[[maybe_unused]]`. - If you approve of this diff, please use the "Accept & Ship" button :-) Test Plan: Sandcastle Reviewed By: palmje Pull Request resolved: https://github.com/pytorch/pytorch/pull/143517 Approved by: https://github.com/mhorowitz	2024-12-19 00:18:08 +00:00
titaiwangms	b23f11c529	[ONNX] Automatically convert dynamic_axes to dynamic_shapes with torch.export.Dim.AUTO (#143158 ) With https://github.com/pytorch/pytorch/pull/133620 introducing Dim.AUTO, we can now automatically convert dynamic_axes to dynamic_shapes without specifying min and max. However, exporting still could be crashed when there are same specs shared between inputs and there is no guarantee that the axes will be dynamic (see PR description). ~~Therefore, a~~ follow-up PR should create a post-processing ONNX side pass to ~~enable the missed dynamic axes~~ rename the dynamic shapes (s0, s1, ...) to dynamic_axes (user setting names). This PR does: (1) Apply torch.export.Dim.AUTO to dynamic_axes when dynamic_shapes is not provided. (2) Convert args/kwargs to tuple inputs, which follows the generated dynamic_shapes format to avoid errors during torch.export.export. (3) Avoid KeyError in _rename_dynamic_shapes_with_model_inputs funtion. (4) Add real world case of a HF model with kv_cache to test on ONNX exporter. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143158 Approved by: https://github.com/xadupre, https://github.com/shubhambhokare1	2024-12-18 23:49:01 +00:00
Shangdi Yu	15a7a0c37e	Remove deprecated branch after capture_pre_autograd_graph fully migrate to training IR (#143228 ) Summary: as title #buildall Test Plan: CI Differential Revision: D67222286 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143228 Approved by: https://github.com/andrewor14	2024-12-18 23:30:45 +00:00
Nikita Shulga	58627fb6bf	[BE] Integrate 5 line build script into template (#143511 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143511 Approved by: https://github.com/kit1980, https://github.com/atalman, https://github.com/seemethere ghstack dependencies: #143395	2024-12-18 23:27:09 +00:00
Michael Lazos	4eafbe5288	[Dynamo] Flatten slices during graph deduplication (#143522 ) I encountered this issue while debugging torchtune - overall we need to make sure to not miss nodes that are slice arguments. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143522 Approved by: https://github.com/williamwen42	2024-12-18 23:12:34 +00:00
Ryan Guo	5380407af5	[dynamo] Properly model root frame globals during inlining (#143447 ) This patch updates `InliningInstructionTranslator.STORE_GLOBAL` to properly check whether `self.f_globals` is the same as root frame `f_globals`. See added comments for why this is important. Fixes #143425. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143447 Approved by: https://github.com/zou3519	2024-12-18 23:04:02 +00:00
Tom Ritchford	d8c8ba2440	Fix unused Python variables in test/[e-z]* (#136964 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136964 Approved by: https://github.com/justinchuby, https://github.com/albanD	2024-12-18 23:02:30 +00:00
William Wen	d298bd840f	[dynamo] add two-point iter test (#143500 ) Implements the last checkbox for https://github.com/pytorch/pytorch/issues/112532. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143500 Approved by: https://github.com/StrongerXi	2024-12-18 22:55:46 +00:00
Nikhil Gupta	d3ff2d42c2	[ARM][feat]: Add 4 bit dynamic quantization matmuls & KleidiAI Backend (#134124 ) Description: 1. Quantize Linear Layer Weights to 4-bits: Quantize the weights of the Linear layer to 4 bits, using symmetric quantization. Pack two 4-bit weights into one uint8 container. Choose a quantization scheme (channel-wise or group-wise), with the group size being a multiple of 32. 2. Prepare Quantized Weights, Scales, and Optional Bias: After quantizing, obtain the quantized_weights, scales, and groupsize. If the original Linear layer has a bias, prepare it as well. 3. Pack the Weights Efficiently: Use torch.ops.aten._dyn_quant_pack_4bit_weight to optimally pack the weights, scales, and optional bias. ```python packed_weights = torch.ops.aten._dyn_quant_pack_4bit_weight(weight, scales_and_zeros, bias, groupsize, in_features, out_features) ``` Input parameters should include: in_features and out_features (the same as the Linear layer’s corresponding parameters). 4. Perform Dynamic Quantized Matrix Multiplication: Use torch.ops.aten._dyn_quant_matmul_4bit to perform matrix multiplication with quantized weights. ```python output = torch.ops.aten._dyn_quant_matmul_4bit(input, packed_weights, groupsize, in_features, out_features) ``` Inputs required include: The input tensor, packed_weights , groupsize, and the in_features and out_features. API Usage: https://github.com/pytorch/pytorch/issues/143289 Model Perf : 7B Transformer model: Prefill : 340 t/s Decode : 40 t/s 2B Transformer model Prefill : 747 t/s Decode : 80 t/s Tests: python test/test_linalg.py -k test__dyn_quant_pack_4bit_weight Ran 1 test in 0.016s OK python test/test_linalg.py -k test__dyn_quant_matmul_4bit Ran 8 tests in 0.077s OK python test/test_linalg.py -k test_compile_dyn_quant_matmul_4bit Ran 8 tests in 11.454s Change-Id: Ia1672bad5e6ec94e64d8bb1971395d60f4b3a452 Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/134124 Approved by: https://github.com/digantdesai, https://github.com/malfet	2024-12-18 22:30:07 +00:00
Huy Do	4717cd1ce9	Skip test_conv2d_linear_add_broadcast_shapes_cpu on fbcode (#143530 ) Summary: The test is added by D67376995 and it is failing on fbcode Test Plan: `buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/inductor:mkldnn_pattern_matcher_cpu -- --exact 'caffe2/test/inductor:mkldnn_pattern_matcher_cpu - test_conv2d_linear_add_broadcast_shapes_cpu (caffe2.test.inductor.test_mkldnn_pattern_matcher.TestPatternMatcher)'` Differential Revision: D67413687 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143530 Approved by: https://github.com/jansel	2024-12-18 22:08:08 +00:00
James	d4ed5941db	Fix floating point literals in IRPrinter (#142119 ) Fixes #114035 This is a recreation of #140002 with approval from its author. Original description: >when v larger than 1e16, the format will be error. example: v is 1.2e17, the output is 1.2e17.f, it have two point '.' Pull Request resolved: https://github.com/pytorch/pytorch/pull/142119 Approved by: https://github.com/jgong5, https://github.com/malfet	2024-12-18 21:59:48 +00:00
Yidi Wu	10b9c5944e	[export] don't decompose custom triton op when exporting (#142426 ) For torch.export (strict and non-strict), we don't do functional decomposition. Instead, we preserve the custom triton ops as custom ops. This is because we want the exported program to be high-level and serializable. #### The alternative: If we decompose the custom op to a functional hop and make it a node in exported program, we need to figure out ways of serializing the hop and its arguments, which can be triton.jited python functions and triton dtypes. This is undesireble because: - it can be tedious to maintain layer that serialize the jited function (e.g. with a string) and dtypes. - changes to triton or the serialization logic for triton arguments can be BC breaking - exported program will expose the implementation detail (i.e. triton source code) for a specific backend (GPU) to users, which mixes levels of abstraction. #### Future plans: After this PR, in the short term, we expect users to have a seperate aot_compile stage that compiles the exported program into a Cubin file on the same machine that users call export, which does autotuning and removes triton dependency and serve the model with Cubin. This guarantees that triton changes won't break BC. In the long term, we may export multiple cubins for the triton op directly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142426 Approved by: https://github.com/zou3519 ghstack dependencies: #142425	2024-12-18 21:36:28 +00:00
Yidi Wu	1e201422ed	[export] add is_exporting flag (#142425 ) We added an is_export flag under torch.compiler.is_exporting. This comes handy when we try to do some special logic in user-level and system-level (e.g. in upper of the stack). In increasing-scope: - `_is_fx_tracing` is set to True when we use under symbolic_trace or make_fx. - `is_exporting` is set to True when we're doing strict or non-strict export, which internally has a step that calls make_fx and set _is_fx_tracing to be True. - `is_compiling` is set to True when we're either doing strict, non-strict export or torch.compile. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142425 Approved by: https://github.com/avikchaudhuri	2024-12-18 21:36:28 +00:00
Nichols A. Romero	894d47b91b	[ROCm] Fix unit test: matmul_offline_tunableop (#143322 ) Fixes #137936 The PR contains: * Fix for `matmul_offline_tunableop` * Clean-up try-finally blocks in UTs that don't use environment variables (`test_validator_tunableop_rocm`, `test_minimum_tuning_iteration_tunableop`, `test_disable_tuning_tunableop`) * Avoid the use of environment variables in `minimum_tuning_iteration_tunableop` Pull Request resolved: https://github.com/pytorch/pytorch/pull/143322 Approved by: https://github.com/jeffdaily	2024-12-18 20:14:44 +00:00
cyy	255a977494	[1/N] Avoid const_cast (#143169 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/143169 Approved by: https://github.com/albanD	2024-12-18 19:48:01 +00:00
Nikita Shulga	f129bcb5a5	[BE] Refactor argument parsing into its own function (#143395 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143395 Approved by: https://github.com/kit1980, https://github.com/atalman, https://github.com/seemethere	2024-12-18 19:42:49 +00:00
Tom Ritchford	8d4926e30a	Fix unused variables in test/torch.py (#143399 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143399 Approved by: https://github.com/albanD	2024-12-18 17:57:24 +00:00
Sun, Jiayi	863e6e4567	Improve input dimensions check for reflection_pad1d, reflection_pad2d and reflection_pad3d (#141670 ) Fix https://github.com/pytorch/pytorch/issues/141447. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141670 Approved by: https://github.com/mingfeima, https://github.com/malfet	2024-12-18 17:46:26 +00:00
Sun, Jiayi	b588a78ca3	add grad_output shape check for adaptive_max_pool2d_backward and adaptive_max_pool3d_backward (#141663 ) Fix https://github.com/pytorch/pytorch/issues/141099, https://github.com/pytorch/pytorch/issues/141100. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141663 Approved by: https://github.com/mingfeima, https://github.com/malfet	2024-12-18 17:44:27 +00:00
Mark Saroufim	93e8e32708	Remove iOS folder (#143398 ) This folder is a tutorial that is not packaged in PyTorch that's an example of how to use the now deprecated Lite Interpreter People should be using Executorch instead and there's already good documentation on it all over our tutorials and main homepage Testing to see what breaks in CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/143398 Approved by: https://github.com/albanD	2024-12-18 17:25:52 +00:00
Joy Dong	ed9931e6ee	Add tests for non divisible inputs for flex decoding (#143214 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143214 Approved by: https://github.com/drisspg	2024-12-18 16:32:45 +00:00
Bin Bao	0e8013fc1c	[AOTI] Fix a typo in cpp_builder.py (#143351 ) Summary: passthough -> passthrough Pull Request resolved: https://github.com/pytorch/pytorch/pull/143351 Approved by: https://github.com/yushangdi, https://github.com/chenyang78 ghstack dependencies: #143350	2024-12-18 16:28:37 +00:00
Bin Bao	a2092665a9	[AOTI] Refactor path operations in AotCodeCompiler (#143350 ) Summary: Use safer pathlib operation instead of direct string manipulation; Update some path naming to make them more meaningful. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143350 Approved by: https://github.com/yushangdi, https://github.com/chenyang78	2024-12-18 16:28:37 +00:00
Nikita Shulga	24a18d76c8	[MPS] Use metal shaders for all view ops (#143375 ) Before this PR Metal shaders were used to scatter/gather 1-5 dimensional tensors. This PR introduces generalized ones that could be used for any dimensionality and as results gets rid of 700+ lines complex and untested code that might not even work as expected. Generalized gather shader looks as follows ```metal kernel void gather_kernel_n(uint linear_index [[thread_position_in_grid]], constant void * src_ [[buffer(0)]], device void * dst_ [[buffer(1)]], constant uint32_t * size [[buffer(2)]], constant uint32_t * stride [[buffer(3)]], constant uint32_t & numel [[buffer(4)]], constant int32_t & ndim [[buffer(5)]]) {{ if (linear_index >= numel) return; constant {0} * src = (constant {0} )src_; device {1} dst = (device {1} )dst_; uint64_t src_offs = 0; auto src_idx = linear_index; for(int dim = ndim - 1; dim >= 0; --dim) {{ src_offs += stride[dim] (src_idx % size[dim]); src_idx /= size[dim]; }} dst[linear_index] = cast<{1}>(src[src_offs]); }} ``` Which, according to the following benchmark ```python from timeit import default_timer import torch import torch.utils.cpp_extension from torch.utils.benchmark import Measurement, Timer t = Timer( stmt=f"y.copy_(x);torch.mps.synchronize()", setup=f"x=torch.rand(4, 5, 16, 64, 33, 24, dtype=torch.float32, device='mps')[:,:,:,:24,:24,];y=torch.empty(x.shape, device=x.device, dtype=x.dtype)", language="python", timer=default_timer ) print(t.blocked_autorange()) ``` Is almost twice as fast as previous implementation (i.e. on Mac Book M2 Pro it returns 2.9ms for MPS version vs 1.5ms for shader one On MacOS Sequoia [`gatherWithUpdatesTensor: indicesTensor:...`](https://developer.apple.com/documentation/metalperformanceshadersgraph/mpsgraph/gather(withupdatestensor:indicestensor:axis:batchdimensions:name:)?language=objc) crashes if invoked with complex data type, as one can see by running the code below ```swift import Metal import MetalPerformanceShadersGraph func gatherComplexMPS(device: MTLDevice, inp_buf: MTLBuffer, idx_buf: MTLBuffer, out_buf: MTLBuffer, inp_elem: Int, upd_elem: Int) { let graph = MPSGraph() let inputPlaceholder = graph.placeholder(shape: [inp_elem as NSNumber], dataType: .complexFloat32, name: nil) let indicesPlaceholder = graph.placeholder(shape: [upd_elem as NSNumber], dataType: .int64, name: nil) let outNode = graph.gather(withUpdatesTensor: inputPlaceholder, indicesTensor: indicesPlaceholder, axis: 0, batchDimensions: 0, name: nil) let mpsInputBuffer = MPSGraphTensorData(inp_buf, shape: [inp_elem as NSNumber], dataType: .complexFloat32) let mpsIndicesBuffer = MPSGraphTensorData(idx_buf, shape: [upd_elem as NSNumber], dataType: .int64) let mpsOutputBuffer = MPSGraphTensorData(out_buf, shape: [inp_elem as NSNumber], dataType: .complexFloat32) guard let queue = device.makeCommandQueue() else { fatalError("Can't make queue") } graph.run(with: queue, feeds: [inputPlaceholder: mpsInputBuffer, indicesPlaceholder: mpsIndicesBuffer ], targetOperations: nil, resultsDictionary: [outNode: mpsOutputBuffer]) } func makeBufferWithValues<T>(device: MTLDevice, values: [T]) -> MTLBuffer { guard let buf = device.makeBuffer(length: values.count * MemoryLayout<T>.size, options: [.storageModeShared]) else { fatalError("Can't alloc") } let buf_data = buf.contents().assumingMemoryBound(to: T.self) for i in 0..<values.count { buf_data[i] = values[i] } return buf } guard let device = MTLCopyAllDevices().first else { fatalError("Not Metal device found") } print("Using device \(device.name)") let inp_buf = makeBufferWithValues(device: device, values: [1.0, 2.0 , 3.0, 4.0, 5.0, 6.0, 7.0, 8.0]) let idx_buf = makeBufferWithValues(device: device, values: [0, 1, 2, 3]) guard let out_buf = device.makeBuffer(length:8 * MemoryLayout<Float>.size, options: [.storageModeShared]) else { fatalError("Can't alloc") } gatherComplexMPS(device: device, inp_buf: inp_buf, idx_buf: idx_buf, out_buf: out_buf, inp_elem: 4, upd_elem: 4) ``` Fixes https://github.com/pytorch/pytorch/issues/143140 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143375 Approved by: https://github.com/albanD	2024-12-18 16:15:46 +00:00
FFFrog	f47aac6bc2	Make Context to be Device-agnostic Step by Step (3/N) (#137578 ) Detailed Descriptions: - Using unified Device-agnostic API to create new generator for accelerator. - Add deprecated info for GeneratorForPrivateuseone Pull Request resolved: https://github.com/pytorch/pytorch/pull/137578 Approved by: https://github.com/cyyever, https://github.com/ezyang	2024-12-18 15:12:19 +00:00
albanD	80a42399bb	Various fix for memory leak in test autograd and dataloader (#143323 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143323 Approved by: https://github.com/andrewkho, https://github.com/soulitzer ghstack dependencies: #143225	2024-12-18 13:56:59 +00:00
bobrenjc93	84b91ce4a1	remove allow-untyped-defs for torch/_inductor/test_operators.py (#143436 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143436 Approved by: https://github.com/aorenste	2024-12-18 12:54:25 +00:00
Shangdi Yu	d8ea4ce631	[reland] Kill capture_pre_autograd_graph API (#143426 ) Summary: Delete the following API: - capture_pre_autograd_graph() - capture_pre_autograd_graph_using_training_ir() - gm_using_training_ir() Update XLA pin to include https://github.com/pytorch/xla/pull/8398 There's no more call sites to `capture_pre_autograd_graph`. Except 1) two test cases in coreml, guarded by version guard, PR to remove: https://github.com/apple/coremltools/pull/2400 2) a few call sites guarded by version guard (< 2.5.0) Test Plan: CI Differential Revision: D67354440 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143426 Approved by: https://github.com/gmagogsfm	2024-12-18 12:07:09 +00:00
Zizeng Meng	eb67dd3e2d	[3/N][Memory Profiling] Add memory profiling function for MTIA hooks (#142149 ) Design Doc: https://fburl.com/gdoc/47zpuweb Prototyping: D66469341 In this diff, we implement two new mtia hooks to start/stop profiler and export the memory snapshot. In next diff, we will integrate the mtia backend with profiler python api Differential Revision: [D66823583](https://our.internmc.facebook.com/intern/diff/D66823583/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/142149 Approved by: https://github.com/nautsimon	2024-12-18 11:58:23 +00:00
Tom Ritchford	993b2f0ee0	Fix unused variables in test/test_transformers.py (#143407 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143407 Approved by: https://github.com/drisspg	2024-12-18 09:59:24 +00:00
bobrenjc93	8dd380803c	remove allow-untyped-defs for torch/_functorch/batch_norm_replacement.py (#143438 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143438 Approved by: https://github.com/oulgen	2024-12-18 09:01:06 +00:00
bobrenjc93	75fe5a3ef7	remove allow-untyped-defs for torch/fx/experimental/debug.py (#143439 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143439 Approved by: https://github.com/oulgen	2024-12-18 08:55:46 +00:00
bobrenjc93	03991798ca	remove allow-untyped-defs for torch/nn/parallel/__init__.py (#143437 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143437 Approved by: https://github.com/oulgen	2024-12-18 08:50:37 +00:00
Aidyn-A	a99536480d	[ATen][Native][Special] Hermite polynomial prematurely return NaN if n is high (#141955 ) Hermite polynomials diverge to NaN at high orders due to numerical overflow. The proposal is to prematurely return NaN of it is known that at this value it will be NaN. According to my short test ```Python import torch device = "cuda" dtype = torch.float32 x = torch.linspace(-1000, 1000, 100000, device=device, dtype=dtype) for n in range(1024): if torch.special.hermite_polynomial_h(x, n).isnan().sum().item() == x.shape[0]: print(f"hermite_polynomial_h: all outputs are nans! n = {n}") break for n in range(1024): if torch.special.hermite_polynomial_he(x, n).isnan().sum().item() == x.shape[0]: print(f"hermite_polynomial_he: all outputs are nans! n = {n}") break ``` The output values become NaNs at these orders: ``` hermite_polynomial_h: all outputs are nans! n = 53, dtype=torch.float32 hermite_polynomial_he: all outputs are nans! n = 61, dtype=torch.float32 hermite_polynomial_h: all outputs are nans! n = 272, dtype=torch.float64 hermite_polynomial_he: all outputs are nans! n = 304, dtype=torch.float64 ``` Surely, it makes sense to increase the limit as a safety margin. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141955 Approved by: https://github.com/malfet, https://github.com/eqy	2024-12-18 08:30:08 +00:00
Sheng Fu	2ea4b56ec8	Record min/max of integral tensor in ET (#143088 ) Summary: In et-replay, random data is used to run the operators. However, it does not work well for the op that uses index to access tensor. For example, embedding ops, which use the indices to look up the embedding table. If random data is used for these index ops, et-replay usually runs into invalid memory access issue. To fix it, ET provides an environment variable "ENABLE_PYTORCH_EXECUTION_TRACE_INTEGRAL_TENSOR_RANGE", if it is set, ET will capture the min/max value of the flattened integral tensor. Then in et_replay, the min/max is used to generate the random tensor within that range. It fixed invalid memory access issue. Test Plan: buck2 run mode/opt caffe2/test:test_profiler_cuda -- profiler.test_execution_trace.TestExecutionTraceCUDA.test_execution_trace_record_integral_tensor_range_cuda Differential Revision: D66666931 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143088 Approved by: https://github.com/sanrise	2024-12-18 08:20:35 +00:00
Avik Chaudhuri	bceedeec2b	fix checking non-trivial input constraints (#143442 ) A bunch of auto dynamic shape tests would fail non-strict retraceability because when checking input constraints, we'd compare non-trivial expressions, which would require / affect shape env. ``` ... is not tracked with proxy for <torch.fx.experimental.proxy_tensor._ModuleStackTracer object ... ``` I've also observed this bug internally. This PR does an early check on whether args passed have concrete shapes, and only then proceeds: as before, we 1. try to unify / solve with the arg dim when the corresponding placeholder node dim is symbolic in one symbol 2. check directly if the placeholder node dim is concrete 3. otherwise defer to run time. Differential Revision: [D67359596](https://our.internmc.facebook.com/intern/diff/D67359596/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143442 Approved by: https://github.com/tugsbayasgalan	2024-12-18 07:29:08 +00:00
qiurc	90cc43f270	Support garbage collection after pt2 compilation (#143364 ) Summary: Support garbage collection after pt2 compilation. Add jk to control the global rollout / rollback of this functionality Add env var to control individual job's rollout Test Plan: Test the model training job with / without this changes Reviewers: @yuxihu @ezyang , @Yuzhen11 , Subscribers: Tasks: Tags: Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/143364 Approved by: https://github.com/ezyang	2024-12-18 07:25:11 +00:00
Rachel Guo	9275091d6e	[provenance_tracking] Dump inductor_triton_kernel_to_post_grad_nodes.json info in debug_trace (#143055 ) Summary: This diff mainly adds code changes to dump `inductor_triton_kernel_to_post_grad_nodes.json` artifact which contains mapping info from post_grad -> inductor kernel code: `{"inductor_triton_kernel_name": [post_grad_node_0, post_grad_node_1, ..., ], "..."}.` Example paste: P1695235000 verified on the test model. See "Test Plan": We use this artifact to demonstrate provenance tracking in the frontend 3-tab highlighter tool: https://github.com/YUNQIUGUO/compiler_explorer (copy/pasted the input files for demo purpose for now and will integrate with Shangdi's tool to 4-tab) https://pxl.cl/66BzK Note: Currently only supports mapping for inductor's`TritonKernel` type. TODO for enhancing more support for `ExternKernel` and other inductor generated kernel type, etc. Test Plan: test_model_coverage.sh: ``` #!/bin/sh MODEL_ENTITY_ID=644688112 SNAPSHOT_ID=32 MODULE=merge # buck2 build --show-output mode/opt -c=python.package_style=inplace -c fbcode.enable_gpu_sections=true -c fbcode.platform=platform010 -c fbcode.split-dwarf=true -c fbcode.nvcc_arch=a100,h100 caffe2/torch/fb/model_transform/experimental/benchmark:mts_gpu_benchmark TORCH_COMPILE_DEBUG=1 CUDA_VISIBLE_DEVICES=0 TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 TORCH_LOGS="+inductor, schedule, fusion, output_code" TORCH_TRACE="tmp/guorachel_tt" TORCHINDUCTOR_MAX_AUTOTUNE=1 TORCHINDUCTOR_UNIQUE_KERNEL_NAMES=1 ../buck-out/v2/gen/fbcode/d29ee94b913014f1/caffe2/torch/fb/model_transform/experimental/benchmark/__mts_gpu_benchmark__/mts_gpu_benchmark.par --model-path manifold://ads_storage_fblearner/tree/user/facebook/fblearner/predictor/${MODEL_ENTITY_ID}/${SNAPSHOT_ID}/gpu_lowering/input.predictor.disagg.gpu.merge --lower-backend AOT_INDUCTOR_EP --gpu-trace --aot-inductor-config="{'max_autotune': True}" 2>&1 \| tee output.txt ``` {F1973765026} ``` buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/inductor:provenance_tracing -- --exact 'caffe2/test/inductor:provenance_tracing - test_triton_kernel_post_grad_mapping_aot_inductor (caffe2.test.inductor.test_provenance_tracing.TestProvenanceTracingArtifact)' ``` ``` TORCH_LOGS="+inductor, output_code" buck2 run -c fbcode.enable_gpu_sections=true -c fbcode.nvcc_arch=h100 @//mode/opt fbcode//caffe2/test/inductor:provenance_tracing -- -r test_triton_kernel_post_grad_mapping_aot_inductor ``` Differential Revision: D66967510 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143055 Approved by: https://github.com/chenyang78	2024-12-18 06:51:50 +00:00
Digant Desai	6829897682	Remove assert from partitioner.py (#143376 ) Remove erroneous assert assuming a dependent (user) node to be in the partition. This partially reverts #136616 by removing the assert. Tested locally with a failing ExecuTorch Arm test using ``` $ python -m examples.arm.aot_arm_compiler --model_name mv2 --target ethos-u55-128 --delegate --quantize ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/143376 Approved by: https://github.com/tarun292	2024-12-18 06:08:19 +00:00
Bert Maher	6715a8858a	Triton bump for 3.2 cherry-picks (device context) (#143409 ) Summary: * https://github.com/triton-lang/triton/pull/3731 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143409 Approved by: https://github.com/atalman	2024-12-18 05:17:29 +00:00
Shangdi Yu	c17a07ade3	Add float8 support in serde schema (#143343 ) Summary: Fix https://github.com/pytorch/pytorch/issues/141316 Bump up schema minor version. as title, add float8 support in serde schema Test Plan: ``` buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test:test_export -- -r test_serialize_float8 ``` Differential Revision: D67307670 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143343 Approved by: https://github.com/yiming0416	2024-12-18 05:07:21 +00:00
emmettbicker	576789197a	Add support for CPU scalar in addcmul (#143264 ) Step required for performance in #143122 Adds support for CPU scalar for tensor_2 in addcmul. For example: ``` import torch a = torch.rand(2, 2, device="cuda") b = torch.tensor(1e-3) torch.add(a, b) torch.addcmul(a, a, b) # used to fail, now works ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/143264 Approved by: https://github.com/janeyx99 Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>	2024-12-18 04:43:29 +00:00
Natalia Gimelshein	859be14c4e	fix a few int64_t index computations, fix complex128 scan that had to… (#143401 ) …o few threads per title Pull Request resolved: https://github.com/pytorch/pytorch/pull/143401 Approved by: https://github.com/eqy	2024-12-18 04:27:27 +00:00
Tom Ritchford	c947a7d38e	Fix unused Python variables in test/nn (#143396 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143396 Approved by: https://github.com/mikaylagawarecki	2024-12-18 03:30:54 +00:00
bobrenjc93	17a6d4b882	remove allow-untyped-defs for torch/_export/passes/remove_runtime_assertions.py (#143435 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143435 Approved by: https://github.com/oulgen	2024-12-18 03:05:20 +00:00
Nikita Shulga	a9de6a68f4	[CD] Test that all PyTorch wheels support OpenMP (#143394 ) Together with https://github.com/pytorch/pytorch/pull/143393 fixes https://github.com/pytorch/pytorch/issues/123225 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143394 Approved by: https://github.com/atalman ghstack dependencies: #143393	2024-12-18 02:27:55 +00:00
atalman	2400db115c	Use Manylinux 2.28 for nightly build and cxx11-abi (#143423 ) As per: https://dev-discuss.pytorch.org/t/pytorch-linux-wheels-switching-to-new-wheel-build-platform-manylinux-2-28-on-november-12-2024/2581 Linux Builds: CPU, CUDA 11.8, CUDA 12.4 switched to Manylinux 2.28 and D_GLIBCXX_USE_CXX11_ABI=1 on the week of Dec 16 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143423 Approved by: https://github.com/huydhn, https://github.com/malfet, https://github.com/seemethere	2024-12-18 02:02:58 +00:00
eellison	e890d67543	Use process pool for precompilation of triton templates (#142450 ) Perf results: https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Tue%2C%2003%20Dec%202024%2022%3A57%3A51%20GMT&stopTime=Tue%2C%2010%20Dec%202024%2022%3A57%3A51%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cuda%20(a100)&lBranch=gh/eellison/740/head&lCommit=b925256c29ec43e1933e4ede94b16d1f404b595f&rBranch=gh/eellison/740/base&rCommit=a161d6362f7d9db773322d2ce2a3a70aabbecf4b Training: <img width="793" alt="image" src="https://github.com/user-attachments/assets/75f5bc0d-8005-4213-ae88-0b94fb187dfc" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/142450 Approved by: https://github.com/jansel	2024-12-18 01:48:04 +00:00
Sun, Jiayi	c06b5048ba	[Inductor] Fix _can_be_inplace function (#143279 ) Summary: Modify _can_be_inplace function: return False if `_other.data` is an instance of `ir.BaseView`. Fix https://github.com/pytorch/pytorch/issues/143280. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143279 Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel, https://github.com/jgong5	2024-12-18 00:26:05 +00:00
Mikayla Gawarecki	6cd96f069b	Add warning to torch.jit.load (#143403 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143403 Approved by: https://github.com/albanD ghstack dependencies: #143326	2024-12-18 00:17:41 +00:00
Mikayla Gawarecki	ac8342f881	Prevent torch.jit.load path in torch.load when weights_only=True (#143326 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143326 Approved by: https://github.com/albanD	2024-12-18 00:17:41 +00:00
soulitzer	13a5c15ef5	Fix sample inputs leaked from subtest (#143415 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143415 Approved by: https://github.com/jbschlosser ghstack dependencies: #143333	2024-12-18 00:15:18 +00:00
soulitzer	3f99682fbd	NJT linear_backward should not return inner tensor as-is (#143333 ) Fixes debug=1 use-count checks https://github.com/pytorch/pytorch/actions/runs/12187808902/job/34002323481#step:22:2521 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143333 Approved by: https://github.com/jbschlosser	2024-12-18 00:15:18 +00:00
Felix Su	feb4818bc9	[SJD] adding kill logic for current process when killing a worker (#141060 ) Summary: we have seen cases where some workers don't receive stop signals, meaning watchdog isn't stopped accordingly. this diff introduces logic to kill the current pid alongside the worker pid something to note is that there is a case where the worker pid to be killed either doesn't exist or cannot be killed for some reason which will result in the current pid also not being killed. this seems okay since the watchdog loop will just attempt to kill the worker pid on the next iteration but just wanted to point this out Test Plan: experiment in next diff shows this works Differential Revision: D65837085 Pull Request resolved: https://github.com/pytorch/pytorch/pull/141060 Approved by: https://github.com/gag1jain	2024-12-18 00:13:02 +00:00
Hyunho Yeo	efe21ee59d	[MTIA] (3/n) Implement PyTorch APIs to query/reset device peak memory usage (#143347 ) Summary: This diff implements the "max_memory_allocated" PyTorch API for MTIA devices, which returns the peak device DRAM usage Test Plan: Passed the local unit test ``` buck2 test //mtia/host_runtime/torch_mtia/tests:test_torch_mtia_api -- -r test_max_memory_allocated ``` https://www.internalfb.com/intern/testinfra/testrun/8444249544807192 Reviewed By: yuhc, egienvalue Differential Revision: D67118173 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143347 Approved by: https://github.com/nautsimon	2024-12-17 23:37:03 +00:00
Aleksei Nikiforov	a040006da7	Force symlink creation when building python on s390x (#143195 ) Sometimes it exists already when building on s390x This change should fix docker image build on s390x. Example of error can be found here: https://github.com/pytorch/pytorch/actions/runs/12282230596/job/34365267303 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143195 Approved by: https://github.com/ezyang	2024-12-17 23:01:47 +00:00
Nikita Shulga	2642bbc6dc	[CD] Run smoke tests on MacOS wheel (#143393 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143393 Approved by: https://github.com/atalman, https://github.com/seemethere	2024-12-17 22:47:07 +00:00
Eli Uriegas	b247f87845	tools: Add a tool to build wheels for multiple python versions (#143361 ) Adds a tool to build bdist_wheels sequentially for multiple different python versions (if specified). The goal of this tool is to eventually be able to utilize this in our binary build runs to significantly reduce the amount of time we take to build packages by utilizing a local ccache from the first build. Tested locally using the following: ``` $ ccache -C # clear cache # -p could actually reference any python interpreter $ python tools/packaging/build_wheel.py \ -p /home/eliuriegas/.local/share/uv/python/cpython-3.12.7-linux-x86_64-gnu/bin/python3.12 \ -p /home/eliuriegas/.local/share/uv/python/cpython-3.13.0-linux-x86_64-gnu/bin/python3.13 \ -d dist-multi/ ... 2024-12-17 10:48:11,365 - INFO - Build time (3.12.7): 571.440689s 2024-12-17 10:48:11,365 - INFO - Build time (3.13.0): 191.147503s ``` Signed-off-by: Eli Uriegas <eliuriegas@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/143361 Approved by: https://github.com/malfet, https://github.com/atalman	2024-12-17 21:56:06 +00:00
Tristan Rice	1e058a8f38	FileTimerClient: add retry logic on connect (#143318 ) Fixes #143188 The fifo server binds from a thread -- under rare cases the client connects before the server thread starts. This adds a retry when opening the fifo socket in non-blocking mode. This will wait up to 1s for the server to start which balances fast error messages while still providing some wiggle room on the server side. Test plan: ``` pytest --minutes 10 test/distributed/elastic/timer/file_based_local_timer_test.py -k test_watchdog_call_count -x ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/143318 Approved by: https://github.com/fegin	2024-12-17 21:48:30 +00:00
Manav Avlani	aabe285aaf	Add 2 more APIs to the exposed public torch python APIs (#143380 ) These two APIs are being used internally for some projects and need to be exposed as the build for this is done using OSS toolchain. `af8789c056` - this change hid most apis in torch python barring the ones explicitly specified breaking the build. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143380 Approved by: https://github.com/suo	2024-12-17 21:16:51 +00:00
Chirag Pandya	0bdc173ab6	[fr] recognize all_reduce_barrier as a valid op (#143354 ) Summary: D67068632 introduced a better profiling name for barrier operations to be able to distinguish various ops. Unfortunately, this broke Flight Recorder Analysis with the following error as reported by dmwu ``` fr_trace -m torchx-param_bench_16g_mi300x-all_to_all -a 0 --mast_job_version 98 -w 16 Traceback (most recent call last): File "/usr/local/fbcode/platform010/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/local/fbcode/platform010/lib/python3.10/runpy.py", line 86, in _run_code ``` Test Plan: Test manually. Differential Revision: D67305997 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143354 Approved by: https://github.com/wconstab	2024-12-17 21:09:18 +00:00
Michael Lazos	a96387a481	[Dynamo] only import einops if version is lower than 0.7.0 (#142847 ) Fixes internal xref (https://fb.workplace.com/groups/257735836456307/posts/804793021750583/?comment_id=805229281706957&reply_comment_id=805232695039949) Pull Request resolved: https://github.com/pytorch/pytorch/pull/142847 Approved by: https://github.com/zou3519	2024-12-17 20:50:25 +00:00
Richard Barnes	9283c40ba8	[codemod] Decorate unused variables with `[[maybe_unused]]` (#143381 ) Summary: LLVM-15 has a warning `-Wunused-variable` which we treat as an error because it's so often diagnostic of a code issue. Unused variables can compromise readability or, worse, performance. This diff either (a) removes an unused variable and, possibly, it's associated code or (b) qualifies the variable with `[[maybe_unused]]`. - If you approve of this diff, please use the "Accept & Ship" button :-) Test Plan: Sandcastle Reviewed By: palmje Pull Request resolved: https://github.com/pytorch/pytorch/pull/143381 Approved by: https://github.com/malfet	2024-12-17 20:36:03 +00:00
bobrenjc93	7c25a55c65	clean up type nits on torch/jit/_ir_utils.py (#143371 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143371 Approved by: https://github.com/laithsakka	2024-12-17 20:28:07 +00:00
Catherine Lee	de4a555c82	Run inductor-rocm workflow on ciflow/inductor (#143205 ) The paths are almost the same as ciflow/inductor. The only differences I could spot where that ciflow/inductor also has `test/dynamo/` and `torch/csrc/dynamo/` This is to prevent failures like https://github.com/pytorch/pytorch/actions/runs/12304985383/job/34345585535 which fails due to running on a fork, which cannot set the id token. The other option to prevent this is to stop the job from running when on a fork. If someone adds both labels, one will be cancelled because they have the same concurrency group Pull Request resolved: https://github.com/pytorch/pytorch/pull/143205 Approved by: https://github.com/huydhn	2024-12-17 20:09:48 +00:00
Joy Dong	b16f020edd	Add flex attention kernel parameter tuning options (#139639 ) 1. Add `num_warps` and `num_stages` to kernel parameters of `flex_attention`. This allows performance tuning when the default parameters of `flex_attention` is suboptimal, for example for `document_masks`. 2. Update how flex decoding splits are assigned to threadblocks. The first split of full blocks are assigned to the first threadblock, and the first split of partial blocks are assigned to the last threadblock. 3. Update `get_split_k` to assign 2 splits per SM before we have runtime workload balancing based on BlockMask. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139639 Approved by: https://github.com/drisspg	2024-12-17 19:31:40 +00:00
Catherine Lee	e3c53fb1bc	Increase sharding for debug build (#143327 ) It started timing out consistently and takes 3+ hours per shard I assume its just that we slowly increase tests over time since I cannot find a dramatic jump recently Pull Request resolved: https://github.com/pytorch/pytorch/pull/143327 Approved by: https://github.com/wdvr, https://github.com/huydhn	2024-12-17 19:27:51 +00:00
Chong Gu	5b5d7016c8	Remove stable_partition for ARM AOTI Runtimes (#142394 ) Summary: This function call will cause OOM issues on ARM machines with multi-threaded predictors (reason behind this is still being investigated), we replace it with the standard partition instead. Differential Revision: D66904296 Pull Request resolved: https://github.com/pytorch/pytorch/pull/142394 Approved by: https://github.com/frank-wei	2024-12-17 19:19:04 +00:00
Aaron Orenstein	e7704f41ca	Simplify _compute_symbolic_stride() (#138844 ) Rewrite _compute_symbolic_stride() to make it simpler and faster. The existing code involves several inner loops in an attempt to process the common case faster - but in reality this effort is actually slower than the simpler code. Testing: The initial version of this PR (which passed all tests) ran both the old algorithm and new algorithm and compared the results to make sure that results were substantially the same (they weren't the same simply because the algorithm allocates new dynamic symbols as part of it). I also measured the timing of both methods and from the cases I checked the simpler algorithm was generally about 30% faster (which was usually the "fast path" of the old algorithm). Pull Request resolved: https://github.com/pytorch/pytorch/pull/138844 Approved by: https://github.com/bobrenjc93 ghstack dependencies: #138843	2024-12-17 19:16:53 +00:00
Aaron Orenstein	63cb5e4ade	Move inner loop of _create_symbolic_sizes_strides_storage_offset into its own method (#138843 ) Making the next PR easier to review: - move the inner loop of _create_symbolic_sizes_strides_storage_offset() into a separate function - fix lintrunner lints Pull Request resolved: https://github.com/pytorch/pytorch/pull/138843 Approved by: https://github.com/ezyang	2024-12-17 19:16:53 +00:00
eellison	f3ec59d44c	Fix non-dense inductor effn attn bias (#141905 ) Didn't have any luck making local repro, partially because https://github.com/pytorch/pytorch/issues/141888 which will be fixed when we update to triton 3.2. but verified locally it fixes https://github.com/pytorch/pytorch/issues/139424 with the triton pin update that is landing soon Pull Request resolved: https://github.com/pytorch/pytorch/pull/141905 Approved by: https://github.com/drisspg ghstack dependencies: #143315	2024-12-17 18:55:50 +00:00
Tom Ritchford	1e9ec51431	Fix unused variables in test_serialize_sym_float (#143389 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143389 Approved by: https://github.com/Skylion007	2024-12-17 18:55:14 +00:00
William Wen	18261e9f39	[dynamo] implement framelocals mapping as c++ object (#140063 ) Implements https://github.com/pytorch/pytorch/issues/93753 - move frame local guard accessors to C++. Before, we used dict accessors on a Python dict representing the frame's fastlocals that we manually build. We move this accessor to C++ and additionally use the fastlocal index whenever possible. Some implementation notes: - `FrameLocalsMapping` is now initialized as a C++ vector of `PyObject`s. We do not just use the frame's localsplus/fastlocals buffer because we also unbox cells. - `FrameLocalsMapping` can still be converted into a Python dict representing the frame's fastlocals, but it is done lazily. - We update `LeafGuard`, `GuardAccessor`, and `GuardManager`'s `check_nopybind` methods to accept `FrameLocalsMapping`. By default, we convert the `FrameLocalsMapping` to a Python dict and run the original `check_nopybind` on it, but in some cases, conversion is not needed. - We add a new guard accessor `FrameLocalsGuardAccessor`, which is similar to `DictGetItemGuardAccessor` but has special handling for `FrameLocalsMapping`. We create a separate class to emphasize different use cases, but we could probably combine these two (can do in a follow up) dynamo_guard_eval.py microbenchmark update: - 713.2us -> 630.0us (3.10) - 598.8us -> 530.7us (3.12) Other followups: - Add `FrameLocalsMapping` version for `check_verbose_nopybind` in order to match behavior between `check_nopybind` and `check_verbose_nopybind`. This can prevent difficult debugging situations where guards fail (`check_nopybind` returns false) but no guard error message is generated (`check_verbose_nopybind` succeeds). - Rewrite the `SHAPE_ENV` guard into C++ - it is a fairly common guard that results in `FrameLocalsMapping` needing to convert to a dict Pull Request resolved: https://github.com/pytorch/pytorch/pull/140063 Approved by: https://github.com/jansel ghstack dependencies: #142117, #142430	2024-12-17 18:54:27 +00:00
William Wen	c04f0bb7b9	[dynamo] add benchmark for guard eval (#142430 ) Benchmarks: - 713.2us (3.10) - 598.8us (3.12) Pull Request resolved: https://github.com/pytorch/pytorch/pull/142430 Approved by: https://github.com/jansel ghstack dependencies: #142117	2024-12-17 18:54:27 +00:00
William Wen	97ca09f692	[dynamo] format eval_frame.c (#142117 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/142117 Approved by: https://github.com/jansel	2024-12-17 18:54:27 +00:00
bobrenjc93	53e4d7b6a2	remove allow-untyped-defs for torch/_lazy/device_context.py (#143367 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143367 Approved by: https://github.com/aorenste ghstack dependencies: #143366	2024-12-17 18:54:03 +00:00
eellison	bcc93a1e8e	remove nonowninglayout special case in require strides (#143315 ) NonOwningLayout is always constructed to a FixedLayout. We should handle it the same way as FixedLayout. Note - this case is very rare, I added an assertion here and no test/model failed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143315 Approved by: https://github.com/zou3519	2024-12-17 18:47:38 +00:00
Bin Bao	a3688ead4b	[AOTI][doc] Update tutorial (#143390 ) Summary: Update the cpp inference part to call AOTIModelPackageLoader.run directly Pull Request resolved: https://github.com/pytorch/pytorch/pull/143390 Approved by: https://github.com/yushangdi	2024-12-17 18:35:40 +00:00
chuanqiw	fa4db62968	[CI] Unify the XPU Windows CICD installtion scripts (#143185 ) Follow https://github.com/pytorch/pytorch/pull/142156 Works for https://github.com/pytorch/pytorch/issues/114850 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143185 Approved by: https://github.com/atalman	2024-12-17 18:26:19 +00:00
bobrenjc93	74e66a21b4	remove allow-untyped-defs for torch/_C/_distributed_autograd.pyi (#143369 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143369 Approved by: https://github.com/aorenste	2024-12-17 18:09:28 +00:00
Benjamin Glass	37a1b9efcc	[export] Serialize all dataclass fields (#142286 ) Reverts a change in #121337. All dataclass members must be serialized, even default-valued members, because downstream code often implicitly assumes their presence. This PR fixes a segfault when running `test_custom_op_all_inputs` from `test/inductor/test_aot_inductor_custom_ops.py`. This segfault was caused by querying for an "index" field for the `Device` type (see `torch/csrc/inductor/aoti_torch/oss_proxy_executor.cpp:136`), which was previously skipped when serializing if the device index was unspecified. A number of other structs which are deserialized in this file also contain optional fields, and presumably could experience the same bug. Fixes #138955 Fixes #134793 Pull Request resolved: https://github.com/pytorch/pytorch/pull/142286 Approved by: https://github.com/zhxchen17 ghstack dependencies: #142175	2024-12-17 17:21:27 +00:00
Benjamin Glass	bb06fc79fb	cpp_builder: handle CUDA lib paths involving "stubs" in more circumstances (#142175 ) conda packages for `cuda-driver-dev=12.4.127` use a "stubs" subdirectory to contain `libcuda.so`. This was previously only handled by cpp_builder in some cases, but now needs to be potentially handled more generally. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142175 Approved by: https://github.com/desertfire	2024-12-17 17:21:27 +00:00
PyTorch MergeBot	e3d754419f	Revert "[reland][dynamo][guards] Consider tensors as immutable for dict tag matches (#141085 )" This reverts commit 1bf983077f9f9c19e20dac178aa764b4620d78e7. Reverted https://github.com/pytorch/pytorch/pull/141085 on behalf of https://github.com/huydhn due to The diff D66211131 has been commandeered internally and is it not part of the train anymore. If codev is needed, pls reland this accordingly ([comment](https://github.com/pytorch/pytorch/pull/141085#issuecomment-2549092225))	2024-12-17 17:21:14 +00:00
bobrenjc93	ec02ae4345	remove allow-untyped-defs for torch/utils/benchmark/examples/simple_timeit.py (#143368 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143368 Approved by: https://github.com/aorenste	2024-12-17 17:19:11 +00:00
bobrenjc93	313b9964ae	remove allow-untyped-defs for torch/_C/_lazy.pyi (#143370 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143370 Approved by: https://github.com/aorenste, https://github.com/desertfire ghstack dependencies: #143366	2024-12-17 17:18:10 +00:00
Guilherme Leobas	487343346e	Prevent users from seeing hardcoded print stmt when hypothesis is not installed (#142398 ) Fixes: #142357 Pull Request resolved: https://github.com/pytorch/pytorch/pull/142398 Approved by: https://github.com/zou3519	2024-12-17 16:59:05 +00:00
PyTorch MergeBot	969b07b96f	Revert "[ROCm] CK Flash Attention Backend (#138947 )" This reverts commit 500d02921bcf1619e268196866ddf099a4b94080. Reverted https://github.com/pytorch/pytorch/pull/138947 on behalf of https://github.com/atalman due to Breaks default windows checkout ([comment](https://github.com/pytorch/pytorch/pull/138947#issuecomment-2548998359))	2024-12-17 16:46:57 +00:00
bobrenjc93	cd7de1f4fa	remove allow-untyped-defs for torch/masked/maskedtensor/creation.py (#143321 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143321 Approved by: https://github.com/laithsakka	2024-12-17 16:44:50 +00:00
Bin Bao	4d90c487d8	[AOTI] Add is_big_gpu checking to test_conv3d (#143339 ) Summary: test_conv3d tests max-autotune, which is only supported for big_gpu. Differential Revision: D67306331 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143339 Approved by: https://github.com/BoyuanFeng	2024-12-17 16:18:45 +00:00
albanD	792f1c47e9	No actual change, just remove variable contain Tensors from global scope (#143225 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143225 Approved by: https://github.com/ezyang	2024-12-17 16:14:25 +00:00
Joona Havukainen	afa313e669	Extend bmm tiling to work up to 2^32 elem in any single output dim (#143095 ) The previous tiling implementation worked for up to 2^32 total elements per single batch entry. This extends the functionality to support the dimensions encountered in ComfyUI (output shape: 1,72250,72250). Fixes #141909 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143095 Approved by: https://github.com/kulinseth	2024-12-17 16:03:46 +00:00
Jackson	340f02c49b	make it clearer (in docs) one can double decorate with torch.library.impl_* APIs (#137608 ) Fixes #120503. Fix originally attempt by @soxand16 with PR: https://github.com/pytorch/pytorch/pull/121469. PR was almost ready to merge, but then went stale (over 6 months old). This PR implements original fix with refactoring for clarity. CC: @zou3519 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137608 Approved by: https://github.com/zou3519	2024-12-17 15:13:58 +00:00
Yuanhao Ji	6bbbb08458	[Dynamo] Replace `torch._dynamo.optimize()` with `torch.compile()` [10/N] (#142451 ) > This is the last one related commits: - #139706 - #140238 - #140247 - #140253 - #140663 - #140688 - #140922 - #140924 - #140933 - #142451 Pull Request resolved: https://github.com/pytorch/pytorch/pull/142451 Approved by: https://github.com/bdhirsh	2024-12-17 12:18:29 +00:00
Shunting Zhang	34a0d8b62e	[inductor] invalidate pointwise dep cache for LOAF (#141160 ) Fixes https://github.com/pytorch/pytorch/issues/141134 Pull Request resolved: https://github.com/pytorch/pytorch/pull/141160 Approved by: https://github.com/vkuzo	2024-12-17 09:51:29 +00:00
drisspg	5160a725c8	[FlexAttention] Fix broken eager tracing (#143344 ) Fixes #143331 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143344 Approved by: https://github.com/Chillee ghstack dependencies: #143299	2024-12-17 09:42:36 +00:00
Jason Ansel	cf46eb3bf5	[inductor] Include types and size hints in MultiKernel cache key (#142349 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/142349 Approved by: https://github.com/eellison, https://github.com/shunting314	2024-12-17 09:26:38 +00:00
Richard Barnes	e2d47a133b	Disable c10::optional macros (#138912 ) Test Plan: Sandcastle Pull Request resolved: https://github.com/pytorch/pytorch/pull/138912 Approved by: https://github.com/Skylion007, https://github.com/malfet	2024-12-17 09:22:47 +00:00
Laith Sakka	c3f3a6e4d2	Back out "Fix undesired specialization on slice after split. (#142372 )" (#143356 ) Summary: Original commit changeset: e54ffcc9fd48 Original Phabricator Diff: D67113058 Reviewed By: ezyang Differential Revision: D67311579 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143356 Approved by: https://github.com/oulgen	2024-12-17 09:17:18 +00:00
Adnan Akhundov	2531543c5f	[user triton cache] Dedup user-defined Triton kernels by config in codecache (#143353 ) Previously, the same kernel source with different autotuning configs would generate the same cache key which can lead to wrong cache it and silent incorrectness. Here we add the configs to the cache key in `FxGraphHashDetails`. Test Plan: ``` python3 test/inductor/test_codecache.py -k test_triton_higher_order_op_different_configs ... ---------------------------------------------------------------------- Ran 2 tests in 3.590s OK ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/143353 Approved by: https://github.com/oulgen	2024-12-17 08:41:22 +00:00
Avik Chaudhuri	6056efc5ff	non strict sequential slicing (#143298 ) Differential Revision: [D67284841](https://our.internmc.facebook.com/intern/diff/D67284841/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143298 Approved by: https://github.com/zhxchen17	2024-12-17 08:35:20 +00:00
Shunting Zhang	297ce77636	[Inductor] inplace padding (#140249 ) https://github.com/pytorch/pytorch/issues/139865 This PR may change the semantic of constant_pad_nd from 'clone' to 'view'. I tried a few tests to do inplace update. Looks like thanks to functionalization, this works fine. Perf for `test_linear_and_cel`: ``` # TORCHINDUCTOR_INPLACE_PADDING=0 DO_PERF_TEST=1 python test/inductor/test_inplace_padding.py -k test_linear_and_cel inductor_config.inplace_padding=False ms=83.311 # TORCHINDUCTOR_INPLACE_PADDING=1 DO_PERF_TEST=1 python test/inductor/test_inplace_padding.py -k test_linear_and_cel inductor_config.inplace_padding=True ms=79.827 ``` The saving is about 4ms (slightly less since we need fill 0 for the padding area). Similar savings for llm.c. - Without the feature: 182.151ms per batch, 180.9K tokens/s - With the feature: 178.278ms per batch, 183.9K tokens/s. There are 3K tokens/s increase. Perf test shows compilation time regression. . I'm not sure if that's real. Will debug more. But a good thing is, there is no accuracy failure: [link](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Mon%2C%2004%20Nov%202024%2020%3A23%3A22%20GMT&stopTime=Mon%2C%2011%20Nov%202024%2020%3A23%3A22%20GMT&granularity=hour&suite=torchbench&mode=training&dtype=amp&deviceName=cuda%20(a100)&lBranch=gh/shunting314/186/head&lCommit=03fd924ff382958daf5055dc8425d279e4e10a1e&rBranch=main&rCommit=c03324de2dfbbf0006818c86b88c92a3378f46b7) . UPDATE: Perf test regression seems to be not real. Here is a rerun [link](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Thu%2C%2007%20Nov%202024%2001%3A29%3A55%20GMT&stopTime=Thu%2C%2021%20Nov%202024%2001%3A29%3A55%20GMT&granularity=hour&suite=torchbench&mode=training&dtype=amp&deviceName=cuda%20(a100)&lBranch=gh/shunting314/186/head&lCommit=7e2c8e5d9256ac06205e7cd5e740c9e20ce804d0&rBranch=main&rCommit=565a7942eee1ddc23067cdbae597443d0f2290a0). Our dashboard is not that reliable recently due to AWS migration. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140249 Approved by: https://github.com/jansel	2024-12-17 06:15:48 +00:00
bobrenjc93	a42ca5a45b	remove allow-untyped-defs for _inductor/codegen/rocm/rocm_template_buffer.py (#143272 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143272 Approved by: https://github.com/aorenste	2024-12-17 05:34:22 +00:00
drisspg	d2ec7f0756	[FlexAttention] Allow num_warps 8 since when block size >=128 (#143299 ) # Summary Fixes #143290 We already strip bad configs here: `e0e763e331/torch/_inductor/kernel/flex_attention.py (L2299)` So this shouldn't be needed. Confirming that the 64 x 128 case is valid otherwise we can just change the default config Pull Request resolved: https://github.com/pytorch/pytorch/pull/143299 Approved by: https://github.com/yanboliang	2024-12-17 05:32:41 +00:00
bobrenjc93	e7ec92331e	remove allow-untyped-defs for torch/jit/_ir_utils.py (#143366 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143366 Approved by: https://github.com/aorenste	2024-12-17 05:15:15 +00:00
Shuqi Yang	bcd3692132	[Inductor][Easy] Fix a test failure in loop_ordering_after_fusion (#142474 ) Summary: Re-land the pr. The previous one was reverted because of a test failure on SM89. The fix is just removing `xfailIfSM89`. ``` _____________________ LoopOrderingTest.test_fp8_pattern_2 ______________________ Unexpected success ``` ------ (Since I am trying the other solution for https://github.com/pytorch/pytorch/pull/141082, I moved out the test case fixes from that pr to a separate pr to land first.) ----- Testing float8 dynamic scaling case with `TORCHINDUCTOR_LOOP_ORDERING_AFTER_FUSION=1` didn't make any difference. The test case for fp8 (https://github.com/pytorch/pytorch/blob/main/test/inductor/test_loop_ordering.py#L425) is also failing, https://www.internalfb.com/intern/test/844425111960859?ref_report_id=0 ------- The main change here is to modify the condition of calling `loop_reordering` from `shared_data_score == 0` to `shared_data_score < config.score_fusion_memory_threshold`. Before the change: `shared_data_score > 0 -> won't loop_reorder -> can't fused because of shared_data_score < config.score_fusion_memory_threshold` After the change: `shared_data_score > 0 -> loop_reorder (shared_data_score < config.score_fusion_memory_threshold) -> get a larger shared_data_score -> fused` ---- It's the same issue as fixed in https://github.com/pytorch/pytorch/pull/136782. But the condition to call loop_reorder might be changed later, causing the test case to fail again. Test Plan: ``` buck2 test 'fbcode//mode/opt' caffe2/test/inductor:loop_ordering ``` ----- Ran a float8 dynamic scaling training script to verify it e2e Differential Revision: D67012816 Pull Request resolved: https://github.com/pytorch/pytorch/pull/142474 Approved by: https://github.com/eellison, https://github.com/sijiac, https://github.com/shunting314	2024-12-17 04:14:28 +00:00
Andy Lugo	500d02921b	[ROCm] CK Flash Attention Backend (#138947 ) Replaces https://github.com/ROCm/pytorch/pull/1592 This PR contains the initial implementation of SDPA with composable_kernel backend. The CK path can be forced by simply calling `torch.backends.cuda.preferred_rocm_fa_library("ck")`. Similarly, you can force the incumbent aotriton implementation by passing in "aotriton" or "default". As you'd expect, not setting this option will result in aotriton to be used as the backend. In the case of CK, if pytorch deems flash attention usable, then it will use the CK path in all the same places aotriton would have been used. This PR makes no changes to the heuristics which select which attention scheme to use (i.e. flash attention vs memory efficient attention vs math etc etc). It only gets called when flash attention is both enabled (via `USE_FLASH_ATTENTION`) and is selected at runtime by the existing heuristics. Files located in pytorch/aten/src/ATen/native/transformers/hip/flash_attn/ck/mha* have been pulled from https://github.com/Dao-AILab/flash-attention courtesy of @tridao's hard work who is the co-author NOTE: In order to use this backend, the user MUST set USE_CK_FLASH_ATTENTION=1 in their environment when they build PyTorch. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138947 Approved by: https://github.com/pruthvistony, https://github.com/xw285cornell, https://github.com/leitian Co-authored-by: Xiaodong Wang <xw285@cornell.edu>	2024-12-17 02:18:07 +00:00
Huy Do	c15638d803	Enable swap on all Linux jobs (#143316 ) A swapfile on Linux runner has been prepared by https://github.com/pytorch/test-infra/pull/6058. So this PR does 2 things: * Start using the swapfile on all Linux build and test jobs * Testing the rollout https://github.com/pytorch-labs/pytorch-gha-infra/pull/582 ### Testing Run `swapon` inside the container and the swapfile shows up correctly: ``` jenkins@259dfb0a314c:~/workspace$ swapon NAME TYPE SIZE USED PRIO /swapfile file 3G 256K -2 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/143316 Approved by: https://github.com/ZainRizvi, https://github.com/atalman	2024-12-17 02:12:24 +00:00
Michael Lazos	cb4c614ed6	[foreach-map] Add tests for backward (#143282 ) Adds tests for unary and binary foreach_map w/ backwards Pull Request resolved: https://github.com/pytorch/pytorch/pull/143282 Approved by: https://github.com/eellison	2024-12-17 02:08:12 +00:00
PyTorch MergeBot	533d63f83b	Revert "FileTimerClient: add retry logic on connect (#143318 )" This reverts commit b3fb8f8a3a2fe07ca61852b09271382c988629fc. Reverted https://github.com/pytorch/pytorch/pull/143318 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing lint jobs in trunk ([comment](https://github.com/pytorch/pytorch/pull/143318#issuecomment-2547342910))	2024-12-17 02:06:52 +00:00
cyy	201cb8834f	Enable more C++ warnings (#143099 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/143099 Approved by: https://github.com/albanD	2024-12-17 02:03:39 +00:00
Yifu Wang	af190479c8	[fused_all_gather_matmul] use _multimem_all_gather_matmul for small global Ms (#143160 ) ## Benchmark M=2048, N=3584, K=8192 baseline (nccl + cublas): 301us decomp-based async-tp: 354us comm-aware async-tp: 295us multimem_all_gather matmul: 277us As M further decreases, the multimem_all_gather approach consistently outperforms the baseline and other approaches (omitted other approaches in the chart as they start to be slower than the baseline): ![image](https://github.com/user-attachments/assets/5811455a-68c9-43fe-9d82-ca488dd77bc1) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143160 Approved by: https://github.com/weifengpy ghstack dependencies: #142283, #142810, #143159	2024-12-17 01:07:27 +00:00
Yifu Wang	286921b39e	[fused_all_gather_matmul] introduce an argument to specify whether the all-gather result needs to be returned (#143159 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143159 Approved by: https://github.com/weifengpy ghstack dependencies: #142283, #142810	2024-12-17 01:07:27 +00:00
Yifu Wang	6fae60a34a	[SymmetricMemory] introduce multimem_all_gather (#142810 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/142810 Approved by: https://github.com/weifengpy ghstack dependencies: #142283	2024-12-17 01:07:27 +00:00
PyTorch MergeBot	519d858c31	Revert "Kill capture_pre_autograd_graph API (#143224 )" This reverts commit 4c62275325afe21052f3fd49ed4135e3db3c47eb. Reverted https://github.com/pytorch/pytorch/pull/143224 on behalf of https://github.com/huydhn due to Sorry for reverting your change but the XLA failure is legit ([comment](https://github.com/pytorch/pytorch/pull/143224#issuecomment-2547264675))	2024-12-17 00:47:24 +00:00
Will Constable	9d57a39541	[C10D] Update docs for wait() (#143305 ) Clarify that currently active stream, not default stream, is the one that will be blocked by a call to wait(), and also point out that the CPU is not blocked by the call for CUDA/nccl collectives. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143305 Approved by: https://github.com/LucasLLC, https://github.com/ngimel	2024-12-17 00:41:11 +00:00
Tristan Rice	b3fb8f8a3a	FileTimerClient: add retry logic on connect (#143318 ) Fixes #143188 The fifo server binds from a thread -- under rare cases the client connects before the server thread starts. This adds a retry when opening the fifo socket in non-blocking mode. This will wait up to 1s for the server to start which balances fast error messages while still providing some wiggle room on the server side. Test plan: ``` pytest --minutes 10 test/distributed/elastic/timer/file_based_local_timer_test.py -k test_watchdog_call_count -x ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/143318 Approved by: https://github.com/fegin	2024-12-17 00:36:10 +00:00
Andrew Gu	90fb7c36ab	[FSDP2] Clamp `reduce_dtype` in lazy init (#143297 ) fixes https://github.com/pytorch/pytorch/issues/143277 by moving the clamp of `reduce_dtype` to `None` to lazy init (same place as where `param_dtype` can be clamped to `None`) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143297 Approved by: https://github.com/weifengpy	2024-12-17 00:25:08 +00:00
atalman	dd2cd4279e	Create build_directory if it does not exist when generating ninja build file (#143328 ) Fixes: https://github.com/pytorch/vision/issues/8816 I am observing this failure on Windows, Python 3.13 vision builds: ``` Emitting ninja build file C:\actions-runner\_work\vision\vision\pytorch\vision\build\temp.win-amd64-cpython-313\Release\build.ninja... error: [Errno 2] No such file or directory: 'C:\\actions-runner\\_work\\vision\\vision\\pytorch\\vision\\build\\temp.win-amd64-cpython-313\\Release\\build.ninja' ERROR conda.cli.main_run:execute(49): `conda run packaging/windows/internal/vc_env_helper.bat python setup.py bdist_wheel` failed. (See above for error) ``` Adding the code above fixes it, confirmed by running `` python setup.py bdist_wheel`` : ``` building 'torchvision._C' extension Emitting ninja build file C:\actions-runner\_work\vision\vision\pytorch\vision\build\temp.win-amd64-cpython-313\Release\build.ninja... Creating build directory C:\actions-runner\_work\vision\vision\pytorch\vision\build\temp.win-amd64-cpython-313\Release Compiling objects... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) [1/26] cl /showIncludes /nologo /O2 /W3 /GL /DNDEBUG /MD /MD /wd4819 /wd4251 /wd4244 /wd4267 /wd4275 /wd4018 /wd4190 /wd4624 /wd4067 /wd4068 /EHsc -Dtorchvision_EXPORTS -IC:\actions-runner\_work\vision\vision\pytorch\vision\torchvision\csrc -IC:\actions-runner\_work\_temp\conda_environment_12361066769\Lib\site-packages\torch\include -IC:\actions-runner\_work\_temp\conda_environment_12361066769\Lib\site-packages\torch\include\torch\csrc\api\include -IC:\actions-runner\_work\_temp\conda_environment_12361066769\Lib\site-packages\torch\include\TH -IC:\actions-runner\_work\_temp\conda_environment_12361066769\Lib\site-packages\torch\include\THC -IC:\actions-runner\_work\_temp\conda_environment_12361066769\include -IC:\actions-runner\_work\_temp\conda_environment_12361066769\Include "-IC:\Pr ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/143328 Approved by: https://github.com/kit1980, https://github.com/albanD	2024-12-17 00:20:43 +00:00
Bin Bao	467970d683	[AOTI] Relax input alignment assertion (#143236 ) Summary: https://github.com/pytorch/pytorch/pull/142136 added a runtime alignment assertion. But the assumption is probably too strict for more flexible use cases of AOTI, e.g. python deployment, see a recent error torchchat ran into for more details, https://github.com/pytorch/torchchat/actions/runs/12322072267/job/34394851280 . This PR relaxes the runtime check and implements copy_misaligned_inputs in cpp instead. Differential Revision: [D67287922](https://our.internmc.facebook.com/intern/diff/D67287922) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143236 Approved by: https://github.com/malfet, https://github.com/chenyang78	2024-12-17 00:17:39 +00:00
bobrenjc93	c4ab3e6ceb	remove allow-untyped-defs for torch/__config__.py (#143320 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143320 Approved by: https://github.com/aorenste ghstack dependencies: #143319	2024-12-17 00:16:09 +00:00
bobrenjc93	0178e43949	remove allow-untyped-defs for torch/utils/_stats.py (#143319 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143319 Approved by: https://github.com/aorenste	2024-12-17 00:16:09 +00:00
Shivam Raikundalia	ff373171d0	[Profiler] Add Optional Flag to turn off external correlations v2 (#143314 ) Summary: The original diff got reverted because its base commit was on a broken version of pytorch that was failing rocm tests. There is no indication that this diff had any effect on rocm. Had trouble rebasing the GH pr after revert and accidentally closed the PR so submitting again . Test Plan: See original PR with same name Differential Revision: D67293040 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143314 Approved by: https://github.com/leitian, https://github.com/aaronenyeshi	2024-12-16 23:49:13 +00:00
rzou	10df370a77	Add missing IValue overloads for SymInt lists (#143167 ) We should be able to convert Int lists into SymInt lists. Test Plan: - new tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/143167 Approved by: https://github.com/ezyang ghstack dependencies: #143166	2024-12-16 23:18:55 +00:00
rzou	557da8014d	[gen_autograd_functions] rename some variables (#143166 ) This is a follow-up from https://github.com/pytorch/pytorch/pull/141278. Test Plan: - existing tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/143166 Approved by: https://github.com/soulitzer	2024-12-16 23:18:55 +00:00
Shangdi Yu	4c62275325	Kill capture_pre_autograd_graph API (#143224 ) Summary: Delete the following API: - capture_pre_autograd_graph() - capture_pre_autograd_graph_using_training_ir() - gm_using_training_ir() There's no more call sites to `capture_pre_autograd_graph`. Except 1) two test cases in coreml, PR to remove: https://github.com/apple/coremltools/pull/2400 2) XLA: one test case in pytorch/xla, PR to remove: https://github.com/pytorch/xla/pull/8398 3) a few call sites guarded by version guard (< 2.5.0) Test Plan: CI Reviewed By: tugsbayasgalan Differential Revision: D64056353 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143224 Approved by: https://github.com/tugsbayasgalan	2024-12-16 23:06:22 +00:00
PyTorch MergeBot	6356690b3d	Revert "[BE] Revert "Add conda to Manylinux Docker images (#139903 )" (#143300 )" This reverts commit c86383f956ee86f34d0ffb94bc229c51c6f11dd9. Reverted https://github.com/pytorch/pytorch/pull/143300 on behalf of https://github.com/atalman due to failing nova workflows with conda: command not found ([comment](https://github.com/pytorch/pytorch/pull/143300#issuecomment-2547030664))	2024-12-16 22:50:08 +00:00
eellison	135a2d4483	Update low prec codegen for div/mod (#142350 ) Div/mod in fp16/bf16 requires a downcast to preserve its inputs' dtypes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142350 Approved by: https://github.com/blaine-rister	2024-12-16 21:46:08 +00:00
Bradley Davis	15aee8e090	update aten bmm CK heuristic (#143294 ) Summary: updates heuristic to use new instances based on ck profiling of LLM shapes Differential Revision: D67280269 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143294 Approved by: https://github.com/mxz297, https://github.com/xw285cornell	2024-12-16 21:44:59 +00:00
atalman	c86383f956	[BE] Revert "Add conda to Manylinux Docker images (#139903 )" (#143300 ) This reverts commit 56a40d4ebb0bcf733f1ea5f6efde805326a7a565. Having conda in manylinux builder images is not required. This was added to have manylinux-builder images as the only images for CD builds after conda-builder is deprecated. However we decided to start using ``almalinux-builder``. We are using almalinux-builder for linux_job_v2 which contains conda: https://github.com/pytorch/test-infra/blob/main/.github/workflows/linux_job_v2.yml#L114 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143300 Approved by: https://github.com/seemethere	2024-12-16 21:40:08 +00:00
Bert Maher	4e594f4d12	Triton bump for 3.2 cherry-picks (mmav3 segfault fix, gfx950 support) (#143302 ) * https://github.com/triton-lang/triton/pull/5277 * https://github.com/triton-lang/triton/pull/5084 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143302 Approved by: https://github.com/atalman, https://github.com/pruthvistony	2024-12-16 21:22:29 +00:00
Aaron Orenstein	401b1498d2	[BE] typing for decorators - distributed/_tensor/ops/utils (#142139 ) Test Plan: unit tests Differential Revision: D62302679 Pull Request resolved: https://github.com/pytorch/pytorch/pull/142139 Approved by: https://github.com/Skylion007, https://github.com/kwen2501	2024-12-16 21:19:33 +00:00
Aaron Orenstein	159b7ad8aa	Improve async workers to handle forking for async compile (#142072 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/142072 Approved by: https://github.com/masnesral	2024-12-16 21:16:42 +00:00
xadupre	678f74988d	Fix a misspelling [ONNX] (#143301 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143301 Approved by: https://github.com/titaiwangms	2024-12-16 20:19:41 +00:00
bobrenjc93	8ad842cda4	remove allow-untyped-defs for utils/data/datapipes/dataframe/structures.py (#143273 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143273 Approved by: https://github.com/aorenste ghstack dependencies: #143271	2024-12-16 20:07:36 +00:00
PyTorch MergeBot	54ed13cdce	Revert "Update low prec codegen for div/mod (#142350 )" This reverts commit ca973069ed9a08782695d9407605e219008821e2. Reverted https://github.com/pytorch/pytorch/pull/142350 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I think it. breaks an internal test ([comment](https://github.com/pytorch/pytorch/pull/142350#issuecomment-2546615951))	2024-12-16 20:05:14 +00:00
Adnan Akhundov	e885225eda	Add persistent+TMA version of Triton mm and addmm (#142101 ) This PR adds persistent+TMA versions (Triton template + the corresponding infra) for the `tuned_mm` and `tuned_addmm` lowerings. The persistent+TMA choices are added to the GEMM autotuning if (checked by the `use_triton_tma_template` helper): 1. The min. hardware and Triton version requirements are met for the TMA support. 2. The GEMM inputs are compatible with the Triton TMA API (i.e., 16-byte aligned and contiguous). 3. The `config.triton.enable_persistent_tma_matmul` is set to `True`. Additional notes: 1. As added in this PR, the TMA uses are not compatible with prolog / epilogue fusion. To this end, in the new Triton template we currently support: TMA-based loads of A/B, but no prologue fusion; epilogue fusion, but no TMA-based stores of C. TMA + fusion compatibility can be added as a follow-up. 2. The current Triton TMA API (`experimental_device_tensormap_create2d`) does not support strides. Due to this, we limit the applicability of the new Triton template to the cases where the inputs are contiguous. 3. The transposed layouts of A and / or B are supported by passing the constexpr flags to the kernel and adjusting the ordering of the block sizes accordingly in the kernel code (this should have no effect on the kernel perf, as decided at the Triton compilation time). 4. After the next Triton pin update, we can switch to the tensor descriptor API (landed recently in https://github.com/triton-lang/triton/pull/5290) in the new Triton template, which should allow lifting 2 and 3 above. 5. The configs for the new Triton template in `persistent_mm_kernel_configs` are preliminary. We should do more perf exploration and possibly augment the config in a follow-up. 6. This PR is rebased onto and unifies with two related PRs landed previously: https://github.com/pytorch/pytorch/pull/142045 (some infra unification with the persistent+TMA template for _scaled_mm) and https://github.com/pytorch/pytorch/pull/134532 (add possibility to disable prolog fusion for selected choices). 7. The current Triton TMA API only supports 1D and 2D descriptors (even after https://github.com/triton-lang/triton/pull/5290, see [here](`9829ce87cc/python/triton/language/core.py (L1957)`)). For now, this blocks adding persistent+TMA template for `torch.bmm`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142101 Approved by: https://github.com/drisspg, https://github.com/eellison	2024-12-16 19:12:12 +00:00
Oguz Ulgen	17b71e5d6a	Add config alias (#142088 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/142088 Approved by: https://github.com/c00w	2024-12-16 18:51:17 +00:00
William Wen	1b6b86fad7	[dynamo] disable eval frame callback around most of _TorchDynamoContext wrapper function (#143211 ) Internal xref: https://fb.workplace.com/groups/1075192433118967/permalink/1559636954674510/ If the `_fn` returned by `_TorchDynamoContext.__call__` makes an external function call, dynamo is recursively invoked. This can cause issues if there are added calls that are not skipped by Dynamo. So we should disable the eval frame callback as much as possible. Differential Revision: [D67211749](https://our.internmc.facebook.com/intern/diff/D67211749) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143211 Approved by: https://github.com/jansel	2024-12-16 18:38:58 +00:00
Animesh Jain	1bf983077f	[reland][dynamo][guards] Consider tensors as immutable for dict tag matches (#141085 ) Reland - https://github.com/pytorch/pytorch/pull/139560 As mentioned in https://github.com/pytorch/pytorch/pull/130341, using `static py::object` can lead to segfaults. I suspect this is the reason for the import system error seen internally (https://www.internalfb.com/sevmanager/view/469592). In this PR, I am removing the `static` part. This is fine and also the right thing to do because this will catch if user changes the flag in the same process for compiling two different functions. Unfortunately, there is no easy way to trigger this segfault, so I can't write a test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141085 Approved by: https://github.com/jansel Co-authored-by: William Wen <williamwen@meta.com>	2024-12-16 18:38:32 +00:00
Jeeja	338835d0d2	Add support for other backends in get_preferred_device (#132118 ) Currenlty get_preferred_device supports only cuda and cpu. Add support for other backends using backend config. Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/132118 Approved by: https://github.com/kwen2501	2024-12-16 18:30:41 +00:00
leslie-fang-intel	ccf35af142	[Inductor] Fix the Index Put lowering with same input of self and values (#139366 ) Summary Fix the issue: https://github.com/pytorch/pytorch/issues/138908, the root-cause is in https://github.com/pytorch/pytorch/issues/138908#issuecomment-2449192447 Test Plan ``` python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_index_put python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_index_add ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/139366 Approved by: https://github.com/jgong5, https://github.com/eellison	2024-12-16 17:07:14 +00:00
PyTorch MergeBot	7ab3177776	Revert "[AMD] Turn on TF32 for aten::mm (#139869 )" This reverts commit e0bdae7884aed09d9e3f1a3f7a53c095e74a9aff. Reverted https://github.com/pytorch/pytorch/pull/139869 on behalf of https://github.com/jeffdaily due to causing ROCm CI failures, need to investigate, revert for now ([comment](https://github.com/pytorch/pytorch/pull/139869#issuecomment-2546127069))	2024-12-16 16:46:48 +00:00
chuanqiw	a8cc19bb51	[CD] Fix XPU linux CD whl test failure (#143268 ) Follow https://github.com/pytorch/pytorch/pull/142482, refer the original fix PR https://github.com/pytorch/pytorch/pull/130742 and new issue in https://github.com/pytorch/pytorch/actions/runs/12323126436/job/34403681230 Works for https://github.com/pytorch/pytorch/issues/114850 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143268 Approved by: https://github.com/atalman	2024-12-16 15:00:03 +00:00
PyTorch UpdateBot	e4d2e81086	Update slow tests (#143278 ) This PR is auto-generated weekly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/weekly.yml). Update the list of slow tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143278 Approved by: https://github.com/pytorchbot	2024-12-16 12:40:40 +00:00
bobrenjc93	d745b2b516	remove allow-untyped-defs for distributed/rpc/_testing/__init__.py (#143271 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143271 Approved by: https://github.com/aorenste	2024-12-16 02:35:37 +00:00
Yu, Guangye	9706ada369	[RELAND] Add device-agnostic runtime Device/Stream C++ API (#138677 ) # Motivation This PR intends to add C++ accelerator device-agnostic APIs. # Additional Context This PR is relanded. It is reverted because `torch.Event` doesn't support mps backend. We have fixed it in https://github.com/pytorch/pytorch/pull/142468. The previous commit is `f84e533a2c` Pull Request resolved: https://github.com/pytorch/pytorch/pull/138677 Approved by: https://github.com/albanD, https://github.com/EikanWang ghstack dependencies: #143171, #133572	2024-12-16 02:18:41 +00:00
Yu, Guangye	45ac4ebf15	[RELAND] Add UTs for accelerator device-agnostic runtime APIs (#133572 ) # Motivation This PR intends to add UTs for accelerator device-agnostic APIs. # Additional Context This PR is relanded. It is reverted because `torch.Event` doesn't support mps backend. We have fixed it in https://github.com/pytorch/pytorch/pull/142468. The previous commit is `952514f0c8` Pull Request resolved: https://github.com/pytorch/pytorch/pull/133572 Approved by: https://github.com/EikanWang, https://github.com/albanD ghstack dependencies: #143171	2024-12-16 02:18:41 +00:00
Yu, Guangye	c1d4d9d3cf	[MPS] Support torch.accelerator.synchronize() on mps (#143171 ) # Motivation Support `torch.accelerator.synchronize()` on mps. The root cause is that MPS doesn't support lazy initialization. So we must check if the current accelerator supports device lazy initialization rather than early return. # Additional Context Add a mps UT to test code change. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143171 Approved by: https://github.com/albanD	2024-12-16 02:18:32 +00:00
cyy	af8789c056	Hide torch_python symbols (#142214 ) Change symbols in torch_python to invisible by default on platforms other than Apple. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142214 Approved by: https://github.com/ezyang	2024-12-16 00:59:26 +00:00
drisspg	744a303dee	[FlexAttention] Optimzing learned bias perf to dq calc (#142281 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/142281 Approved by: https://github.com/Chillee	2024-12-15 21:44:32 +00:00
Xiaodong Wang	e0bdae7884	[AMD] Turn on TF32 for aten::mm (#139869 ) Summary: hipblaslt supports TF32, so adding the support. Test Plan: CI Differential Revision: D65435392 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139869 Approved by: https://github.com/leitian	2024-12-15 10:02:29 +00:00
PyTorch UpdateBot	5273d8fd2a	[audio hash update] update the pinned audio hash (#143265 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned audio hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143265 Approved by: https://github.com/pytorchbot	2024-12-15 03:41:14 +00:00
PyTorch MergeBot	9ed045eae9	Revert "[Profiler] Add Optional Flag to turn off external correlations (#142516 )" This reverts commit b29fc52f827cc4b4336ecd24cc0a019ec9cf24b6. Reverted https://github.com/pytorch/pytorch/pull/142516 on behalf of https://github.com/huydhn due to Sorry for reverting your change but the test is failing on ROCm ([comment](https://github.com/pytorch/pytorch/pull/142516#issuecomment-2543431758))	2024-12-15 03:34:37 +00:00
Simon Fan	dd2d360b7d	[ca] re-enable disabled tests (#143247 ) FIXES https://github.com/pytorch/pytorch/issues/133197 The unspecified floats PR landed while this test was disabled, and it added an analysis restart which counts towards the backend call counter the test is using Pull Request resolved: https://github.com/pytorch/pytorch/pull/143247 Approved by: https://github.com/zou3519	2024-12-15 02:11:39 +00:00
cyy	4273e1a059	[5/N] Apply bugprone-unchecked-optional-access (#143111 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/143111 Approved by: https://github.com/Skylion007	2024-12-15 01:07:28 +00:00
Tom Ritchford	91bf2e16de	[distributed] Remove unused variable in test_composability/test_pp_composability.py (#143191 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143191 Approved by: https://github.com/mori360	2024-12-14 12:23:44 +00:00
Avik Chaudhuri	de484134e4	support slicing with symints in non-strict (#143217 ) Differential Revision: [D67215745](https://our.internmc.facebook.com/intern/diff/D67215745/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143217 Approved by: https://github.com/tugsbayasgalan	2024-12-14 10:27:45 +00:00
Michael Suo	9933e59c2b	[torch][cuda] fix race condition in cuda initialization (#143238 ) The access to lazy init callbacks (`_lazy_seed_tracker` and `_queued_calls`) is not synchronized with the initialization lock. This exposes us to the following race: 1. start `_lazy_init` 2. take `_initialization_lock` 3. flush `_queued_calls` and run them all 4. another thread comes in and uses `_lazy_call` to put something on the queue (in our case, the `manual_seed`) 5. original thread finishes initializing, but never runs that call Pull Request resolved: https://github.com/pytorch/pytorch/pull/143238 Approved by: https://github.com/ngimel	2024-12-14 07:41:24 +00:00
Oguz Ulgen	28d8297712	Migrate compiler config to Config (#143152 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143152 Approved by: https://github.com/ezyang ghstack dependencies: #143229	2024-12-14 07:38:25 +00:00
Oguz Ulgen	7c4d29485e	Add typechecking indirection for Config (#143229 ) When we create a Config[T], we actually dynamically unbox this in the module, so lets have type checker believe that Config[T] creates a T. This enables proper typechecking support. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143229 Approved by: https://github.com/aorenste	2024-12-14 07:38:25 +00:00
Will Feng	be5b342332	[Inductor] Move peak memory pass and overlap pass to be run at the right place (#142822 ) This PR moves `decide_global_ordering_of_comms` to run first before all other Inductor scheduler passes, so that downstream passes have the correct dependency tracking info. It also moves peak memory pass and overlap pass to the end of all passes, because they need to be the final decision maker on the node order to achieve the desired peak memory and overlap. This PR fixes hard-to-debug peak memory pass errors caused by incorrect tracking in `.unmet_dependencies` during the enablement of SimpleFSDP on internal models. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142822 Approved by: https://github.com/eellison	2024-12-14 06:53:02 +00:00
Heiner	3cc617b6a7	`__cuda_array_interface__`: Use "<V2" for bfloat16. (#143042 ) Rationale: While Numpy doesn't support `bfloat16` and therefore there's no official typestr for `bfloat16` in `__array_interface__` (https://docs.scipy.org/doc/numpy-1.13.0/reference/arrays.interface.html#__array_interface__), JAX/ml_dtypes uses "<V2": ``` >>> from jax import numpy as jnp >>> jnp.bfloat16.dtype.str '<V2' ``` Using the same in PyTorch has the upside of making the typestrs returned by `__cuda_array_interface__` identify the torch dtype uniquely. ### Misc notes (1) JAX itself just refuses to do `__cuda_array_interface__` for `bfloat16`: ``` >>> from jax import numpy as jnp >>> jnp.arange(10, dtype=jnp.bfloat16).__cuda_array_interface__ Traceback (most recent call last): File "<stdin>", line 1, in <module> jaxlib.xla_extension.XlaRuntimeError: INVALID_ARGUMENT: __cuda_array_interface__ is not supported for bfloat16 buffers. ``` (2) The "official" description of `__cuda_array_interface__` doesn't mention bfloat16, it just references `__array_interface__`: https://numba.readthedocs.io/en/stable/cuda/cuda_array_interface.html (3) Ongoing issue for numpy to support bfloat16: https://github.com/numpy/numpy/issues/19808 (4) Tweet that triggered this: https://x.com/HeinrichKuttler/status/1866761979349844211, with @ezyang responding. (5) "<V2" is kinda weird, as it's a "little-endian void" type. When given to Numpy, it gets turned into endian-agnostic: ``` >>> import numpy as np >>> import ml_dtypes >>> np.dtype("bfloat16").str '<V2' >>> np.dtype("<V2").str '\|V2' ``` Still, it makes sense to have a unique string for `bfloat16` and since Google chose "<V2" we might as well use that. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143042 Approved by: https://github.com/ezyang	2024-12-14 06:27:52 +00:00
Nichols A. Romero	c0a39ad35a	[ROCm] Fix TunableOp UTs: Rotating Buffer (#143172 ) TunableOp's rotating buffer feature cannot be properly tested because the environment variable that controls this feature is sticky. A Python API is introduced to modify this value. Additional items in this PR: * UT for rotating buffer API * Clean up UTs that were setting the rotating buffer via the environment variable * Align behavior of environment variable and Python API when a negative value (< 0) is set. * Update documentation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143172 Approved by: https://github.com/jeffdaily	2024-12-14 06:18:11 +00:00
Peter Bell	96c3b2c388	Expose remaining sharedMem cudaDeviceProps to python (#143226 ) Was a bit too fast with my earlier PR, `sharedMemPerMultiprocessor` includes some memory that is reserved for the system. The amount a kernel can actually use is limited by `sharedMemPerBlockOptin`. I also expose `sharedMemPerBlock` for completeness. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143226 Approved by: https://github.com/ezyang	2024-12-14 06:13:28 +00:00
cyy	4764303cc6	Use static initialization to avoid once_flag in getCUDAHooks (#143198 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/143198 Approved by: https://github.com/albanD	2024-12-14 06:05:41 +00:00
Edward Z. Yang	23379e8933	Add torch._compile to uninteresting files (#143209 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/143209 Approved by: https://github.com/albanD	2024-12-14 05:40:21 +00:00
eellison	ca973069ed	Update low prec codegen for div/mod (#142350 ) Div/mod in fp16/bf16 requires a downcast to preserve its inputs' dtypes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142350 Approved by: https://github.com/blaine-rister	2024-12-14 03:53:28 +00:00
Edward Z. Yang	24f24eebde	Get rid of _lazy_import hack (#143213 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/143213 Approved by: https://github.com/aorenste, https://github.com/albanD	2024-12-14 03:46:21 +00:00
PyTorch UpdateBot	698eefaddd	[audio hash update] update the pinned audio hash (#143245 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned audio hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143245 Approved by: https://github.com/pytorchbot	2024-12-14 03:37:56 +00:00
cyy	e9f6045e80	[15/N] Fix extra warnings brought by clang-tidy-17 (#143100 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/143100 Approved by: https://github.com/Skylion007	2024-12-14 03:24:10 +00:00
Eric Hanson	33dee721ae	Reraise worker errors as runtime errors in more cases when the original exception can't be constructed (#140911 ) related to https://github.com/pytorch/pytorch/issues/34130 when pytorch attempts to re-raise an exception from a worker process (e.g. multiprocessing dataloader), if it can't reconstruct the original exception message due to a type error, it instead raises it as a runtime error. However, if it can't reconstruct the exception for some other reason, it throws an error with a stacktrace pointing to the `ExceptionWrapper` code rather than the original underlying issue. One case in which I run into this is with boto3's [HTTPClientError](`66dc1f8d52/botocore/exceptions.py (L94)`)s. They must be constructed with a keyword argument `error`, but if `error` isn't passed, a `KeyError` is thrown instead of a `TypeError`, due to the particular way it is implemented: * [HTTPClientError](`66dc1f8d52/botocore/exceptions.py (L94)`)'s constructor excepts variable keyword arguments it passes to `super` (BotoCoreError) * [it also defines a field `fmt` with `error`](`66dc1f8d52/botocore/exceptions.py (L95)`) * BotoCoreError [expects to be able to format that string with the kwargs](`66dc1f8d52/botocore/exceptions.py (L41)`) So in this case, if a HTTPClientError occurs on a worker process, you simply get a `KeyError: error` with a stacktrace pointing to [this line](`3e2f276a14/torch/_utils.py (L710)`) which is unhelpful. Instead, I propose to reraise the error as a `RuntimeError` unconditionally. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140911 Approved by: https://github.com/vmoens	2024-12-14 03:11:36 +00:00
Simon Fan	cdc03f99b7	[ca] add graph id (#141906 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141906 Approved by: https://github.com/jansel ghstack dependencies: #141919	2024-12-14 03:02:06 +00:00
Nikita Shulga	19f3570000	[EZ] Remove `--pre` from numpy installation command (#143237 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143237 Approved by: https://github.com/janeyx99, https://github.com/kit1980	2024-12-14 02:55:21 +00:00
xinan.lin	bf8d4f5b7a	[Inductor UT] Generalize device-bias code in test_triton_syntax.py. (#143178 ) Fix #143177 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143178 Approved by: https://github.com/eellison	2024-12-14 02:08:32 +00:00
Arash Pakbin	86c3370bc3	operator benchmark: write output to a JSON (#142809 ) This pull request adds the functionality of writing the output of operator benchmark to an optional JSON file specified. The output is still printed in the terminal like before, but the user has the option of saving it in a JSON file as well. Main part of the functionality is implemented using the function _perf_result_to_dict which outputs a dictionary to be put inside a JSON file. Each dictionary corresponds to a single test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142809 Approved by: https://github.com/albanD	2024-12-14 01:42:00 +00:00
zeshengzong	12098ad242	Add torch.cat tensors type promotion description (#141339 ) Fixes #126964 Add note description about type promotion of `torch.cat` Test Result Before ![image](https://github.com/user-attachments/assets/2449f11b-48ed-406e-b73e-6d00f8eadb00) After ![image](https://github.com/user-attachments/assets/cba99572-e8b1-4b9c-ba95-a963b54859ba) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141339 Approved by: https://github.com/albanD	2024-12-14 01:36:41 +00:00
Scott Wolchok	13233e062d	Fix Apple Clang ICE when building with -march=armv8.6a (#142879 ) When investigating #142703, I found that the build with -march=armv8.6 on my M1 mac was hitting a clang ICE. When looking at the blame code, I finally noticed that this constructor was nonsense, apparently in a way that the compiler frontend accepted but the backend choked on. example ICE error message: ``` fatal error: error in backend: Cannot select: 0x12689c260: bf16 = uint_to_fp 0x1258324a0 0x1258324a0: i32 = AssertZext 0x125822d90, ValueType:ch:i16 0x125822d90: i32,ch = CopyFromReg 0x1238dddc0, Register:i32 %22 0x12689c6c0: i32 = Register %22 In function: _ZN2at6native7DEFAULTL12logit_kernelERNS_18TensorIteratorBaseERKN3c106ScalarE c++: error: clang frontend command failed with exit code 70 (use -v to see invocation) Apple clang version 16.0.0 (clang-1600.0.26.3) Target: arm64-apple-darwin24.1.0 Thread model: posix ``` Unbreaks `env CFLAGS=-march=armv8.6-a CXXFLAGS=-march=armv8.6-a python setup.py develop --cmake` on M1 Mac. Differential Revision: [D67102953](https://our.internmc.facebook.com/intern/diff/D67102953/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/142879 Approved by: https://github.com/malfet	2024-12-14 01:07:01 +00:00
Bradley Davis	063194aa32	add additional CK BMM Instances (2) (#142874 ) Summary: stacked changes to keep new codegen-ed instances below 2000 LOC Reviewed By: zjing14 Differential Revision: D66985408 Pull Request resolved: https://github.com/pytorch/pytorch/pull/142874 Approved by: https://github.com/mxz297	2024-12-14 01:04:34 +00:00
leslie-fang-intel	00b0210139	[Inductor] Use sleef implementation for CPP backend asinh codegen (#142360 ) Summary Fix https://github.com/pytorch/pytorch/issues/142345. Previously, we use `asinh(x) = log(x + sqrt(1 + x2))` to calculate the result of `asinh`, the issue happens when input with `-10000.1`, which makes `x + sqrt(1 + x2)` close to 0 and log(0) is invalid. We use the `sleef` implementation in this PR to fix this issue. Test Plan ``` python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_asinh_with_corner_inputs ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/142360 Approved by: https://github.com/jgong5	2024-12-14 00:27:55 +00:00
eellison	d53164880f	dont attempt to fuse in unaligned accesses to mm (#142435 ) This isn't profitable - we were trying to fuse in a padding of unaligned mm, which defeats padding's purpose. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142435 Approved by: https://github.com/jansel ghstack dependencies: #142401, #142402	2024-12-14 00:22:31 +00:00
albanD	70be7900bb	Fix Tensor clear to properly clear slots (#143203 ) Fixes a bug introduced in https://github.com/pytorch/pytorch/pull/137267 While the test ensures the finalizer did run to make sure things are cleared, the objects are not properly collected by the gc due to the faulty tp_clear implementation. So, while the finalizer did run, the object was still alive. Fixing this by giving tp_clear the same treatment as tp_traverse and tp_dealloc on Tensor: make it a unique function that handles the full subclass hierarchy in one place. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143203 Approved by: https://github.com/ezyang, https://github.com/colesbury ghstack dependencies: #143202	2024-12-14 00:17:07 +00:00
albanD	8741d72e3c	move function before modifying it (#143202 ) This is a no-op. Just to make the diff in the next PR easier to read Pull Request resolved: https://github.com/pytorch/pytorch/pull/143202 Approved by: https://github.com/ezyang, https://github.com/janeyx99	2024-12-14 00:17:07 +00:00
atalman	3bfdf6f063	Exclude py 31.3t triton package from PyTorch 3.13t wheel (#143218 ) Follow up after https://github.com/pytorch/pytorch/pull/143162 Include triton only for 3.13 packages not 3.13t Pull Request resolved: https://github.com/pytorch/pytorch/pull/143218 Approved by: https://github.com/kit1980	2024-12-14 00:12:45 +00:00
Nikita Shulga	515abb7744	[CI] Add Triton 3.13t build (#143212 ) By just extending the matrix and invoking script with appropriate cpython runtime Pull Request resolved: https://github.com/pytorch/pytorch/pull/143212 Approved by: https://github.com/clee2000, https://github.com/atalman, https://github.com/seemethere	2024-12-13 23:45:47 +00:00
eellison	8621b9ff0c	Infer whether prologues can be computed without upcasting to fp32 without changing numerics (#142402 ) For prologues which only do either loads like gathers or dtype conversions, and no actual arithmetic on lower-precision types, we can codegen them without upcasting to fp32 without changing numerics. Prologues that actually do arithmetic will need to use invoke quant. But I would like to to support upcasts/gathers out of the box. We could potentially extend this in the future to avoid upcasting max pooling operations as well, if there were perf benefits to be had (less likely). Pull Request resolved: https://github.com/pytorch/pytorch/pull/142402 Approved by: https://github.com/jansel ghstack dependencies: #142401	2024-12-13 23:25:15 +00:00
PyTorch MergeBot	4e0de50eb5	Revert "[CI] Add Triton 3.13t build (#143212 )" This reverts commit 571cd92d7c4c7bd2d5f068b5a285e0e70b8d0a40. Reverted https://github.com/pytorch/pytorch/pull/143212 on behalf of https://github.com/janeyx99 due to lint is failing, the other failures don't seem relevant but ci has turned red after this change haha ([comment](https://github.com/pytorch/pytorch/pull/143212#issuecomment-2542521875))	2024-12-13 23:03:45 +00:00
PyTorch MergeBot	f406207af2	Revert "[ROCm] Prune old gfx archs gfx900/gfx906 from binaries (#142827 )" This reverts commit 1e2b841675e50a6abd8dab9a95b33fda64b12e2b. Reverted https://github.com/pytorch/pytorch/pull/142827 on behalf of https://github.com/jeffdaily due to prematurely dropped support for gfx900/gfx906 ([comment](https://github.com/pytorch/pytorch/pull/142827#issuecomment-2542507857))	2024-12-13 22:48:44 +00:00
eellison	ad2faec8bb	Add a pass which analyzes whether a prologue preserves zero mask (#142401 ) We load inputs to prologue fusion with a mask. That mask must still be zero before we run `tl.dot`. Previously, we would always apply the mask: ``` tmp0 = tl.load(in_ptr1 + (tl.broadcast_to(xindex, xindex.shape)), a_mask, eviction_policy='evict_last') tmp1 = tmp0.to(tl.float32) a = tl.where(a_mask, tmp1, 0.0) ``` now we do not need to -> ``` tmp0 = tl.load(in_ptr1 + (tl.broadcast_to(xindex, xindex.shape)), a_mask, eviction_policy='evict_last') tmp1 = tmp0.to(tl.float32) a = tmp1 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/142401 Approved by: https://github.com/jansel	2024-12-13 22:37:33 +00:00
Shivam Raikundalia	b29fc52f82	[Profiler] Add Optional Flag to turn off external correlations (#142516 ) Summary: External Correlations are super spammy and oftentimes not even useful. Add flag during init to remove them entirely Test Plan: https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/dynocli/devvm2185.cco0.facebook.com/rank-0.Dec_10_12_33_31.531106.pt.trace.json.gz&bucket=gpu_traces Differential Revision: D67048206 Pull Request resolved: https://github.com/pytorch/pytorch/pull/142516 Approved by: https://github.com/ngimel	2024-12-13 22:32:09 +00:00
Shangdi Yu	bb574abe73	[BC-Breaking]Remove capture_pre_autograd_graph references in quantization (#139505 ) Summary: As title This is a BC-breaking change because graph produced by "capture_pre_autograd_graph" cannot be input to quantization anymore. But this is ok, since this API is deprecated for a while and is going to be deleted. We have removed all call sites of it. We remove the deprecated API references in code, docs, and tests. We also removed two tests that specific to capture_pre_autograd_graph API. Test Plan: CI Differential Revision: D65351887 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139505 Approved by: https://github.com/tugsbayasgalan, https://github.com/andrewor14, https://github.com/jerryzh168	2024-12-13 22:26:22 +00:00
Tom Ritchford	d25e6e623f	Fix unused Python variables in test/[a-d]* (#134665 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134665 Approved by: https://github.com/albanD	2024-12-13 22:13:12 +00:00
Brian Hirsh	e19f493f02	add private config to temporarily preserve old FSDP guard behavior (#142871 ) Summary: https://github.com/pytorch/pytorch/pull/138819 wobbled dynamo guards in a way that caused some performance regression, so this PR temporarily adds a config to get the old behavior back while we investigate. Test Plan: CI Differential Revision: D67096751 Pull Request resolved: https://github.com/pytorch/pytorch/pull/142871 Approved by: https://github.com/yf225	2024-12-13 22:06:48 +00:00
Shangdi Yu	8fae4397b4	Add "inductor_pre_grad_graph" logging (#142717 ) (#143126 ) Summary: Add new structured logging "inductor_pre_grad_graph" This is for inductor provenance tracking front-end to load this graph from tlparse. ghstack-source-id: 257581974 exported-using-ghexport Test Plan: ``` buck2 run 'fbcode//mode/dev-nosan' //caffe2/test/dynamo:test_dynamo -- -r StructuredTraceTest ``` Differential Revision: D67150288 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143126 Approved by: https://github.com/desertfire	2024-12-13 21:48:25 +00:00
Nikita Shulga	8a04018329	[MPS] Fix conv backward for channels last (cont) (#143196 ) This is a continuation of https://github.com/pytorch/pytorch/issues/140902 but extends the same logic to input. Looks like existing channels-last logic just produced incorrect results on pre MacOS-15 versions and fails on MacOS-15, so removing it feels like a right idea Fixes https://github.com/pytorch/pytorch/issues/142344 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143196 Approved by: https://github.com/manuelcandales	2024-12-13 21:32:42 +00:00
Nikita Shulga	571cd92d7c	[CI] Add Triton 3.13t build (#143212 ) By just extending the matrix and invoking script with appropriate cpython runtime Pull Request resolved: https://github.com/pytorch/pytorch/pull/143212 Approved by: https://github.com/clee2000, https://github.com/atalman, https://github.com/seemethere	2024-12-13 21:28:52 +00:00
Sam Larsen	60c54467db	[logging] Log runtime autotuning timing to scuba (#141919 ) See test plan in internal diff [D66679369](https://our.internmc.facebook.com/intern/diff/D66679369) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141919 Approved by: https://github.com/jamesjwu, https://github.com/ezyang	2024-12-13 21:22:13 +00:00
Eddie Yan	0d6d29af38	[CUDA] Follow up to clean up some `set_per_process_memory_fraction` usage in tests (#142811 ) follow-up to #140852 now that #140620 has landed Pull Request resolved: https://github.com/pytorch/pytorch/pull/142811 Approved by: https://github.com/Skylion007	2024-12-13 21:09:05 +00:00
Yidi Wu	65d0a25289	[associative_scan] patch inductor tests to always run with static shape (#143161 ) fixes #143053 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143161 Approved by: https://github.com/eellison	2024-12-13 21:06:12 +00:00
Aaron Orenstein	52f31cc238	dynamo tracing perf: Guard slots: 51.76 -> 51.34 (#143060 ) See #143056 for overall docs. This PR: Add slots to Guard Pull Request resolved: https://github.com/pytorch/pytorch/pull/143060 Approved by: https://github.com/jansel ghstack dependencies: #143066, #143056, #143058, #143059	2024-12-13 21:02:50 +00:00
PyTorch MergeBot	e87f07d3b8	Revert "Migrate compiler config to Config (#143152 )" This reverts commit 1ebdfd56053dafa8880a0dedf535fff70aa92e09. Reverted https://github.com/pytorch/pytorch/pull/143152 on behalf of https://github.com/oulgen due to lint failure ([comment](https://github.com/pytorch/pytorch/pull/143152#issuecomment-2542342073))	2024-12-13 20:55:14 +00:00
Nikita Shulga	625b4edb97	[CD] Test torch.compile on 3.13 (#143207 ) Follow up after https://github.com/pytorch/pytorch/pull/143162 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143207 Approved by: https://github.com/atalman, https://github.com/ZainRizvi	2024-12-13 20:01:36 +00:00
atalman	fe9365f3f5	Add check_binary workflow to pytorch/pytorch (#143201 ) Migrated from pytorch/builder Related to: https://github.com/pytorch/builder/issues/2054 Copying from : `3468139e81` Pull Request resolved: https://github.com/pytorch/pytorch/pull/143201 Approved by: https://github.com/seemethere, https://github.com/malfet	2024-12-13 19:30:10 +00:00
Edward Z. Yang	8f40446770	Fix precedence of bitwise and/or printing (#143197 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/143197 Approved by: https://github.com/albanD, https://github.com/williamwen42	2024-12-13 19:29:42 +00:00
Oguz Ulgen	1ebdfd5605	Migrate compiler config to Config (#143152 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143152 Approved by: https://github.com/ezyang ghstack dependencies: #143150, #143151	2024-12-13 19:29:07 +00:00
Oguz Ulgen	f1ff8bc1c5	Add type to Config (#143151 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143151 Approved by: https://github.com/ezyang ghstack dependencies: #143150	2024-12-13 19:29:07 +00:00
Oguz Ulgen	9d05c8110d	Require Config to have a default (#143150 ) With aliases coming soon, we want to reject alias + default combo, so we need defaults to be passed in. On top of this, this simplifies statically type checking config. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143150 Approved by: https://github.com/ezyang	2024-12-13 19:28:59 +00:00
Doru Bercea	bf711a9cce	[ROCm] Improve performance of reduce sum for 3D shapes (#143137 ) Improve performance of reduce sum for 3D shapes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143137 Approved by: https://github.com/jeffdaily, https://github.com/eqy	2024-12-13 19:02:00 +00:00
Aaron Orenstein	6178be822d	dynamo tracing perf: direct Guard: 52.58 -> 51.76 (#143059 ) See #143056 for overall docs. This PR: Remove explicit constant check from `VariableBuilder.install_guards()` the args calling convention. Also remove a lambda binding. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143059 Approved by: https://github.com/williamwen42, https://github.com/jansel ghstack dependencies: #143066, #143056, #143058	2024-12-13 18:20:48 +00:00
Aaron Orenstein	6bcda3a21a	dynamo tracing perf: cache on import_source: 52.9 -> 52.58 (#143058 ) See #143056 for overall docs. This PR: add cache to `InstructionTranslatorBase.import_source()` Pull Request resolved: https://github.com/pytorch/pytorch/pull/143058 Approved by: https://github.com/jansel ghstack dependencies: #143066, #143056	2024-12-13 18:20:48 +00:00
Aaron Orenstein	b472d82c96	dynamo tracing perf: import in build: 60.48 -> 59.92 (#143056 ) A series of directed perf improvements to drive down the dynamo tracing cost of the given test. Before this PR stack the compile took about 60s, and after takes 30s. Individual improvements are listed below along with the approximate improvement of that change. Tested with this model: ``` @torch.compile(backend="eager") def model_add(x, y): out = x for i in range(5000): out = torch.add(out, y) return out ``` This PR: Stop importing builder in the inner loop of `VariableTracker.build()` Pull Request resolved: https://github.com/pytorch/pytorch/pull/143056 Approved by: https://github.com/jansel ghstack dependencies: #143066	2024-12-13 18:20:48 +00:00
Aaron Orenstein	63e1f97f4b	dynamo tracing perf: don't unnecessarily call getframeinfo on the hot path: 47.26 -> 37.66 (#143066 ) See #143056 for overall docs. This PR: Stop using `getframeinfo()` when we only care about the function name and throw the rest away. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143066 Approved by: https://github.com/jansel	2024-12-13 18:20:48 +00:00
George Wigley	e0c8abda76	Fix potentially undefined behaviour in index_put sample input (#143116 ) From the [docs](https://pytorch.org/docs/stable/generated/torch.Tensor.index_put_.html) for index_put_: > If accumulate is True, the elements in values are added to self. If accumulate is False, the behavior is undefined if indices contain duplicate elements. Currently the sample inputs for `index_put` generates 2 indices. Because they are generated randomly, they could be the same leading to undefined behaviour if `accumulate=False`. This PR changes the input generation to only generate a single index if `accumulate=False` preventing duplicate indices and undefined behaviour. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143116 Approved by: https://github.com/albanD	2024-12-13 17:59:01 +00:00
Jeremy Hadidjojo	23b8ea3094	Allow disabling int specialization on nn.Modules (#142829 ) Resolves issue #140464 by adding an option to not specialize int from nn.Modules (False by default to maintain existing behavior). Test Plan: `buck2 test mode/opt caffe2/test/dynamo:test_dynamo -- test_modules.py::NNModuleTests::test_nn_module_unspec_int_attr` Differential Revision: D66837042 Pull Request resolved: https://github.com/pytorch/pytorch/pull/142829 Approved by: https://github.com/ezyang, https://github.com/yanboliang	2024-12-13 17:26:11 +00:00
Peter Bell	82a45d19b4	Expose sharedMemPerMultiprocessor device property to python (#143119 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143119 Approved by: https://github.com/ezyang	2024-12-13 16:53:57 +00:00
Jithun Nair	3f62054de1	[ROCm] upgrade nightly wheels to rocm6.3 - 1 of 2 (docker images) (#142151 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/142151 Approved by: https://github.com/jeffdaily	2024-12-13 16:21:17 +00:00
eellison	7968732f5b	Fix int8 mm V.ops.mul dispatching (#143127 ) This is sort of subtle - because we were doing `V.ops.mul` at binding time, we dont redispatch later when we invoke the epilogue. and then later running into assertion checking in pr above. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143127 Approved by: https://github.com/drisspg ghstack dependencies: #143048	2024-12-13 16:17:23 +00:00
Tom Ritchford	da67a6a7bb	[inductor] Replace set by OrderedSet (#138466 ) Uses the set_linter from https://github.com/pytorch/pytorch/pull/138454 and considerable manual editing Pull Request resolved: https://github.com/pytorch/pytorch/pull/138466 Approved by: https://github.com/eellison	2024-12-13 16:08:45 +00:00
Zhengxu Chen	fbfc530442	[export][ez] Fix forward D67044185 (#143193 ) Summary: Fixing forward D67044185 and T210459833 by adding the missing buld file. Test Plan: buck2 build --flagfile fbcode//mode/opt fbcode//admarket/training_data/augmentation/processors/tests:model_manager_test Differential Revision: D67200056 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143193 Approved by: https://github.com/tugsbayasgalan	2024-12-13 16:06:42 +00:00
Andrey Talman	04bb82f097	Linux Wheels: Remove triton dependency python < 3.13 constraint (#143162 ) We do build pytorch-triton package for python 3.13 : https://github.com/pytorch/pytorch/actions/runs/12304476674/job/34344764271 Hence constraint is no longer needed. This stack enabled torch.compile for Python 3.13 : https://github.com/pytorch/pytorch/pull/141264 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143162 Approved by: https://github.com/kit1980	2024-12-13 15:08:44 +00:00
Yifu Wang	810808d97d	Enable cutlass-based all-gather matmul when TORCH_SYMM_MEM_ENABLE_NATIVE_ASYNC_TP is set (#142283 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/142283 Approved by: https://github.com/weifengpy, https://github.com/Chillee	2024-12-13 10:29:14 +00:00
Bin Bao	3e1f587514	[AOTI] Fix an autotune block grid computation issue (#143098 ) Summary: There is a grid computation issue after switching to one-pass codegen in https://github.com/pytorch/pytorch/pull/141980. When max-autotune is turned on, there is an incorrect grid codegen in some cases. Reviewed By: henrylhtsang Differential Revision: D67120987 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143098 Approved by: https://github.com/henrylhtsang	2024-12-13 07:52:30 +00:00
Nikita Shulga	9f90583ca2	[CI] Run aarch64 tests on Graviton3 (#143129 ) Which is armv8.6 that has SVE and BF16 capability mkldnn_pattern_matcher skips are tracked in https://github.com/pytorch/pytorch/issues/143146 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143129 Approved by: https://github.com/digantdesai	2024-12-13 07:39:22 +00:00
Nikita Shulga	c37185c76a	[BE] Stop using deprecated APIs in mkldnn_pattern_matcher (#143156 ) This should fix ``` /var/lib/jenkins/workspace/test/inductor/test_mkldnn_pattern_matcher.py:157: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/143156 Approved by: https://github.com/kit1980	2024-12-13 06:37:20 +00:00
cyy	075905b7bd	[14/N] Fix extra warnings brought by clang-tidy-17 (#141644 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/141644 Approved by: https://github.com/ezyang Co-authored-by: Eli Uriegas <1700823+seemethere@users.noreply.github.com>	2024-12-13 06:22:13 +00:00
Simon Fan	72fd7abb35	[ca] fix flex attention backward HOP capture in initial graph (#143155 ) FIXES https://github.com/pytorch/pytorch/issues/142313 So with previous HOPs, compiled autograd could just inline into their body and get their post-dispatch aten representation. You can't do that with this flex attention HOP, which just wants any proxy tracing mechanism to insert it into its graph. Okay, compiled autograd does use proxy tracing, so we can do that. This is safe because other than the reenter_make_fx call, there were no other make_fx internals usage in the HOP. And compiled autograd specializes on the AOT backward's saved symints which should cover any changes in shapes to the inputs of the HOP. However, there's still an issue: Dynamo doesn't know how to handle `FlexAttentionBackwardHOP` and will graph break, so the flex attention backward is running in eager as of this PR. The tlparse looks really scuffed after the compiled autograd capture: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpMMHBEH/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143155 Approved by: https://github.com/drisspg	2024-12-13 06:04:39 +00:00
Ryan Guo	b4f4c75e19	[dynamo] Support multiple inheritance for custom dict construction (#142416 ) This patch applies a local and practical workaround for custom dict construction when multiple inheritance is involved. Handling multiple inheritance in general could be a lot more involved, so I created #142414 to track that. Fixes #141118. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142416 Approved by: https://github.com/jansel	2024-12-13 05:13:05 +00:00
bobrenjc93	b5d8d2444a	add README.md for compile time benchmarks (#143145 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143145 Approved by: https://github.com/laithsakka ghstack dependencies: #141517, #143143	2024-12-13 05:12:26 +00:00
lzhang2	b7ad52abb0	Use new group instead of split group on non-CUDA device (#141469 ) Motivation: Currently, `split_group` only works for NCCL backend. https://github.com/pytorch/pytorch/blob/main/torch/distributed/distributed_c10d.py#L4745. Then we need to use `use_group` on other non-CUDA device. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141469 Approved by: https://github.com/kwen2501, https://github.com/gujinghui, https://github.com/albanD	2024-12-13 05:11:33 +00:00
sanchitintel	57c46af47a	[Inductor][CPU] Add torchao da8w8 pattern with sym quantized act & wgt (#142110 ) ### Summary Extends #142036 for Inductor pattern-matching pattern covered for torchao API `int8_dynamic_activation_int8_weight` in the following scenario (inference-only, freezing enabled) - - int8 quantized (symmetrically) activation (per token quantized). - Statically (so, scales are also constant. But then they would have been constant even in case of dynamic quantization due to constant weights, anyway) per-channel int8 quantized (symmetrically) weights (which are also constant because freezing is enabled). The pattern that's matched is `torch._intmm` -> convert to FP32/BF16 -> [optional expand for activation scale] ->`mul` -> `mul`. We don't check if the activation is dynamically quantized or whether the weights are statically quantized, though (since the implementation won't have have any side-effects even if that wouldn't be true). In practice, it also matches the smooth-quant int8 quantized linear pattern if its output is not reshaped (if activation is 2D). ### More details oneDNN int8 matmul supports application of per-channel weight scale but not a vector activation scale, which could be applied as a post op, but is currently unsupported in ATen. Bias addition (which could be supported with an add post-op) is also unfused. The fusion pattern used in this PR is `torch._intmm` -> convert to FP32/BF16 ->`mul`, which will be replaced by oneDNN qlinear op. The speedup over eager-mode is due to 2 reasons - 1. fusion of int8xint8 -> int32 GEMM, conversion to FP32/BF16 & application of weight scale. (In case of BF16, many intermediate conversions are also avoided). 2. weight is pre-packed & cached by Inductor, so a reorder is avoided at run-time. But, in the future, the whole pattern (including application of activation scale, which would be a mul post-op) + bias could be fused if corresponding support would be enabled in ATen. ### Verification Added UT in this PR ``` python test/inductor/test_mkldnn_pattern_matcher.py -v -k test_da8w8_sym_act_sym_wgt_with_int_mm ``` #### Corresponding torchao UTs 1. int8 Smoothquant legacy API - `TORCHINDUCTOR_FREEZING=1 TORCH_COMPILE_DEBUG=1 TORCH_LOGS="+inductor" python test/integration/test_integration.py -v -k test_non_dynamically_quantizable_linear`. The difference from #139595 is that there are no reshapes of the linear output in this pattern. 2. int8 da8w8 - symmetrically quantized activation (dynamically) & statically quantized weights - ` TORCH_COMPILE_DEBUG=1 TORCH_LOGS="+inductor" TORCHINDUCTOR_FREEZING=1 python test/integration/test_integration.py -v -k test_int8_dynamic_quant_subclass_api_0_cpu` Pull Request resolved: https://github.com/pytorch/pytorch/pull/142110 Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5 ghstack dependencies: #142036	2024-12-13 04:59:03 +00:00
eellison	b731ced91f	Prologue Fusion (#134532 ) This PR extends our ability to fuse pointwise nodes onto triton templates with the ability to fuse pointwise nodes into triton templates - prologue fusion. Similar to the store_output api: `{{store_output(("idx_m", "idx_n"), "acc", "mask")}}` And the modification api: ``` {{ modification( subgraph_number=0, output_name="post_mod_scores", score="qk", out="qk" ) \| indent_except_first(1) }} ``` We have: ```{{load_input("B", "b", ("idx_m", "idx_n"), mask=None if EVEN_K else "b_mask", indent_width=8)}}``` Because we are now loading the input with explicit indices and mask, I needed to rewrite the mm kernel to no longer update the [pointers by BLOCK_K](`bb03ef7aca/torch/_inductor/kernel/mm.py (L110-L111)`) on every iteration and instead on each iteration compute indices from the the k_idx of each loop. This did not have any perf difference. There are a couple main use cases for prologue fusion: - Fusing dequants into a matmul. particularly for more bandwidth bound scenarios. - Fusing gather into a matmul. This is useful particularly in MOE. See https://github.com/pytorch/pytorch/issues/134535 for more details. Prologue fusion is generally much less profitable than epilogue fusion, because it must be applied to an element of an input on each loop of the matmul, compared to only once in the epilogue (gather into matmul is a potential exception). Accordingly, we are much less aggressive in attempting to fuse prologue fusion. We only attempt fusion if it does not increase the number of memory bytes read instead the triton template, multipled by a small factor to allow gathers. This restricts reliably unprofitable fusions like fp32->fp16 inside kernel. In future pr we could potentially have api of being more aggressive if we know we are in a bandwidth bound regime. See: https://github.com/pytorch/pytorch/pull/134532/files#diff-d2539c9c8dc6a3d7e457767a880612e96d3c85752a77ead49a9e4e00a3e4c3c7R3060-R3066 Other notes: By default we will upcast to fp32 inside every kernel. This matches eager numerics. This is fine enough for epilogue because it is only done once (although it is probably unnecessary for say a relu) but tanks perf for prologue. I am currently using the `codegen_upcast_to_fp32` option to avoid it, but that will not work for libdevice calls that require fp32. We will need https://github.com/pytorch/pytorch/pull/136778/ and dtype-aware codegen to upcast fp16 ops into libdevice calls. With prologue fusion, we now have essentially separate kernels for each input, and for the output. I had to increase the number of fields that are swapped out in `set_subgraph_body` by a large number :/ I also update the fusion logic because the inputs will have a different group than the outputs. Maybe as part of enabling multiple outputs, this could get cleaned up a bit so.. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134532 Approved by: https://github.com/jansel	2024-12-13 04:18:25 +00:00
bobrenjc93	ceb664aca6	add float_args benchmark (#143143 ) 71% improvement with automatic dynamic float arguments with specialize_float=False ``` float_args,compile_time_instruction_count,346293869 ``` with specialize_float=True ``` float_args,compile_time_instruction_count,1198546486 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/143143 Approved by: https://github.com/laithsakka ghstack dependencies: #141517	2024-12-13 03:35:59 +00:00
Simon Fan	ab04f3aee1	[ca] set autograd graph task state (#143108 ) GraphTask holds metadata needed for a single execution of backward(), it is 1:1 with backward calls, at least for compiled autograd. It is used for certain torch._C global autograd state APIs. In SAC, we use torch._C._current_graph_task_id() as a dict key to store information during unpack hook execution: `a5fb07af27/torch/utils/checkpoint.py (L1128)` If we don't set an active task, it will randomize the key, and will do its logic as if each unpacked tensor was from a different graph task `a5fb07af27/torch/utils/checkpoint.py (L1112-L1115)` The sketchy part of this PR is that in eager autograd, GraphTask is mutated during execution. But inspecting the struct, the mutation seems to only be used to communicate between autograd threads (created when multiple devices are involved) or for deprecated uses. We shouldn't run into the mutation case at all in compiled autograd. Also, only the graph task id is accessible from python hooks. FIXES https://github.com/pytorch/pytorch/issues/142862 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143108 Approved by: https://github.com/jansel, https://github.com/albanD	2024-12-13 03:10:48 +00:00
Blaine Burton Rister	dbe4b69df0	[Inductor] Fix cooperative reduction tests broken in recent refactor (#143135 ) These tests were broken by https://github.com/pytorch/pytorch/pull/142020. This PR updates the fixed configs accordingly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143135 Approved by: https://github.com/jansel, https://github.com/huydhn	2024-12-13 02:03:43 +00:00
cyy	9f5ebf3fc6	Clang-format aten/src/ATen/native/Tensor*{cpp,h} (#143089 ) These files are relatively stable, so it should be safe to format them without incurring conflicts Pull Request resolved: https://github.com/pytorch/pytorch/pull/143089 Approved by: https://github.com/albanD	2024-12-13 00:06:48 +00:00
Wouter Devriendt	2533a5a843	upgrade sccache to 0.9.0 (#142854 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/142854 Approved by: https://github.com/malfet, https://github.com/ZainRizvi	2024-12-12 22:49:50 +00:00
Xia, Weiwen	fb93462904	[Reopen][Inductor][CPU] Fuse SmoothQuant int8 linear pattern (#142036 ) Reopen of https://github.com/pytorch/pytorch/pull/139595 About the PR In the implementation of SmoothQuant in Torchao, quantized linear is computed by `_int_mm(a, b)` + `mul(b_scale)` + `mul(a_scale)` (+ optional `add` for bias) with `reshape` and `convert_dtype` in between. This PR adds a pass to fuse the corresponding patterns: - (no bias) `reshape -> _int_mm -> convert_element_type -> (expand -> mul) -> mul -> reshape` - (with bias) `pattern_no_bias -> add -> reshape -> reshape` The patterns are replaced by `onednn.qlinear_pointwise` and `onednn.qlinear_prepack`, the latter of which is evaluated and frozen during the freezing process of Inductor. The final graph contains `onednn.qlinear_pointwise` only with packed weight constants. Note that `onednn.qlinear_pointwise` only supports a scalar activation scale, which is a limitation of oneDNN library, so in that case we set activation scale to 1 and bias to none and apply scales and add bias after `onednn.qlinear_pointwise`. Validation results Accuracy/perplexity is not changed with or without this fusion pass. Latency is improved by >10% with the fusion pass. Test method: - Model: EleutherAI/gpt-j-6b - Hardware: Intel(R) Xeon(R) Platinum 8490H, running on 1 socket, 60 cores - Using Intel OMP and Tcmalloc - Running [the example script of SmoothQuant in Torchao](https://github.com/pytorch/ao/blob/main/torchao/prototype/smoothquant/example.py) with `TORCHINDUCTOR_FREEZING=1 numactl -N1 python example.py -m EleutherAI/gpt-j-6b --device=cpu --quant-mode=dynamic --compile` Test plan ``` python test/inductor/test_mkldnn_pattern_matcher.py -k test_smooth_quant_with_int_mm ``` Differential Revision: [D66796966](https://our.internmc.facebook.com/intern/diff/D66796966) Pull Request resolved: https://github.com/pytorch/pytorch/pull/142036 Approved by: https://github.com/jerryzh168, https://github.com/jgong5 Co-authored-by: sanchitintel <sanchit.jain@intel.com>	2024-12-12 21:18:03 +00:00
Chien-Chin Huang	602c86a420	[DSD] Fix strict=False case for DDP (#143038 ) Summary: As title Pull Request resolved: https://github.com/pytorch/pytorch/pull/143038 Approved by: https://github.com/mori360	2024-12-12 21:15:21 +00:00
Adrien Aguila--Multner	a7509e98c5	[pipelining] fix backward_one_chunk when the output of the model is a… (#142237 ) fixes #142229 if any of ``stage_output`` is a view, it cannot be detached in place. Replacing it with ``t = t.detach()`` or similar would not free the graph for the output given to the user. Detaching the base tensor could cause a side effect. The same code is used in ``_backward.py`` (`b64a537993/torch/distributed/pipelining/_backward.py (L215)`) but does not seem to cause any issue in my case. Maybe needs some investigation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142237 Approved by: https://github.com/H-Huang	2024-12-12 20:59:35 +00:00
Huy Do	39cacc1d81	Fix missing tests on test tool lint job (#143052 ) A follow-up from https://github.com/pytorch/pytorch/pull/142476#discussion_r1878888558 where some tests are not discovered correctly by pytest ### Testing https://github.com/pytorch/pytorch/actions/runs/12287448581/job/34289531307?pr=143052#step:14:162 shows the correct number of tests now Pull Request resolved: https://github.com/pytorch/pytorch/pull/143052 Approved by: https://github.com/ZainRizvi	2024-12-12 20:29:32 +00:00
Richard Barnes	82ce888273	c10::string_view -> std::string_view in more places (#142517 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/142517 Approved by: https://github.com/malfet	2024-12-12 19:45:59 +00:00
eellison	0b75b7ff2b	[Easy] factor out inductor ophandler decompositions (#142400 ) Factor out inductor operator decompositions Pull Request resolved: https://github.com/pytorch/pytorch/pull/142400 Approved by: https://github.com/Chillee, https://github.com/jansel	2024-12-12 19:03:26 +00:00
Shivam Raikundalia	c170248b78	[Profiler] Enable Iterative Step without profiler in fbcode (#142077 ) Summary: Adds post optimizer hook for fbcode so that we can run iterative on demand without having to use a frontend profiler interface. Since this is being used more frequently, it would be convenient for users to be able to trigger this on-demand feature without having to worry about being within some timing window. Test Plan: Ran iterative tracing without profiler.profile Differential Revision: D66734119 Pull Request resolved: https://github.com/pytorch/pytorch/pull/142077 Approved by: https://github.com/briancoutinho	2024-12-12 19:00:13 +00:00
atalman	e3fe5f62b6	Remove Checkout pytorch/builder for Linux Binary Builds (#143125 ) Follow Up after: https://github.com/pytorch/pytorch/pull/142282 Remove Checkout pytorch/builder for Linux Binary Builds I believe we where not using builder already. Hence remove this checkout. We should be using scripts from this folder: ``` /pytorch/.ci/${{ inputs.PACKAGE_TYPE }}/build.sh ``` TODO: Will followup with removing BUILDER_ROOT everywhere from PyTorch repo Pull Request resolved: https://github.com/pytorch/pytorch/pull/143125 Approved by: https://github.com/kit1980	2024-12-12 18:55:00 +00:00
PyTorch MergeBot	d48b16a725	Revert "[Dynamo] only import einops if version is lower than 0.7.0 (#142847 )" This reverts commit 357e261b1eded933d98de18ddcef2b083f87259d. Reverted https://github.com/pytorch/pytorch/pull/142847 on behalf of https://github.com/atalman due to Breaks binary builds, see the comment above ([comment](https://github.com/pytorch/pytorch/pull/142847#issuecomment-2539759580))	2024-12-12 18:44:35 +00:00
Howard Huang	b0c3d39e0d	[pipelining] Update tutorials and documentation (#143045 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143045 Approved by: https://github.com/wconstab, https://github.com/kwen2501	2024-12-12 18:42:17 +00:00
Zhengxu Chen	ee5bceaee6	[sigmoid] Write the new export schema format to archive without breaking compatibility. (#142511 ) Summary: This diff make it possible to migrate to PyTorch's OSS export schema from sigmoid. Basically, we add a new field called "methods" to ExportedProgram in Model definition, which contains the thrift schema generated based on schema.py from OSS. This way, we can keep writing the old fields while double write a new format in equivalent form. Since thrift doesn't support inlining type definitions, we do it manually here and it shouldn't break on-wire compatibility. As long as every sigmoid user is using sigmoid.frontend.serialization.serialize, we always guarantee to have the new format saved sa well. Eventually we will will use json deserialization from OSS so we will only keep this double writing for a couple of months. Eventually, we will migrate every serialization path to the OSS workflow. Test Plan: buck test mode/opt sigmoid/frontend:serialization_test buck test mode/opt sigmoid/frontend/test_gpu:serializer_test Differential Revision: D67044185 Pull Request resolved: https://github.com/pytorch/pytorch/pull/142511 Approved by: https://github.com/desertfire	2024-12-12 18:41:10 +00:00
Joel Schlosser	5dabe2d464	Fix NJT backward tests (#143072 ) This PR fixes some issues with NJT backward / compile backward tests: 1. `requires_grad` was not being propagated appropriately during `SampleInput` generation, so a LOT of backward cases were untested before (sad times). This PR utilizes a helper function `_clone()` to clone() / detach() NJTs for SampleInputs while preserving `requires_grad` status. Note: the clone() / detach() stuff is for autograd; can't have two SampleInputs as part of the same autograd graph. 2. Per-sample skips weren't -fully- working; the op logic would still be invoked even with a skip. I found this out thanks to `split_with_sizes`, which segfaults during backwards because it tries to use an NST-specific formula. As annoying as it is, I tried a ton of things but ultimately had to split the `subtest_ctx` into that + a `skip_xfail_ctx` to run the subtests within. * Updated all uses of per-sample skips / xfails: 4 in `test_nestedtensor.py` and 1 in `test_vmap.py` 3. Added the appropriate skips / xfails to get everything passing. There are a shitton of bugs to fix! Pull Request resolved: https://github.com/pytorch/pytorch/pull/143072 Approved by: https://github.com/cpuhrsch, https://github.com/soulitzer	2024-12-12 18:06:23 +00:00
Xuehai Pan	d47a80246a	[dynamo][pytree][3/N] make CXX pytree traceable: `tree_map` / `tree_map_` (#137399 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137399 Approved by: https://github.com/jansel ghstack dependencies: #137398	2024-12-12 18:05:25 +00:00
Xuehai Pan	7edeb1005a	[dynamo][pytree][2/N] make CXX pytree traceable: `tree_flatten` / `tree_unflatten` / `tree_structure` (#137398 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137398 Approved by: https://github.com/jansel	2024-12-12 18:05:25 +00:00
PyTorch MergeBot	c85323c5e8	Revert "Tests Generelization for multiple accelerator devices (#139184 )" This reverts commit b576a8c318201b63269f7ff25ec5830d00662a7a. Reverted https://github.com/pytorch/pytorch/pull/139184 on behalf of https://github.com/clee2000 due to Failing internally when trying to pickle distributed test files D67098795 ([comment](https://github.com/pytorch/pytorch/pull/139184#issuecomment-2539610187))	2024-12-12 17:48:30 +00:00
PyTorch MergeBot	2f0fe82f6d	Revert "[14/N] Fix extra warnings brought by clang-tidy-17 (#141644 )" This reverts commit 24a5a2ef258d2b482ded674cdb9555afaf081402. Reverted https://github.com/pytorch/pytorch/pull/141644 on behalf of https://github.com/clee2000 due to failing internally D67112938 ([comment](https://github.com/pytorch/pytorch/pull/141644#issuecomment-2539602023))	2024-12-12 17:43:36 +00:00
Tom Ritchford	dc23f1944a	Remove unused Python variables in torch/[_-a]* (#133492 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133492 Approved by: https://github.com/albanD	2024-12-12 17:39:14 +00:00
Richard Barnes	7667235a23	c10::optional -> std::optional (#142514 ) Fixes issues introduced in https://github.com/pytorch/pytorch/pull/141348 and https://github.com/pytorch/pytorch/pull/139578 Pull Request resolved: https://github.com/pytorch/pytorch/pull/142514 Approved by: https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2024-12-12 17:23:46 +00:00
Blaine Burton Rister	520ba556cd	[Inductor] Refactor "r" reduction prefix to {"r0_", "r1_"}. (#142020 ) Preparatory refactor for https://github.com/pytorch/pytorch/pull/137243. # Feature This PR changes the `RINDEX` / `"r"` symbol type to `(R0_INDEX, R1_INDEX)` and `("r0_", "r1_")`, respectively. This allows the relevant code to support 2D (often ND) reductions. Unlike the parent PR, this one does not change the tiling algorithm, so `"r1_"` is never used. However, it prepares other parts of the system to handle `"r1_"` once we start using it. This should significantly reduce the chances of hitting merge conflicts, making the parent PR much easier to land. The only change to the generated triton code is to rename `"rindex"` -> `"r0_index"`, `"RBLOCK"` -> `"R0_BLOCK"`, etc. To maintain compatibilty with existing codegen, this also generates aliases to the old reduction variables like `rindex = r0_index`. If we generated 2D reductions (which this PR will not do), the aliases would be more complicated and would collapse 2D multi-indices to linear indices. See some example kernels in the parent PR. These aliases can be eliminated by the Triton compiler, and should not impact the final machine code running on the GPU. See the perf testing in the parent PR which confirms the aliases do not impact perf. # Test plan The existing CI provides good coverage. This PR modifies the expected code in a few places, renaming reduction variables from `r.` to `r0_.`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142020 Approved by: https://github.com/jansel Co-authored-by: Jason Ansel <jansel@meta.com>	2024-12-12 17:22:20 +00:00
PyTorch MergeBot	cf538efd0c	Revert "Hide torch_python symbols (#142214 )" This reverts commit da76e912a4c58c649061fc84b29a42714897a0ca. Reverted https://github.com/pytorch/pytorch/pull/142214 on behalf of https://github.com/huydhn due to The MacOS failure looks legit as it shows up in trunk ([comment](https://github.com/pytorch/pytorch/pull/142214#issuecomment-2539543504))	2024-12-12 17:15:51 +00:00
Simon Fan	15ee2960e1	[aot] Functionalize aot backward prologue and epilogue wrappers (#142415 ) For functional compiled autograd, we're having dynamo trace through the aot backward implementation. To avoid graph breaking and imposing too many restrictions, we allow_in_graph the prologue and epilogue. This adds 2 restrictions: - code must be available in the global context - inputs other than tensors/symnodes must be const foldable Pull Request resolved: https://github.com/pytorch/pytorch/pull/142415 Approved by: https://github.com/bdhirsh	2024-12-12 17:14:29 +00:00
Sam Larsen	30b61e521c	[logging] Populate compile_time_autotune_time_us (#143104 ) See testing in attached diff Differential Revision: [D67128210](https://our.internmc.facebook.com/intern/diff/D67128210) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143104 Approved by: https://github.com/ezyang	2024-12-12 17:08:43 +00:00
Yasyf Mohamedali	e3ddc0ca33	Support remote caching requiring redis auth (#141679 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141679 Approved by: https://github.com/masnesral	2024-12-12 17:07:50 +00:00
Svetlana Karslioglu	0f78be5573	Fix search icon (#142808 ) Removing: .pytorch-left-menu-search input[type=text] { background-image: none; } so that the search icon correctly appears in the sphinx searchbox Also, fixing scrolling Pull Request resolved: https://github.com/pytorch/pytorch/pull/142808 Approved by: https://github.com/albanD	2024-12-12 16:09:30 +00:00
eellison	725526abc5	Fix scan dtypes (#143048 ) FIx for https://github.com/pytorch/pytorch/issues/142883. We weren't getting test coverage of scan because the tests were being skipped. see, https://github.com/pytorch/pytorch/issues/143053 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143048 Approved by: https://github.com/arui-meta, https://github.com/blaine-rister	2024-12-12 15:57:00 +00:00
Nikita Shulga	d83a049232	[EZ] Update lintrunner in CI to 0.12.7 (#143073 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143073 Approved by: https://github.com/wdvr	2024-12-12 15:35:37 +00:00
drisspg	7cc3a591c2	[FlexAttention] Fix a few more symbolic shape issues (#142816 ) # Summary See https://github.com/pytorch/pytorch/issues/139064 for more details. This fixes a number of issues with dynamic shapes. Thanks to @alexdremov for finding most of these Pull Request resolved: https://github.com/pytorch/pytorch/pull/142816 Approved by: https://github.com/yanboliang, https://github.com/ezyang	2024-12-12 15:29:21 +00:00
atalman	84f791381a	Python 3.13 CI add crossref test to existing linux-focal-py3_13-clang10-build (#143074 ) Add linux-jammy-py3_13-gcc11-build and test - similar to Py 3.9 Add crossref test to existing linux-focal-py3_13-clang10-build Pull Request resolved: https://github.com/pytorch/pytorch/pull/143074 Approved by: https://github.com/malfet	2024-12-12 14:45:56 +00:00
PyTorch MergeBot	cd1b5924d5	Revert "[Inductor] Use sleef implementation for CPP backend asinh codegen (#142360 )" This reverts commit 79cf8fa75176a8f6bb79d426c6d0f9369d03ff98. Reverted https://github.com/pytorch/pytorch/pull/142360 on behalf of https://github.com/jeanschmidt due to seems to have broken macos tests ([comment](https://github.com/pytorch/pytorch/pull/142360#issuecomment-2539143039))	2024-12-12 14:42:55 +00:00
Edward Z. Yang	30e2b322a1	Add <string> to uninteresting_files (#142984 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/142984 Approved by: https://github.com/albanD, https://github.com/IvanKobzarev	2024-12-12 14:35:30 +00:00
gasoonjia	91261107e0	debug handler maintain through decomposition (#141612 ) Add checks in the ao numberic debugger to guard the debug handle consistency between aten op decomposition Differential Revision: [D66517480](https://our.internmc.facebook.com/intern/diff/D66517480/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141612 Approved by: https://github.com/jerryzh168	2024-12-12 12:26:45 +00:00
Xuehai Pan	18785c1af9	[BE][accelerator] formalize API name `{current,set}_device_{idx => index}` (#140542 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140542 Approved by: https://github.com/guangyey, https://github.com/albanD	2024-12-12 10:53:48 +00:00
Saiteja Samudrala	a5fb07af27	[Torch][#142396 ]Resolve Failure When Uploading To Remote Storage (#143046 ) Summary: Catch io.UnsupportedOperation exception so that stream's without fileno support don't cause failure Test Plan: UT Differential Revision: D67108487 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143046 Approved by: https://github.com/saumishr	2024-12-12 08:17:15 +00:00
Avik Chaudhuri	497f89ff83	fix dynamo nn module stack fqn (#142823 ) Dynamo can produce sources that have funny patterns in their `.name()` that break `nn_module_stack` fqns. Added a test that used to have `._modules` inside nn_module_stack fqns, now doesn't. (Unfortunately couldn't repro a case mentioned in the GH issue where `.slice(...)` is claimed to appear as well.) Fixes https://github.com/pytorch/pytorch/issues/141939 Differential Revision: [D67064189](https://our.internmc.facebook.com/intern/diff/D67064189/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/142823 Approved by: https://github.com/pianpwk, https://github.com/zhxchen17	2024-12-12 07:02:13 +00:00
cyyever	da76e912a4	Hide torch_python symbols (#142214 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/142214 Approved by: https://github.com/ezyang	2024-12-12 07:00:54 +00:00
Nichols A. Romero	dcb128d495	[ROCm] TunableOp use thread-safe getenv functions (#142274 ) Fixes #142403 ~~PR fixes breakage due to this commit `8cd7ad8b48`~~ PR is a partial reland of this https://github.com/pytorch/pytorch/pull/140594 with a unit test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142274 Approved by: https://github.com/jeffdaily, https://github.com/eqy	2024-12-12 06:49:26 +00:00
Xilun Wu	5ad7d5304c	[DTensor][random] add HSDP+TP model init test (#143077 ) Summary 1. Move the model init tests from `DistTensorRandomOpTest` to `DistTensorRandomInitTest` 2. Added a HSDP+TP meta init test to show correct model init result in this use case. Note that this test requires 8 GPUs to run and our CI doesn't have that capacity so this test will be skipped on CI testing. A local run shows that the test passes on a 8-GPU host. Test `pytest test/distributed/_tensor/test_random_ops.py -s -k test_hsdp_tp_model_meta_init` <details> <summary> Test Result </summary> <img width="3343" alt="image" src="https://github.com/user-attachments/assets/a960c5e6-37bc-49be-9e36-ecc29ed47eb0" /> </details> Pull Request resolved: https://github.com/pytorch/pytorch/pull/143077 Approved by: https://github.com/weifengpy	2024-12-12 06:46:16 +00:00
Michael Lazos	357e261b1e	[Dynamo] only import einops if version is lower than 0.7.0 (#142847 ) Fixes internal xref (https://fb.workplace.com/groups/257735836456307/posts/804793021750583/?comment_id=805229281706957&reply_comment_id=805232695039949) Pull Request resolved: https://github.com/pytorch/pytorch/pull/142847 Approved by: https://github.com/zou3519	2024-12-12 06:38:22 +00:00
Michael Lazos	9701c50bdc	[Dynamo] Add missing tensor builtins to allowed functions (#142841 ) Fixes https://github.com/pytorch/pytorch/issues/141232 Pull Request resolved: https://github.com/pytorch/pytorch/pull/142841 Approved by: https://github.com/yanboliang	2024-12-12 06:38:19 +00:00
YeJiaxi	b25f64b613	Add-o pipefail for all bash scripts (#143050 ) Fixes #142380 I have added -o pipefail in all bash scripts in pytorch/.ci/pytorch. Sorry I didn't double-check the submodule in my last PR. Thanks for the correction! Please contact me again if there are any problems with this fix^^. (Actually contributing to the open source community is an assignment for one of my courses and today is the deadline so I rushed to revise it when I saw an email early in the morning. Haha.) @ezyang @malfet @huydhn @zou3519 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143050 Approved by: https://github.com/ezyang, https://github.com/huydhn Co-authored-by: Edward Z. Yang <ezyang@mit.edu>	2024-12-12 06:18:41 +00:00
leslie-fang-intel	79cf8fa751	[Inductor] Use sleef implementation for CPP backend asinh codegen (#142360 ) Summary Fix https://github.com/pytorch/pytorch/issues/142345. Previously, we use `asinh(x) = log(x + sqrt(1 + x2))` to calculate the result of `asinh`, the issue happens when input with `-10000.1`, which makes `x + sqrt(1 + x2)` close to 0 and log(0) is invalid. We use the `sleef` implementation in this PR to fix this issue. Test Plan ``` python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_asinh_with_corner_inputs ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/142360 Approved by: https://github.com/jgong5	2024-12-12 05:40:48 +00:00
Jithun Nair	1e2b841675	[ROCm] Prune old gfx archs gfx900/gfx906 from binaries (#142827 ) Remove gfx900 and gfx906 archs as they're long-in-the-tooth. Should help reduce the increasing size of ROCm binaries. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142827 Approved by: https://github.com/jeffdaily	2024-12-12 05:33:40 +00:00
cyy	fda43c98d1	Improve implementation of quantized_batch_norm (#141570 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/141570 Approved by: https://github.com/albanD	2024-12-12 04:35:00 +00:00
cyy	20df80a669	Remove unneeded optional dereference (#141578 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/141578 Approved by: https://github.com/swolchok	2024-12-12 04:34:43 +00:00
cyy	f7b9533c3f	[4/N] Apply bugprone-unchecked-optional-access (#142832 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/142832 Approved by: https://github.com/albanD	2024-12-12 04:33:32 +00:00
James Wu	fbbafd0320	Turn on AOTAutogradCache by default on open source (#141981 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141981 Approved by: https://github.com/bdhirsh, https://github.com/oulgen	2024-12-12 04:21:11 +00:00
mori360	4d0775462e	E2E composability testing (#141398 ) Add 3D(pp+tp+fsdp) test `test_3d_with_tp_dp_pp` at test_pp_compodability Currently provide @parametrize on "ScheduleClass" for pp in [ScheduleGPipe, Schedule1F1B, ScheduleInterleaved1F1B, ScheduleLoopedBFS, ScheduleInterleavedZeroBubble] "MixedPrecisionParam" for fsdp in [torch.bfloat16, torch.float32] Future work: 1. add fp8 2. add cp(context parallelism) to enable 4D test Pull Request resolved: https://github.com/pytorch/pytorch/pull/141398 Approved by: https://github.com/wconstab, https://github.com/kwen2501	2024-12-12 04:19:29 +00:00
cyy	2903cf0ad8	Re-enable some C++ warnings (#142332 ) It enables some C++ warnings since the code base is fairly clean. Meanwhile, Wextra-semi is disabled on CUDA generated code since there is no way to fix them without the cooperation of CUDA team. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142332 Approved by: https://github.com/albanD, https://github.com/eqy	2024-12-12 04:02:12 +00:00
Carlo Bertolli	f892f9862a	[ROCM] Enable *_load_dwordx4 ISA for BFloat16 and Half. (#141397 ) Remove input_vec_size constexpr and move it to template parameter. This enables generation of vectorized loads in ROCm AMDGPU backend. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141397 Approved by: https://github.com/jeffdaily Co-authored-by: Jerry Mannil <jerry.mannil@amd.com>	2024-12-12 03:27:49 +00:00
Nikita Shulga	4d8357e912	[CD] Use Anaconda cmake for Mac builds (#143054 ) To find Anaconda-env-installed OpenMP (As OpenMP from PyPI is looking for it in a different places) For posterity: our build script names are very confusing: - [`.ci/wheel/build_wheel.sh`](`6cb6e8d790/.ci/wheel/build_wheel.sh`) is only used for MacOS wheel/libtorch builds - [`.ci/manywheel/build.sh`](`6cb6e8d790/.ci/manywheel/build.sh`) are used for Linux wheel/libtorch builds - [`.ci/pytorch/windows/build_pytorch.bat`](`6cb6e8d790/.ci/pytorch/windows/build_pytorch.bat`) is used for Windows wheel builds Fixes https://github.com/pytorch/pytorch/issues/142873 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143054 Approved by: https://github.com/Jack-Khuu, https://github.com/atalman	2024-12-12 03:05:46 +00:00
Ke Wen	cb354f8b47	[PGNCCL] Move NCCLComm impl to cpp (#142826 ) BE as titled. No behavior change. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142826 Approved by: https://github.com/wconstab, https://github.com/c-p-i-o	2024-12-12 02:45:52 +00:00
leslie-fang-intel	06075d3d18	[Inductor][CPP] Fix Mask Dtype mismatch (#142103 ) Summary Fix issue: https://github.com/pytorch/pytorch/issues/141559. The `vec_mask` store data type doesn't aligned when doing `bitwise_and`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142103 Approved by: https://github.com/jgong5	2024-12-12 01:21:32 +00:00
Colin L. Rice	d68403df3b	filelock: Make waitcounter variant to use (#139816 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139816 Approved by: https://github.com/ezyang	2024-12-12 01:18:34 +00:00
atalman	6cb6e8d790	Python 3.11, 3.12 Remove tests covered by 3.13 (#143078 ) We do have linux-focal-py3_13-clang10-build and test. Hence removing linux-focal-py3_11-clang10-build/test and linux-focal-py3_12-clang10-build/test Pull Request resolved: https://github.com/pytorch/pytorch/pull/143078 Approved by: https://github.com/huydhn, https://github.com/malfet	2024-12-12 01:12:00 +00:00
atalman	1dd6f21029	Cuda 12.1 - Remove from trunk tests (#143076 ) Remove cuda 12.1 from trunk tests. This is covered by 12.4 tests. Move ``libtorch-linux-focal-cuda12_4-py3_7-gcc9-debug-build`` -> ``libtorch-linux-focal-cuda12_4-py3_10-gcc9-debug-build`` Pull Request resolved: https://github.com/pytorch/pytorch/pull/143076 Approved by: https://github.com/huydhn, https://github.com/malfet	2024-12-12 01:10:09 +00:00
atalman	bd7d81db9e	Use validate-docker-images workflow from test-infra (#143081 ) After PR: https://github.com/pytorch/test-infra/pull/6029 use validate-docker-images.yml from test-infra. Related to: https://github.com/pytorch/builder/issues/2054 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143081 Approved by: https://github.com/huydhn	2024-12-12 00:24:27 +00:00
cyy	db81a3f31c	[TorchGen] remove remove_non_owning_ref_types from valuetype_type (#142449 ) It is not used Pull Request resolved: https://github.com/pytorch/pytorch/pull/142449 Approved by: https://github.com/ezyang	2024-12-12 00:15:44 +00:00
PyTorch MergeBot	1b3f8b7589	Revert "[RELAND] Add UTs for accelerator device-agnostic runtime APIs (#133572 )" This reverts commit 209119424922b135fef39aba1f25da3b67f5879a. Reverted https://github.com/pytorch/pytorch/pull/133572 on behalf of https://github.com/huydhn due to Sorry for reverting your change but the new test is still very flaky on MacOS even when it does not segfault anymore ([comment](https://github.com/pytorch/pytorch/pull/133572#issuecomment-2537256522))	2024-12-11 21:47:18 +00:00
PyTorch MergeBot	dfe5669076	Revert "[RELAND] Add device-agnostic runtime Device/Stream C++ API (#138677 )" This reverts commit 734bb01460d59e661e9114e7aa17e04821e4b57a. Reverted https://github.com/pytorch/pytorch/pull/138677 on behalf of https://github.com/huydhn due to Sorry for reverting your change but the new test is still very flaky on MacOS even when it does not segfault anymore ([comment](https://github.com/pytorch/pytorch/pull/133572#issuecomment-2537256522))	2024-12-11 21:47:17 +00:00
PyTorch MergeBot	cd50bd8477	Revert "[BE][accelerator] formalize API name `{current,set}_device_{idx => index}` (#140542 )" This reverts commit fb02b40d27737213e0547dec0e30977dfc50f2f3. Reverted https://github.com/pytorch/pytorch/pull/140542 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but I need to revert this in order to revert https://github.com/pytorch/pytorch/pull/133572#issuecomment-2537204202 due to a conflict ([comment](https://github.com/pytorch/pytorch/pull/140542#issuecomment-2537253665))	2024-12-11 21:44:23 +00:00
Michael Lazos	de313f1155	[foreach_map] Initial foreach map HOP impl for inference (#142098 ) This is the initial foreach map HOP for pointwise ops which will be extended in the future to support grouped GEMMs and other ops. This PR utilizes PrimHOPBase class to represent foreach_map as a HOP with a single subgraph. The way this is implemented is that the user API `foreach_map` provides a single pointwise torch op, and internally this function calls a polyfill which has the same semantics as a foreach op (ie iterates over lists of operands applying the op elementwise). The higher order op is passed through the stack down to inductor where a lowering in essence inlines the subgraph into the main graph. This is done by interpreting it with a pointwise subgraph lowering, grouping the outputs by device, and registering the output buffers as foreach groups as applicable. For testing I was able to reuse the existing foreach tests by creating a wrapper function which matches the foreach op interfaces for those tests and then run all of the existing foreach tests on foreach_map. TODO before landing: * Add tests for general functions * Test warning if unsupported op will block fusion Followups: * I need to add tests for backwards (this will be a followup PR because backwards will require other work as well) Pull Request resolved: https://github.com/pytorch/pytorch/pull/142098 Approved by: https://github.com/eellison	2024-12-11 21:32:11 +00:00
Nikita Shulga	bd199bc754	[EZ] Move slow job from CU12.1 to CU12.4 (#142856 ) I though it was migrated a while back Pull Request resolved: https://github.com/pytorch/pytorch/pull/142856 Approved by: https://github.com/huydhn, https://github.com/atalman, https://github.com/ZainRizvi	2024-12-11 21:12:35 +00:00
Tristan Rice	688f44824b	DistributedDataParallel: add init_sync option to control collectives during initialization (#142824 ) This controls whether or not we run collectives during the DDP init function. This makes it easier to use fault tolerant ProcessGroup implementations that may not be starting at the same time. torchft uses a dummy process group and a comm hook to get around these checks. With this change torchft can use the normal ProcessGroup API via the stock comm hook. https://github.com/pytorch-labs/torchft/blob/main/torchft/ddp.py#L50-L59 Test plan: ``` pytest test/distributed/test_c10d_pypg.py ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/142824 Approved by: https://github.com/wconstab, https://github.com/fegin, https://github.com/H-Huang	2024-12-11 20:28:38 +00:00
Jane Xu	fd65bd755d	[BE] replace incorrect .. note:: invocations (#142868 ) Something I've noticed is that a lot of the distributed sites don't render on our docs at all, but if they ever do, the notes will render properly now 😛 Pull Request resolved: https://github.com/pytorch/pytorch/pull/142868 Approved by: https://github.com/albanD	2024-12-11 19:58:18 +00:00
Edward Z. Yang	0b96413dbf	Upgrade expecttest to 0.3.0 (#142869 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/142869 Approved by: https://github.com/albanD, https://github.com/malfet	2024-12-11 19:04:16 +00:00
cyy	e5f08c0cbf	[TorchGen] Remove cpp_type_registration_declarations (#142452 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/142452 Approved by: https://github.com/ezyang	2024-12-11 19:01:36 +00:00
cyy	e228381846	[TorchGen] Simplify argument_type_str (#142491 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/142491 Approved by: https://github.com/ezyang	2024-12-11 19:01:20 +00:00
Nikita Shulga	42d4eec5f3	Don't install lintrunner on S390 (#142876 ) Not sure if there are many users of this platform, but hopefully this will fix https://github.com/pytorch/pytorch/issues/142872 Pull Request resolved: https://github.com/pytorch/pytorch/pull/142876 Approved by: https://github.com/jeanschmidt	2024-12-11 18:54:12 +00:00
Yukio Siraichi	e647b6d590	Fix undesired specialization on slice after split. (#142372 ) Fix: #141251 This PR adds a few static guard checks when decomposing and lowering the `slice` operation, so that we avoid adding unnecessary guards. Specifically, when clamping the end values. In summary, the changes are: - `slice` dynamo decomposition: checks `end >= sizes[dim]` statically. If we don't know that, the following guard ensures that we (don't) need clamping. - `evaluate_min` inductor `sizevar` function: checks whether we can solve it statically or not, before actually creating a new guard. The latter had to be changed because `evaluate_min` (called by `ir.SliceView` constructor) would always try to create a guard based on the hints operation result. However, if both `left` and `right` hints were true, it would default to `left <= right` guard. By checking the guards statically before, we can avoid that. ```python N = 16 @torch.compile(backend="inductor", dynamic=False, fullgraph=True) def fn(x): splits = torch.ops.aten.split.Tensor(x, N) first = splits[0] return torch.ops.aten.slice.Tensor(first, 0, 0, N) x = torch.arange(N) torch._dynamo.mark_dynamic(x, 0) fn(x) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/142372 Approved by: https://github.com/ezyang	2024-12-11 18:52:17 +00:00
titaiwangms	0ddb33ba22	[ONNX] Avoid overwriting overlapped decomposed functions (#142831 ) Fixes #141770 The decomposed function in `torch.export.default_decompositions().items()` is overwritten by `torch._decomp.decomposition_table`. As from `torch.onnx.export()` perspective, we should rather respect the table of decompositions in `torch.export.default_decompositions().items()` and avoid overwriting it with `torch._decomp.decomposition_table. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142831 Approved by: https://github.com/justinchuby	2024-12-11 18:47:40 +00:00
Yidi Wu	c632e29774	[hop][dynamo] support torch.SymInt inputs (#141524 ) Fixes https://github.com/pytorch/pytorch/issues/141305. ```python class M(torch.nn.Module): def forward(self, x, y, z): a = y.shape[0] b = z.shape[0] def true_fn(x): return x + a def false_fn(x): return x + b * z # When exporting with non-strict: a and b are symints, # so torch.compile need to wrap and trace symint inputs. return torch.cond(x.shape[0] > 5, true_fn, false_fn, (x,)) ``` In non-strict export, when inputs are annotated with dynamic shape, the a, and b in above example are torch.SymInt type. true_fn and false_fn will have closure that're of torch.SymInt types. The error is triggered because we didn't handle SymInt inputs in dynamo and ends up using a UserDefinedObjectVariable for it, which doesn't have a proxy. We added support by following how we handle SymBool input previously. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141524 Approved by: https://github.com/zou3519 ghstack dependencies: #142185	2024-12-11 18:46:58 +00:00
Yidi Wu	a8fa98ccef	skip test dynamo for aot_dispatch tests on ci (#142185 ) A lot of tests in test_aotdispatch.py is not meaningful (from user's perspective) when we run with dynamo. So we skip them. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142185 Approved by: https://github.com/zou3519	2024-12-11 18:46:58 +00:00
cyy	24a5a2ef25	[14/N] Fix extra warnings brought by clang-tidy-17 (#141644 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/141644 Approved by: https://github.com/ezyang	2024-12-11 18:40:42 +00:00
Jane Xu	be27dbf2b8	Enable CPP/CUDAExtension with py_limited_api for python agnosticism (#138088 ) Getting tested with ao, but now there is a real test i added. ## What does this PR do? We want to allow custom PyTorch extensions to be able to build one wheel for multiple Python versions, in other words, achieve python agnosticism. It turns out that there is such a way that setuptools/Python provides already! Namely, if the user promises to use only the Python limited API in their extension, they can pass in `py_limited_api` to their Extension class and to the bdist_wheel command (with a min python version) in order to build 1 wheel that will suffice across multiple Python versions. Sounds lovely! Why don't people do that already with PyTorch? Well 2 things. This workflow is hardly documented (even searching for python agnostic specifically does not reveal many answers) so I'd expect that people simply don't know about it. But even if they did, _PyTorch_ custom Extensions would still not work because we always link torch_python, which does not abide by py_limited_api rules. So this is where this PR comes in! We respect when the user specifies py_limited_api and skip linking torch_python under that condition, allowing users to enroll in the provided functionality I just described. ## How do I know this PR works? I manually tested my silly little ultra_norm locally (with `import python_agnostic`) and wrote a test case for the extension showing that - torch_python doesn't show up in the ldd tree - no Py- symbols show up It may be a little confusing that our test case is actually python-free (more clean than python-agnostic) but it is sufficient (and not necessary) towards showing that this change works. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138088 Approved by: https://github.com/ezyang, https://github.com/albanD	2024-12-11 18:22:55 +00:00
Xuehai Pan	fb02b40d27	[BE][accelerator] formalize API name `{current,set}_device_{idx => index}` (#140542 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140542 Approved by: https://github.com/guangyey, https://github.com/albanD	2024-12-11 17:57:56 +00:00
cyy	82aaf64422	[3/N] Apply py39 ruff fixes (#142115 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/142115 Approved by: https://github.com/ezyang	2024-12-11 17:50:10 +00:00

4091 changed files with 311577 additions and 31567 deletions

									
										16

.ci/aarch64_linux/aarch64_ci_build.sh
									
												View File
												
				@ -3,22 +3,12 @@ set -eux -o pipefail

				GPU_ARCH_VERSION=${GPU_ARCH_VERSION:-}

				# cuda arm build for Grace Hopper solely

				export TORCH_CUDA_ARCH_LIST="9.0"

				SCRIPTPATH="$( cd -- "$(dirname "$0")" >/dev/null 2>&1 ; pwd -P )"

				source $SCRIPTPATH/aarch64_ci_setup.sh

				tagged_version() {

				  GIT_DESCRIBE="git --git-dir /pytorch/.git describe --tags --match v[0-9]*.[0-9]*.[0-9]*"

				  if ${GIT_DESCRIBE} --exact >/dev/null; then

				    ${GIT_DESCRIBE}

				  else

				    return 1

				  fi

				}

				if tagged_version >/dev/null; then

				  export OVERRIDE_PACKAGE_VERSION="$(tagged_version | sed -e 's/^v//' -e 's/-.*$//')"

				fi

				###############################################################################

				# Run aarch64 builder python

				###############################################################################

									
										6

.ci/aarch64_linux/build_aarch64_wheel.py
									
												View File
												
				@ -619,9 +619,11 @@ def build_torchaudio(

				    if host.using_docker():

				        build_vars += " CMAKE_SHARED_LINKER_FLAGS=-Wl,-z,max-page-size=0x10000"

				    host.run_cmd(f"cd audio && export FFMPEG_ROOT=$(pwd)/third_party/ffmpeg && export USE_FFMPEG=1 \

				    host.run_cmd(

				        f"cd audio && export FFMPEG_ROOT=$(pwd)/third_party/ffmpeg && export USE_FFMPEG=1 \

				        && ./packaging/ffmpeg/build.sh \

				        && {build_vars} python3 setup.py bdist_wheel")

				        && {build_vars} python3 setup.py bdist_wheel"

				    )

				    wheel_name = host.list_dir("audio/dist")[0]

				    embed_libgomp(host, use_conda, os.path.join("audio", "dist", wheel_name))

5

.ci/docker/aotriton_version.txt

View File

 @ -1,5 +0,0 @@
 .8b
 manylinux_2_28
 rocm6.2
 f8cbcac8a92775291bb1ba8f514d4beb350baf4
 e938def5d32869fe2e00aec0300f354c9f157867bebdf2e104d732b94cb238d8

									
										34

.ci/docker/build.sh
									
												View File
												
				@ -208,20 +208,6 @@ case "$image" in

				    CONDA_CMAKE=yes

				    TRITON=yes

				    ;;

				  pytorch-linux-focal-cuda12.4-cudnn9-py3-gcc9)

				    CUDA_VERSION=12.4.1

				    CUDNN_VERSION=9

				    ANACONDA_PYTHON_VERSION=3.10

				    GCC_VERSION=9

				    PROTOBUF=yes

				    DB=yes

				    VISION=yes

				    KATEX=yes

				    UCX_COMMIT=${_UCX_COMMIT}

				    UCC_COMMIT=${_UCC_COMMIT}

				    CONDA_CMAKE=yes

				    TRITON=yes

				    ;;

				  pytorch-linux-focal-cuda12.1-cudnn9-py3-gcc9)

				    CUDA_VERSION=12.1.1

				    CUDNN_VERSION=9

				@ -236,20 +222,6 @@ case "$image" in

				    CONDA_CMAKE=yes

				    TRITON=yes

				    ;;

				  pytorch-linux-focal-cuda12.4-cudnn9-py3-gcc9)

				    CUDA_VERSION=12.4.1

				    CUDNN_VERSION=9

				    ANACONDA_PYTHON_VERSION=3.10

				    GCC_VERSION=9

				    PROTOBUF=yes

				    DB=yes

				    VISION=yes

				    KATEX=yes

				    UCX_COMMIT=${_UCX_COMMIT}

				    UCC_COMMIT=${_UCC_COMMIT}

				    CONDA_CMAKE=yes

				    TRITON=yes

				    ;;

				  pytorch-linux-focal-py3-clang10-onnx)

				    ANACONDA_PYTHON_VERSION=3.9

				    CLANG_VERSION=10

				@ -296,7 +268,7 @@ case "$image" in

				    PROTOBUF=yes

				    DB=yes

				    VISION=yes

				    ROCM_VERSION=6.1

				    ROCM_VERSION=6.2.4

				    NINJA_VERSION=1.9.0

				    CONDA_CMAKE=yes

				    TRITON=yes

				@ -307,7 +279,7 @@ case "$image" in

				    PROTOBUF=yes

				    DB=yes

				    VISION=yes

				    ROCM_VERSION=6.2.4

				    ROCM_VERSION=6.3

				    NINJA_VERSION=1.9.0

				    CONDA_CMAKE=yes

				    TRITON=yes

				@ -525,7 +497,7 @@ docker build \

				       --build-arg "NINJA_VERSION=${NINJA_VERSION:-}" \

				       --build-arg "KATEX=${KATEX:-}" \

				       --build-arg "ROCM_VERSION=${ROCM_VERSION:-}" \

				       --build-arg "PYTORCH_ROCM_ARCH=${PYTORCH_ROCM_ARCH:-gfx90a}" \

				       --build-arg "PYTORCH_ROCM_ARCH=${PYTORCH_ROCM_ARCH:-gfx90a;gfx942}" \

				       --build-arg "IMAGE_NAME=${IMAGE_NAME}" \

				       --build-arg "UCX_COMMIT=${UCX_COMMIT}" \

				       --build-arg "UCC_COMMIT=${UCC_COMMIT}" \

									
										7

.ci/docker/centos-rocm/Dockerfile
									
												View File
												
				@ -113,13 +113,6 @@ COPY triton_version.txt triton_version.txt

				RUN if [ -n "${TRITON}" ]; then bash ./install_triton.sh; fi

				RUN rm install_triton.sh common_utils.sh triton.txt triton_version.txt

				# Install AOTriton (Early fail)

				COPY ./aotriton_version.txt aotriton_version.txt

				COPY ./common/common_utils.sh common_utils.sh

				COPY ./common/install_aotriton.sh install_aotriton.sh

				RUN ["/bin/bash", "-c", "./install_aotriton.sh /opt/rocm && rm -rf install_aotriton.sh aotriton_version.txt common_utils.sh"]

				ENV AOTRITON_INSTALLED_PREFIX /opt/rocm/aotriton

				# Install ccache/sccache (do this last, so we get priority in PATH)

				COPY ./common/install_cache.sh install_cache.sh

				ENV PATH /opt/cache/bin:$PATH

2

.ci/docker/ci_commit_pins/executorch.txt

View File

 @ -1 +1 @@
 f638937d64e3396793956d75ee3e14802022745
 a29b208a06ab378bb29ab1aa68932e412f8e09f1

2

.ci/docker/ci_commit_pins/triton.txt

View File

 @ -1 +1 @@
 c6c7c6284582b3f41c71c150e11b517acf074a
 d4682f073ded4d1a8260dd4208a43d735ae3a2b

									
										2

.ci/docker/common/install_acl.sh
									
												View File
												
				@ -1,7 +1,7 @@

				set -euo pipefail

				readonly version=v24.04

				readonly src_host=https://review.mlplatform.org/ml

				readonly src_host=https://github.com/ARM-software

				readonly src_repo=ComputeLibrary

				# Clone ACL

									
										23

.ci/docker/common/install_aotriton.sh
									
												View File
											
				@ -1,23 +0,0 @@

				#!/bin/bash

				set -ex

				source "$(dirname "${BASH_SOURCE[0]}")/common_utils.sh"

				TARBALL='aotriton.tar.gz'

				# This read command alwasy returns with exit code 1

				read -d "\n" VER MANYLINUX ROCMBASE PINNED_COMMIT SHA256 < aotriton_version.txt || true

				ARCH=$(uname -m)

				AOTRITON_INSTALL_PREFIX="$1"

				AOTRITON_URL="https://github.com/ROCm/aotriton/releases/download/${VER}/aotriton-${VER}-${MANYLINUX}_${ARCH}-${ROCMBASE}-shared.tar.gz"

				cd "${AOTRITON_INSTALL_PREFIX}"

				# Must use -L to follow redirects

				curl -L --retry 3 -o "${TARBALL}" "${AOTRITON_URL}"

				ACTUAL_SHA256=$(sha256sum "${TARBALL}" | cut -d " " -f 1)

				if [ "${SHA256}" != "${ACTUAL_SHA256}" ]; then

				  echo -n "Error: The SHA256 of downloaded tarball is ${ACTUAL_SHA256},"

				  echo " which does not match the expected value ${SHA256}."

				  exit

				fi

				tar xf "${TARBALL}" && rm -rf "${TARBALL}"

									
										8

.ci/docker/common/install_cache.sh
									
												View File
												
				@ -9,7 +9,7 @@ install_ubuntu() {

				  # Instead use lib and headers from OpenSSL1.1 installed in `install_openssl.sh``

				  apt-get install -y cargo

				  echo "Checking out sccache repo"

				  git clone https://github.com/mozilla/sccache -b v0.8.2

				  git clone https://github.com/mozilla/sccache -b v0.9.0

				  cd sccache

				  echo "Building sccache"

				  cargo build --release

				@ -36,11 +36,7 @@ sed -e 's|PATH="\(.*\)"|PATH="/opt/cache/bin:\1"|g' -i /etc/environment

				export PATH="/opt/cache/bin:$PATH"

				# Setup compiler cache

				if [ -n "$ROCM_VERSION" ]; then

				  curl --retry 3 http://repo.radeon.com/misc/.sccache_amd/sccache -o /opt/cache/bin/sccache

				else

				  install_ubuntu

				fi

				install_ubuntu

				chmod a+x /opt/cache/bin/sccache

				function write_sccache_stub() {

									
										2

.ci/docker/common/install_cpython.sh
									
												View File
												
				@ -70,7 +70,7 @@ function do_cpython_build {

				    # install setuptools since python 3.12 is required to use distutils

				    ${prefix}/bin/pip install wheel==0.34.2 setuptools==68.2.2

				    local abi_tag=$(${prefix}/bin/python -c "from wheel.pep425tags import get_abbr_impl, get_impl_ver, get_abi_tag; print('{0}{1}-{2}'.format(get_abbr_impl(), get_impl_ver(), get_abi_tag()))")

				    ln -s ${prefix} /opt/python/${abi_tag}

				    ln -sf ${prefix} /opt/python/${abi_tag}

				}

				function build_cpython {

									
										16

.ci/docker/common/install_rocm.sh
									
												View File
												
				@ -62,6 +62,22 @@ install_ubuntu() {

				        sqlite3 $kdb "PRAGMA journal_mode=off; PRAGMA VACUUM;"

				    done

				    # ROCm 6.3 had a regression where initializing static code objects had significant overhead

				    if [[ $(ver $ROCM_VERSION) -eq $(ver 6.3) ]]; then

				        # clr build needs CppHeaderParser but can only find it using conda's python

				        /opt/conda/bin/python -m pip install CppHeaderParser

				        git clone https://github.com/ROCm/HIP -b rocm-6.3.x

				        HIP_COMMON_DIR=$(readlink -f HIP)

				        git clone https://github.com/jeffdaily/clr -b release/rocm-rel-6.3-statco-hotfix

				        mkdir -p clr/build

				        pushd clr/build

				        cmake .. -DCLR_BUILD_HIP=ON -DHIP_COMMON_DIR=$HIP_COMMON_DIR

				        make -j

				        cp hipamd/lib/libamdhip64.so.6.3.* /opt/rocm/lib/libamdhip64.so.6.3.*

				        popd

				        rm -rf HIP clr

				    fi

				    # Cleanup

				    apt-get autoclean && apt-get clean

				    rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/*

									
										7

.ci/docker/libtorch/Dockerfile
									
												View File
												
				@ -92,13 +92,6 @@ RUN apt-get update -y && \

				RUN bash ./install_rocm_drm.sh && rm install_rocm_drm.sh

				RUN bash ./install_rocm_magma.sh && rm install_rocm_magma.sh

				# Install AOTriton

				COPY ./common/common_utils.sh common_utils.sh

				COPY ./aotriton_version.txt aotriton_version.txt

				COPY ./common/install_aotriton.sh install_aotriton.sh

				RUN bash ./install_aotriton.sh /opt/rocm && rm install_aotriton.sh aotriton_version.txt

				ENV AOTRITON_INSTALLED_PREFIX /opt/rocm/aotriton

				FROM ${BASE_TARGET} as final

				COPY --from=openssl            /opt/openssl           /opt/openssl

				# Install patchelf

									
										7

.ci/docker/manywheel/Dockerfile
									
												View File
												
				@ -198,10 +198,3 @@ ADD ./common/install_rocm_magma.sh install_rocm_magma.sh

				RUN bash ./install_rocm_magma.sh && rm install_rocm_magma.sh

				ADD ./common/install_miopen.sh install_miopen.sh

				RUN bash ./install_miopen.sh ${ROCM_VERSION} && rm install_miopen.sh

				# Install AOTriton

				COPY ./common/common_utils.sh common_utils.sh

				COPY ./aotriton_version.txt aotriton_version.txt

				COPY ./common/install_aotriton.sh install_aotriton.sh

				RUN bash ./install_aotriton.sh /opt/rocm && rm install_aotriton.sh aotriton_version.txt

				ENV AOTRITON_INSTALLED_PREFIX /opt/rocm/aotriton

12

.ci/docker/requirements-ci.txt

View File

 @ -30,10 +30,10 @@ dill==0.3.7
 #Pinned versions: 0.3.7
 #test that import: dynamo/test_replay_record.py test_dataloader.py test_datapipe.py test_serialization.py
 expecttest==0.2.1
 expecttest==0.3.0
 #Description: method for writing tests where test framework auto populates
 # the expected output based on previous runs
 #Pinned versions: 0.2.1
 #Pinned versions: 0.3.0
 #test that import:
 fbscribelogger==0.1.7
 @ -280,9 +280,9 @@ unittest-xml-reporting<=3.2.0,>=2.0.0
 #test that import:
 #lintrunner is supported on aarch64-linux only from 0.12.4 version
 lintrunner==0.12.5
 lintrunner==0.12.7
 #Description: all about linters!
 #Pinned versions: 0.12.5
 #Pinned versions: 0.12.7
 #test that import:
 redis>=4.0.0
 @ -294,7 +294,7 @@ ghstack==0.8.0
 #Pinned versions: 0.8.0
 #test that import:
 jinja2==3.1.4
 jinja2==3.1.5
 #Description: jinja2 template engine
 #Pinned versions: 3.1.4
 #test that import:
 @ -304,7 +304,7 @@ pytest-cpp==2.3.0
 #Pinned versions: 2.3.0
 #test that import:
 z3-solver==4.12.2.0
 z3-solver==4.12.6.0
 #Description: The Z3 Theorem Prover Project
 #Pinned versions:
 #test that import:

									
										11

.ci/docker/ubuntu-rocm/Dockerfile
									
												View File
												
				@ -107,12 +107,11 @@ COPY triton_version.txt triton_version.txt

				RUN if [ -n "${TRITON}" ]; then bash ./install_triton.sh; fi

				RUN rm install_triton.sh common_utils.sh triton.txt triton_version.txt

				# Install AOTriton

				COPY ./aotriton_version.txt aotriton_version.txt

				COPY ./common/common_utils.sh common_utils.sh

				COPY ./common/install_aotriton.sh install_aotriton.sh

				RUN ["/bin/bash", "-c", "./install_aotriton.sh /opt/rocm && rm -rf install_aotriton.sh aotriton_version.txt common_utils.sh"]

				ENV AOTRITON_INSTALLED_PREFIX /opt/rocm/aotriton

				# This is needed by sccache

				COPY ./common/install_openssl.sh install_openssl.sh

				ENV OPENSSL_ROOT_DIR /opt/openssl

				RUN bash ./install_openssl.sh

				ENV OPENSSL_DIR /opt/openssl

				# Install ccache/sccache (do this last, so we get priority in PATH)

				COPY ./common/install_cache.sh install_cache.sh

									
										23

.ci/manywheel/build_cuda.sh
									
												View File
												
				@ -43,13 +43,6 @@ if [[ -n "$DESIRED_CUDA" ]]; then

				        fi

				    fi

				    echo "Using CUDA $CUDA_VERSION as determined by DESIRED_CUDA"

				    # There really has to be a better way to do this - eli

				    # Possibly limiting builds to specific cuda versions be delimiting images would be a choice

				    if [[ "$OS_NAME" == *"Ubuntu"* ]]; then

				        echo "Switching to CUDA version ${DESIRED_CUDA}"

				        /builder/conda/switch_cuda_version.sh "${DESIRED_CUDA}"

				    fi

				else

				    CUDA_VERSION=$(nvcc --version|grep release|cut -f5 -d" "|cut -f1 -d",")

				    echo "CUDA $CUDA_VERSION Detected"

				@ -60,22 +53,10 @@ cuda_version_nodot=$(echo $CUDA_VERSION | tr -d '.')

				TORCH_CUDA_ARCH_LIST="5.0;6.0;7.0;7.5;8.0;8.6"

				case ${CUDA_VERSION} in

				    12.6)

				        if [[ "$GPU_ARCH_TYPE" = "cuda-aarch64" ]]; then

				            TORCH_CUDA_ARCH_LIST="9.0"

				        else

				            TORCH_CUDA_ARCH_LIST="${TORCH_CUDA_ARCH_LIST};9.0+PTX"

				        fi

				        TORCH_CUDA_ARCH_LIST="${TORCH_CUDA_ARCH_LIST};9.0+PTX"

				        EXTRA_CAFFE2_CMAKE_FLAGS+=("-DATEN_NO_TEST=ON")

				        ;;

				    12.4)

				        if [[ "$GPU_ARCH_TYPE" = "cuda-aarch64" ]]; then

				            TORCH_CUDA_ARCH_LIST="9.0"

				        else

				            TORCH_CUDA_ARCH_LIST="${TORCH_CUDA_ARCH_LIST};9.0"

				        fi

				        EXTRA_CAFFE2_CMAKE_FLAGS+=("-DATEN_NO_TEST=ON")

				        ;;

				    12.1)

				        TORCH_CUDA_ARCH_LIST="${TORCH_CUDA_ARCH_LIST};9.0"

				        EXTRA_CAFFE2_CMAKE_FLAGS+=("-DATEN_NO_TEST=ON")

				        ;;

				@ -275,7 +256,7 @@ else

				    exit 1

				fi

				# builder/test.sh requires DESIRED_CUDA to know what tests to exclude

				# run_tests.sh requires DESIRED_CUDA to know what tests to exclude

				export DESIRED_CUDA="$cuda_version_nodot"

				# Switch `/usr/local/cuda` to the desired CUDA version

									
										27

.ci/manywheel/build_rocm.sh
									
												View File
												
				@ -118,7 +118,7 @@ if [[ "$OS_NAME" == *"CentOS Linux"* || "$OS_NAME" == *"AlmaLinux"* ]]; then

				    fi

				    LIBDRM_PATH="/opt/amdgpu/lib64/libdrm.so.2"

				    LIBDRM_AMDGPU_PATH="/opt/amdgpu/lib64/libdrm_amdgpu.so.1"

				    if [[ $ROCM_INT -ge 60100 ]]; then

				    if [[ $ROCM_INT -ge 60100 && $ROCM_INT -lt 60300 ]]; then

				        # Below libs are direct dependencies of libhipsolver

				        LIBSUITESPARSE_CONFIG_PATH="/lib64/libsuitesparseconfig.so.4"

				        if [[ "$OS_NAME" == *"CentOS Linux"* ]]; then

				@ -151,7 +151,7 @@ elif [[ "$OS_NAME" == *"Ubuntu"* ]]; then

				    fi

				    LIBDRM_PATH="/usr/lib/x86_64-linux-gnu/libdrm.so.2"

				    LIBDRM_AMDGPU_PATH="/usr/lib/x86_64-linux-gnu/libdrm_amdgpu.so.1"

				    if [[ $ROCM_INT -ge 60100 ]]; then

				    if [[ $ROCM_INT -ge 60100 && $ROCM_INT -lt 60300 ]]; then

				        # Below libs are direct dependencies of libhipsolver

				        LIBCHOLMOD_PATH="/lib/x86_64-linux-gnu/libcholmod.so.3"

				        # Below libs are direct dependencies of libcholmod

				@ -186,15 +186,6 @@ do

				    OS_SO_FILES[${#OS_SO_FILES[@]}]=$file_name # Append lib to array

				done

				# FIXME: Temporary until https://github.com/pytorch/pytorch/pull/137443 lands

				# Install AOTriton

				if [ -e ${PYTORCH_ROOT}/.ci/docker/aotriton_version.txt ]; then

				    cp -a ${PYTORCH_ROOT}/.ci/docker/aotriton_version.txt aotriton_version.txt

				    bash ${PYTORCH_ROOT}/.ci/docker/common/install_aotriton.sh ${ROCM_HOME} && rm aotriton_version.txt

				    export AOTRITON_INSTALLED_PREFIX=${ROCM_HOME}/aotriton

				    ROCM_SO_FILES+=("libaotriton_v2.so")

				fi

				# rocBLAS library files

				ROCBLAS_LIB_SRC=$ROCM_HOME/lib/rocblas/library

				ROCBLAS_LIB_DST=lib/rocblas/library

				@ -266,20 +257,6 @@ RCCL_SHARE_FILES=($(ls $RCCL_SHARE_SRC))

				DEPS_AUX_SRCLIST+=(${RCCL_SHARE_FILES[@]/#/$RCCL_SHARE_SRC/})

				DEPS_AUX_DSTLIST+=(${RCCL_SHARE_FILES[@]/#/$RCCL_SHARE_DST/})

				# PyTorch 2.6+ (AOTriton 0.8b+)

				# AKS = "AOTriton Kernel Storage", a file format to store GPU kernels compactly

				if (( $(echo "${PYTORCH_VERSION} 2.6" | awk '{print ($1 >= $2)}') )); then

				    LIBAOTRITON_DIR=$(find "$ROCM_HOME/lib/" -name "libaotriton_v2.so" -printf '%h\n')

				    if [[ -z ${LIBAOTRITON_DIR} ]]; then

				        LIBAOTRITON_DIR=$(find "$ROCM_HOME/" -name "libaotriton_v2.so" -printf '%h\n')

				    fi

				    AKS_FILES=($(find "${LIBAOTRITON_DIR}/aotriton.images" -type f -name '*.aks?' -printf '%P\n'))

				    AKS_SRC="${LIBAOTRITON_DIR}/aotriton.images"

				    AKS_DST="lib/aotriton.images"

				    DEPS_AUX_SRCLIST+=(${AKS_FILES[@]/#/${AKS_SRC}/})

				    DEPS_AUX_DSTLIST+=(${AKS_FILES[@]/#/${AKS_DST}/})

				fi

				echo "PYTORCH_ROCM_ARCH: ${PYTORCH_ROCM_ARCH}"

				SCRIPTPATH="$( cd "$(dirname "$0")" ; pwd -P )"

									
										6

.ci/pytorch/build.sh
									
												View File
												
				@ -228,7 +228,7 @@ if [[ "$BUILD_ENVIRONMENT" == *-debug* ]]; then

				  export CMAKE_BUILD_TYPE=RelWithAssert

				fi

				# Do not change workspace permissions for ROCm CI jobs

				# Do not change workspace permissions for ROCm and s390x CI jobs

				# as it can leave workspace with bad permissions for cancelled jobs

				if [[ "$BUILD_ENVIRONMENT" != *rocm* && "$BUILD_ENVIRONMENT" != *s390x* && -d /var/lib/jenkins/workspace ]]; then

				  # Workaround for dind-rootless userid mapping (https://github.com/pytorch/ci-infra/issues/96)

				@ -247,7 +247,7 @@ if [[ "$BUILD_ENVIRONMENT" != *rocm* && "$BUILD_ENVIRONMENT" != *s390x* && -d /v

				fi

				if [[ "$BUILD_ENVIRONMENT" == *-bazel-* ]]; then

				  set -e

				  set -e -o pipefail

				  get_bazel

				@ -278,7 +278,7 @@ else

				          "$BUILD_ENVIRONMENT" != *xla* ]]; then

				      if [[ "$BUILD_ENVIRONMENT" != *py3.8* ]]; then

				        # Install numpy-2.0.2 for builds which are backward compatible with 1.X

				        python -mpip install --pre numpy==2.0.2

				        python -mpip install numpy==2.0.2

				      fi

				      WERROR=1 python setup.py clean

									
										2

.ci/pytorch/common.sh
									
												View File
												
				@ -3,7 +3,7 @@

				# Common setup for all Jenkins scripts

				# shellcheck source=./common_utils.sh

				source "$(dirname "${BASH_SOURCE[0]}")/common_utils.sh"

				set -ex

				set -ex -o pipefail

				# Required environment variables:

				#   $BUILD_ENVIRONMENT (should be set by your Docker image)

									
										4

.ci/pytorch/common_utils.sh
									
												View File
												
				@ -160,7 +160,7 @@ function install_torchvision() {

				}

				function install_tlparse() {

				  pip_install --user "tlparse==0.3.25"

				  pip_install --user "tlparse==0.3.30"

				  PATH="$(python -m site --user-base)/bin:$PATH"

				}

				@ -192,7 +192,7 @@ function install_torchrec_and_fbgemm() {

				function clone_pytorch_xla() {

				  if [[ ! -d ./xla ]]; then

				    git clone --recursive -b r2.6 https://github.com/pytorch/xla.git

				    git clone --recursive --quiet https://github.com/pytorch/xla.git

				    pushd xla

				    # pin the xla hash so that we don't get broken by changes to xla

				    git checkout "$(cat ../.github/ci_commit_pins/xla.txt)"

									
										2

.ci/pytorch/cpp_doc_push_script.sh
									
												View File
												
				@ -40,7 +40,7 @@ echo "Building PyTorch C++ API docs..."

				rm -rf cppdocs

				git clone https://github.com/pytorch/cppdocs

				set -ex

				set -ex -o pipefail

				# Generate ATen files

				pushd "${pt_checkout}"

									
										2

.ci/pytorch/functorch_doc_push_script.sh
									
												View File
												
				@ -5,7 +5,7 @@ pt_checkout="/var/lib/jenkins/workspace"

				source "$pt_checkout/.ci/pytorch/common_utils.sh"

				echo "functorch_doc_push_script.sh: Invoked with $*"

				set -ex

				set -ex -o pipefail

				version=${DOCS_VERSION:-nightly}

				echo "version: $version"

									
										2

.ci/pytorch/install_cache_xla.sh
									
												View File
												
				@ -6,7 +6,7 @@

				# return the same thing, ex checks for for rocm, CUDA, and changing the path

				# where sccache is installed, and not changing /etc/environment.

				set -ex

				set -ex -o pipefail

				install_binary() {

				  echo "Downloading sccache binary from S3 repo"

									
										93

.ci/pytorch/multigpu-test.sh
									
												View File
												
				@ -8,55 +8,62 @@

				source "$(dirname "${BASH_SOURCE[0]}")/common.sh"

				echo "Testing pytorch"

				time python test/run_test.py --include test_cuda_multigpu test_cuda_primary_ctx --verbose

				# When adding more tests, please use HUD to see which shard is shorter

				if [[ "${SHARD_NUMBER:-1}" == "1" ]]; then

				    # FSDP tests

				    for f in test/distributed/fsdp/*.py ; do time python test/run_test.py --verbose -i "${f#*/}" ; done

				fi

				# Disabling tests to see if they solve timeout issues; see https://github.com/pytorch/pytorch/issues/70015

				# python tools/download_mnist.py --quiet -d test/cpp/api/mnist

				# OMP_NUM_THREADS=2 TORCH_CPP_TEST_MNIST_PATH="test/cpp/api/mnist" build/bin/test_api

				time python test/run_test.py --verbose -i distributed/test_c10d_common

				time python test/run_test.py --verbose -i distributed/test_c10d_gloo

				time python test/run_test.py --verbose -i distributed/test_c10d_nccl

				time python test/run_test.py --verbose -i distributed/test_c10d_spawn_gloo

				time python test/run_test.py --verbose -i distributed/test_c10d_spawn_nccl

				time python test/run_test.py --verbose -i distributed/test_compute_comm_reordering

				time python test/run_test.py --verbose -i distributed/test_store

				time python test/run_test.py --verbose -i distributed/test_symmetric_memory

				time python test/run_test.py --verbose -i distributed/test_pg_wrapper

				time python test/run_test.py --verbose -i distributed/rpc/cuda/test_tensorpipe_agent

				# FSDP tests

				for f in test/distributed/fsdp/*.py ; do time python test/run_test.py --verbose -i "${f#*/}" ; done

				# ShardedTensor tests

				time python test/run_test.py --verbose -i distributed/checkpoint/test_checkpoint

				time python test/run_test.py --verbose -i distributed/checkpoint/test_file_system_checkpoint

				time python test/run_test.py --verbose -i distributed/_shard/sharding_spec/test_sharding_spec

				time python test/run_test.py --verbose -i distributed/_shard/sharding_plan/test_sharding_plan

				time python test/run_test.py --verbose -i distributed/_shard/sharded_tensor/test_sharded_tensor

				time python test/run_test.py --verbose -i distributed/_shard/sharded_tensor/test_sharded_tensor_reshard

				if [[ "${SHARD_NUMBER:-2}" == "2" ]]; then

				    time python test/run_test.py --include test_cuda_multigpu test_cuda_primary_ctx --verbose

				# functional collective tests

				time python test/run_test.py --verbose -i distributed/test_functional_api

				    # Disabling tests to see if they solve timeout issues; see https://github.com/pytorch/pytorch/issues/70015

				    # python tools/download_mnist.py --quiet -d test/cpp/api/mnist

				    # OMP_NUM_THREADS=2 TORCH_CPP_TEST_MNIST_PATH="test/cpp/api/mnist" build/bin/test_api

				    time python test/run_test.py --verbose -i distributed/test_c10d_common

				    time python test/run_test.py --verbose -i distributed/test_c10d_gloo

				    time python test/run_test.py --verbose -i distributed/test_c10d_nccl

				    time python test/run_test.py --verbose -i distributed/test_c10d_spawn_gloo

				    time python test/run_test.py --verbose -i distributed/test_c10d_spawn_nccl

				    time python test/run_test.py --verbose -i distributed/test_compute_comm_reordering

				    time python test/run_test.py --verbose -i distributed/test_store

				    time python test/run_test.py --verbose -i distributed/test_symmetric_memory

				    time python test/run_test.py --verbose -i distributed/test_pg_wrapper

				    time python test/run_test.py --verbose -i distributed/rpc/cuda/test_tensorpipe_agent

				# DTensor tests

				time python test/run_test.py --verbose -i distributed/_tensor/test_random_ops

				time python test/run_test.py --verbose -i distributed/_tensor/test_dtensor_compile

				    # ShardedTensor tests

				    time python test/run_test.py --verbose -i distributed/checkpoint/test_checkpoint

				    time python test/run_test.py --verbose -i distributed/checkpoint/test_file_system_checkpoint

				    time python test/run_test.py --verbose -i distributed/_shard/sharding_spec/test_sharding_spec

				    time python test/run_test.py --verbose -i distributed/_shard/sharding_plan/test_sharding_plan

				    time python test/run_test.py --verbose -i distributed/_shard/sharded_tensor/test_sharded_tensor

				    time python test/run_test.py --verbose -i distributed/_shard/sharded_tensor/test_sharded_tensor_reshard

				# DeviceMesh test

				time python test/run_test.py --verbose -i distributed/test_device_mesh

				    # functional collective tests

				    time python test/run_test.py --verbose -i distributed/test_functional_api

				# DTensor/TP tests

				time python test/run_test.py --verbose -i distributed/tensor/parallel/test_tp_examples

				time python test/run_test.py --verbose -i distributed/tensor/parallel/test_tp_random_state

				    # DTensor tests

				    time python test/run_test.py --verbose -i distributed/tensor/test_random_ops

				    time python test/run_test.py --verbose -i distributed/tensor/test_dtensor_compile

				# FSDP2 tests

				time python test/run_test.py --verbose -i distributed/_composable/fsdp/test_fully_shard_training -- -k test_2d_mlp_with_nd_mesh

				    # DeviceMesh test

				    time python test/run_test.py --verbose -i distributed/test_device_mesh

				# ND composability tests

				time python test/run_test.py --verbose -i distributed/_composable/test_composability/test_2d_composability

				time python test/run_test.py --verbose -i distributed/_composable/test_composability/test_pp_composability

				    # DTensor/TP tests

				    time python test/run_test.py --verbose -i distributed/tensor/parallel/test_tp_examples

				    time python test/run_test.py --verbose -i distributed/tensor/parallel/test_tp_random_state

				# Other tests

				time python test/run_test.py --verbose -i test_cuda_primary_ctx

				time python test/run_test.py --verbose -i test_optim -- -k test_forloop_goes_right_direction_multigpu

				time python test/run_test.py --verbose -i test_optim -- -k test_mixed_device_dtype

				time python test/run_test.py --verbose -i test_foreach -- -k test_tensors_grouping

				    # FSDP2 tests

				    time python test/run_test.py --verbose -i distributed/_composable/fsdp/test_fully_shard_training -- -k test_2d_mlp_with_nd_mesh

				    # ND composability tests

				    time python test/run_test.py --verbose -i distributed/_composable/test_composability/test_2d_composability

				    time python test/run_test.py --verbose -i distributed/_composable/test_composability/test_pp_composability

				    # Other tests

				    time python test/run_test.py --verbose -i test_cuda_primary_ctx

				    time python test/run_test.py --verbose -i test_optim -- -k test_forloop_goes_right_direction_multigpu

				    time python test/run_test.py --verbose -i test_optim -- -k test_mixed_device_dtype

				    time python test/run_test.py --verbose -i test_foreach -- -k test_tensors_grouping

				fi

				assert_git_not_dirty

									
										4

.ci/pytorch/python_doc_push_script.sh
									
												View File
												
				@ -7,7 +7,7 @@ source "$pt_checkout/.ci/pytorch/common_utils.sh"

				echo "python_doc_push_script.sh: Invoked with $*"

				set -ex

				set -ex -o pipefail

				# for statements like ${1:-${DOCS_INSTALL_PATH:-docs/}}

				# the order of operations goes:

				@ -63,7 +63,7 @@ build_docs () {

				    echo "(tried to echo the WARNINGS above the ==== line)"

				    echo =========================

				  fi

				  set -ex

				  set -ex -o pipefail

				  return $code

				}

									
										2

.ci/pytorch/run_tests.sh
									
												View File
												
				@ -13,7 +13,7 @@ set -eux -o pipefail

				# This script expects to be in the pytorch root folder

				if [[ ! -d 'test' || ! -f 'test/run_test.py' ]]; then

				    echo "builder/test.sh expects to be run from the Pytorch root directory " \

				    echo "run_tests.sh expects to be run from the Pytorch root directory " \

				         "but I'm actually in $(pwd)"

				    exit 2

				fi

									
										19

.ci/pytorch/smoke_test/smoke_test.py
									
												View File
												
				@ -109,8 +109,10 @@ def check_version(package: str) -> None:

				                            {release_matrix[module['name']]} for channel {channel}. But its {module_version}"

				                    )

				                else:

				                    print(f"{module['name']} version actual: {module_version} expected: \

				                        {release_matrix[module['name']]} for channel {channel}.")

				                    print(

				                        f"{module['name']} version actual: {module_version} expected: \

				                        {release_matrix[module['name']]} for channel {channel}."

				                    )

				    else:

				        print(f"Skip version check for channel {channel} as stable version is None")

				@ -180,7 +182,7 @@ def smoke_test_cuda(

				    # torch.compile is available on macos-arm64 and Linux for python 3.8-3.13

				    if (

				        torch_compile_check == "enabled"

				        and sys.version_info < (3, 13, 0)

				        and sys.version_info < (3, 14, 0)

				        and target_os in ["linux", "linux-aarch64", "macos-arm64", "darwin"]

				    ):

				        smoke_test_compile("cuda" if torch.cuda.is_available() else "cpu")

				@ -339,7 +341,7 @@ def smoke_test_modules():

				                print(f"Output: \n{output}\n")

				def main() -> None:

				def parse_args():

				    parser = argparse.ArgumentParser()

				    parser.add_argument(

				        "--package",

				@ -362,9 +364,16 @@ def main() -> None:

				        choices=["enabled", "disabled"],

				        default="enabled",

				    )

				    options = parser.parse_args()

				    return parser.parse_args()

				def main() -> None:

				    options = parse_args()

				    print(f"torch: {torch.__version__}")

				    print(torch.__config__.parallel_info())

				    # All PyTorch binary builds should be built with OpenMP

				    if not torch.backends.openmp.is_available():

				        raise RuntimeError("PyTorch must be built with OpenMP support")

				    check_version(options.package)

				    smoke_test_conv2d()

									
										62

.ci/pytorch/test.sh
									
												View File
												
				@ -4,7 +4,7 @@

				# (This is set by default in the Docker images we build, so you don't

				# need to set it yourself.

				set -ex

				set -ex -o pipefail

				# Suppress ANSI color escape sequences

				export TERM=vt100

				@ -12,9 +12,9 @@ export TERM=vt100

				# shellcheck source=./common.sh

				source "$(dirname "${BASH_SOURCE[0]}")/common.sh"

				# Do not change workspace permissions for ROCm CI jobs

				# Do not change workspace permissions for ROCm and s390x CI jobs

				# as it can leave workspace with bad permissions for cancelled jobs

				if [[ "$BUILD_ENVIRONMENT" != *rocm* && -d /var/lib/jenkins/workspace ]]; then

				if [[ "$BUILD_ENVIRONMENT" != *rocm* && "$BUILD_ENVIRONMENT" != *s390x* && -d /var/lib/jenkins/workspace ]]; then

				  # Workaround for dind-rootless userid mapping (https://github.com/pytorch/ci-infra/issues/96)

				  WORKSPACE_ORIGINAL_OWNER_ID=$(stat -c '%u' "/var/lib/jenkins/workspace")

				  cleanup_workspace() {

				@ -86,6 +86,13 @@ if [[ "$BUILD_ENVIRONMENT" == *clang9* || "$BUILD_ENVIRONMENT" == *xpu* ]]; then

				  export VALGRIND=OFF

				fi

				if [[ "$BUILD_ENVIRONMENT" == *s390x* ]]; then

				  # There are additional warnings on s390x, maybe due to newer gcc.

				  # Skip this check for now

				  export VALGRIND=OFF

				fi

				if [[ "${PYTORCH_TEST_RERUN_DISABLED_TESTS}" == "1" ]] || [[ "${CONTINUE_THROUGH_ERROR}" == "1" ]]; then

				  # When rerunning disable tests, do not generate core dumps as it could consume

				  # the runner disk space when crashed tests are run multiple times. Running out

				@ -129,7 +136,7 @@ if [[ "$TEST_CONFIG" == 'default' ]]; then

				fi

				if [[ "$TEST_CONFIG" == 'distributed' ]] && [[ "$BUILD_ENVIRONMENT" == *rocm* ]]; then

				  export HIP_VISIBLE_DEVICES=0,1

				  export HIP_VISIBLE_DEVICES=0,1,2,3

				fi

				if [[ "$TEST_CONFIG" == 'slow' ]]; then

				@ -153,6 +160,8 @@ elif [[ "$BUILD_ENVIRONMENT" == *xpu* ]]; then

				  export PYTORCH_TESTING_DEVICE_ONLY_FOR="xpu"

				  # setting PYTHON_TEST_EXTRA_OPTION

				  export PYTHON_TEST_EXTRA_OPTION="--xpu"

				  # Disable sccache for xpu test due to flaky issue https://github.com/pytorch/pytorch/issues/143585

				  sudo rm -rf /opt/cache

				fi

				if [[ "$TEST_CONFIG" == *crossref* ]]; then

				@ -313,6 +322,7 @@ test_dynamo_wrapped_shard() {

				    --exclude-jit-executor \

				    --exclude-distributed-tests \

				    --exclude-torch-export-tests \

				    --exclude-aot-dispatch-tests \

				    --shard "$1" "$NUM_TEST_SHARDS" \

				    --verbose \

				    --upload-artifacts-while-running

				@ -326,7 +336,7 @@ test_inductor_distributed() {

				  python test/run_test.py -i inductor/test_aot_inductor.py -k test_non_default_cuda_device --verbose

				  python test/run_test.py -i inductor/test_aot_inductor.py -k test_replicate_on_devices --verbose

				  python test/run_test.py -i distributed/test_c10d_functional_native.py --verbose

				  python test/run_test.py -i distributed/_tensor/test_dtensor_compile.py --verbose

				  python test/run_test.py -i distributed/tensor/test_dtensor_compile.py --verbose

				  python test/run_test.py -i distributed/tensor/parallel/test_micro_pipeline_tp.py --verbose

				  python test/run_test.py -i distributed/_composable/test_replicate_with_compiler.py --verbose

				  python test/run_test.py -i distributed/_composable/fsdp/test_fully_shard_comm.py --verbose

				@ -379,15 +389,29 @@ test_inductor_aoti() {

				  CPP_TESTS_DIR="${BUILD_BIN_DIR}" LD_LIBRARY_PATH="${TORCH_LIB_DIR}" python test/run_test.py --cpp --verbose -i cpp/test_aoti_abi_check cpp/test_aoti_inference

				}

				test_inductor_cpp_wrapper() {

				test_inductor_cpp_wrapper_shard() {

				  if [[ -z "$NUM_TEST_SHARDS" ]]; then

				    echo "NUM_TEST_SHARDS must be defined to run a Python test shard"

				    exit 1

				  fi

				  export TORCHINDUCTOR_CPP_WRAPPER=1

				  TEST_REPORTS_DIR=$(pwd)/test/test-reports

				  mkdir -p "$TEST_REPORTS_DIR"

				  # Run certain inductor unit tests with cpp wrapper. In the end state, we should be able to run all the inductor

				  # unit tests with cpp wrapper.

				  python test/run_test.py --include inductor/test_torchinductor.py --verbose

				  if [[ "$1" -eq "2" ]]; then

				    # For now, manually put the opinfo tests in shard 2, and all other tests in

				    # shard 1.  Test specific things triggering past bugs, for now.

				    python test/run_test.py \

				      --include inductor/test_torchinductor_opinfo \

				      -k 'linalg or to_sparse' \

				      --verbose

				    exit

				  fi

				  # Run certain inductor unit tests with cpp wrapper. In the end state, we

				  # should be able to run all the inductor unit tests with cpp_wrapper.

				  python test/run_test.py --include inductor/test_torchinductor --verbose

				  # Run inductor benchmark tests with cpp wrapper.

				  # Skip benchmark tests if it's in rerun-disabled-mode.

				@ -517,7 +541,7 @@ test_perf_for_dashboard() {

				            --dynamic-batch-only "$@" \

				            --output "$TEST_REPORTS_DIR/${backend}_dynamic_${suite}_${dtype}_${mode}_${device}_${target}.csv"

				      fi

				      if [[ "$DASHBOARD_TAG" == *cppwrapper-true* ]] && [[ "$mode" == "inference" ]]; then

				      if [[ "$DASHBOARD_TAG" == *cppwrapper-true* ]]; then

				        TORCHINDUCTOR_CPP_WRAPPER=1 $TASKSET python "benchmarks/dynamo/$suite.py" \

				            "${target_flag[@]}" --"$mode" --"$dtype" --backend "$backend" --disable-cudagraphs "$@" \

				            --output "$TEST_REPORTS_DIR/${backend}_cpp_wrapper_${suite}_${dtype}_${mode}_${device}_${target}.csv"

				@ -893,10 +917,20 @@ test_libtorch_api() {

				  else

				    # Exclude IMethodTest that relies on torch::deploy, which will instead be ran in test_deploy

				    OMP_NUM_THREADS=2 TORCH_CPP_TEST_MNIST_PATH="${MNIST_DIR}" python test/run_test.py --cpp --verbose -i cpp/test_api -k "not IMethodTest"

				    python test/run_test.py --cpp --verbose -i cpp/test_tensorexpr

				    # On s390x, pytorch is built without llvm.

				    # Even if it would be built with llvm, llvm currently doesn't support used features on s390x and

				    # test fails with errors like:

				    # JIT session error: Unsupported target machine architecture in ELF object pytorch-jitted-objectbuffer

				    # unknown file: Failure

				    # C++ exception with description "valOrErr INTERNAL ASSERT FAILED at "/var/lib/jenkins/workspace/torch/csrc/jit/tensorexpr/llvm_jit.h":34, please report a bug to PyTorch. Unexpected failure in LLVM JIT: Failed to materialize symbols: { (main, { func }) }

				    if [[ "${BUILD_ENVIRONMENT}" != *s390x* ]]; then

				      python test/run_test.py --cpp --verbose -i cpp/test_tensorexpr

				    fi

				  fi

				  if [[ "${BUILD_ENVIRONMENT}" != *android* && "${BUILD_ENVIRONMENT}" != *cuda* && "${BUILD_ENVIRONMENT}" != *asan* ]]; then

				  # quantization is not fully supported on s390x yet

				  if [[ "${BUILD_ENVIRONMENT}" != *android* && "${BUILD_ENVIRONMENT}" != *cuda* && "${BUILD_ENVIRONMENT}" != *asan* && "${BUILD_ENVIRONMENT}" != *s390x* ]]; then

				    # NB: This test is not under TORCH_BIN_DIR but under BUILD_BIN_DIR

				    export CPP_TESTS_DIR="${BUILD_BIN_DIR}"

				    python test/run_test.py --cpp --verbose -i cpp/static_runtime_test

				@ -1243,7 +1277,7 @@ EOF

				}

				test_bazel() {

				  set -e

				  set -e -o pipefail

				  # bazel test needs sccache setup.

				  # shellcheck source=./common-build.sh

				@ -1497,7 +1531,7 @@ elif [[ "${TEST_CONFIG}" == *inductor_cpp_wrapper* ]]; then

				  install_torchaudio cuda

				  install_torchvision

				  checkout_install_torchbench hf_T5 llama moco

				  PYTHONPATH=$(pwd)/torchbench test_inductor_cpp_wrapper

				  PYTHONPATH=$(pwd)/torchbench test_inductor_cpp_wrapper_shard "$SHARD_NUMBER"

				elif [[ "${TEST_CONFIG}" == *inductor* ]]; then

				  install_torchvision

				  test_inductor_shard "${SHARD_NUMBER}"

									
										2

.ci/pytorch/win-build.sh
									
												View File
												
				@ -38,7 +38,7 @@ if [[ $PYLONG_API_CHECK == 0 ]]; then

				  echo "PyLong_AsUnsignedLong -> THPUtils_unpackUInt32 / THPUtils_unpackUInt64"

				  exit 1

				fi

				set -ex

				set -ex -o pipefail

				"$SCRIPT_HELPERS_DIR"/build_pytorch.bat

									
										3

.ci/pytorch/win-test-helpers/build_pytorch.bat
									
												View File
												
				@ -26,7 +26,8 @@ if not errorlevel 0 goto fail

				if "%USE_XPU%"=="1" (

				  :: Install xpu support packages

				  call %INSTALLER_DIR%\install_xpu.bat

				  set CUDA_VERSION=xpu

				  call %SCRIPT_HELPERS_DIR%\..\windows\internal\xpu_install.bat

				  if errorlevel 1 exit /b 1

				)

									
										114

.ci/pytorch/win-test-helpers/installation-helpers/install_xpu.bat
									
												View File
											
				@ -1,114 +0,0 @@

				@echo on

				REM Description: Install Intel Support Packages on Windows

				REM BKM reference: https://www.intel.com/content/www/us/en/developer/articles/tool/pytorch-prerequisites-for-intel-gpus.html

				set XPU_INSTALL_MODE=%~1

				if "%XPU_INSTALL_MODE%"=="" goto xpu_bundle_install_start

				if "%XPU_INSTALL_MODE%"=="bundle" goto xpu_bundle_install_start

				if "%XPU_INSTALL_MODE%"=="driver" goto xpu_driver_install_start

				if "%XPU_INSTALL_MODE%"=="all" goto xpu_driver_install_start

				:arg_error

				echo Illegal XPU installation mode. The value can be "bundle"/"driver"/"all"

				echo If keep the value as space, will use default "bundle" mode

				exit /b 1

				:xpu_driver_install_start

				:: TODO Need more testing for driver installation

				set XPU_DRIVER_LINK=https://downloadmirror.intel.com/830975/gfx_win_101.5972.exe

				curl -o xpu_driver.exe --retry 3 --retry-all-errors -k %XPU_DRIVER_LINK%

				echo "XPU Driver installing..."

				start /wait "Intel XPU Driver Installer" "xpu_driver.exe"

				if errorlevel 1 exit /b 1

				del xpu_driver.exe

				if "%XPU_INSTALL_MODE%"=="driver" goto xpu_install_end

				:xpu_bundle_install_start

				set XPU_BUNDLE_PARENT_DIR=C:\Program Files (x86)\Intel\oneAPI

				set XPU_BUNDLE_URL=https://registrationcenter-download.intel.com/akdlm/IRC_NAS/9d1a91e2-e8b8-40a5-8c7f-5db768a6a60c/w_intel-for-pytorch-gpu-dev_p_0.5.3.37_offline.exe

				set XPU_BUNDLE_PRODUCT_NAME=intel.oneapi.win.intel-for-pytorch-gpu-dev.product

				set XPU_BUNDLE_VERSION=0.5.3+31

				set XPU_BUNDLE_INSTALLED=0

				set XPU_BUNDLE_UNINSTALL=0

				set XPU_EXTRA_URL=https://registrationcenter-download.intel.com/akdlm/IRC_NAS/9d1a91e2-e8b8-40a5-8c7f-5db768a6a60c/w_intel-pti-dev_p_0.9.0.37_offline.exe

				set XPU_EXTRA_PRODUCT_NAME=intel.oneapi.win.intel-pti-dev.product

				set XPU_EXTRA_VERSION=0.9.0+36

				set XPU_EXTRA_INSTALLED=0

				set XPU_EXTRA_UNINSTALL=0

				if not [%XPU_VERSION%]==[] if [%XPU_VERSION%]==[2025.0] (

				    set XPU_BUNDLE_URL=https://registrationcenter-download.intel.com/akdlm/IRC_NAS/efc86abd-cb77-452e-a03f-a741895b8ece/intel-deep-learning-essentials-2025.0.0.336_offline.exe

				    set XPU_BUNDLE_PRODUCT_NAME=intel.oneapi.win.deep-learning-essentials.product

				    set XPU_BUNDLE_VERSION=2025.0.0+335

				    set XPU_BUNDLE_INSTALLED=0

				    set XPU_BUNDLE_UNINSTALL=0

				    set XPU_EXTRA_URL=NULL

				    set XPU_EXTRA_PRODUCT_NAME=intel.oneapi.win.compiler.product

				    set XPU_EXTRA_VERSION=2025.0.1+1226

				    set XPU_EXTRA_INSTALLED=0

				    set XPU_EXTRA_UNINSTALL=0

				)

				:: Check if XPU bundle is target version or already installed

				if exist "%XPU_BUNDLE_PARENT_DIR%\Installer\installer.exe" goto xpu_bundle_ver_check

				goto xpu_bundle_install

				:xpu_bundle_ver_check

				"%XPU_BUNDLE_PARENT_DIR%\Installer\installer.exe" --list-products > xpu_bundle_installed_ver.log

				for /f "tokens=1,2" %%a in (xpu_bundle_installed_ver.log) do (

				    if "%%a"=="%XPU_BUNDLE_PRODUCT_NAME%" (

				        echo %%a Installed Version: %%b

				        set XPU_BUNDLE_INSTALLED=1

				        if not "%XPU_BUNDLE_VERSION%"=="%%b" (

				            start /wait "Installer Title" "%XPU_BUNDLE_PARENT_DIR%\Installer\installer.exe" --action=remove --eula=accept --silent --product-id %%a --product-ver %%b --log-dir uninstall_bundle

				            set XPU_BUNDLE_UNINSTALL=1

				        )

				    )

				    if "%%a"=="%XPU_EXTRA_PRODUCT_NAME%" (

				        echo %%a Installed Version: %%b

				        set XPU_EXTRA_INSTALLED=1

				        if not "%XPU_EXTRA_VERSION%"=="%%b" (

				            start /wait "Installer Title" "%XPU_BUNDLE_PARENT_DIR%\Installer\installer.exe" --action=remove --eula=accept --silent --product-id %%a --product-ver %%b --log-dir uninstall_bundle

				            set XPU_EXTRA_UNINSTALL=1

				        )

				    )

				    if not "%%b" == "Version" if not [%%b]==[] if not "%%a"=="%XPU_BUNDLE_PRODUCT_NAME%" if not "%%a"=="%XPU_EXTRA_PRODUCT_NAME%" (

				        echo "Uninstalling...."

				        start /wait "Installer Title" "%XPU_BUNDLE_PARENT_DIR%\Installer\installer.exe" --action=remove --eula=accept --silent --product-id %%a --product-ver %%b --log-dir uninstall_bundle

				    )

				)

				if errorlevel 1 exit /b 1

				if exist xpu_bundle_installed_ver.log del xpu_bundle_installed_ver.log

				if exist uninstall_bundle rmdir /s /q uninstall_bundle

				if "%XPU_BUNDLE_INSTALLED%"=="0" goto xpu_bundle_install

				if "%XPU_BUNDLE_UNINSTALL%"=="1" goto xpu_bundle_install

				:xpu_extra_check

				if "%XPU_EXTRA_URL%"=="NULL" goto xpu_install_end

				if "%XPU_EXTRA_INSTALLED%"=="0" goto xpu_extra_install

				if "%XPU_EXTRA_UNINSTALL%"=="1" goto xpu_extra_install

				goto xpu_install_end

				:xpu_bundle_install

				curl -o xpu_bundle.exe --retry 3 --retry-all-errors -k %XPU_BUNDLE_URL%

				echo "XPU Bundle installing..."

				start /wait "Intel Pytorch Bundle Installer" "xpu_bundle.exe" --action=install --eula=accept --silent --log-dir install_bundle

				if errorlevel 1 exit /b 1

				del xpu_bundle.exe

				goto xpu_extra_check

				:xpu_extra_install

				curl -o xpu_extra.exe --retry 3 --retry-all-errors -k %XPU_EXTRA_URL%

				echo "Intel XPU EXTRA installing..."

				start /wait "Intel XPU EXTRA Installer" "xpu_extra.exe" --action=install --eula=accept --silent --log-dir install_bundle

				if errorlevel 1 exit /b 1

				del xpu_extra.exe

				:xpu_install_end

									
										4

.ci/pytorch/win-test.sh
									
												View File
												
				@ -1,5 +1,5 @@

				#!/bin/bash

				set -ex

				set -ex -o pipefail

				SCRIPT_PARENT_DIR=$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )

				# shellcheck source=./common.sh

				@ -41,7 +41,7 @@ python -m pip install pytest-rerunfailures==10.3 pytest-cpp==2.3.0 tensorboard==

				python -m pip install z3-solver==4.12.2.0

				# Install tlparse for test\dynamo\test_structured_trace.py UTs.

				python -m pip install tlparse==0.3.25

				python -m pip install tlparse==0.3.30

				# Install parameterized

				python -m pip install parameterized==0.8.1

									
										11

.ci/pytorch/windows/internal/xpu_install.bat
									
												View File
												
				@ -7,6 +7,9 @@ if not "%CUDA_VERSION%" == "xpu" (

				    exit /b 0

				)

				set SRC_DIR=%NIGHTLIES_PYTORCH_ROOT%

				if not exist "%SRC_DIR%\temp_build" mkdir "%SRC_DIR%\temp_build"

				set XPU_INSTALL_MODE=%~1

				if "%XPU_INSTALL_MODE%"=="" goto xpu_bundle_install_start

				if "%XPU_INSTALL_MODE%"=="bundle" goto xpu_bundle_install_start

				@ -101,6 +104,14 @@ goto xpu_install_end

				:xpu_bundle_install

				:: Install Level Zero SDK

				set XPU_EXTRA_LZ_URL=https://github.com/oneapi-src/level-zero/releases/download/v1.14.0/level-zero-sdk_1.14.0.zip

				curl -k -L %XPU_EXTRA_LZ_URL% --output "%SRC_DIR%\temp_build\level_zero_sdk.zip"

				echo "Installing level zero SDK..."

				7z x "%SRC_DIR%\temp_build\level_zero_sdk.zip" -o"%SRC_DIR%\temp_build\level_zero"

				set "INCLUDE=%SRC_DIR%\temp_build\level_zero\include;%INCLUDE%"

				:: Install Bundle

				curl -o xpu_bundle.exe --retry 3 --retry-all-errors -k %XPU_BUNDLE_URL%

				echo "XPU Bundle installing..."

				start /wait "Intel Pytorch Bundle Installer" "xpu_bundle.exe" --action=install --eula=accept --silent --log-dir install_bundle

									
										20

.ci/wheel/build_wheel.sh
									
												View File
												
				@ -226,26 +226,6 @@ if [[ -z "$BUILD_PYTHONLESS" ]]; then

				    # Copy the whl to a final destination before tests are run

				    echo "Renaming Wheel file: $wheel_filename_gen to $wheel_filename_new"

				    cp "$whl_tmp_dir/$wheel_filename_gen" "$PYTORCH_FINAL_PACKAGE_DIR/$wheel_filename_new"

				    ##########################

				    # now test the binary, unless it's cross compiled arm64

				    if [[ -z "$CROSS_COMPILE_ARM64" ]]; then

				        pip uninstall -y "$TORCH_PACKAGE_NAME" || true

				        pip uninstall -y "$TORCH_PACKAGE_NAME" || true

				        # Create new "clean" conda environment for testing

				        conda create ${EXTRA_CONDA_INSTALL_FLAGS} -yn "test_conda_env" python="$desired_python"

				        conda activate test_conda_env

				        pip install "$PYTORCH_FINAL_PACKAGE_DIR/$wheel_filename_new" -v

				        echo "$(date) :: Running tests"

				        # TODO: Add real tests, as run_test.sh from builder is a glorified no-op

				        # pushd "$pytorch_rootdir"

				        # "${SOURCE_DIR}/../run_tests.sh" 'wheel' "$desired_python" 'cpu'

				        # popd

				        echo "$(date) :: Finished tests"

				    fi

				else

				    pushd "$pytorch_rootdir"

									
										2

.circleci/codegen_validation/normalize_yaml_fragment.py
									
												View File
												
				@ -7,7 +7,7 @@ import yaml

				# Need to import modules that lie on an upward-relative path

				sys.path.append(os.path.join(sys.path[0], ".."))

				sys.path.append(os.path.dirname(sys.path[0]))

				import cimodel.lib.miniyaml as miniyaml

									
										2

.circleci/scripts/binary_linux_test.sh
									
												View File
												
				@ -94,6 +94,8 @@ if [[ "\$GPU_ARCH_TYPE" != *s390x* && "\$GPU_ARCH_TYPE" != *xpu* && "\$GPU_ARCH_

				  python /pytorch/.ci/pytorch/smoke_test/smoke_test.py --package=torchonly --torch-compile-check disabled

				fi

				# Clean temp files

				cd /pytorch/.ci/pytorch/ && git clean -ffdx

				# =================== The above code will be executed inside Docker container ===================

				EOL

									
										11

.circleci/scripts/binary_macos_build.sh
									
												View File
											
				@ -1,11 +0,0 @@

				#!/bin/bash

				set -eux -o pipefail

				source "${BINARY_ENV_FILE:-/Users/distiller/project/env}"

				mkdir -p "$PYTORCH_FINAL_PACKAGE_DIR"

				# Build

				export USE_PYTORCH_METAL_EXPORT=1

				export USE_COREML_DELEGATE=1

				export TORCH_PACKAGE_NAME="$(echo $TORCH_PACKAGE_NAME | tr '-' '_')"

				"${PYTORCH_ROOT}/.ci/wheel/build_wheel.sh"

									
										5

.circleci/scripts/binary_populate_env.sh
									
												View File
												
				@ -75,9 +75,8 @@ export PYTORCH_BUILD_NUMBER=1

				TRITON_VERSION=$(cat $PYTORCH_ROOT/.ci/docker/triton_version.txt)

				# Here PYTORCH_EXTRA_INSTALL_REQUIREMENTS is already set for the all the wheel builds hence append TRITON_CONSTRAINT

				TRITON_CONSTRAINT="platform_system == 'Linux' and platform_machine == 'x86_64' and python_version < '3.13'"

				if [[ "$PACKAGE_TYPE" =~ .*wheel.* &&  -n "${PYTORCH_EXTRA_INSTALL_REQUIREMENTS:-}" ]]; then

				  # Only linux Python < 3.13 are supported wheels for triton

				TRITON_CONSTRAINT="platform_system == 'Linux' and platform_machine == 'x86_64'"

				if [[ "$PACKAGE_TYPE" =~ .*wheel.* &&  -n "${PYTORCH_EXTRA_INSTALL_REQUIREMENTS:-}" && ! "$PYTORCH_BUILD_VERSION" =~ .*xpu.* ]]; then

				  TRITON_REQUIREMENT="triton==${TRITON_VERSION}; ${TRITON_CONSTRAINT}"

				  if [[ -n "$PYTORCH_BUILD_VERSION" && "$PYTORCH_BUILD_VERSION" =~ .*dev.* ]]; then

				      TRITON_SHORTHASH=$(cut -c1-8 $PYTORCH_ROOT/.ci/docker/ci_commit_pins/triton.txt)

									
										20

.circleci/scripts/binary_upload.sh
									
												View File
												
				@ -68,17 +68,29 @@ s3_upload() {

				  local pkg_type

				  extension="$1"

				  pkg_type="$2"

				  s3_root_dir="${UPLOAD_BUCKET}/${pkg_type}/${UPLOAD_CHANNEL}"

				  s3_key_prefix="${pkg_type}/${UPLOAD_CHANNEL}"

				  if [[ -z ${UPLOAD_SUBFOLDER:-} ]]; then

				    s3_upload_dir="${s3_root_dir}/"

				    s3_upload_dir="${UPLOAD_BUCKET}/${s3_key_prefix}/"

				  else

				    s3_upload_dir="${s3_root_dir}/${UPLOAD_SUBFOLDER}/"

				    s3_key_prefix="${s3_key_prefix}/${UPLOAD_SUBFOLDER}"

				    s3_upload_dir="${UPLOAD_BUCKET}/${s3_key_prefix}/"

				  fi

				  (

				    for pkg in ${PKG_DIR}/*.${extension}; do

				      (

				        set -x

				        ${AWS_S3_CP} --no-progress --acl public-read "${pkg}" "${s3_upload_dir}"

				        if [[ ${pkg_type} == "whl" ]]; then

				          dry_run_arg="--dry-run"

				          if [[ "${DRY_RUN}" = "disabled" ]]; then

				            dry_run_arg=""

				          fi

				          uv run scripts/release/upload_metadata_file.py \

				            --package "${pkg}" \

				            --bucket "${UPLOAD_BUCKET}" \

				            --key-prefix "${s3_key_prefix}" \

				            ${dry_run_arg}

				        fi

				      )

				    done

				  )

				@ -86,7 +98,7 @@ s3_upload() {

				# Install dependencies (should be a no-op if previously installed)

				conda install -yq anaconda-client

				pip install -q awscli

				pip install -q awscli uv

				case "${PACKAGE_TYPE}" in

				  conda)

2

.clang-format

View File

 @ -106,6 +106,8 @@ StatementMacros:
   - C10_DEFINE_int32
   - C10_DEFINE_int64
   - C10_DEFINE_string
   - C10_DEFINE_REGISTRY_WITHOUT_WARNING
   - C10_REGISTER_CREATOR
   - DEFINE_BINARY
   - PyObject_HEAD
   - PyObject_VAR_HEAD

15

.clang-tidy

View File

 @ -1,8 +1,9 @@
 ---
 # NOTE there must be no spaces before the '-', so put the comma last.
 # The check bugprone-unchecked-optional-access is also turned off atm
 # because it causes clang-tidy to hang randomly. The tracking issue
 # The check bugprone-unchecked-optional-access is also turned on.
 # Note that it can cause clang-tidy to hang randomly. The tracking issue
 # can be found at https://github.com/llvm/llvm-project/issues/69369.
 # When that happens, we can disable it on the problematic code by NOLINT.
 InheritParentConfig: true
 Checks: '
 bugprone-*,
 @ -12,7 +13,10 @@ bugprone-*,
 -bugprone-lambda-function-name,
 -bugprone-reserved-identifier,
 -bugprone-swapped-arguments,
 -bugprone-unchecked-optional-access,
 clang-analyzer-core.*,
 clang-analyzer-cplusplus.*,
 clang-analyzer-nullability.*,
 clang-analyzer-deadcode.*,
 clang-diagnostic-missing-prototypes,
 cppcoreguidelines-*,
 -cppcoreguidelines-avoid-do-while,
 @ -55,10 +59,11 @@ readability-container-size-empty,
 readability-delete-null-pointer,
 readability-duplicate-include
 readability-misplaced-array-index,
 readability-redundant-function-ptr-dereference,
 readability-redundant-smartptr-get,
 readability-redundant*
 readability-simplify-subscript-expr,
 readability-string-compare,
 -readability-redundant-access-specifiers,
 -readability-redundant-control-flow,
 '
 HeaderFilterRegex: '^(aten/|c10/|torch/).*$'
 WarningsAsErrors: '*'

									
										2

.github/ISSUE_TEMPLATE/bug-report.yml
									
										vendored
									
												View File
												
				@ -5,7 +5,7 @@ body:

				- type: markdown

				  attributes:

				    value: >

				      #### Before submitting a bug, please make sure the issue hasn't been already addressed by searching through [the existing and past issues](https://github.com/pytorch/pytorch/issues?q=is%3Aissue+sort%3Acreated-desc+).

				      #### Before submitting a bug, please make sure the issue hasn't been already addressed by searching through [the existing and past issues](https://github.com/pytorch/pytorch/issues?q=is%3Aissue+sort%3Acreated-desc+). Note: Please write your bug report in English to ensure it can be understood and addressed by the development team.

				- type: textarea

				  attributes:

				    label: 🐛 Describe the bug

									
										4

.github/ISSUE_TEMPLATE/documentation.yml
									
										vendored
									
												View File
												
				@ -2,6 +2,10 @@ name: 📚 Documentation

				description: Report an issue related to https://pytorch.org/docs/stable/index.html

				body:

				- type: markdown

				  attributes:

				    value: >

				      #### Note: Please report your documentation issue in English to ensure it can be understood and addressed by the development team.

				- type: textarea

				  attributes:

				    label: 📚 The doc issue

									
										4

.github/ISSUE_TEMPLATE/feature-request.yml
									
										vendored
									
												View File
												
				@ -2,6 +2,10 @@ name: 🚀 Feature request

				description: Submit a proposal/request for a new PyTorch feature

				body:

				- type: markdown

				  attributes:

				    value: >

				      #### Note: Please write your feature request in English to ensure it can be understood and addressed by the development team.

				- type: textarea

				  attributes:

				    label: 🚀 The feature, motivation and pitch

									
										4

.github/ISSUE_TEMPLATE/pt2-bug-report.yml
									
										vendored
									
												View File
												
				@ -3,6 +3,10 @@ description: Create a report to help us reproduce and fix the bug

				labels: ["oncall: pt2"]

				body:

				  - type: markdown

				    attributes:

				      value: >

				      #### Note: Please write your bug report in English to ensure it can be understood and addressed by the development team.

				  - type: markdown

				    attributes:

				      value: >

									
										4

.github/actionlint.yaml
									
										vendored
									
												View File
												
				@ -42,8 +42,10 @@ self-hosted-runner:

				    - windows.8xlarge.nvidia.gpu

				    - windows.8xlarge.nvidia.gpu.nonephemeral

				    - windows.g5.4xlarge.nvidia.gpu

				    # Organization-wide AMD hosted MI300 runners

				    # Organization-wide AMD hosted runners

				    - linux.rocm.gpu

				    - linux.rocm.gpu.2

				    - linux.rocm.gpu.4

				    # Repo-specific Apple hosted  runners

				    - macos-m1-ultra

				    - macos-m2-14

									
										4

.github/actions/checkout-pytorch/action.yml
									
										vendored
									
												View File
												
				@ -41,10 +41,10 @@ runs:

				        mkdir "${GITHUB_WORKSPACE}"

				    - name: Checkout PyTorch

				      uses: malfet/checkout@silent-checkout

				      uses: actions/checkout@v4

				      with:

				        ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}

				        # --depth=1 for speed, manually fetch history and other refs as necessary

				        fetch-depth: ${{ inputs.fetch-depth }}

				        submodules: ${{ inputs.submodules }}

				        quiet-checkout: true

				        show-progress: false

									
										4

.github/actions/diskspace-cleanup/action.yml
									
										vendored
									
												View File
												
				@ -17,6 +17,10 @@ runs:

				        set -ex

				        diskspace_cutoff=${{ inputs.diskspace-cutoff }}

				        docker_root_dir=$(docker info -f '{{.DockerRootDir}}')

				        if [ ! -d "$docker_root_dir" ]; then

				            echo "Docker root directory ($docker_root_dir) does not exist. Skipping disk space check."

				            exit 0

				        fi

				        diskspace=$(df -H --output=pcent ${docker_root_dir} | sed -n 2p | sed 's/%//' | sed 's/ //')

				        msg="Please file an issue on pytorch/pytorch reporting the faulty runner. Include a link to the runner logs so the runner can be identified"

				        if [[ "$diskspace" -ge "$diskspace_cutoff" ]] ; then

									
										52

.github/actions/setup-rocm/action.yml
									
										vendored
									
												View File
												
				@ -5,20 +5,6 @@ description: Set up ROCm host for CI

				runs:

				  using: composite

				  steps:

				    - name: Set DOCKER_HOST

				      shell: bash

				      run: echo "DOCKER_HOST=unix:///run/user/$(id -u)/docker.sock" >> "${GITHUB_ENV}"

				    - name: Remove leftover Docker config file

				      shell: bash

				      continue-on-error: true

				      run: |

				        set -ex

				        cat ~/.docker/config.json || true

				        # https://stackoverflow.com/questions/64455468/error-when-logging-into-ecr-with-docker-login-error-saving-credentials-not

				        rm -f ~/.docker/config.json

				    - name: Stop all running docker containers

				      if: always()

				      shell: bash

				@ -38,6 +24,12 @@ runs:

				        cat /opt/rocm/.info/version || true

				        whoami

				    - name: Runner health check amdgpu info

				      if: always()

				      shell: bash

				      run: |

				        dpkg -l | grep -E "  amdgpu"

				    - name: Runner health check rocm-smi

				      if: always()

				      shell: bash

				@ -68,7 +60,7 @@ runs:

				        fi

				    - name: Runner diskspace health check

				      uses: ./.github/actions/diskspace-cleanup

				      uses: pytorch/pytorch/.github/actions/diskspace-cleanup@main

				      if: always()

				    - name: Runner health check disconnect on failure

				@ -77,14 +69,38 @@ runs:

				      run: |

				        killall runsvc.sh

				    - name: Setup useful environment variables

				      shell: bash

				      run: |

				        RUNNER_ARTIFACT_DIR="${RUNNER_TEMP}/artifacts"

				        rm -rf "${RUNNER_ARTIFACT_DIR}"

				        mkdir -p "${RUNNER_ARTIFACT_DIR}"

				        echo "RUNNER_ARTIFACT_DIR=${RUNNER_ARTIFACT_DIR}" >> "${GITHUB_ENV}"

				        RUNNER_TEST_RESULTS_DIR="${RUNNER_TEMP}/test-results"

				        rm -rf "${RUNNER_TEST_RESULTS_DIR}"

				        mkdir -p "${RUNNER_TEST_RESULTS_DIR}"

				        echo "RUNNER_TEST_RESULTS_DIR=${RUNNER_TEST_RESULTS_DIR}" >> "${GITHUB_ENV}"

				        RUNNER_DOCS_DIR="${RUNNER_TEMP}/docs"

				        rm -rf "${RUNNER_DOCS_DIR}"

				        mkdir -p "${RUNNER_DOCS_DIR}"

				        echo "RUNNER_DOCS_DIR=${RUNNER_DOCS_DIR}" >> "${GITHUB_ENV}"

				    - name: Preserve github env variables for use in docker

				      shell: bash

				      run: |

				        env | grep '^GITHUB' >> "/tmp/github_env_${GITHUB_RUN_ID}"

				        env | grep '^CI' >> "/tmp/github_env_${GITHUB_RUN_ID}"

				        env | grep '^GITHUB' >> "${RUNNER_TEMP}/github_env_${GITHUB_RUN_ID}"

				        env | grep '^CI' >> "${RUNNER_TEMP}/github_env_${GITHUB_RUN_ID}"

				    - name: ROCm set GPU_FLAG

				      shell: bash

				      run: |

				        # All GPUs are visible to the runner; visibility, if needed, will be set by run_test.py.

				        echo "GPU_FLAG=--device=/dev/mem --device=/dev/kfd --device=/dev/dri --group-add video --group-add daemon" >> "${GITHUB_ENV}"

				        # Add render group for container creation.

				        render_gid=`cat /etc/group | grep render | cut -d: -f3`

				        # The --group-add daemon and --group-add bin are needed in the Ubuntu 24.04 and Almalinux OSs respectively.

				        # This is due to the device files (/dev/kfd & /dev/dri) being owned by video group on bare metal.

				        # This video group ID maps to subgid 1 inside the docker image due to the /etc/subgid entries.

				        # The group name corresponding to group ID 1 can change depending on the OS, so both are necessary.

				        echo "GPU_FLAG=--device=/dev/mem --device=/dev/kfd --device /dev/dri --group-add video --group-add $render_gid --group-add daemon --group-add bin --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --network=host" >> "${GITHUB_ENV}"

									
										1

.github/actions/test-pytorch-binary/action.yml
									
										vendored
									
												View File
												
				@ -13,7 +13,6 @@ runs:

				        container_name=$(docker run \

				          ${GPU_FLAG:-} \

				          -e BINARY_ENV_FILE \

				          -e BUILDER_ROOT \

				          -e BUILD_ENVIRONMENT \

				          -e DESIRED_CUDA \

				          -e DESIRED_DEVTOOLSET \

2

.github/ci_commit_pins/audio.txt vendored

View File

 @ -1 +1 @@
 d4b300f00a0d862e3cfe1495db3b1a14f9
 b6d4675c7aedc53ba04f3f55786aac1de32be6b4

2

.github/ci_commit_pins/xla.txt vendored

View File

 @ -1 +1 @@
 r2.6
 b2b890e962f5fb6f481e5da2eb4a43bb990d0f1b

									
										4

.github/labeler.yml
									
										vendored
									
												View File
												
				@ -30,9 +30,9 @@

				- torch/fx/experimental/sym_node.py

				- torch/fx/experimental/validator.py

				- torch/fx/experimental/proxy_tensor.py

				- test/distributed/_tensor/test_dtensor_compile.py

				- test/distributed/tensor/test_dtensor_compile.py

				- test/distributed/tensor/parallel/test_fsdp_2d_parallel.py

				- torch/distributed/_tensor/**

				- torch/distributed/tensor/**

				- torch/distributed/fsdp/**

				- torch/csrc/inductor/**

				- torch/csrc/dynamo/**

2

.github/requirements-gha-cache.txt vendored

View File

 @ -5,7 +5,7 @@
 #   functorch/docs/requirements.txt
 #   .ci/docker/requirements-ci.txt
 boto3==1.35.42
 jinja2==3.1.4
 jinja2==3.1.5
 lintrunner==0.10.7
 ninja==1.10.0.post1
 nvidia-ml-py==11.525.84

2

.github/requirements/pip-requirements-macOS.txt vendored

View File

 @ -1,6 +1,6 @@
 boto3==1.35.42
 hypothesis==6.56.4
 expecttest==0.2.1
 expecttest==0.3.0
 fbscribelogger==0.1.7
 librosa>=0.6.2
 mpmath==1.3.0

									
										2

.github/scripts/delete_old_branches.py
									
										vendored
									
												View File
												
				@ -22,7 +22,7 @@ TOKEN = os.environ["GITHUB_TOKEN"]

				if not TOKEN:

				    raise Exception("GITHUB_TOKEN is not set")  # noqa: TRY002

				REPO_ROOT = Path(__file__).parent.parent.parent

				REPO_ROOT = Path(__file__).parents[2]

				# Query for all PRs instead of just closed/merged because it's faster

				GRAPHQL_ALL_PRS_BY_UPDATED_AT = """

									
										2

.github/scripts/ensure_actions_will_cancel.py
									
										vendored
									
												View File
												
				@ -6,7 +6,7 @@ from pathlib import Path

				import yaml

				REPO_ROOT = Path(__file__).resolve().parent.parent.parent

				REPO_ROOT = Path(__file__).resolve().parents[2]

				WORKFLOWS = REPO_ROOT / ".github" / "workflows"

				EXPECTED_GROUP_PREFIX = (

				    "${{ github.workflow }}-${{ github.event.pull_request.number || github.sha }}"

									
										6

.github/scripts/filter_test_configs.py
									
										vendored
									
												View File
												
				@ -39,9 +39,9 @@ SUPPORTED_PERIODICAL_MODES: Dict[str, Callable[[Optional[str]], bool]] = {

				}

				# The link to the published list of disabled jobs

				DISABLED_JOBS_URL = "https://ossci-metrics.s3.amazonaws.com/disabled-jobs.json?versionId=pQg1WJZKNqoisT5kAGG9Wmbuns5zBdBc"

				DISABLED_JOBS_URL = "https://ossci-metrics.s3.amazonaws.com/disabled-jobs.json"

				# and unstable jobs

				UNSTABLE_JOBS_URL = "https://ossci-metrics.s3.amazonaws.com/unstable-jobs.json?versionId=ddADM6lf9NqVTA0APn69zl3M7nMda4DH"

				UNSTABLE_JOBS_URL = "https://ossci-metrics.s3.amazonaws.com/unstable-jobs.json"

				# Some constants used to handle disabled and unstable jobs

				JOB_NAME_SEP = "/"

				@ -562,7 +562,7 @@ def main() -> None:

				    # If the tag matches, we can get the PR number from it, this is from ciflow

				    # workflow dispatcher

				    tag_regex = re.compile(r"^ciflow/\w+/(?P<pr_number>\d+)$")

				    tag_regex = re.compile(r"^ciflow/[\w\-]+/(?P<pr_number>\d+)$")

				    labels = set()

				    if pr_number:

									
										26

.github/scripts/generate_binary_build_matrix.py
									
										vendored
									
												View File
												
				@ -20,7 +20,8 @@ CUDA_ARCHES = ["11.8", "12.4", "12.6"]

				CUDA_ARCHES_FULL_VERSION = {"11.8": "11.8.0", "12.4": "12.4.1", "12.6": "12.6.3"}

				CUDA_ARCHES_CUDNN_VERSION = {"11.8": "9", "12.4": "9", "12.6": "9"}

				ROCM_ARCHES = ["6.1", "6.2.4"]

				# NOTE: Also update the ROCm sources in tools/nightly.py when changing this list

				ROCM_ARCHES = ["6.2.4", "6.3"]

				XPU_ARCHES = ["xpu"]

				@ -93,7 +94,7 @@ def get_nccl_submodule_version() -> str:

				    from pathlib import Path

				    nccl_version_mk = (

				        Path(__file__).absolute().parent.parent.parent

				        Path(__file__).absolute().parents[2]

				        / "third_party"

				        / "nccl"

				        / "nccl"

				@ -158,15 +159,16 @@ def arch_type(arch_version: str) -> str:

				DEFAULT_TAG = os.getenv("RELEASE_VERSION_TAG", "main")

				WHEEL_CONTAINER_IMAGES = {

				    "11.8": f"pytorch/manylinux-builder:cuda11.8-{DEFAULT_TAG}",

				    "12.4": f"pytorch/manylinux-builder:cuda12.4-{DEFAULT_TAG}",

				    "12.6": f"pytorch/manylinux2_28-builder:cuda12.6-{DEFAULT_TAG}",

				    **{

				        gpu_arch: f"pytorch/manylinux2_28-builder:cuda{gpu_arch}-{DEFAULT_TAG}"

				        for gpu_arch in CUDA_ARCHES

				    },

				    **{

				        gpu_arch: f"pytorch/manylinux2_28-builder:rocm{gpu_arch}-{DEFAULT_TAG}"

				        for gpu_arch in ROCM_ARCHES

				    },

				    "xpu": f"pytorch/manylinux2_28-builder:xpu-{DEFAULT_TAG}",

				    "cpu": f"pytorch/manylinux-builder:cpu-{DEFAULT_TAG}",

				    "cpu": f"pytorch/manylinux2_28-builder:cpu-{DEFAULT_TAG}",

				    "cpu-cxx11-abi": f"pytorch/manylinuxcxx11-abi-builder:cpu-cxx11-abi-{DEFAULT_TAG}",

				    "cpu-aarch64": f"pytorch/manylinux2_28_aarch64-builder:cpu-aarch64-{DEFAULT_TAG}",

				    "cpu-s390x": f"pytorch/manylinuxs390x-builder:cpu-s390x-{DEFAULT_TAG}",

				@ -375,13 +377,7 @@ def generate_wheels_matrix(

				                            gpu_arch_type, gpu_arch_version

				                        ),

				                        "use_split_build": "True" if use_split_build else "False",

				                        "devtoolset": (

				                            "cxx11-abi"

				                            if (

				                                arch_version == "cuda-aarch64" or arch_version == "12.6"

				                            )

				                            else ""

				                        ),

				                        "devtoolset": "cxx11-abi",

				                        "container_image": WHEEL_CONTAINER_IMAGES[arch_version],

				                        "package_type": package_type,

				                        "pytorch_extra_install_requirements": (

				@ -426,8 +422,8 @@ def generate_wheels_matrix(

				                        "use_split_build": "True" if use_split_build else "False",

				                        "devtoolset": (

				                            "cxx11-abi"

				                            if (arch_version in ["cpu-cxx11-abi", "cpu-aarch64", "xpu"])

				                            or gpu_arch_type == "rocm"

				                            if (arch_version in ["cpu-cxx11-abi", "cpu-aarch64"])

				                            or os == "linux"

				                            else ""

				                        ),

				                        "container_image": WHEEL_CONTAINER_IMAGES[arch_version],

									
										2

.github/scripts/gitutils.py
									
										vendored
									
												View File
												
				@ -32,7 +32,7 @@ def get_git_remote_name() -> str:

				def get_git_repo_dir() -> str:

				    from pathlib import Path

				    return os.getenv("GIT_REPO_DIR", str(Path(__file__).resolve().parent.parent.parent))

				    return os.getenv("GIT_REPO_DIR", str(Path(__file__).resolve().parents[2]))

				def fuzzy_list_to_dict(items: List[Tuple[str, str]]) -> Dict[str, List[str]]:

									
										2

.github/scripts/lint_native_functions.py
									
										vendored
									
												View File
												
				@ -26,7 +26,7 @@ def fn(base: str) -> str:

				    return str(base / Path("aten/src/ATen/native/native_functions.yaml"))

				with open(Path(__file__).parent.parent.parent / fn(".")) as f:

				with open(Path(__file__).parents[2] / fn(".")) as f:

				    contents = f.read()

				yaml = ruamel.yaml.YAML()  # type: ignore[attr-defined]

									
										2

.github/scripts/lintrunner.sh
									
										vendored
									
												View File
												
				@ -19,7 +19,7 @@ fi

				# if lintrunner is not installed, install it

				if ! command -v lintrunner &> /dev/null; then

				    python3 -m pip install lintrunner==0.12.5

				    python3 -m pip install lintrunner==0.12.7

				fi

				# This has already been cached in the docker image

									
										6

.github/scripts/s390x-ci/self-hosted-builder/actions-runner.Dockerfile
									
										vendored
									
												View File
												
				@ -1,12 +1,12 @@

				# Self-Hosted IBM Z Github Actions Runner.

				# Temporary image: amd64 dependencies.

				FROM docker.io/amd64/ubuntu:23.10 as ld-prefix

				FROM --platform=linux/amd64 docker.io/ubuntu:24.04 as ld-prefix

				ENV DEBIAN_FRONTEND=noninteractive

				RUN apt-get update && apt-get -y install ca-certificates libicu72 libssl3

				RUN apt-get update && apt-get -y install ca-certificates libicu74 libssl3

				# Main image.

				FROM docker.io/s390x/ubuntu:23.10

				FROM --platform=linux/s390x docker.io/ubuntu:24.04

				# Packages for pytorch building and testing.

				ENV DEBIAN_FRONTEND=noninteractive

									
										2

.github/scripts/test_gitutils.py
									
										vendored
									
												View File
												
				@ -68,7 +68,7 @@ class TestRetriesDecorator(TestCase):

				class TestGitRepo(TestCase):

				    def setUp(self) -> None:

				        repo_dir = BASE_DIR.parent.parent.absolute()

				        repo_dir = BASE_DIR.absolute().parent.parent

				        if not (repo_dir / ".git").is_dir():

				            raise SkipTest(

				                "Can't find git directory, make sure to run this test on real repo checkout"

									
										2

.github/scripts/trymerge.py
									
										vendored
									
												View File
												
				@ -669,7 +669,7 @@ def get_ghstack_prs(

				        if not open_only or not candidate.is_closed():

				            return False

				        print(

				            f"Skipping {idx+1} of {len(rev_list)} PR (#{candidate.pr_num}) as its already been merged"

				            f"Skipping {idx + 1} of {len(rev_list)} PR (#{candidate.pr_num}) as its already been merged"

				        )

				        return True

11

.github/templates/common.yml.j2 vendored

View File

 @ -5,11 +5,6 @@
 {%- set timeout_minutes = 240 -%}
 # NOTE: If testing pytorch/builder changes you can change this variable to change what pytorch/builder reference
 #       the binary builds will check out
 {%- set builder_repo = "pytorch/builder" -%}
 {%- set builder_branch = "release/2.6" -%}
 {%- macro concurrency(build_environment) -%}
 concurrency:
   group: !{{ build_environment }}-${{ github.event.pull_request.number || github.ref_name }}-${{ github.ref_type == 'branch' && github.sha }}-${{ github.event_name == 'workflow_dispatch' }}
 @ -36,7 +31,7 @@ concurrency:
 {%- macro setup_ec2_windows() -%}
       !{{ display_ec2_information() }}
       - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
         uses: pytorch/test-infra/.github/actions/setup-ssh@release/2.6
         uses: pytorch/test-infra/.github/actions/setup-ssh@main
         continue-on-error: true
         with:
           github-secret: ${{ secrets.GITHUB_TOKEN }}
 @ -84,7 +79,7 @@ concurrency:
 {%- macro checkout(submodules="recursive", deep_clone=True, directory="", repository="pytorch/pytorch", branch="", checkout_pr_head=True) -%}
       - name: Checkout !{{ 'PyTorch' if repository == "pytorch/pytorch" else repository }}
         uses: malfet/checkout@silent-checkout
         uses: actions/checkout@v4
         with:
       {%- if branch %}
           ref: !{{ branch }}
 @ -102,7 +97,7 @@ concurrency:
       {%- if directory %}
           path: !{{ directory }}
       {%- endif %}
           quiet-checkout: true
           show-progress: false
       - name: Clean !{{ 'PyTorch' if repository == "pytorch/pytorch" else repository }} checkout
         run: |
           # Remove any artifacts from the previous checkouts

11

.github/templates/linux_binary_build_workflow.yml.j2 vendored

View File

 @ -42,7 +42,6 @@ env:
   AWS_DEFAULT_REGION: us-east-1
   BINARY_ENV_FILE: /tmp/env
   BUILD_ENVIRONMENT: !{{ build_environment }}
   BUILDER_ROOT: /builder
   GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
   PR_NUMBER: ${{ github.event.pull_request.number }}
   PYTORCH_FINAL_PACKAGE_DIR: /artifacts
 @ -55,7 +54,7 @@ jobs:
   get-label-type:
     if: github.repository_owner == 'pytorch'
     name: get-label-type
     uses: pytorch/pytorch/.github/workflows/_runner-determinator.yml@release/2.6
     uses: pytorch/pytorch/.github/workflows/_runner-determinator.yml@main
     with:
       triggering_actor: ${{ github.triggering_actor }}
       issue_owner: ${{ github.event.pull_request.user.login || github.event.issue.user.login }}
 @ -145,9 +144,9 @@ jobs:
         with:
           name: !{{ config["build_name"] }}
           path: "${{ runner.temp }}/artifacts/"
       !{{ common.checkout(deep_clone=False, directory="pytorch", checkout_pr_head=False) }}
       !{{ common.checkout(deep_clone=False, directory="pytorch") }}
       - name: Pull Docker image
         uses: pytorch/test-infra/.github/actions/pull-docker-image@release/2.6
         uses: pytorch/test-infra/.github/actions/pull-docker-image@main
         with:
           docker-image: !{{ config["container_image"] }}
       - name: Test Pytorch binary
 @ -166,12 +165,12 @@ jobs:
         with:
           name: !{{ config["build_name"] }}
           path: "${{ runner.temp }}/artifacts/"
       !{{ common.checkout(deep_clone=False, directory="pytorch", checkout_pr_head=False) }}
       !{{ common.checkout(deep_clone=False, directory="pytorch") }}
       - name: ROCm set GPU_FLAG
         run: |
           echo "GPU_FLAG=--device=/dev/mem --device=/dev/kfd --device=/dev/dri --group-add video --group-add daemon" >> "${GITHUB_ENV}"
       - name: Pull Docker image
         uses: pytorch/test-infra/.github/actions/pull-docker-image@release/2.6
         uses: pytorch/test-infra/.github/actions/pull-docker-image@main
         with:
           docker-image: !{{ config["container_image"] }}
       - name: Test Pytorch binary

45

.github/templates/macos_binary_build_workflow.yml.j2 vendored

View File

 @ -78,18 +78,7 @@ jobs:
           elif [ -d "/Applications/Xcode_13.3.1.app" ]; then
             echo "DEVELOPER_DIR=/Applications/Xcode_13.3.1.app/Contents/Developer" >> "${GITHUB_ENV}"
           fi
       !{{ common.checkout(deep_clone=False, directory="pytorch", checkout_pr_head=False) }}
       - name: Install sccache (only for non-forked PRs, and pushes to trunk)
         uses: nick-fields/retry@v3.0.0
         if: ${{ github.event_name == 'push' || github.event.pull_request.head.repo.full_name == github.repository }}
         with:
           timeout_minutes: 5
           max_attempts: 3
           retry_wait_seconds: 90
           command: |
             sudo curl --retry 3 --retry-all-errors https://s3.amazonaws.com/ossci-macos/sccache_v2.15 --output /usr/local/bin/sccache
             sudo chmod +x /usr/local/bin/sccache
             echo "SCCACHE_BUCKET=ossci-compiler-cache-circleci-v2" >> "${GITHUB_ENV}"
       !{{ common.checkout(deep_clone=False, directory="pytorch") }}
       - name: Populate binary env
         run: |
           # shellcheck disable=SC1091
 @ -99,7 +88,37 @@ jobs:
         run: |
           # shellcheck disable=SC1091
           source "${RUNNER_TEMP}/anaconda/bin/activate"
           "${PYTORCH_ROOT}/.circleci/scripts/binary_macos_build.sh"
           set -eux -o pipefail
           # shellcheck disable=SC1090
           source "${BINARY_ENV_FILE:-/Users/distiller/project/env}"
           mkdir -p "$PYTORCH_FINAL_PACKAGE_DIR"
           # Build
           USE_PYTORCH_METAL_EXPORT=1
           USE_COREML_DELEGATE=1
           TORCH_PACKAGE_NAME="${TORCH_PACKAGE_NAME//-/_}"
           export USE_PYTORCH_METAL_EXPORT
           export USE_COREML_DELEGATE
           export TORCH_PACKAGE_NAME
           "${PYTORCH_ROOT}/.ci/wheel/build_wheel.sh"
 {%- if config["package_type"] == "wheel" %}
       - name: Test PyTorch wheel
         run: |
           # shellcheck disable=SC1091
           source "${RUNNER_TEMP}/anaconda/bin/activate"
           set -eux -o pipefail
           # shellcheck disable=SC1090
           source "${BINARY_ENV_FILE:-/Users/distiller/project/env}"
           pip uninstall -y "$TORCH_PACKAGE_NAME" || true
           pip uninstall -y "$TORCH_PACKAGE_NAME" || true
           # Create new "clean" conda environment for testing
           conda create -yn "test_conda_env" python="$DESIRED_PYTHON"
           conda activate test_conda_env
           pip install "$PYTORCH_FINAL_PACKAGE_DIR"/*.whl numpy -v
           python "${PYTORCH_ROOT}/.ci/pytorch/smoke_test/smoke_test.py" --package torchonly
 {%- endif %}
       - uses: actions/upload-artifact@v4.4.0
         if: always()
         with:

2

.github/templates/upload.yml.j2 vendored

View File

 @ -7,10 +7,8 @@
 {%- macro binary_env_as_input(config, is_windows=False, include_skip_tests=False) -%}
 {%- if is_windows %}
       PYTORCH_ROOT: ${{ github.workspace }}/pytorch
       BUILDER_ROOT: ${{ github.workspace }}/builder
 {%- else %}
       PYTORCH_ROOT: /pytorch
       BUILDER_ROOT: /builder
 {%- endif %}
       PACKAGE_TYPE: !{{ config["package_type"] }}
       # TODO: This is a legacy variable that we eventually want to get rid of in

6

.github/templates/windows_binary_build_workflow.yml.j2 vendored

View File

 @ -56,7 +56,7 @@ jobs:
   get-label-type:
     if: github.repository_owner == 'pytorch'
     name: get-label-type
     uses: pytorch/pytorch/.github/workflows/_runner-determinator.yml@release/2.6
     uses: pytorch/pytorch/.github/workflows/_runner-determinator.yml@main
     with:
       triggering_actor: ${{ github.triggering_actor }}
       issue_owner: ${{ github.event.pull_request.user.login || github.event.issue.user.login }}
 @ -80,7 +80,7 @@ jobs:
     steps:
       !{{ common.setup_ec2_windows() }}
       !{{ set_runner_specific_vars() }}
       !{{ common.checkout(deep_clone=False, directory="pytorch", checkout_pr_head=False) }}
       !{{ common.checkout(deep_clone=False, directory="pytorch") }}
       - name: Populate binary env
         shell: bash
         run: |
 @ -121,7 +121,7 @@ jobs:
         with:
           name: !{{ config["build_name"] }}
           path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
       !{{ common.checkout(deep_clone=False, directory="pytorch", checkout_pr_head=False) }}
       !{{ common.checkout(deep_clone=False, directory="pytorch") }}
       - name: Populate binary env
         shell: bash
         run: |

									
										14

.github/workflows/_bazel-build-test.yml
									
										vendored
									
												View File
												
				@ -47,7 +47,7 @@ jobs:

				      reenabled-issues: ${{ steps.filter.outputs.reenabled-issues }}

				    steps:

				      - name: Checkout PyTorch

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@release/2.6

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@main

				        with:

				          fetch-depth: 1

				          submodules: false

				@ -69,25 +69,25 @@ jobs:

				    runs-on: ${{ matrix.runner }}

				    steps:

				      - name: Setup SSH (Click me for login details)

				        uses: pytorch/test-infra/.github/actions/setup-ssh@release/2.6

				        uses: pytorch/test-infra/.github/actions/setup-ssh@main

				        with:

				          github-secret: ${{ secrets.GITHUB_TOKEN }}

				      # [see note: pytorch repo ref]

				      - name: Checkout PyTorch

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@release/2.6

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@main

				      - name: Setup Linux

				        uses: ./.github/actions/setup-linux

				      - name: Calculate docker image

				        id: calculate-docker-image

				        uses: pytorch/test-infra/.github/actions/calculate-docker-image@release/2.6

				        uses: pytorch/test-infra/.github/actions/calculate-docker-image@main

				        with:

				          docker-image-name: ${{ inputs.docker-image-name }}

				      - name: Pull docker image

				        uses: pytorch/test-infra/.github/actions/pull-docker-image@release/2.6

				        uses: pytorch/test-infra/.github/actions/pull-docker-image@main

				        with:

				          docker-image: ${{ steps.calculate-docker-image.outputs.docker-image }}

				@ -97,7 +97,7 @@ jobs:

				        run: echo "IN_CONTAINER_RUNNER=$(if [ -f /.inarc ] || [ -f /.incontainer ]; then echo true ; else echo false; fi)" >> "$GITHUB_OUTPUT"

				      - name: Install nvidia driver, nvidia-docker runtime, set GPU_FLAG

				        uses: pytorch/test-infra/.github/actions/setup-nvidia@release/2.6

				        uses: pytorch/test-infra/.github/actions/setup-nvidia@main

				        if: ${{ inputs.cuda-version != 'cpu' && steps.check_container_runner.outputs.IN_CONTAINER_RUNNER == 'false' }}

				      - name: Output disk space left

				@ -209,5 +209,5 @@ jobs:

				          file-suffix: bazel-${{ github.job }}_${{ steps.get-job-id.outputs.job-id }}

				      - name: Teardown Linux

				        uses: pytorch/test-infra/.github/actions/teardown-linux@release/2.6

				        uses: pytorch/test-infra/.github/actions/teardown-linux@main

				        if: always()

									
										19

.github/workflows/_binary-build-linux.yml
									
										vendored
									
												View File
												
				@ -42,10 +42,6 @@ on:

				        required: true

				        type: string

				        description: Root directory for the pytorch/pytorch repository

				      BUILDER_ROOT:

				        required: true

				        type: string

				        description: Root directory for the pytorch/builder repository

				      PACKAGE_TYPE:

				        required: true

				        type: string

				@ -98,7 +94,6 @@ jobs:

				    timeout-minutes: ${{ inputs.timeout-minutes }}

				    env:

				      PYTORCH_ROOT: ${{ inputs.PYTORCH_ROOT }}

				      BUILDER_ROOT: ${{ inputs.BUILDER_ROOT }}

				      PACKAGE_TYPE: ${{ inputs.PACKAGE_TYPE }}

				      # TODO: This is a legacy variable that we eventually want to get rid of in

				      #       favor of GPU_ARCH_VERSION

				@ -129,7 +124,6 @@ jobs:

				        run: |

				          {

				            echo "PYTORCH_ROOT=${{ env.PYTORCH_ROOT }}"

				            echo "BUILDER_ROOT=${{ env.BUILDER_ROOT }}"

				            echo "PACKAGE_TYPE=${{ env.PACKAGE_TYPE }}"

				            echo "DESIRED_CUDA=${{ env.DESIRED_CUDA }}"

				            echo "GPU_ARCH_VERSION=${{ env.GPU_ARCH_VERSION }}"

				@ -159,13 +153,13 @@ jobs:

				      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"

				        if: inputs.build_environment != 'linux-s390x-binary-manywheel'

				        uses: pytorch/test-infra/.github/actions/setup-ssh@release/2.6

				        uses: pytorch/test-infra/.github/actions/setup-ssh@main

				        continue-on-error: true

				        with:

				          github-secret: ${{ secrets.github-token }}

				      - name: Checkout PyTorch

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@release/2.6

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@main

				        with:

				          no-sudo: ${{ inputs.build_environment == 'linux-aarch64-binary-manywheel' || inputs.build_environment == 'linux-s390x-binary-manywheel' }}

				@ -193,11 +187,12 @@ jobs:

				          fi

				      - name: Checkout PyTorch to pytorch dir

				        uses: malfet/checkout@silent-checkout

				        uses: actions/checkout@v4

				        with:

				          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}

				          submodules: recursive

				          path: pytorch

				          quiet-checkout: true

				          show-progress: false

				      - name: Clean PyTorch checkout

				        run: |

				@ -219,7 +214,7 @@ jobs:

				      - name: Pull Docker image

				        if: ${{ steps.filter.outputs.is-test-matrix-empty == 'False' && inputs.build_environment != 'linux-s390x-binary-manywheel' }}

				        uses: pytorch/test-infra/.github/actions/pull-docker-image@release/2.6

				        uses: pytorch/test-infra/.github/actions/pull-docker-image@main

				        with:

				          docker-image: ${{ inputs.DOCKER_IMAGE }}

				@ -275,7 +270,7 @@ jobs:

				      - name: Teardown Linux

				        if: always() && inputs.build_environment != 'linux-s390x-binary-manywheel'

				        uses: pytorch/test-infra/.github/actions/teardown-linux@release/2.6

				        uses: pytorch/test-infra/.github/actions/teardown-linux@main

				      - name: Chown workspace

				        if: always() && inputs.build_environment != 'linux-s390x-binary-manywheel'

									
										20

.github/workflows/_binary-test-linux.yml
									
										vendored
									
												View File
												
				@ -19,10 +19,6 @@ on:

				        required: true

				        type: string

				        description: Root directory for the pytorch/pytorch repository

				      BUILDER_ROOT:

				        required: true

				        type: string

				        description: Root directory for the pytorch/builder repository

				      PACKAGE_TYPE:

				        required: true

				        type: string

				@ -86,7 +82,6 @@ jobs:

				    timeout-minutes: 240

				    env:

				      PYTORCH_ROOT: ${{ inputs.PYTORCH_ROOT }}

				      BUILDER_ROOT: ${{ inputs.BUILDER_ROOT }}

				      PACKAGE_TYPE: ${{ inputs.PACKAGE_TYPE }}

				      # TODO: This is a legacy variable that we eventually want to get rid of in

				      #       favor of GPU_ARCH_VERSION

				@ -116,7 +111,6 @@ jobs:

				        run: |

				          {

				            echo "PYTORCH_ROOT=${{ env.PYTORCH_ROOT }}"

				            echo "BUILDER_ROOT=${{ env.BUILDER_ROOT }}"

				            echo "PACKAGE_TYPE=${{ env.PACKAGE_TYPE }}"

				            echo "DESIRED_CUDA=${{ env.DESIRED_CUDA }}"

				@ -142,14 +136,14 @@ jobs:

				      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"

				        if: inputs.build_environment != 'linux-s390x-binary-manywheel'

				        uses: pytorch/test-infra/.github/actions/setup-ssh@release/2.6

				        uses: pytorch/test-infra/.github/actions/setup-ssh@main

				        continue-on-error: true

				        with:

				          github-secret: ${{ secrets.github-token }}

				        # Setup the environment

				      - name: Checkout PyTorch

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@release/2.6

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@main

				        with:

				          no-sudo: ${{ inputs.build_environment == 'linux-aarch64-binary-manywheel' || inputs.build_environment == 'linux-s390x-binary-manywheel' }}

				@ -170,9 +164,11 @@ jobs:

				          mkdir "${GITHUB_WORKSPACE}"

				      - name: Checkout PyTorch to pytorch dir

				        uses: malfet/checkout@silent-checkout

				        uses: actions/checkout@v4

				        with:

				          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}

				          submodules: recursive

				          show-progress: false

				          path: pytorch

				      - name: Clean PyTorch checkout

				@ -201,12 +197,12 @@ jobs:

				          path: "${{ runner.temp }}/artifacts/"

				      - name: Install nvidia driver, nvidia-docker runtime, set GPU_FLAG

				        uses: pytorch/test-infra/.github/actions/setup-nvidia@release/2.6

				        uses: pytorch/test-infra/.github/actions/setup-nvidia@main

				        if: ${{ inputs.GPU_ARCH_TYPE == 'cuda' && steps.filter.outputs.is-test-matrix-empty == 'False' }}

				      - name: Pull Docker image

				        if: ${{ steps.filter.outputs.is-test-matrix-empty == 'False' && inputs.build_environment != 'linux-s390x-binary-manywheel' }}

				        uses: pytorch/test-infra/.github/actions/pull-docker-image@release/2.6

				        uses: pytorch/test-infra/.github/actions/pull-docker-image@main

				        with:

				          docker-image: ${{ inputs.DOCKER_IMAGE }}

				@ -216,7 +212,7 @@ jobs:

				      - name: Teardown Linux

				        if: always() && inputs.build_environment != 'linux-s390x-binary-manywheel'

				        uses: pytorch/test-infra/.github/actions/teardown-linux@release/2.6

				        uses: pytorch/test-infra/.github/actions/teardown-linux@main

				      - name: Chown workspace

				        if: always() && inputs.build_environment != 'linux-s390x-binary-manywheel'

									
										7

.github/workflows/_binary-upload.yml
									
										vendored
									
												View File
												
				@ -15,10 +15,6 @@ on:

				        required: false

				        type: string

				        description: Root directory for the pytorch/pytorch repository. Not actually needed, but currently passing it in since we pass in the same inputs to the reusable workflows of all binary builds

				      BUILDER_ROOT:

				        required: false

				        type: string

				        description: Root directory for the pytorch/builder repository. Not actually needed, but currently passing it in since we pass in the same inputs to the reusable workflows of all binary builds

				      PACKAGE_TYPE:

				        required: true

				        type: string

				@ -81,7 +77,6 @@ jobs:

				      image: continuumio/miniconda3:4.12.0

				    env:

				      PYTORCH_ROOT: /pytorch

				      BUILDER_ROOT: /builder

				      PACKAGE_TYPE: ${{ inputs.PACKAGE_TYPE }}

				      # TODO: This is a legacy variable that we eventually want to get rid of in

				      #       favor of GPU_ARCH_VERSION

				@ -103,7 +98,7 @@ jobs:

				      USE_SPLIT_BUILD: ${{ inputs.use_split_build }}

				    steps:

				      - name: Checkout PyTorch

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@release/2.6

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@main

				        with:

				          no-sudo: true

									
										10

.github/workflows/_docs.yml
									
										vendored
									
												View File
												
				@ -84,7 +84,7 @@ jobs:

				    name: build-docs-${{ matrix.docs_type }}-${{ inputs.push }}

				    steps:

				      - name: Setup SSH (Click me for login details)

				        uses: pytorch/test-infra/.github/actions/setup-ssh@release/2.6

				        uses: pytorch/test-infra/.github/actions/setup-ssh@main

				        with:

				          github-secret: ${{ secrets.GITHUB_TOKEN }}

				          instructions: |

				@ -95,7 +95,7 @@ jobs:

				      # [see note: pytorch repo ref]

				      - name: Checkout PyTorch

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@release/2.6

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@main

				      - name: Setup Linux

				        uses: ./.github/actions/setup-linux

				@ -110,12 +110,12 @@ jobs:

				      - name: Calculate docker image

				        id: calculate-docker-image

				        uses: pytorch/test-infra/.github/actions/calculate-docker-image@release/2.6

				        uses: pytorch/test-infra/.github/actions/calculate-docker-image@main

				        with:

				          docker-image-name: ${{ inputs.docker-image }}

				      - name: Pull docker image

				        uses: pytorch/test-infra/.github/actions/pull-docker-image@release/2.6

				        uses: pytorch/test-infra/.github/actions/pull-docker-image@main

				        with:

				          docker-image: ${{ steps.calculate-docker-image.outputs.docker-image }}

				@ -222,5 +222,5 @@ jobs:

				          s3-prefix: pytorch/pytorch/${{ github.event.pull_request.number }}/functorchdocs

				      - name: Teardown Linux

				        uses: pytorch/test-infra/.github/actions/teardown-linux@release/2.6

				        uses: pytorch/test-infra/.github/actions/teardown-linux@main

				        if: always()

									
										35

.github/workflows/_linux-build.yml
									
										vendored
									
												View File
												
				@ -108,7 +108,7 @@ jobs:

				      test-matrix: ${{ steps.filter.outputs.test-matrix }}

				    steps:

				      - name: Setup SSH (Click me for login details)

				        uses: pytorch/test-infra/.github/actions/setup-ssh@release/2.6

				        uses: pytorch/test-infra/.github/actions/setup-ssh@main

				        if: inputs.build-environment != 'linux-s390x-binary-manywheel'

				        with:

				          github-secret: ${{ secrets.GITHUB_TOKEN }}

				@ -118,7 +118,7 @@ jobs:

				      # checkout because when we run this action we don't *have* a local

				      # checkout. In other cases you should prefer a local checkout.

				      - name: Checkout PyTorch

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@release/2.6

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@main

				        with:

				          no-sudo: true

				@ -136,7 +136,7 @@ jobs:

				      - name: Calculate docker image

				        id: calculate-docker-image

				        uses: pytorch/test-infra/.github/actions/calculate-docker-image@release/2.6

				        uses: pytorch/test-infra/.github/actions/calculate-docker-image@main

				        if: inputs.build-environment != 'linux-s390x-binary-manywheel'

				        with:

				          docker-image-name: ${{ inputs.docker-image-name }}

				@ -152,7 +152,7 @@ jobs:

				          echo "docker pull ghcr.io/pytorch/ci-image:${tag/:/-}"

				      - name: Pull docker image

				        uses: pytorch/test-infra/.github/actions/pull-docker-image@release/2.6

				        uses: pytorch/test-infra/.github/actions/pull-docker-image@main

				        if: inputs.build-environment != 'linux-s390x-binary-manywheel'

				        with:

				          docker-image: ${{ steps.calculate-docker-image.outputs.docker-image }}

				@ -199,7 +199,10 @@ jobs:

				          SHA1: ${{ github.event.pull_request.head.sha || github.sha }}

				          SCCACHE_BUCKET: ossci-compiler-cache-circleci-v2

				          SCCACHE_REGION: us-east-1

				          SCCACHE_S3_KEY_PREFIX: ${{ github.workflow }}

				          # Use the build environment here to make sure that all build jobs in the same environment

				          # will share the same cache regardless of which workflow they belong. This should improve

				          # the cache usage for jobs in non-pull workflows like periodic, slow, or inductor

				          SCCACHE_S3_KEY_PREFIX: ${{ inputs.build-environment || github.workflow }}

				          XLA_CLANG_CACHE_S3_BUCKET_NAME: ossci-compiler-clang-cache-circleci-xla

				          PR_LABELS: ${{ toJson(github.event.pull_request.labels.*.name) }}

				          TORCH_CUDA_ARCH_LIST: ${{ inputs.cuda-arch-list }}

				@ -216,6 +219,10 @@ jobs:

				          if [[ ${BUILD_ENVIRONMENT} == *"s390x"* ]]; then

				            JENKINS_USER=

				            USED_IMAGE="${DOCKER_IMAGE_S390X}"

				            # ensure that docker container cleanly exits in 12 hours

				            # if for some reason cleanup action doesn't stop container

				            # when job is cancelled

				            DOCKER_SHELL_CMD="sleep 12h"

				            # since some steps are skipped on s390x, if they are necessary, run them here

				            env | grep '^GITHUB' >> "/tmp/github_env_${GITHUB_RUN_ID}"

				@ -223,15 +230,17 @@ jobs:

				          else

				            JENKINS_USER="--user jenkins"

				            USED_IMAGE="${DOCKER_IMAGE}"

				            DOCKER_SHELL_CMD=

				          fi

				          # Leaving 1GB for the runner and other things

				          TOTAL_AVAILABLE_MEMORY_IN_GB=$(awk '/MemTotal/ { printf "%.3f \n", $2/1024/1024 - 1 }' /proc/meminfo)

				          # https://docs.docker.com/engine/containers/resource_constraints/#--memory-swap-details

				          TOTAL_MEMORY_WITH_SWAP=$(("${TOTAL_AVAILABLE_MEMORY_IN_GB%.*}" * 2))

				          # https://docs.docker.com/engine/containers/resource_constraints/#--memory-swap-details, the 3GB swap

				          # comes from https://github.com/pytorch/test-infra/pull/6058

				          TOTAL_MEMORY_WITH_SWAP=$(("${TOTAL_AVAILABLE_MEMORY_IN_GB%.*}" + 3))

				          # detached container should get cleaned up by teardown_ec2_linux

				          # Used for JENKINS_USER, which can be empty

				          # Used for JENKINS_USER and DOCKER_SHELL_CMD, which can be empty

				          # shellcheck disable=SC2086

				          container_name=$(docker run \

				            -e BUILD_ENVIRONMENT \

				@ -262,7 +271,8 @@ jobs:

				            ${JENKINS_USER} \

				            -v "${GITHUB_WORKSPACE}:/var/lib/jenkins/workspace" \

				            -w /var/lib/jenkins/workspace \

				            "${USED_IMAGE}"

				            "${USED_IMAGE}" \

				            ${DOCKER_SHELL_CMD}

				          )

				          docker exec -t "${container_name}" sh -c '.ci/pytorch/build.sh'

				@ -320,7 +330,7 @@ jobs:

				          build-time: ${{ steps.build.outputs.build_time }}

				      - name: Teardown Linux

				        uses: pytorch/test-infra/.github/actions/teardown-linux@release/2.6

				        uses: pytorch/test-infra/.github/actions/teardown-linux@main

				        if: always() && inputs.build-environment != 'linux-s390x-binary-manywheel'

				      - name: Cleanup docker

				@ -328,6 +338,5 @@ jobs:

				        shell: bash

				        run: |

				          # on s390x stop the container for clean worker stop

				          # ignore expansion of "docker ps -q" since it could be empty

				          # shellcheck disable=SC2046

				          docker stop $(docker ps -q) || true

				          docker stop -a || true

				          docker kill -a || true

									
										68

.github/workflows/_linux-test.yml
									
										vendored
									
												View File
												
				@ -80,8 +80,8 @@ jobs:

				    timeout-minutes: ${{ matrix.mem_leak_check == 'mem_leak_check' && 600 || inputs.timeout-minutes }}

				    steps:

				      - name: Setup SSH (Click me for login details)

				        uses: pytorch/test-infra/.github/actions/setup-ssh@release/2.6

				        if: ${{ !contains(matrix.runner, 'gcp.a100') }}

				        uses: pytorch/test-infra/.github/actions/setup-ssh@main

				        if: ${{ !contains(matrix.runner, 'gcp.a100') && inputs.build-environment != 'linux-s390x-binary-manywheel' }}

				        with:

				          github-secret: ${{ secrets.GITHUB_TOKEN }}

				          instructions: |

				@ -89,15 +89,16 @@ jobs:

				              docker exec -it $(docker container ps --format '{{.ID}}') bash

				      - name: Checkout PyTorch

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@release/2.6

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@main

				        with:

				          no-sudo: true

				      - name: Setup Linux

				        uses: ./.github/actions/setup-linux

				        if: inputs.build-environment != 'linux-s390x-binary-manywheel'

				      - name: configure aws credentials

				        if : ${{ inputs.aws-role-to-assume != '' }}

				        if : ${{ inputs.aws-role-to-assume != '' && inputs.build-environment != 'linux-s390x-binary-manywheel' }}

				        uses: aws-actions/configure-aws-credentials@v3

				        with:

				          role-to-assume: ${{ inputs.aws-role-to-assume }}

				@ -106,12 +107,14 @@ jobs:

				      - name: Calculate docker image

				        id: calculate-docker-image

				        uses: pytorch/test-infra/.github/actions/calculate-docker-image@release/2.6

				        uses: pytorch/test-infra/.github/actions/calculate-docker-image@main

				        if: inputs.build-environment != 'linux-s390x-binary-manywheel'

				        with:

				          docker-image-name: ${{ inputs.docker-image }}

				      - name: Use following to pull public copy of the image

				        id: print-ghcr-mirror

				        if: inputs.build-environment != 'linux-s390x-binary-manywheel'

				        env:

				          ECR_DOCKER_IMAGE: ${{ steps.calculate-docker-image.outputs.docker-image }}

				        shell: bash

				@ -120,7 +123,8 @@ jobs:

				          echo "docker pull ghcr.io/pytorch/ci-image:${tag/:/-}"

				      - name: Pull docker image

				        uses: pytorch/test-infra/.github/actions/pull-docker-image@release/2.6

				        uses: pytorch/test-infra/.github/actions/pull-docker-image@main

				        if: inputs.build-environment != 'linux-s390x-binary-manywheel'

				        with:

				          docker-image: ${{ steps.calculate-docker-image.outputs.docker-image }}

				@ -131,7 +135,7 @@ jobs:

				      - name: Install nvidia driver, nvidia-docker runtime, set GPU_FLAG

				        id: install-nvidia-driver

				        uses: pytorch/test-infra/.github/actions/setup-nvidia@release/2.6

				        uses: pytorch/test-infra/.github/actions/setup-nvidia@main

				        if: ${{ contains(inputs.build-environment, 'cuda') && !contains(matrix.config, 'nogpu') && steps.check_container_runner.outputs.IN_CONTAINER_RUNNER == 'false' }}

				      - name: Setup GPU_FLAG for docker run

				@ -166,6 +170,7 @@ jobs:

				        with:

				          name: ${{ inputs.build-environment }}

				          s3-bucket: ${{ inputs.s3-bucket }}

				          use-gha: ${{ inputs.use-gha }}

				      - name: Download TD artifacts

				        continue-on-error: true

				@ -230,7 +235,10 @@ jobs:

				          TD_DISTRIBUTED: ${{ steps.keep-going.outputs.ci-td-distributed }}

				          SCCACHE_BUCKET: ossci-compiler-cache-circleci-v2

				          SCCACHE_REGION: us-east-1

				          SCCACHE_S3_KEY_PREFIX: ${{ github.workflow }}

				          # Use the build environment here to make sure that all build jobs in the same environment

				          # will share the same cache regardless of which workflow they belong. This should improve

				          # the cache usage for jobs in non-pull workflows like periodic, slow, or inductor

				          SCCACHE_S3_KEY_PREFIX: ${{ inputs.build-environment || github.workflow }}

				          SHM_SIZE: ${{ contains(inputs.build-environment, 'cuda') && '2g' || '1g' }}

				          DOCKER_IMAGE: ${{ inputs.docker-image }}

				          XLA_CUDA: ${{ contains(inputs.build-environment, 'xla') && '0' || '' }}

				@ -253,9 +261,27 @@ jobs:

				            TEST_COMMAND=.ci/pytorch/test.sh

				          fi

				          # Leaving 1GB for the runner and other things

				          TOTAL_AVAILABLE_MEMORY_IN_GB=$(awk '/MemTotal/ { printf "%.3f \n", $2/1024/1024 - 1 }' /proc/meminfo)

				          # https://docs.docker.com/engine/containers/resource_constraints/#--memory-swap-details, the 3GB swap

				          # comes from https://github.com/pytorch/test-infra/pull/6058

				          TOTAL_MEMORY_WITH_SWAP=$(("${TOTAL_AVAILABLE_MEMORY_IN_GB%.*}" + 3))

				          if [[ ${BUILD_ENVIRONMENT} == *"s390x"* ]]; then

				            SHM_OPTS=

				            JENKINS_USER=

				            # since some steps are skipped on s390x, if they are necessary, run them here

				            env | grep '^GITHUB' >> "/tmp/github_env_${GITHUB_RUN_ID}"

				            env | grep '^CI' >> "/tmp/github_env_${GITHUB_RUN_ID}"

				          else

				            SHM_OPTS="--shm-size=${SHM_SIZE}"

				            JENKINS_USER="--user jenkins"

				          fi

				          # detached container should get cleaned up by teardown_ec2_linux

				          # TODO: Stop building test binaries as part of the build phase

				          # Used for GPU_FLAG since that doesn't play nice

				          # Used for GPU_FLAG, SHM_OPTS and JENKINS_USER since that doesn't play nice

				          # shellcheck disable=SC2086,SC2090

				          container_name=$(docker run \

				            ${GPU_FLAG:-} \

				@ -301,15 +327,17 @@ jobs:

				            -e DASHBOARD_TAG \

				            -e IS_A100_RUNNER \

				            -e ARTIFACTS_FILE_SUFFIX \

				            --memory="${TOTAL_AVAILABLE_MEMORY_IN_GB%.*}g" \

				            --memory-swap="${TOTAL_MEMORY_WITH_SWAP}g" \

				            --env-file="/tmp/github_env_${GITHUB_RUN_ID}" \

				            --security-opt seccomp=unconfined \

				            --cap-add=SYS_PTRACE \

				            --ipc=host \

				            --shm-size="${SHM_SIZE}" \

				            ${SHM_OPTS} \

				            --tty \

				            --detach \

				            --name="${container_name}" \

				            --user jenkins \

				            ${JENKINS_USER} \

				            -v "${GITHUB_WORKSPACE}:/var/lib/jenkins/workspace" \

				            -w /var/lib/jenkins/workspace \

				            "${DOCKER_IMAGE}"

				@ -317,6 +345,11 @@ jobs:

				          # Propagate download.pytorch.org IP to container

				          grep download.pytorch.org /etc/hosts | docker exec -i "${container_name}" sudo bash -c "/bin/cat >> /etc/hosts"

				          echo "DOCKER_CONTAINER_ID=${container_name}" >> "${GITHUB_ENV}"

				          if [[ ${BUILD_ENVIRONMENT} == *"s390x"* ]]; then

				            docker exec -t "${container_name}" sh -c "python3 -m pip install -r .ci/docker/requirements-ci.txt"

				          fi

				          docker exec -t "${container_name}" sh -c "python3 -m pip install $(echo dist/*.whl)[opt-einsum] && ${TEST_COMMAND}"

				      - name: Upload pytest cache if tests failed

				@ -331,7 +364,7 @@ jobs:

				          job_identifier: ${{ github.workflow }}_${{ inputs.build-environment }}

				      - name: Upload the benchmark results

				        uses: pytorch/test-infra/.github/actions/upload-benchmark-results@release/2.6

				        uses: pytorch/test-infra/.github/actions/upload-benchmark-results@main

				        with:

				          benchmark-results-dir: test/test-reports

				          dry-run: false

				@ -377,7 +410,7 @@ jobs:

				          path: ./**/core.[1-9]*

				      - name: Teardown Linux

				        uses: pytorch/test-infra/.github/actions/teardown-linux@release/2.6

				        uses: pytorch/test-infra/.github/actions/teardown-linux@main

				        if: always() && steps.check_container_runner.outputs.IN_CONTAINER_RUNNER == 'false'

				      # NB: We are currently having an intermittent GPU-related issue on G5 runners with

				@ -456,3 +489,12 @@ jobs:

				            echo "NVIDIA driver detects $GPU_COUNT GPUs. The runner has a broken GPU, shutting it down..."

				            .github/scripts/stop_runner_service.sh

				          fi

				      - name: Cleanup docker

				        if: always() && inputs.build-environment == 'linux-s390x-binary-manywheel'

				        shell: bash

				        run: |

				          # on s390x stop the container for clean worker stop

				          # ignore expansion of "docker ps -q" since it could be empty

				          # shellcheck disable=SC2046

				          docker stop $(docker ps -q) || true

									
										10

.github/workflows/_mac-build.yml
									
										vendored
									
												View File
												
				@ -71,11 +71,11 @@ jobs:

				      test-matrix: ${{ steps.filter.outputs.test-matrix }}

				    steps:

				      - name: Clean up disk space before running MacOS workflow

				        uses: pytorch/test-infra/.github/actions/check-disk-space@release/2.6

				        uses: pytorch/test-infra/.github/actions/check-disk-space@main

				      # [see note: pytorch repo ref]

				      - name: Checkout PyTorch

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@release/2.6

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@main

				      - name: Set xcode version

				        env:

				@ -87,7 +87,7 @@ jobs:

				      - name: Setup miniconda

				        if: inputs.environment-file == ''

				        uses: pytorch/test-infra/.github/actions/setup-miniconda@release/2.6

				        uses: pytorch/test-infra/.github/actions/setup-miniconda@main

				        with:

				          python-version: ${{ inputs.python-version }}

				          environment-file: .github/requirements/conda-env-${{ runner.os }}-${{ runner.arch }}

				@ -97,7 +97,7 @@ jobs:

				      # environment even though the arch is x86-64

				      - name: Setup miniconda using the provided environment file

				        if: inputs.environment-file != ''

				        uses: pytorch/test-infra/.github/actions/setup-miniconda@release/2.6

				        uses: pytorch/test-infra/.github/actions/setup-miniconda@main

				        with:

				          python-version: ${{ inputs.python-version }}

				          environment-file: ${{ inputs.environment-file }}

				@ -207,4 +207,4 @@ jobs:

				      - name: Clean up disk space

				        if: always()

				        continue-on-error: true

				        uses: pytorch/test-infra/.github/actions/check-disk-space@release/2.6

				        uses: pytorch/test-infra/.github/actions/check-disk-space@main

									
										11

.github/workflows/_mac-test-mps.yml
									
										vendored
									
												View File
												
				@ -41,7 +41,7 @@ jobs:

				      reenabled-issues: ${{ steps.filter.outputs.reenabled-issues }}

				    steps:

				      - name: Checkout PyTorch

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@release/2.6

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@main

				        with:

				          submodules: false

				@ -66,10 +66,10 @@ jobs:

				          sysctl machdep.cpu.brand_string kern.osproductversion

				      - name: Checkout PyTorch

				        uses: malfet/checkout@silent-checkout

				        uses: actions/checkout@v4

				        with:

				          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}

				          quiet-checkout: true

				          show-progress: false

				      - name: Clean checkout

				        run: |

				@ -82,7 +82,7 @@ jobs:

				          use-gha: true

				      - name: Setup miniconda

				        uses: pytorch/test-infra/.github/actions/setup-miniconda@release/2.6

				        uses: pytorch/test-infra/.github/actions/setup-miniconda@main

				        with:

				          python-version: ${{ inputs.python-version }}

				          environment-file: .github/requirements/conda-env-${{ runner.os }}-${{ runner.arch }}

				@ -152,6 +152,7 @@ jobs:

				          set -e

				          ${CONDA_RUN} python3 test/run_test.py --mps --verbose

				          MTL_CAPTURE_ENABLED=1 ${CONDA_RUN} python3 test/test_mps.py --verbose -k test_metal_capture

				      - name: Print remaining test logs

				        shell: bash

				@ -169,4 +170,4 @@ jobs:

				      - name: Clean up disk space

				        if: always()

				        continue-on-error: true

				        uses: pytorch/test-infra/.github/actions/check-disk-space@release/2.6

				        uses: pytorch/test-infra/.github/actions/check-disk-space@main

									
										10

.github/workflows/_mac-test.yml
									
										vendored
									
												View File
												
				@ -82,11 +82,11 @@ jobs:

				          done

				      - name: Clean up disk space before running MacOS workflow

				        uses: pytorch/test-infra/.github/actions/check-disk-space@release/2.6

				        uses: pytorch/test-infra/.github/actions/check-disk-space@main

				      # [see note: pytorch repo ref]

				      - name: Checkout PyTorch

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@release/2.6

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@main

				      - name: Start monitoring script

				        id: monitor-script

				@ -109,7 +109,7 @@ jobs:

				          use-gha: true

				      - name: Setup miniconda

				        uses: pytorch/test-infra/.github/actions/setup-miniconda@release/2.6

				        uses: pytorch/test-infra/.github/actions/setup-miniconda@main

				        with:

				          python-version: ${{ inputs.python-version }}

				          environment-file: .github/requirements/conda-env-${{ runner.os }}-${{ runner.arch }}

				@ -224,7 +224,7 @@ jobs:

				          file-suffix: ${{ github.job }}-${{ matrix.config }}-${{ matrix.shard }}-${{ matrix.num_shards }}-${{ matrix.runner }}_${{ steps.get-job-id.outputs.job-id }}

				      - name: Upload the benchmark results

				        uses: pytorch/test-infra/.github/actions/upload-benchmark-results@release/2.6

				        uses: pytorch/test-infra/.github/actions/upload-benchmark-results@main

				        with:

				          benchmark-results-dir: test/test-reports

				          dry-run: false

				@ -234,4 +234,4 @@ jobs:

				      - name: Clean up disk space

				        if: always()

				        continue-on-error: true

				        uses: pytorch/test-infra/.github/actions/check-disk-space@release/2.6

				        uses: pytorch/test-infra/.github/actions/check-disk-space@main

									
										12

.github/workflows/_rocm-test.yml
									
										vendored
									
												View File
												
				@ -66,7 +66,7 @@ jobs:

				    steps:

				      # [see note: pytorch repo ref]

				      - name: Checkout PyTorch

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@release/2.6

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@main

				        with:

				          no-sudo: true

				@ -88,12 +88,12 @@ jobs:

				      - name: Calculate docker image

				        id: calculate-docker-image

				        uses: pytorch/test-infra/.github/actions/calculate-docker-image@release/2.6

				        uses: pytorch/test-infra/.github/actions/calculate-docker-image@main

				        with:

				          docker-image-name: ${{ inputs.docker-image }}

				      - name: Pull docker image

				        uses: pytorch/test-infra/.github/actions/pull-docker-image@release/2.6

				        uses: pytorch/test-infra/.github/actions/pull-docker-image@main

				        with:

				          docker-image: ${{ steps.calculate-docker-image.outputs.docker-image }}

				@ -170,9 +170,7 @@ jobs:

				          SHARD_NUMBER: ${{ matrix.shard }}

				          NUM_TEST_SHARDS: ${{ matrix.num_shards }}

				          REENABLED_ISSUES: ${{ steps.keep-going.outputs.reenabled-issues }}

				          SCCACHE_BUCKET: ossci-compiler-cache-circleci-v2

				          DOCKER_IMAGE: ${{ inputs.docker-image }}

				          XLA_CLANG_CACHE_S3_BUCKET_NAME: ossci-compiler-clang-cache-circleci-xla

				          PYTORCH_TEST_CUDA_MEM_LEAK_CHECK: ${{ matrix.mem_leak_check && '1' || '0' }}

				          PYTORCH_TEST_RERUN_DISABLED_TESTS: ${{ matrix.rerun_disabled_tests && '1' || '0' }}

				          TESTS_TO_INCLUDE: ${{ inputs.tests-to-include }}

				@ -219,12 +217,10 @@ jobs:

				            -e NO_TEST_TIMEOUT \

				            -e NO_TD \

				            -e MAX_JOBS="$(nproc --ignore=2)" \

				            -e SCCACHE_BUCKET \

				            -e XLA_CLANG_CACHE_S3_BUCKET_NAME \

				            -e PYTORCH_TEST_CUDA_MEM_LEAK_CHECK \

				            -e PYTORCH_TEST_RERUN_DISABLED_TESTS \

				            -e TESTS_TO_INCLUDE \

				            --env-file="/tmp/github_env_${GITHUB_RUN_ID}" \

				            --env-file="${RUNNER_TEMP}/github_env_${GITHUB_RUN_ID}" \

				            --ulimit stack=10485760:83886080 \

				            --ulimit core=0 \

				            --security-opt seccomp=unconfined \

									
										2

.github/workflows/_runner-determinator.yml
									
										vendored
									
												View File
												
				@ -54,7 +54,7 @@ jobs:

				      PR_NUMBER: ${{ github.event.pull_request.number }}

				    steps:

				      # - name: Checkout PyTorch

				      #   uses: pytorch/pytorch/.github/actions/checkout-pytorch@release/2.6

				      #   uses: pytorch/pytorch/.github/actions/checkout-pytorch@main

				      #   with:

				      #     fetch-depth: 1

				      #     submodules: true

									
										6

.github/workflows/_win-build.yml
									
										vendored
									
												View File
												
				@ -84,10 +84,10 @@ jobs:

				          git config --global core.fsmonitor false

				      - name: Clean up leftover processes on non-ephemeral Windows runner

				        uses: pytorch/test-infra/.github/actions/cleanup-runner@release/2.6

				        uses: pytorch/test-infra/.github/actions/cleanup-runner@main

				      - name: Setup SSH (Click me for login details)

				        uses: pytorch/test-infra/.github/actions/setup-ssh@release/2.6

				        uses: pytorch/test-infra/.github/actions/setup-ssh@main

				        with:

				          github-secret: ${{ secrets.GITHUB_TOKEN }}

				          instructions: |

				@ -102,7 +102,7 @@ jobs:

				      # [see note: pytorch repo ref]

				      - name: Checkout PyTorch

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@release/2.6

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@main

				        with:

				          no-sudo: true

									
										6

.github/workflows/_win-test.yml
									
										vendored
									
												View File
												
				@ -66,10 +66,10 @@ jobs:

				          git config --global core.fsmonitor false

				      - name: Clean up leftover processes on non-ephemeral Windows runner

				        uses: pytorch/test-infra/.github/actions/cleanup-runner@release/2.6

				        uses: pytorch/test-infra/.github/actions/cleanup-runner@main

				      - name: Setup SSH (Click me for login details)

				        uses: pytorch/test-infra/.github/actions/setup-ssh@release/2.6

				        uses: pytorch/test-infra/.github/actions/setup-ssh@main

				        with:

				          github-secret: ${{ secrets.GITHUB_TOKEN }}

				          instructions: |

				@ -85,7 +85,7 @@ jobs:

				      # [see note: pytorch repo ref]

				      - name: Checkout PyTorch

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@release/2.6

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@main

				        with:

				          no-sudo: true

									
										6

.github/workflows/_xpu-test.yml
									
										vendored
									
												View File
												
				@ -62,7 +62,7 @@ jobs:

				    steps:

				      # [see note: pytorch repo ref]

				      - name: Checkout PyTorch

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@release/2.6

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@main

				      - name: Setup XPU

				        uses: ./.github/actions/setup-xpu

				@ -80,12 +80,12 @@ jobs:

				      - name: Calculate docker image

				        id: calculate-docker-image

				        uses: pytorch/test-infra/.github/actions/calculate-docker-image@release/2.6

				        uses: pytorch/test-infra/.github/actions/calculate-docker-image@main

				        with:

				          docker-image-name: ${{ inputs.docker-image }}

				      - name: Pull docker image

				        uses: pytorch/test-infra/.github/actions/pull-docker-image@release/2.6

				        uses: pytorch/test-infra/.github/actions/pull-docker-image@main

				        with:

				          docker-image: ${{ steps.calculate-docker-image.outputs.docker-image }}

									
										4

.github/workflows/build-almalinux-images.yml
									
										vendored
									
												View File
												
				@ -41,12 +41,12 @@ jobs:

				      CUDA_VERSION: ${{ matrix.cuda_version }}

				    steps:

				      - name: Checkout PyTorch

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@release/2.6

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@main

				        with:

				          submodules: false

				      - name: Calculate docker image

				        if: env.WITH_PUSH == 'false'

				        uses: pytorch/test-infra/.github/actions/calculate-docker-image@release/2.6

				        uses: pytorch/test-infra/.github/actions/calculate-docker-image@main

				        with:

				            docker-image-name: almalinux-builder${{ matrix.cuda_version == 'cpu' && '-' || '-cuda' }}${{matrix.cuda_version}}

				            docker-build-dir:  .ci/docker/almalinux

									
										16

.github/workflows/build-libtorch-images.yml
									
										vendored
									
												View File
												
				@ -32,7 +32,7 @@ jobs:

				  get-label-type:

				    if: github.repository_owner == 'pytorch'

				    name: get-label-type

				    uses: pytorch/pytorch/.github/workflows/_runner-determinator.yml@release/2.6

				    uses: pytorch/pytorch/.github/workflows/_runner-determinator.yml@main

				    with:

				      triggering_actor: ${{ github.triggering_actor }}

				      issue_owner: ${{ github.event.pull_request.user.login || github.event.issue.user.login }}

				@ -51,12 +51,12 @@ jobs:

				      GPU_ARCH_VERSION: ${{ matrix.cuda_version }}

				    steps:

				      - name: Checkout PyTorch

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@release/2.6

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@main

				        with:

				          submodules: false

				      - name: Calculate docker image

				        if: env.WITH_PUSH == 'false'

				        uses: pytorch/test-infra/.github/actions/calculate-docker-image@release/2.6

				        uses: pytorch/test-infra/.github/actions/calculate-docker-image@main

				        with:

				            docker-image-name: libtorch-cxx11-builder-cuda${{matrix.cuda_version}}

				            docker-build-dir:  .ci/docker/libtorch

				@ -87,18 +87,18 @@ jobs:

				    runs-on: "${{ needs.get-label-type.outputs.label-type }}linux.9xlarge.ephemeral"

				    strategy:

				      matrix:

				        rocm_version: ["6.1", "6.2.4"]

				        rocm_version: ["6.2.4", "6.3"]

				    env:

				      GPU_ARCH_TYPE: rocm

				      GPU_ARCH_VERSION: ${{ matrix.rocm_version }}

				    steps:

				      - name: Checkout PyTorch

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@release/2.6

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@main

				        with:

				          submodules: false

				      - name: Calculate docker image

				        if: env.WITH_PUSH == 'false'

				        uses: pytorch/test-infra/.github/actions/calculate-docker-image@release/2.6

				        uses: pytorch/test-infra/.github/actions/calculate-docker-image@main

				        with:

				            docker-image-name: libtorch-cxx11-builder-rocm${{matrix.rocm_version}}

				            docker-build-dir:  .ci/docker/libtorch

				@ -129,12 +129,12 @@ jobs:

				    runs-on: "${{ needs.get-label-type.outputs.label-type }}linux.9xlarge.ephemeral"

				    steps:

				      - name: Checkout PyTorch

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@release/2.6

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@main

				        with:

				          submodules: false

				      - name: Calculate docker image

				        if: env.WITH_PUSH == 'false'

				        uses: pytorch/test-infra/.github/actions/calculate-docker-image@release/2.6

				        uses: pytorch/test-infra/.github/actions/calculate-docker-image@main

				        with:

				            docker-image-name: libtorch-cxx11-builder-cpu

				            docker-build-dir:  .ci/docker/libtorch

									
										2

.github/workflows/build-magma-windows.yml
									
										vendored
									
												View File
												
				@ -28,7 +28,7 @@ jobs:

				      CUDA_VERSION: ${{ matrix.cuda_version }}

				      CONFIG: ${{ matrix.config }}

				    steps:

				      - name: Checkout pytorch/builder

				      - name: Checkout pytorch/pytorch

				        uses: actions/checkout@v4

				      - name: Enable MSVC dev commands to enable cl.exe  # FYI incompatible with shell: bash

				        uses: ilammy/msvc-dev-cmd@dd5e2fa0a7de1e7929605d9ecc020e749d9856a3

									
										2

.github/workflows/build-manywheel-images-s390x.yml
									
										vendored
									
												View File
												
				@ -41,7 +41,7 @@ jobs:

				      GPU_ARCH_TYPE: cpu-s390x

				    steps:

				      - name: Checkout PyTorch

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@release/2.6

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@main

				        with:

				          submodules: false

				          no-sudo: true

									
										42

.github/workflows/build-manywheel-images.yml
									
										vendored
									
												View File
												
				@ -36,7 +36,7 @@ jobs:

				  get-label-type:

				    if: github.repository_owner == 'pytorch'

				    name: get-label-type

				    uses: pytorch/pytorch/.github/workflows/_runner-determinator.yml@release/2.6

				    uses: pytorch/pytorch/.github/workflows/_runner-determinator.yml@main

				    with:

				      triggering_actor: ${{ github.triggering_actor }}

				      issue_owner: ${{ github.event.pull_request.user.login || github.event.issue.user.login }}

				@ -57,12 +57,12 @@ jobs:

				      - name: Purge tools folder (free space for build)

				        run: rm -rf /opt/hostedtoolcache

				      - name: Checkout PyTorch

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@release/2.6

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@main

				        with:

				          submodules: false

				      - name: Calculate docker image

				        if: env.WITH_PUSH == 'false'

				        uses: pytorch/test-infra/.github/actions/calculate-docker-image@release/2.6

				        uses: pytorch/test-infra/.github/actions/calculate-docker-image@main

				        with:

				            docker-image-name: manylinux-builder-cuda${{matrix.cuda_version}}

				            docker-build-dir:  .ci/docker/manywheel

				@ -102,12 +102,12 @@ jobs:

				      - name: Purge tools folder (free space for build)

				        run: rm -rf /opt/hostedtoolcache

				      - name: Checkout PyTorch

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@release/2.6

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@main

				        with:

				          submodules: false

				      - name: Calculate docker image

				        if: env.WITH_PUSH == 'false'

				        uses: pytorch/test-infra/.github/actions/calculate-docker-image@release/2.6

				        uses: pytorch/test-infra/.github/actions/calculate-docker-image@main

				        with:

				            docker-image-name: manylinux2_28-builder-cuda${{matrix.cuda_version}}

				            docker-build-dir:  .ci/docker/manywheel

				@ -147,7 +147,7 @@ jobs:

				        uses: actions/checkout@v3

				      - name: Calculate docker image

				        if: env.WITH_PUSH == 'false'

				        uses: pytorch/test-infra/.github/actions/calculate-docker-image@release/2.6

				        uses: pytorch/test-infra/.github/actions/calculate-docker-image@main

				        with:

				            docker-image-name: manylinuxaarch64-builder-cuda${{matrix.cuda_version}}

				            docker-build-dir:  .ci/docker/manywheel

				@ -178,18 +178,18 @@ jobs:

				    runs-on: "${{ needs.get-label-type.outputs.label-type }}linux.9xlarge.ephemeral"

				    strategy:

				      matrix:

				        rocm_version: ["6.1", "6.2.4"]

				        rocm_version: ["6.2.4", "6.3"]

				    env:

				      GPU_ARCH_TYPE: rocm-manylinux_2_28

				      GPU_ARCH_VERSION: ${{ matrix.rocm_version }}

				    steps:

				      - name: Checkout PyTorch

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@release/2.6

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@main

				        with:

				          submodules: false

				      - name: Calculate docker image

				        if: env.WITH_PUSH == 'false'

				        uses: pytorch/test-infra/.github/actions/calculate-docker-image@release/2.6

				        uses: pytorch/test-infra/.github/actions/calculate-docker-image@main

				        with:

				            docker-image-name: manylinux2_28-builder-rocm${{matrix.rocm_version}}

				            docker-build-dir:  .ci/docker/manywheel

				@ -220,12 +220,12 @@ jobs:

				    runs-on: "${{ needs.get-label-type.outputs.label-type }}linux.9xlarge.ephemeral"

				    steps:

				      - name: Checkout PyTorch

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@release/2.6

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@main

				        with:

				          submodules: false

				      - name: Calculate docker image

				        if: env.WITH_PUSH == 'false'

				        uses: pytorch/test-infra/.github/actions/calculate-docker-image@release/2.6

				        uses: pytorch/test-infra/.github/actions/calculate-docker-image@main

				        with:

				            docker-image-name: manylinux-builder-cpu

				            docker-build-dir:  .ci/docker/manywheel

				@ -258,12 +258,12 @@ jobs:

				      GPU_ARCH_TYPE: cpu-manylinux_2_28

				    steps:

				      - name: Checkout PyTorch

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@release/2.6

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@main

				        with:

				          submodules: false

				      - name: Calculate docker image

				        if: env.WITH_PUSH == 'false'

				        uses: pytorch/test-infra/.github/actions/calculate-docker-image@release/2.6

				        uses: pytorch/test-infra/.github/actions/calculate-docker-image@main

				        with:

				            docker-image-name: manylinux2_28-builder-cpu

				            docker-build-dir:  .ci/docker/manywheel

				@ -296,12 +296,12 @@ jobs:

				      GPU_ARCH_TYPE: cpu-aarch64

				    steps:

				      - name: Checkout PyTorch

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@release/2.6

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@main

				        with:

				          submodules: false

				      - name: Calculate docker image

				        if: env.WITH_PUSH == 'false'

				        uses: pytorch/test-infra/.github/actions/calculate-docker-image@release/2.6

				        uses: pytorch/test-infra/.github/actions/calculate-docker-image@main

				        with:

				            docker-image-name: manylinuxaarch64-builder-cpu-aarch64

				            docker-build-dir:  .ci/docker/manywheel

				@ -334,12 +334,12 @@ jobs:

				      GPU_ARCH_TYPE: cpu-aarch64-2_28

				    steps:

				      - name: Checkout PyTorch

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@release/2.6

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@main

				        with:

				          submodules: false

				      - name: Calculate docker image

				        if: env.WITH_PUSH == 'false'

				        uses: pytorch/test-infra/.github/actions/calculate-docker-image@release/2.6

				        uses: pytorch/test-infra/.github/actions/calculate-docker-image@main

				        with:

				            docker-image-name: manylinux2_28_aarch64-builder-cpu-aarch64

				            docker-build-dir:  .ci/docker/manywheel

				@ -375,12 +375,12 @@ jobs:

				      GPU_ARCH_TYPE: cpu-cxx11-abi

				    steps:

				      - name: Checkout PyTorch

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@release/2.6

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@main

				        with:

				          submodules: false

				      - name: Calculate docker image

				        if: env.WITH_PUSH == 'false'

				        uses: pytorch/test-infra/.github/actions/calculate-docker-image@release/2.6

				        uses: pytorch/test-infra/.github/actions/calculate-docker-image@main

				        with:

				            docker-image-name: manylinuxcxx11-abi-builder-cpu-cxx11-abi

				            docker-build-dir:  .ci/docker/manywheel

				@ -413,12 +413,12 @@ jobs:

				      GPU_ARCH_TYPE: xpu

				    steps:

				      - name: Checkout PyTorch

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@release/2.6

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@main

				        with:

				          submodules: false

				      - name: Calculate docker image

				        if: env.WITH_PUSH == 'false'

				        uses: pytorch/test-infra/.github/actions/calculate-docker-image@release/2.6

				        uses: pytorch/test-infra/.github/actions/calculate-docker-image@main

				        with:

				            docker-image-name: manylinux2_28-builder-xpu

				            docker-build-dir:  .ci/docker/manywheel

									
										19

.github/workflows/build-triton-wheel.yml
									
										vendored
									
												View File
												
				@ -3,7 +3,7 @@ name: Build Triton wheels

				on:

				  push:

				    branches:

				      - release/2.6

				      - main

				    tags:

				      # NOTE: Binary build pipelines should only get triggered on release candidate builds

				      # Release candidate tags look like: v1.11.0-rc1

				@ -30,7 +30,7 @@ jobs:

				  get-label-type:

				    if: github.repository_owner == 'pytorch'

				    name: get-label-type

				    uses: pytorch/pytorch/.github/workflows/_runner-determinator.yml@release/2.6

				    uses: pytorch/pytorch/.github/workflows/_runner-determinator.yml@main

				    with:

				      triggering_actor: ${{ github.triggering_actor }}

				      issue_owner: ${{ github.event.pull_request.user.login || github.event.issue.user.login }}

				@ -44,7 +44,7 @@ jobs:

				    strategy:

				      fail-fast: false

				      matrix:

				        py_vers: [ "3.9", "3.10", "3.11", "3.12", "3.13" ]

				        py_vers: [ "3.9", "3.10", "3.11", "3.12", "3.13", "3.13t" ]

				        device: ["cuda", "rocm", "xpu"]

				        docker-image: ["pytorch/manylinux-builder:cpu", "pytorch/manylinux2_28-builder:cpu"]

				        exclude:

				@ -54,7 +54,7 @@ jobs:

				            docker-image: "pytorch/manylinux2_28-builder:cpu"

				        include:

				          - device: "rocm"

				            rocm_version: "6.2.4"

				            rocm_version: "6.3"

				          - device: "cuda"

				            rocm_version: ""

				    timeout-minutes: 40

				@ -65,12 +65,12 @@ jobs:

				      PLATFORM: ${{ contains(matrix.docker-image, '2_28') && 'manylinux_2_28_x86_64' || 'manylinux2014_x86_64' }}

				    steps:

				      - name: Setup SSH (Click me for login details)

				        uses: pytorch/test-infra/.github/actions/setup-ssh@release/2.6

				        uses: pytorch/test-infra/.github/actions/setup-ssh@main

				        with:

				          github-secret: ${{ secrets.GITHUB_TOKEN }}

				      - name: Checkout PyTorch

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@release/2.6

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@main

				        with:

				          submodules: false

				@ -78,7 +78,7 @@ jobs:

				        uses: ./.github/actions/setup-linux

				      - name: Pull Docker image

				        uses: pytorch/test-infra/.github/actions/pull-docker-image@release/2.6

				        uses: pytorch/test-infra/.github/actions/pull-docker-image@main

				        with:

				          docker-image: ${{ env.DOCKER_IMAGE }}

				@ -114,6 +114,9 @@ jobs:

				          3.13)

				            PYTHON_EXECUTABLE=/opt/python/cp313-cp313/bin/python

				            ;;

				          3.13t)

				            PYTHON_EXECUTABLE=/opt/python/cp313-cp313t/bin/python

				            ;;

				          *)

				            echo "Unsupported python version ${PY_VERS}"

				            exit 1

				@ -154,7 +157,7 @@ jobs:

				          path: ${{ runner.temp }}/artifacts/wheelhouse/*

				      - name: Teardown Linux

				        uses: pytorch/test-infra/.github/actions/teardown-linux@release/2.6

				        uses: pytorch/test-infra/.github/actions/teardown-linux@main

				        if: always()

				  upload-wheel:

									
										2

.github/workflows/check-labels.yml
									
										vendored
									
												View File
												
				@ -38,7 +38,7 @@ jobs:

				    runs-on: linux.20_04.4x

				    steps:

				      - name: Checkout PyTorch

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@release/2.6

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@main

				        with:

				          submodules: false

				          fetch-depth: 1

									
										2

.github/workflows/close-nonexistent-disable-issues.yml
									
										vendored
									
												View File
												
				@ -11,7 +11,7 @@ jobs:

				    runs-on: ubuntu-latest

				    steps:

				      - name: Checkout PyTorch

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@release/2.6

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@main

				        with:

				          submodules: false

				          fetch-depth: 1

									
										5

.github/workflows/create_release.yml
									
										vendored
									
												View File
												
				@ -19,7 +19,7 @@ jobs:

				  get-label-type:

				    if: github.repository_owner == 'pytorch'

				    name: get-label-type

				    uses: pytorch/pytorch/.github/workflows/_runner-determinator.yml@release/2.6

				    uses: pytorch/pytorch/.github/workflows/_runner-determinator.yml@main

				    with:

				      triggering_actor: ${{ github.triggering_actor }}

				      issue_owner: ${{ github.event.pull_request.user.login || github.event.issue.user.login }}

				@ -36,8 +36,9 @@ jobs:

				    outputs:

				      pt_release_name: ${{ steps.release_name.outputs.pt_release_name }}

				    steps:

				      - uses: malfet/checkout@silent-checkout

				      - uses: actions/checkout@v4

				        with:

				          show-progress: false

				          submodules: 'recursive'

				          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}

				      - name: Fake name for PRs

									
										10

.github/workflows/docker-builds.yml
									
										vendored
									
												View File
												
				@ -33,7 +33,7 @@ jobs:

				  get-label-type:

				    if: github.repository_owner == 'pytorch'

				    name: get-label-type

				    uses: pytorch/pytorch/.github/workflows/_runner-determinator.yml@release/2.6

				    uses: pytorch/pytorch/.github/workflows/_runner-determinator.yml@main

				    with:

				      triggering_actor: ${{ github.triggering_actor }}

				      issue_owner: ${{ github.event.pull_request.user.login || github.event.issue.user.login }}

				@ -99,21 +99,21 @@ jobs:

				      # [see note: pytorch repo ref]

				      # deep clone (fetch-depth 0) required for git merge-base

				      - name: Checkout PyTorch

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@release/2.6

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@main

				      - name: Setup Linux

				        uses: ./.github/actions/setup-linux

				      - name: Build docker image

				        id: build-docker-image

				        uses: pytorch/test-infra/.github/actions/calculate-docker-image@release/2.6

				        uses: pytorch/test-infra/.github/actions/calculate-docker-image@main

				        with:

				          docker-image-name: ${{ matrix.docker-image-name }}

				          always-rebuild: true

				          push: true

				      - name: Pull docker image

				        uses: pytorch/test-infra/.github/actions/pull-docker-image@release/2.6

				        uses: pytorch/test-infra/.github/actions/pull-docker-image@main

				        with:

				          docker-image: ${{ steps.build-docker-image.outputs.docker-image }}

				@ -145,5 +145,5 @@ jobs:

				        if: always()

				      - name: Teardown Linux

				        uses: pytorch/test-infra/.github/actions/teardown-linux@release/2.6

				        uses: pytorch/test-infra/.github/actions/teardown-linux@main

				        if: always()

									
										12

.github/workflows/docker-release.yml
									
										vendored
									
												View File
												
				@ -37,7 +37,7 @@ jobs:

				  get-label-type:

				    if: github.repository_owner == 'pytorch'

				    name: get-label-type

				    uses: pytorch/pytorch/.github/workflows/_runner-determinator.yml@release/2.6

				    uses: pytorch/pytorch/.github/workflows/_runner-determinator.yml@main

				    with:

				      triggering_actor: ${{ github.triggering_actor }}

				      issue_owner: ${{ github.event.pull_request.user.login || github.event.issue.user.login }}

				@ -52,7 +52,7 @@ jobs:

				      matrix: ${{ steps.generate-matrix.outputs.matrix }}

				    steps:

				      - name: Checkout PyTorch

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@release/2.6

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@main

				        with:

				          fetch-depth: 1

				          submodules: true

				@ -82,7 +82,7 @@ jobs:

				      CUDNN_VERSION: ${{ matrix.cudnn_version }}

				    steps:

				      - name: Setup SSH (Click me for login details)

				        uses: pytorch/test-infra/.github/actions/setup-ssh@release/2.6

				        uses: pytorch/test-infra/.github/actions/setup-ssh@main

				        with:

				          github-secret: ${{ secrets.GITHUB_TOKEN }}

				      # [see note: pytorch repo ref]

				@ -160,12 +160,12 @@ jobs:

				          fi

				      - name: Teardown Linux

				        uses: pytorch/test-infra/.github/actions/teardown-linux@release/2.6

				        uses: pytorch/test-infra/.github/actions/teardown-linux@main

				        if: always()

				  validate:

				    needs: build

				    uses: pytorch/test-infra/.github/workflows/validate-docker-images.yml@release/2.6

				    uses: pytorch/test-infra/.github/workflows/validate-docker-images.yml@main

				    with:

				      channel: test

				      channel: nightly

				      ref: main

									
										78

.github/workflows/generated-linux-aarch64-binary-manywheel-nightly.yml
									
										generated
									
										vendored
									
												View File
												
				@ -25,7 +25,6 @@ env:

				  AWS_DEFAULT_REGION: us-east-1

				  BINARY_ENV_FILE: /tmp/env

				  BUILD_ENVIRONMENT: linux-aarch64-binary-manywheel

				  BUILDER_ROOT: /builder

				  GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}

				  PR_NUMBER: ${{ github.event.pull_request.number }}

				  PYTORCH_FINAL_PACKAGE_DIR: /artifacts

				@ -40,7 +39,7 @@ jobs:

				  get-label-type:

				    if: github.repository_owner == 'pytorch'

				    name: get-label-type

				    uses: pytorch/pytorch/.github/workflows/_runner-determinator.yml@release/2.6

				    uses: pytorch/pytorch/.github/workflows/_runner-determinator.yml@main

				    with:

				      triggering_actor: ${{ github.triggering_actor }}

				      issue_owner: ${{ github.event.pull_request.user.login || github.event.issue.user.login }}

				@ -52,13 +51,12 @@ jobs:

				    needs: get-label-type

				    with:

				      PYTORCH_ROOT: /pytorch

				      BUILDER_ROOT: /builder

				      PACKAGE_TYPE: manywheel

				      # TODO: This is a legacy variable that we eventually want to get rid of in

				      #       favor of GPU_ARCH_VERSION

				      DESIRED_CUDA: cpu

				      GPU_ARCH_TYPE: cpu-aarch64

				      DOCKER_IMAGE: pytorch/manylinux2_28_aarch64-builder:cpu-aarch64-2.6

				      DOCKER_IMAGE: pytorch/manylinux2_28_aarch64-builder:cpu-aarch64-main

				      DESIRED_DEVTOOLSET: cxx11-abi

				      use_split_build: False

				      DESIRED_PYTHON: "3.9"

				@ -78,13 +76,12 @@ jobs:

				    uses: ./.github/workflows/_binary-test-linux.yml

				    with:

				      PYTORCH_ROOT: /pytorch

				      BUILDER_ROOT: /builder

				      PACKAGE_TYPE: manywheel

				      # TODO: This is a legacy variable that we eventually want to get rid of in

				      #       favor of GPU_ARCH_VERSION

				      DESIRED_CUDA: cpu

				      GPU_ARCH_TYPE: cpu-aarch64

				      DOCKER_IMAGE: pytorch/manylinux2_28_aarch64-builder:cpu-aarch64-2.6

				      DOCKER_IMAGE: pytorch/manylinux2_28_aarch64-builder:cpu-aarch64-main

				      DESIRED_DEVTOOLSET: cxx11-abi

				      use_split_build: False

				      DESIRED_PYTHON: "3.9"

				@ -103,13 +100,12 @@ jobs:

				    needs: manywheel-py3_9-cpu-aarch64-test

				    with:

				      PYTORCH_ROOT: /pytorch

				      BUILDER_ROOT: /builder

				      PACKAGE_TYPE: manywheel

				      # TODO: This is a legacy variable that we eventually want to get rid of in

				      #       favor of GPU_ARCH_VERSION

				      DESIRED_CUDA: cpu

				      GPU_ARCH_TYPE: cpu-aarch64

				      DOCKER_IMAGE: pytorch/manylinux2_28_aarch64-builder:cpu-aarch64-2.6

				      DOCKER_IMAGE: pytorch/manylinux2_28_aarch64-builder:cpu-aarch64-main

				      DESIRED_DEVTOOLSET: cxx11-abi

				      use_split_build: False

				      DESIRED_PYTHON: "3.9"

				@ -126,13 +122,12 @@ jobs:

				    needs: get-label-type

				    with:

				      PYTORCH_ROOT: /pytorch

				      BUILDER_ROOT: /builder

				      PACKAGE_TYPE: manywheel

				      # TODO: This is a legacy variable that we eventually want to get rid of in

				      #       favor of GPU_ARCH_VERSION

				      DESIRED_CUDA: cu126

				      GPU_ARCH_TYPE: cuda-aarch64

				      DOCKER_IMAGE: pytorch/manylinuxaarch64-builder:cuda12.6-2.6

				      DOCKER_IMAGE: pytorch/manylinuxaarch64-builder:cuda12.6-main

				      DESIRED_DEVTOOLSET: cxx11-abi

				      use_split_build: False

				      DESIRED_PYTHON: "3.9"

				@ -152,13 +147,12 @@ jobs:

				    needs: manywheel-py3_9-cuda-aarch64-build

				    with:

				      PYTORCH_ROOT: /pytorch

				      BUILDER_ROOT: /builder

				      PACKAGE_TYPE: manywheel

				      # TODO: This is a legacy variable that we eventually want to get rid of in

				      #       favor of GPU_ARCH_VERSION

				      DESIRED_CUDA: cu126

				      GPU_ARCH_TYPE: cuda-aarch64

				      DOCKER_IMAGE: pytorch/manylinuxaarch64-builder:cuda12.6-2.6

				      DOCKER_IMAGE: pytorch/manylinuxaarch64-builder:cuda12.6-main

				      DESIRED_DEVTOOLSET: cxx11-abi

				      use_split_build: False

				      DESIRED_PYTHON: "3.9"

				@ -175,13 +169,12 @@ jobs:

				    needs: get-label-type

				    with:

				      PYTORCH_ROOT: /pytorch

				      BUILDER_ROOT: /builder

				      PACKAGE_TYPE: manywheel

				      # TODO: This is a legacy variable that we eventually want to get rid of in

				      #       favor of GPU_ARCH_VERSION

				      DESIRED_CUDA: cpu

				      GPU_ARCH_TYPE: cpu-aarch64

				      DOCKER_IMAGE: pytorch/manylinux2_28_aarch64-builder:cpu-aarch64-2.6

				      DOCKER_IMAGE: pytorch/manylinux2_28_aarch64-builder:cpu-aarch64-main

				      DESIRED_DEVTOOLSET: cxx11-abi

				      use_split_build: False

				      DESIRED_PYTHON: "3.10"

				@ -201,13 +194,12 @@ jobs:

				    uses: ./.github/workflows/_binary-test-linux.yml

				    with:

				      PYTORCH_ROOT: /pytorch

				      BUILDER_ROOT: /builder

				      PACKAGE_TYPE: manywheel

				      # TODO: This is a legacy variable that we eventually want to get rid of in

				      #       favor of GPU_ARCH_VERSION

				      DESIRED_CUDA: cpu

				      GPU_ARCH_TYPE: cpu-aarch64

				      DOCKER_IMAGE: pytorch/manylinux2_28_aarch64-builder:cpu-aarch64-2.6

				      DOCKER_IMAGE: pytorch/manylinux2_28_aarch64-builder:cpu-aarch64-main

				      DESIRED_DEVTOOLSET: cxx11-abi

				      use_split_build: False

				      DESIRED_PYTHON: "3.10"

				@ -226,13 +218,12 @@ jobs:

				    needs: manywheel-py3_10-cpu-aarch64-test

				    with:

				      PYTORCH_ROOT: /pytorch

				      BUILDER_ROOT: /builder

				      PACKAGE_TYPE: manywheel

				      # TODO: This is a legacy variable that we eventually want to get rid of in

				      #       favor of GPU_ARCH_VERSION

				      DESIRED_CUDA: cpu

				      GPU_ARCH_TYPE: cpu-aarch64

				      DOCKER_IMAGE: pytorch/manylinux2_28_aarch64-builder:cpu-aarch64-2.6

				      DOCKER_IMAGE: pytorch/manylinux2_28_aarch64-builder:cpu-aarch64-main

				      DESIRED_DEVTOOLSET: cxx11-abi

				      use_split_build: False

				      DESIRED_PYTHON: "3.10"

				@ -249,13 +240,12 @@ jobs:

				    needs: get-label-type

				    with:

				      PYTORCH_ROOT: /pytorch

				      BUILDER_ROOT: /builder

				      PACKAGE_TYPE: manywheel

				      # TODO: This is a legacy variable that we eventually want to get rid of in

				      #       favor of GPU_ARCH_VERSION

				      DESIRED_CUDA: cu126

				      GPU_ARCH_TYPE: cuda-aarch64

				      DOCKER_IMAGE: pytorch/manylinuxaarch64-builder:cuda12.6-2.6

				      DOCKER_IMAGE: pytorch/manylinuxaarch64-builder:cuda12.6-main

				      DESIRED_DEVTOOLSET: cxx11-abi

				      use_split_build: False

				      DESIRED_PYTHON: "3.10"

				@ -275,13 +265,12 @@ jobs:

				    needs: manywheel-py3_10-cuda-aarch64-build

				    with:

				      PYTORCH_ROOT: /pytorch

				      BUILDER_ROOT: /builder

				      PACKAGE_TYPE: manywheel

				      # TODO: This is a legacy variable that we eventually want to get rid of in

				      #       favor of GPU_ARCH_VERSION

				      DESIRED_CUDA: cu126

				      GPU_ARCH_TYPE: cuda-aarch64

				      DOCKER_IMAGE: pytorch/manylinuxaarch64-builder:cuda12.6-2.6

				      DOCKER_IMAGE: pytorch/manylinuxaarch64-builder:cuda12.6-main

				      DESIRED_DEVTOOLSET: cxx11-abi

				      use_split_build: False

				      DESIRED_PYTHON: "3.10"

				@ -298,13 +287,12 @@ jobs:

				    needs: get-label-type

				    with:

				      PYTORCH_ROOT: /pytorch

				      BUILDER_ROOT: /builder

				      PACKAGE_TYPE: manywheel

				      # TODO: This is a legacy variable that we eventually want to get rid of in

				      #       favor of GPU_ARCH_VERSION

				      DESIRED_CUDA: cpu

				      GPU_ARCH_TYPE: cpu-aarch64

				      DOCKER_IMAGE: pytorch/manylinux2_28_aarch64-builder:cpu-aarch64-2.6

				      DOCKER_IMAGE: pytorch/manylinux2_28_aarch64-builder:cpu-aarch64-main

				      DESIRED_DEVTOOLSET: cxx11-abi

				      use_split_build: False

				      DESIRED_PYTHON: "3.11"

				@ -324,13 +312,12 @@ jobs:

				    uses: ./.github/workflows/_binary-test-linux.yml

				    with:

				      PYTORCH_ROOT: /pytorch

				      BUILDER_ROOT: /builder

				      PACKAGE_TYPE: manywheel

				      # TODO: This is a legacy variable that we eventually want to get rid of in

				      #       favor of GPU_ARCH_VERSION

				      DESIRED_CUDA: cpu

				      GPU_ARCH_TYPE: cpu-aarch64

				      DOCKER_IMAGE: pytorch/manylinux2_28_aarch64-builder:cpu-aarch64-2.6

				      DOCKER_IMAGE: pytorch/manylinux2_28_aarch64-builder:cpu-aarch64-main

				      DESIRED_DEVTOOLSET: cxx11-abi

				      use_split_build: False

				      DESIRED_PYTHON: "3.11"

				@ -349,13 +336,12 @@ jobs:

				    needs: manywheel-py3_11-cpu-aarch64-test

				    with:

				      PYTORCH_ROOT: /pytorch

				      BUILDER_ROOT: /builder

				      PACKAGE_TYPE: manywheel

				      # TODO: This is a legacy variable that we eventually want to get rid of in

				      #       favor of GPU_ARCH_VERSION

				      DESIRED_CUDA: cpu

				      GPU_ARCH_TYPE: cpu-aarch64

				      DOCKER_IMAGE: pytorch/manylinux2_28_aarch64-builder:cpu-aarch64-2.6

				      DOCKER_IMAGE: pytorch/manylinux2_28_aarch64-builder:cpu-aarch64-main

				      DESIRED_DEVTOOLSET: cxx11-abi

				      use_split_build: False

				      DESIRED_PYTHON: "3.11"

				@ -372,13 +358,12 @@ jobs:

				    needs: get-label-type

				    with:

				      PYTORCH_ROOT: /pytorch

				      BUILDER_ROOT: /builder

				      PACKAGE_TYPE: manywheel

				      # TODO: This is a legacy variable that we eventually want to get rid of in

				      #       favor of GPU_ARCH_VERSION

				      DESIRED_CUDA: cu126

				      GPU_ARCH_TYPE: cuda-aarch64

				      DOCKER_IMAGE: pytorch/manylinuxaarch64-builder:cuda12.6-2.6

				      DOCKER_IMAGE: pytorch/manylinuxaarch64-builder:cuda12.6-main

				      DESIRED_DEVTOOLSET: cxx11-abi

				      use_split_build: False

				      DESIRED_PYTHON: "3.11"

				@ -398,13 +383,12 @@ jobs:

				    needs: manywheel-py3_11-cuda-aarch64-build

				    with:

				      PYTORCH_ROOT: /pytorch

				      BUILDER_ROOT: /builder

				      PACKAGE_TYPE: manywheel

				      # TODO: This is a legacy variable that we eventually want to get rid of in

				      #       favor of GPU_ARCH_VERSION

				      DESIRED_CUDA: cu126

				      GPU_ARCH_TYPE: cuda-aarch64

				      DOCKER_IMAGE: pytorch/manylinuxaarch64-builder:cuda12.6-2.6

				      DOCKER_IMAGE: pytorch/manylinuxaarch64-builder:cuda12.6-main

				      DESIRED_DEVTOOLSET: cxx11-abi

				      use_split_build: False

				      DESIRED_PYTHON: "3.11"

				@ -421,13 +405,12 @@ jobs:

				    needs: get-label-type

				    with:

				      PYTORCH_ROOT: /pytorch

				      BUILDER_ROOT: /builder

				      PACKAGE_TYPE: manywheel

				      # TODO: This is a legacy variable that we eventually want to get rid of in

				      #       favor of GPU_ARCH_VERSION

				      DESIRED_CUDA: cpu

				      GPU_ARCH_TYPE: cpu-aarch64

				      DOCKER_IMAGE: pytorch/manylinux2_28_aarch64-builder:cpu-aarch64-2.6

				      DOCKER_IMAGE: pytorch/manylinux2_28_aarch64-builder:cpu-aarch64-main

				      DESIRED_DEVTOOLSET: cxx11-abi

				      use_split_build: False

				      DESIRED_PYTHON: "3.12"

				@ -447,13 +430,12 @@ jobs:

				    uses: ./.github/workflows/_binary-test-linux.yml

				    with:

				      PYTORCH_ROOT: /pytorch

				      BUILDER_ROOT: /builder

				      PACKAGE_TYPE: manywheel

				      # TODO: This is a legacy variable that we eventually want to get rid of in

				      #       favor of GPU_ARCH_VERSION

				      DESIRED_CUDA: cpu

				      GPU_ARCH_TYPE: cpu-aarch64

				      DOCKER_IMAGE: pytorch/manylinux2_28_aarch64-builder:cpu-aarch64-2.6

				      DOCKER_IMAGE: pytorch/manylinux2_28_aarch64-builder:cpu-aarch64-main

				      DESIRED_DEVTOOLSET: cxx11-abi

				      use_split_build: False

				      DESIRED_PYTHON: "3.12"

				@ -472,13 +454,12 @@ jobs:

				    needs: manywheel-py3_12-cpu-aarch64-test

				    with:

				      PYTORCH_ROOT: /pytorch

				      BUILDER_ROOT: /builder

				      PACKAGE_TYPE: manywheel

				      # TODO: This is a legacy variable that we eventually want to get rid of in

				      #       favor of GPU_ARCH_VERSION

				      DESIRED_CUDA: cpu

				      GPU_ARCH_TYPE: cpu-aarch64

				      DOCKER_IMAGE: pytorch/manylinux2_28_aarch64-builder:cpu-aarch64-2.6

				      DOCKER_IMAGE: pytorch/manylinux2_28_aarch64-builder:cpu-aarch64-main

				      DESIRED_DEVTOOLSET: cxx11-abi

				      use_split_build: False

				      DESIRED_PYTHON: "3.12"

				@ -495,13 +476,12 @@ jobs:

				    needs: get-label-type

				    with:

				      PYTORCH_ROOT: /pytorch

				      BUILDER_ROOT: /builder

				      PACKAGE_TYPE: manywheel

				      # TODO: This is a legacy variable that we eventually want to get rid of in

				      #       favor of GPU_ARCH_VERSION

				      DESIRED_CUDA: cu126

				      GPU_ARCH_TYPE: cuda-aarch64

				      DOCKER_IMAGE: pytorch/manylinuxaarch64-builder:cuda12.6-2.6

				      DOCKER_IMAGE: pytorch/manylinuxaarch64-builder:cuda12.6-main

				      DESIRED_DEVTOOLSET: cxx11-abi

				      use_split_build: False

				      DESIRED_PYTHON: "3.12"

				@ -521,13 +501,12 @@ jobs:

				    needs: manywheel-py3_12-cuda-aarch64-build

				    with:

				      PYTORCH_ROOT: /pytorch

				      BUILDER_ROOT: /builder

				      PACKAGE_TYPE: manywheel

				      # TODO: This is a legacy variable that we eventually want to get rid of in

				      #       favor of GPU_ARCH_VERSION

				      DESIRED_CUDA: cu126

				      GPU_ARCH_TYPE: cuda-aarch64

				      DOCKER_IMAGE: pytorch/manylinuxaarch64-builder:cuda12.6-2.6

				      DOCKER_IMAGE: pytorch/manylinuxaarch64-builder:cuda12.6-main

				      DESIRED_DEVTOOLSET: cxx11-abi

				      use_split_build: False

				      DESIRED_PYTHON: "3.12"

				@ -544,13 +523,12 @@ jobs:

				    needs: get-label-type

				    with:

				      PYTORCH_ROOT: /pytorch

				      BUILDER_ROOT: /builder

				      PACKAGE_TYPE: manywheel

				      # TODO: This is a legacy variable that we eventually want to get rid of in

				      #       favor of GPU_ARCH_VERSION

				      DESIRED_CUDA: cpu

				      GPU_ARCH_TYPE: cpu-aarch64

				      DOCKER_IMAGE: pytorch/manylinux2_28_aarch64-builder:cpu-aarch64-2.6

				      DOCKER_IMAGE: pytorch/manylinux2_28_aarch64-builder:cpu-aarch64-main

				      DESIRED_DEVTOOLSET: cxx11-abi

				      use_split_build: False

				      DESIRED_PYTHON: "3.13"

				@ -570,13 +548,12 @@ jobs:

				    uses: ./.github/workflows/_binary-test-linux.yml

				    with:

				      PYTORCH_ROOT: /pytorch

				      BUILDER_ROOT: /builder

				      PACKAGE_TYPE: manywheel

				      # TODO: This is a legacy variable that we eventually want to get rid of in

				      #       favor of GPU_ARCH_VERSION

				      DESIRED_CUDA: cpu

				      GPU_ARCH_TYPE: cpu-aarch64

				      DOCKER_IMAGE: pytorch/manylinux2_28_aarch64-builder:cpu-aarch64-2.6

				      DOCKER_IMAGE: pytorch/manylinux2_28_aarch64-builder:cpu-aarch64-main

				      DESIRED_DEVTOOLSET: cxx11-abi

				      use_split_build: False

				      DESIRED_PYTHON: "3.13"

				@ -595,13 +572,12 @@ jobs:

				    needs: manywheel-py3_13-cpu-aarch64-test

				    with:

				      PYTORCH_ROOT: /pytorch

				      BUILDER_ROOT: /builder

				      PACKAGE_TYPE: manywheel

				      # TODO: This is a legacy variable that we eventually want to get rid of in

				      #       favor of GPU_ARCH_VERSION

				      DESIRED_CUDA: cpu

				      GPU_ARCH_TYPE: cpu-aarch64

				      DOCKER_IMAGE: pytorch/manylinux2_28_aarch64-builder:cpu-aarch64-2.6

				      DOCKER_IMAGE: pytorch/manylinux2_28_aarch64-builder:cpu-aarch64-main

				      DESIRED_DEVTOOLSET: cxx11-abi

				      use_split_build: False

				      DESIRED_PYTHON: "3.13"

				@ -618,13 +594,12 @@ jobs:

				    needs: get-label-type

				    with:

				      PYTORCH_ROOT: /pytorch

				      BUILDER_ROOT: /builder

				      PACKAGE_TYPE: manywheel

				      # TODO: This is a legacy variable that we eventually want to get rid of in

				      #       favor of GPU_ARCH_VERSION

				      DESIRED_CUDA: cu126

				      GPU_ARCH_TYPE: cuda-aarch64

				      DOCKER_IMAGE: pytorch/manylinuxaarch64-builder:cuda12.6-2.6

				      DOCKER_IMAGE: pytorch/manylinuxaarch64-builder:cuda12.6-main

				      DESIRED_DEVTOOLSET: cxx11-abi

				      use_split_build: False

				      DESIRED_PYTHON: "3.13"

				@ -644,13 +619,12 @@ jobs:

				    needs: manywheel-py3_13-cuda-aarch64-build

				    with:

				      PYTORCH_ROOT: /pytorch

				      BUILDER_ROOT: /builder

				      PACKAGE_TYPE: manywheel

				      # TODO: This is a legacy variable that we eventually want to get rid of in

				      #       favor of GPU_ARCH_VERSION

				      DESIRED_CUDA: cu126

				      GPU_ARCH_TYPE: cuda-aarch64

				      DOCKER_IMAGE: pytorch/manylinuxaarch64-builder:cuda12.6-2.6

				      DOCKER_IMAGE: pytorch/manylinuxaarch64-builder:cuda12.6-main

				      DESIRED_DEVTOOLSET: cxx11-abi

				      use_split_build: False

				      DESIRED_PYTHON: "3.13"

									
										9

.github/workflows/generated-linux-binary-libtorch-cxx11-abi-main.yml
									
										generated
									
										vendored
									
												View File
												
				@ -20,7 +20,6 @@ env:

				  AWS_DEFAULT_REGION: us-east-1

				  BINARY_ENV_FILE: /tmp/env

				  BUILD_ENVIRONMENT: linux-binary-libtorch-cxx11-abi

				  BUILDER_ROOT: /builder

				  GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}

				  PR_NUMBER: ${{ github.event.pull_request.number }}

				  PYTORCH_FINAL_PACKAGE_DIR: /artifacts

				@ -35,7 +34,7 @@ jobs:

				  get-label-type:

				    if: github.repository_owner == 'pytorch'

				    name: get-label-type

				    uses: pytorch/pytorch/.github/workflows/_runner-determinator.yml@release/2.6

				    uses: pytorch/pytorch/.github/workflows/_runner-determinator.yml@main

				    with:

				      triggering_actor: ${{ github.triggering_actor }}

				      issue_owner: ${{ github.event.pull_request.user.login || github.event.issue.user.login }}

				@ -47,13 +46,12 @@ jobs:

				    needs: get-label-type

				    with:

				      PYTORCH_ROOT: /pytorch

				      BUILDER_ROOT: /builder

				      PACKAGE_TYPE: libtorch

				      # TODO: This is a legacy variable that we eventually want to get rid of in

				      #       favor of GPU_ARCH_VERSION

				      DESIRED_CUDA: cpu

				      GPU_ARCH_TYPE: cpu

				      DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:cpu-2.6

				      DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:cpu-main

				      LIBTORCH_VARIANT: shared-with-deps

				      DESIRED_DEVTOOLSET: cxx11-abi

				      runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"

				@ -69,13 +67,12 @@ jobs:

				    uses: ./.github/workflows/_binary-test-linux.yml

				    with:

				      PYTORCH_ROOT: /pytorch

				      BUILDER_ROOT: /builder

				      PACKAGE_TYPE: libtorch

				      # TODO: This is a legacy variable that we eventually want to get rid of in

				      #       favor of GPU_ARCH_VERSION

				      DESIRED_CUDA: cpu

				      GPU_ARCH_TYPE: cpu

				      DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:cpu-2.6

				      DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:cpu-main

				      LIBTORCH_VARIANT: shared-with-deps

				      DESIRED_DEVTOOLSET: cxx11-abi

				      build_name: libtorch-cpu-shared-with-deps-cxx11-abi

									
										247

.github/workflows/generated-linux-binary-libtorch-cxx11-abi-nightly.yml
									
										generated
									
										vendored
									
												View File
												
				@ -25,7 +25,6 @@ env:

				  AWS_DEFAULT_REGION: us-east-1

				  BINARY_ENV_FILE: /tmp/env

				  BUILD_ENVIRONMENT: linux-binary-libtorch-cxx11-abi

				  BUILDER_ROOT: /builder

				  GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}

				  PR_NUMBER: ${{ github.event.pull_request.number }}

				  PYTORCH_FINAL_PACKAGE_DIR: /artifacts

				@ -40,7 +39,7 @@ jobs:

				  get-label-type:

				    if: github.repository_owner == 'pytorch'

				    name: get-label-type

				    uses: pytorch/pytorch/.github/workflows/_runner-determinator.yml@release/2.6

				    uses: pytorch/pytorch/.github/workflows/_runner-determinator.yml@main

				    with:

				      triggering_actor: ${{ github.triggering_actor }}

				      issue_owner: ${{ github.event.pull_request.user.login || github.event.issue.user.login }}

				@ -52,13 +51,12 @@ jobs:

				    needs: get-label-type

				    with:

				      PYTORCH_ROOT: /pytorch

				      BUILDER_ROOT: /builder

				      PACKAGE_TYPE: libtorch

				      # TODO: This is a legacy variable that we eventually want to get rid of in

				      #       favor of GPU_ARCH_VERSION

				      DESIRED_CUDA: cpu

				      GPU_ARCH_TYPE: cpu

				      DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:cpu-2.6

				      DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:cpu-main

				      LIBTORCH_VARIANT: shared-with-deps

				      DESIRED_DEVTOOLSET: cxx11-abi

				      runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"

				@ -74,13 +72,12 @@ jobs:

				    uses: ./.github/workflows/_binary-test-linux.yml

				    with:

				      PYTORCH_ROOT: /pytorch

				      BUILDER_ROOT: /builder

				      PACKAGE_TYPE: libtorch

				      # TODO: This is a legacy variable that we eventually want to get rid of in

				      #       favor of GPU_ARCH_VERSION

				      DESIRED_CUDA: cpu

				      GPU_ARCH_TYPE: cpu

				      DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:cpu-2.6

				      DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:cpu-main

				      LIBTORCH_VARIANT: shared-with-deps

				      DESIRED_DEVTOOLSET: cxx11-abi

				      build_name: libtorch-cpu-shared-with-deps-cxx11-abi

				@ -97,13 +94,12 @@ jobs:

				    needs: libtorch-cpu-shared-with-deps-cxx11-abi-test

				    with:

				      PYTORCH_ROOT: /pytorch

				      BUILDER_ROOT: /builder

				      PACKAGE_TYPE: libtorch

				      # TODO: This is a legacy variable that we eventually want to get rid of in

				      #       favor of GPU_ARCH_VERSION

				      DESIRED_CUDA: cpu

				      GPU_ARCH_TYPE: cpu

				      DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:cpu-2.6

				      DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:cpu-main

				      LIBTORCH_VARIANT: shared-with-deps

				      DESIRED_DEVTOOLSET: cxx11-abi

				      build_name: libtorch-cpu-shared-with-deps-cxx11-abi

				@ -119,14 +115,13 @@ jobs:

				    needs: get-label-type

				    with:

				      PYTORCH_ROOT: /pytorch

				      BUILDER_ROOT: /builder

				      PACKAGE_TYPE: libtorch

				      # TODO: This is a legacy variable that we eventually want to get rid of in

				      #       favor of GPU_ARCH_VERSION

				      DESIRED_CUDA: cu118

				      GPU_ARCH_VERSION: 11.8

				      GPU_ARCH_TYPE: cuda

				      DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:cuda11.8-2.6

				      DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:cuda11.8-main

				      LIBTORCH_VARIANT: shared-with-deps

				      DESIRED_DEVTOOLSET: cxx11-abi

				      runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"

				@ -142,14 +137,13 @@ jobs:

				    uses: ./.github/workflows/_binary-test-linux.yml

				    with:

				      PYTORCH_ROOT: /pytorch

				      BUILDER_ROOT: /builder

				      PACKAGE_TYPE: libtorch

				      # TODO: This is a legacy variable that we eventually want to get rid of in

				      #       favor of GPU_ARCH_VERSION

				      DESIRED_CUDA: cu118

				      GPU_ARCH_VERSION: 11.8

				      GPU_ARCH_TYPE: cuda

				      DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:cuda11.8-2.6

				      DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:cuda11.8-main

				      LIBTORCH_VARIANT: shared-with-deps

				      DESIRED_DEVTOOLSET: cxx11-abi

				      build_name: libtorch-cuda11_8-shared-with-deps-cxx11-abi

				@ -166,14 +160,13 @@ jobs:

				    needs: libtorch-cuda11_8-shared-with-deps-cxx11-abi-test

				    with:

				      PYTORCH_ROOT: /pytorch

				      BUILDER_ROOT: /builder

				      PACKAGE_TYPE: libtorch

				      # TODO: This is a legacy variable that we eventually want to get rid of in

				      #       favor of GPU_ARCH_VERSION

				      DESIRED_CUDA: cu118

				      GPU_ARCH_VERSION: 11.8

				      GPU_ARCH_TYPE: cuda

				      DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:cuda11.8-2.6

				      DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:cuda11.8-main

				      LIBTORCH_VARIANT: shared-with-deps

				      DESIRED_DEVTOOLSET: cxx11-abi

				      build_name: libtorch-cuda11_8-shared-with-deps-cxx11-abi

				@ -189,14 +182,13 @@ jobs:

				    needs: get-label-type

				    with:

				      PYTORCH_ROOT: /pytorch

				      BUILDER_ROOT: /builder

				      PACKAGE_TYPE: libtorch

				      # TODO: This is a legacy variable that we eventually want to get rid of in

				      #       favor of GPU_ARCH_VERSION

				      DESIRED_CUDA: cu124

				      GPU_ARCH_VERSION: 12.4

				      GPU_ARCH_TYPE: cuda

				      DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:cuda12.4-2.6

				      DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:cuda12.4-main

				      LIBTORCH_VARIANT: shared-with-deps

				      DESIRED_DEVTOOLSET: cxx11-abi

				      runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"

				@ -212,14 +204,13 @@ jobs:

				    uses: ./.github/workflows/_binary-test-linux.yml

				    with:

				      PYTORCH_ROOT: /pytorch

				      BUILDER_ROOT: /builder

				      PACKAGE_TYPE: libtorch

				      # TODO: This is a legacy variable that we eventually want to get rid of in

				      #       favor of GPU_ARCH_VERSION

				      DESIRED_CUDA: cu124

				      GPU_ARCH_VERSION: 12.4

				      GPU_ARCH_TYPE: cuda

				      DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:cuda12.4-2.6

				      DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:cuda12.4-main

				      LIBTORCH_VARIANT: shared-with-deps

				      DESIRED_DEVTOOLSET: cxx11-abi

				      build_name: libtorch-cuda12_4-shared-with-deps-cxx11-abi

				@ -236,14 +227,13 @@ jobs:

				    needs: libtorch-cuda12_4-shared-with-deps-cxx11-abi-test

				    with:

				      PYTORCH_ROOT: /pytorch

				      BUILDER_ROOT: /builder

				      PACKAGE_TYPE: libtorch

				      # TODO: This is a legacy variable that we eventually want to get rid of in

				      #       favor of GPU_ARCH_VERSION

				      DESIRED_CUDA: cu124

				      GPU_ARCH_VERSION: 12.4

				      GPU_ARCH_TYPE: cuda

				      DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:cuda12.4-2.6

				      DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:cuda12.4-main

				      LIBTORCH_VARIANT: shared-with-deps

				      DESIRED_DEVTOOLSET: cxx11-abi

				      build_name: libtorch-cuda12_4-shared-with-deps-cxx11-abi

				@ -259,14 +249,13 @@ jobs:

				    needs: get-label-type

				    with:

				      PYTORCH_ROOT: /pytorch

				      BUILDER_ROOT: /builder

				      PACKAGE_TYPE: libtorch

				      # TODO: This is a legacy variable that we eventually want to get rid of in

				      #       favor of GPU_ARCH_VERSION

				      DESIRED_CUDA: cu126

				      GPU_ARCH_VERSION: 12.6

				      GPU_ARCH_TYPE: cuda

				      DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:cuda12.6-2.6

				      DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:cuda12.6-main

				      LIBTORCH_VARIANT: shared-with-deps

				      DESIRED_DEVTOOLSET: cxx11-abi

				      runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"

				@ -282,14 +271,13 @@ jobs:

				    uses: ./.github/workflows/_binary-test-linux.yml

				    with:

				      PYTORCH_ROOT: /pytorch

				      BUILDER_ROOT: /builder

				      PACKAGE_TYPE: libtorch

				      # TODO: This is a legacy variable that we eventually want to get rid of in

				      #       favor of GPU_ARCH_VERSION

				      DESIRED_CUDA: cu126

				      GPU_ARCH_VERSION: 12.6

				      GPU_ARCH_TYPE: cuda

				      DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:cuda12.6-2.6

				      DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:cuda12.6-main

				      LIBTORCH_VARIANT: shared-with-deps

				      DESIRED_DEVTOOLSET: cxx11-abi

				      build_name: libtorch-cuda12_6-shared-with-deps-cxx11-abi

				@ -306,14 +294,13 @@ jobs:

				    needs: libtorch-cuda12_6-shared-with-deps-cxx11-abi-test

				    with:

				      PYTORCH_ROOT: /pytorch

				      BUILDER_ROOT: /builder

				      PACKAGE_TYPE: libtorch

				      # TODO: This is a legacy variable that we eventually want to get rid of in

				      #       favor of GPU_ARCH_VERSION

				      DESIRED_CUDA: cu126

				      GPU_ARCH_VERSION: 12.6

				      GPU_ARCH_TYPE: cuda

				      DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:cuda12.6-2.6

				      DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:cuda12.6-main

				      LIBTORCH_VARIANT: shared-with-deps

				      DESIRED_DEVTOOLSET: cxx11-abi

				      build_name: libtorch-cuda12_6-shared-with-deps-cxx11-abi

				@ -323,116 +310,19 @@ jobs:

				      conda-pytorchbot-token-test: ${{ secrets.CONDA_PYTORCHBOT_TOKEN_TEST }}

				    uses: ./.github/workflows/_binary-upload.yml

				  libtorch-rocm6_1-shared-with-deps-cxx11-abi-build:

				    if: ${{ github.repository_owner == 'pytorch' }}

				    uses: ./.github/workflows/_binary-build-linux.yml

				    needs: get-label-type

				    with:

				      PYTORCH_ROOT: /pytorch

				      BUILDER_ROOT: /builder

				      PACKAGE_TYPE: libtorch

				      # TODO: This is a legacy variable that we eventually want to get rid of in

				      #       favor of GPU_ARCH_VERSION

				      DESIRED_CUDA: rocm6.1

				      GPU_ARCH_VERSION: 6.1

				      GPU_ARCH_TYPE: rocm

				      DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:rocm6.1-2.6

				      LIBTORCH_VARIANT: shared-with-deps

				      DESIRED_DEVTOOLSET: cxx11-abi

				      runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"

				      build_name: libtorch-rocm6_1-shared-with-deps-cxx11-abi

				      build_environment: linux-binary-libtorch-cxx11-abi

				    secrets:

				      github-token: ${{ secrets.GITHUB_TOKEN }}

				  libtorch-rocm6_1-shared-with-deps-cxx11-abi-test:  # Testing

				    if: ${{ github.repository_owner == 'pytorch' }}

				    needs:

				      - libtorch-rocm6_1-shared-with-deps-cxx11-abi-build

				      - get-label-type

				    runs-on: linux.rocm.gpu

				    timeout-minutes: 240

				    env:

				      PYTORCH_ROOT: /pytorch

				      BUILDER_ROOT: /builder

				      PACKAGE_TYPE: libtorch

				      # TODO: This is a legacy variable that we eventually want to get rid of in

				      #       favor of GPU_ARCH_VERSION

				      DESIRED_CUDA: rocm6.1

				      GPU_ARCH_VERSION: 6.1

				      GPU_ARCH_TYPE: rocm

				      SKIP_ALL_TESTS: 1

				      DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:rocm6.1-2.6

				      LIBTORCH_VARIANT: shared-with-deps

				      DESIRED_DEVTOOLSET: cxx11-abi

				    steps:

				      - name: Setup ROCm

				        uses: ./.github/actions/setup-rocm

				      - uses: actions/download-artifact@v4.1.7

				        name: Download Build Artifacts

				        with:

				          name: libtorch-rocm6_1-shared-with-deps-cxx11-abi

				          path: "${{ runner.temp }}/artifacts/"

				      - name: Checkout PyTorch

				        uses: malfet/checkout@silent-checkout

				        with:

				          submodules: recursive

				          path: pytorch

				          quiet-checkout: true

				      - name: Clean PyTorch checkout

				        run: |

				          # Remove any artifacts from the previous checkouts

				          git clean -fxd

				        working-directory: pytorch

				      - name: ROCm set GPU_FLAG

				        run: |

				          echo "GPU_FLAG=--device=/dev/mem --device=/dev/kfd --device=/dev/dri --group-add video --group-add daemon" >> "${GITHUB_ENV}"

				      - name: Pull Docker image

				        uses: pytorch/test-infra/.github/actions/pull-docker-image@release/2.6

				        with:

				          docker-image: pytorch/libtorch-cxx11-builder:rocm6.1-2.6

				      - name: Test Pytorch binary

				        uses: ./pytorch/.github/actions/test-pytorch-binary

				      - name: Teardown ROCm

				        uses: ./.github/actions/teardown-rocm

				  libtorch-rocm6_1-shared-with-deps-cxx11-abi-upload:  # Uploading

				    if: ${{ github.repository_owner == 'pytorch' }}

				    permissions:

				      id-token: write

				      contents: read

				    needs: libtorch-rocm6_1-shared-with-deps-cxx11-abi-test

				    with:

				      PYTORCH_ROOT: /pytorch

				      BUILDER_ROOT: /builder

				      PACKAGE_TYPE: libtorch

				      # TODO: This is a legacy variable that we eventually want to get rid of in

				      #       favor of GPU_ARCH_VERSION

				      DESIRED_CUDA: rocm6.1

				      GPU_ARCH_VERSION: 6.1

				      GPU_ARCH_TYPE: rocm

				      DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:rocm6.1-2.6

				      LIBTORCH_VARIANT: shared-with-deps

				      DESIRED_DEVTOOLSET: cxx11-abi

				      build_name: libtorch-rocm6_1-shared-with-deps-cxx11-abi

				    secrets:

				      github-token: ${{ secrets.GITHUB_TOKEN }}

				      conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}

				      conda-pytorchbot-token-test: ${{ secrets.CONDA_PYTORCHBOT_TOKEN_TEST }}

				    uses: ./.github/workflows/_binary-upload.yml

				  libtorch-rocm6_2_4-shared-with-deps-cxx11-abi-build:

				    if: ${{ github.repository_owner == 'pytorch' }}

				    uses: ./.github/workflows/_binary-build-linux.yml

				    needs: get-label-type

				    with:

				      PYTORCH_ROOT: /pytorch

				      BUILDER_ROOT: /builder

				      PACKAGE_TYPE: libtorch

				      # TODO: This is a legacy variable that we eventually want to get rid of in

				      #       favor of GPU_ARCH_VERSION

				      DESIRED_CUDA: rocm6.2.4

				      GPU_ARCH_VERSION: 6.2.4

				      GPU_ARCH_TYPE: rocm

				      DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:rocm6.2.4-2.6

				      DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:rocm6.2.4-main

				      LIBTORCH_VARIANT: shared-with-deps

				      DESIRED_DEVTOOLSET: cxx11-abi

				      runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"

				@ -449,7 +339,6 @@ jobs:

				    timeout-minutes: 240

				    env:

				      PYTORCH_ROOT: /pytorch

				      BUILDER_ROOT: /builder

				      PACKAGE_TYPE: libtorch

				      # TODO: This is a legacy variable that we eventually want to get rid of in

				      #       favor of GPU_ARCH_VERSION

				@ -457,7 +346,7 @@ jobs:

				      GPU_ARCH_VERSION: 6.2.4

				      GPU_ARCH_TYPE: rocm

				      SKIP_ALL_TESTS: 1

				      DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:rocm6.2.4-2.6

				      DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:rocm6.2.4-main

				      LIBTORCH_VARIANT: shared-with-deps

				      DESIRED_DEVTOOLSET: cxx11-abi

				    steps:

				@ -469,11 +358,12 @@ jobs:

				          name: libtorch-rocm6_2_4-shared-with-deps-cxx11-abi

				          path: "${{ runner.temp }}/artifacts/"

				      - name: Checkout PyTorch

				        uses: malfet/checkout@silent-checkout

				        uses: actions/checkout@v4

				        with:

				          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}

				          submodules: recursive

				          path: pytorch

				          quiet-checkout: true

				          show-progress: false

				      - name: Clean PyTorch checkout

				        run: |

				          # Remove any artifacts from the previous checkouts

				@ -483,9 +373,9 @@ jobs:

				        run: |

				          echo "GPU_FLAG=--device=/dev/mem --device=/dev/kfd --device=/dev/dri --group-add video --group-add daemon" >> "${GITHUB_ENV}"

				      - name: Pull Docker image

				        uses: pytorch/test-infra/.github/actions/pull-docker-image@release/2.6

				        uses: pytorch/test-infra/.github/actions/pull-docker-image@main

				        with:

				          docker-image: pytorch/libtorch-cxx11-builder:rocm6.2.4-2.6

				          docker-image: pytorch/libtorch-cxx11-builder:rocm6.2.4-main

				      - name: Test Pytorch binary

				        uses: ./pytorch/.github/actions/test-pytorch-binary

				      - name: Teardown ROCm

				@ -498,14 +388,13 @@ jobs:

				    needs: libtorch-rocm6_2_4-shared-with-deps-cxx11-abi-test

				    with:

				      PYTORCH_ROOT: /pytorch

				      BUILDER_ROOT: /builder

				      PACKAGE_TYPE: libtorch

				      # TODO: This is a legacy variable that we eventually want to get rid of in

				      #       favor of GPU_ARCH_VERSION

				      DESIRED_CUDA: rocm6.2.4

				      GPU_ARCH_VERSION: 6.2.4

				      GPU_ARCH_TYPE: rocm

				      DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:rocm6.2.4-2.6

				      DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:rocm6.2.4-main

				      LIBTORCH_VARIANT: shared-with-deps

				      DESIRED_DEVTOOLSET: cxx11-abi

				      build_name: libtorch-rocm6_2_4-shared-with-deps-cxx11-abi

				@ -514,3 +403,97 @@ jobs:

				      conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}

				      conda-pytorchbot-token-test: ${{ secrets.CONDA_PYTORCHBOT_TOKEN_TEST }}

				    uses: ./.github/workflows/_binary-upload.yml

				  libtorch-rocm6_3-shared-with-deps-cxx11-abi-build:

				    if: ${{ github.repository_owner == 'pytorch' }}

				    uses: ./.github/workflows/_binary-build-linux.yml

				    needs: get-label-type

				    with:

				      PYTORCH_ROOT: /pytorch

				      PACKAGE_TYPE: libtorch

				      # TODO: This is a legacy variable that we eventually want to get rid of in

				      #       favor of GPU_ARCH_VERSION

				      DESIRED_CUDA: rocm6.3

				      GPU_ARCH_VERSION: 6.3

				      GPU_ARCH_TYPE: rocm

				      DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:rocm6.3-main

				      LIBTORCH_VARIANT: shared-with-deps

				      DESIRED_DEVTOOLSET: cxx11-abi

				      runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"

				      build_name: libtorch-rocm6_3-shared-with-deps-cxx11-abi

				      build_environment: linux-binary-libtorch-cxx11-abi

				    secrets:

				      github-token: ${{ secrets.GITHUB_TOKEN }}

				  libtorch-rocm6_3-shared-with-deps-cxx11-abi-test:  # Testing

				    if: ${{ github.repository_owner == 'pytorch' }}

				    needs:

				      - libtorch-rocm6_3-shared-with-deps-cxx11-abi-build

				      - get-label-type

				    runs-on: linux.rocm.gpu

				    timeout-minutes: 240

				    env:

				      PYTORCH_ROOT: /pytorch

				      PACKAGE_TYPE: libtorch

				      # TODO: This is a legacy variable that we eventually want to get rid of in

				      #       favor of GPU_ARCH_VERSION

				      DESIRED_CUDA: rocm6.3

				      GPU_ARCH_VERSION: 6.3

				      GPU_ARCH_TYPE: rocm

				      SKIP_ALL_TESTS: 1

				      DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:rocm6.3-main

				      LIBTORCH_VARIANT: shared-with-deps

				      DESIRED_DEVTOOLSET: cxx11-abi

				    steps:

				      - name: Setup ROCm

				        uses: ./.github/actions/setup-rocm

				      - uses: actions/download-artifact@v4.1.7

				        name: Download Build Artifacts

				        with:

				          name: libtorch-rocm6_3-shared-with-deps-cxx11-abi

				          path: "${{ runner.temp }}/artifacts/"

				      - name: Checkout PyTorch

				        uses: actions/checkout@v4

				        with:

				          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}

				          submodules: recursive

				          path: pytorch

				          show-progress: false

				      - name: Clean PyTorch checkout

				        run: |

				          # Remove any artifacts from the previous checkouts

				          git clean -fxd

				        working-directory: pytorch

				      - name: ROCm set GPU_FLAG

				        run: |

				          echo "GPU_FLAG=--device=/dev/mem --device=/dev/kfd --device=/dev/dri --group-add video --group-add daemon" >> "${GITHUB_ENV}"

				      - name: Pull Docker image

				        uses: pytorch/test-infra/.github/actions/pull-docker-image@main

				        with:

				          docker-image: pytorch/libtorch-cxx11-builder:rocm6.3-main

				      - name: Test Pytorch binary

				        uses: ./pytorch/.github/actions/test-pytorch-binary

				      - name: Teardown ROCm

				        uses: ./.github/actions/teardown-rocm

				  libtorch-rocm6_3-shared-with-deps-cxx11-abi-upload:  # Uploading

				    if: ${{ github.repository_owner == 'pytorch' }}

				    permissions:

				      id-token: write

				      contents: read

				    needs: libtorch-rocm6_3-shared-with-deps-cxx11-abi-test

				    with:

				      PYTORCH_ROOT: /pytorch

				      PACKAGE_TYPE: libtorch

				      # TODO: This is a legacy variable that we eventually want to get rid of in

				      #       favor of GPU_ARCH_VERSION

				      DESIRED_CUDA: rocm6.3

				      GPU_ARCH_VERSION: 6.3

				      GPU_ARCH_TYPE: rocm

				      DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:rocm6.3-main

				      LIBTORCH_VARIANT: shared-with-deps

				      DESIRED_DEVTOOLSET: cxx11-abi

				      build_name: libtorch-rocm6_3-shared-with-deps-cxx11-abi

				    secrets:

				      github-token: ${{ secrets.GITHUB_TOKEN }}

				      conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}

				      conda-pytorchbot-token-test: ${{ secrets.CONDA_PYTORCHBOT_TOKEN_TEST }}

				    uses: ./.github/workflows/_binary-upload.yml

Compare commits

986 Commits v2.6.0-rc2 ... dev/joona/

16 .ci/aarch64_linux/aarch64_ci_build.sh Unescape Escape View File

6 .ci/aarch64_linux/build_aarch64_wheel.py Unescape Escape View File

5 .ci/docker/aotriton_version.txt Unescape Escape View File

34 .ci/docker/build.sh Unescape Escape View File

7 .ci/docker/centos-rocm/Dockerfile Unescape Escape View File

2 .ci/docker/ci_commit_pins/executorch.txt Unescape Escape View File

2 .ci/docker/ci_commit_pins/triton.txt Unescape Escape View File

2 .ci/docker/common/install_acl.sh Unescape Escape View File

23 .ci/docker/common/install_aotriton.sh Unescape Escape View File

8 .ci/docker/common/install_cache.sh Unescape Escape View File

2 .ci/docker/common/install_cpython.sh Unescape Escape View File

16 .ci/docker/common/install_rocm.sh Unescape Escape View File

7 .ci/docker/libtorch/Dockerfile Unescape Escape View File

7 .ci/docker/manywheel/Dockerfile Unescape Escape View File

12 .ci/docker/requirements-ci.txt Unescape Escape View File

11 .ci/docker/ubuntu-rocm/Dockerfile Unescape Escape View File

23 .ci/manywheel/build_cuda.sh Unescape Escape View File

27 .ci/manywheel/build_rocm.sh Unescape Escape View File

6 .ci/pytorch/build.sh Unescape Escape View File

2 .ci/pytorch/common.sh Unescape Escape View File

4 .ci/pytorch/common_utils.sh Unescape Escape View File

2 .ci/pytorch/cpp_doc_push_script.sh Unescape Escape View File

2 .ci/pytorch/functorch_doc_push_script.sh Unescape Escape View File

2 .ci/pytorch/install_cache_xla.sh Unescape Escape View File

93 .ci/pytorch/multigpu-test.sh Unescape Escape View File

4 .ci/pytorch/python_doc_push_script.sh Unescape Escape View File

2 .ci/pytorch/run_tests.sh Unescape Escape View File

19 .ci/pytorch/smoke_test/smoke_test.py Unescape Escape View File

62 .ci/pytorch/test.sh Unescape Escape View File

2 .ci/pytorch/win-build.sh Unescape Escape View File

3 .ci/pytorch/win-test-helpers/build_pytorch.bat Unescape Escape View File

114 .ci/pytorch/win-test-helpers/installation-helpers/install_xpu.bat Unescape Escape View File

4 .ci/pytorch/win-test.sh Unescape Escape View File

11 .ci/pytorch/windows/internal/xpu_install.bat Unescape Escape View File

20 .ci/wheel/build_wheel.sh Unescape Escape View File

2 .circleci/codegen_validation/normalize_yaml_fragment.py Unescape Escape View File

2 .circleci/scripts/binary_linux_test.sh Unescape Escape View File

11 .circleci/scripts/binary_macos_build.sh Unescape Escape View File

5 .circleci/scripts/binary_populate_env.sh Unescape Escape View File

20 .circleci/scripts/binary_upload.sh Unescape Escape View File

2 .clang-format Unescape Escape View File

15 .clang-tidy Unescape Escape View File

2 .github/ISSUE_TEMPLATE/bug-report.yml vendored Unescape Escape View File

4 .github/ISSUE_TEMPLATE/documentation.yml vendored Unescape Escape View File

4 .github/ISSUE_TEMPLATE/feature-request.yml vendored Unescape Escape View File

4 .github/ISSUE_TEMPLATE/pt2-bug-report.yml vendored Unescape Escape View File

4 .github/actionlint.yaml vendored Unescape Escape View File

4 .github/actions/checkout-pytorch/action.yml vendored Unescape Escape View File

4 .github/actions/diskspace-cleanup/action.yml vendored Unescape Escape View File

52 .github/actions/setup-rocm/action.yml vendored Unescape Escape View File

1 .github/actions/test-pytorch-binary/action.yml vendored Unescape Escape View File

2 .github/ci_commit_pins/audio.txt vendored Unescape Escape View File

2 .github/ci_commit_pins/xla.txt vendored Unescape Escape View File

4 .github/labeler.yml vendored Unescape Escape View File

2 .github/requirements-gha-cache.txt vendored Unescape Escape View File

2 .github/requirements/pip-requirements-macOS.txt vendored Unescape Escape View File

2 .github/scripts/delete_old_branches.py vendored Unescape Escape View File

2 .github/scripts/ensure_actions_will_cancel.py vendored Unescape Escape View File

6 .github/scripts/filter_test_configs.py vendored Unescape Escape View File

26 .github/scripts/generate_binary_build_matrix.py vendored Unescape Escape View File

2 .github/scripts/gitutils.py vendored Unescape Escape View File

2 .github/scripts/lint_native_functions.py vendored Unescape Escape View File

2 .github/scripts/lintrunner.sh vendored Unescape Escape View File

6 .github/scripts/s390x-ci/self-hosted-builder/actions-runner.Dockerfile vendored Unescape Escape View File

2 .github/scripts/test_gitutils.py vendored Unescape Escape View File

2 .github/scripts/trymerge.py vendored Unescape Escape View File

11 .github/templates/common.yml.j2 vendored Unescape Escape View File

11 .github/templates/linux_binary_build_workflow.yml.j2 vendored Unescape Escape View File

45 .github/templates/macos_binary_build_workflow.yml.j2 vendored Unescape Escape View File

2 .github/templates/upload.yml.j2 vendored Unescape Escape View File

6 .github/templates/windows_binary_build_workflow.yml.j2 vendored Unescape Escape View File

14 .github/workflows/_bazel-build-test.yml vendored Unescape Escape View File

19 .github/workflows/_binary-build-linux.yml vendored Unescape Escape View File

20 .github/workflows/_binary-test-linux.yml vendored Unescape Escape View File

7 .github/workflows/_binary-upload.yml vendored Unescape Escape View File

10 .github/workflows/_docs.yml vendored Unescape Escape View File

35 .github/workflows/_linux-build.yml vendored Unescape Escape View File

68 .github/workflows/_linux-test.yml vendored Unescape Escape View File

986 Commits

v2.6.0-rc2 ... dev/joona/

16

.ci/aarch64_linux/aarch64_ci_build.sh

View File

6

.ci/aarch64_linux/build_aarch64_wheel.py

View File

5

.ci/docker/aotriton_version.txt

View File

34

.ci/docker/build.sh

View File

7

.ci/docker/centos-rocm/Dockerfile

View File

2

.ci/docker/ci_commit_pins/executorch.txt

View File

2

.ci/docker/ci_commit_pins/triton.txt

View File

2

.ci/docker/common/install_acl.sh

View File

23

.ci/docker/common/install_aotriton.sh

View File

8

.ci/docker/common/install_cache.sh

View File

2

.ci/docker/common/install_cpython.sh

View File

16

.ci/docker/common/install_rocm.sh

View File

7

.ci/docker/libtorch/Dockerfile

View File

7

.ci/docker/manywheel/Dockerfile

View File

12

.ci/docker/requirements-ci.txt

View File

11

.ci/docker/ubuntu-rocm/Dockerfile

View File

23

.ci/manywheel/build_cuda.sh

View File

27

.ci/manywheel/build_rocm.sh

View File

6

.ci/pytorch/build.sh

View File

2

.ci/pytorch/common.sh

View File

4

.ci/pytorch/common_utils.sh

View File

2

.ci/pytorch/cpp_doc_push_script.sh

View File

2

.ci/pytorch/functorch_doc_push_script.sh

View File

2

.ci/pytorch/install_cache_xla.sh

View File

93

.ci/pytorch/multigpu-test.sh

View File

4

.ci/pytorch/python_doc_push_script.sh

View File

2

.ci/pytorch/run_tests.sh

View File

19

.ci/pytorch/smoke_test/smoke_test.py

View File

62

.ci/pytorch/test.sh

View File

2

.ci/pytorch/win-build.sh

View File

3

.ci/pytorch/win-test-helpers/build_pytorch.bat

View File

114

.ci/pytorch/win-test-helpers/installation-helpers/install_xpu.bat

View File

4

.ci/pytorch/win-test.sh

View File

11

.ci/pytorch/windows/internal/xpu_install.bat

View File

20

.ci/wheel/build_wheel.sh

View File

2

.circleci/codegen_validation/normalize_yaml_fragment.py

View File

2

.circleci/scripts/binary_linux_test.sh

View File

11

.circleci/scripts/binary_macos_build.sh

View File

5

.circleci/scripts/binary_populate_env.sh

View File

20

.circleci/scripts/binary_upload.sh

View File

2

.clang-format

View File

15

.clang-tidy

View File

2

.github/ISSUE_TEMPLATE/bug-report.yml vendored

View File

4

.github/ISSUE_TEMPLATE/documentation.yml vendored

View File

4

.github/ISSUE_TEMPLATE/feature-request.yml vendored

View File

4

.github/ISSUE_TEMPLATE/pt2-bug-report.yml vendored

View File

4

.github/actionlint.yaml vendored

View File

4

.github/actions/checkout-pytorch/action.yml vendored

View File

4

.github/actions/diskspace-cleanup/action.yml vendored

View File

52

.github/actions/setup-rocm/action.yml vendored

View File

1

.github/actions/test-pytorch-binary/action.yml vendored

View File

2

.github/ci_commit_pins/audio.txt vendored

View File

2

.github/ci_commit_pins/xla.txt vendored

View File

4

.github/labeler.yml vendored

View File

2

.github/requirements-gha-cache.txt vendored

View File

2

.github/requirements/pip-requirements-macOS.txt vendored

View File

2

.github/scripts/delete_old_branches.py vendored

View File

2

.github/scripts/ensure_actions_will_cancel.py vendored

View File

6

.github/scripts/filter_test_configs.py vendored

View File

26

.github/scripts/generate_binary_build_matrix.py vendored

View File

2

.github/scripts/gitutils.py vendored

View File

2

.github/scripts/lint_native_functions.py vendored

View File

2

.github/scripts/lintrunner.sh vendored

View File

6

.github/scripts/s390x-ci/self-hosted-builder/actions-runner.Dockerfile vendored

View File

2

.github/scripts/test_gitutils.py vendored

View File

2

.github/scripts/trymerge.py vendored

View File

11

.github/templates/common.yml.j2 vendored

View File

11

.github/templates/linux_binary_build_workflow.yml.j2 vendored

View File

45

.github/templates/macos_binary_build_workflow.yml.j2 vendored

View File

2

.github/templates/upload.yml.j2 vendored

View File

6

.github/templates/windows_binary_build_workflow.yml.j2 vendored

View File

14

.github/workflows/_bazel-build-test.yml vendored

View File

19

.github/workflows/_binary-build-linux.yml vendored

View File

20

.github/workflows/_binary-test-linux.yml vendored

View File

7

.github/workflows/_binary-upload.yml vendored

View File

10

.github/workflows/_docs.yml vendored

View File

35

.github/workflows/_linux-build.yml vendored

View File

68

.github/workflows/_linux-test.yml vendored

View File

10

.github/workflows/_mac-build.yml vendored

View File