pytorch

mirror of https://github.com/pytorch/pytorch.git synced 2025-10-24 15:44:58 +08:00

Author	SHA1	Message	Date
Edward Z. Yang	a2d2a30311	Add torch._dynamo.config.fail_on_cache_limit_hit (#136767 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/136767 Approved by: https://github.com/albanD, https://github.com/jansel ghstack dependencies: #136533	2024-09-27 03:58:00 +00:00
Mu-Chu Lee	2521cd3874	Skip kernel saving if already existed. (#136389 ) Summary: We skip the save_gpu_kernel if kernel is being saved already. This would give us a more accurate Triton profiling result. The following trace shows before/after the change for a benchmarking of a trivial addmm: Before: <img width="1255" alt="Screenshot 2024-09-23 at 10 26 53 AM" src="https://github.com/user-attachments/assets/5aea05ef-6ef0-464c-8da9-17b31c97b43a"> After: <img width="910" alt="Screenshot 2024-09-23 at 10 27 03 AM" src="https://github.com/user-attachments/assets/488b7d4f-268f-41cf-8553-cb16ceeae118"> We can see that before the change, the benchmarking includes two parts, (1) The overhead of our triton_heuristic call, which includes the save/get, and the (expensive) hash computation. (2) The exact computation of Triton kernel. We see that (1) accounts >50% of time, which makes kernel selection for profiling often choose aten kernels over Triton kernels. Test Plan: Existing OSS CI [Redacted, Some internal model results in D63441430] Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/136389 Approved by: https://github.com/desertfire	2024-09-27 03:03:28 +00:00
Fuzzkatt	d1382aaf3d	skip test_out_of_memory for jetson (#133270 ) Skip test_out_of_memory in test/test_cuda.py on Jetson as OOM reporting in Jetson has issues due to partially missing NVML support. cc @eqy Pull Request resolved: https://github.com/pytorch/pytorch/pull/133270 Approved by: https://github.com/eqy, https://github.com/albanD, https://github.com/seemethere	2024-09-27 02:36:48 +00:00
Bin Bao	26869d38e1	[Inductor] Further solve missing aoti_torch_check symbole issue (#136775 ) Summary: https://github.com/pytorch/pytorch/pull/136669 didn't resolve all the internal test failures. Add more tests to OSS CI to catch the remaining issues, and fix some internal TARGETS dependency. Differential Revision: D63473744 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136775 Approved by: https://github.com/henrylhtsang	2024-09-27 02:26:49 +00:00
CaoE	66340e6751	Fix numerical instability for norm (#129352 ) Fixes #123645 When the reduce size is large, reducing directly may exceed the range that FP32 can represent, resulting in incorrect results. Reducing in group and using double as the intermediate cumulative type can avoid exceeding the representation range. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129352 Approved by: https://github.com/jgong5, https://github.com/peterbell10	2024-09-27 00:51:31 +00:00
Sahan Paliskara	adc77a9b7f	[lintrunner] auto apply formatting changes as suggestions (#136239 ) (Remove spurious cc) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136239 Approved by: https://github.com/huydhn, https://github.com/eqy Co-authored-by: Huy Do <huydhn@gmail.com>	2024-09-27 00:51:21 +00:00
Ruben Rodriguez Buchillon	faedee12fa	[test] enable test_triton_wrapper again (#136721 ) Summary: Reenable the `test_triton_wrapper.py` test again # Why We want this to run internally # What - fix python path issue on the test - reenable the test # Background It appears that the parent process does not pass the entire path down to the child process. Namely, if there is some setup that makes the sys.path effectively look different than, say, PYTHONPATH or something like this, the child will not inherit this setup. To avoid needing to keep track of specific setups, we pass the effective `sys.path` from the parent to the child through the PYTHONPATH env variable Test Plan: buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/inductor:triton_wrapper Differential Revision: D63438186 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136721 Approved by: https://github.com/henrylhtsang	2024-09-27 00:44:40 +00:00
ankurneog	22a4129a76	Generalization of FSDP common for non-cuda execution (#133209 ) ## Motivation The FSDP common code for FSDP UT execution is mostly written with cuda device in mind. However other devices such the intel Gaudi supports most of the functionality. We are generalizing the base content so that the UT content can be used for non-cuda device execution. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133209 Approved by: https://github.com/kwen2501	2024-09-27 00:38:10 +00:00
Sergii Dymchenko	a619ced5ed	Revert "Update run_test.py" This reverts commit 193073b4914a7f80758541d391eacbe21194ecdf.	2024-09-26 17:34:52 -07:00
Sergii Dymchenko	193073b491	Update run_test.py	2024-09-26 16:56:29 -07:00
eellison	aa56f80ec1	Dont pairwise check unfusable nodes in scheduler (#136682 ) Gives 8% wall time speedup on n=1000 benchmark in https://github.com/pytorch/pytorch/pull/136429 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136682 Approved by: https://github.com/ezyang, https://github.com/jansel, https://github.com/shunting314	2024-09-26 23:46:52 +00:00
Nikita Shulga	0b62ebfeaa	[CI] Populate `JOB_ID` for MPS tests (#136791 ) Move `get-job-id` steps before running the tests and copy-n-paste environment variables from `_mac-test.yml` added in https://github.com/pytorch/pytorch/pull/113099 Should fix the following warning during MPS test run: ``` /Users/ec2-user/runner/_work/pytorch/pytorch/tools/stats/upload_metrics.py:147: UserWarning: Not emitting metrics for td_test_failure_stats_v2. Missing job_id. Please set the JOB_ID environment variable to pass in this value. warn(f"Not emitting metrics for {metric_name}. {e}") ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136791 Approved by: https://github.com/albanD, https://github.com/izaitsevfb	2024-09-26 23:00:52 +00:00
Bin Bao	da5c7b6f4e	[AOTI] Set CUDA device for torch._export.aot_load (#136715 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/136369. When a CUDA device with index is specified when calling torch._export.aot_load, we need to specify the CUDA device when running model.so. Differential Revision: [D63438335](https://our.internmc.facebook.com/intern/diff/D63438335) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136715 Approved by: https://github.com/angelayi	2024-09-26 22:20:12 +00:00
Joel Schlosser	991f8f8ec3	Bias gradient calculation for NJT linear backward (#136660 ) Previously NYI - @mikaylagawarecki needs it for Transformers. Fixes #136652 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136660 Approved by: https://github.com/soulitzer	2024-09-26 21:38:10 +00:00
eqy	c0e98a485b	[FP8][CUDA] Fix stale expected error message (#136581 ) CC @nWEIdia as I think we have seen internal failures on this Pull Request resolved: https://github.com/pytorch/pytorch/pull/136581 Approved by: https://github.com/mikaylagawarecki	2024-09-26 20:57:38 +00:00
Roy Hvaara	5789f8d5dc	[MPS] Add regression test for large inputs to `F.linear` (#136084 ) This PR adds a regression test for the issue reported in #122045. I was not able to reproduce on macOS > 13. ~Expect the first iteration of the tests to fail for macOS 13, but pass for 14 and 15.~ Pull Request resolved: https://github.com/pytorch/pytorch/pull/136084 Approved by: https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2024-09-26 20:46:14 +00:00
Sergii Dymchenko	9656a603b2	Fix lint (#136781 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136781 Approved by: https://github.com/clee2000, https://github.com/ZainRizvi, https://github.com/malfet	2024-09-26 19:13:56 +00:00
Sergii Dymchenko	c878ea2c4e	Add info about "release tracker" label for cherry-picking bot (#136777 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136777 Approved by: https://github.com/seemethere, https://github.com/atalman	2024-09-26 18:45:45 +00:00
Jithun Nair	851b9732aa	Download pre-compiled AOTriton from GitHub unless AOTRITON_INSTALL_FROM_SOURCE=1 is set (#136603 ) PyTorch community members have reported issues with building PyTorch from source for ROCm in an environment that doesn't have aotriton pre-installed, because aotriton is only installed in the [CI](`a8ed873ba2/.ci/docker/manywheel/Dockerfile (L197)`) docker images. Building aotriton from source can take ~45 minutes. This PR fixes the issue by downloading the aotriton tarball in such scenarios, unless the user explicitly wants to build aotriton from source using the AOTRITON_INSTALL_FROM_SOURCE=1 env var Pull Request resolved: https://github.com/pytorch/pytorch/pull/136603 Approved by: https://github.com/atalman Co-authored-by: Xinya Zhang <Xinya.Zhang@amd.com>	2024-09-26 18:05:51 +00:00
Pian Pawakapan	f0a92541fe	[export] fix lifted constants order for 0-input graphs (#136658 ) Summary: With empty graphs, the `graph.inserting_before(first_user_input = None)` call turns into a `graph.inserting_after(root)` call, inverting the order of constant input nodes being inserted. This fixes the issue by initializing to the first node in the graph (still valid if not a user input - only used for insertion). Test Plan: test_export Differential Revision: D63403514 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136658 Approved by: https://github.com/avikchaudhuri	2024-09-26 17:44:24 +00:00
fduwjj	40c825d773	[reland] [torchelastic][c10d] Fix store prefix race in rendezvous (#136768 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136768 Approved by: https://github.com/kwen2501, https://github.com/atalman	2024-09-26 17:37:07 +00:00
Rachel Guo	da09984c0d	[AOTI][Tooling][9/n] Add debug printer support for cpp kernel type (#136465 ) Summary: As title. Cpp kernel has a different codegen path: https://www.internalfb.com/code/fbsource/[6df946858879dd9bcefa18710dd79095a957f0dd]/fbcode/caffe2/torch/_inductor/codegen/cpp.py?lines=4643 Previously it is not wrapped/supported by the debug printer manager. This diff adds this support. It can be useful for cpu models. See this for a use case: https://www.internalfb.com/phabricator/paste/view/P1598561051?lines=927 Test Plan: ``` AOT_INDUCTOR_DEBUG_INTERMEDIATE_VALUE_PRINTER=2 TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 TORCHINDUCTOR_ABI_COMPATIBLE=1 TORCH_COMPILE_DEBUG=1 TORCH_LOGS="+graph, inductor, +schedule, output_code" buck2 run 'fbcode//mode/opt' fbcode//accelerators/workloads/models/slimdsnn:slimdsnn -- aot --batch-size 1 ``` Differential Revision: D63053101 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136465 Approved by: https://github.com/hl475	2024-09-26 17:30:43 +00:00
Tugsbayasgalan (Tugsuu) Manlaibaatar	e4e83a4ac4	Remove aten.item hack (#136663 ) Summary: Title Test Plan: CI Differential Revision: D63404353 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136663 Approved by: https://github.com/bdhirsh	2024-09-26 17:14:48 +00:00
albanD	2421344d8f	Update current maintainers (#136672 ) This file didn't had an overall in a few years so long overdue. Most of the credit goes to @orionr for gathering all of this info. The main rules we followed: - No code contributor is removed, they're all placed as emeritus - Breakdown too big categories to make this document useful to know who to ping - No category where the code is still in the codebase is removed - We did not rework the categories (for example to be closer to module: labels) and leave that for later - All non-emeritus names are ordered by their number of comments on issues related to their topic Pull Request resolved: https://github.com/pytorch/pytorch/pull/136672 Approved by: https://github.com/eqy, https://github.com/ezyang, https://github.com/seemethere, https://github.com/malfet	2024-09-26 17:13:16 +00:00
Edward Z. Yang	beb46de342	Correctly convert Python float to float64 when passing argument as Tensor (#136413 ) I can't actually test the Dynamo codegen fix as it is impossible to directly use the Tensor at the moment. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/136413 Approved by: https://github.com/bobrenjc93 ghstack dependencies: #136599	2024-09-26 16:50:13 +00:00
Edward Z. Yang	11fd55827d	Make CLOSURE_VARS construction lazy (#136599 ) This makes us less likely to hit import cycle problems with torch Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/136599 Approved by: https://github.com/anijain2305	2024-09-26 16:50:13 +00:00
drisspg	ff2360c733	[FlexAttention] Reduce expensive test time by 10x (#136677 ) Now that we support non 128 divisble sequence lengths; drops expensive tests by like 10x Before ```Shell 46.32s call test/inductor/test_flex_attention.py::TestFlexAttention::test_aot_eager_gradcheck_score_mod1 45.61s call test/inductor/test_flex_attention.py::TestFlexAttention::test_aot_eager_gradcheck_score_mod2 44.45s call test/inductor/test_flex_attention.py::TestFlexAttention::test_aot_eager_gradcheck_score_mod3 43.61s call test/inductor/test_flex_attention.py::TestFlexAttention::test_aot_eager_gradcheck_score_mod0 ``` After: ```Shell 4.25s call test/inductor/test_flex_attention.py::TestFlexAttention::test_aot_eager_gradcheck_score_mod5 4.20s call test/inductor/test_flex_attention.py::TestFlexAttention::test_aot_eager_gradcheck_score_mod4 4.19s call test/inductor/test_flex_attention.py::TestFlexAttention::test_aot_eager_gradcheck_score_mod1 4.04s call test/inductor/test_flex_attention.py::TestFlexAttention::test_aot_eager_gradcheck_score_mod2 3.99s call test/inductor/test_flex_attention.py::TestFlexAttention::test_aot_eager_gradcheck_score_mod0 3.98s call test/inductor/test_flex_attention.py::TestFlexAttention::test_aot_eager_gradcheck_score_mod3 ```` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136677 Approved by: https://github.com/Chillee ghstack dependencies: #136673	2024-09-26 16:40:21 +00:00
drisspg	840c6b7a68	[FlexAttention] Add Better error message for cpu tensors (#136673 ) Partially address: #136525 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136673 Approved by: https://github.com/Chillee	2024-09-26 16:40:21 +00:00
Thanh Ha	ddab704b28	Use wildcard for portion of AMI version number (#136764 ) Rather than specifying a specific version number for the AMIs, use wildcards for the date section. Issue: pytorch/pytorch#136762 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136764 Approved by: https://github.com/ZainRizvi	2024-09-26 16:39:25 +00:00
cyy	59e8f8228f	[3/N] Fix clang-tidy warnings in torch/csrc/lazy (#136705 ) Follows #136634 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136705 Approved by: https://github.com/Skylion007	2024-09-26 16:29:43 +00:00
Jez Ng	31c0467594	Add Triton CPU as an Inductor backend (#133408 ) The goal is to use Inductor-generated kernels to stress test the new Triton CPU backend. Differential Revision: [D63298968](https://our.internmc.facebook.com/intern/diff/D63298968) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133408 Approved by: https://github.com/jansel, https://github.com/blaine-rister, https://github.com/malfet	2024-09-26 15:35:26 +00:00
Nikita Shulga	68579ef665	[EZ][MPS] Extend `arange` to bfloat16 (#136754 ) RangeFactories class is the only one that uses `AT_DISPATCH_MPS_TYPES` Fixes https://github.com/pytorch/pytorch/issues/136624 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136754 Approved by: https://github.com/Skylion007	2024-09-26 15:33:45 +00:00
Nikita Shulga	73ec76ed50	[MPS] Implement `isposinf` and `isneginf` (#136689 ) Not sure, why `isinf` is a composite op, but those needs to be implemented by hand. Implementation is a trivial call to ```objc [mpsGraph equalWithPrimaryTensor:input secondaryTensor:[mpsGraph constantWithScalar:std::numeric_limits<T>::infinity() dataType:input.dataType]] ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136689 Approved by: https://github.com/Skylion007	2024-09-26 15:33:20 +00:00
drisspg	d05645841e	Update get_device_properties to take in optional device (#136683 ) Aligns behavior with the rest of cuda's device info query methods Pull Request resolved: https://github.com/pytorch/pytorch/pull/136683 Approved by: https://github.com/eqy	2024-09-26 15:07:31 +00:00
PyTorch MergeBot	d5e4a20c17	Revert "Introduce _ArglessActivation base class for parameterless activation functions (#136296 )" This reverts commit dda0e4de32b29098f25f9b2889423c9446680cc1. Reverted https://github.com/pytorch/pytorch/pull/136296 on behalf of https://github.com/atalman due to Breaks Internal CI. Error: Too many arguments [19]: Call `nn.modules.activation._ArglessActivation.__init__` expects 0 positional arguments, 1 was provided. ([comment](https://github.com/pytorch/pytorch/pull/136296#issuecomment-2377091280))	2024-09-26 14:12:12 +00:00
Joel Schlosser	4150ab44a4	Fix composite op redispatch for NJT in inference mode (#134683 ) Prior to this PR, calling `reshape()` under `inference_mode()` would throw a `NotImplementedError`. This is because `inference_mode()` disables autograd key dispatch, incidentally preventing the decomposition of reshape for NJT. This PR fixes this by redispatching on the `CompositeImplicitAutogradNestedTensor` key whenever a composite implicit op is encountered in `NJT.__torch_dispatch__()`. This fixes reshape and any other composite implicit ops underneath `inference_mode()`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134683 Approved by: https://github.com/soulitzer, https://github.com/albanD ghstack dependencies: #136566	2024-09-26 14:10:53 +00:00
Joel Schlosser	f8debd5d83	Fix wrapper subclass reentrant dispatch + TorchDispatchMode (#136566 ) Fixes #136565 This PR makes the python fallback robust to the case where there are no active modes & no tensors with the Python key. In this case, simply redispatch with the Python key disabled. This was found when trying to use reentrant dispatch for NJT to get decompositions under `inference_mode()` when the autograd key is disabled. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136566 Approved by: https://github.com/bdhirsh	2024-09-26 14:06:51 +00:00
leslie-fang-intel	963e793e1b	[Inductor][CPP] Optimize WOQ INT8 wgt dequant in AMX GEMM template (#136630 ) Summary Optimize the WOQ int8 AMX performance by changing the int8 -> bf16 conversion. Earlier, 16 int8 elements were being loaded at a time & converted to 16 BF16 elements. With this change, 32 int8 elements will be loaded at a time, and converted to a cache-line of 32 BF16 elements more efficiently. Performance before ``` AUTOTUNE _weight_int8pack_mm(4096x4096, 4096x4096, 4096) cpp_packed_gemm_0 38.0439 ms 100.0% _weight_int8pack_mm 50.2524 ms 75.7% SingleProcess AUTOTUNE benchmarking takes 1.1087 seconds and 1.9791 seconds precompiling AUTOTUNE _weight_int8pack_mm(4096x4096, 11008x4096, 11008) cpp_packed_gemm_4 78.2038 ms 100.0% _weight_int8pack_mm 119.1962 ms 65.6% SingleProcess AUTOTUNE benchmarking takes 1.9274 seconds and 1.9949 seconds precompiling AUTOTUNE _weight_int8pack_mm(4096x11008, 4096x11008, 4096) cpp_packed_gemm_6 79.2368 ms 100.0% _weight_int8pack_mm 118.3212 ms 67.0% SingleProcess AUTOTUNE benchmarking takes 1.9200 seconds and 2.0015 seconds precompiling AUTOTUNE _weight_int8pack_mm(4096x4096, 32000x4096, 32000) cpp_packed_gemm_224 225.7201 ms 100.0% _weight_int8pack_mm 388.5588 ms 58.1% ``` Performance after this PR ``` AUTOTUNE _weight_int8pack_mm(4096x4096, 4096x4096, 4096) cpp_packed_gemm_0 11.0086 ms 100.0% _weight_int8pack_mm 50.2918 ms 21.9% SingleProcess AUTOTUNE benchmarking takes 1.0837 seconds and 2.0301 seconds precompiling AUTOTUNE _weight_int8pack_mm(4096x4096, 11008x4096, 11008) cpp_packed_gemm_4 24.3528 ms 100.0% _weight_int8pack_mm 119.8492 ms 20.3% SingleProcess AUTOTUNE benchmarking takes 1.8303 seconds and 1.8195 seconds precompiling AUTOTUNE _weight_int8pack_mm(4096x11008, 4096x11008, 4096) cpp_packed_gemm_6 24.6148 ms 100.0% _weight_int8pack_mm 119.1908 ms 20.7% SingleProcess AUTOTUNE benchmarking takes 1.8315 seconds and 1.8352 seconds precompiling AUTOTUNE _weight_int8pack_mm(4096x4096, 32000x4096, 32000) cpp_packed_gemm_224 78.1369 ms 100.0% _weight_int8pack_mm 387.6289 ms 20.2% SingleProcess AUTOTUNE benchmarking takes 4.5059 seconds and 1.8010 seconds precompiling ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136630 Approved by: https://github.com/jgong5 ghstack dependencies: #136353	2024-09-26 08:41:58 +00:00
Menglu Yu	77fba0c407	[PT2][Optimus] Fix a group batch fusion corner case (#136650 ) Summary: We have a user report on BA model that it raised "AttributeError: 'SymFloat' object has no attribute 'shape'", thus we add type check for the meta node. See more context in the post https://fb.workplace.com/groups/1075192433118967/permalink/1510477489590457/ Test Plan: # local reproduce ``` CUDA_VISIBLE_DEVICES=3 OC_CAUSE=1 buck2 run mode/opt //scripts/jackiexu0313/pt2:local_model_with_pt2 -- --test_mode split-batch-decompose --flow_id 646303196 ``` P1609807876 # E2E before fix f646303196 after fix Differential Revision: D63399959 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136650 Approved by: https://github.com/ezyang	2024-09-26 06:35:11 +00:00
Kurt Mohler	d1bb8e828f	Add deterministic path for CUDA `cumsum` (#136224 ) Change `cumsum` to call its decomposition when `use_deterministic_algorithms(True)` and input is CUDA. Fixes #89492 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136224 Approved by: https://github.com/ezyang, https://github.com/justinchuby	2024-09-26 04:52:05 +00:00
PyTorch MergeBot	b408591b53	Revert "[Flex Attention] fix block size order (#136657 )" This reverts commit 529b6ab0bb9f8800ed795ec8e4fa1f0e8042bb0a. Reverted https://github.com/pytorch/pytorch/pull/136657 on behalf of https://github.com/huydhn due to Sorry for reverting your change but some test_flex_attention is failing in trunk after this change `529b6ab0bb` ([comment](https://github.com/pytorch/pytorch/pull/136657#issuecomment-2375824802))	2024-09-26 04:06:41 +00:00
cyy	3c542ce831	[Reland] Check function declarations of COREML code (#136070 ) Reland of #135467 by fixing periodic workflows. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136070 Approved by: https://github.com/ezyang	2024-09-26 03:52:06 +00:00
Roy Hvaara	042af7ec53	[BE] [MPS] Use validation helper for input tensors (#134609 ) Small refactor to use already existing helper with equivalent behavior. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134609 Approved by: https://github.com/malfet	2024-09-26 03:47:30 +00:00
rzou	e4d32d2194	Improve data-dependent-output meta kernel error message (#136671 ) Test Plan: - code reading Pull Request resolved: https://github.com/pytorch/pytorch/pull/136671 Approved by: https://github.com/williamwen42	2024-09-26 03:46:04 +00:00
xinan.lin	190e09d8b6	[Inductor UT] Generalize device-bias code introduced from #134874 and (#136596 ) [Inductor UT] Generalize device-bias code introduced from #134874 and fix unexpected success test cases. Fix #136595 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136596 Approved by: https://github.com/EikanWang, https://github.com/jansel Co-authored-by: Yu, Guangye <guangye.yu@intel.com>	2024-09-26 02:56:59 +00:00
eugenekoran	dda0e4de32	Introduce _ArglessActivation base class for parameterless activation functions (#136296 ) Fixes #133683 Fixes #133684 Fixes #133688 This PR introduces a new base class `_ArglessActivation` and refactors five existing activation functions to inherit from it. This change aims to improve documentation consistency and also API consistency with other activation functions that do have parameters and explicitly call `super().__init__()` Key changes and considerations: 1. Added new class `_ArglessActivation`: 2. Refactored the following classes to inherit from `_ArglessActivation`: - Sigmoid - Tanh - Softsign - Tanhshrink - Softmax2d 3. Performance consideration: - This change introduces a slight overhead for creating a new stack frame and handling an additional function call on every instance creation - The impact is expected to be minimal in most use cases Docs view before: <img width="425" alt="Screen Shot 2024-09-18 at 3 00 22 PM" src="https://github.com/user-attachments/assets/ca0d1000-44c5-4c52-b344-68f7e170bafe"> Docs view after: <img width="431" alt="Screen Shot 2024-09-18 at 3 00 52 PM" src="https://github.com/user-attachments/assets/f7ceb8f3-a2a2-4fd6-a2b8-39105a02bcbd"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/136296 Approved by: https://github.com/mikaylagawarecki	2024-09-26 02:45:05 +00:00
rzou	d0456b4274	noop on torch.library APIs under torch::deploy (multipy) (#136645 ) Fixes https://github.com/pytorch/pytorch/issues/136177 The motivation is that torch::deploy doesn't handle this well. The workaround for users is to use C++ custom ops. All torch.library APIs ultimately go through the torch.library.Library object, so we add checks to noop for torch::deploy there. Test Plan: - new test - going to test this internally and hope nothing breaks. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136645 Approved by: https://github.com/ezyang	2024-09-26 02:34:34 +00:00
Bin Bao	5c78c6b05a	[CI] Switch aarch64 dashboard run back to nightly (#136643 ) Summary: Reduce the frequency of the aarch64 dashboard CI run since we don't need to monitor its instability anymore. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136643 Approved by: https://github.com/huydhn	2024-09-26 01:26:05 +00:00
Howard Huang	141cae2eb8	[pipelining] Fix more leaks and check leaks in tests (#136584 ) Fix two more leaks of the same variety as #136507 (see that PR desc and attached gdoc for debug details). This time, also add a test-time check that helped to discover new leaks and ensure we won't accidently regress. Adds `check_tensor_leak` util which internally asserts no tensors are being kept alive by other objects involved in py ref cycles. Uses objgraph for a nice debug utility when a leak is found. Credit to @H-Huang for pointing out objdump and helping debug the 'param_group["intermediates"]` leak. I manually confirmed that all 3 of the leaks identified/fixed so far are caught by the unit test and checker. Sample output, if I re-introduce a leak by commenting out `del param_group["intermediates"]` in _backward.py, and run `python test/distributed/pipelining/test_schedule_multiproc.py -k test_schedule_with_native_zero_bubble`: ``` warnings.warn( /data/users/whc/pytorch/torch/testing/_internal/common_utils.py:5341: UserWarning: 34 tensors were found in the garbage. Did you introduce a reference cycle? warnings.warn( /data/users/whc/pytorch/torch/testing/_internal/common_utils.py:5347: UserWarning: Dumping first 1 objgraphs of leaked tensors rendered to png Graph written to /tmp/objgraph-ztz642h3.dot (19 nodes) Graph viewer (xdot) not found, generating a png instead Image generated as /tmp/objgraph-ztz642h3.png ``` rendering of ` /tmp/objgraph-ztz642h3.png`: <img width="1671" alt="image" src="https://github.com/user-attachments/assets/9098ff29-224c-4533-935b-83c210ac2e22"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/136584 Approved by: https://github.com/kwen2501, https://github.com/H-Huang ghstack dependencies: #136507 Co-authored-by: Howard Huang <howardhuang@fb.com>	2024-09-26 01:10:40 +00:00
Nichols A. Romero	e8f1dd6ba0	Fix hardcoded ROCm paths in `Caffe2Targets.cmake` (#136283 ) Fixes #131701 Use CMake imported targets more consistently to eliminate hardcode paths. Here is the new relevant sections of Caffe2Targets.cmake: ``` set_target_properties(c10_hip PROPERTIES INTERFACE_INCLUDE_DIRECTORIES "${_IMPORT_PREFIX}/include" INTERFACE_LINK_LIBRARIES "c10;hip::amdhip64" ) ``` ``` set_target_properties(torch_hip PROPERTIES INTERFACE_COMPILE_DEFINITIONS "USE_C10D_NCCL" INTERFACE_COMPILE_OPTIONS "-fPIC;-D__HIP_PLATFORM_AMD__=1;-DCUDA_HAS_FP16=1;-DUSE_ROCM;-D__HIP_NO_HALF_OPERATORS__=1;-D__HIP_NO_HALF_CONVERSIONS__=1;-DTORCH_HIP_VERSION=602;-Wno-shift-count-negative;-Wno-shift-count-overflow;-Wno-duplicate-decl-specifier;-DCAFFE2_USE_MIOPEN;-DTHRUST_DEVICE_SYSTEM=THRUST_DEVICE_SYSTEM_HIP;-std=c++17;-DHIPBLAS_V2;-DHIP_NEW_TYPE_ENUMS" INTERFACE_INCLUDE_DIRECTORIES "${_IMPORT_PREFIX}/include" INTERFACE_LINK_LIBRARIES "c10_hip;torch_cpu_library;hip::amdhip64;MIOpen;hiprtc::hiprtc;roc::hipblaslt;roc::hipblas;hip::hipfft;hip::hiprand;roc::hipsparse;roc::hipsolver" ) ``` HIPCUB dependency was not actually used; which is why it is removed here as the imported target had undesirable side effects. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136283 Approved by: https://github.com/jeffdaily, https://github.com/Skylion007, https://github.com/jithunnair-amd, https://github.com/atalman	2024-09-26 00:34:43 +00:00
Zheng, Zhaoqiong	f3dd1721f4	[Update] Update note for Getting Started with PyTorch on Intel GPUs (#129946 ) remove the hardware and software prerequisites and set up env part. keep the prerequisites section and link to pytorch prerequistes for intel gpus for driver install, intel support package install and env set up https://www.intel.com/content/www/us/en/developer/articles/tool/pytorch-prerequisites-for-intel-gpus.html Update the support for Intel Client GPU MTL-H Update inference & training examples Pull Request resolved: https://github.com/pytorch/pytorch/pull/129946 Approved by: https://github.com/seemethere	2024-09-26 00:22:05 +00:00
PyTorch MergeBot	9223c16208	Revert "Fix constant propagation in builtins and UserClasses (#131354 )" This reverts commit dd4a51b39aa02cba23b3a387b41c5026770d9220. Reverted https://github.com/pytorch/pytorch/pull/131354 on behalf of https://github.com/atalman due to Breaks torchrec tests ([comment](https://github.com/pytorch/pytorch/pull/131354#issuecomment-2375417145))	2024-09-25 23:01:03 +00:00
Bin Bao	ecc15c4f89	[AOTI] Fix a missing aoti_torch_check symbol issue (#136669 ) Summary: When Inductor generates cpp kernels, they should be pure cpp loops which are independent to libtorch as much as possible. Differential Revision: D63403473 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136669 Approved by: https://github.com/henrylhtsang	2024-09-25 22:56:10 +00:00
Huy Do	b7a5c7d331	Do not XFAIL test_segfault in fbcode (#136661 ) https://github.com/pytorch/pytorch/pull/136252 silence the failure on OSS, but the test actually passed on fbcode [T202241133](https://www.internalfb.com/intern/tasks/?t=202241133) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136661 Approved by: https://github.com/malfet	2024-09-25 22:26:24 +00:00
ratnampa	8d65d9f11b	Constraint setuptools to 72.1.0 or older in requirements.txt (#136489 ) FIXES: https://github.com/pytorch/pytorch/issues/136541 Setuptools>=74.0.0 has deprecated support for some functions in distutils, and so the builds run into error such as ```AttributeError: module 'distutils' has no attribute '_msvccompiler'```. Also, the pytorch builds have setuptools pin to 72.1.0 according to these PRs: https://github.com/pytorch/builder/pull/1995 and `89d9a8cf6f`. So, until there is a fix to change the function usage in accordance with latest setuptools, the 72.1.0 version works fine. Also observed in CI jobs: https://github.com/pytorch/pytorch/actions/runs/10979326524 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136489 Approved by: https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2024-09-25 22:06:05 +00:00
Xuan Zhang	c9d12f6360	[inductor][memory] add signpost event for memory pass (#136538 ) Add logging to scuba table for internal models. For verification, I triggered a sample workflow internally and checked the scuba table logging to make sure the `Paramaters` column has the expected loggings, see [here](https://fburl.com/scuba/workflow_signpost/39h7qo9s). Pull Request resolved: https://github.com/pytorch/pytorch/pull/136538 Approved by: https://github.com/yf225	2024-09-25 21:47:46 +00:00
rzou	b5c2a657ae	Add zou3519 to CODEOWNERS for HOPs (#136679 ) There are some tricky things that I want to guard against Pull Request resolved: https://github.com/pytorch/pytorch/pull/136679 Approved by: https://github.com/Chillee	2024-09-25 21:29:48 +00:00
Animesh Jain	289df45cee	Revert "[Dynamo] Trace enter/exit of TorchFunctionModes (#135422 )" (#136590 ) This reverts commit 7743149b2be4a9eba7e0997ccdc6abe552bec266. Reverts * https://github.com/pytorch/pytorch/pull/135503 * https://github.com/pytorch/pytorch/pull/135502 * https://github.com/pytorch/pytorch/pull/135422 This passes this test. Earlier, the getitem would stay like a getitem in the Fx graph. But now the fake tensor propagations fails saying that .item is called. It seems that torch function is not getting triggered while fake tensor propagation. ``` import torch from torch.nn.attention.flex_attention import BlockMask, _mask_mod_signature, _score_mod_signature, flex_attention from torch._inductor.lowering import make_pointwise, register_lowering from torch._inductor.virtualized import ops from torch.nn.attention.flex_attention import create_block_mask torch.set_default_device('cuda') flex_attention = torch.compile(flex_attention, dynamic=False) prefix_lengths = torch.arange(8) def prefix_lm(b, h, q, kv): return prefix_lengths[b] >= kv mask = create_block_mask(prefix_lm, 8, None, 512, 512, _compile=True) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136590 Approved by: https://github.com/Chillee	2024-09-25 21:10:43 +00:00
Boyuan Feng	529b6ab0bb	[Flex Attention] fix block size order (#136657 ) `create_block_mask` currently gives wrong BLOCK_SIZE and shape when using non-default block size `(128,128)`. This PR fixes the issue by using BLOCK_SIZE order `(Q_BLOCK_SIZE, KV_BLOCK_SIZE)`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136657 Approved by: https://github.com/Chillee, https://github.com/drisspg	2024-09-25 21:08:40 +00:00
Edward Yang	76b044d7cb	Don't actually import module when checking if its valid (#136548 ) Summary: If you actually import the module, you might end up with some import cycle situation where a module is imported too early and accesses things that are not initialized yet. Test Plan: sandcastle and ossci ``` TORCH_LOGS=+torch._inductor.codecache buck run mode/opt caffe2/benchmarks/dynamo:torchbench ``` Differential Revision: D63330224 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136548 Approved by: https://github.com/Skylion007	2024-09-25 20:47:32 +00:00
atalman	11c5f9ac3b	Use amazon linux 2023 runners for Docker builds (#136544 ) Migrate these builds to linux 2023. We want to build and test the Docker images in CD. Looks like we are hitting this issue: https://github.com/docker/buildx/issues/379 when trying to build Docker on Amazon Linux 2023. Conda Docker build is timing out. While Manywheel is executing but failing because BUILDKIT is turned off: https://github.com/pytorch/pytorch/actions/runs/11036043157/job/30653543264?pr=136544 Proposed Solution is to fix it in user_data . Please see: https://github.com/pytorch/test-infra/issues/5712 I see docker builds are executed successfully here: https://github.com/pytorch/pytorch/actions/runs/11040149229/job/30667448668?pr=136544 Workaround timeout problem (reported in https://bugzilla.redhat.com/show_bug.cgi?id=1537564 ) by configuring number of open files per container to 1048576 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136544 Approved by: https://github.com/ZainRizvi Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2024-09-25 20:39:56 +00:00
Xinran / Allan Rui	13b0baf2a1	[FX] Update _inline_module util function to work with both args and kwargs (#136631 ) Summary: Previously `_inline_module ` helper function only works with submodules that have args specified. This diff updates the util function to look for input arguments from submodule kwargs first using placeholder node names, then fallback to list of args if node name not found. Test Plan: ``` buck2 run @//mode/{opt,mtia,inplace} //glow/fb/fx/fba/tests:test_fba_inductor -- -r test_connected_fusions ``` Differential Revision: D63347675 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136631 Approved by: https://github.com/jfix71	2024-09-25 20:20:57 +00:00
Sunishchal Dev	a8ed873ba2	Add missing input "eps" to adam docs (#135191 ) Minor fix for missing input argument in the Adam optimizer docs page. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135191 Approved by: https://github.com/janeyx99	2024-09-25 20:17:23 +00:00
cyy	6aa6bd4ca5	[Distributed] [12/N] Fix clang-tidy warnings in torch/csrc/distributed/ (#136528 ) Follows #136439. A dangling reference to qualifiedName was found and fixed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136528 Approved by: https://github.com/kwen2501	2024-09-25 20:12:08 +00:00
Xiaozhu Meng	5a29a06aa3	[AMD][inductor] do not use float64 on AMD internally (#136441 ) Summary: Internal AMD triton seems to have issue with float64 constant: ``` ### Most recent error lines found on the logs: E0920 13:23:56.391000 2026 torch/_inductor/runtime/triton_heuristics.py:446] [2/2] ^ E0920 13:23:56.391000 2026 torch/_inductor/runtime/triton_heuristics.py:446] [2/2] tmp8 = tl.broadcast_to((libdevice.llrint((tl.full([1], 1.00000000000000, tl.float64))(ks3.to(tl.float64)))) / ks1, [XBLOCK, RBLOCK]) E0920 13:23:56.391000 2026 torch/_inductor/runtime/triton_heuristics.py:446] [2/2] tmp7 = tmp5 + tmp6 E0920 13:23:56.391000 2026 torch/_inductor/runtime/triton_heuristics.py:446] [2/2] tmp6 = 0.5 E0920 13:23:56.391000 2026 torch/_inductor/runtime/triton_heuristics.py:446] [2/2] tmp5 = tmp4.to(tl.float32) E0920 13:23:56.391000 2026 torch/_inductor/runtime/triton_heuristics.py:446] [2/2] tmp4 = (((r3 + (x0((17 + (16ks0ks1)) // 18))) % ks2) // ks0) % ks1 E0920 13:23:56.391000 2026 torch/_inductor/runtime/triton_heuristics.py:446] [2/2] tmp3 = tmp2.to(tl.int1) E0920 13:23:56.391000 2026 torch/_inductor/runtime/triton_heuristics.py:446] [2/2] tmp2 = tmp0 < tmp1 E0920 13:23:56.391000 2026 torch/_inductor/runtime/triton_heuristics.py:446] [2/2] tmp1 = 16ks0ks1 E0920 13:23:56.391000 2026 torch/_inductor/runtime/triton_heuristics.py:446] [2/2] tmp0 = r3 + (x0((17 + (16ks0*ks1)) // 18)) E0920 13:23:56.391000 2026 torch/_inductor/runtime/triton_heuristics.py:446] [2/2] r3 = rindex E0920 13:23:56.391000 2026 torch/_inductor/runtime/triton_heuristics.py:446] [2/2] rmask = rindex < rnumel E0920 13:23:56.391000 2026 torch/_inductor/runtime/triton_heuristics.py:446] [2/2] rindex = roffset + rbase E0920 13:23:56.391000 2026 torch/_inductor/runtime/triton_heuristics.py:446] [2/2] triton.compiler.errors.CompilationError: at 26:15: E0920 13:23:56.391000 2026 torch/_inductor/runtime/triton_heuristics.py:446] [2/2] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns) ``` Bisecting showing this error introduced by D62465575 This diff tries to not convert constant to float64 on AMD, and emu1.4 predictor now can run on AMD with rocm6.0. Test Plan: rocm6.0 can work ``` TORCHINDUCTOR_AUTOTUNE_REMOTE_CACHE=1 HIP_FORCE_DEV_KERNARG=1 HIP_GRAPH=--use-cuda-graph PYTORCH_MIOPEN_SUGGEST_NHWC=1 TORCHINDUCTOR_LAYOUT_OPTIMIZATION=1 CUDA_VISIBLE_DEVICES="2" TORCH_LOGS="recompiles,cudagraphs" buck2 run @//mode/opt-amd-gpu -c fbcode.rocm_ck_rtz=true -m rocm60 fblearner/predictor/py/applications/photogen:ip_python_predictor_photogen_cm -- --model=photogen_v1p4_9b --thrift_server_port=15008 --max_predict_calls=1 --enable_tunable_op --load_from_torch_package=genai:937233660_1 ``` emu1.4 predictor on AMD fails with rocm6.2 with some other triton errors (https://www.internalfb.com/phabricator/paste/view/P1603842354) Differential Revision: D63263806 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136441 Approved by: https://github.com/houseroad	2024-09-25 19:13:17 +00:00
Zain Rizvi	37f340c1e5	[EZ] Remove remaining amz2023 runner variant references (#136540 ) Validated no jobs use the amz2023 runner variant anymore ([proof](https://github.com/search?type=code&q=org%3Apytorch+%2F%5Cbamz2023%5Cb%2F+&p=1)) so removing all references to it Explicit references to the amz2023 runner type variants were removed in the following PRs: - https://github.com/pytorch/ignite/pull/3285 - https://github.com/pytorch/ao/pull/887 - https://github.com/pytorch/fbscribelogger/pull/1 - https://github.com/pytorch/pytorch/pull/134355 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136540 Approved by: https://github.com/huydhn, https://github.com/malfet	2024-09-25 19:01:00 +00:00
David Berard	9c2c61d2dd	[inductor] ELEMENTS_PER_WARP_32 -> ONE_ELEMENT_PER_THREAD (#136472 ) AMD devices have 64 elements per thread; this PR makes the handling of the "ELEMENTS_PER_WARP_32" generic and uses DeviceProperties.warp_size to determine the warp size instead of hard-coding the warp size as 32. It also renames the enum value. Added a unit test for this. Note: I left the old enum option (ELEMENTS_PER_WARP_32) as is instead of renaming it. I'm not sure whether we expect should caches to get invalidated here; if this concern is valid, then there's a risk that this would get updated, but some model could use the cached inductor code, which would reference "ELEMENTS_PER_WARP_32", which would no longer exist. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136472 Approved by: https://github.com/jansel	2024-09-25 18:21:09 +00:00
cyy	a259fbf72c	[2/N] Fix clang-tidy warnings in torch/csrc/lazy (#136634 ) Follows #134655 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136634 Approved by: https://github.com/Skylion007	2024-09-25 18:08:29 +00:00
Tugsbayasgalan (Tugsuu) Manlaibaatar	0b38fa154a	Fix meta registry in export (#136492 ) Summary: Title Test Plan: CI This fixes some breaking tests in executorch. I think the root cause is when we have aten::matmul which we are not preserving, we register meta implementation from C++ side. It seems like the C++ kernel doesn't work well with mix of FakeTensor and real tensor. This PR sidesteps this problem by always preferring python CIA decomp over C++ Cia decomp Differential Revision: D63297050 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136492 Approved by: https://github.com/bdhirsh	2024-09-25 17:53:02 +00:00
Justin Chu	8582835499	[ONNX] Remove the operators test (#136335 ) The tests are obsolete and hard to maintain. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136335 Approved by: https://github.com/xadupre, https://github.com/cyyever Co-authored-by: Edward Z. Yang <ezyang@meta.com>	2024-09-25 17:44:18 +00:00
Edward Z. Yang	7cb6d31567	Dump partially traced make_fx graph in event of error to tlparse (#136508 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/136508 Approved by: https://github.com/zou3519, https://github.com/bdhirsh, https://github.com/malfet ghstack dependencies: #136533	2024-09-25 17:44:15 +00:00
Tugsbayasgalan (Tugsuu) Manlaibaatar	9409274bc1	Fix bug in functional tensor decomp (#136600 ) Summary: Previously we had a very bad bug where we don't allow any decomp on CIA. This never mattered before because we never had to actually push CIA decomp to Python key level in export. Test Plan: CI Differential Revision: D63363749 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136600 Approved by: https://github.com/bdhirsh	2024-09-25 17:37:50 +00:00
David Berard	5d7ed02f52	[user-written triton kernels] specialize exprs if they are expected to be tl.constexpr (#136512 ) Fixes #136504 If you have a tl.constexpr parameter to a triton kernel, and you pass in a SymNode, then, right now, you run into failures (see under 'constants'): ``` File "/tmp/torchinductor_dberard/na/cnax67r5zmslz7bvdfizteaepj7fajpjallb3bu2gyetjcdqtbzj.py", line 14, in <module> triton_meta={'signature': {0: 'fp32', 1: 'fp32'}, 'device': DeviceProperties(type='cuda', index=0, cc=90, major=9, regs_per_multiprocessor=65536, max_threads_per_multi_processor=2048, multi_processor_count=132, warp_size=32), 'constants': {2: s0, 3: 256}, 'configs': [AttrsDescriptor(divisible_by_16=(0, 1), equal_to_1=())]}, torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised: NameError: name 's0' is not defined ``` To fix this, we specialize on the value during dynamo tracing, so that we have a real integer when we do codegen. Alternatives: specialize somewhere else (e.g. inductor); or figure out how to actually pass the value dynamically into the user-written kernel. However, if we try to pass a dynamic value, then we wouldn't be able to precompile the triton kernels in inductor or use AOTI. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136512 Approved by: https://github.com/oulgen, https://github.com/jansel, https://github.com/eellison	2024-09-25 17:12:11 +00:00
Pian Pawakapan	7c6d543a5b	[export] fix _get_non_persistent_buffers for duplicates (#136552 ) Summary: Export's method _get_non_persistent_buffers doesn't check duplicate submodules, so we run into state_dict related issues if non-persistent buffers exist on shared submodules. Test Plan: test_export Differential Revision: D63332976 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136552 Approved by: https://github.com/avikchaudhuri, https://github.com/tugsbayasgalan	2024-09-25 16:46:31 +00:00
Sahan Paliskara	aa80b82cea	[hygiene] Delete dead alerting code (#136583 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136583 Approved by: https://github.com/clee2000	2024-09-25 15:44:46 +00:00
Sergii Dymchenko	0232278b33	Fix comment posting permissions for check-labels.yml (#136610 ) Currently it fails with Error fetching https://api.github.com/repos/pytorch/pytorch/issues/136607/comments HTTP Error 403: Forbidden (see https://github.com/pytorch/pytorch/actions/runs/11026434368/job/30622960113?pr=136607) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136610 Approved by: https://github.com/malfet	2024-09-25 15:43:19 +00:00
Huy Do	34711fe8c9	Fix test_skip_data_serialization pickle exception match (#136617 ) The test is failing in trunk atm with the following error: ``` test_serialization.py::TestSerialization::test_skip_data_serialization_materialize_fake_False - AssertionError: "Can't pickle local object 'WeakValueDictionary.__init__.<locals>.remove'" does not match "Can't get local object 'WeakValueDictionary.__init__.<locals>.remove'" ``` for example, `36f0e61166` This comes from this cpython commit `a3076c734d`, and manifests in python 3.12.5 currently used in CI. The failure doesn't happen when I try it out with 3.12.3 and 3.12.4. Looking at the commit logs of https://github.com/python/cpython/commits/main/Lib/pickle.py, it looks like the exception message is changing back and forth, so I guess a regex match would capture both.	2024-09-25 08:35:46 -07:00
Catherine Lee	deb820602a	viable/strict update: log push to s3 (#136470 ) As stated in https://github.com/pytorch/test-infra/pull/5686, I cannot figure out a way to determine the push time from webhooks (other than when the webhook was sent, but that isn't super accurate either). Instead, manually save a json file to s3 that contains information for the sha and date so that we can still get this information. Relies on https://github.com/pytorch/test-infra/pull/5690 tested in https://github.com/pytorch/pytorch/pull/136387 (but I squashed so it's kinda hard to find now) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136470 Approved by: https://github.com/huydhn	2024-09-25 15:28:53 +00:00
PyTorch MergeBot	e3b89ca124	Revert "Add deterministic path for CUDA `cumsum` (#136224 )" This reverts commit b1a02bf70824a4802411ddd5be1d3610e7a2e269. Reverted https://github.com/pytorch/pytorch/pull/136224 on behalf of https://github.com/ezyang due to Failing internall CI ([comment](https://github.com/pytorch/pytorch/pull/136224#issuecomment-2374201626))	2024-09-25 14:11:01 +00:00
Bin Bao	20a855bf01	[AOTI] Move stack_allocation logic from PythonWrapperCodegen (#136463 ) Summary: Move stack_allocation logic from PythonWrapperCodegen to CppWrapperCpuArrayRef Differential Revision: [D63319970](https://our.internmc.facebook.com/intern/diff/D63319970) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136463 Approved by: https://github.com/chenyang78 ghstack dependencies: #136062, #136461, #136462	2024-09-25 14:06:33 +00:00
PyTorch MergeBot	5171b0e3c6	Revert "[ONNX] Remove the operators test (#136335 )" This reverts commit 9629835b1ccce8e72fc93bf95be13e3d53cb4871. Reverted https://github.com/pytorch/pytorch/pull/136335 on behalf of https://github.com/ezyang due to I'll reland this, bear with me ([comment](https://github.com/pytorch/pytorch/pull/136335#issuecomment-2374183435))	2024-09-25 14:06:03 +00:00
Bin Bao	070952aca5	[AOTI] Move stack_allocation logic from CppWrapperCpu (#136462 ) Summary: Move stack_allocation logic from CppWrapperCpu to CppWrapperCpuArrayRef Differential Revision: [D63300359](https://our.internmc.facebook.com/intern/diff/D63300359) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136462 Approved by: https://github.com/chenyang78 ghstack dependencies: #136062, #136461	2024-09-25 14:03:03 +00:00
Bin Bao	5ad5f40283	[AOTI][reland] Create another wrapper class to handle ArrayRef (#136461 ) Summary: Create another wrapper codegen class to handle ArrayRef for CPU. The goal is to simplify the regular cpp wrapper codegen logic and the generated cpp code. Test Plan: CI Differential Revision: [D63300361](https://our.internmc.facebook.com/intern/diff/D63300361) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136461 Approved by: https://github.com/angelayi, https://github.com/chenyang78 ghstack dependencies: #136062	2024-09-25 14:00:09 +00:00
Edward Z. Yang	25ab87c09b	Add lint rule META_NO_CREATE_UNBACKED (#135870 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/135870 Approved by: https://github.com/albanD	2024-09-25 13:33:56 +00:00
Tom Ritchford	dd4a51b39a	Fix constant propagation in builtins and UserClasses (#131354 ) * Fixes https://github.com/pytorch/pytorch/issues/118675 * Replaces https://github.com/pytorch/pytorch/pull/118994 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131354 Approved by: https://github.com/jansel, https://github.com/anijain2305	2024-09-25 13:03:40 +00:00
Jez Ng	a0c76ea853	Make test_skip_data_serialization regex more flexible (#136580 ) Some CI machines seem to throw "Can't get local object" rather than "Can't pickle local object". Pull Request resolved: https://github.com/pytorch/pytorch/pull/136580 Approved by: https://github.com/mikaylagawarecki	2024-09-25 11:27:23 +00:00
IvanKobzarev	370c1c4297	[aotd] Fix rrelu compilation (#136008 ) Issues: https://github.com/pytorch/pytorch/issues/135083 https://github.com/pytorch/pytorch/issues/120292 rrelu decomposition contains mutation, copy_. Decompositions are executed below Functionalization, as a result AOT produces non-functional graph. Also that decomposition is registered as python_dispatch kernel for AutogradCUDA. Autograd dispatch happens above Functionalization, so registering it for Autograd to handle all backends makes functionalization running after this. Testing: ``` python test/functorch/test_aotdispatch.py -k test_rrelu ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136008 Approved by: https://github.com/bdhirsh	2024-09-25 11:26:19 +00:00
Wu, Chunyuan	c3fdf587b5	[inductor] [cpp] fix the check of template_buffer_has_other_users if no epilogue_nodes (#136518 ) The `template_buffer_has_other_users` function checks the case where there're epilogue nodes and the template output has users other than these epilogue nodes. When there's no epilogue nodes, the function could return `False` directly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136518 Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5 ghstack dependencies: #136418	2024-09-25 10:25:07 +00:00
Jokeren	cabfbef6cf	[pytorch][PR] [inductor] More fixes on the keys of `constants` and `signature` dictionaries (#136514 ) Summary: Previous PR forgets to change two other places that also create `constants` and `signature`. Test Plan: Imported from GitHub, without a `Test Plan:` line. {F1884584338} Differential Revision: D63027728 Pulled By: Myrthan Co-authored-by: Jokeren <robinho364@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/136514 Approved by: https://github.com/jansel Co-authored-by: Jokeren <robinho364@gmail.com>	2024-09-25 09:34:14 +00:00
Wu, Chunyuan	2e30c160ef	[inductor] [cpp] fix max-autotune for single-thread dynamic shapes (#136418 ) Fixes the compilation error of max-autotune for `maml_omniglot` (AMP and FP32) and `soft_actor_critic` (AMP) in Torchbench for single-thread dynamic shapes case: ``` /tmp/torchinductor_user/uv/cuvq6wenwp7us423onuvntkfx4cspmagha5beiknob7tiebzhupa.cpp: In function ‘void kernel(const bfloat16, const bfloat16, const bfloat16, bfloat16, int64_t)’: /tmp/torchinductor_user/uv/cuvq6wenwp7us423onuvntkfx4cspmagha5beiknob7tiebzhupa.cpp:279:41: error: the value of ‘Mr_blocks’ is not usable in a constant expression 279 \| constexpr int64_t m_block_end = Mr_blocks; \| ^~~~~~~~~ /tmp/torchinductor_user/uv/cuvq6wenwp7us423onuvntkfx4cspmagha5beiknob7tiebzhupa.cpp:237:19: note: ‘Mr_blocks’ was not initialized with a constant expression 237 \| const int64_t Mr_blocks = (M + Mr - 1) / Mr; \| ^~~~~~~~~ ``` The PR also updates the UT to add a test for `BS`=512 in single thread. The previous case has `BS`=1024 equal to the `K` and `N` value. The generated code does not have symbolic shapes thus fails to capture the above issue. By adding a case of `BS`=512, the generated code will have symbolic shape for the M dim and is able to reproduce the issue that this PR is addressing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136418 Approved by: https://github.com/jgong5	2024-09-25 09:24:05 +00:00
Anatoly Myachev	a0a1873148	[Inductor] Fix Triton tests after updating pybind11 to 2.13.6 (#136280 ) https://github.com/pytorch/pytorch/pull/136087 update pybind11 to 2.13.6 and that new release has the feature which is expressed by [a new function](https://pybind11.readthedocs.io/en/latest/changelog.html#version-2-13-6-september-13-2024) `_pybind11_conduit_v1_`. The presence of this function breaks the serialization mechanisms used by Titon and in PyTorch itself. Possible errors that have been noticed due to this change: <details> <summary> the first error </summary> ```bash _________ KernelTests.test_layout_constraint_needs_fixed_stride_order __________ Traceback (most recent call last): File "/runner/_work/intel-xpu-backend-for-triton/intel-xpu-backend-for-triton/pytorch/test/inductor/test_triton_kernels.py", line 1072, in test_layout_constraint_needs_fixed_stride_order eager_out = f(x) File "/runner/_work/intel-xpu-backend-for-triton/intel-xpu-backend-for-triton/pytorch/test/inductor/test_triton_kernels.py", line 1068, in f arange_out(x, y) File "/runner/_work/intel-xpu-backend-for-triton/intel-xpu-backend-for-triton/pytorch/test/inductor/test_triton_kernels.py", line 1059, in arange_out kernel[grid](x, out, n_elements, BLOCK_SIZE=4) File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/triton/runtime/jit.py", line 330, in <lambda> return lambda args, kwargs: self.run(grid=grid, warmup=False, args, *kwargs) File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/triton/runtime/jit.py", line 657, in run kernel = self.compile( File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/triton/compiler/compiler.py", line 315, in compile metadata_group[metadata_filename] = fn_cache_manager.put(json.dumps(metadata, default=vars), metadata_filename, File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/json/__init__.py", line 234, in dumps return cls( File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/json/encoder.py", line 199, in encode chunks = self.iterencode(o, _one_shot=True) File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/json/encoder.py", line 257, in iterencode return _iterencode(o, 0) TypeError: vars() argument must have __dict__ attribute ``` </details> <details> <summary> the second error </summary> ```bash ________________ TestTritonWrapper.test_wrapper_using_gpu_seed _________________ Traceback (most recent call last): File "/cache/pytorch-c5e9d03a2da4b93481737594cbe2f5931fa569aa833f206a638189cad2c36d3c-11/test/inductor/test_triton_wrapper.py", line 40, in test_wrapper_using_gpu_seed out = f(x, y) File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py", line 465, in _fn return fn(args, *kwargs) File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_dynamo/convert_frame.py", line 1292, in __call__ return self._torchdynamo_orig_callable( File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_dynamo/convert_frame.py", line 1087, in __call__ result = self._inner_convert( File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_dynamo/convert_frame.py", line 530, in __call__ return _compile( File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_dynamo/convert_frame.py", line 933, in _compile guarded_code = compile_inner(code, one_graph, hooks, transform) File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_dynamo/convert_frame.py", line 675, in compile_inner return _compile_inner(code, one_graph, hooks, transform) File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_utils_internal.py", line 87, in wrapper_function return function(args, *kwargs) File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_dynamo/convert_frame.py", line 708, in _compile_inner out_code = transform_code_object(code, transform) File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_dynamo/bytecode_transformation.py", line 1322, in transform_code_object transformations(instructions, code_options) File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_dynamo/convert_frame.py", line 220, in _fn return fn(args, kwargs) File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_dynamo/convert_frame.py", line 643, in transform tracer.run() File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_dynamo/symbolic_convert.py", line 2776, in run super().run() File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_dynamo/symbolic_convert.py", line 979, in run while self.step(): File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_dynamo/symbolic_convert.py", line 891, in step self.dispatch_table[inst.opcode](self, inst) File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_dynamo/symbolic_convert.py", line 2967, in RETURN_VALUE self._return(inst) File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_dynamo/symbolic_convert.py", line 2952, in _return self.output.compile_subgraph( File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_dynamo/output_graph.py", line 1117, in compile_subgraph self.compile_and_call_fx_graph(tx, list(reversed(stack_values)), root) File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_dynamo/output_graph.py", line 1369, in compile_and_call_fx_graph compiled_fn = self.call_user_compiler(gm) File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_dynamo/output_graph.py", line 1416, in call_user_compiler return self._call_user_compiler(gm) File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_dynamo/output_graph.py", line 1465, in _call_user_compiler raise BackendCompilerFailed(self.compiler_fn, e).with_traceback( File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_dynamo/output_graph.py", line 1446, in _call_user_compiler compiled_fn = compiler_fn(gm, self.example_inputs()) File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_dynamo/repro/after_dynamo.py", line 130, in __call__ compiled_gm = compiler_fn(gm, example_inputs) File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/__init__.py", line 2235, in __call__ return compile_fx(model_, inputs_, config_patches=self.config) File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_inductor/compile_fx.py", line 1528, in compile_fx return aot_autograd( File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_dynamo/backends/common.py", line 72, in __call__ cg = aot_module_simplified(gm, example_inputs, self.kwargs) File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_functorch/aot_autograd.py", line 1071, in aot_module_simplified compiled_fn = dispatch_and_compile() File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_functorch/aot_autograd.py", line 1056, in dispatch_and_compile compiled_fn, _ = create_aot_dispatcher_function( File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_functorch/aot_autograd.py", line 522, in create_aot_dispatcher_function return _create_aot_dispatcher_function( File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_functorch/aot_autograd.py", line 759, in _create_aot_dispatcher_function compiled_fn, fw_metadata = compiler_fn( File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_functorch/_aot_autograd/jit_compile_runtime_wrappers.py", line 179, in aot_dispatch_base compiled_fw = compiler(fw_module, updated_flat_args) File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_inductor/compile_fx.py", line 1357, in fw_compiler_base return _fw_compiler_base(model, example_inputs, is_inference) File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_inductor/compile_fx.py", line 1428, in _fw_compiler_base return inner_compile( File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_inductor/compile_fx.py", line 479, in compile_fx_inner return wrap_compiler_debug(_compile_fx_inner, compiler_name="inductor")( File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_dynamo/repro/after_aot.py", line 85, in debug_wrapper inner_compiled_fn = compiler_fn(gm, example_inputs) File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_inductor/compile_fx.py", line 665, in _compile_fx_inner compiled_graph = FxGraphCache.load( File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_inductor/codecache.py", line 1341, in load compiled_graph = compile_fx_fn( File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_inductor/compile_fx.py", line 574, in codegen_and_compile compiled_graph = fx_codegen_and_compile(gm, example_inputs, **fx_kwargs) File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_inductor/compile_fx.py", line 882, in fx_codegen_and_compile compiled_fn = graph.compile_to_fn() File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_inductor/graph.py", line 1952, in compile_to_fn return self.compile_to_module().call File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_inductor/graph.py", line 1878, in compile_to_module return self._compile_to_module() File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_inductor/graph.py", line 1906, in _compile_to_module mod = PyCodeCache.load_by_key_path( File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_inductor/codecache.py", line 2866, in load_by_key_path mod = _reload_python_module(key, path) File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_inductor/runtime/compile_tasks.py", line 45, in _reload_python_module exec(code, mod.__dict__, mod.__dict__) File "/tmp/tmps59zkbew/kg/ckgkb4gt5fs5pll4o7fqawppsmdezu5h52cq6nmrvi3yy6j7ddq4.py", line 45, in <module> File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_inductor/async_compile.py", line 198, in triton kernel = TritonCodeCache.load(kernel_name, source_code) File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_inductor/codecache.py", line 2916, in load return _module_to_triton_kernel(PyCodeCache.load(source_code), kernel_name) File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_inductor/codecache.py", line 2853, in load return cls.load_by_key_path(key, path, linemap, attrs) File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_inductor/codecache.py", line 2866, in load_by_key_path mod = _reload_python_module(key, path) File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_inductor/runtime/compile_tasks.py", line 39, in _reload_python_module raise RuntimeError( torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised: RuntimeError: Failed to import /tmp/tmps59zkbew/g3/cg3zgxsidsjhdlz2lzvajvubdq6kg2x2hzd2kznfj43qwvlv33du.py SyntaxError: invalid syntax (cg3zgxsidsjhdlz2lzvajvubdq6kg2x2hzd2kznfj43qwvlv33du.py, line 14) ``` </details> Pull Request resolved: https://github.com/pytorch/pytorch/pull/136280 Approved by: https://github.com/etaf, https://github.com/jansel, https://github.com/EikanWang Co-authored-by: Henry Schreiner <HenrySchreinerIII@gmail.com>	2024-09-25 08:09:46 +00:00
Pei-Hsuan Wu	1cb265fafa	[AILab][attempt2] Add TryExcept when decoding healthcheck port (#136574 ) Summary: ## Context The first attempt has lint error in OSS https://hud.pytorch.org/pr/pytorch/pytorch/136438#30553902641 {F1886895223} ## This Diff Fix error message with try catch Error Message: ``` File "/packages/aps_models.examples.dlrm.lite/dlrm_train_app-inplace#link-tree/torch/distributed/elastic/agent/server/local_elastic_agent.py", line 224, in _setup_healthcheck port=int(healthcheck_port), ValueError: invalid literal for int() with base 10: \'%port.thrift%\' ``` Test Plan: ``` arc lint ``` Reviewed By: felixsu2006 Differential Revision: D63343041 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136574 Approved by: https://github.com/atalman	2024-09-25 04:43:51 +00:00
Nikita Shulga	561cd5a0a6	[BE] Use C++17 convetion methods in CUDA kernels (#136575 ) - `std::is_same<X, Y>::value` -> `std::is_same_v<X, Y>` - `std::enable_if<C, T>::type` -> `std::enable_if_t<C, T>` And so on Pull Request resolved: https://github.com/pytorch/pytorch/pull/136575 Approved by: https://github.com/Skylion007, https://github.com/eqy	2024-09-25 04:30:01 +00:00
Nikita Shulga	5340feb8aa	Disable iOS workflow (#136571 ) See https://github.com/pytorch/pytorch/issues/136284 It's been broken for more than a week and it does not seem like anyone cares about fixing it. Once it's landed I'll reassigned the issue on `oncall: mobile` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136571 Approved by: https://github.com/huydhn, https://github.com/kit1980	2024-09-25 04:29:34 +00:00
Bin Bao	1c9a1a2a19	[AOTI] Support MKL linear ops in cpp wrapper (#134974 ) Summary: Similar to https://github.com/pytorch/pytorch/pull/134475, support mkl linear in the ABI-compatible mode for cpp-wrapper Inductor. Differential Revision: [D63322202](https://our.internmc.facebook.com/intern/diff/D63322202) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134974 Approved by: https://github.com/chenyang78, https://github.com/leslie-fang-intel Co-authored-by: leslie-fang-intel <leslie.fang@intel.com>	2024-09-25 03:53:11 +00:00
chilli	0200ad3457	Turn on unique kernel names (#136503 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136503 Approved by: https://github.com/ezyang, https://github.com/eellison ghstack dependencies: #136509	2024-09-25 03:39:45 +00:00
Nichols A. Romero	482fe186b9	Add ROCm documentation to libtorch (C++) reST. (#136378 ) Fixes #126640 Added ROCm support section to libtorch (C++) reST. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136378 Approved by: https://github.com/ezyang	2024-09-25 02:30:56 +00:00
leslie-fang-intel	3c7edf1ec0	[Inductor][CPP] Fix int8 cvt half (#136353 ) Fix the correctness issue of https://github.com/pytorch/ao/pull/884/. The current implementation for converting between `Half/BFloat16` and `int8/uint8` incorrectly assumes that 1/4 of the int8/uint8 vector lane maps to 1/2 of the Half/BFloat16 vector lane. This assumption leads to accuracy issues after the full bit-width vectorization of the Half data type was introduced. When converting between int8 weights and the half data type, the generated code is as the following: ``` #include "/tmp/torchinductor_leslie/xw/cxww3s7wxrujoyxna7mlcjktid2uu6nntixqwm542xfkd756gl3x.h" extern "C" void kernel(const int8_t* in_ptr0, half* out_ptr0) { { for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(2048L); x0+=static_cast<int64_t>(32L)) { auto tmp0 = at::vec::Vectorized<int8_t>::loadu(in_ptr0 + static_cast<int64_t>(x0), static_cast<int64_t>(32)); auto tmp1 = at::vec::convert<half>(tmp0); tmp1.store(out_ptr0 + static_cast<int64_t>(x0), static_cast<int64_t>(32)); } } } ``` In this PR, we address the issue by changing the implementation to convert 1/2 of the int8/uint8 vector lane into a full vector lane of Half/BFloat16. TestPlan * AO: `python test/integration/test_integration.py -k test_int8_weight_only_quant_subclass_api` * `python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_convert_int8_to_half_vec` * Due to the CPP backend legalization pass, we are unable to create a unit test to simulate the conversion from `Half` to `int8`. Instead, we rely on a C++ test case. * `./build/bin/vec_test_all_types_AVX512 --gtest_filter="VecConvertTestsReducedFloat/.ConvertReduced"` `./build/bin/vec_test_all_types_AVX2 --gtest_filter="VecConvertTestsReducedFloat/*.ConvertReduced"` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136353 Approved by: https://github.com/jgong5, https://github.com/jerryzh168	2024-09-25 02:23:43 +00:00
eqy	8225e7706e	[CUDA][Expandable Segments] Account for non-gc'able memory in expandable segments tests (#136496 ) Seems like some other tests are holding onto memory that is not gc'able (e.g., cuBLAS workspaces), so these tests while working in isolation fail when run as e.g., `python test/test_cuda.py -k able` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136496 Approved by: https://github.com/ezyang	2024-09-25 01:14:45 +00:00
Will Cromar	5233b5a448	Update PyTorch/XLA CI image to Python 3.10 (#135278 ) The old image used Python 3.8. Corresponding XLA PR: https://github.com/pytorch/xla/pull/7953 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135278 Approved by: https://github.com/JackCaoG, https://github.com/atalman	2024-09-25 00:53:39 +00:00
eqy	670d64a802	[SDPA][Nested Tensor] Bump `grad_query` fudge factor for small GPUs (#135715 ) Similar to #135711, here we see a ~1/1000 mismatch with absolute value ~0.0016 when 0.001 is allowed Pull Request resolved: https://github.com/pytorch/pytorch/pull/135715 Approved by: https://github.com/drisspg	2024-09-25 00:36:10 +00:00
Pearu Peterson	8f2a4cc4b1	Tune bsr_dense_addmm for int8 inputs on A100 (#136088 ) As in the title. The tuning is done for dimensions 1280 and 5120 that are used in Vit-H. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136088 Approved by: https://github.com/cpuhrsch	2024-09-25 00:24:12 +00:00
Justin Chu	9629835b1c	[ONNX] Remove the operators test (#136335 ) The tests are obsolete and hard to maintain. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136335 Approved by: https://github.com/xadupre	2024-09-24 23:08:48 +00:00
Edward Z. Yang	b57d67e263	Add isuruf to core reviewers (#136554 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/136554 Approved by: https://github.com/Skylion007, https://github.com/malfet	2024-09-24 23:06:46 +00:00
angelayi	210b136c07	[export] Add experimental swap API (#136190 ) Prototyped the following API which takes in an ExportedProgram, a dictionary of fqn to modules to swap, and returns a (unlifted) GraphModule ``` _swap_modules( ep: ExportedProgram, modules_to_swap: Dict[str, torch.nn.Module] ) -> torch.fx.GraphModule: ``` Differential Revision: [D62879819](https://our.internmc.facebook.com/intern/diff/D62879819) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136190 Approved by: https://github.com/avikchaudhuri	2024-09-24 22:50:44 +00:00
PyTorch MergeBot	706eda5cd8	Revert "[RFC][torchelastic][c10d] Fix store prefix race in rendezvous (#135957 )" This reverts commit 5033a1ca0dd22dae34a8939add33dbebfe0fd31d. Reverted https://github.com/pytorch/pytorch/pull/135957 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/135957#issuecomment-2372493186))	2024-09-24 22:24:26 +00:00
William Wen	ae80bce496	[dynamo] refactor resume_execution.py to use bytecode templates (#136483 ) Use bytecode from template instead of hardcoding bytecode in resume_execution.py. Gets rid of a lot of Python-version dependent bytecode generation. Also makes resume_execution.py easier to support in future Python version updates. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136483 Approved by: https://github.com/jansel, https://github.com/anijain2305	2024-09-24 22:20:26 +00:00
Nikita Shulga	36f0e61166	[BE] Use nested namespace in ATen/native/cuda (#136570 ) It's a nice C++17 feature Pull Request resolved: https://github.com/pytorch/pytorch/pull/136570 Approved by: https://github.com/Skylion007	2024-09-24 22:19:10 +00:00
Jeff Daily	1d3af68202	[ROCm] install_miopen.sh exit for ROCm >= 6.3 (#136436 ) Follow up to #132555. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136436 Approved by: https://github.com/jithunnair-amd, https://github.com/pruthvistony, https://github.com/atalman	2024-09-24 22:15:26 +00:00
Justin Chu	780f4debdb	[ONNX] Remove _optimize_graph from public init (#136279 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136279 Approved by: https://github.com/xadupre ghstack dependencies: #136281	2024-09-24 22:00:55 +00:00
Edward Z. Yang	00bc17555a	Don't try to evaluate sympy.Eq in replacement; we knew this wouldn't simplify since we are here (#136533 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/136533 Approved by: https://github.com/isuruf, https://github.com/pianpwk	2024-09-24 21:52:25 +00:00
Kurt Mohler	b1a02bf708	Add deterministic path for CUDA `cumsum` (#136224 ) Change `cumsum` to call its decomposition when `use_deterministic_algorithms(True)` and input is CUDA. Fixes #89492 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136224 Approved by: https://github.com/ezyang, https://github.com/justinchuby	2024-09-24 21:34:43 +00:00
PyTorch MergeBot	0133fbcfe7	Revert "Correctly convert Python float to float64 when passing argument as Tensor (#136413 )" This reverts commit f0f79dd8f1df6cf6342c9c23ae3a9be0f74eb9f5. Reverted https://github.com/pytorch/pytorch/pull/136413 on behalf of https://github.com/ezyang due to forward fix is stuck, revert this ([comment](https://github.com/pytorch/pytorch/pull/136413#issuecomment-2372404873))	2024-09-24 21:20:37 +00:00
Bin Bao	95c0f7493f	[Inductor] Rename WrapperCodeGen to PythonWrapperCodegen (#136062 ) Summary: Rename WrapperCodeGen to PythonWrapperCodegen to make its meaning more explicit. Differential Revision: [D63300358](https://our.internmc.facebook.com/intern/diff/D63300358) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136062 Approved by: https://github.com/angelayi, https://github.com/chenyang78	2024-09-24 21:02:51 +00:00
Yifu Wang	da1560c49f	[SymmetricMemory] add support for cuStreamWriteValue32 (#136488 ) cuStreamWriteValue efficiently combines the issuing of a system-level fence with the update of a single memory location. It is highly suitable for inter-stream progress sharing (e.g., all_gather_with_progress). Exposing it via SymmetricMemory allows users to more easily implement efficient progress-aware matmuls in triton ([xformers example](https://github.com/facebookresearch/xformers/blob/main/xformers/ops/_triton/sequence_parallel_fused_kernels.py)). Pull Request resolved: https://github.com/pytorch/pytorch/pull/136488 Approved by: https://github.com/eqy, https://github.com/Chillee	2024-09-24 20:56:29 +00:00
Justin Chu	7c777dd587	[ONNX] Unify ONNXProgram and remove the old one (#136281 ) ## Note `test_fx_to_onnx_with_onnxruntime.py` is removed for now (it has a lot of xfails anyways). A better version will be added back. Fixes https://github.com/pytorch/pytorch/issues/136274 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136281 Approved by: https://github.com/xadupre, https://github.com/albanD	2024-09-24 20:52:19 +00:00
Will Constable	dbc3356655	[pipelining] fix py ref cycle in stage_backward (#136507 ) TLDR; found forward activation tensors were being kept alive "forever" (or until GC ran), and tracked it down to a cycle involving `stage_backward.<locals>.extract_tensors_with_grads`. The reference cycle in question is below. (constructed using gc.get_referrers after doing a gc.collect in gc debug mode) tensor is kept alive by `[(<class 'cell'>, '0x7f7360234400')]` tuple of cell objects `(<cell at 0x7f73602343d0: function object at 0x7f734fff0ee0>, <cell at 0x7f7360234400: list object at 0x7f734e4d9a80>, <cell at 0x7f73602a4190: list object at 0x7f734eff8b00>)` is kept alive by `[(<class 'function'>, '0x7f734fff0ee0')]` `<function stage_backward.<locals>.extract_tensors_with_grads at 0x7f734fff0ee0>` is kept alive by `[(<class 'cell'>, '0x7f73602343d0')]` Put into more plain terms, ``` def stage_backward(...): ... stage_output_tensors = [] # a cell object will exist that contains the variables defined in stage_backward and used by # both stage_backward and nested functions # in this case, the cell object contains 'stage_output_tensors' but # this function object will hold a reference to a 'cell' that contains any vars from # the parent scope not explicitly passed into the function as args. def extract_tensors_with_grads(...): ... # extract_tensors_with_grads refers to stage_output_tensors, so stage_output_tensors # is in the cell stage_output_tensors.append(output_val) ... # but extract_tensors_with_grads ALSO refers to itself (extract_tensors_with_grads), # so `extract_tensors_with_grads` will be in the cell extract_tensors_with_grads(...) ``` More debug details: https://docs.google.com/document/d/1QPH1Lz0tnieIFPM2tyHrjVB-bjlnHuDgjx1p2am3cmE/edit?usp=sharing In pdb: ``` gc.collect() g = gc.garbage g[-1] [rank0]:(Pdb) [rank0]:<function stage_backward.<locals>.extract_tensors_with_grads at 0x7fee5c3392d0> g[-2] [rank0]:(Pdb) [rank0]:(<cell at 0x7fee7abbcf40: function object at 0x7fee5c3392d0>, <cell at 0x7fee7abbcf70: list object at 0x7fee7ab68940>, <cell at 0x7fee5c3210c0: list object at 0x7fee5e1 d6340>) g[-3] [rank0]:(Pdb) [rank0]:[tensor([[[-4.1127e-06, -3.3826e-06, 2.6226e-06, ..., 6.4969e-06, [rank0]: -4.4405e-06, -4.7684e-06], ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136507 Approved by: https://github.com/awgu, https://github.com/kwen2501	2024-09-24 20:46:37 +00:00
chilli	7ff8e66140	Fix flexattention sympy expr printer issue (#136509 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136509 Approved by: https://github.com/yanboliang	2024-09-24 20:10:29 +00:00
Henry Tsang	02ef5dd327	[inductor][test] Check if mkl dnn bf16 is supported when using bf16 (#136290 ) Sometimes the test is run with older cpu, e.g. Intel(R) Xeon(R) CPU E5-2680 v4. If we inspect its `lscpu`, in the flags, we don't see a `avx512_bf16`. So that probably means bf16 is not supported for those hardwares, and hence the unit test can fail. So we add the check in the code. Context: https://github.com/pytorch/pytorch/pull/135038 Differential Revision: D62984129 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136290 Approved by: https://github.com/XuehaiPan, https://github.com/chenyang78	2024-09-24 19:32:48 +00:00
Joel Schlosser	888744bd36	NJT binary pointwise broadcasting support via jagged <-> padded dense conversion (#133021 ) Related: #132695 This PR uses padded dense <-> jagged conversions to handle binary pointwise broadcasting of (NT, T) and (T, NT). This includes: * `(B, j0, D) + (1, 1, 1)` * `(B, j0, D) + (B, 1, 1)` * `(B, j0, D) + (B, 1, D)` * etc. This PR also adds (hacky) support for bool inputs to the jagged <-> padded dense conversions. The underlying CUDA kernels do not support integer / bool inputs; so the following workaround is employed: `convert input -> half, run conversion kernel, convert output -> bool`. Note that this bool support is needed specifically for the backward formula of `fmax`, and likely others. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133021 Approved by: https://github.com/cpuhrsch	2024-09-24 19:11:49 +00:00
David Berard	8ecc5f1a8f	[TorchScript][tensorexpr] imbue locale for IRPrinter (#136458 ) We had an internal report where the NNC-generated CUDA code had thousands separators in integer literals. Although I wasn't able to cleanly repro, I did come up with a hacky repro and verified that this fix works (see #136459). Differential Revision: [D63278771](https://our.internmc.facebook.com/intern/diff/D63278771) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136458 Approved by: https://github.com/eellison	2024-09-24 19:00:57 +00:00
Nikita Shulga	c6192f32f1	[MPS] Add upsample_bicubic2d as Metal op (#136123 ) More or less literal copy-n-paste of `c33b0580e6/aten/src/ATen/native/cuda/UpSampleBicubic2d.cu (L24)` and `c33b0580e6/aten/src/ATen/native/cuda/UpSampleBicubic2d.cu (L99)` Missing `uint8` implementation mimics CUDA behavior Initial version coded live in https://www.youtube.com/watch?v=shi6Kb5xxvk Later refinements: - Switch from 2D dispatch to 1D one (to match CUDA behavior) - Added batch + channel loops - Fixed scale computation to match align corners behavior - Added backward implementation Backward implementation again, mimics CUDA, so it has issues precision issue for `torch.half` as well as a somewhat slow simulation of atomic adds using atomic compare and exchange of the pair of adjacent values, i.e. ```metal emplate <typename T> static inline void atomic_add_helper( device atomic<int>* data, long offset, float value) { auto ptr = data + (offset >> 1); auto old = atomic_load_explicit(ptr, memory_order_relaxed); union { int i; T t[2]; } val; do { val.i = old; val.t[offset & 1] += static_cast<T>(value); } while (!atomic_compare_exchange_weak_explicit( ptr, &old, val.i, memory_order_relaxed, memory_order_relaxed)); } ``` Bump basic Metal language version to 3.0, as it's supported on MacOS13 and that's the first version that has `atomic_float` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136123 Approved by: https://github.com/albanD	2024-09-24 18:58:11 +00:00
Animesh Jain	dacf0c4884	[dynamo] Do not treat user defined nn module attributes static for dynamic shape infra (#136516 ) Fixes https://github.com/pytorch/pytorch/issues/136254 Th regression was introduced in https://github.com/pytorch/pytorch/pull/132736 where originally we were trying to fix another regression. This PR and the offending PR together say - "treat user defined nn module attributes as automatic dynamic, but for cudagraphs they will be considered static". This avoid recompilations. This can lead to a cudagraph recording, which is ok. This also maintains the state before inline_inbuilt_nn_modules flag was introduced. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136516 Approved by: https://github.com/williamwen42	2024-09-24 18:26:12 +00:00
Sam Larsen	1028cedf71	[inductor] Enable parallel compile by default in fbcode (#136246 ) Summary: Now that we have subprocess parallel compile on by default, we can change the internal compile_threads default to > 1 with a killswitch. Some jankiness so we can avoid evaluating the justknob at import. Test Plan: Ran codecache tests with JK on, then canaried locally with JK off Differential Revision: D62913998 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136246 Approved by: https://github.com/eellison	2024-09-24 18:10:01 +00:00
Oguz Ulgen	9abdc62065	Allow fx graph caching higher order operators (opt-in) (#135877 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135877 Approved by: https://github.com/zou3519	2024-09-24 17:23:09 +00:00
ankurneog	efed357ef5	Add dtypes support in opinfo for Intel Gaudi (#132840 ) ## Motivation This is following up on changes introduced in https://github.com/pytorch/pytorch/pull/128584 we are adding the dtype information to be picked up while executing the UTs for Intel Gaudi/HPU Pull Request resolved: https://github.com/pytorch/pytorch/pull/132840 Approved by: https://github.com/albanD	2024-09-24 17:17:15 +00:00
PyTorch MergeBot	064093a4d6	Revert "Increase update_hint_regression problem size to 1000 (#136434 )" This reverts commit 3116fbda0fcf9af0c3dfe1280fb7e05e30e6ad5f. Reverted https://github.com/pytorch/pytorch/pull/136434 on behalf of https://github.com/ezyang due to whoops, this is too slow ([comment](https://github.com/pytorch/pytorch/pull/136434#issuecomment-2371847842))	2024-09-24 17:05:20 +00:00
Shangdi Yu	ebfcbe0822	Move print_export_warning so lru_cache works (#136491 ) Summary: as title move print_export_warning() out of the function so `lru_cache` actually works Test Plan: CI Differential Revision: D63297083 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136491 Approved by: https://github.com/pianpwk	2024-09-24 16:52:22 +00:00
Fuzzkatt	44ec706789	add tolerance changes for test_sdpa_autocast in test_nestedtensor.py (#136485 ) Upstreaming minor unit test fix from nvidia internal CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/136485 Approved by: https://github.com/soulitzer	2024-09-24 16:31:32 +00:00
Robert Hardwick	eac04fe72a	Increase bf32 tolerances for some cdist tests in test_torch (#136315 ) - Set the new tolerances ~= N * eps(bfloat16) which should be a comfortable upper bound for tolerances. Where N is the inner dimension of the matmal. Logic behind choice of tolerance: The maximum error of the summation of a series of N numbers in bfloat16 should be `N * epsilon(bfloat16)` , I confirmed by sampling different random seeds that the maximum observed error doesn't exceed this value and is usually much less. Fixes test failures on Arm® Neoverse™ V1 ( not raised as an issue as this hardware type is not currently covered by linux-aarch64 workflow ) ``` Traceback (most recent call last): File "/var/lib/jenkins/workspace/test/test_torch.py", line 2478, in test_cdist_large self.assertEqual(expected, actual) File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3885, in assertEqual raise error_metas.pop()[0].to_error( AssertionError: Tensor-likes are not close! Mismatched elements: 134118 / 1000000 (13.4%) Greatest absolute difference: 0.03829193115234375 at index (291, 726) (up to 0.005 allowed) Greatest relative difference: 0.03519868478178978 at index (291, 726) (up to 1.3e-06 allowed) ``` @malfet @jondea Pull Request resolved: https://github.com/pytorch/pytorch/pull/136315 Approved by: https://github.com/albanD	2024-09-24 16:10:11 +00:00
Ma Jian	0b667c073e	Disable compiled autograd for re-entrant autograd (#135795 ) Fixes #135298 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135795 Approved by: https://github.com/xmfan	2024-09-24 15:09:16 +00:00
gaopengff	33e10803c8	Fix ut in internal distributed_test.py (#136251 ) I have failed with test case of test_new_subgroups_by_enumeration_input_rank_exceeds_world_size, and passed with this small change. The expected exception is supposed to be "ValueError" rather than "RuntimeError" according to [code](https://github.com/pytorch/pytorch/blob/v2.4.1/torch/distributed/distributed_c10d.py#L4190). Pull Request resolved: https://github.com/pytorch/pytorch/pull/136251 Approved by: https://github.com/kwen2501	2024-09-24 15:06:20 +00:00
Justin Chu	58274e4655	Remove onnx imports in dynamo (#136334 ) Remove imports of the ``torch.onnx.operators`` module in dynamo. Since ONNX depends on dynamo, this import line causes a circular dependency. Judging from the source they are not actually needed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136334 Approved by: https://github.com/xadupre, https://github.com/jansel, https://github.com/titaiwangms	2024-09-24 14:54:23 +00:00
Isuru Fernando	2a178a6982	Avoid changing FTZ/DAZ flags in CPP builder (#136466 ) Fixes https://github.com/pytorch/pytorch/issues/136273 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136466 Approved by: https://github.com/ezyang	2024-09-24 14:39:17 +00:00
Fuzzkatt	6300eb1dc7	tf32 off for test_noncontiguous_samples in test_ops.py (#136484 ) Upstreaming minor unit test fix from nvidia internal CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/136484 Approved by: https://github.com/soulitzer	2024-09-24 14:26:47 +00:00
Amadeusz Skrzypczak	47ebb5856e	Make avoid_device_init() aware of hpu device (#136194 ) Added hpu to devices handled by avoid_device_init() in FakeTensorMode. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136194 Approved by: https://github.com/eellison	2024-09-24 14:13:45 +00:00
enkilee	54fc4f56ff	[Docs fix] fix syntax error in docs :torch.blackman_window (#136354 ) Fixes #ISSUE_NUMBER https://pytorch.org/docs/stable/generated/torch.blackman_window.html error at : equal to torch.blackman_window(L + 1, periodic=False)[:-1]). should delete the last ). Pull Request resolved: https://github.com/pytorch/pytorch/pull/136354 Approved by: https://github.com/soulitzer	2024-09-24 14:00:26 +00:00
Aaron Orenstein	9fc721d22b	Add cache logs + other minor caching cleanup (#136456 ) Summary: - Added TORCH_LOGS=cache to dump cache stats on exit - supported by RemoteCache. - Split REMOTE_CACHE_VERSION - it was used for both JKs fx_graph_memcache_version and autotune_memcache_version but they really should be separate (just in case we need to change one but not the other) - Prepare `_ManifoldCache` for use with other subpath keys - Move create_cache to be more public and use it in codecache - Add _InductorMetaTy alias (still just a dict) - Cleaned up some common cached_autotune calls in triton_heuristics Test Plan: unit tests Reviewed By: oulgen Differential Revision: D62648249 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136456 Approved by: https://github.com/oulgen	2024-09-24 14:00:23 +00:00
IvanKobzarev	342c031f0e	[aotd] Fix freezing API for subclasses (#136265 ) Original issue: https://github.com/pytorch/ao/issues/890 The problem: TracingContext.flat_params contain original params, with not desugared Subclasses. While inductor.freezing API works on aot graphs, which already desugared Subclasses. flat_params are used only for this logic and storing in them desguared subclasses fixes the issue. Testing: ``` python test/functorch/test_aotdispatch.py -k test_inductor_freezing_with_subclasses ``` Torch AO original failure: ``` python test/integration/test_integration.py -k test_int8_weight_only_quant_with_freeze ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136265 Approved by: https://github.com/bdhirsh	2024-09-24 13:15:01 +00:00
cyy	f048569c24	[Distributed] [11/N] Fix clang-tidy warnings in torch/csrc/distributed/ (#136439 ) Follows #131671 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136439 Approved by: https://github.com/kwen2501	2024-09-24 13:05:15 +00:00
PyTorch MergeBot	538ee7bf60	Revert "Fix tensor.data_ptr() representation overflow (#135567 )" This reverts commit 2e8d431a8fbfdbdb07448195f16afa9e101188ac. Reverted https://github.com/pytorch/pytorch/pull/135567 on behalf of https://github.com/etaf due to Block XPU, let's re-land with triton update. ([comment](https://github.com/pytorch/pytorch/pull/135567#issuecomment-2371200549))	2024-09-24 12:59:14 +00:00
Bob Ren	32727b9859	Add types to _dynamo/testing.py (#136402 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136402 Approved by: https://github.com/jansel	2024-09-24 10:23:54 +00:00
Xuehai Pan	73c10a04f6	[dynamo][easy] support `sys.intern` (#136081 ) Closes #134023 - #134023 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136081 Approved by: https://github.com/anijain2305	2024-09-24 09:12:34 +00:00
Amin Alam	1266be21f4	deprecated datetime.utcnow() fix and _RendezvousJoinOp module initiation bug fix (#136141 ) Fix to #136140 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136141 Approved by: https://github.com/kwen2501	2024-09-24 07:26:10 +00:00
Jianyu Huang	0a35986cdb	Add option to configure reduced precision math backend for SDPA (#135964 ) Summary: Address https://github.com/pytorch/pytorch/issues/135778 by adding a global flag to configure whether using high precision or low precision for math backend of SDPA. Test Plan: buck2 run mode/opt //scripts/feikou/llm:run_attn_kernels Differential Revision: D62625515 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135964 Approved by: https://github.com/jbschlosser	2024-09-24 07:11:38 +00:00
Wu, Chunyuan	44c871c34b	[inductor] [cpp] add index check when fusing epilogue with GEMM template (#135661 ) ## Description Fixes the accuracy failure of FP32 `jx_nest_base` of max-autotune. The current epilogue fusion implementation in GEMM template assumes that the read of template buffer and the write of epilogue output in the epilogue node have the same index (the layout could be different but the index should be the same). If the condition is not satisfied, the computation is wrong, leading to correctness issue for FP32 `jx_nest_base`. This PR disabled the epilogue fusion with GEMM template when the above condition is not satisfied. ### Unsupported epilogue: `buf1` is the template buffer and `buf2` is the epilogue output buffer. The store of `buf2`: 401408 * d0 + 100352 * d1 + *7168 d2 + 1792 * d3** + 128 * d4 + d5 The load of `buf1` in the epilogue node: 401408 * d0 + 100352 * d1 + *1792 d2 + 25088 * d3** + 128 * d4 + d5 The above two indexes are different. ``` CppTemplateBuffer(name='buf1', layout=FixedLayout('cpu', torch.float32, size=[25088, 128], stride=[128, 1])) ComputedBuffer(name='buf2', layout=FixedLayout('cpu', torch.float32, size=[8, 4, 14, 4, 14, 128], stride=[401408, 100352, 7168, 1792, 128, 1]), data=Pointwise( 'cpu', torch.float32, def inner_fn(index): i0, i1, i2, i3, i4, i5 = index tmp0 = ops.load(arg5_1, i5 + 128 * i4 + 1792 * i2 + 25088 * i3 + 100352 * i1 + 401408 * i0) tmp1 = ops.load(buf0, i5 + 128 * i4 + 1792 * i2 + 25088 * i3 + 100352 * i1 + 401408 * i0) tmp2 = tmp0 + tmp1 tmp3 = ops.load(buf1, i5 + 128 * i4 + 1792 * i2 + 25088 * i3 + 100352 * i1 + 401408 * i0) tmp4 = tmp2 + tmp3 return tmp4 , ranges=[8, 4, 14, 4, 14, 128], origin_node=clone, origins=OrderedSet([clone]) )) ``` ### Supported epilogue: `buf1` is the template buffer and `buf2` is the epilogue output buffer. The store of `buf2`: d0 + 576 * d1 + 32 * d2 The load of `buf1` in the epilogue node: d0 + 576 * d1 + 32 * d2 The above two indexes are the same. The layout of `buf2` and `buf1` are different though which is handled by the reindexer: `buf1`: `size=[324, 32], stride=[32, 1]` `buf2`: `size=[1, 32, 18, 18], stride=[10368, 1, 576, 32]` ``` CppTemplateBuffer(name='buf1', layout=FixedLayout('cpu', torch.bfloat16, size=[324, 32], stride=[32, 1])) ComputedBuffer(name='buf2', layout=FixedLayout('cpu', torch.bfloat16, size=[1, 32, 18, 18], stride=[10368, 1, 576, 32]), data=Pointwise( 'cpu', torch.bfloat16, def inner_fn(index): _, i1, i2, i3 = index tmp0 = ops.load(buf1, i1 + 32 * i3 + 576 * i2) tmp1 = ops.to_dtype(tmp0, torch.float32, src_dtype=torch.bfloat16) tmp2 = ops.load(_frozen_param4, i1) tmp3 = tmp1 * tmp2 tmp4 = ops.load(arg7_1, i1 + 32 * i3 + 576 * i2) tmp5 = tmp3 + tmp4 tmp6 = ops.to_dtype(tmp5, torch.bfloat16, src_dtype=torch.float32) return tmp6 , ranges=[1, 32, 18, 18], origin_node=convert_element_type_4, origins=OrderedSet([add, mul, convert_element_type_4]) )) ``` ## TODO Add the support for fusions when the indexes are different in a follow-up PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135661 Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5	2024-09-24 05:25:28 +00:00
Max Podkorytov	7283530db2	[ROCm][Inductor][CK] FP8 gemm (#136337 ) At the moment, lowering torch._scaled_mm with tensorwise scaling and rowwise scaling for both A and B We probably also want to support either combination of tensorwise and rowwise for A and B, as well as bias support Pull Request resolved: https://github.com/pytorch/pytorch/pull/136337 Approved by: https://github.com/chenyang78	2024-09-24 05:19:45 +00:00
Aaron Orenstein	7f98781f84	Fix autodeps from D62049222 that pyfmt broke (#136455 ) Summary: `arc lint` changed the formatting which then caused autodeps to be confused. Test Plan: this passes: ``` arc lint --skip AUTODEPS fbpython fbcode/tools/build/buck/linters/lint_autoformat.py --linter=autodeps --default-exec-timeout=1800 -- fbcode/caffe2/test/inductor/test_memory_planning.py ``` Differential Revision: D63277059 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136455 Approved by: https://github.com/bobrenjc93, https://github.com/oulgen	2024-09-24 05:06:12 +00:00
blzheng	797c7e2802	[Quant][PT2E]change flatten recipe for X86InductorQuantizer (#136298 ) This PR modifies the flatten recipe: if none of the users of the flatten node are quantizable ops, int8 flatten will be disabled to avoid unnecessary dtype conversions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136298 Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5	2024-09-24 04:30:12 +00:00
Riley Dulin	3be150653c	[torch][ao] Add customizable loss function to NodeAccuracySummary (#136282 ) Summary: Add a customizable loss function callback to NodeAccuracySummary to allow users to pass in their own loss function. Also, fix some type errors and propagate better exception messages when unexpected tensor comparisons occur. Finally, enhance the robustness of `generate_numeric_debug_handle` in the case where it is called multiple times on the same model, by avoiding reuse of the same IDs. Test Plan: Added a test for this case in `test_numeric_debugger`. Reviewed By: jerryzh168 Differential Revision: D62898297 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136282 Approved by: https://github.com/jerryzh168	2024-09-24 03:28:12 +00:00
Guilherme Leobas	e09c5b6046	Remove `vt` argument in `raise_observed_exception` (#136037 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136037 Approved by: https://github.com/zou3519	2024-09-24 02:36:57 +00:00
fduwjj	9372692c7b	[FR] Make OSS fr_trace function available for internal script and improve pg filtering (#136473 ) Differential Revision: [D63287384](https://our.internmc.facebook.com/intern/diff/D63287384/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136473 Approved by: https://github.com/c-p-i-o	2024-09-24 02:34:43 +00:00
Nikita Shulga	4fd16dd8aa	Clarify that `libtorch` API is C++17 compatible (#136471 ) As it relies on some common C++17 primitives, such as `std::optional` Replace all docs references from C++14 to C++17 Fixes https://github.com/pytorch/pytorch/issues/133205 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136471 Approved by: https://github.com/kit1980, https://github.com/atalman	2024-09-24 02:03:33 +00:00
Jez Ng	e4d294221b	[inductor] Log precompilation time (#136395 ) This has been useful for diagnosing the long compile time issues I've seen in the Triton CPU backend. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136395 Approved by: https://github.com/eellison	2024-09-24 01:47:54 +00:00
Edward Z. Yang	802ba79121	Inherit all secrets to inductor workflow (#135354 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/135354 Approved by: https://github.com/desertfire, https://github.com/atalman, https://github.com/malfet	2024-09-24 01:30:40 +00:00
Aaron Orenstein	06909803cc	Existing mypy issues (#136236 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136236 Approved by: https://github.com/bobrenjc93, https://github.com/Skylion007	2024-09-24 01:02:07 +00:00
Xuan Zhang	a14f57b126	fix the inductor tests (#136474 ) Fixes https://github.com/pytorch/pytorch/issues/136464 introduced in https://github.com/pytorch/pytorch/pull/134874 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136474 Approved by: https://github.com/malfet	2024-09-24 00:59:22 +00:00
Nikita Shulga	9d9bc65b5e	Make `FlashAttentionKernel.cpp` compilable for SVE with GCC-11 (#136477 ) Extends https://github.com/pytorch/pytorch/pull/132434 to all minor revisions of GCC-11, as they all likely affected by https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95528 Hattip to @abhishek-iitmadras for the investigation Fixes https://github.com/pytorch/pytorch/issues/136432 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136477 Approved by: https://github.com/atalman, https://github.com/kit1980	2024-09-24 00:54:26 +00:00
Ke Wen	e0f84f40f7	[Pipelining] Allow non-0 stages to accept kwargs (#136416 ) For supporting usage case in torchchat: all non-0 stages requires `input_pos` and `cache_lane`. ``` kwargs = {"input_pos": input_pos, "cache_lane": lane} if pp_rank == first_pp_rank: output = decorder.step(new_token, kwargs) elif pp_rank == last_pp_rank: output = decorder.step(kwargs) else: # middle pp ranks decorder.step(**kwargs) ``` The `forward_one_chunk` code today hard sets `{}` as kwarg for non-0 stages, hence cannot support the above use case. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136416 Approved by: https://github.com/wconstab	2024-09-23 23:50:59 +00:00
Guilherme Leobas	52c917b0ba	Optimize dict reconstruct to not codegen untouched values (#134876 ) PR changes how `reconstruct` is done for a ConstDict. As of today, it works as follow: (1) codegen(...) each pair of key/value (2) create a new dictionary to hold the new items (3) clear the original dictionary (4) update the original dict with the one created in (2) We do a micro optimization in the generated bytecode to: - Only codegen the items that changed. - Only clear the original dictionary if a key was removed. Fixes: #133487 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134876 Approved by: https://github.com/zou3519	2024-09-23 21:45:44 +00:00
fduwjj	5033a1ca0d	[RFC][torchelastic][c10d] Fix store prefix race in rendezvous (#135957 ) 1. We want to take option 3 as discussed in https://github.com/pytorch/pytorch/issues/135712, so every time when we retry, we create a new TCPStore server first so that we don't need to append attempt count as prefix and avoid eventually TCPStore sync failure. (This is only for the TCPStore sharing enabled case) 2. We start a new server bound to an ephemeral port (i.e. 0) so it gets assigned to a free port. We then pass that downstream (trainer or c10d). By doing so, TCPStore is managed by the elastic agent rather than having a race condition on binding to a specific port in the trainer. 3. Then the port be broadcasted for dynamic_rendezvous. Only one more question, what do we do about the store created from (_create_tcp_store) torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py, are we ok with creating a duplicate TCPStore server? Pull Request resolved: https://github.com/pytorch/pytorch/pull/135957 Approved by: https://github.com/d4l3k, https://github.com/c-p-i-o	2024-09-23 20:32:24 +00:00
PyTorch MergeBot	fd182b90a7	Revert "Add deterministic path for CUDA `cumsum` (#136224 )" This reverts commit d45b0151e5d9a9358368b9fbd7fa454edd5d9709. Reverted https://github.com/pytorch/pytorch/pull/136224 on behalf of https://github.com/atalman due to Failing internall CI ([comment](https://github.com/pytorch/pytorch/pull/136224#issuecomment-2369244135))	2024-09-23 19:57:13 +00:00
Nikita Shulga	08dba25775	[BE] Do not use deprecated APIs in SparseCsrTensorMath.cu (#136449 ) - `Tensor::type()` -> `Tensor::scalar_type()` - `Tensor::data<T>()` -> `Tensor::data_ptr<T>()` Should fix following warnings during the compilation: ``` caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/transformers/cuda/mem_eff_attention/kernels/cutlassB_f32_notaligned_k128_dropout.cu.o[0m /var/lib/jenkins/workspace/aten/src/ATen/native/sparse/cuda/SparseCsrTensorMath.cu: In function ‘void at::native::_GLOBAL__N__496f0b0c_22_SparseCsrTensorMath_cu_868dd545::_apply_sparse_csr_linear_solve(const at::Tensor&, const at::Tensor&, bool, const at::Tensor&)’: /var/lib/jenkins/workspace/aten/src/ATen/native/sparse/cuda/SparseCsrTensorMath.cu:739:36: error: ‘T* at::Tensor::data() const [with T = int]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Werror=deprecated-declarations] 739 \| int* rowOffsets = crow.data<int>(); \| ^ /var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:247:1: note: declared here 247 \| T * data() const { \| ^ ~~ /var/lib/jenkins/workspace/aten/src/ATen/native/sparse/cuda/SparseCsrTensorMath.cu:740:35: error: ‘T* at::Tensor::data() const [with T = int]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Werror=deprecated-declarations] 740 \| int* colIndices = col.data<int>(); \| ^ /var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:247:1: note: declared here 247 \| T * data() const { \| ^ ~~ /var/lib/jenkins/workspace/aten/src/ATen/native/sparse/cuda/SparseCsrTensorMath.cu: In lambda function: /var/lib/jenkins/workspace/aten/src/ATen/native/sparse/cuda/SparseCsrTensorMath.cu:753:44: error: ‘at::DeprecatedTypeProperties& at::Tensor::type() const’ is deprecated: Tensor.type() is deprecated. Instead use Tensor.options(), which in many cases (e.g. in a constructor) is a drop-in replacement. If you were using data from type(), that is now available from Tensor itself, so instead of tensor.type().scalar_type(), use tensor.scalar_type() instead and instead of tensor.type().backend() use tensor.device(). [-Werror=deprecated-declarations] 753 \| AT_DISPATCH_FLOATING_TYPES(values.type(), "create_matrix", ([&] { \| ^ /var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:225:1: note: declared here 225 \| DeprecatedTypeProperties & type() const { \| ^ ~~ /var/lib/jenkins/workspace/aten/src/ATen/native/sparse/cuda/SparseCsrTensorMath.cu:753:159: error: ‘c10::ScalarType detail::scalar_type(const at::DeprecatedTypeProperties&)’ is deprecated: passing at::DeprecatedTypeProperties to an AT_DISPATCH macro is deprecated, pass an at::ScalarType instead [-Werror=deprecated-declarations] 753 \| AT_DISPATCH_FLOATING_TYPES(values.type(), "create_matrix", ([&] { \| ^ /var/lib/jenkins/workspace/aten/src/ATen/Dispatch.h:109:1: note: declared here 109 \| inline at::ScalarType scalar_type(const at::DeprecatedTypeProperties& t) { \| ^~~~~~~~~~~ /var/lib/jenkins/workspace/aten/src/ATen/native/sparse/cuda/SparseCsrTensorMath.cu:753:159: error: ‘c10::ScalarType detail::scalar_type(const at::DeprecatedTypeProperties&)’ is deprecated: passing at::DeprecatedTypeProperties to an AT_DISPATCH macro is deprecated, pass an at::ScalarType instead [-Werror=deprecated-declarations] 753 \| AT_DISPATCH_FLOATING_TYPES(values.type(), "create_matrix", ([&] { \| ^ /var/lib/jenkins/workspace/aten/src/ATen/Dispatch.h:109:1: note: declared here 109 \| inline at::ScalarType scalar_type(const at::DeprecatedTypeProperties& t) { \| ^~~~~~~~~~~ /var/lib/jenkins/workspace/aten/src/ATen/native/sparse/cuda/SparseCsrTensorMath.cu: In lambda function: /var/lib/jenkins/workspace/aten/src/ATen/native/sparse/cuda/SparseCsrTensorMath.cu:753:1014: error: ‘T* at::Tensor::data() const [with T = double]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Werror=deprecated-declarations] 753 \| AT_DISPATCH_FLOATING_TYPES(values.type(), "create_matrix", ([&] { \| ^ /var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:247:1: note: declared here 247 \| T * data() const { \| ^ ~~ /var/lib/jenkins/workspace/aten/src/ATen/native/sparse/cuda/SparseCsrTensorMath.cu:753:1054: error: ‘T* at::Tensor::data() const [with T = double]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Werror=deprecated-declarations] 753 \| AT_DISPATCH_FLOATING_TYPES(values.type(), "create_matrix", ([&] { \| ^ /var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:247:1: note: declared here 247 \| T * data() const { \| ^ ~~ /var/lib/jenkins/workspace/aten/src/ATen/native/sparse/cuda/SparseCsrTensorMath.cu:753:1094: error: ‘T* at::Tensor::data() const [with T = double]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Werror=deprecated-declarations] 753 \| AT_DISPATCH_FLOATING_TYPES(values.type(), "create_matrix", ([&] { \| ^ /var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:247:1: note: declared here 247 \| T * data() const { \| ^ ~~ /var/lib/jenkins/workspace/aten/src/ATen/native/sparse/cuda/SparseCsrTensorMath.cu: In lambda function: /var/lib/jenkins/workspace/aten/src/ATen/native/sparse/cuda/SparseCsrTensorMath.cu:753: error: ‘T* at::Tensor::data() const [with T = float]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Werror=deprecated-declarations] 753 \| AT_DISPATCH_FLOATING_TYPES(values.type(), "create_matrix", ([&] { \| /var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:247:1: note: declared here 247 \| T * data() const { \| ^ ~~ /var/lib/jenkins/workspace/aten/src/ATen/native/sparse/cuda/SparseCsrTensorMath.cu:753: error: ‘T* at::Tensor::data() const [with T = float]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Werror=deprecated-declarations] 753 \| AT_DISPATCH_FLOATING_TYPES(values.type(), "create_matrix", ([&] { \| /var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:247:1: note: declared here 247 \| T * data() const { \| ^ ~~ /var/lib/jenkins/workspace/aten/src/ATen/native/sparse/cuda/SparseCsrTensorMath.cu:753: error: ‘T* at::Tensor::data() const [with T = float]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Werror=deprecated-declarations] 753 \| AT_DISPATCH_FLOATING_TYPES(values.type(), "create_matrix", ([&] { \| /var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:247:1: note: declared here 247 \| T * data() const { \| ^ ~~ ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136449 Approved by: https://github.com/huydhn	2024-09-23 19:20:34 +00:00
Xiaodong Wang	9a1dc41de7	[AMD] Skipping 0 byte send/recv for AMD GPU (#136362 ) Summary: We found jobs getting stuck by send/recv zero bytes with RDMA on AMD GPUs. So just skipping them. Reviewed By: danzimm Differential Revision: D63075000 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136362 Approved by: https://github.com/malfet, https://github.com/houseroad	2024-09-23 19:14:12 +00:00
Edward Z. Yang	3116fbda0f	Increase update_hint_regression problem size to 1000 (#136434 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/136434 Approved by: https://github.com/laithsakka	2024-09-23 18:51:44 +00:00
PyTorch MergeBot	274883083d	Revert "[AOTI] Create another wrapper class to handle ArrayRef (#136318 )" This reverts commit d21841d077b00350d5e621e7b74dace71849c701. Reverted https://github.com/pytorch/pytorch/pull/136318 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/136318#issuecomment-2368957264))	2024-09-23 17:47:49 +00:00
Aleksei Nikiforov	d859fcbc61	s390x: build s390x binaries on each pull request (#125399 ) Ensure that s390x keeps building for each PR Pull Request resolved: https://github.com/pytorch/pytorch/pull/125399 Approved by: https://github.com/huydhn	2024-09-23 17:39:48 +00:00
Joel Schlosser	83a3ee0699	Support embedding_bag() with NJT input (#135888 ) Fixes #93843 `EmbeddingBag()` / `embedding_bag()` support 1D inputs with offsets to handle raggedness. NJT is a natural fit here as it already maintains offsets of the same form. This PR updates the python-side to support NJT and adds corresponding OpInfo-based NJT tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135888 Approved by: https://github.com/cpuhrsch	2024-09-23 17:35:19 +00:00
James Wu	4649aeaebf	Make AOTAutogradCache support remote FXGraphCache (#136173 ) Summary: After the previous refactor, we can now call load_with_key directly from AOTAutogradCache to use the remote FXGraphCache. This does not implement a remote AOTAutogradCache. It just allows AOTAutogradCache to work with remote FXGraphCache. Test Plan: (Meta only tests) Reviewed By: aorenste Differential Revision: D62384944 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136173 Approved by: https://github.com/oulgen	2024-09-23 17:24:27 +00:00
Nikita Shulga	c3e678382b	Fix addmm silent correctness on aarch64 (#136371 ) Do not dispatch to fast gemmv functions when alpha is not equal to 1 Add regression test to address the problem Fixes https://github.com/pytorch/pytorch/issues/136299 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136371 Approved by: https://github.com/swolchok	2024-09-23 17:10:34 +00:00
Edward Z. Yang	f0f79dd8f1	Correctly convert Python float to float64 when passing argument as Tensor (#136413 ) I can't actually test the Dynamo codegen fix as it is impossible to directly use the Tensor at the moment. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/136413 Approved by: https://github.com/bobrenjc93	2024-09-23 16:48:08 +00:00
wz337	637d5c4b7e	[DSD] Fix loading uneven full tensor into sharded state dict (#136365 ) Fix #136228. This is a follow up on https://github.com/pytorch/pytorch/pull/135725. We need to pass shape and stride from the original dtensor, since for uneven case, `from_local` would calculate shape and stride assuming the tensor is evenly-sharded based on the local tensor. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136365 Approved by: https://github.com/fegin	2024-09-23 16:35:58 +00:00
fduwjj	da51fe1c42	[FR] Fix errors in all2all check, improve some log output (#136399 ) We found that we show the hashed pg name in our script output, which is not UX friendly. Also we found a bug in our all2all check and we made a bunch of changes to error messages to make it better readable. Differential Revision: [D63206469](https://our.internmc.facebook.com/intern/diff/D63206469) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136399 Approved by: https://github.com/c-p-i-o	2024-09-23 16:31:31 +00:00
PyTorch MergeBot	df6a8fa1eb	Revert "[aotd] Fix freezing API for subclasses (#136265 )" This reverts commit cdef760560049ebda5fb7e30b1703f345fe05cfa. Reverted https://github.com/pytorch/pytorch/pull/136265 on behalf of https://github.com/atalman due to Breaks internal CI sorry, need to revert ([comment](https://github.com/pytorch/pytorch/pull/136265#issuecomment-2368772574))	2024-09-23 16:25:05 +00:00
Andrew Gu	9992084f38	[FSDP2] Fixed `test_all_gather_extensions_monkey_patch` (#136130 ) I messed up the test before. The extensions were not running :/ Pull Request resolved: https://github.com/pytorch/pytorch/pull/136130 Approved by: https://github.com/weifengpy ghstack dependencies: #136129	2024-09-23 15:12:44 +00:00
Andrew Gu	b9f53c0dce	[FSDP2] Added module, mp policy to `fsdp_pre_all_gather` (#136129 ) - Sometimes having access to the `MixedPrecisionPolicy` in the `fsdp_pre_all_gather` is useful. See [here](https://github.com/pytorch/ao/pull/748/files#r1760375325) in the torchao INT8 mixed precision training PR. - Sometimes having access to the owning `nn.Module` allows for using it for saving state. See [here](https://github.com/pytorch/pytorch/issues/114299#issuecomment-2298692762) for an example. The major paint point here is how to deal with backward compatibility. For now, we use `signature.inspect` to check if the user subclass follows the old vs. new signature. However, for the new signature, the `param_dtype` in the post-all-gather is redundant, as if the user needed it, the user could save it from the `mp_policy` passed in the pre-all-gather now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136129 Approved by: https://github.com/weifengpy	2024-09-23 15:12:36 +00:00
Bin Bao	d21841d077	[AOTI] Create another wrapper class to handle ArrayRef (#136318 ) Summary: Create another wrapper codegen class to handle ArrayRef for CPU. The goal is to simplify the regular cpp wrapper codegen logic and the generated cpp code. Test Plan: CI Differential Revision: D62961885 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136318 Approved by: https://github.com/frank-wei	2024-09-23 15:10:27 +00:00
PyTorch MergeBot	0e19522122	Revert "Adds support for accelerated sorting with x86-simd-sort (#127936 )" This reverts commit 239a9ad65eebf93dcf9bb108a5129d4160b12c86. Reverted https://github.com/pytorch/pytorch/pull/127936 on behalf of https://github.com/atalman due to test/test_sort_and_select.py::TestSortAndSelectCPU::test_sort_discontiguous_slow_cpu_float32 [GH job link](https://github.com/pytorch/pytorch/actions/runs/10994904767/job/30525578456) [HUD commit link](`239a9ad65e`) ([comment](https://github.com/pytorch/pytorch/pull/127936#issuecomment-2368522316))	2024-09-23 14:52:23 +00:00
Edward Z. Yang	bae427e4b1	Refactor maybe_evaluate_static into a worker function off of ShapeEnv (#135107 ) By refactoring this way, I can put a non-expiring LRU cache here. Splitting also will make it easier for me to tell who is using up all the time. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/135107 Approved by: https://github.com/aorenste	2024-09-23 14:39:20 +00:00
PyTorch MergeBot	e9bfbf78d5	Revert "Allow fx graph caching higher order operators (opt-in) (#135877 )" This reverts commit 66d5eb64e0be91680a8531ccb24f098554610d46. Reverted https://github.com/pytorch/pytorch/pull/135877 on behalf of https://github.com/jeanschmidt due to seems to have introduced regressions on rocm signals ([comment](https://github.com/pytorch/pytorch/pull/135877#issuecomment-2367616653))	2024-09-23 09:04:24 +00:00
cyy	75f141be62	Avoid unnecessary CMake warnings on Windows (#136393 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/136393 Approved by: https://github.com/ezyang	2024-09-23 06:42:59 +00:00
Yuxin Wu	663e760065	add unittest for OOM message (#129671 ) Add unittest for the bug in #123984 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129671 Approved by: https://github.com/eqy	2024-09-23 04:48:01 +00:00
Yiming Zhou	068fdd602f	[export] enable custom tag metadata re-export test (#136048 ) Improves and enables a commented out test originally introduced in #131912 In `test_custom_tag_metadata_re_export()`, we check the added "custom" metadata to given nodes is preserved and not copied to other nodes after re-exporting Pull Request resolved: https://github.com/pytorch/pytorch/pull/136048 Approved by: https://github.com/zhxchen17	2024-09-23 04:37:58 +00:00
Oguz Ulgen	66d5eb64e0	Allow fx graph caching higher order operators (opt-in) (#135877 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135877 Approved by: https://github.com/zou3519	2024-09-23 04:33:27 +00:00
cyy	a38e4c5e1e	Enable clang-tidy warnings on aten/src/ATen/cuda/*.cpp (#134547 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/134547 Approved by: https://github.com/ezyang	2024-09-23 03:44:55 +00:00
Isuru Fernando	f276da7f98	Remove prims.slice_in_dim and prims.slice (#136150 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136150 Approved by: https://github.com/ezyang	2024-09-23 01:27:22 +00:00
Xilun Wu	3406ac24d9	[BE] fix circular import in torch/distributed/utils.py (#136286 ) Summary Fix circular import in `torch/distributed/utils.py` found when running internal test, see D62901023. Curious why this wasn't causing any issue. Is this relevant code deprecated and no longer used? Pull Request resolved: https://github.com/pytorch/pytorch/pull/136286 Approved by: https://github.com/Skylion007	2024-09-22 20:54:12 +00:00
Shangdi Yu	3bc073d728	[aoti] Fix workspace generation for triton (#135552 ) Fixes #131337 - add `arg_type` for workspace_arg, the type is consistent with the type in `generate_workspace_allocation()`. - do not generate example tensors for `workspace`, and use `generate_workspace_allocation()` instead. - add workspace allocation generation code to `kernel_autotune_calls`. e.g. ```python workspace = empty_strided_cuda((1280, ), (1, ), torch.uint8) workspace.zero_() ..... triton_spl_fused_add_cumprod_0.run(buf2, arg0_1, arg1_1, workspace, 1, 10000, grid=split_scan_grid(1, 10000), stream=stream0) del buf2, arg0_1, arg1_1, workspace ``` - add `empty_strided_cuda = torch._C._dynamo.guards._empty_strided_cuda` to the header of triton autotune code. The generated cpp has lines like below, so we also implement a `zero_()` for ` AtenTensorHandle `. ```cpp static constexpr int64_t int_array_0[] = {1280L, }; static constexpr int64_t int_array_1[] = {1L, }; AtenTensorHandle workspace_handle; AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch_empty_strided(1, int_array_0, int_array_1, cached_torch_dtype_uint8, cached_torch_device_type_cuda, 0, &workspace_handle)); RAIIAtenTensorHandle workspace(workspace_handle); workspace.zero_(); ``` - Fix handle grid_fn for grid computation. Pass in "RBLOCK" to `split_scan_grid` - Fix dynamic shapes: Without the fix we generate code that looks like this `workspace = empty_strided_cuda((32((255 + s0) // 256), ), (1, ), torch.uint8)` when doing triton autotune and `s0` is not defined. The solution approach is to use `V.graph.sizevars.size_hint(nbytes)` to realize the workspace size for triton autotune. Note that we only realize it for triton autotune code, but not for the cpp cuda code. - We also generate slightly different cpp code depending on if `abi_compatible` is turned on. ```cpp RAIIAtenTensorHandle workspace(workspace_handle); AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch_zero_(workspace.get())); ``` vs ```cpp at::Tensor workspace = at::detail::empty_strided_cuda({8L(c10::div_floor_integer(static_cast<int64_t>((255L + s0)), static_cast<int64_t>(256L))), }, {1L, }, at::kByte, c10::DeviceType::CUDA); workspace.zero_(); ``` Test Plan: ``` TORCHINDUCTOR_ABI_COMPATIBLE=1 TORCHINDUCTOR_CPP_WRAPPER=1 python test/inductor/test_torchinductor.py -k GPUTests.test_consecutive_split_cumprod_cuda python test/inductor/test_cuda_cpp_wrapper.py TestCudaWrapper.test_consecutive_split_cumprod_cuda_cuda_wrapper python test/inductor/test_cuda_cpp_wrapper.py DynamicShapesCudaWrapperCudaTests.test_consecutive_split_cumprod_cuda_dynamic_shapes_cuda_wrapper TORCHINDUCTOR_ABI_COMPATIBLE=1 python test/inductor/test_cuda_cpp_wrapper.py TestCudaWrapper.test_consecutive_split_cumprod_cuda_cuda_wrapper TORCHINDUCTOR_CPP_WRAPPER=1 python test/inductor/test_torchinductor.py -k GPUTests.test_consecutive_split_cumprod_cuda ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135552 Approved by: https://github.com/desertfire	2024-09-22 04:51:37 +00:00
Zhou, Lingzhi	35532fc477	[Partitioner] Reuse partition to check whether nodes exist (#135317 ) The time complexity of find node whether in NodeList is O(n). Reuse partition to speed up due to partition.nodes is hash table and has same elements. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135317 Approved by: https://github.com/ezyang	2024-09-21 23:52:02 +00:00
cyy	e4cdc31227	[14/N] Fix clang-tidy warnings in aten/src/ATen (#133988 ) Follows #133807 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133988 Approved by: https://github.com/ezyang	2024-09-21 22:41:40 +00:00
Bob Ren	9731ccb9e0	Type _dynamo/variables/lazy.py (#136376 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136376 Approved by: https://github.com/Skylion007	2024-09-21 22:18:02 +00:00
Jovian Anthony Jaison	09715638ab	Add _dynamo.config.suppress_errors logging (#136379 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136379 Approved by: https://github.com/ezyang	2024-09-21 21:00:26 +00:00
Aaron Orenstein	3176966732	update cache tests (#136215 ) Summary: - Clean up cache test code a bit. - Removed patch_fbcode() - it turned out to cause flaky issues (image if it set fbcode=False and then loaded a module for the first time which had a top-level fbcode check). Test Plan: unit tests Reviewed By: oulgen Differential Revision: D62648248 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136215 Approved by: https://github.com/bobrenjc93	2024-09-21 20:36:22 +00:00
Ramana Sundararaman	be4b7e8131	Param fixes in docstring (#136097 ) Fixes wrong param names in docstrings. cc: @kit1980 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136097 Approved by: https://github.com/ezyang	2024-09-21 18:56:34 +00:00
Aaron Gokaslan	b6ffa381e1	[BE]: Add half CUDA support nextafter (#136373 ) Making CUDA support match CPU support for nextafter Pull Request resolved: https://github.com/pytorch/pytorch/pull/136373 Approved by: https://github.com/ezyang	2024-09-21 17:13:45 +00:00
PyTorch MergeBot	cc17d58809	Revert "S390x update builder image (#132983 )" This reverts commit 080a249fc2290602402e01bf5864d9d9a416e5b6. Reverted https://github.com/pytorch/pytorch/pull/132983 on behalf of https://github.com/atalman due to Authenticate With PUSH is failing. Error: no registries found in registries.conf, a registry must be provided. Error: Process completed with exit code 125. ([comment](https://github.com/pytorch/pytorch/pull/132983#issuecomment-2365249249))	2024-09-21 16:46:54 +00:00
Xuan Zhang	03957efa5d	[inductor][scheduler] reorder scheduler nodes after fusion to reduce peak memory (#134874 ) Motivations: A topological order of the scheduler nodes that optimize the liveness of buffers can reduce the peak memory utilization. This has been observed and studied e.g., [here](https://arxiv.org/pdf/1910.02653) and [here](https://proceedings.mlr.press/v202/steiner23a/steiner23a.pdf). Solutions: 1. implement a peak memory estimator via liveness analysis 2. implement a few memory aware topological sorting algorithms and pick the one with the lowest peak memory Results: On some models we can reduce the peak memory significantly: \| model \| batch size \| peak_memory baseline \| peak_memory new \| ratio \| \|:-----------------------------:\|:----------:\|:--------------------:\|:---------------:\|:-----:\| \| alexnet \| 128 \| 1.17 \| 0.99 \| 1.19 \| \| vgg16 \| 64 \| 4.10 \| 3.57 \| 1.15 \| \| DebertaV2ForQuestionAnswering \| 1 \| 11.60 \| 10.56 \| 1.10 \| In the presence of compiler based AC, peak memory can be further reduced: \| model \| batch size \| peak_memory baseline \| peak_memory new \| ratio \| \|:------------------------------:\|:----------:\|:--------------------:\|:---------------:\|:-----:\| \| AlbertForMaskedLM \| 4 \| 6.87 \| 6.43 \| 1.07 \| \| AlbertForQuestionAnswering \| 4 \| 8.69 \| 7.76 \| 1.12 \| \| MobileBertForQuestionAnswering \| 128 \| 4.67 \| 3.90 \| 1.20 \| [Here](https://fb.workplace.com/groups/1075192433118967/posts/1499920537312819/?comment_id=1499938843977655&reply_comment_id=1499951630643043) is an internal use case. Other infos: * neutral model runtime, because the the reordering happens after fusion. So memory saving is _for free_. * minimal compile time overhead as the algorithm is linear in the number of edges of the inductor graph. For all hugglingface benchmark models, the additional compile time is less than 1 second. * no peak memory regression since we only adopt a new order if the peak memory is reduced based on the estimator. However, the model is unaware of operators' working memories, but for large models, the working memory should be negligible. We haven't observed any significant regressions on all of our tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134874 Approved by: https://github.com/yf225	2024-09-21 16:28:38 +00:00
DavidGu-Datong	fb4670a1f9	fix mean_out: op does not update parameter out for BF16/FP16 dtype on CPU (#135174 ) Fixes #134848 For BF16/FP16, when a tensor is specified in `out` parameter of mean, the mean kernel should use its storage for output, but that doesn't happen, since an `at::to` in the current code causes storage to be allocated again, but the `out` parameter tensor's storage doesn't get updated, resulting in it not holding the mean output. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135174 Approved by: https://github.com/soulitzer	2024-09-21 14:21:42 +00:00
Will Constable	ea737e4e5d	[Pipelining] Make PipelineStage support meta initialization (#136243 ) Avoid allocating memory or dry-running the submodule during stage init. Save user-provided input/output metadata during stage init, to allow lazily initializing the buffers before the first step call. Later, we plan to build on top of this to add lazy shape inference (#130856) so that no input/output shapes are required at stage init. For now, we require input/output tensors for stage init, but these should be on meta device and stage should not allocate any real memory. Note: this needs more thorough testing and review, but it worked on the torchtitan 3d test. TODO: - delete 'device' arg from PipelineStage ctor? (move it to inferred from args tensors passed to first step call? separate PR. - delete 'output_args' from PipelineStage ctor? we don't actually need it, but we use it to do shape validation, which is why I didn't remove it in this PR. Proposal: leave it until we add lazy shape inference? Fixes #136225, #136226 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136243 Approved by: https://github.com/H-Huang, https://github.com/kwen2501	2024-09-21 09:47:22 +00:00
cyy	c459430558	Pass Werror to CUDA host compiler (#130213 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/130213 Approved by: https://github.com/ezyang	2024-09-21 08:01:06 +00:00
Menglu Yu	e18439113e	[PT2][Inductor][Optmus] fix test_pad_mm_bf16 and reland to fix long computation kernel (#136349 ) Summary: see D62220158 Test Plan: ``` buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/inductor:pad_mm -- --exact 'caffe2/test/inductor:pad_mm - test_pad_mm_bf16 (caffe2.test.inductor.test_pad_mm.PadMMTest)' --run-disabled ``` ### H100 Buck UI: https://www.internalfb.com/buck2/e5d85802-cab7-41a5-aacc-95f541796a99 Test UI: https://www.internalfb.com/intern/testinfra/testrun/9570149258587374 Network: Up: 9.1KiB Down: 0B (reSessionID-b339b51b-6a0e-4347-9414-1ba38f26a5d0) Jobs completed: 9. Time elapsed: 1:15.7s. Cache hits: 0%. Commands: 3 (cached: 0, remote: 0, local: 3) Tests finished: Pass 1. Fail 0. Fatal 0. Skip 1. Build failure 0 ### A100 Buck UI: https://www.internalfb.com/buck2/1082ad6e-56b0-4eb5-8092-ce507ca9a70e Test UI: https://www.internalfb.com/intern/testinfra/testrun/8444249533824784 Network: Up: 9.2KiB Down: 0B (reSessionID-2b3056ac-f29e-4de4-b6f5-9d994acf566b) Jobs completed: 9. Time elapsed: 1:36.9s. Cache hits: 0%. Commands: 3 (cached: 0, remote: 0, local: 3) Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0 # E2E see D62220158 Differential Revision: D63040455 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136349 Approved by: https://github.com/dshi7	2024-09-21 06:35:50 +00:00
cyy	02871461f7	Fix clang-tidy warnings in torch/csrc/lazy (#134655 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/134655 Approved by: https://github.com/ezyang	2024-09-21 02:59:35 +00:00
Laith Sakka	0b91e7e2dc	Remove duplicate line (#136383 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136383 Approved by: https://github.com/kit1980, https://github.com/malfet	2024-09-21 01:35:13 +00:00
eqy	29f7b8d483	[TF32] Account for TF32 in `test_conv_double_backward` (#135716 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135716 Approved by: https://github.com/Skylion007	2024-09-21 01:06:22 +00:00
Nikita Shulga	7936584a88	Fix `Vectorized<double>::next_after` SVE compilation (#136388 ) Should have called [`Sleef_nextafterdx_sve`](https://sleef.org/2-references/libm/aarch64#vectorized-double-precision-function-for-obtaining-the-next-representable-fp-value) rather than [`Sleef_nextafterfx_sve`](https://sleef.org/2-references/libm/aarch64#vectorized-single-precision-function-for-obtaining-the-next-representable-fp-value) to get vectorized `nextafter` for double precision rather than single precision values This fixes a compilation issue introduced by https://github.com/pytorch/pytorch/pull/119571 and exposed by https://github.com/pytorch/pytorch/pull/133339 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136388 Approved by: https://github.com/kit1980	2024-09-20 23:54:17 +00:00
albanD	067d203b22	Upgrade pybind11 API calls for 3.13t (#136370 ) This is a modified version of https://github.com/pytorch/pytorch/pull/130341 that preserve support for older pybind version. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136370 Approved by: https://github.com/Skylion007, https://github.com/malfet	2024-09-20 23:09:55 +00:00
Colin Peppler	1a10751731	[AOTI][Tooling] Filter out kernels based off lowercase names (#135395 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135395 Approved by: https://github.com/YUNQIUGUO	2024-09-20 21:56:08 +00:00
Isuru Fernando	0c936c3ecb	Add decomps for max_unpool (#133146 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133146 Approved by: https://github.com/amjames, https://github.com/eellison	2024-09-20 21:35:25 +00:00
侯奇	293fccf86d	add TORCH_CUDA_CPP_API for AutoNcclGroup (#130012 ) `torch::cuda::nccl` is an option for developers to depend only on torch but not nccl. But to use `torch::cuda::nccl::send`/`torch::cuda::nccl::recv`, `ncclGroupStart()`/`ncclGroupEnd()` is needed, `torch::cuda::nccl::AutoNcclGroup` can be used. but `torch::cuda::nccl::AutoNcclGroup` is not exported and is LOCAL symbol, which can't be used from outside of libtorch. <img width="1618" alt="image" src="https://github.com/pytorch/pytorch/assets/1913192/25b0bd54-2da6-480f-876d-b05acfecfe62"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/130012 Approved by: https://github.com/kwen2501, https://github.com/eqy	2024-09-20 21:20:25 +00:00
Matthew Sterrett	239a9ad65e	Adds support for accelerated sorting with x86-simd-sort (#127936 ) Adds x86-simd-sort as a submodule to accelerate sorting for 32-bit and 64-bit datatypes when AVX2 or AVX512 are available. For contiguous data, this can be over a 10x speedup for large arrays. For discontiguous data, it can give over a 4x speedup with larger arrays. These benchmarks were gathered on a Skylake system (7900x), limited to 8 threads. <details> <summary><b>Contiguous Benchmarks</b></summary> ``` float32, normally distributed (in microseconds) size Default AVX2 AVX512 Default/AVX2 Default/AVX512 16 7.150844336 6.886271477 7.132277489 1.038420335 1.002603214 128 9.208030939 8.478154898 7.846915245 1.086089019 1.173458697 1024 37.79037627 23.60707456 16.44122627 1.600807257 2.298513241 10000 714.7355628 203.9921844 105.5683001 3.503739934 6.770361577 100000 8383.074408 721.6333354 465.3709247 11.61680593 18.01374766 1000000 97124.31945 5632.054572 3920.148401 17.24491803 24.77567416 10000000 1161974.907 86070.48988 71533.82301 13.50027063 16.24371323 int32_t, uniformly distributed (in microseconds) size Default AVX2 AVX512 Default/AVX2 Default/AVX512 16 7.203208685 6.92212224 7.014458179 1.040606975 1.026908779 128 8.972388983 8.195516348 7.592543125 1.094792396 1.18173698 1024 32.77489477 23.6874548 15.36617105 1.383639359 2.132925285 10000 607.8824128 193.3402024 99.25090471 3.144107667 6.124703997 100000 523.9384684 608.1836536 442.3166784 0.861480682 1.184532472 1000000 5211.348627 5271.598405 3518.861883 0.988570871 1.480975611 10000000 133853.6263 81463.05084 67852.97394 1.643120714 1.972700952 ``` </details> Note that the int32_t sort is accelerated by FBGEMM's radix sort for larger arrays, but this only handles contiguous data and in one sorting direction. <details> <summary><b>Discontiguous Benchmarks</b></summary> ``` float, normal distributed, discontiguous in sorted dimension (in microseconds) size Default AVX2 AVX512 Default/AVX2 Default/AVX512 16 3.836543679 4.011214256 3.84376061 0.956454439 0.99812243 128 5.755310194 5.755723127 4.820394962 0.999928257 1.193949923 1024 49.46946019 24.78790785 15.47874362 1.995709379 3.195960952 10000 665.2505291 236.6165959 143.9490662 2.811512551 4.621429974 100000 4328.002203 1329.001212 818.3516414 3.256582586 5.288682743 1000000 47651.5018 16693.72045 11827.39551 2.854456677 4.028909133 10000000 556655.1288 236252.6258 184215.9828 2.356185998 3.021752621 int32_t, uniformly distributed, discontiguous in sorted dimension (in microseconds) size Default AVX2 AVX512 Default/AVX2 Default/AVX512 16 3.817994356 3.878117442 3.770039797 0.984496837 1.012719908 128 5.578731397 5.577152082 4.716770534 1.000283176 1.182743862 1024 43.3412619 23.61275801 14.55446819 1.835501887 2.977866408 10000 634.3997478 224.4322851 133.9518324 2.826686667 4.736028889 100000 4084.358152 1292.363303 781.7867576 3.16037924 5.22438902 1000000 46262.20465 16608.35284 11367.51817 2.785478192 4.06968381 10000000 541231.9104 235185.1861 180249.9294 2.301301028 3.002674742 ``` </details> Pull Request resolved: https://github.com/pytorch/pytorch/pull/127936 Approved by: https://github.com/jgong5, https://github.com/peterbell10	2024-09-20 21:19:33 +00:00
cyy	d2455b99fb	Use cpython declaration of _PyWeakref_ClearRef (#136300 ) To avoid the DLL inconsistency warning by MSVC: ``` torch/csrc/utils/python_compat.h(38): warning C4273: '_PyWeakref_ClearRef': inconsistent dll linkage ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136300 Approved by: https://github.com/Skylion007	2024-09-20 18:58:58 +00:00
Bob Ren	7f9c06462f	fix mypi in utils/_sympy/functions.py (#136339 ) Signed-off-by: Bob Ren <bobren@fb.com> Turns out older versions of python, in particular 3.8 shows errors that 3.12 doesn't. For posterity these are the steps I took to reproduce: ``` conda create -n py38 python=3.8 conda activate py38 pip install -r requirements.txt lintrunner init dmypy restart && lintrunner --all-files --take MYPY ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136339 Approved by: https://github.com/Skylion007 ghstack dependencies: #136205	2024-09-20 18:39:16 +00:00
Bin Bao	f53a0f9cc1	[Inductor] Fix test_profiler_mark_wrapper_call_cuda_cuda_wrapper (#136356 ) Summary: Internal profiler behaves differently after turning on triton.autotune_at_compile_time. Needs more investigation but turning it off for this test for now. Reviewed By: henrylhtsang Differential Revision: D63035855 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136356 Approved by: https://github.com/henrylhtsang	2024-09-20 18:35:09 +00:00
Xu Song	5997354151	Add more distributed examples (#130427 ) 1. Add `gather` example 2. Add device to `scatter` example Pull Request resolved: https://github.com/pytorch/pytorch/pull/130427 Approved by: https://github.com/kwen2501	2024-09-20 18:27:27 +00:00
PyTorch MergeBot	df1eef9779	Revert "[torch][ao] Add customizable loss function to NodeAccuracySummary (#136282 )" This reverts commit f3c54ccf8f6139807f4623037c0174964a286652. Reverted https://github.com/pytorch/pytorch/pull/136282 on behalf of https://github.com/huydhn due to This breaks OSS, let revert it and land the revert internally then ([comment](https://github.com/pytorch/pytorch/pull/136282#issuecomment-2364219252))	2024-09-20 17:49:06 +00:00
Jeff Daily	15dba021bb	[ROCm][CI] upgrade CI to ROCm 6.2 (#132555 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132555 Approved by: https://github.com/pruthvistony, https://github.com/malfet	2024-09-20 17:39:31 +00:00
Chirag Pandya	29affa6b95	return instead of using skipTest (#136244 ) Summary: Return from functions instead of using `skipTest`. This is mostly to make our test report happier. Skipped tests still show up in our Broken test report. ``` OK (skipped=1) I0917 16:14:24.749060 1018907 StorageDemandControl.cpp:572] Flushing Demand Control ODS counters Skipped: Store doesn't support extended APIs ``` Test Plan: Tested locally. Test shows up as passed instead of skipped. ``` Cache hits: 99%. Commands: 125048 (cached: 124961, remote: 10, local: 77) Tests finished: Pass 1. Fail 0. Fatal 0. Skip 0. Build failure 0 ``` Differential Revision: D62912065 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136244 Approved by: https://github.com/XilunWu	2024-09-20 17:36:28 +00:00
David Berard	d7a6980078	[inductor] Make DtypeView work with cpp_wrapper without abi_compatible (#136233 ) Fixes #136159 Prior to this PR, using cpp_wrapper without abi_compatible could result in incorrect dtypes. The following block of code implements cpp_wrapper codegen for reinterpret_view for abi_compatible mode, but not for non-abi_compatible mode. `f6f1504d39/torch/_inductor/codegen/cpp_wrapper_cpu.py (L1678-L1814)` Added a test that verifies that we keep the view behavior, but returned tensors also have correct dtypes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136233 Approved by: https://github.com/FindHao, https://github.com/eellison, https://github.com/jansel	2024-09-20 17:30:35 +00:00
Aleksei Nikiforov	080a249fc2	S390x update builder image (#132983 ) S390x update builder image Pull Request resolved: https://github.com/pytorch/pytorch/pull/132983 Approved by: https://github.com/huydhn, https://github.com/malfet	2024-09-20 17:26:26 +00:00
PyTorch MergeBot	783c5ba80a	Revert "[PT2/Profiler] Add Context Info to Torch-Compiled Regions (#132765 )" This reverts commit 0b81f700aa7eb20d4b9f20e9627dd1208e50ea58. Reverted https://github.com/pytorch/pytorch/pull/132765 on behalf of https://github.com/ezyang due to implementation is not correct, needs full rewrite ([comment](https://github.com/pytorch/pytorch/pull/132765#issuecomment-2364160452))	2024-09-20 17:10:27 +00:00
IvanKobzarev	cdef760560	[aotd] Fix freezing API for subclasses (#136265 ) Original issue: https://github.com/pytorch/ao/issues/890 The problem: TracingContext.flat_params contain original params, with not desugared Subclasses. While inductor.freezing API works on aot graphs, which already desugared Subclasses. flat_params are used only for this logic and storing in them desguared subclasses fixes the issue. Testing: ``` python test/functorch/test_aotdispatch.py -k test_inductor_freezing_with_subclasses ``` Torch AO original failure: ``` python test/integration/test_integration.py -k test_int8_weight_only_quant_with_freeze ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136265 Approved by: https://github.com/bdhirsh	2024-09-20 16:32:49 +00:00
Aditya Tewari	4842f0fac6	Enable torch build with SLEEF on ARM by default (#133339 ) Scope: Enable PyTorch build with SLEEF on Arm by default. Enable codegen kernels compilation with SLEEF on ARM platform. Enabling the build with SLEEF by default and setting `AT_BUILD_ARM_VEC256_WITH_SLEEF` as the default for Arm improves performance for some models. I have benchmarked several networks on `Neoverse-V1` using `torch.compile` with the `inductor` backend. On models like `hf_Bert_Large` , `hf_GPT_fast`, we're seeing a ~1.2x speedup (with 16 threads). The below results are run with `Batch_Size=1` and `Cores=8, 16` ![Screenshot 2024-08-27 at 17 04 23](https://github.com/user-attachments/assets/319c7ef7-1202-4145-a51a-7a80dfd5f1f6) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133339 Approved by: https://github.com/malfet, https://github.com/kimishpatel Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2024-09-20 16:02:32 +00:00
Riley Dulin	f3c54ccf8f	[torch][ao] Add customizable loss function to NodeAccuracySummary (#136282 ) Summary: Add a customizable loss function callback to NodeAccuracySummary to allow users to pass in their own loss function. Also, fix some type errors and propagate better exception messages when unexpected tensor comparisons occur. Finally, enhance the robustness of `generate_numeric_debug_handle` in the case where it is called multiple times on the same model, by avoiding reuse of the same IDs. Test Plan: Added a test for this case in `test_numeric_debugger`. Reviewed By: jerryzh168 Differential Revision: D62898297 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136282 Approved by: https://github.com/jerryzh168	2024-09-20 07:34:52 +00:00
Sun, Jiayi	687e5cf8c5	[inductor] Relax the conditions for loop split (#135335 ) Summary This PR Relaxes the conditions for loop split to support dynamic shape cases. Now the conditions that need to be met to apply loop split optimization are as follows: 1. No reduction and no mudular index for all nodes. 2. The indexing_exprs of all nodes contain only one (or more, but all the same) division, where the divisor is an integer, the dividend is one of the iter_vars, and this var, i.e. the dimension that needs to be split, is contiguous in all other indexing_exprs. Example: ``` import torch import torch.nn as nn class GN(torch.nn.Module): def __init__(self, num_groups, num_channels): super(GN, self).__init__() self.gn = nn.GroupNorm(num_groups, num_channels) def forward(self, x): return self.gn(x) input = torch.randn(2, 960, 96, 96).to(memory_format=torch.channels_last) m = GN(32, 960).eval() compiled_m = torch.compile(m, dynamic=True) with torch.no_grad(): compiled_m(input) ``` Before loop split, the node's var_ranges: `{z0: s0, z1: s2, z2: s2, z3: 960}` and indexing_exprs: `{'index0': 960s22z0 + 960s2z1 + 960z2 + z3, 'index1': 32z0 + (z3//30), 'index2': 30s22, 'index3': z3, 'index4': 960s2z0((s2*2//s2)) + 960z1((s22//s2)) + 960z2 + z3}`. After loop split `z3` will split to `30z3 + z4`, then the node's var_ranges will be changed to `{z0: s0, z1: s2, z2: s2, z3: 32, z4: 30}` and indexing_exprs will be changed to `{'index0': 960s2*2z0 + 960s2z1 + 960z2 + 30z3 + z4, 'index1': 32z0 + z3, 'index2': 30s2*2, 'index3': 30z3 + z4, 'index4': 960s2z0((s22//s2)) + 960z1((s22//s2)) + 960z2 + 30z3 + z4}` Generated code: - Before: ``` cpp_fused_native_group_norm_0 = async_compile.cpp_pybinding(['const float', 'const float', 'const float', 'float', 'float', 'float', 'const int64_t', 'const int64_t'], ''' #include "/tmp/torchinductor_jiayisun/32/c32dcqa3qidvmunis4lucp3dhoicleq5qjfjfgvpiadbbzfp6ofy.h" extern "C" void kernel(const float in_ptr0, const float* in_ptr1, const float* in_ptr2, float* out_ptr0, float* out_ptr1, float* out_ptr2, const int64_t ks0, const int64_t ks1) { #pragma omp parallel num_threads(112) { int tid = omp_get_thread_num(); { #pragma omp for collapse(2) for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(ks0); x0+=static_cast<int64_t>(1L)) { for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(32L); x1+=static_cast<int64_t>(1L)) { { Welford<float> tmp_acc0 = Welford<float>(); Welford<at::vec::Vectorized<float>> tmp_acc0_vec = Welford<at::vec::Vectorized<float>>(); Welford<at::vec::Vectorized<float>> masked_tmp_acc0_vec = Welford<at::vec::Vectorized<float>>(); static WeightRecp<at::vec::Vectorized<float>> wrecps0(static_cast<int64_t>(c10::div_floor_integer(static_cast<int64_t>((15L(static_cast<int64_t>(ks1ks1)))), static_cast<int64_t>(8L)))); for(int64_t x2=static_cast<int64_t>(0L); x2<static_cast<int64_t>(static_cast<int64_t>(ks1ks1)); x2+=static_cast<int64_t>(1L)) { for(int64_t x3=static_cast<int64_t>(0L); x3<static_cast<int64_t>(16L); x3+=static_cast<int64_t>(16L)) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<int64_t>(x3 + (30Lx1) + (960Lx2) + (960Lx0(static_cast<int64_t>(ks1ks1)))), static_cast<int64_t>(16)); tmp_acc0_vec = welford_combine(tmp_acc0_vec, tmp0, &wrecps0); } for(int64_t x3=static_cast<int64_t>(16L); x3<static_cast<int64_t>(30L); x3+=static_cast<int64_t>(14L)) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<int64_t>(x3 + (30Lx1) + (960Lx2) + (960Lx0(static_cast<int64_t>(ks1ks1)))), static_cast<int64_t>(14L)); masked_tmp_acc0_vec = welford_combine(masked_tmp_acc0_vec, tmp0, static_cast<int64_t>(14L), &wrecps0); } } tmp_acc0 = welford_combine(tmp_acc0, welford_vec_reduce_all(masked_tmp_acc0_vec)); tmp_acc0 = welford_combine(tmp_acc0, welford_vec_reduce_all(tmp_acc0_vec)); out_ptr0[static_cast<int64_t>(x1 + (32Lx0))] = static_cast<float>(tmp_acc0.mean); out_ptr1[static_cast<int64_t>(x1 + (32Lx0))] = static_cast<float>(tmp_acc0.m2); } } } } { #pragma omp for collapse(2) for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(ks0); x0+=static_cast<int64_t>(1L)) { for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(ks1); x1+=static_cast<int64_t>(1L)) { #pragma GCC ivdep for(int64_t x2=static_cast<int64_t>(0L); x2<static_cast<int64_t>(ks1); x2+=static_cast<int64_t>(1L)) { #pragma GCC ivdep for(int64_t x3=static_cast<int64_t>(0L); x3<static_cast<int64_t>(960L); x3+=static_cast<int64_t>(1L)) { auto tmp0 = in_ptr0[static_cast<int64_t>(x3 + (960Lx2) + (960Lks1x1) + (960Lx0(static_cast<int64_t>(ks1ks1))))]; auto tmp1 = out_ptr0[static_cast<int64_t>((32Lx0) + (c10::div_floor_integer(static_cast<int64_t>(x3), static_cast<int64_t>(30L))))]; auto tmp3 = out_ptr1[static_cast<int64_t>((32Lx0) + (c10::div_floor_integer(static_cast<int64_t>(x3), static_cast<int64_t>(30L))))]; auto tmp11 = in_ptr1[static_cast<int64_t>(x3)]; auto tmp13 = in_ptr2[static_cast<int64_t>(x3)]; auto tmp2 = decltype(tmp0)(tmp0 - tmp1); auto tmp4 = 30L(static_cast<int64_t>(ks1ks1)); auto tmp5 = c10::convert<float>(tmp4); auto tmp6 = tmp3 / tmp5; auto tmp7 = static_cast<float>(1e-05); auto tmp8 = decltype(tmp6)(tmp6 + tmp7); auto tmp9 = 1 / std::sqrt(tmp8); auto tmp10 = decltype(tmp2)(tmp2 tmp9); auto tmp12 = decltype(tmp10)(tmp10 * tmp11); auto tmp14 = decltype(tmp12)(tmp12 + tmp13); out_ptr2[static_cast<int64_t>(x3 + (960Lx2) + (960Lx1(c10::div_floor_integer(static_cast<int64_t>((static_cast<int64_t>(ks1ks1))), static_cast<int64_t>(ks1)))) + (960Lks1x0(c10::div_floor_integer(static_cast<int64_t>((static_cast<int64_t>(ks1ks1))), static_cast<int64_t>(ks1)))))] = tmp14; } } } } } } } ''') async_compile.wait(globals()) del async_compile def call(args): arg0_1, arg1_1, arg2_1, arg3_1, arg4_1 = args args.clear() s0 = arg2_1 s2 = arg3_1 assert_size_stride(arg0_1, (960, ), (1, )) assert_size_stride(arg1_1, (960, ), (1, )) assert_size_stride(arg4_1, (s0, 960, s2, s2), (960(s2s2), 1, 960s2, 960)) buf0 = empty_strided_cpu((s0, 32, 1, 1), (32, 1, 32s0, 32s0), torch.float32) buf1 = empty_strided_cpu((s0, 32, 1, 1), (32, 1, 32s0, 32s0), torch.float32) buf3 = empty_strided_cpu((s0, 960, s2, s2), (960s2((s2s2) // s2), 1, 960((s2s2) // s2), 960), torch.float32) cpp_fused_native_group_norm_0(arg4_1, arg0_1, arg1_1, buf0, buf1, buf3, s0, s2) del arg0_1 del arg1_1 del arg4_1 return (buf3, ) ``` After: ``` cpp_fused_native_group_norm_0 = async_compile.cpp_pybinding(['const float', 'const float', 'const float', 'float', 'float', 'float', 'const int64_t', 'const int64_t'], ''' #include "/tmp/torchinductor_jiayisun/32/c32dcqa3qidvmunis4lucp3dhoicleq5qjfjfgvpiadbbzfp6ofy.h" extern "C" void kernel(const float* in_ptr0, const float* in_ptr1, const float* in_ptr2, float* out_ptr0, float* out_ptr1, float* out_ptr2, const int64_t ks0, const int64_t ks1) { #pragma omp parallel num_threads(112) { int tid = omp_get_thread_num(); { #pragma omp for collapse(2) for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(ks0); x0+=static_cast<int64_t>(1L)) { for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(32L); x1+=static_cast<int64_t>(1L)) { { Welford<float> tmp_acc0 = Welford<float>(); Welford<at::vec::Vectorized<float>> tmp_acc0_vec = Welford<at::vec::Vectorized<float>>(); Welford<at::vec::Vectorized<float>> masked_tmp_acc0_vec = Welford<at::vec::Vectorized<float>>(); static WeightRecp<at::vec::Vectorized<float>> wrecps0(static_cast<int64_t>(c10::div_floor_integer(static_cast<int64_t>((15L(static_cast<int64_t>(ks1ks1)))), static_cast<int64_t>(8L)))); for(int64_t x2=static_cast<int64_t>(0L); x2<static_cast<int64_t>(static_cast<int64_t>(ks1ks1)); x2+=static_cast<int64_t>(1L)) { for(int64_t x3=static_cast<int64_t>(0L); x3<static_cast<int64_t>(16L); x3+=static_cast<int64_t>(16L)) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<int64_t>(x3 + (30Lx1) + (960Lx2) + (960Lx0(static_cast<int64_t>(ks1ks1)))), static_cast<int64_t>(16)); tmp_acc0_vec = welford_combine(tmp_acc0_vec, tmp0, &wrecps0); } for(int64_t x3=static_cast<int64_t>(16L); x3<static_cast<int64_t>(30L); x3+=static_cast<int64_t>(14L)) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<int64_t>(x3 + (30Lx1) + (960Lx2) + (960Lx0(static_cast<int64_t>(ks1ks1)))), static_cast<int64_t>(14L)); masked_tmp_acc0_vec = welford_combine(masked_tmp_acc0_vec, tmp0, static_cast<int64_t>(14L), &wrecps0); } } tmp_acc0 = welford_combine(tmp_acc0, welford_vec_reduce_all(masked_tmp_acc0_vec)); tmp_acc0 = welford_combine(tmp_acc0, welford_vec_reduce_all(tmp_acc0_vec)); out_ptr0[static_cast<int64_t>(x1 + (32Lx0))] = static_cast<float>(tmp_acc0.mean); out_ptr1[static_cast<int64_t>(x1 + (32Lx0))] = static_cast<float>(tmp_acc0.m2); } } } } { #pragma omp for collapse(2) for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(ks0); x0+=static_cast<int64_t>(1L)) { for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(ks1); x1+=static_cast<int64_t>(1L)) { #pragma GCC ivdep for(int64_t x2=static_cast<int64_t>(0L); x2<static_cast<int64_t>(ks1); x2+=static_cast<int64_t>(1L)) { #pragma GCC ivdep for(int64_t x3=static_cast<int64_t>(0L); x3<static_cast<int64_t>(32L); x3+=static_cast<int64_t>(1L)) { for(int64_t x4=static_cast<int64_t>(0L); x4<static_cast<int64_t>(16L); x4+=static_cast<int64_t>(16L)) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<int64_t>(x4 + (30Lx3) + (960Lx2) + (960Lks1x1) + (960Lx0(static_cast<int64_t>(ks1ks1)))), static_cast<int64_t>(16)); auto tmp1 = out_ptr0[static_cast<int64_t>(x3 + (32Lx0))]; auto tmp4 = out_ptr1[static_cast<int64_t>(x3 + (32Lx0))]; auto tmp13 = at::vec::Vectorized<float>::loadu(in_ptr1 + static_cast<int64_t>(x4 + (30Lx3)), static_cast<int64_t>(16)); auto tmp15 = at::vec::Vectorized<float>::loadu(in_ptr2 + static_cast<int64_t>(x4 + (30Lx3)), static_cast<int64_t>(16)); auto tmp2 = at::vec::Vectorized<float>(tmp1); auto tmp3 = tmp0 - tmp2; auto tmp5 = 30L(static_cast<int64_t>(ks1ks1)); auto tmp6 = c10::convert<float>(tmp5); auto tmp7 = tmp4 / tmp6; auto tmp8 = static_cast<float>(1e-05); auto tmp9 = decltype(tmp7)(tmp7 + tmp8); auto tmp10 = 1 / std::sqrt(tmp9); auto tmp11 = at::vec::Vectorized<float>(tmp10); auto tmp12 = tmp3 * tmp11; auto tmp14 = tmp12 * tmp13; auto tmp16 = tmp14 + tmp15; tmp16.store(out_ptr2 + static_cast<int64_t>(x4 + (30Lx3) + (960Lx2) + (960Lx1(c10::div_floor_integer(static_cast<int64_t>((static_cast<int64_t>(ks1ks1))), static_cast<int64_t>(ks1)))) + (960Lks1x0(c10::div_floor_integer(static_cast<int64_t>((static_cast<int64_t>(ks1ks1))), static_cast<int64_t>(ks1)))))); } for(int64_t x4=static_cast<int64_t>(16L); x4<static_cast<int64_t>(30L); x4+=static_cast<int64_t>(14L)) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<int64_t>(x4 + (30Lx3) + (960Lx2) + (960Lks1x1) + (960Lx0(static_cast<int64_t>(ks1ks1)))), static_cast<int64_t>(14L)); auto tmp1 = out_ptr0[static_cast<int64_t>(x3 + (32Lx0))]; auto tmp4 = out_ptr1[static_cast<int64_t>(x3 + (32Lx0))]; auto tmp13 = at::vec::Vectorized<float>::loadu(in_ptr1 + static_cast<int64_t>(x4 + (30Lx3)), static_cast<int64_t>(14L)); auto tmp15 = at::vec::Vectorized<float>::loadu(in_ptr2 + static_cast<int64_t>(x4 + (30Lx3)), static_cast<int64_t>(14L)); auto tmp2 = at::vec::Vectorized<float>(tmp1); auto tmp3 = tmp0 - tmp2; auto tmp5 = 30L(static_cast<int64_t>(ks1ks1)); auto tmp6 = c10::convert<float>(tmp5); auto tmp7 = tmp4 / tmp6; auto tmp8 = static_cast<float>(1e-05); auto tmp9 = decltype(tmp7)(tmp7 + tmp8); auto tmp10 = 1 / std::sqrt(tmp9); auto tmp11 = at::vec::Vectorized<float>(tmp10); auto tmp12 = tmp3 * tmp11; auto tmp14 = tmp12 * tmp13; auto tmp16 = tmp14 + tmp15; tmp16.store(out_ptr2 + static_cast<int64_t>(x4 + (30Lx3) + (960Lx2) + (960Lx1(c10::div_floor_integer(static_cast<int64_t>((static_cast<int64_t>(ks1ks1))), static_cast<int64_t>(ks1)))) + (960Lks1x0(c10::div_floor_integer(static_cast<int64_t>((static_cast<int64_t>(ks1ks1))), static_cast<int64_t>(ks1))))), static_cast<int64_t>(14L)); } } } } } } } } ''') async_compile.wait(globals()) del async_compile def call(args): arg0_1, arg1_1, arg2_1, arg3_1, arg4_1 = args args.clear() s0 = arg2_1 s2 = arg3_1 assert_size_stride(arg0_1, (960, ), (1, )) assert_size_stride(arg1_1, (960, ), (1, )) assert_size_stride(arg4_1, (s0, 960, s2, s2), (960(s2s2), 1, 960s2, 960)) buf0 = empty_strided_cpu((s0, 32, 1, 1), (32, 1, 32s0, 32s0), torch.float32) buf1 = empty_strided_cpu((s0, 32, 1, 1), (32, 1, 32s0, 32s0), torch.float32) buf3 = empty_strided_cpu((s0, 960, s2, s2), (960s2((s2s2) // s2), 1, 960((s2*s2) // s2), 960), torch.float32) cpp_fused_native_group_norm_0(arg4_1, arg0_1, arg1_1, buf0, buf1, buf3, s0, s2) del arg0_1 del arg1_1 del arg4_1 return (buf3, ) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135335 Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/jansel	2024-09-20 05:42:52 +00:00
albanD	cf31724db7	Fix and improvements to toward 3.13t (#136319 ) Small part of https://github.com/pytorch/pytorch/pull/130689 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136319 Approved by: https://github.com/malfet, https://github.com/Skylion007	2024-09-20 04:22:18 +00:00
Tom Ritchford	e3ea5429f2	Implement GetAttrVariable.as_python_constant() (#134216 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134216 Approved by: https://github.com/amjames, https://github.com/williamwen42	2024-09-20 03:44:43 +00:00
Sergii Dymchenko	d9aca9914b	Remove duplicated words in library.rst (#136340 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136340 Approved by: https://github.com/svekars	2024-09-20 03:30:54 +00:00
Huy Do	fe0e9fb385	Fix flaky SIGSEGV crash in test_profile_memory (#136304 ) Fixes https://github.com/pytorch/pytorch/issues/132331 We need another barrier here to ensure that the main thread doesn't stop the profiler while other threads are still using it (and crash). I can reliably reproduce the issue with `pytest -v test/profiler/test_cpp_thread.py -k test_profile_memory --flake-finder`. ### Testing `pytest -v test/profiler/test_cpp_thread.py --flake-finder` all passes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136304 Approved by: https://github.com/briancoutinho	2024-09-20 02:56:49 +00:00
Kurt Mohler	d45b0151e5	Add deterministic path for CUDA `cumsum` (#136224 ) Change `cumsum` to call its decomposition when `use_deterministic_algorithms(True)` and input is CUDA. Fixes #89492 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136224 Approved by: https://github.com/ezyang, https://github.com/justinchuby	2024-09-20 02:41:56 +00:00
Felix Su	1dfa07e885	passing FileTimerRequests.to_json() to log_debug_info_for_expired_timers for a better debugging experience (#135913 ) Summary: The change involves passing the expired timers to the log_debug_info_for_expired_timers function after to_json() has been applied . This change is made to provide a better debugging experience for the user. Test Plan: unit tests Reviewed By: gag1jain Differential Revision: D62408767 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135913 Approved by: https://github.com/gag1jain	2024-09-20 00:54:02 +00:00
Tristan Rice	bebf5302ba	TCPStoreLibUvBackend: trace operations (#136320 ) Summary: This logs all operations when tracing log level is enabled for the `TCPStoreLibUvBackend`. This is very useful for debugging collective operations when issues occur as it logs all hosts and the keys that they're modifying. To minimize total data we only log the keys and not the values This changes the C10D_* macros to be much more efficient -- previously we would always format the log string even if they would never be printed which is very wasteful for detailed tracing. This now gates them with an if statement to achieve the same behavior with no overhead Test Plan: ``` TORCH_DISTRIBUTED_DEBUG=DETAIL torchrun --nnodes 1 --nproc_per_node 1 --no-python /bin/bash -c "echo foo" ``` ``` I0919 09:26:52.352013 34271 TCPStore.cpp:285] [c10d - debug] The server has started on port = 29500. I0919 09:26:52.352246 34271 socket.cpp:783] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (127.0.0.1, 29500). I0919 09:26:52.352241 36903 TCPStoreLibUvBackend.cpp:1173] [c10d - debug] Uv main loop running I0919 09:26:52.352308 34271 socket.cpp:854] [c10d - trace] The client socket is attempting to connect to [localhost]:29500. I0919 09:26:52.353633 34271 socket.cpp:945] [c10d] The client socket has connected to [localhost]:29500 on SocketImpl(fd=41, addr=[localhost]:45646, remote=[localhost]:29500). I0919 09:26:52.354422 34271 TCPStore.cpp:321] [c10d - debug] TCP client connected to host 127.0.0.1:29500 I0919 09:26:52.354558 36903 TCPStoreLibUvBackend.cpp:774] [c10d - trace] validate magic:1015412686 address:[localhost]:45646 I0919 09:26:52.354638 36903 TCPStoreLibUvBackend.cpp:789] [c10d - trace] ping nonce:34271 address:[localhost]:45646 I0919 09:26:52.356122 36903 TCPStoreLibUvBackend.cpp:866] [c10d - trace] add key:init/ val:1 address:[localhost]:45646 I0919 09:26:52.356308 36903 TCPStoreLibUvBackend.cpp:930] [c10d - trace] wait key_count:1 address:[localhost]:45646 I0919 09:26:52.356410 36903 TCPStoreLibUvBackend.cpp:846] [c10d - trace] get key:init/ address:[localhost]:45646 I0919 09:26:52.358688 36903 TCPStoreLibUvBackend.cpp:808] [c10d - trace] set key:/none/torchelastic/role_info/0 address:[localhost]:45646 I0919 09:26:52.360177 36903 TCPStoreLibUvBackend.cpp:930] [c10d - trace] wait key_count:1 address:[localhost]:45646 I0919 09:26:52.360296 36903 TCPStoreLibUvBackend.cpp:1004] [c10d - trace] multi_get key_count:1 address:[localhost]:45646 I0919 09:26:52.362076 36903 TCPStoreLibUvBackend.cpp:1036] [c10d - trace] multi_set key_count:1 address:[localhost]:45646 I0919 09:26:52.364001 36903 TCPStoreLibUvBackend.cpp:930] [c10d - trace] wait key_count:1 address:[localhost]:45646 I0919 09:26:52.364091 36903 TCPStoreLibUvBackend.cpp:846] [c10d - trace] get key:/none/torchelastic/assigned_ranks/0 address:[localhost]:45646 ``` Differential Revision: D62924454 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136320 Approved by: https://github.com/c-p-i-o, https://github.com/XilunWu	2024-09-20 00:53:21 +00:00
Wei Wang	9b424aac1d	[CI][CUSPARSELT] Extend cusparselt installation script to support cuda 12.6 (#136321 ) To prepare for future cuda updates. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136321 Approved by: https://github.com/Skylion007, https://github.com/eqy	2024-09-19 23:45:57 +00:00
Brian Hirsh	172ecf78b7	DTensor: dont hash symint tensor input in propagate_tensor_meta (#136266 ) This fixes a subset of issues for dynamic shapes + DTensor. It's pretty easy to run into other issues - it's likely that we need https://github.com/pytorch/pytorch/pull/125941 to land for DTensor + dynamic shapes to work more generally. I ended up writing a test that had dynamic shape inputs but not dynamic shape outputs in order to properly test this fix Pull Request resolved: https://github.com/pytorch/pytorch/pull/136266 Approved by: https://github.com/ezyang, https://github.com/yf225	2024-09-19 20:39:36 +00:00
cyy	7bbdf87517	[22/N] Fix clang-tidy warnings in jit (#134829 ) Follows #134537 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134829 Approved by: https://github.com/ezyang	2024-09-19 19:24:42 +00:00
Laith Sakka	b71802fa79	add basic_modules_ListOfLinears_inductor_gpu_force_shape_pad (#136175 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136175 Approved by: https://github.com/ezyang	2024-09-19 19:15:50 +00:00
Rachel Guo	8cba0ec958	[AOTI][Tooling][8/n] Add option to pinpoint kernel names in debug printer (#136182 ) Summary: Add a third mode where we only print kernel names without dumping any intermediate actual tensor value info. It can be helpful in quickly identifying the troublesome kernels in CUDA IMA issues. thanks ColinPeppler and henrylhtsang for this "feature request". Test Plan: The output can look like this if set the `AOT_INDUCTOR_DEBUG_INTERMEDIATE_VALUE_PRINTER=3`: {F1871629091} Differential Revision: D62791371 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136182 Approved by: https://github.com/henrylhtsang	2024-09-19 18:51:57 +00:00
Shan19900305	49723a8ff3	fix stride compare failed when size value equal to one in ForeachUtils.h (#134546 ) When size value equal to one, tensor strides value need be skipped to compare. @ezyang Pull Request resolved: https://github.com/pytorch/pytorch/pull/134546 Approved by: https://github.com/janeyx99	2024-09-19 18:43:41 +00:00
Jerry Mannil	ccca3de0cd	[ROCm] Enable Flex attention tests on AMD gpus (#136245 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136245 Approved by: https://github.com/malfet	2024-09-19 18:02:41 +00:00
Bob Ren	8d9c42735a	Type _sympy/functions.py [1/n] (#136205 ) Signed-off-by: Bob Ren <bobren@fb.com> I was chatting with @jamesjwu about strategies to learn the code and he suggested adding types to some files. This stack of PRs adds types to _sympy/functions.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/136205 Approved by: https://github.com/Skylion007, https://github.com/jamesjwu	2024-09-19 17:15:53 +00:00
James Wu	803ce507f1	Log structured logging overhead to dynamo compile (kinda) (#136142 ) Summary: X-link: https://github.com/pytorch/benchmark/pull/2454 This adds structured logging overhead at a per compile basis to compilation metrics. To do so, we track the frame_id_frame_compile_id that trace_structured uses to categorize compiles, and use that as the key in our timing table. Implementation notes: - If there's times we call trace_structured without a compile id, the time won't be measured. Not really a good way around that today given the compile id framework of compilation metrics. Strobelight is still the best way to measure on a per job basis. - We don't actually measure the time it takes to log the compilation metrics itself. Fundamentally, it's not possible to log this properly if we're storing the logging number in compilation metrics, since there's no way to measure it before we do it(unless we want discrepancies between dynamo_compile and tlparse, which seems suboptimal). Hopefully for a large job, the cost of structured_logging compilation metrics itself is small. - I wanted to use frame_phase_timing here, but there's a bunch of ids to iron out, and I don't really want to deal with that headache. compilation_time_metrics is sort of what I want, but that isn't by frame/compile id, so it's also a bit off. Putting it into torch.logging as a separate thing so logging tracks its own overhead seems fine, though. Test Plan: Run benchmarks/nanogpt and staging logger. See that the new compilation metric is logged to the staged dynamo_compile table: https://fburl.com/scuba/logger_staging_jjwu_30582a48f1ff9cf5f4ac50a4c40af/xazjg5xq Note that the sum(structured_logging_overhead_s) / sum(entire_frame_compile_time) = 8.387 / 124.278 = 6%, which seems reasonable as the overhead for a small compilation like this. You can also look at samples for a more detailed log of this. Reviewed By: oulgen Differential Revision: D62643611 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136142 Approved by: https://github.com/bobrenjc93	2024-09-19 16:11:38 +00:00
Andrew Gu	65df26f615	[FSDP2] Fixed 2D mismatched grad placements (#136237 ) ``` CUDA_VISIBLE_DEVICES=2,3,6,7 pytest test/distributed/_composable/test_composability/test_2d_composability.py -k test_train_parity_2d_transformer ``` Differential Revision: [D62964658](https://our.internmc.facebook.com/intern/diff/D62964658) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136237 Approved by: https://github.com/weifengpy	2024-09-19 14:35:15 +00:00
PyTorch MergeBot	4ea741d24f	Revert "Reland D62220158 (#136213 )" This reverts commit 083c9149b75cd918b6fb2795050d7173923a3629. Reverted https://github.com/pytorch/pytorch/pull/136213 on behalf of https://github.com/jeanschmidt due to Seems to have introduced regressions in rocm signals ([comment](https://github.com/pytorch/pytorch/pull/136213#issuecomment-2360885064))	2024-09-19 12:44:54 +00:00
Igor Sugak	bce52d0b60	[CODEMOD][caffe2] use npt.NDArray instead of np.ndarray in type annotations (#136288 ) Summary: To facilitate PSS-2 upgrade, this uses `ndt.NDArray` instead of `nd.ndarray` in type annotations. In Numpy-1.19 (PSS-1) it's an alias to `nd.ndarray` -- a noop. In Numpy-1.24, `ndt.NDArray` a proper generic type, and without this change uses of `nd.ndarray` generate this Pyre type error: ```counterexample Invalid type parameters [24]: Generic type `np.ndarray` expects 2 type parameters. ``` Test Plan: Sandcastle plus visual inspection Differential Revision: D62977370 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136288 Approved by: https://github.com/kit1980	2024-09-19 12:40:36 +00:00
Jan Wieczorek	908a5689eb	Return unsafe_view instead of view from matmul when folding occurs (#134568 ) When tensor folding occurs during matmul operation returned tensor is a view. This can cause issues when matmul is used inside a custom function and such view is then returned as output. Then it cannot be modified inplace and causes errors. It can be especially problematic when after such function inplace allreduce is performed. Issue is resolved when unsafe_view is returned from matmul instead. This solution aligns matmul decomposition with eager implementation in such a way that a non view tensor is returned. Test included in this PR reproduces the issue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134568 Approved by: https://github.com/zou3519	2024-09-19 11:52:16 +00:00
Huy Do	db80b98ec4	XFAIL test_segfault (#136252 ) Fixes https://github.com/pytorch/pytorch/issues/128551 As this has been failing in trunk for a while and there is no owner yet to fix it properly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136252 Approved by: https://github.com/andrewkho	2024-09-19 04:17:06 +00:00
Duygu Altinok	775517693a	Add type checks for Tensor.add_ (#135864 ) Fixes #127049 There's already a meta func in `meta_registrations.py` for `add_` and `sub_` methods. I added a second meta function for error checking, i.e `int.add/sub_(float)` and `bool.add/sub_(other types)` . Also the corresponding test with Dynamo passes, removed `@xfailIfTorchDynamo`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135864 Approved by: https://github.com/williamwen42	2024-09-19 03:09:36 +00:00
William Wen	e037bb326f	[dynamo] fix crash in InspectSignatureVariable (#136010 ) Fix crash that was happening in https://github.com/pytorch/pytorch/issues/128095, because we were trying to extract a constant incorrectly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136010 Approved by: https://github.com/yanboliang, https://github.com/anijain2305, https://github.com/jansel	2024-09-19 00:23:00 +00:00
Jerry Zhang	f2b0fc89f2	Add uint16 support for observer (#136238 ) Summary: att Test Plan: python test/test_quantization.py -k TestObserver Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D62909821](https://our.internmc.facebook.com/intern/diff/D62909821) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136238 Approved by: https://github.com/tarun292	2024-09-18 23:52:18 +00:00
Nikita Shulga	068c80e6b6	[BE][MPS] Fix deprecation warnings on MacOS 15.0 (#136292 ) [reverseSquareRootWithTensor:](https://developer.apple.com/documentation/metalperformanceshadersgraph/mpsgraph/reversesquareroot(with:name:)?changes=__8&language=objc) were deprecated in favor of [reciprocalSquareRootWithTensor:](https://developer.apple.com/documentation/metalperformanceshadersgraph/mpsgraph/reciprocalsquareroot(_:name:)?changes=__8&language=objc) Without it, following warnings are generated if compiled on recently released MacOS Sequoia: ``` /Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/mps/operations/Normalization.mm:720:35: warning: 'reverseSquareRootWithTensor:name:' is deprecated: first deprecated in macOS 15.0 [-Wdeprecated-declarations] 720 \| rsqrtTensor = [mpsGraph reverseSquareRootWithTensor:varianceEpsTensor name:nil]; \| ^~~~~~~~~~~~~~~~~~~~~~~~~~~ \| reciprocalSquareRootWithTensor /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX15.0.sdk/usr/include/c++/v1/__type_traits/invoke.h:341:10: note: in instantiation of function template specialization 'at::native::batch_norm_backward_mps(const Tensor &, const Tensor &, const std::optional<Tensor> &, const std::optional<Tensor> &, const std::optional<Tensor> &, const std::optional<Tensor> &, const std::optional<Tensor> &, bool, double, std::array<bool, 3>)::(anonymous class)::operator()<MPSGraph , CachedGraph >' requested here 341 \| decltype(std::declval<_Fp>()(std::declval<_Args>()...)) \| ^ /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX15.0.sdk/usr/include/c++/v1/__type_traits/invoke.h:351:19: note: while substituting deduced template arguments into function template '__invoke' [with _Fp = (lambda at /Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/mps/operations/Normalization.mm:623:68) &, _Args = <MPSGraph , CachedGraph >] 351 \| static decltype(std::__invoke(std::declval<_XFp>(), std::declval<_XArgs>()...)) __try_call(int); \| ^ /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX15.0.sdk/usr/include/c++/v1/__type_traits/invoke.h:357:28: note: while substituting deduced template arguments into function template '__try_call' [with _XFp = (lambda at /Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/mps/operations/Normalization.mm:623:68) &, _XArgs = (no value)] 357 \| using _Result = decltype(__try_call<_Fp, _Args...>(0)); \| ^ /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX15.0.sdk/usr/include/c++/v1/__type_traits/conjunction.h:27:32: note: in instantiation of template class 'std::__invokable_r<void, (lambda at /Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/mps/operations/Normalization.mm:623:68) &, MPSGraph , CachedGraph >' requested here 27 \| __expand_to_true<__enable_if_t<_Pred::value>...> __and_helper(int); \| ^ /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX15.0.sdk/usr/include/c++/v1/__type_traits/conjunction.h:38:39: note: while substituting explicitly-specified template arguments into function template '__and_helper' 38 \| using _And _LIBCPP_NODEBUG = decltype(std::__and_helper<_Pred...>(0)); \| ^ /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX15.0.sdk/usr/include/c++/v1/__functional/function.h:828:20: note: (skipping 1 context in backtrace; use -ftemplate-backtrace-limit=0 to see all) 828 \| bool = _And< _IsNotSame<__remove_cvref_t<_Fp>, function>, __invokable<_Fp, _ArgTypes...> >::value> \| ^ /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX15.0.sdk/usr/include/c++/v1/__functional/function.h:841:49: note: in instantiation of default argument for '__callable<(lambda at /Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/mps/operations/Normalization.mm:623:68) &>' required here 841 \| using _EnableIfLValueCallable = __enable_if_t<__callable<_Fp&>::value>; \| ^~~~~~~~~~~~~~~~ /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX15.0.sdk/usr/include/c++/v1/__functional/function.h:851:32: note: in instantiation of template type alias '_EnableIfLValueCallable' requested here 851 \| template <class _Fp, class = _EnableIfLValueCallable<_Fp>> \| ^ /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX15.0.sdk/usr/include/c++/v1/__functional/function.h:852:25: note: in instantiation of default argument for 'function<(lambda at /Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/mps/operations/Normalization.mm:623:68)>' required here 852 \| _LIBCPP_HIDE_FROM_ABI function(_Fp); \| ^~~~~~~~~~~~~ /Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/mps/operations/Normalization.mm:623:68: note: while substituting deduced template arguments into function template 'function' [with _Fp = (lambda at /Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/mps/operations/Normalization.mm:623:68), $1 = (no value)] 623 \| auto cachedGraph = LookUpOrCreateCachedGraph<CachedGraph>(key, [&](auto mpsGraph, auto newCachedGraph) { \| ^ /Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/mps/operations/Normalization.mm:623:24: note: while substituting deduced template arguments into function template 'LookUpOrCreateCachedGraph' [with T = CachedGraph] 623 \| auto cachedGraph = LookUpOrCreateCachedGraph<CachedGraph>(key, [&](auto mpsGraph, auto newCachedGraph) { \| ^ /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX15.0.sdk/System/Library/Frameworks/MetalPerformanceShadersGraph.framework/Headers/MPSGraphArithmeticOps.h:123:1: note: 'reverseSquareRootWithTensor:name:' has been explicitly marked deprecated here 123 \| -(MPSGraphTensor ) reverseSquareRootWithTensor:(MPSGraphTensor ) tensor \| ^ /Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/mps/operations/Normalization.mm:745:37: warning: 'reverseSquareRootWithTensor:name:' is deprecated: first deprecated in macOS 15.0 [-Wdeprecated-declarations] 745 \| rsqrtTensor = [mpsGraph reverseSquareRootWithTensor:varianceEpsTensor name:nil]; \| ^~~~~~~~~~~~~~~~~~~~~~~~~~~ \| reciprocalSquareRootWithTensor /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX15.0.sdk/System/Library/Frameworks/MetalPerformanceShadersGraph.framework/Headers/MPSGraphArithmeticOps.h:123:1: note: 'reverseSquareRootWithTensor:name:' has been explicitly marked deprecated here 123 \| -(MPSGraphTensor ) reverseSquareRootWithTensor:(MPSGraphTensor ) tensor \| ^ 2 warnings generated. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136292 Approved by: https://github.com/kit1980	2024-09-18 23:38:31 +00:00
Nikita Shulga	b9a197df77	[BE][MPS] Delete duplicated code in `View.mm` (#136295 ) After https://github.com/pytorch/pytorch/pull/135706 `getGatherScatterScalarType` returns exactly the same results as `scalarToMetalTypeString` , so delete the function and call `scalarToMetalTypeString` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136295 Approved by: https://github.com/kit1980	2024-09-18 22:44:43 +00:00
Siju Samuel	f1ad680818	[dynamo]Remove stream hardcoding in dynamo VariableBuilder (#131763 ) Fixes #ISSUE_NUMBER Recent change from PR#123487 used torch.cuda.Stream directly and this causes failure for other backends. This PR will generalize the stream handling for all backends like cuda/hpu/xpu Pull Request resolved: https://github.com/pytorch/pytorch/pull/131763 Approved by: https://github.com/yanboliang, https://github.com/yf225	2024-09-18 22:32:34 +00:00
Will Feng	bc9597b7d8	[Traceable FSDP2] Minor refactor to traceable FSDP2 unit tests (#136219 ) Changes in this PR: - Monkey-patching `F.scaled_dot_product_attention` with a lambda seems to not work in some cases. This PR avoids using a lambda. - Running `fullgraph=True` and `fullgraph=False` in the same unit test seems to cause the two cases to interfere with each other and causes error. This PR splits them into two separate unit tests. - The checks in the unit tests might not work with compile cache. This PR turns off the cache in order to have a more predictable compile behavior to do unit test on. Test commands: - `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_nested_fully_shard_backend_inductor_fullgraph_True` - `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_nested_fully_shard_backend_inductor_fullgraph_False` - `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_transformer_backend_inductor_fullgraph_True` - `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_transformer_backend_inductor_fullgraph_False` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136219 Approved by: https://github.com/yifuwang	2024-09-18 22:30:23 +00:00
Isuru Fernando	1a86d8aa29	Fix calling Add._from_args and Mul._from_args (#136143 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136143 Approved by: https://github.com/ezyang	2024-09-18 20:51:04 +00:00
Atul Jangra	aae68e2976	Add wait counter for nccl abort (#136067 ) Summary: Quite a few times, we see the NCCL PG abort taking too long. There's no easy way to measure this, so let's add a counter to measure this across the stack. This will help us measure how much time we take the NCCL abort. Test Plan: Unit tests Reviewed By: c-p-i-o Differential Revision: D62675010 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136067 Approved by: https://github.com/fduwjj	2024-09-18 20:14:10 +00:00
eqy	68a7246f13	[cuDNN][conv][A100] Bump tolerances for `vmap_autograd_grad` `conv2d` on A100 (#136178 ) Likely due to a cuDNN heuristics update Pull Request resolved: https://github.com/pytorch/pytorch/pull/136178 Approved by: https://github.com/Skylion007	2024-09-18 19:42:13 +00:00
maajidkhann	5a6ddbcc3b	Extending the Pytorch vec backend for SVE (ARM) (#119571 ) Motivation: In Pytorch, Aten vectorization supports multiple platforms, including x86 and Arm, as well as multiple data types. It provides a generic implementation of Vector (Vec) type that allows the programmer to write code packing various primitives (such as floats) within 256bit & 512bits registers. It can be extended to support other ISAs easily by adding more VecISA sub-classes. Reference Link: https://github.com/pytorch/pytorch/tree/main/aten/src/ATen/cpu/vec This PR: * Our goal with this contribution is to add support for SVE backend for Vec in the Aten vectorization for CPU backend which can be benefitted by any ARM architecture supported CPU's that supports SVE. * More about SVE ISA for ARM: [https://developer.arm.com/Architectures/Scalable Vector Extensions](https://developer.arm.com/Architectures/Scalable%20Vector%20Extensions) * We are using the ARM C Language Extensions for SVE (https://developer.arm.com/documentation/102699/0100/Optimizing-with-intrinsics ) to accelerate performance for various operators in the SVE backend for Vec. * Currently we are adding support only for SVE ISA with the vector length of 256 bits (SVE 256). In future, we plan to extend this SVE support for other vector lengths as well. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119571 Approved by: https://github.com/malfet, https://github.com/snadampal Co-authored-by: Divya Kotadiya <divya.kotadiya@fujitsu.com>	2024-09-18 18:59:10 +00:00
Jack Taylor	bad69044d8	[ROCm] upgrade ROCm CI builds to py3.10 (#134108 ) Upgrade ROCm CI builds to py3.10 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134108 Approved by: https://github.com/jeffdaily, https://github.com/jithunnair-amd, https://github.com/atalman	2024-09-18 17:39:34 +00:00
fduwjj	3efaa016b1	[c10d] Make test compatible for new pytest (#136158 ) Temporary fix to the issue in https://github.com/pytorch/pytorch/issues/127517. Short-term fix following CPython: `51aefc5bf9/Lib/unittest/case.py (L419-L426)` Differential Revision: [D62878083](https://our.internmc.facebook.com/intern/diff/D62878083) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136158 Approved by: https://github.com/fegin	2024-09-18 17:10:55 +00:00
Scott Wolchok	605f2d802a	[PyTorch] Remove unnecessary include of c10/util/Exception.h in irange.h (#136202 ) Manually audited and can't figure out why this would be needed. Differential Revision: [D62879500](https://our.internmc.facebook.com/intern/diff/D62879500/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136202 Approved by: https://github.com/malfet	2024-09-18 16:57:15 +00:00
CaoE	6a6f5b20c5	Add _addmm_activation to lower precision cast policy on AutocastCPU (#135936 ) Fixes #132613. Add `_addmm_activation` to lower precision cast policy on AutocastCPU. `_addmm_activation` https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/transformers/transformer.cpp#L39 of `transformer_encoder_layer_forward` may throw `RuntimeError: mat1 and mat2 must have the same dtype, but got BFloat16 and Float` when autocast is enabled, as `_native_multi_head_attention` is put in lower data type cast policy https://github.com/pytorch/pytorch/pull/107674 and `_addmm_activation` may encounter mixed data types. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135936 Approved by: https://github.com/jgong5, https://github.com/ezyang	2024-09-18 16:31:27 +00:00
Isuru Fernando	c8d152cb0e	Fix fast_expand recursion error (#136163 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136163 Approved by: https://github.com/ezyang	2024-09-18 13:58:45 +00:00
Sun, Jiayi	701ba5203f	[Inductor] Increase multiplier to 3 for Inductor AMP FP16 benchmark correctness check (#135932 ) Fix https://github.com/pytorch/pytorch/issues/135657. Aligned with AMP BF16, using multiplier 3 for Inductor AMP FP16 benchmark correctness check Pull Request resolved: https://github.com/pytorch/pytorch/pull/135932 Approved by: https://github.com/CaoE, https://github.com/jgong5, https://github.com/jansel	2024-09-18 13:03:45 +00:00
Prachi Gupta	b5be4d8c05	Fix ROCm skip decorator for test_ddp_tp and multiprocess UTs (#136161 ) skip_if_rocm is used only in multiprocess case (when UT test class is a child of MultiProcessTestCase). Each individual process can exit with a skip code. If used for single process UT, it will cause the UT to fail as the process returns a non-zero exit code. Use skipIfRocm in single process UTs. To avoid the above confusion, this PR renamed skip_if_rocm to skip_if_rocm_multiprocess. Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/136161 Approved by: https://github.com/jithunnair-amd, https://github.com/kwen2501, https://github.com/fegin	2024-09-18 11:01:23 +00:00
Menglu Yu	083c9149b7	Reland D62220158 (#136213 ) Summary: We fix the unit test test_pad_mm and reland the diff Test Plan: See in D62220158 Differential Revision: D62891584 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136213 Approved by: https://github.com/dshi7	2024-09-18 07:33:41 +00:00
Jason Ansel	a0207c8471	[dynamo] Fix support for classmethod(property(...)) (#134968 ) Fixes #134451 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134968 Approved by: https://github.com/yanboliang	2024-09-18 04:47:51 +00:00
Nikita Shulga	9aa22eabe7	[CI] Make linux-aarch64 shards actually running different tests (#136208 ) Non-functional sharding was introduced in https://github.com/pytorch/pytorch/pull/125255 but each shard in that case were running the same tests... Pull Request resolved: https://github.com/pytorch/pytorch/pull/136208 Approved by: https://github.com/seemethere, https://github.com/ZainRizvi, https://github.com/atalman	2024-09-18 03:10:21 +00:00
Kiuk Chung	8895f69d12	[torch/numpy][numpy2.0 compat] Additional changes for tests to run under numpy-2.0 (#136152 ) Continuation of https://github.com/pytorch/pytorch/pull/131909. This PR makes numpy tests compatible with numpy>=2.0.0. Specifically it deals with APIs that have been removed from numpy-2.0. Changes in this PR: 1. Use `numpy.exceptions.ComplexWarning` if `numpy.exceptions` namespace is present. In numpy-2.0 `numpy.ComplexWarning` has been removed in favor of using `numpy.exceptions.ComplexWarning` (see [numpy-2.0 migration guide](https://numpy.org/devdocs/numpy_2_0_migration_guide.html#changes-to-namespaces)). Note that `numpy.exceptions` was introduced in numpy-1.25.0 hence does not exist in numpy<=1.24.x. 2. Do the same for `numpy.exceptions.VisibleDeprecationWarning` 3. Use `np.sort(...,axis=0)` over `np.msort()`(`np.msort()` removed in numpy-2.0) 4. Use `np.pad()` over `np.lib.pad()` (`np.lib` removed in numpy-2.0) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136152 Approved by: https://github.com/atalman	2024-09-18 02:11:22 +00:00
Nikita Shulga	6682327c75	[BE] Make `NestedTensorTransformerFunctions.cu` compilable without warnings (#136222 ) Before the change compilation produced following warnings: ``` /home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/nested/cuda/NestedTensorTransformerFunctions.cu: In function ‘std::tuple<dim3, dim3, at::native::StackArray<long int> > at::native::check_shape_and_partition_(const at::Tensor&, const std::vector<at::Tensor>&, const at::Tensor&)’: /home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/nested/cuda/NestedTensorTransformerFunctions.cu:584:22: warning: comparison of integer expressions of different signedness: ‘const int’ and ‘const size_t’ {aka ‘const long unsigned int’} [-Wsign-compare] 584 \| TORCH_CHECK(num_jagged_dim <= kStackArrayMaxDims); \| ~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~ /home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/nested/cuda/NestedTensorTransformerFunctions.cu: In lambda function: /home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/nested/cuda/NestedTensorTransformerFunctions.cu:1224:1061: warning: comparison of integer expressions of different signedness: ‘long unsigned int’ and ‘int’ [-Wsign-compare] 1224 \| AT_DISPATCH_INDEX_TYPES( \| ^ /home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/nested/cuda/NestedTensorTransformerFunctions.cu: In lambda function: /home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/nested/cuda/NestedTensorTransformerFunctions.cu:1224:1985: warning: comparison of integer expressions of different signedness: ‘long unsigned int’ and ‘int’ [-Wsign-compare] 1224 \| AT_DISPATCH_INDEX_TYPES( \| ^ /home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/nested/cuda/NestedTensorTransformerFunctions.cu: In instantiation of ‘void at::native::jagged_dense_elementwise_jagged_output_opt_(const at::Tensor&, const std::vector<at::Tensor>&, const at::Tensor&, const at::Tensor&, F) [with scalar_t = c10::Half; F = __nv_dl_wrapper_t<__nv_dl_trailing_return_tag<at::Tensor (*)(const at::Tensor&, c10::ArrayRef<at::Tensor>, std::optional<c10::SymInt>), at::native::_fbgemm_dense_to_jagged_forward_symint, c10::Half, 1> >]’: /home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/nested/cuda/NestedTensorTransformerFunctions.cu:1515:1: required from here /home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/nested/cuda/NestedTensorTransformerFunctions.cu:1336:2006: warning: comparison of integer expressions of different signedness: ‘size_t’ {aka ‘long unsigned int’} and ‘int’ [-Wsign-compare] 1336 \| AT_DISPATCH_INDEX_TYPES( \| ^ /home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/nested/cuda/NestedTensorTransformerFunctions.cu:1336:2113: warning: comparison of integer expressions of different signedness: ‘size_t’ {aka ‘long unsigned int’} and ‘int’ [-Wsign-compare] 1336 \| AT_DISPATCH_INDEX_TYPES( \| ^ ``` after it compiled without a warning Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/136222 Approved by: https://github.com/PaliC, https://github.com/kit1980	2024-09-18 01:24:05 +00:00
leslie-fang-intel	b18ba9419e	[AO][Inductor] Enable WOQ fusion pattern with permute (#135928 ) Summary Fix https://github.com/pytorch/pytorch/issues/135831 and https://github.com/pytorch/ao/issues/890. The root cause of the numerical failure was that the customized woq-int8 kernel was not triggered due to changes in the pattern. After re-adding the fusion pattern, the accuracy check now passes. I will open a separate TorchAO PR to enable these unit tests in TorchAO. Test Plan ``` python test/inductor/test_mkldnn_pattern_matcher.py -k test_woq_int8 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135928 Approved by: https://github.com/jgong5, https://github.com/eellison	2024-09-18 00:56:16 +00:00
Chirag Pandya	cccf500193	[c10d] remove sleep from watchdogHandler (#135760 ) Summary: Remove sleep from the `watchdogHandler` function. This sleep unnecessary slows things down during a NCCL timeout. Flight recorder is configured to take a minute, at most, to dump out it's buffer. This sleep ends up waiting for `8` minutes before destroy is called. Test Plan: Unit tests. Differential Revision: D62529875 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135760 Approved by: https://github.com/fduwjj, https://github.com/shuqiangzhang	2024-09-18 00:55:01 +00:00
Nikita Shulga	f6f1504d39	[MPS] Fix 5D+ reductions over negative dimentions (#136198 ) This fixes bug introduced by https://github.com/pytorch/pytorch/pull/99856 that attempts to speed-up reduction for 5D+ tensor if trailing dimensions are all ones, but introduces crashes/off-by-one errors for wrapped dimensions Added regresion test case to `TestMPS.test_sum` Fixes https://github.com/pytorch/pytorch/issues/136132 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136198 Approved by: https://github.com/albanD	2024-09-17 21:53:31 +00:00
Banit Agrawal	a575ce0dc6	[PyTorch Pinned Allocator] Add support of background thread to process events (#135524 ) Summary: Currently we process events in the regular allocation path and we call cudaEventQuery to check on the events and this path can take some locks in libcuda driver. Its not entirely needed to do process events in the allocation path, we could move this to a background thread and keep processing events regularly and put the freed block to the free list. Differential Revision: D62396585 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135524 Approved by: https://github.com/zyan0	2024-09-17 21:08:10 +00:00
Banit Agrawal	48d18fbd4c	[PyTorch CUDA Allocator] Allow reuse of non-split blocks with better rounding (#136174 ) Summary: This diff adds an option to round the non-split blocks in caching allocator so that they can be reused without causing lots of fragmentation for large memory segments. For example, if we specify max_split memory size as 400MB, then all allocations more than 400MB will not be split. Lets say, we allocated some 1024MB blocks and these are cached in the allocator blocks. If we request a new 500MB block, we round it to nearest power-2-division, thats 512MB, we add default kLargeBuffer of 20MB, that will be 532MB and since 532MB is less than existing 1024MB block, the 1024MB will not be used for this allocation, instead a new 512MB block will be created. In this diff, we provide an option to cofigure the kLargeBuffer for rounding and expose as a configurable option, so 512MB + max_non_split_rounding_size and if thats greater than 1024MB, we will use te 1024MB and we wont create a new 512MB block using cudaMalloc. This option is added so that we can pre-allocate some large blocks so that we can reuse them as much as possible and we dont stall on calling cudaMalloc. Differential Revision: D62758758 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136174 Approved by: https://github.com/zyan0	2024-09-17 19:08:44 +00:00
eqy	e3aa5e2f64	[NCCL] Don't override `waitUntilInitialized`'s setting of `comm->initialized_` (#136155 ) #133630 sets `initialized_` to `true` which causes previous wait codepaths to skip necessary waits, see also #https://github.com/pytorch/pytorch/issues/136151 CC @shuqiangzhang @wconstab Pull Request resolved: https://github.com/pytorch/pytorch/pull/136155 Approved by: https://github.com/fduwjj, https://github.com/kwen2501, https://github.com/c-p-i-o, https://github.com/shuqiangzhang	2024-09-17 18:50:12 +00:00
Huanyu He	a4e9a1c90b	[TorchRec][PT2 IR][APF] short circuit the flatten/unflatten between EBC and KTRegroupAsDict modules (#136045 ) Summary: # context * for the root cause and background please refer to this [post](https://fb.workplace.com/groups/1028545332188949/permalink/1042204770823005/) * basica idea of this diff is to short circuit the pytree flatten-unflatten function pairs between two preserved modules, i.e., EBC/fpEBC and KTRegroupAsDict. NOTE: There could be multiple EBCs and one single KTRegroupAsDict as shown in the [pic](https://fburl.com/gslide/lcyt8eh3) {F1864810545} * short-circuiting the EBC-KTRegroupAsDict pairs are very special and a must in most of the cases due to the EBC key-order issue with distributed table lookup. * hide all the operations behind a control flag `short_circuit_pytree_ebc_regroup` to the torchrec main api call `decapsulate_ir_modules`, which should only be visible to the infra layer, not to the users. # details * The `_short_circuit_pytree_ebc_regroup` function finds all the EBCs/fpEBC and KTRegroupAsDict modules in an unflattened module. Retrieve their fqns and sort to in_fqns (regroup_fqns) and out_fqns (ebc_fqns). Because currently the fpEBC is swapped as a whole, so we do some extra fqn logic to filter out the EBC that belongs to an up-level fpEBC. * a util function `prune_pytree_flatten_unflatten` removes the in-coming and out-going pytree flatten/unflatten function calls in the graph module, based on the given fqns. WARNING: The flag `short_circuit_pytree_ebc_regroup` should be turned on if EBCs are used and EBC sharding is needed. Assertions are also added if can't find a `KTRegroupAsDict` module, or `finalize_interpreter_modules` is not `True`. # additional changes * absorb the `finalize_interpreter_modules` process inside the torchrec main api `decapsulate_ir_modules`. * set `graph.owning_module` in export.unflatten as required by the graph modification * add one more layer of `sparse_module` for closely mimicing the APF model structure. Test Plan: # run test * serializer ``` buck2 run fbcode//mode/opt fbcode//torchrec/ir/tests:test_serializer ``` * apf ``` buck2 run fbcode//mode/opt fbcode//aps_models/ads/gmp/tests/ne/e2e_deterministic_tests:gmp_e2e_ne_tests -- --filter-text 'test_mtml_instagram_model_562438350_single_gpu_with_ir' ``` * local mp run ``` ==== Finished E2E deterministic test for mtml_instagram_model_gmp_474023725_non_kjt_unary ==== finished test_mtml_instagram_model_562438350_single_gpu_with_ir Imports took: 6.0s! Profile with --import-profiler. --_ \|""---__ Executed 1 example in 203.1s: \|'.\| \|\| . """\| Successful: 1 \| \|\| \|\| /\|\""-. \| Failed: 0 \| \|\| \|\| \| \| \| Skipped: 0 \| \|\| \|\| \| \\|/ \| Not executed: 8 \|."\| \|\| --"" '__\| https://testslide.readthedocs.io/ --" \|__---""" ``` Differential Revision: D62606738 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136045 Approved by: https://github.com/angelayi	2024-09-17 18:42:56 +00:00
angelayi	ea10c072f3	[export] Deserialize args with python keyword names (#136036 ) Currently when we deserialize inputs to nodes, we deserialize arguments with default values as kwargs. So deserializing `aten.uniform`, which has the signature `uniform(Tensor(a!) self, float from=0, float to=1, *, Generator? generator=None) -> Tensor(a!)`, will get become `uniform(x, from=0, to=1)`. However, this fails when running in python because `from` is a python keyword. So the solution here is to not deserialize it as a kwarg. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136036 Approved by: https://github.com/zhxchen17	2024-09-17 18:13:14 +00:00
Joel Schlosser	a8382847f4	Support rms_norm() for NJT (#135872 ) `rms_norm()` is a nice-to-have for ViT :) This PR: * SymInt-ifies `rms_norm()`, allowing NJT to use the same decomp. * Adds torch_function-based input validation logic for nested-specific stuff (no normalization supported over the ragged dim for now) on the python NJT side. * Adds multi-dim support (on non-ragged, non-batch dims) to `mean()` for NJT. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135872 Approved by: https://github.com/mikaylagawarecki ghstack dependencies: #125947	2024-09-17 18:09:20 +00:00
Nikita Shulga	785e98783b	Delete links to non-existing `run_plan_mpi.cc` (#136204 ) That were deleted by https://github.com/pytorch/pytorch/pull/125092 Fixes https://github.com/pytorch/pytorch/issues/136199 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136204 Approved by: https://github.com/albanD, https://github.com/seemethere	2024-09-17 17:51:56 +00:00
Trung Truong	cc365fdd7b	[MTIA] Support torch.cuda.get_device_capability equivalent API on MTIA (#135889 ) Summary: Mirror `get_device_capability` on MTIA per https://fburl.com/gdoc/p4lo5avn At the moment, both the major and minor version are just 0 Test Plan: Unit test: `buck2 test //mtia/host_runtime/torch_mtia/tests:test_torch_mtia_api` https://www.internalfb.com/intern/testinfra/testconsole/testrun/1688850109958190/ Differential Revision: D62595296 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135889 Approved by: https://github.com/egienvalue	2024-09-17 17:42:56 +00:00
Xintong Hu	8e5bb356e0	[PT2] Port merge_concats_pass to PT2 pre_grad passes (#135527 ) Summary: as title Test Plan: new UT Differential Revision: D62398390 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135527 Approved by: https://github.com/frank-wei	2024-09-17 17:26:53 +00:00
Nikhil Gupta	63dc5dff10	[Fix]: Update CPUINFO submodule to fix support for NON-SVE ARM Hardware (#135857 ) Regression PR : https://github.com/pytorch/cpuinfo/pull/255 Change-Id: I56cec061072be11ec33ccb661114360b979fc7aa Pull Request resolved: https://github.com/pytorch/pytorch/pull/135857 Approved by: https://github.com/digantdesai, https://github.com/malfet	2024-09-17 16:50:17 +00:00
Justin Chu	67b14ce8bd	[ONNX] Fix numpy method to return the correct type (#136162 ) Previous implementation of the `numpy()` method returns `fp64` when the tensor is `fp32`. This is unexpected but seems to be caused by calling `__array__(dtype=None)` on the numpy array. I updated the implementation to implement the `numpy()` method explicitly and added tests to guard the behavior. This needs to be cherry-picked into torch 2.5 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136162 Approved by: https://github.com/gramalingam, https://github.com/xadupre	2024-09-17 15:51:00 +00:00
Mauricio Villegas	ece8267d2c	Add back optim type hints that were lost when .pyi files were removed (#136185 ) When stub files (`.pyi`) were removed from `optim` (#125556, #125452), some types that existed are no longer available. This pull request adds them back. Just for reference, these types are used in `pytorch-lightning`'s `LightningCLI`. Command line interfaces are created automatically, and having type hints make them nicer. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136185 Approved by: https://github.com/janeyx99	2024-09-17 15:45:15 +00:00
Edward Z. Yang	913f97e878	Don't run reshape pattern match on dynamic shape size tensor (#136100 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/136100 Approved by: https://github.com/mengluy0125	2024-09-17 15:08:55 +00:00
PyTorch MergeBot	462b727d1e	Revert "Add decomposition for permute_copy (#130944 )" This reverts commit ab9a7eadd34aee59fc67e29237610b7562cc4ff0. Reverted https://github.com/pytorch/pytorch/pull/130944 on behalf of https://github.com/jeanschmidt due to Broke internal signal executorch.backends.xnnpack.test.ops.permute.TestPermute, more details on D62737086. @eellison could you please help get this PR merged to main? ([comment](https://github.com/pytorch/pytorch/pull/130944#issuecomment-2355846394))	2024-09-17 13:42:55 +00:00
PyTorch MergeBot	2c4ae81494	Revert "Add decomposition for squeeze_copy (#130941 )" This reverts commit c33b0580e6a702be0cd5be691b3b465da012aa34. Reverted https://github.com/pytorch/pytorch/pull/130941 on behalf of https://github.com/jeanschmidt due to Need to revert in order to be able to revert https://github.com/pytorch/pytorch/pull/130944, after fixing any merge conflicts, feel free to merge it back ([comment](https://github.com/pytorch/pytorch/pull/130941#issuecomment-2355831480))	2024-09-17 13:39:07 +00:00
PyTorch MergeBot	3b5e2689a1	Revert "Optimize dict reconstruct to not codegen untouched values (#134876 )" This reverts commit a1a57a424dc992f4dc2d44bdc1e4e7e500881a9c. Reverted https://github.com/pytorch/pytorch/pull/134876 on behalf of https://github.com/jeanschmidt due to new introduced test test_reconstruct.py::ReconstructTest::test_functional_call_reconstruct is breaking internally. @zou3519 may you help get those changes merged back to main? ([comment](https://github.com/pytorch/pytorch/pull/134876#issuecomment-2355697685))	2024-09-17 13:00:01 +00:00
ankurneog	e248c1d7eb	Update real device in FSDP state_dict_utils (#134994 ) ## Motivation The default device for tensor.device both for sharded as well as non sharded is set to cuda by default. Hence while checking the FSDP UTs we see the following errors. This change updates the actual device type based on the created tensor. ``` [rank3] File "/root/repos/pytorch-training-tests/tests/pytorch/v2.4.0/distributed_hpu/fsdp/test_fsdp_dtensor_state_dict.py", line 143, in test_dtensor_sharded_tensor_state_dict_identical [rank3] sharded_tensor_sd = ref_model.state_dict() [rank3] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1944, in state_dict [rank3] hook_result = hook(self, destination, prefix, local_metadata) [rank3] File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context [rank3] return func(args, kwargs) [rank3] File "/usr/local/lib/python3.10/dist-packages/torch/distributed/fsdp/_state_dict_utils.py", line 752, in _post_state_dict_hook [rank3] tensor.device, [rank3] File "/usr/local/lib/python3.10/dist-packages/typing_extensions.py", line 2853, in wrapper [rank3] return arg(args, **kwargs) [rank3] File "/usr/local/lib/python3.10/dist-packages/torch/distributed/_shard/sharded_tensor/api.py", line 1152, in __torch_function__ [rank3] return dispatch(st_instance, func) [rank3] File "/usr/local/lib/python3.10/dist-packages/torch/distributed/_shard/sharded_tensor/api.py", line 1134, in dispatch [rank3] return _SHARDED_OPS[func](types, args, kwargs, st._process_group) [rank3] File "/usr/local/lib/python3.10/dist-packages/torch/distributed/_shard/op_registry_utils.py", line 33, in wrapper [rank3] return wrapped_func(types, args, kwargs, process_group) [rank3] File "/usr/local/lib/python3.10/dist-packages/torch/distributed/_shard/sharded_tensor/_ops/tensor_ops.py", line 52, in tensor_device [rank3] dev = torch.device(torch.cuda.current_device()) [rank3] File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 878, in current_device [rank3] _lazy_init() [rank3] File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 305, in _lazy_init [rank3] raise AssertionError("Torch not compiled with CUDA enabled") [rank3] AssertionError: Torch not compiled with CUDA enabled ```` Pull Request resolved: https://github.com/pytorch/pytorch/pull/134994 Approved by: https://github.com/fegin	2024-09-17 04:39:08 +00:00
wz337	408fe41a45	[DSD][EZ] Minor update in _state_dict_utils.py (#136165 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136165 Approved by: https://github.com/kwen2501 ghstack dependencies: #135725, #135763	2024-09-17 04:32:43 +00:00
Brian Hirsh	dc82d274e6	make view.dtype always return an alias (#136074 ) Fixes https://github.com/pytorch/pytorch/issues/136064 In the linked repro, this issue was that there was some code like this: ``` # x has dtype torch.float32 def f(x): y = x.view(torch.float32) y.copy_(...) ``` Where because `view.dtype` is implemented today to potentially directly return its input, we would end up directly clobbering the proxy for our graph input (replacing its FX proxy value from `arg0_1` to `view_1`). This is not desirable, because we have careful assertions in AOTDispatcher that mutations only ever happen on graph inputs - but this clobbering caused the mutation to appear, from the perspective of the FX graph, like it was happening on a view of the input. Why is this normally not a problem? Ordinarily, the `ADInplaceOrView` kernel for `view.dtype` will take the output of the view kernel, [and detach() it](https://github.com/pytorch/pytorch/blob/main/tools/autograd/gen_inplace_or_view_type.py#L466) (properly creating a fresh `TensorImpl`). This does not happen, though, if you are executing the kernel from with a `__torch_dispatch__` region: the `ADInplaceOrView` logic has already run above you, so that key will be in the TLS exclude set. This PR changes eager behavior - at first I considered trying to only change behavior under compile. But this problem isn't technically specific to PT2: if you ever rely on tensor identity from inside of a __torch_dispatch__ call, then we need to make sure the raw `view.dtype` kernel doesn't directly return the input. I am also making the assumption that "`view.dtype` no-op'ing when the dtype is the same" is not a case worth optimizing in eager mode, and that the overhead of the `TensorImpl` creation is relatively negligible. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136074 Approved by: https://github.com/Skylion007, https://github.com/ezyang, https://github.com/albanD ghstack dependencies: #136041	2024-09-17 03:40:54 +00:00
Brian Hirsh	d463a81c27	inductor: dont use default_dtype during rng functionalization (#136041 ) Fixes https://github.com/pytorch/pytorch/issues/119162 See context at https://github.com/pytorch/pytorch/issues/119162#issuecomment-2349849469 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136041 Approved by: https://github.com/eellison	2024-09-17 03:40:54 +00:00
Zhijing Li (Accelerator Enablement)	3f74310784	Back out "Flip triton kernel default layout constraint to "needs_fixed_stride_order" (#135581 )" (#136160 ) Test Plan: make train-hstu-cint-publish-bf16-tgif-local Differential Revision: D62766335 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136160 Approved by: https://github.com/muchulee8	2024-09-17 01:06:10 +00:00
PyTorch MergeBot	37a08b33bb	Revert "fix compiled_autograd deadlock throw (#135795 )" This reverts commit 00dc7d435652ad66e9d2feb2660928b632281a98. Reverted https://github.com/pytorch/pytorch/pull/135795 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/135795#issuecomment-2354233619))	2024-09-16 23:59:56 +00:00
Laith Sakka	071da87cd7	use csv extention for test report in order for it to be uploaded to s3 (#136128 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136128 Approved by: https://github.com/clee2000	2024-09-16 21:47:46 +00:00
Justin Chu	c12536b3c0	[ONNX] Treat CompositeImplicitAutograd ops as normal ops in decomp (#136153 ) Since https://github.com/pytorch/pytorch/pull/135080, the CompositeImplicitAutograd (CIA) ops are only decomposed when a decomp function is provided in a table. There is no longer a need to distinguish CIA ops like Upsample and preserve them explicitly. On the ONNX Script torchlib side I will unregister some ops from the following list to make sure some CIA ops are still decomposed. ``` <OpOverload(op='aten.__and__', overload='Scalar')>, <OpOverload(op='aten.__and__', overload='Tensor')>, <OpOverload(op='aten.__or__', overload='Scalar')>, <OpOverload(op='aten.__or__', overload='Tensor')>, <OpOverload(op='aten.__xor__', overload='Scalar')>, <OpOverload(op='aten.__xor__', overload='Tensor')>, <OpOverload(op='aten._add_batch_dim', overload='default')>, <OpOverload(op='aten._assert_tensor_metadata', overload='default')>, <OpOverload(op='aten._backward', overload='default')>, <OpOverload(op='aten._batch_norm_impl_index_backward', overload='default')>, <OpOverload(op='aten._cast_Byte', overload='default')>, <OpOverload(op='aten._cast_Char', overload='default')>, <OpOverload(op='aten._cast_Double', overload='default')>, <OpOverload(op='aten._cast_Float', overload='default')>, <OpOverload(op='aten._cast_Half', overload='default')>, <OpOverload(op='aten._cast_Int', overload='default')>, <OpOverload(op='aten._cast_Long', overload='default')>, <OpOverload(op='aten._cast_Short', overload='default')>, <OpOverload(op='aten._choose_qparams_per_tensor', overload='default')>, <OpOverload(op='aten._convolution', overload='deprecated')>, <OpOverload(op='aten._convolution_double_backward', overload='default')>, <OpOverload(op='aten._convolution_mode', overload='default')>, <OpOverload(op='aten._cufft_clear_plan_cache', overload='default')>, <OpOverload(op='aten._cufft_get_plan_cache_max_size', overload='default')>, <OpOverload(op='aten._cufft_get_plan_cache_size', overload='default')>, <OpOverload(op='aten._cufft_set_plan_cache_max_size', overload='default')>, <OpOverload(op='aten._debug_has_internal_overlap', overload='default')>, <OpOverload(op='aten._dim_arange', overload='default')>, <OpOverload(op='aten._embedding_bag_sparse_backward', overload='default')>, <OpOverload(op='aten._gather_sparse_backward', overload='default')>, <OpOverload(op='aten._grid_sampler_2d_cpu_fallback_backward', overload='default')>, <OpOverload(op='aten._has_compatible_shallow_copy_type', overload='default')>, <OpOverload(op='aten._is_zerotensor', overload='default')>, <OpOverload(op='aten._lu_with_info', overload='default')>, <OpOverload(op='aten._nnpack_available', overload='default')>, <OpOverload(op='aten._pack_padded_sequence_backward', overload='default')>, <OpOverload(op='aten._pad_circular', overload='default')>, <OpOverload(op='aten._pad_enum', overload='default')>, <OpOverload(op='aten._pad_packed_sequence', overload='default')>, <OpOverload(op='aten._propagate_xla_data', overload='default')>, <OpOverload(op='aten._remove_batch_dim', overload='default')>, <OpOverload(op='aten._reshape_from_tensor', overload='default')>, <OpOverload(op='aten._rowwise_prune', overload='default')>, <OpOverload(op='aten._saturate_weight_to_fp16', overload='default')>, <OpOverload(op='aten._scaled_dot_product_attention_math', overload='default')>, <OpOverload(op='aten._shape_as_tensor', overload='default')>, <OpOverload(op='aten._sobol_engine_draw', overload='default')>, <OpOverload(op='aten._sparse_bsc_tensor_unsafe', overload='default')>, <OpOverload(op='aten._sparse_bsr_tensor_unsafe', overload='default')>, <OpOverload(op='aten._sparse_compressed_tensor_unsafe', overload='default')>, <OpOverload(op='aten._sparse_coo_tensor_unsafe', overload='default')>, <OpOverload(op='aten._sparse_csc_tensor_unsafe', overload='default')>, <OpOverload(op='aten._sparse_csr_tensor_unsafe', overload='default')>, <OpOverload(op='aten._sparse_log_softmax', overload='Dimname')>, <OpOverload(op='aten._sparse_log_softmax', overload='int')>, <OpOverload(op='aten._sparse_mm', overload='default')>, <OpOverload(op='aten._sparse_mm', overload='reduce')>, <OpOverload(op='aten._sparse_softmax', overload='Dimname')>, <OpOverload(op='aten._sparse_softmax', overload='int')>, <OpOverload(op='aten._sparse_sum', overload='default')>, <OpOverload(op='aten._sparse_sum', overload='dim_dtype')>, <OpOverload(op='aten._sparse_sum', overload='dtype')>, <OpOverload(op='aten._test_ambiguous_defaults', overload='a')>, <OpOverload(op='aten._test_ambiguous_defaults', overload='b')>, <OpOverload(op='aten._test_autograd_multiple_dispatch', overload='ntonly')>, <OpOverload(op='aten._test_check_tensor', overload='default')>, <OpOverload(op='aten._test_serialization_subcmul', overload='default')>, <OpOverload(op='aten._test_string_default', overload='default')>, <OpOverload(op='aten._thnn_differentiable_gru_cell_backward', overload='default')>, <OpOverload(op='aten._thnn_differentiable_lstm_cell_backward', overload='default')>, <OpOverload(op='aten._thnn_fused_lstm_cell_backward', overload='default')>, <OpOverload(op='aten._to_cpu', overload='default')>, <OpOverload(op='aten._upsample_bicubic2d_aa', overload='vec')>, <OpOverload(op='aten._upsample_bilinear2d_aa', overload='vec')>, <OpOverload(op='aten._upsample_nearest_exact1d', overload='default')>, <OpOverload(op='aten._upsample_nearest_exact1d', overload='vec')>, <OpOverload(op='aten._upsample_nearest_exact2d', overload='default')>, <OpOverload(op='aten._upsample_nearest_exact2d', overload='vec')>, <OpOverload(op='aten._upsample_nearest_exact3d', overload='default')>, <OpOverload(op='aten._upsample_nearest_exact3d', overload='vec')>, <OpOverload(op='aten._use_cudnn_rnn_flatten_weight', overload='default')>, <OpOverload(op='aten._validate_sparse_bsc_tensor_args', overload='default')>, <OpOverload(op='aten._validate_sparse_bsr_tensor_args', overload='default')>, <OpOverload(op='aten._validate_sparse_compressed_tensor_args', overload='default')>, <OpOverload(op='aten._validate_sparse_coo_tensor_args', overload='default')>, <OpOverload(op='aten._validate_sparse_csc_tensor_args', overload='default')>, <OpOverload(op='aten._validate_sparse_csr_tensor_args', overload='default')>, <OpOverload(op='aten._version', overload='default')>, <OpOverload(op='aten._weight_norm', overload='default')>, <OpOverload(op='aten._weight_norm_differentiable_backward', overload='default')>, <OpOverload(op='aten.absolute', overload='default')>, <OpOverload(op='aten.adaptive_avg_pool1d', overload='default')>, <OpOverload(op='aten.adaptive_avg_pool2d', overload='default')>, <OpOverload(op='aten.adaptive_avg_pool3d', overload='default')>, <OpOverload(op='aten.adaptive_max_pool1d', overload='default')>, <OpOverload(op='aten.affine_grid_generator_backward', overload='default')>, <OpOverload(op='aten.align_as', overload='default')>, <OpOverload(op='aten.align_tensors', overload='default')>, <OpOverload(op='aten.all', overload='dimname')>, <OpOverload(op='aten.any', overload='dimname')>, <OpOverload(op='aten.arccos', overload='default')>, <OpOverload(op='aten.arccosh', overload='default')>, <OpOverload(op='aten.arcsin', overload='default')>, <OpOverload(op='aten.arcsinh', overload='default')>, <OpOverload(op='aten.arctan', overload='default')>, <OpOverload(op='aten.arctan2', overload='default')>, <OpOverload(op='aten.arctanh', overload='default')>, <OpOverload(op='aten.argsort', overload='default')>, <OpOverload(op='aten.argsort', overload='dimname')>, <OpOverload(op='aten.argsort', overload='stable')>, <OpOverload(op='aten.argwhere', overload='default')>, <OpOverload(op='aten.atleast_1d', overload='Sequence')>, <OpOverload(op='aten.atleast_2d', overload='Sequence')>, <OpOverload(op='aten.atleast_3d', overload='Sequence')>, <OpOverload(op='aten.avg_pool1d', overload='default')>, <OpOverload(op='aten.bilinear', overload='default')>, <OpOverload(op='aten.broadcast_tensors', overload='default')>, <OpOverload(op='aten.can_cast', overload='default')>, <OpOverload(op='aten.cat', overload='names')>, <OpOverload(op='aten.cdist', overload='default')>, <OpOverload(op='aten.chain_matmul', overload='default')>, <OpOverload(op='aten.chalf', overload='default')>, <OpOverload(op='aten.choose_qparams_optimized', overload='default')>, <OpOverload(op='aten.clip', overload='Tensor')>, <OpOverload(op='aten.clip', overload='default')>, <OpOverload(op='aten.column_stack', overload='default')>, <OpOverload(op='aten.combinations', overload='default')>, <OpOverload(op='aten.concat', overload='default')>, <OpOverload(op='aten.concat', overload='names')>, <OpOverload(op='aten.concatenate', overload='default')>, <OpOverload(op='aten.concatenate', overload='names')>, <OpOverload(op='aten.conv1d', overload='default')>, <OpOverload(op='aten.conv1d', overload='padding')>, <OpOverload(op='aten.conv2d', overload='default')>, <OpOverload(op='aten.conv2d', overload='padding')>, <OpOverload(op='aten.conv3d', overload='default')>, <OpOverload(op='aten.conv3d', overload='padding')>, <OpOverload(op='aten.conv_tbc_backward', overload='default')>, <OpOverload(op='aten.conv_transpose1d', overload='default')>, <OpOverload(op='aten.conv_transpose2d', overload='input')>, <OpOverload(op='aten.conv_transpose3d', overload='input')>, <OpOverload(op='aten.corrcoef', overload='default')>, <OpOverload(op='aten.cosine_embedding_loss', overload='default')>, <OpOverload(op='aten.cosine_similarity', overload='default')>, <OpOverload(op='aten.cov', overload='default')>, <OpOverload(op='aten.cross', overload='default')>, <OpOverload(op='aten.cross_entropy_loss', overload='default')>, <OpOverload(op='aten.ctc_loss', overload='IntList')>, <OpOverload(op='aten.ctc_loss', overload='Tensor')>, <OpOverload(op='aten.cudnn_is_acceptable', overload='default')>, <OpOverload(op='aten.cummax', overload='dimname')>, <OpOverload(op='aten.cummaxmin_backward', overload='default')>, <OpOverload(op='aten.cummin', overload='dimname')>, <OpOverload(op='aten.cumprod', overload='dimname')>, <OpOverload(op='aten.cumprod_backward', overload='default')>, <OpOverload(op='aten.cumsum', overload='dimname')>, <OpOverload(op='aten.cumulative_trapezoid', overload='dx')>, <OpOverload(op='aten.cumulative_trapezoid', overload='x')>, <OpOverload(op='aten.data', overload='default')>, <OpOverload(op='aten.det', overload='default')>, <OpOverload(op='aten.diag', overload='default')>, <OpOverload(op='aten.diagflat', overload='default')>, <OpOverload(op='aten.diff', overload='default')>, <OpOverload(op='aten.divide', overload='Scalar')>, <OpOverload(op='aten.divide', overload='Scalar_mode')>, <OpOverload(op='aten.divide', overload='Tensor')>, <OpOverload(op='aten.divide', overload='Tensor_mode')>, <OpOverload(op='aten.dstack', overload='default')>, <OpOverload(op='aten.einsum', overload='default')>, <OpOverload(op='aten.embedding_backward', overload='default')>, <OpOverload(op='aten.embedding_bag', overload='default')>, <OpOverload(op='aten.embedding_bag', overload='padding_idx')>, <OpOverload(op='aten.embedding_sparse_backward', overload='default')>, <OpOverload(op='aten.fake_quantize_per_channel_affine', overload='default')>, <OpOverload(op='aten.fake_quantize_per_channel_affine_cachemask_backward', overload='default')>, <OpOverload(op='aten.fake_quantize_per_tensor_affine', overload='default')>, <OpOverload(op='aten.fake_quantize_per_tensor_affine', overload='tensor_qparams')>, <OpOverload(op='aten.fake_quantize_per_tensor_affine_cachemask_backward', overload='default')>, <OpOverload(op='aten.fbgemm_linear_fp16_weight', overload='default')>, <OpOverload(op='aten.fbgemm_linear_fp16_weight_fp32_activation', overload='default')>, <OpOverload(op='aten.fbgemm_linear_int8_weight', overload='default')>, <OpOverload(op='aten.fbgemm_linear_int8_weight_fp32_activation', overload='default')>, <OpOverload(op='aten.fbgemm_linear_quantize_weight', overload='default')>, <OpOverload(op='aten.fbgemm_pack_gemm_matrix_fp16', overload='default')>, <OpOverload(op='aten.fbgemm_pack_quantized_matrix', overload='KN')>, <OpOverload(op='aten.fbgemm_pack_quantized_matrix', overload='default')>, <OpOverload(op='aten.fft_fft', overload='default')>, <OpOverload(op='aten.fft_fft2', overload='default')>, <OpOverload(op='aten.fft_fftn', overload='default')>, <OpOverload(op='aten.fft_fftshift', overload='default')>, <OpOverload(op='aten.fft_hfft', overload='default')>, <OpOverload(op='aten.fft_hfft2', overload='default')>, <OpOverload(op='aten.fft_hfftn', overload='default')>, <OpOverload(op='aten.fft_ifft', overload='default')>, <OpOverload(op='aten.fft_ifft2', overload='default')>, <OpOverload(op='aten.fft_ifftn', overload='default')>, <OpOverload(op='aten.fft_ifftshift', overload='default')>, <OpOverload(op='aten.fft_ihfft', overload='default')>, <OpOverload(op='aten.fft_ihfft2', overload='default')>, <OpOverload(op='aten.fft_ihfftn', overload='default')>, <OpOverload(op='aten.fft_irfft', overload='default')>, <OpOverload(op='aten.fft_irfft2', overload='default')>, <OpOverload(op='aten.fft_irfftn', overload='default')>, <OpOverload(op='aten.fft_rfft', overload='default')>, <OpOverload(op='aten.fft_rfft2', overload='default')>, <OpOverload(op='aten.fft_rfftn', overload='default')>, <OpOverload(op='aten.fix', overload='default')>, <OpOverload(op='aten.flatten_dense_tensors', overload='default')>, <OpOverload(op='aten.fliplr', overload='default')>, <OpOverload(op='aten.flipud', overload='default')>, <OpOverload(op='aten.float_power', overload='Scalar')>, <OpOverload(op='aten.float_power', overload='Tensor_Scalar')>, <OpOverload(op='aten.float_power', overload='Tensor_Tensor')>, <OpOverload(op='aten.frobenius_norm', overload='dim')>, <OpOverload(op='aten.gather', overload='dimname')>, <OpOverload(op='aten.gather_backward', overload='default')>, <OpOverload(op='aten.ger', overload='default')>, <OpOverload(op='aten.gradient', overload='array')>, <OpOverload(op='aten.gradient', overload='scalararray')>, <OpOverload(op='aten.gradient', overload='scalarint')>, <OpOverload(op='aten.gradient', overload='scalarrayarray')>, <OpOverload(op='aten.gradient', overload='scalarrayint')>, <OpOverload(op='aten.gradient', overload='tensorarray')>, <OpOverload(op='aten.gradient', overload='tensorarrayint')>, <OpOverload(op='aten.greater', overload='Scalar')>, <OpOverload(op='aten.greater', overload='Tensor')>, <OpOverload(op='aten.greater_equal', overload='Scalar')>, <OpOverload(op='aten.greater_equal', overload='Tensor')>, <OpOverload(op='aten.grid_sampler', overload='default')>, <OpOverload(op='aten.group_norm', overload='default')>, <OpOverload(op='aten.gru', overload='data')>, <OpOverload(op='aten.gru', overload='input')>, <OpOverload(op='aten.gru_cell', overload='default')>, <OpOverload(op='aten.hinge_embedding_loss', overload='default')>, <OpOverload(op='aten.histogramdd', overload='TensorList_bins')>, <OpOverload(op='aten.histogramdd', overload='default')>, <OpOverload(op='aten.histogramdd', overload='int_bins')>, <OpOverload(op='aten.hstack', overload='default')>, <OpOverload(op='aten.index_add', overload='dimname')>, <OpOverload(op='aten.index_copy', overload='dimname')>, <OpOverload(op='aten.index_fill', overload='Dimname_Scalar')>, <OpOverload(op='aten.index_fill', overload='Dimname_Tensor')>, <OpOverload(op='aten.index_select', overload='dimname')>, <OpOverload(op='aten.index_select_backward', overload='default')>, <OpOverload(op='aten.infinitely_differentiable_gelu_backward', overload='default')>, <OpOverload(op='aten.inner', overload='default')>, <OpOverload(op='aten.instance_norm', overload='default')>, <OpOverload(op='aten.inverse', overload='default')>, <OpOverload(op='aten.is_complex', overload='default')>, <OpOverload(op='aten.is_conj', overload='default')>, <OpOverload(op='aten.is_distributed', overload='default')>, <OpOverload(op='aten.is_floating_point', overload='default')>, <OpOverload(op='aten.is_inference', overload='default')>, <OpOverload(op='aten.is_leaf', overload='default')>, <OpOverload(op='aten.is_neg', overload='default')>, <OpOverload(op='aten.is_nonzero', overload='default')>, <OpOverload(op='aten.is_signed', overload='default')>, <OpOverload(op='aten.is_vulkan_available', overload='default')>, <OpOverload(op='aten.isclose', overload='default')>, <OpOverload(op='aten.isfinite', overload='default')>, <OpOverload(op='aten.isreal', overload='default')>, <OpOverload(op='aten.istft', overload='default')>, <OpOverload(op='aten.item', overload='default')>, <OpOverload(op='aten.kl_div', overload='default')>, <OpOverload(op='aten.kron', overload='default')>, <OpOverload(op='aten.kthvalue', overload='dimname')>, <OpOverload(op='aten.l1_loss', overload='default')>, <OpOverload(op='aten.layer_norm', overload='default')>, <OpOverload(op='aten.ldexp', overload='Tensor')>, <OpOverload(op='aten.less', overload='Scalar')>, <OpOverload(op='aten.less', overload='Tensor')>, <OpOverload(op='aten.less_equal', overload='Scalar')>, <OpOverload(op='aten.less_equal', overload='Tensor')>, <OpOverload(op='aten.linalg_cholesky', overload='default')>, <OpOverload(op='aten.linalg_cond', overload='default')>, <OpOverload(op='aten.linalg_cond', overload='p_str')>, <OpOverload(op='aten.linalg_det', overload='default')>, <OpOverload(op='aten.linalg_eigh', overload='default')>, <OpOverload(op='aten.linalg_eigvals', overload='default')>, <OpOverload(op='aten.linalg_eigvalsh', overload='default')>, <OpOverload(op='aten.linalg_inv', overload='default')>, <OpOverload(op='aten.linalg_ldl_factor', overload='default')>, <OpOverload(op='aten.linalg_lu_factor', overload='default')>, <OpOverload(op='aten.linalg_matmul', overload='default')>, <OpOverload(op='aten.linalg_matrix_norm', overload='default')>, <OpOverload(op='aten.linalg_matrix_norm', overload='str_ord')>, <OpOverload(op='aten.linalg_matrix_power', overload='default')>, <OpOverload(op='aten.linalg_matrix_rank', overload='atol_rtol_float')>, <OpOverload(op='aten.linalg_matrix_rank', overload='atol_rtol_tensor')>, <OpOverload(op='aten.linalg_matrix_rank', overload='default')>, <OpOverload(op='aten.linalg_matrix_rank', overload='tol_tensor')>, <OpOverload(op='aten.linalg_multi_dot', overload='default')>, <OpOverload(op='aten.linalg_norm', overload='default')>, <OpOverload(op='aten.linalg_norm', overload='ord_str')>, <OpOverload(op='aten.linalg_pinv', overload='atol_rtol_float')>, <OpOverload(op='aten.linalg_pinv', overload='default')>, <OpOverload(op='aten.linalg_pinv', overload='rcond_tensor')>, <OpOverload(op='aten.linalg_slogdet', overload='default')>, <OpOverload(op='aten.linalg_solve', overload='default')>, <OpOverload(op='aten.linalg_solve_ex', overload='default')>, <OpOverload(op='aten.linalg_svd', overload='default')>, <OpOverload(op='aten.linalg_svdvals', overload='default')>, <OpOverload(op='aten.linalg_tensorinv', overload='default')>, <OpOverload(op='aten.linalg_tensorsolve', overload='default')>, <OpOverload(op='aten.linalg_vander', overload='default')>, <OpOverload(op='aten.linalg_vecdot', overload='default')>, <OpOverload(op='aten.linear', overload='default')>, <OpOverload(op='aten.log_sigmoid', overload='default')>, <OpOverload(op='aten.log_softmax', overload='Dimname')>, <OpOverload(op='aten.log_softmax', overload='int')>, <OpOverload(op='aten.logcumsumexp', overload='dimname')>, <OpOverload(op='aten.logdet', overload='default')>, <OpOverload(op='aten.logsumexp', overload='names')>, <OpOverload(op='aten.lstm', overload='data')>, <OpOverload(op='aten.lstm', overload='input')>, <OpOverload(op='aten.lstm_cell', overload='default')>, <OpOverload(op='aten.lu_solve', overload='default')>, <OpOverload(op='aten.margin_ranking_loss', overload='default')>, <OpOverload(op='aten.masked_select_backward', overload='default')>, <OpOverload(op='aten.matmul', overload='default')>, <OpOverload(op='aten.matrix_exp', overload='default')>, <OpOverload(op='aten.matrix_exp_backward', overload='default')>, <OpOverload(op='aten.matrix_power', overload='default')>, <OpOverload(op='aten.max', overload='names_dim')>, <OpOverload(op='aten.max', overload='other')>, <OpOverload(op='aten.max_pool1d', overload='default')>, <OpOverload(op='aten.max_pool1d_with_indices', overload='default')>, <OpOverload(op='aten.max_pool2d', overload='default')>, <OpOverload(op='aten.max_pool3d', overload='default')>, <OpOverload(op='aten.mean', overload='names_dim')>, <OpOverload(op='aten.median', overload='names_dim')>, <OpOverload(op='aten.meshgrid', overload='default')>, <OpOverload(op='aten.meshgrid', overload='indexing')>, <OpOverload(op='aten.min', overload='names_dim')>, <OpOverload(op='aten.min', overload='other')>, <OpOverload(op='aten.mish_backward', overload='default')>, <OpOverload(op='aten.mode', overload='dimname')>, <OpOverload(op='aten.msort', overload='default')>, <OpOverload(op='aten.multilabel_margin_loss', overload='default')>, <OpOverload(op='aten.multiply', overload='Scalar')>, <OpOverload(op='aten.multiply', overload='Tensor')>, <OpOverload(op='aten.nanmean', overload='default')>, <OpOverload(op='aten.nanmedian', overload='names_dim')>, <OpOverload(op='aten.nanquantile', overload='default')>, <OpOverload(op='aten.nanquantile', overload='scalar')>, <OpOverload(op='aten.native_channel_shuffle', overload='default')>, <OpOverload(op='aten.negative', overload='default')>, <OpOverload(op='aten.nested_to_padded_tensor', overload='default')>, <OpOverload(op='aten.nll_loss', overload='default')>, <OpOverload(op='aten.nll_loss2d', overload='default')>, <OpOverload(op='aten.nll_loss_nd', overload='default')>, <OpOverload(op='aten.nonzero_numpy', overload='default')>, <OpOverload(op='aten.norm', overload='names_ScalarOpt_dim')>, <OpOverload(op='aten.norm', overload='names_ScalarOpt_dim_dtype')>, <OpOverload(op='aten.norm_except_dim', overload='default')>, <OpOverload(op='aten.not_equal', overload='Scalar')>, <OpOverload(op='aten.not_equal', overload='Tensor')>, <OpOverload(op='aten.nuclear_norm', overload='default')>, <OpOverload(op='aten.nuclear_norm', overload='dim')>, <OpOverload(op='aten.one_hot', overload='default')>, <OpOverload(op='aten.orgqr', overload='default')>, <OpOverload(op='aten.outer', overload='default')>, <OpOverload(op='aten.output_nr', overload='default')>, <OpOverload(op='aten.pad', overload='default')>, <OpOverload(op='aten.pad_sequence', overload='default')>, <OpOverload(op='aten.pairwise_distance', overload='default')>, <OpOverload(op='aten.pdist', overload='default')>, <OpOverload(op='aten.pinverse', overload='default')>, <OpOverload(op='aten.poisson_nll_loss', overload='default')>, <OpOverload(op='aten.prelu', overload='default')>, <OpOverload(op='aten.prod', overload='dim_Dimname')>, <OpOverload(op='aten.promote_types', overload='default')>, <OpOverload(op='aten.qr', overload='default')>, <OpOverload(op='aten.quantile', overload='default')>, <OpOverload(op='aten.quantile', overload='scalar')>, <OpOverload(op='aten.quantized_gru_cell', overload='default')>, <OpOverload(op='aten.quantized_lstm_cell', overload='default')>, <OpOverload(op='aten.quantized_rnn_relu_cell', overload='default')>, <OpOverload(op='aten.quantized_rnn_tanh_cell', overload='default')>, <OpOverload(op='aten.relu6', overload='default')>, <OpOverload(op='aten.repeat_interleave', overload='self_Tensor')>, <OpOverload(op='aten.repeat_interleave', overload='self_int')>, <OpOverload(op='aten.result_type', overload='Scalar')>, <OpOverload(op='aten.result_type', overload='Scalar_Scalar')>, <OpOverload(op='aten.result_type', overload='Scalar_Tensor')>, <OpOverload(op='aten.result_type', overload='Tensor')>, <OpOverload(op='aten.retains_grad', overload='default')>, <OpOverload(op='aten.rms_norm', overload='default')>, <OpOverload(op='aten.rnn_relu', overload='data')>, <OpOverload(op='aten.rnn_relu', overload='input')>, <OpOverload(op='aten.rnn_relu_cell', overload='default')>, <OpOverload(op='aten.rnn_tanh', overload='data')>, <OpOverload(op='aten.rnn_tanh', overload='input')>, <OpOverload(op='aten.rnn_tanh_cell', overload='default')>, <OpOverload(op='aten.row_stack', overload='default')>, <OpOverload(op='aten.rrelu', overload='default')>, <OpOverload(op='aten.scaled_dot_product_attention', overload='default')>, <OpOverload(op='aten.scatter', overload='dimname_src')>, <OpOverload(op='aten.scatter', overload='dimname_value')>, <OpOverload(op='aten.scatter_add', overload='dimname')>, <OpOverload(op='aten.selu', overload='default')>, <OpOverload(op='aten.silu_backward', overload='default')>, <OpOverload(op='aten.size', overload='Dimname')>, <OpOverload(op='aten.size', overload='int')>, <OpOverload(op='aten.slogdet', overload='default')>, <OpOverload(op='aten.slow_conv3d', overload='default')>, <OpOverload(op='aten.smm', overload='default')>, <OpOverload(op='aten.softmax', overload='Dimname')>, <OpOverload(op='aten.softmax', overload='int')>, <OpOverload(op='aten.sort', overload='dimname')>, <OpOverload(op='aten.sort', overload='dimname_stable')>, <OpOverload(op='aten.sparse_bsc_tensor', overload='ccol_row_value')>, <OpOverload(op='aten.sparse_bsc_tensor', overload='ccol_row_value_size')>, <OpOverload(op='aten.sparse_bsr_tensor', overload='crow_col_value')>, <OpOverload(op='aten.sparse_bsr_tensor', overload='crow_col_value_size')>, <OpOverload(op='aten.sparse_coo_tensor', overload='indices')>, <OpOverload(op='aten.sparse_coo_tensor', overload='indices_size')>, <OpOverload(op='aten.sparse_csc_tensor', overload='ccol_row_value')>, <OpOverload(op='aten.sparse_csc_tensor', overload='ccol_row_value_size')>, <OpOverload(op='aten.sparse_csr_tensor', overload='crow_col_value')>, <OpOverload(op='aten.sparse_csr_tensor', overload='crow_col_value_size')>, <OpOverload(op='aten.special_digamma', overload='default')>, <OpOverload(op='aten.special_erf', overload='default')>, <OpOverload(op='aten.special_erfc', overload='default')>, <OpOverload(op='aten.special_erfinv', overload='default')>, <OpOverload(op='aten.special_exp2', overload='default')>, <OpOverload(op='aten.special_expit', overload='default')>, <OpOverload(op='aten.special_expm1', overload='default')>, <OpOverload(op='aten.special_gammainc', overload='default')>, <OpOverload(op='aten.special_gammaincc', overload='default')>, <OpOverload(op='aten.special_gammaln', overload='default')>, <OpOverload(op='aten.special_i0', overload='default')>, <OpOverload(op='aten.special_log1p', overload='default')>, <OpOverload(op='aten.special_log_softmax', overload='default')>, <OpOverload(op='aten.special_logit', overload='default')>, <OpOverload(op='aten.special_logsumexp', overload='default')>, <OpOverload(op='aten.special_multigammaln', overload='default')>, <OpOverload(op='aten.special_ndtr', overload='default')>, <OpOverload(op='aten.special_polygamma', overload='default')>, <OpOverload(op='aten.special_psi', overload='default')>, <OpOverload(op='aten.special_round', overload='default')>, <OpOverload(op='aten.special_sinc', overload='default')>, <OpOverload(op='aten.special_softmax', overload='default')>, <OpOverload(op='aten.special_xlogy', overload='default')>, <OpOverload(op='aten.special_xlogy', overload='other_scalar')>, <OpOverload(op='aten.special_xlogy', overload='self_scalar')>, <OpOverload(op='aten.square', overload='default')>, <OpOverload(op='aten.sspaddmm', overload='default')>, <OpOverload(op='aten.std', overload='correction_names')>, <OpOverload(op='aten.std', overload='default')>, <OpOverload(op='aten.std', overload='dim')>, <OpOverload(op='aten.std', overload='names_dim')>, <OpOverload(op='aten.std_mean', overload='correction_names')>, <OpOverload(op='aten.std_mean', overload='default')>, <OpOverload(op='aten.std_mean', overload='dim')>, <OpOverload(op='aten.std_mean', overload='names_dim')>, <OpOverload(op='aten.stft', overload='center')>, <OpOverload(op='aten.stft', overload='default')>, <OpOverload(op='aten.stride', overload='Dimname')>, <OpOverload(op='aten.stride', overload='int')>, <OpOverload(op='aten.subtract', overload='Scalar')>, <OpOverload(op='aten.subtract', overload='Tensor')>, <OpOverload(op='aten.sum', overload='dim_DimnameList')>, <OpOverload(op='aten.sum_to_size', overload='default')>, <OpOverload(op='aten.svd', overload='default')>, <OpOverload(op='aten.sym_size', overload='int')>, <OpOverload(op='aten.sym_stride', overload='int')>, <OpOverload(op='aten.take_along_dim', overload='default')>, <OpOverload(op='aten.tensordot', overload='default')>, <OpOverload(op='aten.thnn_conv2d', overload='default')>, <OpOverload(op='aten.tile', overload='default')>, <OpOverload(op='aten.to_dense', overload='default')>, <OpOverload(op='aten.to_dense_backward', overload='default')>, <OpOverload(op='aten.to_mkldnn_backward', overload='default')>, <OpOverload(op='aten.to_sparse', overload='default')>, <OpOverload(op='aten.to_sparse', overload='sparse_dim')>, <OpOverload(op='aten.to_sparse_bsc', overload='default')>, <OpOverload(op='aten.to_sparse_bsr', overload='default')>, <OpOverload(op='aten.to_sparse_csc', overload='default')>, <OpOverload(op='aten.to_sparse_csr', overload='default')>, <OpOverload(op='aten.trace_backward', overload='default')>, <OpOverload(op='aten.trapezoid', overload='dx')>, <OpOverload(op='aten.trapezoid', overload='x')>, <OpOverload(op='aten.trapz', overload='dx')>, <OpOverload(op='aten.trapz', overload='x')>, <OpOverload(op='aten.triplet_margin_loss', overload='default')>, <OpOverload(op='aten.true_divide', overload='Scalar')>, <OpOverload(op='aten.true_divide', overload='Tensor')>, <OpOverload(op='aten.type_as', overload='default')>, <OpOverload(op='aten.unflatten_dense_tensors', overload='default')>, <OpOverload(op='aten.upsample_bicubic2d', overload='vec')>, <OpOverload(op='aten.upsample_bilinear2d', overload='vec')>, <OpOverload(op='aten.upsample_linear1d', overload='vec')>, <OpOverload(op='aten.upsample_nearest1d', overload='default')>, <OpOverload(op='aten.upsample_nearest1d', overload='vec')>, <OpOverload(op='aten.upsample_nearest2d', overload='default')>, <OpOverload(op='aten.upsample_nearest2d', overload='vec')>, <OpOverload(op='aten.upsample_nearest3d', overload='default')>, <OpOverload(op='aten.upsample_nearest3d', overload='vec')>, <OpOverload(op='aten.upsample_trilinear3d', overload='vec')>, <OpOverload(op='aten.value_selecting_reduction_backward', overload='default')>, <OpOverload(op='aten.vander', overload='default')>, <OpOverload(op='aten.var', overload='correction_names')>, <OpOverload(op='aten.var', overload='default')>, <OpOverload(op='aten.var', overload='dim')>, <OpOverload(op='aten.var', overload='names_dim')>, <OpOverload(op='aten.var_mean', overload='correction_names')>, <OpOverload(op='aten.var_mean', overload='default')>, <OpOverload(op='aten.var_mean', overload='dim')>, <OpOverload(op='aten.var_mean', overload='names_dim')>, <OpOverload(op='aten.vstack', overload='default')>, <OpOverload(op='aten.where', overload='Scalar')>, <OpOverload(op='aten.where', overload='ScalarOther')>, <OpOverload(op='aten.where', overload='ScalarSelf')>, <OpOverload(op='aten.where', overload='default')>, <OpOverload(op='aten.wrapped_linear_prepack', overload='default')>, <OpOverload(op='aten.wrapped_quantized_linear_prepacked', overload='default')> ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136153 Approved by: https://github.com/xadupre, https://github.com/gramalingam	2024-09-16 21:28:54 +00:00
Pearu Peterson	b76d1b79e6	Add scaling arguments to bsr_dense_addmm (#136104 ) As in the title. Tackles https://github.com/pytorch/ao/pull/821/files#r1759821413 The PR assumes that the existing tuning parameters are good also when using scaling arguments. This needs to be verified as a follow-up task. Also, this PR redefines triton-contiguous tensors: the tensor must have strides not larger than 1. This will now allow zero strides that previously triggered `contiguous` call although the underlying memory buffer was contiguous. Re: "a considerable slow-down occurs because tensor data is copied element-wise rather than chunk-wise" - this note should refer to a code (torch or triton?) that implements the element/chunk-wise copy so that we could verify that allowing zero strides indeed would not trigger element-wise copies. Atm, the performance increase in ViT-H benchmarks (that involve using 0 strides) is an evidence that allowing zero strides does not lead to slow-downs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136104 Approved by: https://github.com/cpuhrsch	2024-09-16 20:26:54 +00:00
PyTorch MergeBot	bfbcdf4967	Revert "[dynamo] Fix support for classmethod(property(...)) (#134968 )" This reverts commit c64ae601ba9eb3ad2cd3402a14f6ac83c0ab7eba. Reverted https://github.com/pytorch/pytorch/pull/134968 on behalf of https://github.com/jeanschmidt due to Breaking internal signals, we need to skip the new tests on py3.10 ([comment](https://github.com/pytorch/pytorch/pull/134968#issuecomment-2353909010))	2024-09-16 20:26:35 +00:00
Dan Johnson	3c97b0ab00	Use ncclAlltoAllv and ncclAlltoAll API when supported (#134499 ) NCCL does not have an api for ncclAllToAll and ncclAllToAllv, so PyTorch does point to point send/recv. Expose this API if it is supported. Differential Revision: [D61683836](https://our.internmc.facebook.com/intern/diff/D61683836/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134499 Approved by: https://github.com/shuqiangzhang, https://github.com/eqy	2024-09-16 20:08:06 +00:00
Kiuk Chung	abd16a8c64	[torch/multiprocessing] Use multiprocessing.reduction.register ForkingPickler.register to register custom tensor and storage reductions (#135030 ) Right now `multiprocessing.reduction.register()` is simply an alias to `multiprocessing.reduction.ForkingPickler.register()` https://github.com/python/cpython/blame/main/Lib/multiprocessing/reduction.py#L56, but the top-level `register()` function exposes less of the internal details of `multiprocessing.reduction` module. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135030 Approved by: https://github.com/albanD	2024-09-16 20:07:29 +00:00
fduwjj	a0c7029a75	[c10d][Reland] Remove Option for ProcessGroup and Expose backend Options to reflect the correct code structure (#132931 ) (#135653 ) We introduced the dispatchable backend for a ProcessGroup and collective in https://github.com/pytorch/pytorch/issues/86225. This PR is a follow-up cleanup to clean up the option of a ProcessGroup and ask users to either set timeout or backend later on or directly create backend after creating a PG. Also PGNCCL is using option class from ProcessGroup but we actually should use Option from backend class. So this PR is to make the type or name to be aligned with what we are doing in cpp side. I don't change the signature for the public API, so they still use args named "pg_options" We need to make changes to the test to make it aligned with the change. This is try to reland D62008954 by fixing internal errors. Differential Revision: [D62483294](https://our.internmc.facebook.com/intern/diff/D62483294/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135653 Approved by: https://github.com/wz337, https://github.com/H-Huang	2024-09-16 19:56:42 +00:00
James Wu	7537f74277	Refactor FxGraphCache.load into separate functions, so that AOTAutogradCache may access it correctly later (#135491 ) Summary: We refactor FxGraphCache.load into three phases: - prepare_key, which checks that an inductor input is cacheable and bypasses otherwise - load_with_key, which tries to lookup the key in the cache - post compile, where we do some logging and run post compile steps Splitting it along these lines will allow AOTAutogradCache to use load_with_key and still get access to all of the observability + remote cache logic when accessing FxGraphCache, without needing to pass key components, etc. Differential Revision: D62314862 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135491 Approved by: https://github.com/oulgen	2024-09-16 19:48:08 +00:00
Aaron Gokaslan	31715be72a	[BE]: Update mypy to 1.11.2 (#133816 ) Updates mypy to 1.11.1 to improve type inference Pull Request resolved: https://github.com/pytorch/pytorch/pull/133816 Approved by: https://github.com/ezyang	2024-09-16 19:44:11 +00:00
Nikita Shulga	38caf10411	[EZ] Fix spelling typo (#136157 ) s/toosl/tools/ (spotted by @louie-tsai) Also, capitalize CUDA Pull Request resolved: https://github.com/pytorch/pytorch/pull/136157 Approved by: https://github.com/kit1980	2024-09-16 19:30:30 +00:00
Ke Wen	c977bb7d03	[Distributed] fix FileSystemWriter __init__ (#136135 ) Fixes #135608. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136135 Approved by: https://github.com/Skylion007	2024-09-16 19:11:08 +00:00
eugenekoran	717fca2cac	Drop outdated section 'Running clang-tidy' in CONTRIBUTING.md (#136146 ) Fixes #125920 [Running clang-tidy](https://github.com/pytorch/pytorch/blob/main/CONTRIBUTING.md#running-clang-tidy) section is misleading and outdated. C++ lint is done with lintrunner and covered in [local-linting](https://github.com/pytorch/pytorch/blob/main/CONTRIBUTING.md#local-linting) section. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136146 Approved by: https://github.com/janeyx99	2024-09-16 19:02:21 +00:00
Alexander Kurakin	f89ce4dfbb	`torch.nn.MultiheadAttention`: docs: improvement (#136111 ) `torch.nn.MultiheadAttention`: docs: improvement Pull Request resolved: https://github.com/pytorch/pytorch/pull/136111 Approved by: https://github.com/janeyx99	2024-09-16 18:52:20 +00:00
Nikita Shulga	d3647d15e6	Remove accidentally committed code (#136154 ) Accidentally left out during rebase Pull Request resolved: https://github.com/pytorch/pytorch/pull/136154 Approved by: https://github.com/kit1980, https://github.com/albanD	2024-09-16 18:34:20 +00:00
PyTorch MergeBot	d0cebedb31	Revert "Add Triton CPU as an Inductor backend (#133408 )" This reverts commit e498b02b472e45cfd6b7a08db0d6c1babec655c5. Reverted https://github.com/pytorch/pytorch/pull/133408 on behalf of https://github.com/jeanschmidt due to Broke internal signals, see D62737208 for more details ([comment](https://github.com/pytorch/pytorch/pull/133408#issuecomment-2353623816))	2024-09-16 18:33:33 +00:00
PyTorch MergeBot	7fe004f7cf	Revert "Add CI for Triton CPU backend (#135342 )" This reverts commit 426580a67db15ec17b2b861a09667bf59927e033. Reverted https://github.com/pytorch/pytorch/pull/135342 on behalf of https://github.com/jeanschmidt due to Broke internal signals, see D62737208 for more details ([comment](https://github.com/pytorch/pytorch/pull/133408#issuecomment-2353623816))	2024-09-16 18:33:33 +00:00
Aaron Gokaslan	23c0d2689e	[BE][Ez]: Fix missing float16 coverage for adaptive_pool3d_cpu (#136091 ) Testing if op info coverage has issues Pull Request resolved: https://github.com/pytorch/pytorch/pull/136091 Approved by: https://github.com/ezyang	2024-09-16 18:22:16 +00:00
Suresh Babu Kolla	5193f23469	[Pytorch] Cleanup Strobelight URL and shorten for readability (#136102 ) Summary: - Converted strobelight URL prefix to more readable and editable json - Dump shortened URLs when possible for easier readability Test Plan: ``` python ./torch/_strobelight/examples/compile_time_profile_example.py python torch/_strobelight/examples/cli_function_profiler_example.py ``` Differential Revision: D62690292 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136102 Approved by: https://github.com/laithsakka	2024-09-16 18:10:33 +00:00
PyTorch MergeBot	0199fd4d7e	Revert "[inductor] More fixes on the keys of `constants` and `signature` dictionaries (#135406 )" This reverts commit e54b559e8860e343692bb5534777b2384a57a613. Reverted https://github.com/pytorch/pytorch/pull/135406 on behalf of https://github.com/jeanschmidt due to Reverting as it is breaking triton_mtia internal signals @jansel could you have a look and help get those changes merged? ([comment](https://github.com/pytorch/pytorch/pull/135406#issuecomment-2353557481))	2024-09-16 17:58:02 +00:00
Aaron Gokaslan	b491e2974c	[BE][Ez]: Add full half/bfloat16 dtype for `unique` and `isin` (#136114 ) Fixes #136090 * Add support for isin to tensor half dtypes for CPU (just add a few extra dispatches). * Seems like the CUDA implementation for bfloat16 was mostly compiled and available all along (it just calls sort internally AND unique). To enable it, we just need to remove an assert to access it (since sort's functionality was updated since the assert was added) and add missing dtype support to unique. * This unlocks more GPU functionality with minimal code bloat. I also added CPU kernels for the dtypes for parity. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136114 Approved by: https://github.com/malfet	2024-09-16 17:49:12 +00:00
Justin Chu	0aa41eb52f	[ONNX] Run type promotion test in CI and update the table (#135915 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135915 Approved by: https://github.com/gramalingam, https://github.com/xadupre	2024-09-16 16:46:13 +00:00
IvanKobzarev	090046b936	[effects] Turn off dtype promotion for with_effects lowering (#136039 ) By default inductor promotes arguments to the common highest dtype. Having empty token with dtype=torch.float32 results in dtype promotion for effectful ops during lowering of with_effects. Disabling dtype promotion for this lowering. Removing previous workaround making token dtype torch.bool. Testing: ``` python test/distributed/test_c10d_functional_native.py -k test_inductor_dtypeview_memory_lea ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136039 Approved by: https://github.com/bdhirsh, https://github.com/eellison, https://github.com/zou3519	2024-09-16 16:14:05 +00:00
Tom Ritchford	c33b0580e6	Add decomposition for squeeze_copy (#130941 ) * Extracted from #128416 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130941 Approved by: https://github.com/amjames, https://github.com/eellison	2024-09-16 15:46:57 +00:00
Jon Janzen	13bd1256f9	Delete stable prototype (#135911 ) This project ended up going in an entirely different direction, so we can close out all this Pull Request resolved: https://github.com/pytorch/pytorch/pull/135911 Approved by: https://github.com/izaitsevfb, https://github.com/malfet	2024-09-16 15:32:17 +00:00
Bin Bao	d833f49602	[reland][Inductor] Rename `cpp_wrapper_cuda.py` as `cpp_wrapper_gpu.py` (#136046 ) Summary: Reland https://github.com/pytorch/pytorch/pull/135313 after fixing internal build issues Test Plan: CI Differential Revision: D62658837 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136046 Approved by: https://github.com/chenyang78, https://github.com/etaf, https://github.com/jansel	2024-09-16 14:35:19 +00:00
Bin Bao	a803cb0531	[AOTI] Refactor how cpp_wrapper specific options are set (#136035 ) Summary: 1) When cpp-wrapper is turned on, certain triton specific options need to be set, both for forward and backward. This PR considate the settings in one place. 2) Change config.triton.autotune_at_compile_time to default to None. If the flag is not explicitly set by user, default it to True for cpp-wrapper. Differential Revision: [D62689940](https://our.internmc.facebook.com/intern/diff/D62689940) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136035 Approved by: https://github.com/chenyang78	2024-09-16 14:32:13 +00:00
atalman	bbc3fdbbde	Add python 3.13.0t build to Docker images (#136001 ) Adds 3.13t python to Docker images Pull Request resolved: https://github.com/pytorch/pytorch/pull/136001 Approved by: https://github.com/albanD	2024-09-16 12:49:36 +00:00
PyTorch MergeBot	3117f2cf67	Revert "[BE]: Update mypy to 1.11.2 (#133816 )" This reverts commit 55299cfc223fa838aadd8d6d6fa3ed541fa5acd1. Reverted https://github.com/pytorch/pytorch/pull/133816 on behalf of https://github.com/jeanschmidt due to seems to have broken https://github.com/pytorch/pytorch/actions/runs/10865710499/job/30155699792 on main ([comment](https://github.com/pytorch/pytorch/pull/133816#issuecomment-2352377684))	2024-09-16 09:11:16 +00:00
Xuehai Pan	951c21d679	[dynamo] simplify implementation for `builtins.sum` (#133779 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133779 Approved by: https://github.com/jansel, https://github.com/anijain2305 ghstack dependencies: #133778	2024-09-16 04:53:06 +00:00
Xuehai Pan	9961aaa601	[dynamo] simplify implementation for `functools.reduce` (#133778 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133778 Approved by: https://github.com/jansel, https://github.com/anijain2305	2024-09-16 04:53:06 +00:00
Ke Wen	d2207c57f7	[Distributed] add pack-check method for float8_e5m2 (#136115 ) Add support for Float8_e5m2, following similar algorithm used for Float8_e4m3fn (i.e. overflow check). Made `HasNanFP8x8` a template so that it is extendable based on dtype. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136115 Approved by: https://github.com/Skylion007 ghstack dependencies: #135891, #135961	2024-09-15 21:37:43 +00:00
Howard Huang	e501ed71d4	Update link in distributed.tensor.parallel.rst (#136103 ) dtensor folder was moved Pull Request resolved: https://github.com/pytorch/pytorch/pull/136103 Approved by: https://github.com/kwen2501, https://github.com/fegin	2024-09-15 19:36:29 +00:00
Tom Ritchford	ab9a7eadd3	Add decomposition for permute_copy (#130944 ) * Extracted from #129476 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130944 Approved by: https://github.com/amjames, https://github.com/eellison	2024-09-15 19:35:14 +00:00
Andrii Grynenko	a141c6bb0d	[pytorch][monitoring] Dynamic backend for WaitCounter (#135967 ) Summary: This implements a default backend proxy that tries to look up a backend via dlsym. What this enables is dynamically loading a module with a backend implementation without having it statically linked with the application. Differential Revision: D62549295 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135967 Approved by: https://github.com/c-p-i-o	2024-09-15 18:07:49 +00:00
Tugsbayasgalan Manlaibaatar	dec3403b24	Add some doc for export_for_training (#135918 ) Differential Revision: [D62610491](https://our.internmc.facebook.com/intern/diff/D62610491) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135918 Approved by: https://github.com/avikchaudhuri ghstack dependencies: #135080, #135912	2024-09-15 17:08:12 +00:00
Tugsbayasgalan Manlaibaatar	1904b09e61	Create export_for_inference API and expose core_aten as public facing API (#135912 ) Differential Revision: [D62606908](https://our.internmc.facebook.com/intern/diff/D62606908) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135912 Approved by: https://github.com/avikchaudhuri ghstack dependencies: #135080	2024-09-15 17:05:07 +00:00
Tugsbayasgalan Manlaibaatar	382fad58b3	Deprecate _preserve_ops and consolidate with decomp_table (#135080 ) In this PR, we deprecate _preserve_ops feature in run_decomposition API. We can't kill this API completely because Executorch team depends on it. As the syncing between two repos is non-trivial, I just leave this argument as deprecated for now. In the next PR, i will immediately remove it. After this PR, run_decompositions will only decompose what's inside the decomp table and preserve the rest by default. Note that this feature is only rolled out to OSS for now. Old code path is protected under IS_FBCODE flag. Differential Revision: [D62163161](https://our.internmc.facebook.com/intern/diff/D62163161/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135080 Approved by: https://github.com/justinchuby, https://github.com/avikchaudhuri, https://github.com/bdhirsh	2024-09-15 17:01:58 +00:00
PyTorch MergeBot	357b7fb579	Revert "[Pytorch] Consolidate Strobelight compile time profiler between OSS and fbcode (#135953 )" This reverts commit b8637503c036abb898f6b880b325aeffe6f09c03. Reverted https://github.com/pytorch/pytorch/pull/135953 on behalf of https://github.com/kollasb due to Broke internal module factory compatibility, revert from Phabricator failed ([comment](https://github.com/pytorch/pytorch/pull/135953#issuecomment-2351381777))	2024-09-15 05:32:38 +00:00
cyy	31e42a45dd	Fix redundant move warnings by g++ (#134987 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/134987 Approved by: https://github.com/ezyang	2024-09-15 05:28:19 +00:00
PyTorch UpdateBot	e1abd346a3	[audio hash update] update the pinned audio hash (#136106 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned audio hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136106 Approved by: https://github.com/pytorchbot	2024-09-15 04:31:35 +00:00
Will Feng	386884e553	[Traceable FSDP2] Ignore FSDP2 forward hook side-effects in AC; Support FSDP2 + AC (#134997 ) > Ignore FSDP2 forward hook side-effects in AC Under AC, FSDP2 does not rely on forward hook to all-gather weights to do recomputation, instead it relies on pre-backward hook to do this job: `451eaf0ff2/torch/distributed/_composable/fsdp/_fsdp_state.py (L219-L220)` So when we use `speculate_subgraph` to trace the utils.checkpoint AC region, we don't actually need to worry about FSDP2 forward hook's side effects and can safely ignore it, because we are not and we don't expect to re-run the FSDP2 forward hook during backward recomputation. ---- Test commands: - `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_nested_fully_shard_backend_inductor` - `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_transformer_backend_inductor` Pull Request resolved: https://github.com/pytorch/pytorch/pull/134997 Approved by: https://github.com/zou3519 ghstack dependencies: #135727	2024-09-15 02:00:17 +00:00
leslie-fang-intel	8072ebc36c	SKIP llama for dynamic size testing (#135960 ) Running Torchbench llama with dynamic size failed with ``` File "/localdisk/leslie/torch_inductor_community/pytorch/torch/fx/experimental/symbolic_shapes.py", line 4182, in produce_guards raise ConstraintViolationError( torch.fx.experimental.symbolic_shapes.ConstraintViolationError: Constraints violated (L['inputs'][0].size()[0])! For more information, run with TORCH_LOGS="+dynamic". - Not all values of RelaxedUnspecConstraint(L['inputs'][0].size()[0]) are valid because L['inputs'][0].size()[0] was inferred to be a constant (32). ``` Skip this model for marking dynamic dim. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135960 Approved by: https://github.com/ezyang	2024-09-15 00:06:49 +00:00
Guilherme Leobas	a1a57a424d	Optimize dict reconstruct to not codegen untouched values (#134876 ) PR changes how `reconstruct` is done for a ConstDict. As of today, it works as follow: (1) codegen(...) each pair of key/value (2) create a new dictionary to hold the new items (3) clear the original dictionary (4) update the original dict with the one created in (2) We do a micro optimization in the generated bytecode to: - Only codegen the items that changed. - Only clear the original dictionary if a key was removed. Fixes: #133487 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134876 Approved by: https://github.com/zou3519	2024-09-14 23:25:28 +00:00
Bob Ren	a5eb43d8b4	Add TensorReferenceAnalysis and some tests (#135886 ) Split out and modified from https://github.com/pytorch/pytorch/pull/130228. There were a bunch of subtle bugs eg. sometimes we need to use torch.ops.aten.{operator}.Tensor vs other times using torch.ops.aten.{operator}.default. Or in the case of pow we need to use Tensor_Tensor. I figured it'd be easier to split out adding TensorReferenceAnalysis and add some tests and do the actual integration in a separate diff. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135886 Approved by: https://github.com/ezyang	2024-09-14 23:09:40 +00:00
Isuru Fernando	391f2d6d50	use a fast expand algorithm (#135999 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135999 Approved by: https://github.com/ezyang	2024-09-14 23:09:34 +00:00
Isuru Fernando	5b21d91197	Fix dividing Mul by factor (#136079 ) Fixes https://github.com/pytorch/pytorch/issues/136032 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136079 Approved by: https://github.com/ezyang	2024-09-14 22:14:27 +00:00
Jez Ng	426580a67d	Add CI for Triton CPU backend (#135342 ) Where possible, I have marked failing tests (which we intend to fix or triage) as `@xfail_if_triton_cpu`. This will help us track progress of the Triton CPU backend over time. Tests that I don't think we need to address, or that are flaky, have been marked as skips. Successful CI run: https://github.com/pytorch/pytorch/actions/runs/10822238062/job/30028284549 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135342 Approved by: https://github.com/jansel ghstack dependencies: #133408	2024-09-14 21:45:19 +00:00
Jez Ng	e498b02b47	Add Triton CPU as an Inductor backend (#133408 ) The goal is to use Inductor-generated kernels to stress test the new Triton CPU backend. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133408 Approved by: https://github.com/jansel	2024-09-14 21:45:19 +00:00
Aaron Gokaslan	55299cfc22	[BE]: Update mypy to 1.11.2 (#133816 ) Updates mypy to 1.11.1 to improve type inference Pull Request resolved: https://github.com/pytorch/pytorch/pull/133816 Approved by: https://github.com/ezyang	2024-09-14 21:40:36 +00:00
Jason Ansel	c64ae601ba	[dynamo] Fix support for classmethod(property(...)) (#134968 ) Fixes #134451 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134968 Approved by: https://github.com/yanboliang	2024-09-14 21:00:41 +00:00
Aaron Gokaslan	7f5abb44af	[BE][Ez]: Update pybind11 to 2.13.6. Exposes new conduit cross-compat API (#136087 ) Updates pybind11 submodule. The major patchnote is an experimental new function that is added to all pybind11 objects that will make them more compatible across pybind11 version, settings, and frameworks (such as nanobind) called cpp_conduit. No code changes needed on our end except to update Pull Request resolved: https://github.com/pytorch/pytorch/pull/136087 Approved by: https://github.com/malfet	2024-09-14 20:48:44 +00:00
Michael Lazos	8df01c8258	[Dynamo] Remove ignored modes from torch function mode stack guard (#135503 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135503 Approved by: https://github.com/anijain2305 ghstack dependencies: #134732, #133137, #135443, #135444, #135422, #135502	2024-09-14 18:52:22 +00:00
Michael Lazos	860838e9be	[Dynamo] Remove ignored modes workaround (#135502 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135502 Approved by: https://github.com/anijain2305 ghstack dependencies: #134732, #133137, #135443, #135444, #135422	2024-09-14 18:52:22 +00:00
Michael Lazos	1b9daeb240	[Dynamo] Trace enter/exit of TorchFunctionModes (#135422 ) This PR implements tracing of with contexts with TorchFunction modes which have the default enter/exit behavior (ie pushing/popping the mode) Typically the bytecode for a context manager looks like this during a graph break: 1. graph call 2. enter context 3. unsupported code 4. exit context 5. resume call resume fn structure: 1. enter context 2. jump ... 3. exit context The issue with torch function modes is that side effects will replay any mutations to the torch function stack performed during tracing. So, we do not need to enter and exit around the unsupported code in the original function (doing so would result in a duplicate torch function mode entry during execution of the unsupported code), and we don't need to enter again in the resume function (the mode that was pushed from the side effects bytecode would still be on the stack). So for torch function modes the structure of our output code is this: 1. graph call 2. mutate tf mode stack to replay mutations 4. unsupported code 5. on exception restore stack 6. resume function Then our resume fn looks like this: 1. no-op enter torch function mode 2. jump 3. exit tf mode To implement the no-op enter of the torch function mode I added torch function mode in polyfill which no-op enters, but normally exits. This is needed because we still want to trace the with context in the resume function, and exit properly (the exit instructions will still be in the function, so we need to generate instructions to set up the context). Separately from the bytecode, dynamo also tracks contexts on the block stack, which is how the SETUP_* instructions are implemented. Naturally at a graph break, we exit these block stacks to properly reset the contexts entirely, so that we can re-enter around the unsupported code soundly. However once again, in the torch function mode case, in the event of a graph we do not want to perform any exit side effects because we want to preserve the state of the mode stack as is so that we will properly update the stack with bytecode mentioned in the first section. If we exited here, dynamo would pop the mode off of the symbolic stack, and not update the true python torch function mode stack with the suffix bytecode. All in all, for torch function modes we enter exactly once, update the global torch function mode stack with side effects bytecode, re-read this stack when compiling the resume function, and exit exactly once in the resume function. This matches the semantics of eager exactly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135422 Approved by: https://github.com/williamwen42 ghstack dependencies: #134732, #133137, #135443, #135444	2024-09-14 18:52:22 +00:00
Michael Lazos	06caa2d560	[Dynamo] Simplify torch function mode stack guard (#135444 ) The semantics of ignored modes previously had edge cases, this eliminates these by in essence filtering any ignored modes out of both the ref stack and the current torch function mode stack. This is purely to fix complexity in #135422. The ignored modes handling will be removed in a future PR after https://github.com/pytorch/pytorch/pull/135422 lands, since we will then trace through DeviceContexts vs inserting them into the graph which needed these extra workarounds for correctness. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135444 Approved by: https://github.com/anijain2305, https://github.com/williamwen42 ghstack dependencies: #134732, #133137, #135443	2024-09-14 18:52:22 +00:00
Michael Lazos	14cabdf626	[Dynamo] Support thread local setattr (#135443 ) In preparation for tracing through DeviceContext (`defb515306/torch/utils/_device.py (L66)`) This PR adds support for calling the setattr of thread local objects. These objects have a slots impl, and since this doesn't appear to have any side effects, we call this setattr impl when replaying mutations, since calling `object.__setattr__` on these objects results in a type error. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135443 Approved by: https://github.com/anijain2305 ghstack dependencies: #134732, #133137	2024-09-14 18:52:22 +00:00
Michael Lazos	5c5c33ac32	[Dynamo] Trace torch function modes entered outside of torch.compile (#133137 ) This PR adds initial tracing for torch function modes. Details: In essence, this adds tracing into the torch function of modes entered outside of the torch.compile call. This does not yet support tracing enter/exit of a torch function mode/ tracing set_default_device properly using the new mode infra (this will be a very good stress test for modes). I am adding more PRs to this stack to support these. The overall plan is to support tracing enter/exit and handling graph breaks like we do other torch.* context managers. Previously landed: https://github.com/pytorch/pytorch/pull/133135 https://github.com/pytorch/pytorch/pull/133136 https://github.com/pytorch/pytorch/pull/133134 https://github.com/pytorch/pytorch/pull/133133 https://github.com/pytorch/pytorch/pull/133132 https://github.com/pytorch/pytorch/pull/133131 https://github.com/pytorch/pytorch/pull/133729 https://github.com/pytorch/pytorch/pull/133130 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133137 Approved by: https://github.com/jansel, https://github.com/zou3519 ghstack dependencies: #134732	2024-09-14 18:52:22 +00:00
Michael Lazos	228760b945	[Dynamo] Use custom backend to reenter metadata tf mode when tracing while/cond (#134732 ) For tracing cond/while in eager, we trace the HOP with the eager backend with metadata torchfunction mode enabled. HOPs disallow the mutation that occurs in this torch function mode, so it is not able to be traced. As a result, we use a custom backend which enters this mode for tracing these HOPs. Thanks to @ydwu4 for the help with implementing this Pull Request resolved: https://github.com/pytorch/pytorch/pull/134732 Approved by: https://github.com/ydwu4	2024-09-14 18:52:22 +00:00
Bin Bao	b4c84c3167	[AOTI] Fix a fallback op returning None issue (#135997 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/135781. In some cases, a fallback can return None in the place of a tensor. Differential Revision: [D62659039](https://our.internmc.facebook.com/intern/diff/D62659039) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135997 Approved by: https://github.com/chenyang78	2024-09-14 18:12:06 +00:00
Laith Sakka	b82122beef	Only keep ListOfLinears module in basic_modules_benchmarks and add gpu version. (#135730 ) All of the previous benchmarks are similar, ListOfLinears should be representative enough. I copied the previous benchmarks from unit tests without an intention, was just trying to create a large number of benchmarks to better observe noise. This PR keeps only one, we can add more as we see value and regressions in the future. Also this diff adds a GPU version. ``` collecting compile time instruction count for basic_modules_ListOfLinears_eager compile time instruction count for iteration 0 is 6479525851 compile time instruction count for iteration 1 is 1024432680 compile time instruction count for iteration 2 is 1019417317 compile time instruction count for iteration 3 is 1013603566 compile time instruction count for iteration 4 is 1008853980 compile time instruction count for iteration 5 is 1009541481 compile time instruction count for iteration 6 is 1005025533 compile time instruction count for iteration 7 is 1004116323 compile time instruction count for iteration 8 is 1000828633 compile time instruction count for iteration 9 is 999788323 collecting compile time instruction count for basic_modules_ListOfLinears_inductor compile time instruction count for iteration 0 is 40837529730 compile time instruction count for iteration 1 is 18411921909 compile time instruction count for iteration 2 is 18383665161 compile time instruction count for iteration 3 is 18348983522 compile time instruction count for iteration 4 is 18349276590 compile time instruction count for iteration 5 is 18353046274 compile time instruction count for iteration 6 is 18346818581 compile time instruction count for iteration 7 is 18340057998 compile time instruction count for iteration 8 is 18331267320 compile time instruction count for iteration 9 is 18328381338 collecting compile time instruction count for basic_modules_ListOfLinears_inductor_gpu compile time instruction count for iteration 0 is 15408870979 compile time instruction count for iteration 1 is 10949520859 compile time instruction count for iteration 2 is 11058786167 compile time instruction count for iteration 3 is 11003606719 compile time instruction count for iteration 4 is 10896406770 compile time instruction count for iteration 5 is 10982875189 compile time instruction count for iteration 6 is 10931848275 compile time instruction count for iteration 7 is 10956345008 compile time instruction count for iteration 8 is 11045384499 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135730 Approved by: https://github.com/ezyang, https://github.com/anijain2305	2024-09-14 16:45:52 +00:00
Suresh Babu Kolla	b8637503c0	[Pytorch] Consolidate Strobelight compile time profiler between OSS and fbcode (#135953 ) Summary: Move towards consolidating strobelight profiler implementations between OSS and fbcode. This change is a first step towards that. - Created a new function to abstract out compile time profiling enablement. This function allows profiler to switch between different function profilers (e.g. Thrift based or CLI based) - Both OSS and Fbcode now use one compile time profiler in torch/_strobelight Test Plan: Tested OSS with following commands: ``` python torch/_strobelight/examples/compile_time_profile_example.py python torch/_strobelight/examples/cli_function_profiler_example.py TORCH_COMPILE_STROBELIGHT=TRUE TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 python benchmarks/dynamo/huggingface.py --ci --accuracy --timing --explain --inductor --device cuda --training --amp --only XLNetLMHeadModel ``` See test commands for fbcode in comments. Differential Revision: D62444551 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135953 Approved by: https://github.com/laithsakka	2024-09-14 16:35:22 +00:00
William Wen	f97cccf62a	[3.13] fix 3.13 pickle error in torch/package (#136049 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136049 Approved by: https://github.com/albanD ghstack dependencies: #136034	2024-09-14 14:28:09 +00:00
CaoE	db393fb95e	Add Half support for reflection and replication padding on CPU (#135931 ) Fixes #135680 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135931 Approved by: https://github.com/Skylion007	2024-09-14 14:18:55 +00:00
PyTorch MergeBot	23dec79cef	Revert "[Dynamo] Use custom backend to reenter metadata tf mode when tracing while/cond (#134732 )" This reverts commit 731b178b56c83966d6e8cdfb0015d22d8f91b4d2. Reverted https://github.com/pytorch/pytorch/pull/134732 on behalf of https://github.com/mlazos due to broke python test/quantization/pt2e/test_numeric_debugger.py TestNumericDebugger.test_re_export_preserve_handle modified yesterday ([comment](https://github.com/pytorch/pytorch/pull/134732#issuecomment-2350937008))	2024-09-14 10:02:55 +00:00
PyTorch MergeBot	8c8a3086a7	Revert "[Dynamo] Trace torch function modes entered outside of torch.compile (#133137 )" This reverts commit 4528777e034b157a8329d1879daf52290eea199a. Reverted https://github.com/pytorch/pytorch/pull/133137 on behalf of https://github.com/mlazos due to broke python test/quantization/pt2e/test_numeric_debugger.py TestNumericDebugger.test_re_export_preserve_handle modified yesterday ([comment](https://github.com/pytorch/pytorch/pull/134732#issuecomment-2350937008))	2024-09-14 10:02:55 +00:00
PyTorch MergeBot	46f5037007	Revert "[Dynamo] Support thread local setattr (#135443 )" This reverts commit 149d0b716173787df4543186ff74b605aca54e3e. Reverted https://github.com/pytorch/pytorch/pull/135443 on behalf of https://github.com/mlazos due to broke python test/quantization/pt2e/test_numeric_debugger.py TestNumericDebugger.test_re_export_preserve_handle modified yesterday ([comment](https://github.com/pytorch/pytorch/pull/134732#issuecomment-2350937008))	2024-09-14 10:02:55 +00:00
PyTorch MergeBot	7975ec3a29	Revert "[Dynamo] Simplify torch function mode stack guard (#135444 )" This reverts commit ce3c74f2744cbc134b95cf8bd53ae5e3fbc67c29. Reverted https://github.com/pytorch/pytorch/pull/135444 on behalf of https://github.com/mlazos due to broke python test/quantization/pt2e/test_numeric_debugger.py TestNumericDebugger.test_re_export_preserve_handle modified yesterday ([comment](https://github.com/pytorch/pytorch/pull/134732#issuecomment-2350937008))	2024-09-14 10:02:55 +00:00
PyTorch MergeBot	f3180f0088	Revert "[Dynamo] Trace enter/exit of TorchFunctionModes (#135422 )" This reverts commit 7743149b2be4a9eba7e0997ccdc6abe552bec266. Reverted https://github.com/pytorch/pytorch/pull/135422 on behalf of https://github.com/mlazos due to broke python test/quantization/pt2e/test_numeric_debugger.py TestNumericDebugger.test_re_export_preserve_handle modified yesterday ([comment](https://github.com/pytorch/pytorch/pull/134732#issuecomment-2350937008))	2024-09-14 10:02:55 +00:00
PyTorch MergeBot	838c912502	Revert "[Dynamo] Remove ignored modes workaround (#135502 )" This reverts commit 5c67cf180ee53d696f95d7c45dd99a35399e4450. Reverted https://github.com/pytorch/pytorch/pull/135502 on behalf of https://github.com/mlazos due to broke python test/quantization/pt2e/test_numeric_debugger.py TestNumericDebugger.test_re_export_preserve_handle modified yesterday ([comment](https://github.com/pytorch/pytorch/pull/134732#issuecomment-2350937008))	2024-09-14 10:02:55 +00:00
PyTorch MergeBot	72b868d034	Revert "[Dynamo] Remove ignored modes from torch function mode stack guard (#135503 )" This reverts commit e77bd0ebd20e96990ccd40518e68bbcfe7fda855. Reverted https://github.com/pytorch/pytorch/pull/135503 on behalf of https://github.com/mlazos due to broke python test/quantization/pt2e/test_numeric_debugger.py TestNumericDebugger.test_re_export_preserve_handle modified yesterday ([comment](https://github.com/pytorch/pytorch/pull/134732#issuecomment-2350937008))	2024-09-14 10:02:54 +00:00
Zhenbin Lin	41b58a1bec	OpenReg: Fix issue when copying on the same device (#135956 ) Current copy gets wrong value when src and dst are both openreg. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135956 Approved by: https://github.com/albanD	2024-09-14 09:57:45 +00:00
CaoE	f96a073c9d	Use _amp_foreach_non_finite_check_and_unscale_ for CPU grads of ShardedGradScaler (#135232 ) Use `_amp_foreach_non_finite_check_and_unscale_` instead of fallback version for CPU grads of `ShardedGradScaler ` as `_amp_foreach_non_finite_check_and_unscale_ ` is supported on CPU https://github.com/pytorch/pytorch/pull/109281. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135232 Approved by: https://github.com/ezyang	2024-09-14 09:53:17 +00:00
Will Feng	a815611db9	[Traceable FSDP2][Partitioner] Must save AC output if output has a backward hook (#135727 ) If node is AC region output and has a backward hook on it, we intentionally choose to save it. This is to work around circular dependencies in Traceable FSDP2+AC. Example: ``` out = fully_shard(utils.checkpoint(module))(x) norm_out = layer_norm(out) ``` and there is a circular dependency: 1. In backward, grad_input of layer_norm aka. `out_grad` is actually dependent on `out`. 2. `out` depends on `out`'s backward hook created by FSDP2 (which does all-gather for `module` weights) in order to be recomputed. 3. `out`'s FSDP2 backward hook, as is the case for all eager backward hooks, depends on `out_grad` -> circular dependency with (1)! Solution: check whether `out` has a backward hook, and if so, intentionally save `out` in forward graph outputs. With this, we can break the above circular dependency. ---- Pull Request resolved: https://github.com/pytorch/pytorch/pull/135727 Approved by: https://github.com/Chillee	2024-09-14 08:45:58 +00:00
Oguz Ulgen	3352c9ac94	Add higher order operator name to the cache bypass exception (#135876 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135876 Approved by: https://github.com/jamesjwu, https://github.com/zou3519	2024-09-14 07:05:29 +00:00
Will Feng	5a2be192d1	[Traceable FSDP2] Don't register RegisterPostBackwardFunction if user intends to use Traceable FSDP2, and assert that compiled autograd is not used when entering RegisterPostBackwardFunction (#135824 ) During enablement of Traceable FSDP2 on internal models, sometimes the user only applies torch.compile to some of the FSDP2 instances but not all of them. Such mixed usage pattern is not supported by compiled autograd. Here we try to catch and throw error at such usage pattern, so that the user can fix the usage. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135824 Approved by: https://github.com/awgu	2024-09-14 06:30:12 +00:00
Nikita Shulga	a9bef85263	[CI] Increase open file handles limit to 16K on MacOS (#136061 ) May be it will help with flaky failures tracked in https://github.com/pytorch/pytorch/issues/135885 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136061 Approved by: https://github.com/clee2000, https://github.com/kit1980, https://github.com/huydhn, https://github.com/ZainRizvi	2024-09-14 06:16:12 +00:00
Laith Sakka	44dd218a61	Disable garbage collection during compile_time_instructions count in benchmark base by default. (#135768 ) When we measure compile time instruction count, probably we do want in most cases to measure gc instructions disabling it here by default. if it is needed we can add an option to allow it, or someone can use the regular total instruction count instead of compile time instruction count. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135768 Approved by: https://github.com/ezyang, https://github.com/anijain2305	2024-09-14 06:15:28 +00:00
Nikita Shulga	1a67e2b680	[MPS] Add native im2col (#135706 ) It's called from `torch.unfold` and one of the few remaining vestiges in `MPSFallback.mm` Strongly inspired by CUDA implementation from `09519eb195/aten/src/ATen/native/cuda/im2col.cuh (L40-L61)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135706 Approved by: https://github.com/albanD	2024-09-14 06:09:36 +00:00
Jack Taylor	b9b6094793	[ROCm] Skip pointwise associative scan tests due to regression (#135995 ) https://github.com/pytorch/pytorch/pull/133012 caused a regression on ROCm causing pointwise scan tests to fail ``` ERROR: test_pointwise_associative_scan_tuple_reverse_True_combine_mode_pointwise_cuda ERROR: test_pointwise_associative_scan_tuple_reverse_False_combine_mode_pointwise_cuda ERROR: test_pointwise_associative_scan_complex_pytree_reverse_True_combine_mode_pointwise_cuda ERROR: test_pointwise_associative_scan_complex_pytree_reverse_False_combine_mode_pointwise_cuda ERROR: test_pointwise_associative_scan_binary_operator_reverse_True_combine_mode_pointwise_cuda ERROR: test_pointwise_associative_scan_binary_operator_reverse_False_combine_mode_pointwise_cuda ``` Skipping temporarily while triage is underway. Full log: https://ossci-raw-job-status.s3.amazonaws.com/log/30067645445 ``` File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/_inductor/graph.py", line 1020, in call_function out = lowerings[target](args, kwargs) # type: ignore[index] File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/_inductor/lowering.py", line 363, in wrapped out = decomp_fn(args, **kwargs) File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/_inductor/lowering.py", line 6245, in associative_scan raise RuntimeError("Unable to generate code for associative_scan op") torch._inductor.exc.LoweringException: RuntimeError: Unable to generate code for associative_scan op ``` NOTE: even "eager" backend fails ``` File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/_higher_order_ops/associative_scan.py", line 338, in associative_scan_op_dense raise NotImplementedError("associative_scan is not implemented for eager") NotImplementedError: associative_scan is not implemented for eager ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135995 Approved by: https://github.com/malfet	2024-09-14 05:40:10 +00:00
fduwjj	911a43f930	[TCPStore] Remove deprecated constructor (#136004 ) While looking at TCPStore code again and found it confusing that we still keep the deprecated constructor for TCPStore in cpp while we don't expose it in python via pybind already. I checked both internal and external, all use cases in cpp (aside from unit test fixed in this PR) already moved to using option. So let's remove this legacy constructor to avoid confusion. Differential Revision: [D62653634](https://our.internmc.facebook.com/intern/diff/D62653634) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136004 Approved by: https://github.com/Skylion007, https://github.com/XilunWu	2024-09-14 04:25:47 +00:00
Michael Lazos	e77bd0ebd2	[Dynamo] Remove ignored modes from torch function mode stack guard (#135503 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135503 Approved by: https://github.com/anijain2305 ghstack dependencies: #134732, #133137, #135443, #135444, #135422, #135502	2024-09-14 02:41:16 +00:00
Michael Lazos	5c67cf180e	[Dynamo] Remove ignored modes workaround (#135502 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135502 Approved by: https://github.com/anijain2305 ghstack dependencies: #134732, #133137, #135443, #135444, #135422	2024-09-14 02:41:16 +00:00
Michael Lazos	7743149b2b	[Dynamo] Trace enter/exit of TorchFunctionModes (#135422 ) This PR implements tracing of with contexts with TorchFunction modes which have the default enter/exit behavior (ie pushing/popping the mode) Typically the bytecode for a context manager looks like this during a graph break: 1. graph call 2. enter context 3. unsupported code 4. exit context 5. resume call resume fn structure: 1. enter context 2. jump ... 3. exit context The issue with torch function modes is that side effects will replay any mutations to the torch function stack performed during tracing. So, we do not need to enter and exit around the unsupported code in the original function (doing so would result in a duplicate torch function mode entry during execution of the unsupported code), and we don't need to enter again in the resume function (the mode that was pushed from the side effects bytecode would still be on the stack). So for torch function modes the structure of our output code is this: 1. graph call 2. mutate tf mode stack to replay mutations 4. unsupported code 5. on exception restore stack 6. resume function Then our resume fn looks like this: 1. no-op enter torch function mode 2. jump 3. exit tf mode To implement the no-op enter of the torch function mode I added torch function mode in polyfill which no-op enters, but normally exits. This is needed because we still want to trace the with context in the resume function, and exit properly (the exit instructions will still be in the function, so we need to generate instructions to set up the context). Separately from the bytecode, dynamo also tracks contexts on the block stack, which is how the SETUP_* instructions are implemented. Naturally at a graph break, we exit these block stacks to properly reset the contexts entirely, so that we can re-enter around the unsupported code soundly. However once again, in the torch function mode case, in the event of a graph we do not want to perform any exit side effects because we want to preserve the state of the mode stack as is so that we will properly update the stack with bytecode mentioned in the first section. If we exited here, dynamo would pop the mode off of the symbolic stack, and not update the true python torch function mode stack with the suffix bytecode. All in all, for torch function modes we enter exactly once, update the global torch function mode stack with side effects bytecode, re-read this stack when compiling the resume function, and exit exactly once in the resume function. This matches the semantics of eager exactly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135422 Approved by: https://github.com/williamwen42 ghstack dependencies: #134732, #133137, #135443, #135444	2024-09-14 02:41:08 +00:00
Michael Lazos	ce3c74f274	[Dynamo] Simplify torch function mode stack guard (#135444 ) The semantics of ignored modes previously had edge cases, this eliminates these by in essence filtering any ignored modes out of both the ref stack and the current torch function mode stack. This is purely to fix complexity in #135422. The ignored modes handling will be removed in a future PR after https://github.com/pytorch/pytorch/pull/135422 lands, since we will then trace through DeviceContexts vs inserting them into the graph which needed these extra workarounds for correctness. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135444 Approved by: https://github.com/anijain2305, https://github.com/williamwen42 ghstack dependencies: #134732, #133137, #135443	2024-09-14 02:40:59 +00:00
Michael Lazos	149d0b7161	[Dynamo] Support thread local setattr (#135443 ) In preparation for tracing through DeviceContext (`defb515306/torch/utils/_device.py (L66)`) This PR adds support for calling the setattr of thread local objects. These objects have a slots impl, and since this doesn't appear to have any side effects, we call this setattr impl when replaying mutations, since calling `object.__setattr__` on these objects results in a type error. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135443 Approved by: https://github.com/anijain2305 ghstack dependencies: #134732, #133137	2024-09-14 02:40:52 +00:00
Michael Lazos	4528777e03	[Dynamo] Trace torch function modes entered outside of torch.compile (#133137 ) This PR adds initial tracing for torch function modes. Details: In essence, this adds tracing into the torch function of modes entered outside of the torch.compile call. This does not yet support tracing enter/exit of a torch function mode/ tracing set_default_device properly using the new mode infra (this will be a very good stress test for modes). I am adding more PRs to this stack to support these. The overall plan is to support tracing enter/exit and handling graph breaks like we do other torch.* context managers. Previously landed: https://github.com/pytorch/pytorch/pull/133135 https://github.com/pytorch/pytorch/pull/133136 https://github.com/pytorch/pytorch/pull/133134 https://github.com/pytorch/pytorch/pull/133133 https://github.com/pytorch/pytorch/pull/133132 https://github.com/pytorch/pytorch/pull/133131 https://github.com/pytorch/pytorch/pull/133729 https://github.com/pytorch/pytorch/pull/133130 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133137 Approved by: https://github.com/jansel, https://github.com/zou3519 ghstack dependencies: #134732	2024-09-14 02:40:43 +00:00
Michael Lazos	731b178b56	[Dynamo] Use custom backend to reenter metadata tf mode when tracing while/cond (#134732 ) For tracing cond/while in eager, we trace the HOP with the eager backend with metadata torchfunction mode enabled. HOPs disallow the mutation that occurs in this torch function mode, so it is not able to be traced. As a result, we use a custom backend which enters this mode for tracing these HOPs. Thanks to @ydwu4 for the help with implementing this Pull Request resolved: https://github.com/pytorch/pytorch/pull/134732 Approved by: https://github.com/ydwu4	2024-09-14 02:40:32 +00:00
PyTorch MergeBot	1786a17fed	Revert "Use _amp_foreach_non_finite_check_and_unscale_ for CPU grads of ShardedGradScaler (#135232 )" This reverts commit 51c52061339069a2162e921e5b464fad5a411522. Reverted https://github.com/pytorch/pytorch/pull/135232 on behalf of https://github.com/CaoE due to wrong commit ([comment](https://github.com/pytorch/pytorch/pull/135232#issuecomment-2350792806))	2024-09-14 02:31:06 +00:00
CaoE	51c5206133	Use _amp_foreach_non_finite_check_and_unscale_ for CPU grads of ShardedGradScaler (#135232 ) Use `_amp_foreach_non_finite_check_and_unscale_` instead of fallback version for CPU grads of `ShardedGradScaler ` as `_amp_foreach_non_finite_check_and_unscale_ ` is supported on CPU https://github.com/pytorch/pytorch/pull/109281. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135232 Approved by: https://github.com/ezyang	2024-09-14 02:20:58 +00:00
Yu, Guangye	2e8d431a8f	Fix tensor.data_ptr() representation overflow (#135567 ) # Motivation fix https://github.com/pytorch/pytorch/issues/135550 In PyTorch, [`tensor.data_ptr()`](`e889252493/tools/autograd/templates/python_variable_methods.cpp (L204)`) is reinterpreted by a [signed int64](`e889252493/torch/csrc/autograd/utils/wrap_outputs.h (L50)`) data type, which could result in an overflow issue, like below: ```python import torch a = torch.randn(2).to('xpu') a.data_ptr() # one possible output is -23453392437248 # this is inconsistent with storage.data_ptr() a.untyped_storage().data_ptr() # one possible output is 18446720620317114368 ``` This PR aims to fix this representation overflow issue to make `tensor.data_ptr()` consistent with [`tensor.untyped_storage().data_ptr()`](`c0d2f991b1/torch/csrc/StorageMethods.cpp (L62)`). With this PR, the output will become: ```python import torch a = torch.randn(2).to('xpu') a.data_ptr() # one possible output is 18446720620317114368 # this is consistent with storage.data_ptr() a.untyped_storage().data_ptr() # one possible output is 18446720620317114368 ``` # Solution Use `PyLong_FromVoidPtr` to prevent the overflow issue and fit the semantic of `wrap`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135567 Approved by: https://github.com/dvrogozh, https://github.com/EikanWang, https://github.com/albanD	2024-09-14 01:52:04 +00:00
Nikita Shulga	95496e4855	[CI] Check that PyTorch is built with OpenMP (#136060 ) Restriction for x86 only builds should have been removed long time ago Pull Request resolved: https://github.com/pytorch/pytorch/pull/136060 Approved by: https://github.com/clee2000, https://github.com/kit1980, https://github.com/ZainRizvi	2024-09-14 01:51:36 +00:00
Li, Xingyuan	5de4cb8cd8	[Inductor UT] Generalize inductor UT for intel GPU (Part 3) (#135827 ) [Inductor UT] Reuse Inductor test case for Intel GPU. Reuse `test/inductor/test_compiled_autograd.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135827 Approved by: https://github.com/etaf, https://github.com/desertfire	2024-09-14 01:43:05 +00:00
Joel Schlosser	06bc717410	Fix sum() forward for NJT (#131945 ) This PR solves two problems with `sum()` support in NJT: * `sum()` over a dim with `keepdim=True` returns the wrong shape (i.e. it'll keep the wrong dim). This is a long-standing bug from way back in #112519. * Historically, we've only supported `sum()` over a dim and not a full reduction. This PR adds the full reduction form (forward only, backward still fails). Pull Request resolved: https://github.com/pytorch/pytorch/pull/131945 Approved by: https://github.com/davidberard98, https://github.com/jananisriram	2024-09-14 00:58:03 +00:00
Nikita Shulga	081c4a966d	[BE] Use squeeze/unsqueeze in im2col (#136006 ) And move unsqeeze out of the dispatch, as it's dtype agnostic Pull Request resolved: https://github.com/pytorch/pytorch/pull/136006 Approved by: https://github.com/Skylion007, https://github.com/eqy	2024-09-14 00:35:37 +00:00
Ke Wen	4237592b8f	[Distributed] add pack-check method for float8_e4m3fn (#135961 ) We check 8 x FP8 simultaneously, at size of 8 bytes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135961 Approved by: https://github.com/yifuwang, https://github.com/Skylion007 ghstack dependencies: #135891	2024-09-14 00:32:27 +00:00
William Wen	a00faf4408	[3.13] fix 3.13 pickle error in serialization.py (#136034 ) Error encountered when adding dynamo 3.13 support. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136034 Approved by: https://github.com/albanD	2024-09-14 00:02:40 +00:00
eellison	b608ff3bea	[Easy] Dont match to mm_plus_mm if not in max autotune (#135929 ) It's only an optimization when we tune the triton template. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135929 Approved by: https://github.com/FindHao	2024-09-13 23:38:02 +00:00
Jerry Zhang	b8eef500a6	Fix attr check for quantization spec (#135736 ) Summary: Previously we only checked dtype and is_dynamic to decide if two quantization spec are equivalent this may not work in some cases, e.g. when people use different qscheme or quant_min/quant_max This PR added checks for other fields as well Test Plan: regression tests Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D62530974](https://our.internmc.facebook.com/intern/diff/D62530974) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135736 Approved by: https://github.com/sxu	2024-09-13 23:01:22 +00:00
Menglu Yu	aad556a0b5	[PT2][Inductor][Optimus] Fix a corner case in remove_split_with_size_one (#135962 ) Summary: see context in https://fb.workplace.com/groups/1075192433118967/permalink/1501768230461383/ Test Plan: # local reproduce ``` CUDA_VISIBLE_DEVICES=3 OC_CAUSE=1 buck2 run mode/opt //scripts/jackiexu0313/pt2:local_model_with_pt2 -- --test_mode batch-split --model_type "mai" --flow_id 642153776 ``` P1586356950 # e2e before fix f642153776 after fix Differential Revision: D62625318 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135962 Approved by: https://github.com/jackiexu1992	2024-09-13 22:53:08 +00:00
Zain Rizvi	3c5d44dda5	Cleanup unused runner variants (#136058 ) Cleaning up unused runner variants, leaving behind only the few that are actually referenced by workflows For more details see description in the PR that generated these code changes: - https://github.com/pytorch/test-infra/pull/5665 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136058 Approved by: https://github.com/wdvr, https://github.com/malfet	2024-09-13 22:50:07 +00:00
Justin Chu	e2d3af405f	[ONNX] Remove logging apis from public (#133825 ) Remove - torch.onnx.enable_log - torch.onnx.disable_log - torch.onnx.set_log_stream - torch.onnx.log Because they are not meant for public consumption and has been marked for deprecation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133825 Approved by: https://github.com/titaiwangms	2024-09-13 22:19:52 +00:00
Jessica Vandebon	baff86dafb	[MTIA tensor] allow shallow copy between CPU and MTIA tensors (#135871 ) Reviewed By: egienvalue, hanzlfs Differential Revision: D61662214 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135871 Approved by: https://github.com/egienvalue, https://github.com/nautsimon	2024-09-13 22:13:58 +00:00
Huy Do	db5e1b44d2	Fix inductor-micro-benchmark results upload (take 2) (#136052 ) I had a brain freeze when I wrote the original fix. The parameters were in the wrong order. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136052 Approved by: https://github.com/clee2000, https://github.com/kit1980, https://github.com/malfet	2024-09-13 22:05:10 +00:00
Nikita Shulga	a30d5ba16c	Fix bug in split-build workflows codegen (#136043 ) By just deleting a few rogue lines left out in https://github.com/pytorch/pytorch/pull/135510 If file in workflows folder does not have a `.yml` extensions it will not be launched at all, will it? Pull Request resolved: https://github.com/pytorch/pytorch/pull/136043 Approved by: https://github.com/kit1980, https://github.com/atalman	2024-09-13 21:29:06 +00:00
Laith Sakka	46935c8241	Reduce default iterations to 5 . (#135773 ) running all benchmarks takes around 15 mins rn, this is the data https://www.internalfb.com/phabricator/paste/view/P1583590240 the data looks mostly stable, and 5 iterations should be good, specially with our 1.5% threshold. that said, the diff also add a way to increase the number of iterations for a specific benchmark. after the change results https://www.internalfb.com/phabricator/paste/view/P1583618969 time is down to half (7 mins) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135773 Approved by: https://github.com/ezyang, https://github.com/anijain2305	2024-09-13 21:16:38 +00:00
Laith Sakka	4f407c1884	Only measure compile time instruction count for sum_floordiv benchmark (#135785 ) there was a recent strange noise +5%, -5%. using only compile time : 1) avoid gc time . 2) avoid other operations that are not what we try to measure by this. ==> less probable noise. ``` collecting compile time instruction count for sum_floordiv_regression compile time instruction count for iteration 0 is 8899290248 compile time instruction count for iteration 1 is 1188830489 compile time instruction count for iteration 2 is 1180579615 compile time instruction count for iteration 3 is 1176263131 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135785 Approved by: https://github.com/avikchaudhuri, https://github.com/anijain2305	2024-09-13 21:14:10 +00:00
Laith Sakka	2e461e54e8	Add gpu and gpu_dynamic versions of add_loop (#135809 ) I am thinking maybe 3 iterations are enough for this one? - so I am keeping eager and inductor since inductor is 2X eager time - Eager dynamic is 2X eager so keeping this as well. - inductor have three tests. (dynamic gpu, gpu and cpu) I am unsure if am over profiling here happy to trim if anyone have suggestions. ``` collecting compile time instruction count for add_loop_eager compile time instruction count for iteration 0 is 8213664211 compile time instruction count for iteration 1 is 2798628246 compile time instruction count for iteration 2 is 2796811362 compile time instruction count for iteration 3 is 2794438188 compile time instruction count for iteration 4 is 2794634117 collecting compile time instruction count for add_loop_eager_dynamic compile time instruction count for iteration 0 is 5724108021 compile time instruction count for iteration 1 is 5499908609 compile time instruction count for iteration 2 is 5569101366 compile time instruction count for iteration 3 is 5493806364 compile time instruction count for iteration 4 is 5493169851 collecting compile time instruction count for add_loop_inductor compile time instruction count for iteration 0 is 49789381222 compile time instruction count for iteration 1 is 25769347393 compile time instruction count for iteration 2 is 25772594322 compile time instruction count for iteration 3 is 25768695952 compile time instruction count for iteration 4 is 25768032314 collecting compile time instruction count for add_loop_inductor_gpu compile time instruction count for iteration 0 is 23966942581 compile time instruction count for iteration 1 is 23771950919 compile time instruction count for iteration 2 is 23770784286 compile time instruction count for iteration 3 is 23780160875 compile time instruction count for iteration 4 is 23774634465 collecting compile time instruction count for add_loop_inductor_dynamic_gpu compile time instruction count for iteration 0 is 41505055086 compile time instruction count for iteration 1 is 41293654089 compile time instruction count for iteration 2 is 41301016100 compile time instruction count for iteration 3 is 41306056207 compile time instruction count for iteration 4 is 41308171566 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135809 Approved by: https://github.com/ezyang, https://github.com/anijain2305	2024-09-13 20:42:31 +00:00
atalman	a3d827a28c	Use python 3.11 for Large Wheel build (#136042 ) Use Python 3.11 in nightly Large wheel builds. Required for Colab testing Pull Request resolved: https://github.com/pytorch/pytorch/pull/136042 Approved by: https://github.com/kit1980, https://github.com/malfet Co-authored-by: Sergii Dymchenko <kit1980@gmail.com>	2024-09-13 20:27:11 +00:00
Yiming Zhou	4312794b92	[reland][export] fix re-export custom metadata (#135720 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/134778 The previous D62304294 broke some executorch tests. It has already been reverted. In this diff, `_collect_param_buffer_metadata()` is modified in a way that when a `call_function` node is encountered and its input nodes include `get_attr`. We skip the fields that have been collected previously and only collect rest of the fields. This prevents over-writing. Test Plan: ``` buck2 test 'fbcode//mode/dev-nosan' fbcode//executorch/backends/xnnpack/test:test_xnnpack_ops buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/quantization:test_quantization -- -r test_re_export_preserve_handle buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/quantization:test_quantization -- -r test_run_decompositions_preserve_handle ``` Differential Revision: D62514208 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135720 Approved by: https://github.com/zhxchen17, https://github.com/jerryzh168	2024-09-13 20:15:15 +00:00
Sergii Dymchenko	b856f3539b	Fix script name in the comments (#135507 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135507 Approved by: https://github.com/atalman	2024-09-13 19:59:47 +00:00
Jing Xu	835e7bb077	fix requirements.txt installation failure issue on Windows (#134567 ) Fixes #134564 Root cause: The `lintrunner` wheel released on [pypi.org](https://pypi.org/project/lintrunner/#files) only supports Windows 32bit and Linux 64bit. Since compilation of pytorch requires a 64bit env, on windows, the `lintrunner` has to be compiled from source distribution. `Rust` is its dependency for compilation, as indicated in the error message. Meanwhile, Visual Studio environment is needed for linking libraries.. ![image](https://github.com/user-attachments/assets/180cd899-8886-43b5-b42f-031f41e81683) Issue when performing `pip install lintrunner` without a Visual Studio environment activated is shown below. ```bash >python -m pip install lintrunner Collecting lintrunner Downloading lintrunner-0.12.5.tar.gz (62 kB) Installing build dependencies ... done Getting requirements to build wheel ... done Preparing metadata (pyproject.toml) ... done Building wheels for collected packages: lintrunner Building wheel for lintrunner (pyproject.toml) ... error error: subprocess-exited-with-error × Building wheel for lintrunner (pyproject.toml) did not run successfully. │ exit code: 1 ╰─> [137 lines of output] Running `maturin pep517 build-wheel -i C:\Users\\miniforge3\envs\py310\python.exe --compatibility off` ðŸ“¡ Using build options bindings from pyproject.toml Compiling proc-macro2 v1.0.79 Compiling unicode-ident v1.0.12 Compiling version_check v0.9.4 Compiling windows_x86_64_msvc v0.52.4 Compiling winapi v0.3.9 Compiling serde v1.0.197 Compiling autocfg v1.2.0 Compiling syn v1.0.109 Compiling lazy_static v1.4.0 Compiling libc v0.2.153 Compiling equivalent v1.0.1 Compiling hashbrown v0.14.3 Compiling memchr v2.7.2 Compiling yansi v1.0.1 Compiling unicode-width v0.1.11 Compiling regex-syntax v0.8.3 Compiling encode_unicode v0.3.6 Compiling cfg-if v1.0.0 Compiling winnow v0.6.5 Compiling cc v1.0.92 error: could not compile `windows_x86_64_msvc` (build script) due to 2 previous errors warning: build failed, waiting for other jobs to finish... error: could not compile `serde` (build script) due to 2 previous errors error: could not compile `proc-macro2` (build script) due to 2 previous errors error: could not compile `syn` (build script) due to 2 previous errors error: could not compile `libc` (build script) due to 2 previous errors error: could not compile `winapi` (build script) due to 2 previous errors ðŸ’¥ maturin failed Caused by: Failed to build a native library through cargo Caused by: Cargo build finished with "exit code: 101": `cargo rustc --manifest-path Cargo.toml --message-format json --release --bins --` ðŸ“¦ Including license file "LICENSE" ðŸ”— Found bin bindings error: linker `link.exe` not found \| = note: program not found note: the msvc targets depend on the msvc linker but `link.exe` was not found note: please ensure that Visual Studio 2017 or later, or Build Tools for Visual Studio were installed with the Visual C++ option. note: VS Code is a different product, and is not sufficient. error: aborting due to 1 previous error error: linker `link.exe` not found \| = note: program not found note: the msvc targets depend on the msvc linker but `link.exe` was not found note: please ensure that Visual Studio 2017 or later, or Build Tools for Visual Studio were installed with the Visual C++ option. note: VS Code is a different product, and is not sufficient. error: aborting due to 1 previous error error: linker `link.exe` not found \| = note: program not found note: the msvc targets depend on the msvc linker but `link.exe` was not found note: please ensure that Visual Studio 2017 or later, or Build Tools for Visual Studio were installed with the Visual C++ option. note: VS Code is a different product, and is not sufficient. error: aborting due to 1 previous error error: linker `link.exe` not found \| = note: program not found note: the msvc targets depend on the msvc linker but `link.exe` was not found note: please ensure that Visual Studio 2017 or later, or Build Tools for Visual Studio were installed with the Visual C++ option. note: VS Code is a different product, and is not sufficient. error: aborting due to 1 previous error error: linker `link.exe` not found \| = note: program not found note: the msvc targets depend on the msvc linker but `link.exe` was not found note: please ensure that Visual Studio 2017 or later, or Build Tools for Visual Studio were installed with the Visual C++ option. note: VS Code is a different product, and is not sufficient. error: aborting due to 1 previous error error: linker `link.exe` not found \| = note: program not found note: the msvc targets depend on the msvc linker but `link.exe` was not found note: please ensure that Visual Studio 2017 or later, or Build Tools for Visual Studio were installed with the Visual C++ option. note: VS Code is a different product, and is not sufficient. error: aborting due to 1 previous error Error: command ['maturin', 'pep517', 'build-wheel', '-i', 'C:\\Users\\\\miniforge3\\envs\\py310\\python.exe', '--compatibility', 'off'] returned non-zero exit status 1 [end of output] note: This error originates from a subprocess, and is likely not a problem with pip. ERROR: Failed building wheel for lintrunner Failed to build lintrunner ERROR: ERROR: Failed to build installable wheels for some pyproject.toml based projects (lintrunner) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/134567 Approved by: https://github.com/malfet	2024-09-13 18:43:55 +00:00
PyTorch MergeBot	b6d6aa49b8	Revert "Validate input types for `torch.nn.Linear` and `torch.nn.Bilinear` (#135596 )" This reverts commit e157ce3ebbb3f30d008c15914e82eb74217562f0. Reverted https://github.com/pytorch/pytorch/pull/135596 on behalf of https://github.com/malfet due to It's too restrictive, should allow other int-like types, such as `numpy.int64` ([comment](https://github.com/pytorch/pytorch/pull/135596#issuecomment-2349714104))	2024-09-13 18:06:56 +00:00
PyTorch MergeBot	deee21cb78	Revert "[Inductor] Rename `cpp_wrapper_cuda.py` as `cpp_wrapper_gpu.py` (#135313 )" This reverts commit 16b37b309f64ddd4e498c57a99191e1d9b3dfdac. Reverted https://github.com/pytorch/pytorch/pull/135313 on behalf of https://github.com/izaitsevfb due to breaks internal builds ([comment](https://github.com/pytorch/pytorch/pull/135313#issuecomment-2349662091))	2024-09-13 17:53:21 +00:00
Daohang Shi	3f69410976	[gpu-profiler] Expose active and repeat in os env var (#135757 ) Summary: https://fb.workplace.com/groups/ai.efficiency.tools.users/permalink/1855136444971825/ Test Plan: `buck2 test mode/opt caffe2/test:profiler -- -r test_kineto_profiler_api ` eyes Differential Revision: D62529249 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135757 Approved by: https://github.com/Yuzhen11	2024-09-13 17:48:27 +00:00
PyTorch MergeBot	18f9331e5d	Revert "[aoti] Fix workspace generation for triton (#135552 )" This reverts commit d3833253928f29ed760b2dccac2b730028a868ca. Reverted https://github.com/pytorch/pytorch/pull/135552 on behalf of https://github.com/izaitsevfb due to blocks revert of #135313, internal failures, see D62511427 ([comment](https://github.com/pytorch/pytorch/pull/135552#issuecomment-2349641372))	2024-09-13 17:47:36 +00:00
Catherine Lee	bc0f330169	[trymerge] Manually close merged PR when Github fails (#135890 ) Manually close merged PR when Github fails to do it. Consequences of current design: Sleeping for 1 min uses up the machine, might result in race conditions, results in merging label to removed a bit later, pr still left open if this api fails too (ie no async clean up job) Tested in https://github.com/malfet/deleteme/pull/92 by removing the part of the commit message that has "resolved #pr num" Pull Request resolved: https://github.com/pytorch/pytorch/pull/135890 Approved by: https://github.com/malfet, https://github.com/huydhn	2024-09-13 17:29:24 +00:00
Rachel Guo	7834c0bb2c	[AOTI][Tooling] Add stats summary (mean/min/max, etc) for jit inductor tensor value printing (#135887 ) Summary: As title. Follow up to add stats summary (mean/min/max, etc) for jit inductor tensor value printing as well. The inductor python wrapper code level printing would look something like this: {F1859224287} Test Plan: CI Reviewed By: chenyang78 Differential Revision: D62415575 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135887 Approved by: https://github.com/chenyang78	2024-09-13 17:19:25 +00:00
PyTorch MergeBot	6ef49fe8f1	Revert "Pass ideep:lowp_kind to matmul_forward::compute on cache misses (#135058 )" This reverts commit 3d2431380999252d5401f83d5010b398a32e7597. Reverted https://github.com/pytorch/pytorch/pull/135058 on behalf of https://github.com/malfet due to It regresses x86 performance ([comment](https://github.com/pytorch/pytorch/pull/135058#issuecomment-2349480861))	2024-09-13 17:09:45 +00:00
Jack Taylor	a15774563b	[ROCm] Enable ROCm support for inductor's dynamic_rblock_scaling (#129663 ) As of ROCm 6.1 [hipDeviceProp_t::regsPerMultiprocessor](https://rocm.docs.amd.com/projects/HIP/en/latest/doxygen/html/structhip_device_prop__t.html#a7390d5b180d63978c81aa971060270b4) is now available allowing us to enable this attribute on ROCm. ``` >>> torch.cuda.get_device_properties(0) _CudaDeviceProperties(name='AMD Instinct MI250X/MI250', major=9, minor=0, gcnArchName='gfx90a:sramecc+:xnack-', total_memory=65520MB, multi_processor_count=104) >>> torch.cuda.get_device_properties(0).regs_per_multiprocessor 65536 ``` With https://github.com/triton-lang/triton/pull/3962we can extract n_regs and n_spells from a triton binary with AMD backend allowing us to enable inductor's dynamic_rblock_scaling on ROCm initially implemented in https://github.com/pytorch/pytorch/pull/115094 Leaving this in draft until following PRs have landed: - https://github.com/pytorch/pytorch/pull/129361 to bump the triton commit pin - https://github.com/pytorch/pytorch/pull/128449 to allow us to grab warp_size from device properties instead of hard coding 64 on ROCm. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129663 Approved by: https://github.com/jansel, https://github.com/shunting314	2024-09-13 16:45:39 +00:00
PyTorch MergeBot	564d00f364	Revert "Fix clang-tidy warnings in Caffe2 code (#134935 )" This reverts commit 7cfd23636c8fa6fcbb8bf3ea34e15b847ec9ad9d. Reverted https://github.com/pytorch/pytorch/pull/134935 on behalf of https://github.com/izaitsevfb due to breaks internal builds, caffe2 is still used internally ([comment](https://github.com/pytorch/pytorch/pull/134935#issuecomment-2349368152))	2024-09-13 16:42:37 +00:00
drisspg	ae02d663cd	[FlexAttention] Fix output layout (#135882 ) We previously only supported the same v_head dim and + qk_head dim. When allowed for different head-dims I accidently kept the same query strides for the output. This PR fixes this bug as well it ensures that we always produce output in the same stride order as the input query. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135882 Approved by: https://github.com/yanboliang, https://github.com/Chillee	2024-09-13 16:36:05 +00:00
James Wu	ad2f0e9f81	Add remote cache time saved to compilation metrics (#135490 ) Summary: Record remote cache time saved via frame_phase_timing We add to the "phase" when remote cache hits and saves us time, so that we have a 1:1 correspondence between a frame and time saved. Test Plan: Internally run benchmark, see that it's populated in sandbox table after previous diff lands and logger config is actualized. Show that column exists in table: https://fburl.com/scuba/logger_staging_jjwu_30582a48f1ff9cf5f4ac50a4c40af/fp2te0ff Note that an earlier version of D62105258 had the column as a string so the staging table is a bit messed up. But you can see the most recent samples have the column populates as a float. Reviewed By: aorenste Differential Revision: D62106921 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135490 Approved by: https://github.com/aorenste	2024-09-13 16:35:51 +00:00
Edward Z. Yang	21ffa18ad1	Fix "expand: SymIntArrayRef expected to contain only concrete integers" in AOTInductor (#135933 ) Internal xref: https://fb.workplace.com/groups/1075192433118967/permalink/1501860707118802/ Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/135933 Approved by: https://github.com/angelayi	2024-09-13 15:23:42 +00:00
eqy	2519e5a8de	[CUDA][FP8] Skip rowwise scaling test on sm89 (#135718 ) Same reason as #https://github.com/pytorch/pytorch/pull/133612, rowwise scaling implementation is sm90+ specific (e.g., uses TMA) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135718 Approved by: https://github.com/Skylion007	2024-09-13 15:07:20 +00:00
Laith Sakka	ba6e0f31ab	Remove cycle dependency by localizing the import. (#135926 ) Summary: Since https://www.internalfb.com/diff/D62215095 landed there has been many silence errors due to the dependency between functional_tensor and config. ``` File "/tmp/torch_deploy_zip5YRJC1/torch_python_modules.zip/torch/export/__init__.py", line 64, in <module> File "/tmp/torch_deploy_zip5YRJC1/torch_python_modules.zip/torch/export/dynamic_shapes.py", line 23, in <module> File "/tmp/torch_deploy_zip5YRJC1/torch_python_modules.zip/torch/export/exported_program.py", line 26, in <module> File "/tmp/torch_deploy_zip5YRJC1/torch_python_modules.zip/torch/_higher_order_ops/__init__.py", line 1, in <module> File "/tmp/torch_deploy_zip5YRJC1/torch_python_modules.zip/torch/_higher_order_ops/cond.py", line 6, in <module> File "/tmp/torch_deploy_zip5YRJC1/torch_python_modules.zip/torch/_subclasses/functional_tensor.py", line 9, in <module> File "/tmp/torch_deploy_zip5YRJC1/torch_python_modules.zip/torch/_inductor/config.py", line 44, in <module> ``` https://fburl.com/logarithm/ol5kx0ee complaining about a cycle dependency this fix it. Test Plan: buck test multipy/runtime:test_deploy_embedded_cuda_interp_without_cuda_available -- --run-disabled TorchpyTest.AcquireMultipleSessionsInDifferentPackages Reviewed By: aorenste Differential Revision: D62616765 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135926 Approved by: https://github.com/aorenste, https://github.com/oulgen, https://github.com/Skylion007	2024-09-13 15:05:41 +00:00
PyTorch MergeBot	7ed0563cad	Revert "[Dynamo] Use custom backend to reenter metadata tf mode when tracing while/cond (#134732 )" This reverts commit e504fb70693d4a3741c3380b6a989d441e84f737. Reverted https://github.com/pytorch/pytorch/pull/134732 on behalf of https://github.com/albanD due to Broke tests on main ([comment](https://github.com/pytorch/pytorch/pull/134732#issuecomment-2348886378))	2024-09-13 12:52:58 +00:00
PyTorch MergeBot	eb7dd91dd1	Revert "[Dynamo] Trace torch function modes entered outside of torch.compile (#133137 )" This reverts commit fafdd588f27e1d56090c6d260d0382c255eaf9eb. Reverted https://github.com/pytorch/pytorch/pull/133137 on behalf of https://github.com/albanD due to Broke tests on main ([comment](https://github.com/pytorch/pytorch/pull/134732#issuecomment-2348886378))	2024-09-13 12:52:58 +00:00
PyTorch MergeBot	3f30360d05	Revert "[Dynamo] Support thread local setattr (#135443 )" This reverts commit 30b007bea329f512af3dc4fd4e6c7d145e807b71. Reverted https://github.com/pytorch/pytorch/pull/135443 on behalf of https://github.com/albanD due to Broke tests on main ([comment](https://github.com/pytorch/pytorch/pull/134732#issuecomment-2348886378))	2024-09-13 12:52:58 +00:00
PyTorch MergeBot	4734e356d6	Revert "[Dynamo] Simplify torch function mode stack guard (#135444 )" This reverts commit 0c080cb2c78a85a5320fbeadbbb9a2cc640fd89d. Reverted https://github.com/pytorch/pytorch/pull/135444 on behalf of https://github.com/albanD due to Broke tests on main ([comment](https://github.com/pytorch/pytorch/pull/134732#issuecomment-2348886378))	2024-09-13 12:52:57 +00:00
PyTorch MergeBot	ac169795a9	Revert "[Dynamo] Trace enter/exit of TorchFunctionModes (#135422 )" This reverts commit 2af3b8ffd84e36b91279174e9106f84b2d2a11f2. Reverted https://github.com/pytorch/pytorch/pull/135422 on behalf of https://github.com/albanD due to Broke tests on main ([comment](https://github.com/pytorch/pytorch/pull/134732#issuecomment-2348886378))	2024-09-13 12:52:57 +00:00
PyTorch MergeBot	fca58bfda1	Revert "[Dynamo] Remove ignored modes workaround (#135502 )" This reverts commit 7d5e0dd4b1a8d20fc8624b3085a6f5ddedd89a2e. Reverted https://github.com/pytorch/pytorch/pull/135502 on behalf of https://github.com/albanD due to Broke tests on main ([comment](https://github.com/pytorch/pytorch/pull/134732#issuecomment-2348886378))	2024-09-13 12:52:57 +00:00
PyTorch MergeBot	dc71e7a7d4	Revert "[Dynamo] Remove ignored modes from torch function mode stack guard (#135503 )" This reverts commit c56728b643e2b7d796abd7ec45803319e1c5967d. Reverted https://github.com/pytorch/pytorch/pull/135503 on behalf of https://github.com/albanD due to Broke tests on main ([comment](https://github.com/pytorch/pytorch/pull/134732#issuecomment-2348886378))	2024-09-13 12:52:57 +00:00
PyTorch MergeBot	1cdf658f4a	Revert "[PT2][inductor][Optimus] Add pad_aten_mm_pass pattern to resolve long computation kernel in LCE (#135167 )" This reverts commit eb0fe029337b31bcb3d4b2d1e539895393975d68. Reverted https://github.com/pytorch/pytorch/pull/135167 on behalf of https://github.com/jithunnair-amd due to Broke ROCm CI eg. https://github.com/pytorch/pytorch/actions/runs/10845542664/job/30097957154 ([comment](https://github.com/pytorch/pytorch/pull/135167#issuecomment-2348847595))	2024-09-13 12:35:05 +00:00
PyTorch MergeBot	b5c52e96e8	Revert "[dynamo] Fix support for classmethod(property(...)) (#134968 )" This reverts commit bf68e16e94fc05f10d434cdc162a14d02c6ad23c. Reverted https://github.com/pytorch/pytorch/pull/134968 on behalf of https://github.com/jithunnair-amd due to Broke ROCm CI: eg. https://github.com/pytorch/pytorch/actions/runs/10845542664/job/30097956613 ([comment](https://github.com/pytorch/pytorch/pull/134968#issuecomment-2348837553))	2024-09-13 12:29:03 +00:00
Bin Bao	ea2ecab15b	[AOTI][reland] Fix assert_function call in cpu autotune template (#135920 ) Summary: Reland https://github.com/pytorch/pytorch/pull/135086. In the ABI-compatible mode, assert_function should be AOTI_TORCH_CHECK. Test Plan: CI Differential Revision: D62500592 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135920 Approved by: https://github.com/chenyang78	2024-09-13 12:21:57 +00:00
CaoE	2f53d570fe	Update document for autocast on CPU (#135299 ) Update document for autocast on CPU due to the support of float16 and changes in the operator list. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135299 Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/svekars	2024-09-13 09:11:47 +00:00
Ke Wen	31007cf200	[Distributed] add FP8 support to NaN checker (#135891 ) Adding support for `torch.float8_e4m3fn` and `torch.float8_e5m2` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135891 Approved by: https://github.com/wconstab	2024-09-13 08:43:54 +00:00
Michael Lazos	c56728b643	[Dynamo] Remove ignored modes from torch function mode stack guard (#135503 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135503 Approved by: https://github.com/anijain2305 ghstack dependencies: #134732, #133137, #135443, #135444, #135422, #135502	2024-09-13 08:41:32 +00:00
Michael Lazos	7d5e0dd4b1	[Dynamo] Remove ignored modes workaround (#135502 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135502 Approved by: https://github.com/anijain2305 ghstack dependencies: #134732, #133137, #135443, #135444, #135422	2024-09-13 08:41:32 +00:00
Michael Lazos	2af3b8ffd8	[Dynamo] Trace enter/exit of TorchFunctionModes (#135422 ) This PR implements tracing of with contexts with TorchFunction modes which have the default enter/exit behavior (ie pushing/popping the mode) Typically the bytecode for a context manager looks like this during a graph break: 1. graph call 2. enter context 3. unsupported code 4. exit context 5. resume call resume fn structure: 1. enter context 2. jump ... 3. exit context The issue with torch function modes is that side effects will replay any mutations to the torch function stack performed during tracing. So, we do not need to enter and exit around the unsupported code in the original function (doing so would result in a duplicate torch function mode entry during execution of the unsupported code), and we don't need to enter again in the resume function (the mode that was pushed from the side effects bytecode would still be on the stack). So for torch function modes the structure of our output code is this: 1. graph call 2. mutate tf mode stack to replay mutations 4. unsupported code 5. on exception restore stack 6. resume function Then our resume fn looks like this: 1. no-op enter torch function mode 2. jump 3. exit tf mode To implement the no-op enter of the torch function mode I added torch function mode in polyfill which no-op enters, but normally exits. This is needed because we still want to trace the with context in the resume function, and exit properly (the exit instructions will still be in the function, so we need to generate instructions to set up the context). Separately from the bytecode, dynamo also tracks contexts on the block stack, which is how the SETUP_* instructions are implemented. Naturally at a graph break, we exit these block stacks to properly reset the contexts entirely, so that we can re-enter around the unsupported code soundly. However once again, in the torch function mode case, in the event of a graph we do not want to perform any exit side effects because we want to preserve the state of the mode stack as is so that we will properly update the stack with bytecode mentioned in the first section. If we exited here, dynamo would pop the mode off of the symbolic stack, and not update the true python torch function mode stack with the suffix bytecode. All in all, for torch function modes we enter exactly once, update the global torch function mode stack with side effects bytecode, re-read this stack when compiling the resume function, and exit exactly once in the resume function. This matches the semantics of eager exactly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135422 Approved by: https://github.com/williamwen42 ghstack dependencies: #134732, #133137, #135443, #135444	2024-09-13 08:41:24 +00:00
Michael Lazos	0c080cb2c7	[Dynamo] Simplify torch function mode stack guard (#135444 ) The semantics of ignored modes previously had edge cases, this eliminates these by in essence filtering any ignored modes out of both the ref stack and the current torch function mode stack. This is purely to fix complexity in #135422. The ignored modes handling will be removed in a future PR after https://github.com/pytorch/pytorch/pull/135422 lands, since we will then trace through DeviceContexts vs inserting them into the graph which needed these extra workarounds for correctness. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135444 Approved by: https://github.com/anijain2305, https://github.com/williamwen42 ghstack dependencies: #134732, #133137, #135443	2024-09-13 08:41:17 +00:00
Michael Lazos	30b007bea3	[Dynamo] Support thread local setattr (#135443 ) In preparation for tracing through DeviceContext (`defb515306/torch/utils/_device.py (L66)`) This PR adds support for calling the setattr of thread local objects. These objects have a slots impl, and since this doesn't appear to have any side effects, we call this setattr impl when replaying mutations, since calling `object.__setattr__` on these objects results in a type error. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135443 Approved by: https://github.com/anijain2305 ghstack dependencies: #134732, #133137	2024-09-13 08:41:07 +00:00
Michael Lazos	fafdd588f2	[Dynamo] Trace torch function modes entered outside of torch.compile (#133137 ) This PR adds initial tracing for torch function modes. Details: In essence, this adds tracing into the torch function of modes entered outside of the torch.compile call. This does not yet support tracing enter/exit of a torch function mode/ tracing set_default_device properly using the new mode infra (this will be a very good stress test for modes). I am adding more PRs to this stack to support these. The overall plan is to support tracing enter/exit and handling graph breaks like we do other torch.* context managers. Previously landed: https://github.com/pytorch/pytorch/pull/133135 https://github.com/pytorch/pytorch/pull/133136 https://github.com/pytorch/pytorch/pull/133134 https://github.com/pytorch/pytorch/pull/133133 https://github.com/pytorch/pytorch/pull/133132 https://github.com/pytorch/pytorch/pull/133131 https://github.com/pytorch/pytorch/pull/133729 https://github.com/pytorch/pytorch/pull/133130 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133137 Approved by: https://github.com/jansel, https://github.com/zou3519 ghstack dependencies: #134732	2024-09-13 08:41:00 +00:00
Michael Lazos	e504fb7069	[Dynamo] Use custom backend to reenter metadata tf mode when tracing while/cond (#134732 ) For tracing cond/while in eager, we trace the HOP with the eager backend with metadata torchfunction mode enabled. HOPs disallow the mutation that occurs in this torch function mode, so it is not able to be traced. As a result, we use a custom backend which enters this mode for tracing these HOPs. Thanks to @ydwu4 for the help with implementing this Pull Request resolved: https://github.com/pytorch/pytorch/pull/134732 Approved by: https://github.com/ydwu4	2024-09-13 08:40:50 +00:00
Jez Ng	b346e99376	remove fast_flush arguments (#135387 ) I've removed them from upstream Triton in https://github.com/triton-lang/triton/pull/4485. It looks like most places in the code use the default value of `fast_flush=True` anyway, though there are two PRs from @pearu that use `False`. To my knowledge, there's no reason to use the `False` value. Differential Revision: [D62325778](https://our.internmc.facebook.com/intern/diff/D62325778) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135387 Approved by: https://github.com/nmacchioni, https://github.com/jansel	2024-09-13 08:13:46 +00:00
Animesh Jain	7dc1788396	[inductor] Remove the batch fusion passes from being a default (#135922 ) Ads team do a search internally to figure out which fusion passes to use. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135922 Approved by: https://github.com/eellison, https://github.com/yanboliang ghstack dependencies: #135819	2024-09-13 06:07:33 +00:00
xinan.lin	9fd54d787d	[Inductor UT] Generalize device-bias code in test_triton_kernels.py introduced in #135530 (#135656 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135656 Approved by: https://github.com/EikanWang, https://github.com/zou3519	2024-09-13 05:27:56 +00:00
xingyuan li	b38be727eb	[Inductor UT] Generalize inductor UT for intel GPU (Part 2) (#134556 ) [Inductor UT] Reuse Inductor test case for Intel GPU. Reuse `test/inductor/test_torchinductor_opinfo.py` Reuse `test/inductor/test_minifier_isolate.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/134556 Approved by: https://github.com/etaf, https://github.com/eellison	2024-09-13 05:16:28 +00:00
Jokeren	e54b559e88	[inductor] More fixes on the keys of `constants` and `signature` dictionaries (#135406 ) Previous PR forgets to change two other places that also create `constants` and `signature`. https://github.com/pytorch/pytorch/pull/135170 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135406 Approved by: https://github.com/jansel	2024-09-13 04:10:41 +00:00
wz337	eea5e6ff0f	[DCP][DSD] Add a test case to demonstrate the workaround to load full state dict into a 2D model (#135763 ) Fix https://github.com/pytorch/pytorch/issues/134095 This is a workaround for loading full state dict into a FSDP1+TP 2D model. Since named_parameters() in FSDP1 does not return DTensor, we don't have the information to shard the full_state_dict and load it directly into the 2d model. In order to load a full state dict in FSDP1+TP 2D model, we need to do: - load the full state dict into a 1D FSDP model - dcp.save the full/shard state dict into storage - initialize a 2D FSDP1+TP model - get the default sharded state dict for the 2D model (full_state_dict=False) - dcp.load the state dict from storage - load the state dict into the 2D model Pull Request resolved: https://github.com/pytorch/pytorch/pull/135763 Approved by: https://github.com/fegin ghstack dependencies: #135725	2024-09-13 03:51:14 +00:00
Pian Pawakapan	6df91b5917	real tensor prop for composite ops (#135717 ) Fixes #135632 Adds real tensor propagation for decompositions, checking any symbols on their outputs Pull Request resolved: https://github.com/pytorch/pytorch/pull/135717 Approved by: https://github.com/ezyang	2024-09-13 03:35:16 +00:00
wz337	0cdc6a8dcd	[DSD] Fix distributed state dict full_state_dict option hang during set_state_dict (#135725 ) Fix https://github.com/pytorch/pytorch/issues/134095 This fix distributed state dict full_state_dict option hang during set_state_dict. We switch `_distribute_tensors` in _state_dict_utils.py to use `DTensor.from_local` instead of `distribute_tensor` to support FSDP2+TP 2D strided sharding use case, as `distribute_tensor` cannot handle strided sharding yet. `distribute_tensor` incurs a scatter behind the scenes, while `DTensor.from_local` takes the local slice from the full tensor on each rank to create the DTensor (no collective). This means it's the user's responsibility to make sure the full_tensor from the full_state_dict is the same across all ranks. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135725 Approved by: https://github.com/fegin	2024-09-13 03:26:36 +00:00
Prachi Gupta	6cdc70bccd	[ROCm] skip test_fp8_cast_and_t on non-MI300 machines (#135917 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/135917 Approved by: https://github.com/malfet	2024-09-13 02:46:48 +00:00
Yu, Guangye	e6b68359d7	Fix xpu memory stats error (#135818 ) # Motivation fix https://github.com/pytorch/pytorch/issues/135726 After merging two free blocks, I made a stupid mistake of ignoring the correct size to decrease the active memory size, which should be the original block size instead of the merged block size. # Additional Context Add a UT to guard this scenario. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135818 Approved by: https://github.com/EikanWang	2024-09-13 02:41:21 +00:00
Nikita Shulga	1c04cbfba6	[BE] Use `C10_UNUSED` (#135914 ) Instead of `(void)foo; // Suppress unused variable` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135914 Approved by: https://github.com/huydhn, https://github.com/eqy	2024-09-13 02:27:07 +00:00
Shivam Raikundalia	062681a0ed	[Profiler] Torch Profiler distributed info is not JSON serializable (#135548 ) Summary: To fix https://github.com/pytorch/pytorch/issues/133308 we must create an encoder for numpy values so we can serialize the distributed metadata to JSON. Test Plan: Added unit test to check that numpy values can be serialized Differential Revision: D62411619 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135548 Approved by: https://github.com/aaronenyeshi, https://github.com/albanD	2024-09-13 02:22:33 +00:00
Aaron Orenstein	8c356ce3da	Fix lint errors in fbcode (#135614 ) Summary: Fixed a bunch of fbcode imports that happened to work but confused autodeps. After this autodeps still suggests "improvements" to TARGETS (which breaks our builds) but at least it can find all the imports. Test Plan: ``` fbpython fbcode/tools/build/buck/linters/lint_autoformat.py --linter=autodeps --default-exec-timeout=1800 -- fbcode/caffe2/TARGETS fbcode/caffe2/test/TARGETS ``` Before: ``` ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "test_export" (from caffe2/test/export/testing.py:229) when processing rule "test_export". Please make sure it's listed in the srcs parameter of another rule. See https://fbur$ ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "testing" (from caffe2/test/export/test_export.py:87) when processing rule "test_export". Please make sure it's listed in the srcs parameter of another rule. See https://fburl$ ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "test_export" (from caffe2/test/export/test_serdes.py:9) when processing rule "test_export". Please make sure it's listed in the srcs parameter of another rule. See https://fb$ ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "testing" (from caffe2/test/export/test_serdes.py:10) when processing rule "test_export". Please make sure it's listed in the srcs parameter of another rule. See https://fburl$ ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "testing" (from caffe2/test/export/test_retraceability.py:7) when processing rule "test_export". Please make sure it's listed in the srcs parameter of another rule. See https:$ ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "test_export" (from caffe2/test/export/test_retraceability.py:6) when processing rule "test_export". Please make sure it's listed in the srcs parameter of another rule. See ht$ ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "testing" (from caffe2/test/export/test_export_nonstrict.py:7) when processing rule "test_export". Please make sure it's listed in the srcs parameter of another rule. See http$ ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "test_export" (from caffe2/test/export/test_export_nonstrict.py:6) when processing rule "test_export". Please make sure it's listed in the srcs parameter of another rule. See $ ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "test_export" (from caffe2/test/export/test_export_training_ir_to_run_decomp.py:8) when processing rule "test_export". Please make sure it's listed in the srcs parameter of an$ ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "testing" (from caffe2/test/export/test_export_training_ir_to_run_decomp.py:10) when processing rule "test_export". Please make sure it's listed in the srcs parameter of anoth$ ERROR while processing caffe2/test/TARGETS: Found "//python/typeshed_internal:typeshed_internal_library" owner for "cv2" but it is protected by visibility rules: [] (from caffe2/test/test_bundled_images.py:7) when processing rule "test_bundled_$ ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "caffe2.test.profiler_test_cpp_thread_lib" (from caffe2/test/profiler/test_cpp_thread.py:29) when processing rule "profiler_test_cpp_thread". Please make sure it's listed in t$ ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "torch._utils_internal.get_file_path_2" (from caffe2/test/test_custom_ops.py:23) when processing rule "custom_ops". Please make sure it's listed in the srcs parameter of anoth$ ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "torch._utils_internal.get_file_path_2" (from caffe2/test/test_public_bindings.py:13) when processing rule "public_bindings". Please make sure it's listed in the srcs paramete$ ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "torch._C._profiler.symbolize_tracebacks" (from caffe2/test/test_cuda.py:3348) when processing rule "test_cuda". Please make sure it's listed in the srcs parameter of another $ ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "torch._C._profiler.gather_traceback" (from caffe2/test/test_cuda.py:3348) when processing rule "test_cuda". Please make sure it's listed in the srcs parameter of another rule$ ERROR while processing caffe2/test/TARGETS: Cannot find an owner for include <torch/csrc/autograd/profiler_kineto.h> (from caffe2/test/profiler/test_cpp_thread.cpp:2) when processing profiler_test_cpp_thread_lib. Some things to try: ``` Differential Revision: D62049222 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135614 Approved by: https://github.com/oulgen, https://github.com/laithsakka	2024-09-13 02:04:34 +00:00
Jason Ansel	bf68e16e94	[dynamo] Fix support for classmethod(property(...)) (#134968 ) Fixes #134451 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134968 Approved by: https://github.com/yanboliang	2024-09-13 01:14:18 +00:00
eqy	d732df7e56	[Inductor] Disable TF32 in `test_slice_scatter_reinplace` (#135709 ) TF32 linear/matmul numerics seem unrelated to test functionality so disabling it here to abate noisy failures Pull Request resolved: https://github.com/pytorch/pytorch/pull/135709 Approved by: https://github.com/eellison	2024-09-13 00:30:45 +00:00
Sahan Paliskara	c9de2efde6	[Docs] fix inconsistent docs in conv1d, conv2d, and conv3d (#135894 ) Addresses https://github.com/pytorch/pytorch/issues/135880 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135894 Approved by: https://github.com/mikaylagawarecki, https://github.com/malfet	2024-09-13 00:19:42 +00:00
Jason Ansel	1f15c0c7a5	[fx] Replace _snake_case with a regexp (#135822 ) ~2x speedup on this function, though saves <0.5s overall Pull Request resolved: https://github.com/pytorch/pytorch/pull/135822 Approved by: https://github.com/oulgen ghstack dependencies: #135787, #135788, #135820, #135821	2024-09-13 00:18:41 +00:00
Jason Ansel	a72124add9	[fx] Minor optimization in create_arg (#135821 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135821 Approved by: https://github.com/oulgen ghstack dependencies: #135787, #135788, #135820	2024-09-13 00:18:41 +00:00
Jason Ansel	10ca4c0564	[inductor] Use TracerBase directly in LoopBody (#135820 ) This skips some unneeded work in the subclass. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135820 Approved by: https://github.com/oulgen ghstack dependencies: #135787, #135788	2024-09-13 00:18:41 +00:00
Jason Ansel	d3aab9642b	[inductor] Optimize can_fuse_vertical() (#135788 ) An O(n^2) to O(n) improvement by not comparing all pairs of deps. Before: ![image](https://github.com/user-attachments/assets/797cd1bd-5d53-4374-8e76-ffce4232d7f9) After: ![image](https://github.com/user-attachments/assets/1e61bf29-adba-41a4-839e-f028130fa979) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135788 Approved by: https://github.com/oulgen ghstack dependencies: #135787	2024-09-13 00:18:41 +00:00
Jason Ansel	67a929eea8	[inductor] Remove unused check (#135787 ) I think this is unreachable code because mode is always None on reads. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135787 Approved by: https://github.com/oulgen	2024-09-13 00:18:41 +00:00
Isuru Fernando	f576960bbc	do not expand in replace/simplify if no changes (#135863 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135863 Approved by: https://github.com/ezyang	2024-09-13 00:12:01 +00:00
Nikita Shulga	1aba224cfd	Update nightly PyTorch version to 2.6.0 (#135916 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/135916 Approved by: https://github.com/kit1980	2024-09-13 00:08:52 +00:00
Shangdi Yu	d383325392	[aoti] Fix workspace generation for triton (#135552 ) Fixes #131337 - add `arg_type` for workspace_arg, the type is consistent with the type in `generate_workspace_allocation()`. - do not generate example tensors for `workspace`, and use `generate_workspace_allocation()` instead. - add workspace allocation generation code to `kernel_autotune_calls`. e.g. ```python workspace = empty_strided_cuda((1280, ), (1, ), torch.uint8) workspace.zero_() ..... triton_spl_fused_add_cumprod_0.run(buf2, arg0_1, arg1_1, workspace, 1, 10000, grid=split_scan_grid(1, 10000), stream=stream0) del buf2, arg0_1, arg1_1, workspace ``` - add `empty_strided_cuda = torch._C._dynamo.guards._empty_strided_cuda` to the header of triton autotune code. The generated cpp has lines like below, so we also implement a `zero_()` for ` AtenTensorHandle `. ```cpp static constexpr int64_t int_array_0[] = {1280L, }; static constexpr int64_t int_array_1[] = {1L, }; AtenTensorHandle workspace_handle; AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch_empty_strided(1, int_array_0, int_array_1, cached_torch_dtype_uint8, cached_torch_device_type_cuda, 0, &workspace_handle)); RAIIAtenTensorHandle workspace(workspace_handle); workspace.zero_(); ``` - Fix handle grid_fn for grid computation. Pass in "RBLOCK" to `split_scan_grid` - Fix dynamic shapes: Without the fix we generate code that looks like this `workspace = empty_strided_cuda((32((255 + s0) // 256), ), (1, ), torch.uint8)` when doing triton autotune and `s0` is not defined. The solution approach is to use `V.graph.sizevars.size_hint(nbytes)` to realize the workspace size for triton autotune. Note that we only realize it for triton autotune code, but not for the cpp cuda code. - We also generate slightly different cpp code depending on if `abi_compatible` is turned on. ```cpp RAIIAtenTensorHandle workspace(workspace_handle); AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch_zero_(workspace.get())); ``` vs ```cpp at::Tensor workspace = at::detail::empty_strided_cuda({8L(c10::div_floor_integer(static_cast<int64_t>((255L + s0)), static_cast<int64_t>(256L))), }, {1L, }, at::kByte, c10::DeviceType::CUDA); workspace.zero_(); ``` Test Plan: ``` TORCHINDUCTOR_ABI_COMPATIBLE=1 TORCHINDUCTOR_CPP_WRAPPER=1 python test/inductor/test_torchinductor.py -k GPUTests.test_consecutive_split_cumprod_cuda python test/inductor/test_cuda_cpp_wrapper.py TestCudaWrapper.test_consecutive_split_cumprod_cuda_cuda_wrapper python test/inductor/test_cuda_cpp_wrapper.py DynamicShapesCudaWrapperCudaTests.test_consecutive_split_cumprod_cuda_dynamic_shapes_cuda_wrapper TORCHINDUCTOR_ABI_COMPATIBLE=1 python test/inductor/test_cuda_cpp_wrapper.py TestCudaWrapper.test_consecutive_split_cumprod_cuda_cuda_wrapper TORCHINDUCTOR_CPP_WRAPPER=1 python test/inductor/test_torchinductor.py -k GPUTests.test_consecutive_split_cumprod_cuda ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135552 Approved by: https://github.com/desertfire	2024-09-12 23:53:09 +00:00
Ma Jian	00dc7d4356	fix compiled_autograd deadlock throw (#135795 ) Fixes #135298 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135795 Approved by: https://github.com/xmfan	2024-09-12 23:24:57 +00:00
Yanbo Liang	1760bbc259	[FlexAttention] Ensure q/k/v and block_mask on excact the same device (#135823 ) Fixes #134739 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135823 Approved by: https://github.com/BoyuanFeng	2024-09-12 23:11:01 +00:00
Jack Taylor	fb9d8e3248	[ROCm] Use ieee precision for fp32 in flex attention (#135702 ) `3bebc09be9` Brought in a change to flex_attention to allow TF32 precision, this largely lacks support on ROCm side and we should use ieee. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135702 Approved by: https://github.com/jeffdaily, https://github.com/drisspg	2024-09-12 23:00:48 +00:00
eellison	aaabfc8930	[Easy] Check if quant registered in constant folding (#135875 ) Belated fix for https://github.com/pytorch/pytorch/issues/110904 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135875 Approved by: https://github.com/shunting314	2024-09-12 22:16:39 +00:00
William Wen	63d6cd351a	[dynamo] support torch.nn.attention.sdpa_kernel context manager (#135404 ) Fixes https://github.com/pytorch/pytorch/issues/134608 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135404 Approved by: https://github.com/jansel, https://github.com/drisspg	2024-09-12 22:04:48 +00:00
PyTorch MergeBot	3de9e474df	Revert "Check function declarations of Core ML code (#135467 )" This reverts commit bc1b8f094d24de27432f4c29f0729e85a6b5ba63. Reverted https://github.com/pytorch/pytorch/pull/135467 on behalf of https://github.com/malfet due to This breaks ios periodic jobs, see https://github.com/pytorch/pytorch/actions/runs/10797026668/job/29947377532 ([comment](https://github.com/pytorch/pytorch/pull/135467#issuecomment-2347322784))	2024-09-12 22:04:35 +00:00
PyTorch MergeBot	3e1a4ea132	Revert "[DSD] Fix distributed state dict full_state_dict option hang during set_state_dict (#135725 )" This reverts commit 83c594ebd6dfa517fdd67ae23929cc60d5fa325d. Reverted https://github.com/pytorch/pytorch/pull/135725 on behalf of https://github.com/ZainRizvi due to This is breaking lint. See [GH job link](https://github.com/pytorch/pytorch/actions/runs/10835983999/job/30068709508) [HUD commit link](`83c594ebd6`) ([comment](https://github.com/pytorch/pytorch/pull/135725#issuecomment-2347303272))	2024-09-12 21:47:38 +00:00
Sanskar Modi	e157ce3ebb	Validate input types for `torch.nn.Linear` and `torch.nn.Bilinear` (#135596 ) Adding validation checks to check the input types and display better error messages for the same. Fixes https://github.com/pytorch/pytorch/issues/135463 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135596 Approved by: https://github.com/malfet	2024-09-12 21:28:37 +00:00
Pian Pawakapan	b897ab0540	[export] ignore mark_dynamic() in export (#135536 ) Previously we were accomodating `torch._dynamo.mark_dynamic()` for export's dynamic shapes. Here we clean things up and ignore it, requiring users to specify an export input for `dynamic_shapes`. Note: there's 4 decorators relevant to export, `mark_dynamic, maybe_mark_dynamic, mark_static, mark_unbacked`. User calls that involve export have only been `mark_dynamic()`, and we use `maybe_mark_dynamic` under the hood for `Dim.AUTO`, but we could start using others. One reason I decided to not warn and just silently ignore is these decorators cause the tensors to carry dynamic info, and it'll be hard to tell whether the markers are from export or user calls when re-exporting with the same inputs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135536 Approved by: https://github.com/avikchaudhuri	2024-09-12 21:22:19 +00:00
Fadi Arafeh	3d24313809	Pass ideep:lowp_kind to matmul_forward::compute on cache misses (#135058 ) Optimized dynamic quantization for aarch64 was enabled by #126687 and #134897 This PR fixes an issue for aarch64 where on a [cache miss](https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/quantized/cpu/qlinear_dynamic.cpp#L592) (e.g. if input dimensions change) [ideep::matmul_forward::compute ](https://github.com/intel/ideep/blob/pytorch-rls-v3.5.3-2/include/ideep/operators/matmul.hpp#L160) (wrongly) runs with the [default lowp_kind (u8s8)](https://github.com/intel/ideep/blob/pytorch-rls-v3.5.3-2/include/ideep/operators/matmul.hpp#L174) which is not supported by oneDNN+ACL (Arm Compute Library), causing the workload to fall back to a much slower oneDNN gemm:jit kernel Example: ```python import torch DIM = 4096 INPUT_SIZE1 = 32 INPUT_SIZE2 = 16 class LinearNet(torch.nn.Module): def __init__(self): super().__init__() self.fc1 = torch.nn.Linear(DIM, DIM, bias=False) def forward(self, x): x = self.fc1(x) return x input1 = torch.randn(size=(INPUT_SIZE1, DIM)) input2 = torch.randn(size=(INPUT_SIZE2, DIM)) with torch.no_grad(): model = LinearNet() model = torch.ao.quantization.quantize_dynamic(model,{torch.nn.Linear}) model(input1) # this goes to ACL lowp_gemm print("="50) model(input2) # this goes to gemm:jit without this PR, and to ACL with this PR ``` In the code snippet above: - The matmul from `model(input1)` goes to oneDNN+ACL (in both cases, with and without the PR) - The matmul from `model(input2)`: Without this PR: there's a cache miss (different input shapes) and matmul_forward::compute is run with the default lowp_kind (u8s8). Hence the matmul falls back to gemm:jit in oneDNN. However, With this PR* the matmul goes to oneDNN+ACL which is around 10x faster than oneDNN+jit. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135058 Approved by: https://github.com/jondea, https://github.com/malfet	2024-09-12 20:30:20 +00:00
Riley Dulin	cd472bb1e3	[torch][fx] Add new replacement_callback to materialize a replacement just in time (#135553 ) Summary: Sometimes we only want to generate a replacement for a matched pattern once we know some information about the nodes in the pattern. So far, we have found this the most useful to do matches based on specific shapes of tensors flowing into functions. Use a callback function similar to `match_filters`. By default this isn't used. Had to make `replacement` a None-able parameter because Callable was already used to detect a case where a graph needed to be traced. Differential Revision: D62412628 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135553 Approved by: https://github.com/SherlockNoMad	2024-09-12 18:52:14 +00:00
Guilherme Leobas	f032135bbf	Add batching rule for torch.scatter_reduce (#135547 ) Fixes #134797 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135547 Approved by: https://github.com/zou3519	2024-09-12 18:51:21 +00:00
Joel Schlosser	525bec804c	NJT <-> padded dense conversions (#125947 ) This PR: * Implements the pre-existing `nt.to_padded_tensor(padding_val)` ATen op via the FBGEMM kernel + appropriate view gymnastics (since that kernel only handles 2D values) * Introduces a new `_nested_from_padded_tensor` op for the reverse conversion, implemented via the reverse FBGEMM kernel + view gymnastics * Note: there is currently no public API for this; design booted to a future PR TODO: * ~~Propagate min / max sequence length via the new factory function `_nested_from_padded_tensor`~~ * ~~Verify that Inductor does computation fusion via test logic~~ Pull Request resolved: https://github.com/pytorch/pytorch/pull/125947 Approved by: https://github.com/soulitzer	2024-09-12 17:54:25 +00:00
wz337	83c594ebd6	[DSD] Fix distributed state dict full_state_dict option hang during set_state_dict (#135725 ) Fix https://github.com/pytorch/pytorch/issues/134095 This fix distributed state dict full_state_dict option hang during set_state_dict. We switch `_distribute_tensors` in _state_dict_utils.py to use `DTensor.from_local` instead of `distribute_tensor` to support FSDP2+TP 2D strided sharding use case, as `distribute_tensor` cannot handle strided sharding yet. `distribute_tensor` incurs a scatter behind the scenes, while `DTensor.from_local` takes the local slice from the full tensor on each rank to create the DTensor (no collective). This means it's the user's responsibility to make sure the full_tensor from the full_state_dict is the same across all ranks. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135725 Approved by: https://github.com/fegin	2024-09-12 17:43:57 +00:00
Rachel Guo	c1277945d3	[AOTI][Tooling] Support debug printing for inductor level extern kernel call such as externkernel.addmm, bmm, etc. (#135731 ) Summary: As title. Effect after merging this diff would look something like this: ``` print('inductor: before_launch - triton_poi_fused_0 - buf0', buf0) triton_poi_fused_0.run(buf0, 6, grid=grid(6), stream=stream0) print('inductor: after_launch - triton_poi_fused_0 - buf0', buf0) buf1 = empty_strided_cuda((16, 6), (6, 1), torch.float32) # Topologically Sorted Source Nodes: [linear], Original ATen: [aten.addmm] print('inductor: before_launch - extern_kernels.addmm - buf0', buf0) extern_kernels.addmm(buf0, reinterpret_tensor(arg2_1, (16, 16), (16, 1), 0), reinterpret_tensor(L__self___weight, (16, 6), (1, 16), 0), alpha=1, beta=1, out=buf1) print('inductor: after_launch - extern_kernels.addmm - buf0', buf0) ``` Context: D62272588 only support major triton kernel jit inductor debug printing codegen Test Plan: CI & OSS CI Reviewed By: chenyang78, ColinPeppler Differential Revision: D62397017 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135731 Approved by: https://github.com/ColinPeppler	2024-09-12 17:31:10 +00:00
Isuru Fernando	dab7d646d5	Use a better decomposition for split_with_sizes (#135728 ) This decomposition has less checks and improves the performance of torch.compile. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135728 Approved by: https://github.com/ezyang	2024-09-12 16:38:51 +00:00
whywhy-rtx3090	7647c398ff	Allow optional positional arguments for `torch.func.functional_call` (#134643 ) This PR resolves #134408. Add an additional test and have passed the local test. Do you think we should add a post-check to ensure `args` and `kwargs` are not both `None`? It seems to be possible to have modules without inputs. This PR does not include any such post-check. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134643 Approved by: https://github.com/zou3519	2024-09-12 15:22:06 +00:00
Justin Chu	d67cc58181	[ONNX] Fix symbolic values and numpy implementation (#135786 ) 1. Remove `__eq__` to make `SymbolicTensor` hashable and test for that 2. Update the `__array__` method so that it works for tensor on GPU Fixes https://github.com/pytorch/pytorch/issues/135700 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135786 Approved by: https://github.com/titaiwangms	2024-09-12 14:24:43 +00:00
Animesh Jain	dddaadac6c	[dynamo] Dont graph break on inner torch.compile (#135819 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135819 Approved by: https://github.com/jansel	2024-09-12 11:39:09 +00:00
Jason Ansel	02169364e1	[inductor] Split reduction loops when there is no shared reads (#134307 ) Fixes #129102 ![image](https://github.com/user-attachments/assets/0d00f75b-2bb9-4ce6-a0d9-2daceaff539c) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134307 Approved by: https://github.com/shunting314	2024-09-12 09:45:08 +00:00
Yanbo Liang	c30042fbeb	[GPT-fast] Update compilation time target for Llama & Mixtral (#135817 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135817 Approved by: https://github.com/xmfan, https://github.com/huydhn	2024-09-12 07:13:44 +00:00
Sun, Jiayi	6700175531	[Inductor] simplify indexing_exprs in LoopBody._init_with_copy (#135574 ) This PR uses `var_ranges` information to simplify `indexing_exprs` in `LoopBody._init_with_copy` to to reduce occurrences of `FloorDiv` and `ModularIndexing` in the `indexing_exprs`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135574 Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jansel	2024-09-12 06:56:34 +00:00
Xilun Wu	de8a8653c0	[dtensor][BE] replace compute_local_shape with compute_local_shape_and_global_offset (#135554 ) Summary 1. This PR removes the public API `compute_local_shape` and replace its use with the more general API `compute_local_shape_and_global_offset`. 2. To keep `compute_local_shape_and_global_offset` consistent with `compute_local_shape` on empty shards, it now returns local tensor shape `(0,)` for empty shards which is more aligned with DTensor's semantics on non-participating ranks. Test `pytest test/distributed/_tensor/test_dtensor.py` `pytest test/distributed/_tensor/test_init.py` `pytest test/distributed/_tensor/test_tensor_ops.py` Differential Revision: [D62415591](https://our.internmc.facebook.com/intern/diff/D62415591) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135554 Approved by: https://github.com/tianyu-l, https://github.com/wz337	2024-09-12 06:30:09 +00:00
Jason Ansel	86335e9135	[reland 3/3][fx] Bypass custom __setattr__ in Node.__init__ (#135735 ) Relands #135079 whcih was reverted by #135562 I broke this up into three parts to test internally. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135735 Approved by: https://github.com/oulgen	2024-09-12 05:50:39 +00:00
angelayi	14e3f3c062	[aoti] Remove nlohmann/json.hpp from header (#135765 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135765 Approved by: https://github.com/malfet	2024-09-12 05:38:51 +00:00
Dmitry Rogozhkin	9852c6d236	xpu: fix 3rd party builds on systems with cmake<3.25 (#135767 ) Cmake LINUX variable is available on starting from cmake 3.25. Better to use CMAKE_SYSTEM_NAME instead to relax cmake version requirement. See: https://cmake.org/cmake/help/v3.25/variable/LINUX.html Fixes: #135766 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135767 Approved by: https://github.com/malfet, https://github.com/guangyey	2024-09-12 05:31:01 +00:00
Jason Ansel	6354271178	[inductor] Skip unused call to get_estimated_runtime() (#135776 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135776 Approved by: https://github.com/oulgen ghstack dependencies: #135445, #135446	2024-09-12 05:22:23 +00:00
Jason Ansel	12902f6ecf	[inductor] Cache get_operation_names/get_buffer_names (#135446 ) Before: ![image](https://github.com/user-attachments/assets/db5b6fce-d849-4512-a21d-7a09efc72311) After: ![image](https://github.com/user-attachments/assets/097e340c-03b2-491e-ad36-132350b37892) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135446 Approved by: https://github.com/oulgen ghstack dependencies: #135445	2024-09-12 05:22:23 +00:00
Jason Ansel	3decb676aa	[inductor] Optimize cache_on_self (#135445 ) This is a small compile time win, but also makes profiles more readable. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135445 Approved by: https://github.com/oulgen	2024-09-12 05:22:23 +00:00
Zhenbin Lin	8d68a02905	OpenReg: Split the daemon into drvier/executor (#135646 ) Split the daemon into a proper user-process driver vs device-process executor. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135646 Approved by: https://github.com/albanD	2024-09-12 05:03:46 +00:00
Jason Ansel	28330a8a39	[reland 1/3][fx] Bypass custom __setattr__ in Node.__init__ (#135733 ) Relands #135079 whcih was reverted by #135562 I broke this up into three parts to test internally. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135733 Approved by: https://github.com/oulgen	2024-09-12 04:29:37 +00:00
Animesh Jain	eaba287adb	[dynamo] Bug fix for _torchdynamo_inline source handling (#135612 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135612 Approved by: https://github.com/drisspg	2024-09-12 04:05:08 +00:00
cyy	f5f1d0a753	Fix build warnings for torch_python (#134981 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/134981 Approved by: https://github.com/ezyang	2024-09-12 03:59:34 +00:00
Adam J. Stewart	5bc238c73e	torch.hub: add get_dir/set_dir type hints (#134906 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134906 Approved by: https://github.com/Skylion007	2024-09-12 03:53:29 +00:00
He Kai	79223114db	Avoid inserting extra transpose when the input to group norm is NHWC (#135575 ) When the input format for group norm is NHWC and the device is privateuseone, it introduces an additional transpose operation. To avoid this issue, a check for the privateuseone device needs to be added here. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135575 Approved by: https://github.com/ezyang	2024-09-12 03:36:05 +00:00
cyy	7cfd23636c	Fix clang-tidy warnings in Caffe2 code (#134935 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/134935 Approved by: https://github.com/ezyang	2024-09-12 03:27:09 +00:00
Feng Yuan	0d1d69fd25	Update torch-xpu-ops pin (ATen XPU implementation) (#135647 ) Release cycle for PyTorch 2.5 1. Fixing runtime error on Windows: Fail to load torch_xpu_ops_unary_binary_kernels.dll as the bin size is large. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135647 Approved by: https://github.com/EikanWang	2024-09-12 03:16:08 +00:00
Aaron Orenstein	21a64d57b1	[BE] typing for decorators - masked/_ops (#135108 ) Differential Revision: D62184735 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135108 Approved by: https://github.com/Skylion007	2024-09-12 01:34:09 +00:00
Shangdi Yu	1a74952925	"Remove BLOCK_LIST" (#135729 ) Summary: Skip test_prepare_qat_conv_bn_fusion_getitem_placeholder when we use training ir, since it's only for bn-getitem pattern, but the pattern doesn't exist in training ir. Remove BLOCK_LIST since it's empty. Now all internal unittests will use training ir. Test Plan: ``` buck2 run 'fbcode//mode/dev-nosan' caffe2/test/quantization:test_quantization -- -r test_prepare_qat_conv_bn_fusion_getitem_placeholder buck2 run 'fbcode//mode/dev-nosan' caffe2/test:quantization_pt2e_qat -- -r test_prepare_qat_conv_bn_fusion_getitem_placeholder ``` Differential Revision: D62387987 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135729 Approved by: https://github.com/tugsbayasgalan	2024-09-12 01:22:06 +00:00
Huy Do	a130ed828a	Fix the upload of x86 micro benchmark results (#135780 ) Upload stats workflow currently skips this https://github.com/pytorch/pytorch/actions/runs/10807251335/job/29977650639, this is a miss from https://github.com/pytorch/pytorch/pull/135042. So, the workflow is running but nothing has been uploaded yet. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135780 Approved by: https://github.com/atalman	2024-09-12 01:16:38 +00:00
Menglu Yu	eb0fe02933	[PT2][inductor][Optimus] Add pad_aten_mm_pass pattern to resolve long computation kernel in LCE (#135167 ) Summary: We observed another long computation issue for OBA_AFOC pyper model, thus adding a pattern to avoid the perf regression - Only happens in A100 - Do not want to use force_shape_pad since it will pad all GEMMs, which may not be optimal. Optimus pass has more flexisibility to customized GEMM shape and do corresponding padding - To enable, we pass the pass to config, where "k_threshold_to_pad" can be customized inductor_config.patch(post_grad_fusion_options={"pad_aten_mm_pass": {"k_threshold_to_pad" : 8388608}}) Test Plan: # unit test ``` buck2 test mode/opt //caffe2/test/inductor:pad_mm ``` Buck UI: https://www.internalfb.com/buck2/58b0f272-f405-45be-bc8d-aec2dc4d5841 Test UI: https://www.internalfb.com/intern/testinfra/testrun/10133099209954651 Network: Up: 9.0KiB Down: 142B (reSessionID-8eb71a37-a5ca-4aff-a4f1-93ade3e47e4e) Jobs completed: 9. Time elapsed: 3:18.0s. Cache hits: 0%. Commands: 3 (cached: 0, remote: 0, local: 3) Tests finished: Pass 17. Fail 0. Fatal 0. Skip 0. Build failure 0 # e2e test see [D62388582](https://www.internalfb.com/diff/D62388582) Differential Revision: D62220158 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135167 Approved by: https://github.com/jackiexu1992	2024-09-12 00:51:34 +00:00
Wei Feng	d270e2d240	[FSDP2] better error msg for cpu offloading (#135156 ) when cpu offloading is enabled, if user load a gpu state dict, FSDP2 will throw a less obvious error at backward ``` RuntimeError: attempting to assign a gradient with device type 'cpu' to a tensor with device type 'cuda'. Please ensure that the gradient and the tensor are on the same device ``` this PR throws error more explicitly by specifying which parameters should be moved because of cpu offloading ``` FSDP parameters should be materialized on cpu when enabling cpu offloading. For example, load cpu state dict or call module.to_empty(device="cpu"). Found following parameters on non-cpu device: ['0.weight'] ``` `pytest -s test/distributed/_composable/fsdp/test_fully_shard_state_dict.py -k test_dp_state_dict_cpu_offload` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135156 Approved by: https://github.com/awgu	2024-09-12 00:05:07 +00:00
xinan.lin	16b37b309f	[Inductor] Rename `cpp_wrapper_cuda.py` as `cpp_wrapper_gpu.py` (#135313 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135313 Approved by: https://github.com/jansel, https://github.com/desertfire ghstack dependencies: #135312	2024-09-11 23:59:54 +00:00
xinan.lin	13ee85ca5e	[Inductor] Generalize cuda cpp wrapper as common triton based GPU cpp wrapper, will be reused by xpu in next PR. (#135312 ) [Inductor] Generalize cuda cpp wrapper as common triton based GPU cpp wrapper, will be reused by xpu in next PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135312 Approved by: https://github.com/jansel, https://github.com/desertfire, https://github.com/eellison	2024-09-11 23:59:54 +00:00
Will Feng	94d2471d1f	[Traceable FSDP2] Use .copy_ instead of .set_ for unsharded_param inplace update; Replace unsharded_param graph input usage with graph intermediate; Support FSDP2+LoRA (#133730 ) Using `fsdp.set_` for unsharded_param inplace update causes difficult-to-debug errors when enabling Traceable FSDP2 on TorchTune models. In this PR, we change it to use `fsdp.copy_` which fixes the error and also strictly follows eager semantics (i.e. if user explictly stores an alias of the unsharded_param during execution of the user's module code, that alias will get updated correctly when the unsharded_param is copy_ into; whereas if we just swap out unsharded_param storage via set_, that user-saved alias will not get updated, which is not good). This PR also implements the graph pass to remove the resizes and copy if there is a resize_(full) -> copy_ -> resize_(0) pattern. ------ Test commands: - `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_transformer_backend_inductor` - `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_nested_fully_shard_backend_inductor` - `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_trace_fsdp_copy_` - `pytest -rA test/dynamo/test_repros.py::ReproTests::test_partitioner_cse_respects_mutation_boundaries` - `pytest -rA test/dynamo/test_repros.py::ReproTests::test_fsdp_set_input_mutation_applied_when_input_gets_no_gradients` - `pytest -rA test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mutation_op_matching` - `python test/inductor/test_distributed_patterns.py DistributedPatternTests.test_fake_distributed_aot_eager` - `PYTORCH_OPINFO_SAMPLE_INPUT_INDEX=1 PYTORCH_TEST_WITH_CROSSREF=1 python test/functorch/test_aotdispatch.py TestEagerFusionOpInfoCPU.test_aot_autograd_exhaustive_norm_cpu_float32` - `python test/distributed/test_inductor_collectives.py TestCollectivesInductor.test_backwards` Pull Request resolved: https://github.com/pytorch/pytorch/pull/133730 Approved by: https://github.com/bdhirsh	2024-09-11 23:01:05 +00:00
Alexander Jipa	5ca46be15e	Fix/torch cat doc attr (#135698 ) The `torch.cat` attr name for tensors in the docs differs from the method signature, unlike other methods. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135698 Approved by: https://github.com/albanD Co-authored-by: Alexander Jipa <azzhipa@amazon.com>	2024-09-11 22:32:55 +00:00
Mayank Mishra	9a04cfbeff	fix for fp16 (#134106 ) This PR is a replacement for https://github.com/pytorch/pytorch/pull/133085 for pushing a quick fix for RMSNorm. The original author is @kkontny Previous PR summary: Since FP16 has quite small dynamic range it is very easy to overflow while computing `at::pow(input, 2)` , and it happens in real world computation. I've tried to use `nn.RMSNorm` fused implementation instead of `LlamaRMSNorm` inside `transformers` implementation of Llama (`src/transformers/models/llama/modeling_llama.py`). It started to give wrong answers in Fp16 while still giving good in FP32. I figured out happens due to overflow while computing square of the input tensor. Original `LLamaRMSNorm` implementation upcasts input to fp32 to prevent this and give better numerical stability. ``` class LlamaRMSNorm(nn.Module): def __init__(self, hidden_size, eps=1e-6): """ LlamaRMSNorm is equivalent to T5LayerNorm """ super().__init__() self.weight = nn.Parameter(torch.ones(hidden_size)) self.variance_epsilon = eps def forward(self, hidden_states): input_dtype = hidden_states.dtype hidden_states = hidden_states.to(torch.float32) variance = hidden_states.pow(2).mean(-1, keepdim=True) hidden_states = hidden_states * torch.rsqrt(variance + self.variance_epsilon) return self.weight * hidden_states.to(input_dtype) ``` Proposed commit fixed the issue. FP16 in RMSNorm has to be treated in special way, to be usable in real world implementations. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134106 Approved by: https://github.com/mikaylagawarecki, https://github.com/eqy	2024-09-11 22:02:07 +00:00
Shubham Bhokare	66db61f0d1	[ONNX] Update fake mode usage in onnx docs (#135512 ) Update fake mode usage in onnx docs Pull Request resolved: https://github.com/pytorch/pytorch/pull/135512 Approved by: https://github.com/justinchuby Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com>	2024-09-11 21:29:04 +00:00
PyTorch MergeBot	c025f7becc	Revert "[Partitioner] Reuse partition to check whether nodes exist (#135317 )" This reverts commit e004d539da3335d97a8134c9081245628f18eb67. Reverted https://github.com/pytorch/pytorch/pull/135317 on behalf of https://github.com/izaitsevfb due to BC-breaking, breaks executorch and internal meta builds ([comment](https://github.com/pytorch/pytorch/pull/135317#issuecomment-2344730294))	2024-09-11 21:27:53 +00:00
FFFrog	8c4e1148b8	Refactoring byte_order (#135558 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135558 Approved by: https://github.com/mikaylagawarecki	2024-09-11 21:06:43 +00:00
Nikita Shulga	e20ee39558	Expand bitwise ops to unsigned types (#135525 ) Fixes https://github.com/pytorch/pytorch/issues/135436 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135525 Approved by: https://github.com/ezyang	2024-09-11 20:48:52 +00:00
Xinya Zhang	74fd1bf965	[ROCm] Update to AOTriton 0.7b (#134498 ) Notable changes: 1. Enable CudaGraph related tests 2. Fix UT problems 3. EXPERIMENTAL Navi31 support. User should enable Navi31 support with Env Var `TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1` Know Problem: 1. `test/test_transformers.py` will massive failures and/or NaN outputs with `--use-pytest` + Update: Confirmed skip `class TestSDPAPrivateUse1Only` can fix the problem with `--use-pytest` Note: AOTriton 0.7b adds support to nestedtenosrs+SDPA but need more work (and consequently a separate PR) to enable it. Fixes #133540 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134498 Approved by: https://github.com/pruthvistony, https://github.com/jeffdaily, https://github.com/malfet	2024-09-11 20:34:01 +00:00
Sidney Tsang	5d964a5eb7	[Export] Fix SDPA decomposition (#135297 ) Summary: Update SDPA decomposition to match updated stride from D62009189 which aligns strides with the `aten._scaled_dot_product_attention_math.default`, which makes `t.permute().continuous().permute()` no longer necessary. Test Plan: CI Differential Revision: D62278378 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135297 Approved by: https://github.com/drisspg	2024-09-11 20:21:59 +00:00
Bin Bao	118d7e1480	[Inductor] add _dynamo.reset to test_cat_slice_cat_cuda (#135694 ) Summary: test_cat_slice_cat_cuda runs inductor multiple times and check counters["inductor"] in between, and thus we need to reset properly. Differential Revision: D62500331 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135694 Approved by: https://github.com/masnesral	2024-09-11 20:07:11 +00:00
Bob Ren	dd47f6f623	Simplify expr before getting implications in _maybe_evaluate_static (#135499 ) Fixes #134268 Previously we weren't simplifying these expressions before calling get_implications, resulting in inconsistent application of FloorDiv/CleanDiv. See #134268 for more details. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135499 Approved by: https://github.com/ezyang	2024-09-11 19:48:29 +00:00
Tom Ritchford	e05ea2b179	Add decomposition for transpose_copy (#130943 ) * Extracted from #128416 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130943 Approved by: https://github.com/amjames, https://github.com/eellison	2024-09-11 19:45:22 +00:00
Shangdi Yu	ad75b09d89	Replace capture_pre_autograd_graph with export_for_training in torch tests (#135623 ) Summary: as title Test Plan: ``` buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test:test_export -- -r test_conv_dynamic buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test:fx -- -r matcher buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test/quantization:test_quantization -- -r x86 ``` CI Differential Revision: D62448302 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135623 Approved by: https://github.com/tugsbayasgalan	2024-09-11 19:23:08 +00:00
rzou	a2cb9b7331	Flip triton kernel default layout constraint to "needs_fixed_stride_order" (#135581 ) This is to match the default layout constraint for custom operators. By default, Inductor should match the stride order of inputs to a triton kernel. Test Plan: - existing tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/135581 Approved by: https://github.com/eellison ghstack dependencies: #135530	2024-09-11 18:43:18 +00:00
Edward Z. Yang	451eaf0ff2	Log full exception trace when error raised in Dynamo (#135697 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/135697 Approved by: https://github.com/Skylion007	2024-09-11 18:14:33 +00:00
Zain Rizvi	09519eb195	Support rolling over a percentage of workflows (#134816 ) In order to support adding a rollover percentage, this ended up being a complete rewrite of runner_determinator.py. Details of the new format are in the comments up top. On the plus side, this now includes some unit tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134816 Approved by: https://github.com/PaliC, https://github.com/zxiiro	2024-09-11 18:01:26 +00:00
Bob Ren	5314ae2660	Don't use exception chaining for BackendCompilerFailed (#135545 ) Commandeered from https://github.com/pytorch/pytorch/pull/135496 as I'm now helping @ezyang ship dynamic float arguments in PT2. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135545 Approved by: https://github.com/ezyang	2024-09-11 17:49:18 +00:00
Jack Taylor	da587de9cb	[ROCm] [BUGFIX] Re-enable rocm-specific tuning parameters v2 (#133852 ) Small bug fix - https://github.com/pytorch/pytorch/pull/124592 replaced the torch.version.hip with device_props but made a mistake in porting the original logic. The original code was: `if torch.version.hip is not None:` Which was incorrectly replaced by: `if self.device_props.type != "hip":` Another occurence of https://github.com/pytorch/pytorch/pull/130617 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133852 Approved by: https://github.com/masnesral, https://github.com/malfet	2024-09-11 17:21:40 +00:00
Jithun Nair	82a4df2d5f	[CI] [ROCm] Run rocm workflow on every push to main branch (#135644 ) Dial the frequency back up from https://github.com/pytorch/pytorch/pull/131637 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135644 Approved by: https://github.com/huydhn	2024-09-11 17:21:05 +00:00
Catherine Lee	18a9030952	[CI] Fix update slow tests (#135390 ) * Add pytorchbot to list of approvers for file * Add labels to the auto created PR The auto generated PR is currently not merging due to some failing tests on slow workflow that were supposed to be moved back to normal idk if this has much value, clearly we've been managing without the update Pull Request resolved: https://github.com/pytorch/pytorch/pull/135390 Approved by: https://github.com/ZainRizvi	2024-09-11 17:02:17 +00:00
Isuru Fernando	03f23d07b4	Optimize ShapeEnv.replace (#135652 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135652 Approved by: https://github.com/ezyang ghstack dependencies: #135621, #135622	2024-09-11 16:50:59 +00:00
Isuru Fernando	8c738c9270	Improve performance of sympy_generic_le (#135622 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135622 Approved by: https://github.com/ezyang ghstack dependencies: #135621	2024-09-11 16:20:03 +00:00
Isuru Fernando	7ddacaf40a	Improve performance of canonicalize_bool_expr (#135621 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135621 Approved by: https://github.com/ezyang	2024-09-11 16:20:03 +00:00
PyTorch MergeBot	183c32fd3b	Revert "[Dynamo] Trace torch function modes entered outside of torch.compile (#133137 )" This reverts commit 0d15122092c27fec1143b800bab7c996d126b547. Reverted https://github.com/pytorch/pytorch/pull/133137 on behalf of https://github.com/clee2000 due to something in this stack broke functorch/test_control_flow.py::TestControlFlow::test_scan_simple_graph [GH job link](https://github.com/pytorch/pytorch/actions/runs/10804912306/job/29980571390) [HUD commit link](`444b52ff40`), newly added test yesterday ([comment](https://github.com/pytorch/pytorch/pull/133137#issuecomment-2344054339))	2024-09-11 15:57:00 +00:00
PyTorch MergeBot	3ab12e2596	Revert "[Dynamo] Support thread local setattr (#135443 )" This reverts commit 160c228a4bd60ceffa62b045a6b0a6f9413835c5. Reverted https://github.com/pytorch/pytorch/pull/135443 on behalf of https://github.com/clee2000 due to something in this stack broke functorch/test_control_flow.py::TestControlFlow::test_scan_simple_graph [GH job link](https://github.com/pytorch/pytorch/actions/runs/10804912306/job/29980571390) [HUD commit link](`444b52ff40`), newly added test yesterday ([comment](https://github.com/pytorch/pytorch/pull/135443#issuecomment-2344042800))	2024-09-11 15:53:55 +00:00
PyTorch MergeBot	596e93b506	Revert "[dynamo] Bug fix for _torchdynamo_inline source handling (#135612 )" This reverts commit 5c3d0a2dedbc0e85f3b256ce56ac674078a5fae1. Reverted https://github.com/pytorch/pytorch/pull/135612 on behalf of https://github.com/clee2000 due to broke inductor/test_cpu_select_algorithm.py::TestSelectAlgorithmCPU::test_linear_input_transpose_bias_True_cpu_float32 [GH job link](https://github.com/pytorch/pytorch/actions/runs/10805518363/job/29982386304) [HUD commit link](`5c3d0a2ded`), bad TD ([comment](https://github.com/pytorch/pytorch/pull/135612#issuecomment-2344039370))	2024-09-11 15:51:12 +00:00
PyTorch MergeBot	f96e8041b1	Revert "[Dynamo] Simplify torch function mode stack guard (#135444 )" This reverts commit 444b52ff40cf4afce7bc3fdcf021a88eab3b954c. Reverted https://github.com/pytorch/pytorch/pull/135444 on behalf of https://github.com/clee2000 due to something in this stack broke functorch/test_control_flow.py::TestControlFlow::test_scan_simple_graph [GH job link](https://github.com/pytorch/pytorch/actions/runs/10804912306/job/29980571390) [HUD commit link](`444b52ff40`), newly added test yesterday ([comment](https://github.com/pytorch/pytorch/pull/135444#issuecomment-2344036843))	2024-09-11 15:48:27 +00:00
PyTorch MergeBot	7cf9c81918	Revert "[Dynamo] Use custom backend to reenter metadata tf mode when tracing while/cond (#134732 )" This reverts commit 6a3edfcc1e474e6ebd0c06624000a6d6bf1a0dee. Reverted https://github.com/pytorch/pytorch/pull/134732 on behalf of https://github.com/clee2000 due to broke functorch/test_control_flow.py::TestControlFlow::test_scan_simple_graph [GH job link](https://github.com/pytorch/pytorch/actions/runs/10804912306/job/29980571390) [HUD commit link](`444b52ff40`), newly added test yesterday ([comment](https://github.com/pytorch/pytorch/pull/134732#issuecomment-2344016694))	2024-09-11 15:39:21 +00:00
Sam Larsen	49e0b88aab	Fix test_triton_kernel_float64_constant (#135583 ) Summary: Landed https://github.com/pytorch/pytorch/pull/135260 too soon and the test in that PR doesn't do exactly what I tested (actually test different dtypes). Test Plan: `python test/inductor/test_triton_kernels.py -k float64_constant` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135583 Approved by: https://github.com/isuruf, https://github.com/eellison, https://github.com/Skylion007	2024-09-11 15:16:23 +00:00
Pushpak Raj Gautam	ee8c5cc1cc	For S444023: Back out "deprecate `search_autotune_cache` (#133628 )" (#135186 ) Summary: For S444023 Test Plan: Revert prevented the NaN errors - f639391901 Training job ran for 7767 iterations. NaN errors show up within the first 1k. Reviewed By: nmacchioni Differential Revision: D62224747 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135186 Approved by: https://github.com/kit1980	2024-09-11 14:08:40 +00:00
Nikita Lutsenko	ce4d146f56	ATen \| Fix MPSCNNNeuron creation on Mac Catalyst. (#135595 ) Summary: These are still utilized directly when using relu/sigmoid/tanh tensors directly from here: https://fburl.com/code/k6n7ofzd However, on Mac Catalyst we always were returning `nil`, as such in most cases yielding the entire graph completely useless and most often just stray `MPSTemporaryImage` references that were never written into. This fixes the issue completely by making sure that we always return the valid kernels back, so they can be executed. Test Plan: Test with segmentation net that uses a combination of relu and other tensors together - run this via Mac Catalyst build - it works! {F1858576745} Reviewed By: MichaelTay Differential Revision: D62430010 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135595 Approved by: https://github.com/MichaelTay	2024-09-11 11:12:23 +00:00
Amadeusz Skrzypczak	0226fcaacf	Disable cuda specific restrictions in _scaled_mm for other devices (#135579 ) Fixes #135576 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135579 Approved by: https://github.com/drisspg	2024-09-11 11:05:38 +00:00
Yanbo Liang	4cde5096c4	[Inductor][FlexAttention] Supports dynamic shapes with block mask (#135629 ) Fixes #134560 and #135206 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135629 Approved by: https://github.com/drisspg	2024-09-11 08:10:50 +00:00
Ke Wen	443c015393	[Distributed] Improve efficiency of NaN checker (#135414 ) Some customers would like to run the NaN checks on the fly, so we are improving its efficiency. ## Benchmarking Allreduce 2G floats. `TORCH_NCCL_NAN_CHECK=1` Red kernel: ncclAllreduce Blue kernel: Nan check <img width="1093" alt="Screenshot 2024-09-06 at 10 00 05 PM" src="https://github.com/user-attachments/assets/5501bc31-024f-4115-adb2-dd66eb4025d3"> ## Comparison with torch ops: Let's say a user manually check for NaNs with the following torch ops before all-reduce: ``` torch.any(torch.isnan(x)) ``` <img width="1091" alt="Screenshot 2024-09-06 at 10 14 53 PM" src="https://github.com/user-attachments/assets/1f8b5f63-c955-4612-bb96-241b6c69959b"> So our perf is on-par with torch ops. ## Changes - Load from vidmem using "big packs" of 16 bytes - Bump `blockDim.x` from 256 to 512 - Separate loads and checks into two loops, each of 8 iterations - Unroll the loops - Templated functions for checking NaN in a "big pack" based on dtype Special thanks to @jbachan from NCCL! Pull Request resolved: https://github.com/pytorch/pytorch/pull/135414 Approved by: https://github.com/wconstab	2024-09-11 07:53:42 +00:00
Yiming Zhou	4ae6d7c18f	Back out "[pytorch][PR] [export] fix re-export custom metadata" (#135634 ) Summary: Broke some tests. Revert this diff Test Plan: CI Differential Revision: D62474337 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135634 Approved by: https://github.com/tugsbayasgalan	2024-09-11 06:16:26 +00:00
Eddie Yan	3084b7b5c0	[cuDNN][SDPA] Support `attn_bias` in cuDNN (#130482 ) CC @drisspg Pull Request resolved: https://github.com/pytorch/pytorch/pull/130482 Approved by: https://github.com/drisspg, https://github.com/Skylion007, https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2024-09-11 05:59:25 +00:00
Animesh Jain	5c3d0a2ded	[dynamo] Bug fix for _torchdynamo_inline source handling (#135612 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135612 Approved by: https://github.com/drisspg ghstack dependencies: #135588	2024-09-11 05:23:42 +00:00
fduwjj	c608b17f60	[PTD][BE][c10d] Add some code documents for TCPStore code and cosmetic changes to libUVStore code (#130496 ) While designing something else when TCPStore is needed. I spent some time digging into the codebase of TCPStore and found that the code is a little bit challenging to understand without proper documents. Although people from OSS community must be smarter than me, I still want to document my findings in the code so that devs and users can use them as a reference down the road. Also for libuv, we need to make private variables with a "_", so it's a pure renaming of private variables such as `tcpServer`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130496 Approved by: https://github.com/wconstab	2024-09-11 04:42:25 +00:00
Michael Lazos	444b52ff40	[Dynamo] Simplify torch function mode stack guard (#135444 ) The semantics of ignored modes previously had edge cases, this eliminates these by in essence filtering any ignored modes out of both the ref stack and the current torch function mode stack. This is purely to fix complexity in #135422. The ignored modes handling will be removed in a future PR after https://github.com/pytorch/pytorch/pull/135422 lands, since we will then trace through DeviceContexts vs inserting them into the graph which needed these extra workarounds for correctness. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135444 Approved by: https://github.com/anijain2305, https://github.com/williamwen42 ghstack dependencies: #134732, #133137, #135443	2024-09-11 04:18:22 +00:00
Michael Lazos	160c228a4b	[Dynamo] Support thread local setattr (#135443 ) In preparation for tracing through DeviceContext (`defb515306/torch/utils/_device.py (L66)`) This PR adds support for calling the setattr of thread local objects. These objects have a slots impl, and since this doesn't appear to have any side effects, we call this setattr impl when replaying mutations, since calling `object.__setattr__` on these objects results in a type error. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135443 Approved by: https://github.com/anijain2305 ghstack dependencies: #134732, #133137	2024-09-11 04:18:22 +00:00
Michael Lazos	0d15122092	[Dynamo] Trace torch function modes entered outside of torch.compile (#133137 ) This PR adds initial tracing for torch function modes. Details: In essence, this adds tracing into the torch function of modes entered outside of the torch.compile call. This does not yet support tracing enter/exit of a torch function mode/ tracing set_default_device properly using the new mode infra (this will be a very good stress test for modes). I am adding more PRs to this stack to support these. The overall plan is to support tracing enter/exit and handling graph breaks like we do other torch.* context managers. Previously landed: https://github.com/pytorch/pytorch/pull/133135 https://github.com/pytorch/pytorch/pull/133136 https://github.com/pytorch/pytorch/pull/133134 https://github.com/pytorch/pytorch/pull/133133 https://github.com/pytorch/pytorch/pull/133132 https://github.com/pytorch/pytorch/pull/133131 https://github.com/pytorch/pytorch/pull/133729 https://github.com/pytorch/pytorch/pull/133130 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133137 Approved by: https://github.com/jansel, https://github.com/zou3519 ghstack dependencies: #134732	2024-09-11 04:18:22 +00:00
Michael Lazos	6a3edfcc1e	[Dynamo] Use custom backend to reenter metadata tf mode when tracing while/cond (#134732 ) For tracing cond/while in eager, we trace the HOP with the eager backend with metadata torchfunction mode enabled. HOPs disallow the mutation that occurs in this torch function mode, so it is not able to be traced. As a result, we use a custom backend which enters this mode for tracing these HOPs. Thanks to @ydwu4 for the help with implementing this Pull Request resolved: https://github.com/pytorch/pytorch/pull/134732 Approved by: https://github.com/ydwu4	2024-09-11 04:18:22 +00:00
penguin-wwy	356f14e7b7	Fix the output of FileCheck when not run and add unit tests (#135345 ) When FileCheck is destructed without execution, it should output all rules. For example: ``` >>> fc = FileCheck().check("test") >>> del fc You have not run this instance of FileCheck! FileCheck checks: CHECK: test ``` Additionally, unit tests for the Python interface of FileCheck will be added. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135345 Approved by: https://github.com/eellison	2024-09-11 04:13:24 +00:00
Sathyanarayanan Saravanamuthu	34dc8f69a1	Adding entry-point based support for out-of-tree rendezvous plugins (#132633 ) Fixes #127519 Currently in torchrun rendezvous, there are only two rendezvous backends supported out of the box: `C10d` and `Etcd`. The changes in this PR enables the distributed elastic users to bring their out-of-tree rendezvous backend implementations as Python packages. #### AUTHORING NEW PLUGIN Any new plugin will be a python package exposing entry-points. For example, the structure of redis plugin is as follows: ``` plugin_root \|_ pyproject.toml \|_ src \|_ redis \|_ __init__.py \|_ redis_store.py \|_ redis_backend.py ``` The contents of the `pyproject.toml` should indicate that this is exposes a torchrun entry-point by mentioning the group name `torchrun.plugins`. The `pyproject.toml` for redis plugin would be as follows: ``` [project] name = "redis" version = "0.0.1" [project.entry-points.'torchrun.plugins'] redis = 'redis' ``` The `src/redis/__init__.py` file would contain functions that return the plugin name and plugin handler. The contents of `__init__.py` for redis would be as follows: ``` def getPluginHandler(): def _create_redis_handler(params: RendezvousParameters): from redis_rendezvous_backend import create_backend backend, store = create_backend(params) return create_handler(store, backend, params) return _create_redis_handler ``` The files `redis_store` and `redis_backend` contain the implementation of [Store](`41189b0da4/torch/_C/_distributed_c10d.pyi (L171)`) and [RendezvousBackend](`e782918b8e/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py (L61)`) respectively. #### USER EXPERIENCE Before using the plugin for the first time, the user has to install the plugin packages. For example, the published packages can be installed using `pip3 install <plugin-name>` and the plugin is in local file systemcan be installed using `pip3 install -e <plugin-location>`. Once installed, the new backend can be used in torchrun as follows: ``` torchrun --rdzv-backend=redis --rdzv-endpoint=redis-container:6379 --nnodes=3 --nproc-per-node=1 --max-restarts=3 --rdzv-id=1 test.py ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/132633 Approved by: https://github.com/fduwjj	2024-09-11 03:35:02 +00:00
angelayi	cd9ee49a69	[aoti] Add cpp loader (#135374 ) * Added a cpp loader, AOTIModelPackageLoader, which can load the .pt2, build the .so, and create a runner. The python-facing API is that users can directly call the `run` function, whereas in cpp users can directly access the `runner_` if they are more familiar with that. I couldn't figure out how to bind the `get_runner()` function to python... * Added a new config, `aot_inductor.package_cpp_only` which will not package the so. This means that whenever the package is loaded, we will need to build the so. This is turned off by default so that new environments do not need to rebuild their so. The `package_cpp_only` is a feature which torchchat intends to use to provide flexibility to users. * Added a new config, `aot_inductor.metadata` which stores user-provided metadata, serialized to the pt2 as a json file. It also stores the device used when exporting, "cuda" or "cpu", so that during load time, we can use that data to determine which AOTIModelContainerRunner to use. The metadata can be accessed through `loader.get_metadata()`. TODO is to move this metadata to the toplevel `package_aoti` function so that we can remove the metadata as a config. * Separated out `package_aoti` as a standalone function, instead of it automatically being called in inductor. This is to prepare for the case where users will compile multiple models, and want to bundle it in one package. The specific use case is in torchchat, where we want to package the separately-exported encoder and decoder layers. An example of how to use this is in `test_multiple_methods`. * `load_package` will load a singular model, given the model name. * The loader doesn't support windows for now, I think I need to add some more casing to make the build commands work on windows? Differential Revision: [D62329906](https://our.internmc.facebook.com/intern/diff/D62329906) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135374 Approved by: https://github.com/desertfire, https://github.com/malfet	2024-09-11 03:00:01 +00:00
chuanqiw	26e5572dd2	Bump triton xpu pin and release version (#135638 ) Similar with https://github.com/pytorch/pytorch/pull/135627 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135638 Approved by: https://github.com/atalman	2024-09-11 00:56:15 +00:00
Animesh Jain	693897df42	[dynamo] Missing guard source keys for corner case of NNModuleVariabl… (#135041 ) Potentially fixes - https://fb.workplace.com/groups/1286739428954016/permalink/1319662695661689/ Pull Request resolved: https://github.com/pytorch/pytorch/pull/135041 Approved by: https://github.com/ezyang	2024-09-11 00:43:26 +00:00
Nikita Shulga	3bf6be457d	[MPS] Add missing dispatch to rshift.Tensor (#135607 ) Missed it while working on https://github.com/pytorch/pytorch/pull/131813 Test plan: `python -c "import torch;print(torch.randint(100, 500, (64,), device='mps') >> torch.tensor([3,], device='mps'))"` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135607 Approved by: https://github.com/manuelcandales	2024-09-11 00:20:53 +00:00
titaiwangms	492f064f15	[ONNX] Add assertion nodes to ignoring list (#135591 ) Fixes #135419 PS: there are 104 empty output nodes, I suggest we add them one by one when we run into them. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135591 Approved by: https://github.com/justinchuby	2024-09-11 00:18:17 +00:00
rzou	29408ea81a	Add option to tweak inductor stride settings for user-defined triton kernels (#135530 ) Previously, Inductor was allowed to modify the stride/storage_offset (layout) for inputs to user-defined triton kernels. This can cause silent incorrectness because most triton kernels are written for a specific striding pattern (usually contiguous). This PR adds a config to allow the user to choose Inductor's behavior on this. The options are: - "flexible_layout" (default): Inductor can modify the layout for inputs to user-defined triton kernels as much as it wants. - "needs_fixed_stride_order": Inductor must preserve the stride order (when compared to tracing) for inputs to user-defined triton kernels. This matches our handling for custom operators. In the future, we'll want a "needs_exact_strides" option (this is the safest option). Test Plan: - new test Pull Request resolved: https://github.com/pytorch/pytorch/pull/135530 Approved by: https://github.com/FindHao, https://github.com/oulgen	2024-09-11 00:11:17 +00:00
Haoming Lu	02dcb07765	Add boolean support in pack segments ops for both cpu and cuda impls (#132897 ) (#135620 ) Summary: Same as int types, forward only. bypass-github-export-checks diff has been synced to github Test Plan: buck test mode/dev-nosan //caffe2/torch/fb/sparsenn:test -- test_pack_segments https://www.internalfb.com/intern/testinfra/testconsole/testrun/16888498646804437/ Reviewed By: garroud Differential Revision: D60785563 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135620 Approved by: https://github.com/kit1980 Co-authored-by: Haoming Lu <haominglu@meta.com>	2024-09-11 00:03:17 +00:00
Animesh Jain	5c38aa72c0	[dynamo][dicts][nv-embed] Support update with kwargs (#135588 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135588 Approved by: https://github.com/yanboliang	2024-09-10 23:50:23 +00:00
atalman	5134ba7458	Bump triton pin and release version (#135627 ) Update the pin and release version to sync with https://github.com/triton-lang/triton/tree/release/3.1.x Pull Request resolved: https://github.com/pytorch/pytorch/pull/135627 Approved by: https://github.com/Chillee, https://github.com/drisspg, https://github.com/malfet	2024-09-10 23:46:36 +00:00
titaiwangms	e48ee2cf50	[ONNX] Fix scaled_dot_product_attention with float scale (#135594 ) Fixes #125158 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135594 Approved by: https://github.com/justinchuby	2024-09-10 23:04:02 +00:00
hongxyan	eb38ee21ba	[ROCm] slow torch.sum optimization by increasing max_values_per_thread in reduce config (#135397 ) Fixes #132964 This change is to optimize torch.sum() performance by increasing max_values_per_thread in setReduceConfig() for ROCm platform. By increasing this parameter, it uses fewer threadblocks and improved the performance. Test: Tested on MI300x and H100, and now the MI300x perf improved to 3205GByte/s from ~1690GByte/s for the test case and is slightly better than H100 (3136GByte/s). Also tested with other different sizes of tensors and also see perf improvement. ```python import torch from triton.testing import do_bench x = torch.randn(2*30, device='cuda') ms = do_bench(lambda: x.sum(dim=-1)) bandwidth_gbyte = x.numel() x.dtype.itemsize / (10**9) time_s = ms / 1000 bw_per_second = bandwidth_gbyte / time_s print(bw_per_second) ``` Co-author: @carlobertolli Pull Request resolved: https://github.com/pytorch/pytorch/pull/135397 Approved by: https://github.com/eqy, https://github.com/malfet	2024-09-10 21:03:01 +00:00
Shunting Zhang	8057b72763	[ez][inductor] don't benchmark cloning if there are no mutated args (#135533 ) When a kernel does not have mutated args (this is quite common?), benchmarking the cost of cloning actually benchmarks a no-op. This still takes >100ms since triton.testing.do_bench will allocate 100 ms budget to run the kernel. Skipping this benchmarking can save quite some compilation time if the code path is hit multiple times. Let's say, if the code path is hit 100 times when the graph is large, we would save >10s. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135533 Approved by: https://github.com/jansel ghstack dependencies: #135531	2024-09-10 20:54:31 +00:00
Shunting Zhang	7b17918dc9	[inductor] fix a device sync issue for benchmarking fusion (#135531 ) Fix https://github.com/pytorch/pytorch/issues/134768 . When we benchmark the latency for a fused node set, we do benchmarking twice: 1. benchmark the latency of the kernel including cloning mutated args 2. benchmark the latency of cloning mutated args without running the kernel We subtract result 2 from result 1 to get the latency of the kernel itself. But when the tensors are not on the cuda device 0, we get equal number for result 1 and result 2 no matter how much work the kernel does. The root cause is, in `triton.testing.do_bench` the `torch.cuda.synchronize` call sync the current cuda device (which is device 0 if it's not overriden). But since the tensors and kernels are located on another device, the sync actually does nothing (unless there happens to be other kernels on the device 0). The fix is to set the correct current device in our benchmarking code. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135531 Approved by: https://github.com/jansel	2024-09-10 20:54:31 +00:00
Yiming Zhou	66c45f3ed9	[export] fix re-export custom metadata (#135282 ) Fixes #134778 When a model is exported and debug handles are added to the "custom" field of non-placeholder and non-output nodes in the graph, re-exporting it will change the metadata of placeholder nodes (the "custom" field will be added or copied to these nodes, depending whether `ExportedProgram` or `ExportedProgram.module()` is passed to `generate_numeric_debug_handle()`). This occurs because when we re-export the model, `placeholder` nodes are unlifted to `get_attr` nodes. These nodes remain as `get_attr` after being exported to `gm_torch_level`. Their metadata are modified [here](https://github.com/pytorch/pytorch/blob/main/torch/export/_trace.py#L1347) based on `params_buffers_to_node_meta` which is collected [here](https://github.com/pytorch/pytorch/blob/main/torch/export/_trace.py#L1312). Pull Request resolved: https://github.com/pytorch/pytorch/pull/135282 Approved by: https://github.com/jerryzh168, https://github.com/zhxchen17, https://github.com/tugsbayasgalan	2024-09-10 20:15:02 +00:00
PyTorch MergeBot	0a9d55d2ee	Revert "[AOTI] Fix assert_function call in cpu autotune template (#135086 )" This reverts commit 16c3b8f87cfa9cb5acee8104820baa389e7ee2bd. Reverted https://github.com/pytorch/pytorch/pull/135086 on behalf of https://github.com/izaitsevfb due to breaks internal tests, see D62405818 ([comment](https://github.com/pytorch/pytorch/pull/135086#issuecomment-2341889428))	2024-09-10 19:51:16 +00:00
Catherine Lee	4ca65d3323	[CI] Increase sharding for jobs that are timing out (#135582 ) Increase sharding for * slow grad check * slow cuda tests slow / linux-focal-cuda12.1-py3.10-gcc9-sm86 / test * avx Pull Request resolved: https://github.com/pytorch/pytorch/pull/135582 Approved by: https://github.com/huydhn, https://github.com/malfet	2024-09-10 19:45:13 +00:00
Andrew Gu	c932b39739	[FSDP2] Added `_set_unshard_async_op` (#135523 ) This PR adds a private API `_set_unshard_async_op` that allows for running pre-forward and pre-backward all-gathers using the `async_op=True` path so that all-gather allocations happen in the default stream to avoid inter-stream fragmentation. If using this option, forward requires explicit prefetching e.g. via the `unshard(async_op=True)` API for overlap. fp32 -> bf16 casts and the all-gather copy-in will not overlap with compute. Differential Revision: [D62401551](https://our.internmc.facebook.com/intern/diff/D62401551) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135523 Approved by: https://github.com/weifengpy	2024-09-10 19:28:02 +00:00
Rachel Guo	1f15973657	[AOTI][Tooling][7/n] Add debug printing support for JIT inductor codegen path as well (#135285 ) Summary: 1. Add the debug printer call to a level lower for triton kernel python wrapper codegen path 2. Add `torch.save()` for jit inductor as well 3. This also fixes the issue introduced in D61949020 (at python wrapper code level for triton kernel not printing) Test Plan: ``` AOT_INDUCTOR_DEBUG_INTERMEDIATE_VALUE_PRINTER=1 TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 TORCHINDUCTOR_ABI_COMPATIBLE=1 TORCH_COMPILE_DEBUG=1 TORCH_LOGS="+graph, inductor, +schedule, output_code" buck2 run -c fbcode.enable_gpu_sections=true -c fbcode.nvcc_arch=h100 @//mode/opt fbcode//caffe2/test/inductor:test_aot_inductor -- -r test_addmm_abi_compatible_cuda ``` Differential Revision: D62272588 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135285 Approved by: https://github.com/chenyang78	2024-09-10 19:24:58 +00:00
Dan Zimmerman	fc88ba260f	[amdsmi][torch] Update amdsmi API usages (#135504 ) Summary: In ROCm 6.2.0 there were API name changes-- we check if the new APIs exist and use them in this diff; see `7b2463abe0` for the changes Test Plan: CI Differential Revision: D62325661 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135504 Approved by: https://github.com/eqy, https://github.com/houseroad	2024-09-10 19:15:39 +00:00
Sam Larsen	bf8d0e3107	[inductor] Enable subprocess parallel compile internally with killswitch (#132467 ) Differential Revision: [D60629630](https://our.internmc.facebook.com/intern/diff/D60629630) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132467 Approved by: https://github.com/eellison	2024-09-10 19:05:46 +00:00
Shivam Raikundalia	3a1239a248	[Profiler] Harden Record Function Kwargs (#135365 ) Summary: In S445839, we had HTA break because of the "stream" parameter that was added to gpu traces. This brought up discussions regarding hardening our post processing of said inputs as to not break JSON schema as well as downstream tools. For this reason, this diff does the following. 1. Only allow int, double, bool and string values to be processed as kwinputs for JSON output. We can handle lists if needed in the future. 2. Make sure that any boolean is lowercase when a string so that the JSON does not break when parsing it 3. Force stream parameter to be an int Test Plan: Added unit tests to ensure that the list of requirements above is true for kwargs only. Differential Revision: D62304843 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135365 Approved by: https://github.com/aaronenyeshi	2024-09-10 18:44:05 +00:00
Sam Larsen	4f9f1775d8	Fix flaky TestCudaWrapper.test_randint_cuda_cuda_wrapper (#135370 ) Summary: This test is flaky when run after `test_dynamic_shapes_persistent_reduction_mixed_x_dim_cuda_cuda_wrapper` because the TestCase sets config options globally in its setUp() that stick around for subsequent tests. For test isolation, we use a contextlib.ExitStack pattern in other tests to patch the config options and restore them in tearDown(). Update all TestCases in `test/inductor/test_combo_kernels.py` to use that pattern. Test Plan: ``` python test/inductor/test_combo_kernels.py python test/inductor/test_cuda_cpp_wrapper.py TestCudaWrapper.test_dynamic_shapes_persistent_reduction_mixed_x_dim_cuda_cuda_wrapper TestCudaWrapper.test_randint_cuda_cuda_wrapper ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135370 Approved by: https://github.com/jansel	2024-09-10 18:43:14 +00:00
Thanh Ha	5e0788befb	Migrate remaining jobs to use runner determinator (#134867 ) At this point all self-hosted runner jobs should be using the runner determinator to switch between LF and Meta runners. This change updates the remaining jobs that have not yet been migrated over. Issue: https://lf-pytorch.atlassian.net/browse/PC-25 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134867 Approved by: https://github.com/ZainRizvi	2024-09-10 18:14:00 +00:00
Ivan Zaitsev	440f8f57af	Revert "[fx] Bypass custom __setattr__ in Node.__init__ (#135079 )" (#135562 ) This reverts commit 66da3b3b2acacb116a9b23e91b24934830eaf6b8. #135079 breaks internal tests and needs to be reverted. Revert with mergebot doesn't work as this PR is technically part of the stack, but, according to @jansel, it should be possible to revert it individually. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135562 Approved by: https://github.com/jansel, https://github.com/seemethere	2024-09-10 18:07:11 +00:00
Zhou, Lingzhi	e004d539da	[Partitioner] Reuse partition to check whether nodes exist (#135317 ) The time complexity of find node whether in NodeList is O(n). Reuse partition to speed up due to partition.nodes is hash table and has same elements. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135317 Approved by: https://github.com/ezyang	2024-09-10 17:45:29 +00:00
Zixi Qi	c4b84a46a9	Add more logging to TunableOp validators (#135396 ) Summary: Add more logging to TunableOp validators Test Plan: Verified additional logging when loading kernel selections: ``` ROCBLAS_VERSION validation: expect 4.0.0-72e57364-dirty to match 4.0.0-72e57364-dirty GCN_ARCH_NAME validation: expect gfx942:sramecc+:xnack- to match gfx942:sramecc+:xnack- HIPBLASLT_VERSION validation: expect 800-a15e4178 to match 800-a15e4178 ROCM_VERSION validation: expect 6.0.0.0-12969-1544e39 to match 6.0.0.0-12969-1544e39 PT_VERSION validation: expect 2.5.0 to match 2.5.0 ``` ``` [qizixi@devgpu039.atn3 /data/users/qizixi/fbsource/fbcode (f9305317d\|remote/master)]$ PYTORCH_TUNABLEOP_VERBOSE=1 buck2 run mode/{opt,amd-gpu} -c fbcode.e nable_gpu_sections=true //scripts/xdwang/example:fc_llama -- --enable-tuning File changed: fbcode//hipblas_tuning_pt_llama0.csv Buck UI: https://www.internalfb.com/buck2/1ed2fac4-743e-49ef-805f-7fb6b9300022 Network: Up: 0B Down: 0B Jobs completed: 4189. Time elapsed: 0.2s. BUILD SUCCEEDED Enabled tuning - Run Linear (matmul) 2 x 1280 x 8192, dtype = torch.bfloat16 INFO:2024-09-06 14:38:07 2834864:2835138 CuptiActivityProfiler.cpp:260] HIP versions. Roctracer: 4.1; Runtime: 60032830; Driver: 60032830 INFO:2024-09-06 14:38:07 2834864:2836083 DynoConfigLoader.cpp:61] Setting communication fabric enabled = 0 reading tuning results from hipblas_tuning_pt_llama0.csv Validator PT_VERSION=2.5.0 Validator ROCM_VERSION=6.0.0.0-12969-1544e39 Validator HIPBLASLT_VERSION=800-a15e4178 Validator GCN_ARCH_NAME=gfx942:sramecc+:xnack- Validator ROCBLAS_VERSION=4.0.0-72e57364-dirty ROCBLAS_VERSION validation: expect 4.0.0-72e57364-dirty to match 4.0.0-72e57364-dirty GCN_ARCH_NAME validation: expect gfx942:sramecc+:xnack- to match gfx942:sramecc+:xnack- HIPBLASLT_VERSION validation: expect 800-a15e4178 to match 800-a15e4178 ROCM_VERSION validation: expect 6.0.0.0-12969-1544e39 to match 6.0.0.0-12969-1544e39 PT_VERSION validation: expect 2.5.0 to match 2.5.0 Loading results Avg time: 13.165860176086426 us, Achieved 3.19 TFLOPS, 1598.24 GB/s - Run Linear (matmul) 2 x 8192 x 1024, dtype = torch.bfloat16 Avg time: 13.230760097503662 us, Achieved 2.54 TFLOPS, 1271.14 GB/s - Run Linear (matmul) 2 x 7168 x 8192, dtype = torch.bfloat16 Avg time: 26.804399490356445 us, Achieved 8.76 TFLOPS, 4384.90 GB/s - Run Linear (matmul) 2 x 8192 x 3584, dtype = torch.bfloat16 Avg time: 13.407809734344482 us, Achieved 8.76 TFLOPS, 4384.14 GB/s 2x1280x8192-torch.bfloat16,13.165860176086426,3.18574247630113,1598.237845349412 2x8192x1024-torch.bfloat16,13.230760097503662,2.536092541374924,1271.1420867780075 2x7168x8192-torch.bfloat16,26.804399490356445,8.762778814892096,4384.9040543618985 2x8192x3584-torch.bfloat16,13.407809734344482,8.759112362638383,4384.138585247748 ``` Reviewed By: leitian Differential Revision: D62322830 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135396 Approved by: https://github.com/eqy	2024-09-10 17:20:59 +00:00
cyy	bc1b8f094d	Check function declarations of Core ML code (#135467 ) Relax the restrictions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135467 Approved by: https://github.com/ezyang	2024-09-10 16:05:22 +00:00
rzou	f65a564fa2	[inductor] Flip custom_op_default_layout_constraint (#135239 ) By default, Inductor should respect the stride order of input Tensors to custom operators. Test Plan: - new tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/135239 Approved by: https://github.com/albanD ghstack dependencies: #135391	2024-09-10 14:27:43 +00:00
Edward Z. Yang	386b313028	Handle KeyError for compiler collective in scalars too (#135385 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/135385 Approved by: https://github.com/jansel	2024-09-10 12:33:04 +00:00
torotoki	6d7cbc20d2	Add dynamo itertools.pairwise support (#135416 ) Fixes #133766 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135416 Approved by: https://github.com/XuehaiPan, https://github.com/jansel Co-authored-by: Xuehai Pan <XuehaiPan@pku.edu.cn>	2024-09-10 11:37:59 +00:00
xinan.lin	ca16956b20	[Inductor] Generalize device guard codegen for cpp_wrapper mode. (#134761 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134761 Approved by: https://github.com/jansel, https://github.com/EikanWang ghstack dependencies: #134693	2024-09-10 10:11:52 +00:00
xinan.lin	67735d1ee8	[Inductor] Generalize `is_cuda` to specific device_type to make cpp_wrapper mode be extensible (#134693 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134693 Approved by: https://github.com/ezyang, https://github.com/EikanWang, https://github.com/jansel	2024-09-10 10:11:13 +00:00
Boyuan Feng	6e13f5eb38	[FlexAttention] Add broadcast support for kv batch dimension (#135505 ) This PR adds broadcast support for KV batch dimension. ## Details Consider Q of shape `[Bq, Hq, Q_LEN, D]`, and K, V of shape `[Bkv, Hkv, KV_LEN, D]`. Prior to this diff, we require `Bq == Bkv`. However, for some use cases, we may have Bkv < Bq. For example, in paged attention, we provide K, V of shape `[1, Hkv, MAX_LEN, D]`, while still providing Q of shape `[Bq, Hq, Q_LEN, D]`. Here, MAX_LEN is the maximal number of tokens supported by paged attention. This PR relax this requirement to be `Bq == Bkv or (Bq > 1 and Bkv == 0)`. This support covers both flex decoding, flex attention forward and backward. ## Benchmark GPU: H100 We see negligible (1%~2%) performance change from this PR when `Bq == Bkv`. ``` python benchmarks/transformer/score_mod.py --calculate-bwd ``` ### Perf before this PR FWD \| Type \| Speedup \| score_mod \| mask_mod \| dtype \| shape(B,Hq,M,Hkv,N,D) \| \|---------\|-----------\|---------------\|------------\|----------------\|------------------------------\| \| Average \| 0.743 \| \| \| \| \| \| Max \| 0.955 \| head_bias \| None \| torch.bfloat16 \| (2, 16, 1024, 2, 1024, 64) \| \| Min \| 0.548 \| relative_bias \| None \| torch.bfloat16 \| (16, 16, 1024, 2, 1024, 128) \| BWD \| Type \| Speedup \| score_mod \| mask_mod \| dtype \| shape(B,Hq,M,Hkv,N,D) \| \|---------\|-----------\|-------------\|------------\|----------------\|-----------------------------\| \| Average \| 0.834 \| \| \| \| \| \| Max \| 1.261 \| head_bias \| None \| torch.bfloat16 \| (8, 16, 512, 16, 512, 64) \| \| Min \| 0.456 \| None \| causal \| torch.bfloat16 \| (2, 16, 1024, 2, 1024, 128) \| <details> <summary> Full performance sweep </summary> \| score_mod \| mask_mod \| dtype \| shape(B,Hq,M,Hkv,N,D) \| fwd_eager_time \| fwd_compiled_time \| bwd_eager_time \| bwd_compiled_time \| fwd_speedup \| bwd_speedup \| \|---------------\|------------\|----------------\|-------------------------------\|------------------\|---------------------\|------------------\|---------------------\|---------------\|---------------\| \| None \| None \| torch.bfloat16 \| (2, 16, 512, 16, 512, 64) \| 15.264 \| 17.184 \| 107.040 \| 140.800 \| 0.888 \| 0.760 \| \| None \| causal \| torch.bfloat16 \| (2, 16, 512, 16, 512, 64) \| 15.840 \| 19.744 \| 112.576 \| 140.064 \| 0.802 \| 0.804 \| \| relative_bias \| None \| torch.bfloat16 \| (2, 16, 512, 16, 512, 64) \| 15.232 \| 17.344 \| 87.744 \| 142.496 \| 0.878 \| 0.616 \| \| head_bias \| None \| torch.bfloat16 \| (2, 16, 512, 16, 512, 64) \| 15.264 \| 17.184 \| 108.192 \| 143.328 \| 0.888 \| 0.755 \| \| None \| None \| torch.bfloat16 \| (2, 16, 512, 16, 512, 128) \| 19.904 \| 22.400 \| 106.432 \| 136.512 \| 0.889 \| 0.780 \| \| None \| causal \| torch.bfloat16 \| (2, 16, 512, 16, 512, 128) \| 19.424 \| 26.752 \| 91.712 \| 106.688 \| 0.726 \| 0.860 \| \| relative_bias \| None \| torch.bfloat16 \| (2, 16, 512, 16, 512, 128) \| 19.808 \| 22.432 \| 89.024 \| 101.920 \| 0.883 \| 0.873 \| \| head_bias \| None \| torch.bfloat16 \| (2, 16, 512, 16, 512, 128) \| 19.840 \| 22.272 \| 88.896 \| 102.592 \| 0.891 \| 0.867 \| \| None \| None \| torch.bfloat16 \| (2, 16, 1024, 16, 1024, 64) \| 30.240 \| 32.416 \| 116.768 \| 112.256 \| 0.933 \| 1.040 \| \| None \| causal \| torch.bfloat16 \| (2, 16, 1024, 16, 1024, 64) \| 29.536 \| 37.024 \| 113.664 \| 102.688 \| 0.798 \| 1.107 \| \| relative_bias \| None \| torch.bfloat16 \| (2, 16, 1024, 16, 1024, 64) \| 30.656 \| 32.800 \| 116.992 \| 127.008 \| 0.935 \| 0.921 \| \| head_bias \| None \| torch.bfloat16 \| (2, 16, 1024, 16, 1024, 64) \| 30.592 \| 32.480 \| 116.928 \| 112.160 \| 0.942 \| 1.043 \| \| None \| None \| torch.bfloat16 \| (2, 16, 1024, 16, 1024, 128) \| 40.448 \| 61.920 \| 198.656 \| 204.512 \| 0.653 \| 0.971 \| \| None \| causal \| torch.bfloat16 \| (2, 16, 1024, 16, 1024, 128) \| 37.760 \| 62.528 \| 189.536 \| 170.624 \| 0.604 \| 1.111 \| \| relative_bias \| None \| torch.bfloat16 \| (2, 16, 1024, 16, 1024, 128) \| 40.896 \| 62.368 \| 198.304 \| 205.824 \| 0.656 \| 0.963 \| \| head_bias \| None \| torch.bfloat16 \| (2, 16, 1024, 16, 1024, 128) \| 40.448 \| 61.952 \| 198.432 \| 203.648 \| 0.653 \| 0.974 \| \| None \| None \| torch.bfloat16 \| (2, 16, 4096, 16, 4096, 64) \| 318.528 \| 355.904 \| 947.232 \| 1162.496 \| 0.895 \| 0.815 \| \| None \| causal \| torch.bfloat16 \| (2, 16, 4096, 16, 4096, 64) \| 199.776 \| 252.128 \| 677.792 \| 813.184 \| 0.792 \| 0.834 \| \| relative_bias \| None \| torch.bfloat16 \| (2, 16, 4096, 16, 4096, 64) \| 316.512 \| 363.328 \| 947.712 \| 1361.984 \| 0.871 \| 0.696 \| \| head_bias \| None \| torch.bfloat16 \| (2, 16, 4096, 16, 4096, 64) \| 317.984 \| 356.864 \| 947.264 \| 1165.024 \| 0.891 \| 0.813 \| \| None \| None \| torch.bfloat16 \| (2, 16, 4096, 16, 4096, 128) \| 446.656 \| 734.656 \| 1664.288 \| 2172.960 \| 0.608 \| 0.766 \| \| None \| causal \| torch.bfloat16 \| (2, 16, 4096, 16, 4096, 128) \| 278.688 \| 467.648 \| 1182.624 \| 1339.296 \| 0.596 \| 0.883 \| \| relative_bias \| None \| torch.bfloat16 \| (2, 16, 4096, 16, 4096, 128) \| 447.872 \| 744.096 \| 1662.944 \| 2196.544 \| 0.602 \| 0.757 \| \| head_bias \| None \| torch.bfloat16 \| (2, 16, 4096, 16, 4096, 128) \| 448.128 \| 732.928 \| 1663.072 \| 2156.800 \| 0.611 \| 0.771 \| \| None \| None \| torch.bfloat16 \| (2, 16, 512, 2, 512, 64) \| 15.648 \| 16.640 \| 107.520 \| 143.008 \| 0.940 \| 0.752 \| \| None \| causal \| torch.bfloat16 \| (2, 16, 512, 2, 512, 64) \| 15.776 \| 18.240 \| 129.056 \| 141.920 \| 0.865 \| 0.909 \| \| relative_bias \| None \| torch.bfloat16 \| (2, 16, 512, 2, 512, 64) \| 15.168 \| 16.640 \| 103.616 \| 139.648 \| 0.912 \| 0.742 \| \| head_bias \| None \| torch.bfloat16 \| (2, 16, 512, 2, 512, 64) \| 15.616 \| 16.640 \| 128.608 \| 164.448 \| 0.938 \| 0.782 \| \| None \| None \| torch.bfloat16 \| (2, 16, 512, 2, 512, 128) \| 19.776 \| 21.952 \| 125.344 \| 170.304 \| 0.901 \| 0.736 \| \| None \| causal \| torch.bfloat16 \| (2, 16, 512, 2, 512, 128) \| 19.776 \| 23.712 \| 104.288 \| 196.896 \| 0.834 \| 0.530 \| \| relative_bias \| None \| torch.bfloat16 \| (2, 16, 512, 2, 512, 128) \| 19.072 \| 21.952 \| 102.080 \| 177.056 \| 0.869 \| 0.577 \| \| head_bias \| None \| torch.bfloat16 \| (2, 16, 512, 2, 512, 128) \| 19.648 \| 21.920 \| 109.920 \| 170.848 \| 0.896 \| 0.643 \| \| None \| None \| torch.bfloat16 \| (2, 16, 1024, 2, 1024, 64) \| 30.464 \| 31.936 \| 127.808 \| 228.832 \| 0.954 \| 0.559 \| \| None \| causal \| torch.bfloat16 \| (2, 16, 1024, 2, 1024, 64) \| 29.472 \| 33.856 \| 113.152 \| 215.072 \| 0.871 \| 0.526 \| \| relative_bias \| None \| torch.bfloat16 \| (2, 16, 1024, 2, 1024, 64) \| 30.496 \| 32.160 \| 116.576 \| 231.744 \| 0.948 \| 0.503 \| \| head_bias \| None \| torch.bfloat16 \| (2, 16, 1024, 2, 1024, 64) \| 30.464 \| 31.904 \| 116.320 \| 229.824 \| 0.955 \| 0.506 \| \| None \| None \| torch.bfloat16 \| (2, 16, 1024, 2, 1024, 128) \| 40.480 \| 61.440 \| 176.448 \| 345.312 \| 0.659 \| 0.511 \| \| None \| causal \| torch.bfloat16 \| (2, 16, 1024, 2, 1024, 128) \| 38.304 \| 59.424 \| 169.312 \| 371.360 \| 0.645 \| 0.456 \| \| relative_bias \| None \| torch.bfloat16 \| (2, 16, 1024, 2, 1024, 128) \| 40.960 \| 61.760 \| 176.512 \| 358.912 \| 0.663 \| 0.492 \| \| head_bias \| None \| torch.bfloat16 \| (2, 16, 1024, 2, 1024, 128) \| 40.352 \| 61.696 \| 176.512 \| 344.928 \| 0.654 \| 0.512 \| \| None \| None \| torch.bfloat16 \| (2, 16, 4096, 2, 4096, 64) \| 316.224 \| 357.728 \| 905.728 \| 1668.448 \| 0.884 \| 0.543 \| \| None \| causal \| torch.bfloat16 \| (2, 16, 4096, 2, 4096, 64) \| 199.904 \| 248.416 \| 636.544 \| 1109.088 \| 0.805 \| 0.574 \| \| relative_bias \| None \| torch.bfloat16 \| (2, 16, 4096, 2, 4096, 64) \| 314.880 \| 363.616 \| 906.304 \| 1658.176 \| 0.866 \| 0.547 \| \| head_bias \| None \| torch.bfloat16 \| (2, 16, 4096, 2, 4096, 64) \| 316.160 \| 354.368 \| 906.080 \| 1649.024 \| 0.892 \| 0.549 \| \| None \| None \| torch.bfloat16 \| (2, 16, 4096, 2, 4096, 128) \| 446.912 \| 739.840 \| 1555.808 \| 2521.952 \| 0.604 \| 0.617 \| \| None \| causal \| torch.bfloat16 \| (2, 16, 4096, 2, 4096, 128) \| 279.776 \| 463.904 \| 1068.928 \| 1849.888 \| 0.603 \| 0.578 \| \| relative_bias \| None \| torch.bfloat16 \| (2, 16, 4096, 2, 4096, 128) \| 446.080 \| 748.960 \| 1553.504 \| 2629.888 \| 0.596 \| 0.591 \| \| head_bias \| None \| torch.bfloat16 \| (2, 16, 4096, 2, 4096, 128) \| 446.208 \| 740.608 \| 1558.880 \| 2524.960 \| 0.602 \| 0.617 \| \| None \| None \| torch.bfloat16 \| (8, 16, 512, 16, 512, 64) \| 33.568 \| 41.280 \| 170.016 \| 147.584 \| 0.813 \| 1.152 \| \| None \| causal \| torch.bfloat16 \| (8, 16, 512, 16, 512, 64) \| 30.688 \| 43.040 \| 159.552 \| 146.720 \| 0.713 \| 1.087 \| \| relative_bias \| None \| torch.bfloat16 \| (8, 16, 512, 16, 512, 64) \| 34.112 \| 41.504 \| 170.112 \| 152.672 \| 0.822 \| 1.114 \| \| head_bias \| None \| torch.bfloat16 \| (8, 16, 512, 16, 512, 64) \| 34.240 \| 41.152 \| 170.272 \| 134.976 \| 0.832 \| 1.261 \| \| None \| None \| torch.bfloat16 \| (8, 16, 512, 16, 512, 128) \| 48.672 \| 76.416 \| 295.296 \| 263.648 \| 0.637 \| 1.120 \| \| None \| causal \| torch.bfloat16 \| (8, 16, 512, 16, 512, 128) \| 45.088 \| 72.576 \| 281.920 \| 237.664 \| 0.621 \| 1.186 \| \| relative_bias \| None \| torch.bfloat16 \| (8, 16, 512, 16, 512, 128) \| 48.032 \| 76.672 \| 295.520 \| 265.248 \| 0.626 \| 1.114 \| \| head_bias \| None \| torch.bfloat16 \| (8, 16, 512, 16, 512, 128) \| 48.096 \| 76.096 \| 295.456 \| 262.112 \| 0.632 \| 1.127 \| \| None \| None \| torch.bfloat16 \| (8, 16, 1024, 16, 1024, 64) \| 93.920 \| 111.232 \| 401.568 \| 382.944 \| 0.844 \| 1.049 \| \| None \| causal \| torch.bfloat16 \| (8, 16, 1024, 16, 1024, 64) \| 68.192 \| 95.232 \| 338.752 \| 326.816 \| 0.716 \| 1.037 \| \| relative_bias \| None \| torch.bfloat16 \| (8, 16, 1024, 16, 1024, 64) \| 93.984 \| 111.840 \| 401.856 \| 444.224 \| 0.840 \| 0.905 \| \| head_bias \| None \| torch.bfloat16 \| (8, 16, 1024, 16, 1024, 64) \| 94.176 \| 110.496 \| 401.600 \| 383.136 \| 0.852 \| 1.048 \| \| None \| None \| torch.bfloat16 \| (8, 16, 1024, 16, 1024, 128) \| 131.488 \| 227.040 \| 727.424 \| 739.712 \| 0.579 \| 0.983 \| \| None \| causal \| torch.bfloat16 \| (8, 16, 1024, 16, 1024, 128) \| 95.616 \| 169.760 \| 616.864 \| 574.112 \| 0.563 \| 1.074 \| \| relative_bias \| None \| torch.bfloat16 \| (8, 16, 1024, 16, 1024, 128) \| 131.680 \| 228.672 \| 727.616 \| 746.048 \| 0.576 \| 0.975 \| \| head_bias \| None \| torch.bfloat16 \| (8, 16, 1024, 16, 1024, 128) \| 131.104 \| 225.696 \| 727.904 \| 735.392 \| 0.581 \| 0.990 \| \| None \| None \| torch.bfloat16 \| (8, 16, 4096, 16, 4096, 64) \| 1227.296 \| 1386.656 \| 3720.192 \| 4539.904 \| 0.885 \| 0.819 \| \| None \| causal \| torch.bfloat16 \| (8, 16, 4096, 16, 4096, 64) \| 691.360 \| 831.712 \| 2515.872 \| 3067.808 \| 0.831 \| 0.820 \| \| relative_bias \| None \| torch.bfloat16 \| (8, 16, 4096, 16, 4096, 64) \| 1228.192 \| 1403.136 \| 3715.520 \| 5309.280 \| 0.875 \| 0.700 \| \| head_bias \| None \| torch.bfloat16 \| (8, 16, 4096, 16, 4096, 64) \| 1229.024 \| 1384.992 \| 3715.904 \| 4550.368 \| 0.887 \| 0.817 \| \| None \| None \| torch.bfloat16 \| (8, 16, 4096, 16, 4096, 128) \| 1784.832 \| 2865.888 \| 6539.840 \| 8460.224 \| 0.623 \| 0.773 \| \| None \| causal \| torch.bfloat16 \| (8, 16, 4096, 16, 4096, 128) \| 1017.408 \| 1660.480 \| 4369.824 \| 5056.992 \| 0.613 \| 0.864 \| \| relative_bias \| None \| torch.bfloat16 \| (8, 16, 4096, 16, 4096, 128) \| 1792.448 \| 2904.864 \| 6546.080 \| 8537.024 \| 0.617 \| 0.767 \| \| head_bias \| None \| torch.bfloat16 \| (8, 16, 4096, 16, 4096, 128) \| 1795.552 \| 2856.864 \| 6544.672 \| 8400.160 \| 0.629 \| 0.779 \| \| None \| None \| torch.bfloat16 \| (8, 16, 512, 2, 512, 64) \| 34.240 \| 38.880 \| 148.832 \| 179.936 \| 0.881 \| 0.827 \| \| None \| causal \| torch.bfloat16 \| (8, 16, 512, 2, 512, 64) \| 31.168 \| 38.080 \| 138.528 \| 167.552 \| 0.818 \| 0.827 \| \| relative_bias \| None \| torch.bfloat16 \| (8, 16, 512, 2, 512, 64) \| 34.240 \| 39.168 \| 148.512 \| 181.248 \| 0.874 \| 0.819 \| \| head_bias \| None \| torch.bfloat16 \| (8, 16, 512, 2, 512, 64) \| 34.240 \| 38.784 \| 148.864 \| 180.224 \| 0.883 \| 0.826 \| \| None \| None \| torch.bfloat16 \| (8, 16, 512, 2, 512, 128) \| 48.832 \| 76.352 \| 253.632 \| 295.968 \| 0.640 \| 0.857 \| \| None \| causal \| torch.bfloat16 \| (8, 16, 512, 2, 512, 128) \| 45.760 \| 65.792 \| 239.040 \| 290.752 \| 0.696 \| 0.822 \| \| relative_bias \| None \| torch.bfloat16 \| (8, 16, 512, 2, 512, 128) \| 48.768 \| 76.576 \| 253.312 \| 304.032 \| 0.637 \| 0.833 \| \| head_bias \| None \| torch.bfloat16 \| (8, 16, 512, 2, 512, 128) \| 48.768 \| 76.192 \| 253.600 \| 296.096 \| 0.640 \| 0.856 \| \| None \| None \| torch.bfloat16 \| (8, 16, 1024, 2, 1024, 64) \| 93.728 \| 109.728 \| 357.696 \| 498.912 \| 0.854 \| 0.717 \| \| None \| causal \| torch.bfloat16 \| (8, 16, 1024, 2, 1024, 64) \| 68.704 \| 92.288 \| 295.616 \| 386.240 \| 0.744 \| 0.765 \| \| relative_bias \| None \| torch.bfloat16 \| (8, 16, 1024, 2, 1024, 64) \| 93.632 \| 111.392 \| 357.408 \| 512.448 \| 0.841 \| 0.697 \| \| head_bias \| None \| torch.bfloat16 \| (8, 16, 1024, 2, 1024, 64) \| 93.280 \| 109.952 \| 357.696 \| 501.440 \| 0.848 \| 0.713 \| \| None \| None \| torch.bfloat16 \| (8, 16, 1024, 2, 1024, 128) \| 131.392 \| 230.496 \| 612.224 \| 807.552 \| 0.570 \| 0.758 \| \| None \| causal \| torch.bfloat16 \| (8, 16, 1024, 2, 1024, 128) \| 96.512 \| 165.184 \| 502.624 \| 672.384 \| 0.584 \| 0.748 \| \| relative_bias \| None \| torch.bfloat16 \| (8, 16, 1024, 2, 1024, 128) \| 131.360 \| 232.608 \| 612.064 \| 832.320 \| 0.565 \| 0.735 \| \| head_bias \| None \| torch.bfloat16 \| (8, 16, 1024, 2, 1024, 128) \| 131.008 \| 230.528 \| 612.640 \| 804.320 \| 0.568 \| 0.762 \| \| None \| None \| torch.bfloat16 \| (8, 16, 4096, 2, 4096, 64) \| 1227.968 \| 1377.408 \| 3477.920 \| 5324.384 \| 0.892 \| 0.653 \| \| None \| causal \| torch.bfloat16 \| (8, 16, 4096, 2, 4096, 64) \| 695.264 \| 824.544 \| 2268.224 \| 3210.208 \| 0.843 \| 0.707 \| \| relative_bias \| None \| torch.bfloat16 \| (8, 16, 4096, 2, 4096, 64) \| 1228.640 \| 1404.576 \| 3476.832 \| 5463.456 \| 0.875 \| 0.636 \| \| head_bias \| None \| torch.bfloat16 \| (8, 16, 4096, 2, 4096, 64) \| 1228.416 \| 1378.752 \| 3478.048 \| 5367.712 \| 0.891 \| 0.648 \| \| None \| None \| torch.bfloat16 \| (8, 16, 4096, 2, 4096, 128) \| 1788.736 \| 2867.712 \| 6039.520 \| 8616.256 \| 0.624 \| 0.701 \| \| None \| causal \| torch.bfloat16 \| (8, 16, 4096, 2, 4096, 128) \| 1021.952 \| 1653.824 \| 3866.208 \| 5306.848 \| 0.618 \| 0.729 \| \| relative_bias \| None \| torch.bfloat16 \| (8, 16, 4096, 2, 4096, 128) \| 1786.752 \| 2896.352 \| 6044.128 \| 8871.360 \| 0.617 \| 0.681 \| \| head_bias \| None \| torch.bfloat16 \| (8, 16, 4096, 2, 4096, 128) \| 1786.080 \| 2868.672 \| 6040.160 \| 8550.144 \| 0.623 \| 0.706 \| \| None \| None \| torch.bfloat16 \| (16, 16, 512, 16, 512, 64) \| 57.504 \| 71.552 \| 312.768 \| 255.040 \| 0.804 \| 1.226 \| \| None \| causal \| torch.bfloat16 \| (16, 16, 512, 16, 512, 64) \| 49.472 \| 71.104 \| 285.696 \| 243.520 \| 0.696 \| 1.173 \| \| relative_bias \| None \| torch.bfloat16 \| (16, 16, 512, 16, 512, 64) \| 58.112 \| 72.896 \| 312.768 \| 288.256 \| 0.797 \| 1.085 \| \| head_bias \| None \| torch.bfloat16 \| (16, 16, 512, 16, 512, 64) \| 57.952 \| 71.680 \| 312.768 \| 255.552 \| 0.808 \| 1.224 \| \| None \| None \| torch.bfloat16 \| (16, 16, 512, 16, 512, 128) \| 82.336 \| 144.256 \| 580.128 \| 500.160 \| 0.571 \| 1.160 \| \| None \| causal \| torch.bfloat16 \| (16, 16, 512, 16, 512, 128) \| 76.160 \| 123.712 \| 552.544 \| 447.648 \| 0.616 \| 1.234 \| \| relative_bias \| None \| torch.bfloat16 \| (16, 16, 512, 16, 512, 128) \| 82.400 \| 145.184 \| 580.032 \| 504.032 \| 0.568 \| 1.151 \| \| head_bias \| None \| torch.bfloat16 \| (16, 16, 512, 16, 512, 128) \| 82.368 \| 143.904 \| 580.192 \| 499.936 \| 0.572 \| 1.161 \| \| None \| None \| torch.bfloat16 \| (16, 16, 1024, 16, 1024, 64) \| 177.216 \| 209.568 \| 787.872 \| 747.712 \| 0.846 \| 1.054 \| \| None \| causal \| torch.bfloat16 \| (16, 16, 1024, 16, 1024, 64) \| 121.984 \| 168.256 \| 651.968 \| 628.256 \| 0.725 \| 1.038 \| \| relative_bias \| None \| torch.bfloat16 \| (16, 16, 1024, 16, 1024, 64) \| 177.088 \| 211.488 \| 788.320 \| 864.352 \| 0.837 \| 0.912 \| \| head_bias \| None \| torch.bfloat16 \| (16, 16, 1024, 16, 1024, 64) \| 177.440 \| 208.576 \| 787.424 \| 749.120 \| 0.851 \| 1.051 \| \| None \| None \| torch.bfloat16 \| (16, 16, 1024, 16, 1024, 128) \| 249.472 \| 441.376 \| 1405.440 \| 1431.648 \| 0.565 \| 0.982 \| \| None \| causal \| torch.bfloat16 \| (16, 16, 1024, 16, 1024, 128) \| 172.960 \| 312.064 \| 1172.064 \| 1096.448 \| 0.554 \| 1.069 \| \| relative_bias \| None \| torch.bfloat16 \| (16, 16, 1024, 16, 1024, 128) \| 249.632 \| 446.336 \| 1405.408 \| 1448.480 \| 0.559 \| 0.970 \| \| head_bias \| None \| torch.bfloat16 \| (16, 16, 1024, 16, 1024, 128) \| 250.944 \| 440.128 \| 1406.624 \| 1421.952 \| 0.570 \| 0.989 \| \| None \| None \| torch.bfloat16 \| (16, 16, 4096, 16, 4096, 64) \| 2418.720 \| 2747.936 \| 7330.432 \| 9023.712 \| 0.880 \| 0.812 \| \| None \| causal \| torch.bfloat16 \| (16, 16, 4096, 16, 4096, 64) \| 1353.696 \| 1608.480 \| 4941.696 \| 6078.752 \| 0.842 \| 0.813 \| \| relative_bias \| None \| torch.bfloat16 \| (16, 16, 4096, 16, 4096, 64) \| 2427.456 \| 2746.816 \| 7329.792 \| 10539.968 \| 0.884 \| 0.695 \| \| head_bias \| None \| torch.bfloat16 \| (16, 16, 4096, 16, 4096, 64) \| 2426.688 \| 2763.168 \| 7336.256 \| 9057.536 \| 0.878 \| 0.810 \| \| None \| None \| torch.bfloat16 \| (16, 16, 4096, 16, 4096, 128) \| 3554.240 \| 5634.400 \| 12919.872 \| 16843.489 \| 0.631 \| 0.767 \| \| None \| causal \| torch.bfloat16 \| (16, 16, 4096, 16, 4096, 128) \| 2003.648 \| 3250.784 \| 8610.144 \| 10015.424 \| 0.616 \| 0.860 \| \| relative_bias \| None \| torch.bfloat16 \| (16, 16, 4096, 16, 4096, 128) \| 3582.080 \| 5710.944 \| 12923.328 \| 17011.871 \| 0.627 \| 0.760 \| \| head_bias \| None \| torch.bfloat16 \| (16, 16, 4096, 16, 4096, 128) \| 3581.920 \| 5618.144 \| 12934.528 \| 16745.888 \| 0.638 \| 0.772 \| \| None \| None \| torch.bfloat16 \| (16, 16, 512, 2, 512, 64) \| 57.120 \| 71.232 \| 269.760 \| 295.680 \| 0.802 \| 0.912 \| \| None \| causal \| torch.bfloat16 \| (16, 16, 512, 2, 512, 64) \| 49.408 \| 65.312 \| 242.304 \| 253.952 \| 0.756 \| 0.954 \| \| relative_bias \| None \| torch.bfloat16 \| (16, 16, 512, 2, 512, 64) \| 57.504 \| 72.544 \| 269.632 \| 298.976 \| 0.793 \| 0.902 \| \| head_bias \| None \| torch.bfloat16 \| (16, 16, 512, 2, 512, 64) \| 57.760 \| 71.040 \| 269.600 \| 296.640 \| 0.813 \| 0.909 \| \| None \| None \| torch.bfloat16 \| (16, 16, 512, 2, 512, 128) \| 82.336 \| 147.168 \| 466.080 \| 487.456 \| 0.559 \| 0.956 \| \| None \| causal \| torch.bfloat16 \| (16, 16, 512, 2, 512, 128) \| 76.704 \| 115.040 \| 435.392 \| 453.248 \| 0.667 \| 0.961 \| \| relative_bias \| None \| torch.bfloat16 \| (16, 16, 512, 2, 512, 128) \| 81.856 \| 147.424 \| 465.920 \| 499.552 \| 0.555 \| 0.933 \| \| head_bias \| None \| torch.bfloat16 \| (16, 16, 512, 2, 512, 128) \| 81.760 \| 146.656 \| 466.176 \| 485.984 \| 0.557 \| 0.959 \| \| None \| None \| torch.bfloat16 \| (16, 16, 1024, 2, 1024, 64) \| 176.608 \| 206.976 \| 678.080 \| 866.976 \| 0.853 \| 0.782 \| \| None \| causal \| torch.bfloat16 \| (16, 16, 1024, 2, 1024, 64) \| 121.664 \| 164.768 \| 538.240 \| 636.160 \| 0.738 \| 0.846 \| \| relative_bias \| None \| torch.bfloat16 \| (16, 16, 1024, 2, 1024, 64) \| 176.608 \| 209.664 \| 677.696 \| 883.424 \| 0.842 \| 0.767 \| \| head_bias \| None \| torch.bfloat16 \| (16, 16, 1024, 2, 1024, 64) \| 177.440 \| 207.840 \| 677.248 \| 868.288 \| 0.854 \| 0.780 \| \| None \| None \| torch.bfloat16 \| (16, 16, 1024, 2, 1024, 128) \| 250.272 \| 449.536 \| 1163.424 \| 1420.832 \| 0.557 \| 0.819 \| \| None \| causal \| torch.bfloat16 \| (16, 16, 1024, 2, 1024, 128) \| 173.472 \| 305.376 \| 929.408 \| 1104.544 \| 0.568 \| 0.841 \| \| relative_bias \| None \| torch.bfloat16 \| (16, 16, 1024, 2, 1024, 128) \| 249.376 \| 454.976 \| 1163.648 \| 1455.296 \| 0.548 \| 0.800 \| \| head_bias \| None \| torch.bfloat16 \| (16, 16, 1024, 2, 1024, 128) \| 250.368 \| 450.144 \| 1163.520 \| 1409.984 \| 0.556 \| 0.825 \| \| None \| None \| torch.bfloat16 \| (16, 16, 4096, 2, 4096, 64) \| 2416.576 \| 2726.208 \| 6835.520 \| 10442.784 \| 0.886 \| 0.655 \| \| None \| causal \| torch.bfloat16 \| (16, 16, 4096, 2, 4096, 64) \| 1357.440 \| 1590.752 \| 4433.664 \| 5975.296 \| 0.853 \| 0.742 \| \| relative_bias \| None \| torch.bfloat16 \| (16, 16, 4096, 2, 4096, 64) \| 2427.360 \| 2747.040 \| 6853.056 \| 10670.784 \| 0.884 \| 0.642 \| \| head_bias \| None \| torch.bfloat16 \| (16, 16, 4096, 2, 4096, 64) \| 2441.120 \| 2718.944 \| 6836.640 \| 10433.792 \| 0.898 \| 0.655 \| \| None \| None \| torch.bfloat16 \| (16, 16, 4096, 2, 4096, 128) \| 3555.392 \| 5620.960 \| 11944.000 \| 16504.801 \| 0.633 \| 0.724 \| \| None \| causal \| torch.bfloat16 \| (16, 16, 4096, 2, 4096, 128) \| 2010.848 \| 3241.152 \| 7636.064 \| 9870.464 \| 0.620 \| 0.774 \| \| relative_bias \| None \| torch.bfloat16 \| (16, 16, 4096, 2, 4096, 128) \| 3557.440 \| 5688.352 \| 11935.744 \| 17090.496 \| 0.625 \| 0.698 \| \| head_bias \| None \| torch.bfloat16 \| (16, 16, 4096, 2, 4096, 128) \| 3562.720 \| 5630.432 \| 11939.168 \| 16392.033 \| 0.633 \| 0.728 \| </details> ### Perf after this PR FWD \| Type \| Speedup \| score_mod \| mask_mod \| dtype \| shape(B,Hq,M,Hkv,N,D) \| \|---------\|-----------\|---------------\|------------\|----------------\|----------------------------\| \| Average \| 0.776 \| \| \| \| \| \| Max \| 1.006 \| None \| None \| torch.bfloat16 \| (2, 16, 1024, 2, 1024, 64) \| \| Min \| 0.566 \| relative_bias \| None \| torch.bfloat16 \| (16, 16, 512, 2, 512, 128) \| BWD \| Type \| Speedup \| score_mod \| mask_mod \| dtype \| shape(B,Hq,M,Hkv,N,D) \| \|---------\|-----------\|-------------\|------------\|----------------\|-----------------------------\| \| Average \| 0.817 \| \| \| \| \| \| Max \| 1.150 \| None \| causal \| torch.bfloat16 \| (16, 16, 512, 16, 512, 128) \| \| Min \| 0.454 \| None \| causal \| torch.bfloat16 \| (2, 16, 1024, 2, 1024, 128) \| <details> <summary> Full performance sweep </summary> \| score_mod \| mask_mod \| dtype \| shape(B,Hq,M,Hkv,N,D) \| fwd_eager_time \| fwd_compiled_time \| bwd_eager_time \| bwd_compiled_time \| fwd_speedup \| bwd_speedup \| \|---------------\|------------\|----------------\|-------------------------------\|------------------\|---------------------\|------------------\|---------------------\|---------------\|---------------\| \| None \| None \| torch.bfloat16 \| (2, 16, 512, 16, 512, 64) \| 15.680 \| 17.056 \| 64.544 \| 73.376 \| 0.919 \| 0.880 \| \| None \| causal \| torch.bfloat16 \| (2, 16, 512, 16, 512, 64) \| 15.712 \| 19.872 \| 65.408 \| 72.864 \| 0.791 \| 0.898 \| \| relative_bias \| None \| torch.bfloat16 \| (2, 16, 512, 16, 512, 64) \| 16.160 \| 17.280 \| 64.896 \| 73.888 \| 0.935 \| 0.878 \| \| head_bias \| None \| torch.bfloat16 \| (2, 16, 512, 16, 512, 64) \| 16.192 \| 17.120 \| 64.896 \| 75.424 \| 0.946 \| 0.860 \| \| None \| None \| torch.bfloat16 \| (2, 16, 512, 16, 512, 128) \| 19.648 \| 22.496 \| 89.184 \| 82.592 \| 0.873 \| 1.080 \| \| None \| causal \| torch.bfloat16 \| (2, 16, 512, 16, 512, 128) \| 20.320 \| 26.816 \| 91.264 \| 82.880 \| 0.758 \| 1.101 \| \| relative_bias \| None \| torch.bfloat16 \| (2, 16, 512, 16, 512, 128) \| 20.096 \| 22.528 \| 89.184 \| 83.776 \| 0.892 \| 1.065 \| \| head_bias \| None \| torch.bfloat16 \| (2, 16, 512, 16, 512, 128) \| 19.680 \| 22.432 \| 89.184 \| 120.096 \| 0.877 \| 0.743 \| \| None \| None \| torch.bfloat16 \| (2, 16, 1024, 16, 1024, 64) \| 32.384 \| 32.512 \| 119.232 \| 128.960 \| 0.996 \| 0.925 \| \| None \| causal \| torch.bfloat16 \| (2, 16, 1024, 16, 1024, 64) \| 30.176 \| 37.248 \| 113.664 \| 119.520 \| 0.810 \| 0.951 \| \| relative_bias \| None \| torch.bfloat16 \| (2, 16, 1024, 16, 1024, 64) \| 32.512 \| 32.928 \| 119.264 \| 131.456 \| 0.987 \| 0.907 \| \| head_bias \| None \| torch.bfloat16 \| (2, 16, 1024, 16, 1024, 64) \| 32.448 \| 32.704 \| 119.200 \| 128.352 \| 0.992 \| 0.929 \| \| None \| None \| torch.bfloat16 \| (2, 16, 1024, 16, 1024, 128) \| 41.952 \| 62.176 \| 199.040 \| 214.304 \| 0.675 \| 0.929 \| \| None \| causal \| torch.bfloat16 \| (2, 16, 1024, 16, 1024, 128) \| 39.744 \| 62.880 \| 189.504 \| 179.968 \| 0.632 \| 1.053 \| \| relative_bias \| None \| torch.bfloat16 \| (2, 16, 1024, 16, 1024, 128) \| 41.472 \| 62.784 \| 199.136 \| 217.664 \| 0.661 \| 0.915 \| \| head_bias \| None \| torch.bfloat16 \| (2, 16, 1024, 16, 1024, 128) \| 42.048 \| 61.952 \| 199.168 \| 214.496 \| 0.679 \| 0.929 \| \| None \| None \| torch.bfloat16 \| (2, 16, 4096, 16, 4096, 64) \| 341.184 \| 357.632 \| 980.256 \| 1328.896 \| 0.954 \| 0.738 \| \| None \| causal \| torch.bfloat16 \| (2, 16, 4096, 16, 4096, 64) \| 212.576 \| 252.960 \| 673.888 \| 824.864 \| 0.840 \| 0.817 \| \| relative_bias \| None \| torch.bfloat16 \| (2, 16, 4096, 16, 4096, 64) \| 340.000 \| 363.296 \| 980.768 \| 1375.808 \| 0.936 \| 0.713 \| \| head_bias \| None \| torch.bfloat16 \| (2, 16, 4096, 16, 4096, 64) \| 340.768 \| 356.832 \| 980.960 \| 1326.272 \| 0.955 \| 0.740 \| \| None \| None \| torch.bfloat16 \| (2, 16, 4096, 16, 4096, 128) \| 459.392 \| 737.120 \| 1678.240 \| 2205.248 \| 0.623 \| 0.761 \| \| None \| causal \| torch.bfloat16 \| (2, 16, 4096, 16, 4096, 128) \| 292.672 \| 468.096 \| 1178.016 \| 1371.584 \| 0.625 \| 0.859 \| \| relative_bias \| None \| torch.bfloat16 \| (2, 16, 4096, 16, 4096, 128) \| 462.144 \| 745.312 \| 1680.000 \| 2252.512 \| 0.620 \| 0.746 \| \| head_bias \| None \| torch.bfloat16 \| (2, 16, 4096, 16, 4096, 128) \| 462.112 \| 736.576 \| 1679.008 \| 2216.480 \| 0.627 \| 0.758 \| \| None \| None \| torch.bfloat16 \| (2, 16, 512, 2, 512, 64) \| 16.064 \| 16.704 \| 105.120 \| 120.768 \| 0.962 \| 0.870 \| \| None \| causal \| torch.bfloat16 \| (2, 16, 512, 2, 512, 64) \| 15.552 \| 18.144 \| 107.136 \| 121.696 \| 0.857 \| 0.880 \| \| relative_bias \| None \| torch.bfloat16 \| (2, 16, 512, 2, 512, 64) \| 16.096 \| 16.768 \| 102.688 \| 120.864 \| 0.960 \| 0.850 \| \| head_bias \| None \| torch.bfloat16 \| (2, 16, 512, 2, 512, 64) \| 16.032 \| 16.576 \| 104.736 \| 124.672 \| 0.967 \| 0.840 \| \| None \| None \| torch.bfloat16 \| (2, 16, 512, 2, 512, 128) \| 19.392 \| 21.952 \| 104.736 \| 174.656 \| 0.883 \| 0.600 \| \| None \| causal \| torch.bfloat16 \| (2, 16, 512, 2, 512, 128) \| 20.128 \| 23.712 \| 105.216 \| 199.008 \| 0.849 \| 0.529 \| \| relative_bias \| None \| torch.bfloat16 \| (2, 16, 512, 2, 512, 128) \| 19.904 \| 21.888 \| 103.744 \| 179.520 \| 0.909 \| 0.578 \| \| head_bias \| None \| torch.bfloat16 \| (2, 16, 512, 2, 512, 128) \| 19.968 \| 21.952 \| 104.640 \| 177.312 \| 0.910 \| 0.590 \| \| None \| None \| torch.bfloat16 \| (2, 16, 1024, 2, 1024, 64) \| 32.096 \| 31.904 \| 118.720 \| 231.968 \| 1.006 \| 0.512 \| \| None \| causal \| torch.bfloat16 \| (2, 16, 1024, 2, 1024, 64) \| 30.528 \| 33.952 \| 112.480 \| 218.304 \| 0.899 \| 0.515 \| \| relative_bias \| None \| torch.bfloat16 \| (2, 16, 1024, 2, 1024, 64) \| 32.160 \| 32.224 \| 118.752 \| 237.312 \| 0.998 \| 0.500 \| \| head_bias \| None \| torch.bfloat16 \| (2, 16, 1024, 2, 1024, 64) \| 32.128 \| 32.032 \| 118.240 \| 233.120 \| 1.003 \| 0.507 \| \| None \| None \| torch.bfloat16 \| (2, 16, 1024, 2, 1024, 128) \| 41.312 \| 61.280 \| 177.408 \| 350.688 \| 0.674 \| 0.506 \| \| None \| causal \| torch.bfloat16 \| (2, 16, 1024, 2, 1024, 128) \| 39.552 \| 59.360 \| 168.832 \| 371.488 \| 0.666 \| 0.454 \| \| relative_bias \| None \| torch.bfloat16 \| (2, 16, 1024, 2, 1024, 128) \| 41.984 \| 61.696 \| 177.376 \| 360.416 \| 0.680 \| 0.492 \| \| head_bias \| None \| torch.bfloat16 \| (2, 16, 1024, 2, 1024, 128) \| 41.312 \| 61.760 \| 177.184 \| 355.744 \| 0.669 \| 0.498 \| \| None \| None \| torch.bfloat16 \| (2, 16, 4096, 2, 4096, 64) \| 339.744 \| 357.888 \| 939.712 \| 1665.376 \| 0.949 \| 0.564 \| \| None \| causal \| torch.bfloat16 \| (2, 16, 4096, 2, 4096, 64) \| 212.608 \| 248.832 \| 633.280 \| 1122.848 \| 0.854 \| 0.564 \| \| relative_bias \| None \| torch.bfloat16 \| (2, 16, 4096, 2, 4096, 64) \| 339.712 \| 363.232 \| 940.448 \| 1689.440 \| 0.935 \| 0.557 \| \| head_bias \| None \| torch.bfloat16 \| (2, 16, 4096, 2, 4096, 64) \| 341.056 \| 355.264 \| 940.128 \| 1641.152 \| 0.960 \| 0.573 \| \| None \| None \| torch.bfloat16 \| (2, 16, 4096, 2, 4096, 128) \| 460.736 \| 741.024 \| 1569.824 \| 2559.552 \| 0.622 \| 0.613 \| \| None \| causal \| torch.bfloat16 \| (2, 16, 4096, 2, 4096, 128) \| 293.856 \| 464.192 \| 1066.240 \| 1840.416 \| 0.633 \| 0.579 \| \| relative_bias \| None \| torch.bfloat16 \| (2, 16, 4096, 2, 4096, 128) \| 460.704 \| 753.152 \| 1570.112 \| 2641.088 \| 0.612 \| 0.594 \| \| head_bias \| None \| torch.bfloat16 \| (2, 16, 4096, 2, 4096, 128) \| 460.832 \| 745.536 \| 1570.144 \| 2602.560 \| 0.618 \| 0.603 \| \| None \| None \| torch.bfloat16 \| (8, 16, 512, 16, 512, 64) \| 35.680 \| 41.280 \| 171.840 \| 158.176 \| 0.864 \| 1.086 \| \| None \| causal \| torch.bfloat16 \| (8, 16, 512, 16, 512, 64) \| 31.360 \| 42.976 \| 158.912 \| 139.264 \| 0.730 \| 1.141 \| \| relative_bias \| None \| torch.bfloat16 \| (8, 16, 512, 16, 512, 64) \| 35.168 \| 41.600 \| 171.648 \| 161.344 \| 0.845 \| 1.064 \| \| head_bias \| None \| torch.bfloat16 \| (8, 16, 512, 16, 512, 64) \| 35.136 \| 41.152 \| 171.808 \| 158.336 \| 0.854 \| 1.085 \| \| None \| None \| torch.bfloat16 \| (8, 16, 512, 16, 512, 128) \| 48.832 \| 76.384 \| 295.680 \| 277.696 \| 0.639 \| 1.065 \| \| None \| causal \| torch.bfloat16 \| (8, 16, 512, 16, 512, 128) \| 45.632 \| 72.512 \| 281.760 \| 250.752 \| 0.629 \| 1.124 \| \| relative_bias \| None \| torch.bfloat16 \| (8, 16, 512, 16, 512, 128) \| 49.504 \| 76.608 \| 295.584 \| 279.712 \| 0.646 \| 1.057 \| \| head_bias \| None \| torch.bfloat16 \| (8, 16, 512, 16, 512, 128) \| 48.864 \| 75.904 \| 295.456 \| 277.568 \| 0.644 \| 1.064 \| \| None \| None \| torch.bfloat16 \| (8, 16, 1024, 16, 1024, 64) \| 99.392 \| 111.232 \| 408.640 \| 442.656 \| 0.894 \| 0.923 \| \| None \| causal \| torch.bfloat16 \| (8, 16, 1024, 16, 1024, 64) \| 71.392 \| 95.168 \| 338.784 \| 341.760 \| 0.750 \| 0.991 \| \| relative_bias \| None \| torch.bfloat16 \| (8, 16, 1024, 16, 1024, 64) \| 99.808 \| 112.256 \| 408.608 \| 456.160 \| 0.889 \| 0.896 \| \| head_bias \| None \| torch.bfloat16 \| (8, 16, 1024, 16, 1024, 64) \| 100.032 \| 110.816 \| 408.512 \| 444.192 \| 0.903 \| 0.920 \| \| None \| None \| torch.bfloat16 \| (8, 16, 1024, 16, 1024, 128) \| 135.040 \| 226.112 \| 726.880 \| 774.176 \| 0.597 \| 0.939 \| \| None \| causal \| torch.bfloat16 \| (8, 16, 1024, 16, 1024, 128) \| 99.904 \| 169.696 \| 616.448 \| 607.104 \| 0.589 \| 1.015 \| \| relative_bias \| None \| torch.bfloat16 \| (8, 16, 1024, 16, 1024, 128) \| 135.488 \| 228.384 \| 727.776 \| 782.368 \| 0.593 \| 0.930 \| \| head_bias \| None \| torch.bfloat16 \| (8, 16, 1024, 16, 1024, 128) \| 135.744 \| 225.664 \| 728.000 \| 773.600 \| 0.602 \| 0.941 \| \| None \| None \| torch.bfloat16 \| (8, 16, 4096, 16, 4096, 64) \| 1324.192 \| 1387.808 \| 3866.944 \| 5217.184 \| 0.954 \| 0.741 \| \| None \| causal \| torch.bfloat16 \| (8, 16, 4096, 16, 4096, 64) \| 738.464 \| 832.608 \| 2507.392 \| 3146.688 \| 0.887 \| 0.797 \| \| relative_bias \| None \| torch.bfloat16 \| (8, 16, 4096, 16, 4096, 64) \| 1326.016 \| 1404.256 \| 3867.872 \| 5382.624 \| 0.944 \| 0.719 \| \| head_bias \| None \| torch.bfloat16 \| (8, 16, 4096, 16, 4096, 64) \| 1326.144 \| 1386.688 \| 3867.552 \| 5203.264 \| 0.956 \| 0.743 \| \| None \| None \| torch.bfloat16 \| (8, 16, 4096, 16, 4096, 128) \| 1847.488 \| 2866.336 \| 6612.704 \| 8597.696 \| 0.645 \| 0.769 \| \| None \| causal \| torch.bfloat16 \| (8, 16, 4096, 16, 4096, 128) \| 1066.592 \| 1660.640 \| 4357.696 \| 5174.016 \| 0.642 \| 0.842 \| \| relative_bias \| None \| torch.bfloat16 \| (8, 16, 4096, 16, 4096, 128) \| 1850.464 \| 2905.408 \| 6616.928 \| 8793.280 \| 0.637 \| 0.752 \| \| head_bias \| None \| torch.bfloat16 \| (8, 16, 4096, 16, 4096, 128) \| 1848.896 \| 2834.720 \| 6623.872 \| 8637.920 \| 0.652 \| 0.767 \| \| None \| None \| torch.bfloat16 \| (8, 16, 512, 2, 512, 64) \| 36.384 \| 38.656 \| 150.336 \| 182.624 \| 0.941 \| 0.823 \| \| None \| causal \| torch.bfloat16 \| (8, 16, 512, 2, 512, 64) \| 31.360 \| 38.112 \| 137.664 \| 171.840 \| 0.823 \| 0.801 \| \| relative_bias \| None \| torch.bfloat16 \| (8, 16, 512, 2, 512, 64) \| 36.608 \| 39.040 \| 150.528 \| 183.872 \| 0.938 \| 0.819 \| \| head_bias \| None \| torch.bfloat16 \| (8, 16, 512, 2, 512, 64) \| 36.064 \| 38.656 \| 150.560 \| 183.520 \| 0.933 \| 0.820 \| \| None \| None \| torch.bfloat16 \| (8, 16, 512, 2, 512, 128) \| 49.344 \| 76.352 \| 253.920 \| 301.440 \| 0.646 \| 0.842 \| \| None \| causal \| torch.bfloat16 \| (8, 16, 512, 2, 512, 128) \| 46.720 \| 65.824 \| 239.424 \| 296.384 \| 0.710 \| 0.808 \| \| relative_bias \| None \| torch.bfloat16 \| (8, 16, 512, 2, 512, 128) \| 49.248 \| 76.416 \| 253.728 \| 307.808 \| 0.644 \| 0.824 \| \| head_bias \| None \| torch.bfloat16 \| (8, 16, 512, 2, 512, 128) \| 49.376 \| 76.288 \| 253.728 \| 304.736 \| 0.647 \| 0.833 \| \| None \| None \| torch.bfloat16 \| (8, 16, 1024, 2, 1024, 64) \| 99.264 \| 110.144 \| 364.960 \| 503.072 \| 0.901 \| 0.725 \| \| None \| causal \| torch.bfloat16 \| (8, 16, 1024, 2, 1024, 64) \| 71.136 \| 92.384 \| 294.432 \| 393.056 \| 0.770 \| 0.749 \| \| relative_bias \| None \| torch.bfloat16 \| (8, 16, 1024, 2, 1024, 64) \| 99.200 \| 111.360 \| 365.152 \| 512.640 \| 0.891 \| 0.712 \| \| head_bias \| None \| torch.bfloat16 \| (8, 16, 1024, 2, 1024, 64) \| 99.264 \| 110.240 \| 365.088 \| 504.224 \| 0.900 \| 0.724 \| \| None \| None \| torch.bfloat16 \| (8, 16, 1024, 2, 1024, 128) \| 135.680 \| 230.336 \| 613.472 \| 816.896 \| 0.589 \| 0.751 \| \| None \| causal \| torch.bfloat16 \| (8, 16, 1024, 2, 1024, 128) \| 100.256 \| 165.088 \| 502.144 \| 676.480 \| 0.607 \| 0.742 \| \| relative_bias \| None \| torch.bfloat16 \| (8, 16, 1024, 2, 1024, 128) \| 135.008 \| 232.480 \| 613.184 \| 836.672 \| 0.581 \| 0.733 \| \| head_bias \| None \| torch.bfloat16 \| (8, 16, 1024, 2, 1024, 128) \| 135.232 \| 230.624 \| 613.536 \| 827.136 \| 0.586 \| 0.742 \| \| None \| None \| torch.bfloat16 \| (8, 16, 4096, 2, 4096, 64) \| 1324.064 \| 1378.688 \| 3631.808 \| 5308.384 \| 0.960 \| 0.684 \| \| None \| causal \| torch.bfloat16 \| (8, 16, 4096, 2, 4096, 64) \| 731.776 \| 826.688 \| 2263.168 \| 3241.344 \| 0.885 \| 0.698 \| \| relative_bias \| None \| torch.bfloat16 \| (8, 16, 4096, 2, 4096, 64) \| 1316.128 \| 1403.200 \| 3625.088 \| 5550.688 \| 0.938 \| 0.653 \| \| head_bias \| None \| torch.bfloat16 \| (8, 16, 4096, 2, 4096, 64) \| 1311.904 \| 1378.880 \| 3616.320 \| 5353.696 \| 0.951 \| 0.675 \| \| None \| None \| torch.bfloat16 \| (8, 16, 4096, 2, 4096, 128) \| 1837.856 \| 2887.392 \| 6121.632 \| 8586.656 \| 0.637 \| 0.713 \| \| None \| causal \| torch.bfloat16 \| (8, 16, 4096, 2, 4096, 128) \| 1066.976 \| 1654.368 \| 3843.136 \| 5291.040 \| 0.645 \| 0.726 \| \| relative_bias \| None \| torch.bfloat16 \| (8, 16, 4096, 2, 4096, 128) \| 1854.208 \| 2896.832 \| 6130.112 \| 8745.984 \| 0.640 \| 0.701 \| \| head_bias \| None \| torch.bfloat16 \| (8, 16, 4096, 2, 4096, 128) \| 1860.512 \| 2889.344 \| 6135.648 \| 8750.592 \| 0.644 \| 0.701 \| \| None \| None \| torch.bfloat16 \| (16, 16, 512, 16, 512, 64) \| 60.640 \| 71.552 \| 315.968 \| 296.512 \| 0.847 \| 1.066 \| \| None \| causal \| torch.bfloat16 \| (16, 16, 512, 16, 512, 64) \| 50.784 \| 71.040 \| 284.288 \| 258.880 \| 0.715 \| 1.098 \| \| relative_bias \| None \| torch.bfloat16 \| (16, 16, 512, 16, 512, 64) \| 61.312 \| 72.704 \| 315.680 \| 302.016 \| 0.843 \| 1.045 \| \| head_bias \| None \| torch.bfloat16 \| (16, 16, 512, 16, 512, 64) \| 60.800 \| 71.776 \| 316.320 \| 297.152 \| 0.847 \| 1.065 \| \| None \| None \| torch.bfloat16 \| (16, 16, 512, 16, 512, 128) \| 84.576 \| 144.416 \| 580.576 \| 535.936 \| 0.586 \| 1.083 \| \| None \| causal \| torch.bfloat16 \| (16, 16, 512, 16, 512, 128) \| 76.064 \| 123.648 \| 553.344 \| 481.376 \| 0.615 \| 1.150 \| \| relative_bias \| None \| torch.bfloat16 \| (16, 16, 512, 16, 512, 128) \| 84.160 \| 145.248 \| 581.024 \| 540.000 \| 0.579 \| 1.076 \| \| head_bias \| None \| torch.bfloat16 \| (16, 16, 512, 16, 512, 128) \| 84.512 \| 143.552 \| 581.088 \| 535.776 \| 0.589 \| 1.085 \| \| None \| None \| torch.bfloat16 \| (16, 16, 1024, 16, 1024, 64) \| 189.152 \| 209.408 \| 798.400 \| 868.704 \| 0.903 \| 0.919 \| \| None \| causal \| torch.bfloat16 \| (16, 16, 1024, 16, 1024, 64) \| 127.552 \| 168.800 \| 650.816 \| 663.328 \| 0.756 \| 0.981 \| \| relative_bias \| None \| torch.bfloat16 \| (16, 16, 1024, 16, 1024, 64) \| 189.376 \| 211.360 \| 798.080 \| 895.552 \| 0.896 \| 0.891 \| \| head_bias \| None \| torch.bfloat16 \| (16, 16, 1024, 16, 1024, 64) \| 189.440 \| 208.576 \| 797.888 \| 873.152 \| 0.908 \| 0.914 \| \| None \| None \| torch.bfloat16 \| (16, 16, 1024, 16, 1024, 128) \| 257.536 \| 441.760 \| 1408.960 \| 1514.720 \| 0.583 \| 0.930 \| \| None \| causal \| torch.bfloat16 \| (16, 16, 1024, 16, 1024, 128) \| 179.328 \| 312.096 \| 1170.368 \| 1177.472 \| 0.575 \| 0.994 \| \| relative_bias \| None \| torch.bfloat16 \| (16, 16, 1024, 16, 1024, 128) \| 259.264 \| 446.944 \| 1408.768 \| 1530.400 \| 0.580 \| 0.921 \| \| head_bias \| None \| torch.bfloat16 \| (16, 16, 1024, 16, 1024, 128) \| 258.080 \| 440.480 \| 1408.864 \| 1514.144 \| 0.586 \| 0.930 \| \| None \| None \| torch.bfloat16 \| (16, 16, 4096, 16, 4096, 64) \| 2595.808 \| 2771.456 \| 7616.704 \| 10405.248 \| 0.937 \| 0.732 \| \| None \| causal \| torch.bfloat16 \| (16, 16, 4096, 16, 4096, 64) \| 1435.744 \| 1610.336 \| 4927.520 \| 6220.000 \| 0.892 \| 0.792 \| \| relative_bias \| None \| torch.bfloat16 \| (16, 16, 4096, 16, 4096, 64) \| 2595.264 \| 2745.056 \| 7611.232 \| 10631.392 \| 0.945 \| 0.716 \| \| head_bias \| None \| torch.bfloat16 \| (16, 16, 4096, 16, 4096, 64) \| 2576.256 \| 2735.456 \| 7626.400 \| 10346.976 \| 0.942 \| 0.737 \| \| None \| None \| torch.bfloat16 \| (16, 16, 4096, 16, 4096, 128) \| 3679.744 \| 5634.816 \| 13077.056 \| 17182.528 \| 0.653 \| 0.761 \| \| None \| causal \| torch.bfloat16 \| (16, 16, 4096, 16, 4096, 128) \| 2099.360 \| 3250.176 \| 8589.664 \| 10236.672 \| 0.646 \| 0.839 \| \| relative_bias \| None \| torch.bfloat16 \| (16, 16, 4096, 16, 4096, 128) \| 3676.800 \| 5716.288 \| 13073.088 \| 17311.071 \| 0.643 \| 0.755 \| \| head_bias \| None \| torch.bfloat16 \| (16, 16, 4096, 16, 4096, 128) \| 3679.136 \| 5570.496 \| 13070.720 \| 17192.863 \| 0.660 \| 0.760 \| \| None \| None \| torch.bfloat16 \| (16, 16, 512, 2, 512, 64) \| 61.600 \| 71.008 \| 272.320 \| 300.000 \| 0.868 \| 0.908 \| \| None \| causal \| torch.bfloat16 \| (16, 16, 512, 2, 512, 64) \| 50.176 \| 65.344 \| 241.568 \| 258.912 \| 0.768 \| 0.933 \| \| relative_bias \| None \| torch.bfloat16 \| (16, 16, 512, 2, 512, 64) \| 61.120 \| 72.512 \| 272.672 \| 305.408 \| 0.843 \| 0.893 \| \| head_bias \| None \| torch.bfloat16 \| (16, 16, 512, 2, 512, 64) \| 61.248 \| 71.136 \| 272.640 \| 301.120 \| 0.861 \| 0.905 \| \| None \| None \| torch.bfloat16 \| (16, 16, 512, 2, 512, 128) \| 83.872 \| 146.784 \| 466.912 \| 496.832 \| 0.571 \| 0.940 \| \| None \| causal \| torch.bfloat16 \| (16, 16, 512, 2, 512, 128) \| 76.704 \| 115.072 \| 435.584 \| 462.112 \| 0.667 \| 0.943 \| \| relative_bias \| None \| torch.bfloat16 \| (16, 16, 512, 2, 512, 128) \| 83.392 \| 147.392 \| 466.656 \| 504.448 \| 0.566 \| 0.925 \| \| head_bias \| None \| torch.bfloat16 \| (16, 16, 512, 2, 512, 128) \| 83.360 \| 146.688 \| 466.656 \| 499.040 \| 0.568 \| 0.935 \| \| None \| None \| torch.bfloat16 \| (16, 16, 1024, 2, 1024, 64) \| 189.024 \| 207.584 \| 684.768 \| 873.568 \| 0.911 \| 0.784 \| \| None \| causal \| torch.bfloat16 \| (16, 16, 1024, 2, 1024, 64) \| 126.944 \| 164.288 \| 536.192 \| 645.984 \| 0.773 \| 0.830 \| \| relative_bias \| None \| torch.bfloat16 \| (16, 16, 1024, 2, 1024, 64) \| 188.768 \| 209.760 \| 684.096 \| 897.504 \| 0.900 \| 0.762 \| \| head_bias \| None \| torch.bfloat16 \| (16, 16, 1024, 2, 1024, 64) \| 189.408 \| 207.776 \| 685.024 \| 876.384 \| 0.912 \| 0.782 \| \| None \| None \| torch.bfloat16 \| (16, 16, 1024, 2, 1024, 128) \| 259.168 \| 449.536 \| 1167.936 \| 1433.280 \| 0.577 \| 0.815 \| \| None \| causal \| torch.bfloat16 \| (16, 16, 1024, 2, 1024, 128) \| 180.000 \| 305.312 \| 928.000 \| 1113.920 \| 0.590 \| 0.833 \| \| relative_bias \| None \| torch.bfloat16 \| (16, 16, 1024, 2, 1024, 128) \| 258.464 \| 455.136 \| 1167.808 \| 1462.848 \| 0.568 \| 0.798 \| \| head_bias \| None \| torch.bfloat16 \| (16, 16, 1024, 2, 1024, 128) \| 257.824 \| 450.208 \| 1167.744 \| 1448.000 \| 0.573 \| 0.806 \| \| None \| None \| torch.bfloat16 \| (16, 16, 4096, 2, 4096, 64) \| 2598.368 \| 2729.120 \| 7134.400 \| 10381.632 \| 0.952 \| 0.687 \| \| None \| causal \| torch.bfloat16 \| (16, 16, 4096, 2, 4096, 64) \| 1435.456 \| 1591.040 \| 4424.768 \| 6035.808 \| 0.902 \| 0.733 \| \| relative_bias \| None \| torch.bfloat16 \| (16, 16, 4096, 2, 4096, 64) \| 2594.752 \| 2725.952 \| 7128.384 \| 10822.496 \| 0.952 \| 0.659 \| \| head_bias \| None \| torch.bfloat16 \| (16, 16, 4096, 2, 4096, 64) \| 2597.888 \| 2716.960 \| 7101.568 \| 10385.440 \| 0.956 \| 0.684 \| \| None \| None \| torch.bfloat16 \| (16, 16, 4096, 2, 4096, 128) \| 3647.648 \| 5581.632 \| 12089.952 \| 16667.233 \| 0.654 \| 0.725 \| \| None \| causal \| torch.bfloat16 \| (16, 16, 4096, 2, 4096, 128) \| 2093.952 \| 3241.440 \| 7579.392 \| 9847.936 \| 0.646 \| 0.770 \| \| relative_bias \| None \| torch.bfloat16 \| (16, 16, 4096, 2, 4096, 128) \| 3650.528 \| 5650.688 \| 12105.568 \| 16963.680 \| 0.646 \| 0.714 \| \| head_bias \| None \| torch.bfloat16 \| (16, 16, 4096, 2, 4096, 128) \| 3680.064 \| 5585.312 \| 12117.504 \| 16935.040 \| 0.659 \| 0.716 \| </details> Pull Request resolved: https://github.com/pytorch/pytorch/pull/135505 Approved by: https://github.com/Chillee	2024-09-10 09:30:02 +00:00
Roy Hvaara	23b1486185	[MPS] Allow nan mean reduction in `nll_loss` (#135434 ) This PR allows results from `nn_loss` to be `nan`, which is the same behavior as with CUDA and CPU https://github.com/pytorch/pytorch/pull/64572#issuecomment-926504162. Fixes #134431 Ref #64572 #119108 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135434 Approved by: https://github.com/malfet	2024-09-10 08:37:59 +00:00
Victor Tao	9902b349cb	[Inductor] Make static_input_idxs a set for faster lookup (#135314 ) `static_input_idxs` is only used for lookups. With large models, this is a large list. This takes over a millisecond in some cases. Profile before change: <img width="824" alt="image" src="https://github.com/user-attachments/assets/002a0775-fd2f-4d27-8cf2-812b502d7d5e"> Profile after change: gaps are smaller, 1ms speedup before launching the cuda graph <img width="794" alt="image" src="https://github.com/user-attachments/assets/12a0a0b9-2cc1-4d53-ac87-9bd5140a46f5"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/135314 Approved by: https://github.com/oulgen	2024-09-10 07:27:55 +00:00
Tugsbayasgalan Manlaibaatar	5a9ac83e94	Fix doc (#135551 ) Differential Revision: [D62412667](https://our.internmc.facebook.com/intern/diff/D62412667/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135551 Approved by: https://github.com/yushangdi ghstack dependencies: #135549	2024-09-10 07:18:44 +00:00
Sam Larsen	1adf28a5c0	[inductor] print triton float64 constants correctly (#135260 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135260 Approved by: https://github.com/jansel	2024-09-10 07:05:02 +00:00
Tugsbayasgalan Manlaibaatar	c18052da0e	Add some minor doc improvement and ban using training IR for unflattener (#135549 ) Title Differential Revision: [D62412490](https://our.internmc.facebook.com/intern/diff/D62412490/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135549 Approved by: https://github.com/yushangdi	2024-09-10 06:48:42 +00:00
Yichen Yan	c0d2f991b1	Increase `TRITON_MAX_BLOCK['X']` (#135181 ) Fixes #135028 As title, increase `TRITON_MAX_BLOCK['X']` to 4096 and fix an error, thanks to @Chillee: https://github.com/pytorch/pytorch/pull/133300/files#r1744706189 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135181 Approved by: https://github.com/jansel	2024-09-10 05:54:37 +00:00
Thomas Bohnstingl	e889252493	Implementation of scan (#134102 ) This operation is supposed to be the pendant to the `associative_scan`, but can operate with non-associative functions. @ydwu4 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134102 Approved by: https://github.com/ydwu4	2024-09-10 04:51:16 +00:00
Avik Chaudhuri	6546c6186d	do not raise when flatten_fn_with_keys not found when suggesting fixes (#135518 ) Test Plan: added test Differential Revision: D62395371 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135518 Approved by: https://github.com/zhxchen17	2024-09-10 03:47:36 +00:00
Chien-Chin Huang	1d9fefff19	[DCP] Fixes the stateless optimizer issue of distributed state_dict (#135535 ) Some optimizers don't have states that can cause get_state_dict/set_state_dict behave incorrectly. This PR fixes the issues. fixes: https://github.com/pytorch/pytorch/issues/133415 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135535 Approved by: https://github.com/wz337	2024-09-10 03:10:00 +00:00
zengxian	7ec17b49cf	Fix dynamo benchmark skip logic for cpu device (#135193 ) Fixes #132380, adjust torchbench and huggingface skip models list, then we can remove `--no-skip` when running benchmarks on 3 suites. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135193 Approved by: https://github.com/chuanqi129, https://github.com/jansel	2024-09-10 03:02:19 +00:00
Wu, Chunyuan	146921007a	[inductor] [cpp] fix the input contiguous check in max-autotune (#134982 ) ## Description Fixes the FP32 accuracy failure of `resmlp_12_224` and BF16 accuracy failure of `volo_d1_224` in timm. In this PR, we check whether input is contiguous using the following way: If it has `FixedLayout`, we know the accurate strides. For `FlexibleLayout`, if its data is a `ComputedBuffer`, we could get the fill order of the buffer to decide whether it's contiguous. For the other cases, we won't use GEMM template as we can't infer whether it's contiguous. ## Additional context The current GEMM template only supports this case: `input.get_stride()[-1] == 1`. In `resmlp_12_224`, when we run into this check, the layout of `input` is a `FlexibleLayout`. The reason is that when realizing the input which is a `View` IR, the `convert_to_reinterpret_view` call fails: `d14fe3ffed/torch/_inductor/ir.py (L4712-L4715)` And it finally runs into this `copy_input` and returns a `FlexibleLayout`. `d14fe3ffed/torch/_inductor/ir.py (L4722)` When checking its stride, this `FlexibleLayout` indeed satisfies `input.get_stride()[-1] == 1` but it is later decided as a `FixedLayout` with `size = (3072, 196), stride = (1, 3072)`, which is not supported by the GEMM template, thus causing accuracy issue in this model. The `FlexibleLayout` is converted to `FixedLayout` during [CppPackedGemmTemplate.add_choices](`d14fe3ffed/torch/_inductor/mkldnn_lowerings.py (L1051)`) which calls [slice_nd](`d14fe3ffed/torch/_inductor/codegen/cpp_template_kernel.py (L150)`) when rendering the kernel (`slice_nd(X)`). When creating the `SliceView` IR, [as_storage_and_layout](`d14fe3ffed/torch/_inductor/ir.py (L2288)`) invokes [decide_layout](`d14fe3ffed/torch/_inductor/ir.py (L2135)`) and converts it to a `FixedLayout` with `size = (3072, 196), stride = (1, 3072)`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134982 Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jansel	2024-09-10 02:47:38 +00:00
Yueming Hao	a71e5509bc	[inductor]Add profiler to operatorbench (#135515 ) Add profiling to operatorbench. The new argument `--profile` is added and the profiling trace is like the following figure. <img width="954" alt="image" src="https://github.com/user-attachments/assets/5b00d6e3-4905-4a77-a5e9-9f62620a5fd5"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/135515 Approved by: https://github.com/shunting314	2024-09-10 02:33:30 +00:00
Guilherme Leobas	136e28f616	Enable forward AD in functional.affine_grid (#135494 ) Fixes #121411 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135494 Approved by: https://github.com/zou3519, https://github.com/soulitzer	2024-09-10 00:07:07 +00:00
Jeff Daily	39a61795e3	remove amax_ptr from scaled_gemm (#135421 ) amax was removed from _scaled_mm by #128683. Remove it from the internal at::cuda::blas::scaled_gemm, as well. This allows hipBLASLt to find additional solutions rather than forcing amax to be used and then discarding the result. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135421 Approved by: https://github.com/drisspg, https://github.com/eqy	2024-09-09 23:04:36 +00:00
Scott Wolchok	b4feec9782	[xplat][XNNPACK] don't prefer static linkage in xplat for main target (#135529 ) Building XNNPACK as a static library has some issues because of multiple global params floating around. Let's try to get rid of it in xplat and see how it fares. Differential Revision: [D60776152](https://our.internmc.facebook.com/intern/diff/D60776152/) NOTE FOR REVIEWERS: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D60776152/)! Pull Request resolved: https://github.com/pytorch/pytorch/pull/135529 Approved by: https://github.com/kimishpatel, https://github.com/mcr229, https://github.com/kirklandsign	2024-09-09 22:47:01 +00:00
Yanbo Liang	d81731615f	[Dynamo] Adding CallFunctionNoArgsSource and (#135425 ) CallFunctionNoArgsGuardAccessor to support torch.cuda.current_device() Pull Request resolved: https://github.com/pytorch/pytorch/pull/135425 Approved by: https://github.com/anijain2305	2024-09-09 22:46:00 +00:00
shubhambhokare1	e2f9a83b85	[ONNX] Drop final None values as inputs for nodes in exporter graph (#135520 ) When value for an optional input is not provided, it is defaulted to `None`, which gets translates to "" in the onnx graph. To avoid this, if we have a list of inputs and the final few are all `None`, strip them in the graph. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135520 Approved by: https://github.com/justinchuby Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com>	2024-09-09 22:28:41 +00:00
PyTorch MergeBot	70a65a8bd5	Revert "NJT <-> padded dense conversions (#125947 )" This reverts commit 09a5e88bef04d5485b70d8f65f46a675aaa52942. Reverted https://github.com/pytorch/pytorch/pull/125947 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing dynamo test `09a5e88bef`, maybe a landrace ([comment](https://github.com/pytorch/pytorch/pull/125947#issuecomment-2339228570))	2024-09-09 22:01:09 +00:00
PyTorch MergeBot	689d278543	Revert "Add `__init__.py` to shape inference folder. (#135461 )" This reverts commit dced0d6d9f05f0962f74a3c6227f774111c15715. Reverted https://github.com/pytorch/pytorch/pull/135461 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it exposes some public function without appropriate doc. I will reopen the issue with hi-prio so that it can be fixed properly ([comment](https://github.com/pytorch/pytorch/pull/135461#issuecomment-2339218382))	2024-09-09 21:55:13 +00:00
atalman	9b764491e3	Use upload-artifact@v4.4.0 for create_release.yml (#135528 ) Fixes failure: https://github.com/pytorch/pytorch/actions/runs/10780281005/job/29895846007 Due broken sync ``` actions/upload-artifact@v2 and actions/download-artifact@v4.1.7 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135528 Approved by: https://github.com/kit1980, https://github.com/malfet	2024-09-09 20:48:52 +00:00
Maclyn Brandwein	cbc6b30a24	Fix broken E2E tests on Linux machines (#135394 ) Summary: I'm not entirely sure why this is failing with an `ImportError` (according to lastnameye a super class of `ModuleNotFoundError`s), but on our E2E tests on Linux machines (but not Macs?), we're seeing the import failure not getting caught -- `ImportError: cannot import name 'parutil' from 'libfb.py' (/data/sandcastle/boxes/eden-trunk-hg-full-fbsource/buck-out/v2/gen/fbsource/d0c916ec8d40ce11/arvr/libraries/ctrl/studies/replay/__ctrl-r__/ctrl-r#link-tree/libfb/py/__init__.py)` from this test run https://www.internalfb.com/sandcastle/workflow/2522015791331601269, an instance of this job: https://www.internalfb.com/intern/test/844425085172858?ref_report_id=0 is the overall job Test Plan: `arc skycastle schedule tools/skycastle/workflows2/ctrl/js_tests.sky:test_js_e2e_replay_tests --sandcastle-spec-overrides '{"type": "fbcode", "unicastle_size": "I1_MEDIUM"}'` -> https://www.internalfb.com/sandcastle/workflow/256705178764255769 Differential Revision: D62321167 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135394 Approved by: https://github.com/laithsakka	2024-09-09 20:18:08 +00:00
PyTorch MergeBot	5b368de7f7	Revert "[ONNX] Update fake mode usage in onnx docs (#135512 )" This reverts commit a13c118994b4f118388d97a35abcb91a396cd437. Reverted https://github.com/pytorch/pytorch/pull/135512 on behalf of https://github.com/davidberard98 due to failing test https://github.com/pytorch/pytorch/actions/runs/10778813316/job/29891679127 ([comment](https://github.com/pytorch/pytorch/pull/135512#issuecomment-2338999090))	2024-09-09 20:15:12 +00:00
Joel Schlosser	09a5e88bef	NJT <-> padded dense conversions (#125947 ) This PR: * Implements the pre-existing `nt.to_padded_tensor(padding_val)` ATen op via the FBGEMM kernel + appropriate view gymnastics (since that kernel only handles 2D values) * Introduces a new `_nested_from_padded_tensor` op for the reverse conversion, implemented via the reverse FBGEMM kernel + view gymnastics * Note: there is currently no public API for this; design booted to a future PR TODO: * ~~Propagate min / max sequence length via the new factory function `_nested_from_padded_tensor`~~ * ~~Verify that Inductor does computation fusion via test logic~~ Pull Request resolved: https://github.com/pytorch/pytorch/pull/125947 Approved by: https://github.com/soulitzer	2024-09-09 19:37:32 +00:00
Sahan Paliskara	a4e6a0b240	[split build] move periodic split builds into own concurrency group (#135510 ) To avoid nightly workflows cancelling each other Pull Request resolved: https://github.com/pytorch/pytorch/pull/135510 Approved by: https://github.com/clee2000, https://github.com/huydhn, https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2024-09-09 19:35:57 +00:00
imShZh	4ab232d0c4	Fix symbolic number's type and tensor's dtype mismatch bug in Tensor ctor (#135433 ) Fixes #135432 In the current implementation, if we try to store a symbolic number in Tensor's constructor, it assumes that the tensor's dtype and the symbolic number's type are matched, which is not the case. In other words, if we try to store a `SymInt`, current implementation assumes tensor's dtype is `torch.int32`, `torch.int64` or something. And if we try to store a `SymFloat`, it assumes tensor's dtype is `torch.float32` or `torch.float64`. However, the tensor's dtype could also be `torch.float32` or something else when we try to store `SymInt`, which would be wrong. This PR stores symbolic numbers by tensor's scalar type by wrapping `SymInt` and `SymFoat`'s guarded number into a PyObject. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135433 Approved by: https://github.com/ezyang	2024-09-09 19:32:18 +00:00
Sergii Dymchenko	2032f107d7	Don't try to tag s390x docker images (#135509 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135509 Approved by: https://github.com/atalman	2024-09-09 19:07:48 +00:00
rzou	5f7d956362	Fix bugs blocking flipping the default layout constraint for custom ops (#135391 ) Fixes two things: - For regular PyTorch ops, the default layout constraint tag is always flexible_layout. This was a bug with #135238 - Mark the new quantized _wrapped_linear_prepack ops as flexible_layout. The metas for these are incorrect, I didn't want to fix them (and changing the default requires the metas actually be correct). Test Plan: - The next PR up in the stack. The PRs are split because the next one is riskier. foo Pull Request resolved: https://github.com/pytorch/pytorch/pull/135391 Approved by: https://github.com/albanD	2024-09-09 18:24:21 +00:00
shubhambhokare1	a13c118994	[ONNX] Update fake mode usage in onnx docs (#135512 ) Update fake mode usage in onnx docs Pull Request resolved: https://github.com/pytorch/pytorch/pull/135512 Approved by: https://github.com/justinchuby	2024-09-09 18:10:37 +00:00
Chien-Chin Huang	21241bfeee	[CP] Extend CP to support load-balancing shards (#132442 ) This PR extends the current ring attention to support load-balancing shards -- the context/sequence is divided into `2 * world_size` shards and each rank gets `rank` and `(world_size * 2 - rank - 1)` shards. The data re-shuffling is done in the `context_parallel` API. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132442 Approved by: https://github.com/wconstab	2024-09-09 18:04:38 +00:00
PyTorch MergeBot	73a6fc6e30	Revert "[Inductor] Make static_input_idxs a set for faster lookup (#135314 )" This reverts commit 011cae9570fb3c44b7f6f0c8004c470579ed21da. Reverted https://github.com/pytorch/pytorch/pull/135314 on behalf of https://github.com/ZainRizvi due to Lint is failing on this file in trunk. See [GH job link](https://github.com/pytorch/pytorch/actions/runs/10777258770/job/29885960050) [HUD commit link](`011cae9570`) ([comment](https://github.com/pytorch/pytorch/pull/135314#issuecomment-2338678219))	2024-09-09 17:33:01 +00:00
Roy Hvaara	09287e3af4	[MPS] Add regression test for `fft.fftfreq` (#135440 ) The issue reported in #135223 was already solved in #128393. This PR adds a regression test for it. Fixes #135223 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135440 Approved by: https://github.com/ezyang	2024-09-09 17:12:36 +00:00
Bin Bao	16c3b8f87c	[AOTI] Fix assert_function call in cpu autotune template (#135086 ) Summary: In the ABI-compatible mode, assert_function should be AOTI_TORCH_CHECK. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135086 Approved by: https://github.com/chenyang78, https://github.com/angelayi ghstack dependencies: #134857	2024-09-09 16:54:12 +00:00
Bin Bao	9c6dff4941	[AOTI] Add C shim for aten.mkldnn_rnn_layer in cpp wrapper (#134857 ) Summary: Support aten.mkldnn_rnn_layer in the ABI-compatible mode. Because aten.mkldnn_rnn_layer is an aten op, it is easier to add a C shim function for it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134857 Approved by: https://github.com/angelayi	2024-09-09 16:54:12 +00:00
atalman	0eb425a563	[Release] Apply Release changes scripts after release 2.4 (#135495 ) Based on additional changes required for https://github.com/pytorch/pytorch/pull/128347 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135495 Approved by: https://github.com/kit1980	2024-09-09 16:49:04 +00:00
Victor Tao	011cae9570	[Inductor] Make static_input_idxs a set for faster lookup (#135314 ) `static_input_idxs` is only used for lookups. With large models, this is a large list. This takes over a millisecond in some cases. Profile before change: <img width="824" alt="image" src="https://github.com/user-attachments/assets/002a0775-fd2f-4d27-8cf2-812b502d7d5e"> Profile after change: gaps are smaller, 1ms speedup before launching the cuda graph <img width="794" alt="image" src="https://github.com/user-attachments/assets/12a0a0b9-2cc1-4d53-ac87-9bd5140a46f5"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/135314 Approved by: https://github.com/oulgen	2024-09-09 16:24:58 +00:00
CaoE	dfb2b661f7	Use float data type for Half var_sum in batchnorm stats updating on CPU (#126525 ) Using float data type for Half `var_sum` in batchnorm stats updating on CPU to avoid `var_sum` overflow since the representation range of Half is small. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126525 Approved by: https://github.com/jgong5, https://github.com/peterbell10	2024-09-09 15:31:38 +00:00
Roy Hvaara	5a69e0ebbe	[MPS] Update decorator comments with issue ref (#135448 ) Updating the comments with references to better places for context now that the bugs have been identified. xref #135442 #135447 #134184 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135448 Approved by: https://github.com/ezyang	2024-09-09 15:18:52 +00:00
Xavier Dupré	5e145861f2	[ONNX] Improves documentation of ONNX exporter (#135372 ) The PR updates the documentation to reflect the changes introduced in pytorch 2.5 and related to onnx exporter. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135372 Approved by: https://github.com/justinchuby Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com>	2024-09-09 15:09:01 +00:00
Yuxin Wu	c35b953531	Fix wrong error msg (#135423 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135423 Approved by: https://github.com/ezyang	2024-09-09 13:28:31 +00:00
PHLens	dced0d6d9f	Add `__init__.py` to shape inference folder. (#135461 ) Fixes #135196 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135461 Approved by: https://github.com/ezyang	2024-09-09 13:27:58 +00:00
Jiong Gong	c0436c5701	[inductor][cpp][gemm] fix perf regression xcit_large_24_p8_224 (#134686 ) (#135438 ) Fix #134686. PR https://github.com/pytorch/pytorch/pull/132729 makes GEMM template faster for one of the GEMMs in xcit_large_24_p8_224: SingleProcess AUTOTUNE benchmarking takes 1.7088 seconds and 1.9207 seconds precompiling AUTOTUNE linear_unary(12544x3072, 768x3072, 768) cpp_packed_gemm_2 2.9371 ms 100.0% _linear_pointwise 3.1584 ms 93.0% But it is slower than Aten in the e2e run due to different cache behavior. The access to the input data (12544x3072) is LLC latency bound and bottlenecks seen due to the memory synchronization (data transfers and coherence updates across processors). This PR tries to mitigate the problem by cooperatively loading different chunks of input data from different processors that share the input data. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135438 Approved by: https://github.com/leslie-fang-intel	2024-09-09 05:16:02 +00:00
cyy	60e8dc4374	Check function declarations in Caffe2 code (#134925 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/134925 Approved by: https://github.com/ezyang	2024-09-09 05:03:29 +00:00
xingyunjohn1	e6c3f58584	Fix example: Address broadcasting error in the addition of `attn_bias… (#135427 ) …` and `attn_mask`, and correct device assignment for newly created variables in the method. Fix example: Address broadcasting error in the addition of `attn_bias` and `attn_mask`, and correct device assignment for newly created variables in the method. 1. Adding `attn_bias += attn_mask` results in a broadcasting error. The expected shape of `attn_bias` is (L, S), so the output should also have the shape (L, S). However, when the input shape is (N, num_heads, L, S), broadcasting occurs, leading to an output shape of (N, num_heads, L, S), which is not desired. 2. `attn_bias` is a newly created variable within the method, but it is not assigned to the correct device. This is my retry of PR #130209 . The PR has been merged into commit `d4a79d4a7c746068d25fe5cf9333495561f4ce1f`, but the modifications were overwritten by subsequent commits. Co-authored-by: mikaylagawarecki <mikaylagawarecki@gmail.com> @mikaylagawarecki provided a more elegant implementation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135427 Approved by: https://github.com/ezyang	2024-09-09 03:47:34 +00:00
PhilipMay	90e12cf63d	Fix return type of `nansum` example. (#135435 ) One of the examples in the documentation of `torch.nansum` contains a wrong return type. This fixes it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135435 Approved by: https://github.com/ezyang	2024-09-09 03:34:52 +00:00
Zhou, Lingzhi	44c08f4984	[Partitioner] Query whether nodes exist in graph faster (#135316 ) Find node if exist in graph.nodes (linked list) take too long time. Using graph._find_nodes_lookup_table (hash table) instead to speed up. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135316 Approved by: https://github.com/ezyang	2024-09-09 03:34:02 +00:00
Rafal Litka	b6186353c6	enable lazy_init for hpu (#135203 ) enables lazy_init for hpu device Pull Request resolved: https://github.com/pytorch/pytorch/pull/135203 Approved by: https://github.com/ezyang	2024-09-09 03:32:20 +00:00
Alexander Kurakin	b7eb7256fb	docs: `torch.nn.utils.rnn.pack_padded_sequence`: docs improve (#135417 ) docs: `torch.nn.utils.rnn.pack_padded_sequence`: docs improve /cc @mikaylagawarecki Pull Request resolved: https://github.com/pytorch/pytorch/pull/135417 Approved by: https://github.com/ezyang	2024-09-09 03:16:11 +00:00
Xu Han	c1ae78be92	[inductor] calibration inductor windows uts (18/N) (#135449 ) skip test_quantized_* UTs of `test/inductor/test_cpu_select_algorithm.py`. Windows inductor don't support quantize so far. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135449 Approved by: https://github.com/ezyang	2024-09-09 03:10:54 +00:00
yuqingj	defb515306	[NJT]Add permute ops support (#135336 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135336 Approved by: https://github.com/davidberard98	2024-09-08 21:00:41 +00:00
Jason Ansel	31c4e0d37d	[inductor] Cleanup analysis done at lowering time (#135412 ) Before this we would take multiple passes over the body of each IRNode as we did lowering. This combines most analysis into `OpCounterCSE` so it can be done in a single pass. Before: ![image](https://github.com/user-attachments/assets/0047db09-4258-4491-a9a6-b078e183092a) After: ![image](https://github.com/user-attachments/assets/1e03adcb-8303-4bb1-8bbb-cc42dacd44d7) This stack: ![image](https://github.com/user-attachments/assets/d6b50b24-c30c-4d23-8b1a-344b3ba65d7a) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135412 Approved by: https://github.com/oulgen ghstack dependencies: #135286, #135306, #135377, #135400	2024-09-08 18:02:36 +00:00
Jason Ansel	53290ca00b	[inductor] Refactor BaseSchedulerNode.__init__ (#135400 ) Might be a small compile time improvement since we remove a call to extract_read_writes(). Pull Request resolved: https://github.com/pytorch/pytorch/pull/135400 Approved by: https://github.com/oulgen ghstack dependencies: #135286, #135306, #135377	2024-09-08 18:02:36 +00:00
Jason Ansel	16f5155992	[inductor] Fast path for extract_read_writes without tracing (#135377 ) Before (bottom of stack): ![image](https://github.com/user-attachments/assets/13060ff9-b31d-42a9-8e8f-c50b2bf3dc2f) After (this PR): ![image](https://github.com/user-attachments/assets/7d190821-b614-46b7-9e9e-9087443df654) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135377 Approved by: https://github.com/oulgen ghstack dependencies: #135286, #135306	2024-09-08 18:02:32 +00:00
Jason Ansel	37144be03d	[inductor] Remove ReadWrites.op_counts (#135306 ) This was (almost) unused. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135306 Approved by: https://github.com/oulgen ghstack dependencies: #135286	2024-09-08 18:02:28 +00:00
Jason Ansel	3bdc54ed18	[inductor] Refactor LoopBody.memory_usage (#135286 ) This is preparing for some other changes where I speed up extract_read_writes tracing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135286 Approved by: https://github.com/oulgen	2024-09-08 18:02:24 +00:00
cyy	2196f32475	[22/N] Fix clang-tidy warnings in jit (#135319 ) Follows #134537 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135319 Approved by: https://github.com/titaiwangms	2024-09-08 17:18:29 +00:00
Wanchao Liang	cfc227ad43	[reland][dtensor] move DTensor to public namespace (#134203 ) reland of https://github.com/pytorch/pytorch/pull/133113 I have to create a new PR because the previous reverted PR could not either be rebased, or imported successfully :( ---- Moving DTensor to be in the public namespace, to formally add the documentation page that includes all the public APIs. This includes: * many path renames and path import fixes * a dedicated doc page without too much content yet (adding in the next PRs) * To preserve the BC for users still using the torch.distributed._tensor, I added a shim script to redirect old path calls to the new module The BC preserving is evidented by the fact that all DTensor tests are still working without changing the public imports. So it's safe to land the changes Pull Request resolved: https://github.com/pytorch/pytorch/pull/134203 Approved by: https://github.com/tianyu-l	2024-09-08 17:08:40 +00:00
Animesh Jain	20cab91a12	[dynamo] Remove skip from jit freeze tests (#135281 ) Fixes https://github.com/pytorch/pytorch/issues/119781 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135281 Approved by: https://github.com/zou3519	2024-09-08 15:11:12 +00:00
CaoE	a6fae2e811	Use BRGEMM for Half flash attention forward kernel (#131879 ) Use oneDNN BRGEMM on packed data to get better performance on the 5th generation of Xeon where Intel® Advanced Matrix Extensions (AMX) will have fp16 support, e.g. amx-fp16. Multiple models have achieved acceleration, for instance, FP16 stable diffusion v2.1 has achieved over 50% improvement. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131879 Approved by: https://github.com/jgong5, https://github.com/peterbell10 ghstack dependencies: #131878	2024-09-08 12:32:23 +00:00
Justin Chu	042f2f7746	[ONNX] Re-raise the exception if the dynamic shapes cannot be refined (#135418 ) Improve error reporting. Otherwise users will just see not being able to refine shapes most of the time. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135418 Approved by: https://github.com/titaiwangms	2024-09-08 05:30:34 +00:00
Huamin Li	fd494dd426	Change wrapped_linear_prepack and wrapped_quantized_linear_prepacked to private by adding _ as prefix (#135401 ) Summary: In https://github.com/pytorch/pytorch/pull/134232, we added two new ops wrapped_linear_prepack and wrapped_quantized_linear_prepacked. From the review comments and offline discussion, we are changing them to private by adding `_` as prefix Differential Revision: D62325142 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135401 Approved by: https://github.com/houseroad	2024-09-08 04:16:24 +00:00
Bob Ren	8334cb2fb9	remove commented out breakpoints (#135363 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135363 Approved by: https://github.com/oulgen	2024-09-08 02:15:45 +00:00
Yanbo Liang	e72ed4717e	[Dynamo] Fix Huggingface PretrainedConfig get non const attr (#135413 ) Fixes #135329 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135413 Approved by: https://github.com/anijain2305	2024-09-07 19:16:29 +00:00
drisspg	3bebc09be9	[FlexAttention] Align the matmul tensorcore usage (#135168 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135168 Approved by: https://github.com/Chillee	2024-09-07 16:33:41 +00:00
Sam Larsen	a2db22e6bb	[inductor] Catch BrokenProcessPool and print a more helpful message. (#135120 ) Summary: BrokenProcessPool means a parallel-compile subprocess exited, which we never expect. It's likely due to a crash, so print a more meaningful error message and instructions that it's probably easier to debug by turning off parallel compile. Output looks like: ``` ... File "/data/users/slarsen/pytorch/torch/_inductor/runtime/compile_tasks.py", line 45, in _reload_python_module exec(code, mod.__dict__, mod.__dict__) File "/tmp/torchinductor_slarsen/4q/c4qw7xk5lbb7whg5txnk4hwbc7z6kepak3o666tr3d64gcad5r5b.py", line 815, in <module> async_compile.wait(globals()) File "/data/users/slarsen/pytorch/torch/_inductor/async_compile.py", line 265, in wait raise RuntimeError( RuntimeError: A compilation subprocess exited unexpectedly. This is likely due to a crash. To facilitate debugging, you can re-run with TORCHINDUCTOR_COMPILE_THREADS=1 to cause compilation to occur in the main process. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135120 Approved by: https://github.com/Chillee	2024-09-07 16:33:37 +00:00
Jason Ansel	eac5e12548	[inductor] Move LoopBody to its own file (#135257 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135257 Approved by: https://github.com/oulgen	2024-09-07 16:29:15 +00:00
Wu, Chunyuan	18479c5f70	[Doc] update max-autotune for CPU (#134986 ) The current doc for `max-autotune` is applicable only for GPU. This PR adds the corresponding content for CPU. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134986 Approved by: https://github.com/jgong5, https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2024-09-07 13:42:40 +00:00
CaoE	f7c0c06692	Add oneDNN BRGEMM support on CPU (#131878 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131878 Approved by: https://github.com/jgong5, https://github.com/peterbell10	2024-09-07 13:22:30 +00:00
Yu, Guangye	b53d97c7be	[Intel GPU] Add XPU memory-related APIs (#129919 ) # Motivation According to https://github.com/pytorch/pytorch/issues/116322, we will help unify the device allocator. So we introduce a simple xpu device allocator only with the key functionality first. And expect to add some memory statistics-related functionality after the unification. But now, some memory statistic-related APIs listed in https://github.com/pytorch/pytorch/issues/127929 are requested. We need more time to unify the device allocator. In order to facilitate the user experience, we expect to support these memory statistic-related APIs before the unification. # Additional Context Fixes: #127929 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129919 Approved by: https://github.com/dvrogozh, https://github.com/abhilash1910, https://github.com/gujinghui, https://github.com/EikanWang, https://github.com/albanD ghstack dependencies: #130923	2024-09-07 11:15:17 +00:00
Yu, Guangye	6c1da66407	[Reland] Refactor caching device allocator utils (#130923 ) # Motivation Following [[RFC] Intel GPU Runtime Upstreaming for Allocator ](https://github.com/pytorch/pytorch/issues/116322), this PR aims to refactor caching device allocator utils to improve code reuse usage. This is the first PR, we could prepare some follow-up PRs continuing to refactor the device caching allocator. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130923 Approved by: https://github.com/EikanWang, https://github.com/gujinghui, https://github.com/albanD, https://github.com/eqy	2024-09-07 11:14:17 +00:00
Jiong Gong	d7c97e7245	[inductor][cpp][gemm] cache blocking config for dynamic shapes (#133538 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133538 Approved by: https://github.com/leslie-fang-intel ghstack dependencies: #135277, #133447 Co-authored-by: Wu, Chunyuan <chunyuan.wu@intel.com>	2024-09-07 11:09:30 +00:00
Jiong Gong	be9f4ffe88	[inductor][cpp][gemm] enable dynamic M for k-slicing (#133447 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133447 Approved by: https://github.com/leslie-fang-intel ghstack dependencies: #135277 Co-authored-by: Wu, Chunyuan <chunyuan.wu@intel.com>	2024-09-07 11:09:30 +00:00
Jiong Gong	692faa9bc6	[inductor][cpp][gemm] reduce memory alloc overhead by allocating local acc once per thread (#135277 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135277 Approved by: https://github.com/leslie-fang-intel Co-authored-by: Wu, Chunyuan <chunyuan.wu@intel.com>	2024-09-07 11:09:25 +00:00
Justin Chu	32f3af72b7	[ONNX] Support FakeTensor in ONNXProgram (#135399 ) Sync with https://github.com/justinchuby/torch-onnx/compare/v0.1.20...v0.1.21 to support FakeTensors in ONNXProgram. Specifically, this PR implements the `apply_weights` method to allow users to supply a dictionary of concrete tensors to replace FakeTensors in the exported model weights. An error is raised when users try to serialize a FakeTensor to avoid segfaults. Also fixed a bug in `.save()` when `keep_initializers_as_inputs` is True and `include_initializers` is False. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135399 Approved by: https://github.com/titaiwangms	2024-09-07 04:48:18 +00:00
Yanbo Liang	ebab5c85c4	[FlexAttention] Skip very small block size unit tests on H100 due to Triton bug (#135393 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135393 Approved by: https://github.com/BoyuanFeng	2024-09-07 04:35:22 +00:00
Justin Chu	3d734d837b	[ONNX] Handle mixed sequence inputs properly (#135378 ) Previously, when an input contains a mixture of `Value` and python constants like `[SymbolicTensor('sym_size_int_3', type=Tensor(INT64), shape=[], producer=node_Shape_0, index=0), 512]`, we get errors like ```pytb Traceback (most recent call last): File "/Users/justinc/Documents/GitHub/torch-onnx/src/torch_onnx/_building.py", line 367, in _call_op converted_named_inputs = _process_python_constants_and_sequences( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/justinc/Documents/GitHub/torch-onnx/src/torch_onnx/_building.py", line 275, in _process_python_constants_and_sequences raise TypeError( TypeError: Constant input '[SymbolicTensor('sym_size_int_3', type=Tensor(INT64), shape=[], producer=node_Shape_0, index=0), 512]' of type '<class 'list'>' is not supported ``` This PR updates Sequence handling to support this case, as well as variadic inputs and ONNX Sequence inputs. Synced from https://github.com/justinchuby/torch-onnx/pull/187 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135378 Approved by: https://github.com/titaiwangms	2024-09-07 03:07:39 +00:00
Yiming Zhou	c92227c41a	[quant][pt2e] fix placeholder typo and related quantization tests (#135379 ) A previous typo on "placeholder" and related tests in quantization are fixed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135379 Approved by: https://github.com/jerryzh168	2024-09-07 02:31:43 +00:00
blaine-rister	e6a0221fc6	[Inductor] Optionally allow padding on non-GPU devices (#135280 ) This is the OSS component of a larger MTIA diff. Currently, Inductor disables padding for non-GPU devices. We need to change this behavior to enable padding on MTIA. This PR adds a config option to enable padding on the CPU, or any other non-GPU device. In the future, we might want to enable padding on all devices by default. However, that might require supporting device-dependent padding defaults, since CPUs will likely use different settings than H100 GPUs. Differential Revision: D61038114 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135280 Approved by: https://github.com/jfix71, https://github.com/shunting314	2024-09-07 02:19:14 +00:00
Justin Chu	a6b9d444fb	[ONNX] Refactor exporter errors (#135180 ) Refactor exporter errors to combine old errors and new errors for API consistency. This PR also 1. Removes the `_C._check_onnx_proto(proto)` call in the old exporter. We don't need the ONNX checker because it is limited. 2. Removes the `OnnxExporterError` defined in the dynamo module. This class unnecessarily stores the onnx program object, making it very bulky. Instead, we revert to use the plain OnnxExporterError defined in the `errors` module and use it as the base class for all errors. 3. Continues to expose `OnnxExporterError` in `torch.onnx` and the rest of the errors in `torch.onnx.errors`. 4. Removes the `CheckerError` and `InvalidExportOptionsError` from `torch.onnx`. This is BC breaking but should have low impact. 5. I did not rename existing errors out of compatibility considerations, even though `ExporterError` would have been more succinct. Fixes https://github.com/pytorch/pytorch/issues/135125 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135180 Approved by: https://github.com/titaiwangms	2024-09-07 00:50:15 +00:00
Sergii Dymchenko	d42b0c8f22	Add release matrix for 2.5 (#135383 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135383 Approved by: https://github.com/huydhn	2024-09-07 00:49:53 +00:00
Will Feng	941d094dd1	[Dynamo][DTensor] Fixes SymNodeVariable() is not a constant error in Compiled DDP + TP unit test (#135315 ) Before the fix, the unit test will fail at forward Dynamo tracing: ``` File "/data/users/willfeng/pytorch/test/distributed/_composable/test_replicate_with_compiler.py", line 415, in test_ddp_tp loss = compiled_replicate_model(data).sum() ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ... torch._dynamo.exc.InternalTorchDynamoError: SymNodeVariable() is not a constant from user code: File "/data/users/willfeng/pytorch/torch/distributed/tensor/parallel/_data_parallel_utils.py", line 34, in _unflatten_tensor result = DTensor.from_local( ``` After the fix, the compilation fails at a later step (Compiled Autograd tracing), due to needing "pre-dispatch tracing of backward graph" feature (see details at https://github.com/pytorch/pytorch/issues/127797#issuecomment-2291695474). I believe this PR is a net improvement, because it should also fix the 1D Traceable FSDP2 failure case on internal models (https://github.com/pytorch/pytorch/issues/130978#issuecomment-2319476690), which is much harder to build a minimal unit test for. Fixes https://github.com/pytorch/pytorch/issues/130978. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135315 Approved by: https://github.com/bdhirsh	2024-09-07 00:11:25 +00:00
Shangdi Yu	b1a934741e	Change test_constant_prop_preserve_metadata (#135268 ) Summary: In new export_for_training, "stack_trace" does not exist in node meta anymore. Test Plan: ``` buck run fbcode//mode/dev-nosan fbcode//caffe2/test:quantization_pt2e -- -r test_constant_prop_preserve_metadata ``` Reviewed By: angelayi Differential Revision: D62219974 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135268 Approved by: https://github.com/angelayi	2024-09-07 00:02:35 +00:00
Sahan Paliskara	0c661f3e1a	[Split Build] Refactor split build binary builds into their own workflows and move split build binary builds to periodic (#134624 ) As we need to move split build binary tests from trunk to periodic this pr, refactors those jobs out into its own workflow to achieve this. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134624 Approved by: https://github.com/malfet	2024-09-06 23:57:56 +00:00
leslie-fang-intel	2c7e314803	[Inductor][CPP] Fix the issue of view dtype (#135301 ) Summary Fix issue: https://github.com/pytorch/pytorch/issues/135160, it's a regression introduced by https://github.com/pytorch/pytorch/pull/134569, where the dtype of `to_dtype_bitcast` was incorrectly handled when using the scalarize implementation. TestPlan ``` python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_view_dtype ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135301 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-09-06 23:36:44 +00:00
Sun, Jiayi	ead4407f57	[inductor] Fix loop split optimization (#135303 ) Fix https://github.com/pytorch/pytorch/issues/135274. Improve the check whether the div expr matches: add a check whether `split_var` is in `original_body.iter_vars`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135303 Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jansel	2024-09-06 23:06:25 +00:00
Henry Tsang	2f5b40c099	[aoti test] Disable FP8 funz dtypes in fp8 runtime check test (#135373 ) Fixing https://github.com/pytorch/pytorch/issues/126734 Key is the funz FP8 types are for AMD only. source: https://github.com/openxla/stablehlo/blob/main/rfcs/20230321-fp8_fnuz.md Pull Request resolved: https://github.com/pytorch/pytorch/pull/135373 Approved by: https://github.com/chenyang78	2024-09-06 23:05:47 +00:00
Yidi Wu	993b5647ab	[export] fix placeholder name collision tests by removing map call (#135366 ) The current test is failing because of the current unstable state of map. torch.compile and non-strict export are taking two seperate routes unlike cond and while_loop. This pr fix the test it self. We'll fix map in follow up PRs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135366 Approved by: https://github.com/angelayi	2024-09-06 22:02:50 +00:00
Sam Larsen	2ab26806f1	Require tlparse for failing tests in test_structured_trace.py (#135376 ) Summary: These tests are currently failing internally. Per discussion, skip if tlparse is unavailable Test Plan: ``` feature remove tlparse buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/dynamo:test_dynamo -- --run-disabled --regex test_structured_trace.py feature install tlparse buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/dynamo:test_dynamo -- --run-disabled --regex test_structured_trace.py ``` Differential Revision: D62310342 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135376 Approved by: https://github.com/ezyang	2024-09-06 21:53:41 +00:00
Jane Xu	b1612569f6	[BE] Clarify defaulting behavior in optimizer (#135384 ) Fixes #135340 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135384 Approved by: https://github.com/drisspg, https://github.com/jainapurva	2024-09-06 21:52:55 +00:00
Will Constable	dc0e818738	[FR] Automatically infer a common filename prefix (#135158 ) Save the annoyance of specifying this on the command line each time Pull Request resolved: https://github.com/pytorch/pytorch/pull/135158 Approved by: https://github.com/fduwjj, https://github.com/c-p-i-o ghstack dependencies: #135157	2024-09-06 21:44:27 +00:00
Will Constable	06e414d7fe	[FR] Make trace_dir a required argument (#135157 ) Ensures users get a clean error if they forget to specify the dir, and improves the help message. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135157 Approved by: https://github.com/c-p-i-o, https://github.com/fduwjj	2024-09-06 21:44:27 +00:00
PyTorch MergeBot	a681260caf	Revert "[ONNX] Refactor exporter errors (#135180 )" This reverts commit 5eebd9315a72422d59b6f8d8ca8e4e573e231d5c. Reverted https://github.com/pytorch/pytorch/pull/135180 on behalf of https://github.com/clee2000 due to I think this broke test_public_bindings.py::TestPublicBindings::test_correct_module_names [GH job link](https://github.com/pytorch/pytorch/actions/runs/10743909338/job/29800779403) [HUD commit link](`5eebd9315a`), possibly a landrace with the PR that landed before it ([comment](https://github.com/pytorch/pytorch/pull/135180#issuecomment-2334844191))	2024-09-06 21:39:18 +00:00
William Wen	95e976a63f	[dynamo] recursively skip frames when Dynamo cache limit is hit (#135144 ) Fixes https://github.com/pytorch/pytorch/pull/135144 and [T197117723](https://www.internalfb.com/intern/tasks/?t=197117723). In general, adds `SkipCodeRecursiveException` to Dynamo - when raised in Dynamo, convert_frame will return a `skip_code_recursive_flag` back to C Dynamo, signaling it to skip the current frame and all recursive calls. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135144 Approved by: https://github.com/jansel, https://github.com/anijain2305	2024-09-06 21:38:53 +00:00
Catherine Lee	306ac44eaa	[ez][TD] Fix request for issue body returns None (#135389 ) I assumed it would be empty string if the body is empty, but its just None Pull Request resolved: https://github.com/pytorch/pytorch/pull/135389 Approved by: https://github.com/malfet	2024-09-06 21:02:01 +00:00
Vadym Khortiuk	a7643baceb	Revert expectFailureIf condition on tests with torch.compile on Windows (#134759 ) Fixes #134716 This PR reverts some changes introduced in `6eae569546` (#133987) torch.compile is not available on Windows, tests should be expected to fail. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134759 Approved by: https://github.com/malfet	2024-09-06 20:51:55 +00:00
William Wen	a4030e37be	[dynamo] reland map/zip iterator related changes (#135074 ) Differential Revision: [D62211019](https://our.internmc.facebook.com/intern/diff/D62211019) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135074 Approved by: https://github.com/jansel, https://github.com/anijain2305, https://github.com/mlazos	2024-09-06 20:38:02 +00:00
Henry Tsang	22e1fb6faa	[test][easy] Add debug utils for cpu select algorithm test (#135038 ) Summary: Add debug utils to debug a flaky test in fbcode ci. Some context: https://github.com/pytorch/pytorch/pull/126545 Test Plan: ci Differential Revision: D62005445 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135038 Approved by: https://github.com/jgong5, https://github.com/XuehaiPan	2024-09-06 20:30:49 +00:00
titaiwangms	2a4890e315	[ONNX] Clean up the missed lines from previous PRs (#135368 ) Some missed deleted lines Pull Request resolved: https://github.com/pytorch/pytorch/pull/135368 Approved by: https://github.com/justinchuby	2024-09-06 20:27:52 +00:00
Tristan Rice	3ce433aef2	[TCPStore] use wait counters (#135283 ) This replaces the existing TCPStore counters with the new shared wait counters. There's no users of the tcpstore counters so should be completely safe to remove. Test plan: Existing tests + build There's no OSS backend for wait counters so can't write any tests with them currently. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135283 Approved by: https://github.com/c-p-i-o	2024-09-06 19:54:25 +00:00
Jane Xu	7f2d20e687	Run all autograd node post hooks (#134728 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134728 Approved by: https://github.com/albanD, https://github.com/soulitzer	2024-09-06 19:44:28 +00:00
titaiwangms	32fd29c1ea	[ONNX] Properly handle Attributes in traceable functions (#135367 ) Previously the attributes were sent in as Attr objects even when we call the function as a plain Python function. Turning them into python objects. From https://github.com/justinchuby/torch-onnx/pull/186 Related https://github.com/microsoft/onnxscript/issues/1846 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135367 Approved by: https://github.com/justinchuby	2024-09-06 19:35:22 +00:00
Justin Chu	5eebd9315a	[ONNX] Refactor exporter errors (#135180 ) Refactor exporter errors to combine old errors and new errors for API consistency. This PR also 1. Removes the `_C._check_onnx_proto(proto)` call in the old exporter. We don't need the ONNX checker because it is limited. 2. Removes the `OnnxExporterError` defined in the dynamo module. This class unnecessarily stores the onnx program object, making it very bulky. Instead, we revert to use the plain OnnxExporterError defined in the `errors` module and use it as the base class for all errors. 3. Continues to expose `OnnxExporterError` in `torch.onnx` and the rest of the errors in `torch.onnx.errors`. 4. Removes the `CheckerError` and `InvalidExportOptionsError` from `torch.onnx`. This is BC breaking but should have low impact. 5. I did not rename existing errors out of compatibility considerations, even though `ExporterError` would have been more succinct. Fixes https://github.com/pytorch/pytorch/issues/135125 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135180 Approved by: https://github.com/titaiwangms	2024-09-06 19:10:56 +00:00
Nowtryz	a15aabc975	Add MaskedTensor passthrough: unfold, F.Unfold, F.Fold, stack (#125262 ) Hi, I noticed the `unfold` operator was missing on MaskedTensor. I tested that my change works when calling unfold and backward on a `MaskedTensor` but I didn't find the tests for the dispatch of such operation. Where is it? Pull Request resolved: https://github.com/pytorch/pytorch/pull/125262 Approved by: https://github.com/cpuhrsch	2024-09-06 19:06:23 +00:00
Jokeren	b143426db3	[Inductor] Use argument names as the key for the `constants` dict and the `signature` dict (#135170 ) Referencing how triton constructs these dictionaries `ca3fb5f6fa/python/triton/runtime/jit.py (L639)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135170 Approved by: https://github.com/htyu	2024-09-06 19:05:00 +00:00
Oguz Ulgen	13ba0a2e5c	Run bypassed graph compile outside the except block to avoid chaining of exceptions (#135175 ) Fixes #135172 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135175 Approved by: https://github.com/masnesral, https://github.com/ezyang	2024-09-06 19:03:57 +00:00
wdziurdz	8520ce5f78	Fix incorrect trace of post-accumulate grad hook on tensor with zero dims (#135226 ) Fix incorrect trace of post-accumulate grad hook on tensor with zero dimensions Fixes #135207 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135226 Approved by: https://github.com/xmfan	2024-09-06 18:19:54 +00:00
Tristan Rice	196748d491	[elastic] support local_addr across all rendezvous impls (#135262 ) Summary: There was a regression introduced in https://github.com/pytorch/pytorch/pull/125743 that made `local_addr` no longer used. This fixes that by passing `local_addr` to `RendezvousStoreInfo.build` everywhere it's used. This also fixes a number of tests allowing them to be run in parallel which hugely sped up the testing cycle as this change touches many different rendezvous implementations. This required a few fixes in unrelated tests. Test Plan: Added tests for the common rendezvous implementations that `local_addr` to prevent future regressions. ``` buck2 test @//mode/dev-nosan fbcode//caffe2/test/distributed/elastic/... fbcode//caffe2/torch/distributed/elastic/... -- --stress-runs 3 ``` To vet the parallelism changes I also ran with 3 stress runs each to identify flakiness caused by parallelism. Differential Revision: D62256407 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135262 Approved by: https://github.com/fduwjj, https://github.com/wz337	2024-09-06 17:55:43 +00:00
Pian Pawakapan	177e4f4218	remove _check call on item() for torch.istft (#135234 ) Fixes #135014 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135234 Approved by: https://github.com/tugsbayasgalan	2024-09-06 17:31:25 +00:00
Henry Tsang	3988b3468b	[aoti][easy] remove breakpoint() in wrapper.py (#134807 ) Differential Revision: D61687146 Remove an unintended breakpoint in code. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134807 Approved by: https://github.com/YUNQIUGUO	2024-09-06 17:25:05 +00:00
Zhengxu Chen	04118d8617	[export] Record the global torch version in serialization. (#135243 ) Summary: In general I think it will be useful to also record the global torch version in the EP, so that we can track them in the logging in addition to the schema version. Test Plan: CI Reviewed By: henryoier Differential Revision: D62252626 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135243 Approved by: https://github.com/yushangdi	2024-09-06 17:02:06 +00:00
Riley Dulin	24482e5c68	[torch][fx] Set maximum warning count during fx.Graph.lint (#135069 ) Summary: resnet152 spent about 15 minutes writing warning messages in _unlift during `to_executorch` because they're all written to unbuffered stderr by the `warnings` module. These warnings are almost always about get_attr nodes referencing a non-existent name: ```lang=py warnings.warn(f'Node {node} target {node.target} {atom} of {seen_qualname} does ' 'not reference an nn.Module, nn.Parameter, or buffer, which is ' 'what \'get_attr\' Nodes typically target' ) ``` I'm not aware of a way to configure the warnings module to write this out at most once, so I'm just going to disable the lint for now. Test Plan: Re-ran resnet152 with Executorch and the XNNPackBackend, it is much faster now Differential Revision: D62156090 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135069 Approved by: https://github.com/yushangdi	2024-09-06 16:41:59 +00:00
yanbing-j	c0ec599f27	Update submodule ideep to include aarch64 change (#134897 ) This PR is per ARM request, which is in https://github.com/intel/ideep/issues/334. Context for the request is: Arm team has upstreamed the dynamic quantization changes, all the PRs were merged (torch, ideep, oneDNN), but without this ideep submodule update, the feature will not work. The change is isolated to only matmul operator and quantization path alone. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134897 Approved by: https://github.com/jgong5, https://github.com/atalman, https://github.com/snadampal	2024-09-06 16:40:26 +00:00
Alfredo Tupone	7074de43c0	Porting to GCC 15 (#135188 ) uint8_t is found on cstdint header Pull Request resolved: https://github.com/pytorch/pytorch/pull/135188 Approved by: https://github.com/Skylion007	2024-09-06 16:16:53 +00:00
Rachel Guo	771dcce11d	[AOTI][Tooling][6/n] Fix long dtype input tensors calling `mean()` in `aoti_torch_print_tensor_handle` (#135072 ) Differential Revision: D61635232 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135072 Approved by: https://github.com/hl475, https://github.com/ColinPeppler	2024-09-06 15:59:32 +00:00
Avik Chaudhuri	de74aafff4	error on exporting ScriptModule (#135302 ) Test Plan: added test Differential Revision: D62279179 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135302 Approved by: https://github.com/yushangdi	2024-09-06 15:12:40 +00:00
rzou	ad29a2c0dc	Add Inductor config for default stride behavior (#135238 ) By default, Inductor is allowed to manipulate the layout (strides+storage offset) of input tensors to custom operators. We want to change it so that the default is that Inductor should respect the stride order of input tensors to custom operators. This PR adds a config to toggle the behavior, in the next PR up we'll change the default. We also make the following changes: - We add a new operator Tag (flexible_layout), which means that inductor is allowed to manipulate the layout. When we flip the default, users can specify they want the old behavior by using this tag. This is a reland of https://github.com/pytorch/pytorch/pull/126986, which was previously reverted due to silent incorrectness. We've since fixed the silent incorrectness (https://github.com/pytorch/pytorch/pull/133639) Test Plan: - new test Pull Request resolved: https://github.com/pytorch/pytorch/pull/135238 Approved by: https://github.com/albanD	2024-09-06 14:48:24 +00:00
Yiwen Shi	3a9e33dca8	[torchelastic] Don't do signal handling when off the main thread (#135088 ) Summary: In multiprocessing, signal handling is not possible if the thread is not the main thread. This resulted in the following error: > "ValueError('signal only works in main thread of the main interpreter')" To address this issue, the diff checks whether the thread is the main thread and, if not, skips signal handling. Test Plan: Before this change, MAST job failed: https://fburl.com/mlhub/iq2m10v8 With this change, MAST job succeeded: https://fburl.com/mlhub/q6kb8343 Differential Revision: D62166943 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135088 Approved by: https://github.com/d4l3k	2024-09-06 14:47:03 +00:00
David Berard	a086882d72	[inductor][triton] mark workspace args as mutated (#134648 ) SplitScan makes use of a workspace arg that needs to be zeroed before it is used - then, it is used to communicate between thread blocks during the triton kernel implementation. It is mutated during during the execution of the kernel, so it should be marked as such. Before this PR, it is not marked as mutated; AFAIK this is fine during normal execution, but during autotuning it causes problems. The workspace starts off zeroed (as expected), but during autotuning the kernel will be executed multiple times and the workspace does not get re-set between executions, resulting in incorrect data. If the data is used for indexing, then you can fail device-side asserts (and the results after the initial run (with autotuning) could be wrong). The test added in this PR repros the issue when the fix is removed. When we mark the arg as mutated, then the arg gets cloned before autotuning, so that the arg passed to the kernel during autotuning will always be zeroed as expected. `804852c1f9/torch/_inductor/runtime/triton_heuristics.py (L685-L689)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/134648 Approved by: https://github.com/peterbell10, https://github.com/jansel	2024-09-06 14:23:37 +00:00
Will Feng	84ae6b7d6b	AOTDispatcher: limit cases when we detach() graph inputs to non-leaves (#134193 ) This PR is slightly a revival / update to the discussion from https://github.com/pytorch/pytorch/pull/98960: Part of FSDP2's tracing strategy right now is that: (1) it is painful/difficult to handle the case where we have multiple graph input tensors that are aliased to each other and at least one of them is duplicated (2) we already have longstanding in logic to remove duplicate input tensors from the graph in dynamo. Morally, FSDP2 gives us duplicate input tensors in the backward graph for every `unsharded_param`, because we have (a) the `unsharded_param` being closed over by the backward hook to resize/allgather, and (b) the same `unsharded_param` being saved for backward by autograd (we now guarantee in the partitioner that we will always save the base tensor for backward and recompute views) (3) However, we were still seeing cases where the `unsharded_param` showed up twice in the backward graph inputs, as distinct tensor objects (with different python ids) instead of being true duplicates that dynamo can de-dup. It turns on that this was because we were `.detach()`ing the `unsharded_param` in AOTDispatcher before plumbing it through the compiled forward (and so autograd would save a detach'd version of the `unsharded_param`). This is precisely because of the logic from https://github.com/pytorch/pytorch/pull/98960. However, re-reading the detailed comments, it seems unnecessary to do a detach() on a graph input that is a (leaf) `nn.Parameter`, even if it happens to get no gradients in the backward. Since it is a leaf, we don't have to worry about the autograd engine "continuing to backprop through the graph beyond the current tensor" (the leaf has no other grad_fn for autograd to backprop through). So this PR makes us a bit less aggressive about calling detach() on inputs: we only do it when: (1) our graph input statically will get a `None` gradient (and also has no metadata mutations, the existing state) (2) and our graph input is a non-leaf tensor (so detach()ing is actually required to prevent autograd from incorrectly backpropping past the non-leaf. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134193 Approved by: https://github.com/yf225 Co-authored-by: Will Feng <yf225@cornell.edu>	2024-09-06 14:06:48 +00:00
Julia Guo	60a097a071	[CD] Update binary_linux_test.sh to include calling builder smoke test (#133869 ) Run smoke test Fixes #1969 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133869 Approved by: https://github.com/atalman Co-authored-by: Andrey Talman <atalman@fb.com>	2024-09-06 13:27:24 +00:00
Wu, Chunyuan	13bae39e22	[inductor] [cpp] improve cache blocking for is_dynamic_M (#131306 ) ## Performance Models with >= 3% performance speedup are listed below: ### AMP single-thread dynamic shape (measured on CPU with AMX support) No regressions \| Model Family \| Model Name \| Speedup \| \|--------------\|------------\|---------\| torchbench \| soft_actor_critic\| 3% Pull Request resolved: https://github.com/pytorch/pytorch/pull/131306 Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel ghstack dependencies: #135275 Co-authored-by: Jiong Gong <jiong.gong@intel.com>	2024-09-06 13:21:24 +00:00
Jiong Gong	4ef6c05f65	[inductor][cpp][gemm] fix autotune runtime error from linear_binary fusion (#135275 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135275 Approved by: https://github.com/leslie-fang-intel	2024-09-06 13:21:23 +00:00
Edward Z. Yang	d6b9bd3e60	Also handle compiler collective when input variable doesn't exist on all ranks (#135147 ) Internal xref: https://fb.workplace.com/groups/3095840833991792/permalink/3810738595835342/ Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/135147 Approved by: https://github.com/jansel	2024-09-06 13:18:36 +00:00
Edward Z. Yang	d0591f4658	Ignore fresh unbacked when doing recursive make_fx inside HOPs (#135053 ) Internal xref: https://fb.workplace.com/groups/6829516587176185/posts/7705964779531357/ This now also incorporates a test from https://github.com/pytorch/pytorch/pull/133585 (which it fixes) and the prep PR https://github.com/pytorch/pytorch/pull/134407 Including the PR desc from that: I am trying to fix a problem reported by user in [fb.workplace.com/groups/6829516587176185/permalink/7705964779531357](https://fb.workplace.com/groups/6829516587176185/permalink/7705964779531357/) The summary of this problem is that when we do collect metadata analysis in AOTAutograd, we accumulate pending unbacked symbols which are going to be discarded at the end of the trace. However, if we do a recursive make_fx inside tracing, as occurs with torch.cond, we end up seeing that there are pending unbacked symbols that aren't associated with a binding, even though it's spurious (they've leaked into the inner make_fx call from the outer AOTAutograd analysis). In https://github.com/pytorch/pytorch/pull/133588 I tried to just prevent adding the symbols to the pending list at all in the first place. But this itself caused some problems which were fixed in https://github.com/pytorch/pytorch/pull/124785 . The problem fixed in that PR is that when we allocate tangents that have unbacked size, something prevented them from having correct unbacked SymInts when ignore fresh unbacked SymInts was enabled. So I had patched it at the time by just not suppressing pending symbols and clearing them out some other way. I think... I was wrong in that PR? That is to say, it was OK to avoid putting the fresh unbacked symbols in the pending list; the real problem was suppressing unbacked renamings. But there doesn't seem to be a good reason to suppress these; this PR shows that it doesn't actually fail any tests if you do these anyway. Intuitively, this makes sense, because you can't trigger renamings unless you're actually adding unbacked symbols to the pending set. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/135053 Approved by: https://github.com/ydwu4	2024-09-06 13:13:15 +00:00
Yan Zhiwei	b5dea061c8	check compilation status before query cudnn version in conv (#135332 ) This PR is created for fixing the https://github.com/pytorch/pytorch/issues/135322. The cudnn compilation status should be check firstly before querying version, otherwise, conv may trigger runtimeerror before any check in other non-cuda backends. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135332 Approved by: https://github.com/EikanWang, https://github.com/atalman	2024-09-06 12:50:04 +00:00
Michael Lazos	041960a1ce	[Dynamo] Automatically in-graph traceable tensor subclass ctors (#135151 ) Fixes https://github.com/pytorch/pytorch/issues/114389 Previously, dynamo would attempt to trace through the `__init__` of traceable tensor subclasses, since their constructors are AOT dispatcher traceable by definition, dynamo should automatically put these in the graph like we do for any other tensors. Not doing this is difficult because dynamo would need to apply mutations post tensor subclass creation in the graph. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135151 Approved by: https://github.com/bdhirsh	2024-09-06 12:23:38 +00:00
Sun, Jiayi	67c7924ea1	[inductor] Fix gen_transposed_tile_load_store (#135307 ) Recent PR: https://github.com/pytorch/pytorch/pull/131745 bring new VLA logical in cpp codegen. And it will raise build fail error on MSVC and error code is `Compiler Error C2131`: https://learn.microsoft.com/en-us/cpp/error-messages/compiler-errors-1/compiler-error-c2131?view=msvc-170 reproduce UT: ```cmd pytest test\inductor\test_torchinductor_dynamic_shapes.py -v -k test_large_block_sizes_dynamic_shapes_cpu ``` Original generated code: ```c++ alignas(16) float tmp1[static_cast<int64_t>(((-256LL)(c10::div_floor_integer(static_cast<int64_t>(ks1), static_cast<int64_t>(16LL)))) + (16LLks1))]; ``` Changes: allocate a large-enough fixed-sized buffer. New genarated code: ```c++ alignas(16) float tmp1[16*16]; ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135307 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-09-06 10:44:08 +00:00
penguin-wwy	217ba7b2ab	[Docs] Update FileCheck doc (#135199 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135199 Approved by: https://github.com/soulitzer	2024-09-06 08:18:38 +00:00
CaoE	758d515d98	[Inductor][CPP] Select tiling factor for lower precision data types (#133830 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133830 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-09-06 08:12:37 +00:00
Feng Yuan	60d98b4cfb	Update torch-xpu-ops pin (ATen XPU implementation) (#135300 ) Release cycle for PyTorch 2.5 1. Bugfixing: correct reduction logic in cdist kernel. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135300 Approved by: https://github.com/EikanWang	2024-09-06 07:30:09 +00:00
Shangdi Yu	590a3e9f8a	[export][training ir migration] quantized_decomposed.quantize_per_tensor decomposition (#134525 ) Summary: In graph of TestXNNPACKQuantizer.test_dynamic_linear_with_con test, some quantized_decomposed.quantize_per_tensor.default ops are becoming quantized_decomposed.dequantize_per_tensor.tensor ops when using the new training ir. This is because we lift params/buffers before calling make_fx. So previously, for the graph that’s passed to make_fx,`graph.L__self___linear1.weight` is a tensor now in training ir, graph.L__self___linear1.weight is a FakeTensor. This caused the node overload to be different. Test Plan: ``` buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test/quantization:test_quantization -- -r test_dynamic_linear_with_conv ``` Differential Revision: D61364547 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134525 Approved by: https://github.com/tugsbayasgalan, https://github.com/jerryzh168	2024-09-06 07:06:06 +00:00
drisspg	764ee6e3f9	[FlexAttention] Specify padding_value for boundary checked loads (#134573 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134573 Approved by: https://github.com/Chillee	2024-09-06 06:47:26 +00:00
wz337	67f98a99a4	[DeviceMesh][Easy] Make RuntimeError a bit more descriptive by including the actual world_size (#135271 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135271 Approved by: https://github.com/fduwjj	2024-09-06 06:23:20 +00:00
fduwjj	e020a8755a	[Fix][FR][ez] Remove debugging logs (#135308 ) Removing the print added during debugging process. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135308 Approved by: https://github.com/wz337	2024-09-06 06:14:33 +00:00
Jason Ansel	7ffb3b201c	[inductor] Remove LoopBody.reads,writes,other (#135256 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135256 Approved by: https://github.com/oulgen ghstack dependencies: #135070, #135076, #135082, #135084, #135079, #135235	2024-09-06 06:11:55 +00:00
Jason Ansel	f946bf88c4	[inductor] Skip retracing an existing LoopBody (#135235 ) This is roughly a 7% speedup in inductor compile time for hf_Bert_large. The time spent in `LoopBody.__init__` improves from 15% to 8% of `fx_codegen_and_compile`. Before ![image](https://github.com/user-attachments/assets/7de0f28e-35bd-472f-b4be-b52733d2a85c) After ![image](https://github.com/user-attachments/assets/5f0cf11a-43c5-43ae-b13c-f32383a75a7f) Overall ![image](https://github.com/user-attachments/assets/6a369d8c-fb5e-4ad2-9504-0fc745ad6568) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135235 Approved by: https://github.com/oulgen ghstack dependencies: #135070, #135076, #135082, #135084, #135079	2024-09-06 06:11:55 +00:00
Jason Ansel	66da3b3b2a	[fx] Bypass custom __setattr__ in Node.__init__ (#135079 ) Before: ![image](https://github.com/user-attachments/assets/5f0a6ae6-6049-44d0-b5f2-a549a23ad97f) After: ![image](https://github.com/user-attachments/assets/51c9f91b-f8a0-4043-8362-65813feec823) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135079 Approved by: https://github.com/oulgen ghstack dependencies: #135070, #135076, #135082, #135084	2024-09-06 06:11:46 +00:00
Laith Sakka	41e653456e	[RDP] Fix "No module named 'libfb’" (#135244 ) Summary: D62215095 Introduced an import error to arvr pipelines as the is_fbcode() function does not work as intended. This changes is_fbcode() to be a much stricter check. Test Plan: ``` buck2 run arvr/mode/platform010/opt-stripped //arvr/libraries/depthlink/clients/mr_replay:pipeline_runner -c bolt.use_eva3_sim=True -- --config_file arvr/libraries/depthlink/clients/mr_replay/configs/runner_config.yaml --features DEPTH ``` Differential Revision: D62237502 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135244 Approved by: https://github.com/aorenste	2024-09-06 04:52:31 +00:00
chilli	e40a0a9359	Add randomness checking for sdpa vmap (#135176 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135176 Approved by: https://github.com/zou3519	2024-09-06 04:50:49 +00:00
Xuan Zhang	c05a7adb36	[inductor][debug] fix draw_buffers (#135266 ) Before: ![image](https://github.com/user-attachments/assets/aac756f3-1349-4647-9da3-87cf105cf647) After: <img width="791" alt="image" src="https://github.com/user-attachments/assets/d72c663c-e598-42fa-ac40-9e58956f1ec1"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/135266 Approved by: https://github.com/yf225	2024-09-06 04:12:41 +00:00
hippocookie	5f57be7571	[Distributed] Change function call in test to non-deprecated to eliminate warning (#134938 ) Migrate function call in test to eliminate warning message in below and reduce the chance of test fail when methods removed - from deprecated `save_state_dict` change to `save` - from deprecated `load_state_dict` change to `load` Warning message: ```bash pytorch/test/distributed/checkpoint/test_fsdp_model_state.py:37: FutureWarning: `save_state_dict` is deprecated and will be removed in future versions.Please use `save` instead. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/134938 Approved by: https://github.com/wz337, https://github.com/fegin	2024-09-06 03:25:09 +00:00
Xu Han	29d72c1100	[inductor] check intel compiler minimal version (#135209 ) On Windows: early version icx has `-print-file-name` issue, and can't preload correctly for inductor. Add minimal version check for Intel compiler. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135209 Approved by: https://github.com/ezyang	2024-09-06 03:21:07 +00:00
leslie-fang-intel	3b1a334c0f	[Inductor][CPP] Avoid mistake wgt tensor delete (#135100 ) Summary Fix issue: https://github.com/pytorch/pytorch/issues/134998: Previously, we only checked if the `get_attr` FX node for the weight had a single user node. However, two `get_attr` nodes may share the same tensor and should not be deleted in such cases. In this PR, we add the count of users for tensor along with the num of users for nodes to decide whether this tensor can be deleted or not. TestPlan ``` python test/inductor/test_cpu_select_algorithm.py -k test_linear_wgt_multi_users ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135100 Approved by: https://github.com/jgong5	2024-09-06 03:13:36 +00:00
leslie-fang-intel	07689a38bf	[Inductor] Fix AOT weight alignment issue on CPU (#135205 ) Summary Fix issue: https://github.com/pytorch/pytorch/issues/135027. On CPU, the `consts_size` used to generate `_binary_constants_bin_start` is not padded to `ALIGN_BYTES`, while `serialized_weights` is, causing a failure in the 16K alignment check. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135205 Approved by: https://github.com/jgong5, https://github.com/desertfire	2024-09-06 03:06:51 +00:00
Edward Z. Yang	06a7dc21c1	Remove dead expect_rational (#135105 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/135105 Approved by: https://github.com/malfet	2024-09-06 02:57:27 +00:00
Edward Z. Yang	d9a18173fa	Report qualname of exception type rather than <class 'RuntimeError'> (#135146 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/135146 Approved by: https://github.com/Skylion007, https://github.com/albanD, https://github.com/yanboliang ghstack dependencies: #135148, #135145	2024-09-06 02:56:50 +00:00
Edward Z. Yang	d8543e3162	Include exception type qualname when rewrapping InternalTorchDynamoError (#135145 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/135145 Approved by: https://github.com/drisspg, https://github.com/anijain2305 ghstack dependencies: #135148	2024-09-06 02:56:50 +00:00
Edward Z. Yang	ad01fc194d	Consolidate raise and rewrap raise error branches (#135148 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/135148 Approved by: https://github.com/anijain2305, https://github.com/albanD, https://github.com/yanboliang, https://github.com/malfet	2024-09-06 02:56:46 +00:00
Haibo Chen	e162414963	add instrumentation of CCA stats for reserved and allocated memory size (#135231 ) As titled Pull Request resolved: https://github.com/pytorch/pytorch/pull/135231 Approved by: https://github.com/c-p-i-o	2024-09-06 02:48:56 +00:00
Edward Z. Yang	9e5a797771	Improve test_public_bindings import module error reporting (#135258 ) Error was hard to understand without message. Render it now. See https://github.com/pytorch/pytorch/pull/135259 for it in action. Example failure: ``` 2024-09-05T20:04:45.3022000Z FAILED [5.9524s] test_public_bindings.py::TestPublicBindings::test_modules_can_be_imported - AssertionError: String comparison failed: '' != "torch._logging.scribe failed to import w[112 chars].py)" 2024-09-05T20:04:45.3025413Z + torch._logging.scribe failed to import with error ImportError: cannot import name 'TypeAlias' from 'typing' (/opt/conda/envs/py_3.9/lib/python3.9/typing.py) 2024-09-05T20:04:45.3026990Z ``` Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/135258 Approved by: https://github.com/albanD	2024-09-06 02:40:03 +00:00
atalman	b46a1b9e2d	Use Python 3.9 on all libtorch jobs (#135245 ) Part of the migration py3.8->3.9 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135245 Approved by: https://github.com/izaitsevfb	2024-09-06 02:27:22 +00:00
Sunita Nadampalli	9688014820	aarch64: extend matmul heuristic checks to all neoverse platforms (#134548 ) for aarch64 neoverse platforms there are two gemm backends available for matmul operator on PyTorch: (1) Arm Compute Library and (2) OpenBLAS. While Arm Compute Library provides better performance over OpenBLAS, it has overhead for the kernel launch time, and hence we use OpenBLAS for smaller tensor compute. The heuristic was originally implemented for neoverse_v1. This commit extends the heuristic to other neoverse platforms Pull Request resolved: https://github.com/pytorch/pytorch/pull/134548 Approved by: https://github.com/malfet	2024-09-06 01:40:50 +00:00
titaiwangms	8f6e73f068	[ONNX] Enable experimental exporter logic to dynamo_export and support refine dynamic_shapes (#134976 ) (1) Enable experimental exporter logic to dynamo_export (2) Refine dynamic shapes and retry export in export strategies (3) Delete `torch_export_graph_extractor` and use the new export logic (4) Disable ExportedProgram test in `test_fx_onnx_with_onnxruntime.py`, as ONNXProgram is different now. Fixes https://github.com/pytorch/pytorch/issues/126479 Fixes #135183 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134976 Approved by: https://github.com/justinchuby	2024-09-06 01:29:56 +00:00
Bin Bao	1e57ef08fa	[AOTI] Support MKLDNN qconv ops in cpp wrapper (#134795 ) Summary: Similar to https://github.com/pytorch/pytorch/pull/134475, support qconv in the ABI-compatible mode for cpp-wrapper Inductor. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134795 Approved by: https://github.com/leslie-fang-intel, https://github.com/chunyuan-w, https://github.com/angelayi ghstack dependencies: #134475, #134783	2024-09-06 01:01:53 +00:00
Bin Bao	614b86d602	[AOTI] Support MKLDNN qlinear ops in cpp wrapper (#134783 ) Summary: Similar to https://github.com/pytorch/pytorch/pull/134475, support qlinear in the ABI-compatible mode for cpp-wrapper Inductor. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134783 Approved by: https://github.com/leslie-fang-intel, https://github.com/chunyuan-w, https://github.com/angelayi ghstack dependencies: #134475	2024-09-06 01:01:53 +00:00
Bin Bao	0b96dfb736	[AOTI] Support MKLDNN conv ops in cpp wrapper (#134475 ) Summary: Partially fix https://github.com/pytorch/pytorch/issues/123040. In the ABI-compatible mode, MKLDNN fallback ops do not have C shim implementations and thus need to go through the custom ops launch path. Other MLKDNN ops will be fixed in following PRs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134475 Approved by: https://github.com/leslie-fang-intel, https://github.com/chunyuan-w, https://github.com/angelayi	2024-09-06 01:01:53 +00:00
Shivam Raikundalia	62b221d5cc	Add Percentages to Function Events (#135155 ) Summary: Users have recently asked that the profiler contains self/total CPU and device percentages to FunctionEvents so that teams can process the data procedurely. Some of it could be done mathematically via subroutines but since we already have the information in the _build_table, lets build it there. Test Plan: Check that we have the same table as before but also check that the parameters we check also have the expected values Differential Revision: D62210351 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135155 Approved by: https://github.com/shanw-meta, https://github.com/kit1980	2024-09-06 00:39:11 +00:00
Laith Sakka	66dd4577b1	Track base of FunctionalTensor in inference mode. (#135141 ) The idea behind the tracking is the following, whenever we see a tensor if the tensors is a root tensors (does not have any view metas ) when we consider is as the base of the all the tensors that shares its storage. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135141 Approved by: https://github.com/zou3519	2024-09-06 00:10:25 +00:00
cyy	cc28634172	[Submodule] Bump pybind11 to v2.13.5 (#135202 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/135202 Approved by: https://github.com/Skylion007	2024-09-06 00:09:00 +00:00
wz337	c83cdf068b	[DTensor] Fix view op replicating on tensor dim when the size of the tensor dim = 1 (#135054 ) We found a corner case that when a tensor dimension is 1, calling `view(1)` would result in an unexpected replication (see case 1 below). When the tensor dimension to shard is not 1, no matter whether the tensor dimension is evenly-shardable across the mesh dimension, it won't cause an implicit replication behind the scenes if view doesn't change the size of the given tensor dimension (see case 2 and 3). When the tensor dimension to shard is of size 1, it is not being added to shardable_dims here: https://github.com/pytorch/pytorch/blob/main/torch/distributed/_tensor/ops/_view_ops.py#L518 ``` # uneven case where the size of the tensor dimension to shard is 1 p = torch.randn(1,2) mesh = init_device_mesh(“cuda”, (2,)) dtensor = distribute_tensor(p, mesh, [Shard(0)]) t = dtensor.view(1, 2) # this would result in replication, meaning t is now replicated across all ranks. # uneven case where the size of the tensor dimension to shard is not 1 p = torch.randn(3, 2) mesh = init_device_mesh(“cuda”, (2,)) dtensor = distribute_tensor(p, mesh, [Shard(0)]) t = dtensor.view(3, 2) # this would not result in replication. # this would not result in replication, meaning t stays as sharded. # even case p = torch.randn(2,2) dtensor = distribute_tensor(p, mesh, [Shard(0)]) t = dtensor.view(2, 2) # this would not result in replication, meaning t stays as sharded. ``` Differential Revision: [D62155606](https://our.internmc.facebook.com/intern/diff/D62155606) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135054 Approved by: https://github.com/tianyu-l, https://github.com/wanchaol	2024-09-06 00:03:54 +00:00
titaiwangms	28ccfba248	[ONNX] Delete ONNXProgramSerializer (#135261 ) Fixes #135182 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135261 Approved by: https://github.com/justinchuby	2024-09-05 23:52:51 +00:00
Jason Ansel	b2386bdca1	[debug] Add helper to run cProfile on a function (#135084 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135084 Approved by: https://github.com/oulgen ghstack dependencies: #135070, #135076, #135082	2024-09-05 23:41:30 +00:00
Jason Ansel	bdfc8d9f96	[fx] Don't use generators in map_aggregate (#135082 ) While the generators avoid a copy, they are slow. Before: ![image](https://github.com/user-attachments/assets/70a55a9a-0595-4105-b0ab-22cf77c7409c) After: ![image](https://github.com/user-attachments/assets/cecb9c59-ae36-47de-8b08-cab2c7cb3d57) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135082 Approved by: https://github.com/oulgen ghstack dependencies: #135070, #135076	2024-09-05 23:41:30 +00:00
Jason Ansel	70779dded8	[fx] Compile time optimization in Node.__update_args_kwargs (#135076 ) Before this we took two passes over all of the args. Before: ![image](https://github.com/user-attachments/assets/24ce5628-03f4-4983-9f2d-5ddf0ca5816e) After: ![image](https://github.com/user-attachments/assets/c9681aa2-32f0-4f6b-a598-fc6f90ffafb5) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135076 Approved by: https://github.com/Chillee ghstack dependencies: #135070	2024-09-05 23:41:30 +00:00
Jason Ansel	ea231300d1	[inductor] Improve compile time regression from MemoryDep.normalize (#135070 ) Possible fix for #135056 Before ![image](https://github.com/user-attachments/assets/3962cb85-e808-4fd4-991f-471ff5ef7eae) After ![image](https://github.com/user-attachments/assets/2322d48d-6518-4518-baca-336027b5cda8) Measured based on: ``` python benchmarks/dynamo/torchbench.py --ci --accuracy --timing --explain --inductor --device cuda --training --only hf_Bert_large --stats -n1 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135070 Approved by: https://github.com/Chillee	2024-09-05 23:41:30 +00:00
PyTorch MergeBot	8f66995459	Revert "Support rolling over a percentage of workflows (#134816 )" This reverts commit fc890b55b51098437b6149abf1026a8b2aaee389. Reverted https://github.com/pytorch/pytorch/pull/134816 on behalf of https://github.com/malfet due to Causes lint to intermittently fail ([comment](https://github.com/pytorch/pytorch/pull/134816#issuecomment-2332902609))	2024-09-05 23:39:41 +00:00
Kulin Seth	144fde4fd2	[MPS] Add support for autocast in MPS (#99272 ) Fixes https://github.com/pytorch/pytorch/issues/88415 Need to run inductor/test_cpu_select_algorithm Pull Request resolved: https://github.com/pytorch/pytorch/pull/99272 Approved by: https://github.com/malfet Co-authored-by: Siddharth Kotapati <skotapati@apple.com> Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com> Co-authored-by: Roy Hvaara <roy@lightyear.no>	2024-09-05 23:23:17 +00:00
Avik Chaudhuri	43f4947d44	fix fake tensor tolist implementation (#135131 ) Summary: When exporting for training with `tolist`, we do not hit `FunctionalTensor.tolist` since we do not functionalize. Unfortunately, this means we hit `FakeTensor.tolist`, which creates unbacked symints that are not backed by proxies. Rather than trying to patch up this low-level implementation, we replace it with essentially what `FunctionalTensor.tolist` does, which is higher-level: we essentially desugar to `item()` calls and let it take care of unbacked symints. Test Plan: Some expected failures are gone now. Also found a test for `tolist` that was written when `FunctionalTensor.tolist` was implemented but not really doing much; repurposed it now to exercise more modes. Differential Revision: D62197742 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135131 Approved by: https://github.com/ezyang	2024-09-05 23:20:31 +00:00
Chirag Pandya	65e1c34061	[rfc] scuba for flight recorder (#134794 ) Summary: Record flight recorder status in a scuba table. Test Plan: Testing with timing out a job. Will post results soon. Differential Revision: D61729221 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134794 Approved by: https://github.com/fduwjj	2024-09-05 23:18:10 +00:00
Stonepia	830247c355	[Intel Triton] Update Intel Triton to release/2.5.0 (#134074 ) This PR relands https://github.com/pytorch/pytorch/pull/134053 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134074 Approved by: https://github.com/EikanWang	2024-09-05 22:46:31 +00:00
Yidi Wu	4262755b5a	[cond] fix typo in cond codegen (#134708 ) As titled. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134708 Approved by: https://github.com/jansel	2024-09-05 22:38:24 +00:00
Edward Z. Yang	3825607144	Add torch._logging.scribe (#135224 ) See https://github.com/pytorch/pytorch/pull/135138 for a usage example. Meta only, see https://docs.google.com/document/d/1JpbAQvRhTmuxjnKKjT7qq57dsnV84nxSLpWJo1abJuE/edit#heading=h.9wi46k7np6xw for context fbscribelogger is a library that allows us to write to scribe, which is Meta's logging infrastructure, when you have appropriate access token (this token is available for jobs running on main, as well as authorized jobs with the ci-scribe label). The resulting data is accessible via Scuba (a real time in-memory database) and Hive (a more traditional SQL persisted database). Here's the motivating use case. Suppose there is somewhere in PyTorch's codebase where you'd like to log an event, and then you'd like to find all the situations where this log is called. If PyTorch is rolled out to our internal users, we have some FB-oriented APIs (like torch._utils_internal.signpost_event) with which you can do this. But you have to actually land your PR to main, wait for it to be ingested to fbcode, and then wait for us to actually roll out this version, before you get any data. But what if you want the results within the next few hours? Instead, you can use torch._logging.scribe to directly write to our logging infrastructure from inside CI jobs. The most convenient approach is to log unstructured JSON blobs to `open_source_signpost` (added in this PR; you can also add your own dedicated table as described in the GDoc above). After adding logging code to your code, you can push your PR to CI, add 'ci-scribe' label, and in a few hours view the results in Scuba, e.g., (Meta-only) https://fburl.com/scuba/torch_open_source_signpost/z2mq8o4l If you want continuous logging on all commits on master, you can land your PR and it will be continuously get logging for all CI runs that happen on main. Eventually, if your dataset is important enough, you can consider collaborating with PyTorch Dev Infra to get the data collected in our public AWS cloud so that OSS users can view it without access to Meta's internal users. But this facility is really good for prototyping / one-off experiments. It's entirely self serve: just add your logging, run your PR CI with ci-scribe, get results, do analysis in Scuba. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/135224 Approved by: https://github.com/Skylion007	2024-09-05 22:37:13 +00:00
eqy	3c8f71ff93	[cuDNN][64-bit indexing] cuDNN v9.3+ supports non-batch-splittable convolutions with > 2**31 elements (#134890 ) For longstanding issues such as #95024 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134890 Approved by: https://github.com/Skylion007	2024-09-05 22:22:45 +00:00
Zain Rizvi	fc890b55b5	Support rolling over a percentage of workflows (#134816 ) In order to support adding a rollover percentage, this ended up being a complete rewrite of runner_determinator.py. Details of the new format are in the comments up top. On the plus side, this now includes some unit tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134816 Approved by: https://github.com/PaliC, https://github.com/zxiiro	2024-09-05 22:21:45 +00:00
Animesh Jain	058a69d91a	[fbcode][dynamo] Turn on guard_nn_modules using justknobs_check (#134928 ) As Title Pull Request resolved: https://github.com/pytorch/pytorch/pull/134928 Approved by: https://github.com/ezyang	2024-09-05 22:05:54 +00:00
sanchitintel	6c5920d515	Tune int8 AMX WoQ micro-kernel for CPU (#134832 ) This patch prevents performance regression against the default ATen implementation for LLaMA 3.1 int8 GPTQ WoQ workload. Uses AMX micro-kernel only if `M` >= `block_m` Pull Request resolved: https://github.com/pytorch/pytorch/pull/134832 Approved by: https://github.com/jgong5	2024-09-05 22:01:14 +00:00
Zhengxu Chen	116fd474da	[export] Expand coverage to more copied sym ops for unflattener. (#135119 ) Test Plan: buck2 test 'fbcode//mode/opt' fbcode//torchrec/ir/tests:test_serializer -- --run-disabled ``` File changed: fbcode//caffe2/torch/export/unflatten.py Buck UI: https://www.internalfb.com/buck2/2e0377e7-e2b6-4bd0-8133-a787245165a0 Test UI: https://www.internalfb.com/intern/testinfra/testrun/5066549824883887 Network: Up: 0B Down: 0B Jobs completed: 16. Time elapsed: 10.2s. Tests finished: Pass 6. Fail 0. Fatal 0. Skip 0. Build failure 0 ``` Differential Revision: D62190172 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135119 Approved by: https://github.com/yushangdi	2024-09-05 21:58:20 +00:00
Scott Wolchok	a5d70cf545	[PyTorch] Add isfinite to BFloat16-math.h (#135052 ) Missing function from <cmath>. Differential Revision: [D62148884](https://our.internmc.facebook.com/intern/diff/D62148884/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135052 Approved by: https://github.com/PaliC, https://github.com/albanD ghstack dependencies: #135031	2024-09-05 21:50:36 +00:00
Scott Wolchok	7fe819d917	[PyTorch] Fix -Wshadow -Werror build in BFloat16-inl.h (#135031 ) `float_t` is required to exists in C99 math.h, which causes -Wshadow to fire. We don't need the alias, fortunately. Differential Revision: [D62135908](https://our.internmc.facebook.com/intern/diff/D62135908/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135031 Approved by: https://github.com/albanD	2024-09-05 21:48:21 +00:00
PyTorch MergeBot	f63571060c	Revert "Use actions/upload-artifact@v4.4.0 for rest of workflows (#135264 )" This reverts commit 9c0b03020b7204ca5d5dbe18174bab005f79c47b. Reverted https://github.com/pytorch/pytorch/pull/135264 on behalf of https://github.com/atalman due to broke CI ([comment](https://github.com/pytorch/pytorch/pull/135264#issuecomment-2332674607))	2024-09-05 21:43:05 +00:00
Yidi Wu	38fead8f7c	[hop] preserve metadata in re-tracing hop subgraph by running with interpreter (#135159 ) In this way, the interpreter.run can preserve the current metadata of subgraphs correctly when tracing the subgraphs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135159 Approved by: https://github.com/tugsbayasgalan	2024-09-05 21:36:56 +00:00
Huy Do	24a223c49d	Run inductor micro benchmark on x86 metal runner (#135042 ) This enables inductor micro benchmark on CPU (x86): * Running on AWS metal runner for more accurate benchmark * I add a new `arch` column, which will be either x86_64 or arm64 for CPU or GPU name for GPU. We can use this later to differentiate between different setup, i.e. cuda (a100) vs cuda (a10g) or cpu (x86_64) vs cpu (arm64) The next step would be to run this one cpu arm64, and cuda (a10g). ### Testing Here is the CSV results from my test run https://github.com/pytorch/pytorch/actions/runs/10709344180 ``` name,metric,target,actual,dtype,device,arch,is_model mlp_layer_norm_gelu,flops_utilization,0.8,17.36,bfloat16,cpu,x86_64,False gather_gemv,memory_bandwidth(GB/s),990,170.80,int8,cpu,x86_64,False gather_gemv,memory_bandwidth(GB/s),1060,204.78,bfloat16,cpu,x86_64,False Mixtral-8x7B-v0.1,token_per_sec,175,26.68,int8,cpu,x86_64,True Mixtral-8x7B-v0.1,memory_bandwidth(GB/s),1130,171.91,int8,cpu,x86_64,True Mixtral-8x7B-v0.1,compilation_time(s),162,47.36,int8,cpu,x86_64,True gemv,memory_bandwidth(GB/s),870,236.36,int8,cpu,x86_64,False gemv,memory_bandwidth(GB/s),990,305.71,bfloat16,cpu,x86_64,False Llama-2-7b-chat-hf,token_per_sec,94,14.01,bfloat16,cpu,x86_64,True Llama-2-7b-chat-hf,memory_bandwidth(GB/s),1253,185.18,bfloat16,cpu,x86_64,True Llama-2-7b-chat-hf,compilation_time(s),162,74.99,bfloat16,cpu,x86_64,True Llama-2-7b-chat-hf,token_per_sec,144,25.09,int8,cpu,x86_64,True Llama-2-7b-chat-hf,memory_bandwidth(GB/s),957,165.83,int8,cpu,x86_64,True Llama-2-7b-chat-hf,compilation_time(s),172,70.69,int8,cpu,x86_64,True layer_norm,memory_bandwidth(GB/s),950,172.03,bfloat16,cpu,x86_64,False ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135042 Approved by: https://github.com/yanboliang	2024-09-05 21:31:36 +00:00
Will Feng	e4920a1364	[Traceable FSDP2][Dynamo] allow tracing through auto_functionalized HOP (#135169 ) If an `auto_functionalized` HOP is included in backward graph due to activation checkpointing, we will run into a scenario where Compiled Autograd Dynamo tracing will need to trace through the `auto_functionalized` HOP. This PR adds support for it. Test commands: - `pytest -rA test/inductor/test_compiled_autograd.py::TestCompiledAutograd::test_trace_auto_functionalized` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135169 Approved by: https://github.com/zou3519	2024-09-05 21:22:45 +00:00
Shangdi Yu	bc5ecf83d7	[training ir migration] Fix quantization tests (#135184 ) Summary: Fixed some quantization tests for new training ir: Fix batch norm node pattern matcher. In training ir, we have `aten.batch_norm` node instead of `aten._native_batch_norm_legit` and `aten._native_batch_norm_legit_no_training`. Test Plan: ``` buck run fbcode//mode/dev-nosan fbcode//caffe2/test:quantization_pt2e ``` Reviewed By: tugsbayasgalan Differential Revision: D62209819 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135184 Approved by: https://github.com/tugsbayasgalan	2024-09-05 21:19:28 +00:00
PyTorch MergeBot	e55c0f59e5	Revert "[Reland] Refactor caching device allocator utils (#130923 )" This reverts commit 9809080b9ed657a8c0ea0383be7cbdce3a26e05e. Reverted https://github.com/pytorch/pytorch/pull/130923 on behalf of https://github.com/kit1980 due to breaking internal builds - Error: Relocation overflow has occured ([comment](https://github.com/pytorch/pytorch/pull/130923#issuecomment-2332640961))	2024-09-05 21:16:14 +00:00
PyTorch MergeBot	a4cf9653ee	Revert "Remove Caffe2 code from tool scripts (#134941 )" This reverts commit c818ecd1698a28d9fadf4a81453a89914b18374a. Reverted https://github.com/pytorch/pytorch/pull/134941 on behalf of https://github.com/kit1980 due to breaking internal builds - The path `caffe2/operators/hip/gather_op.cuh` does not exist ([comment](https://github.com/pytorch/pytorch/pull/134941#issuecomment-2332636624))	2024-09-05 21:12:54 +00:00
atalman	9c0b03020b	Use actions/upload-artifact@v4.4.0 for rest of workflows (#135264 ) To be consistent with https://github.com/pytorch/pytorch/pull/135263 and rest of workflows. Use v4.4.0. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135264 Approved by: https://github.com/kit1980, https://github.com/malfet	2024-09-05 21:05:06 +00:00
Jack Taylor	034717a029	[ROCm] remove triton-rocm commit pin and merge pins with triton.txt (#133438 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133438 Approved by: https://github.com/jithunnair-amd, https://github.com/malfet Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>	2024-09-05 20:36:45 +00:00
Angela Yi	9c38b00999	[export] Add ability to run eagerly on UnflattenedModule (#133996 ) Summary: Added the contextmanager, `_disable_interpreter`, which is meant to put around a call to `unflatten`. This will generate an UnflattendModule and sub-InterpreterModules which will not use torch.fx.Interpreter to run eagerly. We want to have this as a state of the module instead of a contextmanager around running the module because it's not clear where we are calling the unflattened module. This seems to improve the performance: https://fb.workplace.com/groups/1075192433118967/posts/1473590629945810/?comment_id=1473621763276030 Test Plan: CI Differential Revision: D60939034 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133996 Approved by: https://github.com/pianpwk	2024-09-05 20:28:42 +00:00
atalman	8efe547046	Use actions/upload-artifact@v4.4.0 for triton builds (#135263 ) Same as: https://github.com/pytorch/pytorch/pull/135139 Fixes upload failure: https://github.com/pytorch/pytorch/actions/runs/10722567217/job/29748125015 fix regression introduced by https://github.com/pytorch/pytorch/pull/135068 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135263 Approved by: https://github.com/kit1980, https://github.com/huydhn	2024-09-05 20:03:39 +00:00
rzou	82d00acfee	Allow cross-device copies for cpu scalars in refs (#135140 ) This copies our eager-mode behavior where someone can do torch.add(a, b, out=c) where a and b are CPU scalar tensors and c is a CUDA tensor. Fixes https://github.com/pytorch/pytorch/issues/121619 by side effect (we get into a situation where we're writing a CPU scalar into a FakeTensor that is actually a meta tensor) Test Plan: - new test Pull Request resolved: https://github.com/pytorch/pytorch/pull/135140 Approved by: https://github.com/williamwen42, https://github.com/yanboliang	2024-09-05 19:08:48 +00:00
Zhonglin Han	098431a29d	Update Resize.cpp with new device type (#135117 ) Update Resize.cpp with new device type Pull Request resolved: https://github.com/pytorch/pytorch/pull/135117 Approved by: https://github.com/egienvalue	2024-09-05 18:53:13 +00:00
Xintong Hu	be660ea2d3	[PT2] Directly set meta.val in group_batch_fusion_aten (#135078 ) Summary: instead of using FakeTensorProp after the pass Differential Revision: D62162640 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135078 Approved by: https://github.com/frank-wei	2024-09-05 18:17:06 +00:00
CaoE	52c7c89ea4	[Inductor][CPP] Leverage full bits for BF16/FP16 vectorization (#126502 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126502 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-09-05 17:17:46 +00:00
IvanKobzarev	1efd341d15	[fake_tensor] Move unrecognized_type NotImplemented before ConstProp (#135033 ) We should not try to do ConstProp on the unrecognized types (e.g. Subclasses). In case of those types throwing NotImplemented will jump to the next torch_dispatch. Test: ``` python test/functorch/test_aotdispatch.py -k test_aot_test_subclasses_with_tensor_factories ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135033 Approved by: https://github.com/zou3519, https://github.com/bdhirsh	2024-09-05 17:09:41 +00:00
Mikayla Gawarecki	a096f2899d	Add torch.serialization.skip_data context manager (#134504 ) ## Semantic The semantic is (1) By default `torch.serialization.skip_data(materialize_fake_tensors=False)` will make `torch.save` skip writing storages (but reserve space for them in the checkpoint). ```python import torch import torch.nn as nn sd = nn.Linear(3, 5).state_dict() with torch.serialization.skip_data(): torch.save(sd, 'foo.pt') print(torch.load('foo.pt', weights_only=True)) ``` (2) With `torch.serialization.skip_data(materialize_fake_tensors=True)`If FakeTensor is passed to `torch.save` the pickler will treat these FakeTensors as being "materialized" space will be reserved in the checkpoint for the associated storage bytes, and when loading the type will be Tensor instead of FakeTensor) ```python import torch import torch.nn as nn from torch._subclasses.fake_tensor import FakeTensorMode with FakeTensorMode(): m = nn.Linear(3, 5, dtype=torch.float16, device='cuda') sd = m.state_dict() with torch.serialization.skip_data(materialize_fake_tensors=True): torch.save(sd, 'bla.pt') print(torch.load('bla.pt', weights_only=True)) # OrderedDict([('weight', tensor([[0., 0., 0.], # [0., 0., 0.], # [0., 0., 0.], # [0., 0., 0.], # [0., 0., 0.]], device='cuda:0', dtype=torch.float16)), ('bias', tensor([0., 0., 0., 0., 0.], device='cuda:0', dtype=torch.float16))]) ``` ## Follow Ups - [ ] `torch.load` semantic for skip_data context manager - [ ] Mechanism for getting offsets of storages saved via this method (for writing in a separate pass) Differential Revision: [D62238610](https://our.internmc.facebook.com/intern/diff/D62238610) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134504 Approved by: https://github.com/albanD	2024-09-05 16:53:39 +00:00
Edward Z. Yang	dbeb8a1691	Render log filepaths that are not anchored in torch's directory in a reasonable way (#135165 ) For example, if I do TORCH_LOGS=fbscribelogger I'll get: ``` I0904 17:59:07.567000 3672513 fbscribelogger/__init__.py:161] stop ``` instead of ``` I0904 12:46:15.332000 2930287 ../../../../../home/ezyang/local/a/pytorch-env/lib/python3.10/site-packages/fbscribelogger/__init__.py:161] stop ``` Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/135165 Approved by: https://github.com/Skylion007	2024-09-05 16:48:09 +00:00
mori360	b1f72e2984	Gradient scaler for DTensor (#132816 ) Solve the request [here](https://github.com/pytorch/pytorch/issues/120003#issuecomment-2248805798). Enable DTensor input in gradient scaler's APIs, especially on `.unscale_()` Related dispatch strategy is added to accept DTensor input. To enable found_inf to conduct reduce action across devices, we add allreduce at dispatch with args after dispatch strategy and kernel. Since `aten._amp_foreach_non_finite_check_and_unscale_.default` is an inplace_op, grad_scale as the arg[0] with be inplaced, so that redesign a strategy or refactoring the kernel would not help Test files are testing 2 parts under 1-d(dp) and 2-d(dp,tp) cases: 1. whether the non-inf values unscaled 2. whether all DTensors at each device could found inf even not at their device. 3. If inf not found, will new parameters generates 4. if inf found, will scale be updated Pull Request resolved: https://github.com/pytorch/pytorch/pull/132816 Approved by: https://github.com/XilunWu, https://github.com/weifengpy, https://github.com/wanchaol	2024-09-05 16:44:32 +00:00
Henry Tsang	bb3c2408f4	[inductor][test] in test_unbacked_symints, replace inductor's skipCUDAIf with common device type's skipcudaif (#133936 ) Differential Revision: D61506212 Use `skipCUDAIf` from `torch.testing._internal.common_device_type` if we create the test class with `instantiate_device_type_tests`. `instantiate_device_type_tests` would make sure the class has attr device_type, which works with`skipCUDAIf` from `torch.testing._internal.common_device_type`. Also skipping test_vertical_pointwise_reduction_fusion for cpu test class, since the test expects cuda. FAILED [0.0026s] test/inductor/test_unbacked_symints.py::TestUnbackedSymintsCPU::test_vertical_pointwise_reduction_fusion_cpu - AttributeError: 'TestUnbackedSymintsCPU' object has no attribute 'device' repro: ``` CUDA_VISIBLE_DEVICES="" pytest test/inductor/test_unbacked_symints.py -k cpu -v ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/133936 Approved by: https://github.com/ColinPeppler, https://github.com/desertfire	2024-09-05 16:40:14 +00:00
Tom Ritchford	2c99f17a32	Implement VariableTracker.python_type() (#134215 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134215 Approved by: https://github.com/amjames, https://github.com/jansel	2024-09-05 16:35:47 +00:00
Tarun Karuturi	0043dcd79e	Switch torch pt2e xnnpack tests to use export_for_training (#134788 ) Migrate all the callsites inside the pt2e XNNPACK tests to use export_for_training. Differential Revision: D61994553 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134788 Approved by: https://github.com/mergennachin	2024-09-05 16:11:18 +00:00
Edward Z. Yang	2e2fb668fa	Upgrade expecttest to 0.2.1 (#135136 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/135136 Approved by: https://github.com/albanD, https://github.com/atalman, https://github.com/Skylion007	2024-09-05 16:05:35 +00:00
Stonepia	9d24f945ba	[CI] Use larger instance for building triton whl (#135201 ) When running CI jobs of "Build Triton Wheels", it failed due to the lack of resources. This PR uses a larger runner to avoid these issues. The failure message is like: ``` Process completed with exit code 137. ``` Related running actions: Failed actions: https://github.com/pytorch/pytorch/actions/runs/10714445036 Success actions: https://github.com/pytorch/pytorch/actions/runs/10716710830 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135201 Approved by: https://github.com/chuanqi129, https://github.com/atalman	2024-09-05 14:36:23 +00:00
min-jean-cho	ecbd715363	[Intel GPU][Windows] Fix overriding default CMAKE_CXX_FLAGS (#135093 ) The root cause is that `/EHsc` is part of the default `CMAKE_CXX_FLAGS` in CMake. Fix to not override the default `CMAKE_CXX_FLAGS`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135093 Approved by: https://github.com/EikanWang, https://github.com/atalman	2024-09-05 12:52:43 +00:00
Xinyu	58f2477a26	[Dynamo] Support builtin function frozenset (#134563 ) Support builtin function frozenset in dynamo Pull Request resolved: https://github.com/pytorch/pytorch/pull/134563 Approved by: https://github.com/anijain2305, https://github.com/EikanWang, https://github.com/jansel	2024-09-05 12:15:10 +00:00
sanchitintel	43dcb4bb61	Revise CPU vectorization ISA support API (#135075 ) Revising (mostly renaming) CPU vectorization ISA support API (non-frontend-user-facing). Also added AVX512_BF16 ISA detection API. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135075 Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/ezyang	2024-09-05 12:14:56 +00:00
Bin Bao	50d1e37079	[AOTI] Fix a unbacked symint retrieve bug (#134670 ) Summary: Fix https://github.com/pytorch/pytorch/issues/134081. When a unbacked symint is computed as the shape of a tensor from a tuple, generated C++ code needs to use std::get<> to extract the tensor. Differential Revision: [D62142113](https://our.internmc.facebook.com/intern/diff/D62142113) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134670 Approved by: https://github.com/angelayi, https://github.com/22quinn, https://github.com/chenyang78	2024-09-05 11:34:14 +00:00
Feng Yuan	b99ef1a02e	Update torch-xpu-ops pin (ATen XPU implementation) (#135185 ) Release cycle for PyTorch 2.5 1. Update specific AOT targets for Windows. On Windows, AOT target list prefers Intel client GPUs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135185 Approved by: https://github.com/EikanWang	2024-09-05 10:05:23 +00:00
Jack Zhang	8a5c8e5db9	Update unbacked symints in masked_select more precisely (#134899 ) ## Summary At the moment, the fake impl for `masked_select` simply sets the upper range while updating its size-like SymInt to `sys.maxsize`(9223372036854775807, max value for an unsigned int64) if the there are any SymInts in the original input tensor shape. This PR constrains the range more intelligently by using the upper ranges of each SymInt in the input tensor shape. This solves an issue where an model being lowered to Executorch errors during memory planning because the memory allocated for `masked_select` ended up exceeded the 64-bit address space (`INT_MAX * size(dtype)`). ## Test plan - Passes existing unit tests (tests case where upper bound is inf) - Added unit test to verify upper bound reduction calculation - Tested end-to-end by exporting with TORCH_LOGS="export" and ensuring that the range for `masked_select`'s SymInt size has the correct upper bound Pull Request resolved: https://github.com/pytorch/pytorch/pull/134899 Approved by: https://github.com/ezyang	2024-09-05 09:01:06 +00:00
Yutao Xu	c7328dff7f	Enhance the stability of the complex divide code (#134647 ) In C++, when a floating-point literal (e.g., 3.14) is compared with a variable of type float, the literal is by default interpreted as a double. ```c++ float f = 3.14f; if (f == 3.14) { // Do something } ``` If a device does not support double, an error will occur. This PR addresses the issue of complex64 errors on machines that do not support double operations. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134647 Approved by: https://github.com/EikanWang, https://github.com/albanD	2024-09-05 08:36:37 +00:00
Wu, Chunyuan	749dc6ceda	[inductor] [cpp] use_local_acc if template_buffer_has_other_users (#135081 ) Fix the compilation error of `coat_lite_mini` in timm and `YituTechConvBert` in HF: ``` /tmp/tmpuu94adg_/nf/cnf3zm677wbfjzzll522zvjp57g44udzfnj66ac2t5b2odvfqpts.cpp:239:33: error: invalid conversion from ‘const float’ to ‘float’ [-fpermissive] 239 \| &(in_ptr2[static_cast<int64_t>(n_start + (192Lm_start) + (Nrnci) + ((-1L)Nrnc))]), \| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ \| \| \| const float* ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135081 Approved by: https://github.com/jgong5 ghstack dependencies: #134984	2024-09-05 08:31:31 +00:00
fduwjj	eaeae0ac95	[c10d] Change collective to take in a list of tensors so it work fully for all collectives (#135049 ) We found that currently, we only pass one input and output tensor to the function `collective`, and this causes NaNCheck, work numel stats and FR input/output sizes not accurate for all-to-all, scatter and reduce. So we want to let the collective take in a list of tensors to ensure it works for all collectives inside PGNCCL. This partially revert what we did in https://github.com/pytorch/pytorch/pull/119421, and down the road we will have another round of cleanup on the collective to make it cleaner. For now, at least for the sake of correctness, we changed it back. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135049 Approved by: https://github.com/kwen2501	2024-09-05 07:56:56 +00:00
Pian Pawakapan	5a0e7a408f	restore CSE'd node metadata in runtime asserts pass (#134516 ) Adds val, and optionally stack_trace & nn_module_stack metadata back to SymInt compute nodes that we CSE, with a hook on `graph.create_node()`. Not sure if there's other metadata we want to populate here? Pull Request resolved: https://github.com/pytorch/pytorch/pull/134516 Approved by: https://github.com/ezyang	2024-09-05 07:50:04 +00:00
Yan Zhiwei	81a8624296	[Intel GPU] Customized XPU behaviour in indexing, group norm (#134453 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134453 Approved by: https://github.com/EikanWang, https://github.com/albanD ghstack dependencies: #133980	2024-09-05 07:41:57 +00:00
Wu, Chunyuan	731fd3172a	[inductor] [cpp] generate reindexer for each epilogue_node (#134984 ) Fixes the FP32 accuracy failure of `levit_128` in timm. Previously, we used `Y` which is the output of the final epilogue node to calculate the reindexer. We actually need to use each epilogue node to calculate the reindexer from the GEMM output to the epilogue node. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134984 Approved by: https://github.com/jgong5	2024-09-05 07:08:31 +00:00
Tugsbayasgalan Manlaibaatar	9d705605dd	Fix decomp behaviour in export training IR (#134801 ) Subset of changes in https://github.com/pytorch/pytorch/pull/132901, can't land the previous one because it is too complicated. Rest of the change will be implemented as follow up after export design meeting. This part just makes the training IR -> inference IR decomp to have the same path as normal export. Differential Revision: [D62000525](https://our.internmc.facebook.com/intern/diff/D62000525) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134801 Approved by: https://github.com/avikchaudhuri, https://github.com/angelayi	2024-09-05 06:37:44 +00:00
Sun, Jiayi	05feb6e4ed	[Inductor] support masked vectorization for the tail_loop for dynamic shapes (#131745 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131745 Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jansel	2024-09-05 06:17:48 +00:00
Pian Pawakapan	7b280c31ba	[export] dynamic_shapes serialization, load/dump (#134718 ) Adds utility functions `_dump_dynamic_shapes` and `_load_dynamic_shapes`. - `_dump_dynamic_shapes`: dynamic shapes spec -> serialized format: - takes in the `dynamic_shapes` pytree object you'd feed into `export()`, and dumps into serialized format - `_load_dynamic_shapes`: serialized format -> dynamic shapes spec - takes the serialized format, and produces a `dynamic_shapes` object you feed into `export()` For example with dumping: ``` dx = Dim("dx", min=4, max=16) dy = dx + 1 inputs = ( [ torch.randn(4, 4), torch.randn(5, 4), ], torch.randn(4), torch.randn(4, 4), "hello", ) dynamic_shapes = { "a": [ (dx, 4), (dy, 4), ], "b": (Dim.AUTO,), "c": None, "d": None, } out = _dump_dynamic_shapes(dynamic_shapes, inputs) ``` would generate the following output: ``` DynamicShapesSpec( dynamic_shapes=( [ ['dx', 4], ['dx + 1', 4], ], ['_DimHint.STATIC'], ['_DimHint.STATIC', '_DimHint.STATIC'], None, ), dims={ 'dx': RootDim( min=4, max=16, derived=['dx + 1'], ), }, ) ``` The serialized format contains 2 keys, `dynamic_shapes` and `dims.` - `dynamic_shapes` is the pytree structure matching the input to `export()`, with strings in place of Dim names and enums, and ints/Nones otherwise. Each tensor is represented with a list of shapes, non-tensors with Nones. - `dims` contain min/max range and derived dims info for each root dim. The test cases show some roundtrippability guarantees for these functions. Definitely taking naming suggestions for them :) Follow up: utility function to extract serializable format from ExportedProgram. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134718 Approved by: https://github.com/avikchaudhuri	2024-09-05 05:39:44 +00:00
PyTorch UpdateBot	f2a7228aed	[executorch hash update] update the pinned executorch hash (#135162 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned executorch hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135162 Approved by: https://github.com/pytorchbot	2024-09-05 04:21:51 +00:00
Will Feng	8fb1281db9	[Traceable FSDP2] Skip _backward_prefetch under compile, and rely on compiler pass to have prefetching (#135163 ) Before this PR, when traceable FSDP2 + AC is run, an error would be thrown: ``` File "/data/users/willfeng/pytorch/torch/_dynamo/variables/builtin.py", line 1449, in call_getitem return args[0].call_method(tx, "__getitem__", args[1:], kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/users/willfeng/pytorch/torch/_dynamo/variables/lists.py", line 435, in call_method return super().call_method(tx, name, args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/users/willfeng/pytorch/torch/_dynamo/variables/lists.py", line 392, in call_method return super().call_method(tx, name, args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/users/willfeng/pytorch/torch/_dynamo/variables/lists.py", line 131, in call_method return self.getitem_const(tx, value) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/users/willfeng/pytorch/torch/_dynamo/variables/lists.py", line 106, in getitem_const return self.items[index] Error: Index out of bound from user code: File "<eval_with_key>.5", line 105, in forward aot0_trace_wrapped = torch__dynamo__trace_wrapped_higher_order_op_self_invoke(aot0_tangents_1, bw_state = aot0_primals_34); aot0_tangents_1 = None File "/data/users/willfeng/pytorch/torch/_dynamo/_trace_wrapped_higher_order_op.py", line 74, in self_invoke return _trace_wrapped_op(args, dyn_kwargs, kwargs) File "/data/users/willfeng/pytorch/torch/_dynamo/external_utils.py", line 132, in call_hook_from_backward_state return getattr(bw_state, hook_name)(args, **kwargs) File "/data/users/willfeng/pytorch/torch/distributed/_composable/fsdp/_fsdp_state.py", line 271, in _pre_backward self._fsdp_param_group.pre_backward(default_prefetch) File "/data/users/willfeng/pytorch/torch/distributed/_composable/fsdp/_fsdp_param_group.py", line 332, in pre_backward self._backward_prefetch() File "/data/users/willfeng/pytorch/torch/distributed/_composable/fsdp/_fsdp_param_group.py", line 417, in _backward_prefetch target_fsdp_param_group = self.comm_ctx.post_forward_order[target_index] ``` Since it's okay to rely on the compiler to recover the "prefetching" pattern, we will skip this `_backward_prefetch()` code path during tracing to avoid the error, and have a compiler pass (in future PR) to achieve the equivalent prefetching overlap. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135163 Approved by: https://github.com/awgu	2024-09-05 03:32:04 +00:00
ZhiweiYan-96	a7a53b796b	[Intel GPU]device guard codegen for XPU (#133980 ) This PR is a supplement to #130082. The previous PR #130082 fulfill the basic functionality of codegen, while we found it fails to handle the device sameness check in lots of uts. Current PR is aimed to facilitate the XPU device guard code generation. With current PR, the code snippet in `RegisterXPU.cpp` is as follows, where we can see the device guard is successfully generated. ```c++ namespace { at::Tensor & wrapper_XPU_Tensor_float_out_normal_out(const at::Tensor & mean, double std, ::std::optional<at::Generator> generator, at::Tensor & out) { std::optional<Device> common_device = std::nullopt; (void)common_device; // Suppress unused variable warning c10::impl::check_and_update_common_device(common_device, out, "wrapper_XPU_Tensor_float_out_normal_out", "out"); c10::impl::check_and_update_common_device(common_device, mean, "wrapper_XPU_Tensor_float_out_normal_out", "mean"); const OptionalDeviceGuard device_guard(device_of(out)); return at::native::normal_out(mean, std, generator, out); } } // anonymous namespace ``` Nevertheless, without current change, the generated code is ```c++ namespace { at::Tensor & wrapper_XPU_Tensor_float_out_normal_out(const at::Tensor & mean, double std, ::std::optional<at::Generator> generator, at::Tensor & out) { // No device check // DeviceGuard omitted return at::native::normal_out(mean, std, generator, out); } } // anonymous namespace ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/133980 Approved by: https://github.com/EikanWang, https://github.com/malfet	2024-09-05 01:53:31 +00:00
Bob Ren	30b98940b8	Fix typo in comment (#135111 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135111 Approved by: https://github.com/aorenste, https://github.com/oulgen	2024-09-05 01:39:04 +00:00
Wei Feng	724faac260	[FSDP] casting input args with dataclass(frozen=True) (#135067 ) resolve: https://github.com/pytorch/pytorch/pull/135029 when enabling mixed precision, FSDP cast input args to desired dtype by calling `_apply_to_tensors`. When input args has `dataclass(frozen=True)`, we hit following runtime error, because of using `setattr` in `_apply_to_tensors` `dataclasses.FrozenInstanceError: cannot assign to field 'some_key'`. The fix is to use dataclasses api `dataclasses.replace` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135067 Approved by: https://github.com/awgu	2024-09-05 01:19:53 +00:00
Aleksei Nikiforov	04e11c7eed	Update current scripts used for setting up s390x runners (#129866 ) Update current scripts used for setting up s390x runners Just a documentation update. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129866 Approved by: https://github.com/malfet, https://github.com/huydhn	2024-09-05 01:17:54 +00:00
drisspg	a3e0d4bf07	[FlexAttention] Fix mismatched backward strides for eager impl (#135152 ) # Fixes: The first repro from: https://github.com/pytorch/pytorch/issues/134888 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135152 Approved by: https://github.com/Chillee	2024-09-05 01:14:53 +00:00
FFFrog	27d86f93fe	Remove redundant code (#134955 ) Remove GetPrivateUse1HooksInterface Pull Request resolved: https://github.com/pytorch/pytorch/pull/134955 Approved by: https://github.com/Skylion007	2024-09-05 01:11:32 +00:00
Animesh Jain	32f45f01a9	[dynamo] Retire CompileProfiler (#135133 ) Fixes confusion in https://github.com/pytorch/pytorch/issues/113443 We have TORCH_LOGS that supersedes CompileProfiler Pull Request resolved: https://github.com/pytorch/pytorch/pull/135133 Approved by: https://github.com/ezyang ghstack dependencies: #135039, #135121, #135129, #135130	2024-09-05 01:08:40 +00:00
fduwjj	4a661e089a	[FR] Add version based logic to FR script and make traces print can be filtered (#135154 ) This PR makes version passing around the version, so that we can have different behaviors for different versions of FR dump. This PR also adds the logic of filtering to certain PG(desc) and ranks to show their traces. Some minor refactors to make the name more accurate and util function working. <img width="1180" alt="image" src="https://github.com/user-attachments/assets/4ef8a2d6-1296-4a45-b9a7-6d3b48fbe233"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/135154 Approved by: https://github.com/wconstab	2024-09-05 00:59:32 +00:00
Nikita Shulga	105ac2418c	Fix binary builds artifact download (#135139 ) By upgrading upload-artifacts action to v4.4.0 As artifact store layout is different between v3 and v4 actions and artifacts uploaded by v3 can not be downloaded by v4 Should fix`Unable to download artifact(s): Artifact not found for name: libtorch-cpu-shared-with-deps-release`, which could be seen for example [here](https://github.com/pytorch/pytorch/actions/runs/10707740040/job/29690137218#step:7:29) I.e. fix regression introduced by https://github.com/pytorch/pytorch/pull/135068 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135139 Approved by: https://github.com/atalman, https://github.com/huydhn	2024-09-05 00:43:34 +00:00
Laith Sakka	560f449d8f	Fix: use clone_preserve_strides in auto_functionalized_v2 (#135142 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135142 Approved by: https://github.com/zou3519 ghstack dependencies: #134409	2024-09-05 00:39:48 +00:00
Aidyn-A	956da79bda	[CUDA][AMP] Fix autocast_dtype (#133938 ) Fixes #132715 The failure in #132715 is due to `autocast_dtype` being a thread-local variable. It causes inconsistencies between `get_autocast_dtype()` among different threads. To be exact, what is happening in the following: The amp dtype is set to `bfloat16` on main thread. The `backward` call runs on a side thread, so `at::autocast::prioritize` fails because `lower_precision_fp` defaults to `float16`: `6f738d6434/aten/src/ATen/autocast_mode.h (L221-L225)` This PR makes `autocast_dtype` thread-global so it consistent among all threads of forward and backward passes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133938 Approved by: https://github.com/soulitzer	2024-09-05 00:07:32 +00:00
chuanqiw	977a909250	[CI] Build pytorch wheel with Torch XPU Operators on Windows (#133151 ) # Description This pipeline enables the CI build on Windows with PR labeled with ciflow/xpu. This will build torch binary with Torch XPU Operators on Windows using Vision Studio BuildTools 2022. # Changes 1. Install xpu batch file (install_xpu.bat) - Check if build machine has oneAPI in environment, and if the version of it is latest. If not, install the latest public released oneAPI in the machine. 2. GHA callable pipeline (_win-build.yml) - Set vc_year and use_xpu as parameter to set build wheel environment. 3. GHA workflow (xpu.yml) - Add a new windows build job and pass parameters to it. 4. Build wheels script (.ci/pytorch/win-test-helpers/build_pytorch.bat) - Prepare environment for building, e.g. install oneAPI bundle. # Note 1. For building wheels on Intel GPU, you need Vision Studio BuildTools version >= 2022 2. This pipeline requires to use Vision Studio BuildTools 2022 to build wheels. For now, we specify "windows.4xlarge.nonephemeral" as build machine label in the yaml file. We will request to add self-hosted runners with Intel GPU and Vision Studio BuildTools 2022 installed soon. Work for #114850 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133151 Approved by: https://github.com/chuanqi129, https://github.com/atalman Co-authored-by: chuanqiw <chuanqi.wang@intel.com>	2024-09-05 00:02:46 +00:00
Howard Huang	b3ef0c99f5	[PP] Fix zero bubble composability with DP (#134052 ) Moved all the backward functions (`stage_backward_input`, `stage_backward_weight`, `stage_backward`) under the same `backward_maybe_with_nosync` function which controls the logic of the data parallel wrappers. FSDP was not working with zero bubble PP because there will be twice as many "backward" calls and we update the weight gradients after `autograd.grad` is called. As a result, we need to manually call the FSDP `post_backward_hook()` after the weights have the correct gradients. Fixes the tests: `python test/distributed/_composable/test_composability/test_pp_composability.py ComposabilityTest.test_manual_with_data_parallel_dp_type_FSDP_ScheduleClass0_use_new_runtime_False` `python test/distributed/_composable/test_composability/test_pp_composability.py ComposabilityTest.test_manual_with_data_parallel_dp_type_DDP_ScheduleClass0_use_new_runtime_False` Pull Request resolved: https://github.com/pytorch/pytorch/pull/134052 Approved by: https://github.com/kwen2501	2024-09-04 23:46:29 +00:00
Benjamin Glass	43c9b4e0e6	Fix unintentional deduplication of returned tensors (#134726 ) When CSE was used, returned tensors that had gone through identical processing steps but were distinct from a data perspective were pruned out of the graph. This commit protects tensors which are directly output from being pruned, and adds a test for this behavior. Closes #88813 and #114344 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134726 Approved by: https://github.com/amjames, https://github.com/zou3519, https://github.com/bdhirsh	2024-09-04 23:42:56 +00:00
titaiwangms	00a8666708	[ONNX] Support output_names in dynamic_axes when dynamo=True (#135134 ) Previous to this PR, if output_names shows in dynamic_axes, it errors when we turn it to dynamic_shapes of torch.export, as we only recognized input_names. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135134 Approved by: https://github.com/justinchuby	2024-09-04 23:42:13 +00:00
eqy	4f70b3cfae	[CUDA][complex][TF32] Update `test_noncontiguous_samples` tolerances for `complex64` (#134526 ) Recent cuDNN heuristics change surfaces same TF32 issue as `float32` Pull Request resolved: https://github.com/pytorch/pytorch/pull/134526 Approved by: https://github.com/ezyang	2024-09-04 23:37:16 +00:00
Shangdi Yu	359077fa43	[export] Fix indentation (#135128 ) Summary: as title Test Plan: CI Differential Revision: D62195680 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135128 Approved by: https://github.com/tugsbayasgalan	2024-09-04 23:26:36 +00:00
Ke Wen	9810ce9ca7	[PP] Go back to export instead of _export (#134299 ) Reverts https://github.com/pytorch/pytorch/pull/130998 because FakeTensor + real device suffice to work around the autocast issue in HF. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134299 Approved by: https://github.com/lessw2020	2024-09-04 23:25:17 +00:00
Animesh Jain	804852c1f9	[dynamo] Search for _torchdynamo_inline only for functions (#135130 ) Issue seen in https://github.com/pytorch/pytorch/issues/93633 Fixes https://github.com/pytorch/pytorch/issues/93633 Unable to create a testcase Pull Request resolved: https://github.com/pytorch/pytorch/pull/135130 Approved by: https://github.com/williamwen42, https://github.com/yanboliang ghstack dependencies: #135039, #135121, #135129	2024-09-04 23:02:59 +00:00
Sun, Jiayi	13a4a0c60d	[Inductor] Apply loop split optimization in codegen_node (#132389 ) This PR applies loop split optimization in codegen_node to avoid non-contiguous load. When the vector is loaded in a non-contiguous manner due to a division in the index, we eliminate the division by splitting the loop to avoid non-contiguous load. Example: ``` import torch import torch.nn as nn class GNReLU(torch.nn.Module): def __init__(self, num_groups, num_channels): super(GNReLU, self).__init__() self.gn = nn.GroupNorm(num_groups, num_channels) def forward(self, x): return torch.nn.functional.relu(self.gn(x)) input = torch.randn(2, 960, 96, 96).to(memory_format=torch.channels_last) m = GNReLU(32, 960).eval() compiled_m = torch.compile(m) with torch.no_grad(): compiled_m(input) ``` Generated code: - Before: ``` cpp_fused_native_group_norm_relu_0 = async_compile.cpp_pybinding(['const float', 'const float', 'const float', 'float', 'float', 'float'], ''' #include "/tmp/torchinductor_jiayisun/vu/cvuckxaygqfovv2zu2byqhcmiejbke7mdhf2rpgpr5mlscdev2hg.h" extern "C" void kernel(const float* in_ptr0, const float* in_ptr1, const float* in_ptr2, float* out_ptr0, float* out_ptr1, float* out_ptr2) { #pragma omp parallel num_threads(56) { int tid = omp_get_thread_num(); { #pragma omp for collapse(2) for(long x0=static_cast<long>(0L); x0<static_cast<long>(2L); x0+=static_cast<long>(1L)) { for(long x1=static_cast<long>(0L); x1<static_cast<long>(32L); x1+=static_cast<long>(1L)) { { Welford<float> tmp_acc0 = Welford<float>(); Welford<at::vec::Vectorized<float>> tmp_acc0_vec = Welford<at::vec::Vectorized<float>>(); Welford<at::vec::Vectorized<float>> masked_tmp_acc0_vec = Welford<at::vec::Vectorized<float>>(); static WeightRecp<at::vec::Vectorized<float>> wrecps0(static_cast<long>(17280L)); for(long x2=static_cast<long>(0L); x2<static_cast<long>(9216L); x2+=static_cast<long>(1L)) { for(long x3=static_cast<long>(0L); x3<static_cast<long>(16L); x3+=static_cast<long>(16L)) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(x3 + (30Lx1) + (960Lx2) + (8847360Lx0)), 16); tmp_acc0_vec = welford_combine(tmp_acc0_vec, tmp0, &wrecps0); } for(long x3=static_cast<long>(16L); x3<static_cast<long>(30L); x3+=static_cast<long>(14L)) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(x3 + (30Lx1) + (960Lx2) + (8847360Lx0)), 14); masked_tmp_acc0_vec = welford_combine(masked_tmp_acc0_vec, tmp0, 14, &wrecps0); } } tmp_acc0 = welford_combine(tmp_acc0, welford_vec_reduce_all(masked_tmp_acc0_vec)); tmp_acc0 = welford_combine(tmp_acc0, welford_vec_reduce_all(tmp_acc0_vec)); out_ptr0[static_cast<long>(x1 + (32Lx0))] = static_cast<float>(tmp_acc0.mean); out_ptr1[static_cast<long>(x1 + (32Lx0))] = static_cast<float>(tmp_acc0.m2); } } } } { #pragma omp for collapse(2) for(long x0=static_cast<long>(0L); x0<static_cast<long>(2L); x0+=static_cast<long>(1L)) { for(long x1=static_cast<long>(0L); x1<static_cast<long>(9216L); x1+=static_cast<long>(1L)) { for(long x2=static_cast<long>(0L); x2<static_cast<long>(960L); x2+=static_cast<long>(16L)) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(x2 + (960Lx1) + (8847360Lx0)), 16); auto tmp1 = [&] { __at_align__ std::array<float, 16> tmpbuf; #pragma GCC unroll 16 for (long x2_inner = 0; x2_inner < 16; x2_inner++) { tmpbuf[x2_inner] = out_ptr0[static_cast<long>((32Lx0) + (c10::div_floor_integer((x2 + x2_inner), 30L)))]; } return at::vec::Vectorized<float>::loadu(tmpbuf.data(), 16); } () ; auto tmp3 = [&] { __at_align__ std::array<float, 16> tmpbuf; #pragma GCC unroll 16 for (long x2_inner = 0; x2_inner < 16; x2_inner++) { tmpbuf[x2_inner] = out_ptr1[static_cast<long>((32Lx0) + (c10::div_floor_integer((x2 + x2_inner), 30L)))]; } return at::vec::Vectorized<float>::loadu(tmpbuf.data(), 16); } () ; auto tmp12 = at::vec::Vectorized<float>::loadu(in_ptr1 + static_cast<long>(x2), 16); auto tmp14 = at::vec::Vectorized<float>::loadu(in_ptr2 + static_cast<long>(x2), 16); auto tmp2 = tmp0 - tmp1; auto tmp4 = static_cast<float>(276480.0); auto tmp5 = at::vec::Vectorized<float>(tmp4); auto tmp6 = tmp3 / tmp5; auto tmp7 = static_cast<float>(1e-05); auto tmp8 = at::vec::Vectorized<float>(tmp7); auto tmp9 = tmp6 + tmp8; auto tmp10 = tmp9.rsqrt(); auto tmp11 = tmp2 * tmp10; auto tmp13 = tmp11 * tmp12; auto tmp15 = tmp13 + tmp14; auto tmp16 = at::vec::clamp_min(tmp15, decltype(tmp15)(0)); tmp16.store(out_ptr2 + static_cast<long>(x2 + (960Lx1) + (8847360Lx0))); } } } } } } ''') async_compile.wait(globals()) del async_compile def call(args): arg2_1, = args args.clear() assert_size_stride(arg2_1, (2, 960, 96, 96), (8847360, 1, 92160, 960)) buf0 = empty_strided_cpu((2, 32, 1, 1), (32, 1, 64, 64), torch.float32) buf1 = empty_strided_cpu((2, 32, 1, 1), (32, 1, 64, 64), torch.float32) buf3 = empty_strided_cpu((2, 960, 96, 96), (8847360, 1, 92160, 960), torch.float32) cpp_fused_native_group_norm_relu_0(arg2_1, _frozen_param3, _frozen_param2, buf0, buf1, buf3) del arg2_1 return (buf3, ) ``` - After: ``` cpp_fused_native_group_norm_relu_0 = async_compile.cpp_pybinding(['const float', 'const float', 'const float', 'float', 'float', 'float'], ''' #include "/tmp/torchinductor_jiayisun/vu/cvuckxaygqfovv2zu2byqhcmiejbke7mdhf2rpgpr5mlscdev2hg.h" extern "C" void kernel(const float* in_ptr0, const float* in_ptr1, const float* in_ptr2, float* out_ptr0, float* out_ptr1, float* out_ptr2) { #pragma omp parallel num_threads(56) { int tid = omp_get_thread_num(); { #pragma omp for collapse(2) for(long x0=static_cast<long>(0L); x0<static_cast<long>(2L); x0+=static_cast<long>(1L)) { for(long x1=static_cast<long>(0L); x1<static_cast<long>(32L); x1+=static_cast<long>(1L)) { { Welford<float> tmp_acc0 = Welford<float>(); Welford<at::vec::Vectorized<float>> tmp_acc0_vec = Welford<at::vec::Vectorized<float>>(); Welford<at::vec::Vectorized<float>> masked_tmp_acc0_vec = Welford<at::vec::Vectorized<float>>(); static WeightRecp<at::vec::Vectorized<float>> wrecps0(static_cast<long>(17280L)); for(long x2=static_cast<long>(0L); x2<static_cast<long>(9216L); x2+=static_cast<long>(1L)) { for(long x3=static_cast<long>(0L); x3<static_cast<long>(16L); x3+=static_cast<long>(16L)) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(x3 + (30Lx1) + (960Lx2) + (8847360Lx0)), 16); tmp_acc0_vec = welford_combine(tmp_acc0_vec, tmp0, &wrecps0); } for(long x3=static_cast<long>(16L); x3<static_cast<long>(30L); x3+=static_cast<long>(14L)) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(x3 + (30Lx1) + (960Lx2) + (8847360Lx0)), 14); masked_tmp_acc0_vec = welford_combine(masked_tmp_acc0_vec, tmp0, 14, &wrecps0); } } tmp_acc0 = welford_combine(tmp_acc0, welford_vec_reduce_all(masked_tmp_acc0_vec)); tmp_acc0 = welford_combine(tmp_acc0, welford_vec_reduce_all(tmp_acc0_vec)); out_ptr0[static_cast<long>(x1 + (32Lx0))] = static_cast<float>(tmp_acc0.mean); out_ptr1[static_cast<long>(x1 + (32Lx0))] = static_cast<float>(tmp_acc0.m2); } } } } { #pragma omp for collapse(2) for(long x0=static_cast<long>(0L); x0<static_cast<long>(2L); x0+=static_cast<long>(1L)) { for(long x1=static_cast<long>(0L); x1<static_cast<long>(9216L); x1+=static_cast<long>(1L)) { #pragma GCC ivdep for(long x2=static_cast<long>(0L); x2<static_cast<long>(32L); x2+=static_cast<long>(1L)) { for(long x3=static_cast<long>(0L); x3<static_cast<long>(16L); x3+=static_cast<long>(16L)) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(x3 + (30Lx2) + (960Lx1) + (8847360Lx0)), 16); auto tmp1 = out_ptr0[static_cast<long>(x2 + (32Lx0))]; auto tmp4 = out_ptr1[static_cast<long>(x2 + (32Lx0))]; auto tmp12 = at::vec::Vectorized<float>::loadu(in_ptr1 + static_cast<long>(x3 + (30Lx2)), 16); auto tmp14 = at::vec::Vectorized<float>::loadu(in_ptr2 + static_cast<long>(x3 + (30Lx2)), 16); auto tmp2 = at::vec::Vectorized<float>(tmp1); auto tmp3 = tmp0 - tmp2; auto tmp5 = static_cast<float>(276480.0); auto tmp6 = tmp4 / tmp5; auto tmp7 = static_cast<float>(1e-05); auto tmp8 = decltype(tmp6)(tmp6 + tmp7); auto tmp9 = 1 / std::sqrt(tmp8); auto tmp10 = at::vec::Vectorized<float>(tmp9); auto tmp11 = tmp3 tmp10; auto tmp13 = tmp11 * tmp12; auto tmp15 = tmp13 + tmp14; auto tmp16 = at::vec::clamp_min(tmp15, decltype(tmp15)(0)); tmp16.store(out_ptr2 + static_cast<long>(x3 + (30Lx2) + (960Lx1) + (8847360Lx0))); } for(long x3=static_cast<long>(16L); x3<static_cast<long>(30L); x3+=static_cast<long>(14L)) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(x3 + (30Lx2) + (960Lx1) + (8847360Lx0)), 14); auto tmp1 = out_ptr0[static_cast<long>(x2 + (32Lx0))]; auto tmp4 = out_ptr1[static_cast<long>(x2 + (32Lx0))]; auto tmp12 = at::vec::Vectorized<float>::loadu(in_ptr1 + static_cast<long>(x3 + (30Lx2)), 14); auto tmp14 = at::vec::Vectorized<float>::loadu(in_ptr2 + static_cast<long>(x3 + (30Lx2)), 14); auto tmp2 = at::vec::Vectorized<float>(tmp1); auto tmp3 = tmp0 - tmp2; auto tmp5 = static_cast<float>(276480.0); auto tmp6 = tmp4 / tmp5; auto tmp7 = static_cast<float>(1e-05); auto tmp8 = decltype(tmp6)(tmp6 + tmp7); auto tmp9 = 1 / std::sqrt(tmp8); auto tmp10 = at::vec::Vectorized<float>(tmp9); auto tmp11 = tmp3 * tmp10; auto tmp13 = tmp11 * tmp12; auto tmp15 = tmp13 + tmp14; auto tmp16 = at::vec::clamp_min(tmp15, decltype(tmp15)(0)); tmp16.store(out_ptr2 + static_cast<long>(x3 + (30Lx2) + (960Lx1) + (8847360L*x0)), 14); } } } } } } } ''') async_compile.wait(globals()) del async_compile def call(args): arg2_1, = args args.clear() assert_size_stride(arg2_1, (2, 960, 96, 96), (8847360, 1, 92160, 960)) buf0 = empty_strided_cpu((2, 32, 1, 1), (32, 1, 64, 64), torch.float32) buf1 = empty_strided_cpu((2, 32, 1, 1), (32, 1, 64, 64), torch.float32) buf3 = empty_strided_cpu((2, 960, 96, 96), (8847360, 1, 92160, 960), torch.float32) cpp_fused_native_group_norm_relu_0(arg2_1, _frozen_param3, _frozen_param2, buf0, buf1, buf3) del arg2_1 return (buf3, ) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/132389 Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel Co-authored-by: Jiong Gong <jiong.gong@intel.com>	2024-09-04 22:42:46 +00:00
Animesh Jain	87842cc658	[dynamo][super] Corner case where the class is not present in the __mro__ (#135129 ) I could not come up with a testcase. This was seen in https://github.com/pytorch/pytorch/issues/93633 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135129 Approved by: https://github.com/yanboliang ghstack dependencies: #135039, #135121	2024-09-04 22:30:09 +00:00
Michael Lazos	d9ae92cd6e	[Dynamo] Support for proxying frozen dataclasses (#134846 ) Fixes https://github.com/pytorch/pytorch/issues/133858 Details: Previously Dynamo would treat dataclasses as UserDefinedVariables. This was non-desirable if we would like to proxy the value into the graph, which is needed for TensorSubclassMetadata. To rectify this, frozen dataclasses are now able to be proxied similarly to NamedTuples. We require the object to be frozen, because if arbitrary mutation were allowed, we would need to replay those mutations in the graph after construction of the object. For tracing construction of the variable, the generated `__init__` for the dataclass uses `object.__setattr__` because frozen dataclasses throw errors on the usual `__setattr__` invocation. With this treatment, no special handling is needed in dynamo for frozen dataclass construction. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134846 Approved by: https://github.com/bdhirsh, https://github.com/anijain2305	2024-09-04 22:17:00 +00:00
Xilun Wu	ed06772e35	[TorchElastic] add warning when users try to pass a "use_libuv" argument to create_c10d_store (#135062 ) Summary Extend the warning message to be more self-explained Pull Request resolved: https://github.com/pytorch/pytorch/pull/135062 Approved by: https://github.com/shuqiangzhang	2024-09-04 22:05:51 +00:00
Nikita Shulga	fb1c580892	[BE][optim] Make pyright recognize exported symbols (#135043 ) Follows pattern introduced by https://github.com/pytorch/pytorch/pull/80955 which [pyright](https://github.com/microsoft/pyright) prefers over `__all__` symbol, see https://github.com/microsoft/pylance-release/issues/2953#issuecomment-1168956296 Fixes https://github.com/pytorch/pytorch/issues/134985 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135043 Approved by: https://github.com/janeyx99	2024-09-04 21:53:46 +00:00
rzou	2276940f8c	Make Dynamo inline through torch._library.custom_ops.autograd (#135066 ) Fixes https://github.com/pytorch/pytorch/issues/135057 The bug was: in the situation that Dynamo graph breaks in the forward and Compiled Autograd uses Dynamo to introspect the backward, we end up running into a "Unsupported: inlining through SKIPFILES" error. The solution is to mark the entirety of this module as inlineable. Test Plan: - new test Pull Request resolved: https://github.com/pytorch/pytorch/pull/135066 Approved by: https://github.com/bdhirsh, https://github.com/williamwen42, https://github.com/yanboliang	2024-09-04 21:48:28 +00:00
Manuel Candales	4e6df83d19	[PT] Add out variant for avg_pool1d and adaptive_avg_pool1d (#135051 ) Test Plan: CI Differential Revision: D62148410 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135051 Approved by: https://github.com/SS-JIA	2024-09-04 21:20:01 +00:00
Animesh Jain	a8611da86f	[dynamo][backend match] Optimize backend match for common case (#135121 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135121 Approved by: https://github.com/williamwen42 ghstack dependencies: #135039	2024-09-04 21:02:29 +00:00
Boyuan Feng	09a339fc06	[Flex Attention] update __getitem__ without tree_map_only to support compile (#134627 ) Adds a helper function for getting the block mask for a specific row index during decoding. We need this change to avoid the pytree + torch.compile issue #134731. Tested in gpt-fast [pr](https://github.com/pytorch-labs/gpt-fast/pull/196). Pull Request resolved: https://github.com/pytorch/pytorch/pull/134627 Approved by: https://github.com/Chillee	2024-09-04 20:09:41 +00:00
PyTorch MergeBot	741d52c69f	Revert "Add support for 32KB multi_tensor_apply kernel arguments (#134373 )" This reverts commit 08184aa85cf183198ebdf2fd7a49fe7bc4842c13. Reverted https://github.com/pytorch/pytorch/pull/134373 on behalf of https://github.com/drisspg due to See https://github.com/pytorch/pytorch/issues/135126 for more details ([comment](https://github.com/pytorch/pytorch/pull/134373#issuecomment-2329839011))	2024-09-04 19:44:29 +00:00
Saurabh Mishra	dd7cd182ab	[AIInfra][DCP] All gather keys checkpoint utils bug fix (#135045 ) Summary: All gather keys checkpoint utils bug fix. Dist. get_world_size should have the process group passed in to avoid inconsistent world size in case the process group has changed. This is common in the tests. Test Plan: UTs Reviewed By: Saiteja64 Differential Revision: D61578832 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135045 Approved by: https://github.com/MeetVadakkanchery, https://github.com/LucasLLC	2024-09-04 18:49:34 +00:00
Shivam Raikundalia	eb0fd17bc4	[Profiler] Fix Raw Metadata Iterator (#135096 ) Summary: D62008788 added an extra parameter to the RawTensorMetadata struct. For some reason this causes some corrupted accesses in other tests as described in T200685032. Once this is removed the tests pass. Going forward we need to document how to add parameters to this portion of the code as the AppendOnlyLists seem to be very rigid. Test Plan: Ran all the tests locally and they all passed. Differential Revision: D62171089 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135096 Approved by: https://github.com/aaronenyeshi	2024-09-04 18:41:50 +00:00
PyTorch MergeBot	c88c19c6de	Revert "restore CSE'd node metadata in runtime asserts pass (#134516 )" This reverts commit 1dfb1052395d908ed6e67288c9357e16022da272. Reverted https://github.com/pytorch/pytorch/pull/134516 on behalf of https://github.com/pianpwk due to breaking NestedTensor test ([comment](https://github.com/pytorch/pytorch/pull/134516#issuecomment-2329738450))	2024-09-04 18:41:21 +00:00
Shunting Zhang	873abfc18e	[inductor] fix compile time regression due the (disabled) loop ordering after fusion (#135071 ) It's a bit surprised that the code added in Scheduler.fusable_read_and_write would increase compilation time. Here are some number I get from a H100 on BertForMaskedLM: - without the fix, cold start compilation time is around 82s - with the fix, cold start compilation time is around 76s. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135071 Approved by: https://github.com/jansel	2024-09-04 18:36:59 +00:00
rzou	d7b57c4d63	Fix tensor.data access under inference_mode and compile (#134878 ) Fixes https://github.com/pytorch/pytorch/issues/134798 In the regular Tensor case, when you call Tensor.data, there's a check for if inference mode is active. If it is active, then we don't set the version counter. We replicate this check for Tensor Subclasses (the bug was we were trying to set the version counter on a FakeTensor in inference_mode). Test Plan: - new test Pull Request resolved: https://github.com/pytorch/pytorch/pull/134878 Approved by: https://github.com/bdhirsh	2024-09-04 17:55:41 +00:00
Svetlana Karslioglu	0d193a0adf	Add ExecuTorch warning to mobile_optimizer (#134697 ) Preview: https://docs-preview.pytorch.org/pytorch/pytorch/134697/mobile_optimizer.html Pull Request resolved: https://github.com/pytorch/pytorch/pull/134697 Approved by: https://github.com/ali-khosh, https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2024-09-04 17:47:14 +00:00
Jason Ansel	193c547461	[inductor] Refactor simplify erase_nodes() (#134822 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134822 Approved by: https://github.com/shunting314 ghstack dependencies: #134748, #134749	2024-09-04 17:32:07 +00:00
Jason Ansel	2ddf3ed707	[inductor] Allow cudagraphs with unused CPU inputs (#134749 ) This pattern was preventing cudagraphs from kicking in on torch_multimodal_clip, resulting in `1.6529 → 3.3471` speedup. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134749 Approved by: https://github.com/shunting314 ghstack dependencies: #134748	2024-09-04 17:32:07 +00:00
Jason Ansel	cff1158200	[inductor] Pass to fix device on index(..., [iota]) (#134748 ) This pattern was preventing cudagraphs from kicking in on torch_multimodal_clip, resulting in `1.6529 → 3.3471` speedup. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134748 Approved by: https://github.com/shunting314	2024-09-04 17:31:58 +00:00
PyTorch MergeBot	7858045491	Revert "Fix set_unbacked_bindings when list of Tensors is returned (#133585 )" This reverts commit 2a49296d7563150d67bb00bd4c97bc5aafaa77df. Reverted https://github.com/pytorch/pytorch/pull/133585 on behalf of https://github.com/ezyang due to fails torchrec tests ([comment](https://github.com/pytorch/pytorch/pull/133585#issuecomment-2329602983))	2024-09-04 17:21:32 +00:00
PyTorch MergeBot	8759ed2ac5	Revert "Compute and do renamings even when ignoring fresh unbacked symbols (#134407 )" This reverts commit 46cb2af7d822681298370bab9d49b3cba5546dd5. Reverted https://github.com/pytorch/pytorch/pull/134407 on behalf of https://github.com/ezyang due to need to back out https://github.com/pytorch/pytorch/pull/133585 ([comment](https://github.com/pytorch/pytorch/pull/134407#issuecomment-2329597388))	2024-09-04 17:18:21 +00:00
PyTorch MergeBot	fc07e6bf56	Revert "Ignore fresh unbacked when doing recursive make_fx inside HOPs (#135053 )" This reverts commit a178a053ad2c8e42d1b684ed38385b9646ec3b74. Reverted https://github.com/pytorch/pytorch/pull/135053 on behalf of https://github.com/ezyang due to need to back out https://github.com/pytorch/pytorch/pull/133585 ([comment](https://github.com/pytorch/pytorch/pull/134407#issuecomment-2329597388))	2024-09-04 17:18:21 +00:00
Laith Sakka	c8ab9b06a2	Redesign custom op functionlaization for better re-inplace (#134409 ) - The new implementation (auto_functionalized_v2) is enabled by default but can be disable using an inductor flag. - In export mode the old implementation is used. Motiviation Previous functionalization fails to re-inplace arguments when they are view over other tensors. see issue https://github.com/pytorch/pytorch/issues/131192 The new functionalization is easier to re-inplace for views. A) Functionalizations pass consider a program: ``` func(t) x = t[0] y = t[1] foo(x, y) # custom operator with x, y mutable return (x, y, t) ``` - To functionalize `foo` we generate a function that operates on the base tensors of the inputs; (x.base() and y.base()) and record how to regenerates the views out of the base for argument x by recording ```ViewInfo=(x.base(), x.size(), x.stride, x,storage_offset())``` - Due to some limitations on the torch.export arguments format, we have to generate alot of arguments, but this is something we can simplify in the future, for the example above we get the following function. ``` auto_functionalized = torch.ops.higher_order.auto_functionalized(torch.ops.mylib.foo.default, _x_base_index = 0, _x_size = (), _x_stride = (), _x_storage_offset = 0 , _y_base_index = 0,_y_size = (), _y_stride = (), _y_storage_offset = 1 , _all_bases = [arg0_1]) ``` - In the code above: - _all_bases[t]: refers to a unique set of bases for all foo arguments. - for each argument x we have _x_base_index, _x_size, _x_stride, _x_storage_offset that can be used to (1) regenerate x from _all_bases[_x_base_index] or a copy of a the base. - the output of auto_functionalized is foo output , followed by x tensors one for each base in _all_bases, that is a copy of the base tensor after observing the mutations of the all the arguments that are views of that base. - for each use of a base in _all_bases or a view of it , that are after the call to foo, replace it with a view of the new output for the function above after functionalization we get : ``` def forward(self, arg0_1: "f32[2][1]cpu"): auto_functionalized = torch.ops.higher_order.auto_functionalized(torch.ops.mylib.foo.default, _x_base_index = 0, _x_size = (), _x_stride = (), _x_storage_offset = 0, _y_base_index = 0, _y_size = (), _y_stride = (), _y_storage_offset = 1, _all_bases = [arg0_1]) getitem_1: "f32[2][1]cpu" = auto_functionalized[1]; auto_functionalized = None copy_: "f32[2][1]cpu" = torch.ops.aten.copy_.default(arg0_1, getitem_1); arg0_1 = copy_ = None # No stacktrace found for following nodes select_2: "f32[][]cpu" = torch.ops.aten.select.int(getitem_1, 0, 0) select_3: "f32[][]cpu" = torch.ops.aten.select.int(getitem_1, 0, 1); getitem_1 = None return (select_2, select_3) ``` B) Semantics of auto_functionalize The new semantics of auto_functionalize is as the following: 1. For each base in all_bases, copy the base and create all_bases copies. (if a base is inplaced we do not need to copy it) 2. For each arg, regenerate the arg from the copy of its base using the view information above. 3. return the original foo output followed by the new bases. C) Re-inplace pass since auto_functionalize not copy the bases, what we actually inplace is the bases. (run just like before but on the beses instead of args). 1. For each base b in _all_bases check if there is any use of base (or its aliases/views) after auto_functionalize (before its overwritten with a copy) if there is not any, then inplace it (avoid copying it in step 1 above). Pull Request resolved: https://github.com/pytorch/pytorch/pull/134409 Approved by: https://github.com/zou3519	2024-09-04 17:08:58 +00:00
Shivam Raikundalia	195ac85fb6	[Profiler] Allow kwinputs to be non-string values (#134893 ) Summary: When we process keyword arguments in profiler today we assume that all values will be strings. This breaks HTA because it assumes that "stream" and other values similar to it will be ints. To fix this we will only put quotes around strings for ivalues. Test Plan: Add chrome trace export in unit tests and check that stream does not have quotes around it Differential Revision: D62056059 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134893 Approved by: https://github.com/sanrise, https://github.com/izaitsevfb	2024-09-04 16:34:10 +00:00
atalman	60dfe1b35e	Fix lint after Bump actions/download-artifact update (#135109 ) Fixes lint after auto-generated PR: `367a78495f` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135109 Approved by: https://github.com/ezyang, https://github.com/huydhn	2024-09-04 15:26:17 +00:00
Avik Chaudhuri	8bfd4916d6	fast path for sympy gcd in floordiv (#134880 ) Summary: Re-implementation of https://github.com/pytorch/pytorch/pull/134150, which was reverted because of some internal tests hanging (case B). The original motivation was to get some other internal test unstuck (case A). The root cause is that sympy.gcd is both very clever as well as can blow up in some cases. This PR introduces a fast path with an appropriate fallback to sympy.gcd that ensures that both cases A and B go through. Test Plan: See the included test for specific examples. Also https://fb.workplace.com/groups/1075192433118967/posts/1491493248155548/?comment_id=1491938994777640&reply_comment_id=1492622821375924 Differential Revision: D62043315 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134880 Approved by: https://github.com/ezyang	2024-09-04 14:56:49 +00:00
chuanqiw	67208f08bd	[CD] Enable XPU nightly build on Windows (#134312 ) Depends on https://github.com/pytorch/builder/pull/1975 land. Works for https://github.com/pytorch/pytorch/issues/114850 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134312 Approved by: https://github.com/atalman	2024-09-04 14:46:36 +00:00
Edward Z. Yang	6c5669903f	Fix Invalid NaN comparison due to infinity-zero multiply on latest sympy (#135044 ) Fixes https://github.com/pytorch/pytorch/issues/133735 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/135044 Approved by: https://github.com/zou3519	2024-09-04 14:13:09 +00:00
Edward Z. Yang	a178a053ad	Ignore fresh unbacked when doing recursive make_fx inside HOPs (#135053 ) Internal xref: https://fb.workplace.com/groups/6829516587176185/posts/7705964779531357/ I'm not sure this is the right approach though... Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/135053 Approved by: https://github.com/ydwu4 ghstack dependencies: #134407	2024-09-04 13:25:08 +00:00
Edward Z. Yang	46cb2af7d8	Compute and do renamings even when ignoring fresh unbacked symbols (#134407 ) This is a bit twisty and I don't entirely understand the situation, but here's my best explanation. In https://github.com/pytorch/pytorch/pull/133588 I am trying to fix a problem reported by user in https://fb.workplace.com/groups/6829516587176185/permalink/7705964779531357/ The summary of this problem is that when we do collect metadata analysis in AOTAutograd, we accumulate pending unbacked symbols which are going to be discarded at the end of the trace. However, if we do a recursive make_fx inside tracing, as occurs with torch.cond, we end up seeing that there are pending unbacked symbols that aren't associated with a binding, even though it's spurious (they've leaked into the inner make_fx call from the outer AOTAutograd analysis). In #133588 I tried to just prevent adding the symbols to the pending list at all in the first place. But this itself caused some problems which were fixed in https://github.com/pytorch/pytorch/pull/124785 . The problem fixed in that PR is that when we allocate tangents that have unbacked size, something prevented them from having correct unbacked SymInts when ignore fresh unbacked SymInts was enabled. So I had patched it at the time by just not suppressing pending symbols and clearing them out some other way. I think... I was wrong in that PR? That is to say, it was OK to avoid putting the fresh unbacked symbols in the pending list; the real problem was suppressing unbacked renamings. But there doesn't seem to be a good reason to suppress these; this PR shows that it doesn't actually fail any tests if you do these anyway. Intuitively, this makes sense, because you can't trigger renamings unless you're actually adding unbacked symbols to the pending set. But I don't entirely understand all the interactions. I just know that this seems to not cause tests to fail, and it should fix the internal issue (which I need to add a UT for.) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/134407 Approved by: https://github.com/ydwu4	2024-09-04 13:25:07 +00:00
FFFrog	5690f003a6	C10_DIAGNOSTIC_PUSH_AND_IGNORED_IF_DEFINED and C10_DIAGNOST should be used in pairs (#135004 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135004 Approved by: https://github.com/aaronenyeshi	2024-09-04 13:14:23 +00:00
Thanh Ha	dcf05fcb14	Fix stale job using non-existant ARC runner (#134863 ) The ARC CI system has been shutdown so this job is currently using a runner that doesn't exist. Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/134863 Approved by: https://github.com/ZainRizvi	2024-09-04 12:57:10 +00:00
FFFrog	a8467c17c3	Remove specific lazy initialization of PrivateUse1 (#135002 ) As the title stated, lazy initialization of PrivateUse1 can been removed because maybe_initialize_device have supported PrivateUse1 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135002 Approved by: https://github.com/albanD	2024-09-04 11:45:45 +00:00
FFFrog	80a6d60829	Moving _run_autocast_outofplace to basic class named TestAutocast to reduce redundance (#134460 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134460 Approved by: https://github.com/EikanWang, https://github.com/ezyang	2024-09-04 10:48:58 +00:00
Luca Wehrstedt	c2ff9fe042	[fp8 rowwise] Retune the tile heuristics to increase perf (#134781 ) I propose a new heuristic function to select tile tile size, cluster size, and transposition given M, N and K. It improves the performance across the board (on average) while remaining simple and relying only on a handful of kernels (to limit build time and binary size). Across the shapes I benchmarked, the new heuristic gives a (geometric) mean speedup of +16.5%. Some shapes worsen, but 98.6% of the shapes retain their old performance (up to 5% to allow for noise) or improve it. ![image](https://github.com/user-attachments/assets/bca30583-ac32-4af6-a4f9-37164bdb2430) I benchmarked on over 5.4k different shapes: - For M and N I swept across all values which are the sums of two powers of 2 (limited to multiples of 64, capped at 16,384) - For K I only used powers of 2 between 1,024 and 8,192 (based on the intuition that the optimal config doesn't depend on K, which turned out to be the case) Here's the detailed speedup for each shape ![image](https://github.com/user-attachments/assets/acac4318-9ee0-455d-861b-c764b8c13d22) <details> <summary> This is the code I used to benchmark </summary> ``` import torch import torch.utils.benchmark s = set() for i in range(6, 15): s.add(2i) for j in range(6, i): s.add(2i + 2j) ms = [i for i in sorted(s) if i <= 214] ns = [i for i in sorted(s) if i <= 214] ks = [2i for i in range(10, 14)] def make_graph(n_iters, f): g = torch.cuda.CUDAGraph() with torch.cuda.graph(g): for _ in range(n_iters): f() return g def rowwise_scale(t, dtype_t): min_v, max_v = torch.finfo(dtype_t).min, torch.finfo(dtype_t).max scale_t = torch.clamp(t.abs().amax(dim=-1, keepdim=True).float(), min=1e-12) / max_v t_fp8 = (t / scale_t).clamp(min=min_v, max=max_v).to(dtype_t) return t_fp8, scale_t for m in ms: for n in ns: for k in ks: a = torch.randn((m, k), device="cuda", dtype=torch.float) b_t = torch.randn((n, k), device="cuda", dtype=torch.float) a_fp8, scale_a = rowwise_scale(a, torch.float8_e4m3fn) b_t_fp8, scale_b_t = rowwise_scale(b_t, torch.float8_e4m3fn) func = lambda: torch._scaled_mm( a_fp8, b_t_fp8.t(), scale_a=scale_a, scale_b=scale_b_t.t(), bias=None, use_fast_accum=True, out_dtype=torch.bfloat16 ) print(f"{m=},{n=},{k=}") print(torch.utils.benchmark.Timer("g.replay()", globals={"g": make_graph(1000, func)}).blocked_autorange(min_run_time=1).mean / 1000) ``` </details> <details> <summary> This is the code I used for the plots </summary> ``` from itertools import islice import pandas as pd import matplotlib.pyplot as plt from matplotlib.cm import ScalarMappable from matplotlib.colors import FuncNorm from mpl_toolkits.axes_grid1 import ImageGrid def batched(iterable, n): iterator = iter(iterable) while batch := tuple(islice(iterator, n)): yield batch def try_to_convert(v): if v == "False": return False if v == "True": return True return int(v) def get_from_paste(filename): text = open(filename, "rt").read() headers = [] data = [] for config, value in batched(text.splitlines(), 2): config_elems = config.split(",") if not headers: headers = [e.partition("=")[0] for e in config_elems] data.append((*(try_to_convert(e.partition("=")[-1]) for e in config_elems), float(value))) return pd.DataFrame(data, columns=headers + ["latency"]) old_latencies = get_from_paste(...) new_latencies = get_from_paste(...) ratios = pd.merge(new_latencies, old_latencies, how="left", left_on=["m", "n", "k"], right_on=["m", "n", "k"], suffixes=("_new", "_old")) ratios = ratios.assign(ratio=ratios.latency_old / ratios.latency_new) fig = plt.figure(figsize=(40.0, 10.0)) grid = ImageGrid( fig, 111, nrows_ncols=(1, 4), axes_pad=0.5, share_all=True, cbar_location="right", cbar_mode="single", cbar_size="7%", cbar_pad=0.15, ) log_amax = np.max(np.abs(np.log(ratios.ratio.to_numpy()))) for K, ax in zip([1024, 2048, 4096, 8192], grid): pivoted = ratios[(ratios.k == K)].pivot_table(index="m", columns="n", values="ratio") im = ax.imshow(np.log(pivoted.to_numpy()), origin="lower", vmin=-log_amax, vmax=log_amax, cmap="PiYG") m_vals, n_vals = pivoted.axes ax.set_xticks(np.arange(len(n_vals)), labels=[f"N={i}" for i in n_vals.values], fontsize=12) ax.set_yticks(np.arange(len(m_vals)), labels=[f"M={i}" for i in m_vals.values], fontsize=12) plt.setp(ax.get_xticklabels(), rotation=90, ha="right", rotation_mode="anchor") ax.grid(False) ax.set_title(f"K={K}", fontsize=20) norm = FuncNorm((lambda x: np.log(x), lambda x: np.exp(x)), np.exp(-log_amax), np.exp(log_amax)) ax.cax.colorbar(ScalarMappable(norm=norm, cmap="PiYG")) plt.show() counts, bins = np.histogram(np.log(ratios.ratio.to_numpy()), bins=500) plt.stairs(counts, np.exp(bins), fill=True) plt.xscale("function", functions=(lambda x: np.log(x), lambda x: np.exp(x))) ``` </details> I only benchmarked fast_accum=True and out_dtype=torch.bfloat16 supposing that these are the most commonly-used flags (e.g., with fast_accum=False row-wise scaling is much slower than tensor-wise scaling hence unpractical). Pull Request resolved: https://github.com/pytorch/pytorch/pull/134781 Approved by: https://github.com/drisspg, https://github.com/eqy ghstack dependencies: #134773	2024-09-04 09:17:28 +00:00
Luca Wehrstedt	eec8fa038e	[fp8 rowwise] Support transposing operands in order to change output layout (#134773 ) On some occasion, a column-major output layout is more efficient (it's unclear if it's because of better store coalescing for some tile shapes, or whether it's just that it's CUTLASS's default and thus it's better optimized). At this stage I only add a flag that allows to transpose, but the hardest will be deciding on a new heuristic to turn it on selectively. This will be in a follow-up PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134773 Approved by: https://github.com/drisspg	2024-09-04 09:17:28 +00:00
Gregory Comer	679b8fe426	Update generate-xnnpack-wrappers.py parsing to handle build identifier (#134724 ) Fixes an issue after updating XNNPACK where parsing the XNNPACK CMakeLists breaks. I'm just ignored the generated build identifier for now, since it's not used and we would need to update the buck build to generate it at build time. Remove unused ukernels_xop XNNPACK target as it has no sources (after the recent update) and causes buck1 to complain. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134724 Approved by: https://github.com/mcr229	2024-09-04 08:45:46 +00:00
Pian Pawakapan	1dfb105239	restore CSE'd node metadata in runtime asserts pass (#134516 ) Adds val, and optionally stack_trace & nn_module_stack metadata back to SymInt compute nodes that we CSE, with a hook on `graph.create_node()`. Not sure if there's other metadata we want to populate here? Pull Request resolved: https://github.com/pytorch/pytorch/pull/134516 Approved by: https://github.com/ezyang	2024-09-04 05:56:28 +00:00
Avik Chaudhuri	9f00317997	rationalize STATIC vs. None (#134877 ) Summary: A bit of refactoring to prepare to remove `None` as a way to specify static dimensions in dynamic shapes, given we already have `Dim.STATIC` for the same purpose. We will now warn whenever this happens. However no tests were modified because problematic uses of `None` still need to behave as they do today, until we are ready to remove support. It should be easy to port tests by replacing the warning function to raise instead. Note that other uses of `None`, such as for entire values (tensor or non-tensor) remain as is. Moving forward this should be the only purpose of `None` (at least externally). Finally, there's a bit of confusion in our representation now because `AUTO` also internally transforms to `None`. Renamed dynamic_shapes to transformed_dynamic_shapes where this happens. Overall the two forms (pre and post transformation) have different properties so should probably not be represented in the same format in the future. Test Plan: existing Differential Revision: D62040729 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134877 Approved by: https://github.com/pianpwk	2024-09-04 05:34:26 +00:00
Yu, Guangye	9809080b9e	[Reland] Refactor caching device allocator utils (#130923 ) # Motivation Following [[RFC] Intel GPU Runtime Upstreaming for Allocator ](https://github.com/pytorch/pytorch/issues/116322), this PR aims to refactor caching device allocator utils to improve code reuse usage. This is the first PR, we could prepare some follow-up PRs continuing to refactor the device caching allocator. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130923 Approved by: https://github.com/EikanWang, https://github.com/gujinghui, https://github.com/albanD, https://github.com/eqy	2024-09-04 05:31:08 +00:00
Xu Han	6448d351db	[inductor] clean up cpp_builder code. (#134909 ) Clean up cpp_builder duplication code. Hi @henrylhtsang , could you please help on land internally? Pull Request resolved: https://github.com/pytorch/pytorch/pull/134909 Approved by: https://github.com/henrylhtsang	2024-09-04 05:29:08 +00:00
PyTorch UpdateBot	2c9b4d2052	[executorch hash update] update the pinned executorch hash (#135077 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned executorch hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135077 Approved by: https://github.com/pytorchbot	2024-09-04 05:17:29 +00:00
CaoE	6b05aafc57	Add specializations for VecMaskLoad and VecMaskCast (#126501 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126501 Approved by: https://github.com/jgong5, https://github.com/jansel ghstack dependencies: #126500	2024-09-04 05:12:52 +00:00
CK Luk	ffd1e214df	Back out "[FSDP2] Set `ctx.set_materialize_grads(False)` for post-backward (#133498 )" (#135059 ) Summary: Original commit changeset: 96513cbc425f Original Phabricator Diff: D61291210 There is some evidence that FB-FM-v4 has better NE with Set ctx.set_materialize_grads(False), especially when pairing up with prefetching. See https://www.internalfb.com/intern/anp/view/?id=5732259 Test Plan: export NUM_WORKERS=128 export BATCH_SIZE=1024 export CONFIG_FILE="mast_joint_arch_exploration_cmf_updated_fbfm_v3_fsdp2.yaml" export ENTITLEMENT=ads_global_tc_2k_training_large_short buck2 run mode/opt //aps_models/ads/icvr:icvr_launcher -c fbcode.platform010_cuda_version=12 -c hpc_comms.use_nccl=2.17.1 -- mode=${CONFIG_FILE} launcher.tags='[ads_ranking_taxonomy_monetization_genai]' launcher.data_project=pytorch_at_scale launcher.max_retries=10 launcher.fbl_entitl ement=${ENTITLEMENT} launcher.oncall=pytorch_training_enablement launcher.hardware=GRANDTETON launcher.num_workers=${NUM_WORKERS} data_loader.dataset.batch_size=${BATCH_SIZE} training.planner.proposer=dynamic_col_dim training.planner.proposer.optim_target=h bm 2>&1\| tee ~/tmp/log.mast Differential Revision: D62009163 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135059 Approved by: https://github.com/awgu	2024-09-04 04:50:32 +00:00
cyy	c818ecd169	Remove Caffe2 code from tool scripts (#134941 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/134941 Approved by: https://github.com/ezyang	2024-09-04 03:47:58 +00:00
Animesh Jain	9e6f4f3f77	[dynamo] Use __eq__ for backend match (#135039 ) Fixes https://github.com/pytorch/pytorch/issues/131150 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135039 Approved by: https://github.com/jansel	2024-09-04 03:35:18 +00:00
dependabot[bot]	367a78495f	Bump actions/download-artifact from 2 to 4.1.7 in /.github/workflows (#135068 ) Bumps [actions/download-artifact](https://github.com/actions/download-artifact) from 2 to 4.1.7. - [Release notes](https://github.com/actions/download-artifact/releases) - [Commits](https://github.com/actions/download-artifact/compare/v2...v4.1.7) --- updated-dependencies: - dependency-name: actions/download-artifact dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2024-09-03 20:33:57 -07:00
Sam Larsen	362ecd9817	[inductor] Skip the sub-process pool until it's ready (#133508 ) Summary: Torch-compiling a quick script can be a bit slower than it needs to be: even though we initialize the subprocess pool early, it still might not be ready by the time we try to compile the first Triton kernel. Instead, let's use the single-threaded path until the pool has successfully completed a no-op job. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133508 Approved by: https://github.com/Chillee	2024-09-04 03:26:55 +00:00
Justin Chu	7600e9b36f	[ONNX] Use the stable APIs in onnxscript and sync the latest logic (#134782 ) Use the stable apis from onnxscript: https://github.com/microsoft/onnxscript/issues/1827 Sync with torch-onnx at https://github.com/justinchuby/torch-onnx/compare/v0.1.12...v0.1.15. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134782 Approved by: https://github.com/titaiwangms	2024-09-04 03:10:20 +00:00
Jason Ansel	982e27e532	[halide-backend] Update CI pin (#130258 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130258 Approved by: https://github.com/eellison	2024-09-04 03:08:49 +00:00
Rachel Guo	ae3aa8ff73	[AOTI][Tooling][5/n] Refactor the debug printer call to a level lower (#134789 ) Summary: 1. Move the debug printer call a level lower -> at here :https://www.internalfb.com/code/fbsource/[931d7bbb9e7cf2dcb926f42718f56fc940903eec]/fbcode/caffe2/torch/_inductor/codegen/cpp_wrapper_cuda.py?lines=335 2. Add UT for validating debug printer for user defined triton kernel codegen The benefit of having the debug printer call happens at a more centralized place is 1) reduce the duplicate debug printer related logic code scattered everywhere in the codebase 2) it can handle more triton kernel codegen path as long as it invokes this `generate_kernel_call()` for example, it can automatically handle/support user_defined_kernel 's debug printing which is a pretty common use case we encounter in debugging Test Plan: ```AOT_INDUCTOR_DEBUG_INTERMEDIATE_VALUE_PRINTER=2 TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 TORCHINDUCTOR_ABI_COMPATIBLE=1 TORCH_COMPILE_DEBUG=1 TORCH_LOGS="+graph, inductor, +schedule, output_code" buck2 run -c fbcode.enable_gpu_sections=true -c fbcode.nvcc_arch=h100 @//mode/opt fbcode//caffe2/test/inductor:test_aot_inductor -- -r test_aoti_debug_printer_user_defined_triton_kernel_abi_compatible_cuda``` Also verified that templateKernel codegen path still works Differential Revision: D61949020 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134789 Approved by: https://github.com/ColinPeppler	2024-09-04 02:41:30 +00:00
Bob Ren	ea89f01281	Remove unused comment (#135034 ) As part of my rampup I've been reading through some of @ezyang's diffs. I noticed in https://github.com/pytorch/pytorch/pull/133439 there was a comment that he forgot to remove. This diff removes that comment. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135034 Approved by: https://github.com/albanD	2024-09-04 02:32:26 +00:00
Edward Z. Yang	175485097a	[EASY] Typofix (#135022 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/135022 Approved by: https://github.com/albanD	2024-09-04 01:59:40 +00:00
Edward Z. Yang	15c25c4580	Fix dim mismatch logic automatic dynamic not working with compiler collectives (#135025 ) Fixes https://fb.workplace.com/groups/3095840833991792/permalink/3810738595835342/ Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/135025 Approved by: https://github.com/albanD	2024-09-04 01:50:21 +00:00
CaoE	4ebf6b04a8	Turn on expanded index path for Half on CPU (#133553 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133553 Approved by: https://github.com/yanbing-j, https://github.com/jgong5, https://github.com/peterbell10	2024-09-04 00:56:56 +00:00
Moritz Marseu	e000cf0ad9	Fix license metadata in setup.py (#129219 ) Package metadata in setup.py lists license as BSD-3 which is not a valid SPDX id. The correct id would be BSD-3-Clause. Specifying an SPDX id is beneficial to license compliance scanning. Taking up #129123 from my personal account. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129219 Approved by: https://github.com/malfet, https://github.com/kit1980	2024-09-04 00:21:22 +00:00
Menglu Yu	45743019cf	[PT2][Optimus] Skip meta update on symblic shape (#134975 ) Summary: We noticed that there will be runtime error to do the dim broadcast when the meta example value has symbolic shape, thus we skip it. Test Plan: ``` buck2 run mode/opt //caffe2/benchmarks/dynamo/fb:torchbench_run_ads_dhen_5x_training -- -m ads_dhen_5x -t training ``` P1559019921 Differential Revision: D62115015 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134975 Approved by: https://github.com/xuzhao9	2024-09-04 00:05:51 +00:00
Shivam Raikundalia	9ffcca7060	[Profiler] Handle Tensor Sizes/Strides Parsing Error (#134862 ) Summary: Currently some jobs are encountering the following trace, P1539415198. This suggests that when we are parsing through tensors the path is prone to encountering an invalid address. This is is possibly occurring because for some reason the sizes() and strides() of a Tensor seem to not be of the same dimensions. We assume such when iterating through the shapes to get the Ivalue generator. When browsing some of the tensor implementations, I found that some of the size and stride paths are different which could be the cause of this issue. Regardless, the profiler should be flexible enough to handle such issues without bringing down the whole main thread. If the crashes still persist, it will still give us a data point as to where they are occurring and we can rule out the strides/sizes as the culprit Test Plan: This change doesn't break anything in the happy path, just makes sure the bad path is not exited abruptly. We should use this in order to debug what the events are having mismatching dimensions between sizes and strides. Differential Revision: D62008788 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134862 Approved by: https://github.com/aaronenyeshi	2024-09-03 23:46:38 +00:00
Zain Rizvi	f05b716d6d	Add validator to ensure runner determinator script is kept in sync (#134800 ) We keep two copies of the runner-determinator script: 1. In runner_determinator.py, for ease of testing. This however is not actually executed during CI 2. Embedded in _runner-determinator.yml. This is what CI uses. Why the duplication? Short version: Because of how github CI works, during a given CI run the workflow yml files could actually come from the main branch, while the remaining files get read from the local commit. This can lead to a newer version of _runner-determinator.yml trying to invoke an older version of runner_determintor.py than it was actually designed for. Chaos ensues. We mitigate this by embedding the script into the yml file. But we still keep the script around because it's much easier to run tests against. This workflow's job is to ensure that if one edits the script in one of those two locations then they remember to update it in the other location as well Pull Request resolved: https://github.com/pytorch/pytorch/pull/134800 Approved by: https://github.com/zxiiro, https://github.com/PaliC ghstack dependencies: #134796	2024-09-03 23:29:04 +00:00
Zain Rizvi	469429b959	Refactor runner determinator (#134796 ) Some minor refactorings to make the code easier to parse and easier to add unit tests for. Keeping this as a separate PR for ease of review, since it should have zero functional behavior changes Pull Request resolved: https://github.com/pytorch/pytorch/pull/134796 Approved by: https://github.com/zxiiro, https://github.com/PaliC	2024-09-03 23:29:04 +00:00
PyTorch MergeBot	c044deb9ce	Revert "c10d/logging: add C10D_LOCK_GUARD (#134131 )" This reverts commit f33bcbe5fd67e6b18be259ad2f0dc11c74157075. Reverted https://github.com/pytorch/pytorch/pull/134131 on behalf of https://github.com/kit1980 due to See D61985186 ([comment](https://github.com/pytorch/pytorch/pull/134131#issuecomment-2327556381))	2024-09-03 22:35:14 +00:00
PyTorch MergeBot	2fd36086bc	Revert "Add torch.serialization.skip_data context manager (#134504 )" This reverts commit 94db935749b8de99d8c3ab23fb880c67c8f3e67a. Reverted https://github.com/pytorch/pytorch/pull/134504 on behalf of https://github.com/kit1980 due to See D62082697 ([comment](https://github.com/pytorch/pytorch/pull/134504#issuecomment-2327542276))	2024-09-03 22:21:27 +00:00
drisspg	85fa019697	[Docs] Fix call to deprecated function (#135037 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135037 Approved by: https://github.com/janeyx99, https://github.com/jbschlosser	2024-09-03 20:57:11 +00:00
rzou	14c8ef5198	autolabel aotinductor->export (#135040 ) "module: aotinductor" will automatically add "oncall: export". Test Plan: - none Pull Request resolved: https://github.com/pytorch/pytorch/pull/135040 Approved by: https://github.com/ydwu4	2024-09-03 20:17:51 +00:00
Xu Han	c40e622966	[inductor] add openmp config for intel conpiler on Linux. (#134973 ) Config `openmp` for Intel Compiler on Linux. Base on this PR, we can confirm the Intel optimized libraries are work built well. <img width="1039" alt="image" src="https://github.com/user-attachments/assets/838d5114-c778-4961-9cfe-39a814647089"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/134973 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-09-03 20:10:21 +00:00
Driss Guessous	272f3b9fe1	[FlexAttention] Update tolerance for failing test (#135035 ) Summary: Address: T198937061 Test Plan: buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/inductor:flex_attention -- --exact 'caffe2/test/inductor:flex_attention - test_no_q_info_compile_False (caffe2.test.inductor.test_flex_attention.TestBlockMask)' --run-disabled Differential Revision: D62137797 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135035 Approved by: https://github.com/Chillee	2024-09-03 20:09:21 +00:00
Xilun Wu	e7731b3f8a	[TorchElastic] make torch elastic not have to realize TCPStore backend type and rely on c10d to decide which backend to use (#134882 ) D53335860 and D56435815 added an option to torch elastic allowing users to choose a TCPStore backend type to use via 1) explicit argument passing in user code when instantiating `MastRendezvousHandler` 2) pass `--use_libuv` command line argument to `torchrun`. The motivation was to offer a quick way to roll back to non-libuv TCPStore backend since we were making libuv the default in `c10d` code. Now we think that it's better to have torch elastic to not realize the TCPStore backend type but rely on `c10d`'s mechanism to decide which backend to use for torch elastic as well. In this sense, the TCPStore backend type used by torch elastic will be identical to that in pytorch. PyTorch TCPStore uses the environment variable `USE_LIBUV` to determine the backend type: when `USE_LIBUV="0"`, the non-libuv backend will be used. when `USE_LIBUV="1"`, the libuv backend will be used. And this is the default option. Differential Revision: [D58259590](https://our.internmc.facebook.com/intern/diff/D58259590/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134882 Approved by: https://github.com/shuqiangzhang	2024-09-03 19:43:21 +00:00
Nikita Shulga	71383dd3da	[MPS] Fix bachnorm_2d for channels last (#134618 ) By skipping gather of input tensor if memory_layout is channels_last, which is a first step towards fixing https://github.com/pytorch/pytorch/issues/134580 Though underlying problem is much more interesting, i.e. MPS does not have a generic support for channels last, but `c10::is_contiguoius()` is true for channels last layout. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134618 Approved by: https://github.com/albanD	2024-09-03 19:20:11 +00:00
Tobias Ringwald	758d787901	Added complex support for `torch.logsumexp` (#133187 ) Added complex support for `torch.logsumexp`. Implemented complex backward pass for `torch.logsumexp`. Fixes #133047 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133187 Approved by: https://github.com/amjames, https://github.com/lezcano	2024-09-03 17:28:36 +00:00
Laith Sakka	6c3767452d	Move auto functionalize tests in their own test file (#134834 ) title + use `with torch.library._scoped_library as lib` when needed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134834 Approved by: https://github.com/zou3519 ghstack dependencies: #134831	2024-09-03 17:09:03 +00:00
Haibo Chen	2e0b114c06	add a new Guage API with an empty backend to PyTorch core (#134883 ) Summary: The current use case is to continuously measure the total allocated and reserved CUDA memory size from CUDACachingAllocator, and export their distribution (min, max, p90 etc) over time as timeseries. The current callback-based API does not work because the backend decides when the measurement is taken, so data points between two measurements may not be recorded. The distribution (e.g. max) as such will not be accurate. This new API closely follow the design of the existing WaitCounter API otherwise. This is not quite a synchronous version of DynamicCounter, as summing multiple data points does not make sense to my use case Test Plan: CI Differential Revision: D61837528 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134883 Approved by: https://github.com/c-p-i-o	2024-09-03 17:08:47 +00:00
Nikita Shulga	7804c089c6	[BE] Update numpy version to 2.0.2 (#134875 ) It's long time to abandon pre-release version Partially addresses https://github.com/pytorch/pytorch/issues/134868 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134875 Approved by: https://github.com/justinchuby, https://github.com/clee2000, https://github.com/kit1980, https://github.com/atalman, https://github.com/Skylion007	2024-09-03 17:02:35 +00:00
Justin Chu	1b9f51bd88	[ONNX] Bump onnxscript version in CI; temporarily remove op test (#133748 ) Bump onnxscript version in CI to 0.1.0.dev20240831, and temporarily remove the fx consistency test. We will add a better version back later. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133748 Approved by: https://github.com/titaiwangms	2024-09-03 16:30:07 +00:00
PyTorch MergeBot	27677ead7c	Revert "[ONNX] Bump onnxscript version in CI; temporarily remove op test (#133748 )" This reverts commit 6eed63c8b9c4f54a573bb51960d252cd42bfab0c. Reverted https://github.com/pytorch/pytorch/pull/133748 on behalf of https://github.com/ZainRizvi due to The version bump appears to be pulling in an unavailable numpy version? [GH job link](https://github.com/pytorch/pytorch/actions/runs/10686076754/job/29620426371) [HUD commit link](`6eed63c8b9`) ([comment](https://github.com/pytorch/pytorch/pull/133748#issuecomment-2326932868))	2024-09-03 16:19:47 +00:00
Edward Z. Yang	a258844a32	Properly handle empty CPUINFO variable (#134916 ) Fixes https://github.com/pytorch/pytorch/issues/134915 But I did not root cause why CPUINFO is totally empty to begin with... Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/134916 Approved by: https://github.com/Skylion007	2024-09-03 15:59:59 +00:00
PyTorch MergeBot	f927bcb934	Revert "[Inductor] Apply loop split optimization in codegen_node (#132389 )" This reverts commit 3cb5d251224b3fb59b5a10c6fefbb4c84eb565a6. Reverted https://github.com/pytorch/pytorch/pull/132389 on behalf of https://github.com/ZainRizvi due to Hi, this seems to be breaking in trunk. See test_dataloader.py::TestDataLoader::test_segfault [GH job link](https://github.com/pytorch/pytorch/actions/runs/10660461216/job/29556282081) [HUD commit link](`de3a641476`) ([comment](https://github.com/pytorch/pytorch/pull/132389#issuecomment-2326843129))	2024-09-03 15:40:45 +00:00
Justin Chu	6eed63c8b9	[ONNX] Bump onnxscript version in CI; temporarily remove op test (#133748 ) Bump onnxscript version in CI to 0.1.0.dev20240831, and temporarily remove the fx consistency test. We will add a better version back later. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133748 Approved by: https://github.com/titaiwangms	2024-09-03 15:33:09 +00:00
IvanKobzarev	33ba952e31	[subclasses] Do not fakeTensor const prop subclass args (#134855 ) The issue: Const propagation checks only if arguments do not have FakeTensor. If argument is Subclass, it will pass this condition. As a result Const Propogation execution happens without FakeTensorMode and having tensor factories inside Subclass.__torch_dispatch__ results that this Tensor is not Fakified. Solution: If we have subclasses arguments, do not count that const propagation is doable Pull Request resolved: https://github.com/pytorch/pytorch/pull/134855 Approved by: https://github.com/zou3519	2024-09-03 13:31:49 +00:00
Edward Z. Yang	2a49296d75	Fix set_unbacked_bindings when list of Tensors is returned (#133585 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/133585 Approved by: https://github.com/albanD	2024-09-03 12:23:31 +00:00
Feng Yuan	2443507acc	Update torch-xpu-ops pin (ATen XPU implementation) (#134983 ) Release cycle for PyTorch 2.5 1. Enable Windows build in latest torch-xpu-ops. Resolved large bin issue. 2. Refine test infrastructure for compatibility on different HW platforms. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134983 Approved by: https://github.com/EikanWang	2024-09-03 12:14:37 +00:00
Nikita Shulga	39935e0fde	Update cpuinfo submodule (#134891 ) Last time it was done in June by https://github.com/pytorch/pytorch/pull/127505 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134891 Approved by: https://github.com/Skylion007	2024-09-03 09:29:59 +00:00
chilli	23a2161ad1	Changed addmv to be a decomposition and not a fallback (#134823 ) Overall seems to be faster ![image](https://github.com/user-attachments/assets/0cbea76e-fb78-4634-9265-047de0291549) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134823 Approved by: https://github.com/jansel ghstack dependencies: #134813, #134818, #134819	2024-09-03 06:33:31 +00:00
chilli	9856bc50a2	Switch nanmedian to not cuda synchronize (#134819 ) Generally, this seems to be faster. ![image](https://github.com/user-attachments/assets/43a86c6f-236d-4ba1-aae0-14e3d88ae401) And as an added benefit, it works great with cudagraphs and such :) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134819 Approved by: https://github.com/Skylion007, https://github.com/eqy ghstack dependencies: #134813, #134818	2024-09-03 06:33:31 +00:00
chilli	6fce1faa10	change multinomial to use async asserts instead of a synchronization (#134818 ) Fixes https://github.com/pytorch/pytorch/issues/134442 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134818 Approved by: https://github.com/ezyang ghstack dependencies: #134813	2024-09-03 06:33:24 +00:00
chilli	db193d1e29	add msg to _assert_async (#134813 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134813 Approved by: https://github.com/ezyang, https://github.com/eqy, https://github.com/albanD	2024-09-03 06:33:18 +00:00
leslie-fang-intel	d14fe3ffed	[Inductor][CPP] Turns on inline_inbuilt_nn_modules for CPP GEMM template testing (#132487 ) Summary The CPP GEMM template testing has been skipped with turning on `inline_inbuilt_nn_modules ` as in https://github.com/pytorch/pytorch/issues/131929. Since https://github.com/pytorch/pytorch/pull/132334 has landed to fix the issues. Turn on this flag back since it's default. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132487 Approved by: https://github.com/anijain2305, https://github.com/jgong5	2024-09-03 05:05:50 +00:00
CaoE	a00fad0177	Add specializations for vectorized conversion between float and BF16/FP16 (#126500 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126500 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-09-03 02:09:12 +00:00
titaiwangms	45f11094b6	[ONNX] Delete `op_level_debug` from `torch.onnx.ExportOptions` (#134961 ) op_level_debug helped to identify missing operators, and wrongly implemented operators at the time that dynamo exporter relied on nearest matching and torchlib was just created. However, right now, with dispatcher logic improved and torchlib becomes mature, we no longer need it. PS: op-level-debug diagnostics rule is not deleted in this PR, as it auto generates lint error code, and need more time to fix. We can delete it when we retire sarif. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134961 Approved by: https://github.com/justinchuby	2024-09-02 23:38:39 +00:00
Xuehai Pan	4c1dd13ba3	[BE] better type annotation for `torch.types` (#129559 ) Closes #129525 - #129525 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129559 Approved by: https://github.com/ezyang	2024-09-02 15:35:32 +00:00
Jonathan Wenger	76710d4f95	Corrected docstring of ``solve_triangular`` (#129766 ) Description The arguments docstring of [torch.linalg.solve_triangular](https://pytorch.org/docs/stable/generated/torch.linalg.solve_triangular.html#torch.linalg.solve_triangular) incorrectly describes the shape of the ``A`` argument if the option ``left=True``. The argument ``A`` should have shape $k \times k$ if ``left=False`` in line with the rest of the docstring and the implementation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129766 Approved by: https://github.com/lezcano	2024-09-02 13:30:30 +00:00
Edward Z. Yang	ee03530fd9	Add a test to avoid decorator based regression for cprofile traces (#133086 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/133086 Approved by: https://github.com/aorenste	2024-09-02 12:53:34 +00:00
FEI	16de25b1dc	fix tensor_repr(at::Tensor) (#134762 ) (#134764 ) Fixes #134762 @ezyang @antocuni Pull Request resolved: https://github.com/pytorch/pytorch/pull/134764 Approved by: https://github.com/ezyang Co-authored-by: Edward Z. Yang <ezyang@meta.com>	2024-09-02 06:05:10 +00:00
Blaine Burton Rister	3daca187aa	[Inductor] Allow customizing the padding format (#133939 ) Based on https://github.com/pytorch/pytorch/pull/130956. Inductor already supports padding through the `config.comprehensive_padding` option, but the padding format involves a few heuristics that are specific to Nvidia GPUs: - When we pad, it is always aligned to the next multiple of 128 bytes. - Strides smaller than 1024 are not padded. - Only intermediate values are padded, not outputs. The last of these is not really GPU-specific, but there are certain cases where we may want to override it. For example, padding outputs is useful on hardware accelerators with specific memory alignment requirements, or for applications where performance is more important than conformity with eager mode. This PR surfaces padding parameters up to Inductor's config module, so the user can control them. - `config.pad_outputs`: choose whether to pad outputs (default: `False`) - `config.padding_alignment_bytes`: choose the alignment size for padding (default: `128`) - `config.padding_stride_threshold`: choose the smallest stride that we will pad. For example, setting this to 0 will pad all unaligned strides. (default: `1024`) Test plan Added a new test in `test_padding.py` which tries various combinations of these options, checking that the output strides match our expectations. These changes should not affect perf, because the defaults are identical to Inductor's current behavior. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133939 Approved by: https://github.com/shunting314 Co-authored-by: Yueming Hao <yhao@meta.com>	2024-09-02 05:56:33 +00:00
PyTorch UpdateBot	de3a641476	[executorch hash update] update the pinned executorch hash (#134914 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned executorch hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134914 Approved by: https://github.com/pytorchbot	2024-09-02 03:52:40 +00:00
Sun, Jiayi	3cb5d25122	[Inductor] Apply loop split optimization in codegen_node (#132389 ) This PR applies loop split optimization in codegen_node to avoid non-contiguous load. When the vector is loaded in a non-contiguous manner due to a division in the index, we eliminate the division by splitting the loop to avoid non-contiguous load. Example: ``` import torch import torch.nn as nn class GNReLU(torch.nn.Module): def __init__(self, num_groups, num_channels): super(GNReLU, self).__init__() self.gn = nn.GroupNorm(num_groups, num_channels) def forward(self, x): return torch.nn.functional.relu(self.gn(x)) input = torch.randn(2, 960, 96, 96).to(memory_format=torch.channels_last) m = GNReLU(32, 960).eval() compiled_m = torch.compile(m) with torch.no_grad(): compiled_m(input) ``` Generated code: - Before: ``` cpp_fused_native_group_norm_relu_0 = async_compile.cpp_pybinding(['const float', 'const float', 'const float', 'float', 'float', 'float'], ''' #include "/tmp/torchinductor_jiayisun/vu/cvuckxaygqfovv2zu2byqhcmiejbke7mdhf2rpgpr5mlscdev2hg.h" extern "C" void kernel(const float* in_ptr0, const float* in_ptr1, const float* in_ptr2, float* out_ptr0, float* out_ptr1, float* out_ptr2) { #pragma omp parallel num_threads(56) { int tid = omp_get_thread_num(); { #pragma omp for collapse(2) for(long x0=static_cast<long>(0L); x0<static_cast<long>(2L); x0+=static_cast<long>(1L)) { for(long x1=static_cast<long>(0L); x1<static_cast<long>(32L); x1+=static_cast<long>(1L)) { { Welford<float> tmp_acc0 = Welford<float>(); Welford<at::vec::Vectorized<float>> tmp_acc0_vec = Welford<at::vec::Vectorized<float>>(); Welford<at::vec::Vectorized<float>> masked_tmp_acc0_vec = Welford<at::vec::Vectorized<float>>(); static WeightRecp<at::vec::Vectorized<float>> wrecps0(static_cast<long>(17280L)); for(long x2=static_cast<long>(0L); x2<static_cast<long>(9216L); x2+=static_cast<long>(1L)) { for(long x3=static_cast<long>(0L); x3<static_cast<long>(16L); x3+=static_cast<long>(16L)) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(x3 + (30Lx1) + (960Lx2) + (8847360Lx0)), 16); tmp_acc0_vec = welford_combine(tmp_acc0_vec, tmp0, &wrecps0); } for(long x3=static_cast<long>(16L); x3<static_cast<long>(30L); x3+=static_cast<long>(14L)) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(x3 + (30Lx1) + (960Lx2) + (8847360Lx0)), 14); masked_tmp_acc0_vec = welford_combine(masked_tmp_acc0_vec, tmp0, 14, &wrecps0); } } tmp_acc0 = welford_combine(tmp_acc0, welford_vec_reduce_all(masked_tmp_acc0_vec)); tmp_acc0 = welford_combine(tmp_acc0, welford_vec_reduce_all(tmp_acc0_vec)); out_ptr0[static_cast<long>(x1 + (32Lx0))] = static_cast<float>(tmp_acc0.mean); out_ptr1[static_cast<long>(x1 + (32Lx0))] = static_cast<float>(tmp_acc0.m2); } } } } { #pragma omp for collapse(2) for(long x0=static_cast<long>(0L); x0<static_cast<long>(2L); x0+=static_cast<long>(1L)) { for(long x1=static_cast<long>(0L); x1<static_cast<long>(9216L); x1+=static_cast<long>(1L)) { for(long x2=static_cast<long>(0L); x2<static_cast<long>(960L); x2+=static_cast<long>(16L)) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(x2 + (960Lx1) + (8847360Lx0)), 16); auto tmp1 = [&] { __at_align__ std::array<float, 16> tmpbuf; #pragma GCC unroll 16 for (long x2_inner = 0; x2_inner < 16; x2_inner++) { tmpbuf[x2_inner] = out_ptr0[static_cast<long>((32Lx0) + (c10::div_floor_integer((x2 + x2_inner), 30L)))]; } return at::vec::Vectorized<float>::loadu(tmpbuf.data(), 16); } () ; auto tmp3 = [&] { __at_align__ std::array<float, 16> tmpbuf; #pragma GCC unroll 16 for (long x2_inner = 0; x2_inner < 16; x2_inner++) { tmpbuf[x2_inner] = out_ptr1[static_cast<long>((32Lx0) + (c10::div_floor_integer((x2 + x2_inner), 30L)))]; } return at::vec::Vectorized<float>::loadu(tmpbuf.data(), 16); } () ; auto tmp12 = at::vec::Vectorized<float>::loadu(in_ptr1 + static_cast<long>(x2), 16); auto tmp14 = at::vec::Vectorized<float>::loadu(in_ptr2 + static_cast<long>(x2), 16); auto tmp2 = tmp0 - tmp1; auto tmp4 = static_cast<float>(276480.0); auto tmp5 = at::vec::Vectorized<float>(tmp4); auto tmp6 = tmp3 / tmp5; auto tmp7 = static_cast<float>(1e-05); auto tmp8 = at::vec::Vectorized<float>(tmp7); auto tmp9 = tmp6 + tmp8; auto tmp10 = tmp9.rsqrt(); auto tmp11 = tmp2 * tmp10; auto tmp13 = tmp11 * tmp12; auto tmp15 = tmp13 + tmp14; auto tmp16 = at::vec::clamp_min(tmp15, decltype(tmp15)(0)); tmp16.store(out_ptr2 + static_cast<long>(x2 + (960Lx1) + (8847360Lx0))); } } } } } } ''') async_compile.wait(globals()) del async_compile def call(args): arg2_1, = args args.clear() assert_size_stride(arg2_1, (2, 960, 96, 96), (8847360, 1, 92160, 960)) buf0 = empty_strided_cpu((2, 32, 1, 1), (32, 1, 64, 64), torch.float32) buf1 = empty_strided_cpu((2, 32, 1, 1), (32, 1, 64, 64), torch.float32) buf3 = empty_strided_cpu((2, 960, 96, 96), (8847360, 1, 92160, 960), torch.float32) cpp_fused_native_group_norm_relu_0(arg2_1, _frozen_param3, _frozen_param2, buf0, buf1, buf3) del arg2_1 return (buf3, ) ``` - After: ``` cpp_fused_native_group_norm_relu_0 = async_compile.cpp_pybinding(['const float', 'const float', 'const float', 'float', 'float', 'float'], ''' #include "/tmp/torchinductor_jiayisun/vu/cvuckxaygqfovv2zu2byqhcmiejbke7mdhf2rpgpr5mlscdev2hg.h" extern "C" void kernel(const float* in_ptr0, const float* in_ptr1, const float* in_ptr2, float* out_ptr0, float* out_ptr1, float* out_ptr2) { #pragma omp parallel num_threads(56) { int tid = omp_get_thread_num(); { #pragma omp for collapse(2) for(long x0=static_cast<long>(0L); x0<static_cast<long>(2L); x0+=static_cast<long>(1L)) { for(long x1=static_cast<long>(0L); x1<static_cast<long>(32L); x1+=static_cast<long>(1L)) { { Welford<float> tmp_acc0 = Welford<float>(); Welford<at::vec::Vectorized<float>> tmp_acc0_vec = Welford<at::vec::Vectorized<float>>(); Welford<at::vec::Vectorized<float>> masked_tmp_acc0_vec = Welford<at::vec::Vectorized<float>>(); static WeightRecp<at::vec::Vectorized<float>> wrecps0(static_cast<long>(17280L)); for(long x2=static_cast<long>(0L); x2<static_cast<long>(9216L); x2+=static_cast<long>(1L)) { for(long x3=static_cast<long>(0L); x3<static_cast<long>(16L); x3+=static_cast<long>(16L)) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(x3 + (30Lx1) + (960Lx2) + (8847360Lx0)), 16); tmp_acc0_vec = welford_combine(tmp_acc0_vec, tmp0, &wrecps0); } for(long x3=static_cast<long>(16L); x3<static_cast<long>(30L); x3+=static_cast<long>(14L)) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(x3 + (30Lx1) + (960Lx2) + (8847360Lx0)), 14); masked_tmp_acc0_vec = welford_combine(masked_tmp_acc0_vec, tmp0, 14, &wrecps0); } } tmp_acc0 = welford_combine(tmp_acc0, welford_vec_reduce_all(masked_tmp_acc0_vec)); tmp_acc0 = welford_combine(tmp_acc0, welford_vec_reduce_all(tmp_acc0_vec)); out_ptr0[static_cast<long>(x1 + (32Lx0))] = static_cast<float>(tmp_acc0.mean); out_ptr1[static_cast<long>(x1 + (32Lx0))] = static_cast<float>(tmp_acc0.m2); } } } } { #pragma omp for collapse(2) for(long x0=static_cast<long>(0L); x0<static_cast<long>(2L); x0+=static_cast<long>(1L)) { for(long x1=static_cast<long>(0L); x1<static_cast<long>(9216L); x1+=static_cast<long>(1L)) { #pragma GCC ivdep for(long x2=static_cast<long>(0L); x2<static_cast<long>(32L); x2+=static_cast<long>(1L)) { for(long x3=static_cast<long>(0L); x3<static_cast<long>(16L); x3+=static_cast<long>(16L)) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(x3 + (30Lx2) + (960Lx1) + (8847360Lx0)), 16); auto tmp1 = out_ptr0[static_cast<long>(x2 + (32Lx0))]; auto tmp4 = out_ptr1[static_cast<long>(x2 + (32Lx0))]; auto tmp12 = at::vec::Vectorized<float>::loadu(in_ptr1 + static_cast<long>(x3 + (30Lx2)), 16); auto tmp14 = at::vec::Vectorized<float>::loadu(in_ptr2 + static_cast<long>(x3 + (30Lx2)), 16); auto tmp2 = at::vec::Vectorized<float>(tmp1); auto tmp3 = tmp0 - tmp2; auto tmp5 = static_cast<float>(276480.0); auto tmp6 = tmp4 / tmp5; auto tmp7 = static_cast<float>(1e-05); auto tmp8 = decltype(tmp6)(tmp6 + tmp7); auto tmp9 = 1 / std::sqrt(tmp8); auto tmp10 = at::vec::Vectorized<float>(tmp9); auto tmp11 = tmp3 tmp10; auto tmp13 = tmp11 * tmp12; auto tmp15 = tmp13 + tmp14; auto tmp16 = at::vec::clamp_min(tmp15, decltype(tmp15)(0)); tmp16.store(out_ptr2 + static_cast<long>(x3 + (30Lx2) + (960Lx1) + (8847360Lx0))); } for(long x3=static_cast<long>(16L); x3<static_cast<long>(30L); x3+=static_cast<long>(14L)) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(x3 + (30Lx2) + (960Lx1) + (8847360Lx0)), 14); auto tmp1 = out_ptr0[static_cast<long>(x2 + (32Lx0))]; auto tmp4 = out_ptr1[static_cast<long>(x2 + (32Lx0))]; auto tmp12 = at::vec::Vectorized<float>::loadu(in_ptr1 + static_cast<long>(x3 + (30Lx2)), 14); auto tmp14 = at::vec::Vectorized<float>::loadu(in_ptr2 + static_cast<long>(x3 + (30Lx2)), 14); auto tmp2 = at::vec::Vectorized<float>(tmp1); auto tmp3 = tmp0 - tmp2; auto tmp5 = static_cast<float>(276480.0); auto tmp6 = tmp4 / tmp5; auto tmp7 = static_cast<float>(1e-05); auto tmp8 = decltype(tmp6)(tmp6 + tmp7); auto tmp9 = 1 / std::sqrt(tmp8); auto tmp10 = at::vec::Vectorized<float>(tmp9); auto tmp11 = tmp3 * tmp10; auto tmp13 = tmp11 * tmp12; auto tmp15 = tmp13 + tmp14; auto tmp16 = at::vec::clamp_min(tmp15, decltype(tmp15)(0)); tmp16.store(out_ptr2 + static_cast<long>(x3 + (30Lx2) + (960Lx1) + (8847360L*x0)), 14); } } } } } } } ''') async_compile.wait(globals()) del async_compile def call(args): arg2_1, = args args.clear() assert_size_stride(arg2_1, (2, 960, 96, 96), (8847360, 1, 92160, 960)) buf0 = empty_strided_cpu((2, 32, 1, 1), (32, 1, 64, 64), torch.float32) buf1 = empty_strided_cpu((2, 32, 1, 1), (32, 1, 64, 64), torch.float32) buf3 = empty_strided_cpu((2, 960, 96, 96), (8847360, 1, 92160, 960), torch.float32) cpp_fused_native_group_norm_relu_0(arg2_1, _frozen_param3, _frozen_param2, buf0, buf1, buf3) del arg2_1 return (buf3, ) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/132389 Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel Co-authored-by: Jiong Gong <jiong.gong@intel.com>	2024-09-02 00:28:34 +00:00
Aaron Orenstein	c140fa1426	Reorg cache code to make it simpler (#134911 ) Summary: Pull the big nested function out of the middle of cached_autotune() into its own class. Also refactor creating the autotune cache itself out - which gets shared in the next diff. Test Plan: unit tests Differential Revision: D60677501 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134911 Approved by: https://github.com/oulgen	2024-09-02 00:27:40 +00:00
Edward Z. Yang	0cbcef12bd	Stop adding useless prefix to error message here, you're pushing the important info off the screen. (#133108 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/133108 Approved by: https://github.com/Skylion007	2024-09-01 23:11:17 +00:00
Edward Z. Yang	208442ea18	Don't setup try-except handler when Dynamo compiling (#133239 ) The reraise is not supported and so this just gunks up our actual exception handling. You can trigger this by hitting an exception inside of an NN module that has hooks on it. You end up graph breaking on the reraise here, and losing the inner stack trace from the actual exception that was raised. This might be kind of controversial. An alternate strategy is to support reraises in Dynamo or something but IDK this doesn't feel like the right place to apply force. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/133239 Approved by: https://github.com/anijain2305	2024-09-01 22:26:46 +00:00
Edward Z. Yang	ea01aec8b1	Move FunctionSchema implementations to cpp file (#133856 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/133856 Approved by: https://github.com/bdhirsh, https://github.com/albanD	2024-09-01 19:50:35 +00:00
Oguz Ulgen	2dadc2c8fc	Log fx graph cache bypass reasons (#134792 ) Summary: Lets track when we bypass and why Test Plan: unit tests Differential Revision: D61994739 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134792 Approved by: https://github.com/jamesjwu	2024-09-01 19:02:09 +00:00
cyy	1595e755af	[Reland] [Torchgen] Pass mutable to cpp.valuetype_type (#134549 ) Reland of #121415 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134549 Approved by: https://github.com/ezyang	2024-09-01 15:15:38 +00:00
eqy	b1a00b7b6d	Abate `-Wsign-compare` warning spam in `Indexing.cu` (#134805 ) Fix for warning spam like ``` warning: comparison of integer expressions of different signedness: ‘long int’ and ‘uint64_t’ {aka ‘long unsigned int’} [-Wsign-compare] ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/134805 Approved by: https://github.com/janeyx99	2024-09-01 10:48:07 +00:00
cyy	d03f767cae	Check function declarations of Vulkan code (#134550 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/134550 Approved by: https://github.com/ezyang	2024-09-01 09:38:35 +00:00
Natalia Gimelshein	c25b64a057	expose host_emptyCache to python, fix a bug in freeing cudaHostRegist… (#134919 ) …ered memory Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/134919 Approved by: https://github.com/eqy	2024-09-01 09:07:25 +00:00
Manuel Candales	caa04e0cae	[ET] codegen: bool array as array ref (#134886 ) Test Plan: CI Differential Revision: D62046959 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134886 Approved by: https://github.com/larryliu0820	2024-09-01 01:33:43 +00:00
Natalia Gimelshein	29b7852dc1	drop gil in couple places (leads to deadlocks) (#134910 ) Per title Pull Request resolved: https://github.com/pytorch/pytorch/pull/134910 Approved by: https://github.com/eqy	2024-09-01 00:05:53 +00:00
Aaron Orenstein	7239b8a4f1	Clean up RemoteCache classes (#134032 ) Summary: The existing RemoteCacheBackend classes were a bit haphazard - some of them accepted bytes only, some accepted objects, some returned different types of objects than were passed in. Update them to be more consistent: 1. RemoteCacheBackend is an implementation of a backend: Redis, Memcache, Manifold, LocalFile 2. RemoteCacheSerde is an implementation of a serde protocol - to turn structured objects (dict, list, etc) into bytes: RemoteCacheJsonSerde (json encoding), RemoteCachePassthroughSerde (strictly bytes only) 3. RemoteCache is the cache implementation itself, mixing a RemoteCacheBackend along with an RemoteCacheSerde to provide structured caching. Other than simply reorganizing the existing cache code this also fixes the Redis autotune caching for OSS. Test Plan: unit tests Reviewed By: oulgen Differential Revision: D61178859 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134032 Approved by: https://github.com/oulgen, https://github.com/bhack	2024-08-31 20:18:59 +00:00
Xu Han	590d96be64	[inductor] move test_fuse_large_params to slow test. (#134900 ) Move `test_fuse_large_params` to slow test. This case spend about 1.5 minutes. <img width="855" alt="image" src="https://github.com/user-attachments/assets/adf16dcf-d398-4d66-8dda-0c9cafc4e351"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/134900 Approved by: https://github.com/jansel	2024-08-31 18:08:11 +00:00
haozhe.zhu	f4641ca481	[Inductor] Remove VecChecker and fallback non-supported Vec op to Scalar impl with a for loop (#134569 ) Fall back non-vectorized op by scalar impl + for loop. Example code: ``` cpp_fused_igammac_0 = async_compile.cpp_pybinding(['const double', 'const double', 'double'], ''' #include "/tmp/torchinductor_root/z4/cz4j2mmotlx3z2b7u4fbjtdt4x6plhd67ljwzg5bk7ekv4xz6y7q.h" extern "C" void kernel(const double in_ptr0, const double* in_ptr1, double* out_ptr0) { { for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(48L); x0+=static_cast<int64_t>(8L)) { auto tmp0 = at::vec::VectorizedN<double,2>::loadu(in_ptr0 + static_cast<int64_t>(x0), 8); auto tmp1 = in_ptr1[static_cast<int64_t>(0L)]; auto tmp2 = at::vec::VectorizedN<double,2>(tmp1); auto tmp3 = [&]() { __at_align__ std::array<double, 8> tmpbuf0; tmp0.store(tmpbuf0.data(), 8); __at_align__ std::array<double, 8> tmpbuf1; tmp2.store(tmpbuf1.data(), 8); __at_align__ std::array<double, 8> tmpbuf_out; for (int i = 0; i < 8; i++) { tmpbuf_out[i] = calc_igammac(tmpbuf0[i], tmpbuf1[i]); } return at::vec::VectorizedN<double, 2>::loadu(tmpbuf_out.data(), 8); } () ; tmp3.store(out_ptr0 + static_cast<int64_t>(x0), 8); } #pragma omp simd simdlen(4) for(int64_t x0=static_cast<int64_t>(48L); x0<static_cast<int64_t>(50L); x0+=static_cast<int64_t>(1L)) { auto tmp0 = in_ptr0[static_cast<int64_t>(x0)]; auto tmp1 = in_ptr1[static_cast<int64_t>(0L)]; auto tmp2 = calc_igammac(tmp0, tmp1); out_ptr0[static_cast<int64_t>(x0)] = tmp2; } } } ''') ``` `frexp` are difficult to be handled by common `fallback` since it returns two `cse_var` `2ba60a1618/torch/_inductor/codegen/cpp.py (L752-L766)` So we added a special function to do that. ``` cpp_fused_frexp_0 = async_compile.cpp_pybinding(['const double', 'double', 'int32_t'], ''' #include "/tmp/torchinductor_root/z4/cz4j2mmotlx3z2b7u4fbjtdt4x6plhd67ljwzg5bk7ekv4xz6y7q.h" extern "C" void kernel(const double in_ptr0, double* out_ptr0, int32_t* out_ptr1) { { for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(16L); x0+=static_cast<int64_t>(8L)) { auto tmp0 = at::vec::VectorizedN<double,2>::loadu(in_ptr0 + static_cast<int64_t>(x0), 8); at::vec::Vectorized<int32_t> tmp1; at::vec::VectorizedN<double, 2> tmp2; [&]() { __at_align__ std::array<double, 8> tmpbuf; tmp0.store(tmpbuf.data(), 8); __at_align__ std::array<int32_t, 8> tmpbuf_exponent; __at_align__ std::array<double, 8> tmpbuf_mantissa; for (int i = 0; i < 8; i++) { tmpbuf_mantissa[i] = std::frexp(tmpbuf[i], &tmpbuf_exponent[i]); } tmp1 = at::vec::Vectorized<int32_t>::loadu(tmpbuf_exponent.data(), 8); tmp2 = at::vec::VectorizedN<double, 2>::loadu(tmpbuf_mantissa.data(), 8); } (); tmp2.store(out_ptr0 + static_cast<int64_t>(x0), 8); tmp1.store(out_ptr1 + static_cast<int64_t>(x0), 8); } #pragma omp simd simdlen(4) for(int64_t x0=static_cast<int64_t>(16L); x0<static_cast<int64_t>(20L); x0+=static_cast<int64_t>(1L)) { auto tmp0 = in_ptr0[static_cast<int64_t>(x0)]; int32_t tmp1; auto tmp2 = std::frexp(tmp0, &tmp1); out_ptr0[static_cast<int64_t>(x0)] = tmp2; out_ptr1[static_cast<int64_t>(x0)] = tmp1; } } } ''') ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/134569 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-08-31 11:19:57 +00:00
Michael Lazos	16f119e62a	Update compiled optimizer tests for tensor betas (#134169 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134169 Approved by: https://github.com/anijain2305, https://github.com/eellison ghstack dependencies: #134166, #134167, #134168	2024-08-31 10:24:39 +00:00
Michael Lazos	4e71418566	[dynamo] rewrite addcmul_ to remove graph break (#134168 ) Context: Adding support for the beta parameters to be tensors Details: Similarly to the previous two PRs addcmul_ is used with the tensor betas as the value argument. When this occurs, an item() call is invoked in the aten op. To avoid this graph break, addcmul_ is decomposed into its constrituent ops to avoid this. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134168 Approved by: https://github.com/anijain2305 ghstack dependencies: #134166, #134167	2024-08-31 10:24:39 +00:00
Michael Lazos	3fb4c6bc38	[dynamo] Rewrite foreach pow to broadcast scalar argument (#134167 ) Context: Adding support for the beta parameters to be tensors Details: In this PR similarly to the previous, foreach_pow calls item() on the first argument when it is a scalar tensor. In this case, we broadcast that scalar tensor into a list of aliases of that tensor to avoid the item() call, and this results in a device copy of the scalar tensor. Once again, I dont think we can change the foreach_pow API due to BC concerns, so this op rewrite allows us to avoid a graph break, generate semantically the same code, and not affect eager. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134167 Approved by: https://github.com/anijain2305 ghstack dependencies: #134166	2024-08-31 10:24:35 +00:00
Michael Lazos	471c33f007	[dynamo] Rewrite foreach_lerp to avoid aten item call (#134166 ) Context: Adding support for the beta parameters to be tensors Details: In order to add support for the beta params to be tensors without graph breaks in the Adam family of optimizers it is necessary to support foreach_lerp(x, y, s) where s is a scalar tensor. Today, this isn't possible because when `s` is a scalar, internally the aten op calls item() on it to extract the value and distribute it to each of the ops on the individual list indices. To support this in dynamo without graph breaks, I decompose the lerp into its constituent ops which support a scalar tensor in the list argument positions which do not result in an item() call. To be clear the item() call is more performant for eager I think and for BC I don't think we can modify that API, so this allows us to have performance in eager and no graph breaks in compile. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134166 Approved by: https://github.com/anijain2305	2024-08-31 10:24:31 +00:00
Xuehai Pan	eed0d76682	[dynamo][itertools] refactor `itertools.islice` to use polyfill (#133876 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133876 Approved by: https://github.com/jansel ghstack dependencies: #133864, #133894	2024-08-31 10:08:07 +00:00
Xuehai Pan	ec660c383e	[dynamo] reduce overhead for `PolyfilledFunctionVariable.call_function` (#134842 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134842 Approved by: https://github.com/jansel	2024-08-31 09:12:46 +00:00
cyyever	d9cc693719	[jit] Change argument names (#134828 ) It seems like a bug. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134828 Approved by: https://github.com/janeyx99 Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>	2024-08-31 08:42:30 +00:00
Xu Han	136badae64	[inductor] preload icx built in math libs (#134870 ) Intel Compiler implenmented more math libraries than clang, for performance proposal. We need preload them like openmp library. reproduce UT: ```cmd pytest test/inductor/test_cpu_cpp_wrapper.py -v -k test_silu_cpu_dynamic_shapes_cpp_wrapper ``` Depends of module: <img width="804" alt="Image" src="https://github.com/user-attachments/assets/9a672e03-ebf5-4ebb-b182-09180e6f7841"> Local test pass: <img width="857" alt="image" src="https://github.com/user-attachments/assets/afbb8c1c-8fcc-4d64-a3ad-c8521b137d2d"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/134870 Approved by: https://github.com/jansel	2024-08-31 04:50:31 +00:00
Yanbo Liang	090d9cf410	[Dynamo][autograd.Function][vmap] support torch._C._are_functorch_transforms_active (#134889 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134889 Approved by: https://github.com/jansel	2024-08-31 04:39:09 +00:00
PyTorch UpdateBot	34b85d301f	[executorch hash update] update the pinned executorch hash (#134894 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned executorch hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134894 Approved by: https://github.com/pytorchbot	2024-08-31 04:16:41 +00:00
Alex Baden	64fad53b50	[Inductor] Support passing module map parameter to Triton make_ir API (#134774 ) In https://github.com/triton-lang/triton/pull/4539 the `make_ir` API was modified to accept a new `module_map` parameter. Update the Inductor callsite accordingly, preserving backwards compatibility following the existing code. Fixes #134674 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134774 Approved by: https://github.com/EikanWang, https://github.com/zou3519, https://github.com/jansel	2024-08-31 03:38:08 +00:00
Eddie Yan	aef5da50f4	Cleanup unused `pytorch.version` (#134381 ) This file doesn't seem to be used anywhere? checking CI... Pull Request resolved: https://github.com/pytorch/pytorch/pull/134381 Approved by: https://github.com/zou3519	2024-08-31 02:50:05 +00:00
PyTorch MergeBot	86e03a64e1	Revert "[Inductor] Allow customizing the padding format (#133939 )" This reverts commit 8b258b3b14408986a1d4142cff5a153c798ceecc. Reverted https://github.com/pytorch/pytorch/pull/133939 on behalf of https://github.com/ZainRizvi due to sorry but this PR is causing issues with diff train imports reverting it for now but it can be merged back in as-is ([comment](https://github.com/pytorch/pytorch/pull/133939#issuecomment-2322635388))	2024-08-31 00:38:30 +00:00
Nikita Shulga	f95085fd91	[BE][MPS] Prefer xfail to skip (#134858 ) This essentially undoes large skips on everything but MacOS Sequoia to nn.modules made by https://github.com/pytorch/pytorch/pull/128393 Instead it uses existing `xfail`, but guards it on `_macos15_or_newer` boolean Before the change if run on MacOS 14: ``` % python3 ../test/test_modules.py -v -k Hardswish 2>&1\|tail -n3 Ran 57 tests in 0.053s OK (skipped=32) ``` After ``` % python3 ../test/test_modules.py -v -k Hardswish 2>&1\|tail -n3 Ran 57 tests in 0.229s OK (skipped=10, expected failures=2) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/134858 Approved by: https://github.com/janeyx99	2024-08-31 00:29:48 +00:00
Yiming Zhou	050ad925f3	[benchmark] Add to torchbench relative path search (#134871 ) Add to relative path search in benchmark. This enables user to run `torchbench.py` inside the `pytorch/benchmark/dynamo` folder when `torchbench` repo is cloned in the same level as `pytorch` Pull Request resolved: https://github.com/pytorch/pytorch/pull/134871 Approved by: https://github.com/FindHao	2024-08-31 00:28:22 +00:00
Xuehai Pan	a854c3a25e	[dynamo] refactor `builtins.enumerate` to use polyfill (#133894 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133894 Approved by: https://github.com/jansel ghstack dependencies: #133864	2024-08-31 00:17:27 +00:00
Xuehai Pan	ebbdeeede1	[dynamo][itertools] refactor `itertools.chain` and `itertools.chain.from_iterable` to use polyfills (#133864 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133864 Approved by: https://github.com/jansel	2024-08-31 00:11:54 +00:00
Yichen Yan	5dad6a5a84	[ONNX][DORT] Lazy-import `onnxruntime` (#134662 ) Currently, if installed, `onnxruntime` will be imported when importing `torch._inductor` (which will be imported by some other library, e.g. transformer-engine): ``` /mnt/c.py(53)<module>() -> from torch._inductor.utils import maybe_profile /usr/local/lib/python3.10/site-packages/torch/_inductor/utils.py(49)<module>() -> import torch._export /usr/local/lib/python3.10/site-packages/torch/_export/__init__.py(25)<module>() -> import torch._dynamo /usr/local/lib/python3.10/site-packages/torch/_dynamo/__init__.py(2)<module>() -> from . import convert_frame, eval_frame, resume_execution /usr/local/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py(48)<module>() -> from . import config, exc, trace_rules /usr/local/lib/python3.10/site-packages/torch/_dynamo/trace_rules.py(52)<module>() -> from .variables import ( /usr/local/lib/python3.10/site-packages/torch/_dynamo/variables/__init__.py(38)<module>() -> from .higher_order_ops import ( /usr/local/lib/python3.10/site-packages/torch/_dynamo/variables/higher_order_ops.py(14)<module>() -> import torch.onnx.operators /usr/local/lib/python3.10/site-packages/torch/onnx/__init__.py(62)<module>() -> from ._internal.onnxruntime import ( /usr/local/lib/python3.10/site-packages/torch/onnx/_internal/onnxruntime.py(37)<module>() -> import onnxruntime # type: ignore[import] ``` This issue breaks generated triton kernel because it imported torch, and unexpected runtime libraries as well. I've also added a test for this specific case under `test/onnx`, perhaps we should add more somewhere else? Related issue: https://github.com/huggingface/accelerate/pull/3056 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134662 Approved by: https://github.com/justinchuby	2024-08-31 00:06:28 +00:00
Ratnam Parikh	2384f77d76	[XPU] Fix Windows XPU build (#134276 ) Linker flag check doesn't work correctly with MSVC and linking torch_xpu with torch_cpu_library for windows MSVC works without any errors Pull Request resolved: https://github.com/pytorch/pytorch/pull/134276 Approved by: https://github.com/EikanWang, https://github.com/atalman	2024-08-30 23:51:40 +00:00
Yanbo Liang	e688b78791	[Dynamo][autograd.Function] Trace fwd graph under no_grad mode (#134872 ) Fixes #134820 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134872 Approved by: https://github.com/zou3519	2024-08-30 22:24:18 +00:00
Blaine Burton Rister	8b258b3b14	[Inductor] Allow customizing the padding format (#133939 ) Based on https://github.com/pytorch/pytorch/pull/130956. Inductor already supports padding through the `config.comprehensive_padding` option, but the padding format involves a few heuristics that are specific to Nvidia GPUs: - When we pad, it is always aligned to the next multiple of 128 bytes. - Strides smaller than 1024 are not padded. - Only intermediate values are padded, not outputs. The last of these is not really GPU-specific, but there are certain cases where we may want to override it. For example, padding outputs is useful on hardware accelerators with specific memory alignment requirements, or for applications where performance is more important than conformity with eager mode. This PR surfaces padding parameters up to Inductor's config module, so the user can control them. - `config.pad_outputs`: choose whether to pad outputs (default: `False`) - `config.padding_alignment_bytes`: choose the alignment size for padding (default: `128`) - `config.padding_stride_threshold`: choose the smallest stride that we will pad. For example, setting this to 0 will pad all unaligned strides. (default: `1024`) Test plan Added a new test in `test_padding.py` which tries various combinations of these options, checking that the output strides match our expectations. These changes should not affect perf, because the defaults are identical to Inductor's current behavior. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133939 Approved by: https://github.com/shunting314 Co-authored-by: Yueming Hao <yhao@meta.com>	2024-08-30 20:34:11 +00:00
PyTorch MergeBot	a1ba8e61d1	Revert "[ROCm] remove triton-rocm commit pin and merge pins with triton.txt (#133438 )" This reverts commit 5e8bf29148a590318f678620f84be8f4d5ffff5c. Reverted https://github.com/pytorch/pytorch/pull/133438 on behalf of https://github.com/ZainRizvi due to This still breaks linux binary builds. Added the appropriate labels to ensure tests can pass. See [GH job link](https://github.com/pytorch/pytorch/actions/runs/10626427003/job/29460479554) [HUD commit link](`5e8bf29148`) ([comment](https://github.com/pytorch/pytorch/pull/133438#issuecomment-2322246198))	2024-08-30 20:00:41 +00:00
qchip	f6398eb0fa	dynamic shapes for combo_kenel/foreach_kernel (#134477 ) This PR add dynamic shapes support to foreach and combo kernels for horizontal fusion. A flag `combo_kernel_foreach_dynamic_shapes` (default False to avoid disturb production workflows) is added to _inductor/config.py. Setting it to True enables automatic dynamic shapes for foreach kernels. It is always enabled for combo kernels cases. Added unit cases. This PR also fixes a flaky test case for [T198833257](https://www.internalfb.com/intern/tasks/?t=198833257) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134477 Approved by: https://github.com/mlazos	2024-08-30 19:58:20 +00:00
Wouter Devriendt	db17a9898d	regenerate ci workflows for binary builds with new g4dn runners (#133404 ) Fixes #103104 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133404 Approved by: https://github.com/ZainRizvi	2024-08-30 19:53:22 +00:00
Gabriel Ferns	98b813d0d4	Enable cudagraphs in cpp wrapper (#133885 ) Fixes https://github.com/pytorch/pytorch/issues/130878 Summary: Enables cudagraphs in cpp wrapper by clearing inputs. Generated, non-cpp wrapper code: ```python def call(args): arg0_1, = args args.clear() assert_size_stride(arg0_1, (10, ), (1, )) with torch.cuda._DeviceGuard(0): torch.cuda.set_device(0) buf0 = empty_strided_cuda((10, ), (1, ), torch.float32) # Topologically Sorted Source Nodes: [sin], Original ATen: [aten.sin] stream0 = get_raw_stream(0) triton_poi_fused_sin_0.run(arg0_1, buf0, 10, grid=grid(10), stream=stream0) del arg0_1 return (buf0, ) ``` vs generated cpp wrapper code: ```python def _wrap_func(f): def g(args): input_tensors = [arg if isinstance(arg, torch.Tensor) else torch.tensor(arg) for arg in args] input_handles = torch._C._aoti.unsafe_alloc_void_ptrs_from_tensors(input_tensors) # new: args.clear() # end new output_handles = f(input_handles) output_tensors = torch._C._aoti.alloc_tensors_by_stealing_from_void_ptrs(output_handles) return output_tensors return g ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/133885 Approved by: https://github.com/eellison, https://github.com/desertfire	2024-08-30 18:48:37 +00:00
fduwjj	bdfa94b787	[RFC] Make fr trace script a console scripts (#134729 ) We want to make fr analyzer script a command after users `pip install torch`, that's why we want to mimic what torchrun is doing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134729 Approved by: https://github.com/c-p-i-o, https://github.com/malfet ghstack dependencies: #134528, #134780	2024-08-30 18:17:06 +00:00
Andrew Gu	a0d0c6b7e6	Used `torch.equal` in `test_foreach_copy_with_multi_dtypes` (#134861 ) `self.assertEqual` allows some tolerance, but here, we want to show that `_foreach_copy_` gives bitwise equivalent results. Let us use `torch.equal` then. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134861 Approved by: https://github.com/Skylion007, https://github.com/janeyx99, https://github.com/crcrpar	2024-08-30 18:04:41 +00:00
fduwjj	1993a2aa9e	[FR] Make pg_name unique, show P2P collective status and fix bugs when running the script as command (#134780 ) Fixes a bunches of bugs in the script when running with the generated command and 3D parallel. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134780 Approved by: https://github.com/c-p-i-o ghstack dependencies: #134528	2024-08-30 18:03:17 +00:00
Xu Han	15f5a4858b	[inductor] enable Intel Compiler(icx-cl) for inductor windows (#134772 ) This PR is enable Intel Compiler (`icx-cl`) for Windows inductor, likes previous PR: https://github.com/pytorch/pytorch/pull/134444 which enable clang. Changes: 1. Fix icx-cl crash by wrong decode args, the right decode should be "utf-8". 2. Add intel compiler check, and intel compiler Windows drivers check(icx-cl). 3. Add Intel compiler openmp args config. 4. Add intel compiler openmp binary preload. For intel compiler openmp binary path: <img width="788" alt="image" src="https://github.com/user-attachments/assets/54c76356-018d-4bef-a9b7-0ea150fd7aba"> For performance, Intel compiler(`icx-cl`) is much better performance than MSVC(`cl`): <img width="875" alt="image" src="https://github.com/user-attachments/assets/67865faf-b1de-4535-917a-486b72527204"> Append `clang-cl` performance data: <img width="821" alt="image" src="https://github.com/user-attachments/assets/476f4568-bf58-457f-b73d-4e57f49be384"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/134772 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-08-30 17:51:46 +00:00
David Berard	9e0ddc0e14	[inductor] don't allow triton config pre_hook (#134633 ) The caching autotuner caches triton configs, and it doesn't try to hash or save the pre_hook from the config if it exists. If we had a config that had a pre_hook, then we might autotune -> save the config (without the pre_config) -> later load the saved config and try to run it, but this time without the pre_hook. So this PR adds an assert and deletes the pre_hook handling. We can be confident that we didn't have functional pre_hooks, because the pre_hook handling tries to use `self.arg_name`, which doesn't exist. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134633 Approved by: https://github.com/shunting314, https://github.com/jansel	2024-08-30 17:39:37 +00:00
Masaki Kozuki	e21d7b77ce	Update `ForeachfuncInfo.sample_inputs_func` to yield scalars & scalarlists that are more friendly to test_meta (#134552 ) for `test_meta.py` to see more "PASSED" instead of "XFAIL". `pytest test_meta.py -k "_foreach_"` ran 6400 test cases and: - This PR: 4702 passed, 260 skipped, 73732 deselected, 1698 xfailed - main (92c4771853892193d73d87bd60eca4dc7efc51d8): 3906 passed, 260 skipped, 73732 deselected, 2494 xfailed Pull Request resolved: https://github.com/pytorch/pytorch/pull/134552 Approved by: https://github.com/janeyx99	2024-08-30 17:30:50 +00:00
Animesh Jain	577a93514f	[dynamo][dynamic][heuristic] Mark tuple getitem integers as static (#134734 ) Fixes issue seen in https://github.com/pytorch/pytorch/issues/132872#issuecomment-2314574656 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134734 Approved by: https://github.com/jansel ghstack dependencies: #134653, #134713	2024-08-30 17:06:57 +00:00
Yifu Wang	08184aa85c	Add support for 32KB multi_tensor_apply kernel arguments (#134373 ) ## Benchmark On H100 SXM (HBM2e, 500W TDP), CUDA Toolkit=12.2, Driver Version=535.154.05, with [this script](https://gist.github.com/yifuwang/178c1f4bf951c5794ea79c04d90e44fa) (`torch._foreach_copy_`): Baseline ``` https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/perfetto_internal_traces/tree/shared_trace/yifu_tmp0g_x4sys device ms: 0.891, cpu ms: 7.200 memory bandwidth: 1457.727 GB/s ``` Single iteration trace: <img width="1432" alt="image" src="https://github.com/user-attachments/assets/8ef54365-0265-4281-a0f0-d4c2f448300e"> This PR ``` https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/perfetto_internal_traces/tree/shared_trace/yifu_tmp3jqiugli device ms: 0.683, cpu ms: 6.745 memory bandwidth: 1902.010 GB/s ``` Single iteration trace: <img width="1074" alt="image" src="https://github.com/user-attachments/assets/e52acad1-d09b-492c-9611-6d69e339f3ac"> ## Binary Size and Kernel Specialization The binary size for `libtorch_cuda.so` increased 6MB (243MB -> 249MB). ``` // NOTE: [32KB kernel argument size support] // 32KB kernel argument size support has three requirements: // - CUDART_VERSION >= 12010 // - Driver version >= 530 // - GPU arch >= VOLTA // // Due to minor version compatibility, it possible for binaries built with // CUDART_VERSION >= 12010 to run with driver version < 530. Since driver // version can only be checked at runtime, if CUDART_VERSION >= 12010, we have // to build both 4KB and 32KB kernels and determine the appropriate kernel to // dispatch at runtime. // // - If CUDART_VERSION < 12010, only 4KB kernels will be instantiated. // // - If CUDART_VERSION >= 12010: // - Host code: // - We always instantiate the launching stub for both 4KB and 32KB kernels. // - Device code: // - If __CUDA_ARCH__ >= 700, we always instantiate both 4KB and 32KB // kernels. // - If __CUDA_ARCH__ < 700, it's not possible to even compile an empty // 32KB kernel (formal parameter space overflowed). Thus, we only // instantiate a declaration for 32KB kernels. This is valid as long as the // declaration-only kernel is not launched. // // - At runtime, we dispatch to the 32KB kernel if driver version >= 530 and // GPU arch >= VOLTA. // // - TODO(yifu): once there's a CUDART version that is not compatible with any // driver version below 530, we can determine at compile time to not compile // the kernels for 4KB kernel argument size. // // https://developer.nvidia.com/blog/cuda-12-1-supports-large-kernel-parameters/ ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/134373 Approved by: https://github.com/eqy, https://github.com/crcrpar, https://github.com/janeyx99	2024-08-30 16:52:28 +00:00
Zhengxu Chen	a19a7524f6	[export] Make sure getitem replacement are synced with module call graph. (#134830 ) Summary: When we are placing nodes in the graph, we should also replace the references in module_call_graph. Test Plan: buck2 run 'fbcode//mode/opt' torchrec/fb/ir/tests:test_serializer -- --filter-regex test_serialize_deserialize_vlea buck2 test 'fbcode//mode/opt' fbcode//torchrec/fb/ir/tests:test_serializer -- --exact 'torchrec/fb/ir/tests:test_serializer - torchrec.fb.ir.tests.test_serializer.TestSerializer: test_serialize_empty_value_vlea' --run-disabled buck2 test 'fbcode//mode/opt' fbcode//torchrec/fb/ir/tests:test_serializer -- --exact 'torchrec/fb/ir/tests:test_serializer - torchrec.fb.ir.tests.test_serializer.TestSerializer: test_deserialized_device_vle' --run-disabled Differential Revision: D62014035 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134830 Approved by: https://github.com/angelayi	2024-08-30 16:47:05 +00:00
Laith Sakka	f5b0caee71	Rewrite `unsafe_remove_auto_functionalized_pass` using `decompose_auto_functionalized` (#134831 ) `unsafe_remove_auto_functionalized_pass` can be written as using `decompose_auto_functionalized`, this way we do not have to update it each time we do a change to `auto_functionalize` (Ex https://github.com/pytorch/pytorch/pull/134409) , and we avoid duplicate logics implemented in two different ways. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134831 Approved by: https://github.com/zou3519	2024-08-30 16:27:53 +00:00
PyTorch MergeBot	351ba3e67c	Revert "[c10d] Remove Option for ProcessGroup and Expose backend Options to reflect the correct code structure (#132931 )" This reverts commit 65864d01341d006955579b145f78547314ceb14b. Reverted https://github.com/pytorch/pytorch/pull/132931 on behalf of https://github.com/ZainRizvi due to This PR is breaking builds internally due to the removal of ProcessGroup::Options ([comment](https://github.com/pytorch/pytorch/pull/132931#issuecomment-2321862402))	2024-08-30 16:27:40 +00:00
Thomas Bohnstingl	994438040c	Improvements for associative_scan - combine_mode (#133012 ) This is part of a series of PRs to improve the functionality of the `associatve_scan` functionality. This specific PR introduces a `combine_mode`, which can be either `pointwise` (default) or `generic`. In case of `generic`, the `associative_scan` is more flexible and allows also to perform non-pointwise functions. This PR has been derived from https://github.com/pytorch/pytorch/pull/129307. @ydwu4 @Chillee @zou3519 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133012 Approved by: https://github.com/ydwu4	2024-08-30 16:09:53 +00:00
PyTorch MergeBot	c6ecf57dd2	Revert "[dynamo] simplify implementation for `functools.reduce` (#133778 )" This reverts commit b5f1ffa7ab0988184497788f2738e1769888ab7d. Reverted https://github.com/pytorch/pytorch/pull/133778 on behalf of https://github.com/ZainRizvi due to This is still failing internally with the same error about 'Graph break due to unsupported builtin _functools.reduce' ([comment](https://github.com/pytorch/pytorch/pull/133778#issuecomment-2321787968))	2024-08-30 16:06:10 +00:00
PyTorch MergeBot	7a85c488a8	Revert "[dynamo] simplify implementation for `builtins.sum` (#133779 )" This reverts commit eaa449fbf0fe528a0827ee9b5bcfcd307a7c658d. Reverted https://github.com/pytorch/pytorch/pull/133779 on behalf of https://github.com/ZainRizvi due to This is still failing internally with the same error about 'Graph break due to unsupported builtin _functools.reduce' ([comment](https://github.com/pytorch/pytorch/pull/133778#issuecomment-2321787968))	2024-08-30 16:06:10 +00:00
PyTorch MergeBot	1ad08c7a5b	Revert "[dynamo][itertools] refactor `itertools.chain` and `itertools.chain.from_iterable` to use polyfills (#133864 )" This reverts commit 1b703669576223024eb84a76c53b7ec5ed8bb270. Reverted https://github.com/pytorch/pytorch/pull/133864 on behalf of https://github.com/ZainRizvi due to This is still failing internally with the same error about 'Graph break due to unsupported builtin _functools.reduce' ([comment](https://github.com/pytorch/pytorch/pull/133778#issuecomment-2321787968))	2024-08-30 16:06:10 +00:00
PyTorch MergeBot	8aa44e14cf	Revert "[dynamo] refactor `builtins.enumerate` to use polyfill (#133894 )" This reverts commit a2566adfb6064235db6d950568435fb6ef58a598. Reverted https://github.com/pytorch/pytorch/pull/133894 on behalf of https://github.com/ZainRizvi due to This is still failing internally with the same error about 'Graph break due to unsupported builtin _functools.reduce' ([comment](https://github.com/pytorch/pytorch/pull/133778#issuecomment-2321787968))	2024-08-30 16:06:09 +00:00
PyTorch MergeBot	10c31e96df	Revert "[dynamo][itertools] refactor `itertools.islice` to use polyfill (#133876 )" This reverts commit 7d12e6dceb94a221288f21c0e79ce8ca667d657a. Reverted https://github.com/pytorch/pytorch/pull/133876 on behalf of https://github.com/ZainRizvi due to This is still failing internally with the same error about 'Graph break due to unsupported builtin _functools.reduce' ([comment](https://github.com/pytorch/pytorch/pull/133778#issuecomment-2321787968))	2024-08-30 16:06:09 +00:00
Yidi Wu	d261a1751a	[HOP] fix export x inline_inbuilt_nn_modules (#133731 ) TLDR; this PR supports exporting cond x inine_inbuilt nn modules flag by inling into tracing code in proxy_tensor.py _symbolic_trace.py (internally, the pattern is make_fx(record_module_stack)(torch.compile(f))). We have two special treatments for following cases: 1. _ModuleStackTracer will wrap all the nn modules into _AttrProxy. This _AttrProxy has several subtiles which make it hard to inline in dynamo like overriding _modules with a property method and overrides the `__getattr__`, which mutates captured states when calling `__getattr__`. Solution to this is that we unwrap the _AttrProxy and get its corresponding nn_module (a 1-1 correspondence). So that dynamo symbolically traces the original nn module instead of tracing _AttrProxy. 2. The tracer applies a bunch of patches the `__getattr__` and `__call__` of nn.Module for tracking reasons. This doesn't work well with dynamo. The immediate error we see is `torch._dynamo.exc.Unsupported: 'inline in skipfiles: WeakKeyDictionary.__contains__ \| __contains__ /home/yidi/.conda/envs/pytorch/lib/python3.10/weakref.py` caused by a weakdict in PythonKeyTracer. Solution to this is that we remove the patches during dynamo symbolic convert temporally. So that dynamo has a clean environment. make_fx will be trace the transformed bytecode of dynamo and patches nn modules there instead. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133731 Approved by: https://github.com/anijain2305 ghstack dependencies: #134775	2024-08-30 15:58:20 +00:00
Yidi Wu	932c4ca5a0	make make_fx collective test single threaded (#134775 ) make_fx is not thread-safe due to mutating and patching global states. It's difficult and low roi to make it thread-safe so just turn the tracing test into a single-thread test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134775 Approved by: https://github.com/yifuwang	2024-08-30 15:58:20 +00:00
eqy	c07e566baf	[CUDA][P2P] Check device capability in `requires_cuda_p2p_access` (#134523 ) Tests seem to fail on e.g., Volta without this given the compile time meacros used e.g., in `79b7fff188/torch/csrc/distributed/c10d/intra_node_comm.cu (L487)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/134523 Approved by: https://github.com/yifuwang, https://github.com/Skylion007	2024-08-30 14:08:55 +00:00
Joona Havukainen	92f282ca52	Enable batch matmul for result sizes > 232 the tensor can be split along batch axis (#133430 ) Fixes #131865. Addresses the issue seen when running llama v3.1 8B parameter model on MPS backend where the batch matmul output size can go over the 32-bit indexing limit of MPS tensors, causing an assert. Test case to reproduce the issue with the dimensions encountered in llama v3.1 and verify this fix works around it: ``` import torch device='mps' a = torch.randn([32, 20064, 128], dtype=torch.float32,device=device) b = torch.randn([32, 128, 20064], dtype=torch.float32, device=device) res = torch.bmm(a, b) ``` Notably the current change only works as long as the individual output matrix in the bmm does not exceed the number of elements 232. This lets us split up the computation along the batch axis to avoid going over the limit. Added a TORCH_CHECK to raise an error if the individual matrix dimensions are too large to handle for this op until a more general workaround tiling the matmuls is available. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133430 Approved by: https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2024-08-30 14:08:43 +00:00
wz337	50efbb9f1e	[DeviceMesh][Test] Add a unit test for get_local_rank for flattened mesh (#134603 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134603 Approved by: https://github.com/fduwjj ghstack dependencies: #133838, #133839, #134048	2024-08-30 08:13:37 +00:00
Animesh Jain	0f8bec4399	[dynamo] mark_static_nn_module (#134713 ) Fixes issue seen in https://github.com/pytorch/pytorch/issues/132872#issuecomment-2314574656 With this API, we can mark the offending module as static in detectron2. Today's world - Consider user defined nn module int attributes automatic dynamic. Use the API in this PR to make them static if you want. Alternative work - Consider all int attributes of any user defined nn module class static. And then introduce an API - `torch._dynamo.mark_nn_module_attribute_dynamic`. The default being static is worrying if users have `counter` in their model which is updated in each forward invocation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134713 Approved by: https://github.com/jansel ghstack dependencies: #134653	2024-08-30 07:01:06 +00:00
Jason Ansel	a5630239ad	[dynamo] Improve minifier error message when fp64 not supported (#134737 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134737 Approved by: https://github.com/anijain2305	2024-08-30 06:42:32 +00:00
Ankur Neog	1011e0ae98	Generalize devices specific UTs for dynamo (#130714 ) ## Motivation This is follow up to PR:https://github.com/pytorch/pytorch/pull/126970, adding facility to run content for Intel Gaudi devices. We intend to extend similar generalization for the rest of the content in test/dynamo which is currently being written to work specifically for cuda devices. Other devices can add onto it if support is available. ## Changes carve out bert related content to another class use instantiate_device_type utility to instantiate this class for devices which support the functionality Pull Request resolved: https://github.com/pytorch/pytorch/pull/130714 Approved by: https://github.com/anijain2305	2024-08-30 05:02:47 +00:00
Animesh Jain	7a694f6683	[justknobs] Override __bool__ method (#134799 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134799 Approved by: https://github.com/ezyang	2024-08-30 04:54:02 +00:00
PyTorch UpdateBot	75b86b1554	[executorch hash update] update the pinned executorch hash (#134736 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned executorch hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134736 Approved by: https://github.com/pytorchbot	2024-08-30 04:11:51 +00:00
Jack Taylor	5e8bf29148	[ROCm] remove triton-rocm commit pin and merge pins with triton.txt (#133438 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133438 Approved by: https://github.com/jithunnair-amd, https://github.com/malfet Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>	2024-08-30 03:38:35 +00:00
Xu Han	1f1e2eeb9d	[inductor] Install `tlparse` for test\dynamo\test_structured_trace.py UTs. (#134806 ) Install tlparse for test\dynamo\test_structured_trace.py UTs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134806 Approved by: https://github.com/ezyang	2024-08-30 03:16:03 +00:00
Laith Sakka	0d5f978795	add basic nn modules diff time benchmarks (#134658 ) benchmarks several shapes of basic nn modules. in both eager and inductor ``` collecting compile time instruction count for basic_modules_ListOfLinears_inductor compile time instruction count for iteration 0 is 48602516013 compile time instruction count for iteration 1 is 20424350269 compile time instruction count for iteration 2 is 20440350455 compile time instruction count for iteration 3 is 20419269999 compile time instruction count for iteration 4 is 20430782200 compile time instruction count for iteration 5 is 20455049622 compile time instruction count for iteration 6 is 20157290712 compile time instruction count for iteration 7 is 20455324001 compile time instruction count for iteration 8 is 20450158317 compile time instruction count for iteration 9 is 20492987748 collecting compile time instruction count for basic_modules_ListOfLinears_eager compile time instruction count for iteration 0 is 961328334 compile time instruction count for iteration 1 is 958887896 compile time instruction count for iteration 2 is 958792214 compile time instruction count for iteration 3 is 958375977 compile time instruction count for iteration 4 is 958568525 compile time instruction count for iteration 5 is 958152305 compile time instruction count for iteration 6 is 959322800 compile time instruction count for iteration 7 is 958332703 compile time instruction count for iteration 8 is 958092100 compile time instruction count for iteration 9 is 958095277 collecting compile time instruction count for basic_modules_ModuleForwardHasGraphBreak_inductor compile time instruction count for iteration 0 is 3572145793 compile time instruction count for iteration 1 is 3503323973 compile time instruction count for iteration 2 is 3501962432 compile time instruction count for iteration 3 is 3501746084 compile time instruction count for iteration 4 is 3500687361 compile time instruction count for iteration 5 is 3822254676 compile time instruction count for iteration 6 is 3498356846 compile time instruction count for iteration 7 is 3499019157 compile time instruction count for iteration 8 is 3500780314 compile time instruction count for iteration 9 is 3500257458 collecting compile time instruction count for basic_modules_ModuleForwardHasGraphBreak_eager compile time instruction count for iteration 0 is 1844838754 compile time instruction count for iteration 1 is 1843476862 compile time instruction count for iteration 2 is 1844761450 compile time instruction count for iteration 3 is 1845371742 compile time instruction count for iteration 4 is 1845159665 compile time instruction count for iteration 5 is 1845035802 compile time instruction count for iteration 6 is 1844895007 compile time instruction count for iteration 7 is 1844697922 compile time instruction count for iteration 8 is 1844780885 compile time instruction count for iteration 9 is 1844493990 collecting compile time instruction count for basic_modules_SequentialWithDuplicatedModule_inductor compile time instruction count for iteration 0 is 1597839479 compile time instruction count for iteration 1 is 1348225351 compile time instruction count for iteration 2 is 1347340818 compile time instruction count for iteration 3 is 1348170800 compile time instruction count for iteration 4 is 1348637747 compile time instruction count for iteration 5 is 1678366444 compile time instruction count for iteration 6 is 1348412420 compile time instruction count for iteration 7 is 1348461578 compile time instruction count for iteration 8 is 1347420149 compile time instruction count for iteration 9 is 1349748195 collecting compile time instruction count for basic_modules_SequentialWithDuplicatedModule_eager compile time instruction count for iteration 0 is 137721777 compile time instruction count for iteration 1 is 139065517 compile time instruction count for iteration 2 is 137130552 compile time instruction count for iteration 3 is 137506030 compile time instruction count for iteration 4 is 137089838 compile time instruction count for iteration 5 is 137477395 compile time instruction count for iteration 6 is 138550452 compile time instruction count for iteration 7 is 137568409 compile time instruction count for iteration 8 is 136968468 compile time instruction count for iteration 9 is 137481664 collecting compile time instruction count for basic_modules_ModuleComparison_inductor compile time instruction count for iteration 0 is 917209684 compile time instruction count for iteration 1 is 899154426 compile time instruction count for iteration 2 is 898145079 compile time instruction count for iteration 3 is 899817018 compile time instruction count for iteration 4 is 899184687 compile time instruction count for iteration 5 is 898172885 compile time instruction count for iteration 6 is 899958951 compile time instruction count for iteration 7 is 899348186 compile time instruction count for iteration 8 is 897745404 compile time instruction count for iteration 9 is 899581123 collecting compile time instruction count for basic_modules_ModuleComparison_eager compile time instruction count for iteration 0 is 113165302 compile time instruction count for iteration 1 is 112724376 compile time instruction count for iteration 2 is 112774611 compile time instruction count for iteration 3 is 114465211 compile time instruction count for iteration 4 is 112689572 compile time instruction count for iteration 5 is 112726465 compile time instruction count for iteration 6 is 112853691 compile time instruction count for iteration 7 is 112295238 compile time instruction count for iteration 8 is 114022136 compile time instruction count for iteration 9 is 112664932 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/134658 Approved by: https://github.com/anijain2305 ghstack dependencies: #133834, #134635, #134649, #134652	2024-08-30 02:13:52 +00:00
Xilun Wu	a645a18d2e	[reland][dtensor][MTPG] make sharding prop lru cache not shared among threads (#134509 ) Summary reland of https://github.com/pytorch/pytorch/pull/134294 Fixes #131446 Fixes #126852 Fixes #126868 Fixes #126493 The PR was reverted due to CI red signal in https://github.com/pytorch/pytorch/actions/runs/10537099590/job/29201744658. It seems that the `gaussian_nll_loss` test had been flaky before my original PR #134294 . Therefore this PR also removes the `xfail` mark on this specific test to make CI signal green. See the error message below: ``` 2024-08-24T13:42:01.3228990Z ==================================== RERUNS ==================================== 2024-08-24T13:42:01.3229530Z [31m[1m_ TestDTensorOpsCPU.test_dtensor_op_db_nn_functional_gaussian_nll_loss_cpu_float32 _[0m 2024-08-24T13:42:01.3229710Z Unexpected success[90m[39;49;00m 2024-08-24T13:42:01.3230235Z [31m[1m_ TestDTensorOpsCPU.test_dtensor_op_db_nn_functional_gaussian_nll_loss_cpu_float32 _[0m 2024-08-24T13:42:01.3230407Z Unexpected success[90m[39;49;00m 2024-08-24T13:42:01.3230594Z =================================== FAILURES =================================== 2024-08-24T13:42:01.3231128Z [31m[1m_ TestDTensorOpsCPU.test_dtensor_op_db_nn_functional_gaussian_nll_loss_cpu_float32 _[0m 2024-08-24T13:42:01.3231296Z Unexpected success[90m[39;49;00m ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/134509 Approved by: https://github.com/tianyu-l, https://github.com/wz337	2024-08-30 02:13:45 +00:00
Chen Haifeng	27ffa67984	Support __class__ attr for tuple and list variables (#134099 ) Fixes #134086 This supports __class__ attribute for TupleVariable and ListVariable. And allows to construct a tuple or list by using __class__ attribute. This patch also fix a bug in NamedTupleVariable which misses a return on calling super var_getattr. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134099 Approved by: https://github.com/anijain2305, https://github.com/jansel	2024-08-30 01:57:49 +00:00
Colin L. Rice	cf11fc0dcb	dynamo: Only log if we've disabled eval_frame once. (#134529 ) This spams logs pretty badly otherwise Pull Request resolved: https://github.com/pytorch/pytorch/pull/134529 Approved by: https://github.com/chuanhaozhuge, https://github.com/oulgen	2024-08-30 00:35:25 +00:00
Ivan Zaitsev	8b68912dfc	Correctly detect "Rate limit exceeded" error (#134785 ) Currently all 403 errors are treated as "Rate limit exceeded": https://github.com/pytorch/pytorch/actions/runs/10622019167/job/29445336924 [Github docs](https://docs.github.com/en/rest/using-the-rest-api/rate-limits-for-the-rest-api?apiVersion=2022-11-28#exceeding-the-rate-limit) claim: > If you exceed your primary rate limit, you will receive a 403 or 429 response, and the x-ratelimit-remaining header will be 0. You should not retry your request until after the time specified by the x-ratelimit-reset header. After this change: https://github.com/pytorch/pytorch/actions/runs/10622365327/job/29446456395 Note, the 403 error in the jobs above is a separate issue, this PR addresses only the logging. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134785 Approved by: https://github.com/clee2000	2024-08-29 23:58:15 +00:00
Yu, Guangye	3402a5d865	fix windows xpu build issue (#133845 ) # Motivation If build XPU via oneAPI 2024.2, it will fail because `sycl-preview.lib` exists in windows. And linking the unexpected lib results in `error LNK2019: unresolved external symbol`. # Solution Use explicitly `sycl-preview` in linux build only. # Additional Context For `find_library`, please note that the variable will not be updated if it has been stored. ``` If the library is found the result is stored in the variable and the search will not be repeated unless the variable is cleared. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/133845 Approved by: https://github.com/min-jean-cho, https://github.com/EikanWang, https://github.com/atalman, https://github.com/malfet	2024-08-29 23:53:32 +00:00
leslie-fang-intel	3775fc982d	[Inductor][CPP] Fix Index name error (#134645 ) Summary Fix the comment: https://github.com/pytorch/pytorch/pull/122961#issuecomment-2313930242. For all of the cases we see in the 3 test suits (TorchBench, Timms, Huggingface) we expect: * `_node` is a FX Node with target in ["index_expr", "load", "store"] * `_node.args[1 if _node.target == "index_expr" else 2]` is another FX node with target `get_index` * `_node.args[1 if _node.target == "index_expr" else 2].args[0]` is a str for the name of this index expression It seems not true in some FB internal testcase from the failure log posted in above link. So, add the condition check to work around it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134645 Approved by: https://github.com/jgong5, https://github.com/masnesral	2024-08-29 23:33:15 +00:00
Shuqiang Zhang	d13ce2e2b5	[c10d] release gil lock during eager init (#134779 ) Summary: We found that if we init the pG in a background thread, it would block the main thread till init is complete. This is because in the pybinding we never release the GIL lock Test Plan: existing CI on eager init Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/134779 Approved by: https://github.com/c-p-i-o	2024-08-29 23:25:33 +00:00
Lucian Grijincu	71ff168dbb	pytorch: llvm_codegen: prefix JIT generated functions with 8B of data so jitted code can be called from ASAN+UBSAN on LLVM17 (llvm/llvm-project#65253) (#134572 ) Summary: Similar workaround was already applied elsewhere in pytorch https://github.com/pytorch/pytorch/pull/133623 {D61348865} LLVM17 UBSAN change discussion https://github.com/llvm/llvm-project/issues/104505 Here we also have to associate the data with the function with `setPrefixData(dummyPrefixData)` to prevent this workaround being disabled by the `optimize(*module_);` call which could change layout/remove the unused variable/etc. Differential Revision: D61845799 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134572 Approved by: https://github.com/atalman	2024-08-29 23:15:13 +00:00
Laith Sakka	496e57283d	add add_loop benchmarks (#134652 ) This benchmark measure the cost of compiling the following function in eager and inductor its basically two benchmarks. ``` @torch.compile(backend=self.backend, fullgraph=True) def f(a, b): result = a.clone() for i in range(1000): if i % 3 == 0: result = result + b elif i % 3 == 1: result = result + 8 * b else: result = result.sin() return result ``` PYTHONPATH=$(pwd) python benchmarks/add_loop.py out ``` collecting compile time instruction count for add_loop_eager compile time instruction count for iteration 0 is 8286649663 compile time instruction count for iteration 1 is 2838971338 compile time instruction count for iteration 2 is 2834263023 compile time instruction count for iteration 3 is 2829447493 compile time instruction count for iteration 4 is 2830904231 compile time instruction count for iteration 5 is 2830281077 compile time instruction count for iteration 6 is 2831466595 compile time instruction count for iteration 7 is 2830732164 compile time instruction count for iteration 8 is 2831088056 compile time instruction count for iteration 9 is 2831204407 collecting compile time instruction count for add_loop_inductor compile time instruction count for iteration 0 is 32585687849 compile time instruction count for iteration 1 is 11747553436 compile time instruction count for iteration 2 is 11746959875 compile time instruction count for iteration 3 is 11749479461 compile time instruction count for iteration 4 is 11750053711 compile time instruction count for iteration 5 is 11750793958 compile time instruction count for iteration 6 is 11751673576 compile time instruction count for iteration 7 is 11754552912 compile time instruction count for iteration 8 is 11753723127 compile time instruction count for iteration 9 is 11759059942 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/134652 Approved by: https://github.com/anijain2305 ghstack dependencies: #133834, #134635, #134649	2024-08-29 23:04:01 +00:00
fduwjj	65864d0134	[c10d] Remove Option for ProcessGroup and Expose backend Options to reflect the correct code structure (#132931 ) We introduced the dispatchable backend for a ProcessGroup and collective in https://github.com/pytorch/pytorch/issues/86225. This PR is a follow-up cleanup to clean up the option of a ProcessGroup and ask users to either set timeout or backend later on or directly create backend after creating a PG. Also PGNCCL is using option class from ProcessGroup but we actually should use Option from backend class. So this PR is to make the type or name to be aligned with what we are doing in cpp side. I don't change the signature for the public API, so they still use args named "pg_options" We need to make changes to the test to make it aligned with the change. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132931 Approved by: https://github.com/H-Huang	2024-08-29 22:40:12 +00:00
Zhuoran Zhao	8b4c487581	Fix AOTInductor complication on ROCM (#134522 ) Summary: Original PR (https://github.com/pytorch/pytorch/pull/124123) is broken by cpp_builder refactoring So resubmit it to fix Test Plan: Test with command here: https://www.internalfb.com/phabricator/paste/view/P1549765548 Differential Revision: D61827208 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134522 Approved by: https://github.com/frank-wei	2024-08-29 21:59:04 +00:00
Shunting Zhang	1e92d7b688	[inductor] move loop ordering after fusion (#126254 ) Restart the work from PR https://github.com/pytorch/pytorch/pull/100331 in this new PR since it's hard to rebase. It would be expected that some code is copy/pasted from the previous PR and main idea is the same. Previously we see relatively large compilation time increase due to too many loop orders being considered. This PR tries to continue the work by doing pruning and only considering loop orders that we know for sure are relevant (i.e. do it on demand). Some manually created cases that loop ordering matters are added as unit tests. The PR can make sure inductor does not miss fusion opportunities for them. This PR should solve the not-able to fusion problem in https://github.com/pytorch/pytorch/issues/130015 Right now there is still significant increase of compilation time. I'll disable the feature by default. Later on after the compilation time issue is resolved, I'll enable it by default. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126254 Approved by: https://github.com/jansel	2024-08-29 21:50:07 +00:00
min-jean-cho	416a7894fe	[Windows][XPU] Disable Kineto PTI on Windows only (#134620 ) Disable Kineto + XPU PTI on Windows only. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134620 Approved by: https://github.com/guangyey, https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2024-08-29 20:58:55 +00:00
Xuehai Pan	7d12e6dceb	[dynamo][itertools] refactor `itertools.islice` to use polyfill (#133876 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133876 Approved by: https://github.com/jansel ghstack dependencies: #133769, #133778, #133779, #133864, #133894	2024-08-29 20:56:16 +00:00
Xuehai Pan	a2566adfb6	[dynamo] refactor `builtins.enumerate` to use polyfill (#133894 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133894 Approved by: https://github.com/jansel ghstack dependencies: #133769, #133778, #133779, #133864	2024-08-29 20:56:16 +00:00
Xuehai Pan	1b70366957	[dynamo][itertools] refactor `itertools.chain` and `itertools.chain.from_iterable` to use polyfills (#133864 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133864 Approved by: https://github.com/jansel ghstack dependencies: #133769, #133778, #133779	2024-08-29 20:56:16 +00:00
Xuehai Pan	eaa449fbf0	[dynamo] simplify implementation for `builtins.sum` (#133779 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133779 Approved by: https://github.com/jansel, https://github.com/anijain2305 ghstack dependencies: #133769, #133778	2024-08-29 20:56:16 +00:00
Xuehai Pan	b5f1ffa7ab	[dynamo] simplify implementation for `functools.reduce` (#133778 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133778 Approved by: https://github.com/jansel, https://github.com/anijain2305 ghstack dependencies: #133769	2024-08-29 20:56:16 +00:00
Xuehai Pan	e09324e7da	[dynamo] simplify polyfill registration for `builtins.all` and `builtins.any` (#133769 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133769 Approved by: https://github.com/jansel	2024-08-29 20:56:16 +00:00
drisspg	b977abd5de	[Inductor] Fix error checking for scaled_mm lowering (#134765 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134765 Approved by: https://github.com/Skylion007	2024-08-29 20:18:42 +00:00
atalman	6180574771	Move py 3.8->3.9 pull, trunk, inductor, prerioric CI tests (#133624 ) Part of Deprecation of python 3.8 and moving to 3.9. Related to: https://github.com/pytorch/pytorch/issues/120718 Except XPU and ROCM jobs Pull Request resolved: https://github.com/pytorch/pytorch/pull/133624 Approved by: https://github.com/Skylion007, https://github.com/malfet, https://github.com/ZainRizvi	2024-08-29 19:15:59 +00:00
Jason Ansel	202e5cc87d	[inductor] Fix error in debug_str_extra (#134747 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134747 Approved by: https://github.com/Skylion007, https://github.com/shunting314	2024-08-29 19:09:50 +00:00
Brian Vaughan	43e1df64f8	register all entry_point backends on first attempt (#132546 ) fixes: https://github.com/pytorch/pytorch/issues/131360 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132546 Approved by: https://github.com/jansel	2024-08-29 18:59:29 +00:00
Ke Wen	5470fcd5b9	[5/N] Reconcile barrier and NaN checker (#134707 ) By using a zeros() tensor instead of empty() tensor. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134707 Approved by: https://github.com/shuqiangzhang, https://github.com/wconstab ghstack dependencies: #134345, #134357, #134701	2024-08-29 18:51:12 +00:00
zdevito	d91b49dbaa	expandable_segments <-> other allocator options (#134338 ) Previously setting garbage_collection_threshold or max_split_size_mb along with expandable_segments:True could cause the allocator to hit assert failures when running nearly out of memory. This PR ensures garbage_collection and max_split freeing do not accidentally try to release expandable segments. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134338 Approved by: https://github.com/ezyang	2024-08-29 18:43:59 +00:00
Rachel Guo	3fc6e47d42	[AOTI] Fix cosmetic indentation issue in cuda cpp wrapper codegen for DeferredCudaKernelLine/GridLine (#134705 ) Summary: Follow up fix for D61018114, D61800622 Increase indentation for `loadKernel` `launchKernel` and `Grid` lines. Test Plan: ``` TORCH_LOGS="+graph, inductor, +schedule, output_code" buck2 run -c fbcode.enable_gpu_sections=true -c fbcode.nvcc_arch=h100 @//mode/opt fbcode//caffe2/test/inductor:test_aot_inductor -- -r test_zero_grid_with_unbacked_symbols_abi_compatible_cuda ``` ``` TORCH_LOGS="+graph, inductor, +schedule, output_code" buck2 run -c fbcode.enable_gpu_sections=true -c fbcode.nvcc_arch=h100 @//mode/opt fbcode//caffe2/test/inductor:test_aot_inductor -- -r test_zero_grid_with_backed_symbols_abi_compatible_cuda ``` Differential Revision: D61927248 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134705 Approved by: https://github.com/ColinPeppler	2024-08-29 18:38:45 +00:00
Aaron Gokaslan	5573c17877	[BE][Ez]: Update ruff to 0.6.3 (#134769 ) Mostly bugfix release, updating because it fixes an edgecase with a rule we are using Pull Request resolved: https://github.com/pytorch/pytorch/pull/134769 Approved by: https://github.com/albanD	2024-08-29 18:35:47 +00:00
Xintong Hu	ce96146623	[PT2] Fix node metadata setting in group_batch_fusion_aten (#134543 ) Summary: Current impl results in `meta` missing fields like`val`, use `FakeTensorProp` to update the information Differential Revision: D61832932 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134543 Approved by: https://github.com/frank-wei	2024-08-29 18:32:04 +00:00
chilli	348d02a983	Changed masked out rows logsumexp to be -inf and not zero (#134650 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134650 Approved by: https://github.com/yanboliang, https://github.com/BoyuanFeng, https://github.com/drisspg	2024-08-29 17:22:52 +00:00
Pian Pawakapan	36a6516290	[export] use single FQN for param_buffer_mapping (#134500 ) Fixes #133252 In strict mode, we have this routine for mapping traced parameters to their FQNs using tensor ids. Currently we assume there's at least 1 unique FQN for each traced parameter, but this seems to break with parameter reuse when call_module nodes are present. Adding a test case where this breaks. Fixes this by assigning the same FQN to all traced parameters with the same tensor id. This is fine because we return the original state_dict for the EP, and the unflattener has its own routine of handling aliasing: https://github.com/pytorch/pytorch/pull/125758 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134500 Approved by: https://github.com/angelayi	2024-08-29 17:06:31 +00:00
Ke Wen	d9d95dc55e	[4/N] Test NaN checker against broadcast (#134701 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134701 Approved by: https://github.com/wconstab ghstack dependencies: #134345, #134357	2024-08-29 17:00:07 +00:00
PyTorch MergeBot	ab646cd805	Revert "[reland][dtensor][MTPG] make sharding prop lru cache not shared among threads (#134509 )" This reverts commit ba5aec88c678fe4b9ad101602c29726724f56e21. Reverted https://github.com/pytorch/pytorch/pull/134509 on behalf of https://github.com/ZainRizvi due to Sorry but this fails internally. For details see D61953754 ([comment](https://github.com/pytorch/pytorch/pull/134509#issuecomment-2318323161))	2024-08-29 16:39:19 +00:00
Ke Wen	26aea277f7	[3/N] Set correct device to CUDA guards (#134357 ) In `collective()`, `pointToPoint()` and `collectiveCoalesced()`, CUDA guards were created with an unset (default) CUDA device. This is the reason for the IMA facing the NaN checker in issue https://github.com/pytorch/pytorch/issues/134062. With this fix, `torch.cuda.set_device(device)` is not needed to work around the IMA. Also refactored a couple places where the guard is created -- preferably we create the guard with a known device, rather than setting the device later. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134357 Approved by: https://github.com/wconstab, https://github.com/shuqiangzhang ghstack dependencies: #134345	2024-08-29 16:25:27 +00:00
Xu Han	d503217ea4	[inductor] calibration inductor windows uts (15/N) (#134586 ) Fix `test_logs_out` UT on Windows. make `test/dynamo/test_logging.py` all UTs pass on Windows. Changes: 1. Close `NamedTemporaryFile` to release file handle to avoid PermissionError issue. 2. `PermissionError` setup as `delete=False`, let file not be auto deleted. 3. Open log file as "utf-8" to align with Linux. 4. Process wrap difference for Windows. 5. Delete tmp file manually. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134586 Approved by: https://github.com/jansel	2024-08-29 16:18:40 +00:00
Ke Wen	9953f55f4c	[2/N] Add flag to control which rank should perform NaN check (#134345 ) Fixes https://github.com/pytorch/pytorch/issues/134062. For example, in case of broadcast / scatter, only the root rank should perform the NaN check. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134345 Approved by: https://github.com/shuqiangzhang, https://github.com/wconstab	2024-08-29 16:13:15 +00:00
Bin Bao	387d3fc296	[AOTI] Switch benchmarking to use export non-strict mode (#130977 ) Summary: Switch the export part used by AOTInductor benchmarking from strict to non-strict, and switch it from producing torch IR to aten IR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130977 Approved by: https://github.com/angelayi ghstack dependencies: #134639	2024-08-29 16:08:52 +00:00
Valentine233	0dbc72887b	[CPU][flash attention] make the stride of output align with input (#134656 ) Fixes #133671 Currently, the output of CPU flash attention has a fixed layout, no matter what the input is. This PR makes the stride of output align with input q/k/v, which is the same behavior as math backend. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134656 Approved by: https://github.com/jgong5, https://github.com/drisspg	2024-08-29 16:04:25 +00:00
Stonepia	4fcd15a667	Fix test_sgd_weight_decay_xpu accuracy error (#134744 ) Fixes #134743 This PR adds `test_sgd_weight_decay_xpu` in `KERNEL_COUNT_OVERRIDES` to override. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134744 Approved by: https://github.com/EikanWang, https://github.com/desertfire	2024-08-29 15:12:40 +00:00
Animesh Jain	594162f7ab	[dynamo] Support reading attributes from pybind objects (#134630 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134630 Approved by: https://github.com/jansel	2024-08-29 15:06:52 +00:00
Avik Chaudhuri	92e38a476f	preserve aten::to device in export training (#134622 ) Summary: With training IR, we cannot rely on trapping `to()` in `FunctionalTensor` because the regular decomposition kicks it first, and that can cause it to be optimized away. So instead we preserve it until we functionalize, and then replace it explicitly with `_to_copy()`. Test Plan: expected test failures go away Differential Revision: D61883878 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134622 Approved by: https://github.com/zhxchen17, https://github.com/tugsbayasgalan	2024-08-29 14:53:30 +00:00
rzou	092349dcdd	Never CSE aten.empty in the partitioner (#134703 ) aten.empty is almost always fusible into its consumer, so we never CSE it. This fixes a bug that looks like the following: ```py @torch.library.custom_op("_reinplacing::sin_cos", mutates_args={"out_sin", "out_cos"}) def sin_cos(x: torch.Tensor, out_sin: torch.Tensor, out_cos: torch.Tensor) -> None: out_sin.copy_(x.sin()) out_cos.copy_(x.cos()) @torch.compile def f(x): out0 = torch.empty_like(x) out1 = torch.empty_like(x) sin_cos(x, out0, out1) return x.clone(), out0, out1 x = torch.randn(3, requires_grad=True) f(x) ``` - cse would de-duplicate the empty nodes - reinplacing would add an additional clone (because it can't write to both tensors at the same time) - the clone lowers into a new buffer + a copy_ kernel - the copy_ kernel is unnecessary because "empty" is special - all reinplacing needed was an additional buffer, it doesn't matter what the values are. We could attempt to fix this on the reinplacing side but this seemed better as a partitioner heuristic and the reinplacing fix is a bit more tricky (we'd need to identify that the op never reads from the empty node). Test Plan: - new test (the old number was 27, the new number is 21, so this PR helped). Pull Request resolved: https://github.com/pytorch/pytorch/pull/134703 Approved by: https://github.com/yf225 ghstack dependencies: #134466, #134490, #134491	2024-08-29 13:51:19 +00:00
Xuehai Pan	70853b792a	[dynamo][itertools] support `itertools.tee` (#133771 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133771 Approved by: https://github.com/jansel ghstack dependencies: #133801	2024-08-29 13:36:52 +00:00
Xuehai Pan	9e806c1a60	[dynamo] simplify implementation for `os.fspath` (#133801 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133801 Approved by: https://github.com/anijain2305	2024-08-29 13:36:52 +00:00
Animesh Jain	d01a7a9faa	[dynamo] Graph break on FSDP flat_param inconsistent tensor and grad dtype (#134614 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134614 Approved by: https://github.com/awgu, https://github.com/yf225 ghstack dependencies: #134610, #134590, #134621	2024-08-29 09:14:42 +00:00
Animesh Jain	fb35d1e01f	[raland][dynamo][exceptions] Support raise from None (#134621 ) The PR was reverted because this PR traced more code and surfaced a latent bug. Resubmitting w/o any changes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134621 Approved by: https://github.com/jansel ghstack dependencies: #134610, #134590	2024-08-29 09:14:42 +00:00
Animesh Jain	2bf622685d	[dynamo][dicts] Support hasattr on dicts (#134590 ) Fixes - https://github.com/pytorch/pytorch/issues/134577 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134590 Approved by: https://github.com/Skylion007 ghstack dependencies: #134610	2024-08-29 09:14:42 +00:00
Animesh Jain	2446dead35	[dynamo][exceptions] Use exception subclass whenever possible (#134610 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134610 Approved by: https://github.com/drisspg, https://github.com/jansel	2024-08-29 09:14:42 +00:00
wz337	cfb642bb6b	[DTensor] Extend implicit replication to replicate DTensor for foreach ops so model doesn't have to be fully tp-ed when using 2D (#134551 ) Fixes [134212](https://github.com/pytorch/pytorch/issues/134212) Currently, when we use 2D FSDP with TP, `optimizer.step()` would fail if the model were not fully tensor parallelized. If we don't have the entire model tensor parallelized when doing 2D, we would have both 1D and 2D DTensor parameters. As foreach is turned on by default, `optimizer.step()` would fail as cross mesh op is not allowed. Error as follows: ``` NotImplementedError: aten._foreach_mul_.Scalar: DTensor does not support cross-mesh operation yet!Got meshes: DeviceMesh('cuda', [[0, 1], [2, 3]], mesh_dim_names=('dp', 'tp')) DeviceMesh('cuda', [1, 3], mesh_dim_names=('dp',)) ``` In this PR, we extend implicit_replication to replicate DTensor in missing dimensions for foreach ops. If users don't want to fully tensor parallelize the model when using 2D, they have the option of using the `implicit_replication()` context manager for `optimizer.step()`. In this case, we would swap out the 1D DTensorSpec and replace it with 2D DTensorSpec. However, we don't want to turn this on by default yet, as we want the users to be aware that the tp dimension is replicated if a layer is not tp-ed. With implicit implication turning on, try replicate dtensor spec in missing dimension would work for most cases for foreach case except when the first DTensor in the list is one that also need to be replicated. This is currently a limitation, which I don't have a good solution yet. Currently, with this change, we can handle most of the cases except the case that the first DTensor's ndim is not the largest. ``` [2D_DTensor, 1D_DTensor...] ---> Implicit_replication() can handle this. [1D_DTensor, 2D_DTensor...] ---> Implicit_replication() can't handle this. ``` This change doesn't affect the existing default behavior, as `implicit_replication()` is not turned on by default. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134551 Approved by: https://github.com/tianyu-l	2024-08-29 09:01:31 +00:00
Ke Wen	3645634f3c	[1/N] Move NaN check onto NCCL stream (#134300 ) So that the tensor's lifetime management is the same as the management built for the NCCL, pre and post kernels. Also so that on visualizers, they show up in the NCCL stream line. Otherwise if they show up in the compute line, user may get confused (my code does not have these kernels). The check is thus moved after the point where we depend NCCL stream from the last compute kernel. Also moved declaration of `checkForNan` from Utils.hpp to NCCLUtils.hpp, and renamed Utils.cu to NCCLUtils.cu. Differential Revision: [D61957573](https://our.internmc.facebook.com/intern/diff/D61957573) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134300 Approved by: https://github.com/shuqiangzhang, https://github.com/wconstab	2024-08-29 08:28:49 +00:00
Will Feng	578b8d75e5	[2nd try][Traceable FSDP2] Allow tracing through FSDP2 impl in trace_rules.py (#134539 ) The previous PR https://github.com/pytorch/pytorch/pull/133532 caused stuck compilation issue on internal models. In this 2nd attempt PR, we gate the trace_rules.py changes with `if not torch._dynamo.config.skip_fsdp_hooks:`, so that they don't take effect for current graph-break FSDP2 (which relies on the default config value `skip_fsdp_hooks=True`), and will only take effect when we are using Traceable FSDP2 (in which case the user needs to proactively set `skip_fsdp_hooks=False`). Pull Request resolved: https://github.com/pytorch/pytorch/pull/134539 Approved by: https://github.com/ckluk2, https://github.com/yanboliang	2024-08-29 06:28:16 +00:00
Xia, Weiwen	834d8b0965	[Inductor][mkldnn] Bug fix: incorrect codegen arg order for qconv (#134579 ) Fixes #133448 The arg order for mkldnn qconv IR became incorrect after PR #132367 . This PR fixes the bug. Test plan `python test/inductor/test_mkldnn_pattern_matcher.py -k qconv` `python test/inductor/test_cpu_cpp_wrapper.py -k qconv` Pull Request resolved: https://github.com/pytorch/pytorch/pull/134579 Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5	2024-08-29 06:20:52 +00:00
wz337	b0a6d9ad27	[DTensor] Add pointwise ops strategy for aten.isinf, aten.isneginf, aten.isposinf (#134699 ) Fixes #ISSUE_NUMBER Need it for https://github.com/facebookresearch/optimizers/blob/main/distributed_shampoo/utils/shampoo_preconditioner_list.py#L671 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134699 Approved by: https://github.com/tianyu-l	2024-08-29 06:01:12 +00:00
Wang, Eikan	da9e61ef70	Get accumulate dtype for Intel GPU (#134465 ) Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): There are two function variants to get accumulated dtype for a given dtype: - Func1: `c10::ScalarType toAccumulateType(c10::ScalarType type, c10::DeviceType device)` - Func2: `c10::ScalarType toAccumulateType(c10::ScalarType type, bool is_cuda)` The Func1 is general enough to support different devices, while the Func2 only supports CUDA and CPU. This PR intends to add the Intel GPU path in the Func1. And we expect users to invoke the Func1 to ensure compatibility for different devices. * __->__ #134465 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134465 Approved by: https://github.com/Skylion007, https://github.com/atalman	2024-08-29 05:27:57 +00:00
Mikayla Gawarecki	94db935749	Add torch.serialization.skip_data context manager (#134504 ) ## Semantic The semantic is (1) By default `torch.serialization.skip_data(materialize_fake_tensors=False)` will make `torch.save` skip writing storages (but reserve space for them in the checkpoint). ```python import torch import torch.nn as nn sd = nn.Linear(3, 5).state_dict() with torch.serialization.skip_data(): torch.save(sd, 'foo.pt') print(torch.load('foo.pt', weights_only=True)) ``` (2) With `torch.serialization.skip_data(materialize_fake_tensors=True)`If FakeTensor is passed to `torch.save` the pickler will treat these FakeTensors as being "materialized" space will be reserved in the checkpoint for the associated storage bytes, and when loading the type will be Tensor instead of FakeTensor) ```python import torch import torch.nn as nn from torch._subclasses.fake_tensor import FakeTensorMode with FakeTensorMode(): m = nn.Linear(3, 5, dtype=torch.float16, device='cuda') sd = m.state_dict() with torch.serialization.skip_data(materialize_fake_tensors=True): torch.save(sd, 'bla.pt') print(torch.load('bla.pt', weights_only=True)) # OrderedDict([('weight', tensor([[0., 0., 0.], # [0., 0., 0.], # [0., 0., 0.], # [0., 0., 0.], # [0., 0., 0.]], device='cuda:0', dtype=torch.float16)), ('bias', tensor([0., 0., 0., 0., 0.], device='cuda:0', dtype=torch.float16))]) ``` ## Follow Ups - [ ] `torch.load` semantic for skip_data context manager - [ ] Mechanism for getting offsets of storages saved via this method (for writing in a separate pass) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134504 Approved by: https://github.com/albanD	2024-08-29 04:52:52 +00:00
Banit Agrawal	297b42012d	[PyTorch] Use pinned memory for zero_cuda_out (#134712 ) Summary: This diff creates a pinned tensor for copying from device for the zero_out op. Differential Revision: D61759262 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134712 Approved by: https://github.com/zyan0	2024-08-29 04:46:08 +00:00
Jennifer (Jiyue) Wang	a32255481b	[caffe2][hipify] remove un-used flag from `pybind_utils.h` (#134404 ) Summary: Encountered issues related to AMD build when working on https://www.internalfb.com/diff/D60739324?dst_version_fbid=2203158110057105 (see stack trace P1545717562) Looking at the file history, seems that the flag is no longer used so I propose to remove it. Alternatively, I could change the `#ifdef` to check both `USE_C10D_NCCL` and `USE_ROCM` and include the corresponding AMD header files. Let me know what is more preferred way. Test Plan: Sandcastle Differential Revision: D61762129 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134404 Approved by: https://github.com/malfet	2024-08-29 04:09:44 +00:00
Syed Tousif Ahmed	4655eb3ee2	Uses MemPoolContext to route allocations from CUDACachingAllocator (#134685 ) Re-open of https://github.com/pytorch/pytorch/pull/133599 that was mistakenly closed by issuing `ghstack land` Pull Request resolved: https://github.com/pytorch/pytorch/pull/134685 Approved by: https://github.com/ezyang	2024-08-29 03:56:31 +00:00
David Berard	4b4ba7ab06	[NJT] Support NJT SDPA + meta-device flop counting (#134289 ) A user wants to use the flop counter with meta devices. This previously caused problems for SDPA+NJT: 1. autocast check: `torch.is_autocast_enabled("meta")` fails because `meta` is not valid for autocasting. If we skip this, we run into the next error 2. math backend: conversion to NST requires getting concrete offsets in a list of python integers, which doesn't work on a meta tensor `b2eb0e8c6a/torch/nested/_internal/sdpa.py (L809-L815)` 3. (fixed in the previous PR, #134288) - if we force using flash attention backend for flop counting, `_flash_attention_forward` previously didn't support meta tensors. In this PR, we check specifically for FlopCounterMode, and, if it's enabled and combined with meta tensors, (a) skip autocasting and (b) force it down the flash attention path. This isn't generally safe for tracing (e.g. if you actually care which kernels you are running), but in the absence of actual device information, we have to make some assumptions. By specifically checking for FlopCounterMode, this should reduce the chance of unintended side effects for other meta tensor users. Note: fake tensor would solve a bunch of these issues, but it's not a viable solution right now for the user. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134289 Approved by: https://github.com/soulitzer ghstack dependencies: #134288	2024-08-29 03:43:42 +00:00
CaoE	17e9c2d1e7	Add oneDNN support for Half LSTM on CPU (#132607 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132607 Approved by: https://github.com/jgong5, https://github.com/peterbell10	2024-08-29 03:40:10 +00:00
Ivan Zaitsev	41e36e2b46	Reflect check_labels status as a signal (#134711 ) Fixes the workflow when meta-exported diff (co-dev) doesn't have the required labels, but the signal is suppressed due to job failure (e.g. [see this run](https://github.com/pytorch/pytorch/actions/runs/10590994706/job/29347663526?pr=134484)). With this change the workflow status correctly reflects the status of the check. # Testing * [illegal pr_num](https://github.com/pytorch/pytorch/actions/runs/10603163898/job/29386843591) * [successful run](https://github.com/pytorch/pytorch/actions/runs/10603279052/job/29387230110) (topic label present) * no labels: [check fails](https://github.com/pytorch/pytorch/actions/runs/10603310368/job/29387333864) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134711 Approved by: https://github.com/clee2000	2024-08-29 03:11:16 +00:00
Yueming Hao	4f9c68454a	[inductor]Let output or input_as_strided match exact strides (#130956 ) Fixes #130394 TorchInductor doesn't respect original strides of outputs. It opens up optimization opportunities like changing up memory layout. But for some cases, such as the case in https://github.com/pytorch/pytorch/issues/130394, we do need the output match the exact stride as required. The correctness is the first priority goal. So, this PR adds a new API `ir.ExternKernel.require_exact_strides(x, exact_strides, allow_padding=False)` to fix the issue. This PR enables dense and non-dense outputs' strides follow the strides required by semantics. The comparison between the original and after this fix for the test is the below. ```python @triton.jit def triton_(in_ptr0, out_ptr0, xnumel, XBLOCK : tl.constexpr): xnumel = 128 xoffset = tl.program_id(0) * XBLOCK xindex = xoffset + tl.arange(0, XBLOCK)[:] xmask = xindex < xnumel x0 = xindex % 8 x1 = (xindex // 8) - x2 = xindex tmp0 = tl.load(in_ptr0 + (x0 + (16x1)), xmask) tmp1 = tmp0 + tmp0 - tl.store(out_ptr0 + (x2), tmp1, xmask) + tl.store(out_ptr0 + (x0 + (16x1)), tmp1, xmask) def call(args): arg0_1, = args args.clear() assert_size_stride(arg0_1, (16, 8), (16, 1)) with torch.cuda._DeviceGuard(0): torch.cuda.set_device(0) - buf1 = empty_strided_cuda((16, 8), (8, 1), torch.float32) + buf1 = empty_strided_cuda((16, 8), (16, 1), torch.float32) stream0 = get_raw_stream(0) triton_poi_fused_add_copy_0.run(arg0_1, buf1, 128, grid=grid(128), stream=stream0) del arg0_1 return (buf1, ) ``` The buf1 is created with exact stride required by users, and its values are written in same stride with the input. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130956 Approved by: https://github.com/eellison, https://github.com/blaine-rister, https://github.com/desertfire	2024-08-29 03:06:58 +00:00
PyTorch MergeBot	4811dc3de9	Revert "[dynamo] simplify polyfill registration for `builtins.all` and `builtins.any` (#133769 )" This reverts commit cc3a76edbac4a48381db6ccc44a83927f80c545b. Reverted https://github.com/pytorch/pytorch/pull/133769 on behalf of https://github.com/ZainRizvi due to Sorry but this has been discovered to be causing a performance regression internally ([comment](https://github.com/pytorch/pytorch/pull/133769#issuecomment-2316620213))	2024-08-29 03:00:47 +00:00
PyTorch MergeBot	f65df5edae	Revert "[dynamo][itertools] support `itertools.tee` (#133771 )" This reverts commit 1dbd3476de07d7f07489e243cb7a43073e8c25c1. Reverted https://github.com/pytorch/pytorch/pull/133771 on behalf of https://github.com/ZainRizvi due to Sorry, have to revert this in order to be able to revert https://github.com/pytorch/pytorch/pull/133769 ([comment](https://github.com/pytorch/pytorch/pull/133771#issuecomment-2316611158))	2024-08-29 02:49:30 +00:00
PyTorch MergeBot	eaec9e80b8	Revert "[dynamo] simplify implementation for `os.fspath` (#133801 )" This reverts commit 74341e1150f10b8aaddd33a165e686724424071f. Reverted https://github.com/pytorch/pytorch/pull/133801 on behalf of https://github.com/ZainRizvi due to Sorry, have to revert this in order to be able to revert https://github.com/pytorch/pytorch/pull/133769 ([comment](https://github.com/pytorch/pytorch/pull/133771#issuecomment-2316611158))	2024-08-29 02:49:30 +00:00
Jason Ansel	76f975948e	[inductor] Cleanup generate_node_schedule (#134306 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134306 Approved by: https://github.com/shunting314	2024-08-29 02:45:14 +00:00
Sun, Jiayi	cccb121d4e	[Inductor] add inductor config: masked_vec (#134566 ) This PR adds inductor config: masked_vec to control enable/disable masked vectorization for the tail_loop, and enable by default. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134566 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-08-29 02:29:06 +00:00
Laith Sakka	c5f114747e	fix flakiness in update_hint_benchmark.py (#134649 ) ``` compile time instruction count for iteration 1 is 10732129038 compile time instruction count for iteration 2 is 10719776783 compile time instruction count for iteration 3 is 10729546868 compile time instruction count for iteration 4 is 10737655132 compile time instruction count for iteration 5 is 10732564252 compile time instruction count for iteration 6 is 10728721234 compile time instruction count for iteration 7 is 10733354271 compile time instruction count for iteration 8 is 10719588972 compile time instruction count for iteration 9 is 10706311856 ``` 1. add torch.manual_seed(0), inputs was not the same across iterations 2. disable gc. 3. remove loop (not needed since compilation happen once only) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134649 Approved by: https://github.com/aorenste ghstack dependencies: #133834, #134635	2024-08-29 02:22:05 +00:00
PyTorch MergeBot	f0fceed432	Revert "[dynamo][exceptions] Use exception subclass whenever possible (#134610 )" This reverts commit 880e3d18a406777dbea6aeaf14443b0e3a8b441c. Reverted https://github.com/pytorch/pytorch/pull/134610 on behalf of https://github.com/ZainRizvi due to Sorry, I had to revert this in order to revert another PR ([comment](https://github.com/pytorch/pytorch/pull/134610#issuecomment-2316568553))	2024-08-29 02:02:12 +00:00
PyTorch MergeBot	67d7040fce	Revert "[dynamo][dicts] Support hasattr on dicts (#134590 )" This reverts commit c566f2465f41b8081caed205fcf5fe973fd970b3. Reverted https://github.com/pytorch/pytorch/pull/134590 on behalf of https://github.com/ZainRizvi due to Sorry, I had to revert this in order to revert another PR ([comment](https://github.com/pytorch/pytorch/pull/134610#issuecomment-2316568553))	2024-08-29 02:02:12 +00:00
PyTorch MergeBot	40cebde3bc	Revert "[raland][dynamo][exceptions] Support raise from None (#134621 )" This reverts commit e96dc3665a1d48434c02e17f7faed41f779cee2c. Reverted https://github.com/pytorch/pytorch/pull/134621 on behalf of https://github.com/ZainRizvi due to Sorry, I had to revert this in order to revert another PR ([comment](https://github.com/pytorch/pytorch/pull/134610#issuecomment-2316568553))	2024-08-29 02:02:12 +00:00
PyTorch MergeBot	c35d1f7b3a	Revert "[dynamo] Graph break on FSDP flat_param inconsistent tensor and grad dtype (#134614 )" This reverts commit e4a5958ab58e2f9b5b9c336a1d2a6449784d88d3. Reverted https://github.com/pytorch/pytorch/pull/134614 on behalf of https://github.com/ZainRizvi due to Sorry, I had to revert this in order to revert another PR ([comment](https://github.com/pytorch/pytorch/pull/134610#issuecomment-2316568553))	2024-08-29 02:02:12 +00:00
PyTorch MergeBot	25531eb735	Revert "[2nd try][Traceable FSDP2] Allow tracing through FSDP2 impl in trace_rules.py (#134539 )" This reverts commit 26e392132d3039345de6aaf8643e7330f7fc3cbc. Reverted https://github.com/pytorch/pytorch/pull/134539 on behalf of https://github.com/ZainRizvi due to Sorry, I had to revert this in order to revert another PR ([comment](https://github.com/pytorch/pytorch/pull/134539#issuecomment-2316568257))	2024-08-29 01:59:02 +00:00
PyTorch MergeBot	cbf5ba1e97	Revert "[1/N] Move NaN check onto NCCL stream (#134300 )" This reverts commit 94caba4899096f160eca9628acddba6032755b3b. Reverted https://github.com/pytorch/pytorch/pull/134300 on behalf of https://github.com/kwen2501 due to This is breaking builds of MTIA ([comment](https://github.com/pytorch/pytorch/pull/134300#issuecomment-2316559704))	2024-08-29 01:50:22 +00:00
PyTorch MergeBot	33d0c11b26	Revert "[2/N] Add flag to control which rank should perform NaN check (#134345 )" This reverts commit 2fe7e332c7a61f025ccbcdbbb4875c6bf0b9afdf. Reverted https://github.com/pytorch/pytorch/pull/134345 on behalf of https://github.com/kwen2501 due to This is breaking builds of MTIA ([comment](https://github.com/pytorch/pytorch/pull/134300#issuecomment-2316559704))	2024-08-29 01:50:22 +00:00
PyTorch MergeBot	43dc17fd00	Revert "[3/N] Set correct device to CUDA guards (#134357 )" This reverts commit afc76c6f2d46d7726012507ec5c67b4c04e21723. Reverted https://github.com/pytorch/pytorch/pull/134357 on behalf of https://github.com/kwen2501 due to This is breaking builds of MTIA ([comment](https://github.com/pytorch/pytorch/pull/134300#issuecomment-2316559704))	2024-08-29 01:50:22 +00:00
PyTorch MergeBot	503c0dd923	Revert "Add MaskedTensor support to *_like API (#128637 )" This reverts commit b6e51711a0ea6174806e75ab6e208d2d910b45f5. Reverted https://github.com/pytorch/pytorch/pull/128637 on behalf of https://github.com/ZainRizvi due to Actually, seems like it was this commit that introduced the failure: test_maskedtensor.py::TestOperatorsCUDA::test_like_empty_like_layout1_cuda_bool [GH job link](https://github.com/pytorch/pytorch/actions/runs/10604690725/job/29392898277) [HUD commit link](`b6e51711a0`) ([comment](https://github.com/pytorch/pytorch/pull/128637#issuecomment-2316554188))	2024-08-29 01:42:52 +00:00
PyTorch MergeBot	1285443994	Revert "Add torch.serialization.skip_data context manager (#134504 )" This reverts commit 202600bc2384cb19a29b8fca503bafc289158c32. Reverted https://github.com/pytorch/pytorch/pull/134504 on behalf of https://github.com/mikaylagawarecki due to This is breaking Windows docs tests due to NamedTemporaryFile on Windows not working well ([comment](https://github.com/pytorch/pytorch/pull/134504#issuecomment-2316543901))	2024-08-29 01:30:49 +00:00
Li-Huai (Allan) Lin	e7711d6c7d	[MPS] Fix SDP training (#134719 ) Check whether the input tensors require grad. If required, then we don't get into the fast path and fall back to composite implicit. Fixes #134678 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134719 Approved by: https://github.com/malfet	2024-08-29 01:28:53 +00:00
Avik Chaudhuri	ca03a14cf7	hang dim hint constants off Dim (#134702 ) Summary: Retry landing https://github.com/pytorch/pytorch/pull/134484 Test Plan: (see original) Differential Revision: D61925860 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134702 Approved by: https://github.com/pianpwk	2024-08-29 01:02:01 +00:00
Rachel Guo	7a554e96b4	[AOTI][Tooling] Follow up to print location of saved file path for `torch.pickle_save()` (#134651 ) Summary: - Follow up to add torch.pickle_save() log for saved file path - Minor debug printer code refine Test Plan: CI Differential Revision: D61883239 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134651 Approved by: https://github.com/muchulee8	2024-08-28 23:58:37 +00:00
Mikayla Gawarecki	202600bc23	Add torch.serialization.skip_data context manager (#134504 ) ## Semantic The semantic is (1) By default `torch.serialization.skip_data(materialize_fake_tensors=False)` will make `torch.save` skip writing storages (but reserve space for them in the checkpoint). ```python import torch import torch.nn as nn sd = nn.Linear(3, 5).state_dict() with torch.serialization.skip_data(): torch.save(sd, 'foo.pt') print(torch.load('foo.pt', weights_only=True)) ``` (2) With `torch.serialization.skip_data(materialize_fake_tensors=True)`If FakeTensor is passed to `torch.save` the pickler will treat these FakeTensors as being "materialized" space will be reserved in the checkpoint for the associated storage bytes, and when loading the type will be Tensor instead of FakeTensor) ```python import torch import torch.nn as nn from torch._subclasses.fake_tensor import FakeTensorMode with FakeTensorMode(): m = nn.Linear(3, 5, dtype=torch.float16, device='cuda') sd = m.state_dict() with torch.serialization.skip_data(materialize_fake_tensors=True): torch.save(sd, 'bla.pt') print(torch.load('bla.pt', weights_only=True)) # OrderedDict([('weight', tensor([[0., 0., 0.], # [0., 0., 0.], # [0., 0., 0.], # [0., 0., 0.], # [0., 0., 0.]], device='cuda:0', dtype=torch.float16)), ('bias', tensor([0., 0., 0., 0., 0.], device='cuda:0', dtype=torch.float16))]) ``` ## Follow Ups - [ ] `torch.load` semantic for skip_data context manager - [ ] Mechanism for getting offsets of storages saved via this method (for writing in a separate pass) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134504 Approved by: https://github.com/albanD	2024-08-28 23:53:17 +00:00
PyTorch MergeBot	f997b2b8e6	Revert "Add MaskedTensor passthrough: unfold, F.Unfold, F.Fold, stack (#125262 )" This reverts commit f685018ea9d08f98cbd7106028db134f967f74d3. Reverted https://github.com/pytorch/pytorch/pull/125262 on behalf of https://github.com/ZainRizvi due to Hi, this PR appears to be calling maskedtensor tests to fail on main. Please rebase your changes onto the latest trunk build to repro the failure. test_maskedtensor.py::TestOperatorsCUDA::test_like_empty_like_layout1_cuda_bool [GH job link](https://github.com/pytorch/pytorch/actions/runs/10604716811/job/29393256312) [HUD commit link](`f685018ea9`) ([comment](https://github.com/pytorch/pytorch/pull/125262#issuecomment-2316387447))	2024-08-28 23:10:07 +00:00
Tugsbayasgalan Manlaibaatar	6dd3f81aaf	Add export_for_training as public API (#134677 ) Differential Revision: [D61912084](https://our.internmc.facebook.com/intern/diff/D61912084) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134677 Approved by: https://github.com/avikchaudhuri, https://github.com/zhxchen17	2024-08-28 22:32:10 +00:00
rzou	a7933acd5a	Improve custom ops aliasing error message (#134688 ) Fixes https://github.com/pytorch/pytorch/issues/134278 Test Plan: - tested locally Pull Request resolved: https://github.com/pytorch/pytorch/pull/134688 Approved by: https://github.com/yushangdi ghstack dependencies: #134466, #134490, #134491, #134690, #134692	2024-08-28 22:22:04 +00:00
rzou	dd443f418a	Improve opcheck docs. (#134692 ) Fixes https://github.com/pytorch/pytorch/issues/134119 From user feedback, it's difficult to understand what the tests do. We clarify the docs more. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134692 Approved by: https://github.com/albanD ghstack dependencies: #134466, #134490, #134491, #134690	2024-08-28 22:22:04 +00:00
Ke Wen	afc76c6f2d	[3/N] Set correct device to CUDA guards (#134357 ) In `collective()`, `pointToPoint()` and `collectiveCoalesced()`, CUDA guards were created with an unset (default) CUDA device. This is the reason for the IMA facing the NaN checker in issue https://github.com/pytorch/pytorch/issues/134062. With this fix, `torch.cuda.set_device(device)` is not needed to work around the IMA. Also refactored a couple places where the guard is created -- preferably we create the guard with a known device, rather than setting the device later. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134357 Approved by: https://github.com/wconstab, https://github.com/shuqiangzhang ghstack dependencies: #134300, #134345	2024-08-28 22:17:11 +00:00
rzou	5ff97e79ee	Skip test_mutable_custom_op_fixed_layout2 on ROCM (#134690 ) ROCM doesn't trigger the layout optimization that makes the test case valid so we're going to skip the checks. Should fix the following (I'll close them later) - https://github.com/pytorch/pytorch/issues/134481 - https://github.com/pytorch/pytorch/issues/134519 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134690 Approved by: https://github.com/FindHao ghstack dependencies: #134466, #134490, #134491	2024-08-28 22:12:24 +00:00
Ke Wen	2fe7e332c7	[2/N] Add flag to control which rank should perform NaN check (#134345 ) Fixes https://github.com/pytorch/pytorch/issues/134062. For example, in case of broadcast / scatter, only the root rank should perform the NaN check. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134345 Approved by: https://github.com/shuqiangzhang, https://github.com/wconstab ghstack dependencies: #134300	2024-08-28 21:53:39 +00:00
Janet Yang	26ec06e45d	[amd][lowering] hipify shim v2 headers (#134689 ) Summary: The default c_shim version was switched to 2 for HIP in D60674018. This results in some linking errors where shim function symbols are missing from the compiled .so file (eg. P1551186492) when building lowering benchmark scripts since the required files aren't included. Hipify the shim v2 generated header files as well since they're needed during codegen when the buck binaries are executed. Reviewed By: frank-wei, zoranzhao, henryoier Differential Revision: D61865202 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134689 Approved by: https://github.com/zoranzhao	2024-08-28 21:53:24 +00:00
PyTorch MergeBot	7b3da5f297	Revert "[dynamo] Cache _dynamo.disable results (#134272 )" This reverts commit dbef2b05b4d81e891f7497f92f730a22bebe445d. Reverted https://github.com/pytorch/pytorch/pull/134272 on behalf of https://github.com/anijain2305 due to Peak mem increase detected internally ([comment](https://github.com/pytorch/pytorch/pull/134272#issuecomment-2316308170))	2024-08-28 21:51:43 +00:00
Jia Li	20b62fed21	Create processes in parallel in mp.start_processes for forkserver (#134629 ) Summary: This is to fix the pytorch issue filed https://github.com/pytorch/pytorch/issues/133010 one way to fix this problem is to enable parallel start processes in mp.start_processes. What else in the diff: refactored a test case api_test which was repeating a lot of tests due to the inheritance. added unit test for forkserver when parallel start is on. Test Plan: Added unit tests Differential Revision: D61878552 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134629 Approved by: https://github.com/d4l3k	2024-08-28 21:34:32 +00:00
Nowtryz	f685018ea9	Add MaskedTensor passthrough: unfold, F.Unfold, F.Fold, stack (#125262 ) Hi, I noticed the `unfold` operator was missing on MaskedTensor. I tested that my change works when calling unfold and backward on a `MaskedTensor` but I didn't find the tests for the dispatch of such operation. Where is it? Pull Request resolved: https://github.com/pytorch/pytorch/pull/125262 Approved by: https://github.com/cpuhrsch	2024-08-28 21:30:39 +00:00
Nowtryz	b6e51711a0	Add MaskedTensor support to *_like API (#128637 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128637 Approved by: https://github.com/cpuhrsch	2024-08-28 21:28:23 +00:00
fduwjj	4c16797e71	[c10d FR analyzer] Output a meaningful debug report for users (#134528 ) - This PR generates a more useful output log for users: P1552399180. - It also fixes the logic when we check the all-gather size mismatch. - Add dtype check for collective input/output - We store more context information for error match_state so that we can report them in the file. - Disable the size match for alltoall because we don't log the size for all inputs/outputs. - Correct some types for func args specification. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134528 Approved by: https://github.com/c-p-i-o	2024-08-28 21:22:47 +00:00
Sanket Purandare	de35d3062f	Runtime Estimator for estimating GPU compute time (#134243 ) This PR adds a basic Runtime Estimator for single-device models. It estimates the GPU runtime in milliseconds using various estimation methods under the ``FakeTensorMode``. It provides a ``TorchDispatchMode`` based context manager that can estimate the eager runtime of PyTorch functions. It supports two estimation modes, benchmarking (`operator-level-benchmark`) and roofline cost modeling (`operator-level-cost-model`). For modules executed under this context manager, it agggregates the forward and backward operation runtimes and records their execution orders. ``` import torch from torch import nn, optim from torch._subclasses.fake_tensor import FakeTensorMode from torch.distributed._tools.runtime_estimator import RuntimeEstimator from torch.testing._internal.distributed._tensor.common_dtensor import ( ModelArgs, Transformer, ) if __name__ == "__main__": def _train_step( model: nn.Module, optimizer: optim.Optimizer, inp: torch.Tensor, ): out = model(inp) loss = out.sum() loss.backward() optimizer.step() optimizer.zero_grad() dev = torch.cuda.current_device() vocab_size = 8192 bsz, seq_len = 32, 1024 model_args = ModelArgs( n_layers=4, n_heads=12, vocab_size=vocab_size, max_seq_len=seq_len, dim=768, dropout_p=0.1, ) runtime_estimator = RuntimeEstimator() with FakeTensorMode(): with torch.device(dev): model = Transformer(model_args) optimizer = optim.Adam(model.parameters(), lr=1e-2, foreach=True) inp = torch.randint(0, model_args.vocab_size, (bsz, model_args.max_seq_len), device=dev) with runtime_estimator("operator-level-benchmark"): _train_step(model, optimizer, inp) with runtime_estimator("operator-level-cost-model"): _train_step(model, optimizer, inp) # Actual model runtime with torch.device(dev): model = Transformer(model_args) optimizer = optim.Adam(model.parameters(), lr=1e-2, foreach=True) inp = torch.randint(0, model_args.vocab_size, (bsz, model_args.max_seq_len), device=dev) warmup_iters, actual_iters = 2, 5 start_event = torch.cuda.Event(enable_timing=True) end_event = torch.cuda.Event(enable_timing=True) for _ in range(warmup_iters): _train_step(model, optimizer, inp) start_event.record() for _ in range(actual_iters): _train_step(model, optimizer, inp) end_event.record() torch.cuda.synchronize() measured_time = start_event.elapsed_time(end_event) / actual_iters print(f"Actual total_time: {measured_time:.3f} ms") ``` <img width="506" alt="Screenshot 2024-08-26 at 11 27 15 PM" src="https://github.com/user-attachments/assets/04d243c9-21a6-4389-8c20-80958980788c"> @weifengpy @xuanzhang816 @gnadathur Pull Request resolved: https://github.com/pytorch/pytorch/pull/134243 Approved by: https://github.com/weifengpy	2024-08-28 20:06:54 +00:00
Manuel Candales	cae817c862	[ET][CodeGen] Remove TORCH_API from NativeFunctions.h declarations (#134245 ) Summary: Remove TORCH_API from the generated executorch/kernels/portable/NativeFunctions.h declarations These generated declarations are using ET tensors. They don't need to have the TORCH_API macro prefixed to them, since in this case TORCH_API is just empty. See [codegen/macros.h](https://www.internalfb.com/code/fbsource/[d12d7d3accfb12932368e0216124f2d735c51d73]/fbcode/executorch/codegen/macros.h) Test Plan: CI Differential Revision: D61490943 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134245 Approved by: https://github.com/larryliu0820	2024-08-28 19:58:37 +00:00
Yidi Wu	b07d0a22f5	[hop] require hops to override __call__. (#134352 ) Fixes https://github.com/pytorch/pytorch/issues/133719 by making `__call__` of hops an abstractmethod. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134352 Approved by: https://github.com/zou3519	2024-08-28 19:56:40 +00:00
PyTorch MergeBot	66c33d5989	Revert "[2/N] Add flag to control which rank should perform NaN check (#134345 )" This reverts commit be7752ead3824e79f5ede6a2f59715b415a2f776. Reverted https://github.com/pytorch/pytorch/pull/134345 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/134345#issuecomment-2316133024))	2024-08-28 19:51:59 +00:00
PyTorch MergeBot	23e26b84af	Revert "[3/N] Set correct device to CUDA guards (#134357 )" This reverts commit 13114da4ef9d14978ea1dfc0fefb236cb4000435. Reverted https://github.com/pytorch/pytorch/pull/134357 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/134357#issuecomment-2316121423))	2024-08-28 19:44:55 +00:00
Gregory Comer	3b40b07efb	Update PyTorch for XNNPACK 87ee0b4 (#134518 ) Summary: Update XNNPACK library version. Test Plan: Combined diff CI is clean: D61586079 (all changes, has to be split out for export). Differential Revision: D61822610 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134518 Approved by: https://github.com/mcr229	2024-08-28 19:24:04 +00:00
Animesh Jain	042b733ddd	[dynamo][freezing] Set is_static_type to false after marking an input static (#134653 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134653 Approved by: https://github.com/mlazos	2024-08-28 19:22:37 +00:00
Andrew Gu	aa31e7019a	[FSDP] Made `clip_grad_norm_` norm compute order deterministic (#134673 ) Fixes https://github.com/pytorch/pytorch/issues/134393 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134673 Approved by: https://github.com/weifengpy ghstack dependencies: #134152	2024-08-28 18:44:11 +00:00
Simon Fan	47ba47a81f	[compiled autograd] error instead of deadlock on reentrant autograd (#134530 ) reentrant calls autograd multiple times using the same thread, so it passes all the thread checks and hangs waiting for the lock it holds in another scope Pull Request resolved: https://github.com/pytorch/pytorch/pull/134530 Approved by: https://github.com/jansel ghstack dependencies: #134514	2024-08-28 17:54:31 +00:00
Simon Fan	c352b6aaaf	[compiled autograd][cpp node] point c++ custom autograd functions tracing error to google doc (#134514 ) `RuntimeError: Attempting to trace a potentially unsafe C++ autograd function: torch::autograd::CppNode<CustomOpAutogradFunction>. It may be possible to trace it safely, please refer to the instructions in: https://docs.google.com/document/d/11VucFBEewzqgkABIjebZIzMvrXr3BtcY1aGKpX61pJY/.` Pull Request resolved: https://github.com/pytorch/pytorch/pull/134514 Approved by: https://github.com/yf225	2024-08-28 17:54:31 +00:00
Xilun Wu	ba5aec88c6	[reland][dtensor][MTPG] make sharding prop lru cache not shared among threads (#134509 ) Summary reland of https://github.com/pytorch/pytorch/pull/134294 Fixes #131446 Fixes #126852 Fixes #126868 Fixes #126493 The PR was reverted due to CI red signal in https://github.com/pytorch/pytorch/actions/runs/10537099590/job/29201744658. It seems that the `gaussian_nll_loss` test had been flaky before my original PR #134294 . Therefore this PR also removes the `xfail` mark on this specific test to make CI signal green. See the error message below: ``` 2024-08-24T13:42:01.3228990Z ==================================== RERUNS ==================================== 2024-08-24T13:42:01.3229530Z [31m[1m_ TestDTensorOpsCPU.test_dtensor_op_db_nn_functional_gaussian_nll_loss_cpu_float32 _[0m 2024-08-24T13:42:01.3229710Z Unexpected success[90m[39;49;00m 2024-08-24T13:42:01.3230235Z [31m[1m_ TestDTensorOpsCPU.test_dtensor_op_db_nn_functional_gaussian_nll_loss_cpu_float32 _[0m 2024-08-24T13:42:01.3230407Z Unexpected success[90m[39;49;00m 2024-08-24T13:42:01.3230594Z =================================== FAILURES =================================== 2024-08-24T13:42:01.3231128Z [31m[1m_ TestDTensorOpsCPU.test_dtensor_op_db_nn_functional_gaussian_nll_loss_cpu_float32 _[0m 2024-08-24T13:42:01.3231296Z Unexpected success[90m[39;49;00m ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/134509 Approved by: https://github.com/tianyu-l, https://github.com/wz337	2024-08-28 17:51:44 +00:00
Bin Bao	310eb6d8c6	[AOTI] Fix test_aoti_inference CPU build issue (#134675 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/130311. We need to guard CUDA-only code in test_aoti_inference with macros so that it won't fail for CPU-only platform. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134675 Approved by: https://github.com/atalman, https://github.com/chunyuan-w	2024-08-28 17:42:19 +00:00
Laith Sakka	633a9a3b13	add back sum_floordiv benchmark. (#134635 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134635 Approved by: https://github.com/avikchaudhuri, https://github.com/oulgen ghstack dependencies: #133834	2024-08-28 17:38:24 +00:00
Banit Agrawal	b8859dc4b8	[PyTorch Pin Memory Allocator] Optimize the free list implementation and add lock sharding (#134154 ) Summary: This diff addresses the lock contention issue in free list implementation of CachingHost/Pinned allocator. We add a different data structure for free list and also add lock sharding based on allocation size. Differential Revision: D61623367 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134154 Approved by: https://github.com/guangyey, https://github.com/jgong5, https://github.com/zyan0, https://github.com/EikanWang, https://github.com/jiayisuse	2024-08-28 17:12:01 +00:00
Chien-Lin Chen	40de63be09	parameterized test_graph_optims and test_graph_scaling_fused_optimizers (#133749 ) Fixes #123451 This is a rework of a reverted pull request, https://github.com/pytorch/pytorch/pull/125127. The test failure is fixed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133749 Approved by: https://github.com/janeyx99	2024-08-28 16:34:06 +00:00
Chien-Chin Huang	c7338f457c	[DCP] Fixes the BC issue where the traversal doesn't support versions before 2.4 (#134158 ) The original DCP doesn't flattening all the containers, which can cause issues, https://github.com/pytorch/pytorch/pull/125335 intends to solve the issue by flattening all the dictionaries. Unfortunately, it breaks the checkpoints that are saved before 2.4. This also shows some issues of the DCP: 1. DCP should record version in the metadata. 2. DCP should have a nice way to load old state_dict. 3. DCP should unflatten all containers (map, list) not just map. This PR only addresses issue 2 to unblock users. Issue 1 and issue 3 need to be addressed in the future. @pradeepfn Please let me know if this summary matches our discussion. Fixes https://github.com/pytorch/pytorch/issues/133923 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134158 Approved by: https://github.com/wz337, https://github.com/pradeepfn	2024-08-28 16:31:44 +00:00
PyTorch MergeBot	13d40f6fc5	Revert "hang dim hint constants off Dim (#134484 )" This reverts commit c142af7209a423a05504fdec50680333f5a37629. Reverted https://github.com/pytorch/pytorch/pull/134484 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/134484#issuecomment-2315749549))	2024-08-28 16:05:42 +00:00
PyTorch MergeBot	2c88a923a7	Revert "Refactor caching device allocator utils (#130923 )" This reverts commit c45ca8092dddf718563a1a754de798ad25eae1ee. Reverted https://github.com/pytorch/pytorch/pull/130923 on behalf of https://github.com/ZainRizvi due to Sorry but this appears to be causing internal tests to fail with errors like `error: no type named 'DeviceStats' in namespace 'xxx::xxx:xxxAllocator'; did you mean 'DeviceStatus'?` ([comment](https://github.com/pytorch/pytorch/pull/130923#issuecomment-2315730155))	2024-08-28 15:56:08 +00:00
PyTorch MergeBot	d52aff3e73	Revert "Adding entry-point based support for out-of-tree rendezvous plugins (#132633 )" This reverts commit 136b19b062f62c81ea3ed8fb306debe9d7720e93. Reverted https://github.com/pytorch/pytorch/pull/132633 on behalf of https://github.com/ZainRizvi due to Sorry but this is causing internal tests to fail with the error `ImportError: cannot import name '_register_out_of_tree_handlers' from 'torch.distributed.elastic.rendezvous.registry'` ([comment](https://github.com/pytorch/pytorch/pull/132633#issuecomment-2315716201))	2024-08-28 15:49:18 +00:00
chuanqiw	85d9946001	[CI] change conda to miniforge for XPU images (#134455 ) The `.ci/docker` change with `ciflow/xpu` label will trigger docker images rebuild on xpu runner, but xpu runner can't use miniconda, change to miniforge. Works for https://github.com/pytorch/pytorch/issues/114850 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134455 Approved by: https://github.com/atalman	2024-08-28 15:16:14 +00:00
Mao, Yunfei	208b922327	[Intel GPU] Remove special dispatch logic for xpu in adaptive_avg_pooling (#132217 ) We now align the dispatch logic for XPU with CUDA in the adaptive average pooling operation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132217 Approved by: https://github.com/EikanWang, https://github.com/atalman, https://github.com/albanD, https://github.com/malfet	2024-08-28 15:06:35 +00:00
Bin Bao	e6bf1710ff	[Inductor][Refactor] Rename CPU benchmark test configs (#134639 ) Summary: benchmarks/dynamo/ci_expected_accuracy/update_expected.py expects a benchmark run config is named as {config}_{benchmark}, and CPU tests should follow the same naming convention. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134639 Approved by: https://github.com/huydhn	2024-08-28 14:49:55 +00:00
Avik Chaudhuri	c142af7209	hang dim hint constants off Dim (#134484 ) Summary: Recently https://github.com/pytorch/pytorch/pull/133620 added support for automatic dynamic shapes, where a new enum, `DIM`, was introduced to provide hints like `AUTO` and `STATIC`. This PR is a nominal change where we expose the hints via the existing public `Dim` API, and remove `DIM` from the public API. The main motivation is to avoid having users need to import too many things. Test Plan: existing Differential Revision: D61807361 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134484 Approved by: https://github.com/angelayi	2024-08-28 14:35:40 +00:00
Spencer Gibson	3e42f21eee	Bucketize fix to include number and tensor inputs (#133652 ) Fixes #132222 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133652 Approved by: https://github.com/ezyang	2024-08-28 13:35:41 +00:00
IvanKobzarev	bb22132c8d	[aotd] Make effects op registry WeakKeyDictionary (#134470 ) Op is used as a Dictionary Key, while op can be deregistered as a result this Key will be holding this op from deallocation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134470 Approved by: https://github.com/zou3519	2024-08-28 12:12:00 +00:00
Yanbo Liang	97c8a0739e	[Dynamo] Support inspect.signature.Parameter getattr (#134636 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134636 Approved by: https://github.com/Chillee, https://github.com/anijain2305	2024-08-28 09:59:41 +00:00
Will Feng	26e392132d	[2nd try][Traceable FSDP2] Allow tracing through FSDP2 impl in trace_rules.py (#134539 ) The previous PR https://github.com/pytorch/pytorch/pull/133532 caused stuck compilation issue on internal models. In this 2nd attempt PR, we gate the trace_rules.py changes with `if not torch._dynamo.config.skip_fsdp_hooks:`, so that they don't take effect for current graph-break FSDP2 (which relies on the default config value `skip_fsdp_hooks=True`), and will only take effect when we are using Traceable FSDP2 (in which case the user needs to proactively set `skip_fsdp_hooks=False`). Pull Request resolved: https://github.com/pytorch/pytorch/pull/134539 Approved by: https://github.com/ckluk2, https://github.com/yanboliang	2024-08-28 08:57:56 +00:00
Yanbo Liang	8693322ef0	[Dynamo][autograd.Function] Support mark_non_differentiable (#134087 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134087 Approved by: https://github.com/zou3519	2024-08-28 08:12:37 +00:00
Ke Wen	d01415409b	[PGNCCL] Improve logic to infer device for barrier (#134617 ) Fixes #134391, #124714 The above issues reported that `dist.barrier()` could hang in some cases. The culprit is that ProcessGroupNCCL inferred a wrong device to perform the dummy all-reduce. After the PR, the following will be the order of device selection: - 1st choice: `opts.device_ids`, if provided by user via `barrier(opts)`. - 2nd choice: bound device id, if provided to `init_process_group` via `device_id` arg. - 3rd choice: `usedDeviceIdxs_` recorded in current PG. Will have a value from previous collectives. - 4th choice: `globalRank() % localDeviceCount_`. This can only happen when `dist.barrier()` is the first call of the PG. What's new: - Added the 2nd choice. - In the 4th choice, we use `globalRank()` instead of group-local rank, because the group-local rank can be offset wrt the device id if intra-node GPUs are sharded into multiple dimensions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134617 Approved by: https://github.com/yifuwang, https://github.com/shuqiangzhang	2024-08-28 08:12:09 +00:00
Animesh Jain	e4a5958ab5	[dynamo] Graph break on FSDP flat_param inconsistent tensor and grad dtype (#134614 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134614 Approved by: https://github.com/awgu, https://github.com/yf225 ghstack dependencies: #134610, #134590, #134621	2024-08-28 07:35:24 +00:00
Animesh Jain	e96dc3665a	[raland][dynamo][exceptions] Support raise from None (#134621 ) The PR was reverted because this PR traced more code and surfaced a latent bug. Resubmitting w/o any changes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134621 Approved by: https://github.com/jansel ghstack dependencies: #134610, #134590	2024-08-28 07:35:23 +00:00
Animesh Jain	c566f2465f	[dynamo][dicts] Support hasattr on dicts (#134590 ) Fixes - https://github.com/pytorch/pytorch/issues/134577 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134590 Approved by: https://github.com/Skylion007 ghstack dependencies: #134610	2024-08-28 07:35:18 +00:00
Animesh Jain	880e3d18a4	[dynamo][exceptions] Use exception subclass whenever possible (#134610 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134610 Approved by: https://github.com/drisspg, https://github.com/jansel	2024-08-28 07:35:12 +00:00
xingyuan li	bf7db4e4f9	[Inductor UT] Generalize inductor UT for intel GPU (#133309 ) [Inductor UT] Generalize Inductor test case for Intel GPU. - Reuse `test/inductor/test_decompose_mem_bound_mm.py` - Reuse `test/inductor/test_inplacing_pass.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/133309 Approved by: https://github.com/EikanWang, https://github.com/jansel, https://github.com/etaf	2024-08-28 06:17:43 +00:00
haozhe.zhu	2ba60a1618	fix torch.prod vectorized path for bool (#128009 ) Fix https://github.com/pytorch/pytorch/issues/127866. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128009 Approved by: https://github.com/jgong5, https://github.com/albanD	2024-08-28 05:27:50 +00:00
Rachel Guo	89929d9abc	[AOTI][Tooling][4/n] Add `torch.save()` for individual intermediate tensor (#133871 ) Differential Revision: D61415304 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133871 Approved by: https://github.com/ColinPeppler	2024-08-28 04:48:00 +00:00
PyTorch UpdateBot	ca77f0a986	[executorch hash update] update the pinned executorch hash (#133386 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned executorch hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133386 Approved by: https://github.com/pytorchbot	2024-08-28 04:16:42 +00:00
PyTorch UpdateBot	e3308d835d	[audio hash update] update the pinned audio hash (#134632 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned audio hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134632 Approved by: https://github.com/pytorchbot	2024-08-28 04:16:25 +00:00
cyy	bb4dfe90b8	[Reland] [1/N] Fix clang-tidy warnings in inductor (#134544 ) Reland #131979 and exclude aoti_torch_index_put_out changes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134544 Approved by: https://github.com/ColinPeppler	2024-08-28 04:05:06 +00:00
Yiming Zhou	71d0eff6e7	Back out "[pytorch][PR] [export] Schematize nn_module_stack serialization" (#134628 ) Summary: Breaking backward compatibilities for serialization and deserialization Differential Revision: D61888223 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134628 Approved by: https://github.com/angelayi	2024-08-28 03:45:46 +00:00
cyy	ec3f52dd27	[21/N] Fix clang-tidy warnings in jit (#134537 ) Follows #133399 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134537 Approved by: https://github.com/Skylion007	2024-08-28 03:22:01 +00:00
Ke Wen	5beb859e74	[BE] no need to print stream in comm abort (#134362 ) Strictly speaking, NCCL communicator has nothing to do with CUDA streams. Thus, we don't need to print stream in comm abort's message. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134362 Approved by: https://github.com/fduwjj, https://github.com/wconstab	2024-08-28 02:14:18 +00:00
Tristan Rice	f33bcbe5fd	c10d/logging: add C10D_LOCK_GUARD (#134131 ) This adds logs if we can't acquire locks in NCCLUtils and ProcessGroupNCCL for 30s. This is motivated by some deadlocks were seeing and it's unclear if it's in NCCL or on the PyTorch side of things. This required replacing most `std::mutex` with `std::timed_mutex` and `std::condition_variable_any` as appropriate. Test plan: existing CI for regressions will add unit tests on `C10D_LOCK_GUARD` Pull Request resolved: https://github.com/pytorch/pytorch/pull/134131 Approved by: https://github.com/c-p-i-o, https://github.com/fduwjj	2024-08-28 01:40:42 +00:00
Yu, Guangye	c45ca8092d	Refactor caching device allocator utils (#130923 ) # Motivation Following [[RFC] Intel GPU Runtime Upstreaming for Allocator ](https://github.com/pytorch/pytorch/issues/116322), this PR aims to refactor caching device allocator utils to improve code reuse usage. This is the first PR, we could prepare some follow-up PRs continuing to refactor the device caching allocator. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130923 Approved by: https://github.com/EikanWang, https://github.com/gujinghui, https://github.com/albanD, https://github.com/eqy	2024-08-28 01:35:23 +00:00
atalman	d96254631e	[CD] Fix docker builds by installing setuptools after python build (#134631 ) Follow up after https://github.com/pytorch/pytorch/pull/134595 Same error happens silently before the error addressed in the above PR (and build continues and builds invalid Docker): ``` #47 457.5 Traceback (most recent call last): #47 457.5 File "<string>", line 1, in <module> #47 457.5 File "/opt/_internal/cpython-3.12.0/lib/python3.12/site-packages/wheel/pep425tags.py", line 3, in <module> #47 457.5 import distutils.util #47 457.5 ModuleNotFoundError: No module named 'distutils' #47 457.5 + local abi_tag= #47 457.5 + ln -s /opt/_internal/cpython-3.12.0 /opt/python/ #47 457.5 + rm -f Python-3.12.0.tgz ``` The fix in https://github.com/pytorch/pytorch/pull/134595 is no longer needed since we will install setuptools right after python installation. Link: https://github.com/pytorch/pytorch/actions/runs/10584642913/job/29329366729#step:6:6041 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134631 Approved by: https://github.com/kit1980	2024-08-28 01:17:41 +00:00
Sun, Jiayi	2b95da7ef4	allow conv_bn mixed dtype folding in post-grad (#133968 ) This PR relaxes the condition to allow conv_bn mixed dtype folding in post-grad. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133968 Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel	2024-08-28 01:02:09 +00:00
FFFrog	f7467c3b95	using new device-agnostic api instead of old api like torch.cpu or torch.cuda (#134448 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134448 Approved by: https://github.com/guangyey, https://github.com/shink, https://github.com/albanD	2024-08-28 01:01:49 +00:00
Pian Pawakapan	0c7856973b	[export] enumerate unsupported sympy.Functions (#134271 ) (#134598 ) Summary: There's 2 concepts of unsupported sympy.Functions in symbolic_shapes: 1) unsupported by the export solver, meaning the solver doesn't know how to provide useful fixes for those functions 2) unsupported by the sympy interpreter - meaning we can't reify them into FX nodes because the functions aren't present in PythonReferenceAnalysis This splits the current call into a call for each version, with the Export solver the only user of 1). For 1), we enumerate the functions in _sympy/functions.py, and subtract the functions we know we can support. For 2) there's only 3 functions we've seen pop up in test cases. cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 Differential Revision: D61863394 Pulled By: pianpwk Pull Request resolved: https://github.com/pytorch/pytorch/pull/134598 Approved by: https://github.com/angelayi	2024-08-28 00:34:38 +00:00
albanD	3b33f26513	Add device daemon (#131814 ) Base implementation aiming towards https://github.com/pytorch/rfcs/pull/64 Details of the implementation and next steps in https://github.com/pytorch/pytorch/blob/gh/albanD/3/head/test/cpp_extensions/open_registration_extension/README.md Pull Request resolved: https://github.com/pytorch/pytorch/pull/131814 Approved by: https://github.com/ezyang	2024-08-27 23:32:07 +00:00
Laith Sakka	d6091c8726	Add compile time instruction count metric (#133834 ) PYTHONPATH=$(pwd) python benchmarks/update_hint_benchmark.py out as of this diff, compile_time_instruction_count counts the number of instruction from within convert_frame.compile_inner ``` update_hint_regression,compile_time_instruction_count,10522459165 ``` will add result from CI once populated. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133834 Approved by: https://github.com/aorenste	2024-08-27 23:29:02 +00:00
Max Podkorytov	ef0f5919c7	[ROCm][Inductor][CK] Fix codegen after ck signature change (#134483 ) MakeArgument signature was changed in https://github.com/ROCm/composable_kernel/pull/1453 adding splitK argument to universal gemm templates which are used to codegen addmm and matmul (part of the series started at #125453 ) # Testing `pytest test/inductor/test_ck_backend.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/134483 Approved by: https://github.com/ColinPeppler	2024-08-27 23:25:42 +00:00
Pian Pawakapan	5ead965026	[export] don't duck size for DIM.AUTO (#134486 ) Summary: apparently DIM.AUTO leads to duck sizing, I didn't catch this. Doing the least intrusive fix possible by using `torch._dynamo.maybe_mark_dynamic()` under the hood. Test Plan: added test Differential Revision: D61809344 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134486 Approved by: https://github.com/avikchaudhuri	2024-08-27 23:00:26 +00:00
PyTorch MergeBot	30094bedbc	Revert "[dynamo][dicts] Support hasattr on dicts (#134590 )" This reverts commit d23c0150f3ba5fd1162358e9e7b0e72e7308c87e. Reverted https://github.com/pytorch/pytorch/pull/134590 on behalf of https://github.com/anijain2305 due to causing trunk CI failures ([comment](https://github.com/pytorch/pytorch/pull/134590#issuecomment-2313705582))	2024-08-27 22:52:52 +00:00
drisspg	d966d91e37	[FlexAttention] Fix Sparse block multiple to ceildiv instead for floor div (#134538 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134538 Approved by: https://github.com/yanboliang ghstack dependencies: #134507, #134511	2024-08-27 22:04:57 +00:00
drisspg	f5c67917d3	[FlexAttention] Remove unused code (#134511 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134511 Approved by: https://github.com/yanboliang ghstack dependencies: #134507	2024-08-27 22:04:57 +00:00
drisspg	856a8410f2	[FlexAttention] Create new variables for the subgraphs (#134507 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134507 Approved by: https://github.com/yanboliang, https://github.com/BoyuanFeng	2024-08-27 22:04:57 +00:00
Nikita Shulga	41e512a4cd	[EZ] Restore `test_unicode_comments` (#134589 ) This reverts changes introduced by test_jit.py by `43737bd78a` and adds lint suppression for this it As test name suggests it should have an unicode comment to make sure our parser can handle it Part of the fix for https://github.com/pytorch/pytorch/issues/134422 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134589 Approved by: https://github.com/aorenste, https://github.com/Skylion007	2024-08-27 21:51:06 +00:00
Bob Ren	1ba39ec1d0	Add test case test_arange_length_with_float32_dtype (#134415 ) Adding a test as a followup from https://github.com/pytorch/pytorch/pull/134296 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134415 Approved by: https://github.com/ezyang	2024-08-27 21:36:23 +00:00
PaliC	b58a0c3c4d	[split build] fix distributed problems (#134502 ) Should fix the issue where USE_C10D_NCCL was not getting propagated to libtorch_python.so Pull Request resolved: https://github.com/pytorch/pytorch/pull/134502 Approved by: https://github.com/yifuwang	2024-08-27 21:12:58 +00:00
David Berard	289486d007	Move attention kernels back from fake_impls to meta_registrations (#134288 ) See #121528 for additional context. In #120682, we moved the attention kernels from meta_registrations to fake_impls with the intent of fixing the device handling for seed/offset: these are typically on CPU. We needed to put the registrations in fake_impls to do this because meta_registrations doesn't have a way to specify device, whereas fake_impls does. But when we tried to actually fix the device types (#120839), we had to revert the PR because it broke cudagraph handling (during which seed/offset _are_ on CUDA). Now, we want to put the registrations back in meta_registrations so that we can call these kernels with meta tensors. The use case is later in this stack - we want to be able to use the flop counter with these kernels. Also - I specifically skip the `compare_tensor_meta()` check in test_fake / test_fake_autocast tests for the `_efficient_attention_forward` and `_flash_attention_forward` kernels, which fails because of the device mismatch from the seed/offset tensors. Then we can un-skip these opinfos. I verified that the efficient_attention_forward bug (#120842) is now caught by these opinfos if I revert the fix from this PR. Differential Revision: [D61687369](https://our.internmc.facebook.com/intern/diff/D61687369) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134288 Approved by: https://github.com/drisspg	2024-08-27 21:10:36 +00:00
rzou	39ca96398b	Update label_to_label with oncall: pt2 hierarchy. (#134582 ) Test Plan: - None Pull Request resolved: https://github.com/pytorch/pytorch/pull/134582 Approved by: https://github.com/clee2000	2024-08-27 21:05:40 +00:00
cyy	b567ca0f51	Remove unused imported names in python files (#134438 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/134438 Approved by: https://github.com/zou3519	2024-08-27 20:44:04 +00:00
Animesh Jain	d23c0150f3	[dynamo][dicts] Support hasattr on dicts (#134590 ) Fixes - https://github.com/pytorch/pytorch/issues/134577 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134590 Approved by: https://github.com/Skylion007 ghstack dependencies: #134039	2024-08-27 20:43:40 +00:00
Bo Li	16b8146c9e	Exclude test_transformers and unit tests which require recent GPU arch (#132895 ) This PR is to exclude test_transformers on ROCm temporarily and skip some unit tests which require recent GPU arch. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132895 Approved by: https://github.com/jithunnair-amd, https://github.com/pruthvistony, https://github.com/malfet	2024-08-27 20:40:53 +00:00
Yuanhao Ji	44dadf2506	[Fix] Check name when registering privateuse1 backend (#134071 ) do some checks when registering privateuse1 backend to avoid using in-tree deivce names Pull Request resolved: https://github.com/pytorch/pytorch/pull/134071 Approved by: https://github.com/albanD	2024-08-27 20:28:30 +00:00
Colin Peppler	f754c0ae1b	[easy] rm duplicate definition for inductor in TORCH_LOGS documentation (#134480 ) already defined in `2eb9339b71/torch/_logging/_internal.py (L286-L287)` Test Plan: Sandcastle run Differential Revision: D61806088 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134480 Approved by: https://github.com/eellison, https://github.com/mlazos	2024-08-27 20:15:10 +00:00
Moritz Hennen	fe6d0e3a04	Do not compute unnecessary `tensor!=0` for bool tensors in `count_nonzero` (#134254 ) Updated aten/src/ATen/native/TensorAdvancedIndexing.cpp to only reduce non-bool tensors before computing a sum Since I have no expertise for MPS, I did leave the MPS backend untouched. Also, in `count_nonzero_impl` for CPU, I assumed the comparison can be optimized by the compiler for boolean values? `90c821814e/aten/src/ATen/native/TensorAdvancedIndexing.cpp (L2262-L2264)` Fixes #133983 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134254 Approved by: https://github.com/albanD	2024-08-27 20:09:29 +00:00
xpfjmj	b744ed6816	Add a cpu_dispatch_key parameter to the cpu_fallback function (#134321 ) Fixes #134322 Add a cpu_dispatch_key parameter to the cpu_fallback function to support fallback, for example, to SparseCPU. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134321 Approved by: https://github.com/albanD	2024-08-27 19:57:57 +00:00
Ivan Duka	adf401f822	Links to contributors' GitHub accounts (#133787 ) Maintainers have the links to their GitHub profiles, but the major contributors do not have them. I added the links to the contributors' GitHub accounts in case anyone wants to follow them. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133787 Approved by: https://github.com/albanD	2024-08-27 19:56:08 +00:00
Nikita Shulga	534f43ddce	[Doc] Fix rendering of the unicode characters (#134597 ) https://github.com/pytorch/pytorch/pull/124771 introduced unicode escape sequences inside raw strings, which were not rendered correctly. Also fix typo in `\uue0 ` escape sequence (should have been `\u00e0`) Fix it by relying on [string literal concatenation](https://docs.python.org/3/reference/lexical_analysis.html#string-literal-concatenation) to join raw and regular strings together during lexical analysis stage Fixes https://github.com/pytorch/pytorch/issues/134422 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134597 Approved by: https://github.com/aorenste, https://github.com/Skylion007	2024-08-27 19:52:46 +00:00
Jerry Zhang	3ef4c27ab3	Update pt2e numeric debugger to use node.meta["custom"] field (#134040 ) Summary: With https://github.com/pytorch/pytorch/pull/131912 we now have a "custom" field in node.meta that can be preserved in * copy/deepcopy * run_decompositions() * serialization * re-exporting So we refactored numeric debugger to use this. Test Plan: python test/test_quantization.py TestNumericDebugger Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/134040 Approved by: https://github.com/tarun292	2024-08-27 19:51:03 +00:00
Xu Han	ed494603c7	[inductor] calibration inductor windows uts (16/N) (#134587 ) skip UT for `test/inductor/test_compiled_autograd.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/134587 Approved by: https://github.com/jansel	2024-08-27 19:45:02 +00:00
Xu Han	b094972051	[inductor] calibration inductor windows uts (17/N) (#134588 ) skip UTs for `test/inductor/test_minifier_isolate.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/134588 Approved by: https://github.com/jansel	2024-08-27 19:41:17 +00:00
Xu Han	9d0e0e6f1d	[inductor] calibration inductor windows uts (14/N) (#134585 ) skip UT for `test/dynamo/test_exc.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/134585 Approved by: https://github.com/jansel	2024-08-27 19:40:56 +00:00
Roy Hvaara	05ac7cd760	[MPS] Remove superfluous label/link (#134090 ) This was probably intended to be a comment. I removed it since the issue is already linked in the warning below. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134090 Approved by: https://github.com/albanD	2024-08-27 19:37:33 +00:00
atalman	d5aefadb17	[CD] Fix docker builds by installing setuptools (#134595 ) Seeing failures like this: ``` #49 844.6 //build_scripts/manylinux1-check.py:6: DeprecationWarning: The distutils package is deprecated and slated for removal in Python 3.12. Use setuptools or check PEP 632 for potential alternatives ..... [python 3/3] RUN bash build_scripts/build.sh && rm -r build_scripts: 846.9 ...it did, yay. 846.9 + for PYTHON in '/opt/python/*/bin/python' 846.9 + /opt/python/cpython-3.12.0/bin/python build_scripts/manylinux1-check.py 847.0 Traceback (most recent call last): 847.0 File "//build_scripts/manylinux1-check.py", line 55, in <module> 847.0 if is_manylinux1_compatible(): 847.0 ^^^^^^^^^^^^^^^^^^^^^^^^^^ 847.0 File "//build_scripts/manylinux1-check.py", line 6, in is_manylinux1_compatible 847.0 from distutils.util import get_platform 847.0 ModuleNotFoundError: No module named 'distutils' ------ ``` PR: https://github.com/pytorch/pytorch/pull/134455 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134595 Approved by: https://github.com/kit1980, https://github.com/seemethere, https://github.com/malfet	2024-08-27 19:31:44 +00:00
Bin Bao	a4b44dd2ef	[AOTI] Introduce DeferredCudaGridLine for cuda cpp wrapper (#129268 ) Summary: Similar to https://github.com/pytorch/pytorch/pull/129135, use DeferredCudaGridLine to create a deferred grid computation line when generating cpp wrapper. Differential Revision: [D61800622](https://our.internmc.facebook.com/intern/diff/D61800622) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129268 Approved by: https://github.com/angelayi	2024-08-27 19:23:25 +00:00
Xinya Zhang	5fd670e0ef	[ROCM] Properly disable Flash Attention/Efficient Attention with environment variables (#133866 ) Now `USE_FLASH_ATTENTION=0 USE_MEM_EFF_ATTENTION=0 python setup.py` can compile correctly Fixes #125230 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133866 Approved by: https://github.com/jithunnair-amd, https://github.com/jeffdaily, https://github.com/malfet	2024-08-27 18:24:29 +00:00
PyTorch MergeBot	5b392d22c6	Revert "fix stuck floordiv (#134150 )" This reverts commit 92c4771853892193d73d87bd60eca4dc7efc51d8. Reverted https://github.com/pytorch/pytorch/pull/134150 on behalf of https://github.com/anijain2305 due to compile time regression internal ([comment](https://github.com/pytorch/pytorch/pull/134150#issuecomment-2313230404))	2024-08-27 18:23:44 +00:00
Xilun Wu	0159ebb654	[dtensor] add test for local_map decorator (#127752 ) Summary This PR is a follow-up of #126924 to address reviewer's comments: 1) add a test case to show the use of `local_map` as a function decorator. 2) simplify the logic of handling different data types of `out_placements`. 3) correct variable naming in test cases to match math formulas. Test see #126924 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127752 Approved by: https://github.com/wanchaol	2024-08-27 18:22:23 +00:00
Nikita Shulga	8de0d7690c	Use newer `toAccumulateType` signature in `Normalization.cpp` (#134540 ) Which fixes BatchNorm behavior for if called with empty tensors on MPS backed. Removed `expectedFailureMPS` in test_nn.py, deleted expected failure in `test_mps.py` and adjusted `skipIfMPS` to `expectedFailureMPS` in BatchNorm2d OpInfo decorator, but restrict it only to the memory format tests Test Plan: CI + `python3 -c "import torch; print(torch.nn.BatchNorm2d(3, device='mps')(torch.rand(0, 3, 2, 2, device='mps')))"` Fixes https://github.com/pytorch/pytorch/issues/134423 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134540 Approved by: https://github.com/Skylion007, https://github.com/albanD	2024-08-27 18:09:20 +00:00
Jessica Vandebon	68b1a09422	Integrate device agnostic APIs in FSDP library [1/n] (#134337 ) Summary: For MTIA FSDP support, we need to ensure the FSDP library code handles accelerator devices not limited to CUDA. Test Plan: CI Reviewed By: hanzlfs Differential Revision: D60587415 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134337 Approved by: https://github.com/LucasLLC, https://github.com/awgu	2024-08-27 17:31:11 +00:00
Colin Peppler	13049cd6e5	[aotinductor][UserDefinedTritonKernel] fix case with non-constexpr params declared after autotuned params (#134520 ) ## Context In some user Triton kernels, we have this set-up for whatever reason. ``` @triton.jit def mykernel( param0, param1, param2, param3: tl.constexpr, # autotuned param4, # non-constexpr ): ... ``` This is an edge case because it's a general practice to declare all constexprs params at the end. And this will be an issue for AOTI because it fails to codegen all 4 params. That will surface as a device-side error: CUDA IMA, invalid argument... ``` > void* kernel_args_var_0[] = {&var_0, &var_1, &var_2}; --- < CUdeviceptr var_3; < AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch_get_data_ptr(buf0, reinterpret_cast<void*>(&var_3))); < void kernel_args_var_0[] = {&var_0, &var_1, &var_2, &var_3}; ``` ## Root-cause * `kernel.constexpr` from the Kernel side-table contains the indices for all `constexpr` params that includes autotuned params. * `raw_args`, that gets passed to wrapper codegen, excludes autotuned args. * In the wrapper codegen, we try to find non-constexpr args using `kernel.constexpr` & `raw_args`. This is okay unless there's a `raw_arg` after an autotuned param in the function signature. `79b7fff188/torch/_inductor/codegen/cpp_wrapper_cuda.py (L118-L126)` ## Fix We try to fix this, by calculating the right constexprs wrt `raw_args`. An illustration ``` raw_args: [arg0, arg1, arg2, arg4] kernel.arg_names: [param0, param1, param2, param3, param4] kernel.constexprs: [3] # param3 is autotuned; this is correct wrt kernel.arg_names constexpr_indices: [] # this is correct wrt raw_args ``` Differential Revision: [D61831625](https://our.internmc.facebook.com/intern/diff/D61831625) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134520 Approved by: https://github.com/oulgen	2024-08-27 17:20:27 +00:00
Ke Wen	13114da4ef	[3/N] Set correct device to CUDA guards (#134357 ) In `collective()`, `pointToPoint()` and `collectiveCoalesced()`, CUDA guards were created with an unset (default) CUDA device. This is the reason for the IMA facing the NaN checker in issue https://github.com/pytorch/pytorch/issues/134062. With this fix, `torch.cuda.set_device(device)` is not needed to work around the IMA. Also refactored a couple places where the guard is created -- preferably we create the guard with a known device, rather than setting the device later. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134357 Approved by: https://github.com/wconstab, https://github.com/shuqiangzhang ghstack dependencies: #134300, #134345	2024-08-27 16:38:15 +00:00
Ke Wen	be7752ead3	[2/N] Add flag to control which rank should perform NaN check (#134345 ) Fixes https://github.com/pytorch/pytorch/issues/134062. For example, in case of broadcast / scatter, only the root rank should perform the NaN check. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134345 Approved by: https://github.com/shuqiangzhang, https://github.com/wconstab ghstack dependencies: #134300	2024-08-27 16:33:59 +00:00
Colin L. Rice	9dc4bd7466	Create a JustknobConfig for use in config (#134161 ) This is designed to be a more ergonomic interface on top of justknob_feature (see https://github.com/pytorch/pytorch/pull/134151 for just the PR with the base commits). The idea is that people stop having to think about this as much, and can just do JustkobsConfig("//the:thing", "FORCE_THING") and it'll do the right thing. Primarily sending this to see how people feel about the API, and using it for new config changes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134161 Approved by: https://github.com/ezyang	2024-08-27 16:07:33 +00:00
Ke Wen	94caba4899	[1/N] Move NaN check onto NCCL stream (#134300 ) So that the tensor's lifetime management is the same as the management built for the NCCL, pre and post kernels. Also so that on visualizers, they show up in the NCCL stream line. Otherwise if they show up in the compute line, user may get confused (my code does not have these kernels). The check is thus moved after the point where we depend NCCL stream from the last compute kernel. Also moved declaration of `checkForNan` from Utils.hpp to NCCLUtils.hpp, and renamed Utils.cu to NCCLUtils.cu. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134300 Approved by: https://github.com/shuqiangzhang, https://github.com/wconstab	2024-08-27 16:02:27 +00:00
rzou	c582602245	Update partitioner's is_fusible heuristic to respect triton kernels (#134491 ) mutated arguments to triton kernels are fusible into the triton kernel. Test Plan: - new test Pull Request resolved: https://github.com/pytorch/pytorch/pull/134491 Approved by: https://github.com/Chillee ghstack dependencies: #134364, #134466, #134490	2024-08-27 15:57:32 +00:00
wz337	761cf91e3c	[DeviceMesh] Add get_all_submeshes in _MeshEnv (#134275 ) Adding a private helper method for Shampoo HSDP use cases. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134275 Approved by: https://github.com/XilunWu	2024-08-27 14:51:19 +00:00
Mikayla Gawarecki	d028b810fe	Fix flaky GroupNorm ModuleInfo test (#133899 ) Fixes https://github.com/pytorch/pytorch/issues/98677 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133899 Approved by: https://github.com/albanD	2024-08-27 14:45:51 +00:00
Mikayla Gawarecki	2033934ff8	Clarify error messages for NEWOBJ and BUILD in weights_only unpickler (#134346 ) Clarify that `add_safe_globals` will allow types for these instructions Some types do not appear as `GLOBAL` and are only caught in `BUILD`, example from hf slack is `numpy.dtypes.UInt32DType` ```python import torch import numpy as np from tempfile import TemporaryDirectory from pathlib import Path from codecs import encode torch.serialization.add_safe_globals([encode, np.dtype, np.core.multiarray._reconstruct, np.ndarray]) with TemporaryDirectory() as tempdir: p = Path(tempdir) r2 = np.random.get_state() torch.save(r2, p / "r2.pkl") torch.load(p / "r2.pkl", weights_only=True) ``` Yields (error comes from BUILD) ``` UnpicklingError: Weights only load failed. Re-running `torch.load` with `weights_only` set to `False` will likely succeed, but it can result in arbitrary code execution. Do it only if you got the file from a trusted source. Please file an issue with the following so that we can make `weights_only=True` compatible with your use case: WeightsUnpickler error: Can only build Tensor, parameter or OrderedDict objects, but got <class 'numpy.dtypes.UInt32DType'> ``` The reasoning is that `numpy.dtypes.UInt32DType` is constructed via `REDUCE` with `func =<class 'numpy.dtype'>` and `args= ('u4', False, True)`, clarify the error message that doing `add_safe_globals` on these will also allow them After this PR error message becomes ``` _pickle.UnpicklingError: Weights only load failed. Re-running `torch.load` with `weights_only` set to `False` will likely succeed, but it can result in arbitrary code execution. Do it only if you got the file from a trusted source. Please file an issue with the following so that we can make `weights_only=True` compatible with your use case: WeightsUnpickler error: Can only build Tensor, Parameter, OrderedDict or types allowlisted via `add_safe_globals`, but got <class 'numpy.dtypes.UInt32DType'> ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/134346 Approved by: https://github.com/albanD	2024-08-27 14:45:39 +00:00
Mikayla Gawarecki	2ac710e667	Make torch.serialization.set_default_mmap_options usable as a context manager (#134371 ) As title Pull Request resolved: https://github.com/pytorch/pytorch/pull/134371 Approved by: https://github.com/albanD	2024-08-27 14:45:29 +00:00
Nikita Shulga	0fa0ac80e4	Do not use `<filesystem>` on Linux (#134494 ) Because right now it leads to symbol conflict from binary builds. Use of `std::filesystem::file_exists` was introduced by https://github.com/pytorch/pytorch/pull/126601 and in this PR it is replaced with a very straightforward implementation that calls `stat` on the given path, which is a classic C-way of checking for the file existence. This PR should be reverted once one figures out how to keep `std::filesystem` methods linked into the binary private Fixes symptoms of https://github.com/pytorch/pytorch/issues/133437 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134494 Approved by: https://github.com/atalman, https://github.com/d4l3k	2024-08-27 14:44:10 +00:00
PyTorch MergeBot	3418708abf	Revert "[FlexAttention] Create new variables for the subgraphs (#134507 )" This reverts commit 4d0a44d34a46af6dcc764d55269b30ac537822a0. Reverted https://github.com/pytorch/pytorch/pull/134507 on behalf of https://github.com/albanD due to Broke lint due to too long line ([comment](https://github.com/pytorch/pytorch/pull/134507#issuecomment-2312505955))	2024-08-27 13:05:27 +00:00
PyTorch MergeBot	87a3f664e1	Revert "[FlexAttention] Remove unused code (#134511 )" This reverts commit 767c47d3c0ee3fc7804918a08de3f94874143a03. Reverted https://github.com/pytorch/pytorch/pull/134511 on behalf of https://github.com/albanD due to Broke lint due to too long line ([comment](https://github.com/pytorch/pytorch/pull/134507#issuecomment-2312505955))	2024-08-27 13:05:27 +00:00
PyTorch MergeBot	3e10a1eb5a	Revert "[FlexAttention] Fix Sparse block multiple to ceildiv instead for floor div (#134538 )" This reverts commit a34320a6f225061a3b5fe130a5a8fe35ed7a40f9. Reverted https://github.com/pytorch/pytorch/pull/134538 on behalf of https://github.com/albanD due to Broke lint due to too long line ([comment](https://github.com/pytorch/pytorch/pull/134507#issuecomment-2312505955))	2024-08-27 13:05:27 +00:00
rzou	c7cbcdad76	Update partitioner's is_fusible heuristic to respect auto_functionalized (#134490 ) We say Node a is fusible into node b if node b is an auto_functionalized node that may reinplace node a later on. This PR also changes aten.empty to be recomputable w.r.t the Partitioner (it is, like aten.zeros, cheap to recompute and fusible into other ops). Fixes https://github.com/pytorch/pytorch/issues/134468 Test Plan: - new test Pull Request resolved: https://github.com/pytorch/pytorch/pull/134490 Approved by: https://github.com/Chillee ghstack dependencies: #134364, #134466	2024-08-27 13:05:01 +00:00
xinyu-intel	dde5974b13	Implementation for rng ops on hpu and xpu (#133068 ) implementation for high_order_op::run_and_save_rng_state and high_order_op::run_with_rng_state on hpu Pull Request resolved: https://github.com/pytorch/pytorch/pull/133068 Approved by: https://github.com/jgong5, https://github.com/EikanWang, https://github.com/jansel, https://github.com/anijain2305	2024-08-27 11:34:37 +00:00
FEI	ef8236f12b	Provide default value None for the attn_bias parameter(#133981 ) (#133986 ) Fixes #133981 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133986 Approved by: https://github.com/ezyang	2024-08-27 11:10:43 +00:00
drisspg	a34320a6f2	[FlexAttention] Fix Sparse block multiple to ceildiv instead for floor div (#134538 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134538 Approved by: https://github.com/yanboliang ghstack dependencies: #134495, #134507, #134511	2024-08-27 09:53:19 +00:00
drisspg	767c47d3c0	[FlexAttention] Remove unused code (#134511 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134511 Approved by: https://github.com/yanboliang ghstack dependencies: #134495, #134507	2024-08-27 09:53:19 +00:00
drisspg	4d0a44d34a	[FlexAttention] Create new variables for the subgraphs (#134507 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134507 Approved by: https://github.com/yanboliang, https://github.com/BoyuanFeng ghstack dependencies: #134495	2024-08-27 09:53:13 +00:00
Zain Rizvi	f480385277	Remove explicit Amz2023 reference from jobs (#134355 ) Changes jobs to go back to using the default AMI. Note: This is only a cleanup PR. It does NOT introduce any behavior changes in CI Now that the default variant uses the Amazon 2023 AMI and has been shown to be stable for a week, it's time to remove the explicit amz2023 references and go back to using the default variant. After a week or two, when this is rolled out to most people, we can remove the variants from scale config as well. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134355 Approved by: https://github.com/jeanschmidt	2024-08-27 08:51:42 +00:00
Prashant Rawat	0916d72e99	Fix the warning for cat operators with same qparams (#133999 ) Summary: Currently the warning is printed when the cat inputs have same qparam, leading to a flood of warnings. This diff emits the warning only when cat inputs don't have the same qparam. Test Plan: CI Reviewed By: aprotopopov Differential Revision: D60638609 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133999 Approved by: https://github.com/tarun292	2024-08-27 08:21:39 +00:00
wizzniu	3515090006	Fix TypeError when itering NoneType in instantiate_device_type_tests() (#134457 ) Fixes #134454 Fix TypeError introduced by https://github.com/pytorch/pytorch/pull/133082, which uses iter for NoneType of default args ``except_for`` and ``only_for``. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134457 Approved by: https://github.com/shink, https://github.com/albanD	2024-08-27 07:13:36 +00:00
Sathyanarayanan Saravanamuthu	136b19b062	Adding entry-point based support for out-of-tree rendezvous plugins (#132633 ) Fixes #127519 Currently in torchrun rendezvous, there are only two rendezvous backends supported out of the box: `C10d` and `Etcd`. The changes in this PR enables the distributed elastic users to bring their out-of-tree rendezvous backend implementations as Python packages. #### AUTHORING NEW PLUGIN Any new plugin will be a python package exposing entry-points. For example, the structure of redis plugin is as follows: ``` plugin_root \|_ pyproject.toml \|_ src \|_ redis \|_ __init__.py \|_ redis_store.py \|_ redis_backend.py ``` The contents of the `pyproject.toml` should indicate that this is exposes a torchrun entry-point by mentioning the group name `torchrun.plugins`. The `pyproject.toml` for redis plugin would be as follows: ``` [project] name = "redis" version = "0.0.1" [project.entry-points.'torchrun.plugins'] redis = 'redis' ``` The `src/redis/__init__.py` file would contain functions that return the plugin name and plugin handler. The contents of `__init__.py` for redis would be as follows: ``` def getPluginHandler(): def _create_redis_handler(params: RendezvousParameters): from redis_rendezvous_backend import create_backend backend, store = create_backend(params) return create_handler(store, backend, params) return _create_redis_handler ``` The files `redis_store` and `redis_backend` contain the implementation of [Store](`41189b0da4/torch/_C/_distributed_c10d.pyi (L171)`) and [RendezvousBackend](`e782918b8e/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py (L61)`) respectively. #### USER EXPERIENCE Before using the plugin for the first time, the user has to install the plugin packages. For example, the published packages can be installed using `pip3 install <plugin-name>` and the plugin is in local file systemcan be installed using `pip3 install -e <plugin-location>`. Once installed, the new backend can be used in torchrun as follows: ``` torchrun --rdzv-backend=redis --rdzv-endpoint=redis-container:6379 --nnodes=3 --nproc-per-node=1 --max-restarts=3 --rdzv-id=1 test.py ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/132633 Approved by: https://github.com/wconstab	2024-08-27 07:09:41 +00:00
Xu Han	4a18fcf7af	[inductor] calibration inductor windows uts (12/N) (#134428 ) enable Windows inductor UTs for `test/inductor/test_torchinductor_codegen_dynamic_shapes.py` Failed by depends on https://github.com/pytorch/pytorch/pull/134429, need to rebase after https://github.com/pytorch/pytorch/pull/134429 merged. ```cmd 2024-08-25T23:57:23.2747794Z Windows CI does not have necessary dependencies for test_torchinductor_dynamic_shapes yet 2024-08-25T23:57:23.2748541Z Traceback (most recent call last): 2024-08-25T23:57:23.2749593Z File "C:\actions-runner\_work\pytorch\pytorch\test\inductor\test_torchinductor_codegen_dynamic_shapes.py", line 30, in <module> 2024-08-25T23:57:23.2750688Z from inductor.test_torchinductor_dynamic_shapes import ( 2024-08-25T23:57:23.2751877Z File "C:\actions-runner\_work\pytorch\pytorch\test\inductor\test_torchinductor_dynamic_shapes.py", line 46, in <module> 2024-08-25T23:57:23.2752876Z raise unittest.SkipTest("requires sympy/functorch/filelock") 2024-08-25T23:57:23.2753545Z unittest.case.SkipTest: requires sympy/functorch/filelock 2024-08-25T23:57:23.2754077Z Got exit code 1 2024-08-25T23:57:23.2754874Z No stepcurrent file found. Either pytest didn't get to run (e.g. import error) or file got deleted (contact dev infra) ``` Local test pass: <img width="1892" alt="image" src="https://github.com/user-attachments/assets/241ab082-6026-4f33-b3ac-7e9ef7da744d"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/134428 Approved by: https://github.com/jansel	2024-08-27 05:43:07 +00:00
Shivam Raikundalia	0b81f700aa	[PT2/Profiler] Add Context Info to Torch-Compiled Regions (#132765 ) Summary: We want to add compile IDs and frames to each Torch-Compiled Region in order to help users cross reference the section they are checking alongside data obtained from tools, such as tlparse. This diff operates on the assumption that each graph section will enter and exit a CompileContext before it is ran to either compile the graph or look it up in the cache. Based on this assuption, we can save the value of the graph section from the exited CompileContext in eval_frame.c using a Python C API. After this, we can create a new interface in cpp shim to wrap around the record_function in order to pass in the new keyword argument for "context". Test Plan: Enhance test_profiler_dynamo_compiled_region to look for kwinputs as well as a name to see that the context is now labeled. Also changed test to run graph with more contexts so that we test a wider range of profiling. Differential Revision: D60803317 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132765 Approved by: https://github.com/anijain2305	2024-08-27 04:55:04 +00:00
Shuai Yang	de57a6e806	Back out "[dynamo][exception] Support raise exception from None (#134028 )" (#134513 ) Summary: The original diff is causing the error "attempting to assign a gradient with dtype 'c10::BFloat16' to a tensor with dtype ‘float". The context is in: https://fb.workplace.com/groups/1075192433118967/permalink/1491357138169159/ Test Plan: After reverting, the above issue is gone, details are in https://fb.workplace.com/groups/1075192433118967/permalink/1491357138169159/ Differential Revision: D61820520 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134513 Approved by: https://github.com/anijain2305	2024-08-27 02:57:14 +00:00
Xu Han	02b0b524b5	[inductor] Turn on UT: test_randint_int64_mod (#134510 ) It fixed by https://github.com/pytorch/pytorch/pull/134229, turn on it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134510 Approved by: https://github.com/ezyang	2024-08-27 02:33:07 +00:00
Xuehai Pan	d0147290d8	[BE][Easy][dynamo] ensure `trace_rules.MOD_INLINELIST` in alphabetical order (#134246 ) Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * __->__ #134246 * #133987 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134246 Approved by: https://github.com/yanboliang	2024-08-27 02:29:43 +00:00
cyy	2ee201a7d0	[CMake] Remove BUILDING_WITH_TORCH_LIBS (#134434 ) Since BUILDING_WITH_TORCH_LIBS is not used now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134434 Approved by: https://github.com/ezyang	2024-08-27 01:48:21 +00:00
Edward Z. Yang	bdfc1d3987	Remove unnecessary expect_true in split_with_sizes (#133439 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/133439 Approved by: https://github.com/albanD	2024-08-27 01:34:00 +00:00
Edward Z. Yang	c7ca89a11a	Improve print stack/locals printing in comptime (#133651 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/133651 Approved by: https://github.com/anijain2305	2024-08-27 01:29:50 +00:00
rzou	58771315d3	Unify lowerings for auto_functionalized and triton_kernel_wrapper_functional (#134466 ) Fixes https://github.com/pytorch/pytorch/issues/134372 The triton_kernel_wrapper_functional lowering was causing problems (it was generating small kernels with nans in it, probably from realizing aten.empty nodes. Instead of having its own manual lowering, we change triton_kernel_wrapper_functional to go the same route as auto_functionalized where we decompose the node into clone + mutation nodes. Test Plan: - new test - existing tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/134466 Approved by: https://github.com/oulgen, https://github.com/eellison ghstack dependencies: #134364	2024-08-27 00:53:05 +00:00
PyTorch MergeBot	141a9c7204	Revert "[export] enumerate unsupported sympy.Functions (#134271 )" This reverts commit ddd71e34797f3bb56a048058e007a2df87c5755f. Reverted https://github.com/pytorch/pytorch/pull/134271 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/134271#issuecomment-2311353460))	2024-08-27 00:45:00 +00:00
drisspg	4df10a6340	[FlexAttention] Fix bug when checking whether to return LSE (#134495 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134495 Approved by: https://github.com/yanboliang, https://github.com/Chillee, https://github.com/BoyuanFeng	2024-08-27 00:31:46 +00:00
Xu Han	b98d33c155	[inductor] calibration inductor windows uts (13/N) (#134429 ) enable Windows inductor UTs for `test/inductor/test_torchinductor_dynamic_shapes.py` Local test pass: <img width="1885" alt="image" src="https://github.com/user-attachments/assets/4b96b6d9-715f-4c94-8059-9ee0afaaa574"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/134429 Approved by: https://github.com/jansel	2024-08-27 00:16:16 +00:00
Xuehai Pan	74341e1150	[dynamo] simplify implementation for `os.fspath` (#133801 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133801 Approved by: https://github.com/anijain2305 ghstack dependencies: #133771	2024-08-27 00:08:04 +00:00
Xuehai Pan	1dbd3476de	[dynamo][itertools] support `itertools.tee` (#133771 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133771 Approved by: https://github.com/jansel	2024-08-27 00:08:04 +00:00
CK Luk	43bbd781f2	Back out "[Traceable FSDPS] Allow tracing through FSDP2 impl in trace_rules.py (#133532 )" (#134478 ) Summary: Original commit changeset: 0215a41433e9 Original Phabricator Diff: D61432583 D61432583 causes FSDP2 stuck in PT2 compilation when applied to FB-FM-v4. With D61432583: https://www.internalfb.com/mast/job/aps-ckluk-745e763d6a After backing out D61432583: https://www.internalfb.com/mast/job/aps-ckluk-f9604ea1f9 Test Plan: hg graft D61774888 scripts/ckluk/aps/mast_joint_arch_exploration_cmf_updated_fbfm_v3_fsdp2_qps.sh Differential Revision: D61802689 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134478 Approved by: https://github.com/yf225	2024-08-27 00:07:28 +00:00
Xinya Zhang	46ecc673ae	[ROCm] Prevent accidental enablement of efficient attention. (#133331 ) Currently Efficient attention and Flash attention share the same set of GPU kernels on ROCM and have common limitations on head sizes. Fixes https://github.com/pytorch/pytorch/issues/132004 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133331 Approved by: https://github.com/malfet, https://github.com/jithunnair-amd	2024-08-27 00:03:45 +00:00
xinan.lin	0be6584203	[Inductor UT] Refine test case `test_codegen_upcast_to_fp32_upcast` to pass on XPU. (#134474 ) [Inductor UT] Refine test case test_codegen_upcast_to_fp32_upcast to pass on XPU. Fix issue: #134476 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134474 Approved by: https://github.com/jansel	2024-08-26 23:59:26 +00:00
Roy Hvaara	1565940114	[MPS] Add `test/test_nn.py` to test suite (#134184 ) This PR increases test coverage by including the tests in `test/test_nn.py` in the test suite of MPS. Some of the tests are decorated with `@expectedFailureMPS` for various reasons. Either that the op is not implemented, or that the outputs do not align. Those tests that contain differing results should be investigated further to rule out any live bugs. ```bash $ python test/run_test.py --mps --verbose -k TestNN Running test batch 'tests to run' cost 84.76 seconds ``` Ref #133520 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134184 Approved by: https://github.com/albanD, https://github.com/malfet	2024-08-26 23:48:23 +00:00
Nikita Shulga	79b7fff188	Fix docstring for torch.signal.windows.nuttall (#134512 ) This partially fixes regression introduced by https://github.com/pytorch/pytorch/pull/124771 but also just improves `z_n` rendering, by using MathML In 2.3 it was [rendered](https://pytorch.org/docs/2.3/generated/torch.signal.windows.nuttall.html#torch.signal.windows.nuttall) as <img width="177" alt="image" src="https://github.com/user-attachments/assets/2c15d1f9-13ad-483f-bb66-41fa3fa4ba9c"> With this change it'll be [rendered](https://docs-preview.pytorch.org/pytorch/pytorch/134512/generated/torch.signal.windows.nuttall.html#torch.signal.windows.nuttall) as ```math z_n = \frac{2 \pi n}{M} ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/134512 Approved by: https://github.com/kit1980, https://github.com/aorenste, https://github.com/atalman	2024-08-26 22:51:43 +00:00
Pian Pawakapan	ddd71e3479	[export] enumerate unsupported sympy.Functions (#134271 ) There's 2 concepts of unsupported sympy.Functions in symbolic_shapes: 1) unsupported by the export solver, meaning the solver doesn't know how to provide useful fixes for those functions 2) unsupported by the sympy interpreter - meaning we can't reify them into FX nodes because the functions aren't present in PythonReferenceAnalysis This splits the current call into a call for each version, with the Export solver the only user of 1). For 1), we enumerate the functions in _sympy/functions.py, and subtract the functions we know we can support. For 2) there's only 3 functions we've seen pop up in test cases. Differential Revision: D61677956 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134271 Approved by: https://github.com/avikchaudhuri	2024-08-26 22:44:12 +00:00
Benjamin Glass	55236d0cb7	TestForeach::test_parity: Remove check for error message text (#134251 ) Previously, error messages were expected to be string equivalent to error messages thrown by the ref function. This check fails for dozens of torch functions, and doesn't appear to add much value for the end user. This commit removes this check. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134251 Approved by: https://github.com/amjames, https://github.com/janeyx99 ghstack dependencies: #134253, #134344	2024-08-26 22:40:54 +00:00
Benjamin Glass	ef8c474fcf	Add the fast path for bfloat16 lgamma (#134344 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134344 Approved by: https://github.com/amjames, https://github.com/janeyx99 ghstack dependencies: #134253	2024-08-26 22:40:54 +00:00
Benjamin Glass	3c5883e550	Fix test_parity xfail for sigmoid (#134253 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134253 Approved by: https://github.com/amjames, https://github.com/janeyx99	2024-08-26 22:40:54 +00:00
soulitzer	a23dae22d5	Update AC pass use_reentrant message (#134472 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134472 Approved by: https://github.com/albanD	2024-08-26 21:57:38 +00:00
Animesh Jain	dbef2b05b4	[dynamo] Cache _dynamo.disable results (#134272 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134272 Approved by: https://github.com/yf225, https://github.com/jansel	2024-08-26 21:04:15 +00:00
Aidyn-A	28a4db84f2	[ARM] Fix infinite recursion in unwind (#134387 ) Fixes #119905 The `TORCH_SHOW_CPP_STACKTRACES=1` setting on ARM causes infinite recursive unwind because on failure a `StackTraceFetcher` attempts to unwind the <ins>failed instruction</ins>: `5ad759ca33/torch/csrc/profiler/combined_traceback.cpp (L25)` then the unwind itself fails: `5ad759ca33/torch/csrc/profiler/unwind/unwind.cpp (L10-L12)` and it causes another attempt to unwind the failure in `unwind()`... In summary, the executed instruction is equivalent to: ```C++ std::vector<void*> unwind() { // some instructions ... return unwind(); } ``` This PR replaces `TORCH_CHECK` by `TORCH_WARN_ONCE` as it will not cause an uncontrolled recursion. The only side effect would be an empty back-trace. Huge thanks to @nWEIdia who found the root cause! Pull Request resolved: https://github.com/pytorch/pytorch/pull/134387 Approved by: https://github.com/eqy, https://github.com/nWEIdia, https://github.com/malfet	2024-08-26 21:02:31 +00:00
Xu Han	900c5083ed	[inductor] calibration inductor windows uts (9/N) (#134425 ) enable Windows inductor UTs of `test/inductor/test_binary_folding.py` Failed UT depends on https://github.com/pytorch/pytorch/pull/134427 Need to rebase after https://github.com/pytorch/pytorch/pull/134427 merged. ```cmd 2024-08-25T23:32:23.0905727Z Traceback (most recent call last): 2024-08-25T23:32:23.0906516Z File "C:\actions-runner\_work\pytorch\pytorch\test\inductor\test_binary_folding.py", line 18, in <module> 2024-08-25T23:32:23.0908200Z from inductor.test_inductor_freezing import TestCase 2024-08-25T23:32:23.0909883Z File "C:\actions-runner\_work\pytorch\pytorch\test\inductor\test_inductor_freezing.py", line 39, in <module> 2024-08-25T23:32:23.0911128Z raise unittest.SkipTest("requires sympy/functorch/filelock") 2024-08-25T23:32:23.0911801Z unittest.case.SkipTest: requires sympy/functorch/filelock 2024-08-25T23:32:23.0912370Z Got exit code 1 2024-08-25T23:32:23.0913155Z No stepcurrent file found. Either pytest didn't get to run (e.g. import error) or file got deleted (contact dev infra) ``` Local test pass: <img width="1898" alt="image" src="https://github.com/user-attachments/assets/4a6e3f66-4bbc-4aab-8f0d-2e2318046e53"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/134425 Approved by: https://github.com/ezyang, https://github.com/jansel	2024-08-26 20:57:41 +00:00
Animesh Jain	68624cf089	[dynamo][guards] De-dupe DUPLICATE_INPUT guard (#134354 ) Hard to write a test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134354 Approved by: https://github.com/jansel	2024-08-26 20:48:57 +00:00
Nikita Shulga	af82dc816a	Fix lint failures (#134488 ) Introduced by https://github.com/pytorch/pytorch/pull/131000 Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/134488 Approved by: https://github.com/Skylion007, https://github.com/msaroufim, https://github.com/albanD, https://github.com/atalman	2024-08-26 20:13:21 +00:00
albanD	2588b5e51a	Move module_tracker to logging for confused hierarchy (#134467 ) Fixes https://github.com/pytorch/pytorch/issues/134242 Make sure to never raise an error when confused. Logs for confusion can be enabled with `TORCH_LOGS="torch.utils.module_tracker"` or the usual python systems. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134467 Approved by: https://github.com/malfet	2024-08-26 19:39:08 +00:00
Mengwei Liu	a0e062c6f1	Add mean.dtype_out (#133506 ) Give it a try and see if CI is happy Pull Request resolved: https://github.com/pytorch/pytorch/pull/133506 Approved by: https://github.com/bdhirsh	2024-08-26 19:26:11 +00:00
eqy	3541e450af	Support larger page sizes with `use_mmap_weights` (#131000 ) Fixes e.g., `test_large_mmaped_weights_non_abi_compatible_cuda` on machines with 64K page size CC @malfet @tinglvv @nWEIdia Pull Request resolved: https://github.com/pytorch/pytorch/pull/131000 Approved by: https://github.com/malfet	2024-08-26 18:35:55 +00:00
Henry Tsang	3322ee236d	[aoti] remove c_shim_version v1 logic (#134283 ) Summary: Previously, https://github.com/pytorch/pytorch/pull/132750 and https://github.com/pytorch/pytorch/pull/133105 set c_shim_version to 2 for all cases. So removing c_shim_version logic. Test Plan: ci Differential Revision: D61574695 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134283 Approved by: https://github.com/desertfire	2024-08-26 18:29:40 +00:00
Wuxun Zhang	1d231ff8ba	[HOO] add hints_wrapper to support passing context hints (#132860 ) Fixes #126393 The implementation code is based on feedback here (https://github.com/pytorch/pytorch/pull/121639#issuecomment-2223948842). Hints are passed as kwargs of hints_wrapper op. It also supports nested hints. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132860 Approved by: https://github.com/ydwu4, https://github.com/zou3519	2024-08-26 18:21:22 +00:00
Animesh Jain	1ccc8f0200	[dynamo][super] Improve handling of getattr on super (#134039 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134039 Approved by: https://github.com/yanboliang, https://github.com/jansel	2024-08-26 18:20:39 +00:00
Xu Han	1dd4b9221b	[inductor] enable clang for Windows inductor (#134444 ) Changes: 1. Add Windows clang-cl compiler check. 2. Add openmp config for clang-cl. 3. Preload libomp.dll when use clang. 4. Add compiler flags syntax check for `clang` and `clang++`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134444 Approved by: https://github.com/jgong5, https://github.com/jansel, https://github.com/malfet	2024-08-26 18:19:59 +00:00
Xu Han	0a3c064c12	[inductor] fix _maybe_subprocess_run not support Windows path (#134365 ) Windows file path use `\` as delimiter, it is also a escape character. We need translate all path `\` to `/`. which like Linux. Reproduce UTs: ```cmd pytest test\dynamo\test_minifier.py -v -k test_after_dynamo_cpu_accuracy_error ``` Error message: ```cmd ____________________________________________________________________________________________________________ MinifierTests.test_after_dynamo_cpu_accuracy_error _____________________________________________________________________________________________________________ Traceback (most recent call last): File "D:\xu_git\dnnl_cb\pytorch\test\dynamo\test_minifier.py", line 40, in test_after_dynamo_cpu_accuracy_error self._test_after_dynamo( File "D:\xu_git\dnnl_cb\pytorch\test\dynamo\test_minifier.py", line 27, in _test_after_dynamo self._run_full_test(run_code, "dynamo", expected_error, isolate=False) File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\site-packages\torch\_dynamo\test_minifier_common.py", line 235, in _run_full_test self.assertIn(expected_error, test_proc.stderr.decode("utf-8")) File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\unittest\case.py", line 1112, in assertIn self.fail(self._formatMessage(msg, standardMsg)) File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\unittest\case.py", line 675, in fail raise self.failureException(msg) AssertionError: 'AccuracyError' not found in 'Traceback (most recent call last):\n File "C:\\Users\\Xuhan\\.conda\\envs\\win_mkl_static\\lib\\site-packages\\torch\\_dynamo\\test_minifier_common.py", line 114, in _maybe_subprocess_run\n exec(code, {"__name__": "__main__", "__compile_source__": code})\n File "<string>", line 9\n torch._dynamo.config.debug_dir_root = "C:\\Users\\Xuhan\\AppData\\Local\\Temp\\tmpufu9t3pc"\n ^\nSyntaxError: (unicode error) \'unicodeescape\' codec can\'t decode bytes in position 2-3: truncated \\UXXXXXXXX escape\n' To execute this test, run the following from the base repo dir: python test\dynamo\test_minifier.py MinifierTests.test_after_dynamo_cpu_accuracy_error This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 --------------------------------------------------------------------------------------------------------------------------- Captured stdout call ---------------------------------------------------------------------------------------------------------------------------- test stdout: test stderr: Traceback (most recent call last): File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\site-packages\torch\_dynamo\test_minifier_common.py", line 114, in _maybe_subprocess_run exec(code, {"__name__": "__main__", "__compile_source__": code}) File "<string>", line 9 torch._dynamo.config.debug_dir_root = "C:\Users\Xuhan\AppData\Local\Temp\tmpufu9t3pc" ^ SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \UXXXXXXXX escape --------------------------------------------------------------------------------------------------------------------------- Captured stderr call ---------------------------------------------------------------------------------------------------------------------------- running test ``` Local test passed: <img width="849" alt="image" src="https://github.com/user-attachments/assets/4a4eecc2-7c08-4de6-9395-546b69803b16"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/134365 Approved by: https://github.com/jansel, https://github.com/jgong5	2024-08-26 17:48:11 +00:00
atalman	78128cbdd8	[CD] Use ephemeral arm64 runners for nightly and docker builds (#134473 ) Follow up after adding linux arm64 ephemeral instances: https://github.com/pytorch/pytorch/pull/134469 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134473 Approved by: https://github.com/malfet	2024-08-26 17:47:20 +00:00
Xu Han	0f5b052dba	[inductor] calibration inductor windows uts (11/N) (#134427 ) enable Windows inductor UTs of `test/inductor/test_inductor_freezing.py` Local test pass: <img width="1891" alt="image" src="https://github.com/user-attachments/assets/f3a873b4-abb5-4047-92f8-8e6da7c67315"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/134427 Approved by: https://github.com/jansel	2024-08-26 17:43:58 +00:00
cyy	73604eed0c	[20/N] Fix clang-tidy warnings in jit (#133399 ) Follows #133067 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133399 Approved by: https://github.com/Skylion007	2024-08-26 17:43:52 +00:00
Xu Han	019b80855f	[inductor] calibration inductor windows uts (10/N) (#134426 ) enable Windows inductor UT of `test/inductor/test_efficient_conv_bn_eval.py` Local test pass: <img width="1892" alt="image" src="https://github.com/user-attachments/assets/8a94c5e4-68bf-4a6f-8a1b-60d6ede14882"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/134426 Approved by: https://github.com/jansel	2024-08-26 17:43:36 +00:00
Xu Han	7ff576072f	[inductor] calibration inductor windows uts (8/N) (#134424 ) enable Windows inductor UTs of `test/inductor/test_benchmark_fusion.py` Local test pass: <img width="1912" alt="image" src="https://github.com/user-attachments/assets/5be34b0c-9411-4430-927e-3313245f7c13"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/134424 Approved by: https://github.com/ezyang	2024-08-26 17:38:53 +00:00
PyTorch MergeBot	adcce538b7	Revert "Allow mp.start_processes to create processes in parallel (#133707 )" This reverts commit 3546628a2a167ace6060737eeccf8ee8fd87ddc0. Reverted https://github.com/pytorch/pytorch/pull/133707 on behalf of https://github.com/ZainRizvi due to sorry but trunk has been consistently broken since this PR was merged. See: [GH job link](https://github.com/pytorch/pytorch/actions/runs/10529617600/job/29191757055) [HUD commit link](`3546628a2a`) ([comment](https://github.com/pytorch/pytorch/pull/133707#issuecomment-2310709523))	2024-08-26 17:31:10 +00:00
mori360	d0ac5d55ba	Memory optimization for DSD for TorchTune LoRA (#134025 ) Optimize memory cost at [PR#129635](https://github.com/pytorch/pytorch/pull/129635) There are 2 main part of the optimization here: 1. optimize the tensor distributing part, postpone the full_tensor generation, which avoids the memory overlap, saves around 50% peak memory at 2 param test case. 2. apply `assign=True` for the `load_state_dict`, saves memory cost at the state dict loading by assigning the input param, around 50% peak memory at loading part. Future work: Memory optimization to the opt will be conducted in the next PR Pull Request resolved: https://github.com/pytorch/pytorch/pull/134025 Approved by: https://github.com/fegin Co-authored-by: Rachel Guo <guorachel@meta.com>	2024-08-26 17:24:25 +00:00
Catherine Lee	fc61aae70f	Remove color in CI (#133517 ) Remove color by default to make CI logs easier to read Example of color <img width="569" alt="image" src="https://github.com/user-attachments/assets/0da13544-98b1-47be-8383-64a5b3fd8951"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/133517 Approved by: https://github.com/ZainRizvi	2024-08-26 16:58:06 +00:00
PyTorch MergeBot	42955e04f1	Revert "[dynamo] Cache _dynamo.disable results (#134272 )" This reverts commit a699bd11551e9755bb9238c6b82c369880789397. Reverted https://github.com/pytorch/pytorch/pull/134272 on behalf of https://github.com/ZainRizvi due to Fails internal tests ([comment](https://github.com/pytorch/pytorch/pull/134272#issuecomment-2310649115))	2024-08-26 16:57:53 +00:00
PyTorch MergeBot	e94bdc7876	Revert "[dynamo][guards] De-dupe DUPLICATE_INPUT guard (#134354 )" This reverts commit cdb9df5efe78142b7a612ae9c938ddf8a8850d10. Reverted https://github.com/pytorch/pytorch/pull/134354 on behalf of https://github.com/ZainRizvi due to Fails internal tests ([comment](https://github.com/pytorch/pytorch/pull/134272#issuecomment-2310649115))	2024-08-26 16:57:53 +00:00
atalman	a6fac0e969	Use ephemeral runners for windows nightly builds (#134463 ) This is definition of windows.4xlarge: ``` windows.4xlarge: disk_size: 256 instance_type: c5d.4xlarge is_ephemeral: true max_available: 420 os: windows ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/134463 Approved by: https://github.com/jeanschmidt	2024-08-26 16:33:19 +00:00
Wang, Chuanqi	b417e32da2	[CD] fix xpu nightly wheel test env (#134395 ) (#134464 ) Due to the https://github.com/pytorch/builder/pull/1972 landed, it will source xpu env duplicated in nightly wheel test. Works for https://github.com/pytorch/pytorch/issues/114850 Realnd of #134395 to be landed with pytorchmergebot Pull Request resolved: https://github.com/pytorch/pytorch/pull/134464 Approved by: https://github.com/jeanschmidt Co-authored-by: Wang, Chuanqi <chuanqi.wang@intel.com>	2024-08-26 15:35:48 +00:00
atalman	c507f402f1	Add linux arm64 ephemeral runners (#134469 ) Should be landed with: https://github.com/pytorch/test-infra/pull/5593 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134469 Approved by: https://github.com/jeanschmidt, https://github.com/clee2000	2024-08-26 15:32:45 +00:00
PyTorch MergeBot	17e8a51ff2	Revert "[inductor]Let output or input_as_strided match exact strides (#130956 )" This reverts commit a63efee5cd422db0aabe5d02d2fe35fef9be7978. Reverted https://github.com/pytorch/pytorch/pull/130956 on behalf of https://github.com/ZainRizvi due to sorry but this seems to cause internal tests to fail. Please see D61771533 for details ([comment](https://github.com/pytorch/pytorch/pull/130956#issuecomment-2310490049))	2024-08-26 15:31:23 +00:00
PyTorch MergeBot	1c4780e69a	Revert "c10d/logging: add C10D_LOCK_GUARD (#134131 )" This reverts commit 4c28a0eb0ba437c1b7db559f63f8bec17bd48f69. Reverted https://github.com/pytorch/pytorch/pull/134131 on behalf of https://github.com/ZainRizvi due to Sorry but this causes formatting errors internally which make it fail to build. See D61759282 ([comment](https://github.com/pytorch/pytorch/pull/134131#issuecomment-2310455878))	2024-08-26 15:19:27 +00:00
PyTorch MergeBot	50e90d7203	Revert "[dynamo] simplify implementation for `functools.reduce` (#133778 )" This reverts commit 6c0b15e3828b8e2a0bd726a3e5d4e98c8ced5efe. Reverted https://github.com/pytorch/pytorch/pull/133778 on behalf of https://github.com/ZainRizvi due to Sorry, but this breaks internal tests because of using functools ([comment](https://github.com/pytorch/pytorch/pull/133778#issuecomment-2310445169))	2024-08-26 15:16:17 +00:00
PyTorch MergeBot	472c7cf962	Revert "[dynamo] simplify implementation for `builtins.sum` (#133779 )" This reverts commit 8d90392fb02ce5e6854e6b4dbcdc4a7bbd55f8e2. Reverted https://github.com/pytorch/pytorch/pull/133779 on behalf of https://github.com/ZainRizvi due to Sorry, but this breaks internal tests because of using functools ([comment](https://github.com/pytorch/pytorch/pull/133778#issuecomment-2310445169))	2024-08-26 15:16:17 +00:00
PyTorch MergeBot	3d7f3f6a55	Revert "[dynamo][itertools] support `itertools.tee` (#133771 )" This reverts commit 0e49b2f18e78386c8ed9ce540a8017411c7ab0cd. Reverted https://github.com/pytorch/pytorch/pull/133771 on behalf of https://github.com/ZainRizvi due to Sorry, but this breaks internal tests because of using functools ([comment](https://github.com/pytorch/pytorch/pull/133778#issuecomment-2310445169))	2024-08-26 15:16:17 +00:00
PyTorch MergeBot	e1fc4362fb	Revert "[dynamo] simplify implementation for `os.fspath` (#133801 )" This reverts commit c5f6b72041144c00e240bcfdc783a5597c3d8928. Reverted https://github.com/pytorch/pytorch/pull/133801 on behalf of https://github.com/ZainRizvi due to Sorry, but this breaks internal tests because of using functools ([comment](https://github.com/pytorch/pytorch/pull/133778#issuecomment-2310445169))	2024-08-26 15:16:17 +00:00
Thanh Ha	bb67ff2ba7	Migrate Windows bin jobs to runner determinator (#134231 ) Update Windows binary workflows to use the runner determinator script. Closes: pytorch/ci-infra#262 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134231 Approved by: https://github.com/ZainRizvi	2024-08-26 14:56:00 +00:00
Benjamin Glass	27d97b9649	Remove unnecessary test skip (#134250 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134250 Approved by: https://github.com/amjames, https://github.com/janeyx99	2024-08-26 14:34:53 +00:00
Andrey Talman	be96ccf77c	Revert "[CD] fix xpu nightly wheel test env (#134395 )" (#134461 ) This reverts commit 96738c9d756fbd64e6f2eba67f711d3e18f1630c. Merged without pytorchmergebot command by mistake Pull Request resolved: https://github.com/pytorch/pytorch/pull/134461 Approved by: https://github.com/jeanschmidt	2024-08-26 13:40:17 +00:00
Wang, Chuanqi	96738c9d75	[CD] fix xpu nightly wheel test env (#134395 )	2024-08-26 08:53:15 -04:00
haozhe.zhu	1ff226d88c	[inductor] support vec for atomic add (#131314 ) Depends on https://github.com/pytorch/pytorch/pull/130827 to have correct `index_expr` dtype Support vec for atomic add by scalar implementation. TestPlan: ``` python test/inductor/test_cpu_repro.py -k test_scatter_using_atomic_add_vec ``` Generated code for `test_scatter_using_atomic_add_vec` ``` cpp_fused_scatter_0 = async_compile.cpp_pybinding(['const float', 'const int64_t', 'const float', 'float'], ''' #include "/tmp/torchinductor_root/nn/cnnpkaxivwaa5rzng6qsyc4ao42vschogi3yk33ukwv3emlvxeqq.h" extern "C" void kernel(const float* in_ptr0, const int64_t* in_ptr1, const float* in_ptr2, float* out_ptr0) { { for(long x0=static_cast<long>(0L); x0<static_cast<long>(16L); x0+=static_cast<long>(16L)) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(x0), 16); tmp0.store(out_ptr0 + static_cast<long>(x0)); } #pragma omp simd simdlen(8) for(long x0=static_cast<long>(16L); x0<static_cast<long>(25L); x0+=static_cast<long>(1L)) { auto tmp0 = in_ptr0[static_cast<long>(x0)]; out_ptr0[static_cast<long>(x0)] = tmp0; } } { for(long x0=static_cast<long>(0L); x0<static_cast<long>(16L); x0+=static_cast<long>(16L)) { auto tmp0 = at::vec::VectorizedN<int64_t,2>::loadu(in_ptr1 + static_cast<long>(x0), 16); auto tmp12 = at::vec::Vectorized<float>::loadu(in_ptr2 + static_cast<long>(x0), 16); auto tmp1 = 25L; auto tmp2 = c10::convert<int64_t>(tmp1); auto tmp3 = at::vec::VectorizedN<int64_t,2>(tmp2); auto tmp4 = tmp0 + tmp3; auto tmp5 = static_cast<int64_t>(0); auto tmp6 = at::vec::VectorizedN<int64_t,2>(tmp5); auto tmp7 = at::vec::VecMask<int64_t,2>(tmp0 < tmp6); auto tmp8 = decltype(tmp4)::blendv(tmp0, tmp4, tmp7.template cast<int64_t,2>()); auto tmp9 = [&] { __at_align__ std::array<int64_t, 16> tmpbuf; tmp8.store(tmpbuf.data()); return tmpbuf; } () ; auto tmp10 = [&] { __at_align__ std::array<int64_t, 16> tmpbuf; #pragma GCC unroll 16 for (long x0_inner = 0; x0_inner < 16; x0_inner++) { tmpbuf[x0_inner] = static_cast<long>(tmp9[x0_inner]); } return at::vec::VectorizedN<int64_t,2>::loadu(tmpbuf.data(), 16); } () ; TORCH_CHECK((at::vec::VecMask<int64_t,2>((at::vec::VectorizedN<int64_t,2>(0) <= tmp10) & (tmp10 < at::vec::VectorizedN<int64_t,2>(25L)))).all_masked(), "index out of bounds: 0 <= tmp10 < 25L"); atomic_add_vec(out_ptr0, tmp8, tmp12); } #pragma omp simd simdlen(8) for(long x0=static_cast<long>(16L); x0<static_cast<long>(20L); x0+=static_cast<long>(1L)) { auto tmp0 = in_ptr1[static_cast<long>(x0)]; auto tmp9 = in_ptr2[static_cast<long>(x0)]; auto tmp1 = 25L; auto tmp2 = c10::convert<int64_t>(tmp1); auto tmp3 = decltype(tmp0)(tmp0 + tmp2); auto tmp4 = tmp0 < 0; auto tmp5 = tmp4 ? tmp3 : tmp0; auto tmp6 = tmp5; auto tmp7 = c10::convert<int64_t>(tmp6); TORCH_CHECK((0 <= tmp7) & (tmp7 < 25L), "index out of bounds: 0 <= tmp7 < 25L"); atomic_add(&out_ptr0[static_cast<long>(tmp5)], static_cast<float>(tmp9)); } } } ''') ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/131314 Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel	2024-08-26 10:36:51 +00:00
fduwjj	bf5c7bf06d	[FR] Fix the bug in FR script (e.g., checking all ranks dump check) (#134383 ) We somehow convert the rank to string which makes the ranks check fail. This fix now convert them all to int. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134383 Approved by: https://github.com/c-p-i-o	2024-08-26 08:21:14 +00:00
Avik Chaudhuri	92c4771853	fix stuck floordiv (#134150 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/134133 Test Plan: Tested on the small repro in the linked issue with different lengths N (replacing 100), recording N vs. time taken in nanoseconds: 10 127268319 20 220839662 30 325463125 40 429259441 50 553136055 60 670799769 70 999170514 80 899014103 90 997168902 100 1168202035 110 1388556619 120 1457488235 130 1609816470 140 2177889877 150 1917560313 160 2121096113 170 2428502334 180 4117450755 190 4003068224 So N ~ 200 takes ~5s. Previously even smaller N would go for >1 min. Didn't add a perf test because ezyang is planning to build a benchmark. Also tested on https://www.internalfb.com/diff/D61560171, which now gets past the stuck point. Differential Revision: D61619660 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134150 Approved by: https://github.com/ezyang	2024-08-26 07:27:59 +00:00
Xuehai Pan	c5f6b72041	[dynamo] simplify implementation for `os.fspath` (#133801 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133801 Approved by: https://github.com/anijain2305 ghstack dependencies: #133769, #133778, #133779, #133771	2024-08-26 07:12:15 +00:00
Amadeusz Skrzypczak	38f97ec8e3	[pt2] Add meta for poisson (#134103 ) Because aten.poisson doesn't have meta function registered, there is one additional eager execution of this op during compilation phase of torch.compile. There are more ops without meta registration. Is there any reason for it? Pull Request resolved: https://github.com/pytorch/pytorch/pull/134103 Approved by: https://github.com/ezyang	2024-08-26 06:14:38 +00:00
Aaron Orenstein	ed86ac2f25	[BE] typing for decorators - fx/_compatibility (#134054 ) Summary: See #131429 Test Plan: unit tests pass Differential Revision: D61493706 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134054 Approved by: https://github.com/oulgen	2024-08-26 04:00:27 +00:00
Laith Sakka	7b6b10417d	Remove ansi escape chars in assertExpectedInline and add options to skip comments and to skip empty lines (#134248 ) I had a night mare rewriting tests in test_misc.py specifically : 1. graphs can have comments that refers to my files "/lsakka/.." we really dont care about comments add option to ignore comments. 2. empty lines added when EXPECTTEST_ACCEPT=1 are changed with linter causing tests to fail or linter fail! add flag to ignore empty lines. 3. EXPECTTEST_ACCEPT fails when the text have some not readable characters. those should not effect comparing strings, also those causes weird diffs comments when tests fails. I removed ansi_escape chars https://github.com/pytorch/pytorch/pull/133045 this is used in Pull Request resolved: https://github.com/pytorch/pytorch/pull/134248 Approved by: https://github.com/aorenste ghstack dependencies: #133639, #134364	2024-08-26 02:03:44 +00:00
Xu Han	2ec149cd3e	[inductor] fix test_functional_call_sequential_params_and_buffers expectation on Windows (#134394 ) This UT actual code only one empty line wrap difference(`linear` and `add`) between Windows/Linux, and the context is right. Reproduce UTs: ```cmd pytest test\dynamo\test_higher_order_ops.py -v -k test_functional_call_sequential_params_and_buffers ``` We can add `empty_line_normalizer` to fix it. ```cmd ______________________________________________________________________________________________ FuncTorchHigherOrderOpTests.test_functional_call_sequential_params_and_buffers _______________________________________________________________________________________________ Traceback (most recent call last): File "D:\xu_git\dnnl_cb\pytorch\test\dynamo\test_higher_order_ops.py", line 3676, in test_functional_call_sequential_params_and_buffers self.assertExpectedInline( File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\site-packages\torch\testing\_internal\common_utils.py", line 2871, in assertExpectedInline return super().assertExpectedInline(actual if isinstance(actual, str) else str(actual), expect, skip + 1) File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\site-packages\expecttest\__init__.py", line 271, in assertExpectedInline self.assertMultiLineEqualMaybeCppStack(expect, actual, msg=help_text) File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\site-packages\expecttest\__init__.py", line 292, in assertMultiLineEqualMaybeCppStack self.assertMultiLineEqual(expect, actual, args, *kwargs) File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\unittest\case.py", line 1226, in assertMultiLineEqual self.fail(self._formatMessage(msg, standardMsg)) File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\unittest\case.py", line 675, in fail raise self.failureException(msg) AssertionError: 'clas[509 chars]one\n add: "f32[1, 1]" = linear + l_buf[69 chars],)\n' != 'clas[509 chars]one\n\n add: "f32[1, 1]" = linear + l_b[71 chars],)\n' class GraphModule(torch.nn.Module): def forward(self, L_params_l1_weight_: "f32[1, 1]", L_params_l1_bias_: "f32[1]", L_buffers_buffer_: "f32[1]", L_inputs_: "f32[1, 1]"): l_params_l1_weight_ = L_params_l1_weight_ l_params_l1_bias_ = L_params_l1_bias_ l_buffers_buffer_ = L_buffers_buffer_ l_inputs_ = L_inputs_ linear: "f32[1, 1]" = torch._C._nn.linear(l_inputs_, l_params_l1_weight_, l_params_l1_bias_); l_inputs_ = l_params_l1_weight_ = l_params_l1_bias_ = None + <<<< (difference is here ) add: "f32[1, 1]" = linear + l_buffers_buffer_; linear = l_buffers_buffer_ = None return (add,) : To accept the new output, re-run test with envvar EXPECTTEST_ACCEPT=1 (we recommend staging/committing your changes before doing this) To execute this test, run the following from the base repo dir: python test\dynamo\test_higher_order_ops.py FuncTorchHigherOrderOpTests.test_functional_call_sequential_params_and_buffers This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 ========================================================================================================================== short test summary info ========================================================================================================================== FAILED [0.4275s] test/dynamo/test_higher_order_ops.py::FuncTorchHigherOrderOpTests::test_functional_call_sequential_params_and_buffers - AssertionError: 'clas[509 chars]one\n add: "f32[1, 1]" = linear + l_buf[69 chars],)\n' != 'clas[509 chars]one\n\n add: "f32[1, 1]" = linear + l_b[71 chars],)\n' ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/134394 Approved by: https://github.com/jansel Co-authored-by: Jason Ansel <jansel@jansel.net>	2024-08-26 01:41:20 +00:00
Tianyi Tao	7af38eb98b	Fix unexpected inference_mode interaction with torch.autograd.functional.jacobian (#130307 ) Fixes #128264 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130307 Approved by: https://github.com/soulitzer	2024-08-25 22:14:02 +00:00
Xu Han	dc1959e6a7	[inductor] calibration inductor windows uts (7/N) (#134420 ) Disable UTs on Windows: `test/dynamo/test_misc.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/134420 Approved by: https://github.com/jansel	2024-08-25 20:39:54 +00:00
Xu Han	97fd087cdb	[inductor] calibration inductor windows uts (6/N) (#134419 ) Disable UTs for Windows: `test/dynamo/test_aot_autograd_cache.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/134419 Approved by: https://github.com/jansel	2024-08-25 20:39:34 +00:00
Richard Barnes	b5dd60fa75	Fix namespace issues with qnnpack (#134336 ) After this I think all `using namespace` will have been eliminated from PyTorch header files. Internally, `-Wheader-hygiene` will prevent more from being added. Test Plan: Sandcastle Differential Revision: D61679037 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134336 Approved by: https://github.com/Skylion007	2024-08-25 19:50:01 +00:00
Igor Sugak	7940f2428f	[torch/package_importer] add compatibility name mapping (#134376 ) Summary: This enables patching extern modules to provide compatibility with serialized code depending on different versions of those extern modules. The main motivation is to enable Numpy upgrade. In the recent release many alias to builtin types were deprecated and removed [1]. This breaks loading pickled modules that reference the removed aliases. While the proper solution is to re-generate pickled modules, it's not always feasible. This proposes a way to define mapping with a new type, for a module member. It is only set if it's not present in the loaded module, thus removes the need to check for exact versions. https://numpy.org/doc/stable/release/1.20.0-notes.html#using-the-aliases-of-builtin-types-like-np-int-is-deprecated Differential Revision: D61556888 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134376 Approved by: https://github.com/SherlockNoMad	2024-08-25 19:34:46 +00:00
Shivam Raikundalia	816061843a	[Distributed/Profiler] Fix input/output dimension overflow (#134360 ) Summary: When using ParamCommsDebugInfo, the input elements and output elements are stored in `int` instead of `int64_t` Test Plan: Run HTA with new outputted values and make sure overflow does not occur Reviewed By: fengxizhou Differential Revision: D61728747 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134360 Approved by: https://github.com/fengxizhou, https://github.com/jeanschmidt	2024-08-25 16:25:56 +00:00
eqy	e93ca12c88	[CUDNN][SDPA] Fix unsupported trivial stride-1 transpose case (#134031 ) Fixes #134001 Incorrect assumption that two same-shape tensors being contiguous meant that they would have the same stride Pull Request resolved: https://github.com/pytorch/pytorch/pull/134031 Approved by: https://github.com/drisspg, https://github.com/Skylion007 Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>	2024-08-25 14:31:30 +00:00
Chirag Pandya	08d111250a	[ez][c10d] change ERROR to WARNING (#134349 ) Summary: Change error to warning because TCPStore can be torn down during a normal shutdown. It's OK if we're unable to access TCPStore. Should not be an error. Test Plan: Ran locally Pull Request resolved: https://github.com/pytorch/pytorch/pull/134349 Approved by: https://github.com/fduwjj, https://github.com/wconstab	2024-08-25 14:22:55 +00:00
PyTorch MergeBot	4648848696	Revert "[ROCm] remove triton-rocm commit pin and merge pins with triton.txt (#133438 )" This reverts commit f71c3d265ab52589f983dd252d61461db4e7dbbd. Reverted https://github.com/pytorch/pytorch/pull/133438 on behalf of https://github.com/jeanschmidt due to seems to have introduced breakages in linux binary builds ([comment](https://github.com/pytorch/pytorch/pull/133438#issuecomment-2308787310))	2024-08-25 11:20:30 +00:00
PyTorch MergeBot	e5563f7ad7	Revert "[dtensor][MTPG] make sharding prop lru cache not shared among threads (#134294 )" This reverts commit eb15b1a016c6facaf8605dde2c20b5de1586542d. Reverted https://github.com/pytorch/pytorch/pull/134294 on behalf of https://github.com/jeanschmidt due to seems to have introduced https://github.com/pytorch/pytorch/actions/runs/10537099590/job/29201744658 ([comment](https://github.com/pytorch/pytorch/pull/134294#issuecomment-2308785949))	2024-08-25 11:16:04 +00:00
wz337	268092db83	[DeviceMesh] Allow _flatten() to take in an optional mesh_dim_name (#134048 ) If a mesh_dim_name is given, we will use the given mesh_dim_name to name the new flattened dim. Otherwise, the default is a string concatentaing the mesh_dim_names of the given submesh with each mesh_dim_name separated by "_". For example, if we have a 3D mesh DeviceMesh([[[0, 1], [2, 3]], [[4, 5], [6, 7]]], mesh_dim_names=("dp", "cp", "tp")), calling mesh_3d["dp", "cp"]._flatten() will create a 1D submesh DeviceMesh([0, 1, 2, 3], mesh_dim_names=("dp_cp",)) on rank 0, 1, 2, 3 and a 1D submesh DeviceMesh([4, 5, 6, 7], mesh_dim_names=("dp_cp",)) on rank 4, 5, 6, 7. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134048 Approved by: https://github.com/fegin ghstack dependencies: #133838, #133839	2024-08-25 10:36:01 +00:00
Edward Z. Yang	326db8af4c	Replace sympy Min/Max with reimplementations (#133319 ) Sympy's implementation of Min/Max displays asymptotically bad behavior on `TORCH_COMPILE_CPROFILE=1 python torchrec/distributed/tests/test_pt2_multiprocess.py TestPt2Train.test_compile_multiprocess`. Evidence profile: ![image](https://github.com/user-attachments/assets/142301e9-3a18-4370-b9db-19b32ece7ee8) On this test case, we spend 42% of all time compiling the network on ShapeEnv.replace, which in turn spends all of its time in xreplace. The problem appears to be find_localzeros call. By vendoring the implementations of Min/Max, we can potentially reduce the cost of this operation. The implementation is copy-pasted sympy/functions/elementary/miscellaneous.py but with some adjustments: * I deleted logic related to differentatiation, evalf and heaviside, as it's not relevant to PyTorch reasoning * There's some massaging to appease PyTorch's linters, including a lot of noqa and type: ignore (which I could potentially refactor away with substantive changes, but that's better as its own change) * I deleted the second loop iteration for is_connected, as an attempt at initial optimization (this also simplifies the port, since I can omit some code). I'll comment at that point what the exact difference is. Before this change, the test in question takes 100s with 40 features; post this change, afterwards, it takes only 69s. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/133319 Approved by: https://github.com/Skylion007	2024-08-25 05:05:59 +00:00
Avik Chaudhuri	8db8ac700d	line by line logging (#134298 ) Summary: Today there is no good mechanism to detect progress of non-strict export line-by-line in user code. This caused some pain recently in trying to find the exact line of user code that was triggering a bug where the process appeared stuck because deep down something was calling some symbolic shapes code that was suffering some exponential blowup. This PR adds a environment variable for extended debugging that will log the line of user code corresponding to every torch function call. It only works in non-strict export for now. Prefix setting this environment variable with `TORCH_LOGS` enabled for `export` logs at `DEBUG` level (i.e., with a `+` prefix), i.e.,.: ``` TORCHEXPORT_EXTENDED_DEBUG_CURRENT_LOC=1 TORCH_LOGS="+export" ... ``` This will show logs with something like: ``` ... prim::device called at .../example.py:4284 in foo TensorBase.item called at .../example.py:4277 in bar ... ``` We already have an existing place to intercept torch functions where we process data-dependent errors in non-strict, so parking the logging there. An alternative place we could be doing this is where we add `stack_trace` metadata when generating code, but unfortunately at least the example that motivated this gets stuck before generating code, so that would be too late. Test Plan: ran it on some sample commands Differential Revision: D61692156 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134298 Approved by: https://github.com/angelayi	2024-08-25 02:57:11 +00:00
Xu Han	907c32faac	[inductor] calibration inductor windows uts (4/N) (#134401 ) skip failed UTs of `test/dynamo/test_unspec.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/134401 Approved by: https://github.com/ezyang	2024-08-25 00:32:29 +00:00
Xu Han	74ef74be36	[inductor] calibration inductor windows uts (3/N) (#134400 ) skip Windows UT of `test/dynamo/test_trace_rules.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/134400 Approved by: https://github.com/ezyang	2024-08-24 23:48:50 +00:00
Shivam Raikundalia	d33d68e326	[Profiler] Add test to make sure FunctionEvents are processed lazily (#134359 ) Summary: Create simple test that checks that FunctionEvent build tree happens lazily by checking that the metrics for it changes before and after call. Test Plan: Make sure test passes in CI Reviewed By: briancoutinho Differential Revision: D61685429 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134359 Approved by: https://github.com/briancoutinho	2024-08-24 23:03:19 +00:00
Xu Han	af4c87953e	[inductor] calibration inductor windows uts (5/N) (#134402 ) skip UTs of `test/dynamo/test_repros.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/134402 Approved by: https://github.com/ezyang	2024-08-24 23:00:11 +00:00
Bob Ren	94f92fbd88	Use integer divison in arange length calculation when start/end/step are integral (#134296 ) Fixes #133338 Test Plan: ``` TORCH_LOGS=dynamic python import torch torch._dynamo.config.capture_scalar_outputs = True @torch.compile() def f(x): y = x.item() torch._check_is_size(y) r = torch.arange(y, dtype=torch.float32) torch._check(r.size(0) == y) return r f(torch.tensor([300])) ``` Before and after diff. Verify the following line ``` I0813 11:05:44.890000 652898 torch/fx/experimental/symbolic_shapes.py:5198] [0/0] runtime_assert Eq(CeilToInt(IntTrueDiv(u0, 1)), u0) [guard added] at aa.py:10 in f (_dynamo/utils.py:2092 in run_node), for more info run with TORCHDYNAMO_EXTENDED_DEBUG_GUARD_ADDED="Eq(CeilToInt(IntTrueDiv(u0, 1)), u0)" ``` no longer shows in the logs. Also verify CI passes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134296 Approved by: https://github.com/aorenste	2024-08-24 21:09:28 +00:00
Aart Bik	1a0d00f1f4	[traced-graph][sparse] enable to_dense() for compressed (#133371 ) Fixes https://github.com/pytorch/pytorch/issues/133174 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133371 Approved by: https://github.com/ezyang	2024-08-24 20:33:23 +00:00
Aart Bik	050aa67e41	[traced-graph][sparse] fix restrictive assert for sparse add (#134037 ) exporting sparse addition can be CPU/Meta this fixes the overly restrictive assert and adds an exporting test Pull Request resolved: https://github.com/pytorch/pytorch/pull/134037 Approved by: https://github.com/ezyang	2024-08-24 20:26:47 +00:00
Xu Han	90fb83749e	[inductor] fix test torch package working with trace on windows (#134397 ) Current temporary directory path is hard code. Fixed by get temporary directory path by API. Reproduce UTs: ```cmd python test/dynamo/test_dynamic_shapes.py -v -k test_torch_package_working_with_trace_dynamic_shapes ``` Error message: ```cmd ________________________________________________________________________________________________ DynamicShapesMiscTests.test_torch_package_working_with_trace_dynamic_shapes ________________________________________________________________________________________________ Traceback (most recent call last): File "D:\xu_git\dnnl_cb\pytorch\test\dynamo\test_misc.py", line 7199, in test_torch_package_working_with_trace with package.PackageExporter(path) as exp: File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\site-packages\torch\package\package_exporter.py", line 237, in __init__ self.zip_file = torch._C.PyTorchFileWriter(f) RuntimeError: Parent directory /tmp does not exist. To execute this test, run the following from the base repo dir: python test\dynamo\test_dynamic_shapes.py DynamicShapesMiscTests.test_torch_package_working_with_trace_dynamic_shapes This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 ========================================================================================================================== short test summary info ========================================================================================================================== FAILED [0.0080s] test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_torch_package_working_with_trace_dynamic_shapes - RuntimeError: Parent directory /tmp does not exist. ==================================================================================================================== 1 failed, 1665 deselected in 4.00s ===================================================================================================================== ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/134397 Approved by: https://github.com/ezyang	2024-08-24 20:25:44 +00:00
Jonathan Deakin	9cd53b3212	Add Arm copyright line to LICENSE (#133982 ) Some historical commits from arm: - 2021 664126bab5f3f2a275e82b7bde127132cff7f34e - 2023 2630144786e906b40abbe017294d404bcfe3c6ae - 2024 ce6130014156fa9555ce3d16c5f9a84cbdadf8f4 See https://github.com/pytorch/pytorch/pull/126687 for initial discussion. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133982 Approved by: https://github.com/malfet	2024-08-24 18:41:06 +00:00
Jonathan Deakin	50d5aa8c10	Enable optimized dynamic quantization on aarch64 (#126687 ) oneDNN+ACL has optimized kernels for s8s8 matmul, so input is signed. This change leaves behaviour on all other platforms the same. This change requires https://github.com/intel/ideep/pull/313 to go in, and oneDNN 3.5 for the optimized kernels. This change speeds up dynamic quantized linear by ~10x. Also, do you have a policy on copyright headers? Arm's usual policy when contributing to open source projects is to include a copyright header on any file which is modified. Would this be acceptable? If not, is there somewhere else suitable to note copyright? Pull Request resolved: https://github.com/pytorch/pytorch/pull/126687 Approved by: https://github.com/jgong5, https://github.com/malfet, https://github.com/snadampal Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2024-08-24 18:40:12 +00:00
Jack Taylor	f71c3d265a	[ROCm] remove triton-rocm commit pin and merge pins with triton.txt (#133438 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133438 Approved by: https://github.com/jithunnair-amd, https://github.com/malfet	2024-08-24 18:26:49 +00:00
chuanqiw	6245d5b87b	[CI] Update XPU ci test python version to 3.9 (#134214 ) Works for https://github.com/pytorch/pytorch/issues/114850 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134214 Approved by: https://github.com/EikanWang, https://github.com/malfet	2024-08-24 18:11:36 +00:00
Yueming Hao	a63efee5cd	[inductor]Let output or input_as_strided match exact strides (#130956 ) Fixes #130394 TorchInductor doesn't respect original strides of outputs. It opens up optimization opportunities like changing up memory layout. But for some cases, such as the case in https://github.com/pytorch/pytorch/issues/130394, we do need the output match the exact stride as required. The correctness is the first priority goal. So, this PR adds a new API `ir.ExternKernel.require_exact_strides(x, exact_strides, allow_padding=False)` to fix the issue. This PR enables non-dense outputs' strides follow the strides required by semantics. The comparison between the original and after this fix for the test is the below. ```python @triton.jit def triton_(in_ptr0, out_ptr0, xnumel, XBLOCK : tl.constexpr): xnumel = 128 xoffset = tl.program_id(0) * XBLOCK xindex = xoffset + tl.arange(0, XBLOCK)[:] xmask = xindex < xnumel x0 = xindex % 8 x1 = (xindex // 8) - x2 = xindex tmp0 = tl.load(in_ptr0 + (x0 + (16x1)), xmask) tmp1 = tmp0 + tmp0 - tl.store(out_ptr0 + (x2), tmp1, xmask) + tl.store(out_ptr0 + (x0 + (16x1)), tmp1, xmask) def call(args): arg0_1, = args args.clear() assert_size_stride(arg0_1, (16, 8), (16, 1)) with torch.cuda._DeviceGuard(0): torch.cuda.set_device(0) - buf1 = empty_strided_cuda((16, 8), (8, 1), torch.float32) + buf1 = empty_strided_cuda((16, 8), (16, 1), torch.float32) stream0 = get_raw_stream(0) triton_poi_fused_add_copy_0.run(arg0_1, buf1, 128, grid=grid(128), stream=stream0) del arg0_1 return (buf1, ) ``` The buf1 is created with exact stride required by users, and its values are written in same stride with the input. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130956 Approved by: https://github.com/eellison, https://github.com/blaine-rister	2024-08-24 17:04:05 +00:00
Animesh Jain	cdb9df5efe	[dynamo][guards] De-dupe DUPLICATE_INPUT guard (#134354 ) Hard to write a test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134354 Approved by: https://github.com/jansel ghstack dependencies: #134272	2024-08-24 15:17:56 +00:00
David Berard	d433a603af	[BE] use torch.amp.autocast instead of torch.cuda.amp.autocast (#134291 ) torch.cuda.amp.autocast / torch.cpu.amp.autocast are deprecated and spew a ton of warnings when these tests run. This PR: Update to just use torch.amp.autocast(device). Note: this uncovers a bug in the test: when `device` is CUDA, it actually shows up as "cuda:0" - so previously, this test was _always_ using `torch.cpu.amp.autocast` even for `cuda` device. This PR fixes this, and uncovers additional bugs in `pinverse` and `linalg.pinv`; `linalg.pinv` was already failing before on CPU, but now the test also catches failures on CUDA, (and this PR adds to the skipped-test list). Pull Request resolved: https://github.com/pytorch/pytorch/pull/134291 Approved by: https://github.com/YuqingJ	2024-08-24 15:07:49 +00:00
Huanyu He	a1061009c9	[PT2] use statically_known_true in slice_noop (#134270 ) Summary: # context * when fixing the graph break in _maybe_compute_kjt_to_jt_dict, we encountered this issue P1539489731: ``` [rank0]: ATTENTION: guard_size_oblivious would fix the error, evaluating expression to False. [rank0]: Maybe you need to add guard_size_oblivious to framework code, see doc below for more guidance. [rank0]: [rank0]: Potential framework code culprit (scroll up for full backtrace): [rank0]: File "/data/users/hhy/fbsource/buck-out/v2/gen/fbcode/61f992c26f3f2773/aps_models/ads/icvr/__icvr_launcher_live__/icvr_launcher_live#link-tree/torch/_inductor/fx_passes/post_grad.py", line 671, in slice_noop [rank0]: if start == 0 and end >= 2*63 - 1 and step == 1: ``` change the condition logic to be compatible with SymInt Test Plan: # commands * run test ``` TORCH_SHOW_CPP_STACKTRACES=1 TORCHDYNAMO_EXTENDED_DEBUG_CPP=1 TORCH_LOGS="+graph_code,output_code,dynamic,aot,guards,verbose_guards,recompiles,graph_breaks" TORCH_TRACE=/var/tmp/tt buck2 run fbcode//mode/opt fbcode//aps_models/ads/icvr:icvr_launcher_live -- mode=fmc/local_ig_fm_v4_mini training.pipeline_type=pt2 2>&1 \| tee -a `date +"%Y.%m.%d.%H.%M"`.`sl whereami`.log ``` * tlparse ``` ls -thl /var/tmp/tt \| head -9 && tlparse `ls -t /var/tmp/tt/* \| head -1` ``` Differential Revision: D61677207 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134270 Approved by: https://github.com/ezyang	2024-08-24 13:58:51 +00:00
atalman	ff77c67d16	Use ephemeral runners for linux nightly builds (#134367 ) Should be landed with https://github.com/pytorch/test-infra/pull/5590 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134367 Approved by: https://github.com/kit1980, https://github.com/malfet, https://github.com/seemethere	2024-08-24 12:49:07 +00:00
Simon Fan	ff7d94c67e	[compiled autograd] fix saved tensor hook firing count (#134361 ) SavedVariable constructor calls the pack hooks, we don't want to call them for the proxy tensor since it is proxying a tensor that already had called the pack hook during forward. Using the same fix as https://github.com/pytorch/pytorch/pull/123196 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134361 Approved by: https://github.com/jansel ghstack dependencies: #134186, #134200, #134205, #134286, #134290, #134162, #134163	2024-08-24 12:06:36 +00:00
Simon Fan	929de1d0d4	Re-enable skipped compiled autograd eager tests (#134163 ) Originally disabled in: https://github.com/pytorch/pytorch/pull/131700#discussion_r1727153445, but the failure is no longer in CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/134163 Approved by: https://github.com/soulitzer ghstack dependencies: #134186, #134200, #134205, #134286, #134290, #134162	2024-08-24 12:06:36 +00:00
Simon Fan	ad8bdfae1e	add compiled_autograd to programmatic set_logs API (#134162 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134162 Approved by: https://github.com/yf225, https://github.com/jansel ghstack dependencies: #134186, #134200, #134205, #134286, #134290	2024-08-24 12:06:36 +00:00
Simon Fan	1431663693	[compiled autograd] finish classifying tests (#134290 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134290 Approved by: https://github.com/yf225 ghstack dependencies: #134186, #134200, #134205, #134286	2024-08-24 12:06:36 +00:00
Simon Fan	0b228a2af8	[compiled autograd] match eager behavior for ctx.saved_variables (#134286 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134286 Approved by: https://github.com/jansel ghstack dependencies: #134186, #134200, #134205	2024-08-24 12:06:36 +00:00
Simon Fan	6cc57c64b2	[compiled autograd] match eager behavior for post acc grad hooks (#134205 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134205 Approved by: https://github.com/jansel ghstack dependencies: #134186, #134200	2024-08-24 12:06:36 +00:00
Simon Fan	d7a25e1d8c	[compiled autograd] add config patching for certain eager tests (#134200 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134200 Approved by: https://github.com/jansel ghstack dependencies: #134186	2024-08-24 12:06:36 +00:00
Simon Fan	0d9208a398	[compiled autograd] match eager behavior for inplace detached activations (#134186 ) Fixes `TestAutograd.test_saved_variable_saved_original_inplace_detach` when ran under compiled autograd Pull Request resolved: https://github.com/pytorch/pytorch/pull/134186 Approved by: https://github.com/jansel	2024-08-24 12:06:36 +00:00
Huamin Li	ccafc93be5	[AOTI][CPU] Make int8 qlinear work (#134368 ) Summary: This diff will decompose torch.ops._quantized.wrapped_quantized_linear into torch.ops._quantized.wrapped_linear_prepack and torch.ops._quantized.wrapped_quantized_linear_prepacked for AOTI, and added the corresponding impl into shim The way it works will be similar to what we did previously for fbgemm fp16 dynamic qlinear. We will do constant folding for packed weight during runtime (warm up) to achieve the speed up Reviewed By: desertfire Differential Revision: D61396144 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134368 Approved by: https://github.com/houseroad	2024-08-24 08:25:25 +00:00
Xilun Wu	eb15b1a016	[dtensor][MTPG] make sharding prop lru cache not shared among threads (#134294 ) Summary Before this PR, `sharding propagator` is shared among threads. The result is the cache result of rank 0 would be accessible by other ranks e.g. rank 1 and this could lead to wrong DTensor resharding. This PR fixes it by making the cache a local variable at thread level, and it fixes `dstack` test (#126493), `inner` (https://github.com/pytorch/pytorch/issues/126852), and `vstack` (https://github.com/pytorch/pytorch/issues/126868). It also fixes `poisson_nll` (https://github.com/pytorch/pytorch/issues/131446) as a bi-product. Test `pytest test/distributed/_tensor/test_dtensor_ops.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/134294 Approved by: https://github.com/wz337, https://github.com/awgu	2024-08-24 05:56:45 +00:00
Xu Han	1034f456ef	[inductor] fix munge_exc not support windows path (#134348 ) Windows file path use `\` as delimiter, it is also a escape character. We need translate all path `\` to `/`. which like Linux. Reproduce UT: ```cmd pytest test\dynamo\test_higher_order_ops.py -v -k test_vmap_grad_vmap_guard_fail ``` Error msg: ```cmd ________________________________________________________________________________________________________ HigherOrderOpVmapGuardTests.test_vmap_grad_vmap_guard_fail _________________________________________________________________________________________________________ Traceback (most recent call last): File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\site-packages\torch\testing\_internal\logging_utils.py", line 89, in test_fn fn(self, records) File "D:\xu_git\dnnl_cb\pytorch\test\dynamo\test_higher_order_ops.py", line 2714, in test_vmap_grad_vmap_guard_fail munge_exc(record.getMessage()), File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\site-packages\torch\testing\_internal\common_utils.py", line 5252, in munge_exc s = re.sub(file, os.path.basename(file), s) File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\re.py", line 209, in sub return _compile(pattern, flags).sub(repl, string, count) File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\re.py", line 303, in _compile p = sre_compile.compile(pattern, flags) File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\sre_compile.py", line 788, in compile p = sre_parse.parse(p, flags) File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\sre_parse.py", line 955, in parse p = _parse_sub(source, state, flags & SRE_FLAG_VERBOSE, 0) File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\sre_parse.py", line 444, in _parse_sub itemsappend(_parse(source, state, verbose, nested + 1, File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\sre_parse.py", line 526, in _parse code = _escape(source, this, state) File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\sre_parse.py", line 370, in _escape raise source.error("incomplete escape %s" % escape, len(escape)) re.error: incomplete escape \x at position 2 To execute this test, run the following from the base repo dir: python test\dynamo\test_higher_order_ops.py HigherOrderOpVmapGuardTests.test_vmap_grad_vmap_guard_fail This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 --------------------------------------------------------------------------------------------------------------------------- Captured stdout call ---------------------------------------------------------------------------------------------------------------------------- frames [('total', 2), ('ok', 2)] inductor [] inline_call [] stats [('calls_captured', 38), ('unique_graphs', 2)] --------------------------------------------------------------------------------------------------------------------------- Captured stderr call ---------------------------------------------------------------------------------------------------------------------------- V0824 01:29:00.148000 27840 torch\_dynamo\guards.py:2787] [0/1] [__recompiles] Recompiling function fn in D:\xu_git\dnnl_cb\pytorch\test\dynamo\test_higher_order_ops.py:2699 V0824 01:29:00.148000 27840 torch\_dynamo\guards.py:2787] [0/1] [__recompiles] triggered by the following guard failure(s): V0824 01:29:00.148000 27840 torch\_dynamo\guards.py:2787] [0/1] [__recompiles] - 0/0: torch._functorch.pyfunctorch.compare_functorch_state([('Vmap', 1, 'error')]) # _dynamo\output_graph.py:479 in init_ambient_guards ========================================================================================================================== short test summary info ========================================================================================================================== FAILED [0.7452s] test/dynamo/test_higher_order_ops.py::HigherOrderOpVmapGuardTests::test_vmap_grad_vmap_guard_fail - re.error: incomplete escape \x at position 2 ``` Local test passed: <img width="860" alt="image" src="https://github.com/user-attachments/assets/90f0d780-0639-4c03-8d7c-6f227c93a3fc"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/134348 Approved by: https://github.com/jansel	2024-08-24 05:51:35 +00:00
Shangdi Yu	0694918aeb	[export] Temporarily bypass torch_fn in partitioner (#134292 ) Summary: "torch_fn" is not correct for the decomposed add node from batch norm. This is a temporary workaround to bypass torch fn. For example, for the graph below (test_qat_conv2d_unary graph): ``` graph(): %conv_weight : [num_users=1] = get_attr[target=conv.weight] %bn_weight : [num_users=1] = get_attr[target=bn.weight] %bn_bias : [num_users=1] = get_attr[target=bn.bias] %bn_running_mean : [num_users=1] = get_attr[target=bn.running_mean] %bn_running_var : [num_users=1] = get_attr[target=bn.running_var] %bn_num_batches_tracked : [num_users=1] = get_attr[target=bn.num_batches_tracked] %x : [num_users=1] = placeholder[target=x] %conv2d : [num_users=1] = call_function[target=torch.ops.aten.conv2d.default](args = (%x, %conv_weight, None, [1, 1], [1, 1]), kwargs = {}) %add_ : [num_users=0] = call_function[target=torch.ops.aten.add_.Tensor](args = (%bn_num_batches_tracked, 1), kwargs = {}) %batch_norm : [num_users=1] = call_function[target=torch.ops.aten.batch_norm.default](args = (%conv2d, %bn_weight, %bn_bias, %bn_running_mean, %bn_running_var, True, 0.1, 1e-05, True), kwargs = {}) %relu : [num_users=1] = call_function[target=torch.ops.aten.relu.default](args = (%batch_norm,), kwargs = {}) %max_pool2d : [num_users=1] = call_function[target=torch.ops.aten.max_pool2d.default](args = (%relu, [3, 3], [3, 3]), kwargs = {}) return (max_pool2d,) ``` the add_ node has `'torch_fn': ('add__1', 'method_descriptor.add_'),` in its meta. If we run the line below in `_annotate_qat_conv2d_bn_binary_unary`, we'll have a partition without output nodes. ``` find_sequential_partitions( gm, [torch.nn.Conv2d, torch.nn.BatchNorm2d, operator.add, torch.nn.ReLU] ) ```` ``` partition_list [ SourcePartition(nodes=[conv_weight, conv2d], source=<class 'torch.nn.modules.conv.Conv2d'>, input_nodes=[x], output_nodes=[conv2d], params=[conv_weight]), SourcePartition(nodes=[bn_weight, bn_bias, bn_running_mean, bn_running_var, bn_num_batches_tracked, add_, batch_norm], source=<class 'torch.nn.modules.batchnorm.BatchNorm2d'>, input_nodes=[conv2d], output_nodes=[batch_norm], params=[bn_num_batches_tracked, bn_running_var, bn_bias, bn_weight, bn_running_mean]), SourcePartition(nodes=[add_], source='add_', input_nodes=[bn_num_batches_tracked], output_nodes=[], params=[]) ] ``` We should not have the last partition. Test Plan: ``` buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test/quantization:test_quantization -- -r test_qat_conv2d ``` Differential Revision: D61569049 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134292 Approved by: https://github.com/angelayi	2024-08-24 05:50:18 +00:00
Daniel Dale	f260cc2edf	Enable DTensor sharding propagation of `native_layer_norm_backward` to more fully accommodate optional args (#133502 ) Fixes #133499 ### The issue Testing a variety of TP `requires_grad` patterns (validating maximally flexible finetuning) revealed `DTensor` sharding propagation of `aten.native_layer_norm_backward` (default) fails with an `IndexError` for certain `requires_grad` patterns (pattern 1) (e.g. `output_mask` `[True, False, False]`) and an `AssertionError` for others (pattern 2) (e.g. output mask `[False, True, ]`). Please see issue #133499 for a full description of the observed failure patterns along with reproduction. ### Use Cases and Remediation Failure pattern 1 is potentially problematic for a variety of finetuning scenarios. Though failure pattern 2 is really an xfail right now since it's not fully supported, IMHO there are use cases (e.g. especially wrt to mechanistic interpretability research, but certain finetuning scenarios too potentially) that justify supporting this output mask (especially since supporting it is fairly straightforward I think). In this PR I propose some modest changes that: Address the aforementioned failure modes. * Add a couple tests that I'm hopeful will help ensure `DTenso`r op dispatch (which is so well implemented and such a pleasure working with btw! 🚀 🎉) accommodates a wide variety of (potentially unanticipated) `requires_grad` patterns as it evolves. To address both failure modes, I'm proposing the following changes: 1. To [`torch.distributed._tensor.ops._math_ops.layer_norm_bwd_strategy`](`7b269cc484/torch/distributed/_tensor/ops/_math_ops.py (L873)`): - Refactor conditional `output_mask` handling such that the input and output specs in the`PlacementStrategy`s of the returned `output_strategy.strategies` list remain aligned with the `op_schema.args_spec` (whose definition does not change at runtime based upon unused optional args). 2. To [`torch.distributed._tensor._sharding_prop.propagate_op_sharding_non_cached`](`7b269cc484/torch/distributed/_tensor/_sharding_prop.py (L256-L262)`): - When iterating through the active `op_schema.args_spec` to build the relevant `expected_input_specs` list, filter any `None` `desired_specs`. 3. To [`torch/distributed/_tensor/_op_schema.OpSchema._inplace_rewrap_schema_suggestion`](`7b269cc484/torch/distributed/_tensor/_op_schema.py (L418)`) - When inputs need a redistribute, for runtime-unrequired (`None` arguments in the aligned `suggestion_args_schema`), ignore the associated `suggestion_args_spec` ### Implementation considerations: - Regarding `1`, to avoid changing the op strategy return args ([`op_strategy`](`cf81180007/torch/distributed/_tensor/_sharding_prop.py (L234)`)), the change in `1` allows `None` elements to exist temporarily in `PlacementStrategy.input_specs` (treating it as `Sequence[DTensorSpec \| None] \| None` when it's `Sequence[DTensorSpec] \| None`. This could be addressed in any number of ways but I thought it best to leave that for a subsequent PR since it could have broader ramifications (e.g. allowing op_strategies to return an output_strategy.input_specs` mask explicitly, explicitly allowing `None`s in `PlacementStrategy.input_specs`, creating a `Null` DTensorSpec etc.). That's why I'm using an ignore arg-type directive there for now. - Regarding `2` and `3` above, I don't introspect `op_schema.op._schema.arguments` to verify any `None` arguments are `torch.OptionalType`, leaving adherence to the schema contract the responsibility of the given op. Regarding `2`, I assume any `desired_spec` will be either a `DTensorSpec` or `None`, so only `None` can be Falsy in this context. - I considered altering the active `args_schema`, which could be inspected and aligned with the active `output_strategy.input_specs` in some cases and avoid the changes in `3`, but I think that would rely on one of (among other possibilities): - all supported op signatures having optional Tensors (`DTensorSpec`) args after required tensors (which isn't a planned required as far as I know), - (somewhat brittle) heuristic-driven arg alignment - only supporting kwargs etc. ### Added Tests To facilitate detection of future `requires_grad` pattern op failure modes as `DTensor` evolves, I added the following two tests: 1. `test/distributed/_tensor/test_math_ops.py DistMathOpsTest.test_layer_norm_bwd_req_grad` - Tests `native_layer_norm_backward` specifically with 20 subtests that sweep valid `output_mask` patterns along in different LayerNorm dimensionality and `elementwise_affine` configurations. 2. `test/distributed/tensor/parallel/test_tp_examples.py DistTensorParallelExampleTest.test_transformer_req_grad` - Samples a subset of `requires_grad` patterns in a more realistic (relative to the `LayerNorm`-specific test) Transformer usage context with different `dtype` and `is_seq_parallel` configurations. Note since there was substantial overlap with the existing `test_transformer_training` test, I took the opportunity to refactor that test to allow relevant code-sharing. I also added an `ExpCommCounts` `NamedTuple` to facilitate the addition of additional `requires_grad` patterns that we may want to test in the future which may result in different comm counts. I created the separate `requires_grad` test to allow decoupling the multi-iteration `test_transformer_training` test and allow addition of new `requires_grad` scenarios as desired while being mindful of resources. Thanks again to the PyTorch distributed team for your immensely valuable contributions to the open-source ML community! Pull Request resolved: https://github.com/pytorch/pytorch/pull/133502 Approved by: https://github.com/XilunWu	2024-08-24 05:49:54 +00:00
Yanbo Liang	8d3c6494ae	[Inductor][FlexAttention] Rename IS_LAST_BLOCK to CHECK_BLOCK_BOUNDARY (#134378 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134378 Approved by: https://github.com/drisspg	2024-08-24 04:40:01 +00:00
Xu Han	5ad759ca33	[inductor] calibration inductor windows uts (2/N) (#134358 ) skip unsupported UTs of `test\inductor\test_compile_worker.py`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134358 Approved by: https://github.com/jansel	2024-08-24 04:08:59 +00:00
wz337	5ae9c01794	[DTensor] Add naive replicate strategy for aten._linalg_eigh.default (#134284 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/134284 Approved by: https://github.com/awgu	2024-08-24 03:50:05 +00:00
wz337	962e1f6ca7	[DTensor] Add aten.any.default,dim,out to linear_reduction_strategy (#134206 ) For `aten.any`, we can use `reduce_op="sum"` as the linear reduction op. When we do `all_reduce` with `reduce_op="sum"` on bool tensor, if one rank returns `torch.Tensor([True]) `, then the reduction result is `torch.Tensor([True]) `. Only when all ranks return `torch.Tensor([False]) ` would the reduction result be `torch.Tensor([False]) `. This matches with `any`'s behavior. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134206 Approved by: https://github.com/tianyu-l, https://github.com/chuanhaozhuge	2024-08-24 03:49:46 +00:00
wz337	5d39b14b68	[DeviceMesh] Add DeviceMesh slicing support for flatten mesh dim (#133839 ) Add DeviceMesh slicing support such that we could do the following: ``` mesh_3d = init_device_mesh( self.device_type, (2, 2, 2), mesh_dim_names=("replicate", "shard", "cp") ) shard_cp_mesh = mesh_3d["shard", "cp"]._flatten() hsdp_mesh = mesh_3d["replicate", "shard_cp"] # we can get the corresponding group of the flatten mesh through group = shard_cp_mesh.get_group() # or group = mesh_3d["shard_cp"].get_group() # or mesh_3d.get_group(mesh_dim="shard_cp") ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/133839 Approved by: https://github.com/fegin ghstack dependencies: #133838	2024-08-24 03:49:29 +00:00
Akash Kaothalkar	195abdb85c	ppc64le: VSX Support for Inductor (#132746 ) ### Description This PR extends the `VecISA` class to include support for VSX on the `ppc64le` architecture within the Inductor backend. This enhancement enables vectorization support, resulting in performance improvements when using `torch.compile()` on `ppc64le`. ### Fixes - Resolved the `test_acosh_with_negative_large_input` test case in `test_cpu_repro.py` by implementing `acosh` for VSX. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132746 Approved by: https://github.com/jansel	2024-08-24 03:36:09 +00:00
Sheng Fu	519342962d	Pass process group info into NcclWork (#134269 ) Summary: Pass process group info into NcclWork Test Plan: buck2 run mode/dev-nosan kineto/libkineto/fb/integration_tests:pytorch_execution_trace_integration_test Differential Revision: D61677160 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134269 Approved by: https://github.com/wconstab	2024-08-24 01:04:43 +00:00
Justin Chu	e2a87fb1e9	[ONNX] Update exporter logic (#134304 ) Sync the exporter logic with torch-onnx at https://github.com/justinchuby/torch-onnx/compare/v0.1.12...v0.1.15. https://github.com/pytorch/pytorch/issues/129277 - Create a `testing` module to facilitate testing model accuracy. The model is internal - Improve decomp table - Improve model verification logic - Add tests The next PRs will enable OpInfo tests and clean up existing code. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134304 Approved by: https://github.com/titaiwangms	2024-08-24 00:49:54 +00:00
rzou	a1d0b4d568	Add option to skip functional passes in the pattern matcher's replacement graph (#134364 ) The pattern matcher runs DCE and remove_noop_ops on the replacement graph by default. Previously we had a switch for the DCE. This PR changes that switch to also control if we run remove_noop_ops. The context was that there is silent incorrectness with auto_functionalized. We use the Pattern matcher to decompose auto_functionalized into a mutable op + clones; remove_noop_ops were deleting the clones. Future: can try #134363 Test Plan: - new test. I wasn't able to produce a silently incorrect example so I settled for asserting that clones still exist in the post-grad graph. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134364 Approved by: https://github.com/eellison ghstack dependencies: #133639	2024-08-24 00:38:55 +00:00
Jason Ansel	2c8fc3f4ce	[inductor] Move imports to top of file in generated code (#134195 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134195 Approved by: https://github.com/eellison ghstack dependencies: #134194	2024-08-24 00:35:57 +00:00
Jason Ansel	1aa0e35a04	[inductor] Remove dead code in multi_kernel.py (#134194 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134194 Approved by: https://github.com/eellison	2024-08-24 00:35:57 +00:00
Yidi Wu	4ff1a4dd0f	[export] support set_grad_enabled hop in dynamo to enable re-tracing (#134281 ) As titled. We added dynamo support for wrap_with_set_grad_enabled hop to support re-trace an exported program. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134281 Approved by: https://github.com/tugsbayasgalan	2024-08-24 00:35:53 +00:00
drisspg	9dc47f5e62	[FlexAttention]Fix how we realize input buffers (#134351 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134351 Approved by: https://github.com/Chillee	2024-08-24 00:31:00 +00:00
Tristan Rice	4c28a0eb0b	c10d/logging: add C10D_LOCK_GUARD (#134131 ) This adds logs if we can't acquire locks in NCCLUtils and ProcessGroupNCCL for 30s. This is motivated by some deadlocks were seeing and it's unclear if it's in NCCL or on the PyTorch side of things. This required replacing most `std::mutex` with `std::timed_mutex` and `std::condition_variable_any` as appropriate. Test plan: existing CI for regressions will add unit tests on `C10D_LOCK_GUARD` Pull Request resolved: https://github.com/pytorch/pytorch/pull/134131 Approved by: https://github.com/c-p-i-o, https://github.com/fduwjj	2024-08-24 00:27:39 +00:00
atalman	e52e93e8fd	Update scale-config files with linux.24xlarge.ephemeral (#134380 ) Add linux.24xlarge.ephemeral to scale config Pull Request resolved: https://github.com/pytorch/pytorch/pull/134380 Approved by: https://github.com/kit1980, https://github.com/ZainRizvi	2024-08-24 00:01:39 +00:00
Pian Pawakapan	54ff320519	[export] refactor ExportGraphSignature construction (#134059 ) Refactors construction of ExportGraphSignature object for export & training IR, explicitly creating AOTAutograd signature for training IR. This will be helpful for upcoming refactors for placeholder naming & runtime asserts prettifying. Changes: - dedups `make_argument_spec` call, moved to export/graph_signature.py - `_sig_to_specs` wrapped into new function `_convert_to_export_graph_signature`, directly converts GraphSignature -> ExportGraphSignature - `_make_fx_helper` explicitly creates AOTAutograd GraphSignature object Pull Request resolved: https://github.com/pytorch/pytorch/pull/134059 Approved by: https://github.com/angelayi, https://github.com/ydwu4	2024-08-23 23:29:28 +00:00
leslie-fang-intel	aa9f4cc733	[Inductor][CPP] Support vectorization of remainder (#129849 ) Summary When check the vectorization status among 3 test suit, we found some operators disabled vectorization with message `Disabled vectorization: op: remainder`. In this PR, we add vectorization support of this op. Test Plan ``` python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_vec_remainder python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_int_div_vec ``` Differential Revision: [D61147014](https://our.internmc.facebook.com/intern/diff/D61147014) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129849 Approved by: https://github.com/jgong5, https://github.com/lezcano	2024-08-23 23:26:51 +00:00
fduwjj	286f2dba9f	[2/N refactor NCCLPG error logs][c10d] Make msg in monitoring thread in NCCLPG more accurate and simpler (#134036 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134036 Approved by: https://github.com/wconstab	2024-08-23 23:21:28 +00:00
Yiming Zhou	2cfc2da527	[export] Make move_to_device_pass function public (#134263 ) Summary: This is a follow-up of https://github.com/pytorch/pytorch/pull/133660 Here we make the `move_to_device_pass()` function publich so users can call it by `from torch.export.passes import move_to_device_pass` Test Plan: CI Differential Revision: D61671310 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134263 Approved by: https://github.com/angelayi	2024-08-23 23:18:30 +00:00
cyyever	c638a40a93	[Caffe2] Remove unused AVX512 code (#133160 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/133160 Approved by: https://github.com/albanD	2024-08-23 23:16:16 +00:00
Xinran / Allan Rui	1f19ccb5b3	[Inductor/Triton] Customize triton codegen to optionally preserve input dtype on tl.load (#132406 ) Differential Revision: D60536337 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132406 Approved by: https://github.com/jfix71, https://github.com/blaine-rister	2024-08-23 22:58:43 +00:00
Pian Pawakapan	8ff3a5be1b	[export] basic auto dynamic shapes (#133620 ) Starter version of automatic dynamic shapes for export. Creates enums `DIM.AUTO`, `DIM.STATIC`, allowing user to specify `AUTO` for dims in dynamic_shapes specs, meaning that corresponding dims are treated as dynamic, and relevant guards will do what's necessary (e.g. refine ValueRanges, set replacements based on equality, or even set static) without raising ConstraintViolationErrors. Basically allows the user to say, "a bunch of these dims can be dynamic, let export do model analysis and return the program with maximum possible dynamism, without complaining". The usage for specifying `dynamic_shapes` is now: ``` AUTO -> dynamic by default, return whatever produce_guards() says, even if it's static None/int/STATIC -> static Dim/DerivedDim -> same as before - will complain if the min/max range is invalid, or if dims related to this are unspecified. ``` Caveat 1: specifying `AUTO` for a dim won't guarantee it'll be dynamic: - specifying `AUTO` for a dim will return the maximum possible dynamism given your program and other specified constraints, but this can still mean you'll get a static program. For example, with the program below, x is specified dynamic, but it's equal to y, which is specified static, and with how we currently do things we won't promote y to dynamic, but will demote(?) x to static. So this can be surprising if you don't fully know your model, and/or missed one of your other inputs when specifying auto-dynamic shapes. ``` class Foo(torch.nn.Module): def forward(self, x, y): return x + y inputs = (torch.randn(6), torch.randn(6)) export(Foo(), inputs, dynamic_shapes={"x": (DIM.AUTO,), "y": None}) ``` Caveat 2: specifying `AUTO` and Dims in the same spec is still problematic: - The way Dims/DerivedDims are currently handled is very strict. A Dim represents a symbol, and we require a user to specify the symbol for all dims governed by the symbol - that's why we've seen errors in the past like `The values of x must always be related to y by ...`, asking the user to specify the exact relation as in the program. We also require the specified min/max range to be a subset of the valid range from model analysis. All this doesn't compose well with specifying `AUTO` just yet - for example in the program below, ideal behavior could be to return a dynamic program, where `dx = x.size(0) = y.size(0)` has range (3,6). Unfortunately this crashes, and correct behavior is to specify `dx` for both inputs. So currently we raise a UserError and crash if both Dims + `AUTO` are present in the spec. ``` class Foo(torch.nn.Module): def forward(self, x, y): return x + y inputs = (torch.randn(6), torch.randn(6)) export(Foo(), inputs, dynamic_shapes={"x": (DIM.AUTO,), "y": {0: Dim("dx", min=3, max=6)}}) # this doesn't work, because x & y and related ``` Implementation details: This is done by setting `assume_static_by_default=False`, and doing a transform on the `dynamic_shapes` spec to preserve semantics. `assume_static_by_default=False` will treat unspecified dims or Nones as dynamic. This is the opposite of what `export.export()` currently does - unspecified Dims/Nones are treated as static. Historically this static-by-default behavior, where the user deals with fewer guards, has been desirable, and we would like to respect that in this implementation. So this internal spec transformation is added, `_transform_shapes_for_default_dynamic()`, does the spec conversion necessary to be compatbile with dynamic by default. Specifically, AUTOs are converted into Nones, and Nones/unspecified dims are filled in with explicitly static constraints. For example, this would look like, for a 3-d tensor: `{0: DIM.AUTO, 1: None, 2: Dim("dx")} -> {0: None, 1: 32, 2: Dim("dx")}` This does seem overly complicated, but it's done to preserve dynamic shapes semantics for `torch._dynamo.export()`, which already uses `assume_static_by_default=False`, and follows the same process for generating shape constraints , via `_process_dynamic_shapes`. There the semantics are: ``` None/unspecified: dynamic by default Dim/DerivedDim: also a strict assertion ``` If we don't care about BC for `_dynamo.export(dynamic_shapes)`, then we can just modify semantics for `_process_dynamic_shapes()` and change all the relevant tests in `test/dynamo/test_export.py`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133620 Approved by: https://github.com/avikchaudhuri	2024-08-23 22:56:39 +00:00
Angela Yi	f5a2a22dc4	[export] Fix unflattener to respect nn.Parameter requires_grad (#134353 ) Summary: Fixes P1539870235 Test Plan: CI Differential Revision: D61726403 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134353 Approved by: https://github.com/pianpwk	2024-08-23 22:49:34 +00:00
Juan Torrente	eaa2c0e009	Improves error message when passing wrong tensor type to torch.nn.functional.one_hot (#134209 ) The function expects a Tensor of type LongTensor. It currently throws the following error: "one_hot is only applicable to index tensor." which, imo, does not provide the user with enough information on what the problem is. PR simply adds extra information to the error message on this specific scenario. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134209 Approved by: https://github.com/mikaylagawarecki	2024-08-23 22:40:05 +00:00
Nikita Shulga	09a82f3d24	[EZ][BE] Delete references to non-existing `AWS_SCCACHE` secrets (#134370 ) First of all, none of the binary builds should be using sccache for security and reliability reasons (as distributed cache can become corrupted/compromised), but even if they do all authentication to AWS service shoudl be done via OIDC Pull Request resolved: https://github.com/pytorch/pytorch/pull/134370 Approved by: https://github.com/seemethere, https://github.com/atalman	2024-08-23 22:23:48 +00:00
Nikita Shulga	adf0f50cc7	[Compile] Add NEON implementation for bf16->fp32 cast (#134297 ) This changes assembly generated for the following routine ```cpp void bfloat16tofloat(c10::BFloat16* in, float* out) { auto tmp0 = at::vec::Vectorized<c10::BFloat16>::loadu(in, 8); auto tmp1 = at::vec::convert<float>(tmp0); tmp1.store(out); } ``` from ```asm bfloat16tofloat(c10::BFloat16, float): 0000000000000034 stp x29, x30, [sp, #-0x10]! 0000000000000038 mov x29, sp 000000000000003c sub x9, sp, #0x90 0000000000000040 and sp, x9, #0xffffffffffffffe0 0000000000000044 mov x8, #0x0 0000000000000048 adrp x9, 0 ; 0x0 000000000000004c ldr x9, [x9] 0000000000000050 ldr x9, [x9] 0000000000000054 str x9, [sp, #0x88] 0000000000000058 stp xzr, xzr, [sp, #0x10] 000000000000005c ldr q0, [x0] 0000000000000060 str q0, [sp] 0000000000000064 ldr q1, [sp, #0x10] 0000000000000068 stp q0, q1, [sp, #0x20] 000000000000006c add x9, sp, #0x40 0000000000000070 add x10, sp, #0x20 0000000000000074 add x11, x10, x8 0000000000000078 ldp d0, d1, [x11] 000000000000007c shll.4s v0, v0, #16 0000000000000080 shll.4s v1, v1, #16 0000000000000084 stp q0, q1, [x9], #0x20 0000000000000088 add x8, x8, #0x10 000000000000008c cmp x8, #0x20 0000000000000090 b.ne 0x74 0000000000000094 add x8, sp, #0x40 0000000000000098 ld1.4s { v0, v1 }, [x8] 000000000000009c st1.4s { v0, v1 }, [x1] 00000000000000a0 ldr x8, [sp, #0x88] 00000000000000a4 adrp x9, 0 ; 0x0 00000000000000a8 ldr x9, [x9] 00000000000000ac ldr x9, [x9] 00000000000000b0 cmp x9, x8 00000000000000b4 b.ne 0xc4 00000000000000b8 mov sp, x29 00000000000000bc ldp x29, x30, [sp], #0x10 00000000000000c0 ret 00000000000000c4 bl 0xc4 ``` to ```asm bfloat16tofloat(c10::BFloat16, float): 0000000000000034 ldr q0, [x0] 0000000000000038 shll.4s v1, v0, #16 000000000000003c shll2.4s v2, v0, #16 0000000000000040 st1.4s { v1, v2 }, [x1] 0000000000000044 ret ``` And as result speeds up `python3 torchchat.py generate stories110M --num-samples 3 --compile --device cpu --dtype bfloat16` from 33 to 90 tokens/sec Pull Request resolved: https://github.com/pytorch/pytorch/pull/134297 Approved by: https://github.com/kimishpatel	2024-08-23 22:22:59 +00:00
Yiming Zhou	69813dbbfd	[export] Schematize nn_module_stack serialization (#134049 ) `nn_module_stack` was previously serialized to string by adding commas between the module_path and module_type. This error prone when the `nn_module_stack` itself contains commas. This PR fixes this by creating a dictionary to store the `nn_module_stack` and serialize it to string via `json.dumps()` Fixes #131941 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134049 Approved by: https://github.com/angelayi	2024-08-23 21:50:01 +00:00
Yifu Wang	78d69bfe11	[SymmetricMemory] introduce multicast support, multimem_all_reduce_ and multimem_one_shot_all_reduce (#133424 ) ### Summary - Added multicast support to SymmetricMemory. If the cuda runtime and cuda driver have multicast support, SymmetricMemory associate all peer buffers with a multicast object and exposes the multicast virtual address. - Implemented `multimem_all_reduce_` and `multimem_one_shot_all_reduce` based on the multicast support. The two variants shows different performance characteristic for different message size. We plan to use Inductor for collective algo selection (and required symmetric memory buffer allocation). ### Benchmark 8xH100 (non-standard version with HBM2e at 650W). NVSwitch V3 with NVLS support. ![image](https://github.com/user-attachments/assets/4998a16b-c2c0-4797-9dd0-1da2303df947) ![image](https://github.com/user-attachments/assets/278ad361-52cb-4864-82c6-bb67e8d0a3fe) Differential Revision: [D61682507](https://our.internmc.facebook.com/intern/diff/D61682507) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133424 Approved by: https://github.com/yf225, https://github.com/weifengpy	2024-08-23 20:09:20 +00:00
Qiaochu Yuan	2ca7f0fc5c	[Minimizer] for sequential mode, respect find_all setting (#134339 ) Summary: Currently, for sequential mode, minimizer search terminates after a node is excluded via the user defined exclusion_fn. However, on some occasions we would like the search to continue past that for the remaining nodes. In this diff I am changing the termination criteria to respect the find_all setting, where we continue sequential search if it is set. Test Plan: CI Differential Revision: D61720262 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134339 Approved by: https://github.com/jfix71	2024-08-23 19:59:43 +00:00
Daniel Dale	58e2cf364b	Make DTensor sharding propagation for `scaled_dot_product_efficient_attention` and `scaled_dot_product_flash_attention` more conservatively cached (#134146 ) Fixes #134050 ### The issue The current `DTensor` sharding propagation caching policy for `aten.scaled_dot_product_efficient_attention` (default) can result in silently incorrect gradients or trigger an IMA after cuda kernel launch in mixed `require_grad` configurations. Please see issue #134050 for a full description of the observed failure patterns along with reproduction. Note `aten.scaled_dot_product_flash_attention` presents a similar concern so this PR addresses both [as discussed here.](https://github.com/pytorch/pytorch/issues/134050#issuecomment-2299887602) ### Remediation While there are a number of ways this could be addressed, the most straightforward remediation is to modify the sharding propagation caching policy of [`aten._scaled_dot_product_efficient_attention.default`](`b03381cac2/torch/distributed/_tensor/ops/_matrix_ops.py (L337-L340)`), registering it with `schema_info=RuntimeSchemaInfo(4)` to prevent cache sharing between differing `compute_log_sumexp` values i.e. ```python @register_op_strategy(aten._scaled_dot_product_efficient_attention.default, schema_info=RuntimeSchemaInfo(4)) def scaled_dot_product_efficient_attention_strategy( ... ``` [As discussed here](https://github.com/pytorch/pytorch/issues/134050#issuecomment-2299887602), since `aten::_scaled_dot_product_flash_attention` could be affected by a similar issue wrt `return_debug_mask`, this PR adjusts the sharding propagation caching policy for that op as well. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134146 Approved by: https://github.com/tianyu-l	2024-08-23 19:43:30 +00:00
Jesse Cai	157de30f53	[sparse] Update cuSPARSELt to v0.6.2 (#134022 ) Summary: This PR updated cuSPARSELt to v0.6.2. I think we should land https://github.com/pytorch/pytorch/pull/128534 first though. Most of this PR is just enabling tests to run when cuSPARSELt v0.6.2 is available. Unfortunately was running into a bug with fp32 support on Hopper, so I removed fp32 support from the cuSPARSELt backend. I think this should be fine since almost everybody uses the bfloat/float16/int8 kernels. Test Plan: Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/134022 Approved by: https://github.com/jerryzh168, https://github.com/malfet ghstack dependencies: #128534	2024-08-23 19:34:53 +00:00
Angela Yi	74a9001ada	[aoti] Add additional custom op input type support (#132454 ) Summary: Added support for more custom op input types, now only missing dtype, layout, memory format as input type, since we need to add some more testing for mapping the types to their integer values ([previous comment](https://github.com/pytorch/pytorch/pull/126215#discussion_r1617428066)). This PR also replaces the `DynamicArg` struct's `serialized_arg_val` with `list_item_types`, which stores an optional list of strings, where each string represents the type of the value within this list. This is only used for parsing lists of optional tensors, where we need to know if a specific value in the list should be a tensor, or a None. Replacing with a list of strings is also better than storing the actual json format because then we don't need to parse the json string during the runtime, and can just loop over a preprocessed list of strings. Test Plan: `buck2 run @//mode/dev-nosan //caffe2/test/inductor:test_aot_inductor -- -r "test_custom_"` Reviewed By: desertfire Differential Revision: D60295995 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132454 Approved by: https://github.com/desertfire	2024-08-23 19:11:36 +00:00
James Wu	f8fbfe5846	Always emit end events even on failure, use thread local storage for stack (#134279 ) Summary: We should always emit an end event in a finally block so that if a unit test or job fails, the stack is still correct. Also, we use thread local storage for the stack, so that in multithreaded scenarios the stack will still be correctly added. Test Plan: Run benchmark and see that everything still works Run ``` TORCH_LOGS=dynamo buck run test/functorch:test_aotdispatch -- -r test_backward_mutation_on_grad_out ``` With some extra logging to see that start events with the correct stack are emitted, and the end events are also emitted even though the test fails at runtime. Differential Revision: D61682556 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134279 Approved by: https://github.com/aorenste	2024-08-23 18:13:13 +00:00
Yidi Wu	a23d86c178	[hop] ban creating hop by directly instantiating HigherOrderOperator. (#133645 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133645 Approved by: https://github.com/zou3519	2024-08-23 17:28:02 +00:00
Jia Li	3546628a2a	Allow mp.start_processes to create processes in parallel (#133707 ) Summary: Background discussion in https://fb.workplace.com/groups/319878845696681/posts/1226087421742481 and pytorch issue filed https://github.com/pytorch/pytorch/issues/133010 one way to fix this problem is to add an option to parallel start processes on pytorch side. Test Plan: Tested aps run in problem and things are in parallel now (next diff) Differential Revision: D61301603 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133707 Approved by: https://github.com/d4l3k, https://github.com/ezyang	2024-08-23 17:11:20 +00:00
rzou	afd081c9d4	[inductor] Fix needs_fixed_stride_order silent incorrectness (#133639 ) Fixes #128084 The approach is option 2 of what Elias suggested in the comment thread: - We require tensors to have the correct stride at usage. This may involve a clone; if there was a clone and then a mutation into it then we copy_ back the result of the mutation. The reason why I went this approach was because it was the easiest and Inductor already works really hard to remove additional clones/copy_. There are some cases that this doesn't generate efficient code for; for example, if the tensor is a view, we don't change the base of the view to have the right stride order, instead we do a clone. The view case isn't very common so I'm ignoring it for now but we could improve this in the future. Test Plan: - tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/133639 Approved by: https://github.com/eellison	2024-08-23 17:07:58 +00:00
Tristan Rice	2553278bae	.github/merge_rules.yaml: added multiprocessing to Distributed (#134262 ) This allows the Distributed team to approve changes to torch.multiprocessing which is used by torchelastic/run. Example PR: https://github.com/pytorch/pytorch/pull/133707 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134262 Approved by: https://github.com/wconstab, https://github.com/PaliC	2024-08-23 17:07:20 +00:00
Xuehai Pan	6eae569546	[dynamo][fix] always use POSIX-style path in `trace_rule.py` (#133987 ) We are hardcoding some path in string in POSIX style. This will lead to different results on Windows. This PR force all paths to be in POSIX-style. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133987 Approved by: https://github.com/jansel	2024-08-23 16:28:57 +00:00
Yanbo Liang	2eef749b31	[Inductor][FlexAttention] Fix IS_DIVISIBLE bug and add unit tests (#134055 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134055 Approved by: https://github.com/Chillee	2024-08-23 16:11:09 +00:00
IvanKobzarev	8ae4f82243	[aotd] Support HOP effects in backward (#132638 ) Support of effectful operations in backward: 1/ AOTD collects metadata from forward fn only, so we can have usage of effectful ops in backward, that were not used in forward => Allowing tokens discovery during joint function . FunctionalTensorMode holds _tokens, in Joint function after tracing forward we memoize _tokens as `_tokens_forward_output`. 2/ Tokens are added as primals inputs (forward) in EffectTokensWrapper. Tokens that will be used in backward are in partitioner saved values. We do not have control on which positions they are saved in forward outputs. 2/ If new tokens discovered in backward after tracing joint_fn, the result graph will be manually added in the end of primals. _aot_autograd/utils.py 3/ All effectful ops during backward are marked with 'must_be_in_backward' partitioner_tag, to prevent partiitoner to place them in forward. For that functional_tensor_mode got new optional state `self._effects_partitioner_tag` for effectful ops, to set after tracing forward. There are additional changes in partitioner to improve functionality of 'must_be_in_backward' 4/ Unlift tokens now should run for both forward and backward. - As saved for backward tokens are placed on non static places - we identify input and output tokens to erase, by input and output of `with_effects` operation - In forward we can have input tokens, discovered in backward, that are not used in with_effects ops in forward, but saved for backward. We identify them by position in forward inputs. 5/ Adding aot debug logging for graphs before unlifting and before adding additional primal for backward tokens. Tests: ``` python test/higher_order_ops/test_with_effects.py ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/132638 Approved by: https://github.com/bdhirsh	2024-08-23 15:30:58 +00:00
PyTorch MergeBot	7fd3b69886	Revert "[dynamo][super] Improve handling of getattr on super (#134039 )" This reverts commit 1da3a049dac3c78554506d5ef9ede55b7c2b774d. Reverted https://github.com/pytorch/pytorch/pull/134039 on behalf of https://github.com/jeanschmidt due to broke internal torchrec signals, see [D61670727](https://www.internalfb.com/diff/D61670727) ([comment](https://github.com/pytorch/pytorch/pull/134039#issuecomment-2307151643))	2024-08-23 13:57:04 +00:00
PyTorch MergeBot	09127b096c	Revert "[inductor] Fix needs_fixed_stride_order silent incorrectness (#133639 )" This reverts commit 8604c0a150b12e0ba3f9a6faaf52498370f21368. Reverted https://github.com/pytorch/pytorch/pull/133639 on behalf of https://github.com/jeanschmidt due to Broke internal fbgemm signals, see [D61670495](https://www.internalfb.com/diff/D61670495) ([comment](https://github.com/pytorch/pytorch/pull/133639#issuecomment-2307133060))	2024-08-23 13:48:04 +00:00
PyTorch MergeBot	75c22dd8bf	Revert "[dynamo][fix] always use POSIX-style path in `trace_rule.py` (#133987 )" This reverts commit b23779ef0af8d4f06e667da460c43d264359f1f0. Reverted https://github.com/pytorch/pytorch/pull/133987 on behalf of https://github.com/albanD due to This breaks windows trunk jobs ([comment](https://github.com/pytorch/pytorch/pull/133987#issuecomment-2306956764))	2024-08-23 12:08:56 +00:00
Xuehai Pan	0e49b2f18e	[dynamo][itertools] support `itertools.tee` (#133771 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133771 Approved by: https://github.com/jansel ghstack dependencies: #133769, #133778, #133779	2024-08-23 10:13:12 +00:00
Xuehai Pan	8d90392fb0	[dynamo] simplify implementation for `builtins.sum` (#133779 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133779 Approved by: https://github.com/jansel ghstack dependencies: #133769, #133778	2024-08-23 10:10:19 +00:00
Xuehai Pan	6c0b15e382	[dynamo] simplify implementation for `functools.reduce` (#133778 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133778 Approved by: https://github.com/jansel ghstack dependencies: #133769	2024-08-23 09:10:44 +00:00
Xuehai Pan	cc3a76edba	[dynamo] simplify polyfill registration for `builtins.all` and `builtins.any` (#133769 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133769 Approved by: https://github.com/jansel	2024-08-23 09:05:24 +00:00
Su, Tong	ca3f48dd5b	[XPU] Set `make triton` install pre-built whl by default (#130313 ) Now the user could install the pre-built `triton` for xpu by calling the following: ```Bash export USE_XPU=1 make triton ``` [Dev Only]: If the user wishes to build it from the source, one could set an additional flag: ```Bash export TRITON_XPU_BUILD_FROM_SOURCE=1 export USE_XPU=1 make triton ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/130313 Approved by: https://github.com/chuanqi129, https://github.com/EikanWang, https://github.com/atalman	2024-08-23 07:36:34 +00:00
Luca Wehrstedt	55cdcef0f7	[fp8 rowwise] Work around CUDA Invalid Memory Access bug (#134227 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134227 Approved by: https://github.com/drisspg, https://github.com/eqy ghstack dependencies: #134223, #134224, #134225, #134226	2024-08-23 07:27:55 +00:00
Luca Wehrstedt	9d81767d43	[fp8 rowwise] Rework dispatch logic (#134226 ) It's likely a matter of opinion, but I find this new version to have less duplication, even if it might have more boilerplate. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134226 Approved by: https://github.com/drisspg ghstack dependencies: #134223, #134224, #134225	2024-08-23 07:27:55 +00:00
Luca Wehrstedt	0afb4872aa	[fp8 rowwise] Support non-contiguous inputs and clarify checks (#134225 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134225 Approved by: https://github.com/drisspg ghstack dependencies: #134223, #134224	2024-08-23 07:27:52 +00:00
Luca Wehrstedt	9f8d3f511f	[fp8 rowwise] Some clean-up (#134224 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134224 Approved by: https://github.com/drisspg ghstack dependencies: #134223	2024-08-23 07:27:48 +00:00
Luca Wehrstedt	2f198605ac	[fp8 rowwise] Simplify epilogue visitor tree via common blocks (#134223 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134223 Approved by: https://github.com/drisspg	2024-08-23 07:27:41 +00:00
Xuehai Pan	25b2e46573	[dynamo] add max iterator limit while inlining generators (#134233 ) Related: - #133879 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134233 Approved by: https://github.com/jansel	2024-08-23 07:03:31 +00:00
xingyuan li	673b9bd561	[WIP] [Inductor UT] Reuse inductor UT for intel GPU `test/inductor/test_multi_kernel.py` (#133943 ) [Inductor UT] Reuse Inductor test case for Intel GPU. Reuse `test/inductor/test_multi_kernel.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/133943 Approved by: https://github.com/EikanWang, https://github.com/jansel Co-authored-by: Justin Chu <justinchu@microsoft.com> Co-authored-by: Jesse Cai <jcjessecai@gmail.com> Co-authored-by: Sahdev Zala <spzala@us.ibm.com> Co-authored-by: rzou <zou3519@gmail.com> Co-authored-by: FFFrog <ljw1101.vip@gmail.com> Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com> Co-authored-by: yanbing-j <yanbing.jiang@intel.com> Co-authored-by: Will Feng <yf225@cornell.edu> Co-authored-by: Bin Bao <binbao@meta.com> Co-authored-by: Yiming Zhou <yimingzhou@meta.com> Co-authored-by: Yanbo Liang <ybliang8@gmail.com>	2024-08-23 05:52:29 +00:00
Xu Han	80846caa8c	[inductor] fix dynamic size array(vla) build error on msvc v4 (#134221 ) MSVC don't support dynamic array. Ref: https://stackoverflow.com/questions/56555406/creating-dynamic-sized-array-using-msvc-c-compiler We tried to solutions: 1. use std::vector to instead of it in previous PR: https://github.com/pytorch/pytorch/pull/134140, but it changed variable's type and failed at UTs. 2. Use `std::unique_ptr` to instead of it in PR: https://github.com/pytorch/pytorch/pull/134156, @jansel reviewed and give comments: https://github.com/pytorch/pytorch/pull/134156#pullrequestreview-2253091693. It is make sense, allocation memory maybe make code run slower. 3. Use fixed size array to instead of it in PR: https://github.com/pytorch/pytorch/pull/134210, fixed size is hard to process the situlation, reserved size if small than CPU number. > a. Use min() function limited is local test failed: https://github.com/pytorch/pytorch/pull/134210#issuecomment-2304447729 > b. Dynamic select fixed size or dynamic array: https://github.com/pytorch/pytorch/pull/134210#issuecomment-2304128666 . It makes code too complex to maintains. Discussed with origin PR(https://github.com/pytorch/pytorch/pull/115620) author @zhuhaozhe, we think: 1. MSVC it the only one compiler, which not support VLA. 2. MSVC it worse performance than other compilers, use `std::unique_ptr` for MSVC and make it works. 3. For other compilers, keep using current `VLA` code. 4. For Windows users, they can use `clang-cl` or `icx` to get better performance than MSVC. 5. Discussed with @jansel , we need to move compiler check to python side, and make output code cleaner. Reproduce UT: ```cmd pytest test/inductor/test_cpu_repro.py -v -k test_reduction_with_dynamic_threads ``` Error msg: ```cmd C:/Users/Xuhan/AppData/Local/Temp/tmpncykej5v/a4/ca4534cazplidnf7vopaaxaifqkjiyhxm3h2gsylgztputbaeybx.cpp(13): error C2131: expression did not evaluate to a constant C:/Users/Xuhan/AppData/Local/Temp/tmpncykej5v/a4/ca4534cazplidnf7vopaaxaifqkjiyhxm3h2gsylgztputbaeybx.cpp(13): note: failure was caused by a read of a variable outside its lifetime C:/Users/Xuhan/AppData/Local/Temp/tmpncykej5v/a4/ca4534cazplidnf7vopaaxaifqkjiyhxm3h2gsylgztputbaeybx.cpp(13): note: see usage of 'max_threads' C:/Users/Xuhan/AppData/Local/Temp/tmpncykej5v/a4/ca4534cazplidnf7vopaaxaifqkjiyhxm3h2gsylgztputbaeybx.cpp(16): error C3863: array type 'float [max_threads]' is not assignable ``` Genarated code: ```c++ #include "C:/Users/Xuhan/AppData/Local/Temp/tmpt6mxcjzi/j2/cj22tgrdgh42wbunl7gdptg2lintcziox2kmr7rdbcc6n2njrhgx.h" extern "C" __declspec(dllexport) void kernel(const float* in_ptr0, const float* in_ptr1, float* out_ptr0, float* out_ptr1) { { { float tmp_acc0 = 0; at::vec::Vectorized<float> tmp_acc0_vec = at::vec::Vectorized<float>(0); int max_threads = omp_get_max_threads(); float tmp_acc0_arr[max_threads]; for (int tid = 0; tid < max_threads; tid++) { tmp_acc0_arr[tid] = 0; } at::vec::Vectorized<float> tmp_acc0_vec_arr[max_threads]; for (int tid = 0; tid < max_threads; tid++) { tmp_acc0_vec_arr[tid] = at::vec::Vectorized<float>(0); } #pragma omp parallel ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/134221 Approved by: https://github.com/zhuhaozhe, https://github.com/jansel	2024-08-23 05:40:08 +00:00
Xu Han	49b9f2d8b0	[inductor] fix signbit build fail on Windows. (#134229 ) Reproduce UT: ```cmd pytest test/inductor/test_torchinductor.py -v -k test_randint_int64_mod_cpu ``` Error message: ```cmd cl : Command line warning D9025 : overriding '/openmp' with '/openmp:experimental' c6airoloxwj4prmlejdyo5ybp43xa2yo5rbnpk4ttw3oifu6noor.cpp C:/Users/Xuhan/AppData/Local/Temp/tmpx1fj2bd4/6a/c6airoloxwj4prmlejdyo5ybp43xa2yo5rbnpk4ttw3oifu6noor.cpp(23): error C2668: 'signbit': ambiguous call to overloaded function C:\Program Files (x86)\Windows Kits\10\include\10.0.22000.0\ucrt\corecrt_math.h(309): note: could be 'bool signbit(float) noexcept' C:\Program Files (x86)\Windows Kits\10\include\10.0.22000.0\ucrt\corecrt_math.h(314): note: or 'bool signbit(double) noexcept' C:\Program Files (x86)\Windows Kits\10\include\10.0.22000.0\ucrt\corecrt_math.h(319): note: or 'bool signbit(long double) noexcept' C:/Users/Xuhan/AppData/Local/Temp/tmpx1fj2bd4/6a/c6airoloxwj4prmlejdyo5ybp43xa2yo5rbnpk4ttw3oifu6noor.cpp(23): note: while trying to match the argument list '(__int64)' C:/Users/Xuhan/AppData/Local/Temp/tmpx1fj2bd4/6a/c6airoloxwj4prmlejdyo5ybp43xa2yo5rbnpk4ttw3oifu6noor.cpp(24): error C2668: 'signbit': ambiguous call to overloaded function C:\Program Files (x86)\Windows Kits\10\include\10.0.22000.0\ucrt\corecrt_math.h(309): note: could be 'bool signbit(float) noexcept' C:\Program Files (x86)\Windows Kits\10\include\10.0.22000.0\ucrt\corecrt_math.h(314): note: or 'bool signbit(double) noexcept' C:\Program Files (x86)\Windows Kits\10\include\10.0.22000.0\ucrt\corecrt_math.h(319): note: or 'bool signbit(long double) noexcept' C:/Users/Xuhan/AppData/Local/Temp/tmpx1fj2bd4/6a/c6airoloxwj4prmlejdyo5ybp43xa2yo5rbnpk4ttw3oifu6noor.cpp(24): note: while trying to match the argument list '(int64_t)' ``` Genarated code: ```c++ #include "C:/Users/Xuhan/AppData/Local/Temp/tmpcjnxnvkl/4f/c4ff4q4pxgo3yprbo2nkfopkt3qgi6rmptfpgpl2iylgtunvizwn.h" extern "C" __declspec(dllexport) void kernel(const int64_t* in_ptr0, int64_t* out_ptr0) { #pragma omp parallel num_threads(8) { int tid = omp_get_thread_num(); { #pragma omp for for(int64_t x0=static_cast<int64_t>(0LL); x0<static_cast<int64_t>(20LL); x0+=static_cast<int64_t>(1LL)) { auto tmp0 = in_ptr0[static_cast<int64_t>(0LL)]; auto tmp1 = x0; auto tmp2 = c10::convert<int32_t>(tmp1); auto tmp3 = static_cast<int64_t>(-5); auto tmp4 = static_cast<int64_t>(5); auto tmp5 = randint64_cpu(tmp0, tmp2, tmp3, tmp4); auto tmp6 = static_cast<int64_t>(10); auto tmp7 = mod(tmp5, tmp6); auto tmp8 = static_cast<int32_t>(0); auto tmp9 = tmp7 != tmp8; auto tmp10 = std::signbit(tmp7); auto tmp11 = std::signbit(tmp6); auto tmp12 = tmp10 != tmp11; auto tmp13 = tmp9 & tmp12; auto tmp14 = decltype(tmp7)(tmp7 + tmp6); auto tmp15 = tmp13 ? tmp14 : tmp7; out_ptr0[static_cast<int64_t>(x0)] = tmp15; } } } } ``` Fixed by cast `std::signbit` to `long double`: https://learn.microsoft.com/en-us/cpp/c-runtime-library/reference/signbit?view=msvc-170 Local test passed: <img width="848" alt="image" src="https://github.com/user-attachments/assets/e4467256-a068-40ef-a6ff-19b442e9116d"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/134229 Approved by: https://github.com/jansel	2024-08-23 05:40:05 +00:00
Huamin Li	311af3b988	Add new ops wrapped_linear_prepack and wrapped_quantized_linear_prepacked (#134232 ) Summary: This diff adds two new operators torch.ops._quantized.wrapped_linear_prepack and torch.ops._quantized.wrapped_quantized_linear_prepacked. It is a decomposition of the op torch.ops._quantized.wrapped_quantized_linear added in the previous diff. We decomposed in this way as packed weight could be computed early so we don;t need to do it in every forward in AOTI Reviewed By: jerryzh168 Differential Revision: D61395887 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134232 Approved by: https://github.com/houseroad	2024-08-23 04:54:26 +00:00
Xuehai Pan	b23779ef0a	[dynamo][fix] always use POSIX-style path in `trace_rule.py` (#133987 ) We are hardcoding some path in string in POSIX style. This will lead to different results on Windows. This PR force all paths to be in POSIX-style. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133987 Approved by: https://github.com/jansel	2024-08-23 04:33:05 +00:00
Animesh Jain	a699bd1155	[dynamo] Cache _dynamo.disable results (#134272 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134272 Approved by: https://github.com/yf225, https://github.com/jansel	2024-08-23 04:20:50 +00:00
Avik Chaudhuri	b454c51060	remove dynamic_dim (#134211 ) Summary: As promised in https://github.com/pytorch/pytorch/pull/134045. Test Plan: existing Differential Revision: D61646937 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134211 Approved by: https://github.com/angelayi	2024-08-23 04:13:03 +00:00
Rachel Guo	058302494c	[AOTI][Tooling] Add a test case where `config.debug_intermediate_value_printer=True` to check codegen (#133326 ) Summary: As title. Add a test case in test_aot_inductor to check for codegen (i.e. `aoti_torch_print_tensor_handle` is inserted as expected for debugging printer) for both cpu and cuda based on a simple `addmm` test model. Test Plan: ``` AOT_INDUCTOR_DEBUG_INTERMEDIATE_VALUE_PRINTER=1 TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 TORCHINDUCTOR_ABI_COMPATIBLE=1 TORCH_COMPILE_DEBUG=1 TORCH_LOGS="+graph, inductor, +schedule, output_code" buck2 run -c fbcode.enable_gpu_sections=true -c fbcode.nvcc_arch=h100 @//mode/opt fbcode//caffe2/test/inductor:test_aot_inductor -- -r test_aoti_debug_printer_codegen_abi_compatible_{cuda/cpu} ``` Differential Revision: D61169068 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133326 Approved by: https://github.com/ColinPeppler	2024-08-23 02:12:21 +00:00
Yanbo Liang	d2c60749ac	[Inductor][FlexAttention] Respect user's input kernel_options (#134065 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134065 Approved by: https://github.com/Chillee	2024-08-23 01:21:05 +00:00
fduwjj	8301add833	[4/N] Further refactor FR script to make it more modulized (#134196 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134196 Approved by: https://github.com/c-p-i-o	2024-08-23 01:15:29 +00:00
Shivam Raikundalia	bcfc560aea	[Profiler/CPU] Add Test for Dynamic Activity Toggling [4/n] (#134149 ) Summary: Add tests that check function events for dynamic activity toggling for both GPU and CPU events. Also added comments from previous GH comments Test Plan: Make sure all tests pass Differential Revision: D61617514 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134149 Approved by: https://github.com/aaronenyeshi	2024-08-23 01:13:42 +00:00
drisspg	bf5addb613	[FlexAttention] Enable different qk and v head-dims (#134043 ) # Summary Adds the option for the head dims to be different between QK and V tensors. Fixes issue: https://github.com/pytorch/pytorch/issues/133674 V_DIM > QK_DIM is blocked by landing: https://github.com/triton-lang/triton/pull/4138 / https://github.com/triton-lang/triton/pull/4540 Into PyTorch's triton branch. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134043 Approved by: https://github.com/Chillee	2024-08-23 01:06:57 +00:00
Bin Bao	7c93c4f8cf	[CI][dashboard] Change aarch64 perf run (#134265 ) Summary: Reduce the aarch64 dashboard run to only test the default config, until we solve the timeout issue. Also increase the frequency from nightly to 6 times a day, to see if we can reproduce the perf instability Nikita has observed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134265 Approved by: https://github.com/malfet	2024-08-23 00:40:28 +00:00
Animesh Jain	b3821f1da1	[dynamo][guards][logs] Generate code_parts for debugging (#134181 ) Fixes https://github.com/pytorch/pytorch/issues/132692 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134181 Approved by: https://github.com/youkaichao, https://github.com/jansel ghstack dependencies: #133742, #134016, #134039	2024-08-22 23:40:37 +00:00
Dan Johnson	edbadc904b	Do not broadcast uniqueId during a split (#133962 ) When using split, we do not need to exchange the NCCL uniqueID at all. This would avoid connecting to the TCPStore on each split operation. @exported-using-ghexport Differential Revision: [D60966980](https://our.internmc.facebook.com/intern/diff/D60966980/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133962 Approved by: https://github.com/shuqiangzhang ghstack dependencies: #133960, #133961	2024-08-22 23:23:32 +00:00
Eli Uriegas	b2eb0e8c6a	docker: Use miniforge, install from pip (#134274 ) Switch installation of the pytorch package to be installed from our download.pytorch.org sources which are better maintained. As well, switching over the miniconda installation to a miniforge installation in order to ensure backwards compat for users expecting to have the conda package manager installed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134274 Approved by: https://github.com/malfet, https://github.com/atalman Co-authored-by: atalman <atalman@fb.com>	2024-08-22 23:20:22 +00:00
Stonepia	30d7e7a1cd	[XPU] Fix patch for old llvm package error for triton xpu (#134204 ) Fixes #134199 The PR #133694 does a workaround to replace the str `"https://tritonlang.blob.core.windows.net/llvm-builds/"` with `"https://oaitriton.blob.core.windows.net/public/llvm-builds/"` in `triton/python/setup.py`. However, in [newer version of Triton](`06e6799f4e`), it has already been changed to `"https://oaitriton.blob.core....` and don't need to be replaced. But formerly, this will throw a runtime error. This PR makes the `check_and_replace` logic won't fail in such a scenario. Both the old link and the newer link could work. Also note that the `.ci/docker/common/install_triton.sh` does not need the fix, because its `sed` command won't be in effect if there is no such pattern. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134204 Approved by: https://github.com/chuanqi129, https://github.com/EikanWang, https://github.com/atalman	2024-08-22 23:18:44 +00:00
drisspg	629bd6f718	Update FlexAttention with masking semantic (#133373 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133373 Approved by: https://github.com/yanboliang	2024-08-22 22:50:33 +00:00
fduwjj	e7929809f3	[c10d][ez] Add comments to CudaEventCache class (#134172 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134172 Approved by: https://github.com/d4l3k, https://github.com/kwen2501	2024-08-22 22:44:12 +00:00
Justin Chu	b319fa3fd9	[ONNX] Opt into ruff fmt (#134120 ) Add ONNX directory to use ruff format. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134120 Approved by: https://github.com/XuehaiPan, https://github.com/Skylion007	2024-08-22 22:44:03 +00:00
Dan Johnson	25499de814	Remove ncclIdToCommMap_. (#133961 ) There is no purpose for this map structure, and it is incorrect in some cases. For example, when the uniqueID is not broadcasted to the other processes. @exported-using-ghexport Differential Revision: [D60966882](https://our.internmc.facebook.com/intern/diff/D60966882/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133961 Approved by: https://github.com/shuqiangzhang ghstack dependencies: #133960	2024-08-22 22:06:25 +00:00
Shangdi Yu	b0cf287b46	[export][training ir migration] Fix getitem not exist (#134259 ) Summary: Make quantization tests compatible with the new training IR. With the new batch norm node `torch.ops.aten.batch_norm.default`, we don't need an additional getitem node after the bn node, so tests need to be fixed to not check for the getitem node. We added a capture_pre_autograd_graph_using_training_ir() function, which returns True when we are using the training ir, and False otherwise. This way, the code supports both training ir and the old ir. For now, we are just rolling out the training ir for fbcode internal tests. Test Plan: ``` buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test/quantization:test_quantization -- -r test_qat_preserve_source_fn_stack buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test/quantization:test_quantization -- -r test_qat_update_shared_qspec buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test/quantization:test_quantization -- -r test_conv2d buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test/quantization:test_quantization -- -r test_qat_conv_bn_relu_fusion buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test/quantization:test_quantization -- -r test_qat_conv_bn_fusion buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test/quantization:test_quantization -- -r test_qat_conv_bn_fusion_literal_args ``` Reviewed By: andrewor14, tugsbayasgalan Differential Revision: D61292102 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134259 Approved by: https://github.com/tugsbayasgalan	2024-08-22 22:00:14 +00:00
Bin Bao	f0ba309d78	[CI][dashboard] Add jemalloc back for aarch64 (#134189 ) Forward fix based on https://github.com/pytorch/pytorch/pull/133997#discussion_r1726004220 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134189 Approved by: https://github.com/malfet, https://github.com/huydhn	2024-08-22 21:08:39 +00:00
Dan Johnson	1b6bbaa016	Remove PMI dependencies in PyTorch (#133960 ) This patch makes two changes: 1. Whenever ncclCommSplit accepts groupRanks in its config, we should populate it. This is independent of using PMI or not. For example, non-PMI NCCL can also use this information, if it chooses to. 2. Provide a user flag to decide when to do a uniqueId broadcast and when to skip it. This is a performance optimization, and not a correctness requirement. If the user forgets to set this, we will do the uniqueId broadcast, which is wasteful (because it will be ignored by NCCL), but not incorrect. @exported-using-ghexport Differential Revision: [D60966774](https://our.internmc.facebook.com/intern/diff/D60966774/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133960 Approved by: https://github.com/shuqiangzhang	2024-08-22 20:34:43 +00:00
Yanbo Liang	ff61f55387	[Dynamo][autograd.Function] Supports ctx.set_materialize_grads (#133978 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133978 Approved by: https://github.com/zou3519	2024-08-22 20:06:17 +00:00
Zain Rizvi	5633773188	Convert various jobs to be Linux Foundation fleet compatible (#134128 ) Migrates a batch of workflows over to LF Pull Request resolved: https://github.com/pytorch/pytorch/pull/134128 Approved by: https://github.com/zxiiro, https://github.com/jeanschmidt	2024-08-22 19:23:07 +00:00
Jeff Daily	0eb9c870fd	[reland][ROCm] TunableOp for gemm_and_bias (#128919 ) Reland of #128143 but added `alpha` and `bias` initialization to `launchTunableGemmAndBias` Thus far TunableOp was implemented for gemm, bgemm, and scaled_mm. gemm_and_bias was notably missing. This PR closes that gap. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128919 Approved by: https://github.com/malfet	2024-08-22 18:27:50 +00:00
Shangdi Yu	978c5a80a0	[export][training ir migration] fix batch norm pattern match in quantization (#134157 ) Summary: In the new training ir, we produce `torch.ops.aten.batch_norm.default` instead of `torch.ops.aten._native_batch_norm_legit.default` or `torch.ops.aten._native_batch_norm_legit_no_training.default`. So we need to change the pattern match to accomodate the new op. - Add `torch.ops.aten.batch_norm.default` to pattern matcher list so it's identified as a batch norm node - `torch.ops.aten.batch_norm.default` doesn't have a getitem user anymore, so when removing the bn norm, we need to do `bn_node.replace_all_uses_with(conv_node)` instead of `getitem_node.replace_all_uses_with(conv_node)` The behavior of capture_pre_autograd_graph is consistent for each run. If the run is a fbcode test, then capture_pre_autograd_graph uses training IR. This means both _get_aten_graph_module_for_pattern and replace_pattern_with_filters see the same training IR. If the run is not a fbcode test, then both would see the old IR. Test Plan: ``` buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test/quantization:test_quantization -- -r test_conv2d_binary2 buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test/quantization:test_quantization -- -r test_conv2d_unary buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test/quantization:test_quantization -- -r test_linear_unary buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test/quantization:test_quantization -- -r test_dynamic_quant_linear buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test/quantization:test_quantization -- -r test_qat_dynamic_quant_linear buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test/quantization:test_quantization -- -r test_flatten_recipe buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test/quantization:test_quantization -- -r test_linear_unary ``` Reviewed By: andrewor14, tugsbayasgalan Differential Revision: D61291077 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134157 Approved by: https://github.com/tugsbayasgalan	2024-08-22 18:25:45 +00:00
Animesh Jain	fee677eeb6	[fbode-testing][dynamo][reland][inline-inbuilt-nn-modules] Mark attri… (#134136 ) Shuai wants to test this internally before https://github.com/pytorch/pytorch/pull/133713 can go in. Creating a separate PR for ghmport. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134136 Approved by: https://github.com/yanboliang	2024-08-22 17:54:58 +00:00
Thanh Ha	8f7d66f0c3	Enable dynamic rollout for Linux binary workflows (#131472 ) Enables dynamic migration of jobs to the LF AWS account for binary workflows. The new runners are only given to people specified in this issue: pytorch/test-infra#5132 This closes pytorch/ci-infra#251. Depends-On: pytorch/pytorch#132870 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131472 Approved by: https://github.com/ZainRizvi	2024-08-22 17:12:50 +00:00
Aaron Orenstein	d95aedf5fd	[BE] typing for decorators - fx/_compatibility (part 1) (#134202 ) Part of #134054. This corresponds to the pytorch mypy changes from D61493706. Updating takes so long and touches so many files that it's impossible to land as a whole without conflicting with some other intermediate change. So landing these 'type: ignore' for pytorch in advance of them actually being needed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134202 Approved by: https://github.com/Skylion007	2024-08-22 17:07:33 +00:00
yuqingj	44fa9f991c	[NJT] add aten.to.dtype support (#134164 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134164 Approved by: https://github.com/davidberard98	2024-08-22 16:59:38 +00:00
Xuehai Pan	b6abac68ec	[BE][dynamo] reorganize polyfill module hierarchy (#133977 ) Changes: 1. Move `polyfill.py` -> `polyfills/__init__.py`. It can be used as `polyfill.xxx` -> `polyfills.xxx`. 2. Move submodule loading from `polyfills/__init__.py` to `polyfills/loader.py`. Merge `polyfill.py` and `polyfills/` packages. Each polyfill module have its own namespace for better code organization. The ultimate goal is make `polyfills/__init__.py` empty and all polyfill functions move to its own namespace. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133977 Approved by: https://github.com/jansel	2024-08-22 16:42:29 +00:00
Xuehai Pan	c95ddd4bf2	[dynamo] ensure polyfill function has the same signature as the original function in `substitute_in_graph` (#133813 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133813 Approved by: https://github.com/jansel	2024-08-22 16:38:06 +00:00
Shangdi Yu	240467adfe	[fx] Implement deepcopy for Proxy (#133706 ) Summary: When deepcopy a proxy, we first try the default deepcopy behavior. Test Plan: buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test:fx -- -r proxy_deepcopy Differential Revision: D61398418 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133706 Approved by: https://github.com/angelayi	2024-08-22 16:37:30 +00:00
PyTorch MergeBot	b0171c3920	Revert "[ONNX] Opt into ruff fmt (#134120 )" This reverts commit 0870398fa8c3e097640f31cb8a8e2e2d3e522d33. Reverted https://github.com/pytorch/pytorch/pull/134120 on behalf of https://github.com/albanD due to Breaks main branch lint ([comment](https://github.com/pytorch/pytorch/pull/134120#issuecomment-2305089756))	2024-08-22 15:48:14 +00:00
Simon Mahns	828ab84e19	Improve error msg on _lazy_init() error (#134159 ) Reviewed By: hanzlfs Differential Revision: D61627609 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134159 Approved by: https://github.com/hanzlfs	2024-08-22 15:10:50 +00:00
James Wu	3c5485fb7f	[Retry] Log chromium events to scuba (#134118 ) Summary: This diff implements a bunch of views for internal scuba viewing. TODOS that I might punt to another diff: - Saving cache stats via counter is definitely sus here, but there's not really a good way to track "fx graph cache hit for this compile phase" right now. Will think about this more. - We should definitely log frame id, compile id, etc - We should definitely be logging configs. That way, we can A/B test based on whether a config is turned on. - idk what I'm doing with compile_uuid yet, but it's useful when you want to look at samples for a single run. I think if we had mast job info this field is not needed, but it's nice to be able to drill down to a single run and get its chrome trace view or icicle view, so idk Test Plan: All of the above views are run with nanogpt benchmark: ``` buck run mode/opt caffe2/benchmarks/dynamo:torchbench -- --training --backend=inductor --only nanogpt --performance ``` Differential Revision: D61603243 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134118 Approved by: https://github.com/oulgen	2024-08-22 14:59:45 +00:00
Isuru Fernando	1b10a5c652	Allow SymInts and SymFloats as other in div_softmax_pattern (#133989 ) Fixes https://github.com/pytorch/pytorch/issues/133759 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133989 Approved by: https://github.com/ezyang	2024-08-22 14:36:01 +00:00
Vladimir Monakhov	afc2615d33	Add proper casting to fuse_linear_bn_weights (#134105 ) As per title, this PR adds proper casting to fuse_linear_bn_weights in the same style as the conv case above. This previously caused numerical issues on my end, so that is why I am fixing it. Also cleans up the docstring. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134105 Approved by: https://github.com/mikaylagawarecki	2024-08-22 14:26:12 +00:00
yuqingj	b459ca78eb	[NJT]Add unit tests that cover the internal use cases using new NJT API (#133513 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133513 Approved by: https://github.com/davidberard98, https://github.com/soulitzer	2024-08-22 13:54:40 +00:00
PyTorch MergeBot	1a7e8e5780	Revert "Update FlexAttention with masking semantic (#133373 )" This reverts commit 5a7b544e5c3e37bea62c6a231f6230c004a33d38. Reverted https://github.com/pytorch/pytorch/pull/133373 on behalf of https://github.com/jeanschmidt due to Broke internal test/inductor signals, see D61611729 ([comment](https://github.com/pytorch/pytorch/pull/133373#issuecomment-2304714503))	2024-08-22 13:47:26 +00:00
PyTorch MergeBot	88c973005d	Revert "[FlexAttention] Enable different qk and v head-dims (#134043 )" This reverts commit e847b6bb9ba281b0db83fcdd79c328252403e9e8. Reverted https://github.com/pytorch/pytorch/pull/134043 on behalf of https://github.com/jeanschmidt due to Need to revert, in order to be able to revert https://github.com/pytorch/pytorch/pull/133373, feel free to reland this after solving conflicts ([comment](https://github.com/pytorch/pytorch/pull/134043#issuecomment-2304708996))	2024-08-22 13:44:17 +00:00
Aaron Gokaslan	83b5d449a3	Add full float16/bfloat16 support to MaxUnPool (#133774 ) It already supported half so might as well add bfloat16 support for parity Pull Request resolved: https://github.com/pytorch/pytorch/pull/133774 Approved by: https://github.com/eqy, https://github.com/ezyang	2024-08-22 13:34:43 +00:00
Aaron Gokaslan	c9c84ae3ee	[BE][Ez]: Update CUDNN_frontend submodule to 1.6.1 (#134007 ) Update cudnn_frontend submodule to 1.6.1 to patch some minor bugfixes and compiler fixes. # Bug fix * Fixed an issue where custom dropout mask was not correctly applied. * Added -fvisibility=hidden for the pip wheels generated to avoid symbol conflicts with other modules that use cudnn frontend. * Fixed an issue in sdpa operation which when deserialized will lead to numerical mismatches. * Fixed an issue in sdpa fp8 fprop operation (in inference mode). # Samples * Added a new sample to showcase how a custom dropout mask can be applied to a sdpa operation. * Added a sample to showcase convolutions on large (c * d * h * w > 2 **31) tensors. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134007 Approved by: https://github.com/eqy	2024-08-22 13:34:17 +00:00
Howard Huang	108a75b454	[PP] Add ZeroBubble schedule (#133467 ) Zero bubble can be expressed through `ScheduleFlexibleInterleaved1F1B` by setting `enable_zero_bubble=True`. But instead of having to include this flag in schedule initialization we should create a separate ZeroBubbleSchedule and also transition `Interleaved1F1B` to derive from `ScheduleFlexibleInterleaved1F1B`. Then we dont need to expose `ScheduleFlexibleInterleaved1F1B` since the naming is not obvious Pull Request resolved: https://github.com/pytorch/pytorch/pull/133467 Approved by: https://github.com/wconstab ghstack dependencies: #132691	2024-08-22 13:32:15 +00:00
PyTorch MergeBot	cedfac20c7	Revert "[SymmetricMemory] introduce multicast support, multimem_all_reduce_ and multimem_one_shot_all_reduce (#133424 )" This reverts commit 66d3eb783c3b3d7087988dd29bfb619b7f4306b7. Reverted https://github.com/pytorch/pytorch/pull/133424 on behalf of https://github.com/jeanschmidt due to Broke internal ADS builds, see D61611517 ([comment](https://github.com/pytorch/pytorch/pull/133424#issuecomment-2304676328))	2024-08-22 13:29:27 +00:00
Andrew Gu	592a172910	[FSDP2] Resolved strided sharding todo in clipping tests (#134152 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134152 Approved by: https://github.com/XilunWu, https://github.com/weifengpy, https://github.com/wz337	2024-08-22 12:45:13 +00:00
Jez Ng	4c645c04d8	Fix type of get_raw_stream (#134187 ) Just something I noticed while implementing a new DeviceInterface I had to add `# type: ignore[assignment]` because mypy thinks DeviceInterface.get_raw_stream is a `Callable` and therefore incompatible with a `staticmethod`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134187 Approved by: https://github.com/jansel	2024-08-22 12:00:08 +00:00
Xu Han	5fb8754434	[inductor] write cpp code with encoding utf-8 (#134027 ) Windows is different to Linux, each Windows version with different language pack have different code page. Inductor on Windows will write the genarated cpp code with its code page, and it should occured un-decode character failed. For this situlation, Microsoft suggest to use Unicode to instead of a specific code page. Ref: https://learn.microsoft.com/en-us/windows/win32/intl/code-page-identifiers Changes: 1. Use `utf-8` as encoder for cpp code. 2. It only change encode for cpp code, but not for binary type. binary type is for AoT binary context. It works on https://github.com/pytorch/pytorch/issues/122094#issuecomment-2299592942. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134027 Approved by: https://github.com/desertfire, https://github.com/jgong5, https://github.com/jansel	2024-08-22 11:54:32 +00:00
Luca Wehrstedt	aea1148d56	[fp8 rowwise] Clarify dtypes (#134114 ) Disambiguate some of the dtypes (e.g., for the scales), move the "constant" ones out of the function, and use safe casting functions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134114 Approved by: https://github.com/drisspg ghstack dependencies: #134110, #134111, #134112, #134113	2024-08-22 11:07:39 +00:00
Luca Wehrstedt	72586ccd14	[fp8 rowwise] Don't build separate kernel for no bias (#134113 ) CUTLASS automatically skips a stage in the epilogue if we provide a nullptr. Thus, instead of building a special kernel for bias=None, we can reuse one of the other ones. This also considerably simplifies the code. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134113 Approved by: https://github.com/drisspg ghstack dependencies: #134110, #134111, #134112	2024-08-22 11:07:39 +00:00
Luca Wehrstedt	d64fa11095	[fp8 rowwise] Fix bias calculation being done in low precision (#134112 ) The compute dtype for the bias addition was set to ElementBias. Thus, for a bf16 bias, we would cast the fp32 accum to bf16 and _then_ add the bias. It is however (slightly?) more accurate to first add the bias in fp32 and only cast at the end. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134112 Approved by: https://github.com/drisspg ghstack dependencies: #134110, #134111	2024-08-22 11:07:34 +00:00
Luca Wehrstedt	15faed60ca	[fp8 rowwise] Make schedule selection more readable (#134111 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134111 Approved by: https://github.com/drisspg ghstack dependencies: #134110	2024-08-22 11:07:30 +00:00
Luca Wehrstedt	b8ea5b01c9	[fp8 rowwise] Allocate workspace as a PyTorch Tensor (#134110 ) This makes us pass through the CUDA caching allocator which is safer e.g. in case of CUDA graphs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134110 Approved by: https://github.com/drisspg	2024-08-22 11:07:26 +00:00
cyy	4c8193b8f0	[14/N] Fix clang-tidy warnings in aten/src/ATen (#132733 ) Follows #133807 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132733 Approved by: https://github.com/ezyang	2024-08-22 10:09:15 +00:00
Zitong Zhan	90c821814e	SparseCsrCUDA: cuDSS backend for linalg.solve (#129856 ) This PR switches to cuDSS library and has the same purpose of #127692, which is to add Sparse CSR tensor support to linalg.solve. Fixes #69538 Minimum example of usage: ``` import torch if __name__ == '__main__': spd = torch.rand(4, 3) A = spd.T @ spd b = torch.rand(3).to(torch.float64).cuda() A = A.to_sparse_csr().to(torch.float64).cuda() x = torch.linalg.solve(A, b) print((A @ x - b).norm()) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129856 Approved by: https://github.com/amjames, https://github.com/lezcano, https://github.com/huydhn Co-authored-by: Zihang Fang <zhfang1108@gmail.com> Co-authored-by: Huy Do <huydhn@gmail.com>	2024-08-22 07:57:30 +00:00
Pearu Peterson	64cfcbd8a3	Tune _int_bsr_dense_addmm for int8 inputs on A100 (#134035 ) As in the title. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134035 Approved by: https://github.com/cpuhrsch ghstack dependencies: #133855	2024-08-22 06:43:11 +00:00
Feng Yuan	b7baa062fc	Update torch-xpu-ops pin (ATen XPU implementation) (#133850 ) Bugfixings for PyTorch 2.5, 1. Using SYCL group algorithm API instead of old style for sub group shift utilities. 2. Add preprocess in reduction kernel for cases requiring data type cast. 3. Make group norm memory format compatible. 4. ZeroTensor: a. Remove unnecessary aten operators registration, or ZeroTensor process is bypassed. b. Align preprocess with intree implementation in aten::copy_. 5. Rebase checkIndexTensorTypes usage. 6. Align latest semantics of PyTorch foreach operators. Return multiple tensors with offset=0. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133850 Approved by: https://github.com/EikanWang	2024-08-22 06:27:03 +00:00
Yuanhao Ji	cdb9c7d228	Add support for using privateuse1 backend name in `instantiate_device_type_tests()` (#133082 ) As you can see, 'privateuse1' appears many times in out-of-tree extension codebase. I think that everything about the device type should be as same as other in-tree backends after registering the privateuse1 backend. For example, after registering a privateuse1 backend named "foo", you should allow "foo" to be passed in as a valid device type. ```diff - instantiate_device_type_tests(TestIndexing, globals(), only_for='privateuse1') - instantiate_device_type_tests(NumpyTests, globals(), only_for='privateuse1') + instantiate_device_type_tests(TestIndexing, globals(), only_for='foo') + instantiate_device_type_tests(NumpyTests, globals(), only_for='foo') ``` > https://github.com/Ascend/pytorch/blob/master/test/test_indexing.py#L1654-L1655 The change is to map privateuse1 backend name to 'privateuse1' when calling `filter_desired_device_types()`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133082 Approved by: https://github.com/albanD	2024-08-22 06:17:21 +00:00
Chong Gu	24c2dd2002	Migrate fuse_chunk_reshape_concat_pass to PT2 (#134026 ) Summary: This is part of the work of dper pass migration https://fburl.com/gdoc/wxwykxns This pass has ~2.4% perf impact for adfinder_reels_ctr_model Test Plan: Still in test Differential Revision: D60789747 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134026 Approved by: https://github.com/huxintong	2024-08-22 06:13:52 +00:00
chilli	938f37b745	Added batching rule for sdpa_math, sdpa_efficient_attention forward, cudnn, and flash attention (#133964 ) Fixes https://github.com/pytorch/pytorch/issues/117016, https://github.com/pytorch/pytorch/issues/102457, https://github.com/pytorch/pytorch/issues/110525, https://github.com/pytorch/pytorch/issues/108065, Pull Request resolved: https://github.com/pytorch/pytorch/pull/133964 Approved by: https://github.com/Skylion007	2024-08-22 05:29:49 +00:00
Xu Han	e2ff094008	[inductor] calibration inductor windows uts (1/N) (#134033 ) Changes: 1. Re-open fixed UTs. 2. Mark skiped reasons for failed UTs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134033 Approved by: https://github.com/jansel	2024-08-22 05:21:28 +00:00
Avik Chaudhuri	0d7ac1966a	kill sharing of constraints (#134045 ) Summary: Previously, reuse of the same `Dim` was encoded by "sharing" internal constraints among constraint targets. This kind of sharing, implemented using `shared` fields between `_Constraint`s, was originally motivated by `dynamic_dim`, specifically to support `==` between `dynamic_dim`s, but we no longer need to maintain this overcomplicated structure: we can simply use names of `Dims` to directly encode sharing information. Thus this PR vastly simplifies the structure of `_Constraint` by removing `shared` fields. As a result, both `_Constraint` and its moral subclass, `_DerivedConstraint`, are 1-1 with `Dim` and its moral subclass, `DerivedDim`. Note that this will break `==` over `dynamic_dim`, so an immediate follow-up will be to remove `dynamic_dim` entirely from our public API. (It's been more than 6 months since the deprecation warning anyway.) I just didn't want to deal with that process in the same PR. Test Plan: existing Differential Revision: D61559413 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134045 Approved by: https://github.com/pianpwk	2024-08-22 04:40:47 +00:00
Wil Kong	de06345e9b	Avoid Host & Device Sync In LR Scheduler (#133663 ) Fixes #133662. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133663 Approved by: https://github.com/janeyx99, https://github.com/eqy Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>	2024-08-22 03:52:43 +00:00
drisspg	e847b6bb9b	[FlexAttention] Enable different qk and v head-dims (#134043 ) # Summary Adds the option for the head dims to be different between QK and V tensors. Fixes issue: https://github.com/pytorch/pytorch/issues/133674 V_DIM > QK_DIM is blocked by landing: https://github.com/triton-lang/triton/pull/4138 / https://github.com/triton-lang/triton/pull/4540 Into PyTorch's triton branch. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134043 Approved by: https://github.com/Chillee	2024-08-22 03:42:17 +00:00
Yanbo Liang	7868b65c4d	[Dynamo] Support dict.setdefault (#134083 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134083 Approved by: https://github.com/williamwen42	2024-08-22 01:57:33 +00:00
Yiming Zhou	7b20514f8e	[export] Device remapping in export (#133660 ) Implemented `move_to_device_pass()` function in `torch._export.passes`. The user has to explicitly call this method to move the exported program from one torch.device to another one. Fixes https://github.com/pytorch/pytorch/issues/121761 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133660 Approved by: https://github.com/angelayi	2024-08-22 01:03:35 +00:00
Bin Bao	df467f8746	[CI] Do not set Intel OMP for aarch64 (#133997 ) As title Pull Request resolved: https://github.com/pytorch/pytorch/pull/133997 Approved by: https://github.com/angelayi	2024-08-22 00:55:46 +00:00
Will Feng	6bddfb9546	[FSDP2] Add cache for FSDP wrapper class (#134135 ) Currently, `fully_shard` will create a new `FSDPMyModuleClass` class for each `MyModuleClass` module object, which causes Dynamo to guard-fail on every module object's type checking. This PR fixes the issue by caching and reusing previously created FSDP wrapper class. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134135 Approved by: https://github.com/awgu	2024-08-22 00:41:30 +00:00
yanbing-j	2a73ba298c	Upgrade submodule oneDNN to v3.5.3 (#131620 ) This PR is to upgrad submodule oneDNN to v3.5.3. ## Improvements - [experimental] Introduced [microkernel API](https://oneapi-src.github.io/oneDNN/ukernels.html) for Intel Architecture Processors. This API exposes internal mechanisms used in matmul and convolution implementation to expert users. - Improved performance of matmul primitive with sum post-op for batched cases on processors with Intel AMX instruction set support. - Introduced fp64 matmul support. This functionality is currently implemented on Intel GPUs with hardware acceleration for fp64 math only. ## Validation results on CPU No regression was found. 1. NLP models accuracy/inference/training Model Name \| Mode Name \| Precision \| OneDNN \| Baseline \| OneDNN/Baseline -- \| -- \| -- \| -- \| -- \| -- bert-large \| realtime \| bf16 \| 192.498 \| 189.664 \| 1.014942214 bert-large \| throughput \| bf16 \| 202.424 \| 202.156 \| 1.001325709 bert-large \| train_phase2 \| bf16 \| 15.955 \| 16.029 \| 0.995383368 LCM \| throughput \| bf16 \| 1.01983 \| 1.06632 \| 0.956401455 stable-diffusion \| throughput \| bf16 \| 0.10313 \| 0.10184 \| 1.012666929 ViT \| realtime \| bf16 \| 1086.48 \| 928.43 \| 1.17023362 ViT \| throughput \| bf16 \| 1419.07 \| 1393.81 \| 1.018122987 yolov7 \| realtime \| bf16 \| 413.468682 \| 415.16503 \| 0.995914039 yolov7 \| throughput \| bf16 \| 369.697 \| 366.789 \| 1.007928264 bert-large \| realtime \| fp32 \| 46.685 \| 46.652 \| 1.000707365 bert-large \| throughput \| fp32 \| 47.766 \| 48.007 \| 0.994979899 bert-large \| train_phase2 \| fp32 \| 7.101 \| 7.104 \| 0.999577703 LCM \| throughput \| fp32 \| 0.5501 \| 0.55023 \| 0.999763735 stable-diffusion \| throughput \| fp32 \| 0.04012 \| 0.04002 \| 1.002498751 ViT \| realtime \| fp32 \| 337.27 \| 335.19 \| 1.006205436 ViT \| throughput \| fp32 \| 346.52 \| 350.08 \| 0.989830896 yolov7 \| realtime \| fp32 \| 107.138054 \| 107.242747 \| 0.999023775 yolov7 \| throughput \| fp32 \| 103.383 \| 104.301 \| 0.99119855 bert-large \| realtime \| int8 \| 283.541 \| 289.569 \| 0.979182855 LCM \| throughput \| int8 \| 1.09864 \| 1.08998 \| 1.0079451 stable-diffusion \| throughput \| int8 \| 0.10617 \| 0.10604 \| 1.001225952 ViT \| realtime \| int8 \| 1562.11 \| 1554.68 \| 1.004779119 ViT \| throughput \| int8 \| 1904.38 \| 1903.39 \| 1.000520125 yolov7 \| realtime \| int8 \| 540.489493 \| 539.902488 \| 1.001087243 yolov7 \| throughput \| int8 \| 499.999 \| 500.757 \| 0.998486292 Device \| Dtype \| Geomean Higher is better -- \| -- \| -- All \| all \| 101.17% All \| fp32 \| 99.83% All \| bf16 \| 102.24% All \| int8 \| 99.91% All \| fp16 \| 103.61% SPR \| all \| 100.54% SPR \| fp32 \| 99.82% SPR \|bf16 \| 101.78% SPR \|int8 \| 99.90% GNR \| all \| 101.58% GNR \| fp32 \| 99.85% GNR \| bf16 \| 102.66% GNR \| int8 \| 99.93% GNR \| fp16 \| 103.61% 2. Torchbench cpu userbenchmark inference & training Perf_Geomean \| Ratio (oneDNN/baseline) -- \| -- eager_throughtput_bf16_infer \| 1.00x eager_throughtput_fp32_infer \| 1.00x jit_llga_throughtput_amp_bf16 \| 1.00x jit_llga_throughtput_fp32 \| 1.00x eager_throughtput_fx_int8 \| 0.99x eager_throughtput_bf16_train \| 1.01x eager_throughtput_fp32_train \| 1.00x 3. Inductor quantization Static quant: Perf_Geomean \| Ratio (oneDNN/baseline) -- \| -- PTQ \| 1.00x PTQ_CPP_WRAPPER \| 1.00x QAT \| 1.00x ACC_Geomean \| Ratio (oneDNN/baseline) -- \| -- PTQ \| 1.00x PTQ_CPP_WRAPPER \| 1.00x QAT \| 1.00x Dynamic quant: \| Ratio (oneDNN/baseline) -- \| -- Performance \| 1.04x Accuracy \| 1.00x 4. Dynamo benchmarks GEOMEAN summary ![image](https://github.com/user-attachments/assets/82fc4b76-50f6-4f06-9ba9-034b932f1158) FP32 Static shape, default wrapper ![image](https://github.com/user-attachments/assets/9335268e-3e99-426b-91f8-f9df90a2007c) FP32 Dynamic shape, default wrapper ![image](https://github.com/user-attachments/assets/e7cf3f4f-2a62-4b58-9461-5e5ba254d822) AMP Static shape, default wrapper ![image](https://github.com/user-attachments/assets/12392c88-e44f-4c95-904a-4fa5fc9f34a2) AMP Dynamic shape, default wrapper ![image](https://github.com/user-attachments/assets/13930b0d-9bb2-46de-9ecb-5d2585d5c2f6) ## Validation results on XPU Category \| Eager \| Inductor -- \| -- \| -- huggingface_amp_fp16_training \| 1.002456 \| 0.999998 huggingface_bfloat16_inference \| 1.005386 \| 1.003511 huggingface_float32_training \| 1.002533 \| 1.003098 torchbench_amp_fp16_training \| 1.009065 \| 1.01323 torchbench_bfloat16_inference \| 1.003371 \| 1.001534 torchbench_float32_training \| 1.012102 \| 1.011596 timm_models_amp_fp16_training \| 1.005511 \| 1.010329 timm_models_bfloat16_inference \| 1.000935 \| 1.000538 timm_models_float32_training \| 0.991873 \| 0.99721 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131620 Approved by: https://github.com/jgong5, https://github.com/malfet	2024-08-21 23:40:02 +00:00
Nikita Shulga	5f0bd98767	Increase max total number of dynamo partitions to 15 (#134153 ) Needed to be able to split some of the aarch64 workflows to 15 shards Pull Request resolved: https://github.com/pytorch/pytorch/pull/134153 Approved by: https://github.com/seemethere, https://github.com/kit1980, https://github.com/ZainRizvi	2024-08-21 23:10:12 +00:00
FFFrog	a5ef04a3b8	add relevant function (#133946 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133946 Approved by: https://github.com/ezyang	2024-08-21 23:04:59 +00:00
rzou	8604c0a150	[inductor] Fix needs_fixed_stride_order silent incorrectness (#133639 ) Fixes #128084 The approach is option 2 of what Elias suggested in the comment thread: - We require tensors to have the correct stride at usage. This may involve a clone; if there was a clone and then a mutation into it then we copy_ back the result of the mutation. The reason why I went this approach was because it was the easiest and Inductor already works really hard to remove additional clones/copy_. There are some cases that this doesn't generate efficient code for; for example, if the tensor is a view, we don't change the base of the view to have the right stride order, instead we do a clone. The view case isn't very common so I'm ignoring it for now but we could improve this in the future. Test Plan: - tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/133639 Approved by: https://github.com/eellison	2024-08-21 22:54:16 +00:00
Sahdev Zala	d2204d4f0f	Remove skip ci recommendation (#134134 ) Using `skip ci` is no longer a recommendation practices. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134134 Approved by: https://github.com/soulitzer	2024-08-21 22:42:25 +00:00
Jesse Cai	255cd75a97	[sparse] Add cuSPARSELt as a backend (#128534 ) Summary: This PR adds in cuSPARSELt as a backend to PyTorch. It is now possible to see if cuSPARSELt is available and the version if it is with ``` torch.backends.cusparselt.is_available() torch.backends.cusparselt.version() ``` Test Plan: ``` python test/test_sparse_semi_structured.py -k test_cusparselt_backend ``` Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/128534 Approved by: https://github.com/cpuhrsch, https://github.com/eqy, https://github.com/syed-ahmed	2024-08-21 22:06:07 +00:00
Justin Chu	0870398fa8	[ONNX] Opt into ruff fmt (#134120 ) Add ONNX directory to use ruff format. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134120 Approved by: https://github.com/XuehaiPan, https://github.com/Skylion007	2024-08-21 21:43:55 +00:00
Gufan Yin	96dfe95ed0	Fix DDPLoadBalancingPlanner docstring (#134044 ) Summary: 1. Indentation in chunk function was wrong. 1. The previous logic missed a level of zip. This diff uses the idiom in python zip doc to do chunking https://docs.python.org/3/library/functions.html#zip Test Plan: Run the docstring locally Differential Revision: D61548758 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134044 Approved by: https://github.com/fegin	2024-08-21 21:28:22 +00:00
Bin Bao	5d5a45dc85	[CI][dashboard] Collect Export pass rate separately (#134076 ) Summary: Collect Export pass rate separately when running AOTInduction, so that we can have a better isolated signal. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134076 Approved by: https://github.com/angelayi	2024-08-21 21:18:55 +00:00
Nikita Shulga	b3eef3deaf	Triple number of shards for aarch64 cpu inductor tests (#134123 ) Let's see if this will work. Alas, other than linting I can only test it after it lands Pull Request resolved: https://github.com/pytorch/pytorch/pull/134123 Approved by: https://github.com/clee2000	2024-08-21 20:52:23 +00:00
Pearu Peterson	345578afb4	Add int8 support to bsr_dense_addmm and bsr_dense_mm Triton kernels (#133855 ) As in the title. In addition, the PR introduces `_int_bsr_dense_addmm` that is equivalent to `bsr_dense_addmm` except for int8 inputs the operation result is int32 tensor (similar to existing `_int_mm`). Pull Request resolved: https://github.com/pytorch/pytorch/pull/133855 Approved by: https://github.com/cpuhrsch	2024-08-21 20:44:40 +00:00
Pavel Belevich	a3e1416c05	Fix out_tensor device in diag_test.py (#134020 ) This benchmark fails if device='cuda' but out_tensor is on cpu Pull Request resolved: https://github.com/pytorch/pytorch/pull/134020 Approved by: https://github.com/soulitzer	2024-08-21 20:43:39 +00:00
Animesh Jain	6c1e2d2462	[easy] Force inline_inbuilt_nn_modules to remove divergence (#134122 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134122 Approved by: https://github.com/williamwen42, https://github.com/mlazos	2024-08-21 20:42:15 +00:00
Valentin Andrei	865facda44	[pytorch] Remove thread naming when torch is imported (#134066 ) Fixes #133690 The naming was added in #121170 to allow performance debugging of latency critical threads. However the `pt_main_thread` name gets inherited every time a new process or thread is created from the parent one, which defeats the purpose. We need a better way to name the thread that launches kernels on accelerators but for the time being we can let users name the threads in the application code, using: `torch.multiprocessing._set_thread_name("insert_name")` Pull Request resolved: https://github.com/pytorch/pytorch/pull/134066 Approved by: https://github.com/soulitzer, https://github.com/d4l3k	2024-08-21 20:34:35 +00:00
PyTorch MergeBot	1491a61769	Revert "[hop] ban creating hop by directly instantiating HigherOrderOperator. (#133645 )" This reverts commit 696107efcb83f9359aa669ab343c2cfa2a111372. Reverted https://github.com/pytorch/pytorch/pull/133645 on behalf of https://github.com/ydwu4 due to breaking ci. probably due to land race ([comment](https://github.com/pytorch/pytorch/pull/133645#issuecomment-2302866106))	2024-08-21 19:33:14 +00:00
Shangdi Yu	5fcfccefc6	[export] Migrate `capture_pre_autograd_graph` to `_export_for_training` (#132815 ) Summary: as title Test Plan: CI Differential Revision: D60860909 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132815 Approved by: https://github.com/tugsbayasgalan	2024-08-21 19:00:41 +00:00
Nikita Shulga	18aaceb7be	Update conda-env-iOS.txt (#134068 ) Followup after https://github.com/pytorch/pytorch/pull/133814 To fix periodic build failures update `typing-extensions` to 4.11.0, as 4.10 is missing in conda Pull Request resolved: https://github.com/pytorch/pytorch/pull/134068 Approved by: https://github.com/wdvr, https://github.com/atalman, https://github.com/Skylion007	2024-08-21 18:47:14 +00:00
David Berard	84b3f1900a	C++ network flow implementation in c10 (#132188 ) The functorch partitioners use network flow to split the joint graph into a forward and backward graph. Internally, we've found that upgrading to networkx 2.8.8 (from 2.5) results in some hard-to-debug failures (internal reference: https://fburl.com/workplace/jrqwagdm). And I'm told that there's interest to remove the python dependency. So this PR introduces a C++ implementation that mirrors the API provided by networkx. We'll need to add python bindings and do some additional testing to verify correctness. Differential Revision: [D61550977](https://our.internmc.facebook.com/intern/diff/D61550977) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132188 Approved by: https://github.com/Chillee	2024-08-21 18:40:54 +00:00
Sahdev Zala	05304f59f0	[Doc] Fix typo in `torch/fx/passes/README.md` (#134078 ) Fix typo, `utis` to `utils`, in the utility name. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134078 Approved by: https://github.com/soulitzer, https://github.com/malfet	2024-08-21 18:35:50 +00:00
Edward Z. Yang	32e057636c	Enable scribe environment for compile-time benchmarks if requested. (#133891 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/133891 Approved by: https://github.com/malfet	2024-08-21 18:02:54 +00:00
atalman	750d68ff70	Use amazon linux2 for Docker builds, fix build-docker-conda condition (#134116 ) 1. Switches failing jobs to amzon linux 2: - CUDA, CPU, ROCM jobs are failing 3. Fix trigger condition for build-docker-conda to be same as manywheel and libtorch Pull Request resolved: https://github.com/pytorch/pytorch/pull/134116 Approved by: https://github.com/ZainRizvi, https://github.com/nWEIdia	2024-08-21 18:01:16 +00:00
Yidi Wu	696107efcb	[hop] ban creating hop by directly instantiating HigherOrderOperator. (#133645 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133645 Approved by: https://github.com/zou3519 ghstack dependencies: #133521	2024-08-21 17:34:21 +00:00
Yidi Wu	6835f20d20	[HOP] support generating schema for hop (#133521 ) Add a way of generating a FunctionSchema from example values because hop's schema varies even for the same hop. We didn't use torch._C.FunctionSchema because we cannot construct the classes directly (e.g. "__init__" cannot be used for torch._C.FunctionSchema). Also extending the Basic types in c++ seems not that easy. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133521 Approved by: https://github.com/zou3519	2024-08-21 17:34:21 +00:00
Xintong Hu	dd5a7c8397	[PT2] Add a pass to convert stack to unsqueeze cat (#133966 ) Summary: so that we can optimize with `fuse_chunk_reshape_unsqueeze_concat_pass` Test Plan: new UT Reviewed By: frank-wei Differential Revision: D61220221 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133966 Approved by: https://github.com/frank-wei	2024-08-21 17:31:26 +00:00
Animesh Jain	1da3a049da	[dynamo][super] Improve handling of getattr on super (#134039 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134039 Approved by: https://github.com/yanboliang ghstack dependencies: #133742, #134016	2024-08-21 16:50:35 +00:00
Zhengxu Chen	3ef1cc8583	[export] Implement common_getitem_elimination pass. (#133618 ) Summary: In export, we will generate many redundant getitem nodes branching from the same source, inserted by runtime assertions or any passes. This is causing issues with any downstream system relying on any value being uniquely defined by a single node. I don't think it hurt to remove a bunch of getitem nodes only, so I just added to the ctor. Test Plan: rebase on D61256937 ``` buck2 run scripts/bearzx:pt2_export_playground ``` Differential Revision: D61351578 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133618 Approved by: https://github.com/tugsbayasgalan	2024-08-21 16:48:24 +00:00
PyTorch MergeBot	2db28a9611	Revert "[BE]: Update Typeguard to TypeIs for better type inference (#133814 )" This reverts commit bce0caba7804b0787684dbf1f4e1c4d9e3acded5. Reverted https://github.com/pytorch/pytorch/pull/133814 on behalf of https://github.com/ezyang due to root cause of internal failures not addressed ([comment](https://github.com/pytorch/pytorch/pull/133814#issuecomment-2302466444))	2024-08-21 16:13:34 +00:00
IvanKobzarev	57625bacea	[partitioner] Fix must_be_in_backward corner cases (#134002 ) Preparation PR for https://github.com/pytorch/pytorch/pull/132638 "must_be_in_backward" fails the partitioner, if partitioner picks this node as saved_values. The fix is to prevent partitioner to pick those nodes during nodes classification. It's hard to make a test without making effectful ops in backward "must_be_in_backward", which will be testing this ( https://github.com/pytorch/pytorch/pull/132638 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134002 Approved by: https://github.com/bdhirsh ghstack dependencies: #134003	2024-08-21 15:58:49 +00:00
PyTorch MergeBot	68425e68fe	Revert "[dynamo][reland][inline-inbuilt-nn-modules] Mark attributes of nn mod… (#133714 )" This reverts commit e8d3c4be3629582294b5944754009fae60f42f6d. Reverted https://github.com/pytorch/pytorch/pull/133714 on behalf of https://github.com/anijain2305 due to fails internally ([comment](https://github.com/pytorch/pytorch/pull/133714#issuecomment-2302171472))	2024-08-21 14:21:06 +00:00
ooooo	32e052e468	[docs] improve `torch.stack` example code to be reproducible (#133857 ) Improve the sample code can produce the expected results after copying and executing it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133857 Approved by: https://github.com/soulitzer	2024-08-21 14:07:02 +00:00
blazej-smorawski	585c049fa3	Fix `Extension` attribute name in `CppExtension` example (#134046 ) Hi! It seems there's a typo in `CppExtension` example. I think it should say `extra_link_args` instead of `extra_link_flags`. Not that I spent a few hours debugging missing kernels inside a library's fatbin or anything :D. Please see `Extension` definition inside setuptools: `ebddeb36f7/setuptools/_distutils/extension.py (L62)` Thanks! Błażej Pull Request resolved: https://github.com/pytorch/pytorch/pull/134046 Approved by: https://github.com/soulitzer	2024-08-21 13:58:16 +00:00
Aaron Gokaslan	afaa5fcecb	[BE][Ez]: FURB142,FURB92 misc preview fixes (#133880 ) Fixes some miscellaneous code quality issues with some refurb rules that have not been enabled yet. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133880 Approved by: https://github.com/soulitzer, https://github.com/malfet	2024-08-21 13:54:51 +00:00
rzou	683609c631	Skip cpp_extension test internally (#134011 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134011 Approved by: https://github.com/masnesral	2024-08-21 13:51:05 +00:00
Howard Huang	4b1fb3b0ed	[PP] pt-native input/weight grad split (#132691 ) Add `stage_backward_input` and `stage_backward_weight` functions to perform the weight updates for inputs and weights independently. We still support `self.dw_builder` argument for a custom backward, but it has become optional. It takes a separate code path and cannot be used in conjuction with the native zero backward. Added tests: `python test/distributed/pipelining/test_schedule_multiproc.py -k test_schedule_with_native_zero_bubble` `python test/distributed/pipelining/test_backward.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/132691 Approved by: https://github.com/wconstab	2024-08-21 13:37:54 +00:00
leslie-fang-intel	2bffbe06bd	[Inductor][CPP] Support vectorization of load_seed and randn (#130317 ) Summary Enable the vectorization of `load_seed` and `randn`. For now, `randn` is using the reference implementation. Test Plan ``` python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_vec_randn ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/130317 Approved by: https://github.com/jgong5 ghstack dependencies: #122961	2024-08-21 13:20:43 +00:00
leslie-fang-intel	313bc11963	[inductor][cpp] complete vectorization for int32/int64 (#122961 ) Summary Implement the complete vectorization of `index_expr` functionally. We also add heuristic from performance perspective to resolve the regressions posted below: https://github.com/pytorch/pytorch/pull/122961#issuecomment-2041336265 by disabling vectorization of specific (Fused) scheduler Node: - Heuristic 1: when the num of non-contiguous `index_expr/load/store` exceeds the threshold, we disable the vectorization. - Heuristic 2: when the total number of elements along the vec dim is less than `tiling_factor/2`, we disable the vectorization. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122961 Approved by: https://github.com/jansel Co-authored-by: leslie-fang-intel <leslie.fang@intel.com>	2024-08-21 13:12:38 +00:00
Xuehai Pan	539be0a769	[dynamo] support `ClassMethodDescriptorType` (#133862 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133862 Approved by: https://github.com/jansel	2024-08-21 12:56:19 +00:00
Animesh Jain	0d79f67a25	[dynamo][exception] Support raise exception from None (#134028 ) Fixes https://github.com/pytorch/pytorch/issues/132362 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134028 Approved by: https://github.com/yanboliang	2024-08-21 12:48:35 +00:00
Animesh Jain	bd0db490bf	[dynamo][set] Fix EQUALS_MATCH guard for constant sets and lists (#134016 ) Fixes https://github.com/pytorch/pytorch/issues/133509 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134016 Approved by: https://github.com/laithsakka, https://github.com/jansel ghstack dependencies: #133742	2024-08-21 12:41:52 +00:00
Xuehai Pan	c929e1e11f	[dynamo] fix polyfill for user defined constructor `__new__` (#133822 ) In `cls->tp_call`, if `cls->tp_new` does not return an instance of class `cls`, then `cls->tp_init` is not called on the new instance. Related PR: - #132977 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133822 Approved by: https://github.com/jansel	2024-08-21 12:41:19 +00:00
Michael Lazos	695291be2f	Fix test flakiness due to not resetting state (#134058 ) Fixes https://github.com/pytorch/pytorch/issues/133994 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134058 Approved by: https://github.com/yanboliang	2024-08-21 11:54:08 +00:00
IvanKobzarev	30dc6338c1	[effects] Prevent inductor dtype promotions for HOP effects tokens (#134003 ) Preparation for https://github.com/pytorch/pytorch/pull/132638 and https://github.com/pytorch/pytorch/pull/132755 Inductor promotes arguments dtypes to the highest dtype, as a result additional token tensor argument wtih float32 dtype incurred dtype promotions for lower types, e.g. int32 The solution for that - to use the lowest dtype for tokens - torch.bool. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134003 Approved by: https://github.com/zou3519, https://github.com/bdhirsh	2024-08-21 11:42:10 +00:00
xinan.lin	19eb14493a	[Inductor] Moves intermediary tensors which are constructed on the cpu to XPU when safe, align with CUDA. (#132843 ) [Inductor] Moves intermediary tensors which are constructed on the cpu to XPU when safe, align with CUDA. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132843 Approved by: https://github.com/EikanWang, https://github.com/eellison ghstack dependencies: #132740, #132748	2024-08-21 11:28:09 +00:00
xinan.lin	6535f11259	[Inductor] Support _check_triton_bf16_support on XPU. (#132748 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132748 Approved by: https://github.com/EikanWang, https://github.com/eellison ghstack dependencies: #132740	2024-08-21 11:28:09 +00:00
xinan.lin	c2e2602ecd	[Inductor] Move `GPU_TYPE`(The runtime avaliable gpu type, cuda or xpu) from (#132740 ) Move GPU_TYPE(The runtime avaliable gpu type, cuda or xpu) from `testing/_internal/inductor_utils.py` to `_inductor/utils.py`. So that we can use it in Inductor, not limited in test case. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132740 Approved by: https://github.com/EikanWang, https://github.com/jansel	2024-08-21 11:18:00 +00:00
Huamin Li	3d8db41337	Add new op wrapped_quantized_linear (#134024 ) Summary: This diff adds a new operator wrapped_quantized_linear (torch.ops._quantized.wrapped_quantized_linear) and takes the following input argument: input (in fp32) , input_scale, input_zero_point, weight (in fp32), weight_scale, weight_zero_point, bias (in fp32), output_scale, output_zero_point, and out_channel. It does the following 1. Use quantize_per_tensor(input, input_scale, input_zero_point) to quantize the input tensor to int8 2. Use quantized::linear_prepack(weight, weight_scale, weight_zero_point, bias) to pack the weight and bias 3. Use quantized::linear to perform int8 quantized linear 4. dequantize This new op is essentially a wrapper of mutiple ops. We do this as torch.export cannot handle models where it has old quantize apis. Reviewed By: jerryzh168 Differential Revision: D61377266 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134024 Approved by: https://github.com/houseroad	2024-08-21 09:26:58 +00:00
Xuehai Pan	022cd7c9aa	[RFC][dynamo] add decorator to register polyfill for unsupported C++ function to avoid graph break (#133712 ) Add decorator `torch.compiler.substitute_in_graph` to register polyfill for unsupported C++ function to avoid graph break. This API provides an official way to add support for dynamo for third-party C extensions. Also, it can be used to simplify our implementation for `torch._dynamo.polyfill`. `5ee070266f/torch/_dynamo/variables/builtin.py (L97-L107)` Example: ```python >>> import operator >>> operator.indexOf([1, 2, 3, 4, 5], 3) 2 >>> torch.compile(operator.indexOf, fullgraph=True)([1, 2, 3, 4, 5], 3) Unsupported: ... >>> @torch.compiler.substitute_in_graph(operator.indexOf) ... def indexOf(sequence, x): ... for i, item in enumerate(sequence): ... if item is x or item == x: ... return i ... raise ValueError("sequence.index(x): x not in sequence") >>> torch.compile(operator.indexOf, fullgraph=True)([1, 2, 3, 4, 5], 3) 2 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/133712 Approved by: https://github.com/jansel	2024-08-21 06:36:41 +00:00
Deep Shah	843fdf81c2	Fix a getenv segfault due to a race (#133744 ) Summary: * TLDR: `getenv` is not thread safe w.r.t `setenv`. Environment variables are kept as a per-process "dictionary" by libc. `setenv` can essentially realloc the whole thing move this list to a completely different location. If there is a concurrent `getenv` happening as the same time, it is possible that it might end up reading stale memory and segfault. `getenv` is thread safe w.r.t other `getenv`. * Details: Inside PTD init: ``` ProcessGroupNCCL ctor ... ncclCommWatchdogThread_ = std::thread(&ProcessGroupNCCL::ncclCommWatchdog, this); (https://fburl.com/code/terf9ai7) ``` Inside ncclCommWatchdog thread: ``` ... ncclHeartbeatMonitorThread_ = std::thread(&ProcessGroupNCCL::heartbeatMonitor, this); (https://fburl.com/code/fv9camg2) ... ``` Inside heartbeatMonitor thread: ``` ... std::optional<DumpPipe> dumpPipe = std::nullopt; (https://fburl.com/code/qdvahzbu) dumpPipe.emplace(rank_); ... ``` Inside DumpPipe ctor (https://fburl.com/code/wvixlqcz) ``` getCvarString getenv <=== SIGSEGV ``` On the main thread: We go on to initialize NCCL: Inside getNCCLComm, we call: `getNcclVersion` -> `initEnv` (https://fburl.com/code/j312pccu) `initEnv` inside NCCL does this: `initEnv` -> `setEnvFile` This guy, reads the /etc/nccl.conf file, and sets values of env variables with "setenv" (https://fburl.com/code/cq4r0y0h) This "setenv" can race with "getenv" in heartbeatMonitor thread. Ideally, all `setenv` should be done by a single thread before launching other threads. This diff moves getNCCLVersion before launching watchdog thread to make sure all setenvs are done beforehand. I think we are just getting lucky that we are not hitting it in production. IIRC in fact we saw getenv segfault once in one of the large scale runs, but now I dont remember the details. Test Plan: A lot of testing done as part of D61411062 & CI Differential Revision: D61421292 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133744 Approved by: https://github.com/wconstab, https://github.com/fduwjj	2024-08-21 06:27:31 +00:00
Nicolas Macchioni	af664882dd	Safely infer device type + docstrings + tests (#133668 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133668 Approved by: https://github.com/eellison	2024-08-21 05:27:31 +00:00
fduwjj	b39ec7fbe9	[1/N] Make NCCL PG error messages more accurate and simpler (#134017 ) We did a thorough review on all the error messages we are logging inside PGNCCL, and we want to make log message simpler and more accurate, this is the first PR for this effort. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134017 Approved by: https://github.com/wconstab	2024-08-21 05:21:24 +00:00
Yifu Wang	66d3eb783c	[SymmetricMemory] introduce multicast support, multimem_all_reduce_ and multimem_one_shot_all_reduce (#133424 ) ### Summary - Added multicast support to SymmetricMemory. If the cuda runtime and cuda driver have multicast support, SymmetricMemory associate all peer buffers with a multicast object and exposes the multicast virtual address. - Implemented `multimem_all_reduce_` and `multimem_one_shot_all_reduce` based on the multicast support. The two variants shows different performance characteristic for different message size. We plan to use Inductor for collective algo selection (and required symmetric memory buffer allocation). ### Benchmark 8xH100 (non-standard version with HBM2e at 650W). NVSwitch V3 with NVLS support. ![image](https://github.com/user-attachments/assets/4998a16b-c2c0-4797-9dd0-1da2303df947) ![image](https://github.com/user-attachments/assets/278ad361-52cb-4864-82c6-bb67e8d0a3fe) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133424 Approved by: https://github.com/yf225, https://github.com/weifengpy	2024-08-21 05:11:21 +00:00
Shangdi Yu	8337b4d96e	[training ir migration] Fix ReorderConvertTest (#134010 ) Summary: Change ReorderConvertTest to work with the new `capture_pre_autograd_graph` implementation using D61175223. Note that now `ReorderConvertTest` doesn't work with the old `capture_pre_autograd_graph` anymore. Test Plan: ``` buck2 run 'fbcode//mode/dev-nosan' fbcode//bolt/nn/executorch/passes/tests:optimize_test -- -r ReorderConvertTest ``` Differential Revision: D61507772 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134010 Approved by: https://github.com/tugsbayasgalan	2024-08-21 04:48:43 +00:00
Justin Chu	e8fc1e0118	[ONNX] New export logic leveraging ExportedProgram and ONNX IR (#132530 ) 1/n PR to - Move code from torch-onnx from commit `395495e566` into torch.onnx and fixes imports. - Integrate the new export logic with the torch.onnx.export API and include basic set of tests. - Refactor the API for the change. - Improve documentation. Next PRs will be more tests and docs. Fix https://github.com/pytorch/pytorch/issues/129277 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132530 Approved by: https://github.com/titaiwangms, https://github.com/malfet	2024-08-21 01:08:42 +00:00
Sahdev Zala	06cc2e83f0	Make optim.swa.util content accessible from the torch.optim doc (#133393 ) Link various classes and functions of the `optim.swa.util` to make doc content accessible from the `torch.optim` doc. Currently, if you click the link, https://pytorch.org/docs/stable/optim.html#module-torch.optim.swa_utils it goes to a blank, bottom of the page section of `torch.optim`. Also, `torch.optim.swa_utils.AveragedModel` and `torch.optim.swa_utils.SWALR` classes as well as `torch.optim.swa_utils.update_bn()` and `optim.swa_utils.get_ema_multi_avg_fn` are not linked to doc. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133393 Approved by: https://github.com/janeyx99	2024-08-21 00:43:46 +00:00
Nikita Shulga	d1abd6241a	[CI][BE] Update retry action to v3.0.0 (#119403 ) To reduce number of ``` Node.js 16 actions are deprecated. Please update the following actions to use Node.js 20 ``` Finally can land this one as all nodes has been migrated to AmazonLinux2023 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119403 Approved by: https://github.com/clee2000, https://github.com/Skylion007	2024-08-20 23:56:37 +00:00
leslie-fang-intel	c42ac54d9e	[inductor] prune unused constants in graph scheduling (#132208 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132208 Approved by: https://github.com/leslie-fang-intel Co-authored-by: leslie-fang-intel <leslie.fang@intel.com>	2024-08-20 23:40:11 +00:00
quanta42	5f3d22a609	Avoid GPU syncs by reusing Pre-allocated Zero Tensor (#128069 ) This commit improves the FullyShardedDataParallel (FSDP) class in PyTorch by reducing unnecessary GPU synchronizations by reusing a pre-allocated zero tensor. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128069 Approved by: https://github.com/awgu	2024-08-20 22:51:33 +00:00
drisspg	5a7b544e5c	Update FlexAttention with masking semantic (#133373 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133373 Approved by: https://github.com/yanboliang	2024-08-20 22:38:10 +00:00
Yanbo Liang	bc785c2d9a	[Inductor][FlexAttention] Don't trigger dynamic shape on building empty block mask (#133836 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133836 Approved by: https://github.com/Chillee	2024-08-20 22:36:53 +00:00
Nikita Shulga	f7c1f32803	Fix partially initialized module error (#134019 ) https://github.com/pytorch/pytorch/pull/132990 introduced dependency on `torch.version`, which might not be imported yet, and can result in `AttributeError: partially initialized module 'torch' has no attribute 'version' (most likely due to a circular import)` if user starts its code with `import torch.cuda` Fix it by importing `torch.version` explicitly Test Plan: CI Differential Revision: D61549284 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134019 Approved by: https://github.com/seemethere	2024-08-20 22:20:02 +00:00
Sherlock Huang	41fab40be7	[report_exportability] Avoid re-exporting duplicated modules (#133930 ) Summary: Skip re-exporting modules with the duplicated types to speed up the exportability tests. In real models, there are many duplicated modules, and mostly have the same export issues. Test Plan: Existing CI Differential Revision: D61504630 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133930 Approved by: https://github.com/angelayi	2024-08-20 22:11:57 +00:00
Animesh Jain	1ae5d5bb62	[dynamo][user-defined] Improve getattr_static for user_defined objects (#133742 ) Fixes https://github.com/pytorch/pytorch/issues/133607 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133742 Approved by: https://github.com/Skylion007, https://github.com/jansel	2024-08-20 21:51:03 +00:00
atalman	a36739f36a	Cherry-Picking don't resolve conflicts (#134047 ) During cherry-picking we want to use default setting and fail if there is merge conflict Here an example of invalid conflict resolution: https://github.com/pytorch/pytorch/pull/131194 and cherry-pick https://github.com/pytorch/pytorch/pull/133590 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134047 Approved by: https://github.com/kit1980	2024-08-20 21:48:05 +00:00
krzysztofjordan	2e1830c7c8	Implement 2D version of masked_select for nestedtensors (#133889 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133889 Approved by: https://github.com/soulitzer	2024-08-20 21:46:32 +00:00
PyTorch MergeBot	15b5a0b67f	Revert "[RFC][dynamo] add decorator to register polyfill for unsupported C++ function to avoid graph break (#133712 )" This reverts commit 71dd52f51a05d110c06e83f74cef165f64627842. Reverted https://github.com/pytorch/pytorch/pull/133712 on behalf of https://github.com/ZainRizvi due to breaking main windows cpu tests - this stack still causes that windows test to fail ([comment](https://github.com/pytorch/pytorch/pull/133712#issuecomment-2299776241))	2024-08-20 21:14:45 +00:00
PyTorch MergeBot	88ead0afc6	Revert "[dynamo] simplify polyfill registration for `builtins.all` and `builtins.any` (#133769 )" This reverts commit 178e8563b8a44243a6f69f3d257d9a3aab71b2c5. Reverted https://github.com/pytorch/pytorch/pull/133769 on behalf of https://github.com/ZainRizvi due to breaking main windows cpu tests - this stack still causes that windows test to fail ([comment](https://github.com/pytorch/pytorch/pull/133712#issuecomment-2299776241))	2024-08-20 21:14:45 +00:00
PyTorch MergeBot	3fa874abbe	Revert "[dynamo] simplify implementation for `functools.reduce` (#133778 )" This reverts commit 37b4bc60a4ec65858044983a36577912fb9b4651. Reverted https://github.com/pytorch/pytorch/pull/133778 on behalf of https://github.com/ZainRizvi due to breaking main windows cpu tests - this stack still causes that windows test to fail ([comment](https://github.com/pytorch/pytorch/pull/133712#issuecomment-2299776241))	2024-08-20 21:14:45 +00:00
PyTorch MergeBot	98e6a1d8ff	Revert "[dynamo] simplify implementation for `builtins.sum` (#133779 )" This reverts commit 3f58a8051a92470dbd254859322a7eb085a8f243. Reverted https://github.com/pytorch/pytorch/pull/133779 on behalf of https://github.com/ZainRizvi due to breaking main windows cpu tests - this stack still causes that windows test to fail ([comment](https://github.com/pytorch/pytorch/pull/133712#issuecomment-2299776241))	2024-08-20 21:14:44 +00:00
PyTorch MergeBot	2540ee372a	Revert "[dynamo][itertools] support `itertools.tee` (#133771 )" This reverts commit 28ce3c0227830c78c0b5d4ec592f5c3879bc61a3. Reverted https://github.com/pytorch/pytorch/pull/133771 on behalf of https://github.com/ZainRizvi due to breaking main windows cpu tests - this stack still causes that windows test to fail ([comment](https://github.com/pytorch/pytorch/pull/133712#issuecomment-2299776241))	2024-08-20 21:14:44 +00:00
Justin Chu	ccc0aa69ce	[ONNX] Remove torch.onnx._export (#133824 ) - Remove the deprecated torch.onnx._export function - Remove test/onnx/test_export_modes.py because export modes are no longer supported Pull Request resolved: https://github.com/pytorch/pytorch/pull/133824 Approved by: https://github.com/titaiwangms	2024-08-20 20:54:48 +00:00
Xuehai Pan	b03381cac2	[dynamo] support `cls.__flags__` (#133970 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133970 Approved by: https://github.com/jansel ghstack dependencies: #133969	2024-08-20 20:03:31 +00:00
Xuehai Pan	5229b52bf2	[dynamo] support `cls.__base__` (#133969 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133969 Approved by: https://github.com/jansel	2024-08-20 20:03:31 +00:00
David Berard	bb0bf09aff	[easy] skip test_sdpa_autocast on windows (#134009 ) test is failing because torch.compile doesn't work on windows Pull Request resolved: https://github.com/pytorch/pytorch/pull/134009 Approved by: https://github.com/YuqingJ, https://github.com/Skylion007, https://github.com/ZainRizvi	2024-08-20 19:51:55 +00:00
Xuehai Pan	28ce3c0227	[dynamo][itertools] support `itertools.tee` (#133771 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133771 Approved by: https://github.com/jansel ghstack dependencies: #133712, #133769, #133778, #133779	2024-08-20 19:48:57 +00:00
Xuehai Pan	3f58a8051a	[dynamo] simplify implementation for `builtins.sum` (#133779 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133779 Approved by: https://github.com/jansel ghstack dependencies: #133712, #133769, #133778	2024-08-20 19:48:57 +00:00
Xuehai Pan	37b4bc60a4	[dynamo] simplify implementation for `functools.reduce` (#133778 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133778 Approved by: https://github.com/jansel ghstack dependencies: #133712, #133769	2024-08-20 19:48:57 +00:00
Xuehai Pan	178e8563b8	[dynamo] simplify polyfill registration for `builtins.all` and `builtins.any` (#133769 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133769 Approved by: https://github.com/jansel ghstack dependencies: #133712	2024-08-20 19:48:57 +00:00
Xuehai Pan	71dd52f51a	[RFC][dynamo] add decorator to register polyfill for unsupported C++ function to avoid graph break (#133712 ) Add decorator `torch.compiler.substitute_in_graph` to register polyfill for unsupported C++ function to avoid graph break. This API provides an official way to add support for dynamo for third-party C extensions. Also, it can be used to simplify our implementation for `torch._dynamo.polyfill`. `5ee070266f/torch/_dynamo/variables/builtin.py (L97-L107)` Example: ```python >>> import operator >>> operator.indexOf([1, 2, 3, 4, 5], 3) 2 >>> torch.compile(operator.indexOf, fullgraph=True)([1, 2, 3, 4, 5], 3) Unsupported: ... >>> @torch.compiler.substitute_in_graph(operator.indexOf) ... def indexOf(sequence, x): ... for i, item in enumerate(sequence): ... if item is x or item == x: ... return i ... raise ValueError("sequence.index(x): x not in sequence") >>> torch.compile(operator.indexOf, fullgraph=True)([1, 2, 3, 4, 5], 3) 2 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/133712 Approved by: https://github.com/jansel	2024-08-20 19:48:57 +00:00
wz337	49430bfd5c	[DeviceMesh] Add a _MeshEnv attr to record the mapping of flatten mesh_dim_name to its mesh dim index in root mesh (#133838 ) ``` # supposed we have a 3d mesh mesh_3d = init_device_mesh("cuda", (2,2,2), mesh_dim_names=("dp", "cp", "tp") dp_cp_mesh = mesh_3d["dp", "cp"]._flatten() """ then we would have flatten_name_to_root_dims[mesh_3d]: { "dp_cp": (0, 1) } """ ``` We need this information to validate the order mesh slice including flatten mesh dim. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133838 Approved by: https://github.com/fegin	2024-08-20 19:43:45 +00:00
Zain Rizvi	c188d419db	[BE] [EZ] Allow linux-build workflows to run on the default runner type (#133640 ) Replace usage of `runner` with the new `runner_prefix` input, which allows the workflows to use the default runner type (linux.2xlarge) specified by the reusable workflow. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133640 Approved by: https://github.com/clee2000, https://github.com/jeanschmidt, https://github.com/malfet	2024-08-20 19:37:14 +00:00
Colin Peppler	81a822ddc9	Back out "[1/N] Fix clang-tidy warnings in inductor (#131979 )" (#133922 ) Summary: Original commit changeset: cc9392e5fce2 Original Phabricator Diff: D60464909 Differential Revision: D61501052 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133922 Approved by: https://github.com/22quinn	2024-08-20 19:16:29 +00:00
PyTorch MergeBot	49f6ea6dd9	Revert "[report_exportability] Avoid re-exporting duplicated modules (#133930 )" This reverts commit 278bc985d71f1ee09a499fba2ea5032b7baf2567. Reverted https://github.com/pytorch/pytorch/pull/133930 on behalf of https://github.com/izaitsevfb due to breaks lint ([comment](https://github.com/pytorch/pytorch/pull/133930#issuecomment-2299513046))	2024-08-20 18:44:09 +00:00
Roy Hvaara	43f78bf37a	[MPS] Gather sliced inputs to batch norm (#133610 ) This PR removes the `executeGatherOp` flag from batch norm in favor of relying on the logic in `4aa66f68a8/aten/src/ATen/native/mps/OperationUtils.mm (L372)` to decide if gathering is necessary. It's not the most efficient way to solve this issue, but it assures correctness for sliced inputs. ### Performance impact #### With fix ``` python -m timeit -n 100 -s "import torch; import torch.nn as nn; bn = nn.BatchNorm2d(100, affine=False, device='mps');x = torch.randn(100, 100, 35, 45).to('mps')" "bn(x)" 100 loops, best of 5: 282 usec per loop python -m timeit -n 100 -s "import torch; import torch.nn as nn; bn = nn.BatchNorm2d(100, affine=False, device='mps');x = torch.randn(100, 100, 35, 45).to('mps')" "bn(x[5:])" 100 loops, best of 5: 448 usec per loop python -m timeit -n 1000 -s "import torch; import torch.nn as nn; bn = nn.BatchNorm2d(100, affine=False, device='mps');x = torch.randn(100, 100, 35, 45).to('mps')" "bn(x)" 1000 loops, best of 5: 705 usec per loop python -m timeit -n 1000 -s "import torch; import torch.nn as nn; bn = nn.BatchNorm2d(100, affine=False, device='mps');x = torch.randn(100, 100, 35, 45).to('mps')" "bn(x[5:])" 1000 loops, best of 5: 1.11 msec per loop python -m timeit -n 1000 -s "import torch; import torch.nn as nn; bn = nn.BatchNorm2d(100, affine=False, device='mps');x = torch.randn(1000, 100, 35, 45).to('mps')" "bn(x)" 1000 loops, best of 5: 7.16 msec per loop python -m timeit -n 1000 -s "import torch; import torch.nn as nn; bn = nn.BatchNorm2d(100, affine=False, device='mps');x = torch.randn(1000, 100, 35, 45).to('mps')" "bn(x[5:])" 1000 loops, best of 5: 11.7 msec per loop ``` #### Without fix ``` python -m timeit -n 100 -s "import torch; import torch.nn as nn; bn = nn.BatchNorm2d(100, affine=False, device='mps');x = torch.randn(100, 100, 35, 45).to('mps')" "bn(x)" 100 loops, best of 5: 284 usec per loop python -m timeit -n 100 -s "import torch; import torch.nn as nn; bn = nn.BatchNorm2d(100, affine=False, device='mps');x = torch.randn(100, 100, 35, 45).to('mps')" "bn(x[5:])" 100 loops, best of 5: 265 usec per loop python -m timeit -n 1000 -s "import torch; import torch.nn as nn; bn = nn.BatchNorm2d(100, affine=False, device='mps');x = torch.randn(100, 100, 35, 45).to('mps')" "bn(x)" 1000 loops, best of 5: 715 usec per loop python -m timeit -n 1000 -s "import torch; import torch.nn as nn; bn = nn.BatchNorm2d(100, affine=False, device='mps');x = torch.randn(100, 100, 35, 45).to('mps')" "bn(x[5:])" 1000 loops, best of 5: 675 usec per loop python -m timeit -n 1000 -s "import torch; import torch.nn as nn; bn = nn.BatchNorm2d(100, affine=False, device='mps');x = torch.randn(1000, 100, 35, 45).to('mps')" "bn(x)" 1000 loops, best of 5: 7.19 msec per loop python -m timeit -n 1000 -s "import torch; import torch.nn as nn; bn = nn.BatchNorm2d(100, affine=False, device='mps');x = torch.randn(1000, 100, 35, 45).to('mps')" "bn(x[5:])" 1000 loops, best of 5: 7.13 msec per loop ``` Please feel free to push back or request changes. Fixes #133520 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133610 Approved by: https://github.com/malfet	2024-08-20 18:24:48 +00:00
Sherlock Huang	278bc985d7	[report_exportability] Avoid re-exporting duplicated modules (#133930 ) Summary: Skip re-exporting modules with the duplicated types to speed up the exportability tests. In real models, there are many duplicated modules, and mostly have the same export issues. Test Plan: Existing CI Differential Revision: D61504630 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133930 Approved by: https://github.com/angelayi Co-authored-by: bearzx <bearzx@fb.com>	2024-08-20 18:20:49 +00:00
Wei Wang	333890b701	Enable CUDA 12.4.1 (#132202 ) Trying to keep a record of the steps before I lose track of it. - 1st Commit: Similar to https://github.com/pytorch/builder/pull/1720 - 2nd Commit: Update CUDA 12.4 CI CUDA versions from 12.4.0 to 12.4.1 mapping to changes in https://github.com/pytorch/pytorch/pull/125944/files - 3rd Commit: update for aarch64 install_cuda_aarch64.sh docker step - 4th Commit: `aaa456e3e6` Related https://github.com/pytorch/pytorch/pull/121684 - Synchronization point: Meta helps uploading pypi cuda dependencies specified in .github/scripts/generate_binary_build_matrix.py - The above pypi upload is done (thanks Andrey!), restarted jobs like https://github.com/pytorch/pytorch/actions/runs/10188203670/job/28369471321 - `77532344e3`, use temporary docker containers (generated from a previous successful container build). If merged, these containers would be rebuilt, therefore testing them now. (5th commit) - 6th commit `5f93c625b5`: revert the 5th commit. Update, done but have to debug seemingly irrelevant failures (rocm/xpu/mps) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132202 Approved by: https://github.com/Skylion007, https://github.com/eqy, https://github.com/atalman	2024-08-20 17:52:50 +00:00
fduwjj	e41b520ee3	[3/N] Refactor FR script - Add a processor module (#133933 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133933 Approved by: https://github.com/c-p-i-o ghstack dependencies: #133927, #133929	2024-08-20 17:36:49 +00:00
Aaron Gokaslan	bce0caba78	[BE]: Update Typeguard to TypeIs for better type inference (#133814 ) Uses TypeIs instead of TypeGuard for better inference. See https://peps.python.org/pep-0742/ Pull Request resolved: https://github.com/pytorch/pytorch/pull/133814 Approved by: https://github.com/ezyang	2024-08-20 17:19:57 +00:00
Xu Han	fbf3fc2a30	[inductor] Use int64_t as index type for all platfroms 4 (#133892 ) It is parallel PR to https://github.com/pytorch/pytorch/pull/133819 , and it is append change for @jansel 's comments. 1. For `torch/_inductor/codegen/cpp_wrapper_cpu.py`, revert to origin code to append LL on MacOS and Windows: `bdc14ad89a` 2. For `torch/_inductor/codegen/cpp_utils.py`, append LL on MacOS and Windows forlarge constants. And fix its UTs: `3a56b76ce0` ------------------------------ Another solution for https://github.com/pytorch/pytorch/pull/133615, use `int64_t` as index type for all plartform. ### Development notes: The metioned PR( https://github.com/pytorch/pytorch/pull/133615) is fix the index type not match to parse_arg args types. As reviewed with @jansel , Jason think we need to unificate `INDEX_TYPE` for all platforms. Current code is make code cumbersome: ```python INDEX_TYPE = "int64_t" if _IS_WINDOWS else "long" ``` So, I have some attempts to unificate `INDEX_TYPE` as `long` or `int64_t`. For use `long` as index type: https://github.com/pytorch/pytorch/pull/133768 For use `int64_t` as index type: https://github.com/pytorch/pytorch/pull/133782 Since that, we still discussed which type we will select as final solution. ![image](https://github.com/user-attachments/assets/b23fa577-2d40-4bd6-b934-fb7994fe0bb0) `long` type is different define and size in different OSs and different compilers. So, @jansel make decision that, we need to select `int64_t` for all platforms. So, I would comtine my work based on https://github.com/pytorch/pytorch/pull/133782. As https://github.com/pytorch/pytorch/pull/133782 still has two issues: 1. std::min/std::max could not match function instances by arg types. It as fixed and validated in PR: https://github.com/pytorch/pytorch/pull/133812 4. Cuda TestMemoryPlanning::test_cpp_wrapper issue by wrong index type. It is fixing in this PR. So, we made final solution in this PR. ### Changes: 1. Use `int64_t` type as index type for all OSs: `Windows`, `Linux` and `MacOS`. 2. Use static_cast<int64_t>(`constant`) to convert constant to `div_floor_integer` with args type(`int64_t`). 3. Update `parse_arg` function signature to `int64_t`, which follow the index type. 4. Append double L(`LL`) to constant on Windows and MacOS, because of their int64_t are are long long. 5. Fix `std::min/std::max` type miss match by static_cast to `INDEX_TYPE`. 6. Fix UTs, containts: cuda `TestMemoryPlanning::test_cpp_wrapper`, and `test_indexing.py`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133892 Approved by: https://github.com/jansel	2024-08-20 16:54:12 +00:00
Xu Han	3caf3baabb	[inductor] enable inductor backend for dynamo on Windows. (#133921 ) Changes: Enable inductor backend for dynamo on Windows. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133921 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-08-20 16:46:19 +00:00
cyy	c3d02fa390	[Reland2] Update NVTX to NVTX3 (#109843 ) Another attempt to update NVTX to NVTX3. We now avoid changing NVTX header inclusion of existing code. The advantage of NVTX3 over NVTX is that it is a header-only library so that linking with NVTX3 can greatly simplify our CMake and other building scripts for finding libraries in user environments. In addition, NVTX are indeed still present in the latest CUDA versions, but they're no longer a compiled library: It's now a header-only library. That's why there isn't a .lib file anymore. Pull Request resolved: https://github.com/pytorch/pytorch/pull/109843 Approved by: https://github.com/peterbell10, https://github.com/eqy Co-authored-by: Ivan Zaitsev <108101595+izaitsevfb@users.noreply.github.com>	2024-08-20 16:33:26 +00:00
Animesh Jain	33f1ee036e	[dynamo][user-defined] Simplify call_hasattr (#133935 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133935 Approved by: https://github.com/williamwen42, https://github.com/jansel ghstack dependencies: #133745, #133747, #133746, #133799, #133800	2024-08-20 16:27:44 +00:00
cyy	8d93fe510e	Remove NestedTensorFactories.h (#133809 ) Since it has no code. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133809 Approved by: https://github.com/ezyang	2024-08-20 16:16:30 +00:00
Aaron Orenstein	187d55018a	[BE] Fix MYPY issues (#133872 ) Fix some mypy issues that have crept in to the trunk. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133872 Approved by: https://github.com/oulgen, https://github.com/Skylion007	2024-08-20 16:12:04 +00:00
Sam Larsen	52dfe99dbf	Skip test_custom_op_add_abi_compatible_cpu_with_stack_allocation internally (#133704 ) Summary: This test is segfaulting internally. Skip for now so we can get the internal tests green. Differential Revision: D61399618 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133704 Approved by: https://github.com/desertfire	2024-08-20 16:01:39 +00:00
PyTorch MergeBot	3a2f7192c3	Revert "return state dict without optimized module (#132626 )" This reverts commit e37eef8a7bd5915fa2961d688fd8b02df5cc5fd7. Reverted https://github.com/pytorch/pytorch/pull/132626 on behalf of https://github.com/ZainRizvi due to Sorry but it seems like this PR broke trunk. distributed/checkpoint/test_state_dict.py::TestStateDict::test_fsdp2 [GH job link](https://github.com/pytorch/pytorch/actions/runs/10458281674/job/28969008325) [HUD commit link](`da69a28c6f`) ([comment](https://github.com/pytorch/pytorch/pull/132626#issuecomment-2299190664))	2024-08-20 15:54:54 +00:00
Nikita Shulga	f2b57d8831	Fix `torch._C` submodules population (#133919 ) This fixes regression introduced by https://github.com/pytorch/pytorch/pull/132216 that on some Python runtimes failed with ``` > from torch._C._dynamo.guards import GlobalStateGuard E ModuleNotFoundError: No module named 'torch._C._dynamo.guards'; 'torch._C._dynamo' is not a package c:\users\malfet\git\pytorch\torch\_dynamo\convert_frame.py:28: ModuleNotFoundError ``` Simplify it by always registering submodules by its primary name and do not try to add submodules which are not part of the same namespace as parent. Otherwise module can be registered by alias, rather than by primary name. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133919 Approved by: https://github.com/atalman, https://github.com/izaitsevfb, https://github.com/XuehaiPan, https://github.com/albanD, https://github.com/Skylion007	2024-08-20 15:38:32 +00:00
Shangdi Yu	b02695d65f	[export] training ir migration, fix export_rle_model (#133937 ) Summary: - exir.capture + to_edge is deprecated. We need to use the export + to_edge. - Fix quantization pass to be compatible with the new export IR. In the quantization pass, some nodes might have side-effects, so they don't have users, but still are not removed by the DCE pass. We need to consider it. - now export_rle_model works with the default `capture_pre_autograd_graph`, it should also work with the new training it Test Plan: ``` buck2 run 'fbcode//mode/dev-nosan' fbcode//bolt/nn/executorch/export:export_rle_model -- -r export_rle_model ``` Differential Revision: D61485834 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133937 Approved by: https://github.com/tugsbayasgalan	2024-08-20 15:35:25 +00:00
chuanqiw	6590f4fb0e	[CD] Enable python 3.13 for xpu nightly build (#133670 ) Enable python 3.13 for XPU nightly build, it depends on https://github.com/pytorch/pytorch/pull/133454 land. Also update the xpu nightly wheel test env. Works for https://github.com/pytorch/pytorch/issues/114850 Fixes #130543 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133670 Approved by: https://github.com/atalman, https://github.com/malfet	2024-08-20 15:05:20 +00:00
fduwjj	36376efd06	[2/N] Refactor FR script - add a loader module (#133929 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133929 Approved by: https://github.com/c-p-i-o ghstack dependencies: #133927	2024-08-20 14:27:40 +00:00
PyTorch MergeBot	2bd02e0c82	Revert "[RFC][dynamo] add decorator to register polyfill for unsupported C++ function to avoid graph break (#133712 )" This reverts commit 641724ed1daad1e6fc2525cc6858d199e576d5cd. Reverted https://github.com/pytorch/pytorch/pull/133712 on behalf of https://github.com/jeanschmidt due to breaking main windows cpu tests - reverting them all, so we can identify the culprit with more calmness ([comment](https://github.com/pytorch/pytorch/pull/133712#issuecomment-2298528797))	2024-08-20 10:34:41 +00:00
PyTorch MergeBot	91fd270535	Revert "[dynamo] simplify polyfill registration for `builtins.all` and `builtins.any` (#133769 )" This reverts commit 59ca56e56ca3e2f6dd80db57079725cf61f06810. Reverted https://github.com/pytorch/pytorch/pull/133769 on behalf of https://github.com/jeanschmidt due to breaking main windows cpu tests - reverting them all, so we can identify the culprit with more calmness ([comment](https://github.com/pytorch/pytorch/pull/133712#issuecomment-2298528797))	2024-08-20 10:34:41 +00:00
PyTorch MergeBot	5109c5ef23	Revert "[dynamo] simplify implementation for `functools.reduce` (#133778 )" This reverts commit ff9be0eda99c59cdbcc269853168657de93043c7. Reverted https://github.com/pytorch/pytorch/pull/133778 on behalf of https://github.com/jeanschmidt due to breaking main windows cpu tests - reverting them all, so we can identify the culprit with more calmness ([comment](https://github.com/pytorch/pytorch/pull/133712#issuecomment-2298528797))	2024-08-20 10:34:41 +00:00
Aaron Orenstein	241df7e7f8	Add multi-cache autotune test (#133868 ) Summary: The existing tests didn't cover a case where we had multiple autotunes in a single graph. Add a test to demonstrate that case. Also added a test dependency on redis and removed the "fake redis" from the previous PR (#133579) Test Plan: unit tests Reviewed By: oulgen Differential Revision: D61178861 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133868 Approved by: https://github.com/oulgen	2024-08-20 10:26:45 +00:00
Yifu Wang	11af423eca	[SymmetricMemory] make buffer_ptrs_dev, signal_pad_ptrs_dev, buffer_size, and signal_pad_size accessible in python (#133680 ) These allows us to experiment with creative applications with triton. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133680 Approved by: https://github.com/Chillee	2024-08-20 10:15:35 +00:00
PyTorch MergeBot	08b5e07e6c	Revert "[dynamo] simplify implementation for `builtins.sum` (#133779 )" This reverts commit 1fdeb4e32918017ee3a712e0bba86e8482fa293b. Reverted https://github.com/pytorch/pytorch/pull/133779 on behalf of https://github.com/jeanschmidt due to breaking main windows cpu tests ([comment](https://github.com/pytorch/pytorch/pull/133779#issuecomment-2298285206))	2024-08-20 08:33:29 +00:00
PyTorch MergeBot	68570fca69	Revert "Add MaskedTensor support to *_like API (#128637 )" This reverts commit 8de56e29581fa2706d44f8c4b0827830c9351470. Reverted https://github.com/pytorch/pytorch/pull/128637 on behalf of https://github.com/jeanschmidt due to Introduced API linting errors ([comment](https://github.com/pytorch/pytorch/pull/128637#issuecomment-2298270307))	2024-08-20 08:26:28 +00:00
PyTorch MergeBot	42097f0ec1	Revert "[BE]: Update Typeguard to TypeIs for better type inference (#133814 )" This reverts commit cf60fe53a83bafec0857d5b49c2054de6ba4cddc. Reverted https://github.com/pytorch/pytorch/pull/133814 on behalf of https://github.com/jeanschmidt due to Broke 12k internal signals/jobs, @ezyang please help get those changes merged. More details check D61488368 ([comment](https://github.com/pytorch/pytorch/pull/133814#issuecomment-2298210309))	2024-08-20 08:02:49 +00:00
Michael Lazos	25d5a815f7	[Dynamo] Guard on torch function mode global state (#133135 ) Adds guards checking whether torch function mode is in the all disabled state. There are three torch function enablement states: * All torch function disabled (modes + subclasses) * Torch function subclass disabled * All enabled We now have guards checking if the state is All enabled and if state is All disabled. All of the above ternary states are assigned to a unique pair of these two flags. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133135 Approved by: https://github.com/anijain2305 ghstack dependencies: #133130, #133729, #133131, #133132, #133133, #133134, #133136	2024-08-20 07:15:04 +00:00
Michael Lazos	48ee0984ac	Add C API to return all torch function disablement status (#133136 ) This PR adds a C function to check if all torch function is disabled. Recall that there are three torch function enablement states: * All disabled * Torch Function Subclass disabled * All enabled The API before this change provides two functions: * `_is_torch_function_enabled` - returns True iff the current TF state is All enabled * `_is_torch_function_mode_enabled` - returns True iff the state is not All disabled and the torch function mode stack is non-empty. The crux of why a new API is needed is the following: If dynamo enters a frame with the torch function mode stack empty, `_is_torch_function_enabled` == False, it is impossible to determine if after a new mode is pushed whether we should enter the mode or not. This is because we don't know if the enablement state is All disabled or only subclass disabled. Adding this API to check if All disabled is True allows us to disambiguate this case. In the next PR, Dynamo InstructionTranslator will have clearer flags than the underlying C API: * A flag to indicate if subclasses are disabled (ie All disabled or Subclass Disabled is the current state) * A flag to indicate if modes are disabled (ie if All disabled is the current state) * A symbolic stack which can be checked if any modes are present Pull Request resolved: https://github.com/pytorch/pytorch/pull/133136 Approved by: https://github.com/bdhirsh ghstack dependencies: #133130, #133729, #133131, #133132, #133133, #133134	2024-08-20 07:15:04 +00:00
Michael Lazos	d97ca968cd	[Dynamo] Test intermediate tf mode construction (#133134 ) Ensures that constructing a torch function mode in the middle of a function is supported. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133134 Approved by: https://github.com/williamwen42 ghstack dependencies: #133130, #133729, #133131, #133132, #133133	2024-08-20 07:14:56 +00:00
Michael Lazos	626acaeb16	[Dynamo] Support torch function stack len (#133133 ) Adds support for `torch._C._len_torch_function_stack()` Pull Request resolved: https://github.com/pytorch/pytorch/pull/133133 Approved by: https://github.com/williamwen42 ghstack dependencies: #133130, #133729, #133131, #133132	2024-08-20 07:14:52 +00:00
Michael Lazos	d1fdf984c3	[Dynamo] Support push torch function mode stack (#133132 ) This PR adds support `torch._C._push_on_torch_function_stack()` by updating `torch.py` to push onto the symbolic torch function mode stack when a push is encountered. The same side effects infra used in the previous PR is used to track the mutation of the torch function mode stack and add bytecode to update it if it is mutated. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133132 Approved by: https://github.com/williamwen42 ghstack dependencies: #133130, #133729, #133131	2024-08-20 07:14:47 +00:00
Michael Lazos	c0b4aaa8c5	[Dynamo] Support pop torch function mode stack (#133131 ) This PR adds support for tracing `torch._C._pop_torch_function_stack()` without graph breaking and in order to verify the state change also adds replay of mutations to the torch function mode stack via side_effects appending supplemental bytecode as we do for other python mutable objects. Details: To represent the torch function mode stack symbolically a deque field is added to the instruction translator. When the InstructionTranslator is initialized, all modes are read from the current torch function mode stack, and stashed in a global weak ref for later access (using existing sources) without needing to push/pop the python/cpp torch function mode stack. During tracing, when `_pop_torch_function_stack` is encountered a value is popped from this deque and the variable tracker representing the mode is returned. To ensure the true torch function mode stack matches this state, `TorchFunctionModeStackVariable`, a singleton, is marked as mutated, this adds it to side effects, where during final codegen, side effects will codegen a call to a python helper which will update the python torch function mode stack. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133131 Approved by: https://github.com/jansel ghstack dependencies: #133130, #133729	2024-08-20 07:14:42 +00:00
Michael Lazos	f147349568	Fix DeviceContext bug (#133729 ) Fixes https://github.com/pytorch/pytorch/issues/133666 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133729 Approved by: https://github.com/bdhirsh ghstack dependencies: #133130	2024-08-20 07:14:37 +00:00
Michael Lazos	09e366cb57	[Dynamo] Add torch function mode stack guard to dynamo (#133130 ) This PR adds a guard on the torch function mode stack state at the beginning of tracing. The way this is implemented is via a new leaf guard which is passed the initial stack state at construction and compares it to the stack state at the time the guard is run. Details: The stack state is extracted via popping all modes, appending them to a list, and pushing all modes back. This list is stored on the output graph and read during guard construction to pass to the stack mode guard. There the length and types of the modes are recorded. Next time the guard is run it compares this recorded state to the current mode stack state. To implement this in python a helper function was added to utils.py and this is used if cpp guards are not enabled. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133130 Approved by: https://github.com/anijain2305	2024-08-20 07:14:33 +00:00
Aaron Orenstein	7492da804f	Mark disabled tests as fixed (#133940 ) Fixes #132552, #133900, #133901, #133902, #133903, #133904, #133905, #133906, #133908, #133910, #133911, #133912, #133913, #133914, #133915, #133916, #133917 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133940 Approved by: https://github.com/oulgen	2024-08-20 06:58:11 +00:00
Animesh Jain	e8d3c4be36	[dynamo][reland][inline-inbuilt-nn-modules] Mark attributes of nn mod… (#133714 ) Relands https://github.com/pytorch/pytorch/pull/132539 Relands https://github.com/pytorch/pytorch/pull/132736 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133714 Approved by: https://github.com/jansel	2024-08-20 05:57:52 +00:00
Bob Ren	f08d484702	Add itertools.islice support in dynamo (#133893 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133893 Approved by: https://github.com/oulgen	2024-08-20 05:55:53 +00:00
fduwjj	b6891f4002	[1/N] Refactor fr trace script to make it modulized - config (#133927 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133927 Approved by: https://github.com/c-p-i-o	2024-08-20 05:47:17 +00:00
Stonepia	15addb00e6	Update test_control_flow.py to device-agnostic. (#133843 ) Fixes #133841 This PR makes the `test_pointwise_associative_scan_CUDA_flip` also work on XPU. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133843 Approved by: https://github.com/EikanWang, https://github.com/desertfire, https://github.com/malfet, https://github.com/jansel, https://github.com/atalman	2024-08-20 05:05:43 +00:00
Chirag Pandya	994fcb9acd	Killswitch based rollout for flight recorder (#133237 ) Summary: Defaulting TORCH_NCCL_DUMP_ON_TIMEOUT to "true" and adding a kilswitch in case we need to kill this feature in production. Test Plan: Tests pass manually but need futher testing before this is rolled out fully everywhere. Differential Revision: D61136320 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133237 Approved by: https://github.com/c00w	2024-08-20 04:27:55 +00:00
Huamin Li	32f57ac627	[BE] Fix lint issues in qlinear_prepack.cpp (#133797 ) Summary: This diff fixed many lint issues in qlinear_prepack.cpp. I'am fixing them as I want to add more ops/funcs into this file later. Test Plan: Sandcastle Differential Revision: D61425436 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133797 Approved by: https://github.com/Skylion007	2024-08-20 04:23:25 +00:00
Avik Chaudhuri	b0bafd2be5	remove tensor weak ref from constraint target (#133890 ) Summary: `_ConstraintTarget` is an internal data structure that has some redundancy: tensors are identified by their id but also carry a weak reference. The weak reference was probably useful a year back but everything is done with ids right now, and the lifetime of these tensors ensures that using their ids is OK. Test Plan: existing tests Differential Revision: D61488816 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133890 Approved by: https://github.com/tugsbayasgalan	2024-08-20 03:03:05 +00:00
atalman	188cb5e67b	Bump scikit-image to 0.22.0 (#133932 ) Fixes: https://github.com/pytorch/pytorch/issues/133926 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133932 Approved by: https://github.com/malfet	2024-08-20 02:37:16 +00:00
Bin Bao	6c82a1c68c	[AOTI] Introduce DeferredCudaKernelLine for cuda cpp wrapper (#129135 ) Summary: When generating CUDA kernel load and launch, certain Triton kernel meta data are needed, but those meta data only exist after kernel auto-tune is done. DeferredCudaKernelLine is a deferred line which can backfill a string template after kernel auto-tune. This is to prepare for one-pass AOTI codegen implementation. Differential Revision: [D61018114](https://our.internmc.facebook.com/intern/diff/D61018114) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129135 Approved by: https://github.com/angelayi	2024-08-20 02:15:44 +00:00
cyy	c51fc7e98e	Enable clang-tidy in aten/src/ATen/native/nested/ (#133829 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/133829 Approved by: https://github.com/Skylion007	2024-08-20 01:52:15 +00:00
chuanqiw	c6ea7b3f21	Update xpu CD used driver to rolling version (#133454 ) The main purpose of this PR is change the XPU CD use rolling driver to support more clients GPU AOT build and enable Kineto. And also plan to enable python 3.13 for xpu CD. Works for https://github.com/pytorch/pytorch/issues/114850 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133454 Approved by: https://github.com/atalman	2024-08-20 01:45:45 +00:00
Jane Xu	c7af2728d3	Remove aten dispatch to empty in foreach_norm cuda kernel (#133897 ) Saves significant time on aten dispatch. For 2k tensors, goes from 38ms to 58us. Should shave some overhead mentioned in https://github.com/pytorch/pytorch/issues/133586 Before PR: ![image](https://github.com/user-attachments/assets/7813f059-0f7f-4d44-a9f0-1aaf94ae849f) After: ![image](https://github.com/user-attachments/assets/ad0855b1-2743-432a-ad31-b574c620e2fd) script: ``` import torch # warm up caching allocator a = torch.rand(200, 10, device="cuda") b = torch.rand(200, 10, device="cuda") c = a + b del a, b, c ts = [torch.rand(2, 3, device="cuda") for _ in range(2000)] with torch.profiler.profile( activities=[ torch.profiler.ProfilerActivity.CPU, torch.profiler.ProfilerActivity.CUDA, ] ) as p: torch._foreach_norm(ts) print(p.key_averages().table(sort_by="cpu_time_total")) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/133897 Approved by: https://github.com/albanD, https://github.com/drisspg	2024-08-20 01:27:09 +00:00
fduwjj	874ae854eb	[c10d] Land CudaEventCache with roll out flags (#133727 ) @zdevito added a cache for CudaEvent in https://github.com/pytorch/pytorch/pull/122732. And we want to productionize it with a flag in this PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133727 Approved by: https://github.com/shuqiangzhang, https://github.com/eqy	2024-08-20 01:08:00 +00:00
Menglu Yu	cfcb9e388d	[PT2][Optimus] Add move reshape out of split stack pass (#133710 ) Summary: We observed a new pattern in CMF where reshape nodes are in the middle of split stack patter, introducing massive triton_fused_stack_xxx kernels, leading to increased compilation time, we thus move it outside of the pattern, and elimate such split stack nodes. Test Plan: # unit test ``` CUDA_VISIBLE_DEVICES=3 OC_CAUSE=1 buck2 test //caffe2/test/inductor:split_cat_fx_passes ``` Buck UI: https://www.internalfb.com/buck2/2fb51ae7-832e-436b-b6b7-a81599390182 Test UI: https://www.internalfb.com/intern/testinfra/testrun/14918173811074971 Network: Up: 10MiB Down: 5.4GiB (reSessionID-96a20105-fdc6-4b4f-b465-813a84a71eba) Jobs completed: 304618. Time elapsed: 25:24.7s. Cache hits: 99%. Commands: 120772 (cached: 120410, remote: 357, local: 5) Tests finished: Pass 9. Fail 0. Fatal 0. Skip 1. Build failure 0 # benchmark ``` CUDA_VISIBLE_DEVICES=3 OC_CAUSE=1 buck2 run mode/opt //scripts/jackiexu0313/pt2:local_model_with_pt2 -- --test_mode batch-split --model_type "cmf_shrink" --flow_id 587303213 ``` P1529578588 graph diffing: https://www.internalfb.com/intern/diffing/?paste_number=1529577762 Counter({'pattern_matcher_nodes': 2123, 'pattern_matcher_count': 1715, 'normalization_pass': 404, 'remove_split_with_size_one_pass': 269, 'extern_calls': 193, 'merge_splits_pass': 74, 'normalization_aten_pass': 47, 'fxgraph_cache_miss': 9, 'batch_aten_mul': 6, 'scmerge_split_sections_removed': 5, 'scmerge_split_removed': 4, 'scmerge_cat_removed': 4, 'unbind_stack_pass': 4, 'batch_sigmoid': 2, 'batch_linear': 2, 'move_reshape_out_of_split_stack_pass': 2, 'batch_aten_sub': 2, 'batch_layernorm': 1, 'scmerge_split_added': 1, 'scmerge_cat_added': 1, 'split_stack_to_cats_pass': 1, 'split_cat_to_slices_pass': 1, 'batch_aten_add': 1, 'batch_relu': 1}) Trace link: https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree%2Ftraces%2Ftest%2Fcmf_shrink.Aug_15_10_55_41_trace.json.gz&bucket=pyper_traces The triton_fused_stack_xxx has been reduced significantly, we can see from the trace that the green part becomes smaller {F1806406290} # e2e ads_dper3:68464f2dc5e849ba2670482079cecaaa training_platform:8643db0c3453f2658aa7be7d73974ea0 baseline: f588719502 proposal: f592116164 Differential Revision: D61249205 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133710 Approved by: https://github.com/jackiexu1992	2024-08-20 00:50:07 +00:00
Lucy Qiu	6f738d6434	Remove early exit in constant_pad_nd for export (#132679 ) Summary: Remove the early exit for padding when padding = [0, 0, 0, 0]. This prevents export from specializing when all padding=0, allowing export when all padding >= 0. Specialization will still happen for negative padding. This change will be used to export image preprocess for multimodal models, where images of dynamic shape are padded. As images are of dynamic shape, we can't be sure if padding will be required or not. Padding is guaranteed to be non-negative. Preprocess code: https://github.com/pytorch/torchtune/pull/1242 Note: the alternative is to wrap padding in a custom op, which isn't ideal given the custom op will contain the same impl as constant_pad_nd. Test Plan: ci Differential Revision: D60687727 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132679 Approved by: https://github.com/ezyang	2024-08-20 00:07:41 +00:00
Ahmad Sarvmeily	9a998d98f1	Fix edge case in inductor triton clean script (#130837 ) The regex in the script is too restrictive, as it excludes examples with parentheses in args, like the following: ``` triton_poi_fused_add_0.run(arg0_1.item(), arg1_1.item(), buf0, 1, grid=grid(1), stream=streamNone) ^ ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/130837 Approved by: https://github.com/Chillee	2024-08-19 23:46:11 +00:00
Oguz Ulgen	65b3e42074	Warn on fx graph cache bypass and log it to tlparse (#133826 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133826 Approved by: https://github.com/aorenste	2024-08-19 23:39:55 +00:00
Yidi Wu	2ec95ffe57	[cond] support unbacked symbool inputs (#133589 ) Fixes https://github.com/pytorch/pytorch/issues/133577. In dynamo, when received an unbacked symbool input, we create an unbacked symint to replace it. The alternative approach of `not realizing the pred LazyVariable in cond` doesn't work because we need to get the proxy of the symbool input. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133589 Approved by: https://github.com/ezyang	2024-08-19 23:36:48 +00:00
Jithun Nair	3f525c9d5d	Upgrade nightly wheels to rocm6.2 - 2 of 2 (binaries) (#133238 ) Depends on https://github.com/pytorch/pytorch/pull/132875 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133238 Approved by: https://github.com/atalman	2024-08-19 22:35:33 +00:00
William Wen	2b95007d12	[dynamo] support random.Random (#133725 ) Fixes the observed graph breaks in https://github.com/pytorch/pytorch/issues/121349 and https://github.com/pytorch/pytorch/issues/121350. But there are still graph breaks since a random output is being used as a seed, e.g. ```python import random import torch def fn(x): seed = random.randint(0, 100) rand = random.Random(seed) return x + rand.randrange(10) opt_fn = torch.compile(fn, backend="eager", fullgraph=True) opt_fn(torch.ones(1)) ``` fails with ``` torch._dynamo.exc.InternalTorchDynamoError: UnspecializedPythonVariable() is not a constant ``` when tracing the line ``` rand = random.Random(seed) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/133725 Approved by: https://github.com/jansel	2024-08-19 22:34:44 +00:00
James Perng	06faa15194	[pytorch][counters] add pytorch.wait_counter.fx_codgen_and_compile (#133107 ) as titled Differential Revision: [D60876629](https://our.internmc.facebook.com/intern/diff/D60876629/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133107 Approved by: https://github.com/asiab4	2024-08-19 22:29:16 +00:00
Justin Chu	afb3e5ed6a	Add onnx and onnxscript to CI requirements (#133647 ) Add onnx and onnxscript to requirements-ci.txt to allow for `test_public_bindings` and mypy to function when checking `torch.onnx._internal` code as @malfet suggested. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133647 Approved by: https://github.com/titaiwangms, https://github.com/kit1980	2024-08-19 22:15:07 +00:00
Xuehai Pan	1fdeb4e329	[dynamo] simplify implementation for `builtins.sum` (#133779 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133779 Approved by: https://github.com/jansel ghstack dependencies: #133712, #133769, #133778	2024-08-19 22:14:34 +00:00
Xuehai Pan	ff9be0eda9	[dynamo] simplify implementation for `functools.reduce` (#133778 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133778 Approved by: https://github.com/jansel ghstack dependencies: #133712, #133769	2024-08-19 22:14:33 +00:00
Xuehai Pan	59ca56e56c	[dynamo] simplify polyfill registration for `builtins.all` and `builtins.any` (#133769 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133769 Approved by: https://github.com/jansel ghstack dependencies: #133712	2024-08-19 22:14:33 +00:00
Xuehai Pan	641724ed1d	[RFC][dynamo] add decorator to register polyfill for unsupported C++ function to avoid graph break (#133712 ) Add decorator `torch.compiler.substitute_in_graph` to register polyfill for unsupported C++ function to avoid graph break. This API provides an official way to add support for dynamo for third-party C extensions. Also, it can be used to simplify our implementation for `torch._dynamo.polyfill`. `5ee070266f/torch/_dynamo/variables/builtin.py (L97-L107)` Example: ```python >>> import operator >>> operator.indexOf([1, 2, 3, 4, 5], 3) 2 >>> torch.compile(operator.indexOf, fullgraph=True)([1, 2, 3, 4, 5], 3) Unsupported: ... >>> @torch.compiler.substitute_in_graph(operator.indexOf) ... def indexOf(sequence, x): ... for i, item in enumerate(sequence): ... if item is x or item == x: ... return i ... raise ValueError("sequence.index(x): x not in sequence") >>> torch.compile(operator.indexOf, fullgraph=True)([1, 2, 3, 4, 5], 3) 2 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/133712 Approved by: https://github.com/jansel	2024-08-19 22:14:33 +00:00
nowtryz	8de56e2958	Add MaskedTensor support to *_like API (#128637 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128637 Approved by: https://github.com/cpuhrsch	2024-08-19 22:13:59 +00:00
nowtryz	14ddd932fd	Add MaskedTensor support to _is_any_true (#128574 ) Fixes #128557 If there is a better way to detect autograd anomalies consistently, feel free to share your ideas. This is a dirty check. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128574 Approved by: https://github.com/cpuhrsch	2024-08-19 21:34:31 +00:00
Edward Z. Yang	432638f521	Remove useless environment in reusable workflow (#133659 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/133659 Approved by: https://github.com/Skylion007	2024-08-19 20:44:17 +00:00
atalman	d131048056	Change install_triton to do git checkout, apply patch, pip install (#133878 ) Fixes Docker builds: https://github.com/pytorch/pytorch/actions/runs/10458684809/job/28961048777 Follow up after https://github.com/pytorch/pytorch/pull/133694 to apply same patch to Docker build. Change Rather then doing: ``` pip_install "git+${TRITON_REPO}@${TRITON_PINNED_COMMIT}#subdirectory=python" ``` We do using 4 step: git clone, git checkout, apply patch, pip install Pull Request resolved: https://github.com/pytorch/pytorch/pull/133878 Approved by: https://github.com/malfet, https://github.com/ZainRizvi	2024-08-19 20:42:50 +00:00
Edward Z. Yang	66d6d8b1b9	Support TORCH_COMPILER_COLLECTIVES envvar (#133696 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/133696 Approved by: https://github.com/Skylion007, https://github.com/c-p-i-o	2024-08-19 20:13:04 +00:00
Colin Peppler	0d4eacb9d2	[fake tensor] unbacked symint support for binary op fast path (#133584 ) Addreses https://github.com/pytorch/pytorch/issues/133525 We have an unbacked symint in `final_shape` and it's a tuple... So, add `guard_size_oblivious` to do size oblivious checks + `sym_eq` for list equality. ``` op.shape > torch.Size([1]) final_shape > (u0 + 1,) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/133584 Approved by: https://github.com/ezyang	2024-08-19 20:03:05 +00:00
Yichen Yan	565e2ea019	Scale XBLOCK in triton for `pointwise` (#133300 ) Adjust https://github.com/pytorch/pytorch/pull/128826 for also `triton_heuristics.pointwise`. An example we encountered during training qwen-7b with rocm 6.1: Note: this kernel also hit the limit of `TRITON_MAX_BLOCK['X']`, shall we increase it from 2048 to 4096? ``` import torch aten = torch.ops.aten inductor_ops = torch.ops.inductor assert_size_stride = torch._C._dynamo.guards.assert_size_stride empty_strided_cpu = torch._C._dynamo.guards._empty_strided_cpu empty_strided_cuda = torch._C._dynamo.guards._empty_strided_cuda reinterpret_tensor = torch._C._dynamo.guards._reinterpret_tensor alloc_from_pool = torch.ops.inductor._alloc_from_pool import triton import triton.language as tl from triton.compiler.compiler import AttrsDescriptor from torch._inductor.runtime import triton_heuristics from torch._inductor.runtime.hints import DeviceProperties @triton_heuristics.pointwise( size_hints=[8589934592], filename=__file__, triton_meta={'signature': {0: 'bf16'}, 'device': DeviceProperties(type='hip', index=2, cc='gfx942', major=None, regs_per_multiprocessor=None, max_threads_per_multi_processor=None, multi_processor_count=None), 'constants': {}, 'configs': [AttrsDescriptor(divisible_by_16=(0, 1), equal_to_1=())]}, inductor_meta={'autotune_hints': set(), 'kernel_name': 'triton_poi_fused_nll_loss_backward_0', 'mutated_arg_names': [], 'no_x_dim': False, 'num_load': 0, 'num_reduction': 0, 'backend_hash': None, 'are_deterministic_algorithms_enabled': False, 'assert_indirect_indexing': True, 'autotune_local_cache': True, 'autotune_pointwise': True, 'autotune_remote_cache': False, 'force_disable_caches': False, 'dynamic_scale_rblock': True, 'max_autotune': False, 'max_autotune_pointwise': False, 'min_split_scan_rblock': 256, 'spill_threshold': 16, 'store_cubin': False, 'is_hip': True}, min_elem_per_thread=0 ) @triton.jit def triton_(out_ptr0, XBLOCK : tl.constexpr): xoffset = tl.program_id(0).to(tl.int64) XBLOCK xindex = xoffset + tl.arange(0, XBLOCK)[:].to(tl.int64) x0 = xindex tmp0 = 0.0 tl.store(out_ptr0 + (x0), tmp0, None) import triton import triton.language as tl from torch._inductor.runtime.triton_heuristics import grid from torch._C import _cuda_getCurrentRawStream as get_raw_stream if __name__ == "__main__": with torch.cuda._DeviceGuard(2): torch.cuda.set_device(2) buf0 = empty_strided_cuda((32752, 151936), (151936, 1), torch.bfloat16) stream2 = get_raw_stream(2) triton_.run(buf0, grid=grid(4976207872), stream=stream2) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/133300 Approved by: https://github.com/jansel	2024-08-19 19:41:55 +00:00
drisspg	fb26b84390	Update fused kernels and call _safe_softmax from SDPA (#133882 ) # UPDATE: This is take 3 of https://github.com/pytorch/pytorch/pull/131863 which was landed via co dev but not applying correclty # Summary Changes the stance of SDPA on what to do for fully masked out rows ## Current Behavior Several PyTorch users have expressed frustration over this issue: - https://github.com/pytorch/pytorch/issues/41508 - https://github.com/pytorch/pytorch/issues/103749 - https://github.com/pytorch/pytorch/issues/103963 These are significant issues with extensive discussion but no satisfactory resolution. The PyTorch team's consensus, as stated here: https://github.com/pytorch/pytorch/issues/24816#issuecomment-524415617 Can be paraphrased as follows: When passing in fully masked out rows, attention becomes ambiguous. We have two main options: 1. Uniformly attend to all values: ```python scores[masked_out_rows] = 1 / len(row) out[masked_out_rows] = 1 / len(row) * value ``` 2. Decide that attention between no queries (masked) and no keys (masked) is meaningless: ```python output[fully_masked_rows] = NaN ``` We went with option 2. Partially because it was easier to implement, but also people argued that users can slice the output to remove the NaNs: ``` Python >fill_value = -float("inf") >row0 = torch.randn(4) >row1 = torch.tensor([(fill_value for _ in range(4)]) >matrix = torch.stack([row0, row1]).requires_grad_(True) >out = torch.softmax(matrix, 1) >out = out[0] >print(out) tensor([0.5377, 0.2729, 0.0692, 0.1201]) ``` Cool, problem solved. But what happends when you call backwards.. ```Python >out.backward(torch.ones_like(out)) >print(matrix.grad) tensor([[3.0957e-08, 1.4157e-08, 7.7802e-10, 1.3713e-08], [ nan, nan, nan, nan]]) ``` Those pesky NaNs are back! ## Why do we see NaNs today? The core of the problem revolves around using softmax function in sdpa: ```python > row = torch.tensor([(-float("inf")) for _ in range(4)]) > torch.softmax(row, 0) tensor([nan, nan, nan, nan]) ``` ## Quick Aside: Masking in Attention Attention itself doesn't have a concept of masking. The `sdpa` function has an argument called `attn_mask`, which would be more accurately named `attn_bias`. This is because we don't actually "mask" entries when computing attention. Instead, due to implementation details([performance](https://github.com/pytorch/pytorch/issues/25110#issuecomment-524519087)), we add a value to the masked-out query/key pairs. We use a large negative number (typically -inf) to decrease the attention weight, as softmax assigns more weight to larger values. ## Alternative Approaches If we use a very large negative number instead of -inf: ```python > row = torch.tensor([(-1e6) for _ in range(4)]) > torch.softmax(row, 0) tensor([0.2500, 0.2500, 0.2500, 0.2500]) ``` However if users always remembered to "slice" out their outputs i.e.: ```Python >fill_value = -1e6 >... >out.backward(torch.ones_like(out)) >print(matrix.grad) tensor([[-0.0563, -0.0564, 0.1613, -0.0486], [ 0.0000, 0.0000, 0.0000, 0.0000]]) ``` This would bring us back into a better state. ## A Third Option We don't necessarily need to alter the behavior of softmax for -inf or very large negative numbers. The fundamental goal is to exclude certain query/key pairs from attention, regardless of the underlying implementation. This PR implements the new semantic for masking w/ attention in fully masked-out rows: ```python out[masked_out_rows] = 0 ``` Important Note: This idea isn't entirely new. The [MaskedTensor](https://pytorch.org/tutorials/prototype/maskedtensor_overview#safe-softmax) prototype, a tensor subclass, was designed to handle such cases. However, it remains a prototype feature and hasn't gained widespread adoption. ## Details This PR stack does 3 things: 1. Adds a PRIVATE _safe_softmax op 2. Updates semantic for flash_cpu fused kernel 3. Updates semantic for efficient_cuda fused kernel _safe_softmax is not supposed to be used generically and is only meant to be used within the context of SDPA. Due to this fact instead of decomposing softmax and checking for -inf rows we instead "cheat" and use nan_to_num. Why I think this is okay? (please find a counter point if avail) There are multiple ways NaNs can emerge. For the fully masked out rows case nan_to_num works. But what if there were other NaNs, wouldn't this silently remove them? The only case that this can happen is if the input itself had a NaN or an Inf For example: ```Python a = torch.ones([4], requires_grad=False, dtype=torch.float16) a[1] = torch.finfo(torch.float16).max print(a.softmax(-1)) ``` Will return `tensor([0., 1., 0., 0.], dtype=torch.float16)` Where ```Python a = torch.ones([4], requires_grad=False, dtype=torch.float16) a[1] = float("inf") a.softmax(-1) ``` returns: `tensor([nan, nan, nan, nan], dtype=torch.float16)` If we dont want to even allow for the possibility of "inf" or "NaN" attention scores to be converted to 0 then we can implemented it something like this ```Python max = torch.max(a, dim=-1, keepdim=True) exp = torch.exp(a - max.values) denom = torch.sum(exp, dim=-1, keepdim=True) softmax = exp / denom softmax = torch.where(max.values == float('-inf'), 0.0, softmax) ``` however we would be paying for this in math performance. ## Why Now I think one point that has substantially changed where PyTorch should lie on this argument is the fact that we have fused implementations for SDPA now. And these fused implementations allow us to easily and performantly support this new semantic. Differential Revision: [D61418679](https://our.internmc.facebook.com/intern/diff/D61418679) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133882 Approved by: https://github.com/soulitzer	2024-08-19 18:53:11 +00:00
Shangdi Yu	f1dc3b108a	Back out "[export] fix test for training ir migration" (#133697 ) Summary: Original commit changeset: 0a1cb57e0338 Original Phabricator Diff: D61223356 Test Plan: buck2 run 'fbcode//mode/dev-nosan' fbcode//bolt/nn/executorch/export:export_rle_model -- -r test_export_rle_model Reviewed By: tugsbayasgalan Differential Revision: D61395818 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133697 Approved by: https://github.com/tugsbayasgalan	2024-08-19 18:30:42 +00:00
Edward Z. Yang	a8619c9a1d	Add nitpicker, which allows adding comments to PRs when they match a file pattern (#133861 ) This message would have helped avoid https://www.internalfb.com/sevmanager/view/440895 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/133861 Approved by: https://github.com/albanD, https://github.com/izaitsevfb	2024-08-19 18:29:59 +00:00
Jack Zhang	64d9afd8a7	Register nll_loss2d decompositions for core aten (#133534 ) When exporting a training model for Executorch (which requires all ops to be core aten) with cross entropy loss (`torch.nn.CrossEntropyLoss`), we ran into the following error from the fx verifier in `to_edge`: ``` torch._export.verifier.SpecViolationError: Operator torch._ops.aten.nll_loss2d_forward.default is not Aten Canonical. ``` The aten [implementation](https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/LossNLL.cpp#L624) of `torch.nn.CrossEntropyLoss` uses `nll_loss2d_forward` for inference and `nll_loss2d_backward` for training, so we need to add the decompositions for both (which already exist) to the list of core aten decompositions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133534 Approved by: https://github.com/JacobSzwejbka	2024-08-19 18:26:48 +00:00
Bin Bao	ad7dda7b32	[CI] Bump up TIMM pin (#133528 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133528 Approved by: https://github.com/angelayi	2024-08-19 18:13:57 +00:00
Jack Zhang	773a782249	Decompose _unsafe_index_put into index_put (#133365 ) ## Description Create decomposition of _unsafe_index_put (non-core aten) that turns it into index_put (core aten) ## Testing Phi3 mini + LoRA model successfully passed `to_edge` after failing due to a non-core aten `unsafe_index_put` getting introduced in a decomposition during joint graph calculations. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133365 Approved by: https://github.com/pianpwk	2024-08-19 18:07:23 +00:00
Zhengxu Chen	517aee5369	[torchscript] Add a sampled logging integration point. (#133484 ) Test Plan: test script: ``` def test_zhxchen17(self): from libfb.py.pyinit import initFacebook initFacebook() class M(torch.nn.Module): def forward(self, x): return torch.add(x, x) def tmptmp(x, y): return torch.mul(x, y) m = M() n = torch.jit.script(m) print(n(torch.tensor(1))) print(torch.jit.script(tmptmp)(torch.tensor(1), torch.tensor(2))) ``` ``` I0802 12:01:23.932929 4079081 init.cc:407] Logging to scuba: run __torch__.caffe2.test.export.test_export.M.forward sample rate: 1000000 ``` Differential Revision: D60920867 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133484 Approved by: https://github.com/davidberard98	2024-08-19 18:04:45 +00:00
Xintong Hu	6564e746ed	[PT2] Port remove_noop to PT2 pre_grad passes (#132183 ) Summary: migrate to aten IR, `reshape` -> `view.default`, not covering `flatten` as there are already optimazation done in PT2, see the example here P1506057533 Differential Revision: D60476525 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132183 Approved by: https://github.com/frank-wei	2024-08-19 17:46:51 +00:00
Will Constable	da69a28c6f	[pipelining] Add schedule runtime for lowered schedule (#130488 ) Creates a new runtime that shifts complexity from runtime to ahead-of-time. The existing runtime (PipelineScheduleMulti) accepts a compute-only schedule (forward, backward, weight) actions only are specified, and it infers the communication operations at runtime. Compared to that runtime, PipelineScheduleRuntime has less logic that happens at runtime and relies on lowering passes to transform the compute-only schedule to add communications. Advantages include - easier to verify the correctness by dumping a compute+comm schedule - posible to manually edit the compute+comm schedule if the lowering heuristics are insufficient Functionality included inside the PipelineScheduleRuntime is limited to - accepting a compute-only schedule and lowering it to add comms - executing the compute or comm operations specified by the given schedule - handling work.wait() automatically by calling it just before the matching compute operation (for RECV ops) or at the end of step (for SEND ops) Follow ups for later PRs - Some refactoring should be done to replace PipelineScheduleMulti with this runtime - Optimizer execution is not considered (e.g. for zero-bubble cases) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130488 Approved by: https://github.com/H-Huang	2024-08-19 17:44:24 +00:00
PyTorch MergeBot	f31404ba6f	Revert "Update xpu CD used driver to rolling version (#133454 )" This reverts commit 32ed4a3beb746c94c702c80c79c812e45ab3b2f4. Reverted https://github.com/pytorch/pytorch/pull/133454 on behalf of https://github.com/ZainRizvi due to Sorry, there's [an outage](https://github.com/triton-lang/triton/issues/4527) that's preventing triton from being installed correctly, which has the side effect of breaking our docker builds. Reverting this PR since it requires a docker rebuild (which now fails) to give us more time to properly fix the docker builds. ([comment](https://github.com/pytorch/pytorch/pull/133454#issuecomment-2297073937))	2024-08-19 17:28:50 +00:00
Animesh Jain	6ca68357b3	[dynamo] Save class vt in UserDefinedObjectVariable (#133800 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133800 Approved by: https://github.com/jansel ghstack dependencies: #133745, #133747, #133746, #133799	2024-08-19 17:21:48 +00:00
Animesh Jain	08f14d5492	[refactor][dynamo][side-effects] Helper function for __new__ for user defined class (#133799 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133799 Approved by: https://github.com/jansel ghstack dependencies: #133745, #133747, #133746	2024-08-19 17:21:48 +00:00
drisspg	d6f30b91e5	Add a smaller default config option for decode (#133646 ) ## Before A100 \| Type \| Speedup \| score_mod \| mask_mod \| dtype \| shape(B,Hq,M,Hkv,N,D) \| \|---------\|-----------\|-------------\|------------\|----------------\|---------------------------\| \| Average \| 0.461 \| \| \| \| \| \| Max \| 0.996 \| None \| causal \| torch.bfloat16 \| (16, 16, 1, 16, 1024, 64) \| \| Min \| 0.188 \| None \| causal \| torch.bfloat16 \| (2, 16, 1, 16, 512, 128) \| H100 \| Type \| Speedup \| score_mod \| mask_mod \| dtype \| shape(B,Hq,M,Hkv,N,D) \| \|---------\|-----------\|-------------\|------------\|----------------\|---------------------------\| \| Average \| 4.528 \| \| \| \| \| \| Max \| 16.710 \| None \| offset \| torch.bfloat16 \| (2, 16, 1, 2, 4096, 64) \| \| Min \| 1.612 \| None \| offset \| torch.bfloat16 \| (16, 16, 1, 16, 512, 128) \| ## After A100: \| Type \| Speedup \| score_mod \| mask_mod \| dtype \| shape(B,Hq,M,Hkv,N,D) \| \|---------\|-----------\|-------------\|------------\|----------------\|---------------------------\| \| Average \| 0.472 \| \| \| \| \| \| Max \| 1.110 \| None \| causal \| torch.bfloat16 \| (16, 16, 1, 16, 1024, 64) \| \| Min \| 0.182 \| None \| causal \| torch.bfloat16 \| (2, 16, 1, 16, 4096, 128) \| H100: \| Type \| Speedup \| score_mod \| mask_mod \| dtype \| shape(B,Hq,M,Hkv,N,D) \| \|---------\|-----------\|-------------\|------------\|----------------\|---------------------------\| \| Average \| 4.535 \| \| \| \| \| \| Max \| 16.691 \| None \| offset \| torch.bfloat16 \| (2, 16, 1, 2, 4096, 64) \| \| Min \| 1.607 \| None \| offset \| torch.bfloat16 \| (16, 16, 1, 16, 512, 128) \| ### Failing example code ``` Python import torch import torch.nn as nn import functools from torch.nn.attention.flex_attention import flex_attention, create_block_mask class AttentionModel(nn.Module): def __init__(self, initial_kv_len): super().__init__() self.kv_len = initial_kv_len self.q_len = 1 def causal_mask_decode(self, b, h, q_idx, kv_idx): offset = self.kv_len - self.q_len return offset + q_idx >= kv_idx def forward(self, queries, keys, values, mask): self.kv_len = keys.shape[-2] bs, nh, seq_len, _ = queries.shape attention = functools.partial(flex_attention, block_mask=mask, enable_gqa=True) attention = torch.compile(attention) attn_output = attention(queries, keys, values) return attn_output # Driver code def main(): # Set up parameters d_model = 256 q_heads = 32 kv_heads = 8 kv_len = 128 q_len = 1 batch_size = 1 # Initialize the model model = AttentionModel(kv_len) mask = create_block_mask( lambda a, b, c, d: model.causal_mask_decode(a, b, c, d), 1, 1, q_len, kv_len ) # Create sample input tensors queries = torch.randn(batch_size, q_heads, q_len, d_model, device="cuda") keys = torch.randn(batch_size, kv_heads, kv_len, d_model, device="cuda") values = torch.randn(batch_size, kv_heads, kv_len, d_model, device="cuda") # Forward pass output = model(queries, keys, values, mask) print(f"Input shapes:") print(f" Queries: {queries.shape}") print(f" Keys: {keys.shape}") print(f" Values: {values.shape}") print(f"Output shape: {output.shape}") if __name__ == "__main__": main() ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/133646 Approved by: https://github.com/Chillee, https://github.com/joydddd	2024-08-19 17:13:26 +00:00
Mayank Mishra	e37eef8a7b	return state dict without optimized module (#132626 ) Fixes #123625 We should consider changing the current behaviour and make it similar to `1fb498d6e3/torch/distributed/algorithms/_checkpoint/checkpoint_wrapper.py (L69-L101)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/132626 Approved by: https://github.com/williamwen42	2024-08-19 16:58:41 +00:00
PyTorch MergeBot	8d404581fc	Revert "[ONNX] New export logic leveraging ExportedProgram and ONNX IR (#132530 )" This reverts commit 5fab35d77c7d1db7dbb9d5c516254a510b4f4f64. Reverted https://github.com/pytorch/pytorch/pull/132530 on behalf of https://github.com/ZainRizvi due to Sorry but it seems like Dr. CI incorrectly flagged the [pull / linux-docs / build-docs-python-false](https://hud.pytorch.org/pr/pytorch/pytorch/132530#28918577682) failure as being flaky. The job started failing consistently on CI once your PR was merged. [GH job link](https://github.com/pytorch/pytorch/actions/runs/10454830880/job/28949386844) [HUD commit link](`5fab35d77c`) ([comment](https://github.com/pytorch/pytorch/pull/132530#issuecomment-2297001423))	2024-08-19 16:47:15 +00:00
Aaron Orenstein	68fcd54226	Lower cache mocking to test more pytorch code (#133579 ) Summary: Previously we were mocking out FbRemoteFxGraphCacheBackend which meant that we were missing testing a whole bunch of the cache code. Cache at a lower level (CacheClient, LocalAutotuneCacheBackend, ManifoldClient, Redis) so we cover a larger amount of the caching code. Test Plan: unit tests Reviewed By: oulgen Differential Revision: D60937966 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133579 Approved by: https://github.com/oulgen	2024-08-19 16:32:36 +00:00
chuanqiw	32ed4a3beb	Update xpu CD used driver to rolling version (#133454 ) The main purpose of this PR is change the XPU CD use rolling driver to support more clients GPU AOT build and enable Kineto. And also plan to enable python 3.13 for xpu CD. Works for https://github.com/pytorch/pytorch/issues/114850 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133454 Approved by: https://github.com/atalman	2024-08-19 16:01:47 +00:00
fduwjj	df6831562c	[Flight Recorder] Add more basic analysis to the script (#133412 ) This is the first step to make sure we have a basic function of analyzer for FR in production. - We want to use this script to find out abnormalities in collectives and report it to users. - We also fixed some type errors. - [Ongoing] Also we will add more unit tests to this script and make it modularized so that we can better maintain it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133412 Approved by: https://github.com/c-p-i-o, https://github.com/atalman	2024-08-19 15:55:00 +00:00
PyTorch MergeBot	76b0284744	Revert "[inductor][cpp] complete vectorization for int32/int64 (#122961 )" This reverts commit 99b3b58f39507bb8ad5b4bb1b9bedf7f47b64fa3. Reverted https://github.com/pytorch/pytorch/pull/122961 on behalf of https://github.com/atalman due to Breaks slow jobs: inductor/test_cpu_repro.py::CPUReproTests::test__adaptive_avg_pool2d [GH job link](https://github.com/pytorch/pytorch/actions/runs/10432403692/job/28893704833) [HUD commit link](`a0ef8888e6`) ([comment](https://github.com/pytorch/pytorch/pull/122961#issuecomment-2296852418))	2024-08-19 15:29:15 +00:00
PyTorch MergeBot	318d3b39c4	Revert "[Inductor][CPP] Support vectorization of load_seed and randn (#130317 )" This reverts commit a0ef8888e60d934ae7e4ddaec1c1274b12d0d39d. Reverted https://github.com/pytorch/pytorch/pull/130317 on behalf of https://github.com/atalman due to Breaks slow jobs: inductor/test_cpu_repro.py::CPUReproTests::test__adaptive_avg_pool2d [GH job link](https://github.com/pytorch/pytorch/actions/runs/10432403692/job/28893704833) [HUD commit link](`a0ef8888e6`) ([comment](https://github.com/pytorch/pytorch/pull/130317#issuecomment-2296819045))	2024-08-19 15:13:39 +00:00
Weizhuo Zhang	5153550e4b	[CI] Add FP32 dynamic, AMP static, AMP dynamic for AOT inductor accuracy CPU CI test (#132836 ) This PR added 3 more accuracy test for AOT inductor CPU side. 1. FP32 dynamic shape accuracy test, torchbench suite 2. AMP static shape accuracy test, torchbench suite 3. AMP dynamic shape accuracy test, torchbench suite Test Time cost: \| Precision \| Shape Type \| Suite \| Time cost \| \|----------- \|------------ \|------------ \|----------- \| \| FP32 \| dynamic \| Torchbench \| 1h40m \| \| AMP \| Static \| Torchbench \| 1h38m \| \| AMP \| dynamic \| Torchbench \| 1h48m \| Pull Request resolved: https://github.com/pytorch/pytorch/pull/132836 Approved by: https://github.com/desertfire	2024-08-19 14:26:48 +00:00
Justin Chu	5fab35d77c	[ONNX] New export logic leveraging ExportedProgram and ONNX IR (#132530 ) 1/n PR to - Move code from torch-onnx from commit `395495e566` into torch.onnx and fixes imports. - Integrate the new export logic with the torch.onnx.export API and include basic set of tests. - Refactor the API for the change. - Improve documentation. Next PRs will be more tests and docs. Fix https://github.com/pytorch/pytorch/issues/129277 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132530 Approved by: https://github.com/titaiwangms, https://github.com/malfet	2024-08-19 14:01:07 +00:00
Jack Taylor	92151c814b	[ROCm] Set _HAS_PYNVML to false if amdsmi not installed (#132990 ) This is a bugfix that was recently encountered in ROCm/Deepspeed. Currently if a library installs pynvml and runs on ROCm pytorch will break as _HAS_PYNVML is set to true and it will attempt to use amdsmi library for the device_count call which will not be installed. This fix will set _HAS_PYNVML to false on ROCm if amdsmi is not installed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132990 Approved by: https://github.com/pruthvistony, https://github.com/eqy, https://github.com/malfet	2024-08-19 09:45:58 +00:00
Robert Hardwick	0a976b8899	Enable bf16 float32 mkldnn matmul when float32 precision is 'medium' (#130919 ) This fixes an issue on AArch64 cpus supporting BF16, caused when torch.set_float32_matmul_precision("highest") does not disable the bf16 downconversion in mkldnn_matmul. This was discovered from a unit test failure where the decorator `torch.testing._internal.common_mkldnn.bf32_on_and_off`, which internally switches the float32_matmul_precision between "medium" and "highest" was not having the desired effect. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130919 Approved by: https://github.com/jgong5	2024-08-19 09:18:12 +00:00
Laith Sakka	8b6b1721c8	remove StrobelightCompileTimeProfiler.profile_compile_time from stacktrace when strobelight profiling not enabled (#133831 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133831 Approved by: https://github.com/oulgen	2024-08-19 09:14:52 +00:00
wz337	4bae7ae3d9	[DeviceMesh][Easy] Fix typo (#133790 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/133790 Approved by: https://github.com/Skylion007	2024-08-19 05:20:22 +00:00
PyTorch MergeBot	35f36363ec	Revert "[dtensor] move DTensor to public namespace (#133113 )" This reverts commit 2ee6b97464d17fcf4c1fc67c29868fa30d0c16e1. Reverted https://github.com/pytorch/pytorch/pull/133113 on behalf of https://github.com/wanchaol due to looks like it break some internal type imports ([comment](https://github.com/pytorch/pytorch/pull/133113#issuecomment-2295670911))	2024-08-19 05:00:19 +00:00
CaoE	42e61c783c	[Inductor][CPP] Align Half load with BFloat16 load (#132011 ) Remove `static_cast<float>` for Half load to align with BFloat16. Before: ``` extern "C" void kernel(const half* in_ptr0, half* out_ptr0) { { #pragma GCC ivdep for(long x0=static_cast<long>(0L); x0<static_cast<long>(20L); x0+=static_cast<long>(1L)) { auto tmp0 = static_cast<float>(in_ptr0[static_cast<long>(x0)]); out_ptr0[static_cast<long>(x0)] = tmp0; } } } ``` After: ``` extern "C" void kernel(const half* in_ptr0, half* out_ptr0) { { #pragma GCC ivdep for(long x0=static_cast<long>(0L); x0<static_cast<long>(20L); x0+=static_cast<long>(1L)) { auto tmp0 = in_ptr0[static_cast<long>(x0)]; out_ptr0[static_cast<long>(x0)] = tmp0; } } } ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/132011 Approved by: https://github.com/jgong5, https://github.com/peterbell10	2024-08-19 04:52:39 +00:00
Zain Rizvi	ae00063570	Change default runner's AMI to Amazon 2023 AMI - Part 1 (#133641 ) Upgrades the LF scale configs to change the default AMI in accordance with the Amazon 2023 rollout plan. This PR will be merged on Monday Aug 19 in the morning, and over the next 2-3 days as new linux runners are spun up (and old ones spun down) they'll start using this new AMI This PR will be paired with https://github.com/pytorch/test-infra/pull/5558, which will be merged after this one Pull Request resolved: https://github.com/pytorch/pytorch/pull/133641 Approved by: https://github.com/jeanschmidt	2024-08-19 01:32:25 +00:00
Christopher Yeh	e72e924eb5	Add correct typing annotations to rsample() for all distributions (#133516 ) Fixes #133514 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133516 Approved by: https://github.com/Skylion007	2024-08-18 20:31:54 +00:00
eqy	c0c82a5f6a	[CUDA][SDPA] Bump tolerances for `test_mem_efficient_attention_attn_mask_vs` (#133738 ) Same thing as #133051 but for efficient attention CC @drisspg @nWEIdia Pull Request resolved: https://github.com/pytorch/pytorch/pull/133738 Approved by: https://github.com/drisspg, https://github.com/nWEIdia, https://github.com/Skylion007	2024-08-18 19:14:29 +00:00
Aaron Gokaslan	cf60fe53a8	[BE]: Update Typeguard to TypeIs for better type inference (#133814 ) Uses TypeIs instead of TypeGuard for better inference. See https://peps.python.org/pep-0742/ Pull Request resolved: https://github.com/pytorch/pytorch/pull/133814 Approved by: https://github.com/ezyang	2024-08-18 19:10:16 +00:00
cyy	0d4cedaa47	[13/N] Fix clang-tidy warnings in aten/src/ATen (#133807 ) Follows #133425 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133807 Approved by: https://github.com/ezyang, https://github.com/Skylion007	2024-08-18 17:54:12 +00:00
cyy	47ed5f57b0	[12/N] Fix clang-tidy warnings in aten/src/ATen (#133425 ) Follows #133758 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133425 Approved by: https://github.com/ezyang	2024-08-18 11:03:55 +00:00
Yu, Guangye	fbd020fce6	Add new prop to _XpuDevicePropertie for triton gemm optimization (#131738 ) # Motivation This PR aims to add new properties to `_XpuDevicePropertie` for triton gemm optimization. # Additional Context `ext_oneapi_supports_cl_extension` is not a ABI-neutral API. It depends on compiler 2025.0. For more details, see https://github.com/intel/llvm/pull/13212 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131738 Approved by: https://github.com/gujinghui	2024-08-18 08:32:30 +00:00
Animesh Jain	fed6096e73	[dynamo] Support object.__new__ call (#133746 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133746 Approved by: https://github.com/Skylion007, https://github.com/jansel ghstack dependencies: #133745, #133747	2024-08-18 07:18:52 +00:00
Animesh Jain	d56a395971	[dynamo] Support os.fspath (#133747 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133747 Approved by: https://github.com/yanboliang, https://github.com/Skylion007, https://github.com/jansel ghstack dependencies: #133745	2024-08-18 07:18:52 +00:00
JackCaoG	27dfd63ee8	remove unnecessary slicing in EffectTokensWrapper (#133737 ) In the cases that `outs ` is a tensor, `[0:]` will cause a nadditional slicing ops that's unnecessary and failed some of XLA's unit test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133737 Approved by: https://github.com/IvanKobzarev	2024-08-18 05:52:48 +00:00
Simon Fan	d717df2071	[compiled autograd] fix flaky tests due to torch.cuda.memory_allocated() != 0 (#133733 ) FIXES https://github.com/pytorch/pytorch/issues/123949 https://github.com/pytorch/pytorch/issues/124376 torch.cuda.memory_allocated returns the amount of memory allocated in the current process, so if it isn't 0 it means another test didn't properly clean up after itself. I'm keeping the memory check and isolating these tests in subprocess as we don't have a good way to test for activation refcount e.g. https://github.com/pytorch/pytorch/runs/28838386083 ``` _______________ TestCompiledAutograd.test_free_activation_memory _______________ Traceback (most recent call last): File "/var/lib/jenkins/workspace/test/inductor/test_compiled_autograd.py", line 1892, in test_free_activation_memory self.assertTrue(torch.cuda.memory_allocated() == 0) File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 687, in assertTrue raise self.failureException(msg) AssertionError: False is not true ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/133733 Approved by: https://github.com/jansel	2024-08-18 05:43:35 +00:00
cyy	fb9d2dc641	Remove Wno-invalid-partial-specialization from CMake (#133398 ) The code base is clean enough that Winvalid-partial-specialization can be enabled. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133398 Approved by: https://github.com/ezyang	2024-08-18 04:06:21 +00:00
cyy	f8cf1829b5	[Reland] [11/N] Fix clang-tidy warnings in aten/src/ATen (#133758 ) Reland of #133298. Remove possible changes that may increase the build time. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133758 Approved by: https://github.com/Skylion007, https://github.com/ezyang	2024-08-17 23:09:44 +00:00
James Wu	0bde3c4f2f	Run cudagraphs on AOTAutograd cache hit (#132294 ) This threads through all of the necessary parts into aot autograd from the FXGraphCache changes so that we can run cudagraphs properly on a AOTAutograd cache hit. Specifics: - AOTAutograd needs access to the `cudagraphs` boxedbool in order to properly set the backward to not use cudagraphs on a cache hit from the forward. - We have lots of tests that test this already from the previous PR, so I just added an extra test and made the previous test work with both AOTAutogradCache and FXGraphCache at the same time. ``` TORCH_LOGS=torch._functorch._aot_autograd.autograd_cache,cudagraphs ENABLE_AOT_AUTOGRAD_CACHE=1 TORCHINDUCTOR_FX_GRAPH_CACHE=1 tlp python benchmarks/gpt_fast/benchmark.py --output ~/gpt_fast_benchmark.csv ``` Twice, once on cache miss and once and cache hit. Here is the perfetto trace for each(FB only link): Cache Miss: Logs: ``` Loading model Llama-2-7b-chat-hf Time to load model: 0.66 seconds I0813 10:53:34.416000 911030 torch/_functorch/_aot_autograd/autograd_cache.py:479] [0/0] AOTAutograd cache miss for key alqchc7zw6ynsxj2bzktcsngu4cajwcb3tmhvwlyqkuinx3zhmey I0813 10:53:51.395000 911030 torch/_functorch/_aot_autograd/autograd_cache.py:558] [0/0] Writing AOTAutograd cache entry to /tmp/torchinductor_jjwu/aotautograd/alqchc7zw6ynsxj2bzktcsngu4cajwcb3tmhvwlyqkuinx3zhmey/entry I0813 10:54:17.579000 911030 torch/_functorch/_aot_autograd/autograd_cache.py:479] [1/0] AOTAutograd cache miss for key a3nq2ywjxku342c6ag7rsqkalnxfshlcgve3tb2bigg7a45uz6pt I0813 10:54:38.636000 911030 torch/_functorch/_aot_autograd/autograd_cache.py:558] [1/0] Writing AOTAutograd cache entry to /tmp/torchinductor_jjwu/aotautograd/a3nq2ywjxku342c6ag7rsqkalnxfshlcgve3tb2bigg7a45uz6pt/entry I0813 10:54:39.228000 911030 torch/_inductor/cudagraph_trees.py:385] [__cudagraphs] recording cudagraph tree for graph without symints V0813 10:54:39.939000 911030 torch/_inductor/cudagraph_trees.py:2160] [__cudagraphs] Running warmup of function 0 V0813 10:55:10.615000 911030 torch/_inductor/cudagraph_trees.py:2119] [__cudagraphs] Recording function 0 of graph recording id 0 Compilation time: 101.24 seconds Average tokens/sec: 147.96 tokens/sec Average bandwidth achieved: 1955.22 GB/s Memory used: 14.51 GB ``` Chromium Event(fb only): https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html?url=https%3A%2F%2Finterncache-all.fbcdn.net%2Fmanifold%2Ftlparse_reports%2Ftree%2Flogs%2Fjjwu%2Fcustom%2Fchromium_events.json#!/viewer?url=https%3A%2F%2Finterncache-all.fbcdn.net%2Fmanifold%2Ftlparse_reports%2Ftree%2Flogs%2Fjjwu%2Fcustom%2Fchromium_events.json&local_cache_key ![image](https://github.com/user-attachments/assets/47fdd77e-3cc1-437e-8e68-7901646269bb) Cache Hit: Logs: ``` Loading model Llama-2-7b-chat-hf Time to load model: 0.67 seconds I0813 10:55:51.821000 944420 torch/_functorch/_aot_autograd/autograd_cache.py:474] [0/0] AOTAutograd cache hit for key alqchc7zw6ynsxj2bzktcsngu4cajwcb3tmhvwlyqkuinx3zhmey I0813 10:55:55.465000 944420 torch/_functorch/_aot_autograd/autograd_cache.py:474] [1/0] AOTAutograd cache hit for key a3nq2ywjxku342c6ag7rsqkalnxfshlcgve3tb2bigg7a45uz6pt I0813 10:55:56.030000 944420 torch/_inductor/cudagraph_trees.py:385] [__cudagraphs] recording cudagraph tree for graph without symints V0813 10:55:56.192000 944420 torch/_inductor/cudagraph_trees.py:2160] [__cudagraphs] Running warmup of function 0 V0813 10:55:56.426000 944420 torch/_inductor/cudagraph_trees.py:2119] [__cudagraphs] Recording function 0 of graph recording id 0 Compilation time: 9.40 seconds Average tokens/sec: 147.94 tokens/sec Average bandwidth achieved: 1954.98 GB/s Memory used: 14.51 GB ``` Chromium Event(fb only): https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html?url=https%3A%2F%2Finterncache-all.fbcdn.net%2Fmanifold%2Ftlparse_reports%2Ftree%2Flogs%2Fjjwu%2Fcustom2%2Fchromium_events.json#!/viewer?url=https%3A%2F%2Finterncache-all.fbcdn.net%2Fmanifold%2Ftlparse_reports%2Ftree%2Flogs%2Fjjwu%2Fcustom2%2Fchromium_events.json&local_cache_key ![image](https://github.com/user-attachments/assets/9bdd14ec-d12a-4c89-8705-135c999ac746) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132294 Approved by: https://github.com/eellison	2024-08-17 21:24:54 +00:00
Christophe Bornet	d6368985af	[BE]: Fix setuptools not installed with Python 3.12 (#133561 ) setuptools is not installed correctly for Python 3.12. See https://github.com/python-poetry/poetry/issues/9630#issuecomment-2291114885 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133561 Approved by: https://github.com/Skylion007 Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>	2024-08-17 17:42:04 +00:00
Felix Janda	b4a1673a67	profiler/unwind: include <dlfcn.h> for dladdr (#133582 ) This fixes a compilation error on linux systems using the musl c library. Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/133582 Approved by: https://github.com/Skylion007, https://github.com/aaronenyeshi	2024-08-17 16:15:18 +00:00
Jiang, Yanbing	215b14530a	Add Half for sparse.mm reduce (#133672 ) This PR is to add Half support for sparse.mm reduce in CPU backend. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133672 Approved by: https://github.com/Skylion007	2024-08-17 15:20:39 +00:00
Xuehai Pan	1c6fbae579	[Easy][dynamo] fix builtin function names for `itertools` (#133711 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133711 Approved by: https://github.com/Skylion007	2024-08-17 15:12:01 +00:00
leslie-fang-intel	a0ef8888e6	[Inductor][CPP] Support vectorization of load_seed and randn (#130317 ) Summary Enable the vectorization of `load_seed` and `randn`. For now, `randn` is using the reference implementation. Test Plan ``` python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_vec_randn ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/130317 Approved by: https://github.com/jgong5 ghstack dependencies: #122961	2024-08-17 07:15:57 +00:00
leslie-fang-intel	99b3b58f39	[inductor][cpp] complete vectorization for int32/int64 (#122961 ) Summary Implement the complete vectorization of `index_expr` functionally. We also add heuristic from performance perspective to resolve the regressions posted below: https://github.com/pytorch/pytorch/pull/122961#issuecomment-2041336265 by disabling vectorization of specific (Fused) scheduler Node: - Heuristic 1: when the num of non-contiguous `index_expr/load/store` exceeds the threshold, we disable the vectorization. - Heuristic 2: when the total number of elements along the vec dim is less than `tiling_factor/2`, we disable the vectorization. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122961 Approved by: https://github.com/jansel Co-authored-by: leslie-fang-intel <leslie.fang@intel.com>	2024-08-17 07:07:49 +00:00
Huanyu He	d5f6d68d68	[PT2] Resolve PT2 compatility issue in slice and diff (#133740 ) Summary: # context * when running an IG FM training with PT2 we found there are a few graph break due to torch.diff call in [jagged_tensor.py](https://fburl.com/code/cwssxabc) ``` _length: List[int] = ( _length_per_key_from_stride_per_key(torch.diff(offsets), stride_per_key) if variable_stride_per_key else torch.sum(torch.diff(offsets).view(-1, stride), dim=1).tolist() ) ``` * look into the failure, we found the TORCH_CHECK in diff should be TORCH_SYM_CHECK * slice_forward error: df3d7729e, [tlparse](https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/.tmpxXZ2em/index.html) ``` RestartAnalysis Tried to use data-dependent value in the subsequent computation. This can happen when we encounter unbounded dynamic value that is unknown during tracing time. You will need to explicitly give hint to the compiler. Please take a look at torch._check OR torch._check_is_size APIs. Could not guard on data-dependent expression ((5u37 + u38)//(u37 + u38)) < 0 (unhinted: ((5u37 + u38)//(u37 + u38)) < 0). (Size-like symbols: u38, u37) ATTENTION: guard_size_oblivious would fix the error, evaluating expression to False. Maybe you need to add guard_size_oblivious to framework code, see doc below for more guidance. Potential framework code culprit (scroll up for full backtrace): File "/data/users/hhy/fbsource/buck-out/v2/gen/fbcode/e99934938a0abe90/aps_models/ads/icvr/__icvr_launcher_live__/icvr_launcher_live#link-tree/torch/_decomp/decompositions.py", line 771, in slice_forward if end_val < 0: ``` * after this diff: [tlparse](https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/.tmpAhv2Sh/failures_and_restarts.html) Test Plan: # command * run model ``` TORCH_SHOW_CPP_STACKTRACES=1 TORCHDYNAMO_EXTENDED_DEBUG_CPP=1 TORCH_LOGS="+graph_code,output_code,dynamic,aot,guards,verbose_guards,recompiles,graph_breaks" TORCH_TRACE=/var/tmp/tt buck2 run fbcode//mode/opt fbcode//aps_models/ads/icvr:icvr_launcher_live -- mode=fmc/local_ig_fm_v4_mini training.pipeline_type=pt2 ``` * generate tlparse ``` tlparse `ls -t /var/tmp/tt/* \| head -1` ``` Reviewed By: ezyang Differential Revision: D56339251 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133740 Approved by: https://github.com/ezyang	2024-08-17 06:07:21 +00:00
Jiong Gong	cd89bf77c8	[inductor][cpp][gemm] easy: adjust indentation of template, var renaming etc. (#133312 ) Indent the template instructions separately from the generated code, for readability. Also, renaming M0,N0,K0 to Mr,Nr,Kr ("r" meaning "register") to consistent naming. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133312 Approved by: https://github.com/Skylion007, https://github.com/leslie-fang-intel ghstack dependencies: #132729, #132730	2024-08-17 05:49:14 +00:00
Animesh Jain	4dc9795ebf	[refactor][easy] Directly call var_getattr method for PythonModuleVariable (#133745 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133745 Approved by: https://github.com/yanboliang	2024-08-17 05:30:01 +00:00
Wanchao Liang	2ee6b97464	[dtensor] move DTensor to public namespace (#133113 ) Moving DTensor to be in the public namespace, to formally add the documentation page that includes all the public APIs. This includes: * many path renames and path import fixes * a dedicated doc page without too much content yet (adding in the next PRs) * To preserve the BC for users still using the `torch.distributed._tensor`, I added a shim script to redirect old path calls to the new module The BC preserving is evidented by the fact that all DTensor tests are still working without changing the public imports. So it's safe to land the changes Pull Request resolved: https://github.com/pytorch/pytorch/pull/133113 Approved by: https://github.com/XilunWu ghstack dependencies: #133305, #133306	2024-08-17 05:09:52 +00:00
Wanchao Liang	1a4709cef5	[dtensor] add more documentations (#133306 ) This PR adds more documentations to the DTensor APIs, to prepare for the module be public Pull Request resolved: https://github.com/pytorch/pytorch/pull/133306 Approved by: https://github.com/XilunWu, https://github.com/tianyu-l, https://github.com/wz337 ghstack dependencies: #133305	2024-08-17 05:09:52 +00:00
Wanchao Liang	addee9f4d1	[dtensor] add missing __all__ to public modules (#133305 ) as titled, some submodules are missing __all__ for API exposures, this PR adds necessary __all__ to those modules, and private some non public APIs explicitly together in this PR Pull Request resolved: https://github.com/pytorch/pytorch/pull/133305 Approved by: https://github.com/XilunWu, https://github.com/tianyu-l, https://github.com/wz337	2024-08-17 05:09:48 +00:00
Masaki Kozuki	702c810780	move param's device check to `_init_group` for fused (#131153 ) There could be some cases where the params have the meta device when calling optimizer's dunder init and those params are materialized in the first computation. This change would allow such situation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131153 Approved by: https://github.com/mlazos, https://github.com/janeyx99 Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>	2024-08-17 04:49:47 +00:00
Oguz Ulgen	12b8e29203	Add a fudge factor to ephemeral NCCL timeout increase (#133722 ) Differential Revision: [D61422431](https://our.internmc.facebook.com/intern/diff/D61422431) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133722 Approved by: https://github.com/c00w, https://github.com/aorenste ghstack dependencies: #133504	2024-08-17 03:08:40 +00:00
Avik Chaudhuri	695d7db2d6	remove dead code for suggesting legacy dynamic shapes fixes (#133700 ) Summary: `dynamic_dim` based dynamic shapes are long gone, so pretty-printing suggested fixes for them is dead code. Test Plan: existing tests Differential Revision: D61398303 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133700 Approved by: https://github.com/zhxchen17	2024-08-17 01:59:34 +00:00
Oguz Ulgen	455f6bda56	Add cache timings info to tlparse (#133504 ) https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/.tmpLR1T85/rank_1/0_0_0/fx_graph_cache_hash_11.json Differential Revision: [D61422432](https://our.internmc.facebook.com/intern/diff/D61422432) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133504 Approved by: https://github.com/jamesjwu	2024-08-17 01:37:53 +00:00
Li, Xingyuan	dcfa415e6e	[Inductor UT] Reuse inductor UT for intel GPU `test/inductor/test_compiled_optimizers.py` (#133083 ) [Inductor UT] Reuse Inductor test case for Intel GPU. Reuse `test/inductor/test_compiled_optimizers.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/133083 Approved by: https://github.com/etaf, https://github.com/jansel, https://github.com/mlazos	2024-08-17 01:15:26 +00:00
Simon Fan	983bea399d	[compiled autograd] move non-hot path logs into default logger (#133541 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133541 Approved by: https://github.com/yf225, https://github.com/bdhirsh ghstack dependencies: #133115, #133148	2024-08-17 00:46:52 +00:00
Simon Fan	0a6cc15079	[compiled autograd] use same graph node names as AOTDispatcher (#133148 ) FIXES https://github.com/pytorch/pytorch/issues/132939 Compiled autograd's trace of the AOT backward may result in some additional ops e.g. clone to make contiguous, trace_wrapped HOPs, so the graphs may be slightly offset from each other hf_Whisper example: https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/.tmpNv89Pu/index.html fsdp2 example: https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/.tmpPdKssS/rank_0/index.html Unit test example: https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/.tmpvoQsnl/index.html ```python ===== Compiled autograd graph ===== <eval_with_key>.14 class CompiledAutograd(torch.nn.Module): def forward(self, inputs, sizes, scalars, hooks): # No stacktrace found for following nodes getitem: "f32[]cpu" = inputs[0] aot1_primals_1: "f32[4]cpu" = inputs[1] aot1_primals_2: "f32[4]cpu" = inputs[2] aot0_sin: "f32[4]cpu" = inputs[3] aot0_cos: "f32[4]cpu" = inputs[4] getitem_5: "f32[4]cpu" = inputs[5]; inputs = None # File: /data/users/xmfan/a/pytorch/torch/_dynamo/compiled_autograd.py:444 in set_node_origin, code: SumBackward0 (NodeCall 1) expand: "f32[4]cpu" = torch.ops.aten.expand.default(getitem, [4]); getitem = None # File: /data/users/xmfan/a/pytorch/torch/_dynamo/compiled_autograd.py:444 in set_node_origin, code: CompiledFunctionBackward1 (NodeCall 2) aot1_tangents_1: "f32[4]cpu" = torch.ops.aten.clone.default(expand, memory_format = torch.contiguous_format); expand = None aot1_sin_1: "f32[4]cpu" = torch.ops.aten.sin.default(aot1_primals_2); aot1_primals_2 = None aot1_neg: "f32[4]cpu" = torch.ops.aten.neg.default(aot1_sin_1); aot1_sin_1 = None aot0_tangents_2: "f32[4]cpu" = torch.ops.aten.mul.Tensor(aot1_tangents_1, aot1_neg); aot1_neg = None aot1_cos_1: "f32[4]cpu" = torch.ops.aten.cos.default(aot1_primals_1); aot1_primals_1 = None aot0_tangents_1: "f32[4]cpu" = torch.ops.aten.mul.Tensor(aot1_tangents_1, aot1_cos_1); aot1_tangents_1 = aot1_cos_1 = None # File: /data/users/xmfan/a/pytorch/torch/_dynamo/compiled_autograd.py:444 in set_node_origin, code: CompiledFunctionBackward0 (NodeCall 3) aot0_neg: "f32[4]cpu" = torch.ops.aten.neg.default(aot0_sin); aot0_sin = None aot0_mul: "f32[4]cpu" = torch.ops.aten.mul.Tensor(aot0_tangents_2, aot0_neg); aot0_tangents_2 = aot0_neg = None aot0_mul_1: "f32[4]cpu" = torch.ops.aten.mul.Tensor(aot0_tangents_1, aot0_cos); aot0_tangents_1 = aot0_cos = None aot0_add: "f32[4]cpu" = torch.ops.aten.add.Tensor(aot0_mul, aot0_mul_1); aot0_mul = aot0_mul_1 = None # File: /data/users/xmfan/a/pytorch/torch/_dynamo/compiled_autograd.py:444 in set_node_origin, code: torch::autograd::AccumulateGrad (NodeCall 4) accumulate_grad_ = torch.ops.inductor.accumulate_grad_.default(getitem_5, aot0_add); getitem_5 = aot0_add = accumulate_grad_ = None _exec_final_callbacks_stub = torch__dynamo_external_utils__exec_final_callbacks_stub(); _exec_final_callbacks_stub = None return [] ``` where aot1 is ```python class GraphModule(torch.nn.Module): def forward(self, primals_1: "f32[4][1]cpu", primals_2: "f32[4][1]cpu", tangents_1: "f32[4][1]cpu"): # File: /data/users/xmfan/a/pytorch/test/inductor/test_compiled_autograd.py:2233 in torch_dynamo_resume_in_f_at_2232, code: return tmp1.sin() + tmp2.cos() sin_1: "f32[4][1]cpu" = torch.ops.aten.sin.default(primals_2); primals_2 = None neg: "f32[4][1]cpu" = torch.ops.aten.neg.default(sin_1); sin_1 = None mul: "f32[4][1]cpu" = torch.ops.aten.mul.Tensor(tangents_1, neg); neg = None cos_1: "f32[4][1]cpu" = torch.ops.aten.cos.default(primals_1); primals_1 = None mul_1: "f32[4][1]cpu" = torch.ops.aten.mul.Tensor(tangents_1, cos_1); tangents_1 = cos_1 = None return (mul_1, mul) ``` and aot0 is ```python class GraphModule(torch.nn.Module): def forward(self, sin: "f32[4][1]cpu", cos: "f32[4][1]cpu", tangents_1: "f32[4][1]cpu", tangents_2: "f32[4][1]cpu"): # File: /data/users/xmfan/a/pytorch/test/inductor/test_compiled_autograd.py:2231 in f, code: tmp2 = x.cos() neg: "f32[4][1]cpu" = torch.ops.aten.neg.default(sin); sin = None mul: "f32[4][1]cpu" = torch.ops.aten.mul.Tensor(tangents_2, neg); tangents_2 = neg = None # File: /data/users/xmfan/a/pytorch/test/inductor/test_compiled_autograd.py:2230 in f, code: tmp1 = x.sin() mul_1: "f32[4][1]cpu" = torch.ops.aten.mul.Tensor(tangents_1, cos); tangents_1 = cos = None # File: /data/users/xmfan/a/pytorch/test/inductor/test_compiled_autograd.py:2230 in f, code: tmp1 = x.sin() add: "f32[4][1]cpu" = torch.ops.aten.add.Tensor(mul, mul_1); mul = mul_1 = None return (add,) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/133148 Approved by: https://github.com/jansel ghstack dependencies: #133115	2024-08-17 00:46:52 +00:00
Simon Fan	4b3ed8bc52	[compiled autograd] log aot id for CompiledFunctionBackward (#133115 ) Partially addresses https://github.com/pytorch/pytorch/issues/132939. Adds the AOT ID after the CompiledFunctionBackward annotation in verbose compiled autograd logging default (no change): https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/.tmp8WCSLf/dedicated_log_torch_trace_xw3ktsi_.log/index.html TORCH_LOGS="compiled_autograd_verbose": https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/.tmp8WCSLf/dedicated_log_torch_trace_gsc9q_43.log/index.html ```python # File: /data/users/xmfan/a/pytorch/torch/_dynamo/compiled_autograd.py:361 in set_node_origin, code: CompiledFunctionBackward1 (NodeCall 2) clone: "f32[4]" = torch.ops.aten.clone.default(expand, memory_format = torch.contiguous_format); expand = None cos: "f32[4]" = torch.ops.aten.cos.default(getitem_1); getitem_1 = None mul: "f32[4]" = torch.ops.aten.mul.Tensor(clone, cos); clone = cos = None # File: /data/users/xmfan/a/pytorch/torch/_dynamo/compiled_autograd.py:361 in set_node_origin, code: CompiledFunctionBackward0 (NodeCall 3) cos_1: "f32[4]" = torch.ops.aten.cos.default(getitem_2) mul_1: "f32[4]" = torch.ops.aten.mul.Tensor(mul, cos_1); mul = cos_1 = None ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/133115 Approved by: https://github.com/jansel	2024-08-17 00:46:52 +00:00
Andrew Gu	b0803129e8	Added meta registration for `_fused_adamw_` (#133728 ) See https://github.com/pytorch/pytorch/issues/123461#issuecomment-2294335273 <img width="1463" alt="Screenshot 2024-08-16 at 5 38 25 PM" src="https://github.com/user-attachments/assets/fe940c0e-775f-4047-bf69-34a3677d539b"> same signature so should be ok to just add the op to the decorator Pull Request resolved: https://github.com/pytorch/pytorch/pull/133728 Approved by: https://github.com/janeyx99, https://github.com/fegin	2024-08-17 00:28:31 +00:00
Sam Larsen	ec28121017	[inductor] Fix test_cudagraph_trees_expandable_segments.py for internal (#133698 ) Summary: These tests aren't running internally because the outer test harness is crashing without listing the tests. To fix we need: * Add a target for the tools/stats/ folder since this test imports it * Add a dependence to that target so it's included in the par * Fix up the relative import syntax, which is somehow different internally vs. fbcode (not sure why this works, but many other tests are doing it) Test Plan: `buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/inductor:cudagraph_trees_expandable_segments -- --run-disabled` Differential Revision: D61396711 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133698 Approved by: https://github.com/xuzhao9	2024-08-17 00:09:32 +00:00
leslie-fang-intel	648fc6c9c1	[Inductor][CPP] Refactor the tiling select into a standalone module to enhance its extensibility (#130892 ) Summary After enabling more vectorization, we found that vectorization does not always bring performance benefits. For example, a kernel with several non-contiguous index computations or non-contiguous buffer load/store operations can experience performance regression. A typical case is what we observed in the next PR: after fully enabling vectorization of `index_expr`, we saw a performance regression of `hf_BigBird`. In this PR, we refactor the tiling select into a standalone module to enhance its extensibility for further advanced tiling select heuristic. A standalone class `TilingSelect` with its method `select_tiling` has been added. `select_tiling` accepts the inputs of `fn_list`, `var_sizes_list` and return `tiling_factors`, `tiling_indices`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130892 Approved by: https://github.com/jgong5	2024-08-16 23:55:38 +00:00
Thomas Bohnstingl	d04cd7f3ba	Improvements for associative_scan - Reverse feature (#133011 ) This is part of a series of PRs to improve the functionality of the `associatve_scan` functionality. This specific PR introduces a `reverse` flag to the `associative_scan` to establish a similar interface as for `jax.associative_scan`. This PR has been derived from https://github.com/pytorch/pytorch/pull/129307. @ydwu4 @Chillee @zou3519 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133011 Approved by: https://github.com/ydwu4	2024-08-16 23:06:31 +00:00
PyTorch MergeBot	19ff9059eb	Revert "[Inductor][CPP] Support vectorization of remainder (#129849 )" This reverts commit 8624a571b4eecd11547867591d70992843265e97. Reverted https://github.com/pytorch/pytorch/pull/129849 on behalf of https://github.com/izaitsevfb due to ptedge_executorch_benchmark build failed again with LLVM crash ([comment](https://github.com/pytorch/pytorch/pull/129849#issuecomment-2294408526))	2024-08-16 22:41:05 +00:00
Xu Han	98d6a6eb7d	[inductor] clean up TODO comments. (#133718 ) clean up TODO comments. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133718 Approved by: https://github.com/henrylhtsang	2024-08-16 22:12:01 +00:00
Justin Chu	271ee90851	[easy] Fix type annotation for `ExportedProgram.run_decompositions` (#133720 ) Fix the tuple type annotation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133720 Approved by: https://github.com/Skylion007	2024-08-16 22:11:42 +00:00
Charles David Hernandez	99e789b52b	[Fix 1/n] GPU Test skips - fbcode/ caffe2/test/quantization (#133158 ) Summary: This diff aims to fix the GPU Test skips in the quantization tests under the `caffe2/test/quantization` directory. The changes made in the `TARGETS` files include adding the `should_use_remote_gpu` flag to enable remote GPU testing. This should help to resolve the skipped tests and improve the overall test coverage. [This diff] Fixed skip count: 4 [Running total] Fixed skip count: 4 Note: Creating separate diffs for each test-group. Test Plan: 281475054644766: buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/quantization:test_quantization -- --exact 'caffe2/test/quantization:test_quantization - test_compare_per_channel_device_numerics (caffe2.test.quantization.core.test_quantized_tensor.TestQuantizedTensor)' https://www.internalfb.com/intern/testinfra/testrun/5629499773981783 281475054644780: buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/quantization:test_quantization -- --exact 'caffe2/test/quantization:test_quantization - test_compare_per_tensor_device_numerics (caffe2.test.quantization.core.test_quantized_tensor.TestQuantizedTensor)' https://www.internalfb.com/intern/testinfra/testrun/11540474087422107 281475054644853: buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/quantization:test_quantization -- --exact 'caffe2/test/quantization:test_quantization - test_quant_pin_memory (caffe2.test.quantization.core.test_quantized_tensor.TestQuantizedTensor)' https://www.internalfb.com/intern/testinfra/testrun/11540474087422477 844425008078016: buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/quantization:test_quantization -- --exact 'caffe2/test/quantization:test_quantization - test_cuda_quantization_does_not_pin_memory (caffe2.test.quantization.core.test_quantized_tensor.TestQuantizedTensor)' https://www.internalfb.com/intern/testinfra/testrun/1407375259845199 Differential Revision: D60055277 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133158 Approved by: https://github.com/jovianjaison	2024-08-16 22:00:57 +00:00
Menglu Yu	fd33499b0c	[PT2][Optimus] Fix mixed precison training problem in decompose mem bound (#133626 ) Summary: Recently we observed in AI CMF, enabling decompose_mm pass will lead to mixed dtype for aten.mm and aten.addmm errors. By investigation, we figure out that the error comes from torch.sum, which has an implicit type casting to avoid the possible overflow (a similar discussion in github: https://github.com/pytorch/pytorch/issues/115832). Thus we do the output cast to avoid the error. Test Plan: # unit test ``` buck2 test mode/dev-nosan //caffe2/test/inductor:decompose_mem_bound_mm -- test_decompose_mm_mixed_precision ``` Buck UI: https://www.internalfb.com/buck2/00dc168e-4d65-40f8-b169-f4a58206f641 Test UI: https://www.internalfb.com/intern/testinfra/testrun/17169973624867151 Network: Up: 25KiB Down: 44KiB (reSessionID-b7e2ecc7-16ca-476d-95b2-09ea74645eb0) Jobs completed: 19. Time elapsed: 1:07.6s. Cache hits: 0%. Commands: 2 (cached: 0, remote: 0, local: 2) Tests finished: Pass 6. Fail 0. Fatal 0. Skip 0. Build failure 0 # e2e ads_dper3:68464f2dc5e849ba2670482079cecaaa training_platform:2c41d916ad5dd82f196372a8c7bd37a0 ### build training_platform ``` buck2 run fbcode//fblearner/flow/projects/training_platform:training_platform ``` ### register training_platform ``` buck2 run mode/opt fblearner/flow/projects/training_platform:workflow -- register-workflows --project-name training_platform --flow_version training_platform:2c41d916ad5dd82f196372a8c7bd37a0 ``` ### build ads_dper 3 ``` fbpkg build -E ads_dper3 --yes --expire 14d ``` ### register ads_dper 3 ``` buck2 run //pyper/core/eval_app_utils:flow_utils_script -- register --pkg-version ads_dper3:68464f2dc5e849ba2670482079cecaaa ``` ### extend package (optional) ``` fbpkg expire --extend-only training_platform:2c41d916ad5dd82f196372a8c7bd37a0 30d ``` ### before fix f591360990 ### after fix baseline f591395056 proposal Differential Revision: D61351815 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133626 Approved by: https://github.com/jackiexu1992	2024-08-16 21:53:12 +00:00
Mwiza Kunda	be207af6e1	Disable unwrapping scalar tensors when used as outputs (#132859 ) If the scalar tensor is an output tensor, it shouldn't be unwrapped (i.e. `.item()` called) since `tl.store` requires a pointer type for outputs. This issue only occurs for mutated buffers: the input tensor is also used as an output tensor. Fixes #ISSUE_NUMBER @yanboliang @jansel @ngimel Pull Request resolved: https://github.com/pytorch/pytorch/pull/132859 Approved by: https://github.com/jansel	2024-08-16 21:40:45 +00:00
Denis Vieriu	861bdf96f4	[MPS] Add native strided API for MPSNDArray starting with macOS 15 (#128393 ) Add support for native strides in MPS starting with macOS Sequoia. This will get rid of the additional gather and scatter operations needed to solve the strides or storage offsets of the tensors. Summary of changes (starting with macOS 15): - Add support for MPS strided API (strides/storage offsets etc): - [initWithBuffer:offset:descriptor:](https://developer.apple.com/documentation/metalperformanceshaders/mpsndarray/4391636-initwithbuffer?language=objc) - [arrayViewWithCommandBuffer:descriptor:aliasing:](https://developer.apple.com/documentation/metalperformanceshaders/mpsndarray/3114040-arrayviewwithcommandbuffer?language=objc) - [arrayViewWithShape:strides:](https://developer.apple.com/documentation/metalperformanceshaders/mpsndarray/4408694-arrayviewwithshape?language=objc) - [reshapeWithCommandBuffer:sourceArray:shape:destinationArray:](https://developer.apple.com/documentation/metalperformanceshaders/mpsndarrayidentity/4438557-reshapewithcommandbuffer?language=objc) - Add native support for NHWC convolutions (without incurring any extra copy from NCHW -> NHWC -> NCHW). - Add support for strided output buffers (previously we would create a contiguous buffer OSes older than macOS 15 will run the old gather/scatter code path to solve strides/storage offsets. --- Couple performance stats collected from torchbench comparing macOS 15 vs macOS 14: ``` - test_train[functorch_maml_omniglot-mps]: 27% faster - test_train[timm_vision_transformer-mps]: 12% faster - test_train[hf_T5-mps]: 9.46% faster ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/128393 Approved by: https://github.com/albanD Co-authored-by: Siddharth Kotapati <skotapati@apple.com>	2024-08-16 21:07:50 +00:00
Jack Taylor	447f428d6d	[ROCm] Fix text_export cudnn_attention UT (#133234 ) On ROCm we should decompose to flash_attention for sdpa instead of cudnn_attention. Need additional conditionalisation in this code. Issue observed: https://hud.pytorch.org/failure?name=rocm%20%2F%20linux-focal-rocm6.1-py3.8%20%2F%20test%20(default%2C%203%2C%206%2C%20linux.rocm.gpu.2)&jobName=undefined&failureCaptures=%5B%22export%2Ftest_export.py%3A%3ATestOneOffModelExportResult%3A%3Atest_scaled_dot_product_attention_cuda%22%5D Pull Request resolved: https://github.com/pytorch/pytorch/pull/133234 Approved by: https://github.com/malfet	2024-08-16 20:49:13 +00:00
Will Feng	f57b00704e	[Traceable FSDP2][Dynamo] Support reconstructing CUDA event object within Dynamo graph (#133635 ) `torch.cuda.Event` objects are different from `torch.cuda.Stream` in that events are not pooled, meaning we can't look up a previously created CUDA event object by ID. This prevents CUDA event object created outside of the Dynamo graph from being used within the graph (since Dynamo needs a way to emit a `call_function` line in the graph that does the retrieval of the event object for downstream op use). This PR adds a simple object pool within Dynamo utility, to support looking up CUDA event object by ID from within the Dynamo graph. After this PR, if a user creates a CUDA event object outside of the graph and use that event within the graph, the behavior will exactly match eager. Test commands: - `pytest -rA test/dynamo/test_ctx_manager.py::CtxManagerTests::test_cuda_event_created_outside_of_graph` - `pytest -rA test/dynamo/test_ctx_manager.py::CtxManagerTests::test_cuda_event_across_graph_break` Pull Request resolved: https://github.com/pytorch/pytorch/pull/133635 Approved by: https://github.com/yifuwang ghstack dependencies: #133532, #133531, #133636	2024-08-16 20:40:46 +00:00
Yifu Wang	bc9e20b927	Move the layout constraint registration of aten._scaled_mm.default to module scope (#133669 ) During Inductor lowering, layout constraints for an op is applied before the op's lowering is called. Currently `add_layout_constraint(aten._scaled_mm.default, constrain_to_fx_strides)` is called inside `aten._scaled_mm.default`'s lowering. This means that if the first `_scaled_mm` to be lowered relies on the layout constraint, it won't be applied and the generated code would fail. The issue won't manifest if the first `_scaled_mm` doesn't rely on the layout constraint. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133669 Approved by: https://github.com/drisspg, https://github.com/yangsiyu007	2024-08-16 20:30:13 +00:00
Ivan Zaitsev	88ba50279c	Consolidate the format for `--max-acc-splits` flag (#133724 ) fixes the partial export of [lowering] Add max_acc_splits (#133041) ([D60133589](https://www.internalfb.com/diff/D60133589)) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133724 Approved by: https://github.com/kit1980	2024-08-16 20:28:55 +00:00
Aaron Gokaslan	3ac527ac5f	[BE][Ez]: Update cudnn_frontend submodule to 1.6.0 (#133687 ) Updates CUDNN_frontend header only library to make the most of the newest CUDNN features and decrease the overhead of the library. Copied from commit: New API - Graph Slice Operation: Introduced the graph.slice operation for slicing input tensors. Refer to docs/operations/Slice.md for detailed documentation and samples/cpp/misc/slice.cpp for a C++ sample. Pybinds for this operation have also been added. - SM Carveout Feature: Added the set_sm_count(int32_t type) graph property to support the SM Carveout feature introduced in Ampere and Hopper GPUs. Engines that do not support SM_COUNT will return NOT_SUPPORTED. Bug Fixes - Convolution Mode Attribute: Added the missing set_convolution_mode attribute to convolution attributes in forward propagation (fprop), data gradient (dgrad), and weight gradient (wgrad). Previously, this was hardcoded to CUDNN_CROSS_CORRELATION in the 1.x API. - SDPA FP8 Backward Node: Fixed an issue with the deserialization of the sdpa_fp8_backward node. Enhancements - Graph Execution Overhead: Reduced the overhead of graph.execute() by optimizing sub-node tree traversal, collected UIDs, workspace modifications, and workspace size. - Graph Validation Performance: Significantly improved (~10x) the performance of graph.validate() by deferring graph expansion to a later stage (build_operation_graph). - Optional Running Stats for BatchNorm: Made the running statistics for the batch normalization operation optional, supported by cuDNN backend version 9.3.0 and later. - Shape and Stride Inferencing: Enhanced shape and stride inferencing to preserve the stride order of the input. - Diagnostic Error Message: Added a diagnostic error message to create_execution_plans if called without the preceding build_operation_graph. - JSON Schema and Deserialization: Improved the JSON schema and deserialization logic with additional checks. - Logging Overhead: Reduced logging overhead, resulting in faster graph.build() calls. - CMake Integration: Replaced CMAKE_SOURCE_DIR with PROJECT_SOURCE_DIR in CMake files for better integration. See the relevant pull request for more details. Samples - Jupyter Notebooks: Added Jupyter notebooks for RMSNorm, InstanceNorm, and LayerNorm. Refer to the samples/python folder for more information. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133687 Approved by: https://github.com/eqy, https://github.com/malfet	2024-08-16 20:27:23 +00:00
Ivan Zaitsev	41e6619509	[codemod] Del un at::native::metal @ MPSCNNFullyConnectedOp.h:6 (export D59157302) (#133515 ) Manual export of D59157302 Original description: Removes a using namespace from the global namespace in pursuit of enabling -Wheader-hygiene. Qualifies instances that relied on the using namespace. @diff-train-skip-merge Pull Request resolved: https://github.com/pytorch/pytorch/pull/133515 Approved by: https://github.com/kit1980, https://github.com/malfet	2024-08-16 19:59:07 +00:00
PyTorch MergeBot	a0cb54ab46	Revert "C++ network flow implementation in c10 (#132188 )" This reverts commit e6272acaec63c960486b3ac558d0199cd65d7b97. Reverted https://github.com/pytorch/pytorch/pull/132188 on behalf of https://github.com/izaitsevfb due to breaks aps models and builds internally ([comment](https://github.com/pytorch/pytorch/pull/132188#issuecomment-2294120234))	2024-08-16 19:48:54 +00:00
atalman	fb59440791	Use dedicated docker-build environment for manywheel, libtorch and conda Docker builds - 2 (#133709 ) Follow up after https://github.com/pytorch/pytorch/pull/133699. 2 more placed where we need to pass these env vars. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133709 Approved by: https://github.com/Skylion007, https://github.com/seemethere	2024-08-16 19:41:11 +00:00
Yanbo Liang	678a8f9e66	[Inductor][FlexAttention] Small cleanup for FlexAttention kernel template (#133664 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133664 Approved by: https://github.com/drisspg	2024-08-16 19:33:36 +00:00
Siddharth Kotapati	611c104370	[MPS] Add workaround for nonzero with large/complex inputs (#126188 ) Fixes Issue #122916 Resolves correctness issue seen with large inputs to the mps nonzero op by using a different scatter mode. Native nonzero op is still used with smaller inputs for better performance. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126188 Approved by: https://github.com/kulinseth, https://github.com/malfet	2024-08-16 19:04:04 +00:00
Oguz Ulgen	0063e56949	Make FX Graph Cache work with distributed training (#133374 ) During distributed training if all ranks except one hit the cache, the rank that did not hit the cache will cause a NCCL timeout since rest of the ranks will enter the collective and start the timer. This PR uses the new PTD API to increase timeout for the ranks that hit the cache by the amount of time the cache would save. Differential Revision: [D61363722](https://our.internmc.facebook.com/intern/diff/D61363722) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133374 Approved by: https://github.com/ezyang	2024-08-16 18:51:14 +00:00
Matthias Braun	5ee070266f	Workaround ASAN failure (#133623 ) Summary: ASAN in llvm 17.x and newer reads 8 bytes in front of every function called. This means the JIT must not place a function immediately at the beginning of a freshly `mmap`ed page. This adds an 8 byte sized dummy variable as the first thing to work around the problem. See also: - https://reviews.llvm.org/D148665 - https://github.com/llvm/llvm-project/issues/65253 Test Plan: - `servicelab create cogwheel_adfinder_ubsan_multi_trial_test --local-commit`: https://www.internalfb.com/servicelab/experiment/3701354882 - sandcastle Differential Revision: D61348865 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133623 Approved by: https://github.com/Skylion007	2024-08-16 18:48:10 +00:00
cyy	90c3669cd9	Make sure T::is_traceable is bool (#133673 ) Add static_assert to C++ templates in custom_function Pull Request resolved: https://github.com/pytorch/pytorch/pull/133673 Approved by: https://github.com/Skylion007	2024-08-16 18:28:02 +00:00
wz337	eb3d517605	[Test] Add SkipIfRocm to test_grad_acc_cpu_offload (#132975 ) Fixes #123726 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132975 Approved by: https://github.com/malfet	2024-08-16 18:26:20 +00:00
rzou	e5baf43b61	[Inductor] short-term fix for needs_fixed_stride_order silent incorrectness (#133452 ) This is a low-risk short-term fix for https://github.com/pytorch/pytorch/issues/128084, for the purposes of 2.4.1. The actual fix for that issue is more risky and we'll target 2.5. needs_fixed_stride_order is silently incorrect with args that are mutable because it creates clones of those args, writes into them, and doesn't update the original args. This PR makes it so that needs_fixed_stride_order doesn't apply to inputs that are being mutated. This PR doesn't completely fix the problem, but it makes it less incorrect: most of the time the input already has the correct strides but inductor fails to recognize it, and in those cases writing directly to the input is fine. Test Plan: - new test Pull Request resolved: https://github.com/pytorch/pytorch/pull/133452 Approved by: https://github.com/eellison	2024-08-16 18:14:57 +00:00
atalman	caaa339e0f	Use dedicated docker-build environment for manywheel, libtorch and conda Docker builds (#133699 ) BE change. Apply logic simiar to: https://github.com/pytorch/pytorch/blob/main/.github/workflows/docker-builds.yml Pull Request resolved: https://github.com/pytorch/pytorch/pull/133699 Approved by: https://github.com/seemethere	2024-08-16 18:10:43 +00:00
PyTorch MergeBot	b833990a8f	Revert "[CUDA][CUTLASS][submodule] Fixes for CUTLASS upgrade (#131493 )" This reverts commit 4aa66f68a803927ddd127ceaaa1521b8d6e90e5f. Reverted https://github.com/pytorch/pytorch/pull/131493 on behalf of https://github.com/izaitsevfb due to breaks internal builds with identifier "std::numeric_limits< ::cutlass::half_t> ::infinity" is undefined in device code ([comment](https://github.com/pytorch/pytorch/pull/131493#issuecomment-2293939390))	2024-08-16 18:09:33 +00:00
Bill Yoshimi	4ee65c7e4e	Add message text to BypassFxGraphCache exceptions. (#133505 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133505 Approved by: https://github.com/oulgen	2024-08-16 18:02:59 +00:00
Will Feng	1df1d00ffc	[Traceable FSDP2] Remove usage of tuple() generator and simplify code (#133636 ) Dynamo doesn't support `tuple()` generator, and this change also simplifies code a bit. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133636 Approved by: https://github.com/awgu ghstack dependencies: #133532, #133531	2024-08-16 17:47:28 +00:00
Shunting Zhang	374c61cc82	[inductor] make conv template work with symbolic stride/padding (#132938 ) Fix https://github.com/pytorch/pytorch/issues/132716 The triton template for convolution does not work when the stride or padding contains dynamic shape. Use the hint and add guards to handle that. An alternative is to fallback to eager, but since I've seen the lowering rule for convolution use the hint in other cases, I'll just follow the convention. I don't really know how to add a unit test here since I need create symbolic strides (not strides of a tensor but the stride parameter for convolution) and paddings. I can try harder if reviewer swants me to add unit tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132938 Approved by: https://github.com/jansel, https://github.com/eellison ghstack dependencies: #132952	2024-08-16 17:45:12 +00:00
atalman	2cffe82dea	Fix triton build failure due to tritonlang.blob.core.windows.net not available (#133694 ) This should mitigate https://github.com/triton-lang/triton/issues/4527 We should also remove this once our triton pin moves past: https://github.com/triton-lang/triton/pull/4216 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133694 Approved by: https://github.com/Skylion007, https://github.com/kit1980, https://github.com/malfet	2024-08-16 17:34:30 +00:00
Menglu Yu	f735038c8f	[PT2][Optimus] Add unbind_stack_to_slices pass (#133420 ) Summary: We find another pattern to be optimized in AI CMF, thus we add the new pattern Test Plan: # unit test ``` CUDA_VISIBLE_DEVICES=3 OC_CAUSE=1 buck2 test //caffe2/test/inductor:split_cat_fx_passes ``` Buck UI: https://www.internalfb.com/buck2/b0b9bdf6-1bd1-45db-ba2c-a6892d9d557e Test UI: https://www.internalfb.com/intern/testinfra/testrun/1125900285323964 Network: Up: 595KiB Down: 1.7MiB (reSessionID-e527c3b3-03ac-45f8-bd08-3eb9a28b7dc0) Tests finished: Pass 9. Fail 0. Fatal 0. Skip 1. Build failure 0 # benchmark ``` CUDA_VISIBLE_DEVICES=3 OC_CAUSE=1 buck2 run mode/opt //scripts/jackiexu0313/pt2:local_model_with_pt2 -- --test_mode batch-split --model_type "ai_cmf" --flow_id 558295195 -n ``` P1520513078 Counter({'pattern_matcher_nodes': 1756, 'pattern_matcher_count': 936, 'normalization_pass': 280, 'merge_splits_pass': 250, 'scmerge_cat_removed': 14, 'scmerge_cat_added': 12, 'scmerge_split_removed': 7, 'unbind_stack_pass': 7, 'split_stack_to_cats_pass': 4, 'scmerge_split_sections_removed': 3, 'split_cat_pass': 2, 'scmerge_split_added': 2, 'split_cat_to_slices_pass': 2, 'unbind_stack_to_slices_pass': 1} # e2e (OBA AFOC) baseline f590253290 proposal f591051921 ### QPS and NE {F1804187079} ### trace analysis baseline trace link: https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree%2Ftraces%2Fdynocli%2Ff590283096-TrainingApplication%2F4%2Frank-1.Aug_12_08_52_03.3628.pt.trace.json.gz&bucket=pyper_traces proposal trace link: https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree%2Ftraces%2Fdynocli%2Ff591081210-TrainingApplication%2F0%2Frank-1.Aug_12_22_23_35.3401.pt.trace.json.gz&bucket=pyper_traces {F1804227687}{F1804227675} Based on the traces, the green part has been shrinked due to optimus transformation. Differential Revision: D61039466 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133420 Approved by: https://github.com/jackiexu1992	2024-08-16 17:30:35 +00:00
Will Feng	6790eb52f9	[Traceable FSDP2] Set torch._dynamo.config.skip_fsdp_hooks to True by default (#133531 ) Setting `torch._dynamo.config.skip_fsdp_hooks = True` is required for graph-break compiled FSDP2, thus setting it to default will make this adoption easier. If users want to use Traceable FSDP2, they can set this to False manually (which will allow FSDP2 hooks to be traced through). Pull Request resolved: https://github.com/pytorch/pytorch/pull/133531 Approved by: https://github.com/awgu ghstack dependencies: #133532	2024-08-16 17:18:42 +00:00
Will Feng	6d85077168	[Traceable FSDPS] Allow tracing through FSDP2 impl in trace_rules.py (#133532 ) Test commands: - `python test/distributed/_composable/fsdp/test_fully_shard_training.py TestFullyShard1DTrainingCompose.test_train_parity_with_activation_checkpointing` Pull Request resolved: https://github.com/pytorch/pytorch/pull/133532 Approved by: https://github.com/yanboliang	2024-08-16 17:13:47 +00:00
Aleksei Nikiforov	18705e371d	S390x nightly binaries for python 3.13 (#132984 ) Enable building python 3.13 nightly binaries for s390x Pull Request resolved: https://github.com/pytorch/pytorch/pull/132984 Approved by: https://github.com/malfet	2024-08-16 17:07:27 +00:00
Yanbo Liang	770086fe39	[Dynamo] Support torch.cuda.device ctx manager (#133385 ) Fixes #128059 I'm not sure if this is the right way, since Inductor doesn't always respect the device id set by users, so probably we should just wrap it as null context manager and print a warning. cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @chenyang78 @kadeng @chauhang @amjames @jansel @anijain2305 @mlazos @williamwen42 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133385 Approved by: https://github.com/jansel	2024-08-16 17:05:55 +00:00
Alnis Murtovi	38e5ee1a34	mixed_mm: add more extensive dtype testing (#133292 ) This PR adds a test that tests more combinations of dtypes. The bfloat16 and uint8 combination causes a crash somewhere in triton during the generation of LLVM code. Tests like these would have also prevented segfaults like this one https://github.com/pytorch/pytorch/pull/133173. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133292 Approved by: https://github.com/shunting314	2024-08-16 16:49:27 +00:00
Shivam Raikundalia	9c2d119194	[Profiler/CPU] Add API for Dynamic Activity Toggling [3/n] (#133353 ) Summary: In this diff, we add the CPU activity implementation of being able to dynamically toggle profiling in between steps. To do this we remove the callbacks for Torch Ops and add them back in when an enable call is made. This diff also adds some support code for doing the same in python; however, the python stack comes with its own set of compilcations when enabling this feature. For one, we get into a scenario where the python stack during the toggle never gets an exit as it the tracing gets turned off which makes for some tricky post processing. For this reason, we can leave the python dynamic toggling off for now and revisit if there is enough demand. Test Plan: Got the following tracing by disabling torch and cuda ops: https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/dynocli/devvm2185.cco0.facebook.com/rank-0.Aug_13_13_03_02.606577.pt.trace.json.gz&bucket=gpu_traces Differential Revision: D61221497 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133353 Approved by: https://github.com/sanrise, https://github.com/aaronenyeshi	2024-08-16 16:36:57 +00:00
Shuqiang Zhang	46af996ce7	[c10d] Do not call ncclCommAbort if comm is not initialized (#133630 ) Summary: We saw ncclCommAbort was called and hang during the NCCLComm:create. If NCCL comm is not properly initialized, ncclCommAbort behavior is 'undefined', avoid calling it would allow the process to properly throw exception Test Plan: Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/133630 Approved by: https://github.com/wconstab	2024-08-16 16:25:07 +00:00
Alnis Murtovi	8b8b4e5ae9	AutoHeuristic: documentation for mm (#133611 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133611 Approved by: https://github.com/eellison ghstack dependencies: #131705, #131710, #131714, #133608	2024-08-16 16:20:38 +00:00
Alnis Murtovi	0e0077f3b6	AutoHeuristic: mm ranking heuristic h100 (#133608 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133608 Approved by: https://github.com/eellison ghstack dependencies: #131705, #131710, #131714	2024-08-16 16:20:38 +00:00
Alnis Murtovi	e51c8ad369	AutoHeuristic: Heuristic that ranks choices for mm (#131714 ) This PR adds a heuristic for tuned_mm that predicts the top 10 best choices. To be safe, aten.mm is always included. Perf run: https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Thu%2C%2008%20Aug%202024%2020%3A20%3A28%20GMT&stopTime=Thu%2C%2015%20Aug%202024%2020%3A20%3A28%20GMT&granularity=hour&suite=torchbench&mode=inference&dtype=bfloat16&deviceName=cuda%20(a100)&lBranch=gh/AlnisM/22/head&lCommit=905826f4ab5344efb0bcaa87e3b27a25299927ab&rBranch=main&rCommit=79ca596dc6ea16b6cdd0f2517451e19840717d37 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131714 Approved by: https://github.com/eellison ghstack dependencies: #131705, #131710	2024-08-16 16:20:38 +00:00
Aaron Gokaslan	51e13745be	[BE]: Update ruff to 0.6.0 (#133609 ) Updates ruff and fixes a couple false negatives it discovered. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133609 Approved by: https://github.com/malfet	2024-08-16 14:11:01 +00:00
Jiong Gong	eca8b4220f	[inductor][cpp][gemm] fix k-slicing bug and add thread blocking config (#132730 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132730 Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel ghstack dependencies: #132729	2024-08-16 13:50:19 +00:00
atalman	a6aa451bde	Move python 3.8 to 3.9 for linux-binary-manywheel workflow (#133621 ) Part of Deprecation of python 3.8 and moving to 3.9. Related to: https://github.com/pytorch/pytorch/issues/120718 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133621 Approved by: https://github.com/Skylion007, https://github.com/kit1980, https://github.com/malfet	2024-08-16 13:49:26 +00:00
PyTorch MergeBot	e1b9b89d94	Revert "[Flight Recorder] Add more basic analysis to the script (#133412 )" This reverts commit fcc2fc1a70c35628939611b496b209fa0a1d19bf. Reverted https://github.com/pytorch/pytorch/pull/133412 on behalf of https://github.com/atalman due to New test: distributed/flight_recorder/test_fr_analysis is constantly failing ([comment](https://github.com/pytorch/pytorch/pull/133412#issuecomment-2293506539))	2024-08-16 13:26:25 +00:00
Isuru Fernando	b444343087	Fix printing symfloat pow in triton (#133614 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133614 Approved by: https://github.com/Skylion007	2024-08-16 13:08:29 +00:00
Wu, Chunyuan	762b1b4c17	[inductor] [cpp] fix accuracy when template_buffer has users other than the epilogue nodes (#133073 ) This PR fixes the accuracy issues when template_buffer has users other than the epilogue nodes. This will fix the accuracy failure of the below models using max-autotune: - MobileBertForMaskedLM - MobileBertForQuestionAnswering - convnext_base - swin_base_patch4_window7_224 ## Issue 1: Previously we always add `template_buffer` as an alias of `Y`. In case the `template_buffer` has users other than the epilogue nodes, we shouldn't set it as an alias of `Y`. This PR adds the check in such case. Wrong code before the fix where `tmp4` and `tmp9` are both stored to `Y` while we need 2 different buffers for them since `tmp4` will be used by nodes other than the epilogue node: ```cpp Y[static_cast<long>(n_start + x1 + (32Lm_start) + (32Lx0))] = tmp4; // tmp4 is the output of the template Y[static_cast<long>(n_start + x1 + (32Lm_start) + (32Lx0))] = tmp9; // tmp9 is the output of the epilogue node ``` Correct code after the fix: ```cpp out_ptr2[static_cast<long>(n_start + x1 + (32Lm_start) + (32Lx0))] = tmp4; Y[static_cast<long>(n_start + x1 + (32Lm_start) + (32Lx0))] = tmp9; ``` ## Issue 2: When fixing the above issue, we found that there's correctness issue when `bias` is `False`. The root cause is that in the case where `bias` is `False`, the `template_buffer` has users other than the epilogue nodes and the GEMM output buffer is localized, we need to add an extra copy epilogue to ensure that the GEMM output (a local buffer) is stored to the `template_buffer` that will be used later by other nodes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133073 Approved by: https://github.com/jgong5 ghstack dependencies: #133070	2024-08-16 12:13:10 +00:00
Nicolas Macchioni	dd69013c7a	deprecate `search_autotune_cache` (#133628 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133628 Approved by: https://github.com/oulgen	2024-08-16 09:29:39 +00:00
Nicolas Macchioni	15183f5ebf	overestimate `time_taken_ns` for autotuning (#133633 ) tldr; in `autotune_to_one_config` we now include the precompile time, and in coordesc tuning we include the time from `autotune_to_one_config`, since this is a precursor Pull Request resolved: https://github.com/pytorch/pytorch/pull/133633 Approved by: https://github.com/oulgen, https://github.com/eellison	2024-08-16 09:28:49 +00:00
Oguz Ulgen	30fbf5b19c	Remove AMD restrictions on triton hashing (#133616 ) Summary: When we added these functions, AMD's triton checkout was very old, it appears to have caught up. Remove restrictions. Test Plan: unit tests Differential Revision: D61351473 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133616 Approved by: https://github.com/mxz297, https://github.com/nmacchioni, https://github.com/eellison	2024-08-16 08:02:48 +00:00
Avik Chaudhuri	5ed3b70d09	remove redundant upper bound check at runtime (#133627 ) Summary: Some symbols (unbacked symints?) can have upper bound that is `sys.maxsize - 1` but our code for runtime assertions assumes that such upper bounds would come in as `sympy.oo` (like backed symints?) in order to drop them. So we weren't dropping them, which this PR fixes. Test Plan: added test Differential Revision: D61352056 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133627 Approved by: https://github.com/SherlockNoMad	2024-08-16 06:57:12 +00:00
angelayi	f64146aff0	Update source matcher to use torch_fn (#133642 ) Updating the source matcher to also accept pattern matching on the torch_fn metadata, which exists in both strict and non-strict export. We want to replace the use of source_fn_stack with torch_fn, as it's not possible for us to get source_fn_stack in non-strict export. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133642 Approved by: https://github.com/ydwu4	2024-08-16 06:42:52 +00:00
Aleksandar Samardžić	d12bbcd785	Add auto-tuning for sparse semi-structured MM operator (#123742 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123742 Approved by: https://github.com/kadeng	2024-08-16 06:40:24 +00:00
Max Podkorytov	3d45717219	[ROCm][CK][Inductor] enable dynamic shapes for CK backend to gemm max autotune (#133285 ) This PR enables dynamic shapes for the CK backend for gemm max autotune (see #125453). This is achieved via unhardcoding the problem sizes from the template body and passing them as parameters instead. We handle passing the problem sizes for the kernel call as well as for the benchmark call. # Testing `pytest test/inductor/test_ck_backend.py [-k dynamic]` Pull Request resolved: https://github.com/pytorch/pytorch/pull/133285 Approved by: https://github.com/ColinPeppler	2024-08-16 06:05:23 +00:00
Menglu Yu	8ea5b572a6	[PT2][Optimus] Add missing example value for the nodes introduced in group batch fusion (#133414 ) Summary: Recently we observed more missing example values in nodes introduced in Optimus, which causes problem to have further optimization when this node info needs to be used. Thus we add the meta for these nodes in the diff. Test Plan: # unit test ``` CUDA_VISIBLE_DEVICES=3 OC_CAUSE=1 buck2 test //caffe2/test/inductor:split_cat_fx_passes ``` Buck UI: https://www.internalfb.com/buck2/c0ad506f-ce9d-4b80-947a-cb79074b72f0 Test UI: https://www.internalfb.com/intern/testinfra/testrun/2251800058834808 Network: Up: 1.4GiB Down: 2.0GiB (reSessionID-fb781425-f29b-44b5-8a5b-daffe7274f86) Jobs completed: 300289. Time elapsed: 13:19.5s. Cache hits: 99%. Commands: 119360 (cached: 118494, remote: 824, local: 42) Tests finished: Pass 9. Fail 0. Fatal 0. Skip 1. Build failure 0 # benchmark ``` CUDA_VISIBLE_DEVICES=3 OC_CAUSE=1 buck2 run mode/opt //scripts/jackiexu0313/pt2:local_model_with_pt2 -- --test_mode batch-split --model_type "cmf_shrink" --flow_id 587303213 ``` P1520691492 Differential Revision: D61039772 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133414 Approved by: https://github.com/jackiexu1992	2024-08-16 04:52:16 +00:00
Animesh Jain	8a2b064236	[dynamo][user_defined][stable-diffusion] Raise ObservedAttributeError on UserDefinedObject var_getattr (#132806 ) Fixes https://github.com/pytorch/pytorch/issues/132551 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132806 Approved by: https://github.com/williamwen42	2024-08-16 04:30:06 +00:00
fduwjj	fcc2fc1a70	[Flight Recorder] Add more basic analysis to the script (#133412 ) This is the first step to make sure we have a basic function of analyzer for FR in production. - We want to use this script to find out abnormalities in collectives and report it to users. - We also fixed some type errors. - [Ongoing] Also we will add more unit tests to this script and make it modularized so that we can better maintain it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133412 Approved by: https://github.com/c-p-i-o	2024-08-16 03:53:12 +00:00
Shangdi Yu	d9f17cf4e4	[fx] Do not add Proxy on Tensor (#133470 ) Summary: Switch to set_proxy_slot instead of set the proxy directly on the Tensor. We do not want to add Proxy to tensor objects, because Proxy cannot be deepcopied or pickeled and can cause problems when users want to deepcopy or pickle models. Test Plan: CI Differential Revision: D61277650 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133470 Approved by: https://github.com/zou3519	2024-08-16 03:39:50 +00:00
Animesh Jain	8a5708ba3d	[dynamo] Support object creation of classes with custom __new__ (#132977 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132977 Approved by: https://github.com/jansel	2024-08-16 03:09:23 +00:00

2160 changed files with 93539 additions and 52133 deletions

6

.ci/docker/aotriton_version.txt

View File

 @ -1,5 +1,5 @@
 .6b
 .7b
 manylinux_2_17
 rocm6.2
 f07e8a1cb1f99627eb6d77f5c0e9295c775f3c7
 e4ab195d2bd19e939c675a13280c29714c6ef9f2cf420690da150fa0cac043b1
 be04068c3c0857a4cfd17d7e39e71d0423ebac2
 e9e1959d23b93d78a08fcc5f868125dc3854dece32fd9458be9ef4467982291

									
										61

.ci/docker/build.sh
									
												View File
												
				@ -92,7 +92,7 @@ _UCC_COMMIT=20eae37090a4ce1b32bcce6144ccad0b49943e0b

				# from scratch

				case "$image" in

				  pytorch-linux-focal-cuda12.4-cudnn9-py3-gcc9)

				    CUDA_VERSION=12.4.0

				    CUDA_VERSION=12.4.1

				    CUDNN_VERSION=9

				    ANACONDA_PYTHON_VERSION=3.10

				    GCC_VERSION=9

				@ -120,7 +120,7 @@ case "$image" in

				    TRITON=yes

				    ;;

				  pytorch-linux-focal-cuda12.4-cudnn9-py3-gcc9-inductor-benchmarks)

				    CUDA_VERSION=12.4.0

				    CUDA_VERSION=12.4.1

				    CUDNN_VERSION=9

				    ANACONDA_PYTHON_VERSION=3.10

				    GCC_VERSION=9

				@ -165,7 +165,7 @@ case "$image" in

				    INDUCTOR_BENCHMARKS=yes

				    ;;

				  pytorch-linux-focal-cuda12.4-cudnn9-py3.12-gcc9-inductor-benchmarks)

				    CUDA_VERSION=12.4.0

				    CUDA_VERSION=12.4.1

				    CUDNN_VERSION=9

				    ANACONDA_PYTHON_VERSION=3.12

				    GCC_VERSION=9

				@ -194,7 +194,7 @@ case "$image" in

				    TRITON=yes

				    ;;

				  pytorch-linux-focal-cuda12.4-cudnn9-py3-gcc9)

				    CUDA_VERSION=12.4.0

				    CUDA_VERSION=12.4.1

				    CUDNN_VERSION=9

				    ANACONDA_PYTHON_VERSION=3.10

				    GCC_VERSION=9

				@ -222,7 +222,7 @@ case "$image" in

				    TRITON=yes

				    ;;

				  pytorch-linux-focal-cuda12.4-cudnn9-py3-gcc9)

				    CUDA_VERSION=12.4.0

				    CUDA_VERSION=12.4.1

				    CUDNN_VERSION=9

				    ANACONDA_PYTHON_VERSION=3.10

				    GCC_VERSION=9

				@ -236,7 +236,7 @@ case "$image" in

				    TRITON=yes

				    ;;

				  pytorch-linux-focal-py3-clang10-onnx)

				    ANACONDA_PYTHON_VERSION=3.8

				    ANACONDA_PYTHON_VERSION=3.9

				    CLANG_VERSION=10

				    PROTOBUF=yes

				    DB=yes

				@ -245,7 +245,7 @@ case "$image" in

				    ONNX=yes

				    ;;

				  pytorch-linux-focal-py3-clang9-android-ndk-r21e)

				    ANACONDA_PYTHON_VERSION=3.8

				    ANACONDA_PYTHON_VERSION=3.9

				    CLANG_VERSION=9

				    LLVMDEV=yes

				    PROTOBUF=yes

				@ -254,8 +254,8 @@ case "$image" in

				    GRADLE_VERSION=6.8.3

				    NINJA_VERSION=1.9.0

				    ;;

				  pytorch-linux-focal-py3.8-clang10)

				    ANACONDA_PYTHON_VERSION=3.8

				  pytorch-linux-focal-py3.9-clang10)

				    ANACONDA_PYTHON_VERSION=3.9

				    CLANG_VERSION=10

				    PROTOBUF=yes

				    DB=yes

				@ -276,8 +276,8 @@ case "$image" in

				    CONDA_CMAKE=yes

				    TRITON=yes

				    ;;

				  pytorch-linux-focal-py3.8-gcc9)

				    ANACONDA_PYTHON_VERSION=3.8

				  pytorch-linux-focal-py3.9-gcc9)

				    ANACONDA_PYTHON_VERSION=3.9

				    GCC_VERSION=9

				    PROTOBUF=yes

				    DB=yes

				@ -286,18 +286,7 @@ case "$image" in

				    TRITON=yes

				    ;;

				  pytorch-linux-focal-rocm-n-1-py3)

				    ANACONDA_PYTHON_VERSION=3.8

				    GCC_VERSION=9

				    PROTOBUF=yes

				    DB=yes

				    VISION=yes

				    ROCM_VERSION=6.0

				    NINJA_VERSION=1.9.0

				    CONDA_CMAKE=yes

				    TRITON=yes

				    ;;

				  pytorch-linux-focal-rocm-n-py3)

				    ANACONDA_PYTHON_VERSION=3.8

				    ANACONDA_PYTHON_VERSION=3.10

				    GCC_VERSION=9

				    PROTOBUF=yes

				    DB=yes

				@ -307,8 +296,19 @@ case "$image" in

				    CONDA_CMAKE=yes

				    TRITON=yes

				    ;;

				  pytorch-linux-focal-rocm-n-py3)

				    ANACONDA_PYTHON_VERSION=3.10

				    GCC_VERSION=9

				    PROTOBUF=yes

				    DB=yes

				    VISION=yes

				    ROCM_VERSION=6.2

				    NINJA_VERSION=1.9.0

				    CONDA_CMAKE=yes

				    TRITON=yes

				    ;;

				  pytorch-linux-jammy-xpu-2024.0-py3)

				    ANACONDA_PYTHON_VERSION=3.8

				    ANACONDA_PYTHON_VERSION=3.9

				    GCC_VERSION=11

				    PROTOBUF=yes

				    DB=yes

				@ -318,8 +318,8 @@ case "$image" in

				    CONDA_CMAKE=yes

				    TRITON=yes

				    ;;

				    pytorch-linux-jammy-py3.8-gcc11-inductor-benchmarks)

				    ANACONDA_PYTHON_VERSION=3.8

				    pytorch-linux-jammy-py3.9-gcc11-inductor-benchmarks)

				    ANACONDA_PYTHON_VERSION=3.9

				    GCC_VERSION=11

				    PROTOBUF=yes

				    DB=yes

				@ -330,8 +330,8 @@ case "$image" in

				    DOCS=yes

				    INDUCTOR_BENCHMARKS=yes

				    ;;

				  pytorch-linux-jammy-cuda11.8-cudnn9-py3.8-clang12)

				    ANACONDA_PYTHON_VERSION=3.8

				  pytorch-linux-jammy-cuda11.8-cudnn9-py3.9-clang12)

				    ANACONDA_PYTHON_VERSION=3.9

				    CUDA_VERSION=11.8

				    CUDNN_VERSION=9

				    CLANG_VERSION=12

				@ -355,8 +355,8 @@ case "$image" in

				    CONDA_CMAKE=yes

				    VISION=yes

				    ;;

				  pytorch-linux-jammy-py3.8-gcc11)

				    ANACONDA_PYTHON_VERSION=3.8

				  pytorch-linux-jammy-py3.9-gcc11)

				    ANACONDA_PYTHON_VERSION=3.9

				    GCC_VERSION=11

				    PROTOBUF=yes

				    DB=yes

				@ -379,6 +379,7 @@ case "$image" in

				    GCC_VERSION=11

				    CONDA_CMAKE=yes

				    HALIDE=yes

				    TRITON=yes

				    ;;

				  pytorch-linux-focal-linter)

				    # TODO: Use 3.9 here because of this issue https://github.com/python/mypy/issues/13627.

									
										4

.ci/docker/centos-rocm/Dockerfile
									
												View File
												
				@ -108,10 +108,10 @@ ENV CMAKE_C_COMPILER cc

				ENV CMAKE_CXX_COMPILER c++

				COPY ./common/install_triton.sh install_triton.sh

				COPY ./common/common_utils.sh common_utils.sh

				COPY ci_commit_pins/triton-rocm.txt triton-rocm.txt

				COPY ci_commit_pins/triton.txt triton.txt

				COPY triton_version.txt triton_version.txt

				RUN if [ -n "${TRITON}" ]; then bash ./install_triton.sh; fi

				RUN rm install_triton.sh common_utils.sh triton-rocm.txt triton_version.txt

				RUN rm install_triton.sh common_utils.sh triton.txt triton_version.txt

				# Install AOTriton (Early fail)

				COPY ./aotriton_version.txt aotriton_version.txt

2

.ci/docker/ci_commit_pins/executorch.txt

View File

 @ -1 +1 @@
 e9bab8c5956249e75a0f187bf8075df97ca2555
 cd1c833b079adb324871dcbbe75b43d42ffc0ade

2

.ci/docker/ci_commit_pins/halide.txt

View File

 @ -1 +1 @@
 fec6d3ebc73e7a19eba1663e9b0ba8ab2d
 c12871f336fe6f57b55d6a297f13ef209161b

2

.ci/docker/ci_commit_pins/timm.txt

View File

 @ -1 +1 @@
 b907b4d45a4713cbc425cbf224c46089fd514
 ac3470188b914c5d7a5058a7e28b9eb685a62427

1

.ci/docker/ci_commit_pins/triton-rocm.txt

View File

				`@ -1 +0,0 @@`
				`21eae954efa5bf584da70324b640288c3ee7aede`

2

.ci/docker/ci_commit_pins/triton-xpu.txt

View File

 @ -1 +1 @@
 b2f15840e0d70eec50d84c7a0575cb835524def
 b14bf5593cf58a8541f3e6b9125600a867d4ef

2

.ci/docker/ci_commit_pins/triton.txt

View File

 @ -1 +1 @@
 dedb7bdf339a3546896d4820366ca562c586bfa0
 fe38ffd73c2ac6ed6323b554205186696631c6f

									
										4

.ci/docker/common/install_aotriton.sh
									
												View File
												
				@ -4,12 +4,12 @@ set -ex

				source "$(dirname "${BASH_SOURCE[0]}")/common_utils.sh"

				TARBALL='aotriton.tar.bz2'

				TARBALL='aotriton.tar.gz'

				# This read command alwasy returns with exit code 1

				read -d "\n" VER MANYLINUX ROCMBASE PINNED_COMMIT SHA256 < aotriton_version.txt || true

				ARCH=$(uname -m)

				AOTRITON_INSTALL_PREFIX="$1"

				AOTRITON_URL="https://github.com/ROCm/aotriton/releases/download/${VER}/aotriton-${VER}-${MANYLINUX}_${ARCH}-${ROCMBASE}-shared.tar.bz2"

				AOTRITON_URL="https://github.com/ROCm/aotriton/releases/download/${VER}/aotriton-${VER}-${MANYLINUX}_${ARCH}-${ROCMBASE}-shared.tar.gz"

				cd "${AOTRITON_INSTALL_PREFIX}"

				# Must use -L to follow redirects

									
										33

.ci/docker/common/install_conda.sh
									
												View File
												
				@ -5,32 +5,22 @@ set -ex

				# Optionally install conda

				if [ -n "$ANACONDA_PYTHON_VERSION" ]; then

				  BASE_URL="https://repo.anaconda.com/miniconda"

				  CONDA_FILE="Miniconda3-latest-Linux-x86_64.sh"

				  if [[ $(uname -m) == "aarch64" ]] || [[ "$BUILD_ENVIRONMENT" == *xpu* ]]; then

				    BASE_URL="https://github.com/conda-forge/miniforge/releases/latest/download"

				    CONDA_FILE="Miniforge3-Linux-$(uname -m).sh"

				  fi

				  MAJOR_PYTHON_VERSION=$(echo "$ANACONDA_PYTHON_VERSION" | cut -d . -f 1)

				  MINOR_PYTHON_VERSION=$(echo "$ANACONDA_PYTHON_VERSION" | cut -d . -f 2)

				if [[ $(uname -m) == "aarch64" ]]; then

				  BASE_URL="https://github.com/conda-forge/miniforge/releases/latest/download"

				  case "$MAJOR_PYTHON_VERSION" in

				    3)

				      CONDA_FILE="Miniforge3-Linux-aarch64.sh"

				    ;;

				    3);;

				    *)

				      echo "Unsupported ANACONDA_PYTHON_VERSION: $ANACONDA_PYTHON_VERSION"

				      exit 1

				      ;;

				  esac

				else

				  case "$MAJOR_PYTHON_VERSION" in

				    3)

				      CONDA_FILE="Miniconda3-latest-Linux-x86_64.sh"

				    ;;

				    *)

				      echo "Unsupported ANACONDA_PYTHON_VERSION: $ANACONDA_PYTHON_VERSION"

				      exit 1

				      ;;

				  esac

				fi

				  mkdir -p /opt/conda

				  chown jenkins:jenkins /opt/conda

				@ -78,19 +68,20 @@ fi

				    CONDA_COMMON_DEPS="astunparse pyyaml setuptools openblas==0.3.25=*openmp* ninja==1.11.1 scons==4.5.2"

				    if [ "$ANACONDA_PYTHON_VERSION" = "3.8" ]; then

				      conda_install numpy=1.24.4 ${CONDA_COMMON_DEPS}

				      NUMPY_VERSION=1.24.4

				    else

				      conda_install numpy=1.26.2 ${CONDA_COMMON_DEPS}

				      NUMPY_VERSION=1.26.2

				    fi

				  else

				    CONDA_COMMON_DEPS="astunparse pyyaml mkl=2021.4.0 mkl-include=2021.4.0 setuptools"

				    if [ "$ANACONDA_PYTHON_VERSION" = "3.11" ] || [ "$ANACONDA_PYTHON_VERSION" = "3.12" ] || [ "$ANACONDA_PYTHON_VERSION" = "3.13" ]; then

				      conda_install numpy=1.26.0 ${CONDA_COMMON_DEPS}

				      NUMPY_VERSION=1.26.0

				    else

				      conda_install numpy=1.21.2 ${CONDA_COMMON_DEPS}

				      NUMPY_VERSION=1.21.2

				    fi

				  fi

				  conda_install ${CONDA_COMMON_DEPS}

				  # Install llvm-8 as it is required to compile llvmlite-0.30.0 from source

				  # and libpython-static for torch deploy

				@ -112,7 +103,7 @@ fi

				  # Install some other packages, including those needed for Python test reporting

				  pip_install -r /opt/conda/requirements-ci.txt

				  pip_install numpy=="$NUMPY_VERSION"

				  pip_install -U scikit-learn

				  if [ -n "$DOCS" ]; then

									
										25

.ci/docker/common/install_cpython.sh
									
												View File
												
				@ -7,7 +7,7 @@ PYTHON_DOWNLOAD_GITHUB_BRANCH=https://github.com/python/cpython/archive/refs/hea

				GET_PIP_URL=https://bootstrap.pypa.io/get-pip.py

				# Python versions to be installed in /opt/$VERSION_NO

				CPYTHON_VERSIONS=${CPYTHON_VERSIONS:-"3.8.1 3.9.0 3.10.1 3.11.0 3.12.0 3.13.0"}

				CPYTHON_VERSIONS=${CPYTHON_VERSIONS:-"3.8.1 3.9.0 3.10.1 3.11.0 3.12.0 3.13.0 3.13.0t"}

				function check_var {

				    if [ -z "$1" ]; then

				@ -22,6 +22,13 @@ function do_cpython_build {

				    check_var $py_ver

				    check_var $py_folder

				    tar -xzf Python-$py_ver.tgz

				    local additional_flags=""

				    if [ "$py_ver" == "3.13.0t" ]; then

				        additional_flags=" --disable-gil"

				        mv cpython-3.13/ cpython-3.13t/

				    fi

				    pushd $py_folder

				    local prefix="/opt/_internal/cpython-${py_ver}"

				@ -37,8 +44,10 @@ function do_cpython_build {

				        local openssl_flags="--with-openssl=${WITH_OPENSSL} --with-openssl-rpath=auto"

				    fi

				    # -Wformat added for https://bugs.python.org/issue17547 on Python 2.6

				    CFLAGS="-Wformat" ./configure --prefix=${prefix} ${openssl_flags} ${shared_flags} > /dev/null

				    CFLAGS="-Wformat" ./configure --prefix=${prefix} ${openssl_flags} ${shared_flags} ${additional_flags} > /dev/null

				    make -j40 > /dev/null

				    make install > /dev/null

				@ -58,7 +67,8 @@ function do_cpython_build {

				    if [ -e ${prefix}/bin/pip3 ] && [ ! -e ${prefix}/bin/pip ]; then

				        ln -s pip3 ${prefix}/bin/pip

				    fi

				    ${prefix}/bin/pip install wheel==0.34.2

				    # install setuptools since python 3.12 is required to use distutils

				    ${prefix}/bin/pip install wheel==0.34.2 setuptools==68.2.2

				    local abi_tag=$(${prefix}/bin/python -c "from wheel.pep425tags import get_abbr_impl, get_impl_ver, get_abi_tag; print('{0}{1}-{2}'.format(get_abbr_impl(), get_impl_ver(), get_abi_tag()))")

				    ln -s ${prefix} /opt/python/${abi_tag}

				}

				@ -68,7 +78,14 @@ function build_cpython {

				    check_var $py_ver

				    check_var $PYTHON_DOWNLOAD_URL

				    local py_ver_folder=$py_ver

				    if [ "$py_ver" = "3.13.0" ]; then

				    if [ "$py_ver" = "3.13.0t" ]; then

				        PY_VER_SHORT="3.13"

				        PYT_VER_SHORT="3.13t"

				        check_var $PYTHON_DOWNLOAD_GITHUB_BRANCH

				        wget $PYTHON_DOWNLOAD_GITHUB_BRANCH/$PY_VER_SHORT.tar.gz -O Python-$py_ver.tgz

				        do_cpython_build $py_ver cpython-$PYT_VER_SHORT

				    elif [ "$py_ver" = "3.13.0" ]; then

				        PY_VER_SHORT="3.13"

				        check_var $PYTHON_DOWNLOAD_GITHUB_BRANCH

				        wget $PYTHON_DOWNLOAD_GITHUB_BRANCH/$PY_VER_SHORT.tar.gz -O Python-$py_ver.tgz

									
										25

.ci/docker/common/install_cuda.sh
									
												View File
												
				@ -27,6 +27,17 @@ function install_cusparselt_052 {

				    rm -rf tmp_cusparselt

				}

				function install_cusparselt_062 {

				    # cuSparseLt license: https://docs.nvidia.com/cuda/cusparselt/license.html

				    mkdir tmp_cusparselt && pushd tmp_cusparselt

				    wget -q https://developer.download.nvidia.com/compute/cusparselt/redist/libcusparse_lt/linux-x86_64/libcusparse_lt-linux-x86_64-0.6.2.3-archive.tar.xz

				    tar xf libcusparse_lt-linux-x86_64-0.6.2.3-archive.tar.xz

				    cp -a libcusparse_lt-linux-x86_64-0.6.2.3-archive/include/* /usr/local/cuda/include/

				    cp -a libcusparse_lt-linux-x86_64-0.6.2.3-archive/lib/* /usr/local/cuda/lib64/

				    popd

				    rm -rf tmp_cusparselt

				}

				function install_118 {

				    echo "Installing CUDA 11.8 and cuDNN ${CUDNN_VERSION} and NCCL ${NCCL_VERSION} and cuSparseLt-0.4.0"

				    rm -rf /usr/local/cuda-11.8 /usr/local/cuda

				@ -94,13 +105,13 @@ function install_121 {

				}

				function install_124 {

				  echo "Installing CUDA 12.4 and cuDNN ${CUDNN_VERSION} and NCCL ${NCCL_VERSION} and cuSparseLt-0.5.2"

				  echo "Installing CUDA 12.4.1 and cuDNN ${CUDNN_VERSION} and NCCL ${NCCL_VERSION} and cuSparseLt-0.5.2"

				  rm -rf /usr/local/cuda-12.4 /usr/local/cuda

				  # install CUDA 12.4.0 in the same container

				  wget -q https://developer.download.nvidia.com/compute/cuda/12.4.0/local_installers/cuda_12.4.0_550.54.14_linux.run

				  chmod +x cuda_12.4.0_550.54.14_linux.run

				  ./cuda_12.4.0_550.54.14_linux.run --toolkit --silent

				  rm -f cuda_12.4.0_550.54.14_linux.run

				  # install CUDA 12.4.1 in the same container

				  wget -q https://developer.download.nvidia.com/compute/cuda/12.4.1/local_installers/cuda_12.4.1_550.54.15_linux.run

				  chmod +x cuda_12.4.1_550.54.15_linux.run

				  ./cuda_12.4.1_550.54.15_linux.run --toolkit --silent

				  rm -f cuda_12.4.1_550.54.15_linux.run

				  rm -f /usr/local/cuda && ln -s /usr/local/cuda-12.4 /usr/local/cuda

				  # cuDNN license: https://developer.nvidia.com/cudnn/license_agreement

				@ -121,7 +132,7 @@ function install_124 {

				  cd ..

				  rm -rf nccl

				  install_cusparselt_052

				  install_cusparselt_062

				  ldconfig

				}

									
										12

.ci/docker/common/install_cuda_aarch64.sh
									
												View File
												
				@ -17,13 +17,13 @@ function install_cusparselt_052 {

				}

				function install_124 {

				  echo "Installing CUDA 12.4 and cuDNN 9.1 and NCCL ${NCCL_VERSION} and cuSparseLt-0.5.2"

				  echo "Installing CUDA 12.4.1 and cuDNN 9.1 and NCCL ${NCCL_VERSION} and cuSparseLt-0.5.2"

				  rm -rf /usr/local/cuda-12.4 /usr/local/cuda

				  # install CUDA 12.4.0 in the same container

				  wget -q https://developer.download.nvidia.com/compute/cuda/12.4.0/local_installers/cuda_12.4.0_550.54.14_linux_sbsa.run

				  chmod +x cuda_12.4.0_550.54.14_linux_sbsa.run

				  ./cuda_12.4.0_550.54.14_linux_sbsa.run --toolkit --silent

				  rm -f cuda_12.4.0_550.54.14_linux_sbsa.run

				  # install CUDA 12.4.1 in the same container

				  wget -q https://developer.download.nvidia.com/compute/cuda/12.4.1/local_installers/cuda_12.4.1_550.54.15_linux_sbsa.run

				  chmod +x cuda_12.4.1_550.54.15_linux_sbsa.run

				  ./cuda_12.4.1_550.54.15_linux_sbsa.run --toolkit --silent

				  rm -f cuda_12.4.1_550.54.15_linux_sbsa.run

				  rm -f /usr/local/cuda && ln -s /usr/local/cuda-12.4 /usr/local/cuda

				  # cuDNN license: https://developer.nvidia.com/cudnn/license_agreement

									
										25

.ci/docker/common/install_cudss.sh
									
										Normal file
									
												View File
												
				@ -0,0 +1,25 @@

				#!/bin/bash

				set -ex

				# cudss license: https://docs.nvidia.com/cuda/cudss/license.html

				mkdir tmp_cudss && cd tmp_cudss

				if [[ ${CUDA_VERSION:0:4} =~ ^12\.[1-4]$ ]]; then

				    arch_path='sbsa'

				    export TARGETARCH=${TARGETARCH:-$(uname -m)}

				    if [ ${TARGETARCH} = 'amd64' ] || [ "${TARGETARCH}" = 'x86_64' ]; then

				        arch_path='x86_64'

				    fi

				    CUDSS_NAME="libcudss-linux-${arch_path}-0.3.0.9_cuda12-archive"

				    curl --retry 3 -OLs https://developer.download.nvidia.com/compute/cudss/redist/libcudss/linux-${arch_path}/${CUDSS_NAME}.tar.xz

				    # only for cuda 12

				    tar xf ${CUDSS_NAME}.tar.xz

				    cp -a ${CUDSS_NAME}/include/* /usr/local/cuda/include/

				    cp -a ${CUDSS_NAME}/lib/* /usr/local/cuda/lib64/

				fi

				cd ..

				rm -rf tmp_cudss

				ldconfig

									
										10

.ci/docker/common/install_cusparselt.sh
									
												View File
												
				@ -5,7 +5,15 @@ set -ex

				# cuSPARSELt license: https://docs.nvidia.com/cuda/cusparselt/license.html

				mkdir tmp_cusparselt && cd tmp_cusparselt

				if [[ ${CUDA_VERSION:0:4} =~ ^12\.[1-4]$ ]]; then

				if [[ ${CUDA_VERSION:0:4} =~ ^12\.[2-6]$ ]]; then

				    arch_path='sbsa'

				    export TARGETARCH=${TARGETARCH:-$(uname -m)}

				    if [ ${TARGETARCH} = 'amd64' ] || [ "${TARGETARCH}" = 'x86_64' ]; then

				        arch_path='x86_64'

				    fi

				    CUSPARSELT_NAME="libcusparse_lt-linux-${arch_path}-0.6.2.3-archive"

				    curl --retry 3 -OLs https://developer.download.nvidia.com/compute/cusparselt/redist/libcusparse_lt/linux-${arch_path}/${CUSPARSELT_NAME}.tar.xz

				elif [[ ${CUDA_VERSION:0:4} == "12.1" ]]; then

				    arch_path='sbsa'

				    export TARGETARCH=${TARGETARCH:-$(uname -m)}

				    if [ ${TARGETARCH} = 'amd64' ] || [ "${TARGETARCH}" = 'x86_64' ]; then

									
										51

.ci/docker/common/install_miopen.sh
									
												View File
												
				@ -10,6 +10,21 @@ if [[ -z $ROCM_VERSION ]]; then

				    exit 1;

				fi

				IS_UBUNTU=0

				ID=$(grep -oP '(?<=^ID=).+' /etc/os-release | tr -d '"')

				case "$ID" in

				  ubuntu)

				    IS_UBUNTU=1

				    ;;

				  centos)

				    IS_UBUNTU=0

				    ;;

				  *)

				    echo "Unable to determine OS..."

				    exit 1

				    ;;

				esac

				# To make version comparison easier, create an integer representation.

				save_IFS="$IFS"

				IFS=. ROCM_VERSION_ARRAY=(${ROCM_VERSION})

				@ -57,9 +72,11 @@ MIOPEN_CMAKE_COMMON_FLAGS="

				-DMIOPEN_BUILD_DRIVER=OFF

				"

				# Pull MIOpen repo and set DMIOPEN_EMBED_DB based on ROCm version

				if [[ $ROCM_INT -ge 60200 ]] && [[ $ROCM_INT -lt 60300 ]]; then

				    echo "ROCm 6.2 MIOpen does not need any patches, do not build from source"

				if [[ $ROCM_INT -ge 60300 ]]; then

				    echo "ROCm 6.3+ MIOpen does not need any patches, do not build from source"

				    exit 0

				elif [[ $ROCM_INT -ge 60200 ]] && [[ $ROCM_INT -lt 60300 ]]; then

				    MIOPEN_BRANCH="release/rocm-rel-6.2-staging"

				elif [[ $ROCM_INT -ge 60100 ]] && [[ $ROCM_INT -lt 60200 ]]; then

				    echo "ROCm 6.1 MIOpen does not need any patches, do not build from source"

				    exit 0

				@ -93,12 +110,21 @@ else

				    exit 1

				fi

				yum remove -y miopen-hip

				if [[ ${IS_UBUNTU} == 1 ]]; then

				  apt-get remove -y miopen-hip

				else

				  yum remove -y miopen-hip

				fi

				git clone https://github.com/ROCm/MIOpen -b ${MIOPEN_BRANCH}

				pushd MIOpen

				# remove .git to save disk space since CI runner was running out

				rm -rf .git

				# Don't build CK to save docker build time

				if [[ $ROCM_INT -ge 60200 ]]; then

				    sed -i '/composable_kernel/d' requirements.txt

				fi

				# Don't build MLIR to save docker build time

				# since we are disabling MLIR backend for MIOpen anyway

				if [[ $ROCM_INT -ge 50400 ]] && [[ $ROCM_INT -lt 50500 ]]; then

				@ -111,10 +137,15 @@ cmake -P install_deps.cmake --minimum

				# clean up since CI runner was running out of disk space

				rm -rf /tmp/*

				yum clean all

				rm -rf /var/cache/yum

				rm -rf /var/lib/yum/yumdb

				rm -rf /var/lib/yum/history

				if [[ ${IS_UBUNTU} == 1 ]]; then

				  apt-get autoclean && apt-get clean

				  rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/*

				else

				  yum clean all

				  rm -rf /var/cache/yum

				  rm -rf /var/lib/yum/yumdb

				  rm -rf /var/lib/yum/history

				fi

				## Build MIOpen

				mkdir -p build

				@ -131,7 +162,11 @@ make -j $(nproc) package

				# clean up since CI runner was running out of disk space

				rm -rf /usr/local/cget

				yum install -y miopen-*.rpm

				if [[ ${IS_UBUNTU} == 1 ]]; then

				  sudo dpkg -i miopen-hip*.deb

				else

				  yum install -y miopen-*.rpm

				fi

				popd

				rm -rf MIOpen

									
										9

.ci/docker/common/install_onnx.sh
									
												View File
												
				@ -15,7 +15,7 @@ pip_install \

				  flatbuffers==2.0 \

				  mock==5.0.1 \

				  ninja==1.10.2 \

				  networkx==2.0 \

				  networkx==2.5 \

				  numpy==1.24.2

				# ONNXRuntime should be installed before installing

				@ -30,10 +30,9 @@ pip_install \

				pip_install coloredlogs packaging

				pip_install onnxruntime==1.18

				pip_install onnx==1.16.0

				# pip_install "onnxscript@git+https://github.com/microsoft/onnxscript@3e869ef8ccf19b5ebd21c10d3e9c267c9a9fa729" --no-deps

				pip_install onnxscript==0.1.0.dev20240613 --no-deps

				pip_install onnxruntime==1.18.1

				pip_install onnx==1.16.2

				pip_install onnxscript==0.1.0.dev20240831 --no-deps

				# required by onnxscript

				pip_install ml_dtypes

									
										25

.ci/docker/common/install_triton.sh
									
												View File
												
				@ -12,10 +12,7 @@ conda_reinstall() {

				  as_jenkins conda install -q -n py_$ANACONDA_PYTHON_VERSION -y --force-reinstall $*

				}

				if [ -n "${ROCM_VERSION}" ]; then

				  TRITON_REPO="https://github.com/openai/triton"

				  TRITON_TEXT_FILE="triton-rocm"

				elif [ -n "${XPU_VERSION}" ]; then

				if [ -n "${XPU_VERSION}" ]; then

				  TRITON_REPO="https://github.com/intel/intel-xpu-backend-for-triton"

				  TRITON_TEXT_FILE="triton-xpu"

				else

				@ -41,19 +38,33 @@ if [ -z "${MAX_JOBS}" ]; then

				    export MAX_JOBS=$(nproc)

				fi

				# Git checkout triton

				mkdir /var/lib/jenkins/triton

				chown -R jenkins /var/lib/jenkins/triton

				chgrp -R jenkins /var/lib/jenkins/triton

				pushd /var/lib/jenkins/

				as_jenkins git clone ${TRITON_REPO} triton

				cd triton

				as_jenkins git checkout ${TRITON_PINNED_COMMIT}

				cd python

				# TODO: remove patch setup.py once we have a proper fix for https://github.com/triton-lang/triton/issues/4527

				as_jenkins sed -i -e 's/https:\/\/tritonlang.blob.core.windows.net\/llvm-builds/https:\/\/oaitriton.blob.core.windows.net\/public\/llvm-builds/g' setup.py

				if [ -n "${UBUNTU_VERSION}" ] && [ -n "${GCC_VERSION}" ] && [[ "${GCC_VERSION}" == "7" ]]; then

				  # Triton needs at least gcc-9 to build

				  apt-get install -y g++-9

				  CXX=g++-9 pip_install "git+${TRITON_REPO}@${TRITON_PINNED_COMMIT}#subdirectory=python"

				  CXX=g++-9 pip_install -e .

				elif [ -n "${UBUNTU_VERSION}" ] && [ -n "${CLANG_VERSION}" ]; then

				  # Triton needs <filesystem> which surprisingly is not available with clang-9 toolchain

				  add-apt-repository -y ppa:ubuntu-toolchain-r/test

				  apt-get install -y g++-9

				  CXX=g++-9 pip_install "git+${TRITON_REPO}@${TRITON_PINNED_COMMIT}#subdirectory=python"

				  CXX=g++-9 pip_install -e .

				else

				  pip_install "git+${TRITON_REPO}@${TRITON_PINNED_COMMIT}#subdirectory=python"

				  pip_install -e .

				fi

				if [ -n "${CONDA_CMAKE}" ]; then

									
										20

.ci/docker/common/install_xpu.sh
									
												View File
												
				@ -16,11 +16,11 @@ function install_ubuntu() {

				    apt-get update -y

				    apt-get install -y gpg-agent wget

				    # To add the online network package repository for the GPU Driver LTS releases

				    # To add the online network package repository for the GPU Driver

				    wget -qO - https://repositories.intel.com/gpu/intel-graphics.key \

				        | gpg --yes --dearmor --output /usr/share/keyrings/intel-graphics.gpg

				    echo "deb [arch=amd64 signed-by=/usr/share/keyrings/intel-graphics.gpg] \

				        https://repositories.intel.com/gpu/ubuntu ${VERSION_CODENAME}/lts/2350 unified" \

				        https://repositories.intel.com/gpu/ubuntu ${VERSION_CODENAME}${XPU_DRIVER_VERSION} unified" \

				        | tee /etc/apt/sources.list.d/intel-gpu-${VERSION_CODENAME}.list

				    # To add the online network network package repository for the Intel Support Packages

				    wget -O- https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB \

				@ -68,9 +68,9 @@ function install_rhel() {

				    fi

				    dnf install -y 'dnf-command(config-manager)'

				    # To add the online network package repository for the GPU Driver LTS releases

				    # To add the online network package repository for the GPU Driver

				    dnf config-manager --add-repo \

				        https://repositories.intel.com/gpu/rhel/${VERSION_ID}/lts/2350/unified/intel-gpu-${VERSION_ID}.repo

				        https://repositories.intel.com/gpu/rhel/${VERSION_ID}${XPU_DRIVER_VERSION}/unified/intel-gpu-${VERSION_ID}.repo

				    # To add the online network network package repository for the Intel Support Packages

				    tee > /etc/yum.repos.d/intel-for-pytorch-gpu-dev.repo << EOF

				[intel-for-pytorch-gpu-dev]

				@ -85,7 +85,7 @@ EOF

				    # The xpu-smi packages

				    dnf install -y xpu-smi

				    # Compute and Media Runtimes

				    dnf install -y \

				    dnf install --skip-broken -y \

				        intel-opencl intel-media intel-mediasdk libmfxgen1 libvpl2\

				        level-zero intel-level-zero-gpu mesa-dri-drivers mesa-vulkan-drivers \

				        mesa-vdpau-drivers libdrm mesa-libEGL mesa-libgbm mesa-libGL \

				@ -114,9 +114,9 @@ function install_sles() {

				        exit

				    fi

				    # To add the online network package repository for the GPU Driver LTS releases

				    # To add the online network package repository for the GPU Driver

				    zypper addrepo -f -r \

				        https://repositories.intel.com/gpu/sles/${VERSION_SP}/lts/2350/unified/intel-gpu-${VERSION_SP}.repo

				        https://repositories.intel.com/gpu/sles/${VERSION_SP}${XPU_DRIVER_VERSION}/unified/intel-gpu-${VERSION_SP}.repo

				    rpm --import https://repositories.intel.com/gpu/intel-graphics.key

				    # To add the online network network package repository for the Intel Support Packages

				    zypper addrepo https://yum.repos.intel.com/intel-for-pytorch-gpu-dev intel-for-pytorch-gpu-dev

				@ -135,6 +135,12 @@ function install_sles() {

				}

				# Default use GPU driver LTS releases

				XPU_DRIVER_VERSION="/lts/2350"

				if [[ "${XPU_DRIVER_TYPE,,}" == "rolling" ]]; then

				    # Use GPU driver rolling releases

				    XPU_DRIVER_VERSION=""

				fi

				# The installation depends on the base OS

				ID=$(grep -oP '(?<=^ID=).+' /etc/os-release | tr -d '"')

									
										6

.ci/docker/conda/build.sh
									
												View File
												
				@ -37,6 +37,12 @@ esac

				(

				  set -x

				  # TODO: Remove LimitNOFILE=1048576 patch once https://github.com/pytorch/test-infra/issues/5712

				  # is resolved. This patch is required in order to fix timing out of Docker build on Amazon Linux 2023.

				  sudo sed -i s/LimitNOFILE=infinity/LimitNOFILE=1048576/ /usr/lib/systemd/system/docker.service

				  sudo systemctl daemon-reload

				  sudo systemctl restart docker

				  docker build \

				    --target final \

				    --progress plain \

									
										1

.ci/docker/manywheel/Dockerfile
									
												View File
												
				@ -10,6 +10,7 @@ ENV LANG en_US.UTF-8

				ENV LANGUAGE en_US.UTF-8

				ARG DEVTOOLSET_VERSION=9

				# Note: This is required patch since CentOS have reached EOL

				# otherwise any yum install setp will fail

				RUN sed -i s/mirror.centos.org/vault.centos.org/g /etc/yum.repos.d/*.repo

4

.ci/docker/manywheel/Dockerfile_2_28

View File

 @ -145,9 +145,13 @@ ADD ./common/install_miopen.sh install_miopen.sh
 RUN bash ./install_miopen.sh ${ROCM_VERSION} && rm install_miopen.sh
 FROM cpu_final as xpu_final
 # XPU CD use rolling driver
 ENV XPU_DRIVER_TYPE ROLLING
 # cmake-3.28.4 from pip
 RUN python3 -m pip install --upgrade pip && \
     python3 -mpip install cmake==3.28.4
 # Install setuptools and wheel for python 3.13
 RUN /opt/python/cp313-cp313/bin/python -m pip install setuptools wheel
 ADD ./common/install_xpu.sh install_xpu.sh
 RUN bash ./install_xpu.sh && rm install_xpu.sh
 RUN pushd /opt/_internal && tar -xJf static-libs-for-embedding-only.tar.xz && popd

									
										9

.ci/docker/manywheel/build.sh
									
												View File
												
				@ -124,7 +124,14 @@ if [[ -n ${MANY_LINUX_VERSION} && -z ${DOCKERFILE_SUFFIX} ]]; then

				fi

				(

				    set -x

				    DOCKER_BUILDKIT=1 docker build \

				    # TODO: Remove LimitNOFILE=1048576 patch once https://github.com/pytorch/test-infra/issues/5712

				    # is resolved. This patch is required in order to fix timing out of Docker build on Amazon Linux 2023.

				    sudo sed -i s/LimitNOFILE=infinity/LimitNOFILE=1048576/ /usr/lib/systemd/system/docker.service

				    sudo systemctl daemon-reload

				    sudo systemctl restart docker

				    DOCKER_BUILDKIT=1 docker build  \

				        ${DOCKER_GPU_BUILD_ARG} \

				        --build-arg "GPU_IMAGE=${GPU_IMAGE}" \

				        --target "${TARGET}" \

32

.ci/docker/requirements-ci.txt

View File

 @ -30,9 +30,14 @@ dill==0.3.7
 #Pinned versions: 0.3.7
 #test that import: dynamo/test_replay_record.py test_dataloader.py test_datapipe.py test_serialization.py
 expecttest==0.1.6
 expecttest==0.2.1
 #Description: method for writing tests where test framework auto populates
 # the expected output based on previous runs
 #Pinned versions: 0.2.1
 #test that import:
 fbscribelogger==0.1.6
 #Description: write to scribe from authenticated jobs on CI
 #Pinned versions: 0.1.6
 #test that import:
 @ -85,7 +90,7 @@ librosa>=0.6.2 ; python_version < "3.11"
 #Pinned versions:
 #test that import:
 mypy==1.10.0
 mypy==1.11.2
 # Pin MyPy version because new errors are likely to appear with each release
 #Description: linter
 #Pinned versions: 1.10.0
 @ -104,7 +109,7 @@ networkx==2.8.8
 #test that import: run_test.py, test_cpp_extensions_aot.py,test_determination.py
 numba==0.49.0 ; python_version < "3.9"
 numba==0.54.1 ; python_version == "3.9"
 numba==0.55.2 ; python_version == "3.9"
 numba==0.55.2 ; python_version == "3.10"
 #Description: Just-In-Time Compiler for Numerical Functions
 #Pinned versions: 0.54.1, 0.49.0, <=0.49.1
 @ -218,7 +223,7 @@ pygments==2.15.0
 #test that import:
 scikit-image==0.19.3 ; python_version < "3.10"
 scikit-image==0.20.0 ; python_version >= "3.10"
 scikit-image==0.22.0 ; python_version >= "3.10"
 #Description: image processing routines
 #Pinned versions:
 #test that import: test_nn.py
 @ -269,6 +274,10 @@ lintrunner==0.12.5
 #Pinned versions: 0.12.5
 #test that import:
 redis>=4.0.0
 #Description: redis database
 #test that import: anything that tests OSS caching/mocking (inductor/test_codecache.py, inductor/test_max_autotune.py)
 rockset==1.0.3
 #Description: queries Rockset
 #Pinned versions: 1.0.3
 @ -318,3 +327,18 @@ sympy==1.13.1 ; python_version >= "3.9"
 #Description: Required by coremltools, also pinned in .github/requirements/pip-requirements-macOS.txt
 #Pinned versions:
 #test that import:
 onnx==1.16.1
 #Description: Required by mypy and test_public_bindings.py when checking torch.onnx._internal
 #Pinned versions:
 #test that import:
 onnxscript==0.1.0.dev20240817
 #Description: Required by mypy and test_public_bindings.py when checking torch.onnx._internal
 #Pinned versions:
 #test that import:
 parameterized==0.8.1
 #Description: Parameterizes unittests, both the tests themselves and the entire testing class
 #Pinned versions:
 #test that import:

2

.ci/docker/triton_version.txt

View File

 @ -1 +1 @@
 .0.0
 .1.0

									
										6

.ci/docker/ubuntu-cuda/Dockerfile
									
												View File
												
				@ -156,6 +156,12 @@ COPY ./common/install_cusparselt.sh install_cusparselt.sh

				RUN bash install_cusparselt.sh

				RUN rm install_cusparselt.sh

				# Install CUDSS

				ARG CUDA_VERSION

				COPY ./common/install_cudss.sh install_cudss.sh

				RUN bash install_cudss.sh

				RUN rm install_cudss.sh

				# Delete /usr/local/cuda-11.X/cuda-11.X symlinks

				RUN if [ -h /usr/local/cuda-11.6/cuda-11.6 ]; then rm /usr/local/cuda-11.6/cuda-11.6; fi

				RUN if [ -h /usr/local/cuda-11.7/cuda-11.7 ]; then rm /usr/local/cuda-11.7/cuda-11.7; fi

									
										9

.ci/docker/ubuntu-rocm/Dockerfile
									
												View File
												
				@ -68,6 +68,8 @@ RUN rm install_rocm.sh

				COPY ./common/install_rocm_magma.sh install_rocm_magma.sh

				RUN bash ./install_rocm_magma.sh

				RUN rm install_rocm_magma.sh

				ADD ./common/install_miopen.sh install_miopen.sh

				RUN bash ./install_miopen.sh ${ROCM_VERSION} && rm install_miopen.sh

				ENV ROCM_PATH /opt/rocm

				ENV PATH /opt/rocm/bin:$PATH

				ENV PATH /opt/rocm/hcc/bin:$PATH

				@ -100,10 +102,10 @@ ARG TRITON

				# try to reach out to S3, which docker build runners don't have access

				COPY ./common/install_triton.sh install_triton.sh

				COPY ./common/common_utils.sh common_utils.sh

				COPY ci_commit_pins/triton-rocm.txt triton-rocm.txt

				COPY ci_commit_pins/triton.txt triton.txt

				COPY triton_version.txt triton_version.txt

				RUN if [ -n "${TRITON}" ]; then bash ./install_triton.sh; fi

				RUN rm install_triton.sh common_utils.sh triton-rocm.txt triton_version.txt

				RUN rm install_triton.sh common_utils.sh triton.txt triton_version.txt

				# Install AOTriton

				COPY ./aotriton_version.txt aotriton_version.txt

				@ -121,5 +123,8 @@ RUN bash ./install_cache.sh && rm install_cache.sh

				ARG BUILD_ENVIRONMENT

				ENV BUILD_ENVIRONMENT ${BUILD_ENVIRONMENT}

				# Install LLVM dev version (Defined in the pytorch/builder github repository)

				COPY --from=pytorch/llvm:9.0.1 /opt/llvm /opt/llvm

				USER jenkins

				CMD ["bash"]

									
										1

.ci/docker/ubuntu-xpu/Dockerfile
									
												View File
												
				@ -30,6 +30,7 @@ RUN bash ./install_docs_reqs.sh && rm install_docs_reqs.sh

				ARG ANACONDA_PYTHON_VERSION

				ARG CONDA_CMAKE

				ARG DOCS

				ARG BUILD_ENVIRONMENT

				ENV ANACONDA_PYTHON_VERSION=$ANACONDA_PYTHON_VERSION

				ENV PATH /opt/conda/envs/py_$ANACONDA_PYTHON_VERSION/bin:/opt/conda/bin:$PATH

				ENV DOCS=$DOCS

									
										29

.ci/pytorch/build.sh
									
												View File
												
				@ -49,13 +49,8 @@ if [[ ${BUILD_ENVIRONMENT} == *"parallelnative"* ]]; then

				fi

				# Enable LLVM dependency for TensorExpr testing

				if [[ "$BUILD_ENVIRONMENT" == *rocm* ]]; then

				  export USE_LLVM=/opt/rocm/llvm

				  export LLVM_DIR=/opt/rocm/llvm/lib/cmake/llvm

				else

				  export USE_LLVM=/opt/llvm

				  export LLVM_DIR=/opt/llvm/lib/cmake/llvm

				fi

				export USE_LLVM=/opt/llvm

				export LLVM_DIR=/opt/llvm/lib/cmake/llvm

				if [[ "$BUILD_ENVIRONMENT" == *executorch* ]]; then

				  # To build test_edge_op_registration

				@ -237,7 +232,7 @@ fi

				# Do not change workspace permissions for ROCm CI jobs

				# as it can leave workspace with bad permissions for cancelled jobs

				if [[ "$BUILD_ENVIRONMENT" != *rocm* ]]; then

				if [[ "$BUILD_ENVIRONMENT" != *rocm* && "$BUILD_ENVIRONMENT" != *s390x* ]]; then

				  # Workaround for dind-rootless userid mapping (https://github.com/pytorch/ci-infra/issues/96)

				  WORKSPACE_ORIGINAL_OWNER_ID=$(stat -c '%u' "/var/lib/jenkins/workspace")

				  cleanup_workspace() {

				@ -283,11 +278,11 @@ else

				    # set only when building other architectures

				    # or building non-XLA tests.

				    if [[ "$BUILD_ENVIRONMENT" != *rocm*  &&

				          "$BUILD_ENVIRONMENT" != *s390x*   &&

				          "$BUILD_ENVIRONMENT" != *xla* ]]; then

				      if [[ "$BUILD_ENVIRONMENT" != *py3.8* ]]; then

				        # Install numpy-2.0 release candidate for builds

				        # Which should be backward compatible with Numpy-1.X

				        python -mpip install --pre numpy==2.0.0rc1

				        # Install numpy-2.0.2 for builds which are backward compatible with 1.X

				        python -mpip install --pre numpy==2.0.2

				      fi

				      WERROR=1 python setup.py clean

				@ -346,11 +341,11 @@ else

				    CUSTOM_OP_BUILD="${CUSTOM_TEST_ARTIFACT_BUILD_DIR}/custom-op-build"

				    CUSTOM_OP_TEST="$PWD/test/custom_operator"

				    python --version

				    SITE_PACKAGES="$(python -c 'from distutils.sysconfig import get_python_lib; print(get_python_lib())')"

				    SITE_PACKAGES="$(python -c 'import site; print(";".join([x for x in site.getsitepackages()] + [x + "/torch" for x in site.getsitepackages()]))')"

				    mkdir -p "$CUSTOM_OP_BUILD"

				    pushd "$CUSTOM_OP_BUILD"

				    cmake "$CUSTOM_OP_TEST" -DCMAKE_PREFIX_PATH="$SITE_PACKAGES/torch;$SITE_PACKAGES" -DPython_EXECUTABLE="$(which python)" \

				    cmake "$CUSTOM_OP_TEST" -DCMAKE_PREFIX_PATH="$SITE_PACKAGES" -DPython_EXECUTABLE="$(which python)" \

				          -DCMAKE_MODULE_PATH="$CUSTOM_TEST_MODULE_PATH" -DUSE_ROCM="$CUSTOM_TEST_USE_ROCM"

				    make VERBOSE=1

				    popd

				@ -360,10 +355,10 @@ else

				    JIT_HOOK_BUILD="${CUSTOM_TEST_ARTIFACT_BUILD_DIR}/jit-hook-build"

				    JIT_HOOK_TEST="$PWD/test/jit_hooks"

				    python --version

				    SITE_PACKAGES="$(python -c 'from distutils.sysconfig import get_python_lib; print(get_python_lib())')"

				    SITE_PACKAGES="$(python -c 'import site; print(";".join([x for x in site.getsitepackages()] + [x + "/torch" for x in site.getsitepackages()]))')"

				    mkdir -p "$JIT_HOOK_BUILD"

				    pushd "$JIT_HOOK_BUILD"

				    cmake "$JIT_HOOK_TEST" -DCMAKE_PREFIX_PATH="$SITE_PACKAGES/torch;$SITE_PACKAGES" -DPython_EXECUTABLE="$(which python)" \

				    cmake "$JIT_HOOK_TEST" -DCMAKE_PREFIX_PATH="$SITE_PACKAGES" -DPython_EXECUTABLE="$(which python)" \

				          -DCMAKE_MODULE_PATH="$CUSTOM_TEST_MODULE_PATH" -DUSE_ROCM="$CUSTOM_TEST_USE_ROCM"

				    make VERBOSE=1

				    popd

				@ -375,7 +370,7 @@ else

				    python --version

				    mkdir -p "$CUSTOM_BACKEND_BUILD"

				    pushd "$CUSTOM_BACKEND_BUILD"

				    cmake "$CUSTOM_BACKEND_TEST" -DCMAKE_PREFIX_PATH="$SITE_PACKAGES/torch;$SITE_PACKAGES" -DPython_EXECUTABLE="$(which python)" \

				    cmake "$CUSTOM_BACKEND_TEST" -DCMAKE_PREFIX_PATH="$SITE_PACKAGES" -DPython_EXECUTABLE="$(which python)" \

				          -DCMAKE_MODULE_PATH="$CUSTOM_TEST_MODULE_PATH" -DUSE_ROCM="$CUSTOM_TEST_USE_ROCM"

				    make VERBOSE=1

				    popd

				@ -408,6 +403,6 @@ fi

				# snadampal: skipping it till sccache support added for aarch64

				# https://github.com/pytorch/pytorch/issues/121559

				if [[ "$BUILD_ENVIRONMENT" != *aarch64* ]]; then

				if [[ "$BUILD_ENVIRONMENT" != *aarch64* &&  "$BUILD_ENVIRONMENT" != *s390x* ]]; then

				  print_sccache_stats

				fi

									
										10

.ci/pytorch/create_test_cert.py
									
												View File
												
				@ -1,4 +1,4 @@

				from datetime import datetime, timedelta

				from datetime import datetime, timedelta, timezone

				from tempfile import mkdtemp

				from cryptography import x509

				@ -42,10 +42,10 @@ def create_cert(path, C, ST, L, O, key):

				        .issuer_name(issuer)

				        .public_key(key.public_key())

				        .serial_number(x509.random_serial_number())

				        .not_valid_before(datetime.utcnow())

				        .not_valid_before(datetime.now(timezone.utc))

				        .not_valid_after(

				            # Our certificate will be valid for 10 days

				            datetime.utcnow()

				            datetime.now(timezone.utc)

				            + timedelta(days=10)

				        )

				        .add_extension(

				@ -88,10 +88,10 @@ def sign_certificate_request(path, csr_cert, ca_cert, private_ca_key):

				        .issuer_name(ca_cert.subject)

				        .public_key(csr_cert.public_key())

				        .serial_number(x509.random_serial_number())

				        .not_valid_before(datetime.utcnow())

				        .not_valid_before(datetime.now(timezone.utc))

				        .not_valid_after(

				            # Our certificate will be valid for 10 days

				            datetime.utcnow()

				            datetime.now(timezone.utc)

				            + timedelta(days=10)

				            # Sign our certificate with our private key

				        )

									
										19

.ci/pytorch/macos-test.sh
									
												View File
												
				@ -9,15 +9,13 @@ if [[ -n "$CONDA_ENV" ]]; then

				  export PATH="$CONDA_ENV/bin":$PATH

				fi

				# Test that OpenMP is enabled for non-arm64 build

				if [[ ${BUILD_ENVIRONMENT} != *arm64* ]]; then

				  pushd test

				  if [[ ! $(python -c "import torch; print(int(torch.backends.openmp.is_available()))") == "1" ]]; then

				    echo "Build should have OpenMP enabled, but torch.backends.openmp.is_available() is False"

				    exit 1

				  fi

				  popd

				# Test that OpenMP is enabled

				pushd test

				if [[ ! $(python -c "import torch; print(int(torch.backends.openmp.is_available()))") == "1" ]]; then

				  echo "Build should have OpenMP enabled, but torch.backends.openmp.is_available() is False"

				  exit 1

				fi

				popd

				setup_test_python() {

				  # The CircleCI worker hostname doesn't resolve to an address.

				@ -27,8 +25,9 @@ setup_test_python() {

				  echo "Ninja version: $(ninja --version)"

				  echo "Python version: $(which python) ($(python --version))"

				  # Increase default limit on open file handles from 256 to 1024

				  ulimit -n 1024

				  # Set the limit on open file handles to 16384

				  # might help with intermittent compiler test failures

				  ulimit -n 16384

				}

				test_python_all() {

									
										55

.ci/pytorch/test.sh
									
												View File
												
				@ -375,9 +375,8 @@ test_inductor_cpp_wrapper_abi_compatible() {

				  mkdir -p "$TEST_REPORTS_DIR"

				  echo "Testing Inductor cpp wrapper mode with TORCHINDUCTOR_ABI_COMPATIBLE=1"

				  # cpu stack allocation causes segfault and needs more investigation

				  PYTORCH_TESTING_DEVICE_ONLY_FOR="" python test/run_test.py --include inductor/test_cpu_cpp_wrapper

				  python test/run_test.py --include inductor/test_cuda_cpp_wrapper

				  python test/run_test.py --include inductor/test_cuda_cpp_wrapper inductor/test_cpu_repro

				  TORCHINDUCTOR_CPP_WRAPPER=1 python benchmarks/dynamo/timm_models.py --device cuda --accuracy --amp \

				    --training --inductor --disable-cudagraphs --only vit_base_patch16_224 \

				@ -397,11 +396,13 @@ DYNAMO_BENCHMARK_FLAGS=()

				pr_time_benchmarks() {

				  pip_install --user "fbscribelogger"

				  TEST_REPORTS_DIR=$(pwd)/test/test-reports

				  mkdir -p "$TEST_REPORTS_DIR"

				  PYTHONPATH=$(pwd)/benchmarks/dynamo/pr_time_benchmarks source benchmarks/dynamo/pr_time_benchmarks/benchmark_runner.sh "$TEST_REPORTS_DIR/pr_time_benchmarks_after.txt" "benchmarks/dynamo/pr_time_benchmarks/benchmarks"

				  PYTHONPATH=$(pwd)/benchmarks/dynamo/pr_time_benchmarks source benchmarks/dynamo/pr_time_benchmarks/benchmark_runner.sh "$TEST_REPORTS_DIR/pr_time_benchmarks_results.csv" "benchmarks/dynamo/pr_time_benchmarks/benchmarks"

				  echo "benchmark results on current PR: "

				  cat  "$TEST_REPORTS_DIR/pr_time_benchmarks_after.txt"

				  cat  "$TEST_REPORTS_DIR/pr_time_benchmarks_results.csv"

				}

				@ -504,6 +505,12 @@ test_perf_for_dashboard() {

				            --output "$TEST_REPORTS_DIR/${backend}_with_cudagraphs_freezing_autotune_${suite}_${dtype}_${mode}_${device}_${target}.csv"

				      fi

				      if [[ "$DASHBOARD_TAG" == *aotinductor-true* ]] && [[ "$mode" == "inference" ]]; then

				        if [[ "$target" == "accuracy" ]]; then

				          # Also collect Export pass rate and display as a separate row

				          $TASKSET python "benchmarks/dynamo/$suite.py" \

				              "${target_flag[@]}" --"$mode" --"$dtype" --export --disable-cudagraphs "$@" \

				              --output "$TEST_REPORTS_DIR/${backend}_export_${suite}_${dtype}_${mode}_${device}_${target}.csv"

				        fi

				        TORCHINDUCTOR_ABI_COMPATIBLE=1 $TASKSET python "benchmarks/dynamo/$suite.py" \

				            "${target_flag[@]}" --"$mode" --"$dtype" --export-aot-inductor --disable-cudagraphs "$@" \

				            --output "$TEST_REPORTS_DIR/${backend}_aot_inductor_${suite}_${dtype}_${mode}_${device}_${target}.csv"

				@ -567,10 +574,10 @@ test_single_dynamo_benchmark() {

				    fi

				    if [[ "${TEST_CONFIG}" == *_avx2* ]]; then

				      TEST_CONFIG=${TEST_CONFIG::-5}

				      TEST_CONFIG=${TEST_CONFIG//_avx2/}

				    fi

				    if [[ "${TEST_CONFIG}" == *_avx512* ]]; then

				      TEST_CONFIG=${TEST_CONFIG::-7}

				      TEST_CONFIG=${TEST_CONFIG//_avx512/}

				    fi

				    python "benchmarks/dynamo/$suite.py" \

				      --ci --accuracy --timing --explain \

				@ -588,6 +595,9 @@ test_single_dynamo_benchmark() {

				test_inductor_micro_benchmark() {

				  TEST_REPORTS_DIR=$(pwd)/test/test-reports

				  if [[ "${TEST_CONFIG}" == *cpu* ]]; then

				    test_inductor_set_cpu_affinity

				  fi

				  python benchmarks/gpt_fast/benchmark.py --output "${TEST_REPORTS_DIR}/gpt_fast_benchmark.csv"

				}

				@ -657,8 +667,7 @@ test_inductor_torchbench_smoketest_perf() {

				  # https://github.com/pytorch/pytorch/actions/runs/7158691360/job/19491437314,

				  # and thus we lower its threshold to reduce flakiness. If this continues to be a problem,

				  # we switch to use some other model.

				  # lowering threshold from 4.9 to 4.7 for cu124. Will bump it up after cuda 12.4.0->12.4.1 update

				  python benchmarks/dynamo/check_perf_csv.py -f "$TEST_REPORTS_DIR/inductor_inference_smoketest.csv" -t 4.7

				  python benchmarks/dynamo/check_perf_csv.py -f "$TEST_REPORTS_DIR/inductor_inference_smoketest.csv" -t 4.9

				  # Check memory compression ratio for a few models

				  for test in hf_Albert timm_vision_transformer; do

				@ -682,7 +691,7 @@ test_inductor_torchbench_smoketest_perf() {

				}

				test_inductor_get_core_number() {

				  if [[ "${TEST_CONFIG}" == *aarch64 ]]; then

				  if [[ "${TEST_CONFIG}" == *aarch64* ]]; then

				    echo "$(($(lscpu | grep 'Cluster(s):' | awk '{print $2}') * $(lscpu | grep 'Core(s) per cluster:' | awk '{print $4}')))"

				  else

				    echo "$(($(lscpu | grep 'Socket(s):' | awk '{print $2}') * $(lscpu | grep 'Core(s) per socket:' | awk '{print $4}')))"

				@ -692,11 +701,16 @@ test_inductor_get_core_number() {

				test_inductor_set_cpu_affinity(){

				  #set jemalloc

				  JEMALLOC_LIB="$(find /usr/lib -name libjemalloc.so.2)"

				  IOMP_LIB="$(dirname "$(which python)")/../lib/libiomp5.so"

				  export LD_PRELOAD="$JEMALLOC_LIB":"$IOMP_LIB":"$LD_PRELOAD"

				  export LD_PRELOAD="$JEMALLOC_LIB":"$LD_PRELOAD"

				  export MALLOC_CONF="oversize_threshold:1,background_thread:true,metadata_thp:auto,dirty_decay_ms:-1,muzzy_decay_ms:-1"

				  export KMP_AFFINITY=granularity=fine,compact,1,0

				  export KMP_BLOCKTIME=1

				  if [[ "${TEST_CONFIG}" != *aarch64* ]]; then

				    # Use Intel OpenMP for x86

				    IOMP_LIB="$(dirname "$(which python)")/../lib/libiomp5.so"

				    export LD_PRELOAD="$IOMP_LIB":"$LD_PRELOAD"

				    export KMP_AFFINITY=granularity=fine,compact,1,0

				    export KMP_BLOCKTIME=1

				  fi

				  cores=$(test_inductor_get_core_number)

				  export OMP_NUM_THREADS=$cores

				  end_core=$((cores-1))

				@ -1368,14 +1382,16 @@ test_executorch() {

				  assert_git_not_dirty

				}

				test_linux_aarch64(){

				test_linux_aarch64() {

				  python test/run_test.py --include test_modules test_mkldnn test_mkldnn_fusion test_openmp test_torch test_dynamic_shapes \

				       test_transformers test_multiprocessing test_numpy_interop --verbose

				        test_transformers test_multiprocessing test_numpy_interop \

				        --shard "$SHARD_NUMBER" "$NUM_TEST_SHARDS" --verbose

				  # Dynamo tests

				  python test/run_test.py --include dynamo/test_compile dynamo/test_backends dynamo/test_comptime dynamo/test_config \

				       dynamo/test_functions dynamo/test_fx_passes_pre_grad dynamo/test_interop dynamo/test_model_output dynamo/test_modules \

				       dynamo/test_optimizers dynamo/test_recompile_ux dynamo/test_recompiles --verbose

				       dynamo/test_optimizers dynamo/test_recompile_ux dynamo/test_recompiles \

				       --shard "$SHARD_NUMBER" "$NUM_TEST_SHARDS" --verbose

				  # Inductor tests

				  python test/run_test.py --include inductor/test_torchinductor inductor/test_benchmark_fusion inductor/test_codecache \

				@ -1385,7 +1401,8 @@ test_linux_aarch64(){

				       inductor/test_max_autotune inductor/test_memory_planning inductor/test_metrics inductor/test_multi_kernel inductor/test_pad_mm \

				       inductor/test_pattern_matcher inductor/test_perf inductor/test_profiler inductor/test_select_algorithm inductor/test_smoke \

				       inductor/test_split_cat_fx_passes inductor/test_standalone_compile inductor/test_torchinductor \

				       inductor/test_torchinductor_codegen_dynamic_shapes inductor/test_torchinductor_dynamic_shapes --verbose

				       inductor/test_torchinductor_codegen_dynamic_shapes inductor/test_torchinductor_dynamic_shapes inductor/test_memory \

				       --shard "$SHARD_NUMBER" "$NUM_TEST_SHARDS" --verbose

				}

				if ! [[ "${BUILD_ENVIRONMENT}" == *libtorch* || "${BUILD_ENVIRONMENT}" == *-bazel-* ]]; then

				@ -1467,9 +1484,7 @@ elif [[ "${TEST_CONFIG}" == *inductor* ]]; then

				  install_torchvision

				  test_inductor_shard "${SHARD_NUMBER}"

				  if [[ "${SHARD_NUMBER}" == 1 ]]; then

				    if [[ "${BUILD_ENVIRONMENT}" != linux-jammy-py3.8-gcc11-build ]]; then

				      # Temporarily skip test_inductor_aoti due to https://github.com/pytorch/pytorch/issues/130311

				      test_inductor_aoti

				    if [[ "${BUILD_ENVIRONMENT}" != linux-jammy-py3.9-gcc11-build ]]; then

				      test_inductor_distributed

				    fi

				  fi

									
										23

.ci/pytorch/win-test-helpers/build_pytorch.bat
									
												View File
												
				@ -24,6 +24,12 @@ call %INSTALLER_DIR%\install_sccache.bat

				if errorlevel 1 goto fail

				if not errorlevel 0 goto fail

				if "%USE_XPU%"=="1" (

				  :: Install xpu support packages

				  call %INSTALLER_DIR%\install_xpu.bat

				  if errorlevel 1 exit /b 1

				)

				:: Miniconda has been installed as part of the Windows AMI with all the dependencies.

				:: We just need to activate it here

				call %INSTALLER_DIR%\activate_miniconda3.bat

				@ -43,6 +49,16 @@ if "%VC_VERSION%" == "" (

				)

				if errorlevel 1 goto fail

				if not errorlevel 0 goto fail

				if "%USE_XPU%"=="1" (

				  :: Activate xpu environment - VS env is required for xpu

				  call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat"

				  if errorlevel 1 exit /b 1

				  :: Reduce build time. Only have MTL self-hosted runner now

				  SET TORCH_XPU_ARCH_LIST=xe-lpg

				  SET USE_KINETO=0

				)

				@echo on

				popd

				@ -65,13 +81,6 @@ set CUDA_PATH_V%VERSION_SUFFIX%=%CUDA_PATH%

				set CUDNN_LIB_DIR=%CUDA_PATH%\lib\x64

				set CUDA_TOOLKIT_ROOT_DIR=%CUDA_PATH%

				set CUDNN_ROOT_DIR=%CUDA_PATH%

				set NVTOOLSEXT_PATH=C:\Program Files\NVIDIA Corporation\NvToolsExt

				set PATH=%CUDA_PATH%\bin;%CUDA_PATH%\libnvvp;%PATH%

				set CUDNN_LIB_DIR=%CUDA_PATH%\lib\x64

				set CUDA_TOOLKIT_ROOT_DIR=%CUDA_PATH%

				set CUDNN_ROOT_DIR=%CUDA_PATH%

				set NVTOOLSEXT_PATH=C:\Program Files\NVIDIA Corporation\NvToolsExt

				set PATH=%CUDA_PATH%\bin;%CUDA_PATH%\libnvvp;%PATH%

				:cuda_build_end

									
										91

.ci/pytorch/win-test-helpers/installation-helpers/install_xpu.bat
									
										Normal file
									
												View File
												
				@ -0,0 +1,91 @@

				@echo on

				REM Description: Install Intel Support Packages on Windows

				REM BKM reference: https://www.intel.com/content/www/us/en/developer/articles/tool/pytorch-prerequisites-for-intel-gpu/2-5.html

				set XPU_INSTALL_MODE=%~1

				if "%XPU_INSTALL_MODE%"=="" goto xpu_bundle_install_start

				if "%XPU_INSTALL_MODE%"=="bundle" goto xpu_bundle_install_start

				if "%XPU_INSTALL_MODE%"=="driver" goto xpu_driver_install_start

				if "%XPU_INSTALL_MODE%"=="all" goto xpu_driver_install_start

				:arg_error

				echo Illegal XPU installation mode. The value can be "bundle"/"driver"/"all"

				echo If keep the value as space, will use default "bundle" mode

				exit /b 1

				:xpu_driver_install_start

				:: TODO Need more testing for driver installation

				set XPU_DRIVER_LINK=https://downloadmirror.intel.com/830975/gfx_win_101.5972.exe

				curl -o xpu_driver.exe --retry 3 --retry-all-errors -k %XPU_DRIVER_LINK%

				echo "XPU Driver installing..."

				start /wait "Intel XPU Driver Installer" "xpu_driver.exe"

				if errorlevel 1 exit /b 1

				del xpu_driver.exe

				if "%XPU_INSTALL_MODE%"=="driver" goto xpu_install_end

				:xpu_bundle_install_start

				set XPU_BUNDLE_PARENT_DIR=C:\Program Files (x86)\Intel\oneAPI

				set XPU_BUNDLE_URL=https://registrationcenter-download.intel.com/akdlm/IRC_NAS/9d1a91e2-e8b8-40a5-8c7f-5db768a6a60c/w_intel-for-pytorch-gpu-dev_p_0.5.3.37_offline.exe

				set XPU_PTI_URL=https://registrationcenter-download.intel.com/akdlm/IRC_NAS/9d1a91e2-e8b8-40a5-8c7f-5db768a6a60c/w_intel-pti-dev_p_0.9.0.37_offline.exe

				set XPU_BUNDLE_VERSION=0.5.3+31

				set XPU_PTI_VERSION=0.9.0+36

				set XPU_BUNDLE_PRODUCT_NAME=intel.oneapi.win.intel-for-pytorch-gpu-dev.product

				set XPU_PTI_PRODUCT_NAME=intel.oneapi.win.intel-pti-dev.product

				set XPU_BUNDLE_INSTALLED=0

				set XPU_PTI_INSTALLED=0

				set XPU_BUNDLE_UNINSTALL=0

				set XPU_PTI_UNINSTALL=0

				:: Check if XPU bundle is target version or already installed

				if exist "%XPU_BUNDLE_PARENT_DIR%\Installer\installer.exe" goto xpu_bundle_ver_check

				goto xpu_bundle_install

				:xpu_bundle_ver_check

				"%XPU_BUNDLE_PARENT_DIR%\Installer\installer.exe" --list-products > xpu_bundle_installed_ver.log

				for /f "tokens=1,2" %%a in (xpu_bundle_installed_ver.log) do (

				    if "%%a"=="%XPU_BUNDLE_PRODUCT_NAME%" (

				        echo %%a Installed Version: %%b

				        set XPU_BUNDLE_INSTALLED=1

				        if not "%XPU_BUNDLE_VERSION%"=="%%b" (

				            start /wait "Installer Title" "%XPU_BUNDLE_PARENT_DIR%\Installer\installer.exe" --action=remove --eula=accept --silent --product-id %XPU_BUNDLE_PRODUCT_NAME% --product-ver %%b --log-dir uninstall_bundle

				            set XPU_BUNDLE_UNINSTALL=1

				        )

				    )

				    if "%%a"=="%XPU_PTI_PRODUCT_NAME%" (

				        echo %%a Installed Version: %%b

				        set XPU_PTI_INSTALLED=1

				        if not "%XPU_PTI_VERSION%"=="%%b" (

				            start /wait "Installer Title" "%XPU_BUNDLE_PARENT_DIR%\Installer\installer.exe" --action=remove --eula=accept --silent --product-id %XPU_PTI_PRODUCT_NAME% --product-ver %%b --log-dir uninstall_bundle

				            set XPU_PTI_UNINSTALL=1

				        )

				    )

				)

				if errorlevel 1 exit /b 1

				if exist xpu_bundle_installed_ver.log del xpu_bundle_installed_ver.log

				if "%XPU_BUNDLE_INSTALLED%"=="0" goto xpu_bundle_install

				if "%XPU_BUNDLE_UNINSTALL%"=="1" goto xpu_bundle_install

				if "%XPU_PTI_INSTALLED%"=="0" goto xpu_pti_install

				if "%XPU_PTI_UNINSTALL%"=="1" goto xpu_pti_install

				goto xpu_install_end

				:xpu_bundle_install

				curl -o xpu_bundle.exe --retry 3 --retry-all-errors -k %XPU_BUNDLE_URL%

				echo "XPU Bundle installing..."

				start /wait "Intel Pytorch Bundle Installer" "xpu_bundle.exe" --action=install --eula=accept --silent --log-dir install_bundle

				if errorlevel 1 exit /b 1

				del xpu_bundle.exe

				:xpu_pti_install

				curl -o xpu_pti.exe --retry 3 --retry-all-errors -k %XPU_PTI_URL%

				echo "XPU PTI installing..."

				start /wait "Intel PTI Installer" "xpu_pti.exe" --action=install --eula=accept --silent --log-dir install_bundle

				if errorlevel 1 exit /b 1

				del xpu_pti.exe

				:xpu_install_end

									
										1

.ci/pytorch/win-test-helpers/setup_pytorch_env.bat
									
												View File
												
				@ -40,7 +40,6 @@ set CUDA_PATH_V%VERSION_SUFFIX%=%CUDA_PATH%

				set CUDNN_LIB_DIR=%CUDA_PATH%\lib\x64

				set CUDA_TOOLKIT_ROOT_DIR=%CUDA_PATH%

				set CUDNN_ROOT_DIR=%CUDA_PATH%

				set NVTOOLSEXT_PATH=C:\Program Files\NVIDIA Corporation\NvToolsExt

				set PATH=%CUDA_PATH%\bin;%CUDA_PATH%\libnvvp;%PATH%

				set NUMBAPRO_CUDALIB=%CUDA_PATH%\bin

				set NUMBAPRO_LIBDEVICE=%CUDA_PATH%\nvvm\libdevice

									
										2

.ci/pytorch/win-test-helpers/test_custom_backend.bat
									
												View File
												
				@ -31,6 +31,6 @@ if ERRORLEVEL 1 exit /b 1

				:: Run tests C++-side and load the exported script module.

				cd build

				set PATH=C:\Program Files\NVIDIA Corporation\NvToolsExt\bin\x64;%TMP_DIR_WIN%\build\torch\lib;%PATH%

				set PATH=%TMP_DIR_WIN%\build\torch\lib;%PATH%

				test_custom_backend.exe model.pt

				if ERRORLEVEL 1 exit /b 1

									
										2

.ci/pytorch/win-test-helpers/test_custom_script_ops.bat
									
												View File
												
				@ -31,6 +31,6 @@ if ERRORLEVEL 1 exit /b 1

				:: Run tests C++-side and load the exported script module.

				cd build

				set PATH=C:\Program Files\NVIDIA Corporation\NvToolsExt\bin\x64;%TMP_DIR_WIN%\build\torch\lib;%PATH%

				set PATH=%TMP_DIR_WIN%\build\torch\lib;%PATH%

				test_custom_ops.exe model.pt

				if ERRORLEVEL 1 exit /b 1

									
										2

.ci/pytorch/win-test-helpers/test_libtorch.bat
									
												View File
												
				@ -5,7 +5,7 @@ if errorlevel 1 exit /b 1

				set CWD=%cd%

				set CPP_TESTS_DIR=%TMP_DIR_WIN%\build\torch\bin

				set PATH=C:\Program Files\NVIDIA Corporation\NvToolsExt\bin\x64;%TMP_DIR_WIN%\build\torch\lib;%PATH%

				set PATH=%TMP_DIR_WIN%\build\torch\lib;%PATH%

				set TORCH_CPP_TEST_MNIST_PATH=%CWD%\test\cpp\api\mnist

				python tools\download_mnist.py --quiet -d %TORCH_CPP_TEST_MNIST_PATH%

									
										6

.ci/pytorch/win-test.sh
									
												View File
												
				@ -40,6 +40,12 @@ python -m pip install pytest-rerunfailures==10.3 pytest-cpp==2.3.0 tensorboard==

				# Install Z3 optional dependency for Windows builds.

				python -m pip install z3-solver==4.12.2.0

				# Install tlparse for test\dynamo\test_structured_trace.py UTs.

				python -m pip install tlparse==0.3.25

				# Install parameterized

				python -m pip install parameterized==0.8.1

				run_tests() {

				    # Run nvidia-smi if available

				    for path in '/c/Program Files/NVIDIA Corporation/NVSMI/nvidia-smi.exe' /c/Windows/System32/nvidia-smi.exe; do

									
										11

.circleci/scripts/binary_linux_test.sh
									
												View File
												
				@ -116,15 +116,14 @@ if [[ "$PACKAGE_TYPE" == libtorch ]]; then

				  cd /tmp/libtorch

				fi

				if [[ "$GPU_ARCH_TYPE" == xpu ]]; then

				  # Workaround for __mkl_tmp_MOD unbound variable issue, refer https://github.com/pytorch/pytorch/issues/130543

				  set +u

				  source /opt/intel/oneapi/pytorch-gpu-dev-0.5/oneapi-vars.sh

				fi

				# Test the package

				/builder/check_binary.sh

				if [[ "\$GPU_ARCH_TYPE" != *s390x* && "\$GPU_ARCH_TYPE" != *xpu* && "\$GPU_ARCH_TYPE" != *rocm*  && "$PACKAGE_TYPE" != libtorch ]]; then

				  # Exclude s390, xpu, rocm and libtorch builds from smoke testing

				  python /builder/test/smoke_test/smoke_test.py --package=torchonly --torch-compile-check disabled

				fi

				# Clean temp files

				cd /builder && git clean -ffdx

									
										6

.circleci/scripts/binary_populate_env.sh
									
												View File
												
				@ -90,7 +90,7 @@ fi

				if [[ "$PACKAGE_TYPE" =~ .*wheel.* && -n "$PYTORCH_BUILD_VERSION" && "$PYTORCH_BUILD_VERSION" =~ .*rocm.* && $(uname) == "Linux" ]]; then

				    TRITON_REQUIREMENT="pytorch-triton-rocm==${TRITON_VERSION}; ${TRITON_CONSTRAINT}"

				    if [[ -n "$PYTORCH_BUILD_VERSION" && "$PYTORCH_BUILD_VERSION" =~ .*dev.* ]]; then

				        TRITON_SHORTHASH=$(cut -c1-10 $PYTORCH_ROOT/.ci/docker/ci_commit_pins/triton-rocm.txt)

				        TRITON_SHORTHASH=$(cut -c1-10 $PYTORCH_ROOT/.ci/docker/ci_commit_pins/triton.txt)

				        TRITON_REQUIREMENT="pytorch-triton-rocm==${TRITON_VERSION}+${TRITON_SHORTHASH}; ${TRITON_CONSTRAINT}"

				    fi

				    if [[ -z "${PYTORCH_EXTRA_INSTALL_REQUIREMENTS:-}" ]]; then

				@ -102,10 +102,10 @@ fi

				# Set triton via PYTORCH_EXTRA_INSTALL_REQUIREMENTS for triton xpu package

				if [[ "$PACKAGE_TYPE" =~ .*wheel.* && -n "$PYTORCH_BUILD_VERSION" && "$PYTORCH_BUILD_VERSION" =~ .*xpu.* && $(uname) == "Linux" ]]; then

				    TRITON_REQUIREMENT="pytorch-triton-xpu==${TRITON_VERSION}"

				    TRITON_REQUIREMENT="pytorch-triton-xpu==${TRITON_VERSION}; ${TRITON_CONSTRAINT}"

				    if [[ -n "$PYTORCH_BUILD_VERSION" && "$PYTORCH_BUILD_VERSION" =~ .*dev.* ]]; then

				        TRITON_SHORTHASH=$(cut -c1-10 $PYTORCH_ROOT/.ci/docker/ci_commit_pins/triton-xpu.txt)

				        TRITON_REQUIREMENT="pytorch-triton-xpu==${TRITON_VERSION}+${TRITON_SHORTHASH}"

				        TRITON_REQUIREMENT="pytorch-triton-xpu==${TRITON_VERSION}+${TRITON_SHORTHASH}; ${TRITON_CONSTRAINT}"

				    fi

				    if [[ -z "${PYTORCH_EXTRA_INSTALL_REQUIREMENTS:-}" ]]; then

				        export PYTORCH_EXTRA_INSTALL_REQUIREMENTS="${TRITON_REQUIREMENT}"

									
										5

.circleci/scripts/binary_windows_build.sh
									
												View File
												
				@ -10,6 +10,11 @@ export SCCACHE_BUCKET=ossci-compiler-cache

				export SCCACHE_IGNORE_SERVER_IO_ERROR=1

				export VC_YEAR=2019

				if [[ "$DESIRED_CUDA" == 'xpu' ]]; then

				    export VC_YEAR=2022

				    export USE_SCCACHE=0

				fi

				echo "Free space on filesystem before build:"

				df -h

									
										4

.circleci/scripts/binary_windows_test.sh
									
												View File
												
				@ -6,6 +6,10 @@ source "${BINARY_ENV_FILE:-/c/w/env}"

				export CUDA_VERSION="${DESIRED_CUDA/cu/}"

				export VC_YEAR=2019

				if [[ "$DESIRED_CUDA" == 'xpu' ]]; then

				    export VC_YEAR=2022

				fi

				pushd "$BUILDER_ROOT"

				./windows/internal/smoke_test.bat

2

.flake8

View File

 @ -57,7 +57,7 @@ per-file-ignores =
     torch/distributed/_tensor/_collective_utils.py: TOR901
     # This is a full package that happen to live within the test
     # folder, so ok to skip
     test/cpp_extensions/open_registration_extension/pytorch_openreg/__init__.py: TOR901
     test/cpp_extensions/open_registration_extension/pytorch_openreg/_aten_impl.py: TOR901
 optional-ascii-coding = True
 exclude =
     ./.git,

									
										30

.github/actionlint.yaml
									
										vendored
									
												View File
												
				@ -3,18 +3,20 @@ self-hosted-runner:

				    # GitHub hosted x86 Linux runners

				    - linux.20_04.4x

				    - linux.20_04.16x

				    # Repo-specific LF hosted ARC runners

				    - linux.large.arc

				    # Organization-wide AWS Linux Runners

				    - linux.large

				    - linux.2xlarge

				    - linux.4xlarge

				    - linux.9xlarge.ephemeral

				    - am2.linux.9xlarge.ephemeral

				    - linux.12xlarge

				    - linux.12xlarge.ephemeral

				    - linux.24xlarge

				    - linux.24xlarge.ephemeral

				    - linux.arm64.2xlarge

				    - linux.arm64.2xlarge.ephemeral

				    - linux.arm64.m7g.4xlarge

				    - linux.arm64.m7g.4xlarge.ephemeral

				    - linux.4xlarge.nvidia.gpu

				    - linux.8xlarge.nvidia.gpu

				    - linux.16xlarge.nvidia.gpu

				@ -30,34 +32,12 @@ self-hosted-runner:

				    - lf.linux.8xlarge.nvidia.gpu

				    - lf.linux.16xlarge.nvidia.gpu

				    - lf.linux.g5.4xlarge.nvidia.gpu

				    # Organization-wide AWS Linux Runners with new Amazon 2023 AMI

				    - amz2023.linux.large

				    - amz2023.linux.2xlarge

				    - amz2023.linux.4xlarge

				    - amz2023.linux.12xlarge

				    - amz2023.linux.24xlarge

				    - amz2023.linux.arm64.2xlarge

				    - amz2023.linux.arm64.m7g.4xlarge

				    - amz2023.linux.4xlarge.nvidia.gpu

				    - amz2023.linux.8xlarge.nvidia.gpu

				    - amz2023.linux.16xlarge.nvidia.gpu

				    - amz2023.linux.g5.4xlarge.nvidia.gpu

				    # Pytorch/pytorch AWS Linux Runners with the new Amazon 2023 AMI on Linux Foundation account

				    - amz2023.lf.linux.large

				    - amz2023.lf.linux.2xlarge

				    - amz2023.lf.linux.4xlarge

				    - amz2023.lf.linux.12xlarge

				    - amz2023.lf.linux.24xlarge

				    - amz2023.lf.linux.arm64.2xlarge

				    - amz2023.lf.linux.4xlarge.nvidia.gpu

				    - amz2023.lf.linux.8xlarge.nvidia.gpu

				    - amz2023.lf.linux.16xlarge.nvidia.gpu

				    - amz2023.lf.linux.g5.4xlarge.nvidia.gpu

				    # Repo-specific IBM hosted S390x runner

				    - linux.s390x

				    # Organization wide AWS Windows runners

				    - windows.g4dn.xlarge

				    - windows.g4dn.xlarge.nonephemeral

				    - windows.4xlarge

				    - windows.4xlarge.nonephemeral

				    - windows.8xlarge.nvidia.gpu

				    - windows.8xlarge.nvidia.gpu.nonephemeral

									
										2

.github/actions/filter-test-configs/action.yml
									
										vendored
									
												View File
												
				@ -57,7 +57,7 @@ outputs:

				runs:

				  using: composite

				  steps:

				    - uses: nick-fields/retry@3e91a01664abd3c5cd539100d10d33b9c5b68482

				    - uses: nick-fields/retry@v3.0.0

				      name: Setup dependencies

				      env:

				        GITHUB_TOKEN: ${{ inputs.github-token }}

									
										2

.github/actions/pytest-cache-download/action.yml
									
										vendored
									
												View File
												
				@ -17,7 +17,7 @@ inputs:

				runs:

				  using: composite

				  steps:

				    - uses: nick-fields/retry@3e91a01664abd3c5cd539100d10d33b9c5b68482

				    - uses: nick-fields/retry@v3.0.0

				      name: Setup dependencies

				      with:

				        shell: bash

									
										2

.github/actions/pytest-cache-upload/action.yml
									
										vendored
									
												View File
												
				@ -24,7 +24,7 @@ inputs:

				runs:

				  using: composite

				  steps:

				    - uses: nick-fields/retry@3e91a01664abd3c5cd539100d10d33b9c5b68482

				    - uses: nick-fields/retry@v3.0.0

				      name: Setup dependencies

				      with:

				        shell: bash

									
										2

.github/actions/setup-linux/action.yml
									
										vendored
									
												View File
												
				@ -44,7 +44,7 @@ runs:

				        fi

				    - name: Log in to ECR

				      uses: nick-fields/retry@3e91a01664abd3c5cd539100d10d33b9c5b68482

				      uses: nick-fields/retry@v3.0.0

				      env:

				        AWS_RETRY_MODE: standard

				        AWS_MAX_ATTEMPTS: "5"

									
										2

.github/actions/teardown-win/action.yml
									
										vendored
									
												View File
												
				@ -31,7 +31,7 @@ runs:

				    # retry this step several time similar to how checkout-pytorch GHA does

				    - name: Cleanup workspace

				      if: always()

				      uses: nick-fields/retry@v2.8.2

				      uses: nick-fields/retry@v3.0.0

				      env:

				        EXTRA_DELETE_DIR: ${{ inputs.extra-delete-dir }}

				      with:

2

.github/ci_commit_pins/audio.txt vendored

View File

 @ -1 +1 @@
 b3f6f511f2a1082bd56b13a3f6794e7fc3ba4862
 ba696ea3dfec4cbe693bf06a84c75dc196077f5b

									
										39

.github/label_to_label.yml
									
										vendored
									
												View File
												
				@ -1,13 +1,50 @@

				# Use this to auto apply labels based on other labels.  Applies to both PRs and

				# issues. Currently only supports any and all

				- any:

				  - "module: custom operators"

				  - "module: opcheck"

				  then:

				  - "module: custom-operators"

				- any:

				  - "module: custom-operators"

				  - "module: functionalization"

				  - "module: aotdispatch"

				  - "module: higher order operators"

				  - "module: fakeTensor"

				  - "module: ProxyTensor"

				  - "module: library"

				  - "module: reinplacing"

				  then:

				  - "module: pt2-dispatcher"

				- any:

				  - "module: vmap"

				  then:

				  - "module: functorch"

				- any:

				  - "module: reinplacing"

				  then:

				  - "module: inductor"

				- any:

				  - "module: pt2 optimizer"

				  then:

				  - "module: dynamo"

				- any:

				  - "module: flex attention"

				  then:

				  - "module: higher order operators"

				- any:

				  - "module: aotinductor"

				  then:

				  - "oncall: export"

				- any:

				  - "module: dynamo"

				  - "module: pt2-dispatcher"

				  - "module: inductor"

				  - "module: aotinductor"

				  - "module: cudagraphs"

				  - "oncall: export"

				  - "module: startup-tracing-compile"

				  - "module: compiled autograd"

				  - "module: flex attention"

				  - "module: dynamic shapes"

				  then:

				  - "oncall: pt2"

									
										160

.github/lf-canary-scale-config.yml
									
										vendored
									
												View File
												
				@ -7,10 +7,14 @@

				#   runners. Runners listed here will be available as self hosted

				#   runners, configuration is directly pulled from the main branch.

				#

				# NOTE (Apr, 5, 2021): Linux runners are currently all an amazonlinux2

				#

				# NOTE (Jan 5, 2021): Linux runners are all non-ephemeral to reduce the amount of CreateInstaces calls

				#                     to avoid RequestLimitExceeded issues

				# NOTES:

				#  - Linux runners are by default non-ephemeral to reduce the amount of CreateInstaces calls

				#    to avoid RequestLimitExceeded issues

				#  - When updating this file, run the following command to validate the YAML and to generate

				#    corresponding versions of scale-config for the pytorch/pytorch repo and merge the

				#    pytorch/pytorch changes before merging these changes.

				#    `python .github/scripts/validate_scale_config.py --test-infra-repo-root [path_to_test-infra_root] --pytorch-repo-root [path_to_pytorch_root]``

				#

				# TODO: Add some documentation on how the auto-scaling works

				#

				@ -31,53 +35,36 @@ runner_types:

				    is_ephemeral: false

				    max_available: 1000

				    os: linux

				    variants:

				      amz2023:

				        ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				      am2:

				        ami: amzn2-ami-hvm-2.0.20240306.2-x86_64-ebs

				    ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64

				  lf.c.linux.10xlarge.avx2:

				    disk_size: 200

				    instance_type: m4.10xlarge

				    is_ephemeral: false

				    max_available: 450

				    os: linux

				    variants:

				      amz2023:

				        ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				      am2:

				        ami: amzn2-ami-hvm-2.0.20240306.2-x86_64-ebs

				    ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64

				  lf.c.linux.24xl.spr-metal:

				    disk_size: 200

				    instance_type: c7i.metal-24xl

				    is_ephemeral: false

				    max_available: 150

				    os: linux

				    variants:

				      amz2023:

				        ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				      am2:

				        ami: amzn2-ami-hvm-2.0.20240306.2-x86_64-ebs

				    ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64

				  lf.c.linux.16xlarge.spr:

				    disk_size: 200

				    instance_type: c7i.16xlarge

				    is_ephemeral: false

				    max_available: 150

				    os: linux

				    variants:

				      amz2023:

				        ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				      am2:

				        ami: amzn2-ami-hvm-2.0.20240306.2-x86_64-ebs

				    ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64

				  lf.c.linux.9xlarge.ephemeral:

				    disk_size: 200

				    instance_type: c5.9xlarge

				    is_ephemeral: true

				    max_available: 50

				    os: linux

				    ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64

				    variants:

				      amz2023:

				        ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				      am2:

				        ami: amzn2-ami-hvm-2.0.20240306.2-x86_64-ebs

				  lf.c.linux.12xlarge.ephemeral:

				@ -86,187 +73,140 @@ runner_types:

				    is_ephemeral: true

				    max_available: 300

				    os: linux

				    variants:

				      amz2023:

				        ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				      am2:

				        ami: amzn2-ami-hvm-2.0.20240306.2-x86_64-ebs

				    ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64

				  lf.c.linux.16xlarge.nvidia.gpu:

				    disk_size: 150

				    instance_type: g3.16xlarge

				    is_ephemeral: false

				    max_available: 150

				    os: linux

				    variants:

				      amz2023:

				        ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				      am2:

				        ami: amzn2-ami-hvm-2.0.20240306.2-x86_64-ebs

				    ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64

				  lf.c.linux.24xlarge:

				    disk_size: 150

				    instance_type: c5.24xlarge

				    is_ephemeral: false

				    max_available: 500

				    os: linux

				    variants:

				      amz2023:

				        ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				      am2:

				        ami: amzn2-ami-hvm-2.0.20240306.2-x86_64-ebs

				    ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64

				  lf.c.linux.24xlarge.ephemeral:

				    disk_size: 150

				    instance_type: c5.24xlarge

				    is_ephemeral: true

				    max_available: 200

				    os: linux

				    ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64

				  lf.c.linux.2xlarge:

				    disk_size: 150

				    instance_type: c5.2xlarge

				    is_ephemeral: false

				    max_available: 3120

				    os: linux

				    variants:

				      amz2023:

				        ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				      am2:

				        ami: amzn2-ami-hvm-2.0.20240306.2-x86_64-ebs

				    ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64

				  lf.c.linux.4xlarge:

				    disk_size: 150

				    instance_type: c5.4xlarge

				    is_ephemeral: false

				    max_available: 1000

				    os: linux

				    variants:

				      amz2023:

				        ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				      am2:

				        ami: amzn2-ami-hvm-2.0.20240306.2-x86_64-ebs

				    ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64

				  lf.c.linux.4xlarge.nvidia.gpu:

				    disk_size: 150

				    instance_type: g3.4xlarge

				    is_ephemeral: false

				    max_available: 1000

				    os: linux

				    variants:

				      amz2023:

				        ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				      am2:

				        ami: amzn2-ami-hvm-2.0.20240306.2-x86_64-ebs

				    ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64

				  lf.c.linux.8xlarge.nvidia.gpu:

				    disk_size: 150

				    instance_type: g3.8xlarge

				    is_ephemeral: false

				    max_available: 400

				    os: linux

				    variants:

				      amz2023:

				        ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				      am2:

				        ami: amzn2-ami-hvm-2.0.20240306.2-x86_64-ebs

				    ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64

				  lf.c.linux.g4dn.12xlarge.nvidia.gpu:

				    disk_size: 150

				    instance_type: g4dn.12xlarge

				    is_ephemeral: false

				    max_available: 250

				    os: linux

				    variants:

				      amz2023:

				        ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				      am2:

				        ami: amzn2-ami-hvm-2.0.20240306.2-x86_64-ebs

				    ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64

				  lf.c.linux.g4dn.metal.nvidia.gpu:

				    disk_size: 150

				    instance_type: g4dn.metal

				    is_ephemeral: false

				    max_available: 300

				    os: linux

				    variants:

				      amz2023:

				        ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				      am2:

				        ami: amzn2-ami-hvm-2.0.20240306.2-x86_64-ebs

				    ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64

				  lf.c.linux.g5.48xlarge.nvidia.gpu:

				    disk_size: 150

				    instance_type: g5.48xlarge

				    is_ephemeral: false

				    max_available: 200

				    os: linux

				    variants:

				      amz2023:

				        ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				      am2:

				        ami: amzn2-ami-hvm-2.0.20240306.2-x86_64-ebs

				    ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64

				  lf.c.linux.g5.12xlarge.nvidia.gpu:

				    disk_size: 150

				    instance_type: g5.12xlarge

				    is_ephemeral: false

				    max_available: 150

				    os: linux

				    variants:

				      amz2023:

				        ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				      am2:

				        ami: amzn2-ami-hvm-2.0.20240306.2-x86_64-ebs

				    ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64

				  lf.c.linux.g5.4xlarge.nvidia.gpu:

				    disk_size: 150

				    instance_type: g5.4xlarge

				    is_ephemeral: false

				    max_available: 2400

				    os: linux

				    variants:

				      amz2023:

				        ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				      am2:

				        ami: amzn2-ami-hvm-2.0.20240306.2-x86_64-ebs

				    ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64

				  lf.c.linux.g6.4xlarge.experimental.nvidia.gpu:

				    disk_size: 150

				    instance_type: g6.4xlarge

				    is_ephemeral: false

				    max_available: 50

				    os: linux

				    variants:

				      amz2023:

				        ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				      am2:

				        ami: amzn2-ami-hvm-2.0.20240306.2-x86_64-ebs

				    ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64

				  lf.c.linux.large:

				    max_available: 1200

				    disk_size: 15

				    instance_type: c5.large

				    is_ephemeral: false

				    os: linux

				    variants:

				      amz2023:

				        ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				      am2:

				        ami: amzn2-ami-hvm-2.0.20240306.2-x86_64-ebs

				    ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64

				  lf.c.linux.arm64.2xlarge:

				    disk_size: 256

				    instance_type: t4g.2xlarge

				    is_ephemeral: false

				    max_available: 200

				    os: linux

				    variants:

				      amz2023:

				        ami: al2023-ami-2023.5.20240701.0-kernel-6.1-arm64

				      am2:

				        ami: amzn2-ami-hvm-2.0.20240306.2-arm64-gp2

				    ami: al2023-ami-2023.5.202*-kernel-6.1-arm64

				  lf.c.linux.arm64.m7g.4xlarge:

				    disk_size: 256

				    instance_type: m7g.4xlarge

				    is_ephemeral: false

				    max_available: 200

				    os: linux

				    variants:

				      amz2023:

				        ami: al2023-ami-2023.5.20240701.0-kernel-6.1-arm64

				      am2:

				        ami: amzn2-ami-hvm-2.0.20240306.2-arm64-gp2

				    ami: al2023-ami-2023.5.202*-kernel-6.1-arm64

				  lf.c.linux.arm64.2xlarge.ephemeral:

				    disk_size: 256

				    instance_type: t4g.2xlarge

				    is_ephemeral: true

				    max_available: 200

				    os: linux

				    ami: al2023-ami-2023.5.202*-kernel-6.1-arm64

				  lf.c.linux.arm64.m7g.4xlarge.ephemeral:

				    disk_size: 256

				    instance_type: m7g.4xlarge

				    is_ephemeral: true

				    max_available: 200

				    os: linux

				    ami: al2023-ami-2023.5.202*-kernel-6.1-arm64

				  lf.c.linux.arm64.m7g.metal:

				    disk_size: 256

				    instance_type: m7g.metal

				    is_ephemeral: false

				    max_available: 100

				    os: linux

				    variants:

				      amz2023:

				        ami: al2023-ami-2023.5.20240701.0-kernel-6.1-arm64

				      am2:

				        ami: amzn2-ami-hvm-2.0.20240306.2-arm64-gp2

				    ami: al2023-ami-2023.5.202*-kernel-6.1-arm64

				  lf.c.windows.g4dn.xlarge:

				    disk_size: 256

				    instance_type: g4dn.xlarge

									
										160

.github/lf-scale-config.yml
									
										vendored
									
												View File
												
				@ -7,10 +7,14 @@

				#   runners. Runners listed here will be available as self hosted

				#   runners, configuration is directly pulled from the main branch.

				#

				# NOTE (Apr, 5, 2021): Linux runners are currently all an amazonlinux2

				#

				# NOTE (Jan 5, 2021): Linux runners are all non-ephemeral to reduce the amount of CreateInstaces calls

				#                     to avoid RequestLimitExceeded issues

				# NOTES:

				#  - Linux runners are by default non-ephemeral to reduce the amount of CreateInstaces calls

				#    to avoid RequestLimitExceeded issues

				#  - When updating this file, run the following command to validate the YAML and to generate

				#    corresponding versions of scale-config for the pytorch/pytorch repo and merge the

				#    pytorch/pytorch changes before merging these changes.

				#    `python .github/scripts/validate_scale_config.py --test-infra-repo-root [path_to_test-infra_root] --pytorch-repo-root [path_to_pytorch_root]``

				#

				# TODO: Add some documentation on how the auto-scaling works

				#

				@ -31,53 +35,36 @@ runner_types:

				    is_ephemeral: false

				    max_available: 1000

				    os: linux

				    variants:

				      amz2023:

				        ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				      am2:

				        ami: amzn2-ami-hvm-2.0.20240306.2-x86_64-ebs

				    ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64

				  lf.linux.10xlarge.avx2:

				    disk_size: 200

				    instance_type: m4.10xlarge

				    is_ephemeral: false

				    max_available: 450

				    os: linux

				    variants:

				      amz2023:

				        ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				      am2:

				        ami: amzn2-ami-hvm-2.0.20240306.2-x86_64-ebs

				    ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64

				  lf.linux.24xl.spr-metal:

				    disk_size: 200

				    instance_type: c7i.metal-24xl

				    is_ephemeral: false

				    max_available: 150

				    os: linux

				    variants:

				      amz2023:

				        ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				      am2:

				        ami: amzn2-ami-hvm-2.0.20240306.2-x86_64-ebs

				    ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64

				  lf.linux.16xlarge.spr:

				    disk_size: 200

				    instance_type: c7i.16xlarge

				    is_ephemeral: false

				    max_available: 150

				    os: linux

				    variants:

				      amz2023:

				        ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				      am2:

				        ami: amzn2-ami-hvm-2.0.20240306.2-x86_64-ebs

				    ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64

				  lf.linux.9xlarge.ephemeral:

				    disk_size: 200

				    instance_type: c5.9xlarge

				    is_ephemeral: true

				    max_available: 50

				    os: linux

				    ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64

				    variants:

				      amz2023:

				        ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				      am2:

				        ami: amzn2-ami-hvm-2.0.20240306.2-x86_64-ebs

				  lf.linux.12xlarge.ephemeral:

				@ -86,187 +73,140 @@ runner_types:

				    is_ephemeral: true

				    max_available: 300

				    os: linux

				    variants:

				      amz2023:

				        ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				      am2:

				        ami: amzn2-ami-hvm-2.0.20240306.2-x86_64-ebs

				    ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64

				  lf.linux.16xlarge.nvidia.gpu:

				    disk_size: 150

				    instance_type: g3.16xlarge

				    is_ephemeral: false

				    max_available: 150

				    os: linux

				    variants:

				      amz2023:

				        ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				      am2:

				        ami: amzn2-ami-hvm-2.0.20240306.2-x86_64-ebs

				    ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64

				  lf.linux.24xlarge:

				    disk_size: 150

				    instance_type: c5.24xlarge

				    is_ephemeral: false

				    max_available: 500

				    os: linux

				    variants:

				      amz2023:

				        ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				      am2:

				        ami: amzn2-ami-hvm-2.0.20240306.2-x86_64-ebs

				    ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64

				  lf.linux.24xlarge.ephemeral:

				    disk_size: 150

				    instance_type: c5.24xlarge

				    is_ephemeral: true

				    max_available: 200

				    os: linux

				    ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64

				  lf.linux.2xlarge:

				    disk_size: 150

				    instance_type: c5.2xlarge

				    is_ephemeral: false

				    max_available: 3120

				    os: linux

				    variants:

				      amz2023:

				        ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				      am2:

				        ami: amzn2-ami-hvm-2.0.20240306.2-x86_64-ebs

				    ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64

				  lf.linux.4xlarge:

				    disk_size: 150

				    instance_type: c5.4xlarge

				    is_ephemeral: false

				    max_available: 1000

				    os: linux

				    variants:

				      amz2023:

				        ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				      am2:

				        ami: amzn2-ami-hvm-2.0.20240306.2-x86_64-ebs

				    ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64

				  lf.linux.4xlarge.nvidia.gpu:

				    disk_size: 150

				    instance_type: g3.4xlarge

				    is_ephemeral: false

				    max_available: 1000

				    os: linux

				    variants:

				      amz2023:

				        ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				      am2:

				        ami: amzn2-ami-hvm-2.0.20240306.2-x86_64-ebs

				    ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64

				  lf.linux.8xlarge.nvidia.gpu:

				    disk_size: 150

				    instance_type: g3.8xlarge

				    is_ephemeral: false

				    max_available: 400

				    os: linux

				    variants:

				      amz2023:

				        ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				      am2:

				        ami: amzn2-ami-hvm-2.0.20240306.2-x86_64-ebs

				    ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64

				  lf.linux.g4dn.12xlarge.nvidia.gpu:

				    disk_size: 150

				    instance_type: g4dn.12xlarge

				    is_ephemeral: false

				    max_available: 250

				    os: linux

				    variants:

				      amz2023:

				        ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				      am2:

				        ami: amzn2-ami-hvm-2.0.20240306.2-x86_64-ebs

				    ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64

				  lf.linux.g4dn.metal.nvidia.gpu:

				    disk_size: 150

				    instance_type: g4dn.metal

				    is_ephemeral: false

				    max_available: 300

				    os: linux

				    variants:

				      amz2023:

				        ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				      am2:

				        ami: amzn2-ami-hvm-2.0.20240306.2-x86_64-ebs

				    ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64

				  lf.linux.g5.48xlarge.nvidia.gpu:

				    disk_size: 150

				    instance_type: g5.48xlarge

				    is_ephemeral: false

				    max_available: 200

				    os: linux

				    variants:

				      amz2023:

				        ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				      am2:

				        ami: amzn2-ami-hvm-2.0.20240306.2-x86_64-ebs

				    ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64

				  lf.linux.g5.12xlarge.nvidia.gpu:

				    disk_size: 150

				    instance_type: g5.12xlarge

				    is_ephemeral: false

				    max_available: 150

				    os: linux

				    variants:

				      amz2023:

				        ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				      am2:

				        ami: amzn2-ami-hvm-2.0.20240306.2-x86_64-ebs

				    ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64

				  lf.linux.g5.4xlarge.nvidia.gpu:

				    disk_size: 150

				    instance_type: g5.4xlarge

				    is_ephemeral: false

				    max_available: 2400

				    os: linux

				    variants:

				      amz2023:

				        ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				      am2:

				        ami: amzn2-ami-hvm-2.0.20240306.2-x86_64-ebs

				    ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64

				  lf.linux.g6.4xlarge.experimental.nvidia.gpu:

				    disk_size: 150

				    instance_type: g6.4xlarge

				    is_ephemeral: false

				    max_available: 50

				    os: linux

				    variants:

				      amz2023:

				        ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				      am2:

				        ami: amzn2-ami-hvm-2.0.20240306.2-x86_64-ebs

				    ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64

				  lf.linux.large:

				    max_available: 1200

				    disk_size: 15

				    instance_type: c5.large

				    is_ephemeral: false

				    os: linux

				    variants:

				      amz2023:

				        ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				      am2:

				        ami: amzn2-ami-hvm-2.0.20240306.2-x86_64-ebs

				    ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64

				  lf.linux.arm64.2xlarge:

				    disk_size: 256

				    instance_type: t4g.2xlarge

				    is_ephemeral: false

				    max_available: 200

				    os: linux

				    variants:

				      amz2023:

				        ami: al2023-ami-2023.5.20240701.0-kernel-6.1-arm64

				      am2:

				        ami: amzn2-ami-hvm-2.0.20240306.2-arm64-gp2

				    ami: al2023-ami-2023.5.202*-kernel-6.1-arm64

				  lf.linux.arm64.m7g.4xlarge:

				    disk_size: 256

				    instance_type: m7g.4xlarge

				    is_ephemeral: false

				    max_available: 200

				    os: linux

				    variants:

				      amz2023:

				        ami: al2023-ami-2023.5.20240701.0-kernel-6.1-arm64

				      am2:

				        ami: amzn2-ami-hvm-2.0.20240306.2-arm64-gp2

				    ami: al2023-ami-2023.5.202*-kernel-6.1-arm64

				  lf.linux.arm64.2xlarge.ephemeral:

				    disk_size: 256

				    instance_type: t4g.2xlarge

				    is_ephemeral: true

				    max_available: 200

				    os: linux

				    ami: al2023-ami-2023.5.202*-kernel-6.1-arm64

				  lf.linux.arm64.m7g.4xlarge.ephemeral:

				    disk_size: 256

				    instance_type: m7g.4xlarge

				    is_ephemeral: true

				    max_available: 200

				    os: linux

				    ami: al2023-ami-2023.5.202*-kernel-6.1-arm64

				  lf.linux.arm64.m7g.metal:

				    disk_size: 256

				    instance_type: m7g.metal

				    is_ephemeral: false

				    max_available: 100

				    os: linux

				    variants:

				      amz2023:

				        ami: al2023-ami-2023.5.20240701.0-kernel-6.1-arm64

				      am2:

				        ami: amzn2-ami-hvm-2.0.20240306.2-arm64-gp2

				    ami: al2023-ami-2023.5.202*-kernel-6.1-arm64

				  lf.windows.g4dn.xlarge:

				    disk_size: 256

				    instance_type: g4dn.xlarge

									
										19

.github/merge_rules.yaml
									
										vendored
									
												View File
												
				@ -86,6 +86,18 @@

				  - pull

				  - inductor

				- name: OSS CI / pytorchbot / slow tests

				  patterns:

				  - test/slow_tests.json

				  approved_by:

				  - pytorchbot

				  ignore_flaky_failures: false

				  mandatory_checks_name:

				  - EasyCLA

				  - Lint

				  - pull

				  - slow

				- name: OSS CI /pytorchbot / Executorch

				  patterns:

				  - .ci/docker/ci_commit_pins/executorch.txt

				@ -107,8 +119,8 @@

				  mandatory_checks_name:

				  - EasyCLA

				  - Lint

				  - pull / linux-focal-py3_8-clang9-xla / build

				  - pull / linux-focal-py3_8-clang9-xla / test (xla, 1, 1, linux.12xlarge)

				  - pull / linux-focal-py3_9-clang9-xla / build

				  - pull / linux-focal-py3_9-clang9-xla / test (xla, 1, 1, linux.12xlarge)

				- name: Documentation

				  patterns:

				@ -282,9 +294,11 @@

				  - torch/_C/_distributed*

				  - torch/csrc/distributed/**

				  - torch/testing/_internal/distributed/**

				  - torch/multiprocessing/**

				  - test/distributed/**

				  - test/cpp/dist_autograd/**

				  - test/cpp/rpc/**

				  - test/*multiprocessing*

				  approved_by:

				  - wconstab

				  - mrshenli

				@ -530,6 +544,7 @@

				  - anijain2305

				  - bdhirsh

				  - zou3519

				  - isuruf

				  mandatory_checks_name:

				  - EasyCLA

				  - Lint

									
										5

.github/nitpicks.yml
									
										vendored
									
										Normal file
									
												View File
												
				@ -0,0 +1,5 @@

				- markdown: |

				    ## Attention! native_functions.yaml was changed

				    If you are adding a new function or defaulted argument to native_functions.yaml, you cannot use it from pre-existing Python frontend code until our FC window passes (two weeks).  Split your PR into two PRs, one which adds the new C++ functionality, and one that makes use of it from Python, and land them two weeks apart.  See https://github.com/pytorch/pytorch/wiki/PyTorch's-Python-Frontend-Backward-and-Forward-Compatibility-Policy#forwards-compatibility-fc for more info.

				  pathFilter:

				    - 'aten/src/ATen/native/native_functions.yaml'

									
										1

.github/pytorch-probot.yml
									
										vendored
									
												View File
												
				@ -9,6 +9,7 @@ ciflow_push_tags:

				- ciflow/inductor-rocm

				- ciflow/inductor-perf-compare

				- ciflow/inductor-micro-benchmark

				- ciflow/inductor-micro-benchmark-cpu-x86

				- ciflow/inductor-cu124

				- ciflow/linux-aarch64

				- ciflow/mps

2

.github/requirements/conda-env-iOS.txt vendored

View File

 @ -4,4 +4,4 @@ ninja=1.10.2
 numpy=1.23.3
 pyyaml=6.0
 setuptools=68.2.2
 typing-extensions=4.9.0
 typing-extensions=4.11.0

4

.github/requirements/pip-requirements-macOS.txt vendored

View File

 @ -1,6 +1,7 @@
 boto3==1.19.12
 hypothesis==6.56.4
 expecttest==0.1.6
 expecttest==0.2.1
 fbscribelogger==0.1.6
 librosa>=0.6.2
 mpmath==1.3.0
 networkx==2.8.7
 @ -30,3 +31,4 @@ optree==0.12.1
 # NB: test_hparams_* from test_tensorboard is failing with protobuf 5.26.0 in
 # which the stringify metadata is wrong when escaping double quote
 protobuf==3.20.2
 parameterized==0.8.1

									
										26

.github/scripts/build_triton_wheel.py
									
										vendored
									
												View File
												
				@ -15,9 +15,7 @@ REPO_DIR = SCRIPT_DIR.parent.parent

				def read_triton_pin(device: str = "cuda") -> str:

				    triton_file = "triton.txt"

				    if device == "rocm":

				        triton_file = "triton-rocm.txt"

				    elif device == "xpu":

				    if device == "xpu":

				        triton_file = "triton-xpu.txt"

				    with open(REPO_DIR / ".ci" / "docker" / "ci_commit_pins" / triton_file) as f:

				        return f.read().strip()

				@ -50,6 +48,25 @@ def patch_init_py(

				        f.write(orig)

				# TODO: remove patch_setup_py() once we have a proper fix for https://github.com/triton-lang/triton/issues/4527

				def patch_setup_py(path: Path) -> None:

				    with open(path) as f:

				        orig = f.read()

				    try:

				        orig = check_and_replace(

				            orig,

				            "https://tritonlang.blob.core.windows.net/llvm-builds/",

				            "https://oaitriton.blob.core.windows.net/public/llvm-builds/",

				        )

				        with open(path, "w") as f:

				            f.write(orig)

				    except RuntimeError as e:

				        print(

				            f"Applying patch_setup_py() for llvm-build package failed: {e}.",

				            "If you are trying to build a newer version of Triton, you can ignore this.",

				        )

				def build_triton(

				    *,

				    version: str,

				@ -91,6 +108,9 @@ def build_triton(

				        else:

				            check_call(["git", "checkout", commit_hash], cwd=triton_basedir)

				        # TODO: remove this and patch_setup_py() once we have a proper fix for https://github.com/triton-lang/triton/issues/4527

				        patch_setup_py(triton_pythondir / "setup.py")

				        if build_conda:

				            with open(triton_basedir / "meta.yaml", "w") as meta:

				                print(

									
										11

.github/scripts/check_labels.py
									
										vendored
									
												View File
												
				@ -27,6 +27,12 @@ def parse_args() -> Any:

				    parser = ArgumentParser("Check PR labels")

				    parser.add_argument("pr_num", type=int)

				    # add a flag to return a non-zero exit code if the PR does not have the required labels

				    parser.add_argument(

				        "--exit-non-zero",

				        action="store_true",

				        help="Return a non-zero exit code if the PR does not have the required labels",

				    )

				    return parser.parse_args()

				@ -41,10 +47,13 @@ def main() -> None:

				        if not has_required_labels(pr):

				            print(LABEL_ERR_MSG)

				            add_label_err_comment(pr)

				            if args.exit_non_zero:

				                sys.exit(1)

				        else:

				            delete_all_label_err_comments(pr)

				    except Exception as e:

				        pass

				        if args.exit_non_zero:

				            sys.exit(1)

				    sys.exit(0)

									
										3

.github/scripts/cherry_pick.py
									
										vendored
									
												View File
												
				@ -169,7 +169,8 @@ def create_cherry_pick_branch(

				    repo.create_branch_and_checkout(branch=cherry_pick_branch)

				    # We might want to support ghstack later

				    repo._run_git("cherry-pick", "-x", "-X", "theirs", commit_sha)

				    # We don't want to resolve conflicts here.

				    repo._run_git("cherry-pick", "-x", commit_sha)

				    repo.push(branch=cherry_pick_branch, dry_run=False)

				    return cherry_pick_branch

									
										79

.github/scripts/generate_binary_build_matrix.py
									
										vendored
									
												View File
												
				@ -18,13 +18,13 @@ from typing import Dict, List, Optional, Tuple

				CUDA_ARCHES = ["11.8", "12.1", "12.4"]

				CUDA_ARCHES_FULL_VERSION = {"11.8": "11.8.0", "12.1": "12.1.1", "12.4": "12.4.0"}

				CUDA_ARCHES_FULL_VERSION = {"11.8": "11.8.0", "12.1": "12.1.1", "12.4": "12.4.1"}

				CUDA_ARCHES_CUDNN_VERSION = {"11.8": "9", "12.1": "9", "12.4": "9"}

				ROCM_ARCHES = ["6.0", "6.1"]

				ROCM_ARCHES = ["6.1", "6.2"]

				XPU_ARCHES = ["xpu"]

				@ -68,18 +68,18 @@ PYTORCH_EXTRA_INSTALL_REQUIREMENTS = {

				        "nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'"

				    ),

				    "12.4": (

				        "nvidia-cuda-nvrtc-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cuda-runtime-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cuda-cupti-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cuda-nvrtc-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cuda-runtime-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cuda-cupti-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cublas-cu12==12.4.2.65; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cufft-cu12==11.2.0.44; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-curand-cu12==10.3.5.119; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cusolver-cu12==11.6.0.99; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cusparse-cu12==12.3.0.142; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cublas-cu12==12.4.5.8; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cufft-cu12==11.2.1.3; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-curand-cu12==10.3.5.147; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cusolver-cu12==11.6.1.9; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cusparse-cu12==12.3.1.170; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-nvtx-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-nvjitlink-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64'"

				        "nvidia-nvtx-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-nvjitlink-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64'"

				    ),

				}

				@ -325,6 +325,7 @@ def generate_wheels_matrix(

				    os: str,

				    arches: Optional[List[str]] = None,

				    python_versions: Optional[List[str]] = None,

				    use_split_build: bool = False,

				) -> List[Dict[str, str]]:

				    package_type = "wheel"

				    if os == "linux" or os == "linux-aarch64" or os == "linux-s390x":

				@ -340,7 +341,7 @@ def generate_wheels_matrix(

				        if os == "linux":

				            arches += CPU_CXX11_ABI_ARCH + CUDA_ARCHES + ROCM_ARCHES + XPU_ARCHES

				        elif os == "windows":

				            arches += CUDA_ARCHES

				            arches += CUDA_ARCHES + XPU_ARCHES

				        elif os == "linux-aarch64":

				            # Only want the one arch as the CPU type is different and

				            # uses different build/test scripts

				@ -365,13 +366,23 @@ def generate_wheels_matrix(

				                else arch_version

				            )

				            # TODO: Enable python 3.13 on rocm, xpu, aarch64, windows

				            # TODO: Enable python 3.13 on rocm, aarch64, windows

				            if (

				                gpu_arch_type in ["rocm", "xpu"] or os != "linux"

				                gpu_arch_type == "rocm" or (os != "linux" and os != "linux-s390x")

				            ) and python_version == "3.13":

				                continue

				            if use_split_build and (

				                arch_version not in ["12.4", "12.1", "11.8", "cpu"] or os != "linux"

				            ):

				                raise RuntimeError(

				                    "Split build is only supported on linux with cuda 12.4, 12.1, 11.8, and cpu.\n"

				                    f"Currently attempting to build on arch version {arch_version} and os {os}.\n"

				                    "Please modify the matrix generation to exclude this combination."

				                )

				            # 12.1 linux wheels require PYTORCH_EXTRA_INSTALL_REQUIREMENTS to install

				            if (

				                arch_version in ["12.4", "12.1", "11.8"]

				                and os == "linux"

				@ -385,6 +396,7 @@ def generate_wheels_matrix(

				                        "desired_cuda": translate_desired_cuda(

				                            gpu_arch_type, gpu_arch_version

				                        ),

				                        "use_split_build": "True" if use_split_build else "False",

				                        "devtoolset": (

				                            "cxx11-abi" if arch_version == "cuda-aarch64" else ""

				                        ),

				@ -400,7 +412,8 @@ def generate_wheels_matrix(

				                        ),

				                    }

				                )

				                if arch_version != "cuda-aarch64":

				                # Special build building to use on Colab. Python 3.11 for 12.1 CUDA

				                if python_version == "3.11" and arch_version == "12.1":

				                    ret.append(

				                        {

				                            "python_version": python_version,

				@ -409,40 +422,16 @@ def generate_wheels_matrix(

				                            "desired_cuda": translate_desired_cuda(

				                                gpu_arch_type, gpu_arch_version

				                            ),

				                            "use_split_build": "True",

				                            "use_split_build": "True" if use_split_build else "False",

				                            "devtoolset": "",

				                            "container_image": WHEEL_CONTAINER_IMAGES[arch_version],

				                            "package_type": package_type,

				                            "pytorch_extra_install_requirements": (

				                                PYTORCH_EXTRA_INSTALL_REQUIREMENTS[arch_version]  # fmt: skip

				                                if os != "linux-aarch64"

				                                else ""

				                            ),

				                            "build_name": f"{package_type}-py{python_version}-{gpu_arch_type}{gpu_arch_version}-split".replace(  # noqa: B950

				                            "pytorch_extra_install_requirements": "",

				                            "build_name": f"{package_type}-py{python_version}-{gpu_arch_type}{gpu_arch_version}-full".replace(  # noqa: B950

				                                ".", "_"

				                            ),

				                        }

				                    )

				                    # Special build building to use on Colab. PyThon 3.10 for 12.1 CUDA

				                    if python_version == "3.10" and arch_version == "12.1":

				                        ret.append(

				                            {

				                                "python_version": python_version,

				                                "gpu_arch_type": gpu_arch_type,

				                                "gpu_arch_version": gpu_arch_version,

				                                "desired_cuda": translate_desired_cuda(

				                                    gpu_arch_type, gpu_arch_version

				                                ),

				                                "use_split_build": "False",

				                                "devtoolset": "",

				                                "container_image": WHEEL_CONTAINER_IMAGES[arch_version],

				                                "package_type": package_type,

				                                "pytorch_extra_install_requirements": "",

				                                "build_name": f"{package_type}-py{python_version}-{gpu_arch_type}{gpu_arch_version}-full".replace(  # noqa: B950

				                                    ".", "_"

				                                ),

				                            }

				                        )

				            else:

				                ret.append(

				                    {

				@ -452,6 +441,7 @@ def generate_wheels_matrix(

				                        "desired_cuda": translate_desired_cuda(

				                            gpu_arch_type, gpu_arch_version

				                        ),

				                        "use_split_build": "True" if use_split_build else "False",

				                        "devtoolset": (

				                            "cxx11-abi" if arch_version == "cpu-cxx11-abi" else ""

				                        ),

				@ -462,11 +452,12 @@ def generate_wheels_matrix(

				                        ),

				                        "pytorch_extra_install_requirements": (

				                            PYTORCH_EXTRA_INSTALL_REQUIREMENTS["12.1"]  # fmt: skip

				                            if os != "linux"

				                            if os != "linux" and gpu_arch_type != "xpu"

				                            else ""

				                        ),

				                    }

				                )

				    return ret

									
										35

.github/scripts/generate_ci_workflows.py
									
										vendored
									
												View File
												
				@ -61,6 +61,7 @@ class BinaryBuildWorkflow:

				    # Mainly for macos

				    cross_compile_arm64: bool = False

				    macos_runner: str = "macos-14-xlarge"

				    use_split_build: bool = False

				    def __post_init__(self) -> None:

				        if self.abi_version:

				@ -69,6 +70,9 @@ class BinaryBuildWorkflow:

				            )

				        else:

				            self.build_environment = f"{self.os}-binary-{self.package_type}"

				        if self.use_split_build:

				            # added to distinguish concurrency groups

				            self.build_environment += "-split"

				    def generate_workflow_file(self, workflow_template: jinja2.Template) -> None:

				        output_file_path = (

				@ -110,6 +114,20 @@ LINUX_BINARY_BUILD_WORFKLOWS = [

				            isolated_workflow=True,

				        ),

				    ),

				    BinaryBuildWorkflow(

				        os=OperatingSystem.LINUX,

				        package_type="manywheel",

				        build_configs=generate_binary_build_matrix.generate_wheels_matrix(

				            OperatingSystem.LINUX,

				            use_split_build=True,

				            arches=["11.8", "12.1", "12.4", "cpu"],

				        ),

				        ciflow_config=CIFlowConfig(

				            labels={LABEL_CIFLOW_BINARIES, LABEL_CIFLOW_BINARIES_WHEEL},

				            isolated_workflow=True,

				        ),

				        use_split_build=True,

				    ),

				    BinaryBuildWorkflow(

				        os=OperatingSystem.LINUX,

				        package_type="conda",

				@ -158,10 +176,25 @@ LINUX_BINARY_SMOKE_WORKFLOWS = [

				        build_configs=generate_binary_build_matrix.generate_wheels_matrix(

				            OperatingSystem.LINUX,

				            arches=["11.8", "12.1", "12.4"],

				            python_versions=["3.8"],

				            python_versions=["3.9"],

				        ),

				        branches="main",

				    ),

				    BinaryBuildWorkflow(

				        os=OperatingSystem.LINUX,

				        package_type="manywheel",

				        build_configs=generate_binary_build_matrix.generate_wheels_matrix(

				            OperatingSystem.LINUX,

				            arches=["11.8", "12.1", "12.4"],

				            python_versions=["3.9"],

				            use_split_build=True,

				        ),

				        ciflow_config=CIFlowConfig(

				            labels={LABEL_CIFLOW_PERIODIC},

				        ),

				        branches="main",

				        use_split_build=True,

				    ),

				    BinaryBuildWorkflow(

				        os=OperatingSystem.LINUX,

				        package_type="libtorch",

									
										22

.github/scripts/github_utils.py
									
										vendored
									
												View File
												
				@ -46,16 +46,24 @@ def gh_fetch_url_and_headers(

				        with urlopen(Request(url, headers=headers, data=data_, method=method)) as conn:

				            return conn.headers, reader(conn)

				    except HTTPError as err:

				        if err.code == 403 and all(

				            key in err.headers for key in ["X-RateLimit-Limit", "X-RateLimit-Used"]

				        if (

				            err.code == 403

				            and all(

				                key in err.headers

				                for key in ["X-RateLimit-Limit", "X-RateLimit-Remaining"]

				            )

				            and int(err.headers["X-RateLimit-Remaining"]) == 0

				        ):

				            print(

				                f"""Rate limit exceeded:

				                f"""{url}

				                Rate limit exceeded:

				                Used: {err.headers['X-RateLimit-Used']}

				                Limit: {err.headers['X-RateLimit-Limit']}

				                Remaining: {err.headers['X-RateLimit-Remaining']}

				                Resets at: {err.headers['x-RateLimit-Reset']}"""

				            )

				        else:

				            print(f"Error fetching {url} {err}")

				        raise

				@ -160,6 +168,14 @@ def gh_post_commit_comment(

				    )

				def gh_close_pr(org: str, repo: str, pr_num: int, dry_run: bool = False) -> None:

				    url = f"{GITHUB_API_URL}/repos/{org}/{repo}/pulls/{pr_num}"

				    if dry_run:

				        print(f"Dry run closing PR {pr_num}")

				    else:

				        gh_fetch_url(url, method="PATCH", data={"state": "closed"})

				def gh_delete_comment(org: str, repo: str, comment_id: int) -> None:

				    url = f"{GITHUB_API_URL}/repos/{org}/{repo}/issues/comments/{comment_id}"

				    gh_fetch_url(url, method="DELETE")

									
										7

.github/scripts/lintrunner.sh
									
										vendored
									
												View File
												
				@ -17,6 +17,11 @@ if [[ -d "${CACHE_DIRECTORY}" ]]; then

				    cp -r "${CACHE_DIRECTORY}" . || true

				fi

				# if lintrunner is not installed, install it

				if ! command -v lintrunner &> /dev/null; then

				    python3 -m pip install lintrunner==0.12.5

				fi

				# This has already been cached in the docker image

				lintrunner init 2> /dev/null

				@ -33,7 +38,7 @@ python3 torch/utils/data/datapipes/gen_pyi.py

				RC=0

				# Run lintrunner on all files

				if ! lintrunner --force-color --all-files --tee-json=lint.json ${ADDITIONAL_LINTRUNNER_ARGS} 2> /dev/null; then

				if ! lintrunner --force-color --tee-json=lint.json ${ADDITIONAL_LINTRUNNER_ARGS} 2> /dev/null; then

				    echo ""

				    echo -e "\e[1m\e[36mYou can reproduce these results locally by using \`lintrunner -m origin/main\`. (If you don't get the same results, run \'lintrunner init\' to update your local linter)\e[0m"

				    echo -e "\e[1m\e[36mSee https://github.com/pytorch/pytorch/wiki/lintrunner for setup instructions.\e[0m"

									
										373

.github/scripts/runner_determinator.py
									
										vendored
									
												View File
												
				@ -3,49 +3,94 @@

				"""

				This runner determinator is used to determine which set of runners to run a

				GitHub job on. It uses the first comment of a GitHub issue (by default

				https://github.com/pytorch/test-infra/issues/5132) as a user list to determine

				which users will get their jobs to run on experimental runners. This user list

				is also a comma separated list of additional features or experiments which the

				user could be opted in to.

				https://github.com/pytorch/test-infra/issues/5132) to define the configuration

				of which runners should be used to run which job.

				The configuration has two parts, the settings and a list of opted-in users,

				separated by a line containing "---".  If the line is not present, the

				settings are considered to be empty with only the second part, the user

				list, defined.

				The first part is a YAML block that defines the rollout settings. This can be

				used to define any settings that are needed to determine which runners to use.

				It's fields are defined by the RolloutSettings class below.

				The second part is a list of users who are explicitly opted in to the LF fleet.

				The user list is also a comma separated list of additional features or

				experiments which the user could be opted in to.

				The user list has the following rules:

				- Users are GitHub usernames with the @ prefix

				- If the first line is a "*" then all users will use the new runners

				- If the first line is a "!" then all users will use the old runners

				- Users are GitHub usernames, which must start with the @ prefix

				- Each user is also a comma-separated list of features/experiments to enable

				- A "#" prefix indicates the user is opted out of the new runners but is opting

				  into features/experiments.

				- A "#" prefix opts the user out of all experiments

				Example user list:

				Example config:

				    # A list of experiments that can be opted into.

				    # This defines the behavior they'll induce when opted into.

				    # Expected syntax is:

				    #   [experiment_name]: # Name of the experiment. Also used for the label prefix.

				    #      rollout_perc: [int] # % of workflows to run with this experiment when users are not opted in.

				    @User1

				    @User2,amz2023

				    #@UserOptOutOfNewRunner,amz2023

				    experiments:

				      lf:

				        rollout_percent: 25

				    ---

				    # Opt-ins:

				    # Users can opt into the LF fleet by adding their GitHub username to this list

				    # and specifying experiments to enable in a comma-separated list.

				    # Experiments should be from the above list.

				    @User1,lf,split_build

				    @User2,lf

				    @User3,split_build

				"""

				import logging

				import os

				import random

				from argparse import ArgumentParser

				from logging import LogRecord

				from typing import Any, Iterable

				from typing import Any, Dict, Iterable, List, NamedTuple, Tuple

				import yaml

				from github import Auth, Github

				from github.Issue import Issue

				WORKFLOW_LABEL_META = ""  # use meta runners

				DEFAULT_LABEL_PREFIX = ""  # use meta runners

				WORKFLOW_LABEL_LF = "lf."  # use runners from the linux foundation

				WORKFLOW_LABEL_LF_CANARY = "lf.c."  # use canary runners from the linux foundation

				RUNNER_AMI_LEGACY = ""

				RUNNER_AMI_AMZ2023 = "amz2023"

				GITHUB_OUTPUT = os.getenv("GITHUB_OUTPUT", "")

				GH_OUTPUT_KEY_AMI = "runner-ami"

				GH_OUTPUT_KEY_LABEL_TYPE = "label-type"

				SETTING_EXPERIMENTS = "experiments"

				LF_FLEET_EXPERIMENT = "lf"

				CANARY_FLEET_SUFFIX = ".c"

				class Experiment(NamedTuple):

				    rollout_perc: float = (

				        0  # Percentage of workflows to experiment on when user is not opted-in.

				    )

				    # Add more fields as needed

				class Settings(NamedTuple):

				    """

				    Settings for the experiments that can be opted into.

				    """

				    experiments: Dict[str, Experiment] = {}

				class ColorFormatter(logging.Formatter):

				    """Color codes the log messages based on the log level"""

				@ -137,11 +182,14 @@ def get_issue(gh: Github, repo: str, issue_num: int) -> Issue:

				def get_potential_pr_author(

				    gh: Github, repo: str, username: str, ref_type: str, ref_name: str

				    github_token: str, repo: str, username: str, ref_type: str, ref_name: str

				) -> str:

				    # If the trigger was a new tag added by a bot, this is a ciflow case

				    # Fetch the actual username from the original PR. The PR number is

				    # embedded in the tag name: ciflow/<name>/<pr-number>

				    gh = get_gh_client(github_token)

				    if username == "pytorch-bot[bot]" and ref_type == "tag":

				        split_tag = ref_name.split("/")

				        if (

				@ -163,126 +211,233 @@ def get_potential_pr_author(

				def is_exception_branch(branch: str) -> bool:

				    """

				    Branches that get opted out of all experiments and should always use Meta runners

				    """

				    return branch.split("/")[0] in {"main", "nightly", "release", "landchecks"}

				def get_workflow_type(issue: Issue, workflow_requestors: Iterable[str]) -> str:

				def load_yaml(yaml_text: str) -> Any:

				    try:

				        first_comment = issue.get_comments()[0].body.strip("\n\t ")

				        if first_comment[0] == "!":

				            log.info("LF Workflows are disabled for everyone. Using meta runners.")

				            return WORKFLOW_LABEL_META

				        elif first_comment[0] == "*":

				            log.info("LF Workflows are enabled for everyone. Using LF runners.")

				            return WORKFLOW_LABEL_LF

				        else:

				            all_opted_in_users = {

				                usr_raw.strip("\n\t@ ").split(",")[0]

				                for usr_raw in first_comment.split()

				            }

				            opted_in_requestors = {

				                usr for usr in workflow_requestors if usr in all_opted_in_users

				            }

				            if opted_in_requestors:

				                log.info(

				                    f"LF Workflows are enabled for {', '.join(opted_in_requestors)}. Using LF runners."

				                )

				                return WORKFLOW_LABEL_LF

				            else:

				                log.info(

				                    f"LF Workflows are disabled for {', '.join(workflow_requestors)}. Using meta runners."

				                )

				                return WORKFLOW_LABEL_META

				    except Exception as e:

				        log.error(

				            f"Failed to get determine workflow type. Falling back to meta runners. Exception: {e}"

				        )

				        return WORKFLOW_LABEL_META

				        data = yaml.safe_load(yaml_text)

				        return data

				    except yaml.YAMLError as exc:

				        log.exception("Error loading YAML")

				        raise

				def get_optin_feature(

				    issue: Issue, workflow_requestors: Iterable[str], feature: str, fallback: str

				def extract_settings_user_opt_in_from_text(rollout_state: str) -> Tuple[str, str]:

				    """

				    Extracts the text with settings, if any, and the opted in users from the rollout state.

				    If the issue body contains "---" then the text above that is the settings

				    and the text below is the list of opted in users.

				    If it doesn't contain "---" then the settings are empty and the rest is the users.

				    """

				    rollout_state_parts = rollout_state.split("---")

				    if len(rollout_state_parts) >= 2:

				        return rollout_state_parts[0], rollout_state_parts[1]

				    else:

				        return "", rollout_state

				class UserOptins(Dict[str, List[str]]):

				    """

				    Dictionary of users with a list of features they have opted into

				    """

				def parse_user_opt_in_from_text(user_optin_text: str) -> UserOptins:

				    """

				    Parse the user opt-in text into a key value pair of username and the list of features they have opted into

				    Users are GitHub usernames with the @ prefix. Each user is also a comma-separated list of features/experiments to enable.

				        - Example line: "@User1,lf,split_build"

				        - A "#" prefix indicates the user is opted out of all experiments

				    """

				    optins = UserOptins()

				    for user in user_optin_text.split("\n"):

				        user = user.strip("\r\n\t -")

				        if not user or not user.startswith("@"):

				            # Not a valid user. Skip

				            continue

				        if user:

				            usr_name = user.split(",")[0].strip("@")

				            optins[usr_name] = [exp.strip(" ") for exp in user.split(",")[1:]]

				    return optins

				def parse_settings_from_text(settings_text: str) -> Settings:

				    """

				    Parse the experiments from the issue body into a list of ExperimentSettings

				    """

				    try:

				        if settings_text:

				            # Escape the backtick as well so that we can have the settings in a code block on the GH issue

				            # for easy reading

				            # Note: Using ascii for the backtick so that the cat step in _runner-determinator.yml doesn't choke on

				            #       the backtick character in shell commands.

				            backtick = chr(96)  # backtick character

				            settings_text = settings_text.strip(f"\r\n\t{backtick} ")

				            settings = load_yaml(settings_text)

				            # For now we just load experiments. We can expand this if/when we add more settings

				            experiments = {}

				            for exp_name, exp_settings in settings.get(SETTING_EXPERIMENTS).items():

				                valid_settings = {}

				                for setting in exp_settings:

				                    if setting not in Experiment._fields:

				                        log.warning(

				                            f"Unexpected setting in experiment: {setting} = {exp_settings[setting]}"

				                        )

				                    else:

				                        valid_settings[setting] = exp_settings[setting]

				                experiments[exp_name] = Experiment(**valid_settings)

				            return Settings(experiments)

				    except Exception:

				        log.exception("Failed to parse settings")

				    return Settings()

				def parse_settings(rollout_state: str) -> Settings:

				    """

				    Parse settings, if any, from the rollout state.

				    If the issue body contains "---" then the text above that is the settings

				    and the text below is the list of opted in users.

				    If it doesn't contain "---" then the settings are empty and the default values are used.

				    """

				    settings_text, _ = extract_settings_user_opt_in_from_text(rollout_state)

				    return parse_settings_from_text(settings_text)

				def parse_users(rollout_state: str) -> UserOptins:

				    """

				    Parse users from the rollout state.

				    """

				    _, users_text = extract_settings_user_opt_in_from_text(rollout_state)

				    return parse_user_opt_in_from_text(users_text)

				def is_user_opted_in(user: str, user_optins: UserOptins, experiment_name: str) -> bool:

				    """

				    Check if a user is opted into an experiment

				    """

				    return experiment_name in user_optins.get(user, [])

				def get_runner_prefix(

				    rollout_state: str, workflow_requestors: Iterable[str], is_canary: bool = False

				) -> str:

				    try:

				        first_comment = issue.get_comments()[0].body.strip("\n\t ")

				        userlist = {u.lstrip("#").strip("\n\t@ ") for u in first_comment.split()}

				        all_opted_in_users = set()

				        for user in userlist:

				            for i in user.split(","):

				                if i == feature:

				                    all_opted_in_users.add(user.split(",")[0])

				        opted_in_requestors = {

				            usr for usr in workflow_requestors if usr in all_opted_in_users

				        }

				    settings = parse_settings(rollout_state)

				    user_optins = parse_users(rollout_state)

				        if opted_in_requestors:

				            log.info(

				                f"Feature {feature} is enabled for {', '.join(opted_in_requestors)}. Using feature {feature}."

				            )

				            return feature

				        else:

				            log.info(

				                f"Feature {feature} is disabled for {', '.join(workflow_requestors)}. Using fallback \"{fallback}\"."

				            )

				            return fallback

				    fleet_prefix = ""

				    prefixes = []

				    for experiment_name, experiment_settings in settings.experiments.items():

				        enabled = False

				    except Exception as e:

				        # Is any workflow_requestor opted in to this experiment?

				        opted_in_users = [

				            requestor

				            for requestor in workflow_requestors

				            if is_user_opted_in(requestor, user_optins, experiment_name)

				        ]

				        if opted_in_users:

				            log.info(

				                f"{', '.join(opted_in_users)} have opted into experiment {experiment_name}."

				            )

				            enabled = True

				        elif experiment_settings.rollout_perc:

				            # If no user is opted in, then we randomly enable the experiment based on the rollout percentage

				            if random.uniform(0, 100) <= experiment_settings.rollout_perc:

				                log.info(

				                    f"Based on rollout percentage of {experiment_settings.rollout_perc}%, enabling experiment {experiment_name}."

				                )

				                enabled = True

				        if enabled:

				            label = experiment_name

				            if experiment_name == LF_FLEET_EXPERIMENT:

				                # We give some special treatment to the "lf" experiment since determines the fleet we use

				                #  - If it's enabled, then we always list it's prefix first

				                #  - If we're in the canary branch, then we append ".c" to the lf prefix

				                if is_canary:

				                    label += CANARY_FLEET_SUFFIX

				                fleet_prefix = label

				            else:

				                prefixes.append(label)

				    if len(prefixes) > 1:

				        log.error(

				            f'Failed to determine if user has opted-in to feature {feature}. Using fallback "{fallback}". Exception: {e}'

				            f"Only a fleet and one other experiment can be enabled for a job at any time. Enabling {prefixes[0]} and ignoring the rest, which are {', '.join(prefixes[1:])}"

				        )

				        return fallback

				        prefixes = prefixes[:1]

				    # Fleet always comes first

				    if fleet_prefix:

				        prefixes.insert(0, fleet_prefix)

				    return ".".join(prefixes) + "." if prefixes else ""

				def get_rollout_state_from_issue(github_token: str, repo: str, issue_num: int) -> str:

				    """

				    Gets the first comment of the issue, which contains the desired rollout state.

				    The default issue we use - https://github.com/pytorch/test-infra/issues/5132

				    """

				    gh = get_gh_client(github_token)

				    issue = get_issue(gh, repo, issue_num)

				    return str(issue.get_comments()[0].body.strip("\n\t "))

				def main() -> None:

				    args = parse_args()

				    if args.github_ref_type == "branch" and is_exception_branch(args.github_branch):

				        log.info(f"Exception branch: '{args.github_branch}', using meta runners")

				        label_type = WORKFLOW_LABEL_META

				        runner_ami = RUNNER_AMI_LEGACY

				        log.info(

				            f"Exception branch: '{args.github_branch}', using Meta runners and no experiments."

				        )

				        runner_label_prefix = DEFAULT_LABEL_PREFIX

				    else:

				        try:

				            gh = get_gh_client(args.github_token)

				            # The default issue we use - https://github.com/pytorch/test-infra/issues/5132

				            issue = get_issue(gh, args.github_issue_repo, args.github_issue)

				            rollout_state = get_rollout_state_from_issue(

				                args.github_token, args.github_issue_repo, args.github_issue

				            )

				            username = get_potential_pr_author(

				                gh,

				                args.github_token,

				                args.github_repo,

				                args.github_actor,

				                args.github_ref_type,

				                args.github_branch,

				            )

				            label_type = get_workflow_type(

				                issue,

				                (

				                    args.github_issue_owner,

				                    username,

				                ),

				            )

				            runner_ami = get_optin_feature(

				                issue=issue,

				                workflow_requestors=(

				                    args.github_issue_owner,

				                    username,

				                ),

				                feature=RUNNER_AMI_AMZ2023,

				                fallback=RUNNER_AMI_LEGACY,

				            is_canary = args.github_repo == "pytorch/pytorch-canary"

				            runner_label_prefix = get_runner_prefix(

				                rollout_state, (args.github_issue_owner, username), is_canary

				            )

				        except Exception as e:

				            log.error(

				                f"Failed to get issue. Falling back to meta runners. Exception: {e}"

				                f"Failed to get issue. Defaulting to Meta runners and no experiments. Exception: {e}"

				            )

				            label_type = WORKFLOW_LABEL_META

				            runner_ami = RUNNER_AMI_LEGACY

				    # For Canary builds use canary runners

				    if args.github_repo == "pytorch/pytorch-canary" and label_type == WORKFLOW_LABEL_LF:

				        label_type = WORKFLOW_LABEL_LF_CANARY

				    set_github_output(GH_OUTPUT_KEY_LABEL_TYPE, label_type)

				    set_github_output(GH_OUTPUT_KEY_AMI, runner_ami)

				    set_github_output(GH_OUTPUT_KEY_LABEL_TYPE, runner_label_prefix)

				if __name__ == "__main__":

									
										39

.github/scripts/s390x-ci/README.md
									
										vendored
									
												View File
												
				@ -3,7 +3,7 @@

				## Install prerequisites.

				```

				$ sudo dnf install docker

				$ sudo dnf install podman podman-docker jq

				```

				## Add services.

				@ -27,23 +27,48 @@ $ sudo systemctl enable --now qemu-user-static

				## Rebuild the image

				In order to build or update the `iiilinuxibmcom/actions-runner` image, e.g. to get the

				latest OS security fixes, use the following commands:

				First build s390x builder image `docker.io/pytorch/manylinuxs390x-builder`,

				using following commands:

				```

				$ cd ~

				$ git clone https://github.com/pytorch/pytorch

				$ cd pytorch

				$ git submodule update --init --recursive

				$ GPU_ARCH_TYPE=cpu-s390x "$(pwd)/.ci/docker/manywheel/build.sh" manylinuxs390x-builder

				$ docker image tag localhost/pytorch/manylinuxs390x-builder docker.io/pytorch/manylinuxs390x-builder:cpu-s390x

				$ docker image save -o ~/manywheel-s390x.tar docker.io/pytorch/manylinuxs390x-builder:cpu-s390x

				```

				Next step is to build `actions-runner` image using:

				```

				$ cd self-hosted-builder

				$ sudo docker build \

				      --build-arg repo=<owner>/<name> \

				      --build-arg token=<***> \

				      --pull \

				      -f actions-runner.Dockerfile \

				      -t iiilinuxibmcom/actions-runner \

				      -t iiilinuxibmcom/actions-runner.<name> \

				      .

				```

				If it fails, ensure that selinux doesn't prevent it from working.

				If there are failures, ensure that selinux doesn't prevent it from working.

				In worst case, selinux can be disabled with `setenforce 0`.

				Now prepare all necessary files for runner registration:

				```

				$ sudo mkdir -p /etc/actions-runner/<name>

				$ sudo chmod 700 /etc/actions-runner/<name>

				$ sudo /bin/cp <github_app_private_key_file> /etc/actions-runner/<name>/key_private.pem

				$ sudo echo <github_app_id> | sudo tee /etc/actions-runner/<name>/appid.env

				$ sudo echo <github_app_install_id> | sudo tee /etc/actions-runner/<name>/installid.env

				$ sudo echo NAME=<worker_name> | sudo tee    /etc/actions-runner/<name>/env

				$ sudo echo ORG=<github_org>   | sudo tee -a /etc/actions-runner/<name>/env

				$ cd self-hosted-builder

				$ sudo /bin/cp helpers/*.sh /usr/local/bin/

				$ sudo chmod 755 /usr/local/bin/app_token.sh /usr/local/bin/gh_token_generator.sh

				```

				## Autostart the runner.

				```

									
										33

.github/scripts/s390x-ci/self-hosted-builder/actions-runner.Dockerfile
									
										vendored
									
												View File
												
				@ -1,12 +1,12 @@

				# Self-Hosted IBM Z Github Actions Runner.

				# Temporary image: amd64 dependencies.

				FROM docker.io/amd64/ubuntu:22.04 as ld-prefix

				FROM docker.io/amd64/ubuntu:23.10 as ld-prefix

				ENV DEBIAN_FRONTEND=noninteractive

				RUN apt-get update && apt-get -y install ca-certificates libicu70 libssl3

				RUN apt-get update && apt-get -y install ca-certificates libicu72 libssl3

				# Main image.

				FROM docker.io/s390x/ubuntu:22.04

				FROM docker.io/s390x/ubuntu:23.10

				# Packages for pytorch building and testing.

				ENV DEBIAN_FRONTEND=noninteractive

				@ -16,6 +16,7 @@ RUN apt-get update && apt-get -y install \

				        gcc \

				        git \

				        jq \

				        zip \

				        libxml2-dev \

				        libxslt-dev \

				        ninja-build \

				@ -43,24 +44,28 @@ COPY fs/ /

				RUN chmod +x /usr/bin/actions-runner /usr/bin/entrypoint

				# install podman

				RUN apt -y install podman podman-docker

				# amd64 Github Actions Runner.

				RUN useradd -m actions-runner

				USER actions-runner

				WORKDIR /home/actions-runner

				RUN curl -L https://github.com/actions/runner/releases/download/v2.309.0/actions-runner-linux-x64-2.309.0.tar.gz | tar -xz

				# repository

				ARG repo

				# set up python virtual environment which is later used by runner.

				# build workflows use "python -m pip install ...",

				# and it doesn't work for non-root user

				RUN virtualenv --system-site-packages venv

				# repository token

				ARG token

				# copy prebuilt manywheel docker image for builds and tests

				# build command is:

				# GPU_ARCH_TYPE=cpu-s390x "$(pwd)/manywheel/build_docker.sh"

				# and save command is:

				# docker image save -o manywheel-s390x.tar pytorch/manylinuxs390x-builder:cpu-s390x

				#

				COPY --chown=actions-runner:actions-runner manywheel-s390x.tar /home/actions-runner/manywheel-s390x.tar

				RUN ./config.sh \

				        --unattended \

				        --url "https://github.com/${repo}" \

				        --token "${token}" \

				        --no-default-labels \

				        --labels self-hosted,linux.s390x

				RUN curl -L https://github.com/actions/runner/releases/download/v2.317.0/actions-runner-linux-x64-2.317.0.tar.gz | tar -xz

				ENTRYPOINT ["/usr/bin/entrypoint"]

				CMD ["/usr/bin/actions-runner"]

									
										6

.github/scripts/s390x-ci/self-hosted-builder/actions-runner@.service
									
										vendored
									
												View File
												
				@ -8,12 +8,16 @@ StartLimitIntervalSec=0

				Type=simple

				Restart=always

				ExecStartPre=-/usr/bin/docker rm --force actions-runner.%i

				ExecStartPre=-/usr/local/bin/gh_token_generator.sh /etc/actions-runner/%i/appid.env /etc/actions-runner/%i/installid.env /etc/actions-runner/%i/key_private.pem /etc/actions-runner/%i/ghtoken.env

				ExecStart=/usr/bin/docker run \

				              --env-file=/etc/actions-runner/%i/env \

				              --env-file=/etc/actions-runner/%i/ghtoken.env \

				              --init \

				              --interactive \

				              --name=actions-runner.%i \

				              --rm \

				              iiilinuxibmcom/actions-runner

				              --privileged \

				              iiilinuxibmcom/actions-runner.%i

				ExecStop=/bin/sh -c "docker exec actions-runner.%i kill -INT -- -1"

				ExecStop=/bin/sh -c "docker wait actions-runner.%i"

				ExecStop=/bin/sh -c "docker rm actions-runner.%i"

42

.github/scripts/s390x-ci/self-hosted-builder/fs/usr/bin/actions-runner vendored

View File

 @ -2,5 +2,45 @@
 set -e -u
 # first import docker image
 if [ -f ./manywheel-s390x.tar ] ; then
         docker image load --input manywheel-s390x.tar
         docker image tag docker.io/pytorch/manylinuxs390x-builder:cpu-s390x docker.io/pytorch/manylinuxs390x-builder:cpu-s390x-main
         rm -f manywheel-s390x.tar
 fi
 token_file=registration-token.json
 # Generate registration token
 curl \
         -X POST \
         -H "Accept: application/vnd.github.v3+json" \
         -H "Authorization: Bearer ${ACCESS_TOKEN}" \
         "https://api.github.com/orgs/${ORG}/actions/runners/registration-token" \
         -o "$token_file"
 unset ACCESS_TOKEN
 # register runner as ephemeral runner
 # it does one job, stops and unregisters
 registration_token=$(jq --raw-output .token "$token_file")
 ./config.sh \
         --unattended \
         --ephemeral \
         --url "https://github.com/${ORG}" \
         --token "${registration_token}" \
         --name "${NAME}" \
         --no-default-labels \
         --labels self-hosted,linux.s390x
 unset registration_token
 rm -f "$token_file"
 # enter into python virtual environment.
 # build workflows use "python -m pip install ...",
 # and it doesn't work for non-root user
 source venv/bin/activate
 # Run one job.
 ./run.sh --once
 ./run.sh

									
										84

.github/scripts/s390x-ci/self-hosted-builder/helpers/app_token.sh
									
										vendored
									
										Executable file
									
												View File
												
				@ -0,0 +1,84 @@

				#!/usr/bin/env bash

				#

				# Request an ACCESS_TOKEN to be used by a GitHub APP

				# Environment variable that need to be set up:

				# * APP_ID, the GitHub's app ID

				# * INSTALL_ID, the Github's app's installation ID

				# * APP_PRIVATE_KEY, the content of GitHub app's private key in PEM format.

				#

				# https://github.com/orgs/community/discussions/24743#discussioncomment-3245300

				#

				set -o pipefail

				_GITHUB_HOST=${GITHUB_HOST:="github.com"}

				# If URL is not github.com then use the enterprise api endpoint

				if [[ ${GITHUB_HOST} = "github.com" ]]; then

				  URI="https://api.${_GITHUB_HOST}"

				else

				  URI="https://${_GITHUB_HOST}/api/v3"

				fi

				API_VERSION=v3

				API_HEADER="Accept: application/vnd.github.${API_VERSION}+json"

				CONTENT_LENGTH_HEADER="Content-Length: 0"

				APP_INSTALLATIONS_URI="${URI}/app/installations"

				# JWT parameters based off

				# https://docs.github.com/en/developers/apps/building-github-apps/authenticating-with-github-apps#authenticating-as-a-github-app

				#

				# JWT token issuance and expiration parameters

				JWT_IAT_DRIFT=60

				JWT_EXP_DELTA=600

				JWT_JOSE_HEADER='{

				    "alg": "RS256",

				    "typ": "JWT"

				}'

				build_jwt_payload() {

				    now=$(date +%s)

				    iat=$((now - JWT_IAT_DRIFT))

				    jq -c \

				        --arg iat_str "${iat}" \

				        --arg exp_delta_str "${JWT_EXP_DELTA}" \

				        --arg app_id_str "${APP_ID}" \

				    '

				        ($iat_str | tonumber) as $iat

				        | ($exp_delta_str | tonumber) as $exp_delta

				        | ($app_id_str | tonumber) as $app_id

				        | .iat = $iat

				        | .exp = ($iat + $exp_delta)

				        | .iss = $app_id

				    ' <<< "{}" | tr -d '\n'

				}

				base64url() {

				    base64 | tr '+/' '-_' | tr -d '=\n'

				}

				rs256_sign() {

				    openssl dgst -binary -sha256 -sign <(echo "$1")

				}

				request_access_token() {

				    jwt_payload=$(build_jwt_payload)

				    encoded_jwt_parts=$(base64url <<<"${JWT_JOSE_HEADER}").$(base64url <<<"${jwt_payload}")

				    encoded_mac=$(echo -n "$encoded_jwt_parts" | rs256_sign "${APP_PRIVATE_KEY}" | base64url)

				    generated_jwt="${encoded_jwt_parts}.${encoded_mac}"

				    auth_header="Authorization: Bearer ${generated_jwt}"

				    app_installations_response=$(curl -sX POST \

				        -H "${auth_header}" \

				        -H "${API_HEADER}" \

				        --header "X-GitHub-Api-Version: 2022-11-28" \

				        --url "https://api.github.com/app/installations/${INSTALL_ID}/access_tokens" \

				    )

				    echo "$app_installations_response" | jq --raw-output '.token'

				}

				request_access_token

									
										10

.github/scripts/s390x-ci/self-hosted-builder/helpers/gh_token_generator.sh
									
										vendored
									
										Executable file
									
												View File
												
				@ -0,0 +1,10 @@

				#!/usr/bin/env bash

				SCRIPT_DIR=$(dirname "$0")

				APP_ID=$1

				INSTALL_ID=$2

				APP_PRIVATE_KEY=$3

				DST_FILE="$4"

				ACCESS_TOKEN="$(APP_ID="$(<"${APP_ID}")" INSTALL_ID="$(<"${INSTALL_ID}")" APP_PRIVATE_KEY="$(<"${APP_PRIVATE_KEY}")" "${SCRIPT_DIR}/app_token.sh")"

				echo "ACCESS_TOKEN=${ACCESS_TOKEN}" > "${DST_FILE}"

									
										35

.github/scripts/sync_distributed_folder_prototype.sh
									
										vendored
									
												View File
											
				@ -1,35 +0,0 @@

				#!/bin/bash

				set -eoux pipefail

				SYNC_BRANCH=pytorch-stable-prototype

				git config user.email "fake@example.com"

				git config user.name  "PyTorch Stable Bot"

				git fetch origin main

				git fetch origin "$SYNC_BRANCH"

				git checkout "$SYNC_BRANCH"

				# Using a hardcoded SHA here is a massive speedup as we can skip the entire history of the pytorch GitHub repo.

				# This specific SHA was chosen as it was before the "branch point" of the stable branch

				for SHA in $(git log ba3b05fdf37ddbc3c301294d6a560a816335e717..origin/main --pretty="%h" -- torch/distributed torch/csrc/distributed test/distributed test/cpp/c10d benchmarks/distributed)

				do

				    # `git merge-base --is-ancestor` exits with code 0 if the given SHA is an ancestor, and non-0 otherwise

				    if git merge-base --is-ancestor $SHA HEAD || [[ $(git log --grep="(cherry picked from commit $SHA") ]]

				    then

				        echo "Skipping $SHA"

				        continue

				    fi

				    echo "Copying $SHA"

				    git cherry-pick -x "$SHA" -X theirs

				    git reset --soft HEAD~1

				    git add torch/distributed torch/csrc/distributed test/distributed test/cpp/c10d benchmarks/distributed

				    git checkout .

				    git commit --reuse-message=HEAD@{1}

				    git clean -f

				done

				if [[ "${WITH_PUSH}" == true ]]; then

				  git push

				fi

									
										2

.github/scripts/tag_docker_images_for_release.py
									
										vendored
									
												View File
												
				@ -51,6 +51,8 @@ def main() -> None:

				    for platform_image in platform_images:  # type: ignore[attr-defined]

				        for arch in platform_image.keys():  # type: ignore[attr-defined]

				            if arch == "cpu-s390x":

				                continue

				            tag_image(

				                platform_image[arch],  # type: ignore[index]

				                default_tag,

									
										1

.github/scripts/test_check_labels.py
									
										vendored
									
												View File
												
				@ -18,6 +18,7 @@ def mock_parse_args() -> object:

				    class Object:

				        def __init__(self) -> None:

				            self.pr_num = 76123

				            self.exit_non_zero = False

				    return Object()

									
										237

.github/scripts/test_runner_determinator.py
									
										vendored
									
										Normal file
									
												View File
												
				@ -0,0 +1,237 @@

				from unittest import main, TestCase

				from unittest.mock import Mock, patch

				import runner_determinator as rd

				class TestRunnerDeterminatorIssueParser(TestCase):

				    def test_parse_settings(self) -> None:

				        settings_text = """

				        experiments:

				            lf:

				                rollout_perc: 25

				            otherExp:

				                rollout_perc: 0

				        ---

				        Users:

				        @User1,lf

				        @User2,lf,otherExp

				        """

				        settings = rd.parse_settings(settings_text)

				        self.assertTupleEqual(

				            rd.Experiment(rollout_perc=25),

				            settings.experiments["lf"],

				            "lf settings not parsed correctly",

				        )

				        self.assertTupleEqual(

				            rd.Experiment(rollout_perc=0),

				            settings.experiments["otherExp"],

				            "otherExp settings not parsed correctly",

				        )

				    def test_parse_settings_in_code_block(self) -> None:

				        settings_text = """

				        ```

				        experiments:

				            lf:

				                rollout_perc: 25

				            otherExp:

				                rollout_perc: 0

				        ```

				        ---

				        Users:

				        @User1,lf

				        @User2,lf,otherExp

				        """

				        settings = rd.parse_settings(settings_text)

				        self.assertTupleEqual(

				            rd.Experiment(rollout_perc=25),

				            settings.experiments["lf"],

				            "lf settings not parsed correctly",

				        )

				        self.assertTupleEqual(

				            rd.Experiment(rollout_perc=0),

				            settings.experiments["otherExp"],

				            "otherExp settings not parsed correctly",

				        )

				    def test_parse_users(self) -> None:

				        settings_text = """

				        experiments:

				            lf:

				                rollout_perc: 0

				            otherExp:

				                rollout_perc: 0

				        ---

				        Users:

				        @User1,lf

				        @User2,lf,otherExp

				        """

				        users = rd.parse_users(settings_text)

				        self.assertDictEqual(

				            {"User1": ["lf"], "User2": ["lf", "otherExp"]},

				            users,

				            "Users not parsed correctly",

				        )

				    def test_parse_users_without_settings(self) -> None:

				        settings_text = """

				        @User1,lf

				        @User2,lf,otherExp

				        """

				        users = rd.parse_users(settings_text)

				        self.assertDictEqual(

				            {"User1": ["lf"], "User2": ["lf", "otherExp"]},

				            users,

				            "Users not parsed correctly",

				        )

				class TestRunnerDeterminatorGetRunnerPrefix(TestCase):

				    def test_opted_in_user(self) -> None:

				        settings_text = """

				        experiments:

				            lf:

				                rollout_perc: 0

				            otherExp:

				                rollout_perc: 0

				        ---

				        Users:

				        @User1,lf

				        @User2,lf,otherExp

				        """

				        prefix = rd.get_runner_prefix(settings_text, ["User1"])

				        self.assertEqual("lf.", prefix, "Runner prefix not correct for User1")

				    def test_opted_in_user_two_experiments(self) -> None:

				        settings_text = """

				        experiments:

				            lf:

				                rollout_perc: 0

				            otherExp:

				                rollout_perc: 0

				        ---

				        Users:

				        @User1,lf

				        @User2,lf,otherExp

				        """

				        prefix = rd.get_runner_prefix(settings_text, ["User2"])

				        self.assertEqual("lf.otherExp.", prefix, "Runner prefix not correct for User2")

				    @patch("random.uniform", return_value=50)

				    def test_opted_out_user(self, mock_uniform: Mock) -> None:

				        settings_text = """

				        experiments:

				            lf:

				                rollout_perc: 25

				            otherExp:

				                rollout_perc: 25

				        ---

				        Users:

				        @User1,lf

				        @User2,lf,otherExp

				        """

				        prefix = rd.get_runner_prefix(settings_text, ["User3"])

				        self.assertEqual("", prefix, "Runner prefix not correct for user")

				    @patch("random.uniform", return_value=10)

				    def test_opted_out_user_was_pulled_in_by_rollout(self, mock_uniform: Mock) -> None:

				        settings_text = """

				        experiments:

				            lf:

				                rollout_perc: 25

				            otherExp:

				                rollout_perc: 25

				        ---

				        Users:

				        @User1,lf

				        @User2,lf,otherExp

				        """

				        # User3 is opted out, but is pulled into both experiments by the 10% rollout

				        prefix = rd.get_runner_prefix(settings_text, ["User3"])

				        self.assertEqual("lf.otherExp.", prefix, "Runner prefix not correct for user")

				    def test_lf_prefix_always_comes_first(self) -> None:

				        settings_text = """

				        experiments:

				            otherExp:

				                rollout_perc: 0

				            lf:

				                rollout_perc: 0

				        ---

				        Users:

				        @User1,lf

				        @User2,otherExp,lf

				        """

				        prefix = rd.get_runner_prefix(settings_text, ["User2"])

				        self.assertEqual("lf.otherExp.", prefix, "Runner prefix not correct for user")

				    def test_ignores_commented_users(self) -> None:

				        settings_text = """

				        experiments:

				            lf:

				                rollout_perc: 0

				            otherExp:

				                rollout_perc: 0

				        ---

				        Users:

				        #@User1,lf

				        @User2,lf,otherExp

				        """

				        prefix = rd.get_runner_prefix(settings_text, ["User1"])

				        self.assertEqual("", prefix, "Runner prefix not correct for user")

				    def test_ignores_extra_experiments(self) -> None:

				        settings_text = """

				        experiments:

				            lf:

				                rollout_perc: 0

				            otherExp:

				                rollout_perc: 0

				            foo:

				                rollout_perc: 0

				        ---

				        Users:

				        @User1,lf,otherExp,foo

				        """

				        prefix = rd.get_runner_prefix(settings_text, ["User1"])

				        self.assertEqual("lf.otherExp.", prefix, "Runner prefix not correct for user")

				if __name__ == "__main__":

				    main()

									
										48

.github/scripts/trymerge.py
									
										vendored
									
												View File
												
				@ -36,6 +36,7 @@ from warnings import warn

				import yaml

				from github_utils import (

				    gh_close_pr,

				    gh_fetch_json_list,

				    gh_fetch_merge_base,

				    gh_fetch_url,

				@ -1174,11 +1175,11 @@ class GitHubPR:

				            for pr in additional_merged_prs:

				                pr.add_numbered_label(MERGE_COMPLETE_LABEL, dry_run)

				        if comment_id and self.pr_num:

				            # When the merge process reaches this part, we can assume that the commit

				            # has been successfully pushed to trunk

				            merge_commit_sha = repo.rev_parse(name=REMOTE_MAIN_BRANCH)

				        # When the merge process reaches this part, we can assume that the commit

				        # has been successfully pushed to trunk

				        merge_commit_sha = repo.rev_parse(name=self.default_branch())

				        if comment_id and self.pr_num:

				            # Finally, upload the record to Rockset. The list of pending and failed

				            # checks are at the time of the merge

				            save_merge_record(

				@ -1203,6 +1204,17 @@ class GitHubPR:

				        else:

				            print("Missing comment ID or PR number, couldn't upload to Rockset")

				        # Usually Github will see that the commit has "resolves <pr_num>" in the

				        # commit message and close the PR, but sometimes it doesn't, leading to

				        # confusion.  When it doesn't, we close it manually.

				        time.sleep(60)  # Give Github some time to close the PR

				        manually_close_merged_pr(

				            pr=self,

				            additional_merged_prs=additional_merged_prs,

				            merge_commit_sha=merge_commit_sha,

				            dry_run=dry_run,

				        )

				    def merge_changes(

				        self,

				        repo: GitRepo,

				@ -1503,6 +1515,34 @@ def checks_to_markdown_bullets(

				    ]

				def manually_close_merged_pr(

				    pr: GitHubPR,

				    additional_merged_prs: List[GitHubPR],

				    merge_commit_sha: str,

				    dry_run: bool,

				) -> None:

				    def _comment_and_close(pr: GitHubPR, comment: str) -> None:

				        pr = GitHubPR(pr.org, pr.project, pr.pr_num)  # Refresh the PR

				        if not pr.is_closed():

				            gh_post_pr_comment(pr.org, pr.project, pr.pr_num, comment, dry_run)

				            gh_close_pr(pr.org, pr.project, pr.pr_num, dry_run)

				    message = (

				        f"This PR (#{pr.pr_num}) was merged in {merge_commit_sha} but it is still open, likely due to a Github bug, "

				        "so mergebot is closing it manually.  If you think this is a mistake, please feel free to reopen and contact Dev Infra."

				    )

				    _comment_and_close(pr, message)

				    for additional_pr in additional_merged_prs:

				        message = (

				            f"This PR (#{additional_pr.pr_num}) was merged as part of PR #{pr.pr_num} in the stack under {merge_commit_sha} "

				            "but it is still open, likely due to a Github bug, so mergebot is closing it manually. "

				            "If you think this is a mistake, please feel free to reopen and contact Dev Infra."

				        )

				        _comment_and_close(additional_pr, message)

				    print(f"PR {pr.pr_num} and all additional PRs in the stack have been closed.")

				@retries_decorator()

				def save_merge_record(

				    comment_id: int,

4

.github/templates/common.yml.j2 vendored

View File

 @ -1,7 +1,7 @@
 {%- set upload_artifact_s3_action = "seemethere/upload-artifact-s3@v5" -%}
 {%- set download_artifact_s3_action = "seemethere/download-artifact-s3@v4" -%}
 {%- set upload_artifact_action = "actions/upload-artifact@v3" -%}
 {%- set download_artifact_action = "actions/download-artifact@v3" -%}
 {%- set upload_artifact_action = "actions/upload-artifact@v4.4.0" -%}
 {%- set download_artifact_action = "actions/download-artifact@v4.1.7" -%}
 {%- set timeout_minutes = 240 -%}

28

.github/templates/linux_binary_build_workflow.yml.j2 vendored

View File

 @ -52,23 +52,32 @@ env:
 !{{ common.concurrency(build_environment) }}
 jobs:
   get-label-type:
     name: get-label-type
     uses: ./.github/workflows/_runner-determinator.yml
     with:
       triggering_actor: ${{ github.triggering_actor }}
       issue_owner: ${{ github.event.pull_request.user.login || github.event.issue.user.login }}
       curr_branch: ${{ github.head_ref || github.ref_name }}
       curr_ref_type: ${{ github.ref_type }}
 {%- for config in build_configs %}
   !{{ config["build_name"] }}-build:
     if: ${{ github.repository_owner == 'pytorch' }}
     uses: ./.github/workflows/_binary-build-linux.yml
     needs: get-label-type
     with:!{{ upload.binary_env_as_input(config) }}
       {%- if "aarch64" in build_environment %}
       runner_prefix: amz2023.
       runs_on: linux.arm64.m7g.4xlarge
       runs_on: linux.arm64.m7g.4xlarge.ephemeral
       ALPINE_IMAGE: "arm64v8/alpine"
       {%- elif "s390x" in build_environment %}
       runs_on: linux.s390x
       ALPINE_IMAGE: "docker.io/s390x/alpine"
       {%- elif "conda" in build_environment and config["gpu_arch_type"] == "cuda" %}
       runner_prefix: amz2023.
       runs_on: linux.24xlarge
       runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
       runs_on: linux.24xlarge.ephemeral
       {%- else %}
       runner_prefix: amz2023.
       runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
       {%- endif %}
       build_name: !{{ config["build_name"] }}
       build_environment: !{{ build_environment }}
 @ -84,14 +93,15 @@ jobs:
   {%- if config["gpu_arch_type"] != "cuda-aarch64" %}
   !{{ config["build_name"] }}-test:  # Testing
     if: ${{ github.repository_owner == 'pytorch' }}
     needs: !{{ config["build_name"] }}-build
     needs:
       - !{{ config["build_name"] }}-build
       - get-label-type
     {%- if config["gpu_arch_type"] not in ["rocm", "xpu"] %}
     uses: ./.github/workflows/_binary-test-linux.yml
     with:!{{ upload.binary_env_as_input(config) }}
       build_name: !{{ config["build_name"] }}
       build_environment: !{{ build_environment }}
       {%- if "aarch64" in build_environment %}
       runner_prefix: amz2023.
       runs_on: linux.arm64.2xlarge
       ALPINE_IMAGE: "arm64v8/alpine"
       {%- elif "s390x" in build_environment %}
 @ -100,10 +110,10 @@ jobs:
       {%- elif config["gpu_arch_type"] == "rocm" %}
       runs_on: linux.rocm.gpu
       {%- elif config["gpu_arch_type"] == "cuda" %}
       runner_prefix: amz2023.
       runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
       runs_on: linux.4xlarge.nvidia.gpu
       {%- else %}
       runner_prefix: amz2023.
       runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
       runs_on: linux.4xlarge
       {%- endif %}
     secrets:

7

.github/templates/macos_binary_build_workflow.yml.j2 vendored

View File

 @ -64,9 +64,6 @@ jobs:
     {%- if config.pytorch_extra_install_requirements is defined and config.pytorch_extra_install_requirements|d('')|length > 0  %}
       PYTORCH_EXTRA_INSTALL_REQUIREMENTS: !{{ config.pytorch_extra_install_requirements }}
     {%- endif %}
       # For sccache access (only on non-forked PRs)
       AWS_ACCESS_KEY_ID: ${{ secrets.MACOS_SCCACHE_S3_ACCESS_KEY_ID }}
       AWS_SECRET_ACCESS_KEY: ${{ secrets.MACOS_SCCACHE_S3_SECRET_ACCESS_KEY }}
     steps:
       !{{ set_runner_specific_vars() }}
       - name: Install conda and dependencies
 @ -84,7 +81,7 @@ jobs:
       !{{ common.checkout(deep_clone=False, directory="pytorch") }}
       !{{ common.checkout(deep_clone=False, directory="builder", repository=common.builder_repo, branch=common.builder_branch) }}
       - name: Install sccache (only for non-forked PRs, and pushes to trunk)
         uses: nick-fields/retry@v2.8.2
         uses: nick-fields/retry@v3.0.0
         if: ${{ github.event_name == 'push' || github.event.pull_request.head.repo.full_name == github.repository }}
         with:
           timeout_minutes: 5
 @ -104,7 +101,7 @@ jobs:
           # shellcheck disable=SC1091
           source "${RUNNER_TEMP}/anaconda/bin/activate"
           "${PYTORCH_ROOT}/.circleci/scripts/binary_macos_build.sh"
       - uses: actions/upload-artifact@v3
       - uses: actions/upload-artifact@v4.4.0
         if: always()
         with:
           name: !{{ config["build_name"] }}

2

.github/templates/upload.yml.j2 vendored

View File

 @ -45,7 +45,7 @@
   {%- if is_windows %}
       # This is a dummy value for libtorch to work correctly with our batch scripts
       # without this value pip does not get installed for some reason
       DESIRED_PYTHON: "3.8"
       DESIRED_PYTHON: "3.9"
   {%- endif %}
 {%- else %}

26

.github/templates/windows_binary_build_workflow.yml.j2 vendored

View File

 @ -53,10 +53,24 @@ env:
 !{{ common.concurrency(build_environment) }}
 jobs:
   get-label-type:
     name: get-label-type
     uses: ./.github/workflows/_runner-determinator.yml
     with:
       triggering_actor: ${{ github.triggering_actor }}
       issue_owner: ${{ github.event.pull_request.user.login || github.event.issue.user.login }}
       curr_branch: ${{ github.head_ref || github.ref_name }}
       curr_ref_type: ${{ github.ref_type }}
 {%- for config in build_configs %}
   !{{ config["build_name"] }}-build:
     if: ${{ github.repository_owner == 'pytorch' }}
     runs-on: windows.4xlarge.nonephemeral
     needs: get-label-type
     {%- if branches == "nightly" %}
     runs-on: "${{ needs.get-label-type.outputs.label-type }}windows.4xlarge"
     {%- else %}
     runs-on: "${{ needs.get-label-type.outputs.label-type }}windows.4xlarge.nonephemeral"
     {%- endif %}
     timeout-minutes: !{{ common.timeout_minutes }}
     !{{ upload.binary_env(config, True) }}
     {%- if config.pytorch_extra_install_requirements is defined and config.pytorch_extra_install_requirements|d('')|length > 0  %}
 @ -85,15 +99,17 @@ jobs:
       !{{ common.wait_and_kill_ssh_windows('pytorch') }}
   !{{ config["build_name"] }}-test:  # Testing
     if: ${{ github.repository_owner == 'pytorch' }}
     needs: !{{ config["build_name"] }}-build
     needs:
       - !{{ config["build_name"] }}-build
       - get-label-type
 {%- if config["gpu_arch_type"] == "cuda" %}
 {%- if branches == "nightly" %}
     runs-on: windows.8xlarge.nvidia.gpu
     runs-on: "${{ needs.get-label-type.outputs.label-type }}windows.g4dn.xlarge"
 {%- else %}
     runs-on: windows.8xlarge.nvidia.gpu.nonephemeral
     runs-on: "${{ needs.get-label-type.outputs.label-type }}windows.g4dn.xlarge.nonephemeral"
 {%- endif %}
 {%- else %}
     runs-on: windows.4xlarge.nonephemeral
     runs-on: "${{ needs.get-label-type.outputs.label-type }}windows.4xlarge.nonephemeral"
 {%- endif %}
     timeout-minutes: !{{ common.timeout_minutes }}
     !{{ upload.binary_env(config, True) }}

									
										4

.github/workflows/_binary-build-linux.yml
									
										vendored
									
												View File
												
				@ -18,7 +18,7 @@ on:

				        description: prefix for runner label

				      runs_on:

				        required: false

				        default: linux.12xlarge

				        default: linux.12xlarge.ephemeral

				        type: string

				        description: Hardware to run this "build" job on, linux.12xlarge or linux.arm64.2xlarge.

				      timeout-minutes:

				@ -283,7 +283,7 @@ jobs:

				          # Ensure the working directory gets chowned back to the current user

				          docker run --rm -v "${RUNNER_TEMP}/artifacts:/v" -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .

				      - uses: actions/upload-artifact@v3

				      - uses: actions/upload-artifact@v4.4.0

				        if: ${{ steps.filter.outputs.is-test-matrix-empty == 'False' }}

				        with:

				          name: ${{ inputs.build_name }}

									
										2

.github/workflows/_binary-test-linux.yml
									
										vendored
									
												View File
												
				@ -210,7 +210,7 @@ jobs:

				      - name: Download Build Artifacts

				        if: ${{ steps.filter.outputs.is-test-matrix-empty == 'False' }}

				        uses: actions/download-artifact@v3

				        uses: actions/download-artifact@v4.1.7

				        with:

				          name: ${{ inputs.build_name }}

				          path: "${{ runner.temp }}/artifacts/"

									
										2

.github/workflows/_binary-upload.yml
									
										vendored
									
												View File
												
				@ -126,7 +126,7 @@ jobs:

				        # NB: When the previous build job is skipped, there won't be any artifacts and

				        # this step will fail. Binary build jobs can only be skipped on CI, not nightly

				        continue-on-error: true

				        uses: actions/download-artifact@v3

				        uses: actions/download-artifact@v4.1.7

				        with:

				          name: ${{ inputs.build_name }}

				          path: "${{ runner.temp }}/artifacts/"

									
										4

.github/workflows/_buck-build-test.yml
									
										vendored
									
												View File
												
				@ -64,7 +64,7 @@ jobs:

				          environment-file: .github/requirements/conda-env-${{ runner.os }}-${{ runner.arch }}

				      - name: Install Buck

				        uses: nick-fields/retry@3e91a01664abd3c5cd539100d10d33b9c5b68482

				        uses: nick-fields/retry@v3.0.0

				        with:

				          timeout_minutes: 10

				          max_attempts: 5

				@ -74,7 +74,7 @@ jobs:

				            sudo apt install ./buck.2021.01.12.01_all.deb

				      - name: Download third party libraries and generate wrappers

				        uses: nick-fields/retry@3e91a01664abd3c5cd539100d10d33b9c5b68482

				        uses: nick-fields/retry@v3.0.0

				        with:

				          timeout_minutes: 10

				          max_attempts: 5

									
										6

.github/workflows/_ios-build-test.yml
									
										vendored
									
												View File
												
				@ -92,7 +92,7 @@ jobs:

				          fi

				      - name: Install brew dependencies

				        uses: nick-fields/retry@v2.8.2

				        uses: nick-fields/retry@v3.0.0

				        with:

				          timeout_minutes: 5

				          max_attempts: 3

				@ -109,7 +109,7 @@ jobs:

				          pip-requirements-file: .github/requirements/pip-requirements-iOS.txt

				      - name: Setup Fastlane

				        uses: nick-fields/retry@v2.8.2

				        uses: nick-fields/retry@v3.0.0

				        with:

				          timeout_minutes: 5

				          max_attempts: 3

				@ -292,7 +292,7 @@ jobs:

				          bundler-cache: true

				      - name: Download arm64 artifacts

				        uses: actions/download-artifact@v3

				        uses: actions/download-artifact@v4.1.7

				        with:

				          name: pytorch-ios-build-artifacts-arm64

									
										70

.github/workflows/_linux-build.yml
									
										vendored
									
												View File
												
				@ -82,6 +82,10 @@ on:

				        required: false

				        description: |

				          HF Auth token to avoid rate limits when downloading models or datasets from hub

				      SCRIBE_GRAPHQL_ACCESS_TOKEN:

				        required: false

				        description: |

				          FB app token to write to scribe endpoint

				    outputs:

				@ -94,6 +98,7 @@ on:

				jobs:

				  build:

				    environment: ${{ github.ref == 'refs/heads/main' && 'scribe-protected' || startsWith(github.ref, 'refs/heads/release/') && 'scribe-protected' || contains(github.event.pull_request.labels.*.name, 'ci-scribe') && 'scribe-pr' || '' }}

				    # Don't run on forked repos

				    if: github.repository_owner == 'pytorch'

				    runs-on: ${{ inputs.runner_prefix}}${{ inputs.runner }}

				@ -104,6 +109,7 @@ jobs:

				    steps:

				      - name: Setup SSH (Click me for login details)

				        uses: pytorch/test-infra/.github/actions/setup-ssh@main

				        if: inputs.build-environment != 'linux-s390x-binary-manywheel'

				        with:

				          github-secret: ${{ secrets.GITHUB_TOKEN }}

				@ -113,13 +119,16 @@ jobs:

				      # checkout. In other cases you should prefer a local checkout.

				      - name: Checkout PyTorch

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@main

				        with:

				          no-sudo: ${{ inputs.build-environment == 'linux-s390x-binary-manywheel' }}

				      - name: Setup Linux

				        uses: ./.github/actions/setup-linux

				        if: inputs.build-environment != 'linux-s390x-binary-manywheel'

				      - name: configure aws credentials

				        uses: aws-actions/configure-aws-credentials@v3

				        if: ${{ inputs.aws-role-to-assume != '' }}

				        if: ${{ inputs.aws-role-to-assume != '' && inputs.build-environment != 'linux-s390x-binary-manywheel' }}

				        with:

				          role-to-assume: ${{ inputs.aws-role-to-assume }}

				          role-session-name: gha-linux-build

				@ -128,11 +137,13 @@ jobs:

				      - name: Calculate docker image

				        id: calculate-docker-image

				        uses: pytorch/test-infra/.github/actions/calculate-docker-image@main

				        if: inputs.build-environment != 'linux-s390x-binary-manywheel'

				        with:

				          docker-image-name: ${{ inputs.docker-image-name }}

				      - name: Use following to pull public copy of the image

				        id: print-ghcr-mirror

				        if: inputs.build-environment != 'linux-s390x-binary-manywheel'

				        env:

				          ECR_DOCKER_IMAGE: ${{ steps.calculate-docker-image.outputs.docker-image }}

				        shell: bash

				@ -142,6 +153,7 @@ jobs:

				      - name: Pull docker image

				        uses: pytorch/test-infra/.github/actions/pull-docker-image@main

				        if: inputs.build-environment != 'linux-s390x-binary-manywheel'

				        with:

				          docker-image: ${{ steps.calculate-docker-image.outputs.docker-image }}

				@ -169,6 +181,7 @@ jobs:

				      - name: Download pytest cache

				        uses: ./.github/actions/pytest-cache-download

				        continue-on-error: true

				        if: inputs.build-environment != 'linux-s390x-binary-manywheel'

				        with:

				          cache_dir: .pytest_cache

				          job_identifier: ${{ github.workflow }}_${{ inputs.build-environment }}

				@ -190,13 +203,29 @@ jobs:

				          PR_LABELS: ${{ toJson(github.event.pull_request.labels.*.name) }}

				          TORCH_CUDA_ARCH_LIST: ${{ inputs.cuda-arch-list }}

				          DOCKER_IMAGE: ${{ steps.calculate-docker-image.outputs.docker-image }}

				          DOCKER_IMAGE_S390X: ${{ inputs.docker-image-name }}

				          XLA_CUDA: ${{ contains(inputs.build-environment, 'xla') && '0' || '' }}

				          DEBUG: ${{ inputs.build-with-debug && '1' || '0' }}

				          OUR_GITHUB_JOB_ID: ${{ steps.get-job-id.outputs.job-id }}

				          HUGGING_FACE_HUB_TOKEN: ${{ secrets.HUGGING_FACE_HUB_TOKEN }}

				          SCRIBE_GRAPHQL_ACCESS_TOKEN: ${{ secrets.SCRIBE_GRAPHQL_ACCESS_TOKEN }}

				          USE_SPLIT_BUILD: ${{ inputs.use_split_build }}

				        run: |

				          if [[ ${BUILD_ENVIRONMENT} == *"s390x"* ]]; then

				            JENKINS_USER=

				            USED_IMAGE="${DOCKER_IMAGE_S390X}"

				            # since some steps are skipped on s390x, if they are necessary, run them here

				            env | grep '^GITHUB' >> "/tmp/github_env_${GITHUB_RUN_ID}"

				            env | grep '^CI' >> "/tmp/github_env_${GITHUB_RUN_ID}"

				          else

				            JENKINS_USER="--user jenkins"

				            USED_IMAGE="${DOCKER_IMAGE}"

				          fi

				          # detached container should get cleaned up by teardown_ec2_linux

				          # Used for JENKINS_USER, which can be empty

				          # shellcheck disable=SC2086

				          container_name=$(docker run \

				            -e BUILD_ENVIRONMENT \

				            -e MAX_JOBS="$(nproc --ignore=2)" \

				@ -219,10 +248,10 @@ jobs:

				            --cap-add=SYS_PTRACE \

				            --tty \

				            --detach \

				            --user jenkins \

				            ${JENKINS_USER} \

				            -v "${GITHUB_WORKSPACE}:/var/lib/jenkins/workspace" \

				            -w /var/lib/jenkins/workspace \

				            "${DOCKER_IMAGE}"

				            "${USED_IMAGE}"

				          )

				          docker exec -t "${container_name}" sh -c '.ci/pytorch/build.sh'

				@ -233,7 +262,7 @@ jobs:

				      - name: Store PyTorch Build Artifacts on S3

				        uses: seemethere/upload-artifact-s3@v5

				        if: inputs.build-generates-artifacts && steps.build.outcome != 'skipped' && !inputs.use_split_build

				        if: inputs.build-generates-artifacts && steps.build.outcome != 'skipped' && !inputs.use_split_build && inputs.build-environment != 'linux-s390x-binary-manywheel'

				        with:

				          name: ${{ inputs.build-environment }}

				          retention-days: 14

				@ -243,7 +272,7 @@ jobs:

				      - name: Store PyTorch Build Artifacts on S3 for split build

				        uses: seemethere/upload-artifact-s3@v5

				        if: inputs.build-generates-artifacts && steps.build.outcome != 'skipped' && inputs.use_split_build

				        if: inputs.build-generates-artifacts && steps.build.outcome != 'skipped' && inputs.use_split_build && inputs.build-environment != 'linux-s390x-binary-manywheel'

				        with:

				          name: ${{ inputs.build-environment }}-experimental-split-build

				          retention-days: 14

				@ -251,8 +280,26 @@ jobs:

				          path: artifacts.zip

				          s3-bucket: ${{ inputs.s3-bucket }}

				      - name: Store PyTorch Build Artifacts for s390x

				        uses: actions/upload-artifact@v3

				        if: inputs.build-generates-artifacts && steps.build.outcome != 'skipped' && !inputs.use_split_build && inputs.build-environment == 'linux-s390x-binary-manywheel'

				        with:

				          name: ${{ inputs.build-environment }}

				          retention-days: 14

				          if-no-files-found: error

				          path: artifacts.zip

				      - name: Store PyTorch Build Artifacts for s390x for split build

				        uses: actions/upload-artifact@v3

				        if: inputs.build-generates-artifacts && steps.build.outcome != 'skipped' && inputs.use_split_build && inputs.build-environment == 'linux-s390x-binary-manywheel'

				        with:

				          name: ${{ inputs.build-environment }}-experimental-split-build

				          retention-days: 14

				          if-no-files-found: error

				          path: artifacts.zip

				      - name: Upload sccache stats

				        if: steps.build.outcome != 'skipped'

				        if: steps.build.outcome != 'skipped' && inputs.build-environment != 'linux-s390x-binary-manywheel'

				        uses: seemethere/upload-artifact-s3@v5

				        with:

				          s3-prefix: |

				@ -264,4 +311,13 @@ jobs:

				      - name: Teardown Linux

				        uses: pytorch/test-infra/.github/actions/teardown-linux@main

				        if: always()

				        if: always() && inputs.build-environment != 'linux-s390x-binary-manywheel'

				      - name: Cleanup docker

				        if: always() && inputs.build-environment == 'linux-s390x-binary-manywheel'

				        shell: bash

				        run: |

				          # on s390x stop the container for clean worker stop

				          # ignore expansion of "docker ps -q" since it could be empty

				          # shellcheck disable=SC2046

				          docker stop $(docker ps -q) || true

									
										2

.github/workflows/_linux-test.yml
									
										vendored
									
												View File
												
				@ -62,12 +62,12 @@ env:

				jobs:

				  test:

				    environment: ${{ github.ref == 'refs/heads/main' && 'prod-branch-main' || '' }}

				    # Don't run on forked repos or empty test matrix

				    if: github.repository_owner == 'pytorch' && toJSON(fromJSON(inputs.test-matrix).include) != '[]'

				    strategy:

				      matrix: ${{ fromJSON(inputs.test-matrix) }}

				      fail-fast: false

				    environment: ${{ github.ref == 'refs/heads/main' && 'scribe-protected' || startsWith(github.ref, 'refs/heads/release/') && 'scribe-protected' || contains(github.event.pull_request.labels.*.name, 'ci-scribe') && 'scribe-pr' || '' }}

				    runs-on: ${{ matrix.runner }}

				    timeout-minutes: ${{ matrix.mem_leak_check == 'mem_leak_check' && 600 || inputs.timeout-minutes }}

				    steps:

									
										4

.github/workflows/_mac-build.yml
									
										vendored
									
												View File
												
				@ -104,7 +104,7 @@ jobs:

				          pip-requirements-file: .github/requirements/pip-requirements-${{ runner.os }}.txt

				      - name: Install sccache (only for non-forked PRs, and pushes to trunk)

				        uses: nick-fields/retry@v2.8.2

				        uses: nick-fields/retry@v3.0.0

				        if: ${{ github.event_name == 'push' || github.event.pull_request.head.repo.full_name == github.repository }}

				        with:

				          timeout_minutes: 5

				@ -139,7 +139,7 @@ jobs:

				            else

				              # The runner has access to the S3 bucket via IAM profile without the need

				              # for any credential

				              echo "SCCACHE_BUCKET=ossci-compiler-cache-circleci-v2" >> "${GITHUB_ENV}"

				              echo "SCCACHE_BUCKET=ossci-compiler-cache-circleci-v2" >> "${GITHUB_ENV}"0

				              echo "SCCACHE_S3_KEY_PREFIX=${GITHUB_WORKFLOW}" >> "${GITHUB_ENV}"

				            fi

									
										22

.github/workflows/_mac-test-mps.yml
									
										vendored
									
												View File
												
				@ -88,6 +88,13 @@ jobs:

				          environment-file: .github/requirements/conda-env-${{ runner.os }}-${{ runner.arch }}

				          pip-requirements-file: .github/requirements/pip-requirements-${{ runner.os }}.txt

				      - name: Get workflow job id

				        id: get-job-id

				        uses: ./.github/actions/get-workflow-job-id

				        if: always()

				        with:

				          github-token: ${{ secrets.GITHUB_TOKEN }}

				      - name: Install PyTorch and run MPS tests

				        id: test

				        env:

				@ -103,6 +110,14 @@ jobs:

				          NO_TEST_TIMEOUT: ${{ needs.filter.outputs.ci-no-test-timeout }}

				          NO_TD: ${{ needs.filter.outputs.ci-no-td }}

				          PIP_REQUIREMENTS_FILE: .github/requirements/pip-requirements-${{ runner.os }}.txt

				          GITHUB_REPOSITORY: ${{ github.repository }}

				          GITHUB_WORKFLOW: ${{ github.workflow }}

				          GITHUB_JOB: ${{ github.job }}

				          GITHUB_RUN_ID: ${{ github.run_id }}

				          GITHUB_RUN_NUMBER: ${{ github.run_number }}

				          GITHUB_RUN_ATTEMPT: ${{ github.run_attempt }}

				          JOB_ID: ${{ steps.get-job-id.outputs.job-id }}

				          JOB_NAME: ${{ steps.get-job-id.outputs.job-name }}

				          REENABLED_ISSUES: ${{ needs.filter.outputs.reenabled-issues }}

				        run: |

				          # shellcheck disable=SC1090

				@ -144,13 +159,6 @@ jobs:

				        run: |

				          cat test/**/*_toprint.log || true

				      - name: Get workflow job id

				        id: get-job-id

				        uses: ./.github/actions/get-workflow-job-id

				        if: always()

				        with:

				          github-token: ${{ secrets.GITHUB_TOKEN }}

				      - name: Upload test artifacts

				        uses: ./.github/actions/upload-test-artifacts

				        if: always() && steps.test.conclusion && steps.test.conclusion != 'skipped'

									
										2

.github/workflows/_run_android_tests.yml
									
										vendored
									
												View File
												
				@ -68,7 +68,7 @@ jobs:

				          environment-file: .github/requirements/conda-env-${{ runner.os }}-${{ runner.arch }}.txt

				      - name: Install NDK

				        uses: nick-fields/retry@v2.8.2

				        uses: nick-fields/retry@v3.0.0

				        with:

				          timeout_minutes: 5

				          max_attempts: 3

									
										374

.github/workflows/_runner-determinator.yml
									
										vendored
									
												View File
												
				@ -62,49 +62,94 @@ jobs:

				          """

				          This runner determinator is used to determine which set of runners to run a

				          GitHub job on. It uses the first comment of a GitHub issue (by default

				          https://github.com/pytorch/test-infra/issues/5132) as a user list to determine

				          which users will get their jobs to run on experimental runners. This user list

				          is also a comma separated list of additional features or experiments which the

				          user could be opted in to.

				          https://github.com/pytorch/test-infra/issues/5132) to define the configuration

				          of which runners should be used to run which job.

				          The configuration has two parts, the settings and a list of opted-in users,

				          separated by a line containing "---".  If the line is not present, the

				          settings are considered to be empty with only the second part, the user

				          list, defined.

				          The first part is a YAML block that defines the rollout settings. This can be

				          used to define any settings that are needed to determine which runners to use.

				          It's fields are defined by the RolloutSettings class below.

				          The second part is a list of users who are explicitly opted in to the LF fleet.

				          The user list is also a comma separated list of additional features or

				          experiments which the user could be opted in to.

				          The user list has the following rules:

				          - Users are GitHub usernames with the @ prefix

				          - If the first line is a "*" then all users will use the new runners

				          - If the first line is a "!" then all users will use the old runners

				          - Users are GitHub usernames, which must start with the @ prefix

				          - Each user is also a comma-separated list of features/experiments to enable

				          - A "#" prefix indicates the user is opted out of the new runners but is opting

				            into features/experiments.

				          - A "#" prefix opts the user out of all experiments

				          Example user list:

				          Example config:

				              # A list of experiments that can be opted into.

				              # This defines the behavior they'll induce when opted into.

				              # Expected syntax is:

				              #   [experiment_name]: # Name of the experiment. Also used for the label prefix.

				              #      rollout_perc: [int] # % of workflows to run with this experiment when users are not opted in.

				              @User1

				              @User2,amz2023

				              #@UserOptOutOfNewRunner,amz2023

				              experiments:

				                lf:

				                  rollout_percent: 25

				              ---

				              # Opt-ins:

				              # Users can opt into the LF fleet by adding their GitHub username to this list

				              # and specifying experiments to enable in a comma-separated list.

				              # Experiments should be from the above list.

				              @User1,lf,split_build

				              @User2,lf

				              @User3,split_build

				          """

				          import logging

				          import os

				          import random

				          from argparse import ArgumentParser

				          from logging import LogRecord

				          from typing import Any, Iterable

				          from typing import Any, Dict, Iterable, List, NamedTuple, Tuple

				          import yaml

				          from github import Auth, Github

				          from github.Issue import Issue

				          WORKFLOW_LABEL_META = ""  # use meta runners

				          DEFAULT_LABEL_PREFIX = ""  # use meta runners

				          WORKFLOW_LABEL_LF = "lf."  # use runners from the linux foundation

				          WORKFLOW_LABEL_LF_CANARY = "lf.c."  # use canary runners from the linux foundation

				          RUNNER_AMI_LEGACY = ""

				          RUNNER_AMI_AMZ2023 = "amz2023"

				          GITHUB_OUTPUT = os.getenv("GITHUB_OUTPUT", "")

				          GH_OUTPUT_KEY_AMI = "runner-ami"

				          GH_OUTPUT_KEY_LABEL_TYPE = "label-type"

				          SETTING_EXPERIMENTS = "experiments"

				          LF_FLEET_EXPERIMENT = "lf"

				          CANARY_FLEET_SUFFIX = ".c"

				          class Experiment(NamedTuple):

				              rollout_perc: float = (

				                  0  # Percentage of workflows to experiment on when user is not opted-in.

				              )

				              # Add more fields as needed

				          class Settings(NamedTuple):

				              """

				              Settings for the experiments that can be opted into.

				              """

				              experiments: Dict[str, Experiment] = {}

				          class ColorFormatter(logging.Formatter):

				              """Color codes the log messages based on the log level"""

				@ -196,11 +241,14 @@ jobs:

				          def get_potential_pr_author(

				              gh: Github, repo: str, username: str, ref_type: str, ref_name: str

				              github_token: str, repo: str, username: str, ref_type: str, ref_name: str

				          ) -> str:

				              # If the trigger was a new tag added by a bot, this is a ciflow case

				              # Fetch the actual username from the original PR. The PR number is

				              # embedded in the tag name: ciflow/<name>/<pr-number>

				              gh = get_gh_client(github_token)

				              if username == "pytorch-bot[bot]" and ref_type == "tag":

				                  split_tag = ref_name.split("/")

				                  if (

				@ -222,130 +270,238 @@ jobs:

				          def is_exception_branch(branch: str) -> bool:

				              """

				              Branches that get opted out of all experiments and should always use Meta runners

				              """

				              return branch.split("/")[0] in {"main", "nightly", "release", "landchecks"}

				          def get_workflow_type(issue: Issue, workflow_requestors: Iterable[str]) -> str:

				          def load_yaml(yaml_text: str) -> Any:

				              try:

				                  first_comment = issue.get_comments()[0].body.strip("\n\t ")

				                  if first_comment[0] == "!":

				                      log.info("LF Workflows are disabled for everyone. Using meta runners.")

				                      return WORKFLOW_LABEL_META

				                  elif first_comment[0] == "*":

				                      log.info("LF Workflows are enabled for everyone. Using LF runners.")

				                      return WORKFLOW_LABEL_LF

				                  else:

				                      all_opted_in_users = {

				                          usr_raw.strip("\n\t@ ").split(",")[0]

				                          for usr_raw in first_comment.split()

				                      }

				                      opted_in_requestors = {

				                          usr for usr in workflow_requestors if usr in all_opted_in_users

				                      }

				                      if opted_in_requestors:

				                          log.info(

				                              f"LF Workflows are enabled for {', '.join(opted_in_requestors)}. Using LF runners."

				                          )

				                          return WORKFLOW_LABEL_LF

				                      else:

				                          log.info(

				                              f"LF Workflows are disabled for {', '.join(workflow_requestors)}. Using meta runners."

				                          )

				                          return WORKFLOW_LABEL_META

				              except Exception as e:

				                  log.error(

				                      f"Failed to get determine workflow type. Falling back to meta runners. Exception: {e}"

				                  )

				                  return WORKFLOW_LABEL_META

				                  data = yaml.safe_load(yaml_text)

				                  return data

				              except yaml.YAMLError as exc:

				                  log.exception("Error loading YAML")

				                  raise

				          def get_optin_feature(

				              issue: Issue, workflow_requestors: Iterable[str], feature: str, fallback: str

				          def extract_settings_user_opt_in_from_text(rollout_state: str) -> Tuple[str, str]:

				              """

				              Extracts the text with settings, if any, and the opted in users from the rollout state.

				              If the issue body contains "---" then the text above that is the settings

				              and the text below is the list of opted in users.

				              If it doesn't contain "---" then the settings are empty and the rest is the users.

				              """

				              rollout_state_parts = rollout_state.split("---")

				              if len(rollout_state_parts) >= 2:

				                  return rollout_state_parts[0], rollout_state_parts[1]

				              else:

				                  return "", rollout_state

				          class UserOptins(Dict[str, List[str]]):

				              """

				              Dictionary of users with a list of features they have opted into

				              """

				          def parse_user_opt_in_from_text(user_optin_text: str) -> UserOptins:

				              """

				              Parse the user opt-in text into a key value pair of username and the list of features they have opted into

				              Users are GitHub usernames with the @ prefix. Each user is also a comma-separated list of features/experiments to enable.

				                  - Example line: "@User1,lf,split_build"

				                  - A "#" prefix indicates the user is opted out of all experiments

				              """

				              optins = UserOptins()

				              for user in user_optin_text.split("\n"):

				                  user = user.strip("\r\n\t -")

				                  if not user or not user.startswith("@"):

				                      # Not a valid user. Skip

				                      continue

				                  if user:

				                      usr_name = user.split(",")[0].strip("@")

				                      optins[usr_name] = [exp.strip(" ") for exp in user.split(",")[1:]]

				              return optins

				          def parse_settings_from_text(settings_text: str) -> Settings:

				              """

				              Parse the experiments from the issue body into a list of ExperimentSettings

				              """

				              try:

				                  if settings_text:

				                      # Escape the backtick as well so that we can have the settings in a code block on the GH issue

				                      # for easy reading

				                      # Note: Using ascii for the backtick so that the cat step in _runner-determinator.yml doesn't choke on

				                      #       the backtick character in shell commands.

				                      backtick = chr(96)  # backtick character

				                      settings_text = settings_text.strip(f"\r\n\t{backtick} ")

				                      settings = load_yaml(settings_text)

				                      # For now we just load experiments. We can expand this if/when we add more settings

				                      experiments = {}

				                      for exp_name, exp_settings in settings.get(SETTING_EXPERIMENTS).items():

				                          valid_settings = {}

				                          for setting in exp_settings:

				                              if setting not in Experiment._fields:

				                                  log.warning(

				                                      f"Unexpected setting in experiment: {setting} = {exp_settings[setting]}"

				                                  )

				                              else:

				                                  valid_settings[setting] = exp_settings[setting]

				                          experiments[exp_name] = Experiment(**valid_settings)

				                      return Settings(experiments)

				              except Exception:

				                  log.exception("Failed to parse settings")

				              return Settings()

				          def parse_settings(rollout_state: str) -> Settings:

				              """

				              Parse settings, if any, from the rollout state.

				              If the issue body contains "---" then the text above that is the settings

				              and the text below is the list of opted in users.

				              If it doesn't contain "---" then the settings are empty and the default values are used.

				              """

				              settings_text, _ = extract_settings_user_opt_in_from_text(rollout_state)

				              return parse_settings_from_text(settings_text)

				          def parse_users(rollout_state: str) -> UserOptins:

				              """

				              Parse users from the rollout state.

				              """

				              _, users_text = extract_settings_user_opt_in_from_text(rollout_state)

				              return parse_user_opt_in_from_text(users_text)

				          def is_user_opted_in(user: str, user_optins: UserOptins, experiment_name: str) -> bool:

				              """

				              Check if a user is opted into an experiment

				              """

				              return experiment_name in user_optins.get(user, [])

				          def get_runner_prefix(

				              rollout_state: str, workflow_requestors: Iterable[str], is_canary: bool = False

				          ) -> str:

				              try:

				                  first_comment = issue.get_comments()[0].body.strip("\n\t ")

				                  userlist = {u.lstrip("#").strip("\n\t@ ") for u in first_comment.split()}

				                  all_opted_in_users = set()

				                  for user in userlist:

				                      for i in user.split(","):

				                          if i == feature:

				                              all_opted_in_users.add(user.split(",")[0])

				                  opted_in_requestors = {

				                      usr for usr in workflow_requestors if usr in all_opted_in_users

				                  }

				              settings = parse_settings(rollout_state)

				              user_optins = parse_users(rollout_state)

				                  if opted_in_requestors:

				                      log.info(

				                          f"Feature {feature} is enabled for {', '.join(opted_in_requestors)}. Using feature {feature}."

				                      )

				                      return feature

				                  else:

				                      log.info(

				                          f"Feature {feature} is disabled for {', '.join(workflow_requestors)}. Using fallback \"{fallback}\"."

				                      )

				                      return fallback

				              fleet_prefix = ""

				              prefixes = []

				              for experiment_name, experiment_settings in settings.experiments.items():

				                  enabled = False

				              except Exception as e:

				                  # Is any workflow_requestor opted in to this experiment?

				                  opted_in_users = [

				                      requestor

				                      for requestor in workflow_requestors

				                      if is_user_opted_in(requestor, user_optins, experiment_name)

				                  ]

				                  if opted_in_users:

				                      log.info(

				                          f"{', '.join(opted_in_users)} have opted into experiment {experiment_name}."

				                      )

				                      enabled = True

				                  elif experiment_settings.rollout_perc:

				                      # If no user is opted in, then we randomly enable the experiment based on the rollout percentage

				                      if random.uniform(0, 100) <= experiment_settings.rollout_perc:

				                          log.info(

				                              f"Based on rollout percentage of {experiment_settings.rollout_perc}%, enabling experiment {experiment_name}."

				                          )

				                          enabled = True

				                  if enabled:

				                      label = experiment_name

				                      if experiment_name == LF_FLEET_EXPERIMENT:

				                          # We give some special treatment to the "lf" experiment since determines the fleet we use

				                          #  - If it's enabled, then we always list it's prefix first

				                          #  - If we're in the canary branch, then we append ".c" to the lf prefix

				                          if is_canary:

				                              label += CANARY_FLEET_SUFFIX

				                          fleet_prefix = label

				                      else:

				                          prefixes.append(label)

				              if len(prefixes) > 1:

				                  log.error(

				                      f'Failed to determine if user has opted-in to feature {feature}. Using fallback "{fallback}". Exception: {e}'

				                      f"Only a fleet and one other experiment can be enabled for a job at any time. Enabling {prefixes[0]} and ignoring the rest, which are {', '.join(prefixes[1:])}"

				                  )

				                  return fallback

				                  prefixes = prefixes[:1]

				              # Fleet always comes first

				              if fleet_prefix:

				                  prefixes.insert(0, fleet_prefix)

				              return ".".join(prefixes) + "." if prefixes else ""

				          def get_rollout_state_from_issue(github_token: str, repo: str, issue_num: int) -> str:

				              """

				              Gets the first comment of the issue, which contains the desired rollout state.

				              The default issue we use - https://github.com/pytorch/test-infra/issues/5132

				              """

				              gh = get_gh_client(github_token)

				              issue = get_issue(gh, repo, issue_num)

				              return str(issue.get_comments()[0].body.strip("\n\t "))

				          def main() -> None:

				              args = parse_args()

				              if args.github_ref_type == "branch" and is_exception_branch(args.github_branch):

				                  log.info(f"Exception branch: '{args.github_branch}', using meta runners")

				                  label_type = WORKFLOW_LABEL_META

				                  runner_ami = RUNNER_AMI_LEGACY

				                  log.info(

				                      f"Exception branch: '{args.github_branch}', using Meta runners and no experiments."

				                  )

				                  runner_label_prefix = DEFAULT_LABEL_PREFIX

				              else:

				                  try:

				                      gh = get_gh_client(args.github_token)

				                      # The default issue we use - https://github.com/pytorch/test-infra/issues/5132

				                      issue = get_issue(gh, args.github_issue_repo, args.github_issue)

				                      rollout_state = get_rollout_state_from_issue(

				                          args.github_token, args.github_issue_repo, args.github_issue

				                      )

				                      username = get_potential_pr_author(

				                          gh,

				                          args.github_token,

				                          args.github_repo,

				                          args.github_actor,

				                          args.github_ref_type,

				                          args.github_branch,

				                      )

				                      label_type = get_workflow_type(

				                          issue,

				                          (

				                              args.github_issue_owner,

				                              username,

				                          ),

				                      )

				                      runner_ami = get_optin_feature(

				                          issue=issue,

				                          workflow_requestors=(

				                              args.github_issue_owner,

				                              username,

				                          ),

				                          feature=RUNNER_AMI_AMZ2023,

				                          fallback=RUNNER_AMI_LEGACY,

				                      is_canary = args.github_repo == "pytorch/pytorch-canary"

				                      runner_label_prefix = get_runner_prefix(

				                          rollout_state, (args.github_issue_owner, username), is_canary

				                      )

				                  except Exception as e:

				                      log.error(

				                          f"Failed to get issue. Falling back to meta runners. Exception: {e}"

				                          f"Failed to get issue. Defaulting to Meta runners and no experiments. Exception: {e}"

				                      )

				                      label_type = WORKFLOW_LABEL_META

				                      runner_ami = RUNNER_AMI_LEGACY

				              # For Canary builds use canary runners

				              if args.github_repo == "pytorch/pytorch-canary" and label_type == WORKFLOW_LABEL_LF:

				                  label_type = WORKFLOW_LABEL_LF_CANARY

				              set_github_output(GH_OUTPUT_KEY_LABEL_TYPE, label_type)

				              set_github_output(GH_OUTPUT_KEY_AMI, runner_ami)

				              set_github_output(GH_OUTPUT_KEY_LABEL_TYPE, runner_label_prefix)

				          if __name__ == "__main__":

				              main()

				          EOF

				          cat runner_determinator.py

									
										13

.github/workflows/_win-build.yml
									
										vendored
									
												View File
												
				@ -11,6 +11,16 @@ on:

				        required: true

				        type: string

				        description: What CUDA version to build with, "cpu" for none.

				      use-xpu:

				        required: false

				        type: boolean

				        default: false

				        description: If set, build with XPU support.

				      vc-year:

				        required: false

				        type: string

				        default: "2019"

				        description: The Visual Studio year to use for building.

				      build-with-debug:

				        required: false

				        type: boolean

				@ -141,7 +151,7 @@ jobs:

				          SCCACHE_REGION: us-east-1

				          VC_PRODUCT: "BuildTools"

				          VC_VERSION: ""

				          VC_YEAR: "2019"

				          VC_YEAR: "${{ inputs.vc-year }}"

				          ALPINE_IMAGE: "308535385114.dkr.ecr.us-east-1.amazonaws.com/tool/alpine"

				          AWS_DEFAULT_REGION: us-east-1

				          PR_NUMBER: ${{ github.event.pull_request.number }}

				@ -149,6 +159,7 @@ jobs:

				          DEBUG: ${{ inputs.build-with-debug && '1' || '0' }}

				          TORCH_CUDA_ARCH_LIST: "8.6"

				          USE_CUDA: ${{ inputs.cuda-version != 'cpu' && '1' || '0' }}

				          USE_XPU: ${{ inputs.use-xpu == true && '1' || '0' }}

				          OUR_GITHUB_JOB_ID: ${{ steps.get-job-id.outputs.job-id }}

				        run: |

				          .ci/pytorch/win-build.sh

									
										2

.github/workflows/_win-test.yml
									
										vendored
									
												View File
												
				@ -87,7 +87,7 @@ jobs:

				      # TODO: Move to a requirements.txt file for windows

				      - name: Install pip dependencies

				        uses: nick-fields/retry@3e91a01664abd3c5cd539100d10d33b9c5b68482

				        uses: nick-fields/retry@v3.0.0

				        with:

				          shell: bash

				          timeout_minutes: 5

									
										14

.github/workflows/build-conda-images.yml
									
										vendored
									
												View File
												
				@ -11,20 +11,18 @@ on:

				      # Release candidate tags look like: v1.11.0-rc1

				      - v[0-9]+.[0-9]+.[0-9]+-rc[0-9]+

				    paths:

				      - conda/Dockerfile

				      - 'common/*'

				      - '.ci/docker/conda/*'

				      - '.ci/docker/common/*'

				      - .github/workflows/build-conda-images.yml

				  pull_request:

				    paths:

				      - conda/Dockerfile

				      - 'common/*'

				      - '.ci/docker/conda/*'

				      - '.ci/docker/common/*'

				      - .github/workflows/build-conda-images.yml

				env:

				  DOCKER_REGISTRY: "docker.io"

				  DOCKER_BUILDKIT: 1

				  DOCKER_ID: ${{ secrets.DOCKER_ID }}

				  DOCKER_TOKEN: ${{ secrets.DOCKER_TOKEN }}

				  WITH_PUSH: ${{ github.event_name == 'push' && (github.ref == 'refs/heads/main' || startsWith(github.ref, 'refs/heads/release')) }}

				concurrency:

				@ -33,6 +31,7 @@ concurrency:

				jobs:

				  build-docker:

				    environment: ${{ (github.ref == 'refs/heads/main' || startsWith(github.event.ref, 'refs/tags/v')) && 'docker-build' || '' }}

				    runs-on: linux.9xlarge.ephemeral

				    strategy:

				      matrix:

				@ -54,6 +53,9 @@ jobs:

				            push: true

				      - name: Authenticate if WITH_PUSH

				        if: env.WITH_PUSH == 'true'

				        env:

				          DOCKER_TOKEN: ${{ secrets.DOCKER_TOKEN }}

				          DOCKER_ID: ${{ secrets.DOCKER_ID }}

				        run: |

				          if [[ "${WITH_PUSH}" == true ]]; then

				            echo "${DOCKER_TOKEN}" | docker login -u "${DOCKER_ID}" --password-stdin

									
										32

.github/workflows/build-libtorch-images.yml
									
										vendored
									
												View File
												
				@ -22,8 +22,6 @@ on:

				env:

				  DOCKER_REGISTRY: "docker.io"

				  DOCKER_BUILDKIT: 1

				  DOCKER_ID: ${{ secrets.DOCKER_ID }}

				  DOCKER_TOKEN: ${{ secrets.DOCKER_TOKEN }}

				  WITH_PUSH: ${{ github.event_name == 'push' && (github.ref == 'refs/heads/main' || startsWith(github.ref, 'refs/heads/release')) }}

				concurrency:

				@ -31,8 +29,19 @@ concurrency:

				  cancel-in-progress: true

				jobs:

				  get-label-type:

				    name: get-label-type

				    uses: ./.github/workflows/_runner-determinator.yml

				    with:

				      triggering_actor: ${{ github.triggering_actor }}

				      issue_owner: ${{ github.event.pull_request.user.login || github.event.issue.user.login }}

				      curr_branch: ${{ github.head_ref || github.ref_name }}

				      curr_ref_type: ${{ github.ref_type }}

				  build-docker-cuda:

				    runs-on: linux.9xlarge.ephemeral

				    environment: ${{ (github.ref == 'refs/heads/main' || startsWith(github.event.ref, 'refs/tags/v')) && 'docker-build' || '' }}

				    needs: get-label-type

				    runs-on: "${{ needs.get-label-type.outputs.label-type }}linux.9xlarge.ephemeral"

				    strategy:

				      matrix:

				        cuda_version: ["12.4", "12.1", "11.8"]

				@ -54,6 +63,9 @@ jobs:

				            push: true

				      - name: Authenticate if WITH_PUSH

				        if: env.WITH_PUSH == 'true'

				        env:

				          DOCKER_TOKEN: ${{ secrets.DOCKER_TOKEN }}

				          DOCKER_ID: ${{ secrets.DOCKER_ID }}

				        run: |

				          if [[ "${WITH_PUSH}" == true ]]; then

				            echo "${DOCKER_TOKEN}" | docker login -u "${DOCKER_ID}" --password-stdin

				@ -63,7 +75,9 @@ jobs:

				        run: |

				          .ci/docker/libtorch/build.sh libtorch-cxx11-builder:cuda${{matrix.cuda_version}}

				  build-docker-rocm:

				    runs-on: linux.9xlarge.ephemeral

				    environment: ${{ (github.ref == 'refs/heads/main' || startsWith(github.event.ref, 'refs/tags/v')) && 'docker-build' || '' }}

				    needs: get-label-type

				    runs-on: "${{ needs.get-label-type.outputs.label-type }}linux.9xlarge.ephemeral"

				    strategy:

				      matrix:

				        rocm_version: ["6.1", "6.2"]

				@ -85,6 +99,9 @@ jobs:

				            push: true

				      - name: Authenticate if WITH_PUSH

				        if: env.WITH_PUSH == 'true'

				        env:

				          DOCKER_TOKEN: ${{ secrets.DOCKER_TOKEN }}

				          DOCKER_ID: ${{ secrets.DOCKER_ID }}

				        run: |

				          if [[ "${WITH_PUSH}" == true ]]; then

				            echo "${DOCKER_TOKEN}" | docker login -u "${DOCKER_ID}" --password-stdin

				@ -94,7 +111,9 @@ jobs:

				        run: |

				          .ci/docker/libtorch/build.sh libtorch-cxx11-builder:rocm${{matrix.rocm_version}}

				  build-docker-cpu:

				    runs-on: linux.9xlarge.ephemeral

				    environment: ${{ (github.ref == 'refs/heads/main' || startsWith(github.event.ref, 'refs/tags/v')) && 'docker-build' || '' }}

				    needs: get-label-type

				    runs-on: "${{ needs.get-label-type.outputs.label-type }}linux.9xlarge.ephemeral"

				    steps:

				      - name: Checkout PyTorch

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@main

				@ -110,6 +129,9 @@ jobs:

				            push: true

				      - name: Authenticate if WITH_PUSH

				        if: env.WITH_PUSH == 'true'

				        env:

				          DOCKER_TOKEN: ${{ secrets.DOCKER_TOKEN }}

				          DOCKER_ID: ${{ secrets.DOCKER_ID }}

				        run: |

				          if [[ "${WITH_PUSH}" == true ]]; then

				            echo "${DOCKER_TOKEN}" | docker login -u "${DOCKER_ID}" --password-stdin

									
										86

.github/workflows/build-manywheel-images.yml
									
										vendored
									
												View File
												
				@ -12,11 +12,13 @@ on:

				      - v[0-9]+.[0-9]+.[0-9]+-rc[0-9]+

				    paths:

				      - '.ci/docker/manywheel/*'

				      - '.ci/docker/manywheel/build_scripts/*'

				      - '.ci/docker/common/*'

				      - .github/workflows/build-manywheel-images.yml

				  pull_request:

				    paths:

				      - '.ci/docker/manywheel/*'

				      - '.ci/docker/manywheel/build_scripts/*'

				      - '.ci/docker/common/*'

				      - .github/workflows/build-manywheel-images.yml

				@ -24,8 +26,6 @@ on:

				env:

				  DOCKER_REGISTRY: "docker.io"

				  DOCKER_BUILDKIT: 1

				  DOCKER_ID: ${{ secrets.DOCKER_ID }}

				  DOCKER_TOKEN: ${{ secrets.DOCKER_TOKEN }}

				  WITH_PUSH: ${{ github.event_name == 'push' && (github.ref == 'refs/heads/main' || startsWith(github.ref, 'refs/heads/release')) }}

				concurrency:

				@ -33,8 +33,19 @@ concurrency:

				  cancel-in-progress: true

				jobs:

				  get-label-type:

				    name: get-label-type

				    uses: ./.github/workflows/_runner-determinator.yml

				    with:

				      triggering_actor: ${{ github.triggering_actor }}

				      issue_owner: ${{ github.event.pull_request.user.login || github.event.issue.user.login }}

				      curr_branch: ${{ github.head_ref || github.ref_name }}

				      curr_ref_type: ${{ github.ref_type }}

				  build-docker-cuda:

				    runs-on: linux.9xlarge.ephemeral

				    environment: ${{ (github.ref == 'refs/heads/main' || startsWith(github.event.ref, 'refs/tags/v')) && 'docker-build' || '' }}

				    needs: get-label-type

				    runs-on: "${{ needs.get-label-type.outputs.label-type }}linux.9xlarge.ephemeral"

				    strategy:

				      matrix:

				        cuda_version: ["12.4", "12.1", "11.8"]

				@ -58,6 +69,9 @@ jobs:

				            push: true

				      - name: Authenticate if WITH_PUSH

				        if: env.WITH_PUSH == 'true'

				        env:

				          DOCKER_TOKEN: ${{ secrets.DOCKER_TOKEN }}

				          DOCKER_ID: ${{ secrets.DOCKER_ID }}

				        run: |

				          if [[ "${WITH_PUSH}" == true ]]; then

				            echo "${DOCKER_TOKEN}" | docker login -u "${DOCKER_ID}" --password-stdin

				@ -68,7 +82,9 @@ jobs:

				          .ci/docker/manywheel/build.sh manylinux-builder:cuda${{matrix.cuda_version}}

				  # NOTE: manylinux_2_28 are still experimental, see https://github.com/pytorch/pytorch/issues/123649

				  build-docker-cuda-manylinux_2_28:

				    runs-on: linux.9xlarge.ephemeral

				    environment: ${{ (github.ref == 'refs/heads/main' || startsWith(github.event.ref, 'refs/tags/v')) && 'docker-build' || '' }}

				    needs: get-label-type

				    runs-on: "${{ needs.get-label-type.outputs.label-type }}linux.9xlarge.ephemeral"

				    strategy:

				      matrix:

				        cuda_version: ["12.4", "12.1", "11.8"]

				@ -92,6 +108,9 @@ jobs:

				            push: true

				      - name: Authenticate if WITH_PUSH

				        if: env.WITH_PUSH == 'true'

				        env:

				          DOCKER_TOKEN: ${{ secrets.DOCKER_TOKEN }}

				          DOCKER_ID: ${{ secrets.DOCKER_ID }}

				        run: |

				          if [[ "${WITH_PUSH}" == true ]]; then

				            echo "${DOCKER_TOKEN}" | docker login -u "${DOCKER_ID}" --password-stdin

				@ -101,7 +120,9 @@ jobs:

				        run: |

				          .ci/docker/manywheel/build.sh manylinux2_28-builder:cuda${{matrix.cuda_version}}

				  build-docker-cuda-aarch64:

				    runs-on: linux.arm64.2xlarge

				    environment: ${{ (github.ref == 'refs/heads/main' || startsWith(github.event.ref, 'refs/tags/v')) && 'docker-build' || '' }}

				    needs: get-label-type

				    runs-on: "${{ needs.get-label-type.outputs.label-type }}linux.arm64.2xlarge.ephemeral"

				    strategy:

				      matrix:

				        cuda_version: ["12.4"]

				@ -121,6 +142,9 @@ jobs:

				            push: true

				      - name: Authenticate if WITH_PUSH

				        if: env.WITH_PUSH == 'true'

				        env:

				          DOCKER_TOKEN: ${{ secrets.DOCKER_TOKEN }}

				          DOCKER_ID: ${{ secrets.DOCKER_ID }}

				        run: |

				          if [[ "${WITH_PUSH}" == true ]]; then

				            echo "${DOCKER_TOKEN}" | docker login -u "${DOCKER_ID}" --password-stdin

				@ -130,7 +154,9 @@ jobs:

				        run: |

				          .ci/docker/manywheel/build.sh manylinuxaarch64-builder:cuda${{matrix.cuda_version}}

				  build-docker-rocm:

				    runs-on: linux.9xlarge.ephemeral

				    environment: ${{ (github.ref == 'refs/heads/main' || startsWith(github.event.ref, 'refs/tags/v')) && 'docker-build' || '' }}

				    needs: get-label-type

				    runs-on: "${{ needs.get-label-type.outputs.label-type }}linux.9xlarge.ephemeral"

				    strategy:

				      matrix:

				        rocm_version: ["6.1", "6.2"]

				@ -152,6 +178,9 @@ jobs:

				            push: true

				      - name: Authenticate if WITH_PUSH

				        if: env.WITH_PUSH == 'true'

				        env:

				          DOCKER_TOKEN: ${{ secrets.DOCKER_TOKEN }}

				          DOCKER_ID: ${{ secrets.DOCKER_ID }}

				        run: |

				          if [[ "${WITH_PUSH}" == true ]]; then

				            echo "${DOCKER_TOKEN}" | docker login -u "${DOCKER_ID}" --password-stdin

				@ -161,7 +190,9 @@ jobs:

				        run: |

				          .ci/docker/manywheel/build.sh manylinux-builder:rocm${{matrix.rocm_version}}

				  build-docker-cpu:

				    runs-on: linux.9xlarge.ephemeral

				    environment: ${{ (github.ref == 'refs/heads/main' || startsWith(github.event.ref, 'refs/tags/v')) && 'docker-build' || '' }}

				    needs: get-label-type

				    runs-on: "${{ needs.get-label-type.outputs.label-type }}linux.9xlarge.ephemeral"

				    steps:

				      - name: Checkout PyTorch

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@main

				@ -177,6 +208,9 @@ jobs:

				            push: true

				      - name: Authenticate if WITH_PUSH

				        if: env.WITH_PUSH == 'true'

				        env:

				          DOCKER_TOKEN: ${{ secrets.DOCKER_TOKEN }}

				          DOCKER_ID: ${{ secrets.DOCKER_ID }}

				        run: |

				          if [[ "${WITH_PUSH}" == true ]]; then

				            echo "${DOCKER_TOKEN}" | docker login -u "${DOCKER_ID}" --password-stdin

				@ -186,7 +220,9 @@ jobs:

				        run: |

				          .ci/docker/manywheel/build.sh manylinux-builder:cpu

				  build-docker-cpu-manylinux_2_28:

				    runs-on: linux.9xlarge.ephemeral

				    environment: ${{ (github.ref == 'refs/heads/main' || startsWith(github.event.ref, 'refs/tags/v')) && 'docker-build' || '' }}

				    needs: get-label-type

				    runs-on: "${{ needs.get-label-type.outputs.label-type }}linux.9xlarge.ephemeral"

				    env:

				      GPU_ARCH_TYPE: cpu-manylinux_2_28

				    steps:

				@ -204,6 +240,9 @@ jobs:

				            push: true

				      - name: Authenticate if WITH_PUSH

				        if: env.WITH_PUSH == 'true'

				        env:

				          DOCKER_TOKEN: ${{ secrets.DOCKER_TOKEN }}

				          DOCKER_ID: ${{ secrets.DOCKER_ID }}

				        run: |

				          if [[ "${WITH_PUSH}" == true ]]; then

				            echo "${DOCKER_TOKEN}" | docker login -u "${DOCKER_ID}" --password-stdin

				@ -213,7 +252,9 @@ jobs:

				        run: |

				          .ci/docker/manywheel/build.sh manylinux2_28-builder:cpu

				  build-docker-cpu-aarch64:

				    runs-on: linux.arm64.2xlarge

				    environment: ${{ (github.ref == 'refs/heads/main' || startsWith(github.event.ref, 'refs/tags/v')) && 'docker-build' || '' }}

				    needs: get-label-type

				    runs-on: "${{ needs.get-label-type.outputs.label-type }}linux.arm64.2xlarge.ephemeral"

				    env:

				      GPU_ARCH_TYPE: cpu-aarch64

				    steps:

				@ -231,6 +272,9 @@ jobs:

				            push: true

				      - name: Authenticate if WITH_PUSH

				        if: env.WITH_PUSH == 'true'

				        env:

				          DOCKER_TOKEN: ${{ secrets.DOCKER_TOKEN }}

				          DOCKER_ID: ${{ secrets.DOCKER_ID }}

				        run: |

				          if [[ "${WITH_PUSH}" == true ]]; then

				            echo "${DOCKER_TOKEN}" | docker login -u "${DOCKER_ID}" --password-stdin

				@ -240,7 +284,9 @@ jobs:

				        run: |

				          .ci/docker/manywheel/build.sh manylinuxaarch64-builder:cpu-aarch64

				  build-docker-cpu-aarch64-2_28:

				    runs-on: linux.arm64.2xlarge

				    environment: ${{ (github.ref == 'refs/heads/main' || startsWith(github.event.ref, 'refs/tags/v')) && 'docker-build' || '' }}

				    needs: get-label-type

				    runs-on: "${{ needs.get-label-type.outputs.label-type }}linux.arm64.2xlarge.ephemeral"

				    env:

				      GPU_ARCH_TYPE: cpu-aarch64-2_28

				    steps:

				@ -258,16 +304,24 @@ jobs:

				            push: true

				      - name: Authenticate if WITH_PUSH

				        if: env.WITH_PUSH == 'true'

				        env:

				          DOCKER_TOKEN: ${{ secrets.DOCKER_TOKEN }}

				          DOCKER_ID: ${{ secrets.DOCKER_ID }}

				        run: |

				          if [[ "${WITH_PUSH}" == true ]]; then

				            echo "${DOCKER_TOKEN}" | docker login -u "${DOCKER_ID}" --password-stdin

				          fi

				      - name: Build Docker Image

				        if: env.WITH_PUSH == 'true'

				        env:

				          DOCKER_TOKEN: ${{ secrets.DOCKER_TOKEN }}

				          DOCKER_ID: ${{ secrets.DOCKER_ID }}

				        run: |

				          .ci/docker/manywheel/build.sh manylinux2_28_aarch64-builder:cpu-aarch64

				  build-docker-cpu-cxx11-abi:

				    runs-on: linux.9xlarge.ephemeral

				    environment: ${{ (github.ref == 'refs/heads/main' || startsWith(github.event.ref, 'refs/tags/v')) && 'docker-build' || '' }}

				    needs: get-label-type

				    runs-on: "${{ needs.get-label-type.outputs.label-type }}linux.9xlarge.ephemeral"

				    env:

				      GPU_ARCH_TYPE: cpu-cxx11-abi

				    steps:

				@ -285,6 +339,9 @@ jobs:

				            push: true

				      - name: Authenticate if WITH_PUSH

				        if: env.WITH_PUSH == 'true'

				        env:

				          DOCKER_TOKEN: ${{ secrets.DOCKER_TOKEN }}

				          DOCKER_ID: ${{ secrets.DOCKER_ID }}

				        run: |

				          if [[ "${WITH_PUSH}" == true ]]; then

				            echo "${DOCKER_TOKEN}" | docker login -u "${DOCKER_ID}" --password-stdin

				@ -294,7 +351,9 @@ jobs:

				        run: |

				          .ci/docker/manywheel/build.sh manylinuxcxx11-abi-builder:cpu-cxx11-abi

				  build-docker-xpu:

				    runs-on: linux.9xlarge.ephemeral

				    environment: ${{ (github.ref == 'refs/heads/main' || startsWith(github.event.ref, 'refs/tags/v')) && 'docker-build' || '' }}

				    needs: get-label-type

				    runs-on: "${{ needs.get-label-type.outputs.label-type }}linux.9xlarge.ephemeral"

				    env:

				      GPU_ARCH_TYPE: xpu

				    steps:

				@ -312,6 +371,9 @@ jobs:

				            push: true

				      - name: Authenticate if WITH_PUSH

				        if: env.WITH_PUSH == 'true'

				        env:

				          DOCKER_TOKEN: ${{ secrets.DOCKER_TOKEN }}

				          DOCKER_ID: ${{ secrets.DOCKER_ID }}

				        run: |

				          if [[ "${WITH_PUSH}" == true ]]; then

				            echo "${DOCKER_TOKEN}" | docker login -u "${DOCKER_ID}" --password-stdin

Compare commits

1691 Commits fastmath_b ... cslpull92

6 .ci/docker/aotriton_version.txt Unescape Escape View File

61 .ci/docker/build.sh Unescape Escape View File

4 .ci/docker/centos-rocm/Dockerfile Unescape Escape View File

2 .ci/docker/ci_commit_pins/executorch.txt Unescape Escape View File

2 .ci/docker/ci_commit_pins/halide.txt Unescape Escape View File

2 .ci/docker/ci_commit_pins/timm.txt Unescape Escape View File

1 .ci/docker/ci_commit_pins/triton-rocm.txt Unescape Escape View File

2 .ci/docker/ci_commit_pins/triton-xpu.txt Unescape Escape View File

2 .ci/docker/ci_commit_pins/triton.txt Unescape Escape View File

4 .ci/docker/common/install_aotriton.sh Unescape Escape View File

33 .ci/docker/common/install_conda.sh Unescape Escape View File

25 .ci/docker/common/install_cpython.sh Unescape Escape View File

25 .ci/docker/common/install_cuda.sh Unescape Escape View File

12 .ci/docker/common/install_cuda_aarch64.sh Unescape Escape View File

25 .ci/docker/common/install_cudss.sh Normal file Unescape Escape View File

10 .ci/docker/common/install_cusparselt.sh Unescape Escape View File

51 .ci/docker/common/install_miopen.sh Unescape Escape View File

9 .ci/docker/common/install_onnx.sh Unescape Escape View File

25 .ci/docker/common/install_triton.sh Unescape Escape View File

20 .ci/docker/common/install_xpu.sh Unescape Escape View File

6 .ci/docker/conda/build.sh Unescape Escape View File

1 .ci/docker/manywheel/Dockerfile Unescape Escape View File

4 .ci/docker/manywheel/Dockerfile_2_28 Unescape Escape View File

9 .ci/docker/manywheel/build.sh Unescape Escape View File

32 .ci/docker/requirements-ci.txt Unescape Escape View File

2 .ci/docker/triton_version.txt Unescape Escape View File

6 .ci/docker/ubuntu-cuda/Dockerfile Unescape Escape View File

9 .ci/docker/ubuntu-rocm/Dockerfile Unescape Escape View File

1 .ci/docker/ubuntu-xpu/Dockerfile Unescape Escape View File

29 .ci/pytorch/build.sh Unescape Escape View File

10 .ci/pytorch/create_test_cert.py Unescape Escape View File

19 .ci/pytorch/macos-test.sh Unescape Escape View File

55 .ci/pytorch/test.sh Unescape Escape View File

23 .ci/pytorch/win-test-helpers/build_pytorch.bat Unescape Escape View File

91 .ci/pytorch/win-test-helpers/installation-helpers/install_xpu.bat Normal file Unescape Escape View File

1 .ci/pytorch/win-test-helpers/setup_pytorch_env.bat Unescape Escape View File

2 .ci/pytorch/win-test-helpers/test_custom_backend.bat Unescape Escape View File

2 .ci/pytorch/win-test-helpers/test_custom_script_ops.bat Unescape Escape View File

2 .ci/pytorch/win-test-helpers/test_libtorch.bat Unescape Escape View File

6 .ci/pytorch/win-test.sh Unescape Escape View File

11 .circleci/scripts/binary_linux_test.sh Unescape Escape View File

6 .circleci/scripts/binary_populate_env.sh Unescape Escape View File

5 .circleci/scripts/binary_windows_build.sh Unescape Escape View File

4 .circleci/scripts/binary_windows_test.sh Unescape Escape View File

2 .flake8 Unescape Escape View File

30 .github/actionlint.yaml vendored Unescape Escape View File

2 .github/actions/filter-test-configs/action.yml vendored Unescape Escape View File

2 .github/actions/pytest-cache-download/action.yml vendored Unescape Escape View File

2 .github/actions/pytest-cache-upload/action.yml vendored Unescape Escape View File

2 .github/actions/setup-linux/action.yml vendored Unescape Escape View File

2 .github/actions/teardown-win/action.yml vendored Unescape Escape View File

2 .github/ci_commit_pins/audio.txt vendored Unescape Escape View File

39 .github/label_to_label.yml vendored Unescape Escape View File

160 .github/lf-canary-scale-config.yml vendored Unescape Escape View File

160 .github/lf-scale-config.yml vendored Unescape Escape View File

19 .github/merge_rules.yaml vendored Unescape Escape View File

5 .github/nitpicks.yml vendored Normal file Unescape Escape View File

1 .github/pytorch-probot.yml vendored Unescape Escape View File

2 .github/requirements/conda-env-iOS.txt vendored Unescape Escape View File

4 .github/requirements/pip-requirements-macOS.txt vendored Unescape Escape View File

26 .github/scripts/build_triton_wheel.py vendored Unescape Escape View File

11 .github/scripts/check_labels.py vendored Unescape Escape View File

3 .github/scripts/cherry_pick.py vendored Unescape Escape View File

79 .github/scripts/generate_binary_build_matrix.py vendored Unescape Escape View File

35 .github/scripts/generate_ci_workflows.py vendored Unescape Escape View File

22 .github/scripts/github_utils.py vendored Unescape Escape View File

7 .github/scripts/lintrunner.sh vendored Unescape Escape View File

373 .github/scripts/runner_determinator.py vendored Unescape Escape View File

39 .github/scripts/s390x-ci/README.md vendored Unescape Escape View File

33 .github/scripts/s390x-ci/self-hosted-builder/actions-runner.Dockerfile vendored Unescape Escape View File

6 .github/scripts/s390x-ci/self-hosted-builder/actions-runner@.service vendored Unescape Escape View File

42 .github/scripts/s390x-ci/self-hosted-builder/fs/usr/bin/actions-runner vendored Unescape Escape View File

84 .github/scripts/s390x-ci/self-hosted-builder/helpers/app_token.sh vendored Executable file Unescape Escape View File

10 .github/scripts/s390x-ci/self-hosted-builder/helpers/gh_token_generator.sh vendored Executable file Unescape Escape View File

35 .github/scripts/sync_distributed_folder_prototype.sh vendored Unescape Escape View File

2 .github/scripts/tag_docker_images_for_release.py vendored Unescape Escape View File

1 .github/scripts/test_check_labels.py vendored Unescape Escape View File

237 .github/scripts/test_runner_determinator.py vendored Normal file Unescape Escape View File

1691 Commits

fastmath_b ... cslpull92

6

.ci/docker/aotriton_version.txt

View File

61

.ci/docker/build.sh

View File

4

.ci/docker/centos-rocm/Dockerfile

View File

2

.ci/docker/ci_commit_pins/executorch.txt

View File

2

.ci/docker/ci_commit_pins/halide.txt

View File

2

.ci/docker/ci_commit_pins/timm.txt

View File

1

.ci/docker/ci_commit_pins/triton-rocm.txt

View File

2

.ci/docker/ci_commit_pins/triton-xpu.txt

View File

2

.ci/docker/ci_commit_pins/triton.txt

View File

4

.ci/docker/common/install_aotriton.sh

View File

33

.ci/docker/common/install_conda.sh

View File

25

.ci/docker/common/install_cpython.sh

View File

25

.ci/docker/common/install_cuda.sh

View File

12

.ci/docker/common/install_cuda_aarch64.sh

View File

25

.ci/docker/common/install_cudss.sh Normal file

View File

10

.ci/docker/common/install_cusparselt.sh

View File

51

.ci/docker/common/install_miopen.sh

View File

9

.ci/docker/common/install_onnx.sh

View File

25

.ci/docker/common/install_triton.sh

View File

20

.ci/docker/common/install_xpu.sh

View File

6

.ci/docker/conda/build.sh

View File

1

.ci/docker/manywheel/Dockerfile

View File

4

.ci/docker/manywheel/Dockerfile_2_28

View File

9

.ci/docker/manywheel/build.sh

View File

32

.ci/docker/requirements-ci.txt

View File

2

.ci/docker/triton_version.txt

View File

6

.ci/docker/ubuntu-cuda/Dockerfile

View File

9

.ci/docker/ubuntu-rocm/Dockerfile

View File

1

.ci/docker/ubuntu-xpu/Dockerfile

View File

29

.ci/pytorch/build.sh

View File

10

.ci/pytorch/create_test_cert.py

View File

19

.ci/pytorch/macos-test.sh

View File

55

.ci/pytorch/test.sh

View File

23

.ci/pytorch/win-test-helpers/build_pytorch.bat

View File

91

.ci/pytorch/win-test-helpers/installation-helpers/install_xpu.bat Normal file

View File

1

.ci/pytorch/win-test-helpers/setup_pytorch_env.bat

View File

2

.ci/pytorch/win-test-helpers/test_custom_backend.bat

View File

2

.ci/pytorch/win-test-helpers/test_custom_script_ops.bat

View File

2

.ci/pytorch/win-test-helpers/test_libtorch.bat

View File

6

.ci/pytorch/win-test.sh

View File

11

.circleci/scripts/binary_linux_test.sh

View File

6

.circleci/scripts/binary_populate_env.sh

View File

5

.circleci/scripts/binary_windows_build.sh

View File

4

.circleci/scripts/binary_windows_test.sh

View File

2

.flake8

View File

30

.github/actionlint.yaml vendored

View File

2

.github/actions/filter-test-configs/action.yml vendored

View File

2

.github/actions/pytest-cache-download/action.yml vendored

View File

2

.github/actions/pytest-cache-upload/action.yml vendored

View File

2

.github/actions/setup-linux/action.yml vendored

View File

2

.github/actions/teardown-win/action.yml vendored

View File

2

.github/ci_commit_pins/audio.txt vendored

View File

39

.github/label_to_label.yml vendored

View File

160

.github/lf-canary-scale-config.yml vendored

View File

160

.github/lf-scale-config.yml vendored

View File

19

.github/merge_rules.yaml vendored

View File

5

.github/nitpicks.yml vendored Normal file

View File

1

.github/pytorch-probot.yml vendored

View File

2

.github/requirements/conda-env-iOS.txt vendored

View File

4

.github/requirements/pip-requirements-macOS.txt vendored

View File

26

.github/scripts/build_triton_wheel.py vendored

View File

11

.github/scripts/check_labels.py vendored

View File

3

.github/scripts/cherry_pick.py vendored

View File

79

.github/scripts/generate_binary_build_matrix.py vendored

View File

35

.github/scripts/generate_ci_workflows.py vendored

View File

22

.github/scripts/github_utils.py vendored

View File

7

.github/scripts/lintrunner.sh vendored

View File

373

.github/scripts/runner_determinator.py vendored

View File

39

.github/scripts/s390x-ci/README.md vendored

View File

33

.github/scripts/s390x-ci/self-hosted-builder/actions-runner.Dockerfile vendored

View File

6

.github/scripts/s390x-ci/self-hosted-builder/actions-runner@.service vendored

View File

42

.github/scripts/s390x-ci/self-hosted-builder/fs/usr/bin/actions-runner vendored

View File

84

.github/scripts/s390x-ci/self-hosted-builder/helpers/app_token.sh vendored Executable file

View File

10

.github/scripts/s390x-ci/self-hosted-builder/helpers/gh_token_generator.sh vendored Executable file

View File

35

.github/scripts/sync_distributed_folder_prototype.sh vendored

View File

2

.github/scripts/tag_docker_images_for_release.py vendored

View File

1

.github/scripts/test_check_labels.py vendored

View File

237

.github/scripts/test_runner_determinator.py vendored Normal file

View File

48

.github/scripts/trymerge.py vendored

View File