pytorch

mirror of https://github.com/pytorch/pytorch.git synced 2025-10-21 21:49:24 +08:00

Author	SHA1	Message	Date
cyy	9538bf4e7c	[2/N] Remove inclusion of c10/util/string_utils.h (#128372 ) Follows #128300. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128372 Approved by: https://github.com/aaronenyeshi	2024-06-12 01:18:20 +00:00
cyy	219da29dfd	[7/N] Remove unused functions (#128407 ) Follows #128309 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128407 Approved by: https://github.com/ezyang	2024-06-12 01:10:33 +00:00
cyy	fb013ecb24	Remove unused private List::ptr_to_first_element (#128405 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128405 Approved by: https://github.com/ezyang	2024-06-12 01:07:14 +00:00
Kurman Karabukaev	6af4c6acad	Migrate test to internal base class, fixes (#128367 ) Summary: ## Remove etc deps converted tests to non-etcd based rdzv handler so that tests don't have dependency on etcd server ## Adopt pytorch test convetions - test starts with `test_TESTS.py` - Test base class is torch.testing._internal.common_utils.TestCase - include __main__ handler ## reduce test timing (used to take > 300 seconds): 3.05s call test/distributed/launcher/run_test.py::ElasticLaunchTest::test_init_method_env_with_torchelastic 2.59s call test/distributed/launcher/run_test.py::ElasticLaunchTest::test_init_method_tcp_with_torchelastic 2.33s call test/distributed/launcher/run_test.py::ElasticLaunchTest::test_launch_elastic_worker_raise_exception 2.33s call test/distributed/launcher/run_test.py::ElasticLaunchTest::test_launch_run_path 2.30s call test/distributed/launcher/run_test.py::ElasticLaunchTest::test_nproc_launch_auto_configurations 2.24s call test/distributed/launcher/run_test.py::ElasticLaunchTest::test_is_torchelastic_launched_with_logs_spec_defined 2.24s call test/distributed/launcher/run_test.py::ElasticLaunchTest::test_is_torchelastic_launched 2.17s call test/distributed/launcher/run_test.py::ElasticLaunchTest::test_launch_elastic_multiple_agents 2.12s call test/distributed/launcher/run_test.py::ElasticLaunchTest::test_launch_elastic 2.08s call test/distributed/launcher/run_test.py::ElasticLaunchTest::test_nproc_gpu_launch_configurations 1.32s call test/distributed/launcher/run_test.py::ElasticLaunchTest::test_launch_standalone 1.05s call test/distributed/launcher/run_test.py::ElasticLaunchTest::test_nproc_launch_number_configurations 1.05s call test/distributed/launcher/run_test.py::ElasticLaunchTest::test_launch_with_env_vars 1.05s call test/distributed/launcher/run_test.py::ElasticLaunchTest::test_launch_user_script_python 1.05s call test/distributed/launcher/run_test.py::ElasticLaunchTest::test_launch_user_script_python_caffe2_bc 1.04s call test/distributed/launcher/run_test.py::ElasticLaunchTest::test_launch_user_script_bash 1.03s call test/distributed/launcher/run_test.py::ElasticLaunchTest::test_launch_user_script_default_nproc 0.04s call test/distributed/launcher/run_test.py::ElasticLaunchTest::test_logs_logs_spec_entrypoint_must_be_defined 0.01s call test/distributed/launcher/run_test.py::ElasticLaunchTest::test_launch_elastic_agent_raise_exception 0.01s call test/distributed/launcher/run_test.py::ElasticLaunchTest::test_launch_shutdown Test Plan: pytest --durations=0 test/distributed/launcher/run_test.py Differential Revision: D58388182 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128367 Approved by: https://github.com/d4l3k	2024-06-12 01:03:40 +00:00
Bin Bao	786c24a4cd	[inductor] Always realize sigmoid for CPU (#128339 ) Summary: Currently the cpu backend prefers to always realize exp because it's a heavy op on CPU. For the same reason, we need to realize sigmoid as well. This solves a problem in llama2 inference where exp was repeated in an inner loop for many times. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128339 Approved by: https://github.com/eellison, https://github.com/helloguo, https://github.com/jansel, https://github.com/jgong5, https://github.com/peterbell10	2024-06-12 00:46:33 +00:00
PyTorch MergeBot	5d8c7f39d4	Revert "Introduce int_oo (#127693 )" This reverts commit 9cab5987bdeb66df8efbc581b3469bfe300e168c. Reverted https://github.com/pytorch/pytorch/pull/127693 on behalf of https://github.com/clee2000 due to sorry executorch CI is a bit weird regarding pins, I'll make a chat with mergen with the choices of what to do and how it'll affect executorch CI, reverting for now to prevent more divergences in the meantime ([comment](https://github.com/pytorch/pytorch/pull/127693#issuecomment-2161775400))	2024-06-11 23:36:08 +00:00
PyTorch MergeBot	c9c1fed065	Revert "Flip default value for mypy disallow_untyped_defs [10+2/11] (#128374 )" This reverts commit c13e03c87428b986972a48d8fc78dbffc2579f63. Reverted https://github.com/pytorch/pytorch/pull/128374 on behalf of https://github.com/clee2000 due to sorry I need to revert this in order to revert something else, to remerge, just rebase and fix the merge conflict ([comment](https://github.com/pytorch/pytorch/pull/128374#issuecomment-2161772864))	2024-06-11 23:34:03 +00:00
Andrew Hoblitzell	94fea82d66	init sub comment (#128082 ) Fixes #127905 ### Description Add docstring to torch/onnx/symbolic_opset9.py:sigmoid function ### Checklist - [x] The issue that is being fixed is referred in the description - [x] Only one issue is addressed in this pull request - [x] Labels from the issue that this PR is fixing are added to this pull request - [x] No unnecessary issues are included into this pull request Pull Request resolved: https://github.com/pytorch/pytorch/pull/128082 Approved by: https://github.com/titaiwangms	2024-06-11 22:42:35 +00:00
Andrea Frittoli	447173198b	Add docstring for the torch.fx.operator_schemas.create_type_hint func… (#128139 ) Fixes: #127916 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128139 Approved by: https://github.com/SherlockNoMad	2024-06-11 22:42:11 +00:00
angelayi	b79d056e76	[export] FIx unflattener for preserving modules containing unused inputs (#128260 ) Currently unflattener fails if the module its preserving the module signature for contains unused inputs/outputs. This also fixes unflattener issues in D57829276. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128260 Approved by: https://github.com/pianpwk	2024-06-11 22:32:08 +00:00
Chirag Pandya	eb567b1f40	Pass params to dump_nccl_trace_pickle (#128307 ) Summary: Pass parameters from request to dump_nccl_trace_pickle handler. The supported parameters + value are all lowercase. includecollectives={true, false} includestacktraces={true, false} onlyactive={true, false} Example post is: /handler/dump_nccl_trace_pickle?includecollectives=true&includestacktraces=false&onlyactive=true Test Plan: unit tests Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/128307 Approved by: https://github.com/d4l3k ghstack dependencies: #128191	2024-06-11 22:28:53 +00:00
Chirag Pandya	1dd2431f86	[Test] Add test for only_active flag (#128191 ) Summary: Add a unit test for the only_active flag to _dump_nccl_trace API call. With this flag, we only expect active records to be returned. Test Plan: Unit test. Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/128191 Approved by: https://github.com/d4l3k	2024-06-11 22:26:01 +00:00
Andrew Hoblitzell	5fcb5f0c8b	init reshape_from_tensor_shape comment (#128171 ) Fixes #127897 ### Description Add docstring to torch/onnx/symbolic_opset9.py:sigmoid function ### Checklist - [x] The issue that is being fixed is referred in the description - [x] Only one issue is addressed in this pull request - [x] Labels from the issue that this PR is fixing are added to this pull request - [x] No unnecessary issues are included into this pull request Pull Request resolved: https://github.com/pytorch/pytorch/pull/128171 Approved by: https://github.com/titaiwangms	2024-06-11 21:56:33 +00:00
rzou	a55d0d9718	Fix side effect pruning (#128028 ) Summary: The previous side effect pruning algorithm would keep many dead cell variables alive. For example, in https://github.com/pytorch/pytorch/issues/125078, the compiled function has one return but there were three in the Dynamo graph due to two dead cell variables not being pruned away. This PR adds a corrected algorithm. "new cell variables" are alive if they can be reached from one of the following: 1. any of the tx.symbolic_locals or tx.stack (that is, if they are involved in a return from the function or intermediate variable during a graph break). Example: an alive NestedUserFunctionVariable 2. "mutations to pre-existing objects". Example: appending a NestedUserFunctionVariable to a global list The new algorithm reflects this, but please let me know if there are more cases to handle. Test Plan: - existing tests (afaict, test/dynamo/test_python_autograd is the best SideEffects test case we have) - see in test/dynamo/test_higher_order_ops that the expecttests changed -- the functorch dynamo graphs no longer return dead cellvars. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128028 Approved by: https://github.com/jansel	2024-06-11 21:40:48 +00:00
Andrew Gu	8c1247cffb	[BE] Fixed CPU autocast warning (#127774 ) This PR fixes ``` /data/users/andgu/pytorch/torch/utils/checkpoint.py:1398: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/127774 Approved by: https://github.com/soulitzer, https://github.com/Skylion007, https://github.com/tianyu-l	2024-06-11 21:33:35 +00:00
Will Feng	70a1e85718	[Traceable FSDP2] Use custom ops for AllGather copy-in / copy-out and ReduceScatter copy-in (#127856 ) Making these operations into custom ops helps Inductor identify these ops and enforce the FSDP communication op ordering. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127856 Approved by: https://github.com/awgu	2024-06-11 20:15:03 +00:00
PyTorch MergeBot	adb699189b	Revert "[RELAND][dynamo][nn-modules] Trace through nn.Module dunder methods for UnspecializedNNModule (#126578 )" This reverts commit b2d602306a9eb19e30328cbaee941c874f8148a9. Reverted https://github.com/pytorch/pytorch/pull/126578 on behalf of https://github.com/clee2000 due to failed internal test D58394084. Author has forward fix but includes external changes so reverting is a bit easier to coordinate ([comment](https://github.com/pytorch/pytorch/pull/126578#issuecomment-2161481839))	2024-06-11 19:41:41 +00:00
eqy	45dccfddcd	[cuDNN][SDPA] Support different key, value dimension in cuDNN SDPA (#128350 ) CC @vedaanta-nvidia @drisspg Pull Request resolved: https://github.com/pytorch/pytorch/pull/128350 Approved by: https://github.com/Skylion007	2024-06-11 19:22:21 +00:00
yuqingj	3e09123797	Enable UFMT on test_nestedtensor.py (#128359 ) split it into two PRs since it is more than 2k lines of change Pull Request resolved: https://github.com/pytorch/pytorch/pull/128359 Approved by: https://github.com/davidberard98	2024-06-11 19:14:04 +00:00
BowenBao	61f922c2ca	Fix 'get_real_value' on placeholder nodes (#127698 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127698 Approved by: https://github.com/jansel ghstack dependencies: #127695, #127696	2024-06-11 18:57:25 +00:00
BowenBao	984b1a8c35	Fix 'get_attr' call in dynamo 'run_node' (#127696 ) Fixes #124858 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127696 Approved by: https://github.com/jansel ghstack dependencies: #127695	2024-06-11 18:57:25 +00:00
Jing Xu	205410cb44	add xpu to torch.tensors (#127280 ) As support for Intel GPU has been upstreamed, this PR is to add the XPU-related contents to torch.tensors doc. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127280 Approved by: https://github.com/svekars	2024-06-11 18:13:01 +00:00
Eddie Yan	cac7a22b92	[cuDNN][Quantization] Don't print when plan finalization fails in cuDNN quantization backend (#128177 ) Similar in spirit to #125790, hopefully addresses failures seen for cuDNN 9.1 upgrade: #https://github.com/pytorch/pytorch/pull/128166 CC @nWEIdia @atalman Pull Request resolved: https://github.com/pytorch/pytorch/pull/128177 Approved by: https://github.com/nWEIdia, https://github.com/Skylion007	2024-06-11 18:09:25 +00:00
Wanchao Liang	8a09940a54	[inductor] fix compile time regression by caching get_gpu_type (#128363 ) We observed signficant compile time regression in torchtitan when turning on 2D parallel + torch.compile recently. So I decided to get a deeper understanding why. It turns out this is affecting all the trainings that have functional collectives captured in the graph, not only 2D parallel (2D parallel was just the job that happen to have collectives captured in the TP region). The root cause is because when doing inductor lowering, we are calling the comm analysis pass to get a estimated collective time for each collective node in the graph, for each call to check the collective node, we are calling `get_gpu_type()`, which under the hood calls a `torch.utils.collect_env.run` to get the GPU info. However, this call is super expensive! The reason is that this call effectively spawns a new process and call `nvidia-smi` to get the GPU info, so the cost is linear to the number of collective nodes in the graph. see https://github.com/pytorch/pytorch/blob/main/torch/utils/collect_env.py#L75 The fix is to add a lru cache to the function, so that we only call this once and reuse the cached results afterwards torchtitan benchmark shows: * before this fix: 2D parallel + fp8 compile time: 6min + * after this fix: 2D parallel + fp8 compile time: 2min 48s (more than 100% improvement) There're more room to improve the compile time, but this PR is trying to fix the biggest regression I found so far. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128363 Approved by: https://github.com/yf225	2024-06-11 18:02:13 +00:00
PyTorch MergeBot	1d233b8f50	Revert "Make nn.Module state_dict load_state_dict pre-hook and state_dict post hook public (#126704 )" This reverts commit c38b3381a12a0ec033dd417827c530c4474b8165. Reverted https://github.com/pytorch/pytorch/pull/126704 on behalf of https://github.com/clee2000 due to broke internal typecheck D58394110 (which probably means the code wouldn't work either but I guess it didn't run on the diff). Probably an easy fix? ([comment](https://github.com/pytorch/pytorch/pull/126704#issuecomment-2161299193))	2024-06-11 17:45:20 +00:00
PyTorch MergeBot	491c4a5dcb	Revert "Make sure #126704 is BC for torch.save-ed `nn.Module` (#128344 )" This reverts commit 841d87177a900c2bbd59b6589165189141c4e8bb. Reverted https://github.com/pytorch/pytorch/pull/128344 on behalf of https://github.com/clee2000 due to broke internal typecheck D58394110 (which probably means the code wouldn't work either but I guess it didn't run on the diff). Probably an easy fix? ([comment](https://github.com/pytorch/pytorch/pull/126704#issuecomment-2161299193))	2024-06-11 17:45:20 +00:00
Angela Yi	4345d98663	[dynamo] Fix for #127696 (#128358 ) Test Plan: `buck2 test @//mode/dev-nosan //executorch/exir/backend/...` https://www.internalfb.com/intern/testinfra/testrun/12666373989243932 Differential Revision: D58384518 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128358 Approved by: https://github.com/ydwu4	2024-06-11 16:43:15 +00:00
ankurneog	a838e90964	Add Intel Gaudi device/HPU to auto load in instantiate_device_type_tests (#126970 ) ### Motivation Intel Gaudi accelerator (device name hpu) is seen to have good pass rate with the pytorch framework UTs , however being an out-of-tree device, we face challenges in adapting the device to natively run the existing pytorch UTs under pytorch/test. The UTs however is a good indicator of the device stack health and as such we run them regularly with adaptations. Although we can add Gaudi/HPU device to generate the device specific tests using the TORCH_TEST_DEVICES environment variable, we miss out on lot of features such as executing for specific dtypes, skipping and overriding opInfo. With significant changes introduced every Pytorch release maintaining these adaptations become difficult and time consuming. Hence with this PR we introduce Gaudi device in common_device_type framework, so that the tests are instantiated for Gaudi when the library is loaded. The eventual goal is to introduce Gaudi out-of-tree support as equivalent to in-tree devices ### Changes Add HPUTestBase of type DeviceTypeTestBase specifying appropriate attributes for Gaudi/HPU. Include code to check if intel Gaudi Software library is loaded and if so, add the device to the list of devices considered for instantiation of device type tests ### Additional Context please refer the following RFC : https://github.com/pytorch/rfcs/pull/63/ Pull Request resolved: https://github.com/pytorch/pytorch/pull/126970 Approved by: https://github.com/albanD	2024-06-11 16:35:17 +00:00
David Berard	29081059b6	[Static Runtime] Fix & run gen_static_runtime_ops (#128299 ) gen_static_runtime_ops hasn't been updated in a while. In preparation for https://github.com/pytorch/pytorch/pull/127675 in which I need to re-run the codegen step for cumprod, I want to land these changes beforehand in case there are any other issues that arise. I added a number of ops to the blocklist: ``` + "_nested_tensor_storage_offsets", + "_nested_get_values", # no CPU backend + "_nested_get_values_copy", # no CPU backend + "_nested_view_from_jagged", # testing needs to be patched + "_nested_view_from_jagged_copy", # testing needs to be patched + "_nested_view_from_buffer", # testing needs to be patched + "_nested_view_from_buffer_copy", # testing needs to be patched + "_int_mm", # testing needs to be patched + "_to_sparse_csc", # testing needs to be patched + "_to_sparse_csr", # testing needs to be patched + "segment_reduce", # testing needs to be patched ``` Most of these are added just because testing doesn't work right now. Additionally, a few `fft` ops seem to have been removed from native_functions.yaml; I'm guessing it's unlikely FFT would have been used in many real models though. Differential Revision: [D58329403](https://our.internmc.facebook.com/intern/diff/D58329403/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128299 Approved by: https://github.com/YuqingJ	2024-06-11 16:27:39 +00:00
Nikita Shulga	f8c45996d5	[MPS] Make erfinv compilable for bfloat16 (#128375 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128375 Approved by: https://github.com/Skylion007 ghstack dependencies: #128373	2024-06-11 16:04:11 +00:00
Aaron Orenstein	c13e03c874	Flip default value for mypy disallow_untyped_defs [10+2/11] (#128374 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128374 Approved by: https://github.com/Skylion007	2024-06-11 15:58:28 +00:00
Nikita Shulga	053930e194	[MPS][BE] Remove code duplication (#128373 ) Use `scalarToMetalTypeString` instead of `getMetalType` Pull Request resolved: https://github.com/pytorch/pytorch/pull/128373 Approved by: https://github.com/Skylion007	2024-06-11 15:58:04 +00:00
Huamin Li	9a38cae299	[AOTI] Switch to use shim v2 (#127674 ) Differential Revision: D56709309 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127674 Approved by: https://github.com/desertfire	2024-06-11 15:01:25 +00:00
kareem mohiddeen shaik	55901fb3da	[fx] Preserve Fx graph node order in partitioner across runs (#115621 ) Fixes #ISSUE_NUMBER partitioner generates different graph in recompilation on each run Co-authored-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/115621 Approved by: https://github.com/ezyang	2024-06-11 14:04:52 +00:00
IvanKobzarev	fc77fdca6f	[guard_size_oblivious] Add gso ExpandUtils:_sym_to (#128224 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128224 Approved by: https://github.com/ezyang	2024-06-11 14:01:34 +00:00
FFFrog	648625b230	Make TraceUtils.h to be device-agnostic (#126969 ) Some features of third-party devices depend on TraceUtils.h, so some of the CUDA code was removed and split into NCCLUtils files. In addition, some common functions still remain in TraceUtils.h since I'm not sure if other devices will use them later. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126969 Approved by: https://github.com/c-p-i-o	2024-06-11 08:38:07 +00:00
Peter Bell	207c2248a8	[inductor] Fix lowering full with SymBool value (#128213 ) Fixes #128161, fixes #128095 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128213 Approved by: https://github.com/lezcano	2024-06-11 08:33:35 +00:00
Colin L Reliability Rice	a206dcc79e	fb_memcache: Move to fbcode from thirdparty (#128174 ) Summary: The fb_memcache injections location and path is changing. Test Plan: Existing tests should pass. Reviewed By: bertmaher, oulgen Differential Revision: D57973772 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128174 Approved by: https://github.com/oulgen	2024-06-11 07:46:12 +00:00
Animesh Jain	f2d7f235a6	[dynamo][yolov3] Track UnspecializedNNModuleVariable for mutation (#128269 ) Fixes https://github.com/pytorch/pytorch/issues/101168 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128269 Approved by: https://github.com/jansel ghstack dependencies: #128295, #126578, #128268, #128254	2024-06-11 07:09:04 +00:00
Michael Lazos	402b289f3b	Properly register parameter for binary folding test (#128356 ) This PR properly registers the tensor used in the module compute as a parameter. This bug was hidden previously because all tensors on the nn modules would be considered constant by dynamo, with inlining NN modules, this is no longer the case. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128356 Approved by: https://github.com/anijain2305 ghstack dependencies: #128355	2024-06-11 06:48:26 +00:00
Michael Lazos	a32157c67c	Mark params static if inlining modules and freezing (#128355 ) Today inlining builtin nn modules is not compatible with parameter freezing. Freezing parameters and then constant folding them through the graph relies on the assumption that they will not be inputs and will be static across calls to the same graph. When inlining builtin nn modules this assumption is broken and we reuse the same graph for different instances of the same nn module. There are three options 1) abandon constant folding, 2) create a dispatcher layer (like cudagraphs) which will dispatch to the correct constant-folded graph for each distinct set of parameters or 3) recompile This PR implements 3 by introducing guards on the parameter pointers. This was due to freezing being relatively rare and performance sensistive. 2 Had many more unknowns and 1 is not a viable option due to the drop in performance. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128355 Approved by: https://github.com/anijain2305	2024-06-11 06:48:26 +00:00
Lourenco Matos	24e7f29099	Lowering for avg_pool_3d_backward (Fixes:#127101) (#127722 ) We implemented a lowering for the avg_pool3d_backward operation and created tests for it. We ran some benchmarks and achieved the following results: ``` [-------------- avgpool_3d_backwards --------------] \| Decomposed \| Eager 16 threads: ---------------------------------------- (3, 5, 400, 200, 200) \| 6061 \| 11160 (3, 5, 300, 200, 200) \| 4547 \| 8372 (3, 5, 200, 200, 200) \| 3032 \| 5585 (3, 5, 300, 300, 300) \| 10100 \| 18840 (3, 5, 100, 100, 100) \| 381 \| 703 (3, 5, 100, 300, 200) \| 2270 \| 4190 (8, 8, 128, 128, 128) \| 3397 \| 6253 (2, 3, 150, 150, 150) \| 520 \| 947 (1, 3, 128, 128, 128) \| 161 \| 299 (8, 16, 64, 64, 64) \| 851 \| 1569 (1, 1, 50, 50, 50) \| 17 \| 11 (3, 5, 20, 40, 40) \| 17 \| 30 (3, 5, 10, 20, 20) \| 17 \| 11 (1, 1, 10, 10, 10) \| 16 \| 11 (3, 5, 5, 10, 10) \| 17 \| 11 (3, 5, 2, 5, 5) \| 17 \| 11 ``` These were run on an RTX 3050, so we were not able to allocate larger tensors due to memory limitations. We believe it would be beneficial to benchmark this on more recent hardware, just to check if the performance holds up with larger sizes. Furthermore, we also refactored code from adaptive_avg_pool2d and adaptive_max_pool2d, to reduce code duplication. We diffed the kernels and they are identical. Fixes #127101 Co-authored-by: Martim Mendes <martimccmendes@tecnico.ulisboa.pt> Pull Request resolved: https://github.com/pytorch/pytorch/pull/127722 Approved by: https://github.com/jansel	2024-06-11 06:39:04 +00:00
Oguz Ulgen	5b5d269d34	Speed up fx graph iteration by implementing it in C++ (#128288 ) Before this change ``` python benchmarks/dynamo/microbenchmarks/fx_microbenchmarks.py iterating over 100000000 FX nodes took 19.5s (5132266 nodes/s) ``` After this change ``` python benchmarks/dynamo/microbenchmarks/fx_microbenchmarks.py iterating over 100000000 FX nodes took 3.4s (29114001 nodes/s) ``` 5.7x improvement Differential Revision: [D58343997](https://our.internmc.facebook.com/intern/diff/D58343997) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128288 Approved by: https://github.com/jansel, https://github.com/albanD	2024-06-11 05:48:31 +00:00
PyTorch MergeBot	fa88f390a0	Revert "[inductor] enable fx graph cache on torchbench (#128239 )" This reverts commit 734e8f6ad7e7f0fa0341fb658f1f986225173f5f. Reverted https://github.com/pytorch/pytorch/pull/128239 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it seems to surface a bunch of inductor failures in trunk `734e8f6ad7` ([comment](https://github.com/pytorch/pytorch/pull/128239#issuecomment-2159789242))	2024-06-11 04:53:38 +00:00
Ke Wen	fe39c07826	[pipelining][doc] Remove duplicated words (#128368 ) "for execution" is used in both step titles Pull Request resolved: https://github.com/pytorch/pytorch/pull/128368 Approved by: https://github.com/wconstab ghstack dependencies: #128361	2024-06-11 04:52:57 +00:00
Wang, Eikan	cba195c8ed	Support aten operations with out tensor (#124926 ) This PR intends to support the aten operations with the `out` tensor. Currently, the AOT compile always does NOT keep input tensor mutations. According to the comments, this is because it has not encountered such a use case. > For now there's no use case involving keeping input mutations in the graph (which we can only do in the inference case anyway). We can add this later if we need to. However, for aten operations, it is popular that the `out` tensor is an input parameter and needs to be mutated. This PR intends to support it by adding a `keep_inference_input_mutations` flag to `aot_inductor.keep_inference_input_mutations`. This flag can provide flexibility to the callee in deciding whether the AOT compile needs to keep input tensor mutations in the graph. Take `clamp` as an example as follows. ```python out_tensor = torch.randn(128, dtype=torch.float, device=device).fill_(-2.0) inp_tensor = torch.randn(128, dtype=torch.float, device=device).fill_(1.0) min_tensor = inp_tensor - 0.05 max_tensor = inp_tensor + 0.05 torch.clamp(input=inp_tensor, min=min_tensor, max=max_tensor, out=out_tensor) ``` W/O this PR ```python def forward(self): arg0_1: "f32[128]"; arg1_1: "f32[128]"; arg2_1: "f32[128]"; arg3_1: "f32[128]"; arg0_1, arg1_1, arg2_1, arg3_1, = fx_pytree.tree_flatten_spec([], self._in_spec) clamp_min: "f32[128]" = torch.ops.aten.clamp_min.Tensor(arg0_1, arg1_1); arg0_1 = arg1_1 = None clamp_max: "f32[128]" = torch.ops.aten.clamp_max.Tensor(clamp_min, arg2_1); clamp_min = arg2_1 = None return (clamp_max, clamp_max) ``` W/ this PR ```python def forward(self): arg0_1: "f32[128]"; arg1_1: "f32[128]"; arg2_1: "f32[128]"; arg3_1: "f32[128]"; arg0_1, arg1_1, arg2_1, arg3_1, = fx_pytree.tree_flatten_spec([], self._in_spec) clamp_min: "f32[128]" = torch.ops.aten.clamp_min.Tensor(arg0_1, arg1_1); arg0_1 = arg1_1 = None clamp_max: "f32[128]" = torch.ops.aten.clamp_max.Tensor(clamp_min, arg2_1); clamp_min = arg2_1 = None copy_: "f32[128]" = torch.ops.aten.copy_.default(arg3_1, clamp_max); arg3_1 = clamp_max = None return (copy_,) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/124926 Approved by: https://github.com/jgong5, https://github.com/jansel, https://github.com/angelayi	2024-06-11 04:35:27 +00:00
Edward Z. Yang	16e67be7f1	Also preserve unbacked SymInts when partitioning as backward inputs (#128338 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/128338 Approved by: https://github.com/IvanKobzarev	2024-06-11 04:27:09 +00:00
zengxian	7afffdf48b	[CI] Comment hf_T5_generate, hf_GPT2 and timm_efficientnet in inductor cpu smoketest for performance unstable issue (#127588 ) Fixes #126993 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127588 Approved by: https://github.com/chuanqi129, https://github.com/jgong5, https://github.com/desertfire	2024-06-11 03:12:11 +00:00
Animesh Jain	ca45649eb5	[easy][dynamo][inline work] Fix test with inlining inbuilt nn modules (#128254 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128254 Approved by: https://github.com/williamwen42 ghstack dependencies: #128295, #126578, #128268	2024-06-11 03:02:51 +00:00
Animesh Jain	665e568381	[inductor][inlining nn module] Skip batchnorm version check test for inlining (#128268 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128268 Approved by: https://github.com/zou3519 ghstack dependencies: #128295, #126578	2024-06-11 03:02:51 +00:00
Ke Wen	4077cdd589	[pipelining][doc] Update arg list of pipeline API (#128361 ) And document the use of `build_stage` API. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128361 Approved by: https://github.com/wconstab	2024-06-11 02:55:17 +00:00
cyy	e4bd0adca5	[6/N] Remove unused functions (#128309 ) Follows #127185 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128309 Approved by: https://github.com/ezyang	2024-06-11 02:46:33 +00:00
eellison	793df7b7cb	Prevent expansion of cat indexing to avoid int64 intermediate (#127815 ) Fix for https://github.com/pytorch/pytorch/issues/127652 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127815 Approved by: https://github.com/shunting314, https://github.com/peterbell10	2024-06-11 02:41:07 +00:00
Andrew Hoblitzell	d1d9bc7aa6	init add comment (#128083 ) Fixes #127898 ### Description Add docstring to torch/onnx/symbolic_opset9.py:sigmoid function ### Checklist - [x] The issue that is being fixed is referred in the description - [x] Only one issue is addressed in this pull request - [x] Labels from the issue that this PR is fixing are added to this pull request - [x] No unnecessary issues are included into this pull request Pull Request resolved: https://github.com/pytorch/pytorch/pull/128083 Approved by: https://github.com/titaiwangms	2024-06-11 02:37:04 +00:00
Mikayla Gawarecki	841d87177a	Make sure #126704 is BC for torch.save-ed `nn.Module` (#128344 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128344 Approved by: https://github.com/albanD ghstack dependencies: #126906, #126704	2024-06-11 02:26:06 +00:00
Arun Pa	3b555ba477	Add docstring for torch.utils.data.datapipes.decoder.basicandlers (#128018 ) Fixes #127912 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128018 Approved by: https://github.com/andrewkho	2024-06-11 01:32:45 +00:00
Sam Larsen	734e8f6ad7	[inductor] enable fx graph cache on torchbench (#128239 ) Summary: We've already enabled for timm and huggingface, but we had failures saving cache entries for moco. It looks like https://github.com/pytorch/pytorch/pull/128052 has fixed that issue, so we can enable for torchbench. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128239 Approved by: https://github.com/oulgen	2024-06-11 00:40:31 +00:00
cyy	99f5a85a09	[Clang Tidy] Fix misc-header-include-cycle errors in clang-tidy and ignore some files (#127233 ) Since there are such cycles in libfmt and PyTorch, which are detected by clang-tidy. ``` /home/cyy/pytorch/third_party/fmt/include/fmt/format-inl.h:25:10: error: circular header file dependency detected while including 'format.h', please check the include path [misc-header-include-cycle,-warnings-as-errors] 25 \| #include "format.h" \| ^ /home/cyy/pytorch/third_party/fmt/include/fmt/format.h:4530:12: note: 'format-inl.h' included from here 4530 \| # include "format-inl.h" ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/127233 Approved by: https://github.com/ezyang	2024-06-10 23:49:58 +00:00
Jun Luo	f843ccbb1a	[MTIA] Add set_device support (#128040 ) Summary: Support set_device API in MTIA backend. Reviewed By: gnahzg Differential Revision: D58089498 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128040 Approved by: https://github.com/gnahzg	2024-06-10 23:42:52 +00:00
cyy	30875953a4	[1/N] Remove inclusion of c10/util/string_utils.h (#128300 ) As a first step to remove it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128300 Approved by: https://github.com/ezyang, https://github.com/eqy	2024-06-10 23:40:47 +00:00
cyy	2126ae186e	Remove caffe2/perfkernels files (#128186 ) These files are not used. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128186 Approved by: https://github.com/ezyang, https://github.com/r-barnes	2024-06-10 23:40:18 +00:00
Jiashen Cao	739aa224ec	[Fix] Parameter un/lifting issues in the TorchScript to ExportedProgram converter (#127975 ) This PR fixes issues related to parameters and inputs lifting in the converter. #### Issue 1 ``` > Graph[linear.weights, bias.weights, x.1] %1 ... %2 ... %3 = CreateObject() > Block 0[] %linear.0 = GetAttr(linear)[%3] > Block 0.0[] %weight.0 = GetAttr(weights)[%linear.0] > Block 1[] ... ``` * Model parameters for the top level module should be unlifted, while parameters from sub-blocks should be lifted. #### Fixes * Bottom-up traversal (i.e., start from the inner most block) to figure out which parameters to be lifted for sub-blocks. #### Test Plan * Add test cases for nested block without control flow `pytest test/export/test_converter.py -s -k test_convert_nn_module_with_nested_param` * Add test cases for nested block with control flow `pytest test/export/test_converter.py -s -k test_convert_nn_module_with_nested_if_and_param` #### Outcome ##### TorchScript ``` graph(%x.1 : Float(3, strides=[1], requires_grad=0, device=cpu), %m1.m1.linear.weight : Float(3, 3, strides=[3, 1], requires_grad=0, device=cpu), %m1.m1.linear.bias : Float(3, strides=[1], requires_grad=0, device=cpu), %m1.linear.weight : Float(3, 3, strides=[3, 1], requires_grad=0, device=cpu), %m1.linear.bias : Float(3, strides=[1], requires_grad=0, device=cpu), %m1.m2.linear.weight : Float(3, 3, strides=[3, 1], requires_grad=0, device=cpu), %m1.m2.linear.bias : Float(3, strides=[1], requires_grad=0, device=cpu), %linear.weight : Float(3, 3, strides=[3, 1], requires_grad=0, device=cpu), %linear.bias : Float(3, strides=[1], requires_grad=0, device=cpu), %m2.m1.linear.weight : Float(3, 3, strides=[3, 1], requires_grad=0, device=cpu), %m2.m1.linear.bias : Float(3, strides=[1], requires_grad=0, device=cpu), %m2.linear.weight : Float(3, 3, strides=[3, 1], requires_grad=0, device=cpu), %m2.linear.bias : Float(3, strides=[1], requires_grad=0, device=cpu), %m2.m2.linear.weight : Float(3, 3, strides=[3, 1], requires_grad=0, device=cpu), %m2.m2.linear.bias : Float(3, strides=[1], requires_grad=0, device=cpu)): %15 : __torch__.export.test_converter.___torch_mangle_14.SuperNestedM1 = prim::CreateObject() %16 : NoneType = prim::Constant(), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m1 %17 : int = prim::Constant[value=1](), scope: export.test_converter.SuperNestedM1:: # /data/users/jiashenc/pytorch/test/export/test_converter.py:342:34 %18 : Tensor = aten::max(%x.1), scope: export.test_converter.SuperNestedM1:: # /data/users/jiashenc/pytorch/test/export/test_converter.py:342:19 %19 : Tensor = aten::gt(%18, %17), scope: export.test_converter.SuperNestedM1:: # /data/users/jiashenc/pytorch/test/export/test_converter.py:342:19 %20 : bool = aten::Bool(%19), scope: export.test_converter.SuperNestedM1:: # /data/users/jiashenc/pytorch/test/export/test_converter.py:342:19 %21 : Tensor = prim::If(%20), scope: export.test_converter.SuperNestedM1:: # /data/users/jiashenc/pytorch/test/export/test_converter.py:342:16 block0(): %linear.6 : __torch__.torch.nn.modules.linear.___torch_mangle_17.Linear = prim::GetAttr[name="linear"](%15), scope: export.test_converter.SuperNestedM1:: %m1.1 : __torch__.export.test_converter.___torch_mangle_15.NestedM = prim::GetAttr[name="m1"](%15), scope: export.test_converter.SuperNestedM1:: %24 : Tensor = aten::sum(%x.1, %16), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m1 # /data/users/jiashenc/pytorch/test/export/test_converter.py:327:19 %25 : Tensor = aten::gt(%24, %17), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m1 # /data/users/jiashenc/pytorch/test/export/test_converter.py:327:19 %26 : bool = aten::Bool(%25), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m1 # /data/users/jiashenc/pytorch/test/export/test_converter.py:327:19 %27 : Tensor = prim::If(%26), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m1 # /data/users/jiashenc/pytorch/test/export/test_converter.py:327:16 block0(): %linear.10 : __torch__.torch.nn.modules.linear.___torch_mangle_17.Linear = prim::GetAttr[name="linear"](%m1.1), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m1 %m1.3 : __torch__.export.test_converter.___torch_mangle_16.M = prim::GetAttr[name="m1"](%m1.1), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m1 %linear.12 : __torch__.torch.nn.modules.linear.___torch_mangle_17.Linear = prim::GetAttr[name="linear"](%m1.3), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m1 %weight.4 : Tensor = prim::GetAttr[name="weight"](%linear.12), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m1 %bias.4 : Tensor = prim::GetAttr[name="bias"](%linear.12), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m1 %33 : Tensor = aten::linear(%x.1, %weight.4, %bias.4), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m1 # /data/users/jiashenc/pytorch/torch/nn/modules/linear.py:116:15 %weight.6 : Tensor = prim::GetAttr[name="weight"](%linear.10), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m1 %bias.6 : Tensor = prim::GetAttr[name="bias"](%linear.10), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m1 %36 : Tensor = aten::linear(%33, %weight.6, %bias.6), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m1 # /data/users/jiashenc/pytorch/torch/nn/modules/linear.py:116:15 -> (%36) block1(): %linear.14 : __torch__.torch.nn.modules.linear.___torch_mangle_17.Linear = prim::GetAttr[name="linear"](%m1.1), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m1 %m2.3 : __torch__.export.test_converter.___torch_mangle_16.M = prim::GetAttr[name="m2"](%m1.1), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m1 %linear.16 : __torch__.torch.nn.modules.linear.___torch_mangle_17.Linear = prim::GetAttr[name="linear"](%m2.3), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m1 %weight.8 : Tensor = prim::GetAttr[name="weight"](%linear.16), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m1 %bias.8 : Tensor = prim::GetAttr[name="bias"](%linear.16), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m1 %42 : Tensor = aten::linear(%x.1, %weight.8, %bias.8), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m1 # /data/users/jiashenc/pytorch/torch/nn/modules/linear.py:116:15 %weight.2 : Tensor = prim::GetAttr[name="weight"](%linear.14), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m1 %bias.2 : Tensor = prim::GetAttr[name="bias"](%linear.14), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m1 %45 : Tensor = aten::linear(%42, %weight.2, %bias.2), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m1 # /data/users/jiashenc/pytorch/torch/nn/modules/linear.py:116:15 -> (%45) %weight.10 : Tensor = prim::GetAttr[name="weight"](%linear.6), scope: export.test_converter.SuperNestedM1::/torch.nn.modules.linear.Linear::linear %bias.10 : Tensor = prim::GetAttr[name="bias"](%linear.6), scope: export.test_converter.SuperNestedM1::/torch.nn.modules.linear.Linear::linear %48 : Tensor = aten::linear(%27, %weight.10, %bias.10), scope: export.test_converter.SuperNestedM1::/torch.nn.modules.linear.Linear::linear # /data/users/jiashenc/pytorch/torch/nn/modules/linear.py:116:15 -> (%48) block1(): %linear.8 : __torch__.torch.nn.modules.linear.___torch_mangle_17.Linear = prim::GetAttr[name="linear"](%15), scope: export.test_converter.SuperNestedM1:: %m2.1 : __torch__.export.test_converter.___torch_mangle_15.NestedM = prim::GetAttr[name="m2"](%15), scope: export.test_converter.SuperNestedM1:: %51 : Tensor = aten::sum(%x.1, %16), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m2 # /data/users/jiashenc/pytorch/test/export/test_converter.py:327:19 %52 : Tensor = aten::gt(%51, %17), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m2 # /data/users/jiashenc/pytorch/test/export/test_converter.py:327:19 %53 : bool = aten::Bool(%52), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m2 # /data/users/jiashenc/pytorch/test/export/test_converter.py:327:19 %54 : Tensor = prim::If(%53), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m2 # /data/users/jiashenc/pytorch/test/export/test_converter.py:327:16 block0(): %linear.1 : __torch__.torch.nn.modules.linear.___torch_mangle_17.Linear = prim::GetAttr[name="linear"](%m2.1), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m2 %m1 : __torch__.export.test_converter.___torch_mangle_16.M = prim::GetAttr[name="m1"](%m2.1), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m2 %linear.5 : __torch__.torch.nn.modules.linear.___torch_mangle_17.Linear = prim::GetAttr[name="linear"](%m1), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m2 %weight.1 : Tensor = prim::GetAttr[name="weight"](%linear.5), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m2 %bias.1 : Tensor = prim::GetAttr[name="bias"](%linear.5), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m2 %60 : Tensor = aten::linear(%x.1, %weight.1, %bias.1), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m2 # /data/users/jiashenc/pytorch/torch/nn/modules/linear.py:116:15 %weight.3 : Tensor = prim::GetAttr[name="weight"](%linear.1), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m2 %bias.3 : Tensor = prim::GetAttr[name="bias"](%linear.1), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m2 %63 : Tensor = aten::linear(%60, %weight.3, %bias.3), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m2 # /data/users/jiashenc/pytorch/torch/nn/modules/linear.py:116:15 -> (%63) block1(): %linear.3 : __torch__.torch.nn.modules.linear.___torch_mangle_17.Linear = prim::GetAttr[name="linear"](%m2.1), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m2 %m2 : __torch__.export.test_converter.___torch_mangle_16.M = prim::GetAttr[name="m2"](%m2.1), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m2 %linear : __torch__.torch.nn.modules.linear.___torch_mangle_17.Linear = prim::GetAttr[name="linear"](%m2), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m2 %weight.5 : Tensor = prim::GetAttr[name="weight"](%linear), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m2 %bias.5 : Tensor = prim::GetAttr[name="bias"](%linear), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m2 %69 : Tensor = aten::linear(%x.1, %weight.5, %bias.5), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m2 # /data/users/jiashenc/pytorch/torch/nn/modules/linear.py:116:15 %weight.12 : Tensor = prim::GetAttr[name="weight"](%linear.3), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m2 %bias.12 : Tensor = prim::GetAttr[name="bias"](%linear.3), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m2 %72 : Tensor = aten::linear(%69, %weight.12, %bias.12), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m2 # /data/users/jiashenc/pytorch/torch/nn/modules/linear.py:116:15 -> (%72) %weight : Tensor = prim::GetAttr[name="weight"](%linear.8), scope: export.test_converter.SuperNestedM1::/torch.nn.modules.linear.Linear::linear %bias : Tensor = prim::GetAttr[name="bias"](%linear.8), scope: export.test_converter.SuperNestedM1::/torch.nn.modules.linear.Linear::linear %75 : Tensor = aten::linear(%54, %weight, %bias), scope: export.test_converter.SuperNestedM1::/torch.nn.modules.linear.Linear::linear # /data/users/jiashenc/pytorch/torch/nn/modules/linear.py:116:15 -> (%75) return (%21) ``` ##### ExportedProgram ``` ExportedProgram: class GraphModule(torch.nn.Module): def forward(self, p_linear_weight: "f32[3, 3]", p_linear_bias: "f32[3]", p_m1_linear_weight: "f32[3, 3]", p_m1_linear_bias: "f32[3]", p_m1_m1_linear_weight: "f32[3, 3]", p_m1_m1_linear_bias: "f32[3]", p_m1_m2_linear_weight: "f32[3, 3]", p_m1_m2_linear_bias: "f32[3]", p_m2_linear_weight: "f32[3, 3]", p_m2_linear_bias: "f32[3]", p_m2_m1_linear_weight: "f32[3, 3]", p_m2_m1_linear_bias: "f32[3]", p_m2_m2_linear_weight: "f32[3, 3]", p_m2_m2_linear_bias: "f32[3]", x_1: "f32[3]"): # No stacktrace found for following nodes max_1: "f32[]" = torch.ops.aten.max.default(x_1) gt: "b8[]" = torch.ops.aten.gt.Scalar(max_1, 1); max_1 = None # File: <eval_with_key>.137:23 in forward, code: cond = torch.ops.higher_order.cond(l_args_0_, cond_true_2, cond_false_2, [l_args_3_0_, l_args_3_13_, l_args_3_5_, l_args_3_12_, l_args_3_14_, l_args_3_1_, l_args_3_3_, l_args_3_4_, l_args_3_7_, l_args_3_10_, l_args_3_11_, l_args_3_2_, l_args_3_6_, l_args_3_8_, l_args_3_9_]); l_args_0_ = cond_true_2 = cond_false_2 = l_args_3_0_ = l_args_3_13_ = l_args_3_5_ = l_args_3_12_ = l_args_3_14_ = l_args_3_1_ = l_args_3_3_ = l_args_3_4_ = l_args_3_7_ = l_args_3_10_ = l_args_3_11_ = l_args_3_2_ = l_args_3_6_ = l_args_3_8_ = l_args_3_9_ = None true_graph_0 = self.true_graph_0 false_graph_0 = self.false_graph_0 conditional = torch.ops.higher_order.cond(gt, true_graph_0, false_graph_0, [p_linear_weight, p_linear_bias, x_1, p_m1_linear_weight, p_m1_m1_linear_bias, p_m1_linear_bias, p_m1_m2_linear_weight, p_m1_m2_linear_bias, p_m1_m1_linear_weight, p_m2_m2_linear_bias, p_m2_m1_linear_weight, p_m2_linear_weight, p_m2_m1_linear_bias, p_m2_m2_linear_weight, p_m2_linear_bias]); gt = true_graph_0 = false_graph_0 = p_linear_weight = p_linear_bias = x_1 = p_m1_linear_weight = p_m1_m1_linear_bias = p_m1_linear_bias = p_m1_m2_linear_weight = p_m1_m2_linear_bias = p_m1_m1_linear_weight = p_m2_m2_linear_bias = p_m2_m1_linear_weight = p_m2_linear_weight = p_m2_m1_linear_bias = p_m2_m2_linear_weight = p_m2_linear_bias = None getitem: "f32[3]" = conditional[0]; conditional = None return (getitem,) class <lambda>(torch.nn.Module): def forward(self, p_linear_weight: "f32[3, 3]", p_linear_bias: "f32[3]", x_1: "f32[3]", p_m1_linear_weight: "f32[3, 3]", p_m1_m1_linear_bias: "f32[3]", p_m1_linear_bias: "f32[3]", p_m1_m2_linear_weight: "f32[3, 3]", p_m1_m2_linear_bias: "f32[3]", p_m1_m1_linear_weight: "f32[3, 3]", p_m2_m2_linear_bias: "f32[3]", p_m2_m1_linear_weight: "f32[3, 3]", p_m2_linear_weight: "f32[3, 3]", p_m2_m1_linear_bias: "f32[3]", p_m2_m2_linear_weight: "f32[3, 3]", p_m2_linear_bias: "f32[3]"): # File: <eval_with_key>.134:8 in forward, code: sum_default = torch.ops.aten.sum.default(l_args_3_5__1, dtype = None) sum_1: "f32[]" = torch.ops.aten.sum.default(x_1) # File: <eval_with_key>.134:9 in forward, code: gt_scalar = torch.ops.aten.gt.Scalar(sum_default, 1); sum_default = None gt: "b8[]" = torch.ops.aten.gt.Scalar(sum_1, 1); sum_1 = None # File: <eval_with_key>.134:12 in forward, code: cond = torch.ops.higher_order.cond(gt_scalar, cond_true_0, cond_false_0, [l_args_3_12__true_branch, l_args_3_1__true_branch, l_args_3_5__1, l_args_3_14__true_branch, l_args_3_7__true_branch, l_args_3_3__true_branch, l_args_3_4__true_branch]); gt_scalar = cond_true_0 = cond_false_0 = l_args_3_12__true_branch = l_args_3_1__true_branch = l_args_3_5__1 = l_args_3_14__true_branch = l_args_3_7__true_branch = l_args_3_3__true_branch = l_args_3_4__true_branch = None true_graph_0 = self.true_graph_0 false_graph_0 = self.false_graph_0 conditional = torch.ops.higher_order.cond(gt, true_graph_0, false_graph_0, [p_m1_linear_weight, p_m1_linear_bias, x_1, p_m1_m1_linear_bias, p_m1_m1_linear_weight, p_m1_m2_linear_weight, p_m1_m2_linear_bias]); gt = true_graph_0 = false_graph_0 = p_m1_linear_weight = p_m1_linear_bias = x_1 = p_m1_m1_linear_bias = p_m1_m1_linear_weight = p_m1_m2_linear_weight = p_m1_m2_linear_bias = None getitem: "f32[3]" = conditional[0]; conditional = None # File: <eval_with_key>.134:14 in forward, code: linear_default = torch.ops.aten.linear.default(getitem, l_args_3_0__1, l_args_3_13__1); getitem = l_args_3_0__1 = l_args_3_13__1 = None linear: "f32[3]" = torch.ops.aten.linear.default(getitem, p_linear_weight, p_linear_bias); getitem = p_linear_weight = p_linear_bias = None return (linear,) class <lambda>(torch.nn.Module): def forward(self, p_m1_linear_weight: "f32[3, 3]", p_m1_linear_bias: "f32[3]", x_1: "f32[3]", p_m1_m1_linear_bias: "f32[3]", p_m1_m1_linear_weight: "f32[3, 3]", p_m1_m2_linear_weight: "f32[3, 3]", p_m1_m2_linear_bias: "f32[3]"): # File: <eval_with_key>.130:8 in forward, code: linear_default = torch.ops.aten.linear.default(l_args_3_5__1, l_args_3_7__true_branch, l_args_3_14__true_branch); l_args_3_5__1 = l_args_3_7__true_branch = l_args_3_14__true_branch = None linear: "f32[3]" = torch.ops.aten.linear.default(x_1, p_m1_m1_linear_weight, p_m1_m1_linear_bias); x_1 = p_m1_m1_linear_weight = p_m1_m1_linear_bias = None # File: <eval_with_key>.130:9 in forward, code: linear_default_1 = torch.ops.aten.linear.default(linear_default, l_args_3_12__1, l_args_3_1__1); linear_default = l_args_3_12__1 = l_args_3_1__1 = None linear_1: "f32[3]" = torch.ops.aten.linear.default(linear, p_m1_linear_weight, p_m1_linear_bias); linear = p_m1_linear_weight = p_m1_linear_bias = None return (linear_1,) class <lambda>(torch.nn.Module): def forward(self, p_m1_linear_weight: "f32[3, 3]", p_m1_linear_bias: "f32[3]", x_1: "f32[3]", p_m1_m1_linear_bias: "f32[3]", p_m1_m1_linear_weight: "f32[3, 3]", p_m1_m2_linear_weight: "f32[3, 3]", p_m1_m2_linear_bias: "f32[3]"): # File: <eval_with_key>.131:8 in forward, code: linear_default = torch.ops.aten.linear.default(l_args_3_5__1, l_args_3_3__false_branch, l_args_3_4__false_branch); l_args_3_5__1 = l_args_3_3__false_branch = l_args_3_4__false_branch = None linear: "f32[3]" = torch.ops.aten.linear.default(x_1, p_m1_m2_linear_weight, p_m1_m2_linear_bias); x_1 = p_m1_m2_linear_weight = p_m1_m2_linear_bias = None # File: <eval_with_key>.131:9 in forward, code: linear_default_1 = torch.ops.aten.linear.default(linear_default, l_args_3_12__1, l_args_3_1__1); linear_default = l_args_3_12__1 = l_args_3_1__1 = None linear_1: "f32[3]" = torch.ops.aten.linear.default(linear, p_m1_linear_weight, p_m1_linear_bias); linear = p_m1_linear_weight = p_m1_linear_bias = None return (linear_1,) class <lambda>(torch.nn.Module): def forward(self, p_linear_weight: "f32[3, 3]", p_linear_bias: "f32[3]", x_1: "f32[3]", p_m1_linear_weight: "f32[3, 3]", p_m1_m1_linear_bias: "f32[3]", p_m1_linear_bias: "f32[3]", p_m1_m2_linear_weight: "f32[3, 3]", p_m1_m2_linear_bias: "f32[3]", p_m1_m1_linear_weight: "f32[3, 3]", p_m2_m2_linear_bias: "f32[3]", p_m2_m1_linear_weight: "f32[3, 3]", p_m2_linear_weight: "f32[3, 3]", p_m2_m1_linear_bias: "f32[3]", p_m2_m2_linear_weight: "f32[3, 3]", p_m2_linear_bias: "f32[3]"): # File: <eval_with_key>.135:8 in forward, code: sum_default = torch.ops.aten.sum.default(l_args_3_5__1, dtype = None) sum_1: "f32[]" = torch.ops.aten.sum.default(x_1) # File: <eval_with_key>.135:9 in forward, code: gt_scalar = torch.ops.aten.gt.Scalar(sum_default, 1); sum_default = None gt: "b8[]" = torch.ops.aten.gt.Scalar(sum_1, 1); sum_1 = None # File: <eval_with_key>.135:12 in forward, code: cond = torch.ops.higher_order.cond(gt_scalar, cond_true_1, cond_false_1, [l_args_3_2__false_branch, l_args_3_5__1, l_args_3_9__false_branch, l_args_3_11__false_branch, l_args_3_6__false_branch, l_args_3_10__false_branch, l_args_3_8__false_branch]); gt_scalar = cond_true_1 = cond_false_1 = l_args_3_2__false_branch = l_args_3_5__1 = l_args_3_9__false_branch = l_args_3_11__false_branch = l_args_3_6__false_branch = l_args_3_10__false_branch = l_args_3_8__false_branch = None true_graph_0 = self.true_graph_0 false_graph_0 = self.false_graph_0 conditional = torch.ops.higher_order.cond(gt, true_graph_0, false_graph_0, [p_m2_linear_weight, x_1, p_m2_linear_bias, p_m2_m1_linear_weight, p_m2_m1_linear_bias, p_m2_m2_linear_bias, p_m2_m2_linear_weight]); gt = true_graph_0 = false_graph_0 = p_m2_linear_weight = x_1 = p_m2_linear_bias = p_m2_m1_linear_weight = p_m2_m1_linear_bias = p_m2_m2_linear_bias = p_m2_m2_linear_weight = None getitem: "f32[3]" = conditional[0]; conditional = None # File: <eval_with_key>.135:14 in forward, code: linear_default = torch.ops.aten.linear.default(getitem, l_args_3_0__1, l_args_3_13__1); getitem = l_args_3_0__1 = l_args_3_13__1 = None linear: "f32[3]" = torch.ops.aten.linear.default(getitem, p_linear_weight, p_linear_bias); getitem = p_linear_weight = p_linear_bias = None return (linear,) class <lambda>(torch.nn.Module): def forward(self, p_m2_linear_weight: "f32[3, 3]", x_1: "f32[3]", p_m2_linear_bias: "f32[3]", p_m2_m1_linear_weight: "f32[3, 3]", p_m2_m1_linear_bias: "f32[3]", p_m2_m2_linear_bias: "f32[3]", p_m2_m2_linear_weight: "f32[3, 3]"): # File: <eval_with_key>.132:8 in forward, code: linear_default = torch.ops.aten.linear.default(l_args_3_5__1, l_args_3_11__true_branch, l_args_3_6__true_branch); l_args_3_5__1 = l_args_3_11__true_branch = l_args_3_6__true_branch = None linear: "f32[3]" = torch.ops.aten.linear.default(x_1, p_m2_m1_linear_weight, p_m2_m1_linear_bias); x_1 = p_m2_m1_linear_weight = p_m2_m1_linear_bias = None # File: <eval_with_key>.132:9 in forward, code: linear_default_1 = torch.ops.aten.linear.default(linear_default, l_args_3_2__1, l_args_3_9__1); linear_default = l_args_3_2__1 = l_args_3_9__1 = None linear_1: "f32[3]" = torch.ops.aten.linear.default(linear, p_m2_linear_weight, p_m2_linear_bias); linear = p_m2_linear_weight = p_m2_linear_bias = None return (linear_1,) class <lambda>(torch.nn.Module): def forward(self, p_m2_linear_weight: "f32[3, 3]", x_1: "f32[3]", p_m2_linear_bias: "f32[3]", p_m2_m1_linear_weight: "f32[3, 3]", p_m2_m1_linear_bias: "f32[3]", p_m2_m2_linear_bias: "f32[3]", p_m2_m2_linear_weight: "f32[3, 3]"): # File: <eval_with_key>.133:8 in forward, code: linear_default = torch.ops.aten.linear.default(l_args_3_5__1, l_args_3_8__false_branch, l_args_3_10__false_branch); l_args_3_5__1 = l_args_3_8__false_branch = l_args_3_10__false_branch = None linear: "f32[3]" = torch.ops.aten.linear.default(x_1, p_m2_m2_linear_weight, p_m2_m2_linear_bias); x_1 = p_m2_m2_linear_weight = p_m2_m2_linear_bias = None # File: <eval_with_key>.133:9 in forward, code: linear_default_1 = torch.ops.aten.linear.default(linear_default, l_args_3_2__1, l_args_3_9__1); linear_default = l_args_3_2__1 = l_args_3_9__1 = None linear_1: "f32[3]" = torch.ops.aten.linear.default(linear, p_m2_linear_weight, p_m2_linear_bias); linear = p_m2_linear_weight = p_m2_linear_bias = None return (linear_1,) Graph signature: ExportGraphSignature(input_specs=[InputSpec(kind=<InputKind.PARAMETER: 2>, arg=TensorArgument(name='p_linear_weight'), target='linear.weight', persistent=None), InputSpec(kind=<InputKind.PARAMETER: 2>, arg=TensorArgument(name='p_linear_bias'), target='linear.bias', persistent=None), InputSpec(kind=<InputKind.PARAMETER: 2>, arg=TensorArgument(name='p_m1_linear_weight'), target='m1.linear.weight', persistent=None), InputSpec(kind=<InputKind.PARAMETER: 2>, arg=TensorArgument(name='p_m1_linear_bias'), target='m1.linear.bias', persistent=None), InputSpec(kind=<InputKind.PARAMETER: 2>, arg=TensorArgument(name='p_m1_m1_linear_weight'), target='m1.m1.linear.weight', persistent=None), InputSpec(kind=<InputKind.PARAMETER: 2>, arg=TensorArgument(name='p_m1_m1_linear_bias'), target='m1.m1.linear.bias', persistent=None), InputSpec(kind=<InputKind.PARAMETER: 2>, arg=TensorArgument(name='p_m1_m2_linear_weight'), target='m1.m2.linear.weight', persistent=None), InputSpec(kind=<InputKind.PARAMETER: 2>, arg=TensorArgument(name='p_m1_m2_linear_bias'), target='m1.m2.linear.bias', persistent=None), InputSpec(kind=<InputKind.PARAMETER: 2>, arg=TensorArgument(name='p_m2_linear_weight'), target='m2.linear.weight', persistent=None), InputSpec(kind=<InputKind.PARAMETER: 2>, arg=TensorArgument(name='p_m2_linear_bias'), target='m2.linear.bias', persistent=None), InputSpec(kind=<InputKind.PARAMETER: 2>, arg=TensorArgument(name='p_m2_m1_linear_weight'), target='m2.m1.linear.weight', persistent=None), InputSpec(kind=<InputKind.PARAMETER: 2>, arg=TensorArgument(name='p_m2_m1_linear_bias'), target='m2.m1.linear.bias', persistent=None), InputSpec(kind=<InputKind.PARAMETER: 2>, arg=TensorArgument(name='p_m2_m2_linear_weight'), target='m2.m2.linear.weight', persistent=None), InputSpec(kind=<InputKind.PARAMETER: 2>, arg=TensorArgument(name='p_m2_m2_linear_bias'), target='m2.m2.linear.bias', persistent=None), InputSpec(kind=<InputKind.USER_INPUT: 1>, arg=TensorArgument(name='x_1'), target=None, persistent=None)], output_specs=[OutputSpec(kind=<OutputKind.USER_OUTPUT: 1>, arg=TensorArgument(name='getitem'), target=None)]) Range constraints: {} ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/127975 Approved by: https://github.com/angelayi, https://github.com/ydwu4	2024-06-10 23:24:16 +00:00
Animesh Jain	b2d602306a	[RELAND][dynamo][nn-modules] Trace through nn.Module dunder methods for UnspecializedNNModule (#126578 ) Tracing through `__init__` is important because it initializes (calls STORE_ATTR) on members. By doing that, we kick in the mutation tracking for these objects. So, things like mutating `_modules` etc is tracked automatically. Fixes https://github.com/pytorch/pytorch/issues/111837 Pull Request resolved: https://github.com/pytorch/pytorch/pull/126578 Approved by: https://github.com/jansel ghstack dependencies: #128295	2024-06-10 23:11:04 +00:00
Animesh Jain	05711eece9	[dynamo][inlining inbuilt modules] Ensure BC for nn_module_stack (#128295 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128295 Approved by: https://github.com/ydwu4	2024-06-10 23:11:04 +00:00
Yidi Wu	a287ff75d0	Use init_torchbind_implementations in inductor torchbind tests. (#128341 ) Summary: To unify how we load the torch bind libraries for testing. Test Plan: Existing tests. Differential Revision: D58372372 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128341 Approved by: https://github.com/angelayi	2024-06-10 23:02:48 +00:00
PyTorch MergeBot	4bbadeee8a	Revert "Set simdlen based on ATEN_CPU_CAPABILITY (#123514 )" This reverts commit b66e3f0957b96b058c9b632ca60833d9717a9d8a. Reverted https://github.com/pytorch/pytorch/pull/123514 on behalf of https://github.com/clee2000 due to broke test/inductor/test_torchinductor.py::CpuTests::test_new_cpp_build_logical_cpu on periodic test on the no gpu tests `b66e3f0957` https://github.com/pytorch/pytorch/actions/runs/9453518547/job/26040077301 ([comment](https://github.com/pytorch/pytorch/pull/123514#issuecomment-2159433432))	2024-06-10 22:46:01 +00:00
Simon Fan	2176ef7dfa	[compiled autograd] support .backward(inputs=) (#128252 ) autograd already marks nodes as needed or not before calling calling compiled autograd. so our worklist already skips nodes not specified in the `inputs` kwarg. For the .backward(inputs=) case, I'm keeping the grads as outputs, just like for .grad(inputs=), this is to still guard on graph_output when we collect the nodes. This does not get DCE'd rn, and is ignored in the post graph bytecode. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128252 Approved by: https://github.com/jansel	2024-06-10 22:20:51 +00:00
loganthomas	583a56d5a8	DOC: add docstring to construct_and_record_rdzv_event() (#128189 ) Fixes #127902 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128189 Approved by: https://github.com/kurman	2024-06-10 22:17:33 +00:00
Mikayla Gawarecki	c38b3381a1	Make nn.Module state_dict load_state_dict pre-hook and state_dict post hook public (#126704 ) Fixes https://github.com/pytorch/pytorch/issues/75287 and https://github.com/pytorch/pytorch/issues/117437 - `nn.Module._register_state_dict_hook` --> add public `nn.Module.register_state_dict_post_hook` - Add a test as this API was previously untested - `nn.Module._register_load_state_dict_pre_hook` --> add public `nn.Module.register_load_state_dict_pre_hook` (remove the `with_module` flag, default it to `True` ~- For consistency with optimizer `load_state_dict_pre_hook` raised by @janeyx99, allow the pre-hook to return a new `state_dict`~ - Document issue pointed out by https://github.com/pytorch/pytorch/issues/117437 regarding `_register_state_dict_hook` semantic of returning a new state_dict only being respected for the root for private hook - Remove this for the public `register_state_dict_post_hook` Pull Request resolved: https://github.com/pytorch/pytorch/pull/126704 Approved by: https://github.com/albanD ghstack dependencies: #126906	2024-06-10 21:50:17 +00:00
Mikayla Gawarecki	a2d4fea872	[easy] Move state_dict hooks tests to test_module_hooks and decorate tests that call load_state_dict with swap (#126906 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126906 Approved by: https://github.com/albanD	2024-06-10 21:50:17 +00:00
Edward Z. Yang	58083ffb10	Improve unbacked reasoning involving has internal overlap (#128332 ) Fixes https://github.com/pytorch/pytorch/issues/122477 Partially addresses https://github.com/pytorch/pytorch/issues/116336 This PR is slightly overkill: not only does it disable the overlap test when there are unbacked SymInts, it also improves the is non-overlapping and dense test for some more unbacked situations. We technically don't need the latter change, but I was already deep in the sauce and just went ahead and did it. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/128332 Approved by: https://github.com/lezcano	2024-06-10 21:49:38 +00:00
Andrea Frittoli	6630dcd53c	Add docstring for the torch.serialization.default_restore_location function (#128132 ) Fixes: #127887 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128132 Approved by: https://github.com/mikaylagawarecki	2024-06-10 21:33:56 +00:00
laithsakka	3a2d0755a4	enable test_ParameterList with dynamo if nn module inlining enabled only (#128308 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128308 Approved by: https://github.com/anijain2305	2024-06-10 21:25:40 +00:00
IvanKobzarev	b459713ca7	[aota] compiled forward outputs requires_grad alignment with eager (#128016 ) Original issue: https://github.com/pytorch/pytorch/issues/114338 We assume only two possible mutually exclusive scenarios: 1. Running compiled region for training (Any of inputs has requires_grad) - Produced differentiable outputs should have requires_grad. 2. Running compiled region for inference (None of inputs has requires_grad) - All outputs do not have requires_grad. Even if user runs the region under no_grad(), but has an input Tensor with requires_grad - we go Training scenario (1). With current state that means: 1/ needs_autograd should not check torch.is_grad_enabled(), only that any of inputs requires_grad 2/ if needs_autograd => trace_joint (We are in training scenario 1.) => always run compiled region under with.enable_grad() Pull Request resolved: https://github.com/pytorch/pytorch/pull/128016 Approved by: https://github.com/bdhirsh	2024-06-10 20:51:22 +00:00
Guilherme Leobas	4460e481bc	Disable jacrev/jacfwd/hessian if compiling with dynamo (#128255 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128255 Approved by: https://github.com/zou3519	2024-06-10 20:47:53 +00:00
PyTorch MergeBot	90bb510ece	Revert "Deprecate `torch._utils.is_compiling()` and `torch._dynamo.external_utils.is_compiling()` (#127690 )" This reverts commit 348b181a97abc2e636a6c18e5880a78e5d1dab94. Reverted https://github.com/pytorch/pytorch/pull/127690 on behalf of https://github.com/clee2000 due to sorry I think https://github.com/pytorch/pytorch/pull/126898#issuecomment-2142884456 is still relevant, I will reach out to them to see what needs to be done in internal to get this remerged ([comment](https://github.com/pytorch/pytorch/pull/127690#issuecomment-2159248859))	2024-06-10 20:44:42 +00:00
Xiaodong Wang	38e0a0440c	[AMD] Default to hipblaslt in gemm (#127944 ) Summary: It has been a constant pain that we have to specify env var to go with the hipblaslt path. The default path is very slow on MI300. Therefore, let's default to hipblaslt. Differential Revision: D58150764 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127944 Approved by: https://github.com/aaronenyeshi, https://github.com/houseroad	2024-06-10 19:55:21 +00:00
Aaron Orenstein	946f554c8f	Flip default value for mypy disallow_untyped_defs [10+1/11] (#128293 ) See #127836 for details. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128293 Approved by: https://github.com/oulgen	2024-06-10 19:32:44 +00:00
Nikita Shulga	55646554b7	[EZ] Fix typos in SECURITY.md (#128340 ) permisisons -> permissions lates -> latest Pull Request resolved: https://github.com/pytorch/pytorch/pull/128340 Approved by: https://github.com/clee2000, https://github.com/atalman, https://github.com/kit1980	2024-06-10 19:21:39 +00:00
Edward Z. Yang	9cab5987bd	Introduce int_oo (#127693 ) In a previous life, we used sympy.oo to represent the lower/upper bounds of integer ranges. Later, we changed this to be sys.maxsize - 1 for a few reasons: (1) sometimes we do tests on a value being exactly sys.maxsize, and we wanted to avoid a data dependent guard in this case, (2) sympy.oo corresponds to floating point infinity, so you get incorrect types for value ranges with oo, and (3) you can do slightly better reasoning if you assume that input sizes fall within representable 64-bit integer range. After working in the sys.maxsize regime for a bit, I've concluded that this was actually a bad idea. Specifically, the problem is that you end up with sys.maxsize in your upper bound, and then whenever you do any sort of size-increasing computation like size * 2, you end up with 2 * sys.maxsize, and you end up doing a ton of arbitrary precision int computation that is totally unnecessary. A symbolic bound is better. But especially after #126905, we can't go back to using sympy.oo, because that advertises that it's not an integer, and now your ValueRanges is typed incorrectly. So what do we do? We define a new numeric constant `int_oo`, which is like `sympy.oo` but it advertises `is_integer`. test/test_sympy_utils.py describes some basic properties of the number, and torch/utils/_sympy/numbers.py has the actual implementation. The rest of the changes of the PR are working out the implications of this change. I'll give more commentary as inline comments. Fixes https://github.com/pytorch/pytorch/issues/127396 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/127693 Approved by: https://github.com/lezcano ghstack dependencies: #126905	2024-06-10 19:09:53 +00:00
PyTorch MergeBot	db2fa7b827	Revert "[export] FIx unflattener for preserving modules containing unused inputs (#128260 )" This reverts commit 093a4ff5f859ccbbd8ba62dd189f76e5faadfb04. Reverted https://github.com/pytorch/pytorch/pull/128260 on behalf of https://github.com/angelayi due to breaking windows test ([comment](https://github.com/pytorch/pytorch/pull/128260#issuecomment-2159050726))	2024-06-10 18:42:33 +00:00
angelayi	093a4ff5f8	[export] FIx unflattener for preserving modules containing unused inputs (#128260 ) Currently unflattener fails if the module its preserving the module signature for contains unused inputs/outputs. This also fixes unflattener issues in D57829276. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128260 Approved by: https://github.com/pianpwk	2024-06-10 18:39:33 +00:00
Sam Larsen	fa8ec8e718	[dynamo] handle hashable exceptions in trace_rules lookup (#128078 ) Summary: Found during user empathy day when attempting to hash a fractions.Fraction object before it was fully constructed. See https://github.com/pytorch/pytorch/issues/128075 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128078 Approved by: https://github.com/anijain2305	2024-06-10 18:23:22 +00:00
Aaron Enye Shi	136bdb96cb	Update Kineto submodule with fix to test_basic_chrome_trace (#128333 ) Summary: We've updated the sort_index in Kineto chrome traces to support device ids up to 16 devices. This should make chrome trace rows be ordered in the same way as CUDA. We need to update the unit test as well. Test Plan: Ran locally the changing test: ``` $ buck2 test 'fbcode//mode/opt' fbcode//caffe2/test:test_profiler_cuda -- --exact 'caffe2/test:test_profiler_cuda - test_basic_chrome_trace (profiler.test_profiler.TestProfiler)' File changed: fbcode//caffe2/third_party/kineto.submodule.txt Buck UI: https://www.internalfb.com/buck2/f4fd1e9a-99f1-4422-aeed-b54903c64146 Test UI: https://www.internalfb.com/intern/testinfra/testrun/16888498639845776 Network: Up: 5.4KiB Down: 8.6KiB (reSessionID-0329120e-7fa2-4bc0-b539-7e58058f8fce) Jobs completed: 6. Time elapsed: 1:01.2s. Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0 ``` Differential Revision: D58362964 Pulled By: aaronenyeshi Pull Request resolved: https://github.com/pytorch/pytorch/pull/128333 Approved by: https://github.com/Skylion007	2024-06-10 18:12:34 +00:00
Andrea Frittoli	83941482f7	Add docstring for the torch.distributed.elastic.utils.distributed.get_free_port function (#128133 ) Fixes: #127914 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128133 Approved by: https://github.com/H-Huang	2024-06-10 18:10:58 +00:00
Menglu Yu	08d038f8a8	[PT2] Fix a typo and lint problem (#128258 ) Summary: Titled Test Plan: see signal Differential Revision: D58310169 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128258 Approved by: https://github.com/dshi7, https://github.com/Yuzhen11	2024-06-10 18:03:40 +00:00
Shengbao Zheng	46948300a2	[c10d] integrate PMI NCCL initialization to NCCL-PG (#128243 ) Summary: Move broadcastUniqueID check to NCCLUtils Differential Revision: D58273755 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128243 Approved by: https://github.com/wconstab	2024-06-10 17:20:03 +00:00
Chirag Pandya	ab3a0b192a	[RFC] add per-collective timeout value in flight recorder (#128190 ) Summary: Add timeout value field on every collected record. Test Plan: Unit tests Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/128190 Approved by: https://github.com/wconstab	2024-06-10 17:12:57 +00:00
Edward Z. Yang	8e482e909b	Add some guard to size oblivious has_internal_overlap (#128328 ) This doesn't actually help on https://github.com/pytorch/pytorch/issues/122477 but I noticed this modest improvement so sure, why not. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/128328 Approved by: https://github.com/Skylion007	2024-06-10 17:11:26 +00:00
Sheng Fu	7b9c5e0e3f	Turn on GraphTransformObserver for inductor (#127962 ) The FX graphs for some PT2 models are very complicated, Inductor usually goes through many passes of graph optimization to generate the final FX graph. It’s very difficult to see the change in each pass, and check if the optimized graph is correct and optimal. GraphTransformObserver is an observer listening to all add/erase node events on GraphModule during a graph transform pass, and save the changed nodes. When the pass is done and if there is any change in the graph, GraphTransformObserver will save the SVG files of the input graph and the output graph for that pass. This PR is to enable GraphTransformObserver for inductor. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127962 Approved by: https://github.com/jansel	2024-06-10 16:49:02 +00:00
PyTorch MergeBot	ca561d639b	Revert "Fix 'get_attr' call in dynamo 'run_node' (#127696 )" This reverts commit b741819b0580204e6a6b60c62ce44dacaf7787c8. Reverted https://github.com/pytorch/pytorch/pull/127696 on behalf of https://github.com/clee2000 due to broke (executorch?) internal tests D58295865 ([comment](https://github.com/pytorch/pytorch/pull/127696#issuecomment-2158820093))	2024-06-10 16:29:20 +00:00
PyTorch MergeBot	d22287d1ad	Revert "Fix 'get_real_value' on placeholder nodes (#127698 )" This reverts commit 19b31d899a78a6806314bcc73b88172dabf0c26e. Reverted https://github.com/pytorch/pytorch/pull/127698 on behalf of https://github.com/clee2000 due to broke (executorch?) internal tests D58295865 ([comment](https://github.com/pytorch/pytorch/pull/127696#issuecomment-2158820093))	2024-06-10 16:29:20 +00:00
PyTorch MergeBot	3b73f5de3a	Revert "Add OpInfo entry for alias_copy (#127232 ) (#128142 )" This reverts commit 04da6aeb61f4d57bf73ed1054dd897abbcceca83. Reverted https://github.com/pytorch/pytorch/pull/128142 on behalf of https://github.com/DanilBaibak due to The changes broke the test_output_match_alias_copy_cpu_complex64 test. ([comment](https://github.com/pytorch/pytorch/pull/128142#issuecomment-2158793878))	2024-06-10 16:17:16 +00:00
Isuru Fernando	c993f1b37f	Fix edge cases for gather in inductor (#126893 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126893 Approved by: https://github.com/peterbell10 ghstack dependencies: #126876	2024-06-10 15:31:03 +00:00
Tom Ritchford	04da6aeb61	Add OpInfo entry for alias_copy (#127232 ) (#128142 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128142 Approved by: https://github.com/lezcano	2024-06-10 15:01:53 +00:00
CaoE	b66e3f0957	Set simdlen based on ATEN_CPU_CAPABILITY (#123514 ) It is part of https://github.com/pytorch/pytorch/issues/123224. Set simdlen based on the environment ATEN_CPU_CAPABILITY to control CPU vec ISA like eager. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123514 Approved by: https://github.com/jgong5, https://github.com/peterbell10	2024-06-10 09:02:14 +00:00
Xu Han	df43d5843e	fix miss isa bool check (#128274 ) New cpp builder missed ISA bool(dry-compile) check. <img width="941" alt="image" src="https://github.com/pytorch/pytorch/assets/8433590/695ce911-7f6d-401d-b96b-2b9bda751b15"> @jgong5 Found this missing and then I submit this PR to fix it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128274 Approved by: https://github.com/jgong5, https://github.com/ezyang	2024-06-10 02:45:46 +00:00
cyy	26f6a87ae9	[5/N] Remove unused functions (#127185 ) Follows #128193 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127185 Approved by: https://github.com/ezyang	2024-06-10 01:57:49 +00:00
Peter Bell	d3817d8a60	Don't create python tuple when _maybe_handle_torch_function is called from C++ (#128187 ) Marginal overhead reduction when calling through the `torch.ops` API. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128187 Approved by: https://github.com/lezcano ghstack dependencies: #128183, #128184, #128185	2024-06-10 00:16:59 +00:00
Peter Bell	cd2ad29afe	[inductor] Reduce binding overhead of _reinterpret_tensor (#128185 ) Going through the dispatcher + pybind11 + torch.ops adds about 2 us overhead per call compared to `PyArgParser`. Note that views of inputs are reconstructed by AOTAutograd before being returned to the python code, so dispatching for autograd's sake shouldn't be required here. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128185 Approved by: https://github.com/lezcano ghstack dependencies: #128183, #128184	2024-06-09 23:33:03 +00:00
Peter Bell	253fa9c711	[AOTAutograd] Remove runtime import from view replay function (#128184 ) `gen_alias_from_base` spends about ~0.5 us in this import statement, which is called for each view in the graph output. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128184 Approved by: https://github.com/lezcano ghstack dependencies: #128183	2024-06-09 23:33:03 +00:00
Peter Bell	55b2a0a002	[AOTAutograd] Use _set_grad_enabled instead of no_grad (#128183 ) This saves ~1us of overhead from each inductor graph call. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128183 Approved by: https://github.com/lezcano	2024-06-09 23:33:03 +00:00
Masahiro Hiramori	5e7377e044	[Dynamo][TVM] Make the `opt_level` parameter adjustable (#127876 ) Fixes #127874 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127876 Approved by: https://github.com/jansel	2024-06-09 21:38:00 +00:00
Shuqiang Zhang	c7e2c9c37e	[c10d][doc] add a doc page for NCCL ENVs (#128235 ) Addressing issue: https://github.com/pytorch/pytorch/issues/128204 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128235 Approved by: https://github.com/wconstab	2024-06-09 16:08:38 +00:00
Chirag Pandya	0bf2fe522a	[RFC] Provide optional switches to _dump_nccl_trace (#127651 ) Summary: Data from PyTorch distributed is mostly useful during initial stages of model development. Provide options to reduce data sent/dumped. `_dump_nccl_trace` takes 3 optional switches. Default as before returns everything - `includeCollectives`: option to also include collectives: Default is True. - `includeStacktraces`: option to include stack traces in collectives. Default is True. - `onlyActive`: option to only send active collective work - i.e. not completed. Default is False (i.e. send everything) Test Plan: Unit tests Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/127651 Approved by: https://github.com/wconstab	2024-06-09 14:00:57 +00:00
PyTorch MergeBot	75b0720a97	Revert "Use hidden visibility in OBJECTCXX files (#127265 )" This reverts commit 669560d51aa1e81ebd09e2aa8288d0d314407d82. Reverted https://github.com/pytorch/pytorch/pull/127265 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but I suspect that it causes this failure https://github.com/pytorch/vision/issues/8478 on vision where its C++ extension could not be loaded on macOS ([comment](https://github.com/pytorch/pytorch/pull/127265#issuecomment-2156401838))	2024-06-09 09:05:17 +00:00
eqy	4c971932e8	[cuDNN][SDPA] Remove `TORCH_CUDNN_SDPA_ENABLED=1`, enable cuDNN SDPA by default on H100 and 2nd on other archs >= sm80 (#125343 ) Looks like one of the first failures seen is `test_causal_variants_compile_causal_variant_CausalVariant_LOWER_RIGHT_shape0_cuda` when `test_causal_variants_causal_variant_CausalVariant_LOWER_RIGHT_shape0_cuda` passes. What seems interesting here is that the `torch.compile` version fails while the eager version passes. Not sure what the difference would be here... Nevertheless, is there a recommended mechanism to skip cuDNN SDPA as a backend for this test? CC @drisspg Pull Request resolved: https://github.com/pytorch/pytorch/pull/125343 Approved by: https://github.com/Skylion007	2024-06-09 06:53:34 +00:00
Edward Z. Yang	3964a3ec73	Complete revamp of float/promotion sympy handling (#126905 ) At a high level, the idea behind this PR is: * Make it clearer what the promotion and int/float rules for various Sympy operations are. Operators that previously were polymorphic over int/float are now split into separate operators for clarity. We never do mixed int/float addition/multiplication etc in sympy, instead, we always promote to the appropriate operator. (However, equality is currently not done correctly.) * Enforce strict typing on ValueRanges: if you have a ValueRange for a float, the lower and upper MUST be floats, and so forth for integers. The story begins in torch/utils/_sympy/functions.py. Here, I make some changes to how we represent certain operations in sympy expressions: * FloorDiv now only supports integer inputs; to do float floor division, do a truediv and then a trunc. Additionally, we remove the divide out addition by gcd optimization, because sympy gcd is over fields and is willing to generate rationals (but rationals are bad for ValueRange strict typing). * ModularIndexing, LShift, RShift now assert they are given integer inputs. * Mod only supports integer inputs; eventually we will support FloatMod (left for later work, when we build out Sympy support for floating operations). Unfortunately, I couldn't assert integer inputs here, because of a bad interaction with sympy's inequality solver that is used by the offline solver * TrueDiv is split into FloatTrueDiv and IntTrueDiv. This allows for us to eventually generate accurate code for Python semantics IntTrueDiv, which is written in a special way to preserve precision when the inputs are >= 2*53 beyond what first coercing the integer to floats and then doing true division. Trunc is split to TruncToFloat and TruncToInt. * Round is updated to return a float, not an int, making it consistent with the round op handler in Inductor. To get Python-style conversion to int, we call TruncToInt on the result. * RoundDecimal updated to consistently only ever return a float * Add ToFloat for explicit coercion to float (required so we can enforce strict ValueRanges typing) In torch/__init__.py, we modify SymInt and SymFloat to appropriately call into new bindings that route to these refined sympy operations. Also, we modify `torch.sym_min` and `torch.sym_max` to have promotion semantics (if one argument is a float, the return result is always a float), making them inconsistent with builtins.min/max, but possible to do type analysis without runtime information. We also need to introduce some new op handlers in torch/_inductor/ops_handler.py: * `to_int` for truncation to int64, directly corresponding to TruncToInt; this can be implemented by trunc and dtype, but with a dedicated handler it is more convenient for roundtripping in Sympy * `int_truediv` for Python-style integer true division, which has higher precision than casting to floats and then running `truediv` These changes have consequences. First, we need to make some administrative changes: * Actually wire up these Sympy functions from SymInt/SymFloat in torch/fx/experimental/sym_node.py, including the new promotion rules (promote2) * Add support for new Sympy functions in torch/utils/_sympy/interp.py, torch/utils/_sympy/reference.py * In particular, in torch.utils._sympy.reference, we have a strong preference to NOT do nontrivial compute, instead, everything in ops handler should map to a singular sympy function * TODO: I chose to roundtrip mod back to our Mod function, but I think I'm going to have to deal with the C/Python inconsistency this to fix tests here * Add printer support for the Sympy functions in torch/_inductor/codegen/common.py, torch/_inductor/codegen/cpp_utils.py, torch/_inductor/codegen/triton.py. `int_truediv` and mixed precision equality is currently not implemented soundly, so we will lose precision in codegen for large values. TODO: The additions here are not exhaustive yet * Update ValueRanges logic to use new sympy functions in torch/utils/_sympy/value_ranges.py. In general, we prefer to use the new Sympy function rather than try to roll things by hand, which is what was done previously for many VR analysis functions. In torch/fx/experimental/symbolic_shapes.py we need to make some symbolic reasoning adjustments: * Avoid generation of rational subexpressions by removing simplification of `x // y` into `floor(x / y)`. This simplification then triggers an addition simplification rule `(x + y) / c --> x / c + y / c` which is bad because x / c is a rational number now * `_assert_bound_is_rational` is no more, we no longer generate rational bounds * Don't intersect non-int value ranges with the `int_range` * Support more sympy Functions for guard SYMPY_INTERP * Assert the type of value range is consistent with the variable type The new asserts uncovered necessary bug fixes: * torch/_inductor/codegen/cpp.py, torch/_inductor/select_algorithm.py, torch/_inductor/sizevars.py - Ensure Wild/Symbol manually allocated in Inductor is marked `is_integer` so it's accepted to build expressions * torch/_inductor/utils.py - make sure you actually pass in sympy.Expr to these functions * torch/_inductor/ir.py - make_contiguous_strides_for takes int/SymInt, not sympy.Expr! * torch/export/dynamic_shapes.py - don't use infinity to represent int ranges, instead use sys.maxsize - 1 Because of the removal of some symbolic reasoning that produced rationals, some of our symbolic reasoning has gotten worse and we are unable to simplify some guards. Check the TODO at test/test_proxy_tensor.py Reland notes. This requires this internal fbcode diff https://www.internalfb.com/phabricator/paste/view/P1403322587 but I cannot prepare the diff codev due to https://fb.workplace.com/groups/osssupport/posts/26343544518600814/ It also requires this Executorch PR https://github.com/pytorch/executorch/pull/3911 but the ET PR can be landed prior to this landing. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/126905 Approved by: https://github.com/xadupre, https://github.com/lezcano	2024-06-09 06:20:25 +00:00
PyTorch UpdateBot	31c3fa6cf5	[audio hash update] update the pinned audio hash (#128178 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned audio hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128178 Approved by: https://github.com/pytorchbot	2024-06-09 04:29:04 +00:00
cyy	7bfd1db53a	[4/N] Change static functions in headers to inline (#128286 ) Follows #128194. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128286 Approved by: https://github.com/Skylion007, https://github.com/XuehaiPan	2024-06-09 03:08:53 +00:00
Anshul Sinha	f681e3689b	[dtensor][experiment] experimenting with displaying distributed model parameters and printing sharding info (#127987 ) Summary Example code to display distributed model parameters and verify them against ground truth. Also prints sharding information. Test Plan torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/display_sharding_example.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/127987 Approved by: https://github.com/XilunWu ghstack dependencies: #127358, #127360, #127630	2024-06-09 00:14:07 +00:00
Anshul Sinha	2c2cf1d779	[dtensor][experiment] experimenting with displaying model parameters (#127630 ) Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): Summary Example code to display model parameters and verify them against ground truth. Also expanded on moduletracker to accomplish this. Test Plan python3 torch/distributed/_tensor/examples/display_sharding_example.py * #127987 * __->__ #127630 * #127360 * #127358 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127630 Approved by: https://github.com/XilunWu ghstack dependencies: #127358, #127360	2024-06-09 00:14:07 +00:00
Xinya Zhang	d34075e0bd	Add Efficient Attention support on ROCM (#124885 ) This patch implements `with sdpa_kernel(SDPBackend.EFFICIENT_ATTENTION):` by reusing AOTriton's accelerated SDPA implementation Known limitations: - Only supports MI200/MI300X GPUs - Does not support varlen - Does not support `CausalVariant` - Optional arguments `causal_diagonal` and `seqlen_k` in `_efficient_attention_forward/backward` must be null - Does not work well with inductor's SDPA rewriter. The rewriter has been updated to only use math and flash attention on ROCM. This PR also uses a different approach of installing AOTriton binary instead of building it from source in the base docker image. More details on motivation: https://github.com/pytorch/pytorch/pull/124885#issuecomment-2153229129 `PYTORCH_TEST_WITH_ROCM=1 PYTORCH_TESTING_DEVICE_ONLY_FOR="cuda" python test/test_transformers.py` yields "55028 passed, 20784 skipped" results with this change. [Previous result](https://hud.pytorch.org/pr/127528) of `test_transformers.py` was 0 error, 0 failure, 55229 skipped out of 75517 tests in total (the XML report does not contain total number of passed tests). Pull Request resolved: https://github.com/pytorch/pytorch/pull/124885 Approved by: https://github.com/malfet	2024-06-08 22:41:05 +00:00
James Wu	6e7a23475d	[easy] Run autograd if any mutations on inputs that require grad (#128229 ) If any inputs are mutated that require grad, even if all the outputs don't require grad, we should still run autograd with a backwards graph. This fixes two tests: test_input_mutation_alias_everything and test_view_detach. Fixes #128035 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128229 Approved by: https://github.com/aorenste	2024-06-08 21:18:38 +00:00
Will Feng	aee154edbe	[Traceable FSDP2] Make FSDPParam._unsharded_param creation traceable (#127245 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127245 Approved by: https://github.com/awgu	2024-06-08 21:10:15 +00:00
Pritam Damania	0dd55ee159	Fix bug in _update_process_group API (#128262 ) `local_used_map_` was undefined in case of `find_unused_parameters=False`, this resulted in an error when we ran `local_used_map_.fill_(0);` Added a unit test as well Pull Request resolved: https://github.com/pytorch/pytorch/pull/128262 Approved by: https://github.com/awgu	2024-06-08 19:52:24 +00:00
Animesh Jain	3494f3f991	[dynamo] Skip inlining builtin nn modules for torch.compile inside cond (#128247 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128247 Approved by: https://github.com/ydwu4 ghstack dependencies: #128246	2024-06-08 19:20:00 +00:00
Animesh Jain	33972dfd58	[easy][inline-inbuilt-nn-modules] Fix expected graph for control flow test (#128246 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128246 Approved by: https://github.com/ydwu4	2024-06-08 19:20:00 +00:00
Aaron Orenstein	57536286e2	Flip default value for mypy disallow_untyped_defs [10/11] (#127847 ) See #127836 for details. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127847 Approved by: https://github.com/oulgen ghstack dependencies: #127842, #127843, #127844, #127845, #127846	2024-06-08 18:50:06 +00:00
Aaron Orenstein	8db9dfa2d7	Flip default value for mypy disallow_untyped_defs [9/11] (#127846 ) See #127836 for details. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127846 Approved by: https://github.com/ezyang ghstack dependencies: #127842, #127843, #127844, #127845	2024-06-08 18:50:06 +00:00
Aaron Orenstein	27f9d3b0a1	Flip default value for mypy disallow_untyped_defs [8/11] (#127845 ) See #127836 for details. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127845 Approved by: https://github.com/oulgen ghstack dependencies: #127842, #127843, #127844	2024-06-08 18:49:56 +00:00
Aaron Orenstein	038b927590	Flip default value for mypy disallow_untyped_defs [7/11] (#127844 ) See #127836 for details. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127844 Approved by: https://github.com/oulgen ghstack dependencies: #127842, #127843	2024-06-08 18:49:45 +00:00
Aaron Orenstein	7c12cc7ce4	Flip default value for mypy disallow_untyped_defs [6/11] (#127843 ) See #127836 for details. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127843 Approved by: https://github.com/oulgen ghstack dependencies: #127842	2024-06-08 18:49:29 +00:00
Aaron Orenstein	3a0d088517	Flip default value for mypy disallow_untyped_defs [5/11] (#127842 ) See #127836 for details. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127842 Approved by: https://github.com/oulgen	2024-06-08 18:49:18 +00:00
Aaron Orenstein	62bcdc0ac9	Flip default value for mypy disallow_untyped_defs [4/11] (#127841 ) See #127836 for details. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127841 Approved by: https://github.com/oulgen	2024-06-08 18:36:48 +00:00
Aaron Orenstein	afe15d2d2f	Flip default value for mypy disallow_untyped_defs [3/11] (#127840 ) See #127836 for details. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127840 Approved by: https://github.com/oulgen	2024-06-08 18:28:01 +00:00
Aaron Orenstein	ea614fb2b1	Flip default value for mypy disallow_untyped_defs [2/11] (#127839 ) See #127836 for details. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127839 Approved by: https://github.com/oulgen	2024-06-08 18:23:08 +00:00
Aaron Orenstein	dcfa7702c3	Flip default value for mypy disallow_untyped_defs [1/11] (#127838 ) See #127836 for details. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127838 Approved by: https://github.com/oulgen	2024-06-08 18:16:33 +00:00
Chien-Chin Huang	2369c719d4	[DSD][BE] Cleanup unused variables and rename variables to avoid exposure to the users (#128249 ) These APIs and variables should not be exposed to users as they are designed to be used internally. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128249 Approved by: https://github.com/wz337	2024-06-08 17:12:17 +00:00
PyTorch MergeBot	02a901f1e9	Revert "[RFC] Provide optional switches to _dump_nccl_trace (#127651 )" This reverts commit 0a761f0627130e739f0e2748e3f71a0c347552c4. Reverted https://github.com/pytorch/pytorch/pull/127651 on behalf of https://github.com/atalman due to Breaks internal CI ([comment](https://github.com/pytorch/pytorch/pull/127651#issuecomment-2156076838))	2024-06-08 15:30:04 +00:00
PyTorch MergeBot	57a24c4fdb	Revert "[RFC] add per-collective timeout value in flight recorder (#128190 )" This reverts commit 09cccbc1c74c9d1157c1caca5526e79ee9b7ea01. Reverted https://github.com/pytorch/pytorch/pull/128190 on behalf of https://github.com/atalman due to Sorry need to revert this, in conflict with https://github.com/pytorch/pytorch/pull/127651 that needs reverting ([comment](https://github.com/pytorch/pytorch/pull/128190#issuecomment-2156075318))	2024-06-08 15:25:27 +00:00
Xuehai Pan	348b181a97	Deprecate `torch._utils.is_compiling()` and `torch._dynamo.external_utils.is_compiling()` (#127690 ) This PR is split from PR #126898. - #126898 ------ Pull Request resolved: https://github.com/pytorch/pytorch/pull/127690 Approved by: https://github.com/Skylion007	2024-06-08 15:25:03 +00:00
Bin Bao	917387f66d	[AOTI] fix a constant tensor device move issue (#128265 ) Summary: When copying a constant tensor to another device, `.to` returns a fake tensor and causes a problem when a real tensor is expected. Test Plan: CI Differential Revision: D58313034 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128265 Approved by: https://github.com/chenyang78	2024-06-08 13:23:49 +00:00
cyy	695502ca65	[3/N] Change static functions in headers to inline (#128194 ) Follows #127764 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128194 Approved by: https://github.com/ezyang, https://github.com/Skylion007	2024-06-08 08:06:31 +00:00
Edward Z. Yang	73d6ec2db6	Increase verbosity of FX graph dumps (#128042 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/128042 Approved by: https://github.com/aorenste	2024-06-08 07:24:58 +00:00
Ke Wen	0e6c204642	[pipelining] Friendly error message when not traceable (#128276 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128276 Approved by: https://github.com/H-Huang	2024-06-08 06:36:11 +00:00
PyTorch MergeBot	44371bd432	Revert "[dynamo][nn-modules] Trace through nn.Module dunder methods for UnspecializedNNModule (#126578 )" This reverts commit 7ede78f9f5d7e6c993faa1a70a5f0b0eaec5640d. Reverted https://github.com/pytorch/pytorch/pull/126578 on behalf of https://github.com/anijain2305 due to pippy tests fail ([comment](https://github.com/pytorch/pytorch/pull/126578#issuecomment-2155836555))	2024-06-08 06:35:34 +00:00
PyTorch MergeBot	6e13c7e874	Revert "[dynamo] Support if cond on UnspecializedNNModuleVariable and add inline tests (#128158 )" This reverts commit 747fc35ff54154ddec2a5ab5661f57c28d65c591. Reverted https://github.com/pytorch/pytorch/pull/128158 on behalf of https://github.com/anijain2305 due to pippy tests fail ([comment](https://github.com/pytorch/pytorch/pull/128158#issuecomment-2155835787))	2024-06-08 06:32:28 +00:00
PyTorch MergeBot	94165dba7b	Revert "[dynamo] Inline the getattr of fx graph and proxy graph (#128172 )" This reverts commit 662a78f957fb89e53ebeba7deb880561e10ecaf6. Reverted https://github.com/pytorch/pytorch/pull/128172 on behalf of https://github.com/anijain2305 due to pippy tests fail ([comment](https://github.com/pytorch/pytorch/pull/128172#issuecomment-2155835201))	2024-06-08 06:29:36 +00:00
Wanchao Liang	8a0bc8c9ee	[fsdp2] simplify fsdp_param logic with DTensorSpec (#128242 ) as titled, we can use a single DTensorSpec to save the SPMD sharding spec, plus the global shape/stride to simplify the FSDPParam logic Pull Request resolved: https://github.com/pytorch/pytorch/pull/128242 Approved by: https://github.com/awgu	2024-06-08 05:56:41 +00:00
Shaz Qadeer	cbb7e3053f	View specialization (#127641 ) This PR adds specialization shortcuts for converting n-d to 1-d and 1-d to 2-d views. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127641 Approved by: https://github.com/ezyang	2024-06-08 05:52:52 +00:00
chilli	310f80995b	Added memory budget to partitioner (#126320 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126320 Approved by: https://github.com/shunting314	2024-06-08 05:52:40 +00:00
chilli	ffc202a1b9	Added remove_noop_ops to joint_graph_passes (#124451 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124451 Approved by: https://github.com/ezyang, https://github.com/fmassa	2024-06-08 05:48:11 +00:00
Wanchao Liang	c446851829	[fsdp2] update foreach_reduce accumulate_grad (#128117 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128117 Approved by: https://github.com/awgu	2024-06-08 05:13:57 +00:00
Ke Wen	613c7d270d	[pipelining] Format doc (#128279 ) - Should use two dots around `var` - Wrap lines - Add section cross ref Pull Request resolved: https://github.com/pytorch/pytorch/pull/128279 Approved by: https://github.com/H-Huang ghstack dependencies: #128273, #128278	2024-06-08 04:59:04 +00:00
Ke Wen	2e42671619	[pipelining] Rename to stage.py and schedules.py (#128278 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128278 Approved by: https://github.com/H-Huang ghstack dependencies: #128273	2024-06-08 04:42:35 +00:00
Ke Wen	0e3fe694d1	[pipelining] Restore a stage constructor for tracer path (#128273 ) In case user modified stage module out of place, such as mod = DDP(mod) mod = torch.compile(mod) They need a stage builder else than `pipe.build_stage()`. This PR provides an API to do so: ``` def build_stage( stage_module, stage_index, pipe.info(), ... ) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/128273 Approved by: https://github.com/wconstab	2024-06-08 04:42:35 +00:00
Wu, Chunyuan	8a45cf4c64	[AOTI] align data_size of the constants (#127610 ) https://github.com/pytorch/pytorch/pull/124272 set the alignment to the `consts_o` but if there're `data_size` of tensor in the `consts_o` non divisible by the alignment, the following tensors are not aligned anymore, resulting in poor performance on CPU. We align the `data_size` as well in this PR and pad the serialized bytes. Since `size` of the tensor instead of the `data_size` is used when creating tensor from the serialized bytes ([link](`f4d7cdc5e6/torch/csrc/inductor/aoti_runtime/model.h (L236-L259)`)), there won't be correctness issue. `data_size` is only used to record the [bytes_read](`f4d7cdc5e6/torch/csrc/inductor/aoti_runtime/model.h (L217)`). This PR will improve the performance on CPU for 4 models in HF, 7 models in TIMM and 1 model in Torchbench. For the unit test, I add a bias value the original `data_size` of which is not divisible by the alignment to test the correctness: ``` constants_info_[0].dtype = static_cast<int32_t>(at::kFloat); constants_info_[0].data_size = 64; # was 40 before this PR constants_info_[0].shape = {10}; constants_info_[1].dtype = static_cast<int32_t>(at::kFloat); ...... ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/127610 Approved by: https://github.com/jgong5, https://github.com/desertfire	2024-06-08 04:31:00 +00:00
Iris Z	1d84c7e100	[DeviceMesh] Update get_group and add get_all_groups (#128097 ) Fixes #121984 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128097 Approved by: https://github.com/wconstab, https://github.com/wanchaol	2024-06-08 04:28:56 +00:00
Jason Ansel	6e5c2a1a3b	[inductor] Add missing files to torch_key (#128230 ) Previosly all subdirs (like torch.inductor.codegen) were not hashed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128230 Approved by: https://github.com/oulgen	2024-06-08 03:26:48 +00:00
Yidi Wu	6220602943	[torchbind] support query schema of methods (#128267 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128267 Approved by: https://github.com/angelayi	2024-06-08 03:20:44 +00:00
PyTorch MergeBot	0ef5229569	Revert "Change lerp decomp to use aten.as_strided_copy instead of prims.copy_strided (#128030 )" This reverts commit fdf1666b20f63e4acf01798f009e478d997a7f7f. Reverted https://github.com/pytorch/pytorch/pull/128030 on behalf of https://github.com/nWEIdia due to breaking cuda12.1 test_cuda, see HUD https://hud.pytorch.org/hud/pytorch/pytorch/main/1?per_page=50&name_filter=inductor ([comment](https://github.com/pytorch/pytorch/pull/128030#issuecomment-2155764546))	2024-06-08 02:34:06 +00:00
Will Constable	f9508b4c1f	[pipelining] Update Pipelining Docs (#128236 ) ---- - Bring PipelineStage/Schedule more front-and-center - provide details on how to manually construct PipelineStage - move tracer example and manual example below so the high-level flow (e2e) is closer to the top Pull Request resolved: https://github.com/pytorch/pytorch/pull/128236 Approved by: https://github.com/H-Huang ghstack dependencies: #128201, #128228	2024-06-08 02:03:46 +00:00
Andrew Hoblitzell	fe74bbd6f0	init sigmoid comments (#127983 ) Fixes #127913 ### Description Add docstring to `torch/onnx/symbolic_opset9.py`:`sigmoid` function ### Checklist - [x] The issue that is being fixed is referred in the description - [x] Only one issue is addressed in this pull request - [x] Labels from the issue that this PR is fixing are added to this pull request - [x] No unnecessary issues are included into this pull request Pull Request resolved: https://github.com/pytorch/pytorch/pull/127983 Approved by: https://github.com/xadupre	2024-06-08 01:48:00 +00:00
Ke Wen	921aa194c7	[pipelining] Move modify_graph_op_device to _IR.py (#128241 ) This part is more IR related. Thus moving from `PipelineStage` constructor to `pipe.build_stage(..., device, ...)`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128241 Approved by: https://github.com/wconstab ghstack dependencies: #128240	2024-06-08 01:35:07 +00:00
Ke Wen	ad96f991a5	[pipelining] Add pipe.build_stage() (#128240 ) Given `PipelineStage` name to manual side. Thus adding a method under `Pipe` to create PipelineStage. Moved `PipeInfo` to utils.py to avoid circular dependency between `_IR` and `PipelineStage`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128240 Approved by: https://github.com/wconstab, https://github.com/H-Huang	2024-06-08 01:26:02 +00:00
Li-Huai (Allan) Lin	5ef081031e	[MPS] Include MPSGraphVenturaOps.h for complex types on macOS 12 (#127859 ) Fixes this on macOS 12: ``` /Users/qqaatw/Forks/pytorch/aten/src/ATen/native/mps/operations/FastFourierTransform.mm:108:60: error: use of undeclared identifier 'MPSDataTypeComplexFloat16'; did you mean 'MPSDataTypeFloat16'? (inputTensor.dataType == MPSDataTypeFloat16) ? MPSDataTypeComplexFloat16 : MPSDataTypeComplexFloat32; ^~~~~~~~~~~~~~~~~~~~~~~~~ MPSDataTypeFloat16 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/127859 Approved by: https://github.com/kulinseth	2024-06-08 00:54:30 +00:00
Alnis Murtovi	647815049e	Inductor: Allow small sizes of m for mixed mm autotuning (#127663 ) For mixed mm with small sizes of m, such as in the example provided in #127056, being able to set BLOCK_M to 16 leads to better performance. This PR introduces kernel configs that are specific to mixed mm by extending the mm configs with two configs that work well for the example provided in #127056. I am excluding configs with (BLOCK_M=16, BLOCK_K=16, BLOCK_N=64) because triton crashes when this config is used. For the example in #127056: - Without my changes, skip_triton is evaluated to true which disables autotuning. On my machine I achieve 146GB/s. - If autotuning is enabled, but BLOCK_M>=32, I achieve 614 GB/s. - With the changes in this PR (i.e. autotuning enabled and BLOCK_M=16), I achieve 772 GB/s. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127663 Approved by: https://github.com/Chillee	2024-06-08 00:46:16 +00:00
cyy	ef2b5ed500	[4/N] Remove unused functions (#128193 ) Follows #128179 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128193 Approved by: https://github.com/ezyang	2024-06-08 00:09:26 +00:00
Animesh Jain	39dd4740e6	[inductor][dynamo-inline-nn-modules] Fix test with inlining flag (#128200 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128200 Approved by: https://github.com/Skylion007 ghstack dependencies: #128001, #126578, #128158, #128172	2024-06-07 23:51:58 +00:00
Howard Huang	bef586111a	[pipelining] pipelining.rst updates (#128228 ) fix some nits and add `PipelineStage` (manual) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128228 Approved by: https://github.com/wconstab ghstack dependencies: #128201	2024-06-07 23:29:54 +00:00
Chirag Pandya	09cccbc1c7	[RFC] add per-collective timeout value in flight recorder (#128190 ) Summary: Add timeout value field on every collected record. Test Plan: Unit tests Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/128190 Approved by: https://github.com/wconstab	2024-06-07 23:29:35 +00:00
Catherine Lee	11f2d8e823	Move inductor cuda 124 jobs to a separate workflow that is not triggered by ciflow/inductor (#128250 ) https://github.com/pytorch/pytorch/pull/127825 The majority of the g5 runner usage comes from inductor (its something like 2x everything else) in the past week, inductor ran 1300 ish times on PRs and 300 times on main. Inductor-periodic ran 50 times on main, so the previous move from inductor -> inductor-periodic only results in 250 fewer runs. I was under the impression that cu124 is experimental currently and eventually we'll need to switch to it, so this will stay until we switch or inductor uses much fewer runners Are we expected to be able to handle two versions of cuda in CI? Because currently we cannot, at least not comfortably Pull Request resolved: https://github.com/pytorch/pytorch/pull/128250 Approved by: https://github.com/huydhn	2024-06-07 23:01:52 +00:00
laithsakka	5b3624117a	update test_issue175 to handle inline_inbuilt_nn_modules (#128026 ) with inlining the output graph have more function calls reflecting those on the test that count number of function calls. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128026 Approved by: https://github.com/anijain2305 ghstack dependencies: #127553	2024-06-07 22:07:16 +00:00
Xu Han	ba81c3c290	[inductor] add cpp builder code. (take 2) (#125849 ) Fully manual rebase the code of PR: https://github.com/pytorch/pytorch/pull/124045 The old PR seems crashed due to too many commits, and too many times rebase. Please reference: https://github.com/pytorch/pytorch/pull/124045#issuecomment-2103744588 ------- It is the first step of RFC https://github.com/pytorch/pytorch/issues/124245. Changes: 1. Add cpp builder code, the new cpp_builder support Windows OS. 2. Add CPU ISA checker which is cross OS and exported from backend cpuinfo. 3. Switch compiler ISA checker to new cpp builder. 4. CppCodeCache use the new ISA checker. 5. Add temprary `test_new_cpp_build_logical` UT to help on transfer to new code. <img width="1853" alt="Image" src="https://github.com/pytorch/pytorch/assets/8433590/ce6519ab-ba92-4204-b1d6-7d15d2ba2cbe"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/125849 Approved by: https://github.com/jgong5, https://github.com/desertfire	2024-06-07 20:49:58 +00:00
dshi7	3a620a0f65	bug fix of dynamo_timed in cprofile (#128203 ) Fixes #ISSUE_NUMBER fb-only: "Entire Frame" was missing before this change. Before: https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/f565966006-TrainingApplication/20240527/rank_0/5_0_1/compilation_metrics_23.html After: https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/f569854578-TrainingApplication/20240606/rank_0/0_0_0/compilation_metrics_16.html Pull Request resolved: https://github.com/pytorch/pytorch/pull/128203 Approved by: https://github.com/Chillee	2024-06-07 20:47:27 +00:00
Catherine Lee	8892ddaacc	[TD] Test removal on sm86 (#127131 ) Yolo I'm excited to break CI :') Pull Request resolved: https://github.com/pytorch/pytorch/pull/127131 Approved by: https://github.com/huydhn, https://github.com/ZainRizvi	2024-06-07 20:19:18 +00:00
angelayi	fdf1666b20	Change lerp decomp to use aten.as_strided_copy instead of prims.copy_strided (#128030 ) aten.lerp decomposition causes prims::copy_strided to appear in the graph, which is not core aten. Internal ref: https://fb.workplace.com/groups/pytorch.edge.users/permalink/1525644288305859/ Pull Request resolved: https://github.com/pytorch/pytorch/pull/128030 Approved by: https://github.com/Skylion007, https://github.com/zou3519	2024-06-07 20:12:52 +00:00
Ke Wen	e647ea55a3	[pipelining] redirect README to document (#128205 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128205 Approved by: https://github.com/wconstab, https://github.com/H-Huang	2024-06-07 19:34:52 +00:00
Howard Huang	dcb63fcedb	[pipelining] Remove num_microbatches from stage (#128201 ) This is similar to https://github.com/pytorch/pytorch/pull/127979, but instead of removing `num_microbatches` from schedule, we remove it from `PipelineStage`. This also means that during `PipelineSchedule` init we need to setup the buffers for the stage(s). Pull Request resolved: https://github.com/pytorch/pytorch/pull/128201 Approved by: https://github.com/kwen2501	2024-06-07 18:56:44 +00:00
Aaron Gokaslan	cafbcb6376	[BE]: Update ruff to 0.4.8 (#128214 ) Updates ruff to 0.4.8. Some minor fixes, but noticably is 10% faster on microbenchmark and should further reduce local and CI runtime of the linter. Also includes a few bugfixes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128214 Approved by: https://github.com/ezyang	2024-06-07 18:41:35 +00:00
Will Constable	8ca4cefc7d	[C10D] Ensure gil is not released when calling toPyBytes (#128212 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128212 Approved by: https://github.com/Skylion007, https://github.com/XilunWu	2024-06-07 18:24:10 +00:00
_daohang_	0a6df4fca6	delete inductor config.trace.compile_profile (#127143 ) Fixes #ISSUE_NUMBER https://fb.workplace.com/groups/257735836456307/posts/687858786777341/?comment_id=687861123443774&reply_comment_id=687865486776671 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127143 Approved by: https://github.com/Chillee	2024-06-07 18:05:50 +00:00
Xu Zhao	82d7a36a27	Added torchao nightly workflow (#128152 ) Summary: Add torchao benchmark workflow, upload the artifacts to GHA. X-link: https://github.com/pytorch/benchmark/pull/2273 Test Plan: ``` python run_benchmark.py torchao --ci ``` Differential Revision: D58140479 Pulled By: xuzhao9 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128152 Approved by: https://github.com/jerryzh168	2024-06-07 17:52:15 +00:00
Shunting Zhang	0c7f4353e5	[inductor] simplify indexing (#127661 ) This is a short term fix for: https://github.com/pytorch/pytorch/issues/124002 We found the cause of bad perf for the int8_unpack kernel is due to sub-optimal indexing. In this PR we introduce 2 indexing optimizations: 1. expand FloorDiv to the entire expression when feasible. E.g. `x1 * 1024 + x2 // 2` will be transformed to `(x1 * 2048 + x2) // 2`. The motivation is that we have more chance to simplify loops for `x1 * 2048 + x2`. 2. merge ModularIndexing pairs: `ModularIndexing(ModularIndex(x, 1, a), 1, b)`, can be simplified to `ModularIndexing(x, 1, b)` if a is a multiple of b. With both indexing optimizations, we improve int8_unpack perf by 1.54x (183us -> 119us). Pull Request resolved: https://github.com/pytorch/pytorch/pull/127661 Approved by: https://github.com/jansel	2024-06-07 17:51:30 +00:00
Animesh Jain	662a78f957	[dynamo] Inline the getattr of fx graph and proxy graph (#128172 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128172 Approved by: https://github.com/yanboliang ghstack dependencies: #128001, #126578, #128158	2024-06-07 17:14:58 +00:00
BowenBao	19b31d899a	Fix 'get_real_value' on placeholder nodes (#127698 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127698 Approved by: https://github.com/jansel ghstack dependencies: #127695, #127696	2024-06-07 17:13:43 +00:00
BowenBao	b741819b05	Fix 'get_attr' call in dynamo 'run_node' (#127696 ) Fixes #124858 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127696 Approved by: https://github.com/jansel ghstack dependencies: #127695	2024-06-07 17:13:43 +00:00
BowenBao	3aa623d407	Fix assume_constant_result for UnspecializedNNModuleVariable methods (#127695 ) Fixes #127509 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127695 Approved by: https://github.com/jansel	2024-06-07 17:13:43 +00:00
Zain Rizvi	754e6d4ad0	Make jobs with LF runners still pass lint (#128175 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128175 Approved by: https://github.com/huydhn	2024-06-07 17:13:04 +00:00
Xilun Wu	85758fa5ae	[c10d][TCPStore] make TCPStore server use libuv by default (#127957 ) Summary This PR switches the default TCPStore server backend to a new implementation that utilizes [`libuv`](https://github.com/libuv/libuv) for significantly lower initialization time and better scalability: <img width="714" alt="image" src="https://github.com/pytorch/pytorch/assets/12968408/18503011-da5d-4104-8ba9-abc456438b02"> We hope this improvement would benefit users from a much shorter startup time in large-scale jobs. Eventually, we hope to fully replace the old TCPStore backend implementation with the libuv one. What it changes This PR changes the underlying TCPStore server backend to `libuv` if users don't explicitly specify to use the old TCPStore server. This change is not supposed to cause any user notice except significant faster TCPStore startup for large-scale jobs. One thing to note is, we do not support the initialization approach where user passes in a socket for libuv backend. We plan to support it as a next step but we choose to disable it before fully testing. If you are initializing TCPStore in this approach, you can see the next section to remain using the old TCPStore server. Fallback/Remain using the old TCPStore server For users who want to stay with the old TCPStore backend, there're 3 ways: 1. If user is directly instantiating TCPStore object, user can pass in argument `use_libuv=False` to use the old TCPStore server backend e.g. `store = torch.distributed.TCPStore(..., use_libuv=False)`. 2. Or, specify the TCPStore backend option in `init_method` when calling default ProcessGroup init, e.g. `torch.distributed.init_process_group(..., init_method="{YOUR_RENDEZVOUS_METHOD}://{YOUR_HOSTNAME}:{YOUR_PORT}?use_libuv=0")` 3. Or, user can set environment variable `USE_LIBUV` to `"0"` when launching. These 3 approach are in order of precedence. That being said, if user specifies `use_libuv=0` in `init_method` and also sets environment var `USE_LIBUV="1"`, the former will take effect and the TCPStore backend instantiated will be the old one instead of the one using libuv. Operating Systems Compatibility From the CI signals, we believe the new implementation has the same behavior as the old TCPStore server on all supported platforms. If you notice any behavior discrepancy, please file an issue with `oncall: distributed` label. Test Plan `pytest test/distributed/test_store.py` <img width="2548" alt="image" src="https://github.com/pytorch/pytorch/assets/12968408/dc0aebeb-6d5a-4daa-b98c-e56bd39aa588"> note: `TestMultiThreadedWait::test_wait` is a broken test that has been there for some time. `test/distributed/elastic/utils/distributed_test.py` <img width="2558" alt="image" src="https://github.com/pytorch/pytorch/assets/12968408/a6a3266d-b798-41c4-94d2-152056a034f6"> TODO 1. Update the doc at - https://pytorch.org/docs/stable/distributed.html#distributed-key-value-store - https://pytorch.org/docs/stable/distributed.html#tcp-initialization 2. Make torch elastic rendezvous to use libuv TCPStore as well. See `torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py` cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225 @chauhang @d4l3k @kurman 3. Test if libuv backend is okay with initialization with socket. Change `LibUvTCPStoreTest::test_take_over_listen_socket`. Test Plan `pytest test/distributed/test_store.py` <img width="2548" alt="image" src="https://github.com/pytorch/pytorch/assets/12968408/dc0aebeb-6d5a-4daa-b98c-e56bd39aa588"> note: `TestMultiThreadedWait::test_wait` is a broken test that has been there for some time. `test/distributed/elastic/utils/distributed_test.py` <img width="2558" alt="image" src="https://github.com/pytorch/pytorch/assets/12968408/a6a3266d-b798-41c4-94d2-152056a034f6"> Differential Revision: [D58259591](https://our.internmc.facebook.com/intern/diff/D58259591) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127957 Approved by: https://github.com/kurman ghstack dependencies: #127956	2024-06-07 16:53:01 +00:00
Xilun Wu	6c824cd9fb	[BE][c10d] fix use of TORCH_ERROR in TCPStore libuv backend (#127956 ) Summary The use of TORCH_ERROR in TCPStore libuv backend code needs update. Differential Revision: [D58259589](https://our.internmc.facebook.com/intern/diff/D58259589) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127956 Approved by: https://github.com/shuqiangzhang, https://github.com/cyyever	2024-06-07 16:53:01 +00:00
Howard Huang	b9b89ed638	[pipelining] fix LoopedBFS (#127796 ) # Issues Currently two issues need to be fixed with LoopedBFS: 1. The wrap around send operation to the looped around stage blocks will cause a hang. For some reason this doesn't surface on single node, but on multihost this surfaces in a hang. <img width="1311" alt="image" src="https://github.com/pytorch/pytorch/assets/14858254/210d9d18-455f-4f65-8a11-7ce2c1ec73fd"> 2. When microbatches are popped off in `backward_one_chunk` will automatically use the `bwd_chunk_id` starting from 0. This works for interleaved 1f1b and 1f1b, but for loopedBFS we want to pop from starting at `num_microbatches - 1`. Same needs to be fixed for gpipe? # Changes - Update LoopedBFS implementation to share `_step_microbatches` with `Interleaved1F1B` - Also share the tests between the two schedules for varying num_microbatches, local_stages, and world_sizes - Update `backward_one_chunk` to optionally take a `bwd_chunk_id` argument. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127796 Approved by: https://github.com/wconstab	2024-06-07 16:46:38 +00:00
Mu-Chu Lee	d9696ea624	[AOTInductor] [Tooling] Update NaN and INF Checker for AOTInductor (#127574 ) Summary: 1. Integrate NaN and INF checker with existing config, controllable by env var. 2. Move inject point of NaN & INF checker earlier, this could prevent buffer freeing before check. 3. Inject debugging code in Kernel level, which prevents us trying to read buffers that are fused inplace and into a single kernel. Test Plan: Debugging utility. Test and check by existing tests with env var: ``` TORCHINDUCTOR_NAN_ASSERTS=1 TORCHINDUCTOR_MAX_AUTOTUNE=0 python test/inductor/test_aot_inductor.py -k AOTInductorTestNonABICompatibleCuda.test_seq_non_abi_compatible_cuda ``` Reviewed By: ColinPeppler Differential Revision: D57989176 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127574 Approved by: https://github.com/chenyang78, https://github.com/desertfire	2024-06-07 16:46:26 +00:00
Prachi Gupta	fc6e3ff96d	[ROCm] Update triton pin to fix libtanh issue (#125396 ) There were some internal build issues related to tanh when we moved to upstream triton in ROCm. These issues were fixed by the following triton commit: https://github.com/triton-lang/triton/pull/3810 . This PR moves the triton pin to incorporate that change. Added some skips for unit tests that regressed due to the triton commit bump in this PR. Needs https://github.com/pytorch/pytorch/pull/127968 since this PR introduces a triton dependency on llnl-hatchet, which doesn't have py3.12 wheels available currently. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125396 Approved by: https://github.com/pruthvistony, https://github.com/malfet	2024-06-07 16:23:04 +00:00
PyTorch MergeBot	128952625b	Revert "Added memory budget to partitioner (#126320 )" This reverts commit 2184cdd29128a924583e4702489177f83fb8270a. Reverted https://github.com/pytorch/pytorch/pull/126320 on behalf of https://github.com/ZainRizvi due to The new test_ac.py fails on ROCm machines ([comment](https://github.com/pytorch/pytorch/pull/126320#issuecomment-2155141886))	2024-06-07 16:15:03 +00:00
cyy	c219fa5eb9	[3/N] Remove unused functions (#128179 ) Following https://github.com/pytorch/pytorch/pull/128005, this PR continues to remove unused functions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128179 Approved by: https://github.com/ezyang	2024-06-07 16:13:16 +00:00
Sam Larsen	8d16a73f0f	Manipulate triton_hash_with_backend so that it doesn't contain any keywords (#128159 ) Summary: See https://github.com/pytorch/pytorch/issues/127637 where "def" appears in the backend_hash and causes a problem. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128159 Approved by: https://github.com/jansel	2024-06-07 16:10:44 +00:00
Sam Larsen	852b7b4c99	[inductor] Enable subprocess-based parallel compile as the default (#126817 ) Differential Revision: [D58239826](https://our.internmc.facebook.com/intern/diff/D58239826) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126817 Approved by: https://github.com/eellison ghstack dependencies: #128037, #128086	2024-06-07 16:10:11 +00:00
PyTorch MergeBot	ac51f782fe	Revert "Complete revamp of float/promotion sympy handling (#126905 )" This reverts commit 2f7cfecd86009a9d396fdbdcdfb4ba7a005db16b. Reverted https://github.com/pytorch/pytorch/pull/126905 on behalf of https://github.com/atalman due to Sorry need to revert - failing internally ([comment](https://github.com/pytorch/pytorch/pull/126905#issuecomment-2155118778))	2024-06-07 16:01:46 +00:00
PyTorch MergeBot	23c156cd2d	Revert "[inductor] simplify indexing (#127661 )" This reverts commit 901226ae837bd4629b34735c84a3481c4988bb5b. Reverted https://github.com/pytorch/pytorch/pull/127661 on behalf of https://github.com/atalman due to Sorry reverting because in conflict with https://github.com/pytorch/pytorch/pull/126905 which needs to be reverted, will be relanding it ([comment](https://github.com/pytorch/pytorch/pull/127661#issuecomment-2155115388))	2024-06-07 15:58:36 +00:00
cyy	a1b664adeb	Add default values to PyTorchMemEffAttention::AttentionKernel::Params members (#112215 ) Default values were added to Params in order to eliminate CUDA warnings like ``` and the implicitly-defined constructor does not initialize ‘PyTorchMemEffAttention::AttentionKernel<float, cutlass::arch::Sm80, true, 64, 64, 64, true, true>::accum_t PyTorchMemEffAttention::AttentionKernel<float, cutlass::arch::Sm80, true, 64, 64, 64, true, true>::Params::scale’ ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/112215 Approved by: https://github.com/eqy, https://github.com/ezyang	2024-06-07 15:54:07 +00:00
Ke Wen	3090667cf9	[pipelining] pipeline() taking microbatch as example input (#128163 ) Changed the API of `pipeline()` to take microbatch instead of full batch as example args. Main purpose is to: - make this API more atomic; - decouple tracing frontend from runtime info like `num_chunks`. Side effects: - Creates opportunity for varying `num_chunks` of schedules with the same `pipe` object. - User has to create example microbatch input. - Chunk spec stuff are now all moved to runtime side. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128163 Approved by: https://github.com/H-Huang	2024-06-07 15:51:53 +00:00
PyTorch MergeBot	224b4339e5	Revert "Make ValueRange repr less chatty by default (#128043 )" This reverts commit f0dd11df5534ae074ad2d090e6700576a22719d6. Reverted https://github.com/pytorch/pytorch/pull/128043 on behalf of https://github.com/atalman due to Sorry reverting because in conflict with [#126905](https://github.com/pytorch/pytorch/pull/126905) which needs to be reverted ([comment](https://github.com/pytorch/pytorch/pull/128043#issuecomment-2155091732))	2024-06-07 15:43:39 +00:00
James Wu	6e75024ff0	Run TestAOTAutograd with dynamo (#128047 ) My goal is to run these tests with the autograd cache on, but first I want them running with dynamo. These tests already caught an interesting issue so I thought it would be helpful to just have them. Next up I'll have a second subclass of these tests, run them twice, and expect a cache hit the second time from autograd. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128047 Approved by: https://github.com/ezyang	2024-06-07 15:42:28 +00:00
GdoongMathew	771be55bb0	Documenting `torch.onnx.operator.shape_as_tensor` (#128051 ) Fixes #127890 This PR adds docstring to the `torch.onnx.operator.shape_as_tensor` function. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128051 Approved by: https://github.com/xadupre	2024-06-07 15:20:18 +00:00
zabboud	3f9798a4fd	add docstring to masked_fill, expand, select, unsqueeze, cat fns (#128055 ) Fixes #127891 Fixes #127893 Fixes #127894 Fixes #127907 Fixes #127910 ## Description Add docstring to `masked_fill`, `expand`, `select`, `unsqueeze`, and `cat` functions in torch.onnx.symbolic_opset9.py remaining pydocstyle errors: 257 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128055 Approved by: https://github.com/xadupre	2024-06-07 15:17:22 +00:00
Howard Huang	543a870943	[pipelining] Rename ManualPipelineStage -> PipelineStage (#128157 ) Renaming ManualPipelineStage to remove the "Manual" part. I needed to replace the existing `PipelineStage` which takes in the `pipe` argument, so I have renamed that to `TracerPipelineStage`. @kwen2501 will remove this entirely in favor of adding a util to `Pipe` to just create the stage directly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128157 Approved by: https://github.com/wconstab	2024-06-07 09:24:16 +00:00
Will Feng	5f81265572	[Traceable FSDP2] Return early from _register_post_backward_hook when compile (#127864 ) Dynamo doesn't support `RegisterPostBackwardFunction` very well yet. This PR skips it and rely on `root_post_backward_callback` under compile. We will improve `RegisterPostBackwardFunction` support in Q3. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127864 Approved by: https://github.com/awgu	2024-06-07 09:19:07 +00:00
chunyuan	7efaeb1494	[AOTI] docs: add suggestion to turn on freezing on CPU (#128010 ) With https://github.com/pytorch/pytorch/pull/124350 landed, it is now suggested in AOTI to turn on freezing on CPU to get better performance. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128010 Approved by: https://github.com/desertfire	2024-06-07 08:57:02 +00:00
Pian Pawakapan	0c16800b4a	[pipelining] include lifted constants in input_to_state (#128173 ) Previous PR only looked at state dict to determine inputs to state, missing out on lifted tensors Pull Request resolved: https://github.com/pytorch/pytorch/pull/128173 Approved by: https://github.com/kwen2501	2024-06-07 08:40:54 +00:00
Ke Wen	01601ebd41	Retire torch.distributed.pipeline (#127354 ) Actually retiring module after deprecation warning for a while. The new supported module is: torch.distributed.pipelining. Please migrate. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127354 Approved by: https://github.com/wconstab	2024-06-07 08:11:58 +00:00
Jason Ansel	70724bdbfe	Bugfix for nondeterminstic torch_key (#128111 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128111 Approved by: https://github.com/oulgen	2024-06-07 07:17:39 +00:00
Simon Fan	00c6ca4459	[compiled autograd][cudagraphs] Inputs runtime wrapper to move cpu scalars to cuda (#125382 ) Most commonly CPU scalars used for philox random seed. Right now, any cpu input will skip cudagraphing the entire graph. We need both the traced graph and the runtime inputs to be cudaified. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125382 Approved by: https://github.com/jansel	2024-06-07 07:12:46 +00:00
Ke Wen	190f06d468	[pipelining] Lower _configure_data_parallel_mode to stage (#127946 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127946 Approved by: https://github.com/wconstab ghstack dependencies: #127935	2024-06-07 07:06:23 +00:00
Will Feng	a448b3ae95	[Traceable FSDP2] Check hasattr('fsdp_pre_all_gather') only when not compile (#127855 ) Dynamo doesn't support `hasattr(inner_tensor, "fsdp_post_all_gather")` yet. We will work on this support in Q3. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127855 Approved by: https://github.com/awgu	2024-06-07 06:36:40 +00:00
Sun, Jiayi	2ff312359c	skip hf_T5_generate in dynamic shape test (#121129 ) As reported in https://github.com/pytorch/pytorch/issues/119434, `hf_T5_generate` failed with dynamic shape testing, we propose to skip the dynamic batch size testing of this model in this PR. * Error msg is ``` File "/home/jiayisun/pytorch/torch/_dynamo/guards.py", line 705, in SHAPE_ENV guards = output_graph.shape_env.produce_guards( File "/home/jiayisun/pytorch/torch/fx/experimental/symbolic_shapes.py", line 3253, in produce_guards raise ConstraintViolationError( torch.fx.experimental.symbolic_shapes.ConstraintViolationError: Constraints violated (L['inputs_tensor'].size()[0])! For more information, run with TORCH_LOGS="+dynamic". - Not all values of RelaxedUnspecConstraint(L['inputs_tensor'].size()[0]) are valid because L['inputs_tensor'].size()[0] was inferred to be a constant (4). ``` * Root Cause is This error happens while creating guard for this [model script line](https://github.com/huggingface/transformers/blob/main/src/transformers/models/t5/modeling_t5.py#L561): `scores += position_bias_masked` I run it with TORCH_LOGS="+dynamic" and got the key line : `I0305 00:21:00.849974 140376923287424 torch/fx/experimental/symbolic_shapes.py:3963] [6/0_1] eval Eq(s0, 4) [guard added] at miniconda3/envs/pt2/lib/python3.9/site-packages/transformers/models/t5/modeling_t5.py:561 in forward (_refs/__init__.py:403 in _broadcast_shapes)` The reason for this error is that the batch dimension of `inputs_tensor` in the dynamic batch size test is marked as dynamic shape `s0`, so the batch dimension of `scores` generated by a series of operations with `inputs_tensor` is also `s0`. However, because the function of creating `attention_mask` is not in Dynamo but in python. The batch dimension of `attention_mask` is the real shape `4`, and the batch dimension of `position_bias_masked` generated by a series of operations with `attention_mask` is also the real shape `4`, not the dynamic shape `s0`. The current line of `scores += position_bias_masked` requires creating a guard and check whether the batch dimension of `scores` is always equal to the batch dimension of `position_bias_masked`, Eq(s0, 4), the error happens. So the root cause of this error is that the function of creating `attention_mask` not in Dynamo but in python. The reason why the function of `attention_mask` not in Dynamo is that Dynamo has a graph break on this function (happened in the [model script line](https://github.com/huggingface/transformers/blob/main/src/transformers/generation/utils.py#L476): `is_pad_token_in_inputs = (pad_token_id is not None) and (pad_token_id in inputs)`) due to the following error: `torch._dynamo.exc.Unsupported: Tensor.item` Pull Request resolved: https://github.com/pytorch/pytorch/pull/121129 Approved by: https://github.com/leslie-fang-intel, https://github.com/ezyang	2024-06-07 06:28:29 +00:00
Stonepia	d943357a21	[XPU] Add xpu support of `make triton` (#126513 ) This PR is to add XPU support for `make triton`. If a user wishes to use Triton with XPU support, the user needs to install the [intel-xpu-backend-for-triton](https://github.com/intel/intel-xpu-backend-for-triton). This PR allows the user to easily install Triton for xpu backend support: ``` # clone the pytorch repo export USE_XPU=1 make triton ``` The XPU version of triton will always be built from the source. It will cat the commit id from `.ci/docker/ci_commit_pins/triton-xpu.txt`, for example, `b8c64f64c18d8cac598b3adb355c21e7439c21de`. So the final call would be like: ``` pip install --force-reinstall "git+https://github.com/intel/intel-xpu-backend-for-triton@b8c64f64c18d8cac598b3adb355c21e7439c21de#subdirectory=python" ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/126513 Approved by: https://github.com/EikanWang, https://github.com/atalman	2024-06-07 06:25:47 +00:00
laithsakka	68cc63ae27	introduce skipIfNNModuleInlined and skip test_cpu_cuda_module_after_dynamo (#128023 ) see the issue https://github.com/pytorch/pytorch/issues/127636 to for details about the issue, TLDR is that when inlining is enabled, we create a fake tensor while tracing in dynamo and try to perform aten.add.Tensor between two tensor of different types, with out inlining we do not hit that operation during tracing. ``` Failed running call_function <built-in function add>((FakeTensor(..., size=(20, 20), grad_fn=<AddBackward0>), FakeTensor(..., device='cuda:0', size=(20, 20))), *{}): Unhandled FakeTensor Device Propagation for aten.add.Tensor, found two different devices cpu, cuda:0 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/128023 Approved by: https://github.com/anijain2305 ghstack dependencies: #127487, #127553	2024-06-07 06:00:33 +00:00
laithsakka	7e48d6a497	reset dynamo in test_do_not_skip_side_effects unit test loop to avoid dynamo cache limit hit (#127487 ) fix https://github.com/pytorch/pytorch/issues/127483 When nn module inlining is enabled, all recompilations are considered for the same frame hence we hit the cache limit for test_do_not_skip_side_effects, but without inlining things are different , each time we hit a new Object Model we do not consider that a re-compilation, as explained in https://github.com/pytorch/pytorch/issues/127483 For that test we do not really care about cache size hence i reset dynamo in the main loop. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127487 Approved by: https://github.com/anijain2305	2024-06-07 06:00:33 +00:00
Sam Larsen	dc8e3c2e90	[inductor] subproc parallel compile: initialize future before sending work to the pool (#128086 ) Summary: I got reports of intermittent failures in CI and the logs show errors like this: ``` CRITICAL:concurrent.futures:Future 139789013754560 in unexpected state: FINISHED ``` I can't repro locally, but seems clear that we should initialize the future _before_ sending work to the subprocess pool since it could finish before we call set_running_or_notify_cancel() Differential Revision: [D58239829](https://our.internmc.facebook.com/intern/diff/D58239829) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128086 Approved by: https://github.com/jansel ghstack dependencies: #128037	2024-06-07 04:17:35 +00:00
Sam Larsen	6a2bf48cfa	[inductor] subproc parallel-compile: start thread last in init (#128037 ) Summary: Observed on an internal workload: the helper thread started and attempted to access member variables before they were initialized. Differential Revision: [D58239827](https://our.internmc.facebook.com/intern/diff/D58239827) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128037 Approved by: https://github.com/Skylion007, https://github.com/eellison	2024-06-07 04:17:35 +00:00
Sam Larsen	e8e0bdf541	[inductor] parallel-compile: call triton_key() before forking (#127639 ) Summary: A user reported severe slowdown on a workload when using parallel compile. The issue is that in some environments, the process affinity changes after forking such that all forked subprocesses use a single logical processor. Described here: https://github.com/pytorch/pytorch/issues/99625. That requires a separate fix, but during debuging we noticed that we can at least optimize the expensive call to triton_key() before forking. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127639 Approved by: https://github.com/eellison, https://github.com/anijain2305	2024-06-07 04:12:57 +00:00
Ke Wen	96806b1777	[pipelining][doc] Add frontend description and change tracer example (#128070 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128070 Approved by: https://github.com/wconstab, https://github.com/H-Huang	2024-06-07 04:09:36 +00:00
Wanchao Liang	3df53c2a8f	[dtensor] directly return local_tensor under no_grad (#128145 ) as titled, skip the autograd function and directly return the local_tensor if it's under no_grad context, this would avoid creating views Pull Request resolved: https://github.com/pytorch/pytorch/pull/128145 Approved by: https://github.com/awgu ghstack dependencies: #128112	2024-06-07 04:01:47 +00:00
Animesh Jain	747fc35ff5	[dynamo] Support if cond on UnspecializedNNModuleVariable and add inline tests (#128158 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128158 Approved by: https://github.com/jansel ghstack dependencies: #128001, #126578	2024-06-07 03:50:33 +00:00
Aidyn-A	5e5bbdb35e	[DDP] Bucket handling: make first bucket size equal to bucket_cap_mb if it was set (#121640 ) The fist DDP bucket is always being created of the size of `dist._DEFAULT_FIRST_BUCKET_BYTES` (1 MiB) by default regardless of `bucket_cap_mb`. The proposal is to set `bucket_cap_mb` as the one main bucket size if it was supplied by the user. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121640 Approved by: https://github.com/wanchaol	2024-06-07 03:33:33 +00:00
Ke Wen	4d0ece8196	[pipelining] Consolidate chunk counting between stage and schedule (#127935 ) We used to have two backward chunk id counting systems, one at schedule level, the other at stage level. (Which makes safety dependent on the two advancing hand-in-hand.) This PR consolidates the counting system to the schedule side only, which would pass `mb_index` to the following stage calls: `forward_one_chunk` `backward_one_chunk` `get_bwd_send_ops` ... Pull Request resolved: https://github.com/pytorch/pytorch/pull/127935 Approved by: https://github.com/H-Huang	2024-06-07 03:33:18 +00:00
Brian Hirsh	476bfe6cce	fix torch.compile with triton kernels under inference_mode (#124489 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124489 Approved by: https://github.com/albanD	2024-06-07 03:29:37 +00:00
Pian Pawakapan	50155e825b	[export] provide refine function for automatically accepting dynamic shapes suggested fixes (#127436 ) Summary: Part of the work helping export's automatic dynamic shapes / dynamic shapes refining based on suggested fixes. Introduces a util function refine_dynamic_shapes_from_suggested_fixes() that takes the error message from a ConstraintViolationError message containing suggested dynamic shapes fixes, along with the original dynamic shapes spec, and returns the new spec. Written so that the suggested fixes from export can be directly parsed and used. Example usage for the automatic dynamic shapes workflow: ``` # export, fail, parse & refine suggested fixes, re-export try: export(model, inps, dynamic_shapes=dynamic_shapes) except torch._dynamo.exc.UserError as exc: new_shapes = refine_dynamic_shapes_from_suggested_fixes(exc.msg, dynamic_shapes) export(model, inps, dynamic_shapes=new_shapes) ``` For examples of behavior, see the added test and docstring. Will take suggestions for renaming the function to something else 😅 Test Plan: test_export tests Differential Revision: D57409142 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127436 Approved by: https://github.com/avikchaudhuri	2024-06-07 03:29:06 +00:00
Mikayla Gawarecki	65aa16f968	Revert "Default XLA to use swap_tensors path in nn.Module._apply (#126814 )" (#128170 ) https://github.com/pytorch/pytorch/issues/128165 :( This reverts commit a7b1dd82ff3063894fc665ab0c424815231c10e6. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128170 Approved by: https://github.com/drisspg, https://github.com/albanD	2024-06-07 01:44:14 +00:00
GdoongMathew	f99409903c	Documenting `torch.distributions.utils.clamp_probs` (#128136 ) Fixes https://github.com/pytorch/pytorch/issues/127889 This PR adds docstring to the `torch.distributions.utils.clamp_probs` function. Co-authored-by: Svetlana Karslioglu <svekars@meta.com> Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/128136 Approved by: https://github.com/janeyx99, https://github.com/svekars, https://github.com/malfet	2024-06-07 00:49:41 +00:00
Oguz Ulgen	740cd0559f	Filter non input symexprs from codecache guards (#128052 ) Summary: Dynamo lifts all symexprs that appear in the inputs to top level which means that we do not need to look at guards that contain symexprs that do not appear in the inputs. Prune them. Test Plan: added two new tests Differential Revision: D58200476 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128052 Approved by: https://github.com/ezyang, https://github.com/masnesral	2024-06-07 00:48:49 +00:00
Kazuaki Ishizaki	117ab34891	Documenting the torch.utils.collect_env.get_pretty_env_info function (#128123 ) Fixes #127888 This PR adds docstring to the `torch.utils.collect_env.get_pretty_env_info` function Pull Request resolved: https://github.com/pytorch/pytorch/pull/128123 Approved by: https://github.com/ezyang, https://github.com/malfet	2024-06-07 00:43:18 +00:00
Shunting Zhang	901226ae83	[inductor] simplify indexing (#127661 ) This is a short term fix for: https://github.com/pytorch/pytorch/issues/124002 We found the cause of bad perf for the int8_unpack kernel is due to sub-optimal indexing. In this PR we introduce 2 indexing optimizations: 1. expand FloorDiv to the entire expression when feasible. E.g. `x1 * 1024 + x2 // 2` will be transformed to `(x1 * 2048 + x2) // 2`. The motivation is that we have more chance to simplify loops for `x1 * 2048 + x2`. 2. merge ModularIndexing pairs: `ModularIndexing(ModularIndex(x, 1, a), 1, b)`, can be simplified to `ModularIndexing(x, 1, b)` if a is a multiple of b. With both indexing optimizations, we improve int8_unpack perf by 1.54x (183us -> 119us). Pull Request resolved: https://github.com/pytorch/pytorch/pull/127661 Approved by: https://github.com/jansel	2024-06-06 23:57:45 +00:00
Animesh Jain	7ede78f9f5	[dynamo][nn-modules] Trace through nn.Module dunder methods for UnspecializedNNModule (#126578 ) Tracing through `__init__` is important because it initializes (calls STORE_ATTR) on members. By doing that, we kick in the mutation tracking for these objects. So, things like mutating `_modules` etc is tracked automatically. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126578 Approved by: https://github.com/jansel ghstack dependencies: #128001	2024-06-06 23:05:49 +00:00
Animesh Jain	e5b3387166	[dynamo] Bugfix for nn parameter construction (#128001 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128001 Approved by: https://github.com/jansel	2024-06-06 23:05:49 +00:00
brightonanc	6dfdce92ba	Fixed typos in the complex numbers portion of the autograd docs (#127948 ) This PR fixes several typos in the complex numbers section of the docs for autograd. Only documentation was altered. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127948 Approved by: https://github.com/soulitzer	2024-06-06 22:47:04 +00:00
Jiashen Cao	56a3d276fe	Handle custom op during TorchScript to ExportedProgram conversion (#127580 ) #### Description Handle custom ops during TorchScript to ExportedProgram covnersion ```python torch.library.define( "mylib::foo", "(Tensor x) -> Tensor", lib=lib, ) # PyTorch custorm op implementation @torch.library.impl( "mylib::foo", "CompositeExplicitAutograd", lib=lib, ) def foo_impl(x): return x + x # Meta function of the custom op. @torch.library.impl_abstract( "mylib::foo", lib=lib, ) def foo_meta(x): return x + x class M(torch.nn.Module): def forward(self, x): return torch.ops.mylib.foo(x) ``` #### Test Plan * Add a test case where custom op is called and converted. `pytest test/export/test_converter.py -s -k test_ts2ep_converter_custom_op` Pull Request resolved: https://github.com/pytorch/pytorch/pull/127580 Approved by: https://github.com/angelayi	2024-06-06 22:06:51 +00:00
joncrall	80fa2778ed	Update types for verbose in lr_scheduler (#127943 ) I'm currently locked into jsonargparse version 4.19.0, and it complains when used in combination with LightningCLI (v2.0.8). This is because it cares about the types declared in google style docstrings. This causes a problem when it tries to parse how it should cast arguments to construct an instance of an LRScheduler class because the docstrings declare the "verbose" parameter as a bool, but the defaults recently changed to a string "deprecated". This means the type should really be `bool \| str`. This PR adds a `\| str` to the docstring type in each learning rate scheduler class. This will prevent jsonargparse from complaining. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127943 Approved by: https://github.com/janeyx99	2024-06-06 21:59:22 +00:00
Chirag Pandya	0a761f0627	[RFC] Provide optional switches to _dump_nccl_trace (#127651 ) Summary: Data from PyTorch distributed is mostly useful during initial stages of model development. Provide options to reduce data sent/dumped. `_dump_nccl_trace` takes 3 optional switches. Default as before returns everything - `includeCollectives`: option to also include collectives: Default is True. - `includeStacktraces`: option to include stack traces in collectives. Default is True. - `onlyActive`: option to only send active collective work - i.e. not completed. Default is False (i.e. send everything) Test Plan: Unit tests Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/127651 Approved by: https://github.com/wconstab	2024-06-06 21:59:09 +00:00
Eddie Yan	54fe2d0e89	[cuDNN][quantization] skip qlinear test in cuDNN v9.1.0 (#128166 ) #120006 only very recently unskipped this test 3 days ago so we don't consider it a blocker for cuDNNv9 for now CC @atalman Pull Request resolved: https://github.com/pytorch/pytorch/pull/128166 Approved by: https://github.com/atalman, https://github.com/nWEIdia	2024-06-06 21:43:29 +00:00
Andrea Frittoli	04272a0e12	Add docstring for the torch.ao.quantization.utils.get_combined_dict function (#128127 ) Fixes: #127906 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128127 Approved by: https://github.com/jerryzh168	2024-06-06 21:22:09 +00:00
Howard Huang	baaa914bf7	[small] test clean up (#128079 ) remove unnecessary line: https://github.com/pytorch/pytorch/issues/123733 add main so test can be run `python ...`: https://github.com/pytorch/pytorch/issues/124906 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128079 Approved by: https://github.com/awgu	2024-06-06 21:21:40 +00:00
Andrew M. James	9554300436	[inductor][codegen] Codegen constexpr globals and constexpr annotated globals correctly. (#126195 ) [Triton #3762](https://github.com/triton-lang/triton/pull/3762) disallows access to globals which are not `tl.constexpr` Triton has always treated captured globals this way, but they now require it be explicit in user code. Updated codegen to make sure these variables are defined before writing the kernel source when compiling a user defined triton kernel. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126195 Approved by: https://github.com/alexbaden, https://github.com/bertmaher	2024-06-06 20:50:11 +00:00
chilli	2184cdd291	Added memory budget to partitioner (#126320 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126320 Approved by: https://github.com/shunting314	2024-06-06 20:32:29 +00:00
atalman	7e059b3c95	Add a call to validate docker images after build step is complete (#127768 ) Adds validation to docker images. As discussed here: https://github.com/pytorch/pytorch/issues/125879 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127768 Approved by: https://github.com/huydhn, https://github.com/Skylion007	2024-06-06 20:25:39 +00:00
Masahiro Hiramori	e8670f6aea	[Dynamo][TVM] Support macOS and Linux/aarch64 platforms (#128124 ) Fixes #128122 With this fix, I've confirmed that the repro works on the platforms below. - macOS 14.5 (arm64) - Ubuntu 20.04.6 LTS (GNU/Linux 5.10.120-tegra aarch64) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128124 Approved by: https://github.com/malfet	2024-06-06 19:47:11 +00:00
Eddie Yan	de4f8b9946	[BE]: Update cudnn to 9.1.0.70 (#123475 ) cuDNN has managed to upload cu11 and cu12 wheels for ~~9.0.0.312~~ 9.1.0.70, so trying this out... CC @Skylion007 @malfet Co-authored-by: Wei Wang <weiwan@nvidia.com> Co-authored-by: atalman <atalman@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/123475 Approved by: https://github.com/Skylion007, https://github.com/malfet, https://github.com/nWEIdia, https://github.com/atalman	2024-06-06 18:45:22 +00:00
Catherine Lee	fba21edf5b	[CI] Ensure inductor/test_cpu_cpp_wrapper is actually run in inductor_cpp_wrapper_abi_compatible (#126717 ) `inductor/test_cpu_cpp_wrapper` is not actually being run in `inductor_cpp_wrapper_abi_compatible` test config The cpu device type gets removed in `d28868c7e8/torch/testing/_internal/common_device_type.py (L733)` so `d28868c7e8/test/inductor/test_cpu_cpp_wrapper.py (L396)` returns false. Feel free to make a PR with a different way to do this (a better RUN_CPU check?) Add a skip for a failing test. I am not equipped to fix it Pull Request resolved: https://github.com/pytorch/pytorch/pull/126717 Approved by: https://github.com/ZainRizvi	2024-06-06 18:23:52 +00:00
Catherine Lee	936225d7b2	[mergebot] Fix pending unstable jobs being viewed as failed (#128080 ) https://github.com/pytorch/pytorch/pull/128038#issuecomment-2150802030 In the above, pending unstable jobs get put into the ok_failed_checks list, and because there are a lot of unstable jobs, it exceeds the threshold and merge fails. I don't think unstable jobs should be considered in the ok failed checks threshold, only flaky and broken trunk jobs should be considered there. Change looks big, but main thing is that unstable jobs don't get included in the check for how many flaky failures there are. The other changes are mostly renames so things are clearer Pull Request resolved: https://github.com/pytorch/pytorch/pull/128080 Approved by: https://github.com/huydhn	2024-06-06 18:22:20 +00:00
Andrew Gu	32fb68960e	[FSDP2] Added experimental warning to `unshard` API (#128138 ) There is still ongoing discussion on how this API should work. Current approach: - The pre-all-gather ops run in the default stream and the all-gather is called from the default stream with `async_op=True`. - Pros: - The all-gather input and output tensors are allocated in the default stream, so there is no increased memory fragmentation across stream pools. - There is no need for additional CUDA synchronization. The API is self-contained. - Cons: - The pre-all-gather ops (e.g. cast from fp32 -> bf16 and all-gather copy-in device copies) cannot overlap with other default stream compute. The biggest concern here is for CPU offloading, the H2D copies cannot overlap. Alternative approach: - Follow the default implicit prefetching approach, where the pre-all-gather ops and all-gather run in separate streams. - Pros: - The pre-all-gather ops can overlap with default stream compute. - Cons: - We require an API that should be called after the last optimizer step (namely, last op that modified sharded parameters) and before the first `unshard` call that has the all-gather streams wait for the default stream. The API is no longer self-contained and now has a complementary API. - The all-gather input and output tensors are allocated in separate streams (not the default stream), so there can be increased memory fragmentation across pools. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128138 Approved by: https://github.com/wanchaol ghstack dependencies: #128100	2024-06-06 18:18:42 +00:00
laithsakka	78a6b0c479	update test_reformer_train test to handle nn module inlining (#127467 ) number of call nodes increase due to inlining before inlining: ``` class GraphModule(torch.nn.Module): def forward(self, function_ctx, cat: "f32[1, s0, 512]"): # No stacktrace found for following nodes _set_grad_enabled = torch._C._set_grad_enabled(False) # File: /data/users/lsakka/pytorch/pytorch/test/dynamo/test_repros.py:283 in backward, code: grad_attn_output, grad_hidden_states = torch.chunk( chunk = torch.chunk(cat, 2, dim = -1); cat = None getitem: "f32[1, s0, 256]" = chunk[0] getitem_1: "f32[1, s0, 256]" = chunk[1]; chunk = None # No stacktrace found for following nodes _set_grad_enabled_1 = torch._C._set_grad_enabled(True) return (getitem_1, None) ``` after inlining: ``` class GraphModule(torch.nn.Module): def forward(self, s0: "Sym(s0)", L_hidden_states_: "f32[1, s0, 256]", L_self_layers_0_weight: "f32[256, 256]", L_self_layers_0_bias: "f32[256]", L_self_layer_norm_weight: "f32[512]", L_self_layer_norm_bias: "f32[512]", L_self_layer_norm_normalized_shape_0_: "Sym(512)"): l_hidden_states_ = L_hidden_states_ l_self_layers_0_weight = L_self_layers_0_weight l_self_layers_0_bias = L_self_layers_0_bias l_self_layer_norm_weight = L_self_layer_norm_weight l_self_layer_norm_bias = L_self_layer_norm_bias l_self_layer_norm_normalized_shape_0_ = L_self_layer_norm_normalized_shape_0_ # File: /data/users/lsakka/pytorch/pytorch/test/dynamo/test_repros.py:332 in forward, code: hidden_states = torch.cat([hidden_states, hidden_states], dim=-1) hidden_states: "f32[1, s0, 512]" = torch.cat([l_hidden_states_, l_hidden_states_], dim = -1); l_hidden_states_ = None # File: /data/users/lsakka/pytorch/pytorch/test/dynamo/test_repros.py:333 in forward, code: hidden_states = _ReversibleFunction.apply( function_ctx = torch.autograd.function.FunctionCtx() # File: /data/users/lsakka/pytorch/pytorch/test/dynamo/test_repros.py:258 in forward, code: hidden_states, attn_output = torch.chunk(hidden_states, 2, dim=-1) chunk = torch.chunk(hidden_states, 2, dim = -1); hidden_states = None hidden_states_1: "f32[1, s0, 256]" = chunk[0] attn_output: "f32[1, s0, 256]" = chunk[1]; chunk = None # File: /data/users/lsakka/pytorch/pytorch/torch/nn/modules/linear.py:116 in forward, code: return F.linear(input, self.weight, self.bias) attn_output_1: "f32[1, s0, 256]" = torch._C._nn.linear(attn_output, l_self_layers_0_weight, l_self_layers_0_bias); attn_output = l_self_layers_0_weight = l_self_layers_0_bias = None # File: /data/users/lsakka/pytorch/pytorch/test/dynamo/test_repros.py:272 in forward, code: ctx.save_for_backward(attn_output.detach(), hidden_states.detach()) detach: "f32[1, s0, 256]" = attn_output_1.detach() detach_1: "f32[1, s0, 256]" = hidden_states_1.detach() # File: /data/users/lsakka/pytorch/pytorch/test/dynamo/test_repros.py:279 in forward, code: return torch.cat([attn_output, hidden_states], dim=-1) hidden_states_2: "f32[1, s0, 512]" = torch.cat([attn_output_1, hidden_states_1], dim = -1); attn_output_1 = hidden_states_1 = None # File: /data/users/lsakka/pytorch/pytorch/torch/nn/modules/normalization.py:201 in forward, code: return F.layer_norm( hidden_states_3: "f32[1, s0, 512]" = torch.nn.functional.layer_norm(hidden_states_2, (l_self_layer_norm_normalized_shape_0_,), l_self_layer_norm_weight, l_self_layer_norm_bias, 1e-12); hidden_states_2 = l_self_layer_norm_normalized_shape_0_ = l_self_layer_norm_weight = l_self_layer_norm_bias = None # File: /data/users/lsakka/pytorch/pytorch/test/dynamo/test_repros.py:352 in forward, code: hidden_states = torch.nn.functional.dropout( hidden_states_4: "f32[1, s0, 512]" = torch.nn.functional.dropout(hidden_states_3, p = 0.5, training = True); hidden_states_3 = None return (hidden_states_4,) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/127467 Approved by: https://github.com/anijain2305 ghstack dependencies: #126444, #127146, #127424, #127440	2024-06-06 17:56:36 +00:00
Yu, Guangye	304956e1fb	Switch to torch.float16 on XPU AMP mode (#127741 ) # Motivation Previously, the default dtype for AMP on XPU was aligned with the CPU. To align with other GPUs, we intend to change the default dtype for AMP to `torch.float16`. This change aims to save users the effort of converting models from `torch.float16` to `torch.bfloat16`, or vice versa when they want to run the model on different types of GPUs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127741 Approved by: https://github.com/EikanWang, https://github.com/albanD	2024-06-06 17:40:13 +00:00
Yifu Wang	1d0c1087dd	Allow overriding per-dim group options via _MeshEnv.set_dim_group_options (#126599 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126599 Approved by: https://github.com/wanchaol ghstack dependencies: #126598	2024-06-06 17:18:12 +00:00
Pritam Damania	e9c5144cbc	Fix bug in update_process_group DDP API (#128092 ) Fix bug in `_update_process_group` DDP API where we didn't correctly reset `local_used_map_` and a few other variables. This resulted in errors like `Encountered gradient which is undefined, but still allreduced by...` Added a unit test as well that reproduced the issue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128092 Approved by: https://github.com/awgu, https://github.com/fegin	2024-06-06 17:10:42 +00:00
albanD	2ffdf556ea	Add back API that some people rely on in torch.cuda.amp.grad_scaler namespace (#128056 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128056 Approved by: https://github.com/kit1980, https://github.com/eqy	2024-06-06 17:02:32 +00:00
Aaron Gokaslan	2d47385f0f	[BE]: Enable ruff TCH rules and autofixes for better imports (#127688 ) Automated fixes to put imports that are only used in type hints into TYPE_CHECKING imports. This also enables the RUFF TCH rules which will automatically apply autofixes to move imports in and out of TYPE_CHECKING blocks as needed in the future, this will make the initial PyTorch import faster and will reduce cyclic dependencies. Co-authored-by: Xuehai Pan <XuehaiPan@pku.edu.cn> Pull Request resolved: https://github.com/pytorch/pytorch/pull/127688 Approved by: https://github.com/XuehaiPan, https://github.com/ezyang, https://github.com/malfet	2024-06-06 16:55:58 +00:00
Wanchao Liang	4f87f47ea1	[dtensor] reuse DTensorSpec as much as possible (#128112 ) as titled, given that our DTensorSpec is immutable, we can always reuse the spec if the input/output have the same tensor metadata. this helps two fold: 1. We don't need to re-calculate the hash everytime we produce a DTensorSpec, reduce runtime operator overhead 2. reduce the DTensor construction overhead. Some local benchmark on a 800 parameter clip_grad_norm shows that for foreach_norm the CPU overhead reduces from 11ms -> 7.8ms (around 30% improvement) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128112 Approved by: https://github.com/awgu	2024-06-06 16:55:50 +00:00
Edward Z. Yang	f0dd11df55	Make ValueRange repr less chatty by default (#128043 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/128043 Approved by: https://github.com/lezcano	2024-06-06 16:42:48 +00:00
eqy	0de6d2427f	Bump tolerances for `inductor/test_efficient_conv_bn_eval.py::EfficientConvBNEvalCudaTests::test_basic_cuda` attempt 2 (#128048 ) CC @nWEIdia @huydhn @Skylion007 Same thing but also bump backward tolerances... Pull Request resolved: https://github.com/pytorch/pytorch/pull/128048 Approved by: https://github.com/Skylion007	2024-06-06 16:17:43 +00:00
PyTorch MergeBot	a5b86a1ec0	Revert "FP8 rowwise scaling (#125204 )" This reverts commit 5dc912822913b3d90f4938891c7eca722a057cf1. Reverted https://github.com/pytorch/pytorch/pull/125204 on behalf of https://github.com/atalman due to Sorry need to revert this failing, on internal CI. I suggest to reimport this and try to land internally resolving all issues ([comment](https://github.com/pytorch/pytorch/pull/125204#issuecomment-2152905513))	2024-06-06 16:12:34 +00:00
Joona Havukainen	a5ba9b2858	Fix for addcdiv contiguous problem (#124442 ) Fixes issue number #118115 Co-authored-by: Siddharth Kotapati <skotapati@apple.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/124442 Approved by: https://github.com/kulinseth	2024-06-06 16:09:18 +00:00
PyTorch MergeBot	c58d3af3b4	Revert "Add OpInfo entry for alias_copy (#127232 )" This reverts commit 457df212e1c6e1aa4f1eb2ad6ee292052d7c07e1. Reverted https://github.com/pytorch/pytorch/pull/127232 on behalf of https://github.com/clee2000 due to broke [onnx](https://github.com/pytorch/pytorch/actions/runs/9397057801/job/25880181144) and [mps](https://github.com/pytorch/pytorch/actions/runs/9397057805/job/25879818705) tests, [hud link](`457df212e1`) , base is 15 days old, the onnx test xfailed on the pr but the xfail was removed so if you rebase itll surface, mps build failed so no mps tests were run on the pr ([comment](https://github.com/pytorch/pytorch/pull/127232#issuecomment-2152848758))	2024-06-06 15:44:47 +00:00
Jithun Nair	9d849d4312	Disable py3.12 nightly wheel builds for ROCm (#127968 ) Triton commit bump PR https://github.com/pytorch/pytorch/pull/125396 reverted due to missing llnl-hatchet dependency for triton. Workaround is to disable py3.12 binary build jobs for ROCm on PyTorch CI until llnl-hatchet publishes py3.12 wheels on [PyPI](https://pypi.org/project/llnl-hatchet/#files) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127968 Approved by: https://github.com/atalman, https://github.com/pruthvistony	2024-06-06 15:17:35 +00:00
PyTorch MergeBot	48a54146e7	Revert "[dynamo] Support ndarray.dtype attribute access (#124490 )" This reverts commit 4adee71155bec4e419bac32be2cbc1763bc6c98f. Reverted https://github.com/pytorch/pytorch/pull/124490 on behalf of https://github.com/atalman due to Breaks internal builds ([comment](https://github.com/pytorch/pytorch/pull/124490#issuecomment-2152664749))	2024-06-06 14:21:29 +00:00
Hengwen Tong	f08fd8e9e3	Remove redundant device guard in Resize.h (#126498 ) In https://github.com/pytorch/pytorch/pull/113386 a device guard was [inserted](https://github.com/pytorch/pytorch/pull/113386/files#diff-2691af3a999b3a8f4a0f635aabcd8edf0ffeda501edfa9366648e8a89de12a90R30). The new inserted device guarded has a clear and more confined guarded scope. And it's hard to tell the exact purpose and scope of the [old device guard](`78ffe49a3f/aten/src/ATen/native/cuda/Resize.h (L41)`). Removing the guard has negligible positive performance impact and make the code more understandable. Thanks Pull Request resolved: https://github.com/pytorch/pytorch/pull/126498 Approved by: https://github.com/eqy, https://github.com/lezcano	2024-06-06 13:01:42 +00:00
Xuehai Pan	c97e3ebb96	Fix wrongly exposed variables in `torch/__init__.py` (#127795 ) <img width="609" alt="image" src="https://github.com/pytorch/pytorch/assets/16078332/964c6707-1856-4c2c-8cd8-ce1d96d38d36"> This PR removes temporary variables in `torch/__init__.py`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127795 Approved by: https://github.com/albanD	2024-06-06 08:31:41 +00:00
Tom Ritchford	457df212e1	Add OpInfo entry for alias_copy (#127232 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127232 Approved by: https://github.com/lezcano	2024-06-06 07:46:26 +00:00
Michael Lazos	f5328542b5	Allow multiple cudagraph recordings per compiled graph (#126822 ) ### Introduction/Problem Today when dynamo traces a builtin nn module (nn.Linear for example) it will specially handle parameters of that module by storing them as constant attributes of the graph. This requires that dynamo guard on the ID of the NNModule because if the instance of the module changes, we need to retrace and recollect the new parameters as attributes of the graph. This creates a 1:1 compiled graph to cudagraph relationship. With hierarchical compilation, dynamo will treat builtin nn modules like any other code. This reduces complexity and critically, if there are multiple identical layers in a model, we only need to compile one of those layers once, and reuse the same compiled artifact for each layer. This introduces a problem for the current approach to parameter handling. Since the parameters could now possibly change across calls to the compiled artifact, these need to be inputs to the graph instead of attributes. This introduces a problem for cudagraphs - previously cudagraphs was guaranteed that the parameters of builtin NN Modules would be constant across calls, but now since the compiled artifact needs to be agnostic to the actual instance of the NN module being used these parameter memory locations may vary. Previously cudagraphs simply copies varying inputs to cudagraph owned memory, but since the parameters are quite large, this is catastrophic for performance. ### Solution To avoid this performance cliff, this PR allows cudagraphs to re-record a new cudagraph if only parameters change. Metadata about which arguments are parameters are propagated from AOT Autograd to compile_fx, and these indices are passed to cudagraphs. If these memory locations change, a new graph is recorded vs previously where this would be an error (because this previously should not happen). This enables a 1:many compiled graph to cudagraph relationship. Across similar modules we will re-record cudagraphs and dispatch the correct graph if parameter pointers match when the cudagraph is executed. ### Next steps (if needed) It is theoretically possible that a user passes Parameters that change frequently as inputs to model code - if this is a common issue this design allows for dynamo to pass metadata indicating which parameters were created in a builtin NN Module context to only permit those parameters to have the multi-cudagraph behavior, but this PR does not implement this. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126822 Approved by: https://github.com/eellison ghstack dependencies: #126820, #126821	2024-06-06 06:39:59 +00:00
Michael Lazos	5a3bea1e88	Remove unused arg to GraphLowering (#126821 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126821 Approved by: https://github.com/eellison ghstack dependencies: #126820	2024-06-06 06:39:59 +00:00
Michael Lazos	70ba6f0ab6	Collect static parameter metadata in aot (#126820 ) Collect the indices of the static parameters to pass down to cudagraphs in order to re-record if necessary. This location was chosen in order to allow us to restrict this (if needed) in the future by setting metadata in dynamo. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126820 Approved by: https://github.com/bdhirsh	2024-06-06 06:39:50 +00:00
Andrew Gu	c8ff1cd387	[FSDP2] Changed `test_register_forward_method` to use multiprocess test (#128100 ) The test seems to be flaky due to multi-threaded process group. This PR converts the test to use normal multi-process `ProcessGroupNCCL` to fix the flakiness. This PR closes https://github.com/pytorch/pytorch/issues/126851. Interestingly, the original MTPG version passes for me on devgpu. Either way, the new version also passes on devgpu, so we can see in CI. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128100 Approved by: https://github.com/weifengpy	2024-06-06 06:34:02 +00:00
Michael Lazos	638f543ac2	Enable single nadam test (#128087 ) https://github.com/pytorch/pytorch/issues/117150 has been fixed Pull Request resolved: https://github.com/pytorch/pytorch/pull/128087 Approved by: https://github.com/xmfan	2024-06-06 06:25:00 +00:00
Jiashen Cao	cd42b95047	Handle aten::__contains__ during TorchScript to ExportedProgram conversion (#127544 ) #### Description Add support for converting `prim::__contains__` from TorchScript IR to ExportedProgram, e.g., ```python class MIn(torch.nn.Module): def forward(self, x: torch.Tensor): return x.dtype in [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10] ``` #### Test Plan * Add test cases to cover both contains IR resulted from primitive types or Tensor. `pytest test/export/test_converter.py -s -k test_ts2ep_converter_contains` Pull Request resolved: https://github.com/pytorch/pytorch/pull/127544 Approved by: https://github.com/angelayi	2024-06-06 05:00:13 +00:00
cyy	68eb771265	[2/N] Remove unused test functions (#128005 ) Following #127881, this PR continues to remove unused test functions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128005 Approved by: https://github.com/ezyang	2024-06-06 03:41:32 +00:00
Edward Z. Yang	2f7cfecd86	Complete revamp of float/promotion sympy handling (#126905 ) At a high level, the idea behind this PR is: * Make it clearer what the promotion and int/float rules for various Sympy operations are. Operators that previously were polymorphic over int/float are now split into separate operators for clarity. We never do mixed int/float addition/multiplication etc in sympy, instead, we always promote to the appropriate operator. (However, equality is currently not done correctly.) * Enforce strict typing on ValueRanges: if you have a ValueRange for a float, the lower and upper MUST be floats, and so forth for integers. The story begins in torch/utils/_sympy/functions.py. Here, I make some changes to how we represent certain operations in sympy expressions: * FloorDiv now only supports integer inputs; to do float floor division, do a truediv and then a trunc. Additionally, we remove the divide out addition by gcd optimization, because sympy gcd is over fields and is willing to generate rationals (but rationals are bad for ValueRange strict typing). * ModularIndexing, LShift, RShift now assert they are given integer inputs. * Mod only supports integer inputs; eventually we will support FloatMod (left for later work, when we build out Sympy support for floating operations). Unfortunately, I couldn't assert integer inputs here, because of a bad interaction with sympy's inequality solver that is used by the offline solver * TrueDiv is split into FloatTrueDiv and IntTrueDiv. This allows for us to eventually generate accurate code for Python semantics IntTrueDiv, which is written in a special way to preserve precision when the inputs are >= 2*53 beyond what first coercing the integer to floats and then doing true division. Trunc is split to TruncToFloat and TruncToInt. * Round is updated to return a float, not an int, making it consistent with the round op handler in Inductor. To get Python-style conversion to int, we call TruncToInt on the result. * RoundDecimal updated to consistently only ever return a float * Add ToFloat for explicit coercion to float (required so we can enforce strict ValueRanges typing) In torch/__init__.py, we modify SymInt and SymFloat to appropriately call into new bindings that route to these refined sympy operations. Also, we modify `torch.sym_min` and `torch.sym_max` to have promotion semantics (if one argument is a float, the return result is always a float), making them inconsistent with builtins.min/max, but possible to do type analysis without runtime information. We also need to introduce some new op handlers in torch/_inductor/ops_handler.py: * `to_int` for truncation to int64, directly corresponding to TruncToInt; this can be implemented by trunc and dtype, but with a dedicated handler it is more convenient for roundtripping in Sympy * `int_truediv` for Python-style integer true division, which has higher precision than casting to floats and then running `truediv` These changes have consequences. First, we need to make some administrative changes: * Actually wire up these Sympy functions from SymInt/SymFloat in torch/fx/experimental/sym_node.py, including the new promotion rules (promote2) * Add support for new Sympy functions in torch/utils/_sympy/interp.py, torch/utils/_sympy/reference.py * In particular, in torch.utils._sympy.reference, we have a strong preference to NOT do nontrivial compute, instead, everything in ops handler should map to a singular sympy function * TODO: I chose to roundtrip mod back to our Mod function, but I think I'm going to have to deal with the C/Python inconsistency this to fix tests here * Add printer support for the Sympy functions in torch/_inductor/codegen/common.py, torch/_inductor/codegen/cpp_utils.py, torch/_inductor/codegen/triton.py. `int_truediv` and mixed precision equality is currently not implemented soundly, so we will lose precision in codegen for large values. TODO: The additions here are not exhaustive yet * Update ValueRanges logic to use new sympy functions in torch/utils/_sympy/value_ranges.py. In general, we prefer to use the new Sympy function rather than try to roll things by hand, which is what was done previously for many VR analysis functions. In torch/fx/experimental/symbolic_shapes.py we need to make some symbolic reasoning adjustments: * Avoid generation of rational subexpressions by removing simplification of `x // y` into `floor(x / y)`. This simplification then triggers an addition simplification rule `(x + y) / c --> x / c + y / c` which is bad because x / c is a rational number now * `_assert_bound_is_rational` is no more, we no longer generate rational bounds * Don't intersect non-int value ranges with the `int_range` * Support more sympy Functions for guard SYMPY_INTERP * Assert the type of value range is consistent with the variable type The new asserts uncovered necessary bug fixes: * torch/_inductor/codegen/cpp.py, torch/_inductor/select_algorithm.py, torch/_inductor/sizevars.py - Ensure Wild/Symbol manually allocated in Inductor is marked `is_integer` so it's accepted to build expressions * torch/_inductor/utils.py - make sure you actually pass in sympy.Expr to these functions * torch/_inductor/ir.py - make_contiguous_strides_for takes int/SymInt, not sympy.Expr! * torch/export/dynamic_shapes.py - don't use infinity to represent int ranges, instead use sys.maxsize - 1 Because of the removal of some symbolic reasoning that produced rationals, some of our symbolic reasoning has gotten worse and we are unable to simplify some guards. Check the TODO at test/test_proxy_tensor.py Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/126905 Approved by: https://github.com/xadupre, https://github.com/lezcano	2024-06-06 02:29:45 +00:00
Janani Sriram	c1a43a69e4	[NestedTensor] Add error checks for unbind operator coverage when ragged_idx != 1 (#128058 ) Summary: Add the following error checks for the `unbind` operator on `NestedTensor`s when `ragged_idx != 1`: - The current implementation allows the creation of `NestedTensor` instances from the class definition with an `offsets` tensor that applies to a dimension other than the jagged dimension. This diff ensures that `unbind` fails when the `offsets` exceed the length of the jagged dimension. Test Plan: Added the following unit tests: `test_unbind_with_lengths_ragged_idx_equals_2_bad_dim_cpu` verifies that `unbind` fails when there is a mismatch between the offsets and the jagged dimension, for `NestedTensor`s with `lengths`. ``` test_unbind_with_lengths_ragged_idx_equals_2_bad_dim_cpu (test_nestedtensor.TestNestedTensorSubclassCPU) ... ok ``` Reviewed By: davidberard98 Differential Revision: D57989082 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128058 Approved by: https://github.com/davidberard98	2024-06-06 01:56:12 +00:00
PyTorch MergeBot	9795c4224b	Revert "[DDP] Bucket handling: make first bucket size equal to bucket_cap_mb if it was set (#121640 )" This reverts commit e98662bed99df57b7d79f9fc1cbe670afc303235. Reverted https://github.com/pytorch/pytorch/pull/121640 on behalf of https://github.com/clee2000 due to Sorry but it looks like you're failing `distributed/_composable/test_replicate_with_compiler.py::ReplicateTest::test_bucketing_coalesced_op `. THe build failed so the tests didn't run, consider rebasing, there have been a couple of PRs lately related to cudnn so you probably are either based on a bad or too old of a commit `e98662bed9` https://github.com/pytorch/pytorch/actions/runs/9392731942/job/25868060913 ([comment](https://github.com/pytorch/pytorch/pull/121640#issuecomment-2151258585))	2024-06-06 01:50:18 +00:00
sdp	b4a0161449	Build SYCL kernels for ATen XPU ops on Native Windows (take 2) (#127390 ) Original PR https://github.com/pytorch/pytorch/pull/126725 is closed due to bad rebase. ------- As proposed in https://github.com/pytorch/pytorch/issues/126719, we are enabling PyTorch XPU on Native Windows on Intel GPU. This PR enables XPU build on Windows as the first step of #126719: - Enable `USE_XPU` build on Windows using MSVC as host compiler. The use of MSVC as host compiler seamlessly aligns with the existing PyTorch build on Windows. - Build oneDNN GPU library on Windows. Co-authored-by: Yu, Guangye <guangye.yu@intel.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/127390 Approved by: https://github.com/guangyey, https://github.com/EikanWang, https://github.com/gujinghui, https://github.com/ezyang	2024-06-06 01:41:06 +00:00
Kazuaki Ishizaki	6adcf21b2b	Documenting the torch.cuda.nccl.version function (#128022 ) Fixes #127892 This PR adds docstring to the torch.cuda.nccl.version function Pull Request resolved: https://github.com/pytorch/pytorch/pull/128022 Approved by: https://github.com/malfet	2024-06-06 01:13:07 +00:00
Edward Z. Yang	bf2c05352e	Make length == stop size oblivious too (#128050 ) This doesn't do anything right now (need some other PRs to activate) but since it edits a header file it would be better to land this earlier. Context: https://github.com/pytorch/pytorch/pull/127693 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/128050 Approved by: https://github.com/Skylion007, https://github.com/lezcano	2024-06-06 01:09:37 +00:00
Adam J. Stewart	80d34217c6	Typo fixes: et al. (#127811 ) "et al." is short for _et alia_ and should be abbreviated with a period on the second word. Noticed this typo when reading through the SGD docs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127811 Approved by: https://github.com/janeyx99	2024-06-06 01:03:25 +00:00
Edward Z. Yang	d3ad84c38f	Use pexpr, not texpr in Triton launch codegen (#128038 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/128038 Approved by: https://github.com/Skylion007	2024-06-06 00:45:59 +00:00
albanD	8bcebc8dae	Add runtime dependency on setuptools for cpp_extensions (#127921 ) As per title since this was removed from the builtin python binary in 3.12 and we use it `torch.utils.cpp_extension.*`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127921 Approved by: https://github.com/Skylion007	2024-06-05 23:59:38 +00:00
cyy	2fd75667b4	[Caffe2]Remove Caffe2 scripts and benchmarks (#126747 ) Due to removal of Caffe2. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126747 Approved by: https://github.com/ezyang, https://github.com/malfet	2024-06-05 23:46:31 +00:00
Aidyn-A	e98662bed9	[DDP] Bucket handling: make first bucket size equal to bucket_cap_mb if it was set (#121640 ) The fist DDP bucket is always being created of the size of `dist._DEFAULT_FIRST_BUCKET_BYTES` (1 MiB) by default regardless of `bucket_cap_mb`. The proposal is to set `bucket_cap_mb` as the one main bucket size if it was supplied by the user. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121640 Approved by: https://github.com/wanchaol	2024-06-05 23:44:54 +00:00
Tristan Rice	ffaea656b5	WorkerServer: add support for binding to TCP (#127986 ) This adds support for the WorkerServer binding to TCP as well as the existing unix socket support. ```py server = _WorkerServer("", 1234) ``` Test plan: Added unit test ``` python test/distributed/elastic/test_control_plane.py ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/127986 Approved by: https://github.com/c-p-i-o	2024-06-05 22:56:32 +00:00
Xuehai Pan	a7c596870d	[BE][Eazy] remove `torch.torch.xxx` usages (#127800 ) NB: `torch` is exposed in `torch/__init__.py`. So there can be `torch.torch.torch.xxx`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127800 Approved by: https://github.com/peterbell10, https://github.com/kit1980, https://github.com/malfet	2024-06-05 21:53:49 +00:00
titaiwangms	4123323eff	[ONNX] Single function for torch.onnx.export and torch.onnx.dynamo_export (#127974 ) Add `dynamo: bool = True` as a switch in `torch.onnx.export` to provide users an option to try `torch.onnx.dynamo_export`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127974 Approved by: https://github.com/justinchuby	2024-06-05 21:27:46 +00:00
Catherine Lee	01694eaa56	Move cuda 12.4 jobs to periodic for both pull and inductor (#127825 ) Moves 12.4 sm86/a10g jobs in pull to trunk Moves 12.4 cuda non sm86 jobs to periodic Moves 12.4 jobs in inductor to inductor-periodic, except inductor_timm which seems to give important signal There has been a lot of queueing for cuda runners due to the addition of jobs for cuda 12.4, so move those jobs to other workflows that are run less often Co-authored-by: Andrey Talman <atalman@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/127825 Approved by: https://github.com/ZainRizvi, https://github.com/nWEIdia, https://github.com/atalman, https://github.com/malfet	2024-06-05 21:01:36 +00:00
Animesh Jain	8184cd85fc	[fake tensor] Set _is_param for base fake tensors for views (#127823 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127823 Approved by: https://github.com/eellison, https://github.com/ezyang ghstack dependencies: #127972	2024-06-05 20:26:52 +00:00
Animesh Jain	626dc934d1	[dynamo][pippy] Hotfix for nn_module_stack for pippy usecase (#127972 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127972 Approved by: https://github.com/ydwu4	2024-06-05 20:14:50 +00:00
rk7697	72e863df27	Update _learnable_fake_quantize.py (#127993 ) Remove sentence "For literature references, please see the class _LearnableFakeQuantizePerTensorOp." and add "s" to "support" (Possibly) Fixes #99107 (But not sure, sorry) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127993 Approved by: https://github.com/jerryzh168	2024-06-05 20:02:33 +00:00
atalman	6e545392cd	Move nongpu workflows from trunk to periodic (#128049 ) We don't need to run them on every PR. These are used to test for graceful degradation of GPU. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128049 Approved by: https://github.com/clee2000	2024-06-05 18:31:26 +00:00
rzou	6412c6060c	[reland] Refresh OpOverloadPacket if a new OpOverload gets added (#128000 ) If a user accesses an OpOverloadPacket, then creates a new OpOverload, then uses the OpOverloadPacket, the new OpOverload never gets hit. This is because OpOverloadPacket caches OpOverloads when it is constructed. This PR fixes the problem by "refreshing" the OpOverloadPacket if a new OpOverload gets constructed and the OpOverloadPacket exists. Test Plan: - new tests This is the third land attempt. The first one was reverted for breaking internal tests, the second was reverted for being erroneously suspected of causing a perf regression. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128000 Approved by: https://github.com/albanD	2024-06-05 17:57:09 +00:00
Chien-Chin Huang	bb68b54be0	[BE][ptd_fb_test][1/N] Enable testslide (#127512 ) This change allows to enable Testslide, which gives us more readable output, import time, etc. The PR is previously stamped https://github.com/pytorch/pytorch/pull/126460 but the old PR has some ghexport issue. Differential Revision: [D57919583](https://our.internmc.facebook.com/intern/diff/D57919583/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127512 Approved by: https://github.com/wz337, https://github.com/Skylion007	2024-06-05 17:45:15 +00:00
Arun Pa	3acbfd602e	Document torch.utils.collect_env.get_env_info function (#128021 ) Fixes #127911 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128021 Approved by: https://github.com/malfet	2024-06-05 17:44:47 +00:00
willfengg	6454e95824	[FSDP2] enable CI for torch.compile(root Transformer) (#127832 ) This CI showcases FSDP2 works with `torch.compile` root model, since FSDP1 can do the same compiling root Transformer without AC: `pytest test/distributed/_composable/fsdp/test_fully_shard_training.py -k test_train_parity_multi_group` compiling root Transformer with AC: `pytest test/distributed/_composable/fsdp/test_fully_shard_training.py -k test_train_parity_with_activation_checkpointing` Pull Request resolved: https://github.com/pytorch/pytorch/pull/127832 Approved by: https://github.com/awgu	2024-06-05 17:29:46 +00:00
Andrew M. James	4adee71155	[dynamo] Support ndarray.dtype attribute access (#124490 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124490 Approved by: https://github.com/lezcano ghstack dependencies: #125717	2024-06-05 17:20:01 +00:00
Chien-Chin Huang	a9cc147fa1	[DSD][FSDP1] Deprecate FSDP.state_dict_type and redirect users to DSD (#127794 ) Summary: As title Pull Request resolved: https://github.com/pytorch/pytorch/pull/127794 Approved by: https://github.com/awgu ghstack dependencies: #127793	2024-06-05 16:55:05 +00:00
Peter Bell	9acc19f8da	[inductor] Take absolute value of strides when picking loop order (#127425 ) Fixes #126860 The stride hint is found by comparing the value of the indexing expression evaluated at `idx` set to all zeros and at `idx[dim] = 1`. This causes a problem for padded inputs where 0 and 1 are still in the padded region. In particular, for reflection padding this causes the stride to be negative. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127425 Approved by: https://github.com/lezcano	2024-06-05 16:48:22 +00:00
Chien-Chin Huang	22964d1007	[DSD] Deprecate submodules feature for DSD (#127793 ) Summary: Getting a partial of the state_dict and set the state_dict with the type of Dict[nn.Module, Dict[str, Any]] is too complicated and can confuse users. The features can be achieved by simple pre-processing and post-processing by users. So this PR adds the deprecation warning to the feature. The previous PR, https://github.com/pytorch/pytorch/pull/127070, assumes no one is using the feature and remove it without the grace period. This seems to be too aggresive and causes some concerns. This PR adds the deprecation warning and tests. We will remove the support in 2.5. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127793 Approved by: https://github.com/LucasLLC	2024-06-05 16:31:29 +00:00
drisspg	5dc9128229	FP8 rowwise scaling (#125204 ) # Summary This pull request introduces an fp8 row-scaling kernel as an optional implementation for `scaled_mm`. The kernel selection is based on the scaling tensors of the inputs. For inputs `x` and `y` of shape `[M, K]` and `[K, N]` respectively, the following conditions must be met: - `x`'s scale should be a 1-dimensional tensor of length `M`. - `y`'s scale should be a 1-dimensional tensor of length `N`. It's important to note that this kernel is not called "rowwise, columnwise" scaling because, although the scales for `y` are semantically along its columns, this implementation only supports the TN format. This means the scaling is along the faster-moving dimension, or the "row". The following two PRs were required to enable local builds: - [PR #126185](https://github.com/pytorch/pytorch/pull/126185) - [PR #125523](https://github.com/pytorch/pytorch/pull/125523) ### Todo We still do not build our Python wheels with this architecture. @ptrblck @malfet, should we replace `sm_90` with `sm_90a`? The NVRTC TMA shadowing feels wrong, but I a not sure the right way to spoof the symbol for this compilation unit: https://github.com/pytorch/pytorch/pull/125204/files#r1586986954 #### ifdef I tried to use : `#if !defined(USE_ROCM) && defined(CUDA_VERSION) && CUDA_VERSION >= 12000 && \ defined(__CUDA_ARCH__) && __CUDA_ARCH__ > 900` to gate the building of the kernel. I was having a hell of a time with this.. so I am not really sure the right way to do this Kernel Credit: @jwfromm Pull Request resolved: https://github.com/pytorch/pytorch/pull/125204 Approved by: https://github.com/lw, https://github.com/malfet	2024-06-05 15:46:40 +00:00
Jiashen Cao	4f9fcd7156	Handle unpacking during TorchScript to ExportedProgram conversion (#127419 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127419 Approved by: https://github.com/angelayi	2024-06-05 15:27:13 +00:00
cyy	9f2c4b9342	Replace with standard type traits in torch/csrc (#127852 ) In preparation to clean up more type traits. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127852 Approved by: https://github.com/ezyang	2024-06-05 15:22:48 +00:00
cyy	3d617333e7	Simplify CMake code (#127683 ) Due to the recent adoption of find(python), it is possible to further simplify some CMake code. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127683 Approved by: https://github.com/ezyang	2024-06-05 15:17:31 +00:00
cyy	df75a9dc80	Remove Caffe2/onnx (#127991 ) Remove Caffe2/onnx since it is not used. Other tiny fixes are also applied. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127991 Approved by: https://github.com/ezyang	2024-06-05 15:10:12 +00:00
Nikita Shulga	d48c25c7d1	[BE] Fix missing-prototypes errors in Metal backend (#127994 ) By declaring a bunch of functions static. Removed `USE_PYTORCH_METAL` from list of flags that suppress `-Werror=missing-prototypes`. This will prevent regressions like the ones reported in https://github.com/pytorch/pytorch/issues/127942 to sneak past CI, that builds PyTorch with Metal support. Use nested namespaces Remove spurious semicolon after TORCH_LIBRARY declaration. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127994 Approved by: https://github.com/Skylion007, https://github.com/ZainRizvi	2024-06-05 14:58:19 +00:00
Huy Do	8992141dba	Restore MPS testing on MacOS 13 and m2 metal (#127853 ) The runners are ready now https://github.com/organizations/pytorch/settings/actions/runners?qr=label%3Amacos-m1-13, we want to keep some MacOS 13 runner for mps coverage until MacOS 15 is out. This also fixes the `macos-m2-14` mistake from https://github.com/pytorch/pytorch/pull/127582. The current `macos-m2-14` runner is on 14.2 while our `macos-m1-14` has 14.4. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127853 Approved by: https://github.com/malfet	2024-06-05 14:44:00 +00:00
Andrew M. James	879d01afcb	[dynamo][numpy] Add unsigned integer dtypes (#125717 ) We should support these to whatever extent we can. They corresponding `torch.uint<w>` types are defined, so I don't see an issue with generating the various casting rules and allowing them to trace. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125717 Approved by: https://github.com/lezcano	2024-06-05 14:33:47 +00:00
hippocookie	4ce5322a1f	Enable UFMT on test_shape_ops.py test_show_pickle.py test_sort_and_select.py (#127165 ) Fixes some files in #123062 Run lintrunner on files: test_shape_ops.py test_show_pickle.py test_sort_and_select.py ```bash $ lintrunner --take UFMT --all-files ok No lint issues. Successfully applied all patches. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/127165 Approved by: https://github.com/ezyang	2024-06-05 14:31:26 +00:00
Weizhuo Zhang	faabda4fc9	[Inductor] Skip model_fail_to_load and eager_fail_to_run models in inductor benchmarks test (#127210 ) Aligned with test-infra repo, we skipped `model_fail_to_load` and `eager_fail_to_run` models Refer code logic: `d3b79778f8/torchci/rockset/inductor/__sql/compilers_benchmark_performance.sql (L57-L58)` ```SQL WHERE filename LIKE '%_accuracy' AND filename LIKE CONCAT( '%_', : dtypes, '_', : mode, '_', : device, '_%' ) AND _event_time >= PARSE_DATETIME_ISO8601(:startTime) AND _event_time < PARSE_DATETIME_ISO8601(:stopTime) AND (workflow_id = :workflowId OR :workflowId = 0) AND accuracy != 'model_fail_to_load' AND accuracy != 'eager_fail_to_run' ), ``` Comp Item \| Compiler \| suite \| Before \| After fix -- \| -- \| -- \| -- \| -- Pass Rate \| Inductor \| torchbench \| 96%, 80/83 \| 100%, 80/80 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127210 Approved by: https://github.com/jansel	2024-06-05 14:23:09 +00:00
weiyusheng	c3949b20a1	Opt model save and load (#126374 ) ## save&load support for OptimizedModule [Issue Description](https://github.com/pytorch/pytorch/pull/101651) English is not my native language; please excuse typing errors. This pr is based on commit b9588101c4d3411b107fdc860acfa8a72c642f91\ I'll do something with the merge conflicts later ### test result for test/dynamo Conclusion:\ It performs the same as before as far as I can see. ENV(CPU only):\ platform linux -- Python 3.10.14, pytest-7.3.2, pluggy-1.5.0\ configfile: pytest.ini\ plugins: anyio-3.7.1, cpp-2.3.0, flakefinder-1.1.0, xdist-3.3.1, xdoctest-1.1.0, metadata-3.1.1, html-4.1.1, hypothesis-5.35.1, rerunfailures-14.0 #### before this pr: [before](https://github.com/pytorch/pytorch/files/15329370/before.md) #### after this pr: [after](https://github.com/pytorch/pytorch/files/15329376/after.md) ### some changes 1. add test_save_and_load to test/dynamo/test_modules.py with & without "backend='inductor'" 2. add \_\_reduce\_\_ function to OptimizedModule and derived classes of _TorchDynamoContext for pickling & unpickling 3. change the wrappers into wrapper classes ( including convert_frame_assert, convert_frame, catch_errors_wrapper in torch/_dynamo/convert_frame.py & wrap_backend_debug in torch/_dynamo/repro/after_dynamo.py ) 4. change self.output.compiler_fn into innermost_fn(self.output.compiler_fn) in torch/_dynamo/symbolic_convert.py to get the origin compiler_fn and to avoid the "compiler_fn is not eager" condition Pull Request resolved: https://github.com/pytorch/pytorch/pull/126374 Approved by: https://github.com/msaroufim, https://github.com/jansel	2024-06-05 13:01:16 +00:00
PyTorch MergeBot	9a8ab778d3	Revert "[BE]: Update cudnn to 9.1.0.70 (#123475 )" This reverts commit c490046693e77e254664e19d940e9b05a1da18ef. Reverted https://github.com/pytorch/pytorch/pull/123475 on behalf of https://github.com/huydhn due to CUDA trunk jobs are pretty red after this change, and the forward fix https://github.com/pytorch/pytorch/pull/127984 does not look working ([comment](https://github.com/pytorch/pytorch/pull/123475#issuecomment-2149258430))	2024-06-05 08:59:53 +00:00
ibartol	bb2de3b101	Fixed broken link and removed unfinished sentence from issue #126367 (#127938 ) Fixes #126367. ## Description Fixed a broken link in the pytorch/docs/source/torch.compiler_faq.rst doc and deleted a few words that were extra according to the issue tagged above. ## Checklist - [X] The issue that is being fixed is referred in the description - [X] Only one issue is addressed in this pull request - [X] Labels from the issue that this PR is fixing are added to this pull request - [X] No unnecesary issues are included into this pull request Pull Request resolved: https://github.com/pytorch/pytorch/pull/127938 Approved by: https://github.com/msaroufim	2024-06-05 07:37:32 +00:00
dan_the_3rd	4a384d813b	[SDPA/memeff] Backport changes from xFormers to PT (#127090 ) Backporting a few fixes from xFormers: * Bug fixes for local attention (which is not exposed in PT at the moment) * Massively reduced memory usage on the BW pass (see also https://github.com/facebookresearch/xformers/pull/1028) Essentially this will also make xFormers build process much easier, as we will be able to use mem-eff from PyTorch (if the user has a recent enough version) rather than building it at xFormers install time The goal is to have the source of truth for these files in PT moving forward, and remove them from xFormers eventually once our users have a recent-enough version of PT. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127090 Approved by: https://github.com/drisspg	2024-06-05 07:33:27 +00:00
cyy	b054470db2	Remove unused functions (#127881 ) Some unused functions detected by g++ warnings can be removed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127881 Approved by: https://github.com/zou3519	2024-06-05 05:21:24 +00:00
Shuqiang Zhang	30788739f4	[c10d] add a simple test to demonstrate the user usage of collectives (#127665 ) Summary: Just play around the UT and think it would be good to give an simple example of user function which can be used for different subclasses of _ControlCollectives, and test the user function can be executed correctly Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/127665 Approved by: https://github.com/d4l3k	2024-06-05 04:32:11 +00:00
Pian Pawakapan	e505132797	[export] track TORCH_DYNAMO_DO_NOT_EMIT_RUNTIME_ASSERTS for export runtime asserts (#127554 ) Track TORCH_DYNAMO_DO_NOT_EMIT_RUNTIME_ASSERTS=1 in export so it doesn't omit runtime asserts. Differential Revision: D57978699 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127554 Approved by: https://github.com/tugsbayasgalan	2024-06-05 04:16:54 +00:00
PyTorch MergeBot	d5cb5d623a	Revert "Complete revamp of float/promotion sympy handling (#126905 )" This reverts commit fb696ef3aa34e20c0fef1c0210a397abd3ea5885. Reverted https://github.com/pytorch/pytorch/pull/126905 on behalf of https://github.com/ezyang due to internal user reported ceiling equality simplification problem, I have a plan ([comment](https://github.com/pytorch/pytorch/pull/126905#issuecomment-2148805840))	2024-06-05 03:57:58 +00:00
Howard Huang	55a4ef80c4	[pipelining] test pipeline_order in schedule (#127559 ) Add a unittest to test validate the pipeline order for different `num_stages`, `num_microbatches`, `num_world_size` combinations. This doesn't actually run the schedule but just validates the ordering of microbatches processed is valid, therefore doesn't require GPUs / multiple processes. Will add more combinations and negative tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127559 Approved by: https://github.com/wconstab ghstack dependencies: #127084, #127332	2024-06-05 03:51:27 +00:00
Nikita Shulga	71e684bfae	[BE][Mac] Add missing prototypes (#127988 ) Really confused how CI did not catch this one, but this triggers missing prototype erros if compiled from scratch on MacOS Sonoma using clang-15 Fixes https://github.com/pytorch/pytorch/issues/127942 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127988 Approved by: https://github.com/Skylion007, https://github.com/huydhn	2024-06-05 02:16:50 +00:00
cyy	ce4436944c	Fix IOS builds (#127985 ) IOS builds fail these days, fix it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127985 Approved by: https://github.com/ezyang	2024-06-05 02:14:43 +00:00
Mikayla Gawarecki	a135776307	Remove tensor subclass detection logic from weights_only unpickler (#127808 ) Remove logic to auto-detect and allow subclasses that did not override certain methods from the weights_only unpickler from https://github.com/pytorch/pytorch/pull/124331 for 2.4 release Subclasses should be loadable using `torch.serialization.add_safe_globals` Pull Request resolved: https://github.com/pytorch/pytorch/pull/127808 Approved by: https://github.com/malfet	2024-06-05 02:14:30 +00:00
Feng Yuan	8e496046e5	Update torch-xpu-ops pin (ATen XPU implementation) (#127879 ) Support AMP GradScaler. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127879 Approved by: https://github.com/EikanWang	2024-06-05 02:13:46 +00:00
Dmovic	6c07e2c930	fix redundant tensor (#127850 ) As title. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127850 Approved by: https://github.com/mikaylagawarecki	2024-06-05 02:03:02 +00:00
Cory Modlin	8830b81208	[c10d] Add commCreateFromRanks to c10d (#127421 ) (#127982 ) This is a duplicate of: https://github.com/pytorch/pytorch/pull/127421 which we can't merge. its landed internally already Summary: `ncclCommCreateFromRanks` - described in this [document](https://docs.google.com/document/d/1QIRkAO4SAQ6eFBpxE51JmRKRAH2bwAHn8OIj69XuFqQ/edit#heading=h.5g71oqe3soez), replaces `ncclCommSplit` in NCCLX versions 2.21.5+. The difference is that `ncclCommCreateFromRanks` is given a list of active ranks and is collective only over those ranks as opposed to `ncclCommSplit` for which you give it a color for every rank including NO_COLOR for inactive ranks and the collective is over the entire world. This diff connects `ncclCommCreateFromRanks` to `c10d` `ncclCommSplit` will still be available at the NCCL API but, in this diff, is not used starting at version 2.21.5 Split the python test and implementation of `split()` for internal FB and external OSS builds. The diff defines `"USE_C10D_NCCL_FBCODE"` as a compiler option. When defined, we use the version of split in the newly created `NCCLUtils.cpp` in the `fb` directory. The `fb` directory is not shipit-ed to github. The same API is used for `split()` in both the `ncclx` and `nccl` versions adding `ranks` to the API. This argument is not used in the `nccl` version nor in the 2.18 `ncclx` version where `ncclCommSplit()` is used instead of `ncclCommCreateFromRanks()` in `ncclx` This diff was squashed with D57343946 - see D57343946 for additional review comments. Test Plan: for 2.18.3-1 and 2.21.5-1 versions: ``` buck2 run fbcode//mode/opt -c param.use_nccl=True -c fbcode.nvcc_arch=a100 -c hpc_comms.use_ncclx="$VERSION" -c fbcode.enable_gpu_sections=true fbcode//caffe2/test/distributed/fb:test_comm_split_subgroup_x ``` ``` BUILD SUCCEEDED ... ok ---------------------------------------------------------------------- Ran 1 test in 10.210s OK ~/scripts ``` OSS build: `[cmodlin@devgpu003.vll5 ~/fbsource/third-party/ncclx/v2.21.5-1 (e56338cfa)]$ ./maint/oss_build.sh` OSS build output: ``` ... ncclCommHash 197dce9b413e2775 nccl commDesc example_pg Dump from comm 0x4708aa0 rings: [[0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0]] Dump from comm 0x4708aa0 commDesc: example_pg Dump from comm 0x4708aa0 nRanks: 1 Dump from comm 0x4708aa0 nNodes: 1 Dump from comm 0x4708aa0 node: 0 Dump from comm 0x4708aa0 localRanks: 1 Dump from comm 0x4708aa0 localRank: 0 Dump from comm 0x4708aa0 rank: 0 Dump from comm 0x4708aa0 commHash: "197dce9b413e2775" 2024-05-24T09:02:54.385543 devgpu003:3040664:3040744 [0][AsyncJob]ctran/backends/ib/CtranIb.cc:143 NCCL WARN CTRAN-IB : No active device found. 2024-05-24T09:02:54.385607 devgpu003:3040664:3040744 [0][AsyncJob]ctran/mapper/CtranMapper.cc:187 NCCL WARN CTRAN: IB backend not enabled Created NCCL_SPLIT_TYPE_NODE type splitComm 0x11c76d0, rank 0 ~/fbsource/third-party/ncclx/v2.21.5-1 ``` Reviewed By: wconstab, wesbland Differential Revision: D56907877 Fixes #ISSUE_NUMBER Co-authored-by: Cory Modlin <cmodlin@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/127982 Approved by: https://github.com/izaitsevfb	2024-06-05 00:19:52 +00:00
Howard Huang	7fdfb88f03	[pipelining] rewrite interleaved 1f1b (#127332 ) ## Context Interleaved 1F1B has multiple points in the schedule where communication is both criss-crossed across ranks leading to hangs due to 1. looped nature of schedules, 2. batched nature of forward + backward in 1f1b phase. <img width="1370" alt="image" src="https://github.com/pytorch/pytorch/assets/14858254/a07c2b1d-8a99-420b-9ba3-32a0115d228b"> In the current implementation, it is difficult to fix these hangs since it requires `dist.recv` from a prior point in time, but each rank operates on its own step schedule and does not have knowledge of other ranks operations to perform the `recv` prior to their own `send`. ## New implementation The new implementation is split into 2 parts: 1. Creating the pipeline order. Each rank will create the timestep normalized ordering of all schedule actions across all ranks. This is created once during the initialization of the schedule class. The timestep between each rank is normalized as each rank can only have 1 computation action (forward or backward) during that timestep. <img width="1065" alt="image" src="https://github.com/pytorch/pytorch/assets/14858254/196f2347-7ff4-49cf-903b-d8db97d1156f"> 3. Executing the pipeline order. Once the pipeline order is determined, execution is simple because as each rank will perform its send to its peer (based on whether they did forward and backward). Now that each rank has a global understanding of the schedule, they can check their previous and next neighbor ranks to see if they need to recv any activations/gradients from them. Therefore, during execution, each rank is aligned and executing the same time step. ## Benefits - Implementation is faster since 1f1b computation can now be split up in two time steps, 1 for forward and 1 for backward. - Debugging is easier since we can now determine which timestep each rank is hung on - Testing is easier since we can just validate the pipeline order, without running the schedule. This allows us to test on large amount of ranks without actually needing the GPUs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127332 Approved by: https://github.com/wconstab ghstack dependencies: #127084	2024-06-04 23:46:05 +00:00
Shunting Zhang	1f67cfd437	[inductor] raise tolerance for cspdarknet (#127949 ) cspdarknet previously is flaky but after https://github.com/pytorch/pytorch/pull/127367 it fails quite stably. It's probably due to small numerical change from the mentioned PR. That PR will let inductor generated different code due to different loop orders. Raise tolerance to pass CI. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127949 Approved by: https://github.com/atalman, https://github.com/nWEIdia, https://github.com/eqy	2024-06-04 23:28:20 +00:00
PyTorch MergeBot	907cb28f67	Revert "Inductor: Allow small sizes of m for mixed mm autotuning (#127663 )" This reverts commit d8d0bf264a736c7fb3cd17799a1c1aba4addf8d9. Reverted https://github.com/pytorch/pytorch/pull/127663 on behalf of https://github.com/soulitzer due to breaks torch ao CI, see: https://github.com/pytorch/pytorch/issues/127924 ([comment](https://github.com/pytorch/pytorch/pull/127663#issuecomment-2148554128))	2024-06-04 23:06:43 +00:00
Jiashen Cao	f4b05ce683	Add registry for TorchScript to ExportedProgram conversion (#127464 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127464 Approved by: https://github.com/ydwu4, https://github.com/angelayi	2024-06-04 22:53:00 +00:00
rzou	0eb9ec958a	Revert "Inductor respects strides for custom ops by default (#126986 )" (#127923 ) This reverts commit dd64ca2a02434944ecbc8f3e186d44ba81e3cb26. There's a silent incorrectness bug with needs_fixed_stride_order=True and mutable custom ops, so it's better to flip the default back to avoid silent incorrectness. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127923 Approved by: https://github.com/williamwen42	2024-06-04 22:25:45 +00:00
Svetlana Karslioglu	20f966a8e0	Ignore undocumented PipelineSchedule.step (#127955 ) Ignore undocumented PipelineSchedule.step to fix doc build: https://github.com/pytorch/pytorch/actions/runs/9372492435/job/25805861083?pr=127938#step:11:1284 Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/127955 Approved by: https://github.com/kit1980	2024-06-04 22:11:09 +00:00
Mikayla Gawarecki	a7b1dd82ff	Default XLA to use swap_tensors path in nn.Module._apply (#126814 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126814 Approved by: https://github.com/JackCaoG, https://github.com/albanD ghstack dependencies: #127313	2024-06-04 21:40:49 +00:00
Ting Lu	1b704a160f	Add linker script optimization flag to CMAKE rule for CUDA ARM wheel (#127514 ) Original PR - https://github.com/pytorch/pytorch/pull/127220 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127514 Approved by: https://github.com/Aidyn-A, https://github.com/atalman	2024-06-04 20:51:44 +00:00
PyTorch MergeBot	6dc0a291b9	Revert "[dynamo] Bugfix for nn parameter construction (#127806 )" This reverts commit f27c4dd862bf79f37019ef277957cd577d57b66f. Reverted https://github.com/pytorch/pytorch/pull/127806 on behalf of https://github.com/PaliC due to causing nn tests to fail ([comment](https://github.com/pytorch/pytorch/pull/127806#issuecomment-2148393903))	2024-06-04 20:51:41 +00:00
Tristan Rice	597922ba21	Reapply "distributed debug handlers (#126601 )" (#127805 ) This reverts commit 7646825c3eb687030c4f873b01312be0eed80174. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127805 Approved by: https://github.com/PaliC	2024-06-04 19:44:30 +00:00
Anshul Sinha	e76b28c765	[dtensor][debug] added c10d alltoall_ and alltoall_base_ to CommDebugMode (#127360 ) Summary Added c10d alltoall_ and alltoall_base tracing to CommDebugMode and edited test case in test_comm_mode to include added features. Test Plan pytest test/distributed/_tensor/debug/test_comm_mode.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/127360 Approved by: https://github.com/wz337, https://github.com/XilunWu, https://github.com/yifuwang ghstack dependencies: #127358	2024-06-04 18:29:48 +00:00
Anshul Sinha	01e6d1cae4	[dtensor][debug] added c10d reduce_scatter_ and reduce_scatter_tensor_coalesced tracing_ to CommDebugMode (#127358 ) Summary Added c10d reduce_scatter_ and reduce_scatter_tensor_coalesced tracing to CommDebugMode and edited test case in test_comm_mode to include added features. Test Plan pytest test/distributed/_tensor/debug/test_comm_mode.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/127358 Approved by: https://github.com/wz337, https://github.com/XilunWu, https://github.com/yifuwang	2024-06-04 18:29:48 +00:00
PyTorch MergeBot	9a25ff77af	Revert "[inductor] Enable subprocess-based parallel compile as the default (#126817 )" This reverts commit cf77e7dd9770caf65e898ac2ee82045aa0408e30. Reverted https://github.com/pytorch/pytorch/pull/126817 on behalf of https://github.com/huydhn due to There are lots of flaky inductor failure showing up in trunk after this commit `cf77e7dd97`, so I am trying to revert this to see if this helps ([comment](https://github.com/pytorch/pytorch/pull/126817#issuecomment-2148143502))	2024-06-04 18:26:12 +00:00
Animesh Jain	f27c4dd862	[dynamo] Bugfix for nn parameter construction (#127806 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127806 Approved by: https://github.com/jansel ghstack dependencies: #127785, #127802	2024-06-04 18:25:46 +00:00
Animesh Jain	569c5e72e7	[dynamo] Unspec nn module when global backward hooks are present (#127802 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127802 Approved by: https://github.com/jansel ghstack dependencies: #127785	2024-06-04 18:25:46 +00:00
Animesh Jain	c7e936a56a	[dynamo] Tensorvariable - track grad with _grad field (#127785 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127785 Approved by: https://github.com/jansel	2024-06-04 18:25:46 +00:00
Shan19900305	3bcc3cddb5	Using scalarType instead string in function _group_tensors_by_device_and_dtype. (#127869 ) Now torch.dtype can pass through pybind11, so modify function _group_tensors_by_device_and_dtype to using scalar type. And without convert torch.dtype and string in python and c++ side. @ezyang @bdhirsh Pull Request resolved: https://github.com/pytorch/pytorch/pull/127869 Approved by: https://github.com/ezyang	2024-06-04 18:19:33 +00:00
PyTorch MergeBot	0ff60236ab	Revert "Retire torch.distributed.pipeline (#127354 )" This reverts commit b9c058c203ee38032594f898f27cd8404f113a63. Reverted https://github.com/pytorch/pytorch/pull/127354 on behalf of https://github.com/huydhn due to Sorry for reverting your change but the doc build failure looks legit `b9c058c203` ([comment](https://github.com/pytorch/pytorch/pull/127354#issuecomment-2148133982))	2024-06-04 18:19:31 +00:00
chuanqiw	627d2cd87d	[CI] disable td for xpu ci test by default (#127611 ) Due to the xpu ci test has been enabled td by default, a lot of test cases (75%) have been skipped in CI tests. It caused some ci failures escaped from the ci tests, for example issue #127539. This PR depends on PR #127595 landed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127611 Approved by: https://github.com/etaf, https://github.com/atalman	2024-06-04 17:15:10 +00:00
Hu Niu	36e9b71613	Enable UFMT on test/test_jit_fuser_te.py (#127759 ) Part of #123062 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127759 Approved by: https://github.com/ezyang	2024-06-04 16:56:03 +00:00
Tugsbayasgalan (Tugsuu) Manlaibaatar	ff32f6c93b	Use freshly traced jit-traced module to be used in export analysis (#127577 ) Summary: When we export already traced module, it seems to be modifying some global state causing the traced modules to fail to run. For now, we are only logging for test cases, so it is probs ok to trace fresh copy to be used in export for now. Test Plan: CI Differential Revision: D57983518 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127577 Approved by: https://github.com/pianpwk	2024-06-04 16:54:23 +00:00
Eddie Yan	c490046693	[BE]: Update cudnn to 9.1.0.70 (#123475 ) cuDNN has managed to upload cu11 and cu12 wheels for ~~9.0.0.312~~ 9.1.0.70, so trying this out... CC @Skylion007 @malfet Co-authored-by: Wei Wang <weiwan@nvidia.com> Co-authored-by: atalman <atalman@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/123475 Approved by: https://github.com/Skylion007, https://github.com/malfet, https://github.com/nWEIdia	2024-06-04 16:33:06 +00:00
Aaron Orenstein	97ea2b5d83	documentation for pattern_matcher.py (#127459 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127459 Approved by: https://github.com/oulgen ghstack dependencies: #127457, #127458	2024-06-04 15:24:47 +00:00
Aaron Orenstein	7a60a75256	Add typing annotations to pattern_matcher.py (#127458 ) Turn on `mypy: disallow-untyped-defs` in pattern_matcher.py and fix the fallout. There are still a bunch of `type: ignore` annotations which should eventually be ironed out. In the processs found a bug: #127457 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127458 Approved by: https://github.com/Skylion007 ghstack dependencies: #127457	2024-06-04 15:24:47 +00:00
Aaron Orenstein	9adfa143d7	fix post_grad pattern (#127457 ) The lowering pattern built by cuda_and_enabled_mixed_mm_and_not_int8() was using ListOf() incorrectly - ListOf() is meant to represent a single repeating pattern - but cuda_and_enabled_mixed_mm_and_not_int8() was passing two patterns - I think based on the comment it's trying to build a sequence which would be represented by an actual list, not ListOf(). The behavior of the existing pattern would be to pass the second pattern as the `partial` parameter of `ListOf` which is meant to be a boolean - so it's almost certainly not what was intended. I tried changing it to be what I thought was the intended behavior but then the resnet152 test failed accuracy - so I'm just preserving the existing behavior with the correct parameter types. Found when adding annotations to pattern_matcher.py (#127458) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127457 Approved by: https://github.com/oulgen	2024-06-04 15:24:41 +00:00
cyy	f8c6d43524	Concat namespaces and other fixes in torch/csrc/utils (#127833 ) It contains formatting and other minor fixes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127833 Approved by: https://github.com/ezyang	2024-06-04 15:12:45 +00:00
Valeriu	91461601b6	[TORCH_FA2_flash_api] Update total_q to the reshaped query 0th dimension (#127524 ) There is a difference (&bug) between the TORCH_FA2_flash_api:mha_varlen_fwd and FA2_flash_api:mha_varlen_fwd at the query transposition (GQA) step. ``` at::Tensor temp_q = q; if (seqlenq_ngroups_swapped) { temp_q = q.reshape( ... ... } const int total_q = q.sizes()[0]; CHECK_SHAPE(temp_q, total_q, num_heads, head_size_og); ``` When doing query transposition we need to update total_q to the reshaped query 0th dimension, i.e: ``` const int total_q = temp_q.sizes()[0]; ``` In the original FA2_flash_api:mha_varlen_fwd they dont introduce a new variable temp_q but overwrite the q value directly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127524 Approved by: https://github.com/drisspg	2024-06-04 14:44:45 +00:00
IvanKobzarev	c209fbdc53	[inductor] Fix missing unbacked def for unbacked in input expr (#127770 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127770 Approved by: https://github.com/ezyang	2024-06-04 14:43:01 +00:00
cyy	059cae6176	[Caffe2] Remove Caffe2 proto and other files (#127655 ) Remove Caffe2 proto files altogether. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127655 Approved by: https://github.com/ezyang	2024-06-04 14:22:21 +00:00
PyTorch MergeBot	4c074a9b8b	Revert "[torchbind] always fakify script object by default in non-strict export (#127116 )" This reverts commit c27882ffa8c1c7e4cf8ebc6c2f879e5b6c8814ad. Reverted https://github.com/pytorch/pytorch/pull/127116 on behalf of https://github.com/atalman due to Failing internal tests ([comment](https://github.com/pytorch/pytorch/pull/127116#issuecomment-2147459339))	2024-06-04 12:53:19 +00:00
Edward Z. Yang	fb696ef3aa	Complete revamp of float/promotion sympy handling (#126905 ) At a high level, the idea behind this PR is: * Make it clearer what the promotion and int/float rules for various Sympy operations are. Operators that previously were polymorphic over int/float are now split into separate operators for clarity. We never do mixed int/float addition/multiplication etc in sympy, instead, we always promote to the appropriate operator. (However, equality is currently not done correctly.) * Enforce strict typing on ValueRanges: if you have a ValueRange for a float, the lower and upper MUST be floats, and so forth for integers. The story begins in torch/utils/_sympy/functions.py. Here, I make some changes to how we represent certain operations in sympy expressions: * FloorDiv now only supports integer inputs; to do float floor division, do a truediv and then a trunc. Additionally, we remove the divide out addition by gcd optimization, because sympy gcd is over fields and is willing to generate rationals (but rationals are bad for ValueRange strict typing). * ModularIndexing, LShift, RShift now assert they are given integer inputs. * Mod only supports integer inputs; eventually we will support FloatMod (left for later work, when we build out Sympy support for floating operations). Unfortunately, I couldn't assert integer inputs here, because of a bad interaction with sympy's inequality solver that is used by the offline solver * TrueDiv is split into FloatTrueDiv and IntTrueDiv. This allows for us to eventually generate accurate code for Python semantics IntTrueDiv, which is written in a special way to preserve precision when the inputs are >= 2*53 beyond what first coercing the integer to floats and then doing true division. Trunc is split to TruncToFloat and TruncToInt. * Round is updated to return a float, not an int, making it consistent with the round op handler in Inductor. To get Python-style conversion to int, we call TruncToInt on the result. * RoundDecimal updated to consistently only ever return a float * Add ToFloat for explicit coercion to float (required so we can enforce strict ValueRanges typing) In torch/__init__.py, we modify SymInt and SymFloat to appropriately call into new bindings that route to these refined sympy operations. Also, we modify `torch.sym_min` and `torch.sym_max` to have promotion semantics (if one argument is a float, the return result is always a float), making them inconsistent with builtins.min/max, but possible to do type analysis without runtime information. We also need to introduce some new op handlers in torch/_inductor/ops_handler.py: * `to_int` for truncation to int64, directly corresponding to TruncToInt; this can be implemented by trunc and dtype, but with a dedicated handler it is more convenient for roundtripping in Sympy * `int_truediv` for Python-style integer true division, which has higher precision than casting to floats and then running `truediv` These changes have consequences. First, we need to make some administrative changes: * Actually wire up these Sympy functions from SymInt/SymFloat in torch/fx/experimental/sym_node.py, including the new promotion rules (promote2) * Add support for new Sympy functions in torch/utils/_sympy/interp.py, torch/utils/_sympy/reference.py * In particular, in torch.utils._sympy.reference, we have a strong preference to NOT do nontrivial compute, instead, everything in ops handler should map to a singular sympy function * TODO: I chose to roundtrip mod back to our Mod function, but I think I'm going to have to deal with the C/Python inconsistency this to fix tests here * Add printer support for the Sympy functions in torch/_inductor/codegen/common.py, torch/_inductor/codegen/cpp_utils.py, torch/_inductor/codegen/triton.py. `int_truediv` and mixed precision equality is currently not implemented soundly, so we will lose precision in codegen for large values. TODO: The additions here are not exhaustive yet * Update ValueRanges logic to use new sympy functions in torch/utils/_sympy/value_ranges.py. In general, we prefer to use the new Sympy function rather than try to roll things by hand, which is what was done previously for many VR analysis functions. In torch/fx/experimental/symbolic_shapes.py we need to make some symbolic reasoning adjustments: * Avoid generation of rational subexpressions by removing simplification of `x // y` into `floor(x / y)`. This simplification then triggers an addition simplification rule `(x + y) / c --> x / c + y / c` which is bad because x / c is a rational number now * `_assert_bound_is_rational` is no more, we no longer generate rational bounds * Don't intersect non-int value ranges with the `int_range` * Support more sympy Functions for guard SYMPY_INTERP * Assert the type of value range is consistent with the variable type The new asserts uncovered necessary bug fixes: * torch/_inductor/codegen/cpp.py, torch/_inductor/select_algorithm.py, torch/_inductor/sizevars.py - Ensure Wild/Symbol manually allocated in Inductor is marked `is_integer` so it's accepted to build expressions * torch/_inductor/utils.py - make sure you actually pass in sympy.Expr to these functions * torch/_inductor/ir.py - make_contiguous_strides_for takes int/SymInt, not sympy.Expr! * torch/export/dynamic_shapes.py - don't use infinity to represent int ranges, instead use sys.maxsize - 1 Because of the removal of some symbolic reasoning that produced rationals, some of our symbolic reasoning has gotten worse and we are unable to simplify some guards. Check the TODO at test/test_proxy_tensor.py Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/126905 Approved by: https://github.com/xadupre, https://github.com/lezcano	2024-06-04 11:47:32 +00:00
Jack Taylor	db515b6ac7	[ROCm] Fix error in torch.cuda initialisation if amdsmi is not available (#127528 ) Reported in https://github.com/AUTOMATIC1111/stable-diffusion-webui/issues/15874 When nvml_count is set via `9f73c65b8f/torch/cuda/__init__.py (L834)` If amdsmi is not available this will throw an error ``` File "python3.10/site-packages/torch/cuda/__init__.py", line 634, in _raw_device_count_amdsmi except amdsmi.AmdSmiException as e: NameError: name 'amdsmi' is not defined ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/127528 Approved by: https://github.com/jeffdaily, https://github.com/eqy, https://github.com/pruthvistony, https://github.com/atalman	2024-06-04 11:16:02 +00:00
Andrew Gu	49048e7f26	[FSDP2] Fixed variable shadowing of `module` (#127776 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127776 Approved by: https://github.com/wanchaol ghstack dependencies: #127771	2024-06-04 10:27:34 +00:00
Yifu Wang	f325b39303	Introduce Inductor passes to micro-pipeline all-gather-matmul and matmul-reduce-scatter in certain cases (#126598 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126598 Approved by: https://github.com/wanchaol	2024-06-04 09:06:56 +00:00
Sam Larsen	cf77e7dd97	[inductor] Enable subprocess-based parallel compile as the default (#126817 ) Differential Revision: [D58056502](https://our.internmc.facebook.com/intern/diff/D58056502) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126817 Approved by: https://github.com/eellison	2024-06-04 07:48:32 +00:00
Ke Wen	b9c058c203	Retire torch.distributed.pipeline (#127354 ) Actually retiring module after deprecation warning for a while. The new supported module is: torch.distributed.pipelining. Please migrate. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127354 Approved by: https://github.com/wconstab	2024-06-04 07:03:26 +00:00
Ke Wen	6abca6a564	[export][unflatten] More strictly respect scope when removing inputs (#127607 ) Code snippet from TorchTitan (LLaMa): ``` for layer in self.layers.values(): h = layer(h, self.freqs_cis) ``` `self.freqs_cis` is a buffer of root module (`self`). It is also an explicit arg in the call signature of original `layer` modules. If not respecting scope -- `freqs_cis`'s scope only corresponds to root -- `_sink_param` can remove `freqs_cis` from `layer`'s call signature, resulting in runtime error. There are two fixes in this PR: 1. We filter out the `inputs_to_state` corresponding to the current scope, using existing code that does prefix matching. 2. We delay the removal of param inputs from `call_module` nodes' `args`, till `_sink_param` call on that submodule returns. The return now returns information on which input is actually removed by the submodule, thus more accurate than just doing: ``` for node in call_module_nodes: node.args = tuple(filter(lambda n: n.name not in inputs_to_state, node.args)) ``` Before the PR: ![Screenshot 2024-05-31 at 1 40 24 AM](https://github.com/pytorch/pytorch/assets/6676466/a2e06b18-44d5-40ca-b242-0edab45075b7) After the PR: ![Screenshot 2024-05-31 at 1 43 41 AM](https://github.com/pytorch/pytorch/assets/6676466/b72afb94-cdfa-420d-b88b-29a92bf2a0c0) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127607 Approved by: https://github.com/pianpwk	2024-06-04 06:43:54 +00:00
Masahiro Hiramori	e216df48c8	[Dynamo][TVM] Fix ignored `trials` argument for MetaSchedule (#127747 ) Fixes #127746 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127747 Approved by: https://github.com/jansel	2024-06-04 06:13:02 +00:00
Andrew Gu	2122c9e2a9	[BE] Enabled lintrunner on torch/distributed/utils.py (#127771 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127771 Approved by: https://github.com/wanchaol, https://github.com/Skylion007	2024-06-04 06:10:33 +00:00
Ke Wen	ef77f2ca4a	[pipelining] Simple 1F1B schedule (#127673 ) ![Screenshot 2024-05-31 at 9 13 18 PM](https://github.com/pytorch/pytorch/assets/6676466/ecf3ca24-33a6-4188-9f7c-df6e96311caa) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127673 Approved by: https://github.com/wconstab	2024-06-04 06:09:51 +00:00
satheeshhab	f4b77ce8e2	Masked scale meta function registration #119984 (#127389 ) Fixes #119984 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127389 Approved by: https://github.com/cpuhrsch	2024-06-04 06:09:17 +00:00
cyy	e7cb43a2d2	Check unused variables in tests (#127498 ) Enables unused variable checks in CMake. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127498 Approved by: https://github.com/ezyang	2024-06-04 05:35:25 +00:00
Boyuan Feng	2ad0e4197d	[ts-migration] support aten::__is__, aten::__isnot__, aten::__not__, profiler::_record_function_enter_new, profiler::_record_function_exit (#127656 ) Support more ops in ts converter and add unit tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127656 Approved by: https://github.com/SherlockNoMad	2024-06-04 04:51:29 +00:00
Yanbo Liang	8d153e0bab	[Inductor] Add FlexAttention backward kernel dynamic shape tests (#127728 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/127728 Approved by: https://github.com/Chillee	2024-06-04 04:32:03 +00:00
Yanbo Liang	e793ae220f	[Inductor][Flex-attention] Support different sequence lengths for Query and Key/Value (#127678 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/127678 Approved by: https://github.com/Chillee	2024-06-04 04:27:24 +00:00
Nikita Shulga	dae757c971	Specify supported OS matrix (#127816 ) Windows-10 or newer manylinux-2014 MacOS-11 or newer (but only on Apple Silicon) Fixes https://github.com/pytorch/pytorch/issues/126679 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127816 Approved by: https://github.com/kit1980, https://github.com/huydhn	2024-06-04 04:25:41 +00:00
Will Constable	22368eac10	[FSDP2] Fix submesh slicing to enable 3D parallelism (#127585 ) Ensures the submesh used to create sharded parameters are created on a submesh that excludes the Pipeline Parallelism dimension. Also cleans up the logic for storing placements to no longer consider the outer / global dims. Since we store an 'spmd' submesh, we can avoid this. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127585 Approved by: https://github.com/wanchaol	2024-06-04 04:24:09 +00:00
Yanbo Liang	69f5b66132	[Inductor] FlexAttention backward kernel optimization (#127208 ) BWD Speedups (before this PR): ``` \| Type \| Speedup \| shape \| score_mod \| dtype \| \|---------\|-----------\|-------------------\|---------------\|----------------\| \| Average \| 0.211 \| \| \| \| \| Max \| 0.364 \| (16, 16, 512, 64) \| relative_bias \| torch.bfloat16 \| \| Min \| 0.044 \| (2, 16, 4096, 64) \| causal_mask \| torch.bfloat16 \| ``` BWD Speedups (after this PR, though not optimizing block size yet): ``` \| Type \| Speedup \| shape \| score_mod \| dtype \| \|---------\|-----------\|--------------------\|---------------\|----------------\| \| Average \| 0.484 \| \| \| \| \| Max \| 0.626 \| (2, 16, 512, 256) \| head_bias \| torch.bfloat16 \| \| Min \| 0.355 \| (8, 16, 4096, 128) \| relative_bias \| torch.bfloat16 \| ``` There are a few things need to do as follow-ups: * Optimized default block size on A100/H100. * Support different seqlen for Q and K/V. * Support dynamic shapes for backward. * Enhance unit tests to check there is no ```nan``` value in any grad. I think we should make some changes to ```test_padded_dense_causal``` because it has invalid inputs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127208 Approved by: https://github.com/Chillee	2024-06-04 04:22:41 +00:00
Aaron Gokaslan	2498ef7490	Fix scheduler typehints (#127769 ) Fixes scheduler typehints Pull Request resolved: https://github.com/pytorch/pytorch/pull/127769 Approved by: https://github.com/jansel	2024-06-04 04:19:06 +00:00
Xilun Wu	6580a18f86	[c10d][BE] fix test_init_pg_and_rpc_with_same_socket (#127654 ) Summary fix `test_init_pg_and_rpc_with_same_socket` in `test/distributed/test_store.py` which missed a call to destroy the created ProcessGroup before exiting test function. It lead to "init PG twice" error in the test. Test Plan `pytest test/distributed/test_store.py -s -k test_init_pg_and_rpc_with_same_socket` `ciflow/periodic` since this test is included in `.ci/pytorch/multigpu-test.sh` Pull Request resolved: https://github.com/pytorch/pytorch/pull/127654 Approved by: https://github.com/Skylion007, https://github.com/malfet	2024-06-04 04:00:28 +00:00
Menglu Yu	7e906ec9e5	[PT2][Optimus] Improve group batch fusion with same parent/users fusion enablement (#127648 ) Summary: Currently, we fuse the ops in random place, we here enable the same parent/users fuse to enable follow up potential split cat elimination. Context https://docs.google.com/document/d/1MSZY23wKD2keW2Z-DfAI1DscDERHKjOJAnuB5bxa06I/edit Test Plan: # local reproduce ``` buck2 run mode/opt //scripts/jackiexu0313/pt2:local_model_with_pt2 -- --test_mode batch-split --model_type "pm_cmf" --flow_id 559694026 ``` P1386889671 Differential Revision: D58037636 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127648 Approved by: https://github.com/jackiexu1992	2024-06-04 03:41:44 +00:00
mori360	c32fe6b279	[FSDP] keep paras in torch.distributed.checkpoint.state_dict.set_optimizer_state_dict (#127644 ) This addresses Fixes https://github.com/pytorch/pytorch/issues/126948 The previous code under `_load_optim_state_dict `function with condition of `info.broadcast_from_rank0`, `optim_state_dict` holds the parameters based on `optim`. Changes here aim to synchronize the differential parameters. Unit tests are conducted under `test_state_dict.py` in `test_optim_state_dict_para_matching`, Pull Request resolved: https://github.com/pytorch/pytorch/pull/127644 Approved by: https://github.com/fegin	2024-06-04 03:32:22 +00:00
Kiuk Chung	4d0386ce1c	[torch/jit-runtime] Add explicit include of <chrono> to torch/jit/run… (#127779 ) Added an explicit include to `<chrono>` in `jit/runtime/logging.h` since `std::chrono::time_point<std::chrono::high_resolution_clock>` is directly referenced in the header. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127779 Approved by: https://github.com/albanD	2024-06-04 02:12:17 +00:00
Nikita Shulga	ddef7c350f	Add comments about runner labels (#127827 ) To distinguish between org-wide and repo-specific runners as well as highlight where they are hosted (by DevInfra, LF or various partners Delete unused `bm-runner` Pull Request resolved: https://github.com/pytorch/pytorch/pull/127827 Approved by: https://github.com/huydhn	2024-06-04 02:06:43 +00:00
Chien-Chin Huang	1208347d09	[inductor][ez] fix loop ordering test (#127807 ) I didn't realize that the main block is not being run when inductor tests are being run in FBCode via remote GPUs. This is a quick fix. I've tested it in both OSS and FBCode. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127807 Approved by: https://github.com/eellison, https://github.com/jansel	2024-06-04 01:14:34 +00:00
Jirka	41033a4274	PyPI: fix link to images to be rendered (#127798 ) It addresses the long pending issues on PyPI. The [package description](https://pypi.org/project/torch/2.3.0/) is the repo's Readme, but compared to GitHub rendering, PyPI accepts only raw images linked via MarkDown images. ![image](https://github.com/pytorch/pytorch/assets/6035284/1d8e51d5-c8c1-4f92-b323-f7684879adb4) This minor link edit makes the image become raw images and so correctly rendered via PyPI Pull Request resolved: https://github.com/pytorch/pytorch/pull/127798 Approved by: https://github.com/albanD	2024-06-04 00:59:58 +00:00
cyy	05fa05cbae	[2/N] Change static functions in headers to inline (#127764 ) Follows #127727 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127764 Approved by: https://github.com/Skylion007	2024-06-04 00:49:04 +00:00
haozhe.zhu	dbf39a6e63	[inductor] fix linear_add_bias path (#127597 ) Previous the `linear_add_bias` path do not work. This PR is to fix it and add more ut with it. TestPlan ``` python test/inductor/test_mkldnn_pattern_matcher.py -k test_linear_add_bias ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/127597 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-06-04 00:39:01 +00:00
Joel Schlosser	b42cfcabc4	Lift jagged -> padded dense forward / backward kernels from fbgemm_gpu (#125946 ) PyTorch can't depend on `fbgemm_gpu` as a dependency because `fbgemm_gpu` already has a dependency on PyTorch. So this PR copy / pastes kernels from `fbgemm_gpu`: * `dense_to_jagged_forward()` as CUDA registration for new ATen op `_padded_dense_to_jagged_forward()` * `jagged_to_padded_dense_forward()` as CUDA registration for new ATen op `_jagged_to_padded_dense_forward()` CPU impls for these new ATen ops will be added in a follow-up PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125946 Approved by: https://github.com/davidberard98	2024-06-03 23:41:54 +00:00
eqy	ac568fc007	[CUDNN] Remove defunct cuDNN V8 API build flag (#120006 ) The flag basically does nothing following #95722 Let's see if the quantization tests break CC @malfet @atalmanagement Pull Request resolved: https://github.com/pytorch/pytorch/pull/120006 Approved by: https://github.com/malfet	2024-06-03 22:42:05 +00:00
Jeff Daily	0e7bd7fedd	[ROCm] TunableOp improvements (#124362 ) - use less memory; smaller default hipblaslt workspace size - options to avoid cache effects - icache flush option - rotating buffers during tuning - python APIs - unit tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/124362 Approved by: https://github.com/xw285cornell	2024-06-03 22:30:11 +00:00
Scott Wolchok	0f1f0d3015	Onboard ARM bfloat16 to gemv fast path (#127484 ) Summary: Used bfloat16 dot support from #127477 to write a bfloat16 transposed fast path and integrated it. Test Plan: Ran https://github.com/malfet/llm_experiments/blob/main/benchmarks/benchmark_torch_mm.py before and after on my Apple M1 Pro. Before: ``` mv_nt torch.float32 6.77 usec mv_nt torch.float16 8.24 usec mv_nt torch.bfloat16 184.74 usec mv_ta torch.float32 5.71 usec mv_ta torch.float16 27.95 usec mv_ta torch.bfloat16 98.06 usec notrans torch.float32 5.55 usec notrans torch.float16 25.11 usec notrans torch.bfloat16 63.55 usec trans_a torch.float32 5.62 usec trans_a torch.float16 74.48 usec trans_a torch.bfloat16 313.19 usec trans_b torch.float32 5.68 usec trans_b torch.float16 8.18 usec trans_b torch.bfloat16 14.96 usec ``` After: ``` mv_nt torch.float32 5.40 usec mv_nt torch.float16 8.25 usec mv_nt torch.bfloat16 12.81 usec mv_ta torch.float32 5.69 usec mv_ta torch.float16 27.94 usec mv_ta torch.bfloat16 98.18 usec notrans torch.float32 5.60 usec notrans torch.float16 25.17 usec notrans torch.bfloat16 63.22 usec trans_a torch.float32 5.61 usec trans_a torch.float16 69.32 usec trans_a torch.bfloat16 316.62 usec trans_b torch.float32 5.60 usec trans_b torch.float16 8.09 usec trans_b torch.bfloat16 14.61 usec ``` Note large improvement in mv_nt torch.bfloat16 case. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127484 Approved by: https://github.com/malfet ghstack dependencies: #127477, #127478	2024-06-03 22:14:16 +00:00
Scott Wolchok	f6ca822366	Patch ARM Half use_gemv_fast_path gate to avoid kernel duplication (#127478 ) Summary: The existing code didn't gate the fast path, so the fast path had to duplicate the stock kernel. Now we gate it and delete the duplicate kernel. Test Plan: Existing tests. Flipped the TORCH_INTERNAL_ASSERT_DEBUG_ONLY to non-debug and forced to fail (locally) to make sure we had test coverage. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127478 Approved by: https://github.com/malfet ghstack dependencies: #127477	2024-06-03 22:14:16 +00:00
Scott Wolchok	6faa3d5f18	Onboard ARM bfloat16 to gemm-by-dot-product-for-gemm_transa_ infrastructure (#127477 ) Summary: This gets us a baseline level of reasonable performance for bfloat16 matrix-vector and matrix-matrix multiplication on my Apple M1. I've intentionally left using intrinsics for future work. Test Plan: Used https://github.com/malfet/llm_experiments/blob/main/benchmarks/benchmark_torch_mm.py (modified to run larger sizes) to benchmark a range of LLM-interesting matrix-vector and matrix-matrix sizes on my Apple M1 Pro. bfloat16 performance is improved across the board (except possibly for very small cases) and now exceeds float32 performance (as it should) for the matrix-vector cases. Before: ``` Matrix-vector: m=8, n=128, k=1 ==================== trans_b torch.float32 0.75 usec trans_b torch.float16 0.71 usec trans_b torch.bfloat16 0.81 usec m=128, n=8, k=1 ==================== trans_b torch.float32 0.75 usec trans_b torch.float16 0.93 usec trans_b torch.bfloat16 0.98 usec m=4096, n=4096, k=1 ==================== trans_b torch.float32 2194.31 usec trans_b torch.float16 661.27 usec trans_b torch.bfloat16 3758.42 usec m=11008, n=4096, k=1 ==================== trans_b torch.float32 5792.04 usec trans_b torch.float16 1789.98 usec trans_b torch.bfloat16 10120.67 usec m=4096, n=11008, k=1 ==================== trans_b torch.float32 6101.22 usec trans_b torch.float16 1927.34 usec trans_b torch.bfloat16 10469.47 usec m=32000, n=4096, k=1 ==================== trans_b torch.float32 18353.20 usec trans_b torch.float16 5161.06 usec trans_b torch.bfloat16 29601.69 usec Matrix-matrix (prompt len 4: m=8, n=128, k=4 ==================== trans_b torch.float32 2.14 usec trans_b torch.float16 0.85 usec trans_b torch.bfloat16 1.19 usec m=128, n=8, k=4 ==================== trans_b torch.float32 1.47 usec trans_b torch.float16 1.85 usec trans_b torch.bfloat16 1.75 usec m=4096, n=4096, k=4 ==================== trans_b torch.float32 4416.40 usec trans_b torch.float16 2688.36 usec trans_b torch.bfloat16 14987.33 usec m=11008, n=4096, k=4 ==================== trans_b torch.float32 6140.24 usec trans_b torch.float16 7467.26 usec trans_b torch.bfloat16 40295.52 usec m=4096, n=11008, k=4 ==================== trans_b torch.float32 6143.10 usec trans_b torch.float16 7298.04 usec trans_b torch.bfloat16 41393.43 usec m=32000, n=4096, k=4 ==================== trans_b torch.float32 17650.72 usec trans_b torch.float16 21346.63 usec trans_b torch.bfloat16 116849.98 usec Matrix-matrix (prompt len 8: m=8, n=128, k=8 ==================== trans_b torch.float32 1.05 usec trans_b torch.float16 1.03 usec trans_b torch.bfloat16 1.69 usec m=128, n=8, k=8 ==================== trans_b torch.float32 2.05 usec trans_b torch.float16 3.08 usec trans_b torch.bfloat16 2.95 usec m=4096, n=4096, k=8 ==================== trans_b torch.float32 2323.99 usec trans_b torch.float16 5265.45 usec trans_b torch.bfloat16 29942.40 usec m=11008, n=4096, k=8 ==================== trans_b torch.float32 6202.01 usec trans_b torch.float16 14677.90 usec trans_b torch.bfloat16 80625.18 usec m=4096, n=11008, k=8 ==================== trans_b torch.float32 6112.05 usec trans_b torch.float16 14340.52 usec trans_b torch.bfloat16 82799.99 usec m=32000, n=4096, k=8 ==================== trans_b torch.float32 17650.65 usec trans_b torch.float16 42551.43 usec trans_b torch.bfloat16 236081.08 usec Matrix-matrix (prompt len 16: m=8, n=128, k=16 ==================== trans_b torch.float32 1.26 usec trans_b torch.float16 1.34 usec trans_b torch.bfloat16 2.69 usec m=128, n=8, k=16 ==================== trans_b torch.float32 1.60 usec trans_b torch.float16 5.81 usec trans_b torch.bfloat16 5.34 usec m=4096, n=4096, k=16 ==================== trans_b torch.float32 2328.05 usec trans_b torch.float16 10526.58 usec trans_b torch.bfloat16 60028.28 usec m=11008, n=4096, k=16 ==================== trans_b torch.float32 6243.35 usec trans_b torch.float16 28505.08 usec trans_b torch.bfloat16 163670.15 usec m=4096, n=11008, k=16 ==================== trans_b torch.float32 5870.11 usec trans_b torch.float16 28597.89 usec trans_b torch.bfloat16 165404.88 usec m=32000, n=4096, k=16 ==================== trans_b torch.float32 17746.27 usec trans_b torch.float16 83393.87 usec trans_b torch.bfloat16 472313.13 usec Matrix-matrix (prompt len 32: m=8, n=128, k=32 ==================== trans_b torch.float32 1.35 usec trans_b torch.float16 2.01 usec trans_b torch.bfloat16 4.68 usec m=128, n=8, k=32 ==================== trans_b torch.float32 1.19 usec trans_b torch.float16 10.98 usec trans_b torch.bfloat16 10.13 usec m=4096, n=4096, k=32 ==================== trans_b torch.float32 2525.29 usec trans_b torch.float16 23106.71 usec trans_b torch.bfloat16 122987.04 usec m=11008, n=4096, k=32 ==================== trans_b torch.float32 6131.34 usec trans_b torch.float16 57537.41 usec trans_b torch.bfloat16 327825.00 usec m=4096, n=11008, k=32 ==================== trans_b torch.float32 6395.01 usec trans_b torch.float16 57456.33 usec trans_b torch.bfloat16 331325.58 usec m=32000, n=4096, k=32 ==================== trans_b torch.float32 19078.68 usec trans_b torch.float16 167735.08 usec trans_b torch.bfloat16 975736.88 usec Matrix-matrix (prompt len 128: m=8, n=128, k=128 ==================== trans_b torch.float32 2.40 usec trans_b torch.float16 6.07 usec trans_b torch.bfloat16 16.83 usec m=128, n=8, k=128 ==================== trans_b torch.float32 1.78 usec trans_b torch.float16 40.35 usec trans_b torch.bfloat16 37.21 usec m=4096, n=4096, k=128 ==================== trans_b torch.float32 4827.60 usec trans_b torch.float16 84341.24 usec trans_b torch.bfloat16 478917.75 usec m=11008, n=4096, k=128 ==================== trans_b torch.float32 11879.96 usec trans_b torch.float16 226484.33 usec trans_b torch.bfloat16 1289465.50 usec m=4096, n=11008, k=128 ==================== trans_b torch.float32 10707.75 usec trans_b torch.float16 229200.58 usec trans_b torch.bfloat16 1327416.67 usec m=32000, n=4096, k=128 ==================== trans_b torch.float32 33306.32 usec trans_b torch.float16 662898.21 usec trans_b torch.bfloat16 3815866.63 usec ``` After: ``` Matrix-vector: m=8, n=128, k=1 ==================== trans_b torch.float32 0.77 usec trans_b torch.float16 0.72 usec trans_b torch.bfloat16 0.77 usec m=128, n=8, k=1 ==================== trans_b torch.float32 0.73 usec trans_b torch.float16 0.93 usec trans_b torch.bfloat16 1.56 usec m=4096, n=4096, k=1 ==================== trans_b torch.float32 2195.22 usec trans_b torch.float16 675.40 usec trans_b torch.bfloat16 1038.29 usec m=11008, n=4096, k=1 ==================== trans_b torch.float32 5980.27 usec trans_b torch.float16 1806.08 usec trans_b torch.bfloat16 2756.46 usec m=4096, n=11008, k=1 ==================== trans_b torch.float32 6339.95 usec trans_b torch.float16 1844.71 usec trans_b torch.bfloat16 2726.52 usec m=32000, n=4096, k=1 ==================== trans_b torch.float32 18137.17 usec trans_b torch.float16 6020.75 usec trans_b torch.bfloat16 8612.89 usec Matrix-matrix (prompt len 4: m=8, n=128, k=4 ==================== trans_b torch.float32 2.24 usec trans_b torch.float16 0.91 usec trans_b torch.bfloat16 1.07 usec m=128, n=8, k=4 ==================== trans_b torch.float32 1.58 usec trans_b torch.float16 1.96 usec trans_b torch.bfloat16 2.11 usec m=4096, n=4096, k=4 ==================== trans_b torch.float32 4583.43 usec trans_b torch.float16 3014.04 usec trans_b torch.bfloat16 4434.04 usec m=11008, n=4096, k=4 ==================== trans_b torch.float32 6245.55 usec trans_b torch.float16 7513.82 usec trans_b torch.bfloat16 11207.80 usec m=4096, n=11008, k=4 ==================== trans_b torch.float32 6096.22 usec trans_b torch.float16 7688.82 usec trans_b torch.bfloat16 11143.72 usec m=32000, n=4096, k=4 ==================== trans_b torch.float32 17982.88 usec trans_b torch.float16 22001.28 usec trans_b torch.bfloat16 32470.62 usec Matrix-matrix (prompt len 8: m=8, n=128, k=8 ==================== trans_b torch.float32 1.05 usec trans_b torch.float16 1.02 usec trans_b torch.bfloat16 1.44 usec m=128, n=8, k=8 ==================== trans_b torch.float32 2.07 usec trans_b torch.float16 3.10 usec trans_b torch.bfloat16 3.38 usec m=4096, n=4096, k=8 ==================== trans_b torch.float32 2245.43 usec trans_b torch.float16 5597.87 usec trans_b torch.bfloat16 8775.08 usec m=11008, n=4096, k=8 ==================== trans_b torch.float32 6227.68 usec trans_b torch.float16 15102.41 usec trans_b torch.bfloat16 22457.37 usec m=4096, n=11008, k=8 ==================== trans_b torch.float32 6082.16 usec trans_b torch.float16 15131.57 usec trans_b torch.bfloat16 21860.15 usec m=32000, n=4096, k=8 ==================== trans_b torch.float32 19659.00 usec trans_b torch.float16 45075.64 usec trans_b torch.bfloat16 67746.75 usec Matrix-matrix (prompt len 16: m=8, n=128, k=16 ==================== trans_b torch.float32 1.31 usec trans_b torch.float16 1.41 usec trans_b torch.bfloat16 2.04 usec m=128, n=8, k=16 ==================== trans_b torch.float32 1.66 usec trans_b torch.float16 5.76 usec trans_b torch.bfloat16 6.37 usec m=4096, n=4096, k=16 ==================== trans_b torch.float32 2271.34 usec trans_b torch.float16 11198.46 usec trans_b torch.bfloat16 16893.54 usec m=11008, n=4096, k=16 ==================== trans_b torch.float32 6266.85 usec trans_b torch.float16 29342.49 usec trans_b torch.bfloat16 45159.22 usec m=4096, n=11008, k=16 ==================== trans_b torch.float32 5999.16 usec trans_b torch.float16 29157.43 usec trans_b torch.bfloat16 43295.81 usec m=32000, n=4096, k=16 ==================== trans_b torch.float32 18028.83 usec trans_b torch.float16 89626.88 usec trans_b torch.bfloat16 128164.62 usec Matrix-matrix (prompt len 32: m=8, n=128, k=32 ==================== trans_b torch.float32 1.38 usec trans_b torch.float16 2.03 usec trans_b torch.bfloat16 3.29 usec m=128, n=8, k=32 ==================== trans_b torch.float32 1.24 usec trans_b torch.float16 10.58 usec trans_b torch.bfloat16 11.97 usec m=4096, n=4096, k=32 ==================== trans_b torch.float32 2591.56 usec trans_b torch.float16 21683.62 usec trans_b torch.bfloat16 32657.68 usec m=11008, n=4096, k=32 ==================== trans_b torch.float32 6468.43 usec trans_b torch.float16 57811.33 usec trans_b torch.bfloat16 89263.21 usec m=4096, n=11008, k=32 ==================== trans_b torch.float32 6034.74 usec trans_b torch.float16 59372.56 usec trans_b torch.bfloat16 88107.85 usec m=32000, n=4096, k=32 ==================== trans_b torch.float32 18609.27 usec trans_b torch.float16 167298.00 usec trans_b torch.bfloat16 255116.37 usec Matrix-matrix (prompt len 128: m=8, n=128, k=128 ==================== trans_b torch.float32 2.44 usec trans_b torch.float16 6.11 usec trans_b torch.bfloat16 10.92 usec m=128, n=8, k=128 ==================== trans_b torch.float32 1.80 usec trans_b torch.float16 40.26 usec trans_b torch.bfloat16 44.82 usec m=4096, n=4096, k=128 ==================== trans_b torch.float32 4773.29 usec trans_b torch.float16 84458.54 usec trans_b torch.bfloat16 131248.58 usec m=11008, n=4096, k=128 ==================== trans_b torch.float32 12249.16 usec trans_b torch.float16 234411.87 usec trans_b torch.bfloat16 351970.71 usec m=4096, n=11008, k=128 ==================== trans_b torch.float32 11439.24 usec trans_b torch.float16 233347.04 usec trans_b torch.bfloat16 354475.96 usec m=32000, n=4096, k=128 ==================== trans_b torch.float32 33803.03 usec trans_b torch.float16 688157.54 usec trans_b torch.bfloat16 1048221.42 usec ``` Also ran the stock configuration; it was unchanged, indicating that we need to integrate this path with torch.mv separately, which will come in a follow-up PR.l Pull Request resolved: https://github.com/pytorch/pytorch/pull/127477 Approved by: https://github.com/malfet	2024-06-03 22:14:10 +00:00
Xuehai Pan	01fc22056a	[BE] enable UFMT for `torch/masked/` (#127715 ) Part of #123062 - #123062 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127715 Approved by: https://github.com/cpuhrsch	2024-06-03 22:01:49 +00:00
Xiaodong Wang	406532f864	[AMD] Fix power_draw api (#127729 ) Summary: average_socket_power only gives me NA. So we need to change it to current_socket_power Test Plan: Before `torch.cuda.power_draw` gives me NA, after it gives me the right power reading (e.g.441) Differential Revision: D58047484 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127729 Approved by: https://github.com/nmacchioni, https://github.com/eqy	2024-06-03 21:46:50 +00:00
Yidi Wu	c27882ffa8	[torchbind] always fakify script object by default in non-strict export (#127116 ) This diff can be risky for internal tests: any torchbind class that hasn't registered a fake class will fail and we should fix them. We've gained some confidence that this can work e2e by implementing FakeTensorQueue for TBE models in sigmoid with [D54210823](https://www.internalfb.com/diff/D54210823). Differential Revision: [D57991002](https://our.internmc.facebook.com/intern/diff/D57991002) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127116 Approved by: https://github.com/zou3519 ghstack dependencies: #127113, #127114	2024-06-03 21:38:57 +00:00
Yidi Wu	3efac92888	[torchbind] support torch.compile with aot_eager backend (#127114 ) Differential Revision: [D57991001](https://our.internmc.facebook.com/intern/diff/D57991001) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127114 Approved by: https://github.com/zou3519 ghstack dependencies: #127113	2024-06-03 21:38:57 +00:00
Yidi Wu	c6dc624690	[torchbind] remove test cases that don't fakify script objects (#127113 ) As titled. Differential Revision: [D57991003](https://our.internmc.facebook.com/intern/diff/D57991003) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127113 Approved by: https://github.com/zou3519	2024-06-03 21:38:50 +00:00
Zain Huda	6d4ec9b2ec	[RFC] Introduce Checkpointable for DCP (#127540 ) (#127628 ) Summary: # Introduce Checkpointable interface for DCP to support arbitrary tensor subclasses for checkpointing Authors: * zainhuda ## Summary This diff adds a CheckpointableTensor interface to allow for future compatibility for any tensor subclass with DCP in a clean and maintainable way. ## Motivation For TorchRec sharding migration from ShardedTensor to DTensor, we create a tensor subclass that is stored by DTensor to support TorchRec's sharding schemes (ex, empty shards, multiple shards on a rank). ## Proposed Implementation View the CheckpointableTensor interface implementation, in which, we introduce the minimal set of methods needed to be compatible with DCP. These methods are expected to implemented by any tensor subclasses and as such are then checkpointable by DCP. ## Drawbacks No drawbacks, it extends functionality in a clean and maintainable way. ## Alternatives Alternative design was creating paths for checking for certain attributes in tensor subclasses which can get messy and hard to maintain/understand why it was there in the first place. Test Plan: Sandcastle cc mrshenli pritamdamania87 zhaojuanmao satgera gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu fegin XilunWu wanchaol fduwjj wz337 tianyu-l wconstab yf225 chauhang d4l3k LucasLLC Differential Revision: D57970603 Pulled By: iamzainhuda Pull Request resolved: https://github.com/pytorch/pytorch/pull/127628 Approved by: https://github.com/wz337, https://github.com/XilunWu, https://github.com/fegin	2024-06-03 21:21:55 +00:00
Edward Z. Yang	a4064da8ca	Always simplify sympy expressions before printing. (#127543 ) This is important because if a replacement has happened during inductor lowering, we may have stale symbols in sympy expressions that we need to replace away. Do this at the very end. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/127543 Approved by: https://github.com/lezcano	2024-06-03 20:36:14 +00:00
Xinya Zhang	ef9451ac8d	Move the build of AOTriton to base ROCM docker image. (#127012 ) Mitigates #126111 AOTrtion, as a Math library, takes long time to build. However, this library itself is not moving as fast as PyTorch itself and it is not cost-efficient to build it for every CI check. This PR moves the build of AOTriton from PyTorch to its base docker image, avoids duplicated and long build time. Pre-this-PR: * PyTorch base docker build job duration: 1.1-1.3h * PyTorch build job duration: 1.4-1.5hr (includes AOTriton build time of 1hr6min on a linux.2xlarge node) Post-this-PR: * PyTorch base docker build job duration: 1.3h (includes AOTriton build time of 20min on a linux.12xlarge node) * PyTorch build job duration: <20 min Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/127012 Approved by: https://github.com/jithunnair-amd, https://github.com/pruthvistony, https://github.com/huydhn	2024-06-03 20:35:22 +00:00
Ke Wen	941316f821	[pipelining] Stress test schedules with multi iters (#127475 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127475 Approved by: https://github.com/wconstab	2024-06-03 20:24:07 +00:00
Xiangyang (Mark) Guo	db9d457a3f	Use sleef on macOS Apple silicon by default (#126509 ) Use sleef ~~for aarch64~~ on macOS Apple silicon by default. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126509 Approved by: https://github.com/digantdesai, https://github.com/malfet	2024-06-03 19:33:06 +00:00
PyTorch MergeBot	2fc907971a	Revert "[Inductor] FlexAttention backward kernel optimization (#127208 )" This reverts commit f7171313abf14d9501a330457140b2f8a01c9985. Reverted https://github.com/pytorch/pytorch/pull/127208 on behalf of https://github.com/yanboliang due to test_flex_attention is failing internally ([comment](https://github.com/pytorch/pytorch/pull/127208#issuecomment-2145830810))	2024-06-03 18:13:27 +00:00
PyTorch MergeBot	3f45fa63f2	Revert "[Inductor] Add FlexAttention backward kernel dynamic shape tests (#127728 )" This reverts commit 10e3406ea5d115a54a7d753d33110762eb6c07ff. Reverted https://github.com/pytorch/pytorch/pull/127728 on behalf of https://github.com/yanboliang due to Ineternal breakage of https://github.com/pytorch/pytorch/pull/127208 hence reverting ([comment](https://github.com/pytorch/pytorch/pull/127728#issuecomment-2145822667))	2024-06-03 18:10:46 +00:00
PyTorch MergeBot	c35b65715c	Revert "[Inductor][Flex-attention] Support different sequence lengths for Query and Key/Value (#127678 )" This reverts commit e2e3ca94ccce1c0abbfd75ac0368793e1756c268. Reverted https://github.com/pytorch/pytorch/pull/127678 on behalf of https://github.com/atalman due to Ineternal breakage of https://github.com/pytorch/pytorch/pull/127208 hence reverting ([comment](https://github.com/pytorch/pytorch/pull/127678#issuecomment-2145821489))	2024-06-03 18:07:57 +00:00
GdoongMathew	3437177e2b	Quick Fix on #126854 , deepcopy `lr` and other possible `base_parameters` (#127190 ) * Apply `deepcopy` to every base parameters (`initial_lr`, `max_lr`) when instantiating `LRScheduler`. Fixes #126854 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127190 Approved by: https://github.com/janeyx99	2024-06-03 18:06:31 +00:00
Alnis Murtovi	d8d0bf264a	Inductor: Allow small sizes of m for mixed mm autotuning (#127663 ) For mixed mm with small sizes of m, such as in the example provided in #127056, being able to set BLOCK_M to 16 leads to better performance. This PR introduces kernel configs that are specific to mixed mm by extending the mm configs with two configs that work well for the example provided in #127056. I am excluding configs with (BLOCK_M=16, BLOCK_K=16, BLOCK_N=64) because triton crashes when this config is used. For the example in #127056: - Without my changes, skip_triton is evaluated to true which disables autotuning. On my machine I achieve 146GB/s. - If autotuning is enabled, but BLOCK_M>=32, I achieve 614 GB/s. - With the changes in this PR (i.e. autotuning enabled and BLOCK_M=16), I achieve 772 GB/s. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127663 Approved by: https://github.com/Chillee	2024-06-03 17:53:48 +00:00
Janani Sriram	7c3740d388	[NestedTensor] Extend coverage for unbind when ragged_idx != 1 (#127493 ) Summary: Extend coverage for the `NestedTensor` `unbind` operator to cases in which `ragged_idx != 1`. Currently, the `unbind` operator in the `NestedTensor` class splits a tensor along the 0-th dimension, where the `ragged_idx` property, which controls the jagged dimension upon which `unbind` splits, is 1. This diff extends support for `ragged_idx != 1` in `NestedTensor`s, allowing `unbind` to split a tensor along a jagged dimension greater than 0 for `NestedTensor`s with and without the `lengths` property. Test Plan: Added the following unit tests: `test_unbind_ragged_idx_equals_2_cpu`, `test_unbind_ragged_idx_equals_3_cpu`, and `test_unbind_ragged_idx_equals_last_dim_cpu` verify that `unbind` works for all jagged dimensions greater than 1, for `NestedTensor`s without `lengths`. ``` test_unbind_ragged_idx_equals_2_cpu (test_nestedtensor.TestNestedTensorSubclassCPU) ... ok test_unbind_ragged_idx_equals_3_cpu (test_nestedtensor.TestNestedTensorSubclassCPU) ... ok test_unbind_ragged_idx_equals_last_dim_cpu (test_nestedtensor.TestNestedTensorSubclassCPU) ... ok ``` `test_unbind_with_lengths_cpu` and `test_unbind_with_lengths_ragged_idx_equals_1_cpu` verify that `unbind` works when the jagged dimension is 1, for `NestedTensor`s with `lengths`. ``` test_unbind_with_lengths_cpu (test_nestedtensor.TestNestedTensorSubclassCPU) ... ok test_unbind_with_lengths_ragged_idx_equals_1_cpu (test_nestedtensor.TestNestedTensorSubclassCPU) ... ok ``` `test_unbind_with_lengths_ragged_idx_equals_2_cpu` and `test_unbind_with_lengths_ragged_idx_equals_3_cpu` verify that `unbind` works when the jagged dimension is greater than 1, for `NestedTensor`s with `lengths`. ``` test_unbind_with_lengths_ragged_idx_equals_2_cpu (test_nestedtensor.TestNestedTensorSubclassCPU) ... ok test_unbind_with_lengths_ragged_idx_equals_3_cpu (test_nestedtensor.TestNestedTensorSubclassCPU) ... ok ``` `test_unbind_with_lengths_ragged_idx_equals_0_cpu` verifies that `unbind` fails when the jagged dimension is 0 (the batch dimension), for `NestedTensor`s with `lengths`. ``` test_unbind_with_lengths_ragged_idx_equals_0_cpu (test_nestedtensor.TestNestedTensorSubclassCPU) ... ok ``` `test_unbind_with_lengths_ragged_idx_equals_2_bad_dim_cpu` verifies that `unbind` fails when there is a mismatch between the offsets and the jagged dimension, for `NestedTensor`s with `lengths`. ``` test_unbind_with_lengths_ragged_idx_equals_2_bad_dim_cpu (test_nestedtensor.TestNestedTensorSubclassCPU) ... ok ``` `test_unbind_with_wrong_lengths_cpu` verifies that `unbind` fails when the lengths exceed the limitations set by offsets, for `NestedTensor`s with `lengths`. ``` test_unbind_with_wrong_lengths_cpu (test_nestedtensor.TestNestedTensorSubclassCPU) ... ok ``` Differential Revision: D57942686 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127493 Approved by: https://github.com/davidberard98	2024-06-03 17:46:12 +00:00
angelayi	4d32de14b6	[export] Handle serializing duplicate getitem nodes (#127633 ) We ran into a graph that looks something like the following, where we have 2 getitem calls to the same index (%getitem, %getitem_2 both query topk[0]): ``` graph(): %x : [num_users=1] = placeholder[target=x] %topk : [num_users=3] = call_function[target=torch.ops.aten.topk.default](args = (%x, 2), kwargs = {}) %getitem : [num_users=1] = call_function[target=operator.getitem](args = (%topk, 0), kwargs = {}) %getitem_1 : [num_users=1] = call_function[target=operator.getitem](args = (%topk, 1), kwargs = {}) %getitem_2 : [num_users=1] = call_function[target=operator.getitem](args = (%topk, 0), kwargs = {}) %mul_tensor : [num_users=1] = call_function[target=torch.ops.aten.mul.Tensor](args = (%getitem, %getitem_2), kwargs = {}) %mul : [num_users=1] = call_function[target=torch.ops.aten.mul.Tensor](args = (%mul_tensor, 2), kwargs = {}) return (mul, getitem_1) ``` The duplicate getitem call gets created during a pass.. so there are a couple of solutions: 1. Change serializer to support the case of duplicate getitem calls 2. Change the pass so that it doesn’t produce duplicate getitem calls 3. Add a pass which dedups the getitem calls As a framework, we should do 1 and 3 (through a CSE pass). This PR implements solution 1. However, the serializer currently does some special handling for getitem nodes -- instead of directly serializing the getitem nodes, we serialize the output of the node that outputting a list of tensors (the %topk node in this example) into a list nodes for each output ([%getitem, %getitem_1]). This fails when we have duplicate getitem nodes to the same index (%getitem_2), since we do not record that duplicate getitem node anywhere. So, the solution this PR takes is that the serializer will deduplicate the getitem nodes (%getitem_2 will be replaced with %getitem). This would result in a sematically correct graph, but not necessarily node-to-node identical as the original fx graph. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127633 Approved by: https://github.com/ydwu4	2024-06-03 17:25:51 +00:00
Aaron Gokaslan	12c4a2c297	[BE]: Apply PLR1736 fixes (unnecessary index lookup) (#127716 ) Applies the PLR1736 preview rule with some more autofixes to cut down on unnecessary accesses. Added a noqa since that test actually testing the dunder method. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127716 Approved by: https://github.com/ezyang	2024-06-03 17:22:13 +00:00
Wanchao Liang	21144ce570	[dtensor] implement scatter op with simple replication (#126713 ) as titled, implement torch.scatter op with simple replications strategy, need to follow up and see if we could actually support any sharding pattern Pull Request resolved: https://github.com/pytorch/pytorch/pull/126713 Approved by: https://github.com/tianyu-l ghstack dependencies: #126712	2024-06-03 16:16:28 +00:00
Wanchao Liang	ded580a594	[dtensor] standardize multi mesh-dim strategy with utils (#126712 ) This PR standardize the multi mesh-dim strategy generation by unifying a util to expand from a single mesh dim strategy to multi mesh dim strategy, to allow strategy generation simpler Pull Request resolved: https://github.com/pytorch/pytorch/pull/126712 Approved by: https://github.com/tianyu-l	2024-06-03 16:16:28 +00:00
PyTorch MergeBot	d1fad416a8	Revert "Add aten._unsafe_masked_index (#116491 )" This reverts commit f03f8bc901a6c9038308a6353e8d280f4b5628f5. Reverted https://github.com/pytorch/pytorch/pull/116491 on behalf of https://github.com/PaliC due to breaking onnx tests ([comment](https://github.com/pytorch/pytorch/pull/116491#issuecomment-2145557724))	2024-06-03 15:51:50 +00:00
atalman	53f001c599	Revert "correct BLAS input (#126200 )" (#127762 ) This reverts commit ea13e9a097aaa875a2b404822579b7f8b62ea291. Looks like this could have caused: https://github.com/pytorch/pytorch/actions/runs/9346105069/job/25722431775#step:17:984 Aarch64 tests failures: ``` + echo 'Checking that MKLDNN is available on aarch64' Checking that MKLDNN is available on aarch64 + pushd /tmp /tmp / + python -c 'import torch; exit(0 if torch.backends.mkldnn.is_available() else 1)' Error: Process completed with exit code 1. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/127762 Approved by: https://github.com/PaliC, https://github.com/malfet	2024-06-03 15:49:48 +00:00
Shuqiang Zhang	8677508167	[c10d] guard gpu context during abort (#127363 ) This is a mitigation for an internal out of MEM issues on GPU0 that happend during comms abort, this PR was tested internally to have fixed the out of MEM issue. Note This is supposed to be mitigation only, as the ideal fix should be within NCCL comm libs, which should just set the right CUDA context before any CUDA call and restore it to its exact previous state ncclCommDestroy/ncclCommAbort -> commReclaim -> commDestroySync (https://fburl.com/code/pori1tka) In commDestroySync, it thinks that "current device context" is not same as comm's device context. It tries to: 1) save the current context 2) sets the comm's device context 3) cleans up things 4) Restores "previously stored context" by another cudaSetDevice. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127363 Approved by: https://github.com/wconstab	2024-06-03 15:41:11 +00:00
Aidyn-A	430cdfc0ac	[ATen][Native] fixes sparse SPMV on aarch64 (#127642 ) Fixes #127491 In #127491 result was allocated as `result = at::empty(...)`, which does not guarantee `result` being filled by zeros, therefore `torch.mv` was producing non-finite values. This happened mainly because the corner case (`beta = 0`) of `addmv` was not taken care of, as it should be just like in any other `addmv`/`addmm`: `923edef31c/aten/src/ATen/native/mkl/SparseBlasImpl.cpp (L307-L311)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/127642 Approved by: https://github.com/malfet	2024-06-03 15:38:27 +00:00
Zain Rizvi	badf898df2	Remove unstable ARC jobs (#127563 ) Disable these jobs since we're no longer trying to enable ARC Pull Request resolved: https://github.com/pytorch/pytorch/pull/127563 Approved by: https://github.com/huydhn	2024-06-03 15:30:06 +00:00
James Wu	63d7ffe121	Retry of D58015187 Move AsyncCompile to a different file (#127691 ) Summary: This is a retry of https://github.com/pytorch/pytorch/pull/127545/files and D58015187, fixing the internal test that also imported codecache Test Plan: Same tests as CI in github, plus sandcastle for internal unit tests should pass now Differential Revision: D58054611 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127691 Approved by: https://github.com/oulgen	2024-06-03 15:29:41 +00:00
PaliC	3f8b8f08c8	[Split Build] Make libtorch_global_deps accessible from libtorch wheel (#127570 ) Title Pull Request resolved: https://github.com/pytorch/pytorch/pull/127570 Approved by: https://github.com/atalman, https://github.com/malfet	2024-06-03 15:14:29 +00:00
PyTorch MergeBot	d05cddfe23	Revert "FP8 rowwise scaling (#125204 )" This reverts commit 923edef31c7f3e98a14625724f2019b1422dcb26. Reverted https://github.com/pytorch/pytorch/pull/125204 on behalf of https://github.com/atalman due to Broke nightlies and internal tests ([comment](https://github.com/pytorch/pytorch/pull/125204#issuecomment-2145422196))	2024-06-03 15:00:21 +00:00
Isuru Fernando	f03f8bc901	Add aten._unsafe_masked_index (#116491 ) To generate masked indexing operations that would generate masked loads in triton code Pull Request resolved: https://github.com/pytorch/pytorch/pull/116491 Approved by: https://github.com/lezcano, https://github.com/peterbell10	2024-06-03 14:44:03 +00:00
Edward Z. Yang	d6963e769c	Force Inductor output code to be dumped even if it fails to compile (#127700 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/127700 Approved by: https://github.com/oulgen	2024-06-03 14:06:53 +00:00
Daniil Kutz	f343f98710	[jit] Validate mobile module fields parsed by flatbuffer loader (#127437 ) Fixing error in `torch.jit.load` Python API function that cause crash in C-backend of PyTorch. The mobile module is succesfully parsed from flatbuffer format, but its fields are used without any validation. Fixes #127434 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127437 Approved by: https://github.com/davidberard98	2024-06-03 08:48:12 +00:00
Xilun Wu	e017b56c0c	[dtensor] local_map UX change: keep func signature and be compatible with Tensor input (#126924 ) Summary This PR has 2 parts of change in `local_map`: 1. regulates the way user can access `DeviceMesh` inside the `func` argument of `local_map`. This means `local_map` will strictly follow the `func` signature without implicitly passing any argument to `func`. If user wants to use `DeviceMesh` inside `func`, this mesh must be explicitly passed to `func` as an argument by user. For example, ``` def user_function(device_mesh, /, args, kwargs): USER CODE HERE local_func = local_map(func=user_function, ...) dtensor_out = local_func(device_mesh, dtensor_input, ...) ``` Before this PR, user code was like: ``` def user_function(device_mesh, /, args, kwargs): USER CODE HERE local_func = local_map(func=user_function, ...) dtensor_out = local_func(dtensor_input, ...) # local_map passes mesh implicitly for user ``` 2. `local_map` now supports mix use of `torch.Tensor` and `DTensor` in argument: - Pure torch.Tensor case: no `DTensor` argument is passed in, all tensor arguments are `torch.Tensor`. Bypass the `in_placements` check and unwrapping steps. The output will not be wrapped into `DTensor` but directly returned. - Pure DTensor case: no `torch.Tensor` argument is passed in, all tensor arguments are `DTensor`. This follows the default rule: `in_placements` check, unwrapping arguments, pass into `func`, wrapping the `torch.Tensor` output into `DTensor` if the `out_placements` is not `None`. - Mix of the above two: some arguments are `torch.Tensor` while some are `DTensor`. Only perform `in_placements` check and unwrapping on `DTensor` arguments. For output processing, it's the same as Pure DTensor case. Test** `pytest test/distributed/_tensor/experimental/test_local_map.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/126924 Approved by: https://github.com/wanchaol	2024-06-03 08:41:59 +00:00
diwei sun	2d1ad0c31a	[CI] Add freezing for cpu inductor accuracy test in inductor CI (#124715 ) This PR is to enable '--freezing' when running dynamo accuracy check in CI. Backgroud: ISSUES[#124286](https://github.com/pytorch/pytorch/issues/124286) is not captured by CI since freezing is not enabled for cpu-inductor. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124715 Approved by: https://github.com/chuanqi129, https://github.com/jgong5, https://github.com/atalman, https://github.com/desertfire	2024-06-03 07:37:30 +00:00
Yanbo Liang	10e3406ea5	[Inductor] Add FlexAttention backward kernel dynamic shape tests (#127728 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/127728 Approved by: https://github.com/Chillee	2024-06-03 07:15:46 +00:00
Chien-Chin Huang	6d21685b45	[DSD] Fixes various bugs for broadcast_from_rank0 (#127635 ) Fixes https://github.com/pytorch/pytorch/issues/126285 Summary: 1. Fixes https://github.com/pytorch/pytorch/issues/126285 2. Broadcasting one tensor per time to avoid OOM. 3. Add some docstring Pull Request resolved: https://github.com/pytorch/pytorch/pull/127635 Approved by: https://github.com/weifengpy	2024-06-03 06:35:21 +00:00
Feng Yuan	48846cd164	Update torch-xpu-ops pin (ATen XPU implementation) (#127730 ) Regular bi-weekly pin update. 1. Porting operator relative PyTorch unit tests. The existing operators in torch-xpu-ops are covered by, 1) Operator specific test, like test_binary_ufuncs.py. 2) Operator common test, like test_ops.py. 2. Bugfixing under the latest PyTorch unit test scope, https://github.com/intel/torch-xpu-ops/tree/release/2.4/test/xpu. Totally 297 ATen operators are implemented in torch-xpu-ops. https://github.com/intel/torch-xpu-ops/blob/release/2.4/yaml/xpu_functions.yaml Pull Request resolved: https://github.com/pytorch/pytorch/pull/127730 Approved by: https://github.com/EikanWang	2024-06-03 05:55:00 +00:00
Yanbo Liang	e2e3ca94cc	[Inductor][Flex-attention] Support different sequence lengths for Query and Key/Value (#127678 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/127678 Approved by: https://github.com/Chillee	2024-06-03 04:35:50 +00:00
cyy	288df042c5	[1/N] Change static functions in headers to inline (#127727 ) So that it may fix some tricky linking issues. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127727 Approved by: https://github.com/ezyang	2024-06-03 04:34:36 +00:00
cyy	1b182ea0d2	Remove c10::guts::{conjunction,disjunction} (#127726 ) They are not used in Pytorch OSS. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127726 Approved by: https://github.com/ezyang	2024-06-03 04:06:21 +00:00
leslie-fang-intel	3399ad8d9d	[Inductor][CPP] Add UT for bitwise right shift (#127731 ) Summary Per the discussion in https://github.com/pytorch/pytorch/issues/127310, `bitwise_right_shift` failed in Torch 2.1 but pass with latest PyTorch, Add the UT in this PR to ensure the correctness. TestPlan ``` python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_bitwise_right_shift ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/127731 Approved by: https://github.com/Skylion007	2024-06-03 04:05:41 +00:00
Yanbo Liang	7e97b33fbb	[Dynamo] Log backward graph compilation metrics (#126629 ) Fixes #125313 Compilation metric logs for the code example at #125313: ``` %s CompilationMetrics(compile_id='0/0', frame_key='1', co_name='forward', co_filename='/data/users/ybliang/debug/debug2.py', co_firstlineno=10, cache_size=0, accumulated_cache_size=0, guard_count=11, shape_env_guard_count=0, graph_op_count=1, graph_node_count=3, graph_input_count=1, start_time=1716247236.6165977, entire_frame_compile_time_s=7.926939964294434, backend_compile_time_s=7.887059926986694, inductor_compile_time_s=4.108498811721802, code_gen_time_s=3.97833514213562, fail_type=None, fail_reason=None, fail_user_frame_filename=None, fail_user_frame_lineno=None, non_compliant_ops=set(), compliant_custom_ops=set(), restart_reasons={"'skip function graph_break in file /home/ybliang/local/pytorch/torch/_dynamo/decorators.py'"}, dynamo_time_before_restart_s=0.025330543518066406, has_guarded_code=True, is_fwd=True) %s CompilationMetrics(compile_id='1/0', frame_key='2', co_name='torch_dynamo_resume_in_forward_at_12', co_filename='/data/users/ybliang/debug/debug2.py', co_firstlineno=12, cache_size=0, accumulated_cache_size=0, guard_count=10, shape_env_guard_count=0, graph_op_count=2, graph_node_count=5, graph_input_count=1, start_time=1716247244.544928, entire_frame_compile_time_s=0.10148310661315918, backend_compile_time_s=0.08753013610839844, inductor_compile_time_s=0.03691983222961426, code_gen_time_s=0.022417306900024414, fail_type=None, fail_reason=None, fail_user_frame_filename=None, fail_user_frame_lineno=None, non_compliant_ops=set(), compliant_custom_ops=set(), restart_reasons=set(), dynamo_time_before_restart_s=0.0, has_guarded_code=True, is_fwd=True) tensor([[-0.1622, -0.0000, -0.0000, 0.5643, -0.0000, 0.0000, -0.5087, 0.0914, -0.0000, -0.0421]], grad_fn=<CompiledFunctionBackward>) %s CompilationMetrics(compile_id='1/0', frame_key=None, co_name=None, co_filename=None, co_firstlineno=None, cache_size=None, accumulated_cache_size=None, guard_count=None, shape_env_guard_count=None, graph_op_count=None, graph_node_count=None, graph_input_count=None, start_time=None, entire_frame_compile_time_s=None, backend_compile_time_s=None, inductor_compile_time_s=0.026738643646240234, code_gen_time_s=0.016446352005004883, fail_type=None, fail_reason=None, fail_user_frame_filename=None, fail_user_frame_lineno=None, non_compliant_ops=None, compliant_custom_ops=None, restart_reasons=None, dynamo_time_before_restart_s=None, has_guarded_code=None, is_fwd=False) %s CompilationMetrics(compile_id='0/0', frame_key=None, co_name=None, co_filename=None, co_firstlineno=None, cache_size=None, accumulated_cache_size=None, guard_count=None, shape_env_guard_count=None, graph_op_count=None, graph_node_count=None, graph_input_count=None, start_time=None, entire_frame_compile_time_s=None, backend_compile_time_s=None, inductor_compile_time_s=0.14563536643981934, code_gen_time_s=0.08652091026306152, fail_type=None, fail_reason=None, fail_user_frame_filename=None, fail_user_frame_lineno=None, non_compliant_ops=None, compliant_custom_ops=None, restart_reasons=None, dynamo_time_before_restart_s=None, has_guarded_code=None, is_fwd=False) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/126629 Approved by: https://github.com/ezyang	2024-06-03 03:55:33 +00:00
PyTorch MergeBot	84776d7597	Revert "[BE]: Update mypy to 1.10.0 (#127717 )" This reverts commit 30213ab0a7b27277e76ea9dd707ce629a63d91ee. Reverted https://github.com/pytorch/pytorch/pull/127717 on behalf of https://github.com/huydhn due to I am not sure why but the failures look legit and they are showing up in trunk `30213ab0a7` ([comment](https://github.com/pytorch/pytorch/pull/127717#issuecomment-2144183347))	2024-06-03 02:52:47 +00:00
bigning	e57f51b80f	Update _dedup_save_plans.py (#126569 ) To resolve https://github.com/pytorch/pytorch/issues/125740, save each tensor on the lowest rank. Fixes #125740 Pull Request resolved: https://github.com/pytorch/pytorch/pull/126569 Approved by: https://github.com/LucasLLC	2024-06-03 01:55:03 +00:00
Yash Rathore	fec8ef8c17	[Aten][BlasKernel] Add function prototype to fix compiler error (#127719 ) Adds a prototype for function `fp16_dot_with_fp32_arith()` in `aten/src/ATen/native/BlasKernel.cpp`. Without this patch the build fails on Apple silicon/MacOs (CPU) with the error `no previous prototype for function 'fp16_dot_with_fp32_arith' [-Werror,-Wmissing-prototypes]`. The function cannot be marked `static` because its use is not limited to this file. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127719 Approved by: https://github.com/Skylion007	2024-06-02 23:41:43 +00:00
Xuehai Pan	8b08b0f340	[BE] enable ruff rule `Q` from flake8-quotes (#127713 ) Enable [ruff rule `Q`](https://docs.astral.sh/ruff/rules/#flake8-quotes-q) from flake8-quotes. Fixes: - [avoidable-escaped-quote (Q003)](https://docs.astral.sh/ruff/rules/avoidable-escaped-quote/#avoidable-escaped-quote-q003) - [unnecessary-escaped-quote (Q004)](https://docs.astral.sh/ruff/rules/unnecessary-escaped-quote/#unnecessary-escaped-quote-q004) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127713 Approved by: https://github.com/ezyang	2024-06-02 23:25:26 +00:00
Edward Z. Yang	139b9c6529	Avoid reference cycle in inner closure (#127711 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/127711 Approved by: https://github.com/Skylion007, https://github.com/izaitsevfb	2024-06-02 21:28:46 +00:00
Aaron Gokaslan	30213ab0a7	[BE]: Update mypy to 1.10.0 (#127717 ) Updates mypy to the latest and greatest. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127717 Approved by: https://github.com/ezyang	2024-06-02 21:07:23 +00:00
Kiuk Chung	fb53cd6497	[aten_cuda/flash_attn] Add typename to template argument Kernel_trait… (#127634 ) Adds the `typename` keyword to the template argument `Kernel_traits::TiledMma` and `Kernel_traits::TiledMmaSdP` (which are dependent type names) when calling the template function `pytorch_flash::convert_layout_acc_Aregs`. Without `typename` flash_attention kernels do not compile with Clang under C++20 since Clang compiles the entire .cu file in a single pass as opposed to NVCC which split compiles the host and device code. Adding `typename` seems to be OK under NVCC based on CI cuda builds succeeding. Below is the excerpt of the compilation error: ``` third_party/py/torch/aten/src/ATen/native/transformers/cuda/flash_attn/static_switch.h:46:24: note: expanded from macro 'ALIBI_SWITCH' 46 \| #define ALIBI_SWITCH BOOL_SWITCH \| ^ third_party/py/torch/aten/src/ATen/native/transformers/cuda/flash_attn/flash_bwd_launch_template.h:132:5: note: in instantiation of function template specialization 'pytorch_flash::run_flash_bwd_seqk_parallel<pytorch_flash::Flash_bwd_ke rnel_traits<160, 64, 64, 8, 4, 4, 4, false, true>, true>' requested here 132 \| run_flash_bwd_seqk_parallel<Kernel_traits, Is_dropout>(params, stream); \| ^ third_party/py/torch/aten/src/ATen/native/transformers/cuda/flash_attn/flash_bwd_launch_template.h:280:13: note: in instantiation of function template specialization 'pytorch_flash::run_flash_bwd<pytorch_flash::Flash_bwd_kernel_traits<1 60, 64, 64, 8, 4, 4, 4, false, true>, true>' requested here 280 \| run_flash_bwd<Flash_bwd_kernel_traits<Headdim, 64, 64, 8, 4, 4, 4, false, true, T>, Is_dropout>(params, stream); \| ^ third_party/py/torch/aten/src/ATen/native/transformers/cuda/flash_attn/static_switch.h:36:26: note: expanded from macro 'DROPOUT_SWITCH' 36 \| #define DROPOUT_SWITCH BOOL_SWITCH \| ^ third_party/py/torch/aten/src/ATen/native/transformers/cuda/flash_attn/kernels/flash_bwd_hdim160_fp16_sm80.cu:12:5: note: in instantiation of function template specialization 'pytorch_flash::run_mha_bwd_hdim160<cutlass::half_t>' request ed here 12 \| run_mha_bwd_hdim160<cutlass::half_t>(params, stream); \| ^ In file included from third_party/py/torch/aten/src/ATen/native/transformers/cuda/flash_attn/kernels/flash_bwd_hdim160_fp16_sm80.cu:7: In file included from third_party/py/torch/aten/src/ATen/native/transformers/cuda/flash_attn/flash_bwd_launch_template.h:12: third_party/py/torch/aten/src/ATen/native/transformers/cuda/flash_attn/flash_bwd_kernel.h:543:86: error: missing 'typename' prior to dependent type name 'Flash_bwd_kernel_traits<160, 64, 64, 8, 4, 4, 4, false, true>::TiledMmaSdP' 543 \| Tensor tPrP = make_tensor(rP.data(), pytorch_flash::convert_layout_acc_Aregs<Kernel_traits::TiledMmaSdP>(rP.layout())); ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/127634 Approved by: https://github.com/Skylion007	2024-06-02 16:25:02 +00:00
rzou	08653fe355	Beef up the allow_in_graph docs (#127117 ) We make the following changes: - most of the time when someone uses allow_in_graph, they actually wanted to make a custom op. We add a link to the custom ops landing page and explain the differences between allow_in_graph and custom ops. - we warn people against using allow_in_graph footguns and document them. Test Plan: - tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/127117 Approved by: https://github.com/jansel, https://github.com/albanD	2024-06-02 15:00:46 +00:00
Aaron Gokaslan	e24a87ed8d	[BE][Ez]: Apply PYI059 - Generic always come last (#127685 ) Generic baseclass should always be last or unexpected issues can occur, especially in non-stub files (such as with MRO). Applies autofixes from the preview PYI059 rule to fix the issues in the codebase. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127685 Approved by: https://github.com/ezyang	2024-06-02 13:38:58 +00:00
Aaron Gokaslan	c2547dfcc3	[BE][Ez]: Enable ruff PYI019 (#127684 ) Tells pytorch to use typing_extensions.Self when it's able to. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127684 Approved by: https://github.com/ezyang	2024-06-02 13:38:33 +00:00
Xuehai Pan	67ef2683d9	[BE] wrap deprecated function/class with `typing_extensions.deprecated` (#127689 ) Use `typing_extensions.deprecated` for deprecation annotation if possible. Otherwise, add `category=FutureWarning` to `warnings.warn("message")` if the category is missing. Note that only warnings that their messages contain `[Dd]eprecat(ed\|ion)` are updated in this PR. Resolves #126888 - #126888 This PR is split from PR #126898. - #126898 ------ Pull Request resolved: https://github.com/pytorch/pytorch/pull/127689 Approved by: https://github.com/Skylion007	2024-06-02 12:30:43 +00:00
Sheng Fu	c1dd3a615f	Implement Graph Transform Observer (#127427 ) Summary: Implement Graph Transform Observer Differential Revision: D57887518 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127427 Approved by: https://github.com/angelayi	2024-06-02 06:49:47 +00:00
cyy	4e7f497bb3	[Submodule] Remove ios-cmake (#127694 ) It has not been updated for a long time and CI iOS builds don't rely on it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127694 Approved by: https://github.com/ezyang	2024-06-02 04:40:21 +00:00
Michael Lazos	2129903aa3	Properly detect nested torch function args (#127496 ) Dynamo was not detecting nested torch function classes in containers. This was due to pytree compatibility for variable trackers being removed. Fixes https://github.com/pytorch/pytorch/issues/127174 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127496 Approved by: https://github.com/anijain2305	2024-06-02 03:43:22 +00:00
Colin Peppler	16578e8584	[symbolic shapes] if symbol not in var_ranges default to unknown range (#127681 ) Purpose of this PR is to get around this error: https://github.com/pytorch/pytorch/issues/127677 Differential Revision: D58048558 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127681 Approved by: https://github.com/lezcano	2024-06-02 02:28:40 +00:00
titaiwangms	4fd777ed59	[ONNX] Add quantized layer norm op to opset 17 (#127640 ) Fixes #126160 Continue #126555 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127640 Approved by: https://github.com/justinchuby	2024-06-02 02:10:02 +00:00
xinan.lin	c19ad112f6	[Inductor UT][Intel GPU] Skip test case which doesn't currently work on the XPU stack but newly re-enabled by community. (#127629 ) The Inductor UT test/inductor/test_triton_heuristics.py:test_artificial_zgrid that previously skipped was recently enbaled by the PR https://github.com/pytorch/pytorch/pull/127448. However, the test doesn't currently work on the XPU stack, it will huang on GPU, so this PR skip the test for Intel GPU instead of expected failure. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127629 Approved by: https://github.com/EikanWang, https://github.com/peterbell10	2024-06-02 01:00:33 +00:00
Boyuan Feng	2cef2fc2b4	[ts migration] support aten::dim, aten::len, aten::__getitem__ (#127593 ) - Add support for aten::dim, aten::len, aten::__getitem__ for torchscript to export converter. - Add unit tests Co-authored-by: cyy <cyyever@outlook.com> Co-authored-by: Menglu Yu <mengluy@meta.com> Co-authored-by: Animesh Jain <anijain@umich.edu> Co-authored-by: Simon Fan <xmfan@meta.com> Co-authored-by: Zain Rizvi <ZainR@meta.com> Co-authored-by: Tugsbayasgalan (Tugsuu) Manlaibaatar <tmanlaibaatar@meta.com> Co-authored-by: titaiwangms <titaiwang@microsoft.com> Co-authored-by: Yueming Hao <yhao@meta.com> Co-authored-by: IvanKobzarev <ivan.kobzarev@gmail.com> Co-authored-by: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com> Co-authored-by: Edward Z. Yang <ezyang@meta.com> Co-authored-by: Bin Bao <binbao@meta.com> Co-authored-by: Feny Patel <fenypatel@meta.com> Co-authored-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com> Co-authored-by: xinan.lin <xinan.lin@intel.com> Co-authored-by: Zain Huda <zainhuda@meta.com> Co-authored-by: Chien-Chin Huang <chienchin@fb.com> Co-authored-by: Wei Wang <weiwan@nvidia.com> Co-authored-by: Jason Ansel <jansel@meta.com> Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com> Co-authored-by: Iris Z <31293777+wz337@users.noreply.github.com> Co-authored-by: Wang, Eikan <eikan.wang@intel.com> Co-authored-by: angelayi <yiangela7@gmail.com> Co-authored-by: Svetlana Karslioglu <svekars@meta.com> Co-authored-by: Yanbo Liang <ybliang8@gmail.com> Co-authored-by: Catherine Lee <csl@fb.com> Co-authored-by: Kwanghoon An <kwanghoon@meta.com> Co-authored-by: Brian Hirsh <hirsheybar@fb.com> Co-authored-by: Robert Mast <rmast@live.nl> Co-authored-by: drisspg <drisspguessous@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/127593 Approved by: https://github.com/SherlockNoMad, https://github.com/malfet	2024-06-02 00:36:33 +00:00
Oguz Ulgen	0d9e527c4d	Remove tensor storage_offset/storage_bytes from the cache key (#127319 ) Summary: We observed differences in these fields and inductor does not specialize on them so it is safe to remove them from the key. Test Plan: CI Reviewed By: masnesral Differential Revision: D57871276 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127319 Approved by: https://github.com/masnesral	2024-06-02 00:28:43 +00:00
eqy	2e779166eb	[Functorch][cuDNN] Bump tolerances for `test_vmapjvpvjp` (#127355 ) cuDNN can select a winograd kernel for this case which slightly affects tolerances... Pull Request resolved: https://github.com/pytorch/pytorch/pull/127355 Approved by: https://github.com/zou3519, https://github.com/Skylion007	2024-06-01 21:22:55 +00:00
Sam Larsen	6e2e09f6cc	[inductor] fix redis-related env vars in remote_cache.py (#127583 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127583 Approved by: https://github.com/oulgen	2024-06-01 19:55:25 +00:00
Wei Wang	b505e86475	[Inductor][CI][CUDA 12.4] Update dynamic_inductor_timm_training.csv - change gluon_inception_v3 from fail_accuracy to pass (#127672 ) From the HUD, most of the time the "X" is due to "improved_accuracy" for gluon_inception_v3. ![image](https://github.com/pytorch/pytorch/assets/143543872/d4f70377-2756-4921-872d-587426f00302) https://hud.pytorch.org/hud/pytorch/pytorch/main/1?per_page=50&name_filter=inductor_timm Pull Request resolved: https://github.com/pytorch/pytorch/pull/127672 Approved by: https://github.com/eqy, https://github.com/Skylion007	2024-06-01 19:12:43 +00:00
PyTorch MergeBot	17dea09b15	Revert "Default XLA to use swap_tensors path in nn.Module._apply (#126814 )" This reverts commit bfdec93395f675a0e5a59e95aef9104ac8f5081a. Reverted https://github.com/pytorch/pytorch/pull/126814 on behalf of https://github.com/izaitsevfb due to suspicious build instructions count regression, see [D58015016](https://www.internalfb.com/diff/D58015016) ([comment](https://github.com/pytorch/pytorch/pull/126814#issuecomment-2143545818))	2024-06-01 18:46:16 +00:00
PyTorch MergeBot	82cd7a7dab	Revert "Default meta device to use swap_tensors in nn.Module._apply (.to_empty and .to('meta')) (#126819 )" This reverts commit fa426b096b3635daab6ce26b44d50f3baab5a4e5. Reverted https://github.com/pytorch/pytorch/pull/126819 on behalf of https://github.com/izaitsevfb due to suspicious build instructions count regression, see [D58015016](https://www.internalfb.com/diff/D58015016) ([comment](https://github.com/pytorch/pytorch/pull/126814#issuecomment-2143545818))	2024-06-01 18:46:16 +00:00
Lucas Pasqualin	42312a52b3	[DSD] Adds type_check param to copy state dict utils (#127417 ) [DSD] Adds type_check param to copy state dict utils. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127417 Approved by: https://github.com/fegin	2024-06-01 17:50:52 +00:00
Aaron Gokaslan	edffb28d39	[BE][Ez]: Enable B019 - flags memory leaks through LRU cache on method (#127686 ) Flags potential mem leaks through LRUCache and will hopefully make future contributors rethink this pattern which can cause memleaks. noqas the violations we currently have (should be fixed later) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127686 Approved by: https://github.com/c-p-i-o	2024-06-01 17:19:24 +00:00
PyTorch MergeBot	22f392ba40	Revert "[easy?] Move AsyncCompile to a different file (#127235 )" This reverts commit f58fc16e8f059232f452a333f32e14ff681e12af. Reverted https://github.com/pytorch/pytorch/pull/127235 on behalf of https://github.com/izaitsevfb due to breaking internal tests, see [D58015187](https://www.internalfb.com/diff/D58015187) ([comment](https://github.com/pytorch/pytorch/pull/127235#issuecomment-2143518610))	2024-06-01 17:16:16 +00:00
PyTorch MergeBot	d49dc8f4b8	Revert "Add noqa to prevent lint warnings (#127545 )" This reverts commit f9937afd4f87fbb4844642ae2f587b13b5caa08c. Reverted https://github.com/pytorch/pytorch/pull/127545 on behalf of https://github.com/izaitsevfb due to reverting to unblock the revert of #127545 ([comment](https://github.com/pytorch/pytorch/pull/127545#issuecomment-2143517711))	2024-06-01 17:12:46 +00:00
PyTorch MergeBot	114c752b14	Revert "Improve MAGMA conditional macro in BatchLinearAlgebra.cpp (#127495 )" This reverts commit ee08cf57924a4230edad3101666890d8fe050c75. Reverted https://github.com/pytorch/pytorch/pull/127495 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/127495#issuecomment-2143508218))	2024-06-01 16:39:06 +00:00
Animesh Jain	efcea2d2fd	[dynamo] Support __getitem__ on NNModuleVariable __dict__ (#126956 ) Moves further along (but still fails) for the testcase in https://github.com/pytorch/pytorch/pull/126875 Pull Request resolved: https://github.com/pytorch/pytorch/pull/126956 Approved by: https://github.com/jansel ghstack dependencies: #126923	2024-06-01 15:22:45 +00:00
Jane Xu	4129c3e596	Let us find out why we wrote foreach meta regs (#127623 ) Turns out it was for no reason!...well, after realizing that these ops are all CompositeExplicit, their meta impls come for free. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127623 Approved by: https://github.com/mikaylagawarecki ghstack dependencies: #127412	2024-06-01 13:58:18 +00:00
Jane Xu	ac60bdaf01	Allow slow foreach to run for any backend, not just CPU (#127412 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127412 Approved by: https://github.com/albanD	2024-06-01 13:58:18 +00:00
Animesh Jain	4aa7a1efcf	[dynamo] Initial exception handling support (#126923 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126923 Approved by: https://github.com/williamwen42, https://github.com/jansel	2024-06-01 13:00:32 +00:00
Bin Bao	25994a7ed1	[AOTI] Fix a bug when mutated buffer meets .to (#127671 ) Summary: Before this change, the added unit test will trigger: `AssertionError: Can not find the original value for L__self____tensor_constant0_cuda0`. The reason is GraphLowering.constant_name could rename a constant with a device suffix but AOTI requires that new name being registered properly. Differential Revision: [D58047165](https://our.internmc.facebook.com/intern/diff/D58047165) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127671 Approved by: https://github.com/ColinPeppler, https://github.com/22quinn	2024-06-01 12:30:56 +00:00
haozhe.zhu	c3be459f26	[inductor] fix mkldnn linear binary fusion check ut (#127296 ) In this PR: （1）Fix the unary fusion for bf16 conv/linear. Previously we registered same fusion pattern for `bf16. fp16`. And we do not check the dtype while matching the pattern. This results the `fp16` case matched the `bf16` pattern but in later replacement, we found that we have a float16 here which is not expected, so we do not fuse them. We fix it by checking dtypes to avoid `fp16` case matched `bf16` pattern. ``` def _is_valid_computation_unary_fusion(computation_op, lowp_dtype=None): def fn(match): matched = _is_single_computation_op(computation_op, lowp_dtype)(match) # previously we do not check lowp_dtype here ``` It is not exposed before because we only check the match count, and the match count is anyway correct because we matched the pattern. To address this, we add check on number of `generated_kernel`. If it is not fused, there will be an additional kernel to compute the post op. （2）Previous the ut ``` python test/inductor/test_mkldnn_pattern_matcher.py -k test_linear_binary ``` dose not check the fusion status, fix it in this PR. （3）Extend `test_conv_binary` to test with lp. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127296 Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/jansel	2024-06-01 11:10:29 +00:00
Shan19900305	e62925930f	Clear dest impl extra_meta_ info when shallow_copy_from src impl to dest impl. (#127616 ) tensorA.data = tensorB will call shallow_copy_from function to copy tensorB metadata and storage to tensorA metadata and storage. If tensorB extra_meta_ is nullptr,then tensorA extra_meta_ still keep in tensorA. This will contaminate new meta data in tensorA. @ezyang @bdhirsh Pull Request resolved: https://github.com/pytorch/pytorch/pull/127616 Approved by: https://github.com/ezyang	2024-06-01 06:54:32 +00:00
Alex Baden	554265d450	[Inductor]: Use new device-agnostic libdevice import from triton.language (#127348 ) Triton refactored `libdevice` in `5e6952d8c5` While both imports still appear to work under CUDA, this change is required to pull the correct libdevice variants under the Intel XPU backend. I am working on developing a test that catches this behavior. The easiest path would be to enable `test/inductor/test_triton_kernels.py` under the XPU backend, but a different group at Intel manages that test and I need to see if they already have an enabling plan. I am not sure the double `libdevice` import (see line 22 where I have the nolint flag) is really necessary but have yet to find a conclusive test case. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127348 Approved by: https://github.com/etaf, https://github.com/peterbell10	2024-06-01 06:15:33 +00:00
Huy Do	7ef7c265d4	Ack codecvt_utf8_utf16 as a deprecated func in C++17 (#127659 ) https://en.cppreference.com/w/cpp/header/codecvt. This starts to fail on MacOS after migrating it to MacOS 14 with a newer toolchain. For example `57baae9c9b`. As there is no clear alternative to the deprecated function yet, I just ack the warning to fix the build and complete the migration https://github.com/pytorch/pytorch/issues/127490 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127659 Approved by: https://github.com/kit1980, https://github.com/atalman	2024-06-01 04:31:39 +00:00
a-gardner1	3c1cf03fde	Add fake impl for aten.unique_dim (#126561 ) Follow-up to #113118 and #124306. Developed in coordination with the solution to https://github.com/microsoft/onnxscript/pull/1547 This PR adds the missing fake tensor implementation for `aten.unique_dim`, thus enabling tracing and compilation of `torch.unique` when `dim` is not None. Local testing has proceeded with the following simple script (provided that one has checked out the changes in https://github.com/microsoft/onnxscript/pull/1547): ```python import onnx import onnxruntime as ort import logging import numpy as np onnx_program = torch.onnx.dynamo_export( lambda x: torch.unique(x, dim=0, return_inverse=True), torch.arange(10), export_options=torch.onnx.ExportOptions( dynamic_shapes=True, diagnostic_options=torch.onnx.DiagnosticOptions( verbosity_level=logging.DEBUG))) onnx_program.save("torch_unique.onnx") onnx_inputs = onnx_program.adapt_torch_inputs_to_onnx(torch.arange(10)) onnx_outputs = onnx_program(*onnx_inputs) loaded_onnx_program = onnx.load("torch_unique.onnx") onnx.checker.check_model(loaded_onnx_program) ort_session = ort.InferenceSession("torch_unique.onnx") inputs = np.random.randint(0, 10, 10) print(f"Inputs: {inputs}") outputs = ort_session.run(None, { "l_x_": inputs }) print(f"Outputs: {outputs}") print("Success") ``` Co-authored-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/126561 Approved by: https://github.com/ezyang	2024-06-01 04:03:10 +00:00
Wang, Eikan	25447ba241	Always Link libtorch and libtorch_cpu to ensure the functionality for AOT mode (#127381 ) Fix #126763: The root cause is that the produced library does not link any torch library because the vec ISA is invalid, and then it cannot run into another path without linking `libtorch` and `libtorch_cpu`. https://github.com/pytorch/pytorch/blob/main/torch/_inductor/codecache.py#L1637-L1642 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127381 Approved by: https://github.com/desertfire	2024-06-01 01:47:41 +00:00
Masaki Kozuki	df53cc7114	[reland] "[reland] `_foreach_copy` with different src/dst dtypes" (#127186 ) Fixes #115171 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127186 Approved by: https://github.com/ezyang	2024-06-01 01:25:10 +00:00
Huamin Li	ff8042bcfb	Enable AOTI shim v2 build and add into libtorch (#125211 ) Summary: Follow up of https://github.com/pytorch/pytorch/pull/125087 This diff will create shim v2 header and cpp file and corresponding build Differential Revision: D56617546 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125211 Approved by: https://github.com/desertfire	2024-05-31 23:56:11 +00:00
Zain Rizvi	a8c9b26534	[BE] Fix dependabot security errors (#127567 ) Fixes https://github.com/pytorch/pytorch/security/dependabot/36 and https://github.com/pytorch/pytorch/security/dependabot/37 by deleting spurious dependency Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/127567 Approved by: https://github.com/malfet	2024-05-31 23:00:07 +00:00
Yanbo Liang	f7171313ab	[Inductor] FlexAttention backward kernel optimization (#127208 ) BWD Speedups (before this PR): ``` \| Type \| Speedup \| shape \| score_mod \| dtype \| \|---------\|-----------\|-------------------\|---------------\|----------------\| \| Average \| 0.211 \| \| \| \| \| Max \| 0.364 \| (16, 16, 512, 64) \| relative_bias \| torch.bfloat16 \| \| Min \| 0.044 \| (2, 16, 4096, 64) \| causal_mask \| torch.bfloat16 \| ``` BWD Speedups (after this PR, though not optimizing block size yet): ``` \| Type \| Speedup \| shape \| score_mod \| dtype \| \|---------\|-----------\|--------------------\|---------------\|----------------\| \| Average \| 0.484 \| \| \| \| \| Max \| 0.626 \| (2, 16, 512, 256) \| head_bias \| torch.bfloat16 \| \| Min \| 0.355 \| (8, 16, 4096, 128) \| relative_bias \| torch.bfloat16 \| ``` There are a few things need to do as follow-ups: * Optimized default block size on A100/H100. * Support different seqlen for Q and K/V. * Support dynamic shapes for backward. * Enhance unit tests to check there is no ```nan``` value in any grad. I think we should make some changes to ```test_padded_dense_causal``` because it has invalid inputs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127208 Approved by: https://github.com/Chillee	2024-05-31 22:56:10 +00:00
Huy Do	57baae9c9b	Migrating CI/CD jobs to macOS 14 (#127582 ) We have half the fleet in MacoS 14 already and it has been running fine so far https://github.com/pytorch/pytorch/issues/127490. So, I'm preparing the final push to replace the rest of them. This also switches release build from 13 to 14 (GitHub runners) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127582 Approved by: https://github.com/atalman	2024-05-31 22:30:59 +00:00
Zain Rizvi	02248b73eb	[EZ] Port over all test-infra scale configs to lf runners (#127645 ) Follow up to https://github.com/pytorch/pytorch/pull/127578 Since GPU builds seem to be working correctly, porting over all remaining scale configs from [the org-wide scale config file](https://github.com/pytorch/test-infra/blob/main/.github/scale-config.yml) The naming convention here is all temporary. We'll figure out something better before completing the migration Pull Request resolved: https://github.com/pytorch/pytorch/pull/127645 Approved by: https://github.com/malfet	2024-05-31 22:24:41 +00:00
Lucas Pasqualin	bb1468d506	Updates state dict in state dict loader (#127617 ) Fixes #125096 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127617 Approved by: https://github.com/Skylion007, https://github.com/fegin	2024-05-31 21:59:10 +00:00
David Berard	f33beb767d	[NestedTensor] Use maybe_mark_dynamic instead of mark_dynamic (#127453 ) Fixes #127097 TL;DR: dimensions marked with mark_dynamic can result in assertion failures if the marked-dynamic dimensions get specialized. In NJT, we don't care _that_ much that a dimension is marked as dynamic. So instead, mark with `maybe_mark_dynamic` which suggests that a dimension should be dynamic, but doesn't fail if the dimension gets specialized. Background: NJT marks the values tensor as dynamic: `49ad90349d/torch/nested/_internal/nested_tensor.py (L122)` It does this for two reasons: 1. Conceptual: We know that this dimension _should_ be dynamic; it's a nested tensor, so the sequence lengths will _probably_ vary between batches in the common case. Therefore, we should compile it as dynamic to prevent needing a recompile to trigger automatic dynamic shapes. 2. Implementation detail: Right now we run into issues with torch.compile / tensor_unflatten / other details when the dimensions are not marked as dynamic. We have some attempts to remove this (e.g. https://github.com/pytorch/pytorch/pull/126563) but while testing this I wasn't able to get all tests to pass, so there could be potential regressions here if we removed the mark_dynamic. Justification for this change 1. Conceptual: AFAIK, we don't care enough about the dynamism of this dimension to error out if we specialize. We'd prefer that we don't have to recompile to get automatic dynamic shapes, but it's also better to not have this issue (and not to force the user to go hunt down all the other equivalent shapes to mark them as dynamic as well). This solution allows us to suggest the dynamism but not force it. 2. Implementation detail: This still marks the dimension as symbolic at the beginning of dynamo tracing, so we will (probably) avoid a lot of the issues we run into when we completely remove the `mark_dynamic` decorators. Differential Revision: [D57933779](https://our.internmc.facebook.com/intern/diff/D57933779) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127453 Approved by: https://github.com/soulitzer, https://github.com/YuqingJ	2024-05-31 21:32:12 +00:00
albanD	6bfc6e0875	Add back private function torch.cuda.amp.autocast_mode._cast (#127433 ) This is unfortunately used in a few places in the wild: https://github.com/search?q=torch.cuda.amp.autocast_mode._cast&type=code Pull Request resolved: https://github.com/pytorch/pytorch/pull/127433 Approved by: https://github.com/zou3519, https://github.com/guangyey	2024-05-31 20:48:15 +00:00
drisspg	923edef31c	FP8 rowwise scaling (#125204 ) # Summary This pull request introduces an fp8 row-scaling kernel as an optional implementation for `scaled_mm`. The kernel selection is based on the scaling tensors of the inputs. For inputs `x` and `y` of shape `[M, K]` and `[K, N]` respectively, the following conditions must be met: - `x`'s scale should be a 1-dimensional tensor of length `M`. - `y`'s scale should be a 1-dimensional tensor of length `N`. It's important to note that this kernel is not called "rowwise, columnwise" scaling because, although the scales for `y` are semantically along its columns, this implementation only supports the TN format. This means the scaling is along the faster-moving dimension, or the "row". The following two PRs were required to enable local builds: - [PR #126185](https://github.com/pytorch/pytorch/pull/126185) - [PR #125523](https://github.com/pytorch/pytorch/pull/125523) ### Todo We still do not build our Python wheels with this architecture. @ptrblck @malfet, should we replace `sm_90` with `sm_90a`? The NVRTC TMA shadowing feels wrong, but I a not sure the right way to spoof the symbol for this compilation unit: https://github.com/pytorch/pytorch/pull/125204/files#r1586986954 #### ifdef I tried to use : `#if !defined(USE_ROCM) && defined(CUDA_VERSION) && CUDA_VERSION >= 12000 && \ defined(__CUDA_ARCH__) && __CUDA_ARCH__ > 900` to gate the building of the kernel. I was having a hell of a time with this.. so I am not really sure the right way to do this Kernel Credit: @jwfromm Pull Request resolved: https://github.com/pytorch/pytorch/pull/125204 Approved by: https://github.com/lw	2024-05-31 20:09:08 +00:00
Edward Z. Yang	806e6257f3	Unconditionally assign symbolic shapes as locals (#127486 ) Internal xref: https://fb.workplace.com/groups/1405155842844877/posts/8493858177307906 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/127486 Approved by: https://github.com/albanD	2024-05-31 20:01:44 +00:00
PyTorch MergeBot	033e733021	Revert "[BE] wrap deprecated function/class with `typing_extensions.deprecated` (#126898 )" This reverts commit 749a132fb0a8325cbad4734a563aa459ca611991. Reverted https://github.com/pytorch/pytorch/pull/126898 on behalf of https://github.com/fbgheith due to switching typing-extensions=4.3.0 to 4.9.0 causes internal failure ([comment](https://github.com/pytorch/pytorch/pull/126898#issuecomment-2142884456))	2024-05-31 19:47:24 +00:00
Robert Mast	ea13e9a097	correct BLAS input (#126200 ) Fixes #32407 With this little correction to Dependencies.cmake it is possible to build an MKL-free version of Pytorch up from version v2.0.0 by explicitly choosing another MKL-free BLAS. This pullrequest fulfills the "if not already present" part of the original comment in Dependencies.cmake: "setting default preferred BLAS options if not already present." It's tested with this Action-.yml: ``` name: Build PyTorch v2.0.0 without AVX on: push: branches: - v2.0.0 pull_request: branches: - v2.0.0 jobs: build: runs-on: ubuntu-20.04 defaults: run: shell: bash -el {0} steps: - name: Checkout repository uses: actions/checkout@v4 with: #repository: 'pytorch/pytorch' #ref: 'v2.3.0' submodules: 'recursive' - uses: conda-incubator/setup-miniconda@v3 with: auto-activate-base: true activate-environment: true python-version: 3.10.13 - name: Install Dependencies - Common - Linux 2 run: \| conda info conda list conda install nomkl conda install astunparse numpy ninja pyyaml setuptools cmake cffi typing_extensions future six requests dataclasses export PYTORCH_CPU_CAPABILITY=cpu export ATEN_CPU_CAPABILITY_DEFAULT=cpu export CMAKE_PREFIX_PATH=${CONDA_PREFIX:-"$(dirname $(which conda))/../"} export ATEN_CPU_CAPABILITY=default export USE_NNPACK=0 export MAX_JOBS=4 export USE_CUDA=0 export USE_ROCM=0 export BLAS=OpenBLAS export CMAKE_ARGS="-D CMAKE_BUILD_TYPE=Release -D USE_AVX=OFF -D USE_NNPACK=OFF -D C_HAS_AVX_2=OFF -D C_HAS_AVX2_2=OFF -D CXX_HAS_AVX_2=OFF -D CXX_HAS_AVX2_2=OFF -D CAFFE2_COMPILER_SUPPORTS_AVX512_EXTENSIONS=OFF -DPYTHON_INCLUDE_DIR=$(python -c "import sysconfig; print(sysconfig.get_path('include'))") -DPYTHON_LIBRARY=$(python -c "import sysconfig; print(sysconfig.get_config_var('LIBDIR'))") -DPYTHON_EXECUTABLE:FILEPATH=`which python`" pip install build wheel typing_extensions python setup.py bdist_wheel - name: Archive production artifacts uses: actions/upload-artifact@v4 with: name: dist-without-markdown path: \| dist !dist/*/.md ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/126200 Approved by: https://github.com/jgong5, https://github.com/kit1980	2024-05-31 19:38:42 +00:00
PyTorch MergeBot	bbf892dd58	Revert "Add back private function torch.cuda.amp.autocast_mode._cast (#127433 )" This reverts commit 6e0eeecc7cd4dc389683e35d1f2e34738e09e597. Reverted https://github.com/pytorch/pytorch/pull/127433 on behalf of https://github.com/fbgheith due to depends on https://github.com/pytorch/pytorch/pull/126898 which is failing internally and needs to be reverted ([comment](https://github.com/pytorch/pytorch/pull/127433#issuecomment-2142869610))	2024-05-31 19:35:15 +00:00
Bin Bao	1103444870	[AOTI] Add back include_pytorch for specifying link paths (#126802 ) Summary: Running dashboard with the cpp wrapper mode sometimes hit erros like "undefined symbol: aoti_torch_empty_stride", although it can not be reproduced locally and seems only happen on the dashboard CI. Differential Revision: [D57911442](https://our.internmc.facebook.com/intern/diff/D57911442) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126802 Approved by: https://github.com/chenyang78 ghstack dependencies: #126916, #127037	2024-05-31 19:32:52 +00:00
Brian Hirsh	8af1c655e5	improve eager overhead of _disable_dynamo (#127325 ) it seems like `_disable_dynamo` actually has a fair amount of overhead (especially when it was added to `DTensor.__new__`: this change speeds up @wanchaol 's repro from 0.380 -> 0.312s: P1378202570 (that repro runs a vanilla MLP using 2D parallelism, and calls the DTensor constructor 1280 times). It looks like most of the slowndown is in the fact that we are repeatedly running `import torch._dynamo` and constructing an instance of `torch._dynamo.disable(fn, recursive)` on every call to the constructor - this PR caches it on the first invocation. ~~Update: I realized I cannot use `torch.compiler.is_compiling` to know when to fast-path, because when we hit a graph break, cpython will be running so it will return False.~~ ~~As a test / potential fix, I added a new config, `torch._dynamo.config._is_compiling` that is set to True always inside a compiled region (even on frames that are run by cpython). This definitely seems to do what I want in terms of knowing when to fastpath and avoid overhead - although interested in feedback on how reasonable this is~~ Pull Request resolved: https://github.com/pytorch/pytorch/pull/127325 Approved by: https://github.com/wanchaol, https://github.com/anijain2305	2024-05-31 19:30:47 +00:00
Kwanghoon An	b704c7cf0f	Re trying Support min/max carry over for eager mode from_float method (#127576 ) Summary: Original commit changeset: 2605900516c8 Original Phabricator Diff: D57977896 Test Plan: Re enabling due to prod failure Reviewed By: jerryzh168 Differential Revision: D57978925 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127576 Approved by: https://github.com/jerryzh168	2024-05-31 19:08:07 +00:00
Catherine Lee	121c55d8d1	Old branch deletion script to also delete old ciflow tags (#127625 ) Change branch deletion script to also delete left over ciflow tags that the bot doesn't get to, as well as the one created by triggering a workflow on HUD Example run https://github.com/pytorch/pytorch/actions/runs/9322082915/job/25662376463?pr=127625 (didn't actually delete the tag, but lists what tags it would delete) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127625 Approved by: https://github.com/huydhn	2024-05-31 18:54:54 +00:00
Yanbo Liang	0be06b08fc	[GPT-fast benchmark] Merge GPT-fast and micro benchmark output as one CSV file (#127586 ) Consolidate GPT-fast models benchmark with micro-benchmark, and save output as one CSV file with the same format as https://github.com/pytorch/pytorch/pull/126754#issue-2307296847. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127586 Approved by: https://github.com/Chillee	2024-05-31 18:50:49 +00:00
Svetlana Karslioglu	4a0d96e496	Add a GH action to autolabel docathon PRs (#127569 ) To ease oncall burden for the docathon PR reviewers and ensure all PRs are correctly labeled, adding this GH action that will look for the issue number in the PR and if that issue has a docathon-h1-2024 label, then it would propagate the labels from the issues into the PR. It should not conflict with the existing labelers because we use ``pull_request.add_to_labels`` - credit @kit1980. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127569 Approved by: https://github.com/kit1980	2024-05-31 17:57:07 +00:00
angelayi	b2f5fd8efb	[ts_converter] Basic support for prim::If conversion (#127336 ) Script module: ``` graph(%self : __torch__.M, %x.1 : Tensor, %y.1 : Tensor): %11 : int = prim::Constant[value=1]() %5 : bool = aten::Bool(%x.1) # /data/users/angelayi/pytorch2/test/export/test_converter.py:27:19 %21 : Tensor = prim::If(%5) # /data/users/angelayi/pytorch2/test/export/test_converter.py:27:16 block0(): %8 : Tensor = aten::mul(%y.1, %y.1) # /data/users/angelayi/pytorch2/test/export/test_converter.py:28:27 -> (%8) block1(): %12 : Tensor = aten::add(%y.1, %y.1, %11) # /data/users/angelayi/pytorch2/test/export/test_converter.py:30:27 -> (%12) return (%21) ``` ExportedProgram: ``` ExportedProgram: class GraphModule(torch.nn.Module): def forward(self, x_1: "b8[]", y_1: "i64[]"): # File: <eval_with_key>.23:9 in forward, code: cond = torch.ops.higher_order.cond(l_args_0_, cond_true_0, cond_false_0, [l_args_3_0_]); l_args_0_ = cond_true_0 = cond_false_0 = l_args_3_0_ = None true_graph_0 = self.true_graph_0 false_graph_0 = self.false_graph_0 conditional = torch.ops.higher_order.cond(x_1, true_graph_0, false_graph_0, [y_1]); x_1 = true_graph_0 = false_graph_0 = y_1 = None return (conditional,) class <lambda>(torch.nn.Module): def forward(self, y_1: "i64[]"): # File: <eval_with_key>.20:6 in forward, code: mul_tensor = torch.ops.aten.mul.Tensor(l_args_3_0__1, l_args_3_0__1); l_args_3_0__1 = None mul: "i64[]" = torch.ops.aten.mul.Tensor(y_1, y_1); y_1 = None return mul class <lambda>(torch.nn.Module): def forward(self, y_1: "i64[]"): # File: <eval_with_key>.21:6 in forward, code: add_tensor = torch.ops.aten.add.Tensor(l_args_3_0__1, l_args_3_0__1, alpha = 1); l_args_3_0__1 = None add: "i64[]" = torch.ops.aten.add.Tensor(y_1, y_1); y_1 = None return add ``` This PR also adds support for TupleIndex and incorporates some changes from https://github.com/pytorch/pytorch/pull/127341 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127336 Approved by: https://github.com/BoyuanFeng	2024-05-31 17:46:16 +00:00
cyy	3e66052e16	Improve python3 discovery code in CMake (#127600 ) The improvement is based on my comments in #124613 and it also fixes the current linux-s390x-binary-manywheel CI failures. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127600 Approved by: https://github.com/Skylion007	2024-05-31 17:29:06 +00:00
Wang, Eikan	8d7393cb5e	Update triton-xpu commit pin merge rules for XPU (#127203 ) Add the ".ci/docker/ci_commit_pins/triton-xpu.txt" to the XPU merge rules. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127203 Approved by: https://github.com/atalman	2024-05-31 17:19:19 +00:00
Iris Z	1699edaabb	[DeviceMesh] Adding nD slicing support back (#127465 ) Fixes #126530 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127465 Approved by: https://github.com/wconstab, https://github.com/wanchaol	2024-05-31 17:06:36 +00:00
Aaron Gokaslan	8bf2c0a203	[BE][Ez]: Update ruff to 0.4.6 (#127614 ) Update ruff linter to 0.4.6. Uneventful PR that fixes bugs and reduces false positives. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127614 Approved by: https://github.com/albanD	2024-05-31 17:01:50 +00:00
PyTorch MergeBot	58b461d57a	Revert "[ROCm] Update triton pin to fix libtanh issue (#125396 )" This reverts commit 19333d1eb9b8965edd6c8a52fd59b5c67b4fb523. Reverted https://github.com/pytorch/pytorch/pull/125396 on behalf of https://github.com/atalman due to Broke nightly builds ([comment](https://github.com/pytorch/pytorch/pull/125396#issuecomment-2142638237))	2024-05-31 16:51:39 +00:00
Jason Ansel	225ec08e35	Fix typo in .ci/docker/ubuntu-cuda/Dockerfile (#127503 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127503 Approved by: https://github.com/nWEIdia, https://github.com/Skylion007	2024-05-31 16:50:35 +00:00
Wei Wang	67f0807042	[Inductor] [CI] [CUDA] Skip the failed models and tests the better way (#127150 ) Address subtasks in https://github.com/pytorch/pytorch/issues/126692 After enabling the disabled shards, the following two models regressed (for cu124 configuration): dynamic_inductor_timm_training.csv cspdarknet53,pass,7 (expected) \| cspdarknet53,fail_accuracy,7 (actual) eca_botnext26ts_256,pass,7 (expected) \| eca_botnext26ts_256,fail_accuracy,7 (actual) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127150 Approved by: https://github.com/huydhn, https://github.com/eqy, https://github.com/atalman	2024-05-31 16:35:57 +00:00
Chien-Chin Huang	64c581a1d4	[DSD] Make distributed state_dict support torch.distributed is not initialized case (#127385 ) Fixes https://github.com/pytorch/pytorch/issues/124942 Summary: Allow DSD to support loading the regular optimizer state_dict and can be used when torch.distributed.is_initialized() is False. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127385 Approved by: https://github.com/wz337 ghstack dependencies: #127070, #127071, #127384	2024-05-31 16:28:16 +00:00
Chien-Chin Huang	8b4ad3a8d9	[DSD] Unify the API signatures of set_model_state_dict and set_optimizer_state_dict (#127384 ) Summary: Allow the optim_state_dict argument to be a positional argument. This make sense since this is a required argument and this will make the function signature the consistent as set_model_state_dict without causing BC issues. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127384 Approved by: https://github.com/wz337 ghstack dependencies: #127070, #127071	2024-05-31 16:24:51 +00:00
Chien-Chin Huang	bd868eeb28	[DSD] Support flattening the optimizer state_dict when saving and unflattening when loading (#127071 ) Fixes https://github.com/pytorch/pytorch/issues/126595 What does this PR do? This PR unflattens the optimizer state_dict, similar to what TorchRec does. The current `get_optimizer_state_dict()` converts the parameter IDs to FQNs in order to avoid any conflict with different optimizers on different ranks. The current returned optimizer state_dict looks like the following one: ``` { "state": { "layer1.weight": {"step": 10, "exp_avg": SomeTensor, "exp_avg_sq": SomeTensor}, "layer2.weight": {"step": 10, "exp_avg": SomeTensor, "exp_avg_sq": SomeTensor}, }, "param_group": [ {"lr": 0.0, "betas": (0.9, 0.95), ..., "params": ["layer1.weight", "layer2.weight"]} ] } ``` While this can avoid the conflict and can support merging multiple optimizers use case (e.g., optimizer in backward), the current optimizer state_dict still cannot support MPMD (e.g., pipeline parallelism). The root cause is `param_group`. `param_group` cannot generate unique keys during saving -- DCP will flatten the dict but for `param_group`, DCP will get the keys like, `param_group.lr` or `param_group.params`. These keys will conflict when using pipeline parallelism. This PR flatten the optimizer state_dict to the one as the following one: ``` { "state.layer1.weight.step": 10, "state.layer2.weight.step": 10, "state.layer1.weight.exp_avg": SomeTensor, "state.layer2.weight.exp_avg": SomeTensor, "state.layer1.weight.exp_avg_sq": SomeTensor, "state.layer2.weight.exp_avg_sq": SomeTensor, "param_group.layer1.weight.lr" : 0.1, "param_group.layer2.weight.lr" : 0.1, "param_group.layer1.weight.betas" : (0.9, 0.95), "param_group.layer2.weight.betas" : (0.9, 0.95), } ``` This allows distributed state_dict (DSD) to support MPMD (e.g., pipeline parallelism). Pros and Cons Pros 1. Can support optimizer resharding (e.g., changing the parallelisms from 3D to 2D or changing the number of workers). 2. User don't need to manually add prefix to different optimizer. 3. Allow users to merge the optimizer states easily. One use case is loop-based pipeline parallelism. Cons 1. The implementation has a strong assumption of the structure of `param_groups` and its value. If the assumption changes or some customized optimizers do not meet the assumption, the implementations will be broken. 2. There will be extra values saved in the checkpoints. The assumption here is `param_group` generally contains scalars which are cheap to save. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127071 Approved by: https://github.com/wconstab, https://github.com/wz337 ghstack dependencies: #127070	2024-05-31 16:20:36 +00:00
Chien-Chin Huang	6b1b8d0193	[DSD] Remove the support of Dict[nn.Module, Dict[str, Any]] state_dict (#127070 ) Summary: This is a very complicated signature that is hard for users to reason. Remove the support of this feature. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127070 Approved by: https://github.com/wz337	2024-05-31 16:16:05 +00:00
Zain Huda	a010fa9e24	[DCP] Fix variable spelling (#127565 ) Summary: tsia Test Plan: sandcastle Differential Revision: D57983752 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127565 Approved by: https://github.com/wz337, https://github.com/fegin	2024-05-31 15:32:08 +00:00
xinan.lin	75e7588f47	[Inductor UT] Fix expected failure but pass for test case on Intel GPU. (#127595 ) The XPU expected failure test case `TritonCodeGenTests.test_codegen_config_option_dont_assume_alignment` should have been expected passed after the PR #126261 merged, but due to test flaky, this case was skiped when landing the PR. The expected failure but passed error then exposed in periodic test: https://github.com/pytorch/pytorch/actions/runs/9302864965/job/25605549183#step:14:2082. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127595 Approved by: https://github.com/EikanWang, https://github.com/chuanqi129, https://github.com/peterbell10, https://github.com/atalman	2024-05-31 15:32:00 +00:00
Mikayla Gawarecki	4644def434	Update docstring for weights_only (#127575 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127575 Approved by: https://github.com/malfet	2024-05-31 14:27:31 +00:00
Feny Patel	cddb8dbebe	add workloadd events to pytorch (#127415 ) Summary: add workloadd events to pytorch Test Plan: CIs Differential Revision: D57914472 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127415 Approved by: https://github.com/sraikund16	2024-05-31 14:25:44 +00:00
Bin Bao	10a92b5f84	[AOTI] Fix a bool value codegen issue when calling custom ops (#127398 ) Summary: fixes https://github.com/pytorch/pytorch/issues/127392 Differential Revision: [D57911527](https://our.internmc.facebook.com/intern/diff/D57911527) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127398 Approved by: https://github.com/angelayi, https://github.com/chenyang78 ghstack dependencies: #126916, #127037	2024-05-31 14:01:36 +00:00
Bin Bao	17c5b6508b	[AOTI] Support _CollectiveKernel in the cpp wrapper mode (#127037 ) Summary: _CollectiveKernel appears in TorchBench moco training. It's a special Fallback op that requires extra care. Differential Revision: [D57911441](https://our.internmc.facebook.com/intern/diff/D57911441) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127037 Approved by: https://github.com/malfet ghstack dependencies: #126916	2024-05-31 13:58:50 +00:00
Bin Bao	413b81789f	[AOTI][refactor] Unify val_to_arg_str and val_to_cpp_arg_str (#126916 ) Summary: Now fallback argument type information has been passed, so time to unify val_to_arg_str and val_to_cpp_arg_str Differential Revision: [D57907751](https://our.internmc.facebook.com/intern/diff/D57907751) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126916 Approved by: https://github.com/chenyang78	2024-05-31 13:56:11 +00:00
Edward Z. Yang	aaef7b29e9	Only register _inductor_test ops if not running with deploy (#127557 ) Internal xref: https://fb.workplace.com/groups/1405155842844877/posts/8498194410207616 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/127557 Approved by: https://github.com/zou3519	2024-05-31 13:34:23 +00:00
PyTorch MergeBot	029b3ec775	Revert "[inductor][cpp] bf16/fp16 gemm template computed with fp32 w/o epilogue fusion (#126068 )" This reverts commit dae33a4961addb5847dbb362e7bb907bbfc64929. Reverted https://github.com/pytorch/pytorch/pull/126068 on behalf of https://github.com/PaliC due to failing internal tests ([comment](https://github.com/pytorch/pytorch/pull/126068#issuecomment-2141992307))	2024-05-31 12:33:25 +00:00
cyy	a6bae1f6db	Remove more caffe2 files (#127511 ) Remove more caffe2 files. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127511 Approved by: https://github.com/r-barnes	2024-05-31 11:26:27 +00:00
IvanKobzarev	df0c69f32d	[inductor] Add fallback for collectives size estimation for unbacked (#127562 ) Differential Revision: [D57982928](https://our.internmc.facebook.com/intern/diff/D57982928) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127562 Approved by: https://github.com/yifuwang	2024-05-31 11:15:46 +00:00
Animesh Jain	f4d7cdc5e6	[dynamo] Add current instruction to BlockStackEntry (#127482 ) Will be used by exception handling in later PRs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127482 Approved by: https://github.com/jansel	2024-05-31 08:58:53 +00:00
Yueming Hao	2a03bf5a14	[inductor] fix grid z bug for large grid (#127448 ) Fixes #123210 `2f3d3ddd70/torch/_inductor/runtime/triton_heuristics.py (L1733-L1753)` If a kernel's y_grid is larger than 65535, it will be split into multiple z grids. The above grad_fn does this split before the kernel launch; however, the computations for yoffset and the y_grid are incorrect. For example, if we have xy numel of `(1XBLOCK, 65537YBLOCK)`, this function will return an [xyz]_grid with (1, 32768, 2). XBLOCK and YBLOCK here are used for the following `get_grid_dim`. Let's use their default values (4, 1024). `2f3d3ddd70/torch/_inductor/runtime/triton_heuristics.py (L1734)` [xyz]_grid = (1, 32768, 2) means the workload are divided to two z grids. Because the triton kernel generation still follows xy dimension, one of the exampled generated kernel is shown below. ```python @triton.jit def triton_(in_ptr0, out_ptr0, ynumel, xnumel, YBLOCK : tl.constexpr, XBLOCK : tl.constexpr): ynumel = 655371024 xnumel = 14 yoffset = tl.program_id(1) * (tl.program_id(2) + 1) * YBLOCK yindex = yoffset + tl.arange(0, YBLOCK)[None, :] ymask = yindex < ynumel xoffset = tl.program_id(0) * XBLOCK xindex = xoffset + tl.arange(0, XBLOCK)[:, None] xmask = xindex < xnumel x2 = xindex y0 = yindex % 128 y1 = (yindex // 128) y3 = yindex tmp0 = tl.load(in_ptr0 + (y0 + (128x2) + (512y1)), xmask, eviction_policy='evict_last') tl.store(out_ptr0 + (x2 + (4y3)), tmp0, xmask) ``` For a trition block with xyz index (0, 0, 1), its yoffset and xoffset are both 0s based on the compuation `yoffset = tl.program_id(1) (tl.program_id(2) + 1) * YBLOCK` and `xoffset = tl.program_id(0) * XBLOCK`. So, this triton block will access the very first elements of the input. However, the correct yoffset should be `(y_index + z_index * y_grid ) * YBLOCK` which is the starting position of the 2nd z grid. At the same time, because we used `y_grid = y_grid // div` to compute the maximum number of element in y dimension, the y_grid is 32768. The total y grids is 32768*2 = 65536, which is less than the actual y grids 65537. So, we should use `y_grid = ceildiv(y_grid, div)` to compute the y grid to save the remaining grids. #123210 is not about AOTInductor, the root cause is the triton kernel generated by torchinductor. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127448 Approved by: https://github.com/eellison	2024-05-31 08:01:34 +00:00
titaiwangms	4935a019e4	[ONNX] Update decomposition table to core ATen ops (#127353 ) Fixes #125894 Previous to this PR, there are ATen core ops missing in the decomposition table because we thought they might be decomposed into prim ops, as they are under _refs. The PR picks them back according to `f6ef832e87/torch/_decomp/__init__.py (L253)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/127353 Approved by: https://github.com/justinchuby	2024-05-31 06:35:47 +00:00
cyy	0c5faee372	Replace python::python with Python::Module (#127485 ) Use found Python::Module target Pull Request resolved: https://github.com/pytorch/pytorch/pull/127485 Approved by: https://github.com/ezyang	2024-05-31 05:57:05 +00:00
Tugsbayasgalan (Tugsuu) Manlaibaatar	b5e85b8ecc	Add deferred_runtime_assertion pass after run_decompositions (#127305 ) Summary: We also want to reinsert the deferred_runtime passes after run_decompositions as well Test Plan: CI Reviewed By: zhxchen17 Differential Revision: D57802237 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127305 Approved by: https://github.com/BoyuanFeng	2024-05-31 05:45:28 +00:00
Zain Rizvi	ae47152ca8	Expand supported labels to most self-hosted linux pull.yml workflows (#127578 ) Initial set of runners added in https://github.com/pytorch/pytorch/pull/127566 seem to be working. Expanding to include more machine types, especially GPU machines Pull Request resolved: https://github.com/pytorch/pytorch/pull/127578 Approved by: https://github.com/huydhn	2024-05-31 05:40:16 +00:00
Simon Fan	ec098b88b6	[compiled autograd] torch.compile API (#125880 ) - enter existing compiled autograd ctx manager before entering torch.compile frames Pull Request resolved: https://github.com/pytorch/pytorch/pull/125880 Approved by: https://github.com/jansel	2024-05-31 04:38:20 +00:00
cyy	ee08cf5792	Improve MAGMA conditional macro in BatchLinearAlgebra.cpp (#127495 ) Unnecessary TORCH_CHECK(false) are changed to macro coverage as mentioned in #127371 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127495 Approved by: https://github.com/ezyang	2024-05-31 04:27:20 +00:00
Animesh Jain	159632aecd	[dynamo] Support hasattr on BuiltinVariable (#127372 ) Fixes https://github.com/pytorch/pytorch/issues/127172 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127372 Approved by: https://github.com/williamwen42, https://github.com/yanboliang ghstack dependencies: #127377	2024-05-31 04:23:56 +00:00
Animesh Jain	bb6bfd9ad8	[dynamo][compile-time] Cache the child guard managers (#127377 ) Reduces compile time of MobileBertForMaskedLM model from 39 seconds to 26 seconds. This was a regression introduced by #125202. Before that PR, compile time was 24 seconds. The extra two seconds is just because we are going through enormous number of guards. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127377 Approved by: https://github.com/jansel	2024-05-31 04:23:56 +00:00
Menglu Yu	f264745ff1	[interformer] batch pointwise op + unbind stack pass in post grad (#126959 ) Summary: Tested on H100 with single GPU, and the bs is set to 64. Test Plan: # local script ``` buck2 run mode/opt scripts/jackiexu0313/pt2:uniarch_perf_benchmark -- single-module-benchmark --provider interformer --enable_pt2 True --batch_size 64 ``` baseline: P1370993922 \| Metric \| Value \| \|:-------------------\|:-------------\| \| Latency \| 120.84 ms \| \| Model size \| 5.93 G bytes \| \| Flops/example \| 62.22 GB \| \| TFLOPS \| 32.95 \| \| MFU \| 4.12% \| \| Activation/example \| 128.17 MB \| proposal: P1371676068 config ``` torch._inductor.config.pre_grad_fusion_options = {} torch._inductor.config.post_grad_fusion_options = { "batch_aten_mul": {"min_fuse_set_size": 50}, "batch_aten_sigmoid": {"min_fuse_set_size": 50}, "batch_aten_relu": {"min_fuse_set_size": 50}, "batch_linear_post_grad": {"min_fuse_set_size": 50}, "unbind_stack_aten_pass": {}, } ``` \| Metric \| Value \| \|:-------------------\|:-------------\| \| Latency \| 117.30 ms \| \| Model size \| 5.93 G bytes \| \| Flops/example \| 62.65 GB \| \| TFLOPS \| 34.18 \| \| MFU \| 4.27% \| \| Activation/example \| 163.12 MB \| Differential Revision: D57595173 Pull Request resolved: https://github.com/pytorch/pytorch/pull/126959 Approved by: https://github.com/jackiexu1992	2024-05-31 03:54:43 +00:00
cyy	8629f9b3f2	Remove more unused variables in tests (#127510 ) Follows #127379 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127510 Approved by: https://github.com/Skylion007, https://github.com/r-barnes	2024-05-31 03:39:45 +00:00
Edward Z. Yang	0aaac68c57	Add structured logging for tensor fakeification (#126879 ) This adds dumps of MetaTensorDesc and MetaStorageDesc to structured logs when they are triggered from Dynamo. The logs look like this: ``` V0522 08:13:25.267000 140224882566144 torch/_subclasses/meta_utils.py:195] {"describe_storage": {"id": 0, "describer_id": 0, "size": 32}, "frame_id": 0, "frame_compile_id": 0, "attempt": 0} V0522 08:13:25.267000 140224882566144 torch/_subclasses/meta_utils.py:220] {"describe_tensor": {"id": 0, "ndim": 1, "dtype": "torch.float32", "device": "device(type='cpu')", "size": [8], "is_leaf": true, "stride": [1], "storage": 0, "view_func": "<built-in method _view_func_unsafe of Tensor object at 0x7f882959e840>", "describer_id": 0}, "frame_id": 0, "frame_compile_id": 0, "attempt": 0} V0522 08:13:25.268000 140224882566144 torch/_subclasses/meta_utils.py:1594] {"describe_source": {"describer_id": 0, "id": 0, "source": "L['x']"}, "frame_id": 0, "frame_compile_id": 0, "attempt": 0} ``` The `describer_id` is used to disambiguate ids. We expect it to be unique per frame id, but if there is a bug it possibly is not. Note you will get redundant dumps when evaluation restarts. tlparse can use this to give a visualization of input tensors to a model, you could also use this to generate example inputs to run graphs on. Some care is taken to avoid redumping the tensor metadata multiple times, which would happen ordinarily because AOTAutograd refakifies everything after Dynamo, to deal with metadata mutation. Partially fixes https://github.com/pytorch/pytorch/issues/126644 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/126879 Approved by: https://github.com/jamesjwu	2024-05-31 01:58:44 +00:00
Pian Pawakapan	b1792a622d	[pipelining] handle param aliasing (#127471 ) Adds support for parameter aliasing in pipelining. Does this by reading the state_dict, and creating a map of id -> valid tensor FQNs (to be used in _sink_params). Assigns additional FQN attributes that may be used, runs _sink_params(), and then deletes unused attributes. Shares some similarity with how export's unflattener does it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127471 Approved by: https://github.com/kwen2501	2024-05-31 01:52:57 +00:00
Shunting Zhang	d535de1747	[inductor] remove reordering_reindex (#127367 ) This fixes the loop ordering issue for avg_pool2d here (https://github.com/pytorch/pytorch/issues/126255#issuecomment-2117931529). The reason we can not fuse the 2 kernels for avg_pool2d is due to ComputedBuffer.iter_reordering_reindex. Take a simpler example: ``` def f(x, y): """ Add a matmul since inductor may force layout for output. """ return (x.sum(dim=-1) + 1) @ y # Make the first 2 dimension not able to merge on purpose so that # ComputedBuffer.iter_reoredering_reindex will be updated. x = rand_strided([20, 20, 30], [30, 900, 1], device="cuda") y = torch.randn(20, 20) ``` Suppose x.sum is stored to x2. The computed buffer for x2 will remember that we have reordered it's first and second dimension (i.e. loop order [1, 0]). Later one when we decide the loop order for x2 when computing 'x2 + 1' , we decide to pick loop order [1, 0] according to the stride analysis. And then we use the saved ComputedBuffer.iter_reordering_reindex to further reorder the loop order. The net effect is that we use loop order [0, 1] which cause the pointwise kernel not able to fuse with the reduction kernel. I feel that we don't need ComputedBuffer.iter_reordering_reindex. And test result shows removing it has neutral impact on the dashboard [link](https://hud.pytorch.org/benchmark/compilers?startTime=Wed%2C%2022%20May%202024%2017%3A30%3A29%20GMT&stopTime=Wed%2C%2029%20May%202024%2017%3A30%3A29%20GMT&granularity=hour&suite=torchbench&mode=training&dtype=amp&lBranch=gh/shunting314/153/head&lCommit=195f42cf1a414d2d1a0422b8a081a85ff52b7d20&rBranch=main&rCommit=d6e3e89804c4063827ea21ffcd3d865e5fe365d9) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127367 Approved by: https://github.com/jansel	2024-05-31 01:36:43 +00:00
PyTorch MergeBot	7646825c3e	Revert "distributed debug handlers (#126601 )" This reverts commit 3d541835d509910fceca00fc5a916e9718c391d8. Reverted https://github.com/pytorch/pytorch/pull/126601 on behalf of https://github.com/PaliC due to breaking internal typechecking tests ([comment](https://github.com/pytorch/pytorch/pull/126601#issuecomment-2141076987))	2024-05-31 01:21:24 +00:00
cyy	d44daebdbc	[Submodule] Remove deprecated USE_TBB option and TBB submodule (#127051 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127051 Approved by: https://github.com/cpuhrsch, https://github.com/malfet	2024-05-31 01:20:45 +00:00
feifan	da9fb670d2	Nadam support the flag for "maximize" (#127214 ) Fixes https://github.com/pytorch/pytorch/issues/126642 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127214 Approved by: https://github.com/janeyx99	2024-05-31 01:11:16 +00:00
PyTorch MergeBot	f6e303fa47	Revert "[DeviceMesh] Adding nD slicing support back (#127465 )" This reverts commit e72232f8f032b970b74da18200678b3a4617bf95. Reverted https://github.com/pytorch/pytorch/pull/127465 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing lint `e72232f8f0`, the error does not like look trivial fix, so I revert the change for a forward fix ([comment](https://github.com/pytorch/pytorch/pull/127465#issuecomment-2141051630))	2024-05-31 00:43:13 +00:00
atalman	af5ed05416	Include triton in py3.12 binaries (#127547 ) Additional Builder PR: https://github.com/pytorch/builder/pull/1846/ Pull Request resolved: https://github.com/pytorch/pytorch/pull/127547 Approved by: https://github.com/williamwen42	2024-05-31 00:30:10 +00:00
Benson Ma	fc73d07e5e	[c10d] Decorate methods in `NCCLUtils.hpp` with `TORCH_API` (#127550 ) Summary: User-defined PyTorch modules that uses `C10D_NCCL_CHECK` run into undefined symbol errors when loaded by `torch.library.load()`, because they have not been exported. This change exports the symbols needed to resolve those runtime errors. Test Plan: PyTorch CI Differential Revision: D57977944 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127550 Approved by: https://github.com/Skylion007	2024-05-31 00:17:25 +00:00
Sergii Dymchenko	a2bff4dc8c	Fix lint (#127584 ) Trivial fix after https://github.com/pytorch/pytorch/pull/124678 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127584 Approved by: https://github.com/huydhn	2024-05-31 00:00:11 +00:00
wz337	e72232f8f0	[DeviceMesh] Adding nD slicing support back (#127465 ) Fixes #126530 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127465 Approved by: https://github.com/wconstab	2024-05-30 23:55:21 +00:00
Shuqiang Zhang	214dd44608	[c10d] add Work's numel to logger for debugging purposes (#127468 ) Summary: We have seen some cases that all ranks call into a collective but it got stuck probably due to incorrect sizes of the tensors. Adding the size info into logging for debugging Also, taking this chance to consolidate all logger related status metrics in to one struct Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/127468 Approved by: https://github.com/wconstab	2024-05-30 23:32:33 +00:00
Scott Wolchok	620ec081ec	Extract inner loops into separate function for ARM64 fp16_dot_with_fp32_arith (#127476 ) Summary: Preparing to generalize to bf16. (This should not be committed unless the following bf16 PR is committed!) Test Plan: Spot-checked llm_experiments benchmark result to make sure it didn't regress. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127476 Approved by: https://github.com/malfet ghstack dependencies: #127435, #127451	2024-05-30 23:28:17 +00:00
Scott Wolchok	603bde1de3	Use efficient ARM fp16 dot product for gemm_transa_ general case (#127451 ) Summary: This doesn't change the overall gemm algorithm away from repeated dot products, just uses our efficient fp16 dot product developed for the gemv case. It seems to improve performance for every prompt length I tested. Test Plan: Use https://github.com/malfet/llm_experiments/blob/main/benchmarks/benchmark_torch_mm.py , edited to test only the trans_b (really gemm_transa_) case for the sizes outlined in the output. Before: ``` Matrix-vector: m=8, n=128, k=1 ==================== trans_b torch.float32 1.05 usec trans_b torch.float16 0.97 usec trans_b torch.bfloat16 1.06 usec m=128, n=8, k=1 ==================== trans_b torch.float32 0.80 usec trans_b torch.float16 0.97 usec trans_b torch.bfloat16 1.00 usec m=4096, n=4096, k=1 ==================== trans_b torch.float32 2160.75 usec trans_b torch.float16 659.77 usec trans_b torch.bfloat16 3800.13 usec m=11008, n=4096, k=1 ==================== trans_b torch.float32 6343.68 usec trans_b torch.float16 1789.42 usec trans_b torch.bfloat16 10098.34 usec m=4096, n=11008, k=1 ==================== trans_b torch.float32 6217.20 usec trans_b torch.float16 1874.47 usec trans_b torch.bfloat16 10490.30 usec m=32000, n=4096, k=1 ==================== trans_b torch.float32 17934.45 usec trans_b torch.float16 5323.81 usec trans_b torch.bfloat16 29320.80 usec Matrix-matrix (prompt len 4: m=8, n=128, k=4 ==================== trans_b torch.float32 2.40 usec trans_b torch.float16 1.22 usec trans_b torch.bfloat16 1.22 usec m=128, n=8, k=4 ==================== trans_b torch.float32 1.52 usec trans_b torch.float16 1.33 usec trans_b torch.bfloat16 1.77 usec m=4096, n=4096, k=4 ==================== trans_b torch.float32 4317.09 usec trans_b torch.float16 15541.04 usec trans_b torch.bfloat16 15032.29 usec m=11008, n=4096, k=4 ==================== trans_b torch.float32 6191.19 usec trans_b torch.float16 40436.29 usec trans_b torch.bfloat16 40626.93 usec m=4096, n=11008, k=4 ==================== trans_b torch.float32 6049.22 usec trans_b torch.float16 42367.16 usec trans_b torch.bfloat16 42482.43 usec m=32000, n=4096, k=4 ==================== trans_b torch.float32 17611.36 usec trans_b torch.float16 117368.54 usec trans_b torch.bfloat16 116958.85 usec Matrix-matrix (prompt len 8: m=8, n=128, k=8 ==================== trans_b torch.float32 1.04 usec trans_b torch.float16 1.71 usec trans_b torch.bfloat16 1.74 usec m=128, n=8, k=8 ==================== trans_b torch.float32 2.10 usec trans_b torch.float16 2.01 usec trans_b torch.bfloat16 2.91 usec m=4096, n=4096, k=8 ==================== trans_b torch.float32 2456.23 usec trans_b torch.float16 30112.76 usec trans_b torch.bfloat16 29941.58 usec m=11008, n=4096, k=8 ==================== trans_b torch.float32 6236.12 usec trans_b torch.float16 80361.22 usec trans_b torch.bfloat16 80466.64 usec m=4096, n=11008, k=8 ==================== trans_b torch.float32 6236.10 usec trans_b torch.float16 82990.74 usec trans_b torch.bfloat16 83899.80 usec m=32000, n=4096, k=8 ==================== trans_b torch.float32 17606.43 usec trans_b torch.float16 234397.38 usec trans_b torch.bfloat16 237057.29 usec Matrix-matrix (prompt len 16: m=8, n=128, k=16 ==================== trans_b torch.float32 1.31 usec trans_b torch.float16 2.67 usec trans_b torch.bfloat16 2.72 usec m=128, n=8, k=16 ==================== trans_b torch.float32 1.66 usec trans_b torch.float16 3.36 usec trans_b torch.bfloat16 5.18 usec m=4096, n=4096, k=16 ==================== trans_b torch.float32 2504.24 usec trans_b torch.float16 60896.53 usec trans_b torch.bfloat16 59852.49 usec m=11008, n=4096, k=16 ==================== trans_b torch.float32 6407.11 usec trans_b torch.float16 163294.92 usec trans_b torch.bfloat16 161199.10 usec m=4096, n=11008, k=16 ==================== trans_b torch.float32 6132.30 usec trans_b torch.float16 167244.77 usec trans_b torch.bfloat16 170064.35 usec m=32000, n=4096, k=16 ==================== trans_b torch.float32 17635.56 usec trans_b torch.float16 475020.00 usec trans_b torch.bfloat16 476332.29 usec Matrix-matrix (prompt len 32: m=8, n=128, k=32 ==================== trans_b torch.float32 1.40 usec trans_b torch.float16 4.67 usec trans_b torch.bfloat16 4.80 usec m=128, n=8, k=32 ==================== trans_b torch.float32 1.24 usec trans_b torch.float16 6.10 usec trans_b torch.bfloat16 10.03 usec m=4096, n=4096, k=32 ==================== trans_b torch.float32 2660.63 usec trans_b torch.float16 122436.04 usec trans_b torch.bfloat16 121687.96 usec m=11008, n=4096, k=32 ==================== trans_b torch.float32 6405.60 usec trans_b torch.float16 324708.42 usec trans_b torch.bfloat16 324866.67 usec m=4096, n=11008, k=32 ==================== trans_b torch.float32 6566.74 usec trans_b torch.float16 330801.04 usec trans_b torch.bfloat16 332561.79 usec m=32000, n=4096, k=32 ==================== trans_b torch.float32 18610.84 usec trans_b torch.float16 944578.75 usec trans_b torch.bfloat16 940674.33 usec Matrix-matrix (prompt len 128: m=8, n=128, k=128 ==================== trans_b torch.float32 2.48 usec trans_b torch.float16 16.43 usec trans_b torch.bfloat16 17.11 usec m=128, n=8, k=128 ==================== trans_b torch.float32 1.83 usec trans_b torch.float16 22.31 usec trans_b torch.bfloat16 37.00 usec m=4096, n=4096, k=128 ==================== trans_b torch.float32 4806.59 usec trans_b torch.float16 485338.83 usec trans_b torch.bfloat16 478835.08 usec m=11008, n=4096, k=128 ==================== trans_b torch.float32 12109.51 usec trans_b torch.float16 1300928.58 usec trans_b torch.bfloat16 1293181.63 usec m=4096, n=11008, k=128 ==================== trans_b torch.float32 11223.70 usec trans_b torch.float16 1326119.92 usec trans_b torch.bfloat16 1330395.12 usec m=32000, n=4096, k=128 ==================== trans_b torch.float32 33485.34 usec trans_b torch.float16 3869227.17 usec trans_b torch.bfloat16 3792905.00 usec ``` After: ``` Matrix-vector: m=8, n=128, k=1 ==================== trans_b torch.float32 0.75 usec trans_b torch.float16 0.71 usec trans_b torch.bfloat16 0.81 usec m=128, n=8, k=1 ==================== trans_b torch.float32 0.75 usec trans_b torch.float16 0.93 usec trans_b torch.bfloat16 0.98 usec m=4096, n=4096, k=1 ==================== trans_b torch.float32 2194.31 usec trans_b torch.float16 661.27 usec trans_b torch.bfloat16 3758.42 usec m=11008, n=4096, k=1 ==================== trans_b torch.float32 5792.04 usec trans_b torch.float16 1789.98 usec trans_b torch.bfloat16 10120.67 usec m=4096, n=11008, k=1 ==================== trans_b torch.float32 6101.22 usec trans_b torch.float16 1927.34 usec trans_b torch.bfloat16 10469.47 usec m=32000, n=4096, k=1 ==================== trans_b torch.float32 18353.20 usec trans_b torch.float16 5161.06 usec trans_b torch.bfloat16 29601.69 usec Matrix-matrix (prompt len 4: m=8, n=128, k=4 ==================== trans_b torch.float32 2.14 usec trans_b torch.float16 0.85 usec trans_b torch.bfloat16 1.19 usec m=128, n=8, k=4 ==================== trans_b torch.float32 1.47 usec trans_b torch.float16 1.85 usec trans_b torch.bfloat16 1.75 usec m=4096, n=4096, k=4 ==================== trans_b torch.float32 4416.40 usec trans_b torch.float16 2688.36 usec trans_b torch.bfloat16 14987.33 usec m=11008, n=4096, k=4 ==================== trans_b torch.float32 6140.24 usec trans_b torch.float16 7467.26 usec trans_b torch.bfloat16 40295.52 usec m=4096, n=11008, k=4 ==================== trans_b torch.float32 6143.10 usec trans_b torch.float16 7298.04 usec trans_b torch.bfloat16 41393.43 usec m=32000, n=4096, k=4 ==================== trans_b torch.float32 17650.72 usec trans_b torch.float16 21346.63 usec trans_b torch.bfloat16 116849.98 usec Matrix-matrix (prompt len 8: m=8, n=128, k=8 ==================== trans_b torch.float32 1.05 usec trans_b torch.float16 1.03 usec trans_b torch.bfloat16 1.69 usec m=128, n=8, k=8 ==================== trans_b torch.float32 2.05 usec trans_b torch.float16 3.08 usec trans_b torch.bfloat16 2.95 usec m=4096, n=4096, k=8 ==================== trans_b torch.float32 2323.99 usec trans_b torch.float16 5265.45 usec trans_b torch.bfloat16 29942.40 usec m=11008, n=4096, k=8 ==================== trans_b torch.float32 6202.01 usec trans_b torch.float16 14677.90 usec trans_b torch.bfloat16 80625.18 usec m=4096, n=11008, k=8 ==================== trans_b torch.float32 6112.05 usec trans_b torch.float16 14340.52 usec trans_b torch.bfloat16 82799.99 usec m=32000, n=4096, k=8 ==================== trans_b torch.float32 17650.65 usec trans_b torch.float16 42551.43 usec trans_b torch.bfloat16 236081.08 usec Matrix-matrix (prompt len 16: m=8, n=128, k=16 ==================== trans_b torch.float32 1.26 usec trans_b torch.float16 1.34 usec trans_b torch.bfloat16 2.69 usec m=128, n=8, k=16 ==================== trans_b torch.float32 1.60 usec trans_b torch.float16 5.81 usec trans_b torch.bfloat16 5.34 usec m=4096, n=4096, k=16 ==================== trans_b torch.float32 2328.05 usec trans_b torch.float16 10526.58 usec trans_b torch.bfloat16 60028.28 usec m=11008, n=4096, k=16 ==================== trans_b torch.float32 6243.35 usec trans_b torch.float16 28505.08 usec trans_b torch.bfloat16 163670.15 usec m=4096, n=11008, k=16 ==================== trans_b torch.float32 5870.11 usec trans_b torch.float16 28597.89 usec trans_b torch.bfloat16 165404.88 usec m=32000, n=4096, k=16 ==================== trans_b torch.float32 17746.27 usec trans_b torch.float16 83393.87 usec trans_b torch.bfloat16 472313.13 usec Matrix-matrix (prompt len 32: m=8, n=128, k=32 ==================== trans_b torch.float32 1.35 usec trans_b torch.float16 2.01 usec trans_b torch.bfloat16 4.68 usec m=128, n=8, k=32 ==================== trans_b torch.float32 1.19 usec trans_b torch.float16 10.98 usec trans_b torch.bfloat16 10.13 usec m=4096, n=4096, k=32 ==================== trans_b torch.float32 2525.29 usec trans_b torch.float16 23106.71 usec trans_b torch.bfloat16 122987.04 usec m=11008, n=4096, k=32 ==================== trans_b torch.float32 6131.34 usec trans_b torch.float16 57537.41 usec trans_b torch.bfloat16 327825.00 usec m=4096, n=11008, k=32 ==================== trans_b torch.float32 6395.01 usec trans_b torch.float16 57456.33 usec trans_b torch.bfloat16 331325.58 usec m=32000, n=4096, k=32 ==================== trans_b torch.float32 19078.68 usec trans_b torch.float16 167735.08 usec trans_b torch.bfloat16 975736.88 usec Matrix-matrix (prompt len 128: m=8, n=128, k=128 ==================== trans_b torch.float32 2.40 usec trans_b torch.float16 6.07 usec trans_b torch.bfloat16 16.83 usec m=128, n=8, k=128 ==================== trans_b torch.float32 1.78 usec trans_b torch.float16 40.35 usec trans_b torch.bfloat16 37.21 usec m=4096, n=4096, k=128 ==================== trans_b torch.float32 4827.60 usec trans_b torch.float16 84341.24 usec trans_b torch.bfloat16 478917.75 usec m=11008, n=4096, k=128 ==================== trans_b torch.float32 11879.96 usec trans_b torch.float16 226484.33 usec trans_b torch.bfloat16 1289465.50 usec m=4096, n=11008, k=128 ==================== trans_b torch.float32 10707.75 usec trans_b torch.float16 229200.58 usec trans_b torch.bfloat16 1327416.67 usec m=32000, n=4096, k=128 ==================== trans_b torch.float32 33306.32 usec trans_b torch.float16 662898.21 usec trans_b torch.bfloat16 3815866.63 usec ``` torch.float16 performance seems to be improved for all except the m=128, n=8, k=128 case, where it is roughly neutral. This case motivated the addition of the "first-tier tail fixup" in the dot kernel. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127451 Approved by: https://github.com/malfet ghstack dependencies: #127435	2024-05-30 23:28:17 +00:00
Scott Wolchok	74b89b9283	Extract dot-product functions from fp16_gemv_trans gemv kernels (#127435 ) Summary: Refactoring step before we attempt to use these to implement a less bad fp16 GEMM. Test Plan: Existing tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127435 Approved by: https://github.com/malfet	2024-05-30 23:28:17 +00:00
eellison	a3c00e4331	[Easy] Move V.fake_mode inside of replace_by_example (#127494 ) Was writing docs and saw that we always have this duplicated usage. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127494 Approved by: https://github.com/shunting314, https://github.com/aorenste	2024-05-30 23:23:42 +00:00
Rohan Varma	f9a1bc2c65	[FSDP] Remove _sync_module_states (#124678 ) Remove this unused API Differential Revision: [D56445639](https://our.internmc.facebook.com/intern/diff/D56445639/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124678 Approved by: https://github.com/awgu	2024-05-30 23:02:09 +00:00
laithsakka	029af29e6d	support operator.index function (#127440 ) Fix https://github.com/pytorch/pytorch/issues/127426 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127440 Approved by: https://github.com/mlazos ghstack dependencies: #126444, #127146, #127424	2024-05-30 22:44:18 +00:00
Huy Do	3b88c27c46	Mark DynamicShapesExportTests::test_retracibility_dict_container_inp_out as slow (#127558 ) Same as https://github.com/pytorch/pytorch/pull/117896, another slowpoke `DynamicShapesExportTests::test_retracibility_dict_container_inp_out` shows up on recently on MacOS. For example, https://ossci-raw-job-status.s3.amazonaws.com/log/25585713394 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127558 Approved by: https://github.com/clee2000	2024-05-30 22:40:48 +00:00
PyTorch MergeBot	e02971fcfb	Revert "Enable UFMT on test_shape_ops.py test_show_pickle.py test_sort_and_select.py (#127165 )" This reverts commit a288b95d4e5ceed327c5bdb9696331aa87688d60. Reverted https://github.com/pytorch/pytorch/pull/127165 on behalf of https://github.com/atalman due to lint is failing ([comment](https://github.com/pytorch/pytorch/pull/127165#issuecomment-2140930658))	2024-05-30 22:06:46 +00:00
Peter Bell	4ee003abdf	[inductor] Repeat should not return a view (#127533 ) Fixes #127474 `as_strided` unwraps views and looks at the underlying storage, so it isn't legal to lower `repeat`, which should return a new storage, into a view. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127533 Approved by: https://github.com/lezcano	2024-05-30 21:38:59 +00:00
hippocookie	a288b95d4e	Enable UFMT on test_shape_ops.py test_show_pickle.py test_sort_and_select.py (#127165 ) Fixes some files in #123062 Run lintrunner on files: test_shape_ops.py test_show_pickle.py test_sort_and_select.py ```bash $ lintrunner --take UFMT --all-files ok No lint issues. Successfully applied all patches. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/127165 Approved by: https://github.com/ezyang	2024-05-30 21:34:16 +00:00
dilililiwhy	f471482eb2	Try to include NCCL related header file with macro USE_C10D_NCCL (#127501 ) Fixes #ISSUE_NUMBER Try to include NCCL related header file with macro USE_C10D_NCCL, so that third-party device compilation will not be interrupted. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127501 Approved by: https://github.com/ezyang	2024-05-30 21:33:41 +00:00
Xuehai Pan	6849b80411	Add `ninja` as dev dependency (#127380 ) `ninja` is required to build C++ extensions in tests. ```pytb ERROR: test_autograd_cpp_node (__main__.TestCompiledAutograd) ---------------------------------------------------------------------- Traceback (most recent call last): File "/home/PanXuehai/Projects/pytorch/torch/testing/_internal/common_utils.py", line 2741, in wrapper method(args, *kwargs) File "test/inductor/test_compiled_autograd.py", line 1061, in test_autograd_cpp_node module = torch.utils.cpp_extension.load_inline( File "/home/PanXuehai/Projects/pytorch/torch/utils/cpp_extension.py", line 1643, in load_inline return _jit_compile( File "/home/PanXuehai/Projects/pytorch/torch/utils/cpp_extension.py", line 1718, in _jit_compile _write_ninja_file_and_build_library( File "/home/PanXuehai/Projects/pytorch/torch/utils/cpp_extension.py", line 1800, in _write_ninja_file_and_build_library verify_ninja_availability() File "/home/PanXuehai/Projects/pytorch/torch/utils/cpp_extension.py", line 1849, in verify_ninja_availability raise RuntimeError("Ninja is required to load C++ extensions") RuntimeError: Ninja is required to load C++ extensions To execute this test, run the following from the base repo dir: python test/inductor/test_compiled_autograd.py -k TestCompiledAutograd.test_autograd_cpp_node ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/127380 Approved by: https://github.com/ezyang	2024-05-30 21:22:42 +00:00
Xu Zhao	094183dba6	[torchbench][pt2] Enable Huggingface and Timm models for interal buck runner (#127460 ) Summary: Add huggingface and timm model runs to the internal pt2 benchmark runner. Test Plan: Tesing huggingface model: ``` $ buck2 run mode/opt //pytorch/benchmark:pt2 -- --only BlenderbotSmallForCausalLM --performance --training --device=cuda --amp 33/ 33 +0 frames 2s 13 graphs 13 graph calls 0/ -12 = 0% ops 0% time ``` Testing timm model: ``` $ buck2 run mode/opt //pytorch/benchmark:pt2 -- --only coat_lite_mini --performance --training --device=cuda --amp loading model: 0it [00:11, ?it/s] cuda train coat_lite_mini 8/ 8 +0 frames 4s 2 graphs 2 graph calls 0/ -1 = 0% ops 0% time ``` Differential Revision: D57930582 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127460 Approved by: https://github.com/HDCharles, https://github.com/huydhn	2024-05-30 21:18:28 +00:00
cyy	bf2f5e70dd	Fix warnings in SmallVector (#127250 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/127250 Approved by: https://github.com/ezyang	2024-05-30 21:13:20 +00:00
Zain Rizvi	ad1b18ab2f	Add repo-specific scale config files (#127566 ) Part of moving pytorch/pytorch CI infra to a Linux foundation run AWS account. For self-hosted runners that can run jobs from just a single repo, the runner scalers expect them to be stored in the repo itself. These scale-config files define how the linux foundation's self-hosted runners are configured. These will apply to runners that only are available to the pytorch/pytorch and pytorch/pytorch-canary repos Pull Request resolved: https://github.com/pytorch/pytorch/pull/127566 Approved by: https://github.com/zxiiro, https://github.com/huydhn, https://github.com/atalman	2024-05-30 21:08:45 +00:00
PyTorch MergeBot	846f79e61a	Revert "Reduce number of samples in {svd,pca}_lowrank OpInfos (#127199 )" This reverts commit 18a3f781e6382e2222d7c30c18136267407f9953. Reverted https://github.com/pytorch/pytorch/pull/127199 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing MacOS trunk job `18a3f781e6 (25619618844)` ([comment](https://github.com/pytorch/pytorch/pull/127199#issuecomment-2140834363))	2024-05-30 20:45:31 +00:00
Howard Huang	cce2192396	[pipelining] Support calling multiple recv fwd/bwd ops (#127084 ) Currently, only a single `get_fwd_recv_ops` or `get_bwd_recv_ops` can be called before `forward_one_chunk` and `backward_one_chunk` since they both share the same chunk_id counter. This creates a separate `recv_chunk_id` counter so that recvs can be accumulated. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127084 Approved by: https://github.com/wconstab	2024-05-30 20:15:52 +00:00
Howard Huang	aa3d041830	[pipelining] Fix block comments for doc rendering (#127418 ) Previous: <img width="915" alt="image" src="https://github.com/pytorch/pytorch/assets/14858254/14626937-7d79-4a7a-9d0b-3fcfe64b4667"> <img width="926" alt="image" src="https://github.com/pytorch/pytorch/assets/14858254/58ab009c-3f93-46d7-a04f-499a2a0ba390"> New: https://docs-preview.pytorch.org/pytorch/pytorch/127418/distributed.pipelining.html Pull Request resolved: https://github.com/pytorch/pytorch/pull/127418 Approved by: https://github.com/wconstab	2024-05-30 20:10:07 +00:00
Boyuan Feng	ff23c5b7d7	[cudagraph] improve log for mutating static input tensor addresses (#127145 ) Summary: This diff adds more log for cudagraph when static input tensor mutates. For each placeholder whose static input tensor address mutates, we log its name, changed data pointer address, and the input stack trace. Since some placeholder may have empty stack trace, we find its first user with an non-empty stack trace and print this stack trace instead. Test Plan: buck2 run fbcode//caffe2/test/inductor:cudagraph_trees -- --r test_static_inputs_address_mutation_log Differential Revision: D57805118 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127145 Approved by: https://github.com/eellison	2024-05-30 19:57:32 +00:00
Prachi Gupta	19333d1eb9	[ROCm] Update triton pin to fix libtanh issue (#125396 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/125396 Approved by: https://github.com/pruthvistony, https://github.com/nmacchioni	2024-05-30 19:26:58 +00:00
Rohan Varma	2cb6f20867	Warn env vars only once during program (#127046 ) This avoids logs being excessively noisy in some training runs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127046 Approved by: https://github.com/kwen2501, https://github.com/wconstab	2024-05-30 19:10:53 +00:00
Boyuan Feng	4afc5c7bb9	[torchscript] Handle prim::device and prim::dtype (#127466 ) - Support prim::device and prim::dtype during torchscript migration to export - Add unit tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/127466 Approved by: https://github.com/SherlockNoMad	2024-05-30 18:35:44 +00:00
Mikayla Gawarecki	fa426b096b	Default meta device to use swap_tensors in nn.Module._apply (.to_empty and .to('meta')) (#126819 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126819 Approved by: https://github.com/albanD ghstack dependencies: #127313, #126814	2024-05-30 18:28:13 +00:00
Mikayla Gawarecki	bfdec93395	Default XLA to use swap_tensors path in nn.Module._apply (#126814 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126814 Approved by: https://github.com/JackCaoG, https://github.com/albanD ghstack dependencies: #127313	2024-05-30 18:28:13 +00:00
Ali Waheed	39cf2f8e66	Added sorting notes for eig/eigvals (#127492 ) Fixes #58034 @lezcano , Added suggested comments for eig and eigvals in the documentation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127492 Approved by: https://github.com/lezcano, https://github.com/kit1980	2024-05-30 18:13:22 +00:00
Chen Lai	7827afca14	Copy the constant folding pass to the pass under export/passes folder (#127456 ) It's a generic pass and I'm trying to find a good place to host it. It's currently needed by quantization flow. See context in D55930580, it's too much effort to land a fix in the inductor folder. Differential Revision: [D57934182](https://our.internmc.facebook.com/intern/diff/D57934182/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127456 Approved by: https://github.com/angelayi	2024-05-30 18:04:08 +00:00
James Wu	f9937afd4f	Add noqa to prevent lint warnings (#127545 ) This is to prevent the import from being removed due to unused import. What's annoying about this is that it's not consistently running: lintrunner doesn't warn me on this PR even without the comment, but it does on other PRs Pull Request resolved: https://github.com/pytorch/pytorch/pull/127545 Approved by: https://github.com/masnesral	2024-05-30 17:56:49 +00:00
PyTorch MergeBot	12d6446507	Revert "[inductor] fix mkldnn linear binary fusion check ut (#127296 )" This reverts commit cdeb242fc977210e211fd77b217320205c9f4042. Reverted https://github.com/pytorch/pytorch/pull/127296 on behalf of https://github.com/huydhn due to Sorry for reverting you change but one of the tests is failing on trunk ROCm. Please help fix and reland the change https://github.com/pytorch/pytorch/actions/runs/9302535020/job/25606932572 ([comment](https://github.com/pytorch/pytorch/pull/127296#issuecomment-2140334323))	2024-05-30 17:18:23 +00:00
PyTorch MergeBot	e9a6bbbf7c	Revert "[CI] add xpu test in periodic workflow (#126410 )" This reverts commit 30d98611a3a35287c47ded9647f0b4c81fbdf036. Reverted https://github.com/pytorch/pytorch/pull/126410 on behalf of https://github.com/malfet due to Let's sync up on the test strategy/policies here ([comment](https://github.com/pytorch/pytorch/pull/126410#issuecomment-2140269549))	2024-05-30 17:01:02 +00:00
cyy	8777443d73	Remove FindMatlabMex.cmake (#127414 ) It is not used anymore. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127414 Approved by: https://github.com/ezyang	2024-05-30 16:26:35 +00:00
Daniil Kutz	b506d37331	Fix multiple errors while parsing NativeFunctions from YAML (#127413 ) Fixing multiple errors in parse_native_yaml when loading NativeFunctions from Yaml file. Add assertions that validates parsed data. Fixes #127404, #127405, #127406, #127407, #127408, #127409, #127410, #127411 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127413 Approved by: https://github.com/ezyang	2024-05-30 16:25:04 +00:00
PyTorch MergeBot	ea5c17de90	Revert "Add torchao nightly testing workflow (#126885 )" This reverts commit d938170314fa89acaad6b06fbbaac6b98f1e618f. Reverted https://github.com/pytorch/pytorch/pull/126885 on behalf of https://github.com/atalman due to Broke inductor periodic test ([comment](https://github.com/pytorch/pytorch/pull/126885#issuecomment-2140139486))	2024-05-30 16:23:06 +00:00
cyy	be7be9fa16	[Distributed] [8/N] Fix clang-tidy warnings in torch/csrc/distributed/c10d (#125102 ) This PR continues to clean clang-tidy warnings in torch/csrc/distributed/c10d, following https://github.com/pytorch/pytorch/pull/124987. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125102 Approved by: https://github.com/ezyang	2024-05-30 16:19:53 +00:00
Shunting Zhang	576c5ef1dd	[inductor] fix some tests in test_max_autotune.py (#127472 ) Fix https://github.com/pytorch/pytorch/issues/126176 . We should not use torch.empty to generate input data if we are gonna do any accuracy test. torch.empty may return NaN. In that cause both the reference and the actual result may contain NaN at the same index. But `NaN != NaN` so the test fail. Also if torch.empty returns NaN is not deterministic. It may depends on other tests running earlier. Generating random data instead of calling torch.empty fixes the problem. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127472 Approved by: https://github.com/eellison, https://github.com/jansel	2024-05-30 16:04:48 +00:00
Aaron Orenstein	d2df0f56a3	Fix compilation_latency regression caused by #127060 (#127326 ) It seems that while #127060 improved the speed for tacotron2 it introduced a compilation_latency regression for some of the TIMM benchmarks. The original change was to precompute the Dep metadata - but apparently some benchmarks have few enough overlaps that precomputing O(n) deps was slower than ignoring O(n^2) deps. So change it to go back to computing the Dep metadata on demand but to then cache the result. `dm_nfnet_f0` was a good example because on the dashboard it showed an increase from 140s -> 154s. ``` python benchmarks/dynamo/timm_models.py --performance --cold-start-latency --training --amp --backend inductor --dynamic-shapes --dynamic-batch-only --device cuda --total-partitions 5 --partition-id 1 --output timm-0.csv --only dm_nfnet_f0 ``` Looking at the compilation_latency result. On viable (d6e3e8980): 172.777958 176.725071 177.907955 On viable with #127060 and #127061 fully backed out: 158.305166 158.688560 160.791187 On viable w/ this change: 160.094164 160.201845 161.752157 I think that's probably close enough considering the variance. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127326 Approved by: https://github.com/oulgen	2024-05-30 15:37:08 +00:00
rzou	ffe506e853	Better graph break msg (and warning) on Dynamo x Python C++ extension (#127301 ) Dynamo graph breaks on Python C/C++ extensions (e.g. pybinded functions). The usual way to handle this is to turn those extensions into custom ops. This PR adds a nicer graph break message and also changes it to unconditionally warn on this graph break (because graph break messages are usually not visible). Fixes https://github.com/pytorch/pytorch/issues/126799 Test Plan: - new test Pull Request resolved: https://github.com/pytorch/pytorch/pull/127301 Approved by: https://github.com/jansel ghstack dependencies: #127291, #127292, #127400, #127423	2024-05-30 14:54:29 +00:00
rzou	c9beea13ac	Rewrite existing links to custom ops gdocs with the landing page (#127423 ) NB: these links will be live after the docs build happens, which is once a day. Test Plan: - existing tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/127423 Approved by: https://github.com/jansel, https://github.com/williamwen42 ghstack dependencies: #127291, #127292, #127400	2024-05-30 14:54:29 +00:00
lezcano	18a3f781e6	Reduce number of samples in {svd,pca}_lowrank OpInfos (#127199 ) We don't need to generate so many samples for these very expensive ops. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127199 Approved by: https://github.com/peterbell10, https://github.com/zou3519 ghstack dependencies: #125580	2024-05-30 14:45:58 +00:00
lezcano	48538d3d14	Implement svd_lowrank and pca_lowrank for complex numbers (#125580 ) We fix a number of bugs previously present in the complex implementation. We also heavily simplify the implementation, using, among other things, that we now have conjugate views. I saw there is a comment regarding how slow some checks on this function are. As such, I removed quite a few of the combinations of inputs to make the OpInfo lighter. I still left a couple relevant examples to not regress coverage though. Fixes https://github.com/pytorch/pytorch/issues/122188 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125580 Approved by: https://github.com/pearu, https://github.com/peterbell10	2024-05-30 14:45:58 +00:00
Isuru Fernando	3fb8a0b627	Fix nextafter in inductor CPP codegen (#126876 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126876 Approved by: https://github.com/peterbell10, https://github.com/jgong5	2024-05-30 14:08:16 +00:00
PyTorch MergeBot	ce63b676f3	Revert "[compiled autograd] torch.compile API (#125880 )" This reverts commit e1c322112a3d7b128b42e27f68bc9a714bfd9a09. Reverted https://github.com/pytorch/pytorch/pull/125880 on behalf of https://github.com/atalman due to sorry your PR broke lint, need to revert ([comment](https://github.com/pytorch/pytorch/pull/125880#issuecomment-2139605376))	2024-05-30 13:53:31 +00:00
albanD	6e0eeecc7c	Add back private function torch.cuda.amp.autocast_mode._cast (#127433 ) This is unfortunately used in a few places in the wild: https://github.com/search?q=torch.cuda.amp.autocast_mode._cast&type=code Pull Request resolved: https://github.com/pytorch/pytorch/pull/127433 Approved by: https://github.com/zou3519, https://github.com/guangyey	2024-05-30 13:29:23 +00:00
Sam Larsen	3f5d8636aa	[inductor] Copy RedisRemoteCacheBackend into pytorch (#127480 ) Summary: We need an implementation of RedisRemoteCacheBackend with the same API that we're using for FbMemcacheRemoteFxGraphCacheBackend. So we'll stop using the Triton implementation and adapt a version for use by inductor. I also renamed parameters and cache entries to match our cache terminology. Test Plan: Ran this command twice and inspected log output to ensure I got cache hits: ``` TORCH_LOGS=+torch._inductor.codecache TORCHINDUCTOR_FX_GRAPH_REMOTE_CACHE=1 python benchmarks/dynamo/torchbench.py --performance --inductor --device cuda --training --amp --print-compilation-time --only dcgan ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/127480 Approved by: https://github.com/oulgen	2024-05-30 13:08:10 +00:00
haozhe.zhu	cdeb242fc9	[inductor] fix mkldnn linear binary fusion check ut (#127296 ) In this PR: （1）Fix the unary fusion for bf16 conv/linear. Previously we registered same fusion pattern for `bf16. fp16`. And we do not check the dtype while matching the pattern. This results the `fp16` case matched the `bf16` pattern but in later replacement, we found that we have a float16 here which is not expected, so we do not fuse them. We fix it by checking dtypes to avoid `fp16` case matched `bf16` pattern. ``` def _is_valid_computation_unary_fusion(computation_op, lowp_dtype=None): def fn(match): matched = _is_single_computation_op(computation_op, lowp_dtype)(match) # previously we do not check lowp_dtype here ``` It is not exposed before because we only check the match count, and the match count is anyway correct because we matched the pattern. To address this, we add check on number of `generated_kernel`. If it is not fused, there will be an additional kernel to compute the post op. （2）Previous the ut ``` python test/inductor/test_mkldnn_pattern_matcher.py -k test_linear_binary ``` dose not check the fusion status, fix it in this PR. （3）Extend `test_conv_binary` to test with lp. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127296 Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/jansel	2024-05-30 12:29:36 +00:00
Dmitry Rogozhkin	9f73c65b8f	xpu: pass MAX_JOBS building xpu_mkldnn_proj (#126562 ) mkldnn is quite big project and MAX_JOBS support is essential when building on a system with big number of cpus and limited memory. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126562 Approved by: https://github.com/jgong5, https://github.com/guangyey, https://github.com/albanD	2024-05-30 12:10:33 +00:00
chuanqiw	30d98611a3	[CI] add xpu test in periodic workflow (#126410 ) Works for https://github.com/pytorch/pytorch/issues/114850 Pull Request resolved: https://github.com/pytorch/pytorch/pull/126410 Approved by: https://github.com/EikanWang, https://github.com/atalman	2024-05-30 12:10:15 +00:00
Yifu Wang	1071437169	Introduce cuda_p2p based fused_all_gather_matmul and fused_matmul_reduce_scatter (#126634 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126634 Approved by: https://github.com/Chillee, https://github.com/wanchaol	2024-05-30 12:10:11 +00:00
titaiwangms	705346bf8d	[ONNX] Skip optimizer when it fails (#127349 ) continue #127039 (1) Skip optimizer when it fails (2) Update onnx, ort, and onnx-script (3) The update to onnx-script results in the actual optimizer and rewriter enabling in this PR, and https://github.com/pytorch/pytorch/pull/123379 did not update onnx-script. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127349 Approved by: https://github.com/justinchuby	2024-05-30 07:08:45 +00:00
Mikayla Gawarecki	cd06ae0cb8	Relax use_count constraints for swap_tensors when AccumulateGrad holds a reference (#127313 ) ### Before this PR: `torch.utils.swap_tensors(a, b)` required the `use_count` of `a` and `b` to be 1 ```python a = torch.randn(2, 3, requires_grad=True) b = torch.randn(2, 4) out = a * 2 out.sum().backward() # Calling swap_tensors here would fail due to the reference held by AccumulateGrad node, which is not cleaned up after backward # torch.utils.swap_tensors(a, b) del out # Calling swap_tensors here would pass torch.utils.swap_tensors(a, b) ``` ### After this PR: `torch.utils.swap_tensors(a, b)` requires the `use_count` of `a` and `b` to be 1 or 2 IF the second reference is held by `AccumulateGrad` A pre-hook will be registered on the `AccumulateGrad` node so that it will fail if it is called (i.e. if user attempts to backward through the graph). ```python a = torch.randn(2, 3, requires_grad=True) b = torch.randn(2, 4) out = a * 2 out.sum().backward() # Calling swap_tensors here is ok torch.utils.swap_tensors(a, b) # If we ever backward to the AccumulateGrad node it will error that it was poisoned by swap_tensors ``` ### Application to `nn.Module` This issue is especially pertinent in context of `nn.Module` where parameters will have `AccumulateGrad` nodes initialized after forward. Specifically, this is intended to address https://github.com/pytorch/pytorch/pull/126814#issuecomment-2127777866. Previously, this would fail at the `m.cpu()` but we want users to be able to do something like the following, and instead raise an error if the user ever attempts to backward through the poisoned `AccumulateGrad` node ```python import torch import torch.nn as nn m = nn.Linear(3, 5) inp = torch.randn(2, 3) out = m(inp) out.sum().backward() m.cpu() ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/127313 Approved by: https://github.com/soulitzer	2024-05-30 07:06:55 +00:00
William Wen	d44ab8ba6d	[dynamo] utility to generate bytecode from template function (#127359 ) This will be helpful in reducing some of the hardcoded and python-version-dependent bytecode generation in various places in dynamo - e.g. resume function generation and object reconstruction. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127359 Approved by: https://github.com/jansel ghstack dependencies: #127329	2024-05-30 06:37:32 +00:00
Alex Baden	5d316c81be	[Inductor] Add 0 initialization to Triton masked loads (#127311 ) For a masked `tl.load` operation, the Triton language specifies that values masked out (i.e. where the mask evaluates to false) are undefined in the output of the load. Triton provides an optional `other` parameter which, when included, provides an explicit value to use for masked out values from the load. If the output from a masked load without the `other` parameter is used in a conditional, unexpected behavior can occur. Despite the language specification, all Triton backends currently in use by PyTorch Inductor (NVIDIA, AMD, and Intel) 0-initialize masked loads if `other` is not present (we recently changed the Intel backend behavior to match NVIDIA and AMD because that's what our users expect, even if we are not following the Triton spec to the tee). This PR attempts to "future-proof" Inductor for new backends (or perhaps changes in the current backends? - we did not see any performance change from 0-initializing in the Intel XPU backend but one could imagine compiler optimizations to remove paths that depend on undefined) to add an explicit `other` in instances where later conditionals depend on the `tl.load` output. I also removed an exception to `other` behavior for boolean loads, which was put in place for a Triton bug that should be fixed. I added `other` to the getting started documentation as a clue that masked load behavior requires explicit initialization if, even though I don't expect `undef` values to cause the example code to fail if the underlying output is not 0-initialized. Finally, I added other to the `make_load` function in `select_algorithm.py`, though I wasn't able to determine if that function was actually being called. Fixes #126535 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127311 Approved by: https://github.com/jansel	2024-05-30 04:50:54 +00:00
laithsakka	3947731887	enable test_parameter_free_dynamic_shapes test when nn module inlining is on (#127424 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127424 Approved by: https://github.com/mlazos ghstack dependencies: #126444, #127146	2024-05-30 04:20:07 +00:00
Anshul Sinha	15cc9f2e7e	[dtensor][be] added checksAssert function and refactored test cases (#127356 ) Summary Added c10d checksAsserts functions to reduce written lines of code and refactored test cases. Merged one test case into another. Test Plan pytest test/distributed/_tensor/debug/test_comm_mode.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/127356 Approved by: https://github.com/XilunWu ghstack dependencies: #127025, #127029, #127040, #127134, #127334	2024-05-30 03:48:17 +00:00
Anshul Sinha	998f38814c	[dtensor][debug] added c10d allgather, allgather_coalesced, and allgather_into_tensor_coalesced tracing to CommDebugMode (#127334 ) Summary Added c10d allgather, allgather_coalesced, and allgather_into_tensor_coalesced tracing to CommDebugMode and edited test case in test_comm_mode to include added features. Test Plan pytest test/distributed/_tensor/debug/test_comm_mode.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/127334 Approved by: https://github.com/XilunWu, https://github.com/yifuwang ghstack dependencies: #127025, #127029, #127040, #127134	2024-05-30 03:48:17 +00:00
James Wu	f58fc16e8f	[easy?] Move AsyncCompile to a different file (#127235 ) By moving AsyncCompile to its own file, we can import codecache without running the side effects of AsyncCompile. This will be important for AOTAutogradCaching, where we want to share some implementation details with codecache.py without spawning new processes. To conservatively maintain the same behavior elsewhere, every time we import codecache, I've added an import to torch._inductor.async_compile (except in autograd_cache.py, where the explicit goal is to not do this) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127235 Approved by: https://github.com/aorenste, https://github.com/oulgen, https://github.com/masnesral	2024-05-30 02:43:02 +00:00
chilli	e0fc1ab625	Forward fix for templates + views (#127446 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127446 Approved by: https://github.com/eellison	2024-05-30 02:34:35 +00:00
Tristan Rice	3d541835d5	distributed debug handlers (#126601 ) This adds debug handlers as described in: * https://gist.github.com/d4l3k/828b7be585c7615e85b2c448b308d925 (public copy) * https://docs.google.com/document/d/1la68szcS6wUYElUUX-P6zXgkPA8lnfzpagMTPys3aQ8/edit (internal copy) This is only adding the C++ pieces that will be used from the main process. The Python and torchrun pieces will be added in a follow up PR. This adds 2 handlers out of the box: * `/handler/ping` for testing purposes * `/handler/dump_nccl_trace_pickle` as a POC integration with Flight Recorder Test plan: ``` python test/distributed/elastic/test_control_plane.py ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/126601 Approved by: https://github.com/kurman, https://github.com/c-p-i-o	2024-05-30 02:21:08 +00:00
Simon Fan	e1c322112a	[compiled autograd] torch.compile API (#125880 ) - enter existing compiled autograd ctx manager before entering torch.compile frames Pull Request resolved: https://github.com/pytorch/pytorch/pull/125880 Approved by: https://github.com/jansel	2024-05-30 02:10:06 +00:00
SandishKumarHN	da39461d61	[optim] Move test_grad_scaling_autocast_fused_optimizers to test_cuda.py (#126418 ) this PR address the comments in this PR #124904 - Move test_grad_scaling_autocast_fused_optimizers to test_cuda.py - Combine _grad_scaling_autocast_fused_optimizers into test_grad_scaling_autocast_fused_optimizers - Move to OptimizerInfo framework. - For failing tests test_grad_scaling_autocast_fused_optimizers AdamW_cuda_float32, Adam_cuda_float32 - Added toleranceOverride in this PR - created a issue #127000 ``` > (c2env) [sandish@devgpu166.ash6 ~/pytorch (refactoroptimizers)]$ python test/test_cuda.py -k test_grad_scaling_autocast_fused_optimizers -v /home/sandish/pytorch/torch/backends/cudnn/__init__.py:106: UserWarning: PyTorch was compiled without cuDNN/MIOpen support. To use cuDNN/MIOpen, rebuild PyTorch making sure the library is visible to the build system. warnings.warn( /home/sandish/pytorch/torch/backends/cudnn/__init__.py:106: UserWarning: PyTorch was compiled without cuDNN/MIOpen support. To use cuDNN/MIOpen, rebuild PyTorch making sure the library is visible to the build system. warnings.warn( test_grad_scaling_autocast_fused_optimizers_Adagrad_cpu_float32 (__main__.TestCudaOptimsCPU) ... {'fused': True} {'fused': True} {'weight_decay': 0.1, 'fused': True} {'weight_decay': 0.1, 'fused': True} {'weight_decay': 0.1, 'maximize': True, 'fused': True} {'weight_decay': 0.1, 'maximize': True, 'fused': True} {'lr': 0.1, 'fused': True} {'lr': 0.1, 'fused': True} {'initial_accumulator_value': 0.1, 'weight_decay': 0.1, 'fused': True} {'initial_accumulator_value': 0.1, 'weight_decay': 0.1, 'fused': True} {'lr': 0.1, 'lr_decay': 0.5, 'weight_decay': 0.1, 'fused': True} {'lr': 0.1, 'lr_decay': 0.5, 'weight_decay': 0.1, 'fused': True} {'lr': tensor(0.0010), 'fused': True} {'lr': tensor(0.0010), 'fused': True} ok test_grad_scaling_autocast_fused_optimizers_AdamW_cpu_float32 (__main__.TestCudaOptimsCPU) ... {'fused': True} {'fused': True} {'lr': 0.01, 'fused': True} {'lr': 0.01, 'fused': True} {'weight_decay': 0.1, 'fused': True} {'weight_decay': 0.1, 'fused': True} {'weight_decay': 0.1, 'maximize': True, 'fused': True} {'weight_decay': 0.1, 'maximize': True, 'fused': True} {'weight_decay': 0.1, 'amsgrad': True, 'fused': True} {'weight_decay': 0.1, 'amsgrad': True, 'fused': True} ok test_grad_scaling_autocast_fused_optimizers_Adam_cpu_float32 (__main__.TestCudaOptimsCPU) ... {'fused': True} {'fused': True} {'lr': 0.01, 'fused': True} {'lr': 0.01, 'fused': True} {'weight_decay': 0.1, 'fused': True} {'weight_decay': 0.1, 'fused': True} {'weight_decay': 0.1, 'maximize': True, 'fused': True} {'weight_decay': 0.1, 'maximize': True, 'fused': True} {'weight_decay': 0.1, 'amsgrad': True, 'fused': True} {'weight_decay': 0.1, 'amsgrad': True, 'fused': True} ok test_grad_scaling_autocast_fused_optimizers_SGD_cpu_float32 (__main__.TestCudaOptimsCPU) ... {'fused': True} {'fused': True} {'lr': 0.01, 'fused': True} {'lr': 0.01, 'fused': True} {'lr': tensor(0.0010), 'fused': True} {'lr': tensor(0.0010), 'fused': True} {'momentum': 0.9, 'fused': True} {'momentum': 0.9, 'fused': True} {'momentum': 0.9, 'dampening': 0.5, 'fused': True} {'momentum': 0.9, 'dampening': 0.5, 'fused': True} {'momentum': 0.9, 'weight_decay': 0.1, 'fused': True} {'momentum': 0.9, 'weight_decay': 0.1, 'fused': True} {'momentum': 0.9, 'nesterov': True, 'weight_decay': 0.1, 'fused': True} {'momentum': 0.9, 'nesterov': True, 'weight_decay': 0.1, 'fused': True} {'weight_decay': 0.1, 'maximize': True, 'fused': True} {'weight_decay': 0.1, 'maximize': True, 'fused': True} ok test_grad_scaling_autocast_fused_optimizers_Adagrad_cuda_float32 (__main__.TestCudaOptimsCUDA) ... skipped 'cuda is not supported for fused on Adagrad' test_grad_scaling_autocast_fused_optimizers_AdamW_cuda_float32 (__main__.TestCudaOptimsCUDA) ... {'fused': True} {'fused': True} {'lr': 0.01, 'fused': True} {'lr': 0.01, 'fused': True} {'weight_decay': 0.1, 'fused': True} {'weight_decay': 0.1, 'fused': True} {'weight_decay': 0.1, 'maximize': True, 'fused': True} {'weight_decay': 0.1, 'maximize': True, 'fused': True} {'weight_decay': 0.1, 'amsgrad': True, 'fused': True} {'weight_decay': 0.1, 'amsgrad': True, 'fused': True} {'capturable': True, 'fused': True} {'capturable': True, 'fused': True} {'weight_decay': 0.1, 'amsgrad': True, 'capturable': True, 'fused': True} {'weight_decay': 0.1, 'amsgrad': True, 'capturable': True, 'fused': True} {'lr': tensor(0.0010), 'amsgrad': True, 'capturable': True, 'fused': True} {'lr': tensor(0.0010), 'amsgrad': True, 'capturable': True, 'fused': True} ok test_grad_scaling_autocast_fused_optimizers_Adam_cuda_float32 (__main__.TestCudaOptimsCUDA) ... {'fused': True} {'fused': True} {'lr': 0.01, 'fused': True} {'lr': 0.01, 'fused': True} {'weight_decay': 0.1, 'fused': True} {'weight_decay': 0.1, 'fused': True} {'weight_decay': 0.1, 'maximize': True, 'fused': True} {'weight_decay': 0.1, 'maximize': True, 'fused': True} {'weight_decay': 0.1, 'amsgrad': True, 'fused': True} {'weight_decay': 0.1, 'amsgrad': True, 'fused': True} {'capturable': True, 'fused': True} {'capturable': True, 'fused': True} {'weight_decay': 0.1, 'amsgrad': True, 'capturable': True, 'fused': True} {'weight_decay': 0.1, 'amsgrad': True, 'capturable': True, 'fused': True} {'lr': tensor(0.0010), 'amsgrad': True, 'capturable': True, 'fused': True} {'lr': tensor(0.0010), 'amsgrad': True, 'capturable': True, 'fused': True} ok test_grad_scaling_autocast_fused_optimizers_SGD_cuda_float32 (__main__.TestCudaOptimsCUDA) ... {'fused': True} {'fused': True} {'lr': 0.01, 'fused': True} {'lr': 0.01, 'fused': True} {'lr': tensor(0.0010), 'fused': True} {'lr': tensor(0.0010), 'fused': True} {'momentum': 0.9, 'fused': True} {'momentum': 0.9, 'fused': True} {'momentum': 0.9, 'dampening': 0.5, 'fused': True} {'momentum': 0.9, 'dampening': 0.5, 'fused': True} {'momentum': 0.9, 'weight_decay': 0.1, 'fused': True} {'momentum': 0.9, 'weight_decay': 0.1, 'fused': True} {'momentum': 0.9, 'nesterov': True, 'weight_decay': 0.1, 'fused': True} {'momentum': 0.9, 'nesterov': True, 'weight_decay': 0.1, 'fused': True} {'weight_decay': 0.1, 'maximize': True, 'fused': True} {'weight_decay': 0.1, 'maximize': True, 'fused': True} ok ---------------------------------------------------------------------- Ran 8 tests in 16.117s OK (skipped=1) > lintrunner test/test_cuda.py ---------------------------------------------------------------------- ok No lint issues. > lintrunner torch/testing/_internal/common_optimizers.py ---------------------------------------------------------------------- ok No lint issues. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/126418 Approved by: https://github.com/janeyx99	2024-05-30 01:47:41 +00:00
PyTorch MergeBot	67739d8c6f	Revert "[Submodule] Remove deprecated USE_TBB option and TBB submodule (#127051 )" This reverts commit 699db7988d84d163ebb6919f78885e4630182a7a. Reverted https://github.com/pytorch/pytorch/pull/127051 on behalf of https://github.com/PaliC due to This PR needs to be synced using the import button as there is a bug in our diff train ([comment](https://github.com/pytorch/pytorch/pull/127051#issuecomment-2138496995))	2024-05-30 01:16:57 +00:00
rzou	1abcac9dab	New Custom Ops Documentation landing page (#127400 ) We create a new landing page for PyTorch custom ops (suggested by jansel). All of our error messages will link here, and I'll work with the docs team to see if we can boost SEO for this page. NB: the landing page links some non-searchable webpages. Two of those (the Python custom ops tutorial and C++ custom ops tutorial) will turn into actual webpages when PyTorch 2.4 comes around. I'll make the third one (the Custom Operators Manual) once it stabilizes (we continously add new things to it and the length means that we might want to create a custom website for it to make the presentation more ingestable). Test Plan: - view docs preview. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127400 Approved by: https://github.com/jansel ghstack dependencies: #127291, #127292	2024-05-30 01:06:04 +00:00
saadelkouari	49ad90349d	Correct error message for aten::_local_scalar_dense on meta tensor (#124554 ) registering a meta for aten::_local_scalar_dense with a different error message. Fixes pytorch#119588 Co-authored-by: Edward Z. Yang <ezyang@mit.edu> Pull Request resolved: https://github.com/pytorch/pytorch/pull/124554 Approved by: https://github.com/ezyang	2024-05-30 00:50:29 +00:00
Jiashen Cao	d66f12674c	Handle tuple and dict during TorchScript to ExportedProgram conversion (#127341 ) * Add some test cases for testing List, Tuple, and Dict * Refactor the conversion code slightly * Add a logic to handle Dict Pull Request resolved: https://github.com/pytorch/pytorch/pull/127341 Approved by: https://github.com/SherlockNoMad, https://github.com/angelayi	2024-05-30 00:08:09 +00:00
Lei Ding	f14dc3bde8	Fix check message (#126951 ) As title. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126951 Approved by: https://github.com/Skylion007, https://github.com/kit1980	2024-05-29 23:58:09 +00:00
Edward Z. Yang	76fc58c160	Document the legacy constructor for Tensor (#122625 ) Fixes https://github.com/pytorch/pytorch/issues/122408 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/122625 Approved by: https://github.com/albanD	2024-05-29 23:23:19 +00:00
Shan19900305	7931eee5c5	Support torch.dtype as parameter in pybind11 cpp extension. (#126865 ) Support torch.dtype as parameter in pybind11 cpp extension. Example: ` cpp_extension.my_ops(self, other, torch.dtype) ` @ezyang @bdhirsh Co-authored-by: Edward Z. Yang <ezyang@mit.edu> Pull Request resolved: https://github.com/pytorch/pytorch/pull/126865 Approved by: https://github.com/ezyang	2024-05-29 23:19:32 +00:00
cyy	8ea1dc8748	Use Python::NumPy target (#127399 ) Now that we use FindPython, use it again for numpy detection. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127399 Approved by: https://github.com/malfet	2024-05-29 23:17:58 +00:00
lezcano	0fa2c5b049	Fix mask propagation in the presence of where (#125574 ) Before, when calling ops.where, masks were not properly propagated. We now restrict the optimisation to `ops.masked`, which I think it was what the original code intended to do. I'm not 100% sure that even in the masked case this code is not introducing some bugs, but this is a strict improvement over the previous state. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125574 Approved by: https://github.com/peterbell10 ghstack dependencies: #114471, #126783	2024-05-29 23:17:41 +00:00
Darshan Sanghani	15a7916c0e	Ability to capture Process Groups information into Execution Traces (#126995 ) Contains a method added to the ExecutionTraceObserver class to record the snapshot of the current process group config upon tracing start. Unit test: ``` (pytorch) [dsang@devgpu021.nha2 ~/github/pytorch-fork (viable/strict)]$ touch /tmp/barrier && TEMP_DIR="/tmp" BACKEND="nccl" WORLD_SIZE="2" python test/distributed/test_distributed_spawn.py -v TestDistBackendWithSpawn.test_ddp_profiling_execution_trace /home/dsang/github/pytorch-fork/torch/distributed/optim/__init__.py:28: UserWarning: TorchScript support for functional optimizers isdeprecated and will be removed in a future PyTorch release. Consider using the torch.compile optimizer instead. warn("TorchScript support for functional optimizers is" test_ddp_profiling_execution_trace (__main__.TestDistBackendWithSpawn.test_ddp_profiling_execution_trace) ... /home/dsang/github/pytorch-fork/torch/distributed/optim/__init__.py:28: UserWarning: TorchScript support for functional optimizers isdeprecated and will be removed in a future PyTorch release. Consider using the torch.compile optimizer instead. warn("TorchScript support for functional optimizers is" /home/dsang/github/pytorch-fork/torch/distributed/optim/__init__.py:28: UserWarning: TorchScript support for functional optimizers isdeprecated and will be removed in a future PyTorch release. Consider using the torch.compile optimizer instead. warn("TorchScript support for functional optimizers is" NCCL version 2.20.5+cuda12.0 [rank1]:[W523 16:06:01.705774398 reducer.cpp:1400] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator()) [rank0]:[W523 16:06:01.705905760 reducer.cpp:1400] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator()) [rank1]:[W523 16:06:01.715182258 execution_trace_observer.cpp:819] Enabling Execution Trace Observer printing pg info into trace [rank0]:[W523 16:06:01.715841805 execution_trace_observer.cpp:819] Enabling Execution Trace Observer printing pg info into trace [rank1]:[W523 16:06:01.727881877 execution_trace_observer.cpp:831] Disabling Execution Trace Observer [rank0]:[W523 16:06:01.728792871 execution_trace_observer.cpp:831] Disabling Execution Trace Observer Execution trace saved at /tmp/tmpdsov4ngi.et.json [{'id': 3, 'name': '## process_group:init ##', 'ctrl_deps': 2, 'inputs': {'values': ['[{"pg_name": "0", "pg_desc": "default_pg", "backend_config": "cuda:nccl", "ranks": [], "group_size": 2, "group_count": 1}]'], 'shapes': [[]], 'types': ['String']}, 'outputs': {'values': [], 'shapes': [], 'types': []}, 'attrs': [{'name': 'rf_id', 'type': 'uint64', 'value': 1}, {'name': 'fw_parent', 'type': 'uint64', 'value': 0}, {'name': 'seq_id', 'type': 'int64', 'value': -1}, {'name': 'scope', 'type': 'uint64', 'value': 7}, {'name': 'tid', 'type': 'uint64', 'value': 1}, {'name': 'fw_tid', 'type': 'uint64', 'value': 0}, {'name': 'op_schema', 'type': 'string', 'value': ''}, {'name': 'kernel_backend', 'type': 'string', 'value': ''}, {'name': 'kernel_file', 'type': 'string', 'value': ''}]}] Execution trace saved at /tmp/tmpsdiqy6az.et.json [{'id': 3, 'name': '## process_group:init ##', 'ctrl_deps': 2, 'inputs': {'values': ['[{"pg_name": "0", "pg_desc": "default_pg", "backend_config": "cuda:nccl", "ranks": [], "group_size": 2, "group_count": 1}]'], 'shapes': [[]], 'types': ['String']}, 'outputs': {'values': [], 'shapes': [], 'types': []}, 'attrs': [{'name': 'rf_id', 'type': 'uint64', 'value': 1}, {'name': 'fw_parent', 'type': 'uint64', 'value': 0}, {'name': 'seq_id', 'type': 'int64', 'value': -1}, {'name': 'scope', 'type': 'uint64', 'value': 7}, {'name': 'tid', 'type': 'uint64', 'value': 1}, {'name': 'fw_tid', 'type': 'uint64', 'value': 0}, {'name': 'op_schema', 'type': 'string', 'value': ''}, {'name': 'kernel_backend', 'type': 'string', 'value': ''}, {'name': 'kernel_file', 'type': 'string', 'value': ''}]}] ok ---------------------------------------------------------------------- Ran 1 test in 24.447s OK ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/126995 Approved by: https://github.com/briancoutinho, https://github.com/sraikund16	2024-05-29 23:16:17 +00:00
Nikita Shulga	3174e6cb8e	[Temp][CI] Run older MPS tests/Mac builds on MacOS 13 (#127428 ) To avoid ambiguity while migration outlined in https://github.com/pytorch-labs/pytorch-gha-infra/pull/399 is in progress. Otherwise, MPS jobs for Ventura can be accidentally scheduled on Sonoma or builds, which might result in flaky failures on trunk. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127428 Approved by: https://github.com/huydhn	2024-05-29 22:58:41 +00:00
PaliC	9257a0698b	[Split Build] Load dependencies from libtorch in __init__.py (#126826 ) This PR makes it such that we search for a libtorch wheel when initializing pytorch in order to find the necessary shared libraries. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126826 Approved by: https://github.com/huydhn, https://github.com/atalman, https://github.com/ZainRizvi	2024-05-29 22:03:50 +00:00
Wanchao Liang	b0ef363972	[dtensor] rename _Partial -> Partial for all imports (#127420 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127420 Approved by: https://github.com/awgu	2024-05-29 21:42:40 +00:00
Catherine Lee	d99b115eb3	Fix delete old branches workflow (#127442 ) The ubuntu runner started using 2.45.1 (prev 2.43.2), which includes `1f49f7506f` (changes +00:00 timezone to Z) Python versions prior to 3.11 do not support Z when parsing isoformat, so update the workflow to use 3.11 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127442 Approved by: https://github.com/huydhn, https://github.com/malfet	2024-05-29 21:17:09 +00:00
JackCaoG	38a33c3202	don't call .item in onehot for XLA (#127335 ) We found that `nn.function.one_hot` will cause a graph break due to the item call in the native implementation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127335 Approved by: https://github.com/ezyang	2024-05-29 20:37:26 +00:00
cyy	84b5aa9a68	[Caffe2] [Reland] Remove Caffe2 proto files (#127394 ) Reland of #126134, which was reverted due to the wrong base. Now that #126705 has been relanded, it's time to remand this one. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127394 Approved by: https://github.com/r-barnes	2024-05-29 20:37:02 +00:00
Yuanhao Ji	92d081e228	[Docs] Add `str` type to `cuda.get_device_name()` and `cuda. get_device_capability()` function (#126743 ) Fixes #126400 The `get_device_name()` and `get_device_capability()` allow passing in a string, but it's not stated in the doc. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126743 Approved by: https://github.com/eqy, https://github.com/kit1980	2024-05-29 20:09:52 +00:00
Kwanghoon An	24a4bfdcc2	[AdaRound] Make versatile for data / extra param for callback function (#126891 ) Summary: For Speech sequential model, there could be a case where model(data) does not work correctly for feed forward, Speech model uses a different type of Criterion (a.k.a loss function) to feed a data on individual components like encoder, predictor, joiner. Hence we need extra parameter to pass feedforward wrapper Differential Revision: D57680391 Pull Request resolved: https://github.com/pytorch/pytorch/pull/126891 Approved by: https://github.com/jerryzh168	2024-05-29 20:05:27 +00:00
Kwanghoon An	c404b2968c	Support min/max carry over for eager mode from_float method (#127309 ) Summary: After QAT is completed or given pre-tuned weight observer via tunable PTQ algorithm, it should not over-write again with a given weight, at least for static QAT never. Dynamic QAT also does not require to re-run weight observer again by design. This is a fix Test Plan: Signals Differential Revision: D57747749 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127309 Approved by: https://github.com/jerryzh168	2024-05-29 19:33:26 +00:00
Sam Larsen	82a370ae3a	Revert "Refresh OpOverloadPacket if a new OpOverload gets added (#126863 )" (#127366 ) This reverts commit ed734178abc99bc1d83ad2c61d3a1e4d4f5d20c8. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127366 Approved by: https://github.com/zou3519	2024-05-29 19:26:06 +00:00
Jane Xu	05e99154ee	Allow int vals to go down the fastpath for _foreach_max (#127303 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127303 Approved by: https://github.com/albanD ghstack dependencies: #127187	2024-05-29 19:08:58 +00:00
Jane Xu	601c5e085d	Add _foreach_max (#127187 ) This PR adds _foreach_max support, the second reduction foreach op we have :D I did have to change the autogen slightly for foreach. I can promise that the existing foreach ops' derivative behavior has not changed as I've added a skip list for the harder requirement I am setting (that the arg list should match in length). I needed to add this requirement as there is another wrong max (the one that does take in a dim for reduction) that keeps getting matched first. Caveats! - We do not fast path if the shapes, dtypes, device, the regular shebang for foreach are not met. We fall back to slowpath! - MORE IMPORTANTLY, we also do not fast path for int8 and int16 and bool, but that's really a skill issue on my end as I've hardcoded -INFINITY into the CUDA kernels, and -INFINITY is not defined for small ints. It'd be nice to know how to do this properly, but that work can also come later. - This does NOT support empty Tensors in the list, because the original max op also does not support empty Tensors. ~I think this should be allowed though, and this PR may come later.~ I understand why this is not allowed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127187 Approved by: https://github.com/albanD	2024-05-29 19:08:58 +00:00
Nikita Shulga	90f4b3fcb2	PyTorch Distributed security assumptions (#127403 ) To highlight, that PyTorch Distributed should only be used in a trusted environment and never on the nodes with open network access, which is very similar in spirit to https://github.com/tensorflow/tensorflow/blob/master/SECURITY.md#running-a-tensorflow-server Thanks to @Xbalien and @K1ingzzz for drawing attention to missing documentation on distributed workloads security assumptions Pull Request resolved: https://github.com/pytorch/pytorch/pull/127403 Approved by: https://github.com/wconstab	2024-05-29 19:08:20 +00:00
laithsakka	5196ef1b59	support builtin id function on user defined object variables. (#127146 ) Fix: https://github.com/pytorch/pytorch/pull/127146 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127146 Approved by: https://github.com/anijain2305 ghstack dependencies: #126444	2024-05-29 19:00:37 +00:00
lancerts	ff65b18fcf	Update the is_causal explaination in the SDPA doc (#127209 ) Fixes #126873 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127209 Approved by: https://github.com/drisspg	2024-05-29 18:53:17 +00:00
cyy	9cc0d56fdc	Remove unused variables in tests (#127379 ) Reland test fixes in #127161 and reduce reduce_ops_test into floating point types. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127379 Approved by: https://github.com/ezyang	2024-05-29 18:30:51 +00:00
Xu Zhao	d938170314	Add torchao nightly testing workflow (#126885 ) Add and test torchao nightly testing workflow. This workflow will be triggered under the following conditions: 1. If the PR has ciflow/torchao label 2. Manual trigger It will run the torchao benchmark on torchbench/timm/huggingface model workloads with 5 configs (noquant, autoquant, int8dynamic, int8weightonly, int4weightonly). The output will be updated to the PT2 Dashboard: https://hud.pytorch.org/benchmark/compilers Pull Request resolved: https://github.com/pytorch/pytorch/pull/126885 Approved by: https://github.com/huydhn	2024-05-29 18:22:29 +00:00
Scott Wolchok	090a031d6f	Use bit_cast instead of UB type-pun-via-union in Half.h (#127321 ) Summary: Type punning via union has undefined behavior due to the strict aliasing rule. bit_cast does the same thing safely (using memcpy under the hood). Test Plan: CI Godbolt demonstrates that doing this via memcpy still generates the same instructions: https://godbolt.org/z/PhePzd4Ex Pull Request resolved: https://github.com/pytorch/pytorch/pull/127321 Approved by: https://github.com/ezyang, https://github.com/Skylion007	2024-05-29 17:43:50 +00:00
Danielle Pintz	8b5cbb7c68	Improve NLLLoss docs (#127346 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127346 Approved by: https://github.com/mikaylagawarecki	2024-05-29 17:29:06 +00:00
rzou	28de9143a3	opcheck should be usable without optional dependencies (#127292 ) This PR excises opcheck's dependency on torch.testing._internal.common_utils, (which comes with dependencies on expecttest and hypothesis). We do this by moving what we need to torch.testing._utils and adding a test for it. Fixes #126870, #126871 Test Plan: - new tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/127292 Approved by: https://github.com/williamwen42 ghstack dependencies: #127291	2024-05-29 17:17:49 +00:00
Pian Pawakapan	8a31c2aa84	[export] allow complex guards as runtime asserts (#127129 ) With the current state of export's dynamic shapes, we struggle with guards and constraints that are beyond the current dynamic shapes language, expressed with dims and derived dims. While we can compile and guarantee correctness for guards within the current language (e.g. min/max ranges, linear relationships, integer divisibility) we struggle to dynamically compile guards which extend beyond that. For these "complex" guards, we typically do either of the following: 1) raise a constraint violation error, along the lines of "not all values of <symbol> in the specified range satisfy <guard>", with or without suggested fixes, 2) specialize to the provided static values and suggest removing dynamism, or 3) fail compilation due to some arbitrary unsupported case. Previous [work](https://github.com/pytorch/pytorch/pull/124949) went towards resolving this by disabling forced specializations, instead allowing the user to fail at runtime with incorrect inputs. In this PR, relying on [hybrid backed-unbacked symints](https://github.com/pytorch/pytorch/issues/121749), [deferred runtime asserts](https://github.com/pytorch/pytorch/blob/main/torch/fx/passes/runtime_assert.py), and the function [_is_supported_equivalence()](`d7de4c9d80/torch/fx/experimental/symbolic_shapes.py (L1824)`), we add a flag `_allow_complex_guards_as_runtime_asserts` which allows the user to compile exported programs containing these guards and maintain dynamism, while adding correctness checks as runtime assertions in the graph. Hybrid backed-unbacked symints allow us to easily bypass "implicit" guards emitted from computation - guards that we ~expect to be true. Popular examples revolve around reshapes: ``` # reshape def forward(self, x, y): # x: [s0, s1], y: [s2] return x.reshape([-1]) + y # guard s0 * s1 = s2 This leads to the following exported program class GraphModule(torch.nn.Module): def forward(self, x: "f32[s0, s1]", y: "f32[s2]"): sym_size_int: "Sym(s2)" = torch.ops.aten.sym_size.int(y, 0) mul: "Sym(-s2)" = -1 * sym_size_int; sym_size_int = None sym_size_int_1: "Sym(s0)" = torch.ops.aten.sym_size.int(x, 0) sym_size_int_2: "Sym(s1)" = torch.ops.aten.sym_size.int(x, 1) mul_1: "Sym(s0s1)" = sym_size_int_1 sym_size_int_2; sym_size_int_1 = sym_size_int_2 = None add: "Sym(s0s1 - s2)" = mul + mul_1; mul = mul_1 = None eq: "Sym(Eq(s0s1 - s2, 0))" = add == 0; add = None _assert_scalar = torch.ops.aten._assert_scalar.default(eq, "Runtime assertion failed for expression Eq(s0s1 - s2, 0) on node 'eq'"); eq = None view: "f32[s0s1]" = torch.ops.aten.view.default(x, [-1]); x = None add_1: "f32[s0s1]" = torch.ops.aten.add.Tensor(view, y); view = y = None return (add_1,) ``` Another case is symbol divisibility: ``` def forward(self, x): # x: [s0, s1] return x.reshape([-1, x.shape[0] - 1]) # Eq(Mod(s0 s1, s0 - 1), 0) ``` Applying deferred runtime asserts also helps dynamic compilation for "explicit" complex guards that typically cause problems for export. For example we can generate runtime asserts for not-equal guards, and complex conditions like the following: ``` class Foo(torch.nn.Module): def forward(self, x, y): # check that negation of first guard also shows up as runtime assertion if x.shape[0] == y.shape[0]: # False return x + y elif x.shape[0] == y.shape[0] 3: # False return x + 2, y + 3 elif x.shape[0] 2 == y.shape[0] * 3: # True return x * 2.0, y * 3.0 ``` For the above graph we will generate 3 runtime assertions: the negation of the first 2, and the 3rd condition as a guard. One additional benefit here over the current state of exported programs is that this adds further correctness guarantees - previously with explicit complex guards, if compilation succeeded, the guards would be ignored at runtime, treated as given. As shown above, the runtime asserts appear as math ops in the graph, generated by the sympy interpreter, resulting in an _assert_scalar call. There is an option to avoid adding these asserts into the graph, by setting `TORCH_DYNAMO_DO_NOT_EMIT_RUNTIME_ASSERTS=1`. This results in the "original" computation graph, with dynamism, and any incorrect inputs will fail on ops during runtime. Further work could go into prettifying the printer, so the majority of the graph isn't guard-related. Ideally this PR would subsume and remove the recently added [_disable_forced_specializations](https://github.com/pytorch/pytorch/pull/124949) flag, but that flag still handles one additional case of specialization: single-variable equalities where the symbol is solvable for a concrete value: see this [PR](https://github.com/pytorch/pytorch/pull/126925) This PR doesn't change any behavior around data-dependent errors/unbacked symints yet, that could be further work. NOTE: will take naming change suggestions for the flag :) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127129 Approved by: https://github.com/avikchaudhuri	2024-05-29 17:15:25 +00:00
Richard Barnes	cc6e72d882	Drop caffe2 core tests and some other stuff (#127089 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/127089 Approved by: https://github.com/Skylion007	2024-05-29 17:11:45 +00:00
cyy	e8e327ba82	Cover clang-tidy to torch/csrc/onnx/init.cpp (#127393 ) Enabling it will not cause issues. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127393 Approved by: https://github.com/Skylion007	2024-05-29 17:05:28 +00:00
cyy	7de1352457	[1/N] Replace exceptions with static_assert(false) in some templates (#127371 ) This PR tries to report some failures at build time. Once the build fails, it generally indicates that we can wrap the code inside some conditional macros, and it is a hint to further reduce the built code size. The sizeof operations were used to ensure that the assertion dependents on specific template instantiations. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127371 Approved by: https://github.com/ezyang, https://github.com/Skylion007	2024-05-29 16:14:00 +00:00
cyy	c69562caf9	[Caffe2]Remove more caffe2 files (#126628 ) They are not used. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126628 Approved by: https://github.com/albanD	2024-05-29 16:08:48 +00:00
Andrew M. James	80a8fc07b2	[dynamo] Handle np.iinfo/finfo/dtype as input (#124482 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124482 Approved by: https://github.com/lezcano ghstack dependencies: #124481	2024-05-29 16:00:15 +00:00
Derek	9a8e8101a8	Fix wording in nn.Linear docstring. (#127240 ) Definition (Linear Transformation): A mapping $T : V \to W$ between $F$-vector spaces $V,W$ is called a linear transformation if and only if a) $T(u+v)=T(u)+T(v)$, b) $T(cv)=cT(v)$ for all $u, v \in V$, $c \in F$. Consequently, $T(0_V)=0_W$. Thus $x \mapsto xA^T+b$ for nonzero $b$ is not a linear transformation, but is often referred to as an affine linear transformation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127240 Approved by: https://github.com/soulitzer, https://github.com/albanD	2024-05-29 14:55:40 +00:00
Andrew M. James	ade075444f	[dynamo] Support numpy.dtype (#124481 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124481 Approved by: https://github.com/lezcano	2024-05-29 14:45:14 +00:00
Aaron Gokaslan	bf966588f1	[BE][Ez]: Update cudnn_frontend submodule to v1.4.0 (#127175 ) Updates the cudnn_frontend submodule to the latest 1.4.0 version. Should be a straightforward, header-only submodule update. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127175 Approved by: https://github.com/ezyang, https://github.com/malfet	2024-05-29 14:23:38 +00:00
Nikita Shulga	0910429d72	[BE][CMake] Use FindPython module (#124613 ) As FindPythonInterp and FindPythonLibs has been deprecated since cmake-3.12 Replace `PYTHON_EXECUTABLE` with `Python_EXECUTABLE` everywhere (CMake variable names are case-sensitive) This makes PyTorch buildable with python3 binary shipped with XCode on MacOS TODO: Get rid of `FindNumpy` as its part of Python package Pull Request resolved: https://github.com/pytorch/pytorch/pull/124613 Approved by: https://github.com/cyyever, https://github.com/Skylion007	2024-05-29 13:17:35 +00:00
Bin Bao	942d9abd66	[AOTI] Update reinplace to cover mutated buffer (#127297 ) Summary: Unlike JIT Inductor, AOTI currently unlifts weights and buffers from input args, so the reinplace pass didn't really work for AOTI because it only checks mutation on placeholder, which led to excessive memory copies for kv_cache updates in LLM models. This PR removes those memory copies and roughly offers a 2x speedup. In the future, we will revert the unlift logic in AOTI and make the behvior consitent with JIT Inductor. Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/127297 Approved by: https://github.com/peterbell10, https://github.com/chenyang78	2024-05-29 13:07:53 +00:00
Richard Barnes	af69a52f06	Reapply "Remove more of caffe2 (#126705 )" (#127317 ) This reverts commit 00fe0a0d795680ade029fc552f33fffed75c0250. Originally was unnecessarily reverted by an oncall. Landing again. Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/127317 Approved by: https://github.com/izaitsevfb	2024-05-29 12:20:25 +00:00
Xuehai Pan	749a132fb0	[BE] wrap deprecated function/class with `typing_extensions.deprecated` (#126898 ) Use `typing_extensions.deprecated` for deprecation annotation if possible. Otherwise, add `category=FutureWarning` to `warnings.warn("message")` if the category is missing. Note that only warnings that their messages contain `[Dd]eprecat(ed\|ion)` are updated in this PR. UPDATE: Use `FutureWarning` instead of `DeprecationWarning`. Resolves #126888 - #126888 Pull Request resolved: https://github.com/pytorch/pytorch/pull/126898 Approved by: https://github.com/albanD	2024-05-29 12:09:27 +00:00
cyy	699db7988d	[Submodule] Remove deprecated USE_TBB option and TBB submodule (#127051 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127051 Approved by: https://github.com/cpuhrsch, https://github.com/malfet	2024-05-29 11:58:03 +00:00
Valeriu	02b1cdab23	[Sync torch_FA2 and FA2 flash_api] + [Expose seqused_k & alibi_slopes arguments] (#126520 ) 1. Expose seqused_k & alibi_slopes arguments: - This can be used when your sequence length k is not the full extent of the tensor. This is useful for kv cache scenarios and was not previously supported in the FA2 TORCH integration. We need these arguments for external xformers lib call to the _flash_attention_forward API. Before: ``` std::optional<Tensor> seqused_k = c10::nullopt; std::optional<Tensor> alibi_slopes = c10::nullopt; ``` After: ``` _flash_attention_forward(... std::optional<Tensor>& seqused_k, std::optional<Tensor>& alibi_slopes, ``` 2. There is a difference between the TORCH_FA2_flash_api:mha_fwd and FA2_flash_api:mha_fwd (same for mha_varlen_fwd) at the query transposition (GQA) step. The CHECK_SHAPE is applied on the original query vs the reshaped query. This causes an error (because of the shape constraint) for such inputs: ``` q = torch.randn([7, 1, 4, 256], dtype=torch.bfloat16, device='cuda') k = torch.randn([7, 51, 1, 256], dtype=torch.bfloat16, device='cuda') v = torch.randn([7, 51, 1, 256], dtype=torch.bfloat16, device='cuda') ``` ![image](https://github.com/pytorch/pytorch/assets/927999/77ea6bf6-b6e9-4f3f-96a9-8d952956ddd9) - i've modified the code as little as possible, but if you prefer a more verbose change like the following, dont hesitate to tell me: ``` at::Tensor swapped_q = seqlenq_ngroups_swapped ? q.reshape({batch_size, num_heads_k, num_heads / num_heads_k, head_size_og}).transpose(1, 2) : q; if (seqlenq_ngroups_swapped) { seqlen_q = num_heads / num_heads_k; num_heads = num_heads_k; } CHECK_SHAPE(swapped_q, batch_size, seqlen_q, num_heads, head_size_og); ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/126520 Approved by: https://github.com/drisspg	2024-05-29 11:54:44 +00:00
Jiong Gong	dae33a4961	[inductor][cpp] bf16/fp16 gemm template computed with fp32 w/o epilogue fusion (#126068 ) As part of #125683, this PR adds the initial bf16/fp16 gemm template support with micro-gemm implemented with fused type casting and fp32 computation. It doesn't provide epilogue fusion support yet which will be added in the next PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126068 Approved by: https://github.com/jansel ghstack dependencies: #124021, #126019	2024-05-29 11:15:41 +00:00
WeihaoCui	65af1a9c26	FIX the document of distributed.new_group() (#122703 ) As for now, the document of distributed.new_group() says that it returns `None` when current ranks is not in the new created process group. However, it actually returns `GroupMember.NON_GROUP_MEMBER`. I have check the code and think it is more appropriate that we fix the document. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122703 Approved by: https://github.com/wconstab, https://github.com/kwen2501	2024-05-29 09:40:25 +00:00
Sam Larsen	6c81856dca	[inductor] Add a subprocess-based parallel compile (#126816 ) Summary: Adds a "safe" parallel compile implementation that a) Popens a sub-process with an entry point we control, and b) Uses a ProcessPoolExecutor in that sub-processes to perform parallel compiles. This change essentially squashes these two implementations from jansel, but removes the "thread-based" approach since benchmarking revealed that compile-time performance was poor compared to the existing impl: https://github.com/pytorch/pytorch/pull/124682 https://github.com/pytorch/pytorch/pull/122941 This PR adds the implementation, but defaults to the existing "fork". I'll submit a separate change to enable. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126816 Approved by: https://github.com/jansel	2024-05-29 09:40:21 +00:00
Jiong Gong	92bc444ee3	[inductor][cpp] epilogue support for gemm template (#126019 ) As part of #125683, this PR adds the epilogue support for c++ gemm template by reusing the c++ vector codegen on sub-slices of tensors. This is implemented by retracing the epilogue IR nodes with new ranges and offsets. The new `codegen_loop_bodies` and `codegen_functions` methods are added to c++ vector codegen for this purpose. This is leveraged by the `store_output` method of the template kernel for epilogue codegen and store to the final result. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126019 Approved by: https://github.com/jansel ghstack dependencies: #124021	2024-05-29 09:12:03 +00:00
lezcano	00999fd8b1	Prefer flip over index_select (#126783 ) It's faster and has a lower memory footprint in eager. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126783 Approved by: https://github.com/Skylion007, https://github.com/albanD ghstack dependencies: #114471	2024-05-29 09:10:25 +00:00
lezcano	8a21532e53	Fix constant propagation pass (#114471 ) This pass was broken in a number of ways, as we were not generating asserts whenever we took it, even though we need to. While doing so, we found that the analysis we were using for choosing whether to generate asserts or not for dynamic shapes was completely broken. Eliminating indirect indexing in this way allows for a number of optimisations. In particular, we can now fuse against these kernels (indirect indexing disallows fusions). The new strategy is as follows: - We always propagate sympy expressions if we can. - If an expression was an indirect_indexing, we call `check_bounds` - We also call `check_bounds` within `CSEProxy.indirect_indexing` - The checks are issued in the buffer where they would go if the were used in a load - This makes them always be codegen'd before the load and stores - In the case of stores, they will be generated potentially much earlier than the stores themselves, which is fine. We add quite a few asserts to preexisting tests to strengthen them. In particular, we make sure that issuing an assert plays well with all kinds of C++ vectorisation. For now, we rely on the logic within `_maybe_evaluate_static` to prove these bounds. This logic is rather limited though. In the future, we might want to rely on Z3 here to be able to prove bounds in a more general way. Supersedes https://github.com/pytorch/pytorch/pull/113068 Fixes https://github.com/pytorch/pytorch/issues/121251 Pull Request resolved: https://github.com/pytorch/pytorch/pull/114471 Approved by: https://github.com/peterbell10	2024-05-29 09:10:25 +00:00
Animesh Jain	51b22d9cf2	[dynamo] Support enum construction (#127364 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127364 Approved by: https://github.com/yanboliang ghstack dependencies: #127263	2024-05-29 08:09:21 +00:00
Jason Ansel	ad7700bfdb	[inductor] Misc changes (#127307 ) Pulling unrelated changes out of the larger halide PR Pull Request resolved: https://github.com/pytorch/pytorch/pull/127307 Approved by: https://github.com/yanboliang	2024-05-29 08:00:06 +00:00
Jiong Gong	cef776bcd1	[inductor][cpp] GEMM template (infra and fp32) (#124021 ) This PR adds the Cpp template infrastructure and the initial FP32 gemm template. See RFC https://github.com/pytorch/pytorch/issues/125683 for more background info. 1. Cpp template infrastructure Similar template abstractions as the CUTLASS template, i.e., `CppTemplate`, `CppTemplateKernel`, `CppTemplateBuffer`. The MicroGemm micro-kernel abstraction that can be used by Cpp GEMM templates. 2. Initial FP32 gemm template This involves a GEMM template implementation `CppPackedGemmTemplate` that supports GEMM with constant weight (`B`) requiring `N` to be a multiple of register blocking while allows the static or dynamic sizes for the `M` (batch dim) of `A`. The `B` matrix would be prepacked. This is a typical setting for inference workloads. The template handles the thread decomposition (via `thread_blocking`) and cache blocking (via `cache_blocking`). Then it invokes `CppMicroGemm` which handles register blocking, instruction selection, and other CPU architecture-specific optimizations. A `CppMicroGemmFP32Vec` micro-kernel implementation is provided for fp32 matmuls implemented with ATen vec abstraction. 3. Correctness and performance The changes have been validated with fp32 inference on the three benchmark suites (torchbench, huggingface and timm_models) with both static shape and dynamic shapes. Since it is an initial implementation, we are still working on further performance improves with follow-up PRs including the optimizations in kernels as well as fusions. The perf gains are only observed from a selective number of models compared to the ATen kernels which are implemented with MKL. The perf gains are more obvious with dynamic shapes since MKL only supports packed gemm for static shapes. Below are details. Static shapes \| Benchmark \| torchbench \| huggingface \| timm_models \| \|------------\|-------------\|--------------\|--------------\| \| Multi-threaded (baseline) \| 1.47x \| 1.36x \| 1.91x \| \| Multi-threaded (max-autotune) \| 1.47x \| 1.36x \| 1.92x \| \| Single-threaded (baseline) \| 1.56x \| 1.19x \| 1.51x \| \| Single-threaded (max-autotune) \| 1.56x \| 1.19x \| 1.52x \| Key models being sped up: drq: 1.14x soft_act: 1.12 cait_m36_384: 1.18x Dynamic shapes \| Benchmark \| torchbench \| huggingface \| timm_models \| \| --- \| --- \| --- \| --- \| \| Multi-threaded (baseline) \| 1.43x \| 1.28x \| 1.85x \| \| Multi-threaded (max-autotune) \| 1.47x \| 1.28x \| 1.85x \| \| Single-threaded (baseline) \| 1.55x \| 1.20x \| 1.51x \| \| Single-threaded (max-autotune) \| 1.56x \| 1.19x \| 1.53x \| Key models being sped up: BERT_pytorch: 1.22x pyhpc_turbulent: 1.13x soft_actor_critic: 1.77x BlenderbotForCausalLM: 1.09x cait_m36_384: 1.17x Differential Revision: [D57585365](https://our.internmc.facebook.com/intern/diff/D57585365) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124021 Approved by: https://github.com/jansel	2024-05-29 07:37:41 +00:00
William Wen	719589c9bf	[dynamo] move bytecode tests from test_misc to new bytecode test file (#127329 ) Also merge with bytecode hook test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127329 Approved by: https://github.com/yanboliang, https://github.com/jansel	2024-05-29 06:10:59 +00:00
Wanchao Liang	a60b06bd2b	[dtensor] update public API docs (#127340 ) This PR updates the API documentations for the public facing APIs needs more example for each API but plan to add them in a separate PR Pull Request resolved: https://github.com/pytorch/pytorch/pull/127340 Approved by: https://github.com/wz337 ghstack dependencies: #127338, #127339	2024-05-29 05:18:47 +00:00
Wanchao Liang	2c9a420da3	[dtensor] move some modules to private namespace (#127339 ) as titled, moving some modules that are mainly for DTensor private usage to be a private module. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127339 Approved by: https://github.com/awgu ghstack dependencies: #127338	2024-05-29 05:18:47 +00:00
Wanchao Liang	72ef2555e3	[dtensor] make Partial placement public (#127338 ) As titled, partial placement is standardized right now and I think we would want to expose this as a public API to allow user to annotate the the sharding layout easier. Given that we already have use cases internal/externally that uses Partial Keeping the old _Partial name for a while for BC reason Pull Request resolved: https://github.com/pytorch/pytorch/pull/127338 Approved by: https://github.com/awgu, https://github.com/wz337, https://github.com/kwen2501	2024-05-29 05:18:47 +00:00
William Wen	5359af0c7e	[dynamo] wrap GraphModule exceptions in dynamo-wrapped tests (#126341 ) Better approach to https://github.com/pytorch/pytorch/pull/126197 to catch issues like https://github.com/pytorch/pytorch/issues/125568. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126341 Approved by: https://github.com/anijain2305, https://github.com/jansel	2024-05-29 05:18:04 +00:00
laithsakka	cdf2133186	Add compile time profiler for non fbcode targets (#126904 ) This is a tool that allow profiling compile time using strobelight profiler, its a meta only tool. but works on non-fbcode targets. A follow up diff will unify this with caffe2/fb/strobelight/compile_time_profiler.py. example test: ``` run python tools/strobelight/examples/compile_time_profile_example.py ``` ``` python torch/utils/_strobelight/examples/compile_time_profile_example.py strobelight_compile_time_profiler, line 61, 2024-05-23 10:49:28,101, INFO: compile time strobelight profiling enabled strobelight_compile_time_profiler, line 93, 2024-05-23 10:49:28,102, INFO: Unique sample tag for this run is: 2024-05-23-10:49:282334638devvm4561.ash0.facebook.com strobelight_compile_time_profiler, line 94, 2024-05-23 10:49:28,102, INFO: You can use the following link to access the strobelight profile at the end of the run: https://www.internalfb.com/intern/scuba/query/?dataset=pyperf_experimental%2Fon_demand&drillstate=%7B%22purposes%22%3A[]%2C%22end%22%3A%22now%22%2C%22start%22%3A%22-30%20days%22%2C%22filterMode%22%3A%22DEFAULT%22%2C%22modifiers%22%3A[]%2C%22sampleCols%22%3A[]%2C%22cols%22%3A[%22namespace_id%22%2C%22namespace_process_id%22]%2C%22derivedCols%22%3A[]%2C%22mappedCols%22%3A[]%2C%22enumCols%22%3A[]%2C%22return_remainder%22%3Afalse%2C%22should_pivot%22%3Afalse%2C%22is_timeseries%22%3Afalse%2C%22hideEmptyColumns%22%3Afalse%2C%22timezone%22%3A%22America%2FLos_Angeles%22%2C%22compare%22%3A%22none%22%2C%22samplingRatio%22%3A%221%22%2C%22metric%22%3A%22count%22%2C%22aggregation_field%22%3A%22async_stack_complete%22%2C%22top%22%3A10000%2C%22aggregateList%22%3A[]%2C%22param_dimensions%22%3A[%7B%22dim%22%3A%22py_async_stack%22%2C%22op%22%3A%22edge%22%2C%22param%22%3A%220%22%2C%22anchor%22%3A%220%22%7D]%2C%22order%22%3A%22weight%22%2C%22order_desc%22%3Atrue%2C%22constraints%22%3A[[%7B%22column%22%3A%22sample_tags%22%2C%22op%22%3A%22all%22%2C%22value%22%3A[%22[%5C%222024-05-23-10:49:282334638devvm4561.ash0.facebook.com%5C%22]%22]%7D]]%2C%22c_constraints%22%3A[[]]%2C%22b_constraints%22%3A[[]]%2C%22ignoreGroupByInComparison%22%3Afalse%7D&view=GraphProfilerView&&normalized=1712358002&pool=uber strobelight_function_profiler, line 241, 2024-05-23 10:49:34,943, INFO: strobelight run id is: 3507039740348330 strobelight_function_profiler, line 243, 2024-05-23 10:50:00,907, INFO: strobelight profiling running strobelight_function_profiler, line 224, 2024-05-23 10:50:02,741, INFO: strobelight profiling stopped strobelight_function_profiler, line 215, 2024-05-23 10:50:06,173, INFO: Total samples: 7 strobelight_function_profiler, line 215, 2024-05-23 10:50:06,173, INFO: GraphProfiler (python stack): https://fburl.com/scuba/pyperf_experimental/on_demand/75cxdro3 strobelight_function_profiler, line 215, 2024-05-23 10:50:06,173, INFO: Icicle view (python stack): https://fburl.com/scuba/pyperf_experimental/on_demand/qsgydsee strobelight_compile_time_profiler, line 120, 2024-05-23 10:50:06,174, INFO: 1 strobelight success runs out of 1 non-recursive compilation events. strobelight_function_profiler, line 241, 2024-05-23 10:50:08,137, INFO: strobelight run id is: 8721740011604497 strobelight_function_profiler, line 243, 2024-05-23 10:50:34,801, INFO: strobelight profiling running strobelight_function_profiler, line 224, 2024-05-23 10:50:36,803, INFO: strobelight profiling stopped strobelight_function_profiler, line 215, 2024-05-23 10:50:41,289, INFO: Total samples: 3 strobelight_function_profiler, line 215, 2024-05-23 10:50:41,289, INFO: GraphProfiler (python stack): https://fburl.com/scuba/pyperf_experimental/on_demand/qmi2ucwp strobelight_function_profiler, line 215, 2024-05-23 10:50:41,289, INFO: Icicle view (python stack): https://fburl.com/scuba/pyperf_experimental/on_demand/7fjkhs9i strobelight_compile_time_profiler, line 120, 2024-05-23 10:50:41,289, INFO: 2 strobelight success runs out of 2 non-recursive compilation events. strobelight_function_profiler, line 241, 2024-05-23 10:50:43,597, INFO: strobelight run id is: 1932476082259558 strobelight_function_profiler, line 243, 2024-05-23 10:51:09,791, INFO: strobelight profiling running strobelight_function_profiler, line 224, 2024-05-23 10:51:11,883, INFO: strobelight profiling stopped strobelight_function_profiler, line 215, 2024-05-23 10:51:16,218, INFO: Total samples: 3 strobelight_function_profiler, line 215, 2024-05-23 10:51:16,218, INFO: GraphProfiler (python stack): https://fburl.com/scuba/pyperf_experimental/on_demand/vy1ujxec strobelight_function_profiler, line 215, 2024-05-23 10:51:16,218, INFO: Icicle view (python stack): https://fburl.com/scuba/pyperf_experimental/on_demand/2xgadviv strobelight_compile_time_profiler, line 120, 2024-05-23 10:51:16,219, INFO: 3 strobelight success runs out of 3 non-recursive compilation events. ``` or pass TORCH_COMPILE_STROBELIGHT=TRUE for any torch compile python program. ex running on XLNetLMHeadModel. ``` TORCH_COMPILE_STROBELIGHT=TRUE TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 time python benchmarks/dynamo/huggingface.py --ci --accuracy --timing --explain --inductor --device cuda --training --amp --only XLNetLMHeadModel ``` result: Pull Request resolved: https://github.com/pytorch/pytorch/pull/126904 Approved by: https://github.com/aorenste ghstack dependencies: #126444	2024-05-29 05:06:37 +00:00
Boyuan Feng	2b72e2a596	[Cudagraph] better support for streams (#126809 ) This PR fixes Issue #124391. There are two root causes. ### Root Cause 1 [better support for stream during cudagraph capture] When recording a new function, CUDA graph tree records memory block states (e.g., address, size, allocated, etc) via `getCheckpointState`. Let's say the record is called `block_state`. Later, CUDA graph tree would like to recover exactly the same memory block states by `apply_checkpoint_execution_state_in_allocator`, which a) frees all memory blocks; b) allocate all recorded block states (regardless of `block_state->allocated`); c) free blocks with `block_state->allocated == False`; and d) check block_state matches remaining blocks (e.g., `block_state->ptr == block->ptr`). An error may occur when multiple streams exists during recording. [Note](https://github.com/pytorch/pytorch/blob/main/c10/cuda/CUDACachingAllocator.cpp#L2149-L2152) that a block will not be merged with other blocks if it is used by some streams, even if `block->allocated==False`. This may lead to a mismatch between `block_state->ptr` and `block->ptr` in `apply_checkpoint_execution_state_in_allocator`. This PR solves the issue by avoiding inserting events if this events coming from a stream used during cudagraph capture. The reason is that we know all events or streams used during cudagraph capture must have been completed before cudagraph capture finishes. ### Root Cause 2 [fix a bug in checkpoint state] When we getCheckpointState, we create block state. At that time, we do not record block->device. So block_state->device == 0 no matter the real value of block->device. See [how](https://github.com/pytorch/pytorch/blob/main/c10/cuda/CUDACachingAllocator.cpp#L744-L750) BlockState is created from a block. When use block state during setSegmentStateToCheckpoint, we use [block_state.device (=0)](https://github.com/pytorch/pytorch/blob/main/c10/cuda/CUDACachingAllocator.cpp#L1526). This leads to errors. We fixed this issue by recording block->device into block_state in getCheckpointState. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126809 Approved by: https://github.com/eellison	2024-05-29 04:52:35 +00:00
Shengbao Zheng	a41f828da7	[c10d] fix group_name/group_desc set up in eager initialization (#127053 ) Summary: ProcessGroupNCCL set up group_name/desc in c10d log and NCCL when initializing nccl communicator. In eager initialization mode, pg_name and pg_desc is set after communicator initialization so the information won't be available in pytorch log or NCCL communicator. This PR fix this by setting pg_name/desc earlier Differential Revision: D57759816 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127053 Approved by: https://github.com/wconstab, https://github.com/kwen2501	2024-05-29 04:42:34 +00:00
dshi7	932e04142d	extract calculate_time_spent from print_time_report (#127362 ) Fixes #ISSUE_NUMBER wrap certain steps in a separate function for easier TTFB instrumentation (fb internal use case) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127362 Approved by: https://github.com/yanboliang, https://github.com/mengluy0125	2024-05-29 04:37:15 +00:00
PaliC	a25b28a753	[Split Build] Add option to create libtorch wheel and use it to build pytorch as a separate wheel (#126328 ) Creates an option to just build the libtorch portion of pytorch such that we have the necessary .so files. Then it builds a torch package using the libtorch wheel. These options are enabled using ` BUILD_LIBTORCH_WHL` and `BUILD_PYTHON_ONLY`. We run ``` BUILD_LIBTORCH_WHL=1 python setup.py install python setup.py clean BUILD_PYTHON_ONLY=1 python setup.py install ``` to produce ``` sahanp@devgpu086 ~/pytorch (detached HEAD\|REBASE-i 3/5)> ls /home/sahanp/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/torch/lib/ (pytorch-3.10) libshm.so* libtorch_global_deps.so* libtorch_python.so* sahanp@devgpu086 ~/pytorch (detached HEAD\|REBASE-i 3/5)> ldd build/lib/libtorch_python.so (pytorch-3.10) linux-vdso.so.1 (0x00007ffdc2d37000) libtorch.so => /home/sahanp/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/libtorch/lib/libtorch.so (0x00007f539fe99000) libshm.so => /home/sahanp/pytorch/build/lib/libshm.so (0x00007f539fe90000) libcudnn.so.8 => /usr/local/cuda-12.1/targets/x86_64-linux/lib/libcudnn.so.8 (0x00007f539e800000) libnvToolsExt.so.1 => /usr/local/cuda/lib64/libnvToolsExt.so.1 (0x00007f539e400000) libstdc++.so.6 => /lib64/libstdc++.so.6 (0x00007f539e000000) libm.so.6 => /lib64/libm.so.6 (0x00007f539fda5000) libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007f539ebe5000) libc.so.6 => /lib64/libc.so.6 (0x00007f539dc00000) /lib64/ld-linux-x86-64.so.2 (0x00007f539fea0000) libtorch_cpu.so => /home/sahanp/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/libtorch/lib/libtorch_cpu.so (0x00007f5392400000) libtorch_cuda.so => /home/sahanp/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/libtorch/lib/libtorch_cuda.so (0x00007f5380000000) librt.so.1 => /lib64/librt.so.1 (0x00007f539fd9e000) libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f539fd99000) libdl.so.2 => /lib64/libdl.so.2 (0x00007f539fd94000) libc10.so => /home/sahanp/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/libtorch/lib/libc10.so (0x00007f539eb07000) libmkl_intel_lp64.so.2 => /home/sahanp/.conda/envs/pytorch-3.10/lib/libmkl_intel_lp64.so.2 (0x00007f537ec00000) libmkl_gnu_thread.so.2 => /home/sahanp/.conda/envs/pytorch-3.10/lib/libmkl_gnu_thread.so.2 (0x00007f537ce00000) libmkl_core.so.2 => /home/sahanp/.conda/envs/pytorch-3.10/lib/libmkl_core.so.2 (0x00007f5378800000) libomp.so => /home/sahanp/.conda/envs/pytorch-3.10/lib/libomp.so (0x00007f539e707000) libcupti.so.12 => /usr/local/cuda/lib64/libcupti.so.12 (0x00007f5377e00000) libcudart.so.12 => /usr/local/cuda/lib64/libcudart.so.12 (0x00007f5377a00000) libc10_cuda.so => /home/sahanp/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/libtorch/lib/libc10_cuda.so (0x00007f539ea6a000) libcusparse.so.12 => /usr/local/cuda/lib64/libcusparse.so.12 (0x00007f5368400000) libcufft.so.11 => /usr/local/cuda/lib64/libcufft.so.11 (0x00007f535ee00000) libcusolver.so.11 => /usr/local/cuda/lib64/libcusolver.so.11 (0x00007f534c800000) libcurand.so.10 => /usr/local/cuda/lib64/libcurand.so.10 (0x00007f5346200000) libcublas.so.12 => /usr/local/cuda/lib64/libcublas.so.12 (0x00007f533f800000) libcublasLt.so.12 => /usr/local/cuda/lib64/libcublasLt.so.12 (0x00007f531e800000) libutil.so.1 => /lib64/libutil.so.1 (0x00007f539ea63000) libnvJitLink.so.12 => /usr/local/cuda/lib64/libnvJitLink.so.12 (0x00007f531b800000) sahanp@devgpu086 ~/pytorch (detached HEAD\|REBASE-i 3/5)> ldd build/lib/libtorch_global_deps.so (pytorch-3.10) linux-vdso.so.1 (0x00007ffc265df000) libmkl_intel_lp64.so.2 => /home/sahanp/.conda/envs/pytorch-3.10/lib/libmkl_intel_lp64.so.2 (0x00007fa93fc00000) libmkl_gnu_thread.so.2 => /home/sahanp/.conda/envs/pytorch-3.10/lib/libmkl_gnu_thread.so.2 (0x00007fa93de00000) libmkl_core.so.2 => /home/sahanp/.conda/envs/pytorch-3.10/lib/libmkl_core.so.2 (0x00007fa939800000) libm.so.6 => /lib64/libm.so.6 (0x00007fa940f05000) libcudart.so.12 => /usr/local/cuda/lib64/libcudart.so.12 (0x00007fa939400000) libnvToolsExt.so.1 => /usr/local/cuda/lib64/libnvToolsExt.so.1 (0x00007fa939000000) libgomp.so.1 => /home/sahanp/.conda/envs/pytorch-3.10/lib/libgomp.so.1 (0x00007fa93fb07000) libc.so.6 => /lib64/libc.so.6 (0x00007fa938c00000) libdl.so.2 => /lib64/libdl.so.2 (0x00007fa940efe000) libpthread.so.0 => /lib64/libpthread.so.0 (0x00007fa940ef9000) /lib64/ld-linux-x86-64.so.2 (0x00007fa940ff5000) librt.so.1 => /lib64/librt.so.1 (0x00007fa940ef2000) libstdc++.so.6 => /home/sahanp/.conda/envs/pytorch-3.10/lib/libstdc++.so.6 (0x00007fa93921d000) libgcc_s.so.1 => /home/sahanp/.conda/envs/pytorch-3.10/lib/libgcc_s.so.1 (0x00007fa93faec000) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/126328 Approved by: https://github.com/atalman	2024-05-29 04:33:56 +00:00
Ke Wen	8090145936	[pipelining] add back support for multi-use parameters/buffers (#126653 ) ## Motivation Resolves #126626 to support TorchTitan. With this PR, we add back support for cases where a parameter or buffer is used in multiple stages. An example of such usage is in LLaMA (torchtitan), code snippet: ``` for layer in self.layers.values(): h = layer(h, self.freqs_cis) ``` ## Solution Step 1: Remove the previous guards of `if len(node.users) == 1`. Step 2: Call `move_param_to_callee` multiple times, one for each stage ("callee"). Step 3: Delay deletion of the `get_attr` node (for getting the param) from root till this param has been sunk into each stage that uses it. The PR also cleans up the old code around this (dropping the TRANSMIT mode and supporting REPLICATE mode only). ## Test Changed the `ExampleCode` model to use `mm_param1` in multiple stages. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126653 Approved by: https://github.com/pianpwk	2024-05-29 03:36:47 +00:00
Jon Janzen	781f26240a	Add script to copy distributed commits to stable branch (#126918 ) This will be used as part of a prototype of a stable pytorch with a fast-moving distributed folder Tasks: T189915739 Test plan: I ran the script in a few configurations on my local machine. It worked as expected Pull Request resolved: https://github.com/pytorch/pytorch/pull/126918 Approved by: https://github.com/seemethere, https://github.com/malfet	2024-05-29 03:33:44 +00:00
Jiashen Cao	10d2373abd	Add a registry for GraphModuleSerializer (#126550 ) This PR adds a registration function and a global registry for GraphModuleSerializer. After this PR, custom serialization methods can be done through registration instead of subclassing for ease of maintenance. ## Changes - Add a test case where it injects custom op to test serialization. - Add custom op handler - Change allowed op for verifier Co-authored-by: Zhengxu Chen <zhxchen17@outlook.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/126550 Approved by: https://github.com/zhxchen17	2024-05-29 03:12:48 +00:00
PyTorch MergeBot	cdbb2c9acc	Revert "[Submodule] Remove deprecated USE_TBB option and TBB submodule (#127051 )" This reverts commit 4fdbaa794f9d5af2f171f772a51cb710c51c925f. Reverted https://github.com/pytorch/pytorch/pull/127051 on behalf of https://github.com/PaliC due to This PR needs to be synced using the import button as there is a bug in our diff train ([comment](https://github.com/pytorch/pytorch/pull/127051#issuecomment-2136428735))	2024-05-29 03:02:35 +00:00
PyTorch MergeBot	7a506dd005	Revert "[Caffe2]Remove Caffe2 proto files (#126134 )" This reverts commit a40658481ada9ecfd5716513a8537818c79cb3ef. Reverted https://github.com/pytorch/pytorch/pull/126134 on behalf of https://github.com/malfet due to Broke bazel builds, see https://github.com/pytorch/pytorch/actions/runs/9278148147/job/25528691981 ([comment](https://github.com/pytorch/pytorch/pull/126134#issuecomment-2136373096))	2024-05-29 01:53:45 +00:00
cyy	669560d51a	Use hidden visibility in OBJECTCXX files (#127265 ) Since it can eliminate some linker warnings on MacOS Pull Request resolved: https://github.com/pytorch/pytorch/pull/127265 Approved by: https://github.com/ezyang, https://github.com/malfet	2024-05-29 01:40:23 +00:00
PyTorch MergeBot	52e448a7f9	Revert "Enable Wunused-variable on tests (#127161 )" This reverts commit 6436a6407d9d65c42efb8e55beeb8b391b67fd64. Reverted https://github.com/pytorch/pytorch/pull/127161 on behalf of https://github.com/malfet due to Broke ReduceTests on Windows (by testing more), see https://github.com/pytorch/pytorch/actions/runs/9274944325/job/25519484937 ([comment](https://github.com/pytorch/pytorch/pull/127161#issuecomment-2136339435))	2024-05-29 01:09:45 +00:00
Michael Hsu	85172fbe84	Back out "Prevent partitioner from ever saving views (#126446 )" (#127316 ) Summary: Revert "Prevent partitioner from ever saving views (#126446)" due to a torchinductor failure on CU Training Framework tests. Reviewed By: Chillee Differential Revision: D57868343 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127316 Approved by: https://github.com/Chillee	2024-05-29 00:29:44 +00:00
cyy	a40658481a	[Caffe2]Remove Caffe2 proto files (#126134 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/126134 Approved by: https://github.com/r-barnes	2024-05-29 00:22:14 +00:00
Janani Sriram	f4cbcff8ef	[TorchScript] Expand TorchScript __init__ annotation warning (#127045 ) Summary: Expand TorchScript `__init__` annotation warning to `list` and `dict` with reference to GSD task T187638414 and annotation warning reproduction D56834720. Currently, the TorchScript compiler ignores and throws `UserWarning`s for the following annotation types for empty values within the `__init__` function: `List`, `Dict`, `Optional`. However, the compiler should additionally cover warnings for `list` and `dict`. This diff adds support for `list` and `dict`. Test Plan: Added 4 new unit tests: `test_annotated_empty_list_lowercase` and `test_annotated_empty_dict_lowercase` verify that TorchScript throws UserWarnings for the list and dict type annotations on empty values. ``` (base) [jananisriram@devvm2248.cco0 /data/users/jananisriram/fbsource/fbcode (e4ce427eb)]$ buck2 test @mode/{opt,inplace} //caffe2/test:jit -- --regex test_annotated_empty_list_lowercase ... Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0 ``` ``` (base) [jananisriram@devvm2248.cco0 /data/users/jananisriram/fbsource/fbcode (e4ce427eb)]$ buck2 test @mode/{opt,inplace} //caffe2/test:jit -- --regex test_annotated_empty_dict_lowercase ... Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0 ``` `test_annotated_with_jit_empty_list_lowercase` and `test_annotated_with_jit_empty_dict_lowercase` verify that TorchScript throws UserWarnings for the list and dict type annotations on empty values with the jit annotation. ``` (base) [jananisriram@devvm2248.cco0 /data/users/jananisriram/fbsource/fbcode (e4ce427eb)]$ buck2 test @mode/{opt,inplace} //caffe2/test:jit -- --regex test_annotated_with_jit_empty_list_lowercase ... Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0 ``` ``` (base) [jananisriram@devvm2248.cco0 /data/users/jananisriram/fbsource/fbcode (e4ce427eb)]$ buck2 test @mode/{opt,inplace} //caffe2/test:jit -- --regex test_annotated_with_jit_empty_dict_lowercase ... Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0 ``` Differential Revision: D57752002 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127045 Approved by: https://github.com/davidberard98	2024-05-28 23:49:10 +00:00
Richard Barnes	1be7e4086a	Drop caffe2 nomnigraph (#127086 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/127086 Approved by: https://github.com/Skylion007	2024-05-28 23:20:46 +00:00
Peter Bell	f6ef832e87	[inductor] Use symbolic_hint when bounding fallback size hint (#127262 ) The previous fallback ignores any known hint values in the expression and only looks at the value ranges. By using the `symbolic_hint` we will use both hints and value ranges. Also removed the recursive use of `size_hint` on the bounds, since these should always be constants. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127262 Approved by: https://github.com/lezcano ghstack dependencies: #127251	2024-05-28 22:51:45 +00:00
Peter Bell	26a8fa3a06	[inductor] Restore ExpandView sanity checks (#127251 ) This restores the assertion removed in #124864 The handling of unbacked symints is incidental, the main purpose of this assert was to catch bugs in lowerings. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127251 Approved by: https://github.com/lezcano	2024-05-28 22:51:45 +00:00
Andrew Gu	db0a0ecb60	[FSDP2] Added test for N-way TP and 1-way FSDP with CPU offloading (#127024 ) This PR shows that we can use FSDP solely for CPU offloading when composing with N-way TP. Each FSDP mesh is just 1 rank. This was motivated from an ask on Slack :) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127024 Approved by: https://github.com/weifengpy, https://github.com/wanchaol	2024-05-28 22:51:36 +00:00
Anshul Sinha	6b24155827	[dtensor][debug] added c10d gather, reduce, scatter tracing to CommDebugMode (#127134 ) Summary Added c10d gather, reduce, and scatter tracing to CommDebugMode and edited test case in test_comm_mode to include added features. Test Plan pytest test/distributed/_tensor/debug/test_comm_mode.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/127134 Approved by: https://github.com/XilunWu ghstack dependencies: #127025, #127029, #127040	2024-05-28 22:48:07 +00:00
eqy	a76faff71c	[NCCL][CUDA] Optionally avoid rethrowing CUDA Errors in NCCL Watchdog (#126587 ) Doesn't affect current behavior by default, for #126544 I'm not sure what the exact mechanism is here but CUDA errors appear to already be thrown in the main process, meaning that the watchdog is separately throwing CUDA errors again. However this rethrown error causes the process to be terminated as it cannot be handled from user code (which doesn't have visibility of the watchdog thread). Pull Request resolved: https://github.com/pytorch/pytorch/pull/126587 Approved by: https://github.com/kwen2501	2024-05-28 22:17:15 +00:00
eellison	93bfe57144	cudagraphs: fix backward hooks & fsdp interaction (#126914 ) Fixes > ERROR: expected to be in states [<TrainingState.FORWARD_BACKWARD: 2>] but current state is TrainingState.IDLE Error that would occur when composing pt2 fsdp and cudagraphs. Cudagraphs caches output tensor impls in the fast path, so we were inadvertently accumulating multiple hooks on what should have been fresh allocations. from code comment: ``` # this output represents a fresh allocated tensor. # We return the same TensorImpl from run to run to avoid overhead. # autograd.Function will reset the Autograd meta of output tensors # as part of aot_autograd, but _backward_hooks are stored on tensors separately, # so we need to manually reset hooks. `` Pull Request resolved: https://github.com/pytorch/pytorch/pull/126914 Approved by: https://github.com/awgu, https://github.com/xmfan	2024-05-28 22:07:41 +00:00
Chirag Pandya	4154c8358a	[BE] Wrap store check in a try/catch (#127030 ) Summary: Global store may already have been destroyed when we do the check. This leads to a Null Pointer Exception. This caused a SEV in Production. Stack trace from crash: ``` [trainer2]:# 5 c10d::TCPStore::check(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&) [trainer2]:# 6 c10d::ProcessGroupNCCL::heartbeatMonitor() ``` Test Plan: Will deploy in small training job and with `NCCL_DUMP_ON_TIMEOUT` set. Job should complete with no exceptions. Reviewers: Subscribers: Tasks: T190163458 Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/127030 Approved by: https://github.com/Skylion007, https://github.com/shuqiangzhang	2024-05-28 20:57:36 +00:00
Pian Pawakapan	f206c5c628	[export] handle new roots & root swapping in derived dims suggested fixes (#125543 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/125543 This PR address 2 issues with derived dim suggested fixes, 1) newly introduced roots, and 2) root swapping. 1 \| Newly introduced roots appear with modulo guards, e.g. Mod(dx, 2) = 0 suggests dx is a derived dim equal to 2 * _dx, introducing a new root _dx. Currently the final suggested fixes handle this correctly, but we can get intermediate results where related derived dims don't rely on a unified root, and are a mixture of min/max range and derived suggestions. For example: ``` "dx": {"eq": 3_dx-1, "max": 36} "dy": {"eq": dx+1} This should lead to suggested fixes _dx = Dim('_dx', max=12) dx = 3 _dx - 1 dy = 3 * _dx ``` This PR prettifies the suggested fixes routine by unifying to a single root, and making each intermediate suggestion either a derived dim or min/max range, not both. 2 \| The current suggested fixes for derived dims can lead to root dims/derived dims being swapped, e.g. `dy - 1, dy` -> `dx, dx + 1`. This leads to problematic suggested fixes that look like `dy - 1 = Dim("dy - 1")` since we don't have access to the original variable name. This PR only adds a suggested fix for the root dim, and removes all other derived suggestions. For example, with the export test case test_derived_dim_out_of_order_simplified: ``` _dimz = torch.export.Dim("_dimz", min=6, max=8) dimy = _dimz - 1 dimx = dimy - 1 dimz = torch.export.Dim("dimz", min=6, max=8) # doesn't work, should be = _dimz class Foo(torch.nn.Module): def forward(self, x, y, z): return x + y[1:] + z[2:] foo = Foo() u, v, w = torch.randn(5), torch.randn(6), torch.randn(7) export( foo, (u, v, w), dynamic_shapes=({0: dimx}, {0: dimy}, {0: dimz}), ) ``` Before: ``` Suggested fixes: _dimz = Dim('_dimz', min=3, max=9223372036854775807) # 2 <= _dimz - 1 <= 9223372036854775806 _dimz - 2 = Dim('_dimz - 2', min=4, max=6) _dimz = Dim('_dimz', min=2, max=9223372036854775806) # 2 <= _dimz <= 9223372036854775806 _dimz - 1 = _dimz - 1 dimz = _dimz ``` New suggested fixes: ``` Suggested fixes: dimz = _dimz ``` Note: This assumes the specified derived relations between dims are correct. This should be valid because: 1) if the relation is plain wrong (e.g. (dx, dx - 1) provided with inputs (6, 4)), this gets caught in beforehand in produce_guards. 2) if the relation is correct but does not match the emitted guard, for example: ``` def forward(self, x, y): return x.reshape([-1]) + y # guard: s0 * 2 = s1 dx = Dim("dx") export( model, (torch.randn(6, 2), torch.randn(12)), dynamic_shapes={"x": (dx, 2), "y": (dx + 6, )} ) ``` This produces two linear equations, leading to specialization since a) produce_guards is able to solve for a concrete value, and b) the export constraint solver will anyways force specializations due to range constraints. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125543 Approved by: https://github.com/avikchaudhuri	2024-05-28 20:41:43 +00:00
cyy	0a9d73a814	Remove c10::guts::bool_constant and c10::guts::negation (#127300 ) They are not used. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127300 Approved by: https://github.com/r-barnes	2024-05-28 19:55:20 +00:00
lancerts	03005bb655	Improve the clarity of the torch.Tensor.backward doc (#127201 ) Improve the clarity of the torch.Tensor.backward doc, particularly wrt the arg `gradient`. Reference https://pytorch.org/tutorials/beginner/blitz/autograd_tutorial.html, ``` We need to explicitly pass a gradient argument in Q.backward() because it is a vector. gradient is a tensor of the same shape as Q, and it represents the gradient of Q w.r.t. itself ``` @janeyx99 feel free to assign to the corresponding reviewers, thanks Co-authored-by: Jeffrey Wan <soulitzer@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/127201 Approved by: https://github.com/soulitzer	2024-05-28 19:25:51 +00:00
Manuel Candales	f600faf248	[metal] Improve perf of int4pack_mm shader (#127135 ) Using vectorized data types and using SIMD groups to optimize memory access pattern Pull Request resolved: https://github.com/pytorch/pytorch/pull/127135 Approved by: https://github.com/malfet	2024-05-28 18:22:58 +00:00
feifan	c9172d4471	print default value in FunctionSignature (#127059 ) Fixes #[126758](https://github.com/pytorch/pytorch/issues/126758) and #[126759](https://github.com/pytorch/pytorch/issues/126759) The output information in the issue is not accurate because `FunctionSignature::toString()` print the schema strings without default. `cb6ef68caa/torch/csrc/utils/python_arg_parser.cpp (L1282-L1283)` This pr, by adding a `default_value` to save the default str ,which shoule be priented. Of course, can also add an new api to reverse `default_bool/default_int` to string, which is slightly more complicated. result: ![image](https://github.com/pytorch/pytorch/assets/37650440/f58a4cbf-b0f4-4c81-9106-59f0d35c54ea) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127059 Approved by: https://github.com/janeyx99	2024-05-28 18:04:31 +00:00
Nikita Shulga	045309aa35	[MPS] Enable toch.mm and friends for complex dtypes (#127241 ) - Add `supportedFloatingOrComplexType` - Change dtype check to those - Extend low-precision fp32 list to complex types - Mark conv2d as supported now, as it was failing due to the tighter accuracy constrains than the same op for float32 dtype Fixes https://github.com/pytorch/pytorch/issues/127178 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127241 Approved by: https://github.com/janeyx99	2024-05-28 17:56:13 +00:00
IvanKobzarev	829f594d7d	[small] guard_size_oblivious, skip check for meta (#127298 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127298 Approved by: https://github.com/ezyang	2024-05-28 17:53:08 +00:00
Tugsbayasgalan (Tugsuu) Manlaibaatar	9521528f71	Log export result of torch.jit.trace to scuba (#126900 ) Summary: We want to track how well torch.jit.trace can be converted to export in large scale. As a first step, we log all of torch.jit.trace unittests whether we can convert the traced module to export module OR we can export the model directly Test Plan: CI Differential Revision: D57629682 Pull Request resolved: https://github.com/pytorch/pytorch/pull/126900 Approved by: https://github.com/SherlockNoMad	2024-05-28 17:49:34 +00:00
PyTorch MergeBot	3f79e09515	Revert "Made some minor improvements to flexattention perf + added more autotune configs (#126811 )" This reverts commit 84e59f052d4342ac9453703be55758de102e20d3. Reverted https://github.com/pytorch/pytorch/pull/126811 on behalf of https://github.com/PaliC due to breaking on V100s / internal tests ([comment](https://github.com/pytorch/pytorch/pull/126811#issuecomment-2135798983))	2024-05-28 17:48:26 +00:00
Jiashen Cao	254783ce80	[Fix]: populate input parameter name when convert TorchScript to ExportedProgram (#126787 ) ## Goal As title ## Design Based on the fact that each TorchScript module has a `code` property which provides the original source code for the `forward` function, I implemented a function to extrapolate `forward` function signature by using the AST parser. Some other tradeoff * Directly parsing src code as string --> will be very buggy * Directly using `compile` function in Python to get the function object --> raises a lot of exceptions because of missing packages or undefined variable names Pull Request resolved: https://github.com/pytorch/pytorch/pull/126787 Approved by: https://github.com/angelayi, https://github.com/tugsbayasgalan	2024-05-28 17:33:44 +00:00
Jez Ng	122282111d	[inductor][reland] Various improvements to error handling during autotuning (#126847 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126847 This is a reland of [D56764094](https://www.internalfb.com/diff/D56764094) / https://github.com/pytorch/pytorch/pull/125762. It was originally reverted due to rebase conflicts. Original commit changeset: 45875a1e5de2 Original Phabricator Diff: [D56764094](https://www.internalfb.com/diff/D56764094) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126847 Approved by: https://github.com/chenyang78	2024-05-28 17:22:26 +00:00
Swayam	df360e2add	Update derivatives.yaml (#127193 ) Fixed a typo in docs Pull Request resolved: https://github.com/pytorch/pytorch/pull/127193 Approved by: https://github.com/soulitzer	2024-05-28 16:56:03 +00:00
Angela Yi	cbb79a2baf	[export] Disable backend decomps for capture_pre_autograd (#127120 ) Differential Revision: D57785713 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127120 Approved by: https://github.com/ydwu4	2024-05-28 16:37:13 +00:00
cyy	c40408850a	[1/N] Fix clang-tidy warnings in aten/src/ATen/cuda/ (#127183 ) Fixes clang-tidy warnings Pull Request resolved: https://github.com/pytorch/pytorch/pull/127183 Approved by: https://github.com/soulitzer, https://github.com/Skylion007	2024-05-28 15:35:29 +00:00
cyy	3d88c618d5	Concat namespaces in torch/csrc/profiler and other fixes (#127266 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127266 Approved by: https://github.com/soulitzer	2024-05-28 15:21:32 +00:00
rzou	4d4d2a96f2	Add space in MetaFallbackKernel.cpp error message (#127291 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127291 Approved by: https://github.com/Skylion007	2024-05-28 13:54:38 +00:00
atalman	a6b994ed54	Fix lint after #126845 (#127286 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/127286 Approved by: https://github.com/NicolasHug, https://github.com/DanilBaibak	2024-05-28 12:38:27 +00:00
chilli	ec8b254ef4	Refactored template codegen to explicitly set current body when generating code (#127144 ) The main motivation for this refactor is that today, when generating templates, this is what happens. ``` def_kernel() # registers hook for fully generating function definition store_output() # registers hook for generating the output store. also keeps a number of things generated on `self.body`. ``` Later on, when we codegen the template: `f8c4c268da/torch/_inductor/codegen/simd.py (L1402)` ``` epilogue_node.codegen() # Also writes to body! template.finalize() # Calls the above two hooks for def_kernel and store_output, which then reads from the accumulated `self.body` ``` Today, this is fine, as long as `store_output` is the last function called in the template. However, there's a couple things we probably want to do with kernels that makes this annoying. 1. In FlexAttention backwards, we might want a `modification` to be positioned after the `store_output` (just logically from a code organization POV). This doesn't work today because `modification` also needs to codegen a subgraph, but writing to `body` here conflicts with `store_output`'s implicit saved state on `self.body`. 2. If we want to support prologue fusion, we need to go through a bunch of contortions today to call the template hook finalization a couple times (https://github.com/pytorch/pytorch/pull/121211/files#diff-73b89475038a5b4705da805f1217783883fb90398ee1164995db392fc4a342c1R322) 3. The current code also makes it quite difficult to support fusion into multiple output nodes. To resolve this, I do two things: 1. I remove the default `self.body` on `TritonTemplateKernel`. Instead, I have a dict of `self.subgraph_bodies`, which can be enabled in a context with `TritonTemplateKernel.set_subgraph_body`. This allows multiple different template functions to write to their own isolated bodies. 2. I add functions that allow you to finalize specific hooks on `PartialRender`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127144 Approved by: https://github.com/jansel	2024-05-28 09:49:13 +00:00
Jiang, Yanbing	457b9f7397	Optimize mask memory for flash attention (#126961 ) The PR optimizes the mask memory for flash attention. Instead of directly converting the whole mask to fp32, we do the conversion block-wisely. This can decrease the peak memory usage (we test in https://huggingface.co/microsoft/Phi-3-mini-128k-instruct, peak memory usage reduces ~50%) and have some performance improvements as well. ### Performance result single socket in Intel (R) Xeon (R) CPU Max 9480 batch_size = 12, q_seq_len = 1030, kv_seq_len = 1179, n_head = 3, head_dim = 33, mask_dim = 4, bool_mask = 0 \| Forward speedup \| Backward speedup -- \| -- \| -- float64 \| 0.82% \| 3.76% float32 \| 2.2% \| 3.9% bfloat16 \| 16.15% \| 7.56% segment-anything-fast Follow https://github.com/pytorch-labs/segment-anything-fast/tree/main/experiments Single socket in Intel (R) Xeon (R) CPU Max 9480 Dtype: bfloat16, models: vit_b and vit_h, test in `SDPA` and `Triton` commit https://github.com/pytorch-labs/segment-anything-fast/blob/main/experiments/run_experiments.py#L199-L200, select the time of 20th iteration. \| vit_b \| \| vit_h \| -- \| -- \| -- \| -- \| -- \| attn_mask w/o block-wise \| attn_mask w/ block-wise \| attn_mask w/o block-wise \| attn_mask w/ block-wise SDPA\| 10.95s/it \| 6.59s/it \| 19.93s/it \| 12.33s/it Triton \| 10.66s/it \| 7.12s/it \| 19.87s/it \| 12.26s/it Pull Request resolved: https://github.com/pytorch/pytorch/pull/126961 Approved by: https://github.com/Valentine233, https://github.com/jgong5	2024-05-28 09:12:18 +00:00
Animesh Jain	1507d5205a	[dynamo][fsdp] Skip Dynamo tracing of __getattr__ if its top-level frame (#127263 ) The generated bytecode for the first frame is below. Inlined comments about the LOAD_ATTR which causes Dynamo to trigger again on `__getattr__`. ~~~ [__bytecode] MODIFIED BYTECODE fn /data/users/anijain/pytorch2/test/dynamo/test_activation_checkpointing.py line 1129 [__bytecode] 1129 0 COPY_FREE_VARS 1 [__bytecode] 2 RESUME 0 [__bytecode] 4 PUSH_NULL [__bytecode] 6 LOAD_GLOBAL 10 (__compiled_fn_1) [__bytecode] 18 LOAD_FAST 0 (x) [__bytecode] 20 LOAD_DEREF 1 (mod) [__bytecode] 22 LOAD_ATTR 6 (_checkpoint_wrapped_module) [__bytecode] 32 LOAD_CONST 1 (0) [__bytecode] 34 BINARY_SUBSCR [__bytecode] 44 LOAD_ATTR 7 (weight) [__bytecode] 54 LOAD_DEREF 1 (mod) [__bytecode] 56 LOAD_ATTR 6 (_checkpoint_wrapped_module) [__bytecode] 66 LOAD_CONST 1 (0) [__bytecode] 68 BINARY_SUBSCR [__bytecode] 78 LOAD_ATTR 8 (bias) # When this optimized bytecode is executed, these two lines call the __getattr__ of ActivationWrapper module. # Dynamo gets invoked on __getattr__. # If we had inlined __getattr__ during the tracing, we would have seen the LOAD_ATTR # on more low level data structures like _modules, obviating the need for CPython # to call python overriden __getattr__. But today, UnspecializedNNModuleVariable # calls python getattr at tracing time (instead of inlining it), resulting in LOAD_ATTR # on the module itself. # To prevent Dynamo to skip tracing of __Getattr__ on the optimized bytecode, # we can check if its top level frame and just skip it. [__bytecode] 88 LOAD_DEREF 1 (mod) [__bytecode] 90 LOAD_ATTR 0 (a) [__bytecode] 100 PRECALL 4 [__bytecode] 104 CALL 4 [__bytecode] 114 UNPACK_SEQUENCE 1 [__bytecode] 118 RETURN_VALUE ~~~~ Pull Request resolved: https://github.com/pytorch/pytorch/pull/127263 Approved by: https://github.com/yf225	2024-05-28 08:16:53 +00:00
cyy	d6e3e89804	Remove c10::void_t (#127248 ) OSS version doesn't use it anymore. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127248 Approved by: https://github.com/ezyang	2024-05-28 06:59:20 +00:00
Tugsbayasgalan (Tugsuu) Manlaibaatar	246311c944	Unconditionally add asserts after export (#127132 ) Summary: Today AOTAutograd drops some of assert nodes so we reapply it after strict export. Test Plan: CI Reviewed By: angelayi Differential Revision: D57786907 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127132 Approved by: https://github.com/zhxchen17	2024-05-28 06:31:39 +00:00
cyy	e4b245292f	Remove caffe2::tensorrt target code from cuda.cmake (#127204 ) Following #126542. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127204 Approved by: https://github.com/ezyang	2024-05-28 04:42:14 +00:00
cyy	c6b36ec2f9	Remove calls of deprecated _aminmax (#127182 ) While #125995 is pending, the calls should be removed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127182 Approved by: https://github.com/ezyang	2024-05-28 03:51:45 +00:00
youkaichao	d957c2d5de	[Doc] update default magma cuda version in readme (#122125 ) Since we use cuda 12.1 by default now, it would be better to update the doc. Many people (including me), want to directly copy-paste commands in readme 😉 Let's make our life easier. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122125 Approved by: https://github.com/malfet	2024-05-28 03:37:23 +00:00
Shaz Qadeer	7c61e7be5c	Address issue #125307 (#126351 ) PyTorch overrides SymPy's Mod and does its own symbolic simplification. Inspired by issue #125307, this PR adds one more simplification tactic. Fixes #125307 Pull Request resolved: https://github.com/pytorch/pytorch/pull/126351 Approved by: https://github.com/ezyang	2024-05-28 02:03:24 +00:00
hippocookie	8979412442	Enable ufmt format on test files (#126845 ) Fixes some files in #123062 Run lintrunner on files: test/test_nnapi.py, test/test_numba_integration.py, test/test_numpy_interop.py, test/test_openmp.py, test/test_optim.py ```bash $ lintrunner -a --take UFMT --all-files ok No lint issues. Successfully applied all patches. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/126845 Approved by: https://github.com/ezyang	2024-05-28 01:42:07 +00:00
cyy	57000708fc	Remove c10::invoke_result (#127160 ) Following #124169 , it can be safely remove from OSS version. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127160 Approved by: https://github.com/ezyang	2024-05-28 01:39:28 +00:00
cyy	6436a6407d	Enable Wunused-variable on tests (#127161 ) This PR enables unused-variable warnings in tests and fixes some test code. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127161 Approved by: https://github.com/ezyang	2024-05-28 01:37:46 +00:00
cyy	70d8bc2da1	Fix various errors in TCPStoreLibUvBackend.cpp (#127230 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127230 Approved by: https://github.com/Skylion007	2024-05-27 19:14:01 +00:00
Feny Patel	0ff2f8b522	update kineto submodule hash (#126780 ) Summary: update kineto submodule hash Test Plan: CIs Differential Revision: D57620964 Pull Request resolved: https://github.com/pytorch/pytorch/pull/126780 Approved by: https://github.com/Skylion007	2024-05-27 18:11:48 +00:00
Oguz Ulgen	25a9262ba4	Add structured logging for fx graph cache hash (#127156 ) Summary: Add structured logging for fx graph cache hash so that we can debug MAST jobs easily. Test Plan: ad hoc testing Differential Revision: D57791537 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127156 Approved by: https://github.com/jamesjwu	2024-05-27 17:18:41 +00:00
Xuehai Pan	26f4f10ac8	[5/N][Easy] fix typo for `usort` config in `pyproject.toml` (`kown` -> `known`): sort torch (#127126 ) The `usort` config in `pyproject.toml` has no effect due to a typo. Fixing the typo make `usort` do more and generate the changes in the PR. Except `pyproject.toml`, all changes are generated by `lintrunner -a --take UFMT --all-files`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127126 Approved by: https://github.com/kit1980	2024-05-27 14:49:57 +00:00
PyTorch MergeBot	c7f6fbfa9d	Revert "[FSDP2] Added test for N-way TP and 1-way FSDP with CPU offloading (#127024 )" This reverts commit 9117779b0a178ec5ca548585a97bcb44be631644. Reverted https://github.com/pytorch/pytorch/pull/127024 on behalf of https://github.com/atalman due to failing in CI ([comment](https://github.com/pytorch/pytorch/pull/127024#issuecomment-2133566325))	2024-05-27 14:12:09 +00:00
PyTorch MergeBot	7121ea6f70	Revert "Add compile time profiler for non fbcode targets (#126904 )" This reverts commit 575cb617db4043dd7a76aaf523dc3ab7ee07e7a5. Reverted https://github.com/pytorch/pytorch/pull/126904 on behalf of https://github.com/atalman due to Broke nightly smoke test ([comment](https://github.com/pytorch/pytorch/pull/126904#issuecomment-2133418687))	2024-05-27 12:52:09 +00:00
PyTorch MergeBot	00fe0a0d79	Revert "Remove more of caffe2 (#126705 )" This reverts commit f95dbc12761cb4466099b0e9a3667057ca39272b. Reverted https://github.com/pytorch/pytorch/pull/126705 on behalf of https://github.com/DanilBaibak due to Break internal build ([comment](https://github.com/pytorch/pytorch/pull/126705#issuecomment-2133325449))	2024-05-27 11:59:14 +00:00
Jeeja	1110edb94b	Fix stream type to generic in comms default hooks (#120069 ) In comms default_hooks - decompress stream is hardcoded to cuda type. fix this to use generic type based on the grad tensor device. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120069 Approved by: https://github.com/jgong5, https://github.com/fegin	2024-05-27 10:27:30 +00:00
PyTorch MergeBot	55c0ab2887	Revert "[5/N][Easy] fix typo for `usort` config in `pyproject.toml` (`kown` -> `known`): sort torch (#127126 )" This reverts commit 7763c83af67eebfdd5185dbe6ce15ece2b992a0f. Reverted https://github.com/pytorch/pytorch/pull/127126 on behalf of https://github.com/XuehaiPan due to Broken CI ([comment](https://github.com/pytorch/pytorch/pull/127126#issuecomment-2133044286))	2024-05-27 09:22:08 +00:00
PyTorch MergeBot	4608971f7a	Revert "[inductor][cpp] GEMM template (infra and fp32) (#124021 )" This reverts commit 0d1e22855022a04a8601a2d94f3079950283ba5d. Reverted https://github.com/pytorch/pytorch/pull/124021 on behalf of https://github.com/DanilBaibak due to Break internal build ([comment](https://github.com/pytorch/pytorch/pull/124021#issuecomment-2133002071))	2024-05-27 09:01:45 +00:00
PyTorch MergeBot	343a41fba8	Revert "[inductor][cpp] epilogue support for gemm template (#126019 )" This reverts commit 56c412d9063de3dc8163b8e1b0b9b5bf9581ad05. Reverted https://github.com/pytorch/pytorch/pull/126019 on behalf of https://github.com/DanilBaibak due to Break internal build ([comment](https://github.com/pytorch/pytorch/pull/124021#issuecomment-2133002071))	2024-05-27 09:01:45 +00:00
PyTorch MergeBot	68fddebf84	Revert "[inductor][cpp] bf16/fp16 gemm template computed with fp32 w/o epilogue fusion (#126068 )" This reverts commit 4aa43d11f332b2d7b8f19b4da5ceba612133889d. Reverted https://github.com/pytorch/pytorch/pull/126068 on behalf of https://github.com/DanilBaibak due to Break internal build ([comment](https://github.com/pytorch/pytorch/pull/124021#issuecomment-2133002071))	2024-05-27 09:01:45 +00:00
PyTorch MergeBot	ed9951ace7	Revert "[inductor][cpp] support bf16/fp16 gemm template epilogue fusion (#126545 )" This reverts commit 43baabe9b94c86bd36ba4a00f501e52d833d7ec8. Reverted https://github.com/pytorch/pytorch/pull/126545 on behalf of https://github.com/DanilBaibak due to Break internal build ([comment](https://github.com/pytorch/pytorch/pull/124021#issuecomment-2133002071))	2024-05-27 09:01:45 +00:00
PyTorch MergeBot	4c2e671a3b	Revert "[Inductor][CPP] Add Min/Max with VecMask (#126841 )" This reverts commit 1ef4306ab11410a506e0868543a466e87ea879b5. Reverted https://github.com/pytorch/pytorch/pull/126841 on behalf of https://github.com/DanilBaibak due to Blocks reverting of the broken PR ([comment](https://github.com/pytorch/pytorch/pull/126841#issuecomment-2132995404))	2024-05-27 08:58:01 +00:00
PyTorch MergeBot	5247446396	Revert "[Inductor][CPP] Add ne with VecMask (#126940 )" This reverts commit f8c4c268da67e9684f3287b7468f36a5a27c6a0b. Reverted https://github.com/pytorch/pytorch/pull/126940 on behalf of https://github.com/DanilBaibak due to Blocks reverting of the broken PR ([comment](https://github.com/pytorch/pytorch/pull/126841#issuecomment-2132995404))	2024-05-27 08:58:01 +00:00
PyTorch MergeBot	60523fa674	Revert "Move MKLDNN Specific IR to Separate File (#126504 )" This reverts commit bf2909b871579a78e841b661b9b0c302f311d010. Reverted https://github.com/pytorch/pytorch/pull/126504 on behalf of https://github.com/DanilBaibak due to Blocks reverting of the broken PR ([comment](https://github.com/pytorch/pytorch/pull/126841#issuecomment-2132995404))	2024-05-27 08:58:01 +00:00
chuanqiw	ff63e8bac8	[CI] fix doctest case by adding requires (#126855 ) With the triton update, the new dependency `llnl-hatchet` will be introduced. And `pydot` is a dependency of `llnl-hatchet`. So the doctest case `torch/fx/passes/graph_drawer.py::FxGraphDrawer.get_dot_graph:0` won't be skipped. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126855 Approved by: https://github.com/etaf, https://github.com/EikanWang, https://github.com/peterbell10	2024-05-27 07:40:27 +00:00
feifan	22712ba5c5	Radam support the flag for "maximize" (#126765 ) Fixes #[126642](https://github.com/pytorch/pytorch/issues/126642) I reference the maximize in `Adam` and add `Radam's` maximize flag. If this pr is OK, I will add another pr for `Nadam`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126765 Approved by: https://github.com/janeyx99	2024-05-27 06:34:50 +00:00
cyy	5cca904c51	[3/N] Enable clang-tidy in aten/src/ATen/detail/ (#127184 ) Following #127168 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127184 Approved by: https://github.com/jansel	2024-05-27 06:28:07 +00:00
Ting Lu	1c2e221e25	CUDA 12.4 ARM wheel integration to CD - nightly build (#126174 ) rebasing https://github.com/pytorch/pytorch/pull/124112. too many conflict files, so starting a new PR. Test https://github.com/pytorch/builder/pull/1775 (merged) for ARM wheel addition Test https://github.com/pytorch/builder/pull/1828 (merged) for setting MAX_JOBS Current issue to follow up: https://github.com/pytorch/pytorch/issues/126980 Co-authored-by: Aidyn-A <aidyn.b.aitzhan@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/126174 Approved by: https://github.com/nWEIdia, https://github.com/atalman	2024-05-27 05:50:36 +00:00
Xuehai Pan	7763c83af6	[5/N][Easy] fix typo for `usort` config in `pyproject.toml` (`kown` -> `known`): sort torch (#127126 ) The `usort` config in `pyproject.toml` has no effect due to a typo. Fixing the typo make `usort` do more and generate the changes in the PR. Except `pyproject.toml`, all changes are generated by `lintrunner -a --take UFMT --all-files`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127126 Approved by: https://github.com/kit1980 ghstack dependencies: #127122, #127123, #127124, #127125	2024-05-27 04:22:18 +00:00
cyy	4fdbaa794f	[Submodule] Remove deprecated USE_TBB option and TBB submodule (#127051 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127051 Approved by: https://github.com/cpuhrsch, https://github.com/malfet	2024-05-27 03:54:03 +00:00
Peter Bell	6aa5bb1a76	[inductor] Support persistent reductions for dynamic shapes (#126684 ) Currently persistent reductions are only supported when the reduction dimension is static, however we only really need to know that the rnumel is bounded. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126684 Approved by: https://github.com/lezcano	2024-05-27 02:30:20 +00:00
leslie-fang-intel	bf2909b871	Move MKLDNN Specific IR to Separate File (#126504 ) Summary Following the discussion in https://github.com/pytorch/pytorch/pull/122593#discussion_r1604144782, Move Inductor MKLDNN specific IRs to a separate file. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126504 Approved by: https://github.com/desertfire, https://github.com/jgong5 ghstack dependencies: #126841, #126940	2024-05-27 00:48:09 +00:00
Peter Bell	39de62845a	[decomp] Fix default values missing from inplace `rrelu` decomposition (#126978 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126978 Approved by: https://github.com/lezcano	2024-05-26 23:49:40 +00:00
Xiaodong Wang	06934518a2	[AMD] Fix deprecated amdsmi api (#126962 ) Summary: https://github.com/pytorch/pytorch/pull/119182 uses an API that has already been deprecated by `c551c3caed`. So fixing this in a backward compatible way Differential Revision: D57711088 Pull Request resolved: https://github.com/pytorch/pytorch/pull/126962 Approved by: https://github.com/eqy, https://github.com/izaitsevfb	2024-05-26 20:11:23 +00:00
chilli	ee6cb6daa1	Turn the mutation dependency of MutationOutput to weak deps (#127151 ) A writeup of how mutation works in Inductor: https://docs.google.com/document/d/1P0fSq4Nm-3CvdUe9v-mLdEWD3dgIHUf1czQXMmQsuxc/edit Pull Request resolved: https://github.com/pytorch/pytorch/pull/127151 Approved by: https://github.com/oulgen ghstack dependencies: #127148, #127149	2024-05-26 01:21:03 +00:00
leslie-fang-intel	f8c4c268da	[Inductor][CPP] Add ne with VecMask (#126940 ) Summary Fix https://github.com/pytorch/pytorch/issues/126824#issuecomment-2125039161 which is missing the support of `ne` with `VecMask`. Test Plan ``` python test/inductor/test_torchinductor_opinfo.py -k test_comprehensive_ne_cpu_bool ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/126940 Approved by: https://github.com/isuruf, https://github.com/jgong5, https://github.com/peterbell10 ghstack dependencies: #126841	2024-05-25 23:54:48 +00:00
leslie-fang-intel	1ef4306ab1	[Inductor][CPP] Add Min/Max with VecMask (#126841 ) Summary Fix issue: https://github.com/pytorch/pytorch/issues/126824 which is missing the support of `min/max` with `VecMask`. TestPlan ``` python test/inductor/test_torchinductor_opinfo.py -k test_comprehensive_clamp_max_cpu_bool python test/inductor/test_torchinductor_opinfo.py -k test_comprehensive_clamp_min_cpu_bool ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/126841 Approved by: https://github.com/isuruf, https://github.com/jgong5, https://github.com/peterbell10	2024-05-25 23:52:21 +00:00
chilli	b8ee7d0cc1	Change direct uses of MutationOutput to `mark_node_as_mutating` (#127149 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127149 Approved by: https://github.com/oulgen ghstack dependencies: #127148	2024-05-25 23:47:39 +00:00
chilli	3817c4f9fa	Unify add_fake_dep and add_mutation_dep, as they're literally the same thing (#127148 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127148 Approved by: https://github.com/oulgen	2024-05-25 23:47:39 +00:00
cyy	9bead53519	[2/N] Fix clang-tidy warnings in aten/src/ATen/detail/ (#127168 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/127168 Approved by: https://github.com/Skylion007	2024-05-25 22:50:02 +00:00
Xuehai Pan	a28bfb5ed5	[4/N][Easy] fix typo for `usort` config in `pyproject.toml` (`kown` -> `known`): sort functorch (#127125 ) The `usort` config in `pyproject.toml` has no effect due to a typo. Fixing the typo make `usort` do more and generate the changes in the PR. Except `pyproject.toml`, all changes are generated by `lintrunner -a --take UFMT --all-files`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127125 Approved by: https://github.com/Skylion007 ghstack dependencies: #127122, #127123, #127124	2024-05-25 22:45:38 +00:00
Xuehai Pan	35ea5c6b22	[3/N][Easy] fix typo for `usort` config in `pyproject.toml` (`kown` -> `known`): sort torchgen (#127124 ) The `usort` config in `pyproject.toml` has no effect due to a typo. Fixing the typo make `usort` do more and generate the changes in the PR. Except `pyproject.toml`, all changes are generated by `lintrunner -a --take UFMT --all-files`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127124 Approved by: https://github.com/Skylion007 ghstack dependencies: #127122, #127123	2024-05-25 19:20:03 +00:00
Xuehai Pan	0dae2ba5bd	[2/N][Easy] fix typo for `usort` config in `pyproject.toml` (`kown` -> `known`): sort caffe2 (#127123 ) The `usort` config in `pyproject.toml` has no effect due to a typo. Fixing the typo make `usort` do more and generate the changes in the PR. Except `pyproject.toml`, all changes are generated by `lintrunner -a --take UFMT --all-files`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127123 Approved by: https://github.com/Skylion007 ghstack dependencies: #127122	2024-05-25 18:26:34 +00:00
Zhenbin Lin	da141b096b	Enable UFMT on test/test_hub.py (#127155 ) Partially addresses #123062 Ran lintrunner on: test/test_hub.py Detail: ``` $ lintrunner -a --take UFMT test/test_hub.py ok No lint issues. Successfully applied all patches. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/127155 Approved by: https://github.com/Skylion007	2024-05-25 18:23:24 +00:00
PyTorch MergeBot	12d11fe4e5	Revert "reset dynamo cache before each test (#126586 )" This reverts commit bd24991f461476036d6ba20fed92651c7e46ef7c. Reverted https://github.com/pytorch/pytorch/pull/126586 on behalf of https://github.com/malfet due to Broke tons of tests, see `bd24991f46` ([comment](https://github.com/pytorch/pytorch/pull/126586#issuecomment-2131365576))	2024-05-25 17:17:19 +00:00
James Wu	71eafe9e97	Refactor dispatch logic to clarify control flow (#126402 ) As discussed, this cleans up the code so that create_aot_dispatcher literally chooses an aot_dispatch function and runs it. Moves wrapper logic to jit_compile_runtime_wrappers, and adds aot_dispatch_export to handle export cases in one place. This also makes aot_dispatch_* return the same type always: a Callable and the forward metadata, instead of returning different number of arguments in export cases. Callers that don't care about fw_metadata can just ignore it. Added return type hints to enforce the same exact interface among all the aot_dispatch_* functions. It'd be nice to move the checks from the synthetic base and dedup wrappers that have to do with export outside of those wrappers, but it's probably fine for now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126402 Approved by: https://github.com/oulgen, https://github.com/bdhirsh ghstack dependencies: #126193	2024-05-25 16:06:34 +00:00
Aaron Orenstein	7642cdef25	Improve fusable_read_and_write() (#127061 ) Related to https://github.com/pytorch/pytorch/issues/98467 The tacotron2 benchmark creates a lot of nodes which fusion then checks. This improves some of the perf of that checking. `can_fuse_vertical` calls `fusable_read_and_write` on O(read deps * write deps) combinations - but only cares about write deps that are MemoryDeps - so do the isinstance check outside the inner loop to save O(read deps) when it won't matter anyway. Also moves `fusable_read_and_write` to a instance method (instead of a closure) since it doesn't actually capture any variables. I also tried pre-splitting the read deps into `StarDep` vs `MemoryDep` but that didn't actually make any perf difference. Testing: ``` time python benchmarks/dynamo/torchbench.py --accuracy --inference --amp --backend inductor --disable-cudagraphs --device cuda --only tacotron2 ``` Before this change: 10m15s After this change: 9m31s Related to #98467 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127061 Approved by: https://github.com/peterbell10, https://github.com/jansel ghstack dependencies: #127060	2024-05-25 15:17:25 +00:00
Aaron Orenstein	6c79299a35	Improve score_fusion_memory() (#127060 ) Related to #98467 The tacotron2 benchmark creates a lot of nodes which fusion then checks. This improves some of the perf of that checking. `score_fusion_memory` is called O(n^2) times - so by moving the set union, `has_unbacked_symbols` check, and `numbytes_hint` out of the loop we call them O(n) times and the O(n^2) call gets cheaper. Testing: ``` time python benchmarks/dynamo/torchbench.py --accuracy --inference --amp --backend inductor --disable-cudagraphs --device cuda --only tacotron2 ``` Before this change: 12m33s After this change: 10m15s Pull Request resolved: https://github.com/pytorch/pytorch/pull/127060 Approved by: https://github.com/peterbell10, https://github.com/jansel	2024-05-25 15:17:25 +00:00
Xuehai Pan	ba3b05fdf3	[1/N][Easy] fix typo for `usort` config in `pyproject.toml` (`kown` -> `known`): sort stdlib (#127122 ) The `usort` config in `pyproject.toml` has no effect due to a typo. Fixing the typo make `usort` do more and generate the changes in the PR. Except `pyproject.toml`, all changes are generated by `lintrunner -a --take UFMT --all-files`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127122 Approved by: https://github.com/kit1980	2024-05-25 08:25:50 +00:00
Wu, Chunyuan	4a997de8b9	[AOTI] support freezing for MKLDNN (#124350 ) ## Description Fixes https://github.com/pytorch/pytorch/issues/114450. This PR builds upon the work from @imzhuhl done in https://github.com/pytorch/pytorch/pull/114451. This PR requires https://github.com/pytorch/pytorch/pull/122472 to land firstly. We leverage the serialization and deserialization API from oneDNN v3.4.1 to save the opaque MKLDNN tensor during the compilation and restore the opaque tensor when loading the compiled .so. ideep version is updated so that we won't break any pipeline even if third_party/ideep is not updated at the same time. ### Test plan: ```sh python -u test/inductor/test_aot_inductor.py -k AOTInductorTestNonABICompatibleCpu.test_freezing_non_abi_compatible_cpu python -u test/inductor/test_aot_inductor.py -k AOTInductorTestNonABICompatibleCpu.test_conv_freezing_non_abi_compatible_cpu python -u test/inductor/test_aot_inductor.py -k AOTInductorTestNonABICompatibleCpu.test_deconv_freezing_non_abi_compatible_cpu python -u test/inductor/test_aot_inductor.py -k AOTInductorTestNonABICompatibleCpu.test_linear_freezing_non_abi_compatible_cpu ``` ### TODOs in follow-up PRs 1. We found that using `AOTI_TORCH_CHECK` will cause performance drop on several models (`DistillGPT2`, `MBartForConditionalGeneration`, `T5ForConditionalGeneration`, `T5Small`) compared with JIT Inductor which uses `TORCH_CHECK`. This may need further discussion how to address (`AOTI_TORCH_CHECK` is introduced in https://github.com/pytorch/pytorch/pull/119220). 2. Freezing in non-ABI compatible mode will work with the support in this PR. While for ABI compatible mode, we need to firstly address this issue: `AssertionError: None, i.e. optional output is not supported`. `6c4f43f826/torch/_inductor/codegen/cpp_wrapper_cpu.py (L2023-L2024)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/124350 Approved by: https://github.com/jgong5, https://github.com/desertfire	2024-05-25 07:15:36 +00:00
Yu, Guangye	e7a42702f9	generalize custom_fwd&custom_bwd to be device-agnostic (#126531 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126531 Approved by: https://github.com/jgong5, https://github.com/gujinghui, https://github.com/albanD, https://github.com/EikanWang ghstack dependencies: #126527	2024-05-25 06:48:16 +00:00
Yu, Guangye	c09205a057	Deprecate device-specific GradScaler autocast API (#126527 ) # Motivation ## for `torch.amp.GradScaler`, - `torch.cpu.amp.GradScaler(args...)` is completely equivalent to `torch. amp.GradScaler("cpu", args...)`. - `torch.cuda.amp.GradScaler(args...)` is completely equivalent to `torch.amp.GradScaler("cuda", args...)`. So, we intend to depreate them and strongly recommend developer to use `torch.amp.GradScaler`. ## for `custom_fwd` and `custom_bwd`, this is a good solution to make the custom function run with or without effect even in an autocast-enabled region and can be shared by other backends, like CPU and XPU. So we generalize it to be device-agnostic and put them int `torch/amp/autocast_mode.py` and re-expose to `torch.amp.custom_fwd` and `torch.amp.custom_bwd`. Meanwhile, we deprecate `torch.cuda.amp.custom_fwd` and `torch.cuda.amp.custom_bwd`. # Additional Context Add UT to cover the deprecated warning. No need for more UTs to cover the functionality of `torch.amp.custom_f/bwd`, the existing UTs that previously covered the functionality of `torch.cuda.amp.custom_f/bwd` can cover them. To facilitate the review, we separate these code changes to two PRs. The first PR cover `torch.amp.GradScaler`. The follow-up covers `custom_fwd` and `custom_bwd`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126527 Approved by: https://github.com/jgong5, https://github.com/gujinghui, https://github.com/janeyx99, https://github.com/EikanWang	2024-05-25 06:41:34 +00:00
Catherine Lee	ef86a27dba	Mark test_set_per_process_memory_fraction serial (#127087 ) Occasionally OOMs Also should probably give the entire GPU for this anyways Pull Request resolved: https://github.com/pytorch/pytorch/pull/127087 Approved by: https://github.com/huydhn	2024-05-25 06:26:47 +00:00
dshi7	0f67d38f0f	add TORCHDYNAMO_CAPTURE_DYNAMIC_OUTPUT_SHAPE_OPS (#127017 ) tlparse prints failure description like this > dynamic shape operator: aten._unique2.default; to enable, set torch._dynamo.config.capture_dynamic_output_shape_ops = True adding os env var to set it easier for testing Pull Request resolved: https://github.com/pytorch/pytorch/pull/127017 Approved by: https://github.com/jackiexu1992	2024-05-25 05:42:41 +00:00
chilli	84e59f052d	Made some minor improvements to flexattention perf + added more autotune configs (#126811 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126811 Approved by: https://github.com/drisspg, https://github.com/yanboliang, https://github.com/Neilblaze	2024-05-25 05:03:31 +00:00
cyy	9f11fc666a	[1/N] Fix clang-tidy warnings in aten/src/ATen/detail/ (#127057 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/127057 Approved by: https://github.com/Skylion007	2024-05-25 04:55:52 +00:00
Shunting Zhang	bd24991f46	reset dynamo cache before each test (#126586 ) In https://github.com/pytorch/pytorch/issues/125967, we found test results depend on test order. The root cause is due to earlier tests populate dynamo cache and affect the later tests. This PR clear dynamo cache before each unit test so we get more deterministic result for unit test Pull Request resolved: https://github.com/pytorch/pytorch/pull/126586 Approved by: https://github.com/jansel	2024-05-25 04:48:09 +00:00
Ke Wen	8bd26ecf0b	[pipelining] test composability with DDP and FSDP (#127066 ) Added to `multigpu` test config, which is run periodically. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127066 Approved by: https://github.com/H-Huang, https://github.com/wconstab ghstack dependencies: #127136, #126931	2024-05-25 04:30:40 +00:00
Ke Wen	c1d2564acf	[pipelining] Add grad test for interleaved schedules (#126931 ) Added `test_grad_with_manual_interleaved`: - Model: `MultiMLP` - Tested schedules: Interleaved1F1B, LoopedBFS - Two stages per rank ``` Rank 0 stages: [0, 2] Rank 1 stages: [1, 3] ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/126931 Approved by: https://github.com/wconstab ghstack dependencies: #127136	2024-05-25 04:13:28 +00:00
Ke Wen	eaace67444	[pipelining] do not check inputs for non-0 stages (#127136 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127136 Approved by: https://github.com/wconstab	2024-05-25 04:13:28 +00:00
James Wu	cc9a3412d4	Implement a post_compile step for aot_dispatch_autograd (#126193 ) This PR moves the post compile portion of aot_dispatch_autograd into runtime_wrappers.py. Completing this allows us to run the post compile section on its own when warm starting. I considered leaving this thing in jit_compile_runtime_wrappers, but we're gonna run into circular dependency issues later if we don't move it over Pull Request resolved: https://github.com/pytorch/pytorch/pull/126193 Approved by: https://github.com/bdhirsh ghstack dependencies: #126907	2024-05-25 03:24:20 +00:00
Oguz Ulgen	52bcf120e5	Make inductor config hashing more portable (#127022 ) Summary: masnesral and I noticed that config contains non portable artifacts. Lets fix that. Test Plan: adhoc testing Differential Revision: D57748025 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127022 Approved by: https://github.com/masnesral	2024-05-25 03:01:33 +00:00
Jane Xu	665637714f	Remove SparseAdam weird allowance of raw Tensor input (#127081 ) This continues the full deprecation after https://github.com/pytorch/pytorch/pull/114425. It's been 6 months! And I'm fairly certain no one is going to yell at me as this patch is not really used. ------ # BC Breaking note As of this PR, SparseAdam will become consistent with the rest of our optimizers in that it will only accept containers of Tensors/Parameters/param groups and fully complete deprecation of this path. Hitherto, the SparseAdam constructor had allowed raw tensors as the params argument to the constructor. Now, if you write the following code, there will be an error similar to every other optim: "params argument given to the optimizer should be an iterable of Tensors or dicts" ``` import torch param = torch.rand(16, 32) optimizer = torch.optim.SparseAdam(param) ``` Instead you should replace the last line with ``` optimizer = torch.optim.SparseAdam([param]) ``` to no longer error. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127081 Approved by: https://github.com/soulitzer	2024-05-25 02:58:24 +00:00
cyy	29a1f62f23	Replace c10::invoke_result with std::invoke_result (#124169 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124169 Approved by: https://github.com/swolchok	2024-05-25 02:42:13 +00:00
Huy Do	9ef6f8dfc1	Fix typo in inductor workflow for CUDA 12.4 jobs (#127121 ) Discovered by @clee2000. The change was introduced in https://github.com/pytorch/pytorch/pull/121956 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127121 Approved by: https://github.com/clee2000, https://github.com/Skylion007	2024-05-25 02:36:39 +00:00
Ke Wen	ed838793df	[pipelining] Remove qualname mapping (#127018 ) `QualnameMapMixin` was intended to provide a mapping from new FQN of the piped model to the FQN of the original model. It was there because previous tracers and flattening during tracing would modify the FQNs. Now that we use unflattener, the FQN of the stage modules are the same as the original FQNs. We don't need `QualnameMapMixin` any more. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127018 Approved by: https://github.com/H-Huang	2024-05-25 02:32:40 +00:00
drisspg	5f15110499	Update dispatch stub to make SDPA routing cleaner (#126832 ) # Summary Adds a public method to dispatchstub to check if a fn has been registered for a device. We use this new function to clean up the dispatching logic for SDPA, as well as make the private use dispatching simpler: #126392 Pull Request resolved: https://github.com/pytorch/pytorch/pull/126832 Approved by: https://github.com/ezyang, https://github.com/albanD	2024-05-25 01:40:53 +00:00
Shunting Zhang	db9c6aeec6	Revert "Skip test_memory_format_nn_BatchNorm2d in inductor (#125970 )" (#126594 ) This reverts commit 0a9c6e92f8d1a35f33042c8dab39f23b7f39d6e7. enable the test since it's fixed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126594 Approved by: https://github.com/huydhn ghstack dependencies: #126593	2024-05-25 01:27:02 +00:00
Shunting Zhang	b03dc3d167	don't check memory format for empty tensors (#126593 ) Fix https://github.com/pytorch/pytorch/issues/125967 . The test actually fail for empty 4D or 5D tensors when checking for memory format. I'm not exactly sure what recent inductor change cause the failure, but it may be not that important to maintain strides for an empty tensor. (?) I just skip the check for empty tensor. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126593 Approved by: https://github.com/ezyang	2024-05-25 01:19:45 +00:00
Animesh Jain	84f8cd22ac	[dynamo][TensorVariable] Support "if param.grad_fn" usecase (#126960 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126960 Approved by: https://github.com/jansel ghstack dependencies: #126922	2024-05-25 01:09:26 +00:00
Sheng Fu	bbeb0906c4	Register creak_node_hook (#126671 ) Differential Revision: D57469157 Pull Request resolved: https://github.com/pytorch/pytorch/pull/126671 Approved by: https://github.com/angelayi	2024-05-24 23:32:15 +00:00
Tugsbayasgalan (Tugsuu) Manlaibaatar	72f0bdcc22	Remove torch._constrain_as_value (#127103 ) Summary: This API doesn't do anything useful and should be subsumed by torch._check. Test Plan: CI Differential Revision: D57786740 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127103 Approved by: https://github.com/angelayi	2024-05-24 22:49:46 +00:00
Jason Ansel	d5bf3a98db	[inductor] Refactor indexing() into triton.py (#127047 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127047 Approved by: https://github.com/shunting314 ghstack dependencies: #126944, #126945	2024-05-24 22:46:20 +00:00
Jason Ansel	92433217cb	[inductor] Misc refactors (#126945 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126945 Approved by: https://github.com/shunting314 ghstack dependencies: #126944	2024-05-24 22:46:20 +00:00
Jason Ansel	1b6e3e3bcb	[inductor] Refactor part of IterationRangesEntry into triton.py (#126944 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126944 Approved by: https://github.com/shunting314	2024-05-24 22:46:20 +00:00
Anshul Sinha	83617017e0	[dtensor][debug] add c10d allreduce_coalesced_ tracing to CommDebugMode (#127040 ) Summary Added c10d all_reduce_coalesced tracing to CommDebugMode and added test case to test_comm_mode.py. Test Plan pytest test/distributed/_tensor/debug/test_comm_mode.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/127040 Approved by: https://github.com/XilunWu ghstack dependencies: #127025, #127029	2024-05-24 22:25:44 +00:00
Michael Lazos	59052071b7	Disallow fusions of foreach and reductions (#127048 ) Fixes https://github.com/pytorch/pytorch/issues/120857 This currently isn't supported until we enable foreach reduction kernels. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127048 Approved by: https://github.com/weifengpy	2024-05-24 21:35:06 +00:00
James Wu	023c1baf82	Add global configurations to cache key (#126907 ) This adds a bunch of global configurations to the cache key. There's definitely more I haven't added, but this is just an audit of all of the `torch.*` globals that are used in jit_compile_runtime_wrappers, runtime_wrappers, etc. It also makes the hash details object subclass FXGraphHashDetails, which implements other hashed data like configs inductor depends on. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126907 Approved by: https://github.com/aorenste	2024-05-24 21:26:46 +00:00
dan_the_3rd	c133665d4a	[CUDA] Parallelize upsampling OPS across the batch/channel dimension. (#127082 ) This can make this operation 200x+ faster on modern GPUs for small grid sizes, as otherwise this kernel is scheduled with a single block (!) Tested on A100 with: ``` python test/test_nn.py TestNNDeviceTypeCUDA ``` Benchmarks FW Ran on A100 / bf16 ## Forward pass benchmarks \| batch size \| input size \| output size \| before runtime (mem bandwidth) \| after runtime (mem bandwidth) \| speedup \| \|------------\|------------\|-------------\|------------------\|-----------------\|---------\| \| 768 \| 16x16 \| 6x6 \| 5855us (0.07 GB/s) \| 38us (10 GB/s) \| 154x \| \| 768 \| 16x16 \| 7x7 \| 5214us (0.08 GB/s) \| 37us (11 GB/s) \| 138x \| \| 768 \| 16x16 \| 14x14 \| 2314us (0.27 GB/s) \| 36us (17 GB/s) \| 63x \| \| 768 \| 16x16 \| 16x16 \| 1232us (0.59 GB/s) \| 33us (21 GB/s) \| 36x \| \| 768 \| 32x32 \| 6x6 \| 19442us (0.07 GB/s) \| 98us (15 GB/s) \| 197x \| \| 768 \| 32x32 \| 7x7 \| 16918us (0.09 GB/s) \| 89us (17 GB/s) \| 188x \| \| 768 \| 32x32 \| 14x14 \| 6023us (0.28 GB/s) \| 69us (25 GB/s) \| 86x \| \| 768 \| 32x32 \| 16x16 \| 3455us (0.52 GB/s) \| 55us (32 GB/s) \| 62x \| \| 768 \| 48x48 \| 6x6 \| 38597us (0.08 GB/s) \| 179us (18 GB/s) \| 214x \| \| 768 \| 48x48 \| 7x7 \| 34700us (0.09 GB/s) \| 163us (20 GB/s) \| 211x \| \| 768 \| 48x48 \| 14x14 \| 10647us (0.33 GB/s) \| 112us (31 GB/s) \| 94x \| \| 768 \| 48x48 \| 16x16 \| 7388us (0.49 GB/s) \| 100us (36 GB/s) \| 73x \| \| 768 \| 64x64 \| 6x6 \| 76288us (0.07 GB/s) \| 310us (19 GB/s) \| 246x \| \| 768 \| 64x64 \| 7x7 \| 54981us (0.1 GB/s) \| 257us (23 GB/s) \| 213x \| \| 768 \| 64x64 \| 14x14 \| 16565us (0.37 GB/s) \| 169us (36 GB/s) \| 97x \| \| 768 \| 64x64 \| 16x16 \| 12037us (0.51 GB/s) \| 141us (43 GB/s) \| 84x \| \| 1024 \| 16x16 \| 6x6 \| 8123us (0.06 GB/s) \| 44us (12 GB/s) \| 183x \| \| 1024 \| 16x16 \| 7x7 \| 7017us (0.08 GB/s) \| 45us (12 GB/s) \| 155x \| \| 1024 \| 16x16 \| 14x14 \| 3150us (0.27 GB/s) \| 45us (18 GB/s) \| 69x \| \| 1024 \| 16x16 \| 16x16 \| 1695us (0.57 GB/s) \| 41us (23 GB/s) \| 40x \| \| 1024 \| 32x32 \| 6x6 \| 25918us (0.07 GB/s) \| 120us (16 GB/s) \| 214x \| \| 1024 \| 32x32 \| 7x7 \| 22622us (0.09 GB/s) \| 108us (18 GB/s) \| 208x \| \| 1024 \| 32x32 \| 14x14 \| 8245us (0.28 GB/s) \| 87us (26 GB/s) \| 94x \| \| 1024 \| 32x32 \| 16x16 \| 4599us (0.53 GB/s) \| 68us (35 GB/s) \| 67x \| \| 1024 \| 48x48 \| 6x6 \| 51486us (0.08 GB/s) \| 219us (20 GB/s) \| 234x \| \| 1024 \| 48x48 \| 7x7 \| 46501us (0.09 GB/s) \| 202us (22 GB/s) \| 229x \| \| 1024 \| 48x48 \| 14x14 \| 14280us (0.33 GB/s) \| 145us (32 GB/s) \| 98x \| \| 1024 \| 48x48 \| 16x16 \| 9877us (0.49 GB/s) \| 125us (39 GB/s) \| 79x \| \| 1024 \| 64x64 \| 6x6 \| 101731us (0.07 GB/s) \| 378us (20 GB/s) \| 268x \| \| 1024 \| 64x64 \| 7x7 \| 73465us (0.1 GB/s) \| 320us (24 GB/s) \| 229x \| \| 1024 \| 64x64 \| 14x14 \| 22109us (0.37 GB/s) \| 218us (37 GB/s) \| 101x \| \| 1024 \| 64x64 \| 16x16 \| 16081us (0.51 GB/s) \| 178us (46 GB/s) \| 90x \| \| 1536 \| 16x16 \| 6x6 \| 12546us (0.06 GB/s) \| 61us (13 GB/s) \| 205x \| \| 1536 \| 16x16 \| 7x7 \| 11064us (0.07 GB/s) \| 63us (13 GB/s) \| 175x \| \| 1536 \| 16x16 \| 14x14 \| 4839us (0.26 GB/s) \| 62us (20 GB/s) \| 77x \| \| 1536 \| 16x16 \| 16x16 \| 2630us (0.55 GB/s) \| 59us (24 GB/s) \| 44x \| \| 1536 \| 32x32 \| 6x6 \| 38898us (0.07 GB/s) \| 170us (17 GB/s) \| 227x \| \| 1536 \| 32x32 \| 7x7 \| 34079us (0.09 GB/s) \| 155us (19 GB/s) \| 219x \| \| 1536 \| 32x32 \| 14x14 \| 12632us (0.27 GB/s) \| 124us (28 GB/s) \| 101x \| \| 1536 \| 32x32 \| 16x16 \| 6900us (0.53 GB/s) \| 98us (37 GB/s) \| 70x \| \| 1536 \| 48x48 \| 6x6 \| 77272us (0.08 GB/s) \| 316us (21 GB/s) \| 243x \| \| 1536 \| 48x48 \| 7x7 \| 70153us (0.09 GB/s) \| 291us (23 GB/s) \| 240x \| \| 1536 \| 48x48 \| 14x14 \| 21500us (0.33 GB/s) \| 208us (34 GB/s) \| 103x \| \| 1536 \| 48x48 \| 16x16 \| 14851us (0.49 GB/s) \| 181us (40 GB/s) \| 81x \| \| 1536 \| 64x64 \| 6x6 \| 152669us (0.07 GB/s) \| 548us (21 GB/s) \| 278x \| \| 1536 \| 64x64 \| 7x7 \| 110348us (0.1 GB/s) \| 466us (25 GB/s) \| 236x \| \| 1536 \| 64x64 \| 14x14 \| 33350us (0.36 GB/s) \| 316us (38 GB/s) \| 105x \| \| 1536 \| 64x64 \| 16x16 \| 24173us (0.51 GB/s) \| 263us (47 GB/s) \| 91x \| \| 4096 \| 16x16 \| 6x6 \| 34638us (0.06 GB/s) \| 138us (16 GB/s) \| 249x \| \| 4096 \| 16x16 \| 7x7 \| 31590us (0.07 GB/s) \| 144us (16 GB/s) \| 218x \| \| 4096 \| 16x16 \| 14x14 \| 13203us (0.26 GB/s) \| 149us (23 GB/s) \| 88x \| \| 4096 \| 16x16 \| 16x16 \| 7328us (0.53 GB/s) \| 143us (27 GB/s) \| 51x \| \| 4096 \| 32x32 \| 6x6 \| 103802us (0.07 GB/s) \| 405us (19 GB/s) \| 256x \| \| 4096 \| 32x32 \| 7x7 \| 91354us (0.08 GB/s) \| 372us (22 GB/s) \| 245x \| \| 4096 \| 32x32 \| 14x14 \| 34501us (0.26 GB/s) \| 312us (29 GB/s) \| 110x \| \| 4096 \| 32x32 \| 16x16 \| 18465us (0.52 GB/s) \| 247us (39 GB/s) \| 74x \| ## Backward pass benchmarks \| batch size \| input size \| output size \| before runtime (mem bandwidth) \| after runtime (mem bandwidth) \| speedup \| \|------------\|------------\|-------------\|------------------\|-----------------\|---------\| \| 768 \| 16x16 \| 6x6 \| 78656us (0.0 GB/s) \| 323us (1 GB/s) \| 243x \| \| 768 \| 16x16 \| 7x7 \| 67167us (0.0 GB/s) \| 292us (1 GB/s) \| 230x \| \| 768 \| 16x16 \| 14x14 \| 27478us (0.02 GB/s) \| 229us (2 GB/s) \| 119x \| \| 768 \| 16x16 \| 16x16 \| 131us (5.59 GB/s) \| 56us (13 GB/s) \| 2x \| \| 768 \| 32x32 \| 6x6 \| 271752us (0.0 GB/s) \| 888us (1 GB/s) \| 305x \| \| 768 \| 32x32 \| 7x7 \| 224110us (0.0 GB/s) \| 813us (1 GB/s) \| 275x \| \| 768 \| 32x32 \| 14x14 \| 85365us (0.02 GB/s) \| 450us (3 GB/s) \| 189x \| \| 768 \| 32x32 \| 16x16 \| 67700us (0.02 GB/s) \| 360us (5 GB/s) \| 187x \| \| 768 \| 48x48 \| 6x6 \| 593709us (0.0 GB/s) \| 1988us (1 GB/s) \| 298x \| \| 768 \| 48x48 \| 7x7 \| 485566us (0.0 GB/s) \| 1694us (1 GB/s) \| 286x \| \| 768 \| 48x48 \| 14x14 \| 164059us (0.02 GB/s) \| 897us (3 GB/s) \| 182x \| \| 768 \| 48x48 \| 16x16 \| 134317us (0.02 GB/s) \| 674us (5 GB/s) \| 199x \| \| 768 \| 64x64 \| 6x6 \| 1026651us (0.0 GB/s) \| 3360us (1 GB/s) \| 305x \| \| 768 \| 64x64 \| 7x7 \| 770901us (0.0 GB/s) \| 2584us (2 GB/s) \| 298x \| \| 768 \| 64x64 \| 14x14 \| 277850us (0.02 GB/s) \| 1556us (3 GB/s) \| 178x \| \| 768 \| 64x64 \| 16x16 \| 236245us (0.02 GB/s) \| 1144us (5 GB/s) \| 206x \| \| 1024 \| 16x16 \| 6x6 \| 106638us (0.0 GB/s) \| 341us (1 GB/s) \| 312x \| \| 1024 \| 16x16 \| 7x7 \| 90886us (0.0 GB/s) \| 314us (1 GB/s) \| 288x \| \| 1024 \| 16x16 \| 14x14 \| 36572us (0.02 GB/s) \| 292us (2 GB/s) \| 124x \| \| 1024 \| 16x16 \| 16x16 \| 171us (5.69 GB/s) \| 56us (17 GB/s) \| 3x \| \| 1024 \| 32x32 \| 6x6 \| 356900us (0.0 GB/s) \| 936us (2 GB/s) \| 380x \| \| 1024 \| 32x32 \| 7x7 \| 299139us (0.0 GB/s) \| 870us (2 GB/s) \| 343x \| \| 1024 \| 32x32 \| 14x14 \| 113205us (0.02 GB/s) \| 576us (4 GB/s) \| 196x \| \| 1024 \| 32x32 \| 16x16 \| 90886us (0.02 GB/s) \| 458us (5 GB/s) \| 198x \| \| 1024 \| 48x48 \| 6x6 \| 786896us (0.0 GB/s) \| 2127us (2 GB/s) \| 369x \| \| 1024 \| 48x48 \| 7x7 \| 640515us (0.0 GB/s) \| 1837us (2 GB/s) \| 348x \| \| 1024 \| 48x48 \| 14x14 \| 218720us (0.02 GB/s) \| 1152us (4 GB/s) \| 189x \| \| 1024 \| 48x48 \| 16x16 \| 178827us (0.02 GB/s) \| 863us (5 GB/s) \| 207x \| \| 1024 \| 64x64 \| 6x6 \| 1379991us (0.0 GB/s) \| 3589us (2 GB/s) \| 384x \| \| 1024 \| 64x64 \| 7x7 \| 1047466us (0.0 GB/s) \| 2774us (2 GB/s) \| 377x \| \| 1024 \| 64x64 \| 14x14 \| 370139us (0.02 GB/s) \| 1999us (4 GB/s) \| 185x \| \| 1024 \| 64x64 \| 16x16 \| 316501us (0.02 GB/s) \| 1470us (5 GB/s) \| 215x \| \| 1536 \| 16x16 \| 6x6 \| 159057us (0.0 GB/s) \| 477us (1 GB/s) \| 332x \| \| 1536 \| 16x16 \| 7x7 \| 135578us (0.0 GB/s) \| 441us (1 GB/s) \| 306x \| \| 1536 \| 16x16 \| 14x14 \| 53002us (0.02 GB/s) \| 400us (3 GB/s) \| 132x \| \| 1536 \| 16x16 \| 16x16 \| 252us (5.79 GB/s) \| 55us (26 GB/s) \| 4x \| \| 1536 \| 32x32 \| 6x6 \| 545653us (0.0 GB/s) \| 1323us (2 GB/s) \| 412x \| \| 1536 \| 32x32 \| 7x7 \| 447491us (0.0 GB/s) \| 1248us (2 GB/s) \| 358x \| \| 1536 \| 32x32 \| 14x14 \| 173491us (0.02 GB/s) \| 787us (4 GB/s) \| 220x \| \| 1536 \| 32x32 \| 16x16 \| 136395us (0.02 GB/s) \| 633us (5 GB/s) \| 215x \| \| 1536 \| 48x48 \| 6x6 \| 1198639us (0.0 GB/s) \| 3057us (2 GB/s) \| 392x \| \| 1536 \| 48x48 \| 7x7 \| 985549us (0.0 GB/s) \| 2645us (2 GB/s) \| 372x \| \| 1536 \| 48x48 \| 14x14 \| 331419us (0.02 GB/s) \| 1581us (4 GB/s) \| 209x \| \| 1536 \| 48x48 \| 16x16 \| 270972us (0.02 GB/s) \| 1186us (6 GB/s) \| 228x \| \| 1536 \| 64x64 \| 6x6 \| 2094282us (0.0 GB/s) \| 5214us (2 GB/s) \| 401x \| \| 1536 \| 64x64 \| 7x7 \| 1593449us (0.0 GB/s) \| 4086us (2 GB/s) \| 389x \| \| 1536 \| 64x64 \| 14x14 \| 559244us (0.02 GB/s) \| 2828us (4 GB/s) \| 197x \| \| 1536 \| 64x64 \| 16x16 \| 469471us (0.02 GB/s) \| 2057us (6 GB/s) \| 228x \| \| 4096 \| 16x16 \| 6x6 \| 430494us (0.0 GB/s) \| 1008us (2 GB/s) \| 427x \| \| 4096 \| 16x16 \| 7x7 \| 360346us (0.0 GB/s) \| 1015us (2 GB/s) \| 354x \| \| 4096 \| 16x16 \| 14x14 \| 142868us (0.02 GB/s) \| 988us (3 GB/s) \| 144x \| \| 4096 \| 16x16 \| 16x16 \| 658us (5.93 GB/s) \| 56us (69 GB/s) \| 11x \| \| 4096 \| 32x32 \| 6x6 \| 1425928us (0.0 GB/s) \| 2796us (2 GB/s) \| 509x \| \| 4096 \| 32x32 \| 7x7 \| 1188862us (0.0 GB/s) \| 2906us (2 GB/s) \| 409x \| \| 4096 \| 32x32 \| 14x14 \| 464286us (0.02 GB/s) \| 1965us (4 GB/s) \| 236x \| \| 4096 \| 32x32 \| 16x16 \| 363903us (0.02 GB/s) \| 1588us (6 GB/s) \| 229x \| Pull Request resolved: https://github.com/pytorch/pytorch/pull/127082 Approved by: https://github.com/fmassa	2024-05-24 21:17:12 +00:00
Chien-Chin Huang	b0871f9b33	[DSD] Add a test to verify FSDP lazy initialization case (#127069 ) Summary: Distributed state_dict should not error out because the `model.state_dict()` will trigger FSDP to initialize. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127069 Approved by: https://github.com/wz337	2024-05-24 21:09:11 +00:00
Bin Bao	7394ec7123	[AOTI][refactor] Update DTYPE_TO_CPP mapping (#126915 ) Summary: Use more consistent cpp int types in DTYPE_TO_CPP Pull Request resolved: https://github.com/pytorch/pytorch/pull/126915 Approved by: https://github.com/chenyang78	2024-05-24 21:03:12 +00:00
Sijia Chen	800f461b2a	[User-Written Triton] Handle the `scf.for` and `scf.while` case (#127065 ) Summary: This is the official fix of the issue, reported in https://fb.workplace.com/groups/1075192433118967/permalink/1427865377851669/ The root-cause is the MLIR mutation analyze doesn't find the mutated tensors, which made AOT autograd think there is no users of the Triton kernel and then removed it 😔 --- Triton IR: P1369315213 Wrong Analyze Graph: P1364305956 Right Analyze Graph: P1369324977 Test Plan: buck2 run mode/opt scripts/liptds/domain_kernels:triton_dcpp_flash unit tests Differential Revision: D57606053 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127065 Approved by: https://github.com/oulgen, https://github.com/chenyang78	2024-05-24 21:01:13 +00:00
Shuo Ding	dce29a8a87	Replaced same with assertEqual in two files (#126994 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126994 Approved by: https://github.com/masnesral	2024-05-24 20:50:36 +00:00
PyTorch MergeBot	c34f8c7f91	Revert "Lift jagged -> padded dense forward / backward kernels from fbgemm_gpu (#125946 )" This reverts commit 5e69e11d098a2cfccc8a59377c431e9c71cab9a8. Reverted https://github.com/pytorch/pytorch/pull/125946 on behalf of https://github.com/clee2000 due to sorry the Dr CI fix hasn't been merged yet and its still failing `5e69e11d09` https://github.com/pytorch/pytorch/actions/runs/9228887299/job/25393895252 ([comment](https://github.com/pytorch/pytorch/pull/125946#issuecomment-2130305958))	2024-05-24 20:26:07 +00:00
Scott Wolchok	fdda9a22c3	Performance parity for 32-bit-precision in FP16 ARM matrix-vector kernel using FMLAL instruction (#127033 ) Summary: I discovered this instruction by checking all the intrinsics on https://arm-software.github.io/acle/neon_intrinsics/advsimd.html . Test Plan: Existing test coverage benchmarked custom sizes with https://github.com/malfet/llm_experiments benchmarks/benchmark/torch_mm.py: ``` m=1024, n=1024, k=1 ==================== trans_b torch.float16 43.93 usec Using FP16 accumulation trans_b torch.float16 43.76 usec m=4100, n=4100, k=1 ==================== trans_b torch.float16 719.35 usec Using FP16 accumulation trans_b torch.float16 719.33 usec m=4104, n=4104, k=1 ==================== trans_b torch.float16 727.79 usec Using FP16 accumulation trans_b torch.float16 702.72 usec m=16384, n=16384, k=1 ==================== trans_b torch.float16 18465.11 usec Using FP16 accumulation trans_b torch.float16 11435.28 usec ``` also checked the default sizes. Relevant output before: ``` mv_nt torch.float16 13.05 usec trans_b torch.float16 13.69 usec Using FP16 accumulation mv_nt torch.float16 8.65 usec trans_b torch.float16 9.24 usec ``` after: ``` mv_nt torch.float16 8.66 usec trans_b torch.float16 8.85 usec Using FP16 accumulation mv_nt torch.float16 8.52 usec trans_b torch.float16 8.60 usec ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/127033 Approved by: https://github.com/malfet, https://github.com/Skylion007 ghstack dependencies: #126745, #126746, #126793, #126794, #126877, #127016	2024-05-24 19:47:50 +00:00
Scott Wolchok	1d3aa08327	Cleanup: use c10::ForceUnroll and constexpr variables in ARM FP16 matrix-vector fast path (#127016 ) Summary: Just straightforward code cleanup in this path. Test Plan: Existing CI, double-checked benchmark_torch_mm didn't regress as per previous diffs in stack. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127016 Approved by: https://github.com/peterbell10 ghstack dependencies: #126745, #126746, #126793, #126794, #126877	2024-05-24 19:47:50 +00:00
cyy	67d52d7fcb	[caffe2] Remove import_legacy.cpp (#126149 ) I think they are for Caffe2 and should be deleted. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126149 Approved by: https://github.com/r-barnes	2024-05-24 19:47:32 +00:00
Joel Schlosser	5e69e11d09	Lift jagged -> padded dense forward / backward kernels from fbgemm_gpu (#125946 ) PyTorch can't depend on `fbgemm_gpu` as a dependency because `fbgemm_gpu` already has a dependency on PyTorch. So this PR copy / pastes kernels from `fbgemm_gpu`: * `dense_to_jagged_forward()` as CUDA registration for new ATen op `_padded_dense_to_jagged_forward()` * `jagged_to_padded_dense_forward()` as CUDA registration for new ATen op `_jagged_to_padded_dense_forward()` CPU impls for these new ATen ops will be added in a follow-up PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125946 Approved by: https://github.com/davidberard98	2024-05-24 19:16:29 +00:00
Bin Bao	9d4731f952	[AOTI] Disable stack allocation for OSS (#125732 ) Summary: Stack allocation is for certain small CPU models, but its coverage still needs improvement, so default to OFF for OSS. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125732 Approved by: https://github.com/chenyang78 ghstack dependencies: #126720, #126801	2024-05-24 19:10:33 +00:00
Bin Bao	72d30aa026	[AOTI] Fix an int array codegen issue (#126801 ) Summary: fixes https://github.com/pytorch/pytorch/issues/126779. When an int array contains symbol expression, we can't declare it with constexpr. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126801 Approved by: https://github.com/chenyang78 ghstack dependencies: #126720	2024-05-24 19:10:33 +00:00
Bin Bao	71f1aebe1f	[AOTI] Add more fallback ops (#126720 ) Summary: These ops are either in either unit tests or TorchBench. Fixes https://github.com/pytorch/pytorch/issues/122050 Pull Request resolved: https://github.com/pytorch/pytorch/pull/126720 Approved by: https://github.com/chenyang78	2024-05-24 19:10:33 +00:00
Svetlana Karslioglu	f508cd6e00	Update assigntome job (#127027 ) Updating for the new docathon Pull Request resolved: https://github.com/pytorch/pytorch/pull/127027 Approved by: https://github.com/kit1980	2024-05-24 19:04:51 +00:00
Aaron Gokaslan	3cb16ebf08	[BE]: Update ruff to 0.4.5 (#126979 ) Update ruff to 0.4.5 and addresses some false negatives that have been found in the newer version. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126979 Approved by: https://github.com/ezyang	2024-05-24 18:38:35 +00:00
Yifu Wang	4a09117d16	Introduce ProcessGroupCudaP2P (#122163 ) ## Context This stack prototypes automatic micro-pipelining of `all-gather -> matmul` and `matmul -> reduce-scatter` via Inductor. The idea originates from the paper [Overlap Communication with Dependent Computation via Decomposition in Large Deep Learning Models](https://dl.acm.org/doi/pdf/10.1145/3567955.3567959). The implementation and some key optimizations are heavily influenced by @lw's implementation in xformers. The stack contains several components: - `ProcessGroupCudaP2P` - a thin wrapper around `ProcessGroupNCCL`. It in addition maintains a P2P workspace that enables SM-free, one-sided P2P communication which is needed for optimal micro-pipelining. - `fused_all_gather_matmul` and `fused_matmul_reduce_scatter` dispatcher ops. - Post-grad fx pass that detects `all-gather -> matmul` and `matmul -> reduce-scatter` and replaces them with the fused dispatcher ops. To enable the prototype feature: - Set the distributed backend to `cuda_p2p`. - Set `torch._inductor.config._micro_pipeline_tp` to `True`. NOTE: the prototype sets nothing in stone w.r.t to each component's design. The purpose is to have a performant baseline with reasonable design on which each component can be further improved. ## Benchmark Setup: - 8 x H100 (500W) + 3rd gen NVSwitch. - Llama3 8B training w/ torchtitan. - 8-way TP. Reduced the number of layers from 32 to 8 for benchmarking purpose. Trace (baseline): https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/perfetto_internal_traces/tree/shared_trace/yifu_tmpjaz8zgx0 <img width="832" alt="image" src="https://github.com/pytorch/pytorch/assets/4156752/4addba77-5abc-4d2e-93ea-f68078587fe1"> Trace (w/ micro pipelining): https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/perfetto_internal_traces/tree/shared_trace/yifu_tmpn073b4wn <img width="963" alt="image" src="https://github.com/pytorch/pytorch/assets/4156752/4f44e78d-8196-43ab-a1ea-27390f07e9d2"> ## This PR `ProcessGroupCudaP2P` is a thin wrapper around `ProcessGroupNCCL`. By default, it routes all collectives to the underlying `ProcessGroupNCCL`. In addition, `ProcessGroupCudaP2P` initializes a P2P workspace that allows direct GPU memory access among the members. The workspace can be used in Python to optimize intra-node communication patterns or to create custom intra-node collectives in CUDA. `ProcessGroupCudaP2P` aims to bridge the gap where certain important patterns can be better optimized via fine-grained P2P memory access than with collectives in the latest version of NCCL. It is meant to complement NCCL rather than replacing it. Usage: ``` # Using ProcessGroupCudaP2P dist.init_process_group(backend="cuda_p2p", ...) # Using ProcessGroupCudaP2P while specifying ProcessGroupCudaP2P.Options pg_options = ProcessGroupCudaP2P.Options() dist.init_process_group(backend="cuda_p2p", pg_options=pg_options, ...) # Using ProcessGroupCudaP2P while specifying ProcessGroupNCCL.Options pg_options = ProcessGroupNCCL.Options() dist.init_process_group(backend="cuda_p2p", pg_options=pg_options, ...) # Using ProcessGroupCudaP2P while specifying both # ProcessGroupCudaP2P.Options and ProcessGroupNCCL.Options pg_options = ProcessGroupCudaP2P.Options() pg_options.nccl_options = ProcessGroupNCCL.Options() dist.init_process_group(backend="cuda_p2p", pg_options=pg_options, ...) # Down-casting the backend to access p2p buffers for cuda_p2p specific # optimizations if is_cuda_p2p_group(group): backend = get_cuda_p2p_backend(group) if required_p2p_buffer_size > backend.get_buffer_size(): # fallback p2p_buffer = backend.get_p2p_buffer(...) else: # fallback ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/122163 Approved by: https://github.com/wanchaol	2024-05-24 18:33:18 +00:00
Yidi Wu	01f04230cf	[cond] support torch built in function as subgraph (#126909 ) Fixes https://github.com/pytorch/pytorch/issues/126818. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126909 Approved by: https://github.com/zou3519 ghstack dependencies: #127026	2024-05-24 18:31:43 +00:00
Yidi Wu	2d6d2dbc0b	[dynamo] make callable(nn_module) return True (#127026 ) Before the pr, we have a graph break for `callable(nn_module)`: ```python class M(nn.Module): def forward(self, x): return x.sin() def f(m): return callable(m) res = torch.compile(f, fullgraph=True)(M()) ``` ``` Traceback (most recent call last): File "/data/users/yidi/pytorch/t.py", line 17, in <module> out = torch.compile(f, backend="eager", fullgraph=True)(M()) File "/data/users/yidi/pytorch/torch/_dynamo/eval_frame.py", line 414, in _fn return fn(args, kwargs) File "/data/users/yidi/pytorch/torch/_dynamo/convert_frame.py", line 1077, in catch_errors return callback(frame, cache_entry, hooks, frame_state, skip=1) File "/data/users/yidi/pytorch/torch/_dynamo/convert_frame.py", line 456, in _convert_frame_assert return _compile( File "/data/users/yidi/pytorch/torch/_utils_internal.py", line 74, in wrapper_function return function(args, *kwargs) File "/home/yidi/.conda/envs/pytorch/lib/python3.10/contextlib.py", line 79, in inner return func(args, *kwds) File "/data/users/yidi/pytorch/torch/_dynamo/convert_frame.py", line 799, in _compile guarded_code = compile_inner(code, one_graph, hooks, transform) File "/data/users/yidi/pytorch/torch/_dynamo/utils.py", line 210, in time_wrapper r = func(args, *kwargs) File "/data/users/yidi/pytorch/torch/_dynamo/convert_frame.py", line 618, in compile_inner out_code = transform_code_object(code, transform) File "/data/users/yidi/pytorch/torch/_dynamo/bytecode_transformation.py", line 1167, in transform_code_object transformations(instructions, code_options) File "/data/users/yidi/pytorch/torch/_dynamo/convert_frame.py", line 177, in _fn return fn(args, **kwargs) File "/data/users/yidi/pytorch/torch/_dynamo/convert_frame.py", line 564, in transform tracer.run() File "/data/users/yidi/pytorch/torch/_dynamo/symbolic_convert.py", line 2244, in run super().run() File "/data/users/yidi/pytorch/torch/_dynamo/symbolic_convert.py", line 886, in run while self.step(): File "/data/users/yidi/pytorch/torch/_dynamo/symbolic_convert.py", line 801, in step self.dispatch_table[inst.opcode](self, inst) File "/data/users/yidi/pytorch/torch/_dynamo/symbolic_convert.py", line 496, in wrapper return inner_fn(self, inst) File "/data/users/yidi/pytorch/torch/_dynamo/symbolic_convert.py", line 1255, in CALL_FUNCTION self.call_function(fn, args, {}) File "/data/users/yidi/pytorch/torch/_dynamo/symbolic_convert.py", line 739, in call_function self.push(fn.call_function(self, args, kwargs)) File "/data/users/yidi/pytorch/torch/_dynamo/variables/builtin.py", line 948, in call_function return handler(tx, args, kwargs) File "/data/users/yidi/pytorch/torch/_dynamo/variables/builtin.py", line 711, in <lambda> return lambda tx, args, kwargs: obj.call_function( File "/data/users/yidi/pytorch/torch/_dynamo/variables/builtin.py", line 948, in call_function return handler(tx, args, kwargs) File "/data/users/yidi/pytorch/torch/_dynamo/variables/builtin.py", line 835, in builtin_dipatch unimplemented(error_msg) File "/data/users/yidi/pytorch/torch/_dynamo/exc.py", line 216, in unimplemented raise Unsupported(msg) torch._dynamo.exc.Unsupported: builtin: callable [<class 'torch._dynamo.variables.nn_module.NNModuleVariable'>] False ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/127026 Approved by: https://github.com/jansel	2024-05-24 18:31:43 +00:00
cyy	f2c6fddbe1	Remove unnecessary const_cast and other fixes (#127054 ) Removes unnecessary const casts and copies. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127054 Approved by: https://github.com/Skylion007	2024-05-24 18:05:06 +00:00
Andrew Gu	9117779b0a	[FSDP2] Added test for N-way TP and 1-way FSDP with CPU offloading (#127024 ) This PR shows that we can use FSDP solely for CPU offloading when composing with N-way TP. Each FSDP mesh is just 1 rank. This was motivated from an ask on Slack :) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127024 Approved by: https://github.com/weifengpy, https://github.com/wanchaol ghstack dependencies: #127004	2024-05-24 17:09:12 +00:00
Mikayla Gawarecki	87f79af24d	Fix map_location for wrapper subclass and device tensors that go through numpy (#126728 ) Fixes https://github.com/pytorch/pytorch/issues/124418 Pull Request resolved: https://github.com/pytorch/pytorch/pull/126728 Approved by: https://github.com/albanD	2024-05-24 16:39:30 +00:00
Nikita Shulga	4ff9113e3d	[MPS] Add `_weight_int8pack_mm` tests (#127041 ) As well as extend the test to cover MV cases (where A matrix is 1xM) Limit int8 op testing to 32x32 matrix sizes for now Pull Request resolved: https://github.com/pytorch/pytorch/pull/127041 Approved by: https://github.com/larryliu0820, https://github.com/manuelcandales	2024-05-24 16:08:06 +00:00
Nikita Shulga	194950c0ca	Default TreadPool size to number of physical cores (#125963 ) TODO: Some benchmarks Pull Request resolved: https://github.com/pytorch/pytorch/pull/125963 Approved by: https://github.com/janeyx99, https://github.com/Skylion007, https://github.com/gajjanag, https://github.com/jgong5	2024-05-24 16:06:48 +00:00
PyTorch MergeBot	5ae9daa4a2	Revert "[AOTI] support freezing for MKLDNN (#124350 )" This reverts commit 654afb6f3ae3ddbd926a753f9af95a6f6e22131c. Reverted https://github.com/pytorch/pytorch/pull/124350 on behalf of https://github.com/clee2000 due to Seems to have broken inductor/test_aot_inductor.py::AOTInductorTestNonABICompatibleCpu::test_freezing_non_abi_compatible_cpu `654afb6f3a` https://github.com/pytorch/pytorch/actions/runs/9224838183/job/25382780192 ([comment](https://github.com/pytorch/pytorch/pull/124350#issuecomment-2129889809))	2024-05-24 16:03:07 +00:00
Eli Simhayev	2ac739cc80	[DOCS] Fixed KLDiv example (#126857 ) Small import fix to make the example run Pull Request resolved: https://github.com/pytorch/pytorch/pull/126857 Approved by: https://github.com/albanD	2024-05-24 15:39:50 +00:00
Shunting Zhang	4105f91cfc	[inductor] fix an assertion for node debug str (#127021 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127021 Approved by: https://github.com/aorenste	2024-05-24 13:37:05 +00:00
Wu, Chunyuan	654afb6f3a	[AOTI] support freezing for MKLDNN (#124350 ) ## Description Fixes https://github.com/pytorch/pytorch/issues/114450. This PR builds upon the work from @imzhuhl done in https://github.com/pytorch/pytorch/pull/114451. This PR requires https://github.com/pytorch/pytorch/pull/122472 to land firstly. We leverage the serialization and deserialization API from oneDNN v3.4.1 to save the opaque MKLDNN tensor during the compilation and restore the opaque tensor when loading the compiled .so. ideep version is updated so that we won't break any pipeline even if third_party/ideep is not updated at the same time. ### Test plan: ```sh python -u test/inductor/test_aot_inductor.py -k AOTInductorTestNonABICompatibleCpu.test_freezing_non_abi_compatible_cpu python -u test/inductor/test_aot_inductor.py -k AOTInductorTestNonABICompatibleCpu.test_conv_freezing_non_abi_compatible_cpu python -u test/inductor/test_aot_inductor.py -k AOTInductorTestNonABICompatibleCpu.test_deconv_freezing_non_abi_compatible_cpu python -u test/inductor/test_aot_inductor.py -k AOTInductorTestNonABICompatibleCpu.test_linear_freezing_non_abi_compatible_cpu ``` ### TODOs in follow-up PRs 1. We found that using `AOTI_TORCH_CHECK` will cause performance drop on several models (`DistillGPT2`, `MBartForConditionalGeneration`, `T5ForConditionalGeneration`, `T5Small`) compared with JIT Inductor which uses `TORCH_CHECK`. This may need further discussion how to address (`AOTI_TORCH_CHECK` is introduced in https://github.com/pytorch/pytorch/pull/119220). 2. Freezing in non-ABI compatible mode will work with the support in this PR. While for ABI compatible mode, we need to firstly address this issue: `AssertionError: None, i.e. optional output is not supported`. `6c4f43f826/torch/_inductor/codegen/cpp_wrapper_cpu.py (L2023-L2024)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/124350 Approved by: https://github.com/jgong5, https://github.com/desertfire	2024-05-24 13:34:04 +00:00
Jiong Gong	43baabe9b9	[inductor][cpp] support bf16/fp16 gemm template epilogue fusion (#126545 ) As part of #125683, this PR adds epilogue fusion support for bf16/fp16 gemms. The key changes are as follows: 1. bf16 linear w/ epilogue fusion of some ops was originally supported via ATen oneDNN linear pointwise ops. In order to match the ATen op semantics, in-template epilogue support is added to the cpp gemm template so that we would have: "gemm + in-template epilogues -> template buffer". If the template is chosen for codegen, the in-template epilogues will be concatenated with the out-of-template epilogues that are appended during the scheduling. 2. Support bf16/fp16 legalization for `codegen_loop_bodies` which is used to generate the epilogue loops. 3. We used to leverage the in-place buffer mechanism to handle the in-place buffers in the epilogue codegen, in particular, for the reuses for output buffers of GEMM, template and epilogues. This is not correct since the output buffer is an "output" not an "in-place" buffer of the template kernel itself. Now, we use a dedicated "aliases" dict to manage such buffer reuses and the intermediate aliasing buffers are removed after codegen. 4. Add `localize_buffer` method to `LocalBufferScope` to allow the replacement of a global buffer with a local one in the given inductor IR nodes. This helps the fused loops to work on smaller-sized local buffers for better data locality. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126545 Approved by: https://github.com/jansel ghstack dependencies: #124021, #126019, #126068	2024-05-24 12:29:06 +00:00
Jiong Gong	4aa43d11f3	[inductor][cpp] bf16/fp16 gemm template computed with fp32 w/o epilogue fusion (#126068 ) As part of #125683, this PR adds the initial bf16/fp16 gemm template support with micro-gemm implemented with fused type casting and fp32 computation. It doesn't provide epilogue fusion support yet which will be added in the next PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126068 Approved by: https://github.com/jansel ghstack dependencies: #124021, #126019	2024-05-24 12:24:35 +00:00
Jiong Gong	56c412d906	[inductor][cpp] epilogue support for gemm template (#126019 ) As part of #125683, this PR adds the epilogue support for c++ gemm template by reusing the c++ vector codegen on sub-slices of tensors. This is implemented by retracing the epilogue IR nodes with new ranges and offsets. The new `codegen_loop_bodies` and `codegen_functions` methods are added to c++ vector codegen for this purpose. This is leveraged by the `store_output` method of the template kernel for epilogue codegen and store to the final result. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126019 Approved by: https://github.com/jansel ghstack dependencies: #124021	2024-05-24 12:14:12 +00:00
rzou	dd64ca2a02	Inductor respects strides for custom ops by default (#126986 ) Previously, the default was that Inductor did not respect strides for all (builtin and custom) ops unless the op has a "needs_fixed_stride_order" tag on it. This PR changes it so that: - inductor doesn't respect strides for builtin ops. To change the behavior, one can add the "needs_fixed_stride_order" tag - inductor does respect strides for custom ops. To change the behavior, one can add the "does_not_need_fixed_stride_order" tag Test Plan: - new tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/126986 Approved by: https://github.com/ezyang, https://github.com/albanD	2024-05-24 11:11:18 +00:00
Aaron Orenstein	f14cdc570d	Fix to #126656 (#127050 ) Fix failure from fbcode - in the case of a foreach node the fake `group` needs to be hashable. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127050 Approved by: https://github.com/DanilBaibak ghstack dependencies: #126656	2024-05-24 10:56:53 +00:00
PyTorch MergeBot	47c976b904	Revert "[AOTI] Add more fallback ops (#126720 )" This reverts commit 19cd4484ec8449b8c5ebf46be1f8f2fcbace8c6c. Reverted https://github.com/pytorch/pytorch/pull/126720 on behalf of https://github.com/DanilBaibak due to Break internal build ([comment](https://github.com/pytorch/pytorch/pull/126720#issuecomment-2129011751))	2024-05-24 09:07:07 +00:00
PyTorch MergeBot	f749c5def8	Revert "[AOTI] Fix an int array codegen issue (#126801 )" This reverts commit ff617ab6c8f6f67ae912fbcd45a913a89e19effb. Reverted https://github.com/pytorch/pytorch/pull/126801 on behalf of https://github.com/DanilBaibak due to Break internal build ([comment](https://github.com/pytorch/pytorch/pull/126720#issuecomment-2129011751))	2024-05-24 09:07:07 +00:00
PyTorch MergeBot	fd9cdeed19	Revert "[AOTI] Disable stack allocation for OSS (#125732 )" This reverts commit 599e684ad6f34dd069eff8611f45e25b7695a339. Reverted https://github.com/pytorch/pytorch/pull/125732 on behalf of https://github.com/DanilBaibak due to Break internal build ([comment](https://github.com/pytorch/pytorch/pull/126720#issuecomment-2129011751))	2024-05-24 09:07:07 +00:00
Richard Barnes	f95dbc1276	Remove more of caffe2 (#126705 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/126705 Approved by: https://github.com/malfet	2024-05-24 06:53:08 +00:00
Jiong Gong	0d1e228550	[inductor][cpp] GEMM template (infra and fp32) (#124021 ) This PR adds the Cpp template infrastructure and the initial FP32 gemm template. See RFC https://github.com/pytorch/pytorch/issues/125683 for more background info. 1. Cpp template infrastructure Similar template abstractions as the CUTLASS template, i.e., `CppTemplate`, `CppTemplateKernel`, `CppTemplateBuffer`. The MicroGemm micro-kernel abstraction that can be used by Cpp GEMM templates. 2. Initial FP32 gemm template This involves a GEMM template implementation `CppPackedGemmTemplate` that supports GEMM with constant weight (`B`) requiring `N` to be a multiple of register blocking while allows the static or dynamic sizes for the `M` (batch dim) of `A`. The `B` matrix would be prepacked. This is a typical setting for inference workloads. The template handles the thread decomposition (via `thread_blocking`) and cache blocking (via `cache_blocking`). Then it invokes `CppMicroGemm` which handles register blocking, instruction selection, and other CPU architecture-specific optimizations. A `CppMicroGemmFP32Vec` micro-kernel implementation is provided for fp32 matmuls implemented with ATen vec abstraction. 3. Correctness and performance The changes have been validated with fp32 inference on the three benchmark suites (torchbench, huggingface and timm_models) with both static shape and dynamic shapes. Since it is an initial implementation, we are still working on further performance improves with follow-up PRs including the optimizations in kernels as well as fusions. The perf gains are only observed from a selective number of models compared to the ATen kernels which are implemented with MKL. The perf gains are more obvious with dynamic shapes since MKL only supports packed gemm for static shapes. Below are details. Static shapes \| Benchmark \| torchbench \| huggingface \| timm_models \| \|------------\|-------------\|--------------\|--------------\| \| Multi-threaded (baseline) \| 1.47x \| 1.36x \| 1.91x \| \| Multi-threaded (max-autotune) \| 1.47x \| 1.36x \| 1.92x \| \| Single-threaded (baseline) \| 1.56x \| 1.19x \| 1.51x \| \| Single-threaded (max-autotune) \| 1.56x \| 1.19x \| 1.52x \| Key models being sped up: drq: 1.14x soft_act: 1.12 cait_m36_384: 1.18x Dynamic shapes \| Benchmark \| torchbench \| huggingface \| timm_models \| \| --- \| --- \| --- \| --- \| \| Multi-threaded (baseline) \| 1.43x \| 1.28x \| 1.85x \| \| Multi-threaded (max-autotune) \| 1.47x \| 1.28x \| 1.85x \| \| Single-threaded (baseline) \| 1.55x \| 1.20x \| 1.51x \| \| Single-threaded (max-autotune) \| 1.56x \| 1.19x \| 1.53x \| Key models being sped up: BERT_pytorch: 1.22x pyhpc_turbulent: 1.13x soft_actor_critic: 1.77x BlenderbotForCausalLM: 1.09x cait_m36_384: 1.17x Differential Revision: [D57585365](https://our.internmc.facebook.com/intern/diff/D57585365) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124021 Approved by: https://github.com/jansel	2024-05-24 06:26:33 +00:00
Scott Wolchok	505b8ceaa2	Double registers per iteration in FP32-arithmetic FP16 ARM gemv kernel (#126877 ) Summary: I found that doubling this significantly improved performance, but doubling again did not, so I stopped here. Test Plan: CI Benchmarked with llm_experiments repo as previously in stack; relevant data: before: trans_b torch.float16 1396.11 usec (4100) trans_b torch.float16 1399.54 usec (4104) after: trans_b torch.float16 1096.00 usec (4100) trans_b torch.float16 1093.47 usec (4104) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126877 Approved by: https://github.com/malfet ghstack dependencies: #126745, #126746, #126793, #126794	2024-05-24 05:57:09 +00:00
Scott Wolchok	e8fa0f10c5	Quadruple registers per iteration in ARM64 FP16 kernel (#126794 ) The machine has plenty of registers we weren't using. This looks like it might improve performance a couple percent, though there is noise so I'm not certain. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126794 Approved by: https://github.com/malfet ghstack dependencies: #126745, #126746, #126793	2024-05-24 05:57:09 +00:00
daitian1995	f6366454db	Add privateuse1 in FSDP's sharded grad scaler (#126971 ) 1. add privateuse1 in FSDP's sharded grad scaler 2. support found_inf copy for more devices Pull Request resolved: https://github.com/pytorch/pytorch/pull/126971 Approved by: https://github.com/awgu, https://github.com/weifengpy	2024-05-24 05:54:25 +00:00
drisspg	2f6954c7c3	Update the modification api (#127035 ) # Summary Updates the modification jinja template's api, so as to specify the output_name for the fixed buffer. As well updates flex-attention's usage to make the algorithm more clear/ closer align with the vmap impl Pull Request resolved: https://github.com/pytorch/pytorch/pull/127035 Approved by: https://github.com/Chillee	2024-05-24 04:45:34 +00:00
Andrew Gu	894efcd0e9	[DTensor] Supported simple replicate strategy for SVD (#127004 ) This PR adds a simple strategy to always replicate for `torch.linalg.svd()`. This is to help unblock some GaLore exploration. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127004 Approved by: https://github.com/wanchaol	2024-05-24 04:34:43 +00:00
Aaron Orenstein	70dc59c55f	Fix perf regression caused by #122074 (#126996 ) The original change was about 9.5% slower than then before #122074 . This improves it to be only about 1.4% slower. Also touched up some unrelated nits that the linter complained about. Fixes #126293 Ran torchbench 3 times on each change. Perf values before (stable), after (fix), and with #122074 backed out (backout): ``` ../inductor-tools/scripts/modelbench/inductor_single_run.sh single inference performance torchbench pyhpc_isoneutral_mixing amp first dynamic cpp stable: 43.948x 45.754x 44.906x fix: 47.505x 49.987x 47.493x backout: 48.243x 48.199x 48.192x ../inductor-tools/scripts/modelbench/inductor_single_run.sh single inference performance torchbench pyhpc_equation_of_state amp first static default stable: 15.224x 13.286x 15.354x fix: 16.402x 16.370x 16.183x backout: 16.554x 16.675x 16.787x ../inductor-tools/scripts/modelbench/inductor_single_run.sh single inference performance torchbench lennard_jones float32 first static default stable: 1.712x 1.651x 1.640x fix: 1.804x 1.798x 1.792x backout: 1.864x 1.824x 1.836x ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/126996 Approved by: https://github.com/jansel	2024-05-24 04:27:22 +00:00
Angela Yi	cb6ef68caa	Propagate tokens in aotautograd (#127028 ) Test Plan: `buck run mode/dev-nosan //aimp/experimental/pt2:pt2_export -- --model-entity-id 938593492 --output /tmp/938593492.zip --use-torchrec-eager-mp --use-manifold` Differential Revision: D57750072 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127028 Approved by: https://github.com/tugsbayasgalan	2024-05-24 03:23:17 +00:00
PyTorch MergeBot	99a11efc8a	Revert "Lift jagged -> padded dense forward / backward kernels from fbgemm_gpu (#125946 )" This reverts commit e2f081837f4276c1a6a37739bd28157f62004a06. Reverted https://github.com/pytorch/pytorch/pull/125946 on behalf of https://github.com/clee2000 due to I think dr ci is wrong and the windows build failure is real `e2f081837f` https://github.com/pytorch/pytorch/actions/runs/9216826622/job/25357819877 ([comment](https://github.com/pytorch/pytorch/pull/125946#issuecomment-2128388126))	2024-05-24 02:37:46 +00:00
drisspg	cfb374dc73	[BE] Create grad check util (#126991 ) # Summary Add small utility func for deciding if we shoudl compute LSE and update to also check for gradMode Pull Request resolved: https://github.com/pytorch/pytorch/pull/126991 Approved by: https://github.com/cpuhrsch	2024-05-24 02:36:00 +00:00
Anshul Sinha	27594be3ed	[dtensor][be] remove repeated test in test_comm_mode.py (#127029 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127029 Approved by: https://github.com/XilunWu ghstack dependencies: #127025	2024-05-24 01:42:13 +00:00
Anshul Sinha	89c638f9a5	[dtensor][debug] add all_reduce_coalesced tracing to CommDebugMode (#127025 ) Summary Added all_reduce_coalesced tracing to CommDebugMode and added test case to test_comm_mode test suite. Test Plan pytest test/distributed/_tensor/debug/test_comm_mode.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/127025 Approved by: https://github.com/XilunWu	2024-05-24 01:42:13 +00:00
laithsakka	575cb617db	Add compile time profiler for non fbcode targets (#126904 ) This is a tool that allow profiling compile time using strobelight profiler, its a meta only tool. but works on non-fbcode targets. A follow up diff will unify this with caffe2/fb/strobelight/compile_time_profiler.py. example test: ``` run python tools/strobelight/examples/compile_time_profile_example.py ``` ``` python torch/utils/_strobelight/examples/compile_time_profile_example.py strobelight_compile_time_profiler, line 61, 2024-05-23 10:49:28,101, INFO: compile time strobelight profiling enabled strobelight_compile_time_profiler, line 93, 2024-05-23 10:49:28,102, INFO: Unique sample tag for this run is: 2024-05-23-10:49:282334638devvm4561.ash0.facebook.com strobelight_compile_time_profiler, line 94, 2024-05-23 10:49:28,102, INFO: You can use the following link to access the strobelight profile at the end of the run: https://www.internalfb.com/intern/scuba/query/?dataset=pyperf_experimental%2Fon_demand&drillstate=%7B%22purposes%22%3A[]%2C%22end%22%3A%22now%22%2C%22start%22%3A%22-30%20days%22%2C%22filterMode%22%3A%22DEFAULT%22%2C%22modifiers%22%3A[]%2C%22sampleCols%22%3A[]%2C%22cols%22%3A[%22namespace_id%22%2C%22namespace_process_id%22]%2C%22derivedCols%22%3A[]%2C%22mappedCols%22%3A[]%2C%22enumCols%22%3A[]%2C%22return_remainder%22%3Afalse%2C%22should_pivot%22%3Afalse%2C%22is_timeseries%22%3Afalse%2C%22hideEmptyColumns%22%3Afalse%2C%22timezone%22%3A%22America%2FLos_Angeles%22%2C%22compare%22%3A%22none%22%2C%22samplingRatio%22%3A%221%22%2C%22metric%22%3A%22count%22%2C%22aggregation_field%22%3A%22async_stack_complete%22%2C%22top%22%3A10000%2C%22aggregateList%22%3A[]%2C%22param_dimensions%22%3A[%7B%22dim%22%3A%22py_async_stack%22%2C%22op%22%3A%22edge%22%2C%22param%22%3A%220%22%2C%22anchor%22%3A%220%22%7D]%2C%22order%22%3A%22weight%22%2C%22order_desc%22%3Atrue%2C%22constraints%22%3A[[%7B%22column%22%3A%22sample_tags%22%2C%22op%22%3A%22all%22%2C%22value%22%3A[%22[%5C%222024-05-23-10:49:282334638devvm4561.ash0.facebook.com%5C%22]%22]%7D]]%2C%22c_constraints%22%3A[[]]%2C%22b_constraints%22%3A[[]]%2C%22ignoreGroupByInComparison%22%3Afalse%7D&view=GraphProfilerView&&normalized=1712358002&pool=uber strobelight_function_profiler, line 241, 2024-05-23 10:49:34,943, INFO: strobelight run id is: 3507039740348330 strobelight_function_profiler, line 243, 2024-05-23 10:50:00,907, INFO: strobelight profiling running strobelight_function_profiler, line 224, 2024-05-23 10:50:02,741, INFO: strobelight profiling stopped strobelight_function_profiler, line 215, 2024-05-23 10:50:06,173, INFO: Total samples: 7 strobelight_function_profiler, line 215, 2024-05-23 10:50:06,173, INFO: GraphProfiler (python stack): https://fburl.com/scuba/pyperf_experimental/on_demand/75cxdro3 strobelight_function_profiler, line 215, 2024-05-23 10:50:06,173, INFO: Icicle view (python stack): https://fburl.com/scuba/pyperf_experimental/on_demand/qsgydsee strobelight_compile_time_profiler, line 120, 2024-05-23 10:50:06,174, INFO: 1 strobelight success runs out of 1 non-recursive compilation events. strobelight_function_profiler, line 241, 2024-05-23 10:50:08,137, INFO: strobelight run id is: 8721740011604497 strobelight_function_profiler, line 243, 2024-05-23 10:50:34,801, INFO: strobelight profiling running strobelight_function_profiler, line 224, 2024-05-23 10:50:36,803, INFO: strobelight profiling stopped strobelight_function_profiler, line 215, 2024-05-23 10:50:41,289, INFO: Total samples: 3 strobelight_function_profiler, line 215, 2024-05-23 10:50:41,289, INFO: GraphProfiler (python stack): https://fburl.com/scuba/pyperf_experimental/on_demand/qmi2ucwp strobelight_function_profiler, line 215, 2024-05-23 10:50:41,289, INFO: Icicle view (python stack): https://fburl.com/scuba/pyperf_experimental/on_demand/7fjkhs9i strobelight_compile_time_profiler, line 120, 2024-05-23 10:50:41,289, INFO: 2 strobelight success runs out of 2 non-recursive compilation events. strobelight_function_profiler, line 241, 2024-05-23 10:50:43,597, INFO: strobelight run id is: 1932476082259558 strobelight_function_profiler, line 243, 2024-05-23 10:51:09,791, INFO: strobelight profiling running strobelight_function_profiler, line 224, 2024-05-23 10:51:11,883, INFO: strobelight profiling stopped strobelight_function_profiler, line 215, 2024-05-23 10:51:16,218, INFO: Total samples: 3 strobelight_function_profiler, line 215, 2024-05-23 10:51:16,218, INFO: GraphProfiler (python stack): https://fburl.com/scuba/pyperf_experimental/on_demand/vy1ujxec strobelight_function_profiler, line 215, 2024-05-23 10:51:16,218, INFO: Icicle view (python stack): https://fburl.com/scuba/pyperf_experimental/on_demand/2xgadviv strobelight_compile_time_profiler, line 120, 2024-05-23 10:51:16,219, INFO: 3 strobelight success runs out of 3 non-recursive compilation events. ``` or pass TORCH_COMPILE_STROBELIGHT=TRUE for any torch compile python program. ex running on XLNetLMHeadModel. ``` TORCH_COMPILE_STROBELIGHT=TRUE TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 time python benchmarks/dynamo/huggingface.py --ci --accuracy --timing --explain --inductor --device cuda --training --amp --only XLNetLMHeadModel ``` result: Pull Request resolved: https://github.com/pytorch/pytorch/pull/126904 Approved by: https://github.com/aorenste ghstack dependencies: #126693	2024-05-24 01:39:40 +00:00
Joel Schlosser	e2f081837f	Lift jagged -> padded dense forward / backward kernels from fbgemm_gpu (#125946 ) PyTorch can't depend on `fbgemm_gpu` as a dependency because `fbgemm_gpu` already has a dependency on PyTorch. So this PR copy / pastes kernels from `fbgemm_gpu`: * `dense_to_jagged_forward()` as CUDA registration for new ATen op `_padded_dense_to_jagged_forward()` * `jagged_to_padded_dense_forward()` as CUDA registration for new ATen op `_jagged_to_padded_dense_forward()` CPU impls for these new ATen ops will be added in a follow-up PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125946 Approved by: https://github.com/davidberard98	2024-05-24 00:42:59 +00:00
Richard Barnes	3f5b59eef4	[codemod] c10::optional -> std::optional in caffe2/aten/src/ATen/DeviceGuard.h +117 (#126901 ) Summary: Generated with ``` fbgs -f '.*\.(cpp\|cxx\|cc\|h\|hpp\|cu\|cuh)$' c10::optional -l \| perl -pe 's/^fbsource.fbcode.//' \| grep -v executorch \| xargs -n 50 perl -pi -e 's/c10::optional/std::optional/g' ``` - If you approve of this diff, please use the "Accept & Ship" button :-) (117 files modified.) Test Plan: Sandcastle Reviewed By: palmje Pull Request resolved: https://github.com/pytorch/pytorch/pull/126901 Approved by: https://github.com/Skylion007, https://github.com/eqy	2024-05-24 00:26:15 +00:00
cyy	95e5c994f9	[Submodule] Clear USE_QNNPACK build option (#126941 ) Following the removal of QNNPACK third-party module #126657, we can clear more build system code. Also third_party/neon2sse was removed because it is not used. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126941 Approved by: https://github.com/ezyang	2024-05-24 00:12:56 +00:00
PyTorch MergeBot	dfabae5b89	Revert "[pipelining] Add grad test for interleaved schedules (#126931 )" This reverts commit abf6d4e6bc1a9a0e08bfc2204560ca7858fa90cd. Reverted https://github.com/pytorch/pytorch/pull/126931 on behalf of https://github.com/clee2000 due to newly added test fails distributed/pipelining/test_schedule.py::ScheduleTest::test_grad_with_manual_interleaved_ScheduleClass0 `abf6d4e6bc` https://github.com/pytorch/pytorch/actions/runs/9214413308/job/25352507591, pull workflow failed on startup on PR, so no distributed tests ran at all ([comment](https://github.com/pytorch/pytorch/pull/126931#issuecomment-2128228496))	2024-05-23 23:51:29 +00:00
Pian Pawakapan	2db13633e7	[export] disable forced specializations, even when solvable with single var (#126925 ) Summary: Previously https://github.com/pytorch/pytorch/pull/124949 added the ability to disable forced specializations on dynamic shapes for export, keeping dynamism for complex guards instead of specializing, allowing unsoundness by having the user fail at runtime. It avoided disabling one case: single-variable equality guards, where a variable is specified as dynamic but can be solvable for a concrete value, suggesting the correct behavior is specialization. For example, guard : Eq(s0 // 4, 400) suggests s0 should specialize to 1600. In debugging, some users (e.g. APS) would like to keep this dynamic, and defer to failing at runtime instead. This PR adds this, so now all forced specializations should be turned off. Mostly this should be used for debugging, since it produces unsoundness, and lets the user proceed with (probably) incorrect dynamism. Test Plan: export tests Differential Revision: D57698601 Pull Request resolved: https://github.com/pytorch/pytorch/pull/126925 Approved by: https://github.com/angelayi	2024-05-23 23:43:30 +00:00
James Wu	6eac3f45c7	Add basic sanity checks for graph ops to cache key (#124745 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124745 Approved by: https://github.com/bdhirsh	2024-05-23 23:37:43 +00:00
Aart Bik	ff82e2e7cf	[traced-graph][sparse] propagate sparsity metadata into traced graph (#117907 ) Propagate sparsity metadata from sparse tensors of torch.sparse into the traced graph representation (with would be useful for a JIT backend that supports a "sparse compiler"). This is a first careful attempt, since the actual "meta" feature seem still incomplete for coo and completely lacking for csr/csc/bsr/bsc. For background see forum postings (with examples): https://discuss.pytorch.org/t/connecting-pytorch-sparse-tensors-with-mlir/195145 https://dev-discuss.pytorch.org/t/connecting-pytorch-sparse-tensors-with-mlir/1803 And feature request: https://github.com/pytorch/pytorch/issues/117188 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117907 Approved by: https://github.com/pearu, https://github.com/ezyang	2024-05-23 22:46:46 +00:00
Yueming Hao	93ba5e7291	Fix typo for input (#126981 ) The variable name should be `cloned_inputs` rather than `clone_inputs`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126981 Approved by: https://github.com/xuzhao9	2024-05-23 22:08:14 +00:00
William Wen	d11e44c0d0	Reset grad state across unittests (#126345 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126345 Approved by: https://github.com/ezyang	2024-05-23 21:16:39 +00:00
Catherine Lee	a31a60d85b	Change run_test.py arg parsing to handle additional args better (#126709 ) Do not inherit parser from common_utils * I don't think we use any variables in run_test that depend on those, and I think all tests except doctests run in a subprocess so they will parse the args in common_utils and set the variables. I don't think doctests wants any of those variables? Parse known args, add the extra args as extra, pass the extra ones along to the subprocess Removes the first instance of `--` I think I will miss run_test telling me if an arg is valid or not Pull Request resolved: https://github.com/pytorch/pytorch/pull/126709 Approved by: https://github.com/ZainRizvi, https://github.com/huydhn, https://github.com/Flamefire	2024-05-23 21:08:12 +00:00
Catherine Lee	09a73da190	Downgrade requests to 2.31.0 for ios and android (#126989 ) Ex https://github.com/pytorch/pytorch/actions/runs/9211850483/job/25342181353 https://github.com/pytorch/pytorch/actions/runs/9211850483/job/25342182105 2.32.0 isn't on the conda channels yet? Is there a way to add them? If not here's a PR to downgrad Pull Request resolved: https://github.com/pytorch/pytorch/pull/126989 Approved by: https://github.com/atalman, https://github.com/malfet	2024-05-23 21:02:50 +00:00
wz337	0d2ac9782b	[FSDP1] Update docstring to include device_mesh arg (#126589 ) Fixes #126548 Pull Request resolved: https://github.com/pytorch/pytorch/pull/126589 Approved by: https://github.com/wanchaol	2024-05-23 20:40:48 +00:00
Wei Wang	0902929d58	[CUDA] [CI]: Enable CUDA 12.4 CI (#121956 ) Reference PR: https://github.com/pytorch/pytorch/pull/93406 Co-authored-by: Aidyn-A <31858918+Aidyn-A@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/121956 Approved by: https://github.com/atalman	2024-05-23 20:37:47 +00:00
Ke Wen	abf6d4e6bc	[pipelining] Add grad test for interleaved schedules (#126931 ) Added `test_grad_with_manual_interleaved`: - Model: `MultiMLP` - Tested schedules: Interleaved1F1B, LoopedBFS - Two stages per rank ``` Rank 0 stages: [0, 2] Rank 1 stages: [1, 3] ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/126931 Approved by: https://github.com/wconstab ghstack dependencies: #126812, #126721, #126735, #126927	2024-05-23 20:26:08 +00:00
Ke Wen	c46b38bc75	[pipelining] Generalize definition of MultiMLP for testing interleaved schedules (#126927 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126927 Approved by: https://github.com/wconstab ghstack dependencies: #126812, #126721, #126735	2024-05-23 20:26:08 +00:00
Will Constable	6b39146b3f	[pipelining] Validate stage input/output shape/dtype (#126732 ) Address the classes of user errors stemming from (possibly) unintentional dynamic shapes usage or mismatch of configuration time and run time data shapes/dtypes. The goal is to ensure a clear error is raised rather than relying on some underlying error to bubble up when a tensor shape is not compatible, or worse, having a silent correctness issue. Classes of shape/dtype errors * (a) error is thrown within the stage-module forward code, but may be hard to understand/trace back to an input issue * (b) silent correctness issue happens inside the stage-module forward, but the correct output shape is still produced produces the expected output shape * (c) the stage-module produces an output that is locally correct, but not matching the expectation of the following stage, leading to a hang or correctness issue down the line How validation helps Input shape validation - improves debugability of case (a) - guards against case (b) - only needed on first stage, since subsequent stages use pre-allocated recv buffers that can't change shape/size even if they wanted to Output shape validation - guards against case (c) Validation of first stage input and all stages' outputs inductively verifies all shapes Shape/dtype are most critical as they literally affect the number of bytes on the wire. Strides and other tensor properties may also (?) matter, and the validation function can be adjusted accordingly if needed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126732 Approved by: https://github.com/kwen2501	2024-05-23 20:16:06 +00:00
Edward Z. Yang	9b91c91e64	Don't add to replacements when guard is suppressed (#126210 ) Also improve logging when guards are suppressed Partially addresses https://github.com/pytorch/pytorch/issues/125641 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/126210 Approved by: https://github.com/jbschlosser	2024-05-23 20:10:29 +00:00
Richard Zou	f8857cef45	[Reland] Verify types in custom op schemas (#126861 ) Summary: co-dev reland of https://github.com/pytorch/pytorch/pull/124520, which requires the removal of some executorch tests. Before this PR, we didn't check that types in a schema were valid. This is because TorchScript treats unknown types as type variables. This PR checks types in a schema for the TORCH_LIBRARY APIs. To do this, we add an `allow_typevars` flag to parseSchema so that TorchScript can use allow_typevars=True. We also add some error messages for common mistakes (e.g. using int64_t or double in schema). Test Plan: Wait for tests Differential Revision: D57666659 Pull Request resolved: https://github.com/pytorch/pytorch/pull/126861 Approved by: https://github.com/albanD	2024-05-23 19:53:52 +00:00
Ke Wen	c921c5cc77	[c10d] Print certain logs only on head rank of each node (#125432 ) Recently we added the following warning, which is printed on every rank and makes the log a bit verbose. This PR dedups certain logs that are identical across ranks and prints them only on head rank of each node. Resolves https://github.com/pytorch/pytorch/issues/126275 ========================================= [rank0]:[W502 14:06:55.821964708 ProcessGroupNCCL.cpp:1113] WARNING: process group has NOT been destroyed before it is being destructed. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL data transfers have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 [rank1]:[W502 14:06:57.994276972 ProcessGroupNCCL.cpp:1113] WARNING: process group has NOT been destroyed before it is being destructed. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL data transfers have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 [rank2]:[W502 14:07:00.353013116 ProcessGroupNCCL.cpp:1113] WARNING: process group has NOT been destroyed before it is being destructed. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL data transfers have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 [rank3]:[W502 14:07:02.515511670 ProcessGroupNCCL.cpp:1113] WARNING: process group has NOT been destroyed before it is being destructed. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL data transfers have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125432 Approved by: https://github.com/wconstab	2024-05-23 19:16:11 +00:00
Jason Ansel	0625f92993	[inductor] Run some tests on correct device (#126943 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126943 Approved by: https://github.com/yanboliang	2024-05-23 18:47:44 +00:00
Scott Wolchok	abf40320dd	remove ax/ay arrays in fp16 ARM matmul kernels (#126793 ) These shouldn't do anything as only two elements are live at once, so we can simplify the code. (I checked assembly for the inner loops in instruments and it seems to be the same.) Differential Revision: [D57732738](https://our.internmc.facebook.com/intern/diff/D57732738) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126793 Approved by: https://github.com/malfet ghstack dependencies: #126745, #126746	2024-05-23 18:42:45 +00:00
Scott Wolchok	5dcf3d0f9e	use arith-by-dot-products approach for fp32 accumulation in fp16 matmul (#126746 ) Summary: The faster fp16-native kernel is gated off by default. Let's give people better performance in the default case. Test Plan: CI benchmarked matmul of size 4100x4100x1 and 4104x4104x1 using https://github.com/malfet/llm_experiments/blob/main/benchmarks/benchmark_torch_mm.py (4100 % 32 = 4100 % 8 = 4). Relevant timing numbers without FP16 reduction (which then uses this kernel): after: trans_b torch.float16 1396.11 usec (4100) trans_b torch.float16 1399.54 usec (4104) before: trans_b torch.float16 1840.79 usec (4100) trans_b torch.float16 1786.67 usec (4104) Differential Revision: [D57732736](https://our.internmc.facebook.com/intern/diff/D57732736) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126746 Approved by: https://github.com/malfet ghstack dependencies: #126745	2024-05-23 18:42:45 +00:00
Scott Wolchok	fd4fd24080	add tail fixup for fp16 gemv transposed fast path (#126745 ) Summary: We previously had restrictive gating for the fp16 kernel; now it supports arbitrary m & n. Test Plan: 1) ran test coverage added in #126700, passes 2) benchmarked matmul of size 4100x4100x1 and 4104x4104x1 using https://github.com/malfet/llm_experiments/blob/main/benchmarks/benchmark_torch_mm.py (4100 % 32 = 44100 % 8 = 4). Relevant timing numbers with FP16 reduction enabled (which gates this kernel): after: trans_b torch.float16 716.42 usec (4100) trans_b torch.float16 711.10 usec (4104) Before: trans_b torch.float16 1808.66 usec (4100) trans_b torch.float16 1083.18 usec (4104) Differential Revision: [D57732737](https://our.internmc.facebook.com/intern/diff/D57732737) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126745 Approved by: https://github.com/malfet	2024-05-23 18:42:35 +00:00
PyTorch MergeBot	b36e390b6c	Revert "Default XLA to use swap_tensors path in nn.Module._apply (#126814 )" This reverts commit eb41ed5d90e946e62dd664d7037ebbb021baf33e. Reverted https://github.com/pytorch/pytorch/pull/126814 on behalf of https://github.com/mikaylagawarecki due to broke xla ci ([comment](https://github.com/pytorch/pytorch/pull/126814#issuecomment-2127719337))	2024-05-23 17:43:06 +00:00
PyTorch MergeBot	6a06d36296	Revert "Default meta device to use swap_tensors in nn.Module._apply (.to_empty and .to('meta')) (#126819 )" This reverts commit ab61309ab8f6452975021994a6d4a102d55feba8. Reverted https://github.com/pytorch/pytorch/pull/126819 on behalf of https://github.com/mikaylagawarecki due to broke xla ci ([comment](https://github.com/pytorch/pytorch/pull/126814#issuecomment-2127719337))	2024-05-23 17:43:06 +00:00
Jiashen Cao	041e8d73fd	Separate non/strict functions in _export (#126718 ) Move non/strict _export to different functions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126718 Approved by: https://github.com/angelayi	2024-05-23 17:41:23 +00:00
cyy	e5db6758c8	[BE]: Use make_unique (#126966 ) Adds make_unique in places Pull Request resolved: https://github.com/pytorch/pytorch/pull/126966 Approved by: https://github.com/Skylion007	2024-05-23 17:39:48 +00:00
wz337	264155a8d7	[DCP][AC] Add test for apply AC with FSDP1 (#126935 ) Adding test for this cherry pick. https://github.com/pytorch/pytorch/pull/126559/ Pull Request resolved: https://github.com/pytorch/pytorch/pull/126935 Approved by: https://github.com/fegin	2024-05-23 17:35:54 +00:00
Richard Barnes	bbe68a16b9	[codemod][lowrisk] Remove extra semi colon from caffe2/caffe2/core/observer.h (#126976 ) Summary: `-Wextra-semi` or `-Wextra-semi-stmt` If the code compiles, this is safe to land. Test Plan: Sandcastle Reviewed By: palmje Differential Revision: D57632765 Pull Request resolved: https://github.com/pytorch/pytorch/pull/126976 Approved by: https://github.com/Skylion007	2024-05-23 17:31:19 +00:00
Sherlock Huang	a63310eebc	TorchScript 2 ExportedProgram Converter (#126920 ) Summary: Initial commit for TorchScript 2 ExportedProgram Converter. TODO: - Improve TorchScript IR coverage - parameter and buffers should be owned by output ExportedProgram - Experiment on conditional op conversion Test Plan: buck2 run mode/dev-nosan fbcode//caffe2/test:test_export -- -r TestConverter Differential Revision: D57694784 Pull Request resolved: https://github.com/pytorch/pytorch/pull/126920 Approved by: https://github.com/angelayi, https://github.com/tugsbayasgalan	2024-05-23 17:00:18 +00:00
PyTorch MergeBot	1b29c16e5e	Revert "Introduce ProcessGroupCudaP2P (#122163 )" This reverts commit 2dd269986027ea25c092f769ef8e9524920aaef6. Reverted https://github.com/pytorch/pytorch/pull/122163 on behalf of https://github.com/jithunnair-amd due to This is breaking ROCm distributed CI on trunk ([comment](https://github.com/pytorch/pytorch/pull/122163#issuecomment-2127518473))	2024-05-23 16:06:14 +00:00
Mikayla Gawarecki	ab61309ab8	Default meta device to use swap_tensors in nn.Module._apply (.to_empty and .to('meta')) (#126819 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126819 Approved by: https://github.com/albanD ghstack dependencies: #126814	2024-05-23 15:43:32 +00:00
Mikayla Gawarecki	eb41ed5d90	Default XLA to use swap_tensors path in nn.Module._apply (#126814 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126814 Approved by: https://github.com/JackCaoG, https://github.com/albanD	2024-05-23 15:43:32 +00:00
Animesh Jain	f0366de414	[dynamo] Support __contains__ on obj.__dict__ (#126922 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126922 Approved by: https://github.com/jansel, https://github.com/yanboliang	2024-05-23 09:01:29 +00:00
PyTorch MergeBot	25b8dbc3e4	Revert "[inductor][cpp] GEMM template (infra and fp32) (#124021 )" This reverts commit 9da7efa6774777890c8e4a713f6d23ea5cfcf6a4. Reverted https://github.com/pytorch/pytorch/pull/124021 on behalf of https://github.com/DanilBaibak due to Broken trunk ([comment](https://github.com/pytorch/pytorch/pull/124021#issuecomment-2126568331))	2024-05-23 08:50:18 +00:00
PyTorch MergeBot	45784cd229	Revert "[inductor][cpp] epilogue support for gemm template (#126019 )" This reverts commit 08f57b4bffe6edfdb016703219744482b4d03e23. Reverted https://github.com/pytorch/pytorch/pull/126019 on behalf of https://github.com/DanilBaibak due to Broken trunk ([comment](https://github.com/pytorch/pytorch/pull/124021#issuecomment-2126568331))	2024-05-23 08:50:18 +00:00
PyTorch MergeBot	926327e8fc	Revert "[inductor][cpp] bf16/fp16 gemm template computed with fp32 w/o epilogue fusion (#126068 )" This reverts commit 31412cb2f25bda0fe31dae7b2afc88278794cad6. Reverted https://github.com/pytorch/pytorch/pull/126068 on behalf of https://github.com/DanilBaibak due to Broken trunk ([comment](https://github.com/pytorch/pytorch/pull/124021#issuecomment-2126568331))	2024-05-23 08:50:18 +00:00
PyTorch MergeBot	30c9ca0899	Revert "[inductor][cpp] support bf16/fp16 gemm template epilogue fusion (#126545 )" This reverts commit 7b6d036c05bd782f5e59bdb353f9e47865e9db50. Reverted https://github.com/pytorch/pytorch/pull/126545 on behalf of https://github.com/DanilBaibak due to Broken trunk ([comment](https://github.com/pytorch/pytorch/pull/124021#issuecomment-2126568331))	2024-05-23 08:50:18 +00:00
angelayi	da7bf1d588	[export] Fix unflatten with empty nn_module_stack (#126785 ) Fixes https://fb.workplace.com/groups/1075192433118967/permalink/1433418843962989/ Pull Request resolved: https://github.com/pytorch/pytorch/pull/126785 Approved by: https://github.com/tugsbayasgalan	2024-05-23 08:34:25 +00:00
Oguz Ulgen	a6155d23d1	[easy] Delete dead code global (#126903 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126903 Approved by: https://github.com/aorenste ghstack dependencies: #126083	2024-05-23 08:29:29 +00:00
Oguz Ulgen	cc61d03ac9	Do not trace into triton/backends (#126083 ) Fixes #125807 Pull Request resolved: https://github.com/pytorch/pytorch/pull/126083 Approved by: https://github.com/yanboliang, https://github.com/jansel	2024-05-23 08:29:29 +00:00
laithsakka	558c4413ce	add strobelight cli function profiler (#126693 ) This is a meta only tool, this allow users to profile any python function by annotating it with strobelight using the strobelight profiler. ex ``` def fn(x, y, z): return x * y + z # use decorator with default profiler. @strobelight() @torch.compile() def work(): for i in range(100): for j in range(5): fn(torch.rand(j, j), torch.rand(j, j), torch.rand(j, j)) work() ``` test ``` python torch/utils/strobelight/examples/cli_function_profiler_example.py strobelight_cli_function_profiler, line 274, 2024-05-20 11:05:41,513, INFO: strobelight run id is: -6222660165281106 strobelight_cli_function_profiler, line 276, 2024-05-20 11:06:08,318, INFO: strobelight profiling running strobelight_cli_function_profiler, line 257, 2024-05-20 11:06:11,867, INFO: strobelight profiling stopped strobelight_cli_function_profiler, line 237, 2024-05-20 11:06:16,164, INFO: Total samples: 2470 strobelight_cli_function_profiler, line 237, 2024-05-20 11:06:16,164, INFO: GraphProfiler (python stack): https://fburl.com/scuba/pyperf_experimental/on_demand/oiqmyltg strobelight_cli_function_profiler, line 237, 2024-05-20 11:06:16,164, INFO: Icicle view (python stack): https://fburl.com/scuba/pyperf_experimental/on_demand/b10x92x0 strobelight_cli_function_profiler, line 274, 2024-05-20 11:06:18,476, INFO: strobelight run id is: -4112659701221677 strobelight_cli_function_profiler, line 276, 2024-05-20 11:06:45,096, INFO: strobelight profiling running strobelight_cli_function_profiler, line 257, 2024-05-20 11:06:52,366, INFO: strobelight profiling stopped strobelight_cli_function_profiler, line 237, 2024-05-20 11:06:56,222, INFO: Total samples: 1260 strobelight_cli_function_profiler, line 237, 2024-05-20 11:06:56,222, INFO: GraphProfiler (python stack): https://fburl.com/scuba/pyperf_experimental/on_demand/0yyx6el5 strobelight_cli_function_profiler, line 237, 2024-05-20 11:06:56,223, INFO: Icicle view (python stack): https://fburl.com/scuba/pyperf_experimental/on_demand/8m2by4ea (base) [lsakka@devvm4561.ash0 /data/users/lsakka/pytorch/pytorch (strobelight2)]$ python torch/profiler/strobelight_cli_function_profiler_example.py strobelight_cli_function_profiler, line 274, 2024-05-20 11:07:26,701, INFO: strobelight run id is: -2373009368202256 strobelight_cli_function_profiler, line 276, 2024-05-20 11:07:53,477, INFO: strobelight profiling running strobelight_cli_function_profiler, line 257, 2024-05-20 11:07:56,827, INFO: strobelight profiling stopped strobelight_cli_function_profiler, line 237, 2024-05-20 11:08:01,138, INFO: Total samples: 2372 strobelight_cli_function_profiler, line 237, 2024-05-20 11:08:01,138, INFO: GraphProfiler (python stack): https://fburl.com/scuba/pyperf_experimental/on_demand/dk797xg9 strobelight_cli_function_profiler, line 237, 2024-05-20 11:08:01,138, INFO: Icicle view (python stack): https://fburl.com/scuba/pyperf_experimental/on_demand/4w6c8vnm strobelight_cli_function_profiler, line 274, 2024-05-20 11:08:03,235, INFO: strobelight run id is: -1919086123693716 strobelight_cli_function_profiler, line 276, 2024-05-20 11:08:29,848, INFO: strobelight profiling running strobelight_cli_function_profiler, line 257, 2024-05-20 11:08:37,233, INFO: strobelight profiling stopped strobelight_cli_function_profiler, line 237, 2024-05-20 11:08:41,138, INFO: Total samples: 1272 strobelight_cli_function_profiler, line 237, 2024-05-20 11:08:41,138, INFO: GraphProfiler (python stack): https://fburl.com/scuba/pyperf_experimental/on_demand/43r58aew strobelight_cli_function_profiler, line 237, 2024-05-20 11:08:41,138, INFO: Icicle view (python stack): https://fburl.com/scuba/pyperf_experimental/on_demand/9g52onmw (base) [lsakka@devvm4561.ash0 /data/users/lsakka/pytorch/pytorch (strobelight2)]$ ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/126693 Approved by: https://github.com/aorenste	2024-05-23 07:42:25 +00:00
Jiong Gong	7b6d036c05	[inductor][cpp] support bf16/fp16 gemm template epilogue fusion (#126545 ) As part of #125683, this PR adds epilogue fusion support for bf16/fp16 gemms. The key changes are as follows: 1. bf16 linear w/ epilogue fusion of some ops was originally supported via ATen oneDNN linear pointwise ops. In order to match the ATen op semantics, in-template epilogue support is added to the cpp gemm template so that we would have: "gemm + in-template epilogues -> template buffer". If the template is chosen for codegen, the in-template epilogues will be concatenated with the out-of-template epilogues that are appended during the scheduling. 2. Support bf16/fp16 legalization for `codegen_loop_bodies` which is used to generate the epilogue loops. 3. We used to leverage the in-place buffer mechanism to handle the in-place buffers in the epilogue codegen, in particular, for the reuses for output buffers of GEMM, template and epilogues. This is not correct since the output buffer is an "output" not an "in-place" buffer of the template kernel itself. Now, we use a dedicated "aliases" dict to manage such buffer reuses and the intermediate aliasing buffers are removed after codegen. 4. Add `localize_buffer` method to `LocalBufferScope` to allow the replacement of a global buffer with a local one in the given inductor IR nodes. This helps the fused loops to work on smaller-sized local buffers for better data locality. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126545 Approved by: https://github.com/jansel ghstack dependencies: #124021, #126019, #126068	2024-05-23 07:39:29 +00:00
Jiong Gong	31412cb2f2	[inductor][cpp] bf16/fp16 gemm template computed with fp32 w/o epilogue fusion (#126068 ) As part of #125683, this PR adds the initial bf16/fp16 gemm template support with micro-gemm implemented with fused type casting and fp32 computation. It doesn't provide epilogue fusion support yet which will be added in the next PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126068 Approved by: https://github.com/jansel ghstack dependencies: #124021, #126019	2024-05-23 07:39:29 +00:00
Jiong Gong	08f57b4bff	[inductor][cpp] epilogue support for gemm template (#126019 ) As part of #125683, this PR adds the epilogue support for c++ gemm template by reusing the c++ vector codegen on sub-slices of tensors. This is implemented by retracing the epilogue IR nodes with new ranges and offsets. The new `codegen_loop_bodies` and `codegen_functions` methods are added to c++ vector codegen for this purpose. This is leveraged by the `store_output` method of the template kernel for epilogue codegen and store to the final result. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126019 Approved by: https://github.com/jansel ghstack dependencies: #124021	2024-05-23 07:39:29 +00:00
Jiong Gong	9da7efa677	[inductor][cpp] GEMM template (infra and fp32) (#124021 ) This PR adds the Cpp template infrastructure and the initial FP32 gemm template. See RFC https://github.com/pytorch/pytorch/issues/125683 for more background info. 1. Cpp template infrastructure Similar template abstractions as the CUTLASS template, i.e., `CppTemplate`, `CppTemplateKernel`, `CppTemplateBuffer`. The MicroGemm micro-kernel abstraction that can be used by Cpp GEMM templates. 2. Initial FP32 gemm template This involves a GEMM template implementation `CppPackedGemmTemplate` that supports GEMM with constant weight (`B`) requiring `N` to be a multiple of register blocking while allows the static or dynamic sizes for the `M` (batch dim) of `A`. The `B` matrix would be prepacked. This is a typical setting for inference workloads. The template handles the thread decomposition (via `thread_blocking`) and cache blocking (via `cache_blocking`). Then it invokes `CppMicroGemm` which handles register blocking, instruction selection, and other CPU architecture-specific optimizations. A `CppMicroGemmFP32Vec` micro-kernel implementation is provided for fp32 matmuls implemented with ATen vec abstraction. 3. Correctness and performance The changes have been validated with fp32 inference on the three benchmark suites (torchbench, huggingface and timm_models) with both static shape and dynamic shapes. Since it is an initial implementation, we are still working on further performance improves with follow-up PRs including the optimizations in kernels as well as fusions. The perf gains are only observed from a selective number of models compared to the ATen kernels which are implemented with MKL. The perf gains are more obvious with dynamic shapes since MKL only supports packed gemm for static shapes. Below are details. Static shapes \| Benchmark \| torchbench \| huggingface \| timm_models \| \|------------\|-------------\|--------------\|--------------\| \| Multi-threaded (baseline) \| 1.47x \| 1.36x \| 1.91x \| \| Multi-threaded (max-autotune) \| 1.47x \| 1.36x \| 1.92x \| \| Single-threaded (baseline) \| 1.56x \| 1.19x \| 1.51x \| \| Single-threaded (max-autotune) \| 1.56x \| 1.19x \| 1.52x \| Key models being sped up: drq: 1.14x soft_act: 1.12 cait_m36_384: 1.18x Dynamic shapes \| Benchmark \| torchbench \| huggingface \| timm_models \| \| --- \| --- \| --- \| --- \| \| Multi-threaded (baseline) \| 1.43x \| 1.28x \| 1.85x \| \| Multi-threaded (max-autotune) \| 1.47x \| 1.28x \| 1.85x \| \| Single-threaded (baseline) \| 1.55x \| 1.20x \| 1.51x \| \| Single-threaded (max-autotune) \| 1.56x \| 1.19x \| 1.53x \| Key models being sped up: BERT_pytorch: 1.22x pyhpc_turbulent: 1.13x soft_actor_critic: 1.77x BlenderbotForCausalLM: 1.09x cait_m36_384: 1.17x Differential Revision: [D57585365](https://our.internmc.facebook.com/intern/diff/D57585365) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124021 Approved by: https://github.com/jansel	2024-05-23 07:39:29 +00:00
chilli	aa6de76181	Fix silu test for flexattention (#126641 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126641 Approved by: https://github.com/ezyang, https://github.com/drisspg ghstack dependencies: #126615, #126446	2024-05-23 05:45:07 +00:00
youkaichao	36e70572d0	[Dynamo] make bytecode of resume function resemble natural bytecode (#126630 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126630 Approved by: https://github.com/williamwen42	2024-05-23 05:06:33 +00:00
PyTorch MergeBot	2c90b99267	Revert "reset dynamo cache before each test (#126586 )" This reverts commit 43f2f43eb3b6d8cbe8eb7f45acb50376092f1a16. Reverted https://github.com/pytorch/pytorch/pull/126586 on behalf of https://github.com/clee2000 due to broke tests on inductor? test_modules.py::TestModuleCUDA::test_cpu_gpu_parity_nn_CTCLoss_cuda_float64 `43f2f43eb3` https://github.com/pytorch/pytorch/actions/runs/9200644034/job/25308511495 ([comment](https://github.com/pytorch/pytorch/pull/126586#issuecomment-2126228689))	2024-05-23 04:54:28 +00:00
PyTorch MergeBot	b1e214ceb1	Revert "don't check memory format for empty tensors (#126593 )" This reverts commit 12dee4f2046d07db97cddc7b3c5bdf06fc304ae3. Reverted https://github.com/pytorch/pytorch/pull/126593 on behalf of https://github.com/clee2000 due to broke tests on inductor? test_modules.py::TestModuleCUDA::test_cpu_gpu_parity_nn_CTCLoss_cuda_float64 `43f2f43eb3` https://github.com/pytorch/pytorch/actions/runs/9200644034/job/25308511495 ([comment](https://github.com/pytorch/pytorch/pull/126586#issuecomment-2126228689))	2024-05-23 04:54:28 +00:00
PyTorch MergeBot	df4b7cb5f7	Reapply "Skip test_memory_format_nn_BatchNorm2d in inductor (#125970 )" (#126594 ) This reverts commit ce6e36bf8b524c3f4b07605c5b3af2b7d5ba8fd9. Reverted https://github.com/pytorch/pytorch/pull/126594 on behalf of https://github.com/clee2000 due to broke tests on inductor? test_modules.py::TestModuleCUDA::test_cpu_gpu_parity_nn_CTCLoss_cuda_float64 `43f2f43eb3` https://github.com/pytorch/pytorch/actions/runs/9200644034/job/25308511495 ([comment](https://github.com/pytorch/pytorch/pull/126586#issuecomment-2126228689))	2024-05-23 04:54:28 +00:00
PyTorch MergeBot	4f14282e35	Revert "[inductor][cpp] GEMM template (infra and fp32) (#124021 )" This reverts commit 2ac33a9f663269e6060246337c776a20c3b7c858. Reverted https://github.com/pytorch/pytorch/pull/124021 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I think it has a land race and failing in trunk `2ac33a9f66` ([comment](https://github.com/pytorch/pytorch/pull/124021#issuecomment-2126016522))	2024-05-23 01:13:29 +00:00
PyTorch MergeBot	657d39e44c	Revert "[inductor][cpp] epilogue support for gemm template (#126019 )" This reverts commit 57108d9a4990f6b2ed3578cee58354ab01505dd3. Reverted https://github.com/pytorch/pytorch/pull/126019 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I think it has a land race and failing in trunk `2ac33a9f66` ([comment](https://github.com/pytorch/pytorch/pull/124021#issuecomment-2126016522))	2024-05-23 01:13:29 +00:00
PyTorch MergeBot	205f08140e	Revert "[inductor][cpp] bf16/fp16 gemm template computed with fp32 w/o epilogue fusion (#126068 )" This reverts commit 57c185b4c765c522a7f2908a773d128c66def190. Reverted https://github.com/pytorch/pytorch/pull/126068 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I think it has a land race and failing in trunk `2ac33a9f66` ([comment](https://github.com/pytorch/pytorch/pull/124021#issuecomment-2126016522))	2024-05-23 01:13:29 +00:00
Nikita Shulga	2b57652278	Update requests to 2.32.2 (#126805 ) To address CVE-2024-35195 (though it does not really affect PyTorch, only CI) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126805 Approved by: https://github.com/atalman, https://github.com/kit1980, https://github.com/seemethere, https://github.com/Skylion007	2024-05-23 00:21:28 +00:00
eqy	ebbd431d9e	[CPU] Bump `test_complex_2d` thresholds for LBFGS on `complex64` (#126358 ) Is this supposed to be bitwise identical? Wasn't sure how to interpret the comment but it seems to be giving mismatches like: ``` Mismatched elements: 1 / 2 (50.0%) Greatest absolute difference: 4.6372413635253906e-05 at index (1,) (up to 1e-05 allowed) Greatest relative difference: 3.4600801882334054e-05 at index (1,) (up to 1.3e-06 allowed) To execute this test, run the following from the base repo dir: python test/test_optim.py -k test_complex_2d_LBFGS_cpu_complex64 ``` on Neoverse-N2 SBSA ARM CPUs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126358 Approved by: https://github.com/lezcano, https://github.com/janeyx99	2024-05-23 00:16:45 +00:00
Jiong Gong	57c185b4c7	[inductor][cpp] bf16/fp16 gemm template computed with fp32 w/o epilogue fusion (#126068 ) As part of #125683, this PR adds the initial bf16/fp16 gemm template support with micro-gemm implemented with fused type casting and fp32 computation. It doesn't provide epilogue fusion support yet which will be added in the next PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126068 Approved by: https://github.com/jansel ghstack dependencies: #124021, #126019	2024-05-23 00:12:38 +00:00
Jiong Gong	57108d9a49	[inductor][cpp] epilogue support for gemm template (#126019 ) As part of #125683, this PR adds the epilogue support for c++ gemm template by reusing the c++ vector codegen on sub-slices of tensors. This is implemented by retracing the epilogue IR nodes with new ranges and offsets. The new `codegen_loop_bodies` and `codegen_functions` methods are added to c++ vector codegen for this purpose. This is leveraged by the `store_output` method of the template kernel for epilogue codegen and store to the final result. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126019 Approved by: https://github.com/jansel ghstack dependencies: #124021	2024-05-23 00:07:52 +00:00
Jiong Gong	2ac33a9f66	[inductor][cpp] GEMM template (infra and fp32) (#124021 ) This PR adds the Cpp template infrastructure and the initial FP32 gemm template. See RFC https://github.com/pytorch/pytorch/issues/125683 for more background info. 1. Cpp template infrastructure Similar template abstractions as the CUTLASS template, i.e., `CppTemplate`, `CppTemplateKernel`, `CppTemplateBuffer`. The MicroGemm micro-kernel abstraction that can be used by Cpp GEMM templates. 2. Initial FP32 gemm template This involves a GEMM template implementation `CppPackedGemmTemplate` that supports GEMM with constant weight (`B`) requiring `N` to be a multiple of register blocking while allows the static or dynamic sizes for the `M` (batch dim) of `A`. The `B` matrix would be prepacked. This is a typical setting for inference workloads. The template handles the thread decomposition (via `thread_blocking`) and cache blocking (via `cache_blocking`). Then it invokes `CppMicroGemm` which handles register blocking, instruction selection, and other CPU architecture-specific optimizations. A `CppMicroGemmFP32Vec` micro-kernel implementation is provided for fp32 matmuls implemented with ATen vec abstraction. 3. Correctness and performance The changes have been validated with fp32 inference on the three benchmark suites (torchbench, huggingface and timm_models) with both static shape and dynamic shapes. Since it is an initial implementation, we are still working on further performance improves with follow-up PRs including the optimizations in kernels as well as fusions. The perf gains are only observed from a selective number of models compared to the ATen kernels which are implemented with MKL. The perf gains are more obvious with dynamic shapes since MKL only supports packed gemm for static shapes. Below are details. Static shapes \| Benchmark \| torchbench \| huggingface \| timm_models \| \|------------\|-------------\|--------------\|--------------\| \| Multi-threaded (baseline) \| 1.47x \| 1.36x \| 1.91x \| \| Multi-threaded (max-autotune) \| 1.47x \| 1.36x \| 1.92x \| \| Single-threaded (baseline) \| 1.56x \| 1.19x \| 1.51x \| \| Single-threaded (max-autotune) \| 1.56x \| 1.19x \| 1.52x \| Key models being sped up: drq: 1.14x soft_act: 1.12 cait_m36_384: 1.18x Dynamic shapes \| Benchmark \| torchbench \| huggingface \| timm_models \| \| --- \| --- \| --- \| --- \| \| Multi-threaded (baseline) \| 1.43x \| 1.28x \| 1.85x \| \| Multi-threaded (max-autotune) \| 1.47x \| 1.28x \| 1.85x \| \| Single-threaded (baseline) \| 1.55x \| 1.20x \| 1.51x \| \| Single-threaded (max-autotune) \| 1.56x \| 1.19x \| 1.53x \| Key models being sped up: BERT_pytorch: 1.22x pyhpc_turbulent: 1.13x soft_actor_critic: 1.77x BlenderbotForCausalLM: 1.09x cait_m36_384: 1.17x Differential Revision: [D57585365](https://our.internmc.facebook.com/intern/diff/D57585365) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124021 Approved by: https://github.com/jansel	2024-05-22 23:59:12 +00:00
Andrew Gu	e3db9ba37a	[FSDP2] Added test for manual reshard with `reshard_after_forward=False` (#126892 ) This test shows that we could always set `reshard_after_forward=False` but manually insert calls to `module.reshard()` to implement the resharding after forward. This is useful for advanced PP schedules. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126892 Approved by: https://github.com/wanchaol ghstack dependencies: #126887	2024-05-22 23:35:06 +00:00
Andrew Gu	203f2641e9	[FSDP2] Used `CommDebugMode` for comm. count test (#126887 ) simplify the test :) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126887 Approved by: https://github.com/wanchaol	2024-05-22 23:35:06 +00:00
Andrew Gu	69325e4de6	[FSDP] Warned on wrapping `ModuleList`/`ModuleDict` (#124764 ) This partially addresses https://github.com/pytorch/pytorch/issues/113794. To avoid being BC breaking, we just issue an warning when wrapping `ModuleList` or `ModuleDict`. We want to add this warning since this is a common pitfall. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124764 Approved by: https://github.com/wanchaol	2024-05-22 23:34:52 +00:00
laithsakka	b0e849870e	Change error message when nn module inlining is enabled for MiscTests.test_map_side_effects (#126444 ) #fix https://github.com/pytorch/pytorch/issues/126355 Pull Request resolved: https://github.com/pytorch/pytorch/pull/126444 Approved by: https://github.com/anijain2305	2024-05-22 23:24:03 +00:00
Shunting Zhang	17186bd5b6	[inductor] make conv lowering work with dynamic shapes (#126823 ) Fix an issue reported by internal user that conv lowering does not work well with dynamic shapes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126823 Approved by: https://github.com/jansel	2024-05-22 23:15:29 +00:00
Shunting Zhang	14c5c753de	[inductor] use smaller RBLOCK for expensive reduction kernels (#126477 ) Triton sometimes uses less registers for more expensive kernel which results in worse perf ( https://github.com/pytorch/pytorch/issues/126463 ). This may make inductor end up with a sub-optimal config. Use a smaller max RBLOCK if the reduction potentially need many registers. Will run perf test.. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126477 Approved by: https://github.com/jansel	2024-05-22 22:47:10 +00:00
Shunting Zhang	ce6e36bf8b	Revert "Skip test_memory_format_nn_BatchNorm2d in inductor (#125970 )" (#126594 ) This reverts commit 0a9c6e92f8d1a35f33042c8dab39f23b7f39d6e7. enable the test since it's fixed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126594 Approved by: https://github.com/huydhn ghstack dependencies: #126586, #126593	2024-05-22 22:43:09 +00:00
Shunting Zhang	12dee4f204	don't check memory format for empty tensors (#126593 ) Fix https://github.com/pytorch/pytorch/issues/125967 . The test actually fail for empty 4D or 5D tensors when checking for memory format. I'm not exactly sure what recent inductor change cause the failure, but it may be not that important to maintain strides for an empty tensor. (?) I just skip the check for empty tensor. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126593 Approved by: https://github.com/ezyang ghstack dependencies: #126586	2024-05-22 22:43:09 +00:00
Shunting Zhang	43f2f43eb3	reset dynamo cache before each test (#126586 ) In https://github.com/pytorch/pytorch/issues/125967, we found test results depend on test order. The root cause is due to earlier tests populate dynamo cache and affect the later tests. This PR clear dynamo cache before each unit test so we get more deterministic result for unit test Pull Request resolved: https://github.com/pytorch/pytorch/pull/126586 Approved by: https://github.com/jansel	2024-05-22 22:43:09 +00:00
Ke Wen	08c260bc29	[pipelining] Test schedules against manual stage (#126735 ) Added manual stage in test_schedule.py so that we can test various schedules against it. In this file we now have: - test_schedule_with_tracer - test_schedule_with_manual - test_grad_with_tracer - test_grad_with_manual Tested schedules are: - ScheduleGPipe - Schedule1F1B Pull Request resolved: https://github.com/pytorch/pytorch/pull/126735 Approved by: https://github.com/wconstab, https://github.com/H-Huang ghstack dependencies: #126812, #126721	2024-05-22 21:54:27 +00:00
jhavukainen	6a539e80dd	Update descriptor fields to resolve fft precision issue (#125328 ) Fixes #124096 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125328 Approved by: https://github.com/kulinseth, https://github.com/malfet	2024-05-22 21:48:49 +00:00
Catherine Lee	5ccc634603	[CI] Pin uv==0.1.45 for lintrunner (#126908 ) `e4623de4cf/1` ``` 2024-05-22T19:10:48.5974515Z + python3 -m pip install uv 2024-05-22T19:10:48.5975198Z Collecting uv 2024-05-22T19:10:48.5976496Z Downloading uv-0.1.45-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (32 kB) 2024-05-22T19:10:48.5977828Z Downloading uv-0.1.45-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.8 MB) 2024-05-22T19:10:48.5986243Z [?25l [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/12.8 MB[0m [31m?[0m eta [36m-:--:--[0m 2024-05-22T19:10:48.5988326Z [2K [91m━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━[0m [32m6.8/12.8 MB[0m [31m205.8 MB/s[0m eta [36m0:00:01[0m 2024-05-22T19:10:48.5990300Z [2K [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m12.8/12.8 MB[0m [31m215.1 MB/s[0m eta [36m0:00:01[0m 2024-05-22T19:10:48.5991645Z [2K [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m12.8/12.8 MB[0m [31m215.1 MB/s[0m eta [36m0:00:01[0m 2024-05-22T19:10:48.5992724Z [2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m97.8 MB/s[0m eta [36m0:00:00[0m 2024-05-22T19:10:48.5993443Z [?25hInstalling collected packages: uv 2024-05-22T19:10:48.5993950Z Successfully installed uv-0.1.45 2024-05-22T19:10:48.5994363Z + CACHE_DIRECTORY=/tmp/.lintbin 2024-05-22T19:10:48.5994772Z + [[ -d /tmp/.lintbin ]] 2024-05-22T19:10:48.5995157Z + cp -r /tmp/.lintbin . 2024-05-22T19:10:48.5995497Z + lintrunner init 2024-05-22T19:10:48.5995839Z + [[ 1 == \1 ]] ``` vs ``` 2024-05-22T20:33:53.5563991Z + python3 -m pip install uv 2024-05-22T20:33:53.5564921Z Collecting uv 2024-05-22T20:33:53.5566259Z Downloading uv-0.2.1-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (32 kB) 2024-05-22T20:33:53.5568142Z Downloading uv-0.2.1-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.9 MB) 2024-05-22T20:33:53.5570253Z [?25l [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/12.9 MB[0m [31m?[0m eta [36m-:--:--[0m 2024-05-22T20:33:53.5571889Z [2K [91m━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━[0m [32m7.0/12.9 MB[0m [31m208.8 MB/s[0m eta [36m0:00:01[0m 2024-05-22T20:33:53.5573716Z [2K [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m12.9/12.9 MB[0m [31m206.7 MB/s[0m eta [36m0:00:01[0m 2024-05-22T20:33:53.5575478Z [2K [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m12.9/12.9 MB[0m [31m206.7 MB/s[0m eta [36m0:00:01[0m 2024-05-22T20:33:53.5577240Z [2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.9/12.9 MB[0m [31m101.6 MB/s[0m eta [36m0:00:00[0m 2024-05-22T20:33:53.5578531Z [?25hInstalling collected packages: uv 2024-05-22T20:33:53.5579316Z Successfully installed uv-0.2.1 2024-05-22T20:33:53.5580033Z + CACHE_DIRECTORY=/tmp/.lintbin 2024-05-22T20:33:53.5580640Z + [[ -d /tmp/.lintbin ]] 2024-05-22T20:33:53.5581229Z + cp -r /tmp/.lintbin . 2024-05-22T20:33:53.5581799Z + lintrunner init 2024-05-22T20:33:53.5603302Z Traceback (most recent call last): 2024-05-22T20:33:53.5604857Z File "/home/ec2-user/actions-runner/_work/pytorch/pytorch/test-infra/.github/scripts/run_with_env_secrets.py", line 101, in <module> 2024-05-22T20:33:53.5605805Z main() 2024-05-22T20:33:53.5606687Z File "/home/ec2-user/actions-runner/_work/pytorch/pytorch/test-infra/.github/scripts/run_with_env_secrets.py", line 97, in main 2024-05-22T20:33:53.5607762Z run_cmd_or_die(f"docker exec -t {container_name} /exec") 2024-05-22T20:33:53.5608949Z File "/home/ec2-user/actions-runner/_work/pytorch/pytorch/test-infra/.github/scripts/run_with_env_secrets.py", line 38, in run_cmd_or_die 2024-05-22T20:33:53.5610107Z raise RuntimeError(f"Command {cmd} failed with exit code {exit_code}") 2024-05-22T20:33:53.5611328Z RuntimeError: Command docker exec -t e551764bdba0c87c2fc392fba9ea265e8821a552915b36010f18299d8035b304 /exec failed with exit code 1 2024-05-22T20:33:53.5626540Z ##[error]Process completed with exit code 1. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/126908 Approved by: https://github.com/huydhn	2024-05-22 21:41:21 +00:00
lezcano	a30baec0c3	[Docs] Fix NumPy + backward example (#126872 ) We were calling backward on a tensor not a scalar... Pull Request resolved: https://github.com/pytorch/pytorch/pull/126872 Approved by: https://github.com/albanD	2024-05-22 21:29:31 +00:00
Aaron Orenstein	e4623de4cf	typing scheduler.py [2/2]: Apply types (#126656 ) Add `# mypy: disallow-untyped-defs` to scheduler.py and then fix the resulting fallout. We probably should eventually add a new node between BaseSchedulerNode and all the non-FusedSchedulerNode types to indicate the split between nodes that have a valid `self.node` and ones that don't. That would cause a lot of the `assert self.node is not None` churn to go away - but was a bigger change because a lot of code makes assumptions about types that aren't reflected in the types themselves. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126656 Approved by: https://github.com/eellison	2024-05-22 20:33:31 +00:00
hippocookie	3591bce6c7	Add usage explanation in torch.dot ducment (#125908 ) Fixes #125842 Add unsupported declaration on <code>torch.dot</code>, avoid misused like: ```python >>> t1, t2 = torch.tensor([0,1]), torch.tensor([2,3]) >>> torch.dot(input=t1, other=t2) Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: dot() missing 1 required positional arguments: "tensor" ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/125908 Approved by: https://github.com/albanD	2024-05-22 20:33:12 +00:00
Masaki Kozuki	0939b68980	Support `dtype` kwarg in `_foreach_norm` (#125665 ) Fixes #125040 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125665 Approved by: https://github.com/janeyx99	2024-05-22 20:27:50 +00:00
Kurman Karabukaev	d62b025efc	[TorchElastic] Option for sharing TCPStore created by rdzv handlers (#125743 ) Summary: 1. Define explicit `use_agent_store` on rdzv handlers. Handlers that set is true can share the store. 2. Instead of agent coordinating master_add/master_port values, the logic is now encapsulated by a rdzv_handler where `RendezvousInfo` will have `RendezvousStoreInfo` object that handlers must return. - Depending on the implementation they can either: - point to existing store (and expected to `use_agent_store` as true - point 1). Client code will rely on `TORCHELASTIC_USE_AGENT_STORE` env variable to know if the store is shared. - build args that `torch.distributed.init_process_group` can bootstrap by creating new store. Additional points: - When TCPStore is shared, it should be wrapped in PrefixStore to qualify/scope namespace for other usecases. - `next_rendezvous` signature changed to return instance of `RendezvousInfo` instead of a (store, rank, world_size) tuple for extensibility purposes. Why: - Reduce moving parts - easier to swap implementation - improve tractability - addressing perf/debug-ability will benefit all usecases - Test Plan: CI Differential Revision: D57055235 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125743 Approved by: https://github.com/d4l3k	2024-05-22 18:24:11 +00:00
Wanchao Liang	fde1e8af7a	[dtensor] implement distributed topk operator (#126711 ) as titled. Implemented the topk operator in DTensor. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126711 Approved by: https://github.com/wz337 ghstack dependencies: #126710	2024-05-22 18:11:56 +00:00
Wanchao Liang	af633e4a7b	[dtensor] remove unused failed_reason (#126710 ) as titled, this field is not actively used, so removing it Pull Request resolved: https://github.com/pytorch/pytorch/pull/126710 Approved by: https://github.com/wz337	2024-05-22 18:11:56 +00:00
William Wen	a8195f257e	[custom_op] use new python custom ops API on prims ops (#124665 ) Also ads a non-decorator version of `custom_op`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124665 Approved by: https://github.com/zou3519	2024-05-22 17:48:33 +00:00
Shiyan Deng	db0b74bbc5	[CUDA Caching Allocator] Allow division of 0 (#126833 ) Summary: Division of 0 means disabling roundup. Test Plan: CI Differential Revision: D57651410 Pull Request resolved: https://github.com/pytorch/pytorch/pull/126833 Approved by: https://github.com/banitag1	2024-05-22 17:40:39 +00:00
chilli	d4ec18bdad	Prevent partitioner from ever saving views (#126446 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126446 Approved by: https://github.com/anijain2305 ghstack dependencies: #126615	2024-05-22 17:28:46 +00:00
chilli	51e707650f	Fix flexattention not realizing inputs before lowering (also refactored runtime estimation) (#126615 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126615 Approved by: https://github.com/yanboliang, https://github.com/drisspg, https://github.com/xmfan	2024-05-22 17:28:46 +00:00
Ke Wen	3e826c477a	[pipelining] Add pipeline stage test (#126721 ) Test tracer's and manual's stage creation by using a basic schedule (GPipe). (Migrated from https://github.com/pytorch/PiPPy/blob/main/test/test_pipeline_stage.py) Test command: ``` $ python test_stage.py ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/126721 Approved by: https://github.com/wconstab, https://github.com/H-Huang ghstack dependencies: #126812	2024-05-22 16:24:51 +00:00
Ke Wen	403012b50a	[pipelining] expose APIs per pytorch rule (#126812 ) Rule is enforced by #126103. The rule: - If `torch.a.b` defines a public class `C` (i.e. to be exposed in torch API namespace), then `torch.a.b` must be a public path, i.e. no `_`. - `torch.a.b` should ideally have an `__all__` that defines what should be imported from this file when it is imported. - All other definitions in `torch.a.b` that you don't want to expose should have a `_` prefix. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126812 Approved by: https://github.com/wconstab	2024-05-22 16:21:13 +00:00
Bin Bao	599e684ad6	[AOTI] Disable stack allocation for OSS (#125732 ) Summary: Stack allocation is for certain small CPU models, but its coverage still needs improvement, so default to OFF for OSS. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125732 Approved by: https://github.com/chenyang78 ghstack dependencies: #126720, #126801	2024-05-22 15:33:24 +00:00
Bin Bao	ff617ab6c8	[AOTI] Fix an int array codegen issue (#126801 ) Summary: fixes https://github.com/pytorch/pytorch/issues/126779. When an int array contains symbol expression, we can't declare it with constexpr. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126801 Approved by: https://github.com/chenyang78 ghstack dependencies: #126720	2024-05-22 15:33:24 +00:00
Bin Bao	19cd4484ec	[AOTI] Add more fallback ops (#126720 ) Summary: These ops are either in either unit tests or TorchBench. Fixes https://github.com/pytorch/pytorch/issues/122050 Pull Request resolved: https://github.com/pytorch/pytorch/pull/126720 Approved by: https://github.com/chenyang78	2024-05-22 15:33:24 +00:00
Edward Z. Yang	0d17aae242	Teach FakeTensor to fill in item_memo when converting scalar CPU tensor (#126245 ) This PR requires a little justification, but let's start with what it does first: 1. When you have a 0d CPU scalar int64/float64 tensor input to a graph, we will preallocate a backed SymInt/SymFloat corresponding to what you would get if you call item() on this tensor. This means you can freely change your input to be a Python int/float or a Tensor with an item() call and end up with exactly the same level of expressivity (specifically, you can guard on the internal SymInt/SymFloat no matter what). By default, the source of the backed SymInt/SymFloat is `L['tensor'].item()`, but if you have promoted a float input into a Tensor, we will cancel out `torch.as_tensor(L['float']).item()` into just `L['float']`. 2. We switch wrap_symfloat to use this, instead of hand crafting the new SymNodeVariable. Everything works out, except that we carefully pass the item() result to tracked fakes (and not the fake Tensor argument) OK, so why do this at all? There is some marginal benefit where now some item() calls on scalar inputs can be guarded on, but IMO this is a pretty marginal benefit, and if it was the only reason, I wouldn't do this. The real reason for this is that I need to be able to propagate fake tensors through the graphs that are produced by Dynamo, and if I am doing the old custom wrap_symfloat logic, there's no way I can do this, because ordinarily an item() call will cause an unbacked SymInt when I reallocate. The other obvious way to solve the problem above is to make a HOP alternative that item() that "bakes in" the backed SymInt its supposed to return. But this strategy seems more parsimonious, and it does have the marginal benefit I mentioned above. The main downside is that what I have to do next, is make it so that when I run tensor computation, I also apply the equivalent operations to the SymInt/SymFloat as well. That's next PR. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/126245 Approved by: https://github.com/eellison ghstack dependencies: #126637	2024-05-22 15:25:38 +00:00
Matthew Hoffman	86ad101370	Enable pickling `torch._C.Generator` (#126271 ) Fixes #71398 Add `__reduce__` and `__setstate__` methods for `torch._C.Generator`. `__reduce__` returns a tuple of 3 values: 1. `torch.Generator` itself. 2. A one-element tuple containing the `torch.device` to create the `Generator` with, since this cannot be changed after the object is created. 3. The state, a three-element tuple: the initial seed, the offset (or `None` if a CPU `Generator`), and the RNG state tensor. `__setstate__` calls `manual_seed`, `set_offset` (if not `None`), and `set_state` on each respective element of the state. Added test demonstrating successful reserialization with cpu and cuda `Generator`s. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126271 Approved by: https://github.com/ezyang	2024-05-22 14:38:47 +00:00
rzou	ed734178ab	Refresh OpOverloadPacket if a new OpOverload gets added (#126863 ) If a user accesses an OpOverloadPacket, then creates a new OpOverload, then uses the OpOverloadPacket, the new OpOverload never gets hit. This is because OpOverloadPacket caches OpOverloads when it is constructed. This PR fixes the problem by "refreshing" the OpOverloadPacket if a new OpOverload gets constructed and the OpOverloadPacket exists. Test Plan: - new tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/126863 Approved by: https://github.com/albanD	2024-05-22 14:13:27 +00:00
Tuan Trieu	082251e76b	fix invalid call to aoti_torch_tensor_copy_ (#126668 ) Fixes #123039 Pull Request resolved: https://github.com/pytorch/pytorch/pull/126668 Approved by: https://github.com/desertfire	2024-05-22 13:02:02 +00:00
Yifu Wang	2dd2699860	Introduce ProcessGroupCudaP2P (#122163 ) ## Context This stack prototypes automatic micro-pipelining of `all-gather -> matmul` and `matmul -> reduce-scatter` via Inductor. The idea originates from the paper [Overlap Communication with Dependent Computation via Decomposition in Large Deep Learning Models](https://dl.acm.org/doi/pdf/10.1145/3567955.3567959). The implementation and some key optimizations are heavily influenced by @lw's implementation in xformers. The stack contains several components: - `ProcessGroupCudaP2P` - a thin wrapper around `ProcessGroupNCCL`. It in addition maintains a P2P workspace that enables SM-free, one-sided P2P communication which is needed for optimal micro-pipelining. - `fused_all_gather_matmul` and `fused_matmul_reduce_scatter` dispatcher ops. - Post-grad fx pass that detects `all-gather -> matmul` and `matmul -> reduce-scatter` and replaces them with the fused dispatcher ops. To enable the prototype feature: - Set the distributed backend to `cuda_p2p`. - Set `torch._inductor.config._micro_pipeline_tp` to `True`. NOTE: the prototype sets nothing in stone w.r.t to each component's design. The purpose is to have a performant baseline with reasonable design on which each component can be further improved. ## Benchmark Setup: - 8 x H100 (500W) + 3rd gen NVSwitch. - Llama3 8B training w/ torchtitan. - 8-way TP. Reduced the number of layers from 32 to 8 for benchmarking purpose. Trace (baseline): https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/perfetto_internal_traces/tree/shared_trace/yifu_tmpjaz8zgx0 <img width="832" alt="image" src="https://github.com/pytorch/pytorch/assets/4156752/4addba77-5abc-4d2e-93ea-f68078587fe1"> Trace (w/ micro pipelining): https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/perfetto_internal_traces/tree/shared_trace/yifu_tmpn073b4wn <img width="963" alt="image" src="https://github.com/pytorch/pytorch/assets/4156752/4f44e78d-8196-43ab-a1ea-27390f07e9d2"> ## This PR `ProcessGroupCudaP2P` is a thin wrapper around `ProcessGroupNCCL`. By default, it routes all collectives to the underlying `ProcessGroupNCCL`. In addition, `ProcessGroupCudaP2P` initializes a P2P workspace that allows direct GPU memory access among the members. The workspace can be used in Python to optimize intra-node communication patterns or to create custom intra-node collectives in CUDA. `ProcessGroupCudaP2P` aims to bridge the gap where certain important patterns can be better optimized via fine-grained P2P memory access than with collectives in the latest version of NCCL. It is meant to complement NCCL rather than replacing it. Usage: ``` # Using ProcessGroupCudaP2P dist.init_process_group(backend="cuda_p2p", ...) # Using ProcessGroupCudaP2P while specifying ProcessGroupCudaP2P.Options pg_options = ProcessGroupCudaP2P.Options() dist.init_process_group(backend="cuda_p2p", pg_options=pg_options, ...) # Using ProcessGroupCudaP2P while specifying ProcessGroupNCCL.Options pg_options = ProcessGroupNCCL.Options() dist.init_process_group(backend="cuda_p2p", pg_options=pg_options, ...) # Using ProcessGroupCudaP2P while specifying both # ProcessGroupCudaP2P.Options and ProcessGroupNCCL.Options pg_options = ProcessGroupCudaP2P.Options() pg_options.nccl_options = ProcessGroupNCCL.Options() dist.init_process_group(backend="cuda_p2p", pg_options=pg_options, ...) # Down-casting the backend to access p2p buffers for cuda_p2p specific # optimizations if is_cuda_p2p_group(group): backend = get_cuda_p2p_backend(group) if required_p2p_buffer_size > backend.get_buffer_size(): # fallback p2p_buffer = backend.get_p2p_buffer(...) else: # fallback ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/122163 Approved by: https://github.com/wanchaol	2024-05-22 09:33:05 +00:00
PyTorch MergeBot	8a4597980c	Revert "Fix flexattention not realizing inputs before lowering (also refactored runtime estimation) (#126615 )" This reverts commit 831efeeadf5fa8d9e7f973057e634a57e3bcf04b. Reverted https://github.com/pytorch/pytorch/pull/126615 on behalf of https://github.com/DanilBaibak due to Break internal build ([comment](https://github.com/pytorch/pytorch/pull/126615#issuecomment-2124169157))	2024-05-22 08:23:40 +00:00
PyTorch MergeBot	0f37fd06d9	Revert "Prevent partitioner from ever saving views (#126446 )" This reverts commit da2292ce6b37028746bf5beeae04442eef1e803d. Reverted https://github.com/pytorch/pytorch/pull/126446 on behalf of https://github.com/DanilBaibak due to Break internal build ([comment](https://github.com/pytorch/pytorch/pull/126615#issuecomment-2124169157))	2024-05-22 08:23:40 +00:00
PyTorch MergeBot	d2cbbdee31	Revert "Fix silu test for flexattention (#126641 )" This reverts commit cd3a71f754a2248bcfe500de7c9860bd7d2002bf. Reverted https://github.com/pytorch/pytorch/pull/126641 on behalf of https://github.com/DanilBaibak due to Break internal build ([comment](https://github.com/pytorch/pytorch/pull/126615#issuecomment-2124169157))	2024-05-22 08:23:40 +00:00
Xia, Weiwen	4575d3be83	[Quant][onednn] fix performance regression of depth-wise qconv (#126761 ) Fixes #125663 It did not handle groups correctly in the original implementation. Test plan: Functionality is covered by UT. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126761 Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5	2024-05-22 07:53:11 +00:00
Jez Ng	aede940975	[inductor] Fix cuda compilation under fbcode remote execution (#126408 ) Differential Revision: D57390072 Pull Request resolved: https://github.com/pytorch/pytorch/pull/126408 Approved by: https://github.com/desertfire	2024-05-22 07:51:35 +00:00
zjgarvey	edea2b81b5	[ONNX] Adds Support for Some Bitwise Ops in Onnx Exporter (#126229 ) Addresses #126194 Adds support for - "aten::bitwise_right_shift" - "aten::bitwise_left_shift" - "aten::bitwise_and" Pull Request resolved: https://github.com/pytorch/pytorch/pull/126229 Approved by: https://github.com/justinchuby	2024-05-22 07:47:43 +00:00
Jason Ansel	b516de8cac	[halide-backend] Add HalideCodeCache (#126416 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126416 Approved by: https://github.com/shunting314 ghstack dependencies: #126631, #126655	2024-05-22 06:52:50 +00:00
Wanchao Liang	d937d0db0f	[SAC] fix ignored ops in eager mode to recompute (#126751 ) as titled. I found that there're some issues in the eager mode SAC where sometimes we would have recompute pop from storage of ops that are missing, these ops are detach ops. So this PR refactors the two modes, so that they would always recompute ignored ops Pull Request resolved: https://github.com/pytorch/pytorch/pull/126751 Approved by: https://github.com/yf225	2024-05-22 06:47:22 +00:00
Xuehai Pan	3b0f6cce5c	[pytree] freeze attributes of `TreeSpec` (#124011 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124011 Approved by: https://github.com/zou3519	2024-05-22 05:57:00 +00:00
Banit Agrawal	6edf989e2f	[CUDA Caching Allocator] Round to nearest 512 bytes boundary if number of divisions=1 (#126830 ) Summary: This diff fixes an issue when the number of divisions=1, resulting in unaligned memory accesses. Reviewed By: 842974287 Differential Revision: D57648763 Pull Request resolved: https://github.com/pytorch/pytorch/pull/126830 Approved by: https://github.com/842974287	2024-05-22 04:57:24 +00:00
Chirag Pandya	ae66c94eaa	Capture dtype in Flight Recorder (#126581 ) Summary: Capture dtype in flight recorder. Mismatched dtypes can lead to hangs. Newly added logs to job show mismatching DTYPE of op, which affects data size. Even though the sizes match and we don't see the dtype on the FR log. We end up capturing the type as follows: ``` {'entries': [{'record_id': 0, 'pg_id': 0, 'process_group': ('0', 'default_pg'), 'collective_seq_id': 1, 'p2p_seq_id': 0, 'op_id': 1, 'profiling_name': 'nccl:all_reduce', 'time_created_ns': 1715989097552775261, 'duration_ms': 6.697696208953857, 'input_sizes': [[3, 4]], 'input_dtypes': [6], 'output_sizes': [[3, 4]], 'output_dtypes': [6], 'state': 'completed', 'time_discovered_started_ns': 1715989097593778240, 'time_discovered_completed_ns': 1715989097593778461, 'retired': True, ``` Notice the new fields: input_dtypes: [6] output_dtypes: [6] Test Plan: unit tests Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/issues/126554 Pull Request resolved: https://github.com/pytorch/pytorch/pull/126581 Approved by: https://github.com/wconstab	2024-05-22 03:38:09 +00:00
Simon Fan	7530cfe7e4	[dynamo][flaky tests] test_conv_empty_input_* (#126790 ) Run CI, maybe fixes https://github.com/pytorch/pytorch/issues/126178 Pull Request resolved: https://github.com/pytorch/pytorch/pull/126790 Approved by: https://github.com/mikaylagawarecki	2024-05-22 03:14:21 +00:00
Jiashen Cao	ac1f0befcf	Remove redundant serialization code (#126803 ) After https://github.com/pytorch/pytorch/pull/123308, we no longer need separate serialization path to handle different types that exist in the nn_module metadata. This PR cleans up the redundant code. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126803 Approved by: https://github.com/angelayi	2024-05-22 03:14:17 +00:00
Ke Wen	608a11c496	[pipelining] Retire PIPPY_VERBOSITY in favor of TORCH_LOGS=pp (#126828 ) https://github.com/pytorch/pytorch/pull/126499/ established: `TORCH_LOGS=pp` --> info `TORCH_LOGS=-pp` --> warn `TORCH_LOGS=+pp` --> debug Pull Request resolved: https://github.com/pytorch/pytorch/pull/126828 Approved by: https://github.com/wconstab	2024-05-22 02:52:58 +00:00
Isuru Fernando	e3c96935c2	Support CUDA_INC_PATH env variable when compiling extensions (#126808 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126808 Approved by: https://github.com/amjames, https://github.com/ezyang	2024-05-22 02:44:32 +00:00
Ke Wen	5fa7aefb49	[pipelining] Do not print loss (#126829 ) `loss` is a tensor, printing it would induce a GPU-CPU sync, which would slow down the program more than regular debug overhead. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126829 Approved by: https://github.com/wconstab	2024-05-22 02:32:04 +00:00
Yueming Hao	e6f655697b	[AOTI] Fix unsupported type of output=s1 (#126797 ) Fixes #123036 In unit test `DynamicShapesCudaWrapperCudaTests.test_scaled_dot_product_attention_cuda_dynamic_shapes_cuda_wrapper`, computed buffer buf3 is compiled to a fallback kernel `aoti_torch_cuda__scaled_dot_product_flash_attention`. It has 9 outputs whose types are `[MultiOutput, MultiOutput, None, None, s1, s1, MultiOutput, MultiOutput,MultiOutput]`. The type `s1` here is passed from [generate_output](`acfe237a71/torch/_inductor/ir.py (L5658)`). They type check for Symbol is missing for fallback kernel output generation. This PR fixes this issue by checking `output.is_Symbol`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126797 Approved by: https://github.com/desertfire	2024-05-22 02:15:43 +00:00
Nikita Shulga	a379ed6e98	Fix SobolEngine default dtype handling (#126781 ) - Change default dtype argument to `None` and fetch it value via `torch.get_default_dtype()` call if not defined - Fix bug in first draw handling logic, that would ignore dtype in favor of default one due to type promotion - Add regression tests Fixes https://github.com/pytorch/pytorch/issues/126478 Pull Request resolved: https://github.com/pytorch/pytorch/pull/126781 Approved by: https://github.com/Skylion007, https://github.com/albanD	2024-05-22 01:55:48 +00:00
eellison	28f29e074b	Dont mutate tensor stride in place in cudnn conv (#126786 ) Fix for https://github.com/pytorch/pytorch/issues/126241. Within the cudnn convolution, we were in-place updating the strides of the tensor to disambiguate for size-1 dims and contiguous and channels last tensors. Instead of mutating the tensors stride, just use a temporary. Inside cudnn it is then copied: `d7ccb5b3c4/include/cudnn_frontend_Tensor.h (L201-L203)`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126786 Approved by: https://github.com/ezyang, https://github.com/shunting314, https://github.com/eqy	2024-05-22 01:53:44 +00:00
Yanbo Liang	66c23cb021	Add micro-benchmark framework and multi_layer_norm as an example (#126754 ) ```micro_benchmark.py``` output csv example (all numbers are fake, just for demo) ``` name,metric,target,actual multi_layer_norm,inference_time(s),20,19.87 multi_layer_norm,memory_bandwidth(GB/s),108,108.04 llama2-int8, token_per_sec,155,156 llama2-int8,memory_bandwidth(GB/s),92,92.7 ``` Expected dashboard looks like: ``` \| name \| metric \| target \| actual \| change \| \|------------------\|------------------------\|--------\|--------\|--------\| \| multi_layer_norm \| inference_time(s) \| 20 \| 19.87 \| 99% \| \| \| memory_bandwidth(GB/s) \| 108 \| 108.04 \| 101% \| \| llama2-int8 \| token_per_sec \| 155 \| 156 \| 100% \| \| \| memory_bandwidth(GB/s) \| 92 \| 92.7 \| 101% \| ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/126754 Approved by: https://github.com/Chillee	2024-05-22 01:27:37 +00:00
Andrew Gu	636e79991c	[FSDP2] Fixed 2D clip grad norm test (#126497 ) This fixes https://github.com/pytorch/pytorch/issues/126484. We change from transformer to MLP stack since transformer seems to introduce slight numeric differences when using TP. We include a sequence parallel layer norm module in the MLP stack to exercise `(S(0), R)` placement. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126497 Approved by: https://github.com/weifengpy, https://github.com/wz337	2024-05-22 00:29:13 +00:00
Klein Shen	25ea32567e	[caffe2][1/n] migrate global Static Initializer (#126688 ) Summary: Caffe2 lib has 200+ global static initializer usage, which are papar-cut reference to startup perf. Detail in this post https://fb.workplace.com/groups/arglassesperf/permalink/623909116287154. Kick off a stack to migirate all usage of global static initializer in caffe2. Test Plan: TODO: Please advise how can i test this change? Differential Revision: D57531083 Pull Request resolved: https://github.com/pytorch/pytorch/pull/126688 Approved by: https://github.com/ezyang	2024-05-22 00:16:06 +00:00
Masahiro Hiramori	10a5c1b26c	[Dynamo][TVM] Fix tvm backend interface (#126529 ) Fixes #126528 The repro in the above issue works fine with this PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126529 Approved by: https://github.com/xmfan	2024-05-21 23:31:15 +00:00
Xu Zhao	1e818db547	[torchbench] Fix torchao benchmarking script (#126736 ) As the title says. Test Plan: ``` python benchmarks/dynamo/torchbench.py --only BERT_pytorch --bfloat16 --quantization int8dynamic --performance --inference --print-memory cuda eval BERT_pytorch [XZ Debug] Torch grad status: False memory: eager: 0.82 GB, dynamo: 0.92 GB, ratio: 0.89 running benchmark: 100% 1.001x ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/126736 Approved by: https://github.com/jerryzh168, https://github.com/huydhn	2024-05-21 23:15:12 +00:00
Jason Ansel	9dba1aca0e	[inductor] Relax type annotations for statically_known_* (#126655 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126655 Approved by: https://github.com/Skylion007, https://github.com/shunting314 ghstack dependencies: #126631	2024-05-21 23:12:42 +00:00
Jason Ansel	c08afbb3da	[inductor] Add kernel_code logging artifact (#126631 ) This is useful for some compile errors where we don't finish outputting the full graph. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126631 Approved by: https://github.com/shunting314	2024-05-21 23:12:42 +00:00
Shuqiang Zhang	4e921593a4	[c10d]skip nan tests for lower versions of CUDA (#126701 ) Summary: We found that the UNIT tests would hang only in one test, linux-focal-cuda11.8-py3.9-gcc9 / test (multigpu, 1, 1, linux.g5.12xlarge.nvidia.gpu), in which DSA would still be raised, but somehow the process would cause errors like: P1369649418 Test Plan: Run CI tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/126701 Approved by: https://github.com/wconstab ghstack dependencies: #126409	2024-05-21 22:25:29 +00:00
Mu-Chu Lee	f6ffe32a9d	[AOTInductor] Automatic detection for buffer mutation and binary linking (#126706 ) Summary: Instead of a explicit config for users to determine buffer mutation, we automatically detect whether there's buffer mutation in the model and determine which section constants would be placed. If constants are too large and doesn't fit within section, we error out directly. Test Plan: Existing tests for buffer mutation and large weight linking Differential Revision: D57579800 Pull Request resolved: https://github.com/pytorch/pytorch/pull/126706 Approved by: https://github.com/desertfire	2024-05-21 21:49:13 +00:00
wz337	fed536dbcf	[DTensor][Optim] Add support for fused_adam and fused_adamw when lr is a tensor (#126750 ) Fixes #126670 In this PR, we update the following: 1. lr is an kwarg. Add support to automatically turn on implict replication for kwarg. We only did this for arg previously. 2. add associated tensor_lr ops in pointwises.py 3. add associated unit test in test_optimizers.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/126750 Approved by: https://github.com/wanchaol, https://github.com/msaroufim	2024-05-21 21:38:05 +00:00
hippocookie	7ee74d986a	Enable UFMT format on test/typing files (#126038 ) Fixes some files in #123062 Run lintrunner on files: test/typing/*/ ``` $ lintrunner -a --take UFMT --all-files ok No lint issues. Successfully applied all patches. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/126038 Approved by: https://github.com/shink, https://github.com/ezyang	2024-05-21 21:37:07 +00:00
leslie-fang-intel	1cc9354cb0	Unify the dtype to VecMask<float, N> in ops.masked (#126662 ) Summary Fix issue: https://github.com/pytorch/pytorch/issues/126449. For `ops.masked` in CPP backend, when input dtype is `bool`, we actually load it as `VecMask<float, N>`. So, we should unify the type of `other` and `mask` to the same as `VecMask<float, N>` to invoke `blendv` method. Test Plan ``` clear && python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_ops_masked_with_bool_input clear && PYTORCH_ALL_SAMPLES=1 python -u -m pytest -s -v test/inductor/test_torchinductor_opinfo.py -k test_comprehensive__chunk_cat_cpu_bool ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/126662 Approved by: https://github.com/isuruf, https://github.com/jgong5, https://github.com/peterbell10	2024-05-21 20:52:25 +00:00
dependabot[bot]	fd7293db71	Bump rexml from 3.2.5 to 3.2.8 in /ios/TestApp (#126455 ) Bumps [rexml](https://github.com/ruby/rexml) from 3.2.5 to 3.2.8. - [Release notes](https://github.com/ruby/rexml/releases) - [Changelog](https://github.com/ruby/rexml/blob/master/NEWS.md) - [Commits](https://github.com/ruby/rexml/compare/v3.2.5...v3.2.8) --- updated-dependencies: - dependency-name: rexml dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2024-05-21 13:47:12 -07:00
Sahdev Zala	fe0a36fd7c	Fix a link in the compiler backend doc (#126079 ) The core aten is the core subset of aten and seems the corrent link to replace the broken link. Fixes #125961 Pull Request resolved: https://github.com/pytorch/pytorch/pull/126079 Approved by: https://github.com/svekars	2024-05-21 20:16:04 +00:00
Tianyu Liu	5325a6de64	[dtensor] remove `output_` prefix from OpStrategy properties (#126359 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126359 Approved by: https://github.com/wanchaol, https://github.com/XilunWu	2024-05-21 19:54:29 +00:00
Edward Z. Yang	c73c9457aa	Add guard_size_oblivious to vector_norm (#126772 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/126772 Approved by: https://github.com/lezcano, https://github.com/Skylion007 ghstack dependencies: #126771	2024-05-21 19:53:21 +00:00
Edward Z. Yang	97eef61474	Don't assume compare_arg is fx.Node (#126771 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/126771 Approved by: https://github.com/Skylion007	2024-05-21 19:53:21 +00:00
Sergii Dymchenko	fc594ed219	Remove lint from retryable_workflows (#126806 ) Related to https://github.com/pytorch/test-infra/pull/4934 Lint workflow now uses Docker, so there should not be network-related errors for pip installing stuff. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126806 Approved by: https://github.com/seemethere, https://github.com/ZainRizvi, https://github.com/huydhn	2024-05-21 19:47:23 +00:00
Shivam Raikundalia	4e6673e244	Remove MAX_STACK_ENTRY from _build_table (#126583 ) Summary: As reported by this issue: https://github.com/pytorch/pytorch/issues/83584 We already store the entries in evt.stack so there is no need to cap the limit when we output the table to 5 Test Plan: Regression testing should cover this. We have unit tests to check the stack already. Differential Revision: D57513565 Pull Request resolved: https://github.com/pytorch/pytorch/pull/126583 Approved by: https://github.com/nmacchioni	2024-05-21 18:52:04 +00:00
Andrew M. James	0c76018714	[inductor] Don't inherit `__future__` flags from the calling scope when `compile` -ing generated modules (#126454 ) This file includes `from __futures__ import annotations` which interacts with `compile` by causing type annotations to be populated as strings. Triton does not parse the string annotation correctly. Avoid this behavior by passing `dont_inherit=True` to `compile`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126454 Approved by: https://github.com/peterbell10	2024-05-21 18:51:13 +00:00
cyy	7428fd19fe	Remove outdated options from setup.py (#125988 ) Since the recent removal of Caffe2 files. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125988 Approved by: https://github.com/ezyang	2024-05-21 18:48:23 +00:00
Bin Bao	b40fb2de59	[AOTI] Fix a codegen issue when .item() is used for kernel arg (#126575 ) Summary: fixes https://github.com/pytorch/pytorch/issues/126574 . Pass kernel argument type information into generate_args_decl, so it can generate the argument declaration instead of relying on string matching. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126575 Approved by: https://github.com/chenyang78 ghstack dependencies: #126369	2024-05-21 18:20:20 +00:00
Bin Bao	5e2de16a6f	[AOTI] Codegen None as empty tensor (#126369 ) Summary: When None denotes an optional tensor, we codegen NULL to represent it; but when None is for actual tensor type, we need to codegen an empty tensor for it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126369 Approved by: https://github.com/chenyang78	2024-05-21 18:20:20 +00:00
Tristan Rice	ac51920656	Reapply "c10d: add Collectives abstraction (#125978 )" (#126695 ) This reverts commit d9c3485146913324ab4b3e211d2a4517e138f4af. Reapplies #125978. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126695 Approved by: https://github.com/c-p-i-o	2024-05-21 18:00:09 +00:00
eellison	d8f5627a88	prune back configs (#126570 ) We had a previous PR that added configs for an internal model. Running the below script on output from autotuning, we can prune back the added configs with negligible perf loss: P1365917790. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126570 Approved by: https://github.com/nmacchioni	2024-05-21 17:44:32 +00:00
Scott Wolchok	85fd76f76d	Add test coverage for fp16 matrix-vector specialized kernel (#126700 ) Summary: This kernel is special-cased on ARM because it's important for LLMs, so let's have test coverage. Test Plan: Ran locally and it passes. Intentionally broke fp16_gemv_trans and saw it fail, confirming it provides coverage. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126700 Approved by: https://github.com/malfet	2024-05-21 17:23:16 +00:00
Tom Ritchford	bae3b17fd9	Tweak a comment and fix spelling (#126681 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126681 Approved by: https://github.com/Skylion007	2024-05-21 17:19:06 +00:00
Yanbo Liang	0756f9f5fd	Remove debug breakpoint (#126756 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/126756 Approved by: https://github.com/BowenBao, https://github.com/Skylion007	2024-05-21 17:04:50 +00:00
Aaron Orenstein	140ab89c02	typing scheduler.py [1/2]: Bug fix (#126610 ) Found while getting scheduler.py to typecheck - split off to make reviewing easier. 1. is_template: I'm pretty sure this is a bug. Based on the definition of `is_template` I'm pretty sure we want to return the node's `get_template_node()`, not the node itself. 2. can_free: It seems that this was intended to b a raise, not a return. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126610 Approved by: https://github.com/eellison	2024-05-21 16:59:37 +00:00
Catherine Lee	ac2c547838	[TD] Upload names of failures to s3 for pytest cache (#126315 ) Some tests don't get run through pytest and pytest crashes when a test segfaults, so in both caess, the pytest cache won't have an entry (similar to https://github.com/pytorch/test-infra/pull/5205). Instead, manually upload/download an extra file that lists the failing test files Technically this would be more general than the pytest cache Pull Request resolved: https://github.com/pytorch/pytorch/pull/126315 Approved by: https://github.com/ZainRizvi	2024-05-21 16:29:31 +00:00
eellison	4a7b46be3d	small changes to padding (#126716 ) Add cost of writing padding 0s to benchmark, skip dimension that can be squeezed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126716 Approved by: https://github.com/shunting314	2024-05-21 16:09:32 +00:00
PyTorch MergeBot	980f5ac049	Revert "[Quant][PT2E] enable qlinear post op fusion for dynamic quant & qat (#122667 )" This reverts commit 3642e51ea527e23ded10afc266f298b0cb5350c8. Reverted https://github.com/pytorch/pytorch/pull/122667 on behalf of https://github.com/DanilBaibak due to Break internal build ([comment](https://github.com/pytorch/pytorch/pull/122667#issuecomment-2122642317))	2024-05-21 13:45:07 +00:00
William Wen	b36e01801b	[3.12, inductor] re-enable AsyncCompile.warm_pool for 3.12 (#126724 ) Somehow working now? Fixes https://github.com/pytorch/pytorch/issues/124192 and https://github.com/pytorch/pytorch/issues/125979. Still getting the warning ``` /home/williamwen/local/installs/python3.12/debug/install/lib/python3.12/multiprocessing/popen_fork.py:66: DeprecationWarning: This process (pid=2360707) is multi-threaded, use of fork() may lead to deadlocks in the child. self.pid = os.fork() ``` though Pull Request resolved: https://github.com/pytorch/pytorch/pull/126724 Approved by: https://github.com/masnesral, https://github.com/jansel	2024-05-21 08:50:13 +00:00
cyy	faa72dca41	Remove QNNPACK submodule (#126657 ) QNNPACK has integrated into ATEN for a long time and removing it from third party causing no build issue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126657 Approved by: https://github.com/ezyang	2024-05-21 07:25:24 +00:00
Feng Yuan	7d34cfd28a	Update torch-xpu-ops pin (ATen XPU implementation) (#126744 ) Regular bi-weekly pin update. New 85 ATen operators are implemented in XPU backend. https://github.com/intel/torch-xpu-ops/blob/release/2.4/yaml/xpu_functions.yaml Pull Request resolved: https://github.com/pytorch/pytorch/pull/126744 Approved by: https://github.com/EikanWang	2024-05-21 07:21:52 +00:00
Will Constable	4b23c4fc5d	[Pipelining] Clean up function names in 1f1b schedule (#126582 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126582 Approved by: https://github.com/kwen2501 ghstack dependencies: #126539	2024-05-21 06:50:02 +00:00
Will Constable	8c9d332953	[c10d] fix excepthook crash on exc after destroy_process_group (#126739 ) fixes #126379 This is the easy fix. An additional fix that I did not do is to deregister the excepthook (or rather, restore the orignal one) when calling dist.destroy_process_group. This might be a bit complicated in practice, so landing as is for now. Also, couldn't figure out a clean way to test this. assertRaisesRegex wasn't getting a string value, probably becuase of the stderr redirection done via the excepthook in the first place. Output from the original repro is cleaned up with the fix: ``` [rank0]: Traceback (most recent call last): [rank0]: File "/data/users/whc/pytorch/except.py", line 6, in <module> [rank0]: raise ZeroDivisionError [rank0]: ZeroDivisionError ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/126739 Approved by: https://github.com/yf225	2024-05-21 06:39:18 +00:00
PyTorch MergeBot	e363a8a222	Revert "[pipelining] Add pipeline stage test (#126721 )" This reverts commit b948b1ad7a9cf61c9692506c60c295fd40e00f43. Reverted https://github.com/pytorch/pytorch/pull/126721 on behalf of https://github.com/clee2000 due to The test_public_bindings failure is real, you just got unlucky since it was also broken on trunk for a different reason ([comment](https://github.com/pytorch/pytorch/pull/126721#issuecomment-2121725408))	2024-05-21 04:40:05 +00:00
Will Constable	dc2560f073	[Pipelining] Add debug logs for batch p2p ops (#126539 ) logs from torchtitan: <img width="2878" alt="image" src="https://github.com/pytorch/pytorch/assets/4984825/4039c85f-0bf1-4924-92fa-2c55e8e4da2a"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/126539 Approved by: https://github.com/kwen2501, https://github.com/H-Huang	2024-05-21 03:54:46 +00:00
Will Constable	b96d9090d2	[C10D] make get_node_local_rank() accept fallback_rank (#126737 ) Addresses follow up comments on #123992 and allows the use case of writing code that checks `get_node_local_rank(fallback_rank=0)` and runs correctly whether in the presence of a launcher (e.g. torchrun), or run locally on a single device. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126737 Approved by: https://github.com/shuqiangzhang	2024-05-21 03:38:02 +00:00
Yanbo Liang	c1b90a4e8a	[Dynamo] Treat integers stored on nn.Modules as dynamic (#126466 ) Fixes #115711 Pull Request resolved: https://github.com/pytorch/pytorch/pull/126466 Approved by: https://github.com/jansel	2024-05-21 03:31:20 +00:00
Chirag Pandya	a83e745356	[BE] split seq_id to collective_seq_id and p2p_seq_id (#125727 ) Summary: Split out `seq_id` into `collective_seq_id` and `p2p_seq_id`. The main idea here is that collectives that go to all machines should have identical `collective_seq_id` and therefore it makes it easier to spot if one of machines isn't handling a collective operation. Next, we can attempt to match up p2p operations to ensure that the sender(s)/receivers(s) are in sync. Resolves issue: https://github.com/pytorch/pytorch/issues/125173 Test Plan: Unit tests. Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/125727 Approved by: https://github.com/zdevito	2024-05-21 03:26:49 +00:00
eqy	5f64086d08	[NT][SDPA] Bump tolerances for `test_sdpa_with_packed_in_proj_cuda_bfloat16` (#126356 ) Current tolerances fail on RTX 6000 (Ada) with `Mismatched elements: 2 / 144 (1.4%)` ``` AssertionError: Tensor-likes are not close! Mismatched elements: 2 / 144 (1.4%) Greatest absolute difference: 0.002197265625 at index (5, 0, 0) (up to 0.001 allowed) Greatest relative difference: 0.08203125 at index (3, 0, 0) (up to 0.016 allowed) To execute this test, run the following from the base repo dir: python test/test_nestedtensor.py -k test_sdpa_with_packed_in_proj_cuda_bfloat16 This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 ---------------------------------------------------------------------- ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/126356 Approved by: https://github.com/drisspg	2024-05-21 03:25:30 +00:00
cdzhan	40cc616909	Fix caching allocator of out-of-tree device is destructed before the … (#126677 ) …destruction of tensors cached by autocast ## Root Cause For out-of-tree device extension it is loaded after torch (different .so), so the global variable `cached_casts` may be constructed before caching allocator and then destructed in reversed order when exit. ## Fix Lazily initialize `cached_casts` to correct the order. ## How to Reproduce && Test Modify the testcase `TestAutocastGPU.test_cast_cache_is_global` in test/test_autocast.py to run on your out-of-tree device. You will see following failure in the end of test. ```bash ---------------------------------------------------------------------- Ran 1 test in 4.812s OK free: 0x30080ff44000400 terminate called after throwing an instance of 'c10::Error' what(): invalid device pointer: 0x30080ff44000400 Exception raised from free at /projs/framework/betterman/code/pytorch_new/catch/torch_mlu/csrc/framework/core/caching_allocator.cpp:1609 (most recent call first): frame #0: <unknown function> + 0x118fe1 (0x7ffaef4d3fe1 in /projs/framework/betterman/code/pytorch_new/torch/lib/libc10.so) frame #1: <unknown function> + 0x11b1c4 (0x7ffaef4d61c4 in /projs/framework/betterman/code/pytorch_new/torch/lib/libc10.so) frame #2: <unknown function> + 0x117677 (0x7ffaef4d2677 in /projs/framework/betterman/code/pytorch_new/torch/lib/libc10.so) frame #3: <unknown function> + 0x11a2bf (0x7ffaef4d52bf in /projs/framework/betterman/code/pytorch_new/torch/lib/libc10.so) frame #4: <unknown function> + 0x11a186 (0x7ffaef4d5186 in /projs/framework/betterman/code/pytorch_new/torch/lib/libc10.so) frame #5: <unknown function> + 0x119fde (0x7ffaef4d4fde in /projs/framework/betterman/code/pytorch_new/torch/lib/libc10.so) frame #6: <unknown function> + 0x119d2e (0x7ffaef4d4d2e in /projs/framework/betterman/code/pytorch_new/torch/lib/libc10.so) frame #7: <unknown function> + 0x119be0 (0x7ffaef4d4be0 in /projs/framework/betterman/code/pytorch_new/torch/lib/libc10.so) frame #8: <unknown function> + 0x119977 (0x7ffaef4d4977 in /projs/framework/betterman/code/pytorch_new/torch/lib/libc10.so) frame #9: <unknown function> + 0x119313 (0x7ffaef4d4313 in /projs/framework/betterman/code/pytorch_new/torch/lib/libc10.so) frame #10: <unknown function> + 0x118b4c (0x7ffaef4d3b4c in /projs/framework/betterman/code/pytorch_new/torch/lib/libc10.so) frame #11: c10::Error::Error(c10::SourceLocation, std::string) + 0x34 (0x7ffaef4d27c4 in /projs/framework/betterman/code/pytorch_new/torch/lib/libc10.so) frame #12: c10::detail::torchCheckFail(char const, char const, unsigned int, std::string const&) + 0x7f (0x7ffaef4d04ed in /projs/framework/betterman/code/pytorch_new/torch/lib/libc10.so) frame #13: torch_mlu::MLUCachingAllocator::Native::NativeCachingAllocator::free(void) + 0xe6 (0x7ff9a8eeb112 in /projs/framework/betterman/code/pytorch_new/catch/torch_mlu/csrc/lib/libtorch_mlu.so) frame #14: torch_mlu::MLUCachingAllocator::Native::local_raw_delete(void) + 0x3b (0x7ff9a8ed9480 in /projs/framework/betterman/code/pytorch_new/catch/torch_mlu/csrc/lib/libtorch_mlu.so) frame #15: std::unique_ptr<void, void ()(void)>::~unique_ptr() + 0x50 (0x7ffb0a5ea322 in /projs/framework/betterman/code/pytorch_new/torch/lib/libtorch_python.so) frame #16: <unknown function> + 0x1269890 (0x7ffb0a5e4890 in /projs/framework/betterman/code/pytorch_new/torch/lib/libtorch_python.so) frame #17: <unknown function> + 0x1269928 (0x7ffb0a5e4928 in /projs/framework/betterman/code/pytorch_new/torch/lib/libtorch_python.so) frame #18: <unknown function> + 0x127572c (0x7ffb0a5f072c in /projs/framework/betterman/code/pytorch_new/torch/lib/libtorch_python.so) frame #19: <unknown function> + 0x1275758 (0x7ffb0a5f0758 in /projs/framework/betterman/code/pytorch_new/torch/lib/libtorch_python.so) frame #20: <unknown function> + 0xb9bc7 (0x7ffaef474bc7 in /projs/framework/betterman/code/pytorch_new/torch/lib/libc10.so) frame #21: <unknown function> + 0xb97bc (0x7ffaef4747bc in /projs/framework/betterman/code/pytorch_new/torch/lib/libc10.so) frame #22: <unknown function> + 0xdbc50 (0x7ffaef496c50 in /projs/framework/betterman/code/pytorch_new/torch/lib/libc10.so) frame #23: c10::TensorImpl::~TensorImpl() + 0x82 (0x7ffaef49157e in /projs/framework/betterman/code/pytorch_new/torch/lib/libc10.so) frame #24: c10::TensorImpl::~TensorImpl() + 0x1c (0x7ffaef4915aa in /projs/framework/betterman/code/pytorch_new/torch/lib/libc10.so) frame #25: <unknown function> + 0x2f596d9 (0x7ffaf24fc6d9 in /projs/framework/betterman/code/pytorch_new/torch/lib/libtorch_cpu.so) frame #26: <unknown function> + 0x2f589c2 (0x7ffaf24fb9c2 in /projs/framework/betterman/code/pytorch_new/torch/lib/libtorch_cpu.so) frame #27: <unknown function> + 0x2f57b92 (0x7ffaf24fab92 in /projs/framework/betterman/code/pytorch_new/torch/lib/libtorch_cpu.so) frame #28: <unknown function> + 0x2f5c228 (0x7ffaf24ff228 in /projs/framework/betterman/code/pytorch_new/torch/lib/libtorch_cpu.so) frame #29: <unknown function> + 0x30f3f70 (0x7ffaf2696f70 in /projs/framework/betterman/code/pytorch_new/torch/lib/libtorch_cpu.so) frame #30: <unknown function> + 0x30f3f90 (0x7ffaf2696f90 in /projs/framework/betterman/code/pytorch_new/torch/lib/libtorch_cpu.so) frame #31: <unknown function> + 0x30f5004 (0x7ffaf2698004 in /projs/framework/betterman/code/pytorch_new/torch/lib/libtorch_cpu.so) frame #32: <unknown function> + 0x30f5024 (0x7ffaf2698024 in /projs/framework/betterman/code/pytorch_new/torch/lib/libtorch_cpu.so) frame #33: <unknown function> + 0x31207f0 (0x7ffaf26c37f0 in /projs/framework/betterman/code/pytorch_new/torch/lib/libtorch_cpu.so) frame #34: <unknown function> + 0x3120814 (0x7ffaf26c3814 in /projs/framework/betterman/code/pytorch_new/torch/lib/libtorch_cpu.so) frame #35: <unknown function> + 0x30f51e8 (0x7ffaf26981e8 in /projs/framework/betterman/code/pytorch_new/torch/lib/libtorch_cpu.so) frame #36: <unknown function> + 0x30f5148 (0x7ffaf2698148 in /projs/framework/betterman/code/pytorch_new/torch/lib/libtorch_cpu.so) frame #37: <unknown function> + 0x316ecea (0x7ffaf2711cea in /projs/framework/betterman/code/pytorch_new/torch/lib/libtorch_cpu.so) frame #38: <unknown function> + 0x468a7 (0x7ffb0c9ed8a7 in /lib/x86_64-linux-gnu/libc.so.6) frame #39: on_exit + 0 (0x7ffb0c9eda60 in /lib/x86_64-linux-gnu/libc.so.6) <omitting python frames> frame #47: __libc_start_main + 0xf3 (0x7ffb0c9cb083 in /lib/x86_64-linux-gnu/libc.so.6) Aborted (core dumped) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/126677 Approved by: https://github.com/ezyang	2024-05-21 03:20:17 +00:00
Peter Bell	51c07f9f69	[dynamo] Allow asserts to fail (#126661 ) Currently if an assertion is statically known to be false, dynamo converts it to `_assert_async` which inductor currently ignores. Instead this graph breaks to raise the original assertion. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126661 Approved by: https://github.com/ezyang	2024-05-21 02:42:13 +00:00
eellison	d777685ef9	Script for choosing template configurations (#126560 ) This adds logging that will mark any invocation of a matmul for a particular input shapes, and record every template configs performance on it. Then, we can parse that into a script which will minimize the total mm execution time given N allowed templates. And in future, other experiments.. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126560 Approved by: https://github.com/nmacchioni, https://github.com/jansel	2024-05-21 02:28:39 +00:00
Jack Taylor	d30cdc4321	[ROCm] amdsmi library integration (#119182 ) Adds monitoring support for ROCm using amdsmi in place of pynvml. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119182 Approved by: https://github.com/jeffdaily, https://github.com/malfet, https://github.com/xw285cornell	2024-05-21 01:59:26 +00:00
Ke Wen	b948b1ad7a	[pipelining] Add pipeline stage test (#126721 ) Test tracer's and manual's stage creation by using a basic schedule (GPipe). (Migrated from https://github.com/pytorch/PiPPy/blob/main/test/test_pipeline_stage.py) Test command: ``` $ python test_stage.py ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/126721 Approved by: https://github.com/wconstab, https://github.com/H-Huang	2024-05-21 01:22:10 +00:00
Joel Schlosser	31ba6ee49b	Traceable wrapper subclass support for deferred runtime asserts (#126198 ) The padded dense -> jagged conversion op has the signature: ``` _fbgemm_dense_to_jagged_forward(Tensor dense, Tensor[] offsets, SymInt? total_L=None) -> Tensor ``` when `total_L` is not specified, the meta registration has a data-dependent output shape (based on `offsets[0][-1]`). Returning an unbacked SymInt here should work in theory, but traceable wrapper subclass support is missing in later code to handle deferred runtime asserts. This PR fixes this. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126198 Approved by: https://github.com/ezyang	2024-05-21 01:21:46 +00:00
youkaichao	82b4528788	[cudagraph] fix verbose graph logging (#126694 ) According to the [doc](https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html#group__CUDART__TYPES_1g0907ca7a1e7d0211b71ee49c5403072b): > enum cudaGraphDebugDotFlags > CUDA Graph debug write options > > Values > cudaGraphDebugDotFlagsVerbose = 1<<0 > Output all debug data as if every debug flag is enabled > cudaGraphDebugDotFlagsKernelNodeParams = 1<<2 > Adds cudaKernelNodeParams to output > cudaGraphDebugDotFlagsMemcpyNodeParams = 1<<3 > Adds cudaMemcpy3DParms to output > cudaGraphDebugDotFlagsMemsetNodeParams = 1<<4 > Adds cudaMemsetParams to output > cudaGraphDebugDotFlagsHostNodeParams = 1<<5 > Adds cudaHostNodeParams to output > cudaGraphDebugDotFlagsEventNodeParams = 1<<6 > Adds cudaEvent_t handle from record and wait nodes to output > cudaGraphDebugDotFlagsExtSemasSignalNodeParams = 1<<7 > Adds cudaExternalSemaphoreSignalNodeParams values to output > cudaGraphDebugDotFlagsExtSemasWaitNodeParams = 1<<8 > Adds cudaExternalSemaphoreWaitNodeParams to output > cudaGraphDebugDotFlagsKernelNodeAttributes = 1<<9 > Adds cudaKernelNodeAttrID values to output > cudaGraphDebugDotFlagsHandles = 1<<10 > Adds node handles and every kernel function handle to output > cudaGraphDebugDotFlagsConditionalNodeParams = 1<<15 > Adds cudaConditionalNodeParams to output > `1 << 10` is not the most verbose flag. it is just one flag to add node handles and every kernel function handle to output. `1 << 0` is the most verbose flag, under the name `cudaGraphDebugDotFlagsVerbose`. Here is an example of graph, dumped with `1 << 10`: ```dot digraph dot { subgraph cluster_1 { label="graph_1" graph[style="dashed"]; "graph_1_node_0"[style="solid" shape="rectangle" label="0 MEM_ALLOC node handle: 0x000055D2889750F0 "]; "graph_1_node_1"[style="bold" shape="octagon" label="1 _Z3addPhS_S_m node handle: 0x000055D288979A20 func handle: 0x000055D288978D40 "]; "graph_1_node_2"[style="solid" shape="trapezium"label="2 MEMCPY node handle: 0x000055D28897A130 (DtoH,1024) "]; "graph_1_node_3"[style="solid" shape="rectangle" label="3 MEM_FREE node handle: 0x000055D2889890C0 "]; "graph_1_node_0" -> "graph_1_node_1"; "graph_1_node_1" -> "graph_1_node_2"; "graph_1_node_2" -> "graph_1_node_3"; } } ``` The same graph dumped with `1 << 0`: ```dot digraph dot { subgraph cluster_1 { label="graph_1" graph[style="dashed"]; "graph_1_node_0"[style="solid" shape="record" label="{ MEM_ALLOC \| {{ID \| node handle} \| {0 (topoId: 3) \| 0x000055D2889750F0}} \| {{{poolProps \| {allocType \| handleTypes \| {location \| {type \| id}}} \| {PINNED \| NONE \| DEVICE \| 0}}}} \| {{bytesize \| dptr} \| {1024 \| 0x0000000A02000000}} }"]; "graph_1_node_1"[style="bold" shape="record" label="{KERNEL \| {ID \| 1 (topoId: 2) \| _Z3addPhS_S_m\<\<\<4,256,0\>\>\>} \| {{node handle \| func handle} \| {0x000055D288979A20 \| 0x000055D288978D40}} \| {accessPolicyWindow \| {base_ptr \| num_bytes \| hitRatio \| hitProp \| missProp} \| {0x0000000000000000 \| 0 \| 0.000000 \| N \| N}} \| {cooperative \| 0} \| {priority \| 0} }"]; "graph_1_node_2"[style="solid" shape="record" label="{ MEMCPY \| {{ID \| node handle} \| {2 (topoId: 1) \| 0x000055D28897A130}} \| {kind \| DtoH (DEVICE to HOST PAGEABLE)} \| {{srcPtr \| dstPtr} \| {pitch \| ptr \| xsize \| ysize \| pitch \| ptr \| xsize \| ysize} \| {0 \| 0x0000000A02000000 \| 0 \| 0 \| 0 \| 0x000055D287CA6DB0 \| 0 \| 0}} \| {{srcPos \| {{x \| 0} \| {y \| 0} \| {z \| 0}}} \| {dstPos \| {{x \| 0} \| {y \| 0} \| {z \| 0}}} \| {Extent \| {{Width \| 1024} \| {Height \| 1} \| {Depth \| 1}}}} }"]; "graph_1_node_3"[style="solid" shape="record" label="{ MEM_FREE \| {{ID \| node handle} \| {3 (topoId: 0) \| 0x000055D2889890C0}} \| {{dptr} \| {0x0000000A02000000}} }"]; "graph_1_node_0" -> "graph_1_node_1" [headlabel=0]; "graph_1_node_1" -> "graph_1_node_2" [headlabel=0]; "graph_1_node_2" -> "graph_1_node_3" [headlabel=0]; } } ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/126694 Approved by: https://github.com/eqy, https://github.com/eellison	2024-05-21 00:55:15 +00:00
dshi7	4644611b14	[cprofile] log manifold link instead of raw data to trace_structured (#126451 ) Internal D57459752 returns manifold URL and this PR adds to tlparse payload Pull Request resolved: https://github.com/pytorch/pytorch/pull/126451 Approved by: https://github.com/jamesjwu	2024-05-21 00:44:55 +00:00
Edward Z. Yang	b85f9d7fa2	Add symbolic_shape_specialization structured trace (#126450 ) This is typically the information you want when diagnosing why something overspecialized in dynamic shapes. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/126450 Approved by: https://github.com/albanD	2024-05-21 00:34:05 +00:00
chilli	cd3a71f754	Fix silu test for flexattention (#126641 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126641 Approved by: https://github.com/ezyang, https://github.com/drisspg ghstack dependencies: #126615, #126446	2024-05-20 23:40:56 +00:00
chilli	da2292ce6b	Prevent partitioner from ever saving views (#126446 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126446 Approved by: https://github.com/anijain2305 ghstack dependencies: #126615	2024-05-20 23:40:56 +00:00
chilli	831efeeadf	Fix flexattention not realizing inputs before lowering (also refactored runtime estimation) (#126615 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126615 Approved by: https://github.com/yanboliang, https://github.com/drisspg, https://github.com/xmfan	2024-05-20 23:40:56 +00:00
Oguz Ulgen	14dc8d4f63	Protect codecache against cache failures (#126696 ) When there's a manifold, memcache or filesystem related issues or network outages, we should not completely fail to compile but instead fallback to cold start. Differential Revision: [D57573835](https://our.internmc.facebook.com/intern/diff/D57573835/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126696 Approved by: https://github.com/aorenste	2024-05-20 22:22:41 +00:00
Alexander Kurakin	6f1935b0b5	doc: `torch.utils.data.Sampler`: `__len__` is optional (#125938 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/125938 Approved by: https://github.com/andrewkho, https://github.com/xmfan	2024-05-20 22:20:36 +00:00
Wei Han	74b053d7c4	Pass model path to observer (#126503 ) Summary: Passing model path to observer so that they can get additional info if needed. Test Plan: contbuild & OSS CI Differential Revision: D57475129 Pull Request resolved: https://github.com/pytorch/pytorch/pull/126503 Approved by: https://github.com/kirklandsign	2024-05-20 22:17:56 +00:00
Yueming Hao	acfe237a71	Fix C++ compilation error for tensor array in abi_compatible mode (#126412 ) Fixes #122048 There is a compilation error https://github.com/pytorch/pytorch/issues/122048 when the element type in an array is tensor. It is because `val_to_arg_str does` not take arg type as input, and always generate an int array. This PR change the underlying `codegen_int_array_var` to `codegen_var_array` by adding type checks and corresponding code generations. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126412 Approved by: https://github.com/desertfire	2024-05-20 20:57:50 +00:00
angelayi	3d4f1c3083	[export] Make error name private (#126715 ) Fixes CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/126715 Approved by: https://github.com/clee2000	2024-05-20 20:50:11 +00:00
jhavukainen	d28868c7e8	Change skipIfs to xfails in test_mps.py for test_isin (#125412 ) Follow-up to #124896 to move the added test to use expectedFailure instead of skip. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125412 Approved by: https://github.com/kulinseth	2024-05-20 20:23:53 +00:00
PyTorch MergeBot	8bca0847c2	Revert "[TD] Upload names of failures to s3 for pytest cache (#126315 )" This reverts commit 655038687afd19a4a4c9371b77ff046fd6c84be1. Reverted https://github.com/pytorch/pytorch/pull/126315 on behalf of https://github.com/clee2000 due to broke inductor ([comment](https://github.com/pytorch/pytorch/pull/126315#issuecomment-2121133045))	2024-05-20 20:15:08 +00:00
Yueming Hao	2813f0672a	fix huggingface models input issue in torchbench (#126579 ) Fixes https://github.com/pytorch/benchmark/issues/2263. According to https://github.com/pytorch/pytorch/blob/main/benchmarks/dynamo/common.py#L509, example_inputs are formatted as dictionaries for HuggingFace models. However, this forward_pass function passes all inputs to mod with *, which may only pass the input_ids key in HuggingFace model's example inputs. To reproduce, run the following command. ```bash python pytorch/benchmarks/dynamo/torchbench.py --performance --inference -dcuda --only=hf_Bert --output=torchbench_inference.csv ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/126579 Approved by: https://github.com/xuzhao9	2024-05-20 19:10:46 +00:00
Mu-Chu Lee	11c2d127ec	[AOTInductor] Add config to allow buffer mutation (#126584 ) Summary: Add an additional config to allow buffer mutation. For data that's greater than 2GB, we would need to set it as read-only, otherwise overflow would occur. This is a temporary solution since it won't handle cases that requires mutable data greater than 2GB. Test Plan: Included in commit. Differential Revision: D57514729 Pull Request resolved: https://github.com/pytorch/pytorch/pull/126584 Approved by: https://github.com/chenyang78	2024-05-20 18:16:00 +00:00
Xu Zhao	2068dadbe8	[torchbench] Add torchao to PT2 Benchmark Runner (#126469 ) Summary: X-link: https://github.com/pytorch/benchmark/pull/2268 Support torchao performance and accuracy tests in PT2 Benchmark Runner, using the inductor backend as the baseline. Test Plan: ``` $ buck2 run mode/opt //caffe2/benchmarks/dynamo:torchbench -- --only BERT_pytorch --bfloat16 --quantization int8dynamic --performance --inference --print-memory loading model: 0it [00:50, ?it/s] cuda eval BERT_pytorch memory: eager: 0.75 GB, dynamo: 0.75 GB, ratio: 1.00 running benchmark: 100% 1.003x ``` Reviewed By: jerryzh168 Differential Revision: D57463273 Pull Request resolved: https://github.com/pytorch/pytorch/pull/126469 Approved by: https://github.com/huydhn	2024-05-20 17:53:44 +00:00
Edward Z. Yang	022adf8c5e	Fix bug for comptime.get_local for cells/closures (#126637 ) I wasn't paying enough attention and didn't notice that LOAD_DEREF is defined differently for InliningInstructionTranslator. Match it up with the code there. This also fixes comptime.print(), which was broken, because closing over an argument turned it into a cell rather than a regular local. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/126637 Approved by: https://github.com/yanboliang	2024-05-20 17:51:28 +00:00
Jason Ansel	f9de510121	[dynamo] Graph break on set_num_threads (#126623 ) Fixes #125364 Pull Request resolved: https://github.com/pytorch/pytorch/pull/126623 Approved by: https://github.com/yanboliang	2024-05-20 17:44:32 +00:00
angelayi	89c1cfe144	[export] Allow modules to be created in the forward (#125725 ) Fixes the error in non-strict export when we're tracing a module that initializes another module in its forward function. This appears in [many huggingface models](https://github.com/search?q=repo%3Ahuggingface%2Ftransformers+CrossEntropyLoss%28%29&type=code&fbclid=IwAR285uKvSevJM6SDbXmb4-monj4iH7wf8opkvnec-li7sKpn4lUMjIvbGKc). It's probably not good practice to do this, but since it appears in so many places, and strict-export supports this, we will also support this. The approach we'll take for these cases is that we will inline the call to the module. Parameters and buffers initialized as constants (with `torch.tensor`) will be represented as constant tensors, and those initialized with tensor factory functions (`torch.ones`) will show up as an operator in the graph. The module stack for the ops in the inlined module will reflect the toplevel's module stack. One issue is that strict-export seems to segfault when there is an `nn.Parameter` call in the constructor (https://github.com/pytorch/pytorch/issues/126109). Non-strict export will succeed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125725 Approved by: https://github.com/ydwu4	2024-05-20 17:42:20 +00:00
Catherine Lee	655038687a	[TD] Upload names of failures to s3 for pytest cache (#126315 ) Some tests don't get run through pytest and pytest crashes when a test segfaults, so in both caess, the pytest cache won't have an entry (similar to https://github.com/pytorch/test-infra/pull/5205). Instead, manually upload/download an extra file that lists the failing test files Technically this would be more general than the pytest cache Pull Request resolved: https://github.com/pytorch/pytorch/pull/126315 Approved by: https://github.com/ZainRizvi	2024-05-20 17:36:30 +00:00
Colin Peppler	8c38d0cd64	[inductor] Fix edge case in JIT vs. AOT fusion after finalizing MultiTemplateBuffer (#126622 ) # Context Here's a peripheral scenario causing the JIT-pass and AOT-pass to pick different fusions. ```py # JIT -- buf3 is a MultiTemplateBuffer V.graph.buffers = [buf0, buf1, buf2, buf3, buf4] ^ ^ # JIT pass calls finalize_multi_template_buffers() V.graph.buffers = [buf0, buf1, buf2, buf4, buf3] # AOT, note proximity_score(buf2, buf4) is "better" for fusion than JIT V.graph.buffers = [buf0, buf1, buf2, buf4, buf3] ^ ^ ``` It happens like this: * JIT starts with the original set nodes using V.graph.buffers * In JIT, finalize_multi_template_buffers() is called which can change the order of the buffers. * This makes the order of buffers/scheduler nodes different. * Now, each node's min/max-order is different than before. * As a result, the proximity between two nodes is different. `ad67553c5c/torch/_inductor/scheduler.py (L2316-L2335)` # Error ``` $ TORCH_LOGS="+fusion" python test/inductor/test_max_autotune.py -k test_jit_fusion_matches_aot_fusion ====================================================================== FAIL: test_jit_fusion_matches_aot_fusion (__main__.TestMaxAutotune) ---------------------------------------------------------------------- Traceback (most recent call last): ... File "/data/users/colinpeppler/pytorch/torch/_inductor/graph.py", line 1718, in compile_to_fn code, linemap = self.codegen_with_cpp_wrapper() File "/data/users/colinpeppler/pytorch/torch/_inductor/graph.py", line 1618, in codegen_with_cpp_wrapper return self.codegen() File "/data/users/colinpeppler/pytorch/torch/_inductor/graph.py", line 1636, in codegen self.scheduler.codegen() File "/data/users/colinpeppler/pytorch/torch/_dynamo/utils.py", line 210, in time_wrapper r = func(args, *kwargs) File "/data/users/colinpeppler/pytorch/torch/_inductor/scheduler.py", line 2602, in codegen self.get_backend(device).codegen_node(node) # type: ignore[possibly-undefined] File "/data/users/colinpeppler/pytorch/torch/_inductor/codegen/cuda_combined_scheduling.py", line 66, in codegen_node return self._triton_scheduling.codegen_node(node) File "/data/users/colinpeppler/pytorch/torch/_inductor/codegen/triton.py", line 3377, in codegen_node return self.codegen_node_schedule(node_schedule, buf_accesses, numel, rnumel) File "/data/users/colinpeppler/pytorch/torch/_inductor/codegen/triton.py", line 3602, in codegen_node_schedule final_kernel.call_kernel(final_kernel.kernel_name) File "/data/users/colinpeppler/pytorch/torch/_inductor/codegen/triton.py", line 3055, in call_kernel grid = wrapper.generate_default_grid(name, grid) File "/data/users/colinpeppler/pytorch/torch/_inductor/codegen/cpp_wrapper_cuda.py", line 174, in generate_default_grid params is not None AssertionError: cuda kernel parameters for triton_poi_fused_add_0 should already exist at this moment, only found dict_keys(['Placeholder.DESCRIPTIVE_NAME', 'triton_poi_fused_add_mul_0', 'triton_poi_fused_pow_1']) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/126622 Approved by: https://github.com/chenyang78 ghstack dependencies: #125982	2024-05-20 16:58:08 +00:00
Catherine Lee	7aa853a54e	[CI] Install sccache on XLA build job (#126117 ) XLA build job uses a docker image from XLA, which doesn't have sccache installed. The XLA build job just builds pytorch, XLA gets built during the test job. The pytorch build was taking 1+hrs, with a warm cache it takes <30min Pull Request resolved: https://github.com/pytorch/pytorch/pull/126117 Approved by: https://github.com/malfet	2024-05-20 16:39:14 +00:00
Xia, Weiwen	3642e51ea5	[Quant][PT2E] enable qlinear post op fusion for dynamic quant & qat (#122667 ) Description Add fusion path for dynamic quant and for QAT. The following patterns can be matched for static quant with QAT cases: `qx -> qlinear -> add -> optional relu -> optional type convert -> optional quant` The following patterns can be matched for dynamic quant cases: `qx -> qlinear -> add -> optional relu` Test plan python test/inductor/test_mkldnn_pattern_matcher.py -k test_qlinear python test/inductor/test_cpu_cpp_wrapper.py -k test_qlinear python test/test_quantization.py -k test_linear_unary python test/test_quantization.py -k test_linear_binary Pull Request resolved: https://github.com/pytorch/pytorch/pull/122667 Approved by: https://github.com/jgong5	2024-05-20 15:55:18 +00:00
Nikita Shulga	2f53747ec6	Speedup bf16 gemm fallback on ARM (#126592 ) By dispatching it to multiple threads and using vectorized dot operation (with fp16 to fp32 upcasts via left shift) This bumps stories110M eval from 22 to 55 tokens/sec using bfloat16 TODO: - Refactor tinygemm template and use it here Pull Request resolved: https://github.com/pytorch/pytorch/pull/126592 Approved by: https://github.com/mikekgfb	2024-05-20 12:39:51 +00:00
PyTorch MergeBot	cb69c51b6f	Revert " Updated test_graph_optims and test_graph_scaling_fused_optimizers to use new OptimizerInfo infrastructure (#125127 )" This reverts commit cf35a591b95220aa1bfcc04ff8a943efd1d6d6eb. Reverted https://github.com/pytorch/pytorch/pull/125127 on behalf of https://github.com/DanilBaibak due to Broken trunk ([comment](https://github.com/pytorch/pytorch/pull/125127#issuecomment-2120337584))	2024-05-20 12:14:22 +00:00
Peter Bell	7100a72950	[inductor] Fix ops.scan for non-commutative operators (#126633 ) `tl.associative_scan` supports non-commutative combine functions but `tl.reduce` doesn't. This effects non-persistent scans, where we use the reduction from the previous loop iterations as the base for future iterations. Here I work around this by taking the last element of the scan output and using that as the reduced value. This is done using a trick where we create a mask that is 1 at the desired element and 0 elsewhere, then sum over it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126633 Approved by: https://github.com/Chillee, https://github.com/lezcano	2024-05-20 10:27:17 +00:00
PyTorch MergeBot	d9c3485146	Revert "c10d: add Collectives abstraction (#125978 )" This reverts commit 4b2ae2ac338f3a0de340c9711b03989b8ce66fc6. Reverted https://github.com/pytorch/pytorch/pull/125978 on behalf of https://github.com/DanilBaibak due to Break internal build ([comment](https://github.com/pytorch/pytorch/pull/125978#issuecomment-2119858015))	2024-05-20 07:40:41 +00:00
PyTorch MergeBot	53f73cdeb6	Revert "Add symbolic_shape_specialization structured trace (#126450 )" This reverts commit da1fc85d60fcf0bd1e8638d643a7c0c6560c3a5f. Reverted https://github.com/pytorch/pytorch/pull/126450 on behalf of https://github.com/DanilBaibak due to Break internal build ([comment](https://github.com/pytorch/pytorch/pull/126450#issuecomment-2119798075))	2024-05-20 06:59:58 +00:00
PyTorch MergeBot	5ad2f10034	Revert "[inductor] Load python modules using importlib (#126454 )" This reverts commit faa26df72e2a3ff08f9dd564bb50756916826854. Reverted https://github.com/pytorch/pytorch/pull/126454 on behalf of https://github.com/DanilBaibak due to Break internal build ([comment](https://github.com/pytorch/pytorch/pull/126454#issuecomment-2119771267))	2024-05-20 06:41:11 +00:00
jayanth domalapalli	cf35a591b9	Updated test_graph_optims and test_graph_scaling_fused_optimizers to use new OptimizerInfo infrastructure (#125127 ) This PR is meant to address issue #123451, more specifically, the ```test_graph_optims``` and ```test_graph_scaling_fused_optimizers``` functions in ```test_cuda.py``` have been updated so that they now use the new OptimizerInfo infrastructure. Lintrunner passed: ``` $ lintrunner test/test_cuda.py ok No lint issues. ``` Tests passed: ``` >python test_cuda.py -k test_graph_optims Ran 19 tests in 7.463s OK (skipped=9) >python test_cuda.py -k test_graph_scaling_fused_optimizers Ran 6 tests in 2.800s OK (skipped=3) ``` Both the functions have been moved to the newly created TestCase class ```TestCudaOptims```. The test is mostly the same except the ```@optims``` decorator is used at the top of the function to implicitly call the function using each of the optimizers mentioned in the decorator instead of explicitly using a for loop to iterate through each of the optimizers. I was unable to use the ```_get_optim_inputs_including_global_cliquey_kwargs``` to get all kwargs for each of the optimizers since some of the kwargs that are used in the original ```test_graph_optims``` function are not being returned by the new OptimizerInfo infrastructure, more specifically, for the ```torch.optim.rmsprop.RMSprop``` optimizer, the following kwargs are not returned whenever ```_get_optim_inputs_including_global_cliquey_kwargs``` is called: ``` {'foreach': False, 'maximize': True, 'weight_decay': 0} { 'foreach': True, 'maximize': True, 'weight_decay': 0} ``` I ran into the same issue for ```test_graph_scaling_fused_optimizers```, for the ```torch.optim.adamw.AdamW``` optimizer, whenever ```optim_info.optim_inputs_func(device=device)``` was called, the following kwarg was not returned: ``` {'amsgrad': True} ``` Due to this issue, I resorted to using a dictionary to store the kwargs for each of the optimizers, I am aware that this is less than ideal. I was wondering whether I should use the OptimizerInfo infrastructure to get all the kwargs regardless of the fact that it lacks some kwargs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125127 Approved by: https://github.com/janeyx99	2024-05-20 06:20:45 +00:00
Simon Fan	5fb11cda4f	[compiled autograd] Better cache miss logging (#126602 ) - log only first node key cache miss - log existing node key sizes - log which node's collected sizes became dynamic e.g. ``` DEBUG:torch._dynamo.compiled_autograd.__compiled_autograd_verbose:Cache miss due to new autograd node: torch::autograd::GraphRoot (NodeCall 0) with key size 39, previous key sizes=[] ... DEBUG:torch._dynamo.compiled_autograd.__compiled_autograd_verbose:Cache miss due to new autograd node: torch::autograd::AccumulateGrad (NodeCall 5) with key size 32, previous key sizes=[21] ... DEBUG:torch._dynamo.compiled_autograd.__compiled_autograd_verbose:Cache miss due to dynamic shapes: collected size idx 0 of torch::autograd::GraphRoot (NodeCall 0) DEBUG:torch._dynamo.compiled_autograd.__compiled_autograd_verbose:Cache miss due to dynamic shapes: collected size idx 2 of SumBackward0 (NodeCall 1) DEBUG:torch._dynamo.compiled_autograd.__compiled_autograd_verbose:Cache miss due to dynamic shapes: collected size idx 4 of SumBackward0 (NodeCall 1) DEBUG:torch._dynamo.compiled_autograd.__compiled_autograd_verbose:Cache miss due to dynamic shapes: collected size idx 2 of ReluBackward0 (NodeCall 2) DEBUG:torch._dynamo.compiled_autograd.__compiled_autograd_verbose:Cache miss due to dynamic shapes: collected size idx 9 of AddmmBackward0 (NodeCall 3) DEBUG:torch._dynamo.compiled_autograd.__compiled_autograd_verbose:Cache miss due to dynamic shapes: collected size idx 2 of torch::autograd::AccumulateGrad (NodeCall 5) DEBUG:torch._dynamo.compiled_autograd.__compiled_autograd_verbose:Cache miss due to dynamic shapes: collected size idx 2 of ReluBackward0 (NodeCall 6) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/126602 Approved by: https://github.com/jansel ghstack dependencies: #126144, #126146, #126148, #126483	2024-05-19 23:49:52 +00:00
Simon Fan	be67985bd7	[compiled autograd] log in cpp using python logger (#126483 ) Internal infra may not preserve python and c++ log ordering e.g. MAST logs: https://fburl.com/mlhub/38576cxn, all the `[python_compiled_autograd.cpp] Creating cache entry [...]` logs of the entire run are at the beginning of the file Pull Request resolved: https://github.com/pytorch/pytorch/pull/126483 Approved by: https://github.com/jansel ghstack dependencies: #126144, #126146, #126148	2024-05-19 23:49:52 +00:00
cyy	574ae9afb8	[Submodule] Remove third-party onnx-tensorrt (#126542 ) It seems that tensorrt is not used by the C++ code, may be due to the removal of Caffe2. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126542 Approved by: https://github.com/ezyang	2024-05-19 22:34:24 +00:00
cyy	853081a8e7	Replace torch.library.impl_abstract with torch.library.register_fake (#126606 ) To remove the disrupting warning ``` warnings.warn("torch.library.impl_abstract was renamed to " "torch.library.register_fake. Please use that instead; " "we will remove torch.library.impl_abstract in a future " "version of PyTorch.", DeprecationWarning, stacklevel=2) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/126606 Approved by: https://github.com/ezyang	2024-05-19 13:21:39 +00:00
Simon Fan	5ea956a61f	Update hf_BirdBird periodic-dynamo-benchmarks results (#126414 ) can't repro this regression. also nothing in the faulty PR range would cause it only for 1 model. the job is still causing noise, so we should mute it. I think just updating the graph break count is better than skipping the model here since it's still passing Pull Request resolved: https://github.com/pytorch/pytorch/pull/126414 Approved by: https://github.com/ezyang	2024-05-19 10:58:07 +00:00
Edward Z. Yang	c4dfd783f4	UFMT torch.utils._sympy.functions (#126553 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/126553 Approved by: https://github.com/lezcano, https://github.com/Skylion007 ghstack dependencies: #126511	2024-05-19 10:35:48 +00:00
Edward Z. Yang	7dae7d3ca5	Remove unnecessary implementations from MockHandler (#126511 ) Dead implementations are confusing and can cause bugs when people accidentally hit them. Better for it to be missing. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/126511 Approved by: https://github.com/peterbell10, https://github.com/lezcano	2024-05-19 04:43:54 +00:00
PyTorch MergeBot	71b6459edc	Revert "[Dynamo] Treat integers stored on nn.Modules as dynamic (#126466 )" This reverts commit 6bb9d6080d33c817fcbf9e5ae8a59b76812a53d2. Reverted https://github.com/pytorch/pytorch/pull/126466 on behalf of https://github.com/huydhn due to Sorry for reverting your change but the ONNX test failure looks legit, not flaky, as it starts failing in trunk `6bb9d6080d` ([comment](https://github.com/pytorch/pytorch/pull/126466#issuecomment-2119078245))	2024-05-19 02:52:11 +00:00
chilli	e3230f87aa	Cached required_fw_nodes creation (#126613 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126613 Approved by: https://github.com/anijain2305	2024-05-19 01:48:52 +00:00
Huy Do	abc4b66124	Forward fix the failed new test from D57474327 (#126596 ) Summary: TSIA. The two looks the same to me, but buck was failing with the following error when `with torch._inductor.utils.fresh_inductor_cache()` is used: ``` _________________________ ReproTests.test_issue126128 __________________________ self = <caffe2.test.dynamo.test_repros.ReproTests testMethod=test_issue126128> def test_issue126128(self): def fn(): x = torch.randn(1, 10) y = torch.randn(10, 1) return torch.mm(x, y).sum() def fn2(): x = torch.randn(10, 100) y = torch.randn(100, 10) return torch.mm(x, y).sum() > with torch._inductor.utils.fresh_inductor_cache(): E AttributeError: module 'torch._inductor' has no attribute 'utils' ``` Test Plan: `buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/dynamo:test_dynamo -- --exact 'caffe2/test/dynamo:test_dynamo - test_repros.py::ReproTests::test_issue126128'` Differential Revision: D57516676 Pull Request resolved: https://github.com/pytorch/pytorch/pull/126596 Approved by: https://github.com/xmfan	2024-05-18 23:56:03 +00:00
Tarunbir Gambhir	ad67553c5c	Updated test_torch.py to use new OptimizerInfo infrastructure (#125538 ) Fixes #123451 (only addresses test_torch.py cases) This PR solves the specific task to update `test_grad_scaling_autocast` and `test_params_invalidated_with_grads_invalidated_between_unscale_and_step` in `test/test_torch.py` to use the new OptimizerInfo infrastructure. I have combined tests that call `_grad_scaling_autocast_test` into one called `test_grad_scaling_autocast` and used `_get_optim_inputs_including_global_cliquey_kwargs` to avoid hard-coded configurations. ``` $ lintrunner test/test_cuda.py ok No lint issues. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/125538 Approved by: https://github.com/janeyx99	2024-05-18 15:42:45 +00:00
Jiashen Cao	99af1b3ab0	Refactor variables / function names related to non-strict export (#126458 ) Improve variable and function naming for better clarity: `non strict` --> `aten`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126458 Approved by: https://github.com/angelayi	2024-05-18 06:05:14 +00:00
Yanbo Liang	6bb9d6080d	[Dynamo] Treat integers stored on nn.Modules as dynamic (#126466 ) Fixes #115711 Pull Request resolved: https://github.com/pytorch/pytorch/pull/126466 Approved by: https://github.com/jansel	2024-05-18 05:02:16 +00:00
Will Feng	a44d0cf227	[Traceable FSDP2] Change from register_multi_grad_hook to per-tensor backward hook (#126350 ) As discussed with Andrew before, under compile we will register per-tensor backward hook instead of multi-grad hook, because it's difficult for Dynamo to support `register_multi_grad_hook` (or anything `.grad_fn` related). We expect both to have the same underlying behavior, ~~and we will add integration test (in subsequent PR) to show that compile and eager has same numerics.~~ As discussed below, we will change eager path to use per-tensor backward hook as well. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126350 Approved by: https://github.com/awgu	2024-05-18 04:44:29 +00:00
drisspg	d4704dcacc	Map float8 types to uint8 for allgather (#126556 ) # Summary Different take on this one: https://github.com/pytorch/pytorch/issues/126338 We should probably not allow this mapping for 'compute' ops e.g. reductions ### Corresponding fp8 PR https://github.com/pytorch-labs/float8_experimental/pull/263 Pull Request resolved: https://github.com/pytorch/pytorch/pull/126556 Approved by: https://github.com/wanchaol	2024-05-18 03:19:16 +00:00
Wang, Eikan	bf099a08f0	[2/N] Non-Tensor: Scalar Support: Add scalar to the cache for eager-through-torch.compile (#124070 ) Add scalar information to the kernel configuration. #### Additional Context Currently, the input parameters are orchestrated by input order in the kernel configuration and loaded/mapped to the kernel at runtime. For example, the cache order of the input parameters of `torch.add(a, b, alpha=2.0)` is `a' first, followed by `b` and then `alpha`. The same order is for cache loading. However, the orchestration mechanism does not support kwargs because the order of kwargs is useless. For example, the `out` of `aten::gelu.out(Tensor self, *, str approximate='none', Tensor(a!) out) -> Tensor(a!)` may be before `approximate`. We will support it with subsequent PRs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124070 Approved by: https://github.com/jansel, https://github.com/jgong5	2024-05-18 03:08:37 +00:00
Scott Wolchok	c1767d8626	Faster(?) FP16 gemv kernel (#126297 ) Differential Revision: [D57369266](https://our.internmc.facebook.com/intern/diff/D57369266/) NOTE FOR REVIEWERS: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D57369266/)! Pull Request resolved: https://github.com/pytorch/pytorch/pull/126297 Approved by: https://github.com/malfet	2024-05-18 03:03:03 +00:00
Jason Ansel	b98decfc38	[halide-backend] Refactor codegen/triton.py into codegen/simd.py (#126415 ) This PR is primarily just moving stuff around. It creates a new common baseclass for TritonCodegen and the (upcoming) HalideCodegen. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126415 Approved by: https://github.com/shunting314	2024-05-18 02:43:42 +00:00
cyy	74b99438f2	[Submodule] Remove third-party CUB (#126540 ) Because it was updated 4 years ago, and now all supported CUDA versions provide CUB. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126540 Approved by: https://github.com/Skylion007	2024-05-18 02:28:17 +00:00
Ke Wen	1191168c45	[pipelining] Follow improvements in export.unflatten (#126217 ) Previously, we make a copy of `torch.export.unflatten` in pippy/_unflatten.py. But it turns out to be too hard to track bug fixes and improvements in upstream version. For example, `torch.export.unflatten` recently added support for tied parameters, which is something pipelining needs. Now that we moved into pytorch, we make a reference to `torch.export.unflatten` instead of maintaining a copy. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126217 Approved by: https://github.com/H-Huang	2024-05-18 02:24:01 +00:00
Tristan Rice	661ecedbd0	gitmodules: switch cpp-httplib to https (#126580 ) Fixes issue introduced in https://github.com/pytorch/pytorch/pull/126470#issuecomment-2118374811 Test plan: CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/126580 Approved by: https://github.com/PaliC, https://github.com/jeffdaily	2024-05-18 01:31:28 +00:00
Will Constable	224f2bef9f	[C10D] Add __repr__ to P2POp class (#126538 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126538 Approved by: https://github.com/Skylion007, https://github.com/kwen2501, https://github.com/c-p-i-o ghstack dependencies: #126419	2024-05-18 00:58:57 +00:00
Will Constable	bcee6f708a	[Pipelining] Fix 1f1b schedule (#126419 ) This schedule was running fine locally but failing (hanging) on CI. After analysis (https://fburl.com/gdoc/xt80h1gd), it seems like the schedule was not correct previously but may still work depending on the runtime. The fix bundles together fwd-recv(s->s+1) and bwd-send(s+1->s) into one coalesced group so they would not block each other. Design drawing <img width="803" alt="image" src="https://github.com/pytorch/pytorch/assets/4984825/906a9a66-39ae-4a6a-bc1a-18b77eaaa784"> Flight recorder traces show the same coalescing pattern as designed <img width="1013" alt="image" src="https://github.com/pytorch/pytorch/assets/4984825/ab10646e-eaef-4191-83dd-73f448876c27"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/126419 Approved by: https://github.com/c-p-i-o, https://github.com/kwen2501	2024-05-18 00:58:57 +00:00
Manuel Candales	41fb4bcc73	[AOTI] Flag to include aoti sources when building lite interpreter (#126572 ) Summary: Added USE_LITE_AOTI cmake flag, which is turned OFF by default. When it is turned on, the AOTI sources (inductor_core_resources) are included when building lite interpreter Test Plan: ``` ANDROID_ABI=arm64-v8a ./scripts/build_android.sh -DUSE_LITE_AOTI=ON ``` Differential Revision: D57394078 Pull Request resolved: https://github.com/pytorch/pytorch/pull/126572 Approved by: https://github.com/malfet	2024-05-18 00:39:42 +00:00
Kostas Tsiampouris	2863c76b1f	[torch-distributed] Make log directory creation idempotent (#126496 ) Summary: https://docs.python.org/3/library/os.html#os.makedirs > If exist_ok is False (the default), a FileExistsError is raised if the target directory already exists. Test Plan: Existing tests Differential Revision: D57471577 Pull Request resolved: https://github.com/pytorch/pytorch/pull/126496 Approved by: https://github.com/d4l3k	2024-05-18 00:17:13 +00:00
Sherlock Huang	0d5ba547ec	Tool for scouting exportability in one shot (#126471 ) Summary: Tool for scouting exportability issues in one shot. - Collect sample inputs for all submodules by running eager inference with forward_pre_hook. - Start from root module, recursively try exporting child modules, if current module export fails. Limitations: - only works for nn.module that contains tree-like submodules structure. this doesn't work for flatten GraphModule. TODO: support dynamic_dims Sample output: https://docs.google.com/spreadsheets/d/1jnixrqBTYbWO_y6AaKA13XqOZmeB1MQAMuWL30dGoOg/edit?usp=sharing ``` exportability_report = { '': UnsupportedOperatorException(func=<OpOverload(op='testlib.op_missing_meta', overload='default')>), 'submod_1': UnsupportedOperatorException(func=<OpOverload(op='testlib.op_missing_meta', overload='default')>), 'submod_2': None } ``` Test Plan: buck2 run mode/dev-nosan fbcode//caffe2/test:test_export -- -r TestExportTools Differential Revision: D57466486 Pull Request resolved: https://github.com/pytorch/pytorch/pull/126471 Approved by: https://github.com/zhxchen17	2024-05-18 00:10:46 +00:00
Will Constable	54bc55c515	Remove dist_ prefix from TORCH_LOGS shortcuts (#126499 ) e.g. dist_ddp -> ddp 'distributed' shortcut remains unchained Feedback has been that it is not appealing to have the dist_ prefix, and the main reason for it was to keep the distributed shortcuts grouped together in the help menu. It's nice to have shorter shortcuts. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126499 Approved by: https://github.com/XilunWu, https://github.com/kwen2501 ghstack dependencies: #126322	2024-05-18 00:07:30 +00:00
Nikita Shulga	93844a31b3	Fix aarch64 debug build with GCC (#126290 ) By working around GCCs quirks in instantiating templates that require immediate values. Provide alternative implementation for scaling the output if compiled without any optimizations (both GCC and clang define `__OPTIMIZE__` if invoked with anything but `-O0`) Test plan (after change was reverted): ssh into aarch64 runner and rebuild given file with `-O0` Fixes https://github.com/pytorch/pytorch/issues/126283 Pull Request resolved: https://github.com/pytorch/pytorch/pull/126290 Approved by: https://github.com/atalman, https://github.com/seemethere	2024-05-17 23:47:08 +00:00
Martim Mendes	d54c28e7fc	Added error checks for invalid inputs on thnn_conv2d (#121906 ) Fixes #121188 Prevent Segmentation Fault in 'torch._C._nn.thnn_conv2d' Previously, calling 'torch._C._nn.thnn_conv2d' with invalid arguments for padding, stride, and kernel_size would result in a segmentation fault. This issue has been resolved by implementing argument validation (using Torch Check). Now, when invalid arguments are detected, a runtime error is raised with a debug message detailing the correct format. Additionally, this commit includes tests to cover the three referenced cases. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121906 Approved by: https://github.com/janeyx99	2024-05-17 23:41:48 +00:00
Animesh Jain	173b1d811d	[dynamo] Sourceless builder - ordered dict and re.pattern (#126468 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126468 Approved by: https://github.com/Skylion007	2024-05-17 23:24:55 +00:00
Andrew M. James	faa26df72e	[inductor] Load python modules using importlib (#126454 ) The `compile` + `exec` workflow is susceptible to behavior drifting from a "normal" import use importlib instead to avoid this. In particular here annotations were being stored as strings due to `from __futures__ import annotations` in the scope calling `compile`. Triton cares about annotations on global variables and this makes it much easier to reliably code-gen them. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126454 Approved by: https://github.com/peterbell10	2024-05-17 23:13:07 +00:00
Yihan He	d7de4c9d80	Fix issue of lowering nn.linear ops with kwargs (#126331 ) Summary: Support kwarg bias for nn.linear quantization Differential Revision: D57403190 Pull Request resolved: https://github.com/pytorch/pytorch/pull/126331 Approved by: https://github.com/ZhengkaiZ, https://github.com/huydhn	2024-05-17 21:50:55 +00:00
Manuel Candales	c26f6548f9	[AOTI] config target platform (#126306 ) Test Plan: AOTI compile stories15M for Android Differential Revision: D57392830 Pull Request resolved: https://github.com/pytorch/pytorch/pull/126306 Approved by: https://github.com/desertfire	2024-05-17 21:42:19 +00:00
Catherine Lee	09fd771485	Disable vulkan test batch_norm_invalid_inputs (#126571 ) Fails flakily ex https://github.com/pytorch/pytorch/actions/runs/9130802617/job/25109131748 https://github.com/pytorch/pytorch/actions/runs/9125548571/job/25092535707 First bad I can find is `538877d204` Pull Request resolved: https://github.com/pytorch/pytorch/pull/126571 Approved by: https://github.com/SS-JIA	2024-05-17 21:11:07 +00:00
Tugsbayasgalan Manlaibaatar	bed1c600bb	Experimental prototype for converting torch.jit.trace modules to export (#124449 ) Differential Revision: [D56440613](https://our.internmc.facebook.com/intern/diff/D56440613) We want to do this for following reasons: 1. There is current limitation in export tracing for torch.jit.trace d modules that cannot be easily upstreamed 2. We need to run internal CI regularly to understand feature gaps and continuously track them 3. Multiple people will be working on this prototype so it is better to have a checked in version so we don't always run into merge conflicts. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124449 Approved by: https://github.com/angelayi, https://github.com/avikchaudhuri	2024-05-17 20:42:42 +00:00
Peter Y Yeh	30b70b1a63	[ROCm] enable faster_load_save for Fused_SGD (#125456 ) Reopen due to rebase error. Fixes https://github.com/pytorch/pytorch/issues/117599 The reported hang test : `test_cuda.py::TestCuda::test_grad_scaling_autocast_fused_optimizers` is passing with this PR HSA Async copy / host wait on completion signal is resolved in MultiTensorApply.cuh ``` :4:command.cpp :347 : 8881368803196 us: [pid:1268211 tid:0x7f5af80d7180] Command (InternalMarker) enqueued: 0xc4e2070 :4:rocvirtual.cpp :556 : 8881368803201 us: [pid:1268211 tid:0x7f5af80d7180] Host wait on completion_signal=0x7f5967df3e00 :3:rocvirtual.hpp :66 : 8881368803209 us: [pid:1268211 tid:0x7f5af80d7180] Host active wait for Signal = (0x7f5967df3e00) for -1 ns ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/125456 Approved by: https://github.com/jeffdaily, https://github.com/eqy, https://github.com/janeyx99	2024-05-17 20:36:47 +00:00
PyTorch MergeBot	d782e43464	Revert "[FSDP2] Fixed 2D clip grad norm test (#126497 )" This reverts commit 3f289063117673650db868c978bf3cb8125a22dc. Reverted https://github.com/pytorch/pytorch/pull/126497 on behalf of https://github.com/jeanschmidt due to reverting to check if might have introduced inductor cuda 12 issues ([comment](https://github.com/pytorch/pytorch/pull/126497#issuecomment-2118338716))	2024-05-17 20:29:20 +00:00
Aaron Gokaslan	95b2766864	[BE][Ez]: Use NotADirectoryError in tensorboard writer (#126534 ) Slightly improve exception typing for tensorboard wrriter Pull Request resolved: https://github.com/pytorch/pytorch/pull/126534 Approved by: https://github.com/ezyang	2024-05-17 19:52:13 +00:00
PaliC	90a5aeea79	[distributed] Add cpp-httplib to pytorch (#126470 ) Adds https://github.com/yhirose/cpp-httplib such that we are able to use https for host to host communication in distributed (specifically torchrun) Todo: We likely need to add cpp-httplib somewhere in the build (cmake/bazel) but first we should write the code for it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126470 Approved by: https://github.com/d4l3k, https://github.com/Skylion007	2024-05-17 19:45:08 +00:00
Kwanghoon An	eb0b16db92	Initial implementation of AdaRound (#126153 ) Summary: This is an implementation of AdaRound from a paper https://arxiv.org/abs/2004.10568 This algorithm is going to be used by multiple people, hence we need make it official implementation. Differential Revision: D57227565 Pull Request resolved: https://github.com/pytorch/pytorch/pull/126153 Approved by: https://github.com/jerryzh168, https://github.com/huydhn	2024-05-17 19:44:50 +00:00
PyTorch MergeBot	875221dedf	Revert "Fix aarch64 debug build with GCC (#126290 )" This reverts commit 91bf952d10e9524a9b078900d9807efa5d252f5c. Reverted https://github.com/pytorch/pytorch/pull/126290 on behalf of https://github.com/huydhn due to There seems to be a mis-match closing curly bracket here and it breaks some internal build in D57474505 ([comment](https://github.com/pytorch/pytorch/pull/126290#issuecomment-2118246756))	2024-05-17 19:30:02 +00:00
PyTorch MergeBot	f89500030b	Revert "Remove redundant serialization code (#126249 )" This reverts commit aab448e381366d4cf499145adffe9fcb1ac2b28d. Reverted https://github.com/pytorch/pytorch/pull/126249 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing sigmoid/frontend:serialization_test internally ([comment](https://github.com/pytorch/pytorch/pull/126249#issuecomment-2118233656))	2024-05-17 19:19:02 +00:00
briancoutinho	de42af4b00	Add coms metadata to execution trace (ET) (#126317 ) Add Execution Trace communication collective meta data. For specification see https://github.com/pytorch/pytorch/issues/124674 New fields look like ``` { "id": 80, "name": "record_param_comms", "ctrl_deps": 79, "inputs": {"values": [[[78,74,0,100,4,"cuda:0"]],21,["0","default_pg"],0,"allreduce",[],[],0,1,2], "shapes": [[[100]],[],[[],[]],[],[],[],[],[],[],[]], "types": ["GenericList[Tensor(float)]","Int","Tuple[String,String]","Int","String","GenericList[]","GenericList[]","Int","Int","Int"]}, "outputs": {"values": [[[78,74,0,100,4,"cuda:0"]]], "shapes": [[[100]]], "types": ["GenericList[Tensor(float)]"]}, "attrs": [{"name": "rf_id", "type": "uint64", "value": 53},{"name": "fw_parent", "type": "uint64", "value": 0},{"name": "seq_id", "type": "int64", "value": -1},{"name": "scope", "type": "uint64", "value": 0},{"name": "tid", "type": "uint64", "value": 2},{"name": "fw_tid", "type": "uint64", "value": 0},{"name": "op_schema", "type": "string", "value": ""},{"name": "kernel_backend", "type": "string", "value": ""},{"name": "kernel_file", "type": "string", "value": ""}, {"name": "collective_name", "type": "string", "value": "allreduce"}, {"name": "dtype", "type": "string", "value": "Float"}, {"name": "in_msg_nelems", "type": "uint64", "value": 100}, {"name": "out_msg_nelems", "type": "uint64", "value": 100}, {"name": "in_split_size", "type": "string", "value": "[]"}, {"name": "out_split_size", "type": "string", "value": "[]"}, {"name": "global_rank_start", "type": "uint64", "value": 0}, {"name": "global_rank_stride", "type": "uint64", "value": 1}, {"name": "pg_name", "type": "string", "value": "0"}, {"name": "pg_desc", "type": "string", "value": "default_pg"}, {"name": "pg_size", "type": "uint64", "value": 2}] } ``` ## Unit Test Added a new unit test to check the execution trace collected has right attributes `touch /tmp/barrier && TEMP_DIR="/tmp" BACKEND="nccl" WORLD_SIZE="2" python test/distributed/test_distributed_spawn.py -v TestDistBackendWithSpawn.test_ddp_profiling_execution_trace` ``` STAGE:2024-05-08 17:39:10 62892:62892 ActivityProfilerController.cpp:316] Completed Stage: Warm Up STAGE:2024-05-08 17:39:10 62893:62893 ActivityProfilerController.cpp:316] Completed Stage: Warm Up STAGE:2024-05-08 17:39:11 62892:62892 ActivityProfilerController.cpp:322] Completed Stage: Collection STAGE:2024-05-08 17:39:11 62893:62893 ActivityProfilerController.cpp:322] Completed Stage: Collection STAGE:2024-05-08 17:39:11 62892:62892 ActivityProfilerController.cpp:326] Completed Stage: Post Processing STAGE:2024-05-08 17:39:11 62893:62893 ActivityProfilerController.cpp:326] Completed Stage: Post Processing [rank1]:[W508 17:39:12.329544411 reducer.cpp:1399] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator()) [rank0]:[W508 17:39:12.329626774 reducer.cpp:1399] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator()) [rank0]:[W508 17:39:12.339239982 execution_trace_observer.cpp:825] Enabling Execution Trace Observer [rank1]:[W508 17:39:12.339364516 execution_trace_observer.cpp:825] Enabling Execution Trace Observer STAGE:2024-05-08 17:39:12 62892:62892 ActivityProfilerController.cpp:316] Completed Stage: Warm Up STAGE:2024-05-08 17:39:12 62893:62893 ActivityProfilerController.cpp:316] Completed Stage: Warm Up [rank1]:[W508 17:39:12.352452400 execution_trace_observer.cpp:837] Disabling Execution Trace Observer STAGE:2024-05-08 17:39:12 62893:62893 ActivityProfilerController.cpp:322] Completed Stage: Collection [rank0]:[W508 17:39:12.354019014 execution_trace_observer.cpp:837] Disabling Execution Trace Observer STAGE:2024-05-08 17:39:12 62893:62893 ActivityProfilerController.cpp:326] Completed Stage: Post Processing STAGE:2024-05-08 17:39:12 62892:62892 ActivityProfilerController.cpp:322] Completed Stage: Collection STAGE:2024-05-08 17:39:12 62892:62892 ActivityProfilerController.cpp:326] Completed Stage: Post Processing Execution trace saved at /tmp/tmpy01ngc3w.et.json Execution trace saved at /tmp/tmptf8543k4.et.json ok ---------------------------------------------------------------------- ``` Also run profilerunit test `touch /tmp/barrier && TEMP_DIR="/tmp" BACKEND="nccl" WORLD_SIZE="2" python test/distributed/test_distributed_spawn.py -v TestDistBackendWithSpawn.test_ddp_profiling_torch_profiler` ``` STAGE:2024-05-08 18:24:22 1926775:1926775 ActivityProfilerController.cpp:316] Completed Stage: Warm Up STAGE:2024-05-08 18:24:22 1926774:1926774 ActivityProfilerController.cpp:316] Completed Stage: Warm Up STAGE:2024-05-08 18:24:24 1926774:1926774 ActivityProfilerController.cpp:322] Completed Stage: Collection STAGE:2024-05-08 18:24:24 1926775:1926775 ActivityProfilerController.cpp:322] Completed Stage: Collection STAGE:2024-05-08 18:24:24 1926774:1926774 ActivityProfilerController.cpp:326] Completed Stage: Post Processing STAGE:2024-05-08 18:24:24 1926775:1926775 ActivityProfilerController.cpp:326] Completed Stage: Post Processing [rank1]:[W508 18:24:24.508622236 reducer.cpp:1399] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator()) [rank0]:[W508 18:24:24.508622241 reducer.cpp:1399] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator()) STAGE:2024-05-08 18:24:24 1926774:1926774 ActivityProfilerController.cpp:316] Completed Stage: Warm Up STAGE:2024-05-08 18:24:24 1926775:1926775 ActivityProfilerController.cpp:316] Completed Stage: Warm Up STAGE:2024-05-08 18:24:24 1926774:1926774 ActivityProfilerController.cpp:322] Completed Stage: Collection STAGE:2024-05-08 18:24:24 1926775:1926775 ActivityProfilerController.cpp:322] Completed Stage: Collection STAGE:2024-05-08 18:24:24 1926774:1926774 ActivityProfilerController.cpp:326] Completed Stage: Post Processing STAGE:2024-05-08 18:24:24 1926775:1926775 ActivityProfilerController.cpp:326] Completed Stage: Post Processing Trace saved to /tmp/tmpdrw_cmcu.json Trace saved to /tmp/tmpnio7ec9j.json ok ---------------------------------------------------------------------- Ran 1 test in 19.772s OK ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/126317 Approved by: https://github.com/yoyoyocmu, https://github.com/sanrise	2024-05-17 19:08:55 +00:00
andrewor14	6931f781c2	[quant][pt2e] Allow multi users without output observers (#126487 ) Summary: The PT2E quantization flow does not support unquantized outputs yet. To work around this, users may wish to remove the output observer from their graphs. However, this fails currently in some cases because the `PortNodeMetaForQDQ` pass is too restrictive, for example: ``` conv -> obs -------> output0 \\-> add -> output1 ``` Previously we expected conv to always have exactly 1 user, which is the observer. When the observer is removed, however, conv now has 2 users, and this fails the check. ``` conv -------> output0 \\-> add -> output1 ``` This commit relaxes the error into a warning to enable this workaround. Test Plan: python test/test_quantization.py TestQuantizePT2E.test_multi_users_without_output_observer Reviewers: jerryzh168 Subscribers: jerryzh168, supriyar Differential Revision: [D57472601](https://our.internmc.facebook.com/intern/diff/D57472601) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126487 Approved by: https://github.com/tarun292	2024-05-17 18:48:21 +00:00
Sam Larsen	ecd9a4e5c3	Enable FX graph cache for huggingface and timm benchmarks (#126205 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126205 Approved by: https://github.com/eellison	2024-05-17 18:36:05 +00:00
Mikayla Gawarecki	66dc8fb7ff	Allow tensor subclasses and add `torch.serialization.add_safe_globals` that allows users to allowlist classes for `weights_only` load (#124331 ) #### Conditions for allowlisting tensor subclasses We allow tensor subclasses types that (1) Do not override `__setstate__`, `__getattr__`, `__setattr__`, `__get__`, `__set__` or `__getattribute__` of `torch.Tensor` (`torch.Tensor` does not have a definition of `__getattr__`, `__get__` or `__set__` so we check that these are `None`) (2) Use the generic `tp_alloc` (3) Are in a module that has been imported by the user to be pushed onto the stack as strings by `GLOBAL` instructions, while storing the type in a dict The strings will be converted to the classes as appropriate when executing `REBUILD` with `_rebuild_from_type_v2` Note that we use `inspect.getattr_static(sys.modules[module], name)` to get the class/function as this method claims to have no code execution. The rationale for the 3 conditions above is as follows: The rebuild func provided by `Tensor.__reduce_ex__` is `torch._tensor._rebuild_from_type_v2`, which is defined as such (note the call to `getattr`, `Tensor.__setstate__` and the call to `as_subclass` as well as the call to `_set_obj_state` which calls `setattr`) `4e66aaa010/torch/_tensor.py (L57-L71)` `as_subclass` is implemented with a call to `THPVariable_NewWithVar` that will eventually call `tp_alloc` here `4e66aaa010/torch/csrc/autograd/python_variable.cpp (L2053)` The `func` arg to `_rebuild_from_type_v2` for wrapper subclasses is `Tensor.rebuild_wrapper_subclass`, which will similarly call into `THPVariable_NewWithVar` and hit the above `tp_alloc` Note that we do not call `tp_init` or `tp_new` (i.e. `cls.__init__` or `cls.__new__`) when unpickling* ### How do we check something is a tensor subclass/constraints around imports In order to check whether `bla` is a tensor subclass in the bytecode `GLOBAL module.name`, we need to do an `issubclass` check, which entails converting the global string to the appropriate type. We do not arbitrarily import modules but will perform this check as long as the given subclass (given by `module.name`) has already been imported by the user (i.e. `module in sys.modules` and `issubclass(getattr(sys[modules], name), torch.Tensor)` This PR also allowlisted `torch._utils._rebuild_wrapper_subclass` and `torch.device` (used by `_rebuild_wrapper_subclass`) ### API for allow listing This PR also added `torch.serialization.{add/get/clear}_safe_globals` that enables user to allowlist globals they have deemed safe and manipulate this list (for example they could allowlist a tensor subclass with a custom `__setstate__` if they have checked that this is safe). Next steps: - Add testing and allowlist required classes for all in-core tensor subclasses (e.g. `DTensor`, `FakeTensor` etc.) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124331 Approved by: https://github.com/albanD	2024-05-17 17:56:57 +00:00
Catherine Lee	31ea8290e7	Workflow for uploading additional test stats on workflow dispatch (#126080 ) This kind of an experiment for uploading test stats during the run, and also for test dashboard stuff so it can re calculate the info Add workflow that is callable via workflow dispatch for uploading additional test stats Adds script that only calculates the additional info Pull Request resolved: https://github.com/pytorch/pytorch/pull/126080 Approved by: https://github.com/ZainRizvi	2024-05-17 17:29:44 +00:00
Colin Peppler	6bcf15669e	[inductor] fix unbacked case in pointwise + reduction vertical fusion (#125982 ) ``` $ INDUCTOR_TEST_DISABLE_FRESH_CACHE=1 python test/inductor/test_unbacked_symints.py -k test_vertical_pointwise_reduction_fusion File "/data/users/colinpeppler/pytorch/torch/_inductor/scheduler.py", line 1953, in fuse_nodes_once for node1, node2 in self.get_possible_fusions(): File "/data/users/colinpeppler/pytorch/torch/_inductor/scheduler.py", line 2010, in get_possible_fusions check_all_pairs(node_grouping) File "/data/users/colinpeppler/pytorch/torch/_inductor/scheduler.py", line 1997, in check_all_pairs if self.can_fuse(node1, node2): File "/data/users/colinpeppler/pytorch/torch/_inductor/scheduler.py", line 2252, in can_fuse return self.get_backend(device).can_fuse_vertical(node1, node2) File "/data/users/colinpeppler/pytorch/torch/_inductor/codegen/cuda_combined_scheduling.py", line 39, in can_fuse_vertical return self._triton_scheduling.can_fuse_vertical(node1, node2) File "/data/users/colinpeppler/pytorch/torch/_inductor/codegen/triton.py", line 3237, in can_fuse if not all( File "/data/users/colinpeppler/pytorch/torch/_inductor/codegen/triton.py", line 3238, in <genexpr> TritonKernel.is_compatible((numel2, rnumel2), n.get_ranges()) File "/data/users/colinpeppler/pytorch/torch/_inductor/codegen/triton.py", line 1543, in is_compatible cls._split_iteration_ranges(groups, lengths) File "/data/users/colinpeppler/pytorch/torch/_inductor/codegen/triton.py", line 1507, in _split_iteration_ranges while current_group < len(remaining) and sv.size_hint(remaining[current_group]) == 1: File "/data/users/colinpeppler/pytorch/torch/_inductor/sizevars.py", line 442, in size_hint return int(out) File "/home/colinpeppler/local/miniconda3/envs/pytorch/lib/python3.10/site-packages/sympy/core/expr.py", line 320, in __int__ raise TypeError("Cannot convert symbols to int") torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised: TypeError: Cannot convert symbols to int ``` Where the unbacked symints show up at. ``` > /data/users/colinpeppler/pytorch/torch/_inductor/codegen/triton.py(1506)_split_iteration_ranges() (Pdb) print(groups) (1, 512*u0) (Pdb) print(lengths) ([u0, 32, 16], []) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/125982 Approved by: https://github.com/jansel	2024-05-17 17:06:24 +00:00
Nikita Shulga	7e9a037b47	[Perf] Vectorize more dtype for int4mm (#126512 ) It used to be vectorized only for f16, but no reason not to do the same for bf16 or f32 Spiritual followup of https://github.com/pytorch/pytorch/pull/125290 Pull Request resolved: https://github.com/pytorch/pytorch/pull/126512 Approved by: https://github.com/Skylion007	2024-05-17 16:34:19 +00:00
Matthew Hoffman	81277baa0c	Remove removed ruff rule TRY200 (#126256 ) My TOML linter is complaining that "TRY200" is not acceptable for the `tool.ruff.lint` schema. From the ruff docs: https://docs.astral.sh/ruff/rules/reraise-no-cause/ > This rule has been removed and its documentation is only available for historical reasons. > > This rule is identical to [B904](https://docs.astral.sh/ruff/rules/raise-without-from-inside-except/) which should be used instead. and we are currently explicitly ignoring B904. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126256 Approved by: https://github.com/Skylion007	2024-05-17 16:31:05 +00:00
Guilherme Leobas	402170b22f	Early return in _recursive_build if obj is a Tensor (#125639 ) Fix issue #125551 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125639 Approved by: https://github.com/ezyang	2024-05-17 15:53:37 +00:00
David Chiu	7e166e8057	[optim] Fix: wrong ASGD implementation (#126375 ) This PR is based on #125440, additionally merging the latest main branch and fixing the lint failures from #126361. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126375 Approved by: https://github.com/janeyx99	2024-05-17 15:46:39 +00:00
James Wu	078e530446	Delete refactored function, move changes over (#126407 ) Oops, in https://github.com/pytorch/pytorch/pull/125610 I moved this function to runtime_wrappers.py, but forgot to delete the old one. https://github.com/pytorch/pytorch/pull/126234 then modified it which would do nothing, so I'm applying the change correctly now and deleting the function as I intended. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126407 Approved by: https://github.com/eellison	2024-05-17 15:28:18 +00:00
eellison	ab307a8992	Default to env variable instead of config value for precompile parallelism (#126333 ) Previously, we would default to the config `compile_threads`. That controls the number of forks we use for async compile. It defaults to 1 in fbcode because fork() has known issues with safety. In precompilation, we are using threads, which have no safety issues and should strictly improve compile time. there isn't really any reason to reduce except for testing, and it doesn't make sense to share the same value as for determining forks. This changes so we default it to use as many threads as needed unless the env variable is set. Differential Revision: [D57473023](https://our.internmc.facebook.com/intern/diff/D57473023) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126333 Approved by: https://github.com/nmacchioni	2024-05-17 14:58:55 +00:00
Andrew Gu	3f28906311	[FSDP2] Fixed 2D clip grad norm test (#126497 ) This fixes https://github.com/pytorch/pytorch/issues/126484. We change from transformer to MLP stack since transformer seems to introduce slight numeric differences when using TP. We include a sequence parallel layer norm module in the MLP stack to exercise `(S(0), R)` placement. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126497 Approved by: https://github.com/weifengpy, https://github.com/wz337	2024-05-17 13:38:31 +00:00
Edward Z. Yang	55033ab43a	Update ops handler documentation some more (#126480 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/126480 Approved by: https://github.com/peterbell10 ghstack dependencies: #126292, #126299	2024-05-17 13:31:44 +00:00
cyy	4ed93d6e0c	[Submodule] Remove zstd dependency (#126485 ) After searching in the codebase, it seems that zstd is not in use now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126485 Approved by: https://github.com/ezyang	2024-05-17 12:49:23 +00:00
CaoE	6c503f1dbb	save the reciprocal of weights for welford_reduce (#125148 ) Save the reciprocal of weights for welford_reduce to avoid redundant divisions for improving performance, and `weight_recps` will be inserted into the generated vec kernel. Generated code: - Before: ``` for(long x1=static_cast<long>(0L); x1<static_cast<long>(1024L); x1+=static_cast<long>(16L)) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(x1 + (1024Lx0)), 16); tmp_acc0_vec = welford_combine(tmp_acc0_vec, tmp0); } ``` - After:: ``` static WeightRecp<at::vec::Vectorized<float>> weight_recps(64); for(long x1=static_cast<long>(0L); x1<static_cast<long>(1024L); x1+=static_cast<long>(16L)) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(x1 + (1024Lx0)), 16); tmp_acc0_vec = welford_combine(tmp_acc0_vec, tmp0, &weight_recps); } ``` Performance: - Single core: Op \| shape \| eager/ms \| inductor/ms \| optimized inductor/ms -- \| -- \| -- \| -- \| -- layernorm \| (56, 384, 1024) \| 16.825 \| 22.338 \| 15.208 var \| (56, 384, 1024) \| 21.752 \| 13.258 \| 13.102 - 4 cores: Op \| shape \| eager/ms \| inductor/ms \| optimized inductor/ms -- \| -- \| -- \| -- \| -- layernorm \| (56, 384, 1024) \| 4.249 \| 5.899 \| 4.223 var \| (56, 384, 1024) \| 5.3152 \| 3.278 \| 2.163 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125148 Approved by: https://github.com/jgong5, https://github.com/peterbell10	2024-05-17 08:20:12 +00:00
Nicolas Macchioni	8619fe6214	variable search spaces for gemm autotuning (#126220 ) add a switch to change the gemm autotuning search space between the default (the current set of hardcoded configs) and an exhaustive search space that enumerates all block sizes in [16, 32, 64, 128, 256], stages in [1, 2, 3, 4, 5], and warps in [2, 4, 6] Pull Request resolved: https://github.com/pytorch/pytorch/pull/126220 Approved by: https://github.com/eellison	2024-05-17 08:09:53 +00:00
Xia, Weiwen	45f2d09452	[Quant][Inductor] Enable lowering of qlinear-binary(-unary) fusion for X86Inductor (#122593 ) Description Lower the qlinear binary post op pattern to Inductor. Use post op sum (in-place) if the extra input has the same dtype as output. Otherwise, it uses binary add. Supported linear-binary(-unary) patterns ``` linear(X) extra input \ / Add \| Optional(relu) \| Y 1. int8-mixed-fp32 +---+---------------+-----------+------------------------------+---------+ \| # \| Add type \| Quant out \| Pattern \| Post op \| +---+---------------+-----------+------------------------------+---------+ \| 1 \| In-/out-place \| Yes \| linear + fp32 -> (relu) -> q \| add \| +---+---------------+-----------+------------------------------+---------+ \| 2 \| In-/out-place \| No \| linear + fp32 -> (relu) \| sum \| +---+---------------+-----------+------------------------------+---------+ 2. int8-mixed-bf16 +---+----------+---------------+-----------+--------------------------------------------------+---------+ \| # \| X2 dtype \| Add type \| Quant out \| Pattern \| Post op \| +---+----------+---------------+-----------+--------------------------------------------------+---------+ \| 1 \| BF16 \| In-/out-place \| Yes \| linear + bf16 -> (relu) -> to_fp32 -> q \| add \| +---+----------+---------------+-----------+--------------------------------------------------+---------+ \| 2 \| BF16 \| In-/out-place \| No \| linear + bf16 -> (relu) \| sum \| +---+----------+---------------+-----------+--------------------------------------------------+---------+ \| 3 \| FP32 \| Out-place \| Yes \| linear + fp32 -> (relu) -> q \| add \| \| \| \| In-place right\| \| \| \| +---+----------+---------------+-----------+--------------------------------------------------+---------+ \| 4 \| FP32 \| Out-place \| No \| linear + fp32 -> (relu) \| sum \| \| \| \| In-place right\| \| \| \| +---+----------+---------------+-----------+--------------------------------------------------+---------+ \| 5 \| FP32 \| In-place left \| Yes \| linear + fp32 -> to_bf16 -> relu -> to_fp32 -> q \| add \| +---+----------+---------------+-----------+--------------------------------------------------+---------+ \| 6 \| FP32 \| In-place left \| No \| linear + fp32 -> to_bf16 -> (relu) \| add \| +---+----------+---------------+-----------+--------------------------------------------------+---------+ ``` Note (1) The positions of linear and the extra input can be swapped. (2) we don't insert q-dq before the extra input of linear-add by recipe. But if q-dq is found at the extra input, we don't match that pattern because we cannot match all these patterns in 3 passes. Test plan python test/inductor/test_mkldnn_pattern_matcher.py -k test_qlinear_add python test/inductor/test_cpu_cpp_wrapper.py -k test_qlinear_add Pull Request resolved: https://github.com/pytorch/pytorch/pull/122593 Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/eellison	2024-05-17 07:46:48 +00:00
Isuru Fernando	2edaae436a	Fix cummax and cummin lowering for empty case (#126461 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126461 Approved by: https://github.com/peterbell10	2024-05-17 07:08:32 +00:00
wz337	15ca562f86	[DTensor] Turn on foreach implementation for clip_grad_norm_ for DTensor by default (#126423 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/126423 Approved by: https://github.com/awgu	2024-05-17 06:57:52 +00:00
chilli	f9a7033194	Refactor partitioner and clean it up (#126318 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126318 Approved by: https://github.com/anijain2305	2024-05-17 06:15:00 +00:00
Stonepia	5756b53dd8	[XPU] call empty_cache for dynamo tests (#126377 ) When running a batch of models, lacking `empty_cache()` would result in OOM for subsequent models. This PR unifies the `empty_cache` call for both CUDA and XPU. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126377 Approved by: https://github.com/EikanWang, https://github.com/guangyey, https://github.com/desertfire	2024-05-17 06:05:51 +00:00
Tianyu Liu	9edf54df4d	[dtensor] refactor view ops to use OpStrategy (#126011 ) As titled. Some ops require adjustment of output shape argument. In rule-based sharding prop, global output shape was inferred in the rule (in `view_ops.py`). In strategy-based sharding prop, it is now obtained from propagated out_tensor_meta (in `sharding_prop.py`). Pull Request resolved: https://github.com/pytorch/pytorch/pull/126011 Approved by: https://github.com/wanchaol, https://github.com/XilunWu	2024-05-17 05:39:21 +00:00
Will Constable	a0df40f195	Add dist_pp shortcut to TORCH_LOGS (#126322 ) distributed log category already includes pipelining since its under the torch.distributed umbrella. So both TORCH_LOGS=distributed and TORCH_LOGS=dist_pp will enable PP logs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126322 Approved by: https://github.com/kwen2501	2024-05-17 05:32:15 +00:00
Tristan Rice	4b2ae2ac33	c10d: add Collectives abstraction (#125978 ) This adds a new `Collectives` API for doing distributed collectives operations. This is intended to replace the [current Elastic store abstraction](https://github.com/pytorch/pytorch/blob/main/torch/distributed/elastic/utils/store.py) with more performant and debugable primitives. Design doc: https://docs.google.com/document/d/147KcKJXEHvk1Q6tISLbJVvLejHg_1kIhBQeu-8RQxhY/edit The standard implementation is using `StoreCollectives` but other more performant backends will be added in a follow up PR. Test plan: ``` python test/distributed/test_collectives.py -v ``` This tests both functionality using multiple threads as well as timeout behavior. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125978 Approved by: https://github.com/shuqiangzhang	2024-05-17 05:09:11 +00:00
eellison	a8c41e0678	dont pad 0 dim mm inputs (#126475 ) Otherwise you get an error in constant_pad_nd. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126475 Approved by: https://github.com/huydhn ghstack dependencies: #125772, #125773, #125780	2024-05-17 05:03:27 +00:00
wz337	88582195fd	[FSDP2][Test] Fix _test_clip_grad_norm (#126457 ) Fixes #ISSUE_NUMBER We need to compare ref_total_norm to total_norm.full_tensor(). Example: ``` iter_idx:0, rank:0,\ ref_total_norm=tensor(1052.5934, device='cuda:0'),\ total_norm=DTensor(local_tensor=482.0861511230469, device_mesh=DeviceMesh([0, 1]), placements=(_NormPartial(reduce_op='sum', norm_type=2.0),)),\ total_norm.full_tensor()=tensor(1052.5934, device='cuda:0') ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/126457 Approved by: https://github.com/awgu	2024-05-17 04:29:21 +00:00
Alex Denisov	1a27e24ff5	Make inductor scheduler graph extension configurable (#125578 ) This patch makes the inductor scheduler graph extension configurable. It enables ease of debugging by changing the graph format (dot, png, etc.). Particularly, it's very convenient to work with the graph interactively using tools like https://github.com/tintinweb/vscode-interactive-graphviz Pull Request resolved: https://github.com/pytorch/pytorch/pull/125578 Approved by: https://github.com/Chillee	2024-05-17 04:19:23 +00:00
Edward Z. Yang	da1fc85d60	Add symbolic_shape_specialization structured trace (#126450 ) This is typically the information you want when diagnosing why something overspecialized in dynamic shapes. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/126450 Approved by: https://github.com/albanD	2024-05-17 02:01:21 +00:00
Yu, Guangye	d2f5a8ac99	[doc] expose torch.Tensor.xpu API to doc (#126383 ) # Motivation The doc string related `torch.Tensor.xpu` has been added [here](`d61a81a9e7/torch/_tensor_docs.py (L1434)`) but not expose it to public doc, like [torch.Tensor.cuda](https://pytorch.org/docs/stable/generated/torch.Tensor.cuda.html#torch.Tensor.cuda). This PR intends to expose the document of `torch.Tensor.xpu` to public doc. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126383 Approved by: https://github.com/albanD	2024-05-17 01:19:03 +00:00
Mikayla Gawarecki	776b878917	[easy] Fix typing for `map_location` docs in torch.load (#125473 ) Currently it incorrectly has `Callable[[Tensor, str], Tensor]` as a possible type signature, this should be `Callable[[Storage, str], Storage]` <img width="716" alt="Screenshot 2024-05-03 at 12 09 54 PM" src="https://github.com/pytorch/pytorch/assets/35276741/b8946f95-8297-445f-a9d9-570b8a3caab1"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/125473 Approved by: https://github.com/albanD	2024-05-17 01:15:25 +00:00
Andrew Gu	697ed6f5b3	[DeviceMesh] Supported N groups in `from_group` (#126258 ) Overview This PR supports constructing an ND mesh with `from_group()` by passing in `group: List[ProcessGroup]` and `mesh: Union[torch.Tensor, "ArrayLike"]` together. The `ndim` of the device mesh returned from `from_group()` is equal to the number of `ProcessGroup`s passed. If the `ndim` is greater than 1, then the `mesh` argument is required (since there is no simple way to recover the `mesh` tensor from the process groups otherwise). This PR also adds `mesh_dim_names` as an argument to forward to the device mesh for convenience. <details> <summary> Old Approach </summary> Overview - This PR mainly adds `mesh_shape` to `from_group()` so that the user can construct an ND (N > 1) device mesh from a process group. This is to unblock HSDP, where we can pass the overall data parallel process group to `from_group()` with `mesh_shape = (replicate_dim_size, shard_dim_size)` and `from_group()` will construct subgroups for the user. (The user can then get the subgroups from the submeshes.) - Constructing the 2D `DeviceMesh` from an existing shard process group and replicate process group is hard because we cannot easily recover the array of ranks in their parent group on each rank in general. - This PR also adds `mesh_dim_names` to `from_group()` so that the user can name the mesh dimensions of the constructed device mesh. </details> Pull Request resolved: https://github.com/pytorch/pytorch/pull/126258 Approved by: https://github.com/wanchaol	2024-05-17 01:03:21 +00:00
angelayi	1018a68e31	[export] Delete predispatch tests (#126459 ) Deleting predispatch tests as we moved export to predispatch already Pull Request resolved: https://github.com/pytorch/pytorch/pull/126459 Approved by: https://github.com/tugsbayasgalan	2024-05-17 00:48:32 +00:00
Yidi Wu	8bb7a2f46d	Fix documentation for register_fake_class (#126422 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126422 Approved by: https://github.com/angelayi	2024-05-17 00:45:21 +00:00
drisspg	762ce6f062	Add Lowering for FlexAttention Backwards (#125515 ) # Summary #### What does this PR do? It enables Inductor to actually generate the fused flex attention kernel for the backwards I did some other things along the way: - Abstract out the 'build_subgraph_buffer' subroutine and make it reusable between flex attention and flex_attention backwards. In total we need too build 3 subgraphs for fwd + bwd. 1 for the fwd graph and then 2 in the bwd. The FAv2 algorithm recomputes the parts of the forward (more efficiently since we already have the row_max via logsumexp), therefore we need to inline both the fwd graph and the joint graph in the bwds kernel. - The version of the backwards kernel is from a somewhat older version of the triton tutorial implementation. I think that we should update in a follow up to a newer version. Notably the blocks need to be square for this to work as currently implemented. I am sure there are many opportunities for optimization. - I didnt correctly register the decomp table + IndexMode when I landed: https://github.com/pytorch/pytorch/pull/123902, this remedies that. - The rel_bias helper func was reversed in terms of causality. I updated and then add a test specific for "future causal" attention. - This PRs but the main point that I think still needs to be worked out is the store_output call. I have it hacked up to be 'fake' but I dont think we want to land that and likely want to just have a mutated 'dq' and a stored_output 'dk' - I also needed to update the `TritonTemplateKernel` to actually accept multiple subgraphs (modifications) - I updated the benchmark to also profile bwds performance ### Benchmark Numbers: _The current implementation is not parallelizing over ctx length in the bwd_ FWD Speedups \| Type \| Speedup \| shape \| score_mod \| dtype \| \|---------\|-----------\|--------------------\|-------------\|----------------\| \| Average \| 0.991 \| \| \| \| \| Max \| 1.182 \| (16, 16, 4096, 64) \| noop \| torch.bfloat16 \| \| Min \| 0.796 \| (2, 16, 512, 256) \| head_bias \| torch.bfloat16 \| BWD Speedups \| Type \| Speedup \| shape \| score_mod \| dtype \| \|---------\|-----------\|--------------------\|-------------\|----------------\| \| Average \| 0.291 \| \| \| \| \| Max \| 0.652 \| (8, 16, 512, 64) \| head_bias \| torch.bfloat16 \| \| Min \| 0.073 \| (2, 16, 4096, 128) \| head_bias \| torch.bfloat16 \| <details> <summary>Full Data</summary> \| shape \| score_mod \| dtype \| fwd_eager_time \| fwd_compiled_time \| bwd_eager_time \| bwd_compiled_time \| fwd_speedup \| bwd_speedup \| \|---------------------\|---------------\|----------------\|------------------\|---------------------\|------------------\|---------------------\|---------------\|---------------\| \| (2, 16, 512, 64) \| noop \| torch.bfloat16 \| 19.936 \| 19.092 \| 57.851 \| 193.564 \| 1.044 \| 0.299 \| \| (2, 16, 512, 64) \| causal_mask \| torch.bfloat16 \| 19.955 \| 19.497 \| 57.662 \| 206.278 \| 1.024 \| 0.280 \| \| (2, 16, 512, 64) \| relative_bias \| torch.bfloat16 \| 19.455 \| 21.297 \| 57.674 \| 195.219 \| 0.913 \| 0.295 \| \| (2, 16, 512, 64) \| head_bias \| torch.bfloat16 \| 19.958 \| 21.289 \| 57.674 \| 193.859 \| 0.938 \| 0.298 \| \| (2, 16, 512, 128) \| noop \| torch.bfloat16 \| 28.157 \| 28.615 \| 82.831 \| 454.211 \| 0.984 \| 0.182 \| \| (2, 16, 512, 128) \| causal_mask \| torch.bfloat16 \| 28.154 \| 28.444 \| 83.091 \| 432.083 \| 0.990 \| 0.192 \| \| (2, 16, 512, 128) \| relative_bias \| torch.bfloat16 \| 28.722 \| 27.897 \| 83.175 \| 446.789 \| 1.030 \| 0.186 \| \| (2, 16, 512, 128) \| head_bias \| torch.bfloat16 \| 28.299 \| 27.673 \| 83.052 \| 459.179 \| 1.023 \| 0.181 \| \| (2, 16, 512, 256) \| noop \| torch.bfloat16 \| 41.167 \| 50.504 \| 175.019 \| 1083.545 \| 0.815 \| 0.162 \| \| (2, 16, 512, 256) \| causal_mask \| torch.bfloat16 \| 41.656 \| 51.933 \| 175.078 \| 1171.176 \| 0.802 \| 0.149 \| \| (2, 16, 512, 256) \| relative_bias \| torch.bfloat16 \| 41.697 \| 50.722 \| 175.159 \| 1097.312 \| 0.822 \| 0.160 \| \| (2, 16, 512, 256) \| head_bias \| torch.bfloat16 \| 41.690 \| 52.387 \| 175.184 \| 1097.336 \| 0.796 \| 0.160 \| \| (2, 16, 1024, 64) \| noop \| torch.bfloat16 \| 39.232 \| 37.454 \| 127.847 \| 612.430 \| 1.047 \| 0.209 \| \| (2, 16, 1024, 64) \| causal_mask \| torch.bfloat16 \| 39.930 \| 39.599 \| 127.755 \| 665.359 \| 1.008 \| 0.192 \| \| (2, 16, 1024, 64) \| relative_bias \| torch.bfloat16 \| 39.417 \| 41.304 \| 127.902 \| 614.990 \| 0.954 \| 0.208 \| \| (2, 16, 1024, 64) \| head_bias \| torch.bfloat16 \| 39.965 \| 42.034 \| 127.953 \| 613.273 \| 0.951 \| 0.209 \| \| (2, 16, 1024, 128) \| noop \| torch.bfloat16 \| 63.964 \| 71.024 \| 226.510 \| 1637.669 \| 0.901 \| 0.138 \| \| (2, 16, 1024, 128) \| causal_mask \| torch.bfloat16 \| 63.843 \| 72.451 \| 226.750 \| 1558.949 \| 0.881 \| 0.145 \| \| (2, 16, 1024, 128) \| relative_bias \| torch.bfloat16 \| 64.301 \| 70.487 \| 226.651 \| 1610.063 \| 0.912 \| 0.141 \| \| (2, 16, 1024, 128) \| head_bias \| torch.bfloat16 \| 64.033 \| 71.394 \| 226.676 \| 1668.511 \| 0.897 \| 0.136 \| \| (2, 16, 1024, 256) \| noop \| torch.bfloat16 \| 129.348 \| 141.390 \| 507.337 \| 4405.175 \| 0.915 \| 0.115 \| \| (2, 16, 1024, 256) \| causal_mask \| torch.bfloat16 \| 129.538 \| 145.680 \| 507.178 \| 4768.874 \| 0.889 \| 0.106 \| \| (2, 16, 1024, 256) \| relative_bias \| torch.bfloat16 \| 129.438 \| 142.782 \| 507.004 \| 4401.002 \| 0.907 \| 0.115 \| \| (2, 16, 1024, 256) \| head_bias \| torch.bfloat16 \| 129.058 \| 146.242 \| 507.547 \| 4434.251 \| 0.883 \| 0.114 \| \| (2, 16, 4096, 64) \| noop \| torch.bfloat16 \| 481.606 \| 409.120 \| 1440.890 \| 14147.269 \| 1.177 \| 0.102 \| \| (2, 16, 4096, 64) \| causal_mask \| torch.bfloat16 \| 480.227 \| 438.847 \| 1434.419 \| 14973.386 \| 1.094 \| 0.096 \| \| (2, 16, 4096, 64) \| relative_bias \| torch.bfloat16 \| 480.831 \| 458.104 \| 1432.935 \| 14193.253 \| 1.050 \| 0.101 \| \| (2, 16, 4096, 64) \| head_bias \| torch.bfloat16 \| 480.749 \| 452.497 \| 1437.040 \| 14084.869 \| 1.062 \| 0.102 \| \| (2, 16, 4096, 128) \| noop \| torch.bfloat16 \| 872.534 \| 848.275 \| 2600.895 \| 35156.849 \| 1.029 \| 0.074 \| \| (2, 16, 4096, 128) \| causal_mask \| torch.bfloat16 \| 872.647 \| 868.279 \| 2587.581 \| 31919.531 \| 1.005 \| 0.081 \| \| (2, 16, 4096, 128) \| relative_bias \| torch.bfloat16 \| 871.484 \| 827.644 \| 2593.989 \| 34805.634 \| 1.053 \| 0.075 \| \| (2, 16, 4096, 128) \| head_bias \| torch.bfloat16 \| 871.422 \| 856.437 \| 2602.482 \| 35708.591 \| 1.017 \| 0.073 \| \| (2, 16, 4096, 256) \| noop \| torch.bfloat16 \| 1904.497 \| 1758.183 \| 6122.416 \| 66754.593 \| 1.083 \| 0.092 \| \| (2, 16, 4096, 256) \| causal_mask \| torch.bfloat16 \| 1911.174 \| 1762.821 \| 6113.207 \| 72759.392 \| 1.084 \| 0.084 \| \| (2, 16, 4096, 256) \| relative_bias \| torch.bfloat16 \| 1911.254 \| 1727.108 \| 6123.530 \| 66577.988 \| 1.107 \| 0.092 \| \| (2, 16, 4096, 256) \| head_bias \| torch.bfloat16 \| 1916.977 \| 1801.804 \| 6118.158 \| 67359.680 \| 1.064 \| 0.091 \| \| (8, 16, 512, 64) \| noop \| torch.bfloat16 \| 44.984 \| 43.974 \| 170.276 \| 262.259 \| 1.023 \| 0.649 \| \| (8, 16, 512, 64) \| causal_mask \| torch.bfloat16 \| 45.001 \| 46.265 \| 170.509 \| 274.893 \| 0.973 \| 0.620 \| \| (8, 16, 512, 64) \| relative_bias \| torch.bfloat16 \| 45.466 \| 48.211 \| 170.606 \| 262.759 \| 0.943 \| 0.649 \| \| (8, 16, 512, 64) \| head_bias \| torch.bfloat16 \| 45.481 \| 48.435 \| 170.267 \| 261.265 \| 0.939 \| 0.652 \| \| (8, 16, 512, 128) \| noop \| torch.bfloat16 \| 72.565 \| 74.736 \| 313.220 \| 773.126 \| 0.971 \| 0.405 \| \| (8, 16, 512, 128) \| causal_mask \| torch.bfloat16 \| 72.015 \| 75.755 \| 313.311 \| 775.513 \| 0.951 \| 0.404 \| \| (8, 16, 512, 128) \| relative_bias \| torch.bfloat16 \| 72.105 \| 74.189 \| 313.806 \| 769.238 \| 0.972 \| 0.408 \| \| (8, 16, 512, 128) \| head_bias \| torch.bfloat16 \| 72.005 \| 74.364 \| 313.509 \| 775.237 \| 0.968 \| 0.404 \| \| (8, 16, 512, 256) \| noop \| torch.bfloat16 \| 138.656 \| 165.453 \| 663.707 \| 2672.067 \| 0.838 \| 0.248 \| \| (8, 16, 512, 256) \| causal_mask \| torch.bfloat16 \| 139.096 \| 172.613 \| 663.593 \| 2926.538 \| 0.806 \| 0.227 \| \| (8, 16, 512, 256) \| relative_bias \| torch.bfloat16 \| 139.500 \| 168.417 \| 663.938 \| 2658.629 \| 0.828 \| 0.250 \| \| (8, 16, 512, 256) \| head_bias \| torch.bfloat16 \| 139.776 \| 173.549 \| 662.920 \| 2667.266 \| 0.805 \| 0.249 \| \| (8, 16, 1024, 64) \| noop \| torch.bfloat16 \| 134.883 \| 125.004 \| 484.706 \| 1195.254 \| 1.079 \| 0.406 \| \| (8, 16, 1024, 64) \| causal_mask \| torch.bfloat16 \| 134.297 \| 132.875 \| 485.420 \| 1234.953 \| 1.011 \| 0.393 \| \| (8, 16, 1024, 64) \| relative_bias \| torch.bfloat16 \| 134.839 \| 139.231 \| 485.470 \| 1198.556 \| 0.968 \| 0.405 \| \| (8, 16, 1024, 64) \| head_bias \| torch.bfloat16 \| 133.822 \| 136.449 \| 485.608 \| 1189.198 \| 0.981 \| 0.408 \| \| (8, 16, 1024, 128) \| noop \| torch.bfloat16 \| 235.470 \| 234.765 \| 886.094 \| 2662.944 \| 1.003 \| 0.333 \| \| (8, 16, 1024, 128) \| causal_mask \| torch.bfloat16 \| 236.305 \| 241.382 \| 886.293 \| 2646.984 \| 0.979 \| 0.335 \| \| (8, 16, 1024, 128) \| relative_bias \| torch.bfloat16 \| 236.414 \| 233.980 \| 885.250 \| 2642.178 \| 1.010 \| 0.335 \| \| (8, 16, 1024, 128) \| head_bias \| torch.bfloat16 \| 237.176 \| 239.040 \| 885.754 \| 2665.242 \| 0.992 \| 0.332 \| \| (8, 16, 1024, 256) \| noop \| torch.bfloat16 \| 504.445 \| 517.855 \| 1978.956 \| 9592.906 \| 0.974 \| 0.206 \| \| (8, 16, 1024, 256) \| causal_mask \| torch.bfloat16 \| 502.428 \| 536.002 \| 1978.611 \| 10607.342 \| 0.937 \| 0.187 \| \| (8, 16, 1024, 256) \| relative_bias \| torch.bfloat16 \| 503.396 \| 523.960 \| 1977.993 \| 9539.284 \| 0.961 \| 0.207 \| \| (8, 16, 1024, 256) \| head_bias \| torch.bfloat16 \| 503.818 \| 536.014 \| 1980.131 \| 9576.262 \| 0.940 \| 0.207 \| \| (8, 16, 4096, 64) \| noop \| torch.bfloat16 \| 1970.139 \| 1674.930 \| 5750.940 \| 16724.134 \| 1.176 \| 0.344 \| \| (8, 16, 4096, 64) \| causal_mask \| torch.bfloat16 \| 1959.036 \| 1775.056 \| 5780.512 \| 17390.350 \| 1.104 \| 0.332 \| \| (8, 16, 4096, 64) \| relative_bias \| torch.bfloat16 \| 1947.198 \| 1773.869 \| 5780.643 \| 16779.699 \| 1.098 \| 0.345 \| \| (8, 16, 4096, 64) \| head_bias \| torch.bfloat16 \| 1963.935 \| 1829.502 \| 5780.018 \| 16703.259 \| 1.073 \| 0.346 \| \| (8, 16, 4096, 128) \| noop \| torch.bfloat16 \| 3582.711 \| 3362.623 \| 10436.069 \| 36415.565 \| 1.065 \| 0.287 \| \| (8, 16, 4096, 128) \| causal_mask \| torch.bfloat16 \| 3581.504 \| 3499.472 \| 10346.869 \| 36164.959 \| 1.023 \| 0.286 \| \| (8, 16, 4096, 128) \| relative_bias \| torch.bfloat16 \| 3589.779 \| 3337.849 \| 10529.621 \| 36261.696 \| 1.075 \| 0.290 \| \| (8, 16, 4096, 128) \| head_bias \| torch.bfloat16 \| 3602.265 \| 3436.444 \| 10458.660 \| 36507.790 \| 1.048 \| 0.286 \| \| (8, 16, 4096, 256) \| noop \| torch.bfloat16 \| 7695.923 \| 7126.275 \| 24643.009 \| 140949.081 \| 1.080 \| 0.175 \| \| (8, 16, 4096, 256) \| causal_mask \| torch.bfloat16 \| 7679.939 \| 7186.252 \| 24538.105 \| 157156.067 \| 1.069 \| 0.156 \| \| (8, 16, 4096, 256) \| relative_bias \| torch.bfloat16 \| 7681.374 \| 6994.832 \| 24549.713 \| 140077.179 \| 1.098 \| 0.175 \| \| (8, 16, 4096, 256) \| head_bias \| torch.bfloat16 \| 7679.822 \| 7212.278 \| 24627.823 \| 140675.003 \| 1.065 \| 0.175 \| \| (16, 16, 512, 64) \| noop \| torch.bfloat16 \| 80.126 \| 78.291 \| 333.719 \| 541.165 \| 1.023 \| 0.617 \| \| (16, 16, 512, 64) \| causal_mask \| torch.bfloat16 \| 80.065 \| 81.696 \| 333.779 \| 551.113 \| 0.980 \| 0.606 \| \| (16, 16, 512, 64) \| relative_bias \| torch.bfloat16 \| 80.138 \| 86.715 \| 333.364 \| 542.118 \| 0.924 \| 0.615 \| \| (16, 16, 512, 64) \| head_bias \| torch.bfloat16 \| 80.415 \| 85.204 \| 333.294 \| 536.840 \| 0.944 \| 0.621 \| \| (16, 16, 512, 128) \| noop \| torch.bfloat16 \| 134.964 \| 138.025 \| 607.093 \| 1333.102 \| 0.978 \| 0.455 \| \| (16, 16, 512, 128) \| causal_mask \| torch.bfloat16 \| 134.192 \| 141.523 \| 606.269 \| 1424.318 \| 0.948 \| 0.426 \| \| (16, 16, 512, 128) \| relative_bias \| torch.bfloat16 \| 135.711 \| 138.639 \| 606.283 \| 1327.974 \| 0.979 \| 0.457 \| \| (16, 16, 512, 128) \| head_bias \| torch.bfloat16 \| 135.552 \| 140.555 \| 607.107 \| 1347.370 \| 0.964 \| 0.451 \| \| (16, 16, 512, 256) \| noop \| torch.bfloat16 \| 275.113 \| 315.144 \| 1301.583 \| 5268.153 \| 0.873 \| 0.247 \| \| (16, 16, 512, 256) \| causal_mask \| torch.bfloat16 \| 274.867 \| 328.106 \| 1302.513 \| 5770.594 \| 0.838 \| 0.226 \| \| (16, 16, 512, 256) \| relative_bias \| torch.bfloat16 \| 276.052 \| 321.770 \| 1302.904 \| 5241.920 \| 0.858 \| 0.249 \| \| (16, 16, 512, 256) \| head_bias \| torch.bfloat16 \| 271.409 \| 328.839 \| 1302.142 \| 5266.037 \| 0.825 \| 0.247 \| \| (16, 16, 1024, 64) \| noop \| torch.bfloat16 \| 260.489 \| 237.463 \| 955.884 \| 1817.558 \| 1.097 \| 0.526 \| \| (16, 16, 1024, 64) \| causal_mask \| torch.bfloat16 \| 262.378 \| 254.350 \| 955.280 \| 1843.807 \| 1.032 \| 0.518 \| \| (16, 16, 1024, 64) \| relative_bias \| torch.bfloat16 \| 261.338 \| 268.253 \| 956.038 \| 1820.036 \| 0.974 \| 0.525 \| \| (16, 16, 1024, 64) \| head_bias \| torch.bfloat16 \| 262.153 \| 264.156 \| 956.023 \| 1810.076 \| 0.992 \| 0.528 \| \| (16, 16, 1024, 128) \| noop \| torch.bfloat16 \| 476.475 \| 461.413 \| 1760.578 \| 4306.521 \| 1.033 \| 0.409 \| \| (16, 16, 1024, 128) \| causal_mask \| torch.bfloat16 \| 473.794 \| 479.178 \| 1761.277 \| 4619.439 \| 0.989 \| 0.381 \| \| (16, 16, 1024, 128) \| relative_bias \| torch.bfloat16 \| 473.839 \| 463.282 \| 1758.692 \| 4290.562 \| 1.023 \| 0.410 \| \| (16, 16, 1024, 128) \| head_bias \| torch.bfloat16 \| 472.979 \| 472.896 \| 1763.086 \| 4367.931 \| 1.000 \| 0.404 \| \| (16, 16, 1024, 256) \| noop \| torch.bfloat16 \| 1014.184 \| 1026.764 \| 3922.997 \| 19104.147 \| 0.988 \| 0.205 \| \| (16, 16, 1024, 256) \| causal_mask \| torch.bfloat16 \| 1013.217 \| 1039.046 \| 3928.382 \| 21086.281 \| 0.975 \| 0.186 \| \| (16, 16, 1024, 256) \| relative_bias \| torch.bfloat16 \| 1008.519 \| 1015.278 \| 3922.133 \| 18980.652 \| 0.993 \| 0.207 \| \| (16, 16, 1024, 256) \| head_bias \| torch.bfloat16 \| 1011.360 \| 1047.542 \| 3931.245 \| 19069.172 \| 0.965 \| 0.206 \| \| (16, 16, 4096, 64) \| noop \| torch.bfloat16 \| 3929.850 \| 3325.667 \| 11411.704 \| 23344.280 \| 1.182 \| 0.489 \| \| (16, 16, 4096, 64) \| causal_mask \| torch.bfloat16 \| 3885.262 \| 3581.544 \| 11390.515 \| 23725.639 \| 1.085 \| 0.480 \| \| (16, 16, 4096, 64) \| relative_bias \| torch.bfloat16 \| 3865.737 \| 3537.308 \| 11489.901 \| 23406.330 \| 1.093 \| 0.491 \| \| (16, 16, 4096, 64) \| head_bias \| torch.bfloat16 \| 3880.530 \| 3665.249 \| 11484.411 \| 23299.496 \| 1.059 \| 0.493 \| \| (16, 16, 4096, 128) \| noop \| torch.bfloat16 \| 7030.306 \| 6745.715 \| 20621.264 \| 57464.096 \| 1.042 \| 0.359 \| \| (16, 16, 4096, 128) \| causal_mask \| torch.bfloat16 \| 7095.414 \| 7034.385 \| 20410.656 \| 61660.511 \| 1.009 \| 0.331 \| \| (16, 16, 4096, 128) \| relative_bias \| torch.bfloat16 \| 7084.779 \| 6686.497 \| 20315.161 \| 57243.969 \| 1.060 \| 0.355 \| \| (16, 16, 4096, 128) \| head_bias \| torch.bfloat16 \| 7075.367 \| 6863.305 \| 20494.385 \| 58481.953 \| 1.031 \| 0.350 \| \| (16, 16, 4096, 256) \| noop \| torch.bfloat16 \| 15612.741 \| 14297.482 \| 55306.847 \| 281161.865 \| 1.092 \| 0.197 \| \| (16, 16, 4096, 256) \| causal_mask \| torch.bfloat16 \| 15326.592 \| 14263.878 \| 55227.806 \| 313063.232 \| 1.075 \| 0.176 \| \| (16, 16, 4096, 256) \| relative_bias \| torch.bfloat16 \| 15297.963 \| 14007.379 \| 54558.029 \| 279529.175 \| 1.092 \| 0.195 \| \| (16, 16, 4096, 256) \| head_bias \| torch.bfloat16 \| 15216.160 \| 14276.027 \| 55081.581 \| 280996.826 \| 1.066 \| 0.196 \| </details> Pull Request resolved: https://github.com/pytorch/pytorch/pull/125515 Approved by: https://github.com/Chillee	2024-05-17 00:41:55 +00:00
PyTorch MergeBot	337830f657	Revert "[inductor][cpp] GEMM template (infra and fp32) (#124021 )" This reverts commit f060b0c6e608436997a1dc229c82ce26c1e6676f. Reverted https://github.com/pytorch/pytorch/pull/124021 on behalf of https://github.com/huydhn due to Unfortunately, the new tests are still failing internally ([comment](https://github.com/pytorch/pytorch/pull/124021#issuecomment-2116415398))	2024-05-17 00:22:40 +00:00
PyTorch MergeBot	4a5ef0b793	Revert "[inductor][cpp] epilogue support for gemm template (#126019 )" This reverts commit 7844c202b2076ec3efa23264226f3eaef11a6fcb. Reverted https://github.com/pytorch/pytorch/pull/126019 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but the dependency PR https://github.com/pytorch/pytorch/pull/124021 is going to be revert ([comment](https://github.com/pytorch/pytorch/pull/126019#issuecomment-2116408137))	2024-05-17 00:15:00 +00:00
PyTorch MergeBot	59ca0d8c14	Revert "[inductor][cpp] bf16/fp16 gemm template computed with fp32 w/o epilogue fusion (#126068 )" This reverts commit 927e631dc2356c0cb600dbdf9e8f84ce792a8ba1. Reverted https://github.com/pytorch/pytorch/pull/126068 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but the dependency PR https://github.com/pytorch/pytorch/pull/124021 is going to be revert ([comment](https://github.com/pytorch/pytorch/pull/126019#issuecomment-2116408137))	2024-05-17 00:15:00 +00:00
David Berard	cb3b8cd0d3	Use object identity for deepcopy memo (#126126 ) Copy of #126089, with some additional fixes & tests Partial fix for #125635: previously, the deepcopy implementation would group together any tensors with any aliasing relationship and assign them to the same tensor. This was sort of good if you have two tensors `b = a.detach()`, because then if you deepcopy `list = [a, b]` to `list2 = list.deepcopy()`, then writes to `list2[0]` will also modify `list2[1]`. But for the most part, it's bad; (1) if you have `b = a.as_strided((4, 4), (16, 1), 16)`, then it'll make `b == a` in the deepcopied implementation, which is completely wrong; and (2) even if you have `b = a.detach()`, these are still initially two different tensors which become the same tensor after the old deepcopy implementation. The new implementation only groups together tensors that have the same identity. This is a partial fix, but it's more reasonable. What changes: * (becomes more correct): different views of the same base tensor will no longer all become equal after deepcopying * (still kind of wrong): views won't actually alias each other after deepcopying. * (arguably a minor regression): equivalent views of the same tensor will no longer be copied to the same tensor - so they won't alias. BC breaking: C++ deepcopy interface changes from accepting `IValue::HashAliasedIValueMap memo` to accepting `IValue::HashIdentityIValueMap memo`. If there are objections, we can keep the old API. However, it seems likely that users generally won't try to deepcopy from C++. Differential Revision: [D57406306](https://our.internmc.facebook.com/intern/diff/D57406306) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126126 Approved by: https://github.com/ezyang	2024-05-17 00:06:26 +00:00
Shuqiang Zhang	55628624b8	[c10d] add pg_name and pg_desc to logger (#126409 ) Summary: This should further improve our debuggability Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/126409 Approved by: https://github.com/XilunWu	2024-05-16 23:56:19 +00:00
Matthias Braun	796dff7147	Import MKL via //third-party/mkl targets (#126371 ) Summary: This is a step towards upgrading the MKL library and using a buckified targets rather than importing from TP2. - Add new `//third-party/mkl:mkl_xxx` targets that are currently aliases to `third-party//IntelComposerXE:mkl_xxx`. - Switch usage of `external_deps = [("IntelComposerXE", None, "mkl_xxx")]` to `deps = ["fbsource//third-party/mkl:mkl_xxx"]` Note that this only changes references to `mkl_xxx` references in `IntelComposerXE` but not references to "svml" or "ipp*". Test Plan: sandcastle Differential Revision: D57360438 Pull Request resolved: https://github.com/pytorch/pytorch/pull/126371 Approved by: https://github.com/bertmaher	2024-05-16 22:51:26 +00:00
Hongyang Zhao	62403b57b9	Add prefix option to CapabilityBasedPartitioner (#126382 ) Summary: Add prefix arg so that users can provide the submodule name to partitioner. Test Plan: https://fburl.com/anp/2kue4qp9 Differential Revision: D57416926 Pull Request resolved: https://github.com/pytorch/pytorch/pull/126382 Approved by: https://github.com/SherlockNoMad	2024-05-16 22:38:07 +00:00
Richard Barnes	c226839f5c	Eliminate some C++11 checks (#126308 ) Test Plan: Sandcastle Reviewed By: palmje Differential Revision: D57246912 Pull Request resolved: https://github.com/pytorch/pytorch/pull/126308 Approved by: https://github.com/Skylion007	2024-05-16 22:37:45 +00:00
William Wen	f17572fcf6	add 3.12 inductor CI tests (#126218 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126218 Approved by: https://github.com/huydhn, https://github.com/desertfire	2024-05-16 22:29:24 +00:00
Simon Fan	93524cf5ff	[compiled autograd] clear compiled_autograd_verbose once test is done (#126148 ) verbose flag leaks into tests ran after Pull Request resolved: https://github.com/pytorch/pytorch/pull/126148 Approved by: https://github.com/jansel ghstack dependencies: #126144, #126146	2024-05-16 22:23:02 +00:00
Simon Fan	cef7756c9c	[inductor] Clear cache on ctx manager exit (#126146 ) FIXES https://github.com/pytorch/pytorch/issues/126128. Right now, we only clear the cache on ctx manager enter. So state is bad unless we call fresh_inductor_cache again, usually fine in tests. Cue compiled autograd tests when going from TestCompiledAutograd -> TestAutogradWithCompiledAutograd. TestCompiledAutograd uses the ctx manager, but TestAutogradWithCompiledAutograd don't Pull Request resolved: https://github.com/pytorch/pytorch/pull/126146 Approved by: https://github.com/jgong5, https://github.com/oulgen ghstack dependencies: #126144	2024-05-16 22:23:02 +00:00
Simon Fan	4cd4463c1c	[compiled autograd] Fix LoggingTensor flaky test (#126144 ) LoggingTensor fails consistently when root logger level is INFO or lower By default, root logger should be WARNING But, triton driver initialization will overwrite root logger to INFO, which causes flakiness: https://github.com/pytorch/pytorch/issues/126143 Pull Request resolved: https://github.com/pytorch/pytorch/pull/126144 Approved by: https://github.com/jansel	2024-05-16 22:23:02 +00:00
Tarun Karuturi	4b7eee3450	Print export warning only once in capture_pre_autograd (#126403 ) Summary: Missed this in D57163341 Test Plan: CI Differential Revision: D57442088 Pull Request resolved: https://github.com/pytorch/pytorch/pull/126403 Approved by: https://github.com/zhxchen17	2024-05-16 21:55:11 +00:00
Yuanhao Ji	e9719aec30	Fix strict default value in StateDictOptions (#125998 ) Fixes #125992 The default value of the parameter `strict` should be `True`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125998 Approved by: https://github.com/fegin	2024-05-16 21:42:53 +00:00
Will Feng	f5abf28e41	[Traceable FSDP2] Use DTensor.from_local() in _from_local_no_grad when compile (#126346 ) As discussed before, for now Dynamo is not able to support DTensor constructor, and instead we have to use `DTensor.from_local()`. This won't affect eager and it's a compile-only change. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126346 Approved by: https://github.com/awgu	2024-05-16 21:37:00 +00:00
Tobias Ringwald	4f1a56cd42	Switched from parameter in can_cast to from_. (#126030 ) Fixes #126012. `from` is a reserved keyword in Python, thus we can't make the C++ impl available with `from` as function parameter. This PR changes the name to `from_` and also adjusts the docs. If we want to preserve backwards compatibility, we can leave the C++ name as-is and only fix the docs. However, `torch.can_cast(from_=torch.int, to=torch.int)` won't work then. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126030 Approved by: https://github.com/albanD	2024-05-16 20:58:24 +00:00
Edward Z. Yang	82c66bc41a	Make 'pytest test/inductor/test_memory_planning.py' work (#126397 ) There's still another naughty direct test_* import, I'm out of patience right now though. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/126397 Approved by: https://github.com/peterbell10, https://github.com/int3	2024-05-16 20:28:20 +00:00
Edward Z. Yang	866ca4630c	Don't install inplace_methods on MockHandler, not needed (#126398 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/126398 Approved by: https://github.com/jansel, https://github.com/peterbell10	2024-05-16 20:28:05 +00:00
Dmitry Rogozhkin	8f0c207e18	xpu: implement xpu serialization (#125530 ) Fixes: #125529 BC-breaking note: The deprecated "async" argument to the Storage.cuda and Storage.hpu has been removed. Use non_blocking instead. CC: @jbschlosser, @frank-wei @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @albanD Pull Request resolved: https://github.com/pytorch/pytorch/pull/125530 Approved by: https://github.com/guangyey, https://github.com/albanD	2024-05-16 20:22:17 +00:00
Yanbo Liang	da9bf77f0a	[Dynamo] Support SET_UPDATE (#126243 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/126243 Approved by: https://github.com/anijain2305, https://github.com/Skylion007, https://github.com/jansel	2024-05-16 20:05:34 +00:00
Jiashen Cao	aab448e381	Remove redundant serialization code (#126249 ) After https://github.com/pytorch/pytorch/pull/123308, we no longer need separate serialization path to handle different types that exist in the `nn_module` metadata. This PR cleans up the redundant code. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126249 Approved by: https://github.com/angelayi	2024-05-16 19:22:20 +00:00
Gustav Larsson	5862521ad1	[onnx.export] Cache SetGraphInputTypeReliable (#124912 ) This PR is part of an effort to speed up torch.onnx.export (https://github.com/pytorch/pytorch/issues/121422). - For each node that is processed in onnx.export, a check is run to see if all inputs are "reliable" (static shape, etc.). This value does not change, so it is much faster to cache it on the first computation. The caching is added to the ConstantMap state. - Resolves (6) in #121422. - Also see #123028 with a similar addition of a cache state. (partial fix of #121545) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124912 Approved by: https://github.com/justinchuby	2024-05-16 18:48:56 +00:00
Chien-Chin Huang	a0429c01ad	[BE][FSDP] Remove unnecessary warnings (#126365 ) As title Differential Revision: [D57419704](https://our.internmc.facebook.com/intern/diff/D57419704/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126365 Approved by: https://github.com/awgu, https://github.com/Skylion007 ghstack dependencies: #126362	2024-05-16 17:34:01 +00:00
Chien-Chin Huang	0dd53650dd	[BE][FSDP] Change the logging level to info (#126362 ) As title Differential Revision: [D57419445](https://our.internmc.facebook.com/intern/diff/D57419445/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126362 Approved by: https://github.com/awgu, https://github.com/Skylion007	2024-05-16 17:31:06 +00:00
Bin Bao	9fbf2696d7	[AOTI][refactor] Add aoti_torch_item as a util function (#126352 ) Summary: The logic has been repeated several times in the code, so it's worth to write a common util function. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126352 Approved by: https://github.com/chenyang78 ghstack dependencies: #126181, #126182, #126183	2024-05-16 17:07:06 +00:00
Bin Bao	0332b5812e	[AOTI] Support InplaceBernoulliFallback in the ABI-compatible codegen (#126183 ) Summary: Update the torchgen rule for inplace ops like bernoulli_, and update InplaceBernoulliFallback to codegen in the ABI-compatible mode. Fixes https://github.com/pytorch/pytorch/issues/121809 Pull Request resolved: https://github.com/pytorch/pytorch/pull/126183 Approved by: https://github.com/angelayi ghstack dependencies: #126181, #126182	2024-05-16 17:07:06 +00:00
Bin Bao	5792bc3c3e	[AOTI] Refactor some fallback op util functions (#126182 ) Summary: Move some util functions for cpp kernel naming and missing arg filling from FallbackKernel to ExternKernel, since they are useful for ExternKernel in general. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126182 Approved by: https://github.com/chenyang78 ghstack dependencies: #126181	2024-05-16 17:07:00 +00:00
Bin Bao	c5f926ab87	[AOTI][torchgen] Support at::Generator via C shim (#126181 ) Summary: Support at::Generator which is used by many random number generator ops Pull Request resolved: https://github.com/pytorch/pytorch/pull/126181 Approved by: https://github.com/chenyang78	2024-05-16 17:06:53 +00:00
Jithun Nair	a55d63659a	Add 2nd shard to ROCm trunk workflow for core distributed UTs (#121716 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121716 Approved by: https://github.com/ezyang, https://github.com/huydhn	2024-05-16 16:50:02 +00:00
Andres Lugo-Reyes	f155ed6bf2	[ROCm] amax hipblaslt integration (#125921 ) AMAX is coming as part of rocm6.2. This code adds that functionality Pull Request resolved: https://github.com/pytorch/pytorch/pull/125921 Approved by: https://github.com/eqy, https://github.com/lezcano	2024-05-16 16:40:31 +00:00
Jithun Nair	14d8e3aec0	Add distributed/_tensor/test_attention to ROCM_BLOCKLIST (#126336 ) Fixes #125504 Fixes #126252 Fixes #126296 Fixes #126330 This PR doesn't really fix the RingAttentionTest tests for ROCm, but explicitly adds the whole test file to ROCM_BLOCKLIST to get a clean signal on ROCm distributed CI. We will enable these tests in a follow-up PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126336 Approved by: https://github.com/huydhn, https://github.com/pruthvistony	2024-05-16 16:38:09 +00:00
Nikita Shulga	91bf952d10	Fix aarch64 debug build with GCC (#126290 ) By working around GCCs quirks in instantiating templates that require immediate values. Provide alternative implementation for scaling the output if compiled without any optimizations (both GCC and clang define `__OPTIMIZE__` if invoked with anything but `-O0`) Fixes https://github.com/pytorch/pytorch/issues/126283 Pull Request resolved: https://github.com/pytorch/pytorch/pull/126290 Approved by: https://github.com/atalman, https://github.com/seemethere	2024-05-16 13:41:45 +00:00
Andrew Gu	ab07867084	[FSDP2] Supported `set_all_reduce_gradients=False` for HSDP (#126166 ) Context For FSDP, gradient accumulation across microbatches has two flavors: (1) reduce-scatter or (2) no reduce-scatter. (1) incurs the collective per microbatch backward but saves gradient memory (storing the sharded gradients), while (2) avoids the communication but uses more gradient memory (storing the unsharded gradients). - FSDP2 offers (1) without any intervention. The user should simply make sure to run the optimizer step after `K` microbatches for `K > 1`. - FSDP2 offers (2) via `module.set_requires_gradient_sync()` (e.g. `module.set_requires_gradient_sync(is_last_microbatch)`. For HSDP, since we reduce-scatter and then all-reduce, we have additional flexibility and get three flavors: (1) reduce-scatter and all-reduce, (2) reduce-scatter but no all-reduce, and (3) no reduce-scatter and no all-reduce. This PR adds support for (2). - FSDP2 offers (1) without any intervention like mentioned above. - FSDP2 offers (3) via `module.set_requires_gradient_sync()` like mentioned above. - FSDP2 offers (2) via `module.set_requires_all_reduce()` similar to `set_requires_gradient_sync()`. Overview For HSDP, to reduce-scatter but not all-reduce during gradient accumulation, the user can do something like: ``` for microbatch_idx, microbatch in enumerate(microbatches): is_last_microbatch = microbatch_idx == len(microbatches) - 1 model.set_requires_all_reduce(is_last_microbatch) # Run forward/backward ``` This PR also makes the minor change of making the `recurse: bool` argument in these setter methods to be kwarg only. Developer Notes We choose to implement this by saving the partial reduce output to the `FSDPParamGroup` for simplicity, where we assume that the set of parameters that receive gradients does not change across microbatches. An alternative would be to view into the partial reduce output per parameter and save the view to each parameter. We prefer to avoid this alternative for now because it introduces more complexity to do extra viewing when saving the partial reduce output to each parameter, accumulating into them, and accumulating back to the last microbatch's reduce output. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126166 Approved by: https://github.com/weifengpy, https://github.com/wanchaol ghstack dependencies: #126067, #126070, #126161	2024-05-16 12:29:22 +00:00
Xia, Weiwen	c2f8c75129	[Reopen] Upgrade submodule oneDNN to v3.4.2 (#126137 ) Reopen of https://github.com/pytorch/pytorch/pull/122472 ## Improvements This upgrade fixes the following issues: - https://github.com/pytorch/pytorch/issues/120982 This upgrade brings the following new features: - Introduced memory descriptor serialization API. This API is needed to support freezing on CPU in AOTInductor (https://github.com/pytorch/pytorch/issues/114450) ## Validation results on CPU Original results with oneDNN v3.4.1 are here: https://github.com/pytorch/pytorch/pull/122472#issue-2201602846 Need to rerun validation and update results. Co-authored-by: Sunita Nadampalli <nadampal@amazon.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/126137 Approved by: https://github.com/jgong5, https://github.com/snadampal, https://github.com/atalman	2024-05-16 12:00:16 +00:00
yuanx749	691af57fbc	Fix broken link of scikit-learn (#120972 ) The link is broken in https://pytorch.org/docs/main/community/design.html Pull Request resolved: https://github.com/pytorch/pytorch/pull/120972 Approved by: https://github.com/Skylion007	2024-05-16 11:46:34 +00:00
Will Feng	4333e122d4	[Traceable FSDP2] Add all_gather_into_tensor out variant (#126334 ) This PR adds `torch.ops._c10d_functional.all_gather_into_tensor_out`. It's important for tracing FSDP2, because FSDP2 pre-allocates the output buffer of AllGather, and makes input buffer an alias of the output buffer, and expects both of them to be used to achieve lower memory usage. If we don't preserve this behavior and instead functionalize the AllGather op, AllGather op will then create a brand-new output buffer (instead of reusing), thus significantly increasing the memory usage. The expectation is that we will "re-inplace" the AllGather op by switching to the out variant in Inductor post-grad stage via an FX pass, so this API is not expected to be directly used by users. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126334 Approved by: https://github.com/yifuwang, https://github.com/wanchaol	2024-05-16 10:27:06 +00:00
Huy Do	d61a81a9e7	Fix lint failures coming from #126035 (#126378 ) MYPY somehow shows lots of local failures for me. The issue is tracked in https://github.com/pytorch/pytorch/issues/126361. This is only to keep trunk sane. These two line were added by #126035 as an attempt to fix lint there, but didn't seem to help. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126378 Approved by: https://github.com/kit1980	2024-05-16 06:05:47 +00:00
PyTorch MergeBot	0716f75cfb	Revert "Add Lowering for FlexAttention Backwards (#125515 )" This reverts commit 95b9e981c3ab68fc17f78b8a6bbfd9569745ae4c. Reverted https://github.com/pytorch/pytorch/pull/125515 on behalf of https://github.com/huydhn due to Sorry for reverting your change but the newly added test runs out of memory `95b9e981c3` ([comment](https://github.com/pytorch/pytorch/pull/125515#issuecomment-2114084869))	2024-05-16 05:52:13 +00:00
PyTorch MergeBot	cdcba4dee5	Revert "Fix lint failures coming from #126035 (#126378 )" This reverts commit 5fa1f4c6e46d92482d99614c06b6e288cc8d6c8d. Reverted https://github.com/pytorch/pytorch/pull/126378 on behalf of https://github.com/huydhn due to Trying to add yet another lint fix from https://hud.pytorch.org/pr/pytorch/pytorch/126357 and will reland this ([comment](https://github.com/pytorch/pytorch/pull/126378#issuecomment-2114060547))	2024-05-16 05:32:19 +00:00
Yu, Guangye	58378f1224	[Doc] Add deprecated autocast comments for doc (#126062 ) # Motivation We generalize a device-agnostic API `torch.amp.autocast` in [#125103](https://github.com/pytorch/pytorch/pull/125103). After that, - `torch.cpu.amp.autocast(args...)` is completely equivalent to `torch.amp.autocast('cpu', args...)`, and - `torch.cuda.amp.autocast(args...)` is completely equivalent to `torch.amp.autocast('cuda', args...)` no matter in eager mode or JIT mode. Base on this point, we would like to deprecate `torch.cpu.amp.autocast` and `torch.cuda.amp.autocast` to strongly recommend developer to use `torch.amp.autocast` that is a device-agnostic API. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126062 Approved by: https://github.com/eqy, https://github.com/albanD	2024-05-16 05:26:43 +00:00
Wang, Eikan	08aa704d0c	[1/N] Non-Tensor: Scalar Support: Enable aot compile to support aten operations with scalar input like alpha (#124177 ) Some operations have a scalar input parameter, like `torch.add(a, b, alpha=2.0)`. Currently, the aot compile does not support such a case because it requires the signature of the captured graph to align with the operation's signature. This means that some inputs in the captured graph may be scalar(float, int, bool, etc.). It breaks the assumption of `compile_fx_aot` as it assumes all the example inputs are tensor - `0f6ce45bcb/torch/_inductor/compile_fx.py (L1048)` This PR intends to support such cases by allowing not-aligned signature and filtering out the non-Tensor parameters. Captured graph for `torch.add(a, b, alpha=2.0)` ``` opcode name target args kwargs ------------- -------- --------------- ---------------- -------------- placeholder arg0_1 arg0_1 () {} placeholder arg1_1 arg1_1 () {} call_function add aten.add.Tensor (arg0_1, arg1_1) {'alpha': 2.0} output output_1 output ((add,),) {} ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/124177 Approved by: https://github.com/jansel, https://github.com/desertfire, https://github.com/jgong5	2024-05-16 05:15:55 +00:00
Huy Do	5fa1f4c6e4	Fix lint failures coming from #126035 (#126378 ) MYPY somehow shows lots of local failures for me. The issue is tracked in https://github.com/pytorch/pytorch/issues/126361. This is only to keep trunk sane. These two line were added by #126035 as an attempt to fix lint there, but didn't seem to help. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126378 Approved by: https://github.com/kit1980	2024-05-16 05:12:27 +00:00
Valeriu	e661a42428	[Add sliding window attention bias] (#126061 ) Summary: This PR implements sliding window and updates "aten._flash_attention_forward/_flash_attention_backward" to expose the window_size_left and window_size_right arguments. With this kwarg added we can dispatch to the FAv2 impl if the necessary constraints are met. These arguments will eventually be provided to "aten.sdpa_flash" but for now they are needed when called by xformers into their effort to directly use the Pytorch FAv2 impl instead of building their own. Test Plan: Use the default aten.sdpa_flash tests since we've added optional arguments set to the previous default value: -1, /window_size_left/ Using buck2 build --flagfile fbcode//mode/dev-nosan fbcode//caffe2/caffe2/fb/predictor/tests:inference_context_test Differential Revision: D56938087 Pull Request resolved: https://github.com/pytorch/pytorch/pull/126061 Approved by: https://github.com/drisspg, https://github.com/desertfire	2024-05-16 04:50:47 +00:00
Ivan Zaitsev	8dc6f455bd	[ez] fix exported diff mismatch (#126357 ) Fixes the following issue: D55803461 differs from the exported PR: #123658 ⚠️ this PR needs to be skipped on diff train! Pull Request resolved: https://github.com/pytorch/pytorch/pull/126357 Approved by: https://github.com/huydhn, https://github.com/fegin	2024-05-16 04:49:48 +00:00
Edward Z. Yang	6e6e44bdcc	Generate runtime asserts when propagate real tensor is used (#126287 ) This means that propagate real tensor is no longer unsound: if the route we took at compile time diverges with runtime, you will get a runtime assert. Also add structured trace logs for these. Also fix bug where xreplace with int range is not guaranteed to return a sympy expression. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/126287 Approved by: https://github.com/Skylion007	2024-05-16 04:45:57 +00:00
Shuqiang Zhang	c860df5a9d	[c10d] Add an option for NAN check on every collective (#125726 ) Summary: The NAN CHECK is done through device side assert without copying needed from GPU to CPU Test Plan: Unit test for collectives that should experience run time error (sqzhang_1) [sqzhang@devgpu009.cln1 ~/pytorch (38f5143e)]$ python test/distributed/test_c10d_nccl.py ProcessGroupNCCLTest.test_nan_assert /home/sqzhang/pytorch/torch/csrc/distributed/c10d/Utils.cu:15: checkForNaN: block: [0,0,0], thread: [0,0,0] Assertion `!isnan(val)` failed. /home/sqzhang/pytorch/torch/csrc/distributed/c10d/Utils.cu:15: checkForNaN: block: [0,0,0], thread: [1,0,0] Assertion `!isnan(val)` failed. /home/sqzhang/pytorch/torch/csrc/distributed/c10d/Utils.cu:15: checkForNaN: block: [0,0,0], thread: [2,0,0] Assertion `!isnan(val)` failed. /home/sqzhang/pytorch/torch/csrc/distributed/c10d/Utils.cu:15: checkForNaN: block: [0,0,0], thread: [3,0,0] Assertion `!isnan(val)` failed. /home/sqzhang/pytorch/torch/csrc/distributed/c10d/Utils.cu:15: checkForNaN: block: [0,0,0], thread: [4,0,0] Assertion `!isnan(val)` failed. /home/sqzhang/pytorch/torch/csrc/distributed/c10d/Utils.cu:15: checkForNaN: block: [0,0,0], thread: [5,0,0] Assertion `!isnan(val)` failed. [rank0]:[E507 17:31:56.885473996 Utils.cu:30] CUDA error during checkForNan: device-side assert triggered /home/sqzhang/pytorch/torch/csrc/distributed/c10d/Utils.cu:15: checkForNaN: block: [0,0,0], thread: [0,0,0] Assertion `!isnan(val)` failed. /home/sqzhang/pytorch/torch/csrc/distributed/c10d/Utils.cu:15: checkForNaN: block: [0,0,0], thread: [1,0,0] Assertion `!isnan(val)` failed. /home/sqzhang/pytorch/torch/csrc/distributed/c10d/Utils.cu:15: checkForNaN: block: [0,0,0], thread: [2,0,0] Assertion `!isnan(val)` failed. /home/sqzhang/pytorch/torch/csrc/distributed/c10d/Utils.cu:15: checkForNaN: block: [0,0,0], thread: [3,0,0] Assertion `!isnan(val)` failed. /home/sqzhang/pytorch/torch/csrc/distributed/c10d/Utils.cu:15: checkForNaN: block: [0,0,0], thread: [4,0,0] Assertion `!isnan(val)` failed. /home/sqzhang/pytorch/torch/csrc/distributed/c10d/Utils.cu:15: checkForNaN: block: [0,0,0], thread: [5,0,0] Assertion `!isnan(val)` failed. [rank1]:[E507 17:31:56.128961534 Utils.cu:30] CUDA error during checkForNan: device-side assert triggered . ---------------------------------------------------------------------- Ran 1 test in 7.723s OK Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/125726 Approved by: https://github.com/kwen2501	2024-05-16 04:35:15 +00:00
Isuru Fernando	0214711f05	Add mode to MemoryDep to track atomic accumulates (#123223 ) And allow fusion of buffers where writes are only atomic accumulates. This allows fusing of ops like _unsafe_index_put(_unsafe_index_put(a, ...), ...) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123223 Approved by: https://github.com/peterbell10	2024-05-16 04:34:09 +00:00
Wanchao Liang	d0dfcd2c34	fix the device type for with_comms decorator (#125798 ) found by @yifuwang, it looks like we are wrongly using self.device_type="cuda" for gloo backend, which are triggering some flakiness. i.e. https://github.com/pytorch/pytorch/issues/125366 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125798 Approved by: https://github.com/yifuwang	2024-05-16 03:40:19 +00:00
Animesh Jain	bcc8d25e47	[dynamo] Delete extra testing of cpp guard manager (#126343 ) CPP guard manager has been on for a few weeks now. This separate testing was part of phasing when the cpp guard manager was not enabled. Now this is not needed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126343 Approved by: https://github.com/williamwen42 ghstack dependencies: #126303, #126316, #126314, #126327	2024-05-16 03:30:38 +00:00
drisspg	95b9e981c3	Add Lowering for FlexAttention Backwards (#125515 ) # Summary #### What does this PR do? It enables Inductor to actually generate the fused flex attention kernel for the backwards I did some other things along the way: - Abstract out the 'build_subgraph_buffer' subroutine and make it reusable between flex attention and flex_attention backwards. In total we need too build 3 subgraphs for fwd + bwd. 1 for the fwd graph and then 2 in the bwd. The FAv2 algorithm recomputes the parts of the forward (more efficiently since we already have the row_max via logsumexp), therefore we need to inline both the fwd graph and the joint graph in the bwds kernel. - The version of the backwards kernel is from a somewhat older version of the triton tutorial implementation. I think that we should update in a follow up to a newer version. Notably the blocks need to be square for this to work as currently implemented. I am sure there are many opportunities for optimization. - I didnt correctly register the decomp table + IndexMode when I landed: https://github.com/pytorch/pytorch/pull/123902, this remedies that. - The rel_bias helper func was reversed in terms of causality. I updated and then add a test specific for "future causal" attention. - This PRs but the main point that I think still needs to be worked out is the store_output call. I have it hacked up to be 'fake' but I dont think we want to land that and likely want to just have a mutated 'dq' and a stored_output 'dk' - I also needed to update the `TritonTemplateKernel` to actually accept multiple subgraphs (modifications) - I updated the benchmark to also profile bwds performance ### Benchmark Numbers: _The current implementation is not parallelizing over ctx length in the bwd_ FWD Speedups \| Type \| Speedup \| shape \| score_mod \| dtype \| \|---------\|-----------\|--------------------\|-------------\|----------------\| \| Average \| 0.991 \| \| \| \| \| Max \| 1.182 \| (16, 16, 4096, 64) \| noop \| torch.bfloat16 \| \| Min \| 0.796 \| (2, 16, 512, 256) \| head_bias \| torch.bfloat16 \| BWD Speedups \| Type \| Speedup \| shape \| score_mod \| dtype \| \|---------\|-----------\|--------------------\|-------------\|----------------\| \| Average \| 0.291 \| \| \| \| \| Max \| 0.652 \| (8, 16, 512, 64) \| head_bias \| torch.bfloat16 \| \| Min \| 0.073 \| (2, 16, 4096, 128) \| head_bias \| torch.bfloat16 \| <details> <summary>Full Data</summary> \| shape \| score_mod \| dtype \| fwd_eager_time \| fwd_compiled_time \| bwd_eager_time \| bwd_compiled_time \| fwd_speedup \| bwd_speedup \| \|---------------------\|---------------\|----------------\|------------------\|---------------------\|------------------\|---------------------\|---------------\|---------------\| \| (2, 16, 512, 64) \| noop \| torch.bfloat16 \| 19.936 \| 19.092 \| 57.851 \| 193.564 \| 1.044 \| 0.299 \| \| (2, 16, 512, 64) \| causal_mask \| torch.bfloat16 \| 19.955 \| 19.497 \| 57.662 \| 206.278 \| 1.024 \| 0.280 \| \| (2, 16, 512, 64) \| relative_bias \| torch.bfloat16 \| 19.455 \| 21.297 \| 57.674 \| 195.219 \| 0.913 \| 0.295 \| \| (2, 16, 512, 64) \| head_bias \| torch.bfloat16 \| 19.958 \| 21.289 \| 57.674 \| 193.859 \| 0.938 \| 0.298 \| \| (2, 16, 512, 128) \| noop \| torch.bfloat16 \| 28.157 \| 28.615 \| 82.831 \| 454.211 \| 0.984 \| 0.182 \| \| (2, 16, 512, 128) \| causal_mask \| torch.bfloat16 \| 28.154 \| 28.444 \| 83.091 \| 432.083 \| 0.990 \| 0.192 \| \| (2, 16, 512, 128) \| relative_bias \| torch.bfloat16 \| 28.722 \| 27.897 \| 83.175 \| 446.789 \| 1.030 \| 0.186 \| \| (2, 16, 512, 128) \| head_bias \| torch.bfloat16 \| 28.299 \| 27.673 \| 83.052 \| 459.179 \| 1.023 \| 0.181 \| \| (2, 16, 512, 256) \| noop \| torch.bfloat16 \| 41.167 \| 50.504 \| 175.019 \| 1083.545 \| 0.815 \| 0.162 \| \| (2, 16, 512, 256) \| causal_mask \| torch.bfloat16 \| 41.656 \| 51.933 \| 175.078 \| 1171.176 \| 0.802 \| 0.149 \| \| (2, 16, 512, 256) \| relative_bias \| torch.bfloat16 \| 41.697 \| 50.722 \| 175.159 \| 1097.312 \| 0.822 \| 0.160 \| \| (2, 16, 512, 256) \| head_bias \| torch.bfloat16 \| 41.690 \| 52.387 \| 175.184 \| 1097.336 \| 0.796 \| 0.160 \| \| (2, 16, 1024, 64) \| noop \| torch.bfloat16 \| 39.232 \| 37.454 \| 127.847 \| 612.430 \| 1.047 \| 0.209 \| \| (2, 16, 1024, 64) \| causal_mask \| torch.bfloat16 \| 39.930 \| 39.599 \| 127.755 \| 665.359 \| 1.008 \| 0.192 \| \| (2, 16, 1024, 64) \| relative_bias \| torch.bfloat16 \| 39.417 \| 41.304 \| 127.902 \| 614.990 \| 0.954 \| 0.208 \| \| (2, 16, 1024, 64) \| head_bias \| torch.bfloat16 \| 39.965 \| 42.034 \| 127.953 \| 613.273 \| 0.951 \| 0.209 \| \| (2, 16, 1024, 128) \| noop \| torch.bfloat16 \| 63.964 \| 71.024 \| 226.510 \| 1637.669 \| 0.901 \| 0.138 \| \| (2, 16, 1024, 128) \| causal_mask \| torch.bfloat16 \| 63.843 \| 72.451 \| 226.750 \| 1558.949 \| 0.881 \| 0.145 \| \| (2, 16, 1024, 128) \| relative_bias \| torch.bfloat16 \| 64.301 \| 70.487 \| 226.651 \| 1610.063 \| 0.912 \| 0.141 \| \| (2, 16, 1024, 128) \| head_bias \| torch.bfloat16 \| 64.033 \| 71.394 \| 226.676 \| 1668.511 \| 0.897 \| 0.136 \| \| (2, 16, 1024, 256) \| noop \| torch.bfloat16 \| 129.348 \| 141.390 \| 507.337 \| 4405.175 \| 0.915 \| 0.115 \| \| (2, 16, 1024, 256) \| causal_mask \| torch.bfloat16 \| 129.538 \| 145.680 \| 507.178 \| 4768.874 \| 0.889 \| 0.106 \| \| (2, 16, 1024, 256) \| relative_bias \| torch.bfloat16 \| 129.438 \| 142.782 \| 507.004 \| 4401.002 \| 0.907 \| 0.115 \| \| (2, 16, 1024, 256) \| head_bias \| torch.bfloat16 \| 129.058 \| 146.242 \| 507.547 \| 4434.251 \| 0.883 \| 0.114 \| \| (2, 16, 4096, 64) \| noop \| torch.bfloat16 \| 481.606 \| 409.120 \| 1440.890 \| 14147.269 \| 1.177 \| 0.102 \| \| (2, 16, 4096, 64) \| causal_mask \| torch.bfloat16 \| 480.227 \| 438.847 \| 1434.419 \| 14973.386 \| 1.094 \| 0.096 \| \| (2, 16, 4096, 64) \| relative_bias \| torch.bfloat16 \| 480.831 \| 458.104 \| 1432.935 \| 14193.253 \| 1.050 \| 0.101 \| \| (2, 16, 4096, 64) \| head_bias \| torch.bfloat16 \| 480.749 \| 452.497 \| 1437.040 \| 14084.869 \| 1.062 \| 0.102 \| \| (2, 16, 4096, 128) \| noop \| torch.bfloat16 \| 872.534 \| 848.275 \| 2600.895 \| 35156.849 \| 1.029 \| 0.074 \| \| (2, 16, 4096, 128) \| causal_mask \| torch.bfloat16 \| 872.647 \| 868.279 \| 2587.581 \| 31919.531 \| 1.005 \| 0.081 \| \| (2, 16, 4096, 128) \| relative_bias \| torch.bfloat16 \| 871.484 \| 827.644 \| 2593.989 \| 34805.634 \| 1.053 \| 0.075 \| \| (2, 16, 4096, 128) \| head_bias \| torch.bfloat16 \| 871.422 \| 856.437 \| 2602.482 \| 35708.591 \| 1.017 \| 0.073 \| \| (2, 16, 4096, 256) \| noop \| torch.bfloat16 \| 1904.497 \| 1758.183 \| 6122.416 \| 66754.593 \| 1.083 \| 0.092 \| \| (2, 16, 4096, 256) \| causal_mask \| torch.bfloat16 \| 1911.174 \| 1762.821 \| 6113.207 \| 72759.392 \| 1.084 \| 0.084 \| \| (2, 16, 4096, 256) \| relative_bias \| torch.bfloat16 \| 1911.254 \| 1727.108 \| 6123.530 \| 66577.988 \| 1.107 \| 0.092 \| \| (2, 16, 4096, 256) \| head_bias \| torch.bfloat16 \| 1916.977 \| 1801.804 \| 6118.158 \| 67359.680 \| 1.064 \| 0.091 \| \| (8, 16, 512, 64) \| noop \| torch.bfloat16 \| 44.984 \| 43.974 \| 170.276 \| 262.259 \| 1.023 \| 0.649 \| \| (8, 16, 512, 64) \| causal_mask \| torch.bfloat16 \| 45.001 \| 46.265 \| 170.509 \| 274.893 \| 0.973 \| 0.620 \| \| (8, 16, 512, 64) \| relative_bias \| torch.bfloat16 \| 45.466 \| 48.211 \| 170.606 \| 262.759 \| 0.943 \| 0.649 \| \| (8, 16, 512, 64) \| head_bias \| torch.bfloat16 \| 45.481 \| 48.435 \| 170.267 \| 261.265 \| 0.939 \| 0.652 \| \| (8, 16, 512, 128) \| noop \| torch.bfloat16 \| 72.565 \| 74.736 \| 313.220 \| 773.126 \| 0.971 \| 0.405 \| \| (8, 16, 512, 128) \| causal_mask \| torch.bfloat16 \| 72.015 \| 75.755 \| 313.311 \| 775.513 \| 0.951 \| 0.404 \| \| (8, 16, 512, 128) \| relative_bias \| torch.bfloat16 \| 72.105 \| 74.189 \| 313.806 \| 769.238 \| 0.972 \| 0.408 \| \| (8, 16, 512, 128) \| head_bias \| torch.bfloat16 \| 72.005 \| 74.364 \| 313.509 \| 775.237 \| 0.968 \| 0.404 \| \| (8, 16, 512, 256) \| noop \| torch.bfloat16 \| 138.656 \| 165.453 \| 663.707 \| 2672.067 \| 0.838 \| 0.248 \| \| (8, 16, 512, 256) \| causal_mask \| torch.bfloat16 \| 139.096 \| 172.613 \| 663.593 \| 2926.538 \| 0.806 \| 0.227 \| \| (8, 16, 512, 256) \| relative_bias \| torch.bfloat16 \| 139.500 \| 168.417 \| 663.938 \| 2658.629 \| 0.828 \| 0.250 \| \| (8, 16, 512, 256) \| head_bias \| torch.bfloat16 \| 139.776 \| 173.549 \| 662.920 \| 2667.266 \| 0.805 \| 0.249 \| \| (8, 16, 1024, 64) \| noop \| torch.bfloat16 \| 134.883 \| 125.004 \| 484.706 \| 1195.254 \| 1.079 \| 0.406 \| \| (8, 16, 1024, 64) \| causal_mask \| torch.bfloat16 \| 134.297 \| 132.875 \| 485.420 \| 1234.953 \| 1.011 \| 0.393 \| \| (8, 16, 1024, 64) \| relative_bias \| torch.bfloat16 \| 134.839 \| 139.231 \| 485.470 \| 1198.556 \| 0.968 \| 0.405 \| \| (8, 16, 1024, 64) \| head_bias \| torch.bfloat16 \| 133.822 \| 136.449 \| 485.608 \| 1189.198 \| 0.981 \| 0.408 \| \| (8, 16, 1024, 128) \| noop \| torch.bfloat16 \| 235.470 \| 234.765 \| 886.094 \| 2662.944 \| 1.003 \| 0.333 \| \| (8, 16, 1024, 128) \| causal_mask \| torch.bfloat16 \| 236.305 \| 241.382 \| 886.293 \| 2646.984 \| 0.979 \| 0.335 \| \| (8, 16, 1024, 128) \| relative_bias \| torch.bfloat16 \| 236.414 \| 233.980 \| 885.250 \| 2642.178 \| 1.010 \| 0.335 \| \| (8, 16, 1024, 128) \| head_bias \| torch.bfloat16 \| 237.176 \| 239.040 \| 885.754 \| 2665.242 \| 0.992 \| 0.332 \| \| (8, 16, 1024, 256) \| noop \| torch.bfloat16 \| 504.445 \| 517.855 \| 1978.956 \| 9592.906 \| 0.974 \| 0.206 \| \| (8, 16, 1024, 256) \| causal_mask \| torch.bfloat16 \| 502.428 \| 536.002 \| 1978.611 \| 10607.342 \| 0.937 \| 0.187 \| \| (8, 16, 1024, 256) \| relative_bias \| torch.bfloat16 \| 503.396 \| 523.960 \| 1977.993 \| 9539.284 \| 0.961 \| 0.207 \| \| (8, 16, 1024, 256) \| head_bias \| torch.bfloat16 \| 503.818 \| 536.014 \| 1980.131 \| 9576.262 \| 0.940 \| 0.207 \| \| (8, 16, 4096, 64) \| noop \| torch.bfloat16 \| 1970.139 \| 1674.930 \| 5750.940 \| 16724.134 \| 1.176 \| 0.344 \| \| (8, 16, 4096, 64) \| causal_mask \| torch.bfloat16 \| 1959.036 \| 1775.056 \| 5780.512 \| 17390.350 \| 1.104 \| 0.332 \| \| (8, 16, 4096, 64) \| relative_bias \| torch.bfloat16 \| 1947.198 \| 1773.869 \| 5780.643 \| 16779.699 \| 1.098 \| 0.345 \| \| (8, 16, 4096, 64) \| head_bias \| torch.bfloat16 \| 1963.935 \| 1829.502 \| 5780.018 \| 16703.259 \| 1.073 \| 0.346 \| \| (8, 16, 4096, 128) \| noop \| torch.bfloat16 \| 3582.711 \| 3362.623 \| 10436.069 \| 36415.565 \| 1.065 \| 0.287 \| \| (8, 16, 4096, 128) \| causal_mask \| torch.bfloat16 \| 3581.504 \| 3499.472 \| 10346.869 \| 36164.959 \| 1.023 \| 0.286 \| \| (8, 16, 4096, 128) \| relative_bias \| torch.bfloat16 \| 3589.779 \| 3337.849 \| 10529.621 \| 36261.696 \| 1.075 \| 0.290 \| \| (8, 16, 4096, 128) \| head_bias \| torch.bfloat16 \| 3602.265 \| 3436.444 \| 10458.660 \| 36507.790 \| 1.048 \| 0.286 \| \| (8, 16, 4096, 256) \| noop \| torch.bfloat16 \| 7695.923 \| 7126.275 \| 24643.009 \| 140949.081 \| 1.080 \| 0.175 \| \| (8, 16, 4096, 256) \| causal_mask \| torch.bfloat16 \| 7679.939 \| 7186.252 \| 24538.105 \| 157156.067 \| 1.069 \| 0.156 \| \| (8, 16, 4096, 256) \| relative_bias \| torch.bfloat16 \| 7681.374 \| 6994.832 \| 24549.713 \| 140077.179 \| 1.098 \| 0.175 \| \| (8, 16, 4096, 256) \| head_bias \| torch.bfloat16 \| 7679.822 \| 7212.278 \| 24627.823 \| 140675.003 \| 1.065 \| 0.175 \| \| (16, 16, 512, 64) \| noop \| torch.bfloat16 \| 80.126 \| 78.291 \| 333.719 \| 541.165 \| 1.023 \| 0.617 \| \| (16, 16, 512, 64) \| causal_mask \| torch.bfloat16 \| 80.065 \| 81.696 \| 333.779 \| 551.113 \| 0.980 \| 0.606 \| \| (16, 16, 512, 64) \| relative_bias \| torch.bfloat16 \| 80.138 \| 86.715 \| 333.364 \| 542.118 \| 0.924 \| 0.615 \| \| (16, 16, 512, 64) \| head_bias \| torch.bfloat16 \| 80.415 \| 85.204 \| 333.294 \| 536.840 \| 0.944 \| 0.621 \| \| (16, 16, 512, 128) \| noop \| torch.bfloat16 \| 134.964 \| 138.025 \| 607.093 \| 1333.102 \| 0.978 \| 0.455 \| \| (16, 16, 512, 128) \| causal_mask \| torch.bfloat16 \| 134.192 \| 141.523 \| 606.269 \| 1424.318 \| 0.948 \| 0.426 \| \| (16, 16, 512, 128) \| relative_bias \| torch.bfloat16 \| 135.711 \| 138.639 \| 606.283 \| 1327.974 \| 0.979 \| 0.457 \| \| (16, 16, 512, 128) \| head_bias \| torch.bfloat16 \| 135.552 \| 140.555 \| 607.107 \| 1347.370 \| 0.964 \| 0.451 \| \| (16, 16, 512, 256) \| noop \| torch.bfloat16 \| 275.113 \| 315.144 \| 1301.583 \| 5268.153 \| 0.873 \| 0.247 \| \| (16, 16, 512, 256) \| causal_mask \| torch.bfloat16 \| 274.867 \| 328.106 \| 1302.513 \| 5770.594 \| 0.838 \| 0.226 \| \| (16, 16, 512, 256) \| relative_bias \| torch.bfloat16 \| 276.052 \| 321.770 \| 1302.904 \| 5241.920 \| 0.858 \| 0.249 \| \| (16, 16, 512, 256) \| head_bias \| torch.bfloat16 \| 271.409 \| 328.839 \| 1302.142 \| 5266.037 \| 0.825 \| 0.247 \| \| (16, 16, 1024, 64) \| noop \| torch.bfloat16 \| 260.489 \| 237.463 \| 955.884 \| 1817.558 \| 1.097 \| 0.526 \| \| (16, 16, 1024, 64) \| causal_mask \| torch.bfloat16 \| 262.378 \| 254.350 \| 955.280 \| 1843.807 \| 1.032 \| 0.518 \| \| (16, 16, 1024, 64) \| relative_bias \| torch.bfloat16 \| 261.338 \| 268.253 \| 956.038 \| 1820.036 \| 0.974 \| 0.525 \| \| (16, 16, 1024, 64) \| head_bias \| torch.bfloat16 \| 262.153 \| 264.156 \| 956.023 \| 1810.076 \| 0.992 \| 0.528 \| \| (16, 16, 1024, 128) \| noop \| torch.bfloat16 \| 476.475 \| 461.413 \| 1760.578 \| 4306.521 \| 1.033 \| 0.409 \| \| (16, 16, 1024, 128) \| causal_mask \| torch.bfloat16 \| 473.794 \| 479.178 \| 1761.277 \| 4619.439 \| 0.989 \| 0.381 \| \| (16, 16, 1024, 128) \| relative_bias \| torch.bfloat16 \| 473.839 \| 463.282 \| 1758.692 \| 4290.562 \| 1.023 \| 0.410 \| \| (16, 16, 1024, 128) \| head_bias \| torch.bfloat16 \| 472.979 \| 472.896 \| 1763.086 \| 4367.931 \| 1.000 \| 0.404 \| \| (16, 16, 1024, 256) \| noop \| torch.bfloat16 \| 1014.184 \| 1026.764 \| 3922.997 \| 19104.147 \| 0.988 \| 0.205 \| \| (16, 16, 1024, 256) \| causal_mask \| torch.bfloat16 \| 1013.217 \| 1039.046 \| 3928.382 \| 21086.281 \| 0.975 \| 0.186 \| \| (16, 16, 1024, 256) \| relative_bias \| torch.bfloat16 \| 1008.519 \| 1015.278 \| 3922.133 \| 18980.652 \| 0.993 \| 0.207 \| \| (16, 16, 1024, 256) \| head_bias \| torch.bfloat16 \| 1011.360 \| 1047.542 \| 3931.245 \| 19069.172 \| 0.965 \| 0.206 \| \| (16, 16, 4096, 64) \| noop \| torch.bfloat16 \| 3929.850 \| 3325.667 \| 11411.704 \| 23344.280 \| 1.182 \| 0.489 \| \| (16, 16, 4096, 64) \| causal_mask \| torch.bfloat16 \| 3885.262 \| 3581.544 \| 11390.515 \| 23725.639 \| 1.085 \| 0.480 \| \| (16, 16, 4096, 64) \| relative_bias \| torch.bfloat16 \| 3865.737 \| 3537.308 \| 11489.901 \| 23406.330 \| 1.093 \| 0.491 \| \| (16, 16, 4096, 64) \| head_bias \| torch.bfloat16 \| 3880.530 \| 3665.249 \| 11484.411 \| 23299.496 \| 1.059 \| 0.493 \| \| (16, 16, 4096, 128) \| noop \| torch.bfloat16 \| 7030.306 \| 6745.715 \| 20621.264 \| 57464.096 \| 1.042 \| 0.359 \| \| (16, 16, 4096, 128) \| causal_mask \| torch.bfloat16 \| 7095.414 \| 7034.385 \| 20410.656 \| 61660.511 \| 1.009 \| 0.331 \| \| (16, 16, 4096, 128) \| relative_bias \| torch.bfloat16 \| 7084.779 \| 6686.497 \| 20315.161 \| 57243.969 \| 1.060 \| 0.355 \| \| (16, 16, 4096, 128) \| head_bias \| torch.bfloat16 \| 7075.367 \| 6863.305 \| 20494.385 \| 58481.953 \| 1.031 \| 0.350 \| \| (16, 16, 4096, 256) \| noop \| torch.bfloat16 \| 15612.741 \| 14297.482 \| 55306.847 \| 281161.865 \| 1.092 \| 0.197 \| \| (16, 16, 4096, 256) \| causal_mask \| torch.bfloat16 \| 15326.592 \| 14263.878 \| 55227.806 \| 313063.232 \| 1.075 \| 0.176 \| \| (16, 16, 4096, 256) \| relative_bias \| torch.bfloat16 \| 15297.963 \| 14007.379 \| 54558.029 \| 279529.175 \| 1.092 \| 0.195 \| \| (16, 16, 4096, 256) \| head_bias \| torch.bfloat16 \| 15216.160 \| 14276.027 \| 55081.581 \| 280996.826 \| 1.066 \| 0.196 \| </details> Pull Request resolved: https://github.com/pytorch/pytorch/pull/125515 Approved by: https://github.com/Chillee	2024-05-16 03:14:27 +00:00
PyTorch MergeBot	ae6fdfa539	Revert "Initial implementation of AdaRound (#126153 )" This reverts commit 175c18af818804ba8ef433c3eb8488d1a3d1dd9d. Reverted https://github.com/pytorch/pytorch/pull/126153 on behalf of https://github.com/huydhn due to Sorry for reverting your change but the lint failure is legit because there are more than one lint issues, torch/optim/asgd.py is just the last one ([comment](https://github.com/pytorch/pytorch/pull/126153#issuecomment-2113902522))	2024-05-16 02:34:49 +00:00
PyTorch MergeBot	e3c5d1b7d7	Revert "[optim] Fix: wrong ASGD implementation (#125440 )" This reverts commit 2c5ad9a3d7ea79ca897aec153a401f4b9175a717. Reverted https://github.com/pytorch/pytorch/pull/125440 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it looks like there is a linter failure coming from this change ([comment](https://github.com/pytorch/pytorch/pull/125440#issuecomment-2113833108))	2024-05-16 02:12:29 +00:00
Kwanghoon An	175c18af81	Initial implementation of AdaRound (#126153 ) Summary: This is an implementation of AdaRound from a paper https://arxiv.org/abs/2004.10568 This algorithm is going to be used by multiple people, hence we need make it official implementation. Differential Revision: D57227565 Pull Request resolved: https://github.com/pytorch/pytorch/pull/126153 Approved by: https://github.com/jerryzh168	2024-05-16 02:09:18 +00:00
Jiong Gong	927e631dc2	[inductor][cpp] bf16/fp16 gemm template computed with fp32 w/o epilogue fusion (#126068 ) As part of #125683, this PR adds the initial bf16/fp16 gemm template support with micro-gemm implemented with fused type casting and fp32 computation. It doesn't provide epilogue fusion support yet which will be added in the next PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126068 Approved by: https://github.com/jansel ghstack dependencies: #126019	2024-05-16 02:05:49 +00:00
wz337	059b68fbdf	[DeviceMesh] Fix hash and eq not match (#123572 ) Fixes #121799 We fix DeviceMesh hash such that two mesh are considered equal if they have the same mesh and same parent_mesh. Examples can be found here: https://github.com/pytorch/pytorch/issues/121799 Also need this to unblock https://github.com/pytorch/pytorch/pull/123394 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123572 Approved by: https://github.com/xunnanxu, https://github.com/wanchaol, https://github.com/yoyoyocmu	2024-05-16 02:00:45 +00:00
Animesh Jain	1876f0fec1	[dynamo][nn module guards] Use TENSOR_MATCH, and not ID_MATCH, for numpy tensors (#126246 ) Fixes speech_transformer regression here - https://hud.pytorch.org/benchmark/torchbench/inductor_no_cudagraphs?startTime=Tue%2C%2007%20May%202024%2019%3A22%3A54%20GMT&stopTime=Tue%2C%2014%20May%202024%2019%3A22%3A54%20GMT&granularity=hour&mode=training&dtype=amp&lBranch=main&lCommit=02093b6c6ae1046368e2500881d0bb5880873386&rBranch=main&rCommit=b24ad7eab55eaf660893dddae949fc714e434338 Thanks to @eellison and @bdhirsh for isolating the regression to nn module guards. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126246 Approved by: https://github.com/jansel ghstack dependencies: #126203	2024-05-16 01:57:59 +00:00
PyTorch MergeBot	315389bfed	Revert "Remove deprecated _aminmax operator (#125995 )" This reverts commit 0116ffae7f94f35a2c712e186a0b371959b68c64. Reverted https://github.com/pytorch/pytorch/pull/125995 on behalf of https://github.com/huydhn due to Sorry for reverting your change but we need to reland this after I get rid of all usage of _aminmax internally in Meta ([comment](https://github.com/pytorch/pytorch/pull/125995#issuecomment-2113769497))	2024-05-16 01:45:37 +00:00
Aidyn-A	6dca1e639b	[TEST][Dynamo] fix test_deviceguard.py (#126240 ) The `test_device_guard.py` was improperly set up, so there were failures on multi-GPU machines. By design the `DeviceGuard` should keep `idx` the same even after it was applied. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126240 Approved by: https://github.com/jansel	2024-05-16 01:44:42 +00:00
Jiong Gong	7844c202b2	[inductor][cpp] epilogue support for gemm template (#126019 ) As part of #125683, this PR adds the epilogue support for c++ gemm template by reusing the c++ vector codegen on sub-slices of tensors. This is implemented by retracing the epilogue IR nodes with new ranges and offsets. The new `codegen_loop_bodies` and `codegen_functions` methods are added to c++ vector codegen for this purpose. This is leveraged by the `store_output` method of the template kernel for epilogue codegen and store to the final result. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126019 Approved by: https://github.com/jansel	2024-05-16 01:42:29 +00:00
PyTorch MergeBot	6065a4d46e	Revert "Switched from parameter in can_cast to from_. (#126030 )" This reverts commit 06d6bb4ebabc64433224970024ada1781508197d. Reverted https://github.com/pytorch/pytorch/pull/126030 on behalf of https://github.com/huydhn due to Sorry for reverting your change but i need to revert it to avoid a diff train conflict with https://github.com/pytorch/pytorch/pull/125995. Please help rebase and I will reland the change ([comment](https://github.com/pytorch/pytorch/pull/126030#issuecomment-2113757469))	2024-05-16 01:42:23 +00:00
Sam Larsen	5efad4ebc1	[inductor] [FX graph cache] Ignore unbacked symints in guards expression (#126251 ) Summary: Found a unit test that was causing an assertion failure during an attempt to use unbacked symints in the guards expression, but it turns out unbacked symints can't affect guards anyway, so we can just filter them out. Also in this diff: test_torchinductor_dynamic_shapes.py was not configured to exercise the codecache because the TestCase setUp method was indavertently skipping the setUp of the immediate parent class. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126251 Approved by: https://github.com/peterbell10	2024-05-16 01:35:41 +00:00
Animesh Jain	bd63300bae	[dynamo][inline-inbuilt-nn-modules] Add and update test_modules.py for nlining work (#126327 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126327 Approved by: https://github.com/williamwen42 ghstack dependencies: #126303, #126316, #126314	2024-05-16 01:35:09 +00:00
Animesh Jain	7aa068f350	[dynamo][inline-inbuilt-nn-modules] Change test to not depend on id of mod instance (#126314 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126314 Approved by: https://github.com/williamwen42 ghstack dependencies: #126303, #126316	2024-05-16 01:35:09 +00:00
Yanbo Liang	0f8380dd65	[Inductor][Flex-attention] Make num_head support dynamic (#126342 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/126342 Approved by: https://github.com/drisspg	2024-05-16 01:33:53 +00:00
haozhe.zhu	f9d107af66	[optim] add fused_adagrad support for CPU device (#124905 ) Support fused_sgd_kernel support for CPU. ## Bench result: 32 core/sockets ICX Test Scripts: https://gist.github.com/zhuhaozhe/79e842e0a6e25d6d7fa1e4598807272c https://gist.github.com/zhuhaozhe/b4c6998a509dcea1796dd05b3005c969 ``` Tensor Size: 262144, Num Tensor 4, Num Threads: 1 _single_tensor_adagrad time: 0.2500 seconds _fused_adagrad time: 0.0933 seconds Tensor Size: 4194304, Num Tensor 32, Num Threads: 32 _single_tensor_adagrad time: 2.8819 seconds _fused_adagrad time: 1.7591 seconds ``` ## Test Plan: ``` python test_optim.py -k test_fused_matches_forloop python test_optim.py -k test_fused_large_tensor python test_optim.py -k test_can_load_older_state_dict python test_optim.py -k test_grad_scaling_autocast_fused_optimizers python test_torch.py -k test_grad_scaling_autocast_fused python test_torch.py -k test_params_invalidated_with_grads_invalidated_between_unscale_and_step ``` Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/124905 Approved by: https://github.com/jgong5, https://github.com/janeyx99	2024-05-16 01:11:51 +00:00
Ze Sheng	51e9bb8783	[Export] Allow ExportedProgram to take empty decomp table (#126142 ) As title. Still, `ep.run_decompositions()` will use `core_aten_decompositions()` by default. Cases like `ep.run_decompositions(get_decompositions([]))` will use empty table, and go with [`aot_autograd_decompositions`](`04877dc430/torch/_functorch/aot_autograd.py (L456-459)`) only. Motivation We didn't have a clean way to pass in an empty decomp table. Since we've made `pre_dispatch` export as default and `ep.run_decompositions` remains with `aot_export_module(..., pre_dispatch=False)`, allowing empty table would help make blank control easier. Testing CI Also looked through all the references in fbcode. The only concern I have is whether we should update [this example](`04877dc430/torch/onnx/_internal/exporter.py (L817)`) or not. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126142 Approved by: https://github.com/angelayi	2024-05-16 00:31:23 +00:00
Animesh Jain	b3f1882d17	[easy][dynamo][inline-inbuilt-nn-modules] Change test to check for params (#126316 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126316 Approved by: https://github.com/williamwen42 ghstack dependencies: #126303	2024-05-16 00:20:58 +00:00
Tobias Ringwald	06d6bb4eba	Switched from parameter in can_cast to from_. (#126030 ) Fixes #126012. `from` is a reserved keyword in Python, thus we can't make the C++ impl available with `from` as function parameter. This PR changes the name to `from_` and also adjusts the docs. If we want to preserve backwards compatibility, we can leave the C++ name as-is and only fix the docs. However, `torch.can_cast(from_=torch.int, to=torch.int)` won't work then. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126030 Approved by: https://github.com/albanD	2024-05-16 00:09:54 +00:00
Edward Z. Yang	3ae118204e	Make propagate_real_tensor more safe (#126281 ) Internal xref: https://fb.workplace.com/groups/6829516587176185/posts/7228787720582401/ There a few improvements here, which luckily fix some xfails: * In generally, it can be unsafe to call operations on Tensors under a `no_dispatch()` mode that is purely trying to disable ambient modes, because this ALSO disables tensor subclass handling. So we test to see if there is a tensor subclass and don't propagate real tensors if that's the case. Another acceptable outcome might be to try to only disable the ambient fake tensor mode, this would help us propagate real tensors through more exotic tensor types, but I'm not going to do it until someone asks for it. * We're graph breaking for wrapped tensors too late. Pull it up earlier so we do it before we try to muck around with the real tensor. * I noticed that occasionally when I do `storage.copy_(real_storage)`, the sizes mismatch. Careful code reading suggests that I should just copy in the real data when the tensor was initially allocated, so that's what I do now, eliminating the need for a storage copy. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/126281 Approved by: https://github.com/Skylion007	2024-05-15 23:57:02 +00:00
Edward Z. Yang	b2d9b80fba	Also remove compile_time_strobelight_meta frame when generating stack (#126289 ) I think I also need to fix this in fbcode, leaving that for future work. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/126289 Approved by: https://github.com/yanboliang	2024-05-15 23:55:37 +00:00
Edward Z. Yang	9c9d0c2fab	Add VariableTracker.debug_repr (#126299 ) Now you can print arbitrary values at compile time with comptime.print() Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/126299 Approved by: https://github.com/jansel ghstack dependencies: #126292	2024-05-15 23:55:29 +00:00
willfengg	a7af53cec1	[FSDP2] support fully_shard(model_on_meta, cpu_offload) (#126305 ) support fully_shard(model_on_meta, cpu_offload) when fully_shard is placed outside of `torch.device("meta")` Pull Request resolved: https://github.com/pytorch/pytorch/pull/126305 Approved by: https://github.com/awgu ghstack dependencies: #126267	2024-05-15 23:29:23 +00:00
Animesh Jain	bcdd0b11ca	[dynamo][inline-inbuilt-nn-modules] Bug fix - Only unspecialized nn modules (#126303 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126303 Approved by: https://github.com/mlazos, https://github.com/laithsakka	2024-05-15 23:23:12 +00:00
William Wen	5cab7a7662	[dynamo] fix https://github.com/pytorch/pytorch/issues/93624 (#125945 ) Fixes https://github.com/pytorch/pytorch/issues/93624 but also requires https://github.com/jcmgray/autoray/issues/20 to be fixed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125945 Approved by: https://github.com/jansel ghstack dependencies: #125882, #125943	2024-05-15 23:22:06 +00:00
William Wen	56a89fcc08	[dynamo] graph break on issubclass call with non-const args (#125943 ) Fixes https://github.com/pytorch/pytorch/issues/125942 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125943 Approved by: https://github.com/jansel ghstack dependencies: #125882	2024-05-15 23:22:06 +00:00
William Wen	100e3c1205	[dynamo] graph break on const dict KeyError (#125882 ) Fixes https://github.com/pytorch/pytorch/issues/125866 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125882 Approved by: https://github.com/jansel	2024-05-15 23:22:06 +00:00
Adele Sun	b5432ad5ab	Fix triton codegen main do_bench_gpu import error (#126213 ) Summary: Encountered module import error when running triton kernel file. The cause seems to be D57215950 which changed "do_bench" to "do_bench_gpu" for torch._inductor.runtime.runtime_utils However, in the codegen, instead we have "from triton.testing import do_bench", so the line below should be reverted back to "do_bench". Test Plan: LOGLEVEL=DEBUG TORCH_COMPILE_DEBUG=1 TORCHINDUCTOR_MAX_AUTOTUNE=0 CUDA_VISIBLE_DEVICES=5 TORCHINDUCTOR_PROFILE=1 TORCHINDUCTOR_PROFILE_OUTPUT='/home/adelesun/mts_profiling/outputs/profile_output.txt' TORCH_LOGS='+inductor,+schedule,output_code' TORCHINDUCTOR_UNIQUE_KERNEL_NAMES=1 TORCHINDUCTOR_BENCHMARK_KERNEL=1 TORCHINDUCTOR_CACHE_DIR='/home/adelesun/mts_profiling/code' TORCHINDUCTOR_ENABLED_METRIC_TABLES=kernel_metadata buck2 run mode/opt -c=python.package_style=inplace -c fbcode.enable_gpu_sections=true -c fbcode.platform=platform010 -c fbcode.nvcc_arch=v100,a100,h100 -c fbcode.split-dwarf=true caffe2/torch/fb/model_transform/experimental/benchmark:mts_gpu_benchmark -- --local-model /home/adelesun/mts_profiling/inputs/offsite_cvr_model_526372970_793.input.predictor.disagg.gpu.merge --lower-backend AOT_INDUCTOR 2>&1 \| tee /home/adelesun/mts_profiling/outputs/benchmark_output.txt bento console --kernel=aetk --file=/home/adelesun/mts_profiling/code/op/copmbxfunzmywemwmg66lnlcx4apvn2f2vsi3glgisausgfvit4g.py file ran successfully Differential Revision: D57345619 Pull Request resolved: https://github.com/pytorch/pytorch/pull/126213 Approved by: https://github.com/shunting314	2024-05-15 22:56:15 +00:00
David Chiu	2c5ad9a3d7	[optim] Fix: wrong ASGD implementation (#125440 ) > previous: Originally, the variables `new_eta` and `new_mu` would be constructed `len(grouped_mus)` times, but each of their values is the same and won't be changed. Therefore, it can be simplified using Python list multiplication, which only constructs one tensor. - [X] Ill assumption that every param will have the same step. - [x] DIfferent implementation between `foreach=Ture` and `foreach=False`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125440 Approved by: https://github.com/janeyx99	2024-05-15 22:52:15 +00:00
eqy	5af4b49285	Remove expected failure in `test_eager_transforms.py` (#125883 ) Seems to be supported now CC @tinglvv @nWEIdia @Aidyn-A Pull Request resolved: https://github.com/pytorch/pytorch/pull/125883 Approved by: https://github.com/Chillee, https://github.com/Aidyn-A	2024-05-15 22:12:07 +00:00
Yuanhao Ji	0ca8bf4b41	Enable UFMT on `test/test_datapipe.py` (#124994 ) Part of: #123062 Ran lintrunner on: - `test/test_datapipe.py` Detail: ```bash $ lintrunner -a --take UFMT --all-files ok No lint issues. Successfully applied all patches. ``` Co-authored-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/124994 Approved by: https://github.com/mikaylagawarecki	2024-05-15 21:58:35 +00:00
cyy	18cbaf6dbf	Remove Caffe2 python code (#126035 ) Follows the recent changes of Caffe2. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126035 Approved by: https://github.com/r-barnes, https://github.com/Skylion007	2024-05-15 21:51:11 +00:00
zengxian	ad7316b4c2	[CI] Add AMP models in inductor cpu smoketest for performance (#125830 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/125830 Approved by: https://github.com/chuanqi129, https://github.com/jgong5, https://github.com/huydhn, https://github.com/desertfire, https://github.com/atalman	2024-05-15 21:46:58 +00:00
Edward Z. Yang	f0d34941dd	Improve Storage copy_ size mismatch error message (#126280 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/126280 Approved by: https://github.com/mikaylagawarecki	2024-05-15 21:14:59 +00:00
Joel Schlosser	d15920a7d0	Warn SDPA users about dropout behavior (#126294 ) Fixes #124464 Pull Request resolved: https://github.com/pytorch/pytorch/pull/126294 Approved by: https://github.com/mikaylagawarecki, https://github.com/drisspg	2024-05-15 20:58:23 +00:00
Gustav Larsson	31d22858e9	[onnx.export] Avoid unnecessary copy of debug_names (#123026 ) This PR is part of an effort to speed up torch.onnx.export (#121422). - The `auto debug_names = ` infers a copy, where as `const auto& debug_names` does not. - However, this ones requires us to be careful, since calls to `setDebugName` changes `debug_names` and invalidates the `exist_name` iterator. So if we simply change `auto` to `const auto&`, then between that line and `find` we have corrupted the iterator by calling `output[i]->setDebugName`. This change aims to be functionally equivalent to the original, which is why we first get the Value pointer, then call `output[i]->setDebugName`, and finally call `setDebugName` on the found value. It is possible functionally it is OK to simply call `output[i]->setDebugName` first and then find and the second `setDebugName`, but this would not be identical to current behavior. - Resolves (2) in #121422. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123026 Approved by: https://github.com/justinchuby	2024-05-15 20:58:18 +00:00
Animesh Jain	90461d4986	[dynamo] Detect monkeypatching on nn module forward method (#126203 ) An alternative was https://github.com/pytorch/pytorch/pull/124975. Though it was safer because it was adding guards for every inlined function, it was causing guard overhead for a few models of > 20%. The overhead of this PR is minimal for the common unpatched case. Fixes an internal issue - [fb.workplace.com/groups/1075192433118967/permalink/1411067766198097](https://fb.workplace.com/groups/1075192433118967/permalink/1411067766198097/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126203 Approved by: https://github.com/ezyang	2024-05-15 20:41:13 +00:00
willfengg	c8130dfe84	[FSDP2] allow meta tensors during loading state dict and cpu offloading (#126267 ) unit test: ``pytest test/distributed/_composable/fsdp/test_fully_shard_state_dict.py`` with meta init and cpu offloading, we have meta tensors after`model.load_state_dict(assign=True, strict=False)`. This PR avoided calling `.cpu` on meta tensors otherwise it's a runtime error Pull Request resolved: https://github.com/pytorch/pytorch/pull/126267 Approved by: https://github.com/awgu	2024-05-15 20:35:36 +00:00
Catherine Lee	d74c89fb10	2 rocm shards on trunk.yml (#125933 ) after test removal for windows cpu + avx related configs, it's going to be the long pole for trunk Just checked: without rocm, avg tts for trunk is 2.5 hrs last week, with rocm its about 3 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125933 Approved by: https://github.com/ZainRizvi	2024-05-15 20:22:14 +00:00
albanD	d2b2727d66	Fix public api allowlist logical merge conflict (#126321 ) Skip the newly added bad API from https://github.com/pytorch/pytorch/pull/126212 to keep CI green. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126321 Approved by: https://github.com/ezyang	2024-05-15 20:21:39 +00:00
Lucas Pasqualin	e2d18228fe	[DCP] overwrites existing checkpoint by default (#125877 ) Checks for existing checkpoints and overwrites, based on an `overwrite` flag Differential Revision: [D57186174](https://our.internmc.facebook.com/intern/diff/D57186174/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/125877 Approved by: https://github.com/fegin	2024-05-15 20:12:52 +00:00
Edward Z. Yang	b659506d82	Parametrize test_dim_reduction (#126292 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/126292 Approved by: https://github.com/Skylion007	2024-05-15 19:55:37 +00:00
PyTorch MergeBot	2086f91c4c	Revert "Fix aarch64 debug build with GCC (#126290 )" This reverts commit a961e1ac83bf8831768c5a04eb7c4c18df8988d5. Reverted https://github.com/pytorch/pytorch/pull/126290 on behalf of https://github.com/malfet due to Indeed lint is broken :/ ([comment](https://github.com/pytorch/pytorch/pull/126290#issuecomment-2113332757))	2024-05-15 19:45:57 +00:00
Andrew Gu	2978f07d0e	[FSDP] Fixed docs for inter/intra node PG helpers (#126288 ) 1. This fixes an issue where we had 9 ranks in one node and 7 in the other. 2. This makes the notation more explicit that `[0, 7]` is `[0, 1, ..., 7]`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126288 Approved by: https://github.com/weifengpy	2024-05-15 19:45:10 +00:00
albanD	af9acc4168	Fix public binding to actually traverse modules (#126103 ) The current call passes in `['/actual/path']` to os.walk which is a string pointing to no path and thus silently leads to and empty traversal. There is an unused function just above that handles that, so I guess this is what was supposed to be called. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126103 Approved by: https://github.com/suo	2024-05-15 19:36:03 +00:00
Nikita Shulga	a961e1ac83	Fix aarch64 debug build with GCC (#126290 ) By working around GCCs quirks in instantiating templates that require immediate values. Provide alternative implementation for scaling the output if compiled without any optimizations (both GCC and clang define __OPTIMIZE__ if invoked with anything but -O0) Fixes https://github.com/pytorch/pytorch/issues/126283 Pull Request resolved: https://github.com/pytorch/pytorch/pull/126290 Approved by: https://github.com/atalman, https://github.com/seemethere	2024-05-15 19:02:21 +00:00
zong	196661255f	Enable UFMT format on test/test_utils.py (#125996 ) Fixes some files in #123062 Run lintrunner on files: test/test_utils.py ```bash $ lintrunner -a --take UFMT --all-files ok No lint issues. Successfully applied all patches. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125996 Approved by: https://github.com/ezyang	2024-05-15 18:22:57 +00:00
Edward Z. Yang	44efeac24e	Beef up error message for pending assert failure (#126212 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/126212 Approved by: https://github.com/Skylion007	2024-05-15 18:22:53 +00:00
Tugsbayasgalan (Tugsuu) Manlaibaatar	26f6f98364	Forward fix failures for torch.export switch to predispatch (#126081 ) Summary: Fixes: - executorch test - torchrec test Test Plan: CI Differential Revision: D57282304 Pull Request resolved: https://github.com/pytorch/pytorch/pull/126081 Approved by: https://github.com/angelayi	2024-05-15 18:13:06 +00:00
eellison	0d49c5cb06	Skip padding cost of fusible/planable inputs (#125780 ) For mm inputs which are not inputs of the graph, assume that we can memory plan them in the aten.cat and exclude the padding cost in the benchmarking comparison. Technically we also have to do a small amount of 0s writing, but that should be relatively small and encompassed in the weighting of the padding time by `1.1` Pull Request resolved: https://github.com/pytorch/pytorch/pull/125780 Approved by: https://github.com/shunting314 ghstack dependencies: #125772, #125773	2024-05-15 18:05:53 +00:00
eellison	4fb5d69b3b	Reland '[Inductor] GEMM shape padding improvements (#118522 )' (#125773 ) Relanding just the pad in a single pass portion of [the pr](https://github.com/pytorch/pytorch/pull/118522). Not including the transpose logic: This was previously accepted and reviewed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125773 Approved by: https://github.com/shunting314 ghstack dependencies: #125772	2024-05-15 17:34:41 +00:00
James Wu	a91311e7c2	[easy] Remove aot_config from pre_compile returns, rename fw_metadata in post_compile (#125854 ) This field never changes so pre_compile doesn't need to return it again: remove it just for a cleaner refactor. As @aorenste points out, the fw_metadata passed to post_compile is actually the fw_metadata after all wrapper's pre_compile's have run. I want to make this clear in the code, so I renamed the arg in post_compile. Wrappers that need the exact metadata that they were passed in pre_compile need to save that fw_metadata properly themselves. Currently, wrappers come in two categories: 1. Wrappers that modify fw_metadata, but then never use fw_metadata in post compile 2. Wrappers that never modify fw_metadata, and only consume the "final" fw_metadata. So none of the behaviors will change for the existing wrappers. That said, it might be useful to define a "SimpleCompilerWrapper" subclass which guarantees it does not modify fw_metadata. I'll do that in a separate PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125854 Approved by: https://github.com/aorenste, https://github.com/bdhirsh	2024-05-15 17:23:47 +00:00
Gustav Larsson	44e47d5bd0	[onnx.export] Avoid linear loop over symbol_dim_map (#123029 ) This PR is part of an effort to speed up torch.onnx.export (#121422). - Doing a reverse look-up in `symbol_dim_map` incurs a linear cost in number of symbols. This happens for each node, so incurs a quadratic cost to the whole export. - Add a reverse look-up `dim_symbol_map` that is kept in parallel of `symbol_dim_map`. This avoids a linear time look-up, which creates a quadratic export time complexity. - This is a highly pragmatic solution. If someone more familiar with the code base has a better solution, I'm interested to hear about it. - Resolves (9) in #121422. (partial fix of #121422) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123029 Approved by: https://github.com/justinchuby	2024-05-15 17:22:30 +00:00
Alexander Grund	490d72e4e6	CMake: Improve check and report of Magma (#117858 ) - Only search for magma if it is used (GPU builds) - Don't report it was not found when it isn't searched for - Don't report if magma is disabled (currently: "MAGMA not found. Compiling without MAGMA support" is reported) Pull Request resolved: https://github.com/pytorch/pytorch/pull/117858 Approved by: https://github.com/malfet	2024-05-15 17:18:22 +00:00
Yanbo Liang	f91cae461d	[Dynamo] SizeVariable supports hasattr (#126222 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/126222 Approved by: https://github.com/williamwen42, https://github.com/anijain2305	2024-05-15 17:16:36 +00:00
wz337	c1dc8bb858	[DTensor] Turn on foreach implementation of optimizer for DTensor by default (#123394 ) Append DTensor to the optimizer `_foreach_supported_types` and turn on foreach implementation of optimizer for DTensor if not specified by the users. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123394 Approved by: https://github.com/wanchaol	2024-05-15 16:45:42 +00:00
Nikita Shulga	4ab2c399be	Faster int8 quantized (#125704 ) Or my journey to learn how to write fast Metal kernels (more details would be posted [here](https://github.com/malfet/llm_experiments/tree/main/metal-perf) ) Using gpt-fast as a benchmark (by running `python generate.py --checkpoint_path checkpoints/stories110M/model_int8.pth --device mps`) Before the change, on M2 Pro I get 50 tokens per sec After adding a very naive ```metal template<typename T> kernel void int8pack_mm( constant T * A [[buffer(0)]], constant char * B [[buffer(1)]], constant T * scales [[buffer(2)]], device T * outputData [[buffer(3)]], constant uint3 & sizes [[buffer(4)]], uint thread_index [[thread_position_in_grid]]) { const uint lda = sizes.y; const uint ldc = sizes.z; const uint m = thread_index / sizes.z; // 0..sizes.x-1 const uint n = thread_index % sizes.z; // 0..sizes.z-1 constant T A_ptr = A + m lda; constant char B_ptr = B + n lda; float rc = 0.0; for(uint k = 0; k < sizes.y; k++) { const auto a_val = float(A_ptr[k]); const auto b_val = float(B_ptr[k]); rc += a_val * b_val; } outputData[thread_index] = T(rc * float(scales[n])); } ``` Perf dropped down to sad 15 tokens per seconds. Replacing inner loop with vectorized operations ```metal float rc = 0.0; for(uint k = 0; k < sizes.y/4; k++) { const auto a_val = float4(A_ptr[k]); const auto b_val = float4(B_ptr[k]); rc += dot(a_val, b_val); } ``` Perf jumps back up to 53 tokens per second, but it's a bit of a lie when it comes to llama2-7B perf. Next step in unlocking the performance were to replace a 1D grid with a 2D one, but limit the thread group size to a single row, which results in a much better data locality which unfortunately is not observable with `stories110M` anymore as it small model size and Python runtime overhead hide the perf gain) There were several unsuccessful attempts at caching inputs in thread local memory or using `float4x4` to speed up computation. But the key to unlocking the perf were a comment in `631dfbe673/mlx/backend/metal/kernels/gemv.metal (L184)` which hinted at exploiting both SIMD groups and thread local caches, which resulted in 5x jump in performance compared to initial vectorization approach and 3x perf jump in end-to-end llama7b test Pull Request resolved: https://github.com/pytorch/pytorch/pull/125704 Approved by: https://github.com/mikekgfb	2024-05-15 16:39:24 +00:00
Catherine Lee	719a8f42bf	Foward fix lint after #125747 (#126295 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/126295 Approved by: https://github.com/atalman	2024-05-15 16:37:48 +00:00
Catherine Lee	9689532106	[CI] 3 procs non cuda (#125932 ) Too lazy to figure out actual time reduction here, I'll figure it out later. Also I'd rather get an average of a couple of runs on trunk rather than just this one PR Things got faster. Source? Trust me bro * rel to https://github.com/pytorch/pytorch/pull/125598 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125932 Approved by: https://github.com/ZainRizvi	2024-05-15 16:18:36 +00:00
PyTorch MergeBot	718bb9016f	Revert "[Memory Snapshot] Add recordAnnotations to capture record_function annotations (#124179 )" This reverts commit 187aeaeabf612824c2d0e9be72f80ce6612760d4. Reverted https://github.com/pytorch/pytorch/pull/124179 on behalf of https://github.com/clee2000 due to test_tensorexpr.py::TestTensorExprFuser::test_simple_add is causing a segfault https://github.com/pytorch/pytorch/actions/runs/9097383783/job/25007155440 `187aeaeabf`, test was skipped due to bad TD ([comment](https://github.com/pytorch/pytorch/pull/124179#issuecomment-2112948246))	2024-05-15 16:11:47 +00:00
Zhengxu Chen	f9dda37a74	[export] Cover more cases to copy tensor conversions. (#125628 ) Summary: Previously we tried to convert all .to() calls to to_copy in the graph, now some user reports that other methods like .float() is not covered: https://github.com/pytorch/PiPPy/issues/1104#issuecomment-2093352734 I think fundemantally .float() should look similar to .to() in export and this diff tries to expand the coverage of the tensor conversion methods here. Test Plan: buck run mode/opt caffe2/test:test_export -- -r float_conversion Differential Revision: D56951634 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125628 Approved by: https://github.com/tugsbayasgalan	2024-05-15 15:50:21 +00:00
xinan.lin	c53e0ac7ba	[Inductor] Generalize new introduced device-bias code. (#126261 ) We find some Inductor test case failues when enabling Inductor UT for Intel GPU, the root cause is new introduced Inductor device-bias code from recent community PRs, which cause differnet beheaviors between Intel GPU and CUDA. This PR generalize these codes to align their beheaviors. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126261 Approved by: https://github.com/EikanWang, https://github.com/peterbell10	2024-05-15 15:05:07 +00:00
Yuanhao Ji	ba3cd6e463	Enable UFMT on `test/test_fake_tensor.py`, `test/test_flop_counter.py` and some files (#125747 ) Part of: #123062 Ran lintrunner on: - test/test_fake_tensor.py - test/test_flop_counter.py - test/test_function_schema.py - test/test_functional_autograd_benchmark.py - test/test_functional_optim.py - test/test_functionalization_of_rng_ops.py Detail: ```bash $ lintrunner -a --take UFMT --all-files ok No lint issues. Successfully applied all patches. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/125747 Approved by: https://github.com/malfet	2024-05-15 14:50:14 +00:00
Aaron Enye Shi	187aeaeabf	[Memory Snapshot] Add recordAnnotations to capture record_function annotations (#124179 ) Summary: Add new traceEvents into Memory Snapshot for record_function annotations. These will capture both the profiler's step annotation as well as user annotations. Test Plan: CI New Snapshot Generated: devvm2184.cco0.facebook.com.Apr_19_13_27_14.3072800.snapshot.pickle Snippet of Snapshot device_traces show `ProfilerStep#0`, and `## forward ##` annotations: ``` [[{'action': 'user_defined', 'addr': 0, 'size': 0, 'stream': 0, 'time_us': 1713558427168556, 'frames': [{'name': 'START', 'filename': 'ProfilerStep#0', 'line': 0}]}, {'action': 'user_defined', 'addr': 0, 'size': 0, 'stream': 0, 'time_us': 1713558427168738, 'frames': [{'name': 'END', 'filename': 'ProfilerStep#0', 'line': 0}]}, {'action': 'user_defined', 'addr': 0, 'size': 0, 'stream': 0, 'time_us': 1713558427168865, 'frames': [{'name': 'START', 'filename': 'ProfilerStep#1', 'line': 0}]}, {'action': 'user_defined', 'addr': 0, 'size': 0, 'stream': 0, 'time_us': 1713558427168920, 'frames': [{'name': 'START', 'filename': '## forward ##', 'line': 0}]}, {'action': 'alloc', 'addr': 140166073581568, 'size': 3211264, 'stream': 0, 'time_us': 1713558427172978, 'frames': [{'name': '_conv_forward', 'filename': '/mnt/xarfuse/uid-416185/235d4caf-seed-nspid4026531836_cgpid32884718-ns-4026531840/torch/nn/modules/conv ``` Differential Revision: D55941362 Pulled By: aaronenyeshi Pull Request resolved: https://github.com/pytorch/pytorch/pull/124179 Approved by: https://github.com/zdevito	2024-05-15 14:19:40 +00:00
Bin Bao	ee8c1550d6	[AOTI][torchgen] Add a few more fallback ops (#126013 ) Summary: They appear in some unit tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126013 Approved by: https://github.com/chenyang78 ghstack dependencies: #125962	2024-05-15 12:56:07 +00:00
Bin Bao	563aa3e035	[AOTI][torchgen] Update NativeFunctionsGroup mapping (#125962 ) Summary: When looking up for what backend call to use for a fallback op (see get_backend_index_for_aoti), sometimes we need to search for a NativeFunction's structured delegate. Previous str:NativeFunctionsGroup dict missed some cases, such as aten.index.Tensor, and that's why aten.index.Tensor was specified in the fallback_ops list but no C shim entry was generated for it. This PR uses a more robust OperatorName:NativeFunctionsGroup mapping. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125962 Approved by: https://github.com/chenyang78	2024-05-15 12:56:07 +00:00
Edward Z. Yang	a0aaf56114	Don't assert about pending when we are peeking (#126239 ) Internal xref https://fb.workplace.com/groups/6829516587176185/posts/7211398545654652/ In particular, when we're collecting forward metadata, we aren't going to discharge any of the pending, so we'll be continuously collecting more and more pending symbols that we may not be able to resolve. This is fine. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/126239 Approved by: https://github.com/lezcano	2024-05-15 12:18:34 +00:00
Wei Wang	8f30f367d0	[CUDA] [CI] Add cu124 docker images (#125944 ) Fixes issues encountered in https://github.com/pytorch/pytorch/pull/121956 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125944 Approved by: https://github.com/atalman	2024-05-15 09:52:38 +00:00
Jiong Gong	f060b0c6e6	[inductor][cpp] GEMM template (infra and fp32) (#124021 ) This PR adds the Cpp template infrastructure and the initial FP32 gemm template. See RFC https://github.com/pytorch/pytorch/issues/125683 for more background info. 1. Cpp template infrastructure Similar template abstractions as the CUTLASS template, i.e., `CppTemplate`, `CppTemplateKernel`, `CppTemplateBuffer`. The MicroGemm micro-kernel abstraction that can be used by Cpp GEMM templates. 2. Initial FP32 gemm template This involves a GEMM template implementation `CppPackedGemmTemplate` that supports GEMM with constant weight (`B`) requiring `N` to be a multiple of register blocking while allows the static or dynamic sizes for the `M` (batch dim) of `A`. The `B` matrix would be prepacked. This is a typical setting for inference workloads. The template handles the thread decomposition (via `thread_blocking`) and cache blocking (via `cache_blocking`). Then it invokes `CppMicroGemm` which handles register blocking, instruction selection, and other CPU architecture-specific optimizations. A `CppMicroGemmFP32Vec` micro-kernel implementation is provided for fp32 matmuls implemented with ATen vec abstraction. 3. Correctness and performance The changes have been validated with fp32 inference on the three benchmark suites (torchbench, huggingface and timm_models) with both static shape and dynamic shapes. Since it is an initial implementation, we are still working on further performance improves with follow-up PRs including the optimizations in kernels as well as fusions. The perf gains are only observed from a selective number of models compared to the ATen kernels which are implemented with MKL. The perf gains are more obvious with dynamic shapes since MKL only supports packed gemm for static shapes. Below are details. Static shapes \| Benchmark \| torchbench \| huggingface \| timm_models \| \|------------\|-------------\|--------------\|--------------\| \| Multi-threaded (baseline) \| 1.47x \| 1.36x \| 1.91x \| \| Multi-threaded (max-autotune) \| 1.47x \| 1.36x \| 1.92x \| \| Single-threaded (baseline) \| 1.56x \| 1.19x \| 1.51x \| \| Single-threaded (max-autotune) \| 1.56x \| 1.19x \| 1.52x \| Key models being sped up: drq: 1.14x soft_act: 1.12 cait_m36_384: 1.18x Dynamic shapes \| Benchmark \| torchbench \| huggingface \| timm_models \| \| --- \| --- \| --- \| --- \| \| Multi-threaded (baseline) \| 1.43x \| 1.28x \| 1.85x \| \| Multi-threaded (max-autotune) \| 1.47x \| 1.28x \| 1.85x \| \| Single-threaded (baseline) \| 1.55x \| 1.20x \| 1.51x \| \| Single-threaded (max-autotune) \| 1.56x \| 1.19x \| 1.53x \| Key models being sped up: BERT_pytorch: 1.22x pyhpc_turbulent: 1.13x soft_actor_critic: 1.77x BlenderbotForCausalLM: 1.09x cait_m36_384: 1.17x Pull Request resolved: https://github.com/pytorch/pytorch/pull/124021 Approved by: https://github.com/jansel	2024-05-15 08:14:51 +00:00
Oguz Ulgen	79655a1321	Add force_disable_caches to the docs (#126184 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126184 Approved by: https://github.com/msaroufim	2024-05-15 07:16:08 +00:00
PyTorch UpdateBot	2d35b4564a	[audio hash update] update the pinned audio hash (#126248 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned audio hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126248 Approved by: https://github.com/pytorchbot	2024-05-15 05:45:16 +00:00
Sam Larsen	03467b3fed	Add a few "warm start" smoketest runs to CI (#125955 ) Summary: Not sure which to choose, so my criteria was: 1) We care about huggingface as part of internal milestones 2) This handful of models seems to particularly benefite from caching Pull Request resolved: https://github.com/pytorch/pytorch/pull/125955 Approved by: https://github.com/desertfire ghstack dependencies: #125917, #125953	2024-05-15 05:32:06 +00:00
Sam Larsen	c87c39d935	[benchmarking] Suppress csv creation on cold-start phase of --warm-start-latency (#125953 ) Summary: It seems that most (all?) of our utilities for examining benchmark output expect single-line entries per benchmark. The way the --warm-start-latency flag is currently implemented, it means that we'll see two entries for every benchmark run (one for the warm-up run and one for the actual run). This PR adds a --disable-output flag that we can use for the first run to suppress populating the csv. This way, the existing utilities like `benchmarks/dynamo/check_accuracy.py` will function without any changes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125953 Approved by: https://github.com/desertfire ghstack dependencies: #125917	2024-05-15 05:32:06 +00:00
Sam Larsen	9f0d3f71c9	Adjust number of repeats when using --warm-start-latency benchmark flag (#125917 ) Summary: In --warm-start-latency mode, we can just perform the cache-warmup run once instead of whatever was provided with --repeat Pull Request resolved: https://github.com/pytorch/pytorch/pull/125917 Approved by: https://github.com/desertfire	2024-05-15 05:32:06 +00:00
Isuru Fernando	0dedc1aff2	Update CUDA out of memory mesage with private pool info (#124673 ) Fixes https://github.com/pytorch/pytorch/issues/121932 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124673 Approved by: https://github.com/eellison, https://github.com/eqy	2024-05-15 05:30:47 +00:00
eellison	5178baefa9	use statically known instead of suppress guard for ddp stride propagation (#126234 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126234 Approved by: https://github.com/ezyang	2024-05-15 05:21:55 +00:00
xinan.lin	e74a6f487a	[Inductor] Skip test_nll_loss_backward for intel GPU. (#126157 ) Skip this test case due to unaligned behavior to CUDA for Triton `mask_load`. We submitted issue #126173 to elaborate on the root cause. We intend to skip this case for XPU first as we need to take some time to fix the issue and have full validation to update the Triton commit pin for Intel GPU. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126157 Approved by: https://github.com/EikanWang, https://github.com/peterbell10, https://github.com/desertfire	2024-05-15 05:16:07 +00:00
FEI	b950217f19	Support third-party devices emit a range for each autograd operator (#125822 ) Fixes #125752 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125822 Approved by: https://github.com/aaronenyeshi	2024-05-15 05:06:24 +00:00
cyy	bdea4904c1	Add some type annotations to python stream and event classes (#126171 ) For recent device agnostic code changes, we need type hinting on the parent classes for better tooling support. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126171 Approved by: https://github.com/ezyang	2024-05-15 04:58:07 +00:00
iefgnoix	7dfd2949d7	Add missing type uint16, uint32, and uint64 to TensorHash in LTC. (#125972 ) If I do: ``` xla_device = xm.xla_device() xla_tensor_0 = torch.tensor(42, dtype=torch.uint32).to(xla_device) ``` I got the error: ``` RuntimeError: false INTERNAL ASSERT FAILED at "/ansible/pytorch/torch/csrc/lazy/core/hash.h":139, please report a bug to PyTorch. Unsupported scalar type:UInt16 ``` This PR intends to fix this issue. The data type can be found in pytorch/c10/core/ScalarType.h. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125972 Approved by: https://github.com/JackCaoG	2024-05-15 04:57:08 +00:00
Yanbo Liang	dfab69fdf1	[Inductor] Flex attention supports dynamic shape (#125994 ) ## static shapes perf ``` \| Type \| Speedup \| batch_size \| num_heads \| q_seq_len \| k_seq_len \| head_dim \| score_mod \| dtype \| \|---------\|-----------\|--------------\|-------------\|-------------\|-------------\|------------\|-------------\|----------------\| \| Average \| 0.692 \| \| \| \| \| \| \| \| \| Max \| 0.855 \| 16 \| 16 \| 4096 \| 4096 \| 64 \| head_bias \| torch.bfloat16 \| \| Min \| 0.419 \| 8 \| 16 \| 512 \| 512 \| 256 \| noop \| torch.bfloat16 \| ``` ## dynamic shapes perf ``` \| Type \| Speedup \| batch_size \| num_heads \| q_seq_len \| k_seq_len \| head_dim \| score_mod \| dtype \| \|---------\|-----------\|--------------\|-------------\|-------------\|-------------\|------------\|---------------\|----------------\| \| Average \| 0.670 \| \| \| \| \| \| \| \| \| Max \| 0.864 \| 16 \| 16 \| 4096 \| 4096 \| 64 \| relative_bias \| torch.bfloat16 \| \| Min \| 0.376 \| 8 \| 16 \| 512 \| 512 \| 256 \| relative_bias \| torch.bfloat16 \| ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/125994 Approved by: https://github.com/Chillee	2024-05-15 04:43:24 +00:00
Chirag Pandya	1485621ccb	[BE] Abstract out strings to top of file (#125640 ) Summary: Move const strings to top of file. This is in preparation of tooling to make use of shared constants (e.g. version string). A non-functional change. Ideally we want these const strings to be available from both C++ and Python - but I haven't figured out how to correctly share things in PyTorch. I'll do this in a subsequent change. Test Plan: python test/distributed/test_c10d_nccl.py NCCLTraceTest Pull Request resolved: https://github.com/pytorch/pytorch/pull/125640 Approved by: https://github.com/wconstab	2024-05-15 03:38:30 +00:00
Huy Do	24c30096e3	Set dtype when copying empty tensor (#126124 ) Summary: Forward fix D57251348 Test Plan: `buck2 test 'fbcode//mode/dev' fbcode//executorch/kernels/test:aten_op_copy_test` Differential Revision: D57304360 Pull Request resolved: https://github.com/pytorch/pytorch/pull/126124 Approved by: https://github.com/bdhirsh	2024-05-15 03:25:07 +00:00
Yanbo Liang	51ed4c46cf	[Dynamo] Supports torch._C._is_any_autocast_enabled (#126196 ) Fixes #126026 Pull Request resolved: https://github.com/pytorch/pytorch/pull/126196 Approved by: https://github.com/anijain2305	2024-05-15 03:16:13 +00:00
Yidi Wu	314ba13f01	Support trace_subgraph in _MakefxTracer (#125363 ) Adds trace_subgraph to _MakefxTracer, the motivation is in https://github.com/pytorch/pytorch/pull/122972. Also migrate all existing usage of reenter_make_fx to the new sub-tracer. Previously, the torch function mode for creating torch_fn metadata won't be re-enetered when we're in ProxyTensorMode (since it's inside of __torch_function__). This PR reconstruct the torch function mode based on parent tracer's config and reentered the torch function mode so the metadata is shown in the graph. Test Plan: Existing tests. We have a bunch of make_fx tests for cond, map and while_loop. Also remove expected failure for torch_fn since reenter_make_fx is able to re-construct torch function modes. Also fixes https://github.com/pytorch/pytorch/issues/124643 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125363 Approved by: https://github.com/Chillee ghstack dependencies: #125267	2024-05-15 03:12:24 +00:00
Yidi Wu	73d8c10f13	Refactor make_fx to better support hop subgraph tracing (#125267 ) Code movement + minor rewrites. We extract the states of make_fx out and encapsulate them into a _MakefxTracer class. This allows us to create a new make_fx_tracer when tracing subgraphs, the actual logic for tracing subgraph is in the next diff. Test Plan: Existing tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125267 Approved by: https://github.com/Chillee	2024-05-15 03:12:24 +00:00
Howard Huang	470723faea	[pipelining] Add manual pipeline stage (#126123 ) Add `ManualPipelineStage` under `_PipelineStage.py` Fix some type hints since `args_recv_info` can contain more than one RecvInfo. Previously the hint was `Tuple[InputInfo]` which meant it is a tuple of size 1. This is different from `List[InputInfo]` which can contain any number of items. I needed to update to `Tuple[InputInfo, ...]` to make the number of items flexible. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126123 Approved by: https://github.com/kwen2501	2024-05-15 00:55:15 +00:00
drisspg	dccb5cf7ca	Allow for trailing 'a' in sm_arch (#126185 ) # Summary I was getting ``` Shell File "/home/drisspg/meta/pytorch/torch/cuda/__init__.py", line 312, in _lazy_init raise DeferredCudaCallError(msg) from e torch.cuda.DeferredCudaCallError: CUDA call failed lazily at initialization with error: invalid literal for int() with base 10: '90a' ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/126185 Approved by: https://github.com/Skylion007	2024-05-15 00:16:42 +00:00
Kiuk Chung	92eb1731d4	[torch/distributed] Bugfix: wait for all child procs to exit before c… (#125969 ) Observed Problem --------------------- When `torchrun` has finished running the main trainer function (aka entrypoint/user function) successfully, I noticed that sometimes it SIGTERMS the child processes. Then `torchrun` exits successfully. This results in misleading warning log messages towards the end of the job like the one below: ``` W0510 14:52:48.185934 672413 api.py:513] Closing process 675171 via signal SIGTERM W0510 14:52:48.185984 672413 api.py:513] Closing process 675172 via signal SIGTERM W0510 14:52:48.186013 672413 api.py:513] Closing process 675174 via signal SIGTERM # <---- ^^^ ??? everything runs successfully but child still SIGTERM'ed? ^^^ ---> I0510 14:52:48.229119 672413 api.py:877] [main] worker group successfully finished. Waiting 300 seconds for other agents to finish. I0510 14:52:48.229161 672413 api.py:922] Local worker group finished (WorkerState.SUCCEEDED). Waiting 300 seconds for other agents to finish I0510 14:52:48.229395 672413 api.py:936] Done waiting for other agents. Elapsed: 0.0001709461212158203 seconds I0510 14:52:48.257544 672413 dynamic_rendezvous.py:1131] The node 'localhost_672413_0' has closed the rendezvous 'torchrun_qpfd'. I0510 14:52:48.568198 672413 distributed.py:200] Deleting temp log directory: /tmp/torchrun_udgp8zoq I0510 14:52:48.568989 672413 distributed.py:202] Finished running `main` ``` Root Cause ------------------ I noticed that this was due to the incorrect usage of `torch.multiprocessing.ProcessContext.join()` in `torch.distributed.elastic.multiprocessing.api.MultiprocessingContext`. `torch.multiprocessing.ProcessContext.join()` does not actually wait for ALL child procs to exit, but rather waits for at-least-one child proc to exit. If only a subset of the child procs have exited, it returns `False` and if all child procs have exited it returns `True`. `torch.distributed.elastic.multiprocessing.api.MultiprocessingContext` was assuming that `torch.multiprocessing.ProcessContext.join()` blocks indefinitely until all child procs have exited. Fix --------- The fix is simple, just loop, while continuing to call `pc.join()` until it returns `True` > NOTE: that the indefinite blocking is NOT an issue since by the time `torch.distributed.elastic.multiprocessing.api.MultiprocessingContext` calls `pc.join()` it already did all the checking to validate that the entrypoint functions either return successfully or that one of them has failed. So we are really just waiting for the unix process to exit after running the entrypoint function. > NOTE: since `pc.join()` already blocks until at-least-one child proc exits, there is no need to add a polling interval in the body of the loop and the debug logging will show at most `nproc_per_node` times so no log spamming is observed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125969 Approved by: https://github.com/d4l3k	2024-05-15 00:13:08 +00:00
briancoutinho	e5cce35c21	Remove use of USE_C10D (#126120 ) As per https://github.com/pytorch/pytorch/blob/main/torch/CMakeLists.txt#L271 the USE_DISTRIBUTED and USE_C10D are equivalent. In another PR I was cleaning this usage up so also cleaning it up here. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126120 Approved by: https://github.com/aaronenyeshi	2024-05-15 00:00:26 +00:00
PyTorch MergeBot	fd48fb9930	Revert "[CUDA] [CI] Add cu124 docker images (#125944 )" This reverts commit 5fb4a766b88bcf633a23610bd66de0f3020f7c66. Reverted https://github.com/pytorch/pytorch/pull/125944 on behalf of https://github.com/nWEIdia due to test failure seems related `5fb4a766b8` https://github.com/pytorch/pytorch/actions/runs/9085206167/job/24972040039 ([comment](https://github.com/pytorch/pytorch/pull/125944#issuecomment-2111321724))	2024-05-14 23:29:26 +00:00
PyTorch MergeBot	b6d8b256e6	Revert "[inductor][cpp] GEMM template (infra and fp32) (#124021 )" This reverts commit 037615b989b37b1bf5eff0c031055fc8d1fbe5ae. Reverted https://github.com/pytorch/pytorch/pull/124021 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it is failing inductor.test_unbacked_symints.TestUnbackedSymintsCPU::test_autotuning_cpu ([comment](https://github.com/pytorch/pytorch/pull/124021#issuecomment-2111318883))	2024-05-14 23:26:15 +00:00
Animesh Jain	c1aa05f80c	[easy][dynamo] Use disable_dynamo for torch.manual_seed (#126192 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126192 Approved by: https://github.com/yanboliang ghstack dependencies: #126191	2024-05-14 23:20:32 +00:00
Animesh Jain	c6f3f1d239	[reland][dynamo][disable] Move disable impl to its own __call__ method (#126191 ) Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/126191 Approved by: https://github.com/yoyoyocmu, https://github.com/yanboliang, https://github.com/fegin	2024-05-14 23:20:32 +00:00
Edward Z. Yang	41fabbd93f	Fanatically correct real tensor cloning for propagate_real_tensors (#126175 ) Internal xref: https://fb.workplace.com/groups/6829516587176185/posts/7211398545654652/ Previously I did it in a crappy way using clone_input in the callback, but this results in tensors that don't have quite the same size/stride/storage offset and there was an internal test case where not having completely accurate information was causing a downstream problem in propagation. So now I make real tensors as similar to their fake equivalents as much as possible. Though... I don't bother with autograd lol. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/126175 Approved by: https://github.com/albanD	2024-05-14 23:14:17 +00:00
eellison	328b75d1a0	Enable epilogue fusion benchmarking internally (#125455 ) Differential Revision: [D56920738](https://our.internmc.facebook.com/intern/diff/D56920738) Pull Request resolved: https://github.com/pytorch/pytorch/pull/125455 Approved by: https://github.com/Chillee	2024-05-14 23:06:29 +00:00
Pian Pawakapan	e046c59e5b	[export] handle aliased/unused params for unflattening (#125758 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/125758 Aliased and unused params are currently an issue for strict-mode export. For a model like this: ``` def __init__(self): # ... self.alpha = nn.Parameter(torch.randn(4)) self.beta = self.alpha self.gamma = self.alpha def forward(self, x): return x + self.beta ``` Dynamo will trace only 1 parameter (beta) and assign a dynamo name (e.g. `L__self___beta`) which can be difficult to match to the correct FQN in the original eager module. This leads to export graph signature potentially having the incorrect target FQN for the parameter, leading to downstream issues unflattening (the parameter may be assigned to the wrong target attribute, mismatching the relevant placeholder node in the unflattened module). This handles aliasing issues by assigning all tensors present in the state dict as module attributes, even if they're unused. Still, only the used tensors will appear in the graph's forward pass. Another issue that exists is weight-sharing is not maintained in unflattening (all params/buffers are re-cloned) - handle this by checking tensor ids too. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125758 Approved by: https://github.com/zhxchen17	2024-05-14 23:00:46 +00:00
Nikita Shulga	4d063c8e8a	Do not print escape characters in xdoctest logs (#126219 ) By invoking make with `vt100` terminal settings Test Plan: [Before](https://github.com/pytorch/pytorch/actions/runs/9086391859/job/24972547633) ``` 2024-05-14T21:50:09.0459741Z [01mreading sources... [39;49;00m[ 57%] [35mgenerated/torch.func.stack_module_state .. generated/torch.gradient[39;49;00m 2024-05-14T21:50:09.2204992Z [01mreading sources... [39;49;00m[ 59%] [35mgenerated/torch.greater .. generated/torch.jit.ignore[39;49;00m 2024-05-14T21:50:09.9598581Z [01mreading sources... [39;49;00m[ 61%] [35mgenerated/torch.jit.interface .. generated/torch.linalg.multi_dot[39;49;00m 2024-05-14T21:50:10.5383853Z [01mreading sources... [39;49;00m[ 64%] [35mgenerated/torch.linalg.norm .. generated/torch.moveaxis[39;49;00m ``` [After](https://github.com/pytorch/pytorch/actions/runs/9086780396/job/24973727737?pr=126219) ``` 2024-05-14T22:27:22.9388802Z reading sources... [ 57%] generated/torch.func.stack_module_state .. generated/torch.gradient 2024-05-14T22:27:23.5874407Z reading sources... [ 59%] generated/torch.greater .. generated/torch.jit.ignore 2024-05-14T22:27:23.7649947Z reading sources... [ 61%] generated/torch.jit.interface .. generated/torch.linalg.multi_dot 2024-05-14T22:27:24.3492981Z reading sources... [ 64%] generated/torch.linalg.norm .. generated/torch.moveaxis 2024-05-14T22:27:24.9723946Z reading sources... [ 66%] generated/torch.movedim .. generated/torch.nn.AdaptiveLogSoftmaxWithLoss ``` Fixes https://github.com/pytorch/pytorch/issues/123166 Pull Request resolved: https://github.com/pytorch/pytorch/pull/126219 Approved by: https://github.com/clee2000	2024-05-14 22:45:55 +00:00
Daniil Kutz	b522e65056	Check pointer for null before deref in Aten/native/sparse (#126163 ) Fixes #126162 Pull Request resolved: https://github.com/pytorch/pytorch/pull/126163 Approved by: https://github.com/ezyang	2024-05-14 21:55:41 +00:00
Mikayla Gawarecki	bbdbfe3661	Reland add `write_record_metadata` to PyTorchFileWriter (#126087 ) Reland of https://github.com/pytorch/pytorch/pull/125184 with compiler warning fixed by extending `m_pWrite` rather than adding `m_pSeek` to miniz API Differential Revision: [](https://our.internmc.facebook.com/intern/diff/) Differential Revision: [D57287327](https://our.internmc.facebook.com/intern/diff/D57287327) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126087 Approved by: https://github.com/albanD	2024-05-14 21:48:44 +00:00
cdzhan	1ba852c1dc	Fix torch elastic test SimpleElasticAgentTest.test_restart_workers br… (#126002 ) Failure Info: ```bash (pt) betterman@bjys1009:/projs/framework/betterman/code/pytorch_new/test/distributed/elastic/agent/server/test$ pytest api_test.py -k test_restart_workers =============================================================================================================================================== test session starts ================================================================================================================================================ platform linux -- Python 3.10.8, pytest-8.1.1, pluggy-1.4.0 rootdir: /projs/framework/betterman/code/pytorch_new configfile: pytest.ini plugins: hypothesis-6.15.0, rerunfailures-14.0, flakefinder-1.1.0, xdist-3.3.1 collecting 1 item / projs/framework/betterman/code/pytorch_new/test/distributed/elastic/agent/server/test/api_test.py:123: PytestCollectionWarning: cannot collect test class 'TestAgent' because it has a __init__ constructor (from: test/distributed/elastic/agent/server/test/api_test.py) class TestAgent(SimpleElasticAgent): collected 29 items / 28 deselected / 1 selected Running 1 items in this shard api_test.py F [100%] ===================================================================================================================================================== FAILURES ===================================================================================================================================================== ___________________________________________________________________________________________________________________________________ SimpleElasticAgentTest.test_restart_workers ____________________________________________________________________________________________________________________________________ Traceback (most recent call last): File "/usr/local/python3.10/lib/python3.10/unittest/case.py", line 59, in testPartExecutor yield File "/usr/local/python3.10/lib/python3.10/unittest/case.py", line 591, in run self._callTestMethod(testMethod) File "/usr/local/python3.10/lib/python3.10/unittest/case.py", line 549, in _callTestMethod method() File "/projs/framework/betterman/code/pytorch_new/test/distributed/elastic/agent/server/test/api_test.py", line 368, in test_restart_workers agent._restart_workers(worker_group) File "/projs/framework/betterman/code/pytorch_new/torch/distributed/elastic/metrics/api.py", line 123, in wrapper result = f(args, *kwargs) File "/projs/framework/betterman/code/pytorch_new/torch/distributed/elastic/agent/server/api.py", line 728, in _restart_workers self._stop_workers(worker_group, is_restart=True) TypeError: TestAgent._stop_workers() got an unexpected keyword argument 'is_restart' ============================================================================================================================================= short test summary info ============================================================================================================================================== FAILED [0.0054s] api_test.py::SimpleElasticAgentTest::test_restart_workers - TypeError: TestAgent._stop_workers() got an unexpected keyword argument 'is_restart' ========================================================================================================================================= 1 failed, 28 deselected in 7.37s ========================================================================================================================================= ``` Caused by #124819 . Pull Request resolved: https://github.com/pytorch/pytorch/pull/126002 Approved by: https://github.com/ezyang	2024-05-14 21:36:24 +00:00
Aaron Enye Shi	3a58d40b93	[Profiler] Clean up deprecated use_cuda by default (#126180 ) Summary: Should not be setting use_cuda by default anymore, since it is deprecated. Instead it will be set via use_device="cuda". Test Plan: CI and ran locally: Before: ``` [INFO: pytorch_resnet_integration_test.py: 196]: step: 80, peak allocated GPU mem: 3.17GB, peak active GPU mem: 3.17GB, peak reserved GPU mem: 3.39GB. /data/users/aaronshi/fbsource/buck-out/v2/gen/fbcode/277373c3e83d278c/kineto/libkineto/fb/integration_tests/__pytorch_resnet_integration_test__/pytorch_resnet_integration_test#link-tree/torch/autograd/profiler.py:215: UserWarning: The attribute `use_cuda` will be deprecated soon, please use ``use_device = 'cuda'`` instead. Log file: /tmp/libkineto_activities_812639.json Trace start time: 2024-05-14 08:44:50 Trace duration: 500ms Warmup duration: 5s Max GPU buffer size: 128MB Enabled activities: cpu_op,user_annotation,gpu_user_annotation,gpu_memcpy,gpu_memset,kernel,external_correlation,cuda_runtime,cuda_driver,cpu_instant_event,python_function,xpu_runtime,privateuse1_runtime,privateuse1_driver Manifold bucket: gpu_traces Manifold object: tree/traces/clientAPI/0/1715701483/devvm2184.cco0/libkineto_activities_812639.json Trace compression enabled: 1 TTL in seconds: 31536000 (365 days) INFO:2024-05-14 08:44:43 812639:812639 CuptiActivityProfiler.cpp:971] Enabling GPU tracing ``` After: ``` [INFO: pytorch_resnet_integration_test.py: 196]: step: 80, peak allocated GPU mem: 3.17GB, peak active GPU mem: 3.17GB, peak reserved GPU mem: 3.39GB. Log file: /tmp/libkineto_activities_903554.json Trace start time: 2024-05-14 09:05:47 Trace duration: 500ms Warmup duration: 5s Max GPU buffer size: 128MB Enabled activities: cpu_op,user_annotation,gpu_user_annotation,gpu_memcpy,gpu_memset,kernel,external_correlation,cuda_runtime,cuda_driver,cpu_instant_event,python_function,xpu_runtime,privateuse1_runtime,privateuse1_driver Manifold bucket: gpu_traces Manifold object: tree/traces/clientAPI/0/1715702740/devvm2184.cco0/libkineto_activities_903554.json Trace compression enabled: 1 TTL in seconds: 31536000 (365 days) INFO:2024-05-14 09:05:40 903554:903554 CuptiActivityProfiler.cpp:971] Enabling GPU tracing ``` Differential Revision: D57337445 Pulled By: aaronenyeshi Pull Request resolved: https://github.com/pytorch/pytorch/pull/126180 Approved by: https://github.com/davidberard98	2024-05-14 21:23:31 +00:00
Kevin Yin	534c34b320	Fix copy-pasted docs, reversing the load and save description (#125993 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/125993 Approved by: https://github.com/kwen2501, https://github.com/fegin	2024-05-14 21:14:16 +00:00
Pian Pawakapan	2973c9bb88	[export] add SchemaCheckMode testing for pre-dispatch export, OpInfo (#125481 ) This adds a new dispatch mode, PreDispatchSchemaCheckMode, built on top of SchemaCheckMode, used for verifying op schemas for functionalization for PreDispatch IR. More specifically, the mode runs in eager mode on concrete inputs, checking if op schemas incorrectly claim to be functional, but are aliasing or mutating. This mode is pushed to the pre-dispatch mode stack, and run before decompositions. Current testing is hooked up to OpInfo, containing 1103 tests on 600 unique ops. Below is a list of ops that fail testing. One caveat is we only raise errors on ops that claim to be functional - if an op schema admits aliasing or mutating but fails testing for the other, it still may decompose further and become functional. List of failed ops: ``` aten.atleast_1d.default aten.atleast_2d.default aten.atleast_3d.default aten.cartesian_prod.default aten.conj_physical.default aten.alpha_dropout.default aten.feature_dropout.default aten.feature_alpha_dropout.default aten.unsafe_chunk.default ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/125481 Approved by: https://github.com/tugsbayasgalan	2024-05-14 21:07:21 +00:00
Edward Z. Yang	534ddfa619	Move compute unbacked bindings call to track_tensor_tree (#126168 ) This ensures we hit it in all the HOP proxy tensor implementations Fixes https://github.com/pytorch/pytorch/issues/125869 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/126168 Approved by: https://github.com/ydwu4	2024-05-14 21:05:05 +00:00
Yuanhao Ji	54131ecb25	Remove redundant spaces in CMakeLists.txt (#126042 ) Fixes #126023 ```diff diff --git a/CMakeLists.txt b/CMakeLists.txt index 79db67e735..924721d2e6 100644 --- a/CMakeLists.txt +++ b/CMakeLists.txt @@ -281,8 +281,8 @@ if(NOT DEFINED USE_VULKAN) endif() option(USE_SLEEF_FOR_ARM_VEC256 "Use sleef for arm" OFF) -option(USE_SOURCE_DEBUG_ON_MOBILE "Enable " ON) -option(USE_LITE_INTERPRETER_PROFILER "Enable " ON) +option(USE_SOURCE_DEBUG_ON_MOBILE "Enable" ON) +option(USE_LITE_INTERPRETER_PROFILER "Enable" ON) option(USE_VULKAN_FP16_INFERENCE "Vulkan - Use fp16 inference" OFF) option(USE_VULKAN_RELAXED_PRECISION "Vulkan - Use relaxed precision math in the kernels (mediump)" OFF) # option USE_XNNPACK: try to enable xnnpack by default. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/126042 Approved by: https://github.com/r-barnes	2024-05-14 21:04:49 +00:00
Michael Lazos	7ed67cdbcc	Add compile time smoketest for foreach (#126136 ) Fixes [T175425693](https://www.internalfb.com/intern/tasks/?t=175425693) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126136 Approved by: https://github.com/yanboliang	2024-05-14 21:00:55 +00:00
Bas Zalmstra	a8eac0efa8	fix: unknown CMake command "check_function_exists" (#126165 ) When building pytorch with OpenBLAS on windows I ran into this CMake issue: ``` CMake Error at cmake/Modules/FindLAPACK.cmake:137 (check_function_exists): Unknown CMake command "check_function_exists". Call Stack (most recent call first): cmake/Dependencies.cmake:1745 (find_package) CMakeLists.txt:708 (include) ``` Similarly described here: https://discuss.pytorch.org/t/cmake-with-error-by-compiling-on-windows-with-mingw32-make/159140 This PR fixes this issue by adding: ``` include(CheckFunctionExists) ``` To the offending CMake file. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126165 Approved by: https://github.com/ezyang	2024-05-14 20:54:06 +00:00
William Wen	4a8db9d45b	[dynamo] reset grad state in aotdispatch test, add failing trace functional tensor test to dynamo (#126113 ) Workaround for https://github.com/pytorch/pytorch/issues/125568. We could add additional global state to reset (e.g. autocast?) or move this setup/teardown to a more general place. Also added a minimal repro for the linked issue - will investigate in a followup PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126113 Approved by: https://github.com/ezyang, https://github.com/bdhirsh	2024-05-14 20:42:49 +00:00
Peter Bell	f6a00a8032	[inductor] Add abs to index_propagation (#124616 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124616 Approved by: https://github.com/lezcano ghstack dependencies: #124119	2024-05-14 20:14:53 +00:00
Peter Bell	c30ea3387b	[inductor] Improve stability of scaled softmax (#124119 ) This adds a pattern which replaces: ```python scale(x) - scale(x).amax(dim, keepdim=True) ``` with ```python scale(x - x.amax(dim, keepdim=True)) ``` where `scale` can be either multiplication or division by a scalar, or a tensor that is broadcast in the `dim` dimension. We can find this pattern inside of the decomposed graph of: ```python F.softmax(scale(x), dim=dim) ``` This has the effect of both reducing the chance of hitting the `fma` issue and also means we avoid recomputing `scale(x)` inside and outside the reduction which may be significant if we can remove an extra division. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124119 Approved by: https://github.com/lezcano	2024-05-14 20:14:53 +00:00
zdevito	352a893b0c	Fast standalone symbolize for unwinding (#123966 ) We've had issues using addr2line. On certain versions of CentOS it is on a version that has a performance regression making it very slow, and even normallly it is not that fast, taking several seconds even when parallelized for a typical memory trace dump. Folly Symbolize or LLVMSymbolize are fast but it requires PyTorch take a dependency on those libraries to do this, and given the number of environments we run stuff in, we end up hitting cases where we fallback to slow addr2line behavior. This adds a standalone symbolizer to PyTorch similar to the unwinder which has no external dependencies and is ~20x faster than addr2line for unwinding PyTorch frames. I've tested this on some memory profiling runs using all combinations of {gcc, clang} x {dwarf4, dwarf5} and it seems to do a good job at getting line numbers and function names right. It is also careful to route all reads of library data through the `CheckedLexer` object, which ensure it is not reading out of bounds of the section. Errors are routed through UnwindError so that those exceptions get caught and we produce a ?? frame rather than crash. I also added a fuzz test which gives all our symbolizer options random addresses in the process to make sure they do not crash. Differential Revision: [D56828968](https://our.internmc.facebook.com/intern/diff/D56828968) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123966 Approved by: https://github.com/ezyang, https://github.com/aaronenyeshi	2024-05-14 19:39:17 +00:00
Wei Wang	5fb4a766b8	[CUDA] [CI] Add cu124 docker images (#125944 ) Fixes issues encountered in https://github.com/pytorch/pytorch/pull/121956 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125944 Approved by: https://github.com/atalman	2024-05-14 19:38:10 +00:00
Richard Barnes	ed327876f5	[codemod] `c10:optional` -> `std::optional` (#126135 ) Generated by running the following from PyTorch root: ``` find . -regex ".*\.$cpp\\|h\\|cu\\|hpp\\|cc\\|cxx$$" \| grep -v "build/" \| xargs -n 50 -P 4 perl -pi -e 's/c10::optional/std::optional/' ``` `c10::optional` is just an alias for `std::optional`. This removes usages of that alias in preparation for eliminating it entirely. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126135 Approved by: https://github.com/Skylion007, https://github.com/malfet, https://github.com/albanD, https://github.com/aaronenyeshi	2024-05-14 19:35:51 +00:00
Richard Barnes	b55f57b7af	[codemod][lowrisk] Remove extra semi colon from caffe2/c10/core/SymNodeImpl.h (#123055 ) Summary: `-Wextra-semi` or `-Wextra-semi-stmt` If the code compiles, this is safe to land. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123055 Approved by: https://github.com/Skylion007	2024-05-14 19:35:29 +00:00
laithsakka	023f05cfe6	Allow symbols to reach conv_layout stride argument #125829 (#126116 ) https://github.com/pytorch/pytorch/pull/125829 was reverted i rebased and the error could be merge error because its not reproducible. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126116 Approved by: https://github.com/anijain2305	2024-05-14 19:22:16 +00:00
Ke Wen	0e6462f69a	[pipelining] Consolidate test models into a registry (#126114 ) Resolves https://github.com/pytorch/PiPPy/issues/1062. Also added a gradient equivalence test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126114 Approved by: https://github.com/H-Huang ghstack dependencies: #125729, #125975	2024-05-14 19:11:54 +00:00
Andres Lugo-Reyes	38b8b614a2	[ROCm] Implement forward AD for miopen_batch_norm (#125069 ) Implements forward automatic differentiation support for miopen_batch_norm as well as unskips the associated unit tests. Also fixes a class of functorch related unit tests that fail due to failing a contiguous tensor assertion in BatchNorm_miopen.cpp. Solution was to just limit tensors to miopen_batch_norm that have at least 3 dimensions. The exact restriction already existed in the cudnn path and is why the tests in question only failed on ROCm. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125069 Approved by: https://github.com/jeffdaily, https://github.com/andrewor14	2024-05-14 19:09:50 +00:00
David Chiu	1a28f731dc	[optim] Merge the pyi files into py files of optimizer (#125452 ) Continue the work of pytorch/pytorch#125153 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125452 Approved by: https://github.com/janeyx99	2024-05-14 18:24:50 +00:00
David Berard	a00a99e801	[profiler] Report strides in json trace (#125851 ) We already collect strides, we just don't report them anywhere. Note: this depends on concrete input collection being enabled, which I think is currently not the case internally. Differential Revision: [D57165421](https://our.internmc.facebook.com/intern/diff/D57165421) Pull Request resolved: https://github.com/pytorch/pytorch/pull/125851 Approved by: https://github.com/Chillee, https://github.com/aaronenyeshi	2024-05-14 18:24:24 +00:00
Gustav Larsson	50c3d58734	[onnx.export] Cache AllGraphInputsStatic (#123028 ) This PR is part of an effort to speed up torch.onnx.export (#121422). - The inputs (dynamic inputs and constants) do not change as as nodes are added and it is expensive to re-compute for every node. So, we cache this value so we avoid computing it for every node. Open to entirely other solution as well. - Resolves (5) in #121422. (partial fix of #121545) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123028 Approved by: https://github.com/justinchuby	2024-05-14 18:19:04 +00:00
andrewor14	3cba50e478	[quant] Make per_group and per_token quant match torch.fake_quantize (#125781 ) Summary: Follow-up to https://github.com/pytorch/ao/pull/229. This resolves the difference between `input.div(scales)` and `input.mul(1.0 / scales)`, which results in small numerical discrepancies on some inputs. Test Plan: python test/test_quantization.py TestQuantizedTensor.test_decomposed_quantize_per_channel_group python test/test_quantization.py TestQuantizedTensor.test_decomposed_quantize_per_token Reviewers: jerryzh168 Subscribers: jerryzh168, supriyar Pull Request resolved: https://github.com/pytorch/pytorch/pull/125781 Approved by: https://github.com/jerryzh168	2024-05-14 18:18:54 +00:00
Andrew Gu	3892e86c94	[FSDP2] Changed grad acc test to use data parallel ref model (#126161 ) This simplifies the test a bit. Context Option 1: Ref model is data parallel. Each rank's ref model receives local batch. We manually all-reduce gradients and divide them by world size to match DDP/FSDP semantics. Option 2: Ref model is not data parallel. Each rank's ref model receives the same global batch. We manually divide the ref model's gradients by world size to match DDP/FSDP semantics. (Note that all ranks have the same ref model and same global batch.) All of our other unit tests are written following Option 1, which is simpler and a more direct comparison to what our claimed semantics are. This PR switches the gradient accumulation test from being written as following Option 2 to as following Option 1. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126161 Approved by: https://github.com/wanchaol ghstack dependencies: #126067, #126070	2024-05-14 18:15:38 +00:00
Andrew Gu	4ded666535	[FSDP2] Factored out `MLPStack` to de-dup code (#126070 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126070 Approved by: https://github.com/wanchaol ghstack dependencies: #126067	2024-05-14 18:13:51 +00:00
Catherine Lee	48f98bcdfc	[TD] Enable test removal on most default configs + distributed CUDA for everyone (#125931 ) yolo Add the longest jobs in pull: * default cpu configs * non sm86 cuda * distributed cuda for everyone Still excluding * slow, inductor, rocm, onnx, mac, dynamo * distributed cpu * windows cuda Pull Request resolved: https://github.com/pytorch/pytorch/pull/125931 Approved by: https://github.com/huydhn, https://github.com/ZainRizvi	2024-05-14 17:35:12 +00:00
Edward Z. Yang	db3b38202b	Improve dead code elimination of unnecessary int arguments (#126074 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/126074 Approved by: https://github.com/lezcano ghstack dependencies: #125325, #125915	2024-05-14 17:22:30 +00:00
dshi7	9df2f8687f	cprofile every compile id [x/y] to keep consistent with tlparse (#125659 ) This PR moves cprofile decorator to keep consistent with `torch_inductor_stats` logging and is needed by fbcode diffs of profiling enablement in internal e2e jobs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125659 Approved by: https://github.com/ezyang	2024-05-14 17:09:28 +00:00
Andrew Gu	2e4d011195	[FSDP2] Used `CommDebugMode` in grad acc test (#126067 ) +9/-27 lines -- very nice :) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126067 Approved by: https://github.com/wanchaol	2024-05-14 16:43:37 +00:00
PyTorch MergeBot	20aa7cc678	Revert "[c10d] Add an option for NAN check on every collective (#125726 )" This reverts commit 6db32710074f0944305b2d1e4571bb4ce571bf6a. Reverted https://github.com/pytorch/pytorch/pull/125726 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but the new test is failing on both multigpu and rocm distributed, i.e. `c712b0f8a3` ([comment](https://github.com/pytorch/pytorch/pull/125726#issuecomment-2110646075))	2024-05-14 16:26:34 +00:00
angelayi	aac215a824	SymInt-ify unsqueeze_copy (#125976 ) Fixes https://github.com/pytorch/pytorch/issues/125853 I only half-know how to code c++ so please lmk if I did templating incorrectly 🙈 The reason I used a template is because the `InferUnsqueezeGeometryResult` struct gets used in a couple of other places, like for unsqueeze_quantized, but I wasn't sure if I should symint-ify those too. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125976 Approved by: https://github.com/larryliu0820, https://github.com/ezyang	2024-05-14 15:58:52 +00:00
PyTorch MergeBot	ed76079af3	Revert "Remove Caffe2 python code (#126035 )" This reverts commit 9a1bf39c6629e27cad281393059244791b82a166. Reverted https://github.com/pytorch/pytorch/pull/126035 on behalf of https://github.com/jeanschmidt due to Seems to have introduced lint error: Error: Module 'onnx' has no attribute 'numpy_helper' ([comment](https://github.com/pytorch/pytorch/pull/126035#issuecomment-2110570863))	2024-05-14 15:47:33 +00:00
Wang, Eikan	d1f254dce8	Add a cache mechanism to accelerate torch.compile-for-eager (#116368 ) This PR is a follow-up of RFC https://github.com/pytorch/pytorch/issues/115545. In this PR, we are trying to enable a cache mechanism to accelerate eager-through-torch.compile. When eager-through-torch.compile is enabled, we will store a persistent config to cache the kernel information for the aten operation. The persistent config consists of two parts - meta_info and kernel_path. - meta_info: The input tensors' shape, stride, device type, data type, and symbolic flag. - kernel_path: The path of the kernel produced by Inductor. When an aten operation is registered, the `kernel_holder` will load the persistent config and parse it to build the cache map; the meta_info is key, and the kernel library is the value. Currently, this PR only supports static shape to guard the kernel. Take a `mul` as an example. ```python class MulKernel: def __init__(self) -> None: pass def __call__(self, args: Any, kwargs: Any) -> Any: with torch._C._SetExcludeDispatchKeyGuard(torch._C.DispatchKey.Python, False): opt_fn = torch.compile(torch.ops.aten.mul, dynamic=False, options={ "aot_inductor.eager_mode": True, "aot_inductor.eager_op_name": "mul_Tensor" } ) return opt_fn(args, **kwargs) torch_compile_op_lib_impl = torch.library.Library("aten", "IMPL") _, overload_names = torch._C._jit_get_operation("aten::mul") schema = torch._C._get_schema("aten::mul", overload_name) reg_name = schema.name if schema.overload_name: reg_name = f"{reg_name}.{schema.overload_name}" torch_compile_op_lib_impl.impl( reg_name, MulKernel(), "CUDA", compile_mode=True) a = torch.randn(1024, 1024, device=device) b = torch.randn(1024, 1024, device=device) warm_up_iter = 1000 iter = 10000 fn = torch.mul # Warm up for _ in range(warm_up_iter): fn(a, b) # Collect performance beg = time.time() for _ in range(iter): fn(a, b) end = time.time() print(f"E2E run: {end - beg}") ``` It will produce the config as follows. ```json [ { "meta_info": [ { "is_symbolic": false, "device_type": "cuda", "dtype": "torch.float32", "sizes": [1024, 1024], "strides": [1024, 1] }, { "is_symbolic": false, "device_type": "cuda", "dtype": "torch.float32", "sizes": [1024, 1024], "strides": [1024, 1] } ], "kernel_path": "/tmp/torchinductor_eikan/e4/ce4jw46i5l2e7v3tvr2pyglpjmahnp7x3hxaqotrvxwoeh5t6qzc.so" } ] ``` Performance-wise, we collected mul.Tensor through torch.compile w/ 10000 runs(e2e). The data is as follows. And we will collect data when we support dynamic shape. - Eager: ~266.11ms - W/O Cache: ~3455.54ms - W/ Cache and Cache Miss: ~3555.3ms - W/ Cache and Cache Hit: ~267.12ms Hardware: - CPU: Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz - GPU: CUDA A10 Software: - PyTorch Version: 39df084001c54cca5fe3174176f9b0206ddb7dcf - GPU Driver Version: 525.147.05 - CUDA Version: 12.0 Differential Revision: [D57216427](https://our.internmc.facebook.com/intern/diff/D57216427) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116368 Approved by: https://github.com/jansel, https://github.com/atalman	2024-05-14 15:43:48 +00:00
Alexandre Ghelfi	b3a8a3cbab	Fix typos in `torch._dynamo.config.py` (#126150 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126150 Approved by: https://github.com/Skylion007	2024-05-14 14:27:35 +00:00
Yuanhao Ji	680a568721	Fix typo in HistogramKernel.cpp (#126156 ) Fix typo in HistogramKernel.cpp ```diff diff --git a/aten/src/ATen/native/cpu/HistogramKernel.cpp b/aten/src/ATen/native/cpu/HistogramKernel.cpp index 196bfd5647..0505271f6a 100644 --- a/aten/src/ATen/native/cpu/HistogramKernel.cpp +++ b/aten/src/ATen/native/cpu/HistogramKernel.cpp @@ -100,7 +100,7 @@ void histogramdd_cpu_contiguous(Tensor& hist, const TensorList& bin_edges, TensorAccessor<const input_t, 2> accessor_in = input.accessor<const input_t, 2>(); - /* Constructs a c10::optional<TensorAccessor> containing an accessor iff + /* Constructs a c10::optional<TensorAccessor> containing an accessor if * the optional weight tensor has a value. */ const auto accessor_wt = weight.has_value() ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/126156 Approved by: https://github.com/r-barnes	2024-05-14 14:26:35 +00:00
cyy	9a1bf39c66	Remove Caffe2 python code (#126035 ) Follows the recent changes of Caffe2. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126035 Approved by: https://github.com/r-barnes, https://github.com/Skylion007	2024-05-14 14:23:46 +00:00
David Chiu	9641a8db25	[optim] deprecate `LRScheduler.print_lr` (#126105 ) Fixes #99270 Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/126105 Approved by: https://github.com/janeyx99	2024-05-14 14:13:03 +00:00
lezcano	37596769d8	Autocast `vdot` (#125697 ) Fixes https://github.com/pytorch/pytorch/issues/125544 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125697 Approved by: https://github.com/jbschlosser	2024-05-14 12:05:02 +00:00
Jeeja	556e4ec6c9	[FSDP] Add device in pin_memory argument (#119878 ) Add device to pin_memory argument to support other backends like HPU Pull Request resolved: https://github.com/pytorch/pytorch/pull/119878 Approved by: https://github.com/awgu	2024-05-14 10:30:00 +00:00
CaoE	9dec41b684	add avx512 specialization for vec_shuffle_down (#125147 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/125147 Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/peterbell10	2024-05-14 08:26:13 +00:00
Valentin Andrei	8bf9e99cea	[pytorch][cuda] Some speedup on depth wise convolution 2D forward (#125362 ) This PR does a few things: - Adds a generic implementation for `conv_depthwise2d` when the filter size is non standard. This implementation works faster because it doesn't do edge condition checks inside the innermost loops. We avoid the checks by calculating the boundaries ahead of the loop. - Hints to nvcc to minimize the register usage so that we squeeze more memory bandwidth - Adds filter size 5 as a common size where we can use the template implementation to improve unrolling and generate more efficient code The implementation doesn't completely fix the issue described in https://github.com/pytorch/pytorch/issues/18631. For that we need to rewrite the kernel using the suggestions described in the issue chat. This PR uses the same order of accessing the tensor as before but just removes overhead instructions in the inner loops to get the speedup. Before: ``` conv2d-performance: B C iH iW kH kW native (cpu) conv2d (cuda) conv2d-fp16 (cuda) 0 8.0 64.0 1024.0 1008.0 5.0 5.0 149.052643 24.982176 3.236192 1 8.0 64.0 1008.0 1008.0 5.0 5.0 150.810333 24.643536 3.237760 2 4.0 48.0 720.0 539.0 6.0 1.0 15.747776 2.636320 1.788672 3 4.0 120.0 379.0 283.0 6.0 1.0 12.234080 1.791712 1.231360 4 4.0 32.0 713.0 532.0 6.0 1.0 10.362272 1.731584 1.170544 5 4.0 3.0 712.0 542.0 31.0 31.0 24.965248 3.406304 4.165440 6 4.0 120.0 379.0 288.0 1.0 6.0 10.772512 1.215616 0.939936 7 1024.0 384.0 1.0 928.0 1.0 3.0 60.051582 7.594256 2.861344 8 4.0 24.0 687.0 512.0 6.0 1.0 10.231536 1.196704 0.818432 9 96.0 96.0 112.0 112.0 5.0 5.0 21.025631 5.110096 0.715520 10 96.0 80.0 56.0 56.0 5.0 5.0 9.730064 1.016080 0.207424 11 64.0 128.0 64.0 84.0 3.0 3.0 18.759552 0.616736 0.200832 12 16.0 960.0 7.0 7.0 5.0 5.0 0.274880 0.020288 0.014688 13 16.0 64.0 112.0 112.0 3.0 3.0 6.425696 0.189088 0.053728 ``` After ``` B C iH iW kH kW native (cpu) conv2d (cuda) conv2d-fp16 (cuda) 0 8.0 64.0 1024.0 1008.0 5.0 5.0 122.534370 12.915648 3.269936 1 8.0 64.0 1008.0 1008.0 5.0 5.0 126.026978 12.826848 3.236608 2 4.0 48.0 720.0 539.0 6.0 1.0 14.488160 1.803424 1.794368 3 4.0 120.0 379.0 283.0 6.0 1.0 11.556304 1.251200 1.240736 4 4.0 32.0 713.0 532.0 6.0 1.0 9.737841 1.186240 1.174128 5 4.0 3.0 712.0 542.0 31.0 31.0 19.394785 2.017056 2.310368 6 4.0 120.0 379.0 288.0 1.0 6.0 9.586752 0.828736 0.843712 7 1024.0 384.0 1.0 928.0 1.0 3.0 48.939903 5.529312 2.860768 8 4.0 24.0 687.0 512.0 6.0 1.0 13.474000 0.831920 0.825280 9 96.0 96.0 112.0 112.0 5.0 5.0 15.439168 2.611616 0.724864 10 96.0 80.0 56.0 56.0 5.0 5.0 5.991968 0.520352 0.207456 11 64.0 128.0 64.0 84.0 3.0 3.0 9.381472 0.609680 0.202832 12 16.0 960.0 7.0 7.0 5.0 5.0 0.265504 0.015680 0.014496 13 16.0 64.0 112.0 112.0 3.0 3.0 2.384832 0.187168 0.053280 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/125362 Approved by: https://github.com/ezyang	2024-05-14 07:27:02 +00:00
Shunting Zhang	1370f3a00d	[inductor] make mm template work with non-contiguous input (#126106 ) Fix https://github.com/pytorch/pytorch/issues/125437 . Triton matmul template does not work well with non-contiguous inputs and cause mis-aligned memory access. It happens both for inductor matmul template and triton.ops.matmul op. This PR avoid adding `tl.multiple_of` and `tl.max_contiguous` if the input tensors are not contiguous. This work around the issue. We'll follow up and try to figure out the root cause in the GH issue. The if/else added to the template should be resolved at compile time and they by themselves does not cause any perf hit. Test: ``` TORCHINDUCTOR_MAX_AUTOTUNE=1 python benchmarks/dynamo/huggingface.py --backend inductor --amp --accuracy --only BertForMaskedLM --training ``` Previously fail with misaligned memory access and now pass Pull Request resolved: https://github.com/pytorch/pytorch/pull/126106 Approved by: https://github.com/htyu	2024-05-14 07:21:53 +00:00
mengfeil	60b00b4b4d	[CI] Upgrade intel support packages for XPU (#125655 ) upgrade intel basekit package to 0.5 for XPU Works for #114850 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125655 Approved by: https://github.com/EikanWang, https://github.com/chuanqi129, https://github.com/atalman	2024-05-14 06:50:23 +00:00
briancoutinho	c312cd8890	add simple test for nccl metadata (#125317 ) Add a few test cases to verify newly added NCCL metadata in profiler events The test looks at the following blocks record_param_comms ``` { "ph": "X", "cat": "cpu_op", "name": "record_param_comms", "pid": 2840966, "tid": 2844581, "ts": 2424859.045, "dur": 203.866, "args": { "Collective name": "allreduce", "Process Group Description": "default_pg", "dtype": "Float", "In msg nelems": 100, "Global rank start": 0, "Group size": 2, "Process Group Ranks": "[0, 1]", "Record function id": 0, "Out msg nelems": 100, "Global rank stride": 1, "Process Group Name": "0", } } ``` ## Unit test ``` >$ touch /tmp/barrier && TEMP_DIR="/tmp" BACKEND="nccl" WORLD_SIZE="2" python test/distributed/test_distributed_spawn.py -v TestDistBackendWithSpawn.test_ddp_profiling_torch_profiler test_ddp_profiling_torch_profiler (__main__.TestDistBackendWithSpawn.test_ddp_profiling_torch_profiler) ... NCCL version 2.20.5+cuda12.0 STAGE:2024-05-01 16:41:15 2840966:2840966 ActivityProfilerController.cpp:316] Completed Stage: Warm Up STAGE:2024-05-01 16:41:15 2840965:2840965 ActivityProfilerController.cpp:316] Completed Stage: Warm Up STAGE:2024-05-01 16:41:17 2840965:2840965 ActivityProfilerController.cpp:322] Completed Stage: Collection STAGE:2024-05-01 16:41:17 2840966:2840966 ActivityProfilerController.cpp:322] Completed Stage: Collection STAGE:2024-05-01 16:41:17 2840966:2840966 ActivityProfilerController.cpp:326] Completed Stage: Post Processing STAGE:2024-05-01 16:41:17 2840965:2840965 ActivityProfilerController.cpp:326] Completed Stage: Post Processing STAGE:2024-05-01 16:41:18 2840966:2840966 ActivityProfilerController.cpp:316] Completed Stage: Warm STAGE:2024-05-01 16:41:18 2840965:2840965 ActivityProfilerController.cpp:316] Completed Stage: Warm Up STAGE:2024-05-01 16:41:18 2840965:2840965 ActivityProfilerController.cpp:322] Completed Stage: Collection STAGE:2024-05-01 16:41:18 2840966:2840966 ActivityProfilerController.cpp:322] Completed Stage: Collection STAGE:2024-05-01 16:41:18 2840965:2840965 ActivityProfilerController.cpp:326] Completed Stage: Post Processing STAGE:2024-05-01 16:41:18 2840966:2840966 ActivityProfilerController.cpp:326] Completed Stage: Post Processing Trace saved to /tmp/tmpvwivp7mo.json Trace saved to /tmp/tmpvwvsc1fy.json ok ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/125317 Approved by: https://github.com/LucasLLC, https://github.com/kwen2501	2024-05-14 06:20:50 +00:00
daitian1995	b805d3cbcb	Modify device check in capturable optimizer to support more devices (#124919 ) Fixes #124830 Modify device check in capturable optimizer to support more device Pull Request resolved: https://github.com/pytorch/pytorch/pull/124919 Approved by: https://github.com/janeyx99	2024-05-14 05:56:00 +00:00
Wanchao Liang	e0e9d3ed79	make sure device mesh can be imported from torch.distributed (#126119 ) as titled Pull Request resolved: https://github.com/pytorch/pytorch/pull/126119 Approved by: https://github.com/kwen2501, https://github.com/anijain2305	2024-05-14 05:00:48 +00:00
Wanchao Liang	2ae65b72ff	[dtensor] early return for _split_tensor (#125810 ) as titled, if _split_tensor does not require padding or even is evenly sharded on the dim, no need to calculate padding and could simply return This is to avoid some unnecessary CPU operations Pull Request resolved: https://github.com/pytorch/pytorch/pull/125810 Approved by: https://github.com/wz337	2024-05-14 04:59:27 +00:00
Yanbo Liang	bdaa9b2981	[Dynamo] Wrap set as SetVariable and support isdisjoint by polyfill (#126046 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/126046 Approved by: https://github.com/anijain2305, https://github.com/jansel	2024-05-14 04:56:06 +00:00
eellison	bc9587778c	update pointwise cat heuristics (#125772 ) Fix for https://github.com/pytorch/pytorch/issues/122871. There are two cases where we emit pointwise cat: - fusing into a pointwise use - horizontally fusing copy_ kernels The regression I looked into previously was due to being overly aggressive in the latter case. I've updated the logic there so that we only emit the horizontal fusion in the case where there are not reductions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125772 Approved by: https://github.com/Chillee	2024-05-14 04:46:27 +00:00
Yu, Guangye	d0f3ae8e67	[Doc] Update Intel GPU Support on README (#126001 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126001 Approved by: https://github.com/jgong5, https://github.com/albanD, https://github.com/gujinghui, https://github.com/EikanWang	2024-05-14 04:42:58 +00:00
Michael Lazos	812534d27e	Skip two LR schedulers with eager memory leaks in compiled optim tests (#126133 ) SequentialLR and ChainedLR leak memory, so disable these two schedulers until https://github.com/pytorch/pytorch/issues/126131 is fixed. Re-enables https://github.com/pytorch/pytorch/issues/125925 https://github.com/pytorch/pytorch/issues/125925 https://github.com/pytorch/pytorch/issues/125924 Pull Request resolved: https://github.com/pytorch/pytorch/pull/126133 Approved by: https://github.com/yanboliang, https://github.com/aorenste	2024-05-14 04:42:34 +00:00
Edward Z. Yang	9a2beb862d	Permit trivial solves for floating point equality in ShapeEnv (#125915 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/125915 Approved by: https://github.com/lezcano ghstack dependencies: #125325	2024-05-14 04:10:01 +00:00
Edward Z. Yang	2ba102f689	Implement native support for float inputs in Dynamo and ShapeEnv (#125325 ) The big idea is that floats are treated as Tensors on input/output to the FX graph, but on the inside, we immediately call item() on the synthetic Tensor and record regular float operations on it. Canonicalization to Tensor operations will happen in a standalone FX pass. This behavior is controlled by `specialize_float` config variable when set to False. The generated graph looks like this for the test `test_unspec_float_output`: ``` def forward(self, L_x_: "f32[3]", L_y_: "f32[]"): l_x_ = L_x_ l_y_ = L_y_ # File: /data/users/ezyang/a/pytorch/test/dynamo/test_unspec.py:511 in f, code: return x + 1, y * 2 add: "f32[3]" = l_x_ + 1; l_x_ = None item: "Sym(zf0)" = l_y_.item(); l_y_ = None mul: "Sym(2zf0)" = item 2; item = None scalar_tensor: "f32[]" = torch.scalar_tensor(mul); mul = None return (add, scalar_tensor) ``` The ingredients: * torch/_dynamo/variables/builder.py When `specialize_float` is False, we wrap float literals with `wrap_symfloat`. This is an unholy mashup of `wrap_symint` and `wrap_unspecialized_primitive`. The overall strategy is that we first generate a tensor argument (because that's what we want to show up into the FX graph), but then immediately call item() on the tensor argument to get a SymNodeVariable, which we will do the rest of the tracing with. Importantly, this SymNodeVariable is backed with the source of the original float: this means we can guard on the resulting value (something we could NOT do with UnspecializedPythonVariable). This has to be done manually, because if you literally call item() on the tensor, you will end up with an unbacked float. There is a bit of copy paste from wrap_symint and wrap_unspecialized_primitive which we can try to factor out, but this really is its own thing and you should review every line of code in the function. * torch/fx/experimental/symbolic_shapes.py We now can generate guards on float inputs, and these guards are handled inside of ShapeEnv. So we need to be able to allocate (backed!) float symbols, and produce guards for them. Fairly straightforward generalization. * torch/_dynamo/codegen.py I also need to maintain the invariant that there are no float outputs to the FX graph. I chose to do this at codegen time. When we detect a SymNodeVariable on the return stack for a float, we on the fly convert it (via `as_tensor`) to a TensorVariable, which is the true output. We then special case the output bytecode to call item() on it again. The tensor conversion is memoized on SymNodeVariable since we typically run the code generation process twice. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/125325 Approved by: https://github.com/lezcano, https://github.com/jansel	2024-05-14 04:10:01 +00:00
drisspg	04877dc430	Update context manager for cudnn (#126122 ) # Summay Updates the context manager to support cudnn backend This results were done using cuda toolkit 12-3 and cudnn 9.0.0. ## H100 Numbers _power limited_ ``` Markdown +--------------+-------------------+---------+------------+-------------------+-------------------+-------------------------+ \| Batch Size \| Sequence Length \| Heads \| Head Dim \| Flash Time (µs) \| CUDNN Time (µs) \| Speedup (CUDNN/Flash) \| +==============+===================+=========+============+===================+===================+=========================+ \| 1 \| 4096 \| 32 \| 64 \| 665.053 \| 498.59 \| 1.33387 \| +--------------+-------------------+---------+------------+-------------------+-------------------+-------------------------+ \| 1 \| 4096 \| 16 \| 128 \| 591.225 \| 323.828 \| 1.82574 \| +--------------+-------------------+---------+------------+-------------------+-------------------+-------------------------+ \| 1 \| 8192 \| 32 \| 64 \| 2579.77 \| 1933.34 \| 1.33436 \| +--------------+-------------------+---------+------------+-------------------+-------------------+-------------------------+ \| 1 \| 8192 \| 16 \| 128 \| 2297.4 \| 1211.33 \| 1.89659 \| +--------------+-------------------+---------+------------+-------------------+-------------------+-------------------------+ \| 1 \| 16384 \| 32 \| 64 \| 10178.2 \| 7619.18 \| 1.33587 \| +--------------+-------------------+---------+------------+-------------------+-------------------+-------------------------+ \| 1 \| 16384 \| 16 \| 128 \| 9093.51 \| 4725.03 \| 1.92454 \| +--------------+-------------------+---------+------------+-------------------+-------------------+-------------------------+ \| 1 \| 32768 \| 32 \| 64 \| 39893.1 \| 29850.6 \| 1.33643 \| +--------------+-------------------+---------+------------+-------------------+-------------------+-------------------------+ \| 1 \| 32768 \| 16 \| 128 \| 36160.9 \| 18615.9 \| 1.94247 \| +--------------+-------------------+---------+------------+-------------------+-------------------+-------------------------+ \| 1 \| 65536 \| 32 \| 64 \| 157965 \| 116794 \| 1.35251 \| +--------------+-------------------+---------+------------+-------------------+-------------------+-------------------------+ \| 1 \| 65536 \| 16 \| 128 \| 142039 \| 73102.1 \| 1.94303 \| +--------------+-------------------+---------+------------+-------------------+-------------------+-------------------------+ \| 1 \| 131072 \| 32 \| 64 \| 621100 \| 465143 \| 1.33529 \| +--------------+-------------------+---------+------------+-------------------+-------------------+-------------------------+ \| 1 \| 131072 \| 16 \| 128 \| 556142 \| 289776 \| 1.91922 \| +--------------+-------------------+---------+------------+-------------------+-------------------+-------------------------+ ``` ## A100 Numbers ```Markdown +--------------+-------------------+---------+------------+-------------------+-------------------+------------------+-------------------------+------------------------+ \| Batch Size \| Sequence Length \| Heads \| Head Dim \| Flash Time (µs) \| CUDNN Time (µs) \| Flex Time (µs) \| Speedup (CUDNN/Flash) \| Speedup (Flex/Flash) \| +==============+===================+=========+============+===================+===================+==================+=========================+========================+ \| 1 \| 4096 \| 32 \| 64 \| 799.391 \| 836.327 \| 981.234 \| 0.955836 \| 0.814679 \| +--------------+-------------------+---------+------------+-------------------+-------------------+------------------+-------------------------+------------------------+ \| 1 \| 4096 \| 16 \| 128 \| 750.131 \| 806.964 \| 944.766 \| 0.929572 \| 0.793986 \| +--------------+-------------------+---------+------------+-------------------+-------------------+------------------+-------------------------+------------------------+ \| 1 \| 8192 \| 32 \| 64 \| 3211.84 \| 3234.41 \| 3803.09 \| 0.993022 \| 0.844534 \| +--------------+-------------------+---------+------------+-------------------+-------------------+------------------+-------------------------+------------------------+ \| 1 \| 8192 \| 16 \| 128 \| 2984.2 \| 3164.66 \| 3626.79 \| 0.942979 \| 0.822821 \| +--------------+-------------------+---------+------------+-------------------+-------------------+------------------+-------------------------+------------------------+ \| 1 \| 16384 \| 32 \| 64 \| 12630.6 \| 12673.1 \| 14900.6 \| 0.996643 \| 0.847653 \| +--------------+-------------------+---------+------------+-------------------+-------------------+------------------+-------------------------+------------------------+ \| 1 \| 16384 \| 16 \| 128 \| 11722.7 \| 12499.4 \| 13763.5 \| 0.937862 \| 0.851725 \| +--------------+-------------------+---------+------------+-------------------+-------------------+------------------+-------------------------+------------------------+ \| 1 \| 32768 \| 32 \| 64 \| 50068.3 \| 51061.2 \| 60094 \| 0.980556 \| 0.833167 \| +--------------+-------------------+---------+------------+-------------------+-------------------+------------------+-------------------------+------------------------+ \| 1 \| 32768 \| 16 \| 128 \| 46283.6 \| 49708.7 \| 55336.7 \| 0.931096 \| 0.836399 \| +--------------+-------------------+---------+------------+-------------------+-------------------+------------------+-------------------------+------------------------+ \| 1 \| 65536 \| 32 \| 64 \| 203124 \| 203083 \| 239618 \| 1.0002 \| 0.847701 \| +--------------+-------------------+---------+------------+-------------------+-------------------+------------------+-------------------------+------------------------+ \| 1 \| 65536 \| 16 \| 128 \| 187326 \| 198364 \| 221912 \| 0.944355 \| 0.844145 \| +--------------+-------------------+---------+------------+-------------------+-------------------+------------------+-------------------------+------------------------+ \| 1 \| 131072 \| 32 \| 64 \| 816813 \| 827419 \| 978836 \| 0.987182 \| 0.834473 \| +--------------+-------------------+---------+------------+-------------------+-------------------+------------------+-------------------------+------------------------+ \| 1 \| 131072 \| 16 \| 128 \| 749693 \| 845463 \| 905696 \| 0.886725 \| 0.827754 \| +--------------+-------------------+---------+------------+-------------------+-------------------+------------------+-------------------------+------------------------+ ``` ## Script ``` Python import os import torch from typing import Callable from torch.nn.attention import SDPBackend, sdpa_kernel from itertools import product from tqdm import tqdm from tabulate import tabulate os.environ["TORCH_CUDNN_SDPA_ENABLED"] = "1" causal = False from triton.testing import do_bench from torch.nn.functional import scaled_dot_product_attention as sdpa def benchmark_torch_function_in_microseconds(func: Callable, args, kwargs) -> float: # warmup for _ in range(5): func(args, *kwargs) return do_bench(lambda: func(args, *kwargs)) 1e3 def run_attention_test(backend_name, backend_enum): results = [] batch_sizes = [1] seq_lengths = [4096, 8192, 16384, 32768, 65536, 131072] torch.cuda.empty_cache() for b, s in tqdm(product(batch_sizes, seq_lengths), total=len(batch_sizes) * len(seq_lengths), desc=backend_name): for h, d in zip((32, 16), (64, 128)): q, k, v = torch.randn( b, s, h * d * 3, dtype=torch.bfloat16, device="cuda", requires_grad=False ).chunk(3, dim=-1) q = q.view(b, -1, h, d).transpose(1, 2) k = k.view(b, -1, h, d).transpose(1, 2) v = v.view(b, -1, h, d).transpose(1, 2) with torch.no_grad(), sdpa_kernel(backend_enum): time = benchmark_torch_function_in_microseconds(sdpa, q, k, v, is_causal=False) results.append((backend_name, b, s, h, d, time)) return results flash_results = run_attention_test("Flash Attention", SDPBackend.FLASH_ATTENTION) cudnn_results = run_attention_test("CUDNN Attention", SDPBackend.CUDNN_ATTENTION) # Combine results for comparison combined_results = [] for flash, cudnn in zip(flash_results, cudnn_results): speedup = flash[5] / cudnn[5] combined_results.append( (flash[1], flash[2], flash[3], flash[4], flash[5], cudnn[5], speedup) ) # Tabulate the results headers = [ "Batch Size", "Sequence Length", "Heads", "Head Dim", "Flash Time (s)", "CUDNN Time (s)", "Speedup (CUDNN/Flash)", ] table = tabulate(combined_results, headers, tablefmt="grid") print(table) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/126122 Approved by: https://github.com/cpuhrsch	2024-05-14 03:34:19 +00:00
Bin Bao	aeb9934bda	[AOTI] Fix a problem in https://github.com/pytorch/pytorch/pull/125730 (#126110 ) Summary: `generate_c_shim_extern_kernel_call` needs to handle tensor args wrapped with wrap_with_raii_handle_if_needed, to fix some internal test failures Differential Revision: D57293873 Pull Request resolved: https://github.com/pytorch/pytorch/pull/126110 Approved by: https://github.com/huydhn	2024-05-14 02:16:04 +00:00
albanD	71467abc44	Changes to compile with 3.13 (#126033 ) This is mainly: - Fix refcount access macro - Hide all the Dynamo code that needs update as usual - Add _PyWeakref_ClearRef as an extern provided by CPython. Including the pycore header that defines it would require raw c include shenanigans that I don't think are worth it. This allows to build both with regular and nogil version of cpython. Both Note that this requires the 3.13 branch at least past [d3094744d40de2deefbda9b1996d5029c9ebf0b0](`d3094744d4`) which we need for mimalloc include and weakref function being exposed. debug-only issues in pybind11 with PyMem_MALLOC vs PyObject_MALLOC being should be synced either by updating pybind or cpython. @colesbury I can send a PR to ifdef the proper use in pybind if you think that this is the best solution here? Pull Request resolved: https://github.com/pytorch/pytorch/pull/126033 Approved by: https://github.com/colesbury	2024-05-14 02:14:57 +00:00
Oguz Ulgen	ef7d8ad6af	Use source code hash instead of torch version (#126092 ) Differential Revision: [D57289808](https://our.internmc.facebook.com/intern/diff/D57289808/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126092 Approved by: https://github.com/masnesral, https://github.com/jansel	2024-05-14 01:53:31 +00:00
Oguz Ulgen	3c4058cf18	Add master cache disable switch for inductor (#126084 ) Fixes #125699 Differential Revision: [D57284558](https://our.internmc.facebook.com/intern/diff/D57284558/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126084 Approved by: https://github.com/jansel	2024-05-14 01:19:28 +00:00
angelayi	c712b0f8a3	[export] Fix runtime assertions to add call_function (#125878 ) Fixes [internal issue](https://www.internalfb.com/intern/everpaste/?handle=GJCK9xUNpYXovnEBAHfuJ7vQLxZnbsIXAAAB) Pull Request resolved: https://github.com/pytorch/pytorch/pull/125878 Approved by: https://github.com/ezyang	2024-05-14 00:57:50 +00:00
Sun, Jiayi	6a5acd91c3	add shape check for rrelu_with_noise (#122870 ) Fix https://github.com/pytorch/pytorch/issues/121094. Add shape check for rrelu_with_noise, check whether the shape of input tensor and noise tensor are the same. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122870 Approved by: https://github.com/mingfeima, https://github.com/ezyang	2024-05-14 00:12:00 +00:00
Shuqiang Zhang	6db3271007	[c10d] Add an option for NAN check on every collective (#125726 ) Summary: The NAN CHECK is done through device side assert without copying needed from GPU to CPU Test Plan: Unit test for collectives that should experience run time error (sqzhang_1) [sqzhang@devgpu009.cln1 ~/pytorch (38f5143e)]$ python test/distributed/test_c10d_nccl.py ProcessGroupNCCLTest.test_nan_assert /home/sqzhang/pytorch/torch/csrc/distributed/c10d/Utils.cu:15: checkForNaN: block: [0,0,0], thread: [0,0,0] Assertion `!isnan(val)` failed. /home/sqzhang/pytorch/torch/csrc/distributed/c10d/Utils.cu:15: checkForNaN: block: [0,0,0], thread: [1,0,0] Assertion `!isnan(val)` failed. /home/sqzhang/pytorch/torch/csrc/distributed/c10d/Utils.cu:15: checkForNaN: block: [0,0,0], thread: [2,0,0] Assertion `!isnan(val)` failed. /home/sqzhang/pytorch/torch/csrc/distributed/c10d/Utils.cu:15: checkForNaN: block: [0,0,0], thread: [3,0,0] Assertion `!isnan(val)` failed. /home/sqzhang/pytorch/torch/csrc/distributed/c10d/Utils.cu:15: checkForNaN: block: [0,0,0], thread: [4,0,0] Assertion `!isnan(val)` failed. /home/sqzhang/pytorch/torch/csrc/distributed/c10d/Utils.cu:15: checkForNaN: block: [0,0,0], thread: [5,0,0] Assertion `!isnan(val)` failed. [rank0]:[E507 17:31:56.885473996 Utils.cu:30] CUDA error during checkForNan: device-side assert triggered /home/sqzhang/pytorch/torch/csrc/distributed/c10d/Utils.cu:15: checkForNaN: block: [0,0,0], thread: [0,0,0] Assertion `!isnan(val)` failed. /home/sqzhang/pytorch/torch/csrc/distributed/c10d/Utils.cu:15: checkForNaN: block: [0,0,0], thread: [1,0,0] Assertion `!isnan(val)` failed. /home/sqzhang/pytorch/torch/csrc/distributed/c10d/Utils.cu:15: checkForNaN: block: [0,0,0], thread: [2,0,0] Assertion `!isnan(val)` failed. /home/sqzhang/pytorch/torch/csrc/distributed/c10d/Utils.cu:15: checkForNaN: block: [0,0,0], thread: [3,0,0] Assertion `!isnan(val)` failed. /home/sqzhang/pytorch/torch/csrc/distributed/c10d/Utils.cu:15: checkForNaN: block: [0,0,0], thread: [4,0,0] Assertion `!isnan(val)` failed. /home/sqzhang/pytorch/torch/csrc/distributed/c10d/Utils.cu:15: checkForNaN: block: [0,0,0], thread: [5,0,0] Assertion `!isnan(val)` failed. [rank1]:[E507 17:31:56.128961534 Utils.cu:30] CUDA error during checkForNan: device-side assert triggered . ---------------------------------------------------------------------- Ran 1 test in 7.723s OK Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/125726 Approved by: https://github.com/kwen2501	2024-05-14 00:05:41 +00:00
Nicolas Macchioni	1e47c7b11b	[inductor] enable software pipelining on AMD devices (#125858 ) Summary: per-AMD, software pipelining is enabled by setting `num_stages=0`, and should provide a nice perf boost for GEMMs. caveat is that `num_stages=1` is preferred for instances of back-to-back GEMMs, but take `num_stages=0` as the better default. wait to land until triton upstream lands in OSS, pipelining does not work well on the fork Test Plan: n/a Reviewed By: xw285cornell, yoyoyocmu Differential Revision: D56221447 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125858 Approved by: https://github.com/pragupta, https://github.com/yoyoyocmu	2024-05-13 22:36:23 +00:00
Lucas Pasqualin	ec7f2b2626	[DCP] adds type safety to str filtering in EmptyStateDict (#126082 ) [DCP] adds type safety to str filtering in EmptyStateDict Differential Revision: [D57281133](https://our.internmc.facebook.com/intern/diff/D57281133/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126082 Approved by: https://github.com/fegin, https://github.com/wz337, https://github.com/Skylion007	2024-05-13 22:13:05 +00:00
PyTorch MergeBot	bd3cbdba2f	Revert "[optim] add fused_adagrad support for CPU device (#124905 )" This reverts commit 1c3fe8403365db3cc9b75524ae742e3027b745e2. Reverted https://github.com/pytorch/pytorch/pull/124905 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it is failing distributed multigpu test in trunk `1c3fe84033` ([comment](https://github.com/pytorch/pytorch/pull/124905#issuecomment-2108777063))	2024-05-13 20:53:22 +00:00
Giuseppe Ottaviano	36e6f3b339	[caffe2] Make all get_backtrace() implementations lazy (#125750 ) (#126064 ) Summary: #125682 (D56586844) added support for lazy symbolization to `Error` and adopted it for internal use cases; this commit adopts it for `get_backtrace()` as well. Test Plan: Sandcastle and GH CI. NOTE: This is a resubmit of D56881683, a spurious copypasted line in the Android implementation broke the build, but this was not surfaced by diff tests. Reproed the breakage with ``` $ fbpython scripts/build_android_app/build_android_app.py --buck-config-files='@//fbandroid/mode/have_libgflags @//fbandroid/mode/static_linking @//xplat/langtech/mobile/android_opt_buck_config_with_et_boltnn' --build-target='fbsource//xplat/langtech/mobile:transcribe_binAndroid-android-arm64' ``` Verified that the fixed diff builds successfully. Differential Revision: D57275456 Pull Request resolved: https://github.com/pytorch/pytorch/pull/126064 Approved by: https://github.com/ezyang	2024-05-13 20:17:41 +00:00
Richard Barnes	c098cd0cbb	Eliminate a C++11 code pattern in pimpl.h (#126069 ) Test Plan: Sandcastle Differential Revision: D57224687 Pull Request resolved: https://github.com/pytorch/pytorch/pull/126069 Approved by: https://github.com/Skylion007	2024-05-13 19:01:13 +00:00
Richard Barnes	b9e7b35912	Remove caffe2 from more build files (#125898 ) Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/125898 Approved by: https://github.com/Skylion007	2024-05-13 18:37:59 +00:00
albanD	b620231378	Fix nested fqn discovery (#125957 ) I think I missed some fix! Pull Request resolved: https://github.com/pytorch/pytorch/pull/125957 Approved by: https://github.com/sanketpurandare, https://github.com/janeyx99	2024-05-13 18:24:56 +00:00
angelayi	9e1826deff	[torchbind] Add inductor support (#123709 ) Example inductor generated python code: [P1245776497](https://www.internalfb.com/phabricator/paste/view/P1245776497) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123709 Approved by: https://github.com/eellison	2024-05-13 18:18:17 +00:00
Tom Ritchford	4d8fa7df40	Fix four misspellings of "its" in documentation (#125681 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/125681 Approved by: https://github.com/Skylion007, https://github.com/svekars	2024-05-13 18:14:09 +00:00
Jeeja	7f1d5aba93	[FSDP] Use generic device handle instead of cuda (#121620 ) In FSDP _optim_utils.py Use generic device handle instead of cuda to support other backends Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/121620 Approved by: https://github.com/awgu, https://github.com/wz337	2024-05-13 18:07:08 +00:00
Daniele Trifirò	3183d65ac0	use shutil.which in _find_cuda_home (#126060 ) Replace `subprocess.check_output` call with `shutil.which`, similarly to how this is done in `_find_rocm_home` Pull Request resolved: https://github.com/pytorch/pytorch/pull/126060 Approved by: https://github.com/r-barnes	2024-05-13 17:38:17 +00:00
Jason Ansel	637074983e	[inductor] Make load_mask() codegen determinstic (#126017 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126017 Approved by: https://github.com/shunting314	2024-05-13 17:36:52 +00:00
David Berard	82edc8b5d5	[NT] Make NestedTensor register as having symbolic sizes/strides (#124687 ) Fixes #123698 This PR makes TensorImpl::has_symbolic_sizes_strides return false for NestedTensors. 1. It passes in the actual sizes when we call `_make_wrapper_subclass` - this is the change that makes the subclass register as `has_symbolic_sizes_strides() == True` 2. It adds a field to `_make_wrapper_subclass` where an explicit `numel` can be provided. This allows us to skip the numel computation for the storage, which previously fails due to arithmetic on NestedInts. 3. Implements `aten::numel` for NJT - this is separate from the overridden numel in `make_wrapper_subclass` for now. Note also that this means that we leave `dispatch_sizes_strides_policy="sizes"`, so that we call into the custom `numel` implementation (as well as `sizes` and `strides`), because `numel` cannot currently be computed from `sizes` for NJT. Note also that this depends on #121361, because calling TensorImpl::set_sizes_and_strides() tries to clone the sizes into the tensor, which means that we need `clone` to be implemented on NestedInt. Differential Revision: [D57225736](https://our.internmc.facebook.com/intern/diff/D57225736) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124687 Approved by: https://github.com/albanD	2024-05-13 16:50:25 +00:00
Masaki Kozuki	96bdb7a0fb	in `test_foreach.py` pacth `KINETO_LOG_LEVEL` to silence profiler log (#126048 ) as per title, `patch.dict` the env var in favor of cleaner logs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126048 Approved by: https://github.com/janeyx99	2024-05-13 15:31:56 +00:00
Richard Howell	7899034282	[fbcode] remove xcode_public_headers_symlinks (#125966 ) Summary: These attributes do nothing in Buck 2, we can remove them. Test Plan: ``` $ buck2 uquery //... > /dev/null ``` Differential Revision: D57169445 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125966 Approved by: https://github.com/malfet	2024-05-13 15:06:35 +00:00
Richard Barnes	56b271fd7a	`STRONG_CONSTEXPR` -> `constexpr` (#125872 ) Test Plan: Sandcastle Differential Revision: D57158890 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125872 Approved by: https://github.com/Skylion007	2024-05-13 14:07:26 +00:00
Edward Z. Yang	f0c8b93487	Make wrapIndexOnce check async, avoid DtoH sync on index_put_ (#125952 ) Internal xref: https://fb.workplace.com/groups/1075192433118967/posts/1427156211255919/ Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/125952 Approved by: https://github.com/lezcano	2024-05-13 13:28:45 +00:00
PyTorch UpdateBot	c0b7b56cf4	[xla hash update] update the pinned xla hash (#126052 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned xla hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126052 Approved by: https://github.com/pytorchbot	2024-05-13 12:36:51 +00:00
Shaz Qadeer	afda6685ae	fixed typo in documentation (#125974 ) Summary: Fixed typo in documentation. Trying to get familiar with the PR workflow for contributing to PyTorch. Test Plan: None Reviewers: Subscribers: Tasks: Tags: Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/125974 Approved by: https://github.com/ezyang	2024-05-13 04:37:51 +00:00
haozhe.zhu	1c3fe84033	[optim] add fused_adagrad support for CPU device (#124905 ) Support fused_sgd_kernel support for CPU. ## Bench result: 32 core/sockets ICX Test Scripts: https://gist.github.com/zhuhaozhe/79e842e0a6e25d6d7fa1e4598807272c https://gist.github.com/zhuhaozhe/b4c6998a509dcea1796dd05b3005c969 ``` Tensor Size: 262144, Num Tensor 4, Num Threads: 1 _single_tensor_adagrad time: 0.2500 seconds _fused_adagrad time: 0.0933 seconds Tensor Size: 4194304, Num Tensor 32, Num Threads: 32 _single_tensor_adagrad time: 2.8819 seconds _fused_adagrad time: 1.7591 seconds ``` ## Test Plan: ``` python test_optim.py -k test_fused_matches_forloop python test_optim.py -k test_fused_large_tensor python test_optim.py -k test_can_load_older_state_dict python test_optim.py -k test_grad_scaling_autocast_fused_optimizers python test_torch.py -k test_grad_scaling_autocast_fused python test_torch.py -k test_params_invalidated_with_grads_invalidated_between_unscale_and_step ``` Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/124905 Approved by: https://github.com/jgong5, https://github.com/janeyx99	2024-05-13 01:16:20 +00:00
cyy	4b88a5bd0b	Remove AnalyzeTemporaryDtors from clang-tidy config (#125985 ) Remove AnalyzeTemporaryDtors from clang-tidy config which is not used in newer releases. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125985 Approved by: https://github.com/Skylion007	2024-05-12 21:44:33 +00:00
Aaron Gokaslan	34910f87f0	[BE]: Update ruff to v0.4.4 (#125031 ) Update ruff version to 0.4.2. This version mostly has bugfixes for the new parser and also updates the f-string rule to be able to apply more fixes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125031 Approved by: https://github.com/albanD, https://github.com/malfet	2024-05-12 20:02:37 +00:00
Jeff Daily	ae9a4fa63c	[ROCm] enforce ROCM_VERSION >= 6.0 (#125646 ) Remove any code relying on ROCM_VERSION < 6.0. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125646 Approved by: https://github.com/albanD, https://github.com/eqy	2024-05-12 18:01:28 +00:00
cyy	0116ffae7f	Remove deprecated _aminmax operator (#125995 ) It has been deprecated for a long time. Co-authored-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/125995 Approved by: https://github.com/ezyang	2024-05-12 17:50:17 +00:00
Jiong Gong	037615b989	[inductor][cpp] GEMM template (infra and fp32) (#124021 ) This PR adds the Cpp template infrastructure and the initial FP32 gemm template. See RFC https://github.com/pytorch/pytorch/issues/125683 for more background info. 1. Cpp template infrastructure Similar template abstractions as the CUTLASS template, i.e., `CppTemplate`, `CppTemplateKernel`, `CppTemplateBuffer`. The MicroGemm micro-kernel abstraction that can be used by Cpp GEMM templates. 2. Initial FP32 gemm template This involves a GEMM template implementation `CppPackedGemmTemplate` that supports GEMM with constant weight (`B`) requiring `N` to be a multiple of register blocking while allows the static or dynamic sizes for the `M` (batch dim) of `A`. The `B` matrix would be prepacked. This is a typical setting for inference workloads. The template handles the thread decomposition (via `thread_blocking`) and cache blocking (via `cache_blocking`). Then it invokes `CppMicroGemm` which handles register blocking, instruction selection, and other CPU architecture-specific optimizations. A `CppMicroGemmFP32Vec` micro-kernel implementation is provided for fp32 matmuls implemented with ATen vec abstraction. 3. Correctness and performance The changes have been validated with fp32 inference on the three benchmark suites (torchbench, huggingface and timm_models) with both static shape and dynamic shapes. Since it is an initial implementation, we are still working on further performance improves with follow-up PRs including the optimizations in kernels as well as fusions. The perf gains are only observed from a selective number of models compared to the ATen kernels which are implemented with MKL. The perf gains are more obvious with dynamic shapes since MKL only supports packed gemm for static shapes. Below are details. Static shapes \| Benchmark \| torchbench \| huggingface \| timm_models \| \|------------\|-------------\|--------------\|--------------\| \| Multi-threaded (baseline) \| 1.47x \| 1.36x \| 1.91x \| \| Multi-threaded (max-autotune) \| 1.47x \| 1.36x \| 1.92x \| \| Single-threaded (baseline) \| 1.56x \| 1.19x \| 1.51x \| \| Single-threaded (max-autotune) \| 1.56x \| 1.19x \| 1.52x \| Key models being sped up: drq: 1.14x soft_act: 1.12 cait_m36_384: 1.18x Dynamic shapes \| Benchmark \| torchbench \| huggingface \| timm_models \| \| --- \| --- \| --- \| --- \| \| Multi-threaded (baseline) \| 1.43x \| 1.28x \| 1.85x \| \| Multi-threaded (max-autotune) \| 1.47x \| 1.28x \| 1.85x \| \| Single-threaded (baseline) \| 1.55x \| 1.20x \| 1.51x \| \| Single-threaded (max-autotune) \| 1.56x \| 1.19x \| 1.53x \| Key models being sped up: BERT_pytorch: 1.22x pyhpc_turbulent: 1.13x soft_actor_critic: 1.77x BlenderbotForCausalLM: 1.09x cait_m36_384: 1.17x Pull Request resolved: https://github.com/pytorch/pytorch/pull/124021 Approved by: https://github.com/jansel	2024-05-12 07:46:44 +00:00
Yukio Siraichi	02093b6c6a	Keep track of `ViewMeta` with symbolic inputs. (#125876 ) Fix: #125387 This PR helps keep track of whether an instantiated `ViewMeta` has symbolic values as input or not. This is used for checking whether we use the AOTAutograd `ViewMeta`-replay execution path, e.g. it doesn't support tensors that have `ViewMeta` with symbolic inputs. In summary, the changes are: - Add the field `ViewMeta::has_symbolic_inputs` and make it a required constructor parameter - Add the field `FunctionalTensorWrapper::is_symbolic_` and the method `FunctionalTensorWrapper::maybe_mark_symbolic` - Marks a `FunctionalTensorWrapper` as symbolic iff any of its `ViewMeta` have symbolic inputs - Add the plumbing of `FunctionalTensorWrapper::is_symbolic` to the Python API - Codegen the computation of `ViewMeta::has_symbolic_inputs` for each view operation - Use the AOTAutograd `ViewMeta`-replay path if: - `target_functional_tensor` is not `None`; and - `target_functional_tensor` is not symbolic (instead of using a functorch config) Pull Request resolved: https://github.com/pytorch/pytorch/pull/125876 Approved by: https://github.com/ezyang	2024-05-12 01:41:06 +00:00
albanD	6ffc94fa62	Fix cpp node instance check (#125875 ) Mostly visible when calling multi_grad_hook and thus using this to test it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125875 Approved by: https://github.com/jackiexu1992, https://github.com/ezyang	2024-05-11 21:31:12 +00:00
Ke Wen	07d6ab5aa2	[pipelining] Add pipeline schedules (#125975 ) 1. Add pipeline schedules: - GPipe - 1F1B - Interleaved 1F1B - LoopedBFS 2. Add basic forward and backward tests: test_schedule.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/125975 Approved by: https://github.com/wconstab ghstack dependencies: #125729	2024-05-11 21:17:53 +00:00
Edward Z. Yang	f19e07b056	Memoize local_scalar_dense calls, refactor all memos (#125623 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/125623 Approved by: https://github.com/eellison	2024-05-11 21:12:35 +00:00
Animesh Jain	0935b3d794	[dynamo] Turn on guard_nn_modules (#125202 ) Turning on guard_nn_modules adds large number of guards, so we are bound to take a perf hit. But the perf hit is small. These are the numbers ![image](https://github.com/pytorch/pytorch/assets/13822661/c8793906-c8c7-432b-9af4-4594713067be) First we observe that compared to Python guards, C++ guards give around 6x speedup. This reduces the total time spent in guards. This is shown in the last column (cpp_guards/inductor_optimized_latency). The worst model is around 1.61%, with most of the models below 1%. I think this is good enough signal to turn the config on. One might also wonder how much guard slowdown occurs with `guard_nn_modules=True`. This is the table ![image](https://github.com/pytorch/pytorch/assets/13822661/932a885b-1c03-424b-8405-5bc8fd35dd39) For most models, the guard overhead with nn module guards is under 2x. There are a few outliers, where the slowdown is really high and for those models we spend 1%-2% time in C++ guards as shown in first table. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125202 Approved by: https://github.com/ezyang	2024-05-11 19:28:24 +00:00
Bin Bao	0dda3389e5	[AOTI][torchgen] Minor improvements to C shim torchgen (#125928 ) Summary: Make some improvements to https://github.com/pytorch/pytorch/pull/125589 * Add a .default suffix to default ops in fallback_ops.py, to make it clear that those are OpOverload. * Update warnings and comments based on feedbacks to https://github.com/pytorch/pytorch/pull/125589 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125928 Approved by: https://github.com/angelayi ghstack dependencies: #125291, #125730, #125731	2024-05-11 18:12:46 +00:00
Bin Bao	2df114e6be	[AOTI] Fix 'int' object is not subscriptable (#125731 ) Summary: for https://github.com/pytorch/pytorch/issues/117369 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125731 Approved by: https://github.com/chenyang78 ghstack dependencies: #125291, #125730	2024-05-11 18:12:46 +00:00
cyy	3f11958d39	Remove FFMPEG from CI scripts (#125546 ) Because FFMPEG was solely used by Caffe2. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125546 Approved by: https://github.com/r-barnes, https://github.com/kit1980, https://github.com/albanD, https://github.com/malfet, https://github.com/seemethere	2024-05-11 16:46:13 +00:00
PyTorch MergeBot	d49abf039a	Revert "update pointwise cat heuristics (#125772 )" This reverts commit d19d932183f265f5108e6cc30f514d88060a67d7. Reverted https://github.com/pytorch/pytorch/pull/125772 on behalf of https://github.com/izaitsevfb due to Fails numerical stability test for aps model, see D57215900 ([comment](https://github.com/pytorch/pytorch/pull/125772#issuecomment-2105932504))	2024-05-11 15:27:44 +00:00
PyTorch MergeBot	d5470749bc	Revert "[dynamo][disable] Move disable impl to its own __call__ method (#125486 )" This reverts commit d474d79420dbb0c0ba7e203d63d953afcbb595a4. Reverted https://github.com/pytorch/pytorch/pull/125486 on behalf of https://github.com/izaitsevfb due to Fails internal tests, see D57216402 ([comment](https://github.com/pytorch/pytorch/pull/125486#issuecomment-2105925702))	2024-05-11 15:01:58 +00:00
Yanbo Liang	a174c536f8	GPT-fast benchmark: adding memory bandwidth and use A100-40GB as target (#125881 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/125881 Approved by: https://github.com/Chillee	2024-05-11 10:46:54 +00:00
Michael Lazos	b24ad7eab5	Enable dynamo traced test_param_group_with_lrscheduler_goes_right_direction (#124544 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124544 Approved by: https://github.com/janeyx99 ghstack dependencies: #125825, #125826	2024-05-11 06:29:59 +00:00
Michael Lazos	e72ef4f22a	Fix capturable enablement conditions (#125826 ) Only enable capturable if state hasn't been initialized and all parameters are on CUDA. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125826 Approved by: https://github.com/anijain2305 ghstack dependencies: #125825	2024-05-11 06:29:59 +00:00
Michael Lazos	b833fc0ecb	Tighten fallback conditions for compiled optim (#125825 ) Since we now will support `capturable=False` when it's valid, narrow the eager fallback conditions to the cases where `compile` will fail. The lone case here is when the user deletes the capturable flag; `state_steps` are on cuda and `capturable` is `False`. Because a cuda tensor is not supported in the `value` kwarg for foreach ops this results in an error. The fallback wrapper is changed to check the device of `state_steps` if `capturable=False`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125825 Approved by: https://github.com/janeyx99	2024-05-11 06:29:51 +00:00
Zhengxu Chen	1115a25c36	Add obc counter for TS migration. (#125986 ) Summary: Since table caffe2_pytorch_usage_stats only has 1 day retention which renders it useless for TS migration purposes, we want to build a lightweight counter mechanism to collect usage data about torch jit APIs which can monitor the usage decline in the long term. Test Plan: CI Reviewed By: SherlockNoMad Differential Revision: D57216847 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125986 Approved by: https://github.com/gmagogsfm	2024-05-11 05:14:02 +00:00
PyTorch MergeBot	7e92a2c1c9	Revert "Allow symbols to reach conv_layout stride argument (#125829 )" This reverts commit 013722bcb89b9f450d03ce2bd3ed81db6a89d97d. Reverted https://github.com/pytorch/pytorch/pull/125829 on behalf of https://github.com/malfet due to Broke inductor tests, see https://github.com/pytorch/pytorch/actions/runs/9028121462/job/24809113861 ([comment](https://github.com/pytorch/pytorch/pull/125829#issuecomment-2105545503))	2024-05-11 04:43:36 +00:00
Nikita Shulga	e9c5f1cb80	[MPS] Improve _int4pack_mm (#125983 ) But dispatching it as 2D kernel, which improves data locality and bumps perf from 5.9 to 6.6 tokens per sec on M2 Pro And other minor cleanups Pull Request resolved: https://github.com/pytorch/pytorch/pull/125983 Approved by: https://github.com/mikekgfb	2024-05-11 04:40:40 +00:00
hippocookie	9f4bb4d6bc	Enable UFMT format on test/test_throughput_benchmark.py test/test_type_hints.py test/test_type_info.py (#125906 ) Fixes some files in https://github.com/pytorch/pytorch/issues/123062 Run lintrunner on files: test/test_throughput_benchmark.py test/test_type_hints.py test/test_type_info.py ```bash $ lintrunner -a --take UFMT --all-files ok No lint issues. Successfully applied all patches. ``` Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/125906 Approved by: https://github.com/shink, https://github.com/soulitzer, https://github.com/malfet	2024-05-11 04:32:01 +00:00
Huy Do	9dee3ef919	Ingest gpt-fast benchmark results from S3 to Rockset (#125891 ) A follow-up of https://github.com/pytorch/pytorch/pull/125450, this extends the `tools/stats/upload_dynamo_perf_stats.py` script to upload arbitrary benchmark results in CSV format. * Upload gpt-fast benchmarks to a new Rockset collection `benchmarks/oss_ci_benchmark`. The file is in the following format: ``` $ cat test/test-reports/gpt_fast_benchmark.csv name,mode,target,actual,percentage Llama-2-7b-chat-hf,bfloat16,104,104.754128,100.73% ``` * The CSV output needs to be kept in `test/test-reports` directory. * Re-use the existing `.github/workflows/upload-test-stats.yml` workflow ### Testing Run the commands manually ``` (py3.11) huydo@huydo-mbp pytorch % python3 -m tools.stats.upload_artifacts --workflow-run-id 9026179545 --workflow-run-attempt 1 --repo "pytorch/pytorch" Using temporary directory: /var/folders/x4/2kd9r0fn5b9bf_sbcw16fxsc0000gn/T/tmp6eug3cdz Downloading test-jsons-runattempt1-test-inductor-micro-benchmark-1-1-linux.gcp.a100_24803987212.zip Upload /private/var/folders/x4/2kd9r0fn5b9bf_sbcw16fxsc0000gn/T/tmp6eug3cdz/test-jsons-runattempt1-test-inductor-micro-benchmark-1-1-linux.gcp.a100_24803987212.zip to s3://gha-artifacts/pytorch/pytorch/9026179545/1/artifact/test-jsons-test-inductor-micro-benchmark-1-1-linux.gcp.a100_24803987212.zip Downloading test-reports-runattempt1-test-inductor-micro-benchmark-1-1-linux.gcp.a100_24803987212.zip Upload /private/var/folders/x4/2kd9r0fn5b9bf_sbcw16fxsc0000gn/T/tmp6eug3cdz/test-reports-runattempt1-test-inductor-micro-benchmark-1-1-linux.gcp.a100_24803987212.zip to s3://gha-artifacts/pytorch/pytorch/9026179545/1/artifact/test-reports-test-inductor-micro-benchmark-1-1-linux.gcp.a100_24803987212.zip (py3.11) huydo@huydo-mbp pytorch % python3 -m tools.stats.upload_dynamo_perf_stats --workflow-run-id 9026179545 --workflow-run-attempt 1 --repo "pytorch/pytorch" --head-branch "ciflow/inductor-micro-benchmark/125891" --rockset-collection oss_ci_benchmark --rockset-workspace benchmarks --match-filename "^gpt_fast_benchmark" Using temporary directory: /var/folders/x4/2kd9r0fn5b9bf_sbcw16fxsc0000gn/T/tmp8xr4sdxk Downloading test-reports-test-inductor-micro-benchmark-1-1-linux.gcp.a100_24803987212.zip Extracting test-reports-test-inductor-micro-benchmark-1-1-linux.gcp.a100_24803987212.zip to unzipped-test-reports-test-inductor-micro-benchmark-1-1-linux.gcp.a100_24803987212 Processing gpt_fast_benchmark from test-reports-test-inductor-micro-benchmark-1-1-linux.gcp.a100_24803987212.zip Writing 3 documents to Rockset Done! ``` Also run a sanity check on ingesting inductor benchmark results: ``` (py3.11) huydo@huydo-mbp pytorch % python -m tools.stats.upload_dynamo_perf_stats --workflow-run-id 8997654356 --workflow-run-attempt 1 --repo pytorch/pytorch --head-branch main --rockset-collection torch_dynamo_perf_stats --rockset-workspace inductor --match-filename "^inductor_" ... Writing 4904 documents to Rockset Done! ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/125891 Approved by: https://github.com/yanboliang	2024-05-11 04:16:36 +00:00
Bin	c1690a3e12	Fix the link to torch.compiler_custom_backends. (#125865 ) Tiny fix. Fixes #119272 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125865 Approved by: https://github.com/soulitzer	2024-05-11 04:13:44 +00:00
Huy Do	0a9c6e92f8	Skip test_memory_format_nn_BatchNorm2d in inductor (#125970 ) Skipping the test in the context of https://github.com/pytorch/pytorch/issues/125967 until the issue is root caused and fixed properly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125970 Approved by: https://github.com/clee2000	2024-05-11 04:11:18 +00:00
Aleksei Nikiforov	da7ced6e8c	S390x binaries (#120398 ) Allow building nightly, rc and release binaries for s390x. This PR implements building binaries, but publishing part is currently missing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120398 Approved by: https://github.com/huydhn	2024-05-11 02:32:25 +00:00
Ke Wen	d8708a35f6	[pipelining] Add _PipelineStage runtime (#125729 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/125729 Approved by: https://github.com/wconstab	2024-05-11 01:59:18 +00:00
PyTorch MergeBot	c6e5d0d2e6	Revert "Memoize local_scalar_dense calls, refactor all memos (#125623 )" This reverts commit fcbf2b61e6f40048ef0e6d77360c86771956cc9c. Reverted https://github.com/pytorch/pytorch/pull/125623 on behalf of https://github.com/malfet due to Broke ROCM, see https://github.com/pytorch/pytorch/actions/runs/9026074378/job/24804583041 ([comment](https://github.com/pytorch/pytorch/pull/125623#issuecomment-2105444091))	2024-05-11 01:58:39 +00:00
hippocookie	01fb9676b8	Enable UFMT format on test/license.py test/logging.py (#125737 ) Fixes some files in #123062 Run lintrunner on files: test/license.py test/logging.py ```bash $ lintrunner -a --take UFMT --all-files ok No lint issues. Successfully applied all patches. ``` Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/125737 Approved by: https://github.com/ezyang, https://github.com/malfet	2024-05-11 01:52:35 +00:00
Aaron Orenstein	a5c93a6899	Speed up _extract_graph_with_inputs_outputs (#125937 ) _extract_graph_with_inputs_outputs() does membership testing on the input nodes but often that collection is a list so the test is O(n). Ensure it's a set before looping over all the nodes. This change speeds up the internal repro (D57090987) by about 18%: before: ``` 708.88user 15.86system 12:16.19elapsed 98%CPU (0avgtext+0avgdata 12898628maxresident)k 0inputs+91968outputs (3major+3532970minor)pagefaults 0swaps ``` after: ``` 583.39user 15.98system 10:10.11elapsed 98%CPU (0avgtext+0avgdata 12895108maxresident)k 0inputs+87488outputs (4major+3374582minor)pagefaults 0swaps ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/125937 Approved by: https://github.com/oulgen, https://github.com/anijain2305	2024-05-11 00:20:39 +00:00
cyy	4457cd9a30	[Distributed] [7/N] Fix clang-tidy warnings in torch/csrc/distributed/c10d (#124987 ) This PR continues to clean clang-tidy warnings in torch/csrc/distributed/c10d, following #124701. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124987 Approved by: https://github.com/malfet	2024-05-11 00:03:52 +00:00
David Chiu	31946c10d0	Add missing parameter doc of Adagrad (#125886 ) Add the missing documentation for `initial_accumulator_value` parameter in Adagrad, and update the algorithm description in the documentation (adjusted to reflect the implementation). Pull Request resolved: https://github.com/pytorch/pytorch/pull/125886 Approved by: https://github.com/janeyx99	2024-05-10 22:55:22 +00:00
PyTorch MergeBot	ee804d256b	Revert "[caffe2] Make all get_backtrace() implementations lazy (#125750 )" This reverts commit cc4da72b47ef63b7c448f0de4cdbdd792e9195ea. Reverted https://github.com/pytorch/pytorch/pull/125750 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/125750#issuecomment-2105285301))	2024-05-10 21:23:10 +00:00
cyy	45628e3b66	Remove Caffe2 python (#125143 ) This PR tries to decompose https://github.com/pytorch/pytorch/pull/122527 into a smaller one. Caffe2 python build scripts were removed and some tensorboard code using Caffe2 was removed too. To be noted, this was inspired and is co-dev with @r-barnes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125143 Approved by: https://github.com/r-barnes, https://github.com/albanD	2024-05-10 21:15:43 +00:00
Catherine Lee	b08072f645	[CI] Relax per proc memory by a little bit, mark a test as serial (#125960 ) test failure is here https://github.com/pytorch/pytorch/actions/runs/9036789873/job/24836020415 * OOMs etc rel to https://github.com/pytorch/pytorch/pull/125598 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125960 Approved by: https://github.com/huydhn	2024-05-10 21:11:39 +00:00
Irene Huang	c61bfd24c1	[PT2] Register fake impl for quantized embedding bag ops (#125884 ) Summary: Register fake impl for quantized embedding bag ops (e.g. quantized::embedding_bag_4bit_rowwise_offsets) and bypass registration if it has been registered. Test Plan: Before: ``` NotImplementedError: quantized::embedding_bag_4bit_rowwise_offsets: attempted to run this operator with Meta tensors, but there was no fake impl or Meta kernel registered ``` See context here - https://fb.workplace.com/groups/1075192433118967/permalink/1423106614994212/ After: Snapsoht was published successfully with PT2Archive. ``` AIMP_DISABLE_PRUNING=false fdb buck2 run mode/opt-split-dwarf -c python.package_style=inplace -c fbcode.enable_gpu_sections=true lego/scripts:lego_cli -- debug-locally --model_entity_id 545861329 --config_version 14 --publish_context OFFLINE_PUBLISH --lego_pipeline aiplatform.modelstore.model_generation.lego.lego_pipeline_builder.gmpp_lego_pipeline --gmpp_config '{"gmpp_pipeline_descriptor": "aiplatform.modelstore.model_generation.v1.ads_pipelines.aimp_pyper_pipeline.model_generation_pipeline", "worker_process_number":24, "worker_thread_per_process_number": 12, "use_work_assignment": true}' --publish_config_overrides '{"gpu_inference_options": "{\"submodules_to_lower\": []}"}' 2>&1 \| tee ./gmpp_lc_aimp.txt ``` Reviewed By: ydwu4 Differential Revision: D57172944 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125884 Approved by: https://github.com/ydwu4	2024-05-10 21:11:22 +00:00
Bin Bao	538877d204	[AOTI] Fix convolution_backward (#125730 ) Summary: for https://github.com/pytorch/pytorch/issues/125922 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125730 Approved by: https://github.com/chenyang78 ghstack dependencies: #125291	2024-05-10 20:13:34 +00:00
Bin Bao	aca0807101	[AOTI] Use random inputs to autotune the backward pass (#125291 ) Summary: This is for JIT Inductor with cpp wrapper, fixing https://github.com/pytorch/pytorch/issues/117367. In the backward pass, we don't have real inputs to execute the backward pass to autotune kernels. We have 3 options here, 1) use random tensor inputs; 2) store the forward outputs and feed them to backward (non-trivial because of parameter re-ordering); 3) autotune each kernel with random inputs in a subprocess (similar to select_algorithm). This PR uses the easist option 1. Option 3 is where we are going as the next step, which will simplify the cpp wrapper codegen for the CUDA backend. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125291 Approved by: https://github.com/chenyang78, https://github.com/angelayi	2024-05-10 20:13:34 +00:00
David Berard	9e85d3d830	Add "accurate" FlopCounter implementations for NestedTensor SDPA kernels (#125776 ) This adds implementations for: * _flash_attention_forward * _efficient_attention_forward * _flash_attention_backward * _efficient_attention_backward These flop counts are implemented as follows: * Unbind the batch elements * Calculate flops individually for each element in the batch * Sum the final result This means that we are accessing the concrete sequence lengths (which could be slow, and may trigger a GPU/CPU sync); but, the FLOP numbers will vary with the sparsity of the NestedTensor - more accurate than if we just assumed we padded everything. Differential Revision: [D57120139](https://our.internmc.facebook.com/intern/diff/D57120139) Pull Request resolved: https://github.com/pytorch/pytorch/pull/125776 Approved by: https://github.com/Chillee	2024-05-10 19:49:37 +00:00
PyTorch MergeBot	4dad988822	Revert "Remove vision packages from CI scripts (#125546 )" This reverts commit f42ea14c3f795082138421fcef90d24f64c6fd35. Reverted https://github.com/pytorch/pytorch/pull/125546 on behalf of https://github.com/huydhn due to I think we are using vision in inductor tests with their various models there ([comment](https://github.com/pytorch/pytorch/pull/125546#issuecomment-2105174723))	2024-05-10 19:43:23 +00:00
James Wu	0e853327cb	Implement wrappers for aot_dedup and aot_synthetic_base (#125764 ) it's kind of gross that aot_synthetic base requires storing the old fw_metadata's InputInfo, but it is what it is. After this change, aot_dispatch_base's runtime wrappers should all be implemented. After this, I'll start working on aot_dispatch_autograd's remaining runtime wrapping changes Pull Request resolved: https://github.com/pytorch/pytorch/pull/125764 Approved by: https://github.com/bdhirsh ghstack dependencies: #125610	2024-05-10 19:33:35 +00:00
David Chiu	c520929c83	add typing in torch.optim.lr_scheduler (#125556 ) Merge torch/optim/lr_scheduler.pyi into torch/optim/lr_scheduler.py Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/125556 Approved by: https://github.com/janeyx99	2024-05-10 19:28:00 +00:00
Masaki Kozuki	59f2e716cc	Test foreach functions with all dtypes except qints (#125527 ) Set `dtypes` and the others to all dtypes except qints, with some required xfails Related to #124726. Co-authored-by: janeyx99 <janeyx99@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/125527 Approved by: https://github.com/eqy, https://github.com/janeyx99	2024-05-10 18:56:37 +00:00
chengzeyi	10c17b13d7	fix cudnn attention check (#122391 ) For CUDNN attention, besides packed QKV layout with limited support of sequence length (seq_len <= 512) and head dim requirements. Also supporting a more generic "arbitrary sequence length with flash attention" as stated in `Transformer Engine`: `8e672ff075/transformer_engine/common/fused_attn/fused_attn.cpp (L126)` More about "fused flash attention" in CUDNN graph api: https://docs.nvidia.com/deeplearning/cudnn/developer/graph-api.html#fused-flash-attention-fprop Pull Request resolved: https://github.com/pytorch/pytorch/pull/122391 Approved by: https://github.com/eqy, https://github.com/drisspg	2024-05-10 18:52:38 +00:00
Catherine Lee	bef7d650c4	[CI] 3 procs on sm86 (#125598 ) yolo iirc the a10g/sm86 runners have ~21 GB of space, so we can increase parallelism on it to 3. This results in about 6GB CUDA mem per proc. The previous calculation + 2 procs resulted in about 8 GB Also fixes the the calc for per proc memory, assuming that CUDA context + anything else take about a little under 1GB of space (previous calc was .11 on about 7.5 - 8 GB <= .9GB) Times on main are about 1.9-2.5hr per shard This commit is around 1.6-2hr per shard Risks: increase in flaky tests due to OOM Pull Request resolved: https://github.com/pytorch/pytorch/pull/125598 Approved by: https://github.com/huydhn	2024-05-10 18:48:43 +00:00
Nikita Shulga	ff98731803	Speedup `convert<float>(Vectorized<half>::loadu(ptr, 8))` on ARM (#125889 ) By replacing `vdupq_n_f16(0)` with simple `std::memset` Otherwise Apple's clang fails to dead-code eliminate that instruction, which results in a slower codepath. I.e. following [snippet](https://godbolt.org/z/c757TaM1Y) (that mimics vec library parts) ```cpp #include <arm_neon.h> #include <tuple> #include <cstring> struct Foo { Foo() = default; Foo(float16x8x2_t v) : values(v) {} operator float16x8x2_t() const { return values; } float16x8x2_t values; }; struct Bar { Bar() = default; Bar(float32x4x2_t v) : values(v) {} Bar(float32x4_t val0, float32x4_t val1) : values{val0, val1} {} inline void store(float ptr) { vst1q_f32(ptr, values.val[0]); vst1q_f32(ptr + 4, values.val[1]); } float32x4x2_t values; }; inline Foo loadu(const void ptr, int64_t count) { if (count == 16) { return vld1q_f16_x2(reinterpret_cast<const float16_t>(ptr)); } else if (count == 8) { Foo res; res.values.val[0] = vld1q_f16(reinterpret_cast<const float16_t>(ptr)); //res.values.val[1] = vdupq_n_f16(0); std::memset(&res.values.val[1], 0, sizeof(res.values.val[1])); return res; } float16_t tmp_values[16]; for (auto i = 0; i < 16; ++i) { tmp_values[i] = 0; } std::memcpy( tmp_values, reinterpret_cast<const float16_t>(ptr), count sizeof(float16_t)); return vld1q_f16_x2(reinterpret_cast<const float16_t>(tmp_values)); } inline std::tuple<Bar, Bar> convert_half_float(const Foo& a) { float16x8x2_t arr = a; float16x8_t x = arr.val[0]; float16x8_t y = arr.val[1]; float32x4_t x1 = vcvt_f32_f16(vget_low_f16(x)); float32x4_t x2 = vcvt_f32_f16(vget_high_f16(x)); float32x4_t y1 = vcvt_f32_f16(vget_low_f16(y)); float32x4_t y2 = vcvt_f32_f16(vget_high_f16(y)); return { Bar(x1, x2), Bar(y1, y2) }; } inline Bar cvt(const Foo& x) { auto rc = convert_half_float(x); return std::get<0>(rc); } void convert(const float16_t inp, float* outp) { for(auto idx = 0; idx < 1024; idx += 8) { auto tmp0 = loadu(inp + idx, 8); auto tmp1 = cvt(tmp0); tmp1.store(outp + idx); } } Foo load8(const float16_t* inp) { return loadu(inp, 8); } ``` if compiled with `-O3 -fno-unsafe-math-optimizations` produces ```asm convert(half const, float): 0000000000000000 add x8, x1, #0x10 0000000000000004 mov x9, #-0x8 0000000000000008 ldr q0, [x0], #0x10 ; Latency: 4 000000000000000c fcvtl v1.4s, v0.4h ; Latency: 2 0000000000000010 fcvtl2 v0.4s, v0.8h ; Latency: 2 0000000000000014 stp q1, q0, [x8, #-0x10] ; Latency: 4 0000000000000018 add x9, x9, #0x8 000000000000001c add x8, x8, #0x20 0000000000000020 cmp x9, #0x3f8 0000000000000024 b.lo 0x8 0000000000000028 ret load8(half const): 000000000000002c ldr q0, [x0] ; Latency: 4 0000000000000030 movi.2d v1, #0000000000000000 ; Latency: 2 0000000000000034 ret ``` but with `vdupq_n_f16` same yielded ```asm convert(half const, float): 0000000000000000 add x8, x1, #0x10 0000000000000004 mov x9, #-0x8 0000000000000008 ldr q0, [x0], #0x10 ; Latency: 4 000000000000000c scvtf s1, wzr ; Latency: 10 0000000000000010 fcvt h1, s1 ; Latency: 4 0000000000000014 fcvtl v1.4s, v0.4h ; Latency: 2 0000000000000018 fcvtl2 v0.4s, v0.8h ; Latency: 2 000000000000001c stp q1, q0, [x8, #-0x10] ; Latency: 4 0000000000000020 add x9, x9, #0x8 0000000000000024 add x8, x8, #0x20 0000000000000028 cmp x9, #0x3f8 000000000000002c b.lo 0x8 0000000000000030 ret load8(half const): 0000000000000034 scvtf s1, wzr ; Latency: 10 0000000000000038 ldr q0, [x0] ; Latency: 4 000000000000003c fcvt h1, s1 ; Latency: 4 0000000000000040 dup.8h v1, v1[0] ; Latency: 7 0000000000000044 ret ``` (see `scvtf` completely eliminated from `convert` code and replaced with faster `movi.2d` in `load8`) Fixes https://github.com/pytorch/pytorch/issues/125735 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125889 Approved by: https://github.com/desertfire	2024-05-10 18:18:30 +00:00
Brian Hirsh	f25c7c9699	functionalize storage resizing, minimal ppFSDP traceable forward (#122434 ) More details further down, but first a more high-level description of "how do we functionalize storage resizing" Today, dynamo converts `param.untyped_storage().resize_(x)` calls that it sees from fsdp into a custom op, `ops.inductor.resize_storage_bytes_(x)` So given this setup, there are 3 main cases that I think we want to handle: (1) graph input starts with a real storage size, gets resized down to zero in the graph (2) graph input starts with 0 storage size, gets resized up in the graph (3) graph input starts with 0 storage size, gets resized up and used in some compute, then resized back down to 0 For case (1) we need to emit a `resize_storage_bytes_` at the end of the graph, similar to how we emit `copy_()` for data mutations. For case (2), we need to emit a `resize_storage_bytes_` in the graph, and we also need to emit a `copy_()` (the input had its storage resized up, and filled in with data, which is we need to reflect as an input mutation) For case (3), the net effect is that the input had no data on entry and exit of the function, so we don't need to emit any mutable ops in the end of the graph. The main thing to call out is that: we need to write a functionalization rule for `resize_storage_byte_`, (`FunctionalTensorWrapper::storage_resize_()`) and this rule actually does very little. We would like to not emit any new ops in the graph (like say, a functional resize op). Instead, we should expect / rely on the fact that any resize up will be immediately followed by a `copy_()`/`foreach_copy_`/`out=` op, that will fill in the data of the tensor. So `FunctionalTensor` can temporarily live in a state where its data is invalid, until the `x.copy_(y)` "updates" its data with the new tensor. So effectively, all that this rule does is: (1) it stores metadata on the storage, indicating that the tensor was resized, as well as the updated storage size. We need this info in AOTAutograd, so it knows whether to emit a mutable resize_() op in the graph epilogue (2) There is also a corner case: if we are resizing down to zero, but our tensor had previously had a zero size storage, then we update `value_` to point to the original value of the tensor. The reason this seems safe is because if we have a zero storage sized tensor `x`, and we resize it up, use it in some compute, resize it back down to zero, and use it somewhere, we would want the functional version of this code to use the original `x` after the second resize. For FSDP, this is important because we end up saving parameters (graph inputs) for backward, and we want to make sure that the thing we save (and the output to the forward graph) is the original, zero-storage-sized parameter, and not the "version 2" of the parameter after the first resize_() I think a good order to look at changes in this PR would be: (1) `test_aotdispatch.py` shows the 3 main cases I focused on as well as the expected functionalized graphs (2) In `FunctionalStorageImpl.h/cpp`, I had to add a notion of "original base", and "original/curr_size". The first is so I can re-use the zero-size tensor after multiple resizes, and the second is so I can tell in AOTAutograd whether any resizes canceled each other out into a no-op (3) FunctionalTensorWrapper.h/cpp has the new resize functionalizion rule + some extra utils (4) `_functorch/_autograd`: the main changes in this folder were around adding the logic at trace-time to detect when we need to put a resize_() in the graph. I also have some assertions to check that any inputs that experience storage resizing will always be in the graph and not the opaque epilogue, and I also limited the resize_() mutation case so that you can only ever start with zero storage, or end with zero storage (you can't do e.g. `torch.ones(2).storage().resize_(3)`), and banned it on tensor subclasses (5) `fake_tensor.py`/`meta_utils.py`: we now need to be able to fakeify tensors with zero storage, so I added a quick version of it in meta_utils.py. This also.. has ramifications for fake tensor caching that I need to fix (include the storage size on the cache key, maybe?) ------------------ This PR subsumes https://github.com/pytorch/pytorch/pull/120971. This PR is enough to almost get a simple ppFSDP forward pass tracing with a functionalized resize_() properly. It also attempts to do the updated version from @jansel, where we don't have any notion of `resize_()` in the graph at all, post functionalization. It would probably be good to test it with @yf225 's FSDP changes, and see how many of the FX passes it allows us to remove. I think that in theory, it should allow us to remove all FX passes that affect the forward graph / partitioner, except the one that forces views to be recomputed in the backward (more details below). There are a few things worth calling out: (1) failed attempt at functionalizing `aten.copy_()`. I originally wanted to get a version takes these operations: ``` param.storage().resize_(all_gather_size) param.copy_(all_gather_buffer) out = aten.matmul(param, param) ``` and functionalizes them into: ``` out = aten.matmul(all_gather_buffer, all_gather_buffer) ``` This would involve getting functionalization to turn `x.copy_(y)` into a giant no-op that just returns `y`. Unfortunately, we can't actually do this in a reasonable way within functionalization (instead, there's a functional `aten.copy` in the graph - see the test case graph expecttest for details). Why? In order for that transformation to be safe, `x` and `y` need to have the same metadata. However, it's possible for `x` and `y` to be subclasses of different types. This is not something we can easily tell from within functionalization, and would be a layering violation. So for now I'm leaving it to downstream code to optimize away the `aten.copy` (this is already the case today, so I think inductor can handle this) (2) The forward doesn't actually run successfully in this PR (see the `assertRaisesRegex` in the test). Why? The final forward graph looks like this: ``` def forward(self, primals_1, primals_2): _foreach_copy = torch.ops.aten._foreach_copy.default([primals_1], [primals_2]); primals_2 = None getitem = _foreach_copy[0]; _foreach_copy = None mm = torch.ops.aten.mm.default(getitem, getitem); getitem = None t_1 = torch.ops.aten.t.default(primals_1); primals_1 = None return [mm, t_1] ``` Where `primals_1` starts out as a secretly-zero-storage-size parameter, and gets resized up and back down within the forward (these are functionalized away). Importantly, the matmul happy on the result of the `foreach_copy`, but the activation that we save for backward (`t_1`) is the result of transposing the original parameter (the zero-storage-size param). This is exactly the optimization in fsdp that allows us to have good peak memory usage. The problem is that the min-cut partitioner decides to save `t_1` for backward. Running this code in eager breaks, because the kernel for `aten.permute(x)` is not happy when `x` has secretly-zero-sized-storage. The real problem here is that in eager mode the `permute` kernel runs during the backward, after backward hooks have properly resized the saved activation. Here, we are running the transpose in the forward. One option would be to turn off the checks in our view kernels and allow them to work on zero-storage-sized tensors, which feels pretty bad. Another option is to tweak the partitioner (or use one of Will's FX passes) to force the partitioner to not save views for backward, and allow the views to be recomputed in the backward. This seems kind of silly, but is also probably harmless. (3) The backward is still broken. To be fair, this issue is pretty separable from "functionalizing storage resize calls", and can be fixed later (either by a real fix to our tracing infra, or via another hacky FX pass). More description of this problem is described at issue (8) of my PR description in https://github.com/pytorch/pytorch/pull/120971 (4) I only added support for "full graph" resizing: basically, the limited case where a param starts with zero storage size, and gets resized up and back down. I think we can add support for the graph break case, but I think we can keep that add-on separate from this PR unless we need it immediately. I also added asserts so we should fail loudly when we hit this case (5) I have a change to FakeTensor creation when inputs have zero storage size that.. is probably ok. But I also removed FakeTensor caching on view ops, which I probably need to fix before I can land this PR (6) I added a notion of "original_base" to `FunctionalStorageImpl`. More details are in the comments, but my rational for this was that we basically need it to ensure that autograd saves the original, zero-storage-sized param for backward, after resizing up and back down (7) I had to update our eager kernels for `aten.copy` and `aten._foreach_copy`, to handle the case where the `self` argument has secretly-zero-storage. Inductor can probably generate correct code for this case, but we need these ops to work properly in this situation for the `aot_eager` backend to do the right thing Pull Request resolved: https://github.com/pytorch/pytorch/pull/122434 Approved by: https://github.com/jansel	2024-05-10 18:09:10 +00:00
cyy	f42ea14c3f	Remove vision packages from CI scripts (#125546 ) Because they were solely used by Caffe2. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125546 Approved by: https://github.com/r-barnes, https://github.com/kit1980, https://github.com/albanD	2024-05-10 17:53:48 +00:00
Tugsbayasgalan Manlaibaatar	d7fe3c4123	[RELAND] Switch default behavoir of export IR to be predispatch (#125860 ) This PR switches export IR from aot-dispatch to pre-dispatch IR. What is pre-dispatch IR and why should you care? Currently the default IR returned by torch.export can contain only functional ATen operators after ALL pytorch dispatcher decompositions (for example, CompositeImplicitAutograd) run. In contrast, pre-dispatch IR refers to an IR that can contain all functional ATen operators (i.e., not just from the core subset), before any decomposition happens, as well as operators that manipulate autograd state. Pre-dispatch IR closely resembles eager PyTorch computation, but is still functional and serializable by torch.export. As a result: You can train the pre-dispatch IR in eager mode as the IR contains necessary information for the autograd engine to automatically generate a backward graph. You can write sound graph transformations more easily as the IR is functional. Since it is an ATen IR, it is still normalized. For example, torch.add has multiple overloads, but aten.add.Tensor is unique in this IR. If you want to get the core aten IR out of torch.export, you will need to: ``` ep = torch.export.export(M(), inputs) ep_for_core_aten = ep.run_decompositions() ``` Differential Revision: [D57172986](https://our.internmc.facebook.com/intern/diff/D57172986) Pull Request resolved: https://github.com/pytorch/pytorch/pull/125860 Approved by: https://github.com/zhxchen17	2024-05-10 17:36:53 +00:00
Xuehai Pan	4996a3fda3	[BE][Easy] Remove usage of deprecated `ast.Str`, `ast.Ellipsis` and `ast.NameConstant` (#125912 ) `ast.Str`, `ast.Ellipsis`, and `ast.NameConstant` are deprecated in Python 3.8 and will be removed in Python 3.14. Replace them with `ast.Constant`. Ref: https://docs.python.org/3/library/ast.html#node-classes > Changed in version 3.8: Class [ast.Constant](https://docs.python.org/3/library/ast.html#ast.Constant) is now used for all constants. > > Deprecated since version 3.8: Old classes ast.Num, ast.Str, ast.Bytes, ast.NameConstant and ast.Ellipsis are still available, but they will be removed in future Python releases. In the meantime, instantiating them will return an instance of a different class. CI log: https://github.com/metaopt/torchopt/actions/runs/9031146681/job/24816802280?pr=216#step:11:6706 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125912 Approved by: https://github.com/soulitzer	2024-05-10 17:35:35 +00:00
Richard Barnes	53a64e446f	`STRONG_NODISCARD` -> `[[nodiscard]]` (#125873 ) Test Plan: Sandcastle Differential Revision: D57158864 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125873 Approved by: https://github.com/Skylion007	2024-05-10 17:10:53 +00:00
James Wu	5f58cf65d1	Refactor other post compile wrappers in forward functions (#125610 ) This continues the refactor by creating CompilerWrappers for FakifiedOut, RngFunctionalization, and aot_dispatch_subclass_wrapper. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125610 Approved by: https://github.com/bdhirsh	2024-05-10 17:01:40 +00:00
Giuseppe Ottaviano	cc4da72b47	[caffe2] Make all get_backtrace() implementations lazy (#125750 ) Summary: #125682 (D56586844) added support for lazy symbolization to `Error` and adopted it for internal use cases; this commit adopts it for `get_backtrace()` as well. Test Plan: Sandcastle and GH CI. Differential Revision: D56881683 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125750 Approved by: https://github.com/ezyang	2024-05-10 16:02:40 +00:00
Yu, Guangye	31372fa842	Support generic stream/event on CUDA/HIP backend (#125757 ) # Motivation According to [#123611](https://github.com/pytorch/pytorch/pull/123611), we support generic stream/event on CUDA backend. # Additional Context new method/attribute on `torch.Event` for cuda - torch.Event.event_id - torch.Event.elapsed_time - torch.Event.synchronize new method on `c10::Event` on cuda backend - c10.Event.event_id - c10.Event.elapsed_time - c10.Event.synchronize Pull Request resolved: https://github.com/pytorch/pytorch/pull/125757 Approved by: https://github.com/albanD, https://github.com/jgong5, https://github.com/EikanWang	2024-05-10 13:34:09 +00:00
Bin Bao	946b96fd54	[AOTI] Add a failing test case (#123235 ) Summary: Reported in https://github.com/pytorch/pytorch/issues/123210, and can reproduced in AOTI Pull Request resolved: https://github.com/pytorch/pytorch/pull/123235 Approved by: https://github.com/chenyang78	2024-05-10 11:22:06 +00:00
Yanbo Liang	f87fbfdb01	GPT-fast benchmark: remove Embedding layer from model size (#125901 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/125901 Approved by: https://github.com/Chillee	2024-05-10 08:18:13 +00:00
DanilBaibak	d81db9c1df	GitHub workflows / Dynamic rollout (#125680 ) This PR introduces a tool to dynamically switch between ARC runners and old runners without having to update the PR to the latest version. There is also a third option - use both runners at the same time (aka shadow deployment). In this case, failed workflows using ARC launchers will not block merge process. The GitHub issue is used to control access to ARC launchers - [Access Rules Issue](https://github.com/pytorch/test-infra/issues/5132): * In the FIRST comment you can specify who will use the ARC runners: * Add a GitHub username to use ARC runners. * Add "" at the beginning to switch ALL users to ARC runners. Add "!" at the beginning to switch ALL users to old runners. * In the SECOND comment you can specify do we need to run ARC runners and old runners at the same time. * To use both runners, add a second comment with the word "both". * If we want to use only one type of runners, just remove the second comment. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125680 Approved by: https://github.com/ZainRizvi	2024-05-10 07:32:10 +00:00
cyy	2ed17e0b1e	Remove binaries using caffe2 functionality (#125885 ) This PR removed some binaries using deleted or to be deleted Caffe2 functions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125885 Approved by: https://github.com/r-barnes, https://github.com/Chillee	2024-05-10 06:21:10 +00:00
laithsakka	013722bcb8	Allow symbols to reach conv_layout stride argument (#125829 ) #Fix https://github.com/pytorch/pytorch/issues/125638 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125829 Approved by: https://github.com/anijain2305	2024-05-10 06:13:36 +00:00
Edward Z. Yang	fcbf2b61e6	Memoize local_scalar_dense calls, refactor all memos (#125623 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/125623 Approved by: https://github.com/eellison	2024-05-10 01:52:55 +00:00
atalman	8be4104cf3	Update conda to latest version for Docker release builds (#125887 ) Fixes https://github.com/pytorch/pytorch/issues/125879 Issue is somewhat similar to this issue: https://github.com/pytorch/pytorch/issues/106470 doing: ``` conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch-nightly -c nvidia ``` pulls cpu version of pytorch,vision and audio here: https://github.com/pytorch/pytorch/actions/runs/9014006158/job/24795924934#step:11:6849 ``` 16 37.21 mpmath-1.2.1 \| py311_0 1.2 MB pytorch-nightly #16 37.21 nettle-3.7.3 \| hbbd107a_1 809 KB #16 37.21 networkx-3.1 \| py311h06a4308_0 3.3 MB #16 37.21 openh264-2.1.1 \| h4ff587b_0 711 KB #16 37.21 pillow-9.3.0 \| py311h3fd9d12_2 874 KB pytorch-nightly #16 37.21 pytorch-2.4.0.dev20240509 \| py3.11_cpu_0 87.1 MB pytorch-nightly #16 37.21 pytorch-cuda-12.1 \| ha16c6d3_6 7 KB pytorch-nightly #16 37.21 pytorch-mutex-1.0 \| cpu 3 KB pytorch-nightly #16 37.21 sympy-1.12 \| py311h06a4308_0 14.4 MB #16 37.21 torchaudio-2.2.0.dev20240509\| py311_cpu 5.1 MB pytorch-nightly #16 37.21 torchvision-0.19.0.dev20240509\| py311_cpu 7.3 MB pytorch-nightly ``` Updating conda to latest and rebuilding solved this issue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125887 Approved by: https://github.com/huydhn	2024-05-10 01:43:59 +00:00
Nikita Shulga	d14d6127f6	[BE] Rename `macos-12` to `macos-13`/`macos-` jobs (#125859 ) As CI does not have any MacOS 12 runners anymore Cleanup any misleading references about cross-compilation as M1 builds are done natively for quite some time Pull Request resolved: https://github.com/pytorch/pytorch/pull/125859 Approved by: https://github.com/ZainRizvi	2024-05-10 01:30:29 +00:00
Yu, Guangye	2ad794550a	Support generic stream/event on XPU backend (#125751 ) # Motivation According to [#123611](https://github.com/pytorch/pytorch/pull/123611), we support generic stream/event on XPU backend. # Additional Context new method/attribute on `torch.Event` for xpu - torch.Event.event_id - torch.Event.elapsed_time - torch.Event.synchronize new method on `c10::Event` on xpu backend - c10.Event.event_id - c10.Event.elapsed_time - c10.Event.synchronize Pull Request resolved: https://github.com/pytorch/pytorch/pull/125751 Approved by: https://github.com/jgong5, https://github.com/albanD	2024-05-10 01:27:30 +00:00
eellison	d19d932183	update pointwise cat heuristics (#125772 ) Fix for https://github.com/pytorch/pytorch/issues/122871. There are two cases where we emit pointwise cat: - fusing into a pointwise use - horizontally fusing copy_ kernels The regression I looked into previously was due to being overly aggressive in the latter case. I've updated the logic there so that we only emit the horizontal fusion in the case that we would have to emit separate copy_ kernels anyway. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125772 Approved by: https://github.com/Chillee	2024-05-10 01:07:39 +00:00
Wang, Eikan	978b572652	Add registration API for torch.compile-eager (#121387 ) This PR is a follow-up of RFC https://github.com/pytorch/pytorch/issues/115545. In this PR, we intend to provide a registration API dedicated to eager-through-torch.compile. The major workflow of this API will be as follows. - Load cache - Check cache according to the input tensors - Cache Hit: Run the cached kernel directly - Cache Miss: Run the AOTI to produce kernel and run the produced kernel. If AOTI fails to produce the kernel, invoke the python fallback function. Currently, this PR always fallback to python kernel now and cache mechanism will be implemented in another PR - https://github.com/pytorch/pytorch/pull/116368 Differential Revision: [D57164385](https://our.internmc.facebook.com/intern/diff/D57164385) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121387 Approved by: https://github.com/desertfire, https://github.com/jansel, https://github.com/zou3519, https://github.com/jgong5	2024-05-10 00:30:27 +00:00
Pian Pawakapan	c9a258e474	[export] handle constant aliasing for export (#125509 ) Summary: Currently export will [error out](`2b5ae2611e/torch/export/_trace.py (L477)`) if a constant is aliased. This PR supports this by modifying ConstantAttrMap to map constants to a list of FQNs instead of a single FQN, populating the ExportedProgram constants dict to contain multiple entries to the same constant. Test Plan: added test case in test_export.py Differential Revision: D56955654 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125509 Approved by: https://github.com/angelayi, https://github.com/ydwu4	2024-05-10 00:14:37 +00:00
chilli	fd816bf630	Add script for removing Inductor dependencies from Inductor generated code (#125811 ) Usage: ```python TORCHINDUCTOR_UNIQUE_KERNEL_NAMES=1 python foo.py TORCHINDUCTOR_DUMP_LAUNCH_PARAMS=1 python /tmp/torchinductor_chilli/js/cjsbczkf6fj36nhaxxypll6cy4fmwmkoauklrgrvuody2mn7oeef.py python remove_inductor_deps.py /tmp/torchinductor_chilli/js/cjsbczkf6fj36nhaxxypll6cy4fmwmkoauklrgrvuody2mn7oeef.py ``` Example generated code: https://pastebin.com/m6Ae8heB Pull Request resolved: https://github.com/pytorch/pytorch/pull/125811 Approved by: https://github.com/chenyang78	2024-05-10 00:00:25 +00:00
Jiong Gong	3267814d53	[inductor] refactor: device dispatch inside do_bench (#125736 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/125736 Approved by: https://github.com/shunting314	2024-05-09 23:50:02 +00:00
angelayi	13545fe68a	[export] Don't create a new fake mode if dynamo tracing (#125185 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/125185 Approved by: https://github.com/mikekgfb	2024-05-09 23:43:08 +00:00
Richard Barnes	23e71ffd82	Remove unused caffe2 subdirs (#125818 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/125818 Approved by: https://github.com/Skylion007, https://github.com/albanD	2024-05-09 22:57:55 +00:00
Sergii Dymchenko	350a3ed82f	Fix unused variable 'kEps' (#125870 ) Summary: > fbcode/caffe2/caffe2/utils/math_gpu_test.cc:227:17: error: unused variable 'kEps' [-Werror,-Wunused-const-variable] See https://www.internalfb.com/intern/test/844425000398735?ref_report_id=0 Created from CodeHub with https://fburl.com/edit-in-codehub Test Plan: Sandcastle run Reviewed By: r-barnes Differential Revision: D56731004 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125870 Approved by: https://github.com/seemethere, https://github.com/atalman	2024-05-09 22:57:37 +00:00
Animesh Jain	477612c0f6	[dynamo] Clear GenerationTracker on dynamo reset (#125855 ) Fixes https://github.com/pytorch/pytorch/issues/125567 Not doing this causes modules to be unspecialized when tests run in sequence, and specialized when run alone. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125855 Approved by: https://github.com/jansel	2024-05-09 22:47:54 +00:00
Gustav Larsson	52fad83335	[onnx.export] Avoid linear look up in env for exist_in_env (#124909 ) This PR is part of a series of PRs to significantly speed up torch.onnx.export for models with many nodes (e.g. LLM). See #121422 for more analysis. - As part of torch.onnx.export, a reverse look-up is made in env. This is done for each node, and this look-up costs in proportional to the graph size, which incurs and overall O(N^2) time complexity. - A pragmatic solution is simply to keep a separate data structure to make this de facto constant time. So, this introduces a set containing all the values of env. Open to other ideas. Ideally `exist_in_env` wouldn't be needed at all, but to preserve current behavior exactly I'm not sure how that can be done. - Resolves (4) in #121422. - This code change and the choice of py::set looks a bit more natural on top of #123063, where the env is changed from a std::unordered_map to a py::dict. Partially fixes #121422 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124909 Approved by: https://github.com/srikris-sridhar, https://github.com/justinchuby	2024-05-09 22:38:00 +00:00
Zhengxu Chen	37d2ecd123	Only log toplevel torchscript calls. (#125714 ) Summary: as title. Test Plan: CI Reviewed By: gmagogsfm Differential Revision: D57069719 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125714 Approved by: https://github.com/SherlockNoMad	2024-05-09 22:29:53 +00:00
Aaron Orenstein	e43d656921	FakeTensor speedup: minor cleanups (#124224 ) A few cleanup tasks that didn't really fit into the other diffs in this stack. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124224 Approved by: https://github.com/oulgen ghstack dependencies: #122911, #124223	2024-05-09 22:11:51 +00:00
Aaron Orenstein	a08be4b705	FakeTensor speedup: Split cache_key so we only validate once (#124223 ) When dispatching a fake tensor op we cache the result with `(op, args)` as the key. There are some args (such as one with a dynamic output shape) where the output can't be cached. Instead of validating the args every time we compute the cache only validate the args when we first see a new cache key. 18.3% FakeTensor perf win on the microbenchmark (21.7% cumulative) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124223 Approved by: https://github.com/oulgen, https://github.com/masnesral ghstack dependencies: #122911	2024-05-09 22:11:51 +00:00
Aaron Orenstein	6a8b1da18d	FakeTensor speedup: Delay formatting stack trace until it's actually asked for. (#122911 ) When constructing a `FakeTensorMode`, instead of immediately formatting a full stack trace, grab the traceback and only format it on demand. 4.2% FakeTensor perf win on the microbenchmark. ``` import time import torch import torch._dynamo as dynamo from torch._subclasses.fake_tensor import FakeTensorMode import numpy as np def toy_example(a, b): x = a / (torch.abs(a) + 1) b = b * -1 return x * b def run_test1(): dynamo.reset() j = [1, 2, 3] toy_example(torch.randn(j), torch.randn(j)) def run_test2(): dynamo.reset() j = [1, 2, 3] with FakeTensorMode(): toy_example(torch.randn(j), torch.randn(j)) ITERATIONS = 500000 FORMAT_STRING = "{name:12}: TOT: {tot:10.3f}, AVG: {avg:10.3f}, MIN: {min:10.3f}, P50: {p50:10.3f}, P90: {p90:10.3f}, P99: {p99:10.3f}" def run_tests(name, step): step() timings = [] start = time.time() for i in range(ITERATIONS): a = time.perf_counter_ns() step() b = time.perf_counter_ns() timings.append(b - a) end = time.time() fmt = { "best": min(timings), "tot": end - start, "avg": np.average(timings), "min": min(timings), "p50": np.percentile(timings, 50), "p90": np.percentile(timings, 90), "p99": np.percentile(timings, 99) } print(FORMAT_STRING.format(name=name, fmt)) return fmt ts = run_tests("tensor", run_test1) fs = run_tests("fake tensor", run_test2) ratio = {k: a / b for ((k, a), (_, b)) in zip(fs.items(), ts.items())} print(FORMAT_STRING.format(name="ratio", ratio)) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/122911 Approved by: https://github.com/oulgen, https://github.com/eellison	2024-05-09 22:11:51 +00:00
Tarun Karuturi	eaaf0f3299	Print capture_pre_autograd_graph warning only once (#125848 ) Summary: Print this warning only once to avoid flooding the logs of workflows where this is called frequently. Test Plan: CI Differential Revision: D57163341 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125848 Approved by: https://github.com/zhxchen17	2024-05-09 22:04:05 +00:00
Richard Barnes	20271f0a3b	Drop caffe2-linux-jammy-py3_8-gcc11-build (#125857 ) Removes more caffe2 testing Pull Request resolved: https://github.com/pytorch/pytorch/pull/125857 Approved by: https://github.com/albanD, https://github.com/Skylion007	2024-05-09 21:52:27 +00:00
Animesh Jain	ae5e2ab92e	[dynamo][fsdp] Use Tensor match for FSDP modules (#125827 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/125827 Approved by: https://github.com/yf225, https://github.com/jansel ghstack dependencies: #125828, #125805	2024-05-09 21:26:15 +00:00
PyTorch MergeBot	0d4fdb0bb7	Revert "[ROCm] amdsmi library integration (#119182 )" This reverts commit 85447c41e32b1e43a025ea19ac812a0c7f88ff57. Reverted https://github.com/pytorch/pytorch/pull/119182 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but the ROCm failed test is legit `85447c41e3` ([comment](https://github.com/pytorch/pytorch/pull/119182#issuecomment-2103433197))	2024-05-09 21:18:21 +00:00
Sam Larsen	966ebd2e24	Add --warm-start-latency to benchmark harness (#125353 ) Summary: This change introduces a new flagg to perform a "warm start" test from the benchmark harness. The idea is to test a model twice: first with a fresh inductor cache (i.e., a "cold start"), and then a second run in a fresh process with the cache available (i.e. a "warm start"). We can later add this mode to CI runs to collect compile times for warm start. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125353 Approved by: https://github.com/eellison, https://github.com/desertfire	2024-05-09 21:12:15 +00:00
Animesh Jain	ee00349780	[dynamo][logs] move recompilation reason within compile_id scope (#125805 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/125805 Approved by: https://github.com/ezyang ghstack dependencies: #125828	2024-05-09 20:37:23 +00:00
Animesh Jain	a7575e8bd5	[dynamo] Use correct source for custom getattr (#125828 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/125828 Approved by: https://github.com/williamwen42	2024-05-09 20:37:23 +00:00
Catherine Lee	7c00635125	[CI] Move gha artifact download before xml parsing for test stat uploads (#125609 ) Move gha artifact download to before any xml parsing is done for uplaod-test-stats Do not download gha artifacts during xml parsing since got uploaded to s3 in the above and will be downloaded when all the artifacts are downloaded from s3 The previous method resulted in dups if you run the script again TODO: write a deduper so we don't have to worry at all Pull Request resolved: https://github.com/pytorch/pytorch/pull/125609 Approved by: https://github.com/huydhn	2024-05-09 20:35:09 +00:00
David Berard	1ecea513b6	Fix common_methods_invocations example inputs to _efficient_attention_forward (#125788 ) Fixes #120693 this tries to fix the sample input in common_methods_invocations.py: * I think the arange was intended to be skipping every other integer in the range. Previously, we'd have one length that was -1. * k, v tensors were too small - updated the sizes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125788 Approved by: https://github.com/drisspg, https://github.com/Aidyn-A	2024-05-09 20:08:49 +00:00
PyTorch MergeBot	6fd745255e	Revert "add uuid in cudaDeviceProperties (#125083 )" This reverts commit 3f36145db298f7305b3b4df6c82c9101025a049a. Reverted https://github.com/pytorch/pytorch/pull/125083 on behalf of https://github.com/izaitsevfb due to Fails internal builds with: no member named 'uuid' in 'hipDeviceProp_t' ([comment](https://github.com/pytorch/pytorch/pull/125083#issuecomment-2103315320))	2024-05-09 19:52:45 +00:00
hippocookie	74a0ef8f8c	Enable UFMT format on test/test_package.py test/test_per_overload_api.py (#125834 ) Fixes some files in https://github.com/pytorch/pytorch/issues/123062 Run lintrunner on files: test/test_package.py test/test_per_overload_api.py ```bash $ lintrunner -a --take UFMT --all-files ok No lint issues. Successfully applied all patches. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/125834 Approved by: https://github.com/malfet	2024-05-09 19:48:22 +00:00
Andrey Talman	ed8a560845	Update Release Calendar for 2.3.1 and 2.4 releases (#125794 ) As per: - https://dev-discuss.pytorch.org/t/pytorch-release-2-4-0-call-for-features/2051 - https://dev-discuss.pytorch.org/t/pytorch-release-2-3-1-planning/2052 Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/125794 Approved by: https://github.com/malfet	2024-05-09 18:31:52 +00:00
Jack Taylor	85447c41e3	[ROCm] amdsmi library integration (#119182 ) Adds monitoring support for ROCm using amdsmi in place of pynvml. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119182 Approved by: https://github.com/jeffdaily, https://github.com/malfet, https://github.com/xw285cornell	2024-05-09 18:21:38 +00:00
Tugsbayasgalan Manlaibaatar	0e419b9146	Fix graph partitioner and make runtime assertion work with submodules in export (#125793 ) Summary: This fix does three things: 1. When we add inputs from partioner to the top level graph module, we insert in the order of partioner which is not guaranteed to be same as original graph inputs. This PR fixes that. 2. When we replace autograd ops with HOP, we create new submodules and access their outputs via getitem calls. As a result, previous node names associated with getitem gets updated, resulting in the graph being different from produced graph signature. So I just update the graph signature accordingly. 3. We run runtime_assertion pass before autograd HOP pass because the constraints won't be populated correctly. Differential Revision: [D57130314](https://our.internmc.facebook.com/intern/diff/D57130314) Pull Request resolved: https://github.com/pytorch/pytorch/pull/125793 Approved by: https://github.com/zhxchen17	2024-05-09 18:13:46 +00:00
Catherine Lee	98821b3d92	Disable various flaky tests in test_foreach (#125783 ) * Similar to #125046 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125783 Approved by: https://github.com/huydhn	2024-05-09 18:08:39 +00:00
William Wen	ae20f15941	[dynamo] trace through nn parametrize (#125771 ) Fix https://github.com/pytorch/pytorch/issues/120914 Example dynamo output graph (from test_nn_parametrize): ``` V0508 11:16:26.687000 140092517021504 torch/_dynamo/output_graph.py:1272] [0/0] [__graph_code] TRACED GRAPH V0508 11:16:26.687000 140092517021504 torch/_dynamo/output_graph.py:1272] [0/0] [__graph_code] ===== __compiled_fn_1 ===== V0508 11:16:26.687000 140092517021504 torch/_dynamo/output_graph.py:1272] [0/0] [__graph_code] /data/users/williamwen/pytorch2/torch/fx/_lazy_graph_module.py class GraphModule(torch.nn.Module): V0508 11:16:26.687000 140092517021504 torch/_dynamo/output_graph.py:1272] [0/0] [__graph_code] def forward(self, L_x_: "f32[10, 10]"): V0508 11:16:26.687000 140092517021504 torch/_dynamo/output_graph.py:1272] [0/0] [__graph_code] l_x_ = L_x_ V0508 11:16:26.687000 140092517021504 torch/_dynamo/output_graph.py:1272] [0/0] [__graph_code] V0508 11:16:26.687000 140092517021504 torch/_dynamo/output_graph.py:1272] [0/0] [__graph_code] # File: /data/users/williamwen/pytorch2/torch/nn/utils/parametrize.py:275 in forward, code: x = self[0](self.original) V0508 11:16:26.687000 140092517021504 torch/_dynamo/output_graph.py:1272] [0/0] [__graph_code] l__self___parametrizations__param___original: "f32[10, 10]" = self.L__self___parametrizations__param___original V0508 11:16:26.687000 140092517021504 torch/_dynamo/output_graph.py:1272] [0/0] [__graph_code] V0508 11:16:26.687000 140092517021504 torch/_dynamo/output_graph.py:1272] [0/0] [__graph_code] # File: /data/users/williamwen/pytorch2/test/dynamo/test_repros.py:4759 in forward, code: return torch.sin(x) V0508 11:16:26.687000 140092517021504 torch/_dynamo/output_graph.py:1272] [0/0] [__graph_code] x: "f32[10, 10]" = torch.sin(l__self___parametrizations__param___original); l__self___parametrizations__param___original = None V0508 11:16:26.687000 140092517021504 torch/_dynamo/output_graph.py:1272] [0/0] [__graph_code] V0508 11:16:26.687000 140092517021504 torch/_dynamo/output_graph.py:1272] [0/0] [__graph_code] # File: /data/users/williamwen/pytorch2/test/dynamo/test_repros.py:4755 in forward, code: return self.param @ x V0508 11:16:26.687000 140092517021504 torch/_dynamo/output_graph.py:1272] [0/0] [__graph_code] matmul: "f32[10, 10]" = x @ l_x_; x = l_x_ = None V0508 11:16:26.687000 140092517021504 torch/_dynamo/output_graph.py:1272] [0/0] [__graph_code] return (matmul,) V0508 11:16:26.687000 140092517021504 torch/_dynamo/output_graph.py:1272] [0/0] [__graph_code] ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/125771 Approved by: https://github.com/jbschlosser ghstack dependencies: #125710, #125724	2024-05-09 17:43:48 +00:00
Can Balioglu	6ea226b99c	Fix DDP no_sync when find_unused_parameters is True (#124193 ) Fixes #69031, #42793 This PR fixes the bug introduced in #54981 where parameters used within a `no_sync` scope are not respected when `find_unused_parameters` is set to `True`. The `local_used_map_` and `numGradHooksTriggeredMap_` variables should be updated regardless of the `no_sync` state. Tested and verified with fairseq2 and wav2vec2 ASR finetuning recipe. All gradients are correctly synced across workers as expected after applying this fix. Co-authored-by: Kaushik Ram Sadagopan <kaushikram2811@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/124193 Approved by: https://github.com/rohan-varma	2024-05-09 17:33:33 +00:00
PyTorch MergeBot	8fb3ff2a4e	Revert "[profiler] enable CUPTI range profiler in build (#125685 )" This reverts commit 2deea9e6e9faf5eacebefa2336861d129c598c99. Reverted https://github.com/pytorch/pytorch/pull/125685 on behalf of https://github.com/atalman due to Broke nightly ([comment](https://github.com/pytorch/pytorch/pull/125685#issuecomment-2103093237))	2024-05-09 17:28:02 +00:00
Will Constable	26b942c4fc	[C10D] Document destroy_process_group usage (#122358 ) This API was not documented. It has already been a source of confusion, but recently has become more urgent as improper destruction can lead to hangs due to ncclCommAbort's requirement of being called collectively. <img width="888" alt="image" src="https://github.com/pytorch/pytorch/assets/4984825/9e16342d-1108-4d7d-95c8-b8753661b8e9"> Fixes #48203 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122358 Approved by: https://github.com/shuqiangzhang	2024-05-09 16:51:31 +00:00
atalman	257d40ba2e	Docker release - push nightly tags only for amd64 builds (#125845 ) Fixes failure: https://github.com/pytorch/pytorch/actions/runs/9014006158/job/24765880791#step:12:43 ``` Unable to find image 'ghcr.io/pytorch/pytorch-nightly:2.4.0.dev20240509-runtime' locally 2.4.0.dev20240509-runtime: Pulling from pytorch/pytorch-nightly docker: no matching manifest for linux/amd64 in the manifest list entries. ``` This cpu image does not exist for amd64 and not uploaded to dockerhub. Hence don't tag it . Pull Request resolved: https://github.com/pytorch/pytorch/pull/125845 Approved by: https://github.com/malfet, https://github.com/huydhn	2024-05-09 16:42:15 +00:00
Zhengxu Chen	3ccf107f01	[export] remove upgrader. (#125625 ) Summary: talked to executorch team, seems we can remove this now. Test Plan: CI Differential Revision: D57013451 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125625 Approved by: https://github.com/larryliu0820	2024-05-09 16:30:12 +00:00
Pearu Peterson	0241ed9331	Fix sparse fake tensors detach (#125679 ) As in the title. Fixes a bug reported in https://github.com/pytorch/pytorch/pull/117907#discussion_r1589581536 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125679 Approved by: https://github.com/amjames, https://github.com/lezcano	2024-05-09 15:40:57 +00:00
Nikita Shulga	7e86a7c015	Lint: Update older-python test to 3.6 (#125843 ) As python-3.5 can no longer connect to pypi after today's cert update Fixes https://github.com/pytorch/pytorch/issues/125841	2024-05-09 07:23:59 -07:00
Nikita Shulga	b8a706a321	[EZ][BE] Use `untyped_storage` in tests (#125838 ) Get's rid of the following warning: ``` /Users/shenke/workspace/pytorch/test/test_mps.py:9229: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() if base.storage().data_ptr() != other.storage().data_ptr(): ``` (noticed while looking at https://github.com/pytorch/pytorch/issues/96153#issuecomment-2101876484 ) Respective change to view ops was landed back in 2022, see https://github.com/pytorch/pytorch/pull/91414 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125838 Approved by: https://github.com/albanD	2024-05-09 14:04:21 +00:00
Nikita Shulga	4e29e80bf0	Run MPS tests on MacOS Sonoma (#125801 ) Those ones are running 14.4.1, so I wonder if they actually pass CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/125801 Approved by: https://github.com/kit1980, https://github.com/huydhn	2024-05-09 13:43:12 +00:00
leslie-fang-intel	b9588101c4	[Inductor][Quant] Fix PT2E Dynamic Quant regression (#125207 ) Summary Fix 2 regression issues caused by previous refactor: - Fix the issue in dequant promotion pass with dynamic quant when the dequant node is with `tensor` overload. - Fix numerical issue in dynamic quant, since input will convert to scales' dtype (which is `double`) to do quant operatoration with previous implementation. TestPlan ``` clear && python -u -m pytest -s -v test/inductor/test_mkldnn_pattern_matcher.py -k test_dynamic_qlinear_input_dim_exceeds_2 clear && python -u -m pytest -s -v test/inductor/test_mkldnn_pattern_matcher.py -k test_qlinear_dequant_promotion_dynamic_cpu ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/125207 Approved by: https://github.com/peterbell10, https://github.com/jgong5 ghstack dependencies: #124041, #124246	2024-05-09 08:47:24 +00:00
leslie-fang-intel	c337395cdb	[Inductor][Quant] Change the QConv output scale name (#124246 ) Summary Change the name of QConv output scale from `inv_output_scale` to `output_scale` after we move the optimization of quant/dequant from decomposition to lowering phase. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124246 Approved by: https://github.com/jgong5, https://github.com/peterbell10 ghstack dependencies: #124041	2024-05-09 08:44:00 +00:00
leslie-fang-intel	d83ab88f81	[Inductor] [Quant] Enable lowering of quant per tensor and refactor quant pattern (#124041 ) Summary Per the discussion in https://github.com/pytorch/pytorch/pull/123444, the `decomposed quant/dequant` patterns changed after https://github.com/pytorch/pytorch/pull/123445, we can move the optimization of `decomposed quant/dequant` from inductor decomposition into lowering phase to avoid the changes. In this way, we can: - Avoid the pattern matcher failure introduced in https://github.com/pytorch/pytorch/pull/123445 - Make the quantization pattern clearer in the pattern matcher phase, since the `quant/dequant` nodes have not been decomposed. Changes in this PR - Move optimization of `decomposed quant/dequant` from inductor decomposition into lowering phase. - Corresponding changes in the quantization pattern matcher to ensure no bc-breaking. TestPlan ``` python -u -m pytest -s -v test/inductor/test_mkldnn_pattern_matcher.py -k test_q ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/124041 Approved by: https://github.com/peterbell10, https://github.com/jgong5	2024-05-09 08:40:44 +00:00
laithsakka	96c8447001	change error message to avoid failing when nn modules inlined (#125612 ) #address https://github.com/pytorch/pytorch/issues/125605 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125612 Approved by: https://github.com/mlazos, https://github.com/anijain2305	2024-05-09 08:34:31 +00:00
Ma-Jian1	da2f4bbc33	remove empty partition (#124920 ) In some rare scenarios, the partitioner will produce an empty partition. it's a waste of time to compile an empty graph. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124920 Approved by: https://github.com/ezyang	2024-05-09 07:39:47 +00:00
Gustav Larsson	e5766f02d0	[onnx.export] Avoid dict <-> unordered_map implicit copies (#123063 ) This PR is part of an effort to speed up torch.onnx.export (#121422). - Avoid [implicit copy](https://pybind11.readthedocs.io/en/stable/advanced/cast/stl.html#automatic-conversion) between `pybind11::dict` and `std::unordered_map` that happens for every node that gets processed. The copy scales with N (number of nodes), so this creates a quadratic time complexity. Solution is to always use `pybind11::dict`. - This alone speeds up exports by x2 for large models. - Resolves (1) in #121422. (partial fix of #121422) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123063 Approved by: https://github.com/justinchuby	2024-05-09 07:34:47 +00:00
Wanchao Liang	c59a2369be	[fsdp2] Accomodate FSDP2 to accept parent mesh > 2 (#125778 ) as titled, to support higher dimension parallelism Pull Request resolved: https://github.com/pytorch/pytorch/pull/125778 Approved by: https://github.com/weifengpy	2024-05-09 05:02:21 +00:00
Edward Z. Yang	aaa2f93a4f	Add meta for _embedding_bag_dense_backward and _embedding_bag_per_sample_weights_backward (#125785 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/125785 Approved by: https://github.com/albanD	2024-05-09 04:28:16 +00:00
Bin Bao	ed48ea9997	[AOTI] Refine the C shim autogen mechanism (#125589 ) Summary: Based on the discussions in https://github.com/pytorch/pytorch/pull/120513. Instead of auto-generate C shim fallback ops for thousands of ops, we maintain a list of fallback ops based on torch/_inductor/lowering.py, and only generate C shim functions for those ops. At the torchgen time, we will re-generate C shim files and compare the header file contents against the existing C shim headers. If there is any change, the compilation will fail with prompt on how to proceed. This makes sure the ABI-compatible C shim layer is small enough to maintain in the long run. Differential Revision: [D57004046](https://our.internmc.facebook.com/intern/diff/D57004046) Pull Request resolved: https://github.com/pytorch/pytorch/pull/125589 Approved by: https://github.com/frank-wei, https://github.com/chenyang78, https://github.com/albanD, https://github.com/ezyang	2024-05-09 02:48:16 +00:00
Sathyanarayanan Saravanamuthu	0bde9c08ef	Prevent rendezvous shutdown on worker restarts (#124819 ) Fixes #123678 #### Summary When the rank leaves and joins back, the workers are restarted and while restarting the rendezvous is shut down. This change prevents rendezvous shutdown during worker restarts. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124819 Approved by: https://github.com/malfet, https://github.com/kurman, https://github.com/eqy	2024-05-09 02:40:31 +00:00
cyy	6c4f43f826	Decouple most Caffe2 components from the build systems (r-barnes) (#125711 ) Copying #125392 here so I can edit it more easily. Co-authored-by: cyy <cyyever@outlook.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/125711 Approved by: https://github.com/malfet	2024-05-09 02:19:59 +00:00
Michael Ranieri	fdff9920f6	[pytorch] fix blasLt on windows (#125792 ) Summary: It seems like required functions are not available due to `_MSC_VER` guard. Does anyone have more context why this functionality has been disabled for windows? I'm also unsure how this currently compiles in OSS land on windows, as there doesn't seem to be any preprocessor protection around `scaled_gemm` getting pulled in. Test Plan: Fix compilation errors like this ``` C:\open\fbsource\xplat\caffe2\aten\src\ATen\cuda\tunable\TunableGemm.h(74): error C2039: 'scaled_gemm': is not a member of 'at::cuda::blas' C:\open\fbsource\xplat\caffe2\aten\src\ATen\cuda\CUDABlas.h(19): note: see declaration of 'at::cuda::blas' C:\open\fbsource\xplat\caffe2\aten\src\ATen\cuda\tunable\TunableGemm.h(74): note: the template instantiation context (the oldest one first) is C:\open\fbsource\xplat\caffe2\aten\src\ATen\cuda\tunable\TunableGemm.h(71): note: while compiling class template 'at::cuda::tunable::DefaultScaledGemmOp' Action failed: fbsource//xplat/caffe2:ATen_cuda_lib_ovrsource (cxx_compile aten/src/ATen/native/cuda/Blas.cpp) ``` Differential Revision: D57087985 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125792 Approved by: https://github.com/malfet, https://github.com/eqy	2024-05-09 01:54:25 +00:00
Giuseppe Ottaviano	902a74c1d6	[caffe2] Lazily symbolize backtrace in c10::Error (#125787 ) Summary: The macros that build `c10::Error` compute the stack trace at the point of throwing, which is then returned as part of the `what()`. If `what()` is never called, which is the case for most exceptions (since logging is throttled), the cost of computing the stack trace was wasted. By far, the most expensive part of computing the stack trace is its symbolization; just unwinding the stack and collecting the instruction addresses is comparatively cheap. We can thus defer the symbolization to first invocation of `what()`. Test Plan: Added unit tests exercising the lazy nature of `what()`. Ran an adfinder canary: https://www.internalfb.com/intern/ads/canary/460118801509424346 We can see that the cost of symbolization is obliterated (meaning that `what()` is virtually never called, as expected): {F1496627896} Differential Revision: D57128632 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125787 Approved by: https://github.com/huydhn	2024-05-09 01:46:57 +00:00
PyTorch MergeBot	ea3f625e32	Revert "[Inductor] [Quant] Enable lowering of quant per tensor and refactor quant pattern (#124041 )" This reverts commit 33e6791645b5950b0f39301f55b8a4a79c0ca847. Reverted https://github.com/pytorch/pytorch/pull/124041 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I think there is a land race with the change `33e6791645` ([comment](https://github.com/pytorch/pytorch/pull/124041#issuecomment-2101766558))	2024-05-09 01:34:19 +00:00
PyTorch MergeBot	ca579c177b	Revert "[Inductor][Quant] Change the QConv output scale name (#124246 )" This reverts commit 9ba9f7fa821af062ef3d1580b75e70f74ba05063. Reverted https://github.com/pytorch/pytorch/pull/124246 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I think there is a land race with the change `33e6791645` ([comment](https://github.com/pytorch/pytorch/pull/124041#issuecomment-2101766558))	2024-05-09 01:34:19 +00:00
PyTorch MergeBot	97509c8eb2	Revert "[Inductor][Quant] Fix PT2E Dynamic Quant regression (#125207 )" This reverts commit 3da949b0fbe91e802d30e00165141d1390621d71. Reverted https://github.com/pytorch/pytorch/pull/125207 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I think there is a land race with the change `33e6791645` ([comment](https://github.com/pytorch/pytorch/pull/124041#issuecomment-2101766558))	2024-05-09 01:34:19 +00:00
Valentine233	19bab45e67	[Inductor] Add SDPA pattern for OOB GPT2 models (#125562 ) Add SDPA pattern for 2 OOB models: - token-classification+gpt2 - text-generation+gpt2 Note that these models have two masks: attention mask with float type and causal mask with bool type. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125562 Approved by: https://github.com/jgong5, https://github.com/kadeng, https://github.com/jansel	2024-05-09 01:21:09 +00:00
leslie-fang-intel	3da949b0fb	[Inductor][Quant] Fix PT2E Dynamic Quant regression (#125207 ) Summary Fix 2 regression issues caused by previous refactor: - Fix the issue in dequant promotion pass with dynamic quant when the dequant node is with `tensor` overload. - Fix numerical issue in dynamic quant, since input will convert to scales' dtype (which is `double`) to do quant operatoration with previous implementation. TestPlan ``` clear && python -u -m pytest -s -v test/inductor/test_mkldnn_pattern_matcher.py -k test_dynamic_qlinear_input_dim_exceeds_2 clear && python -u -m pytest -s -v test/inductor/test_mkldnn_pattern_matcher.py -k test_qlinear_dequant_promotion_dynamic_cpu ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/125207 Approved by: https://github.com/peterbell10, https://github.com/jgong5 ghstack dependencies: #124041, #124246	2024-05-09 01:05:00 +00:00
Animesh Jain	d474d79420	[dynamo][disable] Move disable impl to its own __call__ method (#125486 ) There were internal cases where calling disable in distributed causes trace_rules to be generated, which imports distributed and causes circular import errors. The code has also gone bulky. I think it is time for disable code to exist separately. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125486 Approved by: https://github.com/yanboliang, https://github.com/williamwen42, https://github.com/jansel	2024-05-09 01:03:12 +00:00
leslie-fang-intel	9ba9f7fa82	[Inductor][Quant] Change the QConv output scale name (#124246 ) Summary Change the name of QConv output scale from `inv_output_scale` to `output_scale` after we move the optimization of quant/dequant from decomposition to lowering phase. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124246 Approved by: https://github.com/jgong5, https://github.com/peterbell10 ghstack dependencies: #124041	2024-05-09 00:57:10 +00:00
leslie-fang-intel	33e6791645	[Inductor] [Quant] Enable lowering of quant per tensor and refactor quant pattern (#124041 ) Summary Per the discussion in https://github.com/pytorch/pytorch/pull/123444, the `decomposed quant/dequant` patterns changed after https://github.com/pytorch/pytorch/pull/123445, we can move the optimization of `decomposed quant/dequant` from inductor decomposition into lowering phase to avoid the changes. In this way, we can: - Avoid the pattern matcher failure introduced in https://github.com/pytorch/pytorch/pull/123445 - Make the quantization pattern clearer in the pattern matcher phase, since the `quant/dequant` nodes have not been decomposed. Changes in this PR - Move optimization of `decomposed quant/dequant` from inductor decomposition into lowering phase. - Corresponding changes in the quantization pattern matcher to ensure no bc-breaking. TestPlan ``` python -u -m pytest -s -v test/inductor/test_mkldnn_pattern_matcher.py -k test_q ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/124041 Approved by: https://github.com/peterbell10, https://github.com/jgong5	2024-05-09 00:54:22 +00:00
Michael Lazos	1b1b18a7a4	Add LRScheduler Composability E2E Tests (#125653 ) adds tests to verify the LRSchedulers correctly update the compiled optimizers without recompiles. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125653 Approved by: https://github.com/yanboliang ghstack dependencies: #123751, #123752, #123753, #125383	2024-05-09 00:52:43 +00:00
Michael Lazos	8c9c169b48	LRScheduler composability kernel tests (#125383 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/125383 Approved by: https://github.com/eellison ghstack dependencies: #123751, #123752, #123753	2024-05-09 00:52:43 +00:00
Michael Lazos	69eeef0727	Update LRScheduler to handle tensor LR (#123753 ) Enables LRScheduler to handle tensor LRs. Note on test changes: For the test modifications I just removed itertools.product and created two loops. This allows us to create a new set of optim_inputs on each iteration to prevent mutations on the tensor LR carrying over across iterations. Nothing else in those tests was modified. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123753 Approved by: https://github.com/janeyx99 ghstack dependencies: #123751, #123752	2024-05-09 00:52:43 +00:00
Michael Lazos	7b36b4a765	Fix user warning for tensor LR (#123752 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123752 Approved by: https://github.com/janeyx99 ghstack dependencies: #123751	2024-05-09 00:52:43 +00:00
Michael Lazos	0ea6ffc613	Swap warning counter to flag in LRScheduler (#123751 ) This was a counter previously, this should be a flag to indicate whether or not the optimizer step has been called. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123751 Approved by: https://github.com/janeyx99	2024-05-09 00:52:43 +00:00
xinan.lin	78a1693266	[Inductor Intel GPU backend Upstream] Reuse inductor test for Intel GPU (PART 1) (#122866 ) Reuse Inductor test suite for Intel GPU including: test_torchinductor.py test_triton_wrapper.py test_metrics.py test_codecache.py test_codegen_triton.py test_kernel_benchmark.py test_triton_heuristics.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/122866 Approved by: https://github.com/EikanWang, https://github.com/jansel	2024-05-09 00:51:35 +00:00
Guokai Ma	4dd33a1c2b	Better core binding in torch.backends.xeon.run_cpu when launced from torchrun with --nproc-per-node (#123711 ) This PR fix `torch.backends.xeon.run_cpu` behavior when it is launched from `torchrun` with `--nproc-per-node` parameter. As a CPU launcher, `run_cpu` would bind cores to each instance it launches using `numactl`, and assign cores to each instance evenly. However, if we use `torchrun` to start `run_cpu` and use `--nproc-per-node` to create multiple `run_cpu` processes. In this case, each `run_cpu` process would assume it can use all the CPU cores, which causes each `run_cpu` process compete for CPU cores. This results in poor performance. This PR recognize environment variable `LOCAL_WORLD_SIZE` and `LOCAL_RANK` set by `torchrun`, then use this information to further shard the cores bind to each instance. With this PR, when launched by `torchrun --nproc-per-node ...`, different CPU cores will be bind to different workers, which maximize CPU utilization and application performance. The specific use case this PR enabled is using TorchServe with DeepSpeed tensor parallel. In this case, TorchServe would run `torchrun --nproc-per-node <tp_size>` to start tensor parallel workers it needed. When run TorchServe on multisocket CPU server with DeepSpeed tensor parallel, we need this PR to achieve best performance. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123711 Approved by: https://github.com/jingxu10, https://github.com/ezyang	2024-05-09 00:32:11 +00:00
Jiong Gong	8def2e92f2	[inductor] autotune benchmark support for cpu (#125159 ) This PR adds the autotune Infrastructure for CPU. It generalizes and extends `BenchmarkRequest` with CPU support and C++ module loader. A `do_bench_cpu` util function is added for benchmarking functions on CPU with warmups and returns the median number from multiple trials. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125159 Approved by: https://github.com/jansel	2024-05-09 00:28:49 +00:00
Shivam Raikundalia	96a5698408	Fix torch.profiler Schedule Function (Function Event only) (#125510 ) Summary: github issue: https://github.com/pytorch/pytorch/issues/73828 Whenever we transition from RECORD_AND_SAVE to WARMUP in the profiler schedule, we instantiate a new backend profiler which wipes out the last cycle's information. This makes using the `repeat` parameter less useful in the schedule as you only get contents of the last cycle/repeat. In this diff, we save the accumulated Function Events before setting the new ones and then merge the two EventLists after post processing/cleaning is done. This diff only fixes Function Events so that we can get statistics over each cycle within a schedule. A follow up should be made to accumulate the chrome tracings as well if it is requested. Test Plan: Added functional python tests in test_profiler.py that test different schedules and their FunctionEvent counts Differential Revision: D56956245 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125510 Approved by: https://github.com/aaronenyeshi	2024-05-08 23:32:50 +00:00
William Wen	ff090c6937	[dynamo] support tracing nn.Module @property that accesses closure cells (#125724 ) Fix https://github.com/pytorch/pytorch/issues/125702 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125724 Approved by: https://github.com/jansel, https://github.com/jbschlosser ghstack dependencies: #125710	2024-05-08 23:25:39 +00:00
William Wen	93f3d561f9	[dynamo] don't make nn parametrized Modules unspecialized (#125710 ) Workaround for https://github.com/pytorch/pytorch/issues/125314 and https://github.com/pytorch/pytorch/issues/125478. We no longer make parametrized nn.Modules unspecialized. Instead, when we are about to call a function from the `torch.nn.utils.parametrize` module, we skip the frame. The script from https://github.com/pytorch/pytorch/issues/125314 now outputs ``` parametrize=True: 6587ms parametrize=False: 1729ms parametrize=True: 4497ms parametrize=False: 1539ms ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/125710 Approved by: https://github.com/jansel, https://github.com/jbschlosser	2024-05-08 23:25:39 +00:00
Aaron Orenstein	e71207b729	Fix infinite recursion in API BC test (#125706 ) ``` python test/test_fx.py -k test_public_api_surface ``` was failing with a complaint about infinite recursion. Fixed that and then marked the two API changes from #123681 as private (for `get_example_value`) and backward compatible (for `insert_deferred_runtime_asserts`). Fixes #104012 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125706 Approved by: https://github.com/BoyuanFeng	2024-05-08 23:07:16 +00:00
Ke Wen	04bf7713e8	[c10d] Reduce test time by reusing ProcessGroup (#125648 ) ## Problem this PR resolves Today, most of distributed tests are arranged like this: ``` def test_allreduce(self): pg = self._create_process_group_nccl(store, self.opts()) pg.allreduce(tensor) ... ``` Thus, we are paying PG creation time per test. That's bad. But why were we doing that? Is there a constraint? If we look deeper, we would find that most of our test cases inherit from `torch.testing._internal.common_distributed.MultiProcessTestCase`. From the name, nothing seems wrong, and probably fits distributed well. But a "problem" exists in its `setUp()` and `tearDown()` methods, which basically do the following: ``` def setUp(self): self._spawn_processes() def tearDown(self): for p in self.processes: p.terminate() ``` Since `setUp` and `tearDown` are "test-scope fixtures", meaning, they are called per test, each test will have brand new processes. Of course we'd have to recreate ProcessGroup every time. ## How we are fixing it First, obviously, we need to put a PG's lifetime into a longer scope. Python `unittest` provides such a helper, called "class-scope fixtures." It is embodied by a `setUpClass` method and a `tearDownClass` method (note the name difference), which are called only once for all tests in the same test class. Therefore, we would do: ``` @classmethod def setUpClass(self): dist.init_process_group(...) @classmethod def tearDownClass(self): dist.destroy_process_group() ``` In this PR, we create a new test template for distributed: `MultiProcContinousTest`, to hold this class-scope fixture. Second, we'd need to avoid per-test process spawn and terminate. That's easy, we can either: 1. launch the whole test file with `torchrun --nproc-per-node=...` or 2. use `mp.spawn()` under `if __name__ == "__main__":`. Point is, launch the processes only once. ## Result We moved the "positive tests" from test_c10d_nccl.py to test_c10d_ops_nccl.py. Before this PR: ``` $ python test_c10d_nccl.py -k ProcessGroupNCCLTest Ran 24 tests in 174.457s ``` After this PR: ``` $ torchrun --nproc-per-node 2 test_c10d_ops_nccl.py or $ python test_c10d_ops_nccl.py Ran 24 tests in 16.247s ``` 10X speedup. ## Limitation For tests intended to test destroy or abort of PGs, we'd need to go back to the old style. So it would make sense to divide our tests into two classes: one for positive tests where we would reuse the PGs, and the other one for abort/destroy and negative tests like watchdog timeout. ## Next step Migrate the tests of distributed that would fit with this test style! Pull Request resolved: https://github.com/pytorch/pytorch/pull/125648 Approved by: https://github.com/wconstab	2024-05-08 22:33:40 +00:00
Daniel Richard G	8f27c7f181	[sparse] Fix type-dispatch errors (#124777 ) I am building PyTorch with the Intel oneAPI 2024.0 compiler and without cuSparseLt, and encountered various type errors of the following forms: ``` [ 63%] Building CUDA object caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/sparse/cuda/SparseSemiStructuredTile.cu.o /tmp/pytorch/aten/src/ATen/native/sparse/cuda/ComputeSparseTile.h(87): error: no operator "=" matches these operands operand types are: cutlass::uint2b_t = int detected during: instantiation of "at::native::Indices4x4 at::native::LargestValuesGreedy<Op>::operator()(Tile4x4Accessor) [with Op=at::native::IdentityOp, Tile4x4Accessor=at::native::KernelTypes<cutlass::half_t>::Tile4x4Accessor]" /tmp/pytorch/aten/src/ATen/native/sparse/cuda/SparseSemiStructuredPack.h(349): here instantiation of "void at::native::KernelTypes<Element_>::sparse_semi_structured_tile_kernel(at::native::KernelTypes<Element_>::Params, MetadataStore, Algorithm) [with Element_=cutlass::half_t, Algorithm=at::native::LargestValuesGreedy<at::native::IdentityOp>, MetadataStore=at::native::MetadataCutlass]" /tmp/pytorch/aten/src/ATen/native/sparse/cuda/SparseSemiStructuredTile.cu(201): here instantiation of "void at::native::sparse_semi_structured_tile_kernel<KT,Metadata,Algorithm>(KT::Params, Metadata, Algorithm) [with KT=at::native::KernelTypes<cutlass::half_t>, Metadata=at::native::MetadataCutlass, Algorithm=at::native::LargestValuesGreedy<at::native::IdentityOp>]" (177): here instantiation of "void at::native::named_algorithms(T) [with T=lambda [](auto, const std::__cxx11::string &)->auto]" /tmp/pytorch/aten/src/ATen/native/sparse/cuda/SparseSemiStructuredTile.cu(265): here instantiation of "std::tuple<at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor> at::native::sparse_semi_structured_tile_typed<Element,MetadataFormat>(at::Tensor, std::__cxx11::string) [with Element=cutlass::half_t, MetadataFormat=at::native::MetadataCutlass]" /tmp/pytorch/aten/src/ATen/native/sparse/cuda/SparseSemiStructuredTile.cu(293): here /tmp/pytorch/aten/src/ATen/native/sparse/cuda/ComputeSparseTile.h(88): error: no operator "=" matches these operands operand types are: cutlass::uint2b_t = int detected during: instantiation of "at::native::Indices4x4 at::native::LargestValuesGreedy<Op>::operator()(Tile4x4Accessor) [with Op=at::native::IdentityOp, Tile4x4Accessor=at::native::KernelTypes<cutlass::half_t>::Tile4x4Accessor]" /tmp/pytorch/aten/src/ATen/native/sparse/cuda/SparseSemiStructuredPack.h(349): here instantiation of "void at::native::KernelTypes<Element_>::sparse_semi_structured_tile_kernel(at::native::KernelTypes<Element_>::Params, MetadataStore, Algorithm) [with Element_=cutlass::half_t, Algorithm=at::native::LargestValuesGreedy<at::native::IdentityOp>, MetadataStore=at::native::MetadataCutlass]" /tmp/pytorch/aten/src/ATen/native/sparse/cuda/SparseSemiStructuredTile.cu(201): here instantiation of "void at::native::sparse_semi_structured_tile_kernel<KT,Metadata,Algorithm>(KT::Params, Metadata, Algorithm) [with KT=at::native::KernelTypes<cutlass::half_t>, Metadata=at::native::MetadataCutlass, Algorithm=at::native::LargestValuesGreedy<at::native::IdentityOp>]" (177): here instantiation of "void at::native::named_algorithms(T) [with T=lambda [](auto, const std::__cxx11::string &)->auto]" /tmp/pytorch/aten/src/ATen/native/sparse/cuda/SparseSemiStructuredTile.cu(265): here instantiation of "std::tuple<at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor> at::native::sparse_semi_structured_tile_typed<Element,MetadataFormat>(at::Tensor, std::__cxx11::string) [with Element=cutlass::half_t, MetadataFormat=at::native::MetadataCutlass]" /tmp/pytorch/aten/src/ATen/native/sparse/cuda/SparseSemiStructuredTile.cu(293): here /tmp/pytorch/aten/src/ATen/native/sparse/cuda/SparseSemiStructuredPack.h(238): error: function "lambda [](cutlass::uint2b_t, cutlass::uint2b_t)->void" cannot be called with the given argument list argument types are: (int, int) object type is: lambda [](cutlass::uint2b_t, cutlass::uint2b_t)->void detected during: instantiation of "at::native::KernelTypes<Element_>::Tile4x4Packed at::native::KernelTypes<Element_>::pack_4x4(at::native::Indices4x4, at::native::KernelTypes<Element_>::Tile4x4Accessor, uint32_t &, int, __nv_bool) [with Element_=cutlass::half_t]" (354): here instantiation of "void at::native::KernelTypes<Element_>::sparse_semi_structured_tile_kernel(at::native::KernelTypes<Element_>::Params, MetadataStore, Algorithm) [with Element_=cutlass::half_t, Algorithm=at::native::LargestValuesGreedy<at::native::IdentityOp>, MetadataStore=at::native::MetadataCutlass]" /tmp/pytorch/aten/src/ATen/native/sparse/cuda/SparseSemiStructuredTile.cu(201): here instantiation of "void at::native::sparse_semi_structured_tile_kernel<KT,Metadata,Algorithm>(KT::Params, Metadata, Algorithm) [with KT=at::native::KernelTypes<cutlass::half_t>, Metadata=at::native::MetadataCutlass, Algorithm=at::native::LargestValuesGreedy<at::native::IdentityOp>]" /tmp/pytorch/aten/src/ATen/native/sparse/cuda/ComputeSparseTile.h(177): here instantiation of "void at::native::named_algorithms(T) [with T=lambda [](auto, const std::__cxx11::string &)->auto]" /tmp/pytorch/aten/src/ATen/native/sparse/cuda/SparseSemiStructuredTile.cu(265): here instantiation of "std::tuple<at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor> at::native::sparse_semi_structured_tile_typed<Element,MetadataFormat>(at::Tensor, std::__cxx11::string) [with Element=cutlass::half_t, MetadataFormat=at::native::MetadataCutlass]" /tmp/pytorch/aten/src/ATen/native/sparse/cuda/SparseSemiStructuredTile.cu(293): here /tmp/pytorch/aten/src/ATen/native/sparse/cuda/SparseSemiStructuredPack.h(241): error: function "lambda [](cutlass::uint2b_t, cutlass::uint2b_t)->void" cannot be called with the given argument list argument types are: (int, int) object type is: lambda [](cutlass::uint2b_t, cutlass::uint2b_t)->void detected during: instantiation of "at::native::KernelTypes<Element_>::Tile4x4Packed at::native::KernelTypes<Element_>::pack_4x4(at::native::Indices4x4, at::native::KernelTypes<Element_>::Tile4x4Accessor, uint32_t &, int, __nv_bool) [with Element_=cutlass::half_t]" (354): here instantiation of "void at::native::KernelTypes<Element_>::sparse_semi_structured_tile_kernel(at::native::KernelTypes<Element_>::Params, MetadataStore, Algorithm) [with Element_=cutlass::half_t, Algorithm=at::native::LargestValuesGreedy<at::native::IdentityOp>, MetadataStore=at::native::MetadataCutlass]" /tmp/pytorch/aten/src/ATen/native/sparse/cuda/SparseSemiStructuredTile.cu(201): here instantiation of "void at::native::sparse_semi_structured_tile_kernel<KT,Metadata,Algorithm>(KT::Params, Metadata, Algorithm) [with KT=at::native::KernelTypes<cutlass::half_t>, Metadata=at::native::MetadataCutlass, Algorithm=at::native::LargestValuesGreedy<at::native::IdentityOp>]" /tmp/pytorch/aten/src/ATen/native/sparse/cuda/ComputeSparseTile.h(177): here instantiation of "void at::native::named_algorithms(T) [with T=lambda [](auto, const std::__cxx11::string &)->auto]" /tmp/pytorch/aten/src/ATen/native/sparse/cuda/SparseSemiStructuredTile.cu(265): here instantiation of "std::tuple<at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor> at::native::sparse_semi_structured_tile_typed<Element,MetadataFormat>(at::Tensor, std::__cxx11::string) [with Element=cutlass::half_t, MetadataFormat=at::native::MetadataCutlass]" /tmp/pytorch/aten/src/ATen/native/sparse/cuda/SparseSemiStructuredTile.cu(293): here ``` The casts added by this PR get the build working again for me. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124777 Approved by: https://github.com/jcaip	2024-05-08 21:49:33 +00:00
Avik Chaudhuri	1b8891a31d	make torch._check understand Eq commutativity (#125629 ) Summary: Given `torch._check(a == b)` we can still get a data-dependent error needing `b == a`. Simple fix. ``` def forward(self, x1, x2, x3, y): z1 = x1.item() z2 = x2.item() z3 = x3.item() torch._check((z2 + z3) == z1) # torch._check(z1 == (z2 + z3)) didn't work, now does if z2 + z3 == z1: return y * 2 else: return y + 3 ``` Differential Revision: D57014730 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125629 Approved by: https://github.com/ezyang	2024-05-08 21:39:21 +00:00
Will Constable	346343e6b5	[DeviceMesh] Make _validate_tp_mesh_dim support 3D (#125763 ) Currently a 3D mesh with a submesh sliced out for TP is going to fail this check. According to @wanchaol in [this comment](https://github.com/pytorch/pytorch/pull/125250#discussion_r1586653669) it should be OK to remove these checks. Though I would appreciate a more careful review here, since I'm not too sure if there are other edge cases where these checks are important. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125763 Approved by: https://github.com/wz337, https://github.com/wanchaol	2024-05-08 21:22:11 +00:00
PyTorch MergeBot	e457fdcd81	Revert "[caffe2] Lazily symbolize backtrace in c10::Error (#125682 )" This reverts commit 08f6ef0e1ccadf4626c0d7ecb15db96c01b8f418. Reverted https://github.com/pytorch/pytorch/pull/125682 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/125682#issuecomment-2101477132))	2024-05-08 21:11:27 +00:00
Simon Fan	7e0edafe86	[compiled autograd][dynamo] improve lifted autograd.Function.backward handling and fallback to pseudo-eager (#125661 ) - `FakeContext` hides all fields other than ctx.saved_tensors, this dynamo errors when the autograd.Function.backward uses other attrs on ctx and it also doesn't allow fallback to eager. - If we remove it, we still can't fallback to eager: node variables are already freed (ctx.saved_tensors throws) - However, we can fallback to "pseudo-eager" by using a duck-typed ctx and routing the ctx.saved_tensors to lifted tensors - Dynamo tries to inline external_utils.call_backward, treats BackwardCFunction as a AutogradFunctionContextVariable (only used up until we create the fake context: FakeBackwardCFunction) - we call_function backward from the forward class AutogradFunctionVariable, and we still pass in the fake context as a UserDefinedObjectVariable (can later use AutogradFunctionContextVariable + HOO graph speculate) Fixes #125489 #124827 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125661 Approved by: https://github.com/jansel	2024-05-08 21:00:37 +00:00
Catherine Lee	de8ce3be20	[TD] Heuristic based on file path (#125477 ) Get the folders of each changed file and attempt to map the folders to some tests. The intention is to push up things like dynamo tests if someone changes a file in the dynamo folder Please see the tests for examples of what should be matched together Pull Request resolved: https://github.com/pytorch/pytorch/pull/125477 Approved by: https://github.com/ZainRizvi, https://github.com/huydhn	2024-05-08 20:56:53 +00:00
Aaron Enye Shi	17ab7f77c2	[Kineto] Update Kineto Submodule Hash (#125621 ) Summary: Update the Kineto submodule in PyTorch. The following diffs are included: - Removed CUPTI overhead track in AMD traces - Delay logging for CUDA stream wait event until the end - Changed chrome trace unit will be in milliseconds, and data will be in ns - Refactored roctracer to include metadata and improved names. - Lowered Kineto Stage log level, reducing noisy output - Changed relative time of ts to quarterly interval for distributed trace alignment - Fixed Non-risky deprecated use of 0/NULL - Removed hardcoding of /opt/rocm - Handling cuLaunchKernelEx better - Fixed Non-risky missing field initializers and unused variables. Test Plan: CI and this is running internally. Differential Revision: D57011897 Pulled By: aaronenyeshi Pull Request resolved: https://github.com/pytorch/pytorch/pull/125621 Approved by: https://github.com/sraikund16	2024-05-08 20:49:07 +00:00
William Wen	255a3afbf1	[dynamo] don't LOAD_FAST local context variables in modified bytecode (#125719 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/125719 Approved by: https://github.com/jansel	2024-05-08 20:39:06 +00:00
atalman	0feca5e341	Increase Python version for Docker builds (#125782 ) Fixes https://github.com/pytorch/pytorch/issues/73714 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125782 Approved by: https://github.com/huydhn	2024-05-08 20:32:32 +00:00
albanD	19a9de114a	Forbid subclassing _TensorBase directly (#125558 ) As per title. This ensures that all the places where we assume the method defined in _tensor.py do exist. BC-Breaking: This is bc-breaking as the user cannot subclass this private class anymore. You should replace any use of _TensorBase to Tensor. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125558 Approved by: https://github.com/ezyang	2024-05-08 20:29:29 +00:00
Zejun Huang	afea237935	[minimizer] Create block traverse mode in minimizer for graph aware debugging (#125613 ) Summary: block traverse mode: Assumption: culprits block formed by (start_idx, end_idx) in topologically sorted graph and the error will go away if graph patterns breaks Reviewed By: junhanh Differential Revision: D56799587 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125613 Approved by: https://github.com/jfix71	2024-05-08 20:21:21 +00:00
wz337	603d1e6049	[DTensor] allow numel 1 tensor operand to be implicitly replicate DTensor (#125073 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/125073 Approved by: https://github.com/wanchaol	2024-05-08 19:47:47 +00:00
Andrew M. James	445a0c01da	Retry: Low mem max_pool2d_with_indices (#122832 ) Based on #105687 The low memory path does not need to strictly return the int8 offsets instead the offset to index computation can be separated from the inner function of the max pool lowering. The partitioner can then choose to move the offset to index computation into the backward pass. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122832 Approved by: https://github.com/peterbell10, https://github.com/eellison	2024-05-08 19:37:08 +00:00
Piotr Bielak	005a12722d	Remove duplicated nodes in dfs_iter_find_cycle (#125585 ) In case the `dfs_iter_find_cycle` function receives duplicated node entries in the `all_user_nodes` argument, it will still process each one of them. This commit changes the `all_user_nodes` list into a set, so each element is unique, resulting in a shorter execution time of the `propose_partitions` function. Fixes #125584 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125585 Approved by: https://github.com/Skylion007	2024-05-08 19:21:15 +00:00
Jeff Daily	3f36145db2	add uuid in cudaDeviceProperties (#125083 ) Replaces #99967. Fixes #99903. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125083 Approved by: https://github.com/pruthvistony, https://github.com/albanD, https://github.com/eqy	2024-05-08 19:15:55 +00:00
Pian Pawakapan	f4b2d50fd7	[export] disable_forced_specializations (#124949 ) Summary: By default, some inferred dynamic shapes guards/constraints that are not expressible with the current dynamic shapes language will lead to specialization to the concrete input values provided. If disable_forced_specializations is set to True, we will not specialize, and will not perform runtime checks on such produced guards. Instead, we allow the user to specify arbitrary shapes, and fail during runtime if the inputs are invalid. Constraints expressible with the language (e.g. ranges, linear derived dims) will still be enforced, and behavior for all other guards remains the same. Cases where we typically specialize are reshapes: ``` x: [4, 6] # [s0, s1] x = x.reshape([x.shape[0] - 1, -1]) # this emits a guard Mod(s0s1, s0-1) = 0, we specialize on s0=4, s1=6 x: [4, 6], y: [24] # [s0, s1], [s2] x = x.reshape([-1]) + y # this emits a guard s0s1 = s2, we specialize on s0=4, s1=6, s2=24 ``` For now only applicable for non-strict mode (need to figure out how to pass this flag into dynamo's call of produce_guards). Test Plan: Added test case that checks compilation, runtime, and suggested fixes behavior. Differential Revision: D56361177 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124949 Approved by: https://github.com/avikchaudhuri	2024-05-08 18:42:39 +00:00
atalman	74b1674860	Use nvidia/cuda:CUDA_VERSION-devel-ubuntu22.04 as base for official Docker release (#125770 ) We don't need to install cudnn system wide. We already install it with pytorch install during this step: ``` [conda-installs 2/3] RUN case linux/amd64 in "linux/arm64") pip install --extra-index-url https://download.pytorch.org/whl/cpu/ torch torchvision torchaudio ;; *) /opt/conda/bin/conda install -c "pytorch-nightly" -c "nvidia" -y "python=3.10" pytorch torchvision torchaudio "pytorch-cuda=$(echo 12.1.1 \| cut -d'.' -f 1-2)" ;; esac && /opt/conda/bin/conda clean -ya ``` Ref: https://github.com/pytorch/pytorch/actions/runs/8998055687/job/24717424912 Validate via: https://github.com/pytorch/builder/actions/workflows/validate_docker_images.yml Pull Request resolved: https://github.com/pytorch/pytorch/pull/125770 Approved by: https://github.com/nWEIdia, https://github.com/seemethere	2024-05-08 18:41:30 +00:00
Tianyu Liu	faf0015052	[dtensor] run transformer sdpa in dtensor (#122997 ) Now that efficient attention is supported in dtensor, we can modify the transformer test to use dtensor in SDPA and get rid of the manual num_head adjustments. Caveat: Efficient attention is supported only with bf16/fp32 (not fp64) and has other constraints. If any of the constraints are not satisfied, the SDPA would fall back to the math decomposed attention, which will break as it does not fully work with dtensor (it creates a `torch.Tensor` mask in the middle). I considered adding some checks like in P1202254918 but that needs to be added everywhere this Transformer is used. Is it necessary if the current CI machines can run efficient attention? Test files containing this Transformer: - `test/distributed/tensor/parallel/test_tp_examples.py` - `test/distributed/_composable/fsdp/test_fully_shard_training.py` - `test/distributed/_composable/fsdp/test_fully_shard_clip_grad_norm_.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/122997 Approved by: https://github.com/XilunWu ghstack dependencies: #122995, #122996	2024-05-08 17:08:47 +00:00
Tianyu Liu	efece3f142	[dtensor] add op support for memory efficient attention (#122996 ) This is a followup to flash attention. On cuda, flash attention is supported only for fp16/bf16, whereas memory efficient attention is supported for fp32 (but not fp64). With this PR, one can run SDPA and in general Transformer completely in dtensor. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122996 Approved by: https://github.com/XilunWu, https://github.com/wanchaol ghstack dependencies: #122995	2024-05-08 17:08:27 +00:00
Tianyu Liu	08be8ec8a9	[dtensor] improve new factory strategy (#122995 ) Previously, the new tensor out of the "new factory" all become replicated. With this PR, if the new tensor has the same shape as the old tensor and the shape can be evenly sharded, then the old spec is inherited and preferred. To accommodate this when the old tensor has sharded placements, the input args for local computation (size, stride) need to be adjusted. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122995 Approved by: https://github.com/wanchaol	2024-05-08 17:05:07 +00:00
Jez Ng	affd7a9789	Get PT2 Cutlass backend working under fbcode [take 2] (#125688 ) Differential Revision: D57051232 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125688 Approved by: https://github.com/chenyang78	2024-05-08 16:44:49 +00:00
eellison	87f86fd586	Fix multi template debug trace (#125703 ) Fix for https://github.com/pytorch/pytorch/issues/125642 We were trying to render the template of multi template kernel before it had been finalized. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125703 Approved by: https://github.com/shunting314	2024-05-08 16:31:18 +00:00
Brian Hirsh	e28d9947a1	AsyncCollectiveTensor: prevent wait_tensor() calls on graph inputs from getting DCEd (#125677 ) @wanchaol was seeing the loss eventually become NaN when compiling individual transformer blocks in torchtitan - with this patch I no longer see the NaN loss. The problem is the following: (1) It is possible to have graph inputs to a compiled region that are AsyncCollectiveTensors. In particular: when we compile individual transformer blocks in the llama model, the first layer (embedding layer) is run in eager mode, and it outputs an AsyncCollectiveTensor that is fed to the first transformer block (2) ideally, we would like that AsyncCollectiveTensor graph input to desugar into a `wait_tensor()` op that shows up at the beginning of the graph. (3) the way this is supposed to happen is: AOTAutograd traces through the __torch_dispatch__ of AsyncCollectiveTensor, tracing out a `wait_tensor()` call before dispatching to any of the other ops in the function we are tracing (4) however: `trigger_wait()` was getting called in a way where we would ignore its output (and return `self.elem` directly), which would cause the `wait_tensor` ops to get DCE'd. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125677 Approved by: https://github.com/wanchaol, https://github.com/yifuwang ghstack dependencies: #125676	2024-05-08 15:54:01 +00:00
Brian Hirsh	5d97c22845	AOTAutograd: use info not debug logging for ViewAndMutationMeta (#125676 ) Before, the AOTAutograd metadata would not get logged when running with `TORCH_LOGS="aot"` Pull Request resolved: https://github.com/pytorch/pytorch/pull/125676 Approved by: https://github.com/albanD	2024-05-08 15:54:01 +00:00
Catherine Lee	6f619cc727	[ez] functorch/test_vmap and test_dataloader to run in parallel (#125597 ) Also mark test_svd serial in linalg to see if it helps with the flakiness Pull Request resolved: https://github.com/pytorch/pytorch/pull/125597 Approved by: https://github.com/huydhn, https://github.com/ZainRizvi	2024-05-08 15:37:29 +00:00
PyTorch UpdateBot	bd2635578b	[vision hash update] update the pinned vision hash (#125521 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned vision hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125521 Approved by: https://github.com/pytorchbot	2024-05-08 15:34:36 +00:00
PyTorch MergeBot	2e237fcd70	Revert "[inductor] add cpp builder code. (#124045 )" This reverts commit 469383755fe416eb1c41fa724762ad3eaecdff07. Reverted https://github.com/pytorch/pytorch/pull/124045 on behalf of https://github.com/clee2000 due to broke inductor/test_codecache and inductor/test_max_autotune `469383755f` https://github.com/pytorch/pytorch/actions/runs/8996772350/job/24724775182 ([comment](https://github.com/pytorch/pytorch/pull/124045#issuecomment-2100851419))	2024-05-08 15:33:20 +00:00
James Wu	c5b6c696c1	Start refactoring runtime wrappers (#125595 ) This is the first PR in a series where I try to organize our runtime wrappers a bit: specifically, I'd like to separate wrappers into objects that have (up to) 2 methods: A pre-compile function, which takes in flat_fn and flat_args (inputs to the compiler) and wraps/modifies them A post-compile function, which takes in a compiled_fn and runtime args and wraps the compiled_function. Extra metadata necessary to run the compile functions can be stored on the attributes of the class. This way, when we think about caching, the set of attributes on the class should be the exact set of metadata that we need to serialize and save in the cache (along with common data, like fw_metadata) Pull Request resolved: https://github.com/pytorch/pytorch/pull/125595 Approved by: https://github.com/bdhirsh	2024-05-08 15:20:36 +00:00
soulitzer	13462ecd27	Update preserve_node_meta to reset torch.fx.traceback.current_meta (#125500 ) Fixes https://github.com/pytorch/pytorch/issues/122766 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125500 Approved by: https://github.com/xmfan, https://github.com/ezyang	2024-05-08 14:30:34 +00:00
Aaron Gokaslan	8cad88e1f3	[BE]: Improve exception typing. Remove NOQAs (#125535 ) Improve some exception typing Pull Request resolved: https://github.com/pytorch/pytorch/pull/125535 Approved by: https://github.com/albanD	2024-05-08 14:07:13 +00:00
Sun, Jiayi	82b7b59d2a	[inductor] Check if n is the input tensor of conv_pointwise (#125119 ) Fix https://github.com/pytorch/pytorch/issues/124837. Check whether n is the input tensor of convolution_pointwise or qconv2d_pointwise, if so freeze the layout to channels_last. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125119 Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel	2024-05-08 13:25:49 +00:00
Yu, Guangye	d17be10df1	make torch.amp.autocast more generic (#125103 ) # Motivation As discussed in [#124479](https://github.com/pytorch/pytorch/pull/124479), `torch.amp.autocast` can NOT be completely equivalent to `torch.cuda.amp.autocast` and `torch.cpu.amp.autocast` since `torch.amp.autocast` has NOT the default `dtype` for CPU (`torch.bfloat16` by default) and CUDA (`torch.float16` by default) respectively. We would like `torch.amp.autocast` to be more generic to help the developer/customer write the device-agnostic code. Because there are not enough reasons to add device-specific autocast `torch.xxx.amp.autocast` for each device backend. # Solution When `None` is passed to `dtype`, we should use `torch.get_autocast_dtype` to get the related dtype for each backend. Meanwhile, `torch.get_autocast_dtype` is necessary to be supported in JIT path for BC. # Additional Context With this PR, `torch.amp.autocast(device_type='cuda')` is equivalent to `torch.cuda.amp.autocast`. Add two new UTs to cover this change in eager and jit path respectively. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125103 Approved by: https://github.com/albanD, https://github.com/jgong5, https://github.com/gujinghui	2024-05-08 12:13:26 +00:00
lezcano	320af5eaa6	Compute bounds for the variables created during codegen (#123100 ) Before we would just bail out on these bounds for all variables that did not come from the FX graph. Now we propagate the bounds whenever we have a rule for that op. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123100 Approved by: https://github.com/jgong5, https://github.com/peterbell10	2024-05-08 08:14:06 +00:00
Chien-Chin Huang	15a9770225	[DSD] Implement broadcast_from_rank0 option for optim state_dict (#125339 ) Summary: This is useful if users would like to avoid CPU memory OOM when loading from a full state_dict. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125339 Approved by: https://github.com/weifengpy ghstack dependencies: #125708, #125338	2024-05-08 07:22:20 +00:00
Chien-Chin Huang	0542fd485f	[DSD] Implement broadcast_from_rank0 option for model state_dict (#125338 ) Summary: This is useful if users would like to avoid CPU memory OOM when loading from a full state_dict. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125338 Approved by: https://github.com/weifengpy ghstack dependencies: #125708	2024-05-08 07:11:18 +00:00
Chien-Chin Huang	88fbe79550	[DSD] Fix set_optimizer_state_dict() changes the parameters with some optimizers (#125708 ) Summary: Some optimizers, like AdamW, change the parameters even if gradients are zero. So `set_optimizer_state_dict()` may affect the parameters values with these optimizers. This PR fixes the issue. This PR also fixes https://github.com/pytorch/pytorch/issues/121186. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125708 Approved by: https://github.com/wz337	2024-05-08 06:57:20 +00:00
Xu Han	469383755f	[inductor] add cpp builder code. (#124045 ) Previous full PR https://github.com/pytorch/pytorch/pull/115248 is failed to merge due to fb_code is hard to debug. I also tried to submit them as two pieces, https://github.com/pytorch/pytorch/pull/118514 https://github.com/pytorch/pytorch/pull/118515. And they have passed PreCI at that time. Now I tried to split https://github.com/pytorch/pytorch/pull/115248 into smaller piece, and it is the first step of RFC https://github.com/pytorch/pytorch/issues/124245. Changes: 1. Add cpp builder code, the new cpp_builder support Windows OS. 2. Add CPU ISA checker which is cross OS and exported from backend cpuinfo. 3. Switch compiler ISA checker to new cpp builder. 4. CppCodeCache use the new ISA checker. 5. Add temprary `test_new_cpp_build_logical` UT to help on transfer to new code. <img width="1853" alt="Image" src="https://github.com/pytorch/pytorch/assets/8433590/ce6519ab-ba92-4204-b1d6-7d15d2ba2cbe"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/124045 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-05-08 05:27:15 +00:00
Giuseppe Ottaviano	08f6ef0e1c	[caffe2] Lazily symbolize backtrace in c10::Error (#125682 ) Summary: The macros that build `c10::Error` compute the stack trace at the point of throwing, which is then returned as part of the `what()`. If `what()` is never called, which is the case for most exceptions (since logging is throttled), the cost of computing the stack trace was wasted. By far, the most expensive part of computing the stack trace is its symbolization; just unwinding the stack and collecting the instruction addresses is comparatively cheap. We can thus defer the symbolization to first invocation of `what()`. Test Plan: Added unit tests exercising the lazy nature of `what()`. Ran an adfinder canary: https://www.internalfb.com/intern/ads/canary/460118801509424346 We can see that the cost of symbolization is obliterated (meaning that `what()` is virtually never called, as expected): {F1496627896} Reviewed By: ezyang Differential Revision: D56586844 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125682 Approved by: https://github.com/ezyang	2024-05-08 04:57:59 +00:00
Pruthvi Madugundu	a1a22a22d5	[ROCm] Parameterize the triton build dir (#125420 ) - Removes hard coding and helps in internal builds Pull Request resolved: https://github.com/pytorch/pytorch/pull/125420 Approved by: https://github.com/malfet	2024-05-08 04:46:52 +00:00
Wanchao Liang	50073127b5	[tp] add some test for shard output layouts for rowwise parallel (#125713 ) as titled. This is to make sure everything work as expected if we configure RowwiseParallel output layouts be sharded Pull Request resolved: https://github.com/pytorch/pytorch/pull/125713 Approved by: https://github.com/XilunWu ghstack dependencies: #125693, #125695	2024-05-08 03:45:34 +00:00
Wanchao Liang	9a2375b6b7	[dtensor] improve some pretty print in op schema (#125695 ) as titled, when I debugged https://github.com/pytorch/pytorch/pull/125369 I found this would be quality of life improvements Pull Request resolved: https://github.com/pytorch/pytorch/pull/125695 Approved by: https://github.com/yifuwang, https://github.com/XilunWu ghstack dependencies: #125693	2024-05-08 03:45:34 +00:00
Wanchao Liang	65fec7bbbf	[dtensor] make sure meta tensor random op does not alternate rng state (#125693 ) as titled, for meta tensor ops, we should avoid calling the RNGTracker, which could potentially alter the current RNG state. Meta tensor ops should be no-op and post `to_empty` init would really alter the RNG state Pull Request resolved: https://github.com/pytorch/pytorch/pull/125693 Approved by: https://github.com/XilunWu	2024-05-08 03:45:29 +00:00
Angela Yi	38baa02a40	Meta kernel for _pack_padded_sequence (#124794 ) Summary: Op implementation: `8cf54929e3/aten/src/ATen/native/PackedSequence.cpp (L34)` Fixes https://fb.workplace.com/groups/pytorch.edge.users/permalink/1499571650913123/ I'm not entirely sure how to test this meta kernel. Differential Revision: D56478332 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124794 Approved by: https://github.com/ezyang	2024-05-08 03:11:22 +00:00
briancoutinho	2deea9e6e9	[profiler] enable CUPTI range profiler in build (#125685 ) Fixes #125272 ## About (This is a re-spin of PR #106617) Kineto introduced a new profiler to read performance counters from NVIDIA GPUs (CUPTI Range Profiler API) added in PR[75616](https://github.com/pytorch/pytorch/pull/75616). Support for the range profiler mode was disabled as we had to link with a NV PerfWorks library (`libnvperf_host.so`). This PR adds that link. The change includes- * Updates cmake build files to find `libnvperf_host.so` and set `CUDA_nvperf_host_LIBRARY` * WIP use the above cmake variable in kineto, will update this PR after kineto PR has landed See https://github.com/pytorch/kineto/pull/724 ## Example usage of CUPTI profiler The code snippet below shows how to configure pytorch profiler in CUPTI Profiler mode. Any code included in profiling window with be profiler by CUPTI/Kineto. Note how the `_ExperimentalConfig` struct is used to configure profiler metrics ``` with torch.profiler.profile( activities=[torch.profiler.ProfilerActivity.CUDA], record_shapes=True, on_trace_ready=trace_handler, experimental_config=torch.profiler._ExperimentalConfig( profiler_metrics=[ "kineto__tensor_core_insts", "dram__bytes_read.sum", "dram__bytes_write.sum"], profiler_measure_per_kernel=False), ) as prof: res = train_batch(modeldef) prof.step() ``` For a full example see this [xor.py](https://gist.github.com/briancoutinho/b1ec7919d8ea2bf1f019b4f4cd50ea80) gist. ### Details of how to configure CUPTI profielr The` _Experimental` config structure can be used to pass metrics to profiler ``` profiler_metrics : a list of CUPTI profiler metrics used to measure GPU performance events. Any metric supported by CUPTI can be used, see here= https://docs.nvidia.com/cupti/r_main.html#r_profiler There are two special alias metrics `kineto__tensor_core_insts` and `kineto__cuda_core_flops` for FLOPS counting. profiler_measure_per_kernel (bool) : whether to profile metrics per kernel or for the entire measurement duration. ``` ## Testing Built from source with kineto [PR](https://github.com/pytorch/kineto/pull/724) ``` $> USE_CUDA=1 python setup.py install -- CUDA_cupti_LIBRARY = /public/apps/cuda/11.6/extras/CUPTI/lib64/libcupti.so -- CUDA_nvperf_host_LIBRARY = /public/apps/cuda/11.6/extras/CUPTI/lib64/libnvperf_host.so ``` Then run example [xor.py](https://gist.github.com/briancoutinho/b1ec7919d8ea2bf1f019b4f4cd50ea80). This only works on V100+ GPUs only. Adding logs for debugging etc. ``` >$ export KINETO_LOG_LEVEL=1 >$ python xor.py INFO:2023-02-11 19:11:37 1683060:1683060 CuptiActivityProfiler.cpp:167] CUDA versions. CUPTI: 16; Runtime: 11060; Driver: 11040 Log file: /tmp/libkineto_activities_1683060.json Trace start time: 2023-02-11 19:11:47 Trace duration: 500ms Warmup duration: 0s Max GPU buffer size: 128MB Enabled activities: cuda_profiler_range Cupti Profiler metrics : kineto__tensor_core_insts, dram__bytes_read.sum, dram__bytes_write.sum Cupti Profiler measure per kernel : 0 Cupti Profiler max ranges : 10 INFO:2023-02-11 19:11:37 1683060:1683060 CuptiActivityProfiler.cpp:638] Enabling GPU tracing INFO:2023-02-11 19:11:37 1683060:1683060 CuptiActivityProfiler.cpp:567] Running child profiler CuptiRangeProfiler for 500 ms INFO:2023-02-11 19:11:37 1683060:1683060 CuptiRangeProfiler.cpp:104] Configuring 3 CUPTI metrics INFO:2023-02-11 19:11:37 1683060:1683060 CuptiRangeProfiler.cpp:109] sm__inst_executed_pipe_tensor.sum INFO:2023-02-11 19:11:37 1683060:1683060 CuptiRangeProfiler.cpp:109] dram__bytes_read.sum INFO:2023-02-11 19:11:37 1683060:1683060 CuptiRangeProfiler.cpp:109] dram__bytes_write.sum INFO:2023-02-11 19:11:37 1683060:1683060 CuptiActivityProfiler.cpp:575] Running child profiler CuptiRangeProfiler for 500 ms INFO:2023-02-11 19:11:37 1683060:1683060 CuptiActivityProfiler.cpp:672] Tracing starting in 9s INFO:2023-02-11 19:11:37 1683060:1683060 CuptiActivityProfiler.cpp:677] Tracing will end in 10s STAGE:2023-02-11 19:11:37 1683060:1683060 ActivityProfilerController.cpp:310] Completed Stage: Warm Up INFO:2023-02-11 19:11:37 1683060:1683060 CuptiActivityProfiler.cpp:693] Starting child profiler session ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/125685 Approved by: https://github.com/sraikund16	2024-05-08 02:34:31 +00:00
SandishKumarHN	9fedf41b60	Dockerfile should set the syntax directive to v1 (#125632 ) Fixes #125526 [#1811](https://github.com/pytorch/builder/issues/1811) Adopt syntax=docker/dockerfile:1 whcih has been stable since 2018, while still best practice to declare in 2024. - Syntax features dependent upon the [syntax directive version are documented here](https://hub.docker.com/r/docker/dockerfile). - While you can set a fixed minor version, [Docker officially advises to only pin the major version] ``` (https://docs.docker.com/build/dockerfile/frontend/#stable-channel): We recommend using docker/dockerfile:1, which always points to the latest stable release of the version 1 syntax, and receives both "minor" and "patch" updates for the version 1 release cycle. BuildKit automatically checks for updates of the syntax when performing a build, making sure you are using the most current version. ``` Support for building with Docker prior to v23 (released on Feb 2023) NOTE: 18.06 may not be the accurate minimum version for using docker/dockerfile:1, according to the [DockerHub tag history](https://hub.docker.com/layers/docker/dockerfile/1.0/images/sha256-92f5351b2fca8f7e2f452aa9aec1c34213cdd2702ca92414eee6466fab21814a?context=explore) 1.0 of the syntax seems to be from Dec 2018, which is probably why docker/dockerfile:experimental was paired with it in this file. Personally, I'd favor only supporting builds with Docker v23. This is only relevant for someone building this Dockerfile locally, the user could still extend the already built and published image from a registry on older versions of Docker without any concern for this directive which only applies to building this Dockerfile, not images that extend it. However if you're reluctant, you may want to refer others to [this Docker docs page](https://docs.docker.com/build/buildkit/#getting-started) where they should only need the ENV DOCKER_BUILDKIT=1, presumably the requirement for experimental was dropped with syntax=docker/dockerfile:1 with releases of Docker since Dec 2018. Affected users can often quite easily install a newer version of Docker on their OS, as per Dockers official guidance (usually via including an additional repo to the package manager). Reference links Since one of these was already included in the inline note (now a broken link), I've included relevant links mentioned above. You could alternatively rely on git blame with a commit message referencing the links or this PR for more information. Feel free to remove any of the reference links, they're mostly only relevant to maintainers to be aware of (which this PR itself has detailed adequately above). Pull Request resolved: https://github.com/pytorch/pytorch/pull/125632 Approved by: https://github.com/malfet	2024-05-08 01:52:56 +00:00
Denis Vieriu	58e045d03c	[MPS] Fix strided ELU op (#125692 ) Fixes https://github.com/pytorch/pytorch/issues/124834 Summary of changes: In case of non-contiguous input, the output would be non-contiguous too. At the moment it's not supported to save the result to a non-contiguous buffer, thus we need two steps, one to allocate a contiguous buffer and the second one to scatter the result back to the original ouput. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125692 Approved by: https://github.com/kulinseth	2024-05-08 01:34:40 +00:00
Michael Suo	21aaac47e7	[torchelastic] add timing events to different stages of rendezvous (#125636 ) Summary: as title Test Plan: unit tests. Launched a test job and observed scuba results: {F1506543300} Differential Revision: D57018103 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125636 Approved by: https://github.com/d4l3k	2024-05-08 01:14:23 +00:00
BowenBao	a3d97f6ce4	[ONNX] Benchmark onnx export w/ ort fusions (#125700 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/125700 Approved by: https://github.com/thiagocrepaldi	2024-05-08 01:10:05 +00:00
eellison	baf36f6d11	Pad bandwidth bound split k kernels on a100 (#125650 ) For ``` import torch import triton dtype = torch.bfloat16 t1 = torch.empty([2, 20569856], dtype=dtype, device="cuda") t2 = torch.empty([20569856, 13], dtype=dtype, device="cuda") @torch.compile() def benchmark(t1, t2): return torch.ops.aten.mm(t1, t2) print(triton.testing.do_bench(lambda: benchmark(t1, t2))) ``` Improves perf from 449ms -> 1.2578779458999634ms Pull Request resolved: https://github.com/pytorch/pytorch/pull/125650 Approved by: https://github.com/bertmaher	2024-05-08 01:04:35 +00:00
Denis Vieriu	ba27548679	[MPS] Remove in place views (causes too many crashes) (#124895 ) Fixes https://github.com/pytorch/pytorch/issues/96153 Remove in place views as they are a general cause for many crashes. Proper fix to handle views without copies will come in a different PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124895 Approved by: https://github.com/kulinseth	2024-05-08 01:00:37 +00:00
Denis Vieriu	3fb53bb6a7	[MPS] Fix strided mse_loss (#125696 ) Fixes https://github.com/pytorch/pytorch/issues/124621 Summary of changes: - In case of non-contiguous input, the output would be non-contiguous too. At the moment it's not supported to save the result to a non-contiguous buffer, thus we need two steps, one to allocate a contiguous buffer and the second one to scatter the result back to the original ouput. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125696 Approved by: https://github.com/kulinseth	2024-05-08 00:52:26 +00:00
Joel Schlosser	939b701d3a	SymInt-ify mem-efficient attention forward op signature (#125418 ) Need this for dynamic shapes! Before this PR, guards on constant min / max seq len values are introduced when SDPA calls mem-efficient attention. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125418 Approved by: https://github.com/soulitzer	2024-05-07 23:59:28 +00:00
Lucas Pasqualin	bb6ba31250	[DCP] Adds storage metadata, and passes it during the save path (#124772 ) This PR seeks to increase observability of save/load requests. This is accomplished with two main changes: 1. The creation of save_id and load_id: - a save_id and load_id is added to the filesystem writer. `save_id` is re-generated on every save call, and `load_id` is also re-generated on every load call. - both these ID's are stored in a new `StorageMeta` class, and saved as part of Metadata. (`load_id` is None when we save, and only set during load) 2. A new mechanism is implemented in the save path which gives the SavePlanner a chance to inspect the `storage_meta` object. The mechanism mirrors the same metadata exchange in the load path. In the load path, `storage_meta` is added to `metadata` such that the LoadPlanner can also access `storage_meta` before we begin loading. If users now wish to access the checkpoint_id in the SavePlanner, they simple need to access the value in `storage_meta` from the `set_up_planner` call Additionally, users now have a generic way of passing data to the SavePlanner from the StorageWriter at the start of the save path, similar to the load path This PR has been tested for backwards compatibility -- meaning any checkpoints saved before this PR can continue being loaded after this PR. One major consideration is that there is limited forwards compatibility. If a checkpoint is generated _past_ this PR, there is no support for loading it using older torch versions. This brings up a fairly important point: since we expect the metadata object (which is saved to the disk) to continue evolving, and we want to support forwards compatibility, we explore patching `pickle` so we can at least add new members to `metadata` and maintain fwd compat. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124772 Approved by: https://github.com/fegin	2024-05-07 23:53:53 +00:00
Richard Howell	244d93039d	Remove fbobjc_configs from xplat (#125586 ) Summary: Pull out the configs attributes in xplat targets, these no longer do anything. Test Plan: ``` $ buck2 uquery //xplat/... > /dev/null ``` Reviewed By: d16r Differential Revision: D56855974 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125586 Approved by: https://github.com/malfet	2024-05-07 23:48:20 +00:00
Nikita Shulga	8b4d62009d	[EZ] Update jinja2 to 3.1.4 (#125698 ) To address https://cwe.mitre.org/data/definitions/79.html Pull Request resolved: https://github.com/pytorch/pytorch/pull/125698 Approved by: https://github.com/huydhn, https://github.com/ZainRizvi	2024-05-07 23:40:54 +00:00
angelayi	8be4c1bc2f	[export] Add metadata for nodes insert_deferred_runtime_asserts (#125414 ) Fixes [internal error](https://fb.workplace.com/groups/1075192433118967/permalink/1416709435633930/). The issue is that the asserting nodes added in the `insert_deferred_runtime_assertion` pass do not contain metadata that the ExportedProgram requires the graph to have. One solution to fix this is to retrace the entire module, or another solution is to manually add back this metadata. This diff implements the latter solution (manually add back the metadata) through hooking into fx.graph's `create_node` function, and adding export-specific metadata for every node that is created. The reason I did this is so that the `insert_deferred_runtime_assertion` does not have to know about what metadata export wants. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125414 Approved by: https://github.com/zhxchen17, https://github.com/BoyuanFeng	2024-05-07 23:15:21 +00:00
Zhengxu Chen	8024e72326	[export] Warn on capture_pre_autograd_graph. (#125602 ) Summary: capture_pre_autograd_graph is deprecated and torch.export won't able to provide timely fix for this API. To reduce some confusion around this we should explicitly give users clear warnings. Test Plan: eyes Reviewed By: tarun292 Differential Revision: D56955202 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125602 Approved by: https://github.com/angelayi	2024-05-07 22:51:17 +00:00
Nikita Shulga	021ff7fd77	[BE] Explicitly handle all types `c10::isSigned` (#125637 ) By defining `CASE_ISSIGNED` macros that just returns `std::numeric_limits<dtype>::is_signed` for the types where it makes sense and explicitly code some types when it does not Remove `default:` case from the switch to avoid regressions like the one reported in https://github.com/pytorch/pytorch/issues/125124 , as [`-Wswitch-enum`](https://clang.llvm.org/docs/DiagnosticsReference.html#wswitch-enum) in combination with `-Werror` will raise an error in case of a missing entry, for example: ``` /Users/nshulga/git/pytorch/pytorch/c10/core/ScalarType.h:518:11: warning: enumeration value 'QInt32' not handled in switch [-Wswitch] switch (t) { ^ /Users/nshulga/git/pytorch/pytorch/c10/core/ScalarType.h:518:11: note: add missing switch cases switch (t) { ^ 1 warning generated. ``` Fixes https://github.com/pytorch/pytorch/issues/125124 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125637 Approved by: https://github.com/albanD	2024-05-07 22:51:03 +00:00
Edward Z. Yang	51f25c08f4	Fix 'Could not infer dtype of SymBool' on torch.tensor call (#125656 ) Internal xref: https://fb.workplace.com/groups/469587837192818/posts/1638909336927323/ Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/125656 Approved by: https://github.com/albanD	2024-05-07 22:41:49 +00:00
Michael Lazos	e3d5afc60a	Enable dynamo'd test for 116499 (#123469 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123469 Approved by: https://github.com/janeyx99 ghstack dependencies: #123619	2024-05-07 22:17:01 +00:00
Michael Lazos	0f02e0aa39	Disable dynamo on functional optims if capturable=False (#123619 ) This resolves a bug in eager where if an old state dict is loaded (without the capturable flag) but the original dict had the capturable flag, then state_steps would be on cuda but we would take the non-capturable path. We now fallback to eager if capturable=False. Current design doc and discussion: https://docs.google.com/document/d/1DmmbiaSp16CDZtGw1qzXKHFTY_0gqc0xpnBdviXq0vk/edit#heading=h.871u7bvwz7ze Note on the actual fallback logic - there was an issue with torchscript originally not handling args, *kwargs properly, after rectifying that by using `functools.wraps`, there was an additional bug with scoping which required the single tensor implementation to be in the global scope at the time of the fallback closure being created. I pass in the single tensor function to the `_disable_dynamo_if_unsupported` decorator to workaround this bug. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123619 Approved by: https://github.com/janeyx99	2024-05-07 22:17:01 +00:00
Nikita Shulga	0fd1fc17c3	[MPS] Fix `abs` for complex types (#125662 ) By calling `realPartOfTensor:` if input type is complex on Sonoma and fall back to `at::view_as_real` trick on Ventura. Split `unary_op` template into `unary_op` and `unary_op_noresize`, which skips resize and empty checks Marked `abs`, `isclose` and `nn.functional.softsign` OpInfo tests as supported by complex types Fixes https://github.com/pytorch/pytorch/issues/125135 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125662 Approved by: https://github.com/kulinseth	2024-05-07 22:15:20 +00:00
Shuao Xiong	2163956208	[TGIF][HHC][Sharding] add device_ordinal to Subgraph (#125616 ) Summary: Add a new field device_ordinal to Subgraph class so we can store device ordinal during splitting Test Plan: See test plan for D56535827 Differential Revision: D57010103 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125616 Approved by: https://github.com/jiayisuse	2024-05-07 21:48:59 +00:00
chilli	b356a0de86	Add support for multiple flexattention calls in a single compile (#125516 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/125516 Approved by: https://github.com/yanboliang, https://github.com/drisspg	2024-05-07 21:37:37 +00:00
Angela Yi	d4225c55d9	[fx] Prioritize runtime assertions ops (#124213 ) Summary: We want to prioritize operators involved in data-dependent runtime assertions when legalizing the graph. For example, in the following piece of code, the `assert_scalar` and `assert_async` calls need to occur before the `slice_copy` for the program to run correctly with fake tensors. Otherwise we will run into a data-dependent error. ``` _local_scalar_dense: "Sym(u113)" = torch.ops.aten._local_scalar_dense.default(aten_minimum_default); aten_minimum_default = None ge_1: "Sym(u113 >= 2)" = _local_scalar_dense >= 2 aten_scalar_tensor_default_3: "f32[]" = executorch_exir_dialects_edge__ops_aten_scalar_tensor_default(ge_1); ge_1 = None aten__assert_async_msg_2 = executorch_exir_dialects_edge__ops_aten__assert_async_msg(aten_scalar_tensor_default_3, '_local_scalar_dense is outside of inline constraint [2, 1000].'); aten_scalar_tensor_default_3 = None le_1: "Sym(u113 <= 1000)" = _local_scalar_dense <= 1000 aten_scalar_tensor_default_4: "f32[]" = executorch_exir_dialects_edge__ops_aten_scalar_tensor_default(le_1); le_1 = None aten__assert_async_msg_3 = executorch_exir_dialects_edge__ops_aten__assert_async_msg(aten_scalar_tensor_default_4, '_local_scalar_dense is outside of inline constraint [2, 1000].'); aten_scalar_tensor_default_4 = None mul: "Sym(-u112)" = -1 * sym_size; sym_size = None add: "Sym(-u112 + u113)" = _local_scalar_dense + mul; mul = None lt: "Sym(-u112 + u113 < 0)" = add < 0; add = None aten__assert_scalar_default = executorch_exir_dialects_edge__ops_aten__assert_scalar_default(lt, 'Deferred runtime assertion failed -u0 + u1 < 0'); lt = None aten_slice_copy_tensor_3: "f32[u113]" = executorch_exir_dialects_edge__ops_aten_slice_copy_Tensor(getitem, 0, 0, _local_scalar_dense); getitem = None ``` Test Plan: test case Differential Revision: D56201450 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124213 Approved by: https://github.com/SherlockNoMad	2024-05-07 21:31:10 +00:00
PyTorch MergeBot	2f79a18324	Revert "[inductor] add cpp builder code. (#124045 )" This reverts commit 7864d287a1e56685aa754285cc2d3c31ff055f62. Reverted https://github.com/pytorch/pytorch/pull/124045 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it is failing trunk jobs `7864d287a1` including lint ([comment](https://github.com/pytorch/pytorch/pull/124045#issuecomment-2099306071))	2024-05-07 21:04:49 +00:00
albanD	c5e04a4479	More accurate is_bw and prompt parents cleanup for ModuleTracker utils (#125634 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/125634 Approved by: https://github.com/soulitzer, https://github.com/Chillee	2024-05-07 20:57:36 +00:00
atalman	fdfef759a6	Add userbase library dir to windows dll search path (#125684 ) Fixes https://github.com/pytorch/pytorch/issues/125109 which is a regression introduced by https://github.com/pytorch/builder/pull/1467 that adds dynamic dependency to mkl, which if installed in the user-dir is placed into `sysconfig.sysconfig.get_config_var("userbase") / "Library" / "bin"` Fix this one, but adding `userbase` folder to the DLL search path Testing before this fix: ``` Python 3.12.3 (tags/v3.12.3:f6650f9, Apr 9 2024, 14:05:25) [MSC v.1938 64 bit (AMD64)] on win32 Type "help", "copyright", "credits" or "license" for more information. >>> import torch Traceback (most recent call last): File "<stdin>", line 1, in <module> File "C:\Users\Administrator\AppData\Roaming\Python\Python312\site-packages\torch\__init__.py", line 141, in <module> raise err OSError: [WinError 126] The specified module could not be found. Error loading "C:\Users\Administrator\AppData\Roaming\Python\Python312\site-packages\torch\lib\shm.dll" or one of its dependencies. >>> exit() ``` After: ``` c:\Program Files\Python312>python Python 3.12.3 (tags/v3.12.3:f6650f9, Apr 9 2024, 14:05:25) [MSC v.1938 64 bit (AMD64)] on win32 Type "help", "copyright", "credits" or "license" for more information. >>> import torch >>> exit() ``` Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/125684 Approved by: https://github.com/malfet	2024-05-07 20:43:40 +00:00
Xu Han	7864d287a1	[inductor] add cpp builder code. (#124045 ) Previous full PR https://github.com/pytorch/pytorch/pull/115248 is failed to merge due to fb_code is hard to debug. I also tried to submit them as two pieces, https://github.com/pytorch/pytorch/pull/118514 https://github.com/pytorch/pytorch/pull/118515. And they have passed PreCI at that time. Now I tried to split https://github.com/pytorch/pytorch/pull/115248 into smaller piece, and it is the first step of RFC https://github.com/pytorch/pytorch/issues/124245. Changes: 1. Add cpp builder code, the new cpp_builder support Windows OS. 2. Add CPU ISA checker which is cross OS and exported from backend cpuinfo. 3. Switch compiler ISA checker to new cpp builder. 4. CppCodeCache use the new ISA checker. 5. Add temprary `test_new_cpp_build_logical` UT to help on transfer to new code. <img width="1853" alt="Image" src="https://github.com/pytorch/pytorch/assets/8433590/ce6519ab-ba92-4204-b1d6-7d15d2ba2cbe"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/124045 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-05-07 20:07:41 +00:00
Aaron Orenstein	b23b6e7108	Ensure that vmap is restored properly if an exception is thrown during frame eval (#122074 ) We save and restore the DynamicLayerStack during frame eval but since fx graph has no way to express a try/finally we just assume it will happen. If we throw an exception between the push and pop to the stack then we're left in a state that affects following operations poorly. Make sure that if it's in a bad state we restore it after frame eval. Repro: before: ``` $ rm test/dynamo_skips/TestSparseCPU.test_log1p_cpu_uint8 $ rm test/dynamo_expected_failures/FuncTorchHigherOrderOpTests.test_vmap_free_tensor $ PYTORCH_TEST_WITH_DYNAMO=1 pytest test/jit/test_sparse.py test/dynamo/test_dynamic_shapes.py test/inductor/test_torchinductor_dynamic_shapes.py test/test_sparse.py -k 'test_log1p_cpu_uint8' ============= 1 passed, 8588 deselected in 9.75s ============= $ PYTORCH_TEST_WITH_DYNAMO=1 pytest test/jit/test_sparse.py test/dynamo/test_dynamic_shapes.py test/inductor/test_torchinductor_dynamic_shapes.py test/test_sparse.py -k 'test_vmap_free_tensor_dynamic_shapes or test_log1p_cpu_uint8' ================== short test summary info =================== FAILED [0.0632s] test/test_sparse.py::TestSparseCPU::test_log1p_cpu_uint8 - AssertionError: "only Tensors of floating point dtype can require gradients" does not match "You are attempting to call Tensor.requires_grad_() (or perhaps using torch.autograd.functional.* APIs) inside of a function ... ======= 1 failed, 1 skipped, 8587 deselected in 10.99s ======= ``` (Note that adding test_vmap_free_tensor_dynamic_shapes causes test_vmap_free_tensor_dynamic_shapes to fail) after: ``` $ rm test/dynamo_skips/TestSparseCPU.test_log1p_cpu_uint8 $ rm test/dynamo_expected_failures/FuncTorchHigherOrderOpTests.test_vmap_free_tensor $ PYTORCH_TEST_WITH_DYNAMO=1 pytest test/jit/test_sparse.py test/dynamo/test_dynamic_shapes.py test/inductor/test_torchinductor_dynamic_shapes.py test/test_sparse.py -k 'test_log1p_cpu_uint8' ============= 1 passed, 8588 deselected in 9.89s ============= $ PYTORCH_TEST_WITH_DYNAMO=1 pytest test/jit/test_sparse.py test/dynamo/test_dynamic_shapes.py test/inductor/test_torchinductor_dynamic_shapes.py test/test_sparse.py -k 'test_vmap_free_tensor_dynamic_shapes or test_log1p_cpu_uint8' ======= 1 passed, 1 skipped, 8587 deselected in 11.34s ======= ``` (test_vmap_free_tensor_dynamic_shapes passes either way) Pull Request resolved: https://github.com/pytorch/pytorch/pull/122074 Approved by: https://github.com/oulgen	2024-05-07 19:36:52 +00:00
Yanbo Liang	196a0b1722	Add Inductor micro benchmark workflow (#125450 ) Fixes #ISSUE_NUMBER Co-authored-by: Huy Do <huydhn@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/125450 Approved by: https://github.com/huydhn	2024-05-07 18:56:01 +00:00
PyTorch MergeBot	5fd0b6e5f7	Revert "add uuid in cudaDeviceProperties (#125083 )" This reverts commit f35fe4eaf1e9fa2e631f6bf1a3eb6e5fbf14183b. Reverted https://github.com/pytorch/pytorch/pull/125083 on behalf of https://github.com/clee2000 due to test_uuid is flaky. ex https://github.com/pytorch/pytorch/actions/runs/8988855916/job/24692369523 https://hud.pytorch.org/flakytest?name=test_uuid&suite=TestCuda&file=%25&limit=300 ([comment](https://github.com/pytorch/pytorch/pull/125083#issuecomment-2099029993))	2024-05-07 18:16:27 +00:00
Chien-Chin Huang	f7d48302b6	[DSD] Fix to remove non_persistent buffer in distributed state dict (#125337 ) Summary: Fixes #122792 state_dict includes only persistent buffers, while named_buffers() would include non_persistent buffers. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125337 Approved by: https://github.com/awgu ghstack dependencies: #125333, #125501, #125334, #125335, #125336	2024-05-07 17:57:34 +00:00
Chien-Chin Huang	a89177936c	[DSD] Correctly handle _extra_state (#125336 ) Summary: distributed_state_dict should not try to use `getattr` to get `_extra_state` as this is not well-defined. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125336 Approved by: https://github.com/LucasLLC ghstack dependencies: #125333, #125501, #125334, #125335	2024-05-07 17:31:33 +00:00
Thiago Crepaldi	9f1d3eebf5	Update PyTorch ONNX Exporter maintainers (#125630 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/125630 Approved by: https://github.com/BowenBao, https://github.com/kit1980	2024-05-07 17:29:05 +00:00
Chien-Chin Huang	6f1e3a6bf7	[DCP] Always flatten mapping even if no tensors present (#125335 ) Summary: Right now DCP only flatten a mapping (e.g., dict) if that mapping has tensor objects. This behavior is odd as users may save different non-tensor objects on different ranks. Without flattening the mappings, we may lose these non-tensor objects. One use case is dataloader state_dict. We may also want to do so for a list/tuple. But this will cause extra pickles. So we don't do this for now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125335 Approved by: https://github.com/LucasLLC, https://github.com/wz337 ghstack dependencies: #125333, #125501, #125334	2024-05-07 17:08:49 +00:00
Huy Do	790f43c315	Run test_inductor_distributed with run_test (#125647 ) CI feature like retrying and disable flaky test won't be available otherwise ### Testing https://github.com/pytorch/pytorch/actions/runs/8977431927/job/24659532123#step:15:1688 looks correct now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125647 Approved by: https://github.com/clee2000	2024-05-07 17:07:36 +00:00
Chien-Chin Huang	22767e4791	[DCP] Always create requests for non-tensor objects (#125334 ) Summary: If an object only exists on certain non-coordinator ranks, we still need to save them. Otherwise, we lose these objects. If they are duplicated, DCP will deduplicate them. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125334 Approved by: https://github.com/wz337, https://github.com/LucasLLC ghstack dependencies: #125333, #125501	2024-05-07 17:04:36 +00:00
Nikita Shulga	9782439277	[Profiler] Do not emit a warning when using CPU profiler (#125654 ) This fixes a logic regression introduced by https://github.com/pytorch/pytorch/pull/123247 where ```python if self.use_device and self.use_device != _get_privateuse1_backend_name(): ``` was replaced with ```python VALID_DEVICE_OPTIONS = ["cuda", "xpu", "privateuseone"] if self.use_device not in VALID_DEVICE_OPTIONS: ``` That triggers a warning every time code is invoke with `self.use_device` set to None This change also skips all the checks which are useless if `use_device` is None to begin with Pull Request resolved: https://github.com/pytorch/pytorch/pull/125654 Approved by: https://github.com/aaronenyeshi	2024-05-07 16:56:17 +00:00
Jez Ng	7863e04615	Back out "Get cutlass_library import working under fbcode" (#125606 ) Summary: Original commit changeset: de79f6bfe348 Differential Revision: D57002294 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125606 Approved by: https://github.com/chenyang78	2024-05-07 16:55:11 +00:00
Chien-Chin Huang	71dc15742c	[DSD] Improve the performance of distributed state_dict (#125501 ) Summary: 1. Remove gc.collect(), which is not necessary. 2. Use lru_cache to cache _get_fqns Pull Request resolved: https://github.com/pytorch/pytorch/pull/125501 Approved by: https://github.com/wz337, https://github.com/LucasLLC ghstack dependencies: #125333	2024-05-07 16:55:05 +00:00
Huy Do	0e57bbb6d7	Set timeout for C++ tests (#125517 ) Looking at the unrelated Windows timeout failure on https://github.com/pytorch/pytorch/pull/125199, it looks like we don't have a timeout value set for C++ tests atm. In this case, a C++ test on Windows timed out after 2+ hours. ``` 2024-05-02T23:35:34.0639067Z Running cpp/c10_TypeList_test 1/1 ... [2024-05-02 23:35:34.059021] 2024-05-02T23:35:34.0641108Z Executing ['pytest', 'C:\\actions-runner\\_work\\pytorch\\pytorch\\build\\win_tmp\\build\\torch\\test\\c10_TypeList_test.exe', '-m', 'not serial', '-v', '-vv', '-rfEX', '-n', '2', '--junit-xml-reruns', 'test-reports\\python-pytest\\test\\run_test\\test\\run_test-c898ddeff8f33cbf.xml', '-x', '--reruns=2'] ... [2024-05-02 23:35:34.062137] 2024-05-03T02:45:33.7862004Z Process SpawnPoolWorker-2: 2024-05-03T02:45:33.7927201Z Traceback (most recent call last): 2024-05-03T02:45:33.7928032Z File "C:\Jenkins\Miniconda3\lib\multiprocessing\process.py", line 315, in _bootstrap 2024-05-03T02:45:33.7928722Z self.run() 2024-05-03T02:45:33.7929722Z File "C:\Jenkins\Miniconda3\lib\multiprocessing\process.py", line 108, in run 2024-05-03T02:45:33.7931639Z self._target(self._args, self._kwargs) 2024-05-03T02:45:33.7932435Z File "C:\Jenkins\Miniconda3\lib\multiprocessing\pool.py", line 114, in worker 2024-05-03T02:45:33.7933338Z task = get() 2024-05-03T02:45:33.7933946Z File "C:\Jenkins\Miniconda3\lib\multiprocessing\queues.py", line 365, in get 2024-05-03T02:45:33.7935219Z res = self._reader.recv_bytes() 2024-05-03T02:45:33.7935897Z File "C:\Jenkins\Miniconda3\lib\multiprocessing\connection.py", line 221, in recv_bytes 2024-05-03T02:45:33.7936609Z buf = self._recv_bytes(maxlength) 2024-05-03T02:45:33.7937302Z File "C:\Jenkins\Miniconda3\lib\multiprocessing\connection.py", line 310, in _recv_bytes 2024-05-03T02:45:33.7938316Z waitres = _winapi.WaitForMultipleObjects( 2024-05-03T02:45:33.7938766Z KeyboardInterrupt ``` Retrying was working, but it was already too late to finish the job. I'm setting the same default `THRESHOLD 3` timeout value here for C++ tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125517 Approved by: https://github.com/clee2000	2024-05-07 16:41:38 +00:00
PyTorch MergeBot	1b396d69cb	Revert "[CUDNN] Remove defunct cuDNN V8 API build flag (#120006 )" This reverts commit ee4cafa098ede2d9546016223cbc1a522ea3630a. Reverted https://github.com/pytorch/pytorch/pull/120006 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing ROCm jobs in trunk `ee4cafa098` ([comment](https://github.com/pytorch/pytorch/pull/120006#issuecomment-2098849813))	2024-05-07 16:28:04 +00:00
Catherine Lee	848fce35b5	[CI][ez] Don't retry when it says don't retry (#125643 ) default arg for retry_shell is retries=1 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125643 Approved by: https://github.com/huydhn	2024-05-07 16:20:00 +00:00
angelayi	0de9ce9bb3	[export] Fix serialization of empty torch artifact (#125542 ) A previous PR added support for serializing/deserializing example inputs, but this fails when `example_inputs` is none. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125542 Approved by: https://github.com/pianpwk, https://github.com/BoyuanFeng, https://github.com/ydwu4	2024-05-07 15:54:45 +00:00
Oguz Ulgen	b37bef9b13	Use triton_key instead of triton.__version__ for hash (#125624 ) Using `triton.__version__` is not correct as the version is not always updated with each code change, so we should use the proper hash function provided by triton library. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125624 Approved by: https://github.com/eellison, https://github.com/masnesral, https://github.com/jansel	2024-05-07 15:43:50 +00:00
Joel Schlosser	8573d9551a	Fix to preserve tensor wrapper subclass dtype through multiprocessing serialization (#125615 ) Fixes #125583 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125615 Approved by: https://github.com/albanD	2024-05-07 14:35:48 +00:00
atalman	b29d77b54f	Separate arm64 and amd64 docker builds (#125617 ) Fixes https://github.com/pytorch/pytorch/issues/125094 Please note: Docker CUDa 12.4 failure is existing issue, related to docker image not being available on gitlab: ``` docker.io/nvidia/cuda:12.4.0-cudnn8-devel-ubuntu22.04: docker.io/nvidia/cuda:12.4.0-cudnn8-devel-ubuntu22.04: not found ``` https://github.com/pytorch/pytorch/actions/runs/8974959068/job/24648540236?pr=125617 Here is the reference issue: https://gitlab.com/nvidia/container-images/cuda/-/issues/225 Tracked on our side: https://github.com/pytorch/builder/issues/1811 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125617 Approved by: https://github.com/huydhn, https://github.com/malfet	2024-05-07 11:50:54 +00:00
FFFrog	5dee46266a	Fix & optimize open device registration test. (#125572 ) 1. Fix the wrong tests about lazy init for PrivateUse1 named foo 2. Refactor the tests and make it more flexible 3. Disable the two tests temporarily - test_open_device_faketensor - test_open_device_scalar_type_fallback Pull Request resolved: https://github.com/pytorch/pytorch/pull/125572 Approved by: https://github.com/albanD	2024-05-07 08:30:01 +00:00
Michael Lazos	f0c6d6100b	Enable dynamo-traced optimizer peak memory tests (#124543 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124543 Approved by: https://github.com/yf225, https://github.com/janeyx99	2024-05-07 08:21:50 +00:00
Oguz Ulgen	5033d3ba6d	Disable fb_memcache for MTIA (#125658 ) Differential Revision: [D57035819](https://our.internmc.facebook.com/intern/diff/D57035819/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/125658 Approved by: https://github.com/jamesjwu	2024-05-07 07:00:26 +00:00
Chien-Chin Huang	e72936c27c	[PT2D] Fix the circular import issue (#125618 ) As title Differential Revision: [D57011394](https://our.internmc.facebook.com/intern/diff/D57011394/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/125618 Approved by: https://github.com/wz337	2024-05-07 05:10:18 +00:00
lezcano	acafabaa29	Rename TorchDynamo -> Dyanamo in the dynamo tutorial doc (#123431 ) Less verbose and it aligns it with the dynamo deepdive Pull Request resolved: https://github.com/pytorch/pytorch/pull/123431 Approved by: https://github.com/peterbell10	2024-05-07 05:07:00 +00:00
Jiong Gong	058e28108f	[inductor][cpp] support int64 vertical vec reduction (fix #124821 ) (#125563 ) Fix #124821 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125563 Approved by: https://github.com/desertfire	2024-05-07 03:56:22 +00:00
David Chiu	a60fa960e5	refactor: extract `get_lr` warning (#125545 ) Extract the `_get_lr_called_within_step` checking in the `get_lr()` of every LRSchedulers. Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/125545 Approved by: https://github.com/janeyx99	2024-05-07 03:15:58 +00:00
ydwu4	461ffaaaf3	[dynamo] support torchbind object input (#124978 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124978 Approved by: https://github.com/jansel	2024-05-07 03:02:00 +00:00
Yuanhao Ji	c165a8e71d	Enable UFMT on `test_decomp.py`, `test_expanded_weights.py` and some files (#125117 ) Part of: #123062 Ran lintrunner on: - test/test_decomp.py - test/test_deploy.py - test/test_determination.py - test/test_dlpack.py - test/test_dynamic_shapes.py - test/test_expanded_weights.py Detail: ```bash $ lintrunner -a --take UFMT --all-files ok No lint issues. Successfully applied all patches. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/125117 Approved by: https://github.com/jansel	2024-05-07 02:36:40 +00:00
Shunting Zhang	48b6c8dbc3	[Inductor] log fusion failure due to index mismatch (#124986 ) The scheduler searches for fusion opportunities by looking for common memory access. Two memory access are considered common not only when the buffer name match, but it also requires more things - index formula matches - var_ranges matches In this PR, I want to log all the fusion failures due to mismatch index formula or var_ranges. I also want to further categories the failures. Right now I found the following failure categories - rand_seed: the index for rand seed access is an integer and different access uses different integer offset - different numel: this happens for cat operation - broadcast: e.g. kernel A write a buffer which is broadcasted and read by kernel B - different loop orders: the major category we want inductor to be able to fuse - different offset: happens when use a concatenated linear layer to project Q/K/V and then split the result. Each split will point to the same buffer with different offset. - unknown My hope is to make sure for the models I tested, there is no fusion failure falling in the unknown category so all the failures are well understood and categories. Right now it's true for BertForMaskedLM ( https://gist.github.com/shunting314/6dc2c903629d342fa63ba731a171adc2 ), DistillGPT2 ( https://gist.github.com/shunting314/145176f2e850103c7fad4ad72f0e200e ) and llm.c ( https://gist.github.com/shunting314/cfc64a326312a889ba55f79bd47b2082 ) For BertForMaskedLM, we found 82 instances of fusion failures and majority of them are due to different loop orders! Studying the log a bit more can help us figure out where all these loop order mismatch comes from in real models. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124986 Approved by: https://github.com/eellison, https://github.com/jansel	2024-05-07 02:29:00 +00:00
Jeff Daily	f35fe4eaf1	add uuid in cudaDeviceProperties (#125083 ) Replaces #99967. Fixes #99903. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125083 Approved by: https://github.com/pruthvistony, https://github.com/albanD, https://github.com/eqy	2024-05-07 01:26:01 +00:00
Angela Yi	4332fc4095	[export] Allow constant attr mutation (#125424 ) Test Plan: CI Differential Revision: D56893728 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125424 Approved by: https://github.com/pianpwk	2024-05-07 00:34:57 +00:00
Apurva	c0c2f6156a	Updated docs to add the error case for torch.multinomial Issue#125388 (#125495 ) Summary: Updated docs to add the error condition for torch.multinomial Test Plan: No change in code Reviewers: @drisspg Subscribers: @drisspg Tasks: Tags: Fixes #125388 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125495 Approved by: https://github.com/drisspg	2024-05-07 00:26:27 +00:00
Mark Saroufim	3407899ba1	DTensor Fused ADAM (#125369 ) Fixes https://github.com/pytorch/pytorch/issues/124633 https://github.com/pytorch/ao/issues/205 ``` (pt) [marksaroufim@devvm17057.vll0 ~/pytorch (dfusedadam)]$ pytest test/distributed/_tensor/test_optimizers.py -s -k adamw_1d_sharding ===================================================================================== test session starts ====================================================================================== platform linux -- Python 3.9.19, pytest-7.4.0, pluggy-1.5.0 rootdir: /home/marksaroufim/pytorch configfile: pytest.ini plugins: hypothesis-6.100.2 collected 10 items / 9 deselected / 1 selected Running 1 items in this shard test/distributed/_tensor/test_optimizers.py . =============================================================================== 1 passed, 9 deselected in 5.95s ================================================================================ (pt) [marksaroufim@devvm17057.vll0 ~/pytorch (dfusedadam)]$ pytest test/distributed/_tensor/test_optimizers.py -s -k adam_1d_sharding ===================================================================================== test session starts ====================================================================================== platform linux -- Python 3.9.19, pytest-7.4.0, pluggy-1.5.0 rootdir: /home/marksaroufim/pytorch configfile: pytest.ini plugins: hypothesis-6.100.2 collected 10 items / 7 deselected / 3 selected Running 3 items in this shard test/distributed/_tensor/test_optimizers.py ... =============================================================================== 3 passed, 7 deselected in 10.79s =============================================================================== (pt) [marksaroufim@devvm17057.vll0 ~/pytorch (dfusedadam)]$ ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/125369 Approved by: https://github.com/wanchaol	2024-05-07 00:08:09 +00:00
Nikita Shulga	65fc3c31bc	[BE] Delete unused `AT_FORALL_SCALAR_TYPES_AND[456]` (#125607 ) Check that they are not used by running the following ``` % grep -h "AT_FORALL_SCALAR_TYPES_AND" . -R\|grep -v #define\|cut -d\( -f1\|sort\|uniq AT_FORALL_SCALAR_TYPES_AND3 AT_FORALL_SCALAR_TYPES_AND3 AT_FORALL_SCALAR_TYPES_AND AT_FORALL_SCALAR_TYPES_AND2 AT_FORALL_SCALAR_TYPES_AND3 AT_FORALL_SCALAR_TYPES_AND7 AT_FORALL_SCALAR_TYPES_AND2 AT_FORALL_SCALAR_TYPES_AND3 AT_FORALL_SCALAR_TYPES_AND7 AT_FORALL_SCALAR_TYPES_AND2 AT_FORALL_SCALAR_TYPES_AND3 AT_FORALL_SCALAR_TYPES_AND7 // AT_FORALL_SCALAR_TYPES / AT_FORALL_SCALAR_TYPES_AND macros below, which are AT_FORALL_SCALAR_TYPES_AND AT_FORALL_SCALAR_TYPES_AND2 AT_FORALL_SCALAR_TYPES_AND3 AT_FORALL_SCALAR_TYPES_AND7 using at::Half; // for AT_FORALL_SCALAR_TYPES_AND3 ``` or by checking online using https://github.com/search?type=code&q=AT_FORALL_SCALAR_TYPES_AND4+repo%3Apytorch%2Fpytorch Pull Request resolved: https://github.com/pytorch/pytorch/pull/125607 Approved by: https://github.com/albanD	2024-05-07 00:01:35 +00:00
Zhirui Dai	3411d54811	fix loading optimizer options from archive (#125215 ) This PR makes libtorch behave the same as PyTorch when loading optimizer state from archive. With PyTorch, options of parameter groups are loaded from the archive, which is missing currently in libtorch. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125215 Approved by: https://github.com/janeyx99	2024-05-06 23:58:40 +00:00
eqy	ee4cafa098	[CUDNN] Remove defunct cuDNN V8 API build flag (#120006 ) The flag basically does nothing following #95722 Let's see if the quantization tests break CC @malfet @atalmanagement Pull Request resolved: https://github.com/pytorch/pytorch/pull/120006 Approved by: https://github.com/malfet	2024-05-06 23:13:58 +00:00
Joel Schlosser	b98c689261	Better repro command: include test class + fix paths for py3.8 (#125498 ) Fixes #117850 This PR: * Adds the class name in the repro command * Fixes the path to the test file for python 3.8 jobs (apparently `inspect.getfile(class_type)` returns a relative path in this older python version) Before (in python 3.8): ```sh PYTORCH_TEST_WITH_DYNAMO=1 python test_autograd.py -k test_foo ``` After: ```sh PYTORCH_TEST_WITH_DYNAMO=1 python test/test_autograd.py -k TestAutograd.test_foo ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/125498 Approved by: https://github.com/huydhn, https://github.com/janeyx99	2024-05-06 22:19:12 +00:00
Oguz Ulgen	22bcfc25ef	Initial implementation of Inductor FX Graph Remote Cache (#124669 ) This diff implements a remote caching strategy (memcache for internal and redis for external) for caching of Inductor FX Graph to Inductor generated wrapper file. It uses the same idea with the autotuning result cache that is currently live. This will land turned off and before turning this on by default, I will do more testing and including looking at the dynamic shape guards added by inductor. Differential Revision: [D56441624](https://our.internmc.facebook.com/intern/diff/D56441624/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124669 Approved by: https://github.com/jansel, https://github.com/eellison	2024-05-06 22:10:27 +00:00
David Berard	05bd7fe3eb	Nested Tensor + AOTI test (#125513 ) Since we expect AOTI to be important for serving NJT in the future, I'm adding a test demonstrating that AOTI currently works with NJT when NJT is entirely in the graph (no NJTs going in or out), and to prevent regressions. Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/125513 Approved by: https://github.com/angelayi, https://github.com/desertfire	2024-05-06 22:07:22 +00:00
Catherine Lee	1b3fd83ab2	[TD] Enable TD on AVX related configs (#125482 ) On test configs `nogpu_AVX512` and `nogpu_NO_AVX2`, which are the next longest jobs on trunk after windows Pull Request resolved: https://github.com/pytorch/pytorch/pull/125482 Approved by: https://github.com/huydhn	2024-05-06 22:02:16 +00:00
Yanbo Liang	8c74162074	Reduce the number of layers for mixtral moe model to adapt CI memory limitation (#125608 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/125608 Approved by: https://github.com/Chillee, https://github.com/huydhn	2024-05-06 21:52:25 +00:00
lezcano	7ddf57e9f5	xfail codegen dynamic if the test is xfailed (#125573 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/125573 Approved by: https://github.com/peterbell10	2024-05-06 20:55:33 +00:00
William Wen	373a00df9a	[dynamo] better file open method in funcname_cache (#125435 ) Fix https://github.com/pytorch/pytorch/issues/124960? Pull Request resolved: https://github.com/pytorch/pytorch/pull/125435 Approved by: https://github.com/ezyang	2024-05-06 20:55:15 +00:00
Ke Wen	cbb3791891	[pipelining] Add tests for tracing frontend (#125449 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/125449 Approved by: https://github.com/wconstab ghstack dependencies: #125273, #125448	2024-05-06 20:44:56 +00:00
William Wen	bdaa7bbd7d	[dynamo] fix potentially missing _torchdynamo_inline from ScriptFunction (#125447 ) Fix https://github.com/pytorch/pytorch/issues/119747 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125447 Approved by: https://github.com/jansel	2024-05-06 20:36:56 +00:00
PHLens	ad9a27f3e5	Move autocast op list to autocast_mode.h to make sure other backends can reuse it. (#125114 ) This PR refactors the op list added in #124051. To make sure other backends can reuse it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125114 Approved by: https://github.com/albanD	2024-05-06 20:31:15 +00:00
PyTorch MergeBot	2a42c40791	Revert "Compute bounds for the variables created during codegen (#123100 )" This reverts commit bb668c6468dd4adf7737a069e7af4c3f612cfc81. Reverted https://github.com/pytorch/pytorch/pull/123100 on behalf of https://github.com/huydhn due to Sorry for reverting you change but it is failing inductor tests `bb668c6468` ([comment](https://github.com/pytorch/pytorch/pull/123100#issuecomment-2096837821))	2024-05-06 20:23:39 +00:00
Yuxin Wu	9cd4bcb2c4	[FSDP] mark pre_backward_hook unserializable (#125464 ) Saw a warning like this: ``` /opt/conda/lib/python3.10/site-packages/torch/utils/hooks.py:86: UserWarning: backward hook functools.partial(<function _pre_backward_hook at 0x7f9a3940fac0>, FullyShardedDataParallel( .... ), <torch.distributed.fsdp.flat_param.FlatParamHandle object at 0x7f25202a9720>) on tensor will not be serialized. If this is expected, you can decorate the function with @torch.utils.hooks.unserializable_hook to suppress this warning ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/125464 Approved by: https://github.com/ezyang	2024-05-06 20:20:31 +00:00
drisspg	7d10b06e1a	Allow building for sm90a (#125523 ) # Summary Fixes: #125413 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125523 Approved by: https://github.com/Skylion007	2024-05-06 20:03:12 +00:00
PyTorch MergeBot	ee0c47349c	Revert "Upgrade submodule oneDNN to v3.4 (#122472 )" This reverts commit dbcf123105a3f11d02f04067ca0cb377ed09e88c. Reverted https://github.com/pytorch/pytorch/pull/122472 on behalf of https://github.com/atalman due to broke aarch64 builds and tests ([comment](https://github.com/pytorch/pytorch/pull/122472#issuecomment-2096750000))	2024-05-06 19:28:20 +00:00
Richard Barnes	af144139df	Remove some pre-c++17 cruft (#125590 ) Summary: C++20 has [eliminated](https://en.cppreference.com/w/cpp/types/result_of) `result_of` in favour of `invoke_result`. It's mysterious that this code even still works, but, nevertheless, I'm fixing it. Test Plan: Sandcastle Differential Revision: D56987418 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125590 Approved by: https://github.com/Skylion007	2024-05-06 19:19:28 +00:00
Wanchao Liang	daf1eb44bc	try to fix the warning in distribute_tensor (#125476 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/125476 Approved by: https://github.com/albanD, https://github.com/awgu ghstack dependencies: #125475	2024-05-06 18:59:47 +00:00
PyTorch MergeBot	7ffa5558ee	Revert "[FX] Update type hints in `torch.fx._compatibility.py` (#125469 )" This reverts commit 235b4d6ec22ddac35b2e47b7e871ef10538d4aee. Reverted https://github.com/pytorch/pytorch/pull/125469 on behalf of https://github.com/izaitsevfb due to breaks pyre in dependent projects (internal: see D56986361) ([comment](https://github.com/pytorch/pytorch/pull/125469#issuecomment-2096665396))	2024-05-06 18:36:43 +00:00
lezcano	bb668c6468	Compute bounds for the variables created during codegen (#123100 ) Before we would just bail out on these bounds for all variables that did not come from the FX graph. Now we propagate the bounds whenever we have a rule for that op. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123100 Approved by: https://github.com/jgong5, https://github.com/peterbell10	2024-05-06 18:12:15 +00:00
Pian Pawakapan	3827810453	[export] suggest constant dim values in dynamic shapes fixes (#125458 ) [https://www.internalfb.com/diff/D54924742](https://github.com/pytorch/pytorch/pull/121860) allowed specifying integer values for static dims in dynamic shapes. This changes suggested fixes to suggest the actual value instead of the current "None". Test Plan: existing export tests cover this Differential Revision: D56921142 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125458 Approved by: https://github.com/avikchaudhuri	2024-05-06 17:44:19 +00:00
atalman	6ebec38453	Add ciflow/linux-aarch64 to auto labeler on mkldnn PR's (#125599 ) Trigger Aarch64 CI on oneDNN changes, to detect issues like this: https://github.com/pytorch/pytorch/issues/125548 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125599 Approved by: https://github.com/malfet, https://github.com/snadampal	2024-05-06 17:26:17 +00:00
Nikita Shulga	e30e6d321f	[MPS][BE] Introduce MetalShaderLibary class (#125550 ) That factors out a repeated pattern of creating a library/fetching a func from source Typical usecase ```cpp static MetalShaderLibrary lib(SHADER_SOURCE); ... id<MTLComputePipelineState> cplState = lib.getPipelieStateForFunc("kernel_name") ``` - Make it possible to use with templated sources - Add `scalarToMetalTypeString(const Tensor&)` variant to avoid repeated `scalarToMetalTypeString(t.scalar_type())` calls in the code I.e. it makes no functional changes, but reduces MPS codebase size by 365 lines Pull Request resolved: https://github.com/pytorch/pytorch/pull/125550 Approved by: https://github.com/kulinseth	2024-05-06 17:15:47 +00:00
Max Ouellet	7bf6ed01ac	[inductor] Remove symbol exports in C shim for Windows (#125472 ) Summary: This shim exports symbols on Windows, which can lead to symbol clashes at link time in the following scenario: 1. A DLL imports libtorch 2. A binary imports libtorch, and also depends on the DLL in (1) Under that scenario, the symbols exported from `shim.h` can clash at link time. Given that AOTInductor only works for PyTorch2, and PyTorch2 doesn't currently work for Windows, we can work around this problem by simply removing the symbols export on Windows. In the long term, this will need to be figured out when Windows support is added & tested for PyTorch2. Differential Revision: D56936696 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125472 Approved by: https://github.com/desertfire	2024-05-06 14:43:11 +00:00
Edward Z. Yang	b6bcd09173	Get rid of tabular and sizes, beef up verbosity of output graph (#125507 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/125507 Approved by: https://github.com/Chillee, https://github.com/jansel ghstack dependencies: #125505	2024-05-06 13:41:58 +00:00
PyTorch UpdateBot	71bec453b1	[xla hash update] update the pinned xla hash (#124599 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned xla hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124599 Approved by: https://github.com/pytorchbot	2024-05-06 12:19:42 +00:00
lezcano	60efb1060a	Make codegen dynamic test faster (#125569 ) Let's early exit + avoid an unnecessary split. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125569 Approved by: https://github.com/kadeng	2024-05-06 12:09:19 +00:00
Peter Bell	24b64fc482	[HOP][inductor] Support pytrees as associative_scan input (#122137 ) This allows `associative_scan` to take an arbitrary pytree of tensors, which is flattened to their leaves before calling the `associative_scan` higher order operator. I also add support in inductor to generate code for scanning over sequences of tensors. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122137 Approved by: https://github.com/lezcano, https://github.com/Chillee ghstack dependencies: #119430	2024-05-06 11:29:28 +00:00
Jiong Gong	68a1f787c8	[inductor][cpp] move some common cpp utils to cpp_utils.py (#125152 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/125152 Approved by: https://github.com/desertfire, https://github.com/jansel	2024-05-06 04:30:30 +00:00
Alex Baden	fc183f0bde	[Inductor] Properly package target info for triton.compile (#125553 ) Triton updated the interface for `triton.compile` `5162346487` The `target` argument to compile needs to be wrapped in a `GPUTarget` object. Without proper wrapping, we hit an assert in `compile`. If that assert is removed, Triton attempts to read device info from Torch while inside a torch thread, which hits an in bad fork assert. This change is required for compatibility with latest commits in Triton. The implementation is backwards compatible, so existing versions of Triton that work now continue to work. Re-submitting this after https://github.com/pytorch/pytorch/pull/125241 was reverted due to an unrelated CI issue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125553 Approved by: https://github.com/huydhn	2024-05-06 01:36:36 +00:00
Aaron Gokaslan	1dd42e42c4	[BE]: Try TCH autofixes on torch/ (#125536 ) Tries TCH autofixes and see what breaks Pull Request resolved: https://github.com/pytorch/pytorch/pull/125536 Approved by: https://github.com/ezyang	2024-05-05 23:13:59 +00:00
PyTorch MergeBot	ccbac091d2	Revert "Add `write_record_metadata` to PyTorchFileWriter (#125184 )" This reverts commit dd92637f445d2787f83829079276f71b1ad1fc7c. Reverted https://github.com/pytorch/pytorch/pull/125184 on behalf of https://github.com/izaitsevfb due to breaks internal builds, see D56962076 ([comment](https://github.com/pytorch/pytorch/pull/125184#issuecomment-2094976897))	2024-05-05 22:40:00 +00:00
Edward Z. Yang	1b1d593c8c	Don't call item() into torch.scalar_tensor uselessly (#125373 ) Fixes https://github.com/pytorch/pytorch/issues/125368 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/125373 Approved by: https://github.com/Skylion007	2024-05-05 22:38:16 +00:00
Edward Z. Yang	ecd62746e3	Also pull size/stride info from example_value (#125505 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/125505 Approved by: https://github.com/jansel	2024-05-05 22:27:46 +00:00
Catherine Lee	d1a3271a55	[ez]2->3 shards for asan slow (#125499 ) One of the shards has been timing out recently Pull Request resolved: https://github.com/pytorch/pytorch/pull/125499 Approved by: https://github.com/huydhn	2024-05-05 21:02:44 +00:00
Kai Londenberg	94c4855e75	[Inductor max autotune] Make autotune_select_algorithm more robust (#124928 ) This diff makes sure that a custom exception is thrown when no valid choices remain during autotuning. This allows to gracefully fall back to a default choice, even if that default choice has not been passed to autotune_select_algorithm. Additionally, this diff handles RuntimeErrors during autotuning gracefully, e.g. the corresponding choice is ignored but it does not lead to the compilation failure of the entire model if a problematic choice is encountered during autotuning. ( An error is being logged, though). Test Plan: CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/124928 Approved by: https://github.com/int3 ghstack dependencies: #125406	2024-05-05 20:10:10 +00:00
Yifu Wang	58d8388ed3	Remove Inductor IRs for legacy functional collectives (#124992 ) This PR completely removes the Inductor IR for legacy functional collectives: - Removed the `CollectiveKernel` hiearchy and `Wait`, as well as the corresponding lowerings. These IRs are target (i.e. Python) specific and don't model node dependencies propoerly (e.g. they rely on `never_reuse_buffers` for correct behavior). They've been superceded by `ir._CollectiveKernel`. - Removed `InPlaceHint` and the scheduler logic for handling it. `InPlaceHint` is a codegen-time buffer reuse mechanism controlled by the IR's codegen. It's a bit hacky and overlaps with the default buffer reuse mechanism. Removing it since it is only used by legacy functional collectives. - Removed `OutputBuffer` and `MultiOutputNoSizeAssert` which are designed for and only used by legacy functional collectives. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124992 Approved by: https://github.com/Chillee, https://github.com/wanchaol	2024-05-05 19:49:58 +00:00
Xuehai Pan	235b4d6ec2	[FX] Update type hints in `torch.fx._compatibility.py` (#125469 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/125469 Approved by: https://github.com/Skylion007 ghstack dependencies: #125468	2024-05-05 19:30:22 +00:00
Xuehai Pan	30c9fd96f6	[FX] Add missing forbidden mutation methods in immutable collections (#125468 ) Add `list.sort`, `list.reverse`, `dict.__ior__`, and `dict.setdefault`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125468 Approved by: https://github.com/Skylion007	2024-05-05 19:30:22 +00:00
Xiaodong Wang	7c59720ba7	[comm] Ensure ncclComm is not aborted before checking exception (#124466 ) Differential Revision: D56347560 More details in this pytorch issue: https://github.com/pytorch/pytorch/issues/124468 It seems there is a race in the ProcessGroupNCCL shutdown logic. The code is quite simple: ``` for i in range(100): dist.all_to_all_single(tensor_out, tensor_in) dist.destroy_process_group() ``` What can happen is this: 1. dist.destroy_process_group() calls into shutdown() and then calls into abort: `b2f6cfd9c0/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp (L1095)` 2. It'll call ncclCommAbort (not graceful afaict), and also set the ncclAsyncErr_ = ncclSystemError; `b2f6cfd9c0/torch/csrc/distributed/c10d/NCCLUtils.hpp (L388)`. 3. ncclWatchdog thread may not have woken up while all this shutdown process happens. And in shutdown we're not waiting for watchdog thread 4. ProcessGroupNCCL dtor is called. It'll wait for the watchdog thread to join 5. watchdog will check the work's isCompleted() -> then calls checkAndSetException(). Because ncclAsyncError_ was set to ncclSystemError, it'll error out and makes you think it's a nccl error. So we can mitigate this issue by checking if the comm was aborted during work.isCompleted/isStarted Some more longer term discussion in the issue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124466 Approved by: https://github.com/shuqiangzhang, https://github.com/yoyoyocmu, https://github.com/kwen2501	2024-05-05 18:55:48 +00:00
Mu-Chu Lee	99e4909677	Remove assertion for cat target_func (#125540 ) Summary: We remove the assertion for target_func being cat. The reason is that we have multiple flavors of concat, such as cat/cat.default/cat_slice/cat_slice_cat/... Assertion here is causing multiple times of false positives. Test Plan: Removing assertion code only. Differential Revision: D56971387 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125540 Approved by: https://github.com/hl475	2024-05-05 18:17:47 +00:00
Edward Z. Yang	650a248d3e	Rename is_unspecialized to pass_arg_as_tensor, add comment (#125496 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/125496 Approved by: https://github.com/lezcano ghstack dependencies: #125395, #125419, #125483, #125494	2024-05-05 16:57:50 +00:00
Edward Z. Yang	12da7ee58f	Don't use wrap_fx_proxy_cls for wrap_symint (#125494 ) We use very little of the code in wrap_fx_proxy_cls, so dupe it out. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/125494 Approved by: https://github.com/lezcano ghstack dependencies: #125395, #125419, #125483	2024-05-05 16:57:50 +00:00
Edward Z. Yang	617e473da5	Split wrap_symint out of wrap_unspecialized_primitive (#125483 ) While there are some similarities, they are also quite different (one handles Numpy numbers while the other handles ints. I am also going to add a wrap_symfloat soon which will do even more different behavior. So split these out for clarity. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/125483 Approved by: https://github.com/lezcano ghstack dependencies: #125395, #125419	2024-05-05 16:57:50 +00:00
Kai Londenberg	10f673541e	[Inductor cutlass backend] Enabled nonzero workspace and Cutlass StreamK (#125406 ) Enable nonzero workspace and Cutlass StreamK for Inductor Cutlass GEMM ops. This is a simpler rewrite of my original version of #119005 using @peterbell10 's workspace allocation mechanism from #117992 Test Plan: - Additional unit test in test_cutlass_backend.py which specifically tests StreamK GEMM with workspace requirement - CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/125406 Approved by: https://github.com/jansel	2024-05-05 15:28:45 +00:00
Andrew Gu	f70bd71a48	[FSDP2] Computed grad divide factors at runtime (#125484 ) Context We are interested in supporting the case where HSDP reduce-scatters but does not all-reduce in a microbatch backward. This saves communication while still saving memory. Only on the last microbatch do we need to both reduce-scatter and all-reduce. This is not implemented yet and will hopefully come in a future PR. There is one notable part of doing this. On the last microbatch, we need to perform an accumulation step after reduce-scatter and before all-reduce. If not, then the preceding microbatch's gradients will not be contributed across the replica group. (In other words, we cannot simply accumulate _after_ all-reduce.) Consider 32 GPUs with 4-way replication and 8-way sharding and 2 microbatches, and focus on global rank 0. - After the first microbatch, rank 0 will have its shard of $\frac{1}{8} \sum_{i \in S(0)} g_i^{(1)}$, where we define $S(0) = \{0, 1, \dots, 7\}$ to be the ranks in its shard group and we define the $(1)$ superscript to denote the first microbatch. - Upon the second microbatch, rank 0 after its reduce-scatter will additionally have its shard of $\frac{1}{8} \sum_{i \in S(0)} g_i^{(2)}$. If we only all-reduce this, then this second microbatch's gradients become $\frac{1}{32} \sum_{i=0, 1, \dots, 31} g_i^{(2)}$, so in total, rank 0 has $\frac{1}{8} \sum_{i \in S(0)} g_i^{(1)} + \frac{1}{32} \sum_{i=0, 1, \dots, 31} g_i^{(2)}$, which is wrong. - Importantly, we must accumulate $\frac{1}{8} \sum_{i \in S(0)} g_i^{(1)} + \frac{1}{8} \sum_{i \in S(0)} g_i^{(2)} = \frac{1}{8}\sum_{i \in S(0)} (g_i^{(1)} + g_i^{(2)})$ first before all-reducing to get $\frac{1}{32} \sum_{i=0, 1, \dots, 31} (g_i^{(1)} + g_i^{(2)})$. Now, note how under this approach, we want a factor of $\frac{1}{8}$ only (i.e. reciprocal of the shard group size), not $\frac{1}{32}$, for the first microbatch's gradients. - For bf16/fp32, since we use `ReduceOp.AVG` and we only reduce-scatter on the first microbatch, we correctly have a factor of $\frac{1}{8}$ on the first microbatch. - For fp16, since we precompute the gradient divide factors at init time assuming always reducing over both shard and replica groups, we incorrectly have a factor of $\frac{1}{32}$ on the first microbatch, deviating from the bf16/fp32 case. We can address this issue by matching the bf16/fp32 vs. fp16 semantics by computing the divide factors at runtime based on which process groups were passed into the reduction function (`foreach_reduce`). Additional Notes How to implement the HSDP reduce-scatter but no all-reduce is not entirely clear yet. (What is the cleanest way to do this?) We need to store the partial reduce-scatter output and check for it upon the next backward. We should also be sure to error if the set of parameters receiving gradients changes, in which case we cannot support this easily. Anyway, we will implement this in a follow-up. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125484 Approved by: https://github.com/wanchaol ghstack dependencies: #125431, #125479	2024-05-05 14:11:33 +00:00
PyTorch MergeBot	dba689bbfd	Revert "[FSDP2] Computed grad divide factors at runtime (#125484 )" This reverts commit 9aa7699185e4ec39077e3046dfd63244dffa9ddb. Reverted https://github.com/pytorch/pytorch/pull/125484 on behalf of https://github.com/huydhn due to Sorry for reverting your change, I am trying to restore ROCm distributed failures in trunk `9aa7699185` ([comment](https://github.com/pytorch/pytorch/pull/125484#issuecomment-2094646996))	2024-05-05 06:12:01 +00:00
cyy	8a0529e986	[2/2] Remove Caffe2 db and distributed code (#125533 ) This PR follows #125092 to remove caffe2/db/* and caffe2/distributed/* . Pull Request resolved: https://github.com/pytorch/pytorch/pull/125533 Approved by: https://github.com/kit1980	2024-05-05 05:10:17 +00:00
chilli	7f0c5eb023	Added some more flex attention tests (#125487 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/125487 Approved by: https://github.com/yanboliang	2024-05-04 23:42:40 +00:00
PyTorch MergeBot	6d30803d64	Revert "[Inductor] Properly package target info for triton.compile (#125241 )" This reverts commit 8a1af95b0979d85c4fe32a75e797323ad81f298d. Reverted https://github.com/pytorch/pytorch/pull/125241 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing inductor tests on ROCm `8a1af95b09` ([comment](https://github.com/pytorch/pytorch/pull/125241#issuecomment-2094472886))	2024-05-04 22:28:16 +00:00
PyTorch MergeBot	084d818e71	Revert "try to fix the warning in distribute_tensor (#125476 )" This reverts commit 2b41e1d6fc05428008875e3cfe8be17184e57491. Reverted https://github.com/pytorch/pytorch/pull/125476 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but there are real failures on the PR that sneak in during the log classifier outage ([comment](https://github.com/pytorch/pytorch/pull/125476#issuecomment-2094468740))	2024-05-04 22:25:32 +00:00
PyTorch MergeBot	a32ad828dc	Revert "Don't call item() into torch.scalar_tensor uselessly (#125373 )" This reverts commit 2b4fe183db00db88749f8524f3b4a69ca80da0ec. Reverted https://github.com/pytorch/pytorch/pull/125373 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but there are real failures on the PR that sneak in during the log classifier outage ([comment](https://github.com/pytorch/pytorch/pull/125373#issuecomment-2094464241))	2024-05-04 22:22:36 +00:00
Animesh Jain	f04c8471a4	[dynamo][prepare for nn module guards] Guard nn modules for a few benchmarks (#125324 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/125324 Approved by: https://github.com/jansel ghstack dependencies: #125439, #125421, #124522	2024-05-04 22:08:56 +00:00
Animesh Jain	5ba777f46e	[guards][cpp-guards] Optimize NN module getattr guards (#124522 ) Improves the guard overhead of MobileBert model with nn module guards from 92000 units to 20000 units. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124522 Approved by: https://github.com/jansel ghstack dependencies: #125439, #125421	2024-05-04 22:08:56 +00:00
albanD	76a26a885d	Add module tracker (#125352 ) This does a few things that were originally a few PRs but I am on a new machine and don't have ghstack. If it is too problematic to review, I can re-split, just let me know. This does: - Cleanup context manager use in test_flop_counter - Remove need for mod argument in FlopCounterMode, warning about it - Re-implement a Module tracker from scratch using global forward Module use and multi_grad_hook (we cannot use global backward Module hook because they don't look for nested Tensor and they're custom Function based instead of multi_grad_hook). - Update FlopCouterMode to use the new ModuleTracker. All the existing test suite passes as-is (only changes there are new tests and refactoring mentioned above) Pull Request resolved: https://github.com/pytorch/pytorch/pull/125352 Approved by: https://github.com/mikaylagawarecki	2024-05-04 18:33:35 +00:00
William Wen	1a20b4ef3f	[dynamo] handle inactive nullcontexts across graph breaks (#125518 ) whoops Pull Request resolved: https://github.com/pytorch/pytorch/pull/125518 Approved by: https://github.com/yanboliang	2024-05-04 12:52:20 +00:00
Edward Z. Yang	6f70d22277	Extend torch.utils._sympy.symbol for more Inductor symbols (#125419 ) I'm still missing a few, cdzq at least Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/125419 Approved by: https://github.com/lezcano ghstack dependencies: #125395	2024-05-04 09:05:00 +00:00
Ke Wen	5cd7c75bd9	[pipelining] Add tracing frontend (#125448 ) This PR allows user to transform a model into a pipeline representation with split stages, according to a split spec. ``` def pipeline( module: torch.nn.Module, num_chunks: int, example_args: Tuple[Any, ...], example_kwargs: Optional[Dict[str, Any]] = None, split_spec: Optional[Dict[str, SplitPoint]] = None, split_policy: Optional[Callable[[fx.GraphModule], fx.GraphModule]] = None, ) -> Pipe: ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/125448 Approved by: https://github.com/H-Huang ghstack dependencies: #125273	2024-05-04 09:00:25 +00:00
Edward Z. Yang	2b4fe183db	Don't call item() into torch.scalar_tensor uselessly (#125373 ) Fixes https://github.com/pytorch/pytorch/issues/125368 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/125373 Approved by: https://github.com/Skylion007	2024-05-04 08:07:13 +00:00
Edward Z. Yang	5ef50d75f8	Don't short circuit if shape is same (#125188 ) This is more unbacked SymInt friendly. If this does not work, my back up plan is to short circuit only if it is statically known equal. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/125188 Approved by: https://github.com/albanD	2024-05-04 07:11:08 +00:00
cyy	83845a7c78	[1/2] Remove caffe2 db and distributed from build system (#125092 ) This PR tries to decompose https://github.com/pytorch/pytorch/pull/122527 into a smaller one. Caffe2 db, distributed and some binaries have been removed. To be noted, this was inspired and is co-dev with @r-barnes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125092 Approved by: https://github.com/malfet	2024-05-04 06:48:46 +00:00
Wanchao Liang	2b41e1d6fc	try to fix the warning in distribute_tensor (#125476 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/125476 Approved by: https://github.com/albanD, https://github.com/awgu ghstack dependencies: #125475	2024-05-04 05:25:13 +00:00
Animesh Jain	b62e89c1b8	[dynamo] Do not turn on record relay with TORCH_COMPILE_DEBUG (#125488 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/125488 Approved by: https://github.com/yanboliang, https://github.com/mlazos	2024-05-04 05:10:31 +00:00
Wanchao Liang	ff061baa94	[comm_mode] adding some initial c10d ops to CommDebugMode (#125475 ) looks like we can make it work :) Pull Request resolved: https://github.com/pytorch/pytorch/pull/125475 Approved by: https://github.com/awgu	2024-05-04 04:20:46 +00:00
Catherine Lee	d4727fd4eb	[TD][ez] Better check for is pr or not (#125485 ) You can trigger ciflow tags on main branch commits, so we should be more conservative when checking to see if a workflow is a PR/on the main branch. get_pr_number checks for the pr number based on the PR_NUMBER env var or a tag of the for `ciflow/workflow/pr number` If we fail to find something like this, then assume it is on the main branch Pull Request resolved: https://github.com/pytorch/pytorch/pull/125485 Approved by: https://github.com/huydhn	2024-05-04 03:08:44 +00:00
ydwu4	0302dc68bf	[Reland] Fakify script object inputs and attributes for non-strict ex… (#125490 ) A re-land of #124239. This PR fakify ScriptObject inputs and attributes in export non-strict mode by default. The basic idea is to only fakify the script object during tracing (i.e. aot_export). After we get the traced graph module, eagerly executing, serializing, or running more passes will use the real script objects. This is essentially treating the script object as constant tensor. Concretely, we fakify all the script object inputs, and module attributes (gathered by constant_attrs). patch the module's attributes with fakified script object right after aot_export, remove the patching (to avoid changing the original module) then modify the exported graph module's attribute to real script object. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125490 Approved by: https://github.com/angelayi	2024-05-04 02:39:42 +00:00
Shuqiang Zhang	bfd5bb0c44	[c10d] only PG0 should dump when monitoring thread timed out (#125356 ) Summary: We found that some dumps are missing when monitoring thread timeout. This is likely due to multiple PGs could still dump the same records at the same time. So we should allow only PG0 to actualy dump Test Plan: unit test python test/run_test.py --cpp --verbose -i cpp/ProcessGroupNCCLErrorsTest Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/125356 Approved by: https://github.com/c-p-i-o	2024-05-04 00:43:20 +00:00
eqy	d325c55896	Add CUDA paths to `CODEOWNERS` (#125409 ) CC @ptrblck @albanD Pull Request resolved: https://github.com/pytorch/pytorch/pull/125409 Approved by: https://github.com/albanD	2024-05-04 00:29:39 +00:00
Alex Baden	8a1af95b09	[Inductor] Properly package target info for triton.compile (#125241 ) Triton updated the interface for `triton.compile` `5162346487` The `target` argument to compile needs to be wrapped in a `GPUTarget` object. Without proper wrapping, we hit an assert in `compile`. If that assert is removed, Triton attempts to read device info from Torch while inside a torch thread, which hits an in bad fork assert. This change is required for compatibility with latest commits in Triton. The implementation is backwards compatible, so existing versions of Triton that work now continue to work. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125241 Approved by: https://github.com/jansel	2024-05-04 00:10:53 +00:00
Andrew Gu	9aa7699185	[FSDP2] Computed grad divide factors at runtime (#125484 ) Context We are interested in supporting the case where HSDP reduce-scatters but does not all-reduce in a microbatch backward. This saves communication while still saving memory. Only on the last microbatch do we need to both reduce-scatter and all-reduce. This is not implemented yet and will hopefully come in a future PR. There is one notable part of doing this. On the last microbatch, we need to perform an accumulation step after reduce-scatter and before all-reduce. If not, then the preceding microbatch's gradients will not be contributed across the replica group. (In other words, we cannot simply accumulate _after_ all-reduce.) Consider 32 GPUs with 4-way replication and 8-way sharding and 2 microbatches, and focus on global rank 0. - After the first microbatch, rank 0 will have its shard of $\frac{1}{8} \sum_{i \in S(0)} g_i^{(1)}$, where we define $S(0) = \{0, 1, \dots, 7\}$ to be the ranks in its shard group and we define the $(1)$ superscript to denote the first microbatch. - Upon the second microbatch, rank 0 after its reduce-scatter will additionally have its shard of $\frac{1}{8} \sum_{i \in S(0)} g_i^{(2)}$. If we only all-reduce this, then this second microbatch's gradients become $\frac{1}{32} \sum_{i=0, 1, \dots, 31} g_i^{(2)}$, so in total, rank 0 has $\frac{1}{8} \sum_{i \in S(0)} g_i^{(1)} + \frac{1}{32} \sum_{i=0, 1, \dots, 31} g_i^{(2)}$, which is wrong. - Importantly, we must accumulate $\frac{1}{8} \sum_{i \in S(0)} g_i^{(1)} + \frac{1}{8} \sum_{i \in S(0)} g_i^{(2)} = \frac{1}{8}\sum_{i \in S(0)} (g_i^{(1)} + g_i^{(2)})$ first before all-reducing to get $\frac{1}{32} \sum_{i=0, 1, \dots, 31} (g_i^{(1)} + g_i^{(2)})$. Now, note how under this approach, we want a factor of $\frac{1}{8}$ only (i.e. reciprocal of the shard group size), not $\frac{1}{32}$, for the first microbatch's gradients. - For bf16/fp32, since we use `ReduceOp.AVG` and we only reduce-scatter on the first microbatch, we correctly have a factor of $\frac{1}{8}$ on the first microbatch. - For fp16, since we precompute the gradient divide factors at init time assuming always reducing over both shard and replica groups, we incorrectly have a factor of $\frac{1}{32}$ on the first microbatch, deviating from the bf16/fp32 case. We can address this issue by matching the bf16/fp32 vs. fp16 semantics by computing the divide factors at runtime based on which process groups were passed into the reduction function (`foreach_reduce`). Additional Notes How to implement the HSDP reduce-scatter but no all-reduce is not entirely clear yet. (What is the cleanest way to do this?) We need to store the partial reduce-scatter output and check for it upon the next backward. We should also be sure to error if the set of parameters receiving gradients changes, in which case we cannot support this easily. Anyway, we will implement this in a follow-up. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125484 Approved by: https://github.com/wanchaol ghstack dependencies: #125431, #125479	2024-05-03 23:44:05 +00:00
Andrew Gu	996bb74077	[FSDP2] Added HSDP grad acc tests and some minor changes (#125479 ) This adds HSDP to the existing gradient accumulation tests and includes some minor changes to simplify things a tiny bit. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125479 Approved by: https://github.com/wanchaol ghstack dependencies: #125431	2024-05-03 23:44:05 +00:00
Muralidhar Andoorveedu	b96b1e8cff	[Distributed] Add P2P versions of *object_list operations (#124379 ) This PR adds `send_object_list` and `recv_object_list` to `distributed_c10d.py`. This is extending functionality already present in PyTorch with `broadcast_object_list` that I noticed was missing and decided to upstream. With this change, sending and receiving arbitrary picklable python objects is possible. Relevant issue: https://github.com/pytorch/pytorch/issues/3473 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124379 Approved by: https://github.com/kwen2501, https://github.com/wconstab	2024-05-03 23:22:58 +00:00
William Wen	f2ab96a57e	[dynamo] fix crash when context manager is passed to a function (#125321 ) Fix https://github.com/pytorch/pytorch/issues/125274. Main change was to reconstruct `ContextWrappingVariables` as objects in general, but we can replace them with the class on the caller side when generating the resume function. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125321 Approved by: https://github.com/jansel	2024-05-03 23:01:30 +00:00
Sergii Dymchenko	59abd1dccb	Fix lint after PR 122611 (#125512 ) Fix lint after https://github.com/pytorch/pytorch/pull/122611 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125512 Approved by: https://github.com/clee2000	2024-05-03 22:58:20 +00:00
Iosif Spulber	4abcf36dde	Make c10::Error empty backtrace as an optional argument (#122611 ) Summary: Split from the main diff in the stack. Test Plan: Build validation should be enough. Reviewed By: ezyang Differential Revision: D55313410 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122611 Approved by: https://github.com/ezyang	2024-05-03 22:50:00 +00:00
Bin Bao	a783fef990	[AOTI] Add a missing mypy ignore (#125508 ) Summary: Caused by https://github.com/pytorch/pytorch/pull/125397, but somehow was not caught by CI. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125508 Approved by: https://github.com/izaitsevfb	2024-05-03 22:32:31 +00:00
Aleksei Nikiforov	2b5ae2611e	s390x: use runtime detection for vectorization support (#123936 ) s390x: use runtime detection for vectorization support Pull Request resolved: https://github.com/pytorch/pytorch/pull/123936 Approved by: https://github.com/malfet, https://github.com/jansel, https://github.com/xuhancn	2024-05-03 21:34:37 +00:00
Edward Z. Yang	5503c29357	Introduce torch.utils._sympy.symbol (#125395 ) This provides utilities for creating and querying properties on sympy.Symbol. I want to use this refactor to get a better handle on how the 's' prefix is being used in Inductor. To start, I only do symbolic_shapes code because that's what I'm familiar with. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/125395 Approved by: https://github.com/Skylion007	2024-05-03 21:24:23 +00:00
Andrew Gu	1a578df57c	[FSDP2] Added test to show rank 0 broadcast for HSDP replicas (#125431 ) This PR shows a simple utility to broadcast the parameters across replicas for HSDP: ``` replicate_group = mesh.get_group("replicate") for param in model.parameters(): # E.g. for mesh [[0, 1, 2, 3], [4, 5, 6, 7]] sharding on dim-1 and # replicating on dim-0, broadcast with sources 0, 1, 2, 3 src_rank = dist.get_process_group_ranks(replicate_group)[0] torch.distributed.broadcast( param.to_local(), src=src_rank, group=replicate_group ) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/125431 Approved by: https://github.com/weifengpy, https://github.com/wanchaol	2024-05-03 21:17:35 +00:00
dlyakhov	c941fee7ea	[CPP extention] Baton lock is called regardless the code version (#125404 ) Greetings! Fixes #125403 Please assist me with the testing as it is possible for my reproducer to miss the error in the code. Several (at least two) threads should enter the same part of the code at the same time to check file lock is actually working Pull Request resolved: https://github.com/pytorch/pytorch/pull/125404 Approved by: https://github.com/ezyang	2024-05-03 21:10:39 +00:00
Aleksei Nikiforov	645baef05d	s390x: remove workaround for sleef issue (#124730 ) This workaround is no longer needed since sleef was updated. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124730 Approved by: https://github.com/soulitzer	2024-05-03 20:52:05 +00:00
Kai Londenberg	b1a7455b99	[Inductor cutlass backend] Fix cutlass_utils.get_max_alignment() for strided layouts. (#124930 ) Fixes cutlass_utils.get_max_alignment() which was so far not checking the alignment properly. Basically the method so far assumed that the passed layout is contiguous and row-major, which does not have to be true. Test Plan: CI - test_cutlass_backend.py to prevent regressions Added unit test Pull Request resolved: https://github.com/pytorch/pytorch/pull/124930 Approved by: https://github.com/int3 ghstack dependencies: #124929	2024-05-03 20:50:26 +00:00
Bin Bao	a988b4ed76	[AOTI] Generate mul_Scalar instead of mul_Tensor (#125397 ) Summary: Fix https://github.com/pytorch/pytorch/issues/117365. When the second argument to aten.mul.Tensor is a scalar (e.g. scale factor), the cpp wrapper expects to generate a call to mul_Scalar when fallback happens (e.g. Complex dtype). Pull Request resolved: https://github.com/pytorch/pytorch/pull/125397 Approved by: https://github.com/chenyang78 ghstack dependencies: #125329	2024-05-03 18:35:42 +00:00
Bin Bao	e84a5b6cc0	[AOTI] Add missing std::move for constant args (#125329 ) Summary: fix https://github.com/pytorch/pytorch/issues/123187 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125329 Approved by: https://github.com/angelayi, https://github.com/chenyang78	2024-05-03 18:35:42 +00:00
Andrew Gu	d6052a35d4	[RFC][FSDP2] Added `register_fsdp_forward_method` for user fwd methods (#125394 ) FSDP only runs its pre/post-forward hooks on `nn.Module.forward`. This means that if the user runs a custom method meant as a forward pass, then FSDP will not all-gather the parameters. Examples include HuggingFace models' `generate()` (https://github.com/pytorch/pytorch/issues/123962, https://github.com/pytorch/pytorch/issues/100069) or others (https://github.com/pytorch/pytorch/issues/109385). This PR adds a monkey patching API `register_fsdp_forward_method(module: nn.Module, method_name: str)` to allow FSDP pre/post-forward hooks to run on the method. The function is a no-op if the passed-in `module` is not an FSDP module so that the register function can be called even if the FSDP wrapping changes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125394 Approved by: https://github.com/weifengpy, https://github.com/wanchaol	2024-05-03 18:31:28 +00:00
Xiaodong Wang	52f9128a0d	[AMD] Fix cutlass path in inductor (#125463 ) Summary: Trunk is broken because fbcode triton-amd doesn't have cutlass path Test Plan: It now runs. Differential Revision: D56923833 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125463 Approved by: https://github.com/Skylion007	2024-05-03 18:02:58 +00:00
Catherine Lee	e10b2ba357	Script for compiling count + time of test at file granularity (#125322 ) Adds script for compiling # of tests + total time take at the file granularity Pull Request resolved: https://github.com/pytorch/pytorch/pull/125322 Approved by: https://github.com/huydhn, https://github.com/ZainRizvi	2024-05-03 17:35:44 +00:00
Zhengxu Chen	12a69afa6d	[export] Fix deserializer node meta handling. (#125454 ) Summary: The code seems not needed because serializer shouldn't make any meaningful decision about what goes to node metadata. Test Plan: CI Differential Revision: D56918543 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125454 Approved by: https://github.com/angelayi	2024-05-03 16:51:08 +00:00
Nikita Shulga	30610251ec	[MPS] And naive quantized intmm and `.gputrace` capture hooks (#125163 ) - Implement a very straightforward Metal copy of CPU int4mm kernel - Implement int8mm kernel by constructing a graph consisting of upcast, transpose and mm - Add `isCapturing`, `isCaptureEnabled`, `startCapture` and `stopCapture` methods to `MPSProfile` which can be used to help one debug/profile Metal kernels by wrapping the calls with the following ```cpp if (getMPSProfiler().profiler.isCaptureEnabled()) { getMPSProfiler().startCapture(__func__, mpsStream); } ... if (getMPSProfiler().isCapturing()) { getMPSProfiler().stopCapture(mpsStream); } ``` that, if invoked with `MTL_CAPTURE_ENABLED` environment variable set to one, will produce .gputrace files, in the current working directory, which can later be loaded and used to debug or profiler the kernel <img width="1093" alt="image" src="https://github.com/pytorch/pytorch/assets/2453524/a2bf27e8-df8a-442c-a525-1df67b8a376a"> - Added `test_int4mm` to TestLinalgMPS, which is mostly copy-n-paste of the test from `test_linalg` TODOs: - Add weight pack - Perf-tune both kernels Pull Request resolved: https://github.com/pytorch/pytorch/pull/125163 Approved by: https://github.com/mikekgfb	2024-05-03 15:20:39 +00:00
Masaki Kozuki	a99ada5b27	call `super().__post_init__` in `ForeachFuncinfo.__post_init__` (#125457 ) obviously the current main branch's `ForeachFuncInfo`'s dunder post init doesn't `super().__post_init__()` which does some setup including setting `dtypesIfCUDA` and `dtypesIfROCM`. Fixes #125295 related: #125001 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125457 Approved by: https://github.com/janeyx99	2024-05-03 14:54:15 +00:00
Andrew Gu	79af814369	[FSDP] Added private `_unshard` API (#124304 ) Some toy example: <img width="998" alt="Screenshot 2024-04-17 at 2 00 05 PM" src="https://github.com/pytorch/pytorch/assets/31054793/b5665a63-beb0-4ca1-92c6-c57a052812fd"> We define `FullyShardedDataParallel._unshard(async_op: bool = False)` that can be used to prefetch all-gathers. The user should make sure: 1. Run lazy init before the first `_unshard` call of training. For example, this can hackily be done via `root_module.check_is_root()` on the root FSDP module `root_module`. 2. Call `root_module._wait_unshard_streams_on_current_stream()` before the first `_unshard` call of the current iteration (just need to call it once after last optimizer step and before first `_unshard` of this iteration). Differential Revision: [D56262876](https://our.internmc.facebook.com/intern/diff/D56262876) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124304 Approved by: https://github.com/wanchaol	2024-05-03 13:14:15 +00:00
Jiang, Yanbing	ca98c2a932	inductor: Add Conv3d support (#124361 ) This PR is to add Conv3d support in inductor. Basicly reuse and expand Conv2d logic and unit tests to Conv3d. Conv3d inductor support will improve the performance of C2D_R50, I3D_R50, I3D_R101, Slow and SlowFast-R50 from OOB models. \| C2D_R50 \| I3D_R50 \| I3D_R101 \| Slow \| SlowFast-R50 -- \| -- \| -- \| -- \| -- \| -- eager \| 15.805 \| 13.909 \| 11.639 \| 12.101 \| 6.606 Compile w/o conv3d \| 17.244 \| 14.893 \| 12.109 \| 13.015 \| 6.603 Compile w/ conv3d \| 21.212 \| 17.707 \| 14.974 \| 16.130 \| 8.537 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124361 Approved by: https://github.com/leslie-fang-intel, https://github.com/CaoE, https://github.com/jgong5, https://github.com/jansel	2024-05-03 10:24:14 +00:00
haozhe.zhu	489b4586e9	[optim]fix ut and sgd kernel (#124904 ) - Original `test_grad_scaling_autocast_fused_optimizers` does not work since there is no "fused" in `optim_inputs` - We should use different `grad_scaler`, they should not share 1 `scale`, there is no issue exposed here because the default `_growth_interval` is 2000 so it will not growth and there is also no inf is found so it will not reduced. The one in `test_cuda.py` should also have this issue, - I set a manual seed to reproduce purpose if there is any numerical failure - I use Tensor tracker here because we failed this UT in dynamo case, the cpp generated code are not exactly same with fused/non fused kernel. - I make it check both `cuda` and `cpu`. - I find some SGD numerical issue with `clang`, and fixed it by using `fmadd` instead of `add/mul` in fused sgd veckernel. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124904 Approved by: https://github.com/jgong5, https://github.com/janeyx99	2024-05-03 09:13:24 +00:00
Stefan-Alin Pahontu	bebefcf845	Driver folder check (#117548 ) Added extra check for driver folders for Libtorch, as stat struct does not recognize driver folders, so torch.save should work for them as well. (e.g. save model.pt directly under C: ) Fixes [#111121](https://github.com/pytorch/pytorch/issues/111121) and #105488 Co-authored-by: Ozan Aydin <148207261+ozanMSFT@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/117548 Approved by: https://github.com/malfet	2024-05-03 09:10:11 +00:00
eellison	e5cc7ada67	skip triton template precompilation in 311.0-3.11.7 to workaround 311 cpython bug (#125446 ) Fix for https://github.com/pytorch/pytorch/issues/125374. We dont have CI for this specific versions, but I verified locally. THere is a cpython bug from 3.11.0->3.11.7 where the ast parsing state is global, and errors with multiple threads. when dust settles a little around the new process based compilation we can look into migrating. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125446 Approved by: https://github.com/Chillee ghstack dependencies: #125289	2024-05-03 08:28:32 +00:00
Mikayla Gawarecki	dd92637f44	Add `write_record_metadata` to PyTorchFileWriter (#125184 ) Add `PyTorchFileWriter.write_record_metadata(record_name, num_bytes)` that - writes the zipfile header/end of central directory metadata for an entry* - reserves `num_bytes` in the zipfile for the payload. *Since the payload is not provided, the CRC32 computation is skipped and 0s are written in the corresponding entry of the zipfile header Pull Request resolved: https://github.com/pytorch/pytorch/pull/125184 Approved by: https://github.com/albanD	2024-05-03 07:29:52 +00:00
PyTorch UpdateBot	4c84789743	[vision hash update] update the pinned vision hash (#123227 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned vision hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123227 Approved by: https://github.com/pytorchbot	2024-05-03 05:55:29 +00:00
Animesh Jain	071ee40793	[dynamo][nn module] Check for duplicate tensors in register_attr_or_module (#125421 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/125421 Approved by: https://github.com/mlazos ghstack dependencies: #125439	2024-05-03 05:08:09 +00:00
Pian Pawakapan	ef757a5c00	[export] use tree_map for _flatten_dynamic_shapes (#125415 ) Summary: Fixing the implementation of `_flatten_dynamic_shapes()`, to follow how `_process_dynamic_shapes()` does it. The previous implementation would misinterpret some nested dynamic shapes specs, causing it to miss out on some shapes specs, for example with nested inputs/constant input tuples: ``` inputs = ( (2, 1), ( torch.randn(2, 1), torch.randn(2, 2), torch.randn(2, 3), ) ) dynamic_shapes = ( (None, None), ( None, None, None, ) ) ``` This would get interpreted as 2 shapes specs for 2d and 3d tensors. Fixing so this doesn't happen. Test Plan: Existing export tests Differential Revision: D56894923 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125415 Approved by: https://github.com/angelayi	2024-05-03 04:59:17 +00:00
Shivam Raikundalia	394ec2da30	Remove GPU Check from Basic Chrome Trace test (#125430 ) Summary: Remove the check to make sure all GPU labels are enumerated when CUDA is available. There are some systems where CUDA is available but we do not print any GPU labels (because GPU is not available). Test Plan: Test in regression with ciflow/periodic label Differential Revision: D56906893 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125430 Approved by: https://github.com/izaitsevfb	2024-05-03 04:51:10 +00:00
Animesh Jain	8706da2bad	[dynamo][cpp-guards] Improve recompilation reason logic for NO_TENSOR_ALIASING guard (#125439 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/125439 Approved by: https://github.com/williamwen42	2024-05-03 04:49:41 +00:00
Tuan Trieu	d156cb2e12	Fix mem size mismatch from split/chunk in const folding (#125199 ) Summary: The chunk/split ops on the weights/constants is folded in a fx pass and each output tensor has the same storage size of the original tensor (which is 3x of its actual size if chunk(3)). However Backend calculates the mem size on device from tensor shape/stride/dtype. This causes the mismatch when copying weights/constants to device as allocated mem on device is always smaller than the size of weights/constants and results in a runtime error in loading weight/constant (T172125529). This diff fixes the issue by cloning the tensors after const folding so that the tensors has correct storage size. Test Plan: Before this change: (18432 = 48 * 64 * 2 * 3) ``` RuntimeError: Failed to load constant getitem_idx0 split (remaining=18432) at fbcode/caffe2/torch/fb/acc_runtime/afg/afg_bindings.cpp:3422: Request failed because an invalid parameter ``` ``` buck2 run mode/opt //caffe2/torch/fb/acc_runtime/afg/tests:test_operators-artemis -- -r test_mem_size_mismatch ``` ``` Ran 1 test in 7.048s OK ``` Reviewed By: jfix71 Differential Revision: D56663931 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125199 Approved by: https://github.com/jfix71	2024-05-03 04:42:38 +00:00
Denis Vieriu	a40d6df448	[MPS] Native nonzero implementation (#125355 ) Fixes https://github.com/pytorch/pytorch/issues/124850 Replace previous MPSGraph nonzero construction with native nonzero op. For older OSes, fallback to CPU (previous implementation was not reliable and was comparable to CPU in speed). Pull Request resolved: https://github.com/pytorch/pytorch/pull/125355 Approved by: https://github.com/kulinseth	2024-05-03 03:50:58 +00:00
Roy Hvaara	e15da7856c	[MPS] Fix overflow in cumsum when dtype is bool (#125318 ) `cumsum` and `cumprod` was (is?) buggy for MPS: `c8d2a55273/aten/src/ATen/native/mps/operations/UnaryOps.mm (L435-L436)` A workaround casts the input to int32 prior to performing the op to prevent overflow for certain numeric types. It turns out this issue also affects boolean types: ```python import torch print(torch.ones(128, dtype=torch.bool, device="mps").cumsum(0)[-1]) # tensor(-128, device='mps:0') ``` In this PR I'm adding logic to also cast bool dtypes to int32 prior to `cumsum` and `cumprod`, although output is guaranteed not to overflow for the latter with bools. I'm also adding a test to prevent regressions. Fixes #96614 #106112 #109166 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125318 Approved by: https://github.com/malfet	2024-05-03 01:19:24 +00:00
Nikita Shulga	acac7aa70f	[CI] Unskip Linalg tests on ARM (#125377 ) Removes obscure "Issue with numpy version on arm" added by https://github.com/pytorch/pytorch/pull/82213 And replaces it with 4 targeted skips: - test_addmv for `float16` - test_vector_norm for `float16`, `bfloat16` and `float32` Followups to fix them are tracked in https://github.com/pytorch/pytorch/issues/125438 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125377 Approved by: https://github.com/kit1980	2024-05-03 01:18:52 +00:00
Alexandre Ghelfi, PhD	d18a6f46d0	Adding Compare in torch.utils.benchmark documentation (#125009 ) `torch.utils.benchmark.Compare` is not directly exposed in torch.utils.benchmark documentation. I think this is a valuable resource to add since it can help people embracing the torch benchmark way of doing things, and help people building documentation towards it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125009 Approved by: https://github.com/mikaylagawarecki	2024-05-03 00:50:54 +00:00
soulitzer	4440d0755a	Support custom layout call under torch dispatch mode (#125379 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/125379 Approved by: https://github.com/jbschlosser	2024-05-02 23:44:12 +00:00
drisspg	7551755cec	Update tolerance for flex fp32 (#125444 ) # Summary Updates the tolerances to account for internal failure Pull Request resolved: https://github.com/pytorch/pytorch/pull/125444 Approved by: https://github.com/kit1980	2024-05-02 23:34:18 +00:00
Yanbo Liang	3b5f6b10ad	[Inductor] default block size for head_dim = 256 for flex attention (#125380 ) ## H100 ### torch.bfloat16 No major change, as expected. ``` \| Type \| Speedup \| batch_size \| num_heads \| q_seq_len \| k_seq_len \| head_dim \| score_mod \| dtype \| \|---------\|-----------\|--------------\|-------------\|-------------\|-------------\|------------\|-------------\|----------------\| \| Average \| 1.122 \| \| \| \| \| \| \| \| \| Max \| 1.437 \| 1 \| 16 \| 512 \| 512 \| 128 \| head_bias \| torch.bfloat16 \| \| Min \| 0.895 \| 1 \| 16 \| 1024 \| 1024 \| 64 \| head_bias \| torch.bfloat16 \| ``` ### torch.float32 Before: OOM when ```head_dim``` = 256 After: ``` \| Type \| Speedup \| batch_size \| num_heads \| q_seq_len \| k_seq_len \| head_dim \| score_mod \| dtype \| \|---------\|-----------\|--------------\|-------------\|-------------\|-------------\|------------\|-------------\|---------------\| \| Average \| 2.231 \| \| \| \| \| \| \| \| \| Max \| 3.760 \| 16 \| 16 \| 4096 \| 4096 \| 64 \| noop \| torch.float32 \| \| Min \| 1.532 \| 1 \| 16 \| 512 \| 512 \| 256 \| causal_mask \| torch.float32 \| ``` ## A100 ### torch.bfloat16 Before: ``` \| Type \| Speedup \| batch_size \| num_heads \| q_seq_len \| k_seq_len \| head_dim \| score_mod \| dtype \| \|---------\|-----------\|--------------\|-------------\|-------------\|-------------\|------------\|---------------\|----------------\| \| Average \| 0.587 \| \| \| \| \| \| \| \| \| Max \| 0.960 \| 1 \| 16 \| 512 \| 512 \| 64 \| noop \| torch.bfloat16 \| \| Min \| 0.017 \| 8 \| 16 \| 4096 \| 4096 \| 256 \| relative_bias \| torch.bfloat16 \| ``` After: ``` \| Type \| Speedup \| batch_size \| num_heads \| q_seq_len \| k_seq_len \| head_dim \| score_mod \| dtype \| \|---------\|-----------\|--------------\|-------------\|-------------\|-------------\|------------\|-------------\|----------------\| \| Average \| 0.756 \| \| \| \| \| \| \| \| \| Max \| 0.931 \| 1 \| 16 \| 512 \| 512 \| 64 \| noop \| torch.bfloat16 \| \| Min \| 0.467 \| 16 \| 16 \| 1024 \| 1024 \| 256 \| noop \| torch.bfloat16 \| ``` ### torch.float32 Before: OOM when ```head_dim``` = 256 After: ``` \| Type \| Speedup \| batch_size \| num_heads \| q_seq_len \| k_seq_len \| head_dim \| score_mod \| dtype \| \|---------\|-----------\|--------------\|-------------\|-------------\|-------------\|------------\|-------------\|---------------\| \| Average \| 2.386 \| \| \| \| \| \| \| \| \| Max \| 7.584 \| 16 \| 16 \| 512 \| 512 \| 64 \| noop \| torch.float32 \| \| Min \| 0.948 \| 1 \| 16 \| 512 \| 512 \| 256 \| causal_mask \| torch.float32 \| ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/125380 Approved by: https://github.com/drisspg	2024-05-02 22:51:07 +00:00
Lucas Pasqualin	5c7b71dccf	[DCP] Adds strict option to DefaultPlanner (#123869 ) ~Users may have custom use cases for the `strict` parameter in load. In my mind, if we automatically call `state_dict` and `load_state_dict` in save/load, we need to support the same functionality in `nn.Modules`.~ It turns out this is actually not related to nn.Module's strict param. Since `state_dict` is called inside `dcp.load`, it's actually impossible to create a model such that the following would raise an error: ``` state_dict = module.state_dict() module.load_state_dict(state_dict, strict=True) ``` The issue is actually just when there are elements in `state_dict` which do not exist in the checkpoint. This PR adds the ability to configure this behavior through the DefaultSavePlanner (see tests). Concretely, if module has extra attributes not present in the checkpoint, we will only raise an error if `DefaultLoadPlanner.allow_partial_load==False` Pull Request resolved: https://github.com/pytorch/pytorch/pull/123869 Approved by: https://github.com/fegin	2024-05-02 22:50:32 +00:00
Jorge Pineda	2c8237c6aa	[ATen-VK] Resolve compiler_flags to allow Mac build (#125361 ) Summary: ## `-Wmissing-prototypes` In ATen-Vulkan, we often define functions in `.cpp` files without declaring them in `.h` files without hiding them in an anonymous namespace. Example: [`Packing.cpp`'s channel_image_repacking()](`f1f142c44f/aten/src/ATen/native/vulkan/impl/Packing.cpp (L299-L348)`) On Mac, this results in a `-Wmissing-prototypes` warning, which is disabled in this change. ## `-Wshadow` In `Adapter.cpp`, we overwrite a variable called `properties`, which we fix in this change as opposed to disabling the warning. Test Plan: CI Differential Revision: D56850324 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125361 Approved by: https://github.com/SS-JIA	2024-05-02 22:26:39 +00:00
William Wen	55c705b602	[dynamo] add trace_bytecode logging artifact (#125360 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/125360 Approved by: https://github.com/ezyang	2024-05-02 22:01:00 +00:00
PyTorch MergeBot	a0e2f62edd	Revert "Include support for the scatter gather cuda kernels to allow for comp… (#124809 )" This reverts commit 9e24c263f998819f849bb8293323213101e9aefc. Reverted https://github.com/pytorch/pytorch/pull/124809 on behalf of https://github.com/kit1980 due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/124809#issuecomment-2091751002))	2024-05-02 21:36:18 +00:00
David Chiu	b1b03992d0	Merge the pyi files into py files of optimizer (#125153 ) Merge the interfaces in pyi files into py files in `torch/optim`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125153 Approved by: https://github.com/janeyx99	2024-05-02 21:29:31 +00:00
drisspg	edad82fc90	Add private helper for determining which version of FA2 closest matches kernel version (#123653 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/123653 Approved by: https://github.com/mikaylagawarecki	2024-05-02 21:28:23 +00:00
Ke Wen	0199ce8d6c	[pipelining] Add microbatch split and merge utils (#125273 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/125273 Approved by: https://github.com/H-Huang ghstack dependencies: #124776, #124875, #124958	2024-05-02 21:09:47 +00:00
Sait Cakmak	1657f7e262	[Doc] Update docstrings for torch/random.py (#125265 ) Updates the docstrings for torch/random.py to clarify what device / RNG each function operates on. While trying to understand the difference between ``` state = torch.random.get_rng_state() some_code torch.random.set_rng_state(state) ``` and ``` with torch.random.fork_rng(): some_code ``` I found out that there was a note about this in the docstring that wasn't being rendered on the website. I fixed that note and added additional clarifications on other functions in this file. Test Plan: Built the docs and verified that everything renders correctly. <img width="911" alt="Screenshot 2024-04-30 at 2 22 08 PM" src="https://github.com/pytorch/pytorch/assets/9263852/f219bc35-89bd-4f5b-ba60-255b089499a4"> <img width="901" alt="Screenshot 2024-04-30 at 2 22 13 PM" src="https://github.com/pytorch/pytorch/assets/9263852/c141e7fa-afc9-4c66-b460-96668ce35606"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/125265 Approved by: https://github.com/Balandat, https://github.com/lezcano	2024-05-02 20:55:23 +00:00
Sheng Fu	fc76764a56	Always pass down kernel_file and grid as string (#125384 ) From my test with Ads production workload, I found sometime kernel_file is None and grid is a tuple. It will crash since ExecutionTraceObserver expects string for both kernel_file and grid. This PR is to make sure kernel_file and grid are always passed down as string. Need to find the root cause why kernel_file is none. Unit test: buck test @mode/dev-nosan caffe2/test:profiler -- test_execution_trace_with_pt2 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125384 Approved by: https://github.com/davidberard98, https://github.com/sraikund16	2024-05-02 20:43:20 +00:00
Edward Z. Yang	dae574c713	Don't make replacements for i variables (#125398 ) This was introduced in https://github.com/pytorch/pytorch/pull/110262 but actually it looks like they were trying to hit unbacked SymInt. Now that unbacked SymInt is renamed to u, this code is no longer necessary Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/125398 Approved by: https://github.com/lezcano, https://github.com/Skylion007	2024-05-02 20:38:09 +00:00
Lucas Pasqualin	4f62494bf9	[DCP] Move async logic into filesystem for better encapsulation (#124944 ) This logic is specific to FilesystemWriter, and now has a better place to live due to the new AsyncStager class Differential Revision: [D56578436](https://our.internmc.facebook.com/intern/diff/D56578436/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124944 Approved by: https://github.com/fegin ghstack dependencies: #122965, #124939	2024-05-02 20:31:33 +00:00
Manish Rajpal	2bbfb70831	ignore unsupported module from flop counter (#125346 ) Summary: Torchscript modules do not support forward hooks and thus can't work with flop_counter context manager for hierarchical output by passing a module to FlopCounterMode on construction. Currently any module that includes a script module causes an exception to be thrown so adding a try/catch to ignore any script modules for forward hooks. Test Plan: CI Signals Differential Revision: D56850661 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125346 Approved by: https://github.com/842974287	2024-05-02 20:30:52 +00:00
Lucas Pasqualin	799f1460af	[DCP] Provides default AsyncStager (#124939 ) Differential Revision: [D56575987](https://our.internmc.facebook.com/intern/diff/D56575987/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124939 Approved by: https://github.com/fegin ghstack dependencies: #122965	2024-05-02 19:48:54 +00:00
Lucas Pasqualin	3741fb3680	[DCP] Introduce async staging extension points (#122965 ) Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * #124944 * #124939 * __->__ #122965 Differential Revision: [D55493240](https://our.internmc.facebook.com/intern/diff/D55493240/) This PR is now ready for merge and is not an RFC Major choices are: -- the introduction of the AsyncStager protocol -- removed `executor` from param. -- leave async as a separate method (for now) This proposal seeks to add extension points to dcp.async_save, allowing users to: - Specify a specific staging method when calling async_save - Allow a vehicle for also making the staging method async, to allow for cases where we may want to overlap with the training loop (e.g., overlap d2h with and only synchronize at the optim.step) - Potentially specify the execution method for doing async_save in parallel. For example some users may prefer a subprocess over a thread to avoid GIL issues. A totally reasonable alternative to this entire proposal is to expect users who want this level of customization to write their own custom async save methods. Here's an example which addresses the issues mentioned in PR comments. ``` def custom_async_save(...): # this step accomplishes staging and includes the usual 'planning' calls (issue 1) buffered_writer = CpuBufferedWriter() # this is stateful, contains a copy of state_dict dcp.save(state_dict, storage_writer=buffered_writer) final_storage_writer = FileSystemWriter() mp.spawn( # issue2 is gone, do whatever you want here dcp.save, # or some custom sub-process method which calls dcp.save under the hood buffered_writer.state_dict, # lot's of way's to do this, not really the most important part checkpoint_id=checkpoint_id, storage_writer=storage_writer, planner=planner, process_group=process_group, # this actually wouldn't work, but again not the pt. ) # leaving out the rest of the details for managing your extra special subprocess. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/122965 Approved by: https://github.com/daulet-askarov	2024-05-02 19:01:55 +00:00
Jeff Daily	da991fac22	[ROCm][CI] upgrade CI to ROCm 6.1 (#124300 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124300 Approved by: https://github.com/malfet	2024-05-02 17:16:02 +00:00
Chien-Chin Huang	1eb7b8eb60	[PT2D] Ensure the trace rules are correct with distributed (#125333 ) Summary: 1. Avoid using `torch._dynamo.disable`. 2. Clear the LRU cache of the trace rules. This won't do anything if rules are not evluated before PG initilization. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125333 Approved by: https://github.com/yanboliang	2024-05-02 16:28:38 +00:00
Edward Z. Yang	e93b57a570	Add propagate_real_tensors mode for unbacked (#125115 ) A common complaint when working with data-dependent code in PyTorch is that it's hard to tell how far you are from the finish line: every time a GuardOnDataDependentSymNode error is hit, you have to somehow fix or workaround it to see the next one. This PR adds a new mode `torch._functorch.config.fake_tensor_propagate_real_tensors` which modifies fake tensors to also propagate real tensors. This means that when we try to guard on a data-dependent SymNode, we can actually produce a real result. We also produce a warning which you should consult to figure out what the crux points are. I ran this on vision_maskrcnn. In the baseline (without this mode), the model has 27 graph breaks, resulting in 40 graphs. With this mode on, the model has only 11 graph breaks, resulting in 15 graphs (the remaining graph breaks are due to missing functionality for item() on float tensor and some other Dynamo missing features.) You get a list of things that would have errored like this: ``` WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Max(1, u1) < 2) -> True WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Eq(Max(1, u1), 1)) -> True WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Eq(Max(1, u1), 1)) -> True WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Ne(Max(1, u1), 1)) -> False WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Max(1, u0) < 2) -> True WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Eq(Max(1, u0), 1)) -> True WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Eq(Max(1, u0), 1)) -> True WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Ne(Max(1, u0), 1)) -> False WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Max(1, u1) < 2) -> True WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Eq(Max(1, u1), 1)) -> True WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Eq(Max(1, u1), 1)) -> True WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Ne(Max(1, u1), 1)) -> False WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Max(1, u0) < 2) -> True WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Eq(Max(1, u0), 1)) -> True WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Eq(Max(1, u0), 1)) -> True WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Ne(Max(1, u0), 1)) -> False WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Max(1, u1) < 2) -> False WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Eq(Max(1, u1), 1)) -> False WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Ne(Max(1, u1), 1)) -> True WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Max(1, u0) < 2) -> False WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Eq(Max(1, u0), 1)) -> False ``` Potential later follow ups: * Improve the warning messages (in particular, should provide user frames) * GC real tensors when they are no longer needed by tracing. Right now, this will use A LOT of memory, equal to as if your GC was broken and every intermediate tensor was kept live Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/125115 Approved by: https://github.com/IvanKobzarev	2024-05-02 15:28:26 +00:00
Jez Ng	fb1bfe1156	Get cutlass_library import working under fbcode (#125257 ) Differential Revision: D56764089 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125257 Approved by: https://github.com/chenyang78	2024-05-02 15:17:10 +00:00
Kai Londenberg	8046de3512	[Inductor cutlass backend] Remove epilogue nodes from Kernel call (#124929 ) Minor refactoring: Remove unused "fused epilogue node" arguments from some method Kernel call signatures. Test Plan: Covered by current tests in test_cutlass_backend.py - no functional change. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124929 Approved by: https://github.com/eellison	2024-05-02 13:02:31 +00:00
Animesh Jain	a13a0a2479	[dynamo][easy] Simple fixes to prepare for nn module guards (#125316 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/125316 Approved by: https://github.com/williamwen42 ghstack dependencies: #125275	2024-05-02 12:08:11 +00:00
laithsakka	0b70026d3b	Do not pass none to has_pending_mutation (#125359 ) #fix https://github.com/pytorch/pytorch/issues/125315 Several failures when inlining nn module is enabled are due to passing None to has_pending_mutation from previous code, it sounds like its expected for variable to be none when not found, In that case we should skip it and not call has_pending_mutation this is tested in https://github.com/pytorch/pytorch/pull/125354 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125359 Approved by: https://github.com/mlazos	2024-05-02 09:08:22 +00:00
Masaki Kozuki	aa7be72cc5	Convert `ForeachFuncInfo` to `dataclass` (#125001 ) - `ForeachFuncInfo` to `dataclass` for smaller diff from `OpInfo` - `skips` to `decorators` and `skip` to `xfail` Pull Request resolved: https://github.com/pytorch/pytorch/pull/125001 Approved by: https://github.com/janeyx99, https://github.com/jeffdaily	2024-05-02 04:19:09 +00:00
Edward Z. Yang	da5d2d9b3e	Hotfix: restore CPP guard string in structured trace (#125303 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/125303 Approved by: https://github.com/albanD	2024-05-02 03:57:19 +00:00
Wanchao Liang	fff7a31800	fix torchdeploy issue on sharddim_alltoall op (#125344 ) Summary: fix torchdeploy issues when registering the distributed op, similar to what functional collective did Differential Revision: D56850434 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125344 Approved by: https://github.com/XilunWu, https://github.com/fegin	2024-05-02 03:38:34 +00:00
Jeff Daily	f59ce798f9	[ROCm] TunableOp for scaled_mm (#123987 ) Adds a new ScaledGemmTunableOp implementation using hipblaslt. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123987 Approved by: https://github.com/jianyuh	2024-05-02 03:06:31 +00:00
Edward Z. Yang	5ea54839c9	Make min(stride, strides[idx]) in collapse_view_helper size oblivious (#125301 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/125301 Approved by: https://github.com/albanD	2024-05-02 02:39:58 +00:00
albanD	b119e1bcc2	Fix refcount handling for dtype, layout and memory format (#125271 ) Finish fixing https://github.com/pytorch/pytorch/issues/124868 re-use our wrap() utils as much as possible and NewRef in other places. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125271 Approved by: https://github.com/colesbury	2024-05-02 02:34:34 +00:00
Edward Z. Yang	4731130ea8	Add a code comment about torch._check_is_size in tensor_split (#125292 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/125292 Approved by: https://github.com/albanD	2024-05-02 02:25:38 +00:00
PyTorch MergeBot	a9309502af	Revert "Refactoring to remove unused variable (#125252 )" This reverts commit b094622bc954e179ddb8649652b87d2a81d7d500. Reverted https://github.com/pytorch/pytorch/pull/125252 on behalf of https://github.com/drisspg due to going to land codev ([comment](https://github.com/pytorch/pytorch/pull/125252#issuecomment-2089394606))	2024-05-02 01:49:57 +00:00
PyTorch MergeBot	b03fb49ed8	Revert "[dynamo] use lazy disable dynamo for manual seed (#125196 )" This reverts commit 8320b770fd9dc4671bc9eb0d535e14173e95cf45. Reverted https://github.com/pytorch/pytorch/pull/125196 on behalf of https://github.com/kit1980 due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/125196#issuecomment-2089355842))	2024-05-02 00:57:39 +00:00
Danial Javady	9e24c263f9	Include support for the scatter gather cuda kernels to allow for comp… (#124809 ) Fixes #121965 This PR hopes to add support complex numbers in the scatter/gather related kernels. For brevity, I will only include `complex<float>` for now as `complex<double>`, for example, will be more complicated. C++ unit tests are currently passing alongside tests in `test_scatter_gather_ops.py`. Python test suites also seem to be passing. Please keep the following in mind: 1) I think this is my first time using Pytorch. 2) This is my first contribution to Pytorch. Environment: 3080 & WSL 2. `nvcc` is at 12.4. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124809 Approved by: https://github.com/mikaylagawarecki	2024-05-01 23:58:35 +00:00
PyTorch MergeBot	f1f142c44f	Revert "Fakify script object inputs and attributes for non-strict export (#124239 )" This reverts commit ecc2e034f7e55bf9ff7f4e5df4e9086a5c92caaa. Reverted https://github.com/pytorch/pytorch/pull/124239 on behalf of https://github.com/kit1980 due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/124239#issuecomment-2089305447))	2024-05-01 23:56:00 +00:00
David Berard	9022f131b5	[inductor] switch assume_aligned_inputs to False (#124336 ) In #123319, we guard some behavior behind the `assume_aligned_inputs` config option. If we set this to `False`, then the behavior added in #123319 becomes the default behavior. See the referenced PR for more details about the behavior affected. Side effects: * It's possible that this will hurt performance in some scenarios. For example, if an unaligned input is used in a matmul, it might be better to perform the clone to align it first. * This will occasionally cause recompiles. Specifically: the check we perform (`(storage_offset * get_dtype_size(dtype)) % ALIGNMENT == 0`) can be guarded on if the storage_offset becomes dynamic. storage_offset becomes dynamic during automatic_dynamic_shapes after a shape or stride changes. Previously, this was increasing graph breaks in cpu inductor torchbench tests (but is fixed by more carefully guarding checks on alignment, so that we don't run them and generate guards unless actually needed). Pull Request resolved: https://github.com/pytorch/pytorch/pull/124336 Approved by: https://github.com/eellison	2024-05-01 23:49:27 +00:00
Jianping Wu	c281d3a0cb	Enable UFMT on test_indexing&test_view_ops (#125112 ) Part of https://github.com/pytorch/pytorch/issues/123062 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125112 Approved by: https://github.com/ezyang	2024-05-01 23:44:53 +00:00
Pearu Peterson	9043ccafdf	Require nnz==0 in sparse meta tensors (#125221 ) As in the title and per discussion starting at https://github.com/pytorch/pytorch/pull/117907#issuecomment-2082426468 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125221 Approved by: https://github.com/amjames, https://github.com/ezyang	2024-05-01 23:41:49 +00:00
eellison	46f326eff5	explicitly reset stderr/stdout in precompilation (#125289 ) I was seeing a weird bug where after running max-autotune my stdout would be misdirected. other people have not been able to repro this. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125289 Approved by: https://github.com/shunting314, https://github.com/mlazos	2024-05-01 23:41:36 +00:00
Wes Bland	6f5f405b05	[ncclx] Rename NCCL-EXP to NCCLX (#125238 ) Reviewed By: kryanchun Differential Revision: D56534548 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125238 Approved by: https://github.com/kwen2501	2024-05-01 23:29:55 +00:00
yan-yhy	6cfb55dd5d	Add a variable for some testcases. (#124708 ) Some testcases can use 'TEST_PRIVATEUSE1_DEVICE_TYPE' to make adapting these testcases on others device more convenient. Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/124708 Approved by: https://github.com/albanD	2024-05-01 23:19:12 +00:00
Joona Havukainen	c451d108da	Implemented isin_Tensor_Tensor_out for MPS backend (#124896 ) Addresses issue #124518, adds isin_Tensor_Tensor_out. Tests added to test_mps.py. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124896 Approved by: https://github.com/malfet, https://github.com/kulinseth	2024-05-01 23:14:05 +00:00
Catherine Lee	506eda538b	Fix windows build error not propagating (#125306 ) * Fixes https://github.com/pytorch/pytorch/issues/124886 * Kind of similar to https://github.com/pytorch/pytorch/pull/109393 I think what happens is `exit` and `exit /b` propagate the errorlevel correctly, but `exit /b` only exists the currently running batch script and not the entire cmd.exe (or whatever program is running the batch script), so `exit /b` exits with errorlevel 1, but the the parent cmd exits with 0, and bash sees cmd's 0 I think `goto fail` and `exit` are the same thing when the batch script is run from a bash script so either would work in this case? But the `goto fail` method might be better if someone happens to run the script on cmdline I assumed that anywhere anyone was exiting after checking the error code, they did want to exit completely, and I'm pretty sure that being inside a parenthesis counts as being a different script, so I changed everything to goto fail just in case, this might be too aggressive? Logs after this change for a build failure on cuda: https://github.com/pytorch/pytorch/actions/runs/8912185834/job/24475087535?pr=125306 ``` 2 errors detected in the compilation of "C:/actions-runner/_work/pytorch/pytorch/aten/src/ATen/native/cuda/AdaptiveMaxPooling3d.cu". AdaptiveMaxPooling3d.cu [7599/8420] Linking CXX shared library bin\torch_cpu.dll ninja: build stopped: subcommand failed. -- Building version 2.4.0a0+git3171c11 cmake -GNinja -DBUILD_ENVIRONMENT=win-vs2019-cuda11.8-py3 -DBUILD_PYTHON=True -DBUILD_TEST=True -DBUILD_TYPE=release -DBUILD_WHEEL=1 -DCMAKE_BUILD_TYPE=Release -DCMAKE_CUDA_COMPILER=C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v11.8/bin/nvcc.exe -DCMAKE_CUDA_COMPILER_LAUNCHER=C:/actions-runner/_work/pytorch/pytorch/build/win_tmp/bin/randomtemp.exe;C:/actions-runner/_work/pytorch/pytorch/build/win_tmp\bin\sccache.exe -DCMAKE_CXX_COMPILER_LAUNCHER=sccache -DCMAKE_C_COMPILER_LAUNCHER=sccache -DCMAKE_GENERATOR=Ninja -DCMAKE_INSTALL_PREFIX=C:\actions-runner\_work\pytorch\pytorch\torch -DCMAKE_PREFIX_PATH=C:\Jenkins\Miniconda3\Lib\site-packages -DCUDA_NVCC_EXECUTABLE=C:/actions-runner/_work/pytorch/pytorch/build/win_tmp/bin/nvcc.bat -DCUDNN_LIBRARY=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.8\lib\x64 -DNUMPY_INCLUDE_DIR=C:\Jenkins\Miniconda3\lib\site-packages\numpy\core\include -DPYTHON_EXECUTABLE=C:\Jenkins\Miniconda3\python.exe -DPYTHON_INCLUDE_DIR=C:\Jenkins\Miniconda3\Include -DPYTHON_LIBRARY=C:\Jenkins\Miniconda3/libs/python39.lib -DTORCH_BUILD_VERSION=2.4.0a0+git3171c11 -DTORCH_CUDA_ARCH_LIST=8.6 -DUSE_CUDA=1 -DUSE_NUMPY=True C:\actions-runner\_work\pytorch\pytorch cmake --build . --target install --config Release -- -j 8 (base) C:\actions-runner\_work\pytorch\pytorch>if errorlevel 1 goto fail (base) C:\actions-runner\_work\pytorch\pytorch>exit /b 1 Error: Process completed with exit code 1. ``` vs original https://github.com/pytorch/pytorch/actions/runs/8910674030/job/24470387612 ``` 2 errors detected in the compilation of "C:/actions-runner/_work/pytorch/pytorch/aten/src/ATen/native/cuda/AdaptiveMaxPooling3d.cu". AdaptiveMaxPooling3d.cu [7604/8420] Linking CXX shared library bin\torch_cpu.dll ninja: build stopped: subcommand failed. -- Building version 2.4.0a0+gite09f98c cmake -GNinja -DBUILD_ENVIRONMENT=win-vs2019-cuda11.8-py3 -DBUILD_PYTHON=True -DBUILD_TEST=True -DBUILD_TYPE=release -DBUILD_WHEEL=1 -DCMAKE_BUILD_TYPE=Release -DCMAKE_CUDA_COMPILER=C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v11.8/bin/nvcc.exe -DCMAKE_CUDA_COMPILER_LAUNCHER=C:/actions-runner/_work/pytorch/pytorch/build/win_tmp/bin/randomtemp.exe;C:/actions-runner/_work/pytorch/pytorch/build/win_tmp\bin\sccache.exe -DCMAKE_CXX_COMPILER_LAUNCHER=sccache -DCMAKE_C_COMPILER_LAUNCHER=sccache -DCMAKE_GENERATOR=Ninja -DCMAKE_INSTALL_PREFIX=C:\actions-runner\_work\pytorch\pytorch\torch -DCMAKE_PREFIX_PATH=C:\Jenkins\Miniconda3\Lib\site-packages -DCUDA_NVCC_EXECUTABLE=C:/actions-runner/_work/pytorch/pytorch/build/win_tmp/bin/nvcc.bat -DCUDNN_LIBRARY=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.8\lib\x64 -DNUMPY_INCLUDE_DIR=C:\Jenkins\Miniconda3\lib\site-packages\numpy\core\include -DPYTHON_EXECUTABLE=C:\Jenkins\Miniconda3\python.exe -DPYTHON_INCLUDE_DIR=C:\Jenkins\Miniconda3\Include -DPYTHON_LIBRARY=C:\Jenkins\Miniconda3/libs/python39.lib -DTORCH_BUILD_VERSION=2.4.0a0+gite09f98c -DTORCH_CUDA_ARCH_LIST=8.6 -DUSE_CUDA=1 -DUSE_NUMPY=True C:\actions-runner\_work\pytorch\pytorch cmake --build . --target install --config Release -- -j 8 (base) C:\actions-runner\_work\pytorch\pytorch>if errorlevel 1 exit /b + assert_git_not_dirty + [[ win-vs2019-cuda11.8-py3 != rocm ]] + [[ win-vs2019-cuda11.8-py3 != xla ]] ++ git status --porcelain ++ grep -v '?? third_party' ++ true + git_status= + [[ -n '' ]] + echo 'BUILD PASSED' BUILD PASSED ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/125306 Approved by: https://github.com/ZainRizvi, https://github.com/huydhn, https://github.com/atalman	2024-05-01 22:06:47 +00:00
Brian Hirsh	599a2e25f1	Reland "make sure dynamo doesn't inline DTensor __new__ or __torch_dispatch__ (#123347 )" (#125288 ) Re-land of https://github.com/pytorch/pytorch/pull/123347. The original PR broke internal because of a circular import due to importing dynamo in the DTensor code. The new version uses `torch._dynamo_disable` to work around This reverts commit 9d88339b535f57cd0e2926c9ac4c2542e4490aac. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125288 Approved by: https://github.com/ezyang, https://github.com/yanboliang, https://github.com/yoyoyocmu, https://github.com/anijain2305, https://github.com/fegin ghstack dependencies: #124398, #124399, #124400	2024-05-01 21:56:01 +00:00
Brian Hirsh	9e9ba61fde	AOTAutograd: force tangents to be contiguous when subclass inner tensor is noncontiguous (#124400 ) Fixes https://github.com/pytorch/pytorch/issues/124397 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124400 Approved by: https://github.com/ezyang, https://github.com/yoyoyocmu ghstack dependencies: #124398, #124399	2024-05-01 21:56:01 +00:00
Brian Hirsh	5173cbe260	fix FakeTensor creation on noncontiguous subclasses (#124399 ) Fixes https://github.com/pytorch/pytorch/issues/125287 Fixes https://github.com/pytorch/pytorch/issues/124090, context on the issue Pull Request resolved: https://github.com/pytorch/pytorch/pull/124399 Approved by: https://github.com/soulitzer ghstack dependencies: #124398	2024-05-01 21:56:01 +00:00
Brian Hirsh	7058563078	support as_python_constant on PlacementClassVariable (#124398 ) Fixes an error for torchtitan + internal Pull Request resolved: https://github.com/pytorch/pytorch/pull/124398 Approved by: https://github.com/ezyang, https://github.com/wanchaol, https://github.com/yoyoyocmu	2024-05-01 21:56:01 +00:00
Edward Z. Yang	2d794bcb8a	Delete NegateSource handling, I think it's dead (#125311 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/125311 Approved by: https://github.com/Skylion007	2024-05-01 21:36:50 +00:00
Avik Chaudhuri	746da8755c	switch tests from constrain_as* to torch._check* (#125253 ) To fix data-dependent errors we want to recommend that people use `torch._check` APIs. The `constrain_as` APIs should be fully subsumed by them, and in the future we should kill them entirely. Differential Revision: D56774333 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125253 Approved by: https://github.com/ezyang	2024-05-01 21:01:27 +00:00
Xia, Weiwen	dbcf123105	Upgrade submodule oneDNN to v3.4 (#122472 ) ## Improvements This upgrade fixes the following issues: - https://github.com/pytorch/pytorch/issues/120982 This upgrade brings the following new features: - Introduced memory descriptor serialization API. This API is needed to support freezing on CPU in AOTInductor (https://github.com/pytorch/pytorch/issues/114450) ## Validation results on CPU No regression was found. 1. NLP models accuracy/inference/training Model Name \| Mode\| Precision \| New \| Baseline \| New/Baseline -- \| -- \| -- \| -- \| -- \| -- bert-large \| accuracy \| fp32 \| 93.15325 \| 93.15325 \| 100.00% bert-large \| accuracy \| bf16 \| 93.20125 \| 93.20125 \| 100.00% bert-large \| accuracy \| int8 \| 92.66641 \| 92.66641 \| 100.00% LCM \| accuracy \| fp32 \| 44.11152 \| 44.11154 \| 100.00% LCM \| accuracy \| bf16 \| 43.57667 \| 43.65096 \| 100.17% ViT \| accuracy \| fp32 \| 0.8033 \| 0.8033 \| 100.00% ViT \| accuracy \| bf16 \| 0.8031 \| 0.8031 \| 100.00% ViT \| accuracy \| int8 \| 0.7985 \| 0.7985 \| 100.00% yolov7 \| accuracy \| fp32 \| 0.512 \| 0.512 \| 100.00% yolov7 \| accuracy \| bf16 \| 0.504 \| 0.504 \| 100.00% yolov7 \| accuracy \| int8 \| 0.507 \| 0.507 \| 100.00% bert-large \| realtime \| fp32 \| 37.433 \| 39.136 \| 95.65% bert-large \| realtime \| bf16 \| 166.592 \| 160.134 \| 104.03% bert-large \| realtime \| int8 \| 230.876 \| 222.594 \| 103.72% ViT \| realtime \| fp32 \| 288.19 \| 282.05 \| 102.18% ViT \| realtime \| bf16 \| 755.42 \| 741.1 \| 101.93% ViT \| realtime \| int8 \| 1060.94 \| 1092.47 \| 97.11% yolov7 \| realtime \| fp32 \| 17.06927 \| 16.47995 \| 103.58% yolov7 \| realtime \| bf16 \| 54.68561 \| 54.00723 \| 101.26% yolov7 \| realtime \| int8 \| 78.38271 \| 77.63214 \| 100.97% bert-large \| throughput \| fp32 \| 47.142 \| 47.341 \| 99.58% bert-large \| throughput \| bf16 \| 200.365 \| 200.806 \| 99.78% bert-large \| throughput \| int8 \| 144.999 \| 145.295 \| 99.80% LCM \| throughput \| fp32 \| 0.54913 \| 0.54897 \| 100.03% LCM \| throughput \| bf16 \| 1.062417 \| 1.07772 \| 98.58% stable-diffusion \| throughput \| fp32 \| 0.03301 \| 0.0331 \| 99.73% stable-diffusion \| throughput \| bf16 \| 0.08773 \| 0.08849 \| 99.14% stable-diffusion \| throughput \| int8 \| 0.0491 \| 0.05024 \| 97.73% ViT \| throughput \| fp32 \| 342.55 \| 346.47 \| 98.87% ViT \| throughput \| bf16 \| 1263.4 \| 1268.32 \| 99.61% ViT \| throughput \| int8 \| 1331.3 \| 1345.32 \| 98.96% yolov7 \| throughput \| fp32 \| 115.313 \| 115.612 \| 99.74% yolov7 \| throughput \| bf16 \| 323.364 \| 323.747 \| 99.88% yolov7 \| throughput \| int8 \| 388.137 \| 384.236 \| 101.02% bert-large \| train_phase1 \| fp32 \| 34.223 \| 34.309 \| 99.75% bert-large \| train_phase1 \| bf16 \| 90.372 \| 88.453 \| 102.17% bert-large \| train_phase2 \| fp32 \| 7.307 \| 7.318 \| 99.85% Data Type \| Geomean -- \| -- fp32 \| 99.88% bf16 \| 100.70% int8 \| 99.88% all \| 100.16% 2. Torchbench cpu userbenchmark inference & training Test suite \| Geomean Ratio (New/baseline) -- \| -- eager_throughtput_bf16_infer \| 1.00x eager_throughtput_fp32_infer \| 1.00x jit_llga_throughtput_amp_bf16 \| 0.99x jit_llga_throughtput_fp32 \| 1.01x eager_throughtput_fx_int8 \| 1.00x eager_throughtput_bf16_train \| 1.00x eager_throughtput_fp32_train \| 1.00x 3. Inductor quantization (static & dynamic) accuracy & performance Config \| Performance geomean ratio (New/baseline) \| Accuracy ratio (New/baseline) -- \| -- \| -- Static quant PTQ \| 0.99x \| 1.00x Static quant PTQ_CPP_WRAPPER \| 0.98x \| 1.00x Static quant QAT \| 0.99x \| 1.00x Dynamic quant PTQ \| 1.00x \| 1.00x 4. Dynamo benchmarks Precision \| Shape \| Wrapper \| Thread \| Ratio old/new GEOMEAN \| Ratio old/new GEOMEAN -- \| -- \| -- \| -- \| -- \| -- \| \| \| \| Eager \| Inductor Float32 \| Static \| Default \| Multiple \| 0.998776 \| 1.002091 \| \| \| Single \| 1.014086 \| 1.01054 Float32 \| Dynamic \| Default \| Multiple \| 1.00386 \| 1.005975 \| \| \| Single \| 1.011036 \| 1.008317 AMP \| Static \| Default \| Multiple \| 0.996965 \| 1.005117 \| \| \| Single \| 1.00092 \| 0.995666 AMP \| Dynamic \| Default \| Multiple \| 0.9959 \| 0.995048 \| \| \| Single \| 1.002569 \| 0.994085 --- Pull Request resolved: https://github.com/pytorch/pytorch/pull/122472 Approved by: https://github.com/jgong5, https://github.com/EikanWang, https://github.com/atalman	2024-05-01 20:59:17 +00:00
albanD	c99617706e	Add lintrunner as dev dependency (#125304 ) As per title. We expect people to use it before pushing any PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125304 Approved by: https://github.com/Skylion007	2024-05-01 20:08:03 +00:00
feifan	197612c84c	ProcessGroupWrapper support custom backend (#124447 ) Fixes #ISSUE_NUMBER In current code, ProcessGroupWrapper works only for `GLOO, NCCL, UCC` when `TORCH_DISTRIBUTED_DEBUG=DETAIL`. I read the ProcessGroupWrapper code，find that communication_op in ProcessGroupWrapper is just communication_op in origin_backend + runCollectiveChecks in gloo, like allreduce: `82e0153487/torch/csrc/distributed/c10d/ProcessGroupWrapper.cpp (L406-L411)` `runCollectiveChecks` is used to `collective finger print` for tensors and run gloo's `monitoredBarrier`. `82e0153487/torch/csrc/distributed/c10d/ProcessGroupWrapper.cpp (L586-L590)` I dont know why ProcessGroupWrapper doesn't work for all backend, but I think custom backend can support it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124447 Approved by: https://github.com/kwen2501	2024-05-01 19:59:55 +00:00
Edward Z. Yang	b4ccc615cd	Do exact type match on int so we don't pick up bool here too (#125305 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/125305 Approved by: https://github.com/Skylion007	2024-05-01 19:46:36 +00:00
angelayi	a216d87c6b	[export] Fix for unflattening modules with duplicate tensors (#125192 ) In the given test case, we have a ModuleList of 3 modules (`norm.0`, `norm.1`, `norm.2`) which share the same `weight` and `bias` tensors. However when we trace, they all end up pointing to one state dict name, (ex. `norm.2`). ``` graph(): %p_norms_0_weight : [num_users=0] = placeholder[target=p_norms_0_weight] %p_norms_0_bias : [num_users=0] = placeholder[target=p_norms_0_bias] %p_norms_1_weight : [num_users=0] = placeholder[target=p_norms_1_weight] %p_norms_1_bias : [num_users=0] = placeholder[target=p_norms_1_bias] %p_norms_2_weight : [num_users=3] = placeholder[target=p_norms_2_weight] %p_norms_2_bias : [num_users=3] = placeholder[target=p_norms_2_bias] %input_ : [num_users=1] = placeholder[target=input_] %native_layer_norm : [num_users=1] = call_function[target=torch.ops.aten.native_layer_norm.default](args = (%input_, [2, 2, 3], %p_norms_2_weight, %p_norms_2_bias, 1e-05), kwargs = {}) %getitem : [num_users=1] = call_function[target=operator.getitem](args = (%native_layer_norm, 0), kwargs = {}) %native_layer_norm_1 : [num_users=1] = call_function[target=torch.ops.aten.native_layer_norm.default](args = (%getitem, [2, 2, 3], %p_norms_2_weight, %p_norms_2_bias, 1e-05), kwargs = {}) %getitem_3 : [num_users=1] = call_function[target=operator.getitem](args = (%native_layer_norm_1, 0), kwargs = {}) %native_layer_norm_2 : [num_users=1] = call_function[target=torch.ops.aten.native_layer_norm.default](args = (%getitem_3, [2, 2, 3], %p_norms_2_weight, %p_norms_2_bias, 1e-05), kwargs = {}) %getitem_6 : [num_users=1] = call_function[target=operator.getitem](args = (%native_layer_norm_2, 0), kwargs = {}) return (getitem_6,) ``` This causes an error in the unflattener where after constructing the submodules for `norm.0`, it will have the graph pointing to `norm.2.weight` and `norm.2.bias`: ``` graph(): %p_norms_2_bias : [num_users=1] = placeholder[target=p_norms_2_bias] %p_norms_2_weight : [num_users=1] = placeholder[target=p_norms_2_weight] %input_ : [num_users=1] = placeholder[target=input_] %native_layer_norm : [num_users=1] = call_function[target=torch.ops.aten.native_layer_norm.default](args = (%input_, [2, 2, 3], %p_norms_2_weight, %p_norms_2_bias, 1e-05), kwargs = {}) %getitem : [num_users=1] = call_function[target=operator.getitem](args = (%native_layer_norm, 0), kwargs = {}) return getitem ``` Since the attributes are not within the same scope of the graph, (`norm.0` vs. `norm.2`), they will not be added to the subgraph, causing an error. So this PR handles the duplicate state dict attributes by modifying the `inputs_to_state` dict to map from node names to a list of possible state dict target names. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125192 Approved by: https://github.com/zhxchen17	2024-05-01 19:12:50 +00:00
mashaobin	af67704dcc	[privateuse1] _refs.masked_fill support privateuse1 when value.device.type is cpu (#124835 ) _refs.masked_fill support privateuse1 when value.device.type is cpu. 1. maybe I should consider whether this modification meets the expectations of other privateuse1 devices, 2. add TestCase Fixes #124693 Co-authored-by: albanD <desmaison.alban@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/124835 Approved by: https://github.com/albanD	2024-05-01 18:57:14 +00:00
Mihai	07422fd0b9	add missing space to first cmake append (#125294 ) the first append not having a space incorrectly merges it to any previous arguments, like `-allow-unsupported-compiler` in my case which results in a silly error: `unrecognized command-line option '-allow-unsupported-compiler-DLIBCUDACXX_ENABLE_SIMPLIFIED_COMPLEX_OPERATIONS'` full log: ``` python setup.py develop Building wheel torch-2.4.0a0+git75fa54a -- Building version 2.4.0a0+git75fa54a cmake3 -GNinja -DBUILD_PYTHON=True -DBUILD_TEST=True -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=/code/pytorch/torch -DCMAKE_PREFIX_PATH=/code/pytorch/.venv/lib/python3.12/site-packages;/code/spack/opt/spack/linux-fedora40-zen2/gcc-14.0.1/gcc-13.2.0-noa2f4oqalxzqvsebhuntndewgt4gq4h:/code/spack/opt/spack/linux-fedora40-zen2/gcc-14.0.1/zstd-1.5.6-z3guwm4l5rmmsv4g4wvkej3ri3bppeja:/code/spack/opt/spack/linux-fedora40-zen2/gcc-14.0.1/zlib-ng-2.1.6-kwi4ljobodjgv5eetnga4bow6crdlacl:/code/spack/opt/spack/linux-fedora40-zen2/gcc-14.0.1/mpc-1.3.1-nuwa2snyzm265lsupa2dkmxxyhiqcv7e:/code/spack/opt/spack/linux-fedora40-zen2/gcc-14.0.1/mpfr-4.2.1-wepuwobwttxbtz3nguimxa2mlljjozsi:/code/spack/opt/spack/linux-fedora40-zen2/gcc-14.0.1/gmp-6.2.1-ashy6kiitonxv2f365f4q3beggzf3646:/code/spack/opt/spack/linux-fedora40-zen2/gcc-14.0.1/gcc-runtime-14.0.1-wmogkqrzn7t57dogaake2hmhjbod27gs -DNUMPY_INCLUDE_DIR=/code/pytorch/.venv/lib64/python3.12/site-packages/numpy/core/include -DPYTHON_EXECUTABLE=/code/pytorch/.venv/bin/python -DPYTHON_INCLUDE_DIR=/usr/include/python3.12 -DPYTHON_LIBRARY=/usr/lib64/libpython3.12.so.1.0 -DTORCH_BUILD_VERSION=2.4.0a0+git75fa54a -DUSE_NUMPY=True /code/pytorch -- /usr/lib64/ccache/c++ /code/pytorch/torch/abi-check.cpp -o /code/pytorch/build/abi-check -- Determined _GLIBCXX_USE_CXX11_ABI=1 -- Current compiler supports avx2 extension. Will build perfkernels. -- Current compiler supports avx512f extension. Will build fbgemm. -- The CUDA compiler identification is NVIDIA 12.4.131 -- Detecting CUDA compiler ABI info -- Detecting CUDA compiler ABI info - failed -- Check for working CUDA compiler: /usr/local/cuda-12/bin/nvcc -- Check for working CUDA compiler: /usr/local/cuda-12/bin/nvcc - broken CMake Error at /usr/share/cmake/Modules/CMakeTestCUDACompiler.cmake:59 (message): The CUDA compiler "/usr/local/cuda-12/bin/nvcc" is not able to compile a simple test program. It fails with the following output: Change Dir: '/code/pytorch/build/CMakeFiles/CMakeScratch/TryCompile-mSGoFl' Run Build Command(s): /code/pytorch/.venv/bin/ninja -v cmTC_ee207 [1/2] /usr/local/cuda-12/bin/nvcc -forward-unknown-to-host-compiler -allow-unsupported-compiler-DLIBCUDACXX_ENABLE_SIMPLIFIED_COMPLEX_OPERATIONS -D_GLIBCXX_USE_CXX11_ABI=1 -Xfatbin -compress-all "--generate-code=arch=compute_52,code=[compute_52,sm_52]" -MD -MT CMakeFiles/cmTC_ee207.dir/main.cu.o -MF CMakeFiles/cmTC_ee207.dir/main.cu.o.d -x cu -c /code/pytorch/build/CMakeFiles/CMakeScratch/TryCompile-mSGoFl/main.cu -o CMakeFiles/cmTC_ee207.dir/main.cu.o FAILED: CMakeFiles/cmTC_ee207.dir/main.cu.o /usr/local/cuda-12/bin/nvcc -forward-unknown-to-host-compiler -allow-unsupported-compiler-DLIBCUDACXX_ENABLE_SIMPLIFIED_COMPLEX_OPERATIONS -D_GLIBCXX_USE_CXX11_ABI=1 -Xfatbin -compress-all "--generate-code=arch=compute_52,code=[compute_52,sm_52]" -MD -MT CMakeFiles/cmTC_ee207.dir/main.cu.o -MF CMakeFiles/cmTC_ee207.dir/main.cu.o.d -x cu -c /code/pytorch/build/CMakeFiles/CMakeScratch/TryCompile-mSGoFl/main.cu -o CMakeFiles/cmTC_ee207.dir/main.cu.o gcc: error: unrecognized command-line option '-allow-unsupported-compiler-DLIBCUDACXX_ENABLE_SIMPLIFIED_COMPLEX_OPERATIONS' ninja: build stopped: subcommand failed. CMake will not be able to correctly generate this project. Call Stack (most recent call first): cmake/public/cuda.cmake:47 (enable_language) cmake/Dependencies.cmake:44 (include) CMakeLists.txt:758 (include) -- Configuring incomplete, errors occurred! ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/125294 Approved by: https://github.com/albanD	2024-05-01 18:35:54 +00:00
amoskvic	bf6acf9add	[ROCm] Add extra cuda_to_hip_mappings.py (#125108 ) Adding extra mappings discovered when hipifying the backward CUDA kernel of the Mamba model (https://github.com/state-spaces/mamba/). Co-authored-by: Jeff Daily <jeff.daily@amd.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/125108 Approved by: https://github.com/Skylion007, https://github.com/jeffdaily	2024-05-01 18:31:02 +00:00
Deng Weishi	c8d2a55273	Intel GPU: specify the tolerance for torchbench models (#125213 ) We encountered some model accuracy failures as the tolerance is critical. In general, we align with CUDA practice. This PR intends to adjust the tolerance for Torchbench models for training mode on Intel GPU devices and aligns with CUDA. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125213 Approved by: https://github.com/desertfire	2024-05-01 17:45:15 +00:00
aaitzhan	e3627d05e7	[CMake] Add NVPL BLAS/LAPACK option (#125268 ) This PR add a [NVPL](https://docs.nvidia.com/nvpl/introduction.html) BLAS/LAPACK option to CMake for `aarch64` (ARM) machines. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125268 Approved by: https://github.com/albanD	2024-05-01 17:26:28 +00:00
Shivam Raikundalia	39eb5d4fa4	Add Sanity Testing to Pytorch Profiler (#124773 ) Summary: In the recent weeks, we have encountered bugs in both the normal synchronous trace and on-demand tracing. This diff on its own does sanity checking to make sure the profiler does not have spans that extend past the boundaries that we expect. It also checks some basic properties of the tracings we expect to see. Right now the sanity tests check some basic properties to make sure that the tracings are not completely broken. Requests/suggestions for other properties are welcome. Test Plan: Run the tests in OSS and Buck Reviewed By: aaronenyeshi Differential Revision: D56374298 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124773 Approved by: https://github.com/aaronenyeshi	2024-05-01 16:59:35 +00:00
PyTorch MergeBot	4d410155b2	Revert "Include support for the scatter gather cuda kernels to allow for comp… (#124809 )" This reverts commit e09f98c705e4851414cd8ddf21949177af2b13aa. Reverted https://github.com/pytorch/pytorch/pull/124809 on behalf of https://github.com/clee2000 due to windows build failure is real, https://github.com/pytorch/pytorch/actions/runs/8910674030/job/24470387612#step:11:11236 is the correct failure line, ignore the statement saying build passed, batch is errorcodes arent propagating again ([comment](https://github.com/pytorch/pytorch/pull/124809#issuecomment-2088680371))	2024-05-01 16:02:02 +00:00
Catherine Lee	e16f1ee4cc	[ez][CI] Move test_modules and test_schema_check off CI_SERIAL_LIST (#125193 ) * Related https://github.com/pytorch/pytorch/pull/124085 As in title, move test_modules and test_schema_check off CI_SERIAL_LIST If things fail, they can get the serialTest decorator instead Pull Request resolved: https://github.com/pytorch/pytorch/pull/125193 Approved by: https://github.com/huydhn	2024-05-01 15:48:48 +00:00
Sunita Nadampalli	8fde9a988c	CI: Extending unit test coverage for aarch64 linux (#125255 ) Adding core, dynamo and inductor unit tests for aarch64 linux CI runs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125255 Approved by: https://github.com/malfet, https://github.com/atalman	2024-05-01 15:37:52 +00:00
Apurva Jain	b094622bc9	Refactoring to remove unused variable (#125252 ) Summary: Removed unused variable for running encoder Test Plan: buck test //caffe2/test:transformers Differential Revision: D56771972 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125252 Approved by: https://github.com/drisspg	2024-05-01 15:17:45 +00:00
Danial Javady	e09f98c705	Include support for the scatter gather cuda kernels to allow for comp… (#124809 ) Fixes #121965 This PR hopes to add support complex numbers in the scatter/gather related kernels. For brevity, I will only include `complex<float>` for now as `complex<double>`, for example, will be more complicated. C++ unit tests are currently passing alongside tests in `test_scatter_gather_ops.py`. Python test suites also seem to be passing. Please keep the following in mind: 1) I think this is my first time using Pytorch. 2) This is my first contribution to Pytorch. Environment: 3080 & WSL 2. `nvcc` is at 12.4. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124809 Approved by: https://github.com/eqy, https://github.com/mikaylagawarecki	2024-05-01 14:31:31 +00:00
Alexander Kurakin	e421f1b4a8	docs: `torch.nn.utils.rnn`: docs improve (#123559 ) docs: `torch.nn.utils.rnn`: docs improve Pull Request resolved: https://github.com/pytorch/pytorch/pull/123559 Approved by: https://github.com/mikaylagawarecki	2024-05-01 14:27:37 +00:00
Nikita Shulga	a2715144c3	Add NEON-accelerated int8mm for bfloat16 (#125290 ) As apparently `vshlq_u32` is faster than `vcvt_f32_f16` Refactor NEON `tinygemm_kernel` to rely on `load_as_float32x4` and `load_as_float32x4x2` and implement them for float16 (using vcvt), bfloat16 (using left shift) and plain float32 (not using anything) As result stories110M run at 60 tokens/sec with f16, but at 66 tokens/sec with bf16 and 75 tokens/sec with f32, though more bandwith demand starts to favor reduced floating types as model size gets bigger. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125290 Approved by: https://github.com/mikekgfb	2024-05-01 14:04:49 +00:00
cdzhan	9fbb4dfc12	Fix AttributeError when doing mock patch for FileTimerServerTest.test_expired_timers (#125144 ) Fix the patch failure, and we should patch the function where it is used, not where it is defined. Failure info: ```bash root@cambricon-PowerEdge-C4140:/workspace# python file_based_timer_test.py -k test_expired_timers /opt/conda/lib/python3.10/site-packages/torch/_custom_ops.py:253: DeprecationWarning: torch.library.impl_abstract was renamed to torch.library.register_fake. Please use that instead; we will remove torch.library.impl_abstract in a future version of PyTorch. return torch.library.impl_abstract(qualname, func, _stacklevel=2) E ====================================================================== ERROR: test_expired_timers (__main__.FileTimerServerTest) tests that a single expired timer on a process should terminate ---------------------------------------------------------------------- Traceback (most recent call last): File "/opt/conda/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 2757, in wrapper method(args, *kwargs) File "/opt/conda/lib/python3.10/unittest/mock.py", line 1376, in patched with self.decoration_helper(patched, File "/opt/conda/lib/python3.10/contextlib.py", line 135, in __enter__ return next(self.gen) File "/opt/conda/lib/python3.10/unittest/mock.py", line 1358, in decoration_helper arg = exit_stack.enter_context(patching) File "/opt/conda/lib/python3.10/contextlib.py", line 492, in enter_context result = _cm_type.__enter__(cm) File "/opt/conda/lib/python3.10/unittest/mock.py", line 1447, in __enter__ original, local = self.get_original() File "/opt/conda/lib/python3.10/unittest/mock.py", line 1420, in get_original raise AttributeError( AttributeError: <module 'torch.distributed.elastic.timer' from '/opt/conda/lib/python3.10/site-packages/torch/distributed/elastic/timer/__init__.py'> does not have the attribute 'log_debug_info_for_expired_timers' To execute this test, run the following from the base repo dir: python file_based_timer_test.py -k test_expired_timers This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 ---------------------------------------------------------------------- Ran 1 test in 0.792s FAILED (errors=1) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/125144 Approved by: https://github.com/gag1jain	2024-05-01 12:08:04 +00:00
aaitzhan	47ba7a76e2	[ATen][CUDA][AMP] Fix dtype mismatch in linalg_vector_norm (#125175 ) Fixes #125174 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125175 Approved by: https://github.com/eqy, https://github.com/lezcano	2024-05-01 10:57:12 +00:00
Nikita Shulga	c59cce38a9	[MacOS][CPUInductor] Fix includes to system Python (#125285 ) On MacOS 14.4, system Python is configured to point to a non-existing include dir ``` % /usr/bin/python3 -c "import sysconfig;print(sysconfig.get_path('include'))" /Library/Python/3.9/include ``` Workaround the issue by composing path to include folder from `stlib` config, which points to ``` % /usr/bin/python3 -c "import sysconfig;print(sysconfig.get_path('stdlib'))" /Applications/Xcode.app/Contents/Developer/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/125285 Approved by: https://github.com/kit1980	2024-05-01 10:39:13 +00:00
Ke Wen	52142192d4	[pipelining] Add stage backward function (#124958 ) This is a helper function which: 1. computes the gradients for the stage inputs, and 2. accumulates gradients for the stage module's parameters. A unit test for this function is also added. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124958 Approved by: https://github.com/wconstab ghstack dependencies: #124776, #124875	2024-05-01 07:56:58 +00:00
Yanbo Liang	aead440c62	[Inductor] Further tune block size for templated attention on H100 (#125286 ) Run a script to enumerate and get the best default block size for templated attention. A100 -> no change, check numbers at #125139 H100 ## torch.bfloat16 Before: ``` \| Type \| Speedup \| batch_size \| num_heads \| q_seq_len \| k_seq_len \| head_dim \| score_mod \| dtype \| \|---------\|-----------\|--------------\|-------------\|-------------\|-------------\|------------\|---------------\|----------------\| \| Average \| 1.103 \| \| \| \| \| \| \| \| \| Max \| 1.322 \| 8 \| 16 \| 512 \| 512 \| 64 \| noop \| torch.bfloat16 \| \| Min \| 0.829 \| 1 \| 16 \| 1024 \| 1024 \| 128 \| relative_bias \| torch.bfloat16 \| ``` After: ``` \| Type \| Speedup \| batch_size \| num_heads \| q_seq_len \| k_seq_len \| head_dim \| score_mod \| dtype \| \|---------\|-----------\|--------------\|-------------\|-------------\|-------------\|------------\|---------------\|----------------\| \| Average \| 1.137 \| \| \| \| \| \| \| \| \| Max \| 1.442 \| 1 \| 16 \| 512 \| 512 \| 128 \| relative_bias \| torch.bfloat16 \| \| Min \| 0.913 \| 1 \| 16 \| 1024 \| 1024 \| 64 \| head_bias \| torch.bfloat16 \| ``` ## torch.float32 Before: ``` \| Type \| Speedup \| batch_size \| num_heads \| q_seq_len \| k_seq_len \| head_dim \| score_mod \| dtype \| \|---------\|-----------\|--------------\|-------------\|-------------\|-------------\|------------\|---------------\|---------------\| \| Average \| 2.269 \| \| \| \| \| \| \| \| \| Max \| 3.740 \| 16 \| 16 \| 1024 \| 1024 \| 64 \| noop \| torch.float32 \| \| Min \| 0.761 \| 1 \| 16 \| 512 \| 512 \| 128 \| relative_bias \| torch.float32 \| ``` After: ``` \| Type \| Speedup \| batch_size \| num_heads \| q_seq_len \| k_seq_len \| head_dim \| score_mod \| dtype \| \|---------\|-----------\|--------------\|-------------\|-------------\|-------------\|------------\|-------------\|---------------\| \| Average \| 2.489 \| \| \| \| \| \| \| \| \| Max \| 3.755 \| 16 \| 16 \| 4096 \| 4096 \| 64 \| noop \| torch.float32 \| \| Min \| 1.609 \| 1 \| 16 \| 512 \| 512 \| 64 \| head_bias \| torch.float32 \| ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/125286 Approved by: https://github.com/Chillee	2024-05-01 07:34:08 +00:00
Edward Z. Yang	c511aed27f	[Meta Tensor] fix meta inplace set storage (#123880 ) Fixes #123879 Co-authored-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/123880 Approved by: https://github.com/ezyang	2024-05-01 06:53:49 +00:00
Edward Z. Yang	c3c4465f50	Add has_guarded_code to CompilationMetrics (#125279 ) While studying some tlparse, I noticed that CompilationMetrics was reporting that there was no error for frames that have no nodes. I'm pretty sure we don't actually install a frame in this situation. has_guarded_code will tell us if that's the case, because it says if the GuardedCode object is None or not. Actually, while working on this, I was wondering if we can ever trigger the "skip this frame entirely, do not trace it ever again" codepath, as best as I could tell, it's impossible for this to happen by the time we get to compilation metrics block. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/125279 Approved by: https://github.com/yanboliang	2024-05-01 06:12:05 +00:00
cyy	081f41a920	Use BFloat16 in distributed quantization when supported by NCCL (#125113 ) This PR enables BFloat16 in torch/csrc/distributed/c10d/quantization/quantization_gpu.cu . Pull Request resolved: https://github.com/pytorch/pytorch/pull/125113 Approved by: https://github.com/kwen2501	2024-05-01 05:43:35 +00:00
Sharvil Nanavati	14857e71c2	Export `torch.jit.interface` from `torch.jit` package (#125209 ) Seems like this symbol was overlooked when other symbols were exported from `torch.jit`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125209 Approved by: https://github.com/ezyang	2024-05-01 05:38:05 +00:00
Sam Larsen	75a8e9ee77	[inductor] better cache clearing in fx graph cache tests (#125280 ) Summary: There's a shortcoming in the FX graph cache tests in that they don't fully clear all inductor in-memory caches when testing the cache-hit path: We were previously accessing the FX graph cache correctly, but when loading the source object using the PyCodeCache.load_by_key_path() method, _that_ path was serving entries out of memory. To better mimic what happens during warm start (i.e., a new process), we should clear all in-memory caches. Test Plan: updated the unit tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/125280 Approved by: https://github.com/eellison	2024-05-01 04:47:46 +00:00
Michael Lazos	787afc5180	Add LR as tensor tests (#123750 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123750 Approved by: https://github.com/janeyx99	2024-05-01 04:46:49 +00:00
Nikita Shulga	1c905f1be3	[EZ][BE] Don't import pathlib twice (#125260 ) It was imported once as `import pathlib` and second time as `from pathlib import Path` Stick to the 2nd flavor Pull Request resolved: https://github.com/pytorch/pytorch/pull/125260 Approved by: https://github.com/kit1980	2024-05-01 04:08:16 +00:00
Andrew Gu	abaa717350	[FSDP2] Removed logic to save and remove pre-backward hook handles (#125269 ) 1. This PR removes the logic for saving and removing the pre-backward hook handles (which is registered via `register_multi_grad_hook(mode="any")`). 2. This PR removes the logic for _trying_ to guard against mistargeted prefetches that relies on querying if the engine will execute the module output tensors' `grad_fn`s. (See https://github.com/pytorch/pytorch/pull/118118 for original motivation.) For 1, the logic was error prone since it relied on `set_is_last_backward(False)` being set correctly or else pre-backward hooks could be de-registered too early. We would prefer to match the hook lifetimes with that of the autograd graph. This solves a bug with a 1f1b interleaved schedule. If we directly remove the manual saving/removing hook handle logic, then we have a ref cycle where the tensors' `grad_fn`s are passed to the hook function. We decide to simply remove this `grad_fn` logic since (1) it cannot perfectly prevent mistargeted prefetches and (2) it introduces undesired complexity. In the future, we may prefer a different mechanism to override the prefetching for more complex/dynamic use cases. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125269 Approved by: https://github.com/weifengpy ghstack dependencies: #125190, #125191	2024-05-01 03:51:30 +00:00
Animesh Jain	37c993546d	[dynamo][guards] Bug fix for set_export_info (#125275 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/125275 Approved by: https://github.com/yanboliang	2024-05-01 03:46:26 +00:00
Isuru Fernando	4d5f8070c4	add a decomposition for select_scatter (#124426 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124426 Approved by: https://github.com/peterbell10	2024-05-01 03:23:18 +00:00
David Berard	e9ce23985f	[TorchScript] attach target function to OSError when source can't be found (#125248 ) Before, it would be hard to figure out which function/module in particular was causing the OSError. Now we'll try to print the function/module string. Differential Revision: [D56768365](https://our.internmc.facebook.com/intern/diff/D56768365) Pull Request resolved: https://github.com/pytorch/pytorch/pull/125248 Approved by: https://github.com/eellison	2024-05-01 03:18:55 +00:00
Will Constable	8f31988088	[C10D] Document 'tag' limitation for nccl send/recv (#125278 ) Existing documentation on isend/irecv also applies to send/recv. This PR copies the doc/warning to send/recv ops as well. Note: tag may be supplied, but will be ignored when used with nccl backend. Fixes #94819 #125079 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125278 Approved by: https://github.com/kwen2501	2024-05-01 02:53:30 +00:00
Sam Larsen	74e8817311	[inductor] Minor fixes to various tests before enabling fx graph caching in OSS by default (#125258 ) Summary: Discovered breakages by enabling codecache by default and doing a CI run. I'll commit these fixes first and eventually enabling caching by default will (hopefully) be a one-liner. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125258 Approved by: https://github.com/eellison	2024-05-01 02:34:01 +00:00
William Wen	0506e95433	[dynamo] support inactive context managers across graph breaks (#125203 ) Fix https://github.com/pytorch/pytorch/issues/124900. When we reconstruct `ContextWrappingVariables`s, we only reconstruct the context class, not the object. Normally, contexts are active (via `with ctx:`) and we initialize the context object in the resume function. But for the case of inactive contexts (contexts declared ahead of time before the `with` block), we do not reconstruct them properly in the optimized bytecode or resume function. So this PR adds initialization for inactive contexts in the resume function. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125203 Approved by: https://github.com/jansel	2024-05-01 01:49:09 +00:00
Henry Hu	1b9d353e4f	[Torch] Add more mm kernel choices (#125000 ) Differential Revision: D56616836 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125000 Approved by: https://github.com/htyu	2024-05-01 01:40:24 +00:00
Sherlock Huang	a59dc14877	Keep node.meta when fusing subgraph (#125261 ) Summary: When CapabilityBasedPartitioner creates the fused subgraph as the call_module node, it didn't populate the node.meta["val"] field. Test Plan: OSS CI Differential Revision: D56789259 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125261 Approved by: https://github.com/zhxchen17	2024-05-01 01:38:28 +00:00
Menglu Yu	0ee5c14163	[PT2][Optimus] Read the patterns from the config instead of hard-code passes (#125136 ) Summary: Due to the compatitbility issue, we hard coded the passes to do the pattern optimization. Here, we revisit the method since it has been a while for the changes into production packages. We instead read from the config to decide whether we do the specific pattern optimization, which makes followup pattern add easier. Differential Revision: D56659934 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125136 Approved by: https://github.com/jackiexu1992	2024-05-01 01:35:30 +00:00
drisspg	25691558d9	Change templated_attention -> flex_attention (#125251 ) # Summary Change all the names Pull Request resolved: https://github.com/pytorch/pytorch/pull/125251 Approved by: https://github.com/Chillee, https://github.com/yanboliang	2024-05-01 01:08:48 +00:00
Edward Z. Yang	a7023b89f8	Use torch._check for safety assert in _reshape_view_helper (#125187 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/125187 Approved by: https://github.com/albanD	2024-05-01 00:40:31 +00:00
Wei Wang	1bcbc9158f	Add CUDA 12.4 workflows (#121684 ) Reference: https://github.com/pytorch/pytorch/pull/98492 Co-authored-by: Andrey Talman <atalman@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/121684 Approved by: https://github.com/atalman	2024-04-30 23:03:24 +00:00
William Wen	d6c713884a	[dynamo, 3.12] xfail refleaking tests due to buggy getattr_static (#125062 ) For tracking https://github.com/pytorch/pytorch/issues/124302 so that we can re-enable the test once 3.12 updates with the bug fix for https://github.com/python/cpython/issues/118013. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125062 Approved by: https://github.com/anijain2305, https://github.com/jansel	2024-04-30 22:40:47 +00:00
Simon Fan	c12c85e919	Revert "[benchmark][cudagraph] Explicitly call aten.div with CUDA denominator for cudagraphs (#119729 )" (#125246 ) This reverts commit 62b5738a8bf325d79468b839b8412b87cb9951c1. https://github.com/pytorch/pytorch/pull/119729/ regresses cudagraph dashboard. Moving the one-time per iteration loss from CPU to CUDA is somehow causing a lot of copies: current (top) vs with revert (bottom) ![image](https://github.com/pytorch/pytorch/assets/9547562/62dfbf66-7edc-4a3c-ba7f-1ec057fba950) Pull Request resolved: https://github.com/pytorch/pytorch/pull/125246 Approved by: https://github.com/eellison	2024-04-30 22:39:53 +00:00
Kazuaki Ishizaki	9fec26e231	Fix typo under torch/_inductor directory (#119658 ) This PR fixes typo in comments and msgs under `torch/_inductor` directory, and also changes the corresponding test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119658 Approved by: https://github.com/colesbury	2024-04-30 22:28:56 +00:00
PyTorch MergeBot	ca0f070065	Revert "Add registration API for torch.compile-eager (#121387 )" This reverts commit 61e937f3d6b904d6706594c1b3cfd7d0e56f9663. Reverted https://github.com/pytorch/pytorch/pull/121387 on behalf of https://github.com/kit1980 due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/121387#issuecomment-2087541956))	2024-04-30 22:13:04 +00:00
Apurva	00dd4d55e3	Refactored _remove_auto_functionalization_from_graph_helper (#125180 ) Summary: Refactored the function to remove multiple list slicing and use unused variable. Test Plan: python test/run_test.py Reviewers: @drisspg Subscribers: Tasks: [T187526123](https://www.internalfb.com/intern/tasks/?t=187526123) [T93492332](https://www.internalfb.com/intern/tasks/?t=93492332) Tags: @pytorchbot merge -r viable/strict Pull Request resolved: https://github.com/pytorch/pytorch/pull/125180 Approved by: https://github.com/drisspg	2024-04-30 21:44:07 +00:00
PyTorch MergeBot	ea347fa6ce	Revert "Fix & optimze open device registration test. (#124712 )" This reverts commit f03cf9d4dc8ebe85552f450678988cac4e959da3. Reverted https://github.com/pytorch/pytorch/pull/124712 on behalf of https://github.com/kit1980 due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/124712#issuecomment-2086971499))	2024-04-30 20:00:37 +00:00
Ke Wen	c1a3fcfa47	[pipelining] Add util and debug facilities (#124875 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124875 Approved by: https://github.com/H-Huang ghstack dependencies: #124776	2024-04-30 19:41:41 +00:00
PyTorch MergeBot	75fa54a9d1	Revert "Convert `ForeachFuncInfo` to `dataclass` (#125001 )" This reverts commit 9466335ae4cb049efd3f4c2b32b2115ba00694f3. Reverted https://github.com/pytorch/pytorch/pull/125001 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I think it is breaking on ROCm `9466335ae4` ([comment](https://github.com/pytorch/pytorch/pull/125001#issuecomment-2086640674))	2024-04-30 19:05:53 +00:00
Xinya Zhang	56e4cbc69d	Fixes two build problems on ROCM 6.1 + Ubuntu 22.04 (#118216 ) Fixes two build problems on ROCM 6.1 + Ubuntu 22.04 ### Inconsistency value of CMAKE_PREFIX_PATH between `.ci/pytorch/build.sh` and Build Instructions Current `CMAKE_PREFIX_PATH` points to the base environment of the conda (commonly `/opt/conda`). However the conda environment used in the CI should be `/opt/conda/envs/py_<VRESION>`, which is supplied by `$CONDA_PREFIX`. This divergence may cause libstdc++ version conflicts because the base conda environment may ship a different libstdc++ than the `pv_<VERSION>`, and/or the system default environment. One notable issue is on our internal CI system this script failed to build AOTriton library on Ubuntu 22.04 due to libstdc++ version conflicts between HIP compiler and conda base environment. This PR fixes this and make sure the CI script follows the official build instruction. ### Incorrect `tinfo` was linked on Ubuntu 22.04 due to flaws in parsing of `os-release` The code to parse /etc/os-release is incorrect and the distribution info was parsed as `PRETTY_Ubuntu` instead of `Ubuntu`. `libtinfo` will not be linked into the binary due to this flaw. Thus, cpp unit tests failed to build because of missing symbols from `libtinfo` Pull Request resolved: https://github.com/pytorch/pytorch/pull/118216 Approved by: https://github.com/jeffdaily, https://github.com/jithunnair-amd, https://github.com/pruthvistony, https://github.com/malfet, https://github.com/atalman	2024-04-30 18:58:48 +00:00
Jeff Daily	90258e8369	forward fix preferred blas backend and windows CI (#125080 ) PR #122106 broke windows tests. The feature should have been disabled for Windows but was not disabled correctly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125080 Approved by: https://github.com/clee2000	2024-04-30 18:38:31 +00:00
Wanchao Liang	04a241947a	[dtensor] delete the old unused mesh_alltoall (#124879 ) as titled, as we have a dedicated comm op, this is not needed anymore Pull Request resolved: https://github.com/pytorch/pytorch/pull/124879 Approved by: https://github.com/XilunWu, https://github.com/wz337 ghstack dependencies: #124871, #124872	2024-04-30 18:30:34 +00:00
Wanchao Liang	00df0d3e94	[dtensor] implement shard dim change with alltoall (#124872 ) as titled, we implement a dedicated communication op to allow efficient sharding dimension change using alltoall, to replace our previous allgather + local chunk Pull Request resolved: https://github.com/pytorch/pytorch/pull/124872 Approved by: https://github.com/XilunWu, https://github.com/yifuwang ghstack dependencies: #124871	2024-04-30 18:30:34 +00:00
Gagan Jain	02e7800b3f	[Torch][Timer] Skip expired timer logging for empty expired timers (#125039 ) Summary: same as title Test Plan: unit tests Differential Revision: D56636566 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125039 Approved by: https://github.com/kurman	2024-04-30 18:28:49 +00:00
Tugsbayasgalan (Tugsuu) Manlaibaatar	3946fa1c12	Fix bug in get_update_constraint (#125194 ) Summary: Title Test Plan: CI Differential Revision: D56726321 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125194 Approved by: https://github.com/pianpwk	2024-04-30 18:21:29 +00:00
James Wu	07958c538c	Setup initial testing harness and cache key generation for AOTAutograd Cache (#124642 ) This doesn't introduce any new behavior, but sets up a basic cache key generation mechanism that I can test. From here I will: - Add checks on the ops in an input FXGraph to make sure they are safe to cache. We'll be conservative in the first version here. - Add serialization for FX graphs - Save these FX graphs to disk in the cache - Support graphs with more complicated ops like higher order ops and specialized nn modules Pull Request resolved: https://github.com/pytorch/pytorch/pull/124642 Approved by: https://github.com/aorenste	2024-04-30 18:17:38 +00:00
andrewor14	8242fb62a7	[quant][pt2e] Fix conv-bn weight + bias per channel QAT (#125208 ) Summary: This commit fixes the pattern matching for conv-bn during QAT fusion where both weight and bias are quantized per channel. Previously this failed because weights and biases used the same example kwargs for their scales and zero points, causing these qparams to be tied during pattern matching. Test Plan: python test/test_quantization.py TestQuantizePT2EQAT_ConvBn1d.test_qat_conv_bn_per_channel_weight_bias python test/test_quantization.py TestQuantizePT2EQAT_ConvBn2d.test_qat_conv_bn_per_channel_weight_bias Reviewers: jerryzh168, angelayi Subscribers: jerryzh168, angelayi, supriyar Differential Revision: [D56740694](https://our.internmc.facebook.com/intern/diff/D56740694) Pull Request resolved: https://github.com/pytorch/pytorch/pull/125208 Approved by: https://github.com/angelayi	2024-04-30 18:12:25 +00:00
Zejun Huang	05be0fb62d	[minimizer] Add exclusion function to minimizer base (#124504 ) Summary: Add exclusion list to minimizer: 1. some operations cannot be lowered when constructing subgraphs; this usually happens when they are isolated from operation group 2. exclude them in search strategies for automation Reviewed By: jimone1 Differential Revision: D56327289 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124504 Approved by: https://github.com/jfix71	2024-04-30 18:02:46 +00:00
Yanbo Liang	80046c315b	Add templated attention BLOCK_M & BLOCK_N default size for different head_dim (#125139 ) Run different head_dims [64, 128], which are the most popular ones across major GPT models. Enumerate different ```BLOCK_M``` and ```BLOCK_N``` candidates [16, 32, 64, 128], and get the best config as default one. ## Before ``` \| Type \| Speedup \| batch_size \| num_heads \| q_seq_len \| k_seq_len \| head_dim \| score_mod \| dtype \| \|---------\|-----------\|--------------\|-------------\|-------------\|-------------\|------------\|-------------\|----------------\| \| Average \| 0.704 \| \| \| \| \| \| \| \| \| Max \| 0.953 \| 1 \| 16 \| 512 \| 512 \| 64 \| noop \| torch.bfloat16 \| \| Min \| 0.482 \| 1 \| 16 \| 4096 \| 4096 \| 128 \| causal_mask \| torch.bfloat16 \| ``` ## After ``` \| Type \| Speedup \| batch_size \| num_heads \| q_seq_len \| k_seq_len \| head_dim \| score_mod \| dtype \| \|---------\|-----------\|--------------\|-------------\|-------------\|-------------\|------------\|-------------\|----------------\| \| Average \| 0.823 \| \| \| \| \| \| \| \| \| Max \| 0.926 \| 1 \| 16 \| 512 \| 512 \| 64 \| noop \| torch.bfloat16 \| \| Min \| 0.723 \| 1 \| 16 \| 512 \| 512 \| 128 \| causal_mask \| torch.bfloat16 \| ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/125139 Approved by: https://github.com/Chillee	2024-04-30 17:40:44 +00:00
cyy	04c6424fbf	Remove caffe2 image and video (#125045 ) This PR tries to decompose https://github.com/pytorch/pytorch/pull/122527 into a smaller one. Caffe2 image and video folders are removed along with the related CMake code. To be noted, this was inspired and is co-dev with @r-barnes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125045 Approved by: https://github.com/eqy, https://github.com/albanD	2024-04-30 17:31:57 +00:00
Noam Siegel	a03b9a2189	fix: typo (#125226 ) Fixes spelling error: spacial is an incorrect spelling of spatial Pull Request resolved: https://github.com/pytorch/pytorch/pull/125226 Approved by: https://github.com/Skylion007	2024-04-30 16:57:39 +00:00
Sam Larsen	d699ade0cb	[dynamo] Refactor into torch/_inductor/runtime/compile_tasks.py (#124681 ) Differential Revision: [D56723769](https://our.internmc.facebook.com/intern/diff/D56723769) Co-authored-by: Sam Larsen <slarsen@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/124681 Approved by: https://github.com/masnesral ghstack dependencies: #124592	2024-04-30 16:54:16 +00:00
Sam Larsen	254128c16e	[inductor] Remove usage of device_interface from _inductor.runtime (#124592 ) Differential Revision: [D56723770](https://our.internmc.facebook.com/intern/diff/D56723770) Co-authored-by: Sam Larsen <slarsen@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/124592 Approved by: https://github.com/masnesral	2024-04-30 16:54:16 +00:00
Jithun Nair	5f4c6d9b49	Upgrade nightly wheels to rocm6.1 (#124811 ) Follow-up to https://github.com/pytorch/builder/pull/1789 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124811 Approved by: https://github.com/malfet	2024-04-30 16:30:19 +00:00
Masaki Kozuki	9466335ae4	Convert `ForeachFuncInfo` to `dataclass` (#125001 ) - `ForeachFuncInfo` to `dataclass` for smaller diff from `OpInfo` - `skips` to `decorators` and `skip` to `xfail` Pull Request resolved: https://github.com/pytorch/pytorch/pull/125001 Approved by: https://github.com/janeyx99	2024-04-30 16:19:42 +00:00
ydwu4	ecc2e034f7	Fakify script object inputs and attributes for non-strict export (#124239 ) This PR fakify ScriptObject inputs and attributes in export non-strict mode by default. The basic idea is to `only fakify the script object during tracing (i.e. aot_export)`. After we get the traced graph module, eagerly executing, serializing, or running more passes will use the real script objects. This is essentially treating the script object as constant tensor. Concretely, we 1. fakify all the script object inputs, and module attributes (gathered by constant_attrs). 2. patch the module's attributes with fakified script object 3. right after aot_export, remove the patching (to avoid changing the original module) then modify the exported graph module's attribute to real script object. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124239 Approved by: https://github.com/zou3519	2024-04-30 15:57:25 +00:00
Sunita Nadampalli	ab80a59677	CI: add opt-in aarch64 linux workflow (#121284 ) Triggered by `ciflow/linux-aarch64` and runs only `test_modules`, `test_mkldnn`, `test_mkldnn_fusion` and `test_openmp` as test for now. TODOS: - Enable sscache for fast CI - Extend to a more reasonable test coverage Pull Request resolved: https://github.com/pytorch/pytorch/pull/121284 Approved by: https://github.com/atalman, https://github.com/malfet	2024-04-30 15:10:56 +00:00
Daohang Shi	b7d67e476d	upload pt2 cprofile stats to manifold (#125162 ) Summary: https://fb.workplace.com/groups/257735836456307/permalink/657458576484029/ upload cprofile to manifold D56696397 has a script to convert profiler stats to dot graphs (see its test plan) Test Plan: non-MAST `TORCH_COMPILE_CPROFILE=1 buck2 run mode/opt mode/inplace //pytorch/benchmark:run -- ads_mc_igctr_mc3_v0 -d cuda -t train --torchdynamo inductor --profile --profile-export-chrome-trace` https://www.internalfb.com/manifold/explorer/pyper_traces/tree/compilation_cprofile/test/20240428_234002_7562397568 MAST `buck2 run mode/opt aps_models/ads/icvr:icvr_launcher -- mode=mast_ctr_cvr_cmf_rep launcher.fbl_entitlement=ai_infra_training_rnd_tc features=ctr_cvr_conso_cmf_pipeline_features_455876776_3teach model=ctr_cvr_cmf_when_rep_config_msmn_3teach model_name=ctr_cvr_when model.when_arch.use_extended_residual_contexts=True optimizers.dense_default.lr_schedule.0.max_iters=20000 training.planner.storage_reservation_policy=FixedPercentage training.planner.storage_reservation_percentage=0.72 data_loader.dataset.batch_size=2048 trainer.garbage_collection.garbage_collection_interval=100 model.when_arch.layer_norm_init_weight=0.3 optimizers.dense_default.lr_schedule.0.value=0.001 model.when_arch.customized_mlp_init_scale=0.3 launcher.num_workers=128 launcher.max_retries=10 launcher.data_project=oncall_ads_model_platform launcher.hardware=ZIONEX_80G data_loader.dataset.table_ds="[2024-01-01]" launcher.job_name=test_inductor_logging` https://www.internalfb.com/manifold/explorer/pyper_traces/tree/compilation_cprofile/aps-test_inductor_logging-745febb51a Generating dotty files from D56696397 ``` Generating dot file from cprofile stats /home/daohang/aps-test_inductor_logging-745febb51a/0/0/_compile1.profile ... P1225733598: https://www.internalfb.com/intern/paste/P1225733598/ Dotty: https://www.internalfb.com/intern/graphviz/?paste=1225733598 Generating dot file from cprofile stats /home/daohang/aps-test_inductor_logging-745febb51a/0/0/_compile10.profile ... P1225733629: https://www.internalfb.com/intern/paste/P1225733629/ Dotty: https://www.internalfb.com/intern/graphviz/?paste=1225733629 Generating dot file from cprofile stats /home/daohang/aps-test_inductor_logging-745febb51a/0/0/_compile0.profile ... P1225733649: https://www.internalfb.com/intern/paste/P1225733649/ Dotty: https://www.internalfb.com/intern/graphviz/?paste=1225733649 ``` Differential Revision: D56679561 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125162 Approved by: https://github.com/anijain2305	2024-04-30 15:05:01 +00:00
Mikayla Gawarecki	2480e8b8a1	Add MAP_SHARED option for torch.load(mmap=True) (#124889 ) Fixes #124528 Going over the options for our MapAllocator and what they do, I don't think any other of them need to be piped up to `torch.load` `4f29103749/aten/src/ATen/MapAllocator.h (L8-L16)` ~However, I wonder if this `MmapVisibility(Enum)` is a good way to represent "or-ing" together of `mmap` flags if we want to extend it in the future. I looked over the flags for [`mmap(2)`](https://man7.org/linux/man-pages/man2/mmap.2.html), and could not immediately see how most of them would be useful for `torch.load` (would maybe `MAP_LOCKED` (like `mlock`) or `MAP_HUGE` ever be worthwhile?)~ Using the flags provided by the python `mmap` library so that we can extend the allowed flags and pipe them down to the cpp `mmap` call if there is a need for other flags in the future Pull Request resolved: https://github.com/pytorch/pytorch/pull/124889 Approved by: https://github.com/albanD	2024-04-30 15:02:19 +00:00
Guilherme Leobas	761a7b84ba	[Dynamo] Fix alias issue with respect to wrapped numbers (#124731 ) (#124774 ) This PR fixes an issue presented when calling `aten.alias(int)` raises a TypeError. ```python import torch import torch.autograd.forward_ad as fwAD def f(x): return 4312491 * x device = "cpu" with torch._subclasses.fake_tensor.FakeTensorMode(): with fwAD.dual_level(): x = torch.randn(3, device=device) y = torch.ones_like(x) dual = fwAD.make_dual(x, y) f(dual) ``` The test case above illustrates this bug. 1) `4312491` turns into a tensor that is a wrapped number 2) Forward mode AD calls `aten::alias` internally 3) The wrapped number (`4312491`) becomes a python integer 4) `aten.alias(int)` raises a `TypeError` Pull Request resolved: https://github.com/pytorch/pytorch/pull/124774 Approved by: https://github.com/albanD, https://github.com/zou3519	2024-04-30 14:11:46 +00:00
Alex Morehead	9aed5dcfe6	Clarify wording in docstring for `CosineAnnealingWarmRestarts` within `lr_scheduler.py` (#125161 ) - Clarifies wording in the docstring for `CosineAnnealingWarmRestarts` within `lr_scheduler.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/125161 Approved by: https://github.com/janeyx99	2024-04-30 14:01:22 +00:00
atalman	e3db465029	Re-enable nightly testing for linux and macos binaries (#123390 ) Related to: https://github.com/pytorch/pytorch/issues/123225 The skip tests logic lives here: https://github.com/pytorch/builder/blob/main/run_tests.sh#L19 Linux builds are using check_binary: https://github.com/pytorch/pytorch/actions/runs/8627625694/job/23649245546#step:16:339 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123390 Approved by: https://github.com/ZainRizvi	2024-04-30 12:53:40 +00:00
DanilBaibak	07d3af8e6a	Added ARC test jobs to all build jobs in the unstable bucket (#125142 ) Added ARC test jobs to all build jobs in the unstable bucket Pull Request resolved: https://github.com/pytorch/pytorch/pull/125142 Approved by: https://github.com/ZainRizvi, https://github.com/seemethere	2024-04-30 09:32:22 +00:00
Shunting Zhang	dc514df2af	[inductor] add triton code to SchedulerNode.debug_str (#125091 ) Here is an example print: https://gist.github.com/shunting314/75c161368a833a535bd0d240b8099d7e Pull Request resolved: https://github.com/pytorch/pytorch/pull/125091 Approved by: https://github.com/jansel ghstack dependencies: #125090	2024-04-30 08:27:53 +00:00
Shunting Zhang	a587a93f4c	[inductor][easy] add buffer layout to SchedulerNode.debug_str (#125090 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/125090 Approved by: https://github.com/jansel	2024-04-30 08:27:53 +00:00
Yuanhao Ji	e0d2c24de1	Fix device type issue in `_get_device_handle` (#124390 ) Fix #124327 `device_type`, the first arg of [init_device_mesh()](`a0466061e1/torch/distributed/device_mesh.py (L503)`), does not support types with indexes, such as `cuda:0`. If `cuda:0` is used as a parameter, `_get_device_handle()` will not correctly return `torch.cuda`. So the exception should be thrown before creating DeviceMesh object. > See https://github.com/pytorch/pytorch/issues/124327#issuecomment-2062551161, Pull Request resolved: https://github.com/pytorch/pytorch/pull/124390 Approved by: https://github.com/wz337, https://github.com/wanchaol	2024-04-30 06:59:56 +00:00
Animesh Jain	5e5f890273	[dynamo][source] Remove inspect getattr_static from AttrSource (#125200 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/125200 Approved by: https://github.com/jansel	2024-04-30 06:44:25 +00:00
Animesh Jain	8320b770fd	[dynamo] use lazy disable dynamo for manual seed (#125196 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/125196 Approved by: https://github.com/fegin, https://github.com/yanboliang	2024-04-30 06:04:22 +00:00
Avik Chaudhuri	e7846447e0	dynamic shapes builder API (#124898 ) This PR introduces a new way of building `dynamic_shapes` for export. The idea is to build up a mapping from input tensors to the dynamic shapes that should be assigned to their corresponding fake tensors. This mapping is automatically converted to the current form of `dynamic_shapes`, which must exactly match the structure of inputs. We do this by using pytree utils. With the current `dynamic_shapes`, we had to be careful about user-defined classes that are registered with pytree, since such classes are not necessarily polymorphic containers; they may be fine containing tensors, but not dynamic shapes. Thus we had decided to allow input instances of such classes to be associated with dynamic shapes in flattened form. This decision needs to be mirrored in this PR as well. To make it easier to keep these code paths in sync, we refactor the current recursive procedure for associating inputs with dynamic shapes to use the same pytree utils. This needs minor fixes to a few tests where `dynamic_shapes` were not exactly matching the structure of inputs. Differential Revision: D56551992 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124898 Approved by: https://github.com/zhxchen17	2024-04-30 03:59:49 +00:00
hyperfraise	31801918e9	Add pooling support for 3d channels last (#116305 ) Part of a multi-PR work to improve #59168 Meant to complete Write native kernels for AvgPool3d Write native kernels for MaxPool3d Write native kernels for AdaptiveAvgPool3d Write native kernels for AdaptiveMaxPool3d Pull Request resolved: https://github.com/pytorch/pytorch/pull/116305 Approved by: https://github.com/ezyang	2024-04-30 03:51:49 +00:00
Pearu Peterson	16e8431963	Fix hybrid sparse COO tensor conversion to meta tensor (#125120 ) As in the title. Addresses a bug reported in https://github.com/pytorch/pytorch/pull/117907#issuecomment-2080035379 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125120 Approved by: https://github.com/ezyang, https://github.com/amjames	2024-04-30 03:43:42 +00:00
Hongtao Yu	74b7c56517	[Autotune] Use half the number of warps for reduction tuning on AMD. (#125084 ) I was seeing for a reduction kernel and a given block size, on AMDGPU, the vectorization bandwidth (16-byte) for a thread was not fully leveraged while it was not a problem for NVGPU. It appeared that each thread got fewer data to process as a whole row were processed by more threads, and the number of elements each thread got was not enough to saturate full vectorization. On AMDGPU, a warp has 64 lanes compared to 32 on the NV side. Therefore I'm tuning down the default number of warps (8 for NV) for AMD. I'm seeing 10% speed up for an internal benchmark. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125084 Approved by: https://github.com/shunting314	2024-04-30 02:38:34 +00:00
Andrew Gu	0969f01d73	[FSDP2] Accumulated in `reduce_dtype` if not syncing grads (#125191 ) For microbatching use cases (e.g. PP), we may use fp32 reduce-scatter (i.e. `MixedPrecisionPolicy(reduce_dtype=torch.float32)`), where we want to accumulate the unsharded gradients in fp32 across microbatches until reduce-scattering in fp32 upon the last microbatch. Note that the `unsharded_param` is in bf16, so we must save the fp32 accumulated gradient to an attribute different from `.grad`. Moreover, saving a new attribute on the `torch.Tensor` leads to some annoying type checking issues (where the attribute may not be defined), so this PR prefers to save the attribute on the `FSDPParam` class instead. One could argue that this behavior should be configurable, but since I think for large-scale training, everyone is leaning toward fp32 accumulation across microbatches, let us avoid adding another argument for now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125191 Approved by: https://github.com/weifengpy ghstack dependencies: #125190	2024-04-30 02:19:13 +00:00
Andrew Gu	631d2b87f1	[FSDP2] Fixed fp32 param dtype/bf16 reduce dtype test (#125190 ) The unit test for fp32 `param_dtype` and bf16 `reduce_dtype` was disabled. This PR debugs the issue and identifies the root cause as numeric differences between NCCL bf16 all-reduce vs. bf16 reduce-scatter. We address this by having the baseline use reduce-scatter -> all-gather to implement all-reduce. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125190 Approved by: https://github.com/weifengpy, https://github.com/wanchaol	2024-04-30 02:15:33 +00:00
Feng Yuan	2369ee49cc	Update torch-xpu-ops pin (ATen XPU implementation) (#125011 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/125011 Approved by: https://github.com/EikanWang	2024-04-30 01:18:19 +00:00
PyTorch MergeBot	724c7491d0	Revert " [Distributed] [7/N] Fix clang-tidy warnings in torch/csrc/distributed/c10d (#124987 )" This reverts commit b3fd94d15ef49c99ffa32a8226d1f00b0cc26f68. Reverted https://github.com/pytorch/pytorch/pull/124987 on behalf of https://github.com/ezyang due to broke downstream extensions ([comment](https://github.com/pytorch/pytorch/pull/124987#issuecomment-2083956511))	2024-04-30 00:37:53 +00:00
PyTorch MergeBot	e7631d6eae	Revert "CI: add aarch64 linux workflow (#121284 )" This reverts commit 32cf04cb7f7aa14aff4d1cf40517d5de797550e7. Reverted https://github.com/pytorch/pytorch/pull/121284 on behalf of https://github.com/malfet due to Test only changes has not been reverted ([comment](https://github.com/pytorch/pytorch/pull/121284#issuecomment-2083925890))	2024-04-30 00:24:11 +00:00
Nikita Shulga	744f341aa4	Fix ref leak in `dtype.to_complex()`/`to_real()` (#125154 ) By using `Py_NewRef` Also, wrap `THPDtype_to_real`/`THPDtype_to_complex` calls with `HANDLE_TH_ERRORS` Add regression test for the above issues, by calling to_complex for integral dtypes, that raises an exception and by preserving reference count to the same to_complex/to_real call to detect if leak is happeneing. Replace ```cpp auto dtype = (PyObject*)torch::getTHPDtype(current_dtype); Py_INCREF(dtype); return dtype; ``` with a more compact/streamlined equivalent ```cpp return Py_NewRef(torch::getTHPDtype(current_dtype)); ``` Fixes https://github.com/pytorch/pytorch/issues/124868 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125154 Approved by: https://github.com/Skylion007, https://github.com/albanD	2024-04-29 23:59:27 +00:00
Catherine Lee	4d717cd7c3	[TD] Enable td on cpu windows (#125049 ) yolo Also * Ensure that at least 1 test always gets run (`//` does truncation which results in 0 if you have too few tests discovered) * Don't run test removal on slow tests - I'm not touching that yet I am avoid everything other than pull + trunk workflows, so not doing this on windows CUDA, which runs on periodic Pull Request resolved: https://github.com/pytorch/pytorch/pull/125049 Approved by: https://github.com/huydhn, https://github.com/ZainRizvi	2024-04-29 23:39:54 +00:00
eellison	8ee6105f84	Fix edge case in cudagraph pool detection (#124981 ) When we do cudagraph warmup, we record which outputs are in the cudagraph pool, so subsequently when we invoke a cudagraph and need to reclaim its memory we can free the prior run's outputs and make them error on access. In warmup, we detect this by ignoring outputs which are an alias of an input that is not a prior output. We did this by checking data pointer. In very rare situations, a data pointer of a non cudagraph input might get reallocated to a cudagraph pool and causes us to ignore it. This was happening with gpt-fast error with gemma 2 when coordinate_descent_tuning was set to False. This updates so that we check aliasing with non-cudagraph inputs by looking at storage pointer.. Unrelated: saw very weird behavior where an output had the same data pointer as a supposedly live input but not the same cdata 🤔 I would think that is not possible. ``` out[0]._cdata in [ref()._cdata for ab in non_cudagraph_inps_storage_refs] # False out[0].data_ptr() in [ref().data_ptr() for ab in non_cudagraph_inps_storage_refs] # True ``` Differential Revision: [D56607721](https://our.internmc.facebook.com/intern/diff/D56607721) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124981 Approved by: https://github.com/ezyang	2024-04-29 23:37:34 +00:00
Wanchao Liang	e1e6ef753b	[dtensor] use str for reduce_op (#125172 ) This PR use str for reduce_op directly instead of the c10d enum. Since our functional collective already uses str, there's no reason that we need the c10d enum anymore as that requires a conversion Also the str hash + eq performance is actually significantly faster than the c10d type, so this would somewhat improves the CPU overhead too Some local cpu benchmarks on `1000000` hash operations: ``` Hash performance for string type: 0.039897 seconds Hash performance for integer type: 0.304665 seconds ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/125172 Approved by: https://github.com/awgu, https://github.com/XilunWu, https://github.com/tianyu-l	2024-04-29 23:30:24 +00:00
Randolf Scholz	ccaf03fd89	Fix: `nn.Parameter` return type identified as `Tensor` instead of `nn.Parameter` (#125106 ) Fixes #125105 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125106 Approved by: https://github.com/ezyang, https://github.com/albanD	2024-04-29 23:25:23 +00:00
Alexandre Ghelfi, PhD	26f8d96cab	Fix typo in `compile` docstring regarding default `cache_size_limit` (#125145 ) Docstring of `torch.compile` specifies that default `torch._dynamo.config.cache_size_limit` equals to `64`, while the value is `8` in the corresponding py file. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125145 Approved by: https://github.com/kit1980	2024-04-29 22:47:43 +00:00
drisspg	8c219251c5	Add backwards support to FlexAttention (#123902 ) # Summary This is part one of adding backwards support to FlexAttention. This PR focuses on the eager implementation and wiring up enough of the templated_attention_backward(name change soon 😉) to get through aot_eager. Notably this does not actually wire up the triton template just yet in order to make this PR easier to review. That will be the next follow up PR. #### Structure We pass both the forward and backward graph to the backwardsHOP since these are both needed to be inlined into the calculation for backwards: - the forward graph is needed in order to re-compute the scores - the joint graph is needed in order to construct the correct gradients post softmax_grad calc ### Attatched AOT Graph https://gist.github.com/drisspg/ce4c041f8df8a5a7983c5174705cf2b5 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123902 Approved by: https://github.com/Chillee	2024-04-29 22:34:22 +00:00
Andrew Ho	720e5f306d	Update CODEOWNERS - Dataloader (#125181 ) Fixes #124473 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125181 Approved by: https://github.com/gokulavasan, https://github.com/albanD	2024-04-29 21:37:18 +00:00
Catherine Lee	faee0e5ee8	[ez][CI] Move test_linalg and test_sparse_csr off CI_SERIAL_LIST (#125068 ) * https://github.com/pytorch/pytorch/pull/124649 for context Pull Request resolved: https://github.com/pytorch/pytorch/pull/125068 Approved by: https://github.com/huydhn, https://github.com/ZainRizvi	2024-04-29 21:22:35 +00:00
Pian Pawakapan	946e202c07	[export] Restore user input names to unlifted graph modules (#124765 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/122842 Currently, calling ep.module() on an ExportedProgram leads to a GraphModule with a default forward signature (e.g. arg_0, arg_1, ...). This leads to original placeholder names disappearing for retracing/re-exporting. Fixing this issue by creating a forward_arg_names field (will take renaming suggestions for this), that stores the positional & keyword arg names that are used. These names aren't present in the call_spec currently stored, and requires a major version bump for the ExportedProgram schema. Test Plan: Tests exist for export, but names are now changed from generic (e.g. arg_0, arg_1) to follow user inputs (e.g. x, y) Differential Revision: D56484994 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124765 Approved by: https://github.com/zhxchen17	2024-04-29 20:58:17 +00:00
PyTorch MergeBot	f1d1e3246f	Revert "[dtensor] implement shard dim change with alltoall (#124872 )" This reverts commit 6b79469d2437531fa506b48d42488be512a87f4d. Reverted https://github.com/pytorch/pytorch/pull/124872 on behalf of https://github.com/clee2000 due to broke distributed/tensor/parallel/test_tp_examples.py::DistTensorParallelExampleTest::test_transformer_training_is_seq_parallel_True https://github.com/pytorch/pytorch/actions/runs/8882762411/job/24389191482 `f7f018a0ed`. Bad TD ([comment](https://github.com/pytorch/pytorch/pull/124872#issuecomment-2083599445))	2024-04-29 20:26:16 +00:00
PyTorch MergeBot	3bd67dab32	Revert "[dtensor] delete the old unused mesh_alltoall (#124879 )" This reverts commit f7f018a0ed442f92eb5270150ced7b6117773368. Reverted https://github.com/pytorch/pytorch/pull/124879 on behalf of https://github.com/clee2000 due to broke distributed/tensor/parallel/test_tp_examples.py::DistTensorParallelExampleTest::test_transformer_training_is_seq_parallel_True https://github.com/pytorch/pytorch/actions/runs/8882762411/job/24389191482 `f7f018a0ed`. Bad TD ([comment](https://github.com/pytorch/pytorch/pull/124872#issuecomment-2083599445))	2024-04-29 20:26:15 +00:00
Aaron Orenstein	3d1dd79b80	make sure to stopTrace() on exception (#125131 ) If there's an exception during collection it can result in the profiler never being stopped properly. As a result all subsequent tests that use profiling will also fail - even if they pass in isolation. I'm hoping this fixes the flakyness in #124253, #124220, #82720, #119346, #119364, #119490, #119526, #119537 (and the currently closed #82864). Before: ``` (py312) $ PYTORCH_TEST_WITH_DYNAMO=1 pytest test/profiler/test_profiler.py ===================================================================================================================== FAILURES ===================================================================================================================== ============================================================================================================= short test summary info ============================================================================================================== FAILED test/profiler/test_profiler.py::TestExecutionTrace::test_execution_trace_with_kineto - AssertionError: Element counts were not equal: FAILED test/profiler/test_profiler.py::TestExperimentalUtils::test_profiler_conv2d_bias_followed_by_batchnorm2d_pattern - RuntimeError: Can't disable Kineto profiler when it's not running FAILED test/profiler/test_profiler.py::TestExperimentalUtils::test_profiler_extra_cuda_copy_pattern - RuntimeError: Can't disable Kineto profiler when it's not running FAILED test/profiler/test_profiler.py::TestExperimentalUtils::test_profiler_extra_cuda_copy_pattern_benchmark - AttributeError: 'NoneType' object has no attribute 'profiler' FAILED test/profiler/test_profiler.py::TestExperimentalUtils::test_profiler_fp32_matmul_pattern - AttributeError: 'NoneType' object has no attribute 'profiler' FAILED test/profiler/test_profiler.py::TestExperimentalUtils::test_profiler_matmul_dim_fp16_pattern - RuntimeError: Can't disable Kineto profiler when it's not running FAILED test/profiler/test_profiler.py::TestProfiler::test_kineto_multigpu - torch._dynamo.exc.InternalTorchDynamoError: 'NoneType' object has no attribute 'events' FAILED test/profiler/test_profiler.py::TestProfiler::test_oom_tracing - AssertionError: RuntimeError not raised FAILED test/profiler/test_profiler.py::TestProfiler::test_source_multithreaded_basic_work_in_main_thread_False - RuntimeError: Can't disable Kineto profiler when it's not running FAILED test/profiler/test_profiler.py::TestProfiler::test_source_multithreaded_close_in_scope_work_in_main_thread_False - RuntimeError: Can't disable Kineto profiler when it's not running FAILED test/profiler/test_profiler.py::TestProfiler::test_source_multithreaded_complex_work_in_main_thread_False - RuntimeError: Can't disable Kineto profiler when it's not running FAILED test/profiler/test_profiler.py::TestProfiler::test_source_multithreaded_multiple_preexisting_work_in_main_thread_False - RuntimeError: Can't disable Kineto profiler when it's not running FAILED test/profiler/test_profiler.py::TestProfiler::test_source_multithreaded_open_in_scope_work_in_main_thread_False - RuntimeError: Can't disable Kineto profiler when it's not running FAILED test/profiler/test_profiler.py::TestTorchTidyProfiler::test_optimizer_parameters_sgd - RuntimeError: Can't disable Kineto profiler when it's not running FAILED test/profiler/test_profiler.py::TestTorchTidyProfiler::test_refcounts - RuntimeError: Can't disable Kineto profiler when it's not running FAILED test/profiler/test_profiler.py::TestTorchTidyProfiler::test_sparse_tensors - RuntimeError: Can't disable Kineto profiler when it's not running ==================================================================================================== 16 failed, 26 passed, 53 skipped in 25.51s ==================================================================================================== ``` After: ``` (py312) $ PYTORCH_TEST_WITH_DYNAMO=1 pytest test/profiler/test_profiler.py ===================================================================================================================== FAILURES ===================================================================================================================== ============================================================================================================= short test summary info ============================================================================================================== FAILED test/profiler/test_profiler.py::TestExecutionTrace::test_execution_trace_with_kineto - AssertionError: Element counts were not equal: FAILED test/profiler/test_profiler.py::TestExperimentalUtils::test_profiler_extra_cuda_copy_pattern - RuntimeError: !stack.empty() INTERNAL ASSERT FAILED at "/data/users/aorenste/pytorch/torch/csrc/autograd/profiler_python.cpp":969... FAILED test/profiler/test_profiler.py::TestExperimentalUtils::test_profiler_extra_cuda_copy_pattern_benchmark - AttributeError: 'NoneType' object has no attribute 'profiler' FAILED test/profiler/test_profiler.py::TestExperimentalUtils::test_profiler_fp32_matmul_pattern - AttributeError: 'NoneType' object has no attribute 'profiler' FAILED test/profiler/test_profiler.py::TestProfiler::test_kineto_multigpu - torch._dynamo.exc.InternalTorchDynamoError: 'NoneType' object has no attribute 'events' FAILED test/profiler/test_profiler.py::TestProfiler::test_oom_tracing - AssertionError: RuntimeError not raised FAILED test/profiler/test_profiler.py::TestTorchTidyProfiler::test_optimizer_parameters_sgd - RuntimeError: !stack.empty() INTERNAL ASSERT FAILED at "/data/users/aorenste/pytorch/torch/csrc/autograd/profiler_python.cpp":969, please... ==================================================================================================== 7 failed, 35 passed, 53 skipped in 31.51s ===================================================================================================== ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/125131 Approved by: https://github.com/Skylion007, https://github.com/aaronenyeshi	2024-04-29 19:07:37 +00:00
cdzhan	a434d1487b	Fix EtcdServer leak in etcd_server_test.py file (#125121 ) As stated in the title. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125121 Approved by: https://github.com/Skylion007	2024-04-29 18:59:05 +00:00
soulitzer	fab5bd5359	[checkpoint] Improve error message when use_reentrant=True is used with .grad() (#125155 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/125155 Approved by: https://github.com/albanD	2024-04-29 18:57:35 +00:00
FFFrog	f03cf9d4dc	Fix & optimze open device registration test. (#124712 ) Fixes #100152 1. Fix the wrong tests about lazy init for PrivateUse1 named foo 2. Fix wrong backend meta registry mechanism when compiling with clang++( compiling with g++ work well)(introduced by static variable in inline function) 3. Refactor the tests and make it more flexible 4. Disable the two tests temporarily - test_open_device_storage_pin_memory - test_compile_autograd_function_aliasing Pull Request resolved: https://github.com/pytorch/pytorch/pull/124712 Approved by: https://github.com/albanD, https://github.com/malfet	2024-04-29 18:55:38 +00:00
Sunita Nadampalli	32cf04cb7f	CI: add aarch64 linux workflow (#121284 ) aarch64 linux workflow is triggered for ciflow/aarch64 tags. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121284 Approved by: https://github.com/atalman, https://github.com/malfet	2024-04-29 18:25:40 +00:00
PyTorch MergeBot	ae13c7e593	Revert "[Meta Tensor] fix meta inplace set storage (#123880 )" This reverts commit cccae9355191a807040fb40a65178c4d7fe3f084. Reverted https://github.com/pytorch/pytorch/pull/123880 on behalf of https://github.com/izaitsevfb due to breaks cpu_inductor_torchbench (detectron2_fasterrcnn) ([comment](https://github.com/pytorch/pytorch/pull/123880#issuecomment-2083366385))	2024-04-29 18:19:42 +00:00
Bradley D	96cc73dc13	[oss][torch.package] fix multiple error messages within PackageExporter (#124943 ) Summary: fixes two issues: - when exporting with debug=True, the list of error-causing modules and a dependency path to them is not printed correctly, there's a missing newline after the path, meaning the name of the module for the next error is on the wrong line, which makes the output a confusing mess to read - when a pickled object references more than one mocked module directly, the error message incorrectly repeats the same information, claiming the referenced attribute is present in several different libraries, because the if condition references the last seen module name while walking the pickle ops, not the module name from the enclosing block `for module_name in all_dependencies:`. this is confusing because one error will print as O(all_dependencies) errors, all with different module names but the same attribute name Differential Revision: D56578035 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124943 Approved by: https://github.com/JonAmazon, https://github.com/houseroad	2024-04-29 18:11:28 +00:00
Wanchao Liang	f7f018a0ed	[dtensor] delete the old unused mesh_alltoall (#124879 ) as titled, as we have a dedicated comm op, this is not needed anymore Pull Request resolved: https://github.com/pytorch/pytorch/pull/124879 Approved by: https://github.com/XilunWu, https://github.com/wz337 ghstack dependencies: #124871, #124872	2024-04-29 17:22:30 +00:00
Wanchao Liang	6b79469d24	[dtensor] implement shard dim change with alltoall (#124872 ) as titled, we implement a dedicated communication op to allow efficient sharding dimension change using alltoall, to replace our previous allgather + local chunk Pull Request resolved: https://github.com/pytorch/pytorch/pull/124872 Approved by: https://github.com/XilunWu, https://github.com/yifuwang ghstack dependencies: #124871	2024-04-29 17:22:30 +00:00
Wanchao Liang	8d46ab4104	[dtensor] move pad/unpad_tensor to separate utils (#124871 ) as titled, 1. pad/unpad is a general util not specific to the Shard placement, 2. for the propose of the next PR, move these two out of Shard placement itself, and give additional pad_dim argument Pull Request resolved: https://github.com/pytorch/pytorch/pull/124871 Approved by: https://github.com/awgu, https://github.com/wz337, https://github.com/XilunWu	2024-04-29 17:22:25 +00:00
Andrew Gu	935a946241	[RFC][FSDP2] Renamed `FSDP` to `FSDPModule` (#124955 ) This PR renames the `FSDP` class to `FSDPModule`. This is a BC breaking change. The rationale is that `FSDPModule` is more descriptive since `fully_shard` is a module-level API (applied to a `module` arg), so the `FSDP` class will always correspond to a module. Also, users commonly import `FullyShardedDataParallel` as `FSDP`, so this can help avoid some name conflict in some cases. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124955 Approved by: https://github.com/wanchaol, https://github.com/wconstab ghstack dependencies: #124651, #124741, #124767, #124768, #124780, #124787	2024-04-29 16:33:18 +00:00
Manish Rajpal	da44d2f7fb	split out flop counting its own method (#125061 ) Summary: Modularizing code for reuse by splitting __torch_dispatch__ to move flop counting to its own method. Test Plan: unit tests Differential Revision: D56644523 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125061 Approved by: https://github.com/842974287	2024-04-29 14:13:44 +00:00
Edward Z. Yang	e5e623af4b	Codegen runtime asserts in Inductor (#124874 ) This completely subsumes https://github.com/pytorch/pytorch/pull/120816 This makes use of the unbacked binding machinery to teach Inductor how to generate deferred runtime asserts directly. There is some back story about why I did it this way, let me explain. Previously, our strategy for generating runtime asserts was that Dynamo would insert them into the FX graph after finishing tracing, and we would attempt to code generate them based on the FX graph. This is a good strategy for export, where we immediately export the graph. However, this strategy was afflicted by problems in eager, where we reuse the same ShapeEnv as before. In particular, on subsequent graph passes, we would immediately turn all of these assertions into noops, because when we evaluated their expressions, we would see that because we had a deferred runtime assert in the ShapeEnv, we know "oh, of course this expression is True" already. Oops! So, with this PR, we take the attitude that as long as the ShapeEnv sticks around, the ShapeEnv's list of deferred runtime asserts is the source of truth, and we don't put anything in the graph. So we just need to decide when to actually generate asserts, and the place I picked was Inductor lowering, since we already have an AssertScalar buffer concept, and so I just need to insert them at this point. AssertScalar also uses raw sympy.Expr rather than SymInt/Bool, so it is easier to prevent unrestricted simplification at this point. There are a few things jumbled together in this PR. I can split them if you want, but some of the changes are before I changed my strategy, but they're useful changes anyway. torch/_dynamo/output_graph.py and torch/_inductor/lowering.py - Here, we stop putting deferred runtime asserts in the graph. I also have to make sure we don't DCE unused symbol arguments; we're going to get some goofy graph arguments this way, will be good to restore that optimization eventually. We also just disable codegen for `_assert_scalar` entirely; we assume that ShapeEnv will be good enough to capture all of these. torch/_inductor/codegen/wrapper.py and torch/_inductor/ir.py - Add a way to codegen sizevars without forcing simplification torch/_inductor/graph.py - The main logic. Our strategy is to interpose in the same place we are testing that unbacked SymInts are properly showing up in lowered code. The logic is directly analogous to the logic in the existing insert deferred runtime asserts FX pass, but it's simpler because sympy expressions can be directly stored on inductor IR nodes. torch/fx/experimental/symbolic_shapes.py - For extra safety, we have a way of freezing runtime asserts, so that if you try to add more we error. This prevents us from adding runtime asserts after we've done lowering. There's a funny interaction with backwards which there's a comment for in graph.py torch/fx/passes/runtime_assert.py - This is not really needed in this PR, but I rewrote the runtime assert logic to use unbacked_bindings rather than inferring it by looking for unbacked SymInts. Now, keypaths are translated into FX node acessors. Unfortunately, I couldn't delete the old inference code, because you still need it to find backed SymInts from arguments (as this pass may be used on graphs which don't explicitly bind all their shape variables as argments). There are some new tests exercising this. TODO: I think we need to generate asserts for replacements too. This is a preexisting problem that the old FX pass had too. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/124874 Approved by: https://github.com/jansel ghstack dependencies: #124864	2024-04-29 10:19:29 +00:00
Edward Z. Yang	e498e28b2f	Remove API that allows for extra deferred runtime asserts during lowering (#124864 ) I want to generate runtime assert nodes during lowering, which means that I need a finalized list of asserts by the time I start lowering. This means this runtime assert introduced in https://github.com/pytorch/pytorch/pull/113839 must go. Fortunately, this runtime assert was never exercisable, apparently, and the test still "passes" without it. I replace it with a compile time test. We can revisit if this assert fails in practice. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/124864 Approved by: https://github.com/jansel	2024-04-29 10:19:29 +00:00
Huamin Li	303880e16b	Update gen.py aoti_fm install dir (#125087 ) Summary: make it consistent with all the other install dir Test Plan: Sandcastle Differential Revision: D56660301 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125087 Approved by: https://github.com/frank-wei	2024-04-29 08:25:16 +00:00
cyy	5585138db9	Remove caffe2 contrib and experiments (#125038 ) This PR tries to decompose #122527 into a smaller one. To be noted, this was inspired and is co-dev with @r-barnes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125038 Approved by: https://github.com/malfet	2024-04-29 06:27:13 +00:00
CaoE	555f1aeb02	Fix module buffer mutation (#124586 ) Fixes #124583 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124586 Approved by: https://github.com/leslie-fang-intel, https://github.com/desertfire	2024-04-29 06:05:12 +00:00
Tugsbayasgalan (Tugsuu) Manlaibaatar	06b845dedc	Make metadata serialization more strict (#124411 ) Summary: When I was debugging an issue, this silent error makes the debugging harder. It is better to error earlier with more descriptive error message. Test Plan: None Differential Revision: D56312433 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124411 Approved by: https://github.com/zhxchen17	2024-04-29 02:11:40 +00:00
Tugsbayasgalan (Tugsuu) Manlaibaatar	cc06c00a56	Don't run auto grad safe mode when predispatch is on (#125066 ) Summary: Title Test Plan: CI Differential Revision: D56646678 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125066 Approved by: https://github.com/zhxchen17	2024-04-29 01:53:23 +00:00
Aaron Gokaslan	e3b9b71684	[BE]: Ruff - TRY401 - Avoid verbose exception logging (#125126 ) Don't bother logging exception obj explicitly with logger, it's captured anyway and would generate verbose outputs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125126 Approved by: https://github.com/ezyang	2024-04-28 21:44:33 +00:00
Aaron Gokaslan	3e1fb96964	[BE]: RUF018 - ban assignment in assert (#125125 ) Ban assignment inside of assert. Python code should ideally not break with assertions disabled. Adds a ruff lint rule to enforce this. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125125 Approved by: https://github.com/ezyang	2024-04-28 21:41:36 +00:00
Yuanhao Ji	a05b2ae302	Enable UFMT on `test/test_dataloader.py` (#124710 ) Part of: #123062 Ran lintrunner on: - test/test_custom_op_testing.py (already deleted) - test/test_dataloader.py Detail: ```bash $ lintrunner -a --take UFMT --all-files ok No lint issues. Successfully applied all patches. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/124710 Approved by: https://github.com/soulitzer	2024-04-28 21:21:51 +00:00
hun	518ab48e85	Enable UFMT on test/test_functionalization.py (#123926 ) Part of #123062 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123926 Approved by: https://github.com/ezyang, https://github.com/statelesshz	2024-04-28 17:02:34 +00:00
Yuanjing Shi	cccae93551	[Meta Tensor] fix meta inplace set storage (#123880 ) Fixes #123879 Co-authored-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/123880 Approved by: https://github.com/ezyang	2024-04-28 17:01:12 +00:00
Alana Xiang	6761b49551	Ensure autocast device_type is a string + Unit test (#125014 ) Reviving #124873 (already approved) to resolve CLA issues Fixes #124738 (Marked as draft until I get local unit tests to run) Edit: Tests passing Pull Request resolved: https://github.com/pytorch/pytorch/pull/125014 Approved by: https://github.com/mikaylagawarecki, https://github.com/soulitzer	2024-04-28 16:27:30 +00:00
Animesh Jain	1a0b247762	[dynamo] Bug fix for LOAD_GLOBAL and STORE_GLOBAL (#125002 ) Earlier globals of inlined functions from other files were not handled correctly. We were not tracking mutations on them. They were colliding with the same global name in the parent function etc. This PR overrides the LOAD/STORE_GLOBAL for inline tx and tracks mutation on them separately. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125002 Approved by: https://github.com/jansel ghstack dependencies: #125097, #125107	2024-04-28 15:24:17 +00:00
Animesh Jain	0f139b04b3	[dynamo] Fix test (#125107 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/125107 Approved by: https://github.com/jansel ghstack dependencies: #125097	2024-04-28 15:24:17 +00:00
Aaron Gokaslan	49ca2b3429	[BE]: Apply RUF025 perf fixups (#125104 ) Uses `dict.fromkeys()` for more efficient dict construction. Automatically generated by RUF025 (prev). Pull Request resolved: https://github.com/pytorch/pytorch/pull/125104 Approved by: https://github.com/ezyang	2024-04-28 15:09:21 +00:00
Han, Xu	94b328ee45	add likely/unlikely macro for unsupport c++20 compiler. (#124997 ) # Issue: Intel validation team found some low version gcc which not support c++20 will occur below issue: ```cmd [2024-04-13T08:03:25.142Z] g++ /tmp/torchinductor_root/vd/cvdytwwwlhi63ofh3pwzqfpjga4w4xe7bjfdoavpblbo5khzf3b2.cpp -shared -fPIC -Wall -std=c++17 -Wno-unused-variable -Wno-unknown-pragmas -D_GLIBCXX_USE_CXX11_ABI=0 -I/root/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/include -I/root/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/root/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/include/TH -I/root/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/include/THC -I/root/anaconda3/envs/pytorch/include/python3.8 -L/root/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/lib -L/root/anaconda3/envs/pytorch/lib -L/root/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/lib -ltorch -ltorch_cpu -lgomp -ltorch_python -lc10 -mavx2 -mfma -DCPU_CAPABILITY_AVX2 -O3 -DNDEBUG -ffast-math -fno-finite-math-only -fno-unsafe-math-optimizations -ffp-contract=off -march=native -fopenmp -D C10_USING_CUSTOM_GENERATED_MACROS -o /tmp/torchinductor_root/vd/cvdytwwwlhi63ofh3pwzqfpjga4w4xe7bjfdoavpblbo5khzf3b2.so [2024-04-13T08:03:25.142Z] [2024-04-13T08:03:25.142Z] Output: [2024-04-13T08:03:25.142Z] /tmp/torchinductor_root/vd/cvdytwwwlhi63ofh3pwzqfpjga4w4xe7bjfdoavpblbo5khzf3b2.cpp: In function ‘T parse_arg(PyObject*, size_t) [with T = long int; PyObject = _object; size_t = long unsigned int]’: [2024-04-13T08:03:25.142Z] /tmp/torchinductor_root/vd/cvdytwwwlhi63ofh3pwzqfpjga4w4xe7bjfdoavpblbo5khzf3b2.cpp:117:10: error: expected identifier before ‘[’ token [2024-04-13T08:03:25.142Z] [[unlikely]] throw std::runtime_error("expected int arg"); [2024-04-13T08:03:25.142Z] ^ ``` The season is `unlikely` need c++20 attribute, ref: https://en.cppreference.com/w/cpp/language/attributes/likely # Solution: Add MACRO to enable non-c++20 attribute GNU compiler. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124997 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-04-28 07:03:12 +00:00
leslie-fang-intel	42a192db0f	Fix Conv BN folding with deadcode (#124808 ) Summary Fix issue: https://github.com/pytorch/pytorch/issues/124286 The TorchBenchmark includes a method called `run_n_iterations` which runs model multiple times. `43f4e71daa/benchmarks/dynamo/common.py (L2272-L2276)` https://github.com/pytorch/pytorch/pull/123399 enables tracing into a `UserDefinedObjectVariable` that's an instance method. It will trace the model into FX graph multiple times within `run_n_iterations`. Then, in the Inductor, `Conv-BN folding` at the module level will fuse the same Conv-BN module multiple times in this case, which leads to accuracy failures. This PR addresses the issue by ensuring that each Conv-BN module is fused only once. TestPlan ``` python -u -m pytest -s -v test/inductor/test_inductor_freezing.py -k test_folded_conv_bn_with_module_sharing python -u -m pytest -s -v test/inductor/test_inductor_freezing.py -k test_folded_conv_functional_bn_with_module_sharing python -u -m pytest -s -v test/inductor/test_inductor_freezing.py -k test_conv_bn_with_multi_bn_share_conv python -u -m pytest -s -v test/inductor/test_inductor_freezing.py -k test_conv_functional_bn_with_multi_bn_share_conv ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/124808 Approved by: https://github.com/jansel, https://github.com/jgong5	2024-04-28 06:29:40 +00:00
Florian	c1e0dea023	Delete unused param 'OP' in KERNEL_PRIVATEUSEONE (#125008 ) Parameter 'OP' is unused but occupies a position that will cause the length of \_\_VA_ARGS\__ less than expected. Missed this diff in https://github.com/pytorch/pytorch/pull/124050. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125008 Approved by: https://github.com/FFFrog, https://github.com/leslie-fang-intel	2024-04-28 06:17:16 +00:00
Florian	5f7c4181b5	Correcting valid device name of privateuse1 (#125018 ) "privateuseone" is an invalid string for privateuse1 backend, the correct one should be returned from _get_privateuse1_backend_name(). Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/125018 Approved by: https://github.com/aaronenyeshi	2024-04-28 06:04:34 +00:00
haozhe.zhu	c5b1a4c269	[inductor] share more cse cache during swap buffer (#124921 ) `swap_buffer` will make the `cse_cache` cannot be shared inside/outside of the lambda function scope. For example, ``` auto tmp8 = -std::numeric_limits<float>::infinity(); auto tmp9 = [&] { auto tmp12 = -std::numeric_limits<float>::infinity(); return tmp12; } ``` `tmp12` should not be created since it is same with `tmp8`. We make the `cse_cache` as a read only cache inside the scope (because it is unsafe to expose cache inside the scope,the outside scope cannot use it.) Test Plan ``` python test/inductor/test_torchinductor.py -k test_AllenaiLongformerBase_repro_cpu ``` the `static_cast<int>(256)` will only occur once after this PR since the inside scope can share the cse buffer outside the scope. Before this PR, ``` cpp_fused_copy_full_like_0 = async_compile.cpp_pybinding(['const float', 'float'], ''' #include "/tmp/torchinductor_root/ub/cub6x5nmhqhp7xapkb3dlgjxef3t2bnkx7y7n4z2f4z5obnecxpy.h" extern "C" void kernel(const float* in_ptr0, float* out_ptr1) { #pragma omp parallel num_threads(128) { int tid = omp_get_thread_num(); { #pragma omp for collapse(2) for(long x0=static_cast<long>(0L); x0<static_cast<long>(4L); x0+=static_cast<long>(1L)) { for(long x1=static_cast<long>(0L); x1<static_cast<long>(1024L); x1+=static_cast<long>(1L)) { #pragma GCC ivdep for(long x2=static_cast<long>(0L); x2<static_cast<long>(12L); x2+=static_cast<long>(1L)) { for(long x3=static_cast<long>(0L); x3<static_cast<long>(512L); x3+=static_cast<long>(16L)) { auto tmp0 = c10::convert<int>(x1); auto tmp1 = static_cast<int>(256); auto tmp2 = tmp0 < tmp1; auto tmp3 = [&] { auto tmp4 = c10::convert<int>(x3); auto tmp5 = at::vec::Vectorized<int>::arange(tmp4, 1); auto tmp6 = static_cast<int>(257); auto tmp7 = at::vec::Vectorized<int>(tmp6); auto tmp8 = at::vec::VecMask<int,1>(tmp5 < tmp7); auto tmp10 = at::vec::VecMask<float,1>::from(tmp2); auto tmp11 = tmp8 & tmp10; auto tmp9 = [&] { auto tmp12 = -std::numeric_limits<float>::infinity(); return tmp12; } ; auto tmp13 = [&] { if (tmp11.all_zero()) { return at::vec::Vectorized<float>(static_cast<float>(0.0)); } else { return decltype(at::vec::Vectorized<float>(tmp9()))::blendv(at::vec::Vectorized<float>(static_cast<float>(0.0)), at::vec::Vectorized<float>(tmp9()), tmp11.template cast<float,1>()); } } () ; auto tmp14 = c10::convert<int>(c10::div_floor_integer(x1, 256L)); auto tmp15 = static_cast<int>(3); auto tmp16 = tmp14 < tmp15; auto tmp18 = tmp16 & tmp2; auto tmp17 = [&] { auto tmp19 = c10::convert<int>(x3); auto tmp20 = at::vec::Vectorized<int>::arange(tmp19, 1); auto tmp21 = static_cast<int>(256); auto tmp22 = at::vec::Vectorized<int>(tmp21); auto tmp23 = at::vec::VecMask<int,1>(tmp20 >= tmp22); auto tmp25 = at::vec::VecMask<float,1>::from(tmp18); auto tmp26 = tmp23 & tmp25; auto tmp24 = [&] { auto tmp27 = tmp26.template cast<float,1>().template loadu<float,1>(in_ptr0 + static_cast<long>((-256L) + x3 + (513L(static_cast<long>(x1) % static_cast<long>(256L))) + (262656L(c10::div_floor_integer(x1, 256L))) + (787968Lx2) + (9455616Lx0))); return tmp27; } ; auto tmp28 = [&] { if (tmp26.all_zero()) { return at::vec::Vectorized<float>(static_cast<float>(0.0)); } else { return decltype(tmp24())::blendv(at::vec::Vectorized<float>(static_cast<float>(0.0)), tmp24(), tmp26.template cast<float,1>()); } } () ; auto tmp29 = static_cast<float>(0.0); auto tmp30 = at::vec::Vectorized<float>(tmp29); auto tmp31 = decltype(tmp28)::blendv(tmp30, tmp28, tmp23.template cast<float,1>()); return tmp31; } ; auto tmp32 = tmp16 ? tmp17() : at::vec::Vectorized<float>(static_cast<float>(0.0)); auto tmp33 = static_cast<float>(0.0); auto tmp34 = at::vec::VecMask<float,1>::from(tmp16); auto tmp35 = at::vec::Vectorized<float>(tmp33); auto tmp36 = decltype(tmp32)::blendv(tmp35, tmp32, tmp34.template cast<float,1>()); auto tmp37 = decltype(tmp13)::blendv(tmp36, tmp13, tmp8.template cast<float,1>()); return tmp37; } ; auto tmp38 = tmp2 ? tmp3() : at::vec::Vectorized<float>(static_cast<float>(0.0)); auto tmp39 = c10::convert<int>(c10::div_floor_integer(x1, 256L)); auto tmp40 = static_cast<int>(3); auto tmp41 = tmp39 < tmp40; auto tmp42 = [&] { auto tmp43 = c10::convert<int>(x3); auto tmp44 = at::vec::Vectorized<int>::arange(tmp43, 1); auto tmp45 = static_cast<int>(256); auto tmp46 = at::vec::Vectorized<int>(tmp45); auto tmp47 = at::vec::VecMask<int,1>(tmp44 >= tmp46); auto tmp49 = at::vec::VecMask<float,1>::from(tmp41); auto tmp50 = tmp47 & tmp49; auto tmp48 = [&] { auto tmp51 = tmp50.template cast<float,1>().template loadu<float,1>(in_ptr0 + static_cast<long>((-256L) + x3 + (513L(static_cast<long>(x1) % static_cast<long>(256L))) + (262656L(c10::div_floor_integer(x1, 256L))) + (787968Lx2) + (9455616Lx0))); return tmp51; } ; auto tmp52 = [&] { if (tmp50.all_zero()) { return at::vec::Vectorized<float>(static_cast<float>(0.0)); } else { return decltype(tmp48())::blendv(at::vec::Vectorized<float>(static_cast<float>(0.0)), tmp48(), tmp50.template cast<float,1>()); } } () ; auto tmp53 = static_cast<float>(0.0); auto tmp54 = at::vec::Vectorized<float>(tmp53); auto tmp55 = decltype(tmp52)::blendv(tmp54, tmp52, tmp47.template cast<float,1>()); return tmp55; } ; auto tmp56 = tmp41 ? tmp42() : at::vec::Vectorized<float>(static_cast<float>(0.0)); auto tmp57 = static_cast<float>(0.0); auto tmp58 = at::vec::VecMask<float,1>::from(tmp41); auto tmp59 = at::vec::Vectorized<float>(tmp57); auto tmp60 = decltype(tmp56)::blendv(tmp59, tmp56, tmp58.template cast<float,1>()); auto tmp61 = at::vec::VecMask<float,1>::from(tmp2); auto tmp62 = decltype(tmp38)::blendv(tmp60, tmp38, tmp61.template cast<float,1>()); tmp62.store(out_ptr1 + static_cast<long>(x3 + (513Lx1) + (525312Lx2) + (6303744Lx0))); } #pragma omp simd simdlen(8) for(long x3=static_cast<long>(512L); x3<static_cast<long>(513L); x3+=static_cast<long>(1L)) { auto tmp0 = c10::convert<int64_t>(x1); auto tmp1 = static_cast<int64_t>(256); auto tmp2 = tmp0 < tmp1; auto tmp3 = [&] { auto tmp4 = c10::convert<int64_t>(x3); auto tmp5 = static_cast<int64_t>(257); auto tmp6 = tmp4 < tmp5; auto tmp7 = [&] { auto tmp8 = -std::numeric_limits<float>::infinity(); return tmp8; } ; auto tmp9 = tmp6 ? tmp7() : static_cast<decltype(tmp7())>(0.0); auto tmp10 = c10::convert<int64_t>(c10::div_floor_integer(x1, 256L)); auto tmp11 = static_cast<int64_t>(3); auto tmp12 = tmp10 < tmp11; auto tmp13 = [&] { auto tmp14 = c10::convert<int64_t>(x3); auto tmp15 = static_cast<int64_t>(256); auto tmp16 = tmp14 >= tmp15; auto tmp17 = [&] { auto tmp18 = in_ptr0[static_cast<long>((-256L) + x3 + (513L(static_cast<long>(x1) % static_cast<long>(256L))) + (262656L(c10::div_floor_integer(x1, 256L))) + (787968Lx2) + (9455616Lx0))]; return tmp18; } ; auto tmp19 = tmp16 ? tmp17() : static_cast<decltype(tmp17())>(0.0); auto tmp20 = static_cast<float>(0.0); auto tmp21 = tmp16 ? tmp19 : tmp20; return tmp21; } ; auto tmp22 = tmp12 ? tmp13() : static_cast<decltype(tmp13())>(0.0); auto tmp23 = static_cast<float>(0.0); auto tmp24 = tmp12 ? tmp22 : tmp23; auto tmp25 = tmp6 ? tmp9 : tmp24; return tmp25; } ; auto tmp26 = tmp2 ? tmp3() : static_cast<decltype(tmp3())>(0.0); auto tmp27 = c10::convert<int64_t>(c10::div_floor_integer(x1, 256L)); auto tmp28 = static_cast<int64_t>(3); auto tmp29 = tmp27 < tmp28; auto tmp30 = [&] { auto tmp31 = c10::convert<int64_t>(x3); auto tmp32 = static_cast<int64_t>(256); auto tmp33 = tmp31 >= tmp32; auto tmp34 = [&] { auto tmp35 = in_ptr0[static_cast<long>((-256L) + x3 + (513L(static_cast<long>(x1) % static_cast<long>(256L))) + (262656L(c10::div_floor_integer(x1, 256L))) + (787968Lx2) + (9455616Lx0))]; return tmp35; } ; auto tmp36 = tmp33 ? tmp34() : static_cast<decltype(tmp34())>(0.0); auto tmp37 = static_cast<float>(0.0); auto tmp38 = tmp33 ? tmp36 : tmp37; return tmp38; } ; auto tmp39 = tmp29 ? tmp30() : static_cast<decltype(tmp30())>(0.0); auto tmp40 = static_cast<float>(0.0); auto tmp41 = tmp29 ? tmp39 : tmp40; auto tmp42 = tmp2 ? tmp26 : tmp41; out_ptr1[static_cast<long>(x3 + (513Lx1) + (525312Lx2) + (6303744Lx0))] = tmp42; } } } } } } } ''') ``` After this PR, ``` cpp_fused_copy_full_like_0 = async_compile.cpp_pybinding(['const float', 'float'], ''' #include "/tmp/torchinductor_root/ub/cub6x5nmhqhp7xapkb3dlgjxef3t2bnkx7y7n4z2f4z5obnecxpy.h" extern "C" void kernel(const float* in_ptr0, float* out_ptr1) { #pragma omp parallel num_threads(128) { int tid = omp_get_thread_num(); { #pragma omp for collapse(2) for(long x0=static_cast<long>(0L); x0<static_cast<long>(4L); x0+=static_cast<long>(1L)) { for(long x1=static_cast<long>(0L); x1<static_cast<long>(1024L); x1+=static_cast<long>(1L)) { #pragma GCC ivdep for(long x2=static_cast<long>(0L); x2<static_cast<long>(12L); x2+=static_cast<long>(1L)) { for(long x3=static_cast<long>(0L); x3<static_cast<long>(512L); x3+=static_cast<long>(16L)) { auto tmp0 = c10::convert<int>(x1); auto tmp1 = static_cast<int>(256); auto tmp2 = tmp0 < tmp1; auto tmp3 = [&] { auto tmp4 = c10::convert<int>(x3); auto tmp5 = at::vec::Vectorized<int>::arange(tmp4, 1); auto tmp6 = static_cast<int>(257); auto tmp7 = at::vec::Vectorized<int>(tmp6); auto tmp8 = at::vec::VecMask<int,1>(tmp5 < tmp7); auto tmp10 = at::vec::VecMask<float,1>::from(tmp2); auto tmp11 = tmp8 & tmp10; auto tmp9 = [&] { auto tmp12 = -std::numeric_limits<float>::infinity(); return tmp12; } ; auto tmp13 = [&] { if (tmp11.all_zero()) { return at::vec::Vectorized<float>(static_cast<float>(0.0)); } else { return decltype(at::vec::Vectorized<float>(tmp9()))::blendv(at::vec::Vectorized<float>(static_cast<float>(0.0)), at::vec::Vectorized<float>(tmp9()), tmp11.template cast<float,1>()); } } () ; auto tmp14 = c10::convert<int>(c10::div_floor_integer(x1, 256L)); auto tmp15 = static_cast<int>(3); auto tmp16 = tmp14 < tmp15; auto tmp18 = tmp16 & tmp2; auto tmp17 = [&] { auto tmp19 = at::vec::Vectorized<int>(tmp1); auto tmp20 = at::vec::VecMask<int,1>(tmp5 >= tmp19); auto tmp22 = at::vec::VecMask<float,1>::from(tmp18); auto tmp23 = tmp20 & tmp22; auto tmp21 = [&] { auto tmp24 = tmp23.template cast<float,1>().template loadu<float,1>(in_ptr0 + static_cast<long>((-256L) + x3 + (513L(static_cast<long>(x1) % static_cast<long>(256L))) + (262656L(c10::div_floor_integer(x1, 256L))) + (787968Lx2) + (9455616Lx0))); return tmp24; } ; auto tmp25 = [&] { if (tmp23.all_zero()) { return at::vec::Vectorized<float>(static_cast<float>(0.0)); } else { return decltype(tmp21())::blendv(at::vec::Vectorized<float>(static_cast<float>(0.0)), tmp21(), tmp23.template cast<float,1>()); } } () ; auto tmp26 = static_cast<float>(0.0); auto tmp27 = at::vec::Vectorized<float>(tmp26); auto tmp28 = decltype(tmp25)::blendv(tmp27, tmp25, tmp20.template cast<float,1>()); return tmp28; } ; auto tmp29 = tmp16 ? tmp17() : at::vec::Vectorized<float>(static_cast<float>(0.0)); auto tmp30 = static_cast<float>(0.0); auto tmp31 = at::vec::VecMask<float,1>::from(tmp16); auto tmp32 = at::vec::Vectorized<float>(tmp30); auto tmp33 = decltype(tmp29)::blendv(tmp32, tmp29, tmp31.template cast<float,1>()); auto tmp34 = decltype(tmp13)::blendv(tmp33, tmp13, tmp8.template cast<float,1>()); return tmp34; } ; auto tmp35 = tmp2 ? tmp3() : at::vec::Vectorized<float>(static_cast<float>(0.0)); auto tmp36 = c10::convert<int>(c10::div_floor_integer(x1, 256L)); auto tmp37 = static_cast<int>(3); auto tmp38 = tmp36 < tmp37; auto tmp39 = [&] { auto tmp40 = c10::convert<int>(x3); auto tmp41 = at::vec::Vectorized<int>::arange(tmp40, 1); auto tmp42 = at::vec::Vectorized<int>(tmp1); auto tmp43 = at::vec::VecMask<int,1>(tmp41 >= tmp42); auto tmp45 = at::vec::VecMask<float,1>::from(tmp38); auto tmp46 = tmp43 & tmp45; auto tmp44 = [&] { auto tmp47 = tmp46.template cast<float,1>().template loadu<float,1>(in_ptr0 + static_cast<long>((-256L) + x3 + (513L(static_cast<long>(x1) % static_cast<long>(256L))) + (262656L(c10::div_floor_integer(x1, 256L))) + (787968Lx2) + (9455616Lx0))); return tmp47; } ; auto tmp48 = [&] { if (tmp46.all_zero()) { return at::vec::Vectorized<float>(static_cast<float>(0.0)); } else { return decltype(tmp44())::blendv(at::vec::Vectorized<float>(static_cast<float>(0.0)), tmp44(), tmp46.template cast<float,1>()); } } () ; auto tmp49 = static_cast<float>(0.0); auto tmp50 = at::vec::Vectorized<float>(tmp49); auto tmp51 = decltype(tmp48)::blendv(tmp50, tmp48, tmp43.template cast<float,1>()); return tmp51; } ; auto tmp52 = tmp38 ? tmp39() : at::vec::Vectorized<float>(static_cast<float>(0.0)); auto tmp53 = static_cast<float>(0.0); auto tmp54 = at::vec::VecMask<float,1>::from(tmp38); auto tmp55 = at::vec::Vectorized<float>(tmp53); auto tmp56 = decltype(tmp52)::blendv(tmp55, tmp52, tmp54.template cast<float,1>()); auto tmp57 = at::vec::VecMask<float,1>::from(tmp2); auto tmp58 = decltype(tmp35)::blendv(tmp56, tmp35, tmp57.template cast<float,1>()); tmp58.store(out_ptr1 + static_cast<long>(x3 + (513Lx1) + (525312Lx2) + (6303744Lx0))); } #pragma omp simd simdlen(8) for(long x3=static_cast<long>(512L); x3<static_cast<long>(513L); x3+=static_cast<long>(1L)) { auto tmp0 = c10::convert<int64_t>(x1); auto tmp1 = static_cast<int64_t>(256); auto tmp2 = tmp0 < tmp1; auto tmp3 = [&] { auto tmp4 = c10::convert<int64_t>(x3); auto tmp5 = static_cast<int64_t>(257); auto tmp6 = tmp4 < tmp5; auto tmp7 = [&] { auto tmp8 = -std::numeric_limits<float>::infinity(); return tmp8; } ; auto tmp9 = tmp6 ? tmp7() : static_cast<decltype(tmp7())>(0.0); auto tmp10 = c10::convert<int64_t>(c10::div_floor_integer(x1, 256L)); auto tmp11 = static_cast<int64_t>(3); auto tmp12 = tmp10 < tmp11; auto tmp13 = [&] { auto tmp14 = tmp4 >= tmp1; auto tmp15 = [&] { auto tmp16 = in_ptr0[static_cast<long>((-256L) + x3 + (513L(static_cast<long>(x1) % static_cast<long>(256L))) + (262656L(c10::div_floor_integer(x1, 256L))) + (787968Lx2) + (9455616Lx0))]; return tmp16; } ; auto tmp17 = tmp14 ? tmp15() : static_cast<decltype(tmp15())>(0.0); auto tmp18 = static_cast<float>(0.0); auto tmp19 = tmp14 ? tmp17 : tmp18; return tmp19; } ; auto tmp20 = tmp12 ? tmp13() : static_cast<decltype(tmp13())>(0.0); auto tmp21 = static_cast<float>(0.0); auto tmp22 = tmp12 ? tmp20 : tmp21; auto tmp23 = tmp6 ? tmp9 : tmp22; return tmp23; } ; auto tmp24 = tmp2 ? tmp3() : static_cast<decltype(tmp3())>(0.0); auto tmp25 = c10::convert<int64_t>(c10::div_floor_integer(x1, 256L)); auto tmp26 = static_cast<int64_t>(3); auto tmp27 = tmp25 < tmp26; auto tmp28 = [&] { auto tmp29 = c10::convert<int64_t>(x3); auto tmp30 = tmp29 >= tmp1; auto tmp31 = [&] { auto tmp32 = in_ptr0[static_cast<long>((-256L) + x3 + (513L(static_cast<long>(x1) % static_cast<long>(256L))) + (262656L(c10::div_floor_integer(x1, 256L))) + (787968Lx2) + (9455616Lx0))]; return tmp32; } ; auto tmp33 = tmp30 ? tmp31() : static_cast<decltype(tmp31())>(0.0); auto tmp34 = static_cast<float>(0.0); auto tmp35 = tmp30 ? tmp33 : tmp34; return tmp35; } ; auto tmp36 = tmp27 ? tmp28() : static_cast<decltype(tmp28())>(0.0); auto tmp37 = static_cast<float>(0.0); auto tmp38 = tmp27 ? tmp36 : tmp37; auto tmp39 = tmp2 ? tmp24 : tmp38; out_ptr1[static_cast<long>(x3 + (513Lx1) + (525312Lx2) + (6303744Lx0))] = tmp39; } } } } } } } ''') ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/124921 Approved by: https://github.com/jgong5, https://github.com/jansel ghstack dependencies: #124597	2024-04-28 04:33:25 +00:00
haozhe.zhu	57790fd088	[inductor] share cse cache during vectorized indirect load (#124597 ) Fix https://github.com/pytorch/pytorch/issues/123502 `swap_buffer` in not needed in vectorized indirect load, remove it to share cse buffer. ``` auto tmp8 = [&] { __at_align__ std::array<int64_t, 16> tmpbuf; tmp7.store(tmpbuf.data()); return tmpbuf; } () ; // // other codes // // also store tmp7 here (redundant tmp16) auto tmp16 = [&] { __at_align__ std::array<int64_t, 16> tmpbuf; tmp7.store(tmpbuf.data()); return tmpbuf; } () ; ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/124597 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-04-28 01:02:48 +00:00
Yanbo Liang	7478b7f1ca	Add common used score_mod functions for templated attention (#124670 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/124670 Approved by: https://github.com/Chillee	2024-04-27 21:04:52 +00:00
Animesh Jain	df08140de2	[dynamo] Collect cell_and_freevars correctly (#125097 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/125097 Approved by: https://github.com/Skylion007	2024-04-27 20:39:54 +00:00
Edward Z. Yang	7aa6bd7fa0	Refactor all top level usages of record_shapeenv_event to ShapeEnv class (#123735 ) This ensures that first argument to record_shapeenv_event is a ShapeEnv so we can appropriately short circuit when recording is not in progress. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/123735 Approved by: https://github.com/ysiraichi, https://github.com/zou3519, https://github.com/albanD	2024-04-27 20:36:40 +00:00
Sergii Dymchenko	9ce58542ba	Ignore torch/distributed/_tensor/_collective_utils.py for TOR901 (#125082 ) Fixes https://github.com/pytorch/pytorch/issues/125050 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125082 Approved by: https://github.com/malfet, https://github.com/Skylion007	2024-04-27 20:14:02 +00:00
Wang, Eikan	b4a008209a	Expose tensor check from guard for reusing (#124836 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124836 Approved by: https://github.com/jgong5, https://github.com/jansel, https://github.com/desertfire	2024-04-27 18:35:35 +00:00
Sheng Fu	f0a5a0d298	OSS: Capture triton kernel in ET (#124775 ) This DIFF is to capture triton kernels in execution trace Pull Request resolved: https://github.com/pytorch/pytorch/pull/124775 Approved by: https://github.com/briancoutinho, https://github.com/aaronenyeshi	2024-04-27 18:01:18 +00:00
Milan Straka	8246f42864	Export torch.newaxis=None for Python Array API/Numpy consistency (#125026 ) Fixes #65307 For consistency with Python Array API (https://data-apis.org/array-api/latest/API_specification/constants.html) and NumPy (https://numpy.org/devdocs/reference/constants.html), I added `torch.newaxis = None`. Note that the consistency is directly mentioned also in the `__init__.py`, right above the added export. The `torch.newaxis` is also mentioned in #110636. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125026 Approved by: https://github.com/lezcano	2024-04-27 16:40:51 +00:00
Richard Barnes	9bf53b128c	[codemod] Remove unused variables in caffe2/aten/src/ATen/test/scalar_test.cpp (#125041 ) Summary: LLVM-15 has a warning `-Wunused-but-set-variable` which we treat as an error because it's so often diagnostic of a code issue. Unused variables can compromise readability or, worse, performance. This diff either (a) removes an unused variable and, possibly, it's associated code, or (b) qualifies the variable with `[[maybe_unused]]`, mostly in cases where the variable _is_ used, but, eg, in an `assert` statement that isn't present in production code. - If you approve of this diff, please use the "Accept & Ship" button :-) Test Plan: Sandcastle Reviewed By: palmje Differential Revision: D56587751 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125041 Approved by: https://github.com/Skylion007	2024-04-27 15:53:16 +00:00
Richard Barnes	905318818d	[codemod] Fix missing field initializer in caffe2/torch/lib/libshm/core.cpp +2 (#125047 ) Summary: The LLVM warning `-Wmissing-field-initializers` has found one or more structs in this diff's files which were missing field initializers. This can be unintended such as: ``` my_struct s1 = {0}; // Initializes only the first field to zero; others to default values my_struct s2 = {}; // Initializes all fields to default values (often zero) ``` or it may be because only some of the members of a struct are initialized, perhaps because the items were added to the struct but not every instance of it was updated. To fix the problem, I've either used `{}` to initialize all fields to default or added appropriate default initializations to the missing fields. Test Plan: Sandcastle Reviewed By: palmje Differential Revision: D56614179 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125047 Approved by: https://github.com/Skylion007	2024-04-27 15:52:56 +00:00
Wang, Eikan	61e937f3d6	Add registration API for torch.compile-eager (#121387 ) This PR is a follow-up of RFC https://github.com/pytorch/pytorch/issues/115545. In this PR, we intend to provide a registration API dedicated to eager-through-torch.compile. The major workflow of this API will be as follows. - Load cache - Check cache according to the input tensors - Cache Hit: Run the cached kernel directly - Cache Miss: Run the AOTI to produce kernel and run the produced kernel. If AOTI fails to produce the kernel, invoke the python fallback function. Currently, this PR always fallback to python kernel now and cache mechanism will be implemented in another PR - https://github.com/pytorch/pytorch/pull/116368 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121387 Approved by: https://github.com/desertfire, https://github.com/jansel, https://github.com/zou3519, https://github.com/jgong5	2024-04-27 12:49:58 +00:00
Kai Londenberg	620d808da0	[Pytorch 2] Forward fix for broken test (#125065 ) Summary: This is a forward Hotfix for T186742340. Some recent changes in Pytorch / Inductor ( D56458606) led to aten.addmm operators being inserted twice into the list of choices to select from during autotuning. This appears to have triggered a test failure in fbcode. This fix prevents the aten operators being added twice to the list of choices for autotuning. Test Plan: * Pytorch CI * CUDA_LAUNCH_BLOCKING=1 buck2 test 'fbcode//mode/opt' fbcode//accelerators/pytorch/lib/pt2_utils/tests:compile_pt2_test -- --exact 'accelerators/pytorch/lib/pt2_utils/tests:compile_pt2_test - test_compile_pt2 (accelerators.pytorch.lib.pt2_utils.tests.compile_pt2_test.TestCompilePT2)' Differential Revision: D56642879 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125065 Approved by: https://github.com/eellison	2024-04-27 10:27:44 +00:00
Yifu Wang	d4a1b3e093	Make c10d_functional ops call into _c10d_functional ops (#124979 ) This PR removes the legacy impls of c10d_functional ops which are now irrelevant. For backward compatibility purpose, c10d_functional ops now call into _c10d_functional ops. We also changed c10d_functional ops to be CompositeExplicitAutograd, so that when traced, only _c10d_functional ops appear in the graph. After this, we'll be able to remove the Inductor IR for the legacy functional collectives. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124979 Approved by: https://github.com/wanchaol	2024-04-27 08:08:02 +00:00
Yifu Wang	91a4740e72	Disable the CUDA fast path for split_with_sizes_copy when capturing (#125052 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/125052 Approved by: https://github.com/awgu, https://github.com/eellison, https://github.com/eqy	2024-04-27 07:59:39 +00:00
cyy	b3fd94d15e	[Distributed] [7/N] Fix clang-tidy warnings in torch/csrc/distributed/c10d (#124987 ) This PR continues to clean clang-tidy warnings in torch/csrc/distributed/c10d, following #124701. In addition, libfmt dependency is added in CMake code to enable using it in the headers. The libfmt has to be added as private dependency to torch_cuda and torch_hip because they include torch/csrc/distributed/c10d/Utils.hpp which uses libfmt. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124987 Approved by: https://github.com/malfet	2024-04-27 07:22:27 +00:00
Yanbo Liang	ce503c1b40	Dynamo x autograd.Function supports setup_context (#124802 ) Fixes part of #118397 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124802 Approved by: https://github.com/zou3519	2024-04-27 04:57:13 +00:00
eqy	a866bfff45	[cuDNN] cuDNN SDPA (Flash Attention) Backward (#122510 ) #113713 currently passing trivial smoke tests but I just totally pattern-matched bits and pieces of the autograd defs Will also collect benchmark data, CC @drisspg Co-authored-by: Eli Uriegas <1700823+seemethere@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/122510 Approved by: https://github.com/drisspg	2024-04-27 04:15:49 +00:00
Nikita Shulga	5944a53555	[MPS] Fix nextafter for negative values (#125029 ) By changing the logic to on older MacOS: ```cpp bits += ((input > 0) ^ (input > other)) ? 1 : -1; ``` And use native `nextafter` on MacOS Sonoma (i.e. if Metal 3.1 is available) TODO: - Add tests for infs and denorms Fixes https://github.com/pytorch/pytorch/issues/124985 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125029 Approved by: https://github.com/Skylion007	2024-04-27 02:58:05 +00:00
Xia, Weiwen	35b332882b	[Quant][PT2E] Enable linear-binary(-unary) post-op recipe for X86Inductor quantizer (#122387 ) As the title Test plan python test/test_quantization.py -k test_linear_binary Differential Revision: [D56288440](https://our.internmc.facebook.com/intern/diff/D56288440) Pull Request resolved: https://github.com/pytorch/pytorch/pull/122387 Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5 ghstack dependencies: #123240	2024-04-27 02:40:57 +00:00
Tristan Rice	dc4c75ba72	elastic/rendezvous: make barrier and rank assignment operations O(n) instead of O(n^2) (#124982 ) Summary: This makes barrier and rank operations linear instead of quadratic with the number of workers. This drastically improves performance for rendezvous when running with over 1000 hosts. This uses 2 approaches for different areas: * local rank assignment: each worker does 1 set and 1 get, local ranks are assigned on the rank 0 host in a O(n) operation which reduces total store operations to be linear with number of workers. * exit_barrier: use a counter and a final flag so each worker has to do max 1 set, 1 get and 1 add. At 4000 hosts we see torchelastic be able to run in as little as 10 seconds down from 373 seconds. Test Plan: This is testing using many small tests running on a remote cluster. {D56549942} ``` torchx run --scheduler mast -- --image=torchelastic_benchmark --j=4000x1 ``` Differential Revision: D56605193 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124982 Approved by: https://github.com/kiukchung, https://github.com/kurman	2024-04-27 02:21:44 +00:00
Simon Fan	1a6fef15ef	[compiled autograd] verbose logs for debugging cache misses (#124980 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124980 Approved by: https://github.com/jansel ghstack dependencies: #124954	2024-04-27 01:10:37 +00:00
Simon Fan	43a7ab2a21	[compiled autograd] introduce verbose logs, add autograd node info to graph (#124954 ) - sets it as a fake stack trace as we don't have a generic comment feature - when verbose is disabled, still adds a contextmanager and flag checks. the alternative is to use MACROS, but that wouldn't be usable with TORCH_LOGS Pull Request resolved: https://github.com/pytorch/pytorch/pull/124954 Approved by: https://github.com/jansel	2024-04-27 01:10:37 +00:00
Xia, Weiwen	e592a609fd	[Quant][ONEDNN] improve performance of qconv by reducing integration overhead (#123240 ) ## Description Framework overhead is found to be big for the onednn qconv op (used for quantization with PT2E X86Inductor backend). This PR reduces the integration overhead by modifying the implementation of qconv. ## performance results Running quantized Resnet50 on an Intel(R) Xeon(R) Platinum 8490H machine Before ``` Average latency: 8.378 ms. ------------------------- ------------ ------------ ------------ ------------ ------------ ------------ Name Self CPU % Self CPU CPU total % CPU total CPU time avg # of Calls ------------------------- ------------ ------------ ------------ ------------ ------------ ------------ onednn::qconv2d_pointwise 86.54% 6.954ms 87.42% 7.025ms 132.547us 53 ``` After ``` Average latency: 6.255 ms. ------------------------- ------------ ------------ ------------ ------------ ------------ ------------ Name Self CPU % Self CPU CPU total % CPU total CPU time avg # of Calls ------------------------- ------------ ------------ ------------ ------------ ------------ ------------ onednn::qconv2d_pointwise 85.05% 6.381ms 85.98% 6.451ms 121.717us 53 ``` Test script: ```python import torch import torchvision import time import copy import numpy as np from torch._export import capture_pre_autograd_graph from torch.ao.quantization.quantize_pt2e import ( prepare_pt2e, convert_pt2e, ) import torch.ao.quantization.quantizer.x86_inductor_quantizer as xiq from torch.ao.quantization.quantizer.x86_inductor_quantizer import X86InductorQuantizer torch._inductor.config.cpp.enable_kernel_profile=True torch._inductor.config.profiler_mark_wrapper_call = True torch._inductor.config.freezing = True torch._inductor.config.cpp_wrapper = True def bench_model(model, inputs): times =[] with torch.no_grad(): for _ in range(5): # warm-up output = model(inputs) for _ in range(20): start_time = time.time() output = model(inputs) end_time = time.time() times.append(end_time - start_time) print ('Average latency: %0.3f ms.' % (np.median(times) * 1000.0)) with torch.profiler.profile(activities=[torch.profiler.ProfilerActivity.CPU]) as p: out_ipex = model(inputs) print(p.key_averages().table(sort_by="self_cpu_time_total", row_limit=-1)) def pt2e_ptq(m, example_inputs): m = m.eval() exported_model = capture_pre_autograd_graph(m, example_inputs) quantizer = X86InductorQuantizer() quantizer.set_global(xiq.get_default_x86_inductor_quantization_config()) prepared_model = prepare_pt2e(exported_model, quantizer) _ = prepared_model(example_inputs) converted_model = convert_pt2e(prepared_model) torch.ao.quantization.move_exported_model_to_eval(converted_model) with torch.no_grad(): optimized_model = torch.compile(converted_model) _ = optimized_model(example_inputs) _ = optimized_model(example_inputs) bench_model(optimized_model, example_inputs) return optimized_model if __name__ == "__main__": data = torch.randn(16, 3, 224, 224) model_fp = torchvision.models.resnet50(weights=torchvision.models.ResNet50_Weights.DEFAULT) pt2e_ptq(copy.deepcopy(model_fp), (data,)) ``` Differential Revision: [D56288440](https://our.internmc.facebook.com/intern/diff/D56288440) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123240 Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/jerryzh168	2024-04-27 00:52:45 +00:00
Valentine233	368f5212fa	[cpu] [inductor] decompose bmm for memory bound in lowering (#124826 ) Fixes #124697. Resolve the issue of large regression of GPT-FAST MOE with `coordinate_descent_tuning` disabled. To get better perf for memory bound case, we decompose bmm in lowering. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124826 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-04-27 00:19:10 +00:00
Valentine233	ebb8905e0c	[cpu] add VecConvert between 8bits and 16bits (#124828 ) The perf benefit was found in https://github.com/pytorch/pytorch/issues/124697#issuecomment-2071658300. The PR adds intrinsic specializations between int8/uint8 and bf16/fp16. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124828 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-04-27 00:17:44 +00:00
Animesh Jain	fd24d8c05a	[dynamo][nn module] Use correct sources for _call_impl (#124970 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124970 Approved by: https://github.com/jansel ghstack dependencies: #124779, #124627	2024-04-26 23:18:30 +00:00
James Pang	43069c460e	Correct check for Boolean list input type (#124899 ) Summary: This diff fixes a bug in PyTorch where when creating a tensor from a List of booleans, PyTorch was throwing an error. This fix resolves that issue. All credit goes to swolchok for identifying the root cause of the issue and suggesting this fix. Test Plan: Running our model end to end works as expected and no error occurs. Differential Revision: D55990810 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124899 Approved by: https://github.com/zhxchen17	2024-04-26 22:25:43 +00:00
Xilun Wu	be2c09725a	[dtensor][experimental] local_map (#123676 ) Summary This PR is attempt to land an experimental feature designed in #103686 . `local_map` is designed to allow users to apply to `DTensor` objects a function that was written to apply to `torch.Tensor`. As a function, `local_map` takes in 2 required arguments (`func` and `out_placements`) and 3 optional arguments (`device_mesh`, `in_placements`, `redistribute_inputs`). `func` is the function to be applied to each local shard of input `DTensor`. `out_placements` is the sharding specification of output `DTensor`. `local_map` returns a new function that does the following: 1. Infer `device_mesh` and `in_placements` from `DTensor` input if they're not provided. If `device_mesh` is provided, it must be identical to the device mesh of every `DTensor` input. If `in_placements` is provided, it serves as the required sharding specification of corresponding `DTensor` input before feeding its local shard into `func`. In case it is different from `DTensor`'s sharding specification, if `redistribute_inputs=False` an exception will be raised, otherwise perform a resharding to the required sharding. 2. Call `func` with the arguments passed in along with `device_mesh` except `DTensor`s. For `DTensor`, pass in its local shard. This `func` may include collectives. 3. For each output of `func` that has validate (i.e. not `None) sharding specification in `out_placements`, construct a new `DTensor` using the output and the specification. Use this `DTensor` as the output. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123676 Approved by: https://github.com/wanchaol	2024-04-26 22:23:59 +00:00
Luca Wehrstedt	83e7b9d25f	[Inductor] Support fusion of chained reductions even if keepdims=True (#124843 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124843 Approved by: https://github.com/shunting314	2024-04-26 21:50:52 +00:00
Catherine Lee	a68a8c0f6b	Disable test_binary_op_list_error_cases in test_foreach (#125046 ) It's really flaky ex * https://github.com/pytorch/pytorch/issues/124636 * https://github.com/pytorch/pytorch/issues/124529 there are more Pull Request resolved: https://github.com/pytorch/pytorch/pull/125046 Approved by: https://github.com/huydhn	2024-04-26 21:25:38 +00:00
rzou	c6b7504d47	Fix torch.library.register_fake's module reporting (#125037 ) torch.library.register_fake reports the python module the fake impl is located in. This is used to check against `m.set_python_module("foo.bar")` calls in C++. The module reporting logic was wrong in most cases. This PR fixes it. Test Plan: - exhaustive tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/125037 Approved by: https://github.com/williamwen42	2024-04-26 20:53:33 +00:00
Kai Londenberg	cd06c73cbd	[Inductor Cutlass backend] Improved GEMM template (#124577 ) Improves the Cutlass backend GEMM template: * Adds code which allows to create stand-alone test runners for Cutlass GEMM Kernels, which allows (manual) debugging of, for example, CUDA IMA errors or similar problems which occur in practice. Includes some utility code and tests to actually compile and run these standalone tests. * Cleans up the GEMM template code through various refactorings * Eliminates code sections and options that are unneccessary now that epilogue fusions are being removed. * Limits the scope of a workaround for (flaky) Cutlass issues with bias broadcasting to neccessary cases. * Puts some CPU runtime checks into #if / #endif blocks, such that it's possible to compile CUTLASS Kernels with lower CPU overhead. * Add documentation comments Pull Request resolved: https://github.com/pytorch/pytorch/pull/124577 Approved by: https://github.com/jansel ghstack dependencies: #124576	2024-04-26 20:03:20 +00:00
Catherine Lee	4a6dfbe480	Add label to label config to auto apply labels based on other labels (#125042 ) * Implemented in https://github.com/pytorch/test-infra/pull/5127, * Tested in malfet/delete me: https://github.com/malfet/deleteme/issues/85 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125042 Approved by: https://github.com/huydhn	2024-04-26 19:58:56 +00:00
Aaron Orenstein	4e2b4c6ed6	Fix broken docs (#124940 ) These were causing doctest to be unhappy. In particular the doc from #124496 caused #124771 to fail "trunk / win-vs2019-cpu-py3 / test" to fail when pushing. Not sure why it wasn't a problem on the original PR. Testing: `./test/run_doctests.sh`: before: ``` === 4 warnings in 11.21 seconds === ``` after: ``` === in 11.11 seconds === ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/124940 Approved by: https://github.com/zou3519, https://github.com/atalman, https://github.com/huydhn	2024-04-26 19:24:52 +00:00
Ashwin Hari	9266e472e2	rename ort to maia in dynamo's ort backend. (#124967 ) Fixes #124966 Co-authored-by: Thiago Crepaldi <thiagofc@microsoft.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/124967 Approved by: https://github.com/thiagocrepaldi	2024-04-26 19:09:29 +00:00
Kurt Mohler	abcb42cdd2	Avoid COW materialize in various places (1) (#124984 ) Most, not all, of these cases were found automatically with `git grep -n '^\s\<const\>.\.=.*\<data_ptr\>'` Part of #97856 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124984 Approved by: https://github.com/Skylion007	2024-04-26 19:06:28 +00:00
Daohang Shi	2ea1e84d40	log pt2 config dict to signpost from inductor post grad (#124593 ) Summary: previous attempts don't work eventually. D49720297 causes online train SEV due to extra importing. D56299408 mitigates a tricky bug from Distributed Shampoo constructor but unfortutenaly didn't correct the scuba logging either. see f552546983 Test Plan: {F1491621504} Differential Revision: D56378270 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124593 Approved by: https://github.com/anijain2305	2024-04-26 18:57:11 +00:00
YangQun1	91d565da0c	[dynamo] Add support for tensor's is_complex method (#124927 ) This PR is to add support for tensor's is_complex method in dynamo. Take the following code as an example: ```python def test_tensor_is_complex(x): if x.is_complex(): return x + 1 else: return x - 1 ``` Before this fix, the is_complex() call will cause a graph break "torch.* op returned non-Tensor bool call_method is_complex". After this fix, the graph break can be avoided. Fixes #122692 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124927 Approved by: https://github.com/ezyang	2024-04-26 18:28:14 +00:00
Catherine Lee	781ea00c90	[TD] Query Github API for base (#122214 ) A better query for the base commit of a PR. Some ghstack PRs are not connected to main so git merge-base doesn't work. Instead, use the Github API to query for the base of the PR, which should be more accurate Sanity checked on one of Ed's ghstack PRs Pull Request resolved: https://github.com/pytorch/pytorch/pull/122214 Approved by: https://github.com/seemethere	2024-04-26 18:21:24 +00:00
Huy Do	858fdd8c40	Remove cppwrapper option on inductor benchmark workflow (#124971 ) I'm restoring the `training` and `inference` options after github.com/pytorch/pytorch/pull/124795 and remove the not less-known `cppwrapper` option instead per @desertfire suggestion. The total number of parameters remains at 10. Also, the default choice for training and inference are explicitly spelled out when dispatching the workflow manually to catch dev attention. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124971 Approved by: https://github.com/ezyang	2024-04-26 17:41:24 +00:00
chilli	392dc45597	Made FlexAttention rewrite getitem calls to use aten.index in score_mod (#124799 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124799 Approved by: https://github.com/drisspg ghstack dependencies: #124444	2024-04-26 17:22:13 +00:00
PyTorch MergeBot	b4d39a5de9	Revert "[TD] Query Github API for base (#122214 )" This reverts commit b003e0f29eeb4a810c47056400918924948b88c2. Reverted https://github.com/pytorch/pytorch/pull/122214 on behalf of https://github.com/clee2000 due to failing on main due to mistake ([comment](https://github.com/pytorch/pytorch/pull/122214#issuecomment-2079732105))	2024-04-26 16:42:51 +00:00
egienvalue	8461e7ed9e	Add test_cpp_extensions tests for stream_and_event and mita_backend (#123614 ) Test the generic torch.Stream/Event with fake device gurad and hooks. Since we added a fake device backend, it is mutual exclusive to other backends. Tests will be skipped if TEST_CUDA or TEST_ROCM is true. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123614 Approved by: https://github.com/albanD ghstack dependencies: #123611, #123612	2024-04-26 16:17:54 +00:00
egienvalue	73744a2c00	torch.mtia module for MTIA device backend (#123612 ) MTIA device has its own Module in PyTorch now. torch.mtia has following APIs similar to other backends. The lazy_init is also supported. ``` __all__ = [ "init", "is_available", "synchronize", "device_count", "current_device", "current_stream", "default_stream", "set_stream", "stream", "device", ] ``` ------------ For device management. We expand AccleratorHooksInterface to support generic device management and it can be used in both C++ and PyThon. ``` def _accelerator_hooks_device_count() -> _int: ... def _accelerator_hooks_set_current_device(device_index: _int) -> None: ... def _accelerator_hooks_get_current_device() -> _int : ... def _accelerator_hooks_exchange_device(device_index: _int) -> _int : ... def _accelerator_hooks_maybe_exchange_device(device_index: _int) -> _int : ... ``` --------- Adding get_device_module API to retrieve device modules for different device types. ``` def get_device_module(device: Optional[Union[torch.device, str]] = None) ``` --------- Pull Request resolved: https://github.com/pytorch/pytorch/pull/123612 Approved by: https://github.com/albanD ghstack dependencies: #123611	2024-04-26 16:17:54 +00:00
xinan.lin	36af9c0d7d	[Aten] Fix XPU convolution_overrideable input memory format. (#124841 ) [Aten] Fix convolution_overrideable input memory format. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124841 Approved by: https://github.com/EikanWang, https://github.com/jgong5, https://github.com/albanD	2024-04-26 15:55:01 +00:00
Aaron Orenstein	a8574a9719	Fix global flake8 issues (#124771 ) Prior to this `lintrunner --all-files --take FLAKE8` failed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124771 Approved by: https://github.com/Skylion007 ghstack dependencies: #124428	2024-04-26 15:35:53 +00:00
Aaron Orenstein	609c958281	Fix mypy issues in fake_tensor.py (#124428 ) fake_tensor.py had mypy error ignored. That seems less than desirable. Also added SafePyObjectT<T> which is a tagged wrapper around a SafePyObject but provides static type checking (with no other guarantees). Used `SafePyObjectT<TorchDispatchModeKey>` on some of the TorchDispatchModeTLS API to ensure that we don't accidentally inject a different type than expected into the stack. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124428 Approved by: https://github.com/malfet	2024-04-26 15:35:53 +00:00
Shan19900305	8d12ba9acf	add methods for open device in PackedSequence module. (#124923 ) 1) add is_{custom_device_name}() and {custom_device_name}() for open device register; 2) fix open device failed testcases. @ezyang @bdhirsh Pull Request resolved: https://github.com/pytorch/pytorch/pull/124923 Approved by: https://github.com/ezyang	2024-04-26 15:26:20 +00:00
Catherine Lee	b003e0f29e	[TD] Query Github API for base (#122214 ) A better query for the base commit of a PR. Some ghstack PRs are not connected to main so git merge-base doesn't work. Instead, use the Github API to query for the base of the PR, which should be more accurate Sanity checked on one of Ed's ghstack PRs Pull Request resolved: https://github.com/pytorch/pytorch/pull/122214 Approved by: https://github.com/seemethere	2024-04-26 15:16:36 +00:00
PyTorch MergeBot	6b54f9d3e1	Revert "fix Invalid call to aoti_torch_tensor_copy_ #123039 (#124037 )" This reverts commit f9379ebbbf1369aad8179cac4a2eb7d72f25739e. Reverted https://github.com/pytorch/pytorch/pull/124037 on behalf of https://github.com/jeanschmidt due to introducing regressions in benchmark, see D56623194 for more details ([comment](https://github.com/pytorch/pytorch/pull/124037#issuecomment-2079574308))	2024-04-26 15:07:09 +00:00
DanilBaibak	6bef5e9f67	[CI] Add retry mechanism to check if the Docker daemon is running (#124728 ) What is done: * Skipped the 'Kill existing containers' step - ARC runners are always ephemeral. * Added a retry mechanism to check if the Docker daemon is running. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124728 Approved by: https://github.com/seemethere, https://github.com/zxiiro, https://github.com/ZainRizvi	2024-04-26 14:36:32 +00:00
Aaron Gokaslan	2f3b0befed	[BE]: Apply ruff FURB 118. (#124743 ) Replaces various lambdas with operator.itemgetter which is more efficient (as it's a builtin function). Particularly useful for when lambdas are used as 'key' functions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124743 Approved by: https://github.com/albanD, https://github.com/malfet	2024-04-26 14:34:52 +00:00
Brian Hirsh	fc2aa23c1e	Test reland "AOTAutograd: gate view-replay behind config, not the def… (#124948 ) A parallel attempt at landing https://github.com/pytorch/pytorch/pull/124945, but attempting to land through fbcode first Pull Request resolved: https://github.com/pytorch/pytorch/pull/124948 Approved by: https://github.com/albanD	2024-04-26 13:16:26 +00:00
Prachi Gupta	fc13c1c850	[aot_inductor] Enable test_aot_inductor tests for ROCm (#123393 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/123393 Approved by: https://github.com/jithunnair-amd, https://github.com/malfet	2024-04-26 13:15:35 +00:00
Stonepia	3d8585e501	[XPU] Add manual_seed and synchronize method (#124709 ) This PR set the following device-specific settings for xpu(Intel GPU) specific: 1. Set the manual seed for xpu 2. Set the synchronization method for xpu Pull Request resolved: https://github.com/pytorch/pytorch/pull/124709 Approved by: https://github.com/EikanWang, https://github.com/desertfire	2024-04-26 12:32:12 +00:00
Jerry Zhang	74afccdd80	[parametrization] fix `requires_grad` propagation (#124888 ) Summary: Previously the `requires_grad` is not propagated from original Tensor to decomposed tensors Test Plan: python test/test_parametrization.py -k test_register_parametrization_no_grad Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/124888 Approved by: https://github.com/lezcano	2024-04-26 10:19:31 +00:00
PyTorch MergeBot	d1b25596d5	Revert "Add common used score_mod functions for templated attention (#124670 )" This reverts commit ed120b08c4828c39f116cfe1fb39195c844be485. Reverted https://github.com/pytorch/pytorch/pull/124670 on behalf of https://github.com/jeanschmidt due to Breaking internal CI, more info can be found in D56571389 ([comment](https://github.com/pytorch/pytorch/pull/124670#issuecomment-2079084881))	2024-04-26 10:18:18 +00:00
lezcano	bba59b718b	Teach ShapeEnv that a <= b => a < b + 1 (#123436 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123436 Approved by: https://github.com/ezyang ghstack dependencies: #123342	2024-04-26 10:18:01 +00:00
lezcano	fa5ea29863	Apply guard knowledge to all simplifications (#123342 ) This was an oversight in a previous PR. We were just applying this knowledge when the expression had an unbacked int Pull Request resolved: https://github.com/pytorch/pytorch/pull/123342 Approved by: https://github.com/ezyang	2024-04-26 10:18:00 +00:00
PyTorch MergeBot	359ff49bf4	Revert "[dtensor] move pad/unpad_tensor to separate utils (#124871 )" This reverts commit 0b0eea222978e6b377e2c67f89902d5eb1aa7da3. Reverted https://github.com/pytorch/pytorch/pull/124871 on behalf of https://github.com/jeanschmidt due to Broke internal tests, see D56587991 for more details ([comment](https://github.com/pytorch/pytorch/pull/124871#issuecomment-2079001103))	2024-04-26 09:30:34 +00:00
PyTorch MergeBot	35a82d4a4a	Revert "Refresh OpOverloadPacket if a new OpOverload gets added (#124654 )" This reverts commit 872eeb0d7deebb58915289756d8c786f68630547. Reverted https://github.com/pytorch/pytorch/pull/124654 on behalf of https://github.com/jeanschmidt due to Broken lots of internal signals, check D56571345 for more details ([comment](https://github.com/pytorch/pytorch/pull/124654#issuecomment-2078940680))	2024-04-26 08:56:03 +00:00
PyTorch MergeBot	7324ddd80c	Revert "Delete erroneous print (#124972 )" This reverts commit 333f095d0779ecf0ce489ceecff35404abde8581. Reverted https://github.com/pytorch/pytorch/pull/124972 on behalf of https://github.com/jeanschmidt due to Need to revert #124654 but this PR depends on it :( ([comment](https://github.com/pytorch/pytorch/pull/124972#issuecomment-2078936303))	2024-04-26 08:52:27 +00:00
Yu, Guangye	19a83eacb5	add new API torch.amp.is_autocast_available (#124938 ) # Motivation expose `torch._is_autocast_available` to `torch.amp.is_autocast_available` as a public api. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124938 Approved by: https://github.com/albanD	2024-04-26 08:45:20 +00:00
PyTorch MergeBot	a46c27d961	Revert "Verify types in custom op schemas (#124520 )" This reverts commit 141888765bba129914448a9609ad5e182778cbdc. Reverted https://github.com/pytorch/pytorch/pull/124520 on behalf of https://github.com/jeanschmidt due to Breaking internal tests check D56588015 for more details ([comment](https://github.com/pytorch/pytorch/pull/124520#issuecomment-2078917978))	2024-04-26 08:42:11 +00:00
Nikita Shulga	9c7c81b897	[BE] Test everything against scipy-1.10.0 (#124983 ) Which is the oldest one that does not have a memory leak regression tracked in CVE-2023-25399 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124983 Approved by: https://github.com/kit1980	2024-04-26 07:03:34 +00:00
Shivam Raikundalia	63d4dc5a80	Remove TMP_LIBKINETO_NANOSECOND flag from Compilation (#124734 ) Summary: Now that we have reached nanosecond granularity, we can now remove the temporary guards that were previously required for nanosecond precision. Test Plan: Regression should cover this change Reviewed By: aaronenyeshi Differential Revision: D56444570 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124734 Approved by: https://github.com/aaronenyeshi	2024-04-26 06:57:03 +00:00
Iris Zhang (PyTorch)	4ad291d07f	[DeviceMesh] Removing mapping child_to_parent_mapping from `_MeshEnv` (#124890 ) Summary: The mapping is no longer needed after https://github.com/pytorch/pytorch/pull/124780, as we are not going to re-create the pgs during mesh slicing. Test Plan: CI Differential Revision: D56499001 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124890 Approved by: https://github.com/awgu	2024-04-26 06:40:36 +00:00
PyTorch MergeBot	f131c2c199	Revert "Fix mypy issues in fake_tensor.py (#124428 )" This reverts commit 25c0d3f3f0b19b7ca88bc92e9dc56e391d18e010. Reverted https://github.com/pytorch/pytorch/pull/124428 on behalf of https://github.com/jeanschmidt due to Unfortunately, I needed to revert #123735 and this one depends on it. So please check if there are no merge conflicts or breakages and feel free to merge this PR again ([comment](https://github.com/pytorch/pytorch/pull/124428#issuecomment-2078699836))	2024-04-26 06:15:17 +00:00
PyTorch MergeBot	1ac60484c1	Revert "Fix global flake8 issues (#124771 )" This reverts commit f01275934bfa1ff358b1c01d3754f2807cd04ee2. Reverted https://github.com/pytorch/pytorch/pull/124771 on behalf of https://github.com/jeanschmidt due to Unfortunately, I needed to revert #123735 and this one depends on it. So please check if there are no merge conflicts or breakages and feel free to merge this PR again ([comment](https://github.com/pytorch/pytorch/pull/124428#issuecomment-2078699836))	2024-04-26 06:15:17 +00:00
PyTorch MergeBot	e607dc8abb	Revert "Refactor all top level usages of record_shapeenv_event to ShapeEnv class (#123735 )" This reverts commit 87bec7db4e55f329e077eb7003af2f4817cd4210. Reverted https://github.com/pytorch/pytorch/pull/123735 on behalf of https://github.com/jeanschmidt due to Breaking internal signals, more info in D56587358 ([comment](https://github.com/pytorch/pytorch/pull/123735#issuecomment-2078695590))	2024-04-26 06:10:58 +00:00
Huy Do	e323c681ad	Update trymerge to honor the list of unstable failures from Dr.CI (#124965 ) After https://github.com/pytorch/test-infra/pull/5131, we want to have trymerge to honor the list of unstable failures from Dr.CI because having the unstable keyword is the job name now doesn't cover all unstable jobs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124965 Approved by: https://github.com/clee2000	2024-04-26 05:10:50 +00:00
David Berard	b3cf36cb7c	Implement deepcopy / clone for SymNode, NestedIntSymNode (#121361 ) Motivation: There's a Meta-internal use case that deepcopies a bunch of metadata, which includes shapes. When we try to use NestedTensor with this tool, it errors out when we try to deepcopy the metadata, because SymNodes cannot be deepcopied. The change here is to add an implementation of `__deepcopy__`. Implementation: 1. `__deepcopy__` on SymNode calls clone() 2. Implement `clone()` in NestedIntSymNode, which previously didn't have this implemented Potential Issues: Right now, this works. But, regarding (2): Eventually we'll have some mapping between the NestedSymIntNode and its corresponding offsets/lengths tensor (cc @soulitzer who is working on this). How should this work with `__deepcopy__`? Should the offsets/lengths tensor also be cloned, or should the new symint reference the same offsets as the old symint? On one hand, we already have this issue with NestedIntSymNodeImpl::mul(): mul() creates a new NestedIntSymNodeImpl. On the other hand, `__deepcopy__` might imply different semantics. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121361 Approved by: https://github.com/soulitzer	2024-04-26 04:18:29 +00:00
Simon Fan	14430564ce	[cudagraphs] add cudagraph_skips counter (#124804 ) used in tests and benchmark csv Pull Request resolved: https://github.com/pytorch/pytorch/pull/124804 Approved by: https://github.com/eellison ghstack dependencies: #119729, #124700	2024-04-26 03:22:29 +00:00
Simon Fan	855939904b	[cudagraphs] add more info to skip messages (#124700 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124700 Approved by: https://github.com/eellison ghstack dependencies: #119729	2024-04-26 03:22:29 +00:00
Simon Fan	62b5738a8b	[benchmark][cudagraph] Explicitly call aten.div with CUDA denominator for cudagraphs (#119729 ) aten.div's output device will be its numerator's device. so it is acceptable to do cuda / cpu type divisions. post grad passes operate only on graphs and can't handle runtime graph inputs. so we change user code to move inputs to cuda for cudagraph. this affects any graph that has cpu tensors as graph inputs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119729 Approved by: https://github.com/eellison	2024-04-26 03:22:26 +00:00
briancoutinho	769b1e6cdc	[profiler] Split up profiler test file (#124856 ) To help with issues on test time out split profiler test file into 4 files. - profiler - record_function - execution_trace - torch_tidy Pull Request resolved: https://github.com/pytorch/pytorch/pull/124856 Approved by: https://github.com/shengfukevin, https://github.com/aaronenyeshi	2024-04-26 03:19:25 +00:00
Nikita Shulga	f9a611a3ce	Update Jinja to 3.1.3 (#124976 ) To fix CVE-2024-22195 Also, delete unused docs/cpp/requirements.txt and functorch/docs/requirements.txt Pull Request resolved: https://github.com/pytorch/pytorch/pull/124976 Approved by: https://github.com/kit1980	2024-04-26 02:57:55 +00:00
Iris Zhang (PyTorch)	43f4e71daa	Making _MeshEnv subclassing thread local (#124555 ) With _mesh_resources being global var, when thread pg based testing is used (aka spawn_threads_and_init_comms()), the last rank with the same key would overwrite the formers. This isn't an issue in regular process-based runtime as logically each key is unique. Example failure: https://github.com/pytorch/pytorch/actions/runs/8779134353/job/24087295785 ``` RuntimeError: Could not resolve the process group registered under the name 8 or Throwing assert not none error ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/124555 Approved by: https://github.com/xunnanxu, https://github.com/wanchaol	2024-04-26 02:45:42 +00:00
PyTorch MergeBot	e913f77c60	Revert "Made FlexAttention rewrite getitem calls to use aten.index in score_mod (#124799 )" This reverts commit 9bccafc31c9d489b727155e95633efd19adbceaa. Reverted https://github.com/pytorch/pytorch/pull/124799 on behalf of https://github.com/clee2000 due to broke tests but only on crossref https://github.com/pytorch/pytorch/actions/runs/8841521519/job/24279075171, added no td label so itll actually run this time ([comment](https://github.com/pytorch/pytorch/pull/124799#issuecomment-2078530797))	2024-04-26 02:35:14 +00:00
PyTorch MergeBot	b2f521f376	Revert "remove empty partition (#124920 )" This reverts commit 98835fff9fd498472b0e8f49a3a4670d86f3c5b7. Reverted https://github.com/pytorch/pytorch/pull/124920 on behalf of https://github.com/clee2000 due to I think Dr CI is wrong, the xla failure looks real `98835fff9f` https://github.com/pytorch/pytorch/actions/runs/8840540357/job/24278180954 ([comment](https://github.com/pytorch/pytorch/pull/124920#issuecomment-2078495051))	2024-04-26 02:03:01 +00:00
chilli	9bccafc31c	Made FlexAttention rewrite getitem calls to use aten.index in score_mod (#124799 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124799 Approved by: https://github.com/drisspg ghstack dependencies: #124444	2024-04-26 01:02:28 +00:00
chilli	7321005dd8	Add support for capturing tensors with score_mod (#124444 ) ``` import torch from torch import nn import torch.nn.functional as F import torch._inductor.config as config # torch.set_default_device('cuda') import torch from torch.nn.attention._templated_attention import _templated_attention as templated_attention from triton.testing import do_bench from torch.nn.attention import SDPBackend, sdpa_kernel index = torch.ops.aten torch.manual_seed(0) B = 16 H = 16 S = 2048 D = 64 head_scale = torch.randn(H, device='cuda') def alibi(score, batch, head, token_q, token_kv): return score + torch.ops.aten.index(head_scale, [head]) * (token_q - token_kv) bias = torch.randn(H, S, S, dtype=torch.float16, device='cuda') query = torch.randn(B, H, S, D, device="cuda", dtype=torch.float16) key = torch.randn(B, H, S, D, device="cuda", dtype=torch.float16) value = torch.randn(B, H, S, D, device="cuda", dtype=torch.float16) compiled = torch.compile(templated_attention) out = compiled(query, key, value, score_mod=alibi) out2 = templated_attention(query, key, value,score_mod=alibi) print((out - out2).abs().mean()) assert (out - out2).abs().mean() < 1e-3 print("Flash (no mask): ", do_bench(lambda: F.scaled_dot_product_attention(query, key, value))) print("Flash (mask): ", do_bench(lambda: F.scaled_dot_product_attention(query, key, value, attn_mask=bias))) print("flexattention: ", do_bench(lambda: compiled(query, key, value, score_mod=alibi))) ``` <img width="324" alt="image" src="https://github.com/pytorch/pytorch/assets/6355099/18c175d0-2720-4dfd-8747-85b8a8f609f5"> Differential Revision: [D56583900](https://our.internmc.facebook.com/intern/diff/D56583900) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124444 Approved by: https://github.com/jansel, https://github.com/drisspg	2024-04-26 01:02:28 +00:00
eellison	3a810bcf91	skip unsupported rocm test (#124968 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124968 Approved by: https://github.com/jithunnair-amd, https://github.com/davidberard98	2024-04-26 00:36:30 +00:00
Tuan Trieu	f9379ebbbf	fix Invalid call to aoti_torch_tensor_copy_ #123039 (#124037 ) fixes #123039 In abi mode, ExternKernelSchedulerNode generates code using `aoti_torch_tensor_copy_` which requires `AtenTensorHandle`, but the allocation generates ArrayRefTensor to allocate mem in stack. To fix this issue, this PR prevents ExternKernelSchedulerNode from using stack-mem-allocation in abi, and creates AtenTensorHandle instead. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124037 Approved by: https://github.com/desertfire	2024-04-26 00:16:16 +00:00
rzou	333f095d07	Delete erroneous print (#124972 ) I forgot to remove it before landing Pull Request resolved: https://github.com/pytorch/pytorch/pull/124972 Approved by: https://github.com/albanD	2024-04-26 00:07:54 +00:00
Edward Z. Yang	c4b6ed4609	guard_size_oblivious in unbind (#124959 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/124959 Approved by: https://github.com/albanD	2024-04-25 23:45:14 +00:00
Xu Han	c715e76799	[inductor] optimize isa dry compile time. (#124602 ) Fixes #100378 Original issue caused by startup dry compile need cost almost 1 second. This PR add compiler version info, isa build options and pytorch version info to the test binary path hash. So same compile, same isa and same pytorch can skip the dry compile. Local test: First time: <img width="1588" alt="image" src="https://github.com/pytorch/pytorch/assets/8433590/d0b83f5d-849e-4f37-9977-3b0276e5a5a5"> We need to compile all c++ modules and it cost 16.5s. Second time: <img width="1589" alt="image" src="https://github.com/pytorch/pytorch/assets/8433590/44f07fb0-5a15-4342-b0f6-dfe2c880b5d3"> We skipped dry compile due to the same isa fingerprint. It is only cost 0.36s. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124602 Approved by: https://github.com/jgong5, https://github.com/ezyang	2024-04-25 23:27:57 +00:00
Nikita Shulga	db3a2d751c	[MPS][BE] Error-check linear (#124952 ) Validate that all arguments are on MPS devices and dtypes are expected Fixes cryptic messages like ``` % python3 -c "import torch;print(torch.nn.functional.linear(torch.rand(32, 32), torch.rand((32, 32), device='mps')))" RuntimeError: Placeholder storage has not been allocated on MPS device! ``` And hard crashes like ``` % python3 -c "import torch;print(torch.nn.functional.linear(torch.rand(32, 32, device='mps'), torch.randint(-10, 10, (32, 32), dtype=torch.int8, device='mps')))" ``` Fixes https://github.com/pytorch/pytorch/issues/123995 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124952 Approved by: https://github.com/Skylion007	2024-04-25 23:25:20 +00:00
eqy	973d724e21	[CUDA] Fix 64-bit indexing in `vol2col` in conv3d (#124650 ) Similar to #118005, fixes sometimes silent IMAs that occur CC @atalman @malfet Pull Request resolved: https://github.com/pytorch/pytorch/pull/124650 Approved by: https://github.com/soulitzer	2024-04-25 23:21:43 +00:00
PyTorch MergeBot	33fae4fcf4	Revert "Use recursive blob for package data (#119257 )" This reverts commit f20e3ae0c36146c962a5665018e9ad662a7cf211. Reverted https://github.com/pytorch/pytorch/pull/119257 on behalf of https://github.com/malfet due to This likely caused https://github.com/pytorch/pytorch/issues/124941, not sure why warning about recursive grep was ignored ([comment](https://github.com/pytorch/pytorch/pull/119257#issuecomment-2078312309))	2024-04-25 23:08:22 +00:00
Ma-Jian1	98835fff9f	remove empty partition (#124920 ) In some rare scenarios, the partitioner will produce an empty partition. it's a waste of time to compile an empty graph. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124920 Approved by: https://github.com/ezyang	2024-04-25 23:07:41 +00:00
angelayi	724f8dd8c5	[export] Serialize empty list based on argument type (#123748 ) Fixes https://github.com/pytorch/pytorch/issues/123480 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123748 Approved by: https://github.com/zhxchen17	2024-04-25 23:03:27 +00:00
Zhengxu Chen	7bb89bcaa4	[export] Fix state dict reparametrization in non-strict. (#124847 ) Summary: There are multiple things implemented incorrectly in non strict for reparametrizing state dict: 1. The same fake tensor should be generated for duplicated weights. 2. We should snapshot state dict in the beginning to always hold the invariant that ep.state_dict == mod.state_dict() 3. We will overwrite real weights with fake weights if we don't restore the weights in LIFO ordering. 4. We don't turn on strict checking which could sliently fail on corner cases. This diff aims to solve all these issues at once. Test Plan: CI Differential Revision: D56505020 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124847 Approved by: https://github.com/pianpwk	2024-04-25 22:44:16 +00:00
David Berard	4259e5d0e0	[inductor] Specialize on unguarded alignment of example inputs (#123319 ) When inductor generates triton code, the triton code can either assume that the inputs given to it are aligned or unaligned. If they are aligned, triton can use more efficient instructions (like vectorized loads or tensor cores). However, if we generate "aligned" code and pass in unaligned inputs, the triton code will error out; to fix this, we clone unaligned inputs that are passed to triton kernels that expect aligned inputs. This can lead to excessive clones if we have inputs that are not expected to be aligned. In this PR, we use the example input to decide whether the generated triton code should assume alignment or not. If the example input is aligned, then we will generate triton code that assumes alignment; if at runtime we receive an unaligned input, we'll make a clone. Meanwhile, if the example input is not aligned, the generated triton code will not assume inputs are aligned and we won't ever need to clone. Note that the alignment of the inputs is not guarded on; we found that adding guards on tensor offsets (a) was slow in cases where we do a lot of comparisons on tensor offsets, and (b) led to a lot of recompilations. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123319 Approved by: https://github.com/eellison	2024-04-25 22:28:15 +00:00
Nikita Shulga	8db42e7688	[EZ][GHF] Rephrase cancelled message (#124947 ) To encourage people to reissue the command if merge timed out Pull Request resolved: https://github.com/pytorch/pytorch/pull/124947 Approved by: https://github.com/kit1980, https://github.com/clee2000	2024-04-25 22:24:08 +00:00
Arun Pa	00c5859aeb	[dynamo] Add support for DELETE_SUBSCR (#123526 ) Fixes #123317 Co-authored-by: Jason Ansel <jansel@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/123526 Approved by: https://github.com/jansel	2024-04-25 22:07:24 +00:00
Lucas Furukawa Gadani	8c515a14fd	[caffe2] Add build configuration for linux-arm64 (#124618 ) Summary: This diff adds a new build configuration that works on linux-arm64. Test Plan: Before: ``` $ buck2 build @//arvr/mode/linux/jetson/opt :c10_ovrsource BUILD FAILED fbsource//xplat/caffe2/c10:c10_ovrsource is incompatible with cfg:linux-arm64-fbcode-platform010-aarch64-no-san#d47c4385e5d19fe0 (ovr_config//os:android unsatisfied), check the target's compatibility attributes ``` After: ``` $ buck2 build @//arvr/mode/linux/jetson/opt :c10_ovrsource BUILD SUCCEEDED ``` Differential Revision: D56088211 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124618 Approved by: https://github.com/izaitsevfb	2024-04-25 21:55:26 +00:00
angelayi	84fb96130f	[export] Fix check for optional tensor returns (#123739 ) Sorry for the delay! Addressing issue in https://www.internalfb.com/diff/D55455000?dst_version_fbid=1599488570890576&transaction_fbid=776042617791884 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123739 Approved by: https://github.com/zhxchen17	2024-04-25 20:51:26 +00:00
Jack Taylor	4b586a434f	[ROCm] Triton upstream AMD backend integration (#121801 ) Update ROCm-triton to use the AMD backend from https://github.com/openai/triton Note: `test__int_mm` can be enabled after https://github.com/pytorch/pytorch/pull/122431 is landed Co-authored-by: Pruthvi Madugundu <pruthvigithub@gmail.com> Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/121801 Approved by: https://github.com/nmacchioni, https://github.com/malfet	2024-04-25 20:44:27 +00:00
Mu-Chu Lee	b8b04b26fb	Forward fix for D56289438 (#124882 ) Summary: D56289438 from OSS breaks test deeplearning/aot_inductor/cpu/test:cpu_lowering_utils_test - test_cpu_lower_merge_with_ibb_3 (deeplearning.aot_inductor.cpu.test.test_lowering_utils.CPULoweringTest) The issue is that we use partial for aten.cat that shouldn't be directly failed out with assertion Test Plan: ``` deeplearning/aot_inductor/cpu/test:cpu_lowering_utils_test - test_cpu_lower_merge_with_ibb_3 ``` Differential Revision: D56541352 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124882 Approved by: https://github.com/chenyang78	2024-04-25 18:42:09 +00:00
Yuanhao Ji	d5182bb75b	Enable UFMT on `test/test_cuda*.py` (#124352 ) Part of: #123062 Ran lintrunner on: - test/test_cuda.py - test/test_cuda_expandable_segments.py - test/test_cuda_multigpu.py - test/test_cuda_nvml_based_avail.py - test/test_cuda_primary_ctx.py - test/test_cuda_sanitizer.py - test/test_cuda_trace.py Detail: ```bash $ lintrunner -a --take UFMT --all-files ok No lint issues. Successfully applied all patches. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/124352 Approved by: https://github.com/ezyang	2024-04-25 18:31:08 +00:00
Nikita Shulga	977dc5593a	[EZ] Get rid of utf-8 quotes (#124932 ) Replace `“important”` with `"important"` and `Taylor’s` with `Taylor's` Fixes the obvious symptoms of https://github.com/pytorch/pytorch/issues/124897 Test plan: Download [wheel](https://github.com/pytorch/pytorch/actions/runs/8833051644/artifacts/1447995459) and check that generated VF.pyi does not have any unicode characters by running following command: ``` % python3 -c "x=open('_VF.pyi', encoding='utf-8').read();uc=[(i, x[i]) for i in range(len(x)) if ord(x[i])>127];print(uc);assert(len(uc)==0)" ```	2024-04-25 11:22:20 -07:00
Bin Bao	751d9a319d	[AOTI] Add a unit test (#124486 ) Summary: from https://github.com/pytorch/pytorch/issues/123745, the test seems already fixed in the nightly, but still worth to add it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124486 Approved by: https://github.com/chenyang78	2024-04-25 18:05:10 +00:00
cyy	a8aed4ce3f	Fix MPI_Group initialization errors (#124824 ) Fixes MPI_Group initialization errors introduced in #124156, since MPI_Group is not a pointer in some MPI implementations. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124824 Approved by: https://github.com/ezyang	2024-04-25 17:27:30 +00:00
Edward Z. Yang	29b22fbef9	Typo fix: s/nonzero/unique/ (#124935 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/124935 Approved by: https://github.com/albanD, https://github.com/Skylion007	2024-04-25 17:22:50 +00:00
Pian Pawakapan	93a319a4fc	[export] kill _process_constraints() (#123985 ) The process for populating range_constraints follows separate methods for non-strict (`make_constraints`), and strict (`_process_constraints`). The strict method is somewhat more convoluted, and the analysis that Dynamo performs for strict is already present as part of the non-strict process in make_constraints (produce_guards(), running the export constraint solver). This PR kills _process_constraints() and replaces calls with make_constraints, without duplicating the work that Dynamo already does. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123985 Approved by: https://github.com/avikchaudhuri	2024-04-25 16:58:57 +00:00
Kai Londenberg	9aeeb8e925	[Inductor Cutlass backend] Improve GEMM op filtering (#124576 ) Add configurable allowlist / denylist regular expressions to make it possible to exclude certain CUTLASS GEMM implementations ( for example "pingpong" Kernels due to undesired numerical behavior ). Remove usage of old 2.x Cutlass Kernels entirely. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124576 Approved by: https://github.com/jansel, https://github.com/eellison	2024-04-25 16:33:54 +00:00
PyTorch MergeBot	e04c7b19f4	Revert "torch.mtia module for MTIA device backend (#123612 )" This reverts commit 381653de63df4b1b31cc95531320caf83b1b60b3. Reverted https://github.com/pytorch/pytorch/pull/123612 on behalf of https://github.com/jeffdaily due to this PR broke ROCm with message RuntimeError: Cannot have MTIA with other devices ([comment](https://github.com/pytorch/pytorch/pull/123612#issuecomment-2077649762))	2024-04-25 16:06:46 +00:00
PyTorch MergeBot	4a1299cc0e	Revert "Add test_cpp_extensions tests for stream_and_event and mita_backend (#123614 )" This reverts commit 355dc34f865036c4c625fcdafe54db846b2be2c2. Reverted https://github.com/pytorch/pytorch/pull/123614 on behalf of https://github.com/jeffdaily due to this PR broke ROCm with message RuntimeError: Cannot have MTIA with other devices ([comment](https://github.com/pytorch/pytorch/pull/123612#issuecomment-2077649762))	2024-04-25 16:06:46 +00:00
Animesh Jain	3de78a1b48	[dynamo][cpp-guards] EQUALS MATCH - Cache first passing value (#124627 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124627 Approved by: https://github.com/jansel ghstack dependencies: #124779	2024-04-25 15:24:12 +00:00
Lucas Pasqualin	87079f5e91	[DCP] Fix broken validate checkpoint api test (#124786 ) This test appears broken, but is somehow not failing CI/CD. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124786 Approved by: https://github.com/fegin, https://github.com/wz337	2024-04-25 14:50:58 +00:00
Yu, Guangye	cdc66e9dc3	refactor autocast python APIs (#124479 ) # Motivation Refactor autocast usage scenario in `torch/amp/autocast_mode.py` and `torch/utils/checkpoint.py` to fix the bug - convention conflict between `torch.xxx.get_autocast_xxx_dtype` defined in `autocast_mode.py` and `torch.xxx.get_autocast_dtype` defined in `checkpoint.py`. # Solution Use device-agnostic APIs like `torch.get_autocast_dtype`, ..., instead. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124479 Approved by: https://github.com/jgong5, https://github.com/gujinghui, https://github.com/EikanWang, https://github.com/albanD ghstack dependencies: #124359	2024-04-25 14:33:33 +00:00
Aaron Orenstein	f01275934b	Fix global flake8 issues (#124771 ) Prior to this `lintrunner --all-files --take FLAKE8` failed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124771 Approved by: https://github.com/Skylion007 ghstack dependencies: #124428	2024-04-25 14:25:00 +00:00
Xu Han	44bb5da529	Fix mkl cmake not support static mkl on Windows. (#124925 ) Fixes #124869 Fix mkl not support static library on Windows. # Local test: ## MKL static: ![image](https://github.com/pytorch/pytorch/assets/8433590/9c6ee5f8-9844-4383-acbd-6b22aff06daa) MKL backend check: <img width="724" alt="Image" src="https://github.com/pytorch/pytorch/assets/8433590/e45e12a5-2dfc-47a1-ad94-32a667bd4799"> ## MKL shared, original path: ![image](https://github.com/pytorch/pytorch/assets/8433590/27a822c7-c4ab-4e5f-bbdb-8c4b085140e5) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124925 Approved by: https://github.com/jgong5, https://github.com/ezyang	2024-04-25 14:21:15 +00:00
Aaron Orenstein	25c0d3f3f0	Fix mypy issues in fake_tensor.py (#124428 ) fake_tensor.py had mypy error ignored. That seems less than desirable. Also added SafePyObjectT<T> which is a tagged wrapper around a SafePyObject but provides static type checking (with no other guarantees). Used `SafePyObjectT<TorchDispatchModeKey>` on some of the TorchDispatchModeTLS API to ensure that we don't accidentally inject a different type than expected into the stack. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124428 Approved by: https://github.com/malfet	2024-04-25 14:07:53 +00:00
Edward Z. Yang	87bec7db4e	Refactor all top level usages of record_shapeenv_event to ShapeEnv class (#123735 ) This ensures that first argument to record_shapeenv_event is a ShapeEnv so we can appropriately short circuit when recording is not in progress. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/123735 Approved by: https://github.com/ysiraichi, https://github.com/zou3519, https://github.com/albanD ghstack dependencies: #124310, #124314, #124316, #124394, #124739, #124782, #124785	2024-04-25 14:02:48 +00:00
Edward Z. Yang	61e05f2fb4	Don't ignore fresh unbacked symbols in AOTAutograd forward analysis (#124785 ) This ensures we have correct SymInts when we allocate tangents. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/124785 Approved by: https://github.com/lezcano ghstack dependencies: #124310, #124314, #124316, #124394, #124739, #124782	2024-04-25 14:02:48 +00:00
Edward Z. Yang	b4597fffce	Try to reuse old symbol name rather than new symbol name when renaming (#124782 ) Previously, unbacked SymInts would gradually get larger and larger as we kept rebinding them. Now, we do the replacement to preserve the old symbol. Actually doing this is a bit tricky. Here’s the order things happen when retracing data dependent: 1. Run fake tensor prop: allocate new unbacked SymInt 2. Run proxy tensor mode, calculate bindings and associate them with FX node 3. Run PropagateUnbackedSymInts, rename unbacked bindings to their old ones so they are consistent So the problem is when we calculate bindings in step (2), we don't know what the original names are yet, we only find out later at (3). But by the time (3) runs, we've already stuffed some new bindings in meta["unbacked_bindings"] and we don't know how to update them! To fix this, I introduce resolve_unbacked_bindings which post facto applies any of the renamings we discovered in (3). Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/124782 Approved by: https://github.com/lezcano ghstack dependencies: #124310, #124314, #124316, #124394, #124739	2024-04-25 14:02:42 +00:00
Edward Z. Yang	4c44e2b236	Improved unbacked SymInt input support in Inductor (#124739 ) This is a subset of changes extracted from https://github.com/pytorch/pytorch/pull/124683/ This PR contains modifications to make Inductor work with unbacked symbol inputs, which can occur when a data-dependent sized tensor is saved for backwards. The problems to be fixed: * When binding initial symbols, we unconditionally bind unbacked symbols (instead of computing if they are needed, which only looks at backed symbols) * Benchmark generation code doesn't work with unbacked symints as we have no hints to actually feed in real values. So I pick a random number and you are expected to fix it if it doesn't work * Need to make sure we don't install dependencies on unbacked SymInt inputs, that puts us down the "promptly deallocate the input" path, but that's pointless for unbacked SymInt Fixes https://github.com/pytorch/pytorch/issues/124652 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/124739 Approved by: https://github.com/jansel ghstack dependencies: #124310, #124314, #124316, #124394	2024-04-25 13:29:53 +00:00
cyy	1ac402a96c	[Distributed] [6/N] Fix clang-tidy warnings in torch/csrc/distributed/c10d (#124701 ) This PR continues to fix some clang-tidy warnings in distributed/c10d code, following https://github.com/pytorch/pytorch/pull/124043. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124701 Approved by: https://github.com/ezyang	2024-04-25 11:39:23 +00:00
PyTorch MergeBot	f6ce94dca5	Revert "[inductor] Remove usage of device_interface from _inductor.runtime (#124592 )" This reverts commit 5d45eb77f1aeb57f13391990215b518a607b3c7e. Reverted https://github.com/pytorch/pytorch/pull/124592 on behalf of https://github.com/jeanschmidt due to breaking internal tests, check D56522594 ([comment](https://github.com/pytorch/pytorch/pull/124592#issuecomment-2076957668))	2024-04-25 11:28:23 +00:00
Peter Bell	58806d6531	[decomp] Remove dead device_hint function (#124849 ) The only use of this function is in `_to_copy` but the result is never used, so this is just dead code. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124849 Approved by: https://github.com/lezcano	2024-04-25 11:25:51 +00:00
PyTorch MergeBot	5f9ea26185	Revert "OSS: Capture triton kernel in ET (#124775 )" This reverts commit c55309e58f88dd37e41e80425fd84a71d4b51548. Reverted https://github.com/pytorch/pytorch/pull/124775 on behalf of https://github.com/jeanschmidt due to need to revert so I can revert https://github.com/pytorch/pytorch/pull/124592 ([comment](https://github.com/pytorch/pytorch/pull/124775#issuecomment-2076954322))	2024-04-25 11:24:39 +00:00
PyTorch MergeBot	3890848ec2	Revert "[ROCm] Triton upstream AMD backend integration (#121801 )" This reverts commit 9888d7495ece6b6df3b7334fc7c2a9d869359250. Reverted https://github.com/pytorch/pytorch/pull/121801 on behalf of https://github.com/jeanschmidt due to need to revert so I can revert https://github.com/pytorch/pytorch/pull/124592 ([comment](https://github.com/pytorch/pytorch/pull/121801#issuecomment-2076951327))	2024-04-25 11:22:19 +00:00
PyTorch MergeBot	e520233526	Revert "[dynamo] Refactor into torch/_inductor/runtime/compile_tasks.py (#124681 )" This reverts commit 0792ceab4b6a61c6c217f65c3fecf51d75e65a9f. Reverted https://github.com/pytorch/pytorch/pull/124681 on behalf of https://github.com/jeanschmidt due to breaking internal tests, check D56522594 ([comment](https://github.com/pytorch/pytorch/pull/124681#issuecomment-2076937810))	2024-04-25 11:14:02 +00:00
Chien-Chin Huang	f3af049b88	[DDP][PT2D] Fix the import issue (#124846 ) As title Differential Revision: [D56521582](https://our.internmc.facebook.com/intern/diff/D56521582/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124846 Approved by: https://github.com/LucasLLC, https://github.com/wz337 ghstack dependencies: #124421, #124422, #123424	2024-04-25 11:08:27 +00:00
PyTorch MergeBot	0ca1ff3dce	Revert "Add support for capturing tensors with score_mod (#124444 )" This reverts commit 7c253a777641791247f7fcc19fe5c60f24be32b9. Reverted https://github.com/pytorch/pytorch/pull/124444 on behalf of https://github.com/jeanschmidt due to Breaking internal tests, check D56522566 ([comment](https://github.com/pytorch/pytorch/pull/124444#issuecomment-2076908582))	2024-04-25 10:56:38 +00:00
PyTorch MergeBot	c0fd7894cc	Revert "Fast standalone symbolize for unwinding (#123966 )" This reverts commit 772ae6da1eb9be1f4238ff993830c56488ecae13. Reverted https://github.com/pytorch/pytorch/pull/123966 on behalf of https://github.com/jeanschmidt due to Breaking internal builds, check D56522678 ([comment](https://github.com/pytorch/pytorch/pull/123966#issuecomment-2076821043))	2024-04-25 10:04:48 +00:00
leslie-fang-intel	2d7f709752	[Inductor] Force the parallel depth as outer loop fusion depth (#123899 ) Summary Fix issue: https://github.com/pytorch/pytorch/issues/123801 which brings performance regression of `pyhpc_turbulent_kinetic_energy` after outer loop fusion. Root Cause - [Generated Kernel before Outer Loop Fusion](https://gist.github.com/leslie-fang-intel/54fe21ac8871fc63b9bf20fdb6edf209) - Taking below 2 kernels as example: - [Kernel 0](https://gist.github.com/leslie-fang-intel/54fe21ac8871fc63b9bf20fdb6edf209#file-pyhpc_turbulent_kinetic_energy-before-outer-loop-fusion-py-L255-L305) has 2 loop levels with size [200, 200]. Parallelization is not feasible due to the inefficient number of elements determined by [`decide_parallel_depth`](`aaec97a403/torch/_inductor/codegen/cpp.py (L2145-L2164)`). Therefore, the loop code will be generated with the `#pragma omp single` directive. - [Kernel 1](https://gist.github.com/leslie-fang-intel/54fe21ac8871fc63b9bf20fdb6edf209#file-pyhpc_turbulent_kinetic_energy-before-outer-loop-fusion-py-L306-L316) has 3 loop levels with size [200, 200, 26] which has enough number of elements to be parallelized. - [Generated Kernel after Outer Loop Fusion](https://gist.github.com/leslie-fang-intel/57a497b9d9c6aa82b1c6a686292fc887) - After outer loop fusion, `Kernel0` and `Kernel1` has been fused into one [OuterLoopFusedKernel](https://gist.github.com/leslie-fang-intel/57a497b9d9c6aa82b1c6a686292fc887#file-pyhpc_turbulent_kinetic_energy-after-outer-loop-fusion-py-L261-L497), the outer loop size is [200, 200] which does not contain enough number of elements to do parallelization. In this PR, we propose a fix for `loop_nest` involving `OuterLoopFusedKernel`. The fix entails adding a specific heuristic for `OuterLoopFusedKernel` to determine the parallel depth by combining `outer_loop_fusion_depth` with the internal kernels' parallel depth. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123899 Approved by: https://github.com/jgong5, https://github.com/lezcano	2024-04-25 09:50:46 +00:00
PyTorch MergeBot	24ed909934	Revert "[CUDA] Fix 64-bit indexing in `vol2col` in conv3d (#124650 )" This reverts commit 71d92bace2b9ff6431976cda69c83df668d078f0. Reverted https://github.com/pytorch/pytorch/pull/124650 on behalf of https://github.com/jeanschmidt due to Reverting to check if it introduced regressions for linux-focal-rocm6.0-py3.8 tests ([comment](https://github.com/pytorch/pytorch/pull/124650#issuecomment-2076786795))	2024-04-25 09:46:21 +00:00
PyTorch MergeBot	678662a557	Revert "Made FlexAttention rewrite getitem calls to use aten.index in score_mod (#124799 )" This reverts commit acc4cbea395c25410c26d6fd3c88c072ce24c918. Reverted https://github.com/pytorch/pytorch/pull/124799 on behalf of https://github.com/jeanschmidt due to checking if this diff introduced regressions on linux-focal-py3.11-clang10 and linux-focal-py3.8-clang10 ([comment](https://github.com/pytorch/pytorch/pull/124799#issuecomment-2076756876))	2024-04-25 09:29:57 +00:00
PyTorch MergeBot	48a016157d	Revert "[benchmark][cudagraph] Explicitly call aten.div with CUDA denominator for cudagraphs (#119729 )" This reverts commit c021c9b8e48b8e787b75fd69a3076beffffb8208. Reverted https://github.com/pytorch/pytorch/pull/119729 on behalf of https://github.com/jeanschmidt due to one PR in this stack seems to have broken linux pull cuda12 tests ([comment](https://github.com/pytorch/pytorch/pull/119729#issuecomment-2076750595))	2024-04-25 09:26:25 +00:00
PyTorch MergeBot	6a92b352ee	Revert "[cudagraphs] add more info to skip messages (#124700 )" This reverts commit 0ed38c9b227f2099c77f4b34fbbe72afa176ac25. Reverted https://github.com/pytorch/pytorch/pull/124700 on behalf of https://github.com/jeanschmidt due to one PR in this stack seems to have broken linux pull cuda12 tests ([comment](https://github.com/pytorch/pytorch/pull/119729#issuecomment-2076750595))	2024-04-25 09:26:25 +00:00
PyTorch MergeBot	154157416c	Revert "[cudagraphs] add cudagraph_skips counter (#124804 )" This reverts commit fdad16b85108209bc021107f312f4b221422a012. Reverted https://github.com/pytorch/pytorch/pull/124804 on behalf of https://github.com/jeanschmidt due to one PR in this stack seems to have broken linux pull cuda12 tests ([comment](https://github.com/pytorch/pytorch/pull/119729#issuecomment-2076750595))	2024-04-25 09:26:25 +00:00
PyTorch MergeBot	7a6813b7b3	Revert "[cuDNN] cuDNN SDPA (Flash Attention) Backward (#122510 )" This reverts commit 64af899fdfc30c0c075d90bde111cec74ad9b4bb. Reverted https://github.com/pytorch/pytorch/pull/122510 on behalf of https://github.com/jeanschmidt due to Breaking amd gpu builds ([comment](https://github.com/pytorch/pytorch/pull/122510#issuecomment-2076743868))	2024-04-25 09:22:37 +00:00
chunyuan	9d139eedcf	[AOTI] set alignment for aot constant (#124272 ) GPU copies the constant blob to aligned memory ([RAII_cudaMalloc](`d0211e207c/torch/csrc/inductor/aoti_runtime/model.h (L46)` ), [64-alignment](`d0211e207c/torch/csrc/inductor/aoti_runtime/model.h (L324)`)) while CPU doesn't have this copy procedure for constant blob, which may result in sub-optimal performance when we want to directly use the constant blob buffer in the computation (for example when these constant blobs are the weight tensor to the oneDNN primitive). We set the alignment to the `constant.o` directly so that there's no need to copy the data to an aligned memory for CPU (when using `--rename-section`, the original section name would need to be specified for `--set-section-alignment`). Pull Request resolved: https://github.com/pytorch/pytorch/pull/124272 Approved by: https://github.com/jgong5, https://github.com/desertfire	2024-04-25 08:37:44 +00:00
Animesh Jain	e68d65dae2	[dynamo][cpp-guards] Differentiate dict guards wrt to guarding on key order (#124779 ) We guard on key order 1) When a key is a non-constant object 2) When we actually need key order - like .values, .items etc For dicts/OrderedDicts that do not require key order guarding, we just rely on usual `GuardManger + DictGetItemGuardAccessor`. This is faster than going through the `list(d.keys())` based design for OrderedDicts. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124779 Approved by: https://github.com/jansel	2024-04-25 08:20:35 +00:00
Animesh Jain	59a1f1f308	[dynamo][inline inbuilt nn modules] Do not inline for export (#124814 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124814 Approved by: https://github.com/jansel	2024-04-25 06:35:31 +00:00
Tiger Huo	94af62b000	Updated test_graph_grad_scaling to use new OptimizerInfo infrastructure (#123581 ) This PR targets the issue mentioned in #123451 , and solves the specific task to update`test_graph_grad_scaling` in `test/test_cuda.py` to use the new OptimizerInfo infrastructure. `test_graph_grad_scaling` is moved to a new `TestCase` class called `TestCudaOptims` in order to use `instantiate_device_type_tests`. The test content remained the same. `@onlyCUDA` is applied to the new test; the original use of the wrapper function is also changed to a `@parametrize` decorator for better style. If we think that this migration is successful, we can delete the original test item under `TestCuda`. Currently it is left untouched to avoid any unexpected issues. Local linter passed. ``` $ lintrunner test/test_cuda.py ok No lint issues. ``` Local tests passed. ``` > python .\test\test_cuda.py -k test_graph_grad_scaling Ran 7 tests in 0.458s OK (skipped = 3) ``` Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/123581 Approved by: https://github.com/janeyx99	2024-04-25 06:29:20 +00:00
chilli	acc4cbea39	Made FlexAttention rewrite getitem calls to use aten.index in score_mod (#124799 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124799 Approved by: https://github.com/drisspg	2024-04-25 06:19:55 +00:00
yuqingj	9a70e7f58c	[Nested Tensor]Add unit test that cover the internal use cases (#124880 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124880 Approved by: https://github.com/jbschlosser	2024-04-25 05:04:27 +00:00
Chien-Chin Huang	8b2f8ee5ef	[DDP][PT2D] Fix no_compiled_forward flag in the test (#124829 ) As title Differential Revision: [D56508696](https://our.internmc.facebook.com/intern/diff/D56508696/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124829 Approved by: https://github.com/yf225 ghstack dependencies: #124421, #124422, #123424	2024-04-25 04:55:39 +00:00
Masaki Kozuki	b21bf5e4e4	[foreach] Use same `dtypes` when `dtypesIfCUDA` is `None` (#124813 ) in order to avoid accidentally testing cuda path with fewer dtypes Pull Request resolved: https://github.com/pytorch/pytorch/pull/124813 Approved by: https://github.com/janeyx99	2024-04-25 04:42:24 +00:00
blaine-rister	84666389e1	[FX] Update opinfo tests (flattened diff) (#124657 ) Summary: This diff updates opinfo tests to compute more statistics. The results are described in this post: https://fb.workplace.com/groups/ai.acceleration.team/permalink/825131926110067/ New features: - Optionally dump kernels to a directory - Optionally disable block pointers - Impose a time limit (2 min) on individual tests - Report a variety of specific error codes when a fails: - MIXED - FALLBACK - EXPORT_ERROR - COMPILE_ERROR - MULTIPLE_KERNELS - MISSING_KERNELS - TIMEOUT - Disable setting the RNG seed inside of opinfo, since Dynamo doesn't like this and it caused a lot of tests to fail which otherwise would be able to generate Triton. - Check each test's `(op,dtype)` pair against {HuggingFace, TIMM, TorchBench} benchmark logs, to see whether tests are representative of real-world usage. Test Plan: `buck2 test @//mode/{dev-nosan,mtia} fbcode//triton_mtia/python/test:` passed locally. This code is also exercised by the CI. Added a bunch of new unit tests: - Dumping kernels to a directory - Disabling block pointers - Mocking various error conditions in inductor - No kernels - Multiple kernels - ATen fallback - Partial ATen fallback (mixed Triton + ATen) - `torch.export` raised exception - `torch.inductor._compile` raised exception - Timeout while running test - Test harness raised uncaught exception - Check that return code == Success when exceptions were raised - Checking whether various (op,dtype) combos are in benchmarks - Check that `aten.add.Tensor` IS in the benchmarks - Check that a made up op is NOT in them Differential Revision: D56336160 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124657 Approved by: https://github.com/eellison	2024-04-25 04:38:44 +00:00
rzou	4e340a7f8b	[custom_op] setup_context fills in default values (#124852 ) This is to mirror autograd.Function's setup_context behavior. The PyTorch Dispatcher removes default values for "FC/BC reasons", but I convinced myself there's no FC/BC problem for the setup_context API. Test Plan: - new tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/124852 Approved by: https://github.com/albanD ghstack dependencies: #124637, #124805, #124806	2024-04-25 04:22:01 +00:00
Simon Fan	fdad16b851	[cudagraphs] add cudagraph_skips counter (#124804 ) used in tests and benchmark csv Pull Request resolved: https://github.com/pytorch/pytorch/pull/124804 Approved by: https://github.com/eellison ghstack dependencies: #119729, #124700	2024-04-25 03:38:09 +00:00
Simon Fan	0ed38c9b22	[cudagraphs] add more info to skip messages (#124700 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124700 Approved by: https://github.com/eellison ghstack dependencies: #119729	2024-04-25 03:38:09 +00:00
Simon Fan	c021c9b8e4	[benchmark][cudagraph] Explicitly call aten.div with CUDA denominator for cudagraphs (#119729 ) aten.div's output device will be its numerator's device. so it is acceptable to do cuda / cpu type divisions. post grad passes operate only on graphs and can't handle runtime graph inputs. so we change user code to move inputs to cuda for cudagraph. this affects any graph that has cpu tensors as graph inputs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119729 Approved by: https://github.com/eellison	2024-04-25 03:38:09 +00:00
Wanchao Liang	0b0eea2229	[dtensor] move pad/unpad_tensor to separate utils (#124871 ) as titled, 1. pad/unpad is a general util not specific to the Shard placement, 2. for the propose of the next PR, move these two out of Shard placement itself, and give additional pad_dim argument Pull Request resolved: https://github.com/pytorch/pytorch/pull/124871 Approved by: https://github.com/awgu, https://github.com/wz337	2024-04-25 03:36:16 +00:00
Edward Z. Yang	13ab24f192	Reimplement unbacked symbol bindings in Inductor (#124394 ) This PR has a lot of "draw the rest of the fucking owl" energy. Here's how to break it down. 1. torch/_inductor/graph.py - We start by tightening unbacked symbol invariants. Specifically, as we lower FX nodes, we check whether or not every unbacked_binding recorded on the FX node meta, actually ends up getting bound (according to get_unbacked_symbol_defs) in all the buffers generated by the lowering. Hopefully this invariant is self evident. This leads to a lot of failures. 2. torch/_inductor/ir.py - Problem 1: There is softness in how Inductor computes defs of unbacked symbols in IR node. Previously, we tried to infer it by looking at the output sizes/strides/etc and see if new unbacked symbols popped up that we hadn't seen in the inputs. I don't know exactly what was buggy about the old code, but sometimes we would fail to notice an unbacked symbol had been bound, or rebind an unbacked symbol multiple times. Fortunately, thanks to the earlier PRs in our stack, we now have a nice list of unbacked symbol bindings from FX, so we now just store it directly on ExternKernel and use it directly to report defs. This has to be done twice: once for FallbackKernel (e.g., nonzero) and once for DynamicScalar (e.g., item) (see also torch/_inductor/lowering.py, torch/_inductor/codegen/wrapper.py and torch/_inductor/codegen/cpp_wrapper_cpu.py for the lowering and codegen changes for item) * process_kernel - Sidequest! It turns out that Inductor lowering can reallocate unbacked symbols. This happens specifically when we repropagate fake tensors through the operator in `process_kernel`. This repropagation process is necessary because Inductor may have changed the strides of input tensors, and it must now recompute the strides so that it can continue to appropriately plan the rest of the lowering process. This is fine: we just make sure we do the rebind unbacked + compute_unbacked_bindings dance we've been doing previously in the PR stack. But instead of putting unbacked_bindings on a new FX node, they go straight into our unbacked_bindings on the Inductor IR node. * codegen_unbacked_symbol_defs - Sidequest! FallbackKernel lowering is done in two steps. First, you emit the FallbackKernel buffer. Then, you emit MultiOutput buffers which actually give access to the individual outputs of FallbackKernel, which may have been multi-output. There is a design decision here: does the FallbackKernel bind the unbacked symbols, or the MultiOutput buffer? Historically, we put the binding on MultiOutput buffer, because it's more convenient: the FallbackKernel buffer is fake, in fact, it doesn't even get a name in C++ codegen. But it's kind of inconsistent with the keypath model that we've been tracking unbacked bindings with: if you have a multi-output node, you'd expect a keypath like `[0].size()[0]` representing the first output's first dimension size. That suggests that it's the FallbackKernel that should define the things. So that was my first implementation. Unfortunately, the C++ codegen is too cursed and I could not understand how to make it work in that case. So now we just unsoundly assume you cannot have multi-output data dependent output, and do the codegen in MultiOutput. There are some comments explaining exactly what we are improperly assuming. 3. _rename_unbacked_to in torch/fx/experimental/symbolic_shapes.py - Previously, when we renamed unbacked symbols, we clobbered any facts we previously knew about them. So for example, if we had a replacement `u0 -> s0` but then we renamed u0 to u1, we would now setup the replacement `u0 -> u1`, clobbering the old replacement. This apparently didn't matter in earlier PRs in the stack, but with Inductor now on the ball, there were some tests that indicated this was a problem. The solution is easy: if u0 had a preexisting replacement, reapply it to u1. However... * torch/_functorch/_aot_autograd/collect_metadata_analysis.py - When we run forward analysis, this triggers fake tensor repropagation and fresh allocations. Previously, we just cleared out the pending symbols when finished the analysis. But with the change above, this would also migrate replacements to the new symbols... which are now dead. So now we explicitly suppress generation of these symbols with `ignore_fresh_unbacked_symbols` so that no rebinding happens at all. * torch/_dynamo/eval_frame.py - same deal; I just searched for all sites we called clear() on pending 4. The last step is fixing the long tail of extra problems that show up, now that unbacked_bindings are load bearing into Inductor * torch/_dynamo/eval_frame.py - Some of the exports are making copies of nodes without repropagating fake tensors, so in this case, it is important to also copy the `unbacked_bindings` (apparently this didn't matter before without the Inductor changes) * torch/_export/pass_base.py - I discover that this is doing fake tensor repropagation via a test suite failure. Do the same playbook as AOTAutograd: PropagateUnbackedSymInts too! Actually, they also have implemented their own tracer as well, so do the same playbook as proxy_tensor: record unbacked_bindings on the newly traced nodes. UGH code duplication. * torch/_subclasses/fake_tensor.py, torch/_subclasses/fake_impls.py (with call site updates at torch/_functorch/_aot_autograd/traced_function_transforms.py and torch/fx/passes/fake_tensor_prop.py) - What's this new epoch thing? I noticed that sometimes I would be retracing, call nonzero() on a fake tensor, and not allocate a new unbacked symbol. This is actually bad, because if I don't get a new unbacked symbol, I don't know there's a binding site, and `unbacked_bindings` is now missing a binding. The reason for this is memoization: if I reuse the exact same fake tensor on my retrace, it will already have an unbacked symint memoized on it and we will short circuit allocation. Well, that's no good. So I associate the memos with a fake tensor epoch, and every time you start a new fake tensor propagation from scratch, you bump the epoch so that I clear all the memos. * torch/_inductor/scheduler.py - I notice in unit tests that V.current_node is not always set when we call process_kernel. So I save it into the IR node and restore it when we are running `get_estimated_runtime`. * torch/fx/experimental/symbolic_shapes.py - A few things * rebind_unbacked (re _tensor_version). Ordinarily, when you have an unbacked SymInt, you persistently hvae it all the way to the end of the program. `_tensor_version` violates this: this generates an unbacked SymInt (for reasons I don't quite understand?) and then gets rid of it later. This triggered an assert violation. I think this op is kind of misusing unbacked SymInt, but I didn't know how to refactor it, so it gets a special case. * rebind_unbacked (re Simplify SymBool binding). Ugh, SymBool, what a pain in the butt. I have an assert that you can only rebind unbacked symbol to another unbacked symbol. This assert fails when a boolean is involved, because the result of running keypath on the result is not `u1`, it's `sympy.Piecewise(... sympy.Eq(u1, 1) ...)`. This is actually just `u1`, but Sympy doesn't know it because it doesn't know that `u1` value range is `[0, 1]`. So we manually implement the simplification needed to get the assert to pass. * compute_unbacked_bindings (re This is pretty fragile). There is a really funny disaster involving memoization and Inductor process kernel. Ordinarily when I retrace, if there was a memo hit in the old trace, there will be a memo hit in the new trace. However, Inductor process kernel breaks this, because it recreates fake tensor inputs to the operator call from scratch (since they might have different strides), and obviously these tensor inputs don't have the memo from the old one. I tried a little bit to try to manually transplant the memo to the new fake tensor but it seemed hopeless, so I just let the fresh symbol ride, allocating a new unbacked symbol. However, in one of our tests, we rely on knowing that the first nonzero call is equal to the second (memoized) nonzero call. The equality test looked pretty easy to discharge, so I just went ahead and added a deferred runtime assert to this effect and it worked. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/124394 Approved by: https://github.com/jansel ghstack dependencies: #124310, #124314, #124316	2024-04-25 02:08:59 +00:00
Edward Z. Yang	66b0156e0b	Ban replacements with unbacked SymInt on both sides (#124316 ) Fixes https://github.com/pytorch/pytorch/issues/123854 Important comment: ``` # Never replace unbacked symbols with other unbacked symbols. # This is error prone because you can cause references to # unbacked symbols to time travel backwards. E.g., # # u1 = x.item() # ... use of u1 ... # u2 = y.item() # u3 = z.item() # torch._check(u1 == u2 + u3) # # If you replace u1 with u2 + u3, then the use of u1 now # references u2 and u3 prior to them actually being bound at # runtime. It's pretty inconvenient to setup control # dependencies for substitutions, so ban it entirely. ``` This is kind of risky for the internal MRS workstream, because we added these substitutions upon their request in the first place. Fortunately, we still allow substitutions to backed SymInts and constants, and I believe that is what is actually load bearing. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/124316 Approved by: https://github.com/ColinPeppler, https://github.com/lezcano ghstack dependencies: #124310, #124314	2024-04-25 02:08:59 +00:00
Edward Z. Yang	5e58227d27	Rebind and refresh unbacked bindings in FakeTensorUpdater (#124314 ) Like the previous two PRs, this is doing the rebinding and binding computation, just in FakeTensorUpdater. FakeTensorUpdater modifies FX graph in place so its usage pattern is slightly different, but still pretty short. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/124314 Approved by: https://github.com/IvanKobzarev, https://github.com/lezcano ghstack dependencies: #124310	2024-04-25 02:08:55 +00:00
Edward Z. Yang	9692b954c6	FakeTensorProp works with unbacked bindings (#124310 ) This is a partial revert of https://github.com/pytorch/pytorch/pull/124059 Like in #124297, profiling has revealed that testing equality on every output is kind of expensive. So we only test equality when we know there is an unbacked binding. This is the same playbook as the previous PR, just on FakeTensorProp instead of PropagateUnbackedSymInts. Note that we also need to populate `unbacked_bindings` in proxy_tensor.py, since we're generating an entirely new graph in that case. We now have enough propagation that we're able to trigger a bug related to divisibility replacement. In https://github.com/pytorch/pytorch/pull/113165 we allowed to replace `u0` with `u1 * c` for some constant c, when we have determined that u0 is divisible by c. However, where does the binding for u1 come from? What we will have in practice is that there is some node that is supposed to have bound u1, but which actually is getting a `u1 * c` in its output. So, to get u1, we must divide out c. Fortunately, under the divisibility condition, this is always possible (but remember, we must test divisibility at runtime!) Because we have tightened up asserts, it is now an error to allocate unbacked SymInts and then fail to track them under unbacked_bindings. In torch/_dynamo/eval_frame.py and torch/_functorch/_aot_autograd/collect_metadata_analysis.py there are examples of benign cases where we repropagated fake tensors but then immediately threw away the results. In these cases, it's not appropriate to rebind, since we're still using the old FX graph that has all of the old symbols. So we just manually clear it. It is possible that other cases will need to be updated, so this PR is "risky" from the perspective of hitting fbcode. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/124310 Approved by: https://github.com/lezcano	2024-04-25 02:08:51 +00:00
rzou	141888765b	Verify types in custom op schemas (#124520 ) Before this PR, we didn't check that types in a schema were valid. This is because TorchScript treats unknown types as type variables. This PR checks types in a schema for the TORCH_LIBRARY APIs. To do this, we add an `allow_typevars` flag to parseSchema so that TorchScript can use allow_typevars=True. We also add some error messages for common mistakes (e.g. using int64_t or double in schema). Test Plan: - new tests Differential Revision: [D56432690](https://our.internmc.facebook.com/intern/diff/D56432690) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124520 Approved by: https://github.com/albanD	2024-04-25 01:56:58 +00:00
Gustav Larsson	050dd65a87	[onnx.export] Track new nodes added during _run_symbolic_function (#123027 ) This PR is part of an effort to speed up torch.onnx.export (#121422). - This copies the shape and type from the node to the nodes that are produced by the export. However, for 1-to-N exports, which are very common, this doesn't make much sense and can give the graph in broken shape or type information. As far as I can tell, a shape inference pass is used to propagate the correct shape and type for all interemediate (and final) nodes. - If there is a situation where this is necessary (shape inference turned off and only 1-to-1 ops are exported ??), perhaps this can be conditionally skipped. It does incur a quadratic cost. Another option is to set a global default for the metadata and use that for all nodes that get created. Again, this meta data may not make sense for all ops and seems dangerous to do. - Resolves (8) in #121422. (partial fix of #121422) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123027 Approved by: https://github.com/BowenBao	2024-04-25 01:56:36 +00:00
rzou	4f398eed0b	[custom_op] register_autograd supports non-tensor kwargonly-args (#124806 ) The user does not need to return gradients for these args. We also change how setup_context works to adapt to kwargonly-args. If the user's op has no kwonly-args, then their setup_context function must look like `setup_context(ctx, inputs, output)`: we require that the arguments have the same names. If the user's op has kwonly-args, then their setup_context function must look like `setup_context(ctx, inputs, keyword_only_inputs, output)`. We require that the arguments have the same names. Test Plan: - new tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/124806 Approved by: https://github.com/albanD, https://github.com/williamwen42 ghstack dependencies: #124637, #124805	2024-04-25 01:51:02 +00:00
rzou	31522391a8	[custom_op] Blanket ban kwarg-only Tensors (#124805 ) We can lift this if users ask for but I haven't seen an op that someone would use with this api that uses a kwarg-only Tensor yet Test Plan: - new tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/124805 Approved by: https://github.com/albanD, https://github.com/williamwen42 ghstack dependencies: #124637	2024-04-25 01:51:02 +00:00
rzou	2b1c13e3a3	[custom_op] fix schema inference for kwarg-only args (#124637 ) Test Plan: - new tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/124637 Approved by: https://github.com/williamwen42, https://github.com/albanD	2024-04-25 01:51:02 +00:00
Gagan Jain	c5e567c573	[Torch][Timer] Adding debug info logging interface for expired timers (#123883 ) Summary: Adding function to log additional debug information before killing the expired watchdog timers. Additional information like stack trace can be added in the debug function using worker process IDs from expired timers. Test Plan: buck test mode/opt caffe2/test/distributed/elastic/timer:file_based_timer_test Differential Revision: D56044153 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123883 Approved by: https://github.com/kurman	2024-04-25 01:15:52 +00:00
eellison	43313a506a	Dont precompile if we search_autotune_cache but not max autotune is set (#124870 ) Differential Revision: [D56534950](https://our.internmc.facebook.com/intern/diff/D56534950) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124870 Approved by: https://github.com/xw285cornell	2024-04-25 01:07:21 +00:00
eellison	68225072e8	Match insignificant strides for sdpa inputs (#124859 ) Fix for https://github.com/pytorch/pytorch/issues/124289. There was a tensor which had a single, expanded element. inductor generated the strides as all 0, while sdpa expects a dense last dimension `t.stride(-1) == 1`. While these are equivalent, we still hit an error in the kernel. We could make fixes in sdpa, but matching the insignificant strides in inductor also works and I am less aware of the downstream sdpa kernel details. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124859 Approved by: https://github.com/drisspg ghstack dependencies: #124751	2024-04-24 23:44:23 +00:00
Andrew Gu	36c983a973	[DeviceMesh] Added `DeviceMesh.from_group()` (#124787 ) This PR adds a `DeviceMesh.from_group()` static method to convert an existing process group to a device mesh. Motivation: We need `DeviceMesh.from_group()` to allow FSDP2 to interoperate with distributed libraries that do not use `DeviceMesh` for all parallelisms. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124787 Approved by: https://github.com/wanchaol ghstack dependencies: #124651, #124741, #124767, #124768, #124780	2024-04-24 23:16:06 +00:00
bhack	cb94845b14	Force upsample to be float32 (#121324 ) Fixes #121072 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121324 Approved by: https://github.com/albanD	2024-04-24 23:14:41 +00:00
Zhengxu Chen	02ed2992d9	[export] Capture tensor.to() under export. (#123732 ) Summary: We use to skip tensor.to() during tracing when the device is the same. This will bring some performance improvement in eager but making graph capture losing the semantics from original model. In this diff, we add an additional condition to skip the fast path when we don't have actual data inside a tensor, which is the case when we're using FakeTensor / FunctionalTensor to trace the model. This won't have perf impact on previous eager models while making sure we can capture the _to_copy() node in the graph. Test Plan: buck test mode/opt caffe2/test:test_export -- -r device_to Differential Revision: D55969674 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123732 Approved by: https://github.com/angelayi, https://github.com/tugsbayasgalan	2024-04-24 23:12:19 +00:00
Catherine Lee	4f29103749	[ez][CI] Move test_cuda off CI_SERIAL_LIST (#124649 ) Tag test cases with large tensor with serial, also tag a few more that failed on a previous iteration of this PR Move test_cuda and test_cuda_expandable_segments off the serial list Pull Request resolved: https://github.com/pytorch/pytorch/pull/124649 Approved by: https://github.com/ZainRizvi	2024-04-24 22:04:23 +00:00
andrewor14	85b28ffc3a	[quant][pt2e] Move batch norm op between eval/train for cuda (#123957 ) Summary: Before in `move_exported_model_to_train/eval`, we only switched the CPU versions of the batch norm op. This commit adds support for the cuda versions of the op too. Note that this fix is temporary; we won't have to differentiate between these two cases once we have batch norm consolidation. Test Plan: python test/test_quantization.py -k test_move_exported_model_bn Reviewers: jerryzh168 Subscribers: jerryzh168, leslie-fang-intel, supriyar Differential Revision: [D56070054](https://our.internmc.facebook.com/intern/diff/D56070054) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123957 Approved by: https://github.com/jerryzh168	2024-04-24 22:01:50 +00:00
Jeff Daily	82fe9071c2	[ROCm][CI] fix 5.7 nightly wheel build (#124797 ) Fixes broken ROCm 5.7 build caused by #122106. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124797 Approved by: https://github.com/atalman	2024-04-24 21:55:24 +00:00
Jeff Daily	a89f442f0b	add -fclang-abi-compat=17 to HIP_HIPCC_FLAGS (#124862 ) C++20 mangling rules were recently added to hip-clang. This flag maintains compatibility since pytorch is at C++17. Otherwise the linker fails. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124862 Approved by: https://github.com/malfet, https://github.com/pruthvistony	2024-04-24 21:46:50 +00:00
wz337	7809b34288	[DTensor][Easy] Update OpSchema __repr__ to show args_schema in format print (#124812 ) When printing op_schema with `print(f"{op_schema=}")`: Before -- can't view into the OpStrategy/TupleStrategy in format print: ``` # A pointwise strategy op_schema=OpSchema(op=aten.relu.default, args_schema=(<torch.distributed._tensor.op_schema.OpStrategy object at 0x7f4e763e0520>,), kwargs_schema={}) # A pointwise strategy pointwise_strategy -- op_schema=OpSchema(op=aten.threshold_backward.default, args_schema=(<torch.distributed._tensor.op_schema.OpStrategy object at 0x7f4e763e1540>, <torch.distributed._tensor.op_schema.OpStrategy object at 0x7f4e763e1510>, 0), kwargs_schema={}) # A tuple strategy op_schema=OpSchema(op=aten._foreach_lerp_.Scalar, args_schema=(<torch.distributed._tensor.op_schema.TupleStrategy object at 0x7f4e763e31f0>, <torch.distributed._tensor.op_schema.TupleStrategy object at 0x7f4e763e3460>, 0.09999999999999998), kwargs_schema={}) ``` After -- printing out the OpStrategy/TupleStrategy string: ``` # A pointwise strategy op_schema=OpSchema(op=aten.relu.default, args_schema=(OpStrategy:[None -> R] @ mesh: (4,)), kwargs_schema={}) # A pointwise strategy op_schema=OpSchema(op=aten.threshold_backward.default, args_schema=(OpStrategy:[None -> R] @ mesh: (4,), OpStrategy:[None -> R] @ mesh: (4,), 0), kwargs_schema={}) # A tuple strategy op_schema=OpSchema(op=aten._foreach_lerp_.Scalar, args_schema=(TupleStrategy(OpStrategy:[None -> S(0)] @ mesh: (4,)), TupleStrategy(OpStrategy:[None -> S(0)] @ mesh: (4,)),0.09999999999999998), kwargs_schema={}) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/124812 Approved by: https://github.com/wanchaol	2024-04-24 21:34:39 +00:00
Prachi Gupta	a248c24694	[ROCm][Inductor] Disable conv cache emptying with hipgraphs (#124791 ) When we warmup hipgraphs, we use cudagraph memory pool to allocate a large part of the memory. We don't necessarily execute the kernels on the GPUs. Therefore, we don't want to free up this allocated memory. However, this is conflicting with emptyCache call happening inside findAlgorithm where convolution algorithm benchmarking is happening. For benchmarking, we might use large memory allocations to cache algorithm results. As a fix, we just disable the emptyCache() call during cudagraph warmup. As per this cuDNN PR which did the same thing for CUDA, we did not have a significant affect on memory footprint. `a8ff647e42` Pull Request resolved: https://github.com/pytorch/pytorch/pull/124791 Approved by: https://github.com/eellison, https://github.com/jeffdaily	2024-04-24 21:21:10 +00:00
Aaron Enye Shi	80ab062103	[MemoryViz] Improve description of blocks with missing frames (#124784 ) Summary: It is common for blocks to be missing frames and there are many users asking why. Let's improve this output message to cover common reasons: 1) block was allocated before _record_memory_history was enabled 2) context or stacks passed to _record_memory_history does not include this block 3) backward events allocated with C++ stack and will not show if stacks = python Test Plan: CI and ran it locally: ![image](https://github.com/pytorch/pytorch/assets/17602366/60a03a22-0e3e-43d8-9ee7-b14358096fc7) Differential Revision: D56490921 Pulled By: aaronenyeshi Pull Request resolved: https://github.com/pytorch/pytorch/pull/124784 Approved by: https://github.com/zdevito	2024-04-24 21:16:31 +00:00
Shen Xu	8885638f95	[quant][pt2e] Propagate get_attr meta through known ops only (#124415 ) Summary: Avoid situation where the graph traversal finds a matmul node with a `get_attr` as its `args[0]`, and incorrectly propagate the `get_attr`'s meta to everything downstream. Test Plan: CI Differential Revision: D56219120 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124415 Approved by: https://github.com/jerryzh168	2024-04-24 20:55:56 +00:00
egienvalue	355dc34f86	Add test_cpp_extensions tests for stream_and_event and mita_backend (#123614 ) Test the generic torch.Stream/Event with fake device gurad and hooks. Differential Revision: [D56443358](https://our.internmc.facebook.com/intern/diff/D56443358) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123614 Approved by: https://github.com/albanD ghstack dependencies: #123611, #123612	2024-04-24 20:51:20 +00:00
egienvalue	381653de63	torch.mtia module for MTIA device backend (#123612 ) MTIA device has its own Module in PyTorch now. torch.mtia has following APIs similar to other backends. The lazy_init is also supported. ``` __all__ = [ "init", "is_available", "synchronize", "device_count", "current_device", "current_stream", "default_stream", "set_stream", "stream", "device", ] ``` ------------ For device management. We expand AccleratorHooksInterface to support generic device management and it can be used in both C++ and PyThon. ``` def _accelerator_hooks_device_count() -> _int: ... def _accelerator_hooks_set_current_device(device_index: _int) -> None: ... def _accelerator_hooks_get_current_device() -> _int : ... def _accelerator_hooks_exchange_device(device_index: _int) -> _int : ... def _accelerator_hooks_maybe_exchange_device(device_index: _int) -> _int : ... ``` --------- Adding get_device_module API to retrieve device modules for different device types. ``` def get_device_module(device: Optional[Union[torch.device, str]] = None) ``` --------- Differential Revision: [D56443356](https://our.internmc.facebook.com/intern/diff/D56443356) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123612 Approved by: https://github.com/albanD ghstack dependencies: #123611	2024-04-24 20:51:20 +00:00
egienvalue	408aa0182c	Build device generic torch.Stream and torch.Event based on c10::Stream/Event (#123611 ) This diff intends to build device generic torch.Stream and torch.Event for newly added accelerators in PyTorch. ------------ torch.Stream APIs ``` # Defined in torch/csrc/Stream.cpp class Stream(_StreamBase): stream_id: _int # Stream id device_index: _int device_type: _int device: _device # The device of the stream @overload def __new__(self, device: Optional[DeviceLikeType] = None, priority: _int = 0) -> Stream: ... @overload def __new__(self, stream_id: _int, device_index: _int, device_type: _int, priority: _int = 0) -> Stream: ... def wait_event(self, event: Event) -> None: ... def wait_stream(self, other: Stream) -> None: ... def record_event(self, event: Optional[Event] = None) -> Event: ... def query(self) -> None: ... def synchronize(self) -> None: ... def __hash__(self) -> _int: ... def __repr__(self) -> str: ... def __eq__(self, other: object) -> _bool: ... ``` ------------------ torch.Event APIs: - IPC related APIs are not implemented, since many device backends don't support it, but we leave interfaces there for future adaption of torch.cuda.Stream. - currently only the enable_timing is supported, since it is the most common one used in other device backends. We have to refactor the event flag system in PyTorch to support more fancy flag. - elapsedTime API is added to c10::Event ``` # Defined in torch/csrc/Event.cpp class Event(_EventBase): device: _device # The device of the Event event_id: _int # The raw event created by device backend def __new__(self, device: Optional[DeviceLikeType] = None, enable_timing: _bool = False, blocking: _bool = False, interprocess: _bool = False) -> Event: ... @classmethod def from_ipc_handle(self, device: DeviceLikeType, ipc_handle: bytes) -> Event: ... def record(self, stream: Optional[Stream] = None) -> None: ... def wait(self, stream: Optional[Stream] = None) -> None: ... def query(self) -> _bool: ... def elapsed_time(self, other: Event) -> _float: ... def synchronize(self) -> None: ... def ipc_handle(self) -> bytes: ... def __repr__(self) -> str: ... ``` ----------- c10::Event provides new APIs - calculate elapsedTime. - Get raw event id - Synchronize event. ``` double elapsedTime(const Event& event) const { return impl_.elapsedTime(event.impl_); } void* eventId() const { return impl_.eventId(); } void synchronize() const { return impl_.synchronize(); } ``` ---------- TODO: need to find a good way to test them in PyTorch with API mocks. Differential Revision: [D56443357](https://our.internmc.facebook.com/intern/diff/D56443357) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123611 Approved by: https://github.com/albanD, https://github.com/jeffdaily	2024-04-24 20:51:17 +00:00
Edward Z. Yang	a22847a9cb	We should not be in kernel invocation before we restore fake mode (#124762 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/124762 Approved by: https://github.com/eellison ghstack dependencies: #124760	2024-04-24 20:32:59 +00:00
Edward Z. Yang	0d58aeb73a	Handle size/etc accessors in FakeTensor, support accessing symbolic types from toInt/etc in IValue (#124760 ) Fixes https://github.com/pytorch/pytorch/issues/122772 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/124760 Approved by: https://github.com/albanD, https://github.com/eellison	2024-04-24 20:32:59 +00:00
Peter Bell	9bd6e93a04	[inductor] Add option to create parent directory for write_atomic (#124646 ) In #124640 I see the error ``` File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/codecache.py", line 887, in load compiled_graph = FxGraphCache._lookup_graph(key, example_inputs) File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/codecache.py", line 776, in _lookup_graph write_atomic(artifact_path, graph.source_code) File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/codecache.py", line 412, in write_atomic with tmp_path.open(write_mode) as f: File "/opt/conda/envs/py_3.10/lib/python3.10/pathlib.py", line 1119, in open return self._accessor.open(self, mode, buffering, encoding, errors, FileNotFoundError: [Errno 2] No such file or directory: '/tmp/tmp02wlik2v/iu/.28383.139931139675904.tmp' ``` Which is fixed by creating the parent directory first. Since this is what you want to do in most cases, I add an argument to `write_atomic` to do so itself. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124646 Approved by: https://github.com/lezcano	2024-04-24 20:12:23 +00:00
yuqingj	adbf62cd0a	Fix layer norm in static runtime when input is non-contiguous (#124789 ) Test: The added unit test fails before this fix. But it passes now after the fix. The fix is coming from @swolchok in D56087067. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124789 Approved by: https://github.com/davidberard98	2024-04-24 19:49:36 +00:00
Eddie Yan	71d92bace2	[CUDA] Fix 64-bit indexing in `vol2col` in conv3d (#124650 ) Similar to #118005, fixes sometimes silent IMAs that occur CC @atalman @malfet Pull Request resolved: https://github.com/pytorch/pytorch/pull/124650 Approved by: https://github.com/soulitzer	2024-04-24 19:47:18 +00:00
Catherine Lee	8fe0b8b6a8	No CPP or xdist process level reruns (#124798 ) xdist doesn't play well with current process level rerun scheme Pull Request resolved: https://github.com/pytorch/pytorch/pull/124798 Approved by: https://github.com/huydhn	2024-04-24 19:44:51 +00:00
Sheng Fu	c55309e58f	OSS: Capture triton kernel in ET (#124775 ) This DIFF is to capture triton kernels in execution trace Pull Request resolved: https://github.com/pytorch/pytorch/pull/124775 Approved by: https://github.com/briancoutinho	2024-04-24 19:39:37 +00:00
rzou	872eeb0d7d	Refresh OpOverloadPacket if a new OpOverload gets added (#124654 ) If a user accesses an OpOverloadPacket, then creates a new OpOverload, then uses the OpOverloadPacket, the new OpOverload never gets hit. This is because OpOverloadPacket caches OpOverloads when it is constructed. This PR fixes the problem by "refreshing" the OpOverloadPacket if a new OpOverload gets constructed and the OpOverloadPacket exists. Test Plan: - new tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/124654 Approved by: https://github.com/albanD	2024-04-24 19:30:52 +00:00
Florian	7ad6dc2cf3	[Profiler][PrivateUse1] Profiler support PrivateUse1 key (#124818 ) Summary: 1.Package public headers of kineto if USE_KINETO so that they can be used by PrivateUse1 user. 2.Add PrivateUse1 key to ActivityType. 3. Support PrivateUse1 key in function deviceTypeFromActivity and _supported_activities. 4. Fix some bugs when processing profiler results. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124818 Approved by: https://github.com/aaronenyeshi	2024-04-24 18:52:08 +00:00
Ke Wen	f07b6227e6	Initial add of torch.distributed.pipelining (#124776 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124776 Approved by: https://github.com/wconstab	2024-04-24 18:51:20 +00:00
Aaron Gokaslan	40cf38fd15	[BE]: Apply ruff rule FURB192 (#124742 ) Apply RUFF rule FURB192 to remove unnecessary sorts and replace them with min / max. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124742 Approved by: https://github.com/albanD, https://github.com/malfet	2024-04-24 18:44:08 +00:00
Andrew Gu	48312a7fc3	[DeviceMesh] Removed unneeded `.to(cpu)` (#124768 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124768 Approved by: https://github.com/wz337 ghstack dependencies: #124651, #124741, #124767	2024-04-24 18:07:20 +00:00
atalman	927ae80afa	Release 2.3 compatibility matrix (#124861 ) Update release compatibility matrix with latest release Pull Request resolved: https://github.com/pytorch/pytorch/pull/124861 Approved by: https://github.com/svekars, https://github.com/seemethere, https://github.com/malfet	2024-04-24 18:05:14 +00:00
Andrew Gu	1db7d64af2	[DeviceMesh] Initialized mesh tensor with CPU context (#124767 ) This PR makes sure to construct the `DeviceMesh`'s `mesh` tensor on CPU device in `init_device_mesh()`. This means that we can call `init_device_mesh()` under meta-device context and still construct the correct `mesh` tensor. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124767 Approved by: https://github.com/wz337 ghstack dependencies: #124651, #124741	2024-04-24 18:04:06 +00:00
Tugsbayasgalan (Tugsuu) Manlaibaatar	674e15ae07	Back out "Switch to predispatch" (#124860 ) Summary: Original commit changeset: 1f155b3a0bfc Original Phabricator Diff: D56273267 Test Plan: CI Differential Revision: D56526505 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124860 Approved by: https://github.com/angelayi	2024-04-24 17:28:33 +00:00
Jack Taylor	9888d7495e	[ROCm] Triton upstream AMD backend integration (#121801 ) Update ROCm-triton to use the AMD backend from https://github.com/openai/triton Note: `test__int_mm` can be enabled after https://github.com/pytorch/pytorch/pull/122431 is landed Co-authored-by: Pruthvi Madugundu <pruthvigithub@gmail.com> Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/121801 Approved by: https://github.com/nmacchioni, https://github.com/malfet	2024-04-24 17:28:12 +00:00
Yanbo Liang	ed120b08c4	Add common used score_mod functions for templated attention (#124670 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/124670 Approved by: https://github.com/Chillee	2024-04-24 17:04:36 +00:00
Saaketh	977105466f	Remove activation checkpointing tag to get correct FQNs (#124698 ) Fixes #124546 When setting `use_orig_params = False` and using activation checkpointing, the FQN mapping as retrieved by the `_get_fqns` function is incorrect because the prefix that is added to the name of each activation checkpointed module, `_checkpoint_wrapped_module`, can still be present. I think this is an edge case with the `_get_fqns` function that was not addressed by this previous commit #118119. Without the change, the list of object names for an activation checkpointed module with FSDP (and `use_orig_params=False`) can be something like: ``` ['model', '_fsdp_wrapped_module', 'transformer', 'blocks', '0', '_fsdp_wrapped_module', '_checkpoint_wrapped_module', '_flat_param'] ``` Which will incorrectly return just one FQN, `{'model.transformer.blocks.0._flat_param'}`, when all the FQNs of the parameters of the transformer block should be returned. With the change, the list of object names will now have `_checkpoint_wrapped_module` removed: ``` ['model', '_fsdp_wrapped_module', 'transformer', 'blocks', '0', '_fsdp_wrapped_module', '_flat_param'] ``` And the FQNs are correctly retrieved and returned in `_get_fqns` when [this condition](`ea61c9cb29/torch/distributed/checkpoint/state_dict.py (L168)`) is satisfied. The correct FQNs are: ``` {'model.transformer.blocks.0.attn.Wqkv.bias', 'model.transformer.blocks.0.ffn.up_proj.bias', 'model.transformer.blocks.0.attn.out_proj.weight', 'model.transformer.blocks.0.norm_2.weight', 'model.transformer.blocks.0.ffn.down_proj.weight', 'model.transformer.blocks.0.attn.Wqkv.weight', 'model.transformer.blocks.0.norm_2.bias', 'model.transformer.blocks.0.ffn.up_proj.weight', 'model.transformer.blocks.0.ffn.down_proj.bias', 'model.transformer.blocks.0.norm_1.bias', 'model.transformer.blocks.0.norm_1.weight', 'model.transformer.blocks.0.attn.out_proj.bias'} ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/124698 Approved by: https://github.com/Skylion007	2024-04-24 16:47:50 +00:00
Catherine Lee	bf834d388b	Mark test_xavier_uniform as slow (#124801 ) takes 17+ minutes, sometimes 30+ min Pull Request resolved: https://github.com/pytorch/pytorch/pull/124801 Approved by: https://github.com/huydhn	2024-04-24 15:48:04 +00:00
PyTorch MergeBot	e739a2d59e	Revert "[quant][pt2e] Move batch norm op between eval/train for cuda (#123957 )" This reverts commit 4efb28c90025ea3d979b720942cd97a274fac6da. Reverted https://github.com/pytorch/pytorch/pull/123957 on behalf of https://github.com/jeanschmidt due to reverting to check if it will fix rocm jobs on main ([comment](https://github.com/pytorch/pytorch/pull/123957#issuecomment-2075158146))	2024-04-24 15:02:11 +00:00
PyTorch MergeBot	92295fbacd	Revert "Verify types in custom op schemas (#124520 )" This reverts commit 5b98d43488bed0836b4da5996a50bafd0dd2c11c. Reverted https://github.com/pytorch/pytorch/pull/124520 on behalf of https://github.com/zou3519 due to broke static runtime tests ([comment](https://github.com/pytorch/pytorch/pull/124520#issuecomment-2075111935))	2024-04-24 14:41:26 +00:00
Kai Londenberg	7d94f52a8a	[Inductor Cutlass backend] clean up CUTLASSGemmTemplate.add_cutlass_gemm_choices (#124575 ) Clean up CUTLASSGemmTemplate.add_cutlass_gemm_choices, removing code that became unneccessary by removing EVT-based epilogue fusion. Test Plan: Already covered by test_cutlass_backend.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/124575 Approved by: https://github.com/jansel ghstack dependencies: #121497, #123930, #123932, #121734, #124107, #124574	2024-04-24 14:00:12 +00:00
Pearu Peterson	49f0d127fb	Fix a bug in retrieving approximate bsr_dense_addmm kernel meta data (#124371 ) Fixes #124333 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124371 Approved by: https://github.com/eqy, https://github.com/lezcano	2024-04-24 13:59:18 +00:00
Kai Londenberg	a47f4253ab	[Inductor Cutlass backend] Set INDUCTOR_TEST_DISABLE_FRESH_CACHE in test setup (#124574 ) The diff https://github.com/pytorch/pytorch/pull/122661 introduces a new automatic cache refresh mechanism during all inductor-derived test cases. But this refresh mechanism seems not to work properly across process boundaries, specifically when using autotune_in_subproc, which many tests in test_cutlass_backend.py rely on. Solution: Set the env var INDUCTOR_TEST_DISABLE_FRESH_CACHE=1 early during test setup within test_cutlass_backend.py Test Plan: This is a change to unit tests only. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124574 Approved by: https://github.com/aakhundov ghstack dependencies: #121497, #123930, #123932, #121734, #124107	2024-04-24 13:58:29 +00:00
Kai Londenberg	e76b5e3cc8	[Inductor Cutlass backend] Disable epilogue fusions (#124107 ) This diff disables Cutlass backend EVT epilogue fusions. It does not yet contain the removal of most of the underlying implementation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124107 Approved by: https://github.com/jansel ghstack dependencies: #121497, #123930, #123932, #121734	2024-04-24 13:56:44 +00:00
Kai Londenberg	537aebc99f	[Inductor cutlass backend] Add bmm support (#121734 ) Add support for bmm ( batch matrix multiply ) op through the Cutlass backend. Test Plan: * CI * Added test in test_cutlass_backend.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/121734 Approved by: https://github.com/eellison ghstack dependencies: #121497, #123930, #123932	2024-04-24 13:54:54 +00:00
Jane Xu	fb69eef1b4	Add int testing for foreach_copy on cuda (#124703 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124703 Approved by: https://github.com/crcrpar, https://github.com/albanD	2024-04-24 13:11:56 +00:00
Andrew Gu	89ca0cb7a0	[FSDP2] Added test to show rank 0 CPU full state dict flow (#124741 ) This PR adds a unit test to show how we can convert FSDP2 GPU sharded state dicts to a CPU full state dict on rank 0. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124741 Approved by: https://github.com/wanchaol, https://github.com/wz337 ghstack dependencies: #124651	2024-04-24 13:02:19 +00:00
Edward Z. Yang	e0e2d897ed	Handle Tensor returns in PropagateUnbackedSymInts (#124297 ) This subsumes https://github.com/pytorch/pytorch/pull/124069 In the original PR, my idea was that when we run PropagateUnbackedSymInts, we check that the sizes before and after are exactly the same. This ended up turning up lots of bugs that I didn't feel like fixing. Separately, Ivan let me know that this pass was quite expensive in terms of compile time, since we spent a lot of time thinking about the equalities. To kill two birds with one stone, we now only check for equality precisely when an unbacked SymInt was bound (thanks to the previous PR in this stack, we now have this information). Specifically, we look to see if `meta["unbacked_bindings"]` is set on the old node, and if it is, we assert the old value is equal to the new value from the repropagation. Note that the pytree key is used to actually extract the new value from the example value, as it may be nested inside an, e.g., tensor size. We do something a bit naughty at the end: we use `defer_runtime_assert` to actually teach ShapeEnv about the equality. This is implementationally equivalent to what we used to do, but we're going to change this later soon. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/124297 Approved by: https://github.com/lezcano ghstack dependencies: #124290	2024-04-24 12:18:33 +00:00
Edward Z. Yang	b04dca1502	Add pending_fresh_unbacked_symbols, populate unbacked_bindings for Dynamo (#124290 ) The important comment: ``` # Whenever we allocate a fresh unbacked Symbol, we add it to this # pending list. Unbacked symbol allocation can occur at unpredictable # points during meta tensor propagation, but at some point, the we # have to know what the binding site for an unbacked symbol is, and # this is computed when we actually place the node in the graph. The # important thing is that we always actually handle every unaccounted # for unbacked symbol, so this list helps us keep track of them and # then make sure they are all accounted for. # # We could potentially give rise to errors earlier by lexically # scoping when we do propagation, and only allowing unbacked symbols # to be allocated at this point in time. However this is inconvenient # to do in Dynamo, because fake tensor propagation is far from when we # analyze binding sites (set_example_value), so we do it in a more # mutatey way. # # NB: fresh unbacked symbols NEVER get substitutions applied to them, # they are binding sites! ``` The compute_unbacked_bindings is the other half of the equation: the thing that actually consumes the pending_fresh_unbacked_symbols and does something with them. Important comment: ``` After having run fake tensor propagation and producing example_value result, traverse example_value looking for freshly bound unbacked symbols and record their paths for later. It is an error if we have allocated an unbacked SymInt but it cannot be found in example_value. (NB: this means if you have a multi-output function, you must call this on the tuple of tensor output, you cannot wait!) ``` For example, if I return a tensor with size `[u0, u1]`, and u1 is a fresh unbacked SymInt, then I'll have `{u1: KeyPath(".size(1)")}`, telling me I can get u1 by running `size(1)` on the result of this node. u0 is not fresh (it probably flowed in as an argument), so I don't generate a binding for it. I eventually intend to propagate this information all the way to Inductor lowering, where extra metadata about unbacked symbol binding will be canonically used for codegen, instead of trying to infer it from defs/uses. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/124290 Approved by: https://github.com/lezcano	2024-04-24 09:11:34 +00:00
DanilBaibak	0848051844	Migrate linux-test Job yo ARC (#124386 ) Migrate linux-test Job yo ARC * Separated `_linux-test-label.yml` workflow to use the `label`; * Separated `_linux-test-rg.yml` workflow to use the `group`; Pull Request resolved: https://github.com/pytorch/pytorch/pull/124386 Approved by: https://github.com/zxiiro, https://github.com/jeanschmidt	2024-04-24 06:48:19 +00:00
Chien-Chin Huang	290bfbe01f	[DDP][PT2D] Lazy Initialization of DDP Module for Replicate API (#123424 ) In order to make replicate work with Meta tensor, we need to do lazy Initialization for the replicate API. This PR impelements the lazy initialization and ensures that replicate still work with the new DDP compilation. Differential Revision: [D55787340](https://our.internmc.facebook.com/intern/diff/D55787340/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123424 Approved by: https://github.com/yf225 ghstack dependencies: #124421, #124422	2024-04-24 06:30:19 +00:00
Teja Rao	81740fd1f6	[DCP] minor readability fix: make param name consistent with overriden function (#124770 ) Summary: This diff has no logic changes. It updates the variable names to be in sync with the name used in prepare_global_plan in StorageWriter. Pasting func signature for easy reference - abc.abstractmethod def prepare_global_plan(self, plans: List[SavePlan]) -> List[SavePlan]: Differential Revision: D56480396 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124770 Approved by: https://github.com/fegin	2024-04-24 05:31:26 +00:00
Florian	34f468e66f	remove the redundent '* 1000' to timestamp (#124374 ) activity->timestamp() already in nanosecond granularity, no need to multiply by 1000. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124374 Approved by: https://github.com/aaronenyeshi	2024-04-24 04:57:44 +00:00
Wanchao Liang	0da94f3a08	[device_mesh] add a private init backend option (#124780 ) This PR adds a private init backend option, to tackle the issues sub mesh creation: in device mesh slicing we don't want to create process groups again, so explicitly turn the group creation off it's useful Also I think there might be more submesh creation functionality so having this flag would ensure that there's no new group created Differential Revision: [D56497780](https://our.internmc.facebook.com/intern/diff/D56497780) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124780 Approved by: https://github.com/awgu	2024-04-24 04:31:58 +00:00
Boyuan Feng	b91f83f181	[cudagraph] add config for cudagraph managed input mutation support (#124754 ) Summary: [#123231](https://github.com/pytorch/pytorch/pull/123231) adds cudagraph supports for more types of functions (i.e., cudagraph managed input mutation). These newly supported functions may have mutated static inputs, leading to assertion errors in some workload which skip cudagraph previously. This diff adds a config to opt in the new feature. Test Plan: ci Differential Revision: D56481353 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124754 Approved by: https://github.com/eellison	2024-04-24 04:23:53 +00:00
Huy Do	bee924d173	Enable test config selection when doing workflow dispatch (#124795 ) Fixes https://github.com/pytorch/test-infra/issues/4468 This is done by updating the filter config script to accept a list of test configs coming from workflow dispatch. For example, having `inductor_huggingface_perf,inductor_timm_perf,inductor_torchbench_perf` will benchmark all 3 datasets, while having `inductor_torchbench_perf` will only run TorchBench. This is exposed via a new string workflow dispatch parameters called `benchmark_configs`. Note that GH limits the maximum number of workflow dispatch parameters to 10, so I need to consolidate `training` and `inference` into `training_and_inference` to squeeze the new parameter into the list. ### Testing Run the script manually and confirm that the filtered list of test config is correct. Also manually dispatch the job with the new parameter https://github.com/pytorch/pytorch/actions/runs/8808159905 and only the selected `inductor_huggingface_perf` is kept https://github.com/pytorch/pytorch/actions/runs/8808159905/job/24176683708#step:11:128 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124795 Approved by: https://github.com/clee2000	2024-04-24 03:13:38 +00:00
Alexander Grund	9dded148d0	Fix test_extension_backend on non-AVX systems (#117272 ) The test checks for a substring "loadu" in generated code. On AVX systems that line is: > auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(i0)) however on non-AVX systems it is > auto tmp0 = in_ptr0[static_cast<long>(i0)]; the difference depends on `codecache.valid_vec_isa_list()` being non-empty. See torch/_inductor/codegen/cpp.py:2639 Modify the test to account for that. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117272 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-04-24 02:55:12 +00:00
Peter Y Yeh	2e7b8ff116	[ROCm] Fix Int_mm() Integration with hipblasLT (#122431 ) The PR - fixes int_mm() /int8_gemm() integration with hipblasLT backend (require ROCm 6.0). - enables/fixes the following tests on Rocm - test__int_mm_k_16_n_16_use_transpose_a_False_use_transpose_b_False_cuda - test__int_mm_k_16_n_16_use_transpose_a_False_use_transpose_b_True_cuda - test__int_mm_k_16_n_16_use_transpose_a_True_use_transpose_b_False_cuda - test__int_mm_k_16_n_16_use_transpose_a_True_use_transpose_b_True_cuda - test__int_mm_k_16_n_32_use_transpose_a_False_use_transpose_b_False_cuda - test__int_mm_k_16_n_32_use_transpose_a_False_use_transpose_b_True_cuda - test__int_mm_k_16_n_32_use_transpose_a_True_use_transpose_b_False_cuda - test__int_mm_k_16_n_32_use_transpose_a_True_use_transpose_b_True_cuda - test__int_mm_k_32_n_16_use_transpose_a_False_use_transpose_b_False_cuda - test__int_mm_k_32_n_16_use_transpose_a_False_use_transpose_b_True_cuda - test__int_mm_k_32_n_16_use_transpose_a_True_use_transpose_b_False_cuda - test__int_mm_k_32_n_16_use_transpose_a_True_use_transpose_b_True_cuda - test__int_mm_k_32_n_32_use_transpose_a_False_use_transpose_b_False_cuda - test__int_mm_k_32_n_32_use_transpose_a_False_use_transpose_b_True_cuda - test__int_mm_k_32_n_32_use_transpose_a_True_use_transpose_b_False_cuda - test__int_mm_k_32_n_32_use_transpose_a_True_use_transpose_b_True_cuda Pull Request resolved: https://github.com/pytorch/pytorch/pull/122431 Approved by: https://github.com/pruthvistony, https://github.com/jithunnair-amd, https://github.com/malfet, https://github.com/atalman	2024-04-24 02:29:33 +00:00
Sergii Dymchenko	f0f7452e31	Do not propogate (#124769 ) Fix the propogate typos. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124769 Approved by: https://github.com/Skylion007	2024-04-24 02:18:18 +00:00
Tristan Rice	952a00eda7	torchelastic: change monitor_interval default to 0.1 (#124692 ) This reduces the default monitor_interval for torchelastic to 0.1s as testing shows negligble load for common use cases. Even at the extremes, 100k processes is only 45.4% cpu util of a single core. Torchelastic monitor_interval only monitors the processes on a single worker so under typical loads even for huge jobs we expect ~8 subprocesses per machine with one per GPU. As an external datapoint, Python's wait polls every 50usec-50ms (https://github.com/python/cpython/blob/main/Lib/subprocess.py#L2035). ## Motivation This setting is used to control how frequently we poll for failed processes in elastic. * For some jobs of note we run elastic 3 times per try so with the default timeout of 5 seconds we should save ~15 seconds per retry. * @kiukchung's use case: Apparently this is annoying in notebooks etc since it adds delay to shutdown when testing things ## Results This is measured in cores (100% is a single core under full load). \| monitor_interval (s) \| nproc-per-node \| CPU util (highest observed) \| \| -------------------- \| -------------- \| --------------------------- \| \| 1.0 \| 10 \| 0.2% \| \| 0.1 \| 1 \| 0.4% \| \| 0.1 \| 10 \| 0.4% \| \| 0.01 \| 10 \| 0.9% \| \| 0.001 \| 10 \| 4.0% \| \| 0.1 \| 100 \| 0.5% \| \| 0.1 \| 1000 \| 2.2% \| \| 0.1 \| 10000 \| 15.7% \| \| 0.1 \| 100000 \| 45.4% \| ## Methodology ```sh # run command $ LOGLEVEL=INFO torchrun --nnodes 1 --nproc-per-node 10 --monitor-interval 0.1 ~/wait.py # wait a few seconds for all processes to start and reach steady state and then run, wait ~30s or 3 prints and take the highest $ top -b -d 10 -c \| rg 'torchrun.wait ``` wait.py ```py import time time.sleep(1060) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/124692 Approved by: https://github.com/kiukchung, https://github.com/kurman	2024-04-24 01:44:41 +00:00
Zain Rizvi	03fa2421dc	Get ARC jobs to run on both classic and ARC infra (#124753 ) ARC jobs are too unstable right now. We're going to mitigate this by: - Reverting ARC jobs to run on the classic infra (https://github.com/pytorch/pytorch/pull/124748) - Spin up new jobs in parallel to run on the new infra. (this PR) - Mark these ARC jobs as unstable (will be done before merging this PR) More details in https://github.com/pytorch/ci-infra/issues/149 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124753 Approved by: https://github.com/zxiiro, https://github.com/seemethere	2024-04-24 01:34:50 +00:00
wz337	2716e77cf7	[FSDP1][2D] Fix FSDP1 2D state_dict to use run_check=False (#123802 ) `from_local` with replicate placement would run mesh_broadcast if `run_check=True`, by default `from_local` have `run_check=True`, but for FSDP state_dict case we are for sure that these are replicated on dp dimension (FSDP + TP) already, so we don't need to check/force check it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123802 Approved by: https://github.com/wanchaol	2024-04-24 01:25:11 +00:00
Jiang, Yanbing	57a12d9d0f	Add Half support to torch.sparse.addmm for CPU (#124694 ) This PR is to add Half support to torch.sparse.addmm for CPU. It is a requested feature in model DCRNN for Half data type. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124694 Approved by: https://github.com/pearu	2024-04-24 01:24:01 +00:00
Jeff Daily	1ab0b3c9f8	[ROCm] avoid heap buffer overflow in hiprtc failure logs (#121865 ) hiprtc doesn't seem to include the null byte automatically in the failure logs, resulting in heap buffer overflow. Initializing the log string with the null byte avoids the problem. Found by rocm address sanitizer. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121865 Approved by: https://github.com/malfet	2024-04-24 01:09:08 +00:00
andrewor14	4efb28c900	[quant][pt2e] Move batch norm op between eval/train for cuda (#123957 ) Summary: Before in `move_exported_model_to_train/eval`, we only switched the CPU versions of the batch norm op. This commit adds support for the cuda versions of the op too. Note that this fix is temporary; we won't have to differentiate between these two cases once we have batch norm consolidation. Test Plan: python test/test_quantization.py -k test_move_exported_model_bn Reviewers: jerryzh168 Subscribers: jerryzh168, leslie-fang-intel, supriyar Differential Revision: [D56070054](https://our.internmc.facebook.com/intern/diff/D56070054) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123957 Approved by: https://github.com/jerryzh168	2024-04-24 01:02:59 +00:00
eqy	64af899fdf	[cuDNN] cuDNN SDPA (Flash Attention) Backward (#122510 ) #113713 currently passing trivial smoke tests but I just totally pattern-matched bits and pieces of the autograd defs Will also collect benchmark data, CC @drisspg Pull Request resolved: https://github.com/pytorch/pytorch/pull/122510 Approved by: https://github.com/drisspg	2024-04-24 01:00:34 +00:00
leslie-fang-intel	31ca27af62	Add the quant lift up pass in convert phase (#122777 ) Summary Lift up the quant node before view like nodes. It can benefit performance of Attention like block. For example, we have the pattern as: ``` DQ DQ LINEAR LINEAR VIEW VIEW PERMUTE PERMUTE TRANSPOSE Q Q DQ DQ Matmul DIV ADD SOFTMAX ``` We want to lift up the the quant nodes from `matmul` before view like nodes as the output of Linear node. ``` DQ DQ LINEAR LINEAR Q Q VIEW VIEW PERMUTE PERMUTE TRANSPOSE DQ DQ Matmul DIV ADD SOFTMAX ``` It produces a `DQ->LINEAR->Q` pattern which can be fused by backend. Test Plan ``` python -m pytest quantization/pt2e/test_x86inductor_quantizer.py -k test_attention_block ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/122777 Approved by: https://github.com/jerryzh168, https://github.com/jgong5 ghstack dependencies: #122776	2024-04-24 00:57:59 +00:00
Tugsbayasgalan Manlaibaatar	c933af2709	Switch to predispatch (#123573 ) This PR switches export IR from aot-dispatch to pre-dispatch IR. What is pre-dispatch IR and why should you care? Currently the default IR returned by torch.export can contain only functional ATen operators after ALL pytorch dispatcher decompositions (for example, CompositeImplicitAutograd) run. In contrast, pre-dispatch IR refers to an IR that can contain all functional ATen operators (i.e., not just from the core subset), before any decomposition happens, as well as operators that manipulate autograd state. Pre-dispatch IR closely resembles eager PyTorch computation, but is still functional and serializable by torch.export. As a result: - You can train the pre-dispatch IR in eager mode as the IR contains necessary information for the autograd engine to automatically generate a backward graph. - You can write sound graph transformations more easily as the IR is functional. - Since it is an ATen IR, it is still normalized. For example, torch.add has multiple overloads, but aten.add.Tensor is unique in this IR. If you want to get the core aten IR out of `torch.export`, you will need to: ``` ep = torch.export.export(M(), inputs) ep_for_core_aten = ep.run_decompositions() ``` Differential Revision: [D56273267](https://our.internmc.facebook.com/intern/diff/D56273267) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123573 Approved by: https://github.com/gmagogsfm	2024-04-24 00:51:09 +00:00
Aaron Enye Shi	3145522427	[Profiler] Update third_party/kineto submodule hash (#124737 ) Summary: Include improvements such as: - AMD: roctracer crash fix and roctracer external correlations - NCCL metadata: process group id to process group name - Complete nanosecond transition for Kineto - Remove PrivateUse1Type function causing gpu track to be above cpu tracks - Use relative time and fix gpu user annotation causing events to overlap Test Plan: CI and Github CI (full suite) Reviewed By: sraikund16 Differential Revision: D56475055 Pulled By: aaronenyeshi Pull Request resolved: https://github.com/pytorch/pytorch/pull/124737 Approved by: https://github.com/davidberard98, https://github.com/malfet	2024-04-24 00:30:17 +00:00
Andrew Gu	e8f9f37b03	[FSDP2] Added test to show rank 0 broadcast meta-device flow (#124651 ) This PR includes two things: 1. Changes to support `load_state_dict(assign=True)` - These changes are not ideal, but until we have `DTensor` padding the local tensor and general `swap_tensors` adoption, we may need to make do. 2. Example of how to convert a full state dict on rank 0 to sharded state dict on all ranks via broadcast - ~~To-do: check for `recordStream` from the funcol broadcast; if being called, remediate either via `async_op=False` c10d broadcast or use `TORCH_NCCL_AVOID_RECORD_STREAMS=1`~~ switched to using c10d `async_op=False` broadcast - To-do: check for broadcast latency since not using any coalescing Pull Request resolved: https://github.com/pytorch/pytorch/pull/124651 Approved by: https://github.com/wanchaol	2024-04-24 00:18:23 +00:00
Jeff Daily	a21327e0b0	[ROCm] update hipDataType support and hipify mappings (#120751 ) The hipDataType support and mappings are now up to date as of ROCm 5.7. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120751 Approved by: https://github.com/jithunnair-amd, https://github.com/malfet	2024-04-23 23:21:56 +00:00
Kurman Karabukaev	1c4ad87396	[TorchElastic] Option to enable TCPStore libuv backed (#124684 ) Summary: Libuv backed isn't enabled in PTD by default now. Add an option to enable libuv backed to improve scaling of the rendezvous process. Tries not to make assumption on the default libuv settings in TCPStore since it may change in the next release. Test Plan: CI Differential Revision: D56435815 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124684 Approved by: https://github.com/d4l3k, https://github.com/XilunWu	2024-04-23 23:12:17 +00:00
eellison	3999b72d46	Dont error in check consistency if old is None (#124751 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124751 Approved by: https://github.com/ezyang	2024-04-23 22:26:52 +00:00
Zain Rizvi	98ffdf930c	Revert ARC jobs to run on classic infra again (#124748 ) ARC jobs are too unstable right now. We're going to mitigate this by: 1. Reverting ARC jobs to run on the classic infra (this PR) 2. Spin up new jobs in parallel, marked as unstable, to run on the new infra. (coming soon) More details in https://github.com/pytorch/ci-infra/issues/149 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124748 Approved by: https://github.com/seemethere, https://github.com/zxiiro, https://github.com/malfet, https://github.com/jeanschmidt	2024-04-23 22:24:31 +00:00
PyTorch MergeBot	cc268a710d	Revert "AOTAutograd: gate view-replay behind config, not the default (#124488 )" This reverts commit 47330ca13321a42d4f1e75f091e17183227ae073. Reverted https://github.com/pytorch/pytorch/pull/124488 on behalf of https://github.com/seemethere due to submodule update causes xla to start failing see job on branch: https://github.com/pytorch/pytorch/actions/runs/8789091145/job/24124569508, Dr. CI incorrectly marked this as flaky and allowed the merge ([comment](https://github.com/pytorch/pytorch/pull/124488#issuecomment-2073568651))	2024-04-23 22:21:50 +00:00
rzou	4ceb44c40d	Add torch.library.opcheck (#124496 ) This PR: - exposes torch.testing._internal.optests.opcheck as torch.library.opcheck - Adds support for CustomOpDef (aka functions decorated with torch.library.custom_op) to opcheck. Test Plan: - Updated tests - We validated opcheck's design internally. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124496 Approved by: https://github.com/williamwen42	2024-04-23 21:48:00 +00:00
Guilherme Leobas	763dc26e59	[Dynamo] Add dynamo support to torch.func.linearize (#123118 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123118 Approved by: https://github.com/zou3519	2024-04-23 21:31:49 +00:00
Laith Sakka	8cf54929e3	compiletime->compile_time (#124579 ) Summary: title. Test Plan: run strobelight profiler. Reviewed By: oulgen Differential Revision: D56395415 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124579 Approved by: https://github.com/oulgen	2024-04-23 20:50:53 +00:00
Zhengxu Chen	d40774f4ed	[export] Fix up nn_module_stack for nodes occured around tracepoint ops. (#124457 ) Summary: as title. Test Plan: hg checkout D55901896 buck run mode/opt torchrec/ir/tests:test_serializer -- --filter-regex test_serialize_deserialize_ebc Differential Revision: D56340319 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124457 Approved by: https://github.com/tugsbayasgalan	2024-04-23 20:26:44 +00:00
Catherine Lee	e94c846cf7	[ez][TD] Unique td_exclusions file name (#124301 ) * Fix after #124082 I keep forgetting that these files overwrite each other Unrelated but TIL if you want to show the pr/issue title when you link it, it should be in a list Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/124301 Approved by: https://github.com/malfet, https://github.com/ZainRizvi	2024-04-23 20:25:27 +00:00
Bin Bao	57e92162eb	[inductor] Keep inductor cache for tests when TORCH_COMPILE_DEBUG is specified (#124755 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124755 Approved by: https://github.com/masnesral	2024-04-23 20:22:55 +00:00
Jason Ansel	5532c7949f	Fix import error in update_failures.py (#124695 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124695 Approved by: https://github.com/zou3519	2024-04-23 20:09:49 +00:00
Pian Pawakapan	e112792a69	[export] refactor _AddRuntimeAssertionsForInlineConstraintsPass (#124503 ) Summary: The current _AddRuntimeAssertionsForInlineConstraintsPass has 2 known issues caused by its use of torch.fx.Interpreter: 1. SymInt-related ops (e.g. item()) are executed, causing new Unbacked SymInts to appear in the graph during the pass. 2. The graph is reconstructed, and node names/indices can be different from before, causing mismatches with `module_call_graph`, and leading to issues during unflattening. This refactors the pass to use PassBase instead of _ExportPassBaseDeprecatedDoNotUse, only constructing new nodes for assertions. Test Plan: This pass is called on all strict-mode export calls with range_constraints, test that behavior remains unchanged. Differential Revision: D56360137 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124503 Approved by: https://github.com/zhxchen17	2024-04-23 20:07:49 +00:00
Edward Z. Yang	35a448f3cb	Record structured log for overall AOTAutograd backwards compilation (#124648 ) It's sort of similar to CompilationMetrics but also not quite the same, quite open to bikeshedding. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/124648 Approved by: https://github.com/bdhirsh ghstack dependencies: #124626	2024-04-23 19:51:14 +00:00
Aaron Shi	abdd569e41	[easy][test_profiler.py] if tqdm is not available, pass instead of None (#124729 ) Change the try exception to pass when it cannot import tqdm. To address comment: https://github.com/pytorch/pytorch/pull/124409#discussion_r1576327365 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124729 Approved by: https://github.com/malfet, https://github.com/shengfukevin	2024-04-23 18:39:39 +00:00
Matthew Hoffman	1d3a13d3d1	Conform torch.mps to device module interface (#124676 ) Right now `torch.fork_rng()` doesn't support MPS. MPS' device module functions don't line up with the others'. There is a step of `fork_rng` to call `device_count()`: `302d7e9a6e/torch/random.py (L146)` It is pretty simple to know the MPS device count, based on whether it is built and available. Also: `302d7e9a6e/torch/random.py (L168)` `302d7e9a6e/torch/random.py (L175)` `get_rng_state` and `set_rng_state` are expected to be able to accept a `device` parameter. @ezyang Pull Request resolved: https://github.com/pytorch/pytorch/pull/124676 Approved by: https://github.com/ezyang	2024-04-23 18:38:48 +00:00
Shengbao Zheng	4e66aaa010	update kineto submodel commit id to include new pg naming (#124332 ) Summary: Update kineto submodule commit id so that pytorch profiler can pick up kineto PR #906 Test Plan: N/A Differential Revision: D56273619 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124332 Approved by: https://github.com/aaronenyeshi	2024-04-23 17:58:10 +00:00
chilli	7c253a7776	Add support for capturing tensors with score_mod (#124444 ) ``` import torch from torch import nn import torch.nn.functional as F import torch._inductor.config as config # torch.set_default_device('cuda') import torch from torch.nn.attention._templated_attention import _templated_attention as templated_attention from triton.testing import do_bench from torch.nn.attention import SDPBackend, sdpa_kernel index = torch.ops.aten torch.manual_seed(0) B = 16 H = 16 S = 2048 D = 64 head_scale = torch.randn(H, device='cuda') def alibi(score, batch, head, token_q, token_kv): return score + torch.ops.aten.index(head_scale, [head]) * (token_q - token_kv) bias = torch.randn(H, S, S, dtype=torch.float16, device='cuda') query = torch.randn(B, H, S, D, device="cuda", dtype=torch.float16) key = torch.randn(B, H, S, D, device="cuda", dtype=torch.float16) value = torch.randn(B, H, S, D, device="cuda", dtype=torch.float16) compiled = torch.compile(templated_attention) out = compiled(query, key, value, score_mod=alibi) out2 = templated_attention(query, key, value,score_mod=alibi) print((out - out2).abs().mean()) assert (out - out2).abs().mean() < 1e-3 print("Flash (no mask): ", do_bench(lambda: F.scaled_dot_product_attention(query, key, value))) print("Flash (mask): ", do_bench(lambda: F.scaled_dot_product_attention(query, key, value, attn_mask=bias))) print("flexattention: ", do_bench(lambda: compiled(query, key, value, score_mod=alibi))) ``` <img width="324" alt="image" src="https://github.com/pytorch/pytorch/assets/6355099/18c175d0-2720-4dfd-8747-85b8a8f609f5"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/124444 Approved by: https://github.com/jansel, https://github.com/drisspg	2024-04-23 17:54:08 +00:00
Jason Ansel	0792ceab4b	[dynamo] Refactor into torch/_inductor/runtime/compile_tasks.py (#124681 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124681 Approved by: https://github.com/masnesral ghstack dependencies: #124592	2024-04-23 17:51:25 +00:00
Jason Ansel	5d45eb77f1	[inductor] Remove usage of device_interface from _inductor.runtime (#124592 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124592 Approved by: https://github.com/masnesral	2024-04-23 17:51:25 +00:00
Chen, Zejun	25a2d18dd9	[Profiler] iterate frontend function events for profiler post processing (#124596 ) The `function_events` in `_parse_kineto_results` is used to contain all function events from the result. It contains 2 kinds of events. One is frontend function events whose correlation id is 0, for example, `aten::add`, `aten::mul`. They are on the top level of the profile results. The other is the backend events, which are associated with the frontend events and its correlation id is > 0, for example, `at::native::vectorized_elementwise_kernel`, it should be the backend event of a frontend element-wise op. They have the device execution duration for the related frontend op. In the following post processing code, the frontend function events should be iterated to find its correlated backend events in `device_corr_map`, instead of iterating all function events, because `device_corr_map` is designed as a dict, whose key is the id of the frontend function event. `3af12447f8/torch/autograd/profiler.py (L543-L560)` `3af12447f8/torch/autograd/profiler.py (L537-L540)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/124596 Approved by: https://github.com/aaronenyeshi	2024-04-23 17:40:32 +00:00
Chien-Chin Huang	05db64024c	[DDP][PT2D] Correctly calculate the numel with symint in DDP fusion (#124422 ) As title Differential Revision: [D56315533](https://our.internmc.facebook.com/intern/diff/D56315533/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124422 Approved by: https://github.com/yf225 ghstack dependencies: #124421	2024-04-23 17:06:18 +00:00
Brian Hirsh	47330ca133	AOTAutograd: gate view-replay behind config, not the default (#124488 ) Fixes https://github.com/pytorch/pytorch/issues/124499 (I also changed the warn to an info to avoid noise) That'll take some investigation, but rather than reverting I'm gating the view-replay behind a config that I default to False. To get the behavior back for XLA, can you have `import torch_xla` set this config? Pull Request resolved: https://github.com/pytorch/pytorch/pull/124488 Approved by: https://github.com/ezyang, https://github.com/Microve	2024-04-23 16:15:50 +00:00
Bin Bao	b2fd224f27	[AOTI] Add more ABI-compatiblity unit test (#123900 ) Summary: Follow https://github.com/pytorch/pytorch/pull/123848, and test more c10 util functions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123900 Approved by: https://github.com/chenyang78	2024-04-23 16:06:40 +00:00
Scott Wolchok	e558008a05	[PyTorch] Add test that canEnableStaticRuntime rejects prim::CallMethod (#120853 ) Rejecting prim::CallMethod is called out in a comment in impl.cpp, but doesn't seem to be tested. Now it is. Differential Revision: [D54338261](https://our.internmc.facebook.com/intern/diff/D54338261/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120853 Approved by: https://github.com/houseroad	2024-04-23 15:56:42 +00:00
Zain Rizvi	fb6d052e9c	Specify the exact table we upload metrics to (#124321 ) Part of https://github.com/pytorch/ci-infra/issues/113 Since this table is only located in one AWS account, but the ARC account also needs to access it, explicitly specify the account name for the table	2024-04-23 10:55:52 -05:00
zdevito	772ae6da1e	Fast standalone symbolize for unwinding (#123966 ) We've had issues using addr2line. On certain versions of CentOS it is on a version that has a performance regression making it very slow, and even normallly it is not that fast, taking several seconds even when parallelized for a typical memory trace dump. Folly Symbolize or LLVMSymbolize are fast but it requires PyTorch take a dependency on those libraries to do this, and given the number of environments we run stuff in, we end up hitting cases where we fallback to slow addr2line behavior. This adds a standalone symbolizer to PyTorch similar to the unwinder which has no external dependencies and is ~20x faster than addr2line for unwinding PyTorch frames. I've tested this on some memory profiling runs using all combinations of {gcc, clang} x {dwarf4, dwarf5} and it seems to do a good job at getting line numbers and function names right. It is also careful to route all reads of library data through the `CheckedLexer` object, which ensure it is not reading out of bounds of the section. Errors are routed through UnwindError so that those exceptions get caught and we produce a ?? frame rather than crash. I also added a fuzz test which gives all our symbolizer options random addresses in the process to make sure they do not crash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123966 Approved by: https://github.com/ezyang	2024-04-23 15:27:18 +00:00
Pian Pawakapan	cf98cab1b6	[export] Forward fix XNNPackQuantizer.module_type_config to detect str nn_module_stack (#123662 ) https://github.com/pytorch/pytorch/pull/123308 previously changed the nn_module_stack format (module type -> module str). This modifies XNNPackQuantizer's module_type_config to detect class module strs instead. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123662 Approved by: https://github.com/williamwen42	2024-04-23 15:21:37 +00:00
Peter Bell	7ecbbc40c3	[HOP][inductor] Add higher order associative scan operator (#119430 ) Currently only supports single tensor scans, e.g. `cumsum`, `cumprod`, `logcumsumexp` Pull Request resolved: https://github.com/pytorch/pytorch/pull/119430 Approved by: https://github.com/Chillee	2024-04-23 14:40:13 +00:00
Edward Z. Yang	64491c0811	Restore CompileContext as well in backwards (#124626 ) This should fix many of the unknown compile id problems currently afflicting tlparse backwards analysis. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/124626 Approved by: https://github.com/bdhirsh	2024-04-23 14:39:52 +00:00
PyTorch MergeBot	4f3e1f1c93	Revert "Add support for capturing tensors with score_mod (#124444 )" This reverts commit e0c5113dec79608941db69ae091dfe8893f9a14f. Reverted https://github.com/pytorch/pytorch/pull/124444 on behalf of https://github.com/malfet due to This is weird, but somehow profile test started to timeout after this PR, see https://hud.pytorch.org/hud/pytorch/pytorch/main/1?per_page=50&name_filter=noGPU_AVX512 ([comment](https://github.com/pytorch/pytorch/pull/124444#issuecomment-2072506731))	2024-04-23 14:39:37 +00:00
rzou	5b98d43488	Verify types in custom op schemas (#124520 ) Before this PR, we didn't check that types in a schema were valid. This is because TorchScript treats unknown types as type variables. This PR checks types in a schema for the TORCH_LIBRARY APIs. To do this, we add an `allow_typevars` flag to parseSchema so that TorchScript can use allow_typevars=True. We also add some error messages for common mistakes (e.g. using int64_t or double in schema). Test Plan: - new tests Differential Revision: [D56432690](https://our.internmc.facebook.com/intern/diff/D56432690) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124520 Approved by: https://github.com/albanD	2024-04-23 14:18:35 +00:00
Amadeusz Skrzypczak	107f944f22	Support fp8 quantization (#123161 ) This commit enables float8_e5m2 and float8_e4m3fn dtypes in fx quantization and PT2E. Motivation for using fp8 quantization instead of int8: - it works better to run inference with the same datatype the model was trained with, - fp8 can handle outliers better, which is one of the problems in LLMs activations. The numerical recipe we want to use it for is fp8 inference: - bgemms/gemms running in float8_e4m3fn, - Per-Tensor-Quantization/Scaling, - amax observer for measurement with input_backoff and weight_backoff. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123161 Approved by: https://github.com/jgong5, https://github.com/jerryzh168	2024-04-23 13:35:27 +00:00
Kai Londenberg	f8f6c460cd	[Inductor max autotune] Make autotuning robust against very slow Kernels (#123932 ) If a Kernel does not return in a reasonable amount of time during autotuning, it can delay inductor compilation a lot. This change introduces soft / hard kill timeouts and a mechanism to kill Kernels being profiled in subprocesses if they take too long. Correspondingly, a few new config options are introduced within _inductor/config.py - all of them with inline docs. Test Plan: Existing tests within test_max_autotune.py and test_cutlass_backend.py ) cover the new codepaths. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123932 Approved by: https://github.com/jansel ghstack dependencies: #121497, #123930	2024-04-23 11:56:15 +00:00
Yu, Guangye	25f321b84f	Refactor autocast C++ APIs to be device-agnostic (#124359 ) # Motivation This PR aims to refactor autocast C++ APIs to be device-agnostic and deprecate the device-specific autocast C++ APIs. In C++ side, - `is_enabled()` -> `is_enabled(device_type)`. - `set_enabled(new_enabled)` -> `set_enabled(device_type, new_enabled)`. - `get_autocast_dtype()` -> `get_autocast_dtype(device_type)` - `set_autocast_dtype(dtype)` -> `set_autocast_dtype(device_type, dtype)` These following C++ APIs are deprecated and should be removed in PyTorch 2.5 - `is_cpu_enabled` - `set_cpu_enabled` - `get_autocast_cpu_dtype` - `set_autocast_cpu_dtype` - `is_xpu_enabled` - `set_xpu_enabled` - `get_autocast_xpu_dtype` - `set_autocast_xpu_dtype` - `is_ipu_enabled` - `set_ipu_enabled` - `get_autocast_ipu_dtype` - `set_autocast_ipu_dtype` - `is_hpu_enabled` - `set_hpu_enabled` - `get_autocast_hpu_dtype` - `set_autocast_hpu_dtype` - `is_xla_enabled` - `set_xla_enabled` - `get_autocast_xla_dtype` - `set_autocast_xla_dtype` - `is_privateuseone_enabled` - `set_privateuseone_enabled` - `get_autocast_privateuseone_dtype` - `set_autocast_privateuseone_dtype` In Python side, provide 4 generic autocast APIs: - `torch.is_autocast_enabled(device_type)` - `torch.set_autocast_enabled(device_type, new_enabled)` - `torch.get_autocast_dtype(device_type)` - `torch.set_autocast_dtype(device_type, dtype)` # Additional Context We will submit another PR to refactor autocast Python APIs based on this PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124359 Approved by: https://github.com/jgong5, https://github.com/albanD	2024-04-23 10:38:50 +00:00
haozhe.zhu	3c964ad1ca	add fused_sgd_kernel support for CPU device (#123629 ) Support fused_sgd_kernel support for CPU. ## Bench result: 32 core/sockets ICX Test Scripts: https://gist.github.com/zhuhaozhe/688763e17e93e4c5e12f25f676ec90d9 https://gist.github.com/zhuhaozhe/ad9938694bc7fae8b66d376f4dffc6c9 ``` Tensor Size: 262144, Num Tensor 4, Num Threads: 1 _single_tensor_sgd time: 0.2301 seconds _fused_sgd time: 0.0925 seconds Tensor Size: 4194304, Num Tensor 32, Num Threads: 32 _single_tensor_sgd time: 2.6195 seconds _fused_sgd time: 1.7543 seconds ``` ## Test Plan: ``` python test_optim.py -k test_fused_matches_forloop python test_optim.py -k test_fused_large_tensor python test_optim.py -k test_can_load_older_state_dict python test_optim.py -k test_grad_scaling_autocast_fused_optimizers python test_torch.py -k test_grad_scaling_autocast_fused python test_torch.py -k test_params_invalidated_with_grads_invalidated_between_unscale_and_step ``` Looks like we already have some PRs under this issue https://github.com/pytorch/pytorch/issues/123451 to unified the UTs, I did not modified UT in this PR. Co-authored-by: Jane Xu <janeyx@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/123629 Approved by: https://github.com/jgong5, https://github.com/janeyx99	2024-04-23 08:28:19 +00:00
Nikita Shulga	4efb980e07	[BE] Update older scipy used in CI to 1.8.1 (#124675 ) As older scipy are affected by CVE-2023-29824 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124675 Approved by: https://github.com/kit1980	2024-04-23 06:59:48 +00:00
Chien-Chin Huang	7b6e354ecd	[DDP][PT2D] Fix some tracing bugs of DDP (#124421 ) 1. We need to clear the cache of get_legacy_mod_inlinelist to ensure the newly added rule will be captured. 2. Don't add the hook if the parameter does not require gradient. Differential Revision: [D56315534](https://our.internmc.facebook.com/intern/diff/D56315534/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124421 Approved by: https://github.com/yf225	2024-04-23 06:43:48 +00:00
lezcano	9a5b4d2403	Do not forward parent's value range to CSE variable for variables created within codegen. (#123099 ) Consider we are generating code for `ops.gt`, and within it we call `ops.to_dtype`. Before, we would forward the bounds from `gt` to the to the result of `to_dtype`, which is wrong. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123099 Approved by: https://github.com/jgong5, https://github.com/peterbell10	2024-04-23 06:26:39 +00:00
Isuru Fernando	edcd968b51	Add out wrappers to some decompositions (#115437 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/115437 Approved by: https://github.com/lezcano	2024-04-23 06:26:11 +00:00
chilli	e0c5113dec	Add support for capturing tensors with score_mod (#124444 ) ``` import torch from torch import nn import torch.nn.functional as F import torch._inductor.config as config # torch.set_default_device('cuda') import torch from torch.nn.attention._templated_attention import _templated_attention as templated_attention from triton.testing import do_bench from torch.nn.attention import SDPBackend, sdpa_kernel index = torch.ops.aten torch.manual_seed(0) B = 16 H = 16 S = 2048 D = 64 head_scale = torch.randn(H, device='cuda') def alibi(score, batch, head, token_q, token_kv): return score + torch.ops.aten.index(head_scale, [head]) * (token_q - token_kv) bias = torch.randn(H, S, S, dtype=torch.float16, device='cuda') query = torch.randn(B, H, S, D, device="cuda", dtype=torch.float16) key = torch.randn(B, H, S, D, device="cuda", dtype=torch.float16) value = torch.randn(B, H, S, D, device="cuda", dtype=torch.float16) compiled = torch.compile(templated_attention) out = compiled(query, key, value, score_mod=alibi) out2 = templated_attention(query, key, value,score_mod=alibi) print((out - out2).abs().mean()) assert (out - out2).abs().mean() < 1e-3 print("Flash (no mask): ", do_bench(lambda: F.scaled_dot_product_attention(query, key, value))) print("Flash (mask): ", do_bench(lambda: F.scaled_dot_product_attention(query, key, value, attn_mask=bias))) print("flexattention: ", do_bench(lambda: compiled(query, key, value, score_mod=alibi))) ``` <img width="324" alt="image" src="https://github.com/pytorch/pytorch/assets/6355099/18c175d0-2720-4dfd-8747-85b8a8f609f5"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/124444 Approved by: https://github.com/jansel, https://github.com/drisspg	2024-04-23 06:20:13 +00:00
Mikayla Gawarecki	c82fcb7b30	Add testing and fix `weights_only` load for quantized types and nn.Parameters with python attrs (#124330 ) Adds the following to allowed globals for the `weights_only` unpickler - [x] `torch._utils._rebuild_qtensor` and qtensor related types - [x] `torch._utils._rebuild_parameter_with_state` (used deserializing a parameter that has user-defined attributes like `Param.foo`) The remaining rebuild functions that have not been allowlisted are - [x] `torch._utils._rebuild_wrapper_subclass` (allowlisted in above PR) - [ ] `torch._utils._rebuild_device_tensor_from_numpy` - [ ] `torch._utils._rebuild_xla_tensor` (legacy) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124330 Approved by: https://github.com/albanD	2024-04-23 04:13:26 +00:00
Nikita Shulga	de5d689cf9	[EZ] Update pillow to 10.3.0 (#124614 ) As older versions as subject to [CVE-2024-28219](https://nvd.nist.gov/vuln/detail/CVE-2024-28219), although it's not super important from CI PoV Modernize `torch/utils/tensorboard/summary.py` to use Pillow-9+ APIs (is this file even used for anything anymore?) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124614 Approved by: https://github.com/Skylion007, https://github.com/ZainRizvi	2024-04-23 03:22:23 +00:00
atalman	7706cd7d12	Extend CPU inductor merge rule (#124671 ) To help unblock: https://github.com/pytorch/pytorch/pull/123710 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124671 Approved by: https://github.com/leslie-fang-intel, https://github.com/huydhn	2024-04-23 02:18:00 +00:00
Edward Z. Yang	660db767ef	Don't clean up fresh inductor cache on error (#124620 ) Useful for local debugging. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/124620 Approved by: https://github.com/oulgen, https://github.com/desertfire, https://github.com/jansel	2024-04-23 02:13:05 +00:00
Oguz Ulgen	7e095be4b6	Fix test_max_autotune_remote_caching (#124655 ) D55206000 broke this test. It is not clear why it did not run in the CI but here's the fix. Differential Revision: [D56439213](https://our.internmc.facebook.com/intern/diff/D56439213/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124655 Approved by: https://github.com/aorenste	2024-04-23 01:41:04 +00:00
Fadi Botros	375ec25f55	Add missing aten::sort.any op for assistant lm models (#123982 ) Differential Revision: D56084098 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123982 Approved by: https://github.com/JacobSzwejbka	2024-04-23 01:35:07 +00:00
cyy	ea61c9cb29	[Distributed] [5/N] Fix clang-tidy warnings in torch/csrc/distributed/c10d (#124043 ) This PR continues to fix some clang-tidy warnings in distributed/c10d code, following https://github.com/pytorch/pytorch/pull/124032. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124043 Approved by: https://github.com/ezyang	2024-04-23 00:43:50 +00:00
Ashwin Hari	5f5778476a	rename ort to maia (#123265 ) Fixes #123264 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123265 Approved by: https://github.com/albanD	2024-04-23 00:33:25 +00:00
leslie-fang-intel	bffecb5aff	[Inductor] Enable VecMask store (#123710 ) Summary Enable the vectorization of store with `bool` dtype. Test Plan ``` python -u -m pytest -s -v inductor/test_cpu_repro.py -k test_decomposed_fake_quant_per_channel ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/123710 Approved by: https://github.com/jgong5, https://github.com/lezcano ghstack dependencies: #123512	2024-04-23 00:29:47 +00:00
leslie-fang-intel	dd440ac734	Add Matmul recipe into x86_inductor_quantizer (#122776 ) Summary Add `matmul` in the quantization recipes, noting that it's not a general recipe but tailored to meet accuracy criteria for specific models. `matmul` recipe is disabled by default. Test Plan ``` python -m pytest quantization/pt2e/test_x86inductor_quantizer.py -k test_attention_block ``` Differential Revision: [D56288468](https://our.internmc.facebook.com/intern/diff/D56288468) Pull Request resolved: https://github.com/pytorch/pytorch/pull/122776 Approved by: https://github.com/jgong5, https://github.com/jerryzh168	2024-04-23 00:25:41 +00:00
soulitzer	1fcdea8cd6	Do not import transformers when import torch._dynamo (#124634 ) Fixes https://github.com/pytorch/pytorch/issues/123954 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124634 Approved by: https://github.com/thiagocrepaldi, https://github.com/Chillee ghstack dependencies: #124343	2024-04-23 00:25:20 +00:00
nopperl	0c21161488	Add meta function for `torch.histc` (#124548 ) Registers a meta function for the `aten.histc.default` and `aten.histc.out` ops to support `torch.compile(dynamic=True)`. Fixes #124512. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124548 Approved by: https://github.com/lezcano, https://github.com/peterbell10	2024-04-23 00:24:59 +00:00
Edward Z. Yang	6054789874	Make numel equal test size oblivious in reshape_symint (#124611 ) Fixes https://github.com/pytorch/pytorch/issues/124581 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/124611 Approved by: https://github.com/bdhirsh ghstack dependencies: #124139	2024-04-22 23:59:40 +00:00
Nikita Shulga	abf3f90781	[MPS] Fix large copy (#124635 ) By slicing `copyFromBuffer:sourceOffset:toBuffer:destinationOffset:size:` into 2Gb chunks Add regression test, but limit it to machines with 12Gb of RAM or more, and MacOS 14+, as on MacOS 13 attempt to alloc 4Gb tensor fails with: ``` /AppleInternal/Library/BuildRoots/c651a45f-806e-11ed-a221-7ef33c48bc85/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShaders/MPSCore/Types/MPSNDArray.mm:724: failed assertion `[MPSNDArray initWithDevice:descriptor:] Error: total bytes of NDArray > 2**32' ``` Fixes https://github.com/pytorch/pytorch/issues/124335 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124635 Approved by: https://github.com/kulinseth	2024-04-22 23:43:11 +00:00
Yanbo Liang	72a34eeb99	Dynamo x autograd.Function supports non-{Tensor, symnode, constant} inputs (#124360 ) Fixes #118395 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124360 Approved by: https://github.com/zou3519	2024-04-22 23:32:54 +00:00
atalman	302d7e9a6e	[Binary Build] Increase timeout for Linux nightly binary builds (#124668 ) Related issue: https://github.com/pytorch/pytorch/issues/124667. Please note, this is mitigation PR. Will follow up with investigation and proper fix for this. Similar to: `94d6463255` Pull Request resolved: https://github.com/pytorch/pytorch/pull/124668 Approved by: https://github.com/huydhn	2024-04-22 22:38:39 +00:00
Shuqiang Zhang	87a35d5a29	Use new function to log one cluster per line (#124628 ) Summary: For motivation behind the overall stack of diffs see D56218385 summary. This particular diff makes cpp_dumper take a custom printer function to log callstacks one-group-at-a-time and as such no longer running into 30K characters limit of `LOG(INFO)`. Test Plan: ``` [romanmal@46150.od /data/sandcastle/boxes/fbsource/fbcode (520a7b7b5)]$ buck2 test //caffe2/torch/csrc/distributed/c10d/... File changed: fbcode//common/base/ThreadStackTrace.cpp File changed: fbsource//xplat/caffe2/torch/csrc/distributed/c10d/fb/TraceUtils.cpp File changed: fbcode//caffe2/torch/csrc/distributed/c10d/ProcessGroupNCCL.hpp 4 additional file change events Buck UI: https://www.internalfb.com/buck2/d8ceae86-7d6f-4779-ad0c-8e37eddcff98 Network: Up: 0B Down: 0B Jobs completed: 2. Time elapsed: 1.5s. Tests finished: Pass 0. Fail 0. Fatal 0. Skip 0. Build failure 0 NO TESTS RAN [romanmal@46150.od /data/sandcastle/boxes/fbsource/fbcode (520a7b7b5)]$ ``` Tested to print the stack trace: P1220109730 Differential Revision: D56218360 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124628 Approved by: https://github.com/wconstab	2024-04-22 21:57:39 +00:00
William Wen	501edc7e59	[inductor, test] remove cast for test_tmp_not_defined_issue2_cpu (#114910 ) Does this verify that https://github.com/pytorch/pytorch/issues/94017 is fixed? Pull Request resolved: https://github.com/pytorch/pytorch/pull/114910 Approved by: https://github.com/angelayi	2024-04-22 21:51:53 +00:00
Sheng Fu	ba3c00c266	[test_profiler.py] Disable tqdm monitor thread and torch.compile with compile_threads=1 (#124409 ) Summary: if tqdm is not shutdown properly, it will leave the monitor thread alive. This causes an issue in the multithreading test because we check all events in that test with their tids. The events that correspond to these lingering threads all have TID of (uint64_t)(-1) which is invalid. The work around is turning off monitoring thread when tqdm is loaded. Since these are unit tests, it is safe to turn off monitor thread. Test Plan: buck test mode/dev-sand caffe2/test:profiler Differential Revision: D56310301 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124409 Approved by: https://github.com/aaronenyeshi	2024-04-22 21:51:14 +00:00
IvanKobzarev	c01499ecc6	[sym_shapes][perf] Cache ShapeEnv constrain_symbol_range calls (#124610 ) Differential Revision: [D56422688](https://our.internmc.facebook.com/intern/diff/D56422688) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124610 Approved by: https://github.com/ezyang, https://github.com/lezcano	2024-04-22 21:49:08 +00:00
Wanchao Liang	05addd5658	[tp] add kwargs support to prepare_module_input (#124114 ) as titled, this PR adds kwargs support to PrepareModuleInput style, where there might be modules who have only kwargs inputs but no positional args, so we should support this Pull Request resolved: https://github.com/pytorch/pytorch/pull/124114 Approved by: https://github.com/XilunWu	2024-04-22 21:46:31 +00:00
Jithun Nair	5785b02ba6	Skip workspace permission change for ROCm CI (#123816 ) PR https://github.com/pytorch/pytorch/pull/122922 added chown steps to test.sh and used the trap mechanism to ensure that, even if the test scripts fails and exits with a non-zero code, it will call the cleanup_workspace function on EXIT. However, this doesn't work as intended when the CI job gets cancelled for eg. if a PR pushes new commits and the older commit CI job gets cancelled. The trap function doesn't get called as the test script immediately aborts. Any subsequent jobs scheduled on the same runner then fail in the 'Checkout PyTorch' step when they try to delete the workspace. This has been resulting in a slew of CI failures on the HUD. Example of this situation playing out on one of the ROCm runners: Cancelled job: https://github.com/pytorch/pytorch/actions/runs/8563212279/job/23469711035 ![image](https://github.com/pytorch/pytorch/assets/37884920/7192e4fe-8cff-4256-abc8-9f874a3918ff) Subsequent failed job: https://github.com/pytorch/pytorch/actions/runs/8564517036/job/23472675041 ![image](https://github.com/pytorch/pytorch/assets/37884920/24b0af66-cfe9-431f-851a-24a1ccc18e84) This PR skips the logic introduced by PR 122922 for ROCm CI. Alternative to https://github.com/pytorch/pytorch/pull/123468 and https://github.com/pytorch/pytorch/pull/123588 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123816 Approved by: https://github.com/pruthvistony, https://github.com/zxiiro, https://github.com/kit1980, https://github.com/malfet	2024-04-22 21:27:32 +00:00
Bin Bao	bb37910e30	[AOTI] Fixes ScatterFallback codegen (#124580 ) Summary: For https://github.com/pytorch/pytorch/issues/123184. ScatterFallback currently relies on op name matching for codegen, which makes its cpp codegen fragile. Refactor to use op_overload and fix the relevant unit test failures. Differential Revision: [D56417815](https://our.internmc.facebook.com/intern/diff/D56417815) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124580 Approved by: https://github.com/chenyang78	2024-04-22 20:47:26 +00:00
Catherine Lee	fd59554be6	Scripts to compile reruns + td exclusions and upload to s3 (#124312 ) Edits upload_test_stats to also upload a condensed version that contains reruns, and one that contains the list of td_exclusions. Grouped by build name + test config Pull Request resolved: https://github.com/pytorch/pytorch/pull/124312 Approved by: https://github.com/malfet	2024-04-22 20:19:35 +00:00
Edward Z. Yang	0bbbc754dd	Add AOTInductor generated cpp code to TORCH_TRACE (#124617 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/124617 Approved by: https://github.com/albanD	2024-04-22 19:25:20 +00:00
Jason Ansel	0093735ccd	[inductor] Use compile time config values in runtime (#124561 ) This removes usage of torch._inductor.config from `torch._inductor.runtime`. Fixing two issues: 1) If configs change we should really use the compile time ones 2) In compile workers, we want to use the parent process config Pull Request resolved: https://github.com/pytorch/pytorch/pull/124561 Approved by: https://github.com/yanboliang ghstack dependencies: #124552, #124553, #124557, #124559, #124560, #124569	2024-04-22 18:46:40 +00:00
Jason Ansel	cb9fe91f5c	[inductor] Remove config check for 3D tiling (#124569 ) This makes the check per-kernel (if 3D tiling is used), rather than global config. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124569 Approved by: https://github.com/yanboliang ghstack dependencies: #124552, #124553, #124557, #124559, #124560	2024-04-22 18:46:40 +00:00
Jason Ansel	4620a45542	[inductor] Refactor runtime files into torch._inductor.runtime (part 5) (#124560 ) I am planning to make the compile_worker process not import torch so it can start up much faster. This stack is prep for that. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124560 Approved by: https://github.com/yanboliang ghstack dependencies: #124552, #124553, #124557, #124559	2024-04-22 18:46:35 +00:00
Jason Ansel	0cc0e60e30	[inductor] Refactor runtime files into torch._inductor.runtime (part 4) (#124559 ) I am planning to make the compile_worker process not import torch so it can start up much faster. This stack is prep for that. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124559 Approved by: https://github.com/yanboliang ghstack dependencies: #124552, #124553, #124557	2024-04-22 18:46:29 +00:00
Jason Ansel	7fd8870e6b	[inductor] Refactor runtime files into torch._inductor.runtime (part 3) (#124557 ) I am planning to make the compile_worker process not import torch so it can start up much faster. This stack is prep for that. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124557 Approved by: https://github.com/yanboliang ghstack dependencies: #124552, #124553	2024-04-22 18:46:24 +00:00
Jason Ansel	bb8815bc31	[inductor] Refactor runtime files into torch._inductor.runtime (part 2) (#124553 ) I am planning to make the compile_worker process not import torch so it can start up much faster. This stack is prep for that. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124553 Approved by: https://github.com/yanboliang ghstack dependencies: #124552	2024-04-22 18:46:20 +00:00
Jason Ansel	480585fd2b	[inductor] Refactor runtime files into torch._inductor.runtime (part 1) (#124552 ) I am planning to make the compile_worker process not import torch so it can start up much faster. This stack is prep for that. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124552 Approved by: https://github.com/yanboliang	2024-04-22 18:41:12 +00:00
PyTorch MergeBot	16eea7c6a5	Revert "[inductor] Refactor runtime files into torch._inductor.runtime (part 1) (#124552 )" This reverts commit a7035cc11aa3aefe1a45a9ba6d0cb4d2a6f2e7c1. Reverted https://github.com/pytorch/pytorch/pull/124552 on behalf of https://github.com/jeanschmidt due to There are internal breakages, already discussed with author and he'll FF ([comment](https://github.com/pytorch/pytorch/pull/124552#issuecomment-2070548223))	2024-04-22 18:28:05 +00:00
PyTorch MergeBot	56714cb497	Revert "[inductor] Refactor runtime files into torch._inductor.runtime (part 2) (#124553 )" This reverts commit f4d47f5bbb07bed98b1eb8313607be6e94686269. Reverted https://github.com/pytorch/pytorch/pull/124553 on behalf of https://github.com/jeanschmidt due to There are internal breakages, already discussed with author and he'll FF ([comment](https://github.com/pytorch/pytorch/pull/124552#issuecomment-2070548223))	2024-04-22 18:28:05 +00:00
PyTorch MergeBot	0b90af0bf5	Revert "[inductor] Refactor runtime files into torch._inductor.runtime (part 3) (#124557 )" This reverts commit fcf28b0ad59b1912d5783688b0f25f18b46efeb3. Reverted https://github.com/pytorch/pytorch/pull/124557 on behalf of https://github.com/jeanschmidt due to There are internal breakages, already discussed with author and he'll FF ([comment](https://github.com/pytorch/pytorch/pull/124552#issuecomment-2070548223))	2024-04-22 18:28:05 +00:00
PyTorch MergeBot	b3d6c2fe9b	Revert "[inductor] Refactor runtime files into torch._inductor.runtime (part 4) (#124559 )" This reverts commit 9ea2a0951005c4bcb2491556a8548319c6cccfdb. Reverted https://github.com/pytorch/pytorch/pull/124559 on behalf of https://github.com/jeanschmidt due to There are internal breakages, already discussed with author and he'll FF ([comment](https://github.com/pytorch/pytorch/pull/124552#issuecomment-2070548223))	2024-04-22 18:28:05 +00:00
PyTorch MergeBot	0f44ef93ab	Revert "[inductor] Refactor runtime files into torch._inductor.runtime (part 5) (#124560 )" This reverts commit 3ac30bc32ad300d70391ec552e5738d6ed66f9a5. Reverted https://github.com/pytorch/pytorch/pull/124560 on behalf of https://github.com/jeanschmidt due to There are internal breakages, already discussed with author and he'll FF ([comment](https://github.com/pytorch/pytorch/pull/124552#issuecomment-2070548223))	2024-04-22 18:28:05 +00:00
PyTorch MergeBot	8973c5b846	Revert "[inductor] Remove config check for 3D tiling (#124569 )" This reverts commit 317c0af149855b5924a59170a18abecca97e2ce0. Reverted https://github.com/pytorch/pytorch/pull/124569 on behalf of https://github.com/jeanschmidt due to There are internal breakages, already discussed with author and he'll FF ([comment](https://github.com/pytorch/pytorch/pull/124552#issuecomment-2070548223))	2024-04-22 18:28:05 +00:00
PyTorch MergeBot	30dec1da84	Revert "[inductor] Use compile time config values in runtime (#124561 )" This reverts commit 3af12447f85dfede191a113c052e58fa7b21a8b3. Reverted https://github.com/pytorch/pytorch/pull/124561 on behalf of https://github.com/jeanschmidt due to There are internal breakages, already discussed with author and he'll FF ([comment](https://github.com/pytorch/pytorch/pull/124561#issuecomment-2070537634))	2024-04-22 18:24:38 +00:00
rzou	d77e7b7c54	Make some kernel static asserts clearer (#124519 ) Users get int/int64_t and double/float confused a lot. Test Plan: - tested locally Pull Request resolved: https://github.com/pytorch/pytorch/pull/124519 Approved by: https://github.com/Skylion007	2024-04-22 18:17:40 +00:00
Isuru Fernando	c2f8bfae9c	Make torch._inductor.dependencies.Dep a proper class (#124407 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124407 Approved by: https://github.com/peterbell10	2024-04-22 17:09:34 +00:00
Aleksei Nikiforov	77c35334c1	Fix build on s390x (#123250 ) Rename s390x-specific zvector functions with same name. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123250 Approved by: https://github.com/malfet	2024-04-22 16:57:08 +00:00
Aleksei Nikiforov	be2e56b5ab	s390x: update using vectorization builtins (#124396 ) With gcc >= 12 on s390x store builtins are accidentally optimized out due to bad type aliasing. Ensure that proper corresponding types are used, and if types do mismatch, first store data into array of correct type and then memcpy it to destination pointer. See also: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114676 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124396 Approved by: https://github.com/malfet	2024-04-22 16:55:18 +00:00
mengfeil	0ee514e628	[CI] Upgrade xpu driver to LTS_803.29 (#123920 ) Upgrade xpu driver from 647.21 to LTS 803.29 Works for #114850 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123920 Approved by: https://github.com/chuanqi129, https://github.com/EikanWang, https://github.com/huydhn	2024-04-22 16:45:01 +00:00
Thiago Crepaldi	9c2ac4476c	Allow ONNX models without parameters (#121904 ) Currently, if initializers are available, they are included in the ONNX model. If they are not available, the model is serialized without them. However, there are times in which the initializers are avaialable, but the user prefers not to include them in the model, say for visualizing it on Netron or because the initialziers will be specified along with the inputs in the onnx runtime of choice. This PR allow users to pass `include_initializers` to `ONNXProgram.save()` API. Fixes #100996 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121904 Approved by: https://github.com/titaiwangms	2024-04-22 15:53:38 +00:00
Jeff Daily	6ede882c0b	preferred blas library; cublaslt gemm implementation (#122106 ) Following the example of PyTorch supporting a preferred Linalg library (cusolver or magma), this PR introduces a preferred blas library selector of either cublas or cublaslt for CUDA and hipblas or hipblaslt for ROCm via normal hipification of sources. The default blas implementation remains cublas or hipblas. cublaslt or hipblaslt can be enabled using environment variable TORCH_BLAS_PREFER_CUBLASLT=1 (or TORCH_BLAS_PREFER_HIPBLASLT=1 as an alias) or by calling `torch.backends.cuda.preferred_blas_library(backend="cublaslt")` or as an alias `backend="hipblaslt"`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122106 Approved by: https://github.com/lezcano	2024-04-22 15:38:22 +00:00
Adnan Akhundov	9a322ba1b0	[user triton] Return unbacked SymInts used in the grid (#124594 ) Summary: When unbacked SymInts are used only in a grid of a user-written Triton kernel call, there is no dependency between the Triton kernel's buffer and those unbacked SymInts. As a result, definition of the unbacked SymInts are not codegen-end and the code using them in the grid definition breaks. Here we add the unbacked SymInts used in the grid to the `get_unbacked_symbol_uses` returned by the `UserDefinedTritonKernel` alongside those used in the `kwargs` (returned by `ExternKernel`). Test Plan: ``` $ python test/inductor/test_aot_inductor.py -k test_triton_kernel_unbacked_symint ... ---------------------------------------------------------------------- Ran 24 tests in 155.764s OK (skipped=16) ``` Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D56406991](https://our.internmc.facebook.com/intern/diff/D56406991) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124594 Approved by: https://github.com/oulgen	2024-04-22 15:33:30 +00:00
PyTorch MergeBot	277ab8a4c0	Revert "[Environment Variable][1/N] Use thread-safe env variable API in c10 (#119449 )" This reverts commit a56e057814565b2ae33b2106b4d0136179aa18f8. Reverted https://github.com/pytorch/pytorch/pull/119449 on behalf of https://github.com/jeanschmidt due to Broken internal signals, @albanD please help get this sorted :) ([comment](https://github.com/pytorch/pytorch/pull/119449#issuecomment-2069716129))	2024-04-22 14:44:44 +00:00
Kai Londenberg	d5037c389c	[Inductor cutlass backend] Fix tests: skipIfROCm always skips when using as class annotation (#123930 ) I previously added @skipIfRocm as a class annotation within test/inductor/test_cutlass_backend.py - turns out this annotation always skips if applied at class level, so I need to skip Cutlass tests on ROCm differently.. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123930 Approved by: https://github.com/jansel ghstack dependencies: #121497	2024-04-22 13:59:59 +00:00
ZhiweiYan-96	ad7b5d32b6	Intel GPU oneDNN Upstreaming: Convolution operators support (#117529 ) # Motivation This PR is a part of RFC #114848,. This PR would depend on oneDNN compilation in #117098 and basic integration support in #117112 and Conv integration code in #117512. Some runtime support is needed in #116019. This PR implements the convolution and deconvolution operators for XPU that should be defined in `aten` libraries. Also, the backward support is also supported. With this PR, the conv-related operators should be functionality ready. Co-authored-by: xiaolil1 <xiaoli.liu@intel.com> Co-authored-by: lei,zhenyuan <zhenyuan.lei@intel.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/117529 Approved by: https://github.com/EikanWang, https://github.com/malfet ghstack dependencies: #117512	2024-04-22 13:22:36 +00:00
ZhiweiYan-96	9d4bef6248	Intel GPU oneDNN upstreaming: Conv primitive integration (#117512 ) # Motivation This PR is a part of RFC #114848,. This PR would depend on oneDNN compilation in #117098 and basic integration support in #117112. Some runtime support is needed in #116019. This PR provides the oneDNN integration code for Convolution and Deconvolution related operators. All aten convolution operators(conv, deconv, and conv-pointwise fusion) will goes into this layer before executing oneDNN primitive. The integration code is responsible for providing correct memory description for primitive and accompanied with primitive attribute description. Wit this PR land, we will add Conv related operators accompanied with their registration. Co-authored-by: xiaolil1 <xiaoli.liu@intel.com> Co-authored-by: lei,zhenyuan <zhenyuan.lei@intel.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/117512 Approved by: https://github.com/EikanWang, https://github.com/malfet	2024-04-22 12:20:54 +00:00
Kai Londenberg	42bd1abc62	[Inductor Cutlass backend] Tolerate dynamic shapes (#121497 ) Previously, when the Cutlass backend was enabled, using dynamic shapes could lead to exceptions during JIT. With this change, there are guards in place to just disable the Cutlass backend if dynamic dimensions are involved. In addition, if no choices for a GEMM are available using the selected backends, then an ATen Kernel is used as fallback, even if the ATen backend is not enabled. Test: CI Additional unit test in test_cutlass_backend.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/121497 Approved by: https://github.com/jansel	2024-04-22 12:05:50 +00:00
PyTorch MergeBot	34bce27f0d	Revert "fix Invalid call to aoti_torch_tensor_copy_ #123039 (#124037 )" This reverts commit 6e24cc012b130869d0029280dcbb34efdd0032cc. Reverted https://github.com/pytorch/pytorch/pull/124037 on behalf of https://github.com/jeanschmidt due to seems to have introduced a regression in pull / linux-focal-cuda12.1-py3.10-gcc9 / test (default, 3, 5, linux.4xlarge.nvidia.gpu) ([comment](https://github.com/pytorch/pytorch/pull/124037#issuecomment-2068659093))	2024-04-22 07:20:10 +00:00
Jason Ansel	3af12447f8	[inductor] Use compile time config values in runtime (#124561 ) This removes usage of torch._inductor.config from `torch._inductor.runtime`. Fixing two issues: 1) If configs change we should really use the compile time ones 2) In compile workers, we want to use the parent process config Pull Request resolved: https://github.com/pytorch/pytorch/pull/124561 Approved by: https://github.com/yanboliang ghstack dependencies: #124552, #124553, #124557, #124559, #124560, #124569	2024-04-22 04:51:30 +00:00
Jason Ansel	317c0af149	[inductor] Remove config check for 3D tiling (#124569 ) This makes the check per-kernel (if 3D tiling is used), rather than global config. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124569 Approved by: https://github.com/yanboliang ghstack dependencies: #124552, #124553, #124557, #124559, #124560	2024-04-22 04:51:30 +00:00
Jason Ansel	3ac30bc32a	[inductor] Refactor runtime files into torch._inductor.runtime (part 5) (#124560 ) I am planning to make the compile_worker process not import torch so it can start up much faster. This stack is prep for that. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124560 Approved by: https://github.com/yanboliang ghstack dependencies: #124552, #124553, #124557, #124559	2024-04-22 04:51:24 +00:00
Jason Ansel	9ea2a09510	[inductor] Refactor runtime files into torch._inductor.runtime (part 4) (#124559 ) I am planning to make the compile_worker process not import torch so it can start up much faster. This stack is prep for that. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124559 Approved by: https://github.com/yanboliang ghstack dependencies: #124552, #124553, #124557	2024-04-22 04:51:20 +00:00
Jason Ansel	fcf28b0ad5	[inductor] Refactor runtime files into torch._inductor.runtime (part 3) (#124557 ) I am planning to make the compile_worker process not import torch so it can start up much faster. This stack is prep for that. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124557 Approved by: https://github.com/yanboliang ghstack dependencies: #124552, #124553	2024-04-22 04:51:15 +00:00
Jason Ansel	f4d47f5bbb	[inductor] Refactor runtime files into torch._inductor.runtime (part 2) (#124553 ) I am planning to make the compile_worker process not import torch so it can start up much faster. This stack is prep for that. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124553 Approved by: https://github.com/yanboliang ghstack dependencies: #124552	2024-04-22 04:51:09 +00:00
Jason Ansel	a7035cc11a	[inductor] Refactor runtime files into torch._inductor.runtime (part 1) (#124552 ) I am planning to make the compile_worker process not import torch so it can start up much faster. This stack is prep for that. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124552 Approved by: https://github.com/yanboliang	2024-04-22 04:51:05 +00:00
Tuan Trieu	6e24cc012b	fix Invalid call to aoti_torch_tensor_copy_ #123039 (#124037 ) fixes #123039 In abi mode, ExternKernelSchedulerNode generates code using `aoti_torch_tensor_copy_` which requires `AtenTensorHandle`, but the allocation generates ArrayRefTensor to allocate mem in stack. To fix this issue, this PR prevents ExternKernelSchedulerNode from using stack-mem-allocation in abi, and creates AtenTensorHandle instead. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124037 Approved by: https://github.com/desertfire	2024-04-22 01:34:22 +00:00
Chen, Zejun	b1984237a0	[Profiler] Unify the device(CUDA, XPU, PrivateUse1) in torch profiler post processing (#123247 ) This PR unifies the CUDA, XPU and PrivateUse1 in the torch profiler. Now CUDA, XPU and PrivateUse1 can together use string object `use_device` to distinguish each other and share one device path for calculating kineto time durations and memory statistics for post processing. #suppress-api-compatibility-check Co-authored-by: Aaron Enye Shi <enye.shi@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/123247 Approved by: https://github.com/aaronenyeshi	2024-04-22 01:26:55 +00:00
Aaron Gokaslan	c5fafe9f48	[BE]: TRY002 - Ban raising vanilla exceptions (#124570 ) Adds a ruff lint rule to ban raising raw exceptions. Most of these should at the very least be runtime exception, value errors, type errors or some other errors. There are hundreds of instance of these bad exception types already in the codebase, so I have noqa'd most of them. Hopefully this error code will get commiters to rethink what exception type they should raise when they submit a PR. I also encourage people to gradually go and fix all the existing noqas that have been added so they can be removed overtime and our exception typing can be improved. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124570 Approved by: https://github.com/ezyang	2024-04-21 22:26:40 +00:00
Chirag Pandya	fd90991790	[rfc] opentelemetry in pytorch (#122999 ) 1. Add current latest version (opentelemetry-cpp version v1.14.2) to PyTorch library. Steps: ``` $cd pytorch $git submodule add https://github.com/open-telemetry/opentelemetry-cpp.git third_party/opentelemetry-cpp $cd third_party/opentelemetry-cpp $git checkout v1.14.2 $git add third_party/opentelemetry-cpp .gitmodules $git commit ``` Expected change in checkout size: ``` (/home/cpio/local/a/pytorch-env) [cpio@devvm17556.vll0 ~/local/pytorch (gh/c-p-i-o/otel)]$ git count-objects -vH count: 654 size: 3.59 MiB in-pack: 1229701 packs: 17 size-pack: 1.17 GiB prune-packable: 76 garbage: 0 size-garbage: 0 bytes ``` 2. TODO - [x] Figure out how dynamic linking works. App builders will somehow need to `target_include` opentelemetry-cpp at runtime. - [ ] Examples on how to use opentelemetry + pytorch - [ ] Tests + documentation (e.g. using null opentelemetry implementation). Pull Request resolved: https://github.com/pytorch/pytorch/pull/122999 Approved by: https://github.com/ezyang	2024-04-21 15:20:21 +00:00
Aaron Gokaslan	29cc293725	[BE]: FURB142 - Remove set mutations. Use set update (#124551 ) Uses set mutation methods instead of manually reimplementing (update, set_difference etc). Pull Request resolved: https://github.com/pytorch/pytorch/pull/124551 Approved by: https://github.com/ezyang	2024-04-21 14:12:33 +00:00
Aaron Gokaslan	5a1216bb2e	[BE]: Update ruff to 0.4.1 (#124549 ) Update ruff to 0.4.1 . This version fixes a lot false negatives/false positives, is 20-40% faster, and has various other bug fixes. Below is a before and after table showing the execution time of ruff lint and ruff format in milliseconds courtesy of https://astral.sh/blog/ruff-v0.4.0 \| Repository \| Linter (v0.3) \| Linter (v0.4) \| Formatter (v0.3) \| Formatter (v0.4) \| \|----------------------------------------------------\|---------------\|---------------\|------------------\|------------------\| \| [pytorch/pytorch](https://github.com/pytorch/pytorch) \| 328.7 \| 251.8 \| 351.1 \| 274.9 \| Pull Request resolved: https://github.com/pytorch/pytorch/pull/124549 Approved by: https://github.com/ezyang	2024-04-21 14:06:23 +00:00
Edward Z. Yang	f34905f61d	Assert that TracingContext is available when set_example_value is called (#124284 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/124284 Approved by: https://github.com/Chillee ghstack dependencies: #124105, #124059, #124176, #124283	2024-04-21 11:23:13 +00:00
Edward Z. Yang	0e6367dd44	Factor var_to_range assignments to _update_var_to_range helper (#124283 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/124283 Approved by: https://github.com/IvanKobzarev ghstack dependencies: #124105, #124059, #124176	2024-04-21 11:23:13 +00:00
Colin Peppler	cbf420b67a	[inductor] for UserDefinedTritonKernels don't mark all inputs as mutating (#124425 ) Take this example: ``` def _mul2(x): y = torch.empty_like(x) mul2_kernel[(10,)]( in_ptr0=x, out_ptr=y, n_elements=x.numel(), BLOCK_SIZE=1, ) return y def f(x): for _ in range(4): x = _mul2(x) return x + 1 ``` Currently, the codegen will show up like this. Notice, how we allocate 5 buffers of the same size. ``` # Source Nodes: [triton_kernel_wrapper_mutation], Original ATen: [] buf0 = empty_strided_cuda((10, ), (1, ), torch.float32) mul2_kernel_0.run(in_ptr0=arg0_1, out_ptr=reinterpret_tensor(buf0, (10, ), (1, ), 0) ...) # Source Nodes: [triton_kernel_wrapper_mutation_1], Original ATen: [] buf4 = empty_strided_cuda((10, ), (1, ), torch.float32) mul2_kernel_0.run(in_ptr0=reinterpret_tensor(buf0, (10, ), (1, ), 0), out_ptr=reinterpret_tensor(buf4, (10, ), (1, ), 0) ...) # Source Nodes: [triton_kernel_wrapper_mutation_2], Original ATen: [] buf8 = empty_strided_cuda((10, ), (1, ), torch.float32) mul2_kernel_0.run(in_ptr0=reinterpret_tensor(buf4, (10, ), (1, ), 0), out_ptr=reinterpret_tensor(buf8, (10, ), (1, ), 0) ...) # Source Nodes: [triton_kernel_wrapper_mutation_3], Original ATen: [] buf12 = empty_strided_cuda((10, ), (1, ), torch.float32) mul2_kernel_0.run(in_ptr0=reinterpret_tensor(buf8, (10, ), (1, ), 0), out_ptr=reinterpret_tensor(buf12, (10, ), (1, ), 0) ...) # Source Nodes: [add], Original ATen: [aten.add] buf16 = empty_strided_cuda((10, ), (1, ), torch.float32) triton_poi_fused_add_0.run(buf12, buf16, 10, grid=grid(10), stream=stream0)...) return (buf16, ) ``` With this PR, we want to see this. Notice, how we only allocate 2 buffers this time. The other 3 buffers are re-used. ``` # Source Nodes: [triton_kernel_wrapper_mutation], Original ATen: [] buf0 = empty_strided_cuda((10, ), (1, ), torch.float32) mul2_kernel_0.run(in_ptr0=arg0_1, out_ptr=reinterpret_tensor(buf0, (10, ), (1, ), 0), ...) del arg0_1 # Source Nodes: [triton_kernel_wrapper_mutation_1], Original ATen: [] buf2 = empty_strided_cuda((10, ), (1, ), torch.float32) mul2_kernel_0.run(in_ptr0=reinterpret_tensor(buf0, (10, ), (1, ), 0), out_ptr=reinterpret_tensor(buf2, (10, ), (1, ), 0) ...) # Source Nodes: [triton_kernel_wrapper_mutation_2], Original ATen: [] buf4 = buf0; del buf0 # reuse mul2_kernel_0.run(in_ptr0=reinterpret_tensor(buf2, (10, ), (1, ), 0), out_ptr=reinterpret_tensor(buf4, (10, ), (1, ), 0) ...) # Source Nodes: [triton_kernel_wrapper_mutation_3], Original ATen: [] buf6 = buf2; del buf2 # reuse mul2_kernel_0.run(in_ptr0=reinterpret_tensor(buf4, (10, ), (1, ), 0), out_ptr=reinterpret_tensor(buf6, (10, ), (1, ), 0) ...) del buf4 # Source Nodes: [add], Original ATen: [aten.add] buf8 = buf6; del buf6 # reuse triton_poi_fused_add_0.run(buf8, 10, grid=grid(10), stream=stream0) return (buf8, ) ``` Differential Revision: [D56379307](https://our.internmc.facebook.com/intern/diff/D56379307) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124425 Approved by: https://github.com/oulgen	2024-04-21 06:00:14 +00:00
Yanbo Liang	0d90d4d613	[Dynamo] Fix NamedTuple hasattr bug (#124531 ) Fixes #124402 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124531 Approved by: https://github.com/jansel	2024-04-21 04:36:22 +00:00
Joël Tang	a6a3f2e06b	[MPS] Fixes GELU, LeakyRELU and MISH on non-contiguous tensors (#123049 ) Fixes GELU, LeakyRELU and MISH activation functions on non-contiguous tensors (for instance, when a transpose operation was applied on the tensors prior to the MPS operator), forward and backward passes. I also extended tests on the 3 activation functions to check: full-precision and half-precision, contiguous and non-contiguous, and several dims of tensors: scalars, 1D, empty, 2D, > 3D. I had issues with Mish and GELU activations when asserting the gradients vs. CPU with sum() on some cases, so I reverted to the previous setup by setting a gradient parameter on .backwards(). This PR also fixes an issue with LeakyRELU on empty tensors. Fixes #98212 huggingface/transformers#22468 huggingface/transformers#19353 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123049 Approved by: https://github.com/kulinseth	2024-04-21 00:12:32 +00:00
Aidyn-A	98f3e0214b	[NCCL][TEST] Synchronize proper devices (#124517 ) There are multiple instances of `torch.cuda.synchronize()` calls without arguments. These calls cause device 0 being synchronized from multiple ranks while the rest of the devices are not. I am pretty sure that was not intended. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124517 Approved by: https://github.com/wconstab, https://github.com/eqy	2024-04-20 23:42:32 +00:00
FFFrog	d6f88105ce	Fix the problem about load_state_dict with unexpected key whose prefix matches a valid key (#124385 ) Fixes https://github.com/pytorch/pytorch/issues/123510 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124385 Approved by: https://github.com/mikaylagawarecki	2024-04-20 23:19:25 +00:00
Edward Z. Yang	afa78ad08c	Call writeline from writelines (#124515 ) This makes it more convenient to add a breakpoint here. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/124515 Approved by: https://github.com/albanD	2024-04-20 15:45:30 +00:00
Animesh Jain	a32eac345f	[dynamo] Return gm.forward for eager backend (#124109 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124109 Approved by: https://github.com/yanboliang, https://github.com/jansel ghstack dependencies: #124445	2024-04-20 14:11:05 +00:00
Animesh Jain	febc4d8759	[dynamo][easy] forbid_in_graph check to use getattr_static (#124445 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124445 Approved by: https://github.com/yanboliang, https://github.com/jansel	2024-04-20 14:11:05 +00:00
Isuru Fernando	97ccfad915	Fix test_decomp test for ops with py_impl(CompositeImplicitAutograd) (#116832 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116832 Approved by: https://github.com/lezcano	2024-04-20 11:10:38 +00:00
Yanbo Liang	a3e3693afc	[Dynamo] Fix missing bracket in ListVariable (#124532 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/124532 Approved by: https://github.com/williamwen42	2024-04-20 08:26:30 +00:00
Timmy Xiao	f20e3ae0c3	Use recursive blob for package data (#119257 ) setup.py now supports recursive glob for package data I only added `.cpp`, `.h`, and `.yaml` files. Not sure if you want to include BAZEL or other files in package_data. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119257 Approved by: https://github.com/zou3519	2024-04-20 06:33:39 +00:00
Michael Lazos	0d0b5b2655	Enable dynamo rosenbrock sparse tests (#124542 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124542 Approved by: https://github.com/yf225 ghstack dependencies: #124540, #124541	2024-04-20 05:54:41 +00:00
Michael Lazos	184f16016e	Enable dynamo-traced deepcopy test for RMSprop (#124541 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124541 Approved by: https://github.com/yf225 ghstack dependencies: #124540	2024-04-20 05:54:41 +00:00
Michael Lazos	6a730698e2	Enable dynamo-traced Adamax tests (#124540 ) Enabling tests related to https://github.com/pytorch/pytorch/issues/121178 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124540 Approved by: https://github.com/yf225	2024-04-20 05:54:41 +00:00
drisspg	f1cbaf1764	Adds LSE output for templated-attention-hop if inputs require grad (#124308 ) Adds LSE output for templated-attention-hop if inputs require grad Prep PR for adding autograd support to templated-attention-hop. The kernel needs to output the LSE during the forward which will be used during backwards. ### Output code https://gist.github.com/drisspg/2aea3ce5db75811e7e143eeecb774d8a ## Before \| Type \| Speedup \| batch_size \| num_heads \| q_seq_len \| k_seq_len \| head_dim \| score_mod \| dtype \| \|---------\|-----------\|--------------\|-------------\|-------------\|-------------\|------------\|---------------\|----------------\| \| Average \| 1.159 \| \| \| \| \| \| \| \| \| Max \| 1.342 \| 16 \| 16 \| 512 \| 512 \| 64 \| noop \| torch.bfloat16 \| \| Min \| 1.016 \| 1 \| 16 \| 512 \| 512 \| 64 \| relative_bias \| torch.bfloat16 \| ## After Type \| Speedup \| batch_size \| num_heads \| q_seq_len \| k_seq_len \| head_dim \| score_mod \| dtype \| \|---------\|-----------\|--------------\|-------------\|-------------\|-------------\|------------\|-------------\|----------------\| \| Average \| 1.155 \| \| \| \| \| \| \| \| \| Max \| 1.339 \| 16 \| 16 \| 512 \| 512 \| 64 \| noop \| torch.bfloat16 \| \| Min \| 1.009 \| 1 \| 16 \| 512 \| 512 \| 64 \| head_bias \| torch.bfloat16 \| Pull Request resolved: https://github.com/pytorch/pytorch/pull/124308 Approved by: https://github.com/Chillee	2024-04-20 05:45:56 +00:00
Oguz Ulgen	0d64b82f0b	Make CompiledFxGraph portable between machines (#124438 ) As we prepare FxGraphCache to move to remote, we need to make sure there's no data that is on the disk. Differential Revision: [D56363808](https://our.internmc.facebook.com/intern/diff/D56363808) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124438 Approved by: https://github.com/jansel	2024-04-20 05:26:14 +00:00
Shunting Zhang	c5a4ba2257	[inductor] consider pointwise nodes when deciding reduction hint (#124131 ) In certain rare scenarios, inductor can generate a reduction kernel with really bad perf. E.g., if - the reduction kernel contains a reduction node followed by a pointwise node - And the pointwise node use a transposed layout. - the reduction node is an inner reduction - and rnumel <= 1024 , then inductor will generate a persistent reduction kernel and it causes really bad perf when doing tl.store for the pointwise node since we use a very skinny tile `(XBLOCK=1, RBLOCK=next_power_of_2(rnumel))` . I've tried a few version of fix. - The first version is, if I found any pointwise node in a reduction kernel uses a non-contiguous dependency, we use ReductionHint.DEFAULT. This cause 8s compilation time increase for huggingface with no perf wins... The reason is ReductionHint.DEFAULT does more autotunings. - Then I changed the code to be more specific. We change the hint from INNER to DEFAULT if we are sure that the pointwise kernel can use a >1 stride for the lowest dimension. Kernels meet this condition should mostly have really bad perf anyways. The situation mentioned above is rare. But it's reported by internal users. I'll also run one more perf test. Testing script: https://gist.github.com/shunting314/9d3389891fa43633b49b8b7564ad6d8b . Something equivalent is also added as a unit test. For this specific test from user reports, we improve the mentioned reduction kernels perf by 4.14x (451us -> 109us) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124131 Approved by: https://github.com/jansel	2024-04-20 05:07:56 +00:00
Xiaodong Wang	57f64197f3	Reduce warning msg in torch.profiler (#124469 ) Summary: This is actually quite noisy and my logs are full of this soft assertion msg. Maybe making it log once? Test Plan: On AMD GPU side, I got a lot of those warnings: ``` W0415 01:40:45.109864 917160 collection.cpp:602] Warning: Memcpy ? (? -> ?) (function operator())” ``` So just suppress the excessive logs Reviewed By: aaronenyeshi, yoyoyocmu Differential Revision: D55602788 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124469 Approved by: https://github.com/aaronenyeshi	2024-04-20 04:45:12 +00:00
Jianping Wu	b79b0d3d6a	Enable UFMT on test/test_legacy_vmap.py (#124381 ) Part of https://github.com/pytorch/pytorch/issues/123062 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124381 Approved by: https://github.com/ezyang	2024-04-20 03:37:57 +00:00
Scott Wolchok	3d8b903d95	[PyTorch] Remove ArrayRefTensor::numel_ (#124516 ) ArrayRefTensor::numel_ is redundant with the size of the contained MiniArrayRef. Reclaiming the space entirely would break ABI compatibility, but at least we have 4-8 bytes for future expansion. Differential Revision: [D56366829](https://our.internmc.facebook.com/intern/diff/D56366829/) NOTE FOR REVIEWERS: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D56366829/)! Pull Request resolved: https://github.com/pytorch/pytorch/pull/124516 Approved by: https://github.com/chenyang78, https://github.com/desertfire	2024-04-20 02:44:20 +00:00
Andrew Gu	f9fce110af	[FSDP2][ez] Removed error check for swap tensors flag (#124513 ) Since `DTensor` uses `swap_tensors` path automatically now, we can remove this check for the global flag. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124513 Approved by: https://github.com/weifengpy ghstack dependencies: #124319, #120256	2024-04-20 00:46:36 +00:00
Andrew Gu	1c2cb36811	[FSDP2] Added CPU offloading (#120256 ) #### Overview This PR adds CPU offloading via the `offload_policy: OffloadPolicy` argument. - We incur one H2D copy for each parameter before all-gather. - We incur one D2H copy for each gradient after reduce-scatter. - We run optimizer on CPU. #### Example (Mixed Precision and CPU Offloading) This example uses a small 125M numel model, which is not too representative. We can try to run with a larger model like Llama-7B. However, since the current optimizer step is already too slow, we may want to patch a faster CPU optimizer. Forward ![Screenshot 2024-02-21 at 10 36 29 AM](https://github.com/pytorch/pytorch/assets/31054793/00ed95db-3a55-49bb-ac98-9b9162feaacd) ![Screenshot 2024-02-21 at 10 39 12 AM](https://github.com/pytorch/pytorch/assets/31054793/10e29854-1907-4001-b3dc-aab6c3bf153c) Backward ![Screenshot 2024-02-21 at 10 37 47 AM](https://github.com/pytorch/pytorch/assets/31054793/7039cb2e-eb78-4f53-b83f-67bae61ebddd) ![Screenshot 2024-02-21 at 10 38 44 AM](https://github.com/pytorch/pytorch/assets/31054793/e34615d6-6b6b-4995-aef1-9c7563034799) Overall CPU (CPU optimizer step dominates) ![Screenshot 2024-02-21 at 10 39 47 AM](https://github.com/pytorch/pytorch/assets/31054793/7a2a929a-3a40-4b35-891b-016cf57e8079) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120256 Approved by: https://github.com/weifengpy ghstack dependencies: #124319	2024-04-20 00:42:58 +00:00
soulitzer	cf5ca58e7f	[NJT] Inline through torch.nested.nested_tensor_from_jagged instead of graph break (#124343 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124343 Approved by: https://github.com/jbschlosser	2024-04-19 23:13:59 +00:00
Laith Sakka	acbf888a13	rename sl to strobelight (#124455 ) Summary: TORCH_COMPILE_SL_PROFILE ->TORCH_COMPILE_STROBELIGHT SL_MAX_STACK_LENGTH -> COMPILE_STROBELIGHT_MAX_STACK_LENGTH SL_MAX_PROFILE_TIME -> COMPILE_STROBELIGHT_MAX_PROFILE_TIME profile_with_sl() -> strobelight() compiletime_sl_profile_meta() -> compiletime_strobelight_meta() Test Plan: 1. run and verify ``` TORCH_COMPILE_STROBELIGHT=TRUE buck2 run @//mode/inplace @//mode/opt //caffe2/fb/strobelight:compiletime_profiler_example ``` 2. run and verify ``` buck2 run @//mode/inplace @//mode/opt //caffe2/fb/strobelight:function_profiler_example --local-only ``` 3. run and verify truncated stack for ``` TORCH_COMPILE_STROBELIGHT=TRUE COMPILE_STROBELIGHT_MAX_STACK_LENGTH=1 buck2 run @//mode/inplace @//mode/opt //caffe2/fb/strobelight:compiletime_profiler_example ``` 4. add infinite loop in _verify and verify samples for ``` COMPILE_STROBELIGHT_MAX_PROFILE_TIME=30 TORCH_COMPILE_STROBELIGHT=TRUE buck2 run @//mode/inplace @//mode/opt //caffe2/fb/strobelight:compiletime_profiler_example ``` Reviewed By: oulgen Differential Revision: D56327139 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124455 Approved by: https://github.com/oulgen	2024-04-19 22:50:13 +00:00
PyTorch MergeBot	0feab7d6c3	Revert "Build device generic torch.Stream and torch.Event based on c10::Stream/Event (#123611 )" This reverts commit cb17721899d4d6a55d66d4f7188e36c20a078231. Reverted https://github.com/pytorch/pytorch/pull/123611 on behalf of https://github.com/jeffdaily due to This broke ROCm. see test_overrides.py ([comment](https://github.com/pytorch/pytorch/pull/123611#issuecomment-2067363780))	2024-04-19 22:44:26 +00:00
PyTorch MergeBot	929242a15c	Revert "torch.mtia module for MTIA device backend (#123612 )" This reverts commit d7e1bf9ff908d2a9c20d5354426d34c539fcb7a1. Reverted https://github.com/pytorch/pytorch/pull/123612 on behalf of https://github.com/jeffdaily due to This broke ROCm. see test_overrides.py ([comment](https://github.com/pytorch/pytorch/pull/123611#issuecomment-2067363780))	2024-04-19 22:44:26 +00:00
PyTorch MergeBot	52da03edeb	Revert "Add test_cpp_extensions tests for stream_and_event and mita_backend (#123614 )" This reverts commit b6f0159db08c1ad55fe57a5e92d8933e21ea543e. Reverted https://github.com/pytorch/pytorch/pull/123614 on behalf of https://github.com/jeffdaily due to This broke ROCm. see test_overrides.py ([comment](https://github.com/pytorch/pytorch/pull/123611#issuecomment-2067363780))	2024-04-19 22:44:26 +00:00
cdzhan	f8f7cfbeee	Add __torch_function__ support for generated tensor methods/property of PrivateUse1 (#121723 ) support following case: ```python import torch ... class CustomFooTensor(torch.Tensor): @classmethod def __torch_function__(cls, func, types, args=(), kwargs=None): ... a = CustomFooTensor([3]) print(a.is_foo) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/121723 Approved by: https://github.com/albanD	2024-04-19 22:34:34 +00:00
eellison	19850d770d	update triton pin (#124429 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124429 Approved by: https://github.com/shunting314, https://github.com/malfet	2024-04-19 22:34:28 +00:00
drisspg	d8a98ddd60	Prep PR for cutlass 3.5 update (#124412 ) # Summary These changes are needed for the upgrade to cutlass 3.5 #123458 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124412 Approved by: https://github.com/Skylion007, https://github.com/nWEIdia, https://github.com/malfet	2024-04-19 22:10:37 +00:00
Yuanhao Ji	b3504af56e	Enable UFMT on `test/scripts` and some files (#124137 ) Part of: #123062 Ran lintrunner on: - `test/scripts` - `test/simulate_nccl_errors.py` - `test/test_ao_sparsity.py` - `test/test_autocast.py` - `test/test_binary_ufuncs.py` - `test/test_bundled_images.py` - `test/test_bundled_inputs.py` - `test/test_comparison_utils.py` - `test/test_compile_benchmark_util.py` - `test/test_complex.py` - `test/test_cpp_api_parity.py` - `test/test_cpp_extensions_aot.py` - `test/test_cpp_extensions_jit.py` - `test/test_cpp_extensions_open_device_registration.py` Detail: ```bash $ lintrunner -a --take UFMT --all-files ok No lint issues. Successfully applied all patches. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/124137 Approved by: https://github.com/soulitzer	2024-04-19 22:01:27 +00:00
rzou	f0560f7b3b	[opcheck] Stop doing test_aot_dispatch_static by default (#124495 ) Motivations: - this is pretty redundant with test_aot_dispatch_dynamic. - The user story for opcheck is that a user should use opcheck to see if their operator was "registered correctly". If a user's custom op only supports dynamic shapes, then it's a bit awkward for one of the tests (e.g. `test_aot_dispatch_static`) to fail. - We've already stopped running test_aot_dispatch_static in all of our opcheck tests. Test Plan: - wait for CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/124495 Approved by: https://github.com/williamwen42 ghstack dependencies: #124180, #124200, #124299, #124134, #124199, #124403, #124414	2024-04-19 21:57:22 +00:00
rzou	37d18966ea	[custom_op] set some tags when constructing the op (#124414 ) - the op is automatically "pt2-compliant" - In general we want to turn on needs_fixed_stride_order for all customm ops, but this needs some more work, so we're just going to turn it on for the new custom op API. Test Plan: - existing tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/124414 Approved by: https://github.com/albanD ghstack dependencies: #124180, #124200, #124299, #124134, #124199, #124403	2024-04-19 21:57:22 +00:00
Andrew Gu	1900f79b72	[FSDP2] Added `set_reshard_after_backward` (#124319 ) This PR adds a `set_reshard_after_backward` method to allow disabling resharding after backward. `reshard_after_backward=False` can be used with `reshard_after_forward=False` to implement "ZeRO-1", where there is only all-gather on the first microbatch forward and reduce-scatter on the last microbatch backward. ``` for microbatch_idx, microbatch in dataloader: is_last_microbatch = microbatch_idx == num_microbatches - 1 model.set_requires_gradient_sync(is_last_microbatch) model.set_reshard_after_backward(is_last_microbatch) model.set_is_last_backward(is_last_microbatch) microbatch_fwd_bwd(model, microbatch, microbatch_idx) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/124319 Approved by: https://github.com/weifengpy	2024-04-19 21:49:35 +00:00
Pian Pawakapan	10b9d4d19c	[export] handle Dim.lower = 0, 1 for ep.run_decompositions() (#123602 ) Summary: With pre-dispatch export and ep.run_decompositions(), range constraints are updated through looking at ShapeEnv.var_to_range. However the lower bounds on these may be incorrect - analysis on un-specialized symbols are done with lower bounds of 2, which mismatch with user-specified bounds (may be 0, 1). This updates `_get_updated_range_constraints()` to use the old range constraints if possible. Test Plan: Existing pre-dispatch/dynamic shapes test case. Differential Revision: D55899872 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123602 Approved by: https://github.com/tugsbayasgalan	2024-04-19 21:29:36 +00:00
Nikita Shulga	c74dfca5e7	Int4MM: Unswizzle for different dtypes (#124448 ) If dtype is not the one this platform is optimized for, it might need different unswizzling pattenrs Implement ones for non-vectorized flavor of the kernel, so that int4mm can be used with float32 and float16 dtypes Pull Request resolved: https://github.com/pytorch/pytorch/pull/124448 Approved by: https://github.com/jgong5, https://github.com/mikekgfb	2024-04-19 21:17:15 +00:00
eellison	000d55870a	Enable in oss (#124031 ) Biggest movement is 4% HF inference, 9% TIMM inference. Note, this is max-autotune mode so we are more tolerant of compilation increases. We could improve compilation time by limiting: ``` # Take how many of the top triton kernels to benchmark epilogue max_epilogue_benchmarked_choices = 3 ``` There is a hf_Whisper failure which you can repro on main without this stack with `TORCHINDUCTOR_MAX_AUTOTUNE_GEMM_BACKENDS=TRITON TORCHINDUCTOR_MAX_AUTOTUNE=1 python benchmarks/dynamo/torchbench.py --backend inductor --amp --accuracy --training --only hf_Whisper`. When you turn off epilogue fusion, it fixes the accuracy. I bisected the failure to an epilogue, however when you compare the results of that epilogue with the corresponding separate kernels the results of the output are equivalent. Inference: <img width="1686" alt="image" src="https://github.com/pytorch/pytorch/assets/11477974/0b240080-cd33-4c08-89d3-583103b1fb0c"> Training: <img width="1329" alt="Screenshot 2024-04-16 at 6 16 30 PM" src="https://github.com/pytorch/pytorch/assets/11477974/db0afcc9-7288-4c27-84ce-4fc1a5690788"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/124031 Approved by: https://github.com/Chillee, https://github.com/shunting314 ghstack dependencies: #124030, #122642, #123229, #122825	2024-04-19 20:28:55 +00:00
Nikita Shulga	e6a788ac26	Fix compilation on aarch64 with gcc (#124511 ) Which is more stringent than clang when equivalently sized NEON registers are cast to each other. In particular, at one point `uint16x4_t` were cast to `int16x4_t`, which gcc does not allow. Added `vreinterpret_s16_u16` (which is a no-op) to solve this and tested in https://godbolt.org/z/sYb4ThM6M Test plan: Build aarch64 wheels Pull Request resolved: https://github.com/pytorch/pytorch/pull/124511 Approved by: https://github.com/mikekgfb	2024-04-19 19:53:19 +00:00
eellison	179108f14d	Use separate flags for MultiTemplates from BenchmarkFusion (#122825 ) Two changes: - Make the flag for multi template buffer independent from benchmark fusion. While benchmark fusion can be useful, the compilation time/performance trade offs are different than for just templates, which we'd like to enable by default. - Dont do MultiTemplateBuffers/benchmark-fusion for templates which have custom input gen fn's (currently which only exist internally). Threading the custom input gn fns to benchmark fusion is NYI. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122825 Approved by: https://github.com/shunting314 ghstack dependencies: #124030, #122642, #123229	2024-04-19 19:50:42 +00:00
IvanKobzarev	73f56e1e81	[sym_shapes][perf] Do not calculate hint in advice_is_size (#124472 ) Differential Revision: [D56352412](https://our.internmc.facebook.com/intern/diff/D56352412) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124472 Approved by: https://github.com/ezyang	2024-04-19 19:10:24 +00:00
Xiaodong Wang	661fd23640	[AMD] TunableOp take priority over DISABLE_ADDMM_HIP_LT (#124161 ) Summary: It seems super confusing that if we set DISABLE_ADDMM_HIP_LT + PYTORCH_TUNABLEOP_ENABLED, the former takes priority. This is because the former goes through the gemm_and_bias and tunable op is integrated with gemm path. Before we can integrate tunable op with gemm_and_bias, we'll probably just let tunable op takes priority Test Plan: Run a simple linear program and verified. Differential Revision: D56183954 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124161 Approved by: https://github.com/jeffdaily, https://github.com/nmacchioni	2024-04-19 19:08:06 +00:00
PyTorch MergeBot	f87c788a34	Revert "Capture triton kernel in execution trace (#124140 )" This reverts commit 89407eca3b0be3c0272b5c583f8e77b9108a71f8. Reverted https://github.com/pytorch/pytorch/pull/124140 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/124140#issuecomment-2067137104))	2024-04-19 19:05:44 +00:00
IvanKobzarev	761de37ab7	[sym_shape][perf] eval_static: guards, unbacked compute once (#124217 ) Differential Revision: [D56212345](https://our.internmc.facebook.com/intern/diff/D56212345) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124217 Approved by: https://github.com/ezyang	2024-04-19 19:03:04 +00:00
Xiaodong Wang	8869b543e8	[AMD] Remove deprecated macro from COnvUtils (#124158 ) Summary: This is not great, but our ATen-cpu is not completely GPU agnostic. Previously we have worked on D54453492 (https://github.com/pytorch/pytorch/pull/121082) and D54528255, but there are a few things we haven't resolved, and it's exploding here. So we'll continue to fix them until all are gone. This ROCm block is for 4.3 which is very old. I don't think it should be supported any more. So let's just kill this macro Test Plan: CI Differential Revision: D56172660 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124158 Approved by: https://github.com/jeffdaily, https://github.com/nmacchioni	2024-04-19 19:00:31 +00:00
Zhuoran Zhao	b0d83726bd	[5/x][AMD][Lowering Enablement] Hipifying aoti code_wrapper (#124241 ) Summary: as title Test Plan: CI & unit test patch on top of https://www.internalfb.com/phabricator/paste/view/P1214895953 to test Differential Revision: D56223917 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124241 Approved by: https://github.com/jansel, https://github.com/desertfire	2024-04-19 18:57:38 +00:00
rzou	25c65d6642	Change register_autograd to reflect ordering of setup_context and backward (#124403 ) old: `register_autograd(setup_context, backward, /)` new: `register_autograd(backward, /, *, setup_context=None)` Motivations: - We introduce these APIs as "give us a backward and use setup_context to save things for backward". - setup_context isn't always necessary. Test Plan: - tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/124403 Approved by: https://github.com/albanD ghstack dependencies: #124180, #124200, #124299, #124134, #124199	2024-04-19 17:56:30 +00:00
rzou	a8e17b2d4d	Move schema inference to torch._library (#124199 ) After this PR, we can delete torch._custom_op/torch._custom_ops (except there are external libraries depending it). Pull Request resolved: https://github.com/pytorch/pytorch/pull/124199 Approved by: https://github.com/albanD ghstack dependencies: #124180, #124200, #124299, #124134	2024-04-19 17:56:30 +00:00
rzou	a78450a00b	Excise uses of the old custom ops APIs (#124134 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124134 Approved by: https://github.com/albanD ghstack dependencies: #124180, #124200, #124299	2024-04-19 17:56:26 +00:00
eellison	9489019085	Small fixes for deferred epilogue (#123229 ) Two small fixes: - preserve rng around compile_fx_inner - Now that will precompile in the background while lowering multiple templates in parallel, we no longer can allocate inputs at the beginning of the function because we will have multiple sets of inputs allocated at the same time. Instead, allocate them when needed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123229 Approved by: https://github.com/shunting314 ghstack dependencies: #124030, #122642	2024-04-19 17:41:29 +00:00
eellison	39fc280dce	Dont precompile already seen keys, limit epilogue choices (#122642 ) Two changes: - in epilogue benchmark fusion, only take top 6 choices. There were basically no choices taken after this in HF. - Share a single precompilation function among matmuls with same key. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122642 Approved by: https://github.com/shunting314 ghstack dependencies: #124030	2024-04-19 17:34:22 +00:00
JackCaoG	7ae835eee4	Enable SourcelessBuilder to build GraphModule generated by make_fx (#123673 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123673 Approved by: https://github.com/ezyang, https://github.com/anijain2305 ghstack dependencies: #123680	2024-04-19 17:23:51 +00:00
Michael Lazos	68a027f144	Fixes for 123400 (#123406 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123406 Approved by: https://github.com/janeyx99 ghstack dependencies: #123324, #123404, #123405, #124309	2024-04-19 17:20:57 +00:00
Michael Lazos	5050e627dc	Defer marking_static_address (#124309 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124309 Approved by: https://github.com/anijain2305 ghstack dependencies: #123324, #123404, #123405	2024-04-19 17:20:57 +00:00
Michael Lazos	1531a29fb9	Enable tests related to 116061 (#123405 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123405 Approved by: https://github.com/janeyx99 ghstack dependencies: #123324, #123404	2024-04-19 17:20:54 +00:00
Michael Lazos	406d99e46c	Fix for 117147 (#123404 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123404 Approved by: https://github.com/Skylion007, https://github.com/janeyx99 ghstack dependencies: #123324	2024-04-19 17:20:50 +00:00
Michael Lazos	203d111c54	Enable dynamo test_forloop_goes_right_direction_multi_gpu (#123324 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123324 Approved by: https://github.com/janeyx99	2024-04-19 17:20:41 +00:00
ydwu4	293f756cdc	Support aot_export torchbind op (#123370 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123370 Approved by: https://github.com/zou3519 ghstack dependencies: #123367	2024-04-19 17:17:27 +00:00
ydwu4	e62169a8fa	Support torchbind op dispatch in python (#123367 ) We override the `__call__` method and register fake, functional, proxy default dispatch mode implementation in its python_key_mode_table. The idea is: 1. when inputs contains FakeScriptObject, we dispatch it through _get_dispatch mechanism. We implement dispatch mode keys automatically in the operator's constructor. 2. when inputs are not fakified, we dispatch through the original c++ dispatcher. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123367 Approved by: https://github.com/zou3519	2024-04-19 17:17:27 +00:00
eellison	136f8378e1	Re-land precompile triton templates (#124030 ) Re-land precompile triton templates. This got reverted because we were precompiling templates without checking the cache. I have since added logic and a test to ensure we do not precompile if there is a cache hit. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124030 Approved by: https://github.com/shunting314, https://github.com/nmacchioni, https://github.com/yoyoyocmu	2024-04-19 17:03:33 +00:00
rzou	bad8d25881	Add torch.library.register_kernel (#124299 ) This mirrors the .register_kernel method on the object produced by the custom_op decorator. Test Plan: - new tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/124299 Approved by: https://github.com/albanD ghstack dependencies: #124180, #124200	2024-04-19 13:54:21 +00:00
rzou	3918dfedc5	[custom_op] Rename register_impl to register_kernel (#124200 ) Motivation: - The API is used for registering an implementation for a specific device type. - "impl" is ambiguous and can be confused with Library.impl. Test Plan: - existing tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/124200 Approved by: https://github.com/albanD ghstack dependencies: #124180	2024-04-19 13:54:21 +00:00
rzou	22a2f676c3	[custom_op] add ability to provide manual schema (#124180 ) Test Plan: - new tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/124180 Approved by: https://github.com/albanD	2024-04-19 13:54:13 +00:00
cyy	a56e057814	[Environment Variable][1/N] Use thread-safe env variable API in c10 (#119449 ) This PR is the beginning of attempts to wrap thread-unsafe getenv and set_env functions inside a RW mutex. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119449 Approved by: https://github.com/malfet, https://github.com/albanD	2024-04-19 13:39:41 +00:00
GdoongMathew	8b1ad51881	Better Error Message in `ChainedScheduler` and `SequentialLR` (#121633 ) Fixes #121577 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121633 Approved by: https://github.com/janeyx99	2024-04-19 13:37:41 +00:00
Jesse Cai	c9db59e9e4	[sparse] Add fast semi-structured spasification kernels (#122350 ) This PR adds in fast semi-structured sparsification kernels to PyTorch. These kernels allow for accelerated semi-structured sparsification kernels in PyTorch. The kernels have been added as aten native functions In particular, three new functions have been added: * `torch._sparse_semi_structured_tile` This function will return the packed representation and metadata for both X and X', as well as the thread masks. Note that this applies 2:4 sparsity in a 4x4 tile instead of a 1x4 strip as usual. * `torch._sparse_semi_structured_apply` This function takes in an input tensor and thread masks from the above function and returns a packed representation and metadata from applying thread masks to the input tensor. * `torch._sparse_semi_structured_apply_dense` This function does the same thing as above but instead of returning the tensor in the sparse representation it returns it in the dense representation The subclasses have also been updated to add a new `prune_dense_static_sort` classmethod to create sparse tensors with this format. I've added some additional documentatino on how to calculate the compressed tensors needed to create a SparseSemiStructuredTensor oneself. To this end, there are two new helper functions added: `sparse_semi_structured_tile` `compute_compressed_swizzled_bitmask` Differential Revision: [D56190801](https://our.internmc.facebook.com/intern/diff/D56190801) Pull Request resolved: https://github.com/pytorch/pytorch/pull/122350 Approved by: https://github.com/cpuhrsch	2024-04-19 13:31:58 +00:00
Cen Zhao	96724a769b	[ptd] drop ncclGroupStart/end for ncclCommInit (#124363 ) (#124416 ) Summary: ``` ncclGroupStart() ncclCommInit(..) ncclGroupEnd() ``` above pattern is only needed when we have single-thread to manage multiple GPUs in our case, we always have 1 process managing 1 GPU, we don't need group operation. Test Plan: CI Differential Revision: D56274975 Co-authored-by: Cen Zhao <cenzhao@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/124416 Approved by: https://github.com/shuqiangzhang	2024-04-19 13:12:42 +00:00
pratiklp00	88fa843e58	Add vectorized norm fill for ppc64le (#113351 ) This patch adds the vectorized norm fill for ppc64le. Pull Request resolved: https://github.com/pytorch/pytorch/pull/113351 Approved by: https://github.com/jgong5	2024-04-19 12:34:00 +00:00
chilli	8e280862ff	Add custom joint graph passes (#124443 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124443 Approved by: https://github.com/aorenste, https://github.com/malfet	2024-04-19 11:54:46 +00:00
Jane Xu	b412b75b42	[optim] add fused_adam/adamw_kernel support for CPU device (#123074 ) On par with `CUDA` implementation. For `autocast` logic, same with `CUDA` + `Fused Adam`: - check inf in `gradscalar.step` - In fused kernel, if there is `inf`, do nothing. If not, unscale the grad ( also write back) and update the param. TestPlan: ``` # extend CUDA only test for CPU fused adagrad python test_optim.py -k test_fused_matches_forloop python test_optim.py -k test_fused_large_tensor python test_torch.py -k test_grad_scaling_autocast_fused # extend fused test python test_torch.py -k test_params_invalidated_with_grads_invalidated_between_unscale_and_step python test_optim.py -k test_can_load_older_state_dict # newly added test (follow `6b1f13ea2f/test/test_cuda.py (L1108)`) python test_optim.py -k test_grad_scaling_autocast_fused_optimizers ``` Benchmark: 5.1x on 56 core SPR Parameter-size=1M Nparams=10 [test script](https://gist.github.com/zhuhaozhe/ef9a290ad3f8f4067b3373a3bdaa33e7) ``` numactl -C 0-55 -m 0 python bench_adam.py non-fused 6.0174267292022705 s fused 1.1787631511688232 s ``` Note: Fused kernel accuracy The accuracy failure in CI shows a little higher than default tolerance ``` 2024-04-02T06:09:16.2213887Z Mismatched elements: 21 / 64 (32.8%) 2024-04-02T06:09:16.2214339Z Greatest absolute difference: 1.5735626220703125e-05 at index (6, 6) (up to 1e-05 allowed) 2024-04-02T06:09:16.2214813Z Greatest relative difference: 1.0073336852656212e-05 at index (4, 1) (up to 1.3e-06 allowed) ``` I have debug it step by step and unfortunately we may not able to make the `fused kernel` exactly same with `non fused` one due to compiler optimizations. For example, in non-fused impl ``` exp_avg_sq.mul_(beta2).addcmul_(grad, grad.conj(), value=1 - beta2) ``` and in fused impl ``` exp_avg_sq_ptr[d] = scalar_t(beta2) * exp_avg_sq_ptr[d]; // std::cout << "exp_avg_sq " << exp_avg_sq_ptr[d] << std::endl; exp_avg_sq_ptr[d] = exp_avg_sq_ptr[d] + scalar_t(exp_avg_sq_grad_coefficient) * grad_val * grad_val; ``` If I keep `std::cout`, I can get exactly same results in UT ``` ===============param 0.6796758770942688 0.6796758770942688 ``` But when I comment out it, there will be a difference ``` ===============param 0.6796758770942688 0.6796759366989136 ``` So I will make the tolerance a little higher than default one. Co-authored-by: Jane Xu <janeyx@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/123074 Approved by: https://github.com/jgong5, https://github.com/janeyx99	2024-04-19 11:14:04 +00:00
Boyuan Feng	9a71d12d92	[CUDAGraphTree] Support mutated inputs from prior cudagraph pool (#123231 ) # PR This PR supports mutating inputs in cudagraph trees, if these inputs are outputs from previous cudagraph. Please check #121861 for more details. # Note on Optimistic Mutation Check To determine whether applying cudagraph, we need to check input mutations, falling into four categories: a) no mutation, b) mutation on parameters/buffers, c) mutation on cudagraph recorded tensors, d) mutation on non-cudagraph recorded tensors. We can apply cudagraph for type a,b,c but cannot for type d. This input mutation types depends on function, current_node, and inputs. Since `check_for_mutation` is slow, there is a trade-off on making type c or d faster. - To make type d) faster, we want to `check_for_mutation` and call eager function early. However, this adds unnecessary overhead to type a, b, c due to the extra check. - To make type c) faster, we want to skip `check_for_mutation` at the beginning and only `check_for_mutation` before `record_function` for a new function. This removes the overhead of `check_for_mutation` for type a, b, c. However, this adds extra overhead to type d due to `check_invariants` for all children nodes. Instead, we design optimistic mutation check. The assumption is that, given a function and a node, the input mutation types usually remain the same across inputs. So, if we have ever detect a function on a node with type d, we will never detect it as type c. The detailed design is: - [Slow Path] On the first invocation of a function on a node, we run `check_for_mutation` once and cache the input mutation type as `non_cudagraph_managed_mutation[node_id][func_id]`. - [Fast Path] On the subsequent invocations of a function on a node, we skip `check_for_mutation`. For `non_cudagraph_managed_mutation[node_id][func_id]` as true, we directly call eager function. Otherwise, we `check_variants` and call cudagraph function. - [Slow Path] Before `record_function`, we run `check_for_mutation` again. Q1: Would there be overhead for type a,b,c,d? A: No. We only check input mutation types for the first invocation of a function on a node. Q2: If a function happens to be type c during the first invocation on a node, could we detect it as type d in the future? A: Yes. This is done by `check_invariants` and guarantees the correctness. Q3: If a function happens to be type d during the first invocation on a node, could it still be recognized as type c in the future? A: No. But this should happen rarely according to our assumption. In the rare case that it happens, there would not be any correctness issues and the performance is the same as the eager (or inductor optimized) function. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123231 Approved by: https://github.com/eellison	2024-04-19 10:32:12 +00:00
Tobias Ringwald	58e403c739	Added a docstring for torch.Size.numel. (#124186 ) Fixes #61231. Fixes #124167. This PR documents a rather long-standing issue w.r.t. unexpected behavior of `torch.Size.numel`, first reported almost 5 years ago. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124186 Approved by: https://github.com/janeyx99	2024-04-19 09:23:02 +00:00
PyTorch MergeBot	520bc1080e	Revert "[Profiler] Unify the device(CUDA, XPU, PrivateUse1) in torch profiler post processing (#123247 )" This reverts commit 768ce2cddad2057349d1194274a5f93c47c5ac88. Reverted https://github.com/pytorch/pytorch/pull/123247 on behalf of https://github.com/DanilBaibak due to Broken trunk ([comment](https://github.com/pytorch/pytorch/pull/123247#issuecomment-2066152611))	2024-04-19 09:09:03 +00:00
Nikita Shulga	b2f6cfd9c0	Fix AVX2 int4pack_mm_kernel crash if weighs are unaligned (#124433 ) Followup after https://github.com/pytorch/pytorch/pull/124128 `s/_mm256_load_si128/_mm256_loadu_si128/` Pull Request resolved: https://github.com/pytorch/pytorch/pull/124433 Approved by: https://github.com/desertfire	2024-04-19 05:17:38 +00:00
Xuehai Pan	a6f044a490	[dynamo, 3.8-3.9] support dataclass with `frozen=True` in Python 3.8/3.9 (#124393 ) Closes #114966 Frozen field assignment in `__init__` in Python 3.8-3.9: `f5bd65ed37/Lib/dataclasses.py (L402-L411)` ```python import builtins BUILTINS = builtins def _field_assign(frozen, name, value, self_name): # If we're a frozen class, then assign to our fields in __init__ # via object.__setattr__. Otherwise, just use a simple # assignment. # # self_name is what "self" is called in this function: don't # hard-code "self", since that might be a field name. if frozen: return f'BUILTINS.object.__setattr__({self_name},{name!r},{value})' return f'{self_name}.{name}={value}' ``` Frozen field assignment in `__init__` in Python 3.10+: `812245ecce/Lib/dataclasses.py (L436-L445)` ```python __dataclass_builtins_object__ = object def _field_assign(frozen, name, value, self_name): # If we're a frozen class, then assign to our fields in __init__ # via object.__setattr__. Otherwise, just use a simple # assignment. # # self_name is what "self" is called in this function: don't # hard-code "self", since that might be a field name. if frozen: return f'__dataclass_builtins_object__.__setattr__({self_name},{name!r},{value})' return f'{self_name}.{name}={value}' ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/124393 Approved by: https://github.com/jansel	2024-04-19 05:10:33 +00:00
PyTorch MergeBot	e408d9ca25	Revert "Migrate linux-focal-cuda11_8-py3_10-gcc9-build to ARC (#123721 )" This reverts commit d032a780080646828bdda15f3af0277288b2fa34. Reverted https://github.com/pytorch/pytorch/pull/123721 on behalf of https://github.com/malfet due to ARC is too flaky ([comment](https://github.com/pytorch/pytorch/pull/123721#issuecomment-2065750954))	2024-04-19 04:51:35 +00:00
PyTorch MergeBot	96a067b190	Revert "Migrate linux-focal-cuda12_1-py3_10-gcc9-build to ARC (#123722 )" This reverts commit b5d4ebe9aeabc1fc46ca39dee2d446f9b5e9e114. Reverted https://github.com/pytorch/pytorch/pull/123722 on behalf of https://github.com/malfet due to ARC is too flaky ([comment](https://github.com/pytorch/pytorch/pull/123722#issuecomment-2065749522))	2024-04-19 04:49:07 +00:00
Nikita Shulga	1ba85b34dd	[AOTI] Enbale mmaped weights when CUDA is used (#124346 ) By refactoring the logic that returns the start to constant pointer into `_get_constants_start()` method and call it from both CUDA and CPU readers It has no runtime impact, but export time is down from 10m to 3m if mmaped weights are used on AWS p4d.24xlarge Pull Request resolved: https://github.com/pytorch/pytorch/pull/124346 Approved by: https://github.com/mikekgfb, https://github.com/desertfire	2024-04-19 04:47:27 +00:00
Kiuk Chung	87f44d70b1	[torch/distributed] Check gloo availability when doing isinstance(pg,… (#124233 ) Fixes a bug where a reference to `_ProcessGroupWrapper` is used without first checking whether gloo is available. This fails on pytorch builds that do not include gloo becuase `_ProcessGroupWrapper` is only pybinded when building with `USE_GLOO=1`. Therefore, creation of a new process group fails with a `NameError` when only NCCL is available as the backend. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124233 Approved by: https://github.com/rohan-varma, https://github.com/d4l3k	2024-04-19 04:07:00 +00:00
Chen, Zejun	768ce2cdda	[Profiler] Unify the device(CUDA, XPU, PrivateUse1) in torch profiler post processing (#123247 ) This PR unifies the CUDA, XPU and PrivateUse1 in the torch profiler. Now CUDA, XPU and PrivateUse1 can together use string object `use_device` to distinguish each other and share one device path for calculating kineto time durations and memory statistics for post processing. #suppress-api-compatibility-check Co-authored-by: Aaron Enye Shi <enye.shi@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/123247 Approved by: https://github.com/aaronenyeshi, https://github.com/gujinghui	2024-04-19 03:31:13 +00:00
rraminen	803a08f8ae	[ROCm] Add cublasGemmAlgo_t -> hipblasGemmAlgo_t (#121030 ) This PR is to add cublasGemmAlgo_t -> hipblasGemmAlgo_t to cuda_to_hip_mappings.py. It is required for DeepSpeed transformer extension build on ROCm. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121030 Approved by: https://github.com/jeffdaily, https://github.com/ezyang	2024-04-19 02:57:16 +00:00
Sam Larsen	290e3e7abb	Add ability to save TORCH_COMPILE_DEBUG logs for CI failures (#124408 ) Summary: The intent is that we can whitelist certain benchmarks to a) enable TORCH_COMPILE_DEBUG=1, and b) save the generated artifacts in test/debug in case of a failure. Via the rules in action.yml, we can then upload test/debug/ to S3 whenever it exists. I chose to introduce a new directory (test/debug/) rather than using an existing one (e.g., test/test-reports/), because these don't seem like test reports and we can later add other debug-related artifacts if we find it useful. For example, we might want to later explore including the inductor cache artifacts. Test Plan: See artifacts generated when I force a failure: https://hud.pytorch.org/pr/124234 Specifically: https://gha-artifacts.s3.amazonaws.com/pytorch/pytorch/8729891826/1/artifact/debug-test-inductor_torchbench-2-2-linux.g5.4xlarge.nvidia.gpu_23953679574.zip Pull Request resolved: https://github.com/pytorch/pytorch/pull/124408 Approved by: https://github.com/desertfire	2024-04-19 02:46:00 +00:00
rzou	889e3eeed3	Avoid cuda init to FakeTensorMode (#124413 ) Also partially fixes #122109 This PR: - We add a C++ flag (only_lift_cpu_tensors) to toggle the torch.tensor(1, device='cuda') ctor strategy. When false (default), it does the current PyTorch behavior of unconditionally constructing a concrete CUDA tensor then calling lift_fresh on it. When true, we instead construct a concrete CPU tensor, call lift_fresh, and then call Tensor.to(device) (under any ambient modes). - FakeTensorMode flips this flag depending on if CUDA is available or not. We don't unconditionally set the flag to True because that is likely BC-breaking. Test Plan: - existing tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/124413 Approved by: https://github.com/eellison	2024-04-19 02:39:35 +00:00
chilli	e620c3e814	Optimized templated attention to use exp2 (#124356 ) 0.705 (vs. FA2) to 0.860 after this change. <img width="1270" alt="image" src="https://github.com/pytorch/pytorch/assets/6355099/d58f57ba-e50e-44ea-8a8a-4f13b8650adf"> to <img width="1277" alt="image" src="https://github.com/pytorch/pytorch/assets/6355099/f1945b67-0cfc-463c-a2f6-5812b90677fe"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/124356 Approved by: https://github.com/drisspg	2024-04-19 01:58:19 +00:00
eqy	0bde4efa84	Fix broken test in `test_aot_inductor.py` (#124329 ) Doesn't seem to run in upstream CI due to sm90 requirement but it is failing on our end due to the extra positional argument Pull Request resolved: https://github.com/pytorch/pytorch/pull/124329 Approved by: https://github.com/chenyang78	2024-04-19 01:54:30 +00:00
Jianping Wu	0affd23014	Enable UFMT on test/test_python_dispatch.py (#124373 ) Part of https://github.com/pytorch/pytorch/issues/123062 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124373 Approved by: https://github.com/ezyang	2024-04-19 00:57:18 +00:00
Tristan Rice	ddd0ed1b43	distributed: templated ring attention (#124215 ) This adds a templated version of the ring attention forwards function as well as tests it with memory efficient attention. This doesn't add support for memory efficient attention in DTensor. That will be added in a follow up PR. This templating is also a POC of how to support other attention ops such as Jagged/nested tensor and as well how to implement striped attention in a scalable way. Misc changes: * Fixes all_to_all_single autograd implementation with CUDA + adds NCCL test * Adds compile support to the ring attention implementations (required some tweaks to process groups) Test plan: ``` pytest test/distributed/_tensor/test_attention.py pytest test/distributed/test_functional_api.py ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/124215 Approved by: https://github.com/wanchaol	2024-04-19 00:57:08 +00:00
Bin Bao	4946638f06	[AOTI] Add ABI-compatiblity tests (#123848 ) Summary: In AOTInductor generated CPU model code, there can be direct references to some aten/c10 utility functions and data structures, e.g. at::vec and c10::Half. These are performance critical and thus it doesn't make sense to create C shim for them. Instead, we make sure they are implemented in a header-only way, and use this set of tests to guard future changes. There are more header files to be updated, but we will do it in other followup PRs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123848 Approved by: https://github.com/jansel ghstack dependencies: #123847	2024-04-19 00:51:24 +00:00
Bin Bao	cbefaf2a37	[AOTI] Move c10/util ostream function implementations to their headers (#123847 ) Summary: AOTInductor generated code for CPU models may have direct reference to these c10-implemented data types, see _inductor/codegen/cpp_prefix.h. To make sure the AOTI generated code is ABI backward compatible, we need to change those headers to a header-only implementation. The next PR in this stack will add tests to use those data types without linking against libtorch. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123847 Approved by: https://github.com/jansel	2024-04-19 00:51:24 +00:00
JackCaoG	9ed9b22ec0	Implement efficient_conv_bn_eval_decomp_graph_transform to handle conv and bn fusion after decomp (#123680 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123680 Approved by: https://github.com/ezyang, https://github.com/youkaichao	2024-04-19 00:22:25 +00:00
Shuqiang Zhang	ca6a0e1348	[c10d] remove the env of TORCH_NCCL_ABORT_IN_DESTROY_PG (#124334 ) Summary: This ENV was introduced to safely rollout the behavior change in destroy process group (e.g., call ncclCommsAbort). Now that this behavior change were already rolled out, we no longer need this env and we should clean up it to keep our code cleaner Test Plan: Modified/existing ut pass Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/124334 Approved by: https://github.com/wconstab	2024-04-18 23:42:55 +00:00
Iris Z	2f45be46f6	[DeviceMesh][Test] Add 3d unit test for `get_local_rank()` (#124142 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/124142 Approved by: https://github.com/xunnanxu, https://github.com/fegin, https://github.com/XilunWu	2024-04-18 23:19:17 +00:00
Kulin Seth	e0792cf3d6	Make copy_cast, softmax and cat_out unranked (#123191 ) Fixes #ISSUE_NUMBER This helps with the performance as it removes multiple copies of the graphs saved due to their shapes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123191 Approved by: https://github.com/DenisVieriu97	2024-04-18 23:14:55 +00:00
eellison	e4f6340f21	realize inputs to mem bound mm decomposition (#123165 ) Differential Revision: [D55639709](https://our.internmc.facebook.com/intern/diff/D55639709) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123165 Approved by: https://github.com/jackiexu1992	2024-04-18 23:10:04 +00:00
Mikayla Gawarecki	5ba6bb7b2f	Add swap_tensors path to nn parametrizations (#124130 ) Fixes #123859 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124130 Approved by: https://github.com/albanD	2024-04-18 22:22:08 +00:00
Wei Wei	87f651c7e7	fix cpu test errors (#124116 ) Similar fix is from @int3 but not landed. Credit to @int3 too. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124116 Approved by: https://github.com/chenyang78	2024-04-18 20:30:58 +00:00
ydwu4	2e48b39603	Fix example_value of map (#124203 ) Previously, we didn't expand the shape of example_value of map to the same as inputs (edit: the first mapped dimension). This pr fixes this bug. To make this easier, we change _call_function_and_unflatten_output to accept example_values directly instead of retrieving them from the variable trackers. Also remove a redundant call function node in strict_mode higher order op in dynamo. Test Plan: existing tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124203 Approved by: https://github.com/ezyang, https://github.com/zou3519	2024-04-18 19:18:36 +00:00
PyTorch MergeBot	4a0900d04b	Revert "[NJT] Inline through torch.nested.nested_tensor_from_jagged instead of graph break (#124343 )" This reverts commit ef93402f619f58d651845981ccd1eba1d68da077. Reverted https://github.com/pytorch/pytorch/pull/124343 on behalf of https://github.com/DanilBaibak due to Broken trunk ([comment](https://github.com/pytorch/pytorch/pull/124343#issuecomment-2064937192))	2024-04-18 18:55:48 +00:00
PyTorch MergeBot	61bc188f42	Revert "[Environment Variable][1/N] Use thread-safe env variable API in c10 (#119449 )" This reverts commit b51f66c1950a582dd18d1b2ee67df840a8c4dbbe. Reverted https://github.com/pytorch/pytorch/pull/119449 on behalf of https://github.com/malfet due to Broke gcc9 builds ([comment](https://github.com/pytorch/pytorch/pull/119449#issuecomment-2064936414))	2024-04-18 18:53:59 +00:00
Sheng Fu	89407eca3b	Capture triton kernel in execution trace (#124140 ) Summary: This DIFF is to capture triton kernels in execution trace. Test Plan: buck test mode/dev-nosan caffe2/test:profiler -- test_execution_trace_with_pt2 Differential Revision: D56162599 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124140 Approved by: https://github.com/briancoutinho	2024-04-18 18:38:26 +00:00
angelayi	74bedbb9e1	[export] Serialize rational symint ranges (#123884 ) Some symints result in rational ranges like 10/3 which runs into an error ([example](https://www.internalfb.com/intern/everpaste/?handle=GMG2AxkeoFUrh-UDAFcE8pKPgjoUbsIXAAAB)). Ed will eventually get rid(?) of these rational ranges but as a workaround export can just clamp the results during serialization time Pull Request resolved: https://github.com/pytorch/pytorch/pull/123884 Approved by: https://github.com/zhxchen17	2024-04-18 18:20:11 +00:00
egienvalue	b6f0159db0	Add test_cpp_extensions tests for stream_and_event and mita_backend (#123614 ) Test the generic torch.Stream/Event with fake device gurad and hooks. @exported-using-ghexport Differential Revision: [D55902506](https://our.internmc.facebook.com/intern/diff/D55902506/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123614 Approved by: https://github.com/albanD ghstack dependencies: #123611, #123612	2024-04-18 17:40:13 +00:00
Aaron Orenstein	37215a4fa2	Fix memory leak in pattern_matcher (#124345 ) #121313 changed precompiled patterns so they are more integrated with the pattern matching code. This resulted with a list of "known" patterns (with their example data) being stored globally. Unfortunately since small FakeTensors store a constant of the original tensor it meant that we leaked cuda tensors in the example data. Fix this by clearing out the constant storage for the example data that we keep around. Fixes #124081 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124345 Approved by: https://github.com/xuzhao9	2024-04-18 17:38:12 +00:00
egienvalue	d7e1bf9ff9	torch.mtia module for MTIA device backend (#123612 ) MTIA device has its own Module in PyTorch now. torch.mtia has following APIs similar to other backends. The lazy_init is also supported. ``` __all__ = [ "init", "is_available", "synchronize", "device_count", "current_device", "current_stream", "default_stream", "set_stream", "stream", "device", ] ``` ------------ For device management. We expand AccleratorHooksInterface to support generic device management and it can be used in both C++ and PyThon. ``` def _accelerator_hooks_device_count() -> _int: ... def _accelerator_hooks_set_current_device(device_index: _int) -> None: ... def _accelerator_hooks_get_current_device() -> _int : ... def _accelerator_hooks_exchange_device(device_index: _int) -> _int : ... def _accelerator_hooks_maybe_exchange_device(device_index: _int) -> _int : ... ``` --------- Adding get_device_module API to retrieve device modules for different device types. ``` def get_device_module(device: Optional[Union[torch.device, str]] = None) ``` --------- @exported-using-ghexport Differential Revision: [D52923602](https://our.internmc.facebook.com/intern/diff/D52923602/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123612 Approved by: https://github.com/albanD ghstack dependencies: #123611	2024-04-18 17:38:06 +00:00
egienvalue	cb17721899	Build device generic torch.Stream and torch.Event based on c10::Stream/Event (#123611 ) This diff intends to build device generic torch.Stream and torch.Event for newly added accelerators in PyTorch. ------------ torch.Stream APIs ``` # Defined in torch/csrc/Stream.cpp class Stream(_StreamBase): stream_id: _int # Stream id device_index: _int device_type: _int device: _device # The device of the stream @overload def __new__(self, device: Optional[DeviceLikeType] = None, priority: _int = 0) -> Stream: ... @overload def __new__(self, stream_id: _int, device_index: _int, device_type: _int, priority: _int = 0) -> Stream: ... def query(self) -> _bool: ... def synchronize(self) -> None: ... def wait_event(self, event: Event) -> None: ... def wait_stream(self, other: Stream) -> None: ... def record_event(self, event: Optional[Event] = None) -> Event: ... def query(self) -> None: ... def synchronize(self) -> None: ... def __hash__(self) -> _int: ... def __repr__(self) -> str: ... def __eq__(self, other: object) -> _bool: ... ``` ------------------ torch.Event APIs: - IPC related APIs are not implemented, since many device backends don't support it, but we leave interfaces there for future adaption of torch.cuda.Stream. - currently only the enable_timing is supported, since it is the most common one used in other device backends. We have to refactor the event flag system in PyTorch to support more fancy flag. - elapsedTime API is added to c10::Event ``` # Defined in torch/csrc/Event.cpp class Event(_EventBase): device: _device # The device of the Event event_id: _int # The raw event created by device backend def __new__(self, device: Optional[DeviceLikeType] = None, enable_timing: _bool = False, blocking: _bool = False, interprocess: _bool = False) -> Event: ... @classmethod def from_ipc_handle(self, device: DeviceLikeType, ipc_handle: bytes) -> Event: ... def record(self, stream: Optional[Stream] = None) -> None: ... def wait(self, stream: Optional[Stream] = None) -> None: ... def query(self) -> _bool: ... def elapsed_time(self, other: Event) -> _float: ... def synchronize(self) -> None: ... def ipc_handle(self) -> bytes: ... def __repr__(self) -> str: ... ``` ----------- c10::Event provides new APIs - calculate elapsedTime. - Get raw event id - Synchronize event. ``` double elapsedTime(const Event& event) const { return impl_.elapsedTime(event.impl_); } void* eventId() const { return impl_.eventId(); } void synchronize() const { return impl_.synchronize(); } ``` ---------- TODO: need to find a good way to test them in PyTorch with API mocks. Differential Revision: [D55351839](https://our.internmc.facebook.com/intern/diff/D55351839/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123611 Approved by: https://github.com/albanD	2024-04-18 17:35:09 +00:00
Jason Ansel	7a6edb0b66	Possible fix for einops warning (#124084 ) See https://github.com/arogozhnikov/einops/issues/315 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124084 Approved by: https://github.com/peterbell10	2024-04-18 17:09:50 +00:00
Tugsbayasgalan (Tugsuu) Manlaibaatar	a8cf91c395	Fix predispatch tracing for aten::lift_fresh_copy (#124198 ) Differential Revision: D56200666 Previously, when we hit the Functionalize kernel for lift_fresh_copy, we directly dispatch self.clone() to proxy dispatch. As a result, we end up receiving a functional tensor at proxy dispatch. As a work around, I unwrap self manually. Not sure, why it works ok in aot-dispatch tho Pull Request resolved: https://github.com/pytorch/pytorch/pull/124198 Approved by: https://github.com/bdhirsh	2024-04-18 17:02:38 +00:00
Zhengxu Chen	e1062f5738	[export] Add a printer to unflattened module. (#124315 ) Summary: add a helper method to print graph in every level of unflattened module. Test Plan: {F1489609684} Differential Revision: D56263195 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124315 Approved by: https://github.com/tugsbayasgalan	2024-04-18 16:35:51 +00:00
vfdev-5	415a8f6398	Fixed issue in affine_grid_backward when grad_grid is non-contiguous (#124370 ) Description: - replaced .view with .reshape to fix the problem when grad_grid is channels last 2d/3d - added a consistency test Fixes #124154 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124370 Approved by: https://github.com/lezcano	2024-04-18 16:30:10 +00:00
Boyuan Feng	aa2da0cdd2	[Export] Add runtime assert to non-strict export (#123681 ) This PR moves insert_deferred_runtime_asserts from dynamo to torch.fx.passes and uses it to add runtime assertion for non-strict export. Differential Revision: D55944267 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123681 Approved by: https://github.com/tugsbayasgalan, https://github.com/angelayi	2024-04-18 16:13:27 +00:00
Nikita Shulga	5677128cb8	[MPS] Fix crash with binary_cross_entropy is invoked for half dtypes (#124258 ) By creating constants using input tensors dtype One line reproducer: ``` python -c "import torch; x=torch.arange(3, dtype=torch.float16,device='mps');print(torch.nn.functional.binary_cross_entropy(x, x))" ``` Before the change ``` loc("mps_subtract"("(mpsFileLoc): /AppleInternal/Library/BuildRoots/ce725a5f-c761-11ee-a4ec-b6ef2fd8d87b/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm":233:0)): error: input types 'tensor<f32>' and 'tensor<3xf16>' are not broadcast compatible LLVM ERROR: Failed to infer result type(s). ``` After ``` tensor(-33.7812, device='mps:0', dtype=torch.float16) ``` Fixes https://github.com/pytorch/pytorch/issues/124252 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124258 Approved by: https://github.com/kulinseth	2024-04-18 15:21:01 +00:00
soulitzer	ef93402f61	[NJT] Inline through torch.nested.nested_tensor_from_jagged instead of graph break (#124343 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124343 Approved by: https://github.com/jbschlosser	2024-04-18 14:42:54 +00:00
Andrew Gu	bbb6e36495	[FSDP2] Fixed `set_requires_gradient_sync`'s `recurse` arg (#124318 ) The `recurse` argument was not being respected for `set_requires_gradient_sync`. This PR fixes that. The previous unit test did not have nested FSDP modules with managed parameters, so the `recurse=False` was not being exercised. We augment the unit test to try only disabling gradient sync for the root module and not children. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124318 Approved by: https://github.com/weifengpy ghstack dependencies: #120952, #124293	2024-04-18 14:21:57 +00:00
PyTorch MergeBot	9385ef2a5d	Revert "Skip workspace permission change for ROCm CI (#123816 )" This reverts commit 4322a0e782119f870ba1a17aec2be8a0ef1103d7. Reverted https://github.com/pytorch/pytorch/pull/123816 on behalf of https://github.com/DanilBaibak due to Broken trunk ([comment](https://github.com/pytorch/pytorch/pull/123816#issuecomment-2063949316))	2024-04-18 14:07:09 +00:00
Yu, Guangye	1325fd94a4	Support xpu autocast policy (#124052 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124052 Approved by: https://github.com/jgong5, https://github.com/EikanWang, https://github.com/gujinghui, https://github.com/albanD	2024-04-18 14:06:48 +00:00
cyy	b51f66c195	[Environment Variable][1/N] Use thread-safe env variable API in c10 (#119449 ) This PR is the beginning of attempts to wrap thread-unsafe getenv and set_env functions inside a RW mutex. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119449 Approved by: https://github.com/albanD	2024-04-18 13:35:48 +00:00
rzou	1542874311	Delete qualname from custom_op decorator (#124092 ) I forgot to delete this in an earlier PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124092 Approved by: https://github.com/albanD ghstack dependencies: #123937, #124064, #124065, #124066, #124071, #124089	2024-04-18 12:48:04 +00:00
rzou	648c39c47d	Add OpOverload.redispatch; use it in new custom ops API (#124089 ) A kernel has "dispatcher convention" if there is an additional keyset arg at the beginning of the argument list. This PR: - adds a way to register kernels with dispatcher_convention using Library.impl (pass dispatcher_convention = True) - adds OpOverload.redispatch We use both of the above in the new custom ops API: we register the autograd kernel in dispatcher convention so that we can actually call redispatch like how pytorch built-in ops do it. Test Plan: - existing tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/124089 Approved by: https://github.com/albanD ghstack dependencies: #123937, #124064, #124065, #124066, #124071	2024-04-18 12:48:04 +00:00
rzou	645173a0b5	Add torch.library.register_autograd (#124071 ) Allows registering autograd for all custom op entry points: - the new-style custom op API (custom_op) - the old-style torch.library APIs - C++ operator registration Test Plan: - tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/124071 Approved by: https://github.com/albanD ghstack dependencies: #123937, #124064, #124065, #124066	2024-04-18 12:47:59 +00:00
rzou	8135c4b921	torch.library.register_fake now accepts more types (#124066 ) We allow it to accept: - a string with the op name - an opoverload - a new-style custom op If any of these are referring to a new-style custom op (created with the custom_op decorator), then we dispatch to CustomOpDef.register_fake. Otherwise, we do what we previously did. Test Plan: - new tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/124066 Approved by: https://github.com/albanD ghstack dependencies: #123937, #124064, #124065	2024-04-18 12:47:55 +00:00
Yu, Guangye	a0466061e1	Support xpu host allocator (#123080 ) # Motivation This PR mainly covers caching host allocator supported on xpu backend. # Solution `XPUCachingHostAllocator` adopts the same caching mechanism as cuda via two abstract interfaces -`CachingHostAllocatorImpl` and `CachingHostAllocatorInterface`. # Additional Context Following CUDA, this PR adds a new API `getPinnedMemoryAllocator` to support the tensor's memory pinned. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123080 Approved by: https://github.com/jgong5, https://github.com/gujinghui, https://github.com/EikanWang, https://github.com/albanD	2024-04-18 12:29:21 +00:00
DanilBaibak	b5d4ebe9ae	Migrate linux-focal-cuda12_1-py3_10-gcc9-build to ARC (#123722 ) Migrate linux-focal-cuda12_1-py3_10-gcc9-build to ARC Pull Request resolved: https://github.com/pytorch/pytorch/pull/123722 Approved by: https://github.com/jeanschmidt	2024-04-18 12:06:57 +00:00
DanilBaibak	d032a78008	Migrate linux-focal-cuda11_8-py3_10-gcc9-build to ARC (#123721 ) Migrate linux-focal-cuda11_8-py3_10-gcc9-build to ARC Pull Request resolved: https://github.com/pytorch/pytorch/pull/123721 Approved by: https://github.com/jeanschmidt	2024-04-18 12:06:28 +00:00
xinan.lin	6fcbeb3489	[ATen] Add CPU fp16 support for nll_loss and cross_entropy_loss (#123256 ) Add CPU FP16 support for nll_loss and cross_entropy_loss. Resolve issue #123328. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123256 Approved by: https://github.com/jgong5, https://github.com/EikanWang, https://github.com/malfet	2024-04-18 11:44:38 +00:00
IvanKobzarev	d59f1da62f	[sym_shapes][perf] _find not update unchanged replacements (#124274 ) Differential Revision: [D56236380](https://our.internmc.facebook.com/intern/diff/D56236380) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124274 Approved by: https://github.com/ezyang	2024-04-18 08:32:02 +00:00
IvanKobzarev	9eba1995d0	[sym_shapes][perf] Use sympy xreplace instead of subs (#124208 ) https://github.com/sympy/sympy/issues/22240 Differential Revision: [D56207553](https://our.internmc.facebook.com/intern/diff/D56207553) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124208 Approved by: https://github.com/ezyang, https://github.com/lezcano	2024-04-18 08:19:03 +00:00
PyTorch MergeBot	2b82345e48	Revert "Re-land precompile triton templates (#124030 )" This reverts commit 030bb13fe84c88ab5c988351543362b60fefb556. Reverted https://github.com/pytorch/pytorch/pull/124030 on behalf of https://github.com/DanilBaibak due to Broken trunk ([comment](https://github.com/pytorch/pytorch/pull/124030#issuecomment-2063191117))	2024-04-18 07:21:41 +00:00
Animesh Jain	704fac5618	[dynamo][cpp-guard] Reland Attempt 1 - Enable cpp guard manager (#124231 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124231 Approved by: https://github.com/jansel ghstack dependencies: #124230, #124237	2024-04-18 06:36:20 +00:00
PyTorch MergeBot	6e86a40694	Revert "[Dynamo] Check for __bool__ attribute before accessing it (#120943 )" This reverts commit dd7aeedb72f8a96d0f168308292e0d41c095f01b. Reverted https://github.com/pytorch/pytorch/pull/120943 on behalf of https://github.com/DanilBaibak due to Broken trunk ([comment](https://github.com/pytorch/pytorch/pull/120943#issuecomment-2063098295))	2024-04-18 06:34:32 +00:00
PyTorch MergeBot	8ff85b42f9	Revert "Add swap_tensors path to nn parametrizations (#124130 )" This reverts commit 64f6ddf12c11738c3f4b1ed01cf4f699541496bf. Reverted https://github.com/pytorch/pytorch/pull/124130 on behalf of https://github.com/DanilBaibak due to Broken trunk ([comment](https://github.com/pytorch/pytorch/pull/124130#issuecomment-2063074856))	2024-04-18 06:12:54 +00:00
Yu, Guangye	ec608a5d66	Refactor CUDA's amp cast policy to be generic (#124051 ) # Motivation This PR intends to create several op lists for different policies: - `AT_FORALL_LOWER_PRECISION_FP` for policy `lower_precision_fp` - `AT_FORALL_FP32` for policy `fp32` - `AT_FORALL_FP32_SET_OPT_DTYPE` for policy `fp32_set_opt_dtype` - `AT_FORALL_PROMOTE` for policy `promote`. To make sure the other backend can reuse the policy op list. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124051 Approved by: https://github.com/jgong5, https://github.com/EikanWang, https://github.com/gujinghui, https://github.com/albanD ghstack dependencies: #124050	2024-04-18 04:35:25 +00:00
Zhuoran Zhao	8ad66e05d2	[4/x][AMD][Lowering Enablement] Enabling meta internal AOTInductor compilation on ROCM (#124123 ) Summary: as title Test Plan: CI & unit test Differential Revision: D56163334 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124123 Approved by: https://github.com/chenyang78, https://github.com/jansel	2024-04-18 04:19:37 +00:00
Xiaodong Wang	de1c0d2497	[cublas] Keep explicit workspace creation to avoid OOM (#124250 ) Summary: We explicitly set the cublas workspace even though CUDA 12.2+ fixed the issue where memory usage increased during graph capture. Original issue: https://github.com/pytorch/pytorch/pull/83461 This is because in CUDA 12.2+, the use of cudaMallocAsync in cublas will allocate memory dynamically (even if they're cheap) outside PyTorch's CUDA caching allocator. It's possible that CCA used up all the memory and cublas's cudaMallocAsync will return OOM Test Plan: CI Differential Revision: D56226746 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124250 Approved by: https://github.com/houseroad, https://github.com/eqy	2024-04-18 04:17:38 +00:00
xinan.lin	c9ab9248ce	[Inductor Intel GPU backend Upstream] Generalize device-bias code in (#124249 ) Generalize device-bias code in tirton_utils.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/124249 Approved by: https://github.com/EikanWang, https://github.com/guangyey, https://github.com/jansel	2024-04-18 03:54:31 +00:00
Yanan Cao (PyTorch)	27daa110c8	Back out "Refresh OpOverloadPacket if a new OpOverload gets added (#123578 )" (#124324 ) Summary: Original commit changeset: 528276bc8a92 Original Phabricator Diff: D56057952 Differential Revision: D56271240 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124324 Approved by: https://github.com/davidberard98	2024-04-18 03:33:54 +00:00
Animesh Jain	f213f262af	[dynamo][cpp-guards] Improve when to use Dict vs DictSubclassGuardManager (#124237 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124237 Approved by: https://github.com/jansel, https://github.com/mlazos ghstack dependencies: #124230	2024-04-18 03:33:37 +00:00
Iris Z	9fed2e826b	[DTensor][Test] Add unit tests to keep track of DTensor sharding for 2D (#123687 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/123687 Approved by: https://github.com/wanchaol	2024-04-18 03:29:16 +00:00
William Wen	dca24d70ba	[dynamo, test] remove skip for unhandled exception test (#123876 ) This test might no longer segfault in CI due to changes to how we allocate and free shadow frames in dynamo. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123876 Approved by: https://github.com/jansel	2024-04-18 03:02:34 +00:00
William Wen	812bae09be	[dynamo] fix 3.11+ refleak (#124238 ) Fixes https://github.com/pytorch/pytorch/issues/119607 for 3.11+. In 3.11+, `_PyFrame_FastToLocalsWithError` could implicity run `COPY_FREE_VARS` on the original frame, leading to double incref's since the dynamo shadow frame can rerun `COPY_FREE_VARS`. So the solution is to skip the first `COPY_FREE_VARS` instruction in the shadow frame if it was already executed in the original frame. Also move the location for clearing the original frame in 3.12 to handle error cases more thoroughly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124238 Approved by: https://github.com/jansel	2024-04-18 03:02:29 +00:00
Simon Fan	7c94652d7d	[benchmarks] Add --use-warm-peak-memory (#124326 ) Measuring peak memory on the first run can capture cases where compiled artifacts leak into runtime, but it also introduces a lot of noise from cudnn/triton autotuning which generally uses as much memory as it can. Setting this flag as a default will need some discussion, so I will only add it to unblock compiled backward benchmarking (where all autotuning memory use is exposed) ``` e.g. resnet50 # without --warm-peak-memory memory: eager: 1.95 GB, dynamo: 6.68 GB, ratio: 0.29 # with --warm-peak-memory memory: eager: 1.96 GB, dynamo: 2.06 GB, ratio: 0.95 ``` ![image](https://github.com/pytorch/pytorch/assets/9547562/36cd8687-a7f7-4ec6-b989-7e1263aa7d37) This issue may also affect large models. Here's an example case of cudnn_convolution_backward autotuning allocating 30GB to tune a model otherwise using 5GB memory: ![image](https://github.com/pytorch/pytorch/assets/9547562/4e544b11-3579-4c69-811a-91d896f1ba66) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124326 Approved by: https://github.com/jansel ghstack dependencies: #119411	2024-04-18 02:57:01 +00:00
Simon Fan	0ddd17bdc6	[benchmarks] Add --snapshot-memory to get memory pickles for eager vs compiled (#119411 ) creates memory snapshot pickles e.g. ``` inductor_no_cudagraphs_torchbench_amp_training_cuda_performance_compiled_pytorch_stargan.pickle inductor_no_cudagraphs_torchbench_amp_training_cuda_performance_eager_pytorch_stargan.pickle ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/119411 Approved by: https://github.com/jansel	2024-04-18 02:57:01 +00:00
Animesh Jain	6b4b857a60	[dynamo][nn_module] Enable torch.compile/disable as decorators on the class (#124187 ) Support something like. This is UI change, so please review carefully. ~~~ @torch._dynamo.disable class SimpleLinear(torch.nn.Module): def __init__(self): super().__init__() self.layer0 = torch.nn.Linear(4, 4) def forward(self, inp): return self.layer0(torch.sigmoid(inp)) @torch.compile(backend=cnts) class SimpleModel(torch.nn.Module): def __init__(self): super().__init__() self.layer0 = SimpleLinear() self.layer1 = torch.nn.Linear(4, 4) def forward(self, inp): z = self.layer0(torch.sin(inp)) return self.layer1(z) ~~~ Pull Request resolved: https://github.com/pytorch/pytorch/pull/124187 Approved by: https://github.com/yanboliang, https://github.com/jansel	2024-04-18 02:51:30 +00:00
Simon Fan	b6b757701e	[aot] trim refcount for subclass runtime wrapper (#124155 ) On torchtrain, before <img width="1218" alt="image" src="https://github.com/pytorch/pytorch/assets/9547562/b340c114-071a-440c-904c-c042de4d92c5"> after ![image](https://github.com/pytorch/pytorch/assets/9547562/ee3b6e6f-6e46-46bc-a93d-d4603673ee63) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124155 Approved by: https://github.com/jansel, https://github.com/bdhirsh ghstack dependencies: #124127	2024-04-18 02:34:52 +00:00
Sun, Jiayi	1f04c29be5	[inductor] Freeze the layout of the conv input to channels_last (#122765 ) Fix https://github.com/pytorch/pytorch/issues/118082. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122765 Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jansel	2024-04-18 02:23:38 +00:00
Sun, Jiayi	51a56efbb9	[inductor] modify the output_stride of ConcatKernel (#122761 ) Fix https://github.com/pytorch/pytorch/issues/121613. Modify the `output_stride` of `ConcatKernel`: If any input to `Concat` is `Pointwise`, check the layout of all inputs to `Pointwise`, if any of the inputs is in channels_last format, set channels_last strides for the `output_stride`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122761 Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jansel	2024-04-18 02:19:46 +00:00
Sun, Jiayi	78f3b99a94	[inductor] Modify the rules for freezing the layout of x.unwrap_view() in convert_to_reinterpret_view (#122760 ) Fix https://github.com/pytorch/pytorch/issues/121607 Modify the rules for freezing the layout of `x.unwrap_view()` in `convert_to_reinterpret_view`: If any read of `x.unwrap_view()` is in channels_last format, freeze the layout of `x.unwrap_view()` to channels_last format. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122760 Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/jansel	2024-04-18 02:17:07 +00:00
Shunting Zhang	b71423c2e4	[inductor] let coordesc tuner respect max RBLOCK (#124325 ) Fix https://github.com/pytorch/pytorch/issues/124251 . Coordesc tuner need respect max RBLOCK. When rnumel is a multiple of max-RBLOCK, inductor codegen will skip rmask. If coordesc tuner does not consider max-RBLOCK and pick a RBLOCK larger than that, we would get CUDA IMA (illegal memory access) error. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124325 Approved by: https://github.com/Chillee, https://github.com/jansel	2024-04-18 02:12:35 +00:00
Pearu Peterson	43b4ac956e	Add index_reduce decomposition (#122579 ) As in the title. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122579 Approved by: https://github.com/peterbell10 ghstack dependencies: #123375	2024-04-18 01:30:47 +00:00
eellison	030bb13fe8	Re-land precompile triton templates (#124030 ) Re-land precompile triton templates. This got reverted because we were precompiling templates without checking the cache. I have since added logic and a test to ensure we do not precompile if there is a cache hit. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124030 Approved by: https://github.com/shunting314, https://github.com/nmacchioni, https://github.com/yoyoyocmu	2024-04-18 01:22:13 +00:00
albanD	fae31495ff	Try to speed up lintrunner in CI (#124311 ) Before timing: clang is 19min and noclang is 16min After timing: clang is 17min and noclang is 15min This is still crazy slow so most likely more could be done but didn't check the logs in details. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124311 Approved by: https://github.com/ezyang, https://github.com/Skylion007	2024-04-18 01:17:47 +00:00
Yu, Guangye	cc66c43d51	Make macro with AMP more generic (#124050 ) # Motivation According to [[RFC] Intel GPU Upstreaming](https://github.com/pytorch/pytorch/issues/114723), we would like to upstream amp autocast policy to facilitate the functionality and accuracy of `torch.compile` on e2e benchmarks. # Solution The first PR aims to make macro `KERNEL` to be generic. It accepts two types of inputs, like `(DISPATCH, OP, POLICY)` and `(DISPATCH, OP, OVERLOAD, POLICY)`. The second PR intends to refactor CUDA's autocast policy to make it can be shared with `XPU` backend. The final PR would like to support XPU autocast policy which shares the same recipe with `CUDA` backend. # Additional Context Another motivation is we would like to unify autocast API and provide the generic APIs, like: - `torch.get_autocast_dtype(device_type)` - `torch.set_autocast_dtype(device_type)` - `torch.is_autocast_enabled(device_type)` - `torch.set_autocast_enabled(device_type)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/124050 Approved by: https://github.com/jgong5, https://github.com/EikanWang, https://github.com/gujinghui, https://github.com/albanD	2024-04-18 01:15:03 +00:00
Michael Lazos	102a223216	Enable dynamo test_state_dict_deterministic (#123323 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123323 Approved by: https://github.com/janeyx99 ghstack dependencies: #123498, #123322	2024-04-18 01:06:28 +00:00
Michael Lazos	d88fcb86d8	Enable dynamo traced test_forloop_goes_right_direction (#123322 ) Removed a bunch of skips, I also updated test_forloop_goes_right_direction to not use the closure when dynamo is tracing. The reason for this is that testing the disabled optimizer doesn't actually test anything. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123322 Approved by: https://github.com/janeyx99 ghstack dependencies: #123498	2024-04-18 00:50:10 +00:00
Michael Lazos	57a3dc56d4	Small Adamax fix (#123498 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123498 Approved by: https://github.com/janeyx99	2024-04-18 00:50:03 +00:00
Yuanhao Ji	21f7cbdc1c	Enable UFMT on `test/test_autograd.py` (#124141 ) Part of: #123062 Ran lintrunner on: - `test/test_autograd.py` Detail: ```bash $ lintrunner -a --take UFMT --all-files ok No lint issues. Successfully applied all patches. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/124141 Approved by: https://github.com/soulitzer	2024-04-18 00:16:23 +00:00
Catherine Lee	025387f4dd	[ez][CI] Reduce CI_SERIAL_LIST pt2 (#124298 ) #124085 Add @serialTest() to some tests slow gradcheck already runs serially Doing this slowly so its easier to check flaky issues that might get made Pull Request resolved: https://github.com/pytorch/pytorch/pull/124298 Approved by: https://github.com/kit1980	2024-04-18 00:13:36 +00:00
William Wen	38bfe7bcd1	add link to custom ops troubleshooting page on tensor data_ptr error (#124240 ) Fix part of https://github.com/pytorch/pytorch/issues/123603. Example traceback on branch https://github.com/pytorch/vision/compare/main...wwen/custom_ops_test: ``` running my_custom_op! Traceback (most recent call last): File "/data/users/williamwen/torchvision/playground.py", line 13, in <module> print(opt_fn1(torch.randn(3, 3))) File "/data/users/williamwen/pytorch2/torch/_dynamo/eval_frame.py", line 387, in _fn return fn(args, kwargs) File "/data/users/williamwen/pytorch2/torch/_dynamo/convert_frame.py", line 977, in catch_errors return callback(frame, cache_entry, hooks, frame_state, skip=1) File "/data/users/williamwen/pytorch2/torch/_dynamo/convert_frame.py", line 818, in _convert_frame result = inner_convert( File "/data/users/williamwen/pytorch2/torch/_dynamo/convert_frame.py", line 411, in _convert_frame_assert return _compile( File "/data/users/williamwen/pytorch2/torch/_utils_internal.py", line 70, in wrapper_function return function(args, *kwargs) File "/data/users/williamwen/py310-env/lib/python3.10/contextlib.py", line 79, in inner return func(args, *kwds) File "/data/users/williamwen/pytorch2/torch/_dynamo/convert_frame.py", line 700, in _compile guarded_code = compile_inner(code, one_graph, hooks, transform) File "/data/users/williamwen/pytorch2/torch/_dynamo/utils.py", line 266, in time_wrapper r = func(args, *kwargs) File "/data/users/williamwen/pytorch2/torch/_dynamo/convert_frame.py", line 568, in compile_inner out_code = transform_code_object(code, transform) File "/data/users/williamwen/pytorch2/torch/_dynamo/bytecode_transformation.py", line 1116, in transform_code_object transformations(instructions, code_options) File "/data/users/williamwen/pytorch2/torch/_dynamo/convert_frame.py", line 173, in _fn return fn(args, kwargs) File "/data/users/williamwen/pytorch2/torch/_dynamo/convert_frame.py", line 515, in transform tracer.run() File "/data/users/williamwen/pytorch2/torch/_dynamo/symbolic_convert.py", line 2237, in run super().run() File "/data/users/williamwen/pytorch2/torch/_dynamo/symbolic_convert.py", line 875, in run while self.step(): File "/data/users/williamwen/pytorch2/torch/_dynamo/symbolic_convert.py", line 790, in step self.dispatch_table[inst.opcode](self, inst) File "/data/users/williamwen/pytorch2/torch/_dynamo/symbolic_convert.py", line 492, in wrapper return inner_fn(self, inst) File "/data/users/williamwen/pytorch2/torch/_dynamo/symbolic_convert.py", line 1260, in CALL_FUNCTION self.call_function(fn, args, {}) File "/data/users/williamwen/pytorch2/torch/_dynamo/symbolic_convert.py", line 730, in call_function self.push(fn.call_function(self, args, kwargs)) File "/data/users/williamwen/pytorch2/torch/_dynamo/variables/torch.py", line 747, in call_function tensor_variable = wrap_fx_proxy( File "/data/users/williamwen/pytorch2/torch/_dynamo/variables/builder.py", line 1425, in wrap_fx_proxy return wrap_fx_proxy_cls(target_cls=TensorVariable, kwargs) File "/data/users/williamwen/pytorch2/torch/_dynamo/variables/builder.py", line 1510, in wrap_fx_proxy_cls example_value = get_fake_value(proxy.node, tx, allow_non_graph_fake=True) File "/data/users/williamwen/pytorch2/torch/_dynamo/utils.py", line 1804, in get_fake_value raise TorchRuntimeError(str(e)).with_traceback(e.__traceback__) from None File "/data/users/williamwen/pytorch2/torch/_dynamo/utils.py", line 1736, in get_fake_value ret_val = wrap_fake_exception( File "/data/users/williamwen/pytorch2/torch/_dynamo/utils.py", line 1251, in wrap_fake_exception return fn() File "/data/users/williamwen/pytorch2/torch/_dynamo/utils.py", line 1737, in <lambda> lambda: run_node(tx.output, node, args, kwargs, nnmodule) File "/data/users/williamwen/pytorch2/torch/_dynamo/utils.py", line 1872, in run_node raise RuntimeError(make_error_message(e)).with_traceback( File "/data/users/williamwen/pytorch2/torch/_dynamo/utils.py", line 1854, in run_node return node.target(args, kwargs) File "/data/users/williamwen/pytorch2/torch/_ops.py", line 870, in __call__ return self_._op(args, *(kwargs or {})) torch._dynamo.exc.TorchRuntimeError: Failed running call_function torchvision.my_custom_op1((FakeTensor(..., size=(3, 3)),), **{}): The tensor has a non-zero number of elements, but its data is not allocated yet. Caffe2 uses a lazy allocation, so you will need to call mutable_data() or raw_mutable_data() to actually allocate memory. If you're using torch.compile/export/fx, it is likely that we are erroneously tracing into a custom kernel. To fix this, please wrap the custom kernel into an opaque custom op. Please see the following for details: https://docs.google.com/document/d/1W--T6wz8IY8fOI0Vm8BF44PdBgs283QvpelJZWieQWQ from user code: File "/data/users/williamwen/torchvision/playground.py", line 5, in fn1 return torch.ops.torchvision.my_custom_op1(x) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/124240 Approved by: https://github.com/zou3519	2024-04-18 00:08:09 +00:00
rzou	5a60a1abde	Move the implementation of register_fake onto torch.library.Library (#124065 ) Motivations: - This makes things more consistent: using a Library object, you should be able to do all of the registration APIs that tie registrations to the lifetime of the Library. - I need this for the next PR up in the stack, where we will have torch.library.register_fake support both CustomOpDef (from the new custom ops API) and other custom ops. Test Plan: - existing tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/124065 Approved by: https://github.com/albanD ghstack dependencies: #123937, #124064	2024-04-17 23:51:20 +00:00
rzou	d1e1d671ef	Stop requiring a pystub for register_fake by default (#124064 ) Previously, if someone used `register_fake` to add a fake impl for an operator defined in C++, we would require them to add a `m.set_python_module(<module>)` call to C++. This was to avoid situations where a user imported the C++ operator without importing the fake impl. This "breaks" open registration: there's no way to add a fake impl outside of a repository that defines an operator, so we want to turn this behavior off by default in open source. Test Plan: - existing tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/124064 Approved by: https://github.com/albanD ghstack dependencies: #123937	2024-04-17 23:51:20 +00:00
PyTorch MergeBot	f5049de242	Revert "[Environment Variable][1/N] Use thread-safe env variable API in c10 (#119449 )" This reverts commit 5bef127c2ea49280e7fda4f9fa7cad6fa4078e7d. Reverted https://github.com/pytorch/pytorch/pull/119449 on behalf of https://github.com/PaliC due to your using TORCH_INTERNAL_ASSERT incorrectly ([comment](https://github.com/pytorch/pytorch/pull/119449#issuecomment-2062696010))	2024-04-17 23:44:00 +00:00
Mikayla Gawarecki	64f6ddf12c	Add swap_tensors path to nn parametrizations (#124130 ) Fixes #123859 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124130 Approved by: https://github.com/albanD	2024-04-17 23:37:28 +00:00
Andrew Gu	b5235694f4	[FSDP2] Made `unshard` return type consistent (#124293 ) We can always return an `UnshardHandle` if `async_op=True` even if the FSDP module does not manage any parameters and hence does not have an `FSDPParamGroup`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124293 Approved by: https://github.com/weifengpy ghstack dependencies: #120952	2024-04-17 23:33:46 +00:00
Andrew M. James	64f42bfd52	[dynamo] Support list.reverse (#124210 ) fixes #123974 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124210 Approved by: https://github.com/peterbell10	2024-04-17 23:33:32 +00:00
Matthias Reso	dd7aeedb72	[Dynamo] Check for __bool__ attribute before accessing it (#120943 ) This PR checks if __bool__ attribute is available before accessing it when handling a UserDefinedObjectVariable Fixes #119782 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120943 Approved by: https://github.com/zou3519	2024-04-17 23:26:55 +00:00
Nikita Shulga	00372b1211	Extend int[48]mm ops to float32 input (#124287 ) Just for completeness Pull Request resolved: https://github.com/pytorch/pytorch/pull/124287 Approved by: https://github.com/mikekgfb	2024-04-17 23:10:49 +00:00
Diogo Teles Sant'Anna	14162eecfc	Update Security Policy to provide Security Guidance for users (#120531 ) Fixes #120530 Co-authored-by: albanD <desmaison.alban@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/120531 Approved by: https://github.com/malfet, https://github.com/albanD	2024-04-17 23:08:48 +00:00
ZhiweiYan-96	9875a834e4	[Intel GPU] oneDNN GPU GEMM support (#117202 ) # Motivation This PR is a part of RFC #114848, and it is a successor PR of #116249 and #116019. This PR would depend on oneDNN compilation in #116249. Some runtime support is needed in #116019. Aten operators like `addmm`, `baddmm` is defined in `Blas.cpp` in `aten/src/ATen/native/mkldnn/xpu/`. Accompanied with these files provide core functionaliy, `BlasImpl.h`, `Utils.h` and other file provide basic utilities for them. For instance, `Utils.h` provide common memory descriptor query utils for `Matmul.h` and these utility function will also be used in other primitive, like `convolution`. `BlasImpl.h` is a header file that provide helper for handling shape info processing in matmul related operators. It would not only help basic GEMM operator like `addmm, baddmm` but also help fusion operators used in `torch.compile` like `linear_pointwise` in #117824. In next stage, we would continually complete the oneDNN support through enabling `matmul fusion` and `convolution` related code. Co-authored-by: xiaolil1 <xiaoli.liu@intel.com> Co-authored-by: lei,zhenyuan <zhenyuan.lei@intel.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/117202 Approved by: https://github.com/EikanWang, https://github.com/jgong5, https://github.com/malfet ghstack dependencies: #117098, #117112	2024-04-17 23:06:38 +00:00
vfdev-5	6330acae76	Refactored implementation for upsample_nearest decompostions (#122783 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/122783 Approved by: https://github.com/peterbell10	2024-04-17 23:05:40 +00:00
Edward Z. Yang	bebdbb63ce	Introduce set_example_value and use it throughout Dynamo (#124176 ) I'm going to setup some extra behavior when we set example value, so I need a convenient place to interpose. I cannot easily do it on meta itself because its a generic dict with no interposition point. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/124176 Approved by: https://github.com/oulgen ghstack dependencies: #124105, #124059	2024-04-17 22:57:11 +00:00
Tugsbayasgalan Manlaibaatar	d23bf9cef0	Add fake impl for aten.unique2 (#124306 ) Reapply of: https://github.com/pytorch/pytorch/pull/121571 Differential Revision: [D56258431](https://our.internmc.facebook.com/intern/diff/D56258431) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124306 Approved by: https://github.com/gmagogsfm	2024-04-17 22:55:27 +00:00
ZhiweiYan-96	cc18afa25f	Intel GPU oneDNN upstreaming for primitive integration (#117112 ) # Motivation As proposed in https://github.com/pytorch/pytorch/issues/114848 and https://github.com/pytorch/pytorch/issues/114723, oneDNN library is an important component for Intel GPU software ecosystem. Current PR is based on #117098, where oneDNN library for Intel GPU should be ready. This PR is the integration code from aten to oneDNN. GEMM integration code is the core part in this PR. Accompanied with GEMM, more basic support like runtime (device, stream), primitive attr is also included. We put the oneDNN integration code in directory `aten/src/ATen/native/mkldnn/xpu/detail`. We add a namespace `at::native::xpu::onednn` for oneDNN integration. The code in this PR would be used in following PRs, where aten operators would call the functions in these integration code.. We separate the prs due to onednn integration is logically separable with aten operator implementation. Also, this can ease the burden of reviewing by avoid too much codes in single PR. Co-authored-by: xiaolil1 <xiaoli.liu@intel.com> Co-authored-by: lei,zhenyuan <zhenyuan.lei@intel.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/117112 Approved by: https://github.com/EikanWang, https://github.com/jgong5, https://github.com/albanD	2024-04-17 22:49:56 +00:00
PyTorch MergeBot	944d046645	Revert "[DeviceMesh][Test] Add 3d unit test for `get_local_rank()` (#124142 )" This reverts commit a403757913689d200683a4158c565bc3dbade74b. Reverted https://github.com/pytorch/pytorch/pull/124142 on behalf of https://github.com/malfet due to Broke lint ([comment](https://github.com/pytorch/pytorch/pull/124142#issuecomment-2062587289))	2024-04-17 22:31:30 +00:00
Tristan Rice	1ec05c769b	all_gather and reduce_scatter autograd (#123989 ) This adds `all_gather_tensor_autograd` and `reduce_scatter_tensor_autograd` to the functional_collectives library. This only supports `sum` mode for `reduce_scatter` but should be easy to extend in the future. The backwards implementations match the behavior in https://github.com/pytorch/torchrec/blob/main/torchrec/distributed/comm_ops.py This follows the pattern of #123599 . Test plan: ```sh pytest test/distributed/test_functional_api.py -k Autograd ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/123989 Approved by: https://github.com/wanchaol	2024-04-17 21:32:22 +00:00
wz337	a403757913	[DeviceMesh][Test] Add 3d unit test for `get_local_rank()` (#124142 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/124142 Approved by: https://github.com/xunnanxu, https://github.com/fegin, https://github.com/XilunWu	2024-04-17 20:45:49 +00:00
wz337	cdc855af97	[Test][2D] Turn on 2D state_dict tests for uneven sharding (#124255 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/124255 Approved by: https://github.com/wanchaol	2024-04-17 20:45:34 +00:00
Xuehai Pan	93e249969b	[BE] enable `ruff` rule `RSE` and remove useless parentheses in `raise` statements (#124261 ) Remove useless parentheses in `raise` statements if the exception type is raised with no argument. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124261 Approved by: https://github.com/albanD	2024-04-17 19:29:34 +00:00
eqy	b726a23d4e	change `tf32` thresholds for `test_per_sample_grads_embeddingnet` (#124104 ) TF32 causes issues with the tolerances here; we might also consider migrating some of the `with_tf32_off` tests in this file to `tf32_on_and_off` in case it would be useful to get signal for TF32. CC @malfet @atalman Pull Request resolved: https://github.com/pytorch/pytorch/pull/124104 Approved by: https://github.com/zou3519	2024-04-17 19:16:32 +00:00
doloresgarcia	4efdf9a6a6	fix pytorch version for onnx in doc (#124182 ) Fixes [ 123845](https://github.com/pytorch/pytorch/issues/123845) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124182 Approved by: https://github.com/albanD	2024-04-17 18:05:15 +00:00
Oguz Ulgen	24cecf06d7	Update autotune jk knobs (#124214 ) Differential Revision: [D56201145](https://our.internmc.facebook.com/intern/diff/D56201145/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124214 Approved by: https://github.com/aakhundov	2024-04-17 17:49:25 +00:00
Animesh Jain	f433517181	[dynamo][decorator] Support disable on nn modules (#124185 ) Fixes https://github.com/pytorch/pytorch/issues/123979 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124185 Approved by: https://github.com/weifengpy, https://github.com/yoyoyocmu	2024-04-17 16:20:34 +00:00
Nikita Shulga	46324fe073	Speedup int4mm_kernel with NEON (#124257 ) By unrolling middle loop by 16 elements and using neon to decode packed int4 to float32. Unrolling entire `n` loop actually makes it a tad slower, probably because ARM has smaller register file that x86 Before/after performance running stories110M on M2Pro \| eager (before) \| eager (after) \| compile(before) \| compile (after) \| \| ---- \| --- \| -- \| -- \| \| 28 \| 57 \| 31 \| 104 \| Pull Request resolved: https://github.com/pytorch/pytorch/pull/124257 Approved by: https://github.com/mikekgfb	2024-04-17 16:04:25 +00:00
maple@max	9b1d6c8d98	improve F.adaptive_avg_pool2d error messages on mps (#124143 ) Gives better error messages on mps. Partially fixes #123725 in the case of `F.adaptive_avg_pool2d`. This also relates to #96056. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124143 Approved by: https://github.com/albanD, https://github.com/malfet	2024-04-17 16:04:09 +00:00
Xuehai Pan	7e1c98c171	[dynamo] support `object.__setattr__(obj, name, value)` (#124068 ) Resolves #114964 Resolves #114966 - #114964 - #114966 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124068 Approved by: https://github.com/jansel	2024-04-17 15:57:14 +00:00
PyTorch MergeBot	36f6928a37	Revert "[Profiler][PrivateUse1] Profiler support PrivateUse1 key (#120556 )" This reverts commit 41613a0803f7cde7956f039bc80f94253b0843f9. Reverted https://github.com/pytorch/pytorch/pull/120556 on behalf of https://github.com/aaronenyeshi due to Breaks GPU Chrome trace UI ([comment](https://github.com/pytorch/pytorch/pull/120556#issuecomment-2061578951))	2024-04-17 15:38:14 +00:00
Pearu Peterson	d2b0c0a34e	Fix index_reduce sampler filter when op_info.variant_test_name is specified (#123375 ) As in the title: `index_reduce` sample must correspond to reduction type specified by `variant_test_name`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123375 Approved by: https://github.com/zou3519, https://github.com/peterbell10	2024-04-17 15:31:28 +00:00
Nikita Shulga	5a735ece6b	Remove @abock from ONNX approvers/codeowners (#124259 ) As he is no longer interested in the project Pull Request resolved: https://github.com/pytorch/pytorch/pull/124259 Approved by: https://github.com/kit1980, https://github.com/BowenBao	2024-04-17 14:13:53 +00:00
Nikita Shulga	b880a71010	[BE] Add missing `std::` prefix to `Unique.mm` (#124232 ) Follow up after https://github.com/pytorch/pytorch/pull/124117 fixes following warning ``` /Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/mps/operations/Unique.mm:282:26: warning: use of function template name with no prior declaration in function call with explicit template arguments is a C++20 extension [-Wc++20-extensions] return std::make_tuple(get<0>(out).to("mps"), get<1>(out).to("mps"), get<2>(out).to("mps")); ^ ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/124232 Approved by: https://github.com/kit1980, https://github.com/Skylion007	2024-04-17 14:12:29 +00:00
Kai Londenberg	5f378e1853	[Inductor cutlass backend] Fix flaky test ( CUDA IMA ) (#124106 ) A unit test within test_cutlass_backend.py can fail with CUDA illegal memory accesses due to the fact that some CUTLASS Kernels contain bugs. By using autotuning in subprocesses, this CUDA illegal memory access simply leads to the buggy Cutlass Kernels being filtered out, instead of causing it to bring down the entire process. Test Plan: This is a change to a unit test. It's recommended to use autotune_in_subproc when using the Cutlass backend anyway. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124106 Approved by: https://github.com/eellison	2024-04-17 13:19:13 +00:00
rzou	47dbfecd37	Rename impl_abstract to register_fake, part 1/2 (#123937 ) This PR: - adds a new torch.library.register_fake and deprecates torch.library.impl_abstract. The motivation is that we have a lot of confusion around the naming so we are going to align the naming with the actual subsystem (FakeTensor). - renames `m.impl_abstract_pystub("fbgemm_gpu.sparse_ops")` to `m.has_python_registration("fbgemm_gpu.sparse_ops")`. No deprecation here yet; I need to test how this works with static initialization. - Renames a bunch of internals to match (e.g. abstractimplpystub -> pystub) I'm scared to rename the Python-side internal APIs (e.g. torch._library.abstract_impl) because of torch.package concerns. I'll do that in its own isolated PR next just in case it causes problems. DEPRECATION NOTE: torch.library.impl_abstract was renamed to to torch.library.register_fake. Please use register_fake. We'll delete impl_abstract in a future version of PyTorch. Test Plan: - existing tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/123937 Approved by: https://github.com/albanD	2024-04-17 12:46:01 +00:00
Yuanhao Ji	6efcb6c718	Fix wrong ufmt exclusions in `.lintrunner.toml` (#124135 ) Part of: #123062 In this pull request(#123809), there were some exclusions that should have been removed, but weren't. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124135 Approved by: https://github.com/ezyang	2024-04-17 12:22:50 +00:00
PyTorch MergeBot	2dc15b6849	Revert "[sparse] Add fast semi-structured spasification kernels (#122350 )" This reverts commit 14b2273b0c58b4000e10b2e441341eeafb7dd2f6. Reverted https://github.com/pytorch/pytorch/pull/122350 on behalf of https://github.com/DanilBaibak due to Broken trunk ([comment](https://github.com/pytorch/pytorch/pull/122350#issuecomment-2061070350))	2024-04-17 11:47:02 +00:00
PyTorch MergeBot	3f89f565bb	Revert "Re-land precompile triton templates (#124030 )" This reverts commit d68196e7ef5eb8f62064ef70c75032f4d8b4a4fa. Reverted https://github.com/pytorch/pytorch/pull/124030 on behalf of https://github.com/DanilBaibak due to Broken trunk ([comment](https://github.com/pytorch/pytorch/pull/124030#issuecomment-2061044960))	2024-04-17 11:31:33 +00:00
PyTorch MergeBot	77ad630f5d	Revert "Dont precompile already seen keys, limit epilogue choices (#122642 )" This reverts commit 050051f412e50d98d506adf0d05aa6e4ceab54bd. Reverted https://github.com/pytorch/pytorch/pull/122642 on behalf of https://github.com/DanilBaibak due to Broken trunk ([comment](https://github.com/pytorch/pytorch/pull/124030#issuecomment-2061044960))	2024-04-17 11:31:32 +00:00
FFFrog	acc466751b	Add bfloat16 support to binary_cross_entropy for CPU (#123823 ) Fixes #123715 As the title stated. But, maybe we should pay attention to this https://github.com/pytorch/pytorch/pull/33206, which removed the half support for cpu about 4 years ago. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123823 Approved by: https://github.com/Skylion007, https://github.com/malfet	2024-04-17 09:44:07 +00:00
DanilBaibak	c4878abab0	Fix Setup Linux for ARC (#124171 ) We can't get information about `ami-id`, `instance-id`, `instance-type` for the ARC runners: ``` 2024-04-16T11:10:17.0098276Z curl: (22) The requested URL returned error: 401 2024-04-16T11:10:17.0110775Z ami-id: 2024-04-16T11:10:17.0159131Z curl: (22) The requested URL returned error: 401 2024-04-16T11:10:17.0167378Z instance-id: 2024-04-16T11:10:17.0219464Z curl: (22) The requested URL returned error: 401 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/124171 Approved by: https://github.com/malfet, https://github.com/ZainRizvi, https://github.com/zxiiro	2024-04-17 09:25:02 +00:00
chunyuan	d0211e207c	inductor cpp wrapper: add GIL release back (#123897 ) Fixes https://github.com/pytorch/pytorch/issues/123517. This PR adds the GIL release (originally added in https://github.com/pytorch/pytorch/pull/111888) back following the suggestion here: https://github.com/pytorch/pytorch/pull/123897#discussion_r1562509705. We added a default constructor and an assignment operator for the `RAIIPyObject` class (https://github.com/pytorch/pytorch/pull/123897#discussion_r1566262575) in order to declare the `custom_op_wrapper` outside of the GIL acquisition scope. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123897 Approved by: https://github.com/peterbell10, https://github.com/jgong5	2024-04-17 07:18:14 +00:00
Yuanhao Ji	e3effa5855	Enable UFMT on all of `test/distributed` (#123539 ) Partially addresses #123062 Ran lintrunner on: - `test/distributed` Pull Request resolved: https://github.com/pytorch/pytorch/pull/123539 Approved by: https://github.com/ezyang	2024-04-17 06:46:02 +00:00
bhack	ed22dde877	Pointer to the nonzero limit ticket (#124244 ) For the nonzero impl limits we are still asking at runtime to fill a new ticket but we had already more then one. So I am pointing to the current open ticket. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124244 Approved by: https://github.com/ezyang	2024-04-17 06:15:36 +00:00
Tugsbayasgalan Manlaibaatar	dd3cea3291	Fix derived dim bugs in ep.run_decomp (#123326 ) Differential Revision: [D55730289](https://our.internmc.facebook.com/intern/diff/D55730289) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123326 Approved by: https://github.com/avikchaudhuri	2024-04-17 04:00:55 +00:00
Edward Z. Yang	236b0d12fa	Don't clamp slices generated from cat kernel (#124139 ) Fixes https://github.com/pytorch/pytorch/issues/123793 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/124139 Approved by: https://github.com/Microve, https://github.com/peterbell10, https://github.com/Skylion007	2024-04-17 03:13:10 +00:00
eellison	050051f412	Dont precompile already seen keys, limit epilogue choices (#122642 ) Two changes: - in epilogue benchmark fusion, only take top 6 choices. There were basically no choices taken after this in HF. - Share a single precompilation function among matmuls with same key. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122642 Approved by: https://github.com/shunting314 ghstack dependencies: #124030	2024-04-17 03:08:59 +00:00
Animesh Jain	51cc808ac7	[dynamo][cpp-guards] Missing decref on early returns in DictSubclassGuardManager (#124230 ) I am sad that I missed this earlier. Good thing is that CI caught it. Will be more careful next time. This was the reason https://github.com/pytorch/pytorch/pull/123547 is reverted - https://github.com/pytorch/pytorch/pull/123547#issuecomment-2058350245 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124230 Approved by: https://github.com/mlazos	2024-04-17 02:49:07 +00:00
eellison	d68196e7ef	Re-land precompile triton templates (#124030 ) Re-land precompile triton templates. This got reverted because we were precompiling templates without checking the cache. I have since added logic and a test to ensure we do not precompile if there is a cache hit. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124030 Approved by: https://github.com/shunting314, https://github.com/nmacchioni, https://github.com/yoyoyocmu	2024-04-17 02:30:46 +00:00
CK Luk	32ca18ea3b	Handle the case when one of the output of forward pass is None (#123988 ) Summary: When applying FSDP-2 to FM-FB benchmark with FullModel model, we ran into an error that one of the output tensors of a forward pass is None. I double checked that the same output tensor is also None in FSDP-1. So, we just need to handle the None properly here. Test Plan: See that in the internal diff. Differential Revision: D56087956 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123988 Approved by: https://github.com/awgu	2024-04-17 02:18:32 +00:00
Valentine233	6e4c4e93b6	[Inductor] add contiguous layout optm for bmm input (#122599 ) Fixes #117743. Add contiguous layout optimization for `bmm` input, to avoid additional copies. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122599 Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/eellison	2024-04-17 02:12:20 +00:00
Oguz Ulgen	1fd9e320ea	Remove unnecessary FileLock in Fx Graph Cache (#124212 ) Writing to file happens via `write_atomic`, there's no need to take a global lock on the file system. This is likely creating unnecessary waits. Differential Revision: [D56208628](https://our.internmc.facebook.com/intern/diff/D56208628/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124212 Approved by: https://github.com/masnesral, https://github.com/eellison	2024-04-17 01:02:41 +00:00
Theodore Ehrenborg	f56c4572a6	Fix typos in docs (#124218 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124218 Approved by: https://github.com/albanD	2024-04-17 00:46:08 +00:00
Andrew Gu	bf45ac8c98	[FSDP2] Added explicit `unshard(async_op)` API (#120952 ) This PR adds an `unshard(async_op: bool = False)` API to manually unshard the parameters via all-gather. This can be used for reordering the all-gather with other collectives (e.g. all-to-all). This currently requires the user to set `TORCH_NCCL_AVOID_RECORD_STREAMS=1` to avoid `recordStream` from `ProcessGroupNCCL` and get expected memory behaviors. Differential Revision: [D56148725](https://our.internmc.facebook.com/intern/diff/D56148725) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120952 Approved by: https://github.com/wanchaol	2024-04-17 00:39:34 +00:00
Catherine Lee	0abd3f60fd	[CI] Reduce CI_SERIAL_LIST list (#124085 ) Add serial marker for individual tests so the test file can be removed from the ci serial list Run serial marked tests first in serial Run all other tests afterwards in parallel Slowly reduce list and mark individual tests as serial instead Hope # of serial tests is small so sharding evenness doesn't get too messed up Hopefully can do 3 procs for sm86 and cpu? serial no longer looks like a real word to me Pull Request resolved: https://github.com/pytorch/pytorch/pull/124085 Approved by: https://github.com/seemethere, https://github.com/malfet	2024-04-17 00:23:47 +00:00
Catherine Lee	946b50c788	[ez][TD] Increase logging (#124082 ) increase logging during td generate an artifact that says which tests got excluded fix minor bug where filter test configs couldnt get commit messages Pull Request resolved: https://github.com/pytorch/pytorch/pull/124082 Approved by: https://github.com/seemethere, https://github.com/malfet	2024-04-17 00:18:28 +00:00
IvanKobzarev	e7cf6f81ea	[sym_shapes][perf] Skip assert in check_is_size (#124209 ) Differential Revision: [D56207943](https://our.internmc.facebook.com/intern/diff/D56207943) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124209 Approved by: https://github.com/ezyang	2024-04-17 00:10:06 +00:00
Edward Z. Yang	cebf65126c	FakeTensorProp assert consistency of sizes when metadata previously existed (#124059 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/124059 Approved by: https://github.com/bdhirsh, https://github.com/thiagocrepaldi ghstack dependencies: #124105	2024-04-16 23:28:42 +00:00
Jithun Nair	4322a0e782	Skip workspace permission change for ROCm CI (#123816 ) PR https://github.com/pytorch/pytorch/pull/122922 added chown steps to test.sh and used the trap mechanism to ensure that, even if the test scripts fails and exits with a non-zero code, it will call the cleanup_workspace function on EXIT. However, this doesn't work as intended when the CI job gets cancelled for eg. if a PR pushes new commits and the older commit CI job gets cancelled. The trap function doesn't get called as the test script immediately aborts. Any subsequent jobs scheduled on the same runner then fail in the 'Checkout PyTorch' step when they try to delete the workspace. This has been resulting in a slew of CI failures on the HUD. Example of this situation playing out on one of the ROCm runners: Cancelled job: https://github.com/pytorch/pytorch/actions/runs/8563212279/job/23469711035 ![image](https://github.com/pytorch/pytorch/assets/37884920/7192e4fe-8cff-4256-abc8-9f874a3918ff) Subsequent failed job: https://github.com/pytorch/pytorch/actions/runs/8564517036/job/23472675041 ![image](https://github.com/pytorch/pytorch/assets/37884920/24b0af66-cfe9-431f-851a-24a1ccc18e84) This PR skips the logic introduced by PR 122922 for ROCm CI. Alternative to https://github.com/pytorch/pytorch/pull/123468 and https://github.com/pytorch/pytorch/pull/123588 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123816 Approved by: https://github.com/pruthvistony, https://github.com/zxiiro, https://github.com/kit1980	2024-04-16 23:26:34 +00:00
andrewor14	3eea300680	[quant] Do not decompose choose_qparams_per_token_asymmetric (#124178 ) Summary: https://github.com/pytorch/pytorch/pull/123452 added backward support to this op by turning it into CompositeImplicitAutograd, which meant it gets decomposed during export/compile. However, this is not desirable behavior for the PTQ case when we try to lower the model. This commit enables QAT without breaking PTQ by refactoring the impl into a separate op that does have backward support. Test Plan: python test/test_quantization.py -k test_decomposed_choose_qparams_per_token_asymmetric_backward Reviewers: jerryzh168, digantdesai, zou3519 Subscribers: jerryzh168, digantdesai, zou3519, supriyar Differential Revision: [D56192116](https://our.internmc.facebook.com/intern/diff/D56192116) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124178 Approved by: https://github.com/digantdesai	2024-04-16 22:58:48 +00:00
Shunting Zhang	3e90e93a78	[inductor] disable comprehensive padding in fbcode (#124191 ) Comprehension padding cause small NE change and fail an internal test. Disable it for internal use case to mitigate. Differential Revision: [D56197430](https://our.internmc.facebook.com/intern/diff/D56197430) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124191 Approved by: https://github.com/jansel	2024-04-16 22:44:08 +00:00
Xilun Wu	b3f88317ec	[dtensor][5/N] have table-wise sharding use LocalShardsWrapper on participating ranks only (#122853 ) Summary We wrap DTensor's local tensor in `LocalShardsWrapper` for torchrec's table-wise sharding. The exception is on non-participating ranks: for non-participating ranks, the local tensor is an empty torch.Tensor object. The reason of this design is to avoid complexity on supporting empty tensor case on `LocalShardsWrapper`. Test `torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/torchrec_sharding_example.py -e table-wise` Pull Request resolved: https://github.com/pytorch/pytorch/pull/122853 Approved by: https://github.com/wz337 ghstack dependencies: #120265, #121392, #122843	2024-04-16 22:27:30 +00:00
Xilun Wu	d419fcd19f	[dtensor][4/N] have row-wise sharding always use LocalShardsWrapper (#122843 ) Summary Always wrap local tensor into a `LocalShardsWrapper`. This is for uniformity and it leads to easiness on adoption of DTensor as a wrapper for local shard(s) representation. To support more tensor ops over `LocalShardsWrapper`, users need to extend its `__torch_dispatch__`. Test `torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/torchrec_sharding_example.py -e row-wise-even` Result ``` Row-wise even sharding example in DTensor Col 0-15 ------- ---------- Row 0-1 cuda:0 Row 2-3 cuda:1 Row 4-5 cuda:2 Row 6-7 cuda:3 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/122843 Approved by: https://github.com/wz337 ghstack dependencies: #120265, #121392	2024-04-16 22:27:30 +00:00
Xilun Wu	1d7ac7baa0	[dtensor][3/N] add torchrec row-wise uneven sharding example (#121392 ) Test `torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/torchrec_sharding_example.py -e row-wise-uneven` Pull Request resolved: https://github.com/pytorch/pytorch/pull/121392 Approved by: https://github.com/wanchaol ghstack dependencies: #120265	2024-04-16 22:27:28 +00:00
Xilun Wu	9d3543df9a	[dtensor][2/N] add torchrec table-wise sharding example (#120265 ) Summary This PR serves as a start of this effort by adding an example test that represents TorchRec's `ShardingType.TABLE_WISE` using DTensor. Test `torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/torchrec_sharding_example.py -e table-wise` Pull Request resolved: https://github.com/pytorch/pytorch/pull/120265 Approved by: https://github.com/wanchaol	2024-04-16 22:27:24 +00:00
PyTorch MergeBot	9d88339b53	Revert "make sure dynamo doesn't inline DTensor __new__ or __torch_dispatch__ (#123347 )" This reverts commit 63dcb5b0f2ef3578e81841fd8a2166e732c0ca99. Reverted https://github.com/pytorch/pytorch/pull/123347 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/123347#issuecomment-2059994989))	2024-04-16 22:08:24 +00:00
Lucas Pasqualin	440e4353c7	[DCP] Remove overlapping loader in async case (#123942 ) In the async case, the state dict is already on CPU, so maintaining this buffer makes no sense. Additionally, using the overlapping cpu loader introduces new cuda synchronize calls, leading to additional unnecessary overhead. Differential Revision: [D56065250](https://our.internmc.facebook.com/intern/diff/D56065250/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123942 Approved by: https://github.com/fegin ghstack dependencies: #123941	2024-04-16 21:19:31 +00:00
Shawn Xu	606c4f1367	[PT] [ST] fix test_sharded_tensor (#124103 ) Summary: https://github.com/pytorch/pytorch/pull/123230 formalizes the rank validation to support sub groups. It broke a few UTs, some of which got fixed in https://github.com/pytorch/pytorch/pull/123778 This is to fix the remaining one reported by DanilBaibak Test Plan: CI Differential Revision: D56155076 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124103 Approved by: https://github.com/fegin	2024-04-16 21:18:22 +00:00
Lucas Pasqualin	46a25cc0db	[DCP] Adds support for non-primatives in async_save by deep copying during cpu offloading (#123941 ) Adds support for non-primatives in async_save by deep copying during cpu offloading. If users are not type checking, the expectation in async is likely that the object is copied Differential Revision: [D56065237](https://our.internmc.facebook.com/intern/diff/D56065237/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123941 Approved by: https://github.com/fegin	2024-04-16 20:49:25 +00:00
Jesse Cai	14b2273b0c	[sparse] Add fast semi-structured spasification kernels (#122350 ) This PR adds in fast semi-structured sparsification kernels to PyTorch. These kernels allow for accelerated semi-structured sparsification kernels in PyTorch. The kernels have been added as aten native functions In particular, three new functions have been added: * `torch._sparse_semi_structured_tile` This function will return the packed representation and metadata for both X and X', as well as the thread masks. Note that this applies 2:4 sparsity in a 4x4 tile instead of a 1x4 strip as usual. * `torch._sparse_semi_structured_apply` This function takes in an input tensor and thread masks from the above function and returns a packed representation and metadata from applying thread masks to the input tensor. * `torch._sparse_semi_structured_apply_dense` This function does the same thing as above but instead of returning the tensor in the sparse representation it returns it in the dense representation The subclasses have also been updated to add a new `prune_dense_static_sort` classmethod to create sparse tensors with this format. I've added some additional documentatino on how to calculate the compressed tensors needed to create a SparseSemiStructuredTensor oneself. To this end, there are two new helper functions added: `sparse_semi_structured_tile` `compute_compressed_swizzled_bitmask` Differential Revision: [D56190801](https://our.internmc.facebook.com/intern/diff/D56190801) Pull Request resolved: https://github.com/pytorch/pytorch/pull/122350 Approved by: https://github.com/cpuhrsch	2024-04-16 20:31:52 +00:00
Mikayla Gawarecki	383d2d1f6c	Add testing and fix issues for weights_only load for LRScheduler (#123775 ) Fixes https://github.com/pytorch/pytorch/issues/98921 There were two issues detected: - `MultiStepLR`: issue is described in https://github.com/pytorch/pytorch/issues/98921, this is resolved by allowlisting `collections.Counter` - `OneCycleLR`: `state_dict['anneal_func']` is either `<function OneCycleLR._annealing_cos at 0x7f364186f5b0>` or `<function OneCycleLR._annealing_linear at 0x7f39aa483640>` depending on the `anneal_func` kwarg. This leads to `WeightsUnpickler error: Unsupported class __builtin__.getattr` from the `weights_only` Unpickler. Fixed the above in a BC-compatible manner by adding `OneCyclicLR._anneal_func_type` as a string attribute and removing `OneCyclicLR.anneal_func` Pull Request resolved: https://github.com/pytorch/pytorch/pull/123775 Approved by: https://github.com/albanD, https://github.com/malfet	2024-04-16 20:29:27 +00:00
PyTorch MergeBot	1f89bf4188	Revert "[reland] `_foreach_copy` with different src/dst dtypes (#123844 )" This reverts commit ff1e3ff5a503a520c1a310c8e72a383657f9a4bc. Reverted https://github.com/pytorch/pytorch/pull/123844 on behalf of https://github.com/malfet due to Perhaps it enabled it for different dtype, but broke for the same ([comment](https://github.com/pytorch/pytorch/pull/123844#issuecomment-2059861767))	2024-04-16 20:23:14 +00:00
Shengbao Zheng	42e22bb444	[nccl-pg] Pass pg name and desc to NCCL communicator (#124149 ) Summary: Pass Process Group Name and Desc to NCCL communicator in order to access pg information in NCCL layer. The information is passed as commDesc string(i.e. "<pg_desc>:<pg_name>") Function only valid when NCCL_COMM_DESCRIPTION is defined. Differential Revision: D55703310 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124149 Approved by: https://github.com/shuqiangzhang	2024-04-16 20:08:07 +00:00
Rohan	72271fb07e	Add NEON ISA support on aarch64 (#123584 ) Fixes #104729 This improves the compiled mode performance of Softmax (by 20%) and other operations (like batchnorm) that invoke the reduce_all function. Thereby also improves BERT inference by around 8%. Tested on a graviton 3 instance (c7g.4xl). Tests were run in a single-threaded manner. Script attached below. Command: `OMP_NUM_THREADS=1 LRU_CACHE_CAPACITY=1024 DNNL_DEFAULT_FPMATH_MODE=BF16 python TestSoftmax.py` [TestSoftmax.txt](https://github.com/pytorch/pytorch/files/14910754/TestSoftmax.txt) ```python import torch import torch.nn as nn from torch.profiler import profile, record_function, ProfilerActivity model = nn.Softmax().eval() compiled_model = torch.compile(model) inputs = torch.randn(1024, 1024) with torch.set_grad_enabled(False): for _ in range(50): compiled_model(inputs) #Warmup print("Warmup over") with profile(activities=[ProfilerActivity.CPU]) as prof: with record_function("model_inference"): for _ in range(100): compiled_model(inputs) print(prof.key_averages().table(sort_by="self_cpu_time_total")) # Check if the compiled model inference and the eager model inference are similar using torch.allclose print(torch.allclose(compiled_model(inputs), model(inputs))) ``` Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/123584 Approved by: https://github.com/jgong5, https://github.com/malfet	2024-04-16 18:49:52 +00:00
Simon Fan	67bd43b510	[compiled autograd][dynamo] use aliases for stack restore when partial graphs steal inputs (#124127 ) same idea as https://github.com/pytorch/pytorch/pull/123359, but for when we restore stack variables after calling a partial graph: Illustrated by the test case: before: ```python def function(inputs): graph_out_0 = __compiled_fn_2(inputs) getitem_1 = graph_out_0[0] add = inputs[1] <---- error inputs is already cleared del graph_out_0 add_1 = add + getitem_1 add = None getitem_1 = None cpu = add_1.cpu() add_1 = None return (cpu,) ``` after: ```python def function(inputs): inputs_ref_0 = inputs[1] graph_out_1 = __compiled_fn_2(inputs) getitem_1 = graph_out_1[0] add = inputs_ref_0 del graph_out_1 add_1 = add + getitem_1 add = None getitem_1 = None cpu = add_1.cpu() add_1 = None return (cpu,) ``` Co-authored-by: Jason Ansel <jansel@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/124127 Approved by: https://github.com/jansel	2024-04-16 17:01:34 +00:00
Lucas Pasqualin	d838cc8f66	[DCP] Returns a copy of sd in copy sd (#123567 ) I found that returning the copy is actually useful in situations where you might do something like: ``` ret = _copy_state_dict(obj, cache) ret.update(some_other_values) ``` and would like `cache` not to change structure from `ret.update(some_other_values)`. Open to some notes here, not returning a copy might force the user to do some additional copies for this case. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123567 Approved by: https://github.com/wz337	2024-04-16 15:29:32 +00:00
Sijia Chen	0f6ce45bcb	[Inductor] handle AMD special launch options (#124146 ) Summary: `matrix_instr_nonkdim` and `waves_per_eu` are AMD specific launch configs that can't be treated as fn input args Test Plan: HIP_VISIBLE_DEVICES=7 numactl --cpunodebind=1 --membind=1 buck2 run mode/{opt,amd-gpu} -c fbcode.triton_backend=amd -c fbcode.enable_gpu_sections=true -c fbcode.rocm_arch=mi300 //hammer/modules/sequential/encoders/tests:hstu_bench -- --torch-compile=True the E2E works well on the magic model Differential Revision: D56165438 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124146 Approved by: https://github.com/aakhundov	2024-04-16 11:07:17 +00:00
William Wen	4dc160864b	[dynamo, 3.12] enable dynamo-wrapped tests in CI (#123307 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123307 Approved by: https://github.com/jansel, https://github.com/malfet ghstack dependencies: #124095, #124100, #124124	2024-04-16 08:44:43 +00:00
William Wen	962096bce6	[dynamo, 3.12] skip some failing profiler dynamo-wrapped tests (#124124 ) The dynamo wrapped tests and normal tests give the same results locally. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124124 Approved by: https://github.com/jansel, https://github.com/aaronenyeshi ghstack dependencies: #124095, #124100	2024-04-16 08:44:43 +00:00
William Wen	5e17f62d10	[dynamo, 3.12] move functorch/test_aotdispatch.py::TestAOTAutograd::test_view_detach from dynamo xfail to skip (#124100 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124100 Approved by: https://github.com/zou3519, https://github.com/jansel ghstack dependencies: #124095	2024-04-16 08:44:43 +00:00
William Wen	9309580d69	[dynamo, 3.12] handle possibility of NULL local variables during graph breaks (#124095 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124095 Approved by: https://github.com/jansel	2024-04-16 08:44:43 +00:00
William Wen	2b3594f90e	[dynamo] fix call_finally issue in Python 3.8 (#124122 ) Fix https://github.com/pytorch/pytorch/issues/97811 again... Pull Request resolved: https://github.com/pytorch/pytorch/pull/124122 Approved by: https://github.com/jansel	2024-04-16 08:36:20 +00:00
Nikita Shulga	298eb69c91	[EZ] Make `weight_int4pack_mm` compilable for `half` input dtype (#124136 ) To enable efficient int4 quantization on ARM Followup after https://github.com/pytorch/pytorch/pull/124022 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124136 Approved by: https://github.com/mikekgfb	2024-04-16 08:10:59 +00:00
Animesh Jain	bb0c768c5b	[dynamo][refactor] Move LazyGraphModule handling (#124113 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124113 Approved by: https://github.com/jansel ghstack dependencies: #124078	2024-04-16 06:39:45 +00:00
PyTorch MergeBot	530bf391cc	Revert "[dynamo] Turn on CPP guard manager (#123547 )" This reverts commit 3e98bdd66d2b051a918e58d5f7bb80b366677bf8. Reverted https://github.com/pytorch/pytorch/pull/123547 on behalf of https://github.com/DanilBaibak due to Broken trunk ([comment](https://github.com/pytorch/pytorch/pull/123547#issuecomment-2058337419))	2024-04-16 06:38:15 +00:00
PyTorch MergeBot	52be63eb2c	Revert "Enable UFMT on all of `test/distributed` (#123539 )" This reverts commit 89ac37fe919997e844f0baa6e28965d0d52b0682. Reverted https://github.com/pytorch/pytorch/pull/123539 on behalf of https://github.com/DanilBaibak due to Broken trunk ([comment](https://github.com/pytorch/pytorch/pull/123539#issuecomment-2058329471))	2024-04-16 06:33:21 +00:00
Xuehai Pan	2e48f7b044	[pytree] add `tree_iter` function (#123913 ) - Add a new `tree_iter` function. - Bump `optree` version to `0.11.0` for C++ version of `tree_iter`. This PR is split from #120300. - #120300 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123913 Approved by: https://github.com/zou3519	2024-04-16 06:02:08 +00:00
Xuehai Pan	0eab740db3	[Docs][Distributed] Add migration notes for `--local-rank` option style change for `torchrun` in PyTorch 2.0 (#109480 ) Fixes https://github.com/pytorch/pytorch/pull/94505#issuecomment-1722777767 Pull Request resolved: https://github.com/pytorch/pytorch/pull/109480 Approved by: https://github.com/ezyang	2024-04-16 05:51:57 +00:00
Arun Pa	7530c5a85d	[DOC] Fix example and typo (#123959 ) Fixes #123554 and fixes #123053 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123959 Approved by: https://github.com/mikaylagawarecki	2024-04-16 05:38:24 +00:00
cyy	5bef127c2e	[Environment Variable][1/N] Use thread-safe env variable API in c10 (#119449 ) This PR is the beginning of attempts to wrap thread-unsafe getenv and set_env functions inside a RW mutex. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119449 Approved by: https://github.com/albanD	2024-04-16 04:39:20 +00:00
Nikita Shulga	83ef3bb128	Fix AVX512 int4pack_mm_kernel crash if weighs are unaligned (#124128 ) By replacing `_mm256_load_si256` with `_mm256_loadu_si256`, as there are no guarantees that tensor should be aligned Fixes crash reported in https://github.com/pytorch/pytorch/issues/124034 though I'm unsure about perf implications if tensor are properly aligned Pull Request resolved: https://github.com/pytorch/pytorch/pull/124128 Approved by: https://github.com/mikekgfb	2024-04-16 04:35:25 +00:00
Shunting Zhang	df5829d0ba	[inductor] let rand_strided support fp8 (#124120 ) I'm working on https://fb.workplace.com/groups/1075192433118967/posts/1411161629522044/ (this is a meta internal link about a inefficient inner/persistent reduction kernel generated by inductor). I found the generated benchmark code for a kernel ( https://gist.github.com/shunting314/13a0105f72a1c54d9c220370c7fd3845 ) can not be run since rand_strided failed to generate tensors for fp8. Errors are like ``` RuntimeError: "normal_kernel_cpu" not implemented for 'Float8_e4m3fn' ``` for CPU or ``` RuntimeError: "normal_kernel_cuda" not implemented for 'Float8_e4m3fn' ``` for GPU This PR work around that problem. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124120 Approved by: https://github.com/Chillee, https://github.com/jansel	2024-04-16 04:15:56 +00:00
Yuanhao Ji	89ac37fe91	Enable UFMT on all of `test/distributed` (#123539 ) Partially addresses #123062 Ran lintrunner on: - `test/distributed` Pull Request resolved: https://github.com/pytorch/pytorch/pull/123539 Approved by: https://github.com/ezyang	2024-04-16 03:23:56 +00:00
Edward Z. Yang	e4efa311f1	Refactor test_tensor_set_data to be parametrized (#124105 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/124105 Approved by: https://github.com/albanD	2024-04-16 03:23:41 +00:00
FFFrog	791e5db705	Part 3: UFMT fix the rest files in torch/optim due to the pr-sanity-checks (#124055 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124055 Approved by: https://github.com/ezyang ghstack dependencies: #124048, #124053, #124054	2024-04-16 03:22:39 +00:00
FFFrog	ac74a6783b	Part 2: UFMT fix 2 files in torch/optim due to the pr-sanity-checks (#124054 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124054 Approved by: https://github.com/ezyang ghstack dependencies: #124048, #124053	2024-04-16 03:20:21 +00:00
FFFrog	560efaa471	Part 1: UFMT partial files in torch/optim due to the pr-sanity-checks (#124053 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124053 Approved by: https://github.com/ezyang ghstack dependencies: #124048	2024-04-16 03:17:18 +00:00
FFFrog	f30704f5f3	add preparatory work for torch/optim/lr_scheduler.py (#124048 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124048 Approved by: https://github.com/albanD	2024-04-16 03:17:18 +00:00
Sam Larsen	6babf00014	[inductor] Bypass FX graph cache when we have HigherOrderOperators (#123325 ) Summary: The initial motivation was to avoid caching when we have triton higher order ops, but it's probably safer to avoid the cache for all higher order ops and allow/implement if/when we find it necessary. Test Plan: Unit test cribbed from: https://docs-preview.pytorch.org/pytorch/tutorials/2783/recipes/torch_compile_user_defined_triton_kernel_tutorial.html?highlight=triton Pull Request resolved: https://github.com/pytorch/pytorch/pull/123325 Approved by: https://github.com/eellison	2024-04-16 02:51:49 +00:00
Masaki Kozuki	ff1e3ff5a5	[reland] `_foreach_copy` with different src/dst dtypes (#123844 ) Attempt to reland https://github.com/pytorch/pytorch/pull/121717. The change is the array bounds check. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123844 Approved by: https://github.com/janeyx99	2024-04-16 02:20:58 +00:00
Joona Havukainen	a4c8002ee0	MPS FFT implementation bug (#123274 ) Current implementation drops the negative frequency components even when the user doesn't ask for the one-sided transform. The tests for the negative frequency components seem to have worked by accident due to internal implementation details but the issue becomes evident in MacOs 14.4. Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/123274 Approved by: https://github.com/malfet	2024-04-16 02:02:37 +00:00
Richard Barnes	eeb626b46a	[BE] Do not use `using namespace` in mps headers (#124117 ) - Remove `using namespace std` from `MPSDevice.h` - Add `std::` prefix to 1st argument of `MPSProfiler::StartTrace` - Do the same in front of `numeric_limits` template instantiation in `ReduceOps.mm` Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/124117 Approved by: https://github.com/malfet	2024-04-16 01:39:42 +00:00
Fuzzkatt	1cf62e86a4	skip various unit tests for Jetson (#122531 ) skip multiprocessing, cuda expandable segments, mem eff and flash attention tests on Jetson due to hanging / sigkill issues from nvidia internal testing Pull Request resolved: https://github.com/pytorch/pytorch/pull/122531 Approved by: https://github.com/eqy, https://github.com/malfet	2024-04-16 01:26:26 +00:00
Kai Londenberg	aaad0554b4	[Inductor] Fix endless recursion in codecache.DLLWrapper.__getattr__ (#123931 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123931 Approved by: https://github.com/peterbell10	2024-04-16 00:52:21 +00:00
cyy	c2596fd3e0	[Distributed] [4/N] Fix clang-tidy warnings in torch/csrc/distributed/c10d (#124032 ) This PR continues to fix some clang-tidy warnings in distributed/c10d code, following https://github.com/pytorch/pytorch/pull/123312. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124032 Approved by: https://github.com/Skylion007	2024-04-16 00:42:18 +00:00
Shivam Raikundalia	9079c76689	Fix Asynchronous PyTorch Profiler Trace (#124080 ) Summary: With the merge of D55925068, we have introduced an overflow issue when recording a trace using dyno gputrace. This is because it is possible for TorchOPs to be enumerated but not have an end time since they were running as the recording ended. By default these events have an end time set to INT_MIN. When finding the duration() for such events using end-start, we get an overflow resulting in a very long duration. This was avoided before because we were dividing the INT_MIN by 1000 because we were trying to convert uS to nS. This change introduces a patch for TorchOps and a future PR will be added to create a more universal guard in kineto. Test Plan: Trace recorded using resnet test. Trace: https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/dynocli/0/1713199267/localhost/libkineto_activities_2247224.json.gz&bucket=gpu_traces Differential Revision: D56144914 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124080 Approved by: https://github.com/aaronenyeshi	2024-04-16 00:24:32 +00:00
Will Constable	1885c3972d	[C10D] Add dist.get_node_local_rank helper (#123992 ) Fixes #122816 Summarizing the pros/cons of the request and motivation from #122816 - (+) it's really common for users to do 'os.getenv["LOCAL_RANK"]' so we should provide a helper - (-) we can't really control if/how local rank information is made available, but it is handled automatically if torchrun is used. We can assume local rank is correctly passed if it is passed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123992 Approved by: https://github.com/shuqiangzhang, https://github.com/zdevito, https://github.com/XilunWu	2024-04-16 00:09:46 +00:00
rzou	2b54b00e30	Update some more APIs to have positional-only args (#124063 ) Not BC-breaking since we haven't released these yet Pull Request resolved: https://github.com/pytorch/pytorch/pull/124063 Approved by: https://github.com/albanD ghstack dependencies: #123615, #124062	2024-04-15 23:32:47 +00:00
rzou	3c25b18d76	Excise old custom ops prototype from custom_op_db (#124062 ) Test Plan: - tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/124062 Approved by: https://github.com/albanD ghstack dependencies: #123615	2024-04-15 23:32:47 +00:00
rzou	a03711d24d	[custom_ops] Support TensorList inputs/outputs (#123615 ) We add a `supports_tensorlist` decorator that gives an autograd.Function the ability to handle TensorLists. Test Plan: - custom_op_db tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/123615 Approved by: https://github.com/albanD	2024-04-15 23:32:43 +00:00
Markus Hennerbichler	5a15cbfa44	Fix typo in TorchScript annotate docstring (#123719 ) It's already in the docstring for torch.jit.Attribute to use Attribute in a __init__ method of a Module. However, this was wrong in the `annotate` docstring Pull Request resolved: https://github.com/pytorch/pytorch/pull/123719 Approved by: https://github.com/mikaylagawarecki	2024-04-15 22:52:20 +00:00
-	70ad64e8a6	update docs for separate context and forward functions (#121955 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121955 Approved by: https://github.com/soulitzer	2024-04-15 22:31:12 +00:00
Shengbao Zheng	9fa922c2ed	[profiler] Log process group name instead of pg uid (#124035 ) Summary: As part of the work of unifying process group identifier, log <group_name, group_desc>, instead of pg uid in profiler. - group_name remains as the unique identifier, e.g. “0”, "1" - group_desc will be the user specified name, e.g. "fsdp". Reviewed By: aaronenyeshi, kwen2501 Differential Revision: D55610682 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124035 Approved by: https://github.com/aaronenyeshi	2024-04-15 21:49:06 +00:00
Nikita Shulga	bd222473fc	[EZ][BE] Fix unknown pragma warning (#124086 ) By using `C10_DIAGNOSTIC_` macros instead of `#pragma clang diagnostic` that puts appropriate compiler supported pragmas. Fixes following warning during the bazel build ``` INFO: From Compiling aten/src/ATen/native/TensorFactories.cpp: aten/src/ATen/native/TensorFactories.cpp:372: warning: ignoring #pragma clang diagnostic [-Wunknown-pragmas] 372 \| #pragma clang diagnostic push \| aten/src/ATen/native/TensorFactories.cpp:373: warning: ignoring #pragma clang diagnostic [-Wunknown-pragmas] 373 \| #pragma clang diagnostic ignored "-Wmissing-prototypes" \| aten/src/ATen/native/TensorFactories.cpp:375: warning: ignoring #pragma clang diagnostic [-Wunknown-pragmas] 375 \| #pragma clang diagnostic pop \| ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/124086 Approved by: https://github.com/kit1980, https://github.com/seemethere, https://github.com/Skylion007	2024-04-15 21:44:31 +00:00
PHLens	9aba918bd8	Support Accelerator OOM Error (#121200 ) (#121702 ) Fixes #121200 This PR introduces AcceleratorOutOfMemoryError for all privateuse1 backend. For python, there is a PyError object which will be set only when privateuse1 is registered. All privateuse1 backend then can use this error for memory errors. Maybe more error types in the future. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121702 Approved by: https://github.com/guangyey, https://github.com/albanD	2024-04-15 21:41:46 +00:00
Andrew Gu	495a4d4a42	[FSDP2] Added `mesh` arg to `fsdp_pre_all_gather` (#123953 ) This PR adds a `mesh: DeviceMesh` argument to `fsdp_pre_all_gather()` so that the extension can know over which mesh the all-gather is happening. This can be useful in recovering the post-all-gather tensor size in the `fsdp_post_all_gather()` (e.g. for `NF4Tensor`). Pull Request resolved: https://github.com/pytorch/pytorch/pull/123953 Approved by: https://github.com/Skylion007, https://github.com/wanchaol ghstack dependencies: #119302, #122908	2024-04-15 21:35:51 +00:00
Andrew Gu	d1a0821e7e	[FSDP2] Added pre/post-all-gather extensions (subclass) (#122908 ) Overview This PR adds pre/post-all-gather extensions to FSDP2. - The pre/post-all-gather extensions are specified at the tensor-level on the `sharded_param._local_tensor` (i.e. the tensor wrapped by the sharded `DTensor`). If the user has a tensor-subclass parameter on the module passed to FSDP that preserves the subclass through the sharding ops (e.g. `new_zeros`, `chunk`, etc.), then the `sharded_param._local_tensor` will naturally be of that subclass. - The pre-all-gather function has signature: ``` def fsdp_pre_all_gather(self) -> Tuple[Tuple[torch.Tensor, ...], Any] ``` - The first return value is a `Tuple[torch.Tensor, ...]` of the all-gather inputs. It is a tuple since a subclass could contribute >1 inner tensors. - The second return value is any optional metadata needed to pass through to the post-all-gather. - The post all-gather function has signature: ``` def fsdp_post_all_gather( self, all_gather_outputs: Tuple[torch.Tensor, ...], metadata: Any, param_dtype: torch.dtype, , out: Optional[torch.Tensor] = None, ) -> Union[Tuple[torch.Tensor, Tuple[torch.Tensor, ...]], None]: ``` - The `all_gather_outputs` are exactly the all-gathered versions of the `fsdp_pre_all_gather` 1st return value (representing the all-gather inputs). We make sure to unflatten these back to ND for the user. - The `metadata` is the `fsdp_pre_all_gather` 2nd return value, untouched. - The `param_dtype` is the parameter dtype based on the passed-in `MixedPrecisionPolicy`. Namely, if no policy is passed in, then `param_dtype` is the original dtype, and otherwise, it is the `MixedPrecisionPolicy.param_dtype`. - If `out` is not specified, then the return value has type `Tuple[torch.Tensor, Tuple[torch.Tensor, ...]]`. The first tuple item is the unsharded parameter (e.g. re-wrapping into some subclass). The second tuple item is a tuple of unsharded inner tensors that FSDP should free during reshard. These should be derived from the all-gather outputs. - The `out` argument is required due to FSDP's `resize_` usage. We require an in-place variant for the backward all-gather. Here, `out` will be exactly the object returned as the first tuple item in the out-of-place variant mentioned before. The unsharded inner tensors will be allocated before calling `fsdp_post_all_gather`. When `out` is specified, the `fsdp_post_all_gather` should return `None`. If the post-all-gather does not do any out-of-place ops, then the `out` variant can just be a no-op since the unsharded inner tensors will be the same as the all-gather outputs, which FSDP directly writes to after all-gather. (E.g., this is the case for both float8 and `NF4Tensor`.) - We check for `fsdp_pre_all_gather` and `fsdp_post_all_gather` directly via `hasattr` to accommodate monkey patching so that we do not strictly require the user to use a tensor subclass. The monkey patch must happen after the local tensors have been finalized (after applying FSDP and after any meta-device init). - For now, we require that all gradients in one FSDP parameter group share the same dtype. This is fine for float8 and `NF4Tensor` use cases. If this requirement is too strict, then in the future we can issue 1 reduce-scatter per dtype per group. Design Notes* - We assume that the `sharded_param._local_tensor` is padded on dim-0. - This assumption should not block immediate use cases, and when we pad the `DTensor._local_tensor` by default, this assumption will always be true. - This assumption allows us to call `sharded_param._local_tensor.fsdp_pre_all_gather()`; i.e. it tells us from which tensor object to invoke `fsdp_pre_all_gather()`. - Suppose we want to compose with CPU offloading. Then, CPU offloading's H2D copy should run first, i.e. `sharded_param._local_tensor.to("cuda").fsdp_pre_all_gather()`, where `_local_tensor.to("cuda")` should return an instance of the subclass so that it still defines `fsdp_pre_all_gather()`. Note that in this case, the subclass instance on GPU is a temporary, which means caching values on it would not be possible. One possibility would be to have `.to("cuda")` move any cached values too. - `fsdp_post_all_gather` can either return an unsharded parameter that aliases with the all-gather output or does not alias, but there is no way to know a priori. - If the unsharded parameter aliases with the all-gather output, then we should _not_ free the all-gather output in `unshard`. - If the unsharded parameter does not alias with the all-gather output, then we prefer to free the all-gather output in `unshard` to avoid holding the unneeded temporary. - One approach is for eager-mode to check for this alias (by comparing data pointers). However, this might be adversarial to full-graph compilation. The compromise for simplicity can be to always free the all-gather output in `reshard`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122908 Approved by: https://github.com/weifengpy, https://github.com/wanchaol ghstack dependencies: #119302	2024-04-15 21:35:51 +00:00
Andrew Gu	ea52918e81	[FSDP2] Generalized all-gather outputs to >1 per parameter (#119302 ) This PR is part of the FSDP extensions work. For subclasses such as for QLoRA's `NF4Tensor` (using block-wise quantization) that have multiple inner tensors per parameter, we must generalize to allow each parameter to contribute >1 all-gather inputs and hence have >1 all-gather outputs. This PR does this generalization by converting `FSDPParam.all_gather_input: torch.Tensor` to `FSDPParam.all_gather_inputs: List[torch.Tensor]`. Unfortunately, since we need to preserve the mapping from all-gather inputs/outputs to their source parameter, we have to introduce `List[List]` instead of simply `List` in several places. Furthermore, we still require the flattened 1D `List` for `torch.split` calls, introducing some redundancy between data structures. Nonetheless, I do not see a way to avoid this if we want the generalization. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119302 Approved by: https://github.com/weifengpy, https://github.com/wanchaol	2024-04-15 21:35:46 +00:00
Isuru Fernando	975f77784f	Fix CUDA out of memory error message formatting (#123984 ) We need a string instead of an integer here. With device 0, the string was getting NULL terminated leading to a truncated error message Pull Request resolved: https://github.com/pytorch/pytorch/pull/123984 Approved by: https://github.com/eqy, https://github.com/peterbell10	2024-04-15 21:00:17 +00:00
Nikita Shulga	95a090fb56	[CI] Update bazel deps (#124076 ) - Update `WORKSPACE` to actually use Python-3.10 as job name claims it is - Get rid of unneeded `future` and `six` dependencies (Removed long time ago) - Update `requests`, `typing-extensions` and `setuptools` to the latest releases - Mark `tools/build/bazel/requirements.txt` as a generated file This also updates idna to 3.7 that contains a fix for [CVE-2024-3651](https://github.com/advisories/GHSA-jjg7-2v4v-x38h), though as we are no shipping a binary with it, it does not expose CI system to any actual risks TODOs: - Add periodic job that runs `pip compile` to update those to the latest version - Unify varios requirements .txt (i.e. bazel requirements and requirements-ci should be one and the same) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124076 Approved by: https://github.com/seemethere, https://github.com/DanilBaibak	2024-04-15 20:39:50 +00:00
Animesh Jain	601112fdb4	[dynamo][log] Print missing skipped frame info on debug (#124078 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124078 Approved by: https://github.com/yanboliang	2024-04-15 20:33:17 +00:00
Sam Larsen	e5b404b809	[inductor] Fix fresh_inductor_cache() (#122661 ) Summary: Modify fresh_inductor_cache() to clear cached state before mocking the toplevel cache_dir directory. Any lru_caches (or otherwise) can use the @clear_on_fresh_inductor_cache decorator to register the cache for clearing. Also change the base inductor TestCase class to use fresh_inductor_cache(). Previously that TestCase was only mocking the subdirectory within the toplevel cache dir designated for the FX graph cache artifacts. Test Plan: - New unit test - All existing inductor tests will exercise fresh_inductor_cache() Pull Request resolved: https://github.com/pytorch/pytorch/pull/122661 Approved by: https://github.com/oulgen	2024-04-15 20:28:54 +00:00
Bert Maher	99059affb9	Use packed metadata from triton to reduce launch latency (#123842 ) https://github.com/openai/triton/pull/3633 converts some kernel launch metadata from a namedtuple to a regular tuple, which is faster to parse. Using it here shaves off a microsecond or so from the apparently extremely-sensitive launch path. Fixes #123597 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123842 Approved by: https://github.com/jansel, https://github.com/shunting314 ghstack dependencies: #123841	2024-04-15 19:43:06 +00:00
Bert Maher	6c9f5064ea	Avoid retrieving launch metadata if launch_enter_hook is not installed (#123841 ) Fixes #123597 There's a sizable comment in the PR about why this is needed, but essentially the launch path is really really perf sensitive (running `launch` is ~30 microseconds, and according to the linked issue, regressing it to 33us is worth 6% overall on torchbench). The `bin.launch_metadata` call doesn't look super expensive, but microseconds matter, and this is only useful when we have a launch hook installed (which seems pretty rare?). This change is worth about 2us, and when combined with the other diff in the stack seems to completely eliminate the torchbench regression. Differential Revision: [D56046347](https://our.internmc.facebook.com/intern/diff/D56046347) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123841 Approved by: https://github.com/jansel, https://github.com/shunting314	2024-04-15 19:43:06 +00:00
Pian Pawakapan	90d1720861	[export] Restore original placeholder names (part 3: constant input de/serialization) (#123590 ) Summary: note: breaking the original diff D55225818 into 3 parts (top-level renaming, higher-order-op subgraphs, constant input de/serialization) because of its size. Stacked PR to restore original names to placeholder nodes, replacing the default names arg0_1, arg1_1, ... This PR supports constant argument placeholder (e.g. forward(self, x, y=1)) names and de/serialization, by adding a name field for ConstantArguments in the graph signature, and ConstantInputSpec in the input specs for serialization. Test Plan: verification checks on placeholder names for all export() calls, unit test in test/export/test_export.py Differential Revision: D55506949 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123590 Approved by: https://github.com/angelayi, https://github.com/zhxchen17	2024-04-15 19:09:41 +00:00
Shunting Zhang	fb6f6270d6	[inductor] comprehensive padding (#120758 ) This PR adds the ability to pad tensor strides during lowering. The goal is to make sure (if possible) tensors with bad shape can have aligned strides so GPU can access the memory more efficiently. By testing BlenderbotSmallForConditionalGeneration I already see 2.5ms speedup. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120758 Approved by: https://github.com/jansel	2024-04-15 19:05:51 +00:00
Nikita Shulga	221c397e2e	Use NEON to speedup `int8pack_mm` on aarch64 (#124023 ) Just vectorizing innter loop as follows: ```cpp float32x4_t c_val = vdupq_n_f32(0.0); for (int k = 0; k < K; k += 8) { float16x8_t a_val = vld1q_f16(reinterpret_cast<const float16_t >(A) + m lda + k); int16x8_t b_val = vmovl_s8(vld1_s8(B + n * ldb + k)); auto a_val_low = vcvt_f32_f16(vget_low_f16(a_val)); auto a_val_high = vcvt_f32_f16(vget_high_f16(a_val)); auto b_val_low = vcvtq_f32_s32(vmovl_s16(vget_low_s16(b_val))); auto b_val_high = vcvtq_f32_s32(vmovl_s16(vget_high_s16(b_val))); c_val = vaddq_f32(c_val, vmulq_f32(a_val_low, b_val_low)); c_val = vaddq_f32(c_val, vmulq_f32(a_val_high, b_val_high)); } float scale_val = static_cast<float>(scales[n]); C[m * ldc + n] = reduce(c_val) * scale_val; ``` Which bumps perf from 35 to 58 tokens per second (65% perf gain). Unrolling both inner and outer loops bumps perf to 64 tokens per sec (i.e. another 10% gain) Before/after performance running stories110M on M2Pro \| eager (before) \| eager (after) \| compile(before) \| compile (after) \| \| ---- \| --- \| -- \| -- \| \| 35 \| 64 \| 56 \| 132 \| Pull Request resolved: https://github.com/pytorch/pytorch/pull/124023 Approved by: https://github.com/mikekgfb ghstack dependencies: #124022	2024-04-15 18:57:59 +00:00
Yuanhao Ji	8ce29f1416	Enable UFMT on `test/onnx_caffe2`, `test/optim`, `test/package` and `test/profiler` (#123901 ) Part of: #123062 Ran lintrunner on: - `test/onnx_caffe2` - `test/optim` - `test/package` - `test/profiler` Pull Request resolved: https://github.com/pytorch/pytorch/pull/123901 Approved by: https://github.com/ezyang	2024-04-15 17:46:59 +00:00
Brian Hirsh	63dcb5b0f2	make sure dynamo doesn't inline DTensor __new__ or __torch_dispatch__ (#123347 ) Fixes https://github.com/pytorch/pytorch/issues/122459, https://github.com/pytorch/torchtrain/issues/61 Even with the previous PR ("support DTensor/subclass constructors directly in the graph"), I still see some errors when running the repro above that start some logs showing that dynamo is inlining `__new__`. I noticed that putting `@torch._dynamo.disable` on DTensor's `__new__` makes the entire repro pass. Why does having dynamo try to inline `Subclass.__new__` run into problems? Morally, dynamo probably shouldn't be inlining __new__ ("creating a subclass" is a blackbox operation that AOTAutograd can trace through anyway). But concretely, we can end up with a node in the dynamo FX graph that has a "partially initialized tensor subclass" as its example value, because the subclass has been created but its fields have not been assigned to yet. This breaks a bunch of invariants throughout dynamo: there are many places where if we have a tensor subclass node, we want to look at its inner tensors, to see if they are FakeTensors, what their FakeTensorMode is, and if they have dynamic shapes. One option is to decide that "uninitialized subclass" is a first-class thing that anyone looking at the FX node examples values on the dynamo graph needs to handle, but this seems like a lot of work when in reality we don't need dynamo to trace the __new__ at all. Hence the `torch._dynamo.disable`. I still wasn't very satisfied, since it was unclear to me why dynamo was inlining the `__new__` call, instead of interposing on the `DTensor()` constructor directly. After a long chat with @anijain2305, he explained that with code like this: ``` @torch._dynamo.disable(recursive=False) def f(x): out = SubclassConstructor(x) ``` Dynamo will never get the chance to interpose on the subclass constructor. Instead, what will happen is: (1) Dynamo hands back control to cpython to run `f()`, since we disabled that frame (2) `SubclassConstructor(x)` is run in eager mode (3) `SubclassConstructor(x)` eventually calls `SubclassConstructor__new__` (4) this is a new frame, that cpython then allows dynamo to intercept and start compiling So it looks like we are basically forced to handle the situation where dynamo might directly start compiling `Subclass.__new__` All of the above does not explain the story for `__torch_dispatch__` though. Empirically, I have a repro in torchtrain where looking at the dynamo logs, we see dynamo try to inline `__torch_dispatch__`. ``` [rank0]:DEBUG: Skipping frame because no content in function call _prepare_output_fn /data/users/hirsheybar/b/pytorch/torch/distributed/tensor/parallel/style.py 318 [rank0]:DEBUG: torchdynamo start compiling __torch_dispatch__ /data/users/hirsheybar/b/pytorch/torch/distributed/_tensor/api.py:297, stack (elided 5 frames): ``` I haven't been able to create a smaller repro of the problem (even using `_dynamo.disable(recursive=False)`), although in theory, if there is a `torch.` op that you were to inline (where one of the inputs is a subclass), the next frame would likely be `__torch_dispatch__`. Dynamo always treats `torch.` operations as not-inlinable though, so in theory we shouldn't ever see dynamo inline `__torch_dispatch__`, but a `_dynamo.disable()` fixes the problem. I asked Animesh if we can have dynamo automatically apply this behavior to subclasses instead of needing it to be added explicitly. He pointed out that for `disable(recursive=False)`, we can't really do this within dynamo Pull Request resolved: https://github.com/pytorch/pytorch/pull/123347 Approved by: https://github.com/zou3519 ghstack dependencies: #122502, #122751, #123348	2024-04-15 17:23:20 +00:00
Aaron Gokaslan	9c4fc5fa34	[BE][Ez]: Fix minor potential perf regression from #123960 (#124013 ) The `non_blocking` arg here is useless if the values are all eagerly consumed, so revert the change. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124013 Approved by: https://github.com/ezyang	2024-04-15 16:51:45 +00:00
Milan Straka	fea1b99d89	Remove warning from LazyModuleMixin constructor. (#123968 ) Remove warning from `LazyModuleMixin` about lazy modules being a new feature under heavy development. The last nontrivial change to the code happened more than three years ago. Fixes #123928 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123968 Approved by: https://github.com/mikaylagawarecki	2024-04-15 15:36:55 +00:00
Edward Z. Yang	af9a707233	Use uv in lintrunner init when it is available. (#124033 ) Before, a no-op lintrunner init takes 12s. After, it takes 1s; a full order of magnitude improvement. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/124033 Approved by: https://github.com/cyyever, https://github.com/Skylion007	2024-04-15 14:47:21 +00:00
PyTorch UpdateBot	7cd7a7aa8e	[xla hash update] update the pinned xla hash (#124042 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned xla hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124042 Approved by: https://github.com/pytorchbot	2024-04-15 10:50:54 +00:00
Yuanhao Ji	e3ac61587a	Enable UFMT on `test/functorch` (#123541 ) Partially addresses #123062 Ran lintrunner on: - `test/functorch` Co-authored-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/123541 Approved by: https://github.com/zou3519, https://github.com/ezyang	2024-04-15 06:21:52 +00:00
Adnan Akhundov	03a05e791a	Don't add non-integer Triton kernel arg 1 to equal_to_1 (#123886 ) Summary: Triton compiler adds constnat argument 1 to `equal_to_1` [only when it's an int](`8c5e33c77e/python/triton/runtime/jit.py (L275)`). Here we restrict Inductor's `equal_to_1` in the same way. Test Plan: ``` $ python test/inductor/test_triton_kernels.py -k test_triton_kernel_equal_to_1_float_arg ... ---------------------------------------------------------------------- Ran 1 test in 6.528s OK $ python test/inductor/test_triton_kernels.py -k test_triton_kernel_equal_to_1_arg ... ---------------------------------------------------------------------- Ran 2 tests in 10.142s OK ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/123886 Approved by: https://github.com/oulgen ghstack dependencies: #123703	2024-04-14 20:34:05 +00:00
Edward Z. Yang	19f50333e9	Improve assert message for unbacked symint not written out (#123965 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/123965 Approved by: https://github.com/Skylion007	2024-04-14 20:03:43 +00:00
Nikita Shulga	a096e99a5d	Enable int8mm kernel for float16 (#124022 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124022 Approved by: https://github.com/mikekgfb	2024-04-14 19:48:43 +00:00
Xuehai Pan	9bb54c7f3c	[pytree] enable `functools.wraps` in Python pytree with dynamo (#124012 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124012 Approved by: https://github.com/Skylion007	2024-04-14 09:25:05 +00:00
Aleksandar Samardžić	f5331aade5	Simplify ATen sparse semi-structured operators based on CUTLASS (#123473 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123473 Approved by: https://github.com/cpuhrsch	2024-04-14 06:57:41 +00:00
WeiChunyu-star	635c238bad	Enable UFMT on all of test/quantization/jit &pt2e (#124010 ) Partially addresses #123062 Ran lintrunner on: - test/quantization/jit - test/quantization/pt2e Detail: ``` $ lintrunner -a --take UFMT --all-files ok No lint issues. Successfully applied all patches. ``` cc, please @ezyang Pull Request resolved: https://github.com/pytorch/pytorch/pull/124010 Approved by: https://github.com/ezyang	2024-04-14 06:07:23 +00:00
William Wen	0dfe72c63b	[dynamo, 3.12] fix positions and offsets of added instructions when we clean (#123991 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123991 Approved by: https://github.com/jansel ghstack dependencies: #123978	2024-04-14 03:58:04 +00:00
Krzysztof Jordan	88a7159493	[NT] Fix typo in declared strides variable (#123856 ) Summary: Looks like it's missing an s in the declaration so pyre is throwing an error {F1484357040} Test Plan: expect no pyre errors Differential Revision: D56023743 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123856 Approved by: https://github.com/cpuhrsch, https://github.com/soulitzer	2024-04-13 19:55:57 +00:00
Jason Ansel	f3fd280238	[dynamo] Relax strict_mode for autograd.Function forward inputs (#123910 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123910 Approved by: https://github.com/oulgen	2024-04-13 19:41:59 +00:00
William Wen	6440d1baa6	[dynamo, 3.12] fix the block stack... again (#123978 ) Some changes to how we handle blocks in 3.11+: - We only keep track of with blocks that are not enclosed in a try block - We do not compile partial graphs if we are in a block that is not in a tracked with block - i.e. any block enclosed in some non-with try/except/etc. block Pull Request resolved: https://github.com/pytorch/pytorch/pull/123978 Approved by: https://github.com/jansel	2024-04-13 17:07:02 +00:00
Xuehai Pan	da7db5d345	[BE] migrate import sorter configurations to `pyproject.toml` (#123846 ) Migrate import sorter configurations to `pyproject.toml` and delete `.isort.cfg`. Also, set the line length to 88 (which is the default of `black`). Pull Request resolved: https://github.com/pytorch/pytorch/pull/123846 Approved by: https://github.com/Skylion007	2024-04-13 12:54:14 +00:00
cyy	b60af92c17	[Distributed] [3/N] Fix clang-tidy warnings in torch/csrc/distributed/c10d (#123312 ) This PR continues to fix some clang-tidy warnings in distributed code, following https://github.com/pytorch/pytorch/pull/122892. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123312 Approved by: https://github.com/Skylion007	2024-04-13 11:45:00 +00:00
PyTorch MergeBot	97261be0a8	Revert "Simplify ATen sparse semi-structured operators based on CUTLASS (#123473 )" This reverts commit b2a0b8c446234f0b35a66aff87501c4596ea5d51. Reverted https://github.com/pytorch/pytorch/pull/123473 on behalf of https://github.com/DanilBaibak due to Break internal build ([comment](https://github.com/pytorch/pytorch/pull/123473#issuecomment-2053561077))	2024-04-13 07:47:32 +00:00
WeiChunyu-star	6ac8fe46dd	Enable UFMT on all of test/quantization/ao_migration &bc (#123994 ) Partially addresses #123062 Ran lintrunner on: - test/quantization/ao_migration - test/quantization/bc Detail: ``` $ lintrunner -a --take UFMT --all-files ok No lint issues. Successfully applied all patches. ``` @ezyang Pull Request resolved: https://github.com/pytorch/pytorch/pull/123994 Approved by: https://github.com/ezyang	2024-04-13 06:36:10 +00:00
Jason Ansel	285c93d64d	[inductor] Write generated files from parent process (#123409 ) Before this PR we would pass generated source code over a pipe to the compile worker then the compile worker would write out the file. Doing it this way is faster and results in smaller messages to the workers (and lets us skip creating the workers in the warm start case). Pull Request resolved: https://github.com/pytorch/pytorch/pull/123409 Approved by: https://github.com/desertfire	2024-04-13 06:31:28 +00:00
Avik Chaudhuri	5961e23e76	primitive attribute assignment (#123898 ) This PR ensures that assignment of attributes of primitive type work without needing any code changes in non-strict mode. (In a previous PR we banned attribute assignments of tensor type unless such attributes are registered as buffers.) While strict mode errors on (all) attribute assignments, non-strict doesn't care, so one might assume that this kind of attribute assignment should already work in non-strict. However, there's a problem: we run through the program once for metadata collection and then run through it again for tracing, so the values observed during tracing (and potentially burned into the graph) do not reflect what should have been observed had the metadata collection pass not run. So the only thing this PR needs to do is restore values of assigned attributes of primitive type once the metadata collection pass has run. We do this by moving the attribute assignment detecting context manager from the overall `aot_export` call in `_trace.py` to the metadata collection pass in `aot_autograd.py`, and extending it. The rest of the PR moves some utils around. Differential Revision: D56047952 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123898 Approved by: https://github.com/angelayi	2024-04-13 05:27:52 +00:00
lezcano	891736f115	Fix links rendering when surrounding code in Dynamo deepdive (#123427 ) I thought the RST was rendering correctly, but here we are. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123427 Approved by: https://github.com/peterbell10	2024-04-13 04:55:15 +00:00
Qingpeng Li	7e3f80f00f	accelerate `binary_cross_entropy_with_logits` (#122789 ) Following https://github.com/pytorch/pytorch/pull/115539 Same benchmark in #115539: \|avg time (ms)\|with `pos_weight`\|no `pos_weight`\| \|-\|-\|-\| \|before #115539 \|2049\|1736\| \|after #115539 \|1320\|1049\| \|this PR \|907 \|801\| This PR is faster 24-31% than the version after #115539. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122789 Approved by: https://github.com/peterbell10	2024-04-13 04:18:47 +00:00
sanchitintel	d39e6b3156	Cleanup: Remove redundant `inference_patterns` PatternMatcherPass (#121602 ) ## Summary Removes a redundant `PatternMatcherPass` in Inductor post-grad passes Pull Request resolved: https://github.com/pytorch/pytorch/pull/121602 Approved by: https://github.com/jgong5, https://github.com/eellison	2024-04-13 04:04:11 +00:00
Sanjeev Grampurohit	eeea3b12aa	Fix _LazyConvXdMixin.initialize_parameters and add related tests (#123756 ) Fixes #123257 _LazyConvXdMixin.initialize_parameters did not handle positional args (other than input) and kwargs to be passed on to the corresponding non-lazy class' .forward() method. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123756 Approved by: https://github.com/mikaylagawarecki	2024-04-13 03:58:37 +00:00
statelesshz	2216068559	Enable UFMT on test/test_ops* (#123935 ) Part of https://github.com/pytorch/pytorch/issues/123062 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123935 Approved by: https://github.com/ezyang	2024-04-13 03:31:56 +00:00
Taras Tsugrii	71b8363f40	[inductor] Remove unused local variable. (#120227 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120227 Approved by: https://github.com/Skylion007	2024-04-13 03:19:13 +00:00
Yifu Wang	2a2e1d8e4f	[functional collective] change the Python APIs to only use the native funcol ops (#123777 ) ## Summary After this PR, the functional collective Python APIs will stop honoring `TORCH_DISABLE_NATIVE_FUNCOL` and only use native funcol ops. Specifically, this PR: - Removed `use_native_funcol()`. - Removed the code path in the Python APIs when `use_native_funcol()` is `False`. - Changed the CI tests that runs on both native funcol and legacy funcol through the Python API to only run with native funcol. ## Test Changes `test_functional_api.py` - Removed the tests where only one of output_split_sizes or input_split_sizes is specified. This behavior is unreliable has been removed from the native funcol. - Removed `TestWaitiness` which tests an implementation detail of the legacy funcol. We have equivalent tests for native funcol in `test/distributed/test_c10d_functional_native.py` `b7fac76fc2/test/distributed/test_c10d_functional_native.py (L114-L116)` `test/distributed/_tensor/test_dtensor.py` `test/distributed/_tensor/test_dtensor_compile.py` `test/distributed/test_device_mesh.py` `test/distributed/_tensor/experimental/test_tp_transform.py` `test/distributed/_tensor/test_matrix_ops.py` `test/distributed/test_inductor_collectives.py` - All these tests were double running with both native funcol and legacy funcol. Changed to only run with native funcol. `test/distributed/test_c10d_functional_native.py` - Removed the `run_with_native_funcol` decorators. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123777 Approved by: https://github.com/wanchaol ghstack dependencies: #123776	2024-04-13 03:08:36 +00:00
Yifu Wang	2da3e113ca	[functional_collective] remove the logic that forces torch-xla to use legacy funcol (#123776 ) After https://github.com/pytorch/xla/pull/6887, torch-xla now also uses the all_reduce from native funcol. So we can remove this logic. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123776 Approved by: https://github.com/wanchaol	2024-04-13 03:08:36 +00:00
Animesh Jain	58afcd7b61	[dynamo][dict] Add UnspecializedNNModuleVariable to dict keys (#122812 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/122812 Approved by: https://github.com/jansel ghstack dependencies: #122943, #123877, #123878	2024-04-13 02:07:35 +00:00
Animesh Jain	fefe6e2fea	[dynamo][3.12] Stop backend detection on the first RETURN_VALUE (#123878 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123878 Approved by: https://github.com/williamwen42 ghstack dependencies: #122943, #123877	2024-04-13 02:07:35 +00:00
willfengg	f1654fd4b0	[PT2D][FSDP] skip FSDP hooks base on dynamo config (#123021 ) unit test: `pytest test/distributed/_composable/fsdp/test_fully_shard_compile.py` For FSDP, we turn on/off compiling hooks base on `torch._dynamo.config.skip_fsdp_hooks` Pull Request resolved: https://github.com/pytorch/pytorch/pull/123021 Approved by: https://github.com/yf225, https://github.com/anijain2305	2024-04-13 01:47:25 +00:00
cyy	77a45883ce	[Reland] [Distributed] [2/N] Fix clang-tidy warnings in torch/csrc/distributed/c10d (#123821 ) Reland of #122892 with problematic changes reverted. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123821 Approved by: https://github.com/Skylion007	2024-04-13 00:57:03 +00:00
Sheng Fu	879f5b9a39	Pass triton kernel info to record function (#123871 ) Summary: This DIFF is to pass triton kernel information, such as kernel python file, kernel type, grid, and stream, to record_function. With these information, Execution trace can capture triton kernel and replay it in PARAM. Test Plan: unit test buck2 test caffe2/test:profiler -- test_record_function_fast Differential Revision: D56021651 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123871 Approved by: https://github.com/sraikund16	2024-04-13 00:55:44 +00:00
albanD	7234f180f3	Add mtia to codeowner (#123975 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123975 Approved by: https://github.com/egienvalue	2024-04-13 00:46:08 +00:00
Aaron Gokaslan	1d6c5972c1	[BE]: Optimize min/max/sum comprehensions C419 (#123960 ) Automatic fixes that replaces certain list comprehensions with generator ones where appropriate so that they are immediately consumed. This is preview functionality in ruff for rule C419 and it was automatically applied. Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/123960 Approved by: https://github.com/malfet	2024-04-12 23:54:15 +00:00
Brian Hirsh	961eb39348	AOT logging: log fw_metadata with each graph (#118646 ) Log fw_metadata for each AOT graph. This is helpful for seeing information about subclass graph inputs/outputs/tangents, and lots of other stuff Pull Request resolved: https://github.com/pytorch/pytorch/pull/118646 Approved by: https://github.com/tugsbayasgalan, https://github.com/ezyang ghstack dependencies: #118645	2024-04-12 23:53:53 +00:00
Shengbao Zheng	585cd117e6	[nccl-pg] print broadcast ncclunique id duration (#123963 ) Summary: Print NCCL PG broadcast nccl unique id duration for measurement. Differential Revision: D56048059 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123963 Approved by: https://github.com/wconstab	2024-04-12 23:33:11 +00:00
Animesh Jain	3e98bdd66d	[dynamo] Turn on CPP guard manager (#123547 ) As title Pull Request resolved: https://github.com/pytorch/pytorch/pull/123547 Approved by: https://github.com/jansel	2024-04-12 23:30:56 +00:00
Aart Bik	d564fe7dca	[sparse] add proper path for cloning sparse tensors (#123127 ) The code does the right thing (rather than crashing). This is a small step towards https://github.com/pytorch/pytorch/issues/117188 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123127 Approved by: https://github.com/pearu, https://github.com/cpuhrsch	2024-04-12 23:19:51 +00:00
NiTIAN	3dde6a461f	fix cpp path in torch/_C/_autograd.pyi (#123924 ) The file `tools/autograd/init.cpp` does not exist, I think the right path is `torch/csrc/autograd/init.cpp`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123924 Approved by: https://github.com/Skylion007	2024-04-12 22:32:00 +00:00
Victor Toni	380180c918	Fix typo (#123767 ) Fixes a tiny typo. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123767 Approved by: https://github.com/Skylion007	2024-04-12 22:26:08 +00:00
Xuehai Pan	7b11fb4695	[Dynamo] fix opcode `YIELD_FROM` and `SEND` (#123912 ) This PR is split from #120300. - #120300 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123912 Approved by: https://github.com/anijain2305	2024-04-12 21:57:47 +00:00
Tristan Rice	4b889d1247	stop TestMakeFx leaking to other tests (#123958 ) Fixes #123916 Due to MultiThreadedTestCase we're leaking is_fx_tracing_flag to other tests which causes any dynamo based tests to fail. The test execution order is arbitrary which caused this to not be caught in development. Test plan: ```sh pytest --random-order test/distributed/test_functional_api.py -k 'TestMakeFx or test_all_to_all_single_compile_True' ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/123958 Approved by: https://github.com/yifuwang	2024-04-12 21:43:12 +00:00
Zitong Zeng	c65aa5af6e	[Pytorch] doc sync-stream-and-free-HBM counter in memory_stats (#123799 ) Differential Revision: D56000503 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123799 Approved by: https://github.com/malfet	2024-04-12 21:19:45 +00:00
rzou	5d1f9bd2bc	Move the trace_rules.py docs up (#123873 ) I always remember that the docs exist but cannot actually find it in the file because it is on line 3000. Moving it to the top of the file for visibility. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123873 Approved by: https://github.com/yanboliang	2024-04-12 20:18:38 +00:00
albanD	79deff689f	Update compile doc to suggest Module.compile (#123951 ) For users for whom fqn change is problematic Pull Request resolved: https://github.com/pytorch/pytorch/pull/123951 Approved by: https://github.com/msaroufim	2024-04-12 20:13:21 +00:00
andrewor14	762e19606e	[quant] Enable backward for choose_qparams_per_token_asymmetric (#123452 ) Summary: When running the backward for this op, we get the error: ``` RuntimeError: derivative for aten::aminmax is not implemented ``` This commit replaces this call with separate amin and amax calls instead, which do have implemented derivatives. Test Plan: python test/test_quantization.py -k test_decomposed_choose_qparams_per_token_asymmetric_backward Reviewers: jerryzh168, digantdesai Subscribers: jerryzh168, digantdesai, supriyar Differential Revision: [D55805170](https://our.internmc.facebook.com/intern/diff/D55805170) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123452 Approved by: https://github.com/digantdesai, https://github.com/jerryzh168, https://github.com/zou3519	2024-04-12 20:05:56 +00:00
Jane Xu	3346ec8263	[BE] Document what is tested in TestOptim (#123853 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123853 Approved by: https://github.com/soulitzer	2024-04-12 19:59:29 +00:00
eqy	2f0fc04fa3	[CUDA][64-bit indexing] Bump large tensor threshold of `test_cross_entropy_large_tensor` to 70GiB (#123772 ) `torch.cuda.max_memory_reserved()` here shows 68729962496 (about 65546 MiB). CC @malfet @crcrpar Pull Request resolved: https://github.com/pytorch/pytorch/pull/123772 Approved by: https://github.com/mikaylagawarecki	2024-04-12 19:18:20 +00:00
Jason Ansel	8069469081	[dynamo] Support Tuple[int] args to autograd.Function (#123887 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123887 Approved by: https://github.com/anijain2305 ghstack dependencies: #123700, #123705, #123786, #123790, #123803, #123804, #123896	2024-04-12 19:03:13 +00:00
Jason Ansel	70b8c58f84	[dynamo] Emit warning to turn on capture_scalar_outputs (#123896 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123896 Approved by: https://github.com/anijain2305 ghstack dependencies: #123700, #123705, #123786, #123790, #123803, #123804	2024-04-12 19:03:13 +00:00
Jason Ansel	e3935783f7	[dynamo] Fix @property on user-defined nn.Module (#123804 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123804 Approved by: https://github.com/anijain2305 ghstack dependencies: #123700, #123705, #123786, #123790, #123803	2024-04-12 19:03:13 +00:00
Jason Ansel	6bac183dc2	[dynamo] Support numpy.iinfo/finfo (#123803 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123803 Approved by: https://github.com/anijain2305 ghstack dependencies: #123700, #123705, #123786, #123790	2024-04-12 19:03:13 +00:00
Jason Ansel	11e6f84ad8	[dynamo] Graph break on uninitialized nn.Module (#123790 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123790 Approved by: https://github.com/anijain2305 ghstack dependencies: #123700, #123705, #123786	2024-04-12 19:03:13 +00:00
Jason Ansel	6022600cc6	[inductor] Handle meta tensor ops in graph (#123786 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123786 Approved by: https://github.com/anijain2305 ghstack dependencies: #123700, #123705	2024-04-12 19:03:13 +00:00
Jason Ansel	6b0ba6bbd3	[dynamo] Improve constant-prop for regex/torch.__version__ (#123705 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123705 Approved by: https://github.com/anijain2305 ghstack dependencies: #123700	2024-04-12 19:03:13 +00:00
Yuanhao Ji	a625705290	Enable UFMT on all of `test/nn` (#123809 ) Part of: #123062 Ran lintrunner on: - `test/nn` with command: ```bash lintrunner -a --take UFMT --all-files ``` Co-authored-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/123809 Approved by: https://github.com/mikaylagawarecki	2024-04-12 18:32:25 +00:00
Shawn Xu	04acdad829	[PT] [FSDP] [test] add barrier device ids (#123866 ) Summary: without this the `ProcessGroupNCCL` lib would try to infer the device id and emit a warning. This doesn't change the behavior just makes it explicit. > ProcessGroupNCCL.cpp:3720] [PG 0 Rank 1] using GPU 1 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device. Test Plan: CI Differential Revision: D55998175 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123866 Approved by: https://github.com/awgu	2024-04-12 18:29:32 +00:00
brothergomez	366b24e242	[Inductor] Add a device agnostic DeviceGuard class to inductor (#123338 ) Summary: Currently although only in one place in inductor, the `device` context manager from the device interface is used . This PR creates an inductor specific `DeviceGuard` class for use in these cases, which keeps a reference to the `DeviceInterface` class which is defined and added out of tree. This then offloads the device specific work to the device interface, instead of having to define this logic on the device class which isn't strictly necessary for inductor. Ideally I would have used the existing `DeviceGuard` class, but these are defined per device and don't work well with inductor's device agnostic/ out of tree compatible design. With the existing classes in mind, I am happy to take suggestions on the renaming of this class. Whilst I was there, I also took the opportunity to rename `gpu_device` to `device_interface` to clarify this is not necessarily a GPU. Test Plan: None currently, happy to add some. Co-authored-by: Matthew Haddock <matthewha@graphcore.ai> Co-authored-by: Adnan Akhundov <adnan.akhundov@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/123338 Approved by: https://github.com/aakhundov	2024-04-12 18:21:27 +00:00
Oguz Ulgen	6367eab1a6	Normalize remote/local cache names (#123914 ) Differential Revision: D56027380 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123914 Approved by: https://github.com/aakhundov	2024-04-12 18:18:09 +00:00
Thiago Crepaldi	23dbe2b517	Add test for skipping hf logging during export (#123410 ) https://github.com/pytorch/pytorch/pull/123402 already supports hf logging because HF logger is based on logging module This PR adds a test to guard this against regression, only Pull Request resolved: https://github.com/pytorch/pytorch/pull/123410 Approved by: https://github.com/BowenBao, https://github.com/malfet	2024-04-12 17:42:46 +00:00
Jez Ng	3c09c6b91a	Fix memory planning compile error (#123867 ) Summary: We should be using CppPrinter in the cpp wrapper codegen, not the ExprPrinter (which prints expressions for Python) Not really a memory-planning-specific bug, but exposed by mem planning because it tends to emit more complicated expressions Differential Revision: D56025683 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123867 Approved by: https://github.com/hl475, https://github.com/chenyang78	2024-04-12 17:34:58 +00:00
David Chiu	ab647bd325	Add missing interfaces of `torch.optim.swa_utils` (#117036 ) Add type hints for the function/class interfaces that appear in torch/optim/swa_utils.py but are missing in torch/optim/swa_utils.pyi. - get_ema_multi_avg_fn - get_swa_multi_avg_fn - get_ema_avg_fn - get_swa_avg_fn - AveragedModel.__init__(multi_avg_fn) - SWALR.get_lr Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/117036 Approved by: https://github.com/janeyx99	2024-04-12 17:17:36 +00:00
andrewor14	5c0a380bdf	[pt2e][qat] Support conv-transpose-bn[-relu] QAT fusion (#123652 ) Summary: This commit adds support for QAT fusion for the [conv-transpose-bn] and [conv-transpose-bn-relu] patterns. Test Plan: python test/test_quantization.py TestQuantizePT2EQAT_ConvBn1d.test_qat_conv_transpose_bn python test/test_quantization.py TestQuantizePT2EQAT_ConvBn1d.test_qat_conv_transpose_bn_relu python test/test_quantization.py TestQuantizePT2EQAT_ConvBn2d.test_qat_conv_transpose_bn python test/test_quantization.py TestQuantizePT2EQAT_ConvBn2d.test_qat_conv_transpose_bn_relu Reviewers: jerryzh168 Subscribers: jerryzh168, supriyar Tasks: https://github.com/pytorch/pytorch/issues/122224 Differential Revision: [D55930704](https://our.internmc.facebook.com/intern/diff/D55930704) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123652 Approved by: https://github.com/jerryzh168	2024-04-12 17:16:02 +00:00
Yukio Siraichi	e4c887fbf6	[AOTAutograd] Replay views on output using `FunctionalTensor` metas. (#121007 ) Fix: #120336 This PR fixes an issue on AOTAutograd, specifically on backends that don't support views by themselves (e.g. XLA). Previously, AOTAutograd tried to reconstruct output views by calling `as_strided` on the concrete bases using sizes and strides of the outputs that aliased them. Since backends such as XLA doesn't support tensor aliasing, the sizes and strides would be that of a contiguous tensor (not a view tensor). Because of that, calling `as_strided` would error, since the output tensor would be bigger than its base. Instead, this PR applies the sequence of `ViewMeta` gathered for each output during the functionalization phase. Note: we intentionally don't support base tensors that went through metadata mutation, i.e. in-place view operations. In summary, this PR: - Introduces one `FunctionalTensorWrapper` member function alongside its Python APIs - `apply_view_metas(base)`: applies the `ViewMeta` sequence of the given instance onto another base - Introduces a `OutputAliasInfo.functional_tensor` field - Saves the `FunctionalTensorWrapper` instance collected by the functionalization phase - Wraps it with a new `FunctionalTensorMetadataEq` class for comparing only the metadata of the tensors - Plumbs `OutputAliasInfo.functional_tensor` to `gen_alias_from_base` function - Applies the `ViewMeta` sequence of the saved `FunctionalTensor` onto `aliased_base_tensor` - Propagates `OutputAliasInfo.functional_tensor` when updating `fw_metadata` (this PR description was updated in order to reflect the most recent changes) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121007 Approved by: https://github.com/bdhirsh	2024-04-12 16:54:13 +00:00
Brian Hirsh	435db051d0	get torch.distributed.breakpoint() to work under Python/Meta contexts (#118645 ) I noticed that when I put a `torch.distributed.breakpoint()` in [here](https://github.com/pytorch/pytorch/blob/main/torch/_subclasses/meta_utils.py#L605), it would fail. This fixes it. In theory, it would probably be better to have a way to get the `barrier()` call to skip the dispatcher completely. I wasn't sure how to do that though, and this seems to cover 90% of issues. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118645 Approved by: https://github.com/yifuwang	2024-04-12 16:36:52 +00:00
Pian Pawakapan	07e0faf3ef	[export] set AOTAutograd ctx to enable_grad with pre-dispatch export (#123671 ) Summary: Currently, torch.export (through AOTAutograd) compiles with a torch.no_grad() wrapper, which affects the presence of `set_grad_enabled` nodes in pre-dispatch export graphs. This changes the wrapper to nullcontext (i.e. enable grad) if `pre_dispatch=True`. An example that previously failed without `with torch.no_grad()` is below: ``` class Model(torch.nn.Module): def forward(self, x, y): with torch.enable_grad(): x = x + y return x model = Model() exported_program = torch.export._trace._export( model, (torch.tensor(2), torch.tensor(3)), dynamic_shapes=None, pre_dispatch=True, strict=False ) ``` The pass would inline the add call, but then try to construct a HOO subgraph with no inputs/outputs: ``` def forward(self): _set_grad_enabled_1 = torch._C._set_grad_enabled(False) ``` Test Plan: Test case checking that nullcontext & no_grad wrappers lead to different export behaviors (regarding set grad subgraphs). Reviewed By: tugsbayasgalan Differential Revision: D55777804 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123671 Approved by: https://github.com/tugsbayasgalan	2024-04-12 16:16:23 +00:00
Aart Bik	757daece95	[sparse] add meta support for add operation (and copy) (#123594 ) This is a small step towards #117188 @pearu to review (this was split of #117907) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123594 Approved by: https://github.com/pearu, https://github.com/peterbell10	2024-04-12 15:50:30 +00:00
Zhengxu Chen	951582949b	[export] Enforce final classes in serialization. (#123861 ) Summary: as title, these are private API and not meant to be used across repos. Test Plan: CI Differential Revision: D56027954 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123861 Approved by: https://github.com/tugsbayasgalan	2024-04-12 15:44:56 +00:00
Andres Lugo-Reyes	2cb3301f80	[ROCm] Add cast to kFloat in amax calculation (#123872 ) necessary cast to kFloat missed in previous amax PR Pull Request resolved: https://github.com/pytorch/pytorch/pull/123872 Approved by: https://github.com/drisspg	2024-04-12 15:38:41 +00:00
Cui, Yifeng	b024c0c2ef	Convert MKL symbols from global to local (#122284 ) PyTorch is statically linked to MKL but MKL symbols are visible to global, which may cause symbol conflicts. Such error has been observed when a different version of MKL is dynamically linked to the other components: `libtorch_cpu.so` was invoked incorrectly when MKL descriptor was freed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122284 Approved by: https://github.com/EikanWang, https://github.com/cyyever, https://github.com/ezyang	2024-04-12 15:36:47 +00:00
Shivam Raikundalia	616446cc0a	Update Kineto Hash to fix OSS json output (#123885 ) Summary: Need to have temporary flag in Kineto so the correct JSON output is used. Will delete all temporary flags afterwards Test Plan: Tested traces using updated hash. Values matched expected order of magnitude/general range that is expected. Differential Revision: D56045866 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123885 Approved by: https://github.com/aaronenyeshi	2024-04-12 14:49:57 +00:00
Florian	41613a0803	[Profiler][PrivateUse1] Profiler support PrivateUse1 key (#120556 ) Summary: 1.Package public headers of kineto if USE_KINETO so that they can be used by PrivateUse1 user. 2.Add PrivateUse1 key to ActivityType. 3. Support PrivateUse1 key in function deviceTypeFromActivity and _supported_activities. 4. Fix some bugs when processing profiler results. Co-authored-by: albanD <desmaison.alban@gmail.com> Co-authored-by: Aaron Shi <enye.shi@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/120556 Approved by: https://github.com/aaronenyeshi	2024-04-12 14:28:19 +00:00
DanilBaibak	6f4c7eeb08	Migrate linux-focal-py3_11-clang10-build to ARC (#123441 ) Migrate linux-focal-py3_11-clang10-build to ARC Pull Request resolved: https://github.com/pytorch/pytorch/pull/123441 Approved by: https://github.com/zxiiro, https://github.com/jeanschmidt	2024-04-12 14:23:12 +00:00
DanilBaibak	e023863474	Migrate linux-focal-py3_8-clang10-build to ARC (#123440 ) Migrate linux-focal-py3_8-clang10-build to ARC Pull Request resolved: https://github.com/pytorch/pytorch/pull/123440 Approved by: https://github.com/zxiiro, https://github.com/jeanschmidt	2024-04-12 14:03:42 +00:00
ZhiweiYan-96	1cdde98df4	Intel GPU oneDNN upstreaming for library compilation (#117098 ) # Motivation As proposed in https://github.com/pytorch/pytorch/issues/114848 and https://github.com/pytorch/pytorch/issues/114723, oneDNN library is an important component for Intel GPU software ecosystem. This PR is intended to enable oneDNN compilation for Intel GPU. It is the first step for we enabling any operators like `at::baddmm`. With this PR, a static library `libdnnl.a` for GPU would be compiled in directory `/build/xpumkldnn_proj-prefix`. It can be further linked to `libtorch_xpu.so` in future. The compilation would depend on `USE_XPU` bool variables and runtime check like SYCL, which is defined in https://github.com/pytorch/pytorch/pull/116019 for runtime support. Once the #116019 merged, the compilation should be able to be triggered. The modification is independent to oneDNN CPU compilation, hence no modification would be introduced for CPU Cmakefiles(e.g. FindMKLDNN.cmake) Co-authored-by: xiaolil1 <xiaoli.liu@intel.com> Co-authored-by: lei,zhenyuan <zhenyuan.lei@intel.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/117098 Approved by: https://github.com/EikanWang, https://github.com/jgong5, https://github.com/atalman	2024-04-12 13:46:22 +00:00
PyTorch MergeBot	3120dbbf81	Revert "[sparse] Add fast semi-structured spasification kernels (#122350 )" This reverts commit aaec97a40364bb6ccfd968f28d309cfff8748d20. Reverted https://github.com/pytorch/pytorch/pull/122350 on behalf of https://github.com/DanilBaibak due to Break internal build ([comment](https://github.com/pytorch/pytorch/pull/122350#issuecomment-2051757450))	2024-04-12 13:26:10 +00:00
Brian Hirsh	f9f7ef33c4	AOTAutograd: add config to error when overlapping input checks would cause slow compile / runtimes (#123455 ) We should eventually make the non-overlapping checks faster when dynamic shapes are enabled, but this is pretty difficult to do. So for now this PR adds a config that lets us fail fast when this situation happens, instead of causing compile times to secretly come to a crawl. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123455 Approved by: https://github.com/ezyang	2024-04-12 13:25:33 +00:00
PyTorch MergeBot	f0eb162730	Revert "Switch quantized_decomposed over to new custom ops API (#123454 )" This reverts commit 638729c0cdf3ce4274f4d68f8e46e5a1cd36cbe8. Reverted https://github.com/pytorch/pytorch/pull/123454 on behalf of https://github.com/DanilBaibak due to Break internal build ([comment](https://github.com/pytorch/pytorch/pull/123454#issuecomment-2051738976))	2024-04-12 13:14:59 +00:00
Simon Fan	7fc3aa5f81	[compiled autograd][aot] Trim runtime refs for list inputs from dynamo (#122535 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/122535 Approved by: https://github.com/bdhirsh ghstack dependencies: #123630, #123674, #122353, #123359	2024-04-12 10:29:09 +00:00
Simon Fan	540b451e91	[compiled autograd][dynamo] Codegen aliases to keep grad mutated tensors alive (#123359 ) The current codegen is problematic if __compiled_fn_0 clears the inputs list, since we need it for assignment afterwards ```python def forward(inputs): __compiled_fn_0 = ... # The actual function needs to be provided graph_out_0 = __compiled_fn_0(inputs) # clears inputs temp_list = [] temp_list.append(graph_out_0[0]) inputs[4].grad = graph_out_0[1] # inputs is empty, index error inputs[7].grad = graph_out_0[2] inputs[8].grad = graph_out_0[3] inputs[9].grad = graph_out_0[3] del graph_out_0 return temp_list ``` With this fix, we use aliases to keep the tensors alive ```python def forward(inputs): __compiled_fn_0 = ... # The actual function needs to be provided inputs_ref_1 = inputs[9] inputs_ref_2 = inputs[4] inputs_ref_3 = inputs[8] inputs_ref_4 = inputs[7] graph_out_0 = __compiled_fn_0(inputs) temp_list = [] temp_list.append(graph_out_0[0]) inputs_ref_2.grad = graph_out_0[1] inputs_ref_4.grad = graph_out_0[2] inputs_ref_3.grad = graph_out_0[3] inputs_ref_1.grad = graph_out_0[3] del graph_out_0 return temp_list ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/123359 Approved by: https://github.com/jansel ghstack dependencies: #123630, #123674, #122353	2024-04-12 10:29:09 +00:00
Simon Fan	d274d57037	[compiled autograd][dynamo] Make compiled graph take in boxed inputs (#122353 ) ### Context In today's Dynamo, we lift all tensors encountered during tracing to be individual graph inputs, even when they were in a container. And [Dynamo generates](`fdc281f258/torch/_dynamo/codegen.py (L371)`) the runtime function's signature using the graph's graphargs. This means that the generated function will have each grapharg as an argument, which is problematic if we want to free the inputs in inductor codegen. See [python function arguments are kept alive for the duration of the function call](https://github.com/pytorch/pytorch/pull/83137#issuecomment-1211320670). ```python # original code def forward(inputs): a, b, c, d, e = inputs inputs.clear() out = a out += b del b # frees memory out += c del c # frees memory out += d del d # frees memory out += e del e # frees memory return out # compiled code: def forward(a, b, c, d, e): # b, c, d, e can't be freed before end of function ``` This isn't a concern when compiling forward because a, b, c, d, e are all from user code, and should be kept alive. But when compiling backwards, a, b, c, d, e may be intermediate results i.e. activations, that we DO want to clear ASAP to remain on par with eager peak memory. ### Solution We have encountered similar memory problems in AOTAutograd before, where we adopted the boxed calling convention (wrapping to-be-freed objects in a list), adding list clearing to inductor codegen, and being careful about holding references to elements in the input list. We need to do something similar, but for inputs from the user program (compiled autograd fx graph in this case). This PR support lists as graphargs/placeholder nodes. When tracing a list of tensors, we create a node for it, and pre-emptively initialize variable trackers for its elements before they are used in the user program. Subsequent uses of those variables will find hits in the lookup table `input_source_to_var`. With the inputs as a list in the graph args, our compiled code can free inputs just like in the eager case. ```python def forward(inputs): # a, b, c, d, e can be freed within the function now ``` Currently, AOT/Inductor flattens list input via [flatten_graph_inputs wrapper](`597f479643/torch/_inductor/compile_fx.py (L1454-L1478)`), which is why this PR's CI can be green. Additional changes are needed to its runtime wrapper, done in the next PR. The next step is to ensure that we are careful in forwarding the list to inductor codegen without holding additional references. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122353 Approved by: https://github.com/jansel ghstack dependencies: #123630, #123674	2024-04-12 10:29:09 +00:00
Pritam Damania	9dfeec9cdc	Add a mode to avoid clone() in DDPSink (#122927 ) DDPSink clones the outputs of DDP to avoid in-place modification of loss (see https://github.com/pytorch/pytorch/issues/61982). However, when outputs are really large (2-3GB) this adds a lot of overhead for peak memory. As a result, adding a mode to avoid this clone in cases where users are not modifying loss in-place. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122927 Approved by: https://github.com/fegin, https://github.com/rohan-varma	2024-04-12 08:56:10 +00:00
Shengbao Zheng	4e9094533e	[c10d/nccl-pg] allow user to pass process group description (#123472 ) Summary: We need a way to allow user set a customized description for a process group, e.g. FSDP, PP. Here are several use cases of user specified group_desc: - Logging: we can easily match a log line and understand what's this collective/pg is used to. - Pytorch traces (e.g. Kineto, Execution Trace) can benefit from the PG desc since trace analysis, benchmarks will be able to easily differentiate PG purpose like FSDP, PP. - Lower layer collectives(e.g. NCCL) debug: we will be able to expose PG desc to NCCL communicator so NCCL layer operations can be easily correlated to a PG. Solution: Add a group_desc field to c10d Differential Revision: D55781850 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123472 Approved by: https://github.com/kwen2501	2024-04-12 08:44:21 +00:00
Xuehai Pan	73f0ecc1ac	[BE] UFMT directory `torch/_functorch` (#123723 ) Part of #123062 - #123062 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123723 Approved by: https://github.com/Skylion007	2024-04-12 08:04:51 +00:00
Zhenghao Zhao	7ff53e169f	add option to turn on return_tuple in _SplitterBase (#123868 ) Summary: as title. split the oss change from D55871896 into this separate diff Test Plan: deploy Differential Revision: D56032268 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123868 Approved by: https://github.com/ZhengkaiZ, https://github.com/DanilBaibak	2024-04-12 07:50:28 +00:00
PyTorch MergeBot	d994d993c0	Revert "[inductor] Fix fresh_inductor_cache() (#122661 )" This reverts commit cda383e7bcdac029a6d5508d63c0355a40bb0d32. Reverted https://github.com/pytorch/pytorch/pull/122661 on behalf of https://github.com/DanilBaibak due to Break internal build ([comment](https://github.com/pytorch/pytorch/pull/122661#issuecomment-2051171028))	2024-04-12 07:26:50 +00:00
PyTorch MergeBot	e881d567f4	Revert "[inductor] Write generated files from parent process (#123409 )" This reverts commit 79c565b24e6c305c09c8c908e27f4023f41dd567. Reverted https://github.com/pytorch/pytorch/pull/123409 on behalf of https://github.com/DanilBaibak due to Needs to be reverted because it blocks reverting of the broken PR. ([comment](https://github.com/pytorch/pytorch/pull/123409#issuecomment-2051166617))	2024-04-12 07:23:57 +00:00
PyTorch MergeBot	5669334175	Revert "Add Matmul recipe into x86_inductor_quantizer (#122776 )" This reverts commit e8e9261b906f69b397e4027362be801f98a68d62. Reverted https://github.com/pytorch/pytorch/pull/122776 on behalf of https://github.com/DanilBaibak due to Break internal build ([comment](https://github.com/pytorch/pytorch/pull/122776#issuecomment-2051073373))	2024-04-12 06:29:27 +00:00
Kurt Mohler	db895ace1d	Only run backward part of COW test if results are strided (#123870 ) Fixes #123792 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123870 Approved by: https://github.com/ezyang	2024-04-12 04:43:02 +00:00
DaiFu	87f7486df9	Support SparseCsrPrivateUse1 (#123826 ) As in the title, the changes support SparseCsr tensors working on PrivateUse1 devices. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123826 Approved by: https://github.com/ezyang	2024-04-12 04:13:35 +00:00
Yuanhao Ji	706f7d1f22	Enable UFMT on `test/jit_hooks`, `test/lazy` and some files (#123807 ) Part of: #123062 Ran lintrunner on: - `test/jit_hooks` - `test/lazy` - `test/linear.py` - `test/load_torchscript_model.py` - `test/mkl_verbose.py` - `test/mkldnn_verbose.py` with command: ```bash lintrunner -a --take UFMT --all-files ``` Co-authored-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/123807 Approved by: https://github.com/ezyang	2024-04-12 03:39:38 +00:00
Animesh Jain	4e3022dbe9	[dynamo][logs] Print bytecode before tracing (#123877 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123877 Approved by: https://github.com/jansel ghstack dependencies: #122943	2024-04-12 02:32:58 +00:00
Animesh Jain	ede9e8237a	[dynamo] Bug fix for GET_YIELD_FROM_ITER (#122943 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/122943 Approved by: https://github.com/jansel	2024-04-12 02:32:58 +00:00
Jesse Cai	aaec97a403	[sparse] Add fast semi-structured spasification kernels (#122350 ) This PR adds in fast semi-structured sparsification kernels to PyTorch. These kernels allow for accelerated semi-structured sparsification kernels in PyTorch. The kernels have been added as aten native functions In particular, three new functions have been added: * `torch._sparse_semi_structured_tile` This function will return the packed representation and metadata for both X and X', as well as the thread masks. Note that this applies 2:4 sparsity in a 4x4 tile instead of a 1x4 strip as usual. * `torch._sparse_semi_structured_apply` This function takes in an input tensor and thread masks from the above function and returns a packed representation and metadata from applying thread masks to the input tensor. * `torch._sparse_semi_structured_apply_dense` This function does the same thing as above but instead of returning the tensor in the sparse representation it returns it in the dense representation The subclasses have also been updated to add a new `prune_dense_static_sort` classmethod to create sparse tensors with this format. I've added some additional documentatino on how to calculate the compressed tensors needed to create a SparseSemiStructuredTensor oneself. To this end, there are two new helper functions added: `sparse_semi_structured_tile` `compute_compressed_swizzled_bitmask` Pull Request resolved: https://github.com/pytorch/pytorch/pull/122350 Approved by: https://github.com/cpuhrsch	2024-04-12 02:22:56 +00:00
David Berard	868e5ced5d	[Dispatcher] Collect autograd sequence numbers on PythonTLSSnapshot dispatch keys (#123304 ) Fixes #121758 TL;DR: When profiling is turned on, the dispatcher will sometimes attach the autograd sequence number to the recorded profiler event. This PR expands the set of profiler events onto which we attach sequence numbers. Before, we'd only attach a sequence number if the current dispatch key was an Autograd dispatch key. Now, we attach a sequence number if the current dispatch key set contains Autograd. Context: The use case for this is torch.profiler for python subclasses. Autograd attaches a "sequence number" to all ops that it encounters during the forward pass. Then, the corresponding sequence number can be associated with a backward kernel when backward is executed. This is used by the profiler to associate the forward ops to the backward ops; a forward op and a backward op with the same sequence number are "linked" in some post-processing step. Prior to this PR, this profiler feature didn't work for python subclasses. The reason is that we don't collect profiler information for all the dispatches for a given kernel; we only dispatch the initial `call`, and not the subsequent `redispatch` invocations. Normally, an Autograd key (if we're running with autograd) is the highest dispatch key, so the initial `call` that we profile is an Autograd key, and we collect the sequence number. But when we're dealing with a python subclass, the first dispatch key is PythonTLSSnapshot, which eventually redispatches into Autograd. We don't record the Autograd sequence number in that case because we don't record redispatches. To fix this, this PR collects a sequence number whenever the dispatch key set contains an Autograd key. That means we might sometimes collect multiple events with the same sequence number, or possibly attach sequence numbers when we won't actually use them? (e.g. maybe if the initial dispatch key handler removes Autograd for some reason). Although this might be a bit confusing for users looking directly at the sequence_nr directly, I think the main use case is for the profiler to create fwd-bwd links; and those should still be generated correctly in these cases. Differential Revision: [D55724190](https://our.internmc.facebook.com/intern/diff/D55724190) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123304 Approved by: https://github.com/soulitzer	2024-04-12 02:01:15 +00:00
Tristan Rice	358ace1a1b	functional_collectives: add first differentiable collective -- all_to_all_single_grad (#123599 ) This adds the differentiable collective -- all_to_all_single_grad. This is the initial proof of concept PR and I will be adding the remaining collectives in follow up PRs. This adds a new function called `all_to_all_single_autograd` which is the autograd variant of `all_to_all_single`. For backwards compatibility + initial testing we wanted to make the autograd variant separate to avoid regressions. This uses `autograd::Function` to register an Autograd op that calls the original `_c10d_functional::all_to_all_single` via the dispatcher. This works with compile and inductor as opposed to the previous Python implementation that had issues. As this uses the existing `_c10d_functional` ops we don't need to register any meta functions or lowering. To avoid cudaStream issues this explicitly calls `wait_tensor` in the backward method to ensure it runs under the same stream as the async operation. This hurts performance but can be alleviated potentially using `compile`. Related work: https://github.com/pytorch/torchrec/blob/main/torchrec/distributed/comm_ops.py Test plan: ``` pytest test/distributed/test_functional_api.py -k test_all_to_all_single_compile pytest test/distributed/test_functional_api.py ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/123599 Approved by: https://github.com/yifuwang	2024-04-12 01:48:49 +00:00
statelesshz	5b648afba4	Enable UFMT on test/test_multiprocessing (#123840 ) part of https://github.com/pytorch/pytorch/issues/123062 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123840 Approved by: https://github.com/ezyang	2024-04-12 01:21:54 +00:00
Brian Hirsh	7efaf54dc4	Fakeifying views shouldnt create symbols when dynamic=False (#123348 ) Fixes https://github.com/pytorch/pytorch/issues/123298 I was also seeing some crashes in torchtrain due to dynamic shapes, even when I set `compile(dynamic=False)` (cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @chenyang78 @kadeng @chauhang @wanchaol). This doesn't fix the underlying dynamic shape issues with compile + DTensor, but it does prevent dynamic shapes from leaking in. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123348 Approved by: https://github.com/ezyang ghstack dependencies: #122502, #122751	2024-04-12 01:12:23 +00:00
Brian Hirsh	96fe3c5d46	fix correctness for dynamo inlining RangeVariable __contains__ (#122751 ) Fixes https://github.com/pytorch/pytorch/issues/122379 It looks like `iter_contains()` in dynamo expects to take in something like `iter_contains(List[VariableTracker], VariableTracker])`. Previously, when we called this function where the list in question was a `RangeVariable`, we would pass in `RangeVariable.items` as our list. This is wrong, though since `RangeVariable.items` just contains the underlying [start, stop, step]. It looks like `unpack_var_sequence` does the right thing of "materializing" the range into a list of `VariableTrackers`, so I used that instead. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122751 Approved by: https://github.com/anijain2305, https://github.com/jansel ghstack dependencies: #122502	2024-04-12 01:12:23 +00:00
Brian Hirsh	2fe672b146	compile: ban mutations on non-compositional uses of as_strided (#122502 ) Fixes https://github.com/pytorch/pytorch/issues/104505 I was originally going to ban all usages of as_strided + mutation in functionalization. But I'm pretty sure that as_strided + mutation is fine when we are calling as_strided on a base tensor. So in this PR I added a slightly more conservative check: if we see an as_strided + mutation, where the input to an as_strided was another view op, then I error loudly in functionalization and link to the github issue above (in case anyone runs into this in the real world) Pull Request resolved: https://github.com/pytorch/pytorch/pull/122502 Approved by: https://github.com/ezyang, https://github.com/albanD	2024-04-12 01:12:23 +00:00
Shuqiang Zhang	22ba180e55	[c10d] add more fields for periodic logging (#123860 ) Summary: Added the names of the last enquened, started and completed colletives, in addition to their seq ID Test Plan: CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/123860 Approved by: https://github.com/XilunWu	2024-04-12 00:11:07 +00:00
Honglin Zhu	78824fd212	[inductor] Fix recompiles bug for torch.full (#123811 ) Fixes #123810 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123811 Approved by: https://github.com/peterbell10	2024-04-12 00:07:47 +00:00
Shawn Xu	5b8c81eb82	[PT] [FSDP] fix HSDP sharding placement (#123778 ) Summary: https://github.com/pytorch/pytorch/pull/123230 formalized the contract for `ShardedTensor` sub group rank placement validation by making sure the placement rank is global rank, to align with general `torch.distributed` convention. The current HSDP allows for both `ShardedTensor` and `DTensor`. While `DTensor` will eventually will replace `ShardedTensor`, its usage still exists and there's at least one test verifying the state dict with ST output. This got broken as the test is run periodically only so it didn't block the other PR. Fixes [#123749](https://github.com/pytorch/pytorch/issues/123749) Test Plan: CI Differential Revision: D55991256 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123778 Approved by: https://github.com/Skylion007, https://github.com/wz337	2024-04-12 00:05:49 +00:00
chilli	7f6884f620	Added some extra repr to triton template buffers and added autotuned block configs to templated attention (#123813 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123813 Approved by: https://github.com/drisspg, https://github.com/shunting314 ghstack dependencies: #123768	2024-04-11 23:57:47 +00:00
Adnan Akhundov	3d9dc976ae	Handle unqualified imports in custom Triton kernels (#123703 ) Summary: If in a custom (user-written) Triton kernel an externally imported symbol is used directly, we need to codegen the corresponding import outside the kernel body in the Python wrapper. E.g., if the user code has this: ``` from triton.language.extra.cuda.libdevice import fast_dividef @triton.jit def my_kernel(...): ... x = fast_dividef(...) ... ``` The `from triton.language.extra.cuda.libdevice import fast_dividef` line needs to be carried over together with the `my_kernel` function. The PR adds this. Test Plan: ``` $ python test/inductor/test_triton_kernels.py ... ---------------------------------------------------------------------- Ran 464 tests in 113.512s OK ``` Differential Revision: [D55953241](https://our.internmc.facebook.com/intern/diff/D55953241) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123703 Approved by: https://github.com/jansel, https://github.com/oulgen	2024-04-11 23:49:25 +00:00
Yuanhao Ji	604c9c5601	Enable UFMT on all of `test/jit` (#123623 ) Partially addresses #123062 Ran lintrunner on: - `test/jit` with command: ```bash lintrunner -a --take UFMT --all-files ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/123623 Approved by: https://github.com/ezyang	2024-04-11 23:45:05 +00:00
Pian Pawakapan	d0ccf599cc	[export] Restore original placeholder names (part 2: higher-order-op subgraph naming) (#123587 ) Summary: note: breaking the original diff [D55225818](https://www.internalfb.com/diff/D55225818) into 3 parts (top-level renaming, higher-order-op subgraphs, constant input de/serialization) because of its size. Stacked PR to restore original names to placeholder nodes, replacing the default names arg0_1, arg1_1, ... This PR propagates node names to higher-order-op subgraph placeholders, retaining the top-level names and handling naming collisions by suffixing other non-placeholder nodes in the subgraph with an index. This is the same handling as in fx.Graph/fx.Node, but implemented separately as a pass. Since the input schemas of HOO subgraphs are very different, they are enumerated in _name_hoo_subgraph_placeholders(). Currently cond, map_impl, and wrap_with_set_grad_enabled are handled, but other ops can be easily added. Test Plan: verification checks on placeholder names for all export() calls, unit test in test/export/test_export.py Differential Revision: D55456749 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123587 Approved by: https://github.com/angelayi	2024-04-11 22:40:46 +00:00
Animesh Jain	b9675e820e	[dynamo][cpp-guards] Improve the logs (#123780 ) For this program ~~~ @torch.compile(backend="eager") def fn(x, y, d): return x * y * d["foo"] * d["bar"] ~~~ Python logs are ~~~ V0410 15:48:57.778000 140318524949632 torch/_dynamo/guards.py:1785] [0/0] [__guards] GUARDS: V0410 15:48:57.778000 140318524949632 torch/_dynamo/guards.py:1803] [0/0] [__guards] ___check_type_id(L['d'], 8833952) # return x * y * d["foo"] * d["bar"] # examples/ord_dicts.py:24 in fn V0410 15:48:57.778000 140318524949632 torch/_dynamo/guards.py:1803] [0/0] [__guards] len(L['d']) == 2 # return x * y * d["foo"] * d["bar"] # examples/ord_dicts.py:24 in fn V0410 15:48:57.779000 140318524949632 torch/_dynamo/guards.py:1803] [0/0] [__guards] list(L['d'].keys()) == ['foo', 'bar'] # return x * y * d["foo"] * d["bar"] # examples/ord_dicts.py:24 in fn V0410 15:48:57.779000 140318524949632 torch/_dynamo/guards.py:1803] [0/0] [__guards] hasattr(L['x'], '_dynamo_dynamic_indices') == False # return x * y * d["foo"] * d["bar"] # examples/ord_dicts.py:24 in fn V0410 15:48:57.779000 140318524949632 torch/_dynamo/guards.py:1803] [0/0] [__guards] hasattr(L['y'], '_dynamo_dynamic_indices') == False # return x * y * d["foo"] * d["bar"] # examples/ord_dicts.py:24 in fn V0410 15:48:57.779000 140318524949632 torch/_dynamo/guards.py:1803] [0/0] [__guards] ___check_type_id(L['d']['bar'], 8842592) # return x * y * d["foo"] * d["bar"] # examples/ord_dicts.py:24 in fn V0410 15:48:57.779000 140318524949632 torch/_dynamo/guards.py:1803] [0/0] [__guards] L['d']['bar'] == 2 # return x * y * d["foo"] * d["bar"] # examples/ord_dicts.py:24 in fn V0410 15:48:57.779000 140318524949632 torch/_dynamo/guards.py:1803] [0/0] [__guards] ___check_type_id(L['d']['foo'], 8842592) # return x * y * d["foo"] * d["bar"] # examples/ord_dicts.py:24 in fn V0410 15:48:57.779000 140318524949632 torch/_dynamo/guards.py:1803] [0/0] [__guards] L['d']['foo'] == 4 # return x * y * d["foo"] * d["bar"] # examples/ord_dicts.py:24 in fn V0410 15:48:57.779000 140318524949632 torch/_dynamo/guards.py:1803] [0/0] [__guards] utils_device.CURRENT_DEVICE == None # _dynamo/output_graph.py:450 in init_ambient_guards V0410 15:48:57.779000 140318524949632 torch/_dynamo/guards.py:1803] [0/0] [__guards] check_tensor(L['x'], Tensor, DispatchKeySet(CPU, BackendSelect, ADInplaceOrView, AutogradCPU), torch.float32, device=None, requires_grad=False, size=[4], stride=[1]) # return x * y * d["foo"] * d["bar"] # examples/ord_dicts.py:24 in fn V0410 15:48:57.780000 140318524949632 torch/_dynamo/guards.py:1803] [0/0] [__guards] check_tensor(L['y'], Tensor, DispatchKeySet(CPU, BackendSelect, ADInplaceOrView, AutogradCPU), torch.float32, device=None, requires_grad=False, size=[4], stride=[1]) # return x * y * d["foo"] * d["bar"] # examples/ord_dicts.py:24 in fn ~~~ CPP logs are ~~~ V0410 15:49:41.607000 140481927914624 torch/_dynamo/guards.py:1792] [0/0] [__guards] GUARDS: V0410 15:49:41.607000 140481927914624 torch/_dynamo/guards.py:1769] [0/0] [__guards] V0410 15:49:41.607000 140481927914624 torch/_dynamo/guards.py:1769] [0/0] [__guards] TREE_GUARD_MANAGER: V0410 15:49:41.607000 140481927914624 torch/_dynamo/guards.py:1769] [0/0] [__guards] +- RootGuardManager V0410 15:49:41.607000 140481927914624 torch/_dynamo/guards.py:1769] [0/0] [__guards] \| +- DEFAULT_DEVICE: utils_device.CURRENT_DEVICE == None # _dynamo/output_graph.py:450 in init_ambient_guards V0410 15:49:41.607000 140481927914624 torch/_dynamo/guards.py:1769] [0/0] [__guards] \| +- GLOBAL_STATE: ___check_global_state() V0410 15:49:41.607000 140481927914624 torch/_dynamo/guards.py:1769] [0/0] [__guards] \| +- DictSubclassGuardManager: source=L['d'], accessed_by=DictGetItemGuardAccessor(d) V0410 15:49:41.607000 140481927914624 torch/_dynamo/guards.py:1769] [0/0] [__guards] \| \| +- KeyValueManager pair at index=0 V0410 15:49:41.607000 140481927914624 torch/_dynamo/guards.py:1769] [0/0] [__guards] \| \| \| +- KeyManager: GuardManager: source=list(L['d'].keys())[0] V0410 15:49:41.607000 140481927914624 torch/_dynamo/guards.py:1769] [0/0] [__guards] \| \| \| \| +- EQUALS_MATCH: list(L['d'].keys())[0] == 'foo' # return x * y * d["foo"] * d["bar"] # examples/ord_dicts.py:24 in fn V0410 15:49:41.607000 140481927914624 torch/_dynamo/guards.py:1769] [0/0] [__guards] \| \| \| +- ValueManager: GuardManager: source=L['d']['foo'] V0410 15:49:41.607000 140481927914624 torch/_dynamo/guards.py:1769] [0/0] [__guards] \| \| \| \| +- EQUALS_MATCH: L['d']['foo'] == 4 # return x * y * d["foo"] * d["bar"] # examples/ord_dicts.py:24 in fn V0410 15:49:41.607000 140481927914624 torch/_dynamo/guards.py:1769] [0/0] [__guards] \| \| +- KeyValueManager pair at index=1 V0410 15:49:41.607000 140481927914624 torch/_dynamo/guards.py:1769] [0/0] [__guards] \| \| \| +- KeyManager: GuardManager: source=list(L['d'].keys())[1] V0410 15:49:41.607000 140481927914624 torch/_dynamo/guards.py:1769] [0/0] [__guards] \| \| \| \| +- EQUALS_MATCH: list(L['d'].keys())[1] == 'bar' # return x * y * d["foo"] * d["bar"] # examples/ord_dicts.py:24 in fn V0410 15:49:41.607000 140481927914624 torch/_dynamo/guards.py:1769] [0/0] [__guards] \| \| \| +- ValueManager: GuardManager: source=L['d']['bar'] V0410 15:49:41.607000 140481927914624 torch/_dynamo/guards.py:1769] [0/0] [__guards] \| \| \| \| +- EQUALS_MATCH: L['d']['bar'] == 2 # return x * y * d["foo"] * d["bar"] # examples/ord_dicts.py:24 in fn V0410 15:49:41.607000 140481927914624 torch/_dynamo/guards.py:1769] [0/0] [__guards] \| +- GuardManager: source=L['x'], accessed_by=DictGetItemGuardAccessor(x) V0410 15:49:41.607000 140481927914624 torch/_dynamo/guards.py:1769] [0/0] [__guards] \| \| +- TENSOR_MATCH: check_tensor(L['x'], Tensor, DispatchKeySet(CPU, BackendSelect, ADInplaceOrView, AutogradCPU), torch.float32, device=None, requires_grad=False, size=[4], stride=[1]) # return x * y * d["foo"] * d["bar"] # examples/ord_dicts.py:24 in fn V0410 15:49:41.607000 140481927914624 torch/_dynamo/guards.py:1769] [0/0] [__guards] \| \| +- NO_HASATTR: hasattr(L['x'], '_dynamo_dynamic_indices') == False # return x * y * d["foo"] * d["bar"] # examples/ord_dicts.py:24 in fn V0410 15:49:41.607000 140481927914624 torch/_dynamo/guards.py:1769] [0/0] [__guards] \| \| +- NO_TENSOR_ALIASING: check_no_aliasing(L['x'], L['y']) V0410 15:49:41.607000 140481927914624 torch/_dynamo/guards.py:1769] [0/0] [__guards] \| +- GuardManager: source=L['y'], accessed_by=DictGetItemGuardAccessor(y) V0410 15:49:41.607000 140481927914624 torch/_dynamo/guards.py:1769] [0/0] [__guards] \| \| +- TENSOR_MATCH: check_tensor(L['y'], Tensor, DispatchKeySet(CPU, BackendSelect, ADInplaceOrView, AutogradCPU), torch.float32, device=None, requires_grad=False, size=[4], stride=[1]) # return x * y * d["foo"] * d["bar"] # examples/ord_dicts.py:24 in fn V0410 15:49:41.607000 140481927914624 torch/_dynamo/guards.py:1769] [0/0] [__guards] \| \| +- NO_HASATTR: hasattr(L['y'], '_dynamo_dynamic_indices') == False # return x * y * d["foo"] * d["bar"] # examples/ord_dicts.py:24 in fn V0410 15:49:41.607000 140481927914624 torch/_dynamo/guards.py:1769] [0/0] [__guards] \| \| +- NO_TENSOR_ALIASING: check_no_aliasing(L['x'], L['y']) ~~~~ This info is also present in this gist for better viewing - https://gist.github.com/anijain2305/b418706e4ad4ec2d601530bc24cf8a20 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123780 Approved by: https://github.com/ezyang, https://github.com/jansel ghstack dependencies: #123773, #123787	2024-04-11 22:23:28 +00:00
Animesh Jain	2e6871f924	[dynamo][guards-cpp] Early return in DictGuardManager for empty dicts (#123787 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123787 Approved by: https://github.com/jansel ghstack dependencies: #123773	2024-04-11 22:23:28 +00:00
Animesh Jain	b0b7aa201c	[dynamo][cpp-guards] Introduce DictSubclassGuardManager (#123773 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123773 Approved by: https://github.com/jansel	2024-04-11 22:23:28 +00:00
Peter Bell	bd225189f1	[inductor] Change OverridesData to take callables instead of strings (#123397 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123397 Approved by: https://github.com/lezcano	2024-04-11 22:22:54 +00:00
Shubhraprakash Das	af1e03fb8f	Quantized relu (#123004 ) Summary: Add Quantized relu ops. Test Plan: Run vulkan api test: # buck2 build --target-platforms ovr_config//platform/macos:arm64-fbsource //xplat/caffe2:pt_vulkan_api_test_binAppleMac\#macosx-arm64 -c pt.vulkan_full_precision=1 --show-output" # buck-out//v2/gen/fbsource/xplat/caffe2/pt_vulkan_api_test_binAppleMac Running main() from third-party/googletest/1.14.0/googletest/googletest/src/gtest_main.cc [==========] Running 418 tests from 1 test suite. [----------] Global test environment set-up. [----------] 418 tests from VulkanAPITest .... [----------] Global test environment tear-down [==========] 418 tests from 1 test suite ran. (4510 ms total) [ PASSED ] 417 tests. [ SKIPPED ] 1 test, listed below: [ SKIPPED ] VulkanAPITest.querypool_flushed_shader_log YOU HAVE 9 DISABLED TESTS Run quantized vulkan api test: Note the linear quantized are failing but all the convolution tests still pass. Linear failures are being debugged. # buck2 build --target-platforms ovr_config//platform/macos:arm64-fbsource //xplat/caffe2:pt_vulkan_quantized_api_test_binAppleMac\#macosx-arm64 -c pt.vulkan_full_precision=1 --show-output" # buck-out//v2/gen/fbsource/xplat/caffe2/pt_vulkan_quantized_api_test_binAppleMac Running main() from third-party/googletest/1.14.0/googletest/googletest/src/gtest_main.cc [==========] Running 86 tests from 1 test suite. [----------] Global test environment set-up. [----------] 86 tests from VulkanAPITest ... [ PASSED ] 77 tests. [ FAILED ] 9 tests, listed below: [ FAILED ] VulkanAPITest.linear_2d_flat [ FAILED ] VulkanAPITest.linear_2d_small [ FAILED ] VulkanAPITest.linear_2d_large [ FAILED ] VulkanAPITest.linear_3d_flat [ FAILED ] VulkanAPITest.linear_3d_small [ FAILED ] VulkanAPITest.linear_3d_large [ FAILED ] VulkanAPITest.linear_4d_flat [ FAILED ] VulkanAPITest.linear_4d_small [ FAILED ] VulkanAPITest.linear_4d_large 9 FAILED TESTS YOU HAVE 8 DISABLED TESTS Reviewed By: copyrightly Differential Revision: D52344264 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123004 Approved by: https://github.com/copyrightly	2024-04-11 21:55:25 +00:00
Andrew Gu	c75a32b9f9	[FSDP2] Fixed `is_last_backward` for 1f1b (#123857 ) `FSDPState` only uses `TrainingState.PRE_BACKWARD` as a backward training state, not `TrainingState.POST_BACKWARD`, because the FSDP state itself does not run post-backward (only its `FSDPParamGroup`, which may not exist if the state does not manage any parameters). This meant that when `is_last_backward=False`, the `FSDPState` was incorrectly still in `TrainingState.PRE_BACKWARD`, and the next `_pre_forward` would not run due to the early return logic for activation checkpointing: `7c451798cc/torch/distributed/_composable/fsdp/_fsdp_state.py (L148-L151)` We fix this by always transitioning to `TrainingState.IDLE` at the end of the current backward task, regardless of `is_last_backward`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123857 Approved by: https://github.com/weifengpy	2024-04-11 21:46:38 +00:00
Lucas Pasqualin	13070e2753	[DCP] Adds better handling in logging of specific kwargs (#123658 ) Adds additional signpost integrations to DCP Logger, to add support for MLU and metric collection. Differential Revision: [D55803461](https://our.internmc.facebook.com/intern/diff/D55803461/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123658 Approved by: https://github.com/fegin	2024-04-11 21:09:38 +00:00
Lucas Pasqualin	b7fac76fc2	[DCP fixes for _load_state_dict_keys and supports nested keys (#123679 ) Fixes some issues with `_load_state_dict_keys`, including: * updates broken test, which was failing due to incorrect parameters * adds support for specifying nested keys e.g. (load state dict keys can now specify `something like "optimizer.state"`, which loads all keys under `optimzier.state`. * updates call site to use the private implementation of `_load_state_dict`, which properly handles empty state dicts (otherwise the keys are ignored) Big shout out to @diego-urgell who not only identified current issues, but recommended the right solutions! Pull Request resolved: https://github.com/pytorch/pytorch/pull/123679 Approved by: https://github.com/diego-urgell, https://github.com/wz337	2024-04-11 20:52:06 +00:00
William Wen	e70bf23b7b	[dynamo] apply certain bytecode cleaning transformations unconditionally (#123785 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123785 Approved by: https://github.com/jansel	2024-04-11 20:25:21 +00:00
Catherine Lee	3cd06f56b1	[ez] test_profiler in serial (#123665 ) Add test_profiler to the serial list since we keep needing to reopen disable issues and I think its due to being incompatible with parallelism Pull Request resolved: https://github.com/pytorch/pytorch/pull/123665 Approved by: https://github.com/ZainRizvi, https://github.com/huydhn	2024-04-11 20:24:47 +00:00
Brian Hirsh	fa013f69bb	dynamo assertion that graph has no fake-tensor constants should check for subclasses (#118644 ) This would have caught some of the nasty errors in https://github.com/pytorch/pytorch/pull/118191 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118644 Approved by: https://github.com/tugsbayasgalan, https://github.com/zou3519 ghstack dependencies: #118647	2024-04-11 20:10:15 +00:00
ydwu4	e979f45610	[while_loop] add a simiple op_info test (#123814 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123814 Approved by: https://github.com/tugsbayasgalan, https://github.com/zou3519	2024-04-11 19:59:04 +00:00
Rohan	f82d20c207	[NEON] Remove implicit type promotion in `Vectorized<c10::Half>::operator!=` (#123864 ) To make code compilable with `gcc`, which `clang` does not allow transparent type promotion between vectorized NEON types of the same sizes, see https://godbolt.org/z/xoasoGM81 as an example Pull Request resolved: https://github.com/pytorch/pytorch/pull/123864 Approved by: https://github.com/malfet	2024-04-11 19:37:11 +00:00
Jason Ansel	5a7fd20aa1	[dynamo] Support autograd.FunctionCtx.needs_input_grad (#123700 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123700 Approved by: https://github.com/anijain2305	2024-04-11 19:30:55 +00:00
Gagan Jain	016ca546aa	Adding health check server hook in torch elastic (#122750 ) (#123504 ) Summary: Building hook for external mechanism to monitor the health of torch elastic launcher. Health check server takes dependency on FileTimerServer to check if launcher is healthy or not. It will be always healthy if FileTimerServer is disabled. Implementation of start_healthcheck_server is unsupported, however tcp/http server can be started on specific port which can monitor the aliveness of worker_watchdog and accordingly take the action. Test Plan: buck test mode/opt caffe2/test/distributed/elastic/agent/server/test:local_agent_test Differential Revision: D55837899 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123504 Approved by: https://github.com/kurman	2024-04-11 19:10:56 +00:00
Kurt Mohler	ee869c9bb7	Avoid COW materialization in backward ops (4) (#123798 ) Affected ops: * embedding_bag * mse_loss * huber_loss * grid_sample * ctc_loss * nll_loss * pdist * _segment_reduce Part of #97856 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123798 Approved by: https://github.com/ezyang ghstack dependencies: #123797	2024-04-11 18:41:41 +00:00
Kurt Mohler	69249a218b	Avoid COW materialization in backward ops (3) (#123797 ) Affected ops: * conv ops * glu * prelu * scaled_dot_product_attention * threshold * logsigmoid * binary_cross_entropy * gelu * unfold * smooth_l1_loss * embedding Part of #97856 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123797 Approved by: https://github.com/ezyang	2024-04-11 18:35:08 +00:00
Brian Hirsh	9c3b87833a	AOTAutograd: keep set_() input mutations in the graph, ban other cases (#122981 ) We have some (limited) support for `set_()` input mutations in `torch.compile`, but one restriction today is that we force them to run outside of the graph, in the opaque runtime epilogue. This is a problem for ppFSDP. Why? The usage pattern of ppFSDP forward graphs look something like this: ``` def forward_fsdp(sacrificial_param, sharded_param, inp): allgathered_param = allgather(sharded_param) sacrificial_param.set_(allgathered_param) # hidden in an autograd.Function that we trace out = matmul(sacrificial_param, inp) sacrificial_param.untyped_storage().resize_(0) return out ``` When we functionalize this graph, `sacrificial_param` sees two distinct types of input mutations, that we must preserve: a `set_`, and a `resize_`. Importantly, at runtime the `set_()` must run before the `resize_()`. Why? the `set_()` updates the storage of our sacrificial param to the allgather'd data, which allows the call to `sacrificial_param.resize_()` to free the allgathered data later. If we run the two mutations in reverse order, we will never free the allgathered data. We want to put the `resize_()` mutation op inside of the graph (see next PR, also there's a much longer description in that PR for anyone interested). However, this will require us to put `set_()` in the graph as well, in order for them to run in the correct order. In order to do this, I had to add some extra restrictions: You are now required to run `set_()` under `no_grad()` if you use it with `torch.compile`, and if you perform any other mutations to the input, those must be under no_grad as well (otherwise, the mutations may mutate the `grad_fn` of the input, making it no longer safe to keep in the graph). These restrictions are hopefully reasonable, since `set_()` doesn't see much usage today (and the original impetus for adding set_() support a few months ago was for fsdp anyway) Pull Request resolved: https://github.com/pytorch/pytorch/pull/122981 Approved by: https://github.com/jansel ghstack dependencies: #122433, #123646	2024-04-11 18:21:57 +00:00
Avik Chaudhuri	10a03c56e5	fix leaky fake tensor on attribute assignment, support buffer assignment (#122337 ) In non-strict, assignment of attributes in a model causes their state to contain fake tensors post-tracing, which leads to incorrect results on running the exported model. We now error when this happens, asking the user to use buffers instead. Next, we add support for assignment of buffers. The final values of the buffers turn into outputs of the graph. Since the buffers are already lifted as inputs and populated with the initial values when the model is run, this leads to a simple programming model where the driver of the model can feed the outputs back as inputs for successive runs. Differential Revision: D55146852 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122337 Approved by: https://github.com/bdhirsh, https://github.com/tugsbayasgalan	2024-04-11 18:08:31 +00:00
Jason Ansel	7c451798cc	[inductor] Disable channels_last heuristic when channels==1 (#123758 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123758 Approved by: https://github.com/shunting314	2024-04-11 17:47:07 +00:00
Jason Ansel	79c565b24e	[inductor] Write generated files from parent process (#123409 ) Before this PR we would pass generated source code over a pipe to the compile worker then the compile worker would write out the file. Doing it this way is faster and results in smaller messages to the workers (and lets us skip creating the workers in the warm start case). Pull Request resolved: https://github.com/pytorch/pytorch/pull/123409 Approved by: https://github.com/desertfire	2024-04-11 17:39:16 +00:00
willfengg	902374cc09	[CI] show doc coverage repro instructions (#123688 ) remind devs they can reproduce the doc coverage error locally with following msg ```You can reproduce locally by running 'cd pytorch/docs && make coverage && cat build/coverage/python.txt'``` I spent 20min to figure out how to test locally so want to enrich the error msg <img width="542" alt="Screenshot 2024-04-09 at 5 22 45 PM" src="https://github.com/pytorch/pytorch/assets/134637289/2c619d9d-74b5-4bda-8903-999ef5c255c2"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/123688 Approved by: https://github.com/clee2000	2024-04-11 17:34:47 +00:00
David Berard	6b24ec480c	[Tensor] Detect more cases of symbolic sizes/strides (#123696 ) Previously, we'd just check `has_symbolic_sizes_strides()` to know whether a tensor has symbolic sizes or strides; if does, we skip some profiler logic. But sometimes `has_symbolic_sizes_strides()` returns false, but we do actually have symbolic sizes or strides. So in this change, we add `may_have_symbolic_sizes_strides()` - which should never return false if the tensor has symbolic sizes and strides Why not change `has_symbolic_sizes_strides()`? It seems like there's preexisting logic that assumes that "if has_symbolic_sizes_strides(), then we can assume that this tensor is guaranteed to have symbolic sizes or strides". In this case, we have python-implemented sizes or strides, which should follow a different code path. Differential Revision: [D55947660](https://our.internmc.facebook.com/intern/diff/D55947660/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123696 Approved by: https://github.com/aaronenyeshi, https://github.com/soulitzer	2024-04-11 16:51:52 +00:00
PyTorch MergeBot	fe092da874	Revert "[quant] Enable backward for choose_qparams_per_token_asymmetric (#123452 )" This reverts commit c83900887f2fb5c7a04e7fd78ad8de7a20f356d4. Reverted https://github.com/pytorch/pytorch/pull/123452 on behalf of https://github.com/clee2000 due to broke test_quantization.py::TestQuantizedTensor::test_decomposed_choose_qparams_per_token_asymmetric_backward on multiple jobs `c83900887f` https://github.com/pytorch/pytorch/actions/runs/8648781225/job/23714753103, probably a landrace ([comment](https://github.com/pytorch/pytorch/pull/123452#issuecomment-2050056601))	2024-04-11 16:19:28 +00:00
Edward Z. Yang	efa36ef092	Natively support int truncation, don't guard on positive/negative (#122827 ) This doesn't entirely fix the original problem that prompted this, but it seems to just be getting stuck in export constraint formatting now which seems like progress to me. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/122827 Approved by: https://github.com/avikchaudhuri	2024-04-11 15:22:32 +00:00
andrewor14	c83900887f	[quant] Enable backward for choose_qparams_per_token_asymmetric (#123452 ) Summary: When running the backward for this op, we get the error: ``` RuntimeError: derivative for aten::aminmax is not implemented ``` This commit replaces this call with separate amin and amax calls instead, which do have implemented derivatives. Test Plan: python test/test_quantization.py -k test_decomposed_choose_qparams_per_token_asymmetric_backward Reviewers: jerryzh168, digantdesai Subscribers: jerryzh168, digantdesai, supriyar Differential Revision: [D55805170](https://our.internmc.facebook.com/intern/diff/D55805170) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123452 Approved by: https://github.com/digantdesai, https://github.com/jerryzh168	2024-04-11 14:51:42 +00:00
Brian Hirsh	134e56fa33	inductor: log unique id to match output_code to aot graphs (#118647 ) I found it helpful to be able to see, given some inductor output code, which AOT graph it came from. When you have large models with multiple graphs floating around this can be difficult, so I added the aot_config.aot_id to the printed inductor output. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118647 Approved by: https://github.com/ezyang	2024-04-11 14:37:07 +00:00
rzou	638729c0cd	Switch quantized_decomposed over to new custom ops API (#123454 ) We are taking API feedback. Changes: - I removed some of the default values (they weren't being used). - I was unable to convert the last op (which is essentially an autograd.Function registered as CompositeImplicitAutograd). That one is "incorrectly registered"; I punt fixing it to the future. Test Plan: - existing tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/123454 Approved by: https://github.com/andrewor14 ghstack dependencies: #123453, #123578	2024-04-11 13:18:06 +00:00
rzou	1b4419dc4d	Refresh OpOverloadPacket if a new OpOverload gets added (#123578 ) If a user accesses an OpOverloadPacket, then creates a new OpOverload, then uses the OpOverloadPacket, the new OpOverload never gets hit. This is because OpOverloadPacket caches OpOverloads when it is constructed. This PR fixes the problem by "refreshing" the OpOverloadPacket if a new OpOverload gets constructed and the OpOverloadPacket exists. Test Plan: - new tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/123578 Approved by: https://github.com/albanD ghstack dependencies: #123453	2024-04-11 13:18:06 +00:00
rzou	8a5e7a01b5	[custom_op] Schema inference now includes default values (#123453 ) If the function has default values, we should be able to do schema inference and put the default values into the schema. Test Plan: - new tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/123453 Approved by: https://github.com/albanD	2024-04-11 13:18:02 +00:00
Aleksandar Samardžić	b2a0b8c446	Simplify ATen sparse semi-structured operators based on CUTLASS (#123473 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123473 Approved by: https://github.com/cpuhrsch	2024-04-11 11:56:27 +00:00
Nikita Shulga	4f244cfaa0	Enable int4mm for both half and bfloat16 (#123794 ) By making performant kernels a template specialization. This is a prep change for enabling ARM+float16 fast int4 kernel TODO: - Add float32 and some testing Pull Request resolved: https://github.com/pytorch/pytorch/pull/123794 Approved by: https://github.com/mikekgfb, https://github.com/jgong5	2024-04-11 11:40:56 +00:00
Episkey0109	02b29e7d07	Add meta function for channel_shuffle operation (#123033 ) This commit introduces a meta function for the `channel_shuffle` operation, enabling PyTorch to perform shape inference and optimizations related to this operation without actual computation. The meta function assumes input shape (*, C, H, W) and validates that the number of channels (C) is divisible by the specified number of groups. Fixes #122771 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123033 Approved by: https://github.com/ezyang, https://github.com/mikaylagawarecki	2024-04-11 10:07:18 +00:00
chilli	84580f76d9	fix flop counter issue with out parameters (#123768 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123768 Approved by: https://github.com/zou3519	2024-04-11 09:39:53 +00:00
leslie-fang-intel	e8e9261b90	Add Matmul recipe into x86_inductor_quantizer (#122776 ) Summary Add `matmul` in the quantization recipes, noting that it's not a general recipe but tailored to meet accuracy criteria for specific models. `matmul` recipe is disabled by default. Test Plan ``` python -m pytest quantization/pt2e/test_x86inductor_quantizer.py -k test_attention_block ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/122776 Approved by: https://github.com/jgong5, https://github.com/jerryzh168 ghstack dependencies: #122775	2024-04-11 09:32:47 +00:00
leslie-fang-intel	8798f5bf0d	Add Quantization recipe filter per operator type for x86_inductor_quantizer (#122775 ) Summary Default recipes are enabled in `X86InductorQuantizer` and request comes to customize recipes based on these defaults. - Avoid annotation propagation and restrict annotation only to annotate `conv`/`linear`. - Add `matmul` in the quantization recipes, noting that it's not a general recipe but tailored to meet accuracy criteria for specific models. To meet these requests, we made changes in this PR by introducing interface as `set_function_type_qconfig` and `set_module_type_qconfig` - `set_function_type_qconfig` accepts functional input as `torch.nn.functional.linear` or `torch.matmul`; `set_module_type_qconfig` accepts nn.Module input as `torch.nn.Conv2d`. - To disable the recipe for this operator, user can simply exclude it from the list of operations as `quantizer.set_function_type_qconfig(op, None)`. - To modify or extend the recipe for this operator with default recipe, user can customize as `quantizer.set_function_type_qconfig(op, config)`. Test Plan ``` python -m pytest quantization/pt2e/test_x86inductor_quantizer.py -k test_filter_conv2d_recipe python -m pytest quantization/pt2e/test_x86inductor_quantizer.py -k test_filter_linear_recipe python -m pytest quantization/pt2e/test_x86inductor_quantizer.py -k test_filter_maxpool2d_recipe ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/122775 Approved by: https://github.com/jgong5, https://github.com/jerryzh168	2024-04-11 09:30:31 +00:00
vfdev-5	6b7741546b	Fixed arange decomp for float dtype (#121013 ) ## Description: - [x] Fixed arange decomp for float dtype - [x] Added a test ## Current state Arange graph and C++ generated code are not optimal when arange is created directly using float32 dtype: ```python import torch def func(x): s = x.shape[-1] a = torch.arange(s, dtype=torch.float32) return s + a c_func = torch.compile(func) out = c_func(torch.rand(10)) ``` Graph on `main`: ``` ===== Forward graph 0 ===== /pytorch/torch/fx/_lazy_graph_module.py class <lambda>(torch.nn.Module): def forward(self): # File: check_arange_decomp.py:8 in func, code: a = torch.arange(s, dtype=torch.float32) iota: "i64[10]" = torch.ops.prims.iota.default(10, start = 0, step = 1, dtype = torch.int64, device = device(type='cpu'), requires_grad = False) convert_element_type: "f64[10]" = torch.ops.prims.convert_element_type.default(iota, torch.float64); iota = None mul: "f64[10]" = torch.ops.aten.mul.Tensor(convert_element_type, 1); convert_element_type = None add: "f64[10]" = torch.ops.aten.add.Tensor(mul, 0); mul = None convert_element_type_1: "f32[10]" = torch.ops.prims.convert_element_type.default(add, torch.float32); add = None # File: check_arange_decomp.py:9 in func, code: return s + a add_1: "f32[10]" = torch.ops.aten.add.Tensor(convert_element_type_1, 10); convert_element_type_1 = None return (add_1,) ===== AFTER POST GRAD ===== /pytorch/torch/fx/_lazy_graph_module.py class <lambda>(torch.nn.Module): def forward(self): # File: check_arange_decomp.py:15 in func, code: a = torch.arange(s, dtype=torch.float32) iota: "i64[10]" = torch.ops.prims.iota.default(10, start = 0, step = 1, dtype = torch.int64, device = device(type='cpu'), requires_grad = False) convert_element_type: "f64[10]" = torch.ops.prims.convert_element_type.default(iota, torch.float64); iota = None mul: "f64[10]" = torch.ops.aten.mul.Tensor(convert_element_type, 1); convert_element_type = None add: "f64[10]" = torch.ops.aten.add.Tensor(mul, 0); mul = None convert_element_type_1: "f32[10]" = torch.ops.prims.convert_element_type.default(add, torch.float32); add = None # File: check_arange_decomp.py:16 in func, code: return s + a add_1: "f32[10]" = torch.ops.aten.add.Tensor(convert_element_type_1, 10); convert_element_type_1 = None return (add_1,) ``` and C++ ```c++ extern "C" void kernel(float* out_ptr0) { { #pragma GCC ivdep for(long x0=static_cast<long>(0L); x0<static_cast<long>(10L); x0+=static_cast<long>(1L)) { auto tmp0 = c10::convert<long>(x0); auto tmp1 = c10::convert<double>(tmp0); // <---- useless ops auto tmp2 = static_cast<double>(1.0); // <---- auto tmp3 = decltype(tmp1)(tmp1 * tmp2); // <---- auto tmp4 = static_cast<double>(0.0); // <---- auto tmp5 = decltype(tmp3)(tmp3 + tmp4); // <---- auto tmp6 = c10::convert<float>(tmp5); auto tmp7 = static_cast<float>(10.0); auto tmp8 = decltype(tmp6)(tmp6 + tmp7); out_ptr0[static_cast<long>(x0)] = tmp8; } } } ``` However, if we manually create arange on i64 and then put to float32, generated graph and C++ code are more natural and benefit of a speed-up. ```python import torch def func(x): s = x.shape[-1] a = torch.arange(s).to(dtype=torch.float32) return s + a c_func = torch.compile(func) out = c_func(torch.rand(10)) ``` Graph on `main`: ``` ===== Forward graph 0 ===== /pytorch/torch/fx/_lazy_graph_module.py class <lambda>(torch.nn.Module): def forward(self): # File: check_arange_decomp.py:14 in func, code: a = torch.arange(s).to(dtype=torch.float32) iota: "i64[10]" = torch.ops.prims.iota.default(10, start = 0, step = 1, dtype = torch.int64, device = device(type='cpu'), requires_grad = False) convert_element_type: "f32[10]" = torch.ops.prims.convert_element_type.default(iota, torch.float32); iota = None # File: check_arange_decomp.py:15 in func, code: return s + a add: "f32[10]" = torch.ops.aten.add.Tensor(convert_element_type, 10); convert_element_type = None return (add,) ===== AFTER POST GRAD ===== /pytorch/torch/fx/_lazy_graph_module.py class <lambda>(torch.nn.Module): def forward(self): # File: check_arange_decomp.py:21 in func, code: a = torch.arange(s).to(dtype=torch.float32) iota: "i64[10]" = torch.ops.prims.iota.default(10, start = 0, step = 1, dtype = torch.int64, device = device(type='cpu'), requires_grad = False) convert_element_type: "f32[10]" = torch.ops.prims.convert_element_type.default(iota, torch.float32); iota = None # File: check_arange_decomp.py:22 in func, code: return s + a add: "f32[10]" = torch.ops.aten.add.Tensor(convert_element_type, 10); convert_element_type = None return (add,) ``` C++ on `main` ```c++ extern "C" void kernel(float* out_ptr0) { { #pragma GCC ivdep for(long x0=static_cast<long>(0L); x0<static_cast<long>(10L); x0+=static_cast<long>(1L)) { auto tmp0 = c10::convert<long>(x0); auto tmp1 = c10::convert<float>(tmp0); auto tmp2 = static_cast<float>(10.0); auto tmp3 = decltype(tmp1)(tmp1 + tmp2); out_ptr0[static_cast<long>(x0)] = tmp3; } } } ``` For example, the speed-up seen on upsample_nearest2d on cpu: ``` [----------------------------------------------------------------------------------------------------------------------------------------------- Interpolate, cpu ----------------------------------------------------------------------------------------------------------------------------------------------] \| Eager (2.3.0a0+gitb4324ed) PR \| Compiled (2.3.0a0+gitb4324ed) PR \| Compiled (2.3.0a0+git0d1e705) Nightly \| speed-up PR vs Nightly \| Eager (2.3.0a0+git0d1e705) Nightly 1 threads: ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ Input (1, 3, 500, 400), torch.uint8, torch.contiguous_format \| mode: nearest, align_corners: None, osize: (256, 256) \| 287.988 (+-10.399) \| 200.034 (+-8.630) \| 285.143 (+-8.412) \| 1.425 (+-0.000) \| 287.991 (+-11.302) Input (1, 3, 500, 400), torch.uint8, torch.channels_last \| mode: nearest, align_corners: None, osize: (256, 256) \| 697.206 (+-27.033) \| 171.650 (+-7.381) \| 193.280 (+-5.840) \| 1.126 (+-0.000) \| 701.642 (+-26.461) Input (1, 3, 500, 400), torch.float32, torch.contiguous_format \| mode: nearest, align_corners: None, osize: (256, 256) \| 149.149 (+-6.045) \| 222.780 (+-6.852) \| 299.968 (+-12.354) \| 1.346 (+-0.000) \| 145.055 (+-7.232) Input (1, 3, 500, 400), torch.float32, torch.channels_last \| mode: nearest, align_corners: None, osize: (256, 256) \| 596.741 (+-27.970) \| 205.923 (+-8.648) \| 233.912 (+-7.742) \| 1.136 (+-0.000) \| 598.000 (+-25.630) Input (4, 3, 500, 400), torch.uint8, torch.contiguous_format \| mode: nearest, align_corners: None, osize: (256, 256) \| 1095.734 (+-51.658) \| 700.850 (+-24.852) \| 1044.255 (+-38.216) \| 1.490 (+-0.000) \| 1097.977 (+-35.521) Input (4, 3, 500, 400), torch.uint8, torch.channels_last \| mode: nearest, align_corners: None, osize: (256, 256) \| 2741.813 (+-122.917) \| 583.073 (+-16.998) \| 665.029 (+-36.331) \| 1.141 (+-0.000) \| 2722.388 (+-116.263) Input (4, 3, 500, 400), torch.float32, torch.contiguous_format \| mode: nearest, align_corners: None, osize: (256, 256) \| 578.183 (+-37.266) \| 833.295 (+-42.264) \| 1131.341 (+-54.710) \| 1.358 (+-0.000) \| 584.953 (+-45.549) Input (4, 3, 500, 400), torch.float32, torch.channels_last \| mode: nearest, align_corners: None, osize: (256, 256) \| 2332.508 (+-103.556) \| 840.194 (+-47.664) \| 935.625 (+-47.467) \| 1.114 (+-0.000) \| 2334.314 (+-91.644) Input (1, 3, 1200, 1300), torch.uint8, torch.contiguous_format \| mode: nearest, align_corners: None, osize: (200, 300) \| 272.631 (+-11.348) \| 195.988 (+-5.748) \| 274.021 (+-9.475) \| 1.398 (+-0.000) \| 272.752 (+-12.716) Input (1, 3, 1200, 1300), torch.uint8, torch.channels_last \| mode: nearest, align_corners: None, osize: (200, 300) \| 640.409 (+-25.465) \| 164.773 (+-7.372) \| 185.018 (+-8.349) \| 1.123 (+-0.000) \| 639.390 (+-30.761) Input (1, 3, 1200, 1300), torch.float32, torch.contiguous_format \| mode: nearest, align_corners: None, osize: (200, 300) \| 158.602 (+-6.593) \| 220.478 (+-6.809) \| 286.376 (+-8.981) \| 1.299 (+-0.000) \| 158.557 (+-6.143) Input (1, 3, 1200, 1300), torch.float32, torch.channels_last \| mode: nearest, align_corners: None, osize: (200, 300) \| 548.903 (+-22.889) \| 202.788 (+-9.158) \| 227.404 (+-8.995) \| 1.121 (+-0.000) \| 554.096 (+-21.330) Input (4, 3, 1200, 1300), torch.uint8, torch.contiguous_format \| mode: nearest, align_corners: None, osize: (200, 300) \| 1036.061 (+-35.285) \| 680.728 (+-30.925) \| 986.254 (+-42.732) \| 1.449 (+-0.000) \| 1038.718 (+-43.070) Input (4, 3, 1200, 1300), torch.uint8, torch.channels_last \| mode: nearest, align_corners: None, osize: (200, 300) \| 2504.520 (+-125.805) \| 550.067 (+-21.383) \| 628.000 (+-27.589) \| 1.142 (+-0.000) \| 2523.134 (+-113.336) Input (4, 3, 1200, 1300), torch.float32, torch.contiguous_format \| mode: nearest, align_corners: None, osize: (200, 300) \| 1058.188 (+-57.853) \| 1216.427 (+-76.160) \| 1380.231 (+-98.939) \| 1.135 (+-0.000) \| 1057.031 (+-66.075) Input (4, 3, 1200, 1300), torch.float32, torch.channels_last \| mode: nearest, align_corners: None, osize: (200, 300) \| 2305.911 (+-116.864) \| 1080.189 (+-79.934) \| 1141.561 (+-67.959) \| 1.057 (+-0.000) \| 2306.606 (+-121.544) Input (1, 3, 300, 400), torch.uint8, torch.contiguous_format \| mode: nearest, align_corners: None, osize: (600, 700) \| 1689.489 (+-60.579) \| 1077.401 (+-44.948) \| 1634.264 (+-64.340) \| 1.517 (+-0.000) \| 1693.945 (+-67.998) Input (1, 3, 300, 400), torch.uint8, torch.channels_last \| mode: nearest, align_corners: None, osize: (600, 700) \| 4198.368 (+-179.096) \| 886.656 (+-30.355) \| 1028.568 (+-46.310) \| 1.160 (+-0.000) \| 4174.351 (+-141.020) Input (1, 3, 300, 400), torch.float32, torch.contiguous_format \| mode: nearest, align_corners: None, osize: (600, 700) \| 716.572 (+-51.954) \| 1175.864 (+-52.191) \| 1674.373 (+-51.815) \| 1.424 (+-0.000) \| 715.724 (+-41.104) Input (1, 3, 300, 400), torch.float32, torch.channels_last \| mode: nearest, align_corners: None, osize: (600, 700) \| 3604.989 (+-132.489) \| 1096.933 (+-54.290) \| 1270.347 (+-60.932) \| 1.158 (+-0.000) \| 3601.864 (+-140.218) Input (4, 3, 300, 400), torch.uint8, torch.contiguous_format \| mode: nearest, align_corners: None, osize: (600, 700) \| 6721.610 (+-355.997) \| 4203.213 (+-134.362) \| 6423.763 (+-225.311) \| 1.528 (+-0.000) \| 6715.626 (+-288.233) Input (4, 3, 300, 400), torch.uint8, torch.channels_last \| mode: nearest, align_corners: None, osize: (600, 700) \| 16695.467 (+-709.620) \| 3460.013 (+-149.456) \| 4001.810 (+-218.093) \| 1.157 (+-0.000) \| 16621.138 (+-713.320) Input (4, 3, 300, 400), torch.float32, torch.contiguous_format \| mode: nearest, align_corners: None, osize: (600, 700) \| 3020.017 (+-147.314) \| 4743.164 (+-135.850) \| 6709.494 (+-281.025) \| 1.415 (+-0.000) \| 3015.602 (+-105.852) Input (4, 3, 300, 400), torch.float32, torch.channels_last \| mode: nearest, align_corners: None, osize: (600, 700) \| 14456.688 (+-752.839) \| 5150.893 (+-201.571) \| 5737.315 (+-138.011) \| 1.114 (+-0.000) \| 14464.472 (+-720.027) Times are in microseconds (us). ``` ## PR This PR improves arange decomp such that `arange(s, dtype=torch.float32)` removing extra dtype conversion to double: Code: ```python import torch def func(x): s = x.shape[-1] a = torch.arange(s, dtype=torch.float32) return s + a c_func = torch.compile(func) out = c_func(torch.rand(10)) ``` Graph on this PR: ``` ===== Forward graph 0 ===== /pytorch/torch/fx/_lazy_graph_module.py class <lambda>(torch.nn.Module): def forward(self): # File: check_arange_decomp.py:15 in func, code: a = torch.arange(s, dtype=torch.float32) iota: "i64[10]" = torch.ops.prims.iota.default(10, start = 0, step = 1, dtype = torch.int64, device = device(type='cpu'), requires_grad = False) mul: "i64[10]" = torch.ops.aten.mul.Tensor(iota, 1); iota = None add: "i64[10]" = torch.ops.aten.add.Tensor(mul, 0); mul = None convert_element_type: "f32[10]" = torch.ops.prims.convert_element_type.default(add, torch.float32); add = None # File: check_arange_decomp.py:16 in func, code: return s + a add_1: "f32[10]" = torch.ops.aten.add.Tensor(convert_element_type, 10); convert_element_type = None return (add_1,) ===== AFTER POST GRAD ===== /pytorch/torch/fx/_lazy_graph_module.py class <lambda>(torch.nn.Module): def forward(self): # File: check_arange_decomp.py:16 in func, code: a = torch.arange(s, dtype=torch.float32) iota: "i64[10]" = torch.ops.prims.iota.default(10, start = 0, step = 1, dtype = torch.int64, device = device(type='cpu'), requires_grad = False) mul: "i64[10]" = torch.ops.aten.mul.Tensor(iota, 1); iota = None add: "i64[10]" = torch.ops.aten.add.Tensor(mul, 0); mul = None convert_element_type: "f32[10]" = torch.ops.prims.convert_element_type.default(add, torch.float32); add = None # File: check_arange_decomp.py:17 in func, code: return s + a add_1: "f32[10]" = torch.ops.aten.add.Tensor(convert_element_type, 10); convert_element_type = None return (add_1,) ``` and C++ on this PR: ```c++ extern "C" void kernel(float* out_ptr0) { { #pragma GCC ivdep for(long x0=static_cast<long>(0L); x0<static_cast<long>(10L); x0+=static_cast<long>(1L)) { auto tmp0 = c10::convert<long>(x0); auto tmp1 = c10::convert<float>(tmp0); auto tmp2 = static_cast<float>(10.0); auto tmp3 = decltype(tmp1)(tmp1 + tmp2); out_ptr0[static_cast<long>(x0)] = tmp3; } } } ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/121013 Approved by: https://github.com/peterbell10	2024-04-11 09:02:31 +00:00
Michael Lazos	2ac99d539b	Only initialize state if needed in SGD (#123757 ) Fixes [T184381726](https://www.internalfb.com/intern/tasks/?t=184381726) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123757 Approved by: https://github.com/janeyx99	2024-04-11 08:56:06 +00:00
Shuqiang Zhang	e00282fecf	[c10d] make monitorThread sleep when we try to dump (#123788 ) Summary: We seperated the FR dump logic from the desync debug logic, so we no longer set collectiveDebugInfoMode_ to true when we just need FR dump. That's why monitor thread did not sleep and try to kill the process without waiting for the dump. The fix is simple, we should sleep whenever shouldDump_ is true Test Plan: Existing unit tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/123788 Approved by: https://github.com/wconstab	2024-04-11 07:10:46 +00:00
Simon Fan	a510afb885	[aot] refactor runtime_wrapper's epilogue args access (#123674 ) I want runtime_wrapper args to be stealable by call_func_at_runtime_with_args, since the args may contain activations which we don't want to hold alive in this scope. The args to runtime_wrapper should always be from a list created within aot_autograd, so it should always be safe to steal them: `a4a49f77b8/torch/_functorch/aot_autograd.py (L928-L932)` There are some accesses after we execute the compiled_fn, but those index accesses are already inferred at compile time. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123674 Approved by: https://github.com/jansel, https://github.com/bdhirsh ghstack dependencies: #123630	2024-04-11 07:07:50 +00:00
Simon Fan	a8d2504eec	[aot] always pass inputs to runtime_wrapper as list and add type annotations (#123630 ) `runtime_wrapper` unpacking the arguments as a Tuple[arg] will prevent them from being freed within its scope. This is problematic if inductors wants to free those inputs, which could be activations in the compiled backwards case. This PR only changes the signature to pass as list, but does not clear it, keeping same refcount as before. Also adding some mypy annotations. Ideally, instead of `Any`, I would want a type to describe single arg which seems to be usually Tensor or int. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123630 Approved by: https://github.com/jansel, https://github.com/bdhirsh	2024-04-11 07:07:50 +00:00
Sherlock Huang	c2f687f32c	Option to include stride and device annotation in gm.print_readable() (#123690 ) Summary: Sample output for gm.print_readable(include_stride=True, include_device=True) ``` getitem_21: "i32[1200][1]cuda:0" = auto_functionalized_4[1] copy_2: "f32[2, 60][60, 1]cuda:1" = .... ``` Test Plan: CI Differential Revision: D55949129 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123690 Approved by: https://github.com/Chillee	2024-04-11 06:53:10 +00:00
Edward Z. Yang	8aad72b0d3	Support all unsigned int sizes on unique (#123643 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/123643 Approved by: https://github.com/albanD, https://github.com/kit1980	2024-04-11 06:50:12 +00:00
Nikita Shulga	416f532753	[AOTI] Serialize large weights (#123002 ) But appending them to the end of the shared library and mmaping afterwards Disabled by default, but overridable by `config.aot_inductor.force_mmap_weights` Implemented by adding `USE_MMAP_SELF` define to `inductor/aoti_runtime/model.h` which is defined when weights are appended to the binary. In that case, shared library name is determined by calling `dladdr`, mmaped and finally checked against random magic number embedded at the end of the weights as well as in const section of the library in question Added unites to validate that it works as expected TODO: - Extend support to CUDA - munmap region if the same library is reused Pull Request resolved: https://github.com/pytorch/pytorch/pull/123002 Approved by: https://github.com/jansel, https://github.com/desertfire, https://github.com/mikekgfb	2024-04-11 06:39:58 +00:00
Nikita Shulga	7fc4b170d8	[EZ] Update mypy to 1.9.0 (#123595 ) TODO: - Add linter that keeps `requirements-ci.txt` and `.lintrunner.toml` in sync Pull Request resolved: https://github.com/pytorch/pytorch/pull/123595 Approved by: https://github.com/kit1980	2024-04-11 06:36:09 +00:00
Jiong Gong	cacc8e27a5	[inductor][cpp] refactor code to use define_kernel and call_kernel similar to CUDA (#123704 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123704 Approved by: https://github.com/jansel, https://github.com/desertfire	2024-04-11 06:34:44 +00:00
Nikita Shulga	2a597cfd2c	[EZ] Pin scipy to 1.12 for Py-3.12 (#123795 ) This caused false positive failures/reverts for https://github.com/pytorch/pytorch/pull/123689 and https://github.com/pytorch/pytorch/pull/123595 Fixes https://github.com/pytorch/pytorch/issues/123655 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123795 Approved by: https://github.com/huydhn	2024-04-11 06:32:16 +00:00
Oguz Ulgen	57a2032c7a	Delete Lark (#123689 ) Now that we are using MLIR bindings inside triton, lets delete Lark parser. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123689 Approved by: https://github.com/jansel	2024-04-11 05:51:06 +00:00
Edward Z. Yang	bbcdd28409	Report LRU cache stats at end of program for symbolic shapes (#123724 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/123724 Approved by: https://github.com/Chillee	2024-04-11 05:12:43 +00:00
Shivam Raikundalia	3ebbeb75fd	[Profiler] Make Kineto traces export ns granularity for finer timestamps (#122425 ) (#123650 ) Summary: Kineto traces use microsecond level granularity because of chrome tracing defaults to that precision. Fix by adding preprocessor flag to TARGETS and BUCK files. Also remove any unnecessary ns to us conversions made in the profiler itself. This diff contains profiler changes only. Libkineto changes found in D54964435. Test Plan: Check JSON and chrome tracing to make sure values are as expected. Tracing with flags enabled should have ns precision. Tracings without flags should be same as master. Zoomer: https://www.internalfb.com/intern/zoomer/?profiling_run_fbid=796886748550189 Ran key_averages() to make sure FunctionEvent code working as expected: -- ------------ ------------ Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls ProfilerStep* 0.74% 3.976ms 64.40% 346.613ms 69.323ms 0.000us 0.00% 61.710ms 12.342ms 5 Optimizer.zero_grad#SGD.zero_grad 0.76% 4.109ms 0.76% 4.109ms 821.743us 0.000us 0.00% 0.000us 0.000us 5 ## forward ## 6.89% 37.057ms 27.19% 146.320ms 29.264ms 0.000us 0.00% 58.708ms 11.742ms 5 aten::conv2d 0.22% 1.176ms 7.74% 41.658ms 157.199us 0.000us 0.00% 27.550ms 103.962us 265 aten::convolution 0.79% 4.273ms 7.52% 40.482ms 152.762us 0.000us 0.00% 27.550ms 103.962us 265 aten::_convolution 0.69% 3.688ms 6.73% 36.209ms 136.637us 0.000us 0.00% 27.550ms 103.962us 265 aten::cudnn_convolution 6.04% 32.520ms 6.04% 32.520ms 122.719us 27.550ms 8.44% 27.550ms 103.962us 265 aten::add_ 2.42% 13.045ms 2.42% 13.045ms 30.694us 12.700ms 3.89% 12.700ms 29.882us 425 aten::batch_norm 0.19% 1.027ms 8.12% 43.717ms 164.971us 0.000us 0.00% 16.744ms 63.185us 265 aten::_batch_norm_impl_index 0.31% 1.646ms 7.93% 42.691ms 161.096us 0.000us 0.00% 16.744ms 63.185us 265 ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Differential Revision: D55925068 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123650 Approved by: https://github.com/aaronenyeshi	2024-04-11 04:29:20 +00:00
chunyuan	ec00daf4f1	[aotinductor] Fix benchmarks with self.autocast for run_performance_test (#123699 ) ## Pitch Similar to https://github.com/pytorch/pytorch/pull/110490 which fixes the `self.autocast` in the `check_accuracy` function, this PR fixes the `self.autocast` context in the `run_performance_test` function. ## Description The code inside `check_accuracy` after the fix on https://github.com/pytorch/pytorch/pull/110490: `a4a49f77b8/benchmarks/dynamo/common.py (L2490-L2500)` The current code on main branch before this PR in `run_performance_test` does not have the `self.autocast` context: `a4a49f77b8/benchmarks/dynamo/common.py (L2685-L2692)` For eager mode, the `model_iter_fn` (which is actually [forward_pass](`e8ad5460c0/benchmarks/dynamo/huggingface.py (L556-L558)`)) is used in [warmup](`e8ad5460c0/benchmarks/dynamo/common.py (L2690)`) and [speedup_experiment](`e8ad5460c0/benchmarks/dynamo/common.py (L648)`). The `forward_pass` has the `self.autocast` context thus it could run into BF16 when AMP is on. While for AOTInductor, we will call `export_aot_inductor` in both [warmup](`e8ad5460c0/benchmarks/dynamo/common.py (L2695)`) and [speedup_experiment](`e8ad5460c0/benchmarks/dynamo/common.py (L644-L646)`), which doesn't have the `autocast` context thus will always run into FP32. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123699 Approved by: https://github.com/jgong5, https://github.com/desertfire	2024-04-11 01:40:44 +00:00
Yifu Wang	902cb2c842	[multi_tensor_apply] revert the optimization introduced in #119764 (#123763 ) The optimization causes regression in some torchbench benchmarks and with some older versions of nvcc. The regression is preventable, but it might require additional template specialization which would increase the binary size. Reverting it for now to re-evaluate. Keeping the introduced tests and cuda-to-hip-mappings since these are not specific to the optimization. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123763 Approved by: https://github.com/janeyx99	2024-04-11 01:39:49 +00:00
chunyuan	0d0fd80033	[AOTI] fix relocation overflow error when .data is large (#123639 ) https://github.com/pytorch/pytorch/pull/123164 removed the below code (so that constants are not readonly) to support module buffer mutation: `a9a9ce6d9c/torch/_inductor/codecache.py (L1685-L1691)` However, it may cause relocation overflow when the `.data` section is large. Below is part of the output from `ld --versbose` (`GNU ld (GNU Binutils for Ubuntu) 2.38`). `.data` is in between `.text` and `.bss`. When `.data` is too large, during the linking, the relocation of `.text` against `.bss` may overflow. Rename it to `.ldata` (perhaps that's why previously `.lrodata` instead of `.rodata` is used) so that it won't be in between the `.text` and `.bss` section ``` .text .rodata .data .bss .lrodata .ldata ``` We met this issue when fixing https://github.com/pytorch/pytorch/issues/114450 and running the below models on CPU: - AlbertForMaskedLM - AlbertForQuestionAnswering - BlenderbotForCausalLM - DebertaV2ForMaskedLM - DebertaV2ForQuestionAnswering - XGLMForCausalLM Pull Request resolved: https://github.com/pytorch/pytorch/pull/123639 Approved by: https://github.com/jgong5, https://github.com/desertfire	2024-04-11 01:37:43 +00:00
Kurt Mohler	281810e307	Avoid COW materialization in backward ops (2) (#123740 ) Affected ops: * pooling ops * relu * pad * interpolate * upsample * multi_margin_loss * multilabel_margin_loss * multilabel_soft_margin_loss Part of #97856 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123740 Approved by: https://github.com/ezyang ghstack dependencies: #123657	2024-04-11 01:35:38 +00:00
PHLens	793df52dc5	Aviod sync for privateuse1 backend inside Onehot. (#123621 ) Onehot skip class value check for cuda and mps backend which avoid sync, and it's also needed for privateuse1. This PR add privateuse1 check to avoid sync too. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123621 Approved by: https://github.com/ezyang	2024-04-11 01:19:53 +00:00
Grace Zhang	3e43dc086a	implement bmm to support nested tensor and normal tensor (#119762 ) implement bmm to support nested and normal tensor Pull Request resolved: https://github.com/pytorch/pytorch/pull/119762 Approved by: https://github.com/cyyever, https://github.com/ezyang	2024-04-11 01:10:04 +00:00
Jason Ansel	b3feb01910	[dynamo] Update co_names if needed in fix_vars (#123697 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123697 Approved by: https://github.com/williamwen42	2024-04-11 01:00:01 +00:00
Andrew Gu	c64184b097	[FSDP] Made patch functions thread safe with barrier (#123754 ) I think if we do not have barriers as added in the PR, we could have a race condition with multi-threading (e.g. MTPG). I think this mainly matters if the test function itself does not run collectives. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123754 Approved by: https://github.com/weifengpy ghstack dependencies: #122962, #123290, #123362	2024-04-11 00:59:16 +00:00
Jez Ng	4d1d71ecac	Add aoti_torch_dtype<> specializations for half and bfloat16 (#123692 ) Fixes #122989. (Note that while the missing symbol issue is fixed, the test itself is still disabled, because the test runner now segfaults on `atexit()`; but I think that issue is unrelated to the missing symbol.) In addition to defining the specializations, I also `= delete`d the default un-specialized version of `aoti_torch_dtype`, so future missing dtype references will show up as compile-time instead of link-time errors. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123692 Approved by: https://github.com/chenyang78	2024-04-11 00:16:05 +00:00
leslie-fang-intel	e29e990ddc	Add the VecConvert between 8bits and float (#123512 ) Summary Fix the issue https://github.com/pytorch/pytorch/issues/123448 by adding intrinsic specialization between `int8/uint8` and `float32`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123512 Approved by: https://github.com/jgong5	2024-04-11 00:15:32 +00:00
Brian Hirsh	01ab5a3104	aot_eager and aot_eager_decomp_partition: include input mutations in graph (#123646 ) In the next PR I force `set_()` input mutation to require always been in the graph. It's a lot easier to do this if we make our other debugging backends allow input mutations in the graph. Input mutations are relatively hardened at this point, so I'd rather just have our debugging backends consistently allow input mutations. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123646 Approved by: https://github.com/ezyang ghstack dependencies: #122433	2024-04-11 00:07:20 +00:00
Brian Hirsh	8d36354187	AOTAutograd: fix detection for mutations under no_grad when there are outstanding aliases (#122433 ) Fixes https://github.com/pytorch/pytorch/issues/122436. The problem was that even though we were detecting when mutations happen under no_grad or not, we were recording these mutations when they happened to the FunctionalTensor - we should really just be recording them on the underlying storage. In particular, what would happen is that we would mutate an alias under no_grad (marking the mutation as under no_grad properly), but if we use the base tensor outside of the no_grad region, we would lazily regenerate the base at this point, propagate the mutation to the base, and at that point mark that the base witnessed a mutation (outside of the no_grad region) Pull Request resolved: https://github.com/pytorch/pytorch/pull/122433 Approved by: https://github.com/ezyang	2024-04-11 00:07:20 +00:00
Edward Z. Yang	57634ce74f	Don't intersect when clamping for size oblivious (#123675 ) Fixes https://github.com/pytorch/pytorch/issues/123651 Previously, when we performed a size oblivious test, we would only modify the lower bound, e.g., if we knew something had range `[0, 100]`, the size oblivious test would do `[2, 100]`. But what if your original range was `[0, 1]`? Naively intersecting this with `[2, sympy.oo]` would result in an empty set: that's a big no no. And in general, this intersection is kind of questionable: if your original range was `[0, 2]`, do we really want to assume that this quantity is exactly equal to 2 in the size oblivious test? So here's an idea: when we're doing a size oblivious test, just forget about the max bound entirely. The idea is that the max bound probably wasn't actually helping you discharge the size oblivious test (because size oblivious tests are all about "well, if we can assume thing isn't zero or one, we know what the static value is.") So you can use the max bound OR you can use the size oblivious bound, but you're not allowed to use both at the same time. (It doesn't actually seem necessary to use the max bound, but it would be easy to permit this without using the size oblivious refinement.) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/123675 Approved by: https://github.com/PaulZhang12	2024-04-10 23:10:41 +00:00
Edward Z. Yang	b36b523c05	Fix guard_size_oblivious on non-symbolic expression (#123743 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/123743 Approved by: https://github.com/avikchaudhuri	2024-04-10 22:45:54 +00:00
Thanh Ha	a13cf5d396	Workaround dind-rootless bind mount permissions (#123641 ) ARC uses dind-rootless which causes bind mounts to always be mounted as the "root" user inside the container rather than the "jenkins" user as expected. We run chown to ensure that the workspace gets mapped to the jenkins user as well as a trap to ensure this change gets reverted when the script ends for any reason. This is the same workaround as in #122922 but adapted for onnx tests. Issue: pytorch/ci-infra#112 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123641 Approved by: https://github.com/jeanschmidt, https://github.com/seemethere	2024-04-10 22:44:57 +00:00
Andres Lugo-Reyes	69c6e0b851	[ROCm] Fix ROCm bug that causes numerical errors in float8_experimental (#123275 ) Recently there has been work in an experimental repo to start implementing the intrinsics necessary handle F8 workloads. (see: https://github.com/pytorch-labs/float8_experimental) A recent PR was submitted to add support for AMD F8 types (fnuz). This PR uncovered a bug in the rocm code that caused unit tests to fail due to numerical inaccuracy. This PR fixes that bug by swapping `abs_()` with `abs()` as the former performs elementwise absolute value on the tensor in-place causing the final assertion to fail due to the tensor only containing positive values. Important to note, this fix is part of a workaround as hipblasLT does not yet support amax (HIPBLASLT_MATMUL_DESC_AMAX_D_POINTER). This functionality has been implemented internally and is going through the proper channels to propagate to the community. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123275 Approved by: https://github.com/drisspg, https://github.com/jeffdaily	2024-04-10 21:52:02 +00:00
PyTorch MergeBot	6b18daf205	Revert "Delete Lark (#123689 )" This reverts commit a631461eef7317efccf981989c5cf5c5b486ab0a. Reverted https://github.com/pytorch/pytorch/pull/123689 on behalf of https://github.com/PaliC due to This PR seems to be breaking test_binary_ufuncs.py ([comment](https://github.com/pytorch/pytorch/pull/123689#issuecomment-2048489549))	2024-04-10 21:48:04 +00:00
Kurt Mohler	49d5553f5a	Avoid COW materialization in backward ops (1) (#123657 ) Affected ops: * cdist * sparse.sampled_addmm * sparse.mm * cross_entropy * norm ops Part of #97856 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123657 Approved by: https://github.com/ezyang, https://github.com/Skylion007	2024-04-10 21:07:07 +00:00
William Wen	4bee4c7c25	[3.12] enable inductor unittests (#123654 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123654 Approved by: https://github.com/jansel	2024-04-10 20:51:43 +00:00
Edward Z. Yang	f688d7a2f7	Only suggest debug envvar when debug is on (#123647 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/123647 Approved by: https://github.com/Chillee	2024-04-10 20:41:39 +00:00
Sam Larsen	cda383e7bc	[inductor] Fix fresh_inductor_cache() (#122661 ) Summary: Modify fresh_inductor_cache() to clear cached state before mocking the toplevel cache_dir directory. Any lru_caches (or otherwise) can use the @clear_on_fresh_inductor_cache decorator to register the cache for clearing. Also change the base inductor TestCase class to use fresh_inductor_cache(). Previously that TestCase was only mocking the subdirectory within the toplevel cache dir designated for the FX graph cache artifacts. Test Plan: - New unit test - All existing inductor tests will exercise fresh_inductor_cache() Pull Request resolved: https://github.com/pytorch/pytorch/pull/122661 Approved by: https://github.com/oulgen	2024-04-10 20:38:56 +00:00
PyTorch MergeBot	cf8139b956	Revert "Fix derived dim bugs in ep.run_decomp (#123326 )" This reverts commit 43228742820d8045a3980826f3ef85158dc9032c. Reverted https://github.com/pytorch/pytorch/pull/123326 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/123326#issuecomment-2048389042))	2024-04-10 20:35:01 +00:00
David Yan	63c24f73ef	Upsample2d backwards to int64_t (#123682 ) Summary: To unblock training where upsamplenearest2d involves input or output tensors which are larger than 2^31. Comes up frequently in image & video applications. Test Plan: ``` buck2 test mode/opt //caffe2/test:test_nn_cuda -- test_upsamplingnearest2d_backward_64bit_indexing ``` Benchmarking (N5207417): ``` device_ms, cpu_ms, gb/device_ms*1000 # before changes 118.03993721008301 124.09385920000001 98.72685525972494 # after changes 118.05780944824218 124.10893509999994 98.71190944734577 ``` Differential Revision: D55625666 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123682 Approved by: https://github.com/ezyang	2024-04-10 20:26:08 +00:00
Oguz Ulgen	a631461eef	Delete Lark (#123689 ) Now that we are using MLIR bindings inside triton, lets delete Lark parser. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123689 Approved by: https://github.com/jansel	2024-04-10 19:41:54 +00:00
PyTorch MergeBot	8d9af8b91c	Revert "[Quant][PT2E] Enable linear-binary(-unary) post-op recipe for X86Inductor quantizer (#122387 )" This reverts commit 82e0153487c2cd1abc92598963be5b57ab1948d4. Reverted https://github.com/pytorch/pytorch/pull/122387 on behalf of https://github.com/DanilBaibak due to Break internal build ([comment](https://github.com/pytorch/pytorch/pull/122387#issuecomment-2048294643))	2024-04-10 19:34:26 +00:00
Jason Ansel	30c4efe6d2	Update preferred-citation in CITATION.cff (#123575 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123575 Approved by: https://github.com/ezyang	2024-04-10 19:01:38 +00:00
Aaron Orenstein	2bcc83dfbd	Preserve dispatch state across function tracing (#122073 ) If we throw an exception in the "wrong" place we can end up with the dispatch state being in a weird state which can cause all future dispatching to fail. Preserve and restore it as part of `preserve_global_state` so we know it's sane after that. Also fake_tensor's in_kernel_invocation_manager() was leaving a bit set in the dispatcher (DispatchKey.Dense) which affected follow-on code. Fixed that to reset after as well. Repro: before: ``` $ rm test/dynamo_skips/TestSparseCPU.test_to_dense_with_gradcheck_sparse_cpu_complex64 $ PYTORCH_TEST_WITH_DYNAMO=1 pytest -s test/dynamo/test_export.py test/test_sparse.py -k 'test_to_dense_with_gradcheck_sparse_cpu_complex64' ======== 1 passed, 6173 deselected in 5.21s ============= $ PYTORCH_TEST_WITH_DYNAMO=1 pytest -s test/dynamo/test_export.py test/test_sparse.py -k 'test_torch_inference_mode_ctx or test_to_dense_with_gradcheck_sparse_cpu_complex64' ========= 1 skipped, 6172 deselected, 1 error in 5.29s ========= ``` (note that test_to_dense_with_gradcheck_sparse_cpu_complex64 passes on its own but failed when including the skipped test_export.py tests) after: ``` $ rm test/dynamo_skips/TestSparseCPU.test_to_dense_with_gradcheck_sparse_cpu_complex64 $ PYTORCH_TEST_WITH_DYNAMO=1 pytest -s test/dynamo/test_export.py test/test_sparse.py -k 'test_to_dense_with_gradcheck_sparse_cpu_complex64' ===================== 1 passed, 6173 deselected in 5.42s ===================== $ PYTORCH_TEST_WITH_DYNAMO=1 pytest -s test/dynamo/test_export.py test/test_sparse.py -k 'test_torch_inference_mode_ctx or test_to_dense_with_gradcheck_sparse_cpu_complex64' ===================== 1 passed, 1 skipped, 6172 deselected in 7.30s ====================== ``` (note that test_to_dense_with_gradcheck_sparse_cpu_complex64 passes in both runs) Pull Request resolved: https://github.com/pytorch/pytorch/pull/122073 Approved by: https://github.com/zou3519	2024-04-10 18:57:01 +00:00
PyTorch MergeBot	a65e9a06f0	Revert "[AOTI] Serialize large weights (#123002 )" This reverts commit 27eb5daee494c42425392a327feff7b3e78c342c. Reverted https://github.com/pytorch/pytorch/pull/123002 on behalf of https://github.com/DanilBaibak due to There is conflict to land the diff internally ([comment](https://github.com/pytorch/pytorch/pull/123002#issuecomment-2048215990))	2024-04-10 18:54:31 +00:00
Tugsbayasgalan Manlaibaatar	4322874282	Fix derived dim bugs in ep.run_decomp (#123326 ) Differential Revision: [D55730289](https://our.internmc.facebook.com/intern/diff/D55730289) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123326 Approved by: https://github.com/avikchaudhuri	2024-04-10 18:54:03 +00:00
Huy Do	cd3c1132a9	Create a mock benchmark results for torchao cudagraphs_low_precision (#123419 ) I copy the results of `cudagraph` backend as the mock data. The new backend is called `cudagraphs_low_precision` and the new dtype is `quant`. One gotcha thing is that GitHub has an upper limit of 10 inputs for a workflow dispatch. Before we can sort it out, I let torchao `cudagraphs_low_precision` mock data generated as part of cudagraphs. If this works out ok, I'll need to work on another PR on test-infra to add the new `quant` dtype on the dashboard. ### Testing Manually dispatch one round of training and inference benchmark with cudagraphs https://github.com/pytorch/test-infra/pull/5066, this should start populating the mock data into Rockset. The dashboard change is at https://github.com/pytorch/test-infra/pull/5066. The mock data shows on the preview at https://torchci-git-fork-huydhn-add-torchao-inducto-132c5a-fbopensource.vercel.app/benchmark/compilers?startTime=Fri%2C%2029%20Mar%202024%2020%3A00%3A48%20GMT&stopTime=Fri%2C%2005%20Apr%202024%2020%3A00%3A48%20GMT&granularity=hour&suite=torchbench&mode=training&dtype=quant&lBranch=torchao-benchmark-template&lCommit=bc2ef535b412f84a9d071727fa6f0628b231fbd2&rBranch=torchao-benchmark-template&rCommit=bc2ef535b412f84a9d071727fa6f0628b231fbd2 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123419 Approved by: https://github.com/ZainRizvi, https://github.com/desertfire	2024-04-10 18:51:21 +00:00
statelesshz	c3de2cc154	Enable UFMT on test/test_foreach.py (#123718 ) Part of https://github.com/pytorch/pytorch/issues/123062 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123718 Approved by: https://github.com/ezyang	2024-04-10 18:22:12 +00:00
Shivam Raikundalia	c9c099b271	Add kwargs to RecordFunctionFast (#123600 ) Differential Revision: [D55897888](https://our.internmc.facebook.com/intern/diff/D55897888/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123600 Approved by: https://github.com/davidberard98	2024-04-10 18:17:50 +00:00
Edward Z. Yang	26a9b05bce	Set stacklevel on checkpoint warning (#123717 ) Partially addresses https://github.com/pytorch/pytorch/issues/123626 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/123717 Approved by: https://github.com/Skylion007	2024-04-10 17:25:06 +00:00
Jeff Daily	66e61af467	[ROCm][CI] skip float16 for TestTemplatedSDPA (#123668 ) Fixes #123531 and #123610 by replacing DISABLED issues with code skips. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123668 Approved by: https://github.com/atalman	2024-04-10 17:09:22 +00:00
FFFrog	49be96efe8	Instantiate VaryingShape<c10::Stride> (#123542 ) Fixes #123248 As the ISSUE stated. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123542 Approved by: https://github.com/ezyang	2024-04-10 16:35:59 +00:00
DanilBaibak	7a925c2657	Migrate linux-focal-py3_8-clang10-onnx-build to ARC (#123435 ) Migrate linux-focal-py3_8-clang10-onnx-build to ARC Pull Request resolved: https://github.com/pytorch/pytorch/pull/123435 Approved by: https://github.com/zxiiro, https://github.com/atalman	2024-04-10 16:18:47 +00:00
PyTorch MergeBot	d017645dc7	Revert "Support all unsigned int sizes on unique (#123643 )" This reverts commit 8aa08b8b9d1fab2a13dc5fbda74c553cb2a08729. Reverted https://github.com/pytorch/pytorch/pull/123643 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it is failing lots of jobs with the new dtype `8aa08b8b9d` ([comment](https://github.com/pytorch/pytorch/pull/123643#issuecomment-2047905094))	2024-04-10 15:49:40 +00:00
FFFrog	5c1bde99c0	Fix the uncorrect return value of Tensor.numpy() (#123538 ) Fixes #123494 As the ISSUE stated. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123538 Approved by: https://github.com/XuehaiPan, https://github.com/Skylion007	2024-04-10 14:47:24 +00:00
Xuehai Pan	bdee35a870	[BE] rewrite logical-`and` expression to if-statement (#123638 ) ```diff - push and self.push(value) + if push: + self.push(value) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/123638 Approved by: https://github.com/ezyang	2024-04-10 14:34:17 +00:00
Edward Z. Yang	8aa08b8b9d	Support all unsigned int sizes on unique (#123643 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/123643 Approved by: https://github.com/albanD, https://github.com/kit1980	2024-04-10 11:46:10 +00:00
William Wen	4a4baff0f3	[dynamo, 3.12] force LOAD_SUPER_ATTR second bit on (#123686 ) This was pretty painful to find haha Pull Request resolved: https://github.com/pytorch/pytorch/pull/123686 Approved by: https://github.com/jansel	2024-04-10 10:31:46 +00:00
willfengg	d60135e915	[FSDP1] fix _same_storage check for DTensor (#123617 ) for FSDP (SHARD_GRAD_OP + use_orig_params) + TP, params in the backward are DTensors. However, ``DTensor.untyped_storage().data_ptr()`` does not work in ``_same_storage``. Thus desugar to ``DTensor._local_tensor.untyped_storage().data_ptr()`` https://github.com/pytorch/pytorch/issues/123272 credit to @bigning for the original fix. after landing, we would not need patching in mosaic composer https://github.com/mosaicml/composer/pull/3175/files Pull Request resolved: https://github.com/pytorch/pytorch/pull/123617 Approved by: https://github.com/awgu	2024-04-10 10:26:12 +00:00
wz337	37fd547518	[DeviceMesh] Make dtype of mesh tensor from `init_device_mesh()` consistent with directly calling `DeviceMesh()` (#123677 ) Currently, mesh tensor from `init_device_mesh()` has a dtype of `torch.int64` while mesh tensor from `DeviceMesh()` would have dtype of `torch.int32`. Making them consistent in this PR. DeviceMesh ctor dtype pointer: https://github.com/pytorch/pytorch/blob/main/torch/distributed/device_mesh.py#L217 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123677 Approved by: https://github.com/xunnanxu, https://github.com/wanchaol	2024-04-10 09:14:34 +00:00
Animesh Jain	1346ebf12e	[dynamo][guards] Delay DUPLICATE_INPUT guard because of incorrect ordering (#123605 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123605 Approved by: https://github.com/jansel ghstack dependencies: #123606	2024-04-10 07:30:02 +00:00
Animesh Jain	1dc4e1e335	[dynamo][logs] Bug fix (#123606 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123606 Approved by: https://github.com/jansel, https://github.com/ezyang	2024-04-10 07:30:02 +00:00
chunyuan	cdc47ad991	fix amp for AOTInductor (#122883 ) ## Pitch This PR disables the amp when calling the inference_compiler in AOTInductor path (after having exported the model graph), following the way we disable AMP in Inductor path in https://github.com/pytorch/pytorch/pull/86515. ## Description When testing AOTInductor AMP accuracy on CPU using the dynamo benchmark suites, multiple workloads will fail in this assertion: [assert pattern_repr not in _seen_patterns](`1d52c2d985/torch/_inductor/pattern_matcher.py (L1095)`) which is called when registering SDPA patterns. The `inference_compiler` ([fw_compiler_base](`1d52c2d985/torch/_inductor/compile_fx.py (L1234)`)) will call into [_recursive_joint_graph_passe](`1d52c2d985/torch/_inductor/compile_fx.py (L1241)`) and then [_sfdp_init](`1d52c2d985/torch/_inductor/fx_passes/fuse_attention.py (L847)`). When testing accuracy, we'll set [inductor_config.fallback_random = True](`1d52c2d985/benchmarks/dynamo/common.py (L3526)`), which will make the `search_fn` to be `None` [here](`1d52c2d985/torch/_inductor/fx_passes/serialized_patterns/central_index.py (L117-L118)`), thus the pattern will be generated runtime [here](`1d52c2d985/torch/_inductor/pattern_matcher.py (L1083-L1084)`). When AMP is on, the generated pattern for SDPA FP32 will be the same as that of FP16, which makes the assertion fail. Inductor path disables amp inside [aot_dispatch_base](`1d52c2d985/torch/_functorch/_aot_autograd/jit_compile_runtime_wrappers.py (L124-L128)`). We follow the same way to disable for AOTInductor path here (after having exported the model graph) to fix this issue. ## UT For the added UT, there's one case `python test/inductor/test_aot_inductor.py -k test_amp_fallback_random_abi_compatible_cpu_with_stack_allocation_and_minimal_arrayref_interface` fails with the below error which is not caused by this PR itself. Marked it as skipped for now. ``` RuntimeError: Error in dlopen: /tmp/torchinductor_user/cf5vk3gqkbvud56qeotdxqvns4wbk3sjnlnuadolt7b6g7a6kspb/cfzjo5ackvrth2gp6oq4lfpdyfafoagodfpjvbzhsi2u64hza2vn.so: undefined symbol: _Z16aoti_torch_dtypeIN3c108BFloat16EEiv ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/122883 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-04-10 06:20:17 +00:00
FFFrog	fe4d1aff05	UFMT formatting on test/export (#123520 ) Partially addresses https://github.com/pytorch/pytorch/issues/123062 Ran lintrunner on: test/export Detail: ```Shell $ lintrunner -a --take UFMT --all-files ok No lint issues. Successfully applied all patches. ``` Co-authored-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/123520 Approved by: https://github.com/ezyang	2024-04-10 05:38:42 +00:00
Weizhuo Zhang	e8ad5460c0	Fix skip logic bug in dynamo benchmark runner (#123544 ) Fix huggingface and timms_model did not uses TorchBenchmarksRunner class issue. ![image](https://github.com/pytorch/pytorch/assets/84730719/358eed37-4d70-4034-85f9-58a922b5c532) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123544 Approved by: https://github.com/chuanqi129, https://github.com/jgong5, https://github.com/desertfire	2024-04-10 05:14:31 +00:00
Gufan Yin	65710d95c9	Fix example in torch.distributed.new_subgroups docstring (#123492 ) Summary: As title Test Plan: Run the example locally Reviewed By: zhaojuanmao Differential Revision: D55617871 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123492 Approved by: https://github.com/wconstab, https://github.com/wz337	2024-04-10 03:33:07 +00:00
statelesshz	713f065c8d	Enable UFMT on test/test_dispatch (#123644 ) Part of https://github.com/pytorch/pytorch/issues/123062 Ran lintrunner on: test/test_dispatch.py Detail: ``` $ lintrunner -a --take UFMT --all-files ok No lint issues. Successfully applied all patches. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/123644 Approved by: https://github.com/ezyang	2024-04-10 03:09:38 +00:00
eqy	e15ae63a42	[cuBLAS][cuBLASLt] Remove CUDA 11 heuristics for dispatching to cuBLASLt (#119939 ) Revisiting an old workaround to see if things have improved since then... Pull Request resolved: https://github.com/pytorch/pytorch/pull/119939 Approved by: https://github.com/atalman	2024-04-10 03:01:59 +00:00
mantaionut	247646333e	Fix py opcode (#118977 ) Added a C file that includes the symbols _PyOpcode_Deopt and _PyOpcode_Caches since they are not available in the python lib but they are available on Linux in order to fix linking issues in Windows in python 3.11. Fixes #93854 Test by running on python 3.11 `python test/functorch/test_dims.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/118977 Approved by: https://github.com/ezyang	2024-04-10 02:39:17 +00:00
Yu, Guangye	b7f898c4a6	Generalize host allocator to be device-agnostic (#123079 ) # Motivation According to [[RFC] Intel GPU Runtime Upstreaming for Allocator](https://github.com/pytorch/pytorch/issues/116322), we would like to generalize device and host allocator to be device-agnostic. We prioritize the host allocator as it is simpler and more native than the device allocator. In this PR, we intend to refactor the host allocator to make it be shared across different backends. In 2nd PR, we will support host allocator on XPU backend. # Design The previous design: - `CUDAHostAllocatorWrapper` inherits from `c10::Allocator`, and `CUDAHostAllocator` is an implementation of `CUDAHostAllocatorWrapper`. The design in this PR: - `CachingHostAllocatorImpl` is an interface that implements the caching host allocator logic that can be sharable across each backend. - `CachingHostAllocatorInterface` inherits from `c10::Allocator` as an interface and accepts `CachingHostAllocatorImpl` as its implementation. - `CUDACachingHostAllocator` is a CUDA host allocator whose implementation is `CUDACachingHostAllocatorImpl` which is specialized from `CachingHostAllocatorImpl`. This design can - share most code of caching mechanism across different backends, and - keep the flexibility to expand its exclusive feature on each backend. # Additional Context In addition, we will continue to generalize the device allocator in the next stage. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123079 Approved by: https://github.com/jgong5, https://github.com/EikanWang, https://github.com/albanD, https://github.com/gujinghui	2024-04-10 02:38:07 +00:00
Xia, Weiwen	82e0153487	[Quant][PT2E] Enable linear-binary(-unary) post-op recipe for X86Inductor quantizer (#122387 ) As the title Test plan python test/test_quantization.py -k test_linear_binary Pull Request resolved: https://github.com/pytorch/pytorch/pull/122387 Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5	2024-04-10 01:34:14 +00:00
Will Feng	69c7bd4587	[Compile FSDP2][3/n] Check all_gather_work is distributed_c10d.Work before calling .wait() (#123491 ) In FSDP2, we have this: ```python if all_gather_work is not None: # async op all_gather_work.wait() ``` In eager, there are only two possible values for `all_gather_work`: 1. `distributed_c10d.Work` object (when `async_op=True`) 2. `None` (when `async_op=False`) So the existing `if` statement is sufficient for eager mode. In compile, there is one additional possible value for `all_gather_work` which is `FakeTensor` object (not None), because we return regular tensor for collective call in compile mode. If we use the existing `if` statement as-is, we will always call `.wait()` on `all_gather_work`, which is not the same semantics as eager. There are a few ways to fix this: Option 1: Properly support `distributed_c10d.Work` in Dynamo. This is the best long-term fix but it will take much more time to make it work. Option 2: Allow calling `.wait()` on FakeTensor in compile mode (and just return None there) - this seems hacky because FakeTensor wouldn't normally have this method. Option 3: Check whether `all_gather_work` is `distributed_c10d.Work` before calling `.wait()` on it. <-- This PR Option 3 is chosen in this PR because it seems to also make the eager program semantics clearer (we don't need to think about whether `all_gather_work` can be `.wait()` on in all scenarios, as long as we know `distributed_c10d.Work` can be waited on). Pull Request resolved: https://github.com/pytorch/pytorch/pull/123491 Approved by: https://github.com/awgu	2024-04-10 01:23:54 +00:00
Aaron Enye Shi	7bcac56140	Update Kineto Submodule in PyTorch (#123565 ) Summary: Update the Kineto Submodule in PyTorch third_party/kineto. Test Plan: GitHub CI Differential Revision: D55875347 Pulled By: aaronenyeshi Pull Request resolved: https://github.com/pytorch/pytorch/pull/123565 Approved by: https://github.com/sraikund16	2024-04-10 01:07:42 +00:00
Matěj Kripner	bd59e1113d	Improve docstring for tensorboard add_embedding() (#120408 ) Fixes missing parameter documentation (`metadata_header`). Fixes a typo. Adds a note explaining a somewhat confusing behavior of Tensorboard Projector where categorical values with more than 50 unique values are not permitted to be used for coloring. This was not documented anywhere. The confusion caused https://github.com/tensorflow/tensorboard/issues/61. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120408 Approved by: https://github.com/albanD	2024-04-10 00:32:29 +00:00
Jiong Gong	0288fa7cae	[inductor][cpp] expose config options via env vars (#123519 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123519 Approved by: https://github.com/leslie-fang-intel, https://github.com/desertfire	2024-04-10 00:11:32 +00:00
PyTorch MergeBot	786c6db519	Revert "UFMT formatting on test/export (#123520 )" This reverts commit ec7551d1b783e284cedddeb9aeabb285e653c480. Reverted https://github.com/pytorch/pytorch/pull/123520 on behalf of https://github.com/PaliC due to lint is still broken ([comment](https://github.com/pytorch/pytorch/pull/123520#issuecomment-2046223260))	2024-04-10 00:06:30 +00:00
eqy	624e58f2c6	[CUDA] Update `size_1` conv tests with TF32 thresholds (#118022 ) Seeing some numerical mismatches on A100 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118022 Approved by: https://github.com/atalman	2024-04-09 23:49:40 +00:00
RoboSchmied	af27bc443b	fix typo in 4 files (#123529 ) fix typo: `information` has no plural. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123529 Approved by: https://github.com/albanD	2024-04-09 23:37:35 +00:00
Jorge Pineda	8a566161cd	[ATen-VK] Remove duplicate function from Resource.cpp (#123659 ) Summary: We want to bundle both ATen-VK and ET-VK in one library. There's a lot of copied code between the two libraries and most of it's fine since it is guarded by different namespaces. This function is the one exception and so we delete it from ATen-VK. ``` Action failed: fbsource//xplat/wearable/wrist/ml:wristmlcore (cxx_link libwristmlcore.so) Local command returned non-zero exit code 1 Reproduce locally: `env -- 'BUCK_SCRATCH_PATH=buck-out/v2/tmp/fbsource/c81367d319075390/xplat/wearable/wrist/ml/__wristm ...<omitted>... /fbsource/c81367d319075390/xplat/wearable/wrist/ml/__wristmlcore__/libwristmlcore.so.linker.argsfile (run `buck2 log what-failed` to get the full command)` stdout: stderr: ld.lld: error: duplicate symbol: operator<<(std::__ndk1::basic_ostream<char, std::__ndk1::char_traits<char>>&, VmaTotalStatistics) >>> defined at Resource.cpp:6 (xplat/caffe2/aten/src/ATen/native/vulkan/api/Resource.cpp:6) >>> Resource.cpp.pic.o:(operator<<(std::__ndk1::basic_ostream<char, std::__ndk1::char_traits<char>>&, VmaTotalStatistics)) in archive buck-out/v2/gen/fbsource/c81367d319075390/xplat/caffe2/__torch_vulkan_api__/libtorch_vulkan_api.pic.a >>> defined at Resource.cpp:14 (xplat/executorch/backends/vulkan/runtime/api/Resource.cpp:14) >>> Resource.cpp.pic.o:(.text._ZlsRNSt6__ndk113basic_ostreamIcNS_11char_traitsIcEEEE18VmaTotalStatistics+0x1) in archive buck-out/v2/gen/fbsource/c81367d319075390/xplat/executorch/backends/vulkan/__vulkan_compute_api__/libvulkan_compute_api.pic.a clang: error: linker command failed with exit code 1 (use -v to see invocation) Buck UI: https://www.internalfb.com/buck2/fc1cf878-690d-48ab-acdb-ece2c48dab42 Network: Up: 43MiB Down: 391MiB (reSessionID-830cf8b1-c9c8-474a-b8ed-45c37fceb21b) Jobs completed: 9227. Time elapsed: 1:31.3s. Cache hits: 22%. Commands: 5665 (cached: 1261, remote: 4002, local: 402) BUILD FAILED Failed to build 'fbsource//xplat/wearable/wrist/ml:wristmlcore (ovr_config//platform/android:arm32-clang-r21e-api29-opt-malibu#c81367d319075390)' ``` Test Plan: ``` LD_LIBRARY_PATH=third-party/swiftshader/lib/linux-x64/ buck2 run fbcode/mode/dev-nosan //xplat/caffe2:pt_vulkan_api_test_bin ``` Reviewed By: copyrightly Differential Revision: D55926906 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123659 Approved by: https://github.com/SS-JIA	2024-04-09 23:28:16 +00:00
FFFrog	ec7551d1b7	UFMT formatting on test/export (#123520 ) Partially addresses https://github.com/pytorch/pytorch/issues/123062 Ran lintrunner on: test/export Detail: ```Shell $ lintrunner -a --take UFMT --all-files ok No lint issues. Successfully applied all patches. ``` Co-authored-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/123520 Approved by: https://github.com/shink, https://github.com/ezyang	2024-04-09 23:24:13 +00:00
Adnan Akhundov	c773913407	Add torch.while_loop support to AOT Inductor (#123586 ) Summary: Previously, `torch.while_loop` was supported only in JIT inductor (added in https://github.com/pytorch/pytorch/pull/122069). Here we extend the support to AOT Inductor. Test Plan: ``` $ python test/inductor/test_aot_inductor.py -k test_while_loop ... ---------------------------------------------------------------------- Ran 24 tests in 129.236s OK (skipped=8) $ python test/inductor/test_control_flow.py ... ---------------------------------------------------------------------- Ran 50 tests in 136.199s OK ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/123586 Approved by: https://github.com/jansel, https://github.com/chenyang78	2024-04-09 22:53:10 +00:00
Kurt Mohler	3908ebca86	Test COW materialization in backward ops (#123593 ) Part of #97856 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123593 Approved by: https://github.com/ezyang	2024-04-09 22:31:50 +00:00
angelayi	298171df5c	[benchmark] Add namedtuple pytree serialization (#123648 ) Fixes https://github.com/pytorch/pytorch/pull/123388#issuecomment-2045289729 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123648 Approved by: https://github.com/desertfire	2024-04-09 22:25:36 +00:00
Edward Z. Yang	60d7fbe89a	Register matmul out variant so it is used (#122979 ) Fixes https://github.com/pytorch/pytorch/issues/122774 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/122979 Approved by: https://github.com/Chillee, https://github.com/Skylion007	2024-04-09 22:21:37 +00:00
Nikita Shulga	27eb5daee4	[AOTI] Serialize large weights (#123002 ) But appending them to the end of the shared library and mmaping afterwards Disabled by default, but overridable by `config._force_mmap_aoti_weights` Implemented by adding `USE_MMAP_SELF` define to `inductor/aoti_runtime/model.h` which is defined when weights are appended to the binary. In that case, shared library name is determined by calling `dladdr`, mmaped and finally checked against random magic number embedded at the end of the weights as well as in const section of the library in question Added unites to validate that it works as expected TODO: - Extend support to CUDA - munmap region if the same library is reused Co-authored-by: Michael Gschwind <61328285+mikekgfb@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/123002 Approved by: https://github.com/jansel, https://github.com/desertfire, https://github.com/mikekgfb	2024-04-09 22:18:57 +00:00
Isuru Fernando	d6fb1da806	Fix doc example of masked_scatter (#123664 ) mask has to be a bool tensor Pull Request resolved: https://github.com/pytorch/pytorch/pull/123664 Approved by: https://github.com/peterbell10, https://github.com/albanD	2024-04-09 22:15:12 +00:00
Jane Xu	adcfc2b582	Add meta reg for addcdiv/addcmul ScalarList (#123486 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123486 Approved by: https://github.com/awgu	2024-04-09 22:05:58 +00:00
Angela Yi	b287dbbc24	[export] Fix naming if state dict contains colons (#123601 ) Test Plan: buck2 run mode/opt //aps_models/pyper/ads:train\[inplace\] +training.ir_serializer=on_disk https://www.internalfb.com/intern/everpaste/?handle=GICWmAB0g_Z1StMCAMxuhJI6U9pHbsIXAAAz Reviewed By: tugsbayasgalan Differential Revision: D55894742 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123601 Approved by: https://github.com/pianpwk	2024-04-09 21:25:08 +00:00
brothergomez	a96e4ad0d1	[Inductor] Pass device interface to the worker compile (#122492 ) Summary: In `codecache.py` pass the device_interface directly to `_worker_compile()` instead of calling `get_device_interface()` from inside the function. If the device_interface is registered by an out-of-tree module then it will only be registered inside the main process and not inside the worker process. This fixes this issue. Happy to add a test if required. Test plan: No tests added Co-authored-by: brothergomez <brothergomez@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/122492 Approved by: https://github.com/ezyang	2024-04-09 21:23:33 +00:00
Jithun Nair	f7cdc1b9bb	Add test_aot_inductor to test_inductor (#123340 ) AOTI changes have been breaking for ROCm on trunk because we do not have testing of AOTI in inductor/pull/trunk workflow for ROCm. This PR adds `test_aot_inductor` to inductor workflow to catch such issues. More context here: https://github.com/pytorch/pytorch/pull/123164#issuecomment-2033494012 Runtime increase for inductor workflow: CUDA: PR corresponding to base commit used for this PR: [100 mins](https://github.com/pytorch/pytorch/actions/runs/8545475047/job/23415210028?pr=123290) This PR: [183 mins](https://github.com/pytorch/pytorch/actions/runs/8562003098/job/23465530389?pr=123340) ROCM: PR corresponding to base commit used for this PR: [105 mins](https://github.com/pytorch/pytorch/actions/runs/8545475047/job/23416422145?pr=123290) This PR: [148 mins](https://github.com/pytorch/pytorch/actions/runs/8562003098/job/23466516866?pr=123340) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123340 Approved by: https://github.com/atalman, https://github.com/desertfire	2024-04-09 21:22:11 +00:00
Michael Lazos	bff321716c	Remove special handling of step with closure (#123620 ) Implements https://github.com/pytorch/pytorch/issues/123479 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123620 Approved by: https://github.com/anijain2305 ghstack dependencies: #123496, #123497, #123551, #123552, #123618	2024-04-09 21:15:24 +00:00
Eddie Yan	3db618d656	[CUDA] Use 64-bit indexing in `CUDA_KERNEL_LOOP` in `im2col` (#118005 ) #117736 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118005 Approved by: https://github.com/atalman	2024-04-09 21:04:20 +00:00
PyTorch MergeBot	b3eb1b2f74	Revert "fix amp for AOTInductor (#122883 )" This reverts commit a4a49f77b8c45ea459263c2242ab391b3d0577f2. Reverted https://github.com/pytorch/pytorch/pull/122883 on behalf of https://github.com/DanilBaibak due to Break internal build ([comment](https://github.com/pytorch/pytorch/pull/122883#issuecomment-2046026363))	2024-04-09 20:51:53 +00:00
Tailing Yuan	041be901b3	fix ctc_loss zero-length/neg-length corner cases (#123193 ) Fixes #84827, fixes #86596, fixes #88047, fixes #89208. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123193 Approved by: https://github.com/mikaylagawarecki	2024-04-09 20:39:39 +00:00
Aidyn-A	a6080f79e9	[Build] Add linker script optimization (#121975 ) This PR adds a linker script optimization based on prioritized symbols that can be extracted from the profiles of popular workloads. The present linker script was generated to target ARM+CUDA and later can be extended if necessary. The reason we target ARM is shown below: > PyTorch and other applications that access more than 24x 2MB code regions in quick succession can result in performance bottlenecks in the CPU front-end. The link-time optimization improves executable code locality and improve performance. We recommend turning on the optimization always for PyTorch and other application that behaves similarly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121975 Approved by: https://github.com/ptrblck, https://github.com/atalman	2024-04-09 20:22:25 +00:00
Jez Ng	178ce1433c	Hoist out auxiliary values in optional-typed arguments (#123613 ) This fixes #123176, and partially addresses #121814 too. #123176 uses an optional device arg while #121814 uses an optional list arg. For optional arguments that have auxiliary info -- specifically, tuples / lists with their length parameter, and device types with their device index -- we need to hoist out the extra argument. E.g. when passing a device with ID 1, we want to emit ``` auto var_0 = cached_torch_device_type_cpu; aoti_torch_foo(..., &var_0, 1); ``` instead of the (syntactically incorrect) ``` auto var_0 = cached_torch_device_type_cpu,1; aoti_torch_foo(..., &var_0); ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/123613 Approved by: https://github.com/desertfire	2024-04-09 20:17:35 +00:00
Yuzhen Huang	1970a802b3	Only print bw result for the first time we benchmark a kernel (#123568 ) Summary: As title. Before this change, we use the benchmark result saved as cache and print out every time we call a kernel. The information is the same. Let's just print out at the first iteration. Test Plan: Local test. Differential Revision: D55878382 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123568 Approved by: https://github.com/jackiexu1992	2024-04-09 19:53:57 +00:00
Aaron Orenstein	5712c326a5	Teach pattern_matcher to use a pre-traced pattern if given (#121314 ) The check_fn portion of pattern_matcher was retracing the pattern even if a pre-traced pattern was provided. I think that as long as the patterns don't have control flow based on their inputs then this should be safe. For this benchmark ``` python benchmarks/dynamo/huggingface.py --training --amp --performance --only MobileBertForQuestionAnswering --backend=inductor ``` this improves the performance of `joint_graph_passes` from about 9s down to 3s. In the performance dashboard it seems to be a small win - most of the compilation times dropped by a couple seconds: Torchbench 126s -> 124s Huggingface 114s -> 110s TIMM models 209s -> 208s Dynamic 44s -> 43s Blueberries 84s -> 81s Pull Request resolved: https://github.com/pytorch/pytorch/pull/121314 Approved by: https://github.com/eellison ghstack dependencies: #121313	2024-04-09 19:42:19 +00:00
Aaron Orenstein	4044e93a51	Add mm_pattern and bmm_pattern to serialized_patterns (#121313 ) Make it easier to serialize patterns by adding `pattern_matcher.gen_register_replacement()` which is like `pattern_matcher.register_replacement()` but also requires the replacement to be precompiled. To precompile patterns (and save to disk) run: ``` torchgen/fuse_attention_patterns/gen_attention_patterns.py ``` - Updated the sfdp patterns to use `gen_register_replacement`. - Add serialized patterns for mm_pattern and bmm_pattern (The 'misc' patterns don't serialize cleanly so can't be added). - Updated the testing so it checked the round-trip patterns match and not just that it serialized the same way. - Checking that the patterns round-trip properly found that the `users` field wasn't being serialized properly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121313 Approved by: https://github.com/eellison	2024-04-09 19:42:19 +00:00
Alexander Soare	f772ea5493	Improve return value docs for Module.load_state_dict (#123637 ) Sorry to add this to your plate but I hope it helps. I find it's ambiguous what "missing keys" and "unexpected keys" are, and the documentation does not add clarity. Today I realized I've been double-guessing myself on this for years. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123637 Approved by: https://github.com/mikaylagawarecki	2024-04-09 19:39:31 +00:00
PyTorch MergeBot	10d06fc92e	Revert "[EZ] Update mypy to 1.9.0 (#123595 )" This reverts commit f61b04a1f0ed55aa3f9b75e1266e7a5dc71fc90d. Reverted https://github.com/pytorch/pytorch/pull/123595 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/123595#issuecomment-2045865407))	2024-04-09 18:53:55 +00:00
Thiago Crepaldi	1b5944358e	Ignore logging.Logger.* calls during dynamo export (#123402 ) Follow up for https://github.com/pytorch/pytorch/pull/123368 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123402 Approved by: https://github.com/williamwen42	2024-04-09 18:51:00 +00:00
Guilherme Leobas	2a37793249	[Dynamo] Ensure that Higher Order Ops can be composed in dynamo (#123357 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123357 Approved by: https://github.com/zou3519 ghstack dependencies: #122211	2024-04-09 18:50:17 +00:00
Yu, Guangye	497bac223c	Add XPU backend check on NamedTensor (#123081 ) # Motivation Support `NamedTensor` on XPU backend. # Motivation No need UTs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123081 Approved by: https://github.com/gujinghui, https://github.com/EikanWang, https://github.com/ezyang	2024-04-09 18:45:17 +00:00
albanD	56dd7603da	Cleanup comment (#123467 ) Realized that these comments are not correct anymore. Updated to match what the code does Pull Request resolved: https://github.com/pytorch/pytorch/pull/123467 Approved by: https://github.com/mikaylagawarecki	2024-04-09 18:00:20 +00:00
Will Feng	7a78534468	[Compile FSDP2][1/n] Support using user-defined object instance method as hook (#123399 ) FSDP2 has this pattern of using user-defined object instance method as hook, and it will throw this error under compile: `torch._dynamo.exc.Unsupported: call_function UserDefinedObjectVariable(_pre_forward) [FSDPManagedNNModuleVariable(), TupleVariable(), ConstDictVariable()] {}` This PR adds support for it by always allowing to trace into a UserDefinedObjectVariable that's an instance method (i.e. `MethodType`). Supersedes https://github.com/pytorch/pytorch/pull/123320. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123399 Approved by: https://github.com/jansel	2024-04-09 17:29:08 +00:00
Edward Z. Yang	9a661636e3	Make lint clean on OS X (#123052 ) I don't know why I get different mypy problems when I run on my Macbook, but they weren't too hard to fix so I justed fixed them. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/123052 Approved by: https://github.com/tugsbayasgalan, https://github.com/cyyever, https://github.com/albanD	2024-04-09 17:10:16 +00:00
sifengyang	46903d978b	fix maybe_initialize_device for custom device. (#121379 ) 1. fix maybe_initialize_device for custom device. @wanchaol @albanD @albanD I am very sorry that I have resubmitted a PR by new e-mail. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121379 Approved by: https://github.com/albanD	2024-04-09 16:58:52 +00:00
Yu, Guangye	270dd99180	Fix record issue on XPUGuardImpl (#123523 ) # Motivation Previously, `xpu_event` became a dangling pointer because the variable on the stack is destroyed when the scope ends. It results in these event-related functions (`destroyEvent`, `record`, `block`, and `queryEvent`) used in `c10/core/impl/InlineEvent.h`, which serves `c10::Event`, do not work correctly. # Solution Use `new` allocated on the heap to assign `xpu_event` to avoid the dangling pointer. # Additional Context Add a UT to cover this. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123523 Approved by: https://github.com/EikanWang, https://github.com/jgong5, https://github.com/gujinghui, https://github.com/albanD	2024-04-09 16:24:13 +00:00
Arun Pa	266e278ccf	UFMT formatting on test/distributions, test/error_messages, test/forward_backward_compatability (#123527 ) Partiall addresses #123062 UFMT formatting on - test/distributions - test/error_messages, test/forward_backward_compatability Pull Request resolved: https://github.com/pytorch/pytorch/pull/123527 Approved by: https://github.com/huydhn	2024-04-09 16:03:46 +00:00
Yuanhao Ji	c96bd3de06	Enable UFMT on all of `test/fx` (#123622 ) Partially addresses #123062 Ran lintrunner on: - `test/fx` with command: ```bash lintrunner -a --take UFMT --all-files ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/123622 Approved by: https://github.com/ezyang	2024-04-09 15:59:17 +00:00
Jakub Marcowski	3b3962f7b3	Enable UFMT on `torch_version.py` and `types.py` (#123131 ) Part of efforts described in #123062. --- This PR enables the `µfmt` formatting for the following files: - `torch_version.py` - `types.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/123131 Approved by: https://github.com/ezyang	2024-04-09 15:03:17 +00:00
DanilBaibak	2834c68deb	Migrate linux-jammy-py3_10-clang15-asan-build to ARC (#123434 ) Migrate linux-jammy-py3_10-clang15-asan-build to ARC Pull Request resolved: https://github.com/pytorch/pytorch/pull/123434 Approved by: https://github.com/zxiiro, https://github.com/atalman	2024-04-09 14:23:18 +00:00
FFFrog	6980c5048d	UFMT formatting on test/mobile (#123521 ) Partially addresses https://github.com/pytorch/pytorch/issues/123062 Ran lintrunner on: test/mobile Detail: ```Shell $ lintrunner -a --take UFMT --all-files ok No lint issues. Successfully applied all patches. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/123521 Approved by: https://github.com/shink, https://github.com/ezyang	2024-04-09 14:06:22 +00:00
Nikita Shulga	f61b04a1f0	[EZ] Update mypy to 1.9.0 (#123595 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123595 Approved by: https://github.com/kit1980	2024-04-09 13:36:57 +00:00
Bin Bao	491e2ed6d1	[AOTI] Fix an internal test regression (#123481 ) Summary: https://github.com/pytorch/pytorch/issues/123174 causes some internal tests to fail, because when the generated model.so uses the MinimalArrayRefInterface, inputs are in ArrayRefTensor which still need to be converted using convert_arrayref_tensor_to_tensor. So let's bring back the relevant code with an enhanced way to detect numbers. Test Plan: CI Differential Revision: D55823570 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123481 Approved by: https://github.com/chenyang78	2024-04-09 13:03:44 +00:00
chunyuan	a4a49f77b8	fix amp for AOTInductor (#122883 ) ## Pitch This PR disables the amp when calling the inference_compiler in AOTInductor path (after having exported the model graph), following the way we disable AMP in Inductor path in https://github.com/pytorch/pytorch/pull/86515. ## Description When testing AOTInductor AMP accuracy on CPU using the dynamo benchmark suites, multiple workloads will fail in this assertion: [assert pattern_repr not in _seen_patterns](`1d52c2d985/torch/_inductor/pattern_matcher.py (L1095)`) which is called when registering SDPA patterns. The `inference_compiler` ([fw_compiler_base](`1d52c2d985/torch/_inductor/compile_fx.py (L1234)`)) will call into [_recursive_joint_graph_passe](`1d52c2d985/torch/_inductor/compile_fx.py (L1241)`) and then [_sfdp_init](`1d52c2d985/torch/_inductor/fx_passes/fuse_attention.py (L847)`). When testing accuracy, we'll set [inductor_config.fallback_random = True](`1d52c2d985/benchmarks/dynamo/common.py (L3526)`), which will make the `search_fn` to be `None` [here](`1d52c2d985/torch/_inductor/fx_passes/serialized_patterns/central_index.py (L117-L118)`), thus the pattern will be generated runtime [here](`1d52c2d985/torch/_inductor/pattern_matcher.py (L1083-L1084)`). When AMP is on, the generated pattern for SDPA FP32 will be the same as that of FP16, which makes the assertion fail. Inductor path disables amp inside [aot_dispatch_base](`1d52c2d985/torch/_functorch/_aot_autograd/jit_compile_runtime_wrappers.py (L124-L128)`). We follow the same way to disable for AOTInductor path here (after having exported the model graph) to fix this issue. ## UT For the added UT, there's one case `python test/inductor/test_aot_inductor.py -k test_amp_fallback_random_abi_compatible_cpu_with_stack_allocation_and_minimal_arrayref_interface` fails with the below error which is not caused by this PR itself. Marked it as skipped for now. ``` RuntimeError: Error in dlopen: /tmp/torchinductor_user/cf5vk3gqkbvud56qeotdxqvns4wbk3sjnlnuadolt7b6g7a6kspb/cfzjo5ackvrth2gp6oq4lfpdyfafoagodfpjvbzhsi2u64hza2vn.so: undefined symbol: _Z16aoti_torch_dtypeIN3c108BFloat16EEiv ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/122883 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-04-09 12:08:33 +00:00
DanilBaibak	4656ea5768	Migrate linux-jammy-py3_8-gcc11-pch to ARC (#123433 ) Migrate linux-jammy-py3_8-gcc11-pch to ARC Pull Request resolved: https://github.com/pytorch/pytorch/pull/123433 Approved by: https://github.com/zxiiro, https://github.com/atalman	2024-04-09 10:51:45 +00:00
chunyuan	15745a52b0	Inductor: don't change the stride_order of FlexibleLayout if it's already the same as required (#122945 ) ## Pitch Fixes https://github.com/pytorch/pytorch/issues/122489. Don't change the `stride_order` of `FlexibleLayout` if it already has the stride with the order required. ## Description For a layout that's both contiguous and channels last contiguous (for example `size=[s0, 1, 28, 28]`, `stride=[784, 784, 28, 1]` where the `C` dim is `1`), the behavior of calling [require_stride_order](`069270db60/torch/_inductor/ir.py (L4053)`) (where the order is specified as channels last) on it is different when it's a `FixedLayout` or a `FlexibleLayout`. - For a `FixedLayout`, the size and stride is unchanged after the call: `size=[s0, 1, 28, 28]`, `stride=[784, 784, 28, 1]`. - For a `FlexibleLayout`, it will become `size=[s0, 1, 28, 28]`, `stride=[784, 1, 28, 1])`. When weight is not prepacked (in dynamic shapes cases), the Conv extern kernel returns output in channels first for input with `size=[s0, 1, 28, 28]`, `stride=[784, 784, 28, 1]` but output in channels last for `size=[s0, 1, 28, 28]`, `stride=[784, 1, 28, 1])`. In this PR, for a `FlexibleLayout`, we add a check to see if it already has the stride in the required order. If that's the case, we don't change its stride order when freezing it. This makes the behavior of calling [require_stride_order](`069270db60/torch/_inductor/ir.py (L4053)`) aligned for `FixedLayout` and `FlexibleLayout`. ## Additional context For a `FixedLayout`, when calling [require_stride_order](`069270db60/torch/_inductor/ir.py (L4053)`), it will firstly run into [x.get_layout().is_stride_ordered(order)](`069270db60/torch/_inductor/ir.py (L4067-L4070)`) to check if it's already ordered as expected. If it is a `FlexibleLayout`, when calling [require_stride_order](`069270db60/torch/_inductor/ir.py (L4053)`), it runs into [as_storage_and_layout](`069270db60/torch/_inductor/ir.py (L4063-L4065)`), which will always [freeze_layout_with_stride_order](`069270db60/torch/_inductor/ir.py (L1805)`) and will always call [as_stride_order](`069270db60/torch/_inductor/ir.py (L2909)`), without checking if the default stride of this `FlexibleLayout` (which has been realized before) is already as expected ([link](`069270db60/torch/_inductor/ir.py (L2693-L2700)`)). Pull Request resolved: https://github.com/pytorch/pytorch/pull/122945 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-04-09 10:00:30 +00:00
Michael Lazos	7c23fed12c	Move step to cpu if state is already initialized (#123618 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123618 Approved by: https://github.com/anijain2305 ghstack dependencies: #123496, #123497, #123551, #123552	2024-04-09 09:04:18 +00:00
Oguz Ulgen	526a69f5ee	Remove incorrect check (#123616 ) Summary: This was a micro optimization that I thought would save time but it is not correct. For example, we cannot compare fake tensors. Test Plan: ``` buck2 run 'fbcode//mode/opt' fbcode//langtech/edge/ns/tools/tests:test_ns_jit_traced_model_all_optimization_f328819347_portal_ns ``` now passes Differential Revision: D55904083 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123616 Approved by: https://github.com/aakhundov	2024-04-09 08:45:34 +00:00
PyTorch MergeBot	d04957c0c6	Revert "Ignore logging.Logger.* calls during dynamo export (#123402 )" This reverts commit 75933ff5231b1caed333065ea9f5a847caa4cdaa. Reverted https://github.com/pytorch/pytorch/pull/123402 on behalf of https://github.com/DanilBaibak due to Broken trunk ([comment](https://github.com/pytorch/pytorch/pull/123402#issuecomment-2044236088))	2024-04-09 06:28:12 +00:00
PyTorch MergeBot	b9d2b75bac	Revert "Add test for skipping hf logging during export (#123410 )" This reverts commit ba55ef8e2165c718a269e5bca0cb83c635731c83. Reverted https://github.com/pytorch/pytorch/pull/123410 on behalf of https://github.com/DanilBaibak due to Broken trunk ([comment](https://github.com/pytorch/pytorch/pull/123402#issuecomment-2044236088))	2024-04-09 06:28:12 +00:00
Valentine233	666a628bea	[Inductor pattern] support int8 woq mm pattern matcher with freezing passe (#122955 ) There exist some issues in the previous PR (https://github.com/pytorch/pytorch/pull/120985) of supporting int8 WOQ mm pattern matcher. This PR tends to further optimize it. 1. New patterns are added to match int8 woq mm in gpt-fast model, due to different input layouts. 2. In constant folding, `int8_weight -> dq -> bf16_weight` should be kept for pattern match. 3. Currently, GPT-Fast enables `coordinate_descent_tuning` for CPU. This flag is only useful for CUDA, but it could change the graph: from the non-decomposed fallback pass to the decomposed one. We will disable the flag in GPT-Fast script for CPU, in order to have neat patterns. @yanbing-j Pull Request resolved: https://github.com/pytorch/pytorch/pull/122955 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-04-09 05:06:52 +00:00
Xia, Weiwen	d86cb9c747	[Quant][Inductor] Add qlinear_pointwise.binary op for X86Inductor backend (#123144 ) Note: This is a reopen of https://github.com/pytorch/pytorch/pull/122288, which was merged by `ghstack land` to its base (not main) by mistake. Description Add qlinear_binary op for X86Inductor backend of quantization PT2E. It only supports `add` and `add_relu` now. It will use post op sum if the extra input has the same dtype as output. Otherwise, it uses binary add. ``` +-------------------+--------------+---------------+ \| Extra input dtype \| Output dtype \| Post op \| +-------------------+--------------+---------------+ \| Fp32/bf16 \| fp32/bf16 \| sum or add* \| +-------------------+--------------+---------------+ \| Fp32/bf16 \| int8 \| add \| +-------------------+--------------+---------------+ \| int8 \| fp32/bf16 \| not supported \| +-------------------+--------------+---------------+ \| int8 \| int8 \| sum \| +-------------------+--------------+---------------+ Use sum if extra input and output have the same dtype; otherwise use add. ``` Test plan* python test_quantization.py -k test_qlinear_add_pt2e python test_quantization.py -k test_qlinear_add_relu_pt2e Pull Request resolved: https://github.com/pytorch/pytorch/pull/123144 Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/jerryzh168	2024-04-09 04:56:37 +00:00
Animesh Jain	7283c37c98	[dynamo] Keep guards on global function (#123423 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123423 Approved by: https://github.com/jansel	2024-04-09 04:23:11 +00:00
Jiong Gong	98cf183629	[merge rules] add mkldnn_lowerings.py to CPU inductor rule (#123627 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123627 Approved by: https://github.com/kit1980	2024-04-09 04:07:20 +00:00
Edward Z. Yang	0d3a771f7b	Allow for git worktree when computing clangtidy scm root (#123060 ) When you make a git worktree, the .git "folder" in the worktree is not a directory, it's a file pointing at the actual .git directory. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/123060 Approved by: https://github.com/albanD	2024-04-09 03:49:27 +00:00
Jiong Gong	0fd072bf90	[inductor] easy: move mkldnn lowerings to its own file (#123556 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123556 Approved by: https://github.com/peterbell10, https://github.com/jansel	2024-04-09 03:44:27 +00:00
William Wen	f07c0977d5	[dynamo, 3.12] avoid using co_lnotab in symbolic_convert (#123577 ) Accessing co_lnotab causes a deprecation warning to be issued, causing some dynamo-wrapped tests to fail. We do not need to remove co_lnotab from tests as of now, as they are still useful as an additional check for linetable correctness, but we will need to deal with co_lnotab removal by 3.14. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123577 Approved by: https://github.com/jansel	2024-04-09 03:40:05 +00:00
Menglu Yu	b63477faa2	[PT2][Inductor] Add decompose_mem_bound_mm to the customization pre and post grad passes (#123376 ) Summary: Titled. It can give more flexibility to customize passes Test Plan: # unit test ``` buck2 test @mode/dev-nosan //caffe2/test/inductor:decompose_mem_bound_mm ``` # local reproduce ### with decompose ``` buck2 run @mode/opt //scripts/jackiexu0313/pt2:local_model_with_pt2 -- --test_mode batch-split-decompose --model_type "cmf" --flow_id 540761965 ``` optimus parameter sent to the scuba: P1204802500 ``` {'before_recompile_pre_grad': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GLxXNRNO6q8ixo4CAIn5QalwADlsbr0LAAAz', 'BatchLayernormFusion': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GN8MQhmQrZaii4sFAJ37FLW-yjkobr0LAAAz', 'BatchSigmoidPreGradFusion': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GKYr2xa3vOkEKIQDAL5eKqkDWQAebr0LAAAz', 'normalization_pass_pre_grad': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GAjzORm9OYV951kBAF5WyqbckVY2br0LAAAz', 'remove_split_with_size_one_pass_pre_grad': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GMpzQhbeucEI_BwDAOK0nUGoCsZkbr0LAAAz', 'merge_getitem_cat_pass_pre_grad': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GDJ2whaLisgDsYMDABd4ox_-2gp5br0LAAAz', 'merge_splits_pass_pre_grad': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GJY4Pxkg0hntj9UCALgYP3xMdmMMbr0LAAAz', 'after_recompile_pre_grad': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GO1gCxfWSaDFqhIBABzCPhU827F7br0LAAAz', 'before_recompile_post_grad': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GPzyNBaxHtNFJdADADH7AsWMwixBbr0LAAAz', 'BatchMulPostGradFusion': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GBxaWAofHuojr0EBALKAINF-n_Ebbr0LAAAz', 'after_recompile_post_grad': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GPTR_RZdqWhlGmwEADUfB1t_xKN-br0LAAAz', 'inductor': Counter({'pattern_matcher_nodes': 3615, 'pattern_matcher_count': 3231, 'normalization_pass': 825, 'remove_split_with_size_one_pass': 673, 'merge_splits_pass': 85, 'merge_getitem_cat_pass': 11, 'batch_aten_mul': 11, 'scmerge_split_sections_removed': 5, 'scmerge_split_removed': 4, 'scmerge_cat_removed': 4, 'decompose_mm': 4, 'decompose_mmt': 4, 'batch_aten_sub': 3, 'batch_sigmoid': 2, 'batch_linear': 2, 'batch_aten_add': 2, 'batch_layernorm': 1, 'scmerge_split_added': 1, 'scmerge_cat_added': 1, 'batch_relu': 1, 'batch_linear_post_grad': 1}), 'PreGradBatchLinearFusion': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GEp2SBdLi4Q9SfYCALG5AsLl-LJubr0LAAAz', 'BatchReLuPreGradFusion': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GMry_BbFwSc8epcBAP7-LFeL-aRbbr0LAAAz', 'BatchAddPostGradFusion': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GKpamxaU2v4MlyANANGbWkDgUAQabr0LAAAz', 'BatchSubPostGradFusion': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GOMotxYcfE3jsWEBAFi0ABcmUboYbr0LAAAz', 'PostGradBatchLinearFusion': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GC3yQRku3hY9VmkBAH3QvuAf5z8Cbr0LAAAz'} ``` ### without decompose optimus parameter sent to the scuba: P1204807273 ``` {'before_recompile_pre_grad': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GLDLYxbo4HnP1ssDAKDGl5fN9SUnbr0LAAAz', 'BatchLayernormFusion': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GOKrQBnK6dfZg3YDALrJX7r23dN8br0LAAAz', 'BatchSigmoidPreGradFusion': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GER6ChcNzZ9NX94DAH6ZWJFFD5Uzbr0LAAAz', 'normalization_pass_pre_grad': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GNmRphbUGk2zvswAAJ3sOh3WWGBAbr0LAAAz', 'remove_split_with_size_one_pass_pre_grad': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GDYJJBQRWfOYB0wFAJpCr7RsFnsQbr0LAAAz', 'merge_getitem_cat_pass_pre_grad': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GM2ABxPOewvvdm8FAMPnyXSb6Fwzbr0LAAAz', 'merge_splits_pass_pre_grad': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GOkqSBYgyv9G4tQCAFBtGCq1OUhkbr0LAAAz', 'after_recompile_pre_grad': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GMdbSBeSWQGyOGkDANtexORtG0lMbr0LAAAz', 'before_recompile_post_grad': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GF8rvghuhGGVZXMBAPKAC7WPIeUGbr0LAAAz', 'BatchMulPostGradFusion': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GDWyCheBejGMvq0FAApYMMDOu7Jwbr0LAAAz', 'after_recompile_post_grad': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GDRgqxE_qCrmMyIDAL5TQ977TQknbr0LAAAz', 'inductor': Counter({'pattern_matcher_nodes': 2323, 'pattern_matcher_count': 2071, 'normalization_pass': 825, 'remove_split_with_size_one_pass': 673, 'merge_splits_pass': 85, 'merge_getitem_cat_pass': 11, 'scmerge_split_sections_removed': 5, 'scmerge_split_removed': 4, 'scmerge_cat_removed': 4, 'batch_sigmoid': 2, 'batch_linear': 2, 'batch_layernorm': 1, 'scmerge_split_added': 1, 'scmerge_cat_added': 1, 'batch_aten_mul': 1, 'batch_relu': 1}), 'PreGradBatchLinearFusion': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GNyUxRYGchkL_gMLAKk5mC-cbU9zbr0LAAAz', 'BatchReLuPreGradFusion': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GD19_BZMbHm46BMNAE05wMFtvB9mbr0LAAAz'} ``` # e2e ads_dper3:f044a2eb340c5477d3347540114a83b0 training_platform:cb4809460a19df86785e790d3a9b92a6 ### with decompose use old config: ``` "decompose_mem_bound_mm": true ``` f549189471 {F1478026031} add to the post grad fusion options: ``` "post_grad_fusion_options": { "batch_linear_post_grad": {}, "decompose_mm_pass": { "min_first_dimension_decomposition": 10240, "max_other_dimention_decomposition": 32 } }, ``` f549189811 {F1478026133} ### without decompose with optimize_compress off f549190692 {F1478027534} with optimize_compress on f549191653 ### QPS and NE {F1481917745} {F1481917870}{F1481917871} ### conclusion 1. when compare with no optimize_compress, has ~2% qps gain with NE neutral 2. when compare with optimize_compress, qps is neutral but the optimize_compress shows NE gap compared to the baseline Differential Revision: D55679277 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123376 Approved by: https://github.com/jackiexu1992	2024-04-09 03:35:36 +00:00
Peter Bell	8865425ff7	[minifier] Add config flag to ignore non-fp values (#123006 ) When minifying, the after-aot minifier ignores non-floating values by default but does check them when running the the initial graph dump step. This means we may capture a graph that doesn't fail the tester and doesn't have any meaningful divergence. For example, the derivative of `elu(x)` depends on `x > 0` so this value is saved for backwards and so becomes a graph output. However, the difference between `FLT_MIN` and `0` in `x` is now enough to trigger an accuracy failure. I fix this by adding a config variable and environment variable to ignore these non floating point values. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123006 Approved by: https://github.com/ezyang ghstack dependencies: #123005	2024-04-09 03:34:09 +00:00
Menglu Yu	6d005ca590	[PT2][Observability] Add model_type and global_rank for the scuba log for the dashboard Optimus pattern frequency monitor (#123398 ) Summary: We also log the model type and global rank for easier scuba query to develop the dashbord monitor. More context: https://docs.google.com/document/d/1RuUCOBOgVt9pp-Jgoo4oEXWvoYv6GN0DljypsqgVTp4/edit Test Plan: # local reproduce ``` buck2 run mode/opt //scripts/jackiexu0313/pt2:local_model_with_pt2 -- --test_mode batch-split --model_type "cmf" --flow_id 524546542 ``` optimus parameter sent to the scuba: ``` {'before_recompile_pre_grad': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GO3rCxej_mk0RV0DAPE1wtdadgNkbr0LAAAz', 'BatchLayernormFusion': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GESWQhm_XNqiIYYCAJ2nCcg9PPwnbr0LAAAz', 'BatchSigmoidPreGradFusion': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GAOv_RZb5hEwKIQBAPc7kNFDN2kEbr0LAAAz', 'normalization_pass_pre_grad': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GMUFOxkqRm1ellcDAFLjROHAy4NXbr0LAAAz', 'remove_split_with_size_one_pass_pre_grad': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GAuEghZfCNtAVtcCACAqgBH3h4R0br0LAAAz', 'merge_getitem_cat_pass_pre_grad': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GL_q2xZIJ9gRUp4GAAnBc-_frnUpbr0LAAAz', 'merge_splits_pass_pre_grad': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GGIerBZMJvpn5moBAH4lzgkY5_Rjbr0LAAAz', 'after_recompile_pre_grad': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GKPngRabKVDgodEHAJNTi6H37kwbbr0LAAAz', 'before_recompile_post_grad': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GBuDPxmwQPFoGJkCAOsLt_QwVNxvbr0LAAAz', 'BatchMulPostGradFusion': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GJQypRaGi3AMr3MBAMWUDs5rHztkbr0LAAAz', 'after_recompile_post_grad': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GMce9xaOaCu3l9YCAM41j-H0hWZMbr0LAAAz', 'inductor': Counter({'pattern_matcher_nodes': 2281, 'pattern_matcher_count': 2081, 'normalization_pass': 864, 'remove_split_with_size_one_pass': 748, 'merge_splits_pass': 82, 'merge_getitem_cat_pass': 11, 'scmerge_split_sections_removed': 4, 'batch_layernorm': 1, 'batch_sigmoid': 1, 'scmerge_split_added': 1, 'scmerge_cat_added': 1, 'scmerge_split_removed': 1, 'scmerge_cat_removed': 1, 'batch_aten_mul': 1}), 'model_type': None, 'global_rank': None} ``` # e2e test I have no resouce to run the test right now due to the MC proposal deadline. Will add it next week. Should ok based on the local reproduce results. Differential Revision: D55777055 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123398 Approved by: https://github.com/Yuzhen11	2024-04-09 03:28:10 +00:00
angelayi	7420b8c5be	[effects] Add way to register effectul op (#122348 ) This adds a way to register an operator as being effectful. I also added a test case which mimics our solution for intermediate logging ([doc](https://docs.google.com/document/d/1eLyGDVe4iplVFiO0I021uLgA4Y6HxK9eqn55e9KzQkc/edit#heading=h.uwec2ukkwhea)), which is by creating a custom op and registering it as effectful. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122348 Approved by: https://github.com/zou3519 ghstack dependencies: #122347	2024-04-09 03:22:32 +00:00
angelayi	493478db4a	[effects] Add inductor support for tokens (#122347 ) Given the following code/dynamo graph: ``` class GraphModule(torch.nn.Module): def forward(self, L_x_ : torch.Tensor): l_x_ = L_x_ _print = torch.ops.aten._print('moo') res = l_x_ + l_x_; l_x_ = None _print_1 = torch.ops.aten._print('moo') return (res,) ``` AOTAutograd will trace the following program, threading tokens from the inputs, through the effectful operator calls (torch.ops.aten._print), and as an output: ``` class <lambda>(torch.nn.Module): def forward(self, arg0_1: "f32[0]", arg1_1: "f32[2, 3]"): with_effects = torch._higher_order_ops.effects.with_effects(arg0_1, torch.ops.aten._print.default, 'moo'); arg0_1 = None getitem: "f32[0]" = with_effects[0]; with_effects = None add: "f32[2, 3]" = torch.ops.aten.add.Tensor(arg1_1, arg1_1); arg1_1 = None with_effects_1 = torch._higher_order_ops.effects.with_effects(getitem, torch.ops.aten._print.default, 'moo'); getitem = None getitem_2: "f32[0]" = with_effects_1[0]; with_effects_1 = None return (getitem_2, add) ``` However when we get to inductor, since we want the inductor generated code to not have any token inputs/outputs for better readability, we want to modify the aten graph by removing the tokens from inputs, and creating them through `torch.ops.aten._make_dep_token`, and sinking them through the `torch.ops.aten._sink_tokens` operators. This has to be done after the partitioner, otherwise the partitioner will add the make_token/sink_token operators to the backwards graph. ``` class <lambda>(torch.nn.Module): def forward(self, arg1_1: "f32[2, 3]"): _make_dep_token_default: "f32[0]" = torch.ops.aten._make_dep_token.default() with_effects = torch._higher_order_ops.effects.with_effects(_make_dep_token_default, torch.ops.aten._print.default, 'moo'); _make_dep_token_default = None getitem: "f32[0]" = with_effects[0]; with_effects = None add: "f32[2, 3]" = torch.ops.aten.add.Tensor(arg1_1, arg1_1); arg1_1 = None with_effects_1 = torch._higher_order_ops.effects.with_effects(getitem, torch.ops.aten._print.default, 'moo'); getitem = None getitem_2: "f32[0]" = with_effects_1[0]; with_effects_1 = None _sink_tokens_default = torch.ops.aten._sink_tokens.default((getitem_2,)); getitem_2 = None return (add,) ``` When doing inductor lowering, we convert `with_effects` calls to an `EffectfulKernel`, which just a `FallbackKernel` but with a pointer to previous effectful operator's call. During scheduling, we will create a `StarDep` between the EffectfulKernel and its previous EffectfulKernel so that they don't get reordered. The inductor generated python code looks like: ``` def call(args): arg1_1, = args args.clear() assert_size_stride(arg1_1, (2, 3), (3, 1)) # Source Nodes: [_print], Original ATen: [] buf2 = aten._print.default('moo') # Source Nodes: [_print_1], Original ATen: [] buf3 = aten._print.default('moo') buf4 = empty_strided_cpu((2, 3), (3, 1), torch.float32) cpp_fused_add_0(arg1_1, buf4) del arg1_1 return (buf4, ) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/122347 Approved by: https://github.com/bdhirsh	2024-04-09 03:22:32 +00:00
drisspg	9bd6d6e8b0	Add mem-eff-attention's sliding window arg to align with xformers (#123571 ) # Summary Updates to align with implementation in xformers Pull Request resolved: https://github.com/pytorch/pytorch/pull/123571 Approved by: https://github.com/danthe3rd	2024-04-09 02:19:12 +00:00
Michael Lazos	565e8c0645	[Reland] Enable dynamo'd tests disabled for #115679 (#123552 ) Relanding https://github.com/pytorch/pytorch/pull/123315 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123552 Approved by: https://github.com/anijain2305 ghstack dependencies: #123496, #123497, #123551	2024-04-09 02:14:32 +00:00
Pian Pawakapan	8bd6223730	[export] construct set_grad_enabled HOO subgraph inside other HOO subgraphs (#123391 ) Summary: Reference: https://github.com/pytorch/pytorch/pull/121736 Previously set_grad_enabled nodes in HOO subgraphs (e.g. cond) were inlined and not replaced with their own HOO subgraphs. This diff recursively does that. Example: ``` class Model(torch.nn.Module): def forward(self, x, y): def true_fn(x, y): with torch.enable_grad(): return x - y return torch.cond( x.sum() > 0, true_fn, lambda x, y: x + y, [x, y], ) ``` Before (printing out `ep.graph_module.true_graph_0`): ``` class <lambda>(torch.nn.Module): def forward(self, arg0_1: "i64[]", arg1_1: "i64[]"): # No stacktrace found for following nodes _set_grad_enabled = torch._C._set_grad_enabled(True) sub: "i64[]" = torch.ops.aten.sub.Tensor(arg0_1, arg1_1); arg0_1 = arg1_1 = None _set_grad_enabled_1 = torch._C._set_grad_enabled(False) return (sub,) ``` After: ``` class GraphModule(torch.nn.Module): def forward(self, arg0_1: "i64[]", arg1_1: "i64[]"): # No stacktrace found for following nodes submod_3 = self.submod_1 sub: "i64[]" = torch._higher_order_ops.wrap.wrap_with_set_grad_enabled(True, submod_3, arg0_1, arg1_1); submod_3 = arg0_1 = arg1_1 = None return (sub,) class GraphModule(torch.nn.Module): def forward(self, arg0_1: "i64[]", arg1_1: "i64[]"): # No stacktrace found for following nodes sub: "i64[]" = torch.ops.aten.sub.Tensor(arg0_1, arg1_1); arg0_1 = arg1_1 = None return sub ``` Differential Revision: D55770138 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123391 Approved by: https://github.com/tugsbayasgalan	2024-04-09 02:08:03 +00:00
Jeff Daily	3969f85769	add TORCH_NCCL_HIGH_PRIORITY option (#122830 ) There are many existing ProcessGroupNCCL features controlled by env vars. This PR adds TORCH_NCCL_HIGH_PRIORITY to force the use of high-priority CUDA or HIP streams for the NCCL or RCCL kernels, respectively. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122830 Approved by: https://github.com/kwen2501	2024-04-09 01:11:18 +00:00
Colin Peppler	9faa8848ea	[aotinductor] Add test case for outputs with views (#123415 ) Also test views instead of .contiguous() for outputs with multiple aliases. ``` output_handles[0] = buf0.release(); output_handles[1] = output_handles[0]; output_handles[2] = wrap_with_raii_handle_if_needed(tmp_tensor_handle_0).release(); ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/123415 Approved by: https://github.com/chenyang78	2024-04-08 23:56:01 +00:00
Yifu Wang	c3d37a88ed	Fix a perf regression in MultiTensorApply (#123566 ) #119764 inadvertantly increased the register usage of `multi_tensor_apply_for_fused_optimizer` and caused a perf regression. The increase was due to an unnecceary indirection from `multi_tensor_apply_kernel` to `multi_tensor_apply_kernel_dev`. This PR fixes the issue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123566 Approved by: https://github.com/eqy, https://github.com/janeyx99 ghstack dependencies: #119764	2024-04-08 23:29:21 +00:00
Thiago Crepaldi	ba55ef8e21	Add test for skipping hf logging during export (#123410 ) https://github.com/pytorch/pytorch/pull/123402 already supports hf logging because HF logger is based on logging module This PR adds a test to guard this against regression, only Pull Request resolved: https://github.com/pytorch/pytorch/pull/123410 Approved by: https://github.com/BowenBao, https://github.com/malfet ghstack dependencies: #123402	2024-04-08 23:20:30 +00:00
Jez Ng	1b9eebb6bb	[AOTI] Handle null outputs (#123460 ) Summary: I skipped over the codegen for output handle assignment if the outputs are null -- in addition to being redundant, it was causing compile errors. I also modified the runtime to do the necessary null checks. Fixes #123173. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123460 Approved by: https://github.com/chenyang78, https://github.com/desertfire	2024-04-08 23:07:03 +00:00
Thiago Crepaldi	75933ff523	Ignore logging.Logger.* calls during dynamo export (#123402 ) Follow up for https://github.com/pytorch/pytorch/pull/123368 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123402 Approved by: https://github.com/williamwen42	2024-04-08 22:50:54 +00:00
Lucas Pasqualin	aa9aed2fcf	Removes forgotten print statement (#123579 ) As title Pull Request resolved: https://github.com/pytorch/pytorch/pull/123579 Approved by: https://github.com/weifengpy, https://github.com/wz337	2024-04-08 22:37:19 +00:00
Jason Ansel	d8e0c26e64	[dynamo] Support warnings.catch_warnings (#123511 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123511 Approved by: https://github.com/anijain2305	2024-04-08 22:27:46 +00:00
Michael Lazos	6951626735	[Reland] Enable tests disabled for #115607 (#123551 ) Relanding https://github.com/pytorch/pytorch/pull/123314 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123551 Approved by: https://github.com/anijain2305 ghstack dependencies: #123496, #123497	2024-04-08 21:29:28 +00:00
Gustav Larsson	4765570359	[onnx.export] Avoid building vals_to_params_map (#123025 ) This PR is part of an effort to speed up torch.onnx.export (#121422). - Building vals_to_params_map costs linear time in N (number of nodes), when instead we can index into this dictionary directly. - No need to call HasField on the final else, since c10::nullopt is the default returned value if a field does not exist. - Resolves (3) in #121422. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123025 Approved by: https://github.com/BowenBao, https://github.com/thiagocrepaldi	2024-04-08 21:02:55 +00:00
Sam Larsen	c48f6680ff	Skip test_artificial_grid_cpp_wrapper (#123211 ) Summary: This test is actually broken and probably succeeding by mistake because of a cache hit. Forcing a fresh cache or removing the errant setting cause a consistent failure. Disabling for now until we have time to investigate further. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123211 Approved by: https://github.com/desertfire	2024-04-08 20:55:47 +00:00
Angela Yi	1be2126ff6	[pytree] Fix namedtuple serialization (#123388 ) Summary: Previously we were serializing namedtuple treespecs incorrectly: ```python Point = namedtuple("Point", ["x", "y"]) p = Point(1, 2) flat, spec = pytree.tree_flatten(p) print(flat) # [1, 2] print(spec) # TreeSpec(type=namedtuple, context=Point, children=[, ]) dumped_spec = pytree.treespec_dumps(spec) print(dumped_spec) """ We only serialize the name of the class and the fields of the namedtuple: TreeSpec { type='collections.namedtuple', context={class_name='Point', class_fields={'x', 'y'}}, children=[Leaf, Leaf] } """ reconstructed_spec = pytree.treespec_loads(dumped_spec) print(reconstructed_spec) """ When we load, we create a new namedtuple class containing the same fields as before, but the is class is now a completely different class than the original one: TreeSpec(type=namedtuple, context=torch.utils._pytree.Point, children=[, ]) """ spec == reconstructed_spec # False ``` So, we introduce a new API called `pytree._register_namedtuple` where users can pass in the serialized name for each namedtuple class: ```python Point = namedtuple("Point", ["x", "y"]) pytree._register_namedtuple(Point, "Point") p = Point(1, 2) flat, spec = pytree.tree_flatten(p) print(flat) # [1, 2] print(spec) # TreeSpec(type=namedtuple, context=Point, children=[, ]) dumped_spec = pytree.treespec_dumps(spec) print(dumped_spec) """ TreeSpec { type='collections.namedtuple', context='Point', children=[Leaf, Leaf] } """ reconstructed_spec = pytree.treespec_loads(dumped_spec) print(reconstructed_spec) # TreeSpec(type=namedtuple, context=Point, children=[, ]) spec == reconstructed_spec # True ``` Test Plan: `python test/test_pytree.py` Differential Revision: D55771058 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123388 Approved by: https://github.com/zou3519	2024-04-08 20:55:19 +00:00
Yuanhao Ji	c797fbc4e1	Enable UFMT on `test/cpp_api_parity`, `test/cpp_extensions`, `test/create_dummy_torchscript_model.py`, `test/custom_backend`, `test/custom_operator` (#123518 ) Partially addresses #123062 Ran lintrunner on: - `test/cpp_api_parity` - `test/cpp_extensions` - `test/create_dummy_torchscript_model.py` - `test/custom_backend` - `test/custom_operator` Pull Request resolved: https://github.com/pytorch/pytorch/pull/123518 Approved by: https://github.com/huydhn	2024-04-08 20:18:42 +00:00
Chien-Chin Huang	b279034e5a	[DDP][PT2D] Add the trace rules for DDP (#121741 ) Add the trace rules for DDP and refactor the tests to verify both DDP and replicate. Differential Revision: [D54815909](https://our.internmc.facebook.com/intern/diff/D54815909/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121741 Approved by: https://github.com/yf225 ghstack dependencies: #123206, #123207	2024-04-08 19:53:13 +00:00
Michael Lazos	89e6292d48	Defer setting capturable in optimizer variable (#123497 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123497 Approved by: https://github.com/anijain2305 ghstack dependencies: #123496	2024-04-08 19:31:25 +00:00
Michael Lazos	73e235f0a6	Swap to ID guard for optimizer Variable (#123496 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123496 Approved by: https://github.com/anijain2305	2024-04-08 19:28:25 +00:00
Chien-Chin Huang	6a3b47ec8f	[PT2D][DDP] Remove the hack to pass None as the process group (#123207 ) Functional collectives can now handle None as the process group. Differential Revision: [D55658338](https://our.internmc.facebook.com/intern/diff/D55658338/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123207 Approved by: https://github.com/kwen2501 ghstack dependencies: #123206	2024-04-08 19:24:29 +00:00
Darshan Sanghani	aa73d5bb5c	Register COLLECTIVE_COMM profiler activity type, if available (#121461 ) Summary: Instantiate a collective communications profiler. If a collectives profiler exists then add COLLECTIVE_COMM activity type to the CUDA activity types. Test Plan: Sample output trace (internal use only): https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/clientAPI/0/1707520266/devgpu007.pci1/nccl_activities_2643654_1707520266977.json.gz&bucket=gpu_traces Co-authored-by: Darshan Sanghani <dsang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/121461 Approved by: https://github.com/aaronenyeshi	2024-04-08 18:23:56 +00:00
Chien-Chin Huang	a2327d203b	[PT2D][DDP] Remove some hacks to get the test work (#123206 ) It seems that these bugs are fixed (not sure what PRs) and we don't need to disable the buffer reused. Differential Revision: [D55657388](https://our.internmc.facebook.com/intern/diff/D55657388/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123206 Approved by: https://github.com/kwen2501, https://github.com/yifuwang	2024-04-08 17:40:14 +00:00
PyTorch MergeBot	3e8d3577be	Revert "Swap to ID guard for optimizer Variable (#123496 )" This reverts commit 26bf05ccacc0377f0ef40d1d9c792c403267d5d5. Reverted https://github.com/pytorch/pytorch/pull/123496 on behalf of https://github.com/PaliC due to seems to have broken distributed/fsdp/test_fsdp_hybrid_shard.py as per `26bf05ccac` ([comment](https://github.com/pytorch/pytorch/pull/123496#issuecomment-2043251234))	2024-04-08 17:06:05 +00:00
PyTorch MergeBot	d9ac80f80c	Revert "Defer setting capturable in optimizer variable (#123497 )" This reverts commit 76b290344f917ee0b9e1c69863ae04354a298dd2. Reverted https://github.com/pytorch/pytorch/pull/123497 on behalf of https://github.com/PaliC due to seems to have broken distributed/fsdp/test_fsdp_hybrid_shard.py as per `26bf05ccac` ([comment](https://github.com/pytorch/pytorch/pull/123496#issuecomment-2043251234))	2024-04-08 17:06:05 +00:00
Yang Chen	e4e5449dfc	[aoti][reland] clear precomputed symbol replacements before cpp wrapper compilation (#123136 ) After we codegen a triton kernel in the triton codegen backend, we cache the generated triton source code in the wrapper to avoid producing multiple triton kernels with the same content. In AOTI compilation flow, this caching mechanism imposes a strong requirement on the codegen that we must generate the same triton source code for the same schedule node in both python and cpp codegen phases. Otherwise, we would end up with a mismatch between the kernel name formed in the cpp codegen and the cuda kernel key produced from the python codegen. Consequently, we would hit an missing-cuda-kernel error. The precomputed symbol replacements saved in V.graph.sizevars can cause such source-code inconsistency related to the code for indexing tensors. For example, let's say in the python codegen phase, we produce "ks2\48" as part of indexing an input for schedule node A while yielding a replacement pair "ks0 -> ks2\48" in the precomputed replacements. In the second cpp codegen phase, we would produce "ks0" for the same indexing code of schedule node A due to the "ks0 -> ks2*48" replacement pair. This PR fixed the issue by clearing precomputed_replacements and inv_precomputed_replacements before cpp wrapper codegen. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123136 Approved by: https://github.com/desertfire	2024-04-08 16:51:43 +00:00
Catherine Lee	61be8843c9	[TD] Use label to configure td on distributed for rollout (#122976 ) Gate TD on distributed behind label TODO: auto add label to certain people's prs Pull Request resolved: https://github.com/pytorch/pytorch/pull/122976 Approved by: https://github.com/huydhn, https://github.com/ZainRizvi	2024-04-08 15:53:55 +00:00
DanilBaibak	4f66db80ca	Migrate linux-jammy-py3.8-gcc11-no-ops to ARC (#123432 ) Migrate linux-jammy-py3.8-gcc11-no-ops to ARC Pull Request resolved: https://github.com/pytorch/pytorch/pull/123432 Approved by: https://github.com/zxiiro, https://github.com/malfet, https://github.com/atalman	2024-04-08 15:50:53 +00:00
PyTorch UpdateBot	3a7351bf91	[xla hash update] update the pinned xla hash (#123549 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned xla hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123549 Approved by: https://github.com/pytorchbot	2024-04-08 11:27:18 +00:00
Michael Lazos	76b290344f	Defer setting capturable in optimizer variable (#123497 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123497 Approved by: https://github.com/anijain2305 ghstack dependencies: #123496	2024-04-08 08:34:19 +00:00
Michael Lazos	26bf05ccac	Swap to ID guard for optimizer Variable (#123496 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123496 Approved by: https://github.com/anijain2305	2024-04-08 05:03:34 +00:00
Animesh Jain	bb04f3f66a	[dynamo][logger] Log graph break on Unsupported bytecodes (#122684 ) This would have saved me a few hours while debugging an internal model. We could not support a LOAD_ATTR bytecode, because it was a property, and the inlining failed due to skip. Since LOAD_ATTR does not support continuation function, we would fallback to eager for the whole frame aka skip. But, we should also log this as graph break. This PR does it. Bonus - removes skip from a test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122684 Approved by: https://github.com/ezyang	2024-04-08 01:50:04 +00:00
Animesh Jain	07cecf4168	[dynamo][cpp-guards] Fix bug for slices (#123516 ) Automatic testing as soon as we turn on cpp guards by default. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123516 Approved by: https://github.com/jansel ghstack dependencies: #123515	2024-04-07 21:09:05 +00:00
Animesh Jain	6ceec53579	[dynamo][cpp-guards] Fix test for CPP guard manager (#123515 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123515 Approved by: https://github.com/guilhermeleobas, https://github.com/jansel	2024-04-07 21:09:05 +00:00
Jason Ansel	212e460dce	[dynamo] Support custom __setattr__ on UserDefinedObjectVariable (#123318 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123318 Approved by: https://github.com/anijain2305	2024-04-07 21:06:52 +00:00
Oguz Ulgen	89724843bb	Use graph.find_nodes in pattern matcher (#122331 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/122331 Approved by: https://github.com/jansel ghstack dependencies: #121565, #122255, #122256, #122257, #122258	2024-04-07 18:51:22 +00:00
Oguz Ulgen	5aab2b9acf	Use graph.find_nodes in functorch (#122258 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/122258 Approved by: https://github.com/jansel ghstack dependencies: #121565, #122255, #122256, #122257	2024-04-07 18:51:22 +00:00
Oguz Ulgen	287680176b	Use graph.find_nodes in dynamo (#122257 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/122257 Approved by: https://github.com/jansel ghstack dependencies: #121565, #122255, #122256	2024-04-07 18:51:18 +00:00
Oguz Ulgen	f8465df9f0	Use graph.find_nodes in inductor (#122256 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/122256 Approved by: https://github.com/jansel ghstack dependencies: #121565, #122255	2024-04-07 18:51:14 +00:00
Oguz Ulgen	33783e43e9	Use graph.find_nodes in inductor/fx_passes (#122255 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/122255 Approved by: https://github.com/jansel ghstack dependencies: #121565	2024-04-07 18:51:09 +00:00
Oguz Ulgen	03b13851d9	[FX] Add side table to FX Graph for O(1) op/target query (#121565 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121565 Approved by: https://github.com/jansel	2024-04-07 18:51:05 +00:00
His-Wardship	0355f6e954	[Bug Fix] Fix Cuda 12.4 compilation - Refactor SFINAE boxing logic (#123377 ) Summary: PyTorch fails to compile from source using CUDA 12.4. The relevant log is extracted below. This was a recurring issue, which would cause the compilation to fail again on further objects if the first offending object was skipped. While searching for whether others had experienced this issue before attempting a fix myself, I found this suggested fix by @christian-heusel in https://github.com/pytorch/pytorch/issues/122169#issuecomment-2008455468 written by @lahwaacz. The code written by @lahwaacz at `bb1f1a4c54` fixes the issue. The original issue (#122169) seems to have gone quiet, so I am submitting this PR. I made no substantive adjustments to @lahwaacz' code. My only adjustment was, for the sake of consistency, to remove the double underscores in the struct name, as double underscores are reserved to the implementation in C++ Standard. My change has no functional effect on the original code. The ArchLinux package from which the original code was committed is licensed under the BSD license. https://archlinux.org/packages/extra/x86_64/python-pytorch/ ``` [7900/8804] Building CUDA object caffe2/CMakeFiles/torch_cuda.dir/__/torch/csrc/distributed/c10d/quantization/quantization_gpu.cu.o FAILED: caffe2/CMakeFiles/torch_cuda.dir/__/torch/csrc/distributed/c10d/quantization/quantization_gpu.cu.o /usr/bin/ccache /usr/local/cuda-12.4/bin/nvcc -forward-unknown-to-host-compiler -DAT_PER_OPERATOR_HEADERS -DFLASHATTENTION_DISABLE_ALIBI -DHAVE_MALLOC_USABLE_SIZE=1 -DHAVE_MMAP=1 -DHAVE_SHM_OPEN=1 -DHAVE_SHM_UNLINK=1 -DIDEEP_USE_MKL -DMINIZ_DISABLE_ZIP_READER_CRC32_CHECKS -DONNXIFI_ENABLE_EXT=1 -DONNX_ML=1 -DONNX_NAMESPACE=onnx_torch -DTORCH_CUDA_BUILD_MAIN_LIB -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_CUDA -DUSE_CUSPARSELT -DUSE_DISTRIBUTED -DUSE_EXTERNAL_MZCRC -DUSE_FLASH_ATTENTION -DUSE_MEM_EFF_ATTENTION -DUSE_NCCL -DUSE_RPC -DUSE_TENSORPIPE -D_FILE_OFFSET_BITS=64 -Dtorch_cuda_EXPORTS -I/home/elliot/compile_test-pytorch/build/aten/src -I/home/elliot/compile_test-pytorch/aten/src -I/home/elliot/compile_test-pytorch/build -I/home/elliot/compile_test-pytorch -I/home/elliot/compile_test-pytorch/cmake/../third_party/benchmark/include -I/home/elliot/compile_test-pytorch/third_party/onnx -I/home/elliot/compile_test-pytorch/build/third_party/onnx -I/home/elliot/compile_test-pytorch/third_party/foxi -I/home/elliot/compile_test-pytorch/build/third_party/foxi -I/home/elliot/compile_test-pytorch/aten/src/THC -I/home/elliot/compile_test-pytorch/aten/src/ATen/cuda -I/home/elliot/compile_test-pytorch/aten/src/ATen/../../../third_party/cutlass/include -I/home/elliot/compile_test-pytorch/build/caffe2/aten/src -I/home/elliot/compile_test-pytorch/aten/src/ATen/.. -I/home/elliot/compile_test-pytorch/build/nccl/include -I/home/elliot/compile_test-pytorch/c10/cuda/../.. -I/home/elliot/compile_test-pytorch/c10/.. -I/home/elliot/compile_test-pytorch/third_party/tensorpipe -I/home/elliot/compile_test-pytorch/build/third_party/tensorpipe -I/home/elliot/compile_test-pytorch/third_party/tensorpipe/third_party/libnop/include -I/home/elliot/compile_test-pytorch/torch/csrc/api -I/home/elliot/compile_test-pytorch/torch/csrc/api/include -isystem /home/elliot/compile_test-pytorch/build/third_party/gloo -isystem /home/elliot/compile_test-pytorch/cmake/../third_party/gloo -isystem /home/elliot/compile_test-pytorch/cmake/../third_party/tensorpipe/third_party/libuv/include -isystem /home/elliot/compile_test-pytorch/cmake/../third_party/googletest/googlemock/include -isystem /home/elliot/compile_test-pytorch/cmake/../third_party/googletest/googletest/include -isystem /home/elliot/compile_test-pytorch/third_party/protobuf/src -isystem /home/elliot/miniforge3/envs/torchtest/include -isystem /home/elliot/compile_test-pytorch/third_party/gemmlowp -isystem /home/elliot/compile_test-pytorch/third_party/neon2sse -isystem /home/elliot/compile_test-pytorch/third_party/XNNPACK/include -isystem /home/elliot/compile_test-pytorch/third_party/ittapi/include -isystem /home/elliot/compile_test-pytorch/cmake/../third_party/eigen -isystem /usr/local/cuda-12.4/include -isystem /home/elliot/compile_test-pytorch/third_party/ideep/mkl-dnn/include/oneapi/dnnl -isystem /home/elliot/compile_test-pytorch/third_party/ideep/include -isystem /home/elliot/compile_test-pytorch/cmake/../third_party/cudnn_frontend/include -DLIBCUDACXX_ENABLE_SIMPLIFIED_COMPLEX_OPERATIONS -D_GLIBCXX_USE_CXX11_ABI=1 -Xfatbin -compress-all -DONNX_NAMESPACE=onnx_torch -gencode arch=compute_86,code=sm_86 -Xcudafe --diag_suppress=cc_clobber_ignored,--diag_suppress=field_without_dll_interface,--diag_suppress=base_class_has_different_dll_interface,--diag_suppress=dll_interface_conflict_none_assumed,--diag_suppress=dll_interface_conflict_dllexport_assumed,--diag_suppress=bad_friend_decl --expt-relaxed-constexpr --expt-extended-lambda -Wno-deprecated-gpu-targets --expt-extended-lambda -DCUB_WRAPPED_NAMESPACE=at_cuda_detail -DCUDA_HAS_FP16=1 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -O3 -DNDEBUG -std=c++17 -Xcompiler=-fPIC -DMKL_HAS_SBGEMM -DMKL_HAS_SHGEMM -DTORCH_USE_LIBUV -DCAFFE2_USE_GLOO -Xcompiler=-Wall,-Wextra,-Wdeprecated,-Wno-unused-parameter,-Wno-unused-function,-Wno-missing-field-initializers,-Wno-unknown-pragmas,-Wno-type-limits,-Wno-array-bounds,-Wno-unknown-pragmas,-Wno-strict-overflow,-Wno-strict-aliasing,-Wno-maybe-uninitialized -Wno-deprecated-copy -MD -MT caffe2/CMakeFiles/torch_cuda.dir/__/torch/csrc/distributed/c10d/quantization/quantization_gpu.cu.o -MF caffe2/CMakeFiles/torch_cuda.dir/__/torch/csrc/distributed/c10d/quantization/quantization_gpu.cu.o.d -x cu -c /home/elliot/compile_test-pytorch/torch/csrc/distributed/c10d/quantization/quantization_gpu.cu -o caffe2/CMakeFiles/torch_cuda.dir/__/torch/csrc/distributed/c10d/quantization/quantization_gpu.cu.o /home/elliot/compile_test-pytorch/aten/src/ATen/core/IListRef_inl.h: In static member function ‘static c10::detail::IListRefConstRef<at::OptionalTensorRef> c10::detail::IListRefTagImpl<c10::IListRefTag::Boxed, at::OptionalTensorRef>::iterator_get(const c10::List<std::optional<at::Tensor> >::const_iterator&)’: /home/elliot/compile_test-pytorch/aten/src/ATen/core/IListRef_inl.h:171:13: warning: possibly dangling reference to a temporary [-Wdangling-reference] 171 \| const auto& ivalue = (it).get(); \| ^~~~~~ /home/elliot/compile_test-pytorch/aten/src/ATen/core/IListRef_inl.h:171:33: note: the temporary was destroyed at the end of the full expression ‘(& it)->c10::impl::ListIterator<std::optional<at::Tensor>, __gnu_cxx::__normal_iterator<c10::IValue, std::vector<c10::IValue> > >::operator().c10::impl::ListElementReference<std::optional<at::Tensor>, __gnu_cxx::__normal_iterator<c10::IValue, std::vector<c10::IValue> > >::get()’ 171 \| const auto& ivalue = (it).get(); \| ~~~~~~~~~~~^~ /home/elliot/compile_test-pytorch/aten/src/ATen/core/boxing/impl/boxing.h: At global scope: /home/elliot/compile_test-pytorch/aten/src/ATen/core/boxing/impl/boxing.h:42:103: error: expected primary-expression before ‘>’ token 42 \| struct has_ivalue_to<T, std::void_t<decltype(std::declval<IValue>().to<T>())>> \| ^ /home/elliot/compile_test-pytorch/aten/src/ATen/core/boxing/impl/boxing.h:42:106: error: expected primary-expression before ‘)’ token 42 \| struct has_ivalue_to<T, std::void_t<decltype(std::declval<IValue>().to<T>())>> \| ^ /home/elliot/compile_test-pytorch/aten/src/ATen/core/dispatch/DispatchKeyExtractor.h: In lambda function: /home/elliot/compile_test-pytorch/aten/src/ATen/core/dispatch/DispatchKeyExtractor.h:154:24: warning: possibly dangling reference to a temporary [-Wdangling-reference] 154 \| for (const at::Tensor& tensor : ivalue.toTensorList()) { \| ^~~~~~ /home/elliot/compile_test-pytorch/aten/src/ATen/core/dispatch/DispatchKeyExtractor.h:154:53: note: the temporary was destroyed at the end of the full expression ‘__for_begin .c10::impl::ListIterator<at::Tensor, __gnu_cxx::__normal_iterator<c10::IValue, std::vector<c10::IValue> > >::operator().c10::impl::ListElementReference<at::Tensor, __gnu_cxx::__normal_iterator<c10::IValue, std::vector<c10::IValue> > >::operator std::conditional_t<true, const at::Tensor&, at::Tensor>()’ 154 \| for (const at::Tensor& tensor : ivalue.toTensorList()) { \| ^ ... ninja: build stopped: subcommand failed. ``` ``` PyTorch version: 2.4.0a0+git595613d Is debug build: False CUDA used to build PyTorch: 12.4 ROCM used to build PyTorch: N/A OS: Ubuntu 23.10 (x86_64) GCC version: (Ubuntu 13.2.0-4ubuntu3) 13.2.0 Clang version: 16.0.6 (15) CMake version: version 3.29.0 Libc version: glibc-2.38 Python version: 3.11.8 \| packaged by conda-forge \| (main, Feb 16 2024, 20:53:32) [GCC 12.3.0] (64-bit runtime) Python platform: Linux-6.5.0-26-generic-x86_64-with-glibc2.38 Is CUDA available: True CUDA runtime version: 12.4.131 CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: NVIDIA GeForce RTX 3090 Ti Nvidia driver version: 550.67 cuDNN version: Probably one of the following: /usr/lib/x86_64-linux-gnu/libcudnn.so.9.0.0 /usr/lib/x86_64-linux-gnu/libcudnn_adv.so.9.0.0 /usr/lib/x86_64-linux-gnu/libcudnn_cnn.so.9.0.0 /usr/lib/x86_64-linux-gnu/libcudnn_engines_precompiled.so.9.0.0 /usr/lib/x86_64-linux-gnu/libcudnn_engines_runtime_compiled.so.9.0.0 /usr/lib/x86_64-linux-gnu/libcudnn_graph.so.9.0.0 /usr/lib/x86_64-linux-gnu/libcudnn_heuristic.so.9.0.0 /usr/lib/x86_64-linux-gnu/libcudnn_ops.so.9.0.0 HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 46 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 24 On-line CPU(s) list: 0-23 Vendor ID: GenuineIntel Model name: 13th Gen Intel(R) Core(TM) i7-13700K CPU family: 6 Model: 183 Thread(s) per core: 2 Core(s) per socket: 16 Socket(s): 1 Stepping: 1 CPU(s) scaling MHz: 19% CPU max MHz: 5400.0000 CPU min MHz: 800.0000 BogoMIPS: 6835.20 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb intel_pt sha_ni xsaveopt xsavec xgetbv1 xsaves split_lock_detect avx_vnni dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp hwp_pkg_req hfi vnmi umip pku ospke waitpkg gfni vaes vpclmulqdq tme rdpid movdiri movdir64b fsrm md_clear serialize pconfig arch_lbr ibt flush_l1d arch_capabilities Virtualization: VT-x L1d cache: 640 KiB (16 instances) L1i cache: 768 KiB (16 instances) L2 cache: 24 MiB (10 instances) L3 cache: 30 MiB (1 instance) NUMA node(s): 1 NUMA node0 CPU(s): 0-23 Vulnerability Gather data sampling: Not affected Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Retbleed: Not affected Vulnerability Spec rstack overflow: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Enhanced / Automatic IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected Versions of relevant libraries: [pip3] numpy==1.26.4 [pip3] optree==0.11.0 [pip3] pytorch-triton==3.0.0+989adb9a29 [pip3] torch==2.4.0a0+git595613d [conda] magma-cuda124 2.6.1 1 pytorch [conda] mkl-include 2024.1.0 intel_691 intel [conda] mkl-static 2024.1.0 intel_691 intel [conda] numpy 1.26.4 py311h64a7726_0 conda-forge [conda] optree 0.11.0 py311h9547e67_0 conda-forge [conda] pytorch-triton 3.0.0+989adb9a29 pypi_0 pypi [conda] torch 2.4.0a0+git595613d pypi_0 pypi ``` Tagging @colesbury per https://github.com/pytorch/pytorch/issues/122169#issuecomment-2008232619 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123377 Approved by: https://github.com/cyyever, https://github.com/malfet	2024-04-07 18:37:47 +00:00
Li-Huai (Allan) Lin	1ea2f1eaa1	[BE][MPS] Reorganize logics and naming in copy.mm (#123310 ) Was trying to address https://github.com/pytorch/pytorch/issues/119367 but hesitated to do so without knowing how the blit copy works under the hood. So I did some BE on naming and logics Pull Request resolved: https://github.com/pytorch/pytorch/pull/123310 Approved by: https://github.com/kulinseth	2024-04-07 07:14:02 +00:00
Gao Tianlin	77681facac	[fix] inductor `split` lowering fails if `item()` is captured (#123032 ) Fixes #122937 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123032 Approved by: https://github.com/jansel	2024-04-07 04:23:57 +00:00
Jason Ansel	e3ea316623	[dynamo] Save/restore cublas_allow_tf32 in convert_frame (#123509 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123509 Approved by: https://github.com/anijain2305	2024-04-07 03:37:47 +00:00
Pearu Peterson	eff1e4899c	Add sparse COO/CSR/CSC/BSR/BSC meta tensor input support to torch.sum (#121673 ) As in the title. Fixes an issue reported in https://github.com/pytorch/pytorch/pull/117907#issuecomment-1987212514 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121673 Approved by: https://github.com/cpuhrsch	2024-04-06 21:11:22 +00:00
lezcano	7ce42ebd44	Generalise mod value ranges (#123253 ) We also add the usual comment where we note that we don't handle negative values in mod properly. We should also fix this in the definition of ModularIndexing. I'll do that in a later PR, as for that one I'll also need to fix a number of tests that are testing an incorrect behaviour. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123253 Approved by: https://github.com/peterbell10	2024-04-06 20:19:24 +00:00
Laith Sakka	caed7f6727	profile pt2 compile time with strobelight (#123311 ) For oss this diff adds a decorator @profile_sb_fbcode that is a nop for non meta workload. Facebook: With this diff someone can generate a strobelight profile for pt2 compilation. users need to set the env variable TORCH_COMPILE_SL_PROFILE =TRUE . For example: ``` TORCH_COMPILE_SL_PROFILE =TRUE buck2 run @//mode/inplace @//mode/opt //caffe2/fb/strobelight:compiletime_profile_example ``` see sample output bellow, at the end of summary. The way this works, is that a unique id is generated and associated with all samples that are collected for functions that are decorated with profile_sb_fbcode. This id can then be used to combine different strobe light profile into one. (for example three compilation events happens in the code bellow). Right now the following two functions are annotated with profile_sb_fbcode. bw_compiler and _compile. if two profile_sl_fbcode is called recursively, recursive invocations are ignored and a log is printed. The output is: ``` Strobelight is enabled for pt2 compilation Unique user-id for this run is: 2024-04-03-13:59:49147091devvm4561.ash0.facebook.com You can use the following link to access the strobelight profile at the end of the run: https://www.internalfb.com/intern/scuba/query/?dataset=pyperf_experimental%2Fon_demand&drillstate=%7B%22purposes%22%3A[]%2C%22end%22%3A%22now%22%2C%22start%22%3A%22-30%20days%22%2C%22filterMode%22%3A%22DEFAULT%22%2C%22modifiers%22%3A[]%2C%22sampleCols%22%3A[]%2C%22cols%22%3A[%22namespace_id%22%2C%22namespace_process_id%22]%2C%22derivedCols%22%3A[]%2C%22mappedCols%22%3A[]%2C%22enumCols%22%3A[]%2C%22return_remainder%22%3Afalse%2C%22should_pivot%22%3Afalse%2C%22is_timeseries%22%3Afalse%2C%22hideEmptyColumns%22%3Afalse%2C%22timezone%22%3A%22America%2FLos_Angeles%22%2C%22compare%22%3A%22none%22%2C%22samplingRatio%22%3A%221%22%2C%22metric%22%3A%22count%22%2C%22aggregation_field%22%3A%22async_stack_complete%22%2C%22top%22%3A10000%2C%22aggregateList%22%3A[]%2C%22param_dimensions%22%3A[%7B%22dim%22%3A%22py_async_stack%22%2C%22op%22%3A%22edge%22%2C%22param%22%3A%220%22%2C%22anchor%22%3A%220%22%7D]%2C%22order%22%3A%22weight%22%2C%22order_desc%22%3Atrue%2C%22constraints%22%3A[[%7B%22column%22%3A%22run_user%22%2C%22op%22%3A%22eq%22%2C%22value%22%3A[%22[%5C%222024-04-03-13:59:49147091devvm4561.ash0.facebook.com%5C%22]%22]%7D]]%2C%22c_constraints%22%3A[[]]%2C%22b_constraints%22%3A[[]]%2C%22ignoreGroupByInComparison%22%3Afalse%7D&view=GraphProfilerView&&pool=uber&graphprofiler_filter=&graphprofiler_column_to_sort_by=exclusive the link below takes you to the collected strobelight profile https://www.internalfb.com/intern/scuba/query/?dataset=pyperf_experimental%2Fon_demand&drillstate=%7B%22dimensions%22%3A%5B%5D%2C%22param_dimensions%22%3A%5B%7B%22anchor%22%3A%220%22%2C%22param%22%3A%220%22%2C%22op%22%3A%22edge%22%2C%22dim%22%3A%22py_async_stack%22%7D%5D%2C%22constraints%22%3A%5B%5B%7B%22value%22%3A%5B%22%5B%5C%22-6800545191281321%5C%22%5D%22%5D%2C%22op%22%3A%22eq%22%2C%22column%22%3A%22run_id%22%7D%2C%7B%22value%22%3A%5B%22%5B%5C%222024-04-03-13%3A59%3A49147091devvm4561.ash0.facebook.com%5C%22%5D%22%5D%2C%22op%22%3A%22eq%22%2C%22column%22%3A%22run_user%22%7D%5D%5D%2C%22top%22%3A10000%2C%22end%22%3A%221712181610%22%2C%22start%22%3A%221712174410%22%7D&view=GraphProfilerView& 1 storbelight success runs out of 1 non-ignored runs. strobelight run id is: 6181728288420687 the link below takes you to the collected strobelight profile https://www.internalfb.com/intern/scuba/query/?dataset=pyperf_experimental%2Fon_demand&drillstate=%7B%22dimensions%22%3A%5B%5D%2C%22param_dimensions%22%3A%5B%7B%22anchor%22%3A%220%22%2C%22param%22%3A%220%22%2C%22op%22%3A%22edge%22%2C%22dim%22%3A%22py_async_stack%22%7D%5D%2C%22constraints%22%3A%5B%5B%7B%22value%22%3A%5B%22%5B%5C%226181728288420687%5C%22%5D%22%5D%2C%22op%22%3A%22eq%22%2C%22column%22%3A%22run_id%22%7D%2C%7B%22value%22%3A%5B%22%5B%5C%222024-04-03-13%3A59%3A49147091devvm4561.ash0.facebook.com%5C%22%5D%22%5D%2C%22op%22%3A%22eq%22%2C%22column%22%3A%22run_user%22%7D%5D%5D%2C%22top%22%3A10000%2C%22end%22%3A%221712181621%22%2C%22start%22%3A%221712174421%22%7D&view=GraphProfilerView& 2 storbelight success runs out of 2 non-ignored runs. strobelight run id is: -1026103682715688 the link below takes you to the collected strobelight profile https://www.internalfb.com/intern/scuba/query/?dataset=pyperf_experimental%2Fon_demand&drillstate=%7B%22dimensions%22%3A%5B%5D%2C%22param_dimensions%22%3A%5B%7B%22anchor%22%3A%220%22%2C%22param%22%3A%220%22%2C%22op%22%3A%22edge%22%2C%22dim%22%3A%22py_async_stack%22%7D%5D%2C%22constraints%22%3A%5B%5B%7B%22value%22%3A%5B%22%5B%5C%22-1026103682715688%5C%22%5D%22%5D%2C%22op%22%3A%22eq%22%2C%22column%22%3A%22run_id%22%7D%2C%7B%22value%22%3A%5B%22%5B%5C%222024-04-03-13%3A59%3A49147091devvm4561.ash0.facebook.com%5C%22%5D%22%5D%2C%22op%22%3A%22eq%22%2C%22column%22%3A%22run_user%22%7D%5D%5D%2C%22top%22%3A10000%2C%22end%22%3A%221712181647%22%2C%22start%22%3A%221712174447%22%7D&view=GraphProfilerView& 3 storbelight success runs out of 3 non-ignored runs. ``` Test Plan: Was tested on buck2 run @//mode/inplace @//mode/opt //caffe2/fb/strobelight:compiletime_profile_example This was also tested in one of the ads benchmarks ``` TORCH_COMPILE_SL_PROFILE =TRUE buck2 run mode/opt mode/inplace //pytorch/benchmark:run -- ads_mc_igctr_mc3_v0 -d cuda -t train --torchdynamo inductor ``` The results matches the results reported in https://fb.workplace.com/groups/257735836456307/permalink/657458576484029 Differential Revision: D55672271 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123311 Approved by: https://github.com/aorenste	2024-04-06 18:57:44 +00:00
PyTorch MergeBot	c66d503194	Revert "[Profiler][submodule] Make Kineto traces export ns granularity for finer timestamps (#122425 )" This reverts commit 6f7dd2f84a4237b31eac29054b86a5284ef6cb6b. Reverted https://github.com/pytorch/pytorch/pull/122425 on behalf of https://github.com/malfet due to Breaks ROCM builds ([comment](https://github.com/pytorch/pytorch/pull/122425#issuecomment-2041129241))	2024-04-06 16:19:00 +00:00
Nikita Shulga	89cbb2d86d	Allow docs build on workflow dispatch (#123493 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123493 Approved by: https://github.com/kit1980	2024-04-06 14:42:39 +00:00
PyTorch MergeBot	ecb2418dd6	Revert "Adding health check server hook in torch elastic (#122750 )" This reverts commit 61d431fab07f65d3e54c28f1ec420c517c7ada92. Reverted https://github.com/pytorch/pytorch/pull/122750 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/122750#issuecomment-2041104931))	2024-04-06 14:31:07 +00:00
Will Feng	7b02910163	[Compile FSDP2][2/n] Support streams created outside of compile region (#123487 ) FSDP2 creates CUDA streams outside of compile region in its 1st iteration eager run, and then torch.compile will attempt to record method calls on these streams (e.g. `stream.record_event()`) in >1st iteration compiled run. Before this PR, stream proxy is None which causes "None doesn't have attribute record_event" error when we try to call `record_event()` on it. After this PR, stream proxy has the correct value which makes calling methods on it possible. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123487 Approved by: https://github.com/jansel	2024-04-06 08:42:42 +00:00
Shivam Raikundalia	6f7dd2f84a	[Profiler][submodule] Make Kineto traces export ns granularity for finer timestamps (#122425 ) Summary: Kineto traces use microsecond level granularity because of chrome tracing defaults to that precision. Fix by adding preprocessor flag to TARGETS and BUCK files. Also remove any unnecessary ns to us conversions made in the profiler itself. This diff contains profiler changes only. Libkineto changes found in D54964435. Test Plan: Check JSON and chrome tracing to make sure values are as expected. Tracing with flags enabled should have ns precision. Tracings without flags should be same as master. Tracing with flags enabled: https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/dynocli/devvm2185.cco0.facebook.com/rank-0.Mar_18_14_37_22.4155151.pt.trace.json.gz&bucket=gpu_traces Tracing without flags enabled: https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/dynocli/devvm2185.cco0.facebook.com/rank-0.Mar_18_14_39_15.4166047.pt.trace.json.gz&bucket=gpu_traces Tracing on main: https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/dynocli/devvm2185.cco0.facebook.com/rank-0.Mar_18_14_42_43.4177559.pt.trace.json.gz&bucket=gpu_traces Ran key_averages() to make sure FunctionEvent code working as expected: -- ------------ ------------ Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls ProfilerStep* 0.74% 3.976ms 64.40% 346.613ms 69.323ms 0.000us 0.00% 61.710ms 12.342ms 5 Optimizer.zero_grad#SGD.zero_grad 0.76% 4.109ms 0.76% 4.109ms 821.743us 0.000us 0.00% 0.000us 0.000us 5 ## forward ## 6.89% 37.057ms 27.19% 146.320ms 29.264ms 0.000us 0.00% 58.708ms 11.742ms 5 aten::conv2d 0.22% 1.176ms 7.74% 41.658ms 157.199us 0.000us 0.00% 27.550ms 103.962us 265 aten::convolution 0.79% 4.273ms 7.52% 40.482ms 152.762us 0.000us 0.00% 27.550ms 103.962us 265 aten::_convolution 0.69% 3.688ms 6.73% 36.209ms 136.637us 0.000us 0.00% 27.550ms 103.962us 265 aten::cudnn_convolution 6.04% 32.520ms 6.04% 32.520ms 122.719us 27.550ms 8.44% 27.550ms 103.962us 265 aten::add_ 2.42% 13.045ms 2.42% 13.045ms 30.694us 12.700ms 3.89% 12.700ms 29.882us 425 aten::batch_norm 0.19% 1.027ms 8.12% 43.717ms 164.971us 0.000us 0.00% 16.744ms 63.185us 265 aten::_batch_norm_impl_index 0.31% 1.646ms 7.93% 42.691ms 161.096us 0.000us 0.00% 16.744ms 63.185us 265 ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Differential Revision: D55087993 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122425 Approved by: https://github.com/aaronenyeshi	2024-04-06 06:04:28 +00:00
Shunting Zhang	a4ef9cdd28	benchmark: raise tolerance to unblock triton upgrade (#123484 ) Debugging is happening in https://github.com/pytorch/pytorch/issues/123126 . Upgrading triton cause accuracy failure for mixer_b16_224 and levit_128 . mixer_b16_224 is debugged specifically. It due to extra FMA instructions being used in a single kernel. That kernel itself only introduce small numerical difference. We conclude that this is not some 'real' accuracy issue and we should raise the tolerance to unblock the triton pin update. The tolerance is picked such that the CI accuracy test can pass. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123484 Approved by: https://github.com/jansel	2024-04-06 03:43:25 +00:00
Zhicheng Yan	77643ed2eb	[torch quantization]raise exception when OOM during combine histogram in observer (#123309 ) Summary: Even with changes in D55347133, it is still possible to OOM in histogram observer, because the size of allocated tensor also depends on downsample_rate. For example, I still see OOM due to the attempt of allocating a 10GB+ histogram tensor in multi-task model. To fix OOM issue better, we use try-catch clause to avoid OOM. Empirically, we set the max size of a single histogram tensor size to 1 GB. Test Plan: Test the change for Multi-Task model (depth + segmentation) Differential Revision: D55567292 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123309 Approved by: https://github.com/jerryzh168	2024-04-06 03:15:02 +00:00
Animesh Jain	d3596cf004	[dynamo][cpp-guards] Fix missing decref in GradGuardAccessor (#123488 ) Found that there was a peak mem increase while running HF suite. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123488 Approved by: https://github.com/jansel ghstack dependencies: #123485	2024-04-06 03:09:29 +00:00
sraikund16	6fa72480d3	Enhance RecordFunctionFast input args and use input args in triton_heuristics.py (#123459 ) Summary: Now that we can input shapes as input args for RecordFunctionFast, let's add that to the triton heuristics. Also, lets add the ability to pass in a tuple into the RecordFunctionFast constructor. Test Plan: Ran both the _inductor/test_profile.py and profiler/test_profiler.py unit tests. Also added tuple based unit test to profiler/test_profiler.py Ran record_function_fast.py from the following branch https://github.com/pytorch/pytorch/compare/sraikund/record_funct_test?expand=1 No shape or args: tests function fast with no args and profile without record_shapes With shape tests: tests function fast with args and profile with record_shapes true Args no shape: tests function fast with args inputted but record_shapes set to false Args shape tuple: tests function fast with args inputted in form of tuple and record_shapes true Stdout: No shape or args:: 1.8491458892822266 us With shape:: 2.211381196975708 us Args no shape:: 1.9212646484375 us With shape tuple:: 2.245788335800171 us Differential Revision: D55809967 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123459 Approved by: https://github.com/davidberard98	2024-04-06 02:44:06 +00:00
Animesh Jain	8c84fe3c86	[dynamo][guards] Forward fix for #123302 (#123485 ) For some reason, adding a `TYPE_CHECK` in DATA_PTR_MATCH guard in https://github.com/pytorch/pytorch/issues/123302 increases optimizer guard overhead for `MT5ForConditionalGeneration` by 10x. There is nothing special about MT5. As we are going to move towards the CPP guards soon, there is no reason to investigate this deeper. We can use `ID_MATCH` instead of `DATA_PTR` match. Today both cant be serialized, so there is no one preference over the other. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123485 Approved by: https://github.com/mlazos	2024-04-06 02:34:06 +00:00
William Wen	841112d074	[dynamo, 3.12] fix graph break issues with BINARY/STORE_SLICE (#123401 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123401 Approved by: https://github.com/jansel ghstack dependencies: #123392	2024-04-06 02:19:15 +00:00
William Wen	284b07ba63	[dynamo, 3.12] fix block stack related issues (#123392 ) `JUMP_BACKWARD` in 3.12+ may not be in the exception table even though it should be considered a part of the block. Also fix a issue where we didn't propagate the exception table entry to new instructions when expanding the `POP_JUMP_IF_[NOT_]NONE` instruction. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123392 Approved by: https://github.com/jansel	2024-04-06 02:19:15 +00:00
Peter Bell	9189d04cb1	[inductor] Add explicit ops.fma and use it in softmax_backward (#122518 ) This allows us to generate an fma even when fp-fusion is disabled in the compiler. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122518 Approved by: https://github.com/lezcano, https://github.com/Chillee	2024-04-06 02:15:16 +00:00
Peter Bell	3e8c64a637	[AOTInductor] Fix non-determinism in CUfunction declarations (#123266 ) These use the ordering of sets and dictionaries to determine the output order, which leads to run-to-run variance in the output code. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123266 Approved by: https://github.com/desertfire	2024-04-06 02:08:17 +00:00
PyTorch MergeBot	e94b81b254	Revert "Enable tests disabled for #115607 (#123314 )" This reverts commit 9564e204c1616ce78434abfdea0f3fd428b675f3. Reverted https://github.com/pytorch/pytorch/pull/123314 on behalf of https://github.com/atalman due to break TestOptimRenewedCPU::test_foreach_matches_forloop_Adamax_cpu_float64 ([comment](https://github.com/pytorch/pytorch/pull/123314#issuecomment-2040854499))	2024-04-06 01:59:22 +00:00
briancoutinho	239abb2a14	add record function Id to Torch ops (#122948 ) Fixes #122833 Add record function ID as additional metadata for PyTorch op events. This enables correlation with PyTorch Execution traces. * Adds a new field "Record function id" for all PyTorch Op events. This value comes from `handle` in record function callback. This is a unique ID to correlate with the PyTorch Execution Trace. * Updated unit tests. ## Test Run a simple example uncommenting the `print trace` in the test below ```pytest test/profiler/test_profiler.py -k test_execution_trace_with_kineto``` We can see the new record function ID field in ET and Kineto Note: the name is "Record function id" now to match the other strings Kineto ![Screenshot 2024-03-28 at 5 48 55 PM](https://github.com/pytorch/pytorch/assets/6922212/08243698-8167-4ea0-9be6-2aede9fe9c43) Execution Trace. ![Screenshot 2024-03-28 at 5 49 14 PM](https://github.com/pytorch/pytorch/assets/6922212/22e4e876-9fbe-43da-9150-dae2927b6e31) We also see for cases where "External ID" is drifting but "Record function ID" is still matching. Kineto ![Screenshot 2024-03-28 at 5 50 34 PM](https://github.com/pytorch/pytorch/assets/6922212/60905ea4-0da1-4c4b-a0d0-24500e8f7006) Execution Trace ![Screenshot 2024-03-28 at 5 50 28 PM](https://github.com/pytorch/pytorch/assets/6922212/680db244-6725-48bf-a7ab-995c658a01ee) Pull Request resolved: https://github.com/pytorch/pytorch/pull/122948 Approved by: https://github.com/davidberard98, https://github.com/shengfukevin	2024-04-06 01:35:03 +00:00
drisspg	f4e2a226aa	ScoreMod API (#121845 ) # Summary This PR adds a new higher-order_op: `templated_attention`. This op is designed to extend the functionality of torch.nn.fucntional.scaled_dot_product_attention. PyTorch has efficient pre-written fused-attention kernels. However, users want to modify how scores are computed (a substep inside attention) -- this traditionally requires the user to write their own attention kernel. One such modification to attention scores that is not currently supported by the top level SDPA op is:[ Attention with Linear Biases (ALiBi](https://arxiv.org/abs/2108.12409)). This higher-order op will instead accept a callable( 'score_mod') function that is through torch.compile will be used to create an efficient attention kernel instantiation. ### Details This HOP utilizes the existing fx and HOP infra to capture and convert the User `score-mod` function and convert to an FX graph module. Inductor then consumes this HOP that has a `ir.Subgraph` input. It will inline this lowered subgraph into a triton kernel which performs fused attention with the modification to the scores matrix inlined. ### API The API for a score_mod function should be as follows: ```Python def score_mod(score: torch.Tensor, batch: torch.Tensor, head: torch.Tensor, token_1: torch.Tensor, token_kv: torch.Tensor) -> torch.Tensor ``` This function receives five parameters: - `score`: A scalar tensor representing the attention score, with the same data type and device as the query, key, and value tensors. - `batch`, `head`, `seq_len_q`, `seq_len_kv`: Scalar tensors indicating the batch index, head index, query index, and key/value index, respectively, with torch.int data type and located on the same device as the score tensor. Consider inputs query, key, value of shapes (2, 4, 16, 8), leading to an intermediate attention score matrix of shape (2, 4, 16, 16) The score_mod function will be vectorized over each element of this matrix. For instance, modifying the score at the position corresponding to the 0th batch, 2nd head, between the 8th query and the 9th key element, would be invoked as: ```Python score_mod(score[0,2,8,9], torch.tensor(0), torch.tensor(2), torch.tensor(8), torch.tensor(9)) ``` ### Examples ```Python import torch from torch.nn.attention.templated_attention import templated_attention torch.manual_seed(0) # Lets create some input tensors # The input tensor has shape (batch_size, num_heads, seq_len, head_dim) query = torch.randn(8, 8, 2048, 64, device="cuda", dtype=torch.float32) key = torch.randn(8, 8, 2048, 64, device="cuda", dtype=torch.float32) value = torch.randn(8, 8, 2048, 64, device="cuda", dtype=torch.float32) # Lets create a fun new score_modification! I will call this # Checkerboard. It will reduce the score for neighboring tokens (1 step apart) # in the sequence. And increase the score for tokens 2 steps apart. For everything # else, the score will remain the same. def checkerboard(score, batch, head, token_q, token_kv): score = torch.where(torch.abs(token_kv - token_q) == 1, score * 0.5, score) score = torch.where(torch.abs(token_kv - token_q) == 2, score * 2.0, score) return score # Lets call templated_attention with this new score modification output = templated_attention(query, key, value, score_mod=checkerboard) compiled_templated_attention = torch.compile(templated_attention) out_compiled = compiled_templated_attention(query, key, value, score_mod=checkerboard) torch.testing.assert_close(output, out_compiled, atol=2e-2, rtol=2e-2) ``` ### Future Work - This PR is currently only forward only. However the triton kernel for backwards where score_modifications to not rely on external buffers has been explored here: https://github.com/drisspg/transformer_nuggets/blob/main/transformer_nuggets/flash/flash_attention.py - Kernel Improvements; There are has been some larger updates to the fused attention implementation that Triton uses in its tutorials. The implementation of this kernel is based on a prior version and should be updated. - We may want to unify this API under the top level SDPA API and leave that as a follow up once this is more stable - Should we error on CPU? - There are some issues with dynamic shapes - Capturing of free variables and lifting to inputs to the subgraph is not working correctly today ### Performance Comparisons generated by this benchmark: \| Type \| Speedup \| batch_size \| num_heads \| q_seq_len \| k_seq_len \| head_dim \| score_mod \| dtype \| \|---------\|-----------\|--------------\|-------------\|-------------\|-------------\|------------\|---------------\|----------------\| \| Average \| 5.412 \| \| \| \| \| \| \| \| \| Max \| 8.882 \| 16 \| 16 \| 4096 \| 4096 \| 64 \| relative_bias \| torch.bfloat16 \| \| Min \| 3.645 \| 8 \| 16 \| 512 \| 512 \| 64 \| causal_mask \| torch.bfloat16 \| \| Min \| 0.345 \| 1 \| 16 \| 1024 \| 1024 \| 64 \| pathological \| torch.bfloat16 \| For reference \| Configuration \| Forward Time (µ seconds) \| Backend \| Speedup \| \|-----------------------------------------------\|--------------------------\|------------------\|---------\| \| Fastest Config in Sweep (`8 16 4096 4096 64 relative_bias torch.bfloat16`) \| 3608 \| Templated Attention \| 1.0 \| \| Compiled SDPA (No Mask) \| 9928 \| Math \| 2.75x \| \| Compiled SDPA (With Mask) \| 11898 \| Math \| 3.29x \| \| Compiled SDPA (With Mask) \| 8704 \| Memory Efficient Attention \| 2.42x \| \| Compiled SDPA (No Mask) \| 2548 \| FlashAttention2 \| 0.706x \| The speedups are measuring compiled templated attention speed versus different calls to torch.nn.functional.sdpa <details> <summary> FULL PERFORMANCE SWEEP NUMBERS </summary> \| batch_size \| num_heads \| q_seq_len \| k_seq_len \| head_dim \| score_mod \| dtype \| eager_time \| compiled_time \| speedup \| \|--------------\|-------------\|-------------\|-------------\|------------\|---------------\|----------------\|--------------\|-----------------\|-----------\| \| 1 \| 16 \| 512 \| 512 \| 64 \| causal_mask \| torch.bfloat16 \| 331.444 \| 67.221 \| 4.931 \| \| 1 \| 16 \| 512 \| 512 \| 64 \| relative_bias \| torch.bfloat16 \| 335.300 \| 64.187 \| 5.224 \| \| 1 \| 16 \| 512 \| 512 \| 64 \| head_bias \| torch.bfloat16 \| 352.039 \| 63.806 \| 5.517 \| \| 1 \| 16 \| 512 \| 512 \| 64 \| pathological \| torch.bfloat16 \| 371.699 \| 711.349 \| 0.523 \| \| 1 \| 16 \| 1024 \| 1024 \| 64 \| causal_mask \| torch.bfloat16 \| 333.488 \| 86.455 \| 3.857 \| \| 1 \| 16 \| 1024 \| 1024 \| 64 \| relative_bias \| torch.bfloat16 \| 322.363 \| 82.469 \| 3.909 \| \| 1 \| 16 \| 1024 \| 1024 \| 64 \| head_bias \| torch.bfloat16 \| 349.967 \| 82.233 \| 4.256 \| \| 1 \| 16 \| 1024 \| 1024 \| 64 \| pathological \| torch.bfloat16 \| 486.359 \| 1412.453 \| 0.344 \| \| 1 \| 16 \| 4096 \| 4096 \| 64 \| causal_mask \| torch.bfloat16 \| 2794.597 \| 551.188 \| 5.070 \| \| 1 \| 16 \| 4096 \| 4096 \| 64 \| relative_bias \| torch.bfloat16 \| 3965.150 \| 513.101 \| 7.728 \| \| 1 \| 16 \| 4096 \| 4096 \| 64 \| head_bias \| torch.bfloat16 \| 2408.013 \| 504.759 \| 4.771 \| \| 1 \| 16 \| 4096 \| 4096 \| 64 \| pathological \| torch.bfloat16 \| 6850.531 \| 16733.675 \| 0.409 \| \| 8 \| 16 \| 512 \| 512 \| 64 \| causal_mask \| torch.bfloat16 \| 441.939 \| 123.576 \| 3.576 \| \| 8 \| 16 \| 512 \| 512 \| 64 \| relative_bias \| torch.bfloat16 \| 560.379 \| 116.710 \| 4.801 \| \| 8 \| 16 \| 512 \| 512 \| 64 \| head_bias \| torch.bfloat16 \| 421.172 \| 115.825 \| 3.636 \| \| 8 \| 16 \| 512 \| 512 \| 64 \| pathological \| torch.bfloat16 \| 994.492 \| 2132.806 \| 0.466 \| \| 8 \| 16 \| 1024 \| 1024 \| 64 \| causal_mask \| torch.bfloat16 \| 1436.430 \| 309.495 \| 4.641 \| \| 8 \| 16 \| 1024 \| 1024 \| 64 \| relative_bias \| torch.bfloat16 \| 1892.216 \| 290.186 \| 6.521 \| \| 8 \| 16 \| 1024 \| 1024 \| 64 \| head_bias \| torch.bfloat16 \| 1360.665 \| 282.956 \| 4.809 \| \| 8 \| 16 \| 1024 \| 1024 \| 64 \| pathological \| torch.bfloat16 \| 3525.532 \| 8359.702 \| 0.422 \| \| 8 \| 16 \| 4096 \| 4096 \| 64 \| causal_mask \| torch.bfloat16 \| 22026.839 \| 3864.604 \| 5.700 \| \| 8 \| 16 \| 4096 \| 4096 \| 64 \| relative_bias \| torch.bfloat16 \| 31262.746 \| 3609.551 \| 8.661 \| \| 8 \| 16 \| 4096 \| 4096 \| 64 \| head_bias \| torch.bfloat16 \| 20219.079 \| 3480.402 \| 5.809 \| \| 8 \| 16 \| 4096 \| 4096 \| 64 \| pathological \| torch.bfloat16 \| 54654.647 \| 116652.357 \| 0.469 \| \| 16 \| 16 \| 512 \| 512 \| 64 \| causal_mask \| torch.bfloat16 \| 820.606 \| 188.683 \| 4.349 \| \| 16 \| 16 \| 512 \| 512 \| 64 \| relative_bias \| torch.bfloat16 \| 1058.362 \| 179.295 \| 5.903 \| \| 16 \| 16 \| 512 \| 512 \| 64 \| head_bias \| torch.bfloat16 \| 784.372 \| 175.714 \| 4.464 \| \| 16 \| 16 \| 512 \| 512 \| 64 \| pathological \| torch.bfloat16 \| 1890.792 \| 4212.877 \| 0.449 \| \| 16 \| 16 \| 1024 \| 1024 \| 64 \| causal_mask \| torch.bfloat16 \| 2781.830 \| 557.017 \| 4.994 \| \| 16 \| 16 \| 1024 \| 1024 \| 64 \| relative_bias \| torch.bfloat16 \| 3694.050 \| 525.249 \| 7.033 \| \| 16 \| 16 \| 1024 \| 1024 \| 64 \| head_bias \| torch.bfloat16 \| 2634.164 \| 507.613 \| 5.189 \| \| 16 \| 16 \| 1024 \| 1024 \| 64 \| pathological \| torch.bfloat16 \| 6959.917 \| 15331.116 \| 0.454 \| \| 16 \| 16 \| 4096 \| 4096 \| 64 \| causal_mask \| torch.bfloat16 \| 43889.096 \| 7582.018 \| 5.789 \| \| 16 \| 16 \| 4096 \| 4096 \| 64 \| relative_bias \| torch.bfloat16 \| 62784.293 \| 7075.846 \| 8.873 \| \| 16 \| 16 \| 4096 \| 4096 \| 64 \| head_bias \| torch.bfloat16 \| 40308.606 \| 6829.587 \| 5.902 \| \| 16 \| 16 \| 4096 \| 4096 \| 64 \| pathological \| torch.bfloat16 \| 108892.137 \| 233090.953 \| 0.467 \| </details> Pull Request resolved: https://github.com/pytorch/pytorch/pull/121845 Approved by: https://github.com/Chillee, https://github.com/zou3519	2024-04-06 01:10:44 +00:00
Animesh Jain	8e98fda7a9	[dynamo][easy] Add AC test and improve graph break message (#121394 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121394 Approved by: https://github.com/yanboliang	2024-04-06 01:02:45 +00:00
PyTorch MergeBot	954d750516	Revert "Enable dynamo'd tests disabled for #115679 (#123315 )" This reverts commit d472ebf94a3f3a3dec31e9d8b2038127b2309727. Reverted https://github.com/pytorch/pytorch/pull/123315 on behalf of https://github.com/atalman due to break TestOptimRenewedCPU::test_foreach_matches_forloop_Adamax_cpu_float64 ([comment](https://github.com/pytorch/pytorch/pull/123315#issuecomment-2040835229))	2024-04-06 00:57:42 +00:00
Michael Lazos	d9d25076fe	Reduce guards of optimizer state dict to guard once per param group (#123413 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123413 Approved by: https://github.com/anijain2305	2024-04-06 00:12:59 +00:00
Yuzhen Huang	f7e41a2b7a	[pt2] Clean up for removing 2 decompose patterns (#123422 ) Summary: Follow up for D55759235. should_decompose_mmt and should_decompose_mm_largek should be removed as well. Test Plan: NA Differential Revision: D55786581 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123422 Approved by: https://github.com/jackiexu1992	2024-04-05 23:31:51 +00:00
Michael Lazos	d472ebf94a	Enable dynamo'd tests disabled for #115679 (#123315 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123315 Approved by: https://github.com/janeyx99 ghstack dependencies: #123313, #123314	2024-04-05 23:21:53 +00:00
Michael Lazos	9564e204c1	Enable tests disabled for #115607 (#123314 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123314 Approved by: https://github.com/janeyx99 ghstack dependencies: #123313	2024-04-05 23:21:53 +00:00
Gagan Jain	61d431fab0	Adding health check server hook in torch elastic (#122750 ) Summary: Building hook for external mechanism to monitor the health of torch elastic launcher. Health check server takes dependency on FileTimerServer to check if launcher is healthy or not. It will be always healthy if FileTimerServer is disabled. Implementation of start_healthcheck_server is unsupported, however tcp/http server can be started on specific port which can monitor the aliveness of worker_watchdog and accordingly take the action. Test Plan: buck test mode/opt caffe2/test/distributed/elastic/agent/server/test:local_agent_test Differential Revision: D55108182 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122750 Approved by: https://github.com/kurman	2024-04-05 23:17:30 +00:00
Animesh Jain	22b9987144	[dynamo][cpp-guards] ListGetItemGuardAccessor and TupleGetItemGuardAccessor (#123396 ) Speeds up the guard-overhead microbenchmark by around 10% normalized to main-branch CPP guards ~~~ import torch @torch.compile(backend="eager") def fn(x, lst): for l in lst: x = x + l return x n = 1000 lst = [i for i in range(n)] x = torch.randn(4) print(fn(x, lst)) print("Sucess") ~~~ Pull Request resolved: https://github.com/pytorch/pytorch/pull/123396 Approved by: https://github.com/jansel ghstack dependencies: #123285, #123302, #123303	2024-04-05 22:10:04 +00:00
rzou	cd6c58baea	[custom_ops] mutated_args -> mutates_args (#123437 ) This seemed better, since when you're construction a custom op you need to provide "the args that the custom op mutates". Test Plan: - existing tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/123437 Approved by: https://github.com/albanD ghstack dependencies: #123108, #123109, #123110, #123129	2024-04-05 22:03:51 +00:00
rzou	81e7a7c955	Add mutated_args field to custom_op (#123129 ) If provided, we: - autogenerate an ADInplaceOrView implementation - assume that no mutated inputs are returned as outputs. There are already aliasing runtime checks that check this. Test Plan: - new tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/123129 Approved by: https://github.com/albanD ghstack dependencies: #123108, #123109, #123110	2024-04-05 22:03:51 +00:00
rzou	9e8d2b6de2	Add register_autograd to register backward formulas for custom ops (#123110 ) The user provides a `setup_context` and a `backward_function`. These get put into a torch.autograd.Function that gets registered as the custom op's autograd implementation. Test Plan: - we update custom ops in the custom_op_db to use the new register_autograd API. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123110 Approved by: https://github.com/albanD ghstack dependencies: #123108, #123109	2024-04-05 22:03:47 +00:00
rzou	d8e1c1087d	Add is_tensorlist_like_type helper (#123109 ) Checks if the type of an argument in a schema is some form of TensorList. Test Plan: - new test Pull Request resolved: https://github.com/pytorch/pytorch/pull/123109 Approved by: https://github.com/albanD ghstack dependencies: #123108	2024-04-05 22:03:42 +00:00
rzou	067851dd0d	Expand is_functional_schema to work with torch._C._FunctionSchema (#123108 ) Previously it worked with torchgen.model.FunctionSchema. This PR extends it to work with torch._C._FunctionSchema by making torchgen.model.FunctionSchema look more like torch._C._FunctionSchema. Test Plan: - new tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/123108 Approved by: https://github.com/albanD	2024-04-05 22:03:39 +00:00
Pian Pawakapan	42c2a5477c	[export] nn_module_stack to return class name str (#123308 ) Previously, `node.meta["nn_module_stack"]` had type `Dict[str, Tuple[str, class]]` when exported, and later `Dict[str, Tuple[str, str]]` after de/serialization. This PR changes it to consistently be `Dict[str, Tuple[str, str]]` for round-trippability, i.e. ``` {..., 'L__self___conv': ('conv', 'torch.nn.modules.conv.Conv2d')} ``` `source_fn_stack` is left untouched in this PR. note: the `Union[type, str]` type annotations in ONNX are because ONNX goes through both `export.export()` and `_dynamo.export()` (which still has the original `Dict[str, Tuple[str, class]]` format). nn_module_stack from `export.export()` should consistently have the new format, and we verify/test for that in `_trace.py`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123308 Approved by: https://github.com/zhxchen17, https://github.com/thiagocrepaldi	2024-04-05 21:48:22 +00:00
Adnan Akhundov	63c221b7fa	Clone mutated inputs in first pass of CPP wrapper compilation (#123316 ) Summary: CPP wrapper compilation is currently done in two passes: in the first pass, Python wrapper is generated and run to compile Triton kernels as a side effect, in the second pass C++ wrapper is generated and compiled. When model inputs are mutated, running the Python wrapper in the first pass mutates the inputs, although the first pass (including the Python wrapper run) is strictly a part of the compilation process, hence must not introduce any side effects on the example inputs. In this PR, we clone mutated inputs in the first pass to avoid input mutation. Fixes https://github.com/pytorch/pytorch/issues/117364. Test Plan: ``` $ TORCHINDUCTOR_CPP_WRAPPER=1 python test/inductor/test_torchinductor.py -k test_inductor_layout_optimization_input_mutations_cuda ... . ---------------------------------------------------------------------- Ran 1 test in 6.368s OK ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/123316 Approved by: https://github.com/jansel, https://github.com/chenyang78, https://github.com/desertfire	2024-04-05 21:47:19 +00:00
Peter Bell	4946558dd4	[minifier] Don't recompile for accuracy minification (#123005 ) `backend_aot_accuracy_fails` reruns `compile_fx_inner` on the real inputs which means the graph is recompiled with static shapes. This meant accuracy failures related to dynamic shapes would never be captured by `REPRO_AFTER=aot`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123005 Approved by: https://github.com/ezyang	2024-04-05 21:22:57 +00:00
Huy Do	f5b8c9b730	Ignore some known duplicated modules in doc build config script (#123425 ) This is a follow-up fix of https://github.com/pytorch/pytorch/pull/123244#discussion_r1552935150 as @clee2000 points out a better way to ignore those duplicated entries. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123425 Approved by: https://github.com/clee2000	2024-04-05 21:12:14 +00:00
Shawn Xu	b0c86a5bc1	[PT] [ST] support non contiguous rank validation in sharded tensor (#123230 ) Summary: Previously the validation logic assumes the sharded tensors' global ranks range from `[0 .. WS]` This is true if we do 1d flat sharding. But once we get into 2d+, the ranks may not be contiguous any more. e.g. ``` [0, 2] [1, 3] ``` The group size is 2 but ranks may be >= 2. Going forward, the ST will be replaced by DTensor so it's less of an issue but this is just to make it work for stacks still relying on ST (like torchrec). Test Plan: added UT CI Differential Revision: D55671872 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123230 Approved by: https://github.com/kwen2501	2024-04-05 21:05:01 +00:00
Tugsbayasgalan Manlaibaatar	d78991a738	Make torch_geometric models compatible with export (#123403 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123403 Approved by: https://github.com/angelayi	2024-04-05 20:58:16 +00:00
William Wen	cbde0f048b	[dynamo, 3.12] enable tests disabled due to missing dynamo 3.12 support (#123300 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123300 Approved by: https://github.com/jansel, https://github.com/malfet, https://github.com/zou3519	2024-04-05 20:13:17 +00:00
Shengbao Zheng	ae6f8d923c	Pass and record process_group_name when creating ProcessGroupNCCL (#123117 ) Summary: Pass python c10d group_name to c++ ProcessGroupNCCL so that the pg name will be consistent across different layers. Also record pg_name in flight recorder entry. Differential Revision: D55597200 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123117 Approved by: https://github.com/wconstab	2024-04-05 18:57:45 +00:00
Pian Pawakapan	d7f23f6826	[export] Restore original placeholder names (part 1: top-level renaming) (#122904 ) Summary: This PR restores original names to placeholder nodes, replacing the default names arg0_1, arg1_1, and so on. User inputs now follow the signature of mod.forward(), for example forward(x, y) produces nodes x, y. If the tensors are nested in dictionaries, lists, tuples, or dataclasses, the names are a concatenation of the path to the tensor, e.g. x = {'a': torch.randn(4), 'b': [torch.randn(4), torch.randn(4)]} produces nodes x_a, x_b_0, x_b_1. Parameters, buffers, constants, and custom objects follow the FQN of the object, prefixed by "p", "b", "c", and "obj" respectively. For example, self.bar.l0.weight gets you p_bar_l0_weight. Effect tokens are named token_1, token_2, and so on, since they are not grounded in model inputs or named attributes. note: breaking the original diff into 3 parts (top-level renaming, higher-order-op subgraphs, constant input de/serialization) because of its size. Examples: ```python # params, buffers, constants, inputs, torch.cond ExportedProgram: class GraphModule(torch.nn.Module): def forward(self, p_l0_weight: "f32[4, 4]", p_l0_bias: "f32[4]", c_alpha: "f32[4]", b_beta: "f32[4]", x_0_a: "f32[4, 4]", y: "f32[4, 4]"): # No stacktrace found for following nodes mul: "f32[4, 4]" = torch.ops.aten.mul.Tensor(x_0_a, x_0_a) t: "f32[4, 4]" = torch.ops.aten.t.default(p_l0_weight); p_l0_weight = None addmm: "f32[4, 4]" = torch.ops.aten.addmm.default(p_l0_bias, y, t); p_l0_bias = y = t = None return addmm # model code class Bar(torch.nn.Module): def forward(self, x): return x * x class Foo(torch.nn.Module): def __init__(self): super().__init__() self.bar = Bar() self.l0 = torch.nn.Linear(4, 4) self.alpha = torch.randn(4) self.register_buffer('beta', torch.randn(4)) def forward(self, x, y): x = x[0]['a'] mul = self.bar(x) z1 = self.l0(y) return z1 # custom objects, dataclasses, tokens, constant inputs ExportedProgram: class GraphModule(torch.nn.Module): def forward(self, token_1: "f32[0]", obj_attr, data_x: "f32[4, 4]", data_y: "f32[4, 4]", mode): # No stacktrace found for following nodes mul: "f32[4, 4]" = torch.ops.aten.mul.Scalar(data_x, 30); data_x = None div: "f32[4, 4]" = torch.ops.aten.div.Tensor_mode(data_y, 1.0, rounding_mode = 'floor'); data_y = None add: "f32[4, 4]" = torch.ops.aten.add.Tensor(mul, div); mul = div = None with_effects = torch._higher_order_ops.effects.with_effects(token_1, torch.ops._TorchScriptTesting.takes_foo.default, obj_attr, add); token_1 = obj_attr = add = None getitem: "f32[0]" = with_effects[0] getitem_1: "f32[4, 4]" = with_effects[1]; with_effects = None return (getitem, getitem_1) # model code class Foo(torch.nn.Module): def __init__(self): super().__init__() self.attr = torch.classes._TorchScriptTesting._Foo(10, 20) def forward(self, data, a=1.0, mode="floor"): x = self.attr.add_tensor(data.x) + torch.div(data.y, a, rounding_mode=mode) x = torch.ops._TorchScriptTesting.takes_foo(self.attr, x) return x dataclass class DataClass: x: Tensor y: Tensor register_dataclass_as_pytree_node( DataClass, serialized_type_name="test.DataClass" ) args = (DataClass(x=torch.randn(4, 4), y=torch.randn(4, 4)), ) kwargs = {'mode': 'floor'} ep = torch.export.export(Foo(), args, kwargs, strict=False) ``` Test Plan: verification checks on placeholder names for all export() calls, unit test in test/export/test_export.py Differential Revision: D55456418 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122904 Approved by: https://github.com/angelayi, https://github.com/thiagocrepaldi	2024-04-05 18:56:00 +00:00
Arun Pa	f71e368969	UFMT formatting on test/autograd test/ao test/cpp test/backends (#123369 ) Partially addresses #123062 Ran lintrunner on - test/_test_bazel.py - test/ao - test/autograd test/backends test/benchmark_uitls test/conftest.py test/bottleneck_test test/cpp Pull Request resolved: https://github.com/pytorch/pytorch/pull/123369 Approved by: https://github.com/huydhn	2024-04-05 18:51:38 +00:00
Lucas Pasqualin	de7edeea25	[DCP] DCP logger (#121352 ) Adds additional logging for improved observability in DCP. Differential Revision: [D54512626](https://our.internmc.facebook.com/intern/diff/D54512626/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121352 Approved by: https://github.com/wz337, https://github.com/fegin	2024-04-05 17:50:50 +00:00
Dmitry Ulyanov	c8e117fb76	Tiny comments improvement (#123426 ) Fixed a typo in `functional.py` and moved comment line to correct place in `transformer.py`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123426 Approved by: https://github.com/mikaylagawarecki	2024-04-05 17:25:42 +00:00
briancoutinho	9b8f446e95	[pytorch profiler] Add metrics for performance timing and other statistics (#123412 ) Measure the performance of various calls in PyTorch profiler and ave them to `_ProfilerStats` structure Differential Revision: [D55457386](https://our.internmc.facebook.com/intern/diff/D55457386/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123412 Approved by: https://github.com/aaronenyeshi	2024-04-05 17:00:45 +00:00
Guilherme Leobas	32f9453c2a	[dynamo] Emit FUNCTORCH_STACK_MATCH guard in vmap(compile(f)) case (#122786 ) Fixes: #122201 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122786 Approved by: https://github.com/zou3519	2024-04-05 15:04:16 +00:00
PyTorch MergeBot	8c7d8f0ff2	Revert "Make torch_geometric models compatible with export (#123403 )" This reverts commit 2ffab6e663b9c6951048b8c8ba82d2cc5ca5c2fc. Reverted https://github.com/pytorch/pytorch/pull/123403 on behalf of https://github.com/atalman due to Related issue basic_gnn_gin ([comment](https://github.com/pytorch/pytorch/pull/123403#issuecomment-2039817292))	2024-04-05 13:34:41 +00:00
Nikita Shulga	5b0ce8f334	[Wheel] Change libtorch_cpu OpenMP search path (#123417 ) To prevent delocate from double-packing it, which makes Torch wheels unusable with torch.compile out of the box Fixes https://github.com/pytorch/pytorch/issues/122705 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123417 Approved by: https://github.com/atalman	2024-04-05 13:02:38 +00:00
DanilBaibak	cfd06bd60c	[CI] Switched to the _linux-build-label workflow for pull, rocm, slow and trunk jobs (#123255 ) Switched to the _linux-build-label workflow for pull requests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123255 Approved by: https://github.com/jeanschmidt, https://github.com/atalman	2024-04-05 09:34:30 +00:00
xinan.lin	9743e3a19c	[Inductor Intel GPU backend Upstream] Add Inductor Intel GPU backend. (#121895 ) As the design in RFC https://github.com/pytorch/pytorch/issues/114856, this PR implemented Intel GPU Inductor backend by: - Reuse WrapperCodegen and TritonScheduling for python wrapper and kernel code generation. And implenented device-specific code generation in XPUDeviceOpOverrides - Reuse fx_pass, lowering, codecache, triton kernel auto-tuning, and compilation. For the test case, this PR provided test/inductor/test_xpu_basic.py for basic inductor backend functionality testing. We'll reuse all the existing Inductor test case in the next PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121895 Approved by: https://github.com/EikanWang, https://github.com/jansel, https://github.com/desertfire	2024-04-05 09:05:11 +00:00
leslie-fang-intel	9078191666	[Inductor] Add the possible fusions group by priority (#123067 ) Summary Refactor the `Scheduler.fuse_nodes` changes in https://github.com/pytorch/pytorch/pull/121625. In the previous implementation of `Scheduler.fuse_nodes` in https://github.com/pytorch/pytorch/pull/121625, we use the `enable_outer_loop_fusion` context to ensure `OuterLoopFusion` happens after all the norm fusions. And there is a discussion in https://github.com/pytorch/pytorch/pull/121625/files#r1527177141 to reuse current `score_fusion` mechanism. However, given that [fuse_nodes](`f4ff063c33/torch/_inductor/scheduler.py (L1679-L1698)`) will invoke `fuse_nodes_once` 10 times. We are concerned that the score approach may potentially disrupt pairs of regular fusion nodes in the 2rd invocation of `fuse_nodes_once` if they have been pick up by the outer loop fusion in the 1st invocation of `fuse_nodes_once`. In this PR, we propose adding an abstract of `filter_possible_fusions_by_priority`. In each invoking of `fuse_nodes_once`, the possible fusions will be grouped by their priority from the backend. And only the group of possible fusions with highest priority will be fused in this invocation. In this way, we can ensure `OuterLoopFusion` happens after all the norm fusions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123067 Approved by: https://github.com/lezcano, https://github.com/jgong5 ghstack dependencies: #121625	2024-04-05 06:30:41 +00:00
leslie-fang-intel	bac2a39aee	[Inductor] [ReImplement] Outer Loop Fusion for CPP Backend (#121625 ) Summary Re-implement of https://github.com/pytorch/pytorch/pull/121064 Test Plan ``` python -u -m pytest -s -v test_cpu_repro.py -k test_outer_loop_fusion ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/121625 Approved by: https://github.com/lezcano, https://github.com/jgong5	2024-04-05 06:24:57 +00:00
Tugsbayasgalan Manlaibaatar	2ffab6e663	Make torch_geometric models compatible with export (#123403 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123403 Approved by: https://github.com/angelayi	2024-04-05 05:26:01 +00:00
Lucas Pasqualin	18c9d46068	Fixes format utils executable (#123407 ) Fixes an issue with the format utils executable, which was causing it to run as a no-op. :( Pull Request resolved: https://github.com/pytorch/pytorch/pull/123407 Approved by: https://github.com/wz337, https://github.com/fegin	2024-04-05 03:53:22 +00:00
Kulin Seth	7b575f0814	Handle transposes in second batch of matrices in bmm (#122194 ) 1. Add support for Unranked placeholders in the MPS backend. 2. PR is now fusing the Transposes into the GEMM kernel dispatches in MPS backend. This improves the performance of Transformer networks by 5-8%. Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/122194 Approved by: https://github.com/DenisVieriu97, https://github.com/malfet	2024-04-05 03:40:16 +00:00
BowenBao	86c5cc6559	[ONNX][dynamo_export] Integrate onnx-rewriter optimizer (#123379 ) Introduces common standard onnx optimization such as constant, if, controlflow folding and pattern rewrites. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123379 Approved by: https://github.com/thiagocrepaldi, https://github.com/justinchuby	2024-04-05 03:29:40 +00:00
Guilherme Leobas	7ffad9ab04	Use out-of-place version of `put` inside `take_backward` (#123268 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123268 Approved by: https://github.com/zou3519 ghstack dependencies: #122211, #122212, #122213	2024-04-05 03:29:11 +00:00
Guilherme Leobas	c575e378ba	Update torch.compile_faq w.r.t to functorch (#122213 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/122213 Approved by: https://github.com/zou3519 ghstack dependencies: #122211, #122212	2024-04-05 03:29:11 +00:00
Guilherme Leobas	dbe0c474a9	Ensure all `torch.func.*` functions capture can be disabled (#122212 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/122212 Approved by: https://github.com/zou3519 ghstack dependencies: #122211	2024-04-05 03:29:11 +00:00
Guilherme Leobas	84658d9c4f	Enable `capture_func_transforms` by default (#122211 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/122211 Approved by: https://github.com/zou3519	2024-04-05 03:29:11 +00:00
Huy Do	3d20cc1332	Cleanup some duplicated placeholder py:module docs (#123244 ) Fixes https://github.com/pytorch/pytorch/issues/123068 Fixes https://github.com/pytorch/pytorch/issues/111256 While investigating the flaky doc build failure .w.r.t duplicated `torch.ao.quantization.quantize` docstring warning, i.e. https://github.com/pytorch/pytorch/actions/runs/8532187126/job/23376591356#step:10:1260, I discover an old but still open bug in Sphinx https://github.com/sphinx-doc/sphinx/issues/4459. These warnings have always been there, but they are hidden because we are using `-j auto` to build docs with multiple threads. It's just by chance that they start to surface now. The issue can be reproduced by removing `-j auto` from https://github.com/pytorch/pytorch/blob/main/docs/Makefile#L5 and run `make html` locally. Then, these warnings shows up consistently. As `make html` treats warnings as errors, they will fail the build. ``` ... /data/users/huydo/conda/py3.8/lib/python3.8/site-packages/torch/ao/quantization/quantize.py:docstring of torch.ao.quantization.quantize.quantize:1: WARNING: duplicate object description of torch.ao.quantization.quantize, other instance in quantization, use :noindex: for one of them /data/users/huydo/conda/py3.8/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py:docstring of torch.nn.parallel.data_parallel.data_parallel:1: WARNING: duplicate object description of torch.nn.parallel.data_parallel, other instance in nn, use :noindex: for one of them /data/users/huydo/conda/py3.8/lib/python3.8/site-packages/torch/nn/utils/spectral_norm.py:docstring of torch.nn.utils.spectral_norm.spectral_norm:1: WARNING: duplicate object description of torch.nn.utils.spectral_norm, other instance in nn, use :noindex: for one of them /data/users/huydo/conda/py3.8/lib/python3.8/site-packages/torch/nn/utils/weight_norm.py:docstring of torch.nn.utils.weight_norm.weight_norm:1: WARNING: duplicate object description of torch.nn.utils.weight_norm, other instance in nn, use :noindex: for one of them /data/users/huydo/github/pytorch/docs/source/nn.rst:579: WARNING: duplicate object description of torch.nn.parallel.data_parallel, other instance in generated/torch.nn.functional.torch.nn.parallel.data_parallel, use :noindex: for one of them /data/users/huydo/github/pytorch/docs/source/nn.rst:594: WARNING: duplicate object description of torch.nn.utils.spectral_norm, other instance in generated/torch.nn.utils.spectral_norm, use :noindex: for one of them /data/users/huydo/github/pytorch/docs/source/nn.rst:595: WARNING: duplicate object description of torch.nn.utils.weight_norm, other instance in generated/torch.nn.utils.weight_norm, use :noindex: for one of them /data/users/huydo/github/pytorch/docs/source/quantization.rst:1348: WARNING: duplicate object description of torch.ao.quantization.quantize, other instance in generated/torch.ao.quantization.quantize, use :noindex: for one of them ... ``` The fix is just to clean up those duplicated placeholder py:module docs, which were there because these modules didn't have any docs originally. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123244 Approved by: https://github.com/andrewor14, https://github.com/malfet	2024-04-05 03:18:53 +00:00
PyTorch MergeBot	16cb5d48dd	Revert "[inductor] Add explicit ops.fma and use it in softmax_backward (#122518 )" This reverts commit 05984e642b16b289f0871d3db9d14426a57b76f0. Reverted https://github.com/pytorch/pytorch/pull/122518 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it starts failing in trunk `05984e642b` ([comment](https://github.com/pytorch/pytorch/pull/122518#issuecomment-2038631010))	2024-04-05 02:09:32 +00:00
sekyondaMeta	535a84c125	Publish PyTorch docs to pytorch/cpp repo (#122895 ) Updating the documents push to go to https://github.com/pytorch/docs repo instead of https://github.com/pytorch/pytorch.github.io as part of updating the PyTorch docs set up. Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com> Co-authored-by: Svetlana Karslioglu <svekars@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/122895 Approved by: https://github.com/malfet	2024-04-05 02:00:20 +00:00
Thiago Crepaldi	d7ccde58a7	[ONNX][dynamo_export] Fix ONNX export with print (#123368 ) Partially fixes #123288 Doesn't handle the `method` case, but `print` is a start. The approach is to mimic `torch.export.export` behavior, whatever that is. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123368 Approved by: https://github.com/BowenBao	2024-04-05 01:07:46 +00:00
Jackie (Jiaqi) Xu	4cf5a9c505	[pt2] remove 2 decompose patterns (#123371 ) Summary: https://fb.workplace.com/groups/1075192433118967/permalink/1402410947063779/ some investigation. large-k pattern will degrade the perf. remove those patterns. Though mmt patten indeed shows some gain in compiling single operator P1201328502 and P1201328722. but it will conflict with other opt in inductor and result in a slow down. Test Plan: some result from benchmark, manually to hold stride ``` import torch import torch._inductor.config as inductor_config import triton inductor_config.trace.enabled = True m1 = torch.rand(9388864, 2, device="cuda", dtype=torch.bfloat16) m2 = torch.rand(9388864, 12, device="cuda", dtype=torch.bfloat16) print(f"m1.stride {m1.stride()}") print(f"m2.stride {m2.stride()}") torch.compile def fake_mm(a, b): return torch.sum(a[:, :, None] * b[:, None, :], dim=0) tmp = fake_mm(m1, m2) print(tmp.shape) s = triton.testing.do_bench(lambda: fake_mm(m1, m2)) print(f"fake mm{s}") tmp2 = torch.mm(m1.permute(1, 0), m2) s = triton.testing.do_bench(lambda: torch.mm(m1.permute(1, 0), m2)) print(print(f"mm{s}")) m3 = m1.permute(1, 0).contiguous() s = triton.testing.do_bench(lambda: torch.mm(m1.permute(1, 0).contiguous(), m2)) print(print(f"mm without permute{s}")) result: fake mm14.968459129333496 mm507.6383972167969 mm without permute0.7466956973075867 ``` single kernel can be speed up from 5ms->3ms {F1477685597} {F1477685813} Differential Revision: D55759235 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123371 Approved by: https://github.com/mengluy0125	2024-04-05 00:54:10 +00:00
Josh Fromm	0c8a165b43	[Export] Improve metadata and output parsing during deserialization (#122793 ) Summary: Deserialization of metadata could encounter a bug where commas are used in valid metadata names. This specifically occurs when a split of a `torch.nn.Sequential` stack is used, but may have other possible triggers. Because the deserialization relies on a comma based string split, such names trigger an error. This change uses a simple regular expression to ignore commas within parentheses to avoid the issue. I add a test that constructs one such problematic sequential stack and show that it can be properly round-tripped with the improved splitting. Similarly, deserialization could fail when outputs are not a tensor type. Although such outputs like None or constants are not very useful, they do show up in graphs and export should be able to support them. This change improves output node parsing and adds a corresponding test. Test Plan: buck test //caffe2/test:test_export -- TestSerialize Differential Revision: D55391674 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122793 Approved by: https://github.com/zhxchen17	2024-04-05 00:25:37 +00:00
Bin Bao	064a650b63	[AOTI][refactor] Improve generate_extern_kernel_out's signature (#123351 ) Summary: Annotate types and make the names more readable. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123351 Approved by: https://github.com/chenyang78 ghstack dependencies: #123346	2024-04-04 23:23:50 +00:00
Bin Bao	aa063054ce	[AOTI] Fix the codegen for aten.randint.low_out (#123346 ) Summary: Fixing https://github.com/pytorch/pytorch/issues/123174. There are two problems here, * Incorrectly calling convert_arrayref_tensor_to_tensor on int arguments. Removing relevant code since we don't use ArrayRef when there is a fallback op. * codegen_kwargs generates an argument for the out parameter of ExternKernelOut. The fix is to leave that logic to corresponding wrapper codegen. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123346 Approved by: https://github.com/chenyang78	2024-04-04 23:23:50 +00:00
PyTorch MergeBot	e61d04e467	Revert "[sparse] Add fast semi-structured spasification kernels (#122350 )" This reverts commit c63a7b569133c9d91bde362c68e4f60abd4b619b. Reverted https://github.com/pytorch/pytorch/pull/122350 on behalf of https://github.com/malfet due to This broke rocm builds, which is visible on PR as well ([comment](https://github.com/pytorch/pytorch/pull/122350#issuecomment-2038424125))	2024-04-04 23:15:36 +00:00
Alexander Kurakin	6107cbba1b	doc: `torch.nn.utils.rnn.pad_sequence`: improve the example description (#123183 ) doc: `torch.nn.utils.rnn.pad_sequence`: improve the example description. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123183 Approved by: https://github.com/mikaylagawarecki	2024-04-04 23:05:50 +00:00
Andrew Gu	19f8cf0167	[FSDP2] Used `ReduceOp.AVG` if bf16 reduce-scatter (#123362 ) The motivation is similar to https://github.com/pytorch/pytorch/pull/120919/ -- we want to use only one mul/div kernel instead of one before and after gradient reduction if possible. Because bf16 has the same dynamic range as fp32, the relative error is the same whether we pre and post-divide vs. just post-divide. In other words, the relative error does not depend on the magnitude. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123362 Approved by: https://github.com/kwen2501, https://github.com/wanchaol ghstack dependencies: #122962, #123290	2024-04-04 23:01:18 +00:00
Dmitry Nikolaev	5494b2a8d3	enable test_sampled_addmm_zero_sized_cuda for rocm (#121940 ) Enable test_sampled_addmm_zero_sized_cuda_* only for ROCm and CUDA issue is currently active. Passed since ROCm 5.6 test_sampled_addmm_zero_sized_cuda_float32 test_sampled_addmm_zero_sized_cuda_float64 test_sampled_addmm_zero_sized_cuda_complex64 test_sampled_addmm_zero_sized_cuda_complex128 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121940 Approved by: https://github.com/pruthvistony, https://github.com/jeffdaily, https://github.com/malfet	2024-04-04 22:38:29 +00:00
Pian Pawakapan	4b1b4db231	[export] Add stack_trace for non-strict export (#121034 ) This addresses 2 issues with stack_trace metadata: - stack_trace is currently missing from nodes in non-strict export - in strict mode, stack_trace is populated for placeholder nodes, which may not be well-defined (with multiple uses) We filter the call stack during tracing for calls from forward() methods, or ops in `torch.__init__.py` (e.g. sym_size_int, sym_constrain_range, etc.) to populate stack_trace. A node-level check is also added to _export_non_strict(). Pull Request resolved: https://github.com/pytorch/pytorch/pull/121034 Approved by: https://github.com/angelayi	2024-04-04 22:35:33 +00:00
Michael Lazos	512759a3d7	Fix for tensor attribute missing (#123313 ) Tensors would sometimes be realized after we already registered attrs on the root nn module. This ensures all stack values are realized before registering attrs on the root nn module. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123313 Approved by: https://github.com/anijain2305	2024-04-04 21:11:04 +00:00
Valentin Andrei	30598c162d	[pytorch][cuda] Optimized softmax forward native CUDA implementation (#122970 ) In Triton's [softmax tutorial](https://triton-lang.org/main/getting-started/tutorials/02-fused-softmax.html), native performance is significantly lower than Triton's. We accelerated the native code as follows: --> Wrote a CUDA kernel `cunn_SoftMaxForwardSmem` for softmax forward that caches the inputs in shared memory. Currently the maximum usable shared memory is 48KB to preserve compatibility with older generation Kepler GPUs but we can increase this. This kernel uses vectorized loads and stores and runs on problem sizes that fit in shared memory and use aligned buffers. --> Modified the default implementation's intra thread block reduction to use warp shuffles as the first step in reduction and use shared memory only to reduce across warps. --> Simplified the `WriteFpropResults` code because the loop unrolling brought no benefits but had a potentially detrimental effect on register usage. We can observe that there is still an advantage in the Triton implementation. We were able to recover the gap by using native `__expf` but we decided to leave `std::exp` to avoid affecting numerical stability. ``` Tests are ran on an A100 GPU using the benchmark in the Triton tutorial. Before softmax-performance: N Triton Torch (native) Torch (jit) 0 256.0 336.946021 595.781814 241.830261 1 384.0 737.741110 762.046555 297.890900 2 512.0 884.128199 860.899863 362.829080 3 640.0 936.228605 901.458039 376.211253 4 768.0 1005.024893 973.306952 384.187594 .. ... ... ... ... 93 12160.0 1336.034308 858.096595 330.642735 94 12288.0 1339.248830 837.047196 331.146707 95 12416.0 1338.877891 839.317673 329.113513 96 12544.0 1335.383669 835.342136 330.067106 97 12672.0 1339.402120 821.690012 329.854051 After softmax-performance: N Triton Torch (native) Torch (jit) 0 256.0 375.833684 602.629893 237.019883 1 384.0 312.572329 739.127852 301.777431 2 512.0 495.546303 863.736375 368.438508 3 640.0 520.953881 884.426455 369.633391 4 768.0 677.374681 975.722054 385.317013 .. ... ... ... ... 93 12160.0 1337.253933 1300.589124 330.655916 94 12288.0 1336.333052 1188.412588 331.116192 95 12416.0 1337.610105 1209.703474 329.232825 96 12544.0 1338.723893 1232.849225 330.003484 97 12672.0 1340.232227 1236.057117 329.925347 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/122970 Approved by: https://github.com/malfet	2024-04-04 21:05:23 +00:00
Peter Bell	05984e642b	[inductor] Add explicit ops.fma and use it in softmax_backward (#122518 ) This allows us to generate an fma even when fp-fusion is disabled in the compiler. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122518 Approved by: https://github.com/lezcano, https://github.com/Chillee ghstack dependencies: #121924	2024-04-04 20:53:14 +00:00
rzou	d486cb7c1b	Deprecate calling FakeTensor.data_ptr in eager-mode (#123292 ) Today, we error out on FakeTensor.data_ptr under torch.compile. This PR moves to error out on FakeTensor.data_ptr under eager mode to avoid diverging behavior. We do this by adding another bit onto FakeTensor that we'll remove after the deprecation cycle. Test Plan: - tested locally Pull Request resolved: https://github.com/pytorch/pytorch/pull/123292 Approved by: https://github.com/eellison ghstack dependencies: #123261, #123282, #123291	2024-04-04 20:35:24 +00:00
rzou	fd60752786	Turn _allow_unsafe_data_ptr_access into a config option (#123291 ) We're not planning on having this flag around for very long (see deprecation in next PR), so it's better as a config option. Test Plan: - existing tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/123291 Approved by: https://github.com/eellison ghstack dependencies: #123261, #123282	2024-04-04 20:35:24 +00:00
Andrew Gallagher	de14717819	[triton] Backport https://github.com/openai/triton/pull/3433 (#122470 ) Summary: Pull cache API changes from https://github.com/openai/triton/pull/3433. Among other simplifications, this allows us the cache all files in a "group" atomically, in a single memcache blob, and avoid needing to use other approaches to handle these files coming from different runs. Reviewed By: bertmaher Differential Revision: D55206000 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122470 Approved by: https://github.com/bertmaher	2024-04-04 20:24:28 +00:00
William Wen	5c7e2fd270	[dynamo, 3.12] use pymalloc allocator instead of malloc/free for frames (#123299 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123299 Approved by: https://github.com/jansel ghstack dependencies: #123216	2024-04-04 20:00:54 +00:00
William Wen	d59c5d7353	[dynamo, 3.12] enable dynamo on 3.12, enable most dynamo unittests on 3.12 (#123216 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123216 Approved by: https://github.com/jansel, https://github.com/malfet	2024-04-04 20:00:54 +00:00
Jesse Cai	c63a7b5691	[sparse] Add fast semi-structured spasification kernels (#122350 ) This PR adds in fast semi-structured sparsification kernels to PyTorch. These kernels allow for accelerated semi-structured sparsification kernels in PyTorch. The kernels have been added as aten native functions In particular, three new functions have been added: * `torch._sparse_semi_structured_tile` This function will return the packed representation and metadata for both X and X', as well as the thread masks. Note that this applies 2:4 sparsity in a 4x4 tile instead of a 1x4 strip as usual. * `torch._sparse_semi_structured_apply` This function takes in an input tensor and thread masks from the above function and returns a packed representation and metadata from applying thread masks to the input tensor. * `torch._sparse_semi_structured_apply_dense` This function does the same thing as above but instead of returning the tensor in the sparse representation it returns it in the dense representation The subclasses have also been updated to add a new `prune_dense_static_sort` classmethod to create sparse tensors with this format. I've added some additional documentatino on how to calculate the compressed tensors needed to create a SparseSemiStructuredTensor oneself. To this end, there are two new helper functions added: `sparse_semi_structured_tile` `compute_compressed_swizzled_bitmask` Pull Request resolved: https://github.com/pytorch/pytorch/pull/122350 Approved by: https://github.com/cpuhrsch	2024-04-04 19:07:35 +00:00
PyTorch MergeBot	d8717c2d68	Revert "Skip test_artificial_grid_cpp_wrapper (#123211 )" This reverts commit a8b9dcb9af8012fd64d310781c85a6e3055e67cc. Reverted https://github.com/pytorch/pytorch/pull/123211 on behalf of https://github.com/clee2000 due to test_artificial_zgrid is failing internally and the PR to skip #123211 is also failing but for a different reason ([comment](https://github.com/pytorch/pytorch/pull/123211#issuecomment-2037979882))	2024-04-04 18:58:55 +00:00
PyTorch MergeBot	a808559fc6	Revert "[inductor] Fix fresh_inductor_cache() (#122661 )" This reverts commit ba7d396eb73e91c1846ed770f470245ef578a923. Reverted https://github.com/pytorch/pytorch/pull/122661 on behalf of https://github.com/clee2000 due to new test is failing internally ([comment](https://github.com/pytorch/pytorch/pull/122661#issuecomment-2037977934))	2024-04-04 18:55:55 +00:00
Joel Schlosser	721dcaff94	Revert usage of NJT views in SDPA (#123215 ) For internal purposes, this PR reverts the use of real views in SDPA -> autograd.Function "views" (i.e. `ViewBufferFromNested` and `ViewNestedFromBuffer`). This is a temporary fix to get the FIRST model launched and working. Note: this breaks some other Dynamo tests related to SDPA that rely on real views, but the breakage there isn't expected to be likely in a real-world scenario. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123215 Approved by: https://github.com/YuqingJ	2024-04-04 18:45:47 +00:00
Andrew Gu	8b83327cd5	[FSDP] Fixed `summon_full_params` on submodule (#123290 ) This PR fixes https://github.com/pytorch/pytorch/issues/122663. This PR changes `_unshard_params` to directly look for FSDP modules instead of the two steps of first finding the root FSDP modules and then recursing on their submodules. This should address the issue where we call `summon_full_params` on an FSDP module that is _not_ the root FSDP module. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123290 Approved by: https://github.com/weifengpy ghstack dependencies: #122962	2024-04-04 18:26:08 +00:00
Ke Wen	fe84155083	[TP][Tests] Replace assertEqual with deepcopy (#123218 ) There were a lot of manual `assertEqual`'s in the tests to make sure `model_tp` was created the same as `model`. `model_tp = copy.deepcopy(model)` should help us rest assured. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123218 Approved by: https://github.com/wanchaol	2024-04-04 18:11:58 +00:00
Richard Barnes	98e5238ad8	[codemod][lowrisk] Remove unused exception parameter from caffe2/caffe2/image/image_input_op.h (#123056 ) Summary: `-Wunused-exception-parameter` has identified an unused exception parameter. This diff removes it. This: ``` try { ... } catch (exception& e) { // no use of e } ``` should instead be written as ``` } catch (exception&) { ``` If the code compiles, this is safe to land. Test Plan: Sandcastle Reviewed By: palmje Differential Revision: D55548497 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123056 Approved by: https://github.com/Skylion007	2024-04-04 17:24:43 +00:00
rzou	836a86064c	Ensure torch.library doctests runs under xdoctest (#123282 ) I'm not sure what "TORCH_DOCTEST_LIBRARY" is, but it prevented these tests from running under xdoctest. This PR fixes the docstrings and makes them actually run under xdoctest. Test Plan: - wait for CI - I verified locally that the docstrings are now being tested. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123282 Approved by: https://github.com/williamwen42 ghstack dependencies: #123261	2024-04-04 16:20:42 +00:00
rzou	8f20cf1c71	Update the functionalization error message (#123261 ) Previously, it suggested that a user add a manual functionalization kernel. However, since we have auto_functionalize now, the user's first course of action should be to modify their op into the form that auto_functionalize accepts (this is possible in the majority of custom ops). Test Plan: - new test Pull Request resolved: https://github.com/pytorch/pytorch/pull/123261 Approved by: https://github.com/williamwen42	2024-04-04 16:20:42 +00:00
Yue Dong	e0c9764660	Back out "Precompile triton templates (#121998 )" (#123305 ) Summary: We are reverting #121998 because the change plus search-autotune-cache led to significant compilation time increase, causing stuck job detector to trigger and then kill the training job. Test Plan: CI tests Reviewed By: nmacchioni Differential Revision: D55712203 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123305 Approved by: https://github.com/eellison, https://github.com/nmacchioni, https://github.com/xw285cornell	2024-04-04 16:05:10 +00:00
Jean Schmidt	595613d746	[CI] Workaround to the dind-rootless limitation to restore user on build.sh and test.sh (#122922 ) Co-authored-by: Thanh Ha <thanh.ha@linuxfoundation.org> Pull Request resolved: https://github.com/pytorch/pytorch/pull/122922 Approved by: https://github.com/DanilBaibak	2024-04-04 14:20:51 +00:00
Thanh Ha	5ecfe58cfb	Remove ulimit setting for ARC dind-rootless (#122629 ) Since ARC runners use dind-rootless mode setting the ulimit in the docker run command is not possible as the dind-rootless container does not sufficient permissions to do that. This change looks like it was coming from a migration from another CI system so perhaps it's not necessary anymore. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122629 Approved by: https://github.com/jeanschmidt	2024-04-04 14:18:58 +00:00
atalman	26b4ccf9d1	Use numpy 2.0.0rc1 in CI (#123286 ) Bump numpy version to 2.0.0rc1 in CI Related to: https://github.com/pytorch/pytorch/issues/107302 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123286 Approved by: https://github.com/huydhn, https://github.com/kit1980, https://github.com/ZainRizvi	2024-04-04 14:00:19 +00:00
eellison	d9cbd57dfe	Make u/int8 cat inductor fallback cpu-only (#123278 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123278 Approved by: https://github.com/Chillee	2024-04-04 13:54:37 +00:00
Nikita Shulga	b5488cbe64	Use Vectorized Half for eager and compile (#123260 ) Using implementation added by https://github.com/pytorch/pytorch/pull/122918 Adapt `convert_half_float`/`convert_half_float` to work when better APIs are available Fix `Vectorized<Half>::reciprocal()` and `::rsqrt()` implementation do use proper divisions rather than `vrecpeq_f16` which computes the estimate, rather than true division. Please note that this pattern already present in `Vectorized<Float>` for NEON, see: `05289a278c/aten/src/ATen/cpu/vec/vec256/vec256_float_neon.h (L618-L622)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/123260 Approved by: https://github.com/mikekgfb	2024-04-04 13:28:44 +00:00
PyTorch MergeBot	54801e6fd6	Revert "[Distributed] [2/N] Fix clang-tidy warnings in torch/csrc/distributed/c10d (#122892 )" This reverts commit 0ba16ffd35af3eb56da4892cc5387c5e8ac864bb. Reverted https://github.com/pytorch/pytorch/pull/122892 on behalf of https://github.com/atalman due to broke cuda tests ([comment](https://github.com/pytorch/pytorch/pull/122892#issuecomment-2037207036))	2024-04-04 13:22:22 +00:00
Shunting Zhang	6890333e3d	[inductor] fix tensor overlap detection that cause cudagraphs being disabled (#123327 ) If any graph input has overlapping memory, inductor disables cudagraphs. But the function `complex_memory_overlap` detecting memory overlap can have false positive. E.g. for tensor `rand_strided((8, 1500, 1), (1504, 1, 1), device=self.device)` the function reports overlapping previously.. This is caused by size=1 dimension. The fix is to do squeeze before running the detection algorithm. This fixes the perf regress for hf_Whisper and timm_efficientdet when we do padding. For these models cudagraphs were dynamically disabled when doing padding due to the issue discussed here and cause perf regress. This may help the dashboard if this is a common thing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123327 Approved by: https://github.com/Chillee	2024-04-04 08:55:00 +00:00
Huy Do	f00ece024b	Handle wrong workflow name from GitHub (#123301 ) Fixes https://github.com/pytorch/pytorch/issues/122422. From my testing, the problem is that GitHub didn't return the correct workflow name in some cases and used the path to the workflow instead. Take https://github.com/pytorch/pytorch/pull/123104 as an example, the returning name from GH graphql was `.github/workflows/generated-linux-binary-conda-nightly.yml` while the name we had on Rockset was `linux-binary-conda`. The latter was correct, but the mismatch caused mergebot to miss the flaky failures. This is a weird issue because retrying the graphql query eventually returns the correct name. First query: ![Screenshot 2024-04-03 at 15 28 37](https://github.com/pytorch/pytorch/assets/475357/81a8ada4-c241-4e6b-b45d-7a6de1c3a151) After several retries: ![Screenshot 2024-04-03 at 15 31 53](https://github.com/pytorch/pytorch/assets/475357/402c2e8c-f963-45f6-8c10-e1d2f49c5479) Then I could never get the result like the first query again. The fix here is to keep track of the job ID so that we can compare it instead of the `workflow / job` name. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123301 Approved by: https://github.com/clee2000	2024-04-04 07:00:40 +00:00
Wei Wei	dbeb214043	[aot_inductor] Fix issues in pre_grad passes (#123181 ) Summary: Fixed a bug in `sink_cat_after_pointwise` pass for PT IR. The root cause is asumption of existence of input in kwargs or args Differential Revision: D55617545 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123181 Approved by: https://github.com/hl475, https://github.com/khabinov	2024-04-04 05:13:28 +00:00
xinan.lin	eee8413b8d	[Inductor Intel GPU backend Upstream] Enable triton installation for Intel GPU (#122254 ) Following the RFC https://github.com/pytorch/pytorch/issues/114856, Intel GPU Inductor backend depends on triton that functions with Intel GPUs. This PR enabled the triton installation in Intel GPU CI docker build. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122254 Approved by: https://github.com/chuanqi129, https://github.com/EikanWang, https://github.com/jansel, https://github.com/desertfire ghstack dependencies: #121883	2024-04-04 05:04:17 +00:00
Colin Peppler	2a24b54e65	[inductor] simplify expr when looking up size hint (#123140 ) ## Context Suppose we have two symbols: `u0` and `s0` where we know that `u0 = s0`. Now, let's say we tried to look up the size hint for `u0 + 1`. * Before this PR, we would use a fallback hint if one was provided. `3f6acf65fd/torch/_inductor/sizevars.py (L406-L407)` * With this PR, we would try to replace `u0` with `s0` via `simplify()` before using a fallback hint. `3f6acf65fd/torch/_inductor/sizevars.py (L46-L47)` ## Concrete Example A scenario where this is useful is when we're running autotuning benchmarking on bmm with two input nodes: one who has `s0` as the batch size and one who has `u0` as the batch size. During benchmarking, we'll create two example input tensors where the input with `u0` has to use a fallback hint for batch size. This will lead to a mismatch. `e3d80f2fa9/torch/_inductor/select_algorithm.py (L991-L997)` Using the fallback hint (i.e. 8192) leads to a batch size mismatch. ``` # Note: s0 = 7 and u0 = 7 and fallback hint is 8192. LoweringException: ErrorFromChoice: Expected size for first two dimensions of batch2 tensor to be: [7, 30] but got: [8192, 30]. From choice ExternKernelCaller(extern_kernels.bmm) ``` Differential Revision: D55619331 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123140 Approved by: https://github.com/aakhundov	2024-04-04 04:59:59 +00:00
Svetlana Karslioglu	5bc6bd3cb8	Remove excessive warnings and rewrite FSDP docstrings (#123281 ) The page at https://pytorch.org/docs/stable/fsdp.html contains a series of warnings and notes that, due to their frequency, may detract from their intended purpose: to highlight crucial information. This PR aims to restructure these notes and warnings into a more coherent narrative, thereby enhancing the readability of the page. Co-authored-by: Andrew Gu <31054793+awgu@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/123281 Approved by: https://github.com/awgu	2024-04-04 04:08:57 +00:00
Animesh Jain	6694628170	[dynamo][guards] Remove workaround after #122858 (#123303 ) Not needed since https://github.com/pytorch/pytorch/pull/122858 has landed Pull Request resolved: https://github.com/pytorch/pytorch/pull/123303 Approved by: https://github.com/mlazos ghstack dependencies: #123285, #123302	2024-04-04 03:52:50 +00:00
Animesh Jain	5b45ec8892	[dynamo][guards] Use DATA_PTR instead of ID_MATCH for tensors (#123302 ) We should sparingly use ID_MATCH guards. When it comes to performance, ID_MATCH is much faster DATA_PTR for Python guards. However, the difference is very small in C++. So, its worth just using DATA_PTR_MATCH. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123302 Approved by: https://github.com/mlazos ghstack dependencies: #123285	2024-04-04 03:52:50 +00:00
Animesh Jain	fb7664d5bf	[dynamo][optimizer][guard-overhead] NOT_NONE guard for param.grad instead of TENSOR_MATCH (#123285 ) For optimizers, we do an DATA_PTR match for parameters. For param.grad, we were doing TENSOR_MATCH, but what we really need to guard is if param.grad is None or not. Therefore, I add a new guard called NOT_NONE. Further improves the guard overhead ![image](https://github.com/pytorch/pytorch/assets/13822661/574598ac-ca71-4e5e-9e75-8774577cd58f) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123285 Approved by: https://github.com/mlazos, https://github.com/jansel	2024-04-04 03:52:47 +00:00
PyTorch MergeBot	63d17d3c90	Revert "Revert usage of NJT views in SDPA (#123215 )" This reverts commit 0fcddb56252c9b4401e8b888eddd4bc4bce3e624. Reverted https://github.com/pytorch/pytorch/pull/123215 on behalf of https://github.com/huydhn due to Sorry for reverting your PR but I think it needs to be skipped on ROCm `0fcddb5625` ([comment](https://github.com/pytorch/pytorch/pull/123215#issuecomment-2036080570))	2024-04-04 02:57:09 +00:00
Sam Larsen	ba7d396eb7	[inductor] Fix fresh_inductor_cache() (#122661 ) Summary: Modify fresh_inductor_cache() to clear cached state before mocking the toplevel cache_dir directory. Any lru_caches (or otherwise) can use the @clear_on_fresh_inductor_cache decorator to register the cache for clearing. Also change the base inductor TestCase class to use fresh_inductor_cache(). Previously that TestCase was only mocking the subdirectory within the toplevel cache dir designated for the FX graph cache artifacts. Test Plan: - New unit test - All existing inductor tests will exercise fresh_inductor_cache() Pull Request resolved: https://github.com/pytorch/pytorch/pull/122661 Approved by: https://github.com/oulgen	2024-04-04 02:32:37 +00:00
Tugsbayasgalan Manlaibaatar	1ea6d3a9b4	Fix conv decomp when running to core-aten (#123283 ) Differential Revision: [D55709374](https://our.internmc.facebook.com/intern/diff/D55709374) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123283 Approved by: https://github.com/angelayi	2024-04-04 01:14:09 +00:00
Yifu Wang	c58b0ac7c2	IntraNodeComm primitives for allgather_matmul (#118038 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118038 Approved by: https://github.com/wanchaol	2024-04-04 00:46:08 +00:00
cyy	0ba16ffd35	[Distributed] [2/N] Fix clang-tidy warnings in torch/csrc/distributed/c10d (#122892 ) This PR continues to fix some clang-tidy warnings in distributed code, following #122884. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122892 Approved by: https://github.com/Skylion007	2024-04-04 00:39:31 +00:00
Nikita Shulga	41e7619875	[EZ] Do not test for an undefined conversion behavior (#123258 ) By limiting `VecConvertTests` subtest cases to positive numbers when converting to unsigned types. As what `static_cast<unsigned int>(-3.0f)` is doing is compiler/architecture specific, as one can observe by running ```cpp #include <stdint.h> #include <iostream> unsigned int convert(float x) { return static_cast<unsigned int>(x); } int main(int argc, const char* argv[]) { auto inp = std::atof(argc > 1 ? argv[1] : "-3.0"); std::cout << "cvt(" << inp << ")=" << convert(inp) << std::endl; return 0; } ``` on x86 would print `cvt(-3)=4294967293`, but on ARM would convert to `0` Pull Request resolved: https://github.com/pytorch/pytorch/pull/123258 Approved by: https://github.com/atalman	2024-04-03 23:42:25 +00:00
Joel Schlosser	0fcddb5625	Revert usage of NJT views in SDPA (#123215 ) For internal purposes, this PR reverts the use of real views in SDPA -> autograd.Function "views" (i.e. `ViewBufferFromNested` and `ViewNestedFromBuffer`). This is a temporary fix to get the FIRST model launched and working. Note: this breaks some other Dynamo tests related to SDPA that rely on real views, but the breakage there isn't expected to be likely in a real-world scenario. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123215 Approved by: https://github.com/YuqingJ	2024-04-03 23:25:31 +00:00
youkaichao	6e99f73923	[CMake] fix cmake regex to match newly introduced 9.0a architecture (#123243 ) When people build pytorch extensions with cmake, and the GPU supports 9.0a arch as introduced in https://github.com/pytorch/pytorch/pull/110587 , the cmake regex is not updated to recognize the change, leading to cmake breaks like https://github.com/pytorch/pytorch/issues/113948 and https://github.com/pytorch/pytorch/issues/119946 . This PR should fix them. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123243 Approved by: https://github.com/malfet	2024-04-03 23:24:26 +00:00
Joona Havukainen	05289a278c	Fix for MPS regression in #122016 and #123178 (#123234 ) Fixes #122016 and #123178. This regression is related to an OS side change that requires a slight adjustment from us on PyTorch side to restore the previous behavior. Additionally we cleared out pre-MacOS13 related workarounds. Before the fix on MacOS 14.4: ``` python -c "import torch;x=torch.zeros(3, device='mps');x[1] = 1; x[2] = 3; print(x)" tensor([0., 3., 3.], device='mps:0') ``` After the fix: ``` python -c "import torch;x=torch.zeros(3, device='mps');x[1] = 1; x[2] = 3; print(x)" tensor([0., 1., 3.], device='mps:0') ``` This also fixes complex number initialization and as such makes `nn.functional.rms_norm` pass on MacOS-14+ Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/123234 Approved by: https://github.com/malfet, https://github.com/kulinseth	2024-04-03 23:00:57 +00:00
Shivam Raikundalia	4732375042	make RecordFunctionFast take inputs (#123208 ) Summary: RECORD_FUNCTION in C++ and torch.profiler.record_function already support recording inputs. Let's do the same for RecordFunctionFast. Test Plan: Add tests in test_profiler.py that take args and also do not take args so we can support it being an optional parameter Differential Revision: D55648870 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123208 Approved by: https://github.com/davidberard98	2024-04-03 21:58:09 +00:00
Thanh Ha	a5cf9a5800	[CI] Do not install Nvidia drivers in ARC (#122890 ) ARC Runners will provide working Nvidia drivers through the host configuration so this step is no longer necessary in the workflow as the ARC container is not able to install packages at the host level. Also simplify the the setup-linux condition on if running in ARC as we can achieve the same result without needing an extra shell step via the hashFiles() function. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122890 Approved by: https://github.com/seemethere, https://github.com/jeanschmidt	2024-04-03 21:47:16 +00:00
Michael Lazos	3e2b7e6052	[dynamo][guard overhead] Data ptr guard optimizer state tensors (#122858 ) Stricter (but faster) guarding on optimizer state tensors Pull Request resolved: https://github.com/pytorch/pytorch/pull/122858 Approved by: https://github.com/anijain2305	2024-04-03 21:42:06 +00:00
Menglu Yu	63a0ce89a0	[PT2][Inductor][3/n] Customize pre grad and post grad patterns (#121915 ) Summary: Currently, we only enabled the group batch fusion customization, we also enable the split cat customization. Test Plan: ``` buck2 run mode/opt //scripts/jackiexu0313/pt2:local_model_with_pt2 -- --test_mode batch-split --model_type "cmf" --flow_id 524546542 ``` P1196013839 Differential Revision: D54861682 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121915 Approved by: https://github.com/jackiexu1992	2024-04-03 21:37:21 +00:00
Nikita Shulga	7deb842b0d	[MacOS] Default parallel jobs to performance cores (#123038 ) By querying `sysctl hw.perflevel0.physicalcpu ` instead of `std:🧵:hardware_concurrency()` which returns total number of cores, which is sum of performance and efficient ones As lots of parallel algorithm in ATen divide the the parallel task into an even region, this end up in faster code execution, compared to when all cores are used by default Pull Request resolved: https://github.com/pytorch/pytorch/pull/123038 Approved by: https://github.com/albanD	2024-04-03 21:05:17 +00:00
William Wen	9e0838ff27	fix typo in export/test_export.py (#123228 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123228 Approved by: https://github.com/pianpwk	2024-04-03 20:17:22 +00:00
Lucas Pasqualin	620aaaf0cb	[DCP] Adds ability to create a CPU state dict that is both shared and pinned (#122338 ) [DCP] Adds ability to create a CPU state dict that is both shared and pinned, as well as a new utility specific to copying the state dict https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__MEMORY.html#group__CUDART__MEMORY_1ge8d5c17670f16ac4fc8fcb4181cb490c Pull Request resolved: https://github.com/pytorch/pytorch/pull/122338 Approved by: https://github.com/fegin	2024-04-03 20:05:01 +00:00
ydwu4	a4035bea5c	[while_loop] support closures (#123018 ) We add an additional_inputs arguments to the HOP while_loop and rename the operands to carried_inputs based on offline discussion with @zou3519 . This allows us to support closures, parameters and buffers. The alternative is to pass the lifted inputs directly to outputs of body_fn. But since we want the body_fn's output to not aliasing input. We'll need to copy the inputs and remove the copies later. This is a bit more work to do. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123018 Approved by: https://github.com/aakhundov ghstack dependencies: #123217	2024-04-03 19:35:15 +00:00
ydwu4	5a66c2d65b	Update pytorch/xla pin (#123217 ) #123018 introduces a necessary bc breaking change and sees a bunch of xla test failures on CI. We made a pr to pytorch/xla to prepare for the breaking change https://github.com/pytorch/xla/pull/6872. We update the pin of pytorch/xla to reflect the change in this PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123217 Approved by: https://github.com/clee2000	2024-04-03 19:35:15 +00:00
Luca Wehrstedt	9aed9c8c87	Reduce CPU overhead of copying inputs in CUDAGraph trees via foreach_copy (#123162 ) I noticed that when enabling CUDA graphs in Inductor, most of the CPU time was spent issuing copies from the new inputs to the graph's input tensors. This meant that my workload was still somewhat CPU bound. <img width="1204" alt="Screenshot 2024-03-28 at 14 18 49" src="https://github.com/pytorch/pytorch/assets/120810/9ac2462d-ef46-4051-8b22-e677845ca83e"> I tried to improve this situation by using the new `_foreach_copy_` operator, in order to group all the copies into one operator. There was already a comment in the code indicating that this was a desired optimization. It did indeed improve the situation substantially: <img width="908" alt="Screenshot 2024-03-28 at 14 21 21" src="https://github.com/pytorch/pytorch/assets/120810/67548ac8-2b41-46ba-8588-cea6470301cc"> On device, the situation also improved, with the memcpys being merged into fewer larger kernels: Before: <img width="848" alt="Screenshot 2024-03-28 at 14 24 48" src="https://github.com/pytorch/pytorch/assets/120810/e12e27c4-6d86-40cf-9478-061bc10920d7"> After: <img width="824" alt="Screenshot 2024-03-28 at 14 24 06" src="https://github.com/pytorch/pytorch/assets/120810/a4771b5c-6848-4510-a841-ffa5bba3023f"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/123162 Approved by: https://github.com/eellison	2024-04-03 19:34:41 +00:00
Chun Cai	691054eeef	Fix error message of autograd (#123154 ) This PR updates the error message in autograd when an input tensor does not set to `require_grad`. The original message does not contain the index info, making users hard to debug. The error message style consists with that on line 105-109. Co-authored-by: Jeffrey Wan <soulitzer@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/123154 Approved by: https://github.com/soulitzer	2024-04-03 19:07:21 +00:00
Yanan Cao (PyTorch)	700917c361	Adjust logging content for TS usage logging (#123133 ) Summary: Remove unused/ignore/export TS logging because they do not represent independent TS usage and leads to overload of scribe Log tupperware job's oncall information so that we have better attribution of who launched the job. Test Plan: manual testing Reviewed By: davidberard98 Differential Revision: D55610844 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123133 Approved by: https://github.com/clee2000	2024-04-03 18:54:26 +00:00
rzou	9aa1a4d386	Remove mypy-ignore-errors from custom_op_db (#123107 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123107 Approved by: https://github.com/soulitzer ghstack dependencies: #122344	2024-04-03 18:36:17 +00:00
rzou	44c0c0fc0f	Add torch.library.custom_op (#122344 ) This is the entrypoint for defining an opaque/blackbox (e.g. PyTorch will never peek into it) custom op. In this PR, you can specify backend impls and the abstract impl for this op. NB: most of this PR is docstrings, please don't be intimidated by the line count. There are a number of interesting features: - we infer the schema from type hints. In a followup I add the ability to manually specify a schema. - name inference. The user needs to manually specify an op name for now. In a followup we add the ability to automatically infer a name (this is a little tricky). - custom_op registrations can override each other. This makes them more pleasant to work with in environments like colab. - we require that the outputs of the custom_op do not alias any inputs or each other. We enforce this via a runtime check, but can relax this into an opcheck test if it really matters in the future. Test Plan: - new tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/122344 Approved by: https://github.com/ezyang, https://github.com/albanD	2024-04-03 18:36:17 +00:00
Michael Lazos	aa16c0163f	Only update momentum buffers for SGD if momentum is enabled (#122349 ) As title [benchmark](https://gist.github.com/mlazos/1171f035a2392c33778aaa3d7bf24370) Helps compiled vanilla SGD execution time by 2x on certain models with large number of small params (ex. ElectraForQuestionAnswering goes from 1090us -> 554us) Pull Request resolved: https://github.com/pytorch/pytorch/pull/122349 Approved by: https://github.com/janeyx99	2024-04-03 18:29:55 +00:00
andrewor14	fe29a8fbea	[quant][be] Simplify fake_quant_per_channel (#123186 ) Summary: We probably don't need `torch._C._AutoDispatchBelowAutograd()`, which is to prevent infinite recursion if the implementation calls itself. Let's remove it and see if anything breaks. The other major change is registering the op to the more general Autograd dispatch key so it can be used on cuda as well. Test Plan: python test/inductor/test_cpu_repro.py -k test_decomposed_fake_quant_per_channel Reviewers: zou3519, bdhirsh Subscribers: zou3519, bdhirsh, jerryzh168, leslie-fang-intel Pull Request resolved: https://github.com/pytorch/pytorch/pull/123186 Approved by: https://github.com/zou3519, https://github.com/leslie-fang-intel	2024-04-03 18:06:45 +00:00
eqy	8b6b179a8a	[cuBLAS][cuBLASLt][FP8] Enforce restrictions on `amax` computation for scaled fp8 gemms (#122821 ) Word from `cuBLAS` is that `amax` computation is unsupported for non-fp8 outputs when the inputs are in fp8, even if it was "silently" executing in the past. CC @tinglvv Pull Request resolved: https://github.com/pytorch/pytorch/pull/122821 Approved by: https://github.com/vkuzo	2024-04-03 17:54:36 +00:00
Andrew M. James	bde1a93bc4	Add lowering for resize, decomp for resize_as. (#122317 ) This has been split off from #121354 as the inplace version of these methods prove to be rather tricky. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122317 Approved by: https://github.com/peterbell10, https://github.com/lezcano	2024-04-03 17:47:29 +00:00
Sam Larsen	a8b9dcb9af	Skip test_artificial_grid_cpp_wrapper (#123211 ) Summary: This test is actually broken and probably succeeding by mistake because of a cache hit. Forcing a fresh cache or removing the errant setting cause a consistent failure. Disabling for now until we have time to investigate further. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123211 Approved by: https://github.com/desertfire	2024-04-03 17:27:55 +00:00
Tugsbayasgalan Manlaibaatar	8a0436014d	Support map in pre-dispatch functionalization (#121444 ) When we enter map_autograd, we try to trace through fwd/bwd of a map operator that is wrapped in ctx.functionalize wrapper. This forces us to go through PreDispatch functionalization again (only the python part). As a result, it revealed our previous bug where pre-dispatch mode handling doesn't actually manage the local dispatch key set. (If there is no active mode, we need to turn off PreDispatch key). This PR fixes that. Also I shuffled some APIs around so that there is less code duplication as the setting/unsetting logic is quite hard to get it right. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121444 Approved by: https://github.com/bdhirsh	2024-04-03 17:14:41 +00:00
Simon Fan	8ac0f072e6	[aot eager] Support frontend graphs with list arguments (#123212 ) We already support bumpy inputs for 3rd party frontend and compiled backward graph, we should add the behavior to aot_eager too Pull Request resolved: https://github.com/pytorch/pytorch/pull/123212 Approved by: https://github.com/jansel ghstack dependencies: #122691, #122746, #123007	2024-04-03 17:07:52 +00:00
Pearu Peterson	d895192e87	Fix zeros_like on sparse compressed fake tensors (#123084 ) Fixes https://github.com/pytorch/pytorch/pull/117907#issuecomment-2025769663 Adds block compressed sparse tensors support to zeros_like Pull Request resolved: https://github.com/pytorch/pytorch/pull/123084 Approved by: https://github.com/amjames, https://github.com/peterbell10	2024-04-03 16:11:11 +00:00
Kai Londenberg	74b3a7920e	[Inductor Cutlass backend] GEMM size threshold for Cutlass backend usage (#121491 ) * Adds a configurable GEMM size threshold for the usage of Cutlass GEMM Kernels _inductor.config.cutlass_backend_min_gemm_size * During GEMM algorithm choice generation: if no viable choices can be generated using the configured backends, the ATen backend will be used as a fallback backend, even if it is not enabled in _inductor.config.max_autotune_gemm_backends Test plan: CI Additional unit test in test_cutlass_backend.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/121491 Approved by: https://github.com/jansel ghstack dependencies: #121490	2024-04-03 13:34:16 +00:00
Kai Londenberg	f2e67179ee	[Inductor] Make codecache CUDA compilation more robust & flexible (#121490 ) Minor changes which make the CUDA compilation within _inductor/codecache.py more robust and flexible. Test plan: CI Additional test in test_codecache.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/121490 Approved by: https://github.com/jansel	2024-04-03 12:56:48 +00:00
PyTorch MergeBot	25ad90adc0	Revert "Support map in pre-dispatch functionalization (#121444 )" This reverts commit 9288b274611abc904a67d9cb02c837aa2cb769fd. Reverted https://github.com/pytorch/pytorch/pull/121444 on behalf of https://github.com/atalman due to New test test_aot_export_predispatch_map_1 is failing on windows ([comment](https://github.com/pytorch/pytorch/pull/121444#issuecomment-2034526949))	2024-04-03 12:55:23 +00:00
xinan.lin	957b8d5c00	[Inductor Intel GPU backend Upstream] Register general runtime device for Intel GPU (#121883 ) Following the RFC https://github.com/pytorch/pytorch/issues/114856, Intel GPU Inductor backend uses device specific runtime API. To generalize this and reuse the existing generalize device interface, this PR registers the general device interface for Intel GPU. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121883 Approved by: https://github.com/EikanWang, https://github.com/guangyey, https://github.com/jansel	2024-04-03 08:34:05 +00:00
Animesh Jain	3eb84b6343	[dynamo][cpp-guards] Init LocalState only when TENSOR_MATCH guard present (#123152 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123152 Approved by: https://github.com/jansel	2024-04-03 08:04:39 +00:00
Tristan Rice	68cffd19f6	DTensor: add ring attention for _scaled_dot_product_flash_attention (#122460 ) Ring attention support for _scaled_dot_product_flash_attention with DTensor. This assumes the query and key/value are sharded along the sequence length dimension. See the tests for example usage with PT Transformer as well as direct usage with _scaled_dot_product_flash_attention. ## Notable caveats * Numerical accuracy: The backwards pass doesn't match numerically with the non-chunked version but the forwards pass does. I assume this is due to accumulated errors. I've added a chunked version that uses autograd to verify that the distributed version matches the chunked version. * nn.Linear has incorrect behavior when running on a sharded tensor of size (bs, heads, seq_len, dim) with `Shard(2)` and does an unnecessary accumulate which requires `Replicate()` on QKV when using `nn.MultiHeadedAttention` to work around the issue. * If enabled, it forces sequence parallelism and doesn't interop with tensor parallelism. ## SDPA usage ```py with attention_context_parallel(), sdpa_kernel(backends=[SDPBackend.FLASH_ATTENTION]): dquery = distribute_tensor(query, device_mesh, [Shard(2)]) dkey = distribute_tensor(key, device_mesh, [Shard(2)]) dvalue = distribute_tensor(value, device_mesh, [Shard(2)]) dout: DTensor = torch.nn.functional.scaled_dot_product_attention( dquery, dkey, dvalue, is_causal=is_causal ) out = dout.to_local() ``` ## Transformer usage ```py with attention_context_parallel(), sdpa_kernel(backends=[SDPBackend.FLASH_ATTENTION]): encoder_layer = nn.TransformerEncoderLayer( d_model=dim, nhead=nheads, dim_feedforward=dim, batch_first=True, ).to(dtype) encoder_layer = parallelize_module( module=encoder_layer, device_mesh=device_mesh, parallelize_plan={ "self_attn": ContextParallel(), }, ) model = nn.TransformerEncoder(encoder_layer, num_layers=num_layers) ``` ## Test plan ``` pytest test/distributed/_tensor/test_attention.py ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/122460 Approved by: https://github.com/drisspg, https://github.com/wanchaol	2024-04-03 06:45:00 +00:00
Ke Wen	f06d77caba	[TP] Improve MLPStacked test (#123199 ) Improve tests per @wanchaol 's suggestions in #122968 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123199 Approved by: https://github.com/wanchaol	2024-04-03 06:14:49 +00:00
Yifu Wang	eb3a34d280	Optimize multi_tensor_apply (take 2) (#119764 ) ### Take 2 The first take (#119153) landed but was reverted because it broke cuda graph for `multi_tensor_apply`. This PR is a reland of #119153: - Incorporate #119652 so that the optimization can be applied (1) without increasing binary size (2) to all 3 MTA variants without much code duplication. - Ensure the optimization is compatible with cuda graph. ### Summary Due to the dynamic nature of the workload, the kernel arguments aren't guaranteed to fit in the static 4kb kernel argument memory. Previously with the apex implementation, we overcame this limitation by dividing a multi_tensor_apply workload into multiple kernel launches. However, this led to low sustained occupancy, affecting the performance of memory bound ops. Based on the observation that the kernel argument memory limitation doesn't correlate well with available SM resources, we adopt a different approach: - When the kernel arguments fit into the static kernel argument memory, we use this memory to transfer the arguments. - Conversely, when the kernel arguments don't fit into the static kernel argument memory, instead of sacrificing sustained occupancy, we use a page-locked cudaMemcpyAsync to transfer the arguments, then perform the entire workload in a single kernel. This PR only covers `multi_tensor_apply` for tensors. The change can be easily applied to `multi_tensor_apply` for tensors + scalars and `multi_tensor_apply_for_fused_optimizer`. ### Benchmark (WIP) The only benchmark I've conducted so far on `_foreach_copy_` on a set of sizes that resembles internal workload. I need to benchmarks on more problem sizes. The speedup should vary among problem sizes. However, I believe this PR should not be slower than the previous impl on any problem sizes. The benchmark can be reproduced with [this script](https://gist.github.com/yifuwang/178c1f4bf951c5794ea79c04d90e44fa). Baseline A single iteration in trace: <img width="831" alt="image" src="https://github.com/pytorch/pytorch/assets/4156752/5c8d72d0-0628-4989-88a8-c756f6bc1319"> ``` https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/perfetto_internal_traces/tree/shared_trace/yifu_5a59145f-567b-472f-8eef-c61c388d45b4.json device ms: 1.111, cpu ms: 7.151 memory bandwidth: 1169.825 GB/s ``` This PR A single iteration in trace: <img width="967" alt="image" src="https://github.com/pytorch/pytorch/assets/4156752/a023e183-8166-48f7-b7c0-c8ba32653d2b"> ``` https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/perfetto_internal_traces/tree/shared_trace/yifu_da060725-62a8-466e-b570-2ad67ff0e29d.json device ms: 0.892, cpu ms: 0.810 memory bandwidth: 1456.744 GB/s ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/119764 Approved by: https://github.com/eqy, https://github.com/eellison, https://github.com/crcrpar	2024-04-03 05:54:49 +00:00
Animesh Jain	d91db70295	[dynamo][cpp-guards] Optimize tensor.grad accessor (#123226 ) For LayoutLM model, reduces C++ guard overhead by 1.48x. These are the numbers ![image](https://github.com/pytorch/pytorch/assets/13822661/25cfc35b-b67d-4903-8403-71fa931dacdd) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123226 Approved by: https://github.com/jansel	2024-04-03 05:32:13 +00:00
Tugsbayasgalan Manlaibaatar	9288b27461	Support map in pre-dispatch functionalization (#121444 ) When we enter map_autograd, we try to trace through fwd/bwd of a map operator that is wrapped in ctx.functionalize wrapper. This forces us to go through PreDispatch functionalization again (only the python part). As a result, it revealed our previous bug where pre-dispatch mode handling doesn't actually manage the local dispatch key set. (If there is no active mode, we need to turn off PreDispatch key). This PR fixes that. Also I shuffled some APIs around so that there is less code duplication as the setting/unsetting logic is quite hard to get it right. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121444 Approved by: https://github.com/bdhirsh	2024-04-03 03:28:14 +00:00
Lei,zhenyuan	15bd81bfaf	expose transformer header in cmake and wheel (#122586 ) expose transformer header in cmake and wheel, some utils functions are used in nested transformer development on IPEX side Pull Request resolved: https://github.com/pytorch/pytorch/pull/122586 Approved by: https://github.com/drisspg, https://github.com/Neilblaze, https://github.com/gujinghui	2024-04-03 02:27:40 +00:00
Andrew Gu	102c676418	[DTensor] Added some more foreach ops (#123214 ) These ops should already work with the existing strategy. We need these for precomputing fp32 -> fp8 casts. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123214 Approved by: https://github.com/wz337 ghstack dependencies: #123142	2024-04-03 02:07:45 +00:00
Shivam Raikundalia	15529de901	Remove FlameGraph comment from export_stacks (#123102 ) Summary: In profiler.export_stacks there was a comment suggesting that the export was compatible with FlameGraph even though it isn't. We should remove this so that users are not confused. Test Plan: Removed comment Reviewed By: aaronenyeshi Differential Revision: D55501792 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123102 Approved by: https://github.com/aaronenyeshi	2024-04-03 01:09:13 +00:00
Wang, Eikan	2964b1ef21	Extend XPU merge rules: Add torch/csrc/xpu/, torch/xpu/ and test/xpu (#122856 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/122856 Approved by: https://github.com/atalman, https://github.com/malfet	2024-04-03 00:52:08 +00:00
Bin Bao	0c6e8af257	[AOTI][refactor] Update some test cases (#123093 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123093 Approved by: https://github.com/Skylion007, https://github.com/chenyang78	2024-04-03 00:51:11 +00:00
Yifu Wang	0a2e0eb4c0	[functional collective rewrite] support rewriting reduce op for reduce_scatter_tensor (#122834 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/122834 Approved by: https://github.com/yf225 ghstack dependencies: #122666	2024-04-03 00:48:24 +00:00
Yifu Wang	f15fd650b7	[funcol] add deprecation warning for the legacy backend (#122666 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/122666 Approved by: https://github.com/yf225	2024-04-03 00:27:06 +00:00
Mu-Chu Lee	31aff29b79	Add clone if output is a view from constant. (#123200 ) Summary: For the original clone we did for output, we only clone when the corresponding tensor is an constant. We need this because we have to make sure the constants' ownership maintain in the Model. However we haven't include if it's a view of a constant. Test Plan: Included in commit test_aot_inductor::test_return_view_constant Reviewed By: frank-wei, desertfire Differential Revision: D55645636 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123200 Approved by: https://github.com/chenyang78	2024-04-03 00:13:39 +00:00
ydwu4	c77352b5cc	Add torch._library.register_fake_class to fakify torchBind class (#122622 ) This PR only adds abstract class registration logic without touching existing tests so they still trace with real script object. The added tests are only for registration APIs and test error messages. Our design is that the abstract implementation should be in Python. This is much better in terms of usability. But this also has implications for custom op that takes script object as input, which is detailed later in this stack. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122622 Approved by: https://github.com/zou3519 ghstack dependencies: #122619, #122620, #122621	2024-04-02 23:52:17 +00:00
ydwu4	46c7235406	add tensor queue example (#122621 ) This PR adds a tensor queue example for later use. It doesn't touch any existing logic. It refactors the tests a little bit to avoid importing the library in unittest setUp. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122621 Approved by: https://github.com/zou3519 ghstack dependencies: #122619, #122620	2024-04-02 23:52:17 +00:00
ydwu4	5d6a447357	[torchbind] change to parametrized tests for pre_dispatch (#122620 ) Refactor the tests to make the test more robust. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122620 Approved by: https://github.com/zou3519 ghstack dependencies: #122619	2024-04-02 23:52:14 +00:00
ydwu4	071f23f4f3	[torchbind] redispatch call_torchbind in proxy dispatch mode (#122619 ) This allows proxy_mode to further dispatch to fake tensor mode. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122619 Approved by: https://github.com/zou3519	2024-04-02 23:52:11 +00:00
BowenBao	e3d80f2fa9	[ONNX] beartype to emit warning instead of error by default (#123205 ) Making exporter more "robust" to advances in beartype tool. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123205 Approved by: https://github.com/justinchuby, https://github.com/thiagocrepaldi	2024-04-02 23:17:50 +00:00
Zhengxu Chen	b1aca36f4c	[export] Allow legacy IR to be unflattened with weaker submodule ordering. (#123192 ) Summary: In some cases we don't have information from the old IR about submodule ordering, in this case unflattener should still work in best effort mode. Differential Revision: D55642005 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123192 Approved by: https://github.com/angelayi	2024-04-02 23:08:55 +00:00
Jane Xu	d7fe0603a1	Move sparse tests to TestOptimRenewed (#123146 ) This is the last of the old TestOptim! With this change, everything will be migrated to use OptimizerInfo. Our sparse support is...well, sparse, and the tests try to best encapsulate which configs actually work. Note that support_sparse is actually just supports sparse grads...we don't test sparse params. 1. This PR fixes a bug in Adagrad multi_tensor with maximize by passing the correct value of maximize (vs False everytime) when sparse values are present. 2. This PR does improve coverage. There used to only be 2 configs each, and now we have the following configs for: Adagrad: ``` python test/test_optim.py -k test_rosenbrock_sparse_with_lrsched_False_Adagrad /home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/transformers/utils/generic.py:441: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead. _torch_pytree._register_pytree_node( {'maximize': True, 'lr': 0.1} {'initial_accumulator_value': 0.1, 'lr': 0.1} <--- this and above are CPU .{'foreach': False, 'lr': 0.1} {'foreach': True, 'lr': 0.1} {'maximize': True, 'foreach': False, 'lr': 0.1} {'maximize': True, 'foreach': True, 'lr': 0.1} {'initial_accumulator_value': 0.1, 'foreach': False, 'lr': 0.1} {'initial_accumulator_value': 0.1, 'foreach': True, 'lr': 0.1} . ---------------------------------------------------------------------- Ran 2 tests in 227.744s OK ``` SGD ``` (pytorch-3.10) [janeyx@devgpu023.odn1 /data/users/janeyx/pytorch (bff23193)]$ python test/test_optim.py -k test_rosenbrock_sparse_with_lrsched_False_SGD /home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/transformers/utils/generic.py:441: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead. _torch_pytree._register_pytree_node( {'dampening': 0.5, 'lr': 0.0048} .{'foreach': False, 'lr': 0.0048} {'foreach': True, 'lr': 0.0048} {'dampening': 0.5, 'foreach': False, 'lr': 0.0048} {'dampening': 0.5, 'foreach': True, 'lr': 0.0048} . ---------------------------------------------------------------------- Ran 2 tests in 112.801s OK ``` SparseAdam ``` (pytorch-3.10) [janeyx@devgpu023.odn1 /data/users/janeyx/pytorch (bff23193)]$ python test/test_optim.py -k test_rosenbrock_sparse_with_lrsched_False_Sparse /home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/transformers/utils/generic.py:441: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead. _torch_pytree._register_pytree_node( {'maximize': True, 'lr': 0.04} .{'maximize': True, 'lr': 0.04} . ---------------------------------------------------------------------- Ran 2 tests in 35.113s OK ``` Fixes #103322. A side quest in this migration was to re-enable and track dynamo issues as they trigger on the optim tests, which will be complete from this PR. New tests may add more things to track in dynamo, but there is now an established system for doing so, and dynamo is either enabled or a bug is tracked for every migrated test in TestOptimRenewed. Next steps: Remove the hyperparameter constraints in common_optimizer.py defined by metadata_for_sparse (other than LR, which seems handpicked for the tests to actually pass). Doing this requires adding more sparse functionality. Add more tests! Maybe add more optimizers! Pull Request resolved: https://github.com/pytorch/pytorch/pull/123146 Approved by: https://github.com/albanD ghstack dependencies: #123134, #123139	2024-04-02 22:51:02 +00:00
Jane Xu	f2838c99a0	Add a tensor lr test for optimizers (#123139 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123139 Approved by: https://github.com/albanD ghstack dependencies: #123134	2024-04-02 22:51:02 +00:00
Jane Xu	cb8fc30e4a	Move LRScheduler integration tests to OptimizerInfo (#123134 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123134 Approved by: https://github.com/albanD	2024-04-02 22:51:02 +00:00
Simon Fan	12e36dc1df	[dynamo] Fix torch._dynamo.disable on flatten_graph_inputs wrapper (#123007 ) Existing `innermost_fn` handling of `functools.wraps` is not ideal, but I'm not sure if there's a good fix. This can manifest for GmWrapper (used to handle list inputs from Dynamo -> AOTAutograd) where we don't call the unflatten wrapper at runtime. Since core parts of Dynamo rely on attribute check for `_torchdynamo_orig_callable`, so I'm adding a test to cover it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123007 Approved by: https://github.com/jansel ghstack dependencies: #122691, #122746	2024-04-02 21:39:44 +00:00
Aidyn-A	71085983ae	[c10d] [NCCL] Fix work handle for coalescing manager (#122849 ) Fixes #122807 The work handle of the coalescing job will be populated: ```python with dist._coalescing_manager(group=pg_nccl, device=device, async_ops=True) as cm: dist.all_reduce(a) dist.all_reduce(b) print(len(cm.works)) # prints 1 cm.wait() # actually waits ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/122849 Approved by: https://github.com/kwen2501	2024-04-02 21:25:16 +00:00
Ke Wen	5027ef7e9c	[TP] Add wildcard support (#122968 ) Adding wildcard support for TP's `parallelize_module` API. Example patterns: `layers..linear`: any characters `layers.?.linear`: single character `layers.[1-2]`: digit range, matches `layers.1` and `layers.2` Example use case: A model have multiple layers, and we want to parallelize the linear module `lin` inside each layer. ``` model_tp = parallelize_module( model, device_mesh, { "layers..lin": ColwiseParallel(), }, ) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/122968 Approved by: https://github.com/XilunWu, https://github.com/wz337, https://github.com/wanchaol ghstack dependencies: #122919	2024-04-02 21:23:39 +00:00
Aart Bik	35f4d70240	[sparse] proper sparse iteration (#123128 ) The branches were in the wrong order (since sparse tensors will also be instances of regular tensors). This puts the branches in the right order. This is a small step towards #117188 @pearu to review (this was split of https://github.com/pytorch/pytorch/pull/117907) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123128 Approved by: https://github.com/pearu, https://github.com/peterbell10	2024-04-02 20:52:48 +00:00
pbialecki	19c2ed15c0	update submodule onnx==1.16.0 (#123125 ) Fixes #121258 CC @malfet @atalman Pull Request resolved: https://github.com/pytorch/pytorch/pull/123125 Approved by: https://github.com/malfet	2024-04-02 20:41:22 +00:00
Michael Lazos	8244ee00cf	Add fuzzer instructions to pt2 bug template (#123156 ) Adds fuzzer instructions to our issue template Pull Request resolved: https://github.com/pytorch/pytorch/pull/123156 Approved by: https://github.com/eellison, https://github.com/anijain2305	2024-04-02 20:33:01 +00:00
Bin Bao	0ff6155eee	[AOTI] Support module buffer mutation (#123164 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/120424. Because in a forward pass module buffers may be mutated, we need to allow that in AOTI. In addition, this will be a necessary step if we want to extend AOTI to training. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123164 Approved by: https://github.com/digantdesai, https://github.com/malfet, https://github.com/chenyang78, https://github.com/khabinov	2024-04-02 20:25:26 +00:00
Andrew Gu	a9a9ce6d9c	[ez][FSDP2] Removed `_contiguous_orig_stride` (#123142 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123142 Approved by: https://github.com/yifuwang	2024-04-02 20:18:27 +00:00
Lucas Pasqualin	bcb6e5aa72	[DCP] Support partial load (#122829 ) Adds ability to load a subset of keys directly from a checkpoint, avoiding the need to initialize state dict first Differential Revision: [D55441391](https://our.internmc.facebook.com/intern/diff/D55441391/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/122829 Approved by: https://github.com/fegin	2024-04-02 19:22:22 +00:00
PyTorch MergeBot	feabb645a7	Revert "Handle transposes in second batch of matrices in bmm (#122194 )" This reverts commit 251ad1232b094d5ea0b641907e03bfd8a2011b61. Reverted https://github.com/pytorch/pytorch/pull/122194 on behalf of https://github.com/malfet due to Broke lint ([comment](https://github.com/pytorch/pytorch/pull/122194#issuecomment-2032806360))	2024-04-02 18:49:28 +00:00
eqy	1c61401086	[cuBLAS] Fix typo in `CUDA_VERSION` `ifdef` for explicit workspace allocation (#123114 ) 12.2 is 12020, not 12200, oops... CC @malfet @atalman @ptrblck Pull Request resolved: https://github.com/pytorch/pytorch/pull/123114 Approved by: https://github.com/peterbell10	2024-04-02 18:34:27 +00:00
Boyuan Feng	64d743044d	Add inline constraints to non-strict exported program (#123017 ) Summary: This PR reduces the difference between strict and non-strict exported program by supporting inline_constraints for non-strict exported program, Test Plan: CI Differential Revision: D55547830 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123017 Approved by: https://github.com/angelayi	2024-04-02 18:16:16 +00:00
William Wen	d17eea9c0f	[dynamo] fix broken 3.11+ windows build failure (#123104 ) e.g. https://github.com/pytorch/pytorch/actions/runs/8478510063/job/23230951466#step:12:23296 Caused by https://github.com/pytorch/pytorch/pull/122335 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123104 Approved by: https://github.com/atalman	2024-04-02 17:52:14 +00:00
Kulin Seth	251ad1232b	Handle transposes in second batch of matrices in bmm (#122194 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/122194 Approved by: https://github.com/DenisVieriu97	2024-04-02 17:48:35 +00:00
Gao Tianlin	aaef246c74	remove log2 decomposition; add log2 lowering (#123112 ) Same reason as `log10`. `log2` is a core aten op, we should not decompose it. As https://github.com/pytorch/pytorch/pull/110882 suggested, it often maps to a hardware intrinsic; Furthermore, decomposing it will negatively impact the numerical precision of the output. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123112 Approved by: https://github.com/peterbell10	2024-04-02 16:16:26 +00:00
eellison	5f46312dbb	Reapply "Switch cudagraph backend to cudagraph trees (#121019 )" and "Add Cudagraphs disable checking (#121018 )" (#121864 ) (#122713 ) This reverts commit 92ed8553a65808682aeca59e3cb5823cf2d52839. No longer importing codecache or boxed_nop at top level, both of which casued issues. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122713 Approved by: https://github.com/anijain2305	2024-04-02 16:11:00 +00:00
soulitzer	638b003cb7	[NJT] .to() properly updates device of offsets (#122797 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/122797 Approved by: https://github.com/jbschlosser	2024-04-02 16:07:27 +00:00
lezcano	b27ee6548d	Add a Dynamo deepdive to documentation (#122305 ) This supersedes the previous `Guards Overview" as a more comprehensive approach to most of the main topics within Dynamo. In the future, we could add specific sections for each of the topics discussed here. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122305 Approved by: https://github.com/msaroufim	2024-04-02 15:08:08 +00:00
Menglu Yu	c40f386afd	[Inductor][1/n]Split cat customization (#123045 ) Summary: Change the config and revise the group batch fusion in order not to reuse the exsiting pre_grad and post_grad fusion options Test Plan: # unit test ``` buck2 test @mode/dev-nosan //caffe2/test/inductor:split_cat_fx_passes ``` Test UI: https://www.internalfb.com/intern/testinfra/testrun/17732923560510096 Network: Up: 15MiB Down: 155MiB (reSessionID-6a577a14-1772-42d9-9ae8-bfdc62f406a3) Jobs completed: 267487. Time elapsed: 2:39.7s. Cache hits: 99%. Commands: 104465 (cached: 104457, remote: 8, local: 0) Tests finished: Pass 11. Fail 0. Fatal 0. Skip 0. Build failure 0 ``` buck2 test @mode/dev-nosan //caffe2/test/inductor/fb:split_cat_fx_passes_fb ``` Test UI: https://www.internalfb.com/intern/testinfra/testrun/9007199283031382 Network: Up: 28MiB Down: 177MiB (reSessionID-a3081518-7cba-4c83-b442-c16655ecb2cd) Jobs completed: 183164. Time elapsed: 1:41.4s. Cache hits: 99%. Commands: 75875 (cached: 75862, remote: 12, local: 1) Tests finished: Pass 1. Fail 0. Fatal 0. Skip 0. Build failure 0 ``` buck2 test @mode/dev-nosan //caffe2/test/inductor:group_batch_fusion ``` Test UI: https://www.internalfb.com/intern/testinfra/testrun/10133099189612276 Network: Up: 1.3MiB Down: 3.1MiB (reSessionID-0d312a2d-e19e-4ba6-9f96-7eb5863734e7) Discovered 9. Pass 0. Fail 0. Fatal 0. Skip 0. Timeout 0 Network: Up: 1.4MiB Down: 3.2MiB (reSessionID-0d312a2d-e19e-4ba6-9f96-7eb5863734e7) Jobs completed: 68. Time elapsed: 2:19.9s. Cache hits: 0%. Commands: 13 (cached: 0, remote: 1, local: 12) Tests finished: Pass 9. Fail 0. Fatal 0. Skip 0. Build failure 0 ``` buck2 test @mode/dev-nosan //caffe2/test/inductor:perf ``` Test UI: https://www.internalfb.com/intern/testinfra/testrun/5066549804623287 Network: Up: 1.5MiB Down: 1.1MiB (reSessionID-8d912a20-fceb-4698-89c3-d28e0708831f) Jobs completed: 164. Time elapsed: 1:42.2s. Cache hits: 0%. Commands: 13 (cached: 0, remote: 1, local: 12) Tests finished: Pass 57. Fail 0. Fatal 0. Skip 0. Build failure 0 # local reproduce case 1: with split cat ``` buck2 run @mode/opt //scripts/jackiexu0313/pt2:local_model_with_pt2 -- --test_mode batch-split --model_type "cmf" --flow_id 524546542 ``` optimus parameter sent to the scuba: ``` {'before_recompile_pre_grad': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GLL6RBZb-ssXJYcBAMzw0oaKtp80br0LAAAz', 'BatchLayernormFusion': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GH1LAxcxv0Ae_BkFAHVav3K3oosDbr0LAAAz', 'BatchSigmoidPreGradFusion': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GNb0jwR-Ukkqns4CAGRmOqucfedDbr0LAAAz', 'normalization_pass_pre_grad': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GHsIQxm-hn3SPrgCAKq1E-HBsoZHbr0LAAAz', 'remove_split_with_size_one_pass_pre_grad': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GOrJORmbMTV_xlQDAOwolqclPsIAbr0LAAAz', 'merge_getitem_cat_pass_pre_grad': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GCqkmRblvVKybGUDACVxkwVIrWxLbr0LAAAz', 'merge_splits_pass_pre_grad': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GCB1QBfko_kVN0wFAKGjSZv4DJULbr0LAAAz', 'after_recompile_pre_grad': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GMwJPRmu4ry88swDAO1gdA5RCKIXbr0LAAAz', 'before_recompile_post_grad': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GLXCORnNiKeQFmoDABR93CRKmP8Sbr0LAAAz', 'BatchMulPostGradFusion': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GBMIPRnlwQyjSD4BANPuaMhV7MUjbr0LAAAz', 'after_recompile_post_grad': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GJ9KPxkOv4LL8_0DAA65D4kh4JYDbr0LAAAz', 'inductor': Counter({'pattern_matcher_nodes': 2844, 'pattern_matcher_count': 2604, 'normalization_pass': 886, 'remove_split_with_size_one_pass': 748, 'merge_splits_pass': 82, 'merge_getitem_cat_pass': 11, 'scmerge_split_sections_removed': 4, 'batch_aten_mul': 4, 'batch_sigmoid': 2, 'batch_aten_sub': 2, 'batch_layernorm': 1, 'scmerge_split_added': 1, 'scmerge_cat_added': 1, 'scmerge_split_removed': 1, 'scmerge_cat_removed': 1, 'batch_aten_add': 1}), 'BatchAddPostGradFusion': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GEcvPxmxBj-pd8gCABE1QgB-d6N6br0LAAAz', 'BatchSubPostGradFusion': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GEvQxhYomJGj2FMBAEXXAI8Vgzhmbr0LAAAz'} ``` P1202819405 case 2: without split cat ``` buck2 run @mode/opt //scripts/jackiexu0313/pt2:local_model_with_pt2 -- --test_mode batch --model_type "cmf" --flow_id 524546542 ``` optimus parameter sent to the scuba: ``` {'before_recompile_pre_grad': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GAY7PxmGthuyjSwEAHF_A767YbMkbr0LAAAz', 'BatchLayernormFusion': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GLDPtBacXyybEOICAKaGCPatq5oabr0LAAAz', 'BatchSigmoidPreGradFusion': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GBu7ORkiDJu42QAEAGmlVTgO_Mpbbr0LAAAz', 'after_recompile_pre_grad': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GC893BZNl99ftY4BAHm5Z8sM4ptSbr0LAAAz', 'before_recompile_post_grad': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GCAeuRYgzPO5RcsCAPO3Z7tdMNMKbr0LAAAz', 'BatchMulPostGradFusion': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GHBIQxm1jlU-xhsFAONkzhh2mgknbr0LAAAz', 'after_recompile_post_grad': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GDoUPhmZ0noiaGMDAJHYuuiwHEAUbr0LAAAz', 'inductor': Counter({'pattern_matcher_nodes': 1189, 'pattern_matcher_count': 757, 'batch_aten_mul': 9, 'batch_aten_sub': 3, 'batch_sigmoid': 2, 'batch_aten_add': 2, 'batch_layernorm': 1, 'batch_linear_post_grad': 1}), 'BatchAddPostGradFusion': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GAluthYxi8uxpI4BAIQDzn3OyywUbr0LAAAz', 'BatchSubPostGradFusion': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GDjsJhTK5VAcot4CADIcAixghrYibr0LAAAz', 'PostGradBatchLinearFusion': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GEPfJxfJwktC7wsEAA0QbkqYNuVAbr0LAAAz'} ``` P1202823734 # e2e training_platform:fd4f02cd855f5cc0ccb49317a5a6c8bb with split cat f546646358 without split cat f546647159 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123045 Approved by: https://github.com/jackiexu1992	2024-04-02 14:36:22 +00:00
PyTorch MergeBot	1f503dffb3	Revert "[aoti][reland] clear precomputed symbol replacements before cpp wrapper compilation (#123136 )" This reverts commit 7eadb157bd96a9e641f64cdfa759aa1dfaaa7dd5. Reverted https://github.com/pytorch/pytorch/pull/123136 on behalf of https://github.com/albanD due to broke ROCm CI ([comment](https://github.com/pytorch/pytorch/pull/123136#issuecomment-2032163699))	2024-04-02 14:17:03 +00:00
Pearu Peterson	72662bf05b	[BE] Add torch.ops.aten._sparse_compressed_tensor_with_dims (#123083 ) Used in https://github.com/pytorch/pytorch/pull/123084 and allows simplifying `empty_like` implementation for sparse compressed tensors (see https://github.com/pytorch/pytorch/pull/121900#issuecomment-2029835473). Pull Request resolved: https://github.com/pytorch/pytorch/pull/123083 Approved by: https://github.com/cpuhrsch	2024-04-02 10:12:21 +00:00
Catherine Lee	f9b2ffa7c4	Forward fix lint after #119727 (#123137 ) After #119727 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123137 Approved by: https://github.com/albanD	2024-04-02 09:35:20 +00:00
Yang Chen	7eadb157bd	[aoti][reland] clear precomputed symbol replacements before cpp wrapper compilation (#123136 ) After we codegen a triton kernel in the triton codegen backend, we cache the generated triton source code in the wrapper to avoid producing multiple triton kernels with the same content. In AOTI compilation flow, this caching mechanism imposes a strong requirement on the codegen that we must generate the same triton source code for the same schedule node in both python and cpp codegen phases. Otherwise, we would end up with a mismatch between the kernel name formed in the cpp codegen and the cuda kernel key produced from the python codegen. Consequently, we would hit an missing-cuda-kernel error. The precomputed symbol replacements saved in V.graph.sizevars can cause such source-code inconsistency related to the code for indexing tensors. For example, let's say in the python codegen phase, we produce "ks2\48" as part of indexing an input for schedule node A while yielding a replacement pair "ks0 -> ks2\48" in the precomputed replacements. In the second cpp codegen phase, we would produce "ks0" for the same indexing code of schedule node A due to the "ks0 -> ks2*48" replacement pair. This PR fixed the issue by clearing precomputed_replacements and inv_precomputed_replacements before cpp wrapper codegen. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123136 Approved by: https://github.com/desertfire	2024-04-02 09:00:05 +00:00
Animesh Jain	969bbf8e82	[dynamo][guards] Skip aliasing guards for optimizers (#123044 ) I am ok if people don't want this PR to be merged. For optimizers, we know that the state dict and param_group have same parameters. So, I think its ok to skip TENSOR_MUST_ALIAS guards. Similarly for state tensors, all of them are different. Therefore, we can skip the tensor aliasing guards. With this PR, these are the numbers for Megatron which has 394 parameters <img width="290" alt="image" src="https://github.com/pytorch/pytorch/assets/13822661/0ce75dc6-4299-46bb-bf3c-7989ebc7cfc4"> C++ numbers jump a lot because of 2 reasons 1) We are now not doing INCREF/DECREF for a large number of tensors. 2) For python guards, we can expect higher numbers but that requires some more plumbing because the Python tensor guards are all collapsed into one. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123044 Approved by: https://github.com/jansel, https://github.com/mlazos	2024-04-02 08:51:00 +00:00
Scott Roy	1d52c2d985	Add vec256_half_neon (#122918 ) Summary: Add `vec256_half_neon.h` Currently not used anywhere. Differential Revision: D55429392 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122918 Approved by: https://github.com/mikekgfb	2024-04-02 04:02:57 +00:00
Shuqiang Zhang	7a934e4031	[c10d] dump on any exception (timeout + nccl error) (#123023 ) Summary: Existing flight recorder dumping logic is: dump only on timeout, but not on NCCL error. This resulted in the faulty ranks missing dumps when NCCL error happens. So in this PR, we revise the logic of dump such that records are dumped when any exception is detected. Exception could be 1. NCCL async errors. 2. watchdog timeout Also the existing code tends to mix the logic of flight recorder dump and desync debug, which is no desirable. We only dump the desync debug report only when timeout is detected. Test Plan: Added a new unit test to trigger nccl error and dump, and make sure the dump is triggered by the error. Also existing dump on timeout tests should still pass. sqzhang_1) [sqzhang@devgpu009.cln1 ~/pytorch (84bf9d4c)]$ python test/distributed/test_c10d_nccl.py NcclErrorDumpTest NCCL version 2.19.3+cuda12.0 [E329 19:15:11.775879730 ProcessGroupNCCL.cpp:565] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2, OpType=ALLREDUCE, NumelIn=10, NumelOut=10, Timeout(ms)=10000) ran for 10028 milliseconds before timing out. [E329 19:15:11.777459894 ProcessGroupNCCL.cpp:1561] [PG 0 Rank 0] Exception hit in NCCL work: 2 [E329 19:15:12.660717323 ProcessGroupNCCL.cpp:1332] [PG 0 Rank 0] Received a timeout signal from this local rank and will start to dump the debug info. Last enqueued NCCL work: 2, last completed NCCL work: 1. [E329 19:15:12.660932242 ProcessGroupNCCL.cpp:1167] [PG 0 Rank 0] ProcessGroupNCCL preparing to dump debug info. [E329 19:15:12.661192990 ProcessGroupNCCL.cpp:1174] [PG 0 Rank 0] ProcessGroupNCCL dumping nccl trace to /tmp/tmp06psqil3/trace_0 [F329 19:15:12.661485601 ProcessGroupNCCL.cpp:1185] [PG 0 Rank 0] [PG 0 Rank 0] ProcessGroupNCCL's watchdog detected a collective timeout from the local rank. This is most likely caused by incorrect usages of collectives, e.g., wrong sizes used across ranks, the order of collectives is not same for all ranks or the scheduled collective, for some reason, didn't run. Additionally, this can be caused by GIL deadlock or other reasons such as network errors or bugs in the communications library (e.g. NCCL), etc. We tried our best to dump the debug info into the storage to help you debug the issue. Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/123023 Approved by: https://github.com/wconstab	2024-04-02 03:16:54 +00:00
willfengg	f1c4d0fb2c	[dynamo] show inlining reasons from trace_rules (#123014 ) show specific inlining reasons with ``TORCH_LOGS="+dynamo" TORCHDYNAMO_VERBOSE=1`` * before, ``INLINING <code...>, inlined according trace_rules.lookup`` * after, ``INLINING <code...> inlined according trace_rules.lookup MOD_INLINELIST`` this can distanguish between inlining by default or by MOD_INLINELIST (specific rule) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123014 Approved by: https://github.com/jansel ghstack dependencies: #123013	2024-04-02 03:04:22 +00:00
Ke Wen	0a038cf0cf	[TP] Avoid splitting path twice (#122919 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/122919 Approved by: https://github.com/awgu, https://github.com/wanchaol	2024-04-02 02:06:11 +00:00
Jane Xu	9d9d2af786	[BE] Move tests using functional API to OptimizerInfo (#122822 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/122822 Approved by: https://github.com/albanD	2024-04-02 01:35:59 +00:00
Xu Zhao	597f479643	Add torchbench on-demand test workflow (#122624 ) When Torchbench discovers a regression, the PR author would like to know if their fix can pass the test, e.g., https://github.com/pytorch/pytorch/issues/122575 We are adding an on-demand ciflow to test Torchbench models if that is required by the user. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122624 Approved by: https://github.com/huydhn	2024-04-02 01:11:01 +00:00
Merlin Lüdicke	fdc281f258	[inductor] lower min SM requirement for gemm autotuning to 68 (#123121 ) Lower the minimum number of CUDA SMs required for GEMM autotuning from V100 to 3080 level, allowing some high-end consumer GPUs to benefit as well. Fixes #109489 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123121 Approved by: https://github.com/jansel	2024-04-02 00:28:59 +00:00
Shunting Zhang	12ced0f986	make user defined triton kernel work with new ASTSource.make_ir API (#123124 ) User defined triton kernel calls `ASTSource.make_ir`. Triton recently added an extra require argument to this API and make the call in PyTorch user defined triton kernel related code to fail. The PR make PyTorch work with both old and new version of the API. Test: ``` python test/inductor/test_aot_inductor.py -k test_triton_kernel_equal_to_1_arg_abi_compatible_cuda ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/123124 Approved by: https://github.com/oulgen, https://github.com/jansel ghstack dependencies: #123076	2024-04-02 00:18:43 +00:00
Yang Chen	bc65c98588	[AOTI] enabled a couple of tests for CPUs (#122992 ) Looks like some tests already work for CPUs so we enable those. Also added links to the relevant issues for the skipped cpu tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122992 Approved by: https://github.com/desertfire	2024-04-01 23:40:53 +00:00
Peter Bell	09c72eaa3f	[inductor] Remove identity from ops.scan (#119727 ) Currently scan has an `init` argument which must be the identity of the combine function. This isn't strictly necessary if we are more careful about keeping track of the first element and avoid combining it with anything. This does additionally require that there are no active load masks, since we can't do the `where_cond` any more. However, this shouldn't be possible anyway since scans are always realized and only fused via the scheduler. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119727 Approved by: https://github.com/lezcano	2024-04-01 22:47:26 +00:00
Yakai Wang	4d5cdc2e1e	Fix empty_like bug for sparse tensors. (#121900 ) Fixes #121671 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121900 Approved by: https://github.com/pearu	2024-04-01 22:40:38 +00:00
Catherine Lee	891994fd1b	Update dynamo test failures list (#123111 ) After #122728 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123111 Approved by: https://github.com/janeyx99, https://github.com/zou3519	2024-04-01 22:25:02 +00:00
Will Feng	489f4a063b	Revert "Preserve unbacked SymInt on SymNode (#120816 )" (#122988 ) This reverts commit 476585b190b16f6b27369679f7e19df9e2d8f073. I did a bisect and this seems to be the cause of compile time regression in cudagraphs_dynamic test suite between 03/23 and 03/24: ![image](https://github.com/pytorch/pytorch/assets/4063635/21394e06-4906-4690-b5a2-7d16cc475843) image Particularly BERT_pytorch and hf_T5 seem to have ~50% compile time regression. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122988 Approved by: https://github.com/eellison	2024-04-01 22:11:09 +00:00
eellison	8b49782ba6	[Inductor] require channels last output for channels last input for max_pool2d_backward (#122749 ) Previously we fell back on max_pool2d_with_indices_backward for channels last.. Turns out this was slow because we were inferring a contiguous output for channels last inputs. Fixing the layout and lowering gives a 1-2% TIMM win. It will also unblock saving the indices as int8 kernel offsets since we now lower channels last output. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122749 Approved by: https://github.com/Chillee, https://github.com/amjames, https://github.com/jansel, https://github.com/shunting314	2024-04-01 22:02:00 +00:00
willfengg	d765e223ac	[dynamo][PT2D] avoid skipping dynamo_resume_* in torch/testing/_internal (#123013 ) this PR ensures ``dynamo_resume_`` survives ``trace_rules.py``. As a ground truth, modules defined outside of ``pytorch/torch`` folders can survive ``trace_rules.py`` Pull Request resolved: https://github.com/pytorch/pytorch/pull/123013 Approved by: https://github.com/jansel	2024-04-01 21:12:48 +00:00
Animesh Jain	5d0ac887b9	[dynamo][higher order ops] Make the subgraph sourceless (#123071 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123071 Approved by: https://github.com/jansel, https://github.com/zou3519 ghstack dependencies: #123046, #123058, #123059	2024-04-01 21:09:41 +00:00
Animesh Jain	69fa28f483	[dynamo][cpp-guards] Enable a few tests to prevent frequent regressions (#123059 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123059 Approved by: https://github.com/jansel ghstack dependencies: #123046, #123058	2024-04-01 21:09:41 +00:00
Animesh Jain	234287aa16	[dynamo][cpp-guards] DUAL_LEVEL guard (#123058 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123058 Approved by: https://github.com/jansel ghstack dependencies: #123046	2024-04-01 21:09:38 +00:00
Animesh Jain	ffd1e4e9ba	[dynamo][cpp-guards] Always Reset relational guards (#123046 ) Reset guard at the end of RootGuardManager, even if the result is true. Earlier we reset only when result was False. But this causes extra bookkeeping in each guard. This PR gives a tiny bit improvement. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123046 Approved by: https://github.com/jansel	2024-04-01 21:09:35 +00:00
Chirag Pandya	c4cbad4106	Fix broken test (#123034 ) Summary: Prior commit #122921 broke one unit test. I renamed log->logger for consistency but forgot to make a similar change in this one unit test. Test Plan: Test passes after fix ``` [cpio@devvm17556.vll0 /data/users/cpio/fbsource (134660074\|remote/fbcode/warm)]$ buck2 test '@fbcode//mode/opt' fbcode//caffe2/test/distributed/elastic/multiprocessing:tail_log_test -- --exact 'caffe2/test/distributed/elastic/multiprocessing:tail_log_test - test_tail_logfile_error_in_tail_fn (tail_log_test.TailLogTest)' File changed: fbcode//caffe2/test/distributed/elastic/multiprocessing/tail_log_test.py Buck UI: https://www.internalfb.com/buck2/19aeef9f-1d93-4505-975b-ecb205f3aad9 Test UI: https://www.internalfb.com/intern/testinfra/testrun/5348024782558243 Network: Up: 11KiB Down: 15KiB (reSessionID-2b0989aa-3fe5-4e9a-943e-36625a0c4969) Jobs completed: 7. Time elapsed: 8.0s. Cache hits: 0%. Commands: 1 (cached: 0, remote: 0, local: 1) Tests finished: Pass 1. Fail 0. Fatal 0. Skip 0. Build failure 0 [cpio@devvm17556.vll0 /data/users/cpio/fbsource (134660074\|remote/fbcode/warm)]$ ``` Reviewers: wanchaol Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/123034 Approved by: https://github.com/fegin, https://github.com/wanchaol	2024-04-01 20:47:47 +00:00
Shunting Zhang	f461444be8	[inductor] make inductor work with new triton kernel launch API (#123076 ) Triton changed its kernel launch API recently. Adapt inductor side call site to make it work with both old and new triton APIs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123076 Approved by: https://github.com/desertfire, https://github.com/jansel	2024-04-01 20:32:52 +00:00
Xinya Zhang	76a87e33a0	Remove cuda dependencies when building AOTriton (#122982 ) Downloading CUDA sometimes fails and breaks the build process, but AOTriton does not need these packages for its own Triton fork. This commit comments out the related downloading scripts. The actual changes from Triton can be found at: `9b73a543a5` Fixes the following building error ``` [2/6] cd /var/lib/jenkins/workspace/build/aotriton/src/third_party/triton/python && /opt/conda/envs/py_3.8/bin/cmake -E env VIRTUAL_ENV=/var/lib/jenkins/workspace/build/aotriton/build/venv PATH="/var/lib/jenkins/workspace/build/aotriton/build/venv/bin:/opt/cache/bin:/opt/rocm/llvm/bin:/opt/rocm/opencl/bin:/opt/rocm/hip/bin:/opt/rocm/hcc/bin:/opt/rocm/bin:/opt/conda/envs/py_3.8/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin" TRITON_BUILD_DIR=/var/lib/jenkins/workspace/build/aotriton/build/triton_build python setup.py develop FAILED: CMakeFiles/aotriton_venv_triton /var/lib/jenkins/.local/lib/python3.8/site-packages/triton/_C/libtriton.so /var/lib/jenkins/workspace/build/aotriton/build/CMakeFiles/aotriton_venv_triton cd /var/lib/jenkins/workspace/build/aotriton/src/third_party/triton/python && /opt/conda/envs/py_3.8/bin/cmake -E env VIRTUAL_ENV=/var/lib/jenkins/workspace/build/aotriton/build/venv PATH="/var/lib/jenkins/workspace/build/aotriton/build/venv/bin:/opt/cache/bin:/opt/rocm/llvm/bin:/opt/rocm/opencl/bin:/opt/rocm/hip/bin:/opt/rocm/hcc/bin:/opt/rocm/bin:/opt/conda/envs/py_3.8/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin" TRITON_BUILD_DIR=/var/lib/jenkins/workspace/build/aotriton/build/triton_build python setup.py develop downloading and extracting https://conda.anaconda.org/nvidia/label/cuda-12.1.1/linux-64/cuda-nvcc-12.1.105-0.tar.bz2 ... downloading and extracting https://conda.anaconda.org/nvidia/label/cuda-12.1.1/linux-64/cuda-cuobjdump-12.1.111-0.tar.bz2 ... Traceback (most recent call last): File "/var/lib/jenkins/workspace/build/aotriton/src/third_party/triton/python/setup.py", line 325, in <module> download_and_copy( File "/var/lib/jenkins/workspace/build/aotriton/src/third_party/triton/python/setup.py", line 151, in download_and_copy ftpstream = urllib.request.urlopen(url) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/conda/lib/python3.12/urllib/request.py", line 215, in urlopen return opener.open(url, data, timeout) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/conda/lib/python3.12/urllib/request.py", line 521, in open response = meth(req, response) ^^^^^^^^^^^^^^^^^^^ File "/opt/conda/lib/python3.12/urllib/request.py", line 630, in http_response response = self.parent.error( ^^^^^^^^^^^^^^^^^^ File "/opt/conda/lib/python3.12/urllib/request.py", line 559, in error return self._call_chain(args) ^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/conda/lib/python3.12/urllib/request.py", line 492, in _call_chain result = func(args) ^^^^^^^^^^^ File "/opt/conda/lib/python3.12/urllib/request.py", line 639, in http_error_default raise HTTPError(req.full_url, code, msg, hdrs, fp) urllib.error.HTTPError: HTTP Error 524: ninja: build stopped: subcommand failed. ``` Example of failed build log: https://github.com/pytorch/pytorch/actions/runs/8483953034/job/23245996425 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122982 Approved by: https://github.com/jansel	2024-04-01 17:50:35 +00:00
Richard Barnes	c422bce131	[codemod] Fix some namespace issues in caffe2 (#121847 ) Summary: Removes `using namespace` from a header file. Having `using namespace` in a header file is always a bad idea. A previous raft of diffs provided appropriate qualifications to everything that relied on this `using namespace`, so it is now safe to remove it in this separate diff. Helps us enable `-Wheader-hygiene`. Test Plan: Sandcastle Reviewed By: dmm-fb Differential Revision: D54838298 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121847 Approved by: https://github.com/Skylion007	2024-04-01 17:45:16 +00:00
Catherine Lee	533c1b6c49	Disable vulkan logsoftmax test (#123103 ) Ex https://github.com/pytorch/pytorch/actions/runs/8509797936/job/23306567177 The failure was only surfaced after #122845 (the bug fix to surface cpp test failures) so I don't know when it started Pull Request resolved: https://github.com/pytorch/pytorch/pull/123103 Approved by: https://github.com/kit1980	2024-04-01 17:41:59 +00:00
Wanchao Liang	d7a274e1b0	[dtensor] switch aten.t to use op strategy (#122950 ) as titled Pull Request resolved: https://github.com/pytorch/pytorch/pull/122950 Approved by: https://github.com/awgu, https://github.com/tianyu-l ghstack dependencies: #122929, #122949	2024-04-01 17:39:43 +00:00
Wanchao Liang	9e1447dad6	[dtensor] make sure expected input spec have correct tensor meta (#122949 ) as titled, previously we could possibly return the expected input spec that shared by multiple args, this is not ok since different args might have different tensor metas, why it was working before is because redistribute in these cases become a no-op. This PR fixes it by making each expected input spec to shallow clone the corresponding input metadata Pull Request resolved: https://github.com/pytorch/pytorch/pull/122949 Approved by: https://github.com/tianyu-l ghstack dependencies: #122929	2024-04-01 17:39:42 +00:00
Wanchao Liang	afee5bea92	[dtensor] refactor schema suggestions in output sharding (#122929 ) This PR refactors the schema_suggestions in OuputSharding to be a single OpSchema instead of list of schemas, which in practice we only have one, for the multiple resharding case we also moved to OpStrategy so there's no case that needs it to be a list Pull Request resolved: https://github.com/pytorch/pytorch/pull/122929 Approved by: https://github.com/tianyu-l	2024-04-01 17:39:39 +00:00
Zhengxu Chen	b4c810491e	[export] Temporarily block mutating ops in quant tests. (#122863 ) Summary: After we migrate to torch.export, we won't see ops like add_ and mul_ due to functionalization. We are rolling out pre dispatch export, so for now we just skip those mutating ops in tests. Test Plan: buck run mode/opt caffe2/test/quantization:test_quantization Reviewed By: tugsbayasgalan Differential Revision: D55442019 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122863 Approved by: https://github.com/clee2000	2024-04-01 16:41:13 +00:00
Jiong Gong	526ca5f28e	[vec] fix compile warning in vec_n.h (#123090 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123090 Approved by: https://github.com/lezcano	2024-04-01 15:55:27 +00:00
Guilherme Leobas	9ff2a9dcdd	[dynamo] Skip leaf check on `assert_metadata_eq` if grad tensor level is `-2` (#122728 ) When fakifying a grad tracking tensor, if the level is -2 (sentinel value) we can just unwrap the grad tensor and return a fake version of it. In this PR, we update the `assert_metadata_eq` to not compare if the grad tensor and the unwrapped ones are leafs or not, as this may not be always true. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122728 Approved by: https://github.com/zou3519	2024-04-01 15:38:16 +00:00
Peter Bell	03439d4c1c	[inductor] Lower divide by constant as multiplication by reciprocal (#121924 ) Fixes #101039 This lowers division by a constant value to be multipication by reciprocal. The same optimization is applied in eager mode on CUDA: `0636c11811/aten/src/ATen/native/cuda/BinaryDivTrueKernel.cu (L36-L38)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/121924 Approved by: https://github.com/lezcano	2024-04-01 14:37:37 +00:00
Peter Bell	6939279a17	[dynamo] Forward OptimizedModule.__setattr__ to the wrapped module (#122098 ) Fixes #114844 In the linked issue we have ``` compiled_module = torch.compile(module) compiled_module.x = ... compiled_module(...) # Mutates self.x ``` Where since the module mutates `self.x` you would expect `compiled_module.x` to be updated but actually `compiled_module.x = ...` sets an attribute "x" on the `OptimizedModule` object while the forward method of the module mutates `module.x`. This gives the expected behavior by forwarding `compiled_module.__setattr__` down to `module.__setattr__`. There is already a corresponding `__getattr__` so now `compiled_module.x` becomes an alias for `module.x`. Co-authored-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/122098 Approved by: https://github.com/ezyang, https://github.com/lezcano	2024-04-01 14:30:44 +00:00
PyTorch UpdateBot	dd8a24b8b7	[xla hash update] update the pinned xla hash (#123078 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned xla hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123078 Approved by: https://github.com/pytorchbot	2024-04-01 11:17:02 +00:00
Mu-Chu Lee	4b725e1619	[AOTInductor] Support quantized linear on CPU with fbgemm (#123069 ) Summary: Added support for quantized linear on CPU with fbgemm. Specifically, for torch.ops.quantized.linear_unpacked_dynamic_fp16, we decompose it into two steps, pack weight, and fbgemm's qlinear with packed weight. Test Plan: Included in commit. test_aot_inductor::test_quantized_linear Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D55577959](https://our.internmc.facebook.com/intern/diff/D55577959) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123069 Approved by: https://github.com/hl475	2024-04-01 09:15:05 +00:00
Weizhuo Zhang	6b1f13ea2f	Add skip models by device in Dynamo Test (#122591 ) Fix skip logic in `runner.py`. Add skip list which was defined by device for dynamo benchmark runner `runner.py`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122591 Approved by: https://github.com/chuanqi129, https://github.com/desertfire, https://github.com/jgong5	2024-04-01 03:16:32 +00:00
chunyuan	8b7da5b791	Inductor cpp wrapper: fix dtype of ShapeAsConstantBuffer (#122297 ) For `at::scalar_tensor` the default dtype will be `float` ([link to scalar_tensor](`0d8e960f74/aten/src/ATen/native/TensorFactories.cpp (L856)`), [link to default dtype](`0d8e960f74/c10/core/TensorOptions.h (L551)`)) if we don't set the `dtype` value. However, the input scalar value is not necessarily a `float` value. With `torch::tensor(x)`, the dtype of the tensor will be decided according to the dtype of the scalar. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122297 Approved by: https://github.com/jgong5, https://github.com/desertfire	2024-04-01 01:32:41 +00:00
Jason Ansel	781e8d2201	[dynamo] Support __next__ on UserDefinedObjectVariable (#122565 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/122565 Approved by: https://github.com/yanboliang	2024-03-31 19:00:03 +00:00
Nikita Shulga	5fc0f52bf0	[BE] Use modern C++ in ATen tests (#123031 ) `std::is_same<A, B>::value` -> `std::is_same_v<A, B>` `std::is_floating_point<T>::value` -> `std::is_floating_point_v<T>` And use constexpr instead of defining two mutually exclusive templates Pull Request resolved: https://github.com/pytorch/pytorch/pull/123031 Approved by: https://github.com/Skylion007	2024-03-31 16:07:38 +00:00
Bin Bao	fa6178d246	[CI] Updated expected result files after https://github.com/pytorch/pytorch/pull/122846 (#123035 ) Summary: Before https://github.com/pytorch/pytorch/pull/122846, pyhpc_isoneutral_mixing in AOTI inference run segfaults so its result was not logged in the expected result file. Now it does show as fail_to_run instead of None. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123035 Approved by: https://github.com/chenyang78	2024-03-31 13:56:00 +00:00
albanD	6c2f36c984	Upgrade submodule pybind to 2.12.0 (#122899 ) To fix https://github.com/pytorch/pytorch/issues/122056 Building with NP 2.0 allows me to run locally with both NP 2.0 and 1.26. Any other test we should run @rgommers ? FYI @Skylion007 @atalman Pull Request resolved: https://github.com/pytorch/pytorch/pull/122899 Approved by: https://github.com/Skylion007	2024-03-31 11:29:40 +00:00
cyy	6d8bb0e984	[Distributed] [1/N] Fix clang-tidy warnings in torch/csrc/distributed/c10d (#122884 ) This PR fixes some clang-tidy warnings in distributed code. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122884 Approved by: https://github.com/kwen2501	2024-03-31 09:06:35 +00:00
haozhe.zhu	a52e89b6f7	[inductor]re-enable cpu reduction ut (#122289 ) Re-enable these two ut. I can pass these two ut on my local and we can see the status in the CI for this PR. See the background about why they are disabled https://github.com/pytorch/pytorch/issues/93542, https://github.com/pytorch/pytorch/issues/87157. After https://github.com/pytorch/pytorch/pull/115620. The reduction orders should be deterministic. However, the orders may not exactly same with ref path (`aten`). We may can set larger tolerance if they still cannot be passed in CI. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122289 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-03-31 08:33:14 +00:00
Xu Han	56451cd49d	Enable x86 CPU vectorization on windows [submodule sleef] (#118980 ) Enable VEC on Windows OS. 1. Fix some type defination gap between Windows and Linux. 2. Fix some operator not support on Windows, such as [], /. 3. Enable static sleef library build on Windows. 4. Disable unsupported function overloading on MSVC. 5. Upgrade submodule sleef lib, which fixed build issue on Windows. 6. Fixed bazel build issues. 7. Fix test app not link to sleef on Windows. Note: If rebuild fail after pulled this PR, please sync `sleef` submodule by run: ```cmd git submodule sync git submodule update --init --recursive ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/118980 Approved by: https://github.com/jgong5, https://github.com/ezyang, https://github.com/malfet	2024-03-31 03:07:32 +00:00
wz337	2b1ba0ceae	[DeviceMesh] Cache and reuse sliced result (#122975 ) Fixes #118849 Add a map for parent_to_child_mappings in _mesh_resources so we can cache and reuse submesh slicing result so that we can avoid recreating submesh and the underlying sub pg repeatedly, which could lead to funky behaviors. We will follow up with reusing pg from the parent_mesh during submesh creation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122975 Approved by: https://github.com/wanchaol	2024-03-30 23:56:55 +00:00
Nikita Shulga	35c493f2cf	[CPP Extension] Escape include paths (#122974 ) By using `shlex.quote` on Linux/Mac and `_nt_quote_args` on Windows Test it by adding non-existent path with spaces and single quote TODO: Fix double quotes on Windows (will require touching `_nt_quote_args`, so will leave it for another day Fixes https://github.com/pytorch/pytorch/issues/122476 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122974 Approved by: https://github.com/Skylion007	2024-03-30 21:58:29 +00:00
drisspg	557e7c9c16	Add some type hints to functions and update a few spelling mistakes (#123015 ) # Summary While working on this PR: https://github.com/pytorch/pytorch/pull/121845 I found that these type hints made my ide/ noob experience easier to reason about Pull Request resolved: https://github.com/pytorch/pytorch/pull/123015 Approved by: https://github.com/Skylion007	2024-03-30 21:15:01 +00:00
Shawn Xu	e203aa9fab	[FSDP] [easy] fix HSDP validation error msg (#123019 ) Summary: This would otherwise yield > ValueError: ('Manual wrapping with ShardingStrategy.HYBRID_SHARD', 'requires explicit specification of process group or device_mesh.') which is odd. Remove the extra tailing commas. Test Plan: CI Differential Revision: D55549851 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123019 Approved by: https://github.com/Skylion007	2024-03-30 18:12:34 +00:00
Shunting Zhang	ec58f1f74e	[inductor] make mask_rcnn inference work in max-autotune mode (#123008 ) inference for vision_maskrcnn model fail when max-autotune is enabled. Repro: ``` TORCHINDUCTOR_MAX_AUTOTUNE=1 time python benchmarks/dynamo/torchbench.py --accuracy --inference --bfloat16 --backend inductor --only vision_maskrcnn ``` It turns out that MA code receives empty input tensor for convolution and some places in MA related code does not handle this corner case properly. This PR enhance that and now the accuracy test above can pass. Regarding why the input tensor is empty, I think it's probably due to no objects are detected in the input images (random data?). Pull Request resolved: https://github.com/pytorch/pytorch/pull/123008 Approved by: https://github.com/jansel	2024-03-30 16:39:57 +00:00
PyTorch MergeBot	5e878be101	Revert "Enable x86 CPU vectorization on windows [submodule sleef] (#118980 )" This reverts commit d94db5f6ee0af745c0d17cc6c87f695baa2b3b5f. Reverted https://github.com/pytorch/pytorch/pull/118980 on behalf of https://github.com/atalman due to Breaks internal build ([comment](https://github.com/pytorch/pytorch/pull/118980#issuecomment-2028084839))	2024-03-30 14:20:54 +00:00
Yu, Guangye	b8550f527f	Support gpu trace on XPU (#121795 ) # Motivation Support GPU trace on XPU backend. Add GPU trace to xpu runtime. It is beneficial to generalize the device caching allocator in the next step. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121795 Approved by: https://github.com/EikanWang, https://github.com/gujinghui, https://github.com/jgong5, https://github.com/albanD ghstack dependencies: #121794	2024-03-30 13:07:53 +00:00
Yu, Guangye	eb7adc3ae0	Refactor gpu trace to be device-agnostic (#121794 ) # Motivation Refactor gpu trace to be device-agnostic. gpu trace is usually used in runtime components, including Device, Stream, Event, Guard, and Allocator. It should be device-agnostic and can be shared among each device backend. # Solution move `_cuda_trace.py` to `_gpu_trace.py`, which makes each device backend owns their callback, respectively. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121794 Approved by: https://github.com/jgong5, https://github.com/albanD, https://github.com/EikanWang, https://github.com/gujinghui	2024-03-30 13:04:38 +00:00
Menglu Yu	99f8f77de9	[Inductor] Fix AFOC QPS Regression. (#122944 ) Summary: Recently, we observed ~8% qps regression for AFOC model. After dig into the problem, I found it was introduced by D55272024, where the split node normalization was skipped or call_method split node, while our pattern detection based on the assumption that all split node has been normalized to call_funciton node. More context: https://docs.google.com/document/d/19h-fu2BqdUXMaSqbd7c0-Qe00ic7quUN-emJqH_1-SA/edit Test Plan: # unit test ``` buck2 test @mode/dev-nosan //caffe2/test/inductor:split_cat_fx_passes ``` Buck UI: https://www.internalfb.com/buck2/0792d406-3d64-4b9c-95cc-15fb0cc76a96 Test UI: https://www.internalfb.com/intern/testinfra/testrun/11258999096315690 Network: Up: 113KiB Down: 535KiB (reSessionID-6132c09b-2ce7-4e89-b61d-d6c6142630cc) Jobs completed: 26. Time elapsed: 1:25.6s. Cache hits: 0%. Commands: 2 (cached: 0, remote: 0, local: 2) Tests finished: Pass 10. Fail 0. Fatal 0. Skip 0. Build failure 0 ``` buck2 test @mode/dev-nosan //caffe2/test/inductor:group_batch_fusion ``` Test UI: https://www.internalfb.com/intern/testinfra/testrun/13792273886410433 Network: Up: 1.3MiB Down: 960KiB (reSessionID-0bea8575-f163-4c5d-b201-69e05806af98) Jobs completed: 68. Time elapsed: 2:47.2s. Cache hits: 0%. Commands: 13 (cached: 0, remote: 1, local: 12) Tests finished: Pass 9. Fail 0. Fatal 0. Skip 0. Build failure 0 # local reproduce ``` buck2 run @mode/opt //scripts/jackiexu0313/pt2:local_model_with_pt2 -- --test_mode batch-split --model_type "afoc" --flow_id 545665840 ``` Now the merge_splits_pass is conducted. ``` 'inductor': Counter({'pattern_matcher_nodes': 1614, 'pattern_matcher_count': 1566, 'normalization_pass': 645, 'remove_split_with_size_one_pass': 629, 'batch_aten_mul': 13, 'scmerge_split_sections_removed': 11, 'scmerge_cat_removed': 5, 'scmerge_cat_added': 4, 'merge_splits_pass': 3, 'merge_getitem_cat_pass': 2, 'scmerge_split_removed': 2, 'batch_linear_post_grad': 2, 'batch_aten_sub': 2, 'batch_layernorm': 1, 'scmerge_split_added': 1})} ``` # e2e baseline: f545633808 before_fix: f545665840 After_fix: f546227494 proposal: Differential Revision: D55513494 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122944 Approved by: https://github.com/jackiexu1992	2024-03-30 07:34:41 +00:00
Xia, Weiwen	2cd3ef4777	Check scale dtype for fake_quantize_per_channel_affine_cachemask (#120987 ) Fixes #120903 Scale for fake quant is assumed FP32 but not checked. If scales of double dtype are passed in, an internal error is raised: `TORCH_INTERNAL_ASSERT(!needs_dynamic_casting<func_t>::check(iter));` in aten/src/ATen/native/cpu/Loops.h This PR adds a check of scale dtype. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120987 Approved by: https://github.com/jgong5, https://github.com/jerryzh168	2024-03-30 07:32:32 +00:00
wz337	07f0ff6ed7	[DCP][FSDP2][Test] Add_adamW to test_train_parity_2d_transformer_checkpoint_resume (#122002 ) Want to add the option of AdamW here, as currently this is the only test for 2D. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122002 Approved by: https://github.com/awgu, https://github.com/fegin	2024-03-30 07:28:41 +00:00
angelayi	ed457c7dbe	[export] Add torch_fn (#122693 ) This PR adds a new metadata, `torch_fn` which is meant to replace `source_fn_stack` as `source_fn_stack` is not entirely well defined between strict/nonstrict. Previous discussion [here](https://docs.google.com/document/d/1sPmmsmh6rZFWH03QBOe49MaXrQkP8SxoG8AOMb-pFk4/edit#heading=h.anmx9qknhvm). `torch_fn` represents the torch function that a particular aten operator came from. For example, `torch.nn.Linear` goes down to the `torch.nn.functional.linear` at the `__torch_function__` layer, and then `aten.t/aten.addmm` in the `__torch_dispatch__` layer. So the nodes `aten.t/aten.addmm` will now have the `torch_fn` metadata containing the `torch.nn.functional.linear`. The `torch_fn` metadata is a tuple of 2 strings: a unique identifier for each torch function call, and the actual torch function `f"{fn.__class__}.{fn.__name__}"`. The purpose of the first value is to distinguish between 2 consecutive calls to the same function. For example, if we had 2 calls to `torch.nn.Linear`, the nodes and corresponding metadata would look something like: ``` aten.t - ("linear_1", "builtin_function_or_method.linear"), aten.addmm - ("linear_1", "builtin_function_or_method.linear"), aten.t - ("linear_2", "builtin_function_or_method.linear"), aten.addmm - ("linear_2", "builtin_function_or_method.linear"), ``` Higher order ops -- currently we can get the torch_fn metadata for nodes within the HOO's subgraph, but after retracing, this becomes the `(cond, higher_order_op.cond)` :( This is because `fx_traceback.set_current_meta` points to the cond node in the toplevel graph, rather than the original node in the subgraph. I think this is because `fx.Interpreter` does not go into the cond subgraphs. (will discuss with Yidi more ab this) Pull Request resolved: https://github.com/pytorch/pytorch/pull/122693 Approved by: https://github.com/tugsbayasgalan	2024-03-30 06:47:15 +00:00
Jason Ansel	3a9eead4ab	[inductor] Don't compile MultiKernelCall in a subprocess (#123010 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123010 Approved by: https://github.com/shunting314 ghstack dependencies: #123009	2024-03-30 05:46:09 +00:00
Jason Ansel	6c0911f1d9	[inductor] Skip cudagraphs warning on CPU (#123009 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123009 Approved by: https://github.com/shunting314	2024-03-30 05:46:09 +00:00
PyTorch UpdateBot	0b7a156f68	[executorch hash update] update the pinned executorch hash (#122662 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned executorch hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122662 Approved by: https://github.com/pytorchbot	2024-03-30 05:18:53 +00:00
Colin Peppler	c66a44ea79	[AOTInductor] Support many outputs aliasing the same tensor (#122846 ) fixes https://github.com/pytorch/pytorch/issues/122826 # Problem When the model returns multiple outputs which alias the same tensor, we get a SEGFAULT. Because we try to release the same buffer twice. ``` def forward(x): x_out = x + 1 contig = x_out.contiguous() # alias of same tensor as x_out return x_out, contig run_impl() { output_handles[0] = buf0.release(); output_handles[1] = buf0.release(); # SEGFAULT } # if we try to workaround this by assign aliases without creating a new tensor, # then, we'll get a double free error during handle clean-up. output_handles[1] = output_handles[0]; # assign without creating a new tensor ... alloc_tensors_by_stealing_from_handles(){ aoti_torch_delete_tensor_object(handles[0]); aoti_torch_delete_tensor_object(handles[1]); # Double free } ``` # Solution ~~Instead, we use the first `output_handle` that shares the same tensor and alias it.~~ ``` output_handles[0] = buf0.release(); aoti_torch_alias_tensor(output_handles[0], &output_handles[1]); # No SEGFAULT & No double free! ``` A simpler approach is to figure out which handles are duplicate. Then we simply copy all duplicate except the last one. The last one will use `std::move` and free the tensor owned by the model instance. ``` output_handles[0] = buf0.release(); output_handles[1] = output_handles[0]; ``` Differential Revision: [D55455344](https://our.internmc.facebook.com/intern/diff/D55455344) Pull Request resolved: https://github.com/pytorch/pytorch/pull/122846 Approved by: https://github.com/desertfire, https://github.com/chenyang78, https://github.com/jingsh	2024-03-30 04:41:17 +00:00
Shunting Zhang	aaba3a87b1	tune down batch-size for res2net to avoid OOM (#122977 ) The batch-size for this model is 64 previously. Later on we change that to 256 and cause OOM in cudagraphs setting. This PR tune the batch size down to 128. Share more logs from my local run ``` cuda,res2net101_26w_4s,128,1.603578,110.273572,335.263494,1.042566,11.469964,11.001666,807,2,7,6,0,0 cuda,res2net101_26w_4s,256,1.714980,207.986155,344.013071,1.058278,22.260176,21.034332,807,2,7,6,0,0 ``` The log shows that torch.compile uses 11GB for 128 batch size and 21GB for 256 batch size. I guess the benchmark script has extra overhead cause the model OOM for 256 batch size in the dashboard run. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122977 Approved by: https://github.com/Chillee	2024-03-30 03:54:53 +00:00
Guilherme Leobas	5a06b8ebfd	Remove skipIfTorchDynamo from TestComposability in test_eager_transforms.py (#121830 ) Fixes: https://github.com/pytorch/pytorch/issues/96559 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121830 Approved by: https://github.com/zou3519 ghstack dependencies: #121410, #121665	2024-03-30 01:55:04 +00:00
Yu, Guangye	3d3d4e1cd5	export XPUStream to doc (#121398 ) # Motivation We would like to export XPUStream to public [doc](https://pytorch.org/cppdocs/api/library_root.html). The detailed documentation can help users understand and utilize XPU more effectively. # Additional Context A detailed XPUStream API and usage should be documented to public doc, like cuda's [doc](https://github.com/pytorch/pytorch/blob/main/docs/cpp/source/notes/tensor_cuda_stream.rst). Pull Request resolved: https://github.com/pytorch/pytorch/pull/121398 Approved by: https://github.com/gujinghui, https://github.com/jgong5, https://github.com/EikanWang, https://github.com/albanD	2024-03-30 00:36:26 +00:00
Yu, Guangye	f4ff063c33	Add attributes to xpu device prop (#121898 ) # Motivation Add some attributes to `XPUDeviceProp` and expose them via `torch.xpu.get_device_properties` and `torch.xpu.get_device_capability`. They can be used in `torch.compile` or directly passed to triton to generate more optimized code based on device properties. # Additional Context expose the following attributes to `torch.xpu.get_device_properties`： - `has_fp16` (newly added) - `has_fp64` (newly added) - `has_atomic64` (newly added) - `driver_version` - `vendor` - `version` Pull Request resolved: https://github.com/pytorch/pytorch/pull/121898 Approved by: https://github.com/jgong5, https://github.com/EikanWang, https://github.com/malfet, https://github.com/albanD, https://github.com/atalman	2024-03-30 00:25:39 +00:00
Catherine Lee	b5bef9bbfd	Fix cpp tests not running + failing to surface (#122845 ) The comment in the code should have the information Pull Request resolved: https://github.com/pytorch/pytorch/pull/122845 Approved by: https://github.com/huydhn	2024-03-29 22:41:45 +00:00
Shuqiang Zhang	4282bb8b07	[c10d] add the source rank which detects the timeout (#122850 ) Summary: When a rank detects a timeout from tcpstore and triggers the dump. It's good to have more info about the source rank which detects the collective timeout locally. We just need to put the source rank as the value in the kvstore Test Plan: In unit test, we triggered the timeout on rank 0 and rank 1 should get the timeout signal from store and log the correct source rank: ``` (sqzhang_1) [sqzhang@devgpu009.cln1 ~/pytorch (34d27652)]$ python test/distributed/test_c10d_nccl.py NCCLTraceTestTimeoutDumpOnStuckRanks NCCL version 2.19.3+cuda12.0 [rank0]:[E327 17:04:16.986381360 ProcessGroupNCCL.cpp:565] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2, OpType=ALLREDUCE, NumelIn=12, NumelOut=12, Timeout(ms)=1000) ran for 1099 milliseconds before timing out. [rank0]:[E327 17:04:16.988036373 ProcessGroupNCCL.cpp:1582] [PG 0 Rank 0] Timeout at NCCL work: 2, last enqueued NCCL work: 2, last completed NCCL work: 1. [rank0]:[E327 17:04:16.182548526 ProcessGroupNCCL.cpp:1346] [PG 0 Rank 0] Received a timeout signal from this local rank and will start to dump the debug info. Last enqueued NCCL work: 2, last completed NCCL work: 1. [rank0]:[E327 17:04:16.247574460 ProcessGroupNCCL.cpp:1167] [PG 0 Rank 0] ProcessGroupNCCL preparing to dump debug info. [rank1]:[E327 17:04:16.273332178 ProcessGroupNCCL.cpp:1346] [PG 0 Rank 1] Received a global timeout from another rank 0, and will start to dump the debug info. Last enqueued NCCL work: 1, last completed NCCL work: 1. [rank1]:[E327 17:04:16.273565177 ProcessGroupNCCL.cpp:1167] [PG 0 Rank 1] ProcessGroupNCCL preparing to dump debug info. [rank1]:[F327 17:04:16.274256512 ProcessGroupNCCL.cpp:1185] [PG 0 Rank 1] [PG 0 Rank 1] ProcessGroupNCCL's watchdog detected a collective timeout from another rank 0 and notified the current rank. This is most likely caused by incorrect usages of collectives, e.g., wrong sizes used across ranks, the order of collectives is not same for all ranks or the scheduled collective, for some reason, didn't run. Additionally, this can be caused by GIL deadlock or other reasons such as network errors or bugs in the communications library (e.g. NCCL), etc. We tried our best to dump the debug info into the storage to help you debug the issue. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/122850 Approved by: https://github.com/wconstab	2024-03-29 22:22:37 +00:00
Catherine Lee	d7d77a152c	[ez] Increase slow grad check shards 4 to 6 (#122631 ) They take almost 4 hours to run completely for one shard Pull Request resolved: https://github.com/pytorch/pytorch/pull/122631 Approved by: https://github.com/huydhn	2024-03-29 21:49:27 +00:00
Jiong Gong	ea33adf6c2	[vec] test VecMask in vec_test_all_types (#122878 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/122878 Approved by: https://github.com/malfet ghstack dependencies: #119979, #122869	2024-03-29 21:48:29 +00:00
Jiong Gong	c9b32c9caa	[vec] test at::vec::convert in vec_test_all_types (#122869 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/122869 Approved by: https://github.com/malfet ghstack dependencies: #119979	2024-03-29 21:48:29 +00:00
Jiong Gong	6f4ed57b8a	[inductor][cpp] unified the vectorized conversion with `at::vec::convert` for all data types (#119979 ) This PR unified the vectorized conversion with `at::vec::convert` for all vectorized data types. The intrinsics implementations are implemented as a specialization and moved to their own arch-specific files. The vectorized conversion logic in cpp Inductor is simplified. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119979 Approved by: https://github.com/jansel, https://github.com/malfet	2024-03-29 21:48:29 +00:00
Arun Pa	05e54536fb	[CI] Removed tests for torch.utils.tensorboard.summary.hparams (#122556 ) Partially addresses #122160 In the module `torch.utils.tensorboard.summary`, the `hparams` method does not depend on any utilities from pytorch as it uses only the utilities from `tensorboard`. Thus, I think it will be safe to delete the test for `hparams` method as it does not depend on pytorch. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122556 Approved by: https://github.com/huydhn	2024-03-29 21:44:02 +00:00
Angela Yi	482d8bf1ea	[aoti] Change aot_compile callsites (#122225 ) Summary: Replacing `torch._export.aot_compile` callsites with ``` ep = torch.export._trace._export(.., predispatch=True) # Traces the given program into predispatch IR so_path = torch._inductor.aot_compile_ep(ep, ...) # Takes an exported program and compiles it into a .so ``` This allows us to explicitly split up the export step from AOTInductor. We can later modify tests to do `export + serialize + deserialize + inductor` to mimic internal production use cases better. Test Plan: CI Differential Revision: D54808612 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122225 Approved by: https://github.com/SherlockNoMad, https://github.com/khabinov	2024-03-29 21:34:20 +00:00
Michael Lazos	267145c5d0	Enable full state checking (#122971 ) Fixes https://github.com/pytorch/pytorch/issues/115679 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122971 Approved by: https://github.com/anijain2305	2024-03-29 21:24:57 +00:00
Nikita Shulga	4d6cb7bca0	Use Q-NEON register to compute the dot product (#122952 ) Make transposed gemv a bit faster Pull Request resolved: https://github.com/pytorch/pytorch/pull/122952 Approved by: https://github.com/kimishpatel ghstack dependencies: #122951	2024-03-29 21:09:08 +00:00
Kurt Mohler	73e362756b	Avoid COW materialize in conv forward ops (#122748 ) Part of #97856 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122748 Approved by: https://github.com/ezyang ghstack dependencies: #122720	2024-03-29 20:34:19 +00:00
cyy	7423092227	[TorchGen] [2/N] Remove unused variables and simplify dictionary iterations (#122585 ) This PR continues to remove unused variables and simplifies dictionary iterations from TorchGen scripts, following #122576. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122585 Approved by: https://github.com/ezyang	2024-03-29 20:34:11 +00:00
Edward Z. Yang	57a9a64e10	[BE] Give a different error message when evaluating an integer. (#122938 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/122938 Approved by: https://github.com/Skylion007	2024-03-29 19:14:15 +00:00
Edward Z. Yang	3178ba0dc9	Don't use sympy Float functions, use an opaque one with no reasoning (#122823 ) Sympy simplifications don't obey floating point semantics, so don't use Sympy for this. Keep them as is, only evaluate with the reference implementations when all arguments are known. This may end up getting subsumed by some other changes later, but I wanted to understand if this was easy and it seems to be easy. This doesn't actually depend on the earlier diffs on the stack and I can detach it. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/122823 Approved by: https://github.com/lezcano	2024-03-29 19:13:55 +00:00
Catherine Lee	ae0cf1f98d	[TD][ez] Set pytest cache bucket default to gha-artifacts (#122901 ) After https://github.com/pytorch/pytorch/pull/121907/files Example failure: https://github.com/pytorch/pytorch/actions/runs/8473386479/job/23217733984#step:5:130 ``` usage: pytest_cache.py [-h] (--upload \| --download) --cache_dir CACHE_DIR --pr_identifier PR_IDENTIFIER --job_identifier JOB_IDENTIFIER [--sha SHA] [--test_config TEST_CONFIG] [--shard SHARD] [--repo REPO] [--temp_dir TEMP_DIR] [--bucket BUCKET] pytest_cache.py: error: argument --bucket: expected one argument ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/122901 Approved by: https://github.com/huydhn	2024-03-29 18:52:58 +00:00
Animesh Jain	99d939f51f	[dynamo] Bugfix for HASATTR guard (#122947 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/122947 Approved by: https://github.com/jansel ghstack dependencies: #122828	2024-03-29 18:50:33 +00:00
AyaseNana	0a7162f898	Fix svd_lowrank parameter `M` (#122681 ) ISSUE: #122699 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122681 Approved by: https://github.com/lezcano	2024-03-29 18:06:38 +00:00
Mikayla Gawarecki	487b6d40ec	Add RMSNorm module (#121364 ) Similar to `dbeed9724b/torchmultimodal/modules/layers/normalizations.py (L51)` The implementation here is not optimized and we welcome pull requests to improve this - Use `normalized_shape` instead of singular integer `dim` to be aligned with the `nn.LayerNorm` implementation - Remove the [upcast to float and downcast ](`dbeed9724b/torchmultimodal/modules/layers/normalizations.py (L73)`) Differential Revision: [](https://our.internmc.facebook.com/intern/diff/) Differential Revision: [D55485840](https://our.internmc.facebook.com/intern/diff/D55485840) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121364 Approved by: https://github.com/albanD	2024-03-29 18:05:28 +00:00
Andrew Gu	3243be7c3a	[FSDP2] Removed `wrapSwapTensorsTest` since no longer needed (#122962 ) We do not need to set the flag after https://github.com/pytorch/pytorch/pull/122755. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122962 Approved by: https://github.com/mikaylagawarecki	2024-03-29 17:53:18 +00:00
PyTorch MergeBot	a236fa9f06	Revert "[aoti] clear precomputed symbol replacements before cpp wrapper compilation (#122882 )" This reverts commit 384de46395234e793a319325e5c9d20a60407a64. Reverted https://github.com/pytorch/pytorch/pull/122882 on behalf of https://github.com/jithunnair-amd due to broke ROCm CI ([comment](https://github.com/pytorch/pytorch/pull/122882#issuecomment-2027544640))	2024-03-29 17:52:39 +00:00
Jason Ansel	2a137f7af1	[dynamo] Support hasattr on UserDefinedClassVariable (#122564 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/122564 Approved by: https://github.com/anijain2305	2024-03-29 17:34:14 +00:00
Wenqi Li	772e142e70	[dynamo] Delay cuda device registration (#122795 ) the module-level `torch.cuda.device_count` calls are delayed until reading the registered devices. Fixes #122085 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122795 Approved by: https://github.com/ezyang	2024-03-29 17:22:18 +00:00
Bradley Davis	315bd951e4	Add inductor fx pass unit test for shape propagation (#122897 ) Summary: Pre-grad fx passes expect information from shape propagation to be present. D55221119 ensured that `pass_execution_and_save` invokes shape propagation, and this diff adds a covering unit test to prevent regression. Test Plan: New UT passes locally. Differential Revision: D55440240 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122897 Approved by: https://github.com/khabinov, https://github.com/Skylion007	2024-03-29 16:44:22 +00:00
Xinya Zhang	b83c94339e	Fix performance regression and memory storage handling of Flash Attention on ROCM (#122857 ) This PR fixes the two major issues that was discovered after the initial merge of PR #121561 1. The Flash Attention support added by has severe performance regressions on regular shapes (power of two head dimensions and sequence lengths) compared with PR #115981. Its performance is worse than the math backend and only has numerical stability advantages. This PR fixes this problem. 2. There is a flaw of memory storage handling in PR #121561 which does not copy the gradients back to the designated output tensor. This PR removes the deprecated `TensorStorageSanitizer` class which is unnecessary due to the more flexible backward kernel shipped by PR #121561 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122857 Approved by: https://github.com/jeffdaily, https://github.com/drisspg	2024-03-29 16:37:24 +00:00
Nikita Shulga	d8b69de73b	[EZ] Run fp16 torch.mm/torch.mv across CPU threads (#122951 ) This significantly speeds up real world applications, such as LLMs Before this change llama2-7b fp16 inference run at 1.5 tokens per sec, after it runs at almost 6 tokens per sec Pull Request resolved: https://github.com/pytorch/pytorch/pull/122951 Approved by: https://github.com/ezyang	2024-03-29 16:14:59 +00:00
cyy	fb90b4d4b2	[TorchGen] Use std::optional in generated code (#121454 ) This PR changes TorchGen to generate std::optional. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121454 Approved by: https://github.com/ezyang	2024-03-29 14:11:09 +00:00
Bin Bao	375a8041ed	[AOTI][refactor] Improve logging (#122932 ) Summary: Improve some logging msgs, and change a data type to remove a compile time warning. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122932 Approved by: https://github.com/chenyang78	2024-03-29 14:02:23 +00:00
cyy	769d1909f0	Enable clang-tidy warnings of aten/src/ATen/functorch (#122933 ) Enable clang-tidy in aten/src/ATen/functorch, following #122779. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122933 Approved by: https://github.com/ezyang	2024-03-29 14:01:28 +00:00
vfdev-5	38946bff51	Added DispatchKey.CompositeImplicitAutograd to all upsample_nearest*.default decompositions (#122782 ) Related to https://github.com/pytorch/pytorch/pull/117632#issuecomment-2021321172 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122782 Approved by: https://github.com/ezyang	2024-03-29 13:55:25 +00:00
vfdev-5	b524a404e0	Fixed support for uint8 in upsample bicubic2d decomposition (#120411 ) Superseeds https://github.com/pytorch/pytorch/pull/104248 Description: - Fixed support for uint8 for upsample bicubic2d decomposition (on `main` results are wrong, so we can tolerate the slowdown) - Added missing clamp(0, 1) for xscale and yscale - slowdown for f32 on cpu. PR on nodes fusion on CPU: https://github.com/pytorch/pytorch/pull/120077 can help for upsampling cases with align corners = true - the slowdown mainly due to the added clamp op and also partially reduced when using torch.stack in weights computation on cpu. - Removed lowering implementation Benchmarks: ``` [-------------------------------------------------------------------------------------------------------------------------------------------------------- Interpolate, cpu --------------------------------------------------------------------------------------------------------------------------------------------------------] \| Eager (2.4.0a0+git0c61c20) PR \| Compiled (2.4.0a0+git0c61c20) PR \| Compiled (2.4.0a0+git069270d) Nightly \| speed-up PR vs Nightly \| Eager (2.4.0a0+git069270d) Nightly 1 threads: ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Input (1, 3, 500, 400), torch.uint8, torch.contiguous_format \| mode: bicubic, align_corners: True, antialias: False, osize: (256, 256) \| 613.029 (+-1.590) \| 5477.608 (+-9.027) \| 3060.314 (+-12.368) \| 0.559 (+-0.000) \| 608.735 (+-6.336) Input (1, 3, 500, 400), torch.uint8, torch.contiguous_format \| mode: bicubic, align_corners: False, antialias: False, osize: (256, 256) \| 610.176 (+-1.428) \| 5718.503 (+-11.203) \| 3424.022 (+-12.836) \| 0.599 (+-0.000) \| 604.781 (+-6.229) Input (1, 3, 500, 400), torch.uint8, torch.channels_last \| mode: bicubic, align_corners: True, antialias: False, osize: (256, 256) \| 325.001 (+-0.840) \| 6183.029 (+-10.893) \| 3275.032 (+-7.625) \| 0.530 (+-0.000) \| 325.693 (+-1.067) Input (1, 3, 500, 400), torch.uint8, torch.channels_last \| mode: bicubic, align_corners: False, antialias: False, osize: (256, 256) \| 325.855 (+-1.108) \| 6391.394 (+-11.552) \| 3533.410 (+-7.666) \| 0.553 (+-0.000) \| 325.838 (+-1.457) Input (1, 3, 1200, 1300), torch.uint8, torch.contiguous_format \| mode: bicubic, align_corners: True, antialias: False, osize: (200, 300) \| 2521.533 (+-14.857) \| 5025.217 (+-13.415) \| 2814.304 (+-6.742) \| 0.560 (+-0.000) \| 2520.308 (+-10.796) Input (1, 3, 1200, 1300), torch.uint8, torch.contiguous_format \| mode: bicubic, align_corners: False, antialias: False, osize: (200, 300) \| 2531.204 (+-12.534) \| 5294.925 (+-11.994) \| 3147.590 (+-6.808) \| 0.594 (+-0.000) \| 2521.228 (+-11.732) Input (1, 3, 1200, 1300), torch.uint8, torch.channels_last \| mode: bicubic, align_corners: True, antialias: False, osize: (200, 300) \| 758.352 (+-10.362) \| 5639.912 (+-14.495) \| 3014.123 (+-8.799) \| 0.534 (+-0.000) \| 756.114 (+-4.792) Input (1, 3, 1200, 1300), torch.uint8, torch.channels_last \| mode: bicubic, align_corners: False, antialias: False, osize: (200, 300) \| 758.712 (+-5.781) \| 5927.541 (+-9.982) \| 3249.555 (+-7.226) \| 0.548 (+-0.000) \| 757.719 (+-5.653) Input (1, 3, 300, 400), torch.uint8, torch.contiguous_format \| mode: bicubic, align_corners: True, antialias: False, osize: (600, 700) \| 1524.469 (+-12.860) \| 34321.641 (+-80.310) \| 19373.714 (+-56.351) \| 0.564 (+-0.000) \| 1518.082 (+-49.653) Input (1, 3, 300, 400), torch.uint8, torch.contiguous_format \| mode: bicubic, align_corners: False, antialias: False, osize: (600, 700) \| 1521.746 (+-13.780) \| 35949.711 (+-81.010) \| 21782.366 (+-68.938) \| 0.606 (+-0.000) \| 1467.911 (+-15.901) Input (1, 3, 300, 400), torch.uint8, torch.channels_last \| mode: bicubic, align_corners: True, antialias: False, osize: (600, 700) \| 712.311 (+-5.361) \| 38826.510 (+-92.267) \| 20762.314 (+-59.303) \| 0.535 (+-0.000) \| 712.669 (+-4.673) Input (1, 3, 300, 400), torch.uint8, torch.channels_last \| mode: bicubic, align_corners: False, antialias: False, osize: (600, 700) \| 715.060 (+-4.757) \| 40269.353 (+-92.543) \| 22402.114 (+-81.574) \| 0.556 (+-0.000) \| 716.001 (+-8.945) Input (4, 3, 500, 400), torch.uint8, torch.contiguous_format \| mode: bicubic, align_corners: True, antialias: False, osize: (256, 256) \| 2331.889 (+-29.159) \| 21541.096 (+-72.346) \| 12181.194 (+-45.288) \| 0.565 (+-0.000) \| 2304.864 (+-21.351) Input (4, 3, 500, 400), torch.uint8, torch.contiguous_format \| mode: bicubic, align_corners: False, antialias: False, osize: (256, 256) \| 2333.697 (+-10.066) \| 22514.154 (+-57.798) \| 21709.449 (+-98.307) \| 0.964 (+-0.000) \| 2302.141 (+-13.041) Input (4, 3, 500, 400), torch.uint8, torch.channels_last \| mode: bicubic, align_corners: True, antialias: False, osize: (256, 256) \| 1198.768 (+-5.364) \| 37652.371 (+-101.644) \| 42740.413 (+-98.571) \| 1.135 (+-0.000) \| 1197.104 (+-7.225) Input (4, 3, 500, 400), torch.uint8, torch.channels_last \| mode: bicubic, align_corners: False, antialias: False, osize: (256, 256) \| 1196.851 (+-5.118) \| 39678.341 (+-173.750) \| 46807.738 (+-92.744) \| 1.180 (+-0.000) \| 1189.322 (+-5.681) Input (4, 3, 1200, 1300), torch.uint8, torch.contiguous_format \| mode: bicubic, align_corners: True, antialias: False, osize: (200, 300) \| 10020.978 (+-54.855) \| 19955.290 (+-71.891) \| 11420.521 (+-53.179) \| 0.572 (+-0.000) \| 9999.583 (+-61.230) Input (4, 3, 1200, 1300), torch.uint8, torch.contiguous_format \| mode: bicubic, align_corners: False, antialias: False, osize: (200, 300) \| 10066.441 (+-62.700) \| 21058.334 (+-183.414) \| 19986.577 (+-65.304) \| 0.949 (+-0.000) \| 10018.672 (+-59.188) Input (4, 3, 1200, 1300), torch.uint8, torch.channels_last \| mode: bicubic, align_corners: True, antialias: False, osize: (200, 300) \| 3171.135 (+-14.635) \| 19687.864 (+-54.320) \| 23313.699 (+-57.391) \| 1.184 (+-0.000) \| 3182.191 (+-17.686) Input (4, 3, 1200, 1300), torch.uint8, torch.channels_last \| mode: bicubic, align_corners: False, antialias: False, osize: (200, 300) \| 3181.314 (+-13.784) \| 20224.468 (+-50.827) \| 30541.963 (+-381.385) \| 1.510 (+-0.000) \| 3183.578 (+-16.203) Input (4, 3, 300, 400), torch.uint8, torch.contiguous_format \| mode: bicubic, align_corners: True, antialias: False, osize: (600, 700) \| 5879.450 (+-31.551) \| 136918.555 (+-480.320) \| 77723.568 (+-331.766) \| 0.568 (+-0.000) \| 5726.061 (+-87.517) Input (4, 3, 300, 400), torch.uint8, torch.contiguous_format \| mode: bicubic, align_corners: False, antialias: False, osize: (600, 700) \| 5882.869 (+-30.325) \| 143378.094 (+-513.842) \| 137244.074 (+-4827.730) \| 0.957 (+-0.000) \| 5727.679 (+-22.164) Input (4, 3, 300, 400), torch.uint8, torch.channels_last \| mode: bicubic, align_corners: True, antialias: False, osize: (600, 700) \| 2674.937 (+-45.003) \| 244829.360 (+-1930.579) \| 271283.073 (+-2243.245) \| 1.108 (+-0.000) \| 2676.054 (+-24.632) Input (4, 3, 300, 400), torch.uint8, torch.channels_last \| mode: bicubic, align_corners: False, antialias: False, osize: (600, 700) \| 2676.217 (+-16.601) \| 248658.668 (+-2904.952) \| 296514.520 (+-2983.281) \| 1.192 (+-0.000) \| 2682.844 (+-19.886) Input (1, 3, 500, 400), torch.float32, torch.contiguous_format \| mode: bicubic, align_corners: True, antialias: False, osize: (256, 256) \| 1768.437 (+-6.294) \| 2934.013 (+-28.870) \| 2520.649 (+-6.797) \| 0.859 (+-0.000) \| 1759.292 (+-5.097) Input (1, 3, 500, 400), torch.float32, torch.contiguous_format \| mode: bicubic, align_corners: False, antialias: False, osize: (256, 256) \| 1748.660 (+-5.550) \| 3271.104 (+-7.557) \| 2891.306 (+-7.632) \| 0.884 (+-0.000) \| 1746.341 (+-5.845) Input (1, 3, 500, 400), torch.float32, torch.channels_last \| mode: bicubic, align_corners: True, antialias: False, osize: (256, 256) \| 2813.150 (+-6.656) \| 3258.973 (+-7.543) \| 2766.286 (+-6.473) \| 0.849 (+-0.000) \| 2805.077 (+-7.611) Input (1, 3, 500, 400), torch.float32, torch.channels_last \| mode: bicubic, align_corners: False, antialias: False, osize: (256, 256) \| 2812.102 (+-8.211) \| 3568.780 (+-9.018) \| 3125.870 (+-7.324) \| 0.876 (+-0.000) \| 2834.178 (+-9.034) Input (1, 3, 1200, 1300), torch.float32, torch.contiguous_format \| mode: bicubic, align_corners: True, antialias: False, osize: (200, 300) \| 1687.975 (+-9.527) \| 2752.085 (+-9.627) \| 2373.274 (+-7.888) \| 0.862 (+-0.000) \| 1698.782 (+-8.098) Input (1, 3, 1200, 1300), torch.float32, torch.contiguous_format \| mode: bicubic, align_corners: False, antialias: False, osize: (200, 300) \| 1696.606 (+-8.678) \| 3056.317 (+-13.303) \| 2699.160 (+-10.638) \| 0.883 (+-0.000) \| 1684.942 (+-10.519) Input (1, 3, 1200, 1300), torch.float32, torch.channels_last \| mode: bicubic, align_corners: True, antialias: False, osize: (200, 300) \| 2613.491 (+-9.769) \| 3176.493 (+-13.366) \| 2730.193 (+-9.573) \| 0.859 (+-0.000) \| 2625.085 (+-9.943) Input (1, 3, 1200, 1300), torch.float32, torch.channels_last \| mode: bicubic, align_corners: False, antialias: False, osize: (200, 300) \| 2614.946 (+-34.129) \| 3465.398 (+-11.165) \| 3044.396 (+-11.447) \| 0.879 (+-0.000) \| 2627.355 (+-9.608) Input (1, 3, 300, 400), torch.float32, torch.contiguous_format \| mode: bicubic, align_corners: True, antialias: False, osize: (600, 700) \| 10784.549 (+-58.181) \| 18292.452 (+-59.344) \| 15909.922 (+-49.864) \| 0.870 (+-0.000) \| 10837.656 (+-51.947) Input (1, 3, 300, 400), torch.float32, torch.contiguous_format \| mode: bicubic, align_corners: False, antialias: False, osize: (600, 700) \| 10786.513 (+-52.308) \| 20449.038 (+-56.204) \| 18295.997 (+-54.522) \| 0.895 (+-0.000) \| 10843.751 (+-44.781) Input (1, 3, 300, 400), torch.float32, torch.channels_last \| mode: bicubic, align_corners: True, antialias: False, osize: (600, 700) \| 17532.699 (+-64.807) \| 20425.699 (+-80.271) \| 17517.040 (+-79.705) \| 0.858 (+-0.000) \| 17595.597 (+-61.870) Input (1, 3, 300, 400), torch.float32, torch.channels_last \| mode: bicubic, align_corners: False, antialias: False, osize: (600, 700) \| 17530.816 (+-55.131) \| 22450.080 (+-92.899) \| 19827.828 (+-77.649) \| 0.883 (+-0.000) \| 17615.934 (+-71.716) Input (4, 3, 500, 400), torch.float32, torch.contiguous_format \| mode: bicubic, align_corners: True, antialias: False, osize: (256, 256) \| 6875.484 (+-40.543) \| 11569.509 (+-62.462) \| 10053.350 (+-208.136) \| 0.869 (+-0.000) \| 6864.501 (+-46.747) Input (4, 3, 500, 400), torch.float32, torch.contiguous_format \| mode: bicubic, align_corners: False, antialias: False, osize: (256, 256) \| 6843.126 (+-44.498) \| 12915.236 (+-60.654) \| 25335.058 (+-382.640) \| 1.962 (+-0.000) \| 6899.002 (+-46.861) Input (4, 3, 500, 400), torch.float32, torch.channels_last \| mode: bicubic, align_corners: True, antialias: False, osize: (256, 256) \| 11103.418 (+-51.318) \| 28834.389 (+-78.395) \| 37405.463 (+-581.646) \| 1.297 (+-0.000) \| 11223.012 (+-60.709) Input (4, 3, 500, 400), torch.float32, torch.channels_last \| mode: bicubic, align_corners: False, antialias: False, osize: (256, 256) \| 11092.994 (+-70.835) \| 36597.023 (+-118.988) \| 45761.267 (+-85.051) \| 1.250 (+-0.000) \| 11104.014 (+-61.288) Input (4, 3, 1200, 1300), torch.float32, torch.contiguous_format \| mode: bicubic, align_corners: True, antialias: False, osize: (200, 300) \| 7106.791 (+-63.666) \| 11191.071 (+-45.402) \| 9786.037 (+-75.781) \| 0.874 (+-0.000) \| 7129.419 (+-77.674) Input (4, 3, 1200, 1300), torch.float32, torch.contiguous_format \| mode: bicubic, align_corners: False, antialias: False, osize: (200, 300) \| 7146.519 (+-28.376) \| 12443.571 (+-39.425) \| 20147.067 (+-74.771) \| 1.619 (+-0.000) \| 7179.622 (+-64.847) Input (4, 3, 1200, 1300), torch.float32, torch.channels_last \| mode: bicubic, align_corners: True, antialias: False, osize: (200, 300) \| 10533.849 (+-44.227) \| 34814.909 (+-138.127) \| 42803.001 (+-114.326) \| 1.229 (+-0.000) \| 10644.039 (+-59.681) Input (4, 3, 1200, 1300), torch.float32, torch.channels_last \| mode: bicubic, align_corners: False, antialias: False, osize: (200, 300) \| 10548.910 (+-44.221) \| 42876.940 (+-146.959) \| 49711.443 (+-139.276) \| 1.159 (+-0.000) \| 10652.375 (+-44.174) Input (4, 3, 300, 400), torch.float32, torch.contiguous_format \| mode: bicubic, align_corners: True, antialias: False, osize: (600, 700) \| 42814.521 (+-103.198) \| 73100.489 (+-435.262) \| 63587.659 (+-134.266) \| 0.870 (+-0.000) \| 43208.921 (+-195.287) Input (4, 3, 300, 400), torch.float32, torch.contiguous_format \| mode: bicubic, align_corners: False, antialias: False, osize: (600, 700) \| 42812.373 (+-103.870) \| 81769.160 (+-373.369) \| 175159.813 (+-2028.558) \| 2.142 (+-0.000) \| 43007.691 (+-96.358) Input (4, 3, 300, 400), torch.float32, torch.channels_last \| mode: bicubic, align_corners: True, antialias: False, osize: (600, 700) \| 69955.505 (+-373.373) \| 215248.616 (+-2040.775) \| 267511.246 (+-2094.161) \| 1.243 (+-0.000) \| 70382.679 (+-594.941) Input (4, 3, 300, 400), torch.float32, torch.channels_last \| mode: bicubic, align_corners: False, antialias: False, osize: (600, 700) \| 69852.157 (+-490.076) \| 242841.484 (+-19645.513) \| 317931.678 (+-2016.498) \| 1.309 (+-0.000) \| 70074.819 (+-352.919) Times are in microseconds (us). [-------------------------------------------------------------------------------------------------------------------------------------------------------- Interpolate, cuda ---------------------------------------------------------------------------------------------------------------------------------------------------------] \| Eager (2.4.0a0+git0c61c20) PR \| Compiled (2.4.0a0+git0c61c20) PR \| Compiled (2.4.0a0+git069270d) Nightly \| speed-up PR vs Nightly \| Eager (2.4.0a0+git069270d) Nightly 1 threads: --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Input (1, 3, 2345, 2456), torch.float32, torch.contiguous_format \| mode: bicubic, align_corners: True, antialias: False, osize: (1234, 1345) \| 97.727 (+-0.018) \| 97.765 (+-0.025) \| 97.773 (+-0.027) \| 1.000 (+-0.000) \| 97.905 (+-0.040) Input (1, 3, 2345, 2456), torch.float32, torch.contiguous_format \| mode: bicubic, align_corners: False, antialias: False, osize: (1234, 1345) \| 97.615 (+-0.066) \| 97.332 (+-0.032) \| 97.950 (+-0.026) \| 1.006 (+-0.000) \| 97.690 (+-0.062) Input (1, 3, 2345, 2456), torch.float32, torch.channels_last \| mode: bicubic, align_corners: True, antialias: False, osize: (1234, 1345) \| 100.635 (+-0.033) \| 125.883 (+-0.020) \| 102.499 (+-0.116) \| 0.814 (+-0.000) \| 101.103 (+-0.027) Input (1, 3, 2345, 2456), torch.float32, torch.channels_last \| mode: bicubic, align_corners: False, antialias: False, osize: (1234, 1345) \| 100.898 (+-0.036) \| 109.717 (+-0.336) \| 102.558 (+-0.120) \| 0.935 (+-0.000) \| 101.642 (+-0.105) Input (4, 3, 2345, 2456), torch.float32, torch.contiguous_format \| mode: bicubic, align_corners: True, antialias: False, osize: (1234, 1345) \| 462.853 (+-0.028) \| 382.475 (+-0.047) \| 382.472 (+-0.033) \| 1.000 (+-0.000) \| 462.188 (+-0.014) Input (4, 3, 2345, 2456), torch.float32, torch.contiguous_format \| mode: bicubic, align_corners: False, antialias: False, osize: (1234, 1345) \| 462.783 (+-0.021) \| 382.806 (+-0.037) \| 382.563 (+-0.043) \| 0.999 (+-0.000) \| 462.089 (+-0.028) Input (4, 3, 2345, 2456), torch.float32, torch.channels_last \| mode: bicubic, align_corners: True, antialias: False, osize: (1234, 1345) \| 466.721 (+-0.022) \| 384.438 (+-0.027) \| 384.886 (+-0.037) \| 1.001 (+-0.000) \| 467.014 (+-0.025) Input (4, 3, 2345, 2456), torch.float32, torch.channels_last \| mode: bicubic, align_corners: False, antialias: False, osize: (1234, 1345) \| 466.993 (+-0.032) \| 384.212 (+-0.009) \| 383.946 (+-0.029) \| 0.999 (+-0.000) \| 466.575 (+-0.020) Input (1, 3, 1234, 1345), torch.float32, torch.contiguous_format \| mode: bicubic, align_corners: True, antialias: False, osize: (2345, 2456) \| 190.070 (+-0.082) \| 209.353 (+-1.096) \| 202.870 (+-0.888) \| 0.969 (+-0.000) \| 189.371 (+-0.164) Input (1, 3, 1234, 1345), torch.float32, torch.contiguous_format \| mode: bicubic, align_corners: False, antialias: False, osize: (2345, 2456) \| 190.021 (+-0.018) \| 210.504 (+-0.456) \| 201.814 (+-0.770) \| 0.959 (+-0.000) \| 189.314 (+-0.036) Input (1, 3, 1234, 1345), torch.float32, torch.channels_last \| mode: bicubic, align_corners: True, antialias: False, osize: (2345, 2456) \| 188.860 (+-0.207) \| 336.635 (+-0.023) \| 252.026 (+-0.510) \| 0.749 (+-0.000) \| 188.860 (+-0.170) Input (1, 3, 1234, 1345), torch.float32, torch.channels_last \| mode: bicubic, align_corners: False, antialias: False, osize: (2345, 2456) \| 188.725 (+-0.214) \| 276.329 (+-0.563) \| 251.439 (+-0.524) \| 0.910 (+-0.000) \| 188.776 (+-0.189) Input (4, 3, 1234, 1345), torch.float32, torch.contiguous_format \| mode: bicubic, align_corners: True, antialias: False, osize: (2345, 2456) \| 781.879 (+-0.086) \| 836.389 (+-7.177) \| 816.483 (+-6.626) \| 0.976 (+-0.000) \| 781.362 (+-0.106) Input (4, 3, 1234, 1345), torch.float32, torch.contiguous_format \| mode: bicubic, align_corners: False, antialias: False, osize: (2345, 2456) \| 781.824 (+-0.099) \| 840.406 (+-7.111) \| 807.530 (+-6.514) \| 0.961 (+-0.000) \| 781.307 (+-0.129) Input (4, 3, 1234, 1345), torch.float32, torch.channels_last \| mode: bicubic, align_corners: True, antialias: False, osize: (2345, 2456) \| 769.290 (+-0.309) \| 675.498 (+-1.537) \| 688.171 (+-4.326) \| 1.019 (+-0.000) \| 769.830 (+-0.222) Input (4, 3, 1234, 1345), torch.float32, torch.channels_last \| mode: bicubic, align_corners: False, antialias: False, osize: (2345, 2456) \| 769.240 (+-0.179) \| 675.800 (+-1.113) \| 673.176 (+-1.740) \| 0.996 (+-0.000) \| 769.935 (+-0.171) Times are in microseconds (us). ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/120411 Approved by: https://github.com/lezcano	2024-03-29 13:15:25 +00:00
Xu Han	d94db5f6ee	Enable x86 CPU vectorization on windows [submodule sleef] (#118980 ) Enable VEC on Windows OS. 1. Fix some type defination gap between Windows and Linux. 2. Fix some operator not support on Windows, such as [], /. 3. Enable static sleef library build on Windows. 4. Disable unsupported function overloading on MSVC. 5. Upgrade submodule sleef lib, which fixed build issue on Windows. 6. Fixed bazel build issues. 7. Fix test app not link to sleef on Windows. Note: If rebuild fail after pulled this PR, please sync `sleef` submodule by run: ```cmd git submodule sync git submodule update --init --recursive ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/118980 Approved by: https://github.com/jgong5, https://github.com/ezyang, https://github.com/malfet	2024-03-29 07:28:31 +00:00
willfengg	35c56f85fd	[dynamo][pt2d] avoid skipping modules from torch/testing/_internal (#122851 ) Dynamo skips user defined modules from `torch/testing/_internal` (eg MLP, Transformer). This PR adds `torch/testing/_internal/...` to `manual_torch_name_rule_map`. It ensures FSDP CI + torch.compile are meaningfully tested unit test shows frame count = 0 before and frame count > 0 after ```pytest test/dynamo/test_trace_rules.py -k test_module_survive_skip_files``` some FSDP unit tests actually start to compile modules with this change. add trition availability check or disable tests for now Pull Request resolved: https://github.com/pytorch/pytorch/pull/122851 Approved by: https://github.com/jansel	2024-03-29 06:42:06 +00:00
Edward Z. Yang	10bdf64427	Properly pexpr the actual sympy.Expression, don't repr it. (#122893 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/122893 Approved by: https://github.com/albanD, https://github.com/desertfire, https://github.com/jansel	2024-03-29 06:40:19 +00:00
chilli	ed37fbdf60	made gpt_fast benchmark run faster (#122872 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/122872 Approved by: https://github.com/msaroufim, https://github.com/yifuwang ghstack dependencies: #122848	2024-03-29 03:49:19 +00:00
chilli	b9c9f037d1	Added some checkpointing tests (#122848 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/122848 Approved by: https://github.com/anijain2305	2024-03-29 03:49:19 +00:00
Chirag Pandya	b6201a60c5	[BE] minor logging cleanup in distributed (#122921 ) Summary: Minor logging cleanup in distributed library 1. Don't use "f" formatted strings - address linter issues. 2. Nits: Make use of unused `e` (error) in a few logs. 3. Change info->debug as asked in issue #113545 4. Nit: rename log -> logger in a few files for consistency 5. Fix a linter error. Test Plan: 1. Local build passes. 2. Linter is happy. Reviewers: wanchaol Pull Request resolved: https://github.com/pytorch/pytorch/pull/122921 Approved by: https://github.com/wanchaol	2024-03-29 03:34:01 +00:00
albanD	6a45809580	Simplify forward AD missing support error (#122639 ) This thing about jit decomposition confuses users greatly and I'm not sure what it adds. So removing it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122639 Approved by: https://github.com/soulitzer	2024-03-29 02:11:46 +00:00
Tugsbayasgalan Manlaibaatar	76d8020e62	Add tests for pre_dispatch + run_decomp flow and taskify failures (#122508 ) Differential Revision: [D55448616](https://our.internmc.facebook.com/intern/diff/D55448616) Pull Request resolved: https://github.com/pytorch/pytorch/pull/122508 Approved by: https://github.com/angelayi, https://github.com/zhxchen17	2024-03-29 01:47:07 +00:00
cyy	f041df8530	Fix order conditioning of norm kernel (#122874 ) NormOneOps is not executed due to an incorrect comparison, this PR fixes it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122874 Approved by: https://github.com/Skylion007	2024-03-29 00:28:13 +00:00
PyTorch MergeBot	6b8205d3de	Revert "Support map in pre-dispatch functionalization (#121444 )" This reverts commit 079feea3379c021a330dbfac7668a5fc8fccc3bd. Reverted https://github.com/pytorch/pytorch/pull/121444 on behalf of https://github.com/clee2000 due to sorry windows failure seems related `079feea337` https://github.com/pytorch/pytorch/actions/runs/8474191301/job/23220791555. PR got force merged before windows job finished ([comment](https://github.com/pytorch/pytorch/pull/121444#issuecomment-2026323614))	2024-03-28 23:42:26 +00:00
Michael Lazos	16771747c2	Add tensor step and capturable support to rprop (#122261 ) Towards fixing https://github.com/pytorch/pytorch/issues/115679 Fixes Rprop step update while compiling Also adds capturable support + testing Pull Request resolved: https://github.com/pytorch/pytorch/pull/122261 Approved by: https://github.com/janeyx99	2024-03-28 23:31:18 +00:00
Joel Schlosser	e63e013c3b	Skip use_count() debug assert for _nested_get_offsets() (#122917 ) This broke [internal tests](https://www.internalfb.com/intern/test/844425064039866/) that run with unset `NDEBUG`. It wasn't initially caught because we don't test with unset `NDEBUG` in OSS CI. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122917 Approved by: https://github.com/soulitzer ghstack dependencies: #122902	2024-03-28 23:19:17 +00:00
Joel Schlosser	6fc5ad931c	Use zeros for NJT dummy to avoid messing with randomness (#122902 ) Use of randomness was breaking vmap. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122902 Approved by: https://github.com/vmoens, https://github.com/zou3519	2024-03-28 22:09:31 +00:00
Guilherme Leobas	f476d707fd	Remove previous grad impl. in torch dynamo (#122215 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/122215 Approved by: https://github.com/zou3519	2024-03-28 22:00:23 +00:00
Tugsbayasgalan Manlaibaatar	079feea337	Support map in pre-dispatch functionalization (#121444 ) When we enter map_autograd, we try to trace through fwd/bwd of a map operator that is wrapped in ctx.functionalize wrapper. This forces us to go through PreDispatch functionalization again (only the python part). As a result, it revealed our previous bug where pre-dispatch mode handling doesn't actually manage the local dispatch key set. (If there is no active mode, we need to turn off PreDispatch key). This PR fixes that. Also I shuffled some APIs around so that there is less code duplication as the setting/unsetting logic is quite hard to get it right. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121444 Approved by: https://github.com/bdhirsh	2024-03-28 21:56:36 +00:00
Xia, Weiwen	481c9bb1fc	Upgrade submodule oneDNN to v3.3.6 (#122164 ) As the title. Including issue fixes for aarch64: - https://github.com/oneapi-src/oneDNN/pull/1831 - https://github.com/oneapi-src/oneDNN/pull/1834 --- ## Validation results (on Intel CPU + Linux) Static quantization with Inductor on CV models Quant method \| Geomean throughput ratio (v3.3.6/baseline) -- \| -- ptq \| 0.982937 ptq (cpp wrapper) \| 0.978384 qat \| 0.978828 Torchbench cpu userbenchmark with Inductor Items \| Perf Geomean Ratio (v3.3.6/baseline) -- \| -- eager_throughtput_bf16_infer \| 1.00x eager_throughtput_fp32_infer \| 1.00x jit_llga_throughtput_amp_bf16 \| 1.01x jit_llga_throughtput_fp32 \| 1.00x eager_throughtput_fx_int8 \| 1.00x eager_throughtput_bf16_train \| 1.46x eager_throughtput_fp32_train \| 1.41x Dynamo benchmarks tests Precision \| Shape \| Wrapper \| Thread \| Eager old/new GEOMEAN \| Inductor old/new GEOMEAN -- \| -- \| -- \| -- \| -- \| -- Float32 \| Static \| Default \| Multiple \| 1.003836812 \| 1.003425 Float32 \| Static \| Default \| Single \| 1.000181451 \| 0.999611 Float32 \| Dynamic \| Default \| Multiple \| 1.003980183 \| 1.006563 Float32 \| Dynamic \| Default \| Single \| 1.000076939 \| 0.999969 AMP \| Static \| Default \| Multiple \| 0.996824772 \| 0.998715 AMP \| Static \| Default \| Single \| 0.996402574 \| 1.001483 AMP \| Dynamic \| Default \| Multiple \| 0.994919866 \| 1.000467 AMP \| Dynamic \| Default \| Single \| 0.9962054 \| 1.000767 (on Aarch64) https://github.com/pytorch/pytorch/pull/122164#issuecomment-2007912919 --- Pull Request resolved: https://github.com/pytorch/pytorch/pull/122164 Approved by: https://github.com/snadampal, https://github.com/malfet, https://github.com/atalman	2024-03-28 21:36:27 +00:00
Andrew Gu	3924d2189c	[FSDP2] Simplified `_move_states_to_device` (#122907 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/122907 Approved by: https://github.com/Skylion007	2024-03-28 21:22:59 +00:00
PyTorch MergeBot	3beb9d85a6	Revert "Add non strict inline constraints and runtime assertions to non-strict exported program (#122722 )" This reverts commit b693fff5d72b249d39436ced577a88d3b866bbba. Reverted https://github.com/pytorch/pytorch/pull/122722 on behalf of https://github.com/BoyuanFeng due to This breaks torchrec.distributed.tests.test_pt2.TestPt2: test_kjt__getitem__ ([comment](https://github.com/pytorch/pytorch/pull/122722#issuecomment-2026078351))	2024-03-28 20:42:35 +00:00
Andrew Gu	8852b09abc	[FSDP2] Used `_chunk_cat` for reduce-scatter copy-in (#122888 ) This PR uses `_chunk_cat` to fuse padding gradients on dim-0, chunking into `world_size` chunks, and copying them into the reduce-scatter input. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122888 Approved by: https://github.com/yifuwang, https://github.com/BoyuanFeng, https://github.com/weifengpy ghstack dependencies: #122726, #122847	2024-03-28 20:35:45 +00:00
PyTorch MergeBot	8df99732a4	Revert "Workaround dind-rootless volumes mount as root (#122787 )" This reverts commit 84dc76156a0b8a73e56d80c3947ed9dd03c5ac5e. Reverted https://github.com/pytorch/pytorch/pull/122787 on behalf of https://github.com/zxiiro due to This broke rocm tests ([comment](https://github.com/pytorch/pytorch/pull/122787#issuecomment-2026022659))	2024-03-28 20:10:19 +00:00
Zhengxu Chen	dacc73669c	[export] Make quantizer compatible with the standard nn_module_stack. (#122819 ) Summary: When we migrate to torch.export, we won't put L['self'] as the prefix for all the fqn in nn_module_stack. This diff adds the branch to handle the new case. Test Plan: buck test mode/opt caffe2/test/quantization:test_quantization -- -r set_module_name Differential Revision: D55436617 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122819 Approved by: https://github.com/tugsbayasgalan	2024-03-28 19:36:46 +00:00
Yang Chen	384de46395	[aoti] clear precomputed symbol replacements before cpp wrapper compilation (#122882 ) After we codegen a triton kernel in the triton codegen backend, we cache the generated triton source code in the wrapper to avoid producing multiple triton kernels with the same content. In AOTI compilation flow, this caching mechanism imposes a strong requirement on the codegen that we must generate the same triton source code for the same schedule node in both python and cpp codegen phases. Otherwise, we would end up with a mismatch between the kernel name formed in the cpp codegen and the cuda kernel key produced from the python codegen. Consequently, we would hit an missing-cuda-kernel error. The precomputed symbol replacements saved in V.graph.sizevars can cause such source-code inconsistency related to the code for indexing tensors. For example, let's say in the python codegen phase, we produce "ks2\48" as part of indexing an input for schedule node A while yielding a replacement pair "ks0 -> ks2\48" in the precomputed replacements. In the second cpp codegen phase, we would produce "ks0" for the same indexing code of schedule node A due to the "ks0 -> ks2*48" replacement pair. This PR fixed the issue by clearing precomputed_replacements and inv_precomputed_replacements before cpp wrapper codegen. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122882 Approved by: https://github.com/desertfire	2024-03-28 19:06:29 +00:00
Shubhraprakash Das	646dd1ab8d	Rewrite quantized conv transpose2d for vulkan (#122547 ) Summary: Vulkan rewrite sp that quantized transpose 2d ops can run in a model Test Plan: Run vulkan api test: # buck2 build --target-platforms ovr_config//platform/macos:arm64-fbsource //xplat/caffe2:pt_vulkan_api_test_binAppleMac\#macosx-arm64 -c pt.vulkan_full_precision=1 --show-output" # buck-out//v2/gen/fbsource/xplat/caffe2/pt_vulkan_api_test_binAppleMac Running main() from third-party/googletest/1.14.0/googletest/googletest/src/gtest_main.cc [==========] Running 418 tests from 1 test suite. [----------] Global test environment set-up. [----------] 418 tests from VulkanAPITest .... [----------] Global test environment tear-down [==========] 418 tests from 1 test suite ran. (4510 ms total) [ PASSED ] 417 tests. [ SKIPPED ] 1 test, listed below: [ SKIPPED ] VulkanAPITest.querypool_flushed_shader_log YOU HAVE 9 DISABLED TESTS Run quantized vulkan api test: Note the linear quantized are failing but all the convolution tests still pass. Linear failures are being debugged. # buck2 build --target-platforms ovr_config//platform/macos:arm64-fbsource //xplat/caffe2:pt_vulkan_quantized_api_test_binAppleMac\#macosx-arm64 -c pt.vulkan_full_precision=1 --show-output" # buck-out//v2/gen/fbsource/xplat/caffe2/pt_vulkan_quantized_api_test_binAppleMac Running main() from third-party/googletest/1.14.0/googletest/googletest/src/gtest_main.cc [==========] Running 86 tests from 1 test suite. [----------] Global test environment set-up. [----------] 86 tests from VulkanAPITest ... [ PASSED ] 77 tests. [ FAILED ] 9 tests, listed below: [ FAILED ] VulkanAPITest.linear_2d_flat [ FAILED ] VulkanAPITest.linear_2d_small [ FAILED ] VulkanAPITest.linear_2d_large [ FAILED ] VulkanAPITest.linear_3d_flat [ FAILED ] VulkanAPITest.linear_3d_small [ FAILED ] VulkanAPITest.linear_3d_large [ FAILED ] VulkanAPITest.linear_4d_flat [ FAILED ] VulkanAPITest.linear_4d_small [ FAILED ] VulkanAPITest.linear_4d_large 9 FAILED TESTS YOU HAVE 8 DISABLED TESTS # Run CUNET quantized model on hibiki board. Reviewed By: manuelcandales Differential Revision: D52344263 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122547 Approved by: https://github.com/manuelcandales, https://github.com/copyrightly, https://github.com/yipjustin	2024-03-28 18:51:44 +00:00
Guilherme Leobas	71b5b7e081	Let dynamo trace some functions in functorch.deprecated.* namespace (#121665 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121665 Approved by: https://github.com/zou3519 ghstack dependencies: #121410	2024-03-28 18:50:43 +00:00
Mu-Chu Lee	966ae943df	Add wrapper for fbgemm quantization operations (#122763 ) Summary: We add wrappers for fbgemm's packing so we can pass it through PT2 to lowering phase of AOTInductor. Test Plan: Included in commit. test_quantized_ops::test_wrapped_fbgemm_linear_fp16 Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D55433204](https://our.internmc.facebook.com/intern/diff/D55433204) Pull Request resolved: https://github.com/pytorch/pytorch/pull/122763 Approved by: https://github.com/jerryzh168 ghstack dependencies: #122762	2024-03-28 18:41:18 +00:00
Edward Z. Yang	e296722e0e	Z3 validation: Lift operators later when we actually run with Z3 (#122791 ) Previously, we lifted operators putting them into the FX graph, limiting the applicability of the FX graph for only Z3. Now, we lift operators when we are interpreting, which means I can use the graph for other things. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/122791 Approved by: https://github.com/Chillee, https://github.com/lezcano	2024-03-28 18:31:30 +00:00
rzou	3d2d7ba19d	Delete torch.autograd.function.traceable APIs (#122817 ) We deprecated them in 2.3 with plans to delete in 2.4. Very few OSS repos use this flag at all and it also does nothing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122817 Approved by: https://github.com/albanD	2024-03-28 18:24:15 +00:00
Mu-Chu Lee	a3b30851c5	Add quantized.linear_unpacked_dynamic_fp16 (#122762 ) Summary: We add a new op quantized.linear_unpacked_dynamic_fp16, which is essentially linear_dynamic_fp16 with different (unpacked) weight/bias format. This op does packing on the fly for each call with standard at::Tensor weight & bias. Test Plan: Included in commit. test_quantized_op::test_unpacked_qlinear_dynamic_fp16 Differential Revision: [D55433203](https://our.internmc.facebook.com/intern/diff/D55433203) Pull Request resolved: https://github.com/pytorch/pytorch/pull/122762 Approved by: https://github.com/jerryzh168	2024-03-28 18:02:27 +00:00
David Berard	59f6393209	[docs] Update PT2+Profiler docs (#122272 ) Document: * Torch-Compiled Region * What to expect in kernels inside a torch-compiled region For review, see https://docs-preview.pytorch.org/pytorch/pytorch/122272/torch.compiler_profiling_torch_compile.html Pull Request resolved: https://github.com/pytorch/pytorch/pull/122272 Approved by: https://github.com/aaronenyeshi	2024-03-28 17:52:28 +00:00
Mu-Chu Lee	091a24495b	[AOTInductor] Support use_runtime_constant_folding for CPU. (#122563 ) Summary: We allow CPU to use the config use_runtime_constant_folding. Changes include 1. Rearrange USE_CUDA flags. Add CPU sections that consumes memory directly. 2. Codegen changes to accomodate cpp fusions for CPU only. Specifically, we shouldn't generate 2 headers that would cause re-declaration. Test Plan: Activate tests that were deactivated for CPU before. Reviewed By: khabinov Differential Revision: D55234300 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122563 Approved by: https://github.com/chenyang78	2024-03-28 17:49:05 +00:00
Hector Yuen	8a33a77fd1	Back out "Added a check in register_lowering to avoid decomposed ops (#117632 )" (#122709 ) Summary: Original commit changeset: ebda663a196b Original Phabricator Diff: D55271788 Test Plan: Some models are failing torch compile with this, retrying the tests Reviewed By: colinchan15 Differential Revision: D55374457 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122709 Approved by: https://github.com/huydhn	2024-03-28 17:46:57 +00:00
Menglu Yu	4670dcc94c	[Inductor]Fix a couple of broken unit tests (#122714 ) Summary: Titled Test Plan: ``` buck2 test mode/dev-nosan //caffe2/test/inductor:group_batch_fusion ``` Buck UI: https://www.internalfb.com/buck2/ad05a43c-cb4a-443e-8904-b4d53e4f4b1e Test UI: https://www.internalfb.com/intern/testinfra/testrun/13510798909218388 Network: Up: 107KiB Down: 28KiB (reSessionID-d7146e4f-773a-46ea-9852-f10f59302479) Jobs completed: 24. Time elapsed: 1:49.3s. Cache hits: 0%. Commands: 2 (cached: 0, remote: 0, local: 2) Tests finished: Pass 9. Fail 0. Fatal 0. Skip 0. Build failure 0 ``` buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/inductor/fb:split_cat_fx_passes_fb ``` Buck UI: https://www.internalfb.com/buck2/82dbf3b0-c747-4c07-98b8-53b69afa3157 Test UI: https://www.internalfb.com/intern/testinfra/testrun/1125900267699118 Network: Up: 1.4GiB Down: 2.3GiB (reSessionID-0bd22c6d-5dfe-4b4a-bc24-705eadac884b) Jobs completed: 252570. Time elapsed: 7:25.2s. Cache hits: 95%. Commands: 123778 (cached: 117999, remote: 2779, local: 3000) Tests finished: Pass 1. Fail 0. Fatal 0. Skip 0. Build failure 0 Differential Revision: D55378009 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122714 Approved by: https://github.com/SherlockNoMad	2024-03-28 17:44:30 +00:00
Zhicheng Yan	07f94df1a6	[torch quantization]fix HistogramObserver OOM when (self.max_val - self.min_val) is too small (#122659 ) Differential Revision: D55347133 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122659 Approved by: https://github.com/jerryzh168	2024-03-28 17:41:21 +00:00
Nicolas Macchioni	d65b9dff73	[AMD] turn off triton memcache for amd devices (#122560 ) Summary: triton memcache is not supported on amd devices yet and causes torch.compile to fail Created from CodeHub with https://fburl.com/edit-in-codehub Test Plan: ci Sandcastle run Differential Revision: D55285655 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122560 Approved by: https://github.com/jansel	2024-03-28 17:38:21 +00:00
Tugsbayasgalan Manlaibaatar	d9a08de9a4	Add Opinfo entries for HOP testing (#122265 ) In this PR, we add a systematic way to test all HOPs to be exportable as export team has been running into various bugs related to newly added HOPs due to lack of tests. We do this by creating: - hop_db -> a list of HOP OpInfo tests which then used inside various flows including export functionalities: [aot-export, pre-dispatch export, retrace, and ser/der For now, we also create an allowlist so that people can bypass the failures for now. But we should discourage ppl to do that. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122265 Approved by: https://github.com/ydwu4, https://github.com/zou3519	2024-03-28 17:36:43 +00:00
uvos	0bfa9f4758	[ROCm][ATen][Native] Fix kernel cache selecting kernels for incorrect architectures (#121401 ) Fixes #120794 Torch creates a cache of compiled kernels at $HOME/.cache/torch/kernels. The names used to save and select the cached kernels use cuda_major and cuda_minor to identify the gpu architecture for which the gpu kernels where compiled. On ROCM this is insufficient as on rocm cudaDeviceProp cuda_major and cuda_minor are mapped to hipDeviceProp_t::major and hipDeviceProp_t::minor which correspond to the first and second number of the LLVM target corresponding to the architecture in question: GFX1030 is major = 10, minor = 3 GFX1032 is major = 10, minor = 3 GFX900 is major = 9, minor = 0 GFX906 is major = 9, minor = 0 GFX908 is major = 9, minor = 0 Thus it can be seen hipDeviceProp_t::major and hipDeviceProp_t::minor are insufficient to uniquely identify the ROCM architecture. This causes the rocm runtime to raise an error when an operation uses a cached kernel that was first cached on a architecture with the same hipDeviceProp_t::major and hipDeviceProp_t::minor but a different llvm target. The solution provided in this pr is to replace the use of hipDeviceProp_t::major,hipDeviceProp_t::minor with hipDeviceProp_t::gcnArchName when pytorch is compiled for rocm which contains a string identical to the LLVM target of the architecture in question Pull Request resolved: https://github.com/pytorch/pytorch/pull/121401 Approved by: https://github.com/jeffdaily, https://github.com/hongxiayang, https://github.com/malfet	2024-03-28 17:24:31 +00:00
Menglu Yu	9693797491	[PT2][Inductor][Observability] Improve the optimus scuba log (#122361 ) Summary: Titled Test Plan: ``` buck2 test mode/dev-nosan //caffe2/test/inductor:group_batch_fusion ``` Test UI: https://www.internalfb.com/intern/testinfra/testrun/18014398535709463 Network: Up: 113KiB Down: 480KiB (reSessionID-1d2e3558-15b5-4a4e-8c5d-10c983afb389) Discovered 9. Pass 0. Fail 0. Fatal 0. Skip 0. Timeout 0 Command: test. Remaining: 9/24. Cache hits: 0%. Time elapsed: 44.3s Command: test. Remaining: 9/24. Cache hits: 0%. Time elapsed: 44.4s Command: test. Remaining: 9/24. Cache hits: 0%. Time elapsed: 44.5s Network: Up: 117KiB Down: 507KiB (reSessionID-1d2e3558-15b5-4a4e-8c5d-10c983afb389) Jobs completed: 24. Time elapsed: 1:48.3s. Cache hits: 0%. Commands: 2 (cached: 0, remote: 0, local: 2) Tests finished: Pass 9. Fail 0. Fatal 0. Skip 0. Build failure 0 ``` buck2 test mode/dev-nosan //caffe2/test/inductor:split_cat_fx_passes ``` Test UI: https://www.internalfb.com/intern/testinfra/testrun/16044073698893554 Network: Up: 120KiB Down: 60KiB (reSessionID-57f2c21b-3f4e-462b-9e5b-fe3dd15f6b7d) Jobs completed: 28. Time elapsed: 1:47.5s. Cache hits: 0%. Commands: 2 (cached: 0, remote: 0, local: 2) Tests finished: Pass 11. Fail 0. Fatal 0. Skip 0. Build failure 0 optimus_scuba_log: ``` {'before_recompile_pre_grad': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GIbj2haUwKx69H8BAKXdGqXZSpoybr0LAAAz', 'group_batch_fusion_pre_grad': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GFqhiRYcJ_C4JFoDABKPTsfpzjJ_br0LAAAz', 'normalization_pass_pre_grad': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GIvswhaiAVyipcoGAJZ5sUi8Bb5qbr0LAAAz', 'remove_split_with_size_one_pass_pre_grad': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GFneTxcVBPaqVuwCADCiI4q1mEwlbr0LAAAz', 'merge_getitem_cat_pass_pre_grad': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GJc0Phn87ljuMO0CADBPGqqehKp2br0LAAAz', 'merge_splits_pass_pre_grad': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GLWB_BbvLyT7D_0DABmygDYPDjJ_br0LAAAz', 'after_recompile_pre_grad': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GO6eQBeIj6oV3o4JAFLzQ3ECMTIrbr0LAAAz', 'inductor_pre_grad': Counter({'pattern_matcher_nodes': 2006, 'pattern_matcher_count': 1806, 'normalization_pass': 861, 'remove_split_with_size_one_pass': 748, 'merge_splits_pass': 82, 'merge_getitem_cat_pass': 11, 'scmerge_split_sections_removed': 4, 'batch_layernorm': 1, 'batch_sigmoid': 1, 'scmerge_split_added': 1, 'scmerge_cat_added': 1, 'scmerge_split_removed': 1, 'scmerge_cat_removed': 1}), 'before_recompile_post_grad': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GMoKmxYg6AUeQ40KAMDaJ4EVDwYmbr0LAAAz', 'group_batch_fusion_post_grad': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GHIvQxkrV1PMBggEACv7786a2bE8br0LAAAz', 'after_recompile_post_grad': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GIpBNxXupQTHWx8BALSiVrKgDbtfbr0LAAAz', 'inductor_post_grad': Counter({'pattern_matcher_nodes': 2093, 'pattern_matcher_count': 1893, 'normalization_pass': 861, 'remove_split_with_size_one_pass': 748, 'merge_splits_pass': 82, 'merge_getitem_cat_pass': 11, 'scmerge_split_sections_removed': 4, 'batch_layernorm': 1, 'batch_sigmoid': 1, 'scmerge_split_added': 1, 'scmerge_cat_added': 1, 'scmerge_split_removed': 1, 'scmerge_cat_removed': 1, 'batch_aten_mul': 1})} ``` Differential Revision: D55107000 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122361 Approved by: https://github.com/jackiexu1992	2024-03-28 17:13:32 +00:00
Hongtao Yu	049d68d8bb	[inductor][Autotune] Add matrix_instr_nonkdim to triton_meta (#122852 ) Summary: Previous work `https://github.com/pytorch/pytorch/pull/120742` to enable `matrix_instr_nonkdim` only dealt with the autotuner benchmarking, but failed to enable the parameter in Triton meta for real runs. `matrix_instr_nonkdim` needs to be visible to the compiler driver to set up the optimization pipeline, so it's unlike other kernel parameters such as `BLOCK_N` that can be just set inside the kernel itself. Test Plan: P1201466917 triton_heuristics.template( num_stages=1, num_warps=4, triton_meta={'signature': {0: 'fp32', 1: 'fp32', 2: '*fp32'}, 'device': 0, 'device_type': 'cuda', 'constants': {}, 'configs': [instance_descriptor(divisible_by_16=(0, 1, 2), equal_to_1=(), ids_of_folded_args=(), divisible_by_8=())], 'matrix_instr_nonkdim': 16}, inductor_meta={'kernel_name': 'triton_tem_fused_mm_0', 'backend_hash': None}, ) Perf : Before: 1.693ms 0.134GB 79.28GB/s After: 1.577ms 0.134GB 85.12GB/s Differential Revision: D55456401 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122852 Approved by: https://github.com/xw285cornell	2024-03-28 16:58:38 +00:00
fzyzcjy	1e8d4b389b	Super tiny fix typo (#122881 ) "CustoType" -> "CustomType" Pull Request resolved: https://github.com/pytorch/pytorch/pull/122881 Approved by: https://github.com/awgu	2024-03-28 16:13:25 +00:00
PyTorch MergeBot	958dbb876c	Revert "`_foreach_copy` with different src/dst dtypes (#121717 )" This reverts commit da2a9a05127c2b44e447e734d99e727d856cb36f. Reverted https://github.com/pytorch/pytorch/pull/121717 on behalf of https://github.com/janeyx99 due to Causing IMAs on V100s internally :C ([comment](https://github.com/pytorch/pytorch/pull/121717#issuecomment-2025553295))	2024-03-28 15:54:40 +00:00
PyTorch MergeBot	8698121636	Revert "Add RMSNorm module (#121364 )" This reverts commit a7306de0dc96cda8b698d19680a88d27aa45a31d. Reverted https://github.com/pytorch/pytorch/pull/121364 on behalf of https://github.com/atalman due to Broke internal tests ([comment](https://github.com/pytorch/pytorch/pull/121364#issuecomment-2025502007))	2024-03-28 15:31:10 +00:00
PyTorch MergeBot	8007d9a34a	Revert "[fx] Preserve Fx graph node order in partitioner across runs (#115621 )" This reverts commit f2c1060de3cdddbfefcab11e547211993d0f9cfa. Reverted https://github.com/pytorch/pytorch/pull/115621 on behalf of https://github.com/atalman due to Broke internal executorch test ([comment](https://github.com/pytorch/pytorch/pull/115621#issuecomment-2025496296))	2024-03-28 15:28:02 +00:00
Andrew Gu	9208df45cb	Fixed increasing CPU overhead of `RemovableHandle.__init__` (#122847 ) For some reason, if we construct `class Handle(RemovableHandle` inside `register_multi_grad_hook`, then over time, the call to `RemovableHandle.__init__` slows down more and more (when we have GC disabled). Perhaps, this is related to the class attribute `next_id: int = 0`. Python experts: please let me know if you have thoughts 😅 I am open to any suggestions on if how we should deal with this `Handle` class. For now, I changed it to a private `_MultiHandle`. <details> <summary> Experiment Script </summary> ``` import gc import time import torch NUM_TENSORS = int(5e4) ts = [torch.empty(1, requires_grad=True) for _ in range(NUM_TENSORS)] def hook(grad) -> None: return gc.disable() times = [] for i, t in enumerate(ts): start_time = time.time() torch.autograd.graph.register_multi_grad_hook([t], hook) end_time = time.time() times.append(end_time - start_time) print([f"{t * 1e6:.3f} us" for t in times[1:6]]) # print first few times print([f"{t * 1e6:.3f} us" for t in times[-5:]]) # print last few times times = [] for i, t in enumerate(ts): start_time = time.time() t.register_hook(hook) end_time = time.time() times.append(end_time - start_time) print([f"{t * 1e6:.3f} us" for t in times[1:6]]) # print first few times print([f"{t * 1e6:.3f} us" for t in times[-5:]]) # print last few times ``` </details> <details> <summary> Results </summary> Before fix: ``` ['23.603 us', '19.550 us', '15.497 us', '12.875 us', '13.828 us'] ['327.110 us', '341.177 us', '329.733 us', '332.832 us', '341.177 us'] ['318.050 us', '315.189 us', '319.719 us', '311.613 us', '308.990 us'] ['374.317 us', '394.821 us', '350.714 us', '337.362 us', '331.402 us'] ``` Calling `register_multi_grad_hook` makes calling itself and `register_hook` slower (actually, any call to `RemovableHandle.__init__`). After fix: ``` ['13.590 us', '9.060 us', '12.875 us', '7.153 us', '8.583 us'] ['4.530 us', '5.245 us', '6.437 us', '4.768 us', '5.007 us'] ['2.623 us', '1.907 us', '1.431 us', '1.669 us', '1.192 us'] ['1.431 us', '1.431 us', '1.192 us', '1.192 us', '1.431 us'] ``` </details> Update: from @soulitzer > Your suspicion about next_id is right. I think what is happening is that whenever a class attribute is set, it needs to invalidate some cached data for the subclasses one-by-one. `eefff682f0/Objects/typeobject.c (L845)` And this PR fixes the issue by avoiding creating many subclasses dynamically. Changing next_id to something like List[int] or incrementing a global instead also fixes this. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122847 Approved by: https://github.com/soulitzer ghstack dependencies: #122726	2024-03-28 15:24:12 +00:00
PyTorch MergeBot	4290a57e9c	Revert "[NJT] .to() properly updates device of offsets (#122797 )" This reverts commit 3e7fd45b409966440c54f5e370885b4b2a388a01. Reverted https://github.com/pytorch/pytorch/pull/122797 on behalf of https://github.com/jeffdaily due to Sorry for reverting your change but it is failing CUDA and ROCm jobs in trunk. Please help take a look and reland the change ([comment](https://github.com/pytorch/pytorch/pull/122797#issuecomment-2025473181))	2024-03-28 15:17:45 +00:00
cyy	d6aed1b692	Fix clang-tidy warnings of aten/src/ATen/functorch (#122779 ) This PR fixes some performance related clang-tidy warnings of aten/src/ATen/functorch. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122779 Approved by: https://github.com/ezyang	2024-03-28 15:15:06 +00:00
PyTorch MergeBot	6e1c81c687	Revert "Let dynamo trace some functions in functorch.deprecated.* namespace (#121665 )" This reverts commit f9eab9ca92c603e671e7714669758a81ce8d7111. Reverted https://github.com/pytorch/pytorch/pull/121665 on behalf of https://github.com/guilhermeleobas due to revert PR ([comment](https://github.com/pytorch/pytorch/pull/121665#issuecomment-2025460500))	2024-03-28 15:11:51 +00:00
Guilherme Leobas	f9eab9ca92	Let dynamo trace some functions in functorch.deprecated.* namespace (#121665 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121665 Approved by: https://github.com/zou3519 ghstack dependencies: #121410	2024-03-28 15:07:18 +00:00
Simon Fan	f178d996a8	[dynamo] Fix traceback generation on runtime errors (#122746 ) Fixes `During handling of the above exception, another exception occurred: [...] torch._dynamo.exc.Unsupported: generator`. traceback.format_exc uses generators which isn't supported by dynamo yet. <details> <summary>current error message</summary> ``` ====================================================================== ERROR: test_custom_fn_saved_tensors (__main__.TestCompiledAutograd) ---------------------------------------------------------------------- Traceback (most recent call last): File "/home/xmfan/core/pytorch/torch/fx/graph_module.py", line 307, in __call__ return super(self.cls, obj).__call__(args, kwargs) # type: ignore[misc] File "/home/xmfan/core/pytorch/torch/nn/modules/module.py", line 1527, in _wrapped_call_impl return self._call_impl(args, *kwargs) File "/home/xmfan/core/pytorch/torch/nn/modules/module.py", line 1537, in _call_impl return forward_call(args, *kwargs) File "<eval_with_key>.0", line 4, in forward def forward(self, inputs, sizes, hooks): IndexError: list index out of range During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/home/xmfan/core/pytorch/torch/testing/_internal/common_utils.py", line 2741, in wrapper method(args, *kwargs) File "/home/xmfan/core/pytorch/test/inductor/test_compiled_autograd.py", line 499, in test_custom_fn_saved_tensors self.check_output_and_recompiles(fn, 1) File "/home/xmfan/core/pytorch/test/inductor/test_compiled_autograd.py", line 61, in check_output_and_recompiles actual = list(opt_fn()) File "/home/xmfan/core/pytorch/test/inductor/test_compiled_autograd.py", line 495, in fn loss.backward() File "/home/xmfan/core/pytorch/torch/_tensor.py", line 534, in backward torch.autograd.backward( File "/home/xmfan/core/pytorch/torch/autograd/__init__.py", line 267, in backward _engine_run_backward( File "/home/xmfan/core/pytorch/torch/autograd/graph.py", line 766, in _engine_run_backward return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass File "/home/xmfan/core/pytorch/torch/nn/modules/module.py", line 1527, in _wrapped_call_impl return self._call_impl(args, *kwargs) File "/home/xmfan/core/pytorch/torch/nn/modules/module.py", line 1537, in _call_impl return forward_call(args, *kwargs) File "/home/xmfan/core/pytorch/torch/_dynamo/eval_frame.py", line 397, in _fn res = fn(args, *kwargs) File "/home/xmfan/core/pytorch/torch/fx/graph_module.py", line 741, in call_wrapped return self._wrapped_call(self, args, *kwargs) File "/home/xmfan/core/pytorch/torch/fx/graph_module.py", line 315, in __call__ _WrappedCall._generate_error_message(topmost_framesummary), File "/home/xmfan/core/pytorch/torch/fx/graph_module.py", line 289, in _generate_error_message tb_repr = get_traceback() File "/home/xmfan/core/pytorch/torch/fx/graph_module.py", line 288, in get_traceback return traceback.format_exc() File "/home/xmfan/.conda/envs/benchmarks/lib/python3.10/traceback.py", line 183, in format_exc return "".join(format_exception(sys.exc_info(), limit=limit, chain=chain)) File "/home/xmfan/.conda/envs/benchmarks/lib/python3.10/traceback.py", line 136, in format_exception return list(te.format(chain=chain)) File "/home/xmfan/core/pytorch/torch/_dynamo/convert_frame.py", line 941, in catch_errors return callback(frame, cache_entry, hooks, frame_state, skip=1) File "/home/xmfan/core/pytorch/torch/_dynamo/convert_frame.py", line 348, in _convert_frame_assert unimplemented("generator") File "/home/xmfan/core/pytorch/torch/_dynamo/exc.py", line 199, in unimplemented raise Unsupported(msg) torch._dynamo.exc.Unsupported: generator ``` </details> With this change, we get back the descriptive error message: <details> <summary>post-fix error message</summary> ``` Traceback (most recent call last): File "/home/xmfan/core/pytorch/torch/fx/graph_module.py", line 307, in __call__ return super(self.cls, obj).__call__(args, kwargs) # type: ignore[misc] File "/home/xmfan/core/pytorch/torch/nn/modules/module.py", line 1527, in _wrapped_call_impl return self._call_impl(args, *kwargs) File "/home/xmfan/core/pytorch/torch/nn/modules/module.py", line 1537, in _call_impl return forward_call(args, **kwargs) File "<eval_with_key>.0", line 4, in forward def forward(self, inputs, sizes, hooks): IndexError: list index out of range Call using an FX-traced Module, line 4 of the traced Module's generated forward function: def forward(self, inputs, sizes, hooks): ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE getitem = inputs[0] getitem_1 = inputs[1]; inputs = None ``` </details> Pull Request resolved: https://github.com/pytorch/pytorch/pull/122746 Approved by: https://github.com/jansel, https://github.com/anijain2305 ghstack dependencies: #122691	2024-03-28 14:40:54 +00:00
Simon Fan	1d96791661	[dynamo] Fix list proxy to list element proxy source propagation (#122691 ) Currently, when we create proxies for a list's elements in wrap_fx_proxy_cls, we create them using the same source as the list's e.g. `LocalSource(inputs)` instead of `GetItemSource(LocalSource(inputs), index=i)`. This results in invalid guards when the tensors it contains becomes dynamic, and the guard system thinks the list is a tensor: ``` Malformed guard: L['sizes'][0] == L['inputs'].size()[0] Malformed guard: 2 <= L['inputs'].size()[0] Traceback [...] AttributeError: 'list' object has no attribute 'size' ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/122691 Approved by: https://github.com/jansel, https://github.com/anijain2305	2024-03-28 14:40:54 +00:00
Edward Z. Yang	0284bca99b	Don't cache device_count if we haven't initialized CUDA yet (#122815 ) Before initializing CUDA, it can change by modifying CUDA_VISIBLE_DEVICES Fixes https://github.com/pytorch/pytorch/issues/122085 Fixes https://github.com/pytorch/pytorch/issues/38616 Fixes https://github.com/pytorch/pytorch/issues/110000 Fixes https://github.com/pytorch/pytorch/issues/110971 Fixes https://github.com/pytorch/pytorch/issues/95073 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/122815 Approved by: https://github.com/albanD	2024-03-28 13:23:45 +00:00
Thanh Ha	84dc76156a	Workaround dind-rootless volumes mount as root (#122787 ) In ARC Runners we are using dind-rootless to run docker-in-docker and in rootless mode volume mounts always mount as root but are mapped to the local `runner` user in ARC. This causes the build.sh and test.sh scripts to fail because they run as the `jenkins` user and expect to be able to write to the workspace path that's being mounted. Signed-off-by: Thanh Ha <thanh.ha@linuxfoundation.org>	2024-03-28 09:06:40 -04:00
cyy	d1da9cc654	[ClangTidy] Disable misc-include-cleaner (#122855 ) misc-include-cleaner was introduced in clang-tidy-17 as a way to check missing and unused includes. However, there are lots of transitive headers in PyTorch and it would take enormous efforts to add related annotations to them in order to direct this checker. For this reason, it's better to disable it now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122855 Approved by: https://github.com/cpuhrsch	2024-03-28 10:10:43 +00:00
Edward Z. Yang	8c8e4e31f2	Some improvements to nonzero post guard_size_oblivious (#122156 ) Prompted by https://github.com/pytorch/pytorch/pull/121571 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/122156 Approved by: https://github.com/jansel	2024-03-28 03:53:16 +00:00
Michael Lazos	caa57e4fcd	Add tensor step and capturable support to rmsprop (#122264 ) Towards fixing https://github.com/pytorch/pytorch/issues/115679 Fixes RMSprop step update while compiling Adds capturable support to RMSprop Pull Request resolved: https://github.com/pytorch/pytorch/pull/122264 Approved by: https://github.com/janeyx99	2024-03-28 03:39:28 +00:00
PyTorch UpdateBot	927bc4b558	[vision hash update] update the pinned vision hash (#122754 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned vision hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122754 Approved by: https://github.com/pytorchbot	2024-03-28 03:27:07 +00:00
PyTorch UpdateBot	c10352a406	[audio hash update] update the pinned audio hash (#122584 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned audio hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122584 Approved by: https://github.com/pytorchbot	2024-03-28 03:26:21 +00:00
Jason Ansel	235f24fc66	[inductor] Add FileLock around V.debug.copy (#122665 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/122665 Approved by: https://github.com/ezyang	2024-03-28 03:17:33 +00:00
Kurt Mohler	1b5ccdb0f0	Avoid COW materialize in more forward ops (#122720 ) Affected ops: * ormqr * lerp * multinomial * bernoulli * histogram * searchsorted * log_softmax * jiterator ops * dropout * _segment_reduce Part of #97856 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122720 Approved by: https://github.com/ezyang	2024-03-28 03:02:13 +00:00
Animesh Jain	60f3c092d4	[dynamo] Config option to Inline builtin nn module forward (#122725 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/122725 Approved by: https://github.com/jansel ghstack dependencies: #122646, #122647, #122716, #122769, #122818	2024-03-28 03:01:27 +00:00
Animesh Jain	d4317becce	[dynamo][easy] Force recompilation in a test (#122818 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/122818 Approved by: https://github.com/williamwen42 ghstack dependencies: #122646, #122647, #122716, #122769	2024-03-28 03:01:27 +00:00
chilli	52b1d2a73d	Increase timm batch sizes to make less overhead-bound and less noisy (#122581 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/122581 Approved by: https://github.com/ezyang ghstack dependencies: #122686, #122688, #121692, #122841	2024-03-28 02:34:32 +00:00
Tristan Rice	e6ee8322d7	nn.Module: use swap_tensors for Tensor subclasses (#122755 ) This fixes a bug when casting a module that has DTensor parameters. The old behavior will swap the .data field of the Tensor subclass which is incorrect behavior when dealing with tensor subclasses that may have multiple child tensors. This uses the `swap_tensors` method to swap all of the tensors not just the .data field. Test plan: ``` pytest test/distributed/_tensor/test_api.py -k 'test_distribute_module_casting' python test/distributed/fsdp/test_wrap.py -k test_auto_wrap_smoke_test_cuda_init_mode1_cpu_offload0_use_device_id_True ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/122755 Approved by: https://github.com/wanchaol, https://github.com/mikaylagawarecki	2024-03-28 02:03:09 +00:00
soulitzer	3e7fd45b40	[NJT] .to() properly updates device of offsets (#122797 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/122797 Approved by: https://github.com/jbschlosser	2024-03-28 00:56:23 +00:00
Boyuan Feng	574a8ccf10	Remove several `expectedFailureNonStrict` (#122802 ) This PR removes several `expectedFailureNonStrict` from `test_export.py`, where the error messages from strict and non-strict export differ a bit. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122802 Approved by: https://github.com/ydwu4	2024-03-28 00:42:49 +00:00
Xinya Zhang	12116aee68	Add Flash Attention support on ROCM (#121561 ) This patch addresses the major limitations in our previous [PR #115981](https://github.com/pytorch/pytorch/pull/115981) through the new dedicated repository [AOTriton](https://github.com/ROCm/aotriton) - [x] Only supports MI200 series GPU (i.e., `gcnArchName == gfx90a:sramecc+:xnack-`). * MI300X is supported. More architectures will be added once Triton support them. - [x] Only supports power of two sequence lengths. * Now it support arbitrary sequence length - [ ] No support for varlen APIs. * varlen API will be supported in future release of AOTriton - [x] Only support head dimension 16,32,64,128. * Now it support arbitrary head dimension <= 256 - [x] Performance is still being optimized. * Kernel is selected according to autotune information from Triton. Other improvements from AOTriton include * Allow more flexible Tensor storage layout * More flexible API This is a more extensive fix to #112997 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121561 Approved by: https://github.com/huydhn	2024-03-28 00:27:38 +00:00
Animesh Jain	8d676a6e8e	[dynamo][cpp-guards] Bugfix for size/strides for tensor match (#122828 ) This got missed because CPP guard manager is not ON by default. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122828 Approved by: https://github.com/mlazos, https://github.com/jansel	2024-03-28 00:16:49 +00:00
Aidyn-A	66510c641f	[c10d][NCCL] Refactor coalesced storage (#122651 ) The `coalescedDevice_` are `coalescedComms_` used inefficiently and in case of consequent coalescing comms can cause to read-before-write condition. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122651 Approved by: https://github.com/kwen2501, https://github.com/eqy	2024-03-27 23:56:02 +00:00
Mikayla Gawarecki	cc12668053	Fix swap_tensors path in _apply for modules that inherit from RNNBase (RNN, GRU, LSTM) (#122800 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/122800 Approved by: https://github.com/albanD	2024-03-27 23:34:16 +00:00
chilli	0348773655	Forward fix for subtly breaking AC with compile in the case of stacked (#122841 ) checkpoint layers separated by recomputable op Pull Request resolved: https://github.com/pytorch/pytorch/pull/122841 Approved by: https://github.com/anijain2305 ghstack dependencies: #122686, #122688, #121692	2024-03-27 23:23:04 +00:00
Aaron Orenstein	a8b7480f0d	fix dynamo.explain examples (#122745 ) `dynamo.explain()` was updated to return a structure but the docs weren't updated to match. - Update the docs to use the new API - Remove some dead code left when `explain` was updated. - Drive-by: Fix some `nopython` uses that I noticed - Drive-by: I noticed an ignored error coming from CleanupHook on shutdown - make it check the global before setting it. Fixes #122573 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122745 Approved by: https://github.com/jansel	2024-03-27 22:53:27 +00:00
chilli	a54ea7bbd8	Made several changes to min-cut partitioner that allow it to recompute more things (#121692 ) Perf results <img width="862" alt="image" src="https://github.com/pytorch/pytorch/assets/6355099/8d44e633-8941-46a6-8e7d-806330a8c890"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/121692 Approved by: https://github.com/shunting314, https://github.com/eellison ghstack dependencies: #122686, #122688	2024-03-27 22:45:52 +00:00
PyTorch MergeBot	bef01c7c2b	Revert "Optimize multi_tensor_apply (take 2) (#119764 )" This reverts commit fe41ba47652ca73569453bddb43605c77bb85184. Reverted https://github.com/pytorch/pytorch/pull/119764 on behalf of https://github.com/atalman due to Failing internally ([comment](https://github.com/pytorch/pytorch/pull/119764#issuecomment-2024105399))	2024-03-27 22:42:07 +00:00
Oguz Ulgen	222dfc4282	[Inductor] Run pattern matcher over the original graph (#122519 ) Differential Revision: [D55429070](https://our.internmc.facebook.com/intern/diff/D55429070) Pull Request resolved: https://github.com/pytorch/pytorch/pull/122519 Approved by: https://github.com/jansel	2024-03-27 22:09:36 +00:00
zdevito	530e13cf3d	Revert "[c10d] disable compute_duration by default (#122138 )" (#122539 ) This reverts commit bf18e967b4abc90c27ad460680497d8f5ec55962. It is stacked after a fix to elapsed_time that will resolve the memory issues that required in the introduction of this flag. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122539 Approved by: https://github.com/wconstab, https://github.com/shuqiangzhang ghstack dependencies: #122538	2024-03-27 21:53:28 +00:00
Guilherme Leobas	933d3a7829	Allow dynamo to inline through "hessian" (#121410 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121410 Approved by: https://github.com/zou3519	2024-03-27 21:39:37 +00:00
Mikayla Gawarecki	a7306de0dc	Add RMSNorm module (#121364 ) Similar to `dbeed9724b/torchmultimodal/modules/layers/normalizations.py (L51)` The implementation here is not optimized and we welcome pull requests to improve this - Use `normalized_shape` instead of singular integer `dim` to be aligned with the `nn.LayerNorm` implementation - Remove the [upcast to float and downcast ](`dbeed9724b/torchmultimodal/modules/layers/normalizations.py (L73)`) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121364 Approved by: https://github.com/albanD	2024-03-27 21:39:30 +00:00
Boyuan Feng	b693fff5d7	Add non strict inline constraints and runtime assertions to non-strict exported program (#122722 ) This PR reduces the difference between strict and non-strict exported program by - Support `inline_constraints` for non-strict exported program - Add runtime assertions for range constraints to non-strict exported program After this PR, the following unit tests are no longer `expectedFailureNonStrict`: - test_automatic_constrain_size - test_export_with_inline_constraints - test_redundant_asserts - test_constrain_size_with_constrain_value Pull Request resolved: https://github.com/pytorch/pytorch/pull/122722 Approved by: https://github.com/pianpwk	2024-03-27 21:20:03 +00:00
William Wen	abe4a0e9eb	[dynamo] pop result of print reordering (#122744 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/122744 Approved by: https://github.com/jansel ghstack dependencies: #122146, #122335, #122354, #122355, #122356, #122449, #122455, #122456, #122530, #122737, #122738, #122739, #122740, #122741, #122742, #122743	2024-03-27 20:39:39 +00:00
William Wen	76fe0faadd	[dynamo, 3.12] add END_SEND (#122743 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/122743 Approved by: https://github.com/jansel ghstack dependencies: #122146, #122335, #122354, #122355, #122356, #122449, #122455, #122456, #122530, #122737, #122738, #122739, #122740, #122741, #122742	2024-03-27 20:39:39 +00:00
William Wen	c5d372dafc	[dynamo, 3.12] trace through __mro__ attribute access (#122742 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/122742 Approved by: https://github.com/jansel ghstack dependencies: #122146, #122335, #122354, #122355, #122356, #122449, #122455, #122456, #122530, #122737, #122738, #122739, #122740, #122741	2024-03-27 20:39:39 +00:00
William Wen	71d40ff861	[dynamo, 3.12] fix typing variable tracing (#122741 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/122741 Approved by: https://github.com/jansel ghstack dependencies: #122146, #122335, #122354, #122355, #122356, #122449, #122455, #122456, #122530, #122737, #122738, #122739, #122740	2024-03-27 20:39:39 +00:00
William Wen	5d0a792d5f	[dynamo, 3.12] fix some tests (#122740 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/122740 Approved by: https://github.com/jansel ghstack dependencies: #122146, #122335, #122354, #122355, #122356, #122449, #122455, #122456, #122530, #122737, #122738, #122739	2024-03-27 20:39:39 +00:00
William Wen	a9704848d1	[dynamo, 3.12] add CALL_INTRINSIC_1 (#122739 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/122739 Approved by: https://github.com/jansel ghstack dependencies: #122146, #122335, #122354, #122355, #122356, #122449, #122455, #122456, #122530, #122737, #122738	2024-03-27 20:39:39 +00:00
William Wen	8e5a4248a3	[dynamo, 3.12] add LOAD_SUPER_ATTR (#122738 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/122738 Approved by: https://github.com/jansel ghstack dependencies: #122146, #122335, #122354, #122355, #122356, #122449, #122455, #122456, #122530, #122737	2024-03-27 20:39:39 +00:00
William Wen	8cd7bb7422	[dynamo, 3.12] add LOAD_FAST variants (#122737 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/122737 Approved by: https://github.com/jansel ghstack dependencies: #122146, #122335, #122354, #122355, #122356, #122449, #122455, #122456, #122530	2024-03-27 20:39:39 +00:00
William Wen	a9b27bbbe9	[dynamo, 3.12] update jump instructions (#122530 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/122530 Approved by: https://github.com/jansel ghstack dependencies: #122146, #122335, #122354, #122355, #122356, #122449, #122455, #122456	2024-03-27 20:39:39 +00:00
William Wen	f44f16ebd5	[dynamo, 3.12] add END_FOR (#122456 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/122456 Approved by: https://github.com/jansel ghstack dependencies: #122146, #122335, #122354, #122355, #122356, #122449, #122455	2024-03-27 20:39:39 +00:00
William Wen	bcdd0c6f59	[dynamo, 3.12] add BINARY/STORE_SLICE (#122455 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/122455 Approved by: https://github.com/jansel ghstack dependencies: #122146, #122335, #122354, #122355, #122356, #122449	2024-03-27 20:39:39 +00:00
William Wen	7b13228038	[dynamo, 3.12] fix DICT_VERSION C++ guards (#122449 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/122449 Approved by: https://github.com/jansel ghstack dependencies: #122146, #122335, #122354, #122355, #122356	2024-03-27 20:39:39 +00:00
William Wen	01547960bc	[dynamo, 3.12] remove LOAD_METHOD, update LOAD_ATTR (#122356 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/122356 Approved by: https://github.com/jansel ghstack dependencies: #122146, #122335, #122354, #122355	2024-03-27 20:39:39 +00:00
William Wen	8ba26f4aa5	[dynamo, 3.12] support RETURN_CONST (#122355 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/122355 Approved by: https://github.com/jansel ghstack dependencies: #122146, #122335, #122354	2024-03-27 20:39:39 +00:00
William Wen	3a67c86f72	[dynamo, 3.12] remove references to PRECALL instruction in 3.12 (#122354 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/122354 Approved by: https://github.com/jansel ghstack dependencies: #122146, #122335	2024-03-27 20:39:39 +00:00
William Wen	35382f0573	[dynamo, 3.12] Use CPython internal _PyOpcode_Caches instead of hardcoding (#122335 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/122335 Approved by: https://github.com/jansel ghstack dependencies: #122146	2024-03-27 20:39:39 +00:00
William Wen	2564f6cf0e	[dynamo, 3.12] Allocate Dynamo shadow frames by mimicking CPython (#122146 ) Python 3.12 changed a few things with how `_PyInterpreterFrame`s are allocated and freed: - Frames are now required to be placed on the Python frame stack. In 3.11, we could allocate frames anywhere in memory. In 3.12, we now need to use `THP_PyThreadState_BumpFramePointerSlow`/`push_chunk`/`allocate_chunk`. This method of allocating/freeing frames is also compatible with 3.11. - The eval frame function is now responsible for clearing the frame (see https://docs.python.org/3/whatsnew/changelog.html#id128, the point about "...which now clear the frame.") Pull Request resolved: https://github.com/pytorch/pytorch/pull/122146 Approved by: https://github.com/jansel	2024-03-27 20:39:39 +00:00
blegouix	ccfc87b199	include scheduler_on_plateau in optim.h (#121722 ) Fixes #121593 Co-authored-by: Jane Xu <janeyx@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/121722 Approved by: https://github.com/albanD	2024-03-27 19:45:25 +00:00
Animesh Jain	ceff2205e9	[dynamo][cpp-guards] Bugfix to pass on correct example_value (#122769 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/122769 Approved by: https://github.com/jansel ghstack dependencies: #122646, #122647, #122716	2024-03-27 19:40:46 +00:00
Animesh Jain	7281c5afdc	[dynamo][fbcode][torchrec] Selectively inline torchrec/distributed/types.py (#122716 ) Manually verified for the internal model. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122716 Approved by: https://github.com/jansel ghstack dependencies: #122646, #122647	2024-03-27 19:40:46 +00:00
Animesh Jain	5b42c41b19	[dynamo][improve-guard-overhead] Skip TENSOR_MATCH guards on parameters for optimizers (#122647 ) 1.32x guard overhead reduction (1.092 vs vs 0.827 ms) for MegatronBertForCausalLM with 394 params. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122647 Approved by: https://github.com/jansel, https://github.com/mlazos ghstack dependencies: #122646	2024-03-27 19:40:43 +00:00
Animesh Jain	c108696228	[dynamo][guards-cpp-refactor][easy] Env variable to turn on cpp manager (#122646 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/122646 Approved by: https://github.com/jansel	2024-03-27 19:40:37 +00:00
Tugsbayasgalan (Tugsuu) Manlaibaatar	1b9c7e41bb	Remove .data call in LSTM as it is not necessary (#122733 ) Summary: Title Test Plan: CI Differential Revision: D55392057 Functional pre-dispatch tracing chokes on LSTM .data call today. While we need to fix it, it seems this call seems unnecessary here. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122733 Approved by: https://github.com/mikaylagawarecki, https://github.com/albanD	2024-03-27 19:08:22 +00:00
Andrew Gu	1d6fc0d4de	Fixed `_infer_device_type` warning in `checkpoint` (#122726 ) Previously, we were checking `len(device_types)` where `device_types` is a `list`. This meant that if there were multiple inputs, we would see something like `device_types = ["cuda", "cuda"]` and a false positive warning. We should check `len(set(device_types))`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122726 Approved by: https://github.com/soulitzer	2024-03-27 18:38:42 +00:00
Lucas Pasqualin	37e3c8f33f	[DCP] Supporting resolve_bytes in LoadPlanner (#122700 ) 1. Supporting resolve bytes, similar to resolve_tensor. 2. This will allow us to load the bytes, directly on to the user provided ioBytes buffer. This essentially mirrors the existing pattern we have for tensors, where the user is expected to follow some version of: ``` 1. resolve_tensor 2. copy to target tensor 3. commit_tensor ``` Differential Revision: [D55259699](https://our.internmc.facebook.com/intern/diff/D55259699/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/122700 Approved by: https://github.com/Skylion007, https://github.com/wz337, https://github.com/pradeepfn	2024-03-27 17:43:32 +00:00
eellison	cd51496f8b	add a couple debug options (#121033 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121033 Approved by: https://github.com/ezyang	2024-03-27 17:24:43 +00:00
Jerry Zhang	5af839f86d	[quant][pt2e] Enable observer sharing between different quantization specs (#122734 ) Summary: Right now we don't insert additional observers (share observers) if qspec.dtype and qspec.is_dynamic matches exactly, since fixed qparams quantization spec and derived quantization spec do have have is_dynamic field curerntly, observer sharing does not happen between them and quantization spec, in this PR we fixed the issue by adding is_dynamic to all quantization specs. Note: SharedQuantizationSpec should probably be its own type in the future TODO later: (1). move all these fields (dtype, is_dynamic, quant_min, quant_max etc.) to QuantizationSpecBase, (2). make SharedQuantizationSpec a separate type (3). add quant_min/quant_max in observer sharing checking in pt2e/prepare.py Test Plan: python test/test_quantization.py -k test_fixed_qparams_qspec_observer_dedup Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D55396546](https://our.internmc.facebook.com/intern/diff/D55396546) Pull Request resolved: https://github.com/pytorch/pytorch/pull/122734 Approved by: https://github.com/andrewor14	2024-03-27 16:45:19 +00:00
PyTorch MergeBot	b63f6f78dc	Revert "[Inductor] Run pattern matcher over the original graph (#122519 )" This reverts commit 1f5fcb4e203eb343e8c53f6444015c98e8f68d60. Reverted https://github.com/pytorch/pytorch/pull/122519 on behalf of https://github.com/atalman due to Breaks internal tests ([comment](https://github.com/pytorch/pytorch/pull/122519#issuecomment-2023022311))	2024-03-27 15:13:26 +00:00
Catherine Lee	f3b82a4dc2	[xla hash update] update the pinned xla hash (#122628 ) Originally made this PR since xla was failing, but the PR that changed the pin got reverted, so this is just a normal update now The old pin was ~2 weeks old? Currently XLA is broken https://github.com/pytorch/pytorch/actions/runs/8438508272/job/23115239444 Co-authored-by: Andrey Talman <atalman@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/122628 Approved by: https://github.com/malfet, https://github.com/JackCaoG	2024-03-27 15:09:42 +00:00
PyTorch MergeBot	f140309e9c	Revert "Only update momentum buffers for SGD if momentum is enabled (#122349 )" This reverts commit a333b080c16a3a6bbb057b4fbaaec4a4e14615dd. Reverted https://github.com/pytorch/pytorch/pull/122349 on behalf of https://github.com/atalman due to Broke internal tests ([comment](https://github.com/pytorch/pytorch/pull/122349#issuecomment-2023001467))	2024-03-27 15:04:52 +00:00
PyTorch MergeBot	70c3deef2d	Revert "[xla hash update] update the pinned xla hash (#122628 )" This reverts commit 04399a30913fd04c2120420b671cd432659d56e6. Reverted https://github.com/pytorch/pytorch/pull/122628 on behalf of https://github.com/atalman due to Need revert and then reland ([comment](https://github.com/pytorch/pytorch/pull/122628#issuecomment-2022995857))	2024-03-27 15:01:33 +00:00
Joel Schlosser	eb5381da66	Skip storage check debug assert in view codegen when output is a subclass instance (#122718 ) Before the fix, this assert blows up in DEBUG mode for views where the input (base) is a dense tensor and the output (view) is a subclass instance. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122718 Approved by: https://github.com/soulitzer	2024-03-27 14:39:51 +00:00
Jiong Gong	105381ea11	[inductor][cpp] simplify CppVecKernelChecker (remove bool/int8 load as mask and load as float flags) (#119734 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119734 Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel ghstack dependencies: #119654, #119655	2024-03-27 11:20:35 +00:00
Jiong Gong	49121603ab	[inductor][cpp] support vectorized indirect indexing (#119655 ) This PR adds the vectorized indirect indexing so that we can further simplify the `CppVecKernelChecker` (done in the later PR #119734) and remove the check that throws `CppVecUnsupportedError`. A boundary assertion check is added on vectorized indices and via the new `indirect_assert` method on `Kernel` - the base implementation is for scalar indices, overridden in `CppVecKernel` for vectorized indices. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119655 Approved by: https://github.com/jansel ghstack dependencies: #119654	2024-03-27 10:25:45 +00:00
Oguz Ulgen	a697d972b1	Fix torchbench errors (#122735 ) Summary: It looks like this target has stopped working, lets fix it. Test Plan: ``` buck2 run mode/opt //caffe2/benchmarks/dynamo/:test ``` now works Differential Revision: D55389546 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122735 Approved by: https://github.com/xmfan	2024-03-27 06:59:16 +00:00
Jiong Gong	367ec62ae3	[inductor][cpp] generalize vector mask for dtypes (#119654 ) Vectorized boolean values in CPU Inductor were modeled with `Vectorized<float>` which cannot work for operations with other data types. This PR generalizes it with the new `VecMask` template class that can work for masks on any vectorized data types. The intrinsics implementation in `cpp_prefix.h` for mask conversion, cast and masked load are now implemented as the specialization for `VecMask` and moved to corresponding header files. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119654 Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel	2024-03-27 05:33:53 +00:00
kareem	f2c1060de3	[fx] Preserve Fx graph node order in partitioner across runs (#115621 ) Fixes #ISSUE_NUMBER partitioner generates different graph in recompilation on each run Co-authored-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/115621 Approved by: https://github.com/ezyang	2024-03-27 02:20:37 +00:00
eellison	d1104d76aa	[Easy] Fix freezing bug with mismatched bias sizes (#122724 ) Fix for https://github.com/pytorch/pytorch/issues/121231 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122724 Approved by: https://github.com/davidberard98	2024-03-27 01:41:00 +00:00
Frank Lin	249e65b92d	Graph-Safe RNG State Exchange for Tensor Parallelism (#114068 ) See #113541 The PR allows for registering and controlling multiple RNG states using indices, ensuring cudagraph-safe operations, and includes both C++ and Python API changes to support this functionality. cc @eellison @anijain2305 @jansel @ezyang @ptrblck @csarofeen @mcarilli Pull Request resolved: https://github.com/pytorch/pytorch/pull/114068 Approved by: https://github.com/ezyang, https://github.com/eqy, https://github.com/xuzhao9	2024-03-27 01:14:38 +00:00
Yifu Wang	fe41ba4765	Optimize multi_tensor_apply (take 2) (#119764 ) ### Take 2 The first take (#119153) landed but was reverted because it broke cuda graph for `multi_tensor_apply`. This PR is a reland of #119153: - Incorporate #119652 so that the optimization can be applied (1) without increasing binary size (2) to all 3 MTA variants without much code duplication. - Ensure the optimization is compatible with cuda graph. ### Summary Due to the dynamic nature of the workload, the kernel arguments aren't guaranteed to fit in the static 4kb kernel argument memory. Previously with the apex implementation, we overcame this limitation by dividing a multi_tensor_apply workload into multiple kernel launches. However, this led to low sustained occupancy, affecting the performance of memory bound ops. Based on the observation that the kernel argument memory limitation doesn't correlate well with available SM resources, we adopt a different approach: - When the kernel arguments fit into the static kernel argument memory, we use this memory to transfer the arguments. - Conversely, when the kernel arguments don't fit into the static kernel argument memory, instead of sacrificing sustained occupancy, we use a page-locked cudaMemcpyAsync to transfer the arguments, then perform the entire workload in a single kernel. This PR only covers `multi_tensor_apply` for tensors. The change can be easily applied to `multi_tensor_apply` for tensors + scalars and `multi_tensor_apply_for_fused_optimizer`. ### Benchmark (WIP) The only benchmark I've conducted so far on `_foreach_copy_` on a set of sizes that resembles internal workload. I need to benchmarks on more problem sizes. The speedup should vary among problem sizes. However, I believe this PR should not be slower than the previous impl on any problem sizes. The benchmark can be reproduced with [this script](https://gist.github.com/yifuwang/178c1f4bf951c5794ea79c04d90e44fa). Baseline A single iteration in trace: <img width="831" alt="image" src="https://github.com/pytorch/pytorch/assets/4156752/5c8d72d0-0628-4989-88a8-c756f6bc1319"> ``` https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/perfetto_internal_traces/tree/shared_trace/yifu_5a59145f-567b-472f-8eef-c61c388d45b4.json device ms: 1.111, cpu ms: 7.151 memory bandwidth: 1169.825 GB/s ``` This PR A single iteration in trace: <img width="967" alt="image" src="https://github.com/pytorch/pytorch/assets/4156752/a023e183-8166-48f7-b7c0-c8ba32653d2b"> ``` https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/perfetto_internal_traces/tree/shared_trace/yifu_da060725-62a8-466e-b570-2ad67ff0e29d.json device ms: 0.892, cpu ms: 0.810 memory bandwidth: 1456.744 GB/s ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/119764 Approved by: https://github.com/eqy, https://github.com/eellison, https://github.com/crcrpar	2024-03-27 00:51:30 +00:00
chilli	67a4d6d6cb	Stopped TORCH_COMPILE_DEBUG from printing out a bunch of logs (#122688 ) @ezyang suggests using TORCH_TRACE for dumping out all intermediate logs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122688 Approved by: https://github.com/ezyang, https://github.com/mlazos ghstack dependencies: #122686	2024-03-27 00:24:40 +00:00
chilli	602c2af9e3	Cleaned up/fixed get_args after_aot repro (#122686 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/122686 Approved by: https://github.com/ezyang	2024-03-27 00:24:40 +00:00
rzou	c81c9ba472	Disallow {FakeTensor,FunctionalTensor}.data_ptr (#122514 ) This PR: - disallows FakeTensor.data_ptr when it is called inside PT2 or fx tracing. - disallows FunctionalTensor.data_ptr (python FunctionalTensor is only used in PT2) The motivation behind this is that the leading cause of segfaults when using custom ops with PT2 is calling .data_ptr on FunctionalTensor or FakeTensor. This change is BC-breaking. If your code broke as a result of this, it's because there was a bug in it (these .data_ptr should never be accessed!). You can either fix the bug (recommended) or get the previous behavior back with: ``` from torch._subclasses.fake_tensor import FakeTensor from torch._subclasses.functional_tensor import FunctionalTensor data_ptr = 0 if isinstance(tensor, (FakeTensor, FunctionalTensor)) else tensor.data_ptr() ``` Test Plan: - existing tests Differential Revision: [D55366199](https://our.internmc.facebook.com/intern/diff/D55366199) Pull Request resolved: https://github.com/pytorch/pytorch/pull/122514 Approved by: https://github.com/ezyang, https://github.com/albanD, https://github.com/yifuwang, https://github.com/kurtamohler	2024-03-26 23:55:42 +00:00
Catherine Lee	04399a3091	[xla hash update] update the pinned xla hash (#122628 ) Originally made this PR since xla was failing, but the PR that changed the pin got reverted, so this is just a normal update now The old pin was ~2 weeks old? Currently XLA is broken https://github.com/pytorch/pytorch/actions/runs/8438508272/job/23115239444 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122628 Approved by: https://github.com/malfet, https://github.com/JackCaoG	2024-03-26 23:51:38 +00:00
Joel Schlosser	07b618e2d4	Graph break cleanly in Dynamo for module parametrization (#121041 ) Fixes #118795 This is a graph breaking partial fix for #120914. We still need -actual- module parametrization tracing support, but at least it doesn't blow up hard now. Background: Module parametrization injects a property as the module parameter attribute that calls a `nn.Module` whose forward takes in a module parameter and returns a reparametrized module parameter. Example: ``` class MyParametrization(nn.Module): def forward(X): # This reparametrization just negates the original parameter value return -X m = nn.Linear(...) p = MyParametrization() register_parametrization(m, "weight", p) # Accessing the "weight" attribute will invoke p's forward() on m's original weight and return the output as the new weight. # m.weight here is now an injected property that does the above instead of an actual Parameter. # This property is defined in torch/nn/utils/parametrize.py. m.weight # NB: Parametrization changes the module type (e.g. torch.nn.utils.parametrize.ParametrizedLinear) print(type(m)) ``` Problem 1: Dynamo has special tracing rules for things in `torch.nn`. Parametrizing a module changes the type of the module and the parametrized attribute, so now these rules wrongly affect tracing here. To fix this: * For parametrized modules, call `convert_to_unspecialized()` to restart analysis where Dynamo starts inlining the module. Problem 2: The issue seen in #118795 is that Dynamo will see a dynamically constructed tensor when `m.weight` is called and introduce that to its `tensor_weakref_to_sizes_strides` cache during fake-ification. This tensor is also made to be a graph input, since it's a module parameter. When guards are created for this module parameter input, the logic calls `m.weight` again and tries to look the result up in the cache, but this is a different tensor now, giving the `KeyError` symptom. To fix this: * Replace Dynamo's `tensor_weakref_to_sizes_strides` cache with a `input_source_to_sizes_strides` cache. * This cache was originally introduced in #100128. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121041 Approved by: https://github.com/anijain2305	2024-03-26 23:44:51 +00:00
Mu-Chu Lee	2367d0dacd	[AOTInductor] Add tensor_constantX to pass constant buffer update's check (#122562 ) (#122690 ) Summary: During tracing, some constants (tensor_constant{idx}) are being generated internally. Those constants are neither parameters or buffers, and users have zero control on them. To accomodate this, we should allow users not passing in those constants generated internally but still be able the constants in the model. Test Plan: Included in commit. ``` build/bin/test_aot_inductor ``` Reviewed By: zoranzhao Differential Revision: D55354548 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122690 Approved by: https://github.com/khabinov	2024-03-26 23:25:15 +00:00
Yifu Wang	09cb42ce29	[dynamo] delete graph_out_{n} after restoring local vars (#122658 ) At graph breaks, we create a graph_out_{n} symbol to hold the graph output and use it to restore the local vars. In addition to their own symbols, the local vars are kept alive by the symbol we created. This means that if the graph break is the last usage of one of the symbols, the symbol would still be kept alive upon graph resumption. This PR: delete the graph_out_{n} symbol after restoring local vars so the lifetime of the local vars is governed by themselves. ## Example Problem Tensor `b`'s last usage is in the graph break. However, it won't be deallocated until `bar()` completes. In the orignal issue report by @Yuzhen11, `b` is a large tensor and `bar()` is an expensive computation. ```python import torch def foo(a): return torch.mm(a, a) @torch._dynamo.disable() def graph_break_fn(a): ret = a.bfloat16() return ret def bar(c): return torch.mm(c, c) def fn(a): b = foo(a) c = graph_break_fn(b) # del b return bar(c) fn_compiled = torch.compile(fn, backend="eager") a = torch.randn(10000, 10000, device="cuda", requires_grad=True) fn_compiled(a).sum().backward() ``` Bytecode before this PR: ``` ORIGINAL BYTECODE fn /home/yifu/microbench/del2.py line 18 19 0 LOAD_GLOBAL 0 (foo) 2 LOAD_FAST 0 (a) 4 CALL_FUNCTION 1 6 STORE_FAST 1 (b) 20 8 LOAD_GLOBAL 1 (graph_break_fn) 10 LOAD_FAST 1 (b) 12 CALL_FUNCTION 1 14 STORE_FAST 2 (c) 22 16 LOAD_GLOBAL 2 (bar) 18 LOAD_FAST 2 (c) 20 CALL_FUNCTION 1 22 RETURN_VALUE MODIFIED BYTECODE fn /home/yifu/microbench/del2.py line 18 18 0 LOAD_GLOBAL 3 (__compiled_fn_0) 2 LOAD_FAST 0 (a) 4 CALL_FUNCTION 1 6 STORE_FAST 3 (graph_out_0) 8 LOAD_GLOBAL 1 (graph_break_fn) 10 LOAD_FAST 3 (graph_out_0) 12 LOAD_CONST 1 (0) 14 BINARY_SUBSCR 20 16 CALL_FUNCTION 1 18 LOAD_GLOBAL 4 (__resume_at_14_1) 20 ROT_TWO 22 CALL_FUNCTION 1 24 RETURN_VALUE ORIGINAL BYTECODE torch_dynamo_resume_in_fn_at_20 /home/yifu/microbench/del2.py line 20 20 0 LOAD_FAST 0 (___stack0) 2 JUMP_ABSOLUTE 9 (to 18) 4 LOAD_GLOBAL 0 (foo) 6 LOAD_FAST 1 (a) 8 CALL_FUNCTION 1 10 STORE_FAST 2 (b) 12 LOAD_GLOBAL 1 (graph_break_fn) 14 LOAD_FAST 2 (b) 16 CALL_FUNCTION 1 >> 18 STORE_FAST 3 (c) 22 20 LOAD_GLOBAL 2 (bar) 22 LOAD_FAST 3 (c) 24 CALL_FUNCTION 1 26 RETURN_VALUE MODIFIED BYTECODE torch_dynamo_resume_in_fn_at_20 /home/yifu/microbench/del2.py line 20 20 0 LOAD_GLOBAL 3 (__compiled_fn_2) 2 LOAD_FAST 0 (___stack0) 4 CALL_FUNCTION 1 6 UNPACK_SEQUENCE 1 8 RETURN_VALUE ``` Bytecode after this PR: ``` ORIGINAL BYTECODE fn /home/yifu/microbench/del2.py line 18 19 0 LOAD_GLOBAL 0 (foo) 2 LOAD_FAST 0 (a) 4 CALL_FUNCTION 1 6 STORE_FAST 1 (b) 20 8 LOAD_GLOBAL 1 (graph_break_fn) 10 LOAD_FAST 1 (b) 12 CALL_FUNCTION 1 14 STORE_FAST 2 (c) 22 16 LOAD_GLOBAL 2 (bar) 18 LOAD_FAST 2 (c) 20 CALL_FUNCTION 1 22 RETURN_VALUE MODIFIED BYTECODE fn /home/yifu/microbench/del2.py line 18 18 0 LOAD_GLOBAL 3 (__compiled_fn_0) 2 LOAD_FAST 0 (a) 4 CALL_FUNCTION 1 6 STORE_FAST 3 (graph_out_0) 8 LOAD_GLOBAL 1 (graph_break_fn) 10 LOAD_FAST 3 (graph_out_0) 12 LOAD_CONST 1 (0) 14 BINARY_SUBSCR 16 DELETE_FAST 3 (graph_out_0) 20 18 CALL_FUNCTION 1 20 LOAD_GLOBAL 4 (__resume_at_14_1) 22 ROT_TWO 24 CALL_FUNCTION 1 26 RETURN_VALUE ORIGINAL BYTECODE torch_dynamo_resume_in_fn_at_20 /home/yifu/microbench/del2.py line 20 20 0 LOAD_FAST 0 (___stack0) 2 JUMP_ABSOLUTE 9 (to 18) 4 LOAD_GLOBAL 0 (foo) 6 LOAD_FAST 1 (a) 8 CALL_FUNCTION 1 10 STORE_FAST 2 (b) 12 LOAD_GLOBAL 1 (graph_break_fn) 14 LOAD_FAST 2 (b) 16 CALL_FUNCTION 1 >> 18 STORE_FAST 3 (c) 22 20 LOAD_GLOBAL 2 (bar) 22 LOAD_FAST 3 (c) 24 CALL_FUNCTION 1 26 RETURN_VALUE MODIFIED BYTECODE torch_dynamo_resume_in_fn_at_20 /home/yifu/microbench/del2.py line 20 20 0 LOAD_GLOBAL 3 (__compiled_fn_2) 2 LOAD_FAST 0 (___stack0) 4 CALL_FUNCTION 1 6 UNPACK_SEQUENCE 1 8 RETURN_VALUE ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/122658 Approved by: https://github.com/jansel, https://github.com/anijain2305	2024-03-26 22:49:05 +00:00
eellison	df724153c1	Add option to skip cudagraphing on dynamic shape graphs (#122520 ) This was requested internally. Differential Revision: [D55264528](https://our.internmc.facebook.com/intern/diff/D55264528) Pull Request resolved: https://github.com/pytorch/pytorch/pull/122520 Approved by: https://github.com/mlazos, https://github.com/shunting314	2024-03-26 21:49:21 +00:00
Nikita Shulga	e229ec6886	[NEON] Speedup float16 convert (#122702 ) By using `vcvt_f16_f32` and back According to [benchmark_convert.py](`d3279637ca`) this makes float32 to float16 tensor conversion roughly 3 times faster: time to convert 4096x4096 float32 tensor drops from 5.23 msec to 1.66 msec on M2 Pro Test plan: run `vector_test_all_types` + CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/122702 Approved by: https://github.com/kimishpatel	2024-03-26 21:48:12 +00:00
Joel Schlosser	6767c04fde	Forward fix for broken internal tests related to NJT view dummy (#122704 ) (internal link) [example test breakage](https://www.internalfb.com/intern/test/562950061753019?ref_report_id=0) Symptom: `type stub not overridden` for SymInt. The global NJT dummy relies on `SymInt.__mul__()` in its constructor. Lazily constructing the dummy avoids the race. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122704 Approved by: https://github.com/soulitzer	2024-03-26 21:22:12 +00:00
Nikita Shulga	291848bf30	[Build] Fix AVX detection logic (#122708 ) `CXX_AVX[2\|512]_FOUND` flags should indicate whether compiler supports generating code for given instruction set, rather than whether host machine can run the generated code. This fixes a weird problem that surfaced after https://github.com/pytorch/pytorch/pull/122503 when builder can sometimes be dispatched to an old CPU architecture, that can not run AVX512 instructions, but can compile for those just fine Pull Request resolved: https://github.com/pytorch/pytorch/pull/122708 Approved by: https://github.com/jeanschmidt	2024-03-26 20:37:35 +00:00
Yifu Wang	3bede14fa7	Don't create world pg variable out of thin air when rewriting c10d collectives (#122561 ) Fixes https://github.com/pytorch/pytorch/issues/122404 Previously, when rewriting c10d collectives, if the group argument is unspecified or None, we create a world pg variable out of thin air and pass it to the rewrite target. The approach was problematic, as it assumes the symbol `torch` is available in the scope (see #122404). After #120560, dynamo can now trace dist.group.WORLD. If the group argument is unspecified, we can just set it with dist.group.WORLD in the rewrite target. Testing pytest test/distributed/test_inductor_collectives.py -k test_dynamo_rewrite_dist_allreduce Also verified with the repro provided in #122404 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122561 Approved by: https://github.com/wconstab ghstack dependencies: #120560	2024-03-26 20:12:08 +00:00
Edward Z. Yang	852111e1c2	[TORCH_TRACE] Record stack when no compile context is available (#122644 ) This will help me track down those annoying unknown compile products. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/122644 Approved by: https://github.com/jamesjwu	2024-03-26 19:30:52 +00:00
PyTorch MergeBot	f631586084	Revert "[dynamo] Forward OptimizedModule.__setattr__ to the wrapped module (#122098 )" This reverts commit b6982bf2b25d2d3ba5d82488a39721d6013a838f. Reverted https://github.com/pytorch/pytorch/pull/122098 on behalf of https://github.com/atalman due to Failing internally ([comment](https://github.com/pytorch/pytorch/pull/122098#issuecomment-2021233604))	2024-03-26 18:54:17 +00:00
Bin Bao	537cd66e73	[Inductor] Support custom op in JIT with cpp wrapper (#122554 ) Summary: To call custom ops in an ABI-compatible way requires doing boxed call with varargs across C shim. In the JIT mode, we can get around it by calling into Python. https://gist.github.com/desertfire/be2a65b0a9b47780bb716b53ac2cd2b3 is an example of generated code. Differential Revision: [D55326556](https://our.internmc.facebook.com/intern/diff/D55326556) Pull Request resolved: https://github.com/pytorch/pytorch/pull/122554 Approved by: https://github.com/jansel, https://github.com/chenyang78	2024-03-26 18:48:45 +00:00
Oguz Ulgen	e61aaab725	Log autotune time in scuba (#122637 ) Summary: This diff * Refactors triton and autotune caches to be child classes of the original memcache based cache infra * Swaps scuba table for autotune * Adds autotune time spent/saved to scuba table Test Plan: Local testing using: ``` buck run mode/opt fbcode//caffe2/test/inductor/:max_autotune -- -r test_max_autotune_remote_caching_dynamic_False ``` and ``` TORCH_INDUCTOR_AUTOTUNE_REMOTE_CACHE=1 buck2 run mode/opt //scripts/oulgen:runner ``` Differential Revision: D55332620 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122637 Approved by: https://github.com/jamesjwu	2024-03-26 17:51:33 +00:00
Oguz Ulgen	1f5fcb4e20	[Inductor] Run pattern matcher over the original graph (#122519 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/122519 Approved by: https://github.com/jansel	2024-03-26 17:30:32 +00:00
wz337	8cfbdc0451	[Easy][DCP] Fix small typo in assert (#122633 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/122633 Approved by: https://github.com/awgu, https://github.com/wconstab	2024-03-26 16:46:12 +00:00
Wang, Eikan	30a579dba3	Add XPU ATen merge rule (#122484 ) Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/122484 Approved by: https://github.com/huydhn, https://github.com/malfet	2024-03-26 16:20:48 +00:00
FEI	e08cbc0d41	update comment of test_invalid_last_dim_stride in test_transformers.py (#122679 ) Fixes #122594 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122679 Approved by: https://github.com/mikaylagawarecki	2024-03-26 15:40:24 +00:00
Catherine Lee	8bad7b63c8	[ez] Add more files to trigger inductor (#122669 ) To catch https://github.com/pytorch/pytorch/pull/122562/files Pull Request resolved: https://github.com/pytorch/pytorch/pull/122669 Approved by: https://github.com/desertfire	2024-03-26 15:19:30 +00:00
Jean Schmidt	9b90c5e2a1	[CI] Switch pull job linux-jammy-py3_8-gcc11-build to use ARC with runner groups (#122503 ) title says it all... Pull Request resolved: https://github.com/pytorch/pytorch/pull/122503 Approved by: https://github.com/atalman	2024-03-26 14:38:12 +00:00
Edward Z. Yang	85845a29db	Refactor ShapeEnvSettings so it's directly on ShapeEnv (#122310 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/122310 Approved by: https://github.com/masnesral, https://github.com/lezcano	2024-03-26 14:16:33 +00:00
Edward Z. Yang	7e176ebb47	Log compilation_metrics to TORCH_TRACE (#122638 ) It's not technically needed as you can get it from Scuba too, but it's more convenient for tlparse to get at it this way. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/122638 Approved by: https://github.com/albanD	2024-03-26 14:10:55 +00:00
Guilherme Leobas	99c822c0ba	Let dynamo inline through jacfwd (#121254 ) Similar to #121146, changes are simple and don't require any fancy modification to the codebase. Moved a few entries on trace_rules.py and added tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121254 Approved by: https://github.com/zou3519 ghstack dependencies: #120338	2024-03-26 12:43:30 +00:00
haozhe.zhu	2b4173e0de	[Quant] [Inductor] Enable the Inductor Lowering of QConv2d post op HardTanh with int8-mix-bf16 (#122374 ) Summary Enable the fusion pattern of `QConv2d -> hardtanh` lowering for int8-mixed-bf16 case. Test Plan ``` python -m pytest test_mkldnn_pattern_matcher.py -k test_qconv2d_hardtanh_int8_mixed_bf16_cpu ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/122374 Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5 ghstack dependencies: #122266, #122267, #122268, #122373	2024-03-26 08:12:41 +00:00
haozhe.zhu	293579363c	[Quant] [Inductor] Enable the Inductor Lowering of QConv2d post op HardSwish with int8-mix-bf16 (#122373 ) Summary Enable the fusion pattern of `QConv2d -> hardswish` lowering for int8-mixed-bf16 case. Test Plan ``` python -m pytest test_mkldnn_pattern_matcher.py -k test_qconv2d_hardswish_int8_mixed_bf16_cpu ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/122373 Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5 ghstack dependencies: #122266, #122267, #122268	2024-03-26 08:09:35 +00:00
haozhe.zhu	caf9c23310	[Quant] [Inductor] Enable the Inductor Lowering of QConv2d post op SiLU (#122268 ) Summary Enable the fusion pattern of `QConv2d -> silu` lowering to `swish` as `QConv2d` post operator. Test Plan ``` python -m pytest test_mkldnn_pattern_matcher.py -k test_qconv2d_silu_cpu python -m pytest test_mkldnn_pattern_matcher.py -k test_qconv2d_silu_int8_mixed_bf16_cpu python -m pytest test_mkldnn_pattern_matcher.py -k test_qat_qconv2d_silu ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/122268 Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5 ghstack dependencies: #122266, #122267	2024-03-26 08:07:06 +00:00
Jacob Szwejbka	41d24df08f	[export] hack skip index_put_ in dce (#122683 ) Summary: Ideally we should do whats in the todo. Just doing this for now to unblock llama capture Test Plan: capturing llama and using pt2e to quantize it Differential Revision: D55354487 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122683 Approved by: https://github.com/kimishpatel	2024-03-26 08:05:06 +00:00
haozhe.zhu	e0329cba8a	[Quant] [PT2] Add SiLU into X86InductorQuantizer Conv2d Unary Annotation (#122267 ) Summary Add `SiLU` into X86InductorQuantizer Conv2d Unary Annotation TestPlan ``` python -m pytest test_x86inductor_quantizer.py -k test_conv2d_unary python -m pytest test_x86inductor_quantizer.py -k test_qat_conv2d_unary ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/122267 Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5 ghstack dependencies: #122266	2024-03-26 08:03:42 +00:00
Mu-Chu Lee	b7089937dc	Disable test (test_mm_plus_mm2_cuda_cuda_wrapper) (#122682 ) Summary: The test is unstable at the moment. We need to make sure both Aten and Triton Kernel works to reactivate the test. Test Plan: Disabling test Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/122682 Approved by: https://github.com/clee2000	2024-03-26 07:14:35 +00:00
Wang, Eikan	f8eeae7aaa	Enable CPP wrapper codegen registration (#121296 ) Extend code gen registration for `CppWrapper`. W/ this PR, an new backend can register its specific `CppWrapper` at runtime. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121296 Approved by: https://github.com/jansel, https://github.com/desertfire	2024-03-26 06:51:03 +00:00
Jason Ansel	d1f58eaaf5	[inductor] Fix bug with freezing + split_cat passes (#122544 ) Fixes #122380 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122544 Approved by: https://github.com/eellison	2024-03-26 06:12:57 +00:00
Edward Z. Yang	268b0cc714	Do not run CUDA lazy init if it is triggered with fake mode on. (#122636 ) Partially fixes https://github.com/pytorch/pytorch/issues/122109 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/122636 Approved by: https://github.com/zou3519	2024-03-26 05:43:59 +00:00
Nikita Shulga	dd3f2cb53a	[Inductor] Add NEON ISA support on arm64 Macs (#122217 ) This started as a re-land of https://github.com/pytorch/pytorch/pull/105590 but focusing on enabling it on MacOS, but quickly turned into landing very limited platform-specific acceleration at this time (I.e. this PR does not add any NEON accelerated code at all, just enables vectorized compilation for the existing abstractions) Enabling the test harness, uncovered number of latent issues in CPU inductor that were fixed in the following PRS: - https://github.com/pytorch/pytorch/pull/122511 - https://github.com/pytorch/pytorch/pull/122513 - https://github.com/pytorch/pytorch/pull/122580 - https://github.com/pytorch/pytorch/pull/122608 Following was added/changed to enable vectorization code to work on MacOS - Added VecNEON class to `_inductor/codecache.py` that is supported on all AppleSilicon Macs - Added `Vectorized::loadu_one_fourth` to `vec_base.h`, and limit it to 8-bit types - Change 64-bit integral types mapping to `int64_t`/`uint64_t` to align with the rest of the code, as on MacOS, `int64_t` is a `long long` rather than `long` (see https://github.com/pytorch/pytorch/pull/118149 for more details) See table below for perf changes with and without torch.compile using [gpt-fast](https://github.com/pytorch-labs/gpt-fast) running `stories15M` on M2 Pro: \| dtype \| Eager \| Compile (before) \| Compile (after) \| \| ------ \| ------ \| --------- \| --------- \| \| bfloat16 \| 120 tokens/sec \| 130 tokens/sec \| 156 tokens/sec \| \| float32 \| 158 tokens/sec \| 140 tokens/sec \| 236 tokens/sec \| \| float16 \| 235 tokens/sec \| 81 tokens/sec \| 58 tokens/sec \| Pull Request resolved: https://github.com/pytorch/pytorch/pull/122217 Approved by: https://github.com/jansel	2024-03-26 05:07:30 +00:00
Michael Lazos	a333b080c1	Only update momentum buffers for SGD if momentum is enabled (#122349 ) As title [benchmark](https://gist.github.com/mlazos/1171f035a2392c33778aaa3d7bf24370) Helps compiled vanilla SGD execution time by 2x on certain models with large number of small params (ex. ElectraForQuestionAnswering goes from 1090us -> 554us) Pull Request resolved: https://github.com/pytorch/pytorch/pull/122349 Approved by: https://github.com/janeyx99	2024-03-26 04:19:39 +00:00
Josh Fromm	0c47f8028e	Keep example_inputs when saving and loading ExportedProgram (#122618 ) Summary: `torch.export` is a powerful tool for creating a structured and shareable package from arbitrary pytorch code. One great use case of `torch.export` is sharing models or subgraphs in a way that allows results to be easily replicated. However, in the current implementation of `export`, the `example_inputs` field is thrown out. When trying to replicate bugs, benchmarks, or behaviors, losing the original input shapes and values makes the process much messier. This change adds saving and loading for the `example_inputs` attribute of an `ExportedProgram` when using `torch.export.save` and `torch.export.load`. This simple addition makes `ExportedPrograms`s a fantastic tool for performance and accuracy replication. For example, with this change we enable the following workflow: ``` # Script to create a reproducible accuracy issue with my model. kwargs = {"fastmath_mode": True} exp_program = export(my_model, sample_inputs, kwargs) result = exp_program.module()(sample_inputs, kwargs) # Uhoh, I dont like that result, lets send the module to a colleague to take a look. torch.export.save(exp_program, "my_model.pt2") ``` My colleague can then easily reproduce my results llike so: ``` # Script to load and reproduce results from a saved ExportedProgram. loaded_program = torch.export.load("my_model.pt2") # The following line is enabled by this Diff, we pull out the arguments # and options that caused the issue. args, kwargs = loaded_program.example_inputs reproduced_result = loaded_program.module()(args, **kwargs) # Oh I see what happened here, lets fix it. ``` Being able to share exact inputs and arguments makes `ExportedPrograms` much more clean and powerful with little downside. The main potential issue with this change is that it does slightly increase the size of saved programs. However, the size of inputs will be much smaller than parameters in most cases. I am curious to hear discussion on saved file size though. The deserialization of `example_inputs` is currently implemented as `Optional`. Although this wont effect users of `export.save` and `export.load`, it does give backwards compatibility to any direct users of `serialize` and `deserialize`. Test Plan: This diff includes a new test which exercises the save / load flow with multiple args and kwargs. ``` buck test //caffe2/test:test_export -- TestSerialize ``` Differential Revision: D55294614 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122618 Approved by: https://github.com/zhxchen17	2024-03-26 03:32:44 +00:00
Tianyu Liu	47e8d60627	[dtensor] add op support for view_as_complex and view_as_real (#122569 ) This PR will unblock DTensor computations for [rotary embeddings](https://github.com/meta-llama/llama/blob/main/llama/model.py#L132) used in LLaMa training. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122569 Approved by: https://github.com/wanchaol ghstack dependencies: #122541	2024-03-26 03:32:04 +00:00
Edward Z. Yang	1af6fc5e03	Remove top-level DisableFuncTorch; clearing interpreter stack should work. (#122610 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/122610 Approved by: https://github.com/zou3519 ghstack dependencies: #122202	2024-03-26 03:08:22 +00:00
Edward Z. Yang	f42818321b	Restore DILL_AVAILABLE for backwards compat with torchdata (#122616 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/122616 Approved by: https://github.com/peterbell10	2024-03-26 02:18:51 +00:00
PyTorch MergeBot	55f36d1ada	Revert "[AOTInductor] Add tensor_constantX to pass constant buffer update's check (#122562 )" This reverts commit 57a3d00b0659e4ac37c4a35a36c71f710e89197a. Reverted https://github.com/pytorch/pytorch/pull/122562 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/122562#issuecomment-2019262415))	2024-03-26 02:18:19 +00:00
Tianyu Liu	4e0b5d59fa	[dtensor] add backward support for scaled dot product attention (flash-attention) (#122541 ) As titled, as a followup to the forward part #120298. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122541 Approved by: https://github.com/wanchaol	2024-03-26 01:50:24 +00:00
chentianyi16	83ad8e01b1	fix the problem that cpu_fallback for aten::triu_indices on custom device crashed (#121306 ) Fixes #121289 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121306 Approved by: https://github.com/ezyang	2024-03-26 01:29:45 +00:00
Kurt Mohler	5e66bf5f42	Avoid COW materialize in nn.functional forward ops (3) (#122443 ) Affected ops: * repeat * unfold * logsigmoid * pixel_shuffle/unshuffle * remaining norm ops Pull Request resolved: https://github.com/pytorch/pytorch/pull/122443 Approved by: https://github.com/ezyang	2024-03-26 00:56:57 +00:00
Peter Bell	b6982bf2b2	[dynamo] Forward OptimizedModule.__setattr__ to the wrapped module (#122098 ) Fixes #114844 In the linked issue we have ``` compiled_module = torch.compile(module) compiled_module.x = ... compiled_module(...) # Mutates self.x ``` Where since the module mutates `self.x` you would expect `compiled_module.x` to be updated but actually `compiled_module.x = ...` sets an attribute "x" on the `OptimizedModule` object while the forward method of the module mutates `module.x`. This gives the expected behavior by forwarding `compiled_module.__setattr__` down to `module.__setattr__`. There is already a corresponding `__getattr__` so now `compiled_module.x` becomes an alias for `module.x`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122098 Approved by: https://github.com/ezyang, https://github.com/lezcano	2024-03-26 00:52:12 +00:00
Nikita Shulga	eda279c997	[CpuInductor] Implement masked_load for integral types (#122608 ) Use `if constexpr` to separate float vs integral masked load for avx512 Discovered while looking at `test_comprehensive_fft_ihfft2_cpu_int64` on non-AVX512 capable CPUs where (5, 6, 7) shape were big enough to start a vectorized loop Added `test_pad_cast` regression test Fixes https://github.com/pytorch/pytorch/issues/122606 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122608 Approved by: https://github.com/jansel ghstack dependencies: #122607	2024-03-25 22:44:54 +00:00
Mu-Chu Lee	57a3d00b06	[AOTInductor] Add tensor_constantX to pass constant buffer update's check (#122562 ) Summary: During tracing, some constants (tensor_constant{idx}) are being generated internally. Those constants are neither parameters or buffers, and users have zero control on them. To accomodate this, we should allow users not passing in those constants generated internally but still be able the constants in the model. Test Plan: Included in commit. ``` build/bin/test_aot_inductor ``` Differential Revision: D55286634 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122562 Approved by: https://github.com/chenyang78, https://github.com/khabinov	2024-03-25 22:05:20 +00:00
eellison	ebde6c72cb	Precompile triton templates (#121998 ) Before this PR we were not precompiling triton templates in parallel. Compilation would occur during benchmarking. Triton benchmarking templates were emitted as : ``` @triton.jit def triton_mm(arg_A, arg_B, out_ptr0): ``` In order to precompile we need to give the full kernel specification, as we do when we emit the template in the final output code generation. ``` @triton_heuristics.template( num_stages=3, num_warps=8, triton_meta={'signature': {0: 'fp32', 1: 'fp32', 2: '*fp32'}, 'device': 0, 'device_type': 'cuda', 'constants': {}, 'configs': [AttrsDescriptor(divisible_by_16=(0, 1, 2), equal_to_1=(), ids_of_folded_args=(), divisible_by_8=())]}, inductor_meta={'kernel_name': 'Placeholder.DESCRIPTIVE_NAME', 'backend_hash': 'cdeecfeccd31ad7810f96b5752194b1c2406d0a81e39a6ca09c8ee150baae183'}, ) @triton.jit def triton_mm(arg_A, arg_B, out_ptr0): ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/121998 Approved by: https://github.com/jansel	2024-03-25 21:33:36 +00:00
IvanKobzarev	9b095c3fe6	[dynamo] Config to not emit runtime asserts (#122603 ) Repetition on squashed & merged by mistake https://github.com/pytorch/pytorch/pull/122406 Differential Revision: [D55312394](https://our.internmc.facebook.com/intern/diff/D55312394) Pull Request resolved: https://github.com/pytorch/pytorch/pull/122603 Approved by: https://github.com/ezyang	2024-03-25 21:17:44 +00:00
PyTorch UpdateBot	1f67da5105	[executorch hash update] update the pinned executorch hash (#122152 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned executorch hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122152 Approved by: https://github.com/pytorchbot	2024-03-25 20:56:34 +00:00
Peter Y Yeh	46a76cfef5	[ROCm] Fix test_trace_rule_update.py (#121524 ) -Add missing torch API to trace rules and ignore API with manual trace rule. The PR fix test/dynamo/test_trace_rule_update maybe related to https://github.com/pytorch/pytorch/pull/121142 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121524 Approved by: https://github.com/jansel, https://github.com/pruthvistony	2024-03-25 20:53:24 +00:00
Guilherme Leobas	bc7f3859b3	Update jvp to support symbolic execution. (#120338 ) Previously, all jvp tests under dynamo/test_dynamic_shapes would fail because symbolic execution wasn't supported in some autograd functions. List of changes: - Update`_has_same_storage_numel` to use `sym_nbytes` - Symintify `_efficientzerotensor_meta` - Introduce `empty_generic_symint` with the first argument `size` as symbolic integer - Update gen_variable_type.py script to call the symint version of zeros_fn function (zeros_symint / _efficientzerotensor_symint) - Update `has_same_meta` to call `sym_*` functions Pull Request resolved: https://github.com/pytorch/pytorch/pull/120338 Approved by: https://github.com/soulitzer	2024-03-25 20:50:12 +00:00
hongxyan	1c1268b6e9	seg-fault of "basic_string::_M_construct null not valid" fix for getNcclErrorDetailStr (#121905 ) When working on testing all-reduce with an alternative rccl replacement backend, my test script crashed. After debugging, I found that `ncclGetLastError(NULL)` return null, and then the code uses the return value to do std::string would seg-fault with an exception of `basic_string::_M_construct null not valid`. This pull request is to fix this edge condition so that it will exit the program gracefully with useful information. Test: Before the fix, my test script exits like below: ``` File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 2051, in all_reduce work = group.allreduce([tensor], opts) RuntimeError: basic_string::_M_construct null not valid ``` After this fix, my test script exited with useful message like, ``` [rank0]: File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 2219, in all_reduce [rank0]: work = group.allreduce([tensor], opts) [rank0]: torch.distributed.DistBackendError: NCCL error in: /pytorch/torch/csrc/distributed/c10d/NCCLUtils.hpp:272, internal error - please report this issue to the NCCL developers, NCCL version 0.4.2 [rank0]: ncclInternalError: Internal check failed. [rank0]: Last error: Unknown NCCL Error ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/121905 Approved by: https://github.com/wconstab	2024-03-25 20:49:34 +00:00
Edward Z. Yang	05bbcae5bb	Refactor functorch meta conversion (#122202 ) At a high level, the goal of this refactor was to make it so that `MetaConverter.__call__` has a straightforward code structure in three steps: (1) check if we support doing meta conversion, (2) describe the tensor into MetaTensorDesc, (3) call `meta_tensor` on MetaTensorDesc. However, this is not so easy to do, because there is a big pile of special cases for functional tensor inside `__call__`. The primarily complication is handling the ambient functionalization state: specifically, the functorch dynamic layer stack and the Python functionalization dispatch. The old code demands that meta tensor conversion happen with this state disabled. But I discovered that when I reconstruct functorch tensors it demands that the functorch layers be active; in fact a batch tensor will have a pointer to the internal functorch layer. I had some discussion with Richard Zou about what code structure here makes sense. In particular, one of the goals of the refactor here is that I can inflate MetaTensorDesc from an entirely different process, which may not have all of the functorch layers activated at the time we do reconstruction. So it seems to me that we should make it explicit in MetaTensorDesc that there was some functorch layer active at the time the functorch tensor was serialized, so that we could potentially know we need to reconstruct these layers on the other side. This is NOT implemented yet, but there's some notes about how potentially it could proceed. But the important thing here is we SHOULD disable everything when we run `meta_tensor`, and internally be responsible for restoring the stack. Actually, the necessary infra bits in functorch don't exist to do this, so I added some simple implementations in pyfunctorch.py. The rest is splitting up the manipulations on tensor (we do things like sync the real tensor before describing it; Describer is responsible for this now) and I also tried to simplify the not supported condition, based on my best understanding of what the old thicket of conditions was doing. You may notice that the internal meta_tensor handling of functional tensor is inconsistent with surrounding code: this is because I exactly replicated the old reconstruction behavior; a further refactor would be to rationalize this. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/122202 Approved by: https://github.com/zou3519	2024-03-25 20:47:21 +00:00
Adnan Akhundov	9223b2cb31	Pop codegened parent graph from wrapper in GraphLowering (#122469 ) Summary: Previously, we kept a reference to `V.graph` in the `codegened_graph_stack` of the wrapper. Memory regression analysis of https://github.com/pytorch/pytorch/issues/121887 shows that this has led to a slightly higher memory utilization during lowering of the `llama_v2_7b_16h` model. Here we refactor the code to pop the parent subgraph from the `codegened_graph_stack` when codegen-ing is done. Fixes https://github.com/pytorch/pytorch/issues/121887. Test Plan: CI, also see https://github.com/pytorch/pytorch/issues/121887#issuecomment-2014209104. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122469 Approved by: https://github.com/eellison	2024-03-25 20:27:59 +00:00
PyTorch MergeBot	b2c496ba24	Revert "[TorchGen] Add mutable parameter to valuetype_type function in api/cpp.py (#121415 )" This reverts commit c1fe09dc37358d8121f119d66e9e8c8d57035158. Reverted https://github.com/pytorch/pytorch/pull/121415 on behalf of https://github.com/ezyang due to I think this needs to be reverted to after https://github.com/pytorch/pytorch/pull/120076 revert ([comment](https://github.com/pytorch/pytorch/pull/121415#issuecomment-2018828813))	2024-03-25 20:14:40 +00:00
Catherine Lee	f84e3bf36d	[ez] Fix XLA auto hash updates (#122630 ) The xla pin is located in .github/ci_commit_pins not .ci/docker/ci_commit_pins Pull Request resolved: https://github.com/pytorch/pytorch/pull/122630 Approved by: https://github.com/huydhn	2024-03-25 19:45:56 +00:00
Nikita Shulga	9d1de31634	[BE][CPUInductor] Use C++17 helper templates (#122607 ) Such as `std::is_same_v` ,`std::is_integral_v` and C++14 one `std::enable_if_t` Pull Request resolved: https://github.com/pytorch/pytorch/pull/122607 Approved by: https://github.com/jansel, https://github.com/Skylion007	2024-03-25 19:01:44 +00:00
Ashwin Hari	2d4197c9b7	add case for creating storage on ort (#122446 ) Fixes #122445 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122446 Approved by: https://github.com/mikaylagawarecki	2024-03-25 18:59:20 +00:00
Jason Ansel	2db7d874a9	[inductor] Improve error message for shape errors in slice_scatter (#122543 ) Fixes #122291 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122543 Approved by: https://github.com/shunting314	2024-03-25 18:57:16 +00:00
PyTorch MergeBot	db506762d1	Revert "Change ATEN generator argument type to const std::optional<Generator>& (#120076 )" This reverts commit a52b4e22571507abc35c2d47de138497190d2e0a. Reverted https://github.com/pytorch/pytorch/pull/120076 on behalf of https://github.com/atalman due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/120076#issuecomment-2018680656))	2024-03-25 18:52:05 +00:00
zdevito	c7bf5871ce	CUDAEvent::elapsed_time could accidentally initialize a non-used GPU (#122538 ) This sets the device before call cudaEventElapsedTime to avoid the case where the "cudaGetCurrentDevice" device would be initialized even though neither event is on that device. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122538 Approved by: https://github.com/shuqiangzhang, https://github.com/wconstab	2024-03-25 17:49:50 +00:00
Kurt Mohler	198927170d	Avoid COW materialize in nn.functional forward ops (2) (#121992 ) Affected ops: * dropout * embedding * embedding_bag * mutli_head_attention_forward * grid_sample * ctc_loss * nll_loss * pdist Pull Request resolved: https://github.com/pytorch/pytorch/pull/121992 Approved by: https://github.com/ezyang ghstack dependencies: #122437, #121991	2024-03-25 17:31:19 +00:00
Kurt Mohler	55becf02bc	Avoid COW materialize in nn.functional forward ops (1) (#121991 ) Affected ops: * Remaining norm ops * pad * margin_loss ops * fractional_max_pool * linear * prelu * rrelu * scaled_dot_product_attention * logsigmoid * threshold * binary_cross_entropy * gelu Pull Request resolved: https://github.com/pytorch/pytorch/pull/121991 Approved by: https://github.com/ezyang ghstack dependencies: #122437	2024-03-25 17:31:19 +00:00
Nikita Shulga	4c70ab26ef	[MPS] Enable `index_select` for complex types (#122590 ) Surprisingly, as of MacOS-14.14 MPS `gatherWithUpdatesTensor:indicesTensor:axis:batchDimensions:name:` still does not support complex types, so emulate them by using `at::view_as_real` trick Fixes https://github.com/pytorch/pytorch/issues/122427 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122590 Approved by: https://github.com/Skylion007	2024-03-25 16:57:35 +00:00
yan-yhy	e6a37eeb06	run some cuda testcases on other devices if available. (#122182 ) If users want to run some cuda testcases on other devices throw setting an environment variable for testing the performance on custom devices, I think it can be used like this pr. Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/122182 Approved by: https://github.com/ezyang	2024-03-25 16:40:03 +00:00
Catherine Lee	70ac13b876	[ez][TD] Hide errors in llm retrieval job (#122615 ) The new ghstack does have a base on main anymore, so finding the base for ghstacked PRs is harder. Something similar to https://github.com/pytorch/pytorch/pull/122214 might be needed, but then I'm worried about tokens Either way, this is a quick workaround to hide these errors for ghstack users Pull Request resolved: https://github.com/pytorch/pytorch/pull/122615 Approved by: https://github.com/huydhn	2024-03-25 16:35:00 +00:00
Edward Z. Yang	47a9725de9	Implement prefer_deferred_runtime_asserts_over_guards (#122090 ) Fixes https://github.com/pytorch/pytorch/issues/121749 As promised, it is pretty easy. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/122090 Approved by: https://github.com/lezcano	2024-03-25 16:31:16 +00:00
Zola	e49a38973f	Update DimOrDims typing in torch.sparse (#122471 ) I noticed the typing of the `torch.sparse.sum`'s `dim` parameter wasn't allowing an int tuple as input and tracked the issue to this type. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122471 Approved by: https://github.com/soulitzer	2024-03-25 16:25:56 +00:00
Jason Ansel	06f22537ca	[dynamo] Suppress warning about torch.autograd.Function() (#122566 ) PR #120577 got reverted due to issues in fbcode. This hides warning that PR was trying to fix until we can debug the fbcode issue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122566 Approved by: https://github.com/yanboliang	2024-03-25 16:18:43 +00:00
Zhengxu Chen	0465a90b00	[export][reland] Fix unflattened submodule ordering. (#122341 ) (#122507 ) Summary: Make sure the order of submodules is the same as the original eager module. bypass-github-export-checks Test Plan: buck test mode/opt caffe2/test:test_export -- -r test_unflatten_submodule_ordering Differential Revision: D55251277 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122507 Approved by: https://github.com/tugsbayasgalan	2024-03-25 15:22:01 +00:00
Lucas Pasqualin	11dfa72153	[BE] Remove unnecessary state dict update. (#122528 ) From what I can see, following is a redundant/unnecessary setting of dict element. Differential Revision: [D55191396](https://our.internmc.facebook.com/intern/diff/D55191396/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/122528 Approved by: https://github.com/Skylion007	2024-03-25 15:21:44 +00:00
sanchitintel	5152945441	GPT2 SDPA inference pattern-matching for Inductor-CPU (#121866 ) ### Summary With this PR, SDPA pattern of GPT2 is being mapped to `torch.nn.functional.scaled_dot_product_attention`. While GPT2 supports both a causal mask & an attention mask, this PR considers the case of attention mask being absent. TorchBench inference workload for GPT2 also doesn't use an attention-mask. This pattern's replacement is being disabled for CUDA because [CUDA AOT Inductor](https://github.com/pytorch/pytorch/actions/runs/8319111885/job/22762567770) CI job's `GPT2ForSequenceClassification` accuracy test failed, although all other trunk CUDA Inductor CI checks had passed. Created #122429 to track that particular issue. ### CPU performance data with TorchBench \|MODEL \|BATCH SIZE \| DTYPE \| BEFORE: Speedup over eager-mode with the default Inductor implementation \| AFTER: Speedup over eager mode with SDPA op mapped\| Perf boost = (AFTER - BEFORE)/BEFORE * 100\| \|--------------------------\|-------------\|---------\|-----------------------------\|--------------------------\|------------\| \|hf_GPT2\| 1 \| FP32 \| 1.522x \| 1.791x\| 17.67%\| \|hf_GPT2\| 1 \| BF16 (AMP) \| 1.795x \| 2.387x\| 32.98%\| \|hf_GPT2\| 2 \| FP32 \| 1.313x \|1.629x \| 19.3%\| \|hf_GPT2\|2\| BF16 (AMP) \| 1.556x \| 1.924x \| 23.65%\| \|hf_GPT2_large\| 1 \| FP32 \| 1.380x \|1.585x \| 12.93%\| \|hf_GPT2_large\| 1 \| BF16 (AMP) \| 1.208x \| 1.567x \| 22.91%\| \|hf_GPT2_large\| 2 \| FP32 \| 1.188x \| 1.490x \| 25.42%\| \|hf_GPT2_large\|2\| BF16 (AMP) \| 0.991x \| 1.575x \| 58.93%\| Machine - Intel(R) Xeon(R) Platinum 8468H (Xeon 4th gen Sapphire Rapids) 48 physical cores were used. Intel OpenMP & libtcmalloc were preloaded. Example command - ``` OMP_NUM_THREADS=48 MKL_NUM_THREADS=48 numactl --membind=0 --cpunodebind=0 -C 0-47 python benchmarks/dynamo/torchbench.py --performance --inference --inductor --float32 -dcpu --only hf_GPT2_large --freezing --batch-size 1 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/121866 Approved by: https://github.com/Valentine233, https://github.com/jgong5, https://github.com/desertfire	2024-03-25 15:04:03 +00:00
PyTorch MergeBot	4dc09d6aa4	Revert "Graph-Safe RNG State Exchange for Tensor Parallelism (#114068 )" This reverts commit e9dcda5cba92884be6432cf65a777b8ed708e3d6. Reverted https://github.com/pytorch/pytorch/pull/114068 on behalf of https://github.com/ezyang due to memory leak in another ci ([comment](https://github.com/pytorch/pytorch/pull/114068#issuecomment-2018044527))	2024-03-25 13:49:04 +00:00
cyy	b9d6f8cc18	Fix clang-tidy warnings in aten/src/ATen/core/.cpp (#122572 ) This PR fixes clang-tidy warnings in aten/src/ATen/core/.cpp. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122572 Approved by: https://github.com/ezyang	2024-03-25 13:46:24 +00:00
Edward Z. Yang	1e404c9b12	Remove redundant query to tensor_to_context (#122278 ) from_real_tensor will query it again, so this query is strictly dominated. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/122278 Approved by: https://github.com/eellison ghstack dependencies: #122044, #122270, #122271	2024-03-25 13:16:21 +00:00
Edward Z. Yang	49b81af45f	Delete dead memoized_only kwarg in FakeTensor (#122271 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/122271 Approved by: https://github.com/eellison ghstack dependencies: #122044, #122270	2024-03-25 13:16:21 +00:00
Edward Z. Yang	f32ce4e28e	Delete FakeTensorConverter.__call__ in favor of from_real_tensor (#122270 ) It's annoying grepping for `__call__` call-sites so they're now all explicit now. I'd do this to MetaConverter too but that one is way more public, a lot more sites. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/122270 Approved by: https://github.com/eellison ghstack dependencies: #122044	2024-03-25 13:16:13 +00:00
Jason Ansel	069270db60	[dynamo] Fix list comparison ops (#122559 ) Fixes #122376 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122559 Approved by: https://github.com/Skylion007	2024-03-25 07:03:23 +00:00
Edward Z. Yang	5891c5b3a6	Factor meta conversion through serializable MetaTensorDesc (#122044 ) Fixes https://github.com/pytorch/pytorch/issues/121085 This PR pretty involved so pay attention to this description. At a high level, the refactor is intended to be mechanical: anywhere in MetaConverter where previously we took a Tensor as argument, we now take a MetaTensorDesc, which contains all of the information that we would have queried off of the Tensor, but placed into a separate data structure which we can serialize or use to recreate a fake tensor in a separate fake tensor mode in exact fidelity to the original. However, this transformation is not always entirely mechanical. Here is what you need to pay attention to: - The memo table from real Tensor -> meta/fake Tensor is now broken into two memo tables: real Tensor -> stable int id -> meta/fake Tensor. The stable int id is needed so that when we do serialization, we know when tensors/storages alias each other and can ensure we preserve this aliasing upon deserialization. The way I have implemented changes the weak reference behavior. Previously, when either the real Tensor OR the meta/fake Tensor went dead, we would remove the entry from the memo table. Now, this only removes entries from one of the two memo tables. This semantically makes sense, because the user may have held on to the stable int id out of band, and may expect a real Tensor to continue to be numbered consistently / expect to be able to lookup a meta/fake tensor from this id. If this is unacceptable, it may be possible to rejigger the memo tables so that we have real Tensor -> stable int id and real Tensor -> meta/fake Tensor, but TBH I find the new implementation a lot simpler, and arranging the memo tables in this way means that I have to muck around with the real tensor to save to the memo table; in the current implementation, I never pass the Tensor to meta_tensor function AT ALL, which means it is impossible to accidentally depend on it. - When I fill in the fields of MetaTensorDesc in describe_tensor, I need to be careful not to poke fields when they are not valid. Previously, preconditions were implicitly checked via the conditional structure ("is this sparse? is this nested?") that is tested before we start reading attributes. This structure has to be replicated in describe_tensor, and I have almost assuredly gotten it wrong on my first try (I'll be grinding through it on CI; a careful audit will help too, by auditing that I've tested all the same conditionals that the original access was guarded by.) - I originally submitted https://github.com/pytorch/pytorch/pull/121821 for the symbolic shapes change, but it turned out the way I did it there didn't actually work so well for this PR. I ended up just inlining the symbolic shapes allocation logic into MetaConverter (look for calls to maybe_specialize_sym_int_with_hint), maybe there is a better way to structure it, but what I really want is to just read sizes/strides/offset directly off of MetaTensorDesc; I don't want another intermediate data structure. - Some fields aren't serializable. These are documented as "NOT serializable". ctx/type should morally be serializable and I just need to setup a contract with subclasses to let them be serialized. The fake_mode is used solely to test if we are refakefying with a pre-existing ShapeEnv and we want to reuse the SymInt directly--serializing this case is hopeless but I am kind of hoping after this refactor we do not need this at all. view_func is not serializable because it's a bound C implemented method. Joel has promised me that this is not too difficult to actually expose as a true data structure, but this is the edgiest of edge cases and there is no reason to deal with it right now. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/122044 Approved by: https://github.com/eellison	2024-03-25 06:21:17 +00:00
Nikita Shulga	cf06189a2d	[CPPInductor] Fix another out-of-bounds access (#122580 ) Not sure what was the idea behind `{self.tiling_factor}*sizeof(float)/sizeof({DTYPE_TO_CPP[dtype]})` size calculation (perhaps copy-n-paste error during the refactor made by https://github.com/pytorch/pytorch/pull/97626 ) , but `Vectorized::store(ptr, tiling_factor)` needs at least `tiling_factor` elements, not `tiling_factor/2` (which would be the case with the original calculation if data type is 64-bit value such as int64) Discovered while trying to enable arch64 vectorized inductor. Minimal reproducer (reproducible on ARMv8 or any x86_64 machine that does not support AVX512): ```python import torch def do_ds(x, y): return torch.diagonal_scatter(x, y) x=torch.ones(10, 10, dtype=torch.int64) y=torch.tensor([ 1, 2, -8, 8, 5, 5, -7, -8, 7, 0]) dsc = torch.compile(do_ds) assert torch.allclose(torch.diagonal_scatter(x, y), dsc(x, y)) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/122580 Approved by: https://github.com/Skylion007, https://github.com/jansel	2024-03-25 04:49:20 +00:00
Edward Z. Yang	deeeaded1f	Add metas for randint/rand factory functions out overload (#122375 ) Fixes https://github.com/pytorch/pytorch/issues/121897 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/122375 Approved by: https://github.com/lezcano	2024-03-25 04:01:38 +00:00
cyy	a01d35c7f6	[TorchGen] Remove unused variables (#122576 ) This PR removes some unused Python variables from TorchGen scripts. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122576 Approved by: https://github.com/Skylion007	2024-03-25 03:31:41 +00:00
Nikita Shulga	e75ecd5618	[BE][veclib] Use `is_same_v`/`enable_if_t` (#122533 ) `enable_if_t` helper is part of C++14 `is_same_v` helper is part of C++17 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122533 Approved by: https://github.com/Skylion007	2024-03-24 20:57:41 +00:00
Alexander Grund	14e348b7ad	Handle JIT test failure when the GPU is newer than the CUDA compiler or vice versa (#122400 ) The test may fail because it either uses target flags newer than the GPU resulting in failures loading the compiled binary or targetting a GPU for which CUDA has no support yet/anymore Pull Request resolved: https://github.com/pytorch/pytorch/pull/122400 Approved by: https://github.com/ezyang	2024-03-24 13:58:06 +00:00
Yifu Wang	36188360dd	[dynamo] support torch.distributed.{group.WORLD, GroupMember.WORLD, distributed_c10d._get_default_group} (#120560 ) Fixes https://github.com/pytorch/pytorch/issues/120431 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120560 Approved by: https://github.com/wconstab	2024-03-24 11:13:05 +00:00
Jason Ansel	3e4a4bea12	[dynamo] Graph break on SymNode control flow (#122546 ) Fixes #111918 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122546 Approved by: https://github.com/ezyang	2024-03-24 07:22:02 +00:00
Honglin Zhu	adeedc060f	[Inductor] Fix unbacked symbol in stride when using item() (#122298 ) Fixes #122296 Test: python test/inductor/test_torchinductor_dynamic_shapes.py -k test_item_unbacked_stride_nobreak_cuda Co-authored-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/122298 Approved by: https://github.com/ezyang	2024-03-24 06:27:15 +00:00
cyy	c1fe09dc37	[TorchGen] Add mutable parameter to valuetype_type function in api/cpp.py (#121415 ) This PR is a follow-up of #120076, it moves std::optional<Generator> detection logic into ```valuetype_type``` of api/cpp.py by adding the mutable parameter, which facilitates future value type changes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121415 Approved by: https://github.com/ezyang	2024-03-24 06:11:08 +00:00
Kurt Mohler	ca9606f809	Update COW OpInfo test to include kwargs and expected materialization (#122437 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/122437 Approved by: https://github.com/ezyang	2024-03-24 06:07:30 +00:00
Alexander Grund	9d4218c23e	Handle JIT test failure when the GPU is newer than the CUDA compiler (#122402 ) The test uses the CUDA compute capabilities of the current device to compile an extension. If nvcc is older than the device, it will fail with a message like "Unsupported gpu architecture 'compute_80'" resulting in a `RuntimeError: Error building extension 'cudaext_archflags'` ultimately failing the test. This checks for this case and allows execution to continue Fixes https://github.com/pytorch/pytorch/issues/51950 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122402 Approved by: https://github.com/ezyang	2024-03-24 05:36:24 +00:00
cyy	808a035658	[Dynamo][4/N] Enable clang-tidy coverage on torch/csrc/dynamo/* (#122534 ) This PR enables clang-tidy coverage on torch/csrc/dynamo/* and also contains other small improvements. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122534 Approved by: https://github.com/Skylion007	2024-03-24 05:26:32 +00:00
PyTorch UpdateBot	f0d461beac	[vision hash update] update the pinned vision hash (#122536 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned vision hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122536 Approved by: https://github.com/pytorchbot	2024-03-24 03:42:21 +00:00
Jason Ansel	5f7e71c411	[dynamo] Add HASATTR guard for UserDefinedObject attrs (#122555 ) Fixes #111522 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122555 Approved by: https://github.com/Skylion007	2024-03-24 03:41:58 +00:00
Jason Ansel	07d037674f	[inductor] Fix issue with randint + symbolic shapes (#122428 ) Fixes #122405 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122428 Approved by: https://github.com/ezyang	2024-03-24 03:41:13 +00:00
Edward Z. Yang	476585b190	Preserve unbacked SymInt on SymNode (#120816 ) Previously, when we applied a replacement, a SymInt that was previously an unbacked SymInt would then transmute into whatever we replaced it into (e.g., a constant). This has a major downside: we often look at SymInts associated with FX nodes (e.g., the meta of x.item() return) to find out where the unbacked SymInt was allocated. If we replace it, we no longer can find out where, e.g., u1 was allocated! But we need to know this so we can generate deferred runtime asserts like u1 == s0. To solve this problem, I have a special mode for replace, resolve_unbacked=False, which lets you disable substitutions on unbacked SymInts. When reporting node.expr, we preferentially avoid applying unbacked SymInt substitutions. To understand if we might accidentally reapply the substitution later, before we have reached the deferred runtime assert, we must study the calls to simplify() in ShapeEnv. My audit turns up these sites: * `produce_guards`: this is fine, deferred runtime asserts never show up here, we must NOT have unbacked SymInts show up here. Similarly `get_nontrivial_guards`. * `_maybe_evaluate_static`: this is fine, we are using this to determine if it is necessary to produce a guard/runtime assert. We don't want to reissue a runtime assert if we've already asserted on it, and replacements can help us understand if this has occurred. * `_simplify_floor_div`: this is a legitimate bug, it needs to be `resolve_unbacked=False` * `_refine_ranges`: this is fine, a refined range doesn't affect what runtime asserts we issue * `_update_divisible`: this updates the `self.divisible` set, which specifies when we can simplify away divisibility constraints. Since this affects replacements only, it won't cause us to oversimplify a user provided expression. There are some situations where we DO want to always apply the substitution, specifically when we have the duplicate symbol problem (we retrace an item call and get u0 and u1 which refer to the same thing.) I don't want two symbols in this case, so a special `rename_unbacked_to` is provided which sets up the unconditional renaming. Along the way, I make a refinement to `_update_var_to_range`: if you update a var range for a size-like unbacked SymInt, you are now no longer allowed to set its lower bound below 2. This is because if you could, then our size oblivious tests for it would be inconsistent. Actually, I think there is still some inconsistency, because if you assert `u0 == 0` we will still end up with this in deferred runtime asserts, and we will then use this to simplify these statements to be True everywhere else. Maybe we should forbid this kind of refinement; not done in this PR. Fixes https://github.com/pytorch/pytorch/issues/119689 Fixes https://github.com/pytorch/pytorch/issues/118385 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/120816 Approved by: https://github.com/lezcano	2024-03-24 02:56:16 +00:00
cyy	a52b4e2257	Change ATEN generator argument type to const std::optional<Generator>& (#120076 ) This PR proposes to use std::optional<Generator>& for underlying functions to avoid unnecessary copy and move operations. The torchgen code was changed to generate the new type. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120076 Approved by: https://github.com/malfet	2024-03-24 02:12:08 +00:00
Edward Z. Yang	788638fcdc	Suggest TORCHDYNAMO_EXTENDED_DEBUG_ envvars when appropriate (#122473 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/122473 Approved by: https://github.com/lezcano	2024-03-24 01:02:20 +00:00
vfdev-5	cdc7f0fd3b	Fixed failing pyhpc_equation_of_state due to cpp nodes fusion with compatible ranges (#122420 ) Fixes #122283 Description: PR https://github.com/pytorch/pytorch/pull/120077 introduced cpp nodes fusion with compatible ranges with an assumption that all scheduler nodes inside the fused nodes are the same, however, it appeared that snodes can have different indexing expressions. This PR fixes the incorrect assumption. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122420 Approved by: https://github.com/lezcano	2024-03-24 00:40:31 +00:00
Nikita Shulga	4758837930	[BE] Do not use `importlib.load_module` (#122542 ) To get rid of the annoying ``` <frozen importlib._bootstrap>:283: DeprecationWarning: the load_module() method is deprecated and slated for removal in Python 3.12; use exec_module() instead ``` using recipe from https://docs.python.org/3/library/importlib.html#importing-a-source-file-directly Pull Request resolved: https://github.com/pytorch/pytorch/pull/122542 Approved by: https://github.com/jansel, https://github.com/desertfire	2024-03-23 17:22:26 +00:00
Nikita Shulga	bf40e3f880	[EZ][BE] Add missing `acosh` op to vec256_float_neon.h (#122513 ) As base class has it `ed15370aab/aten/src/ATen/cpu/vec/vec_base.h (L367-L369)` Discovered while attempting to enabling Inductor vectorization on ARM platform Pull Request resolved: https://github.com/pytorch/pytorch/pull/122513 Approved by: https://github.com/Skylion007	2024-03-23 14:18:02 +00:00
Pearu Peterson	a39e638707	Update bsr_dense_addmm kernel parameters for sizes 3 x 2 ^ N (#122506 ) As in the title. The speed-ups for a particular set of input sizes range from about 7 to 85 % depending on the used BSR tensor block sizes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122506 Approved by: https://github.com/cpuhrsch	2024-03-23 11:54:33 +00:00
Alexander Grund	8a209344c9	Fix access to unitialized memory in VSX vector functions for quantized values (#122399 ) Similar to https://github.com/pytorch/pytorch/pull/89833 those function may access uninitialized memory leading to undefined behavior/results. Initialize with zeros as done before. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122399 Approved by: https://github.com/ezyang	2024-03-23 06:11:30 +00:00
Guang Yang	c677221798	remove torchao dependency (#122524 ) Test Plan: CI ``` buck2 run mode/dev-nosan mode/inplace executorch/examples/models/llama2:export_llama -- -c ~/llama/ultra_new_checkpoint.pt -p ~/llama/params.json -kv -E 8,8 -d fp32 --pt2e_quantize "xnnpack_dynamic" -2 ``` ``` buck run //executorch/backends/xnnpack/test:test_xnnpack_ops -- executorch.backends.xnnpack.test.ops.linear.TestLinear.test_qd8_fp32_per_token_weight_per_channel_group_int4 ``` Differential Revision: D55263008 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122524 Approved by: https://github.com/jerryzh168	2024-03-23 03:18:43 +00:00
Nikita Shulga	19d27a13ea	[CPUInductor] Fix out-of-bounds read/write in cvt_int64_to_[fp32\|int32] (#122511 ) Discovered while debugging regressions in enabling vectorization on ARM platform Without this change `test_div2_cpu` will fail with invalid values on non-x86 CPU Pull Request resolved: https://github.com/pytorch/pytorch/pull/122511 Approved by: https://github.com/peterbell10, https://github.com/jansel	2024-03-23 01:45:07 +00:00
chilli	4d8a3f8bb3	changed aliasing checks to properly recurse for computing last usage (#122444 ) Fixes https://github.com/pytorch/pytorch/issues/122457 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122444 Approved by: https://github.com/yifuwang, https://github.com/jansel ghstack dependencies: #121624, #122474	2024-03-23 01:43:21 +00:00
Matthew Haddock	50036ec781	[Inductor] Add a test for creating a cpu inductor-> triton backend (#122396 ) Summary: Currently there is a test for adding a backend in test/inductor/test_extension_backend.py for a cpp backend with a new device. However there is no such test for the Triton backend; it should be possible for a user to create and register your own ExtensionWrapperCodegen and ExtensionSchedulingfor another non-CUDA device and be able to generate Triton code. For simplicity I have chosen to use a CPU device, as I think it's plausible someone might want to create a CPU Triton backend. Unfortunately the generation and running of the code is quite tightly coupled so I've had to use a mocked function to extract the code before running. Suggestions are welcome for better ways to do this. This is a stepping off point for some additional PRs to make the Triton code path less CUDA specific, as currently there would be no way to test this avenue. Test plan: ``` frames [('total', 1), ('ok', 1)] stats [('calls_captured', 3), ('unique_graphs', 1)] inductor [('intermediate_hooks', 1)] aot_autograd [('total', 1), ('ok', 1)] . ---------------------------------------------------------------------- Ran 1 test in 0.394s ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/122396 Approved by: https://github.com/jansel	2024-03-23 01:14:57 +00:00
Xinfeng	41d69ff324	Add a shape inference tool (#120097 ) Summary: Add a shape inference tool that helps to infer each node shape of a given graph module. 1. Given a fx graph, and an example of an input(don't need to be an accurate input that can be forward, but should have valid dims and data structures), `infer shape` creates an input of symbolic shape 2. Shape prop this symbolic input can catch runtime or value exceptions. 3. These errors are constraints for symbol values, and the constraint solver `infer symbolic values` helps us figure out specific values for each symbol. 4. Finally, we run the shape propagation based on input tensor to get tensor shapes for all nodes in the FX traced module. Test Plan: ### 1. Test `infer symbol values` Command: ``` buck2 test mode/opt //caffe2/test:fx_experimental -- test_infer_symbol_values ``` ### 2. Test `infer shape` Command: ``` buck2 test mode/opt //caffe2/test:fx_experimental -- test_infer_symbol_values ``` Inferred shape result like: P897560514 Differential Revision: D53593702 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120097 Approved by: https://github.com/yf225	2024-03-23 00:23:24 +00:00
Alexander Grund	29bca8547b	Fix failing test_cpu_repro without vectorization support (#117262 ) At least the following tests fail when there is no supported vector ISA: test_lowp_fp_neg_abs test_non_contiguous_index_with_constant_stride test_scalar_mul_bfloat16 test_transpose_non_contiguous test_transpose_sum2d_cpu_only test_transpose_sum_outer test_transpose_vertical_sum_cpu_only test_vertical_sum_cpu_only Those tests assert `metrics.generated_cpp_vec_kernel_count` is nonzero which is never the case without a supported vector ISA, e.g. on PPC and maybe on AArch. Skip those tests with a new decorator and use the simpler one where an equivalent is already used Some usages of `metrics.generated_cpp_vec_kernel_count` where guarded by a check instead of skipping the test. I tried to apply that instead of a skip where the test looked similar enough to where that was previously done. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117262 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-03-23 00:03:55 +00:00
angelayi	a84f1d3def	[effects] Fix backwards handling (#122346 ) I didn't previously test the `.backwards()` call, and when testing on #122348 I realized we were missing some token handling in some places. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122346 Approved by: https://github.com/zou3519	2024-03-22 23:31:52 +00:00
Brian Hirsh	e7fa3f7812	AOTDispatch: allow subclasses to correct when we guess metadata of tangents incorrectly (#118670 ) This PR is enough to fix https://github.com/pytorch/pytorch/issues/118600. More description of the problem is in the issue, but the high-level problem is similar to the "tangents might be non-contiguous" problem that we handle today, via forcing all tangents to be contiguous. There, the problem was something like: "We guessed the tangent strides incorrectly, because strides on the runtime tangents were different from strides on the forward outputs, which we used to generate tangents" Here, the problem is similar: "We guessed the tangent tensor subclass's metadata incorrectly, because the runtime tangent was a subclass with different metadata than the forward output subclass". This happened in an internal DTensor issue, where the metadata in question was the `placements` (shard vs. replicate vs. Partial). One option is to solve this problem via backward guards. This is needed to unblock internal though, so I figured handling this similarly to how we handle non-contiguous tangents would be reasonable. I did this by: (1) Assert that the metadata on subclass tangents is the same as what we guessed, and if not raise a loud error (2) In the error message, provide the name of an optional method that the subclass must implement to handle this case: `def __force_same_metadata__(self, metadata_tensor):`: If the forward output had a `Replicate()` placement, but the runtime tangent had a `Shard(1)` placement, this method allows a subclass to take the tangent and "convert" it to one with a `Replicate()` placement. `__force_standard_metadata__(self)`: One issue is that there is another placement called `_Partial`, and its semantics are such that DTensor is unable to convert a DTensor with some placement type into another DTensor with a `_Partial` placement. `__force_standard_metadata__` is now called on all (fake) subclass forward outs at trace-time to generate tangents, and gives subclasses a chance to "fix" any outputs with metadata that they cannot convert to later. Morally, this is similar to the fact that we force a `contiguous()` call on all tangents at trace-time. I'm interested in thoughts/feedback! Two new dunder methods on traceable subclasses is definitely a contentious change. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118670 Approved by: https://github.com/ezyang	2024-03-22 23:16:08 +00:00
Aaron Orenstein	f7b8d8e249	Support for sapling scm (#122072 ) We can use Sapling (hg) with the pytorch repo but there are a couple minor issues to teach our scripting to be happier with having either a git or hg repo. This change fixes some issues in: - setup.py - lintrunner Pull Request resolved: https://github.com/pytorch/pytorch/pull/122072 Approved by: https://github.com/ezyang	2024-03-22 22:59:16 +00:00
cyy	482f6c4693	[Dynamo][3/N] Fix clang-tidy warnings in torch/csrc/dynamo/* (#122392 ) This PR continues to clean clang-tidy warnings in torch/csrc/dynamo/*, following #122362 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122392 Approved by: https://github.com/ezyang	2024-03-22 22:57:41 +00:00
Pian Pawakapan	3f99306452	[export] Remove from_export flag (#122500 ) Summary: The flag from_export was incorrectly included in a previous diff (https://www.internalfb.com/diff/D54314379) - it was intended for helping with ExportedProgram verification, but was no longer needed in the final implementation. Test Plan: Changes no functionality, test/export already covers everything Differential Revision: D55205857 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122500 Approved by: https://github.com/avikchaudhuri, https://github.com/zhxchen17	2024-03-22 22:55:14 +00:00
Catherine Lee	03184a82dd	[TD] TD on ASAN PR jobs (#122332 ) Low impact CPU jobs Pull Request resolved: https://github.com/pytorch/pytorch/pull/122332 Approved by: https://github.com/huydhn	2024-03-22 22:32:51 +00:00
Tugsbayasgalan (Tugsuu) Manlaibaatar	271cc687de	Audit retracibility errors and fix some ez ones (#122461 ) Summary: Title Test Plan: CI Differential Revision: D55227094 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122461 Approved by: https://github.com/zhxchen17	2024-03-22 21:31:51 +00:00
Thiago Crepaldi	29132c2e47	Prevent dup initializers when ONNXProgram.save is called many times (#122435 ) Fixes https://github.com/pytorch/pytorch/issues/122351 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122435 Approved by: https://github.com/titaiwangms ghstack dependencies: #122196, #122230	2024-03-22 21:03:15 +00:00
Guilherme Leobas	4eaa000acc	Teach dynamo about torch.func.jvp (#119926 ) List of changes: - Replace JVP_NESTING by torch._C._functorch.maybe_current_level() - Remove all increment nesting functions from wrap_fx_proxy_cls - fwAD.make_dual receives the dual_level as keyword argument - Add jvp_increment_nesting, set_fwd_grad_enabled and dual_level context managers to dynamo Pull Request resolved: https://github.com/pytorch/pytorch/pull/119926 Approved by: https://github.com/zou3519	2024-03-22 20:25:47 +00:00
PyTorch MergeBot	3795ebe925	Revert "[Inductor] Make codecache CUDA compilation more robust & flexible (#121490 )" This reverts commit 6bbd697306851b785b51b4d0545c1ef9365ddaa6. Reverted https://github.com/pytorch/pytorch/pull/121490 on behalf of https://github.com/huydhn due to Sorry for reverting you change but I think it is failing on ROCm, i.e. `700c92e1b9` ([comment](https://github.com/pytorch/pytorch/pull/121490#issuecomment-2015829464))	2024-03-22 20:11:47 +00:00
PyTorch MergeBot	97d3bf71b9	Revert "[Inductor Cutlass backend] GEMM size threshold for Cutlass backend usage (#121491 )" This reverts commit 700c92e1b9cb6fae2610d08e5a960273c4dd1697. Reverted https://github.com/pytorch/pytorch/pull/121491 on behalf of https://github.com/huydhn due to Sorry for reverting you change but I think it is failing on ROCm, i.e. `700c92e1b9` ([comment](https://github.com/pytorch/pytorch/pull/121490#issuecomment-2015829464))	2024-03-22 20:11:47 +00:00
David Berard	8013c4409f	[inductor] config to control whether we assume inputs are aligned (#122158 ) Motivation: https://github.com/pytorch/pytorch/issues/112771 Summary: Inductor generates triton that assumes that inputs are going to be 16-byte aligned. If the inputs aren't aligned, Inductor clones the inputs. This PR introduces a config option to not do this: when assume_aligned_inputs=False, Inductor will _not_ pass inputs as being divisible_by_16, and Inductor will not make clones. This an can generate code that might be a bit slower, but this tradeoff can be worth it in some scenarios where you might otherwise make a lot of clones. Ideally, we could do this on a per-tensor basis. But this would be a lot of work, and attempts to add guards on storage offsets to do this automatically have run into issues: recompilations and excessive time to generate/check guards. Tests https://github.com/pytorch/pytorch/pull/122159 flips this to False. It didn't run through all errors, but the ones we see are all expected failures: divisible_by_16 changes; triton kernel caching fails if we call the same triton kernel multiple times (this makes sense because the first call will have unaligned inputs, but subsequent calls have aligned inputs); and some xfailed tests start passing. Alternatives/RFC: * Is this the right thing to do with cudagraphs? * Elias and Jason mentioned that we probably still want to make clones if we're dealing with unaligned inputs to matmuls. Is this something we should add in this config option? (In the use case I'm targeting, it seems like we don't need this optimization right now) Differential Revision: [D55079094](https://our.internmc.facebook.com/intern/diff/D55079094) Pull Request resolved: https://github.com/pytorch/pytorch/pull/122158 Approved by: https://github.com/ezyang	2024-03-22 20:03:38 +00:00
Peter Bell	5790096059	[dynamo] Remove uses of `raise unimplemented` (#122136 ) `unimplemented` is a function that raises an error, so `raise unimplemented(...)` never reaches the `raise`. Another related issue is that `raise unimplemented(...) from e` doesn't attach the exception cause correctly. I fix this by adding a `from_exc` argument to `unimplemented`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122136 Approved by: https://github.com/lezcano	2024-03-22 19:29:58 +00:00
angelayi	ed15370aab	[aoti] Add handling of ir.Constants in promote_constants (#122419 ) This issue popped up when enabling predispatch IR on the benchmarks (https://github.com/pytorch/pytorch/pull/122225) On the following model: ``` class M(torch.nn.Module): def __init__(self, device): super().__init__() self.device = device def forward(self, x): t = torch.tensor(x.size(-1), device=self.device, dtype=torch.float) t = torch.sqrt(t * 3) return x * t ``` We get the following error: ``` ====================================================================== ERROR: test_constant_abi_compatible_cuda (__main__.AOTInductorTestABICompatibleCuda) ---------------------------------------------------------------------- Traceback (most recent call last): File "/data/users/angelayi/pytorch/torch/testing/_internal/common_utils.py", line 2741, in wrapper method(args, kwargs) File "/data/users/angelayi/pytorch/test/inductor/test_torchinductor.py", line 9232, in new_test return value(self) File "/home/angelayi/.conda/envs/pytorch10/lib/python3.10/contextlib.py", line 79, in inner return func(args, *kwds) File "/data/users/angelayi/pytorch/test/inductor/test_aot_inductor.py", line 922, in test_constant self.check_model(M(self.device), (torch.randn(5, 5, device=self.device),)) File "/data/users/angelayi/pytorch/test/inductor/test_aot_inductor.py", line 91, in check_model actual = AOTIRunnerUtil.run( File "/data/users/angelayi/pytorch/test/inductor/test_aot_inductor_utils.py", line 102, in run so_path = AOTIRunnerUtil.compile( File "/data/users/angelayi/pytorch/test/inductor/test_aot_inductor_utils.py", line 40, in compile so_path = torch._inductor.aot_compile_ep( File "/data/users/angelayi/pytorch/torch/_inductor/__init__.py", line 150, in aot_compile_ep return compile_fx_aot( File "/data/users/angelayi/pytorch/torch/_inductor/compile_fx.py", line 1005, in compile_fx_aot compiled_lib_path = compile_fx( File "/home/angelayi/.conda/envs/pytorch10/lib/python3.10/contextlib.py", line 79, in inner return func(args, *kwds) File "/data/users/angelayi/pytorch/torch/_inductor/compile_fx.py", line 1111, in compile_fx return compile_fx( File "/home/angelayi/.conda/envs/pytorch10/lib/python3.10/contextlib.py", line 79, in inner return func(args, *kwds) File "/data/users/angelayi/pytorch/torch/_inductor/compile_fx.py", line 1145, in compile_fx return compile_fx( File "/home/angelayi/.conda/envs/pytorch10/lib/python3.10/contextlib.py", line 79, in inner return func(args, *kwds) File "/data/users/angelayi/pytorch/torch/_inductor/compile_fx.py", line 1336, in compile_fx return inference_compiler(unlifted_gm, example_inputs_) File "/data/users/angelayi/pytorch/torch/_dynamo/utils.py", line 265, in time_wrapper r = func(args, *kwargs) File "/data/users/angelayi/pytorch/torch/_inductor/compile_fx.py", line 1266, in fw_compiler_base return inner_compile( File "/home/angelayi/.conda/envs/pytorch10/lib/python3.10/contextlib.py", line 79, in inner return func(args, *kwds) File "/data/users/angelayi/pytorch/torch/_dynamo/repro/after_aot.py", line 83, in debug_wrapper inner_compiled_fn = compiler_fn(gm, example_inputs) File "/data/users/angelayi/pytorch/torch/_inductor/debug.py", line 304, in inner return fn(args, *kwargs) File "/home/angelayi/.conda/envs/pytorch10/lib/python3.10/contextlib.py", line 79, in inner return func(args, *kwds) File "/home/angelayi/.conda/envs/pytorch10/lib/python3.10/contextlib.py", line 79, in inner return func(args, *kwds) File "/data/users/angelayi/pytorch/torch/_dynamo/utils.py", line 265, in time_wrapper r = func(args, *kwargs) File "/data/users/angelayi/pytorch/torch/_inductor/compile_fx.py", line 447, in compile_fx_inner compiled_graph = fx_codegen_and_compile( File "/data/users/angelayi/pytorch/torch/_inductor/compile_fx.py", line 707, in fx_codegen_and_compile graph.run(example_inputs) File "/data/users/angelayi/pytorch/torch/_dynamo/utils.py", line 265, in time_wrapper r = func(args, kwargs) File "/data/users/angelayi/pytorch/torch/_inductor/graph.py", line 612, in run return super().run(args) File "/data/users/angelayi/pytorch/torch/fx/interpreter.py", line 145, in run self.env[node] = self.run_node(node) File "/data/users/angelayi/pytorch/torch/_inductor/graph.py", line 957, in run_node result = super().run_node(n) File "/data/users/angelayi/pytorch/torch/fx/interpreter.py", line 202, in run_node return getattr(self, n.op)(n.target, args, kwargs) File "/data/users/angelayi/pytorch/torch/_inductor/graph.py", line 819, in call_function raise LoweringException(e, target, args, kwargs).with_traceback( File "/data/users/angelayi/pytorch/torch/_inductor/graph.py", line 816, in call_function out = lowerings[target](args, kwargs) File "/data/users/angelayi/pytorch/torch/_inductor/lowering.py", line 298, in wrapped out = decomp_fn(args, **kwargs) File "/data/users/angelayi/pytorch/torch/_inductor/lowering.py", line 5340, in mul return make_pointwise(fn)(a, b) File "/data/users/angelayi/pytorch/torch/_inductor/lowering.py", line 409, in inner inputs = promote_constants(inputs, override_return_dtype) File "/data/users/angelayi/pytorch/torch/_inductor/lowering.py", line 373, in promote_constants ex = next(x for x in inputs if isinstance(x, (TensorBox, ExpandView))) torch._inductor.exc.LoweringException: StopIteration: target: aten.mul.Tensor args[0]: Constant(value=5.0, dtype=torch.float32, device=device(type='cuda', index=0)) args[1]: 3 ``` So I added an additional casing in `promote_constants` to handle the ir.Constants and now it works! Although please let me know if this is the wrong approach. Here's a paste of the full run with the inductor logs: P1198927007 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122419 Approved by: https://github.com/eellison, https://github.com/desertfire, https://github.com/chenyang78	2024-03-22 18:39:36 +00:00
cyy	52e9049ffa	Remove unused variables (#122496 ) This PR removes several unused variables in the code base. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122496 Approved by: https://github.com/ezyang	2024-03-22 18:04:09 +00:00
liqunfu	bbe846f430	Add symbolic_opset19.py and symbolic_opset20.py to support opset 19/20, extend opset 18 support (#118828 ) Start to fix https://github.com/pytorch/pytorch/issues/114801 Co-authored-by: Thiago Crepaldi <thiagofc@microsoft.com> Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/118828 Approved by: https://github.com/thiagocrepaldi	2024-03-22 18:01:33 +00:00
Lucas Pasqualin	34d33df056	[DCP] Check if pg exists in async before checking for cpu PG (#122316 ) Check if pg exists in async before checking for cpu PG in async save path. This PR enables using async_save even if PG is not initialized. Differential Revision: [D54868689](https://our.internmc.facebook.com/intern/diff/D54868689/) NOTE FOR REVIEWERS: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D54868689/)! Pull Request resolved: https://github.com/pytorch/pytorch/pull/122316 Approved by: https://github.com/shuqiangzhang, https://github.com/XilunWu	2024-03-22 18:01:11 +00:00
Kefei Lu	400cc518fc	pt2 dper passes: run shape prop before each pass (#122451 ) Summary: Most passes relies on shape info. We need to run shape prop after each pass Reviewed By: frank-wei Differential Revision: D55221119 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122451 Approved by: https://github.com/frank-wei	2024-03-22 17:57:25 +00:00
Shunting Zhang	152fa9ecc2	skip moondream for training (#122483 ) The model shows as failed model on the dashboard for training. But the model is not implemented for training (at least for now): `2196021e9b/torchbenchmark/models/moondream/__init__.py (L6)` Skip it in dashboard. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122483 Approved by: https://github.com/eellison	2024-03-22 17:35:52 +00:00
Shunting Zhang	a3d4eaf253	[inductor] device guard for max autotune benchmark (#122479 ) Internal users reported that they get failure for max-autotune if tensors are not on device 0. It turns out that we may use tensors on device say 6 and run kernel on them at device 0. This PR enforces that we do benchmarking for max-autotune on the correct device. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122479 Approved by: https://github.com/xintwfb, https://github.com/Chillee	2024-03-22 17:27:53 +00:00
Yulun Wang	3db64c1955	[NCCL PG] Enable ncclCommDevIdxMap unconditionally (#122049 ) Differential Revision: D54993977 ### Summary The initial purpose of ncclCommDevIdxMap is to support NCCL zero copy algorithms. Therefore, it is only enabled (with its values filled) if useTensorRegisterAllocatorHook_ is set to true. However, now we rely on it to support dumping NCCL information in a single PG. So we need it to be always available, regardless of whether we enabled useTensorRegisterAllocatorHook_. Move the code of filling ncclCommDevIdxMap out of if (useTensorRegisterAllocatorHook_) statement. ### Test Plan See diff Pull Request resolved: https://github.com/pytorch/pytorch/pull/122049 Approved by: https://github.com/shuqiangzhang	2024-03-22 17:10:33 +00:00
wz337	f305c96cac	[DCP] Add bytesIO object to test_e2e_save_and_load (#122112 ) Added a TestTrainstate that includes BytesIO checkpoint. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122112 Approved by: https://github.com/LucasLLC	2024-03-22 16:57:13 +00:00
Yang Chen	86082f1fdc	[aot_inductor] added runtime checks for input/output tensors in debug compile mode (#122047 ) This PR added runtime checks to guard the dtypes and shapes of input/output tensors. Currently, we enable these only for debug compilation (i.e. aot_inductor.debug_compile is True) in abi_compatible mode. Differential Revision: [D54993148](https://our.internmc.facebook.com/intern/diff/D54993148) Pull Request resolved: https://github.com/pytorch/pytorch/pull/122047 Approved by: https://github.com/desertfire	2024-03-22 16:40:33 +00:00
vfdev-5	90a13c3c5b	Added a check in register_lowering to avoid decomposed ops (#117632 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/117632 Approved by: https://github.com/lezcano	2024-03-22 16:38:31 +00:00
Gagan Jain	9347a79f1c	[Watchdog Timer] Clear timer for already terminated process (#122324 ) Summary: handling cases where worker process is terminated w/o releasing the timer request, this scenario causes reaping of process at expiry. removing the non-existent process during clear timer. Test Plan: unit tests Differential Revision: D55099773 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122324 Approved by: https://github.com/d4l3k	2024-03-22 15:48:03 +00:00
kungyork	018f5e2c32	Fix unused variable warning in `int4mm.cu` (#122286 ) Fix the following warning while compilation: ``` /home/pytorch/aten/src/ATen/native/cuda/int4mm.cu: In function ‘at::Tensor at::native::_weight_int4pack_mm_cuda(const at::Tensor&, const at::Tensor&, int64_t, const at::Tensor&)’: /home/pytorch/aten/src/ATen/native/cuda/int4mm.cu:871:6: warning: variable ‘stream’ set but not used [-Wunused-but-set-variable] 871 \| auto stream = at::cuda::getCurrentCUDAStream(); \| ^~~~~~ /home/pytorch/aten/src/ATen/native/cuda/int4mm.cu: In function ‘at::Tensor at::native::_convert_weight_to_int4pack_cuda(const at::Tensor&, int64_t)’: /home/pytorch/aten/src/ATen/native/cuda/int4mm.cu:1044:6: warning: variable ‘stream’ set but not used [-Wunused-but-set-variable] 1044 \| auto stream = at::cuda::getCurrentCUDAStream(); \| ^~~~~~ ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/122286 Approved by: https://github.com/soulitzer	2024-03-22 15:46:18 +00:00
Zhengxu Chen	7fd14ebb52	[export] Use randomized inputs to examples. (#122424 ) Summary: as title. replacing all torch.ones to randn Test Plan: CI Reviewed By: tugsbayasgalan Differential Revision: D55206441 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122424 Approved by: https://github.com/tugsbayasgalan	2024-03-22 15:32:28 +00:00
PyTorch MergeBot	60bc29aa0b	Revert "[Quant] [PT2] Add SiLU into X86InductorQuantizer Conv2d Unary Annotation (#122267 )" This reverts commit 2c6eeb26d3f61fba352ad51fd8653120937a20f3. Reverted https://github.com/pytorch/pytorch/pull/122267 on behalf of https://github.com/jeanschmidt due to Not sure if this PR caused breakages in main rocm jobs, I'll remerge if reverting does not fix it ([comment](https://github.com/pytorch/pytorch/pull/122267#issuecomment-2015294491))	2024-03-22 15:04:30 +00:00
PyTorch MergeBot	b30b396d05	Revert "[Quant] [Inductor] Enable the Inductor Lowering of QConv2d post op SiLU (#122268 )" This reverts commit 99f0fec7d0873d627e8c7f2dec65818d725424b0. Reverted https://github.com/pytorch/pytorch/pull/122268 on behalf of https://github.com/jeanschmidt due to Not sure if this PR caused breakages in main rocm jobs, I'll remerge if reverting does not fix it ([comment](https://github.com/pytorch/pytorch/pull/122267#issuecomment-2015294491))	2024-03-22 15:04:30 +00:00
PyTorch MergeBot	777ac511cc	Revert "[Quant] [Inductor] Enable the Inductor Lowering of QConv2d post op HardSwish with int8-mix-bf16 (#122373 )" This reverts commit 783fd89ff1cf401e484c20d14b16823abf20d87d. Reverted https://github.com/pytorch/pytorch/pull/122373 on behalf of https://github.com/jeanschmidt due to Not sure if this PR caused breakages in main rocm jobs, I'll remerge if reverting does not fix it ([comment](https://github.com/pytorch/pytorch/pull/122267#issuecomment-2015294491))	2024-03-22 15:04:30 +00:00
PyTorch MergeBot	dbedc6bb7c	Revert "[Quant] [Inductor] Enable the Inductor Lowering of QConv2d post op HardTanh with int8-mix-bf16 (#122374 )" This reverts commit 23a6d74f9352e0afb37750fee300d077c4ba9393. Reverted https://github.com/pytorch/pytorch/pull/122374 on behalf of https://github.com/jeanschmidt due to Not sure if this PR caused breakages in main rocm jobs, I'll remerge if reverting does not fix it ([comment](https://github.com/pytorch/pytorch/pull/122267#issuecomment-2015294491))	2024-03-22 15:04:30 +00:00
PyTorch MergeBot	02fee6caec	Revert "Change ATEN generator argument type to const std::optional<Generator>& (#120076 )" This reverts commit ecbe82b9cec75324b7efb58e1d9cae6b35b71bdc. Reverted https://github.com/pytorch/pytorch/pull/120076 on behalf of https://github.com/jeanschmidt due to Reverting in order to check if this will fix XLA trunk jobs ([comment](https://github.com/pytorch/pytorch/pull/120076#issuecomment-2015272644))	2024-03-22 14:53:45 +00:00
Joel Schlosser	e6986e4317	Public API for NJT construction from jagged components (#121518 ) This PR introduces `torch.nested.nested_tensor_from_jagged(values, offsets=None, lengths=None, jagged_dim=1)` (bikeshedding welcome). This is intended to be the main entrypoint for getting an NJT from the `(values, offsets, lengths)` components. The returned NJT is a view of the `values` component. Note that `torch.nested.nested_tensor()` / `torch.nested.as_nested_tensor()` already exist for constructing an NJT from a list of tensors. TODO: * Some doc formatting; suggestions welcome there * Tests / examples using `jagged_dim != 1` Pull Request resolved: https://github.com/pytorch/pytorch/pull/121518 Approved by: https://github.com/cpuhrsch ghstack dependencies: #113279, #113280	2024-03-22 14:48:22 +00:00
Brian Hirsh	65c37fe05a	AOTAutograd: ensure traced tangent subclass metadata takes non-contiguous outputs into account (#118669 ) Fixes https://github.com/pytorch/pytorch/issues/118596. The issue was as follows: (1) Whenever AOTAutograd sees an output that is non-contiguous, that it needs a tangent for, it forces the tangent that it generates to be contiguous during tracing (2) However: if this tangent is a subclass, we need to generate code to flatten/unflatten the subclass at runtime. (3) To do so, we use the metadata stashed here: https://github.com/pytorch/pytorch/blob/main/torch/_functorch/_aot_autograd/schemas.py#L231 (4) However, this metadata was wrong - it was generated by inspecting the tangent, before we made the tangent contiguous. The fix in this PR basically moves the logic make `traced_tangents` contiguous earlier, at the time that we first generate `ViewAndMutationMetadata` Pull Request resolved: https://github.com/pytorch/pytorch/pull/118669 Approved by: https://github.com/zou3519 ghstack dependencies: #118803, #119947	2024-03-22 14:42:27 +00:00
Brian Hirsh	09be5800c8	dynamo: support placement kwargs for DTensor.to_local() (#119947 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119947 Approved by: https://github.com/wanchaol, https://github.com/yoyoyocmu ghstack dependencies: #118803	2024-03-22 14:42:27 +00:00
Brian Hirsh	2e44b12dd4	dynamo: handle DTensor.device_mesh.device_type (#118803 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118803 Approved by: https://github.com/wanchaol, https://github.com/yanboliang	2024-03-22 14:42:22 +00:00
andrewor14	ea8e0c75c7	[quant][pt2] Fix create FQ with FixedQParamsQSpec (#122104 ) Summary: Before we just returned a _PartialWrapper object when using FixedQParamsQuantizationSpec in QAT. This is wrong and we should return a FQ object instead. Differential Revision: [D55021106](https://our.internmc.facebook.com/intern/diff/D55021106) Pull Request resolved: https://github.com/pytorch/pytorch/pull/122104 Approved by: https://github.com/jerryzh168	2024-03-22 14:23:05 +00:00
andrewor14	6e6891e843	[jit] Fix _batch_norm_with_update shape function (#122430 ) Summary: We used `native_batch_norm`'s shape function before, but the schemas are actually different. We need to create new shape functions for `_batch_norm_with_update` specifically. Test Plan: buck2 test '@fbcode//mode/opt-tsan' fbcode//caffe2/test/cpp/jit:jit -- --exact 'caffe2/test/cpp/jit:jit - TestShapeGraphLinting.Basic' Reviewers: bdhirsh, davidberard98, eellison Differential Revision: [D55211182](https://our.internmc.facebook.com/intern/diff/D55211182) Pull Request resolved: https://github.com/pytorch/pytorch/pull/122430 Approved by: https://github.com/eellison, https://github.com/bdhirsh	2024-03-22 14:21:57 +00:00
haozhe.zhu	23a6d74f93	[Quant] [Inductor] Enable the Inductor Lowering of QConv2d post op HardTanh with int8-mix-bf16 (#122374 ) Summary Enable the fusion pattern of `QConv2d -> hardtanh` lowering for int8-mixed-bf16 case. Test Plan ``` python -m pytest test_mkldnn_pattern_matcher.py -k test_qconv2d_hardtanh_int8_mixed_bf16_cpu ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/122374 Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5 ghstack dependencies: #122266, #122267, #122268, #122373	2024-03-22 13:13:14 +00:00
PyTorch MergeBot	f65373e278	Revert "Factor meta conversion through serializable MetaTensorDesc (#122044 )" This reverts commit e2d89e970480d7e5b10a77928442d8caf94e0e84. Reverted https://github.com/pytorch/pytorch/pull/122044 on behalf of https://github.com/jeanschmidt due to Seems that some landrace caused this PR to break lint ([comment](https://github.com/pytorch/pytorch/pull/122044#issuecomment-2015025490))	2024-03-22 12:46:21 +00:00
Kai Londenberg	700c92e1b9	[Inductor Cutlass backend] GEMM size threshold for Cutlass backend usage (#121491 ) * Adds a configurable GEMM size threshold for the usage of Cutlass GEMM Kernels _inductor.config.cutlass_backend_min_gemm_size * During GEMM algorithm choice generation: if no viable choices can be generated using the configured backends, the ATen backend will be used as a fallback backend, even if it is not enabled in _inductor.config.max_autotune_gemm_backends Test plan: CI Additional unit test in test_cutlass_backend.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/121491 Approved by: https://github.com/jansel ghstack dependencies: #121490	2024-03-22 10:58:43 +00:00
chilli	d34514f8db	Renamed mutationlayout/aliasedlayout (#122474 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/122474 Approved by: https://github.com/jansel ghstack dependencies: #121624	2024-03-22 08:32:14 +00:00
chilli	eca30df846	Added load_args to repro (#121624 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121624 Approved by: https://github.com/ezyang	2024-03-22 08:32:14 +00:00
haozhe.zhu	783fd89ff1	[Quant] [Inductor] Enable the Inductor Lowering of QConv2d post op HardSwish with int8-mix-bf16 (#122373 ) Summary Enable the fusion pattern of `QConv2d -> hardswish` lowering for int8-mixed-bf16 case. Test Plan ``` python -m pytest test_mkldnn_pattern_matcher.py -k test_qconv2d_hardswish_int8_mixed_bf16_cpu ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/122373 Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5 ghstack dependencies: #122266, #122267, #122268	2024-03-22 08:17:57 +00:00
haozhe.zhu	99f0fec7d0	[Quant] [Inductor] Enable the Inductor Lowering of QConv2d post op SiLU (#122268 ) Summary Enable the fusion pattern of `QConv2d -> silu` lowering to `swish` as `QConv2d` post operator. Test Plan ``` python -m pytest test_mkldnn_pattern_matcher.py -k test_qconv2d_silu_cpu python -m pytest test_mkldnn_pattern_matcher.py -k test_qconv2d_silu_int8_mixed_bf16_cpu python -m pytest test_mkldnn_pattern_matcher.py -k test_qat_qconv2d_silu ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/122268 Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5 ghstack dependencies: #122266, #122267	2024-03-22 08:15:28 +00:00
Jason Ansel	bb75313f0a	[dynamo] Optimize handling of BINARY_OP (#122465 ) This saves ~0.1s on https://dev-discuss.pytorch.org/t/a-torchdynamo-trace-time-ablation-study/1961 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122465 Approved by: https://github.com/oulgen	2024-03-22 08:14:58 +00:00
haozhe.zhu	2c6eeb26d3	[Quant] [PT2] Add SiLU into X86InductorQuantizer Conv2d Unary Annotation (#122267 ) Summary Add `SiLU` into X86InductorQuantizer Conv2d Unary Annotation TestPlan ``` python -m pytest test_x86inductor_quantizer.py -k test_conv2d_unary python -m pytest test_x86inductor_quantizer.py -k test_qat_conv2d_unary ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/122267 Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5 ghstack dependencies: #122266	2024-03-22 08:12:23 +00:00
Kai Londenberg	6bbd697306	[Inductor] Make codecache CUDA compilation more robust & flexible (#121490 ) Minor changes which make the CUDA compilation within _inductor/codecache.py more robust and flexible. Test plan: CI Additional test in test_codecache.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/121490 Approved by: https://github.com/jansel	2024-03-22 08:12:11 +00:00
haozhe.zhu	a337ee0a3a	[Quant] Enable QConv2d with silu post op (#122266 ) Summary Enable QConv2d implementation with post op `silu` Test Plan ``` python -m pytest test_quantized_op.py -k test_qconv2d_silu_pt2e ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/122266 Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5	2024-03-22 07:58:45 +00:00
arunppsg	b78e8c0d37	remove duplicate method run_subtests (#122421 ) Fixes #121654 I have removed the duplicate test `run_subtests` from `common_dtensor.py` and `common_fsdp.py` and moved it to `common_distributed.py`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122421 Approved by: https://github.com/soulitzer	2024-03-22 07:00:49 +00:00
Tobias Ringwald	6ba85cfc2a	Fixed memory leak in Python dispatcher w.r.t. THPDevice. (#122439 ) Fixes the memory leak reported in #122417. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122439 Approved by: https://github.com/soulitzer	2024-03-22 06:44:12 +00:00
Oguz Ulgen	3600778ede	Do not create a new node if no normalization is needed (#122330 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/122330 Approved by: https://github.com/jansel	2024-03-22 05:51:28 +00:00
Edward Z. Yang	e2d89e9704	Factor meta conversion through serializable MetaTensorDesc (#122044 ) Fixes https://github.com/pytorch/pytorch/issues/121085 This PR pretty involved so pay attention to this description. At a high level, the refactor is intended to be mechanical: anywhere in MetaConverter where previously we took a Tensor as argument, we now take a MetaTensorDesc, which contains all of the information that we would have queried off of the Tensor, but placed into a separate data structure which we can serialize or use to recreate a fake tensor in a separate fake tensor mode in exact fidelity to the original. However, this transformation is not always entirely mechanical. Here is what you need to pay attention to: - The memo table from real Tensor -> meta/fake Tensor is now broken into two memo tables: real Tensor -> stable int id -> meta/fake Tensor. The stable int id is needed so that when we do serialization, we know when tensors/storages alias each other and can ensure we preserve this aliasing upon deserialization. The way I have implemented changes the weak reference behavior. Previously, when either the real Tensor OR the meta/fake Tensor went dead, we would remove the entry from the memo table. Now, this only removes entries from one of the two memo tables. This semantically makes sense, because the user may have held on to the stable int id out of band, and may expect a real Tensor to continue to be numbered consistently / expect to be able to lookup a meta/fake tensor from this id. If this is unacceptable, it may be possible to rejigger the memo tables so that we have real Tensor -> stable int id and real Tensor -> meta/fake Tensor, but TBH I find the new implementation a lot simpler, and arranging the memo tables in this way means that I have to muck around with the real tensor to save to the memo table; in the current implementation, I never pass the Tensor to meta_tensor function AT ALL, which means it is impossible to accidentally depend on it. - When I fill in the fields of MetaTensorDesc in describe_tensor, I need to be careful not to poke fields when they are not valid. Previously, preconditions were implicitly checked via the conditional structure ("is this sparse? is this nested?") that is tested before we start reading attributes. This structure has to be replicated in describe_tensor, and I have almost assuredly gotten it wrong on my first try (I'll be grinding through it on CI; a careful audit will help too, by auditing that I've tested all the same conditionals that the original access was guarded by.) - I originally submitted https://github.com/pytorch/pytorch/pull/121821 for the symbolic shapes change, but it turned out the way I did it there didn't actually work so well for this PR. I ended up just inlining the symbolic shapes allocation logic into MetaConverter (look for calls to maybe_specialize_sym_int_with_hint), maybe there is a better way to structure it, but what I really want is to just read sizes/strides/offset directly off of MetaTensorDesc; I don't want another intermediate data structure. - Some fields aren't serializable. These are documented as "NOT serializable". ctx/type should morally be serializable and I just need to setup a contract with subclasses to let them be serialized. The fake_mode is used solely to test if we are refakefying with a pre-existing ShapeEnv and we want to reuse the SymInt directly--serializing this case is hopeless but I am kind of hoping after this refactor we do not need this at all. view_func is not serializable because it's a bound C implemented method. Joel has promised me that this is not too difficult to actually expose as a true data structure, but this is the edgiest of edge cases and there is no reason to deal with it right now. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/122044 Approved by: https://github.com/eellison ghstack dependencies: #122018	2024-03-22 03:56:34 +00:00
cyy	ecbe82b9ce	Change ATEN generator argument type to const std::optional<Generator>& (#120076 ) This PR proposes to use std::optional<Generator>& for underlying functions to avoid unnecessary copy and move operations. The torchgen code was changed to generate the new type. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120076 Approved by: https://github.com/malfet	2024-03-22 03:49:31 +00:00
PyTorch UpdateBot	ef0d470eb3	[vision hash update] update the pinned vision hash (#122453 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned vision hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122453 Approved by: https://github.com/pytorchbot	2024-03-22 03:37:11 +00:00
angelayi	fb57d1699b	[export] Fix handling output in remove_effect_tokens_pass (#122357 ) Added handling for updating the output_spec in the graph signature if the the result of a with_effects call is an output. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122357 Approved by: https://github.com/zhxchen17	2024-03-22 03:35:59 +00:00
Feng Yuan	09eb07bee8	Introduce XPU implementation for PyTorch ATen operators (#120891 ) As a follow-up to #114835 and #119682, we add limited ATen operators implementation for XPU. With this PR, the blocking issue for oneDNN operations and Inductor XPU backend will be resolved as the two components depend on these operations to support its basic features, respectively. The added ATen operators include: - `copy_`, `_to_copy`, `_copy_from_and_resize`, , `clone` - `view`, `view_as_real`, `view_as_complex`, - `as_strided`, `_reshape_alias`, `resize_`, `resize_as_`, - `add`/`add_`, `sub`/`sub_`, `mul`/`mul_`, `div`/`div_`, `abs`, - `empty`, `empty_strided`, - `fill_`, `zeros_`. Co-authored-by: Wang, Eikan <eikan.wang@intel.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/120891 Approved by: https://github.com/EikanWang, https://github.com/jgong5, https://github.com/gujinghui, https://github.com/atalman	2024-03-22 03:31:04 +00:00
Adnan Akhundov	e419011471	[inductor] Add torch.while_loop support to JIT Inductor (#122069 ) Summary: `torch.while_loop` HOP support is added to JIT Inductor. The test coverage is limited due to the functionality constraints of the upstream `torch.while_loop` op in Dynamo / Export. When those are lifted, we'll add more tests (see TODO-s in the test file). AOT Inductor support will be added in a follow-up PR. Test Plan: ``` $ python test/inductor/test_control_flow.py ... ---------------------------------------------------------------------- Ran 38 tests in 159.387s OK ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/122069 Approved by: https://github.com/jansel, https://github.com/eellison	2024-03-22 02:45:27 +00:00
PyTorch MergeBot	5e0440edb4	Revert "Optimize multi_tensor_apply (take 2) (#119764 )" This reverts commit 0b68a28c87df2c6eb2cf530be4659b5a2f8a95b0. Reverted https://github.com/pytorch/pytorch/pull/119764 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing ROCm job in trunk `0b68a28c87`. Please help take a look and reland the change ([comment](https://github.com/pytorch/pytorch/pull/119764#issuecomment-2014190124))	2024-03-22 02:18:28 +00:00
Joel Schlosser	470b44c048	Support for torch.nested.as_nested_tensor(t) (#113280 ) This PR adds support for tensor inputs to `as_nested_tensor()`. The tensor is treated as a batch of consistently-sized constituents. It utilizes `_nested_view_from_values_offsets()` to return a real view that allows for propagating gradients into inputs. Co-authored-by: voznesenskym <voznesenskym@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/113280 Approved by: https://github.com/cpuhrsch, https://github.com/soulitzer ghstack dependencies: #113279	2024-03-22 02:12:37 +00:00
Joel Schlosser	cd6bfc7965	Proper view support for jagged layout NestedTensor (#113279 ) This PR: * Introduces an ATen op for creating true jagged views from a dense values buffer * `_nested_view_from_jagged(values, offsets, lengths, ragged_idx, dummy)` * This ops is implemented on the Python side using torch.library so we can return a subclass instance * `jagged_from_list()` now uses this instead of the old autograd.Function `NestedViewFromBuffer` * The latter op is used for non-contiguous JTs returned via `torch.nested.narrow()` * `dummy` is an awful hack to ensure that `NestedTensor.__torch_dispatch__()` is invoked for our view * Introduces an ATen op for accessing the `values` component of an NT via a view * `_nested_get_values(nt)` * Removes the autograd.Functions `ViewNestedFromBuffer` and `ViewBufferFromNested` in favor of `nested_from_values_offsets()` / `nested_from_values_offsets_lengths()` and `nt.values()`, respectively. * Changes test code to prefer `as_nested_tensor()` over `jagged_from_list()` directly * Similarly, avoid `buffer_from_jagged()`, preferring `values()` * Depends on general subclass view fake-ification on the PT2 side (handled solely in previous PRs in the stack) With these changes, the semantics of jagged layout NTs are such that they are considered a true view of the underlying `values` buffer. This means views of jagged NTs are views of the underlying buffer as well, simplifying some handling. Differential Revision: [D54269922](https://our.internmc.facebook.com/intern/diff/D54269922) Co-authored-by: voznesenskym <voznesenskym@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/113279 Approved by: https://github.com/ezyang	2024-03-22 02:12:36 +00:00
Flavio Sales Truzzi	bde22835c6	[PT2] - Guard oblivious on meta registrations (#122216 ) Summary: ``` [trainer0\|0]:Potential framework code culprit (scroll up for full backtrace): [trainer0\|0]: File "/mnt/xarfuse/uid-539346/56d4bb3d-seed-nspid4026531836_cgpid183208940-ns-4026531840/torch/_meta_registrations.py", line 5043, in scatter_gather_dtype_check [trainer0\|0]: if index.numel() != 0: ``` Test Plan: General CI. Reviewed By: ezyang Differential Revision: D54689183 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122216 Approved by: https://github.com/ezyang	2024-03-22 01:36:03 +00:00
BowenBao	4f93b3d958	[Dort] Reduce excessive warning to info (#122442 ) No need to warn when an op can be exported. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122442 Approved by: https://github.com/thiagocrepaldi	2024-03-22 01:09:33 +00:00
Flavio Sales Truzzi	a001b4b048	Inductor: Don't clamp views when the views come from split_with_sizes (#122149 ) Summary: Fixes #122126 `split_with_sizes` don't need clamping. Test Plan: Added test + CI Differential Revision: D55043320 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122149 Approved by: https://github.com/ezyang	2024-03-22 00:55:36 +00:00
Zhengxu Chen	b1fa0ce4aa	[export] build the infra to rollout predispatch export. (#122326 ) Test Plan: fbcode:caffe2/test/quantization:test_quantization fbcode:bolt/nn/executorch/backends/tests:qnn_test fbcode:on_device_ai/helios/compiler_tests/... fbcode:pyspeech/tests:pyspeech_utils_test_oss fbcode:caffe2/test:quantization_pt2e_qat fbcode:on_device_ai/Assistant/Jarvis/tests:test_custom_ops fbcode:modai/test:test_modai fbcode:executorch/exir/backend/test:test_partitioner Differential Revision: D55133846 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122326 Approved by: https://github.com/tugsbayasgalan	2024-03-22 00:55:10 +00:00
Huy Do	4b535906aa	Better handle test-config labels on PR (#122155 ) I have some minor fixes in the scripts to 1. Fix the bug where the empty test matrix was confusingly print as unstable https://github.com/pytorch/pytorch/pull/121381#issuecomment-2004558588 1. Replace `print` with `logging.info` 1. Remove the hardcoded `VALID_TEST_CONFIG_LABELS` list. It's out of date and not many people use this features besides `test-config/default`, so why bother. The behavior here is simpler now: 1. If the PR has some `test-config/*` labels, they will be applied 1. If the PR has none of them, all test configs are applied 1. Add log for the previous 2 cases to avoid confusion ### Testing ``` python filter_test_configs.py --workflow "Mac MPS" --job-name "macos-12-py3-arm64 / build" --event-name "push" --schedule "" --branch "" --tag "ciflow/mps/121381" \ --pr-number 121065 \ --test-matrix "{ include: [ { config: "mps", shard: 1, num_shards: 1, runner: "macos-m1-stable" }, { config: "mps", shard: 1, num_shards: 1, runner: "macos-m2-14" }, ]} ``` Also running on this PR Pull Request resolved: https://github.com/pytorch/pytorch/pull/122155 Approved by: https://github.com/clee2000	2024-03-21 23:20:52 +00:00
PyTorch MergeBot	bce640709c	Revert "Precompile triton templates (#121998 )" This reverts commit b8df2f0ca530ebe01fa079c891c170a1f4b22823. Reverted https://github.com/pytorch/pytorch/pull/121998 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is causing all ROCm trunk job to fail `b8df2f0ca5` ([comment](https://github.com/pytorch/pytorch/pull/121998#issuecomment-2014003037))	2024-03-21 23:05:59 +00:00
Thiago Crepaldi	c4486d3e88	Allow fake models to run with ONNXProgram.__call__ (#122230 ) In order to a fake model to run using ONNXProgram.__call__ interface, we need to save the model into disk along with external data before executing the model. This is what this PR implements An alternative is to ONNXProgram.__call__ to detect that the model was exported with fake mode and explicit raise an exception when ONNXProgram.__call__ is executed. The exception message would instruct the user to call ONNXProgram.save and manually execute the model using the ONNX runtime of choice. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122230 Approved by: https://github.com/BowenBao ghstack dependencies: #122196	2024-03-21 22:28:05 +00:00
drisspg	4ba51bb2c4	Add keys used for templated attention impls (#122423 ) # Summary Mypy will complain that these attributes dont exist for this PR: https://github.com/pytorch/pytorch/pull/121845/ Pull Request resolved: https://github.com/pytorch/pytorch/pull/122423 Approved by: https://github.com/bdhirsh	2024-03-21 22:16:53 +00:00
PyTorch MergeBot	224beecee6	Revert "Proper view support for jagged layout NestedTensor (#113279 )" This reverts commit 5855c490f09a028bfdfefea8b93c9833eb55dc5c. Reverted https://github.com/pytorch/pytorch/pull/113279 on behalf of https://github.com/jbschlosser due to Need to fix BC thing ([comment](https://github.com/pytorch/pytorch/pull/113279#issuecomment-2013899762))	2024-03-21 22:03:01 +00:00
PyTorch MergeBot	12e7602cf9	Revert "Support for torch.nested.as_nested_tensor(t) (#113280 )" This reverts commit 17c9c7026521be1c194cae278b76ac8e8f7d145b. Reverted https://github.com/pytorch/pytorch/pull/113280 on behalf of https://github.com/jbschlosser due to Need to fix BC thing ([comment](https://github.com/pytorch/pytorch/pull/113280#issuecomment-2013893099))	2024-03-21 22:00:44 +00:00
PyTorch MergeBot	816db3bd29	Revert "Public API for NJT construction from jagged components (#121518 )" This reverts commit d4dff9cf5e7b734a8621b571e8f5a761dc43e1e0. Reverted https://github.com/pytorch/pytorch/pull/121518 on behalf of https://github.com/jbschlosser due to Need to fix BC thing ([comment](https://github.com/pytorch/pytorch/pull/121518#issuecomment-2013879641))	2024-03-21 21:56:29 +00:00
Peter Bell	48afb5c325	[inductor] Use python constants in IndexPropagation (#122031 ) In the next PR I have the IR `ops.neg(ops.constant(0.0, torch.float32))` which should be folded to `ops.constant(-0.0, torch.float32)` but it seems that `sympy.Float(-0.0)` doesn't respect the sign of the zero and so we instead get a positive zero constant. Here, I work around this by doing the constant folding with python arithmetic which does respect signed zeros. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122031 Approved by: https://github.com/lezcano	2024-03-21 21:53:22 +00:00
angelayi	99055ae165	[aoti] Fix compilation bug for buffer mutations (#121688 ) I realized there's a bug when unlifting buffer mutations in AOTI. However there seems to be a bug during tracing where AOTI mutates the buffer. I didn't take the time to investigate, so I left is as TODO for now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121688 Approved by: https://github.com/chenyang78, https://github.com/bdhirsh	2024-03-21 21:51:32 +00:00
rzou	332456c44d	triton_kernel_wrap shouldn't access FakeTensor.data_ptr (#122418 ) The comment suggests that we need to replace all FakeTensors with real tensors. `torch.empty` doesn't actually return a real Tensor because FakeTensorMode is active! We disable torch dispatch so that torch.empty actually returns a real Tensor. The motivation for this PR is that we're trying to ban FakeTensor.data_ptr (or at least warn on it) in torch.compile. See the next PR up in the stack Test Plan: - Existing tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122418 Approved by: https://github.com/oulgen	2024-03-21 21:48:07 +00:00
rzou	621fdc9db8	infer_schema can add alias annotations when passed a list of mutated args (#122343 ) Test Plan: - new tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/122343 Approved by: https://github.com/ezyang ghstack dependencies: #122319, #122320	2024-03-21 21:39:07 +00:00
rzou	639d6201b4	Expand the types infer_schema can infer (#122320 ) This PR allows it to infer: - None return as () - List[Tensor] as Tensor[] Test Plan: - new tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/122320 Approved by: https://github.com/ezyang, https://github.com/soulitzer ghstack dependencies: #122319	2024-03-21 21:39:07 +00:00
rzou	0dd78f1828	Add standalone tests for infer_schema (#122319 ) We're gonna reuse this helper in the new python custom ops API. Given a function with type annotations, `infer_schema(fun)` returns an inferred schema. Test Plan: - new tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/122319 Approved by: https://github.com/ezyang, https://github.com/soulitzer	2024-03-21 21:39:04 +00:00
William Wen	23524710e6	[dynamo] use proxies to nn.Module in dynamo generated GraphModules (#120756 ) Fixes remaining refleaks found when debugging https://github.com/pytorch/pytorch/issues/119607, tests added in https://github.com/pytorch/pytorch/pull/120657. Also fixes some tests that xfail: https://github.com/pytorch/pytorch/issues/120631 (not entirely sure why), but introduced tests now fail. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120756 Approved by: https://github.com/jansel	2024-03-21 21:23:12 +00:00
Kai Londenberg	2cd0a5d516	[Inductor] Fix for WrapperCodeGen.statically_known_int_or_none (#121808 ) There's obviously a small typo in WrapperCodeGen.statically_known_int_or_none, where the return value of a call to V.graph._shape_env._maybe_evaluate_static is being discarded. This fix changes that to work how it was likely intended to. Test Plan: CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/121808 Approved by: https://github.com/lezcano, https://github.com/jansel, https://github.com/aakhundov	2024-03-21 21:15:32 +00:00
PyTorch MergeBot	968c4c4154	Revert "Refactor gpu trace to be device-agnostic (#121794 )" This reverts commit 74deacbf31d032a2659dc1633dc3e5248921d466. Reverted https://github.com/pytorch/pytorch/pull/121794 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it breaks ROCm jobs in trunk `74deacbf31`, please help take a look and reland the change ([comment](https://github.com/pytorch/pytorch/pull/121794#issuecomment-2013674083))	2024-03-21 20:33:17 +00:00
PyTorch MergeBot	13afbcfc85	Revert "Support gpu trace on XPU (#121795 )" This reverts commit 91ead3eae4cd6cbf50fe7a7b4a2f9f35302bc9b2. Reverted https://github.com/pytorch/pytorch/pull/121795 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it breaks ROCm jobs in trunk `74deacbf31`, please help take a look and reland the change ([comment](https://github.com/pytorch/pytorch/pull/121794#issuecomment-2013674083))	2024-03-21 20:33:16 +00:00
PyTorch MergeBot	182bb0f2ca	Revert "Introduce XPU implementation for PyTorch ATen operators (#120891 )" This reverts commit 148a8de6397be6e4b4ca1508b03b82d117bfb03c. Reverted https://github.com/pytorch/pytorch/pull/120891 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I need to revert it to resolve a conflict in trunk https://github.com/pytorch/pytorch/pull/121794#issuecomment-2013434523. Please help reland the change after ([comment](https://github.com/pytorch/pytorch/pull/120891#issuecomment-2013668563))	2024-03-21 20:30:20 +00:00
Bin Bao	628dcde136	[AOTI] Disable stack allocation when there is a fallback op (#122367 ) Summary: Stack allocation is disabled when there is an aten fallback op, see `c84f81b395/torch/_inductor/codegen/cpp_wrapper_cpu.py (L974)`. But we need to do the same where is a custom op fallback. Test Plan: CI Reviewed By: mikekgfb Differential Revision: D55149369 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122367 Approved by: https://github.com/mikekgfb	2024-03-21 20:02:33 +00:00
ydwu4	af9b71c82f	fix typo in while_loop_test (#122416 ) As titiled. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122416 Approved by: https://github.com/angelayi	2024-03-21 19:42:08 +00:00
Yifu Wang	d131cbc44f	Fuse the input -> p2p buffer copy into one-shot all-reduce kernel when the input is small (#121213 ) This improves the gpt-fast llama2 70B 8xH100 (non-standard) TP benchmark from 86 tok/s to 88 tok/s. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121213 Approved by: https://github.com/Chillee	2024-03-21 18:25:57 +00:00
Abhishek Jindal	765c3fc138	fix breaking changes for ONNX Runtime Training (#122000 ) Fixes breaking changes for ONNX Runtime Training. PR https://github.com/pytorch/pytorch/pull/121102 introduced incompatibility with ORT training because of change in parameter type. Creating a PR to add previous parameter types and verified that it works with ORT training. Error with current scenario: ``` site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/cpu/aten_op_executor/aten_op_executor.cc:60:40: error: invalid conversion from ‘const DLManagedTensor’ to ‘DLManagedTensor’ [-fpermissive] at::Tensor tensor = at::fromDLPack(dlpack); site-packages/torch/include/ATen/DLConvertor.h:15:46: note: initializing argument 1 of ‘at::Tensor at::fromDLPack(DLManagedTensor)’ TORCH_API Tensor fromDLPack(DLManagedTensor src); ``` Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/122000 Approved by: https://github.com/malfet	2024-03-21 18:10:22 +00:00
Edward Z. Yang	c2651a7f0e	Make check_is_size clamp to sys.maxsize - 1, so sys.maxsize comparison returns False (#122372 ) Partially fixes https://github.com/pytorch/pytorch/issues/113002 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/122372 Approved by: https://github.com/lezcano ghstack dependencies: #122370	2024-03-21 17:14:42 +00:00
Edward Z. Yang	780f70b728	Make expected stride test in torch._prims_common size oblivious (#122370 ) Partially addresses https://github.com/pytorch/pytorch/issues/113002 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/122370 Approved by: https://github.com/lezcano	2024-03-21 17:14:42 +00:00
PyTorch MergeBot	25bf5f7e61	Revert "Enable x86 CPU vectorization on windows [submodule sleef] (#118980 )" This reverts commit aa74a8b9e5b34eaa700a64064818adc7a12942ca. Reverted https://github.com/pytorch/pytorch/pull/118980 on behalf of https://github.com/huydhn due to Sorry for revert your change one more time but the hard part is that it breaks lot of internal builds ([comment](https://github.com/pytorch/pytorch/pull/118980#issuecomment-2013043364))	2024-03-21 17:07:17 +00:00
eellison	b8df2f0ca5	Precompile triton templates (#121998 ) Before this PR we were not precompiling triton templates in parallel. Compilation would occur during benchmarking. Triton benchmarking templates were emitted as : ``` @triton.jit def triton_mm(arg_A, arg_B, out_ptr0): ``` In order to precompile we need to give the full kernel specification, as we do when we emit the template in the final output code generation. ``` @triton_heuristics.template( num_stages=3, num_warps=8, triton_meta={'signature': {0: 'fp32', 1: 'fp32', 2: '*fp32'}, 'device': 0, 'device_type': 'cuda', 'constants': {}, 'configs': [AttrsDescriptor(divisible_by_16=(0, 1, 2), equal_to_1=(), ids_of_folded_args=(), divisible_by_8=())]}, inductor_meta={'kernel_name': 'Placeholder.DESCRIPTIVE_NAME', 'backend_hash': 'cdeecfeccd31ad7810f96b5752194b1c2406d0a81e39a6ca09c8ee150baae183'}, ) @triton.jit def triton_mm(arg_A, arg_B, out_ptr0): ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/121998 Approved by: https://github.com/jansel ghstack dependencies: #121996, #120275, #121997	2024-03-21 17:04:53 +00:00
Sahdev Zala	17175cdbc7	[Docs] Add extended debugging options for troubleshooting (#122028 ) Fixes #120889 Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/122028 Approved by: https://github.com/ezyang, https://github.com/malfet	2024-03-21 17:00:45 +00:00
Pian Pawakapan	c20bc18d59	[export] allow static constraints in dynamic_shapes (#121860 ) This PR allows users to specify int values for dimensions in dynamic_shapes as well as None, for example: ``` class Foo(torch.nn.Module): def forward(self, x, y, z): ... foo = Foo() inputs = (torch.randn(4, 6), torch.randn(5, 4), torch.randn(3, 3)) for dynamic_shapes in [ None ((4, 6), (5, 4), (3, 3)), ((None, 6), None, {0: 3, 1: 3}) ]: _ = export(foo, inputs, dynamic_shapes=dynamic_shapes) ``` All of the above should produce the same ExportedProgram. This is done by temporarily creating a static dim constraint during analysis, where vr.lower == vr.upper. These constraints are then deleted during _process_constraints(), and do not show up in the final ExportedProgram's range_constraints. Additionally, export() will also fail if the shapes are mis-specified, for example: ``` _ = export(foo, inputs, dynamic_shapes=((5, None), None, None)) ``` leads to `torch._dynamo.exc.UserError: Static shape constraint of 5 does not match input size of 4, for L['x'].size()[0]` Pull Request resolved: https://github.com/pytorch/pytorch/pull/121860 Approved by: https://github.com/avikchaudhuri	2024-03-21 16:59:59 +00:00
Christian Puhrsch	16935de961	Support alias for NestedTensorCPU/CUDA (#117711 ) Fixes #ISSUE_NUMBER Co-authored-by: Vincent Moens <vmoens@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/117711 Approved by: https://github.com/albanD	2024-03-21 16:05:52 +00:00
Feng Yuan	148a8de639	Introduce XPU implementation for PyTorch ATen operators (#120891 ) As a follow-up to #114835 and #119682, we add limited ATen operators implementation for XPU. With this PR, the blocking issue for oneDNN operations and Inductor XPU backend will be resolved as the two components depend on these operations to support its basic features, respectively. The added ATen operators include: - `copy_`, `_to_copy`, `_copy_from_and_resize`, , `clone` - `view`, `view_as_real`, `view_as_complex`, - `as_strided`, `_reshape_alias`, `resize_`, `resize_as_`, - `add`/`add_`, `sub`/`sub_`, `mul`/`mul_`, `div`/`div_`, `abs`, - `empty`, `empty_strided`, - `fill_`, `zeros_`. Co-authored-by: Wang, Eikan <eikan.wang@intel.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/120891 Approved by: https://github.com/EikanWang, https://github.com/jgong5, https://github.com/gujinghui, https://github.com/atalman	2024-03-21 15:42:20 +00:00
Thiago Crepaldi	204fd69ca6	Make ONNXProgram.model_proto and disk file the same (#122196 ) Currently, the in-memory onnx program model proto does not contain initializers saved into the disk version. This PR changes this behavior, so that both versions are identical. This is important for running models with fake tensor from OMMProgram.model_proto directly, without a file Pull Request resolved: https://github.com/pytorch/pytorch/pull/122196 Approved by: https://github.com/BowenBao	2024-03-21 15:29:31 +00:00
Nikita Shulga	f9996ed764	[BE] Enable torch inductor tests running on MacOS (#122360 ) Original idea was limit the testing to just x86 Macs, but right now it will be skipped on all Apple Silicon ones, as all of them have Metal capable GPU Pull Request resolved: https://github.com/pytorch/pytorch/pull/122360 Approved by: https://github.com/jansel	2024-03-21 14:47:05 +00:00
Adnan Akhundov	456b112dca	[inductor] Support non-Tensor predicate in torch.cond (#122378 ) Summary: Previously, we only supported torch.Tensor boolean scalar predicate in `torch.cond` in Inductor. This PR adds support for SymBool and Python bool predicate, to match the `torch.cond` [sematics](https://pytorch.org/docs/stable/generated/torch.cond.html) in Dynamo / Export. Test Plan: ``` $ python test/inductor/test_control_flow.py ... ---------------------------------------------------------------------- Ran 34 tests in 56.980s OK $ python test/inductor/test_aot_inductor.py -k test_cond ... ---------------------------------------------------------------------- Ran 54 tests in 460.093s OK (skipped=4) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/122378 Approved by: https://github.com/jansel, https://github.com/chenyang78	2024-03-21 14:35:01 +00:00
Yifu Wang	0b68a28c87	Optimize multi_tensor_apply (take 2) (#119764 ) ### Take 2 The first take (#119153) landed but was reverted because it broke cuda graph for `multi_tensor_apply`. This PR is a reland of #119153: - Incorporate #119652 so that the optimization can be applied (1) without increasing binary size (2) to all 3 MTA variants without much code duplication. - Ensure the optimization is compatible with cuda graph. ### Summary Due to the dynamic nature of the workload, the kernel arguments aren't guaranteed to fit in the static 4kb kernel argument memory. Previously with the apex implementation, we overcame this limitation by dividing a multi_tensor_apply workload into multiple kernel launches. However, this led to low sustained occupancy, affecting the performance of memory bound ops. Based on the observation that the kernel argument memory limitation doesn't correlate well with available SM resources, we adopt a different approach: - When the kernel arguments fit into the static kernel argument memory, we use this memory to transfer the arguments. - Conversely, when the kernel arguments don't fit into the static kernel argument memory, instead of sacrificing sustained occupancy, we use a page-locked cudaMemcpyAsync to transfer the arguments, then perform the entire workload in a single kernel. This PR only covers `multi_tensor_apply` for tensors. The change can be easily applied to `multi_tensor_apply` for tensors + scalars and `multi_tensor_apply_for_fused_optimizer`. ### Benchmark (WIP) The only benchmark I've conducted so far on `_foreach_copy_` on a set of sizes that resembles internal workload. I need to benchmarks on more problem sizes. The speedup should vary among problem sizes. However, I believe this PR should not be slower than the previous impl on any problem sizes. The benchmark can be reproduced with [this script](https://gist.github.com/yifuwang/178c1f4bf951c5794ea79c04d90e44fa). Baseline A single iteration in trace: <img width="831" alt="image" src="https://github.com/pytorch/pytorch/assets/4156752/5c8d72d0-0628-4989-88a8-c756f6bc1319"> ``` https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/perfetto_internal_traces/tree/shared_trace/yifu_5a59145f-567b-472f-8eef-c61c388d45b4.json device ms: 1.111, cpu ms: 7.151 memory bandwidth: 1169.825 GB/s ``` This PR A single iteration in trace: <img width="967" alt="image" src="https://github.com/pytorch/pytorch/assets/4156752/a023e183-8166-48f7-b7c0-c8ba32653d2b"> ``` https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/perfetto_internal_traces/tree/shared_trace/yifu_da060725-62a8-466e-b570-2ad67ff0e29d.json device ms: 0.892, cpu ms: 0.810 memory bandwidth: 1456.744 GB/s ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/119764 Approved by: https://github.com/eqy, https://github.com/eellison, https://github.com/crcrpar	2024-03-21 11:53:31 +00:00
PyTorch MergeBot	0d8e960f74	Revert "[Sparsity] add support for H100 compute capability 9.x (#121768 )" This reverts commit 91fdaa1b416ab8ac8be30f3c3428751e236657cd. Reverted https://github.com/pytorch/pytorch/pull/121768 on behalf of https://github.com/jeanschmidt due to Agreed on reverting and fixing rocm tests ([comment](https://github.com/pytorch/pytorch/pull/121768#issuecomment-2011893826))	2024-03-21 10:42:08 +00:00
cyy	7f8bb1de83	[Dynamo][2/N] Fix clang-tidy warnings in torch/csrc/dynamo/* (#122362 ) This PR continues to clean clang-tidy warnings in torch/csrc/dynamo/*, following #122259 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122362 Approved by: https://github.com/ezyang	2024-03-21 09:41:41 +00:00
Shuqiang Zhang	ea1cd31b50	[c10d] Log the target of FR dump (#122345 ) Summary: It would be useful to log the destination of the trace dump in either manifold or local file for the users to quickly locate the dump Test Plan: Modified unit tests Differential Revision: D54972069 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122345 Approved by: https://github.com/wconstab	2024-03-21 08:03:05 +00:00
Michael Lazos	365e89a591	Add tensor step to adadelta (#122252 ) Towards fixing https://github.com/pytorch/pytorch/issues/115679 Fixes Adadelta step update while compiling Pull Request resolved: https://github.com/pytorch/pytorch/pull/122252 Approved by: https://github.com/janeyx99	2024-03-21 07:28:47 +00:00
drisspg	7fa1be506b	Add an option to sdpa benchmark to specify backend (#122368 ) # Summary Adds the ability to specify sdpa backend Pull Request resolved: https://github.com/pytorch/pytorch/pull/122368 Approved by: https://github.com/cpuhrsch	2024-03-21 07:00:40 +00:00
eellison	18c164ef7c	[Inductor] Match insignficiant strides on outputs (#122239 ) Fix for https://github.com/pytorch/pytorch/issues/116433 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122239 Approved by: https://github.com/Chillee	2024-03-21 05:35:59 +00:00
Kurt Mohler	b915877deb	Support numpy array in `Tensor.__eq__` (#122249 ) When the `other` arg of `Tensor.__eq__` is a numpy array, it is converted to a PyTorch tensor view of the numpy array, which is then given as the `other` arg to a `Tensor.eq` call Fixes #119965 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122249 Approved by: https://github.com/ezyang	2024-03-21 04:55:01 +00:00
Shuqiang Zhang	bf18e967b4	[c10d] disable compute_duration by default (#122138 ) Summary: Compute duration would invoke additional cuda overhead and possibly GPU mem increase and possible hang, so we want to disable it by default and enable it only when needed, or at least when timing is enabled. Test Plan: Test with existing unit test Pull Request resolved: https://github.com/pytorch/pytorch/pull/122138 Approved by: https://github.com/wconstab	2024-03-21 04:45:37 +00:00
Bert Maher	ea6f67853e	[inductor fbcode] Add python include paths for Python.h (#122363 ) Summary: We're getting errors that Python.h is not found because we didn't have the proper include path set up for it. bypass-github-export-checks Test Plan: I can only get this to show up in Bento: N5106134 Reviewed By: hl475, chenyang78 Differential Revision: D55133110 Co-authored-by: Bert Maher <bertrand@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/122363 Approved by: https://github.com/bertmaher	2024-03-21 04:32:17 +00:00
Joel Schlosser	d4dff9cf5e	Public API for NJT construction from jagged components (#121518 ) This PR introduces `torch.nested.nested_tensor_from_jagged(values, offsets=None, lengths=None, jagged_dim=1)` (bikeshedding welcome). This is intended to be the main entrypoint for getting an NJT from the `(values, offsets, lengths)` components. The returned NJT is a view of the `values` component. Note that `torch.nested.nested_tensor()` / `torch.nested.as_nested_tensor()` already exist for constructing an NJT from a list of tensors. TODO: * Some doc formatting; suggestions welcome there * Tests / examples using `jagged_dim != 1` Pull Request resolved: https://github.com/pytorch/pytorch/pull/121518 Approved by: https://github.com/cpuhrsch ghstack dependencies: #113280	2024-03-21 04:14:17 +00:00
Joel Schlosser	17c9c70265	Support for torch.nested.as_nested_tensor(t) (#113280 ) This PR adds support for tensor inputs to `as_nested_tensor()`. The tensor is treated as a batch of consistently-sized constituents. It utilizes `_nested_view_from_values_offsets()` to return a real view that allows for propagating gradients into inputs. Co-authored-by: voznesenskym <voznesenskym@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/113280 Approved by: https://github.com/cpuhrsch, https://github.com/soulitzer	2024-03-21 04:13:55 +00:00
titaiwangms	77bed8f7f2	[ONNX] model_type flag is only supported under `SKIP_XFAIL_SUBTESTS` (#122336 ) Fixes #120918 To address the confusion that developers usually have on which list to put xfail and skip. This PR provides guidance that `model_type` and `matcher` specified xfail/skip should go to `SKIP_XFAIL_SUBTESTS`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122336 Approved by: https://github.com/BowenBao, https://github.com/thiagocrepaldi	2024-03-21 04:10:32 +00:00
PyTorch UpdateBot	cc0cadaf4c	[vision hash update] update the pinned vision hash (#122154 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned vision hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122154 Approved by: https://github.com/pytorchbot	2024-03-21 03:59:12 +00:00
PyTorch UpdateBot	61f69c7fc4	[audio hash update] update the pinned audio hash (#122153 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned audio hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122153 Approved by: https://github.com/pytorchbot	2024-03-21 03:53:24 +00:00
Adnan Akhundov	885fb9742d	Handle special kwargs in user-written Triton kernel calls (#122280 ) Summary: Special kwargs like `num_warps`, `num_stages`, and `num_ctas` can be passed to the Triton kernel call as kwargs. These kwargs are handled in a special way, not being passed to the underlying kernel function directly. In this PR, we move those special kwargs from `kwargs` of the `TritonKernelVariable` in dynamo to `Autotuner`'s `Config` instances (either already existing or newly created for this purpose). As a result, the special kwargs can be codegened correctly as a part of `Config`, not as direct arguments to the kernel `.run`. Test Plan: ``` python test/inductor/test_triton_kernels.py -k test_triton_kernel_special_kwargs ... ---------------------------------------------------------------------- Ran 6 tests in 6.783s OK ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/122280 Approved by: https://github.com/oulgen	2024-03-21 03:34:07 +00:00
titaiwang	3e6fdea390	[ONNX] Fix list dtype finding bug in dispatcher (#122327 ) Fixes #122166 Before this PR, dispatcher assumes the first input should provide the reasonable dtype to them, but `aten::index` reveals the cases with `None` in the front of inputs. The PR addresses it by selecting the first non None input to take dtype. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122327 Approved by: https://github.com/justinchuby, https://github.com/thiagocrepaldi	2024-03-21 02:54:58 +00:00
Sherlock Huang	ae913175c3	Fix GraphModuleDeserializer (#122342 ) Summary: self.constants is used in self.deserialize_signature() Test Plan: CI Differential Revision: D55152971 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122342 Approved by: https://github.com/zhxchen17	2024-03-21 02:27:39 +00:00
Frank Lin	e9dcda5cba	Graph-Safe RNG State Exchange for Tensor Parallelism (#114068 ) See #113541 The PR allows for registering and controlling multiple RNG states using indices, ensuring cudagraph-safe operations, and includes both C++ and Python API changes to support this functionality. cc @eellison @anijain2305 @jansel @ezyang @ptrblck @csarofeen @mcarilli Pull Request resolved: https://github.com/pytorch/pytorch/pull/114068 Approved by: https://github.com/ezyang	2024-03-21 01:57:08 +00:00
Yu, Guangye	91ead3eae4	Support gpu trace on XPU (#121795 ) # Motivation Support GPU trace on XPU backend. Add GPU trace to xpu runtime. It is beneficial to generalize the device caching allocator in the next step. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121795 Approved by: https://github.com/EikanWang, https://github.com/gujinghui, https://github.com/jgong5, https://github.com/albanD ghstack dependencies: #121794	2024-03-21 01:56:42 +00:00
Yu, Guangye	74deacbf31	Refactor gpu trace to be device-agnostic (#121794 ) # Motivation Refactor gpu trace to be device-agnostic. gpu trace is usually used in runtime components, including Device, Stream, Event, Guard, and Allocator. It should be device-agnostic and can be shared among each device backend. # Solution move `_cuda_trace.py` to `_gpu_trace.py`, which makes each device backend owns their callback, respectively. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121794 Approved by: https://github.com/jgong5, https://github.com/albanD, https://github.com/EikanWang, https://github.com/gujinghui	2024-03-21 01:52:58 +00:00
Jing Shan	57734202c6	[HSTU][TGIF] Provide a API to check whether running in torch_dispatch mode (#122339 ) Summary: We provide a `is_in_torch_dispatch_mode` API returning `bool` to determine whether the program is running in torch dispatch mode or not. Test Plan: - OSS CI - Tested with publish of hstu models with the this diff and following diffs D54964288, D54964702, D54969677, D55025489, runtime errors are not raised anymore in publish Differential Revision: D55091453 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122339 Approved by: https://github.com/jiayisuse	2024-03-21 01:37:23 +00:00
JackCaoG	e38d60bc07	Remove some stale xla dynamo backend (#122128 ) `torchxla_trace_once ` and `aot_torchxla_trivial ` should be removed. In our internal(hopefully dashboard can be open source soon) torchbench daily runs, `openxla` backend has much higher passing rate and similar perfomrance as the `openxla_eval`(non-aot-auto-grad backend). We still use `openxla_eval` in llama2 example but I think we should move user to `openxla` backend going forward. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122128 Approved by: https://github.com/alanwaketan, https://github.com/jansel	2024-03-21 01:13:50 +00:00
Michael Lazos	c20cf97366	Move some cudagraphs checks into C++ (#122251 ) Based off of https://github.com/pytorch/pytorch/pull/111094 This + cpp guards improves TIMM geomean optimizer performance by about 20% Pull Request resolved: https://github.com/pytorch/pytorch/pull/122251 Approved by: https://github.com/eellison	2024-03-21 01:02:23 +00:00
James Pang	be5863de39	Remove usage of deprecated volatile (#122231 ) Summary: When building our iOS app, we get a compile error about the deprecated `volatile` keyword. This diff attempts to fix it by replacing the usage of the deprecated `volatile` keyword with `atomic` as suggested by malfet Test Plan: Successfully built the iOS app that previously had a compile error Differential Revision: D55090518 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122231 Approved by: https://github.com/malfet	2024-03-21 00:55:16 +00:00
Animesh Jain	1686e2d1e4	[symbolic shapes][compile-time] Minor compile time optimization in has_free_symbols (#122144 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/122144 Approved by: https://github.com/lezcano ghstack dependencies: #120726	2024-03-21 00:48:57 +00:00
cyy	c2eedb7f8a	[Dynamo][1/N] Fix clang-tidy warnings in torch/csrc/dynamo/* (#122259 ) This PR begins a series of works to ensure dynamo C++ code is clang-tidy clean. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122259 Approved by: https://github.com/ezyang	2024-03-21 00:43:25 +00:00
PyTorch MergeBot	c80601f35a	Revert "Avoid COW materialize in conv, log sigmoid, repeat, group_norm, batch_norm (#121537 )" This reverts commit a2a88f39ee991f471f2a2c54571886d70f5cd2e6. Reverted https://github.com/pytorch/pytorch/pull/121537 on behalf of https://github.com/kurtamohler due to flaky CI failures ([comment](https://github.com/pytorch/pytorch/pull/121537#issuecomment-2010937226))	2024-03-21 00:03:30 +00:00
eqy	d5b5012dc4	[CUDA] Raise `softmax_forward_64bit_indexing` GPU memory requirement (#116075 ) printing `torch.cuda.memory_summary()` shows ~41GiB reserved at the end of this test, not sure how it was passing previously on CUDA. CC @ptrblck @malfet Pull Request resolved: https://github.com/pytorch/pytorch/pull/116075 Approved by: https://github.com/ptrblck, https://github.com/malfet	2024-03-21 00:03:17 +00:00
Joel Schlosser	5855c490f0	Proper view support for jagged layout NestedTensor (#113279 ) This PR: * Introduces an ATen op for creating true jagged views from a dense values buffer * `_nested_view_from_jagged(values, offsets, lengths, ragged_idx, dummy)` * This ops is implemented on the Python side using torch.library so we can return a subclass instance * `jagged_from_list()` now uses this instead of the old autograd.Function `NestedViewFromBuffer` * The latter op is used for non-contiguous JTs returned via `torch.nested.narrow()` * `dummy` is an awful hack to ensure that `NestedTensor.__torch_dispatch__()` is invoked for our view * Introduces an ATen op for accessing the `values` component of an NT via a view * `_nested_get_values(nt)` * Removes the autograd.Functions `ViewNestedFromBuffer` and `ViewBufferFromNested` in favor of `nested_from_values_offsets()` / `nested_from_values_offsets_lengths()` and `nt.values()`, respectively. * Changes test code to prefer `as_nested_tensor()` over `jagged_from_list()` directly * Similarly, avoid `buffer_from_jagged()`, preferring `values()` * Depends on general subclass view fake-ification on the PT2 side (handled solely in previous PRs in the stack) With these changes, the semantics of jagged layout NTs are such that they are considered a true view of the underlying `values` buffer. This means views of jagged NTs are views of the underlying buffer as well, simplifying some handling. Differential Revision: [D54269922](https://our.internmc.facebook.com/intern/diff/D54269922) Co-authored-by: voznesenskym <voznesenskym@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/113279 Approved by: https://github.com/ezyang	2024-03-20 23:45:34 +00:00
min-jean-cho	057892f4be	[CPU] optimize Lp norm for 1-dimensional vector (#122143 ) Fixes https://github.com/pytorch/pytorch/issues/120229 - Optimize vector norm by simplifying vector norm formula for 1-dimensional vector. - Vector norm formula for 1-dimensional vector simplifies to `abs(x)`. See below for proof. - Next step, we can similarly optimize matrix norm (`torch.linalg.matrix_norm`) for 1 x 1 matrix. - Additionally, avoids overflow in power, `abs(x) p` for large `p` or `x`, for 1-dimensional vector. ### Performance Avg Latency (ms) of `torch.norm` and `torch.linalg.vector_norm` for `torch.norm(torch.randn(218, 1), ord, -1)` `torch.linalg.vector_norm(torch.randn(218, 1), ord, -1)` Tested on 28 physical cores/socket, 1 socket on Skylake. \| \| \| \| \| Avg Latency (ms) \| \| \| \|-------------------------- \|----------------- \|--------- \|--------- \|----------------------- \|----------------------- \|---------------------------------------- \| \| op \| input shape \| dim \| ord \| baseline (master) \| optimized (7102f1ef372b248414d36cbd0c51a546b6b6a41a) \| speedup ratio (baseline/optimized) \| \| torch.norm \| (218, 1) \| -1 \| fro \| 34.3755531 \| 0.0125408 \| 2741.094 \| \| \| \| \| inf \| 34.0952635 \| 0.0122237 \| 2789.271 \| \| \| \| \| -inf \| 34.3674493 \| 0.0120759 \| 2845.953 \| \| \| \| \| 0 \| 34.1004515 \| 0.0175261 \| 1945.69 \| \| \| \| \| 1 \| 34.1688442 \| 0.0121593 \| 2810.089 \| \| \| \| \| -1 \| 33.949492 \| 0.0120282 \| 2822.487 \| \| \| \| \| 2 \| 34.3669581 \| 0.0120401 \| 2854.366 \| \| \| \| \| -2 \| 33.9252067 \| 0.0121069 \| 2802.139 \| \| \| \| \| \| \| \| \| \| torch.linalg.vector_norm \| (2**18, 1) \| -1 \| inf \| 34.090879 \| 0.0095105 \| 3584.545 \| \| \| \| \| -inf \| 34.3708754 \| 0.0099111 \| 3467.931 \| \| \| \| \| 0 \| 34.0880775 \| 0.0141716 \| 2405.38 \| \| \| \| \| 1 \| 34.1392851 \| 0.0093174 \| 3664.036 \| \| \| \| \| -1 \| 33.925395 \| 0.0092483 \| 3668.302 \| \| \| \| \| 2 \| 34.3854165 \| 0.0092459 \| 3719.002 \| \| \| \| \| -2 \| 33.932972 \| 0.0093007 \| 3648.429 \| ### Proof <details> <summary>For those interested :)</summary> <img width="382" alt="1_dim_vector_norm_proof1" src="https://github.com/pytorch/pytorch/assets/93151422/59b1e00b-8fcd-47cb-877d-d31403b5195b"> <img width="432" alt="1_dim_vector_norm_proof2" src="https://github.com/pytorch/pytorch/assets/93151422/236bea15-2dd5-480b-9871-58b2e3b24322"> </details> Pull Request resolved: https://github.com/pytorch/pytorch/pull/122143 Approved by: https://github.com/lezcano	2024-03-20 23:20:25 +00:00
Xu Han	aa74a8b9e5	Enable x86 CPU vectorization on windows [submodule sleef] (#118980 ) Enable VEC on Windows OS. 1. Fix some type defination gap between Windows and Linux. 2. Fix some operator not support on Windows, such as [], /. 3. Enable static sleef library build on Windows. 4. Disable unsupported function overloading on MSVC. 5. Upgrade submodule sleef lib, which fixed build issue on Windows. 6. Fixed bazel build issues. 7. Fix test app not link to sleef on Windows. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118980 Approved by: https://github.com/jgong5, https://github.com/ezyang, https://github.com/malfet	2024-03-20 22:41:13 +00:00
Thiago Crepaldi	666d6291af	Cast checkpoint weights to match model parameter's dtype (#122100 ) Fixes #121986 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122100 Approved by: https://github.com/BowenBao	2024-03-20 22:01:40 +00:00
ydwu4	2289fa5f5a	[while_loop] fix mode not on stack error (#122323 ) Fixes https://github.com/pytorch/pytorch/issues/121453. This is caused by missing `with mode` in FakeTensor mode. Test Plan: add new tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122323 Approved by: https://github.com/tugsbayasgalan ghstack dependencies: #122244	2024-03-20 21:17:33 +00:00
Pritam Damania	512251c8f3	Use tree_map to get device ids and device types for activation checkpointing (#121462 ) `get_device_states` doesn't recursively look into nested lists/dicts to find tensors. As a result, activation checkpointing for such inputs results in silent incorrect results as `get_device_states` returns an empty result and no rng is saved as a result here: https://github.com/pytorch/pytorch/blob/main/torch/utils/checkpoint.py#L188 since `fwd_device_states` is empty. Fixed this by using `tree_map` for both `get_device_states` and `_infer_device_type`. Also added appropriate unit tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121462 Approved by: https://github.com/soulitzer	2024-03-20 21:09:21 +00:00
cyy	1dd1899fd6	Add missing throw of std::runtime_error in dynamo/guards.cpp (#122306 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/122306 Approved by: https://github.com/Skylion007, https://github.com/ezyang	2024-03-20 20:50:01 +00:00
Menglu Yu	d2a8d3864c	[PT2][Inductor] Change the log for the group batch fusion (#122245 ) Summary: Instead of using "batch_fusion" and "group_fusion" to log, we use the specific pass name to log, which could better summarize the hit of each pattern as well as debug Test Plan: ``` buck2 test mode/dev-nosan //caffe2/test/inductor:group_batch_fusion ``` Differential Revision: D55103303 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122245 Approved by: https://github.com/jackiexu1992	2024-03-20 20:45:37 +00:00
ydwu4	61ff41f0ca	[while_loop] disable closure capturing and manually set the inputs. (#122244 ) For while_loop operator, it's important to keep the output ordering consistent with input ordering. Previously, we're using set_graph_inputs="automatic", which doesn't respect such ordering. This PR changes it to "manual" and respects the original user inputs' ordering. We disable closures for body and cond fn as they require some additional designs. This PR is just to prevent the bleeding. Repro: ```python import torch from torch._higher_order_ops.while_loop import while_loop from torch._functorch.aot_autograd import aot_export_module class Nested(torch.nn.Module): def forward(self, ci, cj, a, b): def cond_fn(i1, j1, x1, y1): return i1 > 0 def body_fn(i1, j1, x1, y1): def cond_fn_nested(i2, j2, x2, y2): return j2 > 0 def body_fn_nested(i2, j2, x2, y2): return i2.clone(), j2 - 1, x2 + 3.14, y2 - 2.71 i1, j1, x1, y1 = while_loop( cond_fn_nested, body_fn_nested, [i1, j1, x1, y1] ) return i1 - 1, j1.clone(), x1 * 2, y1 / 2 return while_loop(cond_fn, body_fn, (ci, cj, a, b)) nested = Nested() torch.compile(nested, backend="eager", fullgraph=True)(torch.tensor(2), torch.tensor(2), torch.randn(2, 2), torch.randn(2, 2)) ``` Test plan: add new test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122244 Approved by: https://github.com/aakhundov	2024-03-20 20:14:35 +00:00
Boyuan Feng	2f6e8e84c5	Fix `_chunk_cat.out` issue (#122076 ) # PR Vectors allocated inside `get_chunk_cat_metadata()` are out of local scope when used in `_chunk_cat_out_cuda_contiguous()`. This PR fixes the issue by returning vectors from `get_chunk_cat_metadata`. This PR also added a few unit tests to cover more edge cases. # Tests This PR is tested with the following command and no error shows. So the flaky test error should be resolved. - `PYTORCH_NO_CUDA_MEMORY_CACHING=1 compute-sanitizer python test/test_ops.py -v -k test_out__chunk_cat_cuda_float32` - `PYTORCH_NO_CUDA_MEMORY_CACHING=1 python test/test_ops.py -v -k test_out__chunk_cat_cuda_float32 --repeat 1500` Fixes #122026 Fixes #121950 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122076 Approved by: https://github.com/yifuwang	2024-03-20 20:01:38 +00:00
Jacob Szwejbka	c84f81b395	[export] add pass to remove auto functionalized hop (#122246 ) Summary: Adds a pass that blindly removes the functionalize hop without consideration on if its safe. Useful for ExecuTorch today and other usecases that have additional logic that can reason about when this pass is safe to use Test Plan: added unit test Differential Revision: D55103867 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122246 Approved by: https://github.com/angelayi	2024-03-20 19:31:52 +00:00
Jing Shan	d813474363	[Pytorch] auto format _python_dispatch file (#122226 ) Summary: Auto format the _python_dispatch file, to make D55091453 easier to review Test Plan: `arc lint` Differential Revision: D55091454 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122226 Approved by: https://github.com/aakhundov	2024-03-20 19:28:39 +00:00
Jean Schmidt	821ad56ea6	[CI] Enables support for pytorch ci build in ARC + introduces _linux-build-rg.yml. (#121930 ) Introduce changes related to enable ARC to run on build for linux-jammy-py3.8-gcc11 Depends on: * https://github.com/pytorch/pytorch/pull/121908 * https://github.com/pytorch/pytorch/pull/121907 * Force docker to update credentials: https://github.com/pytorch/test-infra/pull/4991 * Add permissions to role to access ECR: `acc0154aa0` * Add permissions to the role to access relevant S3 bucket: `496b0422c3` ## Reasoning for introducing a new `_linux-build-rg.yml` Old style `runs-on` definition accept a string, new style `runs-on` requires a object in the format: ``` --- old ... runs-on: "linux.2xlarge" ... --- new ... runs-on: group: "running-group" ... ``` In other words, to specify a group the format of the yaml needs to be changed. Unfortunately, there is no way to accomplish this change using any trick in the book that I am aware of. This is due to the fact that GH actions yaml are not templatable and support minimal functions / replacements. A few examples that did not work: * [`e234f25` (#119544)](`e234f25ba1 (diff-b317d4da565a9e329ccf67e669c2ff1f4d4bc5fb0ffa4d74132545ad66f84339R76)`) * [`087de4a` (#119544)](`087de4ad8b (diff-b317d4da565a9e329ccf67e669c2ff1f4d4bc5fb0ffa4d74132545ad66f84339R76)`) * [`f03512e` (#119544)](`f03512e344 (diff-b317d4da565a9e329ccf67e669c2ff1f4d4bc5fb0ffa4d74132545ad66f84339R76)`) * [`67581fb` (#119544)](`67581fb737 (diff-b317d4da565a9e329ccf67e669c2ff1f4d4bc5fb0ffa4d74132545ad66f84339R76)`) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121930 Approved by: https://github.com/seemethere	2024-03-20 19:06:10 +00:00
Luoshang Pan	91fdaa1b41	[Sparsity] add support for H100 compute capability 9.x (#121768 ) Summary: as title Test Plan: buck test mode/opt //caffe2/test/... Differential Revision: D54792168 @diff-train-skip-merge Pull Request resolved: https://github.com/pytorch/pytorch/pull/121768 Approved by: https://github.com/SherlockNoMad	2024-03-20 19:00:54 +00:00
Zhengxu Chen	d1e8b97387	[export] Log module hierarchy. (#121970 ) Summary: We can also log the module hierarchy in the following format: ``` :ToplevelModule sparse:SparshArch dense:DenseArch ``` So that we can have more information recorded about model's identity. Test Plan: CI Differential Revision: D54921097 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121970 Approved by: https://github.com/angelayi	2024-03-20 18:59:42 +00:00
PyTorch MergeBot	0696db8202	Revert "Teach dynamo about torch.func.jvp (#119926 )" This reverts commit 17489784b635187316c6c856c5fe6b6a28d8a15a. Reverted https://github.com/pytorch/pytorch/pull/119926 on behalf of https://github.com/peterbell10 due to broken mac jobs on main ([comment](https://github.com/pytorch/pytorch/pull/119926#issuecomment-2010327997))	2024-03-20 18:34:43 +00:00
eellison	1d13c82559	Precompile in background (#121997 ) Precompile benchmarking choices in parallel, and then wait on those choices prior to benchmarking. In the case of deferred templates, we only only wait only those choices in the scheduler to allow multiple separate lowerings to compile in parallel. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121997 Approved by: https://github.com/jansel ghstack dependencies: #121996, #120275	2024-03-20 18:34:12 +00:00
PyTorch MergeBot	65eb22158e	Revert "Update jvp to support symbolic execution. (#120338 )" This reverts commit afc4c9382ff8b55da848ef40b4a17a92fb3d2ad6. Reverted https://github.com/pytorch/pytorch/pull/120338 on behalf of https://github.com/huydhn due to Broke dynamo tests on trunk ([comment](https://github.com/pytorch/pytorch/pull/120338#issuecomment-2010276712))	2024-03-20 18:04:53 +00:00
amoskvic	072935917b	Update cuda_to_hip_mappings.py (#122110 ) Added one datatype mapping (cuda_bf16.h), and a number of cub/hipcub mappings. Note: the missing mappings were discovered when hipifying the Mamba model's (https://github.com/state-spaces/mamba) forward kernel. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122110 Approved by: https://github.com/jithunnair-amd, https://github.com/Skylion007	2024-03-20 17:17:53 +00:00
Catherine Lee	334f7e43f9	[TD] Remove credentials requirement for retrieval (#122279 ) Made the bucket readable by public https://s3.console.aws.amazon.com/s3/buckets/target-determinator-assets?region=us-east-1&bucketType=general&tab=permissions The only jobs that matter here are the retrieval and td jobs, which were both successful Pull Request resolved: https://github.com/pytorch/pytorch/pull/122279 Approved by: https://github.com/huydhn	2024-03-20 15:55:46 +00:00
Adnan Akhundov	2e02e1efad	Skip nonzero unbacked SymInt memo in inference mode (#122147 ) Summary: In `torch.inference_mode()`, fake tensors don't have `_version`s. This breaks unbacked SymInt memoization in `torch.nonzero` tracing. Here we disable the latter in inference mode. Fixes https://github.com/pytorch/pytorch/issues/122127 Test Plan: ``` $ python test/inductor/test_unbacked_symints.py -k test_nonzero_in_inference_mode ... ---------------------------------------------------------------------- Ran 2 tests in 14.060s OK ``` Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/122147 Approved by: https://github.com/ezyang	2024-03-20 14:44:55 +00:00
PyTorch MergeBot	15a8185cd3	Revert "Enable x86 CPU vectorization on windows [submodule sleef] (#118980 )" This reverts commit 2b060983809e5fe8706acd085fff67b6a27bfb5f. Reverted https://github.com/pytorch/pytorch/pull/118980 on behalf of https://github.com/zou3519 due to This caused build failures for 2+ pytorch devs, so we're reverting it to be safe ([comment](https://github.com/pytorch/pytorch/pull/118980#issuecomment-2009661069))	2024-03-20 14:10:12 +00:00
PyTorch MergeBot	06db0a9f78	Revert "Upgrade submodule sleef to fix build warning (#122168 )" This reverts commit eec8b252b70b2489aee7281d336eb9c32dd85483. Reverted https://github.com/pytorch/pytorch/pull/122168 on behalf of https://github.com/zou3519 due to trying to revert another PR ([comment](https://github.com/pytorch/pytorch/pull/122168#issuecomment-2009653474))	2024-03-20 14:05:58 +00:00
IvanKobzarev	8a94005d46	[dynamo][runtime_asserts] Ignore failures on sorting sympy relations (#122205 ) Differential Revision: [D55075500](https://our.internmc.facebook.com/intern/diff/D55075500) Pull Request resolved: https://github.com/pytorch/pytorch/pull/122205 Approved by: https://github.com/ezyang	2024-03-20 13:25:37 +00:00
Guilherme Leobas	afc4c9382f	Update jvp to support symbolic execution. (#120338 ) Previously, all jvp tests under dynamo/test_dynamic_shapes would fail because symbolic execution wasn't supported in some autograd functions. List of changes: - Update`_has_same_storage_numel` to use `sym_nbytes` - Symintify `_efficientzerotensor_meta` - Introduce `empty_generic_symint` with the first argument `size` as symbolic integer - Update gen_variable_type.py script to call the symint version of zeros_fn function (zeros_symint / _efficientzerotensor_symint) - Update `has_same_meta` to call `sym_*` functions Pull Request resolved: https://github.com/pytorch/pytorch/pull/120338 Approved by: https://github.com/soulitzer ghstack dependencies: #119926	2024-03-20 13:09:19 +00:00
Guilherme Leobas	17489784b6	Teach dynamo about torch.func.jvp (#119926 ) List of changes: - Replace JVP_NESTING by torch._C._functorch.maybe_current_level() - Remove all increment nesting functions from wrap_fx_proxy_cls - fwAD.make_dual receives the dual_level as keyword argument - Add jvp_increment_nesting, set_fwd_grad_enabled and dual_level context managers to dynamo Pull Request resolved: https://github.com/pytorch/pytorch/pull/119926 Approved by: https://github.com/zou3519	2024-03-20 13:09:19 +00:00
Valentine233	eb1d6ed9f9	[Inductor] fix addmm fusion check (#121953 ) Fixes #121253. To avoid functional issue, disable pattern match for `addmm` when `beta!=1 or 0` or `alpha!=1`, as either `mkl_linear` or `mkldnn_linear` doesn't accept `beta` or `alpha` as parameters. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121953 Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jansel	2024-03-20 09:22:51 +00:00
Xilun Wu	ee6ce31b1d	[BE][fix] fix test_tp_random_state and add it to periodic test list (#122248 ) fix #122184 . Add the test to periodic test so that we can capture the error at CI in future. Test: `pytest test/distributed/tensor/parallel/test_tp_random_state.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/122248 Approved by: https://github.com/wanchaol	2024-03-20 08:24:14 +00:00
Huy Do	a1d02b423c	XFAIL detectron2_maskrcnn_r_101_c4 CPU inductor accuracy (#122263 ) This starts to fail in trunk after the stack https://github.com/pytorch/pytorch/pull/122066 lands Pull Request resolved: https://github.com/pytorch/pytorch/pull/122263 Approved by: https://github.com/jansel	2024-03-20 08:03:34 +00:00
Jason Ansel	477d154ffd	[dynamo] Add missing _nonvar_fields annotations (#122219 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/122219 Approved by: https://github.com/anijain2305 ghstack dependencies: #122218	2024-03-20 07:53:18 +00:00
Jason Ansel	46bf37b3f7	[dynamo] Replace VariableTracker.apply with visit/realize_all (#122218 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/122218 Approved by: https://github.com/anijain2305	2024-03-20 07:53:18 +00:00
Jason Ansel	a0db2e4237	[dynamo] Fixed handling of ImportError (#122222 ) Fixes #122088 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122222 Approved by: https://github.com/anijain2305	2024-03-20 07:52:01 +00:00
Pian Pawakapan	7832efb242	[export] skip nn_module_stack verifier for non-fx.GraphModule modules (#122210 ) Downstream users of torch.export may have different module classes (e.g. LoweredBackendModule), which cannot be checked for metadata in the same way. Add lines to skip this for non-fx.GraphModule modules. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122210 Approved by: https://github.com/angelayi, https://github.com/zhxchen17	2024-03-20 07:40:48 +00:00
Wei Lu	7d2b2dec4b	[Pytoch][Vulkan] Register `run_conv1d_context` (#122172 ) Summary: We have rewritten `conv1d` as `create_conv1d_context` and `run_conv1d_context` to enable prepack of `weight` and `bias`. We have registered `create_conv1d_context` but not `run_conv1d_context`. We add the registration in this diff. Test Plan: ``` [luwei@devbig439.ftw3 /data/users/luwei/fbsource (f89a7de33)]$ LD_LIBRARY_PATH=third-party/swiftshader/lib/linux-x64/ buck run fbcode/mode/dev-nosan //xplat/caffe2:pt_vulkan_api_test_bin -- --gtest_filter="conv1d" Using additional configuration options from /home/luwei/.buckconfig.d/experiments_from_buck_start Recommended: For faster builds try buck2: replace 'buck' with 'buck2' NOTE: buck-out/ has changed: look for files in fbsource/buck-out/v2/ 'buck2 build --show-output //xplat/caffe2:pt_vulkan_api_test_bin' will print the new output paths. If you are building in fbsource//xplat and have questions, post in 'Cross Platform Dev Discussions': https://fb.workplace.com/groups/xplat.qa Targets matching .buckconfig buck2.supported_projects: {'//xplat/caffe2:pt_vulkan_api_test_bin': '//xplat'} To suppress this warning: touch ~/.config/.dont_hint_buck2 Building: finished in 0.1 sec (100%) 394/394 jobs, 0/394 updated Total time: 0.2 sec BUILD SUCCEEDED Running main() from third-party/googletest/1.14.0/googletest/googletest/src/gtest_main.cc Note: Google Test filter = conv1d [==========] Running 2 tests from 1 test suite. [----------] Global test environment set-up. [----------] 2 tests from VulkanAPITest [ RUN ] VulkanAPITest.conv1d_simple [ OK ] VulkanAPITest.conv1d_simple (208 ms) [ RUN ] VulkanAPITest.conv1d [ OK ] VulkanAPITest.conv1d (81 ms) [----------] 2 tests from VulkanAPITest (289 ms total) [----------] Global test environment tear-down [==========] 2 tests from 1 test suite ran. (289 ms total) [ PASSED ] 2 tests. ``` full test result ``` ... [----------] 427 tests from VulkanAPITest (22583 ms total) [----------] Global test environment tear-down [==========] 427 tests from 1 test suite ran. (22583 ms total) [ PASSED ] 426 tests. [ SKIPPED ] 1 test, listed below: [ SKIPPED ] VulkanAPITest.querypool_flushed_shader_log YOU HAVE 11 DISABLED TESTS ``` Differential Revision: D55052816 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122172 Approved by: https://github.com/nathanaelsee	2024-03-20 07:36:23 +00:00
Yifu Wang	e7141d117f	[IntraNodeComm] refactor rendezvous into a separate method for better code organization and error handling (#120968 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120968 Approved by: https://github.com/wanchaol	2024-03-20 06:54:25 +00:00
cyy	9f572b99a6	[Clang-tidy header][29/N] Enable clang-tidy warnings in aten/src/ATen/core/.h (#122190 ) This PR enables clang-tidy in `aten/src/ATen/core/.h`, which ends the series of patches beginning from #122015. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122190 Approved by: https://github.com/Skylion007	2024-03-20 06:17:37 +00:00
Wanchao Liang	11e64b4ba8	[dtensor] aten.cat to use stack strategy approach (#122209 ) This PR switch aten.cat to use the strategy approach that is similar to aten.stack, as these two ops share similar semantics Pull Request resolved: https://github.com/pytorch/pytorch/pull/122209 Approved by: https://github.com/wz337	2024-03-20 04:19:25 +00:00
Tugsbayasgalan (Tugsuu) Manlaibaatar	5b7ceab650	Support auto_functionalize in pre-dispatch (#122177 ) Summary: Title Test Plan: CI Differential Revision: D55042061 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122177 Approved by: https://github.com/zou3519	2024-03-20 04:17:58 +00:00
Huy Do	dc89d8b74a	Fix broken lint after #116876 (#122253 ) Trivial fixes, so let's do this instead of reverting the change. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122253 Approved by: https://github.com/clee2000	2024-03-20 04:09:00 +00:00
Catherine Lee	de950039fc	Use .get in xml parsing (#122103 ) Check that the `classname` attribute actually exists. #122017 I expect this route to happen very rarely At a certain point, we should just remove this parsing altogether since everything uses pytest now... Pull Request resolved: https://github.com/pytorch/pytorch/pull/122103 Approved by: https://github.com/huydhn	2024-03-20 04:07:49 +00:00
Shan19900305	6662627c89	Add APIs for custom device using TensorIteratorBase. (#120792 ) 1) add operand and get_dim_names API; 2) set will_resize to true when output tensor is undefined; 3) add abs_stub for dummy device and calculate on cpu device; 4) support dummy device copy with stride; Pull Request resolved: https://github.com/pytorch/pytorch/pull/120792 Approved by: https://github.com/ezyang	2024-03-20 03:51:09 +00:00
Zhengxu Chen	f8565c4a28	[sigmoid] Clean up serialization API. (#122102 ) Summary: Entirely remove the old serializer code to avoid further confusion and code bloat. Test Plan: CI Reviewed By: SherlockNoMad Differential Revision: D54857118 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122102 Approved by: https://github.com/tugsbayasgalan	2024-03-20 03:45:36 +00:00
Valentine233	1f8177dedf	[Inductor][CPU] fix flash attention last_stride!=1 issue (#122083 ) Fixes #121174. Conv converts the input of sdpa to channel last, resulting in accuracy issue. Ensure the layout in lowering. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122083 Approved by: https://github.com/eellison, https://github.com/jgong5	2024-03-20 02:22:33 +00:00
cyy	55310e58a9	Use constexpr for index variables (#122178 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/122178 Approved by: https://github.com/Skylion007, https://github.com/ezyang	2024-03-20 02:20:17 +00:00
Xu Han	eec8b252b7	Upgrade submodule sleef to fix build warning (#122168 ) Subsequent PR to https://github.com/pytorch/pytorch/pull/118980, fix sleef build warning. submodule sleef, include this sleef PR: https://github.com/shibatch/sleef/pull/514 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122168 Approved by: https://github.com/jgong5, https://github.com/ezyang	2024-03-20 02:14:56 +00:00
eellison	cbbed46377	Defer selection of triton template (#120275 ) Our prior approach to epilogue fusion was to select from a choice from a set of triton templates and extern calls based on benchmarking inputs, then unconditionally fuse epilogues. This can be sub-optimal in following ways: - We select an extern kernel, however an epilogue like relu() exists such that choosing a triton template + relu would have been faster - We select a triton template, epilogue fuse, and register spilling occurs causing it to be slower than not epilogue fusing. In this PR we wait to select either the Triton Template or Extern Kernel based on benchmarking results from the kernel itself and its epilogue. As soon as a successful fusion occurs where a fused Triton Template + epilogue is faster than the unfused choice we finalize the MultiTemplateBuffer as a specific template. If no fusion occurs we'll finalize the MultiTemplateBuffer after fusion. Note: if there are multiple epilogue fusions (not super likely), even though we select a template after the first fusion, we will still benchmark to see if subsequent epilogue are worth fusing. We could potentially defer choosing template in this case in a follow up at expense of compile time. Gives 4% HF training win, 10% TIMM inference win. Increases compilation time which I will be trying to address more in follow up prs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120275 Approved by: https://github.com/jansel ghstack dependencies: #121996	2024-03-20 01:40:33 +00:00
PyTorch MergeBot	e5e0685f61	Revert "[dynamo] Forward OptimizedModule.__setattr__ to the wrapped module (#122098 )" This reverts commit 88ebdbc97c103271766203df6662240e95a09b42. Reverted https://github.com/pytorch/pytorch/pull/122098 on behalf of https://github.com/huydhn due to Sorry for reverting your change but the distributed failure looks legit as it is also failing in trunk `88ebdbc97c` ([comment](https://github.com/pytorch/pytorch/pull/122098#issuecomment-2008483316))	2024-03-20 01:12:24 +00:00
Valentine233	19d6004b97	add int8 woq mm pattern matcher (#120985 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120985 Approved by: https://github.com/mingfeima, https://github.com/jgong5, https://github.com/eellison	2024-03-20 00:23:41 +00:00
Huy Do	6fefc52a2b	Set py3.x build-environment name consistently (#122247 ) https://github.com/pytorch/pytorch/pull/122157 checks for the Python version using `"$BUILD_ENVIRONMENT" != py3.8`, but some build environment uses a different style with `py3_8` instead causing numpy 2.x to be installed there wrongly, i.e. `03b987fe3f` Pull Request resolved: https://github.com/pytorch/pytorch/pull/122247 Approved by: https://github.com/malfet	2024-03-19 23:56:38 +00:00
Richard Barnes	6c659bbc36	[codemod][lowrisk] Remove unused exception parameter from caffe2/c10/mobile/CPUCachingAllocator.cpp (#116875 ) Summary: `-Wunused-exception-parameter` has identified an unused exception parameter. This diff removes it. This: ``` try { ... } catch (exception& e) { // no use of e } ``` should instead be written as ``` } catch (exception&) { ``` If the code compiles, this is safe to land. Test Plan: Sandcastle Reviewed By: kimishpatel, palmje Pull Request resolved: https://github.com/pytorch/pytorch/pull/116875 Approved by: https://github.com/Skylion007	2024-03-19 23:52:09 +00:00
Richard Barnes	6b95dc8884	[codemod][lowrisk] Remove unused exception parameter from caffe2/torch/csrc/jit/frontend/lexer.cpp (#116876 ) Summary: `-Wunused-exception-parameter` has identified an unused exception parameter. This diff removes it. This: ``` try { ... } catch (exception& e) { // no use of e } ``` should instead be written as ``` } catch (exception&) { ``` If the code compiles, this is safe to land. Test Plan: Sandcastle Reviewed By: palmje Pull Request resolved: https://github.com/pytorch/pytorch/pull/116876 Approved by: https://github.com/Skylion007	2024-03-19 23:51:26 +00:00
feifan	d0153ca755	use make_storage_impl to create storages for COWStorage. (#121896 ) Thanks to https://github.com/pytorch/pytorch/pull/118459， `make_storage_impl` will use the func ,which register for other backends, to create StorageImpl. `make_storage_impl` completely overwrites the `make_intrusive<StorageImpl>`, so it makes sense to replace `make_intrusive<StorageImpl>` with `make_storage_impl` to create storage in cow. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121896 Approved by: https://github.com/ezyang	2024-03-19 23:40:15 +00:00
Shan19900305	4aaf25bc38	delete useless cast_outputs call in unary_op_impl_float_out (#120486 ) cast_outputs function is only used for CPU device, and this function already called in cpu_xxx_vec, like cpu_kernel_vec. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120486 Approved by: https://github.com/ezyang	2024-03-19 23:37:06 +00:00
Richard Barnes	2980779d0b	[codemod] Remove unused variables in caffe2/caffe2/experiments/operators/tt_pad_op.h (#120177 ) Summary: LLVM-15 has a warning `-Wunused-but-set-variable` which we treat as an error because it's so often diagnostic of a code issue. Unused variables can compromise readability or, worse, performance. This diff either (a) removes an unused variable and, possibly, it's associated code, or (b) qualifies the variable with `[[maybe_unused]]`, mostly in cases where the variable _is_ used, but, eg, in an `assert` statement that isn't present in production code. - If you approve of this diff, please use the "Accept & Ship" button :-) Test Plan: Sandcastle Reviewed By: palmje Pull Request resolved: https://github.com/pytorch/pytorch/pull/120177 Approved by: https://github.com/Skylion007	2024-03-19 23:36:52 +00:00
Edward Z. Yang	2239b55cd1	Add some more sanity asserts to checkPoolLiveAllocations (#122223 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/122223 Approved by: https://github.com/eellison	2024-03-19 23:26:19 +00:00
Gonçalo Rua	139647d317	Fix #83241 : torch.nn.TripletMarginLoss allowed margin less or equal to 0 (#121978 ) Documentation states that the parameter margin of torch.nn.TripletMarginLoss is greater than 0, however any value was being accepted. Also fixed torch.nn.TripletMarginWithDistanceLoss which had the same problem. Added error test input for the new ValueError. Fixes #83241 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121978 Approved by: https://github.com/mikaylagawarecki	2024-03-19 23:19:11 +00:00
Richard Barnes	a843bbdb21	[codemod] Remove unused variables in caffe2/caffe2/opt/nql/graphmatcher.cc (#118116 ) Summary: LLVM-15 has a warning `-Wunused-but-set-variable` which we treat as an error because it's so often diagnostic of a code issue. Unused variables can compromise readability or, worse, performance. This diff either (a) removes an unused variable and, possibly, it's associated code, or (b) qualifies the variable with `[[maybe_unused]]`, mostly in cases where the variable _is_ used, but, eg, in an `assert` statement that isn't present in production code. - If you approve of this diff, please use the "Accept & Ship" button :-) Test Plan: Sandcastle Reviewed By: malfet, dmm-fb Differential Revision: D52981072 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118116 Approved by: https://github.com/Skylion007	2024-03-19 22:45:43 +00:00
Richard Barnes	f05af9e377	[codemod] Remove unused variables in caffe2/caffe2/opt/nql/ast.h (#120176 ) Summary: LLVM-15 has a warning `-Wunused-but-set-variable` which we treat as an error because it's so often diagnostic of a code issue. Unused variables can compromise readability or, worse, performance. This diff either (a) removes an unused variable and, possibly, it's associated code, or (b) qualifies the variable with `[[maybe_unused]]`, mostly in cases where the variable _is_ used, but, eg, in an `assert` statement that isn't present in production code. - If you approve of this diff, please use the "Accept & Ship" button :-) Test Plan: Sandcastle Reviewed By: palmje Differential Revision: D53779579 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120176 Approved by: https://github.com/Skylion007	2024-03-19 22:44:51 +00:00
Nikita Shulga	03b987fe3f	[CI] Test that NumPy-2.X builds are backward compatible with 1.X (#122157 ) By compiling PyTorch against 2.x RC, but running all the tests with Numpy-1.X This has no affects on binary builds Pull Request resolved: https://github.com/pytorch/pytorch/pull/122157 Approved by: https://github.com/atalman	2024-03-19 22:40:26 +00:00
Richard Barnes	f8becb626f	[codemod] Remove unused variables in caffe2/caffe2/contrib/fakelowp/spatial_batch_norm_fp16_fake_op.h (#120178 ) Summary: LLVM-15 has a warning `-Wunused-but-set-variable` which we treat as an error because it's so often diagnostic of a code issue. Unused variables can compromise readability or, worse, performance. This diff either (a) removes an unused variable and, possibly, it's associated code, or (b) qualifies the variable with `[[maybe_unused]]`, mostly in cases where the variable _is_ used, but, eg, in an `assert` statement that isn't present in production code. - If you approve of this diff, please use the "Accept & Ship" button :-) Test Plan: Sandcastle Reviewed By: palmje Differential Revision: D53779549 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120178 Approved by: https://github.com/Skylion007	2024-03-19 22:36:38 +00:00
Richard Barnes	94eb940a02	[codemod] Remove unused variables in caffe2/caffe2/operators/softmax_op_cudnn.cc (#121995 ) Summary: LLVM-15 has a warning `-Wunused-but-set-variable` which we treat as an error because it's so often diagnostic of a code issue. Unused variables can compromise readability or, worse, performance. This diff either (a) removes an unused variable and, possibly, it's associated code, or (b) qualifies the variable with `[[maybe_unused]]`, mostly in cases where the variable _is_ used, but, eg, in an `assert` statement that isn't present in production code. - If you approve of this diff, please use the "Accept & Ship" button :-) Test Plan: Sandcastle Reviewed By: palmje Differential Revision: D54931224 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121995 Approved by: https://github.com/Skylion007	2024-03-19 22:35:58 +00:00
Richard Barnes	a6aa3afa77	[codemod] Remove unused variables in caffe2/caffe2/video/video_decoder.cc (#122151 ) Summary: LLVM-15 has a warning `-Wunused-but-set-variable` which we treat as an error because it's so often diagnostic of a code issue. Unused variables can compromise readability or, worse, performance. This diff either (a) removes an unused variable and, possibly, it's associated code, or (b) qualifies the variable with `[[maybe_unused]]`, mostly in cases where the variable _is_ used, but, eg, in an `assert` statement that isn't present in production code. - If you approve of this diff, please use the "Accept & Ship" button :-) Test Plan: Sandcastle Differential Revision: D54378401 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122151 Approved by: https://github.com/Skylion007	2024-03-19 22:34:17 +00:00
Richard Barnes	a80c60ad8f	[codemod] Remove unused variables in caffe2/caffe2/operators/conv_op_cudnn.cc (#122161 ) Summary: LLVM-15 has a warning `-Wunused-but-set-variable` which we treat as an error because it's so often diagnostic of a code issue. Unused variables can compromise readability or, worse, performance. This diff either (a) removes an unused variable and, possibly, it's associated code, or (b) qualifies the variable with `[[maybe_unused]]`, mostly in cases where the variable _is_ used, but, eg, in an `assert` statement that isn't present in production code. - If you approve of this diff, please use the "Accept & Ship" button :-) Test Plan: Sandcastle Pull Request resolved: https://github.com/pytorch/pytorch/pull/122161 Approved by: https://github.com/Skylion007	2024-03-19 22:33:19 +00:00
Richard Barnes	02f436da6d	[codemod][bugfix] Fix addressing bug in caffe2/caffe2/video/video_input_op.h (#121856 ) Summary: # Diff Specific The signature of `copyFrom` is ``` void Tensor::CopyFrom(const Tensor& src, bool async) { ``` so the `&context` always evaluated to true. I could dig around to see if anyone cares about what the flag should actually be, but this is old code in caffe2, so I've just used `true` and we'll keep using whatever behaviour we've been using since 2019 or so when this was written. # General A bug in this code was identified by `-Waddress`, which we are working to enable globally. This diff fixes the bug. There are a few types of fixes it might employ: The bug could be `const_char_array == "hello"` which compares two addresses and therefore is almost always false. This is fixed with `const_char_array == std::string_view("hello")` because `string_view` has an `==` operator that makes an appropriate comparison. The bug could be `if(name_of_func)` which always returns true because the function always has an address. Likely you meant to call the function here! - If you approve of this diff, please use the "Accept & Ship" button :-) Test Plan: Sandcastle Pull Request resolved: https://github.com/pytorch/pytorch/pull/121856 Approved by: https://github.com/Skylion007	2024-03-19 22:28:06 +00:00
Shunting Zhang	1c4887d52b	fix dlrm accuracy test in max-autotune (#122012 ) torchrec_dlrm training fail the accuracy check when max-autotune is enabled. I found there is no real issue in PT2. We fail to get fp64 reference results for the accuracy check. In max-autotune mode numerical may change a bit and cause the cosine similarity check fail. Using fp64 baseline is more reliable and make the test pass. The reason why we are not using a fp64 baseline earlier is because torchrec uses a dataclass [Batch](`99e6e669b5/torchrec/datasets/utils.py (L28)`) to represent the input. We use pytree to cast model and inputs to fp64. pytree can not look into a dataclass. My fix is to convert the dataclass to namedtuple to be more pytree friendly Pull Request resolved: https://github.com/pytorch/pytorch/pull/122012 Approved by: https://github.com/jansel, https://github.com/eellison	2024-03-19 22:23:42 +00:00
PyTorch MergeBot	c71554b944	Revert "[aot_inductor][easy] enable test_triton_kernel_multi_output_arg (#122052 )" This reverts commit 206da97b8b61f51041f67de68e68e9a1875589ab. Reverted https://github.com/pytorch/pytorch/pull/122052 on behalf of https://github.com/huydhn due to Although this look fixed on OSS, it is still failing internally. I have added the reproducible buck command in the diff D55046262 ([comment](https://github.com/pytorch/pytorch/pull/122052#issuecomment-2008253185))	2024-03-19 22:22:12 +00:00
Edward Z. Yang	7678be4667	Replace numel with sym_numel in is_int_or_symint (#122145 ) Fixes https://github.com/pytorch/pytorch/issues/122124 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/122145 Approved by: https://github.com/Skylion007	2024-03-19 21:58:43 +00:00
David Yan	6915a5be70	Increase numel limit to 2^63 for replicatepad1d (#122199 ) Summary: As title Test Plan: ``` CUDA_VISIBLE_DEVICES=5 buck2 test mode/opt //caffe2/test:test_nn_cuda -- test_replicatepad_64bit_indexing ``` Also benchmarked in N5106027 ``` device_ms, cpu_ms, gb/device_ms*1000 # before changes 11.058772478103638 18.912256770000006 735.4118906278957 # after changes 10.621162576675415 18.58972748 765.7121070725207 ``` Differential Revision: D55030372 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122199 Approved by: https://github.com/ezyang	2024-03-19 21:55:34 +00:00
Nikita Shulga	b12d297b44	[AARCH64] Hide FP16 scalar arithmetic behind proper feature flag (#122204 ) On Apple Silicon: ``` % sysctl machdep.cpu.brand_string; clang -dM -E - < /dev/null\|grep __ARM_FEATURE_FP16 machdep.cpu.brand_string: Apple M1 #define __ARM_FEATURE_FP16_FML 1 #define __ARM_FEATURE_FP16_SCALAR_ARITHMETIC 1 #define __ARM_FEATURE_FP16_VECTOR_ARITHMETIC 1 ``` On Graviton2 with respective `-march` flag: ``` # ./cpuinfo/build/cpu-info \|grep Microarch -A1; gcc -dM -E - -march=armv8.2-a+fp16 </dev/null \| grep __ARM_FEATURE_FP16 Microarchitectures: 8x Neoverse N1 #define __ARM_FEATURE_FP16_SCALAR_ARITHMETIC 1 #define __ARM_FEATURE_FP16_VECTOR_ARITHMETIC 1 ``` Test Plan: CI Reviewed By: dimitribouche Differential Revision: D55033347 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122204 Approved by: https://github.com/huydhn	2024-03-19 21:18:09 +00:00
Jerry Zhang	901ba2be86	[quant][pt2e] Add support for conv transpose + bn + {relu} weights fusion in PTQ (#122046 ) Summary: also added some utils in xnnpack_quantizer_utils.py * annotate_conv_tranpsose_bn_relu and annotate_conv_transpose_bn -> this is for QAT * annotate_conv_transpose_relu conv_transpose + bn weights fusion is performed automatically and can not be disabled currently we can add support to allow disable this fusion later if needed Test Plan: python test/test_quantization.py -k test_conv_transpose_bn_fusion Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/122046 Approved by: https://github.com/andrewor14	2024-03-19 21:00:57 +00:00
zdevito	bc1fef113d	Respect TORCH_DISABLE_ADDR2LINE in symbolizer (#121359 ) If TORCH_DISABLE_ADDR2LINE is set, the symbolizer will instead give the filename of the shared library as the filename, the offset in that library as the linenumber, and use dladdr to get the function name if possible. This is much faster than using addr2line, and the symbols can be later resolved offline using addr2line if desired. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121359 Approved by: https://github.com/aaronenyeshi	2024-03-19 20:50:26 +00:00
Hoa Dinh	7718a1cd4f	T159183991: Error: EXC_SOFTWARE / SIGABRT at IGPyTorchFramework:-[MPSImageWrapperTrampoline endSynchronization:] (MPSImageWrapper.mm<line_num>):cpp_exception_clas (#122132 ) Summary: Prevent crash by not throwing a C++ exception. Test Plan: spongebobsandcastle Reviewed By: SS-JIA Differential Revision: D55036050 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122132 Approved by: https://github.com/SS-JIA	2024-03-19 20:01:33 +00:00
Oguz Ulgen	c0b2e56c8f	Support triton.language.dtype with torch.compile -- Second Attempt (#122141 ) This PR is the second attempt at supporting `triton.language.dtype`, now instead of putting it on the graph, we put it on the side table since it is a constant. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122141 Approved by: https://github.com/jansel ghstack dependencies: #122140	2024-03-19 19:40:52 +00:00
Oguz Ulgen	58a805da71	[UserDefinedTriton] Move constant args out of the fx graph (#122140 ) @ezyang mentioned that we should not put constant args on the graph. Especially when there are args that would be trickier to put on the graph. E.g. next PR needs `triton.language.dtype` as an argument on the graph. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122140 Approved by: https://github.com/jansel	2024-03-19 19:40:52 +00:00
Pian Pawakapan	c5ffebebab	[export] allow Dim(1,2) for export dynamic shapes (v2 after revert) (#121910 ) Creating this after [PR](https://github.com/pytorch/pytorch/pull/121642) got reverted. Current dynamic shapes implementation fixes lower range of Dims to be 2 for analysis, but allows 0/1 shapes during runtime. This leads to failures when initializing Dim(1,2). This PR sets the lower bound to 0, and avoids erroring out when conflicting with the generated (2, maxsize) constraint during analysis. Also resolves a derived dim constraints issue with the following code: ``` class Bar(torch.nn.Module): def forward(self, x, y): return x + y[1:] dx = Dim("dx", min=1, max=3) ep = export( Bar(), (torch.randn(2, 2), torch.randn(3, 2)), dynamic_shapes=({0: dx, 1: None}, {0: dx+1, 1: None}) ) print(ep.range_constraints) ``` In main: ``` {s0: ValueRanges(lower=2, upper=3, is_bool=False), s0 + 1: ValueRanges(lower=3, upper=4, is_bool=False)} ``` This PR: ``` {s0: ValueRanges(lower=1, upper=3, is_bool=False), s0 + 1: ValueRanges(lower=2, upper=4, is_bool=False)} ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/121910 Approved by: https://github.com/avikchaudhuri, https://github.com/zhxchen17	2024-03-19 19:08:05 +00:00
PyTorch MergeBot	d56ab7b020	Revert "[torch export][serialize] create a more compact stacktrace format for serialization (#121675 )" This reverts commit eae89138d891d0310483c4d86dcb69b16de0a6b5. Reverted https://github.com/pytorch/pytorch/pull/121675 on behalf of https://github.com/jeanschmidt due to It seems that this PR broke lint jobs, I am reverting to confirm if this is the case ([comment](https://github.com/pytorch/pytorch/pull/121675#issuecomment-2007919486))	2024-03-19 19:02:09 +00:00
PyTorch MergeBot	36e5c1dcab	Revert "Teach dynamo about torch.func.jvp (#119926 )" This reverts commit edd04b7c16cc6715411119bb7db234a9df59065f. Reverted https://github.com/pytorch/pytorch/pull/119926 on behalf of https://github.com/jeanschmidt due to lots of breakages in pull jobs, checking if reverting this one will help ([comment](https://github.com/pytorch/pytorch/pull/119926#issuecomment-2007915919))	2024-03-19 18:59:46 +00:00
PyTorch MergeBot	88999674a0	Revert "Update jvp to support symbolic execution. (#120338 )" This reverts commit 39877abee2c3ad1956013d467b0f6e86cd20acfb. Reverted https://github.com/pytorch/pytorch/pull/120338 on behalf of https://github.com/jeanschmidt due to lots of breakages in pull jobs, checking if reverting this one will help ([comment](https://github.com/pytorch/pytorch/pull/120338#issuecomment-2007898831))	2024-03-19 18:50:12 +00:00
Richard Barnes	e0d57001ef	[codemod] Remove unused variables in caffe2/caffe2/experiments/operators/fully_connected_op_prune.h (#122165 ) Summary: LLVM-15 has a warning `-Wunused-but-set-variable` which we treat as an error because it's so often diagnostic of a code issue. Unused variables can compromise readability or, worse, performance. This diff either (a) removes an unused variable and, possibly, it's associated code, or (b) qualifies the variable with `[[maybe_unused]]`, mostly in cases where the variable _is_ used, but, eg, in an `assert` statement that isn't present in production code. - If you approve of this diff, please use the "Accept & Ship" button :-) Test Plan: Sandcastle Reviewed By: dmm-fb Differential Revision: D54380402 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122165 Approved by: https://github.com/Skylion007	2024-03-19 18:41:16 +00:00
Natalia Gimelshein	6bd2d12bc7	release gil in prepareProfiler (#121949 ) Initializing profiler while holding gil can lead to deadlocks, as it makes some presumably synchronizing cuda calls Pull Request resolved: https://github.com/pytorch/pytorch/pull/121949 Approved by: https://github.com/aaronenyeshi	2024-03-19 18:05:21 +00:00
Flavio Sales Truzzi	7fb2d69282	[PT2] - Fix cat backwards wrapping on symints (#121527 ) Summary: Wrapping was comparing Symint and ints forcing a guard. Rewrite it with TORCH_GUARD_SIZE_OBLIVIOUS ``` [trainer0\|0]: File "<invalid>", line 0, in THPEngine_run_backward(_object, _object, _object) [trainer0\|0]: File "<invalid>", line 0, in torch::autograd::python::PythonEngine::execute(std::vector<torch::autograd::Edge, std::allocator<torch::autograd::Edge>> const&, std::vector<at::Tensor, std::allocator<at::Tensor>> const&, bool, bool, bool, std::vector<torch::autograd::Edge, std::allocator<torch::autograd::Edge>> const&) [trainer0\|0]: File "<invalid>", line 0, in torch::autograd::Engine::execute(std::vector<torch::autograd::Edge, std::allocator<torch::autograd::Edge>> const&, std::vector<at::Tensor, std::allocator<at::Tensor>> const&, bool, bool, bool, std::vector<torch::autograd::Edge, std::allocator<torch::autograd::Edge>> const&) [trainer0\|0]: File "<invalid>", line 0, in torch::autograd::python::PythonEngine::execute_with_graph_task(std::shared_ptr<torch::autograd::GraphTask> const&, std::shared_ptr<torch::autograd::Node>, torch::autograd::InputBuffer&&) [trainer0\|0]: File "<invalid>", line 0, in torch::autograd::Engine::execute_with_graph_task(std::shared_ptr<torch::autograd::GraphTask> const&, std::shared_ptr<torch::autograd::Node>, torch::autograd::InputBuffer&&) [trainer0\|0]: File "<invalid>", line 0, in torch::autograd::Engine::thread_main(std::shared_ptr<torch::autograd::GraphTask> const&) [trainer0\|0]: File "<invalid>", line 0, in torch::autograd::Engine::evaluate_function(std::shared_ptr<torch::autograd::GraphTask>&, torch::autograd::Node, torch::autograd::InputBuffer&, std::shared_ptr<torch::autograd::ReadyQueue> const&) [trainer0\|0]: File "<invalid>", line 0, in torch::autograd::Node::operator()(std::vector<at::Tensor, std::allocator<at::Tensor>>&&) [trainer0\|0]: File "<invalid>", line 0, in torch::autograd::generated::CatBackward0::apply(std::vector<at::Tensor, std::allocator<at::Tensor>>&&) [trainer0\|0]: File "<invalid>", line 0, in torch::autograd::generated::details::cat_tensors_backward(at::Tensor const&, std::vector<std::vector<c10::SymInt, std::allocator<c10::SymInt>>, std::allocator<std::vector<c10::SymInt, std::allocator<c10::SymInt>>>> const&, std::vector<c10::ScalarType, std::allocator<c10::ScalarType>> const&, long) [trainer0\|0]: File "<invalid>", line 0, in c10::operator==(c10::SymInt const&, int) [trainer0\|0]: File "<invalid>", line 0, in c10::SymBool::guard_bool(char const, long) const [trainer0\|0]: File "<invalid>", line 0, in torch::impl::PythonSymNodeImpl::guard_bool(char const, long) ``` Test Plan: Regular CI Differential Revision: D54667300 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121527 Approved by: https://github.com/ezyang	2024-03-19 18:03:02 +00:00
Huy Do	8de4d86479	Back out "[fx] Preserve Fx graph node order in partitioner across runs (#115621 )" (#122113 ) Summary: Original commit changeset: 6578f47abfdb Original Phabricator Diff: D54913931 Differential Revision: D55027171 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122113 Approved by: https://github.com/osalpekar	2024-03-19 18:00:37 +00:00
Wenting Wang	eae89138d8	[torch export][serialize] create a more compact stacktrace format for serialization (#121675 ) Summary: - we want fx nodes' stack trace format to be backward compatible and same as before in the program we export - however in the serialized format, we would want to show a more compact stack_trace format, otherwise the nodes attributes are dominated by stack traces - the diff implements the minimal in serialization process to dedupe node stack traces by resorting to a fileinfo_list and a filename_to_abbrev map, so we can use index to represent filenames, use lineno to represent lines. Test Plan: # llm base on D54497918 ``` buck2 run @//mode/dev-nosan fbcode//executorch/examples/models/llama2:export_llama -- -c ~/stories110M.pt -p ~/params.json ``` set up breakpoint after serialization/deserialization - serialize ``` (Pdb) v_meta = [n.meta for n in exported_program.graph_module.graph.nodes] (Pdb) paste_client.create_phabricator_paste_object(paste_creation_client_id=1093956601162697, content=str(v_meta)).number 1193647450 (Pdb) json_program = json.dumps(_dataclass_to_dict(serialized_graph.co_fileinfo_ordered_list),cls=EnumEncoder) (Pdb) json_bytes = json_program.encode('utf-8') (Pdb) paste_client.create_phabricator_paste_object(paste_creation_client_id=1093956601162697, content=str(json_bytes)).number 1193604333 (Pdb) sys.getsizeof(json_bytes) 3846 (Pdb) compressed_bytes = zstd.ZstdCompressor().compress(json_bytes) (Pdb) sys.getsizeof(compressed_bytes) 1139 ``` in P1193647450 (before serialization), search for `stack_trace` in P1193604333 (after serialization), search for `stack_trace` and `co_fileinfo_ordered_list` [note: didn't do compression in this diff since the size is pretty small and it adds complexity if we do compression] - deserialize ``` (Pdb) v_meta = [n.meta for n in deserialized_exported_program.graph_module.graph.nodes] (Pdb) paste_client.create_phabricator_paste_object(paste_creation_client_id=1093956601162697, content=str(v_meta)).number 1193629435 ``` in P1193629435, search for `stack_trace` # ads Differential Revision: D54654443 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121675 Approved by: https://github.com/angelayi	2024-03-19 17:58:12 +00:00
eqy	271b12c790	[Functorch] Bump tolerances for `test_per_sample_grads_embeddingnet_mechanism_functional_call_cuda` (#122014 ) the `rtol` was indeed a problem on Grace Hopper Pull Request resolved: https://github.com/pytorch/pytorch/pull/122014 Approved by: https://github.com/zou3519	2024-03-19 17:52:39 +00:00
Yanan Cao (PyTorch)	ba9a1d96a4	Add scuba logging for TorchScript usage (#121936 ) Summary: Infra to log live usage of TorchScript internally Test Plan: manually tested Differential Revision: D54923510 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121936 Approved by: https://github.com/zhxchen17	2024-03-19 17:38:27 +00:00
Catherine Lee	4819da60ab	[TD] Add LLM retrieval + heuristic (#121836 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/121836 Approved by: https://github.com/osalpekar	2024-03-19 17:31:47 +00:00
Jackie (Jiaqi) Xu	cec0fd6f2f	[pt2] add symbolic shape support for decompose mm and expose max_block to user config (#121440 ) Summary: 1) As described in https://fb.workplace.com/groups/1075192433118967/permalink/1381918665779674/ As a follow up, we can increase max_block["y"] to sovle the issue 2) add symbolic shape support for decompose mm pass. I did not find a good way to compare symint with int. So when there is a symbolic shape, i would assume it is a "large" dim. Test Plan: Without change block: aps-pt2-7c23cea900 increase y_block: aps-pt2_dynamic_shape-25a027423c Differential Revision: D54525453 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121440 Approved by: https://github.com/mengluy0125, https://github.com/Yuzhen11	2024-03-19 17:31:16 +00:00
PyTorch MergeBot	764eae9c4e	Revert "Add Flash Attention support on ROCM (#121561 )" This reverts commit a37e22de7059d06b75e4602f0568c3154076718a. Reverted https://github.com/pytorch/pytorch/pull/121561 on behalf of https://github.com/huydhn due to Sorry for reverting your change but this needs more work to be able to land in fbcode because https://github.com/ROCm/aotriton is not available there atm. We are working to reland this change before 2.3 release ([comment](https://github.com/pytorch/pytorch/pull/121561#issuecomment-2007717091))	2024-03-19 17:14:28 +00:00
Peter Bell	88ebdbc97c	[dynamo] Forward OptimizedModule.__setattr__ to the wrapped module (#122098 ) Fixes #114844 In the linked issue we have ``` compiled_module = torch.compile(module) compiled_module.x = ... compiled_module(...) # Mutates self.x ``` Where since the module mutates `self.x` you would expect `compiled_module.x` to be updated but actually `compiled_module.x = ...` sets an attribute "x" on the `OptimizedModule` object while the forward method of the module mutates `module.x`. This gives the expected behavior by forwarding `compiled_module.__setattr__` down to `module.__setattr__`. There is already a corresponding `__getattr__` so now `compiled_module.x` becomes an alias for `module.x`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122098 Approved by: https://github.com/ezyang, https://github.com/lezcano	2024-03-19 16:51:43 +00:00
Michael Lazos	2164b7f746	Flatten/Unflatten micro optimization in proxy_tensor.py (#121993 ) Lowers compile time by 1s across all suites on average Pull Request resolved: https://github.com/pytorch/pytorch/pull/121993 Approved by: https://github.com/anijain2305, https://github.com/jansel, https://github.com/zou3519	2024-03-19 16:49:28 +00:00
drisspg	42624bceb6	Fixes nan with large bf16 values (#122135 ) Fixes #121558 Performance on main: ``` Markdown +------------+-----------+-----------+------------+-----------+-----------+----------------+--------------------+--------------------+ \| batch_size \| num_heads \| q_seq_len \| kv_seq_len \| embed_dim \| is_causal \| dtype \| forward_time \| backward_time \| +------------+-----------+-----------+------------+-----------+-----------+----------------+--------------------+--------------------+ \| 1 \| 16 \| 128 \| 128 \| 2048 \| True \| torch.bfloat16 \| 12.608132004970683 \| 65.90210803551601 \| \| 1 \| 16 \| 128 \| 128 \| 2048 \| False \| torch.bfloat16 \| 11.75877740024589 \| 64.83824399765581 \| \| 1 \| 16 \| 256 \| 256 \| 2048 \| True \| torch.bfloat16 \| 16.465420153690506 \| 67.6770955324173 \| \| 1 \| 16 \| 256 \| 256 \| 2048 \| False \| torch.bfloat16 \| 17.398148600477725 \| 68.19829455344006 \| \| 1 \| 16 \| 512 \| 512 \| 2048 \| True \| torch.bfloat16 \| 29.053532000398263 \| 99.58901099162175 \| \| 1 \| 16 \| 512 \| 512 \| 2048 \| False \| torch.bfloat16 \| 27.826815698063 \| 98.05690299253911 \| \| 1 \| 16 \| 1024 \| 1024 \| 2048 \| True \| torch.bfloat16 \| 49.89655229728669 \| 178.24282555375248 \| \| 1 \| 16 \| 1024 \| 1024 \| 2048 \| False \| torch.bfloat16 \| 48.840098950313404 \| 174.5950729819015 \| \| 1 \| 16 \| 4096 \| 2048 \| 2048 \| True \| torch.bfloat16 \| 505.66218036692584 \| 1865.9265094902366 \| \| 1 \| 16 \| 4096 \| 2048 \| 2048 \| False \| torch.bfloat16 \| 295.0534054543823 \| 967.3831606050952 \| \| 1 \| 32 \| 128 \| 128 \| 2048 \| True \| torch.bfloat16 \| 11.496030446141958 \| 55.11070846114308 \| \| 1 \| 32 \| 128 \| 128 \| 2048 \| False \| torch.bfloat16 \| 11.47399884648621 \| 55.452342028729625 \| \| 1 \| 32 \| 256 \| 256 \| 2048 \| True \| torch.bfloat16 \| 13.216444296995178 \| 55.14447903260589 \| \| 1 \| 32 \| 256 \| 256 \| 2048 \| False \| torch.bfloat16 \| 12.763233599252999 \| 55.142355500720434 \| \| 1 \| 32 \| 512 \| 512 \| 2048 \| True \| torch.bfloat16 \| 19.409965351223946 \| 74.9107634765096 \| \| 1 \| 32 \| 512 \| 512 \| 2048 \| False \| torch.bfloat16 \| 19.02470579952933 \| 74.84168506925926 \| \| 1 \| 32 \| 1024 \| 1024 \| 2048 \| True \| torch.bfloat16 \| 46.37695319834165 \| 172.19150450546294 \| \| 1 \| 32 \| 1024 \| 1024 \| 2048 \| False \| torch.bfloat16 \| 45.225963747361675 \| 185.19691249821335 \| \| 1 \| 32 \| 4096 \| 2048 \| 2048 \| True \| torch.bfloat16 \| 634.3090848531574 \| 2249.057865119539 \| \| 1 \| 32 \| 4096 \| 2048 \| 2048 \| False \| torch.bfloat16 \| 320.47313248040155 \| 1053.0515247955916 \| \| 4 \| 16 \| 128 \| 128 \| 2048 \| True \| torch.bfloat16 \| 13.448987301671878 \| 63.63581650657579 \| \| 4 \| 16 \| 128 \| 128 \| 2048 \| False \| torch.bfloat16 \| 12.509283400140703 \| 63.059300999157124 \| \| 4 \| 16 \| 256 \| 256 \| 2048 \| True \| torch.bfloat16 \| 19.71098779467866 \| 105.55780201684684 \| \| 4 \| 16 \| 256 \| 256 \| 2048 \| False \| torch.bfloat16 \| 18.264925852417946 \| 105.12311349157244 \| \| 4 \| 16 \| 512 \| 512 \| 2048 \| True \| torch.bfloat16 \| 45.218703348655254 \| 222.87272597895935 \| \| 4 \| 16 \| 512 \| 512 \| 2048 \| False \| torch.bfloat16 \| 43.55393464793451 \| 230.63290398567915 \| \| 4 \| 16 \| 1024 \| 1024 \| 2048 \| True \| torch.bfloat16 \| 134.02968645095825 \| 514.6893998607993 \| \| 4 \| 16 \| 1024 \| 1024 \| 2048 \| False \| torch.bfloat16 \| 157.13709802366793 \| 624.5892751030624 \| \| 4 \| 16 \| 4096 \| 2048 \| 2048 \| True \| torch.bfloat16 \| 1776.7079547047617 \| 6353.551096981391 \| \| 4 \| 16 \| 4096 \| 2048 \| 2048 \| False \| torch.bfloat16 \| 1143.6000745743513 \| 3811.8767354171723 \| \| 4 \| 32 \| 128 \| 128 \| 2048 \| True \| torch.bfloat16 \| 11.717129248427227 \| 55.35991647047922 \| \| 4 \| 32 \| 128 \| 128 \| 2048 \| False \| torch.bfloat16 \| 11.746983398916198 \| 55.76716404175386 \| \| 4 \| 32 \| 256 \| 256 \| 2048 \| True \| torch.bfloat16 \| 17.255573300644752 \| 106.47456656442955 \| \| 4 \| 32 \| 256 \| 256 \| 2048 \| False \| torch.bfloat16 \| 16.46409669774584 \| 108.07770595420152 \| \| 4 \| 32 \| 512 \| 512 \| 2048 \| True \| torch.bfloat16 \| 46.63354124641045 \| 213.74862996162847 \| \| 4 \| 32 \| 512 \| 512 \| 2048 \| False \| torch.bfloat16 \| 47.01801469782367 \| 240.78139301855117 \| \| 4 \| 32 \| 1024 \| 1024 \| 2048 \| True \| torch.bfloat16 \| 127.76448752265424 \| 508.08745552785695 \| \| 4 \| 32 \| 1024 \| 1024 \| 2048 \| False \| torch.bfloat16 \| 168.6308984644711 \| 667.2996102133766 \| \| 4 \| 32 \| 4096 \| 2048 \| 2048 \| True \| torch.bfloat16 \| 2268.1598202325404 \| 7727.2648515645415 \| \| 4 \| 32 \| 4096 \| 2048 \| 2048 \| False \| torch.bfloat16 \| 1242.8469699807465 \| 4161.965740495361 \| \| 8 \| 16 \| 128 \| 128 \| 2048 \| True \| torch.bfloat16 \| 14.340955897932872 \| 93.72280450770633 \| \| 8 \| 16 \| 128 \| 128 \| 2048 \| False \| torch.bfloat16 \| 13.25262250029482 \| 93.2030284893699 \| \| 8 \| 16 \| 256 \| 256 \| 2048 \| True \| torch.bfloat16 \| 27.598425600444898 \| 183.23776399483904 \| \| 8 \| 16 \| 256 \| 256 \| 2048 \| False \| torch.bfloat16 \| 26.362583553418514 \| 183.51862096460536 \| \| 8 \| 16 \| 512 \| 512 \| 2048 \| True \| torch.bfloat16 \| 84.52303148806094 \| 383.50319798337296 \| \| 8 \| 16 \| 512 \| 512 \| 2048 \| False \| torch.bfloat16 \| 89.41743348259479 \| 432.5502900755964 \| \| 8 \| 16 \| 1024 \| 1024 \| 2048 \| True \| torch.bfloat16 \| 217.76640450116247 \| 943.9354750793427 \| \| 8 \| 16 \| 1024 \| 1024 \| 2048 \| False \| torch.bfloat16 \| 303.0781910638325 \| 1225.4394043702632 \| \| 8 \| 16 \| 4096 \| 2048 \| 2048 \| True \| torch.bfloat16 \| 3470.8542854059488 \| 12194.579601055011 \| \| 8 \| 16 \| 4096 \| 2048 \| 2048 \| False \| torch.bfloat16 \| 2268.1174043100327 \| 7608.0941944383085 \| \| 8 \| 32 \| 128 \| 128 \| 2048 \| True \| torch.bfloat16 \| 12.289720651460811 \| 95.88620596332476 \| \| 8 \| 32 \| 128 \| 128 \| 2048 \| False \| torch.bfloat16 \| 11.618648946750909 \| 95.56685149436818 \| \| 8 \| 32 \| 256 \| 256 \| 2048 \| True \| torch.bfloat16 \| 31.567946751601994 \| 180.62468653079122 \| \| 8 \| 32 \| 256 \| 256 \| 2048 \| False \| torch.bfloat16 \| 28.611703700153157 \| 189.4215695792809 \| \| 8 \| 32 \| 512 \| 512 \| 2048 \| True \| torch.bfloat16 \| 84.11306998459621 \| 385.25596749968827 \| \| 8 \| 32 \| 512 \| 512 \| 2048 \| False \| torch.bfloat16 \| 93.82540901424363 \| 455.77428903197875 \| \| 8 \| 32 \| 1024 \| 1024 \| 2048 \| True \| torch.bfloat16 \| 226.80530551588163 \| 965.8026450779289 \| \| 8 \| 32 \| 1024 \| 1024 \| 2048 \| False \| torch.bfloat16 \| 327.4116570246406 \| 1312.5067745568228 \| \| 8 \| 32 \| 4096 \| 2048 \| 2048 \| True \| torch.bfloat16 \| 4445.5064804060385 \| 15020.768146496266 \| \| 8 \| 32 \| 4096 \| 2048 \| 2048 \| False \| torch.bfloat16 \| 2433.0302356975153 \| 8300.016750581563 \| +------------+-----------+-----------+------------+-----------+-----------+----------------+--------------------+--------------------+ ``` Performance on this branch: ```Markdown +------------+-----------+-----------+------------+-----------+-----------+----------------+--------------------+--------------------+ \| batch_size \| num_heads \| q_seq_len \| kv_seq_len \| embed_dim \| is_causal \| dtype \| forward_time \| backward_time \| +------------+-----------+-----------+------------+-----------+-----------+----------------+--------------------+--------------------+ \| 1 \| 16 \| 128 \| 128 \| 2048 \| True \| torch.bfloat16 \| 12.783618393586949 \| 65.59692794689909 \| \| 1 \| 16 \| 128 \| 128 \| 2048 \| False \| torch.bfloat16 \| 12.064015300711617 \| 56.99719698168337 \| \| 1 \| 16 \| 256 \| 256 \| 2048 \| True \| torch.bfloat16 \| 16.629025398287922 \| 68.65267595276237 \| \| 1 \| 16 \| 256 \| 256 \| 2048 \| False \| torch.bfloat16 \| 17.462356004398313 \| 68.35797848179936 \| \| 1 \| 16 \| 512 \| 512 \| 2048 \| True \| torch.bfloat16 \| 29.5476081490051 \| 101.22994752600789 \| \| 1 \| 16 \| 512 \| 512 \| 2048 \| False \| torch.bfloat16 \| 28.395320149138573 \| 98.62275794148445 \| \| 1 \| 16 \| 1024 \| 1024 \| 2048 \| True \| torch.bfloat16 \| 50.50016101449728 \| 181.4357690163888 \| \| 1 \| 16 \| 1024 \| 1024 \| 2048 \| False \| torch.bfloat16 \| 49.450615647947416 \| 175.86063902126625 \| \| 1 \| 16 \| 4096 \| 2048 \| 2048 \| True \| torch.bfloat16 \| 506.06461532879626 \| 1866.0613044630736 \| \| 1 \| 16 \| 4096 \| 2048 \| 2048 \| False \| torch.bfloat16 \| 299.9336270149797 \| 976.4662646921353 \| \| 1 \| 32 \| 128 \| 128 \| 2048 \| True \| torch.bfloat16 \| 11.45752210286446 \| 58.79682704107836 \| \| 1 \| 32 \| 128 \| 128 \| 2048 \| False \| torch.bfloat16 \| 11.407129396684468 \| 58.14061599085107 \| \| 1 \| 32 \| 256 \| 256 \| 2048 \| True \| torch.bfloat16 \| 13.822759891627355 \| 56.56979401828722 \| \| 1 \| 32 \| 256 \| 256 \| 2048 \| False \| torch.bfloat16 \| 13.39154909946956 \| 56.7130644340068 \| \| 1 \| 32 \| 512 \| 512 \| 2048 \| True \| torch.bfloat16 \| 20.282494352431968 \| 77.29688903782517 \| \| 1 \| 32 \| 512 \| 512 \| 2048 \| False \| torch.bfloat16 \| 19.899454596452415 \| 75.4446149803698 \| \| 1 \| 32 \| 1024 \| 1024 \| 2048 \| True \| torch.bfloat16 \| 48.494275606935844 \| 177.5322465109639 \| \| 1 \| 32 \| 1024 \| 1024 \| 2048 \| False \| torch.bfloat16 \| 46.84524350450374 \| 189.1778860008344 \| \| 1 \| 32 \| 4096 \| 2048 \| 2048 \| True \| torch.bfloat16 \| 635.1026654010639 \| 2248.0451600858937 \| \| 1 \| 32 \| 4096 \| 2048 \| 2048 \| False \| torch.bfloat16 \| 335.1591735263355 \| 1080.4320796160027 \| \| 4 \| 16 \| 128 \| 128 \| 2048 \| True \| torch.bfloat16 \| 13.63953539985232 \| 65.50709309522063 \| \| 4 \| 16 \| 128 \| 128 \| 2048 \| False \| torch.bfloat16 \| 12.858113402035087 \| 63.021871959790595 \| \| 4 \| 16 \| 256 \| 256 \| 2048 \| True \| torch.bfloat16 \| 19.98318645055406 \| 105.87883047992364 \| \| 4 \| 16 \| 256 \| 256 \| 2048 \| False \| torch.bfloat16 \| 18.619045056402683 \| 104.90188701078296 \| \| 4 \| 16 \| 512 \| 512 \| 2048 \| True \| torch.bfloat16 \| 45.91175540117546 \| 226.00732848513871 \| \| 4 \| 16 \| 512 \| 512 \| 2048 \| False \| torch.bfloat16 \| 44.39614630537107 \| 232.39317198749632 \| \| 4 \| 16 \| 1024 \| 1024 \| 2048 \| True \| torch.bfloat16 \| 135.5409600073472 \| 522.7949097752571 \| \| 4 \| 16 \| 1024 \| 1024 \| 2048 \| False \| torch.bfloat16 \| 158.79383607534692 \| 628.5856699105352 \| \| 4 \| 16 \| 4096 \| 2048 \| 2048 \| True \| torch.bfloat16 \| 1775.9978299727663 \| 6343.203847063706 \| \| 4 \| 16 \| 4096 \| 2048 \| 2048 \| False \| torch.bfloat16 \| 1160.680354805663 \| 3842.235009651631 \| \| 4 \| 32 \| 128 \| 128 \| 2048 \| True \| torch.bfloat16 \| 11.553713708417488 \| 65.50691701704638 \| \| 4 \| 32 \| 128 \| 128 \| 2048 \| False \| torch.bfloat16 \| 11.486379051348194 \| 56.9980075233616 \| \| 4 \| 32 \| 256 \| 256 \| 2048 \| True \| torch.bfloat16 \| 17.56585600087419 \| 107.89892700267956 \| \| 4 \| 32 \| 256 \| 256 \| 2048 \| False \| torch.bfloat16 \| 16.828144202008843 \| 109.05519902007653 \| \| 4 \| 32 \| 512 \| 512 \| 2048 \| True \| torch.bfloat16 \| 48.23235589428805 \| 217.8974545095116 \| \| 4 \| 32 \| 512 \| 512 \| 2048 \| False \| torch.bfloat16 \| 49.09284680034033 \| 244.73925953498107 \| \| 4 \| 32 \| 1024 \| 1024 \| 2048 \| True \| torch.bfloat16 \| 134.77827049791813 \| 522.7259948151186 \| \| 4 \| 32 \| 1024 \| 1024 \| 2048 \| False \| torch.bfloat16 \| 176.60772847011688 \| 681.5171707421541 \| \| 4 \| 32 \| 4096 \| 2048 \| 2048 \| True \| torch.bfloat16 \| 2267.821540008299 \| 7720.425300067291 \| \| 4 \| 32 \| 4096 \| 2048 \| 2048 \| False \| torch.bfloat16 \| 1295.3941145678982 \| 4272.425139788538 \| \| 8 \| 16 \| 128 \| 128 \| 2048 \| True \| torch.bfloat16 \| 14.514714101096615 \| 94.2192979855463 \| \| 8 \| 16 \| 128 \| 128 \| 2048 \| False \| torch.bfloat16 \| 13.553097198018804 \| 93.244242540095 \| \| 8 \| 16 \| 256 \| 256 \| 2048 \| True \| torch.bfloat16 \| 27.95821905019693 \| 185.0469880155288 \| \| 8 \| 16 \| 256 \| 256 \| 2048 \| False \| torch.bfloat16 \| 26.709681446664035 \| 184.22623950755226 \| \| 8 \| 16 \| 512 \| 512 \| 2048 \| True \| torch.bfloat16 \| 85.85420495364815 \| 388.3417735341937 \| \| 8 \| 16 \| 512 \| 512 \| 2048 \| False \| torch.bfloat16 \| 89.97473795898259 \| 434.4228169647977 \| \| 8 \| 16 \| 1024 \| 1024 \| 2048 \| True \| torch.bfloat16 \| 220.6919804448262 \| 958.9654899900779 \| \| 8 \| 16 \| 1024 \| 1024 \| 2048 \| False \| torch.bfloat16 \| 306.55586952343583 \| 1233.2170095760375 \| \| 8 \| 16 \| 4096 \| 2048 \| 2048 \| True \| torch.bfloat16 \| 3470.7326447824016 \| 12183.611298678443 \| \| 8 \| 16 \| 4096 \| 2048 \| 2048 \| False \| torch.bfloat16 \| 2299.064100370742 \| 7669.618452200666 \| \| 8 \| 32 \| 128 \| 128 \| 2048 \| True \| torch.bfloat16 \| 12.427107692928985 \| 96.96270158747211 \| \| 8 \| 32 \| 128 \| 128 \| 2048 \| False \| torch.bfloat16 \| 11.856995843118057 \| 96.38117247959599 \| \| 8 \| 32 \| 256 \| 256 \| 2048 \| True \| torch.bfloat16 \| 32.9956392000895 \| 182.52741603646427 \| \| 8 \| 32 \| 256 \| 256 \| 2048 \| False \| torch.bfloat16 \| 29.397601098753512 \| 191.0755339777097 \| \| 8 \| 32 \| 512 \| 512 \| 2048 \| True \| torch.bfloat16 \| 89.06024845782667 \| 392.2585004474967 \| \| 8 \| 32 \| 512 \| 512 \| 2048 \| False \| torch.bfloat16 \| 97.78487798757851 \| 462.07307645818213 \| \| 8 \| 32 \| 1024 \| 1024 \| 2048 \| True \| torch.bfloat16 \| 240.521906001959 \| 992.4693452194335 \| \| 8 \| 32 \| 1024 \| 1024 \| 2048 \| False \| torch.bfloat16 \| 341.98952303268015 \| 1339.2950996058062 \| \| 8 \| 32 \| 4096 \| 2048 \| 2048 \| True \| torch.bfloat16 \| 4445.311005110853 \| 15001.030603889374 \| \| 8 \| 32 \| 4096 \| 2048 \| 2048 \| False \| torch.bfloat16 \| 2535.9767401823774 \| 8528.990152990447 \| +------------+-----------+-----------+------------+-----------+-----------+----------------+--------------------+--------------------+ ``` ``` {'avg_forward_time_nan_fix': 399.7900972732653, 'avg_backward_time_nan_fix': 1409.652114014413, 'avg_forward_time_main_branch': 394.6807206988645, 'avg_backward_time_main_branch': 1399.4055472857629, 'geo_mean_nan_fix': 150.95049601244946, 'geo_mean_main_branch': 148.3381648508822} ``` The y axis is wrong and is micro seconds but the relative comparison still works <img width="790" alt="Screenshot 2024-03-18 at 3 34 15 PM" src="https://github.com/pytorch/pytorch/assets/32754868/ca278c15-b815-4535-bdcd-07e522055466"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/122135 Approved by: https://github.com/cpuhrsch	2024-03-19 16:32:00 +00:00
rzou	e26280ad8b	Fix typing for autograd.Function with ctx-less forward (#122167 ) Previously, typing an autograd.Function like the following would lead to a mypy error (which expects the first arg to forward to be named `ctx`). This PR fixes that by deleting the ctx arg. ```py class MySin(torch.autograd.Function): @staticmethod def forward(x: torch.Tensor) -> torch.Tensor: return x.sin() @staticmethod def setup_context(args, *kwargs): pass @staticmethod def backward(ctx, grad): if grad.stride(0) > 1: return grad.sin() return grad.cos() ``` Test Plan: - tested locally (I don't know how to put up a test in CI for this). Pull Request resolved: https://github.com/pytorch/pytorch/pull/122167 Approved by: https://github.com/soulitzer	2024-03-19 16:15:23 +00:00
PyTorch MergeBot	f9ed1c432d	Revert "Refactor gpu trace to be device-agnostic (#121794 )" This reverts commit 0ff1109e2688b8c841c9dd0eeecfba16f027b049. Reverted https://github.com/pytorch/pytorch/pull/121794 on behalf of https://github.com/jeanschmidt due to Reverting to see if rocm trunk errors are related ([comment](https://github.com/pytorch/pytorch/pull/121794#issuecomment-2007519408))	2024-03-19 15:40:26 +00:00
Jason Ansel	c05bf0037d	[dynamo] Remove copy_graphstate/restore_graphstate (#122067 ) Some dead code cleanup. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122067 Approved by: https://github.com/oulgen	2024-03-19 15:37:53 +00:00
PyTorch MergeBot	7673cb534a	Revert "Skip nonzero unbacked SymInt memo in inference mode (#122147 )" This reverts commit 5e2687391229cee6e4dc0214f9208b4ecbe058c1. Reverted https://github.com/pytorch/pytorch/pull/122147 on behalf of https://github.com/jeanschmidt due to Reverting to see if trunk error in inductor are related ([comment](https://github.com/pytorch/pytorch/pull/122147#issuecomment-2007513000))	2024-03-19 15:37:24 +00:00
cyy	6c01c25319	[Clang-tidy header][28/N] Fix clang-tidy warnings in aten/src/ATen/core/.h (#122175 ) This PR fixes various clang-tidy warnings on aten/src/ATen/core/.h, following https://github.com/pytorch/pytorch/pull/122023 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122175 Approved by: https://github.com/ezyang, https://github.com/malfet	2024-03-19 14:08:54 +00:00
Stephen Jia	6c50308801	[ATen-Vulkan][EZ] Small fixes: fix gpu size calculation and Half scalartype ctype mapping (#122096 ) Summary: ## Context Some small fixes to the ATen-Vulkan backend. The first is that GPU sizes for a 4 dimensional tensor with width packing had a small bug: ``` case 4: switch (memory_layout) { case api::GPUMemoryLayout::TENSOR_WIDTH_PACKED: gpu_sizes.at(0) = sizes.at(0); gpu_sizes.at(1) = sizes.at(1); // should be gpu_sizes.at(2) == sizes.at(2) gpu_sizes.at(2) = sizes.at(3); gpu_sizes.at(3) = api::utils::align_up(sizes.at(3), INT64_C(4)); break; ``` This was fixed by simplifying the logic of GPU size calculation for texture storage. The second was to modify the ctype mapping of the `api::kHalf` scalar type to be `float` instead of `unsigned short`. This is because GLSL does not natively support `float16`, so even with a FP16 texture type CPU/GPU transfer shaders will have to read from and write to `float` buffers. In the future, we will look into integrating [VK_KHR_shader_float16_int8](https://registry.khronos.org/vulkan/specs/1.3-extensions/man/html/VK_KHR_shader_float16_int8.html) into ATen-Vulkan to allow for 16 bit and 8 bit types to be referenced explicitly. Test Plan: CI Differential Revision: D55018171 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122096 Approved by: https://github.com/jorgep31415	2024-03-19 13:27:27 +00:00
Guilherme Leobas	39877abee2	Update jvp to support symbolic execution. (#120338 ) Previously, all jvp tests under dynamo/test_dynamic_shapes would fail because symbolic execution wasn't supported in some autograd functions. List of changes: - Update`_has_same_storage_numel` to use `sym_nbytes` - Symintify `_efficientzerotensor_meta` - Introduce `empty_generic_symint` with the first argument `size` as symbolic integer - Update gen_variable_type.py script to call the symint version of zeros_fn function (zeros_symint / _efficientzerotensor_symint) - Update `has_same_meta` to call `sym_*` functions Pull Request resolved: https://github.com/pytorch/pytorch/pull/120338 Approved by: https://github.com/soulitzer ghstack dependencies: #119926	2024-03-19 13:06:42 +00:00
Guilherme Leobas	edd04b7c16	Teach dynamo about torch.func.jvp (#119926 ) List of changes: - Replace JVP_NESTING by torch._C._functorch.maybe_current_level() - Remove all increment nesting functions from wrap_fx_proxy_cls - fwAD.make_dual receives the dual_level as keyword argument - Add jvp_increment_nesting, set_fwd_grad_enabled and dual_level context managers to dynamo Pull Request resolved: https://github.com/pytorch/pytorch/pull/119926 Approved by: https://github.com/zou3519	2024-03-19 13:06:42 +00:00
Xuehai Pan	6b5259e507	[lint] bump lint dependency PyYAML to 6.0.1 to support Python 3.12 (#122022 ) [PyYAML 6.0.0](https://pypi.org/project/PyYAML/6.0) was released 2.5 years ago and it is not installable with Python 3.12. This PR bumps the version of [PyYAML to 6.0.1](https://pypi.org/project/PyYAML/6.0.1) in `lintrunner` configuration. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122022 Approved by: https://github.com/Skylion007	2024-03-19 12:23:49 +00:00
Xia, Weiwen	8168338063	Add CPU implementation for `torch._int_mm` (s8s8->s32) (#121792 ) Fixes #121647 Description* Currently, the op `torch._int_mm` only supports CUDA device. This PR adds CPU implementation for it. Besides the request from the issue, this op may also be useful for planned CPU implementations of [LLM.int8()](https://arxiv.org/abs/2208.07339) in [Bitsandbytes](https://github.com/TimDettmers/bitsandbytes). The implementation prefers mkldnn (oneDNN) kernels. If mkldnn is not available, a reference implementation with nested for loops is used. Test plan `python test/test_linalg.py -k test__int_mm_cpu` Pull Request resolved: https://github.com/pytorch/pytorch/pull/121792 Approved by: https://github.com/jgong5, https://github.com/lezcano	2024-03-19 08:44:33 +00:00
Tugsbayasgalan (Tugsuu) Manlaibaatar	0d845f7b07	Fix auto_functionalize (#121990 ) Differential Revision: D54964130 When we re-export, auto_functionalize HOP will be in the graph. Therefore, we need to implement proper functionalization rule for it. Since the content inside auto_functionalize is guaranteed be functional, it is ok to just fall through it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121990 Approved by: https://github.com/ydwu4, https://github.com/zou3519	2024-03-19 07:11:11 +00:00
Kurt Mohler	a2a88f39ee	Avoid COW materialize in conv, log sigmoid, repeat, group_norm, batch_norm (#121537 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121537 Approved by: https://github.com/ezyang	2024-03-19 06:15:00 +00:00
Yu, Guangye	0ff1109e26	Refactor gpu trace to be device-agnostic (#121794 ) # Motivation Refactor gpu trace to be device-agnostic. gpu trace is usually used in runtime components, including Device, Stream, Event, Guard, and Allocator. It should be device-agnostic and can be shared among each device backend. # Solution move `_cuda_trace.py` to `_gpu_trace.py`, which makes each device backend owns their callback, respectively. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121794 Approved by: https://github.com/jgong5, https://github.com/albanD, https://github.com/EikanWang, https://github.com/gujinghui	2024-03-19 06:02:28 +00:00
Han, Xu	09ce76809c	Improve compiler detection on MacOS (#121406 ) By relying on `is_apple_clang` helper function rather than on compiler name (as `gcc` is clang on MacOS): ``` % which gcc; gcc -v /usr/bin/gcc Apple clang version 15.0.0 (clang-1500.3.9.4) Target: arm64-apple-darwin23.3.0 Thread model: posix InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin ``` But ``` % /opt/homebrew/bin/gcc-13 -v Using built-in specs. COLLECT_GCC=/opt/homebrew/bin/gcc-13 COLLECT_LTO_WRAPPER=/opt/homebrew/Cellar/gcc/13.2.0/bin/../libexec/gcc/aarch64-apple-darwin23/13/lto-wrapper Target: aarch64-apple-darwin23 Configured with: ../configure --prefix=/opt/homebrew/opt/gcc --libdir=/opt/homebrew/opt/gcc/lib/gcc/current --disable-nls --enable-checking=release --with-gcc-major-version-only --enable-languages=c,c++,objc,obj-c++,fortran --program-suffix=-13 --with-gmp=/opt/homebrew/opt/gmp --with-mpfr=/opt/homebrew/opt/mpfr --with-mpc=/opt/homebrew/opt/libmpc --with-isl=/opt/homebrew/opt/isl --with-zstd=/opt/homebrew/opt/zstd --with-pkgversion='Homebrew GCC 13.2.0' --with-bugurl=https://github.com/Homebrew/homebrew-core/issues --with-system-zlib --build=aarch64-apple-darwin23 --with-sysroot=/Library/Developer/CommandLineTools/SDKs/MacOSX14.sdk Thread model: posix Supported LTO compression algorithms: zlib zstd gcc version 13.2.0 (Homebrew GCC 13.2.0) ``` Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/121406 Approved by: https://github.com/malfet, https://github.com/jansel	2024-03-19 05:32:08 +00:00
FEI	8499767e96	add sdpa choice for DeviceType::PrivateUse1 (#121409 ) Fixes #116854 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121409 Approved by: https://github.com/drisspg	2024-03-19 05:08:46 +00:00
Jason Ansel	5bc7f7f977	[dynamo] Make tx.next_instruction lazy (#122066 ) Improves benchmarks/dynamo/microbenchmarks/dynamo_microbenchmarks.py from 2.5s to 2.4s. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122066 Approved by: https://github.com/oulgen, https://github.com/anijain2305 ghstack dependencies: #122039, #122043, #122055, #122058, #122060, #122063	2024-03-19 04:23:30 +00:00
Jason Ansel	153a01833b	[dynamo] Optimize SourcelessBuilder (#122063 ) Improves `benchmarks/dynamo/microbenchmarks/dynamo_microbenchmarks.py` from 2.7s to 2.5s. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122063 Approved by: https://github.com/anijain2305 ghstack dependencies: #122039, #122043, #122055, #122058, #122060	2024-03-19 04:23:30 +00:00
Jason Ansel	8082adcf65	[dynamo] Only rename a proxy once (#122060 ) Improves `benchmarks/dynamo/microbenchmarks/dynamo_microbenchmarks.py` from 3.9s to 2.7s. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122060 Approved by: https://github.com/oulgen ghstack dependencies: #122039, #122043, #122055, #122058	2024-03-19 04:23:27 +00:00
Jason Ansel	2bec55c5f9	[dynamo] Remove VariableTracker.parents_tracker (#122058 ) This is leftover from mutable variable tracker days and no longer needed. Improves benchmarks/dynamo/microbenchmarks/dynamo_microbenchmarks.py from 4.2s to 3.9s. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122058 Approved by: https://github.com/oulgen, https://github.com/anijain2305 ghstack dependencies: #122039, #122043, #122055	2024-03-19 04:23:24 +00:00
Jason Ansel	3c706bf483	[dynamo] Optimize BuiltinVariable (#122055 ) Improves `benchmarks/dynamo/microbenchmarks/dynamo_microbenchmarks.py` from 5.1s to 4.2s (compared to 2 PRs ago). This works by precomputing (and caching) the parts of `BuiltinVariable.call_function` that don't depend on the values of args/kwargs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122055 Approved by: https://github.com/oulgen, https://github.com/anijain2305 ghstack dependencies: #122039, #122043	2024-03-19 04:23:20 +00:00
Jason Ansel	07caea5c12	[dynamo] Refactor COMPARE_OP and comparison builtins (#122043 ) This removes the duplicate handling of comparison ops between symbolic_convert and bultin and refactors the handling to use the binop infrastructure. This change regresses overheads a bit, but this is fixed in the next PR. New test skips are variants of `type(e) is np.ndarray` previously falling back to eager. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122043 Approved by: https://github.com/anijain2305 ghstack dependencies: #122039	2024-03-19 04:23:17 +00:00
Jason Ansel	769ff86b91	[dynamo] Optimize COMPARE_OP (#122039 ) Improves `benchmarks/dynamo/microbenchmarks/dynamo_microbenchmarks.py` from 5.6 to 5.1s. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122039 Approved by: https://github.com/Skylion007, https://github.com/anijain2305	2024-03-19 04:23:14 +00:00
cyy	e1706bba3b	[Clang-tidy header][27/N] Fix clang-tidy warnings in aten/src/ATen/core/.h (#122023 ) This PR fixes various clang-tidy warnings on aten/src/ATen/core/.h, following #122015 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122023 Approved by: https://github.com/ezyang	2024-03-19 03:26:15 +00:00
Adnan Akhundov	5e26873912	Skip nonzero unbacked SymInt memo in inference mode (#122147 ) Summary: In `torch.inference_mode()`, fake tensors don't have `_version`s. This breaks unbacked SymInt memoization in `torch.nonzero` tracing. Here we disable the latter in inference mode. Test Plan: ``` $ python test/inductor/test_unbacked_symints.py -k test_nonzero_in_inference_mode ... ---------------------------------------------------------------------- Ran 2 tests in 14.060s OK ``` Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/122147 Approved by: https://github.com/ezyang	2024-03-19 03:20:33 +00:00
Animesh Jain	8860c625ea	[dynamo][guards-cpp-refactor] Integrate cpp guard manager with CheckFnManager (#120726 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120726 Approved by: https://github.com/jansel	2024-03-19 03:11:31 +00:00
Animesh Jain	f84d560236	[dynamo] Raise accumulated cache size limit (#122130 ) Fixes #114511 This was raised by IBM folks where the a LLM compile was failing because it had more than 64 layers. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122130 Approved by: https://github.com/Chillee, https://github.com/jansel ghstack dependencies: #121954, #122005	2024-03-19 02:35:48 +00:00
Animesh Jain	7084528eb9	[dynamo][model_output] Do not include none for CustomizedDictVariable (#122005 ) Fixes https://github.com/pytorch/pytorch/issues/120923 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122005 Approved by: https://github.com/weifengpy, https://github.com/jansel ghstack dependencies: #121954	2024-03-19 02:35:48 +00:00
Xu Han	2b06098380	Enable x86 CPU vectorization on windows [submodule sleef] (#118980 ) Enable VEC on Windows OS. 1. Fix some type defination gap between Windows and Linux. 2. Fix some operator not support on Windows, such as [], /. 3. Enable static sleef library build on Windows. 4. Disable unsupported function overloading on MSVC. 5. Upgrade submodule sleef lib, which fixed build issue on Windows. 6. Fixed bazel build issues. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118980 Approved by: https://github.com/jgong5, https://github.com/ezyang, https://github.com/malfet	2024-03-19 02:22:04 +00:00
Sam Larsen	6502c888cf	Enable fx graph cache in torch_test.py when using PYTORCH_TEST_WITH_INDUCTOR=1 (#122010 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/122010 Approved by: https://github.com/eellison	2024-03-19 02:17:10 +00:00
Jason Ansel	18d94d7165	Make FX nodes sortable (#122071 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/122071 Approved by: https://github.com/oulgen	2024-03-19 01:40:56 +00:00
Max Ren	1f4d4d3b78	[fx] preserver partiioner order fix (#122111 ) Summary: Previous implementation seems to introduce a key value of {"node": none}. This causes an error in logging later on because we extract the name from the "node" but it is a string instead of a torch.fx.node This seems to cause tests to pass. Test Plan: CI ExecuTorch CI: buck test mode/dev-nosan //executorch/backends/xnnpack/test:test_xnnpack_models Reviewed By: larryliu0820 Differential Revision: D55026133 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122111 Approved by: https://github.com/mikekgfb	2024-03-19 01:00:44 +00:00
Nikita Shulga	34f36a28df	[MPS] Fwd-fix for clamp regression (#122148 ) Forward fix for regressions introduced by https://github.com/pytorch/pytorch/pull/121381 as we failed to run MPS CI twice on it - Do not call `minimumWithNaNPropagationWithPrimaryTensor` for integral tensors as it will crash with ``` /AppleInternal/Library/BuildRoots/ce725a5f-c761-11ee-a4ec-b6ef2fd8d87b/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShaders/MPSCore/Utility/MPSKernelDAG.mm:805: failed assertion `Error getting visible function: (null) Function isNaN_i16_i8 was not found in the library' ``` - Change the order of max and min call as it's apparently important for consistency, as `min(max(a, b), c)` might not equal to `max(min(a, c), b)` if `c` is not always less or equal than `b` Pull Request resolved: https://github.com/pytorch/pytorch/pull/122148 Approved by: https://github.com/huydhn	2024-03-19 00:52:45 +00:00
Nathan	ae983d2d6e	Fix typo in sparse.rst (#121826 ) Change word "on" to "one" when talking in the third person. Fixes #121770 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121826 Approved by: https://github.com/janeyx99	2024-03-19 00:17:19 +00:00
rzou	e6cf3e90a5	[AOTAutograd / Functionalization] Fix incorrect expand_inverse (#122114 ) This is a rebase of https://github.com/pytorch/pytorch/pull/114538, originally submited by @jon-chuang. Fixes #114302 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122114 Approved by: https://github.com/bdhirsh	2024-03-18 22:52:57 +00:00
eellison	ba69dc6675	[Easy] add option to print compilation time (#121996 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121996 Approved by: https://github.com/davidberard98	2024-03-18 22:42:41 +00:00
Nikita Shulga	2ab8b34433	Error out in case of in-source builds (#122037 ) Such builds could not succeed, as arch-specific ATen dispatch mechanism will create temporary files that will be added to the build system with every rebuild, which will result in build failures Fixes https://github.com/pytorch/pytorch/issues/121507 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122037 Approved by: https://github.com/PaliC, https://github.com/kit1980	2024-03-18 21:48:18 +00:00
Guilherme Leobas	e6a461119a	[functorch] Add batch rule for linalg.lu_unpack (#121811 ) Fixes: https://github.com/pytorch/pytorch/issues/119998 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121811 Approved by: https://github.com/peterbell10, https://github.com/zou3519	2024-03-18 21:24:16 +00:00
andrewor14	773ae817f7	Batch Norm Consolidation (#116092 ) Summary: This commit simplifies the existing decomposition hierarchy of batch norm ops by adding a single, backend agnostic op: `batch_norm_with_update`. The existing hierarchy looks like: ``` aten.batch_norm -> aten._batch_norm_impl_index -> [ aten.native_batch_norm -> aten._native_batch_norm_legit (export only) -> _batch_norm_legit_cpu/cuda (kernels, export only) -> _batch_norm_cpu/cuda (kernels) ] OR [ aten.cudnn_batch_norm ] OR [ aten.miopen_batch_norm ] ``` Aside from complexity, an important problem with the above decomposition hierarchy is cuda numerics in export flows. We observed significantly worse convergence when training a mobilenetv2-like model when using the `_batch_norm_cuda` kernel instead of the `cudnn_batch_norm` kernel. This means users who export their models on CPU first then move the models to cuda later may silently see worse accuracies even when cudnn is installed, because they are using the worse kernel. This issue is summarized in https://github.com/pytorch/pytorch/issues/111384. Instead, the new hierarchy proposed by consolidating existing batch norm ops will look like: ``` aten.batch_norm -> aten.batch_norm_with_update -> [ _batch_norm_cpu (kernel) ] OR [ _batch_norm_cuda (kernel) ] OR [ cudnn_batch_norm (kernel) ] OR [ miopen_batch_norm (kernel) ] ``` The new op `batch_norm_with_update` hides backend implementation details and automatically picks the right kernel based on what is installed. This commit also adds the following variants to this op: ``` batch_norm_with_update_functional batch_norm_with_update.out batch_norm_no_update batch_norm_no_update.out batch_norm_backward ``` Note that this commit only adds this op and its variants, but does not actually change the decomps to produce these ops in the graph. This will be done after the 2 week FC window, and the ops used in the old stack is planned to be removed after the 6 month BC window. Test Plan: `OpInfo` tests for `batch_norm_with_update`. Reviewers: albanD, bdhirsh Subscribers: albanD, bdhirsh, supriyar Tasks: https://github.com/pytorch/pytorch/issues/111384 Differential Revision: [D54805279](https://our.internmc.facebook.com/intern/diff/D54805279) Co-authored-by: Tugsbayasgalan Manlaibaatar <tmanlaibaatar@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/116092 Approved by: https://github.com/bdhirsh, https://github.com/albanD	2024-03-18 21:01:30 +00:00
Sam Larsen	a17cd226d6	[inductor] Enable FX graph caching on another round of inductor tests (#121994 ) Summary: Enabling caching for these tests was blocked by https://github.com/pytorch/pytorch/pull/121686 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121994 Approved by: https://github.com/eellison	2024-03-18 20:55:18 +00:00
Oguz Ulgen	7c5e29ae71	Back out "Support `triton.language.dtype` with `torch.compile` (#121690 )" (#122108 ) Summary: Some hard to deal with package import/export related problems. Lets revert and start with clean slate. Test Plan: CI Differential Revision: D55024877 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122108 Approved by: https://github.com/ezyang	2024-03-18 20:50:28 +00:00
Simon Fan	685ace3834	[compiled autograd] add dynamo segfault test (#122004 ) To catch issues like https://github.com/pytorch/pytorch/issues/121862 in CI. This passes because we reverted the PRs, and https://github.com/pytorch/pytorch/pull/121870 confirms that this test can catch it Pull Request resolved: https://github.com/pytorch/pytorch/pull/122004 Approved by: https://github.com/eellison	2024-03-18 20:07:15 +00:00
Roger Lam	40acc84aaf	Fix torch.clamp in MPS to handle NaN correctly (#121381 ) Fixes #120899 So this is interesting. There are methods that specifically propagate NaN instead of clamping to real numbers. https://developer.apple.com/documentation/metalperformanceshadersgraph/mpsgraph/3857573-maximumwithnanpropagationwithpri Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/121381 Approved by: https://github.com/malfet	2024-03-18 19:38:15 +00:00
Dheeraj Peri	0a1b3be216	chore: add unit test to verify split_by_tags output_type (#121262 ) Add a test case as per https://github.com/pytorch/pytorch/pull/120361#issuecomment-1979163324 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121262 Approved by: https://github.com/atalman	2024-03-18 19:19:26 +00:00
PyTorch MergeBot	676a77177e	Revert "[BE] Migrate pull.yml to use S3 pytorch-ci-artifacts bucket for linux-jammy-py3_8-gcc11 and docs builds/tests (#121908 )" This reverts commit 4cbf963894e78d1cfedffe4f829740dc99163caa. Reverted https://github.com/pytorch/pytorch/pull/121908 on behalf of https://github.com/jeanschmidt due to this is due to OIDC can't work on forked PR due to token write permissions can't be shared ([comment](https://github.com/pytorch/pytorch/pull/121908#issuecomment-2004707582))	2024-03-18 19:03:11 +00:00
James Wu	df1cdaedeb	Log restart reasons and extra compile time in CompilationMetrics (#121827 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121827 Approved by: https://github.com/ezyang, https://github.com/yanboliang	2024-03-18 18:59:25 +00:00
Edward Z. Yang	74c09a757b	Simplify Storage meta conversion with PyObject preservation (#122018 ) Thanks to https://github.com/pytorch/pytorch/pull/109039 we can rely on finalizers on Storage PyObject to handle removal from dict. Irritatingly, we still have to attach finalizer, because we don't have a weak key AND value dict (only one or the other). Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/122018 Approved by: https://github.com/eellison, https://github.com/kurtamohler	2024-03-18 18:55:58 +00:00
Kunal Bhalla	32410f80ec	[Caffe2 CPU tests] Update CMakeLists.txt (#119643 ) I was trying to build PyTorch with USE_GLOG=ON (so we could get better timestamps around the nccl logging) and ran into this error ``` [1/7] Linking CXX executable bin/verify_api_visibility FAILED: bin/verify_api_visibility : && /opt/rh/gcc-toolset-11/root/usr/bin/c++ -D_GLIBCXX_USE_CXX11_ABI=1 -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=range-loop-construct -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wsuggest-override -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow -DHAVE_AVX512_CPU_DEFINITION -DHAVE_AVX2_CPU_DEFINITION -O2 -g -DNDEBUG -rdynamic -Wl,--no-as-needed caffe2/CMakeFiles/verify_api_visibility.dir/__/aten/src/ATen/test/verify_api_visibility.cpp.o -o bin/verify_api_visibility -L/lib/intel64 -L/lib/intel64_win -L/lib/win-x64 -Wl,-rpath,/lib/intel64:/lib/intel64_win:/lib/win-x64:/usr/local/cuda/lib64:/root/conda/lib:/mnt/code/pytorch/build/lib: lib/libgtest_main.a -Wl,--no-as-needed,"/mnt/code/pytorch/build/lib/libtorch.so" -Wl,--as-needed -Wl,--no-as-needed,"/mnt/code/pytorch/build/lib/libtorch_cpu.so" -Wl,--as-needed lib/libprotobuf.a /root/conda/lib/libmkl_intel_lp64.so /root/conda/lib/libmkl_gnu_thread.so /root/conda/lib/libmkl_core.so -fopenmp /usr/lib64/libpthread.so -lm /usr/lib64/libdl.so -Wl,--no-as-needed,"/mnt/code/pytorch/build/lib/libtorch_cuda.so" -Wl,--as-needed lib/libc10_cuda.so lib/libc10.so /root/conda/lib/libglog.so.0.4.0 /root/conda/lib/libgflags.so.2.2.2 -lpthread /usr/local/cuda/lib64/libcudart.so /usr/local/cuda/lib64/libnvToolsExt.so lib/libgtest.a -pthread && /root/conda/bin/cmake -E __run_co_compile --lwyu="ldd;-u;-r" --source=bin/verify_api_visibility && : /opt/rh/gcc-toolset-11/root/usr/bin/ld: /mnt/code/pytorch/build/lib/libtorch.so: undefined reference to symbol '_ZTVN10__cxxabiv117__class_type_infoE@@CXXABI_1.3' /opt/rh/gcc-toolset-11/root/usr/bin/ld: /usr/lib64/libstdc++.so.6: error adding symbols: DSO missing from command line collect2: error: ld returned 1 exit status ``` Adding stdc++ explicitly to the list of libraries to link seems to fix the build, and I was able to get a working build of PyTorch. Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/119643 Approved by: https://github.com/zdevito	2024-03-18 18:35:32 +00:00
Jason Ansel	5d52b163d1	[dynamo] Optimize load/store/const op handling (#122038 ) Improves `benchmarks/dynamo/microbenchmarks/dynamo_microbenchmarks.py` from 6.7s to 5.6. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122038 Approved by: https://github.com/Skylion007 ghstack dependencies: #122032, #122033, #122034, #122035	2024-03-18 18:08:06 +00:00
Jason Ansel	4034873a31	[dynamo] Optimize builtin handling (#122035 ) Improves `benchmarks/dynamo/microbenchmarks/dynamo_microbenchmarks.py` from 7.3s to 6.7s. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122035 Approved by: https://github.com/Skylion007 ghstack dependencies: #122032, #122033, #122034	2024-03-18 18:08:06 +00:00
Jason Ansel	6ca0323615	[dynamo] Optimize VariableTracker.__post_init__ (#122034 ) Improves `benchmarks/dynamo/microbenchmarks/dynamo_microbenchmarks.py` from 8.6s to 7.3s. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122034 Approved by: https://github.com/Skylion007 ghstack dependencies: #122032, #122033	2024-03-18 18:08:06 +00:00
Jason Ansel	115c9c6d6b	Remove __getattribute__ on autograd.Function (#122033 ) Improves `benchmarks/dynamo/microbenchmarks/overheads.py` from 38.7us to 34.3us. See #122029 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122033 Approved by: https://github.com/zou3519, https://github.com/soulitzer ghstack dependencies: #122032	2024-03-18 18:08:06 +00:00
Jason Ansel	5a10b56083	[dynamo] Small microbenchmark changes (#122032 ) Used to generate numbers in #122029 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122032 Approved by: https://github.com/yanboliang	2024-03-18 18:08:06 +00:00
Catherine Lee	1a58e9d357	[TD] LLM indexer to run daily (#121835 ) Run indexer daily Run indexer in docker container Pull Request resolved: https://github.com/pytorch/pytorch/pull/121835 Approved by: https://github.com/osalpekar, https://github.com/malfet	2024-03-18 16:34:01 +00:00
PyTorch MergeBot	ceb1910bad	Revert "[BE] Enables support for pytorch ci build in ARC + introduces _linux-build-rg.yml. (#121930 )" This reverts commit 11b36e163df66196d24fbded4b37ef8f8c032640. Reverted https://github.com/pytorch/pytorch/pull/121930 on behalf of https://github.com/huydhn due to New action is breaking current ci in not rebased PRs ([comment](https://github.com/pytorch/pytorch/pull/121930#issuecomment-2004393980))	2024-03-18 16:33:23 +00:00
Jean Schmidt	11b36e163d	[BE] Enables support for pytorch ci build in ARC + introduces _linux-build-rg.yml. (#121930 ) Introduce changes related to enable ARC to run on build for linux-jammy-py3.8-gcc11 Depends on: * https://github.com/pytorch/pytorch/pull/121908 * https://github.com/pytorch/pytorch/pull/121907 * Force docker to update credentials: https://github.com/pytorch/test-infra/pull/4991 * Add permissions to role to access ECR: `acc0154aa0` * Add permissions to the role to access relevant S3 bucket: `496b0422c3` ## Reasoning for introducing a new `_linux-build-rg.yml` Old style `runs-on` definition accept a string, new style `runs-on` requires a object in the format: ``` --- old ... runs-on: "linux.2xlarge" ... --- new ... runs-on: group: "running-group" ... ``` In other words, to specify a group the format of the yaml needs to be changed. Unfortunately, there is no way to accomplish this change using any trick in the book that I am aware of. This is due to the fact that GH actions yaml are not templatable and support minimal functions / replacements. A few examples that did not work: * [`e234f25` (#119544)](`e234f25ba1 (diff-b317d4da565a9e329ccf67e669c2ff1f4d4bc5fb0ffa4d74132545ad66f84339R76)`) * [`087de4a` (#119544)](`087de4ad8b (diff-b317d4da565a9e329ccf67e669c2ff1f4d4bc5fb0ffa4d74132545ad66f84339R76)`) * [`f03512e` (#119544)](`f03512e344 (diff-b317d4da565a9e329ccf67e669c2ff1f4d4bc5fb0ffa4d74132545ad66f84339R76)`) * [`67581fb` (#119544)](`67581fb737 (diff-b317d4da565a9e329ccf67e669c2ff1f4d4bc5fb0ffa4d74132545ad66f84339R76)`) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121930 Approved by: https://github.com/seemethere	2024-03-18 15:40:43 +00:00
Masaki Kozuki	c4d24b5b7f	special-case cuda array interface of zero size (#121458 ) Fixes #98133 retry of #98134 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121458 Approved by: https://github.com/bdice, https://github.com/ptrblck, https://github.com/mikaylagawarecki	2024-03-18 15:21:38 +00:00
chunyuan	f7908d9fa8	enable reshape+linear+reshape fusion for dynamic shapes (#121116 ) reshape+linear+reshape fusion for dynamic shapes has been disabled in https://github.com/pytorch/pytorch/pull/107123. Re-enable it by comparing the symbolic values in case of dynamic shapes. This will improve the performance for dynamic shape cases. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121116 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-03-18 14:46:27 +00:00
chunyuan	f2f8eeea94	Inductor: fix Conv output stride for dynamic shapes (#121400 ) Fixes https://github.com/pytorch/pytorch/issues/120873. Fixes the output stride of Conv in the case of dynamic shapes. The previous logic in inductor assumed that the output stride of Conv is always channels last while it is actually contiguous if `dynamic_shapes and is_contiguous_storage_and_layout(x)`. ### Static shape In static shape cases, since weight is prepacked (`weight_t.is_mkldnn()` will be `true`), we'll always force output to be channels last in the Conv kernel, thus it's fine to have the assumption in Inductor that the output stride of Conv is always channels last. `96ed37ac13/aten/src/ATen/native/mkldnn/Conv.cpp (L357-L358)` ### Dynamic shape In dynamic shape cases, we won't do weight prepack for Conv, in this case, the Conv kernel decides the output layout based on the input and weight layout. `96ed37ac13/torch/_inductor/fx_passes/mkldnn_fusion.py (L1024-L1025)` For input with `channels = 1`, like tensor of size `(s0, 1, 28, 28)` and stride `(784, 784, 28, 1)`, in Inductor, with `req_stride_order` in channels last order, the `require_stride_order` on `x` of such size and stride won't change the stride of the tensor since stride for dimensions of size 1 is ignored `96ed37ac13/torch/_inductor/ir.py (L5451)` While in Conv kernel, such tensor is consider it as contiguous tensor instead of channels last tensor thus the output of the Conv kernel will be in contiguous format. `96ed37ac13/aten/src/ATen/native/ConvUtils.h (L396-L404)` To align the behavior of the Conv kernel, we set the output_stride in such case to be contiguous instead of channels last. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121400 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-03-18 10:56:58 +00:00
Yang Chen	206da97b8b	[aot_inductor][easy] enable test_triton_kernel_multi_output_arg (#122052 ) looks like we already support aoti_torch_cuda_sort in C shim. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122052 Approved by: https://github.com/oulgen	2024-03-18 09:14:35 +00:00
Oguz Ulgen	65ccac6f17	Fix triton import time cycles (#122059 ) Summary: `has_triton` causes some import time cycles. Lets use `has_triton_package` which is enough. Test Plan: ``` buck2 test 'fbcode//mode/opt' fbcode//fblearner/flow/projects/model_processing/pytorch_model_export_utils/logical_transformations/tests:filter_inference_feature_metadata_test -- --exact 'fblearner/flow/projects/model_processing/pytorch_model_export_utils/logical_transformations/tests:filter_inference_feature_metadata_test - test_collect_features_from_graph_module_nodes (fblearner.flow.projects.model_processing.pytorch_model_export_utils.logical_transformations.tests.filter_inference_feature_metadata_test.FilterInferenceFromFeatureMetadataTest)' ``` now passes Differential Revision: D55001430 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122059 Approved by: https://github.com/aakhundov	2024-03-18 05:50:32 +00:00
PyTorch UpdateBot	bc9d054260	[executorch hash update] update the pinned executorch hash (#122061 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned executorch hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122061 Approved by: https://github.com/pytorchbot	2024-03-18 05:02:27 +00:00
PyTorch UpdateBot	7380585d97	[vision hash update] update the pinned vision hash (#122062 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned vision hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122062 Approved by: https://github.com/pytorchbot	2024-03-18 03:41:50 +00:00
Oguz Ulgen	e39aedfcc5	Fix fx graph triton import bug (#122041 ) Summary: Unless we register triton to be a special import, FX graph import mechanism imports it as `from fx-generated._0 import triton as triton` which is obviously broken. Test Plan: I could not figure out how to write a test for this but ``` buck2 run 'fbcode//mode/dev-nosan' fbcode//tgif/lib/tests/gpu_tests:lowering_pass_test -- -r test_default_ait_lowering_multi_hardwares ``` now passes Differential Revision: D54990782 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122041 Approved by: https://github.com/aakhundov	2024-03-17 22:48:51 +00:00
zhangsanfeng2022	5030913d6a	[test] Delete variables that have been declared but not referenced di… (#121964 ) Delete variables that have been declared but not referenced in aten/src/ATen/test/cuda_distributions_test.cu Pull Request resolved: https://github.com/pytorch/pytorch/pull/121964 Approved by: https://github.com/janeyx99	2024-03-17 09:45:05 +00:00
cyy	d9460758df	[Clang-tidy header][26/N] Fix clang-tidy warnings in aten/src/ATen/core/.h (#122015 ) This PR fixes various clang-tidy warnings on aten/src/ATen/core/.h Pull Request resolved: https://github.com/pytorch/pytorch/pull/122015 Approved by: https://github.com/ezyang	2024-03-17 07:56:45 +00:00
Animesh Jain	c568b84794	[dynamo][guards] Move backend match to eval_frame (#121954 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121954 Approved by: https://github.com/jansel	2024-03-17 06:52:10 +00:00
PyTorch UpdateBot	fc504d719f	[executorch hash update] update the pinned executorch hash (#122036 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned executorch hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122036 Approved by: https://github.com/pytorchbot	2024-03-17 04:56:37 +00:00
Edward Z. Yang	6f74b76072	Move get_unwrapped outside of disable_functorch (#121849 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/121849 Approved by: https://github.com/albanD	2024-03-16 22:25:07 +00:00
Pian Pawakapan	3bd38928ba	[export] Improve consistency for nn_module_stack metadata, add checks to _trace.py (#120661 ) We would like to improve consistency for nn_module_stack metadata in torch.export. This PR ensures that all tests in test/export/test_export.py has the following constraints: - Remove nn_module_stack for all placeholder & output nodes, for all modules and submodules - Ensure nn_module_stack is present for all other node types for the top-level module (there is still an issue with torch.cond submodules having empty fields) - Add these checks to _export() in _trace.py (we would add this in the Verifier, but downstream apps construct ExportedPrograms separate from _export(), and metadata may not be maintained there) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120661 Approved by: https://github.com/avikchaudhuri	2024-03-16 21:44:52 +00:00
blzheng	6d9588a12b	[inductor] disable linear weight prepacking pass on double (#121478 ) Fix #121175 Co-authored-by: Jiong Gong <jiong.gong@intel.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/121478 Approved by: https://github.com/jgong5, https://github.com/desertfire	2024-03-16 13:24:21 +00:00
dujinhang	9990d1bc22	Add 'profiler/python' to the package.' (#121892 ) Fixes #ISSUE_NUMBER expose the `py_symbolize` interface for use. thank you Pull Request resolved: https://github.com/pytorch/pytorch/pull/121892 Approved by: https://github.com/zdevito	2024-03-16 11:11:26 +00:00
Huy Do	5f601a41e0	Pin protobuf to 3.20.2 on macOS (#121918 ) The newer protobuf 5.26.0 releasing on March 13rd is causing failures with `test_hparams_*` from `test_tensorboard` in which the stringify metadata is wrong when escaping double quote. For example, `3bc2bb6781`. This looks like an upstream issue from Tensorboard where it doesn't work with this brand new protobuf version https://github.com/tensorflow/tensorboard/blob/master/tensorboard/pip_package/requirements.txt#L29 The package has been pinned on Docker https://github.com/pytorch/pytorch/blob/main/.ci/docker/requirements-ci.txt#L155, so it should be pinned on macOS too. We want to eventually just have one requirements.txt file. Fixes https://github.com/pytorch/pytorch/issues/122008 Fixes https://github.com/pytorch/pytorch/issues/121927 Fixes https://github.com/pytorch/pytorch/issues/121946 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121918 Approved by: https://github.com/kit1980	2024-03-16 09:48:05 +00:00
PyTorch UpdateBot	4d9d5fe540	[executorch hash update] update the pinned executorch hash (#122009 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned executorch hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122009 Approved by: https://github.com/pytorchbot	2024-03-16 04:46:45 +00:00
Jason Ansel	4d92928fe2	[dynamo] Add tests for fake FSDP (#121610 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121610 Approved by: https://github.com/yanboliang ghstack dependencies: #121735, #120965	2024-03-16 04:29:59 +00:00
Jason Ansel	0b7d9711d4	[dynamo] Add support for nn.Parameter constructor (part 2) (#120965 ) This handles the case where the tensor isn't an input. The changes to dynamo tests are cases where we would previously fall back to eager. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120965 Approved by: https://github.com/yanboliang ghstack dependencies: #121735	2024-03-16 04:29:58 +00:00
Jason Ansel	040b925753	[Compiled Autograd] Reorder accumulate grad nodes (#121735 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121735 Approved by: https://github.com/xmfan	2024-03-16 04:29:56 +00:00
PyTorch UpdateBot	f0b9a8344a	[vision hash update] update the pinned vision hash (#121177 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned vision hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121177 Approved by: https://github.com/pytorchbot	2024-03-16 03:25:08 +00:00
Andrew Gu	b94691700e	[FSDP] Avoided CPU sync in `clip_grad_norm_` (#122001 ) Copying a scalar 0 tensor on CPU to GPU or constructing a scalar 0 tensor on GPU requires a CPU sync with the GPU. Let us avoid doing ops that involve it. `FSDP.clip_grad_norm_` already first checks if all parameters are not sharded and calls into `nn.utils.clip_grad_norm_`, so at the point of the code changes, there is guaranteed to be some sharded parameters. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122001 Approved by: https://github.com/wanchaol	2024-03-16 03:01:49 +00:00
PaliC	7bc91d5dc2	[mergebot][BE] If we don't have any required checks, don't run required checks (#121921 ) This PR addresses the issue identified in #121920. The existing problem is that all tests are deemed mandatory if none are selected as required. This behavior is particularly noticeable during a force merge operation. In the context of a force merge, it may not be necessary to execute any tests which are not required (imo). However, this proposed change could be seen as controversial, hence it has been separated from the main update for further discussion and review. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121921 Approved by: https://github.com/huydhn ghstack dependencies: #121920	2024-03-16 01:35:21 +00:00
soulitzer	2b71b21a3f	Don't use Proxy torch function in the sym size calls (#121981 ) Fixes #ISSUE_NUMBER Changes from https://github.com/pytorch/pytorch/pull/121938 + adds test @bypass-github-pytorch-ci-checks Pull Request resolved: https://github.com/pytorch/pytorch/pull/121981 Approved by: https://github.com/davidberard98	2024-03-16 01:20:26 +00:00
Jane Xu	37e563276b	Document complex optimizer semantic behavior (#121667 ) <img width="817" alt="image" src="https://github.com/pytorch/pytorch/assets/31798555/565b389d-3e86-4767-9fcb-fe075b50aefe"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/121667 Approved by: https://github.com/albanD	2024-03-16 00:43:47 +00:00
Sam Larsen	12662900f9	[inductor] FX graph cache: Fix bug handling constants (#121925 ) Summary: During key calculation for FX graph caching: Rather than specialize on "small" vs. "large" tensor constants (i.e., inlined vs. not inlined), always hash on the tensor value. Doing so avoids the complication of trying to later attach the constant values as attributes to an already-compiled module. Instead, different constants will cause an FX graph cache miss and we'll just compile. Test Plan: New unit test Pull Request resolved: https://github.com/pytorch/pytorch/pull/121925 Approved by: https://github.com/eellison	2024-03-16 00:11:51 +00:00
cyy	6b0f61891f	[Clang-tidy header][25/N] Fix clang-tidy warnings and enable clang-tidy on c10/cuda/*.{cpp,h} (#121952 ) This PR enables clang-tidy to code in c10/cuda. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121952 Approved by: https://github.com/Skylion007	2024-03-16 00:09:54 +00:00
PyTorch MergeBot	0cc60a05da	Revert "Fix torch.clamp in MPS to handle NaN correctly (#121381 )" This reverts commit ca80d07ac71c1bfc9b13c3281a713fed89f15e0f. Reverted https://github.com/pytorch/pytorch/pull/121381 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I think its test is failing in trunk https://github.com/pytorch/pytorch/actions/runs/8302739752/job/22725865151#step:7:644, we should have ciflow/mps to run the test on PR. Please take a look a reland the change ([comment](https://github.com/pytorch/pytorch/pull/121381#issuecomment-2000685856))	2024-03-15 23:53:05 +00:00
PyTorch MergeBot	07ec3356b9	Revert "Force upsample to be float32 (#121324 )" This reverts commit 2770e3addd9f05101705f0fef85a163e0034b8a5. Reverted https://github.com/pytorch/pytorch/pull/121324 on behalf of https://github.com/huydhn due to I think it is better to revert and reland this next week `2770e3addd` ([comment](https://github.com/pytorch/pytorch/pull/121324#issuecomment-2000617536))	2024-03-15 23:20:01 +00:00
Andrew Gu	256c0ec1e5	[docs] Added comment on replicate -> partial for `_NormPartial` (#121976 ) Add a version of https://github.com/pytorch/pytorch/pull/121945#discussion_r1525697167 as a comment in the code Pull Request resolved: https://github.com/pytorch/pytorch/pull/121976 Approved by: https://github.com/wanchaol ghstack dependencies: #121747, #121869, #121945	2024-03-15 23:04:06 +00:00
PyTorch MergeBot	b717aa6f36	Revert "[BE] Enables support for pytorch ci build in ARC + introduces _linux-build-rg.yml. (#121930 )" This reverts commit 2c33e3a372c077badc561b4aad4997e52c03610a. Reverted https://github.com/pytorch/pytorch/pull/121930 on behalf of https://github.com/huydhn due to I am seeing lots of inductor jobs failing after this change `2c33e3a372`. They looks unrelated though but this change updates Docker image so may be something sneaks in. I will try to revert this to see if it helps and will reland the change after ([comment](https://github.com/pytorch/pytorch/pull/121930#issuecomment-2000547641))	2024-03-15 22:05:21 +00:00
Roger Lam	ca80d07ac7	Fix torch.clamp in MPS to handle NaN correctly (#121381 ) Fixes #120899 So this is interesting. There are methods that specifically propagate NaN instead of clamping to real numbers. https://developer.apple.com/documentation/metalperformanceshadersgraph/mpsgraph/3857573-maximumwithnanpropagationwithpri Pull Request resolved: https://github.com/pytorch/pytorch/pull/121381 Approved by: https://github.com/malfet	2024-03-15 21:54:50 +00:00
Shuqiang Zhang	26aaabb979	[c10d] initialize lastEnqueuedSeq_ and lastCompletedSeq_ (#121980 ) Summary: It is found that this 2 unitilized number was logged with some super large or negative numbers, which is confusing. So we need to initialize them. Now -1 indicate the number if invalid, or no work is completed or enqueued yet. 0 could be a legit seq id. Test Plan: Build Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/121980 Approved by: https://github.com/xw285cornell, https://github.com/wconstab, https://github.com/kwen2501, https://github.com/XilunWu	2024-03-15 21:45:15 +00:00
Wenting Wang	dfc5e9325d	format caffe2/torch/_export/serde/serialize.py (#121670 ) Summary: black caffe2/torch/_export/serde/serialize.py Test Plan: tests Differential Revision: D54654847 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121670 Approved by: https://github.com/angelayi	2024-03-15 21:30:16 +00:00
Tugsbayasgalan Manlaibaatar	53d2188df9	Update get_aten_graph_module (#121937 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121937 Approved by: https://github.com/andrewor14	2024-03-15 20:35:55 +00:00
Aidyn-A	af86d67d61	[Doc][NVTX] Add documentation for nvtx.range (#121699 ) The context manager `torch.cuda.nvtx.range` has been around for about 4 years (see #42925). Unfortunately, it was never documented and as a consequence users are just unaware of it (see #121663). Pull Request resolved: https://github.com/pytorch/pytorch/pull/121699 Approved by: https://github.com/janeyx99	2024-03-15 20:26:44 +00:00
wz337	b92daff6e9	[DTensor] Enable ASGD foreach optimizer and add the associated unit test (#121942 ) Enable ASGD foreach optimizer and add DTensor optimizer unit test for ASGD. Note that we need to investigate why when using ASGD we need higher atol and rtol when comparing model parameters. Listing it as a TODO now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121942 Approved by: https://github.com/wanchaol	2024-03-15 20:21:27 +00:00
Andrew Gu	f4dd2fda51	[DTensor] Supported 2D `clip_grad_norm_` (#121945 ) This PR adds support for 2D `clip_grad_norm_` (`foreach=True`). - This PR changes `OpSchema.args_spec` to use pytree if the runtime schema info specifies it. - This PR includes a unit test for 2D FSDP2 + SP with `clip_grad_norm_` enabled, which serves as a complete numerics test for 2D. Note: With this PR patched, 2-way SP + 4-way FSDP matches 8-way FSDP numerics on Llama-7B (doubling local batch size for the 2-way SP run). Pull Request resolved: https://github.com/pytorch/pytorch/pull/121945 Approved by: https://github.com/wanchaol ghstack dependencies: #121747, #121869	2024-03-15 20:11:24 +00:00
Jean Schmidt	2c33e3a372	[BE] Enables support for pytorch ci build in ARC + introduces _linux-build-rg.yml. (#121930 ) Introduce changes related to enable ARC to run on build for linux-jammy-py3.8-gcc11 Depends on: * https://github.com/pytorch/pytorch/pull/121908 * https://github.com/pytorch/pytorch/pull/121907 * Force docker to update credentials: https://github.com/pytorch/test-infra/pull/4991 * Add permissions to role to access ECR: `acc0154aa0` * Add permissions to the role to access relevant S3 bucket: `496b0422c3` ## Reasoning for introducing a new `_linux-build-rg.yml` Old style `runs-on` definition accept a string, new style `runs-on` requires a object in the format: ``` --- old ... runs-on: "linux.2xlarge" ... --- new ... runs-on: group: "running-group" ... ``` In other words, to specify a group the format of the yaml needs to be changed. Unfortunately, there is no way to accomplish this change using any trick in the book that I am aware of. This is due to the fact that GH actions yaml are not templatable and support minimal functions / replacements. A few examples that did not work: * [`e234f25` (#119544)](`e234f25ba1 (diff-b317d4da565a9e329ccf67e669c2ff1f4d4bc5fb0ffa4d74132545ad66f84339R76)`) * [`087de4a` (#119544)](`087de4ad8b (diff-b317d4da565a9e329ccf67e669c2ff1f4d4bc5fb0ffa4d74132545ad66f84339R76)`) * [`f03512e` (#119544)](`f03512e344 (diff-b317d4da565a9e329ccf67e669c2ff1f4d4bc5fb0ffa4d74132545ad66f84339R76)`) * [`67581fb` (#119544)](`67581fb737 (diff-b317d4da565a9e329ccf67e669c2ff1f4d4bc5fb0ffa4d74132545ad66f84339R76)`) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121930 Approved by: https://github.com/seemethere	2024-03-15 20:09:50 +00:00
Sam Larsen	6f4fa8e9a1	[inductor] FX graph cache: simplify "current callable" logic (#121903 ) Summary: The handling of the current_callable and compiled_artifact fields in the CompiledFxGraph object is unnecessarily complicated and confusing. We can simplify by storing only the callable. That field is not serializable, so the caching approach is to store a path to the generated artifact and reload from disk on a cache hit. We can just reload inline in the FX cache hit path. This change has the added benefit that it makes it easier to fallback to a "cache miss" if the path somehow doesn't exist. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121903 Approved by: https://github.com/eellison	2024-03-15 20:00:08 +00:00
lezcano	d0d09f5977	Fix torch.compile links (#121824 ) Fixes https://github.com/pytorch/pytorch.github.io/issues/1567 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121824 Approved by: https://github.com/svekars, https://github.com/peterbell10, https://github.com/malfet ghstack dependencies: #121823	2024-03-15 19:49:37 +00:00
lezcano	8a5a377190	Move doc links to point to main (#121823 ) The previous links were pointing to an outdated branch Command: `find . -type f -exec sed -i "s:docs/main:docs/master:g" {} + ` Pull Request resolved: https://github.com/pytorch/pytorch/pull/121823 Approved by: https://github.com/albanD, https://github.com/malfet	2024-03-15 19:49:37 +00:00
Sam Larsen	535bc71d03	Enable FX graph caching in another batch of inductor tests (#121697 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121697 Approved by: https://github.com/eellison	2024-03-15 19:38:51 +00:00
Todd Fiala	3ee319c49c	Fall back to eager mode when viewing with differing bitwidths (#120998 ) (#121786 ) The inductor lowering code for viewing a tensor as a type with a different bitwidth currently doesn't generate valid triton code. This change looks for a source and destination dtype and, if different sizes, falls back to the eager mode aten implementation. Prior to this change, this condition would throw an exception. Fixes #120998. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121786 Approved by: https://github.com/peterbell10, https://github.com/bertmaher	2024-03-15 19:33:30 +00:00
Isuru Fernando	409b1a6081	Add lowering for cummax, cummin (#120429 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120429 Approved by: https://github.com/peterbell10	2024-03-15 19:04:38 +00:00
Animesh Jain	d04faf4531	[dynamo][compile-time] Remove preserve rng state per op (#121923 ) We already have one globally - `02bb2180f4/torch/_dynamo/convert_frame.py (L477)` I don't think we need per op. Saves ~2 seconds on this benchmark ~~~ def fn(x): for _ in range(10000): x = torch.ops.aten.sin(x) return x ~~~ Pull Request resolved: https://github.com/pytorch/pytorch/pull/121923 Approved by: https://github.com/jansel	2024-03-15 18:24:46 +00:00
Isuru Fernando	67ec870234	Fix FakeTensorUpdater logic for updating fake tensors (#116168 ) Fixes https://github.com/pytorch/pytorch/issues/114464 Pull Request resolved: https://github.com/pytorch/pytorch/pull/116168 Approved by: https://github.com/peterbell10	2024-03-15 18:22:24 +00:00
drewd789	239d87af5e	combine loops so fn_name correct in error message (#121601 ) The error message shown when input aliasing is detected in `while_loop_func` may not have the correct `fn_name` as it set only in the previous for loop. This change merges the two loops so that `fn_name` has the correct value. No Issue Number for this minor change. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121601 Approved by: https://github.com/albanD	2024-03-15 17:14:56 +00:00
atalman	39fdde7f84	[release] Increase version 2.3.0->2.4.0 (#121974 ) Branch cut for 2.3.0 completed hence advance main version to 2.4.0 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121974 Approved by: https://github.com/jeanschmidt	2024-03-15 17:09:33 +00:00
Shengbao Zheng	565d1e28ab	update kineto submodule commit id (#121843 ) Summary: Update kineto submodule commit id so that pytorch profiler can pick up kineto changes from https://github.com/pytorch/kineto/pull/880 Test Plan: CI Differential Revision: D54828357 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121843 Approved by: https://github.com/aaronenyeshi	2024-03-15 16:55:25 +00:00
chilli	3c3d7455a3	Disable inductor (default) and inductor (dynamic) by default in the perf run launcher (#121914 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121914 Approved by: https://github.com/desertfire	2024-03-15 16:46:24 +00:00
angelayi	ef25d83a62	[export] Add serialization support for tokens (#121552 ) Differential Revision: [D54906766](https://our.internmc.facebook.com/intern/diff/D54906766) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121552 Approved by: https://github.com/zhxchen17	2024-03-15 16:15:11 +00:00
Wei (Will) Feng	014f91a9d9	[FSDP2] implement HSDP (#121569 ) support HSDP in per-parameter sharding FSDP: https://github.com/pytorch/pytorch/issues/121023 HSDP is a hybrid of FSDP and DDP: reduce-scatter grads intra-node (FSDP), and all-reduce grads inter-node (DDP) for unit test, we are testing 2 + 2 GPUs in single node: ``pytest test/distributed/_composable/fsdp/test_fully_shard_training.py -k test_train_parity_hsdp`` allreduce overlaps with next reduce-scatter in profiler traces <img width="886" alt="Screenshot 2024-03-14 at 3 02 52 PM" src="https://github.com/pytorch/pytorch/assets/134637289/98f1f2b5-c99d-4744-9938-10d0431487e5"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/121569 Approved by: https://github.com/awgu	2024-03-15 10:00:18 +00:00
Jean Schmidt	4cbf963894	[BE] Migrate pull.yml to use S3 pytorch-ci-artifacts bucket for linux-jammy-py3_8-gcc11 and docs builds/tests (#121908 ) Switch to use LF S3 bucket for pull on linux-jammy-py3_9-gcc and docs jobs. This is required to migrate to ARC and move to use LF resources. Depends on https://github.com/pytorch/pytorch/pull/121907 Follow up issue https://github.com/pytorch/pytorch/issues/121919 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121908 Approved by: https://github.com/malfet	2024-03-15 09:09:53 +00:00
bhack	2770e3addd	Force upsample to be float32 (#121324 ) Fixes #121072 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121324 Approved by: https://github.com/ezyang	2024-03-15 07:50:45 +00:00
Simon Fan	e25054b248	[compiled autograd] free stack objects before calling compiled graph (#121707 ) Moved compilation code into _compiled_autograd_impl, frees stack allocated objects e.g. AutogradCompilerCall Pull Request resolved: https://github.com/pytorch/pytorch/pull/121707 Approved by: https://github.com/jansel	2024-03-15 07:12:38 +00:00
Jason Ansel	5a2b4fc8f0	[dynamo] Convert invalid args into graph breaks (#121784 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121784 Approved by: https://github.com/yanboliang	2024-03-15 06:51:27 +00:00
Gao Tianlin	fc33bbf827	better support set_default_dtype(torch.float16), update doc (#121730 ) 1. Fixes #121300 2. Previously, calling `torch.tensor([2j])` after `torch.set_default_dtype(torch.float16)` will cause a runtime error. This PR also fixes it and enables test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121730 Approved by: https://github.com/peterbell10	2024-03-15 06:48:42 +00:00
PyTorch UpdateBot	8fdd8125b6	[executorch hash update] update the pinned executorch hash (#121871 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned executorch hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121871 Approved by: https://github.com/pytorchbot	2024-03-15 05:25:36 +00:00
cyy	fb10e13000	[Clang-tidy header][24/N] Fix clang-tidy warnings on c10/cuda/*.{cpp,h} (#120781 ) This PR begins to clean clang-tidy warnings of code in c10/cuda. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120781 Approved by: https://github.com/ezyang	2024-03-15 05:03:22 +00:00
Tristan Rice	e4fda049c2	DTensor: add comm tests to test_tp_examples (#121669 ) This adds some basic comm tests to test_tp_examples. This validates that the expected distributed calls are being made for `test_transformer_training`. Fixes #121649 Test plan: ``` pytest test/distributed/tensor/parallel/test_tp_examples.py -k test_transformer_training ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/121669 Approved by: https://github.com/wanchaol	2024-03-15 03:37:48 +00:00
wz337	02083f5452	[DCP][DSD] Add AdamW to distributed state dict unit tests (#121774 ) Thanks @fegin for removing the fsdp root module check in DCP to unblock test updates. https://github.com/pytorch/pytorch/pull/121544 This PR adds "optimzer_class" as a kwarg for the subtests of the following tests to add AdamW as an option. - test_fsdp - test_compiled_fsdp - test_fsdp2 - test_ddp - test_fsdp_ddp - test_cpu_offload_full_state_dict In addition, we temporarily remove the two _verify_osd_by_load in _test_save_load, as state dict loading seems affect parameters. Creating an issue https://github.com/pytorch/pytorch/issues/121186 to keep track. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121774 Approved by: https://github.com/Skylion007 ghstack dependencies: #121773	2024-03-15 03:33:33 +00:00
PaliC	efbeefbb84	[executorch] Make trymerge force merges actually work with executorch (#121920 ) This PR addresses an issue with the trymerge function for executorch, which currently uses Facebook CLA instead of Easy CLA. This bug has been patched in #121921. However, the patch is potentially controversial, and we still want to verify Facebook CLA if it exists. Therefore, this PR includes Facebook CLA in our set of mandatory checks. Additionally, this PR removes Facebook CLA from one of the mocks. This change is necessary because the specific PR used for testing fails due to the presence of Facebook CLA in the mock. ## Testing: We run `find_matching_merge_rule(pr = GitHubPR("pytorch", "executorch", 2326), skip_mandatory_checks=True, skip_internal_checks=True)` to check if things work https://pastebin.com/HHSFp2Gw Pull Request resolved: https://github.com/pytorch/pytorch/pull/121920 Approved by: https://github.com/huydhn	2024-03-15 03:21:44 +00:00
Animesh Jain	a623666066	[dynamo][compile-time] Make output_graph new_var linear (#121858 ) Fixes https://github.com/pytorch/pytorch/issues/121679 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121858 Approved by: https://github.com/jansel	2024-03-15 03:20:04 +00:00
haozhe.zhu	3bc2bb6781	use two pass reduction for deterministic reduction order (#115620 ) ## Motivation Address the [non-deterministic reduction order](https://github.com/pytorch/pytorch/issues/93542#issuecomment-1411294181) issue for `omp parallel reduction`. ## Latest update on 1.15: `55d81901bc`. Do not reduce to arr in loops. Instead, reduce to a local scaler and write it to arr after local reduction is done. This will allow the compiler to optimize the reduction variable in register instead read/write from memory. If the `working set` of `loop body` is quite large, `read/write from register/memory` will have a large gap. ``` vaddss (%xmm0, %xmm11, %xmm11) -> accumulate in register %xmm0 vaddssl ((%rdx, %rdi, 4), %xmm0, %xmm0) -> accumulate in memory address (%rdx, %rdi, 4) ``` Examples code: ``` tmp0_acc_arr[64]; #pragma omp parallel num_threads(64) { auto tid = omp_get_thread_num(); #pragma omp for for(...){ .... tmp0_acc_arr[tid] = tmp0_acc_arr[tid] + tmp_x; // access array will always from memory } } ``` will be changed to ``` tmp0_acc_arr[64]; #pragma omp parallel num_threads(64) { auto tid = omp_get_thread_num(); auto tmp0_acc_local = 0; #pragma omp for for(...){ .... tmp0_acc_local = tmp0_acc_local + tmp_x; } tmp0_acc_arr[tid] = tmp0_acc_local; } ``` ## Descriptions Following aten to use `two pass reduction` with `omp parallel` for deterministic reduction order. `9c3ae37fc4/aten/src/ATen/Parallel-inl.h (L39)` `9c3ae37fc4/aten/src/ATen/native/TensorIteratorReduce.cpp (L24)` ``` float tmp_acc0 = 0; at::vec::Vectorized<float> tmp_acc0_vec = at::vec::Vectorized<float>(0); // init reduction buffer per thread float tmp_acc0_arr[64]; at::vec::Vectorized<float> tmp_acc0_vec_arr[64]; for (int tid = 0; tid < 64; tid++) { tmp_acc0_arr[tid] = 0; tmp_acc0_vec_arr[tid] = at::vec::Vectorized<float>(0); } #pragma omp parallel num_threads(64) { int tid = omp_get_thread_num(); #pragma omp for for(long x0=static_cast<long>(0L); x0<static_cast<long>(3964928L); x0+=static_cast<long>(16L)) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(x0)); auto tmp1 = at::vec::Vectorized<float>::loadu(in_ptr1 + static_cast<long>(x0)); auto tmp2 = tmp0 - tmp1; auto tmp3 = tmp2 * tmp2; // reduce to per thread buffers tmp_acc0_vec_arr[tid] = tmp_acc0_vec_arr[tid] + tmp3; } } // second pass reduce for (int tid = 0; tid < 64; tid++) { tmp_acc0 = tmp_acc0 + tmp_acc0_arr[tid]; tmp_acc0_vec = tmp_acc0_vec + tmp_acc0_vec_arr[tid]; } tmp_acc0 = tmp_acc0 + at::vec::vec_reduce_all<float>([](at::vec::Vectorized<float>& x, at::vec::Vectorized<float>& y) { return x + y; }, tmp_acc0_vec); out_ptr0[static_cast<long>(0L)] = static_cast<float>(tmp_acc0); ``` ## Test results I test this PR with dynamo benchmark on 32-core ICX system, Result (avg speed up): \| \| before this PR \| after this PR \| \| ---- \| ---- \| ---- \| \| torchbench \| 1.303 \| 1.301 \| \| hugginface \| 1.346 \| 1.343 \| \| timms \| 1.971 \| 1.970 \| ``` export LD_PRELOAD=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}/lib/libiomp5.so:${CONDA_PREFIX:-"$(dirname $(which conda))/../"}/lib/libjemalloc.so export MALLOC_CONF="oversize_threshold:1,background_thread:true,metadata_thp:auto,dirty_decay_ms:-1,muzzy_decay_ms:-1" export KMP_AFFINITY=granularity=fine,compact,1,0 export KMP_BLOCKTIME=1 multi_threads_test() { CORES=$(lscpu \| grep Core \| awk '{print $4}') export OMP_NUM_THREADS=$CORES end_core=$(expr $CORES - 1) numactl -C 0-${end_core} --membind=0 python benchmarks/dynamo/${SUITE}.py --${SCENARIO} --${DT} -dcpu -n50 --no-skip --dashboard --only "${MODEL}" ${Channels_extra} ${BS_extra} ${Shape_extra} ${Mode_extra} ${Wrapper_extra} ${Flag_extra} --timeout 9000 --backend=inductor --output=${LOG_BASE}/${SUITE}.csv } SCENARIO=performance DT=float32 export TORCHINDUCTOR_FREEZING=1 Flag_extra="--freezing" Mode_extra="--inference" for suite in timm_models huggingface torchbench do export SUITE=$suite echo $SUITE export LOG_BASE=`date +%m%d%H%M%S` mkdir $LOG_BASE multi_threads_test done ``` System info ``` ubuntu@ip-172-31-18-205:~/hz/pytorch$ lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 46 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 64 On-line CPU(s) list: 0-63 Vendor ID: GenuineIntel Model name: Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz CPU family: 6 Model: 106 Thread(s) per core: 2 Core(s) per socket: 32 Socket(s): 1 Stepping: 6 BogoMIPS: 5800.00 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic mo vbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xs aveopt xsavec xgetbv1 xsaves wbnoinvd ida arat avx512vbmi pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg tme avx512_vpopcntdq rdpid md_clear flush_l1d arch_capabilities Virtualization features: Hypervisor vendor: KVM Virtualization type: full Caches (sum of all): L1d: 1.5 MiB (32 instances) L1i: 1 MiB (32 instances) L2: 40 MiB (32 instances) L3: 54 MiB (1 instance) NUMA: NUMA node(s): 1 NUMA node0 CPU(s): 0-63 Vulnerabilities: Gather data sampling: Unknown: Dependent on hypervisor status Itlb multihit: Not affected L1tf: Not affected Mds: Not affected Meltdown: Not affected Mmio stale data: Mitigation; Clear CPU buffers; SMT Host state unknown Retbleed: Not affected Spec rstack overflow: Not affected Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Spectre v2: Mitigation; Enhanced IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence Srbds: Not affected Tsx async abort: Not affected ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/115620 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-03-15 02:03:10 +00:00
PyTorch MergeBot	0cd094a4fd	Revert "[aoti] Fix compilation bug for buffer mutations (#121688 )" This reverts commit 9f314d4aa82169ee552ae2a8ad701bd0441a12b7. Reverted https://github.com/pytorch/pytorch/pull/121688 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/121688#issuecomment-1998740094))	2024-03-15 01:34:04 +00:00
Yifu Wang	01d7c948e2	Make torch/_inductor/comms.py recognize native funcol IRs as collective IRs (#118498 ) ### Summary As title. After this PR, Inductor should recognize native funcol IRs as collectives wherever the existing funcol IRs are recognized as collectives. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118498 Approved by: https://github.com/wanchaol	2024-03-15 01:24:36 +00:00
Jason Ansel	60ccf81490	[dynamo] Refactor update_block_stack into a seperate function (#121810 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121810 Approved by: https://github.com/williamwen42 ghstack dependencies: #121790	2024-03-15 01:01:05 +00:00
Jason Ansel	1e9a7df8fe	[dynamo] Compile time optimizations in tx.step() (#121790 ) `python benchmarks/dynamo/microbenchmarks/dynamo_microbenchmarks.py` - Before: `symbolic_convert_overhead_stress_test: 10.7s` - After: `symbolic_convert_overhead_stress_test: 8.6s` `tx.step()` is a small part of that benchmark, so likely the speedup in that isolated function is larger than the top line. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121790 Approved by: https://github.com/oulgen	2024-03-15 01:01:05 +00:00
João Gouveia	1afa8e0985	Fix #83153 : torch.nn.hardtahn allowed min_val to be greater than max_val (#121627 ) Fixes #83153 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121627 Approved by: https://github.com/albanD	2024-03-15 00:57:45 +00:00
Wanchao Liang	710446b1eb	[dtensor] refactor and generalize stack strategy (#121869 ) This PR rewrite the stack strategy to be more generalized, basically stack/cat like strategy follow pattern need to be smarter, i.e. it should be able to identify: 1. PR, PP, RP -> follow PP 2. RR, SR, RS -> follow SS So this PR refactors how the follow strategy should work, and make sure we start following the strategy that incurred lowest cost. i.e. for multiple PR, RP placements, we should be able to further delay the pending sum reductions Pull Request resolved: https://github.com/pytorch/pytorch/pull/121869 Approved by: https://github.com/awgu	2024-03-15 00:34:25 +00:00
Animesh Jain	92ed8553a6	Revert "Switch cudagraph backend to cudagraph trees (#121019 )" and "Add Cudagraphs disable checking (#121018 )" (#121864 ) This reverts commit 9373ad0bb87b364375a468c296d2daef0e8817d7. Revert "Add Cudagraphs disable checking (#121018)" This reverts commit 4af0e634bf02309583dfe3b5c3421442fda5ec7e. Causes compilation time increase. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121864 Approved by: https://github.com/eellison	2024-03-15 00:03:09 +00:00
Scott Wolchok	d604ab81a2	[PyTorch] Fix static runtime sigrid_hash precomputed multiplier pass (#120851 ) This pass was broken. Differential Revision: [D54336561](https://our.internmc.facebook.com/intern/diff/D54336561/) NOTE FOR REVIEWERS: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D54336561/)! Pull Request resolved: https://github.com/pytorch/pytorch/pull/120851 Approved by: https://github.com/houseroad	2024-03-15 00:02:38 +00:00
David Berard	cceabe873f	[jit] ClassType hashing: hash on compilation_unit as well (#121928 ) Following up on #121874 - it turns out that in our case, we're seeing repeated class names that are from different compilation units. Our previous hash function wasn't considering the compilation unit, leading to hash collisions (and then exponential memory usage in the number of copies of this class name) Differential Revision: [D54916455](https://our.internmc.facebook.com/intern/diff/D54916455) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121928 Approved by: https://github.com/eellison ghstack dependencies: #121874	2024-03-14 23:16:08 +00:00
David Berard	2d9cee20a2	[jit] AliasDB type hash - don't always return 0 (#121874 ) This hash was missing an assignment, so for almost all types it was returning "0". c10::flat_hash_map turns out to have really bad behavior with a terrible hash like this, nearly exponential in memory usage. Differential Revision: [D54916424](https://our.internmc.facebook.com/intern/diff/D54916424) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121874 Approved by: https://github.com/eellison	2024-03-14 23:16:08 +00:00
Tugsbayasgalan (Tugsuu) Manlaibaatar	57b20c51b9	Don't record autograd state ops while torch.compile in pre-dispatch export (#121736 ) Summary: Refer to OSS PR for details Test Plan: CI Differential Revision: D54812833 In pre-dispatch export, we have a special proxy torch mode where we intercept torch._C._set_grad_enabled op to correctly capture user's intention on train/eval. However, this is bit problematic when we are tracing torch.cond during export as it calls torch.compile internally. As a result, we end up capturing unwanted autograd context manager calls that are happening inside dynamo framework code because the top level tracer is still active. We fix it by turning off this proxy torch mode. We can still capture autograd ops inside cond branches because dynamo will translate them into HOP for us, so we don't have to intercept with special proxy mode. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121736 Approved by: https://github.com/anijain2305, https://github.com/ydwu4	2024-03-14 23:06:10 +00:00
Bin Bao	bd7beef529	[Inductor] Update the cpp_wrapper entry function signature (#121745 ) Summary: Update the entry function to use AtenTensorHandle instead of at::Tensor. This makes the compilation of the generated cpp wrapper code much faster: test_cpu_cpp_wrapper.py from 35 min to 21 min, and test_cuda_cpp_wrapper.py from 21 min to 14 min. Differential Revision: [D54818715](https://our.internmc.facebook.com/intern/diff/D54818715) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121745 Approved by: https://github.com/chenyang78, https://github.com/jansel ghstack dependencies: #121523, #121743, #121744	2024-03-14 22:23:00 +00:00
Bin Bao	8be80706b4	[AOTI] Add pybind for tensor_converter util functions (#121744 ) Differential Revision: [D54818716](https://our.internmc.facebook.com/intern/diff/D54818716) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121744 Approved by: https://github.com/chenyang78 ghstack dependencies: #121523, #121743	2024-03-14 22:20:51 +00:00
Bin Bao	46493ee9b5	[AOTI][refactor] Update tensor_converter util functions (#121743 ) Summary: Update the signature of unsafe_alloc_new_handles_from_tensors and alloc_tensors_by_stealing_from_handles. This is a preparation step towards adding pybind for these two functions, which will be used by cpp_wraper JIT Inductor. Differential Revision: [D54818717](https://our.internmc.facebook.com/intern/diff/D54818717) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121743 Approved by: https://github.com/chenyang78 ghstack dependencies: #121523	2024-03-14 22:17:54 +00:00
David Berard	3df1b3b0ad	[jit] support getattr/hasattr on NamedTuple (#121863 ) getattr is already supported on objects, and seems like for the most part for NamedTuples. The only remaining gap seems to be that hasattr only accepted objects, not NamedTuples. This PR adds support, and adds some basic tests. Differential Revision: [D54888612](https://our.internmc.facebook.com/intern/diff/D54888612) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121863 Approved by: https://github.com/eellison	2024-03-14 22:07:28 +00:00
Bin Bao	818b14025a	[AOTI][refactor] Remove is_legacy_abi_kernel and abi_compatible_kernel (#121523 ) Summary: is_legacy_abi_kernel was used for _scaled_dot_product_flash_attention fallback. It is only needed for C shim kernel name matching now, and the name matching is done with a direct string comparison. Also consolidate the fallback cpp kernel naming logic in CppWrapperCpu. Differential Revision: [D54727789](https://our.internmc.facebook.com/intern/diff/D54727789) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121523 Approved by: https://github.com/chenyang78	2024-03-14 22:05:38 +00:00
Yanbo Liang	43e243180b	Add gpt-fast as a static benchmark (#121886 ) Run: ``` python benchmarks/gpt_fast/benchmark.py ``` It generated a cvs file ```gpt_fast_benchmark.csv``` with the content like: ``` name,mode,target,actual,percentage Llama-2-7b-chat-hf,bfloat16,104,103.458618,99.48% Llama-2-7b-chat-hf,int8,155,158.964615,102.56% Mixtral-8x7B-v0.1,int8,97,99.760132,102.85% ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/121886 Approved by: https://github.com/Chillee	2024-03-14 21:46:59 +00:00
chentianyi16	0e68eb1505	Add privateuseone flags for c10::EventFlag (#121118 ) Fixes #117341 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121118 Approved by: https://github.com/albanD	2024-03-14 20:07:12 +00:00
angelayi	9f314d4aa8	[aoti] Fix compilation bug for buffer mutations (#121688 ) I realized there's a bug when unlifting buffer mutations in AOTI. However there seems to be a bug during tracing where AOTI mutates the buffer. I didn't take the time to investigate, so I left is as TODO for now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121688 Approved by: https://github.com/chenyang78	2024-03-14 19:35:26 +00:00
Sherlock Huang	0636c11811	[AOTInductor] Include build cmds at the end of wrapper file (#121872 ) Summary: For easier debugging, include build commands at the end of codegen wrapper. {F1468438991} Test Plan: CI Differential Revision: D54882164 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121872 Approved by: https://github.com/chenyang78, https://github.com/desertfire	2024-03-14 18:41:17 +00:00
Zhengxu Chen	c409292197	[sigmoid] Use deserializer from oss. (#121839 ) Summary: Old path: thrift -> thrift deserializer -> graph module. new path: thrift -> python dataclass -> oss deserializer -> graph_module Test Plan: CI buck2 test mode/dev-nosan caffe2/test/inductor/fb:test_aot_inductor_pt2_inference Reviewed By: SherlockNoMad Differential Revision: D54855251 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121839 Approved by: https://github.com/angelayi	2024-03-14 18:38:58 +00:00
Bin Bao	499136a4dd	[Inductor] Fix a dynamic shape problem when lowering diagonal (#121881 ) Summary: when computing the diagonal size, we need to use correct symbolic min/max function. Differential Revision: [D54884899](https://our.internmc.facebook.com/intern/diff/D54884899) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121881 Approved by: https://github.com/aakhundov	2024-03-14 18:36:37 +00:00
Angela Yi	5b1642516f	[with_effects] Skip over profiler.record_function_exit (#121829 ) Summary: tldr: User calls to `torch.autograd.profiler.record_function` fails when tracing with non-strict pre-dispatch export due to an effect token failure, so the solution is to skip over these operators 😅 Some user code contains calls to a `torch.autograd.profiler.record_function` context, like https://fburl.com/code/uesgknbq and https://fburl.com/code/iogbnsfw, which is used for adding user-defined events into the profiler. Currently these function calls will be skipped/removed in dynamo (https://fburl.com/code/fkf7qmai) but non-strict pre-dispatch export will hit these operators during tracing. However, it seems that although these operators get hit by the dispatcher, they don't actually show up in the final graph (maybe they get DCE-d). However, an issue comes up with a recent change with effect tokens (D54639390) which creates tokens if it sees a ScriptObject during tracing. The operator `torch.ops.profiler.record_function_exit` takes in a ScriptObject, so the effect tokens framework now tries to add an effect token to this operator, but results in the following error: (https://www.internalfb.com/intern/everpaste/?handle=GI-hvBknzj2ZxYkBABNzdztDxJVAbsIXAAAB, P1195258619) The reason is because this operator only gets hit during pre-dispatch, not post-dispatch tracing. During pre-dispatch tracing, we first trace using post-dispatch to collect metadata needed for functionalization, and then we do pre-dispatch tracing to construct the graph. The metadata collection phase is also when we determine what operators need effect tokens and create those tokens. However, since the operator only shows up in pre-dispatch tracing, we do create any tokens. During the actual pre-dispatch tracing to create the graph, we then run into this operator and try to get a token, but none exist, causing an error :( This PR just blocks the record_function operator from being looked at by the effect tokens framework. But a proper fix might be to have functionalization run on the pre-dispatch graph or have the operator also show up in the post-dispatch graph. But since in the PT2 stack dynamo just gets rid of this operator so that it won't show up anywhere downstream, I think we can also just ignore this operator. Test Plan: Fixed test for P1195258619 Differential Revision: D54857444 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121829 Approved by: https://github.com/BoyuanFeng, https://github.com/tugsbayasgalan	2024-03-14 18:09:43 +00:00
Sherlock Huang	f1f7c5c31e	[ez] Document for add_var_to_val (#121850 ) Summary: Add doc for ShapeEnv.add_var_to_val Test Plan: doc only change Reviewed By: izaitsevfb Differential Revision: D54872335 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121850 Approved by: https://github.com/izaitsevfb	2024-03-14 18:01:09 +00:00
Jean Schmidt	4c3a052acf	[BE] Add S3 bucket argument to number of workflows (#121907 ) Namely, it adds the `s3-bucket` argument to the following workflows, with default value set to `gha-artifacts`): - _docs - _linux-test workflows - download-build-artifacts - pytest-cache-download - upload-test-artifacts This is prerequisite part is required in order to start migrating to other s3 buckets for asset storage; This is one of the required steps in order to migrate to ARC and move our assets away from our S3 to Linux Foundation S3 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121907 Approved by: https://github.com/malfet	2024-03-14 17:57:05 +00:00
Andrew Gu	38d7d366b9	[FSDP2] Added 2D DCP save/load test (#121747 ) To prepare for FSDP2 + TP/SP in torchtrain, we should verify that we can resume training correctly with DCP save/load. For loading into a new model/optimizer instance, torchtrain uses lightweight `ModelWrapper` and `OptimizerWrapper`. In the added unit test, we use `get_optimizer_state_dict` directly to show the minimal requirement for correctness. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121747 Approved by: https://github.com/wz337	2024-03-14 17:24:17 +00:00
Shuqiang Zhang	443444dc7f	[c10d] Add generic scuba logging capability into c10d (#121859 ) Summary: This diff tries to periodically (e.g., every 30s) log critical collective progress status to scuba table, starting from a few metric such as last enequeued seq id. With the Scuba table, it is our hope that we can easily detect the straggler of a PG, E.g., the rank that has not progressed it seq_ for X seconds while other ranks in the same PG have a larger seq_ The implementation needs to make sure that Scuba will be used only for FB internal use cases. For OSS, we still provide a generic logger data struct and logger that can be easily extended. If users do not register the logger, nothing will be logged. Test Plan: Re-use the existing unit test for fb side of operations, such as test_register_and_dump in test_c10d_manifold and change the dump period to a very small number, e.g., 1ms, verified that the loggs are correctly shown in scuba table: https://fburl.com/scuba/c10d_work_update/9trhwnmy Reviewed By: wconstab Differential Revision: D54556219 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121859 Approved by: https://github.com/wconstab	2024-03-14 16:03:45 +00:00
Aleksandar Samardžić	83f8e51404	Add CUTLASS kernel as choice for (u)int8/(b)float16 mixed MM autotuning (#119986 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119986 Approved by: https://github.com/kadeng ghstack dependencies: #119685	2024-03-14 16:03:10 +00:00
Jeff Daily	be0bdf111c	relax tol for flaky nansum_out_dtype_cuda_float32 test (#121550 ) TestReductionsCUDA.test_nansum_out_dtype_cuda_float32 would fail or pass depending on the random inputs. Observed by ROCm internal QA testing. But same problematic random inputs breaks the test for CUDA, verified on V100. There is precedent in another test within the same file to relax tolerance. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121550 Approved by: https://github.com/albanD	2024-03-14 15:28:45 +00:00
Andrey Talman	7e13b5ba29	Checkout release branch rather then commit_hash when building triton release (#115379 ) (#121901 ) Cherry pick of https://github.com/pytorch/pytorch/pull/115379 from Release 2.2 that should be applied to main and Release 2.3 as well Pull Request resolved: https://github.com/pytorch/pytorch/pull/121901 Approved by: https://github.com/DanilBaibak, https://github.com/jeanschmidt	2024-03-14 14:42:29 +00:00
andoorve	956059fa2e	[Fix] Fixed behaviour for the conversion of complex tensors to bool (#121803 ) Fixes #120875 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121803 Approved by: https://github.com/lezcano	2024-03-14 13:35:15 +00:00
Aleksandar Samardžić	1251f0fa31	Add CUTLASS kernel as choice for _int_mm() Inductor autotuning (#119685 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119685 Approved by: https://github.com/cpuhrsch, https://github.com/kadeng	2024-03-14 13:25:23 +00:00
Nikita Shulga	38d9bb5abc	Make PyTorch compilable against upcoming Numpy-2.0 (#121880 ) Test plan: ``` % python -c "import torch;import numpy;print(numpy.__version__, torch.tensor(numpy.arange(3, 10)))" 2.1.0.dev0+git20240312.9de8a80 tensor([3, 4, 5, 6, 7, 8, 9]) % python -c "import torch;print(torch.rand(3, 3).numpy())" [[0.0931946 0.44874293 0.8480404 ] [0.93877375 0.10188377 0.67375803] [0.02520031 0.89019287 0.5691561 ]] ``` Fixes https://github.com/pytorch/pytorch/issues/121798 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121880 Approved by: https://github.com/albanD	2024-03-14 05:36:50 +00:00
Nikita Shulga	b4c53aa0ec	Do not compile FP16 arith internally (#121844 ) Also, decorate unused args with `C10_UNUSED` to fix linter warnings Test Plan: `buck2 build -c fbcode.arch=aarch64 //caffe2:ATen-cpu` Differential Revision: D54870507 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121844 Approved by: https://github.com/osalpekar	2024-03-14 05:19:02 +00:00
Adnan Akhundov	3eb322ff29	Handle transitive replacements in Triton kernel mutation analysis (#121867 ) Summary: Previously, we didn't handle transitive replacements in MLIR walk-based function info mining in the Triton kernel mutation analysis pass. As a result, for the TTIR below: ``` tt.func private @cumsum__fp32S1_16S__1cconstexpr_1__2cconstexpr_False_(%arg0: tensor<1x16xf32> loc("...":296:0)) -> tensor<1x16xf32> attributes {noinline = false} { %0 = "tt.scan"(%arg0) <{axis = 1 : i32, reverse = false}> ({ ^bb0(%arg1: f32 loc(unknown), %arg2: f32 loc(unknown)): %1 = tt.call @_sum_combine__fp32_fp32__(%arg1, %arg2) : (f32, f32) -> f32 loc(#loc16) tt.scan.return %1 : f32 loc(#loc16) }) : (tensor<1x16xf32>) -> tensor<1x16xf32> loc(#loc16) tt.return %0 : tensor<1x16xf32> loc(#loc18) } loc(#loc15) ``` the mined function dict looked like this: ``` {Intermediate(idx=25): [Op(name='tt.call', fn_call_name='_sum_combine__fp32_fp32__', args=[Intermediate(idx=26), Intermediate(idx=26)])], Intermediate(idx=27): [Op(name='tt.scan.return', fn_call_name=None, args=[Intermediate(idx=25)])], Intermediate(idx=-4): [Op(name='tt.return', fn_call_name=None, args=[Intermediate(idx=27)])]} ``` whereas it should look like this (not the `Param(idx=0)` arguments of the `tt.call`): ``` {Intermediate(idx=25): [Op(name='tt.call', fn_call_name='_sum_combine__fp32_fp32__', args=[Param(idx=0), Param(idx=0)])], Intermediate(idx=27): [Op(name='tt.scan.return', fn_call_name=None, args=[Intermediate(idx=25)])], Intermediate(idx=-4): [Op(name='tt.return', fn_call_name=None, args=[Intermediate(idx=27)])]} ``` This is fixed in the PR. Test Plan: ``` $ python test/inductor/test_triton_kernels.py -k test_cumsum . ---------------------------------------------------------------------- Ran 1 test in 1.771s OK ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/121867 Approved by: https://github.com/oulgen	2024-03-14 04:06:37 +00:00
Sam Larsen	4cd503c1f3	Enable FX graph cache for a batch of inductor tests (#121696 ) Summary: Get more FX graph cache coverage by enabling it for these unit tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/121696 Approved by: https://github.com/eellison	2024-03-14 03:39:59 +00:00
Michael Lazos	15abc56bd5	Graph break on step closure in optimizer (#121777 ) Fixes https://github.com/pytorch/pytorch/issues/116494 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121777 Approved by: https://github.com/yanboliang	2024-03-14 03:18:23 +00:00
Andre Eid	f85f58bf86	Fix quantized linear vulkan tests (#120960 ) Summary: Fixed quatized linear vulkan tests by using an old pack_biases function. Test Plan: Vulkan quantized api tests buck2 run --target-platforms ovr_config//platform/macos:arm64-fbsource //xplat/caffe2:pt_vulkan_quantized_api_test_binAppleMac\#macosx-arm64 -c pt.vulkan_full_precision=1 ... ... ... [ RUN ] VulkanAPITest.linear_2d_flat [ OK ] VulkanAPITest.linear_2d_flat (5 ms) [ RUN ] VulkanAPITest.linear_2d_small [ OK ] VulkanAPITest.linear_2d_small (0 ms) [ RUN ] VulkanAPITest.linear_2d_large [ OK ] VulkanAPITest.linear_2d_large (4 ms) [ RUN ] VulkanAPITest.linear_3d_flat [ OK ] VulkanAPITest.linear_3d_flat (2 ms) [ RUN ] VulkanAPITest.linear_3d_small [ OK ] VulkanAPITest.linear_3d_small (1 ms) [ RUN ] VulkanAPITest.linear_3d_large [ OK ] VulkanAPITest.linear_3d_large (1 ms) [ RUN ] VulkanAPITest.linear_4d_flat [ OK ] VulkanAPITest.linear_4d_flat (1 ms) [ RUN ] VulkanAPITest.linear_4d_small [ OK ] VulkanAPITest.linear_4d_small (1 ms) [ RUN ] VulkanAPITest.linear_4d_large [ OK ] VulkanAPITest.linear_4d_large (2 ms) ... ... [----------] 85 tests from VulkanAPITest (1704 ms total) [----------] Global test environment tear-down [==========] 85 tests from 1 test suite ran. (1704 ms total) [ PASSED ] 85 tests. YOU HAVE 8 DISABLED TESTS Vulkan api tests buck2 run --target-platforms ovr_config//platform/macos:arm64-fbsource //xplat/caffe2:pt_vulkan_api_test_binAppleMac\#macosx-arm64 -c pt.vulkan_full_precision=1 [----------] Global test environment tear-down [==========] 426 tests from 1 test suite ran. (4997 ms total) [ PASSED ] 423 tests. [ SKIPPED ] 1 test, listed below: [ SKIPPED ] VulkanAPITest.querypool_flushed_shader_log [ FAILED ] 2 tests, listed below: [ FAILED ] VulkanAPITest.log_softmax_underflow [ FAILED ] VulkanAPITest.log_softmax Differential Revision: D54396367 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120960 Approved by: https://github.com/yipjustin	2024-03-14 02:23:00 +00:00
Le-Zheng	a37caa6ed3	[Quant][Inductor] Enable quantization linear pattern fusion with int8_mixed_bf16 for gelu (#116004 ) Summary Enable QLinear Unary pattern for gelu with int8_mix_bf16 Test plan python test/inductor/test_mkldnn_pattern_matcher.py -k test_qlinear_gelu_int8_mixed_bf16 Co-authored-by: leslie-fang-intel <leslie.fang@intel.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/116004 Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel ghstack dependencies: #114853, #114854	2024-03-14 01:52:12 +00:00
Le-Zheng	43d68e9c8f	[Quant][Inductor] Enable quantization linear pattern fusion for gelu inside inductor (#114854 ) Summary Enable QLinear Unary pattern for gelu with int8 Test plan python test/inductor/test_mkldnn_pattern_matcher.py -k test_qlinear_gelu_cpu Co-authored-by: leslie-fang-intel <leslie.fang@intel.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/114854 Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5 ghstack dependencies: #114853	2024-03-14 01:49:14 +00:00
Le-Zheng	25e00545bb	[Quant][PT2E] Enable linear and linear-unary post-op gelu quant recipe for x86 inductor quantizer (#114853 ) Summary Add Gelu for linear-unary post-op quantization recipe to x86 inductor quantizer. Test plan python -m pytest test/quantization/pt2e/test_x86inductor_quantizer.py -k test_linear_unary_gelu python test/test_quantization.py -k test_linear_unary_with_quantizer_api Co-authored-by: leslie-fang-intel <leslie.fang@intel.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/114853 Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jerryzh168	2024-03-14 01:46:35 +00:00
Oguz Ulgen	a04e7fca8e	Use memcache versioning for autotune remote cache (#121748 ) Summary: Internal training platform doesn't get updated very frequently, so lets use versioning for memcache. Test Plan: existing tests Differential Revision: D54818197 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121748 Approved by: https://github.com/aakhundov, https://github.com/jansel	2024-03-14 00:36:10 +00:00
Will Constable	7e076c75bd	[C10D] Fix coalescedCollective op Flight Recording (#120430 ) Also noticed and filed https://github.com/pytorch/pytorch/issues/120516 during this work. May land this as is and then test/fix the other varieties of coalesced collectives later. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120430 Approved by: https://github.com/kwen2501	2024-03-13 23:55:00 +00:00
PyTorch MergeBot	bf7ac4ddf7	Revert "[export] allow Dim(1,2) for export dynamic shapes (#121642 )" This reverts commit a8dcbf2749f2081f939621db2d38fd15ab7e34a3. Reverted https://github.com/pytorch/pytorch/pull/121642 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/121642#issuecomment-1996121710))	2024-03-13 23:51:20 +00:00
drisspg	3e02a7efcd	Only FA2 doesn't support attn-mask (#121825 ) Fixes #121783 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121825 Approved by: https://github.com/cpuhrsch	2024-03-13 23:03:39 +00:00
Pian Pawakapan	a8dcbf2749	[export] allow Dim(1,2) for export dynamic shapes (#121642 ) Current dynamic shapes implementation fixes lower range of Dims to be 2 for analysis, but allows 0/1 shapes during runtime. This leads to failures when initializing Dim(1,2). This PR sets the lower bound to 0, and avoids erroring out when conflicting with the generated (2, maxsize) constraint during analysis. Also resolves a derived dim constraints issue with the following code: ``` class Bar(torch.nn.Module): def forward(self, x, y): return x + y[1:] dx = Dim("dx", min=1, max=3) ep = export( Bar(), (torch.randn(2, 2), torch.randn(3, 2)), dynamic_shapes=({0: dx, 1: None}, {0: dx+1, 1: None}) ) print(ep.range_constraints) ``` In main: ``` {s0: ValueRanges(lower=2, upper=3, is_bool=False), s0 + 1: ValueRanges(lower=3, upper=4, is_bool=False)} ``` This PR: ``` {s0: ValueRanges(lower=1, upper=3, is_bool=False), s0 + 1: ValueRanges(lower=2, upper=4, is_bool=False)} ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/121642 Approved by: https://github.com/avikchaudhuri	2024-03-13 22:59:07 +00:00
PyTorch MergeBot	70c6f542f2	Revert "[dynamo] Convert invalid args into graph breaks (#121784 )" This reverts commit 0df39480f6a74c9094555e8a61a8c8bb01716d4e. Reverted https://github.com/pytorch/pytorch/pull/121784 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I think it breaks ONNX test in trunk `0c1ac4484d` ([comment](https://github.com/pytorch/pytorch/pull/121784#issuecomment-1995979435))	2024-03-13 22:12:43 +00:00
Boyuan Feng	aaff8d274a	CUDA fast path for `_chunk_cat()` (#120678 ) This PR provides CUDA fast path implementation for ATen Op `_chunk_cat` (#121081). Performance on a production benchmark: - Float16 in, Float16 out: 249 -> 500 - BFloat16 in, BFloat16 out: 248 -> 500 - BFloat16 in, Float32 out: 126 -> 278 - Float32 in, Float32 out: 153 -> 260 - Float64 in, Float64 out: 79 -> 132 - int8 in, int8 out: 332 -> 908 - int16 in, int16 out: 250 -> 489 - int32 in, int32 out: 153 -> 260 - int64 in, int64 out: 79 -> 132 Unit: Billion elements per second. Hardware: H100. Baseline: [Existing FSDP implementation](`7b3febdca7/torch/distributed/_composable/fsdp/_fsdp_collectives.py (L176)`) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120678 Approved by: https://github.com/yifuwang	2024-03-13 22:02:06 +00:00
Manuel Candales	c53e3f57b5	allow fp16 in quant/dequant decompositions (#121738 ) Test Plan: ``` buck2 run mode/dev-nosan mode/inplace executorch/examples/models/llama2:export_llama -- -c ~/llama/ultra_new_checkpoint.pt -p ~/llama/params.json -kv -E 8,8 -d fp16 --pt2e_quantize "xnnpack_dynamic" -2 ``` Reviewed By: kirklandsign Differential Revision: D54785950 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121738 Approved by: https://github.com/jerryzh168	2024-03-13 21:45:08 +00:00
Chien-Chin Huang	c7193f4099	[DDP][PT2D][2D] Enable DDP + TP and add test for compiled DDP + TP (#120479 ) This PR enables DDP + TP using a TP internal API. This should not be the final implementation. A more sound implementation is to inline the TP internal API in DDP. In other words, DDP needs to be aware of DTensor so that we can support 2D state_dict. This PR adds a compiled DDP + TP test to ensure the new compiled DDP fusion doesn't break TP all_reduce. TODOs - [x] Implement DDP allreduce fusion algorithm for Inductor post_grad pass. - [x] Add unit tests to ensure the fusion doesn't DDP + TP. - [ ] Group different PG and data type of all_reduces. - [ ] Mixed precision supports and tests - [ ] Implement the fusions with Inductor IR. - [ ] Add auto bucketing based on Inductor profiling. Differential Revision: [D54105050](https://our.internmc.facebook.com/intern/diff/D54105050/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120479 Approved by: https://github.com/wz337 ghstack dependencies: #113209	2024-03-13 21:41:22 +00:00
Sherlock Huang	dd568f4207	[Export, AOTInductor] Populate ShapeEnv's var_to_val during deserialization (#121759 ) Summary: Deserialization didn't populate ShapeEnv's `var_to_val` field properly, and AOTInductor is relying on this field to compile dynamic shape properly. As a result, when AOTI failed at compiling a deserialized ExportedProgram. Test Plan: buck2 test mode/dev-nosan caffe2/test/inductor/fb:test_aot_inductor_pt2_inference Differential Revision: D54559494 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121759 Approved by: https://github.com/avikchaudhuri	2024-03-13 21:28:25 +00:00
PyTorch MergeBot	a2a4693c1b	Revert "Init CUDA instead of faking memory stats (#121698 )" This reverts commit 2460f0b1c7bb6e088aca1f6e9bb62c834053d71b. Reverted https://github.com/pytorch/pytorch/pull/121698 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I think it breaks inductor CPU tests `5b90074540` ([comment](https://github.com/pytorch/pytorch/pull/121698#issuecomment-1995868090))	2024-03-13 21:23:42 +00:00
PyTorch MergeBot	45a835cef2	Revert "[compiled autograd] free stack objects before calling compiled graph (#121707 )" This reverts commit 5b90074540577267c29f5f784be123ee54f6491d. Reverted https://github.com/pytorch/pytorch/pull/121707 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I think it breaks inductor CPU tests `5b90074540` ([comment](https://github.com/pytorch/pytorch/pull/121698#issuecomment-1995868090))	2024-03-13 21:23:42 +00:00
Simon Fan	8b1b61bc70	[compiled autograd] support custom ops backed by c++ autograd::Function (#120681 ) - Adds support for custom ops backed by c++ custom autograd functions, e.g. fbgemm - Include files more granularly to avoid namespace pollution and circular imports limitations: - requires user to audit their code and opt-in their custom autograd::Function via autograd::Function::is_traceable and maybe additional compiled_args + apply_with_saved implementation. this was the only way I can think of for soundness - will throw if we can't hash the saved_data i.e. for any non implemented type other than list and dict in at::IValue::hash `b0cfa96e82/aten/src/ATen/core/ivalue.cpp (L364)` - can technically silently fail if both the typeid hash and the typeid string name of the custom autograd::Function collide at the same time, and an identical autograd graph containing a different custom autograd::Function, yet that has an identical implementation, is called. this case seems extremely unlikely, and the only alternative to hash collision i can think of is compiling with reflection - tensors not saved via save_variables are not lifted, and are specialized on TensorImpl*'s hash (treated as a memory address). if needed, we can lift them. Differential Revision: [D54818488](https://our.internmc.facebook.com/intern/diff/D54818488) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120681 Approved by: https://github.com/jansel	2024-03-13 21:13:21 +00:00
Oguz Ulgen	58ff55aac5	Add support for tt.scan to triton kernel mutation analysis (#121828 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121828 Approved by: https://github.com/aakhundov, https://github.com/Skylion007	2024-03-13 20:37:56 +00:00
Chien-Chin Huang	8e6d572b4e	[DDP][PT2D] Allreduce fusion fx pass using concat and all_reduce_coalesced (#113209 ) Differential Revision: [D49858057](https://our.internmc.facebook.com/intern/diff/D49858057/) TL;DR This PR implements 2 different DDP all_reduce fusions in Inductor post_grad fx passes. The two fusions are 1) fusion with concat op and 2) fusion with all_reduce_coalesced. When DDP detects that Python reducer is being used, DDP will automatically turn on the fusion. This PR does not invent any algorithm and simply reflects the bucket size users set to DDP. Implementation Details Fusion with concat op The idea of this fusion is to use a concat op to concatenate all the gradients into one tensor and perform one `all_reduce`. After the `wait` op of the `all_reduce`, splitting and reshaping will also be perform to get the individual gradient. Because DDP needs to perform gradient scaling, the benefit of using this fusion is that we could perform the gradient scaling over the the concatenated buffer. Fusion with `all_reduce_coalesced` The idea of this fusion is to use `all_reduce_coalesced` op to directly perform the `all_reduce` over multiple buffers. This avoid the copy overhead but may not achieve the best NCCL performance. In addition, because there are multiple buffers, we could not do one simple gradient scaling but have to rely on `foreach_div` to help the gradient scaling. Limitations Current fusions do not distinguish `all_reduce` generated by different DDP modules. This is okay if all DDP instances use the same PG and data type. The support of multiple DDP instances with different PG and data type will come in the later PRs. TODOs - [x] Implement DDP allreduce fusion algorithm for Inductor post_grad pass. - [ ] Add unit tests to ensure the fusion doesn't DDP + TP. - [ ] Group different PG and data type of `all_reduce`s. - [ ] Mixed precision supports and tests - [ ] Implement the fusions with Inductor IR. - [ ] Add auto bucketing based on Inductor profiling. Pull Request resolved: https://github.com/pytorch/pytorch/pull/113209 Approved by: https://github.com/yf225	2024-03-13 20:37:09 +00:00
Boyuan Feng	0c1ac4484d	Support `call_method` in DDPOptimizer (#121771 ) This PR fixes Issue #111279. While #111279 reported the issue with `MultiheadAttention`, a minimal reproduction would be: ```python class ToyModel(nn.Module): def __init__(self,): super().__init__() self.linear = nn.Linear(128, 10) def forward(self, x: torch.Tensor) -> torch.Tensor: return self.linear.forward(x) # Error # return self.linear(x) # OK ``` Dynamo treats `self.linear(x)` as `call_module` while treating `self.linear.forward(x)` as a [`get_attr` and a `call_method`](https://github.com/pytorch/pytorch/blob/main/torch/_dynamo/variables/nn_module.py#L358-L378). However, existing DDPOptimizer assumes, for a `get_attr` node, `getattr(gm, node.target)` gives a tensor with the `requires_grad` attribute. Existing DDPOptimizer also does not support `call_method` nodes. This PR adds support for `call_method` and check on `get_attr`. It also checks if a module's parameters have been added to a bucket to support multiple method calls from the same module. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121771 Approved by: https://github.com/yf225	2024-03-13 20:03:15 +00:00
Jason Ansel	0df39480f6	[dynamo] Convert invalid args into graph breaks (#121784 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121784 Approved by: https://github.com/yanboliang ghstack dependencies: #121615, #121616	2024-03-13 20:02:33 +00:00
Simon Fan	5b90074540	[compiled autograd] free stack objects before calling compiled graph (#121707 ) Moved compilation code into _compiled_autograd_impl, frees stack allocated objects e.g. AutogradCompilerCall Pull Request resolved: https://github.com/pytorch/pytorch/pull/121707 Approved by: https://github.com/jansel ghstack dependencies: #121698	2024-03-13 19:31:44 +00:00
Simon Fan	2460f0b1c7	Init CUDA instead of faking memory stats (#121698 ) This is very confusing when checking for memory usage and allocations are only happening using C API. We should change it to a warning/error or just init cuda. Codepaths that run on non-CUDA environments shouldn't call into these functions in the first place Pull Request resolved: https://github.com/pytorch/pytorch/pull/121698 Approved by: https://github.com/jansel	2024-03-13 19:31:44 +00:00
Sam Larsen	cd949d133e	Support setUpClass & tearDownClass with instantiate_device_type_tests() (#121686 ) Summary: instantiate_device_type_tests() creates dynamic test case classes that derive from a "template class". By default, the test harness will call the setUpClass() and tearDownClass() methods defined by the template class (if the template class defines them). We can explicitly create these methods in the dynamic class and arrange to call those methods in both base classes. That allows us to support setUpClass & tearDownClass test classes used with instantiate_device_type_tests(). Pull Request resolved: https://github.com/pytorch/pytorch/pull/121686 Approved by: https://github.com/ezyang, https://github.com/eellison	2024-03-13 18:28:42 +00:00
Isuru Fernando	ffabb25c48	Count the number of entries directly in avg_pool2d lowering (#121429 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121429 Approved by: https://github.com/peterbell10 ghstack dependencies: #116085	2024-03-13 18:19:47 +00:00
Isuru Fernando	a19a05fd1d	Add lowering for avg_pool{1, 3}d (#116085 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116085 Approved by: https://github.com/peterbell10	2024-03-13 18:19:47 +00:00
Catherine Lee	79fac48bb3	Use pytorch bot's labeler (#121762 ) Change corresponds to https://github.com/pytorch/test-infra/pull/4995 Testing (very light) in https://github.com/malfet/deleteme/pull/81 Should help with https://github.com/pytorch/test-infra/issues/4950 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121762 Approved by: https://github.com/huydhn	2024-03-13 17:16:49 +00:00
Michael Lazos	05df03ec1b	Allow custom attributes for torch function subclasses (#121693 ) Added custom attribute access with test Pull Request resolved: https://github.com/pytorch/pytorch/pull/121693 Approved by: https://github.com/anijain2305	2024-03-13 17:01:57 +00:00
Edward Z. Yang	92a2b214f8	Make translation validation more user friendly (#120880 ) Two main changes: - Don't rethrow the exception when we fail in TV, just throw the entire thing and trust the user will inspect logs / backtrace to see we failed in TV - Don't add an event to the TV logs until we've confirmed that the event actually runs without erroring. This prevents us from putting events that e.g., fail because the guard on data dependent size, and the failing in TV. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/120880 Approved by: https://github.com/lezcano, https://github.com/ysiraichi	2024-03-13 15:21:59 +00:00
Edward Z. Yang	b1d5998956	Upgrade to tlparse 0.3.7 (#121772 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/121772 Approved by: https://github.com/Skylion007	2024-03-13 15:21:20 +00:00
Nikita Shulga	5498804ec2	[MPS] Fix naive matmul for BFloat16 (#121731 ) Will only work on MacOS14 or newer, so compile the shader with `MTLLanguageVersion_3_1` when appropriate Fixes https://github.com/pytorch/pytorch/issues/121583 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121731 Approved by: https://github.com/albanD	2024-03-13 14:34:03 +00:00
Jason Ansel	559ca13b3f	[dynamo] Refactor TorchInGraphFunctionVariable for compile time (#121616 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121616 Approved by: https://github.com/oulgen ghstack dependencies: #121615	2024-03-13 14:21:21 +00:00
PyTorch MergeBot	51cf57c6c6	Revert "Include torch warn in each error in cudnn/Conv_v8.cpp (#120719 )" This reverts commit 5fd7f5c4e336c2c3041e10529990c620cc8cf9a5. Reverted https://github.com/pytorch/pytorch/pull/120719 on behalf of https://github.com/janeyx99 due to sorry but am reverting as this prints unwanted warnings even when an exception is not thrown ([comment](https://github.com/pytorch/pytorch/pull/120719#issuecomment-1994491826))	2024-03-13 14:09:38 +00:00
IvanKobzarev	a157a0d00d	[constraints] Fix scalar type for constraint_range to Long (#121752 ) Differential Revision: [D54822125](https://our.internmc.facebook.com/intern/diff/D54822125) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121752 Approved by: https://github.com/ezyang	2024-03-13 11:11:09 +00:00
Avik Chaudhuri	7fe0cc53e9	make _process_dynamic_shapes an implementation detail (#121713 ) Summary: `_process_dynamic_shapes` converts new dynamic shapes to old constraints, but in the future may not need to do so. Preparing for that future. Test Plan: CI Differential Revision: D54780374 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121713 Approved by: https://github.com/tugsbayasgalan	2024-03-13 08:33:00 +00:00
Shubhraprakash Das	5088e4956e	Add quantized conv transpose2d op (#120151 ) Test Plan: Run vulkan api test: # buck2 build --target-platforms ovr_config//platform/macos:arm64-fbsource //xplat/caffe2:pt_vulkan_api_test_binAppleMac\#macosx-arm64 -c pt.vulkan_full_precision=1 --show-output" # buck-out//v2/gen/fbsource/xplat/caffe2/pt_vulkan_api_test_binAppleMac Running main() from third-party/googletest/1.14.0/googletest/googletest/src/gtest_main.cc [==========] Running 418 tests from 1 test suite. [----------] Global test environment set-up. [----------] 418 tests from VulkanAPITest .... [----------] Global test environment tear-down [==========] 418 tests from 1 test suite ran. (4510 ms total) [ PASSED ] 417 tests. [ SKIPPED ] 1 test, listed below: [ SKIPPED ] VulkanAPITest.querypool_flushed_shader_log YOU HAVE 9 DISABLED TESTS Run quantized vulkan api test: Note the linear quantized are failing but all the convolution tests still pass. Linear failures are being debugged. # buck2 build --target-platforms ovr_config//platform/macos:arm64-fbsource //xplat/caffe2:pt_vulkan_quantized_api_test_binAppleMac\#macosx-arm64 -c pt.vulkan_full_precision=1 --show-output" # buck-out//v2/gen/fbsource/xplat/caffe2/pt_vulkan_quantized_api_test_binAppleMac Running main() from third-party/googletest/1.14.0/googletest/googletest/src/gtest_main.cc [==========] Running 86 tests from 1 test suite. [----------] Global test environment set-up. [----------] 86 tests from VulkanAPITest ... [ PASSED ] 77 tests. [ FAILED ] 9 tests, listed below: [ FAILED ] VulkanAPITest.linear_2d_flat [ FAILED ] VulkanAPITest.linear_2d_small [ FAILED ] VulkanAPITest.linear_2d_large [ FAILED ] VulkanAPITest.linear_3d_flat [ FAILED ] VulkanAPITest.linear_3d_small [ FAILED ] VulkanAPITest.linear_3d_large [ FAILED ] VulkanAPITest.linear_4d_flat [ FAILED ] VulkanAPITest.linear_4d_small [ FAILED ] VulkanAPITest.linear_4d_large 9 FAILED TESTS YOU HAVE 8 DISABLED TESTS Differential Revision: D52344261 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120151 Approved by: https://github.com/yipjustin	2024-03-13 08:09:57 +00:00
Iris Zhang (PyTorch)	e99fa0042c	Back out "[DeviceMesh] Add support for nD slicing (#119752 )" (#121763 ) Summary: Original commit changeset: e52b8809c8d8 Original Phabricator Diff: D54778906 We have to backout this diff. D54778906 seems to be causing test failures for APF blocking trunk health and hence release. Just starting to look at the issue. T182209248 Test Plan: Sandcastle Reviewed By: satgera Differential Revision: D54825114 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121763 Approved by: https://github.com/osalpekar	2024-03-13 07:22:08 +00:00
Gao Tianlin	be33d31ae2	add std::ostream& operator<< for BFloat16 in BFloat16.h (#121302 ) This PR Move `operator<<` of `BFloat16` to `BFloat16.h`. Previously, this function is in `TensorDataContainer.h`. If need `std::cout` a `BFloat16` variable when debugging, `TensorDataContainer.h` have to be included. This is inconvient and counterintuitive. Other dtypes such as `Half`, define their `operator<<` in headers where they are defined such as `Half.h`. Therefore, I think it makes more sense to move `operator<<` of `BFloat16` to `BFloat16.h` Pull Request resolved: https://github.com/pytorch/pytorch/pull/121302 Approved by: https://github.com/ezyang	2024-03-13 06:47:34 +00:00
wz337	5986552ebe	[nit][DCP][DSD] Remove variables not being used in test_state_dict.py #121204 (#121773 ) Replacing https://github.com/pytorch/pytorch/pull/121204 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121773 Approved by: https://github.com/Skylion007	2024-03-13 06:35:04 +00:00
Masaki Kozuki	da2a9a0512	`_foreach_copy` with different src/dst dtypes (#121717 ) Fixes #115171 ``` torch.version.git_version = '6bff6372a922fe72be5335c6844c10e2687b967d', torch.cuda.get_device_name() = 'NVIDIA RTX 6000 Ada Generation' [------------------ foreach copy - self: torch.float32 - shape: (512, 512) -----------------] \| src: torch.float32 \| src: torch.float16 \| src: torch.bfloat16 1 threads: ---------------------------------------------------------------------------------- num_tensors: 32 \| 14.2 \| 12.6 \| 12.7 num_tensors: 256 \| 688.0 \| 510.3 \| 514.0 num_tensors: 1024 \| 2768.0 \| 2053.3 \| 2047.7 Times are in microseconds (us). [------------------ foreach copy - self: torch.float16 - shape: (512, 512) -----------------] \| src: torch.float32 \| src: torch.float16 \| src: torch.bfloat16 1 threads: ---------------------------------------------------------------------------------- num_tensors: 32 \| 10.0 \| 8.9 \| 8.8 num_tensors: 256 \| 497.6 \| 344.3 \| 348.3 num_tensors: 1024 \| 1991.9 \| 1392.0 \| 1389.0 Times are in microseconds (us). [----------------- foreach copy - self: torch.bfloat16 - shape: (512, 512) -----------------] \| src: torch.float32 \| src: torch.float16 \| src: torch.bfloat16 1 threads: ---------------------------------------------------------------------------------- num_tensors: 32 \| 10.0 \| 8.8 \| 8.8 num_tensors: 256 \| 497.5 \| 344.5 \| 348.0 num_tensors: 1024 \| 1993.2 \| 1390.4 \| 1387.5 Times are in microseconds (us). [------------------ foreach copy - self: torch.float32 - shape: (515, 515) -----------------] \| src: torch.float32 \| src: torch.float16 \| src: torch.bfloat16 1 threads: ---------------------------------------------------------------------------------- num_tensors: 32 \| 19.0 \| 17.9 \| 18.1 num_tensors: 256 \| 707.2 \| 540.2 \| 543.1 num_tensors: 1024 \| 2900.6 \| 2156.6 \| 2159.2 Times are in microseconds (us). [------------------ foreach copy - self: torch.float16 - shape: (515, 515) -----------------] \| src: torch.float32 \| src: torch.float16 \| src: torch.bfloat16 1 threads: ---------------------------------------------------------------------------------- num_tensors: 32 \| 13.8 \| 13.7 \| 13.1 num_tensors: 256 \| 513.2 \| 352.6 \| 350.4 num_tensors: 1024 \| 2047.6 \| 1404.4 \| 1400.4 Times are in microseconds (us). [----------------- foreach copy - self: torch.bfloat16 - shape: (515, 515) -----------------] \| src: torch.float32 \| src: torch.float16 \| src: torch.bfloat16 1 threads: ---------------------------------------------------------------------------------- num_tensors: 32 \| 13.6 \| 12.8 \| 14.2 num_tensors: 256 \| 511.9 \| 351.8 \| 350.6 num_tensors: 1024 \| 2045.4 \| 1402.2 \| 1401.4 Times are in microseconds (us). ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/121717 Approved by: https://github.com/janeyx99	2024-03-13 05:42:28 +00:00
Jason Ansel	a13dd92d88	[dynamo] Minor compile time optimizations in torch.py (#121615 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121615 Approved by: https://github.com/oulgen	2024-03-13 05:36:22 +00:00
PyTorch UpdateBot	d619be57c0	[executorch hash update] update the pinned executorch hash (#121056 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned executorch hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121056 Approved by: https://github.com/pytorchbot	2024-03-13 04:54:16 +00:00
Peter Bell	0c1d59b72f	CI: Fix flaky artifact upload step (#121733 ) This PR changes the upload artifact step of the wheels and conda build to write each matrix entry to a different file. This is because updating the same file from multiple jobs can be flaky as is warned in the docs for upload-artifact > Warning: Be careful when uploading to the same artifact via multiple jobs as artifacts may become corrupted. When uploading a file with an identical name and path in multiple jobs, uploads may fail with 503 errors due to conflicting uploads happening at the same time. Ensure uploads to identical locations to not interfere with each other. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121733 Approved by: https://github.com/huydhn ghstack dependencies: #121268	2024-03-13 04:42:52 +00:00
Peter Bell	52ed35bb64	[inductor] Update triton pin (#121268 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121268 Approved by: https://github.com/oulgen, https://github.com/malfet	2024-03-13 04:42:52 +00:00
Nikita Shulga	07330ff7b6	[MPS][BE] Define `_compute_tolerances` (#121754 ) Right now logic is mostly duplicated between `test_output_match` and `test_output_gradient_match` So move tolerance definition logic into a shared `_compute_tolerances` function and only keep differences (for example, grad checks are completely skipped for `torch.unique`) in the respective test functions. Also, increase tolerance for `pow` and `__rpow__` only on MacOS-13.3 or older and remove GRAD xfaillist for those Pull Request resolved: https://github.com/pytorch/pytorch/pull/121754 Approved by: https://github.com/albanD	2024-03-13 04:08:06 +00:00
karll	f83392b677	cublasLt workspace warning info is misleading, the unit of measuremen… (#121073 ) cublasLt workspace warning info is misleading, the unit of measurement should be KiB instead of bytes Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/121073 Approved by: https://github.com/eqy, https://github.com/ezyang	2024-03-13 03:37:40 +00:00
blorange-amd	e755dab0d1	[ROCm] Enable several test_unary_ufuncs UTs on ROCm (#121104 ) Enabled: test_unary_ufuncs::TestUnaryUfuncsCUDA::test_reference_numerics_small_atan_cuda_complex64 test_unary_ufuncs::TestUnaryUfuncsCUDA::test_reference_numerics_small_atan_cuda_complex128 test_unary_ufuncs::TestUnaryUfuncsCUDA::test_reference_numerics_extremal_atan_cuda_complex128 test_unary_ufuncs::TestUnaryUfuncsCUDA::test_reference_numerics_small__refs_atan_cuda_complex64 test_unary_ufuncs::TestUnaryUfuncsCUDA::test_reference_numerics_small__refs_atan_cuda_complex128 test_unary_ufuncs::TestUnaryUfuncsCUDA::test_reference_numerics_extremal__refs_atan_cuda_complex128 test_unary_ufuncs::TestUnaryUfuncsCUDA::test_reference_numerics_extremal_atanh_cuda_complex128 test_unary_ufuncs::TestUnaryUfuncsCUDA::test_reference_numerics_extremal__refs_atanh_cuda_complex128 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121104 Approved by: https://github.com/jeffdaily, https://github.com/ezyang	2024-03-13 03:34:22 +00:00
Mu-Chu Lee	f24ae66abf	[AOTInductor] Skip tests on RoCM for duplicate_constant_folding (#121750 ) Summary: Skip AMD tests for duplicated kernels in constant folding Test Plan: Diff is test Differential Revision: D54820804 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121750 Approved by: https://github.com/huydhn	2024-03-13 03:21:21 +00:00
Adnan Akhundov	9f235971f0	Gate tt.reduce Triton mutation tests on Triton version (#121753 ) Summary: The goal is to make the `test_argmax` and `test_reduce_sum` to work both before and after https://github.com/openai/triton/pull/3191 is included into the Triton pin. This is important to make those tests work during the Triton pin update process both in OSS and internally. Test Plan: ``` $ python test/inductor/test_triton_kernels.py -k test_reduce_sum -k test_argmax .. ---------------------------------------------------------------------- Ran 2 tests in 1.906s OK ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/121753 Approved by: https://github.com/Skylion007	2024-03-13 01:43:02 +00:00
Yanan Cao	7d05c4c093	Remove error anti-pattern when dealing with dynamic shape output (#121681 ) There are cases where capture_dynamic_output_shape_ops=True and we will still see DynamicOutputShapeException. For example, when an op doesn't have a meta kernel implemented to return the correct dynamic shape output. If we blindly give users instructions to set capture_dynamic_output_shape_ops to True, users would try it and see no change. As witnessed in this issue: https://github.com/pytorch/pytorch/issues/121036#issuecomment-1985221435 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121681 Approved by: https://github.com/tugsbayasgalan	2024-03-13 00:45:23 +00:00
PyTorch MergeBot	9df0dca7f6	Revert "[ Inductor ] Shape padding honors output stride preservation (#120797 )" This reverts commit 57fc35a3af09f7657b2be593a1046f0ac2dd50ab. Reverted https://github.com/pytorch/pytorch/pull/120797 on behalf of https://github.com/williamwen42 due to perf regression on dashboard ([comment](https://github.com/pytorch/pytorch/pull/120797#issuecomment-1992857428))	2024-03-13 00:43:34 +00:00
Wenting Wang	02bb2180f4	[torch export] replace traceback.extract_stack with CapturedTraceback.extract (#121449 ) Summary: with a simple bench in TestDeserializer.test_basic function: ``` time_start = time.time() for i in range(1000): self.check_graph(MyModule(), inputs) warnings.warn(f"time_taken: {time.time() - time_start}") ``` and forcing FakeTensorConfig.debug to True, record_stack_traces to True, logging level to debug, it shows that the the changed code is consistently ard 20 secs faster (~90s vs originally ~110s) Test Plan: test passed, see summary compared debug trace before and after: - exactly the same for fake tensor and proxy callsite https://www.internalfb.com/intern/diffing/?paste_number=1189883685 - slightly different for the user frame in proxy node https://www.internalfb.com/intern/diffing/?paste_number=1189884347 Differential Revision: D54237017 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121449 Approved by: https://github.com/angelayi	2024-03-13 00:19:05 +00:00
Peter Bell	7a53dedb07	CI: Specify libc and libstdcxx versions in conda environments (#121556 ) Without this we get mismatches between the GLIBC and GLIBCXX ABI used by conda packages vs pytorch. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121556 Approved by: https://github.com/isuruf, https://github.com/malfet	2024-03-13 00:12:54 +00:00
Oguz Ulgen	68be750e17	Cleanup some exception handling in triton mutation tracking (#121739 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121739 Approved by: https://github.com/Skylion007 ghstack dependencies: #121690	2024-03-13 00:02:36 +00:00
Matthias Reso	a9274c9a2c	Fix aoti doc to avoid cannot bind non-const lvalue reference error (#121672 ) This PR corrects the example in the AOTInductor example which currently fails with: ``` /home/ubuntu/test/inference.cpp:21:62: error: cannot bind non-const lvalue reference of type ‘std::vector<at::Tensor>&’ to an rvalue of type ‘std::vector<at::Tensor>’ 21 \| std::cout << runner.run({torch::randn({2, 10}, at::kCPU)})[0] << std::endl; \| ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/121672 Approved by: https://github.com/desertfire	2024-03-12 23:43:40 +00:00
Oguz Ulgen	79ee6bbde3	Support `triton.language.dtype` with `torch.compile` (#121690 ) Putting this PR as an RFC since I have resorted to some horrible hacks in order to make this work. ``` (Pdb) p triton.language.float32 triton.language.fp32 (Pdb) p str(triton.language.float32) 'fp32' (Pdb) p repr(triton.language.float32) 'triton.language.fp32' ``` This means that we need to "rewrite" them for fx graph and inductor execution. This PR allows Mamba2 to work with `torch.compile`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121690 Approved by: https://github.com/Skylion007	2024-03-12 23:21:46 +00:00
Animesh Jain	22bb24986d	[dynamo][guards] Use lazy variable tracker for func defaults (#121388 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121388 Approved by: https://github.com/jansel	2024-03-12 22:48:48 +00:00
kareem	519151a062	[fx] Preserve Fx graph node order in partitioner across runs (#115621 ) Fixes #ISSUE_NUMBER partitioner generates different graph in recompilation on each run Co-authored-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/115621 Approved by: https://github.com/izaitsevfb	2024-03-12 22:18:43 +00:00
atalman	a95ceb51a2	Release fix pinning slow-tests.json (#121746 ) Apply release changes script adds version to SLOW_TESTS_FILE which should not change Test: ``` SLOW_VER=test sed -i -e s#/slow-tests.json#"/slow-tests.json?versionId=${SLOW_VER}"# tools/stats/import_test_stats.py ``` Output: ``` SLOW_TESTS_FILE = ".pytorch-slow-tests.json" ... url = "https://ossci-metrics.s3.amazonaws.com/slow-tests.json?versionId=test" ``` related to: https://github.com/pytorch/pytorch/pull/121726 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121746 Approved by: https://github.com/huydhn	2024-03-12 22:04:55 +00:00
Kai Londenberg	a5ec45f2ec	[Inductor Cutlass backend] Move tests to separate file (#121489 ) Move Cutlass backend related tests to test/inductor/test_cutlass_backend.py - no changes to the tests themselves. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121489 Approved by: https://github.com/jansel	2024-03-12 21:59:48 +00:00
Bryant Biggs	844bfbbd2e	feat: Update Dockerfile default versions for Python, OS, and CUDA arch list (#121560 ) - Update Dockerfile default versions for Python, OS, and CUDA arch list - Python 3.8 is EOL later this year, the `docker.Makefile` has 3.10 as default - `docker.Makefile` is using 22.04 so this just aligns that - The GPU feature list is quite dated, most of those architectures are long past EOL and we aren't getting the newer cards (A100, H100) into that list until now https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html#gpu-feature-list Pull Request resolved: https://github.com/pytorch/pytorch/pull/121560 Approved by: https://github.com/seemethere, https://github.com/Neilblaze, https://github.com/atalman, https://github.com/malfet	2024-03-12 21:43:26 +00:00
Zihua Wu	d62bdb087d	[Profiler] add missing field device_resource_id (#121480 ) Fixes #121479 Co-authored-by: Aaron Shi <enye.shi@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/121480 Approved by: https://github.com/aaronenyeshi	2024-03-12 21:42:53 +00:00
Tugsbayasgalan Manlaibaatar	5478a4e348	Don't run non-strict for test case that doesn't need non-strict (#121710 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121710 Approved by: https://github.com/BoyuanFeng ghstack dependencies: #121652, #121678, #121687	2024-03-12 21:32:33 +00:00
PyTorch MergeBot	5b506c8bce	Revert "[dynamo][guards] Use lazy variable tracker for func defaults (#121388 )" This reverts commit 04a5d6e8d3f09ee6741484bcfea022228f747b09. Reverted https://github.com/pytorch/pytorch/pull/121388 on behalf of https://github.com/osalpekar due to causing executorch model-test failures internally. See [D54707529](https://www.internalfb.com/diff/D54707529) ([comment](https://github.com/pytorch/pytorch/pull/121388#issuecomment-1992619251))	2024-03-12 21:31:18 +00:00
Shunting Zhang	522d972924	[eazy] add more log when accuracy check fail (#121656 ) Add these log to debug the regress of accuracy test for dm_nfnet_f0 model for training. With these extra log when the accuracy check fail, we can verify if it's close to succeed or not. If yes that indicates there is no real issue but just flaky and we probably can tune the tolerance to fix. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121656 Approved by: https://github.com/jansel, https://github.com/Skylion007	2024-03-12 20:58:20 +00:00
Michael Ranieri	f50c652422	avoid aten dispatch shadowing type with variable (#121659 ) Summary: `DECLARE_DISPATCH` is shadowing the variable data with the data type: `extern TORCH_API struct name name` -> `extern TORCH_API struct gemm_stub gemm_stub` for instance. This is probably dangerous behavior to rely on, as the compiler needs to always resolve to type and/or data based on context. Previous macro fails with VS2022. Test Plan: `buck2 build arvr/mode/win/vs2022/cpp20/opt //xplat/caffe2:aten_pow_ovrsource` Differential Revision: D54699849 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121659 Approved by: https://github.com/albanD	2024-03-12 20:50:47 +00:00
Manuel Candales	6d8a7d6e58	[pytorch] optional zero points on dequantize per channel (#121724 ) Summary: X-link: https://github.com/pytorch/executorch/pull/2364 bypass-github-export-checks Test Plan: sandcastle Reviewed By: mikekgfb Differential Revision: D54709217 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121724 Approved by: https://github.com/mikekgfb	2024-03-12 19:54:11 +00:00
Colin Peppler	a6149eba12	[easy] Refactor MultiOutput. codegen_list_tuple_access to use subclass type checks (#121662 ) Summary: # Why? Right now I'm running into a case where `itype` is `torch.fx.immutable_collections.immutable_list` which is a subclass of `list`. However, currently we're checking the concrete types (i.e. `list`) and `immutable_list` isn't explictly supported here. Thus, we use a runtime check that looks at the subclass so we can support subclasses -- such as immutable_list -- as well. Test Plan: ci Differential Revision: D54764829 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121662 Approved by: https://github.com/aakhundov	2024-03-12 19:27:56 +00:00
Tugsbayasgalan Manlaibaatar	90e886aa6c	Sanity check for non-strict (#121687 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121687 Approved by: https://github.com/avikchaudhuri ghstack dependencies: #121652, #121678	2024-03-12 18:21:32 +00:00
Tugsbayasgalan (Tugsuu) Manlaibaatar	443e241cc5	Don't cache predispatch kernels (#121712 ) Summary: Title Test Plan: CI Differential Revision: D54791087 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121712 Approved by: https://github.com/ydwu4	2024-03-12 18:05:59 +00:00
Wanchao Liang	a26480a4d1	[dtensor] move early return check into redistribute autograd function (#121653 ) This PR fixed the bug of redistribute to move early return check into the redistribute autograd function, so that even though we redistribute the same placement, the grad_placements from the `to_local` call might be different, the redistribute backward still need to happen Pull Request resolved: https://github.com/pytorch/pytorch/pull/121653 Approved by: https://github.com/awgu	2024-03-12 17:37:30 +00:00
atalman	00a53b58dd	Refactor release only changes to two step execution (#121728 ) Refactor release only changes to two step execution. 1. Step ``tag-docker-images.sh`` . Tags latest docker images for current release. This step takes about 30min to complete. This step may fail due to space issues on the local host or http connection when pulling image. Hence should be rerun if failed. 2. Apply release only changes ``apply-release-changes.sh`` prepares a PR with release only changes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121728 Approved by: https://github.com/jeanschmidt	2024-03-12 17:22:22 +00:00
Animesh Jain	4e63d9065a	[dynamo] Delete record replay tests as they are not maintained (#121705 ) Fixes https://github.com/pytorch/pytorch/issues/115518 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121705 Approved by: https://github.com/mlazos	2024-03-12 17:16:34 +00:00
Animesh Jain	cd1751b14f	[dynamo] Measure Dynamo cache latency lookup (#121604 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121604 Approved by: https://github.com/jansel ghstack dependencies: #121614, #121622	2024-03-12 17:09:11 +00:00
Animesh Jain	22489bfe70	[dynamo][guards-cpp-refactor] Directly call root guard manager in eval_frame (#121622 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121622 Approved by: https://github.com/jansel ghstack dependencies: #121614	2024-03-12 17:09:11 +00:00
Animesh Jain	2348e8e4e7	[dynamo][guards-cpp-refactor] Simplify DYNAMIC_INDICES guard (#121614 ) Use NO_HASATTR guard for the common part. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121614 Approved by: https://github.com/jansel	2024-03-12 17:08:56 +00:00
PyTorch MergeBot	0398dc9e8e	Revert "[DCP] Makes fsspec public (#121508 )" This reverts commit d482614fec5fb9bccb49bf4ee4ab561e872c0f50. Reverted https://github.com/pytorch/pytorch/pull/121508 on behalf of https://github.com/osalpekar due to this causes torchrec tests to fail internally with this error: ModuleNotFoundError: No module named 'fsspec'. see [D54779117](https://www.internalfb.com/diff/D54779117) ([comment](https://github.com/pytorch/pytorch/pull/121508#issuecomment-1992137831))	2024-03-12 17:02:43 +00:00
Edward Z. Yang	b84f94f6a3	Restore timestamps on C++ logs without glog (#121384 ) It looks like it was commented out because the original implementation was not sufficiently portable. I had to do some rewrites to the innards to make it no portable. No Windows nanoseconds support because I'm lazy. I tested by running `build/bin/TCPStoreTest` and observing the log messages there. I am actually not sure how to look at the log messages from Python though. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/121384 Approved by: https://github.com/Skylion007, https://github.com/malfet	2024-03-12 17:01:32 +00:00
Igor Sugak	704e15307e	[caffe2] replace refernces to np.asscalar (#121332 ) (#121545 ) Summary: `np.asscalar` was deprecated and removed in a recent Numpy. It used to be implemented the following way, and the recommended alternative is to call `item()` directly: ```python def asscalar(a): return a.item() ``` This fixes all of the references. Test Plan: visual inspection and automated tests Differential Revision: D54697760 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121545 Approved by: https://github.com/malfet	2024-03-12 16:58:47 +00:00
angelayi	d1715c3adb	[export] Update error message for set_grad (#121666 ) Context: https://fb.workplace.com/groups/222849770514616/posts/381979051268353/?comment_id=383334957799429 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121666 Approved by: https://github.com/ydwu4	2024-03-12 16:41:45 +00:00
Jason Ansel	3c8c7e2a46	[dynamo] Tweak naming for module hook bw_state (#121609 ) Some minor changes not related to the other PRs in the stack Pull Request resolved: https://github.com/pytorch/pytorch/pull/121609 Approved by: https://github.com/yanboliang	2024-03-12 16:27:56 +00:00
Chien-Chin Huang	7a68e0a3e8	[DCP][state_dict] Remove the check of FSDP has root (#121544 ) Root may not exist due to FSDP lazy initialization. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121544 Approved by: https://github.com/Skylion007 ghstack dependencies: #121273, #121276, #121290	2024-03-12 15:43:19 +00:00
Andrew Gu	85dc254364	[DTensor] Moved `Transformer` sharding to staticmethod (#121660 ) To support FSDP + TP/SP unit tests, let us factor out the canonical TP/SP sharding of `Transformer` to a staticmethod that can be called by other unit tests. Test Plan: ``` pytest test/distributed/tensor/parallel/test_tp_examples.py -k test_transformer_training ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/121660 Approved by: https://github.com/wanchaol, https://github.com/yifuwang ghstack dependencies: #121360, #121357	2024-03-12 15:08:57 +00:00
Stephen Jia	cc51e100f5	[ET-VK] Enable Dynamic shape support via tensor virtual and physical resizing (#121598 ) Summary: ## Context This changeset lays the foundations for supporting dynamic shapes in the ExecuTorch Vulkan delegate via allowing Tensors to be resized in one of two ways: 1. Discarding underlying `vkImage` or `vkBuffer` and reallocating a new `vkImage` or `vkBuffer` with updated sizes. This method is intended to be used when the current `vkImage` or `vkBuffer` is not large enough to contain the new sizes. 2. Update the tensor's size metadata without reallocating any new resources. This allows shaders to interpret the underlying `vkImage` or `vkBuffer` as if it were smaller than it actually is, and allows command buffers to be preserved when sizes are changed. Test Plan: Check CI. Tests have also been added to `vulkan_compute_api_test` that test the two methods of tensor resizing. Differential Revision: D54728401 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121598 Approved by: https://github.com/jorgep31415	2024-03-12 14:32:00 +00:00
Howard Huang	2a99e6f299	Update error message (#121644 ) Summary: We don't want people to move to NCCL exp without explicit opt in. It seems that sparse allreduce was accidentally called and people were confused whether they should use NCCL exp instead. Update the error message to explicitly say that sparse_allreduce is not supported. Test Plan: sandcastle Differential Revision: D54759307 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121644 Approved by: https://github.com/awgu	2024-03-12 13:04:21 +00:00
kausik	edf22f3a48	Modify signature of dequantize ops for decomposed quantized Tensor (#119173 ) (#121450 ) Summary: X-link: https://github.com/pytorch/executorch/pull/2308 Note: The initial purpose of this PR is to draw suggestion and feedback regarding better alternative, if any. At present, dequantize op for decomposed quantized Tensor representation e.g. dequantize_per_tensor() assumes the output dtype as torch.float and hence, it does not have the output dtype in its operator argument list. However, this op signature becomes unusable when the assumption breaks. Because, in case the output dtype is different from torch.float, there is no way to specify the same during dequantization. This change is aimed at generalizing the signature of dequantize op like dequantize_per_tensor() for wider use-cases where the output dtype can be different from torch.float and needs to passed during dequantization. The proposal is to use an additional argument named 'output_dtype' to solve the problem. However, we would also like to have suggestion and feedback regarding any better alternative that can be used instead. cc jerryzh168 jianyuh raghuramank100 jamesr66a vkuzo jgong5 Xia-Weiwen leslie-fang-intel Reviewed By: digantdesai Differential Revision: D53590486 Pulled By: manuelcandales Co-authored-by: kausik <kmaiti@habana.ai> Pull Request resolved: https://github.com/pytorch/pytorch/pull/121450 Approved by: https://github.com/jerryzh168	2024-03-12 12:36:31 +00:00
Adnan Akhundov	06d2392003	Support tt.reduce in Triton kernel analysis pass (#121706 ) Summary: Previously, we bailed out of the Triton kernel analysis pass when seeing a `tt.reduce` op. In this PR, we support the op and don't bail out anymore. Test Plan: This is a bit tricky, as the extension is added to the MLIR walk-based analysis code path which is active only on when the MLIR bindings added in https://github.com/openai/triton/pull/3191 are available. So for now I've run the `test_argmax` and `test_reduce_sum` manually with a newer Triton version than the current pin. When pin updates, we'll make those tests official (left a TODO comment). Pull Request resolved: https://github.com/pytorch/pytorch/pull/121706 Approved by: https://github.com/jansel	2024-03-12 11:38:28 +00:00
Animesh Jain	78b4793c96	[dynamo][compile-time] Caching VTs to reduce compile-time (#121031 ) Reduces the `torch.compile(backend="eager")` for this code ~~~ def fn(x): for _ in range(10000): # x = torch.sin(x) x = torch.ops.aten.sin(x) # x = sin(x) return x ~~~ From 18 seconds to 12 seconds. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121031 Approved by: https://github.com/jansel	2024-03-12 09:19:50 +00:00
Tugsbayasgalan Manlaibaatar	52ad2b682c	Generate predispatch tests (#121678 ) In this PR, we create another dynamic test class for TestExport tests that basically serializes/deserializas pre-dispatch IR. I encountered 4 additional failures. But 3 of them are due to different operator showing up in the graph and only one legit failure which is tracked by another task internally. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121678 Approved by: https://github.com/angelayi ghstack dependencies: #121652	2024-03-12 08:34:50 +00:00
Dmitry Nikolaev	656134c38f	[ROCm] enable complex128 in test_addmm_sizes_all_sparse_csr for rocm for trivial (k,n,m) cases (#120504 ) This PR enables `test_addmm_sizes_all_sparse_csr_k__n__m_*_cuda_complex128` for ROCm for trivial cases (m or n or k = 0) CUSPARSE_SPMM_COMPLEX128_SUPPORTED also used for `test_addmm_all_sparse_csr` and ` test_sparse_matmul` and both of them are skipped for ROCm by `@skipIfRocm` or `@skipCUDAIf(not _check_cusparse_spgemm_available())` Pull Request resolved: https://github.com/pytorch/pytorch/pull/120504 Approved by: https://github.com/jithunnair-amd, https://github.com/ezyang	2024-03-12 07:29:57 +00:00

7502 changed files with 284196 additions and 600305 deletions

1

.bazelignore

View File

 @ -1,3 +1,4 @@
 # We do not use this library in our Bazel build. It contains an
 # infinitely recursing symlink that makes Bazel very unhappy.
 third_party/ittapi/
 third_party/opentelemetry-cpp

5

.ci/docker/aotriton_version.txt Normal file

View File

 @ -0,0 +1,5 @@
 .6b
 manylinux_2_17
 rocm6
 b5df8c8123f90cba3ede7e971e6fbc6040d506
 db6ecbc915893ff967abd6e1b43bd5f54949868873be60dc802086c3863e648

									
										138

.ci/docker/build.sh
									
												View File
												
				@ -84,16 +84,16 @@ fi

				# CMake 3.18 is needed to support CUDA17 language variant

				CMAKE_VERSION=3.18.5

				_UCX_COMMIT=00bcc6bb18fc282eb160623b4c0d300147f579af

				_UCC_COMMIT=7cb07a76ccedad7e56ceb136b865eb9319c258ea

				_UCX_COMMIT=7bb2722ff2187a0cad557ae4a6afa090569f83fb

				_UCC_COMMIT=20eae37090a4ce1b32bcce6144ccad0b49943e0b

				# It's annoying to rename jobs every time you want to rewrite a

				# configuration, so we hardcode everything here rather than do it

				# from scratch

				case "$image" in

				  pytorch-linux-focal-cuda12.1-cudnn8-py3-gcc9)

				    CUDA_VERSION=12.1.1

				    CUDNN_VERSION=8

				  pytorch-linux-focal-cuda12.4-cudnn9-py3-gcc9)

				    CUDA_VERSION=12.4.0

				    CUDNN_VERSION=9

				    ANACONDA_PYTHON_VERSION=3.10

				    GCC_VERSION=9

				    PROTOBUF=yes

				@ -105,9 +105,23 @@ case "$image" in

				    CONDA_CMAKE=yes

				    TRITON=yes

				    ;;

				  pytorch-linux-focal-cuda12.1-cudnn8-py3-gcc9-inductor-benchmarks)

				  pytorch-linux-focal-cuda12.1-cudnn9-py3-gcc9)

				    CUDA_VERSION=12.1.1

				    CUDNN_VERSION=8

				    CUDNN_VERSION=9

				    ANACONDA_PYTHON_VERSION=3.10

				    GCC_VERSION=9

				    PROTOBUF=yes

				    DB=yes

				    VISION=yes

				    KATEX=yes

				    UCX_COMMIT=${_UCX_COMMIT}

				    UCC_COMMIT=${_UCC_COMMIT}

				    CONDA_CMAKE=yes

				    TRITON=yes

				    ;;

				  pytorch-linux-focal-cuda12.4-cudnn9-py3-gcc9-inductor-benchmarks)

				    CUDA_VERSION=12.4.0

				    CUDNN_VERSION=9

				    ANACONDA_PYTHON_VERSION=3.10

				    GCC_VERSION=9

				    PROTOBUF=yes

				@ -120,9 +134,54 @@ case "$image" in

				    TRITON=yes

				    INDUCTOR_BENCHMARKS=yes

				    ;;

				  pytorch-linux-focal-cuda11.8-cudnn8-py3-gcc9)

				  pytorch-linux-focal-cuda12.1-cudnn9-py3-gcc9-inductor-benchmarks)

				    CUDA_VERSION=12.1.1

				    CUDNN_VERSION=9

				    ANACONDA_PYTHON_VERSION=3.10

				    GCC_VERSION=9

				    PROTOBUF=yes

				    DB=yes

				    VISION=yes

				    KATEX=yes

				    UCX_COMMIT=${_UCX_COMMIT}

				    UCC_COMMIT=${_UCC_COMMIT}

				    CONDA_CMAKE=yes

				    TRITON=yes

				    INDUCTOR_BENCHMARKS=yes

				    ;;

				  pytorch-linux-focal-cuda12.1-cudnn9-py3.12-gcc9-inductor-benchmarks)

				    CUDA_VERSION=12.1.1

				    CUDNN_VERSION=9

				    ANACONDA_PYTHON_VERSION=3.12

				    GCC_VERSION=9

				    PROTOBUF=yes

				    DB=yes

				    VISION=yes

				    KATEX=yes

				    UCX_COMMIT=${_UCX_COMMIT}

				    UCC_COMMIT=${_UCC_COMMIT}

				    CONDA_CMAKE=yes

				    TRITON=yes

				    INDUCTOR_BENCHMARKS=yes

				    ;;

				  pytorch-linux-focal-cuda12.4-cudnn9-py3.12-gcc9-inductor-benchmarks)

				    CUDA_VERSION=12.4.0

				    CUDNN_VERSION=9

				    ANACONDA_PYTHON_VERSION=3.12

				    GCC_VERSION=9

				    PROTOBUF=yes

				    DB=yes

				    VISION=yes

				    KATEX=yes

				    UCX_COMMIT=${_UCX_COMMIT}

				    UCC_COMMIT=${_UCC_COMMIT}

				    CONDA_CMAKE=yes

				    TRITON=yes

				    INDUCTOR_BENCHMARKS=yes

				    ;;

				  pytorch-linux-focal-cuda11.8-cudnn9-py3-gcc9)

				    CUDA_VERSION=11.8.0

				    CUDNN_VERSION=8

				    CUDNN_VERSION=9

				    ANACONDA_PYTHON_VERSION=3.10

				    GCC_VERSION=9

				    PROTOBUF=yes

				@ -134,9 +193,37 @@ case "$image" in

				    CONDA_CMAKE=yes

				    TRITON=yes

				    ;;

				  pytorch-linux-focal-cuda12.1-cudnn8-py3-gcc9)

				  pytorch-linux-focal-cuda12.4-cudnn9-py3-gcc9)

				    CUDA_VERSION=12.4.0

				    CUDNN_VERSION=9

				    ANACONDA_PYTHON_VERSION=3.10

				    GCC_VERSION=9

				    PROTOBUF=yes

				    DB=yes

				    VISION=yes

				    KATEX=yes

				    UCX_COMMIT=${_UCX_COMMIT}

				    UCC_COMMIT=${_UCC_COMMIT}

				    CONDA_CMAKE=yes

				    TRITON=yes

				    ;;

				  pytorch-linux-focal-cuda12.1-cudnn9-py3-gcc9)

				    CUDA_VERSION=12.1.1

				    CUDNN_VERSION=8

				    CUDNN_VERSION=9

				    ANACONDA_PYTHON_VERSION=3.10

				    GCC_VERSION=9

				    PROTOBUF=yes

				    DB=yes

				    VISION=yes

				    KATEX=yes

				    UCX_COMMIT=${_UCX_COMMIT}

				    UCC_COMMIT=${_UCC_COMMIT}

				    CONDA_CMAKE=yes

				    TRITON=yes

				    ;;

				  pytorch-linux-focal-cuda12.4-cudnn9-py3-gcc9)

				    CUDA_VERSION=12.4.0

				    CUDNN_VERSION=9

				    ANACONDA_PYTHON_VERSION=3.10

				    GCC_VERSION=9

				    PROTOBUF=yes

				@ -204,7 +291,7 @@ case "$image" in

				    PROTOBUF=yes

				    DB=yes

				    VISION=yes

				    ROCM_VERSION=5.7

				    ROCM_VERSION=6.0

				    NINJA_VERSION=1.9.0

				    CONDA_CMAKE=yes

				    TRITON=yes

				@ -215,7 +302,7 @@ case "$image" in

				    PROTOBUF=yes

				    DB=yes

				    VISION=yes

				    ROCM_VERSION=6.0

				    ROCM_VERSION=6.1

				    NINJA_VERSION=1.9.0

				    CONDA_CMAKE=yes

				    TRITON=yes

				@ -226,9 +313,10 @@ case "$image" in

				    PROTOBUF=yes

				    DB=yes

				    VISION=yes

				    BASEKIT_VERSION=2024.0.0-49522

				    XPU_VERSION=0.5

				    NINJA_VERSION=1.9.0

				    CONDA_CMAKE=yes

				    TRITON=yes

				    ;;

				    pytorch-linux-jammy-py3.8-gcc11-inductor-benchmarks)

				    ANACONDA_PYTHON_VERSION=3.8

				@ -242,10 +330,10 @@ case "$image" in

				    DOCS=yes

				    INDUCTOR_BENCHMARKS=yes

				    ;;

				  pytorch-linux-jammy-cuda11.8-cudnn8-py3.8-clang12)

				  pytorch-linux-jammy-cuda11.8-cudnn9-py3.8-clang12)

				    ANACONDA_PYTHON_VERSION=3.8

				    CUDA_VERSION=11.8

				    CUDNN_VERSION=8

				    CUDNN_VERSION=9

				    CLANG_VERSION=12

				    PROTOBUF=yes

				    DB=yes

				@ -292,7 +380,7 @@ case "$image" in

				    ANACONDA_PYTHON_VERSION=3.9

				    CONDA_CMAKE=yes

				    ;;

				  pytorch-linux-jammy-cuda11.8-cudnn8-py3.9-linter)

				  pytorch-linux-jammy-cuda11.8-cudnn9-py3.9-linter)

				    ANACONDA_PYTHON_VERSION=3.9

				    CUDA_VERSION=11.8

				    CONDA_CMAKE=yes

				@ -305,6 +393,12 @@ case "$image" in

				    DB=yes

				    VISION=yes

				    CONDA_CMAKE=yes

				    # snadampal: skipping sccache due to the following issue

				    # https://github.com/pytorch/pytorch/issues/121559

				    SKIP_SCCACHE_INSTALL=yes

				    # snadampal: skipping llvm src build install because the current version

				    # from pytorch/llvm:9.0.1 is x86 specific

				    SKIP_LLVM_SRC_BUILD_INSTALL=yes

				    ;;

				  *)

				    # Catch-all for builds that are not hardcoded.

				@ -353,13 +447,13 @@ tmp_tag=$(basename "$(mktemp -u)" | tr '[:upper:]' '[:lower:]')

				#when using cudnn version 8 install it separately from cuda

				if [[ "$image" == *cuda*  && ${OS} == "ubuntu" ]]; then

				  IMAGE_NAME="nvidia/cuda:${CUDA_VERSION}-cudnn${CUDNN_VERSION}-devel-ubuntu${UBUNTU_VERSION}"

				  if [[ ${CUDNN_VERSION} == 8 ]]; then

				  if [[ ${CUDNN_VERSION} == 9 ]]; then

				    IMAGE_NAME="nvidia/cuda:${CUDA_VERSION}-devel-ubuntu${UBUNTU_VERSION}"

				  fi

				fi

				# Build image

				DOCKER_BUILDKIT=1 docker build \

				docker build \

				       --no-cache \

				       --progress=plain \

				       --build-arg "BUILD_ENVIRONMENT=${image}" \

				@ -396,14 +490,16 @@ DOCKER_BUILDKIT=1 docker build \

				       --build-arg "DOCS=${DOCS}" \

				       --build-arg "INDUCTOR_BENCHMARKS=${INDUCTOR_BENCHMARKS}" \

				       --build-arg "EXECUTORCH=${EXECUTORCH}" \

				       --build-arg "BASEKIT_VERSION=${BASEKIT_VERSION}" \

				       --build-arg "XPU_VERSION=${XPU_VERSION}" \

				       --build-arg "ACL=${ACL:-}" \

				       --build-arg "SKIP_SCCACHE_INSTALL=${SKIP_SCCACHE_INSTALL:-}" \

				       --build-arg "SKIP_LLVM_SRC_BUILD_INSTALL=${SKIP_LLVM_SRC_BUILD_INSTALL:-}" \

				       -f $(dirname ${DOCKERFILE})/Dockerfile \

				       -t "$tmp_tag" \

				       "$@" \

				       .

				# NVIDIA dockers for RC releases use tag names like `11.0-cudnn8-devel-ubuntu18.04-rc`,

				# NVIDIA dockers for RC releases use tag names like `11.0-cudnn9-devel-ubuntu18.04-rc`,

				# for this case we will set UBUNTU_VERSION to `18.04-rc` so that the Dockerfile could

				# find the correct image. As a result, here we have to replace the

				#   "$UBUNTU_VERSION" == "18.04-rc"

									
										12

.ci/docker/centos-rocm/Dockerfile
									
												View File
												
				@ -62,7 +62,7 @@ RUN if [ -n "${DB}" ]; then bash ./install_db.sh; fi

				RUN rm install_db.sh

				ENV INSTALLED_DB ${DB}

				# (optional) Install vision packages like OpenCV and ffmpeg

				# (optional) Install vision packages like OpenCV

				ARG VISION

				COPY ./common/install_vision.sh ./common/cache_vision_models.sh ./common/common_utils.sh ./

				RUN if [ -n "${VISION}" ]; then bash ./install_vision.sh; fi

				@ -77,6 +77,9 @@ RUN rm install_rocm.sh

				COPY ./common/install_rocm_magma.sh install_rocm_magma.sh

				RUN bash ./install_rocm_magma.sh

				RUN rm install_rocm_magma.sh

				COPY ./common/install_amdsmi.sh install_amdsmi.sh

				RUN bash ./install_amdsmi.sh

				RUN rm install_amdsmi.sh

				ENV PATH /opt/rocm/bin:$PATH

				ENV PATH /opt/rocm/hcc/bin:$PATH

				ENV PATH /opt/rocm/hip/bin:$PATH

				@ -110,6 +113,13 @@ COPY triton_version.txt triton_version.txt

				RUN if [ -n "${TRITON}" ]; then bash ./install_triton.sh; fi

				RUN rm install_triton.sh common_utils.sh triton-rocm.txt triton_version.txt

				# Install AOTriton (Early fail)

				COPY ./aotriton_version.txt aotriton_version.txt

				COPY ./common/common_utils.sh common_utils.sh

				COPY ./common/install_aotriton.sh install_aotriton.sh

				RUN ["/bin/bash", "-c", "./install_aotriton.sh /opt/rocm && rm -rf install_aotriton.sh aotriton_version.txt common_utils.sh"]

				ENV AOTRITON_INSTALLED_PREFIX /opt/rocm/aotriton

				# Install ccache/sccache (do this last, so we get priority in PATH)

				COPY ./common/install_cache.sh install_cache.sh

				ENV PATH /opt/cache/bin:$PATH

2

.ci/docker/ci_commit_pins/executorch.txt

View File

 @ -1 +1 @@
 e2a8f9548aecb62a68e264607174a7d207ed2929
 d4b3e5cc607e97afdba79dc90f8ef968142f347c

2

.ci/docker/ci_commit_pins/triton-rocm.txt

View File

 @ -1 +1 @@
 a22a91d04c2b4a029a69a198eac390089c3e891
 cbe5045a6898c9a925f01435c8277b2fe6afcc

1

.ci/docker/ci_commit_pins/triton-xpu.txt Normal file

View File

				`@ -0,0 +1 @@`
				`b8c64f64c18d8cac598b3adb355c21e7439c21de`

2

.ci/docker/ci_commit_pins/triton.txt

View File

 @ -1 +1 @@
 a9bc1a36470eefafe0e2ab2503b8698f1e89e7e3
 fff310c891f5a92d55445adf8cc9d29df5841e

									
										2

.ci/docker/common/install_acl.sh
									
												View File
												
				@ -1,6 +1,6 @@

				set -euo pipefail

				readonly version=v23.08

				readonly version=v24.04

				readonly src_host=https://review.mlplatform.org/ml

				readonly src_repo=ComputeLibrary

									
										5

.ci/docker/common/install_amdsmi.sh
									
										Normal file
									
												View File
												
				@ -0,0 +1,5 @@

				#!/bin/bash

				set -ex

				cd /opt/rocm/share/amd_smi && pip install .

									
										23

.ci/docker/common/install_aotriton.sh
									
										Executable file
									
												View File
												
				@ -0,0 +1,23 @@

				#!/bin/bash

				set -ex

				source "$(dirname "${BASH_SOURCE[0]}")/common_utils.sh"

				TARBALL='aotriton.tar.bz2'

				# This read command alwasy returns with exit code 1

				read -d "\n" VER MANYLINUX ROCMBASE PINNED_COMMIT SHA256 < aotriton_version.txt || true

				ARCH=$(uname -m)

				AOTRITON_INSTALL_PREFIX="$1"

				AOTRITON_URL="https://github.com/ROCm/aotriton/releases/download/${VER}/aotriton-${VER}-${MANYLINUX}_${ARCH}-${ROCMBASE}.tar.bz2"

				cd "${AOTRITON_INSTALL_PREFIX}"

				# Must use -L to follow redirects

				curl -L --retry 3 -o "${TARBALL}" "${AOTRITON_URL}"

				ACTUAL_SHA256=$(sha256sum "${TARBALL}" | cut -d " " -f 1)

				if [ "${SHA256}" != "${ACTUAL_SHA256}" ]; then

				  echo -n "Error: The SHA256 of downloaded tarball is ${ACTUAL_SHA256},"

				  echo " which does not match the expected value ${SHA256}."

				  exit

				fi

				tar xf "${TARBALL}" && rm -rf "${TARBALL}"

									
										3

.ci/docker/common/install_base.sh
									
												View File
												
				@ -3,7 +3,7 @@

				set -ex

				install_ubuntu() {

				  # NVIDIA dockers for RC releases use tag names like `11.0-cudnn8-devel-ubuntu18.04-rc`,

				  # NVIDIA dockers for RC releases use tag names like `11.0-cudnn9-devel-ubuntu18.04-rc`,

				  # for this case we will set UBUNTU_VERSION to `18.04-rc` so that the Dockerfile could

				  # find the correct image. As a result, here we have to check for

				  #   "$UBUNTU_VERSION" == "18.04"*

				@ -113,7 +113,6 @@ install_centos() {

				    glibc-devel \

				    glibc-headers \

				    glog-devel \

				    hiredis-devel \

				    libstdc++-devel \

				    libsndfile-devel \

				    make \

									
										24

.ci/docker/common/install_conda.sh
									
												View File
												
				@ -57,8 +57,21 @@ fi

				  # Uncomment the below when resolved to track the latest conda update

				  # as_jenkins conda update -y -n base conda

				  if [[ $(uname -m) == "aarch64" ]]; then

				    export SYSROOT_DEP="sysroot_linux-aarch64=2.17"

				  else

				    export SYSROOT_DEP="sysroot_linux-64=2.17"

				  fi

				  # Install correct Python version

				  as_jenkins conda create -n py_$ANACONDA_PYTHON_VERSION -y python="$ANACONDA_PYTHON_VERSION"

				  # Also ensure sysroot is using a modern GLIBC to match system compilers

				  as_jenkins conda create -n py_$ANACONDA_PYTHON_VERSION -y\

				             python="$ANACONDA_PYTHON_VERSION" \

				             ${SYSROOT_DEP}

				  # libstdcxx from conda default channels are too old, we need GLIBCXX_3.4.30

				  # which is provided in libstdcxx 12 and up.

				  conda_install libstdcxx-ng=12.3.0 -c conda-forge

				  # Install PyTorch conda deps, as per https://github.com/pytorch/pytorch README

				  if [[ $(uname -m) == "aarch64" ]]; then

				@ -110,14 +123,5 @@ fi

				    pip_install -r /opt/conda/requirements-docs.txt

				  fi

				  # HACK HACK HACK

				  # gcc-9 for ubuntu-18.04 from http://ppa.launchpad.net/ubuntu-toolchain-r/test/ubuntu

				  # Pulls llibstdc++6 13.1.0-8ubuntu1~18.04 which is too new for conda

				  # So remove libstdc++6.so.3.29 installed by https://anaconda.org/anaconda/libstdcxx-ng/files?version=11.2.0

				  # Same is true for gcc-12 from Ubuntu-22.04

				  if grep -e [12][82].04.[623] /etc/issue >/dev/null; then

				    rm /opt/conda/envs/py_$ANACONDA_PYTHON_VERSION/lib/libstdc++.so.6

				  fi

				  popd

				fi

									
										14

.ci/docker/common/install_cudnn.sh
									
												View File
												
				@ -1,20 +1,18 @@

				#!/bin/bash

				if [[ ${CUDNN_VERSION} == 8 ]]; then

				if [[ -n "${CUDNN_VERSION}" ]]; then

				    # cuDNN license: https://developer.nvidia.com/cudnn/license_agreement

				    mkdir tmp_cudnn

				    pushd tmp_cudnn

				    if [[ ${CUDA_VERSION:0:4} == "12.1" ]]; then

				        CUDNN_NAME="cudnn-linux-x86_64-8.9.2.26_cuda12-archive"

				        curl --retry 3 -OLs https://developer.download.nvidia.com/compute/cudnn/redist/cudnn/linux-x86_64/${CUDNN_NAME}.tar.xz

				    elif [[ ${CUDA_VERSION:0:4} == "11.8" ]]; then

				        CUDNN_NAME="cudnn-linux-x86_64-8.7.0.84_cuda11-archive"

				        curl --retry 3 -OLs https://developer.download.nvidia.com/compute/redist/cudnn/v8.7.0/local_installers/11.8/${CUDNN_NAME}.tar.xz

				    if [[ ${CUDA_VERSION:0:2} == "12" ]]; then

				        CUDNN_NAME="cudnn-linux-x86_64-9.1.0.70_cuda12-archive"

				    elif [[ ${CUDA_VERSION:0:2} == "11" ]]; then

				        CUDNN_NAME="cudnn-linux-x86_64-9.1.0.70_cuda11-archive"

				    else

				        print "Unsupported CUDA version ${CUDA_VERSION}"

				        exit 1

				    fi

				    curl --retry 3 -OLs https://developer.download.nvidia.com/compute/cudnn/redist/cudnn/linux-x86_64/${CUDNN_NAME}.tar.xz

				    tar xf ${CUDNN_NAME}.tar.xz

				    cp -a ${CUDNN_NAME}/include/* /usr/local/cuda/include/

				    cp -a ${CUDNN_NAME}/lib/* /usr/local/cuda/lib64/

									
										11

.ci/docker/common/install_cusparselt.sh
									
												View File
												
				@ -5,9 +5,14 @@ set -ex

				# cuSPARSELt license: https://docs.nvidia.com/cuda/cusparselt/license.html

				mkdir tmp_cusparselt && cd tmp_cusparselt

				if [[ ${CUDA_VERSION:0:4} == "12.1" ]]; then

				    CUSPARSELT_NAME="libcusparse_lt-linux-x86_64-0.5.2.1-archive"

				    curl --retry 3 -OLs https://developer.download.nvidia.com/compute/cusparselt/redist/libcusparse_lt/linux-x86_64/${CUSPARSELT_NAME}.tar.xz

				if [[ ${CUDA_VERSION:0:4} =~ ^12\.[1-4]$ ]]; then

				    arch_path='sbsa'

				    export TARGETARCH=${TARGETARCH:-$(uname -m)}

				    if [ ${TARGETARCH} = 'amd64' ] || [ "${TARGETARCH}" = 'x86_64' ]; then

				        arch_path='x86_64'

				    fi

				    CUSPARSELT_NAME="libcusparse_lt-linux-${arch_path}-0.5.2.1-archive"

				    curl --retry 3 -OLs https://developer.download.nvidia.com/compute/cusparselt/redist/libcusparse_lt/linux-${arch_path}/${CUSPARSELT_NAME}.tar.xz

				elif [[ ${CUDA_VERSION:0:4} == "11.8" ]]; then

				    CUSPARSELT_NAME="libcusparse_lt-linux-x86_64-0.4.0.7-archive"

				    curl --retry 3 -OLs https://developer.download.nvidia.com/compute/cusparselt/redist/libcusparse_lt/linux-x86_64/${CUSPARSELT_NAME}.tar.xz

									
										11

.ci/docker/common/install_db.sh
									
												View File
												
				@ -4,11 +4,6 @@ set -ex

				install_ubuntu() {

				  apt-get update

				  apt-get install -y --no-install-recommends \

				          libhiredis-dev \

				          libleveldb-dev \

				          liblmdb-dev \

				          libsnappy-dev

				  # Cleanup

				  apt-get autoclean && apt-get clean

				@ -20,12 +15,6 @@ install_centos() {

				  # See http://fedoraproject.org/wiki/EPEL

				  yum --enablerepo=extras install -y epel-release

				  yum install -y \

				      hiredis-devel \

				      leveldb-devel \

				      lmdb-devel \

				      snappy-devel

				  # Cleanup

				  yum clean all

				  rm -rf /var/cache/yum

									
										8

.ci/docker/common/install_onnx.sh
									
												View File
												
				@ -30,15 +30,15 @@ pip_install \

				pip_install coloredlogs packaging

				pip_install onnxruntime==1.17.0

				pip_install onnx==1.15.0

				pip_install onnxruntime==1.18

				pip_install onnx==1.16.0

				# pip_install "onnxscript@git+https://github.com/microsoft/onnxscript@3e869ef8ccf19b5ebd21c10d3e9c267c9a9fa729" --no-deps

				pip_install onnxscript==0.1.0.dev20240301 --no-deps

				pip_install onnxscript==0.1.0.dev20240523 --no-deps

				# Cache the transformers model to be used later by ONNX tests. We need to run the transformers

				# package to download the model. By default, the model is cached at ~/.cache/huggingface/hub/

				IMPORT_SCRIPT_FILENAME="/tmp/onnx_import_script.py"

				as_jenkins echo 'import transformers; transformers.AutoModel.from_pretrained("sshleifer/tiny-gpt2"); transformers.AutoTokenizer.from_pretrained("sshleifer/tiny-gpt2");' > "${IMPORT_SCRIPT_FILENAME}"

				as_jenkins echo 'import transformers; transformers.AutoModel.from_pretrained("sshleifer/tiny-gpt2"); transformers.AutoTokenizer.from_pretrained("sshleifer/tiny-gpt2"); transformers.AutoModelForSpeechSeq2Seq.from_pretrained("openai/whisper-large-v3");' > "${IMPORT_SCRIPT_FILENAME}"

				# Need a PyTorch version for transformers to work

				pip_install --pre torch --index-url https://download.pytorch.org/whl/nightly/cpu

									
										3

.ci/docker/common/install_protobuf.sh
									
												View File
												
				@ -11,7 +11,8 @@ mkdir -p $pb_dir

				ln -s /usr/lib64 "$pb_dir/lib64"

				curl -LO "https://github.com/protocolbuffers/protobuf/releases/download/v3.17.3/protobuf-all-3.17.3.tar.gz" --retry 3

				tar -xvz -C "$pb_dir" --strip-components 1 -f protobuf-all-3.17.3.tar.gz

				tar -xvz --no-same-owner -C "$pb_dir" --strip-components 1 -f protobuf-all-3.17.3.tar.gz

				NPROC=$[$(nproc) - 2]

				pushd "$pb_dir" && ./configure && make -j${NPROC} && make -j${NPROC} check && sudo make -j${NRPOC} install && sudo ldconfig

				popd

									
										122

.ci/docker/common/install_rocm.sh
									
												View File
												
				@ -6,9 +6,6 @@ ver() {

				    printf "%3d%03d%03d%03d" $(echo "$1" | tr '.' ' ');

				}

				# Map ROCm version to AMDGPU version

				declare -A AMDGPU_VERSIONS=( ["5.0"]="21.50" ["5.1.1"]="22.10.1" ["5.2"]="22.20" )

				install_ubuntu() {

				    apt-get update

				    if [[ $UBUNTU_VERSION == 18.04 ]]; then

				@ -26,31 +23,14 @@ install_ubuntu() {

				    apt-get install -y libc++1

				    apt-get install -y libc++abi1

				    if [[ $(ver $ROCM_VERSION) -ge $(ver 4.5) ]]; then

				        # Add amdgpu repository

				        UBUNTU_VERSION_NAME=`cat /etc/os-release | grep UBUNTU_CODENAME | awk -F= '{print $2}'`

				        local amdgpu_baseurl

				        if [[ $(ver $ROCM_VERSION) -ge $(ver 5.3) ]]; then

				          amdgpu_baseurl="https://repo.radeon.com/amdgpu/${ROCM_VERSION}/ubuntu"

				        else

				          amdgpu_baseurl="https://repo.radeon.com/amdgpu/${AMDGPU_VERSIONS[$ROCM_VERSION]}/ubuntu"

				        fi

				        echo "deb [arch=amd64] ${amdgpu_baseurl} ${UBUNTU_VERSION_NAME} main" > /etc/apt/sources.list.d/amdgpu.list

				    fi

				    ROCM_REPO="ubuntu"

				    if [[ $(ver $ROCM_VERSION) -lt $(ver 4.2) ]]; then

				        ROCM_REPO="xenial"

				    fi

				    if [[ $(ver $ROCM_VERSION) -ge $(ver 5.3) ]]; then

				        ROCM_REPO="${UBUNTU_VERSION_NAME}"

				    fi

				    # Add amdgpu repository

				    UBUNTU_VERSION_NAME=`cat /etc/os-release | grep UBUNTU_CODENAME | awk -F= '{print $2}'`

				    echo "deb [arch=amd64] https://repo.radeon.com/amdgpu/${ROCM_VERSION}/ubuntu ${UBUNTU_VERSION_NAME} main" > /etc/apt/sources.list.d/amdgpu.list

				    # Add rocm repository

				    wget -qO - http://repo.radeon.com/rocm/rocm.gpg.key | apt-key add -

				    local rocm_baseurl="http://repo.radeon.com/rocm/apt/${ROCM_VERSION}"

				    echo "deb [arch=amd64] ${rocm_baseurl} ${ROCM_REPO} main" > /etc/apt/sources.list.d/rocm.list

				    echo "deb [arch=amd64] ${rocm_baseurl} ${UBUNTU_VERSION_NAME} main" > /etc/apt/sources.list.d/rocm.list

				    apt-get update --allow-insecure-repositories

				    DEBIAN_FRONTEND=noninteractive apt-get install -y --allow-unauthenticated \

				@ -59,34 +39,28 @@ install_ubuntu() {

				                   rocm-libs \

				                   rccl \

				                   rocprofiler-dev \

				                   roctracer-dev

				                   roctracer-dev \

				                   amd-smi-lib

				    if [[ $(ver $ROCM_VERSION) -ge $(ver 6.1) ]]; then

				        DEBIAN_FRONTEND=noninteractive apt-get install -y --allow-unauthenticated rocm-llvm-dev

				    fi

				    # precompiled miopen kernels added in ROCm 3.5, renamed in ROCm 5.5

				    # search for all unversioned packages

				    # if search fails it will abort this script; use true to avoid case where search fails

				    if [[ $(ver $ROCM_VERSION) -ge $(ver 5.5) ]]; then

				        MIOPENHIPGFX=$(apt-cache search --names-only miopen-hip-gfx | awk '{print $1}' | grep -F -v . || true)

				        if [[ "x${MIOPENHIPGFX}" = x ]]; then

				          echo "miopen-hip-gfx package not available" && exit 1

				        else

				          DEBIAN_FRONTEND=noninteractive apt-get install -y --allow-unauthenticated ${MIOPENHIPGFX}

				        fi

				    MIOPENHIPGFX=$(apt-cache search --names-only miopen-hip-gfx | awk '{print $1}' | grep -F -v . || true)

				    if [[ "x${MIOPENHIPGFX}" = x ]]; then

				      echo "miopen-hip-gfx package not available" && exit 1

				    else

				        MIOPENKERNELS=$(apt-cache search --names-only miopenkernels | awk '{print $1}' | grep -F -v . || true)

				        if [[ "x${MIOPENKERNELS}" = x ]]; then

				          echo "miopenkernels package not available" && exit 1

				        else

				          DEBIAN_FRONTEND=noninteractive apt-get install -y --allow-unauthenticated ${MIOPENKERNELS}

				        fi

				      DEBIAN_FRONTEND=noninteractive apt-get install -y --allow-unauthenticated ${MIOPENHIPGFX}

				    fi

				    # ROCm 6.0 had a regression where journal_mode was enabled on the kdb files resulting in permission errors at runtime

				    if [[ $(ver $ROCM_VERSION) -ge $(ver 6.0) ]]; then

				        for kdb in /opt/rocm/share/miopen/db/*.kdb

				        do

				            sqlite3 $kdb "PRAGMA journal_mode=off; PRAGMA VACUUM;"

				        done

				    fi

				    for kdb in /opt/rocm/share/miopen/db/*.kdb

				    do

				        sqlite3 $kdb "PRAGMA journal_mode=off; PRAGMA VACUUM;"

				    done

				    # Cleanup

				    apt-get autoclean && apt-get clean

				@ -103,25 +77,19 @@ install_centos() {

				  yum install -y epel-release

				  yum install -y dkms kernel-headers-`uname -r` kernel-devel-`uname -r`

				  if [[ $(ver $ROCM_VERSION) -ge $(ver 4.5) ]]; then

				      # Add amdgpu repository

				      local amdgpu_baseurl

				      if [[ $OS_VERSION == 9 ]]; then

				          amdgpu_baseurl="https://repo.radeon.com/amdgpu/${AMDGPU_VERSIONS[$ROCM_VERSION]}/rhel/9.0/main/x86_64"

				      else

				        if [[ $(ver $ROCM_VERSION) -ge $(ver 5.3) ]]; then

				          amdgpu_baseurl="https://repo.radeon.com/amdgpu/${ROCM_VERSION}/rhel/7.9/main/x86_64"

				        else

				          amdgpu_baseurl="https://repo.radeon.com/amdgpu/${AMDGPU_VERSIONS[$ROCM_VERSION]}/rhel/7.9/main/x86_64"

				        fi

				      fi

				      echo "[AMDGPU]" > /etc/yum.repos.d/amdgpu.repo

				      echo "name=AMDGPU" >> /etc/yum.repos.d/amdgpu.repo

				      echo "baseurl=${amdgpu_baseurl}" >> /etc/yum.repos.d/amdgpu.repo

				      echo "enabled=1" >> /etc/yum.repos.d/amdgpu.repo

				      echo "gpgcheck=1" >> /etc/yum.repos.d/amdgpu.repo

				      echo "gpgkey=http://repo.radeon.com/rocm/rocm.gpg.key" >> /etc/yum.repos.d/amdgpu.repo

				  # Add amdgpu repository

				  local amdgpu_baseurl

				  if [[ $OS_VERSION == 9 ]]; then

				      amdgpu_baseurl="https://repo.radeon.com/amdgpu/${ROCM_VERSION}/rhel/9.0/main/x86_64"

				  else

				      amdgpu_baseurl="https://repo.radeon.com/amdgpu/${ROCM_VERSION}/rhel/7.9/main/x86_64"

				  fi

				  echo "[AMDGPU]" > /etc/yum.repos.d/amdgpu.repo

				  echo "name=AMDGPU" >> /etc/yum.repos.d/amdgpu.repo

				  echo "baseurl=${amdgpu_baseurl}" >> /etc/yum.repos.d/amdgpu.repo

				  echo "enabled=1" >> /etc/yum.repos.d/amdgpu.repo

				  echo "gpgcheck=1" >> /etc/yum.repos.d/amdgpu.repo

				  echo "gpgkey=http://repo.radeon.com/rocm/rocm.gpg.key" >> /etc/yum.repos.d/amdgpu.repo

				  local rocm_baseurl="http://repo.radeon.com/rocm/yum/${ROCM_VERSION}"

				  echo "[ROCm]" > /etc/yum.repos.d/rocm.repo

				@ -139,33 +107,23 @@ install_centos() {

				                   rocm-libs \

				                   rccl \

				                   rocprofiler-dev \

				                   roctracer-dev

				                   roctracer-dev \

				                   amd-smi-lib

				  # precompiled miopen kernels; search for all unversioned packages

				  # if search fails it will abort this script; use true to avoid case where search fails

				  if [[ $(ver $ROCM_VERSION) -ge $(ver 5.5) ]]; then

				      MIOPENHIPGFX=$(yum -q search miopen-hip-gfx | grep miopen-hip-gfx | awk '{print $1}'| grep -F kdb. || true)

				      if [[ "x${MIOPENHIPGFX}" = x ]]; then

				        echo "miopen-hip-gfx package not available" && exit 1

				      else

				        yum install -y ${MIOPENHIPGFX}

				      fi

				  MIOPENHIPGFX=$(yum -q search miopen-hip-gfx | grep miopen-hip-gfx | awk '{print $1}'| grep -F kdb. || true)

				  if [[ "x${MIOPENHIPGFX}" = x ]]; then

				    echo "miopen-hip-gfx package not available" && exit 1

				  else

				      MIOPENKERNELS=$(yum -q search miopenkernels | grep miopenkernels- | awk '{print $1}'| grep -F kdb. || true)

				      if [[ "x${MIOPENKERNELS}" = x ]]; then

				        echo "miopenkernels package not available" && exit 1

				      else

				        yum install -y ${MIOPENKERNELS}

				      fi

				    yum install -y ${MIOPENHIPGFX}

				  fi

				  # ROCm 6.0 had a regression where journal_mode was enabled on the kdb files resulting in permission errors at runtime

				  if [[ $(ver $ROCM_VERSION) -ge $(ver 6.0) ]]; then

				      for kdb in /opt/rocm/share/miopen/db/*.kdb

				      do

				          sqlite3 $kdb "PRAGMA journal_mode=off; PRAGMA VACUUM;"

				      done

				  fi

				  for kdb in /opt/rocm/share/miopen/db/*.kdb

				  do

				      sqlite3 $kdb "PRAGMA journal_mode=off; PRAGMA VACUUM;"

				  done

				  # Cleanup

				  yum clean all

									
										5

.ci/docker/common/install_triton.sh
									
												View File
												
				@ -13,8 +13,11 @@ conda_reinstall() {

				}

				if [ -n "${ROCM_VERSION}" ]; then

				  TRITON_REPO="https://github.com/ROCmSoftwarePlatform/triton"

				  TRITON_REPO="https://github.com/openai/triton"

				  TRITON_TEXT_FILE="triton-rocm"

				elif [ -n "${XPU_VERSION}" ]; then

				  TRITON_REPO="https://github.com/intel/intel-xpu-backend-for-triton"

				  TRITON_TEXT_FILE="triton-xpu"

				else

				  TRITON_REPO="https://github.com/openai/triton"

				  TRITON_TEXT_FILE="triton"

									
										6

.ci/docker/common/install_vision.sh
									
												View File
												
				@ -5,8 +5,7 @@ set -ex

				install_ubuntu() {

				  apt-get update

				  apt-get install -y --no-install-recommends \

				          libopencv-dev \

				          libavcodec-dev

				          libopencv-dev

				  # Cleanup

				  apt-get autoclean && apt-get clean

				@ -19,8 +18,7 @@ install_centos() {

				  yum --enablerepo=extras install -y epel-release

				  yum install -y \

				      opencv-devel \

				      ffmpeg-devel

				      opencv-devel

				  # Cleanup

				  yum clean all

									
										25

.ci/docker/common/install_xpu.sh
									
												View File
												
				@ -3,10 +3,7 @@ set -xe

				# Intel® software for general purpose GPU capabilities.

				# Refer to https://dgpu-docs.intel.com/releases/stable_647_21_20230714.html

				# Intel® oneAPI Base Toolkit (version 2024.0.0) has been updated to include functional and security updates.

				# Refer to https://www.intel.com/content/www/us/en/developer/tools/oneapi/base-toolkit-download.html

				# Refer to https://www.intel.com/content/www/us/en/developer/articles/tool/pytorch-prerequisites-for-intel-gpus.html

				# Users should update to the latest version as it becomes available

				@ -17,14 +14,16 @@ function install_ubuntu() {

				    # Set up the repository. To do this, download the key to the system keyring

				    wget -qO - https://repositories.intel.com/gpu/intel-graphics.key \

				        | gpg --dearmor --output /usr/share/keyrings/intel-graphics.gpg

				    wget -O- https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB \

				        | gpg --dearmor | tee /usr/share/keyrings/oneapi-archive-keyring.gpg > /dev/null

				    wget -qO - https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB \

				        | gpg --dearmor --output /usr/share/keyrings/intel-for-pytorch-gpu-dev-keyring.gpg

				    # Add the signed entry to APT sources and configure the APT client to use the Intel repository

				    echo "deb [arch=amd64 signed-by=/usr/share/keyrings/intel-graphics.gpg] https://repositories.intel.com/gpu/ubuntu jammy/production/2328 unified" \

				    echo "deb [arch=amd64 signed-by=/usr/share/keyrings/intel-graphics.gpg] \

				        https://repositories.intel.com/gpu/ubuntu jammy/lts/2350 unified" \

				        | tee /etc/apt/sources.list.d/intel-gpu-jammy.list

				    echo "deb [signed-by=/usr/share/keyrings/oneapi-archive-keyring.gpg] https://apt.repos.intel.com/oneapi all main" \

				        | tee /etc/apt/sources.list.d/oneAPI.list

				    echo "deb [signed-by=/usr/share/keyrings/intel-for-pytorch-gpu-dev-keyring.gpg] \

				        https://apt.repos.intel.com/intel-for-pytorch-gpu-dev all main" \

				        | tee /etc/apt/sources.list.d/intel-for-pytorch-gpu-dev.list

				    # Update the packages list and repository index

				    apt-get update

				@ -40,11 +39,11 @@ function install_ubuntu() {

				        mesa-vdpau-drivers mesa-vulkan-drivers va-driver-all vainfo hwinfo clinfo

				    # Development Packages

				    apt-get install -y libigc-dev intel-igc-cm libigdfcl-dev libigfxcmrt-dev level-zero-dev

				    # Install Intel® oneAPI Base Toolkit

				    if [ -n "$BASEKIT_VERSION" ]; then

				        apt-get install intel-basekit=$BASEKIT_VERSION -y

				    # Install Intel Support Packages

				    if [ -n "$XPU_VERSION" ]; then

				        apt-get install -y intel-for-pytorch-gpu-dev-${XPU_VERSION}

				    else

				        apt-get install intel-basekit -y

				        apt-get install -y intel-for-pytorch-gpu-dev

				    fi

				    # Cleanup

31

.ci/docker/requirements-ci.txt

View File

 @ -85,10 +85,10 @@ librosa>=0.6.2 ; python_version < "3.11"
 #Pinned versions:
 #test that import:
 mypy==1.8.0
 mypy==1.9.0
 # Pin MyPy version because new errors are likely to appear with each release
 #Description: linter
 #Pinned versions: 1.8.0
 #Pinned versions: 1.9.0
 #test that import: test_typing.py, test_type_hints.py
 networkx==2.8.8
 @ -134,9 +134,9 @@ opt-einsum==3.3
 #Pinned versions: 3.3
 #test that import: test_linalg.py
 optree==0.9.1
 optree==0.11.0
 #Description: A library for tree manipulation
 #Pinned versions: 0.9.1
 #Pinned versions: 0.11.0
 #test that import: test_vmap.py, test_aotdispatch.py, test_dynamic_shapes.py,
 #test_pytree.py, test_ops.py, test_control_flow.py, test_modules.py,
 #common_utils.py, test_eager_transforms.py, test_python_dispatch.py,
 @ -147,9 +147,9 @@ optree==0.9.1
 #test_pointwise_ops.py, test_dtensor_ops.py, test_torchinductor.py, test_fx.py,
 #test_fake_tensor.py, test_mps.py
 pillow==10.2.0
 pillow==10.3.0
 #Description:  Python Imaging Library fork
 #Pinned versions: 10.2.0
 #Pinned versions: 10.3.0
 #test that import:
 protobuf==3.20.2
 @ -228,12 +228,11 @@ scikit-image==0.20.0 ; python_version >= "3.10"
 #Pinned versions: 0.20.3
 #test that import:
 scipy==1.6.3 ; python_version < "3.10"
 scipy==1.8.1 ; python_version == "3.10"
 scipy==1.10.1 ; python_version == "3.11"
 scipy==1.10.1 ; python_version <= "3.11"
 scipy==1.12.0 ; python_version == "3.12"
 # Pin SciPy because of failing distribution tests (see #60347)
 #Description: scientific python
 #Pinned versions: 1.6.3
 #Pinned versions: 1.10.1
 #test that import: test_unary_ufuncs.py, test_torch.py,test_tensor_creation_ops.py
 #test_spectral_ops.py, test_sparse_csr.py, test_reductions.py,test_nn.py
 #test_linalg.py, test_binary_ufuncs.py
 @ -264,10 +263,10 @@ unittest-xml-reporting<=3.2.0,>=2.0.0
 #Pinned versions:
 #test that import:
 #wheel not found on aarch64, and source build requires rust
 lintrunner==0.10.7 ; platform_machine == "x86_64"
 #lintrunner is supported on aarch64-linux only from 0.12.4 version
 lintrunner==0.12.5
 #Description: all about linters!
 #Pinned versions: 0.10.7
 #Pinned versions: 0.12.5
 #test that import:
 rockset==1.0.3
 @ -280,9 +279,9 @@ ghstack==0.8.0
 #Pinned versions: 0.8.0
 #test that import:
 jinja2==3.1.3
 jinja2==3.1.4
 #Description: jinja2 template engine
 #Pinned versions: 3.1.3
 #Pinned versions: 3.1.4
 #test that import:
 pytest-cpp==2.3.0
 @ -311,3 +310,5 @@ lxml==5.0.0.
 #Description: This is a requirement of unittest-xml-reporting
 # Python-3.9 binaries
 PyGithub==2.3.0

									
										5

.ci/docker/ubuntu-cuda/Dockerfile
									
												View File
												
				@ -56,7 +56,7 @@ RUN if [ -n "${DB}" ]; then bash ./install_db.sh; fi

				RUN rm install_db.sh

				ENV INSTALLED_DB ${DB}

				# (optional) Install vision packages like OpenCV and ffmpeg

				# (optional) Install vision packages like OpenCV

				ARG VISION

				COPY ./common/install_vision.sh ./common/cache_vision_models.sh ./common/common_utils.sh ./

				RUN if [ -n "${VISION}" ]; then bash ./install_vision.sh; fi

				@ -139,7 +139,7 @@ COPY --from=pytorch/llvm:9.0.1 /opt/llvm /opt/llvm

				ARG CUDNN_VERSION

				ARG CUDA_VERSION

				COPY ./common/install_cudnn.sh install_cudnn.sh

				RUN if [ "${CUDNN_VERSION}" -eq 8 ]; then bash install_cudnn.sh; fi

				RUN if [ -n "${CUDNN_VERSION}" ]; then bash install_cudnn.sh; fi

				RUN rm install_cudnn.sh

				# Install CUSPARSELT

				@ -152,6 +152,7 @@ RUN rm install_cusparselt.sh

				RUN if [ -h /usr/local/cuda-11.6/cuda-11.6 ]; then rm /usr/local/cuda-11.6/cuda-11.6; fi

				RUN if [ -h /usr/local/cuda-11.7/cuda-11.7 ]; then rm /usr/local/cuda-11.7/cuda-11.7; fi

				RUN if [ -h /usr/local/cuda-12.1/cuda-12.1 ]; then rm /usr/local/cuda-12.1/cuda-12.1; fi

				RUN if [ -h /usr/local/cuda-12.4/cuda-12.4 ]; then rm /usr/local/cuda-12.4/cuda-12.4; fi

				USER jenkins

				CMD ["bash"]

									
										14

.ci/docker/ubuntu-rocm/Dockerfile
									
												View File
												
				@ -53,7 +53,7 @@ RUN if [ -n "${DB}" ]; then bash ./install_db.sh; fi

				RUN rm install_db.sh

				ENV INSTALLED_DB ${DB}

				# (optional) Install vision packages like OpenCV and ffmpeg

				# (optional) Install vision packages like OpenCV

				ARG VISION

				COPY ./common/install_vision.sh ./common/cache_vision_models.sh ./common/common_utils.sh ./

				RUN if [ -n "${VISION}" ]; then bash ./install_vision.sh; fi

				@ -78,6 +78,11 @@ ENV MAGMA_HOME /opt/rocm/magma

				ENV LANG C.UTF-8

				ENV LC_ALL C.UTF-8

				# Install amdsmi

				COPY ./common/install_amdsmi.sh install_amdsmi.sh

				RUN bash ./install_amdsmi.sh

				RUN rm install_amdsmi.sh

				# (optional) Install non-default CMake version

				ARG CMAKE_VERSION

				COPY ./common/install_cmake.sh install_cmake.sh

				@ -100,6 +105,13 @@ COPY triton_version.txt triton_version.txt

				RUN if [ -n "${TRITON}" ]; then bash ./install_triton.sh; fi

				RUN rm install_triton.sh common_utils.sh triton-rocm.txt triton_version.txt

				# Install AOTriton

				COPY ./aotriton_version.txt aotriton_version.txt

				COPY ./common/common_utils.sh common_utils.sh

				COPY ./common/install_aotriton.sh install_aotriton.sh

				RUN ["/bin/bash", "-c", "./install_aotriton.sh /opt/rocm && rm -rf install_aotriton.sh aotriton_version.txt common_utils.sh"]

				ENV AOTRITON_INSTALLED_PREFIX /opt/rocm/aotriton

				# Install ccache/sccache (do this last, so we get priority in PATH)

				COPY ./common/install_cache.sh install_cache.sh

				ENV PATH /opt/cache/bin:$PATH

									
										18

.ci/docker/ubuntu-xpu/Dockerfile
									
												View File
												
				@ -61,15 +61,20 @@ COPY ci_commit_pins/timm.txt timm.txt

				RUN if [ -n "${INDUCTOR_BENCHMARKS}" ]; then bash ./install_inductor_benchmark_deps.sh; fi

				RUN rm install_inductor_benchmark_deps.sh common_utils.sh timm.txt huggingface.txt

				# Install XPU Dependencies

				ARG XPU_VERSION

				COPY ./common/install_xpu.sh install_xpu.sh

				RUN bash ./install_xpu.sh && rm install_xpu.sh

				ARG TRITON

				# Install triton, this needs to be done before sccache because the latter will

				# try to reach out to S3, which docker build runners don't have access

				COPY ./common/install_triton.sh install_triton.sh

				COPY ./common/common_utils.sh common_utils.sh

				# TODO: will add triton xpu commit

				COPY ci_commit_pins/triton.txt triton.txt

				COPY ci_commit_pins/triton-xpu.txt triton-xpu.txt

				COPY triton_version.txt triton_version.txt

				RUN if [ -n "${TRITON}" ]; then bash ./install_triton.sh; fi

				RUN rm install_triton.sh common_utils.sh triton.txt

				RUN rm install_triton.sh common_utils.sh triton-xpu.txt triton_version.txt

				# (optional) Install database packages like LMDB and LevelDB

				ARG DB

				@ -78,18 +83,13 @@ RUN if [ -n "${DB}" ]; then bash ./install_db.sh; fi

				RUN rm install_db.sh

				ENV INSTALLED_DB ${DB}

				# (optional) Install vision packages like OpenCV and ffmpeg

				# (optional) Install vision packages like OpenCV

				ARG VISION

				COPY ./common/install_vision.sh ./common/cache_vision_models.sh ./common/common_utils.sh ./

				RUN if [ -n "${VISION}" ]; then bash ./install_vision.sh; fi

				RUN rm install_vision.sh cache_vision_models.sh common_utils.sh

				ENV INSTALLED_VISION ${VISION}

				# Install XPU Dependencies

				ARG BASEKIT_VERSION

				COPY ./common/install_xpu.sh install_xpu.sh

				RUN bash ./install_xpu.sh && rm install_xpu.sh

				# (optional) Install non-default CMake version

				ARG CMAKE_VERSION

				COPY ./common/install_cmake.sh install_cmake.sh

									
										8

.ci/docker/ubuntu/Dockerfile
									
												View File
												
				@ -80,7 +80,7 @@ RUN if [ -n "${DB}" ]; then bash ./install_db.sh; fi

				RUN rm install_db.sh

				ENV INSTALLED_DB ${DB}

				# (optional) Install vision packages like OpenCV and ffmpeg

				# (optional) Install vision packages like OpenCV

				ARG VISION

				COPY ./common/install_vision.sh ./common/cache_vision_models.sh ./common/common_utils.sh ./

				RUN if [ -n "${VISION}" ]; then bash ./install_vision.sh; fi

				@ -169,9 +169,11 @@ RUN rm install_acl.sh

				ENV INSTALLED_ACL ${ACL}

				# Install ccache/sccache (do this last, so we get priority in PATH)

				ARG SKIP_SCCACHE_INSTALL

				COPY ./common/install_cache.sh install_cache.sh

				ENV PATH /opt/cache/bin:$PATH

				RUN bash ./install_cache.sh && rm install_cache.sh

				RUN if [ -z "${SKIP_SCCACHE_INSTALL}" ]; then bash ./install_cache.sh; fi

				RUN rm install_cache.sh

				# Add jni.h for java host build

				COPY ./common/install_jni.sh install_jni.sh

				@ -188,7 +190,9 @@ ARG BUILD_ENVIRONMENT

				ENV BUILD_ENVIRONMENT ${BUILD_ENVIRONMENT}

				# Install LLVM dev version (Defined in the pytorch/builder github repository)

				ARG SKIP_LLVM_SRC_BUILD_INSTALL

				COPY --from=pytorch/llvm:9.0.1 /opt/llvm /opt/llvm

				RUN if [ -n "${SKIP_LLVM_SRC_BUILD_INSTALL}" ]; then set -eu; rm -rf /opt/llvm; fi

				# AWS specific CUDA build guidance

				ENV TORCH_CUDA_ARCH_LIST Maxwell

									
										4

.ci/onnx/common.sh
									
												View File
												
				@ -1,5 +1,9 @@

				#!/bin/bash

				set -ex

				source "$(dirname "${BASH_SOURCE[0]}")/../pytorch/common_utils.sh"

				LOCAL_DIR=$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)

				ROOT_DIR=$(cd "$LOCAL_DIR"/../.. && pwd)

				TEST_DIR="$ROOT_DIR/test"

									
										14

.ci/onnx/test.sh
									
												View File
												
				@ -3,6 +3,20 @@

				# shellcheck source=./common.sh

				source "$(dirname "${BASH_SOURCE[0]}")/common.sh"

				# Workaround for dind-rootless userid mapping (https://github.com/pytorch/ci-infra/issues/96)

				WORKSPACE_ORIGINAL_OWNER_ID=$(stat -c '%u' "/var/lib/jenkins/workspace")

				cleanup_workspace() {

				  echo "sudo may print the following warning message that can be ignored. The chown command will still run."

				  echo "    sudo: setrlimit(RLIMIT_STACK): Operation not permitted"

				  echo "For more details refer to https://github.com/sudo-project/sudo/issues/42"

				  sudo chown -R "$WORKSPACE_ORIGINAL_OWNER_ID" /var/lib/jenkins/workspace

				}

				# Disable shellcheck SC2064 as we want to parse the original owner immediately.

				# shellcheck disable=SC2064

				trap_add cleanup_workspace EXIT

				sudo chown -R jenkins /var/lib/jenkins/workspace

				git config --global --add safe.directory /var/lib/jenkins/workspace

				if [[ "$BUILD_ENVIRONMENT" == *onnx* ]]; then

				  # TODO: This can be removed later once vision is also part of the Docker image

				  pip install -q --user --no-use-pep517 "git+https://github.com/pytorch/vision.git@$(cat .github/ci_commit_pins/vision.txt)"

									
										66

.ci/pytorch/build.sh
									
												View File
												
				@ -44,15 +44,7 @@ if [[ "$BUILD_ENVIRONMENT" == *cuda11* ]]; then

				  fi

				fi

				if [[ ${BUILD_ENVIRONMENT} == *"caffe2"* ]]; then

				  echo "Caffe2 build is ON"

				  export BUILD_CAFFE2=ON

				fi

				if [[ ${BUILD_ENVIRONMENT} == *"paralleltbb"* ]]; then

				  export ATEN_THREADING=TBB

				  export USE_TBB=1

				elif [[ ${BUILD_ENVIRONMENT} == *"parallelnative"* ]]; then

				if [[ ${BUILD_ENVIRONMENT} == *"parallelnative"* ]]; then

				  export ATEN_THREADING=NATIVE

				fi

				@ -81,7 +73,22 @@ if ! which conda; then

				    export USE_MKLDNN=0

				  fi

				else

				  export CMAKE_PREFIX_PATH=/opt/conda

				  # CMAKE_PREFIX_PATH precedences

				  # 1. $CONDA_PREFIX, if defined. This follows the pytorch official build instructions.

				  # 2. /opt/conda/envs/py_${ANACONDA_PYTHON_VERSION}, if ANACONDA_PYTHON_VERSION defined.

				  #    This is for CI, which defines ANACONDA_PYTHON_VERSION but not CONDA_PREFIX.

				  # 3. $(conda info --base). The fallback value of pytorch official build

				  #    instructions actually refers to this.

				  #    Commonly this is /opt/conda/

				  if [[ -v CONDA_PREFIX ]]; then

				    export CMAKE_PREFIX_PATH=${CONDA_PREFIX}

				  elif [[ -v ANACONDA_PYTHON_VERSION ]]; then

				    export CMAKE_PREFIX_PATH="/opt/conda/envs/py_${ANACONDA_PYTHON_VERSION}"

				  else

				    # already checked by `! which conda`

				    CMAKE_PREFIX_PATH="$(conda info --base)"

				    export CMAKE_PREFIX_PATH

				  fi

				  # Workaround required for MKL library linkage

				  # https://github.com/pytorch/pytorch/issues/119557

				@ -223,6 +230,24 @@ if [[ "${BUILD_ENVIRONMENT}" != *android* && "${BUILD_ENVIRONMENT}" != *cuda* ]]

				  export BUILD_STATIC_RUNTIME_BENCHMARK=ON

				fi

				# Do not change workspace permissions for ROCm CI jobs

				# as it can leave workspace with bad permissions for cancelled jobs

				if [[ "$BUILD_ENVIRONMENT" != *rocm* ]]; then

				  # Workaround for dind-rootless userid mapping (https://github.com/pytorch/ci-infra/issues/96)

				  WORKSPACE_ORIGINAL_OWNER_ID=$(stat -c '%u' "/var/lib/jenkins/workspace")

				  cleanup_workspace() {

				    echo "sudo may print the following warning message that can be ignored. The chown command will still run."

				    echo "    sudo: setrlimit(RLIMIT_STACK): Operation not permitted"

				    echo "For more details refer to https://github.com/sudo-project/sudo/issues/42"

				    sudo chown -R "$WORKSPACE_ORIGINAL_OWNER_ID" /var/lib/jenkins/workspace

				  }

				  # Disable shellcheck SC2064 as we want to parse the original owner immediately.

				  # shellcheck disable=SC2064

				  trap_add cleanup_workspace EXIT

				  sudo chown -R jenkins /var/lib/jenkins/workspace

				  git config --global --add safe.directory /var/lib/jenkins/workspace

				fi

				if [[ "$BUILD_ENVIRONMENT" == *-bazel-* ]]; then

				  set -e

				@ -248,15 +273,22 @@ else

				  ( ! get_exit_code python setup.py clean bad_argument )

				  if [[ "$BUILD_ENVIRONMENT" != *libtorch* ]]; then

				    # rocm builds fail when WERROR=1

				    # XLA test build fails when WERROR=1

				    # set only when building other architectures

				    # or building non-XLA tests.

				    if [[ "$BUILD_ENVIRONMENT" != *rocm*  &&

				          "$BUILD_ENVIRONMENT" != *xla* ]]; then

				      if [[ "$BUILD_ENVIRONMENT" != *py3.8* ]]; then

				        # Install numpy-2.0 release candidate for builds

				        # Which should be backward compatible with Numpy-1.X

				        python -mpip install --pre numpy==2.0.0rc1

				      fi

				      WERROR=1 python setup.py bdist_wheel

				    else

				      if [[ "$BUILD_ENVIRONMENT" == *xla* ]]; then

				        source .ci/pytorch/install_cache_xla.sh

				      fi

				      python setup.py bdist_wheel

				    fi

				    pip_install_whl "$(echo dist/*.whl)"

				@ -298,7 +330,7 @@ else

				    SITE_PACKAGES="$(python -c 'from distutils.sysconfig import get_python_lib; print(get_python_lib())')"

				    mkdir -p "$CUSTOM_OP_BUILD"

				    pushd "$CUSTOM_OP_BUILD"

				    cmake "$CUSTOM_OP_TEST" -DCMAKE_PREFIX_PATH="$SITE_PACKAGES/torch" -DPYTHON_EXECUTABLE="$(which python)" \

				    cmake "$CUSTOM_OP_TEST" -DCMAKE_PREFIX_PATH="$SITE_PACKAGES/torch" -DPython_EXECUTABLE="$(which python)" \

				          -DCMAKE_MODULE_PATH="$CUSTOM_TEST_MODULE_PATH" -DUSE_ROCM="$CUSTOM_TEST_USE_ROCM"

				    make VERBOSE=1

				    popd

				@ -311,7 +343,7 @@ else

				    SITE_PACKAGES="$(python -c 'from distutils.sysconfig import get_python_lib; print(get_python_lib())')"

				    mkdir -p "$JIT_HOOK_BUILD"

				    pushd "$JIT_HOOK_BUILD"

				    cmake "$JIT_HOOK_TEST" -DCMAKE_PREFIX_PATH="$SITE_PACKAGES/torch" -DPYTHON_EXECUTABLE="$(which python)" \

				    cmake "$JIT_HOOK_TEST" -DCMAKE_PREFIX_PATH="$SITE_PACKAGES/torch" -DPython_EXECUTABLE="$(which python)" \

				          -DCMAKE_MODULE_PATH="$CUSTOM_TEST_MODULE_PATH" -DUSE_ROCM="$CUSTOM_TEST_USE_ROCM"

				    make VERBOSE=1

				    popd

				@ -323,7 +355,7 @@ else

				    python --version

				    mkdir -p "$CUSTOM_BACKEND_BUILD"

				    pushd "$CUSTOM_BACKEND_BUILD"

				    cmake "$CUSTOM_BACKEND_TEST" -DCMAKE_PREFIX_PATH="$SITE_PACKAGES/torch" -DPYTHON_EXECUTABLE="$(which python)" \

				    cmake "$CUSTOM_BACKEND_TEST" -DCMAKE_PREFIX_PATH="$SITE_PACKAGES/torch" -DPython_EXECUTABLE="$(which python)" \

				          -DCMAKE_MODULE_PATH="$CUSTOM_TEST_MODULE_PATH" -DUSE_ROCM="$CUSTOM_TEST_USE_ROCM"

				    make VERBOSE=1

				    popd

				@ -354,4 +386,8 @@ if [[ "$BUILD_ENVIRONMENT" != *libtorch* && "$BUILD_ENVIRONMENT" != *bazel* ]];

				  python tools/stats/export_test_times.py

				fi

				print_sccache_stats

				# snadampal: skipping it till sccache support added for aarch64

				# https://github.com/pytorch/pytorch/issues/121559

				if [[ "$BUILD_ENVIRONMENT" != *aarch64* ]]; then

				  print_sccache_stats

				fi

									
										2

.ci/pytorch/common_utils.sh
									
												View File
												
				@ -159,7 +159,7 @@ function install_torchvision() {

				}

				function install_tlparse() {

				  pip_install --user "tlparse==0.3.5"

				  pip_install --user "tlparse==0.3.7"

				  PATH="$(python -m site --user-base)/bin:$PATH"

				}

									
										2

.ci/pytorch/docs-test.sh
									
												View File
												
				@ -6,4 +6,4 @@ source "$(dirname "${BASH_SOURCE[0]}")/common.sh"

				echo "Testing pytorch docs"

				cd docs

				make doctest

				TERM=vt100 make doctest

									
										37

.ci/pytorch/install_cache_xla.sh
									
										Executable file
									
												View File
												
				@ -0,0 +1,37 @@

				#!/bin/bash

				# Script for installing sccache on the xla build job, which uses xla's docker

				# image and doesn't have sccache installed on it.  This is mostly copied from

				# .ci/docker/install_cache.sh.  Changes are: removing checks that will always

				# return the same thing, ex checks for for rocm, CUDA, and changing the path

				# where sccache is installed, and not changing /etc/environment.

				set -ex

				install_binary() {

				  echo "Downloading sccache binary from S3 repo"

				  curl --retry 3 https://s3.amazonaws.com/ossci-linux/sccache -o /tmp/cache/bin/sccache

				}

				mkdir -p /tmp/cache/bin

				mkdir -p /tmp/cache/lib

				export PATH="/tmp/cache/bin:$PATH"

				install_binary

				chmod a+x /tmp/cache/bin/sccache

				function write_sccache_stub() {

				  # Unset LD_PRELOAD for ps because of asan + ps issues

				  # https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90589

				  # shellcheck disable=SC2086

				  # shellcheck disable=SC2059

				  printf "#!/bin/sh\nif [ \$(env -u LD_PRELOAD ps -p \$PPID -o comm=) != sccache ]; then\n  exec sccache $(which $1) \"\$@\"\nelse\n  exec $(which $1) \"\$@\"\nfi" > "/tmp/cache/bin/$1"

				  chmod a+x "/tmp/cache/bin/$1"

				}

				write_sccache_stub cc

				write_sccache_stub c++

				write_sccache_stub gcc

				write_sccache_stub g++

				write_sccache_stub clang

				write_sccache_stub clang++

									
										8

.ci/pytorch/multigpu-test.sh
									
												View File
												
				@ -18,6 +18,7 @@ time python test/run_test.py --verbose -i distributed/test_c10d_gloo

				time python test/run_test.py --verbose -i distributed/test_c10d_nccl

				time python test/run_test.py --verbose -i distributed/test_c10d_spawn_gloo

				time python test/run_test.py --verbose -i distributed/test_c10d_spawn_nccl

				time python test/run_test.py --verbose -i distributed/test_cuda_p2p

				time python test/run_test.py --verbose -i distributed/test_store

				time python test/run_test.py --verbose -i distributed/test_pg_wrapper

				time python test/run_test.py --verbose -i distributed/rpc/cuda/test_tensorpipe_agent

				@ -45,6 +46,13 @@ time python test/run_test.py --verbose -i distributed/test_device_mesh

				time python test/run_test.py --verbose -i distributed/tensor/parallel/test_ddp_2d_parallel

				time python test/run_test.py --verbose -i distributed/tensor/parallel/test_fsdp_2d_parallel

				time python test/run_test.py --verbose -i distributed/tensor/parallel/test_tp_examples

				time python test/run_test.py --verbose -i distributed/tensor/parallel/test_tp_random_state

				# FSDP2 tests

				time python test/run_test.py --verbose -i distributed/_composable/fsdp/test_fully_shard_training -- -k test_2d_mlp_with_nd_mesh

				# Pipelining composability tests

				time python test/run_test.py --verbose -i distributed/pipelining/test_composability.py

				# Other tests

				time python test/run_test.py --verbose -i test_cuda_primary_ctx

									
										6

.ci/pytorch/perf_test/compare_with_baseline.py
									
												View File
												
				@ -59,16 +59,16 @@ print("sample mean: ", sample_mean)

				print("sample sigma: ", sample_sigma)

				if math.isnan(sample_mean):

				    raise Exception("""Error: sample mean is NaN""")

				    raise Exception("""Error: sample mean is NaN""")  # noqa: TRY002

				elif math.isnan(sample_sigma):

				    raise Exception("""Error: sample sigma is NaN""")

				    raise Exception("""Error: sample sigma is NaN""")  # noqa: TRY002

				z_value = (sample_mean - mean) / sigma

				print("z-value: ", z_value)

				if z_value >= 3:

				    raise Exception(

				    raise Exception(  # noqa: TRY002

				        f"""\n

				z-value >= 3, there is high chance of perf regression.\n

				To reproduce this regression, run

									
										9

.ci/pytorch/python_doc_push_script.sh
									
												View File
												
				@ -26,8 +26,8 @@ echo "error: python_doc_push_script.sh: version (arg2) not specified"

				fi

				# Argument 1: Where to copy the built documentation to

				# (pytorch.github.io/$install_path)

				install_path="${1:-${DOCS_INSTALL_PATH:-docs/${DOCS_VERSION}}}"

				# (pytorch_docs/$install_path)

				install_path="${1:-${DOCS_INSTALL_PATH:-${DOCS_VERSION}}}"

				if [ -z "$install_path" ]; then

				echo "error: python_doc_push_script.sh: install_path (arg1) not specified"

				  exit 1

				@ -68,8 +68,8 @@ build_docs () {

				}

				git clone https://github.com/pytorch/pytorch.github.io -b "$branch" --depth 1

				pushd pytorch.github.io

				git clone https://github.com/pytorch/docs pytorch_docs -b "$branch" --depth 1

				pushd pytorch_docs

				export LC_ALL=C

				export PATH=/opt/conda/bin:$PATH

				@ -105,6 +105,7 @@ if [ "$is_main_doc" = true ]; then

				    echo undocumented objects found:

				    cat build/coverage/python.txt

				    echo "Make sure you've updated relevant .rsts in docs/source!"

				    echo "You can reproduce locally by running 'cd docs && make coverage && cat build/coverage/python.txt'"

				    exit 1

				  fi

				else

									
										208

.ci/pytorch/test.sh
									
												View File
												
				@ -6,6 +6,27 @@

				set -ex

				# shellcheck source=./common.sh

				source "$(dirname "${BASH_SOURCE[0]}")/common.sh"

				# Do not change workspace permissions for ROCm CI jobs

				# as it can leave workspace with bad permissions for cancelled jobs

				if [[ "$BUILD_ENVIRONMENT" != *rocm* ]]; then

				  # Workaround for dind-rootless userid mapping (https://github.com/pytorch/ci-infra/issues/96)

				  WORKSPACE_ORIGINAL_OWNER_ID=$(stat -c '%u' "/var/lib/jenkins/workspace")

				  cleanup_workspace() {

				    echo "sudo may print the following warning message that can be ignored. The chown command will still run."

				    echo "    sudo: setrlimit(RLIMIT_STACK): Operation not permitted"

				    echo "For more details refer to https://github.com/sudo-project/sudo/issues/42"

				    sudo chown -R "$WORKSPACE_ORIGINAL_OWNER_ID" /var/lib/jenkins/workspace

				  }

				  # Disable shellcheck SC2064 as we want to parse the original owner immediately.

				  # shellcheck disable=SC2064

				  trap_add cleanup_workspace EXIT

				  sudo chown -R jenkins /var/lib/jenkins/workspace

				  git config --global --add safe.directory /var/lib/jenkins/workspace

				fi

				echo "Environment variables:"

				env

				@ -90,9 +111,6 @@ if [[ -n $TESTS_TO_INCLUDE ]]; then

				  INCLUDE_CLAUSE="--include $TESTS_TO_INCLUDE"

				fi

				# shellcheck source=./common.sh

				source "$(dirname "${BASH_SOURCE[0]}")/common.sh"

				echo "Environment variables"

				env

				@ -163,6 +181,11 @@ if [[ "$BUILD_ENVIRONMENT" != *-bazel-* ]] ; then

				  export PATH="$HOME/.local/bin:$PATH"

				fi

				if [[ "$BUILD_ENVIRONMENT" == *aarch64* ]]; then

				  # TODO: revisit this once the CI is stabilized on aarch64 linux

				  export VALGRIND=OFF

				fi

				install_tlparse

				# DANGER WILL ROBINSON.  The LD_PRELOAD here could cause you problems

				@ -211,8 +234,6 @@ if [[ "$BUILD_ENVIRONMENT" == *asan* ]]; then

				    export LD_PRELOAD=/usr/lib/llvm-15/lib/clang/15.0.7/lib/linux/libclang_rt.asan-x86_64.so

				    # Disable valgrind for asan

				    export VALGRIND=OFF

				    # Increase stack size, because ASAN red zones use more stack

				    ulimit -s 81920

				    (cd test && python -c "import torch; print(torch.__version__, torch.version.git_version)")

				    echo "The next four invocations are expected to crash; if they don't that means ASAN/UBSAN is misconfigured"

				@ -243,6 +264,18 @@ elif [[ $TEST_CONFIG == 'nogpu_AVX512' ]]; then

				  export ATEN_CPU_CAPABILITY=avx2

				fi

				# temp workarounds for https://github.com/pytorch/pytorch/issues/126692, remove when fixed

				if [[ "$BUILD_ENVIRONMENT" != *-bazel-* ]]; then

				  pushd test

				  CUDA_VERSION=$(python -c "import torch; print(torch.version.cuda)")

				  if [ "$CUDA_VERSION" == "12.4" ]; then

				    ISCUDA124="cu124"

				  else

				    ISCUDA124=""

				  fi

				  popd

				fi

				test_python_legacy_jit() {

				  time python test/run_test.py --include test_jit_legacy test_jit_fuser_legacy --verbose

				  assert_git_not_dirty

				@ -289,19 +322,24 @@ test_dynamo_shard() {

				test_inductor_distributed() {

				  # Smuggle a few multi-gpu tests here so that we don't have to request another large node

				  echo "Testing multi_gpu tests in test_torchinductor"

				  pytest test/inductor/test_torchinductor.py -k test_multi_gpu

				  pytest test/inductor/test_aot_inductor.py -k test_non_default_cuda_device

				  pytest test/inductor/test_aot_inductor.py -k test_replicate_on_devices

				  pytest test/distributed/test_c10d_functional_native.py

				  pytest test/distributed/_tensor/test_dtensor_compile.py

				  pytest test/distributed/tensor/parallel/test_fsdp_2d_parallel.py

				  pytest test/distributed/_composable/fsdp/test_fully_shard_comm.py

				  pytest test/distributed/_composable/fsdp/test_fully_shard_training.py -k test_train_parity_multi_group

				  pytest test/distributed/_composable/fsdp/test_fully_shard_training.py -k test_train_parity_with_activation_checkpointing

				  pytest test/distributed/_composable/fsdp/test_fully_shard_training.py -k test_train_parity_2d_mlp

				  pytest test/distributed/_composable/fsdp/test_fully_shard_frozen.py

				  pytest test/distributed/_composable/fsdp/test_fully_shard_mixed_precision.py -k test_compute_dtype

				  pytest test/distributed/_composable/fsdp/test_fully_shard_mixed_precision.py -k test_reduce_dtype

				  python test/run_test.py -i inductor/test_torchinductor.py -k test_multi_gpu --verbose

				  python test/run_test.py -i inductor/test_aot_inductor.py -k test_non_default_cuda_device --verbose

				  python test/run_test.py -i inductor/test_aot_inductor.py -k test_replicate_on_devices --verbose

				  python test/run_test.py -i distributed/test_c10d_functional_native.py --verbose

				  python test/run_test.py -i distributed/_tensor/test_dtensor_compile.py --verbose

				  python test/run_test.py -i distributed/tensor/parallel/test_fsdp_2d_parallel.py --verbose

				  python test/run_test.py -i distributed/_composable/fsdp/test_fully_shard_comm.py --verbose

				  python test/run_test.py -i distributed/_composable/fsdp/test_fully_shard_training.py -k test_train_parity_multi_group --verbose

				  python test/run_test.py -i distributed/_composable/fsdp/test_fully_shard_training.py -k test_train_parity_with_activation_checkpointing --verbose

				  python test/run_test.py -i distributed/_composable/fsdp/test_fully_shard_training.py -k test_train_parity_2d_mlp --verbose

				  python test/run_test.py -i distributed/_composable/fsdp/test_fully_shard_training.py -k test_train_parity_hsdp --verbose

				  python test/run_test.py -i distributed/_composable/fsdp/test_fully_shard_training.py -k test_train_parity_2d_transformer_checkpoint_resume --verbose

				  python test/run_test.py -i distributed/_composable/fsdp/test_fully_shard_training.py -k test_gradient_accumulation --verbose

				  python test/run_test.py -i distributed/_composable/fsdp/test_fully_shard_frozen.py --verbose

				  python test/run_test.py -i distributed/_composable/fsdp/test_fully_shard_mixed_precision.py -k test_compute_dtype --verbose

				  python test/run_test.py -i distributed/_composable/fsdp/test_fully_shard_mixed_precision.py -k test_reduce_dtype --verbose

				  python test/run_test.py -i distributed/_composable/fsdp/test_fully_shard_clip_grad_norm_.py -k test_clip_grad_norm_2d --verbose

				  python test/run_test.py -i distributed/fsdp/test_fsdp_tp_integration.py -k test_fsdp_tp_integration --verbose

				  # this runs on both single-gpu and multi-gpu instance. It should be smart about skipping tests that aren't supported

				  # with if required # gpus aren't available

				@ -313,22 +351,32 @@ test_inductor() {

				  python tools/dynamo/verify_dynamo.py

				  python test/run_test.py --inductor --include test_modules test_ops test_ops_gradients test_torch --verbose

				  # Do not add --inductor for the following inductor unit tests, otherwise we will fail because of nested dynamo state

				  python test/run_test.py --include inductor/test_torchinductor inductor/test_torchinductor_opinfo --verbose

				  python test/run_test.py --include inductor/test_torchinductor inductor/test_torchinductor_opinfo inductor/test_aot_inductor --verbose

				  # docker build uses bdist_wheel which does not work with test_aot_inductor

				  # TODO: need a faster way to build

				  if [[ "$BUILD_ENVIRONMENT" != *rocm* ]]; then

				      BUILD_AOT_INDUCTOR_TEST=1 python setup.py develop

				      CPP_TESTS_DIR="${BUILD_BIN_DIR}" LD_LIBRARY_PATH="${TORCH_LIB_DIR}" python test/run_test.py --cpp --verbose -i cpp/test_aot_inductor

				      CPP_TESTS_DIR="${BUILD_BIN_DIR}" LD_LIBRARY_PATH="${TORCH_LIB_DIR}" python test/run_test.py --cpp --verbose -i cpp/test_aoti_abi_check cpp/test_aoti_inference

				  fi

				}

				test_inductor_cpp_wrapper_abi_compatible() {

				  export TORCHINDUCTOR_ABI_COMPATIBLE=1

				  TEST_REPORTS_DIR=$(pwd)/test/test-reports

				  mkdir -p "$TEST_REPORTS_DIR"

				  echo "Testing Inductor cpp wrapper mode with TORCHINDUCTOR_ABI_COMPATIBLE=1"

				  # cpu stack allocation causes segfault and needs more investigation

				  TORCHINDUCTOR_STACK_ALLOCATION=0 python test/run_test.py --include inductor/test_cpu_cpp_wrapper

				  PYTORCH_TESTING_DEVICE_ONLY_FOR="" python test/run_test.py --include inductor/test_cpu_cpp_wrapper

				  python test/run_test.py --include inductor/test_cuda_cpp_wrapper

				  TORCHINDUCTOR_CPP_WRAPPER=1 python benchmarks/dynamo/timm_models.py --device cuda --accuracy --amp \

				    --training --inductor --disable-cudagraphs --only vit_base_patch16_224 \

				    --output "$TEST_REPORTS_DIR/inductor_cpp_wrapper_training.csv"

				  python benchmarks/dynamo/check_accuracy.py \

				    --actual "$TEST_REPORTS_DIR/inductor_cpp_wrapper_training.csv" \

				    --expected "benchmarks/dynamo/ci_expected_accuracy/${ISCUDA124}/inductor_timm_training.csv"

				}

				# "Global" flags for inductor benchmarking controlled by TEST_CONFIG

				@ -432,6 +480,17 @@ test_perf_for_dashboard() {

				            "${target_flag[@]}" --"$mode" --"$dtype" --backend "$backend" "$@" \

				            --output "$TEST_REPORTS_DIR/${backend}_max_autotune_${suite}_${dtype}_${mode}_cuda_${target}.csv"

				      fi

				      if [[ "$DASHBOARD_TAG" == *cudagraphs_low_precision-true* ]] && [[ "$mode" == "inference" ]]; then

				        # TODO: This has a new dtype called quant and the benchmarks script needs to be updated to support this.

				        # The tentative command is as follows. It doesn't work now, but it's ok because we only need mock data

				        # to fill the dashboard.

				        python "benchmarks/dynamo/$suite.py" \

				          "${target_flag[@]}" --"$mode" --quant --backend "$backend" "$@" \

				          --output "$TEST_REPORTS_DIR/${backend}_cudagraphs_low_precision_${suite}_quant_${mode}_cuda_${target}.csv" || true

				        # Copy cudagraph results as mock data, easiest choice?

				        cp "$TEST_REPORTS_DIR/${backend}_with_cudagraphs_${suite}_${dtype}_${mode}_cuda_${target}.csv" \

				          "$TEST_REPORTS_DIR/${backend}_cudagraphs_low_precision_${suite}_quant_${mode}_cuda_${target}.csv"

				      fi

				    done

				  done

				}

				@ -479,13 +538,18 @@ test_single_dynamo_benchmark() {

				      --output "$TEST_REPORTS_DIR/${name}_${suite}.csv"

				    python benchmarks/dynamo/check_accuracy.py \

				      --actual "$TEST_REPORTS_DIR/${name}_$suite.csv" \

				      --expected "benchmarks/dynamo/ci_expected_accuracy/${TEST_CONFIG}_${name}.csv"

				      --expected "benchmarks/dynamo/ci_expected_accuracy/${ISCUDA124}/${TEST_CONFIG}_${name}.csv"

				    python benchmarks/dynamo/check_graph_breaks.py \

				      --actual "$TEST_REPORTS_DIR/${name}_$suite.csv" \

				      --expected "benchmarks/dynamo/ci_expected_accuracy/${TEST_CONFIG}_${name}.csv"

				      --expected "benchmarks/dynamo/ci_expected_accuracy/${ISCUDA124}/${TEST_CONFIG}_${name}.csv"

				  fi

				}

				test_inductor_micro_benchmark() {

				  TEST_REPORTS_DIR=$(pwd)/test/test-reports

				  python benchmarks/gpt_fast/benchmark.py --output "${TEST_REPORTS_DIR}/gpt_fast_benchmark.csv"

				}

				test_dynamo_benchmark() {

				  # Usage: test_dynamo_benchmark huggingface 0

				  TEST_REPORTS_DIR=$(pwd)/test/test-reports

				@ -501,7 +565,11 @@ test_dynamo_benchmark() {

				    test_single_dynamo_benchmark "dashboard" "$suite" "$shard_id" "$@"

				  else

				    if [[ "${TEST_CONFIG}" == *cpu_inductor* ]]; then

				      test_single_dynamo_benchmark "inference" "$suite" "$shard_id" --inference --float32 "$@"

				      if [[ "${TEST_CONFIG}" == *freezing* ]]; then

				        test_single_dynamo_benchmark "inference" "$suite" "$shard_id" --inference --float32 --freezing "$@"

				      else

				        test_single_dynamo_benchmark "inference" "$suite" "$shard_id" --inference --float32 "$@"

				      fi

				    elif [[ "${TEST_CONFIG}" == *aot_inductor* ]]; then

				      test_single_dynamo_benchmark "inference" "$suite" "$shard_id" --inference --bfloat16 "$@"

				    else

				@ -515,12 +583,16 @@ test_inductor_torchbench_smoketest_perf() {

				  TEST_REPORTS_DIR=$(pwd)/test/test-reports

				  mkdir -p "$TEST_REPORTS_DIR"

				  # smoke test the cpp_wrapper mode

				  TORCHINDUCTOR_CPP_WRAPPER=1 python benchmarks/dynamo/torchbench.py --device cuda --accuracy --bfloat16 \

				    --inference --inductor --only hf_T5 --output "$TEST_REPORTS_DIR/inductor_cpp_wrapper_smoketest.csv"

				  # Test some models in the cpp wrapper mode

				  TORCHINDUCTOR_ABI_COMPATIBLE=1 TORCHINDUCTOR_CPP_WRAPPER=1 python benchmarks/dynamo/torchbench.py --device cuda --accuracy \

				    --bfloat16 --inference --inductor --only hf_T5 --output "$TEST_REPORTS_DIR/inductor_cpp_wrapper_inference.csv"

				  TORCHINDUCTOR_ABI_COMPATIBLE=1 TORCHINDUCTOR_CPP_WRAPPER=1 python benchmarks/dynamo/torchbench.py --device cuda --accuracy \

				    --bfloat16 --inference --inductor --only llama --output "$TEST_REPORTS_DIR/inductor_cpp_wrapper_inference.csv"

				  TORCHINDUCTOR_ABI_COMPATIBLE=1 TORCHINDUCTOR_CPP_WRAPPER=1 python benchmarks/dynamo/torchbench.py --device cuda --accuracy \

				    --bfloat16 --inference --inductor --only moco --output "$TEST_REPORTS_DIR/inductor_cpp_wrapper_inference.csv"

				  python benchmarks/dynamo/check_accuracy.py \

				      --actual "$TEST_REPORTS_DIR/inductor_cpp_wrapper_smoketest.csv" \

				      --expected "benchmarks/dynamo/ci_expected_accuracy/inductor_torchbench_inference.csv"

				    --actual "$TEST_REPORTS_DIR/inductor_cpp_wrapper_inference.csv" \

				    --expected "benchmarks/dynamo/ci_expected_accuracy/${ISCUDA124}/inductor_torchbench_inference.csv"

				  python benchmarks/dynamo/torchbench.py --device cuda --performance --backend inductor --float16 --training \

				    --batch-size-file "$(realpath benchmarks/dynamo/torchbench_models_list.txt)" --only hf_Bert \

				@ -535,7 +607,13 @@ test_inductor_torchbench_smoketest_perf() {

				  # https://github.com/pytorch/pytorch/actions/runs/7158691360/job/19491437314,

				  # and thus we lower its threshold to reduce flakiness. If this continues to be a problem,

				  # we switch to use some other model.

				  python benchmarks/dynamo/check_perf_csv.py -f "$TEST_REPORTS_DIR/inductor_inference_smoketest.csv" -t 4.9

				  # Use 4.7 for cuda 12.4, change back to 4.9 after fixing https://github.com/pytorch/pytorch/issues/126692

				  if [ "$CUDA_VERSION" == "12.4" ]; then

				    THRESHOLD=4.7

				  else

				    THRESHOLD=4.9

				  fi

				  python benchmarks/dynamo/check_perf_csv.py -f "$TEST_REPORTS_DIR/inductor_inference_smoketest.csv" -t $THRESHOLD

				  # Check memory compression ratio for a few models

				  for test in hf_Albert timm_vision_transformer; do

				@ -547,6 +625,15 @@ test_inductor_torchbench_smoketest_perf() {

				      "$TEST_REPORTS_DIR/inductor_training_smoketest_$test.csv" \

				      --expected benchmarks/dynamo/expected_ci_perf_inductor_torchbench.csv

				  done

				  # Perform some "warm-start" runs for a few huggingface models.

				  for test in AlbertForQuestionAnswering AllenaiLongformerBase DistilBertForMaskedLM DistillGPT2 GoogleFnet YituTechConvBert; do

				    python benchmarks/dynamo/huggingface.py --accuracy --training --amp --inductor --device cuda --warm-start-latency \

				      --only $test --output "$TEST_REPORTS_DIR/inductor_warm_start_smoketest_$test.csv"

				    python benchmarks/dynamo/check_accuracy.py \

				      --actual "$TEST_REPORTS_DIR/inductor_warm_start_smoketest_$test.csv" \

				      --expected "benchmarks/dynamo/ci_expected_accuracy/${ISCUDA124}/inductor_huggingface_training.csv"

				  done

				}

				test_inductor_torchbench_cpu_smoketest_perf(){

				@ -593,6 +680,12 @@ test_inductor_torchbench_cpu_smoketest_perf(){

				  done

				}

				test_torchbench_gcp_smoketest(){

				  pushd "${TORCHBENCHPATH}"

				  python test.py -v

				  popd

				}

				test_python_gloo_with_tls() {

				  source "$(dirname "${BASH_SOURCE[0]}")/run_glootls_test.sh"

				  assert_git_not_dirty

				@ -624,7 +717,6 @@ test_aten() {

				  ${SUDO} ln -sf "$TORCH_LIB_DIR"/libmkldnn* "$TEST_BASE_DIR"

				  ${SUDO} ln -sf "$TORCH_LIB_DIR"/libnccl* "$TEST_BASE_DIR"

				  ${SUDO} ln -sf "$TORCH_LIB_DIR"/libtorch* "$TEST_BASE_DIR"

				  ${SUDO} ln -sf "$TORCH_LIB_DIR"/libtbb* "$TEST_BASE_DIR"

				  ls "$TEST_BASE_DIR"

				  aten/tools/run_tests.sh "$TEST_BASE_DIR"

				@ -649,21 +741,6 @@ test_without_numpy() {

				  popd

				}

				# pytorch extensions require including torch/extension.h which includes all.h

				# which includes utils.h which includes Parallel.h.

				# So you can call for instance parallel_for() from your extension,

				# but the compilation will fail because of Parallel.h has only declarations

				# and definitions are conditionally included Parallel.h(see last lines of Parallel.h).

				# I tried to solve it #39612 and #39881 by including Config.h into Parallel.h

				# But if Pytorch is built with TBB it provides Config.h

				# that has AT_PARALLEL_NATIVE_TBB=1(see #3961 or #39881) and it means that if you include

				# torch/extension.h which transitively includes Parallel.h

				# which transitively includes tbb.h which is not available!

				if [[ "${BUILD_ENVIRONMENT}" == *tbb* ]]; then

				  sudo mkdir -p /usr/include/tbb

				  sudo cp -r "$PWD"/third_party/tbb/include/tbb/* /usr/include/tbb

				fi

				test_libtorch() {

				  local SHARD="$1"

				@ -677,7 +754,6 @@ test_libtorch() {

				    ln -sf "$TORCH_LIB_DIR"/libc10* "$TORCH_BIN_DIR"

				    ln -sf "$TORCH_LIB_DIR"/libshm* "$TORCH_BIN_DIR"

				    ln -sf "$TORCH_LIB_DIR"/libtorch* "$TORCH_BIN_DIR"

				    ln -sf "$TORCH_LIB_DIR"/libtbb* "$TORCH_BIN_DIR"

				    ln -sf "$TORCH_LIB_DIR"/libnvfuser* "$TORCH_BIN_DIR"

				    export CPP_TESTS_DIR="${TORCH_BIN_DIR}"

				@ -814,7 +890,6 @@ test_rpc() {

				  # test reporting process to function as expected.

				  ln -sf "$TORCH_LIB_DIR"/libtorch* "$TORCH_BIN_DIR"

				  ln -sf "$TORCH_LIB_DIR"/libc10* "$TORCH_BIN_DIR"

				  ln -sf "$TORCH_LIB_DIR"/libtbb* "$TORCH_BIN_DIR"

				  CPP_TESTS_DIR="${TORCH_BIN_DIR}" python test/run_test.py --cpp --verbose -i cpp/test_cpp_rpc

				}

				@ -1116,11 +1191,33 @@ test_executorch() {

				  assert_git_not_dirty

				}

				test_linux_aarch64(){

				  python test/run_test.py --include test_modules test_mkldnn test_mkldnn_fusion test_openmp test_torch test_dynamic_shapes \

				       test_transformers test_multiprocessing test_numpy_interop --verbose

				  # Dynamo tests

				  python test/run_test.py --include dynamo/test_compile dynamo/test_backends dynamo/test_comptime dynamo/test_config \

				       dynamo/test_functions dynamo/test_fx_passes_pre_grad dynamo/test_interop dynamo/test_model_output dynamo/test_modules \

				       dynamo/test_optimizers dynamo/test_recompile_ux dynamo/test_recompiles --verbose

				  # Inductor tests

				  python test/run_test.py --include inductor/test_torchinductor inductor/test_benchmark_fusion inductor/test_codecache \

				       inductor/test_config inductor/test_control_flow inductor/test_coordinate_descent_tuner inductor/test_fx_fusion \

				       inductor/test_group_batch_fusion inductor/test_inductor_freezing inductor/test_inductor_utils \

				       inductor/test_inplacing_pass inductor/test_kernel_benchmark inductor/test_layout_optim \

				       inductor/test_max_autotune inductor/test_memory_planning inductor/test_metrics inductor/test_multi_kernel inductor/test_pad_mm \

				       inductor/test_pattern_matcher inductor/test_perf inductor/test_profiler inductor/test_select_algorithm inductor/test_smoke \

				       inductor/test_split_cat_fx_passes inductor/test_standalone_compile inductor/test_torchinductor \

				       inductor/test_torchinductor_codegen_dynamic_shapes inductor/test_torchinductor_dynamic_shapes --verbose

				}

				if ! [[ "${BUILD_ENVIRONMENT}" == *libtorch* || "${BUILD_ENVIRONMENT}" == *-bazel-* ]]; then

				  (cd test && python -c "import torch; print(torch.__config__.show())")

				  (cd test && python -c "import torch; print(torch.__config__.parallel_info())")

				fi

				if [[ "${TEST_CONFIG}" == *backward* ]]; then

				if [[ "$BUILD_ENVIRONMENT" == *aarch64* ]]; then

				  test_linux_aarch64

				elif [[ "${TEST_CONFIG}" == *backward* ]]; then

				  test_forward_backward_compatibility

				  # Do NOT add tests after bc check tests, see its comment.

				elif [[ "${TEST_CONFIG}" == *xla* ]]; then

				@ -1145,6 +1242,8 @@ elif [[ "$TEST_CONFIG" == deploy ]]; then

				  test_torch_deploy

				elif [[ "${TEST_CONFIG}" == *inductor_distributed* ]]; then

				  test_inductor_distributed

				elif [[ "${TEST_CONFIG}" == *inductor-micro-benchmark* ]]; then

				  test_inductor_micro_benchmark

				elif [[ "${TEST_CONFIG}" == *huggingface* ]]; then

				  install_torchvision

				  id=$((SHARD_NUMBER-1))

				@ -1172,6 +1271,9 @@ elif [[ "${TEST_CONFIG}" == *torchbench* ]]; then

				      llama_v2_7b_16h resnet50 timm_efficientnet mobilenet_v3_large timm_resnest \

				      shufflenet_v2_x1_0 hf_GPT2

				    PYTHONPATH=$(pwd)/torchbench test_inductor_torchbench_cpu_smoketest_perf

				  elif [[ "${TEST_CONFIG}" == *torchbench_gcp_smoketest* ]]; then

				    checkout_install_torchbench

				    TORCHBENCHPATH=$(pwd)/torchbench test_torchbench_gcp_smoketest

				  else

				    checkout_install_torchbench

				    # Do this after checkout_install_torchbench to ensure we clobber any

				@ -1195,6 +1297,10 @@ elif [[ "${TEST_CONFIG}" == *dynamo* && "${SHARD_NUMBER}" == 1 && $NUM_TEST_SHAR

				elif [[ "${TEST_CONFIG}" == *dynamo* && $SHARD_NUMBER -gt 1 && $NUM_TEST_SHARDS -gt 1 ]]; then

				  install_torchvision

				  test_dynamo_shard "${SHARD_NUMBER}"

				elif [[ "${BUILD_ENVIRONMENT}" == *rocm* && -n "$TESTS_TO_INCLUDE" ]]; then

				  install_torchvision

				  test_python_shard "$SHARD_NUMBER"

				  test_aten

				elif [[ "${SHARD_NUMBER}" == 1 && $NUM_TEST_SHARDS -gt 1 ]]; then

				  test_without_numpy

				  install_torchvision

				@ -1224,10 +1330,6 @@ elif [[ "${BUILD_ENVIRONMENT}" == *-mobile-lightweight-dispatch* ]]; then

				  test_libtorch

				elif [[ "${TEST_CONFIG}" = docs_test ]]; then

				  test_docs_test

				elif [[ "${BUILD_ENVIRONMENT}" == *rocm* && -n "$TESTS_TO_INCLUDE" ]]; then

				  install_torchvision

				  test_python

				  test_aten

				elif [[ "${BUILD_ENVIRONMENT}" == *xpu* ]]; then

				  install_torchvision

				  test_python

									
										37

.ci/pytorch/win-test-helpers/build_pytorch.bat
									
												View File
												
				@ -17,22 +17,22 @@ set PATH=C:\Program Files\CMake\bin;C:\Program Files\7-Zip;C:\ProgramData\chocol

				set INSTALLER_DIR=%SCRIPT_HELPERS_DIR%\installation-helpers

				call %INSTALLER_DIR%\install_magma.bat

				if errorlevel 1 exit /b

				if not errorlevel 0 exit /b

				if errorlevel 1 goto fail

				if not errorlevel 0 goto fail

				call %INSTALLER_DIR%\install_sccache.bat

				if errorlevel 1 exit /b

				if not errorlevel 0 exit /b

				if errorlevel 1 goto fail

				if not errorlevel 0 goto fail

				:: Miniconda has been installed as part of the Windows AMI with all the dependencies.

				:: We just need to activate it here

				call %INSTALLER_DIR%\activate_miniconda3.bat

				if errorlevel 1 exit /b

				if not errorlevel 0 exit /b

				if errorlevel 1 goto fail

				if not errorlevel 0 goto fail

				call pip install mkl-include==2021.4.0 mkl-devel==2021.4.0

				if errorlevel 1 exit /b

				if not errorlevel 0 exit /b

				if errorlevel 1 goto fail

				if not errorlevel 0 goto fail

				:: Override VS env here

				pushd .

				@ -41,8 +41,8 @@ if "%VC_VERSION%" == "" (

				) else (

				    call "C:\Program Files (x86)\Microsoft Visual Studio\%VC_YEAR%\%VC_PRODUCT%\VC\Auxiliary\Build\vcvarsall.bat" x64 -vcvars_ver=%VC_VERSION%

				)

				if errorlevel 1 exit /b

				if not errorlevel 0 exit /b

				if errorlevel 1 goto fail

				if not errorlevel 0 goto fail

				@echo on

				popd

				@ -52,12 +52,12 @@ set CUDA_PATH=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v%CUDA_VERSION%

				if x%CUDA_VERSION:.=%==x%CUDA_VERSION% (

				    echo CUDA version %CUDA_VERSION% format isn't correct, which doesn't contain '.'

				    exit /b 1

				    goto fail

				)

				rem version transformer, for example 10.1 to 10_1.

				if x%CUDA_VERSION:.=%==x%CUDA_VERSION% (

				    echo CUDA version %CUDA_VERSION% format isn't correct, which doesn't contain '.'

				    exit /b 1

				    goto fail

				)

				set VERSION_SUFFIX=%CUDA_VERSION:.=_%

				set CUDA_PATH_V%VERSION_SUFFIX%=%CUDA_PATH%

				@ -101,8 +101,8 @@ if "%USE_CUDA%"=="1" (

				  :: CMake requires a single command as CUDA_NVCC_EXECUTABLE, so we push the wrappers

				  :: randomtemp.exe and sccache.exe into a batch file which CMake invokes.

				  curl -kL https://github.com/peterjc123/randomtemp-rust/releases/download/v0.4/randomtemp.exe --output %TMP_DIR_WIN%\bin\randomtemp.exe

				  if errorlevel 1 exit /b

				  if not errorlevel 0 exit /b

				  if errorlevel 1 goto fail

				  if not errorlevel 0 goto fail

				  echo @"%TMP_DIR_WIN%\bin\randomtemp.exe" "%TMP_DIR_WIN%\bin\sccache.exe" "%CUDA_PATH%\bin\nvcc.exe" %%* > "%TMP_DIR%/bin/nvcc.bat"

				  cat %TMP_DIR%/bin/nvcc.bat

				  set CUDA_NVCC_EXECUTABLE=%TMP_DIR%/bin/nvcc.bat

				@ -114,8 +114,8 @@ if "%USE_CUDA%"=="1" (

				set

				python setup.py bdist_wheel

				if errorlevel 1 exit /b

				if not errorlevel 0 exit /b

				if errorlevel 1 goto fail

				if not errorlevel 0 goto fail

				sccache --show-stats

				python -c "import os, glob; os.system('python -mpip install --no-index --no-deps ' + glob.glob('dist/*.whl')[0])"

				(

				@ -135,3 +135,8 @@ python -c "import os, glob; os.system('python -mpip install --no-index --no-deps

				sccache --show-stats --stats-format json | jq .stats > sccache-stats-%BUILD_ENVIRONMENT%-%OUR_GITHUB_JOB_ID%.json

				sccache --stop-server

				exit /b 0

				:fail

				exit /b 1

									
										9

.circleci/scripts/binary_linux_test.sh
									
												View File
												
				@ -96,8 +96,13 @@ if [[ "$PACKAGE_TYPE" == conda ]]; then

				    conda install \${EXTRA_CONDA_FLAGS} -y "\$pkg" --offline

				  )

				elif [[ "$PACKAGE_TYPE" != libtorch ]]; then

				  pip install "\$pkg" --index-url "https://download.pytorch.org/whl/\${CHANNEL}/${DESIRED_CUDA}"

				  retry pip install -q numpy protobuf typing-extensions

				  if [[ "\$BUILD_ENVIRONMENT" != *s390x* ]]; then

				    pip install "\$pkg" --index-url "https://download.pytorch.org/whl/\${CHANNEL}/${DESIRED_CUDA}"

				    retry pip install -q numpy protobuf typing-extensions

				  else

				    pip install "\$pkg"

				    retry pip install -q numpy protobuf typing-extensions

				  fi

				fi

				if [[ "$PACKAGE_TYPE" == libtorch ]]; then

				  pkg="\$(ls /final_pkgs/*-latest.zip)"

									
										4

.circleci/scripts/binary_populate_env.sh
									
												View File
												
				@ -76,8 +76,8 @@ TRITON_VERSION=$(cat $PYTORCH_ROOT/.ci/docker/triton_version.txt)

				# Here PYTORCH_EXTRA_INSTALL_REQUIREMENTS is already set for the all the wheel builds hence append TRITON_CONSTRAINT

				if [[ "$PACKAGE_TYPE" =~ .*wheel.* &&  -n "${PYTORCH_EXTRA_INSTALL_REQUIREMENTS:-}" ]]; then

				  # Only linux Python < 3.12 are supported wheels for triton

				  TRITON_CONSTRAINT="platform_system == 'Linux' and platform_machine == 'x86_64' and python_version < '3.12'"

				  # Only linux Python < 3.13 are supported wheels for triton

				  TRITON_CONSTRAINT="platform_system == 'Linux' and platform_machine == 'x86_64' and python_version < '3.13'"

				  TRITON_REQUIREMENT="triton==${TRITON_VERSION}; ${TRITON_CONSTRAINT}"

				  if [[ -n "$PYTORCH_BUILD_VERSION" && "$PYTORCH_BUILD_VERSION" =~ .*dev.* ]]; then

				      TRITON_SHORTHASH=$(cut -c1-10 $PYTORCH_ROOT/.ci/docker/ci_commit_pins/triton.txt)

4

.clang-tidy

View File

 @ -36,6 +36,7 @@ hicpp-exception-baseclass,
 hicpp-avoid-goto,
 misc-*,
 -misc-const-correctness,
 -misc-include-cleaner,
 -misc-use-anonymous-namespace,
 -misc-unused-parameters,
 -misc-no-recursion,
 @ -60,6 +61,7 @@ readability-simplify-subscript-expr,
 readability-string-compare,
 '
 HeaderFilterRegex: '^(aten/|c10/|torch/).*$'
 AnalyzeTemporaryDtors: false
 WarningsAsErrors: '*'
 CheckOptions:
   misc-header-include-cycle.IgnoredFilesList: 'format.h;ivalue.h;custom_class.h;Dict.h;List.h'
 ...

1

.flake8

View File

 @ -54,6 +54,7 @@ per-file-ignores =
     torch/ao/quantization/fx/_decomposed.py: TOR901
     torch/distributed/_functional_collectives.py: TOR901
     torch/distributed/_spmd/data_parallel.py: TOR901
     torch/distributed/_tensor/_collective_utils.py: TOR901
 optional-ascii-coding = True
 exclude =
     ./.git,

1

.gitattributes vendored

View File

 @ -4,3 +4,4 @@
 .github/generated-* linguist-generated=true
 .github/scripts/gql_mocks.json linguist-generated=true
 third_party/LICENSES_BUNDLED.txt linguist-generated=true
 tools/build/bazel/requirements.txt linguist-generated=true

									
										15

.github/ISSUE_TEMPLATE/pt2-bug-report.yml
									
										vendored
									
												View File
												
				@ -8,7 +8,18 @@ body:

				      value: >

				        #### Before submitting a bug, please make sure the issue hasn't been already addressed by searching through [the

				        existing and past issues](https://github.com/pytorch/pytorch/issues)

				        It's likely that your bug will be resolved by checking our FAQ or troubleshooting guide [documentation](https://pytorch.org/docs/master/dynamo/index.html)

				        It's likely that your bug will be resolved by checking our FAQ or troubleshooting guide [documentation](https://pytorch.org/docs/main/dynamo/index.html)

				        Note: if you're submitting an issue that you generated from a fuzzer. Please do the following:

				        - Ensure rtol/atol are at default tolerances

				        - Dont compare indices of max/min etc, because that avoids the above requirement

				        - If comparing eager and torch.compile at fp16/bf16, you should use fp32 as baseline

				        If the above requirements are met, add the label "topic: fuzzer" to your issue.

				  - type: textarea

				    attributes:

				      label: 🐛 Describe the bug

				@ -33,7 +44,7 @@ body:

				      label: Minified repro

				      description: |

				        Please run the minifier on your example and paste the minified code below

				        Learn more here https://pytorch.org/docs/master/compile/troubleshooting.html

				        Learn more here https://pytorch.org/docs/main/torch.compiler_troubleshooting.html

				      placeholder: |

				        env TORCHDYNAMO_REPRO_AFTER="aot" python your_model.py

				        or

									
										31

.github/actionlint.yaml
									
										vendored
									
												View File
												
				@ -1,9 +1,12 @@

				self-hosted-runner:

				  labels:

				    # GitHub hosted x86 Linux runners

				    - linux.20_04.4x

				    - linux.20_04.16x

				    - linux.large

				    # Repo-specific LF hosted ARC runners

				    - linux.large.arc

				    # Organization-wide AWS Linux Runners

				    - linux.large

				    - linux.2xlarge

				    - linux.4xlarge

				    - linux.12xlarge

				@ -13,16 +16,34 @@ self-hosted-runner:

				    - linux.8xlarge.nvidia.gpu

				    - linux.16xlarge.nvidia.gpu

				    - linux.g5.4xlarge.nvidia.gpu

				    # Organization-wide AWS Linux Runners on Linux Foundation account

				    - lf.linux.large

				    - lf.linux.2xlarge

				    - lf.linux.4xlarge

				    - lf.linux.12xlarge

				    - lf.linux.24xlarge

				    - lf.linux.arm64.2xlarge

				    - lf.linux.4xlarge.nvidia.gpu

				    - lf.linux.8xlarge.nvidia.gpu

				    - lf.linux.16xlarge.nvidia.gpu

				    - lf.linux.g5.4xlarge.nvidia.gpu

				    # Repo-specific IBM hosted S390x runner

				    - linux.s390x

				    # Organization wide AWS Windows runners

				    - windows.4xlarge.nonephemeral

				    - windows.8xlarge.nvidia.gpu

				    - windows.8xlarge.nvidia.gpu.nonephemeral

				    - windows.g5.4xlarge.nvidia.gpu

				    - bm-runner

				    # Organization-wide AMD hosted MI300 runners

				    - linux.rocm.gpu

				    # Repo-specific Apple hosted  runners

				    - macos-m1-ultra

				    - macos-m2-14

				    # Org wise AWS `mac2.metal` runners (2020 Mac mini hardware powered by Apple silicon M1 processors)

				    - macos-m1-stable

				    - macos-m1-13

				    - macos-12-xl

				    - macos-12

				    - macos12.3-m1

				    - macos-m1-14

				    # GitHub-hosted MacOS runners

				    - macos-latest-xlarge

				    - macos-13-xlarge

				    - macos-14-xlarge

									
										11

.github/actions/download-build-artifacts/action.yml
									
										vendored
									
												View File
												
				@ -9,6 +9,10 @@ inputs:

				  use-gha:

				    description: If set to any value, use GHA to download the artifact. Otherwise use s3.

				    required: false

				  s3-bucket:

				    description: S3 bucket to download builds

				    required: false

				    default: "gha-artifacts"

				runs:

				  using: composite

				@ -18,9 +22,10 @@ runs:

				      uses: seemethere/download-artifact-s3@v4

				      with:

				        name: ${{ inputs.name }}

				        s3-bucket: ${{ inputs.s3-bucket }}

				    - name: Download PyTorch Build Artifacts from GHA

				      if: inputs.use-gha

				      if: ${{ inputs.use-gha }}

				      uses: actions/download-artifact@v3

				      with:

				        name: ${{ inputs.name }}

				@ -29,6 +34,10 @@ runs:

				      shell: bash

				      run: unzip -o artifacts.zip

				    - name: Remove artifacts.zip

				      shell: bash

				      run: rm artifacts.zip

				    - name: Output disk space left

				      shell: bash

				      run: df -H

									
										14

.github/actions/filter-test-configs/action.yml
									
										vendored
									
												View File
												
				@ -13,6 +13,13 @@ inputs:

				    required: true

				    type: string

				    description: JSON description of what test configs to run.

				  selected-test-configs:

				    required: false

				    type: string

				    description: |

				      A comma-separated list of test configurations from the test matrix to keep,

				      The empty list means we are going to keep every configurations by defaults

				    default: ""

				  job-name:

				    type: string

				    required: false

				@ -40,6 +47,9 @@ outputs:

				  ci-no-td:

				    description: True if ci-no-td label was on PR or [ci-no-td] in PR body.

				    value: ${{ steps.filter.outputs.ci-no-td }}

				  ci-td-distributed:

				    description: True if ci-td-distributed label was on PR or [ci-td-distributed] in PR body.

				    value: ${{ steps.filter.outputs.ci-td-distributed }}

				runs:

				  using: composite

				@ -56,7 +66,8 @@ runs:

				        command: |

				          set -eux

				          # PyYAML 6.0 doesn't work with MacOS x86 anymore

				          python3 -m pip install requests==2.26.0 pyyaml==6.0.1

				          # This must run on Python-3.7 (AmazonLinux2) so can't use request=3.32.2

				          python3 -m pip install requests==2.27.1 pyyaml==6.0.1

				    - name: Parse ref

				      id: parse-ref

				@ -123,6 +134,7 @@ runs:

				          --workflow "${GITHUB_WORKFLOW}" \

				          --job-name "${JOB_NAME}" \

				          --test-matrix "${{ inputs.test-matrix }}" \

				          --selected-test-configs "${{ inputs.selected-test-configs }}" \

				          --pr-number "${PR_NUMBER}" \

				          --tag "${TAG}" \

				          --event-name "${EVENT_NAME}" \

									
										207

.github/actions/linux-build/action.yml
									
										vendored
									
										Normal file
									
												View File
												
				@ -0,0 +1,207 @@

				name: linux-build

				inputs:

				  build-environment:

				    required: true

				    description: Top-level label for what's being built/tested.

				  docker-image-name:

				    required: true

				    description: Name of the base docker image to build with.

				  build-generates-artifacts:

				    required: false

				    default: "true"

				    description: If set, upload generated build artifacts.

				  build-with-debug:

				    required: false

				    default: "false"

				    description: If set, build in debug mode.

				  sync-tag:

				    required: false

				    default: ""

				    description: |

				      If this is set, our linter will use this to make sure that every other

				      job with the same `sync-tag` is identical.

				  cuda-arch-list:

				    required: false

				    default: "5.2"

				    description: Runner label to select worker type

				  runner:

				    required: false

				    default: "linux.2xlarge"

				    description: |

				      List of CUDA architectures CI build should target.

				  test-matrix:

				    required: false

				    type: string

				    description: |

				      An option JSON description of what test configs to run later on. This

				      is moved here from the Linux test workflow so that we can apply filter

				      logic using test-config labels earlier and skip unnecessary builds

				  s3-bucket:

				    description: S3 bucket to download artifact

				    required: false

				    default: "gha-artifacts"

				  aws-role-to-assume:

				    description: role to assume for downloading artifacts

				    required: false

				    default: ""

				  GITHUB_TOKEN:

				    description: GitHub token

				    required: true

				  HUGGING_FACE_HUB_TOKEN:

				    description: Hugging Face Hub token

				    required: false

				    default: ""

				outputs:

				  docker-image:

				    value: ${{ steps.calculate-docker-image.outputs.docker-image }}

				    description: The docker image containing the built PyTorch.

				  test-matrix:

				    value: ${{ steps.filter.outputs.test-matrix }}

				    description: An optional JSON description of what test configs to run later on.

				runs:

				  using: composite

				  steps:

				    - name: Setup Linux

				      uses: ./.github/actions/setup-linux

				    - name: configure aws credentials

				      uses: aws-actions/configure-aws-credentials@v3

				      if: ${{ inputs.aws-role-to-assume != '' }}

				      with:

				        role-to-assume: ${{ inputs.aws-role-to-assume }}

				        role-session-name: gha-linux-build

				        role-duration-seconds: 10800

				        aws-region: us-east-1

				    - name: Calculate docker image

				      id: calculate-docker-image

				      uses: pytorch/test-infra/.github/actions/calculate-docker-image@main

				      with:

				        docker-image-name: ${{ inputs.docker-image-name }}

				    - name: Use following to pull public copy of the image

				      id: print-ghcr-mirror

				      env:

				        ECR_DOCKER_IMAGE: ${{ steps.calculate-docker-image.outputs.docker-image }}

				      shell: bash

				      run: |

				        tag=${ECR_DOCKER_IMAGE##*/}

				        echo "docker pull ghcr.io/pytorch/ci-image:${tag/:/-}"

				    - name: Pull docker image

				      uses: pytorch/test-infra/.github/actions/pull-docker-image@main

				      with:

				        docker-image: ${{ steps.calculate-docker-image.outputs.docker-image }}

				    - name: Parse ref

				      id: parse-ref

				      shell: bash

				      run: .github/scripts/parse_ref.py

				    - name: Get workflow job id

				      id: get-job-id

				      uses: ./.github/actions/get-workflow-job-id

				      if: always()

				      with:

				        github-token: ${{ inputs.GITHUB_TOKEN }}

				    # Apply the filter logic to the build step too if the test-config label is already there

				    - name: Select all requested test configurations (if the test matrix is available)

				      id: filter

				      uses: ./.github/actions/filter-test-configs

				      with:

				        github-token: ${{ inputs.GITHUB_TOKEN }}

				        test-matrix: ${{ inputs.test-matrix }}

				        job-name: ${{ steps.get-job-id.outputs.job-name }}

				    - name: Download pytest cache

				      uses: ./.github/actions/pytest-cache-download

				      continue-on-error: true

				      with:

				        cache_dir: .pytest_cache

				        job_identifier: ${{ github.workflow }}_${{ inputs.build-environment }}

				        s3_bucket: ${{ inputs.s3-bucket }}

				    - name: Build

				      if: steps.filter.outputs.is-test-matrix-empty == 'False' || inputs.test-matrix == ''

				      id: build

				      env:

				        BUILD_ENVIRONMENT: ${{ inputs.build-environment }}

				        BRANCH: ${{ steps.parse-ref.outputs.branch }}

				        # TODO duplicated

				        AWS_DEFAULT_REGION: us-east-1

				        PR_NUMBER: ${{ github.event.pull_request.number }}

				        SHA1: ${{ github.event.pull_request.head.sha || github.sha }}

				        SCCACHE_BUCKET: ossci-compiler-cache-circleci-v2

				        SCCACHE_S3_KEY_PREFIX: ${{ github.workflow }}

				        XLA_CLANG_CACHE_S3_BUCKET_NAME: ossci-compiler-clang-cache-circleci-xla

				        PR_LABELS: ${{ toJson(github.event.pull_request.labels.*.name) }}

				        TORCH_CUDA_ARCH_LIST: ${{ inputs.cuda-arch-list }}

				        DOCKER_IMAGE: ${{ steps.calculate-docker-image.outputs.docker-image }}

				        XLA_CUDA: ${{ contains(inputs.build-environment, 'xla') && '0' || '' }}

				        DEBUG: ${{ inputs.build-with-debug == 'true' && '1' || '0' }}

				        OUR_GITHUB_JOB_ID: ${{ steps.get-job-id.outputs.job-id }}

				        HUGGING_FACE_HUB_TOKEN: ${{ inputs.HUGGING_FACE_HUB_TOKEN }}

				      shell: bash

				      run: |

				        # detached container should get cleaned up by teardown_ec2_linux

				        container_name=$(docker run \

				          -e BUILD_ENVIRONMENT \

				          -e MAX_JOBS="$(nproc --ignore=2)" \

				          -e AWS_DEFAULT_REGION \

				          -e PR_NUMBER \

				          -e SHA1 \

				          -e BRANCH \

				          -e SCCACHE_BUCKET \

				          -e SCCACHE_S3_KEY_PREFIX \

				          -e XLA_CUDA \

				          -e XLA_CLANG_CACHE_S3_BUCKET_NAME \

				          -e SKIP_SCCACHE_INITIALIZATION=1 \

				          -e TORCH_CUDA_ARCH_LIST \

				          -e PR_LABELS \

				          -e OUR_GITHUB_JOB_ID \

				          -e HUGGING_FACE_HUB_TOKEN \

				          --env-file="/tmp/github_env_${GITHUB_RUN_ID}" \

				          --security-opt seccomp=unconfined \

				          --cap-add=SYS_PTRACE \

				          --tty \

				          --detach \

				          --user jenkins \

				          -v "${GITHUB_WORKSPACE}:/var/lib/jenkins/workspace" \

				          -w /var/lib/jenkins/workspace \

				          "${DOCKER_IMAGE}"

				        )

				        docker exec -t "${container_name}" sh -c '.ci/pytorch/build.sh'

				    - name: Archive artifacts into zip

				      if: inputs.build-generates-artifacts == 'true' && steps.build.outcome != 'skipped'

				      shell: bash

				      run: |

				        zip -1 -r artifacts.zip dist/ build/custom_test_artifacts build/lib build/bin .additional_ci_files

				    - name: Store PyTorch Build Artifacts on S3

				      uses: seemethere/upload-artifact-s3@v5

				      if: inputs.build-generates-artifacts == 'true' && steps.build.outcome != 'skipped'

				      with:

				        name: ${{ inputs.build-environment }}

				        retention-days: 14

				        if-no-files-found: error

				        path: artifacts.zip

				        s3-bucket: ${{ inputs.s3-bucket }}

				    - name: Upload sccache stats

				      if: steps.build.outcome != 'skipped'

				      uses: seemethere/upload-artifact-s3@v5

				      with:

				        s3-prefix: |

				          ${{ github.repository }}/${{ github.run_id }}/${{ github.run_attempt }}/artifact

				        retention-days: 365

				        if-no-files-found: warn

				        path: sccache-stats-*.json

				        s3-bucket: ${{ inputs.s3-bucket }}

				    - name: Teardown Linux

				      uses: pytorch/test-infra/.github/actions/teardown-linux@main

				      if: always()

									
										384

.github/actions/linux-test/action.yml
									
										vendored
									
										Normal file
									
												View File
												
				@ -0,0 +1,384 @@

				name: linux-test

				inputs:

				  build-environment:

				    required: true

				    type: string

				    description: Top-level label for what's being built/tested.

				  test-matrix:

				    required: true

				    type: string

				    description: JSON description of what test configs to run.

				  docker-image:

				    required: true

				    type: string

				    description: Docker image to run in.

				  sync-tag:

				    required: false

				    type: string

				    default: ""

				    description: |

				      If this is set, our linter will use this to make sure that every other

				      job with the same `sync-tag` is identical.

				  use-gha:

				    required: false

				    type: string

				    default: ""

				    description: If set to any value, upload to GHA. Otherwise upload to S3.

				  dashboard-tag:

				    required: false

				    type: string

				    default: ""

				  s3-bucket:

				    description: S3 bucket to download artifact

				    required: false

				    type: string

				    default: "gha-artifacts"

				  aws-role-to-assume:

				    description: role to assume for downloading artifacts

				    required: false

				    type: string

				    default: ""

				  HUGGING_FACE_HUB_TOKEN:

				    description: |

				      HF Auth token to avoid rate limits when downloading models or datasets from hub

				    required: false

				    default: ""

				  GITHUB_TOKEN:

				    description: GitHub token

				    required: true

				#env:

				#  GIT_DEFAULT_BRANCH: ${{ inputs.default_branch }}

				runs:

				  using: composite

				  steps:

				    - name: Setup Linux

				      uses: ./.github/actions/setup-linux

				    - name: configure aws credentials

				      if : ${{ inputs.aws-role-to-assume != '' }}

				      uses: aws-actions/configure-aws-credentials@v3

				      with:

				        role-to-assume: ${{ inputs.aws-role-to-assume }}

				        role-session-name: gha-linux-test

				        aws-region: us-east-1

				    - name: Calculate docker image

				      id: calculate-docker-image

				      uses: pytorch/test-infra/.github/actions/calculate-docker-image@main

				      with:

				        docker-image-name: ${{ inputs.docker-image }}

				    - name: Use following to pull public copy of the image

				      id: print-ghcr-mirror

				      env:

				        ECR_DOCKER_IMAGE: ${{ steps.calculate-docker-image.outputs.docker-image }}

				      shell: bash

				      run: |

				        tag=${ECR_DOCKER_IMAGE##*/}

				        echo "docker pull ghcr.io/pytorch/ci-image:${tag/:/-}"

				    - name: Pull docker image

				      uses: pytorch/test-infra/.github/actions/pull-docker-image@main

				      with:

				        docker-image: ${{ steps.calculate-docker-image.outputs.docker-image }}

				    - name: Check if in a ARC runner

				      shell: bash

				      id: check_arc_runner

				      run: echo "IN_ARC_RUNNER=$([ -f /.inarc ] && echo true || echo false)" >> "$GITHUB_OUTPUT"

				    - name: Install nvidia driver, nvidia-docker runtime, set GPU_FLAG

				      id: install-nvidia-driver

				      uses: pytorch/test-infra/.github/actions/setup-nvidia@main

				      if: ${{ contains(inputs.build-environment, 'cuda') && !contains(matrix.config, 'nogpu') && steps.check_arc_runner.outputs.IN_ARC_RUNNER == 'false' }}

				    - name: Lock NVIDIA A100 40GB Frequency

				      shell: bash

				      run: |

				        sudo nvidia-smi -pm 1

				        sudo nvidia-smi -ac 1215,1410

				        nvidia-smi

				      if: contains(matrix.runner, 'a100')

				    - name: Start monitoring script

				      id: monitor-script

				      shell: bash

				      continue-on-error: true

				      run: |

				        python3 -m pip install psutil==5.9.1 nvidia-ml-py==11.525.84

				        python3 -m tools.stats.monitor > usage_log.txt 2>&1 &

				        echo "monitor-script-pid=${!}" >> "${GITHUB_OUTPUT}"

				    - name: Download build artifacts

				      uses: ./.github/actions/download-build-artifacts

				      with:

				        name: ${{ inputs.build-environment }}

				        s3-bucket: ${{ inputs.s3-bucket }}

				    - name: Download TD artifacts

				      continue-on-error: true

				      uses: ./.github/actions/download-td-artifacts

				    - name: Parse ref

				      id: parse-ref

				      shell: bash

				      run: .github/scripts/parse_ref.py

				    - name: Get workflow job id

				      id: get-job-id

				      uses: ./.github/actions/get-workflow-job-id

				      if: always()

				      with:

				        github-token: ${{ inputs.GITHUB_TOKEN }}

				    - name: Check for keep-going label and re-enabled test issues

				      # This uses the filter-test-configs action because it conviniently

				      # checks for labels and re-enabled test issues.  It does not actually do

				      # any filtering.  All filtering is done in the build step.

				      id: keep-going

				      uses: ./.github/actions/filter-test-configs

				      with:

				        github-token: ${{ inputs.GITHUB_TOKEN }}

				        test-matrix: ${{ inputs.test-matrix }}

				        job-name: ${{ steps.get-job-id.outputs.job-name }}

				    - name: Test

				      id: test

				      env:

				        BUILD_ENVIRONMENT: ${{ inputs.build-environment }}

				        PR_NUMBER: ${{ github.event.pull_request.number }}

				        GITHUB_REPOSITORY: ${{ github.repository }}

				        GITHUB_WORKFLOW: ${{ github.workflow }}

				        GITHUB_JOB: ${{ github.job }}

				        GITHUB_RUN_ID: ${{ github.run_id }}

				        GITHUB_RUN_NUMBER: ${{ github.run_number }}

				        GITHUB_RUN_ATTEMPT: ${{ github.run_attempt }}

				        JOB_ID: ${{ steps.get-job-id.outputs.job-id }}

				        JOB_NAME: ${{ steps.get-job-id.outputs.job-name }}

				        BRANCH: ${{ steps.parse-ref.outputs.branch }}

				        SHA1: ${{ github.event.pull_request.head.sha || github.sha }}

				        BASE_SHA: ${{ github.event.pull_request.base.sha || github.sha }}

				        TEST_CONFIG: ${{ matrix.config }}

				        SHARD_NUMBER: ${{ matrix.shard }}

				        NUM_TEST_SHARDS: ${{ matrix.num_shards }}

				        REENABLED_ISSUES: ${{ steps.keep-going.outputs.reenabled-issues }}

				        CONTINUE_THROUGH_ERROR: ${{ steps.keep-going.outputs.keep-going }}

				        VERBOSE_TEST_LOGS: ${{ steps.keep-going.outputs.ci-verbose-test-logs }}

				        NO_TEST_TIMEOUT: ${{ steps.keep-going.outputs.ci-no-test-timeout }}

				        NO_TD: ${{ steps.keep-going.outputs.ci-no-td }}

				        TD_DISTRIBUTED: ${{ steps.keep-going.outputs.ci-td-distributed }}

				        SCCACHE_BUCKET: ossci-compiler-cache-circleci-v2

				        SCCACHE_S3_KEY_PREFIX: ${{ github.workflow }}

				        SHM_SIZE: ${{ contains(inputs.build-environment, 'cuda') && '2g' || '1g' }}

				        DOCKER_IMAGE: ${{ inputs.docker-image }}

				        XLA_CUDA: ${{ contains(inputs.build-environment, 'xla') && '0' || '' }}

				        XLA_CLANG_CACHE_S3_BUCKET_NAME: ossci-compiler-clang-cache-circleci-xla

				        PYTORCH_TEST_CUDA_MEM_LEAK_CHECK: ${{ matrix.mem_leak_check && '1' || '0' }}

				        PYTORCH_TEST_RERUN_DISABLED_TESTS: ${{ matrix.rerun_disabled_tests && '1' || '0' }}

				        DASHBOARD_TAG: ${{ inputs.dashboard-tag }}

				        HUGGING_FACE_HUB_TOKEN: ${{ inputs.HUGGING_FACE_HUB_TOKEN }}

				      shell: bash

				      run: |

				        set -x

				        if [[ $TEST_CONFIG == 'multigpu' ]]; then

				          TEST_COMMAND=.ci/pytorch/multigpu-test.sh

				        elif [[ $BUILD_ENVIRONMENT == *onnx* ]]; then

				          TEST_COMMAND=.ci/onnx/test.sh

				        else

				          TEST_COMMAND=.ci/pytorch/test.sh

				        fi

				        # detached container should get cleaned up by teardown_ec2_linux

				        # TODO: Stop building test binaries as part of the build phase

				        # Used for GPU_FLAG since that doesn't play nice

				        # shellcheck disable=SC2086,SC2090

				        container_name=$(docker run \

				          ${GPU_FLAG:-} \

				          -e BUILD_ENVIRONMENT \

				          -e PR_NUMBER \

				          -e GITHUB_ACTIONS \

				          -e GITHUB_REPOSITORY \

				          -e GITHUB_WORKFLOW \

				          -e GITHUB_JOB \

				          -e GITHUB_RUN_ID \

				          -e GITHUB_RUN_NUMBER \

				          -e GITHUB_RUN_ATTEMPT \

				          -e JOB_ID \

				          -e JOB_NAME \

				          -e BASE_SHA \

				          -e BRANCH \

				          -e SHA1 \

				          -e AWS_DEFAULT_REGION \

				          -e IN_WHEEL_TEST \

				          -e SHARD_NUMBER \

				          -e TEST_CONFIG \

				          -e NUM_TEST_SHARDS \

				          -e REENABLED_ISSUES \

				          -e CONTINUE_THROUGH_ERROR \

				          -e VERBOSE_TEST_LOGS \

				          -e NO_TEST_TIMEOUT \

				          -e NO_TD \

				          -e TD_DISTRIBUTED \

				          -e PR_LABELS \

				          -e MAX_JOBS="$(nproc --ignore=2)" \

				          -e SCCACHE_BUCKET \

				          -e SCCACHE_S3_KEY_PREFIX \

				          -e XLA_CUDA \

				          -e XLA_CLANG_CACHE_S3_BUCKET_NAME \

				          -e PYTORCH_TEST_CUDA_MEM_LEAK_CHECK \

				          -e PYTORCH_TEST_RERUN_DISABLED_TESTS \

				          -e SKIP_SCCACHE_INITIALIZATION=1 \

				          -e HUGGING_FACE_HUB_TOKEN \

				          -e DASHBOARD_TAG \

				          --env-file="/tmp/github_env_${GITHUB_RUN_ID}" \

				          --security-opt seccomp=unconfined \

				          --cap-add=SYS_PTRACE \

				          --ipc=host \

				          --shm-size="${SHM_SIZE}" \

				          --tty \

				          --detach \

				          --name="${container_name}" \

				          --user jenkins \

				          -v "${GITHUB_WORKSPACE}:/var/lib/jenkins/workspace" \

				          -w /var/lib/jenkins/workspace \

				          "${DOCKER_IMAGE}"

				        )

				        # Propagate download.pytorch.org IP to container

				        grep download.pytorch.org /etc/hosts | docker exec -i "${container_name}" sudo bash -c "/bin/cat >> /etc/hosts"

				        echo "DOCKER_CONTAINER_ID=${container_name}" >> "${GITHUB_ENV}"

				        docker exec -t "${container_name}" sh -c "pip install $(echo dist/*.whl)[opt-einsum] && ${TEST_COMMAND}"

				    - name: Upload pytest cache if tests failed

				      uses: ./.github/actions/pytest-cache-upload

				      continue-on-error: true

				      if: failure() && steps.test.conclusion && steps.test.conclusion == 'failure'

				      with:

				        cache_dir: .pytest_cache

				        shard: ${{ matrix.shard }}

				        sha: ${{ github.event.pull_request.head.sha || github.sha }}

				        test_config: ${{ matrix.config }}

				        job_identifier: ${{ github.workflow }}_${{ inputs.build-environment }}

				    - name: Print remaining test logs

				      shell: bash

				      if: always() && steps.test.conclusion

				      run: |

				        cat test/**/*_toprint.log || true

				    - name: Stop monitoring script

				      if: always() && steps.monitor-script.outputs.monitor-script-pid

				      shell: bash

				      continue-on-error: true

				      env:

				        MONITOR_SCRIPT_PID: ${{ steps.monitor-script.outputs.monitor-script-pid }}

				      run: |

				        kill "$MONITOR_SCRIPT_PID"

				    - name: Upload test artifacts

				      uses: ./.github/actions/upload-test-artifacts

				      if: always() && steps.test.conclusion && steps.test.conclusion != 'skipped'

				      with:

				        file-suffix: ${{ github.job }}-${{ matrix.config }}-${{ matrix.shard }}-${{ matrix.num_shards }}-${{ matrix.runner }}_${{ steps.get-job-id.outputs.job-id }}

				        use-gha: ${{ inputs.use-gha }}

				        s3-bucket: ${{ inputs.s3-bucket }}

				    - name: Collect backtraces from coredumps (if any)

				      if: always()

				      shell: bash

				      run: |

				        # shellcheck disable=SC2156

				        find . -iname "core.[1-9]*" -exec docker exec "${DOCKER_CONTAINER_ID}" sh -c "gdb python {} -ex 'bt' -ex 'q'" \;

				    - name: Store Core dumps on S3

				      uses: seemethere/upload-artifact-s3@v5

				      if: failure()

				      with:

				        name: coredumps-${{ matrix.config }}-${{ matrix.shard }}-${{ matrix.num_shards }}-${{ matrix.runner }}

				        retention-days: 14

				        if-no-files-found: ignore

				        path: ./**/core.[1-9]*

				    - name: Teardown Linux

				      uses: pytorch/test-infra/.github/actions/teardown-linux@main

				      if: always()

				    # NB: We are currently having an intermittent GPU-related issue on G5 runners with

				    # A10G GPU. Once this happens, trying to reset the GPU as done in setup-nvidia does

				    # not seem to help. Here are some symptoms:

				    #   * Calling nvidia-smi timeouts after 60 second

				    #   * Fail to run nvidia-smi with an unable to determine the device handle for GPU

				    #     unknown error

				    #   * Test fails with a missing CUDA GPU error when initializing CUDA in PyTorch

				    #   * Run docker --gpus all fails with error response from daemon

				    #

				    # As both the root cause and recovery path are unclear, let's take the runner out of

				    # service so that it doesn't get any more jobs

				    - name: Check NVIDIA driver installation step

				      if: failure() && steps.install-nvidia-driver.outcome && steps.install-nvidia-driver.outcome != 'skipped'

				      shell: bash

				      env:

				        RUNNER_WORKSPACE: ${{ runner.workspace }}

				      run: |

				        set +e

				        set -x

				        nvidia-smi

				        # NB: Surprisingly, nvidia-smi command returns successfully with return code 0 even in

				        # the case where the driver has already crashed as it still can get the driver version

				        # and some basic information like the bus ID.  However, the rest of the information

				        # would be missing (ERR!), for example:

				        #

				        # +-----------------------------------------------------------------------------+

				        # | NVIDIA-SMI 525.89.02    Driver Version: 525.89.02    CUDA Version: 12.0     |

				        # |-------------------------------+----------------------+----------------------+

				        # | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |

				        # | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |

				        # |                               |                      |               MIG M. |

				        # |===============================+======================+======================|

				        # |   0  ERR!                Off  | 00000000:00:1E.0 Off |                 ERR! |

				        # |ERR!  ERR! ERR!    ERR! / ERR! |   4184MiB / 23028MiB |    ERR!      Default |

				        # |                               |                      |                 ERR! |

				        # +-------------------------------+----------------------+----------------------+

				        #

				        # +-----------------------------------------------------------------------------+

				        # | Processes:                                                                  |

				        # |  GPU   GI   CI        PID   Type   Process name                  GPU Memory |

				        # |        ID   ID                                                   Usage      |

				        # |=============================================================================|

				        # +-----------------------------------------------------------------------------+

				        #

				        # This should be reported as a failure instead as it will guarantee to fail when

				        # Docker tries to run with --gpus all

				        #

				        # So, the correct check here is to query one of the missing piece of info like

				        # GPU name, so that the command can fail accordingly

				        nvidia-smi --query-gpu=gpu_name --format=csv,noheader --id=0

				        NVIDIA_SMI_STATUS=$?

				        # These are acceptable return code from nvidia-smi as copied from setup-nvidia GitHub action

				        if [ "$NVIDIA_SMI_STATUS" -ne 0 ] && [ "$NVIDIA_SMI_STATUS" -ne 14 ]; then

				          echo "NVIDIA driver installation has failed, shutting down the runner..."

				          .github/scripts/stop_runner_service.sh

				        fi

				        # For runner with multiple GPUs, we also want to confirm that the number of GPUs are the

				        # power of 2, i.e. 1, 2, 4, or 8. This is to avoid flaky test issue when one GPU fails

				        # https://github.com/pytorch/test-infra/issues/4000

				        GPU_COUNT=$(nvidia-smi --list-gpus | wc -l)

				        NVIDIA_SMI_STATUS=$?

				        # These are acceptable return code from nvidia-smi as copied from setup-nvidia GitHub action

				        if [ "$NVIDIA_SMI_STATUS" -ne 0 ] && [ "$NVIDIA_SMI_STATUS" -ne 14 ]; then

				          echo "NVIDIA driver installation has failed, shutting down the runner..."

				          .github/scripts/stop_runner_service.sh

				        fi

				        # Check the GPU count to be a power of 2

				        if [ "$GPU_COUNT" -le 8 ] && [ "$GPU_COUNT" -ne 1 ] && [ "$GPU_COUNT" -ne 2 ] && [ "$GPU_COUNT" -ne 4 ] && [ "$GPU_COUNT" -ne 8 ]; then

				          echo "NVIDIA driver detects $GPU_COUNT GPUs. The runner has a broken GPU, shutting it down..."

				          .github/scripts/stop_runner_service.sh

				        fi

									
										6

.github/actions/pytest-cache-download/action.yml
									
										vendored
									
												View File
												
				@ -9,6 +9,10 @@ inputs:

				  job_identifier:

				    description: Text that uniquely identifies a given job type within a workflow. All shards of a job should share the same job identifier.

				    required: true

				  s3_bucket:

				    description: S3 bucket to download PyTest cache

				    required: false

				    default: "gha-artifacts"

				runs:

				  using: composite

				@ -30,6 +34,7 @@ runs:

				        CACHE_DIR: ${{ inputs.cache_dir }}

				        JOB_IDENTIFIER: ${{ inputs.job_identifier }}

				        REPO: ${{ github.repository }}

				        BUCKET: ${{ inputs.s3_bucket }}

				      run: |

				        python3 .github/scripts/pytest_cache.py \

				          --download \

				@ -38,3 +43,4 @@ runs:

				          --job_identifier $JOB_IDENTIFIER \

				          --temp_dir $RUNNER_TEMP \

				          --repo $REPO \

				          --bucket $BUCKET \

									
										40

.github/actions/setup-linux/action.yml
									
										vendored
									
												View File
												
				@ -15,10 +15,12 @@ runs:

				          category=$1

				          # If it is GCP runner (runner name contains gcp), do not run this

				          runner_name_str=${{ runner.name }}

				          if [[ $runner_name_str != *"gcp"* ]]; then

				            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"

				          else

				          if [[ -f /.inarc ]]; then

				            echo "ARC Runner, no info on ec2 metadata"

				          elif [[ $runner_name_str == *"gcp"* ]]; then

				            echo "Runner is from Google Cloud Platform, No info on ec2 metadata"

				          else

				            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"

				          fi

				        }

				        echo "ami-id: $(get_ec2_metadata ami-id)"

				@ -26,8 +28,14 @@ runs:

				        echo "instance-type: $(get_ec2_metadata instance-type)"

				        echo "system info $(uname -a)"

				    - name: Check if in a ARC runner

				      shell: bash

				      id: check_arc_runner

				      run: echo "IN_ARC_RUNNER=$([ -f /.inarc ] && echo true || echo false)"  >> $GITHUB_OUTPUT

				    - name: Start docker if docker deamon is not running

				      shell: bash

				      if: ${{ steps.check_arc_runner.outputs.IN_ARC_RUNNER == 'false' }}

				      run: |

				        if systemctl is-active --quiet docker; then

				            echo "Docker daemon is running...";

				@ -58,6 +66,7 @@ runs:

				        env | grep '^CI' >> "/tmp/github_env_${GITHUB_RUN_ID}"

				    - name: Kill any existing containers, clean up images

				      if: ${{ steps.check_arc_runner.outputs.IN_ARC_RUNNER == 'false' }}

				      shell: bash

				      run: |

				        # ignore expansion of "docker ps -q" since it could be empty

				@ -96,3 +105,28 @@ runs:

				        echo "${RESOLVED_IP} ${PT_DOMAIN}" | sudo tee -a /etc/hosts

				        cat /etc/hosts

				    - name: Check that the docker daemon is running

				      shell: bash

				      continue-on-error: true

				      if: ${{ steps.check_arc_runner.outputs.IN_ARC_RUNNER == 'true' }}

				      run: |

				        set +x

				        max_attempts=30

				        delay=10

				        attempt=1

				        for attempt in $(seq 1 $max_attempts); do

				          echo "Attempt $attempt of $max_attempts: Checking if Docker daemon is running..."

				          if docker info > /dev/null 2>&1; then

				            echo "Docker is running. Proceeding with the next steps"

				            exit 0

				          else

				            echo "Docker is not running yet."

				            echo "Retrying in $delay seconds..."

				            sleep $delay

				          fi

				        done

				        echo "Reached maximum attempts to connect to Docker. Exiting."

				        exit 1

									
										11

.github/actions/test-pytorch-binary/action.yml
									
										vendored
									
												View File
												
				@ -35,7 +35,7 @@ runs:

				          "${DOCKER_IMAGE}"

				        )

				        if [[ "${GPU_ARCH_TYPE}" != "rocm" && "${BUILD_ENVIRONMENT}" != "linux-aarch64-binary-manywheel" ]]; then

				        if [[ "${GPU_ARCH_TYPE}" != "rocm" && "${BUILD_ENVIRONMENT}" != "linux-aarch64-binary-manywheel" && "${BUILD_ENVIRONMENT}" != "linux-s390x-binary-manywheel" ]]; then

				          # Propagate download.pytorch.org IP to container. This is only needed on Linux non aarch64 runner

				          grep download.pytorch.org /etc/hosts | docker exec -i "${container_name}" bash -c "/bin/cat >> /etc/hosts"

				        fi

				@ -44,3 +44,12 @@ runs:

				        # Generate test script

				        docker exec -t -w "${PYTORCH_ROOT}" -e OUTPUT_SCRIPT="/run.sh" "${container_name}" bash -c "bash .circleci/scripts/binary_linux_test.sh"

				        docker exec -t "${container_name}" bash -c "source ${BINARY_ENV_FILE} && bash -x /run.sh"

				    - name: Cleanup docker

				      if: always() && env.BUILD_ENVIRONMENT == 'linux-s390x-binary-manywheel'

				      shell: bash

				      run: |

				        # on s390x stop the container for clean worker stop

				        # ignore expansion of "docker ps -q" since it could be empty

				        # shellcheck disable=SC2046

				        docker stop $(docker ps -q) || true

									
										33

.github/actions/upload-test-artifacts/action.yml
									
										vendored
									
												View File
												
				@ -11,6 +11,10 @@ inputs:

				      Suffix to add to the filename of the artifacts. This should include the

				      workflow job id, see [Job id in artifacts].

				    required: true

				  s3-bucket:

				    description: S3 bucket to download builds

				    required: false

				    default: "gha-artifacts"

				runs:

				  using: composite

				@ -42,7 +46,7 @@ runs:

				      env:

				        FILE_SUFFIX: ${{ inputs.file-suffix }}

				      run: |

				        # Remove any previous test reports if they exist

				        # Remove any previous usage logs if they exist

				        rm -f logs-*.zip

				        # this workflow is also run in bazel build test, but we dont generate usage reports for it

				        # so check to see if the file exists first

				@ -53,6 +57,18 @@ runs:

				            zip -r "logs-${FILE_SUFFIX}.zip" test -i '*.log'

				        fi

				    - name: Zip debugging artifacts for upload

				      if: runner.os != 'Windows' && !inputs.use-gha

				      shell: bash

				      env:

				        FILE_SUFFIX: ${{ inputs.file-suffix }}

				      run: |

				        # Remove any previous debugging artifacts if they exist

				        rm -f debug-*.zip

				        if [ -d 'test/debug' ]; then

				          zip -r "debug-${FILE_SUFFIX}.zip" test/debug

				        fi

				    # Windows zip

				    - name: Zip JSONs for upload

				      if: runner.os == 'Windows' && !inputs.use-gha

				@ -87,6 +103,7 @@ runs:

				      uses: seemethere/upload-artifact-s3@v5

				      if: ${{ !inputs.use-gha }}

				      with:

				        s3-bucket: ${{ inputs.s3-bucket }}

				        s3-prefix: |

				          ${{ github.repository }}/${{ github.run_id }}/${{ github.run_attempt }}/artifact

				        retention-days: 14

				@ -97,6 +114,7 @@ runs:

				      uses: seemethere/upload-artifact-s3@v5

				      if: ${{ !inputs.use-gha }}

				      with:

				        s3-bucket: ${{ inputs.s3-bucket }}

				        s3-prefix: |

				          ${{ github.repository }}/${{ github.run_id }}/${{ github.run_attempt }}/artifact

				        retention-days: 14

				@ -108,12 +126,25 @@ runs:

				      if: ${{ !inputs.use-gha }}

				      continue-on-error: true

				      with:

				        s3-bucket: ${{ inputs.s3-bucket }}

				        s3-prefix: |

				          ${{ github.repository }}/${{ github.run_id }}/${{ github.run_attempt }}/artifact

				        retention-days: 14

				        if-no-files-found: ignore

				        path: logs-*.zip

				    - name: Store Debug Artifacts on S3

				      uses: seemethere/upload-artifact-s3@v5

				      if: ${{ !inputs.use-gha }}

				      continue-on-error: true

				      with:

				        s3-bucket: ${{ inputs.s3-bucket }}

				        s3-prefix: |

				          ${{ github.repository }}/${{ github.run_id }}/${{ github.run_attempt }}/artifact

				        retention-days: 14

				        if-no-files-found: ignore

				        path: debug-*.zip

				    # GHA upload

				    - name: Store Test Downloaded JSONs on Github

				      uses: actions/upload-artifact@v3

2

.github/ci_commit_pins/audio.txt vendored

View File

 @ -1 +1 @@
 aeb554d3e2f7855b7abe5120c282f59648ed7a
 b829e936f7cc61b48149f5f957a451a38bf2a178

2

.github/ci_commit_pins/vision.txt vendored

View File

 @ -1 +1 @@
 c127da8b5e2e8f44b50994c6cb931bcca267cfe
 d23a6e1664d20707c11781299611436e1f0c104f

2

.github/ci_commit_pins/xla.txt vendored

View File

 @ -1 +1 @@
 a632930bfde19ffb361cdf5c31a7682af4e67
 f0b61e5d782913a0fc7743812f2a8e522189111

									
										13

.github/label_to_label.yml
									
										vendored
									
										Normal file
									
												View File
												
				@ -0,0 +1,13 @@

				# Use this to auto apply labels based on other labels.  Applies to both PRs and

				# issues. Currently only supports any and all

				- any:

				  - "module: custom operators"

				  - "module: aotdispatch"

				  then:

				  - "module: pt2-dispatcher"

				- any:

				  - "module: dynamo"

				  - "module: pt2-dispatcher"

				  - "module: inductor"

				  then:

				  - "oncall: pt2"

									
										14

.github/labeler.yml
									
										vendored
									
												View File
												
				@ -35,6 +35,9 @@

				- test/distributed/tensor/parallel/test_fsdp_2d_parallel.py

				- torch/distributed/_tensor/**

				- torch/distributed/fsdp/**

				- torch/csrc/inductor/**

				- test/cpp/aoti_abi_check/**

				- test/cpp/aoti_inference/**

				"module: cpu":

				- aten/src/ATen/cpu/**

				@ -55,6 +58,17 @@

				- third_party/mkl-dnn.BUILD

				- torch/csrc/jit/codegen/onednn/**

				- test/test_jit_llga_fuser.py

				- test/test_mkldnn.py

				"ciflow/linux-aarch64":

				- third_party/ideep

				- caffe2/ideep/**

				- caffe2/python/ideep/**

				- cmake/Modules/FindMKLDNN.cmake

				- third_party/mkl-dnn.BUILD

				- torch/csrc/jit/codegen/onednn/**

				- test/test_jit_llga_fuser.py

				- test/test_mkldnn.py

				"module: amp (automated mixed precision)":

				- torch/amp/**

									
										154

.github/lf-canary-scale-config.yml
									
										vendored
									
										Normal file
									
												View File
												
				@ -0,0 +1,154 @@

				# Defines runner types that will be provisioned by by LF Self-hosted

				# runners for pytorch/pytorch-canary and their labels.

				#

				# Runners listed here will be available as self hosted runners.

				# Configuration is directly pulled from the main branch.

				#

				# Default values:

				#

				# runner_types:

				#   runner_label: # label to specify in the Github Actions workflow

				#     instance_type: m4.large

				#     os: linux

				#     max_available: 20

				#     disk_size: 50

				#     is_ephemeral: true

				runner_types:

				  lf.c.linux.12xlarge:

				    disk_size: 200

				    instance_type: c5.12xlarge

				    is_ephemeral: false

				    max_available: 1000

				    os: linux

				  lf.c.linux.24xl.spr-metal:

				    disk_size: 200

				    instance_type: c7i.metal-24xl

				    is_ephemeral: false

				    max_available: 30

				    os: linux

				  lf.c.linux.16xlarge.spr:

				    disk_size: 200

				    instance_type: c7i.16xlarge

				    is_ephemeral: false

				    max_available: 30

				    os: linux

				  lf.c.linux.12xlarge.ephemeral:

				    disk_size: 200

				    instance_type: c5.12xlarge

				    is_ephemeral: true

				    max_available: 300

				    os: linux

				  lf.c.linux.16xlarge.nvidia.gpu:

				    disk_size: 150

				    instance_type: g3.16xlarge

				    is_ephemeral: false

				    max_available: 30

				    os: linux

				  lf.c.linux.24xlarge:

				    disk_size: 150

				    instance_type: c5.24xlarge

				    is_ephemeral: false

				    max_available: 250

				    os: linux

				  lf.c.linux.2xlarge:

				    disk_size: 150

				    instance_type: c5.2xlarge

				    is_ephemeral: false

				    max_available: 3120

				    os: linux

				  lf.c.linux.4xlarge:

				    disk_size: 150

				    instance_type: c5.4xlarge

				    is_ephemeral: false

				    max_available: 1000

				    os: linux

				  lf.c.linux.4xlarge.nvidia.gpu:

				    disk_size: 150

				    instance_type: g3.4xlarge

				    is_ephemeral: false

				    max_available: 520

				    os: linux

				  lf.c.linux.8xlarge.nvidia.gpu:

				    disk_size: 150

				    instance_type: g3.8xlarge

				    is_ephemeral: false

				    max_available: 400

				    os: linux

				  lf.c.linux.g4dn.12xlarge.nvidia.gpu:

				    disk_size: 150

				    instance_type: g4dn.12xlarge

				    is_ephemeral: false

				    max_available: 50

				    os: linux

				  lf.c.linux.g4dn.metal.nvidia.gpu:

				    disk_size: 150

				    instance_type: g4dn.metal

				    is_ephemeral: false

				    max_available: 30

				    os: linux

				  lf.c.linux.g5.48xlarge.nvidia.gpu:

				    disk_size: 150

				    instance_type: g5.48xlarge

				    is_ephemeral: false

				    max_available: 20

				    os: linux

				  lf.c.linux.g5.12xlarge.nvidia.gpu:

				    disk_size: 150

				    instance_type: g5.12xlarge

				    is_ephemeral: false

				    max_available: 150

				    os: linux

				  lf.c.linux.g5.4xlarge.nvidia.gpu:

				    disk_size: 150

				    instance_type: g5.4xlarge

				    is_ephemeral: false

				    max_available: 1200

				    os: linux

				  lf.c.linux.large:

				    disk_size: 15

				    instance_type: c5.large

				    is_ephemeral: false

				    os: linux

				  lf.c.linux.arm64.2xlarge:

				    disk_size: 256

				    instance_type: t4g.2xlarge

				    is_ephemeral: false

				    max_available: 200

				    os: linux

				  lf.c.linux.arm64.m7g.2xlarge:

				    disk_size: 256

				    instance_type: m7g.2xlarge

				    is_ephemeral: false

				    max_available: 20

				    os: linux

				  lf.c.windows.4xlarge:

				    disk_size: 256

				    instance_type: c5d.4xlarge

				    is_ephemeral: true

				    max_available: 420

				    os: windows

				  lf.c.windows.4xlarge.nonephemeral:

				    disk_size: 256

				    instance_type: c5d.4xlarge

				    is_ephemeral: false

				    max_available: 420

				    os: windows

				  lf.c.windows.8xlarge.nvidia.gpu:

				    disk_size: 256

				    instance_type: p3.2xlarge

				    is_ephemeral: true

				    max_available: 150

				    os: windows

				  lf.c.windows.8xlarge.nvidia.gpu.nonephemeral:

				    disk_size: 256

				    instance_type: p3.2xlarge

				    is_ephemeral: false

				    max_available: 150

				    os: windows

				  lf.c.windows.g5.4xlarge.nvidia.gpu:

				    disk_size: 256

				    instance_type: g5.4xlarge

				    is_ephemeral: false

				    max_available: 250

				    os: windows

									
										154

.github/lf-scale-config.yml
									
										vendored
									
										Normal file
									
												View File
												
				@ -0,0 +1,154 @@

				# Defines runner types that will be provisioned by by LF Self-hosted

				# runners for pytorch/pytorch and their labels.

				#

				# Runners listed here will be available as self hosted runners.

				# Configuration is directly pulled from the main branch.

				#

				# Default values:

				#

				# runner_types:

				#   runner_label: # label to specify in the Github Actions workflow

				#     instance_type: m4.large

				#     os: linux

				#     max_available: 20

				#     disk_size: 50

				#     is_ephemeral: true

				runner_types:

				  lf.linux.12xlarge:

				    disk_size: 200

				    instance_type: c5.12xlarge

				    is_ephemeral: false

				    max_available: 1000

				    os: linux

				  lf.linux.24xl.spr-metal:

				    disk_size: 200

				    instance_type: c7i.metal-24xl

				    is_ephemeral: false

				    max_available: 30

				    os: linux

				  lf.linux.16xlarge.spr:

				    disk_size: 200

				    instance_type: c7i.16xlarge

				    is_ephemeral: false

				    max_available: 30

				    os: linux

				  lf.linux.12xlarge.ephemeral:

				    disk_size: 200

				    instance_type: c5.12xlarge

				    is_ephemeral: true

				    max_available: 300

				    os: linux

				  lf.linux.16xlarge.nvidia.gpu:

				    disk_size: 150

				    instance_type: g3.16xlarge

				    is_ephemeral: false

				    max_available: 30

				    os: linux

				  lf.linux.24xlarge:

				    disk_size: 150

				    instance_type: c5.24xlarge

				    is_ephemeral: false

				    max_available: 250

				    os: linux

				  lf.linux.2xlarge:

				    disk_size: 150

				    instance_type: c5.2xlarge

				    is_ephemeral: false

				    max_available: 3120

				    os: linux

				  lf.linux.4xlarge:

				    disk_size: 150

				    instance_type: c5.4xlarge

				    is_ephemeral: false

				    max_available: 1000

				    os: linux

				  lf.linux.4xlarge.nvidia.gpu:

				    disk_size: 150

				    instance_type: g3.4xlarge

				    is_ephemeral: false

				    max_available: 520

				    os: linux

				  lf.linux.8xlarge.nvidia.gpu:

				    disk_size: 150

				    instance_type: g3.8xlarge

				    is_ephemeral: false

				    max_available: 400

				    os: linux

				  lf.linux.g4dn.12xlarge.nvidia.gpu:

				    disk_size: 150

				    instance_type: g4dn.12xlarge

				    is_ephemeral: false

				    max_available: 50

				    os: linux

				  lf.linux.g4dn.metal.nvidia.gpu:

				    disk_size: 150

				    instance_type: g4dn.metal

				    is_ephemeral: false

				    max_available: 30

				    os: linux

				  lf.linux.g5.48xlarge.nvidia.gpu:

				    disk_size: 150

				    instance_type: g5.48xlarge

				    is_ephemeral: false

				    max_available: 20

				    os: linux

				  lf.linux.g5.12xlarge.nvidia.gpu:

				    disk_size: 150

				    instance_type: g5.12xlarge

				    is_ephemeral: false

				    max_available: 150

				    os: linux

				  lf.linux.g5.4xlarge.nvidia.gpu:

				    disk_size: 150

				    instance_type: g5.4xlarge

				    is_ephemeral: false

				    max_available: 1200

				    os: linux

				  lf.linux.large:

				    disk_size: 15

				    instance_type: c5.large

				    is_ephemeral: false

				    os: linux

				  lf.linux.arm64.2xlarge:

				    disk_size: 256

				    instance_type: t4g.2xlarge

				    is_ephemeral: false

				    max_available: 200

				    os: linux

				  lf.linux.arm64.m7g.2xlarge:

				    disk_size: 256

				    instance_type: m7g.2xlarge

				    is_ephemeral: false

				    max_available: 20

				    os: linux

				  lf.windows.4xlarge:

				    disk_size: 256

				    instance_type: c5d.4xlarge

				    is_ephemeral: true

				    max_available: 420

				    os: windows

				  lf.windows.4xlarge.nonephemeral:

				    disk_size: 256

				    instance_type: c5d.4xlarge

				    is_ephemeral: false

				    max_available: 420

				    os: windows

				  lf.windows.8xlarge.nvidia.gpu:

				    disk_size: 256

				    instance_type: p3.2xlarge

				    is_ephemeral: true

				    max_available: 150

				    os: windows

				  lf.windows.8xlarge.nvidia.gpu.nonephemeral:

				    disk_size: 256

				    instance_type: p3.2xlarge

				    is_ephemeral: false

				    max_available: 150

				    os: windows

				  lf.windows.g5.4xlarge.nvidia.gpu:

				    disk_size: 256

				    instance_type: g5.4xlarge

				    is_ephemeral: false

				    max_available: 250

				    os: windows

									
										23

.github/merge_rules.yaml
									
										vendored
									
												View File
												
				@ -28,12 +28,13 @@

				  - caffe2/python/onnx/**

				  approved_by:

				  - BowenBao

				  - abock

				  - justinchuby

				  - liqunfu

				  - shubhambhokare1

				  - thiagocrepaldi

				  - titaiwangms

				  - wschin

				  - xadupre

				  mandatory_checks_name:

				  - EasyCLA

				  - Lint

				@ -236,6 +237,24 @@

				  - Lint

				  - pull

				- name: XPU ATen

				  patterns:

				  - aten/src/ATen/xpu/**

				  - c10/xpu/**

				  - torch/csrc/xpu/**

				  - torch/xpu/**

				  - test/xpu/**

				  - third_party/xpu.txt

				  - .ci/docker/ci_commit_pins/triton-xpu.txt

				  approved_by:

				  - EikanWang

				  - jgong5

				  - gujinghui

				  mandatory_checks_name:

				  - EasyCLA

				  - Lint

				  - pull

				- name: Distributions

				  patterns:

				  - torch/distributions/**

				@ -357,12 +376,14 @@

				- name: CPU inductor

				  patterns:

				  - torch/_inductor/mkldnn_lowerings.py

				  - torch/_inductor/fx_passes/mkldnn_fusion.py

				  - torch/_inductor/fx_passes/quantization.py

				  - torch/_inductor/codegen/cpp.py

				  - test/inductor/test_mkldnn_pattern_matcher.py

				  - test/inductor/test_cpu_repo.py

				  - test/inductor/test_cpu_cpp_wrapper.py

				  - aten/src/ATen/cpu/**

				  - aten/src/ATen/native/quantized/cpu/**

				  - test/quantization/core/test_quantized_op.py

				  - torch/ao/quantization/quantizer/x86_inductor_quantizer.py

									
										7

.github/pytorch-probot.yml
									
										vendored
									
												View File
												
				@ -7,6 +7,9 @@ ciflow_push_tags:

				- ciflow/binaries_wheel

				- ciflow/inductor

				- ciflow/inductor-perf-compare

				- ciflow/inductor-micro-benchmark

				- ciflow/inductor-cu124

				- ciflow/linux-aarch64

				- ciflow/mps

				- ciflow/nightly

				- ciflow/periodic

				@ -15,9 +18,11 @@ ciflow_push_tags:

				- ciflow/trunk

				- ciflow/unstable

				- ciflow/xpu

				- ciflow/torchbench

				retryable_workflows:

				- lint

				- pull

				- trunk

				- linux-binary

				- windows-binary

				labeler_config: labeler.yml

				label_to_label_config: label_to_label.yml

4

.github/requirements-gha-cache.txt vendored

View File

 @ -5,11 +5,11 @@
 #   functorch/docs/requirements.txt
 #   .ci/docker/requirements-ci.txt
 boto3==1.19.12
 jinja2==3.0.1
 jinja2==3.1.4
 lintrunner==0.10.7
 ninja==1.10.0.post1
 nvidia-ml-py==11.525.84
 pyyaml==6.0
 requests==2.31.0
 requests==2.32.2
 rich==10.9.0
 rockset==1.0.3

3

.github/requirements/conda-env-Linux-X64.txt vendored

View File

 @ -4,6 +4,5 @@ mkl-include=2022.1.0
 ninja=1.10.2
 numpy=1.23.3
 pyyaml=6.0
 requests=2.31.0
 setuptools=68.2.2
 typing-extensions=4.3.0
 typing-extensions=4.9.0

3

.github/requirements/conda-env-iOS.txt vendored

View File

 @ -3,6 +3,5 @@ cmake=3.22.1
 ninja=1.10.2
 numpy=1.23.3
 pyyaml=6.0
 requests=2.31.0
 setuptools=68.2.2
 typing-extensions=4.3.0
 typing-extensions=4.9.0

2

.github/requirements/conda-env-macOS-ARM64 vendored

View File

 @ -2,7 +2,7 @@ numpy=1.22.3
 pyyaml=6.0
 setuptools=61.2.0
 cmake=3.22.*
 typing-extensions=4.3.0
 typing-extensions=4.9.0
 dataclasses=0.8
 pip=22.2.2
 pillow=10.0.1

2

.github/requirements/conda-env-macOS-X64 vendored

View File

 @ -4,7 +4,7 @@ numpy=1.21.2
 pyyaml=5.3
 setuptools=46.0.0
 cmake=3.22.*
 typing-extensions=4.3.0
 typing-extensions=4.9.0
 dataclasses=0.8
 pip=22.2.2
 pillow=10.0.1

2

.github/requirements/pip-requirements-iOS.txt vendored

View File

 @ -1,4 +1,4 @@
 # iOS simulator requirements
 coremltools==5.0b5
 protobuf==3.20.2
 optree==0.9.1
 optree==0.11.0

5

.github/requirements/pip-requirements-macOS.txt vendored

View File

 @ -26,4 +26,7 @@ pytest-cpp==2.3.0
 rockset==1.0.3
 z3-solver==4.12.2.0
 tensorboard==2.13.0
 optree==0.9.1
 optree==0.11.0
 # NB: test_hparams_* from test_tensorboard is failing with protobuf 5.26.0 in
 # which the stringify metadata is wrong when escaping double quote
 protobuf==3.20.2

									
										99

.github/scripts/amd/package_triton_wheel.sh
									
										vendored
									
										Executable file
									
												View File
												
				@ -0,0 +1,99 @@

				set -ex

				# Set ROCM_HOME isn't available, use ROCM_PATH if set or /opt/rocm

				ROCM_HOME="${ROCM_HOME:-${ROCM_PATH:-/opt/rocm}}"

				# Find rocm_version.h header file for ROCm version extract

				rocm_version_h="${ROCM_HOME}/include/rocm-core/rocm_version.h"

				if [ ! -f "$rocm_version_h" ]; then

				    rocm_version_h="${ROCM_HOME}/include/rocm_version.h"

				fi

				# Error out if rocm_version.h not found

				if [ ! -f "$rocm_version_h" ]; then

				    echo "Error: rocm_version.h not found in expected locations." >&2

				    exit 1

				fi

				# Extract major, minor and patch ROCm version numbers

				MAJOR_VERSION=$(grep 'ROCM_VERSION_MAJOR' "$rocm_version_h" | awk '{print $3}')

				MINOR_VERSION=$(grep 'ROCM_VERSION_MINOR' "$rocm_version_h" | awk '{print $3}')

				PATCH_VERSION=$(grep 'ROCM_VERSION_PATCH' "$rocm_version_h" | awk '{print $3}')

				ROCM_INT=$(($MAJOR_VERSION * 10000 + $MINOR_VERSION * 100 + $PATCH_VERSION))

				echo "ROCm version: $ROCM_INT"

				# Check TRITON_ROCM_DIR is set

				if [[ -z "${TRITON_ROCM_DIR}" ]]; then

				    export TRITON_ROCM_DIR=third_party/amd/backend

				fi

				# Remove packaged libs and headers

				rm -rf $TRITON_ROCM_DIR/include/*

				LIBTINFO_PATH="/usr/lib64/libtinfo.so.5"

				LIBNUMA_PATH="/usr/lib64/libnuma.so.1"

				LIBELF_PATH="/usr/lib64/libelf.so.1"

				OS_SO_PATHS=(

				    $LIBELF_PATH

				    $LIBNUMA_PATH

				    $LIBTINFO_PATH

				)

				for lib in "${OS_SO_PATHS[@]}"

				do

				    cp $lib $TRITON_ROCM_DIR/lib/

				done

				# Required ROCm libraries

				if [[ "${MAJOR_VERSION}" == "6" ]]; then

				    libamdhip="libamdhip64.so.6"

				else

				    libamdhip="libamdhip64.so.5"

				fi

				# Required ROCm libraries - ROCm 6.0

				ROCM_SO=(

				    "${libamdhip}"

				    "libhsa-runtime64.so.1"

				    "libamd_comgr.so.2"

				    "libdrm.so.2"

				    "libdrm_amdgpu.so.1"

				)

				if [[ $ROCM_INT -ge 60100 ]]; then

				    ROCM_SO+=("librocprofiler-register.so.0")

				fi

				for lib in "${ROCM_SO[@]}"

				do

				    file_path=($(find $ROCM_HOME/lib/ -name "$lib")) # First search in lib

				    if [[ -z $file_path ]]; then

				        if [ -d "$ROCM_HOME/lib64/" ]; then

				            file_path=($(find $ROCM_HOME/lib64/ -name "$lib")) # Then search in lib64

				        fi

				    fi

				    if [[ -z $file_path ]]; then

				        file_path=($(find $ROCM_HOME/ -name "$lib")) # Then search in ROCM_HOME

				    fi

				    if [[ -z $file_path ]]; then

				        file_path=($(find /opt/ -name "$lib")) # Then search in /opt

				    fi

				    if [[ -z $file_path ]]; then

				            echo "Error: Library file $lib is not found." >&2

				            exit 1

				    fi

				    cp $file_path $TRITON_ROCM_DIR/lib

				    # When running locally, and not building a wheel, we need to satisfy shared objects requests that don't look for versions

				    LINKNAME=$(echo $lib | sed -e 's/\.so.*/.so/g')

				    ln -sf $lib $TRITON_ROCM_DIR/lib/$LINKNAME

				done

				# Copy Include Files

				cp -r $ROCM_HOME/include/hip $TRITON_ROCM_DIR/include

				# Copy linker

				mkdir -p $TRITON_ROCM_DIR/llvm/bin

				cp $ROCM_HOME/llvm/bin/ld.lld $TRITON_ROCM_DIR/llvm/bin/

									
										103

.github/scripts/amd/patch_triton_wheel.sh
									
										vendored
									
										Executable file
									
												View File
												
				@ -0,0 +1,103 @@

				#!/bin/bash

				set -x

				if [ -z "$1" ]; then

				    echo "Need wheel location argument" && exit 1

				fi

				WHEELHOUSE_DIR=$1

				PATCHELF_BIN=patchelf

				ROCM_LIB=backends/amd/lib

				ROCM_LD=backends/amd/llvm/bin

				PREFIX=triton

				fname_without_so_number() {

				    LINKNAME=$(echo $1 | sed -e 's/\.so.*/.so/g')

				    echo "$LINKNAME"

				}

				replace_needed_sofiles() {

				    find $1 -name '*.so*' -o -name 'ld.lld' | while read sofile; do

				        origname=$2

				        patchedname=$3

				        if [[ "$origname" != "$patchedname" ]]; then

				            set +e

				            origname=$($PATCHELF_BIN --print-needed $sofile | grep "$origname.*")

				            ERRCODE=$?

				            set -e

				            if [ "$ERRCODE" -eq "0" ]; then

				                echo "patching $sofile entry $origname to $patchedname"

				                $PATCHELF_BIN --replace-needed $origname $patchedname $sofile

				            fi

				        fi

				    done

				}

				mkdir  -p "/tmp_dir"

				pushd /tmp_dir

				for pkg in /$WHEELHOUSE_DIR/*triton*.whl; do

				    echo "Modifying $pkg"

				    rm -rf tmp

				    mkdir -p tmp

				    cd tmp

				    cp $pkg .

				    unzip -q $(basename $pkg)

				    rm -f $(basename $pkg)

				    $PATCHELF_BIN --set-rpath ${LD_SO_RPATH:-'$ORIGIN:$ORIGIN/../../lib'} $PREFIX/$ROCM_LD/ld.lld

				    $PATCHELF_BIN --print-rpath $PREFIX/$ROCM_LD/ld.lld

				    # Modify libtriton.so as it sits in _C directory apart from its dependencies

				    find $PREFIX/_C -type f -name "*.so*" | while read sofile; do

				        echo "Setting rpath of $sofile"

				        $PATCHELF_BIN --set-rpath ${C_SO_RPATH:-'$ORIGIN:$ORIGIN/'../$ROCM_LIB} ${FORCE_RPATH:-} $sofile

				        $PATCHELF_BIN --print-rpath $sofile

				    done

				    # All included dependencies are included in a single lib directory

				    deps=()

				    deps_soname=()

				    while read sofile; do

				        echo "Setting rpath of $sofile to ${LIB_SO_RPATH:-'$ORIGIN'}"

				        $PATCHELF_BIN --set-rpath ${LIB_SO_RPATH:-'$ORIGIN'} ${FORCE_RPATH:-} $sofile

				        $PATCHELF_BIN --print-rpath $sofile

				        deps+=("$sofile")

				        deps_soname+=("$(basename $sofile)")

				    done < <(find $PREFIX/$ROCM_LIB -type f -name "*.so*")

				    patched=()

				    for filepath in "${deps[@]}"; do

				        filename=$(basename $filepath)

				        destpath=$PREFIX/$ROCM_LIB/$filename

				        if [[ "$filepath" != "$destpath" ]]; then

				            cp $filepath $destpath

				        fi

				        patchedpath=$(fname_without_so_number $destpath)

				        patchedname=$(basename $patchedpath)

				        if [[ "$destpath" != "$patchedpath" ]]; then

				            mv $destpath $patchedpath

				        fi

				        patched+=("$patchedname")

				        echo "Copied $filepath to $patchedpath"

				    done

				    # Go through all required shared objects and see if any of our other objects are dependants.  If so, replace so.ver wth so

				    for ((i=0;i<${#deps[@]};++i)); do

				        echo "replacing "${deps_soname[i]} ${patched[i]}

				        replace_needed_sofiles $PREFIX/$ROCM_LIB ${deps_soname[i]} ${patched[i]}

				        replace_needed_sofiles $PREFIX/_C ${deps_soname[i]} ${patched[i]}

				        replace_needed_sofiles $PREFIX/$ROCM_LD ${deps_soname[i]} ${patched[i]}

				    done

				    # Re-bundle whl with so adjustments

				    zip -rqy $(basename $pkg) *

				    if [[ -z "${MANYLINUX_VERSION}" ]]; then

				        newpkg=$pkg

				    else

				        newpkg=$(echo $pkg | sed -e "s/\linux_x86_64/${MANYLINUX_VERSION}/g")

				    fi

				    # Remove original whl

				    rm -f $pkg

				    # Move rebuilt whl to original location with new name.

				    mv $(basename $pkg) $newpkg

				done

									
										54

.github/scripts/build_triton_wheel.py
									
										vendored
									
												View File
												
				@ -10,9 +10,6 @@ from typing import Optional

				SCRIPT_DIR = Path(__file__).parent

				REPO_DIR = SCRIPT_DIR.parent.parent

				# TODO: Remove me once Triton version is again in sync for vanilla and ROCm

				ROCM_TRITION_VERSION = "2.1.0"

				def read_triton_pin(rocm_hash: bool = False) -> str:

				    triton_file = "triton.txt" if not rocm_hash else "triton-rocm.txt"

				@ -32,27 +29,6 @@ def check_and_replace(inp: str, src: str, dst: str) -> str:

				    return inp.replace(src, dst)

				def patch_setup_py(

				    path: Path,

				    *,

				    version: str,

				    name: str = "triton",

				    expected_version: Optional[str] = None,

				) -> None:

				    with open(path) as f:

				        orig = f.read()

				    # Replace name

				    orig = check_and_replace(orig, 'name="triton",', f'name="{name}",')

				    # Replace version

				    if not expected_version:

				        expected_version = read_triton_version()

				    orig = check_and_replace(

				        orig, f'version="{expected_version}",', f'version="{version}",'

				    )

				    with open(path, "w") as f:

				        f.write(orig)

				def patch_init_py(

				    path: Path, *, version: str, expected_version: Optional[str] = None

				) -> None:

				@ -92,14 +68,20 @@ def build_triton(

				    with TemporaryDirectory() as tmpdir:

				        triton_basedir = Path(tmpdir) / "triton"

				        triton_pythondir = triton_basedir / "python"

				        triton_repo = "https://github.com/openai/triton"

				        if build_rocm:

				            triton_repo = "https://github.com/ROCmSoftwarePlatform/triton"

				            triton_pkg_name = "pytorch-triton-rocm"

				        else:

				            triton_repo = "https://github.com/openai/triton"

				            triton_pkg_name = "pytorch-triton"

				        check_call(["git", "clone", triton_repo], cwd=tmpdir)

				        check_call(["git", "checkout", commit_hash], cwd=triton_basedir)

				        if release:

				            ver, rev, patch = version.split(".")

				            check_call(

				                ["git", "checkout", f"release/{ver}.{rev}.x"], cwd=triton_basedir

				            )

				        else:

				            check_call(["git", "checkout", commit_hash], cwd=triton_basedir)

				        if build_conda:

				            with open(triton_basedir / "meta.yaml", "w") as meta:

				                print(

				@ -155,18 +137,15 @@ def build_triton(

				        patch_init_py(

				            triton_pythondir / "triton" / "__init__.py",

				            version=f"{version}",

				            expected_version=ROCM_TRITION_VERSION if build_rocm else None,

				            expected_version=None,

				        )

				        if build_rocm:

				            # TODO: Remove me when ROCM triton is updated

				            patch_setup_py(

				                triton_pythondir / "setup.py",

				                name=triton_pkg_name,

				                version=f"{version}",

				                expected_version=ROCM_TRITION_VERSION,

				            check_call(

				                [f"{SCRIPT_DIR}/amd/package_triton_wheel.sh"],

				                cwd=triton_basedir,

				                shell=True,

				            )

				            check_call("scripts/amd/setup_rocm_libs.sh", cwd=triton_basedir, shell=True)

				            print("ROCm libraries setup for triton installation...")

				        check_call(

				@ -177,7 +156,10 @@ def build_triton(

				        shutil.copy(whl_path, Path.cwd())

				        if build_rocm:

				            check_call("scripts/amd/fix_so.sh", cwd=triton_basedir, shell=True)

				            check_call(

				                [f"{SCRIPT_DIR}/amd/patch_triton_wheel.sh", Path.cwd()],

				                cwd=triton_basedir,

				            )

				        return Path.cwd() / whl_path.name

									
										2

.github/scripts/cherry_pick.py
									
										vendored
									
												View File
												
				@ -29,7 +29,7 @@ def parse_args() -> Any:

				        "--onto-branch", type=str, required=True, help="the target release branch"

				    )

				    parser.add_argument(

				        "--github-actor", type=str, required=True, help="all the world’s a stage"

				        "--github-actor", type=str, required=True, help="all the world's a stage"

				    )

				    parser.add_argument(

				        "--classification",

									
										6

.github/scripts/comment_on_pr.py
									
										vendored
									
												View File
												
				@ -23,8 +23,10 @@ def main() -> None:

				    job_link = f"[job]({run_url})" if run_url is not None else "job"

				    msg = (

				        f"The {args.action} {job_link} was canceled. If you believe this is a mistake,"

				        + f" then you can re trigger it through [pytorch-bot]({BOT_COMMANDS_WIKI})."

				        f"The {args.action} {job_link} was canceled or timed out. This most often happen if two merge requests were issued"

				        + " for the same PR, or if merge job was waiting for more than 6 hours for tests to finish."

				        + " In later case, please do not hesitate to reissue the merge command\n"

				        + f" For more information see [pytorch-bot wiki]({BOT_COMMANDS_WIKI})."

				    )

				    gh_post_pr_comment(org, project, args.pr_num, msg)

									
										60

.github/scripts/delete_old_branches.py
									
										vendored
									
												View File
												
				@ -2,6 +2,7 @@

				import os

				import re

				from datetime import datetime

				from functools import lru_cache

				from pathlib import Path

				from typing import Any, Callable, Dict, List, Set

				@ -18,7 +19,7 @@ ESTIMATED_TOKENS = [0]

				TOKEN = os.environ["GITHUB_TOKEN"]

				if not TOKEN:

				    raise Exception("GITHUB_TOKEN is not set")

				    raise Exception("GITHUB_TOKEN is not set")  # noqa: TRY002

				REPO_ROOT = Path(__file__).parent.parent.parent

				@ -187,6 +188,17 @@ def get_recent_prs() -> Dict[str, Any]:

				    return prs_by_branch_base

				@lru_cache(maxsize=1)

				def get_open_prs() -> List[Dict[str, Any]]:

				    return paginate_graphql(

				        GRAPHQL_OPEN_PRS,

				        {"owner": "pytorch", "repo": "pytorch"},

				        lambda data: False,

				        lambda res: res["data"]["repository"]["pullRequests"]["nodes"],

				        lambda res: res["data"]["repository"]["pullRequests"]["pageInfo"],

				    )

				def get_branches_with_magic_label_or_open_pr() -> Set[str]:

				    pr_infos: List[Dict[str, Any]] = paginate_graphql(

				        GRAPHQL_NO_DELETE_BRANCH_LABEL,

				@ -196,15 +208,7 @@ def get_branches_with_magic_label_or_open_pr() -> Set[str]:

				        lambda res: res["data"]["repository"]["label"]["pullRequests"]["pageInfo"],

				    )

				    pr_infos.extend(

				        paginate_graphql(

				            GRAPHQL_OPEN_PRS,

				            {"owner": "pytorch", "repo": "pytorch"},

				            lambda data: False,

				            lambda res: res["data"]["repository"]["pullRequests"]["nodes"],

				            lambda res: res["data"]["repository"]["pullRequests"]["pageInfo"],

				        )

				    )

				    pr_infos.extend(get_open_prs())

				    # Get the most recent PR for each branch base (group gh together)

				    branch_bases = set()

				@ -270,5 +274,41 @@ def delete_branches() -> None:

				        delete_branch(git_repo, branch)

				def delete_old_ciflow_tags() -> None:

				    # Deletes ciflow tags if they are associated with a closed PR or a specific

				    # commit.  Lightweight tags don't have information about the date they were

				    # created, so we can't check how old they are.  The script just assumes that

				    # ciflow tags should be deleted regardless of creation date.

				    git_repo = GitRepo(str(REPO_ROOT), "origin", debug=True)

				    def delete_tag(tag: str) -> None:

				        print(f"Deleting tag {tag}")

				        ESTIMATED_TOKENS[0] += 1

				        delete_branch(git_repo, f"refs/tags/{tag}")

				    tags = git_repo._run_git("tag").splitlines()

				    open_pr_numbers = [x["number"] for x in get_open_prs()]

				    for tag in tags:

				        try:

				            if ESTIMATED_TOKENS[0] > 400:

				                print("Estimated tokens exceeded, exiting")

				                break

				            if not tag.startswith("ciflow/"):

				                continue

				            re_match_pr = re.match(r"^ciflow\/.*\/(\d{5,6})$", tag)

				            re_match_sha = re.match(r"^ciflow\/.*\/([0-9a-f]{40})$", tag)

				            if re_match_pr:

				                pr_number = int(re_match_pr.group(1))

				                if pr_number in open_pr_numbers:

				                    continue

				                delete_tag(tag)

				            elif re_match_sha:

				                delete_tag(tag)

				        except Exception as e:

				            print(f"Failed to check tag {tag}: {e}")

				if __name__ == "__main__":

				    delete_branches()

				    delete_old_ciflow_tags()

									
										52

.github/scripts/docathon-label-sync.py
									
										vendored
									
										Normal file
									
												View File
												
				@ -0,0 +1,52 @@

				import os

				import re

				import sys

				from github import Github

				def main() -> None:

				    token = os.environ.get("GITHUB_TOKEN")

				    repo_owner = "pytorch"

				    repo_name = "pytorch"

				    pull_request_number = int(sys.argv[1])

				    g = Github(token)

				    repo = g.get_repo(f"{repo_owner}/{repo_name}")

				    pull_request = repo.get_pull(pull_request_number)

				    pull_request_body = pull_request.body

				    # PR without description

				    if pull_request_body is None:

				        return

				    # get issue number from the PR body

				    if not re.search(r"#\d{1,6}", pull_request_body):

				        print("The pull request does not mention an issue.")

				        return

				    issue_number = int(re.findall(r"#(\d{1,6})", pull_request_body)[0])

				    issue = repo.get_issue(issue_number)

				    issue_labels = issue.labels

				    docathon_label_present = any(

				        label.name == "docathon-h1-2024" for label in issue_labels

				    )

				    # if the issue has a docathon label, add all labels from the issue to the PR.

				    if not docathon_label_present:

				        print("The 'docathon-h1-2024' label is not present in the issue.")

				        return

				    pull_request_labels = pull_request.get_labels()

				    pull_request_label_names = [label.name for label in pull_request_labels]

				    issue_label_names = [label.name for label in issue_labels]

				    labels_to_add = [

				        label for label in issue_label_names if label not in pull_request_label_names

				    ]

				    if not labels_to_add:

				        print("The pull request already has the same labels.")

				        return

				    pull_request.add_to_labels(*labels_to_add)

				    print("Labels added to the pull request!")

				if __name__ == "__main__":

				    main()

BIN
.github/scripts/drci_mocks.json.gz vendored

View File

Binary file not shown.

									
										117

.github/scripts/filter_test_configs.py
									
										vendored
									
												View File
												
				@ -1,6 +1,7 @@

				#!/usr/bin/env python3

				import json

				import logging

				import os

				import re

				import subprocess

				@ -8,6 +9,7 @@ import sys

				import warnings

				from enum import Enum

				from functools import lru_cache

				from logging import info

				from typing import Any, Callable, Dict, List, Optional, Set

				from urllib.request import Request, urlopen

				@ -17,33 +19,7 @@ REENABLE_TEST_REGEX = "(?i)(Close(d|s)?|Resolve(d|s)?|Fix(ed|es)?) (#|https://gi

				PREFIX = "test-config/"

				# Same as shard names

				VALID_TEST_CONFIG_LABELS = {

				    f"{PREFIX}{label}"

				    for label in {

				        "backwards_compat",

				        "crossref",

				        "default",

				        "deploy",

				        "distributed",

				        "docs_tests",

				        "dynamo",

				        "force_on_cpu",

				        "functorch",

				        "inductor",

				        "inductor_distributed",

				        "inductor_huggingface",

				        "inductor_timm",

				        "inductor_torchbench",

				        "jit_legacy",

				        "multigpu",

				        "nogpu_AVX512",

				        "nogpu_NO_AVX2",

				        "slow",

				        "tsan",

				        "xla",

				    }

				}

				logging.basicConfig(level=logging.INFO)

				def is_cuda_or_rocm_job(job_name: Optional[str]) -> bool:

				@ -90,6 +66,12 @@ def parse_args() -> Any:

				    parser.add_argument(

				        "--test-matrix", type=str, required=True, help="the original test matrix"

				    )

				    parser.add_argument(

				        "--selected-test-configs",

				        type=str,

				        default="",

				        help="a comma-separated list of test configurations from the test matrix to keep",

				    )

				    parser.add_argument(

				        "--workflow", type=str, help="the name of the current workflow, i.e. pull"

				    )

				@ -155,19 +137,25 @@ def get_labels(pr_number: int) -> Set[str]:

				    }

				def filter_labels(labels: Set[str], label_regex: Any) -> Set[str]:

				    """

				    Return the list of matching labels

				    """

				    return {l for l in labels if re.match(label_regex, l)}

				def filter(test_matrix: Dict[str, List[Any]], labels: Set[str]) -> Dict[str, List[Any]]:

				    """

				    Select the list of test config to run from the test matrix. The logic works

				    as follows:

				    If the PR has one or more labels as specified in the VALID_TEST_CONFIG_LABELS set, only

				    these test configs will be selected.  This also works with ciflow labels, for example,

				    if a PR has both ciflow/trunk and test-config/functorch, only trunk functorch builds

				    and tests will be run

				    If the PR has one or more test-config labels as specified, only these test configs

				    will be selected.  This also works with ciflow labels, for example, if a PR has both

				    ciflow/trunk and test-config/functorch, only trunk functorch builds and tests will

				    be run.

				    If the PR has none of the test-config label, all tests are run as usual.

				    """

				    filtered_test_matrix: Dict[str, List[Any]] = {"include": []}

				    for entry in test_matrix.get("include", []):

				@ -177,23 +165,46 @@ def filter(test_matrix: Dict[str, List[Any]], labels: Set[str]) -> Dict[str, Lis

				        label = f"{PREFIX}{config_name.strip()}"

				        if label in labels:

				            print(

				                f"Select {config_name} because label {label} is presented in the pull request by the time the test starts"

				            )

				            msg = f"Select {config_name} because label {label} is present in the pull request by the time the test starts"

				            info(msg)

				            filtered_test_matrix["include"].append(entry)

				    valid_test_config_labels = labels.intersection(VALID_TEST_CONFIG_LABELS)

				    if not filtered_test_matrix["include"] and not valid_test_config_labels:

				        # Found no valid label and the filtered test matrix is empty, return the same

				    test_config_labels = filter_labels(labels, re.compile(f"{PREFIX}.+"))

				    if not filtered_test_matrix["include"] and not test_config_labels:

				        info("Found no test-config label on the PR, so all test configs are included")

				        # Found no test-config label and the filtered test matrix is empty, return the same

				        # test matrix as before so that all tests can be run normally

				        return test_matrix

				    else:

				        msg = f"Found {test_config_labels} on the PR so only these test configs are run"

				        info(msg)

				        # When the filter test matrix contain matches or if a valid test config label

				        # is found in the PR, return the filtered test matrix

				        return filtered_test_matrix

				def filter_selected_test_configs(

				    test_matrix: Dict[str, List[Any]], selected_test_configs: Set[str]

				) -> Dict[str, List[Any]]:

				    """

				    Keep only the selected configs if the list if not empty. Otherwise, keep all test configs.

				    This filter is used when the workflow is dispatched manually.

				    """

				    if not selected_test_configs:

				        return test_matrix

				    filtered_test_matrix: Dict[str, List[Any]] = {"include": []}

				    for entry in test_matrix.get("include", []):

				        config_name = entry.get("config", "")

				        if not config_name:

				            continue

				        if config_name in selected_test_configs:

				            filtered_test_matrix["include"].append(entry)

				    return filtered_test_matrix

				def set_periodic_modes(

				    test_matrix: Dict[str, List[Any]], job_name: Optional[str]

				) -> Dict[str, List[Any]]:

				@ -374,30 +385,33 @@ def process_jobs(

				        # - If the target record has the job (config) name, only that test config

				        #   will be skipped or marked as unstable

				        if not target_job_cfg:

				            print(

				            msg = (

				                f"Issue {target_url} created by {author} has {issue_type.value} "

				                + f"all CI jobs for {workflow} / {job_name}"

				            )

				            info(msg)

				            return _filter_jobs(

				                test_matrix=test_matrix,

				                issue_type=issue_type,

				            )

				        if target_job_cfg == BUILD_JOB_NAME:

				            print(

				            msg = (

				                f"Issue {target_url} created by {author} has {issue_type.value} "

				                + f"the build job for {workflow} / {job_name}"

				            )

				            info(msg)

				            return _filter_jobs(

				                test_matrix=test_matrix,

				                issue_type=issue_type,

				            )

				        if target_job_cfg in (TEST_JOB_NAME, BUILD_AND_TEST_JOB_NAME):

				            print(

				            msg = (

				                f"Issue {target_url} created by {author} has {issue_type.value} "

				                + f"all the test jobs for {workflow} / {job_name}"

				            )

				            info(msg)

				            return _filter_jobs(

				                test_matrix=test_matrix,

				                issue_type=issue_type,

				@ -463,7 +477,7 @@ def parse_reenabled_issues(s: Optional[str]) -> List[str]:

				def get_reenabled_issues(pr_body: str = "") -> List[str]:

				    default_branch = os.getenv("GIT_DEFAULT_BRANCH", "main")

				    default_branch = f"origin/{os.environ.get('GIT_DEFAULT_BRANCH', 'main')}"

				    try:

				        commit_messages = subprocess.check_output(

				            f"git cherry -v {default_branch}".split(" ")

				@ -494,10 +508,15 @@ def perform_misc_tasks(

				        "ci-no-test-timeout", check_for_setting(labels, pr_body, "ci-no-test-timeout")

				    )

				    set_output("ci-no-td", check_for_setting(labels, pr_body, "ci-no-td"))

				    # Only relevant for the one linux distributed cuda job, delete this when TD

				    # is rolled out completely

				    set_output(

				        "ci-td-distributed", check_for_setting(labels, pr_body, "ci-td-distributed")

				    )

				    # Obviously, if the job name includes unstable, then this is an unstable job

				    is_unstable = job_name and IssueType.UNSTABLE.value in job_name

				    if not is_unstable and test_matrix:

				    if not is_unstable and test_matrix and test_matrix.get("include"):

				        # Even when the job name doesn't mention unstable, we will also mark it as

				        # unstable when the test matrix only includes unstable jobs. Basically, this

				        # logic allows build or build-and-test jobs to be marked as unstable too.

				@ -567,6 +586,16 @@ def main() -> None:

				        # No PR number, no tag, we can just return the test matrix as it is

				        filtered_test_matrix = test_matrix

				    if args.selected_test_configs:

				        selected_test_configs = {

				            v.strip().lower()

				            for v in args.selected_test_configs.split(",")

				            if v.strip()

				        }

				        filtered_test_matrix = filter_selected_test_configs(

				            filtered_test_matrix, selected_test_configs

				        )

				    if args.event_name == "schedule" and args.schedule == "29 8 * * *":

				        # we don't want to run the mem leak check or disabled tests on normal

				        # periodically scheduled jobs, only the ones at this time

									
										97

.github/scripts/generate_binary_build_matrix.py
									
										vendored
									
												View File
												
				@ -13,16 +13,16 @@ architectures:

				import os

				from typing import Dict, List, Optional, Tuple

				CUDA_ARCHES = ["11.8", "12.1"]

				CUDA_ARCHES = ["11.8", "12.1", "12.4"]

				CUDA_ARCHES_FULL_VERSION = {"11.8": "11.8.0", "12.1": "12.1.1"}

				CUDA_ARCHES_FULL_VERSION = {"11.8": "11.8.0", "12.1": "12.1.1", "12.4": "12.4.0"}

				CUDA_ARCHES_CUDNN_VERSION = {"11.8": "8", "12.1": "8"}

				CUDA_ARCHES_CUDNN_VERSION = {"11.8": "9", "12.1": "9", "12.4": "9"}

				ROCM_ARCHES = ["5.7", "6.0"]

				ROCM_ARCHES = ["6.0", "6.1"]

				CPU_CXX11_ABI_ARCH = ["cpu-cxx11-abi"]

				@ -31,12 +31,18 @@ CPU_CXX11_ABI_ARCH = ["cpu-cxx11-abi"]

				CPU_AARCH64_ARCH = ["cpu-aarch64"]

				CPU_S390X_ARCH = ["cpu-s390x"]

				CUDA_AARCH64_ARCH = ["cuda-aarch64"]

				PYTORCH_EXTRA_INSTALL_REQUIREMENTS = {

				    "11.8": (

				        "nvidia-cuda-nvrtc-cu11==11.8.89; platform_system == 'Linux' and platform_machine == 'x86_64' | "  # noqa: B950

				        "nvidia-cuda-runtime-cu11==11.8.89; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cuda-cupti-cu11==11.8.87; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cudnn-cu11==8.7.0.84; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cudnn-cu11==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cublas-cu11==11.11.3.6; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cufft-cu11==10.9.0.58; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-curand-cu11==10.3.0.86; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				@ -49,7 +55,7 @@ PYTORCH_EXTRA_INSTALL_REQUIREMENTS = {

				        "nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | "  # noqa: B950

				        "nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cudnn-cu12==8.9.2.26; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				@ -58,6 +64,20 @@ PYTORCH_EXTRA_INSTALL_REQUIREMENTS = {

				        "nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'"

				    ),

				    "12.4": (

				        "nvidia-cuda-nvrtc-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cuda-runtime-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cuda-cupti-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cublas-cu12==12.4.2.65; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cufft-cu12==11.2.0.44; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-curand-cu12==10.3.5.119; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cusolver-cu12==11.6.0.99; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cusparse-cu12==12.3.0.142; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-nvtx-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-nvjitlink-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64'"

				    ),

				}

				@ -116,6 +136,10 @@ def arch_type(arch_version: str) -> str:

				        return "cpu-cxx11-abi"

				    elif arch_version in CPU_AARCH64_ARCH:

				        return "cpu-aarch64"

				    elif arch_version in CPU_S390X_ARCH:

				        return "cpu-s390x"

				    elif arch_version in CUDA_AARCH64_ARCH:

				        return "cuda-aarch64"

				    else:  # arch_version should always be "cpu" in this case

				        return "cpu"

				@ -135,6 +159,8 @@ WHEEL_CONTAINER_IMAGES = {

				    "cpu": f"pytorch/manylinux-builder:cpu-{DEFAULT_TAG}",

				    "cpu-cxx11-abi": f"pytorch/manylinuxcxx11-abi-builder:cpu-cxx11-abi-{DEFAULT_TAG}",

				    "cpu-aarch64": f"pytorch/manylinuxaarch64-builder:cpu-aarch64-{DEFAULT_TAG}",

				    "cpu-s390x": f"pytorch/manylinuxs390x-builder:cpu-s390x-{DEFAULT_TAG}",

				    "cuda-aarch64": f"pytorch/manylinuxaarch64-builder:cuda12.4-{DEFAULT_TAG}",

				}

				CONDA_CONTAINER_IMAGES = {

				@ -191,7 +217,9 @@ def translate_desired_cuda(gpu_arch_type: str, gpu_arch_version: str) -> str:

				        "cpu": "cpu",

				        "cpu-aarch64": "cpu",

				        "cpu-cxx11-abi": "cpu-cxx11-abi",

				        "cpu-s390x": "cpu",

				        "cuda": f"cu{gpu_arch_version.replace('.', '')}",

				        "cuda-aarch64": "cu124",

				        "rocm": f"rocm{gpu_arch_version}",

				    }.get(gpu_arch_type, gpu_arch_version)

				@ -272,11 +300,11 @@ def generate_libtorch_matrix(

				                    "libtorch_variant": libtorch_variant,

				                    "libtorch_config": abi_version if os == "windows" else "",

				                    "devtoolset": abi_version if os != "windows" else "",

				                    "container_image": LIBTORCH_CONTAINER_IMAGES[

				                        (arch_version, abi_version)

				                    ]

				                    if os != "windows"

				                    else "",

				                    "container_image": (

				                        LIBTORCH_CONTAINER_IMAGES[(arch_version, abi_version)]

				                        if os != "windows"

				                        else ""

				                    ),

				                    "package_type": "libtorch",

				                    "build_name": f"libtorch-{gpu_arch_type}{gpu_arch_version}-{libtorch_variant}-{abi_version}".replace(

				                        ".", "_"

				@ -292,8 +320,8 @@ def generate_wheels_matrix(

				    python_versions: Optional[List[str]] = None,

				) -> List[Dict[str, str]]:

				    package_type = "wheel"

				    if os == "linux" or os == "linux-aarch64":

				        # NOTE: We only build manywheel packages for x86_64 and aarch64 linux

				    if os == "linux" or os == "linux-aarch64" or os == "linux-s390x":

				        # NOTE: We only build manywheel packages for x86_64 and aarch64 and s390x linux

				        package_type = "manywheel"

				    if python_versions is None:

				@ -309,22 +337,36 @@ def generate_wheels_matrix(

				        elif os == "linux-aarch64":

				            # Only want the one arch as the CPU type is different and

				            # uses different build/test scripts

				            arches = ["cpu-aarch64"]

				            arches = ["cpu-aarch64", "cuda-aarch64"]

				        elif os == "linux-s390x":

				            # Only want the one arch as the CPU type is different and

				            # uses different build/test scripts

				            arches = ["cpu-s390x"]

				    ret: List[Dict[str, str]] = []

				    for python_version in python_versions:

				        for arch_version in arches:

				            gpu_arch_type = arch_type(arch_version)

				            # Disable py3.12 builds for ROCm because of triton dependency

				            # on llnl-hatchet, which doesn't have py3.12 wheels available

				            if gpu_arch_type == "rocm" and python_version == "3.12":

				                continue

				            gpu_arch_version = (

				                ""

				                if arch_version == "cpu"

				                or arch_version == "cpu-cxx11-abi"

				                or arch_version == "cpu-aarch64"

				                or arch_version == "cpu-s390x"

				                or arch_version == "cuda-aarch64"

				                else arch_version

				            )

				            # 12.1 linux wheels require PYTORCH_EXTRA_INSTALL_REQUIREMENTS to install

				            if arch_version in ["12.1", "11.8"] and os == "linux":

				            if (

				                arch_version in ["12.4", "12.1", "11.8"]

				                and os == "linux"

				                or arch_version == "cuda-aarch64"

				            ):

				                ret.append(

				                    {

				                        "python_version": python_version,

				@ -333,10 +375,16 @@ def generate_wheels_matrix(

				                        "desired_cuda": translate_desired_cuda(

				                            gpu_arch_type, gpu_arch_version

				                        ),

				                        "devtoolset": "",

				                        "devtoolset": (

				                            "cxx11-abi" if arch_version == "cuda-aarch64" else ""

				                        ),

				                        "container_image": WHEEL_CONTAINER_IMAGES[arch_version],

				                        "package_type": package_type,

				                        "pytorch_extra_install_requirements": PYTORCH_EXTRA_INSTALL_REQUIREMENTS[arch_version],  # fmt: skip

				                        "pytorch_extra_install_requirements": (

				                            PYTORCH_EXTRA_INSTALL_REQUIREMENTS[arch_version]  # fmt: skip

				                            if os != "linux-aarch64"

				                            else ""

				                        ),

				                        "build_name": f"{package_type}-py{python_version}-{gpu_arch_type}{gpu_arch_version}".replace(  # noqa: B950

				                            ".", "_"

				                        ),

				@ -351,21 +399,24 @@ def generate_wheels_matrix(

				                        "desired_cuda": translate_desired_cuda(

				                            gpu_arch_type, gpu_arch_version

				                        ),

				                        "devtoolset": "cxx11-abi"

				                        if arch_version == "cpu-cxx11-abi"

				                        else "",

				                        "devtoolset": (

				                            "cxx11-abi" if arch_version == "cpu-cxx11-abi" else ""

				                        ),

				                        "container_image": WHEEL_CONTAINER_IMAGES[arch_version],

				                        "package_type": package_type,

				                        "build_name": f"{package_type}-py{python_version}-{gpu_arch_type}{gpu_arch_version}".replace(

				                            ".", "_"

				                        ),

				                        "pytorch_extra_install_requirements":

				                        PYTORCH_EXTRA_INSTALL_REQUIREMENTS["12.1"]  # fmt: skip

				                        if os != "linux" else "",

				                        "pytorch_extra_install_requirements": (

				                            PYTORCH_EXTRA_INSTALL_REQUIREMENTS["12.1"]  # fmt: skip

				                            if os != "linux"

				                            else ""

				                        ),

				                    }

				                )

				    return ret

				validate_nccl_dep_consistency("12.4")

				validate_nccl_dep_consistency("12.1")

				validate_nccl_dep_consistency("11.8")

									
										35

.github/scripts/generate_ci_workflows.py
									
										vendored
									
												View File
												
				@ -5,11 +5,11 @@ import sys

				from dataclasses import asdict, dataclass, field

				from pathlib import Path

				from typing import Dict, Iterable, List, Literal, Set

				from typing_extensions import TypedDict  # Python 3.11+

				import generate_binary_build_matrix  # type: ignore[import]

				import jinja2

				from typing_extensions import TypedDict  # Python 3.11+

				Arch = Literal["windows", "linux", "macos"]

				@ -60,7 +60,7 @@ class BinaryBuildWorkflow:

				    branches: str = "nightly"

				    # Mainly for macos

				    cross_compile_arm64: bool = False

				    macos_runner: str = "macos-12-xl"

				    macos_runner: str = "macos-14-xlarge"

				    def __post_init__(self) -> None:

				        if self.abi_version:

				@ -95,6 +95,7 @@ class OperatingSystem:

				    MACOS = "macos"

				    MACOS_ARM64 = "macos-arm64"

				    LINUX_AARCH64 = "linux-aarch64"

				    LINUX_S390X = "linux-s390x"

				LINUX_BINARY_BUILD_WORFKLOWS = [

				@ -156,7 +157,7 @@ LINUX_BINARY_SMOKE_WORKFLOWS = [

				        package_type="manywheel",

				        build_configs=generate_binary_build_matrix.generate_wheels_matrix(

				            OperatingSystem.LINUX,

				            arches=["11.8", "12.1"],

				            arches=["11.8", "12.1", "12.4"],

				            python_versions=["3.8"],

				        ),

				        branches="main",

				@ -284,7 +285,7 @@ MACOS_BINARY_BUILD_WORKFLOWS = [

				            libtorch_variants=["shared-with-deps"],

				        ),

				        cross_compile_arm64=False,

				        macos_runner="macos-13-xlarge",

				        macos_runner="macos-14-xlarge",

				        ciflow_config=CIFlowConfig(

				            labels={LABEL_CIFLOW_BINARIES, LABEL_CIFLOW_BINARIES_LIBTORCH},

				            isolated_workflow=True,

				@ -297,7 +298,7 @@ MACOS_BINARY_BUILD_WORKFLOWS = [

				            OperatingSystem.MACOS_ARM64

				        ),

				        cross_compile_arm64=False,

				        macos_runner="macos-13-xlarge",

				        macos_runner="macos-14-xlarge",

				        ciflow_config=CIFlowConfig(

				            labels={LABEL_CIFLOW_BINARIES, LABEL_CIFLOW_BINARIES_WHEEL},

				            isolated_workflow=True,

				@ -307,7 +308,7 @@ MACOS_BINARY_BUILD_WORKFLOWS = [

				        os=OperatingSystem.MACOS_ARM64,

				        package_type="conda",

				        cross_compile_arm64=False,

				        macos_runner="macos-13-xlarge",

				        macos_runner="macos-14-xlarge",

				        build_configs=generate_binary_build_matrix.generate_conda_matrix(

				            OperatingSystem.MACOS_ARM64

				        ),

				@ -332,6 +333,20 @@ AARCH64_BINARY_BUILD_WORKFLOWS = [

				    ),

				]

				S390X_BINARY_BUILD_WORKFLOWS = [

				    BinaryBuildWorkflow(

				        os=OperatingSystem.LINUX_S390X,

				        package_type="manywheel",

				        build_configs=generate_binary_build_matrix.generate_wheels_matrix(

				            OperatingSystem.LINUX_S390X

				        ),

				        ciflow_config=CIFlowConfig(

				            labels={LABEL_CIFLOW_BINARIES, LABEL_CIFLOW_BINARIES_WHEEL},

				            isolated_workflow=True,

				        ),

				    ),

				]

				def main() -> None:

				    jinja_env = jinja2.Environment(

				@ -350,6 +365,10 @@ def main() -> None:

				            jinja_env.get_template("linux_binary_build_workflow.yml.j2"),

				            AARCH64_BINARY_BUILD_WORKFLOWS,

				        ),

				        (

				            jinja_env.get_template("linux_binary_build_workflow.yml.j2"),

				            S390X_BINARY_BUILD_WORKFLOWS,

				        ),

				        (

				            jinja_env.get_template("linux_binary_build_workflow.yml.j2"),

				            LINUX_BINARY_SMOKE_WORKFLOWS,

				@ -378,7 +397,9 @@ def main() -> None:

				    for template, workflows in template_and_workflows:

				        # added Iterable check to appease the mypy gods

				        if not isinstance(workflows, Iterable):

				            raise Exception(f"How is workflows not iterable? {workflows}")

				            raise Exception(  # noqa: TRY002

				                f"How is workflows not iterable? {workflows}"

				            )  # noqa: TRY002

				        for workflow in workflows:

				            workflow.generate_workflow_file(workflow_template=template)

									
										14

.github/scripts/generate_docker_release_matrix.py
									
										vendored
									
												View File
												
				@ -21,6 +21,8 @@ DOCKER_IMAGE_TYPES = ["runtime", "devel"]

				def generate_docker_matrix() -> Dict[str, List[Dict[str, str]]]:

				    ret: List[Dict[str, str]] = []

				    # CUDA amd64 Docker images are available as both runtime and devel while

				    # CPU arm64 image is only available as runtime.

				    for cuda, version in generate_binary_build_matrix.CUDA_ARCHES_FULL_VERSION.items():

				        for image in DOCKER_IMAGE_TYPES:

				            ret.append(

				@ -31,9 +33,19 @@ def generate_docker_matrix() -> Dict[str, List[Dict[str, str]]]:

				                        cuda

				                    ],

				                    "image_type": image,

				                    "platform": "linux/arm64,linux/amd64",

				                    "platform": "linux/amd64",

				                }

				            )

				    ret.append(

				        {

				            "cuda": "cpu",

				            "cuda_full_version": "",

				            "cudnn_version": "",

				            "image_type": "runtime",

				            "platform": "linux/arm64",

				        }

				    )

				    return {"include": ret}

									
										3

.github/scripts/get_workflow_job_id.py
									
										vendored
									
												View File
												
				@ -4,6 +4,7 @@

				import argparse

				import json

				import operator

				import os

				import re

				import sys

				@ -126,7 +127,7 @@ def find_job_id_name(args: Any) -> Tuple[str, str]:

				    # Sort the jobs list by start time, in descending order. We want to get the most

				    # recently scheduled job on the runner.

				    jobs.sort(key=lambda job: job["started_at"], reverse=True)

				    jobs.sort(key=operator.itemgetter("started_at"), reverse=True)

				    for job in jobs:

				        if job["runner_name"] == args.runner_name:

									
										99

.github/scripts/get_workflow_type.py
									
										vendored
									
										Normal file
									
												View File
												
				@ -0,0 +1,99 @@

				import json

				from argparse import ArgumentParser

				from typing import Any

				from github import Auth, Github

				from github.Issue import Issue

				WORKFLOW_TYPE_LABEL = "label"

				WORKFLOW_TYPE_RG = "rg"

				WORKFLOW_TYPE_BOTH = "both"

				def parse_args() -> Any:

				    parser = ArgumentParser("Get dynamic rollout settings")

				    parser.add_argument("--github-token", type=str, required=True, help="GitHub token")

				    parser.add_argument(

				        "--github-repo",

				        type=str,

				        required=False,

				        default="pytorch/test-infra",

				        help="GitHub repo to get the issue",

				    )

				    parser.add_argument(

				        "--github-issue", type=int, required=True, help="GitHub issue umber"

				    )

				    parser.add_argument(

				        "--github-user", type=str, required=True, help="GitHub username"

				    )

				    parser.add_argument(

				        "--github-branch", type=str, required=True, help="Current GitHub branch"

				    )

				    return parser.parse_args()

				def get_gh_client(github_token: str) -> Github:

				    auth = Auth.Token(github_token)

				    return Github(auth=auth)

				def get_issue(gh: Github, repo: str, issue_num: int) -> Issue:

				    repo = gh.get_repo(repo)

				    return repo.get_issue(number=issue_num)

				def is_exception_branch(branch: str) -> bool:

				    return branch.split("/")[0] in {"main", "nightly", "release", "landchecks"}

				def get_workflow_type(issue: Issue, username: str) -> str:

				    user_list = issue.get_comments()[0].body.split("\r\n")

				    try:

				        run_option = issue.get_comments()[1].body.split("\r\n")[0]

				    except Exception as e:

				        run_option = "single"

				    if user_list[0] == "!":

				        # Use old runners for everyone

				        return WORKFLOW_TYPE_LABEL

				    elif user_list[1] == "*":

				        if run_option == WORKFLOW_TYPE_BOTH:

				            # Use ARC runners and old runners for everyone

				            return WORKFLOW_TYPE_BOTH

				        else:

				            # Use only ARC runners for everyone

				            return WORKFLOW_TYPE_RG

				    elif username in user_list:

				        if run_option == WORKFLOW_TYPE_BOTH:

				            # Use ARC runners and old runners for a specific user

				            return WORKFLOW_TYPE_BOTH

				        else:

				            # Use only ARC runners for a specific user

				            return WORKFLOW_TYPE_RG

				    else:

				        # Use old runners by default

				        return WORKFLOW_TYPE_LABEL

				def main() -> None:

				    args = parse_args()

				    if is_exception_branch(args.github_branch):

				        output = {"workflow_type": WORKFLOW_TYPE_LABEL}

				    else:

				        try:

				            gh = get_gh_client(args.github_token)

				            issue = get_issue(gh, args.github_repo, args.github_issue)

				            output = {"workflow_type": get_workflow_type(issue, args.github_user)}

				        except Exception as e:

				            output = {"workflow_type": WORKFLOW_TYPE_LABEL}

				    json_output = json.dumps(output)

				    print(json_output)

				if __name__ == "__main__":

				    main()

BIN
.github/scripts/gql_mocks.json.gz vendored

View File

Binary file not shown.

									
										3

.github/scripts/lintrunner.sh
									
										vendored
									
												View File
												
				@ -6,6 +6,9 @@ CONDA_ENV=$(conda env list --json | jq -r ".envs | .[-1]")

				eval "$(command conda 'shell.bash' 'hook' 2> /dev/null)"

				conda activate "${CONDA_ENV}"

				# Use uv to speed up lintrunner init

				python3 -m pip install uv==0.1.45

				CACHE_DIRECTORY="/tmp/.lintbin"

				# Try to recover the cached binaries

				if [[ -d "${CACHE_DIRECTORY}" ]]; then

									
										29

.github/scripts/pytest_caching_utils.py
									
										vendored
									
												View File
												
				@ -18,6 +18,7 @@ PYTEST_CACHE_KEY_PREFIX = "pytest_cache"

				PYTEST_CACHE_DIR_NAME = ".pytest_cache"

				BUCKET = "gha-artifacts"

				LASTFAILED_FILE_PATH = Path("v/cache/lastfailed")

				TD_HEURISTIC_PREVIOUSLY_FAILED_ADDITIONAL = "previous_failures_additional.json"

				# Temp folders

				ZIP_UPLOAD = "zip-upload"

				@ -191,6 +192,10 @@ def _merge_pytest_caches(

				        pytest_cache_dir_to_merge_from, pytest_cache_dir_to_merge_into

				    )

				    _merge_additional_failures_files(

				        pytest_cache_dir_to_merge_from, pytest_cache_dir_to_merge_into

				    )

				def _merge_lastfailed_files(source_pytest_cache: Path, dest_pytest_cache: Path) -> None:

				    # Simple cases where one of the files doesn't exist

				@ -232,3 +237,27 @@ def _merged_lastfailed_content(

				            del to_lastfailed[""]

				    return to_lastfailed

				def _merge_additional_failures_files(

				    source_pytest_cache: Path, dest_pytest_cache: Path

				) -> None:

				    # Simple cases where one of the files doesn't exist

				    source_lastfailed_file = (

				        source_pytest_cache / TD_HEURISTIC_PREVIOUSLY_FAILED_ADDITIONAL

				    )

				    dest_lastfailed_file = dest_pytest_cache / TD_HEURISTIC_PREVIOUSLY_FAILED_ADDITIONAL

				    if not source_lastfailed_file.exists():

				        return

				    if not dest_lastfailed_file.exists():

				        copy_file(source_lastfailed_file, dest_lastfailed_file)

				        return

				    # Both files exist, so we need to merge them

				    from_lastfailed = load_json_file(source_lastfailed_file)

				    to_lastfailed = load_json_file(dest_lastfailed_file)

				    merged_content = list(set(from_lastfailed + to_lastfailed))

				    # Save the results

				    write_json_file(dest_lastfailed_file, merged_content)

									
										28

.github/scripts/sync_distributed_folder_prototype.sh
									
										vendored
									
										Executable file
									
												View File
												
				@ -0,0 +1,28 @@

				#!/bin/bash

				set -eoux pipefail

				SYNC_BRANCH=fbcode/pytorch-stable-prototype

				git config user.email "fake@example.com"

				git config user.name  "PyTorch Stable Bot"

				git fetch origin main

				git fetch origin "$SYNC_BRANCH"

				git checkout "$SYNC_BRANCH"

				for SHA in $(git log 4333e122d4b74cdf84351ed2907045c6a767b4cd..origin/main --pretty="%h" --reverse -- torch/distributed torch/csrc/distributed test/distributed test/cpp/c10d benchmarks/distributed)

				do

				    # `git merge-base --is-ancestor` exits with code 0 if the given SHA is an ancestor, and non-0 otherwise

				    if git merge-base --is-ancestor $SHA HEAD || [[ $(git log --grep="(cherry picked from commit $SHA") ]]

				    then

				        echo "Skipping $SHA"

				        continue

				    fi

				    echo "Copying $SHA"

				    git cherry-pick -x "$SHA"

				done

				if [[ "${WITH_PUSH}" == true ]]; then

				  git push

				fi

									
										20

.github/scripts/td_llm_indexer.sh
									
										vendored
									
										Normal file
									
												View File
												
				@ -0,0 +1,20 @@

				#!/bin/bash

				set -euxo pipefail

				# Download requirements

				cd llm-target-determinator

				pip install -q -r requirements.txt

				cd ../codellama

				pip install -e .

				# Run indexer

				cd ../llm-target-determinator

				torchrun \

				    --standalone \

				    --nnodes=1 \

				    --nproc-per-node=1 \

				    indexer.py \

				    --experiment-name indexer-files \

				    --granularity FILE

									
										60

.github/scripts/test_filter_test_configs.py
									
										vendored
									
												View File
												
				@ -9,6 +9,7 @@ from unittest import main, mock, TestCase

				import yaml

				from filter_test_configs import (

				    filter,

				    filter_selected_test_configs,

				    get_labels,

				    mark_unstable_jobs,

				    parse_reenabled_issues,

				@ -17,7 +18,6 @@ from filter_test_configs import (

				    remove_disabled_jobs,

				    set_periodic_modes,

				    SUPPORTED_PERIODICAL_MODES,

				    VALID_TEST_CONFIG_LABELS,

				)

				@ -273,13 +273,13 @@ class TestConfigFilter(TestCase):

				        testcases = [

				            {

				                "test_matrix": '{include: [{config: "default", runner: "linux"}]}',

				                "expected": '{"include": [{"config": "default", "runner": "linux"}]}',

				                "description": "No match, keep the same test matrix",

				                "expected": '{"include": []}',

				                "description": "Request test-config/cfg but the test matrix doesn't have it",

				            },

				            {

				                "test_matrix": '{include: [{config: "default", runner: "linux"}, {config: "plain-cfg"}]}',

				                "expected": '{"include": [{"config": "default", "runner": "linux"}, {"config": "plain-cfg"}]}',

				                "description": "No match because there is no prefix or suffix, keep the same test matrix",

				                "expected": '{"include": []}',

				                "description": "A valid test config label needs to start with test-config/",

				            },

				            {

				                "test_matrix": '{include: [{config: "default", runner: "linux"}, {config: "cfg", shard: 1}]}',

				@ -294,9 +294,8 @@ class TestConfigFilter(TestCase):

				            )

				            self.assertEqual(case["expected"], json.dumps(filtered_test_matrix))

				    def test_filter_with_valid_label(self) -> None:

				    def test_filter_with_test_config_label(self) -> None:

				        mocked_labels = {f"{PREFIX}cfg", "ciflow/trunk"}

				        VALID_TEST_CONFIG_LABELS.add(f"{PREFIX}cfg")

				        testcases = [

				            {

				@ -317,6 +316,51 @@ class TestConfigFilter(TestCase):

				            )

				            self.assertEqual(case["expected"], json.dumps(filtered_test_matrix))

				    def test_filter_selected_test_configs(self) -> None:

				        testcases = [

				            {

				                "test_matrix": '{include: [{config: "default"}]}',

				                "selected_test_configs": "",

				                "expected": '{"include": [{"config": "default"}]}',

				                "description": "No selected test configs",

				            },

				            {

				                "test_matrix": '{include: [{config: "default"}]}',

				                "selected_test_configs": "foo",

				                "expected": '{"include": []}',

				                "description": "A different test config is selected",

				            },

				            {

				                "test_matrix": '{include: [{config: "default"}]}',

				                "selected_test_configs": "foo, bar",

				                "expected": '{"include": []}',

				                "description": "A different set of test configs is selected",

				            },

				            {

				                "test_matrix": '{include: [{config: "default"}]}',

				                "selected_test_configs": "foo, bar,default",

				                "expected": '{"include": [{"config": "default"}]}',

				                "description": "One of the test config is selected",

				            },

				            {

				                "test_matrix": '{include: [{config: "default"}, {config: "bar"}]}',

				                "selected_test_configs": "foo, bar,Default",

				                "expected": '{"include": [{"config": "default"}, {"config": "bar"}]}',

				                "description": "Several test configs are selected",

				            },

				        ]

				        for case in testcases:

				            selected_test_configs = {

				                v.strip().lower()

				                for v in case["selected_test_configs"].split(",")

				                if v.strip()

				            }

				            filtered_test_matrix = filter_selected_test_configs(

				                yaml.safe_load(case["test_matrix"]), selected_test_configs

				            )

				            self.assertEqual(case["expected"], json.dumps(filtered_test_matrix))

				    def test_set_periodic_modes(self) -> None:

				        testcases: List[Dict[str, str]] = [

				            {

				@ -641,6 +685,7 @@ class TestConfigFilter(TestCase):

				            ci_verbose_test_logs: bool = False,

				            ci_no_test_timeout: bool = False,

				            ci_no_td: bool = False,

				            ci_td_distributed: bool = False,

				            is_unstable: bool = False,

				            reenabled_issues: str = "",

				        ) -> str:

				@ -649,6 +694,7 @@ class TestConfigFilter(TestCase):

				                f"ci-verbose-test-logs={ci_verbose_test_logs}\n"

				                f"ci-no-test-timeout={ci_no_test_timeout}\n"

				                f"ci-no-td={ci_no_td}\n"

				                f"ci-td-distributed={ci_td_distributed}\n"

				                f"is-unstable={is_unstable}\n"

				                f"reenabled-issues={reenabled_issues}\n"

				            )

									
										74

.github/scripts/test_trymerge.py
									
										vendored
									
												View File
												
				@ -205,7 +205,6 @@ def mocked_read_merge_rules(repo: Any, org: str, project: str) -> List[MergeRule

				            approved_by=["pytorch/metamates", "ngimel"],

				            mandatory_checks_name=[

				                "Lint",

				                "Facebook CLA Check",

				                "pull / linux-xenial-cuda11.3-py3.7-gcc7 / build",

				            ],

				            ignore_flaky_failures=True,

				@ -398,7 +397,7 @@ class TestTryMerge(TestCase):

				    def test_gql_retrieve_checksuites(self, *args: Any) -> None:

				        "Fetch comments and conclusions for PR with 60 commits"

				        pr = GitHubPR("pytorch", "pytorch", 94787)

				        self.assertEqual(len(pr.get_checkrun_conclusions()), 183)

				        self.assertEqual(len(pr.get_checkrun_conclusions()), 182)

				    def test_team_members(self, *args: Any) -> None:

				        "Test fetching team members works"

				@ -742,6 +741,30 @@ class TestBypassFailures(TestCase):

				        self.assertTrue(len(failed) == 0)

				        self.assertTrue(len(ignorable["UNSTABLE"]) == 1)

				        # Add another test case where there is no unstable keyword in the job name, but

				        # the job has already been marked as unstable

				        pr = GitHubPR("pytorch", "executorch", 3318)

				        checks = pr.get_checkrun_conclusions()

				        checks = get_classifications(

				            pr.pr_num,

				            pr.project,

				            checks,

				            [],

				        )

				        print(checks)

				        workflow_name = "test-llama-app"

				        job_name = "mobile-job (android)"

				        self.assertTrue(

				            checks[f"Android / {workflow_name} / {job_name}"].classification

				            == "UNSTABLE"

				        )

				        pending, failed, ignorable = categorize_checks(

				            checks, list(checks.keys()), ok_failed_checks_threshold=1

				        )

				        self.assertTrue(len(pending) == 0)

				        self.assertTrue(len(failed) == 0)

				        self.assertTrue(len(ignorable["UNSTABLE"]) == 1)

				    def test_get_classifications_broken_trunk(self, *args: Any) -> None:

				        # The mock merge base is the actual value returned by gh_fetch_merge_base

				        test_cases = [

				@ -750,13 +773,13 @@ class TestBypassFailures(TestCase):

				                # than the one on the base commit. This should still count as broken trunk

				                "pr_num": 104214,

				                "related_failure_count": 0,

				                "unrelated_failure_count": 1,

				                "flaky_or_broken_trunk": 1,

				            },

				            {

				                # This PR had one broken trunk failure and it used ghstack

				                "pr_num": 105145,

				                "related_failure_count": 0,

				                "unrelated_failure_count": 1,

				                "flaky_or_broken_trunk": 1,

				            },

				            {

				                # The failure on the merge base was retried successfully and

				@ -765,20 +788,20 @@ class TestBypassFailures(TestCase):

				                # be used to detect broken trunk

				                "pr_num": 107160,

				                "related_failure_count": 0,

				                "unrelated_failure_count": 4,

				                "flaky_or_broken_trunk": 1,

				            },

				            {

				                # This PR used Dr.CI broken trunk classification

				                "pr_num": 111253,

				                "related_failure_count": 1,

				                "unrelated_failure_count": 2,

				                "flaky_or_broken_trunk": 1,

				            },

				        ]

				        for case in test_cases:

				            pr_num = case["pr_num"]

				            related_failure_count = case["related_failure_count"]

				            unrelated_failure_count = case["unrelated_failure_count"]

				            flaky_or_broken_trunk = case["flaky_or_broken_trunk"]

				            pr = GitHubPR("pytorch", "pytorch", pr_num)

				            checks = pr.get_checkrun_conclusions()

				@ -800,7 +823,7 @@ class TestBypassFailures(TestCase):

				            )

				            self.assertTrue(len(pending) == 0)

				            self.assertTrue(

				                len(failed) == unrelated_failure_count + related_failure_count

				                len(failed) == flaky_or_broken_trunk + related_failure_count

				            )

				    def test_ignore_current(self, *args: Any) -> None:

				@ -833,6 +856,41 @@ class TestBypassFailures(TestCase):

				        self.assertTrue(len(ignorable["FLAKY"]) == 4)

				        self.assertTrue(len(ignorable["BROKEN_TRUNK"]) == 2)

				    def test_get_classifications_wrong_workflow_name(self, *args: Any) -> None:

				        pr = GitHubPR("pytorch", "pytorch", 123104)

				        checks = pr.get_checkrun_conclusions()

				        check_name = "linux-binary-conda / conda-py3_8-cuda11_8-build / build"

				        check_name_workflow_path = ".github/workflows/generated-linux-binary-conda-nightly.yml / conda-py3_8-cuda11_8-build / build"

				        # Mock a check where the workflow name uses the full path

				        checks[check_name_workflow_path] = JobCheckState(

				            check_name_workflow_path,

				            checks[check_name].url,

				            checks[check_name].status,

				            checks[check_name].classification,

				            checks[check_name].job_id,

				            checks[check_name].title,

				            checks[check_name].summary,

				        )

				        del checks[check_name]

				        checks = get_classifications(

				            pr.pr_num,

				            pr.project,

				            checks,

				            [],

				        )

				        pending, failed, ignorable = categorize_checks(

				            checks,

				            list(checks.keys()),

				        )

				        self.assertTrue(len(pending) == 0)

				        self.assertTrue(len(failed) == 0)

				        self.assertTrue(len(ignorable["FLAKY"]) == 1)

				        self.assertTrue(len(ignorable["BROKEN_TRUNK"]) == 0)

				    @mock.patch("trymerge.read_merge_rules", side_effect=xla_merge_rules)

				    def test_dont_ignore_flaky_failures(self, *args: Any) -> None:

				        """

									
										94

.github/scripts/trymerge.py
									
										vendored
									
												View File
												
				@ -123,6 +123,7 @@ fragment PRCheckSuites on CheckSuiteConnection {

				        workflow {

				          name

				        }

				        databaseId

				        url

				      }

				      checkRuns(first: 50) {

				@ -1398,7 +1399,10 @@ def find_matching_merge_rule(

				        )

				        required_checks = list(

				            filter(

				                lambda x: "EasyCLA" in x or not skip_mandatory_checks, mandatory_checks

				                lambda x: ("EasyCLA" in x)

				                or ("Facebook CLA Check" in x)

				                or not skip_mandatory_checks,

				                mandatory_checks,

				            )

				        )

				        pending_checks, failed_checks, _ = categorize_checks(

				@ -1409,6 +1413,13 @@ def find_matching_merge_rule(

				            else 0,

				        )

				        # categorize_checks assumes all tests are required if required_checks is empty.

				        # this is a workaround as we want to keep that behavior for categorize_checks

				        # generally.

				        if not required_checks:

				            pending_checks = []

				            failed_checks = []

				        hud_link = f"https://hud.pytorch.org/{pr.org}/{pr.project}/commit/{pr.last_commit()['oid']}"

				        if len(failed_checks) > 0:

				            if reject_reason_score < 30000:

				@ -1608,28 +1619,59 @@ def remove_job_name_suffix(name: str, replacement: str = ")") -> str:

				def is_broken_trunk(

				    name: str,

				    check: JobCheckState,

				    drci_classifications: Any,

				) -> bool:

				    if not name or not drci_classifications:

				    if not check or not drci_classifications:

				        return False

				    name = check.name

				    job_id = check.job_id

				    # Consult the list of broken trunk failures from Dr.CI

				    return any(

				        name == broken_trunk["name"]

				        (name == broken_trunk["name"]) or (job_id and job_id == broken_trunk["id"])

				        for broken_trunk in drci_classifications.get("BROKEN_TRUNK", [])

				    )

				def is_flaky(

				    name: str,

				def is_unstable(

				    check: JobCheckState,

				    drci_classifications: Any,

				) -> bool:

				    if not name or not drci_classifications:

				    if not check or not drci_classifications:

				        return False

				    name = check.name

				    job_id = check.job_id

				    # The job name has the unstable keyword. This is the original way to mark a job

				    # as unstable on HUD, Dr.CI, and trymerge

				    if "unstable" in name:

				        return True

				    # Consult the list of unstable failures from Dr.CI

				    return any(

				        (name == unstable["name"] or (job_id and job_id == unstable["id"]))

				        for unstable in drci_classifications.get("UNSTABLE", [])

				    )

				def is_flaky(

				    check: JobCheckState,

				    drci_classifications: Any,

				) -> bool:

				    if not check or not drci_classifications:

				        return False

				    name = check.name

				    job_id = check.job_id

				    # Consult the list of flaky failures from Dr.CI

				    return any(name == flaky["name"] for flaky in drci_classifications.get("FLAKY", []))

				    return any(

				        (name == flaky["name"] or (job_id and job_id == flaky["id"]))

				        for flaky in drci_classifications.get("FLAKY", [])

				    )

				def is_invalid_cancel(

				@ -1702,7 +1744,7 @@ def get_classifications(

				        if check.status == "SUCCESS" or check.status == "NEUTRAL":

				            continue

				        if "unstable" in name:

				        if is_unstable(check, drci_classifications):

				            checks_with_classifications[name] = JobCheckState(

				                check.name,

				                check.url,

				@ -1716,7 +1758,7 @@ def get_classifications(

				        # NB: It's important to note that when it comes to ghstack and broken trunk classification,

				        # Dr.CI uses the base of the whole stack

				        if is_broken_trunk(name, drci_classifications):

				        if is_broken_trunk(check, drci_classifications):

				            checks_with_classifications[name] = JobCheckState(

				                check.name,

				                check.url,

				@ -1728,7 +1770,7 @@ def get_classifications(

				            )

				            continue

				        elif is_flaky(name, drci_classifications):

				        elif is_flaky(check, drci_classifications):

				            checks_with_classifications[name] = JobCheckState(

				                check.name,

				                check.url,

				@ -1985,10 +2027,8 @@ def categorize_checks(

				    pending_checks: List[Tuple[str, Optional[str], Optional[int]]] = []

				    failed_checks: List[Tuple[str, Optional[str], Optional[int]]] = []

				    # ok_failed_checks is used with ok_failed_checks_threshold while ignorable_failed_checks

				    # is used to keep track of all ignorable failures when saving the merge record on Rockset

				    ok_failed_checks: List[Tuple[str, Optional[str], Optional[int]]] = []

				    ignorable_failed_checks: Dict[str, List[Any]] = defaultdict(list)

				    # failed_checks_categorization is used to keep track of all ignorable failures when saving the merge record on Rockset

				    failed_checks_categorization: Dict[str, List[Any]] = defaultdict(list)

				    # If required_checks is not set or empty, consider all names are relevant

				    relevant_checknames = [

				@ -2016,36 +2056,38 @@ def categorize_checks(

				            continue

				        elif not is_passing_status(check_runs[checkname].status):

				            target = (

				                ignorable_failed_checks[classification]

				                failed_checks_categorization[classification]

				                if classification

				                in ("IGNORE_CURRENT_CHECK", "BROKEN_TRUNK", "FLAKY", "UNSTABLE")

				                else failed_checks

				            )

				            target.append((checkname, url, job_id))

				            if classification in ("BROKEN_TRUNK", "FLAKY", "UNSTABLE"):

				                ok_failed_checks.append((checkname, url, job_id))

				    flaky_or_broken_trunk = (

				        failed_checks_categorization["BROKEN_TRUNK"]

				        + failed_checks_categorization["FLAKY"]

				    )

				    if ok_failed_checks:

				    if flaky_or_broken_trunk:

				        warn(

				            f"The following {len(ok_failed_checks)} checks failed but were likely due flakiness or broken trunk: "

				            + ", ".join([x[0] for x in ok_failed_checks])

				            f"The following {len(flaky_or_broken_trunk)} checks failed but were likely due flakiness or broken trunk: "

				            + ", ".join([x[0] for x in flaky_or_broken_trunk])

				            + (

				                f" but this is greater than the threshold of {ok_failed_checks_threshold} so merge will fail"

				                if ok_failed_checks_threshold is not None

				                and len(ok_failed_checks) > ok_failed_checks_threshold

				                and len(flaky_or_broken_trunk) > ok_failed_checks_threshold

				                else ""

				            )

				        )

				    if (

				        ok_failed_checks_threshold is not None

				        and len(ok_failed_checks) > ok_failed_checks_threshold

				        and len(flaky_or_broken_trunk) > ok_failed_checks_threshold

				    ):

				        failed_checks = failed_checks + ok_failed_checks

				        failed_checks = failed_checks + flaky_or_broken_trunk

				    # The list of ignorable_failed_checks is returned so that it can be saved into the Rockset merge record

				    return (pending_checks, failed_checks, ignorable_failed_checks)

				    # The list of failed_checks_categorization is returned so that it can be saved into the Rockset merge record

				    return (pending_checks, failed_checks, failed_checks_categorization)

				def merge(

									
										6

.github/scripts/tryrebase.py
									
										vendored
									
												View File
												
				@ -60,7 +60,7 @@ def rebase_onto(

				    repo._run_git("rebase", onto_branch, branch)

				    if repo.rev_parse(branch) == repo.rev_parse(onto_branch):

				        raise Exception(SAME_SHA_ERROR)

				        raise Exception(SAME_SHA_ERROR)  # noqa: TRY002

				    if dry_run:

				        push_result = repo._run_git("push", "--dry-run", "-f", remote_url, refspec)

				@ -100,7 +100,7 @@ def rebase_ghstack_onto(

				    repo._run_git("rebase", onto_branch, orig_ref)

				    if repo.rev_parse(orig_ref) == repo.rev_parse(onto_branch):

				        raise Exception(SAME_SHA_ERROR)

				        raise Exception(SAME_SHA_ERROR)  # noqa: TRY002

				    # steal the identity of the committer of the commit on the orig branch

				    email = repo._run_git("log", orig_ref, "--pretty=format:%ae", "-1")

				@ -126,7 +126,7 @@ def rebase_ghstack_onto(

				        print(push_result)

				        if ghstack_result.returncode != 0:

				            print(ghstack_result.stderr.decode("utf-8"))

				            raise Exception(f"\n```{push_result}```")

				            raise Exception(f"\n```{push_result}```")  # noqa: TRY002

				        # The contents of a successful push result should look like:

				        # Summary of changes (ghstack 0.6.0)

24

.github/templates/linux_binary_build_workflow.yml.j2 vendored

View File

 @ -33,6 +33,8 @@ env:
   # Needed for conda builds
   {%- if "aarch64" in build_environment %}
   ALPINE_IMAGE: "arm64v8/alpine"
   {%- elif "s390x" in build_environment %}
   ALPINE_IMAGE: "docker.io/s390x/alpine"
   {%- else %}
   ALPINE_IMAGE: "308535385114.dkr.ecr.us-east-1.amazonaws.com/tool/alpine"
   {%- endif %}
 @ -46,7 +48,7 @@ env:
   PYTORCH_FINAL_PACKAGE_DIR: /artifacts
   PYTORCH_ROOT: /pytorch
   SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
   SKIP_ALL_TESTS: 1
   SKIP_ALL_TESTS: 0
 !{{ common.concurrency(build_environment) }}
 jobs:
 @ -56,8 +58,11 @@ jobs:
     uses: ./.github/workflows/_binary-build-linux.yml
     with:!{{ upload.binary_env_as_input(config) }}
       {%- if "aarch64" in build_environment %}
       runs_on: linux.arm64.2xlarge
       runs_on: linux.arm64.m7g.4xlarge
       ALPINE_IMAGE: "arm64v8/alpine"
       {%- elif "s390x" in build_environment %}
       runs_on: linux.s390x
       ALPINE_IMAGE: "docker.io/s390x/alpine"
       {%- elif "conda" in build_environment and config["gpu_arch_type"] == "cuda" %}
       runs_on: linux.24xlarge
       {%- endif %}
 @ -66,12 +71,17 @@ jobs:
       {%- if config.pytorch_extra_install_requirements is defined and config.pytorch_extra_install_requirements|d('')|length > 0  %}
       PYTORCH_EXTRA_INSTALL_REQUIREMENTS: !{{ config.pytorch_extra_install_requirements }}
       {%- endif %}
       {%- if config["gpu_arch_type"] == "cuda-aarch64" %}
       timeout-minutes: 420
       {%- endif %}
     secrets:
       github-token: ${{ secrets.GITHUB_TOKEN }}
   {%- if config["gpu_arch_type"] != "cuda-aarch64" %}
   !{{ config["build_name"] }}-test:  # Testing
     if: ${{ github.repository_owner == 'pytorch' }}
     needs: !{{ config["build_name"] }}-build
 {%- if config["gpu_arch_type"] != "rocm" %}
     {%- if config["gpu_arch_type"] != "rocm" %}
     uses: ./.github/workflows/_binary-test-linux.yml
     with:!{{ upload.binary_env_as_input(config) }}
       build_name: !{{ config["build_name"] }}
 @ -79,6 +89,9 @@ jobs:
       {%- if "aarch64" in build_environment %}
       runs_on: linux.arm64.2xlarge
       ALPINE_IMAGE: "arm64v8/alpine"
       {%- elif "s390x" in build_environment %}
       runs_on: linux.s390x
       ALPINE_IMAGE: "docker.io/s390x/alpine"
       {%- elif config["gpu_arch_type"] == "rocm" %}
       runs_on: linux.rocm.gpu
       {%- elif config["gpu_arch_type"] == "cuda" %}
 @ -88,7 +101,7 @@ jobs:
       {%- endif %}
     secrets:
       github-token: ${{ secrets.GITHUB_TOKEN }}
 {%- else %}
     {%- else %}
     runs-on: linux.rocm.gpu
     timeout-minutes: !{{ common.timeout_minutes }}
     !{{ upload.binary_env(config) }}
 @ -113,7 +126,8 @@ jobs:
         uses: ./pytorch/.github/actions/test-pytorch-binary
       - name: Teardown ROCm
         uses: ./.github/actions/teardown-rocm
 {%- endif %}
     {%- endif %}
   {%- endif %}
 {%- if branches == "nightly" %}
   !{{ upload.upload_binaries(config) }}

2

.github/templates/macos_binary_build_workflow.yml.j2 vendored

View File

 @ -48,7 +48,7 @@ env:
   BUILD_ENVIRONMENT: !{{ build_environment }}
   GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
   PR_NUMBER: ${{ github.event.pull_request.number }}
   SKIP_ALL_TESTS: 1
   SKIP_ALL_TESTS: 0
 {%- if cross_compile_arm64 %}
   CROSS_COMPILE_ARM64: 1
 {% endif %}

4

.github/templates/upload.yml.j2 vendored

View File

 @ -57,7 +57,11 @@
       id-token: write
       contents: read
 {%- if has_test %}
     {%- if config["gpu_arch_type"] == "cuda-aarch64" %}
     needs: !{{ config["build_name"] }}-build
     {%- else %}
     needs: !{{ config["build_name"] }}-test
     {%- endif %}
 {%- else %}
     needs: !{{ config["build_name"] }}-build
 {%- endif %}

									
										7

.github/workflows/_bazel-build-test.yml
									
										vendored
									
												View File
												
				@ -86,9 +86,14 @@ jobs:

				        with:

				          docker-image: ${{ steps.calculate-docker-image.outputs.docker-image }}

				      - name: Check if in a ARC runner

				        shell: bash

				        id: check_arc_runner

				        run: echo "IN_ARC_RUNNER=$([ -f /.inarc ] && echo true || echo false)" >> "$GITHUB_OUTPUT"

				      - name: Install nvidia driver, nvidia-docker runtime, set GPU_FLAG

				        uses: pytorch/test-infra/.github/actions/setup-nvidia@main

				        if: ${{ inputs.cuda-version != 'cpu' }}

				        if: ${{ inputs.cuda-version != 'cpu' && steps.check_arc_runner.outputs.IN_ARC_RUNNER == 'false' }}

				      - name: Output disk space left

				        run: |

									
										39

.github/workflows/_binary-build-linux.yml
									
										vendored
									
												View File
												
				@ -12,10 +12,15 @@ on:

				        type: string

				        description: The build environment

				      runs_on:

				          required: false

				          default: linux.12xlarge

				          type: string

				          description: Hardware to run this "build"job on, linux.12xlarge or linux.arm64.2xlarge.

				        required: false

				        default: linux.12xlarge

				        type: string

				        description: Hardware to run this "build"job on, linux.12xlarge or linux.arm64.2xlarge.

				      timeout-minutes:

				        required: false

				        default: 210

				        type: number

				        description: timeout for the job

				      ALPINE_IMAGE:

				        required: false

				        type: string

				@ -78,7 +83,7 @@ on:

				jobs:

				  build:

				    runs-on: ${{ inputs.runs_on }}

				    timeout-minutes: 180

				    timeout-minutes: ${{ inputs.timeout-minutes }}

				    env:

				      PYTORCH_ROOT: ${{ inputs.PYTORCH_ROOT }}

				      BUILDER_ROOT: ${{ inputs.BUILDER_ROOT }}

				@ -139,6 +144,7 @@ jobs:

				        run: env

				      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"

				        if: inputs.build_environment != 'linux-s390x-binary-manywheel'

				        uses: pytorch/test-infra/.github/actions/setup-ssh@main

				        continue-on-error: true

				        with:

				@ -147,12 +153,14 @@ jobs:

				      - name: Checkout PyTorch

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@main

				        with:

				          no-sudo: ${{ inputs.build_environment == 'linux-aarch64-binary-manywheel' }}

				          no-sudo: ${{ inputs.build_environment == 'linux-aarch64-binary-manywheel' || inputs.build_environment == 'linux-s390x-binary-manywheel' }}

				      - name: Setup Linux

				        if: inputs.build_environment != 'linux-s390x-binary-manywheel'

				        uses: ./.github/actions/setup-linux

				      - name: Chown workspace

				        if: inputs.build_environment != 'linux-s390x-binary-manywheel'

				        uses: ./.github/actions/chown-workspace

				        with:

				          ALPINE_IMAGE: ${{ inputs.ALPINE_IMAGE }}

				@ -165,7 +173,7 @@ jobs:

				          rm -rf "${GITHUB_WORKSPACE}"

				          mkdir "${GITHUB_WORKSPACE}"

				          if [[ ${{ inputs.build_environment }} == 'linux-aarch64-binary-manywheel' ]]; then

				          if [[ ${{ inputs.build_environment }} == 'linux-aarch64-binary-manywheel' ]] || [[ ${{ inputs.build_environment }} == 'linux-s390x-binary-manywheel' ]] ; then

				            rm -rf "${RUNNER_TEMP}/artifacts"

				            mkdir "${RUNNER_TEMP}/artifacts"

				          fi

				@ -212,7 +220,7 @@ jobs:

				            ]}

				      - name: Pull Docker image

				        if: ${{ steps.filter.outputs.is-test-matrix-empty == 'False' }}

				        if: ${{ steps.filter.outputs.is-test-matrix-empty == 'False' && inputs.build_environment != 'linux-s390x-binary-manywheel' }}

				        uses: pytorch/test-infra/.github/actions/pull-docker-image@main

				        with:

				          docker-image: ${{ inputs.DOCKER_IMAGE }}

				@ -254,7 +262,7 @@ jobs:

				          fi

				      - name: Chown artifacts

				        if: ${{ steps.filter.outputs.is-test-matrix-empty == 'False' }}

				        if: ${{ steps.filter.outputs.is-test-matrix-empty == 'False' && inputs.build_environment != 'linux-s390x-binary-manywheel' }}

				        shell: bash

				        run: |

				          # Ensure the working directory gets chowned back to the current user

				@ -269,11 +277,20 @@ jobs:

				            ${{ runner.temp }}/artifacts/*

				      - name: Teardown Linux

				        if: always()

				        if: always() && inputs.build_environment != 'linux-s390x-binary-manywheel'

				        uses: pytorch/test-infra/.github/actions/teardown-linux@main

				      - name: Chown workspace

				        if: always()

				        if: always() && inputs.build_environment != 'linux-s390x-binary-manywheel'

				        uses: ./pytorch/.github/actions/chown-workspace

				        with:

				          ALPINE_IMAGE: ${{ inputs.ALPINE_IMAGE }}

				      - name: Cleanup docker

				        if: always() && inputs.build_environment == 'linux-s390x-binary-manywheel'

				        shell: bash

				        run: |

				          # on s390x stop the container for clean worker stop

				          # ignore expansion of "docker ps -q" since it could be empty

				          # shellcheck disable=SC2046

				          docker stop $(docker ps -q) || true

									
										11

.github/workflows/_binary-test-linux.yml
									
										vendored
									
												View File
												
				@ -127,6 +127,7 @@ jobs:

				          } >> "${GITHUB_ENV} }}"

				      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"

				        if: inputs.build_environment != 'linux-s390x-binary-manywheel'

				        uses: pytorch/test-infra/.github/actions/setup-ssh@main

				        continue-on-error: true

				        with:

				@ -136,12 +137,14 @@ jobs:

				      - name: Checkout PyTorch

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@main

				        with:

				          no-sudo: ${{ inputs.build_environment == 'linux-aarch64-binary-manywheel' }}

				          no-sudo: ${{ inputs.build_environment == 'linux-aarch64-binary-manywheel' || inputs.build_environment == 'linux-s390x-binary-manywheel' }}

				      - name: Setup Linux

				        if: inputs.build_environment != 'linux-s390x-binary-manywheel'

				        uses: ./.github/actions/setup-linux

				      - name: Chown workspace

				        if: inputs.build_environment != 'linux-s390x-binary-manywheel'

				        uses: ./.github/actions/chown-workspace

				        with:

				          ALPINE_IMAGE: ${{ inputs.ALPINE_IMAGE }}

				@ -203,7 +206,7 @@ jobs:

				        if: ${{ inputs.GPU_ARCH_TYPE == 'cuda' && steps.filter.outputs.is-test-matrix-empty == 'False' }}

				      - name: Pull Docker image

				        if: ${{ steps.filter.outputs.is-test-matrix-empty == 'False' }}

				        if: ${{ steps.filter.outputs.is-test-matrix-empty == 'False' && inputs.build_environment != 'linux-s390x-binary-manywheel' }}

				        uses: pytorch/test-infra/.github/actions/pull-docker-image@main

				        with:

				          docker-image: ${{ inputs.DOCKER_IMAGE }}

				@ -213,11 +216,11 @@ jobs:

				        uses: ./pytorch/.github/actions/test-pytorch-binary

				      - name: Teardown Linux

				        if: always()

				        if: always() && inputs.build_environment != 'linux-s390x-binary-manywheel'

				        uses: pytorch/test-infra/.github/actions/teardown-linux@main

				      - name: Chown workspace

				        if: always()

				        if: always() && inputs.build_environment != 'linux-s390x-binary-manywheel'

				        uses: ./pytorch/.github/actions/chown-workspace

				        with:

				          ALPINE_IMAGE: ${{ inputs.ALPINE_IMAGE }}

									
										35

.github/workflows/_docs.yml
									
										vendored
									
												View File
												
				@ -28,7 +28,21 @@ on:

				        description: |

				          If this is set, our linter will use this to make sure that every other

				          job with the same `sync-tag` is identical.

				      s3-bucket:

				        description: S3 bucket to download artifact

				        required: false

				        type: string

				        default: "gha-artifacts"

				      aws-role-to-assume:

				        description: role to assume for downloading artifacts

				        required: false

				        type: string

				        default: ""

				      upload-aws-role-to-assume:

				        description: role to assume for downloading artifacts

				        required: false

				        type: string

				        default: ""

				    secrets:

				      GH_PYTORCHBOT_TOKEN:

				        required: false

				@ -82,6 +96,14 @@ jobs:

				      - name: Setup Linux

				        uses: ./.github/actions/setup-linux

				      - name: configure aws credentials

				        if : ${{ inputs.aws-role-to-assume != '' }}

				        uses: aws-actions/configure-aws-credentials@v3

				        with:

				          role-to-assume: ${{ inputs.aws-role-to-assume }}

				          role-session-name: gha-linux-test

				          aws-region: us-east-1

				      - name: Calculate docker image

				        id: calculate-docker-image

				        uses: pytorch/test-infra/.github/actions/calculate-docker-image@main

				@ -97,6 +119,7 @@ jobs:

				        uses: ./.github/actions/download-build-artifacts

				        with:

				          name: ${{ inputs.build-environment }}

				          s3-bucket: ${{ inputs.s3-bucket }}

				      - name: Generate netrc (only for docs-push)

				        if: inputs.push

				@ -156,6 +179,14 @@ jobs:

				        uses: ./.github/actions/chown-workspace

				        if: always()

				      - name: configure aws credentials

				        if : ${{ inputs.upload-aws-role-to-assume != '' }}

				        uses: aws-actions/configure-aws-credentials@v3

				        with:

				          role-to-assume: ${{ inputs.upload-aws-role-to-assume }}

				          role-session-name: gha-linux-test

				          aws-region: us-east-1

				      - name: Upload Python Docs Preview

				        uses: seemethere/upload-artifact-s3@v5

				        if: ${{ github.event_name == 'pull_request' && matrix.docs_type == 'python' && steps.build-docs.outcome == 'success' }}

				@ -163,7 +194,7 @@ jobs:

				          retention-days: 14

				          s3-bucket: doc-previews

				          if-no-files-found: error

				          path: pytorch.github.io/docs/main/

				          path: pytorch_docs/main/

				          s3-prefix: pytorch/pytorch/${{ github.event.pull_request.number }}

				      - name: Upload C++ Docs Preview

									
										109

.github/workflows/_linux-build-label.yml
									
										vendored
									
										Normal file
									
												View File
												
				@ -0,0 +1,109 @@

				name: linux-build

				on:

				  workflow_call:

				    inputs:

				      build-environment:

				        required: true

				        type: string

				        description: Top-level label for what's being built/tested.

				      docker-image-name:

				        required: true

				        type: string

				        description: Name of the base docker image to build with.

				      build-generates-artifacts:

				        required: false

				        type: boolean

				        default: true

				        description: If set, upload generated build artifacts.

				      build-with-debug:

				        required: false

				        type: boolean

				        default: false

				        description: If set, build in debug mode.

				      sync-tag:

				        required: false

				        type: string

				        default: ""

				        description: |

				          If this is set, our linter will use this to make sure that every other

				          job with the same `sync-tag` is identical.

				      cuda-arch-list:

				        required: false

				        type: string

				        default: "5.2"

				        description: Runner label to select worker type

				      runner:

				        required: false

				        type: string

				        default: "linux.2xlarge"

				        description: |

				          List of CUDA architectures CI build should target.

				      test-matrix:

				        required: false

				        type: string

				        description: |

				          An option JSON description of what test configs to run later on. This

				          is moved here from the Linux test workflow so that we can apply filter

				          logic using test-config labels earlier and skip unnecessary builds

				      s3-bucket:

				        description: S3 bucket to download artifact

				        required: false

				        type: string

				        default: "gha-artifacts"

				      aws-role-to-assume:

				        description: role to assume for downloading artifacts

				        required: false

				        type: string

				        default: ""

				    secrets:

				      HUGGING_FACE_HUB_TOKEN:

				        required: false

				        description: |

				          HF Auth token to avoid rate limits when downloading models or datasets from hub

				    outputs:

				      docker-image:

				        value: ${{ jobs.build.outputs.docker-image }}

				        description: The docker image containing the built PyTorch.

				      test-matrix:

				        value: ${{ jobs.build.outputs.test-matrix }}

				        description: An optional JSON description of what test configs to run later on.

				jobs:

				  build:

				    # Don't run on forked repos

				    if: github.repository_owner == 'pytorch'

				    runs-on: ${{ inputs.runner }}

				    timeout-minutes: 240

				    outputs:

				      docker-image: ${{ steps.linux-build.outputs.docker-image }}

				      test-matrix: ${{ steps.linux-build.outputs.test-matrix }}

				    steps:

				      - name: Setup SSH (Click me for login details)

				        uses: pytorch/test-infra/.github/actions/setup-ssh@main

				        with:

				          github-secret: ${{ secrets.GITHUB_TOKEN }}

				      # [pytorch repo ref]

				      # Use a pytorch/pytorch reference instead of a reference to the local

				      # checkout because when we run this action we don't *have* a local

				      # checkout. In other cases you should prefer a local checkout.

				      - name: Checkout PyTorch

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@main

				      - name: Linux Build

				        id: linux-build

				        uses: ./.github/actions/linux-build

				        with:

				          build-environment: ${{ inputs.build-environment }}

				          docker-image-name: ${{ inputs.docker-image-name }}

				          build-generates-artifacts: ${{ inputs.build-generates-artifacts }}

				          build-with-debug: ${{ inputs.build-with-debug }}

				          sync-tag: ${{ inputs.sync-tag }}

				          cuda-arch-list: ${{ inputs.cuda-arch-list }}

				          test-matrix: ${{ inputs.test-matrix }}

				          s3-bucket: ${{ inputs.s3-bucket }}

				          aws-role-to-assume: ${{ inputs.aws-role-to-assume }}

				          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}

				          HUGGING_FACE_HUB_TOKEN: ${{ secrets.HUGGING_FACE_HUB_TOKEN }}

Compare commits

3756 Commits v2.3.0 ... cslpull85

1 .bazelignore Unescape Escape View File

5 .ci/docker/aotriton_version.txt Normal file Unescape Escape View File

138 .ci/docker/build.sh Unescape Escape View File

12 .ci/docker/centos-rocm/Dockerfile Unescape Escape View File

2 .ci/docker/ci_commit_pins/executorch.txt Unescape Escape View File

2 .ci/docker/ci_commit_pins/triton-rocm.txt Unescape Escape View File

1 .ci/docker/ci_commit_pins/triton-xpu.txt Normal file Unescape Escape View File

2 .ci/docker/ci_commit_pins/triton.txt Unescape Escape View File

2 .ci/docker/common/install_acl.sh Unescape Escape View File

5 .ci/docker/common/install_amdsmi.sh Normal file Unescape Escape View File

23 .ci/docker/common/install_aotriton.sh Executable file Unescape Escape View File

3 .ci/docker/common/install_base.sh Unescape Escape View File

24 .ci/docker/common/install_conda.sh Unescape Escape View File

14 .ci/docker/common/install_cudnn.sh Unescape Escape View File

11 .ci/docker/common/install_cusparselt.sh Unescape Escape View File

11 .ci/docker/common/install_db.sh Unescape Escape View File

8 .ci/docker/common/install_onnx.sh Unescape Escape View File

3 .ci/docker/common/install_protobuf.sh Unescape Escape View File

122 .ci/docker/common/install_rocm.sh Unescape Escape View File

5 .ci/docker/common/install_triton.sh Unescape Escape View File

6 .ci/docker/common/install_vision.sh Unescape Escape View File

25 .ci/docker/common/install_xpu.sh Unescape Escape View File

31 .ci/docker/requirements-ci.txt Unescape Escape View File

5 .ci/docker/ubuntu-cuda/Dockerfile Unescape Escape View File

14 .ci/docker/ubuntu-rocm/Dockerfile Unescape Escape View File

18 .ci/docker/ubuntu-xpu/Dockerfile Unescape Escape View File

8 .ci/docker/ubuntu/Dockerfile Unescape Escape View File

4 .ci/onnx/common.sh Unescape Escape View File

14 .ci/onnx/test.sh Unescape Escape View File

66 .ci/pytorch/build.sh Unescape Escape View File

2 .ci/pytorch/common_utils.sh Unescape Escape View File

2 .ci/pytorch/docs-test.sh Unescape Escape View File

37 .ci/pytorch/install_cache_xla.sh Executable file Unescape Escape View File

8 .ci/pytorch/multigpu-test.sh Unescape Escape View File

6 .ci/pytorch/perf_test/compare_with_baseline.py Unescape Escape View File

9 .ci/pytorch/python_doc_push_script.sh Unescape Escape View File

208 .ci/pytorch/test.sh Unescape Escape View File

37 .ci/pytorch/win-test-helpers/build_pytorch.bat Unescape Escape View File

9 .circleci/scripts/binary_linux_test.sh Unescape Escape View File

4 .circleci/scripts/binary_populate_env.sh Unescape Escape View File

4 .clang-tidy Unescape Escape View File

1 .flake8 Unescape Escape View File

1 .gitattributes vendored Unescape Escape View File

15 .github/ISSUE_TEMPLATE/pt2-bug-report.yml vendored Unescape Escape View File

31 .github/actionlint.yaml vendored Unescape Escape View File

11 .github/actions/download-build-artifacts/action.yml vendored Unescape Escape View File

14 .github/actions/filter-test-configs/action.yml vendored Unescape Escape View File

207 .github/actions/linux-build/action.yml vendored Normal file Unescape Escape View File

384 .github/actions/linux-test/action.yml vendored Normal file Unescape Escape View File

6 .github/actions/pytest-cache-download/action.yml vendored Unescape Escape View File

40 .github/actions/setup-linux/action.yml vendored Unescape Escape View File

11 .github/actions/test-pytorch-binary/action.yml vendored Unescape Escape View File

33 .github/actions/upload-test-artifacts/action.yml vendored Unescape Escape View File

2 .github/ci_commit_pins/audio.txt vendored Unescape Escape View File

2 .github/ci_commit_pins/vision.txt vendored Unescape Escape View File

2 .github/ci_commit_pins/xla.txt vendored Unescape Escape View File

13 .github/label_to_label.yml vendored Normal file Unescape Escape View File

14 .github/labeler.yml vendored Unescape Escape View File

154 .github/lf-canary-scale-config.yml vendored Normal file Unescape Escape View File

154 .github/lf-scale-config.yml vendored Normal file Unescape Escape View File

23 .github/merge_rules.yaml vendored Unescape Escape View File

7 .github/pytorch-probot.yml vendored Unescape Escape View File

4 .github/requirements-gha-cache.txt vendored Unescape Escape View File

3 .github/requirements/conda-env-Linux-X64.txt vendored Unescape Escape View File

3 .github/requirements/conda-env-iOS.txt vendored Unescape Escape View File

2 .github/requirements/conda-env-macOS-ARM64 vendored Unescape Escape View File

2 .github/requirements/conda-env-macOS-X64 vendored Unescape Escape View File

2 .github/requirements/pip-requirements-iOS.txt vendored Unescape Escape View File

5 .github/requirements/pip-requirements-macOS.txt vendored Unescape Escape View File

99 .github/scripts/amd/package_triton_wheel.sh vendored Executable file Unescape Escape View File

103 .github/scripts/amd/patch_triton_wheel.sh vendored Executable file Unescape Escape View File

54 .github/scripts/build_triton_wheel.py vendored Unescape Escape View File

2 .github/scripts/cherry_pick.py vendored Unescape Escape View File

6 .github/scripts/comment_on_pr.py vendored Unescape Escape View File

60 .github/scripts/delete_old_branches.py vendored Unescape Escape View File

52 .github/scripts/docathon-label-sync.py vendored Normal file Unescape Escape View File

BIN .github/scripts/drci_mocks.json.gz vendored View File

117 .github/scripts/filter_test_configs.py vendored Unescape Escape View File

3756 Commits

v2.3.0 ... cslpull85

1

.bazelignore

View File

5

.ci/docker/aotriton_version.txt Normal file

View File

138

.ci/docker/build.sh

View File

12

.ci/docker/centos-rocm/Dockerfile

View File

2

.ci/docker/ci_commit_pins/executorch.txt

View File

2

.ci/docker/ci_commit_pins/triton-rocm.txt

View File

1

.ci/docker/ci_commit_pins/triton-xpu.txt Normal file

View File

2

.ci/docker/ci_commit_pins/triton.txt

View File

2

.ci/docker/common/install_acl.sh

View File

5

.ci/docker/common/install_amdsmi.sh Normal file

View File

23

.ci/docker/common/install_aotriton.sh Executable file

View File

3

.ci/docker/common/install_base.sh

View File

24

.ci/docker/common/install_conda.sh

View File

14

.ci/docker/common/install_cudnn.sh

View File

11

.ci/docker/common/install_cusparselt.sh

View File

11

.ci/docker/common/install_db.sh

View File

8

.ci/docker/common/install_onnx.sh

View File

3

.ci/docker/common/install_protobuf.sh

View File

122

.ci/docker/common/install_rocm.sh

View File

5

.ci/docker/common/install_triton.sh

View File

6

.ci/docker/common/install_vision.sh

View File

25

.ci/docker/common/install_xpu.sh

View File

31

.ci/docker/requirements-ci.txt

View File

5

.ci/docker/ubuntu-cuda/Dockerfile

View File

14

.ci/docker/ubuntu-rocm/Dockerfile

View File

18

.ci/docker/ubuntu-xpu/Dockerfile

View File

8

.ci/docker/ubuntu/Dockerfile

View File

4

.ci/onnx/common.sh

View File

14

.ci/onnx/test.sh

View File

66

.ci/pytorch/build.sh

View File

2

.ci/pytorch/common_utils.sh

View File

2

.ci/pytorch/docs-test.sh

View File

37

.ci/pytorch/install_cache_xla.sh Executable file

View File

8

.ci/pytorch/multigpu-test.sh

View File

6

.ci/pytorch/perf_test/compare_with_baseline.py

View File

9

.ci/pytorch/python_doc_push_script.sh

View File

208

.ci/pytorch/test.sh

View File

37

.ci/pytorch/win-test-helpers/build_pytorch.bat

View File

9

.circleci/scripts/binary_linux_test.sh

View File

4

.circleci/scripts/binary_populate_env.sh

View File

4

.clang-tidy

View File

1

.flake8

View File

1

.gitattributes vendored

View File

15

.github/ISSUE_TEMPLATE/pt2-bug-report.yml vendored

View File

31

.github/actionlint.yaml vendored

View File

11

.github/actions/download-build-artifacts/action.yml vendored

View File

14

.github/actions/filter-test-configs/action.yml vendored

View File

207

.github/actions/linux-build/action.yml vendored Normal file

View File

384

.github/actions/linux-test/action.yml vendored Normal file

View File

6

.github/actions/pytest-cache-download/action.yml vendored

View File

40

.github/actions/setup-linux/action.yml vendored

View File

11

.github/actions/test-pytorch-binary/action.yml vendored

View File

33

.github/actions/upload-test-artifacts/action.yml vendored

View File

2

.github/ci_commit_pins/audio.txt vendored

View File

2

.github/ci_commit_pins/vision.txt vendored

View File

2

.github/ci_commit_pins/xla.txt vendored

View File

13

.github/label_to_label.yml vendored Normal file

View File

14

.github/labeler.yml vendored

View File

154

.github/lf-canary-scale-config.yml vendored Normal file

View File

154

.github/lf-scale-config.yml vendored Normal file

View File

23

.github/merge_rules.yaml vendored

View File

7

.github/pytorch-probot.yml vendored

View File

4

.github/requirements-gha-cache.txt vendored

View File

3

.github/requirements/conda-env-Linux-X64.txt vendored

View File

3

.github/requirements/conda-env-iOS.txt vendored

View File

2

.github/requirements/conda-env-macOS-ARM64 vendored

View File

2

.github/requirements/conda-env-macOS-X64 vendored

View File

2

.github/requirements/pip-requirements-iOS.txt vendored

View File

5

.github/requirements/pip-requirements-macOS.txt vendored

View File

99

.github/scripts/amd/package_triton_wheel.sh vendored Executable file

View File

103

.github/scripts/amd/patch_triton_wheel.sh vendored Executable file

View File

54

.github/scripts/build_triton_wheel.py vendored

View File

2

.github/scripts/cherry_pick.py vendored

View File

6

.github/scripts/comment_on_pr.py vendored

View File

60

.github/scripts/delete_old_branches.py vendored

View File

52

.github/scripts/docathon-label-sync.py vendored Normal file

View File

BIN
.github/scripts/drci_mocks.json.gz vendored

View File

117

.github/scripts/filter_test_configs.py vendored

View File

97

.github/scripts/generate_binary_build_matrix.py vendored

View File